CN107437412B

CN107437412B - Acoustic model processing method, voice synthesis method, device and related equipment

Info

Publication number: CN107437412B
Application number: CN201610353978.5A
Authority: CN
Inventors: 宋阳; 陈伟
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2016-05-25
Filing date: 2016-05-25
Publication date: 2021-06-29
Anticipated expiration: 2036-05-25
Also published as: CN107437412A

Abstract

The invention belongs to the field of data processing, and discloses an acoustic model processing method, a voice synthesis device and related equipment, which are used for solving the technical problem of poor voice quality of synthesized voice in the prior art. The method comprises the following steps: acquiring preset parameters of a spectrum model in a voice model; converting the preset parameters of the frequency spectrum model into a magnitude spectrum; carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum; and converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, and further obtaining the processed frequency spectrum model. The technical effect of improving the voice synthesis quality is achieved.

Description

Acoustic model processing method, voice synthesis method, device and related equipment

Technical Field

The present invention relates to the field of data processing, and in particular, to an acoustic model processing method, a speech synthesis apparatus, and a related device.

Background

The mainstream method of the off-line speech synthesis system is based on HMM (Hidden Markov Model) parametric speech synthesis. Firstly, a speech model needs to be trained, and then speech synthesis is realized through the speech model, please refer to fig. 1, wherein the establishing of the speech model comprises the following steps:

step S101: obtaining a corpus;

step S102: extracting acoustic parameters of the linguistic data in the linguistic data base;

step S103: performing context-dependent HMM-GMM modeling on acoustic parameters and corresponding prosodic texts in a corpus to further obtain a speech model, wherein a modeling object comprises a frequency spectrum, a fundamental frequency and a duration;

before building the speech model, please refer to fig. 2, speech can be synthesized by:

step S201: acquiring a text to be synthesized;

step S202: analyzing the context information of the text to be synthesized;

step S203: performing model prediction on the context through a speech model to obtain corresponding acoustic parameters, wherein the acoustic parameters comprise: frequency spectrum, fundamental frequency information;

step S204: the acoustic parameters are synthesized into speech by the vocoder.

The voice synthesized by the scheme has the technical problem of poor tone quality, so that the user experience is low.

Disclosure of Invention

The invention provides an acoustic model processing method, a voice synthesis device and related equipment, and aims to solve the technical problem that the voice quality of synthesized voice in the prior art is poor.

In a first aspect, an embodiment of the present invention provides an acoustic model processing method, including:

acquiring preset parameters of a spectrum model in a voice model;

converting the preset parameters of the frequency spectrum model into a magnitude spectrum;

carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum;

and converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, and further obtaining the processed frequency spectrum model.

Optionally, the converting the preset parameter of the spectrum model into a magnitude spectrum includes:

and converting the static parameters of the mean part of the frequency spectrum model into the amplitude spectrum.

Optionally, the obtaining the processed magnitude spectrum by performing adaptive post-processing on the magnitude spectrum includes: performing adaptive post-processing on the magnitude spectrum by the following formula:

wherein S is_new(z) represents the processed magnitude spectrum;

S_ori(z) represents the amplitude spectrum before treatment;

S_ori(z/β) represents S in the z-plane_ori(z) scale-transformation to the previous β -fold, thereby obtaining a magnitude spectrum;

S_ori(z/α) denotes S in the z-plane_oriTo which the (z) dimension is transformedA times the front, resulting in a magnitude spectrum.

Optionally, the performing adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, further includes:

judging whether the processed amplitude spectrum obtained by calculation is within the range of a preset maximum value and a preset minimum value or not according to each processed amplitude spectrum;

when the processed amplitude spectrum is smaller than the preset minimum value, taking the preset minimum value as the processed amplitude spectrum;

and when the processed amplitude spectrum is larger than the preset maximum value, taking the preset maximum value as the processed amplitude spectrum.

Optionally, after the adaptive post-processing is performed on the magnitude spectrum to obtain a processed magnitude spectrum, the method further includes:

carrying out frequency spectrum energy uniformization treatment on the processed magnitude spectrum;

converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, wherein the preset parameters comprise:

and converting the magnitude spectrum subjected to the frequency spectrum energy uniformization treatment into preset parameters of the frequency spectrum model.

Optionally, the method further includes:

obtaining a text to be synthesized for voice synthesis;

determining acoustic parameters of the text to be synthesized based on the voice model;

and synthesizing the voice data of the text to be synthesized through the acoustic parameters.

In a second aspect, an embodiment of the present invention provides a speech synthesis method, including:

obtaining a text to be synthesized for voice synthesis;

determining the spectrum parameters of the text to be synthesized based on a spectrum model in a voice model, wherein the spectrum model is a spectrum model subjected to adaptive post-processing, and the adaptive post-processing process comprises the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, carrying out self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, and converting the processed magnitude spectrum into the preset parameters of the frequency spectrum model;

and synthesizing the voice data of the text to be synthesized through the frequency spectrum parameters.

Optionally, before the determining the spectral parameters of the text to be synthesized based on the spectral model in the speech model, the method further includes:

performing adaptive post-processing of the spectrum model locally at the client device; and/or

Receiving the spectrum model after adaptive post-processing from the server.

In a third aspect, an embodiment of the present invention provides an acoustic model processing apparatus, including:

the acquisition module is used for acquiring preset parameters of a spectrum model in the voice model;

the first conversion module is used for converting the preset parameters of the frequency spectrum model into a magnitude spectrum;

the first obtaining module is used for carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum;

and the second conversion module is used for converting the processed magnitude spectrum into preset parameters of the frequency spectrum model so as to obtain the processed frequency spectrum model.

In a fourth aspect, an embodiment of the present invention provides a speech synthesis apparatus, including:

a third obtaining module, configured to obtain a text to be synthesized for speech synthesis;

a second determining module, configured to determine a spectrum parameter of the text to be synthesized based on a spectrum model in a speech model, where the spectrum model is a spectrum model that is subjected to adaptive post-processing, and the adaptive post-processing includes the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, carrying out self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, and converting the processed magnitude spectrum into the preset parameters of the frequency spectrum model;

and the second synthesis module is used for synthesizing the voice data of the text to be synthesized through the frequency spectrum parameters.

In a fifth aspect, embodiments of the present invention provide a processing apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and configured for execution by the one or more processors, the one or more programs including instructions for:

acquiring preset parameters of a spectrum model in a voice model;

In a sixth aspect, embodiments of the present invention provide a processing apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for:

obtaining a text to be synthesized for voice synthesis;

The invention has the following beneficial effects:

in the embodiment of the present invention, the processing is performed for the speech model in the following manner: acquiring preset parameters of a spectrum model in a voice model; then, converting the preset parameters of the frequency spectrum model into a magnitude spectrum; carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum; the processed magnitude spectrum is converted into preset parameters of the frequency spectrum model, and then the processed frequency spectrum model is obtained, and self-adaptive post-processing is carried out aiming at the preset parameters in the frequency spectrum model, so that an expected signal in the frequency spectrum model is enhanced and interference signals are reduced, and the quality of synthesized voice can be improved when voice data are generated through the voice model subsequently;

in addition, the object of the adaptive post-processing in the scheme is a magnitude spectrum, the magnitude spectrum is a universal frequency spectrum, and various frequency spectrum parameters can be converted into the magnitude spectrum, so the scheme is suitable for any frequency spectrum parameter, and different adaptive post-processing modes are not required to be adopted for different frequency spectrum parameters (such as a line spectrum pair, a Mel cepstrum and the like), so the scheme has stronger compatibility for the adaptive post-processing of the frequency spectrum parameters;

moreover, the scheme performs adaptive post-processing on the spectrum model in the voice model in advance, and does not need to perform adaptive post-processing after acoustic parameters are generated subsequently, so that the time consumption for synthesizing voice data by using the voice model is reduced.

Drawings

FIG. 1 is a flow chart of prior art modeling of speech;

FIG. 2 is a flow chart of synthesizing speech data in the prior art;

FIG. 3 is a flow chart of a method of acoustic model processing according to a first aspect of an embodiment of the present invention;

FIG. 4 is a flow chart of synthesizing speech data in the acoustic model processing method according to the first aspect of the embodiment of the present invention;

FIG. 5 is a flow chart of a speech synthesis method according to a second aspect of the embodiments of the present invention;

FIG. 6 is a block diagram of an acoustic model processing method according to a third aspect of an embodiment of the present invention;

FIG. 7 is a block diagram of a speech synthesis apparatus according to a fourth aspect of the embodiment of the present invention;

FIG. 8 is a block diagram of an electronic device shown in accordance with an example embodiment;

fig. 9 is a schematic structural diagram of a server in an embodiment of the present invention.

Detailed Description

In order to solve the technical problems, the general idea of the embodiment of the present application is as follows:

in order to better understand the technical solutions of the present invention, the following detailed descriptions of the technical solutions of the present invention are provided with the accompanying drawings and the specific embodiments, and it should be understood that the specific features in the embodiments and the examples of the present invention are the detailed descriptions of the technical solutions of the present invention, and are not limitations of the technical solutions of the present invention, and the technical features in the embodiments and the examples of the present invention may be combined with each other without conflict.

Processing is performed on the speech model by: acquiring preset parameters of a spectrum model in a voice model; then, converting the preset parameters of the frequency spectrum model into a magnitude spectrum; carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum; the processed magnitude spectrum is converted into preset parameters of the frequency spectrum model, and then the processed frequency spectrum model is obtained, and self-adaptive post-processing is carried out aiming at the preset parameters in the frequency spectrum model, so that an expected signal in the frequency spectrum model is enhanced and interference signals are reduced, and the quality of synthesized voice can be improved when voice data are generated through the voice model subsequently;

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

In a first aspect, an embodiment of the present invention provides an acoustic model processing method, please refer to fig. 3, including:

step S301: acquiring preset parameters of a spectrum model in a voice model;

step S302: converting the preset parameters of the frequency spectrum model into a magnitude spectrum;

step S303: carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum;

step S304: and converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, and further obtaining the processed frequency spectrum model.

For example, the solution may be applied to a server, and may also be applied to a client device, and the embodiments of the present invention are not limited. The client devices are for example: a mobile phone, a notebook computer, a tablet computer, a PC, etc., embodiments of the present invention are not limited.

In step S301, for example, the speech model includes: a spectrum model, a fundamental frequency model, a duration model, etc. The spectral model generally includes: a probability density part and a decision tree part, wherein the probability density part comprises a mean value and a variance, the mean value and the variance respectively comprise a static parameter and a dynamic parameter, and the preset parameters of the spectrum model are, for example: the static parameters may also include dynamic parameters, and the embodiments of the present invention are not limited.

In step S302, the preset parameters of the spectrum model may be converted into a magnitude spectrum by:

when the preset parameters of the spectrum model are line spectrum pair parameters, the format is assumed to be K, l (1), l (2), … l (v). When V is an even number, the magnitude spectrum S (ω) is:

when V is odd, the magnitude spectrum S (ω) is:

when the preset parameters of the spectrum model are Mel cepstrum parameters, assume that the format is c_a(0),c_a(1),…c_a(V) where a is known, and a is typically set to 0.42 when the spectral model is derived from audio at a sampling rate of 16KHZ, the cepstrum is first solved according to the following equation

Where v represents the dimension of the current process,

then, the amplitude spectrum is obtained through Fourier transformation and an exponential function with a natural constant e as a base.

The method includes the steps of obtaining a spectrum model, obtaining a matrix of M × Y, and converting the matrix into a Y-dimensional amplitude spectrum according to the scheme. In subsequent steps S102 to S104, only the Y-dimensional magnitude spectrum is processed each time, and a total of M operations are performed.

In step S303, the Y-dimensional amplitude spectrum may be subjected to adaptive post-processing by the following formula:

wherein S is_new(z) represents the processed magnitude spectrum;

S_ori(z) represents the amplitude spectrum before treatment;

S_ori(z/α) denotes S in the z-plane_oriThe (z) scale is transformed to the previous alpha times, resulting in a magnitude spectrum.

Generally, α and β can be set empirically, and in general, the larger the β - α value is, the more obvious the sound quality enhancement effect of the synthesized speech is, but the larger the β - α value may cause the unstable synthesis effect, for example: the synthesized speech is distorted.

In a specific implementation process, after the adaptive post-processing is performed in the above manner, in order to stabilize a synthesis effect, a range of amplitude spectrum transformation may be limited, and further, the adaptive post-processing is performed on the amplitude spectrum to obtain a processed amplitude spectrum, further including: judging whether the processed amplitude spectrum obtained by calculation is within the range of a preset maximum value and a preset minimum value or not according to each processed amplitude spectrum; when the processed amplitude spectrum is smaller than the preset minimum value, taking the preset minimum value as the processed amplitude spectrum; and when the processed amplitude spectrum is larger than the preset maximum value, taking the preset maximum value as the processed amplitude spectrum.

For example, the preset maximum value may be a set fixed value, or may be S_oriThe preset minimum value of the preset proportion (z) can be a set fixed value or S_ori(z) is not limited to the embodiment of the present invention.

Wherein if the preset maximum value and the preset minimum value are both S_ori(z), the transformation range of the magnitude spectrum can be defined by the following formula:

suppose S_oriThe value of the y-th dimension of (z) is s_ori，S_newThe value of the y-th dimension of (z) is s_newWherein Y is more than or equal to 1 and less than or equal to Y. Then:

the mindata and the maxdata can be set according to experience, generally, the larger the maxdata-mindata value is, the more obvious the sound quality enhancement effect of the synthesized voice is, but the larger the maxdata-mindata value is, the more unstable the synthesis effect can be caused. The maxdata-mindata can take on values of 7-10, for example: 8. 9, 10, etc., in this case, it can not only ensure the stability of the synthesized effect, but also achieve a better enhancement effect on the sound quality of the synthesized voice.

Wherein, if the preset maximum value and the preset minimum value are both set fixed values, the transformation range of the amplitude spectrum can be defined by the following formula:

suppose S_newThe value of the y-th dimension of (z) is s_newWherein Y is more than or equal to 1 and less than or equal to Y. Then:

similarly, the mindata and the maxdata can be set empirically, and generally, the larger the maxdata-mindata value is, the more obvious the sound quality enhancement effect of the synthesized voice is, but the larger the maxdata-mindata value is, the more unstable the synthesis effect may be caused. Similarly, the maxdata-mindata value can be, for example, 7-10, for example: 8. 9, 10, etc., in this case, it can not only ensure the stability of the synthesized effect, but also achieve a better enhancement effect on the sound quality of the synthesized voice.

As an alternative embodiment, in order to ensure that the synthesis effect is stable, it is further required to ensure that the spectrum energies before and after the adaptive post-processing are consistent, that is,: after the adaptive post-processing of the magnitude spectrum to obtain a processed magnitude spectrum, the method further comprises: and carrying out frequency spectrum energy uniformization treatment on the treated magnitude spectrum.

The spectrum energy before and after the self-adaptive post-processing can be ensured to be consistent through the following formula:

wherein, S'_new(z) a magnitude spectrum after the spectrum energy uniformization processing;

S_new(z) represents a magnitude spectrum before the spectrum energy uniformization processing;

S_ori(z) represents the magnitude spectrum before the adaptation process.

In step S304, the magnitude spectrum may be converted into preset parameters of the spectrum model by the following method:

when the preset parameters of the frequency spectrum model are line spectrum pair parameters, firstly taking the logarithm taking e as the base for the magnitude spectrum, and then obtaining the cepstrum parameter c through inverse Fourier transform₀(v) Then solving the generalized cepstrum parameter c according to the following regression equation_-1(v) And v denotes the dimension currently being processed,

then gain normalization is carried out to obtain linear predictive coding parameters, then z transformation is carried out on the linear predictive coding parameters, zero points of the linear predictive coding parameters on a unit circle are solved, and angular frequency values corresponding to the zero points are line spectrum pair parameters.

When the preset parameters of the frequency spectrum model are Mel cepstrum parameters, firstly taking the logarithm taking e as the base for the magnitude spectrum, then obtaining the cepstrum parameters through inverse Fourier transform, and assuming that the format is c₀(0),c₀(1),…c₀(V) finally solving the Mel cepstrum according to the following formula

Where a is known, and is typically set to 0.42 when the spectral model is derived from audio at a sampling rate of 16KHZ, v represents the currently processed dimension,

if the spectrum energy uniformization processing is not performed on the magnitude spectrum before, the magnitude spectrum after the adaptive post-processing is directly converted into preset parameters of a spectrum model in step S304; if the amplitude spectrum is subjected to the spectrum energy uniformization processing, in step S304, the amplitude spectrum subjected to the spectrum energy uniformization processing is converted into the preset parameters of the spectrum model.

In the implementation process, after the processed spectrum model is obtained based on step S304, the speech data may be synthesized through the speech model including the spectrum model, please refer to fig. 4, and the speech data may be synthesized through the following steps:

step S401: obtaining a text to be synthesized for voice synthesis;

step S402: determining acoustic parameters of the text to be synthesized based on the voice model;

step S403: and synthesizing the voice data of the text to be synthesized through the acoustic parameters.

In step S401, the text to be synthesized is, for example: the text input by the user, the text corresponding to the prompt tone generated by the client device, the text of the electronic book, etc., may also be any other form of text, and embodiments of the present invention are not illustrated in detail and are not limited.

In step S402, the text to be synthesized may be first subjected to context analysis, so as to analyze context information of the text to be synthesized, and then model prediction is performed on the context through the speech model, so as to obtain corresponding acoustic parameters, where the acoustic parameters include: frequency spectrum, fundamental frequency information, duration, etc.

In step S403, the acoustic parameters determined in step S402 may be synthesized by a vocoder, so as to obtain corresponding voice data. After synthesizing the speech data, the speech data may also be input by various means, such as: the voice data is output through the sound output device of the client device, and the voice data is sent to another client device for the other client device to output the voice data, and the like. In a second aspect, based on the same inventive concept, an embodiment of the present invention provides a speech synthesis method, please refer to fig. 5, including:

step S501: obtaining a text to be synthesized for voice synthesis;

step S502: determining the spectrum parameters of the text to be synthesized based on a spectrum model in a voice model, wherein the spectrum model is a spectrum model subjected to adaptive post-processing, and the adaptive post-processing process comprises the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, carrying out self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, and converting the processed magnitude spectrum into the preset parameters of the frequency spectrum model;

step S503: and synthesizing the voice data of the text to be synthesized through the frequency spectrum parameters.

In step S501, the specific text to be synthesized is not described herein again because it has already been described above.

In step S502, how to obtain the spectrum model after the adaptive post-processing is described in detail, which is not described herein again because the first aspect of the present invention is described. The spectrum model after adaptive post-processing can be obtained through various ways, and two ways are listed below for description, and of course, in the specific implementation process, the two ways are not limited to the following two cases.

First, adaptive post-processing of the spectral model is performed locally at the client device.

Secondly, the spectrum model after adaptive post-processing is received from the server.

In step S503, other parameters of the speech data may also be obtained through other models included in the speech model, for example: obtaining a fundamental frequency parameter of a text to be synthesized through a fundamental frequency model, obtaining a duration parameter of the text to be synthesized through a duration model and the like, and then synthesizing voice data of the text to be synthesized through acoustic parameters such as the fundamental frequency parameter, the duration parameter, the frequency spectrum parameter and the like.

For how to synthesize the speech data of the text to be synthesized by the acoustic parameters, the description is omitted here because it has already been described above.

As can be seen from the above analysis, in the embodiment of the present invention, for the spectrum model, first, the preset parameters (for example, the average value part of the static parameters) of the spectrum model are converted into the magnitude spectrum, then the magnitude spectrum is subjected to adaptive post-processing, in order to stabilize the synthesis effect, the range of the magnitude spectrum transformation is limited, and the magnitude spectrum energy is adjusted to be the same as the magnitude spectrum energy before processing, and finally, the processed magnitude spectrum is converted into the preset parameters of the spectrum model, and the other parts of the spectrum model are kept unchanged.

In a third aspect, based on the same inventive concept, an embodiment of the present invention provides an acoustic model processing apparatus, please refer to fig. 6, including:

an obtaining module 60, configured to obtain preset parameters of a spectrum model in a speech model;

a first conversion module 61, configured to convert preset parameters of the spectrum model into a magnitude spectrum;

a first obtaining module 62, configured to perform adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum;

and a second conversion module 63, configured to convert the processed magnitude spectrum into preset parameters of the spectrum model, so as to obtain the processed spectrum model.

Optionally, the first conversion module 61 is configured to:

Optionally, the first obtaining module 62 is configured to perform adaptive post-processing on the magnitude spectrum according to the following formula:

wherein S is_new(z) represents the processed magnitude spectrum;

S_ori(z) represents the amplitude spectrum before treatment;

Optionally, the first obtaining module 62 includes:

the judgment unit is used for judging whether the processed amplitude spectrum obtained by calculation is located in the range of a preset maximum value and a preset minimum value or not according to each processed amplitude spectrum;

the first determining unit is used for taking the preset minimum value as the processed amplitude spectrum when the processed amplitude spectrum is smaller than the preset minimum value;

and the second determining unit is used for taking the preset maximum value as the processed amplitude spectrum when the processed amplitude spectrum is larger than the preset maximum value.

Optionally, the apparatus further comprises:

the first processing module is used for carrying out frequency spectrum energy uniformization processing on the processed magnitude spectrum;

the second conversion module is used for converting the magnitude spectrum subjected to the spectrum energy uniformization processing into preset parameters of the spectrum model.

Optionally, the apparatus further comprises:

the second obtaining module is used for obtaining a text to be synthesized for voice synthesis;

the first determining module is used for determining the acoustic parameters of the text to be synthesized based on the voice model;

and the first synthesis module is used for synthesizing the voice data of the text to be synthesized through the acoustic parameters.

Since the acoustic model processing apparatus described in the third aspect of the present invention is a device used for implementing the acoustic model processing method described in the first aspect of the present embodiment, and a person skilled in the art can understand a specific structure and a modification of the apparatus based on the acoustic model processing method described in the first aspect of the present embodiment of the present invention, details are not described here, and all the devices used for implementing the acoustic model processing method described in the first aspect of the present embodiment of the present invention belong to the scope to be protected by the embodiment of the present invention.

In a fourth aspect, based on the same inventive concept, an embodiment of the present invention provides a speech synthesis apparatus, please refer to fig. 7, including:

a third obtaining module 70, configured to obtain a text to be synthesized for speech synthesis;

a second determining module 71, configured to determine a spectrum parameter of the text to be synthesized based on a spectrum model in a speech model, where the spectrum model is a spectrum model that is subjected to adaptive post-processing, and the adaptive post-processing includes the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, carrying out self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, and converting the processed magnitude spectrum into the preset parameters of the frequency spectrum model;

and a second synthesis module 72, configured to synthesize the speech data of the text to be synthesized through the spectrum parameters.

Optionally, the apparatus further comprises:

the second processing module is used for performing self-adaptive post-processing on the frequency spectrum model locally on the client equipment; and/or

A receiving module, configured to receive the spectrum model subjected to adaptive post-processing from the server.

Since the speech synthesis apparatus described in the fourth aspect of the present invention is a device used for implementing the speech synthesis method described in the second aspect of the present invention, based on the speech synthesis method described in the second aspect of the present invention, a person skilled in the art can understand the specific structure and the modification of the apparatus, and therefore will not be described herein again, and all the devices used for implementing the speech synthesis method described in the second aspect of the present invention belong to the scope of the embodiments of the present invention to be protected.

In a fifth aspect, based on the same inventive concept, an embodiment of the present invention provides a processing device, which may be an electronic device or a server, including a memory and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, where the one or more programs include instructions for:

acquiring preset parameters of a spectrum model in a voice model;

Since the electronic device introduced in the fifth aspect of the present invention is an electronic device used for implementing the acoustic model processing method introduced in the first aspect of the embodiment of the present invention, and a person skilled in the art can understand a specific structure and a modification of the electronic device based on the acoustic model processing method introduced in the first aspect of the embodiment of the present invention, details are not described here, and all electronic devices used for implementing the acoustic model processing method introduced in the first aspect of the embodiment of the present invention belong to the scope to be protected by the embodiment of the present invention.

In a sixth aspect, based on the same inventive concept, an embodiment of the present invention provides a processing device, which may be an electronic device or a server, including a memory and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, where the one or more programs include instructions for:

obtaining a text to be synthesized for voice synthesis;

Since the electronic device introduced in the sixth aspect of the present invention is an electronic device used for implementing the speech synthesis method introduced in the second aspect of the present invention, based on the speech synthesis method introduced in the second aspect of the present invention, a person skilled in the art can understand a specific structure and a modification of the electronic device, and thus details are not described herein, and all electronic devices used for implementing the speech synthesis method introduced in the second aspect of the present invention belong to the scope of the embodiments of the present invention to be protected.

FIG. 8 is a block diagram illustrating an electronic device 800 implementing an acoustic model processing method (or speech synthesis method) in accordance with an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of acoustic model processing, the method comprising:

acquiring preset parameters of a spectrum model in a voice model;

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of speech synthesis, the method comprising:

obtaining a text to be synthesized for voice synthesis;

Fig. 9 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer readable storage medium in which instructions, when executed by a central processor of a server, enable the server to perform a method of acoustic model processing, the method comprising:

acquiring preset parameters of a spectrum model in a voice model;

A non-transitory computer readable storage medium in which instructions, when executed by a central processor of a server, enable the server to perform a method of speech synthesis, the method comprising:

obtaining a text to be synthesized for voice synthesis;

One or more embodiments of the invention have at least the following beneficial effects:

because in the embodiment of the present invention, the processing is performed on the speech model in the following manner: acquiring preset parameters of a spectrum model in a voice model; then, converting the preset parameters of the frequency spectrum model into a magnitude spectrum; carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum; the processed magnitude spectrum is converted into preset parameters of the frequency spectrum model, and then the processed frequency spectrum model is obtained, and self-adaptive post-processing is carried out aiming at the preset parameters in the frequency spectrum model, so that an expected signal in the frequency spectrum model is enhanced and interference signals are reduced, and the quality of synthesized voice can be improved when voice data are generated through the voice model subsequently;

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. An acoustic model processing method, comprising:

acquiring preset parameters of a spectrum model in a voice model;

performing adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, including: performing adaptive post-processing on the magnitude spectrum by the following formula:

wherein the content of the first and second substances,

representing the processed magnitude spectrum;

representing the amplitude spectrum before treatment;

expressed in the z-plane

Scale transformation to before

Multiplying, thereby obtaining a magnitude spectrum;

expressed in the z-plane

Scale transformation to before

Multiplying, thereby obtaining a magnitude spectrum;

2. The method of claim 1, wherein the converting the preset parameters of the spectrum model into magnitude spectra comprises:

3. The method of claim 1, wherein the magnitude spectrum is adaptively post-processed to obtain a processed magnitude spectrum, further comprising:

4. The method of claim 1, wherein after adaptively post-processing the magnitude spectrum to obtain a processed magnitude spectrum, the method further comprises:

5. The method of any of claims 1-4, wherein the method further comprises:

obtaining a text to be synthesized for voice synthesis;

6. A method of speech synthesis, comprising:

obtaining a text to be synthesized for voice synthesis;

determining the spectrum parameters of the text to be synthesized based on a spectrum model in a voice model, wherein the spectrum model is a spectrum model subjected to adaptive post-processing, and the adaptive post-processing process comprises the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, and performing self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, wherein the processing comprises the following steps: performing adaptive post-processing on the magnitude spectrum by the following formula:

wherein the content of the first and second substances,

representing the processed magnitude spectrum;

representing the amplitude spectrum before treatment;

expressed in the z-plane

Scale transformation to before

Multiplying, thereby obtaining a magnitude spectrum;

expressed in the z-plane

Scale transformation to before

Multiplying, thereby obtaining a magnitude spectrum;

7. The method of claim 6, wherein before the determining spectral parameters of the text to be synthesized based on a spectral model in the speech model, the method further comprises:

Receiving the spectrum model after adaptive post-processing from the server.

8. An acoustic model processing apparatus, comprising:

a first obtaining module, configured to perform adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, including: performing adaptive post-processing on the magnitude spectrum by the following formula:

wherein the content of the first and second substances,

representing the processed magnitude spectrum;

representing the amplitude spectrum before treatment;

expressed in the z-plane

Scale transformation to before

Multiplying, thereby obtaining a magnitude spectrum;

expressed in the z-plane

Scale transformation to before

Multiplying, thereby obtaining a magnitude spectrum;

9. The apparatus of claim 8, wherein the first conversion module is to:

10. The apparatus of claim 6, wherein the first obtaining module 62 comprises:

11. The apparatus of claim 8, wherein the apparatus further comprises:

12. The apparatus of any of claims 8 to 11, further comprising:

13. A speech synthesis apparatus, comprising:

a second determining module, configured to determine a spectrum parameter of the text to be synthesized based on a spectrum model in a speech model, where the spectrum model is a spectrum model that is subjected to adaptive post-processing, and the adaptive post-processing includes the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, and performing self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, wherein the processing comprises the following steps: performing adaptive post-processing on the magnitude spectrum by the following formula:

wherein the content of the first and second substances,

representing the processed magnitude spectrum;

representing the amplitude spectrum before treatment;

expressed in the z-plane

Scale transformation to before

Multiplying, thereby obtaining a magnitude spectrum;

expressed in the z-plane

Scale transformation to before

Multiplying, thereby obtaining a magnitude spectrum;

14. The apparatus as claimed in claim 3, wherein the apparatus further comprises:

15. A processing apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for:

acquiring preset parameters of a spectrum model in a voice model;

wherein the content of the first and second substances,

representing the processed magnitude spectrum;

representing the amplitude spectrum before treatment;

expressed in the z-plane

Scale transformation to before

Multiplying, thereby obtaining a magnitude spectrum;

expressed in the z-plane

Scale transformation to before

Multiplying, thereby obtaining a magnitude spectrum;

16. A processing apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for:

obtaining a text to be synthesized for voice synthesis;

wherein the content of the first and second substances,

representing the processed magnitude spectrum;

representing the amplitude spectrum before treatment;

expressed in the z-plane

Scale transformation to before

Multiplying, thereby obtaining a magnitude spectrum;

expressed in the z-plane

Scale transformation to before

Multiplying, thereby obtaining a magnitude spectrum;

17. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the method steps of any of claims 1 to 5.

18. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the method steps of any of claims 6 to 7.