CN107437412B - Acoustic model processing method, voice synthesis method, device and related equipment - Google Patents

Acoustic model processing method, voice synthesis method, device and related equipment Download PDF

Info

Publication number
CN107437412B
CN107437412B CN201610353978.5A CN201610353978A CN107437412B CN 107437412 B CN107437412 B CN 107437412B CN 201610353978 A CN201610353978 A CN 201610353978A CN 107437412 B CN107437412 B CN 107437412B
Authority
CN
China
Prior art keywords
spectrum
model
processing
processed
magnitude
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610353978.5A
Other languages
Chinese (zh)
Other versions
CN107437412A (en
Inventor
宋阳
陈伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201610353978.5A priority Critical patent/CN107437412B/en
Publication of CN107437412A publication Critical patent/CN107437412A/en
Application granted granted Critical
Publication of CN107437412B publication Critical patent/CN107437412B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of data processing, and discloses an acoustic model processing method, a voice synthesis device and related equipment, which are used for solving the technical problem of poor voice quality of synthesized voice in the prior art. The method comprises the following steps: acquiring preset parameters of a spectrum model in a voice model; converting the preset parameters of the frequency spectrum model into a magnitude spectrum; carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum; and converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, and further obtaining the processed frequency spectrum model. The technical effect of improving the voice synthesis quality is achieved.

Description

Acoustic model processing method, voice synthesis method, device and related equipment
Technical Field
The present invention relates to the field of data processing, and in particular, to an acoustic model processing method, a speech synthesis apparatus, and a related device.
Background
The mainstream method of the off-line speech synthesis system is based on HMM (Hidden Markov Model) parametric speech synthesis. Firstly, a speech model needs to be trained, and then speech synthesis is realized through the speech model, please refer to fig. 1, wherein the establishing of the speech model comprises the following steps:
step S101: obtaining a corpus;
step S102: extracting acoustic parameters of the linguistic data in the linguistic data base;
step S103: performing context-dependent HMM-GMM modeling on acoustic parameters and corresponding prosodic texts in a corpus to further obtain a speech model, wherein a modeling object comprises a frequency spectrum, a fundamental frequency and a duration;
before building the speech model, please refer to fig. 2, speech can be synthesized by:
step S201: acquiring a text to be synthesized;
step S202: analyzing the context information of the text to be synthesized;
step S203: performing model prediction on the context through a speech model to obtain corresponding acoustic parameters, wherein the acoustic parameters comprise: frequency spectrum, fundamental frequency information;
step S204: the acoustic parameters are synthesized into speech by the vocoder.
The voice synthesized by the scheme has the technical problem of poor tone quality, so that the user experience is low.
Disclosure of Invention
The invention provides an acoustic model processing method, a voice synthesis device and related equipment, and aims to solve the technical problem that the voice quality of synthesized voice in the prior art is poor.
In a first aspect, an embodiment of the present invention provides an acoustic model processing method, including:
acquiring preset parameters of a spectrum model in a voice model;
converting the preset parameters of the frequency spectrum model into a magnitude spectrum;
carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum;
and converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, and further obtaining the processed frequency spectrum model.
Optionally, the converting the preset parameter of the spectrum model into a magnitude spectrum includes:
and converting the static parameters of the mean part of the frequency spectrum model into the amplitude spectrum.
Optionally, the obtaining the processed magnitude spectrum by performing adaptive post-processing on the magnitude spectrum includes: performing adaptive post-processing on the magnitude spectrum by the following formula:
Figure BDA0000999714770000021
wherein S isnew(z) represents the processed magnitude spectrum;
Sori(z) represents the amplitude spectrum before treatment;
Sori(z/β) represents S in the z-planeori(z) scale-transformation to the previous β -fold, thereby obtaining a magnitude spectrum;
Sori(z/α) denotes S in the z-planeoriTo which the (z) dimension is transformedA times the front, resulting in a magnitude spectrum.
Optionally, the performing adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, further includes:
judging whether the processed amplitude spectrum obtained by calculation is within the range of a preset maximum value and a preset minimum value or not according to each processed amplitude spectrum;
when the processed amplitude spectrum is smaller than the preset minimum value, taking the preset minimum value as the processed amplitude spectrum;
and when the processed amplitude spectrum is larger than the preset maximum value, taking the preset maximum value as the processed amplitude spectrum.
Optionally, after the adaptive post-processing is performed on the magnitude spectrum to obtain a processed magnitude spectrum, the method further includes:
carrying out frequency spectrum energy uniformization treatment on the processed magnitude spectrum;
converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, wherein the preset parameters comprise:
and converting the magnitude spectrum subjected to the frequency spectrum energy uniformization treatment into preset parameters of the frequency spectrum model.
Optionally, the method further includes:
obtaining a text to be synthesized for voice synthesis;
determining acoustic parameters of the text to be synthesized based on the voice model;
and synthesizing the voice data of the text to be synthesized through the acoustic parameters.
In a second aspect, an embodiment of the present invention provides a speech synthesis method, including:
obtaining a text to be synthesized for voice synthesis;
determining the spectrum parameters of the text to be synthesized based on a spectrum model in a voice model, wherein the spectrum model is a spectrum model subjected to adaptive post-processing, and the adaptive post-processing process comprises the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, carrying out self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, and converting the processed magnitude spectrum into the preset parameters of the frequency spectrum model;
and synthesizing the voice data of the text to be synthesized through the frequency spectrum parameters.
Optionally, before the determining the spectral parameters of the text to be synthesized based on the spectral model in the speech model, the method further includes:
performing adaptive post-processing of the spectrum model locally at the client device; and/or
Receiving the spectrum model after adaptive post-processing from the server.
In a third aspect, an embodiment of the present invention provides an acoustic model processing apparatus, including:
the acquisition module is used for acquiring preset parameters of a spectrum model in the voice model;
the first conversion module is used for converting the preset parameters of the frequency spectrum model into a magnitude spectrum;
the first obtaining module is used for carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum;
and the second conversion module is used for converting the processed magnitude spectrum into preset parameters of the frequency spectrum model so as to obtain the processed frequency spectrum model.
In a fourth aspect, an embodiment of the present invention provides a speech synthesis apparatus, including:
a third obtaining module, configured to obtain a text to be synthesized for speech synthesis;
a second determining module, configured to determine a spectrum parameter of the text to be synthesized based on a spectrum model in a speech model, where the spectrum model is a spectrum model that is subjected to adaptive post-processing, and the adaptive post-processing includes the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, carrying out self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, and converting the processed magnitude spectrum into the preset parameters of the frequency spectrum model;
and the second synthesis module is used for synthesizing the voice data of the text to be synthesized through the frequency spectrum parameters.
In a fifth aspect, embodiments of the present invention provide a processing apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and configured for execution by the one or more processors, the one or more programs including instructions for:
acquiring preset parameters of a spectrum model in a voice model;
converting the preset parameters of the frequency spectrum model into a magnitude spectrum;
carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum;
and converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, and further obtaining the processed frequency spectrum model.
In a sixth aspect, embodiments of the present invention provide a processing apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for:
obtaining a text to be synthesized for voice synthesis;
determining the spectrum parameters of the text to be synthesized based on a spectrum model in a voice model, wherein the spectrum model is a spectrum model subjected to adaptive post-processing, and the adaptive post-processing process comprises the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, carrying out self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, and converting the processed magnitude spectrum into the preset parameters of the frequency spectrum model;
and synthesizing the voice data of the text to be synthesized through the frequency spectrum parameters.
The invention has the following beneficial effects:
in the embodiment of the present invention, the processing is performed for the speech model in the following manner: acquiring preset parameters of a spectrum model in a voice model; then, converting the preset parameters of the frequency spectrum model into a magnitude spectrum; carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum; the processed magnitude spectrum is converted into preset parameters of the frequency spectrum model, and then the processed frequency spectrum model is obtained, and self-adaptive post-processing is carried out aiming at the preset parameters in the frequency spectrum model, so that an expected signal in the frequency spectrum model is enhanced and interference signals are reduced, and the quality of synthesized voice can be improved when voice data are generated through the voice model subsequently;
in addition, the object of the adaptive post-processing in the scheme is a magnitude spectrum, the magnitude spectrum is a universal frequency spectrum, and various frequency spectrum parameters can be converted into the magnitude spectrum, so the scheme is suitable for any frequency spectrum parameter, and different adaptive post-processing modes are not required to be adopted for different frequency spectrum parameters (such as a line spectrum pair, a Mel cepstrum and the like), so the scheme has stronger compatibility for the adaptive post-processing of the frequency spectrum parameters;
moreover, the scheme performs adaptive post-processing on the spectrum model in the voice model in advance, and does not need to perform adaptive post-processing after acoustic parameters are generated subsequently, so that the time consumption for synthesizing voice data by using the voice model is reduced.
Drawings
FIG. 1 is a flow chart of prior art modeling of speech;
FIG. 2 is a flow chart of synthesizing speech data in the prior art;
FIG. 3 is a flow chart of a method of acoustic model processing according to a first aspect of an embodiment of the present invention;
FIG. 4 is a flow chart of synthesizing speech data in the acoustic model processing method according to the first aspect of the embodiment of the present invention;
FIG. 5 is a flow chart of a speech synthesis method according to a second aspect of the embodiments of the present invention;
FIG. 6 is a block diagram of an acoustic model processing method according to a third aspect of an embodiment of the present invention;
FIG. 7 is a block diagram of a speech synthesis apparatus according to a fourth aspect of the embodiment of the present invention;
FIG. 8 is a block diagram of an electronic device shown in accordance with an example embodiment;
fig. 9 is a schematic structural diagram of a server in an embodiment of the present invention.
Detailed Description
The invention provides an acoustic model processing method, a voice synthesis device and related equipment, and aims to solve the technical problem that the voice quality of synthesized voice in the prior art is poor.
In order to solve the technical problems, the general idea of the embodiment of the present application is as follows:
in order to better understand the technical solutions of the present invention, the following detailed descriptions of the technical solutions of the present invention are provided with the accompanying drawings and the specific embodiments, and it should be understood that the specific features in the embodiments and the examples of the present invention are the detailed descriptions of the technical solutions of the present invention, and are not limitations of the technical solutions of the present invention, and the technical features in the embodiments and the examples of the present invention may be combined with each other without conflict.
Processing is performed on the speech model by: acquiring preset parameters of a spectrum model in a voice model; then, converting the preset parameters of the frequency spectrum model into a magnitude spectrum; carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum; the processed magnitude spectrum is converted into preset parameters of the frequency spectrum model, and then the processed frequency spectrum model is obtained, and self-adaptive post-processing is carried out aiming at the preset parameters in the frequency spectrum model, so that an expected signal in the frequency spectrum model is enhanced and interference signals are reduced, and the quality of synthesized voice can be improved when voice data are generated through the voice model subsequently;
in addition, the object of the adaptive post-processing in the scheme is a magnitude spectrum, the magnitude spectrum is a universal frequency spectrum, and various frequency spectrum parameters can be converted into the magnitude spectrum, so the scheme is suitable for any frequency spectrum parameter, and different adaptive post-processing modes are not required to be adopted for different frequency spectrum parameters (such as a line spectrum pair, a Mel cepstrum and the like), so the scheme has stronger compatibility for the adaptive post-processing of the frequency spectrum parameters;
moreover, the scheme performs adaptive post-processing on the spectrum model in the voice model in advance, and does not need to perform adaptive post-processing after acoustic parameters are generated subsequently, so that the time consumption for synthesizing voice data by using the voice model is reduced.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
In a first aspect, an embodiment of the present invention provides an acoustic model processing method, please refer to fig. 3, including:
step S301: acquiring preset parameters of a spectrum model in a voice model;
step S302: converting the preset parameters of the frequency spectrum model into a magnitude spectrum;
step S303: carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum;
step S304: and converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, and further obtaining the processed frequency spectrum model.
For example, the solution may be applied to a server, and may also be applied to a client device, and the embodiments of the present invention are not limited. The client devices are for example: a mobile phone, a notebook computer, a tablet computer, a PC, etc., embodiments of the present invention are not limited.
In step S301, for example, the speech model includes: a spectrum model, a fundamental frequency model, a duration model, etc. The spectral model generally includes: a probability density part and a decision tree part, wherein the probability density part comprises a mean value and a variance, the mean value and the variance respectively comprise a static parameter and a dynamic parameter, and the preset parameters of the spectrum model are, for example: the static parameters may also include dynamic parameters, and the embodiments of the present invention are not limited.
In step S302, the preset parameters of the spectrum model may be converted into a magnitude spectrum by:
when the preset parameters of the spectrum model are line spectrum pair parameters, the format is assumed to be K, l (1), l (2), … l (v). When V is an even number, the magnitude spectrum S (ω) is:
Figure BDA0000999714770000061
when V is odd, the magnitude spectrum S (ω) is:
Figure BDA0000999714770000071
when the preset parameters of the spectrum model are Mel cepstrum parameters, assume that the format is ca(0),ca(1),…ca(V) where a is known, and a is typically set to 0.42 when the spectral model is derived from audio at a sampling rate of 16KHZ, the cepstrum is first solved according to the following equation
Figure BDA0000999714770000072
Where v represents the dimension of the current process,
Figure BDA0000999714770000073
then, the amplitude spectrum is obtained through Fourier transformation and an exponential function with a natural constant e as a base.
The method includes the steps of obtaining a spectrum model, obtaining a matrix of M × Y, and converting the matrix into a Y-dimensional amplitude spectrum according to the scheme. In subsequent steps S102 to S104, only the Y-dimensional magnitude spectrum is processed each time, and a total of M operations are performed.
In step S303, the Y-dimensional amplitude spectrum may be subjected to adaptive post-processing by the following formula:
Figure BDA0000999714770000074
wherein S isnew(z) represents the processed magnitude spectrum;
Sori(z) represents the amplitude spectrum before treatment;
Sori(z/β) represents S in the z-planeori(z) scale-transformation to the previous β -fold, thereby obtaining a magnitude spectrum;
Sori(z/α) denotes S in the z-planeoriThe (z) scale is transformed to the previous alpha times, resulting in a magnitude spectrum.
Generally, α and β can be set empirically, and in general, the larger the β - α value is, the more obvious the sound quality enhancement effect of the synthesized speech is, but the larger the β - α value may cause the unstable synthesis effect, for example: the synthesized speech is distorted.
In a specific implementation process, after the adaptive post-processing is performed in the above manner, in order to stabilize a synthesis effect, a range of amplitude spectrum transformation may be limited, and further, the adaptive post-processing is performed on the amplitude spectrum to obtain a processed amplitude spectrum, further including: judging whether the processed amplitude spectrum obtained by calculation is within the range of a preset maximum value and a preset minimum value or not according to each processed amplitude spectrum; when the processed amplitude spectrum is smaller than the preset minimum value, taking the preset minimum value as the processed amplitude spectrum; and when the processed amplitude spectrum is larger than the preset maximum value, taking the preset maximum value as the processed amplitude spectrum.
For example, the preset maximum value may be a set fixed value, or may be SoriThe preset minimum value of the preset proportion (z) can be a set fixed value or Sori(z) is not limited to the embodiment of the present invention.
Wherein if the preset maximum value and the preset minimum value are both Sori(z), the transformation range of the magnitude spectrum can be defined by the following formula:
suppose SoriThe value of the y-th dimension of (z) is sori,SnewThe value of the y-th dimension of (z) is snewWherein Y is more than or equal to 1 and less than or equal to Y. Then:
Figure BDA0000999714770000081
the mindata and the maxdata can be set according to experience, generally, the larger the maxdata-mindata value is, the more obvious the sound quality enhancement effect of the synthesized voice is, but the larger the maxdata-mindata value is, the more unstable the synthesis effect can be caused. The maxdata-mindata can take on values of 7-10, for example: 8. 9, 10, etc., in this case, it can not only ensure the stability of the synthesized effect, but also achieve a better enhancement effect on the sound quality of the synthesized voice.
Wherein, if the preset maximum value and the preset minimum value are both set fixed values, the transformation range of the amplitude spectrum can be defined by the following formula:
suppose SnewThe value of the y-th dimension of (z) is snewWherein Y is more than or equal to 1 and less than or equal to Y. Then:
Figure BDA0000999714770000091
similarly, the mindata and the maxdata can be set empirically, and generally, the larger the maxdata-mindata value is, the more obvious the sound quality enhancement effect of the synthesized voice is, but the larger the maxdata-mindata value is, the more unstable the synthesis effect may be caused. Similarly, the maxdata-mindata value can be, for example, 7-10, for example: 8. 9, 10, etc., in this case, it can not only ensure the stability of the synthesized effect, but also achieve a better enhancement effect on the sound quality of the synthesized voice.
As an alternative embodiment, in order to ensure that the synthesis effect is stable, it is further required to ensure that the spectrum energies before and after the adaptive post-processing are consistent, that is,: after the adaptive post-processing of the magnitude spectrum to obtain a processed magnitude spectrum, the method further comprises: and carrying out frequency spectrum energy uniformization treatment on the treated magnitude spectrum.
The spectrum energy before and after the self-adaptive post-processing can be ensured to be consistent through the following formula:
Figure BDA0000999714770000092
wherein, S'new(z) a magnitude spectrum after the spectrum energy uniformization processing;
Snew(z) represents a magnitude spectrum before the spectrum energy uniformization processing;
Sori(z) represents the magnitude spectrum before the adaptation process.
In step S304, the magnitude spectrum may be converted into preset parameters of the spectrum model by the following method:
when the preset parameters of the frequency spectrum model are line spectrum pair parameters, firstly taking the logarithm taking e as the base for the magnitude spectrum, and then obtaining the cepstrum parameter c through inverse Fourier transform0(v) Then solving the generalized cepstrum parameter c according to the following regression equation-1(v) And v denotes the dimension currently being processed,
Figure BDA0000999714770000093
then gain normalization is carried out to obtain linear predictive coding parameters, then z transformation is carried out on the linear predictive coding parameters, zero points of the linear predictive coding parameters on a unit circle are solved, and angular frequency values corresponding to the zero points are line spectrum pair parameters.
When the preset parameters of the frequency spectrum model are Mel cepstrum parameters, firstly taking the logarithm taking e as the base for the magnitude spectrum, then obtaining the cepstrum parameters through inverse Fourier transform, and assuming that the format is c0(0),c0(1),…c0(V) finally solving the Mel cepstrum according to the following formula
Figure BDA0000999714770000101
Where a is known, and is typically set to 0.42 when the spectral model is derived from audio at a sampling rate of 16KHZ, v represents the currently processed dimension,
Figure BDA0000999714770000102
if the spectrum energy uniformization processing is not performed on the magnitude spectrum before, the magnitude spectrum after the adaptive post-processing is directly converted into preset parameters of a spectrum model in step S304; if the amplitude spectrum is subjected to the spectrum energy uniformization processing, in step S304, the amplitude spectrum subjected to the spectrum energy uniformization processing is converted into the preset parameters of the spectrum model.
In the implementation process, after the processed spectrum model is obtained based on step S304, the speech data may be synthesized through the speech model including the spectrum model, please refer to fig. 4, and the speech data may be synthesized through the following steps:
step S401: obtaining a text to be synthesized for voice synthesis;
step S402: determining acoustic parameters of the text to be synthesized based on the voice model;
step S403: and synthesizing the voice data of the text to be synthesized through the acoustic parameters.
In step S401, the text to be synthesized is, for example: the text input by the user, the text corresponding to the prompt tone generated by the client device, the text of the electronic book, etc., may also be any other form of text, and embodiments of the present invention are not illustrated in detail and are not limited.
In step S402, the text to be synthesized may be first subjected to context analysis, so as to analyze context information of the text to be synthesized, and then model prediction is performed on the context through the speech model, so as to obtain corresponding acoustic parameters, where the acoustic parameters include: frequency spectrum, fundamental frequency information, duration, etc.
In step S403, the acoustic parameters determined in step S402 may be synthesized by a vocoder, so as to obtain corresponding voice data. After synthesizing the speech data, the speech data may also be input by various means, such as: the voice data is output through the sound output device of the client device, and the voice data is sent to another client device for the other client device to output the voice data, and the like. In a second aspect, based on the same inventive concept, an embodiment of the present invention provides a speech synthesis method, please refer to fig. 5, including:
step S501: obtaining a text to be synthesized for voice synthesis;
step S502: determining the spectrum parameters of the text to be synthesized based on a spectrum model in a voice model, wherein the spectrum model is a spectrum model subjected to adaptive post-processing, and the adaptive post-processing process comprises the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, carrying out self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, and converting the processed magnitude spectrum into the preset parameters of the frequency spectrum model;
step S503: and synthesizing the voice data of the text to be synthesized through the frequency spectrum parameters.
In step S501, the specific text to be synthesized is not described herein again because it has already been described above.
In step S502, how to obtain the spectrum model after the adaptive post-processing is described in detail, which is not described herein again because the first aspect of the present invention is described. The spectrum model after adaptive post-processing can be obtained through various ways, and two ways are listed below for description, and of course, in the specific implementation process, the two ways are not limited to the following two cases.
First, adaptive post-processing of the spectral model is performed locally at the client device.
Secondly, the spectrum model after adaptive post-processing is received from the server.
In step S503, other parameters of the speech data may also be obtained through other models included in the speech model, for example: obtaining a fundamental frequency parameter of a text to be synthesized through a fundamental frequency model, obtaining a duration parameter of the text to be synthesized through a duration model and the like, and then synthesizing voice data of the text to be synthesized through acoustic parameters such as the fundamental frequency parameter, the duration parameter, the frequency spectrum parameter and the like.
For how to synthesize the speech data of the text to be synthesized by the acoustic parameters, the description is omitted here because it has already been described above.
As can be seen from the above analysis, in the embodiment of the present invention, for the spectrum model, first, the preset parameters (for example, the average value part of the static parameters) of the spectrum model are converted into the magnitude spectrum, then the magnitude spectrum is subjected to adaptive post-processing, in order to stabilize the synthesis effect, the range of the magnitude spectrum transformation is limited, and the magnitude spectrum energy is adjusted to be the same as the magnitude spectrum energy before processing, and finally, the processed magnitude spectrum is converted into the preset parameters of the spectrum model, and the other parts of the spectrum model are kept unchanged.
In a third aspect, based on the same inventive concept, an embodiment of the present invention provides an acoustic model processing apparatus, please refer to fig. 6, including:
an obtaining module 60, configured to obtain preset parameters of a spectrum model in a speech model;
a first conversion module 61, configured to convert preset parameters of the spectrum model into a magnitude spectrum;
a first obtaining module 62, configured to perform adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum;
and a second conversion module 63, configured to convert the processed magnitude spectrum into preset parameters of the spectrum model, so as to obtain the processed spectrum model.
Optionally, the first conversion module 61 is configured to:
and converting the static parameters of the mean part of the frequency spectrum model into the amplitude spectrum.
Optionally, the first obtaining module 62 is configured to perform adaptive post-processing on the magnitude spectrum according to the following formula:
Figure BDA0000999714770000121
wherein S isnew(z) represents the processed magnitude spectrum;
Sori(z) represents the amplitude spectrum before treatment;
Sori(z/β) represents S in the z-planeori(z) scale-transformation to the previous β -fold, thereby obtaining a magnitude spectrum;
Sori(z/α) denotes S in the z-planeoriThe (z) scale is transformed to the previous alpha times, resulting in a magnitude spectrum.
Optionally, the first obtaining module 62 includes:
the judgment unit is used for judging whether the processed amplitude spectrum obtained by calculation is located in the range of a preset maximum value and a preset minimum value or not according to each processed amplitude spectrum;
the first determining unit is used for taking the preset minimum value as the processed amplitude spectrum when the processed amplitude spectrum is smaller than the preset minimum value;
and the second determining unit is used for taking the preset maximum value as the processed amplitude spectrum when the processed amplitude spectrum is larger than the preset maximum value.
Optionally, the apparatus further comprises:
the first processing module is used for carrying out frequency spectrum energy uniformization processing on the processed magnitude spectrum;
the second conversion module is used for converting the magnitude spectrum subjected to the spectrum energy uniformization processing into preset parameters of the spectrum model.
Optionally, the apparatus further comprises:
the second obtaining module is used for obtaining a text to be synthesized for voice synthesis;
the first determining module is used for determining the acoustic parameters of the text to be synthesized based on the voice model;
and the first synthesis module is used for synthesizing the voice data of the text to be synthesized through the acoustic parameters.
Since the acoustic model processing apparatus described in the third aspect of the present invention is a device used for implementing the acoustic model processing method described in the first aspect of the present embodiment, and a person skilled in the art can understand a specific structure and a modification of the apparatus based on the acoustic model processing method described in the first aspect of the present embodiment of the present invention, details are not described here, and all the devices used for implementing the acoustic model processing method described in the first aspect of the present embodiment of the present invention belong to the scope to be protected by the embodiment of the present invention.
In a fourth aspect, based on the same inventive concept, an embodiment of the present invention provides a speech synthesis apparatus, please refer to fig. 7, including:
a third obtaining module 70, configured to obtain a text to be synthesized for speech synthesis;
a second determining module 71, configured to determine a spectrum parameter of the text to be synthesized based on a spectrum model in a speech model, where the spectrum model is a spectrum model that is subjected to adaptive post-processing, and the adaptive post-processing includes the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, carrying out self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, and converting the processed magnitude spectrum into the preset parameters of the frequency spectrum model;
and a second synthesis module 72, configured to synthesize the speech data of the text to be synthesized through the spectrum parameters.
Optionally, the apparatus further comprises:
the second processing module is used for performing self-adaptive post-processing on the frequency spectrum model locally on the client equipment; and/or
A receiving module, configured to receive the spectrum model subjected to adaptive post-processing from the server.
Since the speech synthesis apparatus described in the fourth aspect of the present invention is a device used for implementing the speech synthesis method described in the second aspect of the present invention, based on the speech synthesis method described in the second aspect of the present invention, a person skilled in the art can understand the specific structure and the modification of the apparatus, and therefore will not be described herein again, and all the devices used for implementing the speech synthesis method described in the second aspect of the present invention belong to the scope of the embodiments of the present invention to be protected.
In a fifth aspect, based on the same inventive concept, an embodiment of the present invention provides a processing device, which may be an electronic device or a server, including a memory and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, where the one or more programs include instructions for:
acquiring preset parameters of a spectrum model in a voice model;
converting the preset parameters of the frequency spectrum model into a magnitude spectrum;
carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum;
and converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, and further obtaining the processed frequency spectrum model.
Since the electronic device introduced in the fifth aspect of the present invention is an electronic device used for implementing the acoustic model processing method introduced in the first aspect of the embodiment of the present invention, and a person skilled in the art can understand a specific structure and a modification of the electronic device based on the acoustic model processing method introduced in the first aspect of the embodiment of the present invention, details are not described here, and all electronic devices used for implementing the acoustic model processing method introduced in the first aspect of the embodiment of the present invention belong to the scope to be protected by the embodiment of the present invention.
In a sixth aspect, based on the same inventive concept, an embodiment of the present invention provides a processing device, which may be an electronic device or a server, including a memory and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, where the one or more programs include instructions for:
obtaining a text to be synthesized for voice synthesis;
determining the spectrum parameters of the text to be synthesized based on a spectrum model in a voice model, wherein the spectrum model is a spectrum model subjected to adaptive post-processing, and the adaptive post-processing process comprises the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, carrying out self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, and converting the processed magnitude spectrum into the preset parameters of the frequency spectrum model;
and synthesizing the voice data of the text to be synthesized through the frequency spectrum parameters.
Since the electronic device introduced in the sixth aspect of the present invention is an electronic device used for implementing the speech synthesis method introduced in the second aspect of the present invention, based on the speech synthesis method introduced in the second aspect of the present invention, a person skilled in the art can understand a specific structure and a modification of the electronic device, and thus details are not described herein, and all electronic devices used for implementing the speech synthesis method introduced in the second aspect of the present invention belong to the scope of the embodiments of the present invention to be protected.
FIG. 8 is a block diagram illustrating an electronic device 800 implementing an acoustic model processing method (or speech synthesis method) in accordance with an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 8, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of acoustic model processing, the method comprising:
acquiring preset parameters of a spectrum model in a voice model;
converting the preset parameters of the frequency spectrum model into a magnitude spectrum;
carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum;
and converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, and further obtaining the processed frequency spectrum model.
A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of speech synthesis, the method comprising:
obtaining a text to be synthesized for voice synthesis;
determining the spectrum parameters of the text to be synthesized based on a spectrum model in a voice model, wherein the spectrum model is a spectrum model subjected to adaptive post-processing, and the adaptive post-processing process comprises the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, carrying out self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, and converting the processed magnitude spectrum into the preset parameters of the frequency spectrum model;
and synthesizing the voice data of the text to be synthesized through the frequency spectrum parameters.
Fig. 9 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
A non-transitory computer readable storage medium in which instructions, when executed by a central processor of a server, enable the server to perform a method of acoustic model processing, the method comprising:
acquiring preset parameters of a spectrum model in a voice model;
converting the preset parameters of the frequency spectrum model into a magnitude spectrum;
carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum;
and converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, and further obtaining the processed frequency spectrum model.
A non-transitory computer readable storage medium in which instructions, when executed by a central processor of a server, enable the server to perform a method of speech synthesis, the method comprising:
obtaining a text to be synthesized for voice synthesis;
determining the spectrum parameters of the text to be synthesized based on a spectrum model in a voice model, wherein the spectrum model is a spectrum model subjected to adaptive post-processing, and the adaptive post-processing process comprises the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, carrying out self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, and converting the processed magnitude spectrum into the preset parameters of the frequency spectrum model;
and synthesizing the voice data of the text to be synthesized through the frequency spectrum parameters.
One or more embodiments of the invention have at least the following beneficial effects:
because in the embodiment of the present invention, the processing is performed on the speech model in the following manner: acquiring preset parameters of a spectrum model in a voice model; then, converting the preset parameters of the frequency spectrum model into a magnitude spectrum; carrying out self-adaptive post-processing on the amplitude spectrum to obtain a processed amplitude spectrum; the processed magnitude spectrum is converted into preset parameters of the frequency spectrum model, and then the processed frequency spectrum model is obtained, and self-adaptive post-processing is carried out aiming at the preset parameters in the frequency spectrum model, so that an expected signal in the frequency spectrum model is enhanced and interference signals are reduced, and the quality of synthesized voice can be improved when voice data are generated through the voice model subsequently;
in addition, the object of the adaptive post-processing in the scheme is a magnitude spectrum, the magnitude spectrum is a universal frequency spectrum, and various frequency spectrum parameters can be converted into the magnitude spectrum, so the scheme is suitable for any frequency spectrum parameter, and different adaptive post-processing modes are not required to be adopted for different frequency spectrum parameters (such as a line spectrum pair, a Mel cepstrum and the like), so the scheme has stronger compatibility for the adaptive post-processing of the frequency spectrum parameters;
moreover, the scheme performs adaptive post-processing on the spectrum model in the voice model in advance, and does not need to perform adaptive post-processing after acoustic parameters are generated subsequently, so that the time consumption for synthesizing voice data by using the voice model is reduced.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (18)

1. An acoustic model processing method, comprising:
acquiring preset parameters of a spectrum model in a voice model;
converting the preset parameters of the frequency spectrum model into a magnitude spectrum;
performing adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, including: performing adaptive post-processing on the magnitude spectrum by the following formula:
Figure DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE004
representing the processed magnitude spectrum;
Figure DEST_PATH_IMAGE006
representing the amplitude spectrum before treatment;
Figure DEST_PATH_IMAGE008
expressed in the z-plane
Figure DEST_PATH_IMAGE010
Scale transformation to before
Figure DEST_PATH_IMAGE012
Multiplying, thereby obtaining a magnitude spectrum;
Figure DEST_PATH_IMAGE014
expressed in the z-plane
Figure DEST_PATH_IMAGE016
Scale transformation to before
Figure DEST_PATH_IMAGE018
Multiplying, thereby obtaining a magnitude spectrum;
and converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, and further obtaining the processed frequency spectrum model.
2. The method of claim 1, wherein the converting the preset parameters of the spectrum model into magnitude spectra comprises:
and converting the static parameters of the mean part of the frequency spectrum model into the amplitude spectrum.
3. The method of claim 1, wherein the magnitude spectrum is adaptively post-processed to obtain a processed magnitude spectrum, further comprising:
judging whether the processed amplitude spectrum obtained by calculation is within the range of a preset maximum value and a preset minimum value or not according to each processed amplitude spectrum;
when the processed amplitude spectrum is smaller than the preset minimum value, taking the preset minimum value as the processed amplitude spectrum;
and when the processed amplitude spectrum is larger than the preset maximum value, taking the preset maximum value as the processed amplitude spectrum.
4. The method of claim 1, wherein after adaptively post-processing the magnitude spectrum to obtain a processed magnitude spectrum, the method further comprises:
carrying out frequency spectrum energy uniformization treatment on the processed magnitude spectrum;
converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, wherein the preset parameters comprise:
and converting the magnitude spectrum subjected to the frequency spectrum energy uniformization treatment into preset parameters of the frequency spectrum model.
5. The method of any of claims 1-4, wherein the method further comprises:
obtaining a text to be synthesized for voice synthesis;
determining acoustic parameters of the text to be synthesized based on the voice model;
and synthesizing the voice data of the text to be synthesized through the acoustic parameters.
6. A method of speech synthesis, comprising:
obtaining a text to be synthesized for voice synthesis;
determining the spectrum parameters of the text to be synthesized based on a spectrum model in a voice model, wherein the spectrum model is a spectrum model subjected to adaptive post-processing, and the adaptive post-processing process comprises the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, and performing self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, wherein the processing comprises the following steps: performing adaptive post-processing on the magnitude spectrum by the following formula:
Figure DEST_PATH_IMAGE020
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE022
representing the processed magnitude spectrum;
Figure DEST_PATH_IMAGE024
representing the amplitude spectrum before treatment;
Figure DEST_PATH_IMAGE026
expressed in the z-plane
Figure DEST_PATH_IMAGE028
Scale transformation to before
Figure DEST_PATH_IMAGE030
Multiplying, thereby obtaining a magnitude spectrum;
Figure DEST_PATH_IMAGE032
expressed in the z-plane
Figure DEST_PATH_IMAGE034
Scale transformation to before
Figure DEST_PATH_IMAGE036
Multiplying, thereby obtaining a magnitude spectrum;
and synthesizing the voice data of the text to be synthesized through the frequency spectrum parameters.
7. The method of claim 6, wherein before the determining spectral parameters of the text to be synthesized based on a spectral model in the speech model, the method further comprises:
performing adaptive post-processing of the spectrum model locally at the client device; and/or
Receiving the spectrum model after adaptive post-processing from the server.
8. An acoustic model processing apparatus, comprising:
the acquisition module is used for acquiring preset parameters of a spectrum model in the voice model;
the first conversion module is used for converting the preset parameters of the frequency spectrum model into a magnitude spectrum;
a first obtaining module, configured to perform adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, including: performing adaptive post-processing on the magnitude spectrum by the following formula:
Figure DEST_PATH_IMAGE038
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE040
representing the processed magnitude spectrum;
Figure DEST_PATH_IMAGE042
representing the amplitude spectrum before treatment;
Figure DEST_PATH_IMAGE044
expressed in the z-plane
Figure DEST_PATH_IMAGE046
Scale transformation to before
Figure DEST_PATH_IMAGE048
Multiplying, thereby obtaining a magnitude spectrum;
Figure DEST_PATH_IMAGE050
expressed in the z-plane
Figure DEST_PATH_IMAGE052
Scale transformation to before
Figure DEST_PATH_IMAGE054
Multiplying, thereby obtaining a magnitude spectrum;
and the second conversion module is used for converting the processed magnitude spectrum into preset parameters of the frequency spectrum model so as to obtain the processed frequency spectrum model.
9. The apparatus of claim 8, wherein the first conversion module is to:
and converting the static parameters of the mean part of the frequency spectrum model into the amplitude spectrum.
10. The apparatus of claim 6, wherein the first obtaining module 62 comprises:
the judgment unit is used for judging whether the processed amplitude spectrum obtained by calculation is located in the range of a preset maximum value and a preset minimum value or not according to each processed amplitude spectrum;
the first determining unit is used for taking the preset minimum value as the processed amplitude spectrum when the processed amplitude spectrum is smaller than the preset minimum value;
and the second determining unit is used for taking the preset maximum value as the processed amplitude spectrum when the processed amplitude spectrum is larger than the preset maximum value.
11. The apparatus of claim 8, wherein the apparatus further comprises:
the first processing module is used for carrying out frequency spectrum energy uniformization processing on the processed magnitude spectrum;
the second conversion module is used for converting the magnitude spectrum subjected to the spectrum energy uniformization processing into preset parameters of the spectrum model.
12. The apparatus of any of claims 8 to 11, further comprising:
the second obtaining module is used for obtaining a text to be synthesized for voice synthesis;
the first determining module is used for determining the acoustic parameters of the text to be synthesized based on the voice model;
and the first synthesis module is used for synthesizing the voice data of the text to be synthesized through the acoustic parameters.
13. A speech synthesis apparatus, comprising:
a third obtaining module, configured to obtain a text to be synthesized for speech synthesis;
a second determining module, configured to determine a spectrum parameter of the text to be synthesized based on a spectrum model in a speech model, where the spectrum model is a spectrum model that is subjected to adaptive post-processing, and the adaptive post-processing includes the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, and performing self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, wherein the processing comprises the following steps: performing adaptive post-processing on the magnitude spectrum by the following formula:
Figure DEST_PATH_IMAGE056
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE058
representing the processed magnitude spectrum;
Figure DEST_PATH_IMAGE060
representing the amplitude spectrum before treatment;
Figure DEST_PATH_IMAGE062
expressed in the z-plane
Figure DEST_PATH_IMAGE064
Scale transformation to before
Figure DEST_PATH_IMAGE066
Multiplying, thereby obtaining a magnitude spectrum;
Figure DEST_PATH_IMAGE068
expressed in the z-plane
Figure DEST_PATH_IMAGE070
Scale transformation to before
Figure DEST_PATH_IMAGE072
Multiplying, thereby obtaining a magnitude spectrum;
and the second synthesis module is used for synthesizing the voice data of the text to be synthesized through the frequency spectrum parameters.
14. The apparatus as claimed in claim 3, wherein the apparatus further comprises:
the second processing module is used for performing self-adaptive post-processing on the frequency spectrum model locally on the client equipment; and/or
A receiving module, configured to receive the spectrum model subjected to adaptive post-processing from the server.
15. A processing apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for:
acquiring preset parameters of a spectrum model in a voice model;
converting the preset parameters of the frequency spectrum model into a magnitude spectrum;
performing adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, including: performing adaptive post-processing on the magnitude spectrum by the following formula:
Figure DEST_PATH_IMAGE074
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE076
representing the processed magnitude spectrum;
Figure DEST_PATH_IMAGE078
representing the amplitude spectrum before treatment;
Figure DEST_PATH_IMAGE080
expressed in the z-plane
Figure DEST_PATH_IMAGE082
Scale transformation to before
Figure DEST_PATH_IMAGE084
Multiplying, thereby obtaining a magnitude spectrum;
Figure DEST_PATH_IMAGE086
expressed in the z-plane
Figure DEST_PATH_IMAGE088
Scale transformation to before
Figure DEST_PATH_IMAGE090
Multiplying, thereby obtaining a magnitude spectrum;
and converting the processed magnitude spectrum into preset parameters of the frequency spectrum model, and further obtaining the processed frequency spectrum model.
16. A processing apparatus comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for:
obtaining a text to be synthesized for voice synthesis;
determining the spectrum parameters of the text to be synthesized based on a spectrum model in a voice model, wherein the spectrum model is a spectrum model subjected to adaptive post-processing, and the adaptive post-processing process comprises the following steps: converting the preset parameters of the frequency spectrum model into a magnitude spectrum, and performing self-adaptive post-processing on the magnitude spectrum to obtain a processed magnitude spectrum, wherein the processing comprises the following steps: performing adaptive post-processing on the magnitude spectrum by the following formula:
Figure DEST_PATH_IMAGE092
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE094
representing the processed magnitude spectrum;
Figure DEST_PATH_IMAGE096
representing the amplitude spectrum before treatment;
Figure DEST_PATH_IMAGE098
expressed in the z-plane
Figure DEST_PATH_IMAGE100
Scale transformation to before
Figure DEST_PATH_IMAGE102
Multiplying, thereby obtaining a magnitude spectrum;
Figure DEST_PATH_IMAGE104
expressed in the z-plane
Figure DEST_PATH_IMAGE106
Scale transformation to before
Figure DEST_PATH_IMAGE108
Multiplying, thereby obtaining a magnitude spectrum;
and synthesizing the voice data of the text to be synthesized through the frequency spectrum parameters.
17. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the method steps of any of claims 1 to 5.
18. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the method steps of any of claims 6 to 7.
CN201610353978.5A 2016-05-25 2016-05-25 Acoustic model processing method, voice synthesis method, device and related equipment Active CN107437412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610353978.5A CN107437412B (en) 2016-05-25 2016-05-25 Acoustic model processing method, voice synthesis method, device and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610353978.5A CN107437412B (en) 2016-05-25 2016-05-25 Acoustic model processing method, voice synthesis method, device and related equipment

Publications (2)

Publication Number Publication Date
CN107437412A CN107437412A (en) 2017-12-05
CN107437412B true CN107437412B (en) 2021-06-29

Family

ID=60452931

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610353978.5A Active CN107437412B (en) 2016-05-25 2016-05-25 Acoustic model processing method, voice synthesis method, device and related equipment

Country Status (1)

Country Link
CN (1) CN107437412B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110580910B (en) * 2018-06-08 2024-04-26 北京搜狗科技发展有限公司 Audio processing method, device, equipment and readable storage medium
CN110930977B (en) * 2019-11-12 2022-07-08 北京搜狗科技发展有限公司 Data processing method and device and electronic equipment
CN110931045A (en) * 2019-12-20 2020-03-27 重庆大学 Audio feature generation method based on convolutional neural network
CN115798455B (en) * 2023-02-07 2023-06-02 深圳元象信息科技有限公司 Speech synthesis method, system, electronic device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102610236A (en) * 2012-02-29 2012-07-25 山东大学 Method for improving voice quality of throat microphone
CN102938254A (en) * 2012-10-24 2013-02-20 中国科学技术大学 Voice signal enhancement system and method
CN104318927A (en) * 2014-11-04 2015-01-28 东莞市北斗时空通信科技有限公司 Anti-noise low-bitrate speech coding method and decoding method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103065631B (en) * 2013-01-24 2015-07-29 华为终端有限公司 A kind of method of speech recognition, device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102610236A (en) * 2012-02-29 2012-07-25 山东大学 Method for improving voice quality of throat microphone
CN102938254A (en) * 2012-10-24 2013-02-20 中国科学技术大学 Voice signal enhancement system and method
CN104318927A (en) * 2014-11-04 2015-01-28 东莞市北斗时空通信科技有限公司 Anti-noise low-bitrate speech coding method and decoding method

Also Published As

Publication number Publication date
CN107437412A (en) 2017-12-05

Similar Documents

Publication Publication Date Title
CN109801644B (en) Separation method, separation device, electronic equipment and readable medium for mixed sound signal
CN107705783B (en) Voice synthesis method and device
CN110136692B (en) Speech synthesis method, apparatus, device and storage medium
CN111583944A (en) Sound changing method and device
CN111508511A (en) Real-time sound changing method and device
CN110097890B (en) Voice processing method and device for voice processing
CN107871494B (en) Voice synthesis method and device and electronic equipment
CN107437412B (en) Acoustic model processing method, voice synthesis method, device and related equipment
CN109410973B (en) Sound changing processing method, device and computer readable storage medium
CN111326138A (en) Voice generation method and device
CN110931028B (en) Voice processing method and device and electronic equipment
CN115273831A (en) Voice conversion model training method, voice conversion method and device
CN110610720B (en) Data processing method and device and data processing device
CN111104807A (en) Data processing method and device and electronic equipment
CN115039169A (en) Voice instruction recognition method, electronic device and non-transitory computer readable storage medium
CN110580910B (en) Audio processing method, device, equipment and readable storage medium
CN112951202B (en) Speech synthesis method, apparatus, electronic device and program product
CN113345452B (en) Voice conversion method, training method, device and medium of voice conversion model
CN113409765B (en) Speech synthesis method and device for speech synthesis
CN111667842B (en) Audio signal processing method and device
CN110428828B (en) Voice recognition method and device for voice recognition
CN109102810B (en) Voiceprint recognition method and device
CN108345590B (en) Translation method, translation device, electronic equipment and storage medium
CN112818841A (en) Method and related device for recognizing user emotion
CN111063365B (en) Voice processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant