CN109712632B

CN109712632B - Voice coding and decoding method and device

Info

Publication number: CN109712632B
Application number: CN201711008611.0A
Authority: CN
Inventors: 袁豪磊
Original assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Priority date: 2017-10-25
Filing date: 2017-10-25
Publication date: 2022-07-12
Anticipated expiration: 2037-10-25
Also published as: CN109712632A

Abstract

The embodiment of the invention discloses a voice coding method, which comprises the following steps: obtaining the fundamental tone frequency of a voice signal; determining the analysis time of the voice signal according to the fundamental tone frequency; calculating the time domain compensation window according to the analysis time and the fundamental tone frequency, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window; calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency; and determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window. The embodiment of the invention also discloses a voice decoding method and related equipment. By adopting the embodiment of the invention, the tone quality of the synthesized voice can be improved.

Description

Voice coding and decoding method and device

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech encoding and decoding method and apparatus.

Background

Text To Speech (TTS) is a technique that converts Text information into Speech output. The most widely used TTS scheme is currently based on vocoder-based parametric speech synthesis, converting predicted acoustic parameters into speech. Early TTS synthesis methods relied on developers to manually design synthesis rules for specific languages, and with the development of technologies, multimedia data with text-to-speech contrast information became more and more readily available, making it possible to implement TTS using large amounts of text-to-audio contrast information ("corpus") as training data. The modern TTS system realizes the synthesis of voice by combining a large amount of corpus data with a machine learning method, can find out the internal mapping relation from characters to voice from real data (corpus) by developing a set of algorithm, and converts the characters into voice according to the learned mapping relation when inputting a new text, thereby avoiding the dependence on manual rule design.

With the continuous progress of the technology and the continuous increase of the corpus data, the parametric speech synthesis TTS system mainly uses a deep neural network to model the acoustics, the accuracy and the generalization capability of an acoustic model are greatly improved, the main factors influencing the voice quality of TTS synthesis are not voice quality loss caused by a model training link any more, but the quality of coding and decoding of a vocoder.

In prior art solutions, the speech waveform is generally decomposed into signal features of frequency and spectrum, and then the frequency and spectrum are output to an acoustic model for modeling. Since speech is a time continuous signal, it is generally necessary to slice (frame) the speech according to time, and then input the sliced speech into a vocoder for coding, so as to obtain a pitch frequency and a spectrum. Wherein, the fixed frame length of the sub-frame is generally between 5ms and 30 ms. However, the fixed framing approach results in a significant degradation of the quality of the synthesized speech.

Disclosure of Invention

The embodiment of the invention provides a voice coding and decoding method and device. The problem of poor tone quality of the synthesized voice can be solved.

The invention provides a voice coding method in a first aspect, which comprises the following steps:

obtaining the fundamental tone frequency of a voice signal;

determining the analysis time of the voice signal according to the fundamental tone frequency;

calculating the time domain compensation window according to the analysis time and the fundamental tone frequency, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window;

calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency;

and determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window.

Wherein the determining a time-domain smoothed spectrum of the speech signal according to the time-domain compensation window comprises:

multiplying the voice signal by the time domain compensation window, and calculating the voice signal after windowing;

performing Fourier transform on the windowed voice signal;

and calculating the square of the module of the windowed voice signal after the Fourier transform to obtain a time domain smooth spectrum of the voice signal.

Wherein the determining an analysis time of the speech signal according to the pitch frequency comprises:

calculating the duration of an analysis window according to the fundamental tone frequency;

and taking the sum of the time lengths of the N analysis windows as the analysis time of the voice signal.

Wherein, according to the time domain compensation window, a preset triangular window and the pitch frequency, calculating a frequency domain compensation window comprises:

calculating the convolution of the time domain compensation window and the preset triangular window to obtain a window function;

establishing an analysis matrix based on the window function, and determining an analysis vector according to the analysis matrix;

and calculating a frequency domain compensation window according to the analysis vector and the pitch frequency.

Wherein, the determining the frequency domain smooth spectrum of the speech signal according to the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window comprises:

and performing convolution operation on the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window, and taking the result of the convolution operation as the frequency domain smooth spectrum.

A second aspect of the present invention provides a speech decoding method, including:

acquiring a frequency spectrum of a voice signal;

determining a frame type of each frame of a voice frame in the voice signal;

determining a time domain signal of each frame of the voice frame in the voice signal according to the frequency spectrum of the voice signal and the frame type of each frame of the voice frame;

and superposing the time domain signals of each frame of the voice frame to obtain the voice signals.

Wherein the determining the time domain signal of each frame of the speech frame in the speech signal according to the frequency spectrum of the speech signal and the frame type of each frame of the speech frame comprises:

determining the noise type of the additional phase noise according to the frame type of each frame of the voice frame;

and calculating the time domain signal of each frame of the voice frame according to the noise type of the additional phase noise and the frequency spectrum of the voice signal at each synthesis time point.

Wherein, the determining the noise type of the additive phase noise according to the frame type of each frame of the voice frame comprises:

if the frame type of the voice frame is unvoiced, the noise type of the additive phase noise is white noise;

and if the frame type of the voice frame is voiced, the noise type of the additional phase noise is colored noise.

Wherein the speech signal

Wherein the content of the first and second substances,

is shown at the current time t_nSum of all previous pitch periods, s_n(T-T) represents the time domain signal of the speech frame at time T-T, f₀(t＝t_n) Is expressed in t ═ t_nThe pitch frequency at the time.

Accordingly, a third aspect of the present invention provides a speech encoding apparatus comprising:

the obtaining module is used for obtaining the fundamental tone frequency of the voice signal;

a processing module, configured to determine an analysis time of the speech signal according to the pitch frequency;

the processing module is further configured to calculate the time domain compensation window according to the analysis time and the pitch frequency, and determine a time domain smooth spectrum of the speech signal according to the time domain compensation window;

the processing module is further configured to calculate a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the pitch frequency;

the processing module is further configured to determine a frequency domain smoothed spectrum of the speech signal according to the time domain smoothed spectrum, the preset triangular window, and the frequency domain compensation window.

Wherein the processing module is specifically configured to: multiplying the voice signal by the time domain compensation window, and calculating the voice signal after windowing; performing Fourier transform on the windowed voice signal; and calculating the square of the module of the windowed voice signal after the Fourier transform to obtain a time domain smooth spectrum of the voice signal.

Wherein the processing module is specifically configured to: calculating the duration of an analysis window according to the fundamental tone frequency; and taking the sum of the time lengths of the N analysis windows as the analysis time of the voice signal.

Wherein the processing module is specifically configured to: calculating the convolution of the time domain compensation window and the preset triangular window to obtain a window function;

and calculating a frequency domain compensation window according to the analysis vector and the fundamental tone frequency.

Wherein the processing module is specifically configured to:

Accordingly, a fourth aspect of the present invention provides a speech decoding apparatus comprising:

the acquisition module is used for acquiring the frequency spectrum of the voice signal;

the determining module is used for determining the frame type of each frame of voice frame in the voice signal;

the processing module is used for determining a time domain signal of each frame of voice frame in the voice signal according to the frequency spectrum of the voice signal and the frame type of each frame of voice frame; and superposing the time domain signals of each frame of the voice frame to obtain the voice signals.

Wherein the processing module is specifically configured to:

Wherein the speech signal

Wherein the content of the first and second substances,

A fifth aspect of the present invention provides a speech encoding apparatus comprising an interface circuit, a memory, and a processor, wherein the memory stores a set of program codes therein, and the processor is configured to call the program codes stored in the memory to perform the following operations:

obtaining a fundamental tone frequency of a voice signal;

A sixth aspect of the present invention provides a speech decoding device comprising an interface circuit, a memory, and a processor, wherein the memory stores a set of program codes therein, and the processor is configured to call the program codes stored in the memory to perform the following operations:

acquiring a frequency spectrum of a voice signal;

determining a frame type of each frame of a voice frame in the voice signal;

Yet another aspect of the present application provides a computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to perform the method of the above-described aspects.

Yet another aspect of the present application provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of the above-described aspects.

The embodiment of the invention is implemented by firstly obtaining the fundamental tone frequency of a voice signal; determining the analysis time of the voice signal according to the fundamental tone frequency; then, calculating a time domain compensation window according to the analysis time, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window and the fundamental tone frequency; secondly, calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency; and finally, determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, a preset triangular window and a frequency domain compensation window. Eliminating time domain aliasing between adjacent time points through a time domain compensation window; and eliminating the frequency domain aliasing between adjacent frequency points through a frequency domain compensation window, thereby improving the coding and decoding quality of the voice and improving the tone quality of voice synthesis.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a speech synthesis system according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a speech encoding method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a speech decoding method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a speech encoding apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a speech decoding apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a speech encoding apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a speech decoding apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

TTS technology has practical product applications in the fields of news synthesis and novel synthesis, such as: "Tencent news" and "penguin FM", a piece of news or fiction text is input and the synthesized audio is output. In addition, TTS technology can be combined with artificial intelligence technology (speech recognition, human-computer dialogue) to enable speech dialogue between machines and people. As shown in fig. 1, fig. 1 is a schematic diagram of a speech synthesis system according to an embodiment of the present invention. The speech synthesis system comprises user equipment 101 and a processing server 102, wherein the user equipment can comprise a mobile phone terminal, a smart speaker and other human-computer interaction equipment. The user can send a voice signal for inquiring weather conditions or inquiring news topics to the user equipment 101, after receiving the voice signal, the user equipment 101 encodes the voice signal, then sends the encoded voice signal to the processing server 102, after receiving the voice signal, the processing server 102 decodes the voice signal and performs voice recognition, obtains an inquiry result of the user, returns the inquiry result to the user equipment 101, displays the inquiry result by the user equipment 101, and broadcasts the inquiry result to the user in a voice mode. In order to solve the problem of low voice quality of synthesized speech, the following proposes an improvement scheme for the codec of speech synthesis.

Referring to fig. 2, fig. 2 is a flowchart illustrating a speech encoding method according to an embodiment of the present invention. The method in the embodiment of the invention comprises the following steps:

s201, obtaining a fundamental tone frequency of the voice signal. The pitch frequency is the reciprocal of the pitch period.

In a specific implementation, the pitch frequency describes the periodic vibration of the laryngeal vocal cords during vocalization, which is represented by short-time periodicity in the waveform and temporal variation in the instantaneous frequency in the frequency domain. Therefore, the fundamental frequency can be extracted from the time domain and frequency domain characteristics of the voice signal by using the angles of time domain waveform correlation, frequency domain spectral correlation and the like. The method comprises the following steps: on a time domain waveform, the pitch period can be determined by comparing the similarity between the original signal and its shifted signal. The non-sinusoidal periodic characteristic of the voice signal in the time domain is represented as having the structural characteristics of harmonics in the frequency domain, and the fundamental frequency is the distance between the harmonics.

Wherein the frame types of the speech frames in the speech signal include unvoiced and voiced. There is a large difference in the voicing characteristics of voiced and unvoiced sounds, which is a significant periodicity of the voiced signal due to vocal cord vibrations.

S202, determining the analysis time of the voice signal according to the fundamental tone frequency.

In a specific implementation, the duration of the analysis window can be calculated according to the fundamental tone frequency; and taking the sum of the time lengths of the N analysis windows as the analysis time of the voice signal. For example, analysis time

f_nN is the frame number of the speech signal.

S203, calculating a time domain compensation window according to the analysis time and the fundamental tone frequency, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window.

In particular implementations, a time domain compensation window

Wherein the time of analysis

f_nFor pitch frequency, a convolution is indicated.

In addition, the voice signal can be multiplied by a time domain compensation window, and the windowed voice signal is calculated; then, carrying out Fourier transform on the windowed voice signal; and finally, calculating the square of the module of the windowed voice signal after Fourier transform to obtain the time domain smooth spectrum of the voice signal. For example, time domain smoothed spectrum

Wherein s is_tFor speech signals, w_tIn order to compensate the window in the time domain,

representing a fourier transform.

And S204, calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency.

In a specific implementation, a convolution of a time domain compensation window and a preset triangular window can be calculated to obtain a window function; then establishing an analysis matrix based on the window function, and determining an analysis vector according to the analysis matrix; and finally, calculating a frequency domain compensation window according to the analysis vector and the fundamental tone frequency.

For example, a predetermined triangular window a (ω) may be convolved with a time-domain compensation window W (ω) to obtain a window function AW (ω), where a (ω) is the predetermined triangular window in the form of a frequency-domain expression and W (ω) is the time-domain compensation window in the form of a frequency-domain expression. Deforming AW (omega) to obtain AW ((i + j) omega)₀) The analysis matrix H ═ AW ((i + j) ω₀),i∈[-M,M]，j∈[-N,N]H denotes a matrix of 2M +1 rows and 2N +1 columns, H_ijThe element in the ith row and the jth column is shown. After the analysis matrix is calculated, the analysis vector U ═ H is calculated^TH)^-1H^Tδ, where δ is a 2M +1 vector representing the desired frequency response, δ has a response of 1 at the center frequency point and 0 at the other frequency points for an ideal frequency domain compensation window, and is expressed as: delta is [ delta ]_-M,...,δ₀,...,δ_M]＝[0,...,0,1,0,...,0]. Finally, after the vector U is obtained through calculation, a frequency domain compensation window is calculated

Wherein u is_kRepresenting the kth element of the vector U.

And S205, determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window.

In specific implementation, convolution operation can be performed on the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window, and the result of the convolution operation is used as the frequency domain smooth spectrum. For example, S_f＝S_t*W_A*W_f，S_fFor smoothing the spectrum in the frequency domain, S_tFor time-domain smoothing of the spectrum, W_AA predetermined triangular window in the form of a time-domain representation, W_fFor the frequency domain compensation window, the convolution is denoted.

In the embodiment of the invention, after the user equipment encodes the voice signal sent by the user according to the encoding mode, the user equipment sends the encoded voice signal to the processing server, after receiving the voice signal, the processing server decodes the voice signal, performs voice recognition on the decoded voice signal, inquires corresponding text information and returns the text information to the user equipment, and the user equipment prompts the user through a voice broadcasting mode. Because the coding mode is adopted in the voice synthesis process, a more stable smooth frequency spectrum is obtained, and the voice synthesis is more accurate.

In the embodiment of the invention, the fundamental tone frequency of a voice signal is firstly obtained; determining the analysis time of the voice signal according to the fundamental tone frequency; then, calculating a time domain compensation window according to the analysis time, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window and the fundamental tone frequency; secondly, calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency; and finally, determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, a preset triangular window and a frequency domain compensation window. Time domain aliasing between adjacent time points is eliminated through the time domain compensation window, and frequency domain aliasing between adjacent frequency points is eliminated through the frequency domain compensation window, so that the coding and decoding quality of voice is improved, and the tone quality of voice synthesis is improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating a speech decoding method according to an embodiment of the present invention. The method in the embodiment of the invention comprises the following steps:

s301, acquiring the frequency spectrum of the voice signal.

The embodiment of the present invention is an inverse process of the previous embodiment, in which the pitch frequency of the speech signal is extracted and the spectrum of the speech signal is obtained by encoding the speech signal.

S302, determining the frame type of each frame of the voice frame in the voice signal. Wherein the frame types of the speech frame include unvoiced and voiced. The voiced sound and the unvoiced sound have great difference in sound production characteristics, and the frame types of the speech frames are different, and the adopted decoding modes are also different. The difference between voiced and unvoiced sounds is that voiced signals have a pronounced periodicity, which is caused by vocal cord vibrations.

In specific implementation, the frequency of each frame of the speech frame may be obtained, and whether the frequency of each frame of the speech frame is greater than a first preset threshold and less than a second preset threshold is determined, if the frequency of the speech frame is greater than the first preset threshold and less than the second preset threshold, the frame type of the speech frame is voiced, and if the frequency of the speech frame is less than the first preset threshold or greater than the second preset threshold, the frame type of the speech frame is unvoiced. The second preset threshold is greater than the first preset threshold, the first preset threshold includes but is not limited to 40Hz, and the second preset threshold includes but is not limited to 500 Hz.

S303, calculating a time domain signal of each frame of the voice frame in the voice signal according to the frequency spectrum of the voice signal and the frame type of each frame of the voice frame.

In specific implementation, the noise type of the additive phase noise can be determined according to the frame type of each frame of the voice frame; then, the time domain signal of each frame of speech frame is calculated according to the noise type of the additive phase noise and the frequency spectrum of the speech signal at each synthesis time point.

Further, if the frame type of the voice frame is unvoiced, the noise type added with the phase noise is white noise; if the frame type of the speech frame is voiced, the noise type of the additive phase noise is colored noise.

For example, the reconstructed speech signal of each frame

Wherein, S (ω, t)_n) Is shown at the point of synthesis time t_nThe spectrum of the speech signal above, phi (omega), is additive phase noise. If the frame type of the speech frame is unvoiced, phi (omega) is white Gaussian noise, and if the frame type of the speech frame is voiced, phi (omega) is colored noise.

S304, the time domain signals of each frame of the voice frame are superposed to obtain voice signals.

In a specific implementation, the speech signal

Wherein the content of the first and second substances,

is shown at the current time t_nSum of all previous pitch periods, s_n(T-T) represents the time-domain signal of the speech frame at time T-T, f₀(t＝t_n) Is expressed in t ═ t_nThe pitch frequency at the time.

In the embodiment of the invention, after receiving the voice signal, the processing server decodes the voice signal according to the decoding mode, performs voice recognition on the decoded voice signal, searches corresponding text information from a voice library, sends the text information to the user equipment, and synthesizes the text information into the voice information and broadcasts the voice information to the user by the user equipment. Due to the adoption of the decoding mode, the accuracy of voice recognition is improved, and the voice synthesis is more accurate.

In the embodiment of the invention, the time domain signal of each frame of voice frame is calculated according to the frequency spectrum of the voice signal at each synthesis time point, and then the time domain signal of each frame of voice frame is superposed to obtain the reconstructed voice signal.

As shown in fig. 4, fig. 4 is a schematic structural diagram of a speech encoding apparatus according to an embodiment of the present invention. As shown in the figures, the apparatus in the embodiment of the present invention includes:

an obtaining module 401 is configured to obtain a pitch frequency of the speech signal.

In a specific implementation, the pitch frequency describes the periodic vibration of the laryngeal vocal cords during vocalization, which is represented by short-time periodicity in the waveform and temporal variation in the instantaneous frequency in the frequency domain. Therefore, the fundamental frequency can be extracted from the time domain and frequency domain characteristics of the voice signal by using the angles such as time domain waveform correlation, frequency domain spectral correlation and the like. The method comprises the following steps: on a time domain waveform, the pitch period can be determined by comparing the similarity between the original signal and its shifted signal. The non-sinusoidal periodic characteristic of the voice signal in the time domain is represented as having the structural characteristics of harmonics in the frequency domain, and the fundamental frequency is the distance between the harmonics.

Wherein the frame types of the speech frames in the speech signal include unvoiced and voiced. There is a large difference in the voicing characteristics of voiced and unvoiced sounds, which is a pronounced periodicity of the voiced signal due to vocal cord vibrations.

A processing module 402, configured to determine an analysis time of the speech signal according to the pitch frequency.

In a specific implementation, the duration of the analysis window can be calculated according to the fundamental tone frequency; and taking the sum of the time durations of the N analysis windows as the analysis time of the voice signal. For example, analysis time

f_nN is the frame number of the speech signal.

The processing module 402 is further configured to calculate a time domain compensation window according to the analysis time and the pitch frequency, and determine a time domain smoothed spectrum of the speech signal according to the time domain compensation window. In a specific implementation, the time domain compensation window

Wherein the content of the first and second substances,

f_nfor pitch frequency, a convolution is indicated.

In addition, the voice signal can be multiplied by a time domain compensation window, and the windowed voice signal is calculated; carrying out Fourier transform on the windowed voice signal; and calculating the square of the module of the windowed voice signal after Fourier transform to obtain the time domain smooth spectrum of the voice signal. For example, time-domain smoothed spectrum

representing a fourier transform.

The processing module 402 is further configured to calculate a frequency domain compensation window according to the time domain compensation window, a preset triangular window, and the pitch frequency.

For example, a predetermined triangular window a (ω) may be convolved with a time-domain compensation window W (ω) to obtain a window function AW (ω), where a (ω) is the predetermined triangular window in the form of a frequency-domain expression and W (ω) is the time-domain compensation window in the form of a frequency-domain expression. Deforming AW (omega) to obtain AW ((i + j) omega)₀) Analysis matrix H ═ AW ((i + j) ω)₀),i∈[-M,M]，j∈[-N,N]H denotes a matrix of 2M +1 rows and 2N +1 columns, H_ijThe element in the ith row and the jth column is shown. After the analysis matrix is calculated, the analysis vector U ═ H is calculated^TH)^-1H^Tδ, where δ is a 2M +1 vector representing the desired frequency response, δ has a response of 1 at the center frequency point and 0 at the other frequency points for an ideal frequency domain compensation window, and is expressed as: delta is [ delta ]_-M,...,δ₀,...,δ_M]＝[0,...,0,1,0,...,0]. Finally, after the vector U is obtained through calculation, a frequency domain compensation window is calculated

Wherein u is_kRepresenting the kth element of the vector U.

The processing module 402 is further configured to determine a frequency domain smoothed spectrum of the speech signal according to the time domain smoothed spectrum, a preset triangular window, and a frequency domain compensation window.

In specific implementation, convolution operation can be performed on the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window, and the result of the convolution operation is used as the frequency domain smooth spectrum. For example, S_f＝S_t*W_A*W_f，S_fFor smoothing the spectrum in the frequency domain, S_tAs the time domainSmooth spectrum, W_AA predetermined triangular window in the form of a time-domain representation, W_fFor the frequency domain compensation window, the convolution is denoted.

In the embodiment of the invention, after the user equipment encodes the voice signal sent by the user according to the encoding mode, the voice signal subjected to encoding processing is sent to the processing server, after the processing server receives the voice signal, the voice signal is decoded, the voice recognition is carried out on the decoded voice signal, the corresponding text information is inquired, the text information is returned to the user equipment, and the user equipment prompts the user through a voice broadcasting mode. Because the coding mode is adopted in the voice synthesis process, a more stable smooth frequency spectrum is obtained, and the voice synthesis is more accurate.

In the embodiment of the invention, the fundamental tone frequency of a voice signal is firstly obtained; determining the analysis time of the voice signal according to the fundamental tone frequency; then, calculating a time domain compensation window according to the analysis time, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window and the fundamental tone frequency; secondly, calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency; and finally, determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, a preset triangular window and a frequency domain compensation window. Eliminating time domain aliasing between adjacent time points through a time domain compensation window; and eliminating frequency domain aliasing between adjacent frequency points through a frequency domain compensation window, thereby improving the coding and decoding quality of the voice and improving the tone quality of voice synthesis.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a speech decoding apparatus according to an embodiment of the present invention. The device in the embodiment of the invention comprises:

an obtaining module 501 is configured to obtain a frequency spectrum of the voice signal.

A determining module 502, configured to determine a frame type of each frame of a speech frame in a speech signal.

Wherein the frame types of the speech frame include unvoiced and voiced. The voiced sound and the unvoiced sound have great difference in sound production characteristics, and the frame types of the speech frames are different, and the adopted decoding modes are also different. The difference between voiced and unvoiced sounds is that voiced signals have a pronounced periodicity, which is caused by vocal cord vibrations.

The processing module 503 is configured to calculate a time domain signal of each frame of the speech frame in the speech signal according to the frequency spectrum of the speech signal and the frame type of each frame of the speech frame; and superposing the time domain signals of each frame of the voice frame to obtain voice signals.

Further, if the frame type of the voice frame is unvoiced, the noise type with the phase noise is white noise; if the frame type of the speech frame is voiced, the noise type of the additive phase noise is colored noise.

For example, a reconstructed speech signal per frame

Wherein, S (ω, t)_n) Is shown at the point of synthesis time t_nThe spectrum of the speech signal above, phi (omega) is additive phase noise. If the frame type of the speech frame is unvoiced, phi (omega) is white Gaussian noise, and if the frame type of the speech frame is voiced, phi (omega) is colored noise.

Wherein the speech signal

Wherein the content of the first and second substances,

In the embodiment of the invention, after receiving the voice signal, the processing server decodes the voice signal according to the decoding mode, performs voice recognition on the decoded voice signal, searches corresponding text information from a voice library, sends the text information to the user equipment, and synthesizes the text information into the voice information by the user equipment and broadcasts the voice information to the user. Due to the adoption of the decoding mode, the accuracy of voice recognition is improved, and the voice synthesis is more accurate.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a speech encoding apparatus according to an embodiment of the present invention. As shown, the apparatus may include: at least one encoder 601, e.g., a CPU, at least one communication interface 602, at least one memory 603, and at least one communication bus 604. Wherein a communication bus 604 is used to enable the connection communication between these components. The communication interface 602 of the device in this embodiment is used for performing signaling or data communication with other node devices. The memory 603 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 603 may optionally be at least one memory device located remotely from the encoder 601. A set of program codes is stored in the memory 603 and the encoder 601 executes programs executed by the above-described terminal in the memory 603.

Obtaining the fundamental tone frequency of a voice signal;

calculating a time domain compensation window according to the analysis time and the fundamental tone frequency, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window;

and determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, a preset triangular window and a frequency domain compensation window.

Wherein, the encoder 601 is further configured to perform the following operation steps:

multiplying the voice signal by a time domain compensation window, and calculating the voice signal after windowing;

carrying out Fourier transform on the windowed voice signal;

and calculating the square of the module of the windowed voice signal after Fourier transform to obtain the time domain smooth spectrum of the voice signal.

and taking the sum of the time durations of the N analysis windows as the analysis time of the voice signal.

calculating the convolution of the time domain compensation window and a preset triangular window to obtain a window function;

and performing convolution operation on the time domain smooth spectrum, a preset triangular window and a frequency domain compensation window, and taking the result of the convolution operation as the frequency domain smooth spectrum.

Please refer to fig. 7, fig. 7 is a schematic structural diagram of a speech decoding apparatus according to an embodiment of the present invention. As shown, the apparatus may include: at least one decoder 701, e.g. a CPU, at least one communication interface 702, at least one memory 703 and at least one communication bus 704. Wherein a communication bus 704 is used to enable the connection communication between these components. In this embodiment, the communication interface 702 of the device is used for performing signaling or data communication with other node devices. The memory 703 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 703 may optionally be at least one memory device located remotely from the decoder 701. A set of program codes is stored in the memory 703 and the decoder 701 executes the programs executed by the above-described terminal in the memory 703.

Acquiring a frequency spectrum of a voice signal;

determining the frame type of each frame of a voice frame in a voice signal;

calculating a time domain signal of each frame of the voice frame in the voice signal according to the frequency spectrum of the voice signal and the frame type of each frame of the voice frame;

and superposing the time domain signals of each frame of the voice frame to obtain the voice signal.

The decoder 701 is further configured to perform the following operation steps:

if the frame type of the voice frame is unvoiced, the noise type added with the phase noise is white noise;

if the frame type of the speech frame is voiced, the noise type of the additive phase noise is colored noise.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no acts or modules are required by the invention.

The content downloading method, the related device and the system provided by the embodiment of the present invention are described in detail above, and a specific example is applied in the text to explain the principle and the embodiment of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of speech coding, the method comprising:

obtaining the fundamental tone frequency of a voice signal;

determining a frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window; and the processing server for receiving the voice signal subjected to coding processing jointly calculates a time domain signal of each frame of voice frame in the voice signal according to the frequency domain smooth spectrum and the frame type of each frame of voice frame in the voice signal after determining the frame type of each frame of voice frame in the voice signal, and superposes the time domain signal of each frame of voice frame to obtain the voice signal.

2. The method of claim 1, wherein the determining a time-domain smoothed spectrum of the speech signal according to the time-domain compensation window comprises:

performing Fourier transform on the windowed voice signal;

3. The method of claim 1, wherein said determining an analysis time of said speech signal based on said pitch frequency comprises:

calculating the time length of an analysis window according to the fundamental tone frequency;

4. The method of claim 1, wherein the calculating a frequency-domain compensation window based on the time-domain compensation window, a preset triangular window, and the pitch frequency comprises:

5. The method according to any one of claims 1-4, wherein the determining the frequency-domain smoothed spectrum of the speech signal according to the time-domain smoothed spectrum, the preset triangular window and the frequency-domain compensation window comprises:

6. A method for speech decoding, the method comprising:

acquiring a frequency spectrum of a voice signal; the frequency spectrum refers to a frequency domain smooth spectrum of the voice signal determined according to a time domain smooth spectrum, a preset triangular window and a frequency domain compensation window; the frequency domain compensation window is obtained by calculation based on a time domain compensation window, the preset triangular window and the fundamental tone frequency of the voice signal; the time domain smoothing spectrum is determined based on the time domain compensation window, and the time domain compensation window is obtained through calculation according to analysis time and the fundamental tone frequency; the analysis time is determined based on the pitch frequency;

determining a frame type of each frame of a voice frame in the voice signal;

7. The method of claim 6, wherein the determining the time-domain signal for each frame of speech frames in the speech signal based on the frequency spectrum of the speech signal and the frame type of the each frame of speech frames comprises:

8. The method of claim 7, wherein determining a noise type for additive phase noise based on the frame type of each frame of speech comprises:

9. Method according to any of claims 6-8, characterized in that the speech signal

Wherein the content of the first and second substances,

indicating at the current time t_nSum of all previous pitch periods, s_n(T-T) is represented intime domain signal of the speech frame at time T-T, f₀(t＝t_n) Is expressed in t ═ t_nThe pitch frequency at the time.

10. An apparatus for speech coding, the apparatus comprising:

the processing module is further configured to calculate a time domain compensation window according to the analysis time and the pitch frequency, and determine a time domain smooth spectrum of the speech signal according to the time domain compensation window;

the processing module is further configured to determine a frequency domain smooth spectrum of the speech signal according to the time domain smooth spectrum, the preset triangular window, and the frequency domain compensation window; and the processing server for receiving the voice signal subjected to coding processing jointly calculates a time domain signal of each frame of voice frame in the voice signal according to the frequency domain smooth spectrum and the frame type of each frame of voice frame in the voice signal after determining the frame type of each frame of voice frame in the voice signal, and superposes the time domain signal of each frame of voice frame to obtain the voice signal.

11. The apparatus of claim 10, wherein the processing module is specifically configured to:

12. An apparatus for speech decoding, the apparatus comprising:

the acquisition module is used for acquiring the frequency spectrum of the voice signal; the frequency spectrum refers to a frequency domain smooth spectrum of the voice signal determined according to a time domain smooth spectrum, a preset triangular window and a frequency domain compensation window; the frequency domain compensation window is obtained by calculation based on a time domain compensation window, the preset triangular window and the fundamental tone frequency of the voice signal; the time domain smoothing spectrum is determined based on the time domain compensation window, and the time domain compensation window is obtained through calculation according to analysis time and the fundamental tone frequency; the analysis time is determined based on the pitch frequency;

13. The apparatus of claim 12, wherein the processing module is specifically configured to:

14. The apparatus of claim 13, wherein the processing module is specifically configured to:

15. A computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method according to any one of claims 1 to 9.