CN109712632B - Voice coding and decoding method and device - Google Patents

Voice coding and decoding method and device Download PDF

Info

Publication number
CN109712632B
CN109712632B CN201711008611.0A CN201711008611A CN109712632B CN 109712632 B CN109712632 B CN 109712632B CN 201711008611 A CN201711008611 A CN 201711008611A CN 109712632 B CN109712632 B CN 109712632B
Authority
CN
China
Prior art keywords
frame
frequency
voice
window
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711008611.0A
Other languages
Chinese (zh)
Other versions
CN109712632A (en
Inventor
袁豪磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Tencent Cloud Computing Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd, Tencent Cloud Computing Beijing Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201711008611.0A priority Critical patent/CN109712632B/en
Publication of CN109712632A publication Critical patent/CN109712632A/en
Application granted granted Critical
Publication of CN109712632B publication Critical patent/CN109712632B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the invention discloses a voice coding method, which comprises the following steps: obtaining the fundamental tone frequency of a voice signal; determining the analysis time of the voice signal according to the fundamental tone frequency; calculating the time domain compensation window according to the analysis time and the fundamental tone frequency, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window; calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency; and determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window. The embodiment of the invention also discloses a voice decoding method and related equipment. By adopting the embodiment of the invention, the tone quality of the synthesized voice can be improved.

Description

Voice coding and decoding method and device
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a speech encoding and decoding method and apparatus.
Background
Text To Speech (TTS) is a technique that converts Text information into Speech output. The most widely used TTS scheme is currently based on vocoder-based parametric speech synthesis, converting predicted acoustic parameters into speech. Early TTS synthesis methods relied on developers to manually design synthesis rules for specific languages, and with the development of technologies, multimedia data with text-to-speech contrast information became more and more readily available, making it possible to implement TTS using large amounts of text-to-audio contrast information ("corpus") as training data. The modern TTS system realizes the synthesis of voice by combining a large amount of corpus data with a machine learning method, can find out the internal mapping relation from characters to voice from real data (corpus) by developing a set of algorithm, and converts the characters into voice according to the learned mapping relation when inputting a new text, thereby avoiding the dependence on manual rule design.
With the continuous progress of the technology and the continuous increase of the corpus data, the parametric speech synthesis TTS system mainly uses a deep neural network to model the acoustics, the accuracy and the generalization capability of an acoustic model are greatly improved, the main factors influencing the voice quality of TTS synthesis are not voice quality loss caused by a model training link any more, but the quality of coding and decoding of a vocoder.
In prior art solutions, the speech waveform is generally decomposed into signal features of frequency and spectrum, and then the frequency and spectrum are output to an acoustic model for modeling. Since speech is a time continuous signal, it is generally necessary to slice (frame) the speech according to time, and then input the sliced speech into a vocoder for coding, so as to obtain a pitch frequency and a spectrum. Wherein, the fixed frame length of the sub-frame is generally between 5ms and 30 ms. However, the fixed framing approach results in a significant degradation of the quality of the synthesized speech.
Disclosure of Invention
The embodiment of the invention provides a voice coding and decoding method and device. The problem of poor tone quality of the synthesized voice can be solved.
The invention provides a voice coding method in a first aspect, which comprises the following steps:
obtaining the fundamental tone frequency of a voice signal;
determining the analysis time of the voice signal according to the fundamental tone frequency;
calculating the time domain compensation window according to the analysis time and the fundamental tone frequency, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window;
calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency;
and determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window.
Wherein the determining a time-domain smoothed spectrum of the speech signal according to the time-domain compensation window comprises:
multiplying the voice signal by the time domain compensation window, and calculating the voice signal after windowing;
performing Fourier transform on the windowed voice signal;
and calculating the square of the module of the windowed voice signal after the Fourier transform to obtain a time domain smooth spectrum of the voice signal.
Wherein the determining an analysis time of the speech signal according to the pitch frequency comprises:
calculating the duration of an analysis window according to the fundamental tone frequency;
and taking the sum of the time lengths of the N analysis windows as the analysis time of the voice signal.
Wherein, according to the time domain compensation window, a preset triangular window and the pitch frequency, calculating a frequency domain compensation window comprises:
calculating the convolution of the time domain compensation window and the preset triangular window to obtain a window function;
establishing an analysis matrix based on the window function, and determining an analysis vector according to the analysis matrix;
and calculating a frequency domain compensation window according to the analysis vector and the pitch frequency.
Wherein, the determining the frequency domain smooth spectrum of the speech signal according to the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window comprises:
and performing convolution operation on the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window, and taking the result of the convolution operation as the frequency domain smooth spectrum.
A second aspect of the present invention provides a speech decoding method, including:
acquiring a frequency spectrum of a voice signal;
determining a frame type of each frame of a voice frame in the voice signal;
determining a time domain signal of each frame of the voice frame in the voice signal according to the frequency spectrum of the voice signal and the frame type of each frame of the voice frame;
and superposing the time domain signals of each frame of the voice frame to obtain the voice signals.
Wherein the determining the time domain signal of each frame of the speech frame in the speech signal according to the frequency spectrum of the speech signal and the frame type of each frame of the speech frame comprises:
determining the noise type of the additional phase noise according to the frame type of each frame of the voice frame;
and calculating the time domain signal of each frame of the voice frame according to the noise type of the additional phase noise and the frequency spectrum of the voice signal at each synthesis time point.
Wherein, the determining the noise type of the additive phase noise according to the frame type of each frame of the voice frame comprises:
if the frame type of the voice frame is unvoiced, the noise type of the additive phase noise is white noise;
and if the frame type of the voice frame is voiced, the noise type of the additional phase noise is colored noise.
Wherein the speech signal
Figure BDA0001445001740000031
Wherein the content of the first and second substances,
Figure BDA0001445001740000032
is shown at the current time tnSum of all previous pitch periods, sn(T-T) represents the time domain signal of the speech frame at time T-T, f0(t=tn) Is expressed in t ═ tnThe pitch frequency at the time.
Accordingly, a third aspect of the present invention provides a speech encoding apparatus comprising:
the obtaining module is used for obtaining the fundamental tone frequency of the voice signal;
a processing module, configured to determine an analysis time of the speech signal according to the pitch frequency;
the processing module is further configured to calculate the time domain compensation window according to the analysis time and the pitch frequency, and determine a time domain smooth spectrum of the speech signal according to the time domain compensation window;
the processing module is further configured to calculate a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the pitch frequency;
the processing module is further configured to determine a frequency domain smoothed spectrum of the speech signal according to the time domain smoothed spectrum, the preset triangular window, and the frequency domain compensation window.
Wherein the processing module is specifically configured to: multiplying the voice signal by the time domain compensation window, and calculating the voice signal after windowing; performing Fourier transform on the windowed voice signal; and calculating the square of the module of the windowed voice signal after the Fourier transform to obtain a time domain smooth spectrum of the voice signal.
Wherein the processing module is specifically configured to: calculating the duration of an analysis window according to the fundamental tone frequency; and taking the sum of the time lengths of the N analysis windows as the analysis time of the voice signal.
Wherein the processing module is specifically configured to: calculating the convolution of the time domain compensation window and the preset triangular window to obtain a window function;
establishing an analysis matrix based on the window function, and determining an analysis vector according to the analysis matrix;
and calculating a frequency domain compensation window according to the analysis vector and the fundamental tone frequency.
Wherein the processing module is specifically configured to:
and performing convolution operation on the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window, and taking the result of the convolution operation as the frequency domain smooth spectrum.
Accordingly, a fourth aspect of the present invention provides a speech decoding apparatus comprising:
the acquisition module is used for acquiring the frequency spectrum of the voice signal;
the determining module is used for determining the frame type of each frame of voice frame in the voice signal;
the processing module is used for determining a time domain signal of each frame of voice frame in the voice signal according to the frequency spectrum of the voice signal and the frame type of each frame of voice frame; and superposing the time domain signals of each frame of the voice frame to obtain the voice signals.
Wherein the processing module is specifically configured to:
determining the noise type of the additional phase noise according to the frame type of each frame of the voice frame;
and calculating the time domain signal of each frame of the voice frame according to the noise type of the additional phase noise and the frequency spectrum of the voice signal at each synthesis time point.
Wherein the processing module is specifically configured to:
if the frame type of the voice frame is unvoiced, the noise type of the additive phase noise is white noise;
and if the frame type of the voice frame is voiced, the noise type of the additional phase noise is colored noise.
Wherein the speech signal
Figure BDA0001445001740000041
Wherein the content of the first and second substances,
Figure BDA0001445001740000042
is shown at the current time tnSum of all previous pitch periods, sn(T-T) represents the time domain signal of the speech frame at time T-T, f0(t=tn) Is expressed in t ═ tnThe pitch frequency at the time.
A fifth aspect of the present invention provides a speech encoding apparatus comprising an interface circuit, a memory, and a processor, wherein the memory stores a set of program codes therein, and the processor is configured to call the program codes stored in the memory to perform the following operations:
obtaining a fundamental tone frequency of a voice signal;
determining the analysis time of the voice signal according to the fundamental tone frequency;
calculating the time domain compensation window according to the analysis time and the fundamental tone frequency, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window;
calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency;
and determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window.
A sixth aspect of the present invention provides a speech decoding device comprising an interface circuit, a memory, and a processor, wherein the memory stores a set of program codes therein, and the processor is configured to call the program codes stored in the memory to perform the following operations:
acquiring a frequency spectrum of a voice signal;
determining a frame type of each frame of a voice frame in the voice signal;
determining a time domain signal of each frame of the voice frame in the voice signal according to the frequency spectrum of the voice signal and the frame type of each frame of the voice frame;
and superposing the time domain signals of each frame of the voice frame to obtain the voice signals.
Yet another aspect of the present application provides a computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to perform the method of the above-described aspects.
Yet another aspect of the present application provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of the above-described aspects.
The embodiment of the invention is implemented by firstly obtaining the fundamental tone frequency of a voice signal; determining the analysis time of the voice signal according to the fundamental tone frequency; then, calculating a time domain compensation window according to the analysis time, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window and the fundamental tone frequency; secondly, calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency; and finally, determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, a preset triangular window and a frequency domain compensation window. Eliminating time domain aliasing between adjacent time points through a time domain compensation window; and eliminating the frequency domain aliasing between adjacent frequency points through a frequency domain compensation window, thereby improving the coding and decoding quality of the voice and improving the tone quality of voice synthesis.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of a speech synthesis system according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a speech encoding method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a speech decoding method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a speech encoding apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a speech decoding apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a speech encoding apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a speech decoding apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
TTS technology has practical product applications in the fields of news synthesis and novel synthesis, such as: "Tencent news" and "penguin FM", a piece of news or fiction text is input and the synthesized audio is output. In addition, TTS technology can be combined with artificial intelligence technology (speech recognition, human-computer dialogue) to enable speech dialogue between machines and people. As shown in fig. 1, fig. 1 is a schematic diagram of a speech synthesis system according to an embodiment of the present invention. The speech synthesis system comprises user equipment 101 and a processing server 102, wherein the user equipment can comprise a mobile phone terminal, a smart speaker and other human-computer interaction equipment. The user can send a voice signal for inquiring weather conditions or inquiring news topics to the user equipment 101, after receiving the voice signal, the user equipment 101 encodes the voice signal, then sends the encoded voice signal to the processing server 102, after receiving the voice signal, the processing server 102 decodes the voice signal and performs voice recognition, obtains an inquiry result of the user, returns the inquiry result to the user equipment 101, displays the inquiry result by the user equipment 101, and broadcasts the inquiry result to the user in a voice mode. In order to solve the problem of low voice quality of synthesized speech, the following proposes an improvement scheme for the codec of speech synthesis.
Referring to fig. 2, fig. 2 is a flowchart illustrating a speech encoding method according to an embodiment of the present invention. The method in the embodiment of the invention comprises the following steps:
s201, obtaining a fundamental tone frequency of the voice signal. The pitch frequency is the reciprocal of the pitch period.
In a specific implementation, the pitch frequency describes the periodic vibration of the laryngeal vocal cords during vocalization, which is represented by short-time periodicity in the waveform and temporal variation in the instantaneous frequency in the frequency domain. Therefore, the fundamental frequency can be extracted from the time domain and frequency domain characteristics of the voice signal by using the angles of time domain waveform correlation, frequency domain spectral correlation and the like. The method comprises the following steps: on a time domain waveform, the pitch period can be determined by comparing the similarity between the original signal and its shifted signal. The non-sinusoidal periodic characteristic of the voice signal in the time domain is represented as having the structural characteristics of harmonics in the frequency domain, and the fundamental frequency is the distance between the harmonics.
Wherein the frame types of the speech frames in the speech signal include unvoiced and voiced. There is a large difference in the voicing characteristics of voiced and unvoiced sounds, which is a significant periodicity of the voiced signal due to vocal cord vibrations.
S202, determining the analysis time of the voice signal according to the fundamental tone frequency.
In a specific implementation, the duration of the analysis window can be calculated according to the fundamental tone frequency; and taking the sum of the time lengths of the N analysis windows as the analysis time of the voice signal. For example, analysis time
Figure BDA0001445001740000071
fnN is the frame number of the speech signal.
S203, calculating a time domain compensation window according to the analysis time and the fundamental tone frequency, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window.
In particular implementations, a time domain compensation window
Figure BDA0001445001740000072
Wherein the time of analysis
Figure BDA0001445001740000073
fnFor pitch frequency, a convolution is indicated.
In addition, the voice signal can be multiplied by a time domain compensation window, and the windowed voice signal is calculated; then, carrying out Fourier transform on the windowed voice signal; and finally, calculating the square of the module of the windowed voice signal after Fourier transform to obtain the time domain smooth spectrum of the voice signal. For example, time domain smoothed spectrum
Figure BDA0001445001740000074
Wherein s istFor speech signals, wtIn order to compensate the window in the time domain,
Figure BDA0001445001740000075
representing a fourier transform.
And S204, calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency.
In a specific implementation, a convolution of a time domain compensation window and a preset triangular window can be calculated to obtain a window function; then establishing an analysis matrix based on the window function, and determining an analysis vector according to the analysis matrix; and finally, calculating a frequency domain compensation window according to the analysis vector and the fundamental tone frequency.
For example, a predetermined triangular window a (ω) may be convolved with a time-domain compensation window W (ω) to obtain a window function AW (ω), where a (ω) is the predetermined triangular window in the form of a frequency-domain expression and W (ω) is the time-domain compensation window in the form of a frequency-domain expression. Deforming AW (omega) to obtain AW ((i + j) omega)0) The analysis matrix H ═ AW ((i + j) ω0),i∈[-M,M],j∈[-N,N]H denotes a matrix of 2M +1 rows and 2N +1 columns, HijThe element in the ith row and the jth column is shown. After the analysis matrix is calculated, the analysis vector U ═ H is calculatedTH)-1HTδ, where δ is a 2M +1 vector representing the desired frequency response, δ has a response of 1 at the center frequency point and 0 at the other frequency points for an ideal frequency domain compensation window, and is expressed as: delta is [ delta ]-M,...,δ0,...,δM]=[0,...,0,1,0,...,0]. Finally, after the vector U is obtained through calculation, a frequency domain compensation window is calculated
Figure BDA0001445001740000081
Wherein u iskRepresenting the kth element of the vector U.
And S205, determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window.
In specific implementation, convolution operation can be performed on the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window, and the result of the convolution operation is used as the frequency domain smooth spectrum. For example, Sf=St*WA*Wf,SfFor smoothing the spectrum in the frequency domain, StFor time-domain smoothing of the spectrum, WAA predetermined triangular window in the form of a time-domain representation, WfFor the frequency domain compensation window, the convolution is denoted.
In the embodiment of the invention, after the user equipment encodes the voice signal sent by the user according to the encoding mode, the user equipment sends the encoded voice signal to the processing server, after receiving the voice signal, the processing server decodes the voice signal, performs voice recognition on the decoded voice signal, inquires corresponding text information and returns the text information to the user equipment, and the user equipment prompts the user through a voice broadcasting mode. Because the coding mode is adopted in the voice synthesis process, a more stable smooth frequency spectrum is obtained, and the voice synthesis is more accurate.
In the embodiment of the invention, the fundamental tone frequency of a voice signal is firstly obtained; determining the analysis time of the voice signal according to the fundamental tone frequency; then, calculating a time domain compensation window according to the analysis time, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window and the fundamental tone frequency; secondly, calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency; and finally, determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, a preset triangular window and a frequency domain compensation window. Time domain aliasing between adjacent time points is eliminated through the time domain compensation window, and frequency domain aliasing between adjacent frequency points is eliminated through the frequency domain compensation window, so that the coding and decoding quality of voice is improved, and the tone quality of voice synthesis is improved.
Referring to fig. 3, fig. 3 is a flowchart illustrating a speech decoding method according to an embodiment of the present invention. The method in the embodiment of the invention comprises the following steps:
s301, acquiring the frequency spectrum of the voice signal.
The embodiment of the present invention is an inverse process of the previous embodiment, in which the pitch frequency of the speech signal is extracted and the spectrum of the speech signal is obtained by encoding the speech signal.
S302, determining the frame type of each frame of the voice frame in the voice signal. Wherein the frame types of the speech frame include unvoiced and voiced. The voiced sound and the unvoiced sound have great difference in sound production characteristics, and the frame types of the speech frames are different, and the adopted decoding modes are also different. The difference between voiced and unvoiced sounds is that voiced signals have a pronounced periodicity, which is caused by vocal cord vibrations.
In specific implementation, the frequency of each frame of the speech frame may be obtained, and whether the frequency of each frame of the speech frame is greater than a first preset threshold and less than a second preset threshold is determined, if the frequency of the speech frame is greater than the first preset threshold and less than the second preset threshold, the frame type of the speech frame is voiced, and if the frequency of the speech frame is less than the first preset threshold or greater than the second preset threshold, the frame type of the speech frame is unvoiced. The second preset threshold is greater than the first preset threshold, the first preset threshold includes but is not limited to 40Hz, and the second preset threshold includes but is not limited to 500 Hz.
S303, calculating a time domain signal of each frame of the voice frame in the voice signal according to the frequency spectrum of the voice signal and the frame type of each frame of the voice frame.
In specific implementation, the noise type of the additive phase noise can be determined according to the frame type of each frame of the voice frame; then, the time domain signal of each frame of speech frame is calculated according to the noise type of the additive phase noise and the frequency spectrum of the speech signal at each synthesis time point.
Further, if the frame type of the voice frame is unvoiced, the noise type added with the phase noise is white noise; if the frame type of the speech frame is voiced, the noise type of the additive phase noise is colored noise.
For example, the reconstructed speech signal of each frame
Figure BDA0001445001740000091
Wherein, S (ω, t)n) Is shown at the point of synthesis time tnThe spectrum of the speech signal above, phi (omega), is additive phase noise. If the frame type of the speech frame is unvoiced, phi (omega) is white Gaussian noise, and if the frame type of the speech frame is voiced, phi (omega) is colored noise.
S304, the time domain signals of each frame of the voice frame are superposed to obtain voice signals.
In a specific implementation, the speech signal
Figure BDA0001445001740000092
Wherein the content of the first and second substances,
Figure BDA0001445001740000093
is shown at the current time tnSum of all previous pitch periods, sn(T-T) represents the time-domain signal of the speech frame at time T-T, f0(t=tn) Is expressed in t ═ tnThe pitch frequency at the time.
In the embodiment of the invention, after receiving the voice signal, the processing server decodes the voice signal according to the decoding mode, performs voice recognition on the decoded voice signal, searches corresponding text information from a voice library, sends the text information to the user equipment, and synthesizes the text information into the voice information and broadcasts the voice information to the user by the user equipment. Due to the adoption of the decoding mode, the accuracy of voice recognition is improved, and the voice synthesis is more accurate.
In the embodiment of the invention, the time domain signal of each frame of voice frame is calculated according to the frequency spectrum of the voice signal at each synthesis time point, and then the time domain signal of each frame of voice frame is superposed to obtain the reconstructed voice signal.
As shown in fig. 4, fig. 4 is a schematic structural diagram of a speech encoding apparatus according to an embodiment of the present invention. As shown in the figures, the apparatus in the embodiment of the present invention includes:
an obtaining module 401 is configured to obtain a pitch frequency of the speech signal.
In a specific implementation, the pitch frequency describes the periodic vibration of the laryngeal vocal cords during vocalization, which is represented by short-time periodicity in the waveform and temporal variation in the instantaneous frequency in the frequency domain. Therefore, the fundamental frequency can be extracted from the time domain and frequency domain characteristics of the voice signal by using the angles such as time domain waveform correlation, frequency domain spectral correlation and the like. The method comprises the following steps: on a time domain waveform, the pitch period can be determined by comparing the similarity between the original signal and its shifted signal. The non-sinusoidal periodic characteristic of the voice signal in the time domain is represented as having the structural characteristics of harmonics in the frequency domain, and the fundamental frequency is the distance between the harmonics.
Wherein the frame types of the speech frames in the speech signal include unvoiced and voiced. There is a large difference in the voicing characteristics of voiced and unvoiced sounds, which is a pronounced periodicity of the voiced signal due to vocal cord vibrations.
A processing module 402, configured to determine an analysis time of the speech signal according to the pitch frequency.
In a specific implementation, the duration of the analysis window can be calculated according to the fundamental tone frequency; and taking the sum of the time durations of the N analysis windows as the analysis time of the voice signal. For example, analysis time
Figure BDA0001445001740000101
fnN is the frame number of the speech signal.
The processing module 402 is further configured to calculate a time domain compensation window according to the analysis time and the pitch frequency, and determine a time domain smoothed spectrum of the speech signal according to the time domain compensation window. In a specific implementation, the time domain compensation window
Figure BDA0001445001740000111
Wherein the content of the first and second substances,
Figure BDA0001445001740000112
fnfor pitch frequency, a convolution is indicated.
In addition, the voice signal can be multiplied by a time domain compensation window, and the windowed voice signal is calculated; carrying out Fourier transform on the windowed voice signal; and calculating the square of the module of the windowed voice signal after Fourier transform to obtain the time domain smooth spectrum of the voice signal. For example, time-domain smoothed spectrum
Figure BDA0001445001740000113
Wherein s istFor speech signals, wtIn order to compensate the window in the time domain,
Figure BDA0001445001740000114
representing a fourier transform.
The processing module 402 is further configured to calculate a frequency domain compensation window according to the time domain compensation window, a preset triangular window, and the pitch frequency.
In a specific implementation, a convolution of a time domain compensation window and a preset triangular window can be calculated to obtain a window function; then establishing an analysis matrix based on the window function, and determining an analysis vector according to the analysis matrix; and finally, calculating a frequency domain compensation window according to the analysis vector and the fundamental tone frequency.
For example, a predetermined triangular window a (ω) may be convolved with a time-domain compensation window W (ω) to obtain a window function AW (ω), where a (ω) is the predetermined triangular window in the form of a frequency-domain expression and W (ω) is the time-domain compensation window in the form of a frequency-domain expression. Deforming AW (omega) to obtain AW ((i + j) omega)0) Analysis matrix H ═ AW ((i + j) ω)0),i∈[-M,M],j∈[-N,N]H denotes a matrix of 2M +1 rows and 2N +1 columns, HijThe element in the ith row and the jth column is shown. After the analysis matrix is calculated, the analysis vector U ═ H is calculatedTH)-1HTδ, where δ is a 2M +1 vector representing the desired frequency response, δ has a response of 1 at the center frequency point and 0 at the other frequency points for an ideal frequency domain compensation window, and is expressed as: delta is [ delta ]-M,...,δ0,...,δM]=[0,...,0,1,0,...,0]. Finally, after the vector U is obtained through calculation, a frequency domain compensation window is calculated
Figure BDA0001445001740000115
Wherein u iskRepresenting the kth element of the vector U.
The processing module 402 is further configured to determine a frequency domain smoothed spectrum of the speech signal according to the time domain smoothed spectrum, a preset triangular window, and a frequency domain compensation window.
In specific implementation, convolution operation can be performed on the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window, and the result of the convolution operation is used as the frequency domain smooth spectrum. For example, Sf=St*WA*Wf,SfFor smoothing the spectrum in the frequency domain, StAs the time domainSmooth spectrum, WAA predetermined triangular window in the form of a time-domain representation, WfFor the frequency domain compensation window, the convolution is denoted.
In the embodiment of the invention, after the user equipment encodes the voice signal sent by the user according to the encoding mode, the voice signal subjected to encoding processing is sent to the processing server, after the processing server receives the voice signal, the voice signal is decoded, the voice recognition is carried out on the decoded voice signal, the corresponding text information is inquired, the text information is returned to the user equipment, and the user equipment prompts the user through a voice broadcasting mode. Because the coding mode is adopted in the voice synthesis process, a more stable smooth frequency spectrum is obtained, and the voice synthesis is more accurate.
In the embodiment of the invention, the fundamental tone frequency of a voice signal is firstly obtained; determining the analysis time of the voice signal according to the fundamental tone frequency; then, calculating a time domain compensation window according to the analysis time, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window and the fundamental tone frequency; secondly, calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency; and finally, determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, a preset triangular window and a frequency domain compensation window. Eliminating time domain aliasing between adjacent time points through a time domain compensation window; and eliminating frequency domain aliasing between adjacent frequency points through a frequency domain compensation window, thereby improving the coding and decoding quality of the voice and improving the tone quality of voice synthesis.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a speech decoding apparatus according to an embodiment of the present invention. The device in the embodiment of the invention comprises:
an obtaining module 501 is configured to obtain a frequency spectrum of the voice signal.
A determining module 502, configured to determine a frame type of each frame of a speech frame in a speech signal.
Wherein the frame types of the speech frame include unvoiced and voiced. The voiced sound and the unvoiced sound have great difference in sound production characteristics, and the frame types of the speech frames are different, and the adopted decoding modes are also different. The difference between voiced and unvoiced sounds is that voiced signals have a pronounced periodicity, which is caused by vocal cord vibrations.
In specific implementation, the frequency of each frame of the speech frame may be obtained, and whether the frequency of each frame of the speech frame is greater than a first preset threshold and less than a second preset threshold is determined, if the frequency of the speech frame is greater than the first preset threshold and less than the second preset threshold, the frame type of the speech frame is voiced, and if the frequency of the speech frame is less than the first preset threshold or greater than the second preset threshold, the frame type of the speech frame is unvoiced. The second preset threshold is greater than the first preset threshold, the first preset threshold includes but is not limited to 40Hz, and the second preset threshold includes but is not limited to 500 Hz.
The processing module 503 is configured to calculate a time domain signal of each frame of the speech frame in the speech signal according to the frequency spectrum of the speech signal and the frame type of each frame of the speech frame; and superposing the time domain signals of each frame of the voice frame to obtain voice signals.
In specific implementation, the noise type of the additive phase noise can be determined according to the frame type of each frame of the voice frame; then, the time domain signal of each frame of speech frame is calculated according to the noise type of the additive phase noise and the frequency spectrum of the speech signal at each synthesis time point.
Further, if the frame type of the voice frame is unvoiced, the noise type with the phase noise is white noise; if the frame type of the speech frame is voiced, the noise type of the additive phase noise is colored noise.
For example, a reconstructed speech signal per frame
Figure BDA0001445001740000131
Wherein, S (ω, t)n) Is shown at the point of synthesis time tnThe spectrum of the speech signal above, phi (omega) is additive phase noise. If the frame type of the speech frame is unvoiced, phi (omega) is white Gaussian noise, and if the frame type of the speech frame is voiced, phi (omega) is colored noise.
Wherein the speech signal
Figure BDA0001445001740000132
Wherein the content of the first and second substances,
Figure BDA0001445001740000133
is shown at the current time tnSum of all previous pitch periods, sn(T-T) represents the time domain signal of the speech frame at time T-T, f0(t=tn) Is expressed in t ═ tnThe pitch frequency at the time.
In the embodiment of the invention, after receiving the voice signal, the processing server decodes the voice signal according to the decoding mode, performs voice recognition on the decoded voice signal, searches corresponding text information from a voice library, sends the text information to the user equipment, and synthesizes the text information into the voice information by the user equipment and broadcasts the voice information to the user. Due to the adoption of the decoding mode, the accuracy of voice recognition is improved, and the voice synthesis is more accurate.
In the embodiment of the invention, the time domain signal of each frame of voice frame is calculated according to the frequency spectrum of the voice signal at each synthesis time point, and then the time domain signal of each frame of voice frame is superposed to obtain the reconstructed voice signal.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a speech encoding apparatus according to an embodiment of the present invention. As shown, the apparatus may include: at least one encoder 601, e.g., a CPU, at least one communication interface 602, at least one memory 603, and at least one communication bus 604. Wherein a communication bus 604 is used to enable the connection communication between these components. The communication interface 602 of the device in this embodiment is used for performing signaling or data communication with other node devices. The memory 603 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 603 may optionally be at least one memory device located remotely from the encoder 601. A set of program codes is stored in the memory 603 and the encoder 601 executes programs executed by the above-described terminal in the memory 603.
Obtaining the fundamental tone frequency of a voice signal;
determining the analysis time of the voice signal according to the fundamental tone frequency;
calculating a time domain compensation window according to the analysis time and the fundamental tone frequency, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window;
calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency;
and determining the frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, a preset triangular window and a frequency domain compensation window.
Wherein, the encoder 601 is further configured to perform the following operation steps:
multiplying the voice signal by a time domain compensation window, and calculating the voice signal after windowing;
carrying out Fourier transform on the windowed voice signal;
and calculating the square of the module of the windowed voice signal after Fourier transform to obtain the time domain smooth spectrum of the voice signal.
Wherein, the encoder 601 is further configured to perform the following operation steps:
calculating the duration of an analysis window according to the fundamental tone frequency;
and taking the sum of the time durations of the N analysis windows as the analysis time of the voice signal.
Wherein, the encoder 601 is further configured to perform the following operation steps:
calculating the convolution of the time domain compensation window and a preset triangular window to obtain a window function;
establishing an analysis matrix based on the window function, and determining an analysis vector according to the analysis matrix;
and calculating a frequency domain compensation window according to the analysis vector and the fundamental tone frequency.
Wherein, the encoder 601 is further configured to perform the following operation steps:
and performing convolution operation on the time domain smooth spectrum, a preset triangular window and a frequency domain compensation window, and taking the result of the convolution operation as the frequency domain smooth spectrum.
Please refer to fig. 7, fig. 7 is a schematic structural diagram of a speech decoding apparatus according to an embodiment of the present invention. As shown, the apparatus may include: at least one decoder 701, e.g. a CPU, at least one communication interface 702, at least one memory 703 and at least one communication bus 704. Wherein a communication bus 704 is used to enable the connection communication between these components. In this embodiment, the communication interface 702 of the device is used for performing signaling or data communication with other node devices. The memory 703 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 703 may optionally be at least one memory device located remotely from the decoder 701. A set of program codes is stored in the memory 703 and the decoder 701 executes the programs executed by the above-described terminal in the memory 703.
Acquiring a frequency spectrum of a voice signal;
determining the frame type of each frame of a voice frame in a voice signal;
calculating a time domain signal of each frame of the voice frame in the voice signal according to the frequency spectrum of the voice signal and the frame type of each frame of the voice frame;
and superposing the time domain signals of each frame of the voice frame to obtain the voice signal.
The decoder 701 is further configured to perform the following operation steps:
determining the noise type of the additional phase noise according to the frame type of each frame of the voice frame;
and calculating the time domain signal of each frame of the voice frame according to the noise type of the additional phase noise and the frequency spectrum of the voice signal at each synthesis time point.
The decoder 701 is further configured to perform the following operation steps:
if the frame type of the voice frame is unvoiced, the noise type added with the phase noise is white noise;
if the frame type of the speech frame is voiced, the noise type of the additive phase noise is colored noise.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no acts or modules are required by the invention.
The content downloading method, the related device and the system provided by the embodiment of the present invention are described in detail above, and a specific example is applied in the text to explain the principle and the embodiment of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (15)

1. A method of speech coding, the method comprising:
obtaining the fundamental tone frequency of a voice signal;
determining the analysis time of the voice signal according to the fundamental tone frequency;
calculating a time domain compensation window according to the analysis time and the fundamental tone frequency, and determining a time domain smooth spectrum of the voice signal according to the time domain compensation window;
calculating a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the fundamental tone frequency;
determining a frequency domain smooth spectrum of the voice signal according to the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window; and the processing server for receiving the voice signal subjected to coding processing jointly calculates a time domain signal of each frame of voice frame in the voice signal according to the frequency domain smooth spectrum and the frame type of each frame of voice frame in the voice signal after determining the frame type of each frame of voice frame in the voice signal, and superposes the time domain signal of each frame of voice frame to obtain the voice signal.
2. The method of claim 1, wherein the determining a time-domain smoothed spectrum of the speech signal according to the time-domain compensation window comprises:
multiplying the voice signal by the time domain compensation window, and calculating the voice signal after windowing;
performing Fourier transform on the windowed voice signal;
and calculating the square of the module of the windowed voice signal after the Fourier transform to obtain a time domain smooth spectrum of the voice signal.
3. The method of claim 1, wherein said determining an analysis time of said speech signal based on said pitch frequency comprises:
calculating the time length of an analysis window according to the fundamental tone frequency;
and taking the sum of the time lengths of the N analysis windows as the analysis time of the voice signal.
4. The method of claim 1, wherein the calculating a frequency-domain compensation window based on the time-domain compensation window, a preset triangular window, and the pitch frequency comprises:
calculating the convolution of the time domain compensation window and the preset triangular window to obtain a window function;
establishing an analysis matrix based on the window function, and determining an analysis vector according to the analysis matrix;
and calculating a frequency domain compensation window according to the analysis vector and the pitch frequency.
5. The method according to any one of claims 1-4, wherein the determining the frequency-domain smoothed spectrum of the speech signal according to the time-domain smoothed spectrum, the preset triangular window and the frequency-domain compensation window comprises:
and performing convolution operation on the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window, and taking the result of the convolution operation as the frequency domain smooth spectrum.
6. A method for speech decoding, the method comprising:
acquiring a frequency spectrum of a voice signal; the frequency spectrum refers to a frequency domain smooth spectrum of the voice signal determined according to a time domain smooth spectrum, a preset triangular window and a frequency domain compensation window; the frequency domain compensation window is obtained by calculation based on a time domain compensation window, the preset triangular window and the fundamental tone frequency of the voice signal; the time domain smoothing spectrum is determined based on the time domain compensation window, and the time domain compensation window is obtained through calculation according to analysis time and the fundamental tone frequency; the analysis time is determined based on the pitch frequency;
determining a frame type of each frame of a voice frame in the voice signal;
determining a time domain signal of each frame of the voice frame in the voice signal according to the frequency spectrum of the voice signal and the frame type of each frame of the voice frame;
and superposing the time domain signals of each frame of the voice frame to obtain the voice signals.
7. The method of claim 6, wherein the determining the time-domain signal for each frame of speech frames in the speech signal based on the frequency spectrum of the speech signal and the frame type of the each frame of speech frames comprises:
determining the noise type of the additional phase noise according to the frame type of each frame of the voice frame;
and calculating the time domain signal of each frame of the voice frame according to the noise type of the additional phase noise and the frequency spectrum of the voice signal at each synthesis time point.
8. The method of claim 7, wherein determining a noise type for additive phase noise based on the frame type of each frame of speech comprises:
if the frame type of the voice frame is unvoiced, the noise type of the additive phase noise is white noise;
and if the frame type of the voice frame is voiced, the noise type of the additional phase noise is colored noise.
9. Method according to any of claims 6-8, characterized in that the speech signal
Figure FDA0003669270600000031
Wherein the content of the first and second substances,
Figure FDA0003669270600000032
indicating at the current time tnSum of all previous pitch periods, sn(T-T) is represented intime domain signal of the speech frame at time T-T, f0(t=tn) Is expressed in t ═ tnThe pitch frequency at the time.
10. An apparatus for speech coding, the apparatus comprising:
the obtaining module is used for obtaining the fundamental tone frequency of the voice signal;
a processing module, configured to determine an analysis time of the speech signal according to the pitch frequency;
the processing module is further configured to calculate a time domain compensation window according to the analysis time and the pitch frequency, and determine a time domain smooth spectrum of the speech signal according to the time domain compensation window;
the processing module is further configured to calculate a frequency domain compensation window according to the time domain compensation window, a preset triangular window and the pitch frequency;
the processing module is further configured to determine a frequency domain smooth spectrum of the speech signal according to the time domain smooth spectrum, the preset triangular window, and the frequency domain compensation window; and the processing server for receiving the voice signal subjected to coding processing jointly calculates a time domain signal of each frame of voice frame in the voice signal according to the frequency domain smooth spectrum and the frame type of each frame of voice frame in the voice signal after determining the frame type of each frame of voice frame in the voice signal, and superposes the time domain signal of each frame of voice frame to obtain the voice signal.
11. The apparatus of claim 10, wherein the processing module is specifically configured to:
and performing convolution operation on the time domain smooth spectrum, the preset triangular window and the frequency domain compensation window, and taking the result of the convolution operation as the frequency domain smooth spectrum.
12. An apparatus for speech decoding, the apparatus comprising:
the acquisition module is used for acquiring the frequency spectrum of the voice signal; the frequency spectrum refers to a frequency domain smooth spectrum of the voice signal determined according to a time domain smooth spectrum, a preset triangular window and a frequency domain compensation window; the frequency domain compensation window is obtained by calculation based on a time domain compensation window, the preset triangular window and the fundamental tone frequency of the voice signal; the time domain smoothing spectrum is determined based on the time domain compensation window, and the time domain compensation window is obtained through calculation according to analysis time and the fundamental tone frequency; the analysis time is determined based on the pitch frequency;
the determining module is used for determining the frame type of each frame of voice frame in the voice signal;
the processing module is used for determining a time domain signal of each frame of voice frame in the voice signal according to the frequency spectrum of the voice signal and the frame type of each frame of voice frame; and superposing the time domain signals of each frame of the voice frame to obtain the voice signals.
13. The apparatus of claim 12, wherein the processing module is specifically configured to:
determining the noise type of the additional phase noise according to the frame type of each frame of the voice frame;
and calculating the time domain signal of each frame of the voice frame according to the noise type of the additional phase noise and the frequency spectrum of the voice signal at each synthesis time point.
14. The apparatus of claim 13, wherein the processing module is specifically configured to:
if the frame type of the voice frame is unvoiced, the noise type of the additive phase noise is white noise;
and if the frame type of the voice frame is voiced, the noise type of the additional phase noise is colored noise.
15. A computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method according to any one of claims 1 to 9.
CN201711008611.0A 2017-10-25 2017-10-25 Voice coding and decoding method and device Active CN109712632B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711008611.0A CN109712632B (en) 2017-10-25 2017-10-25 Voice coding and decoding method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711008611.0A CN109712632B (en) 2017-10-25 2017-10-25 Voice coding and decoding method and device

Publications (2)

Publication Number Publication Date
CN109712632A CN109712632A (en) 2019-05-03
CN109712632B true CN109712632B (en) 2022-07-12

Family

ID=66252090

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711008611.0A Active CN109712632B (en) 2017-10-25 2017-10-25 Voice coding and decoding method and device

Country Status (1)

Country Link
CN (1) CN109712632B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101272366A (en) * 2007-03-23 2008-09-24 联发科技股份有限公司 Signal generation device and its relevant method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101272366A (en) * 2007-03-23 2008-09-24 联发科技股份有限公司 Signal generation device and its relevant method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SPEECH REPRESENTATION AND TRANSFORMATION USING ADAPTIVE;Hideki IiAWAHARA;《1997 IEEE International Conference on Acoustics, Speech, and Signal Processing》;20020806;全文 *
基于自适应加权谱内插的宽带语音编码算法;凌震华等;《数据采集与处理》;20050331;第29-32页,图1、4 *

Also Published As

Publication number Publication date
CN109712632A (en) 2019-05-03

Similar Documents

Publication Publication Date Title
CN111933110B (en) Video generation method, generation model training method, device, medium and equipment
CN111402855A (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN108108357B (en) Accent conversion method and device and electronic equipment
CN109346109B (en) Fundamental frequency extraction method and device
US9484044B1 (en) Voice enhancement and/or speech features extraction on noisy audio signals using successively refined transforms
US9530434B1 (en) Reducing octave errors during pitch determination for noisy audio signals
CN110930975B (en) Method and device for outputting information
CN110663080A (en) Method and apparatus for dynamically modifying the timbre of speech by frequency shifting of spectral envelope formants
CN111833843A (en) Speech synthesis method and system
US9208794B1 (en) Providing sound models of an input signal using continuous and/or linear fitting
CN107680584B (en) Method and device for segmenting audio
CN109326278B (en) Acoustic model construction method and device and electronic equipment
CN114495977A (en) Speech translation and model training method, device, electronic equipment and storage medium
CN113129864A (en) Voice feature prediction method, device, equipment and readable storage medium
US9058820B1 (en) Identifying speech portions of a sound model using various statistics thereof
CN113963679A (en) Voice style migration method and device, electronic equipment and storage medium
CN114333893A (en) Voice processing method and device, electronic equipment and readable medium
CN113421584A (en) Audio noise reduction method and device, computer equipment and storage medium
CN115188389A (en) End-to-end voice enhancement method and device based on neural network
CN109712632B (en) Voice coding and decoding method and device
CN116129859A (en) Prosody labeling method, acoustic model training method, voice synthesis method and voice synthesis device
CN111326166B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN114495896A (en) Voice playing method and computer equipment
CN114783409A (en) Training method of speech synthesis model, speech synthesis method and device
CN114333892A (en) Voice processing method and device, electronic equipment and readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant