CN112397087B

CN112397087B - Formant envelope estimation method, formant envelope estimation device, speech processing method, speech processing device, storage medium and terminal

Info

Publication number: CN112397087B
Application number: CN202011272148.2A
Authority: CN
Inventors: 張維城; 姜双双; 戚萌; 董斐
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2023-10-31
Anticipated expiration: 2040-11-13
Also published as: CN112397087A

Abstract

Formant envelope estimation, voice processing method and device, storage medium and terminal, wherein the formant envelope estimation method comprises the following steps: acquiring a frequency domain voice signal; determining a plurality of frequency spectrum peaks of the frequency domain voice signal in a predicted frequency range, wherein the predicted frequency range is determined according to the channel characteristics of a preset sound source and the perceived frequency of the preset sound source; and carrying out interpolation operation on the plurality of frequency spectrum peaks to obtain formant envelopes of the frequency domain voice signals. The technical scheme of the application can ensure the accuracy of estimation on the basis of ensuring the simplicity and easiness in realization of formant estimation.

Description

Formant envelope estimation method, formant envelope estimation device, speech processing method, speech processing device, storage medium and terminal

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a formant envelope estimation method, a formant envelope estimation device, a formant processing device, a storage medium, and a terminal.

Background

Basic physical characteristics of human speech include pitch, loudness, formants, timbre, and sound duration. Traditionally, speech has been divided into two major categories: vowels and consonants. Vowels are generally associated with higher energies and stronger periodicity. The relative importance of vowels and consonants in speech perception has been the subject of many studies. In studies using spoken sentences in the presence of background noise, it has been shown that vowels play a more important role in word recognition than consonants. In the presence of noise, the vowels will carry more speech information, probably because formant cues are robust even in noise. Due to the resonance of the vocal tract, the formant frequencies correspond to peaks in the short-time energy spectrum of voiced sounds. Formants are one of the main cues for vowel perception, as well as other factors such as spectral shape and formant ratio. Multidimensional analysis of the perceived vowel space has determined that the two dimensions that account for the greatest difference in perceived space correspond to the first two formant frequencies.

In general, formant frequencies extracted from a speech signal are less individual dependent and context dependent. In addition, the nature of the human voice is variable over a range and can be carefully measured and mapped to the vocabulary system. Thus, speech features containing formant information are widely used in speech recognition tasks. Formant information extracted from conventional feature processing, such as mel-frequency cepstral coefficients (Mel Frequency Cepstral Coefficient, MFCC), perceptual linear prediction (Preceptual Linear Prediction, PLP), is easily corrupted by noise signals. For formant frequencies, many accurate estimation or enhancement techniques have been proposed for decades. There are three main methods for formant estimation tasks: 1. based on LPC.2. Based on cepstrum. 3. And (5) a Gaussian mixture model. The details of the envelope (e.g., the sharpness of its peaks and valleys) are limited by the order of the cepstrum/LPC approximation. The gaussian mixture model is applied to a sinusoidal model of the harmonic structure. However, it has energy only at frequencies corresponding to integer multiples of the fundamental frequency (F0). Therefore, it is difficult to identify the transfer characteristics between adjacent harmonics.

However, conventional LPC/cepstrum methods are not as accurate at describing speech formants in frequency and amplitude. The accuracy of the envelope estimated by LPC analysis or cepstrum smoothing is limited by its order, and the sinusoidal model is limited by harmonics. The result of all these methods can be improved by iterative estimation using side information (e.g. the spectrum of neighboring frames), but with more computational power and memory resources.

Disclosure of Invention

The technical problem solved by the application is how to ensure the accuracy of the estimation on the basis of ensuring the simplicity and easiness in realization of the formant estimation.

In order to solve the above technical problems, an embodiment of the present application provides a formant envelope estimation method, including: acquiring a frequency domain voice signal; determining a plurality of frequency spectrum peaks of the frequency domain voice signal in a predicted frequency range, wherein the predicted frequency range is determined according to the channel characteristics of a preset sound source and the perceived frequency of the preset sound source; and carrying out interpolation operation on the plurality of frequency spectrum peaks to obtain formant envelopes of the frequency domain voice signals.

Optionally, the interpolating the plurality of spectral peaks includes: and performing linear interpolation operation on the plurality of frequency spectrum peaks to obtain formant envelopes of the frequency domain voice signals.

Optionally, the interpolating the plurality of spectral peaks includes: and performing multi-element interpolation on the plurality of frequency spectrum peaks to obtain formant envelopes of the frequency domain voice signals.

Optionally, the determining the plurality of spectral peaks of the frequency domain speech signal in the estimated frequency range includes: dividing the frequency domain voice signal into a preset number of frequency bands in the estimated frequency range; and determining the position with the maximum frequency value in each frequency band as a frequency spectrum peak in the frequency band.

Optionally, the acquiring the frequency domain voice signal includes: acquiring acquired multi-frame time domain voice signals; each frame of time domain speech signal is converted into a corresponding frequency domain speech signal.

In order to solve the technical problem, the embodiment of the application also discloses a voice processing method, which comprises the following steps: acquiring an acquired voice signal; calculating a formant envelope of the speech signal, the formant envelope being obtained by: acquiring a frequency domain voice signal, determining a plurality of frequency spectrum peaks of the frequency domain voice signal in an estimated frequency range, wherein the estimated frequency range is determined according to the channel characteristics of a preset sound source and the perception frequency of the preset sound source, and performing interpolation operation on the plurality of frequency spectrum peaks to obtain a formant envelope of the frequency domain voice signal; extracting formants of the voice signal according to formant envelopes of the voice signal; and performing voice recognition or voice synthesis on the voice signal according to the formants of the voice signal.

Optionally, the acquiring the collected voice signal further includes: and judging whether the voice signal has the voice of the preset sound source category or not.

The embodiment of the application also discloses a formant envelope estimation device, which comprises: the voice signal acquisition module is used for acquiring a frequency domain voice signal; the frequency spectrum peak determining module is used for determining a plurality of frequency spectrum peaks of the frequency domain voice signal in a predicted frequency range, wherein the predicted frequency range is determined according to the channel characteristics of a preset sound source and the perception frequency of the preset sound source; and the formant envelope calculation module is used for carrying out interpolation operation on the plurality of frequency spectrum peaks so as to obtain the formant envelope of the frequency domain voice signal.

The embodiment of the application also discloses a voice processing device, which comprises: the voice signal acquisition module is used for acquiring the acquired voice signals; a formant envelope determination module, configured to calculate a formant envelope of the speech signal, where the formant envelope is obtained by: acquiring a frequency domain voice signal, determining a plurality of frequency spectrum peaks of the frequency domain voice signal in an estimated frequency range, wherein the estimated frequency range is determined according to the sound channel characteristics of a sound source and the perceived frequency of the sound source, and performing interpolation operation on the plurality of frequency spectrum peaks to obtain a formant envelope of the frequency domain voice signal; the formant extraction module is used for extracting formants of the voice signals according to formant envelopes of the voice signals; and the voice processing module is used for carrying out voice recognition or voice synthesis on the voice signal according to the formants of the voice signal.

The embodiment of the application also discloses a storage medium, on which a computer program is stored, which when being executed by a processor, performs the steps of the formant envelope estimation method or the steps of the voice processing method.

The embodiment of the application also discloses a terminal which comprises a memory and a processor, wherein the memory stores a computer program which can be run on the processor, and the processor executes the steps of the formant envelope estimation method or the steps of the voice processing method when running the computer program.

Compared with the prior art, the technical scheme of the embodiment of the application has the following beneficial effects:

the technical scheme of the application is that a predicted frequency range is preset, and a plurality of frequency spectrum peaks are determined in the predicted frequency range so as to be used for calculating the envelope of the formants; because the estimated frequency range is determined according to the sound channel characteristics of the preset sound source and the perception frequency of the preset sound source, the determined formant envelope can be ensured to contain formants in the frequency range, and the accuracy of the subsequent formant estimation is ensured; in addition, as the formant envelope is only required to be calculated in the estimated frequency range, calculation in all frequency bands of the voice signal is avoided, iterative operation is avoided, calculation resources are reduced, and calculation efficiency is improved.

Drawings

FIG. 1 is a flow chart of a formant envelope estimation method according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of speech processing according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a specific application scenario according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a formant envelope estimation device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a speech processing device according to an embodiment of the present application.

Detailed Description

As described in the background art, the accuracy of the envelope estimated by LPC analysis or cepstrum smoothing in the prior art is limited by its order, and the sinusoidal model is limited by harmonics. In addition, accurate formant estimation is time consuming and computationally inefficient. The accuracy of the estimation and the time spent always occur at the beginning of the system design. Particularly in mobile systems, computing and storage resources are severely limited.

The "formants" in the embodiments of the present application refer to the areas where the sound energy is relatively concentrated in the frequency domain signal, determine the quality of sound, and determine which voice user uttered the voice signal through the formants of the signal.

The formant envelope refers to a frequency domain range formed by connecting the highest points of the amplitudes corresponding to different frequencies in a frequency domain signal, and can represent the sound characteristics of a voice user in the current segment.

In order that the above objects, features and advantages of the application will be readily understood, a more particular description of the application will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

Fig. 1 is a flowchart of a formant envelope estimation method according to an embodiment of the present application.

The formant envelope estimation method provided by the embodiment of the application can be implemented by the formant envelope estimation device provided by the embodiment of the application, the device can be implemented in a software and/or hardware mode, and the device is integrated in a device for implementing the method, and the device can be an intelligent terminal such as a smart phone, a tablet, a palm computer and the like, which is configured with any application program capable of processing voice signals.

Specifically, the formant envelope estimation method may include the steps of:

step S101: acquiring a frequency domain voice signal;

step S102: determining a plurality of frequency spectrum peaks of the frequency domain voice signal in a predicted frequency range, wherein the predicted frequency range is determined according to the channel characteristics of a preset sound source and the perceived frequency of the preset sound source;

step S103: and carrying out interpolation operation on the plurality of frequency spectrum peaks to obtain formant envelopes of the frequency domain voice signals.

It should be noted that the serial numbers of the steps in the present embodiment do not represent a limitation on the execution sequence of the steps.

In the implementation of step S101, the original speech signal initially recorded by the speech user and acquired by the speech acquisition device may be acquired, where the original speech signal is typically a time domain signal, and therefore the original speech signal needs to be converted to obtain the frequency domain speech signal. Specifically, the frequency domain voice signal may be obtained by converting a time domain voice signal, or may be directly retrieved from a database, where the database includes the frequency domain voice signal after the conversion.

In one non-limiting embodiment, step S101 may include the steps of: acquiring acquired multi-frame time domain voice signals; each frame of time domain speech signal is converted into a corresponding frequency domain speech signal.

Specifically, the fourier transform is a transformation method for converting a time domain signal into a frequency domain signal, and can be converted into the frequency domain for analysis for a type of information that cannot be obtained clearly in the time domain. Because the time domain voice signal is a voice signal which is sent by the user and contains different frequency information in a period of time, if the whole original voice signal is directly subjected to fourier transform, the corresponding obtained frequency domain signal is a frequency spectrum corresponding to a single frequency determined for all voice information of the whole time domain, and the corresponding frequency characteristic in the local time domain cannot be reflected at the moment, and the frequency domain information in different time periods cannot be analyzed, in this embodiment, the original voice signal can be respectively processed by adopting short-time fourier transform, so that the frequency domain information corresponding to the original voice signal in different time periods (for example, different frames) is obtained. The short-time fourier transform is a process of representing a frequency domain feature at a certain time by a frequency domain signal corresponding to a segment of a speech signal within a predetermined time window.

In the implementation of step S102, an estimated frequency range may be preset, where the estimated frequency range may be determined by a channel characteristic of a preset sound source and a perceived frequency of the preset sound source. For example, designed according to the human perception range and channel characteristics. The estimated frequency range can provide a bandwidth of the frequency band sensitive to the formants. By locating the calculated spectral peaks in the estimated frequency range, a high correlation between the selected spectral peaks and the resonant peak frequency can be ensured. Specifically, the spectral peak is a position within the frequency band where the finger frequency value is maximum.

Since each spectral peak is a discrete numerical point, in the implementation of step S103, the formant envelope of the frequency-domain speech signal may be obtained by performing an interpolation operation on the spectral peak.

In a specific implementation, interpolation operation can be directly performed on a plurality of frequency spectrum peaks to obtain a formant envelope of the frequency domain voice signal.

In another specific example, to obtain a smoother formant envelope, the plurality of spectral peaks may be interpolated multiple times, for example, using a quadratic interpolation method and a cubic interpolation method, to obtain a formant envelope of the frequency domain speech signal.

The embodiment of the application is a preprocessing process before extracting voice features, and the determined formant envelope is used for extracting the subsequent voice features, namely determining the formants of the voice signals.

The embodiment of the application is used for calculating the envelope of the formants by presetting an estimated frequency range and determining a plurality of frequency spectrum peaks in the estimated frequency range; because the estimated frequency range is determined according to the sound channel characteristics of the preset sound source and the perception frequency of the preset sound source, the determined formant envelope can be ensured to contain formants in the frequency range, and the accuracy of the subsequent formant estimation is ensured; in addition, as the formant envelope is only required to be calculated in the estimated frequency range, calculation in all frequency bands of the voice signal is avoided, iterative operation is avoided, calculation resources are reduced, and calculation efficiency is improved.

Referring to fig. 2, the embodiment of the application also discloses a voice processing method. The voice processing method can be executed by the voice processing device provided by the embodiment of the application, the device can be realized by software and/or hardware, and the device is integrated in a device for executing the method, and the device can be an intelligent terminal such as a smart phone, a tablet, a palm computer and the like, which is configured with any application program capable of processing voice signals.

The voice processing method may include the steps of:

step S201: acquiring an acquired voice signal;

step S202: calculating a formant envelope of the speech signal;

step S203: extracting formants of the voice signal according to formant envelopes of the voice signal;

step S204: and performing voice recognition or voice synthesis on the voice signal according to the formants of the voice signal.

Embodiments of the present application can use speech analysis to enhance vowel formants.

The specific implementation of step S202 may refer to the embodiment shown in fig. 1, and will not be described herein.

In the implementation of step S203, the formants of the voice signal may be extracted by using the formant envelope of the voice signal, and specifically, the formants of the voice signal may be determined by using mel-frequency cepstrum coefficient (Mel Frequency Cepstral Coefficient, MFCC) or perceptual linear prediction (Preceptual Linear Prediction, PLP) algorithm, where the input is the formant envelope of the voice signal and the output is the voice feature, that is, the formants of the voice signal.

It should be noted that, regarding the specific algorithm adopted in step S203, any algorithm may be implemented in the prior art, which is not limited in this embodiment of the present application.

Further in the implementation of step S204, the calculated formants may be used to perform speech recognition or speech synthesis on the speech signal to enhance the performance thereof.

In a non-limiting embodiment, the following steps may be further included after step S201 shown in fig. 2: and judging whether the voice signal has the voice of the preset sound source category or not.

The embodiment of the application can judge whether the voice of the preset sound source category is displayed, such as human voice. And in particular by a voice activity detection program. After detecting human speech, the subsequent speech processing steps are continued.

In addition, the voice signal may be further segmented after step S201 to obtain a plurality of short-time frames, which may be specifically obtained by using a preset time window for segmentation.

Experiments prove that the accuracy of the small mixed Gaussian model (Gauss of mixture models) hidden Markov model (Hidden Markov Model, HMM) voice recognition system can be improved by 8% by the method, and the highest noise is added to be 0dB.

Referring specifically to fig. 3, in the spectrogram shown in fig. 3, the noise of the speech signal is added by-15 dB, the abscissa represents the frequency, and the ordinate represents the energy. Curve a represents the formant envelope calculated by the embodiment of the application, and curves b and c represent formant envelopes in the prior art, which are respectively calculated by adopting LPC analysis and cepstrum smoothing. Where the second formants and third formants of the speech signal are at frequencies f1 and f2, it can be seen from fig. 3 that the formant envelope shown by curve a is able to retain its energy substantially for the second formants and third formants relative to curves b and c. That is, the energy estimation at the second and third formant locations is more accurate with embodiments of the present application relative to the prior art.

Furthermore, as additive noise in the voice signal increases, the formant envelope estimated by the embodiment of the application can more keep the energy of the second formant and the third formant.

Referring to fig. 4, the embodiment of the application also discloses a formant envelope estimation device 40. The formant envelope estimation device 40 may include:

a voice signal acquisition module 401, configured to acquire a frequency domain voice signal;

a spectral peak determining module 402, configured to determine a plurality of spectral peaks of the frequency domain speech signal within an estimated frequency range, where the estimated frequency range is determined according to a channel characteristic of a preset sound source and a perceived frequency of the preset sound source;

the formant envelope calculation module 403 is configured to perform interpolation operation on the plurality of spectral peaks to obtain a formant envelope of the frequency domain speech signal.

Preprocessing of the speech signal can be achieved by means of the formant envelope estimation means 40, and the calculated formant envelope can be used for processing the speech signal, for example for speech recognition or speech synthesis.

For more details of the working principle and the working manner of the formant envelope estimation device 40, reference may be made to the related descriptions in fig. 1 to 3, which are not repeated here.

Referring to fig. 5, the embodiment of the present application further discloses a voice processing device 50, where the voice processing device 50 includes:

the voice signal acquisition module 501 is configured to acquire an acquired voice signal;

a formant envelope determination module 502, configured to calculate a formant envelope of the speech signal, where the formant envelope is obtained by: acquiring a frequency domain voice signal, determining a plurality of frequency spectrum peaks of the frequency domain voice signal in an estimated frequency range, wherein the estimated frequency range is determined according to the sound channel characteristics of a sound source and the perceived frequency of the sound source, and performing interpolation operation on the plurality of frequency spectrum peaks to obtain a formant envelope of the frequency domain voice signal;

a formant extraction module 503, configured to extract formants of the speech signal according to formant envelopes of the speech signal;

the voice processing module 504 is configured to perform voice recognition or voice synthesis on the voice signal according to the formants of the voice signal.

In particular, the formant envelope determination module 502 may be the formant envelope estimation device 40, or the formant envelope determination module 502 obtains its calculated formant envelope from the formant envelope estimation device 40.

For more details of the operation principle and the operation manner of the speech processing device 50, reference may be made to the related descriptions in fig. 1 to 3, which are not repeated here.

The embodiment of the application also discloses a storage medium, which is a computer readable storage medium, and a computer program is stored on the storage medium, and the computer program can execute the steps of the method shown in fig. 1 or fig. 2 when running. The storage medium may include ROM, RAM, magnetic or optical disks, and the like. The storage medium may also include a non-volatile memory (non-volatile) or a non-transitory memory (non-transitory) or the like.

The embodiment of the application also discloses a terminal, which can comprise a memory and a processor, wherein the memory stores a computer program capable of running on the processor. The processor may perform the steps of the methods shown in fig. 1 or fig. 2 when running the computer program. The terminal comprises, but is not limited to, a mobile phone, a computer, a tablet personal computer and other terminal equipment.

It should be appreciated that the above-described processor may be a general purpose processor, a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, a system on chip (SoC), a central processor (central processor unit, CPU), a network processor (network processor, NP), a digital signal processing circuit (digital signal processor, DSP), a microcontroller (micro controller unit, MCU), a programmable controller (programmable logic device, PLD) or other integrated chip. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

It should also be understood that the memory referred to in embodiments of the present application may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and direct memory bus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

Note that when the processor is a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, the memory (storage module) is integrated into the processor. It should be noted that the memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Although the present application is disclosed above, the present application is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the application, and the scope of the application should be assessed accordingly to that of the appended claims.

Claims

1. A method of speech processing, comprising:

acquiring an acquired voice signal;

calculating a formant envelope of the speech signal, the formant envelope being obtained by: acquiring a frequency domain voice signal, determining a plurality of frequency spectrum peaks of the frequency domain voice signal in an estimated frequency range, wherein the estimated frequency range is determined according to the channel characteristics of a preset sound source and the perception frequency of the preset sound source, and performing interpolation operation on the plurality of frequency spectrum peaks to obtain a formant envelope of the frequency domain voice signal, wherein the frequency domain voice signal is divided into a preset number of frequency bands in the estimated frequency range; determining the position with the maximum frequency value in each frequency band as a frequency spectrum peak in the frequency band, wherein the frequency spectrum peak is related to the frequency of a resonance peak;

extracting formants of the voice signal according to formant envelopes of the voice signal;

and performing voice recognition or voice synthesis on the voice signal according to the formants of the voice signal.

2. The speech processing method of claim 1 wherein interpolating the plurality of spectral peaks comprises:

and performing linear interpolation operation on the plurality of frequency spectrum peaks to obtain formant envelopes of the frequency domain voice signals.

3. The speech processing method of claim 1 wherein interpolating the plurality of spectral peaks comprises:

and performing multi-element interpolation on the plurality of frequency spectrum peaks to obtain formant envelopes of the frequency domain voice signals.

4. The method of claim 1, wherein the acquiring the frequency domain speech signal comprises:

acquiring acquired multi-frame time domain voice signals;

each frame of time domain speech signal is converted into a corresponding frequency domain speech signal.

5. The method according to claim 1, wherein the acquiring the collected voice signal further comprises:

and judging whether the voice signal has the voice of the preset sound source category or not.

6. A speech processing apparatus, comprising:

the voice signal acquisition module is used for acquiring the acquired voice signals;

a formant envelope determination module, configured to calculate a formant envelope of the speech signal, where the formant envelope is obtained by: acquiring a frequency domain voice signal, determining a plurality of frequency spectrum peaks of the frequency domain voice signal in an estimated frequency range, wherein the estimated frequency range is determined according to the sound channel characteristics of a sound source and the perceived frequency of the sound source, and performing interpolation operation on the plurality of frequency spectrum peaks to obtain a formant envelope of the frequency domain voice signal, wherein the formant envelope determination module is further used for dividing the frequency domain voice signal into a preset number of frequency bands in the estimated frequency range; determining the position with the maximum frequency value in each frequency band as a frequency spectrum peak in the frequency band, wherein the frequency spectrum peak is related to the frequency of a resonance peak;

the formant extraction module is used for extracting formants of the voice signals according to formant envelopes of the voice signals;

and the voice processing module is used for carrying out voice recognition or voice synthesis on the voice signal according to the formants of the voice signal.

7. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the speech processing method of any of claims 1 to 5.

8. A terminal comprising a memory and a processor, said memory having stored thereon a computer program executable on said processor, characterized in that said processor executes the steps of the speech processing method according to any of claims 1 to 5 when said computer program is executed.