CN113571078A - Noise suppression method, device, medium, and electronic apparatus - Google Patents

Noise suppression method, device, medium, and electronic apparatus Download PDF

Info

Publication number
CN113571078A
CN113571078A CN202110129579.1A CN202110129579A CN113571078A CN 113571078 A CN113571078 A CN 113571078A CN 202110129579 A CN202110129579 A CN 202110129579A CN 113571078 A CN113571078 A CN 113571078A
Authority
CN
China
Prior art keywords
voice signal
processing
frequency spectrum
feature
noise suppression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110129579.1A
Other languages
Chinese (zh)
Other versions
CN113571078B (en
Inventor
鲍枫
刘志鹏
李岳鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110129579.1A priority Critical patent/CN113571078B/en
Publication of CN113571078A publication Critical patent/CN113571078A/en
Application granted granted Critical
Publication of CN113571078B publication Critical patent/CN113571078B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain

Abstract

The disclosure provides a noise suppression method, apparatus, medium, and electronic device. The method comprises the following steps: acquiring low-frequency spectrum characteristics and high-frequency spectrum characteristics of an original voice signal, and performing characteristic combination processing on the low-frequency spectrum characteristics and the high-frequency spectrum characteristics to obtain frequency band energy characteristics; determining a current frame voice signal and a previous frame voice signal in an original voice signal, and performing linear domain transformation processing on the current frame voice signal and the previous frame voice signal to obtain a frequency spectrum characteristic parameter; performing correlation calculation on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics, and performing dimension reduction mapping processing on the cepstrum characteristics to obtain dimension reduction characteristics; and performing feature fusion processing on the dimensionality reduction features and the cepstrum features to obtain gain information, and performing noise suppression processing on the gain information to obtain a noise reduction voice signal of the original voice signal. The method and the device ensure the noise suppression effect and efficiency of key noise types, and greatly reduce the complexity of noise suppression.

Description

Noise suppression method, device, medium, and electronic apparatus
Technical Field
The present disclosure relates to the field of audio processing technologies, and in particular, to a noise suppression method, a noise suppression apparatus, a computer-readable medium, and an electronic device.
Background
In audio communication software of various conferences and the like, jet noise is a very common noise signal. The suppression method for the wheat spray noise is to perform the treatment of suppressing the wheat spray noise in a forward manner when performing the conventional noise reduction treatment.
However, the conventional noise reduction processing method has high complexity, and there is no specific suppression means for the jet noise, so that the suppression effect of the jet noise cannot be guaranteed.
In view of the above, there is a need in the art to develop a new noise suppression method and apparatus.
It should be noted that the information disclosed in the above background section is only for enhancement of understanding of the technical background of the present application, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The present disclosure is directed to a noise suppression method, a noise suppression device, a computer readable medium, and an electronic device, so as to overcome the technical problems of high complexity and poor effect of noise suppression at least to some extent.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to an aspect of an embodiment of the present disclosure, there is provided a noise suppressing method including: acquiring low-frequency spectrum characteristics and high-frequency spectrum characteristics of an original voice signal, and performing characteristic combination processing on the low-frequency spectrum characteristics and the high-frequency spectrum characteristics to obtain frequency band energy characteristics;
determining a current frame voice signal and a previous frame voice signal in the original voice signal, and performing linear domain transformation processing on the current frame voice signal and the previous frame voice signal to obtain a frequency spectrum characteristic parameter;
performing correlation calculation on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics, and performing dimension reduction mapping processing on the cepstrum characteristics to obtain dimension reduction characteristics;
and performing feature fusion processing on the dimensionality reduction features and the cepstrum features to obtain gain information, and performing noise suppression processing on the gain information to obtain a noise reduction voice signal of the original voice signal.
According to an aspect of an embodiment of the present disclosure, there is provided a noise suppressing apparatus including: the characteristic combination module is configured to acquire low-frequency spectrum characteristics and high-frequency spectrum characteristics of an original voice signal and perform characteristic combination processing on the low-frequency spectrum characteristics and the high-frequency spectrum characteristics to obtain frequency band energy characteristics;
the conversion processing module is configured to determine a current frame voice signal and a previous frame voice signal in the original voice signal, and perform linear domain conversion processing on the current frame voice signal and the previous frame voice signal to obtain a frequency spectrum characteristic parameter;
the dimension reduction mapping module is configured to perform correlation calculation on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics, and perform dimension reduction mapping processing on the cepstrum characteristics to obtain dimension reduction characteristics;
and the noise suppression module is configured to perform feature fusion processing on the dimensionality reduction features and the cepstrum features to obtain gain information, and perform noise suppression processing on the gain information to obtain a noise reduction voice signal of the original voice signal.
In some embodiments of the present disclosure, based on the above technical solutions, the noise suppression module includes: the fusion processing submodule is configured to perform single fusion processing on the dimensionality reduction feature and the cepstrum feature to obtain a single fusion feature, and perform advanced fusion processing on the cepstrum feature and the single fusion feature to obtain an advanced fusion feature;
and the connection processing submodule is configured to perform full connection processing on the advanced fusion characteristics to obtain gain information.
In some embodiments of the present disclosure, based on the above technical solutions, the noise suppression module includes: the loss calculation submodule is configured to acquire standard gain information corresponding to the original voice signal, and perform gain loss calculation on the gain information and the standard gain information to obtain a gain loss value;
and the gain loss submodule is configured to perform noise suppression processing by using the gain information based on the gain loss value to obtain a noise-reduced voice signal of the original voice signal.
In some embodiments of the present disclosure, based on the above technical solutions, the noise suppression module includes: and the inverse transformation submodule is configured to perform inverse linear domain transformation processing on the gain information to obtain a noise reduction voice signal of the original voice signal.
In some embodiments of the present disclosure, based on the above technical solutions, the feature combining module includes: the energy characteristic submodule is configured to perform nonlinear domain transformation processing on the high-frequency spectrum characteristic to obtain a nonlinear energy characteristic;
and the combination processing sub-module is configured to perform feature combination processing on the low-frequency spectrum features and the nonlinear energy features to obtain frequency band energy features.
In some embodiments of the present disclosure, based on the above technical solution, the dimension reduction mapping module includes: a correlation calculation submodule configured to perform a cross-correlation calculation on the characteristic real part parameter and the characteristic imaginary part parameter to obtain a cross-correlation parameter;
and the energy correlation submodule is configured to perform energy correlation calculation on the cross-correlation parameter and the frequency band energy characteristics to obtain cepstrum characteristics.
In some embodiments of the present disclosure, based on the above technical solutions, the feature combining module includes: and the linear transformation submodule is configured to acquire an original voice signal and perform linear domain transformation processing on the original voice signal to obtain a low-frequency spectrum characteristic and a high-frequency spectrum characteristic.
According to an aspect of the embodiments of the present disclosure, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor implements a noise suppression method as in the above technical solution.
According to an aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the noise suppression method as in the above solution via execution of the executable instructions.
In the technical scheme provided by the embodiment of the disclosure, on one hand, the original voice signal is divided into the low-frequency spectrum characteristic and the high-frequency spectrum characteristic for subsequent noise suppression processing, so that the suppression of the noise in the low-frequency region is more targeted, meanwhile, the noise in the high-frequency region can be suppressed, the noise suppression effect and efficiency of key noise types are ensured, and the noise suppression processing of other frequency bands is also considered; on the other hand, noise suppression processing is carried out on the gain information to obtain a noise reduction voice signal, so that the complexity of noise suppression is greatly reduced, and further the user experience is improved when the noise reduction voice signal is output.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:
fig. 1 schematically illustrates an architecture diagram of an exemplary system to which the disclosed solution applies;
FIG. 2 schematically illustrates a flow chart of steps of a method of noise suppression in some embodiments of the present disclosure;
FIG. 3 schematically illustrates a flow chart of steps of a method of feature combination processing in some embodiments of the present disclosure;
FIG. 4 schematically illustrates a flow chart of steps of a method of relevance computation in some embodiments of the present disclosure;
FIG. 5 schematically illustrates a flow chart of steps of a method of feature fusion processing in some embodiments of the present disclosure;
FIG. 6 schematically illustrates a flow chart of steps of a method of noise suppression processing in some embodiments of the present disclosure;
FIG. 7 schematically illustrates a model framework diagram for training a noise suppression model in an application scenario in accordance with some embodiments of the present disclosure;
FIG. 8 schematically illustrates a comparison schematic of an original speech signal and a noise speech signal in an application scenario in some embodiments of the present disclosure;
FIG. 9 schematically illustrates a block diagram of a noise suppression device in some embodiments of the present disclosure;
FIG. 10 schematically illustrates a structural diagram of a computer system suitable for use with an electronic device that implements an embodiment of the disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
In the related art, a voice noise reduction technique is a technique for removing or suppressing noise from audio in which a target voice and noise are mixed to obtain the target voice. Inhibition is the meaning of controlling avoidance.
In audio communication software of various conferences and the like, jet noise is a very common noise signal. Jet noise is caused by a plosive when speaking. The inhibition method for the wheat-spraying noise generally uses a neural network algorithm such as a Long Short-Term Memory network (LSTM) to perform conventional noise reduction processing and simultaneously performs the treatment of inhibiting the wheat-spraying noise. However, the conventional noise reduction method has high complexity, and there is no specific suppression means for the jet noise, so that the processing effect on the jet noise cannot be guaranteed.
Based on the problems existing in the above schemes, the present disclosure provides a noise suppression method, a noise suppression apparatus, a computer readable medium, and an electronic device based on artificial intelligence and cloud technology.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Among the key technologies of Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, and network in a wide area network or a local area network to realize calculation, storage, processing, and sharing of data.
Cloud technology (Cloud technology) is based on a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied in a Cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.
The cloud conference is an efficient, convenient and low-cost conference form based on a cloud computing technology. A user can share voice, data files and videos with teams and clients all over the world quickly and efficiently only by performing simple and easy-to-use operation through an internet interface, and complex technologies such as transmission and processing of data in a conference are assisted by a cloud conference service provider to operate.
At present, domestic cloud conferences mainly focus on Service contents mainly in a Software as a Service (SaaS a Service) mode, including Service forms such as telephones, networks and videos, and cloud computing-based video conferences are called cloud conferences.
In the cloud conference era, data transmission, processing and storage are all processed by computer resources of video conference manufacturers, users do not need to purchase expensive hardware and install complicated software, and efficient teleconferencing can be performed only by opening a browser and logging in a corresponding interface.
The cloud conference system supports multi-server dynamic cluster deployment, provides a plurality of high-performance servers, and greatly improves conference stability, safety and usability. In recent years, video conferences are popular with many users because of greatly improving communication efficiency, continuously reducing communication cost and bringing about upgrading of internal management level, and the video conferences are widely applied to various fields such as governments, armies, transportation, finance, operators, education, enterprises and the like. Undoubtedly, after the video conference uses cloud computing, the cloud computing has stronger attraction in convenience, rapidness and usability, and the arrival of new climax of video conference application is necessarily stimulated.
The noise suppression method utilizing artificial intelligence and cloud technology is more targeted for suppressing the noise of the low-frequency area, and meanwhile, the noise of the high-frequency area can be suppressed, so that the noise suppression effect and efficiency of key noise types are guaranteed, the noise suppression processing of other frequency bands is also considered, the complexity of noise suppression is greatly reduced, and the user experience is further improved when the noise-reduced voice signal is output.
Fig. 1 shows an exemplary system architecture diagram to which the disclosed solution is applied.
As shown in fig. 1, the system architecture 100 may include a terminal 110, a network 120, and a server side 130. Wherein the terminal 110 and the server 130 are connected through the network 120.
The terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. Network 120 may be any type of communications medium capable of providing a communications link between terminal 110 and server 130, such as a wired communications link, a wireless communications link, or a fiber optic cable, and the like, without limitation. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like.
Specifically, the terminal 110 may obtain a low-frequency spectrum feature and a high-frequency spectrum feature of the original voice signal, and perform feature combination processing on the low-frequency spectrum feature and the high-frequency spectrum feature to obtain a band energy feature. Then, determining a current frame voice signal and a previous frame voice signal in the original voice signal, and performing linear domain transformation processing on the current frame voice signal and the previous frame voice signal to obtain a frequency spectrum characteristic parameter. Further, performing correlation calculation on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics, and performing dimension reduction mapping processing on the cepstrum characteristics to obtain dimension reduction characteristics. And finally, performing feature fusion processing on the dimensionality reduction features and the cepstrum features to obtain gain information, and performing noise suppression processing on the gain information to obtain a noise reduction voice signal of the original voice signal.
In addition, the noise suppression method in the embodiment of the present disclosure may be applied to a terminal, and may also be applied to a server, which is not particularly limited in the present disclosure.
The embodiments of the present disclosure are mainly illustrated by applying the noise suppression method to the terminal 110.
The following detailed description of the noise suppression method, the noise suppression device, the computer readable medium, and the electronic apparatus according to the present disclosure is provided in conjunction with the specific embodiments.
Fig. 2 schematically illustrates a flow chart of steps of a noise suppression method in some embodiments of the present disclosure, and as shown in fig. 2, the noise suppression method may mainly include the following steps:
and S210, acquiring low-frequency spectrum characteristics and high-frequency spectrum characteristics of the original voice signal, and performing characteristic combination processing on the low-frequency spectrum characteristics and the high-frequency spectrum characteristics to obtain frequency band energy characteristics.
Step S220, determining a current frame voice signal and a previous frame voice signal in the original voice signal, and performing linear domain transformation processing on the current frame voice signal and the previous frame voice signal to obtain frequency spectrum characteristic parameters.
And S230, performing correlation calculation on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics, and performing dimension reduction mapping processing on the cepstrum characteristics to obtain dimension reduction characteristics.
And S240, carrying out feature fusion processing on the dimensionality reduction features and the cepstrum features to obtain gain information, and carrying out noise suppression processing on the gain information to obtain a noise reduction voice signal of the original voice signal.
In the exemplary embodiment of the disclosure, on one hand, the original voice signal is divided into the low-frequency spectrum feature and the high-frequency spectrum feature for subsequent noise suppression processing, so that the suppression of the noise in the low-frequency region is more targeted, meanwhile, the noise in the high-frequency region can be suppressed, the noise suppression effect and efficiency of key noise types are ensured, and the noise suppression processing of other frequency bands is also considered; on the other hand, noise suppression processing is carried out on the gain information to obtain a noise reduction voice signal, so that the complexity of noise suppression is greatly reduced, and further the user experience is improved when the noise reduction voice signal is output.
The respective steps of the noise suppression method are explained in detail below.
In step S210, a low-frequency spectrum feature and a high-frequency spectrum feature of the original speech signal are obtained, and a feature combination process is performed on the low-frequency spectrum feature and the high-frequency spectrum feature to obtain a band energy feature.
In an exemplary embodiment of the present disclosure, the original speech signal may be a noisy speech signal. The original speech signal may be a speech signal captured in a real environment by an audio capture device, such as a microphone.
For example, in a video conference scenario, a microphone may capture the voice signal generated when a participant speaks. In the process of acquiring the voice signal, the microphone may acquire a noise signal at the same time, where the noise signal may be environmental noise, microphone shooting sound, or the like, and this is not particularly limited in this exemplary embodiment.
Wherein, the wheat spraying sound is caused by the explosion sound generated during the sound production. Specifically, in the process of inputting audio, when an explosive force generated by air squeezed by the lips of the participant acts on the diaphragm of the microphone, the microphone ejection sound is generated. Especially words containing consonants such as "p" or "b" generate an air flow substantially equivalent to the energy generated by 60 mph wind, and such a large amount of energy acting on the diaphragm of the microphone generates a very large amount of energy, thereby deteriorating the quality of human voice and affecting the overall effect of the original audio.
Further, the original speech signal is subjected to linear domain transformation processing to obtain corresponding high-frequency spectrum characteristics and low-frequency spectrum characteristics.
In an alternative embodiment, an original speech signal is obtained, and linear domain transformation processing is performed on the original speech signal to obtain a low-frequency spectrum feature and a high-frequency spectrum feature.
Wherein, the linear domain transformation processing on the original speech signal may be converting the original speech signal from a time domain to a frequency domain. For example, Fast Fourier Transform (FFT) processing may be performed on the original speech signal.
The FFT algorithm is an algorithm that converts the time domain into the frequency domain. The FFT algorithm is actually a fast algorithm of Discrete Fourier Transform (DFT). In the processing of digital signals, it is usually necessary to obtain the frequency domain characteristics of the signals by using an FFT algorithm. The purpose of the transformation is to obtain virtually the same time domain signal in the frequency domain, so that the characteristics of the signal can be analyzed more easily.
Therefore, after the original speech signal is processed by the FFT algorithm, a series of complex numbers are obtained, and the complex numbers are amplitude characteristics, not amplitudes, of the original speech signal in the corresponding frequency domain. The amplitude feature is the spectral feature of the original speech signal.
The frequency spectrum is an abbreviation of frequency spectrum density and is a distribution curve of frequency. The complex oscillation is decomposed into harmonic oscillations with different amplitudes and frequencies, and the pattern of the amplitude of the harmonic oscillation arranged according to the frequency is a frequency spectrum. Frequency spectrum is widely used in acoustic, optical and radio technologies.
Moreover, since the voice signal in the microphone state is usually a low-frequency signal lower than 500Hz (hertz), the 500Hz is used as a dividing node to divide the spectrum characteristics of the original voice signal into low-frequency spectrum characteristics and high-frequency spectrum characteristics.
The spectral feature of the original voice signal is a 0-8000Hz spectral feature collected with 16000Hz as the sampling rate, so the low-frequency spectral feature of the original voice signal is 0-500Hz, and the high-frequency spectral feature is 500-8000 Hz.
After the low-frequency spectrum feature and the high-frequency spectrum feature of the original voice signal are obtained, the low-frequency spectrum feature and the high-frequency spectrum feature can be subjected to feature combination processing to obtain a frequency band energy feature.
In an alternative embodiment, fig. 3 shows a flow chart of the steps of a method of feature combination processing, which, as shown in fig. 3, comprises at least the following steps: in step S310, nonlinear domain transform processing is performed on the high-frequency spectrum feature to obtain a nonlinear energy feature.
The nonlinear domain transformation process may be a process of converting the high frequency spectrum characteristics of the frequency domain into Bark domain.
The Bark domain is a psychoacoustic measure of sound. Because of the particular configuration of the cochlea of the human ear, the human auditory system produces a series of Critical bands (Critical bands). The critical frequency band is a sound frequency band, and a sound signal in the same critical frequency band is easily masked, that is, the sound signal in the critical frequency band is easily masked by another signal with large energy and close frequency, so that the human auditory system cannot receive the sound signal. If the sound signal is converted from the frequency domain into the critical frequency bands, each critical frequency band becomes a Bark band, that is, the sound signal is converted from the frequency domain into the Bark domain.
Specifically, the nonlinear domain transform process can refer to formula (1):
Figure BDA0002925002280000101
wherein arctan is an arctangent function, f is a high frequency spectral feature of the original speech signal, and Bark (f) is a Bark domain representation of the original speech signal.
The nonlinear energy characteristic of the high-frequency spectrum characteristic can be obtained through the calculation of the formula (1). The nonlinear energy characteristics can be represented by 15 Bark bands to sparsify the high frequency spectral characteristics. It is clear that the Bark domain has a compression effect on the high frequency spectral features and an amplification effect on the low frequency spectral features. However, in order to specifically process the original speech signal in the microphone state, the low-frequency spectrum characteristic of the original speech signal may not be converted into Bark domain.
In step S320, a feature combination process is performed on the low-frequency spectrum feature and the nonlinear energy feature to obtain a band energy feature.
Since the nonlinear domain transformation processing is not performed on the low-frequency spectrum features, the low-frequency spectrum features are not converted into Bark domains, and therefore the low-frequency spectrum features are not subjected to the sparsification processing and can be directly represented by 15 Bark bands. At this time, the low-frequency spectrum feature is obtained by performing linear domain transform processing through a 512-point FFT algorithm.
Further, the low-frequency spectrum characteristics represented by the 15 Bark bands and the nonlinear energy characteristics divided into the 15 Bark bands are combined to obtain the band energy characteristics of the 30 Bark bands.
In the exemplary embodiment, the nonlinear energy characteristics and the low-frequency spectrum characteristics after the nonlinear domain transformation processing are combined, and not only is the high-frequency spectrum characteristics subjected to the sparsification processing, so that low-frequency noise can be better suppressed subsequently, but also the complexity of noise suppression is further reduced, and the noise suppression efficiency is improved.
In step S220, a current frame speech signal and a previous frame speech signal are determined in the original speech signal, and a linear domain transformation process is performed on the current frame speech signal and the previous frame speech signal to obtain a spectral feature parameter.
In an exemplary embodiment of the present disclosure, a frame may be determined in an original speech signal as a current frame speech signal, and a previous frame of the current frame speech signal is continuously determined in the original speech signal as a previous frame speech signal, so as to perform linear domain transform processing on the current frame speech signal and the previous frame speech signal to obtain spectral feature parameters.
The linear domain transformation processing of the current frame speech signal and the previous frame speech signal can also be realized by an FFT algorithm. Specifically, formula (2) can be referred to:
FFT(t,f)=x(t,f)+i×y(t,f) (2)
wherein, FFT(t,f)Spectral features representing the current frame speech signal and the previous frame speech signal in the frequency domain are composed of a vector, i.e., x + yi. Where x denotes the real part of the corresponding spectral feature and y denotes the imaginary part of the corresponding spectral feature.
The real part and the imaginary part of the frequency spectrum characteristics of the current frame voice signal and the previous frame voice signal are corresponding frequency spectrum characteristic parameters.
In step S230, a correlation calculation is performed on the spectral feature parameters and the band energy features to obtain cepstrum features, and dimension reduction mapping processing is performed on the cepstrum features to obtain dimension reduction features.
In an exemplary embodiment of the present disclosure, after obtaining the spectral feature parameter and the band energy feature, a correlation calculation may be performed on the spectral feature parameter and the band energy feature to obtain a cepstrum feature.
In an alternative embodiment, the spectral feature parameters include a real feature parameter and an imaginary feature parameter, and fig. 4 shows a flow chart of steps of a method of correlation calculation, as shown in fig. 4, the method at least including the following steps: in step S410, a cross-correlation calculation is performed on the real part parameter and the imaginary part parameter to obtain a cross-correlation parameter.
The cross-correlation calculation for the characteristic real part parameter and the characteristic imaginary part parameter can be referred to formula (3):
Figure BDA0002925002280000121
wherein r isxy[l]Identifies the energy sequence x [ n ]]And an energy sequence y [ n-l ]]The degree of correlation between them. Thus, rxy[l]The larger the representation of the energy sequence x [ n ]]And an energy sequence y [ n-l ]]The greater the correlation between them. And substituting the characteristic real part parameter and the characteristic imaginary part parameter into the formula (3) to obtain the cross-correlation parameter of the current frame voice signal and the previous frame voice signal.
In step S420, a cepstrum feature is obtained by performing correlation calculation on the cross-correlation parameter and the band energy feature.
After the cross-correlation parameters are obtained, correlation calculation can be performed on the cross-correlation parameters and the band energy characteristics to obtain cepstrum characteristics.
Specifically, the frequency band energy characteristic is firstly squared, and then the frequency band energy characteristic after the square calculation is divided by the cross-correlation parameter to obtain the cepstrum characteristic. In addition, there may be other cepstral feature calculation methods, and this exemplary embodiment is not particularly limited thereto.
The Cepstrum feature may be a Bark Frequency Cepstrum Characteristics (BFCC). BFCC is a commonly used characteristic parameter, and is a parameter based on the characteristics of human auditory perception, which can describe the energy distribution of sound over frequency.
In the exemplary embodiment, the cepstrum feature can be obtained by performing correlation calculation on the characteristic real part parameter, the characteristic imaginary part parameter and the band energy feature, and a data base is provided for noise suppression.
After obtaining the cepstral features, a dimension reduction mapping process may be performed on the cepstral features.
Specifically, the cepstrum feature may be input to an activation function layer, so that the activation function layer performs a dimension reduction mapping process on the cepstrum feature. For example, the cepstral features may be 30-dimensional vectors, and 20-dimensional reduced features may be obtained after input to the activation function layer. The activation function of the activation function layer may be a Tanh function, or may be other activation functions, which is not particularly limited in this exemplary embodiment.
It should be noted that the number of nodes of the activation function layer may be changed according to actual requirements, for example, to 30.
In step S240, feature fusion processing is performed on the dimensionality reduction feature and the cepstrum feature to obtain gain information, and noise suppression processing is performed on the gain information to obtain a noise reduction speech signal of the original speech signal.
In an exemplary embodiment of the present disclosure, after obtaining the dimension reduction feature, the dimension reduction feature and the cepstrum feature may be subjected to feature fusion processing to obtain gain information. It is to be noted that the feature fusion process may be a two-layer feature fusion process.
In an alternative embodiment, fig. 5 shows a flow chart of the steps of a method of feature fusion processing, as shown in fig. 5, the method comprising at least the steps of: in step S510, a single fusion process is performed on the dimensionality reduction feature and the cepstrum feature to obtain a single fusion feature, and an advanced fusion process is performed on the cepstrum feature and the single fusion feature to obtain an advanced fusion feature.
Specifically, the single-pass fusion processing on the dimensionality reduction feature and the cepstrum feature may be to input the dimensionality reduction feature and the cepstrum feature into a gated circulation Unit (Gate recovery Unit, referred to as GRU for short), so that the gated circulation Unit performs feature fusion processing on the dimensionality reduction feature and the cepstrum feature to obtain the single-pass fusion feature.
The gating cycle unit is a new generation of recurrent neural network, similar to the long-short term memory network LSTM. The gated cycle cell is free of cellular states and uses hidden states to convey information. The gated loop unit has only two gates, a reset gate and an update gate. Where the reset gate determines the amount of information that has been forgotten in the past and the update gate determines which information to discard and new information to add.
And the gating cycle unit used for performing feature fusion processing on the dimensionality reduction feature and the cepstral feature can be a GRU ReLU layer. Among them, ReLU (Rectified Linear Unit) is a nonlinear function generally represented by a ramp function and its variation. Thus, the gated loop unit can output a 30-dimensional single-pass fused feature.
It should be noted that the number of nodes of the gated loop unit may be modified according to actual requirements, for example, 50 nodes may be used instead.
Further, advanced fusion processing is carried out on the single fusion features and the cepstrum features to obtain advanced fusion features.
Specifically, the step-wise fusion processing of the single-time fusion features and the cepstrum features may be to input the single-time fusion features and the cepstrum features into another gating cycle unit, so that the gating cycle unit performs the step-wise fusion processing of the single-time fusion features and the cepstrum features.
And the gating cycle unit can also be a GRU ReLU layer. Therefore, the gated loop unit can output a 60-dimensional advanced fusion feature.
It should be noted that the number of nodes of the gated loop unit may also be modified according to actual requirements, for example, 50 nodes may be used instead.
In step S520, the advanced fusion feature is fully connected to obtain gain information.
After the advanced fusion feature is obtained, the advanced fusion feature may be subjected to full-connection processing to obtain gain information.
Specifically, the advanced fusion feature may be input into a full connection layer, so that the full connection layer performs full connection processing on the advanced fusion feature.
The full connectivity process may be implemented at the dense (full connectivity) layer of the deep learning network. The full join process may be a process in which each node is joined to all nodes in the previous layer, i.e., 60-dimensional advanced fusion features are combined to obtain 30-dimensional gain information to match 30 Bark bands.
In the exemplary embodiment, the gain information for noise suppression can be obtained by performing feature fusion processing twice on the dimensionality reduction feature and the cepstrum feature, so that the processing flow of noise suppression is greatly simplified, and the complexity of noise suppression is reduced.
After the gain information is obtained, noise suppression processing may be performed on the gain information to obtain a noise-reduced speech signal corresponding to the original speech signal.
In an alternative embodiment, fig. 6 shows a flow chart of the steps of a method of noise suppression processing, which, as shown in fig. 6, comprises at least the following steps: in step S610, standard gain information corresponding to the original speech signal is obtained, and a gain loss value is obtained by performing gain loss calculation on the gain information and the standard gain information.
When the original speech signal is a noisy speech signal in the training process, corresponding standard gain information can be simultaneously obtained for training the activation function layer, the two GRUs and the full link layer. Thus, the standard gain information is the true gain information of the original speech signal, and the true gain information includes the true gain values applied to different frequency bands of the original speech signal. The true gain value may be calculated from the energy values of the noise-free signal and the noise signal in the original speech signal. For example, a summation calculation is performed on the energy value of the noise-free signal and the energy value of the noise signal, and the energy value of the noise-free signal is divided by the result of the summation calculation to obtain a true gain value.
To calculate the difference between the standard gain information and the predicted gain information, a loss function may be constructed by which a gain loss value between the standard gain information and the gain information is calculated.
Specifically, the loss function may be constructed based on the distance loss between the estimated standard gain information and the gain information, and the distance may be an euclidean distance, a cosine distance, a Mean Squared Error (MSE for short), and the like, which is not particularly limited in this exemplary embodiment.
In step S620, noise suppression processing is performed using the gain information based on the gain loss value to obtain a noise-reduced speech signal of the original speech information number.
When the gain loss value meets the preset condition, for example, the gain loss value is smaller than the corresponding loss threshold value, it indicates that the training of the activation function layer, the two layers of GRUs and the full connection layer is successful, and the gain information can be utilized to perform noise suppression processing to obtain the noise reduction voice signal.
And when the gain loss value does not satisfy the preset condition, for example, the gain loss value is greater than or equal to the corresponding loss threshold, the parameters of the activation function layer, the two layers of GRUs and the full-link layer may be continuously adjusted to minimize the value of the loss function, thereby obtaining the trained activation function layer, the two layers of GRUs and the full-link layer.
In the exemplary embodiment, in the training process, whether noise suppression is performed by using the gain information can be determined by using the standard gain information, so that the accuracy of the training noise suppression model and the noise suppression effect are ensured.
In the application process, the noise suppression processing can be directly carried out by using the gain information without the judgment process of the standard gain information.
In an alternative embodiment, the gain information is subjected to an inverse linear transformation process to obtain a noise-reduced speech signal of the original speech signal.
The Inverse linear Transform processing on the gain information may be Inverse Fast Fourier Transform (IFFT). The IFFT algorithm may use a fast fourier transform algorithm to achieve the effect of transforming the gain information from the frequency domain to the time domain to obtain a noise-reduced speech signal with suppressed noise.
And the noise reduced speech signal is a normal unvoiced signal. The normal unvoiced sound signal is a voice signal except a wheat-spraying signal in the unvoiced sound signal, and is an unvoiced sound signal normally generated when a producer is in a state of speaking or singing.
In the present exemplary embodiment, the gain information is subjected to inverse linear transformation processing to obtain a noise reduction voice signal, and the noise reduction voice signal can be used for output playing, so as to improve the auditory perception of the user.
The noise suppression method provided in the embodiments of the present disclosure is described in detail below with reference to a specific application scenario.
Fig. 7 shows a model framework diagram of training a noise suppression model in an application scenario, and as shown in fig. 7, in step S710, cepstral feature samples are input.
The cepstral feature samples are obtained by dividing 8000Hz speech signal samples into 30 Bark bands.
Specifically, for wideband speech information with a sampling rate of 16000Hz, 8000Hz speech signal samples are divided into 30 Bark bands, i.e. sparse processing is performed.
That is, the 16000Hz sampled 8000Hz speech signal samples are divided into 30 Bark bands. Since the jet noise exists mainly in the low frequency region, the processing of the voice signal samples with low frequency is emphasized. The Bark domain sparsification processing is not carried out on the signals with the frequency bands of the voice signals between 0 and 500Hz, the signals are directly represented by 15 Bark bands, the linear domain transformation processing is equivalent to the linear domain transformation processing by using a 512-point FFT algorithm, only the signals with the voice signal frequency bands of 500 and 8000Hz are sparsified, and the 15 Bark bands are also divided. Therefore, the feature combination processing of the speech signal samples in the low frequency region and the high frequency region can obtain the band energy features on 30 Bark bands.
In step S720, dimension reduction mapping processing is performed by using the Tanh function in the fully connected layer.
And determining a current frame voice signal and a previous frame voice signal in the voice signal samples, and performing linear domain transformation processing on the current frame voice signal and the previous frame voice signal. Specifically, the FF push algorithm may be used to implement the effect of performing linear domain transformation processing on the current frame speech signal and the previous frame speech signal. Therefore, the real part and the imaginary part of the current frame speech signal and the previous frame speech signal can be obtained as the spectral feature parameters, respectively.
Further, cross-correlation calculation is carried out on the characteristic real part parameters and the characteristic imaginary part parameters to obtain cross-correlation parameters, and correlation calculation is carried out on the cross-correlation parameters and the frequency band energy characteristics to obtain cepstrum characteristics.
Specifically, cross-correlation parameters are calculated according to a formula (3), then, the frequency band energy characteristics are squared, and the frequency band energy characteristics after the square calculation are divided by the cross-correlation parameters, so that cepstrum characteristics can be obtained. And the cepstral feature may be BFCC.
And inputting the BFCC into an activation function layer so that the activation function layer performs dimension reduction mapping processing on the cepstrum characteristics. The cepstral features may be 30-dimensional vectors, while 20-dimensional reduced features may result after input to the activation function layer. Wherein, the activation function of the activation function layer may be a Tanh function.
In step S730, a single fusion process is performed with the ReLU function in the gated loop unit.
The single fusion processing on the dimensionality reduction features and the cepstrum features may be to input the dimensionality reduction features and the cepstrum features into the GRU, so that the gating cycle unit performs the feature fusion processing on the dimensionality reduction features and the cepstrum features to obtain the single fusion features.
And the gating cycle unit used for performing feature fusion processing on the dimensionality reduction feature and the cepstral feature can be a GRU ReLU layer. Among them, ReLU (Rectified Linear Unit) is a nonlinear function generally represented by a ramp function and its variation. Thus, the gated loop unit can output a 30-dimensional single-pass fused feature.
In step S740, the advanced fusion processing is performed using the ReLU function in the gated loop unit.
The step-wise fusion processing of the single-shot fusion features and the cepstral features may be to input the single-shot fusion features and the cepstral features into another gated-loop unit so that the gated-loop unit performs the step-wise fusion processing of the single-shot fusion features and the cepstral features.
And the gating cycle unit can also be a GRU ReLU layer. Therefore, the gated loop unit can output a 60-dimensional advanced fusion feature.
In step S750, full connection processing is performed in the full connection layer.
Because the number of nodes of the two layers of GRUs is small and is respectively 30 and 60, standard gain information can be obtained by adopting 30-dimensional full-connection layer calculation.
Specifically, the advanced fusion feature may be input into a full connection layer, so that the full connection layer performs full connection processing on the advanced fusion feature.
The full connection process may be implemented at the dense layer of the deep learning network. The full join process may be a process in which each node is joined to all nodes in the previous layer, i.e., 60-dimensional advanced fusion features are combined to obtain 30-dimensional standard gain information to match 30 Bark bands.
When the standard gain information is small enough, i.e., the loss is small enough, it indicates that the noise suppression model training is complete and can be used for on-line enhancement.
In the online enhancing process, the 30-dimensional BFCC characteristics of the original voice signal to be enhanced can also be extracted and then input into the trained noise suppression model, so that the gain information of 20 dimensions calculated by the noise suppression model is used for recovering a 'clean' noise reduction voice signal.
And the noise reduced speech signal is a normal unvoiced signal. The normal unvoiced sound signal is a voice signal except a wheat-spraying signal in the unvoiced sound signal, and is an unvoiced sound signal normally generated when a producer is in a state of speaking or singing.
When the noise suppression model is online, the noise suppression model can be applied to any service scene with noise reduction requirements on voice signals, such as voice conferences, video conferences, voice recording, video recording and other service scenes.
It should be noted that the number of nodes in the activation function layer and the GRU in the noise suppression model may be changed according to the actual situation, for example, the number of nodes in the activation function layer is 30, and the number of nodes in the GRU is 50.
In addition, the present invention may also be implemented by using Neural network units such as LSTM, DNN (Deep Neural Networks), CNN (Convolutional Neural Networks), and the like, which is not limited in this exemplary embodiment.
Fig. 8 shows a schematic diagram of a comparison between an original speech signal and a noise-reduced speech signal in an application scenario, as shown in fig. 8, the acoustic noise of the original speech signal is mainly concentrated in a low-frequency region, and the noise signal in the low-frequency region in the noise-reduced speech signal is obviously removed, indicating that the acoustic noise suppression effect is better. Meanwhile, the parameter in the noise suppression model is quantized to about 50kb (Kilobyte), which greatly reduces the complexity of the noise suppression algorithm.
Based on the application scenarios, on one hand, the noise suppression method provided by the embodiment of the disclosure divides the original voice signal into the low-frequency spectrum feature and the high-frequency spectrum feature for subsequent noise suppression processing, so that the suppression of the noise in the low-frequency region is more targeted, meanwhile, the noise in the high-frequency region can be suppressed, the noise suppression effect and efficiency of key noise types are ensured, and the noise suppression processing of other frequency bands is also considered; on the other hand, noise suppression processing is carried out on the gain information to obtain a noise reduction voice signal, so that the complexity of noise suppression is greatly reduced, and further the user experience is improved when the noise reduction voice signal is output.
It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
Embodiments of the apparatus of the present disclosure are described below, which may be used to perform the noise suppression method in the above-described embodiments of the present disclosure. For details that are not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the noise suppression method described above in the present disclosure.
Fig. 9 schematically shows a block diagram of a noise suppression device in some embodiments of the present disclosure, and as shown in fig. 9, the noise suppression device 900 may mainly include: a feature combining module 910, a transformation processing module 920, a dimension reduction mapping module 930, and a noise suppression module 940.
The feature combination module 910 is configured to obtain low-frequency spectrum features and high-frequency spectrum features of an original voice signal, and perform feature combination processing on the low-frequency spectrum features and the high-frequency spectrum features to obtain band energy features; a transform processing module 920, configured to determine a current frame speech signal and a previous frame speech signal in an original speech signal, and perform linear domain transform processing on the current frame speech signal and the previous frame speech signal to obtain a spectral feature parameter; the dimension reduction mapping module 930 is configured to perform correlation calculation on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics, and perform dimension reduction mapping processing on the cepstrum characteristics to obtain dimension reduction characteristics; and a noise suppression module 940 configured to perform feature fusion processing on the dimensionality reduction features and the cepstrum features to obtain gain information, and perform noise suppression processing on the gain information to obtain a noise-reduced voice signal of the original voice signal.
In some embodiments of the disclosure, a noise suppression module comprises: the fusion processing submodule is configured to perform single fusion processing on the dimensionality reduction feature and the cepstrum feature to obtain a single fusion feature, and perform advanced fusion processing on the cepstrum feature and the single fusion feature to obtain an advanced fusion feature;
and the connection processing submodule is configured to perform full connection processing on the advanced fusion characteristics to obtain gain information.
In some embodiments of the disclosure, a noise suppression module comprises: the loss calculation submodule is configured to acquire standard gain information corresponding to the original voice signal, and perform gain loss calculation on the gain information and the standard gain information to obtain a gain loss value;
and the gain loss submodule is configured to perform noise suppression processing by using the gain information based on the gain loss value to obtain a noise-reduced voice signal of the original voice signal.
In some embodiments of the disclosure, the noise suppression module comprises: and the inverse transformation submodule is configured to perform inverse linear domain transformation processing on the gain information to obtain a noise reduction voice signal of the original voice signal.
In some embodiments of the disclosure, a feature combination module, comprises: the energy characteristic submodule is configured to perform nonlinear domain transformation processing on the high-frequency spectrum characteristic to obtain a nonlinear energy characteristic;
and the combination processing sub-module is configured to perform feature combination processing on the low-frequency spectrum features and the nonlinear energy features to obtain frequency band energy features.
In some embodiments of the disclosure, the dimension reduction mapping module comprises: the correlation calculation submodule is configured to perform cross-correlation calculation on the characteristic real part parameter and the characteristic imaginary part parameter to obtain a cross-correlation parameter;
and the energy correlation submodule is configured to perform energy correlation calculation on the cross-correlation parameters and the band energy characteristics to obtain cepstrum characteristics.
In some embodiments of the disclosure, a feature combination module, comprises: and the linear transformation submodule is configured to acquire an original voice signal and perform linear domain transformation processing on the original voice signal to obtain a low-frequency spectrum characteristic and a high-frequency spectrum characteristic.
The specific details of the noise suppression device provided in each embodiment of the present disclosure have been described in detail in the corresponding method embodiment, and therefore are not described herein again.
FIG. 10 illustrates a schematic structural diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure.
It should be noted that the computer system 1000 of the electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU)1001 that can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for system operation are also stored. The CPU1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004. An Input/Output (I/O) interface 1005 is also connected to the bus 1004.
The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.
In particular, the processes described in the various method flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. When the computer program is executed by a Central Processing Unit (CPU)1001, various functions defined in the system of the present application are executed.
It should be noted that the computer readable medium shown in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method of noise suppression, the method comprising:
acquiring low-frequency spectrum characteristics and high-frequency spectrum characteristics of an original voice signal, and performing characteristic combination processing on the low-frequency spectrum characteristics and the high-frequency spectrum characteristics to obtain frequency band energy characteristics;
determining a current frame voice signal and a previous frame voice signal in the original voice signal, and performing linear domain transformation processing on the current frame voice signal and the previous frame voice signal to obtain a frequency spectrum characteristic parameter;
performing correlation calculation on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics, and performing dimension reduction mapping processing on the cepstrum characteristics to obtain dimension reduction characteristics;
and performing feature fusion processing on the dimensionality reduction features and the cepstrum features to obtain gain information, and performing noise suppression processing on the gain information to obtain a noise reduction voice signal of the original voice signal.
2. The method according to claim 1, wherein the performing feature fusion processing on the dimensionality reduction feature and the cepstrum feature to obtain gain information includes:
performing single fusion processing on the dimensionality reduction features and the cepstrum features to obtain single fusion features, and performing advanced fusion processing on the cepstrum features and the single fusion features to obtain advanced fusion features;
and carrying out full connection processing on the advanced fusion characteristics to obtain gain information.
3. The method according to claim 1, wherein the performing noise suppression processing on the gain information to obtain the noise-reduced speech signal of the original speech signal comprises:
acquiring standard gain information corresponding to the original voice signal, and performing gain loss calculation on the gain information and the standard gain information to obtain a gain loss value;
and based on the gain loss value, carrying out noise suppression processing by using the gain information to obtain a noise reduction voice signal of the original voice signal.
4. The method according to claim 1, wherein the performing noise suppression processing on the gain information to obtain the noise-reduced speech signal of the original speech signal comprises:
and carrying out inverse linear domain transformation processing on the gain information to obtain a noise reduction voice signal of the original voice signal.
5. The method according to claim 1, wherein the performing a feature combination process on the low-frequency spectral feature and the high-frequency spectral feature to obtain a band energy feature comprises:
carrying out nonlinear domain transformation processing on the high-frequency spectrum characteristic to obtain a nonlinear energy characteristic;
and carrying out feature combination processing on the low-frequency spectrum feature and the nonlinear energy feature to obtain a frequency band energy feature.
6. The noise suppression method according to claim 1, wherein the spectral feature parameters include a characteristic real part parameter and a characteristic imaginary part parameter,
the calculating the correlation between the spectrum characteristic parameter and the band energy characteristic to obtain a cepstrum characteristic includes:
performing cross-correlation calculation on the characteristic real part parameter and the characteristic imaginary part parameter to obtain a cross-correlation parameter;
and performing energy correlation calculation on the cross-correlation parameters and the frequency band energy characteristics to obtain cepstrum characteristics.
7. The noise suppression method according to any one of claims 1 to 6, wherein the obtaining of the low-frequency spectral feature and the high-frequency spectral feature of the original speech signal comprises:
the method comprises the steps of obtaining an original voice signal, and carrying out linear domain transformation processing on the original voice signal to obtain low-frequency spectrum characteristics and high-frequency spectrum characteristics.
8. A noise suppression apparatus, characterized in that the apparatus comprises:
the characteristic combination module is configured to acquire low-frequency spectrum characteristics and high-frequency spectrum characteristics of an original voice signal and perform characteristic combination processing on the low-frequency spectrum characteristics and the high-frequency spectrum characteristics to obtain frequency band energy characteristics;
the conversion processing module is configured to determine a current frame voice signal and a previous frame voice signal in the original voice signal, and perform linear domain conversion processing on the current frame voice signal and the previous frame voice signal to obtain a frequency spectrum characteristic parameter;
the dimension reduction mapping module is configured to perform correlation calculation on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics, and perform dimension reduction mapping processing on the cepstrum characteristics to obtain dimension reduction characteristics;
and the noise suppression module is configured to perform feature fusion processing on the dimensionality reduction features and the cepstrum features to obtain gain information, and perform noise suppression processing on the gain information to obtain a noise reduction voice signal of the original voice signal.
9. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the noise suppression method according to any one of claims 1 to 7.
10. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the noise suppression method of any one of claims 1 to 7 via execution of the executable instructions.
CN202110129579.1A 2021-01-29 2021-01-29 Noise suppression method, device, medium and electronic equipment Active CN113571078B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110129579.1A CN113571078B (en) 2021-01-29 2021-01-29 Noise suppression method, device, medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110129579.1A CN113571078B (en) 2021-01-29 2021-01-29 Noise suppression method, device, medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN113571078A true CN113571078A (en) 2021-10-29
CN113571078B CN113571078B (en) 2024-04-26

Family

ID=78161082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110129579.1A Active CN113571078B (en) 2021-01-29 2021-01-29 Noise suppression method, device, medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113571078B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113823309A (en) * 2021-11-22 2021-12-21 成都启英泰伦科技有限公司 Noise reduction model construction and noise reduction processing method
CN114338623A (en) * 2022-01-05 2022-04-12 腾讯科技(深圳)有限公司 Audio processing method, device, equipment, medium and computer program product
CN116229986A (en) * 2023-05-05 2023-06-06 北京远鉴信息技术有限公司 Voice noise reduction method and device for voiceprint identification task

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7043030B1 (en) * 1999-06-09 2006-05-09 Mitsubishi Denki Kabushiki Kaisha Noise suppression device
US20110286605A1 (en) * 2009-04-02 2011-11-24 Mitsubishi Electric Corporation Noise suppressor
US20170236526A1 (en) * 2014-08-15 2017-08-17 Samsung Electronics Co., Ltd. Sound quality improving method and device, sound decoding method and device, and multimedia device employing same
CN107527611A (en) * 2017-08-23 2017-12-29 武汉斗鱼网络科技有限公司 MFCC audio recognition methods, storage medium, electronic equipment and system
CN110556122A (en) * 2019-09-18 2019-12-10 腾讯科技(深圳)有限公司 frequency band extension method, device, electronic equipment and computer readable storage medium
CN111128213A (en) * 2019-12-10 2020-05-08 展讯通信(上海)有限公司 Noise suppression method and system for processing in different frequency bands
CN111768795A (en) * 2020-07-09 2020-10-13 腾讯科技(深圳)有限公司 Noise suppression method, device, equipment and storage medium for voice signal
CN111899750A (en) * 2020-07-29 2020-11-06 哈尔滨理工大学 Speech enhancement algorithm combining cochlear speech features and hopping deep neural network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7043030B1 (en) * 1999-06-09 2006-05-09 Mitsubishi Denki Kabushiki Kaisha Noise suppression device
US20110286605A1 (en) * 2009-04-02 2011-11-24 Mitsubishi Electric Corporation Noise suppressor
US20170236526A1 (en) * 2014-08-15 2017-08-17 Samsung Electronics Co., Ltd. Sound quality improving method and device, sound decoding method and device, and multimedia device employing same
CN107527611A (en) * 2017-08-23 2017-12-29 武汉斗鱼网络科技有限公司 MFCC audio recognition methods, storage medium, electronic equipment and system
CN110556122A (en) * 2019-09-18 2019-12-10 腾讯科技(深圳)有限公司 frequency band extension method, device, electronic equipment and computer readable storage medium
CN111128213A (en) * 2019-12-10 2020-05-08 展讯通信(上海)有限公司 Noise suppression method and system for processing in different frequency bands
CN111768795A (en) * 2020-07-09 2020-10-13 腾讯科技(深圳)有限公司 Noise suppression method, device, equipment and storage medium for voice signal
CN111899750A (en) * 2020-07-29 2020-11-06 哈尔滨理工大学 Speech enhancement algorithm combining cochlear speech features and hopping deep neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
裴安山;王让定;严迪群;: "基于语音频谱融合特征的手机来源识别", 计算机应用, no. 03, 20 December 2017 (2017-12-20) *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113823309A (en) * 2021-11-22 2021-12-21 成都启英泰伦科技有限公司 Noise reduction model construction and noise reduction processing method
CN114338623A (en) * 2022-01-05 2022-04-12 腾讯科技(深圳)有限公司 Audio processing method, device, equipment, medium and computer program product
CN114338623B (en) * 2022-01-05 2023-12-05 腾讯科技(深圳)有限公司 Audio processing method, device, equipment and medium
CN116229986A (en) * 2023-05-05 2023-06-06 北京远鉴信息技术有限公司 Voice noise reduction method and device for voiceprint identification task
CN116229986B (en) * 2023-05-05 2023-07-21 北京远鉴信息技术有限公司 Voice noise reduction method and device for voiceprint identification task

Also Published As

Publication number Publication date
CN113571078B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
Li et al. Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement
CN110600018B (en) Voice recognition method and device and neural network training method and device
US10373609B2 (en) Voice recognition method and apparatus
CN113571078B (en) Noise suppression method, device, medium and electronic equipment
CN111312245B (en) Voice response method, device and storage medium
US11355097B2 (en) Sample-efficient adaptive text-to-speech
CN111768795A (en) Noise suppression method, device, equipment and storage medium for voice signal
CN111739521A (en) Electronic equipment awakening method and device, electronic equipment and storage medium
Nørholm et al. Instantaneous fundamental frequency estimation with optimal segmentation for nonstationary voiced speech
CN115602165B (en) Digital employee intelligent system based on financial system
CN115376495A (en) Speech recognition model training method, speech recognition method and device
Elbaz et al. End to end deep neural network frequency demodulation of speech signals
CN114333893A (en) Voice processing method and device, electronic equipment and readable medium
CN113823313A (en) Voice processing method, device, equipment and storage medium
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
CN113763978B (en) Voice signal processing method, device, electronic equipment and storage medium
Li RETRACTED ARTICLE: Speech-assisted intelligent software architecture based on deep game neural network
CN114333891A (en) Voice processing method and device, electronic equipment and readable medium
CN117059068A (en) Speech processing method, device, storage medium and computer equipment
CN114333892A (en) Voice processing method and device, electronic equipment and readable medium
CN113516992A (en) Audio processing method and device, intelligent equipment and storage medium
Dai et al. Robust speech recognition by using spectral subtraction with noise peak shifting
CN116705013B (en) Voice wake-up word detection method and device, storage medium and electronic equipment
Liu et al. Teacher-student learning and post-processing for robust BiLSTM mask-based acoustic beamforming

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant