CN113571078B

CN113571078B - Noise suppression method, device, medium and electronic equipment

Info

Publication number: CN113571078B
Application number: CN202110129579.1A
Authority: CN
Inventors: 鲍枫; 刘志鹏; 李岳鹏
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2024-04-26
Anticipated expiration: 2041-01-29
Also published as: CN113571078A

Abstract

The disclosure provides a noise suppression method, a device, a medium and electronic equipment. The method comprises the following steps: acquiring low-frequency spectrum features and high-frequency spectrum features of an original voice signal, and performing feature combination processing on the low-frequency spectrum features and the high-frequency spectrum features to obtain frequency band energy features; determining a current frame voice signal and a last frame voice signal in an original voice signal, and performing linear domain transformation processing on the current frame voice signal and the last frame voice signal to obtain frequency spectrum characteristic parameters; performing correlation calculation on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics, and performing dimension reduction mapping processing on the cepstrum characteristics to obtain dimension reduction characteristics; and carrying out feature fusion processing on the dimension reduction features and the cepstrum features to obtain gain information, and carrying out noise suppression processing on the gain information to obtain a noise reduction voice signal of the original voice signal. The method ensures the noise suppression effect and efficiency of key noise types, and greatly reduces the complexity of noise suppression.

Description

Noise suppression method, device, medium and electronic equipment

Technical Field

The present disclosure relates to the field of audio processing technology, and in particular, to a noise suppression method, a noise suppression apparatus, a computer readable medium, and an electronic device.

Background

In audio communication software such as various conferences, the noise from spraying wheat is a very common noise signal. The suppression mode for the wheat spraying noise is to carry out the wheat spraying noise suppression along the way when the conventional noise reduction treatment is carried out.

However, the conventional noise reduction processing mode has higher complexity, and no special means for suppressing the wheat spraying noise is provided, so that the suppression effect of the wheat spraying noise cannot be ensured.

In view of this, there is a need in the art to develop a new noise suppression method and apparatus.

It should be noted that the information disclosed in the foregoing background section is only for enhancing understanding of the background of the application and may therefore include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a noise suppression method, a noise suppression apparatus, a computer-readable medium, and an electronic device, and further to overcome at least some technical problems of high noise suppression complexity and poor effect.

Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.

According to an aspect of the embodiments of the present disclosure, there is provided a noise suppression method including: acquiring low-frequency spectrum characteristics and high-frequency spectrum characteristics of an original voice signal, and carrying out characteristic combination processing on the low-frequency spectrum characteristics and the high-frequency spectrum characteristics to obtain frequency band energy characteristics;

Determining a current frame voice signal and a last frame voice signal in the original voice signal, and performing linear domain transformation processing on the current frame voice signal and the last frame voice signal to obtain a frequency spectrum characteristic parameter;

Performing correlation calculation on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics, and performing dimension reduction mapping processing on the cepstrum characteristics to obtain dimension reduction characteristics;

and carrying out feature fusion processing on the dimension reduction features and the cepstrum features to obtain gain information, and carrying out noise suppression processing on the gain information to obtain a noise reduction voice signal of the original voice signal.

According to an aspect of the embodiments of the present disclosure, there is provided a noise suppression apparatus including: the feature combination module is configured to acquire low-frequency spectrum features and high-frequency spectrum features of an original voice signal, and perform feature combination processing on the low-frequency spectrum features and the high-frequency spectrum features to obtain frequency band energy features;

The conversion processing module is configured to determine a current frame voice signal and a last frame voice signal in the original voice signal, and perform linear domain conversion processing on the current frame voice signal and the last frame voice signal to obtain frequency spectrum characteristic parameters;

the dimension reduction mapping module is configured to perform correlation calculation on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics, and perform dimension reduction mapping processing on the cepstrum characteristics to obtain dimension reduction characteristics;

the noise suppression module is configured to perform feature fusion processing on the dimension reduction features and the cepstrum features to obtain gain information, and perform noise suppression processing on the gain information to obtain a noise reduction voice signal of the original voice signal.

In some embodiments of the present disclosure, based on the above technical solutions, the noise suppression module includes: the fusion processing sub-module is configured to perform single fusion processing on the dimension reduction feature and the cepstrum feature to obtain a single fusion feature, and perform advanced fusion processing on the cepstrum feature and the single fusion feature to obtain an advanced fusion feature;

and the connection processing sub-module is configured to perform full connection processing on the advanced fusion characteristics to obtain gain information.

In some embodiments of the present disclosure, based on the above technical solutions, the noise suppression module includes: the loss calculation sub-module is configured to acquire standard gain information corresponding to the original voice signal, and perform gain loss calculation on the gain information and the standard gain information to obtain a gain loss value;

and the gain loss submodule is configured to perform noise suppression processing by using the gain information based on the gain loss value to obtain a noise-reduced voice signal of the original voice signal.

In some embodiments of the present disclosure, based on the above technical solutions, the noise suppression module includes: and the inverse transformation submodule is configured to perform inverse linear domain transformation processing on the gain information to obtain a noise-reduced voice signal of the original voice signal.

In some embodiments of the present disclosure, based on the above technical solutions, the feature combination module includes: the energy characteristic submodule is configured to perform nonlinear domain transformation processing on the high-frequency spectrum characteristic to obtain a nonlinear energy characteristic;

and the combination processing sub-module is configured to perform feature combination processing on the low-frequency spectrum feature and the nonlinear energy feature to obtain a frequency band energy feature.

In some embodiments of the present disclosure, based on the above technical solutions, the dimension-reduction mapping module includes: the correlation calculation sub-module is configured to perform cross-correlation calculation on the characteristic real part parameter and the characteristic imaginary part parameter to obtain a cross-correlation parameter;

And the energy correlation sub-module is configured to perform energy correlation calculation on the cross-correlation parameter and the frequency band energy characteristic to obtain a cepstrum characteristic.

In some embodiments of the present disclosure, based on the above technical solutions, the feature combination module includes: the linear transformation sub-module is configured to acquire an original voice signal, and perform linear domain transformation processing on the original voice signal to obtain a low-frequency spectrum characteristic and a high-frequency spectrum characteristic.

According to an aspect of the embodiments of the present disclosure, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements a noise suppression method as in the above technical solutions.

According to an aspect of the embodiments of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the noise suppression method as in the above technical solution via execution of the executable instructions.

In the technical scheme provided by the embodiment of the disclosure, on one hand, the original voice signal is divided into the low-frequency spectrum characteristic and the high-frequency spectrum characteristic for subsequent noise suppression processing, so that the noise suppression processing is more targeted for suppressing the noise in the low-frequency region, and meanwhile, the noise in the high-frequency region can be suppressed, the noise suppression effect and efficiency of the key noise type are ensured, and the noise suppression processing for other frequency bands is also considered; on the other hand, noise suppression processing is carried out on the gain information to obtain a noise reduction voice signal, so that the complexity of noise suppression is greatly reduced, and further, the user experience is improved when the noise reduction voice signal is output.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort. In the drawings:

FIG. 1 schematically illustrates an architecture diagram of an exemplary system to which the disclosed technique is applied;

FIG. 2 schematically illustrates a flow chart of steps of a noise suppression method in some embodiments of the present disclosure;

FIG. 3 schematically illustrates a flow chart of steps of a method of feature combination processing in some embodiments of the present disclosure;

FIG. 4 schematically illustrates a flow chart of steps of a method of correlation computation in some embodiments of the present disclosure;

FIG. 5 schematically illustrates a flow chart of steps of a method of feature fusion processing in some embodiments of the present disclosure;

FIG. 6 schematically illustrates a flow chart of steps of a method of noise suppression processing in some embodiments of the present disclosure;

FIG. 7 schematically illustrates a model framework diagram for training a noise suppression model in an application scenario in some embodiments of the present disclosure;

FIG. 8 schematically illustrates a comparative schematic of an original speech signal and a noisy speech signal in an application scenario in some embodiments of the present disclosure;

FIG. 9 schematically illustrates a block diagram of a noise suppression device in some embodiments of the present disclosure;

fig. 10 schematically illustrates a structural schematic diagram of a computer system suitable for use in implementing the electronic device of the embodiments of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

In the related art in the art, a voice noise reduction technology is a technology of obtaining a target voice from audio in which the target voice and noise are mixed, by removing or suppressing the noise. While inhibition is the meaning of control avoiding.

In audio communication software such as various conferences, the noise from spraying wheat is a very common noise signal. The pop noise is caused by plosive sound at the time of pronunciation. The suppression mode for the spray wheat noise is usually to perform conventional noise reduction processing by using a Long Short-Term Memory (LSTM) and other neural network algorithms, and simultaneously perform spray wheat noise suppression processing incidentally. However, the conventional noise reduction method has high complexity, and no special suppression means for the wheat spraying noise is available, so that the treatment effect on the wheat spraying noise cannot be guaranteed.

Based on the problems of the above schemes, the present disclosure provides a noise suppression method, a noise suppression device, a computer-readable medium and an electronic device based on artificial intelligence and cloud technology.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Among the key technologies of the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

Cloud technology (Cloud technology) is based on the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like applied by Cloud computing business models, and can form a resource pool, so that the Cloud computing business model is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

The cloud conference is an efficient, convenient and low-cost conference form based on a cloud computing technology. The user can rapidly and efficiently share voice, data files and videos with all groups and clients in the world synchronously by simply and easily operating through an internet interface, and the user is helped by a cloud conference service provider to operate through complex technologies such as data transmission, processing and the like in the conference.

At present, domestic cloud conference mainly focuses on service contents taking a Software as a main body (Software as a service) mode, including service forms such as telephone, network, video and the like, and video conference based on cloud computing is called as a cloud conference.

In the cloud conference era, the transmission, processing and storage of data are all processed by the computer resources of video conference factories, and users can carry out efficient remote conferences without purchasing expensive hardware and installing complicated software.

The cloud conference system supports the dynamic cluster deployment of multiple servers, provides multiple high-performance servers, and greatly improves conference stability, safety and usability. In recent years, video conferences are popular for many users because of greatly improving communication efficiency, continuously reducing communication cost and bringing about upgrade of internal management level, and have been widely used in various fields of government, army, transportation, finance, operators, education, enterprises, etc. Undoubtedly, the video conference has stronger attraction in convenience, rapidness and usability after the cloud computing is applied, and the video conference application is required to be stimulated.

The noise suppression method utilizing the artificial intelligence and the cloud technology has more pertinence to suppressing noise in a low-frequency region, can suppress noise in a high-frequency region, ensures the noise suppression effect and efficiency of key noise types, also gives consideration to noise suppression processing of other frequency bands, greatly reduces the complexity of noise suppression, and further improves user experience when noise-reduction voice signals are output.

Fig. 1 shows an exemplary system architecture schematic to which the technical solution of the present disclosure is applied.

As shown in fig. 1, the system architecture 100 may include a terminal 110, a network 120, and a server side 130. Wherein the terminal 110 and the server 130 are connected through the network 120.

The terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. Network 120 may be a communication medium of various connection types capable of providing a communication link between terminal 110 and server side 130, such as a wired communication link, a wireless communication link, or a fiber optic cable, etc., and the application is not limited in this regard. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like.

Specifically, the terminal 110 may obtain a low-frequency spectrum feature and a high-frequency spectrum feature of the original voice signal, and perform feature combination processing on the low-frequency spectrum feature and the high-frequency spectrum feature to obtain a band energy feature. And then, determining the current frame of voice signal and the last frame of voice signal in the original voice signal, and performing linear domain transformation processing on the current frame of voice signal and the last frame of voice signal to obtain frequency spectrum characteristic parameters. Further, correlation calculation is carried out on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics, and dimension reduction mapping processing is carried out on the cepstrum characteristics to obtain dimension reduction characteristics. And finally, carrying out feature fusion processing on the dimension reduction features and the cepstrum features to obtain gain information, and carrying out noise suppression processing on the gain information to obtain a noise reduction voice signal of the original voice signal.

In addition, the noise suppression method in the embodiment of the present disclosure may be applied to a terminal or a server, which is not particularly limited in this disclosure.

The embodiments of the present disclosure are mainly exemplified by the application of the noise suppression method to the terminal 110.

The noise suppression method, the noise suppression apparatus, the computer-readable medium, and the electronic device provided by the present disclosure are described in detail below in connection with the detailed description.

Fig. 2 schematically illustrates a flow chart of steps of a noise suppression method in some embodiments of the present disclosure, as illustrated in fig. 2, the noise suppression method may mainly include the steps of:

S210, acquiring low-frequency spectrum features and high-frequency spectrum features of an original voice signal, and performing feature combination processing on the low-frequency spectrum features and the high-frequency spectrum features to obtain frequency band energy features.

S220, determining a current frame voice signal and a last frame voice signal in the original voice signal, and performing linear domain transformation processing on the current frame voice signal and the last frame voice signal to obtain frequency spectrum characteristic parameters.

S230, performing correlation calculation on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics, and performing dimension reduction mapping processing on the cepstrum characteristics to obtain dimension reduction characteristics.

S240, performing feature fusion processing on the dimension reduction features and the cepstrum features to obtain gain information, and performing noise suppression processing on the gain information to obtain a noise reduction voice signal of the original voice signal.

In the exemplary embodiment of the disclosure, on one hand, an original voice signal is divided into a low-frequency spectrum characteristic and a high-frequency spectrum characteristic to perform subsequent noise suppression processing, so that the noise suppression processing is more targeted to suppress the noise in a low-frequency region, and meanwhile, the noise in a high-frequency region can be suppressed, so that the noise suppression effect and efficiency of key noise types are ensured, and the noise suppression processing on other frequency bands is also considered; on the other hand, noise suppression processing is carried out on the gain information to obtain a noise reduction voice signal, so that the complexity of noise suppression is greatly reduced, and further, the user experience is improved when the noise reduction voice signal is output.

The respective steps of the noise suppression method are described in detail below.

In step S210, the low-frequency spectrum feature and the high-frequency spectrum feature of the original speech signal are obtained, and the low-frequency spectrum feature and the high-frequency spectrum feature are subjected to feature combination processing to obtain the band energy feature.

In an exemplary embodiment of the present disclosure, the original speech signal may be a noisy speech signal. The original speech signal may be a speech signal acquired in a real environment by an audio acquisition device, such as a microphone or the like.

For example, in a video conference scenario, a microphone may collect a speech signal generated when a participant speaks. In the process of collecting the voice signal, the microphone may collect a noise signal at the same time, where the noise signal may be environmental noise, microphone spraying, and the like, and the exemplary embodiment is not limited thereto.

Wherein, the wheat spraying sound is caused by the generation of plosive sound during sounding. Specifically, during the audio input process, the microphone spraying sound is generated when an explosion force generated by the air squeezed in the lips of the participant acts on the microphone diaphragm. In particular, words containing consonants such as "p" or "b" produce an airflow substantially equivalent to the energy produced by a 60 mile/hour wind, and such large energy acts on the diaphragm of the microphone to produce significant energy, thereby destroying the quality of the human voice and affecting the overall effect of the original audio.

Further, the original voice signal is subjected to linear domain transformation processing to obtain corresponding high-frequency spectrum characteristics and low-frequency spectrum characteristics.

In an alternative embodiment, an original voice signal is obtained, and linear domain transformation processing is performed on the original voice signal to obtain a low-frequency spectrum characteristic and a high-frequency spectrum characteristic.

Wherein the linear domain transformation process of the original voice signal may be to transform the original voice signal from a time domain to a frequency domain. For example, a fast fourier transform process (Fast Fourier Transform, FFT) may be performed on the original speech signal.

The FFT algorithm is an algorithm that converts a time domain into a frequency domain. The FFT algorithm is effectively a fast algorithm of discrete fourier transform (Discrete Fourier Transform, abbreviated DFT). In the processing of digital signals, it is often necessary to use FFT algorithms to obtain the frequency domain characteristics of the signal. The purpose of the transformation is in fact to obtain the same time domain signal in the frequency domain, so that the characteristics of the signal can be more easily analysed.

Therefore, after the original voice signal is processed by the FFT algorithm, a series of complex numbers are obtained, and the complex numbers are amplitude features of the original voice signal in the corresponding frequency domain, and are not amplitude values. The amplitude characteristic is the spectrum characteristic of the original voice signal.

Where spectrum is an abbreviation for frequency spectral density, which is a distribution curve of frequency. The complex oscillation is decomposed into harmonic oscillators with different amplitudes and different frequencies, and the amplitude of the harmonic oscillator is a frequency spectrum according to a frequency arrangement graph. The spectrum is widely used in acoustic, optical and radio technologies.

Also, since the voice signal in the spraying state is generally a low frequency signal lower than 500Hz (hertz), the spectrum characteristic of the original voice signal can be divided into a low frequency spectrum characteristic and a high frequency spectrum characteristic with 500Hz as a dividing node.

The frequency spectrum characteristic of the original voice signal is the frequency spectrum characteristic of 0-8000Hz acquired by taking 16000Hz as the sampling rate, so the low-frequency spectrum characteristic of the original voice signal is 0-500Hz, and the high-frequency spectrum characteristic is 500-8000Hz.

After the low-frequency spectrum characteristic and the high-frequency spectrum characteristic of the original voice signal are obtained, the low-frequency spectrum characteristic and the high-frequency spectrum characteristic can be subjected to characteristic combination processing to obtain the frequency band energy characteristic.

In an alternative embodiment, fig. 3 shows a flowchart of the steps of a method of feature combination processing, as shown in fig. 3, the method comprising at least the steps of: in step S310, nonlinear domain transformation processing is performed on the high-frequency spectrum characteristic to obtain a nonlinear energy characteristic.

The nonlinear domain transformation processing may be a processing method of converting the high-frequency spectrum characteristic of the frequency domain into the Bark domain.

The Bark field is a psychoacoustic scale of sound. Because of the specific configuration of the cochlea of the human ear, the human auditory system generates a series of Critical bands (bands). The critical frequency band is a sound frequency band in which a sound signal is prone to masking effects, i.e. the sound signal in the critical frequency band is prone to masking by another signal of high energy and close frequency, resulting in that the human auditory system cannot receive this sound signal. If the sound signal is converted from the frequency domain to the critical frequency bands, each critical frequency band becomes a Bark band, i.e., the sound signal is converted from the frequency domain to the Bark domain.

Specifically, the nonlinear domain transformation process may refer to formula (1):

Where arctan is the arctangent function, f is the high frequency spectral feature of the original speech signal, and Bark (f) is the Bark domain representation of the original speech signal.

And (3) obtaining the nonlinear energy characteristic of the high-frequency spectrum characteristic through the calculation of the formula (1). The nonlinear energy characteristic may be represented by 15 Bark bands to sparsify the high frequency spectral characteristic. Obviously, the Bark domain has a compression effect on the high frequency spectral features and an amplification effect on the low frequency spectral features. However, in order to process the original voice signal in the spraying state in a targeted manner, the low-frequency spectral characteristics of the original voice signal may not be converted into the Bark domain.

In step S320, the low-frequency spectrum feature and the nonlinear energy feature are subjected to feature combination processing to obtain a band energy feature.

Since the nonlinear domain transformation processing is not performed on the low-frequency spectrum feature to be converted into the Bark domain, the thinning processing is not performed on the low-frequency spectrum feature, and the low-frequency spectrum feature can be directly represented by 15 Bark bands. At this time, the low-frequency spectrum characteristic is obtained by performing linear domain transformation processing through a 512-point FFT algorithm.

Further, the low-frequency spectrum characteristic represented by the 15 Bark bands and the nonlinear energy characteristic divided into the 15 Bark bands are combined to obtain the band energy characteristics on the 30 Bark bands.

In the present exemplary embodiment, the combination processing is performed on the nonlinear energy characteristic and the low-frequency spectrum characteristic after the nonlinear domain transformation processing, so that not only the high-frequency spectrum characteristic is subjected to the thinning processing, but also the low-frequency noise can be better suppressed later, the complexity of noise suppression is further reduced, and the noise suppression efficiency is improved.

In step S220, the current frame of speech signal and the previous frame of speech signal are determined from the original speech signals, and linear domain transformation is performed on the current frame of speech signal and the previous frame of speech signal to obtain spectral feature parameters.

In an exemplary embodiment of the present disclosure, a frame may be determined in an original speech signal as a current frame speech signal, and a previous frame of the current frame speech signal is continuously determined in the original speech signal as a previous frame speech signal, so as to perform linear domain transformation processing on the current frame speech signal and the previous frame speech signal to obtain a spectral feature parameter.

The mode of performing linear domain transformation processing on the current frame voice signal and the last frame voice signal can also be realized through an FFT algorithm. Specifically, reference may be made to formula (2):

FFT_(t,f)＝x_(t,f)+i×y_(t,f) (2)

Wherein FFT _(t,f) represents the spectral features of the current frame speech signal and the previous frame speech signal in the frequency domain, and consists of one vector, i.e., x+yi. Where x represents the real part of the corresponding spectral feature and y represents the imaginary part of the corresponding spectral feature.

The real part and the imaginary part of the frequency spectrum characteristic of the current frame voice signal and the last frame voice signal are the corresponding frequency spectrum characteristic parameters.

In step S230, correlation calculation is performed on the spectral feature parameters and the band energy features to obtain cepstrum features, and dimension reduction mapping processing is performed on the cepstrum features to obtain dimension reduction features.

In an exemplary embodiment of the present disclosure, after obtaining the spectral feature parameter and the band energy feature, a correlation calculation may be performed on the spectral feature parameter and the band energy feature to obtain a cepstrum feature.

In an alternative embodiment, the spectral characteristic parameters include a characteristic real part parameter and a characteristic imaginary part parameter, fig. 4 shows a flowchart of steps of a method of correlation calculation, as shown in fig. 4, the method at least comprising the steps of: in step S410, a cross-correlation parameter is obtained by performing a cross-correlation calculation on the characteristic real part parameter and the characteristic imaginary part parameter.

The cross-correlation calculation of the characteristic real part parameter and the characteristic imaginary part parameter can be referred to as formula (3):

Wherein r _xy [ l ] identifies the degree of correlation between the energy sequence x [ n ] and the energy sequence y [ n-l ]. Thus, the larger r _xy [ l ] represents the greater the correlation between the energy sequence x [ n ] and the energy sequence y [ n-l ]. And substituting the characteristic real part parameter and the characteristic imaginary part parameter into the formula (3) to obtain the cross-correlation parameter of the current frame voice signal and the last frame voice signal.

In step S420, correlation calculation is performed on the cross-correlation parameter and the band energy characteristic to obtain a cepstrum characteristic.

After the cross-correlation parameters are obtained, the cross-correlation parameters and the frequency band energy characteristics can be subjected to correlation calculation to obtain cepstrum characteristics.

Specifically, the frequency band energy characteristics are subjected to square calculation, and then the sum of the frequency band energy characteristics after square calculation is divided by the cross-correlation parameter to obtain the cepstrum characteristics. In addition, there may be other methods for calculating the cepstrum features, which are not particularly limited in the present exemplary embodiment.

And the cepstrum feature may be a barker cepstrum feature (Bark Frequency Cepstrum Characteristics, BFCC for short). BFCC is a commonly used characteristic parameter, and BFCC is a parameter based on human auditory perception characteristics, which can describe the energy distribution of sound over frequency.

In the present exemplary embodiment, a cepstrum feature can be obtained by performing correlation calculation on the characteristic real part parameter, the characteristic imaginary part parameter, and the band energy feature, providing a data base for noise suppression.

After the cepstrum feature is obtained, the cepstrum feature can be subjected to dimension reduction mapping processing.

Specifically, the cepstrum feature may be input to an activation function layer, so that the activation function layer performs dimension reduction mapping processing on the cepstrum feature. For example, the cepstral feature may be a 30-dimensional vector, and a 20-dimensional dimension reduction feature may be obtained after input to the activation function layer. The activation function of the activation function layer may be a Tanh function or other activation functions, which is not limited in this exemplary embodiment.

It should be noted that the number of nodes of the activation function layer may be changed according to actual requirements, for example, to 30.

In step S240, feature fusion processing is performed on the dimension-reduction feature and the cepstrum feature to obtain gain information, and noise suppression processing is performed on the gain information to obtain a noise-reduced voice signal of the original voice signal.

In an exemplary embodiment of the present disclosure, after the dimension-reduction feature is obtained, a feature fusion process may be performed on the dimension-reduction feature and the cepstrum feature to obtain gain information. It should be noted that the feature fusion process may be a two-layer feature fusion process.

In an alternative embodiment, fig. 5 shows a flowchart of the steps of a method of feature fusion processing, as shown in fig. 5, the method comprising at least the steps of: in step S510, a single fusion process is performed on the dimension-reduction feature and the cepstrum feature to obtain a single fusion feature, and a step fusion process is performed on the cepstrum feature and the single fusion feature to obtain a step fusion feature.

Specifically, the single fusion processing of the dimension reduction feature and the cepstrum feature may be that the dimension reduction feature and the cepstrum feature are input into a gating circulation unit (Gate Recurrent Unit, abbreviated as GRU), so that the gating circulation unit performs the feature fusion processing of the dimension reduction feature and the cepstrum feature to obtain the single fusion feature.

Wherein, the gating loop unit is a new generation recurrent neural network, similar to the long and short term memory network LSTM. The gated loop unit is free of cellular status and uses hidden status to communicate information. The gate control circulation unit has only two gates, a reset gate and an update gate. Wherein the reset gate decides to forget the amount of information in the past and the update gate decides which information to discard and new information to add.

And the gating circulation unit used for carrying out feature fusion processing on the dimension reduction features and the cepstrum features can be a GRU ReLU layer. Wherein, reLU (RECTIFIED LINEAR Unit, linear rectification function) is a nonlinear function typically referred to as a ramp function and its variants. Thus, the gating loop unit may output a single fusion feature of 30 dimensions.

It should be noted that the number of nodes of the gate control loop unit may be changed according to actual requirements, for example, to 50.

Further, advanced fusion features are obtained by advanced fusion processing of the single fusion features and the cepstrum features.

Specifically, the performing the first-order fusion processing on the single fusion feature and the cepstrum feature may be that the single fusion feature and the cepstrum feature are input into another gating circulation unit, so that the gating circulation unit performs the first-order fusion processing on the single fusion feature and the cepstrum feature.

And the gating cycle unit may also be a GRU ReLU layer. Thus, the gating loop unit may output 60-dimensional advanced fusion features.

It should be noted that the number of the nodes of the gate control loop unit may be changed according to actual requirements, for example, to 50.

In step S520, the advanced fusion feature is fully connected to obtain gain information.

After the advanced fusion feature is obtained, the advanced fusion feature can be subjected to full connection processing to obtain gain information.

Specifically, the advanced fusion feature may be input into a full connection layer, so that the full connection layer performs full connection processing on the advanced fusion feature.

The full connection process may be implemented at the dense layer of the deep learning network. The full connection process may be a process of connecting each node to all nodes of the upper layer, that is, integrating the 60-dimensional advanced fusion features to obtain 30-dimensional gain information to match with 30 Bark bands.

In the present exemplary embodiment, the gain information for noise suppression can be obtained by performing the feature fusion processing on the dimension reduction feature and the cepstrum feature twice, so that the processing flow of noise suppression is greatly simplified, and the complexity of noise suppression is reduced.

After gain information is obtained, noise suppression processing may be performed on the gain information to obtain a noise-reduced voice signal corresponding to the original voice signal.

In an alternative embodiment, fig. 6 shows a flow chart of the steps of a method of noise suppression processing, as shown in fig. 6, the method comprising at least the steps of: in step S610, standard gain information corresponding to the original speech signal is acquired, and gain loss calculation is performed on the gain information and the standard gain information to obtain a gain loss value.

When the original voice signal is a noisy voice signal in the training process, corresponding standard gain information can be obtained simultaneously for training the activation function layer, the two GRUs and the full-connection layer. Thus, the standard gain information is true gain information of the original speech signal, and the true gain information includes true gain values applied to different frequency bands of the original speech signal. The true gain value may be calculated from the energy value of the noise-free signal and the energy value of the noise signal in the original speech signal. For example, the energy value of the noise-free signal and the energy value of the noise signal are summed and calculated, and the energy value of the noise-free signal is divided by the result of the summation calculation to obtain the true gain value.

To calculate the difference between the standard gain information and the predicted gain information, a loss function may be constructed from which the gain loss value between the standard gain information and the gain information is calculated.

Specifically, the loss function may be constructed based on the estimated standard gain information and the distance loss between the gain information, and the distance may be a euclidean distance, a cosine distance, a mean square error (Mean Squared Error, abbreviated as MSE), or the like, which is not particularly limited in the present exemplary embodiment.

In step S620, noise suppression processing is performed with gain information based on the gain loss value to obtain a noise-reduced speech signal of the original speech information number.

When the gain loss value meets the preset condition, for example, the gain loss value is smaller than the corresponding loss threshold value, the training of the activation function layer, the two GRUs and the full-connection layer is successful, and noise suppression processing can be performed by using the gain information to obtain a noise reduction voice signal.

And when the gain loss value does not meet the preset condition, for example, the gain loss value is greater than or equal to the corresponding loss threshold, the parameters of the activation function layer, the two-layer GRU and the full-connection layer can be continuously adjusted to minimize the value of the loss function, so that the activation function layer, the two-layer GRU and the full-connection layer after training is completed are obtained.

In the present exemplary embodiment, in the training process, whether to use gain information for noise suppression can be determined by using standard gain information, so that accuracy of a training noise suppression model and noise suppression effect are ensured.

In the application process, the noise suppression processing can be directly performed by using the gain information without a judging process of the standard gain information.

In an alternative embodiment, the gain information is subjected to inverse linear transformation to obtain a noise-reduced speech signal of the original speech signal.

The inverse linear transformation process of the gain information may be inverse fast fourier transform (INVERSE FAST Fourier Transform, IFFT for short) of the gain information. The IFFT algorithm may use a fast fourier transform algorithm to achieve the effect of converting the gain information from the frequency domain to the time domain to obtain a noise-suppressed noise-reduced speech signal.

And the noise-reduced speech signal is a normal unvoiced sound signal. The normal clear sound signal is a voice signal except the wheat spraying signal in the clear sound signal, and is a clear sound signal normally generated when a producer is in a speaking or singing state.

In this exemplary embodiment, the gain information is subjected to inverse linear transformation to obtain a noise-reduced voice signal, and the noise-reduced voice signal may be used for output and playing, so as to enhance the hearing feeling of the user.

The noise suppression method provided in the embodiments of the present disclosure is described in detail below with reference to a specific application scenario.

Fig. 7 shows a model framework diagram of training a noise suppression model in an application scenario, and as shown in fig. 7, a cepstrum feature sample is input in step S710.

The cepstrum feature sample is obtained by dividing a 8000Hz speech signal sample into 30 Bark bands.

Specifically, for wideband speech information with a sampling rate of 16000Hz, 8000Hz speech signal samples are divided into 30 Bark bands, i.e., subjected to thinning processing.

That is, 8000Hz speech signal samples of 16000Hz samples were divided into 30 Bark bands. Since the noise mainly exists in the low frequency region, the processing of the low frequency speech signal samples is emphasized. The signal with the voice signal sample frequency band of 0-500Hz is not subjected to Bark domain thinning processing, and is directly represented by 15 Bark bands, which is equivalent to linear domain transformation processing by utilizing a 512-point FFT algorithm, and only the signal with the voice signal sample of 500-8000Hz is subjected to thinning processing, and 15 Bark band division is also adopted. Thus, the characteristic combination processing is performed on the voice signal samples in the low frequency region and the high frequency region, so that the band energy characteristics on 30 Bark bands can be obtained.

In step S720, the dimension-reduction mapping process is performed by using the Tanh function at the full connection layer.

And determining the current frame of voice signal and the last frame of voice signal in the voice signal samples, and performing linear domain transformation processing on the current frame of voice signal and the last frame of voice signal. Specifically, the FF estimation method can be used to achieve the effect of performing linear domain transformation processing on the current frame voice signal and the previous frame voice signal. Therefore, the real part and the imaginary part of the current frame voice signal and the last frame voice signal can be respectively obtained as the frequency spectrum characteristic parameters.

Further, cross-correlation calculation is performed on the characteristic real part parameter and the characteristic imaginary part parameter to obtain a cross-correlation parameter, and correlation calculation is performed on the cross-correlation parameter and the frequency band energy characteristic to obtain a cepstrum characteristic.

Specifically, a cross-correlation parameter is obtained through calculation according to a formula (3), square calculation is carried out on the frequency band energy characteristics, and then the sum of the frequency band energy characteristics after square calculation is divided by the cross-correlation parameter to obtain a cepstrum characteristic. And the cepstrum feature may be BFCC.

And inputting BFCC to an activation function layer so that the activation function layer performs dimension reduction mapping processing on the cepstrum features. The cepstrum feature may be a 30-dimensional vector, and a 20-dimensional dimension reduction feature may be obtained after input to the activation function layer. Wherein the activation function of the activation function layer may be a Tanh function.

In step S730, a single fusion process is performed in the gating loop unit using the ReLU function.

The single fusion processing of the dimension reduction feature and the cepstrum feature may be that the dimension reduction feature and the cepstrum feature are input into the GRU, so that the gating circulation unit performs the feature fusion processing of the dimension reduction feature and the cepstrum feature to obtain the single fusion feature.

In step S740, the order fusion process is performed by the gating loop unit using the ReLU function.

The performing of the first order fusion processing on the single fusion feature and the cepstrum feature may be inputting the single fusion feature and the cepstrum feature into another gating cycle unit, so that the gating cycle unit performs the first order fusion processing on the single fusion feature and the cepstrum feature.

In step S750, the full connection process is performed on the full connection layer.

Because the node number of the two layers of GRUs is smaller and is 30 and 60 respectively, the standard gain information can be obtained by adopting 30-dimensional full-connection layer calculation.

The full connection process may be implemented at the dense layer of the deep learning network. The full connection process may be a process of connecting each node to all nodes of the upper layer, that is, integrating the 60-dimensional advanced fusion features to obtain 30-dimensional standard gain information to match with 30 Bark bands.

When the standard gain information is small enough, i.e., the loss is small enough, it indicates that the noise suppression model training is complete and can be used for on-line enhancement.

In the on-line enhancement process, 30-dimensional BFCC features of the original voice signal to be enhanced can be extracted and then input into the trained noise suppression model, so that the noise suppression model calculates 20-dimensional gain information for recovering a clean noise reduction voice signal.

When the noise suppression model is online, the noise suppression model can be applied to any business scene with noise reduction requirements of voice signals, such as business scenes of voice conferences, video conferences, voice recordings, video recordings and the like.

It should be noted that the number of nodes in the active function layer and the GRU in the noise suppression model may be changed according to practical situations, for example, the number of nodes in the active function layer is 30, and the number of nodes in the GRU is 50.

In addition, the present embodiment may be implemented by using neural network units such as LSTM, DNN (Deep Neural Networks, deep neural network) and CNN (Convolutional Neural Networks, convolutional neural network) alone, which is not particularly limited in this exemplary embodiment.

Fig. 8 shows a schematic diagram of comparison between an original speech signal and a noise-reduced speech signal in an application scenario, as shown in fig. 8, the noise of the original speech signal is mainly concentrated in a low-frequency region, and the noise signal of the low-frequency region in the noise-reduced speech signal is obviously removed, which indicates that the noise suppression effect of the noise-reduced speech signal is better. Meanwhile, the parameter in the noise suppression model is about 50kb (Kilobyte kilobytes) after quantization, so that the complexity of a noise suppression algorithm is greatly reduced.

Based on the above application scenarios, it can be known that, according to the noise suppression method provided by the embodiment of the present disclosure, on one hand, an original speech signal is divided into a low-frequency spectrum feature and a high-frequency spectrum feature to perform subsequent noise suppression processing, so that noise suppression in a low-frequency region is more targeted, noise in a high-frequency region can be suppressed, noise suppression effect and efficiency of key noise types are ensured, and noise suppression processing in other frequency bands is also considered; on the other hand, noise suppression processing is carried out on the gain information to obtain a noise reduction voice signal, so that the complexity of noise suppression is greatly reduced, and further, the user experience is improved when the noise reduction voice signal is output.

It should be noted that although the steps of the methods in the present disclosure are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

The following describes apparatus embodiments of the present disclosure that may be used to perform the noise suppression methods of the above-described embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the noise suppression method described above in the present disclosure.

Fig. 9 schematically illustrates a block diagram of a noise suppression apparatus in some embodiments of the present disclosure, and as shown in fig. 9, a noise suppression apparatus 900 may mainly include: a feature combination module 910, a transform processing module 920, a dimension reduction mapping module 930, and a noise suppression module 940.

The feature combination module 910 is configured to obtain a low-frequency spectrum feature and a high-frequency spectrum feature of the original voice signal, and perform feature combination processing on the low-frequency spectrum feature and the high-frequency spectrum feature to obtain a frequency band energy feature; the transformation processing module 920 is configured to determine a current frame of voice signal and a previous frame of voice signal in the original voice signal, and perform linear domain transformation processing on the current frame of voice signal and the previous frame of voice signal to obtain a frequency spectrum characteristic parameter; the dimension-reducing mapping module 930 is configured to perform correlation calculation on the spectrum feature parameters and the frequency band energy features to obtain cepstrum features, and perform dimension-reducing mapping processing on the cepstrum features to obtain dimension-reducing features; the noise suppression module 940 is configured to perform feature fusion processing on the dimension reduction feature and the cepstrum feature to obtain gain information, and perform noise suppression processing on the gain information to obtain a noise reduction voice signal of the original voice signal.

In some embodiments of the present disclosure, a noise suppression module includes: the fusion processing sub-module is configured to perform single fusion processing on the dimension reduction feature and the cepstrum feature to obtain a single fusion feature, and perform advanced fusion processing on the cepstrum feature and the single fusion feature to obtain an advanced fusion feature;

In some embodiments of the present disclosure, a noise suppression module includes: the loss calculation sub-module is configured to acquire standard gain information corresponding to the original voice signal, and perform gain loss calculation on the gain information and the standard gain information to obtain a gain loss value;

In some embodiments of the present disclosure, the noise suppression module includes: and the reverse conversion sub-module is configured to perform inverse linear domain conversion processing on the gain information to obtain a noise reduction voice signal of the original voice signal.

In some embodiments of the present disclosure, a feature combination module includes: the energy characteristic submodule is configured to perform nonlinear domain transformation processing on the high-frequency spectrum characteristic to obtain a nonlinear energy characteristic;

And the combination processing sub-module is configured to perform feature combination processing on the low-frequency spectrum features and the nonlinear energy features to obtain frequency band energy features.

In some embodiments of the present disclosure, a dimension reduction mapping module includes: the correlation calculation sub-module is configured to perform cross-correlation calculation on the characteristic real part parameter and the characteristic imaginary part parameter to obtain a cross-correlation parameter;

And the energy correlation sub-module is configured to perform energy correlation calculation on the cross-correlation parameters and the frequency band energy characteristics to obtain cepstrum characteristics.

In some embodiments of the present disclosure, a feature combination module includes: the linear transformation sub-module is configured to acquire an original voice signal, and perform linear domain transformation processing on the original voice signal to obtain a low-frequency spectrum characteristic and a high-frequency spectrum characteristic.

Specific details of the noise suppression device provided in each embodiment of the present disclosure have been described in the corresponding method embodiments, and thus are not described herein.

Fig. 10 shows a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.

It should be noted that, the computer system 1000 of the electronic device shown in fig. 10 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present disclosure.

As shown in fig. 10, the computer system 1000 includes a central processing unit (Central Processing Unit, CPU) 1001 that can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 1002 or a program loaded from a storage portion 1008 into a random access Memory (Random Access Memory, RAM) 1003. In the RAM 1003, various programs and data required for system operation are also stored. The CPU1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An Input/Output (I/O) interface 1005 is also connected to bus 1004.

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), and a speaker, etc.; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed on the drive 1010 as needed, so that a computer program read out therefrom is installed into the storage section 1008 as needed.

In particular, according to embodiments of the present disclosure, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. When executed by a Central Processing Unit (CPU) 1001, the computer program performs various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), a flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of noise suppression, the method comprising:

Acquiring low-frequency spectrum characteristics and high-frequency spectrum characteristics of an original voice signal, and carrying out characteristic combination processing on the low-frequency spectrum characteristics and the high-frequency spectrum characteristics to obtain frequency band energy characteristics;

2. The noise suppression method according to claim 1, wherein the performing feature fusion processing on the dimension reduction feature and the cepstrum feature to obtain gain information includes:

Performing single fusion processing on the dimension reduction feature and the cepstrum feature to obtain a single fusion feature, and performing advanced fusion processing on the cepstrum feature and the single fusion feature to obtain an advanced fusion feature;

and performing full connection processing on the advanced fusion characteristics to obtain gain information.

3. The noise suppression method according to claim 1, wherein the noise suppression processing of the gain information to obtain the noise-reduced speech signal of the original speech signal includes:

Obtaining standard gain information corresponding to the original voice signal, and performing gain loss calculation on the gain information and the standard gain information to obtain a gain loss value;

And performing noise suppression processing by using the gain information based on the gain loss value to obtain a noise-reduced voice signal of the original voice signal.

4. The noise suppression method according to claim 1, wherein the noise suppression processing of the gain information to obtain the noise-reduced speech signal of the original speech signal includes:

And carrying out inverse linear domain transformation processing on the gain information to obtain a noise reduction voice signal of the original voice signal.

5. The noise suppression method according to claim 1, wherein the feature combination processing of the low-frequency spectrum feature and the high-frequency spectrum feature to obtain a band energy feature includes:

Carrying out nonlinear domain transformation processing on the high-frequency spectrum characteristics to obtain nonlinear energy characteristics;

And carrying out feature combination processing on the low-frequency spectrum feature and the nonlinear energy feature to obtain a frequency band energy feature.

6. The noise suppression method according to claim 1, wherein the spectral feature parameters include a feature real part parameter and a feature imaginary part parameter,

The step of performing correlation calculation on the frequency spectrum characteristic parameters and the frequency band energy characteristics to obtain cepstrum characteristics comprises the following steps:

Performing cross-correlation calculation on the characteristic real part parameter and the characteristic imaginary part parameter to obtain a cross-correlation parameter;

And carrying out energy correlation calculation on the cross-correlation parameters and the frequency band energy characteristics to obtain cepstrum characteristics.

7. The noise suppression method according to any one of claims 1-6, characterized in that the acquiring of the low-frequency spectral features and the high-frequency spectral features of the original speech signal comprises:

and obtaining an original voice signal, and performing linear domain transformation processing on the original voice signal to obtain a low-frequency spectrum characteristic and a high-frequency spectrum characteristic.

8. A noise suppression apparatus, the apparatus comprising:

The feature combination module is configured to acquire low-frequency spectrum features and high-frequency spectrum features of an original voice signal, and perform feature combination processing on the low-frequency spectrum features and the high-frequency spectrum features to obtain frequency band energy features;

9. A computer readable medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the noise suppression method of any one of claims 1 to 7.

10. An electronic device, comprising:

A processor; and

A memory for storing executable instructions of the processor;

Wherein the processor is configured to perform the noise suppression method of any one of claims 1 to 7 via execution of the executable instructions.