CN114898762A

CN114898762A - Real-time voice noise reduction method and device based on target person and electronic equipment

Info

Publication number: CN114898762A
Application number: CN202210490036.7A
Authority: CN
Inventors: 张超; 魏庆凯
Original assignee: Beijing Kuaiyu Electronics Co ltd
Current assignee: Beijing Kuaiyu Electronics Co ltd
Priority date: 2022-05-07
Filing date: 2022-05-07
Publication date: 2022-08-12

Abstract

The embodiment of the disclosure discloses a real-time voice noise reduction method and device based on a target person and electronic equipment. One embodiment of the method comprises: acquiring a target voice signal and a target speaker voice signal; carrying out feature extraction on the voice signal of the target speaker to obtain the voiceprint feature of the target speaker; preprocessing a target voice signal to obtain frequency spectrum characteristics of the target voice signal; inputting the frequency spectrum characteristic of the target voice signal and the voiceprint characteristic of the target speaker into a pre-trained noise reduction neural network to obtain the noise reduction parameter characteristic of the target voice signal; and carrying out post-processing on the target voice signal according to the noise reduction parameter characteristics to obtain a noise reduction voice signal. The embodiment realizes lighter noise reduction processing and better noise reduction effect.

Description

Real-time voice noise reduction method and device based on target person and electronic equipment

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a real-time voice noise reduction method and device based on a target person and electronic equipment.

Background

With the increasing maturity of communication technology, the traffic volume of voice signal/video call is increasing, and the terminals used by users are diversified, such as personal computers, mobile phones, and edge devices such as microphones. However, since the environment of the user is various, the occurrence of various background noises has a great influence on the quality of the voice signal communication, such as the sound of an indoor air conditioner fan, the sound of knocking a keyboard, the sound of an outdoor car, the sound of a bird, and the like.

In addition, when there are other people in the communication environment of the user, the speaking voice of the other people can also interfere with the voice signal conversation. In order to make the heard voice signal more clear, the collected voice signal containing noise needs to be subjected to noise reduction processing to filter background noise and other speaking voice.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure provide a method, an apparatus, a storage medium, and a device for real-time voice noise reduction based on a target person to solve the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a method for real-time voice noise reduction based on a target person, the method comprising: acquiring a target voice signal and a target speaker voice signal; extracting the characteristics of the voice signal of the target speaker to obtain the voiceprint characteristics of the target speaker; preprocessing the target voice signal to obtain the frequency spectrum characteristic of the target voice signal; inputting the frequency spectrum characteristic of the target voice signal and the voiceprint characteristic of the target speaker into a pre-trained noise reduction neural network to obtain the noise reduction parameter characteristic of the target voice signal; and carrying out post-processing on the target voice signal according to the parameter characteristics to obtain a noise reduction voice signal.

In a second aspect, some embodiments of the present disclosure provide a target person-based real-time speech noise reduction apparatus, the apparatus comprising: an acquisition unit configured to acquire a target speech signal and a target speaker speech signal; the extracting unit is configured to extract the characteristics of the voice signal of the target speaker to obtain the voiceprint characteristics of the target speaker; the preprocessing unit is configured to preprocess the target voice signal to obtain the frequency spectrum characteristic of the target voice signal; the generating unit is configured to input the frequency spectrum characteristic of the target speech signal and the voiceprint characteristic of the target speaker into a pre-trained noise reduction neural network to obtain a noise reduction parameter characteristic of the target speech signal; and the post-processing unit is configured to perform post-processing on the target voice signal according to the parameter characteristics to obtain a noise reduction voice signal.

In a third aspect, an embodiment of the present application provides an electronic device, where the network device includes: one or more processors; storage means for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.

In a fourth aspect, the present application provides a computer readable medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method as described in any implementation manner of the first aspect.

One of the above-described various embodiments of the present disclosure has the following advantageous effects: firstly, obtaining a target voice signal and a target speaker voice signal, then, carrying out feature extraction on the target speaker voice signal to obtain a voiceprint feature of a target speaker, then, carrying out pretreatment on the target voice signal to obtain a frequency spectrum feature of the target voice signal, further, inputting the frequency spectrum feature and the voiceprint feature into a pre-trained noise reduction neural network to obtain a noise reduction parameter feature, and finally, carrying out post-treatment on the target voice signal according to the parameter feature to obtain a noise reduction voice signal. Therefore, by processing the target voice signal, the interference of background noise and other human voice in the obtained noise-reduced voice signal is eliminated, and the lighter noise-reduction processing and the better noise-reduction effect are realized.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

FIG. 1 is a schematic diagram of one application scenario of a targeted person-based real-time speech noise reduction method according to some embodiments of the present disclosure;

FIG. 2 is a flow diagram of some embodiments of a targeted person based real-time speech noise reduction method according to the present disclosure;

FIG. 3 is a schematic diagram of a noise-reducing neural network structure, according to some embodiments of the present disclosure;

FIG. 4 is a schematic diagram of a self-attention module structure according to some embodiments of the present disclosure;

FIG. 5 is a schematic diagram of a convolution module structure according to some embodiments of the present disclosure;

FIG. 6 is a schematic block diagram of some embodiments of a targeted person based real-time speech noise reduction apparatus according to the present disclosure;

FIG. 7 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

FIG. 1 is a schematic diagram of one application scenario of a targeted person-based real-time speech noise reduction method according to some embodiments of the present disclosure.

As shown in fig. 1, the executive body server 101 may obtain a target speech signal 103 and a target speaker speech signal 102, then perform feature extraction on the target speaker speech signal 102 to obtain a voiceprint feature 104 of the target speaker, then perform preprocessing on the target speech signal 103 to obtain a feature 105 and a frequency spectrum feature 106, then input the frequency spectrum feature 105 and the voiceprint feature 106 into a pre-trained noise reduction neural network to obtain a noise reduction parameter feature 107, and finally perform post-processing on the frequency spectrum feature 106 according to the parameter feature 107 to obtain a noise reduction speech signal 108.

It is understood that the real-time voice noise reduction method based on the target person may be executed by a terminal device, or may also be executed by the server 101, and the execution body of the method may also include a device formed by integrating the terminal device and the server 101 through a network, or may also be executed by various software programs. The terminal device may be various electronic devices with information processing capability, including but not limited to a smart phone, a tablet computer, an e-book reader, a laptop portable computer, a desktop computer, and the like. The execution subject may also be embodied as the server 101, software, etc. When the execution subject is software, the software can be installed in the electronic device listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

It should be understood that the number of servers in fig. 1 is merely illustrative. There may be any number of servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of some embodiments of a targeted person-based real-time speech noise reduction method according to the present disclosure is shown. The real-time voice noise reduction method based on the target person comprises the following steps:

step 201, obtaining a target voice signal and a target speaker voice signal.

In some embodiments, the subject of execution of the target person-based real-time speech noise reduction method (e.g., the server shown in fig. 1) may obtain the target speech signal and the target speaker speech signal via a wired connection or a wireless connection. It is noted that the retrieved location may be a local storage device. In addition, the wireless connection modes may include, but are not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other currently known or future developed wireless connection modes.

Here, the target speech signal generally refers to a speech signal with noise and a target speaker's voice. The above-mentioned target speaker voice signal generally refers to a voice signal with the voice of the target speaker. The target speaker is usually selected in advance by the user.

Step 202, extracting the characteristics of the voice signal of the target speaker to obtain the voiceprint characteristics of the target speaker.

In some embodiments, based on the target speaker voice signal obtained in step 201, the executing entity (e.g., the server shown in fig. 1) can perform feature extraction on the target speaker voice signal to obtain the voiceprint feature of the target speaker. Here, there are various ways of feature extraction, which are not described herein again.

In some optional implementation manners of some embodiments, the executing entity may input the voice signal of the target speaker to a pre-trained voiceprint feature extraction network to obtain a voiceprint feature of the target speaker, where the voiceprint feature extraction network includes a depth residual layer and a local aggregation vector layer. Here, the depth residual layer (ResNet) is used to extract the features of the segmented speech signal, and the local aggregation vector layer (NetVLAD) is used to aggregate the segmented features.

The voiceprint feature extraction network extracts the segmented voice features by using a ResNet network and aggregates the segmented features by using a NetVLAD of automatic training parameters, and finally the extracted voiceprint features have the characteristic of strong anti-interference performance. The voiceprint information can still be accurately extracted even if the target person is in a slightly noisy environment.

Step 203, preprocessing the target voice signal to obtain the frequency spectrum characteristic of the target voice signal.

In some embodiments, the executing entity may perform preprocessing on the target speech signal to obtain a frequency spectrum feature of the target speech signal. Here, the above-described preprocessing operation generally refers to an operation involving feature extraction on a target speech signal and determination of a frequency spectrum feature of the target speech signal. There are various ways to extract the frequency spectrum features and ways to determine the frequency spectrum features of the speech signal, which are not described herein again.

In some optional implementations of some embodiments, the executing unit performs framing, windowing and short-time fourier transform on the target speech signal to obtain a real part and an imaginary part of a frequency spectrum of the target speech signal. Here, the windowing process generally uses a Hanning (Hanning) window function.

Then, an energy spectrum and a phase spectrum are obtained according to the real part and the imaginary part. As an example, the energy spectrum may be determined according to the following formula:

PCEN(t,f)＝(E(t,f)/(M(t,f)+ε) ^α +δ) ^r -δ ^r where PCEN (t, f) represents the energy normalization for each band channel, t represents the time domain sequence, f represents the frequency domain sequence, and e represents a small constant (e may be 10 as an example) ^-8 ) α represents an energy normalization coefficient, δ and an index r represents a dynamic range coefficient, E (t, f) represents an energy spectrum, and M (t, f) represents a smoothed energy spectrum, determined according to the following formula: m (t, f) is (1-s) · M (t-1, f) + s · E (t, f), where s denotes a smoothing coefficient.

The execution body may then determine the phase spectrum. As an example, the execution subject may determine the correction coefficient to be-2 pi thv, where h denotes a frame shift, v denotes a normalized frequency, and the compensated phase is shifted to be within a range of (-pi, pi) to obtain a corrected phase spectrum. Up to this point, the problem of eliminating the change of the phase of the sine wave with the frequency interval and the frame shift is solved by performing correction compensation on the phase spectrum.

And finally, determining the real part, the imaginary part, the energy spectrum and the phase spectrum as the characteristics of the target speech signal.

In this way, not only is the speech spectrum used in the pre-processing stage, but trainable energy per channel normalization and phase compensation is added as input to the network. The energy of each channel is normalized, automatic gain control and dynamic range control are carried out on the Mel spectrum, more past time axis information is supplemented, and meanwhile, the characteristic that parameters can be trained is adopted, so that the noise reduction neural network is better adapted, and the whole voice noise reduction effect is improved. The phase compensation eliminates the problem of the phase of the sine wave changing with frequency interval and frame shift, providing more accurate phase information for the neural network.

And step 204, inputting the frequency spectrum characteristic and the voiceprint characteristic into a pre-trained noise reduction neural network to obtain a noise reduction parameter characteristic.

In some embodiments, the execution subject may input the frequency spectrum feature and the voiceprint feature to a pre-trained noise reduction neural network to obtain a noise reduction parameter feature.

Here, the above noise reduction neural network is generally used for the corresponding relationship between the characteristic features and the voiceprint features and the parameter features. Specifically, the parameter characteristics may be an amplitude mask parameter, a noise mask parameter, a gain parameter, an angle coefficient, and/or the like.

As an example, the noise reduction neural network may be a correspondence table determined by a researcher based on a large amount of data, the correspondence table has a large amount of sample features and sample voiceprint features and sample parameter features corresponding to the sample features and the sample voiceprint features, and the executing entity may find sample features and sample voiceprint features similar to or identical to the frequency spectrum features and the voiceprint features, and determine the corresponding sample parameter features as the parameter features.

As another example, the noise reduction neural network may be trained according to the following steps: acquiring a training sample set, wherein the training sample set comprises sample characteristics, sample voiceprint characteristics and parameter characteristics corresponding to the sample characteristics and the sample voiceprint characteristics; inputting the sample characteristics and the sample voiceprint characteristics into a model to be trained to obtain sample parameter characteristics; comparing the parameter characteristics with the sample parameter characteristics according to a preset loss function, and determining a loss value; and determining that the model to be trained is trained in response to the loss value meeting a preset condition, and determining the model to be trained as a noise reduction neural network. And adjusting relevant parameters in the model to be trained in response to the loss value not meeting the preset condition.

In some optional implementations of some embodiments, the noise reduction neural network includes a convolution module, a self-attention module, a gated cycle cell module, and a deconvolution module connected in sequence.

As shown in fig. 3, the noise reduction neural network may be a neural network composed of a convolution module 301, a self-attention module 302, a gated cyclic unit module 303, and a deconvolution module 304.

The convolution module 301 is shown in fig. 4. The convolution module 301 is composed of 6 convolution units with the same structure and different parameters. Each convolution cell is composed of a dot convolution layer 401(PConv), a batch normalization layer 402(BatchNorm), a linear rectification function layer 403(ReLU), a deep convolution layer 404(DSConv), a batch normalization layer 405(BatchNorm), and a linear rectification function layer 406(ReLU) in this order. The output of the last Linear rectification function (ReLU) layer of each convolution unit is sent to both the next layer of the network and the corresponding deconvolution unit of the deconvolution module 407 as part of the input. The convolution kernels of the point convolution layers of each convolution unit are all 1, and the step length is all 1. The convolution kernels of the depth convolution layers of the convolution units are respectively 5, 3, 5 and 3, and the step sizes are respectively 2, 1, 2 and 2.

The self-attention module 302 is shown in fig. 5. Self-attention module consists of multi-headed self-attention layer 502multihead attention), normalization layer 503LayerNorm), point convolution layer 504PConv), batch normalization layer 505BatchNorm), linear rectification function layer 506 ReLU). The output of the convolution module is first spliced with the voiceprint features 501 of the target speaker's speech signal, fed into a multi-head self-attention layer 502(multihead attention), passed through a layer normalization layer 503(LayerNorm), entered into a convolution layer 504(PConv), finally processed by a batch normalization layer 505(BatchNorm) and a linear rectification function layer 506(ReLU), and fed into the next module.

The gated loop cell module 303 is composed of a gated loop cell layer (GRU), a dot convolution layer (PConv), a batch normalization layer (BatchNorm), and a linear rectifying function layer (ReLU). The output of the self-attention module is spliced with the voiceprint characteristics of the voice signal of the target speaker, sent into a gate control cycle unit layer (GRU), and then sequentially passes through a point convolution layer (PConv), a batch normalization layer (BatchNorm) and a linear rectification function layer (ReLU).

The deconvolution module 304 is composed of 6 deconvolution units with the same structure and different parameters. Each deconvolution unit is composed of a point convolution layer (PConv), a batch normalization layer (BatchNorm), a linear rectification function layer (ReLU), a deconvolution layer (ConvT), a batch normalization layer (BatchNorm), and a linear rectification function layer (ReLU) in this order. The input of the first PConV of each deconvolution unit is received from the upper layer of the network and the last RuLU layer of the corresponding convolution unit of the convolution module, and the two parts are spliced to be used as the integral input. The convolution kernel of the point convolution layer of each deconvolution unit is 1, and the step length is 1. Convolution kernels of deconvolution layers of the deconvolution units are respectively 3, 5, 3 and 5, and step sizes are respectively divided into 2, 1, 2, 1 and 2.

In some optional implementations of some embodiments, the self-attention module includes a multi-headed self-attention layer, a normalization layer, a point convolution layer, a batch normalization layer, and/or a linear rectification function layer.

In some optional implementations of some embodiments, the gated loop cell module includes a gated loop cell layer, a point convolution layer, a batch normalization layer, and/or a linear rectification function layer.

Here, the convolution module of the noise reduction neural network functions to compress the spectral dimensions. The point convolution and the depth convolution are used, and compared with the standard one-step convolution, the parameter quantity and the operation quantity of the convolution are reduced. The self-attention module used synthesizes the frequency domain features extracted by the convolution module. The self-attention network is used instead of the recurrent neural network because the self-attention network is more able to capture global information. The gating circulation unit is used for processing the relation between each front and back voice frame and establishing the information flow of a time axis, and the gating circulation unit is used for not a long-term and short-term memory network, so that the calculation amount of the network can be further reduced. The voiceprint characteristic vector of the target speaker is additionally added in the self-attention module and the gate control circulation unit module respectively, so that the network can learn and reserve the voice of the target speaker in a targeted manner, and the voices and the noise of other people are eliminated. The deconvolution module restores the compressed frequency domain features to the original spectral dimensions to output a frequency domain mask. The deconvolution module adopts a structure symmetrical to the convolution module, the input not only receives the output of the upper layer of the network, but also directly receives the output of the corresponding unit of the convolution module, so that the gradient change is more stable during network training, and the network model with good effect is easier to train. Meanwhile, the whole neural network has the characteristic of light weight, except for the voiceprint extractor, partial network parameters needing to run in real time are only about 700k, the parameter quantity is small, the calculated quantity is small, the system resource occupation is low, and the noise reduction effect is good.

And step 205, performing post-processing on the target voice signal according to the parameter characteristics to obtain a noise reduction voice signal.

In some embodiments, the executing entity may perform post-processing on the target speech signal based on the parameter characteristics obtained in step 204 to obtain a noise-reduced speech signal. Here, the post-processing generally refers to an operation of optimizing the frequency spectrum characteristic by using the parameter characteristic and restoring the frequency spectrum characteristic to a speech signal.

In some optional implementations of some embodiments, the performing agent may determine the phase-aware mask gain factor according to the parameter characteristics. Here, the phase-aware mask gain coefficients generally refer to a speech signal amplitude mask gain coefficient and a speech signal phase mask gain coefficient.

As an example, the parameter characteristics can be an amplitude mask parameter z, a noise mask parameter z _ n, a gain parameter φ _β Angle coefficient phi ₁ And angle coefficient phi ₂ . The execution body may be composed of a gain factor phi _β Calculating the lower limit of the gain multiple of the amplitude mask:

setting an upper limit for the gain multiple of the amplitude mask to obtain the actual gain multiple beta of the amplitude mask, which is min (beta) _s 1/| sigmoid (z-z _ n) -sigmoid (z _ n-z) |), where the sigmoid function is expressed as sigmoid (x) 1/(1+ e) |) ^-x ). Further, a speech signal amplitude mask M ═ β · sigmoid (z-z _ n) and a noise amplitude mask M _ n ═ β · sigmoid (z _ n-z) can be calculated. Calculating cosine expression cos (delta theta) ═ (1+ | M- ² -|M_n| ² ) /(2| M |). And then by the angle coefficient phi ₁ Angle coefficient phi ₂ Calculating the directional coefficient

Coefficient of direction

Then calculates the angular direction

Thus, a phase mask e of the speech signal is calculated ^jθ ＝cos(Δθ)+jξsin(Δθ)。

And then, generating a noise reduction energy spectrum and a noise reduction phase spectrum according to the phase perception mask gain coefficient, the energy spectrum and the phase spectrum. As an example, the execution subject may apply the speech signal amplitude mask gain coefficient and the speech signal phase mask gain coefficient to the amplitude spectrum and the phase spectrum, respectively, to obtain the amplitude spectrum and the phase spectrum of the noise-reduced speech signal.

And then, determining a noise reduction real part and a noise reduction imaginary part of the noise reduction frequency spectrum characteristic according to the noise reduction energy spectrum and the noise reduction phase spectrum, and performing windowing processing and reverse short-time Fourier transform on the noise reduction real part and the noise reduction imaginary part to obtain a noise reduction voice signal.

The phase perception mask is a complex mask, which not only acts on the magnitude spectrum, but also acts on the phase spectrum, and the provided phase information can help to better recover pure speech. And the range of the amplitude mask in the phase perception mask is not limited, so that more scenes with different signal-to-noise ratios can be covered.

With further reference to fig. 6, as an implementation of the methods shown in the above figures, the present disclosure provides some embodiments of a target person-based real-time speech noise reduction apparatus, which correspond to those of the method embodiments shown in fig. 2, and which may be particularly applied in various electronic devices.

As shown in fig. 6, the target person-based real-time speech noise reduction apparatus 600 of some embodiments includes: an acquisition unit 601, an extraction unit 602, a pre-processing unit 603, a generation unit 604, and a post-processing unit 605. Wherein, the obtaining unit 601 is configured to obtain a target speech signal and a target speaker speech signal; an extracting unit 602, configured to perform feature extraction on the voice signal of the target speaker to obtain a voiceprint feature of the target speaker; a preprocessing unit 603 configured to preprocess the target speech signal to obtain a frequency spectrum characteristic of the target speech signal; a generating unit 604 configured to input the frequency spectrum features and the voiceprint features into a pre-trained noise reduction neural network to obtain noise reduction parameter features, wherein the noise reduction neural network includes a convolution module, a self-attention module, a gated circulation unit module, and/or a deconvolution module, which are connected in sequence; a post-processing unit 605 configured to perform post-processing on the target speech signal according to the parameter characteristics to obtain a noise-reduced speech signal.

In an alternative implementation of some embodiments, the self-attention module includes a multi-headed self-attention layer, a normalization layer, a point convolution layer, a batch normalization layer, and/or a linear rectification function layer.

In an optional implementation of some embodiments, the gated loop unit module includes a gated loop unit layer, a point convolution layer, a batch normalization layer, and/or a linear rectification function layer.

In an alternative implementation of some embodiments, the pre-processing unit is further configured to: performing framing, windowing and short-time Fourier transformation on the target voice signal to obtain a real part and an imaginary part of a frequency spectrum of the target voice signal; obtaining an energy spectrum and a phase spectrum according to the real part and the imaginary part of the frequency spectrum of the target voice signal; and taking the energy spectrum, the phase spectrum, and the real part and the imaginary part of the frequency spectrum of the target voice signal as the frequency spectrum characteristics of the target voice signal.

In an alternative implementation of some embodiments, the post-processing unit is further configured to: determining a phase perception mask gain coefficient according to the parameter characteristics; generating a noise reduction energy spectrum and a noise reduction phase spectrum according to the phase perception mask gain coefficient, the energy spectrum and the phase spectrum; determining a noise reduction real part and a noise reduction imaginary part of the noise reduction frequency spectrum characteristic according to the noise reduction energy spectrum and the noise reduction phase spectrum; and carrying out windowing processing and reverse short-time Fourier transform on the noise reduction real part and the noise reduction imaginary part to obtain a noise reduction voice signal.

In an alternative implementation of some embodiments, the extraction unit is further configured to: and inputting the voice signal of the target speaker into a pre-trained voiceprint feature extraction network to obtain the voiceprint feature of the target speaker, wherein the voiceprint feature extraction network comprises a depth residual error layer and a local aggregation vector layer, the depth residual error layer is used for extracting the characteristics of the segmented voice signal, and the local aggregation vector layer is used for aggregating the characteristics of the segmented voice signal.

It will be understood that the elements described in the apparatus 600 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 600 and the units included therein, and are not described herein again.

Referring now to fig. 7, a block diagram of an electronic device (e.g., server in fig. 1) 700 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network via communications means 709, or may be installed from storage 708, or may be installed from ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of some embodiments of the present disclosure.

It should be noted that the computer readable medium described above in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a target voice signal and a target speaker voice signal; extracting the characteristics of the voice signal of the target speaker to obtain the voiceprint characteristics of the target speaker; preprocessing the target voice signal to obtain the frequency spectrum characteristic of the target voice signal; inputting the frequency spectrum characteristic and the voiceprint characteristic into a pre-trained noise reduction neural network to obtain a noise reduction parameter characteristic, wherein the noise reduction neural network comprises a convolution module, a self-attention module, a gate control cycle unit module and/or a deconvolution module which are connected in sequence; and carrying out post-processing on the target voice signal according to the parameter characteristics to obtain a noise reduction voice signal.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, which may be described as: a processor includes an acquisition unit, an extraction unit, a pre-processing unit, a generation unit, and a post-processing unit. Where the names of the units do not in some cases constitute a limitation of the units themselves, the receiving unit may also be described as a "unit that obtains a target speech signal and a target speaker speech signal", for example.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above-mentioned frequency spectrum characteristic and the technical characteristic (but not limited to) having similar function disclosed in the embodiment of the present disclosure are replaced with each other to form a technical solution.

Claims

1. A real-time voice noise reduction method based on a target person comprises the following steps:

acquiring a target voice signal and a target speaker voice signal;

carrying out feature extraction on the voice signal of the target speaker to obtain the voiceprint feature of the target speaker;

preprocessing the target voice signal to obtain the frequency spectrum characteristic of the target voice signal;

inputting the frequency spectrum characteristic of the target voice signal and the voiceprint characteristic of the target speaker into a pre-trained noise reduction neural network to obtain the noise reduction parameter characteristic of the target voice signal;

and post-processing the target voice signal according to the noise reduction parameter characteristics to obtain a noise reduction voice signal.

2. The real-time speech noise reduction method of claim 1, wherein the noise reduction neural network comprises a convolution module, a self-attention module, a gated cyclic unit module, and a deconvolution module connected in sequence.

3. A method of real-time speech noise reduction according to claim 2, wherein the self-attention module comprises a multi-headed self-attention layer, a normalization layer, a point convolution layer, a batch normalization layer and/or a linear rectification function layer.

4. A method of real-time speech noise reduction according to claim 2, wherein the gated loop unit module comprises a gated loop unit layer, a point convolution layer, a batch normalization layer and/or a linear rectification function layer.

5. The real-time speech noise reduction method of claim 1, wherein the pre-processing the target speech signal comprises:

performing framing, windowing and short-time Fourier transformation on the target voice signal to obtain a real part and an imaginary part of a frequency spectrum of the target voice signal;

obtaining an energy spectrum and a phase spectrum according to the real part and the imaginary part of the frequency spectrum of the target voice signal;

and taking the energy spectrum, the phase spectrum, and the real part and the imaginary part of the frequency spectrum of the target voice signal as the frequency spectrum characteristics of the target voice signal.

6. The real-time speech noise reduction method of claim 5, wherein the post-processing the target speech signal according to the noise reduction parameter features comprises:

determining a phase perception mask gain coefficient according to the noise reduction parameter characteristics;

generating a noise reduction energy spectrum and a noise reduction phase spectrum according to the phase perception mask gain coefficient, the energy spectrum and the phase spectrum;

determining a real part and an imaginary part of a noise reduction frequency spectrum according to the noise reduction energy spectrum and the noise reduction phase spectrum;

and carrying out windowing processing and reverse short-time Fourier transform on the real part and the imaginary part of the noise reduction frequency spectrum to obtain a noise reduction voice signal.

7. The real-time speech noise reduction method of claim 1, wherein the feature extracting the target speaker speech signal comprises:

and inputting the voice signal of the target speaker into a pre-trained voiceprint feature extraction network to obtain the voiceprint feature of the target speaker, wherein the voiceprint feature extraction network comprises a depth residual error layer and a local aggregation vector layer, the depth residual error layer is used for extracting the characteristics of the segmented voice signal, and the local aggregation vector layer is used for aggregating the characteristics of the segmented voice signal.

8. A targeted person based real-time speech noise reduction apparatus comprising:

an acquisition unit configured to acquire a target speech signal and a target speaker speech signal;

the extracting unit is configured to perform feature extraction on the voice signal of the target speaker to obtain the voiceprint feature of the target speaker;

the preprocessing unit is configured to preprocess the target voice signal to obtain the frequency spectrum characteristic of the target voice signal;

the noise reduction unit is configured to input the frequency spectrum characteristic of the target speech signal and the voiceprint characteristic of the target speaker into a pre-trained noise reduction neural network to obtain a noise reduction parameter characteristic of the target speech signal;

and the post-processing unit is configured to perform post-processing on the target voice signal according to the noise reduction parameter characteristics to obtain a noise reduction voice signal.

9. The real-time speech noise reduction apparatus of claim 8, wherein the noise reduction neural network comprises a convolution module, a self-attention module, a gated cyclic unit module, and a deconvolution module connected in sequence.

10. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-7.