CN110534123B

CN110534123B - Voice enhancement method and device, storage medium and electronic equipment

Info

Publication number: CN110534123B
Application number: CN201910663257.8A
Authority: CN
Inventors: 李晨星; 许家铭; 徐波
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-07-22
Filing date: 2019-07-22
Publication date: 2022-04-01
Anticipated expiration: 2039-07-22
Also published as: CN110534123A

Abstract

The embodiment of the invention relates to a voice enhancement method, a voice enhancement device, a storage medium and electronic equipment, wherein the method comprises the following steps: calling voice acquisition equipment to acquire voice in the current environment; processing the voice according to a preset voice processing algorithm to obtain single-channel voice; performing sentence segmentation on the single-channel voice to obtain a voice segmented data stream containing preset type sounds; inputting the voice segment data stream into a preset voice enhancement network model to obtain enhanced voice corresponding to the voice segment data stream; synthesizing the enhanced speech into speech segments. Therefore, multi-scene application can be realized, the influence of noise is avoided, the introduction of distortion is avoided by considering the voice characteristics, and the damage to voice is avoided.

Description

Voice enhancement method and device, storage medium and electronic equipment

Technical Field

The embodiment of the invention relates to the technical field of automatic processing of computer information, in particular to a voice enhancement method, a voice enhancement device, a storage medium and electronic equipment.

Background

Speech, i.e. the material shell of a language, is the external form of the language, is the symbology that most directly records human mental activities, and is one of the most natural and effective means for users to interact with information. When a user obtains a voice signal, the voice signal is inevitably interfered by environmental noise, room reverberation and other users, so that the voice quality is seriously influenced, the performance of voice recognition is further influenced, and the voice enhancement is brought forward. The speech enhancement is an effective way for inhibiting interference and prompting far-field speech recognition rate as a preprocessing mode.

Speech enhancement is a technique for extracting a useful speech signal from a noise background, and suppressing and reducing noise interference, when the speech signal is interfered or even submerged by various noises. In a sentence, the original speech is extracted from the noisy speech as pure as possible.

In the related art, the traditional speech enhancement methods mainly include spectral subtraction, wiener filtering and short-time spectral amplitude enhancement methods based on minimum mean square error. Although the traditional speech enhancement method has the advantages of high speed, no need of large-scale training of a corpus and the like, the methods depend on noise estimation to a great extent, the methods are few in application scenes, speech characteristics cannot be considered, distortion is inevitably introduced, and damage is caused to speech.

Disclosure of Invention

In view of the above, to solve the technical problems or some technical problems, embodiments of the present invention provide a voice enhancement method, apparatus, storage medium, and electronic device.

In a first aspect, an embodiment of the present invention provides a language enhancement method, where the method includes:

calling voice acquisition equipment to acquire voice in the current environment;

processing the voice according to a preset voice processing algorithm to obtain single-channel voice;

performing sentence segmentation on the single-channel voice to obtain a voice segmented data stream containing preset type sounds;

inputting the voice segment data stream into a preset voice enhancement network model to obtain enhanced voice corresponding to the voice segment data stream;

synthesizing the enhanced speech into speech segments.

In a possible embodiment, the processing the speech according to a preset speech processing algorithm to obtain a single-channel speech includes:

and carrying out A/D conversion on the voice, and sampling according to a preset sampling rate to obtain single-channel voice.

In a possible embodiment, the sentence-segmentation on the single-channel speech to obtain a speech segmented data stream containing a preset type of sound includes:

segmenting sentences of the voice in the single-channel voice within a preset threshold range;

for any frame of voice in the single-channel voice within a preset threshold range, detecting whether preset type voice is contained or not by utilizing a pre-established neural network model;

if the frame voice contains the preset type of sound, the frame voice is reserved;

and combining all the voice frames containing the preset type of voice to obtain the voice segment data stream containing the preset type of voice.

In one possible embodiment, the method further comprises:

and if the frame voice does not contain the preset type of sound, filtering the frame voice.

In a second aspect, an embodiment of the present invention provides a speech enhancement apparatus, where the apparatus includes:

the voice acquisition module is used for calling voice acquisition equipment and acquiring voice in the current environment;

the voice processing module is used for processing the voice according to a preset voice processing algorithm to obtain single-channel voice;

the voice segmentation module is used for segmenting the single-channel voice to obtain a voice segmentation data stream containing preset type sounds;

the voice enhancement module is used for inputting the voice segment data stream into a preset voice enhancement network model to obtain enhanced voice corresponding to the voice segment data stream;

and the voice synthesis module is used for synthesizing the enhanced voice into a voice section.

In one possible implementation, the speech processing module is specifically configured to:

In a possible implementation manner, the speech segmentation module is specifically configured to:

In one possible embodiment, the apparatus further comprises:

and the voice filtering module is used for filtering the frame voice if the frame voice does not contain the preset type of sound.

In a third aspect, an embodiment of the present invention provides a storage medium, where one or more programs are stored, and the one or more programs are executable by one or more processors to implement the foregoing speech enhancement method.

In a fourth aspect, an embodiment of the present invention provides an electronic device, including: a processor and a memory, the processor being configured to execute a speech enhancement program stored in the memory to implement the aforementioned speech enhancement method.

According to the technical scheme provided by the embodiment of the invention, the voice is processed to obtain the single-channel voice, the single-channel voice is segmented into the voice segmented data stream containing the preset type of sound, the voice segmented data stream is input into the preset voice enhancement network model, the influence of noise is avoided, the introduction of distortion is avoided by considering the voice characteristic, the damage to the voice is avoided, the enhanced voice can be obtained, the enhanced voice is synthesized to obtain the voice segment, and the multi-scene application can be realized.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present specification, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a flow chart illustrating an implementation of a speech enhancement method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For the convenience of understanding of the embodiments of the present invention, the following description will be further explained with reference to specific embodiments, which are not to be construed as limiting the embodiments of the present invention.

As shown in fig. 1, an implementation flow diagram of a speech enhancement method provided in an embodiment of the present invention is shown, where the method specifically includes the following steps:

and S101, calling voice acquisition equipment to acquire the voice in the current environment.

In the embodiment of the present invention, the current environment may be a far-field noisy acoustic environment, which is not limited by the embodiment of the present invention.

In the current environment, a voice collecting device, such as a microphone, is called to collect voice, where the voice carries an original voice of a target user and noise in the current environment, and the noise in the current environment may be a voice of another user in the current environment, may be music, hitting sound, and the like in the current environment, and all other sounds may be regarded as noise with respect to the original voice of the target user, which is not limited in the embodiment of the present invention.

And S102, processing the voice according to a preset voice processing algorithm to obtain single-channel voice.

For the language collected in the step S101, processing is performed according to a preset speech processing algorithm to obtain a single-channel speech, where an optional implementation is provided to process according to a preset speech processing algorithm:

and carrying out A/D conversion on the voice, and sampling according to a preset sampling rate to obtain single-channel voice. In this case, a/D refers to a circuit that converts an analog signal into a digital signal and is called an analog-to-digital converter.

For example, a microphone is called to collect the language in the current environment, the voice is subjected to a/D conversion, and sampling is performed according to a sampling rate of 16000, so that single-channel voice with a sampling rate of 16000 is obtained.

S103, performing sentence segmentation on the single-channel voice to obtain a voice segmented data stream containing preset type sounds;

pre-training a neural network model, wherein the neural network model is used for detecting whether each frame of voice contains preset type sound, and the preset type sound refers to the original voice of a target user;

carrying out sentence segmentation on the voice in the single-channel voice within a preset threshold range, and detecting whether preset type sound is contained in any frame of voice in the single-channel voice within the preset threshold range by utilizing a pre-established neural network model;

if the frame voice contains the preset type of sound, the frame voice is reserved; if the frame voice does not contain the preset type sound, filtering the frame voice; therefore, other voice frames except the original voice of the target user can be filtered through the pre-established neural network model, and the voice frames containing preset type of voice can be left;

S104, inputting the voice segment data stream into a preset voice enhancement network model to obtain enhanced voice corresponding to the voice segment data stream;

the voice enhancement network model performs end-to-end enhancement on the voice segment data stream to obtain enhanced voice, and the voice enhancement network model inputs the voice segment data stream containing preset type sound and outputs the voice segment data stream as the enhanced voice.

In the embodiment of the present invention, the speech enhancement network model is a multi-scale time domain speech enhancement model based on a full-gated convolutional network, and specifically includes an encoder module, an enhancement module, and a decoder module.

And the encoder module encodes the noise waveform into the intermediate feature space. Wherein, the input section is converted into a high-dimensional characteristic representation form by a one-dimensional convolution neural network.

A reinforcing module: the encoded high-dimensional feature representation is operated on, which comprises three operation processes: multi-scale feature extraction, volume block and multi-scale feature fusion.

Multi-scale feature extraction: gated convolution operations of different sizes are used in parallel to extract and fuse these features. Specifically, the features are extracted by using one-dimensional gated convolution operation, actually, the feature extraction of different scales is realized by using gated convolution networks with different kernel sizes, and then, after the output results of the networks with different kernels are spliced, the features are normalized by a layer normalization method and then output.

Rolling blocks: composed of several convolution blocks. In each block, a full convolutional network based on a time domain convolutional network is employed. In each block, convolution operation is repeated for R times, and meanwhile, the expansion coefficient of the convolution network is continuously improved, and the receptive field is expanded. By extending the receptive field, the network can capture long-term information.

Multi-scale feature fusion: different levels of convolutional neural networks output different types of features, such as low-level texture (shallow) and semantic cues (deep). The contribution of these features to the final task is different. Specifically, in the embodiment of the present invention, the output of the last layer is not directly taken as the final output, but the output of each time domain volume block is extracted and fused into the final output of the model. The output characteristics of each time domain volume block represent a different level of detail. One connection is established for each block. So that different pieces of information are transferred during the training process. This process is called signature transfer. The benefit of information from other layers is unknown. And a gating mechanism is utilized to screen useful information and control information flow. Specifically, the high-level features are transferred step-by-step to the shallow layer.

A decoder module: and (5) the reverse process of the coding module. It decodes the feature representation into speech samples. In particular, the decoding process is implemented using a one-dimensional transpose convolution.

S105, synthesizing the enhanced voice into a voice segment.

And processing the voice subsection data stream through the voice enhancement network model to obtain enhanced voice, and synthesizing the enhanced voice into a voice subsection.

According to the voice enhancement method, an efficient multi-scale time domain voice enhancement model based on the full-gating convolutional network is constructed, and the multi-scale time domain voice enhancement model based on the full-gating convolutional network is used for capturing the time sequence information of a voice signal; integrating a gating mechanism into a multi-scale time domain voice enhancement model based on a full-gating convolution network, so that the multi-scale time domain voice enhancement model based on the full-gating convolution network can learn feature representations of different levels; instead of selecting the output of the last layer as the final output, the output is obtained by fusing feature maps of different depths, and the connection is established between layers of different depths, so that information learned in deep layers can be transmitted to shallow layers. Another gating mechanism is used to screen useful information.

In order to verify the effectiveness of the speech enhancement method in the embodiment of the invention, a multi-scale time domain speech enhancement model based on a full-gating convolutional network is constructed firstly, the output of the last convolutional block is used as the final output, the number of the convolutional blocks is selected to be 3, and the expansion coefficient of the convolutional network is used as 6. On the basis, multi-scale feature fusion and feature transmission are gradually increased.

The experimental results show that the multi-scale time domain speech enhancement model based on the full-gating convolutional network can effectively enhance speech, and the performance of the model is further improved by gradually increasing feature fusion and feature transmission. Compared with the model based on time domain convolution, the final model in the embodiment of the invention respectively obtains 0.12 and 0.01 performance improvement on PESQ (speech quality perception evaluation) and STOI (short-time objective intelligibility). Furthermore, the performance of the model in the present embodiment improves on PESQ and STOI by 0.43 and 0.123, respectively, compared to noisy speech.

The experimental configuration with a convolution block number of 4 and a convolution network expansion coefficient of 8 has the best performance. The optimal model in an embodiment of the present invention achieves a performance improvement of 0.54 and 0.125 over PESQ and STOI, respectively, as compared to noisy speech. The enhancement model in the embodiment of the invention not only can effectively enhance the noise voice, but also has better performance than other reference systems. The performance of the multi-scale time domain speech enhancement model based on the full-gating convolutional network is superior to that of a system based on a frequency domain and a system based on a recurrent neural network. By expanding the receptive field, the long-term dependency relationship can be captured by the multi-scale time domain speech enhancement model based on the full-gated convolution network. Particularly in terms of STOI, significant performance improvements are achieved. This shows that through end-to-end training, a multi-scale time-domain speech enhancement model based on a fully-gated convolutional network can more accurately enhance and estimate speech.

Through the above description of the technical scheme provided by the embodiment of the present invention, a single-channel voice is obtained by processing a voice, a speech segment data stream containing a preset type of sound is obtained by segmenting the single-channel voice, and the speech segment data stream is input into a preset voice enhancement network model, so that the influence of noise is avoided, and distortion is avoided in consideration of voice characteristics, thereby avoiding damage to the voice, so that an enhanced voice can be obtained, and the enhanced voice is synthesized to obtain a voice segment, thereby realizing multi-scene application.

With respect to the method embodiment, an embodiment of the present invention further provides an embodiment of a speech enhancement apparatus, as shown in fig. 2, the apparatus may include: a voice collecting module 210, a voice processing module 220, a voice segmenting module 230, a voice enhancing module 240, and a voice synthesizing module 250.

The voice acquisition module 210 is used for acquiring and calling voice acquisition equipment by voice and acquiring voice in the current environment;

the voice processing module 220 is configured to process the voice according to a preset voice processing algorithm to obtain a single-channel voice;

the voice segmentation module 230 is configured to perform sentence segmentation on the single-channel voice to obtain a voice segment data stream containing a preset type of sound;

a speech enhancement module 240, configured to input the speech segment data stream into a preset speech enhancement network model, so as to obtain an enhanced speech corresponding to the speech segment data stream;

a speech synthesis module 250, configured to synthesize the enhanced speech into speech segments.

According to a specific embodiment provided by the present invention, the speech processing module 220 is specifically configured to: and carrying out A/D conversion on the voice, and sampling according to a preset sampling rate to obtain single-channel voice.

According to a specific embodiment provided by the present invention, the voice segmentation module 230 is specifically configured to: segmenting sentences of the voice in the single-channel voice within a preset threshold range; for any frame of voice in the single-channel voice within a preset threshold range, detecting whether preset type voice is contained or not by utilizing a pre-established neural network model; if the frame voice contains the preset type of sound, the frame voice is reserved; and combining all the voice frames containing the preset type of voice to obtain the voice segment data stream containing the preset type of voice.

According to a specific embodiment provided by the present invention, the apparatus further comprises:

and a voice filtering module 260, configured to filter the frame of voice if the frame of voice does not contain a preset type of sound.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device 300 shown in fig. 3 includes: at least one processor 301, memory 302, at least one network interface 304, and other user interfaces 303. The various components in the mobile terminal 300 are coupled together by a bus system 305. It will be appreciated that the bus system 305 is used to enable communications among the components connected. The bus system 305 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 305 in fig. 3.

The user interface 303 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, trackball, touch pad, or touch screen, among others.

It will be appreciated that the memory 302 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a Read-only memory (ROM), a programmable Read-only memory (PROM), an erasable programmable Read-only memory (erasabprom, EPROM), an electrically erasable programmable Read-only memory (EEPROM), or a flash memory. The volatile memory may be a Random Access Memory (RAM) which functions as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (staticiram, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (syncronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (DDRSDRAM ), Enhanced Synchronous DRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DRRAM). The memory 302 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, memory 302 stores the following elements, executable units or data structures, or a subset thereof, or an expanded set thereof: an operating system 3021 and application programs 3022.

The operating system 3021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs 3022 include various application programs such as a media player (MediaPlayer), a Browser (Browser), and the like, for implementing various application services. A program implementing the method of an embodiment of the present invention may be included in the application program 3022.

In the embodiment of the present invention, by calling a program or an instruction stored in the memory 302, specifically, a program or an instruction stored in the application 3022, the processor 301 is configured to execute the method steps provided by the method embodiments, for example, including:

calling voice acquisition equipment to acquire voice in the current environment; processing the voice according to a preset voice processing algorithm to obtain single-channel voice; performing sentence segmentation on the single-channel voice to obtain a voice segmented data stream containing preset type sounds; inputting the voice segment data stream into a preset voice enhancement network model to obtain enhanced voice corresponding to the voice segment data stream; synthesizing the enhanced speech into speech segments.

The method disclosed in the above embodiments of the present invention may be applied to the processor 301, or implemented by the processor 301. The processor 301 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 301. The processor 301 may be a general-purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software elements in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 302, and the processor 301 reads the information in the memory 302 and completes the steps of the method in combination with the hardware.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

The electronic device provided in this embodiment may be the electronic device shown in fig. 3, and may perform all the steps of the speech enhancement method shown in fig. 1, so as to achieve the technical effect of the speech enhancement method shown in fig. 1.

The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.

When one or more programs in the storage medium are executable by one or more processors to implement the speech enhancement method described above as being performed on the speech enhancement device side.

The processor is configured to execute the speech enhancement program stored in the memory to implement the following steps of the speech enhancement method performed on the speech enhancement device side:

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of speech enhancement, the method comprising:

inputting the voice segment data stream into a preset voice enhancement network model to obtain enhanced voice corresponding to the voice segment data stream, wherein the voice enhancement network model is a multi-scale time domain voice enhancement model based on a full-gated convolutional network, and may include an encoder module, an enhancement module, and a decoder module; the encoder module encodes the noise waveform into an intermediate feature space, and the input section is converted into a high-dimensional feature representation form by a one-dimensional convolutional neural network; the enhancement module operates on the encoded high-dimensional feature representation, and comprises three operation processes: multi-scale feature extraction, convolution block and multi-scale feature fusion, a decoder module, an inverse process of the encoding module, decoding the feature representation into a speech sample, the multi-scale feature extraction, extracting and fusing features using gated convolution operations of different sizes, the convolution block: the method comprises the following steps that the method comprises a plurality of convolution blocks, and in each block, a full convolution network based on a time domain convolution network is adopted; in each block, convolution operation is repeated for R times, meanwhile, the expansion coefficient of the convolution network is continuously improved, the receptive field is expanded, and the network can capture long-term information by expanding the receptive field; multi-scale feature fusion, wherein different levels of convolutional neural networks output different types of features, the output of each time domain convolutional block is extracted and fused into the final output of a voice enhancement network model;

synthesizing the enhanced speech into speech segments.

2. The method of claim 1, wherein processing the speech according to a preset speech processing algorithm to obtain single-channel speech comprises:

3. The method of claim 1, wherein the sentence-segmentation of the single-channel speech to obtain a speech segment data stream containing a preset type of sound comprises:

4. The method of claim 3, further comprising:

5. A speech enhancement apparatus, characterized in that the apparatus comprises:

the speech enhancement module is used for inputting the speech segment data stream into a preset speech enhancement network model to obtain enhanced speech corresponding to the speech segment data stream, wherein the speech enhancement network model is a multi-scale time domain speech enhancement model based on a full-gated convolutional network, and may include an encoder module, an enhancement module and a decoder module; the encoder module encodes the noise waveform into an intermediate feature space, and the input section is converted into a high-dimensional feature representation form by a one-dimensional convolutional neural network; the enhancement module operates on the encoded high-dimensional feature representation, and comprises three operation processes: multi-scale feature extraction, convolution block and multi-scale feature fusion, a decoder module, an inverse process of the encoding module, decoding the feature representation into a speech sample, the multi-scale feature extraction, extracting and fusing features using gated convolution operations of different sizes, the convolution block: the method comprises the following steps that the method comprises a plurality of convolution blocks, and in each block, a full convolution network based on a time domain convolution network is adopted; in each block, convolution operation is repeated for R times, meanwhile, the expansion coefficient of the convolution network is continuously improved, the receptive field is expanded, and the network can capture long-term information by expanding the receptive field; multi-scale feature fusion, wherein different levels of convolutional neural networks output different types of features, the output of each time domain convolutional block is extracted and fused into the final output of a voice enhancement network model;

6. The apparatus of claim 5, wherein the speech processing module is specifically configured to:

7. The apparatus of claim 5, wherein the speech segmentation module is specifically configured to:

8. The apparatus of claim 7, further comprising:

9. An electronic device, comprising: a processor and a memory, the processor being configured to execute a speech enhancement program stored in the memory to implement the speech enhancement method of any of claims 1-4.

10. A storage medium storing one or more programs executable by one or more processors to implement the speech enhancement method of any one of claims 1-4.