CN110534123B - Voice enhancement method and device, storage medium and electronic equipment - Google Patents

Voice enhancement method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN110534123B
CN110534123B CN201910663257.8A CN201910663257A CN110534123B CN 110534123 B CN110534123 B CN 110534123B CN 201910663257 A CN201910663257 A CN 201910663257A CN 110534123 B CN110534123 B CN 110534123B
Authority
CN
China
Prior art keywords
voice
speech
enhancement
module
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910663257.8A
Other languages
Chinese (zh)
Other versions
CN110534123A (en
Inventor
李晨星
许家铭
徐波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201910663257.8A priority Critical patent/CN110534123B/en
Publication of CN110534123A publication Critical patent/CN110534123A/en
Application granted granted Critical
Publication of CN110534123B publication Critical patent/CN110534123B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention relates to a voice enhancement method, a voice enhancement device, a storage medium and electronic equipment, wherein the method comprises the following steps: calling voice acquisition equipment to acquire voice in the current environment; processing the voice according to a preset voice processing algorithm to obtain single-channel voice; performing sentence segmentation on the single-channel voice to obtain a voice segmented data stream containing preset type sounds; inputting the voice segment data stream into a preset voice enhancement network model to obtain enhanced voice corresponding to the voice segment data stream; synthesizing the enhanced speech into speech segments. Therefore, multi-scene application can be realized, the influence of noise is avoided, the introduction of distortion is avoided by considering the voice characteristics, and the damage to voice is avoided.

Description

Voice enhancement method and device, storage medium and electronic equipment
Technical Field
The embodiment of the invention relates to the technical field of automatic processing of computer information, in particular to a voice enhancement method, a voice enhancement device, a storage medium and electronic equipment.
Background
Speech, i.e. the material shell of a language, is the external form of the language, is the symbology that most directly records human mental activities, and is one of the most natural and effective means for users to interact with information. When a user obtains a voice signal, the voice signal is inevitably interfered by environmental noise, room reverberation and other users, so that the voice quality is seriously influenced, the performance of voice recognition is further influenced, and the voice enhancement is brought forward. The speech enhancement is an effective way for inhibiting interference and prompting far-field speech recognition rate as a preprocessing mode.
Speech enhancement is a technique for extracting a useful speech signal from a noise background, and suppressing and reducing noise interference, when the speech signal is interfered or even submerged by various noises. In a sentence, the original speech is extracted from the noisy speech as pure as possible.
In the related art, the traditional speech enhancement methods mainly include spectral subtraction, wiener filtering and short-time spectral amplitude enhancement methods based on minimum mean square error. Although the traditional speech enhancement method has the advantages of high speed, no need of large-scale training of a corpus and the like, the methods depend on noise estimation to a great extent, the methods are few in application scenes, speech characteristics cannot be considered, distortion is inevitably introduced, and damage is caused to speech.
Disclosure of Invention
In view of the above, to solve the technical problems or some technical problems, embodiments of the present invention provide a voice enhancement method, apparatus, storage medium, and electronic device.
In a first aspect, an embodiment of the present invention provides a language enhancement method, where the method includes:
calling voice acquisition equipment to acquire voice in the current environment;
processing the voice according to a preset voice processing algorithm to obtain single-channel voice;
performing sentence segmentation on the single-channel voice to obtain a voice segmented data stream containing preset type sounds;
inputting the voice segment data stream into a preset voice enhancement network model to obtain enhanced voice corresponding to the voice segment data stream;
synthesizing the enhanced speech into speech segments.
In a possible embodiment, the processing the speech according to a preset speech processing algorithm to obtain a single-channel speech includes:
and carrying out A/D conversion on the voice, and sampling according to a preset sampling rate to obtain single-channel voice.
In a possible embodiment, the sentence-segmentation on the single-channel speech to obtain a speech segmented data stream containing a preset type of sound includes:
segmenting sentences of the voice in the single-channel voice within a preset threshold range;
for any frame of voice in the single-channel voice within a preset threshold range, detecting whether preset type voice is contained or not by utilizing a pre-established neural network model;
if the frame voice contains the preset type of sound, the frame voice is reserved;
and combining all the voice frames containing the preset type of voice to obtain the voice segment data stream containing the preset type of voice.
In one possible embodiment, the method further comprises:
and if the frame voice does not contain the preset type of sound, filtering the frame voice.
In a second aspect, an embodiment of the present invention provides a speech enhancement apparatus, where the apparatus includes:
the voice acquisition module is used for calling voice acquisition equipment and acquiring voice in the current environment;
the voice processing module is used for processing the voice according to a preset voice processing algorithm to obtain single-channel voice;
the voice segmentation module is used for segmenting the single-channel voice to obtain a voice segmentation data stream containing preset type sounds;
the voice enhancement module is used for inputting the voice segment data stream into a preset voice enhancement network model to obtain enhanced voice corresponding to the voice segment data stream;
and the voice synthesis module is used for synthesizing the enhanced voice into a voice section.
In one possible implementation, the speech processing module is specifically configured to:
and carrying out A/D conversion on the voice, and sampling according to a preset sampling rate to obtain single-channel voice.
In a possible implementation manner, the speech segmentation module is specifically configured to:
segmenting sentences of the voice in the single-channel voice within a preset threshold range;
for any frame of voice in the single-channel voice within a preset threshold range, detecting whether preset type voice is contained or not by utilizing a pre-established neural network model;
if the frame voice contains the preset type of sound, the frame voice is reserved;
and combining all the voice frames containing the preset type of voice to obtain the voice segment data stream containing the preset type of voice.
In one possible embodiment, the apparatus further comprises:
and the voice filtering module is used for filtering the frame voice if the frame voice does not contain the preset type of sound.
In a third aspect, an embodiment of the present invention provides a storage medium, where one or more programs are stored, and the one or more programs are executable by one or more processors to implement the foregoing speech enhancement method.
In a fourth aspect, an embodiment of the present invention provides an electronic device, including: a processor and a memory, the processor being configured to execute a speech enhancement program stored in the memory to implement the aforementioned speech enhancement method.
According to the technical scheme provided by the embodiment of the invention, the voice is processed to obtain the single-channel voice, the single-channel voice is segmented into the voice segmented data stream containing the preset type of sound, the voice segmented data stream is input into the preset voice enhancement network model, the influence of noise is avoided, the introduction of distortion is avoided by considering the voice characteristic, the damage to the voice is avoided, the enhanced voice can be obtained, the enhanced voice is synthesized to obtain the voice segment, and the multi-scene application can be realized.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present specification, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a flow chart illustrating an implementation of a speech enhancement method according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the convenience of understanding of the embodiments of the present invention, the following description will be further explained with reference to specific embodiments, which are not to be construed as limiting the embodiments of the present invention.
As shown in fig. 1, an implementation flow diagram of a speech enhancement method provided in an embodiment of the present invention is shown, where the method specifically includes the following steps:
and S101, calling voice acquisition equipment to acquire the voice in the current environment.
In the embodiment of the present invention, the current environment may be a far-field noisy acoustic environment, which is not limited by the embodiment of the present invention.
In the current environment, a voice collecting device, such as a microphone, is called to collect voice, where the voice carries an original voice of a target user and noise in the current environment, and the noise in the current environment may be a voice of another user in the current environment, may be music, hitting sound, and the like in the current environment, and all other sounds may be regarded as noise with respect to the original voice of the target user, which is not limited in the embodiment of the present invention.
And S102, processing the voice according to a preset voice processing algorithm to obtain single-channel voice.
For the language collected in the step S101, processing is performed according to a preset speech processing algorithm to obtain a single-channel speech, where an optional implementation is provided to process according to a preset speech processing algorithm:
and carrying out A/D conversion on the voice, and sampling according to a preset sampling rate to obtain single-channel voice. In this case, a/D refers to a circuit that converts an analog signal into a digital signal and is called an analog-to-digital converter.
For example, a microphone is called to collect the language in the current environment, the voice is subjected to a/D conversion, and sampling is performed according to a sampling rate of 16000, so that single-channel voice with a sampling rate of 16000 is obtained.
S103, performing sentence segmentation on the single-channel voice to obtain a voice segmented data stream containing preset type sounds;
pre-training a neural network model, wherein the neural network model is used for detecting whether each frame of voice contains preset type sound, and the preset type sound refers to the original voice of a target user;
carrying out sentence segmentation on the voice in the single-channel voice within a preset threshold range, and detecting whether preset type sound is contained in any frame of voice in the single-channel voice within the preset threshold range by utilizing a pre-established neural network model;
if the frame voice contains the preset type of sound, the frame voice is reserved; if the frame voice does not contain the preset type sound, filtering the frame voice; therefore, other voice frames except the original voice of the target user can be filtered through the pre-established neural network model, and the voice frames containing preset type of voice can be left;
and combining all the voice frames containing the preset type of voice to obtain the voice segment data stream containing the preset type of voice.
S104, inputting the voice segment data stream into a preset voice enhancement network model to obtain enhanced voice corresponding to the voice segment data stream;
the voice enhancement network model performs end-to-end enhancement on the voice segment data stream to obtain enhanced voice, and the voice enhancement network model inputs the voice segment data stream containing preset type sound and outputs the voice segment data stream as the enhanced voice.
In the embodiment of the present invention, the speech enhancement network model is a multi-scale time domain speech enhancement model based on a full-gated convolutional network, and specifically includes an encoder module, an enhancement module, and a decoder module.
And the encoder module encodes the noise waveform into the intermediate feature space. Wherein, the input section is converted into a high-dimensional characteristic representation form by a one-dimensional convolution neural network.
A reinforcing module: the encoded high-dimensional feature representation is operated on, which comprises three operation processes: multi-scale feature extraction, volume block and multi-scale feature fusion.
Multi-scale feature extraction: gated convolution operations of different sizes are used in parallel to extract and fuse these features. Specifically, the features are extracted by using one-dimensional gated convolution operation, actually, the feature extraction of different scales is realized by using gated convolution networks with different kernel sizes, and then, after the output results of the networks with different kernels are spliced, the features are normalized by a layer normalization method and then output.
Rolling blocks: composed of several convolution blocks. In each block, a full convolutional network based on a time domain convolutional network is employed. In each block, convolution operation is repeated for R times, and meanwhile, the expansion coefficient of the convolution network is continuously improved, and the receptive field is expanded. By extending the receptive field, the network can capture long-term information.
Multi-scale feature fusion: different levels of convolutional neural networks output different types of features, such as low-level texture (shallow) and semantic cues (deep). The contribution of these features to the final task is different. Specifically, in the embodiment of the present invention, the output of the last layer is not directly taken as the final output, but the output of each time domain volume block is extracted and fused into the final output of the model. The output characteristics of each time domain volume block represent a different level of detail. One connection is established for each block. So that different pieces of information are transferred during the training process. This process is called signature transfer. The benefit of information from other layers is unknown. And a gating mechanism is utilized to screen useful information and control information flow. Specifically, the high-level features are transferred step-by-step to the shallow layer.
A decoder module: and (5) the reverse process of the coding module. It decodes the feature representation into speech samples. In particular, the decoding process is implemented using a one-dimensional transpose convolution.
S105, synthesizing the enhanced voice into a voice segment.
And processing the voice subsection data stream through the voice enhancement network model to obtain enhanced voice, and synthesizing the enhanced voice into a voice subsection.
According to the voice enhancement method, an efficient multi-scale time domain voice enhancement model based on the full-gating convolutional network is constructed, and the multi-scale time domain voice enhancement model based on the full-gating convolutional network is used for capturing the time sequence information of a voice signal; integrating a gating mechanism into a multi-scale time domain voice enhancement model based on a full-gating convolution network, so that the multi-scale time domain voice enhancement model based on the full-gating convolution network can learn feature representations of different levels; instead of selecting the output of the last layer as the final output, the output is obtained by fusing feature maps of different depths, and the connection is established between layers of different depths, so that information learned in deep layers can be transmitted to shallow layers. Another gating mechanism is used to screen useful information.
In order to verify the effectiveness of the speech enhancement method in the embodiment of the invention, a multi-scale time domain speech enhancement model based on a full-gating convolutional network is constructed firstly, the output of the last convolutional block is used as the final output, the number of the convolutional blocks is selected to be 3, and the expansion coefficient of the convolutional network is used as 6. On the basis, multi-scale feature fusion and feature transmission are gradually increased.
The experimental results show that the multi-scale time domain speech enhancement model based on the full-gating convolutional network can effectively enhance speech, and the performance of the model is further improved by gradually increasing feature fusion and feature transmission. Compared with the model based on time domain convolution, the final model in the embodiment of the invention respectively obtains 0.12 and 0.01 performance improvement on PESQ (speech quality perception evaluation) and STOI (short-time objective intelligibility). Furthermore, the performance of the model in the present embodiment improves on PESQ and STOI by 0.43 and 0.123, respectively, compared to noisy speech.
The experimental configuration with a convolution block number of 4 and a convolution network expansion coefficient of 8 has the best performance. The optimal model in an embodiment of the present invention achieves a performance improvement of 0.54 and 0.125 over PESQ and STOI, respectively, as compared to noisy speech. The enhancement model in the embodiment of the invention not only can effectively enhance the noise voice, but also has better performance than other reference systems. The performance of the multi-scale time domain speech enhancement model based on the full-gating convolutional network is superior to that of a system based on a frequency domain and a system based on a recurrent neural network. By expanding the receptive field, the long-term dependency relationship can be captured by the multi-scale time domain speech enhancement model based on the full-gated convolution network. Particularly in terms of STOI, significant performance improvements are achieved. This shows that through end-to-end training, a multi-scale time-domain speech enhancement model based on a fully-gated convolutional network can more accurately enhance and estimate speech.
Through the above description of the technical scheme provided by the embodiment of the present invention, a single-channel voice is obtained by processing a voice, a speech segment data stream containing a preset type of sound is obtained by segmenting the single-channel voice, and the speech segment data stream is input into a preset voice enhancement network model, so that the influence of noise is avoided, and distortion is avoided in consideration of voice characteristics, thereby avoiding damage to the voice, so that an enhanced voice can be obtained, and the enhanced voice is synthesized to obtain a voice segment, thereby realizing multi-scene application.
With respect to the method embodiment, an embodiment of the present invention further provides an embodiment of a speech enhancement apparatus, as shown in fig. 2, the apparatus may include: a voice collecting module 210, a voice processing module 220, a voice segmenting module 230, a voice enhancing module 240, and a voice synthesizing module 250.
The voice acquisition module 210 is used for acquiring and calling voice acquisition equipment by voice and acquiring voice in the current environment;
the voice processing module 220 is configured to process the voice according to a preset voice processing algorithm to obtain a single-channel voice;
the voice segmentation module 230 is configured to perform sentence segmentation on the single-channel voice to obtain a voice segment data stream containing a preset type of sound;
a speech enhancement module 240, configured to input the speech segment data stream into a preset speech enhancement network model, so as to obtain an enhanced speech corresponding to the speech segment data stream;
a speech synthesis module 250, configured to synthesize the enhanced speech into speech segments.
According to a specific embodiment provided by the present invention, the speech processing module 220 is specifically configured to: and carrying out A/D conversion on the voice, and sampling according to a preset sampling rate to obtain single-channel voice.
According to a specific embodiment provided by the present invention, the voice segmentation module 230 is specifically configured to: segmenting sentences of the voice in the single-channel voice within a preset threshold range; for any frame of voice in the single-channel voice within a preset threshold range, detecting whether preset type voice is contained or not by utilizing a pre-established neural network model; if the frame voice contains the preset type of sound, the frame voice is reserved; and combining all the voice frames containing the preset type of voice to obtain the voice segment data stream containing the preset type of voice.
According to a specific embodiment provided by the present invention, the apparatus further comprises:
and a voice filtering module 260, configured to filter the frame of voice if the frame of voice does not contain a preset type of sound.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device 300 shown in fig. 3 includes: at least one processor 301, memory 302, at least one network interface 304, and other user interfaces 303. The various components in the mobile terminal 300 are coupled together by a bus system 305. It will be appreciated that the bus system 305 is used to enable communications among the components connected. The bus system 305 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 305 in fig. 3.
The user interface 303 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, trackball, touch pad, or touch screen, among others.
It will be appreciated that the memory 302 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a Read-only memory (ROM), a programmable Read-only memory (PROM), an erasable programmable Read-only memory (erasabprom, EPROM), an electrically erasable programmable Read-only memory (EEPROM), or a flash memory. The volatile memory may be a Random Access Memory (RAM) which functions as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (staticiram, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (syncronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (DDRSDRAM ), Enhanced Synchronous DRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DRRAM). The memory 302 described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some embodiments, memory 302 stores the following elements, executable units or data structures, or a subset thereof, or an expanded set thereof: an operating system 3021 and application programs 3022.
The operating system 3021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs 3022 include various application programs such as a media player (MediaPlayer), a Browser (Browser), and the like, for implementing various application services. A program implementing the method of an embodiment of the present invention may be included in the application program 3022.
In the embodiment of the present invention, by calling a program or an instruction stored in the memory 302, specifically, a program or an instruction stored in the application 3022, the processor 301 is configured to execute the method steps provided by the method embodiments, for example, including:
calling voice acquisition equipment to acquire voice in the current environment; processing the voice according to a preset voice processing algorithm to obtain single-channel voice; performing sentence segmentation on the single-channel voice to obtain a voice segmented data stream containing preset type sounds; inputting the voice segment data stream into a preset voice enhancement network model to obtain enhanced voice corresponding to the voice segment data stream; synthesizing the enhanced speech into speech segments.
The method disclosed in the above embodiments of the present invention may be applied to the processor 301, or implemented by the processor 301. The processor 301 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 301. The processor 301 may be a general-purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software elements in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 302, and the processor 301 reads the information in the memory 302 and completes the steps of the method in combination with the hardware.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
The electronic device provided in this embodiment may be the electronic device shown in fig. 3, and may perform all the steps of the speech enhancement method shown in fig. 1, so as to achieve the technical effect of the speech enhancement method shown in fig. 1.
The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.
When one or more programs in the storage medium are executable by one or more processors to implement the speech enhancement method described above as being performed on the speech enhancement device side.
The processor is configured to execute the speech enhancement program stored in the memory to implement the following steps of the speech enhancement method performed on the speech enhancement device side:
calling voice acquisition equipment to acquire voice in the current environment; processing the voice according to a preset voice processing algorithm to obtain single-channel voice; performing sentence segmentation on the single-channel voice to obtain a voice segmented data stream containing preset type sounds; inputting the voice segment data stream into a preset voice enhancement network model to obtain enhanced voice corresponding to the voice segment data stream; synthesizing the enhanced speech into speech segments.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method of speech enhancement, the method comprising:
calling voice acquisition equipment to acquire voice in the current environment;
processing the voice according to a preset voice processing algorithm to obtain single-channel voice;
performing sentence segmentation on the single-channel voice to obtain a voice segmented data stream containing preset type sounds;
inputting the voice segment data stream into a preset voice enhancement network model to obtain enhanced voice corresponding to the voice segment data stream, wherein the voice enhancement network model is a multi-scale time domain voice enhancement model based on a full-gated convolutional network, and may include an encoder module, an enhancement module, and a decoder module; the encoder module encodes the noise waveform into an intermediate feature space, and the input section is converted into a high-dimensional feature representation form by a one-dimensional convolutional neural network; the enhancement module operates on the encoded high-dimensional feature representation, and comprises three operation processes: multi-scale feature extraction, convolution block and multi-scale feature fusion, a decoder module, an inverse process of the encoding module, decoding the feature representation into a speech sample, the multi-scale feature extraction, extracting and fusing features using gated convolution operations of different sizes, the convolution block: the method comprises the following steps that the method comprises a plurality of convolution blocks, and in each block, a full convolution network based on a time domain convolution network is adopted; in each block, convolution operation is repeated for R times, meanwhile, the expansion coefficient of the convolution network is continuously improved, the receptive field is expanded, and the network can capture long-term information by expanding the receptive field; multi-scale feature fusion, wherein different levels of convolutional neural networks output different types of features, the output of each time domain convolutional block is extracted and fused into the final output of a voice enhancement network model;
synthesizing the enhanced speech into speech segments.
2. The method of claim 1, wherein processing the speech according to a preset speech processing algorithm to obtain single-channel speech comprises:
and carrying out A/D conversion on the voice, and sampling according to a preset sampling rate to obtain single-channel voice.
3. The method of claim 1, wherein the sentence-segmentation of the single-channel speech to obtain a speech segment data stream containing a preset type of sound comprises:
segmenting sentences of the voice in the single-channel voice within a preset threshold range;
for any frame of voice in the single-channel voice within a preset threshold range, detecting whether preset type voice is contained or not by utilizing a pre-established neural network model;
if the frame voice contains the preset type of sound, the frame voice is reserved;
and combining all the voice frames containing the preset type of voice to obtain the voice segment data stream containing the preset type of voice.
4. The method of claim 3, further comprising:
and if the frame voice does not contain the preset type of sound, filtering the frame voice.
5. A speech enhancement apparatus, characterized in that the apparatus comprises:
the voice acquisition module is used for calling voice acquisition equipment and acquiring voice in the current environment;
the voice processing module is used for processing the voice according to a preset voice processing algorithm to obtain single-channel voice;
the voice segmentation module is used for segmenting the single-channel voice to obtain a voice segmentation data stream containing preset type sounds;
the speech enhancement module is used for inputting the speech segment data stream into a preset speech enhancement network model to obtain enhanced speech corresponding to the speech segment data stream, wherein the speech enhancement network model is a multi-scale time domain speech enhancement model based on a full-gated convolutional network, and may include an encoder module, an enhancement module and a decoder module; the encoder module encodes the noise waveform into an intermediate feature space, and the input section is converted into a high-dimensional feature representation form by a one-dimensional convolutional neural network; the enhancement module operates on the encoded high-dimensional feature representation, and comprises three operation processes: multi-scale feature extraction, convolution block and multi-scale feature fusion, a decoder module, an inverse process of the encoding module, decoding the feature representation into a speech sample, the multi-scale feature extraction, extracting and fusing features using gated convolution operations of different sizes, the convolution block: the method comprises the following steps that the method comprises a plurality of convolution blocks, and in each block, a full convolution network based on a time domain convolution network is adopted; in each block, convolution operation is repeated for R times, meanwhile, the expansion coefficient of the convolution network is continuously improved, the receptive field is expanded, and the network can capture long-term information by expanding the receptive field; multi-scale feature fusion, wherein different levels of convolutional neural networks output different types of features, the output of each time domain convolutional block is extracted and fused into the final output of a voice enhancement network model;
and the voice synthesis module is used for synthesizing the enhanced voice into a voice section.
6. The apparatus of claim 5, wherein the speech processing module is specifically configured to:
and carrying out A/D conversion on the voice, and sampling according to a preset sampling rate to obtain single-channel voice.
7. The apparatus of claim 5, wherein the speech segmentation module is specifically configured to:
segmenting sentences of the voice in the single-channel voice within a preset threshold range;
for any frame of voice in the single-channel voice within a preset threshold range, detecting whether preset type voice is contained or not by utilizing a pre-established neural network model;
if the frame voice contains the preset type of sound, the frame voice is reserved;
and combining all the voice frames containing the preset type of voice to obtain the voice segment data stream containing the preset type of voice.
8. The apparatus of claim 7, further comprising:
and the voice filtering module is used for filtering the frame voice if the frame voice does not contain the preset type of sound.
9. An electronic device, comprising: a processor and a memory, the processor being configured to execute a speech enhancement program stored in the memory to implement the speech enhancement method of any of claims 1-4.
10. A storage medium storing one or more programs executable by one or more processors to implement the speech enhancement method of any one of claims 1-4.
CN201910663257.8A 2019-07-22 2019-07-22 Voice enhancement method and device, storage medium and electronic equipment Active CN110534123B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910663257.8A CN110534123B (en) 2019-07-22 2019-07-22 Voice enhancement method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910663257.8A CN110534123B (en) 2019-07-22 2019-07-22 Voice enhancement method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110534123A CN110534123A (en) 2019-12-03
CN110534123B true CN110534123B (en) 2022-04-01

Family

ID=68660741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910663257.8A Active CN110534123B (en) 2019-07-22 2019-07-22 Voice enhancement method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110534123B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312224B (en) * 2020-02-20 2023-04-21 北京声智科技有限公司 Training method and device of voice segmentation model and electronic equipment
CN112509593B (en) * 2020-11-17 2024-03-08 北京清微智能科技有限公司 Speech enhancement network model, single-channel speech enhancement method and system
CN113571074B (en) * 2021-08-09 2023-07-25 四川启睿克科技有限公司 Voice enhancement method and device based on multi-band structure time domain audio frequency separation network

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7392312B1 (en) * 1998-09-11 2008-06-24 Lv Partners, L.P. Method for utilizing visual cue in conjunction with web access
CN102124518A (en) * 2008-08-05 2011-07-13 弗朗霍夫应用科学研究促进协会 Apparatus and method for processing an audio signal for speech enhancement using a feature extraction
CN103794221A (en) * 2012-10-26 2014-05-14 索尼公司 Signal processing device and method, and program
CN106157953A (en) * 2015-04-16 2016-11-23 科大讯飞股份有限公司 continuous speech recognition method and system
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN108172238A (en) * 2018-01-06 2018-06-15 广州音书科技有限公司 A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system
CN108564963A (en) * 2018-04-23 2018-09-21 百度在线网络技术(北京)有限公司 Method and apparatus for enhancing voice
CN108877823A (en) * 2018-07-27 2018-11-23 三星电子(中国)研发中心 Sound enhancement method and device
CN109326299A (en) * 2018-11-14 2019-02-12 平安科技(深圳)有限公司 Sound enhancement method, device and storage medium based on full convolutional neural networks
CN109410974A (en) * 2018-10-23 2019-03-01 百度在线网络技术(北京)有限公司 Sound enhancement method, device, equipment and storage medium
CN109841226A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
CN110010144A (en) * 2019-04-24 2019-07-12 厦门亿联网络技术股份有限公司 Voice signals enhancement method and device
CN110503940A (en) * 2019-07-12 2019-11-26 中国科学院自动化研究所 Sound enhancement method, device, storage medium, electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9881631B2 (en) * 2014-10-21 2018-01-30 Mitsubishi Electric Research Laboratories, Inc. Method for enhancing audio signal using phase information
US11373672B2 (en) * 2016-06-14 2022-06-28 The Trustees Of Columbia University In The City Of New York Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
CN106898350A (en) * 2017-01-16 2017-06-27 华南理工大学 A kind of interaction of intelligent industrial robot voice and control method based on deep learning
US10726858B2 (en) * 2018-06-22 2020-07-28 Intel Corporation Neural network for speech denoising trained with deep feature losses

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7392312B1 (en) * 1998-09-11 2008-06-24 Lv Partners, L.P. Method for utilizing visual cue in conjunction with web access
CN102124518A (en) * 2008-08-05 2011-07-13 弗朗霍夫应用科学研究促进协会 Apparatus and method for processing an audio signal for speech enhancement using a feature extraction
CN103794221A (en) * 2012-10-26 2014-05-14 索尼公司 Signal processing device and method, and program
CN106157953A (en) * 2015-04-16 2016-11-23 科大讯飞股份有限公司 continuous speech recognition method and system
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN108172238A (en) * 2018-01-06 2018-06-15 广州音书科技有限公司 A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system
CN108564963A (en) * 2018-04-23 2018-09-21 百度在线网络技术(北京)有限公司 Method and apparatus for enhancing voice
CN108877823A (en) * 2018-07-27 2018-11-23 三星电子(中国)研发中心 Sound enhancement method and device
CN109841226A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
CN109410974A (en) * 2018-10-23 2019-03-01 百度在线网络技术(北京)有限公司 Sound enhancement method, device, equipment and storage medium
CN109326299A (en) * 2018-11-14 2019-02-12 平安科技(深圳)有限公司 Sound enhancement method, device and storage medium based on full convolutional neural networks
CN110010144A (en) * 2019-04-24 2019-07-12 厦门亿联网络技术股份有限公司 Voice signals enhancement method and device
CN110503940A (en) * 2019-07-12 2019-11-26 中国科学院自动化研究所 Sound enhancement method, device, storage medium, electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Convolutional Neural Turing Machine for Speech Separation;Jen-Tzung Chien et al;《ISCSLP 2018》;20181231;第81-85页 *
利用深度全卷积编解码网络的单通道语音增强;时文华等;《信号处理》;20190430;第621-640页 *

Also Published As

Publication number Publication date
CN110534123A (en) 2019-12-03

Similar Documents

Publication Publication Date Title
CN110503940B (en) Voice enhancement method and device, storage medium and electronic equipment
JP6903129B2 (en) Whispering conversion methods, devices, devices and readable storage media
Pascual et al. SEGAN: Speech enhancement generative adversarial network
JP6198872B2 (en) Detection of speech syllable / vowel / phoneme boundaries using auditory attention cues
CN110534123B (en) Voice enhancement method and device, storage medium and electronic equipment
Bhat et al. A real-time convolutional neural network based speech enhancement for hearing impaired listeners using smartphone
Adeel et al. Lip-reading driven deep learning approach for speech enhancement
CN106486130B (en) Noise elimination and voice recognition method and device
Hsieh et al. Improving perceptual quality by phone-fortified perceptual loss using wasserstein distance for speech enhancement
Shah et al. Time-frequency mask-based speech enhancement using convolutional generative adversarial network
Hsieh et al. Improving perceptual quality by phone-fortified perceptual loss for speech enhancement
CN111667834B (en) Hearing-aid equipment and hearing-aid method
JP2020071482A (en) Word sound separation method, word sound separation model training method and computer readable medium
CN114360561A (en) Voice enhancement method based on deep neural network technology
Abdulatif et al. Investigating cross-domain losses for speech enhancement
CN117542373A (en) Non-air conduction voice recovery system and method
WO2019216187A1 (en) Pitch enhancement device, and method and program therefor
CN114724589A (en) Voice quality inspection method and device, electronic equipment and storage medium
CN112002307B (en) Voice recognition method and device
Dua et al. Noise robust automatic speech recognition: review and analysis
Hu et al. Learnable spectral dimension compression mapping for full-band speech enhancement
CN111048065B (en) Text error correction data generation method and related device
Chit et al. Myanmar continuous speech recognition system using fuzzy logic classification in speech segmentation
JP6794064B2 (en) Model learning device, speech interval detector, their methods and programs
WO2019216192A1 (en) Pitch enhancement device, method and program therefor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant