CN110534123A - Sound enhancement method, device, storage medium, electronic equipment - Google Patents

Sound enhancement method, device, storage medium, electronic equipment Download PDF

Info

Publication number
CN110534123A
CN110534123A CN201910663257.8A CN201910663257A CN110534123A CN 110534123 A CN110534123 A CN 110534123A CN 201910663257 A CN201910663257 A CN 201910663257A CN 110534123 A CN110534123 A CN 110534123A
Authority
CN
China
Prior art keywords
voice
preset
sound
speech
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910663257.8A
Other languages
Chinese (zh)
Other versions
CN110534123B (en
Inventor
李晨星
许家铭
徐波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201910663257.8A priority Critical patent/CN110534123B/en
Publication of CN110534123A publication Critical patent/CN110534123A/en
Application granted granted Critical
Publication of CN110534123B publication Critical patent/CN110534123B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The present embodiments relate to a kind of sound enhancement method, device, storage medium, electronic equipments, which comprises calls voice capture device, acquires the voice in current environment;According to preset speech processing algorithm, the voice is handled, single-channel voice is obtained;Punctuate cutting is carried out to the single-channel voice, obtains the voice segment data flow comprising preset kind sound;The voice segment data flow is inputted in preset speech enhan-cement network model, enhancing voice corresponding with the voice segment data flow is obtained;It is voice segments by the enhancing speech synthesis.Thus, it is possible to realize the application of more scenes, the influence of noise is avoided, it is contemplated that characteristics of speech sounds avoids introducing distortion, to avoid causing to damage to voice.

Description

Sound enhancement method, device, storage medium, electronic equipment
Technical field
The present embodiments relate to computerized information technology for automatically treating field more particularly to a kind of sound enhancement method, Device, storage medium, electronic equipment.
Background technique
Voice, i.e. the substance shell of language, are the external forms of language, are the most directly symbols of the thinking activities of recorder Number system is that user carries out that information exchange is most natural, one of most effective means.User is while obtaining voice signal, no The evitable interference that will receive ambient noise, RMR room reverb and other users has seriously affected voice quality, and then has influenced The performance of speech recognition, speech enhan-cement comes into being since then.Speech enhan-cement is to inhibit interference, prompt as preposition processing mode A kind of effective way of far field phonetic recognization rate.
Speech enhan-cement, refer to when voice signal by various noise jammings, even flood after, mentioned from noise background Useful voice signal is taken, the technology of noise jamming is inhibited, reduces.In short, it is extracted from noisy speech as pure as possible Raw tone.
In the related technology, traditional sound enhancement method mainly has spectrum-subtraction, Wiener filtering and based on least mean-square error Short-time spectra amplitude Enhancement Method.Although traditional sound enhancement method has, speed is fast, does not need large-scale training corpus etc. Advantage, but these methods depend greatly on the estimation of noise, and the applicable scene of these methods is few, fails to consider Characteristics of speech sounds inevitably introduces distortion, causes to damage to voice.
Summary of the invention
In consideration of it, to solve above-mentioned technical problem or partial technical problems, the embodiment of the invention provides a kind of increasings of voice Strong method, device, storage medium, electronic equipment.
In a first aspect, the embodiment of the invention provides a kind of Speech enhancement methods, which comprises
Voice capture device is called, the voice in current environment is acquired;
According to preset speech processing algorithm, the voice is handled, single-channel voice is obtained;
Punctuate cutting is carried out to the single-channel voice, obtains the voice segment data flow comprising preset kind sound;
The voice segment data flow is inputted in preset speech enhan-cement network model, is obtained and the voice segment number According to the corresponding enhancing voice of stream;
It is voice segments by the enhancing speech synthesis.
It is in a possible embodiment, described that the voice is handled according to preset speech processing algorithm, Obtain single-channel voice, comprising:
The voice is converted by A/D, is sampled according to preset sample rate, obtains single-channel voice.
In a possible embodiment, described that punctuate cutting is carried out to the single-channel voice, it obtains comprising default The voice segment data flow of type sound, comprising:
Punctuate cutting is carried out to the voice in the single-channel voice in preset threshold range;
For any one frame voice in the single-channel voice in preset threshold range, the nerve pre-established is utilized Network model is detected whether comprising preset kind sound;
If the frame voice includes preset kind sound, retain the frame voice;
All speech frames comprising preset kind sound are combined, the voice segment data comprising preset kind sound are obtained Stream.
In a possible embodiment, the method also includes:
If the frame voice does not include preset kind sound, the frame voice is filtered.
Second aspect, the embodiment of the present invention provide a kind of speech sound enhancement device, and described device includes:
Voice acquisition module acquires the voice in current environment for calling voice capture device;
Speech processing module, for handling the voice, obtaining single channel according to preset speech processing algorithm Voice;
Phonetic segmentation module is obtained for carrying out punctuate cutting to the single-channel voice comprising preset kind sound Voice segment data flow;
Speech enhan-cement module is obtained for inputting the voice segment data flow in preset speech enhan-cement network model To enhancing voice corresponding with the voice segment data flow;
Voice synthetic module, for being voice segments by the enhancing speech synthesis.
In a possible embodiment, the speech processing module is specifically used for:
The voice is converted by A/D, is sampled according to preset sample rate, obtains single-channel voice.
In a possible embodiment, the phonetic segmentation module is specifically used for:
Punctuate cutting is carried out to the voice in the single-channel voice in preset threshold range;
For any one frame voice in the single-channel voice in preset threshold range, the nerve pre-established is utilized Network model is detected whether comprising preset kind sound;
If the frame voice includes preset kind sound, retain the frame voice;
All speech frames comprising preset kind sound are combined, the voice segment data comprising preset kind sound are obtained Stream.
In a possible embodiment, described device further include:
Voice filtering module filters the frame voice if not including preset kind sound for the frame voice.
The third aspect, the embodiment of the present invention provide a kind of storage medium, and the storage medium is stored with one or more Program, one or more of programs can be executed by one or more processor, to realize sound enhancement method above-mentioned.
Fourth aspect, the embodiment of the present invention provide a kind of electronic equipment, comprising: processor and memory, the processor For executing the speech enhan-cement program stored in the memory, to realize sound enhancement method above-mentioned.
Technical solution provided in an embodiment of the present invention obtains single-channel voice by being handled voice, to single channel Voice carries out punctuate cutting and obtains the voice segment data flow comprising preset kind sound, voice segment data flow is inputted default Speech enhan-cement network model in, avoid the influence of noise, it is contemplated that characteristics of speech sounds is avoided introducing and is distorted, to avoid pair Voice causes to damage, and such available enhancing voice synthesizes the enhancing voice and obtains voice segments, answering for more scenes may be implemented With.
Detailed description of the invention
In order to illustrate more clearly of this specification embodiment or technical solution in the prior art, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only The some embodiments recorded in this specification embodiment for those of ordinary skill in the art can also be attached according to these Figure obtains other attached drawings.
Fig. 1 is the implementation process diagram of the sound enhancement method of the embodiment of the present invention;
Fig. 2 is the structural schematic diagram of the speech sound enhancement device of the embodiment of the present invention;
Fig. 3 is the structural schematic diagram of the electronic equipment of the embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
In order to facilitate understanding of embodiments of the present invention, it is further explained below in conjunction with attached drawing with specific embodiment Bright, embodiment does not constitute the restriction to the embodiment of the present invention.
As shown in Figure 1, being a kind of implementation process diagram of sound enhancement method provided in an embodiment of the present invention, this method It can specifically include following steps:
S101 calls voice capture device, acquires the voice in current environment.
In embodiments of the present invention, it for current environment, can be in the acoustic enviroment that far field, band are made an uproar, the present invention is real It applies example and this is not construed as limiting.
In current environment, voice capture device, such as microphone are called, voice is acquired, carries target in the voice and use Noise in original voice and current environment at family can be other in current environment and use for the noise in current environment The voice at family can be music in current environment, impact sound etc., relative to original voice of target user, all other sound Noise can be considered as, the embodiment of the present invention is not construed as limiting this.
S102 handles the voice, obtains single-channel voice according to preset speech processing algorithm.
For collected language in above-mentioned steps S101, is handled according to preset speech processing algorithm, obtain list Channel speech provides a kind of optional implementation here and is handled according to preset speech processing algorithm:
The voice is converted by A/D, is sampled according to preset sample rate, obtains single-channel voice.Wherein, right In A/D, the circuit for converting analog signals into digital signal, referred to as analog-digital converter are referred to.
For example, calling the language in microphone acquisition current environment, voice is converted by A/D, according to 16000 sample rates It is sampled, obtains the single-channel voice of 16000 sample rates.
S103 carries out punctuate cutting to the single-channel voice, obtains the voice segment data comprising preset kind sound Stream;
One neural network model of training in advance, the neural network model is for detecting whether every frame voice includes default class Type sound here presets at original voice that type sound refers to target user;
Punctuate cutting is carried out to the voice in the single-channel voice in preset threshold range, for the single channel language Any one frame voice in sound in preset threshold range is detected whether using the neural network model pre-established comprising default Type sound;
If the frame voice includes preset kind sound, retain the frame voice;If the frame voice does not include default class Type sound then filters the frame voice;It can so be crossed and be filtered out comprising target user's by the neural network model pre-established Other speech frames except original voice, can leave the speech frame comprising preset kind sound;
All speech frames comprising preset kind sound are combined, the voice segment data comprising preset kind sound are obtained Stream.
The voice segment data flow is inputted in preset speech enhan-cement network model, is obtained and the voice by S104 The corresponding enhancing voice of piecewise data flow;
Speech enhan-cement network model enhances voice segment data flow end to end, obtains enhancing voice, the voice Enhancing network model input is the voice segment data flow comprising preset kind sound, is exported to enhance voice.
In embodiments of the present invention, speech enhan-cement network model is the multiple dimensioned time domain speech based on air cock control convolutional network Enhance model, can specifically include coder module, enhances module, decoder module.
Coder module: noise waveform is encoded to intermediate features space.Wherein, input section is by one-dimensional convolutional neural networks Be converted to high dimensional feature representation.
Enhancing module: the high dimensional feature expression after coding is operated, it includes three operating process: Analysis On Multi-scale Features Extraction, convolution block and multi-scale feature fusion.
Multi resolution feature extraction: these features are extracted and merged using different size of gate convolution operation parallel.Tool For body, feature is extracted using one-dimensional gate convolution operation, in fact, the feature extraction of different scale is by different IPs size Gate convolutional network realize, later after the network output result splicing by these with different IPs, normalized by layer Method after feature normalization to exporting.
Convolution block: it is made of several convolution blocks.In each piece, using the full convolution net based on convolution network Network.It in each piece, repeats convolution algorithm R times, while constantly promoting the convolutional network coefficient of expansion, extend receptive field.Pass through expansion Open up receptive field, information when network can capture long.
Multi-scale feature fusion: the convolutional neural networks of different levels export different types of feature, as low layer texture is (shallow Layer) and semantic clues (deep layer).These features are different the contribution of final task.Specifically, in the embodiment of the present invention In not instead of not directly using the output of the last layer as final output, extract the output of each convolution block, and by they It is fused to the final output of model.The output characteristics of each convolution block indicates the details of different levels.It is established for each piece One connection.Transmit the information of different masses in the training process.This process is known as feature transmission.From other layers The benefit of information be unknown.Useful information is screened using door control mechanism, controls information flow.Specifically, step by step High-level characteristic is transferred to shallow-layer by ground.
Decoder module: the inverse process of coding module.Character representation is decoded as speech samples by it.Specifically, it uses One-dimensional transposition convolution realizes decoding process.
The enhancing speech synthesis is voice segments by S105.
Voice segment data flow is handled by above-mentioned speech enhan-cement network model, obtains enhancing voice, it will be described Enhancing speech synthesis is voice segments.
Sound enhancement method in the embodiment of the present invention constructs the more rulers efficiently based on air cock control convolutional network Spending time domain speech enhances model, captures voice signal using the multiple dimensioned time domain speech enhancing model based on air cock control convolutional network Timing information;Door control mechanism is integrated into the enhancing model of the multiple dimensioned time domain speech based on air cock control convolutional network, base is made It can learn the character representation of different levels in the multiple dimensioned time domain speech enhancing model of air cock control convolutional network;It is not selection The output of the last layer is exported as final output by the characteristic pattern of fusion different depth, in different depth Foundation between layers connection, in this way the information that deep layer is acquired can be for delivery to shallow-layer.Another door control mechanism is used for Screen useful information.
In order to verify the validity of the sound enhancement method in the embodiment of the present invention, one is constructed first based on air cock control The multiple dimensioned time domain speech of convolutional network enhances model, and the output of the last one convolution block selects convolution as final output Block number is 3, and the convolutional network coefficient of expansion is experimental configuration as 6.On this basis, multi-scale feature fusion and spy are gradually increased Sign transmission.
From experimental result as can be seen that the multiple dimensioned time domain speech enhancing model based on air cock control convolutional network can be effective Enhance voice, is transmitted by being stepped up Fusion Features and feature, further improve the performance of model.Be based on convolution Model compare, the final mask in the embodiment of the present invention (objective in short-term can in PESQ (evaluation of voice quality consciousness) and STOI Degree of understanding) on obtain respectively 0.12 and 0.01 performance boost.In addition, compared with noisy voice, in the embodiment of the present invention Performance of the model on PESQ and STOI 0.43 and 0.123 has been respectively increased.
Convolution block number is 4, and the experimental configuration that the convolutional network coefficient of expansion is 8 has best performance.With noisy language Sound is compared, and the best model in the embodiment of the present invention realizes 0.54 and 0.125 performance improvement respectively on PESQ and STOI. Enhancing model in the embodiment of the present invention can not only effectively enhance noise speech, and performance is better than other baseline systems. The performance of multiple dimensioned time domain speech enhancing model based on air cock control convolutional network is better than the system based on frequency domain and based on circulation The system of neural network.By extending receptive field, the multiple dimensioned time domain speech enhancing model based on air cock control convolutional network can be with Capture long-term dependence.Especially in terms of STOI, significant performance improvement is achieved.This shows through end-to-end training, Multiple dimensioned time domain speech enhancing model based on air cock control convolutional network can more accurately enhance and estimate voice.
By the above-mentioned description to technical solution provided in an embodiment of the present invention, single-pass is obtained by being handled voice Road voice carries out punctuate cutting to single-channel voice and obtains the voice segment data flow comprising preset kind sound, by voice point Segment data stream inputs in preset speech enhan-cement network model, avoids the influence of noise, it is contemplated that characteristics of speech sounds avoids introducing Distortion, to avoid causing to damage to voice, such available enhancing voice synthesizes the enhancing voice and obtains voice segments, can To realize the application of more scenes.
Relative to embodiment of the method, the embodiment of the invention also provides a kind of embodiments of speech sound enhancement device, such as Fig. 2 institute Show, the apparatus may include: voice acquisition module 210, speech processing module 220, phonetic segmentation module 230, speech enhan-cement mould Block 240, voice synthetic module 250.
Voice acquisition module 210 calls voice capture device for voice collecting, acquires the voice in current environment;
Speech processing module 220, for handling the voice, obtaining list according to preset speech processing algorithm Channel speech;
Phonetic segmentation module 230 is obtained for carrying out punctuate cutting to the single-channel voice comprising preset kind sound Voice segment data flow;
Speech enhan-cement module 240, for the voice segment data flow to be inputted in preset speech enhan-cement network model, Obtain enhancing voice corresponding with the voice segment data flow;
Voice synthetic module 250, for being voice segments by the enhancing speech synthesis.
A kind of specific embodiment provided according to the present invention, the speech processing module 220 are specifically used for: by institute's predicate Sound is converted by A/D, is sampled according to preset sample rate, obtains single-channel voice.
A kind of specific embodiment provided according to the present invention, the phonetic segmentation module 230 are specifically used for: to the list Voice in channel speech in preset threshold range carries out punctuate cutting;For in the single-channel voice in preset threshold model Interior any one frame voice is enclosed, is detected whether using the neural network model pre-established comprising preset kind sound;If should Frame voice includes preset kind sound, then retains the frame voice;All speech frames comprising preset kind sound are combined, are wrapped The voice segment data flow of the sound containing preset kind.
A kind of specific embodiment provided according to the present invention, described device further include:
Voice filtering module 260 filters the frame voice if not including preset kind sound for the frame voice.
Fig. 3 is the structural schematic diagram of the electronic equipment of one kind provided in an embodiment of the present invention, electronic equipment shown in Fig. 3 300 include: at least one processor 301, memory 302, at least one network interface 304 and other users interface 303.It is mobile Various components in terminal 300 are coupled by bus system 305.It is understood that bus system 305 is for realizing these groups Connection communication between part.Bus system 305 further includes power bus, control bus and state in addition to including data/address bus Signal bus.But for the sake of clear explanation, various buses are all designated as bus system 305 in Fig. 3.
Wherein, user interface 303 may include display, keyboard or pointing device (for example, mouse, trace ball (trackball), touch-sensitive plate or touch screen etc..
It is appreciated that the memory 302 in the embodiment of the present invention can be volatile memory or nonvolatile memory, It or may include both volatile and non-volatile memories.Wherein, nonvolatile memory can be read-only memory (Read- OnlyMemory, ROM), programmable read only memory (ProgrammableROM, PROM), Erasable Programmable Read Only Memory EPROM (ErasablePROM, EPROM), electrically erasable programmable read-only memory (ElectricallyEPROM, EEPROM) dodge It deposits.Volatile memory can be random access memory (RandomAccessMemory, RAM), and it is slow to be used as external high speed It deposits.By exemplary but be not restricted explanation, the RAM of many forms is available, such as static random access memory (StaticRAM, SRAM), dynamic random access memory (DynamicRAM, DRAM), Synchronous Dynamic Random Access Memory (SynchronousDRAM, SDRAM), double data speed synchronous dynamic RAM (DoubleDataRate SDRAM, DDRSDRAM), enhanced Synchronous Dynamic Random Access Memory (Enhanced SDRAM, ESDRAM), synchronized links Dynamic random access memory (SynchlinkDRAM, SLDRAM) and direct rambus random access memory (DirectRambusRAM, DRRAM).Memory 302 described herein is intended to include but is not limited to these to be suitble to any other The memory of type.
In some embodiments, memory 302 stores following element, and unit or data structure can be performed, or Their subset of person or their superset: operating system 3021 and application program 3022.
Wherein, operating system 3021 include various system programs, such as ccf layer, core library layer, driving layer etc., are used for Realize various basic businesses and the hardware based task of processing.Application program 3022 includes various application programs, such as media Player (MediaPlayer), browser (Browser) etc., for realizing various applied business.Realize embodiment of the present invention side The program of method may be embodied in application program 3022.
In embodiments of the present invention, by the program or instruction of calling memory 302 to store, specifically, can be application The program or instruction stored in program 3022, processor 301 are used to execute method and step provided by each method embodiment, such as Include:
Voice capture device is called, the voice in current environment is acquired;According to preset speech processing algorithm, to institute's predicate Sound is handled, and single-channel voice is obtained;Punctuate cutting is carried out to the single-channel voice, is obtained comprising preset kind sound Voice segment data flow;The voice segment data flow is inputted in preset speech enhan-cement network model, is obtained and institute's predicate The corresponding enhancing voice of sound piecewise data flow;It is voice segments by the enhancing speech synthesis.
The method that the embodiments of the present invention disclose can be applied in processor 301, or be realized by processor 301. Processor 301 may be a kind of IC chip, the processing capacity with signal.During realization, the above method it is each Step can be completed by the integrated logic circuit of the hardware in processor 301 or the instruction of software form.Above-mentioned processing Device 301 can be general processor, digital signal processor (DigitalSignalProcessor, DSP), specific integrated circuit (ApplicationSpecific IntegratedCircuit, ASIC), ready-made programmable gate array (FieldProgrammableGateArray, FPGA) either other programmable logic device, discrete gate or transistor logic Device, discrete hardware components.It may be implemented or execute disclosed each method, step and the logical box in the embodiment of the present invention Figure.General processor can be microprocessor or the processor is also possible to any conventional processor etc..In conjunction with the present invention The step of method disclosed in embodiment, can be embodied directly in hardware decoding processor and execute completion, or use decoding processor In hardware and software unit combination execute completion.Software unit can be located at random access memory, and flash memory, read-only memory can In the storage medium of this fields such as program read-only memory or electrically erasable programmable memory, register maturation.The storage Medium is located at memory 302, and processor 301 reads the information in memory 302, and the step of the above method is completed in conjunction with its hardware Suddenly.
It is understood that embodiments described herein can with hardware, software, firmware, middleware, microcode or its Combination is to realize.For hardware realization, processing unit be may be implemented in one or more specific integrated circuit (Application SpecificIntegratedCircuits, ASIC), digital signal processor (DigitalSignalProcessing, DSP), Digital signal processing appts (DSPDevice, DSPD), programmable logic device (ProgrammableLogicDevice, PLD), Field programmable gate array (Field-ProgrammableGateArray, FPGA), general processor, controller, microcontroller In device, microprocessor, other electronic units for executing herein described function or combinations thereof.
For software implementations, the techniques described herein can be realized by executing the unit of function described herein.Software generation Code is storable in memory and is executed by processor.Memory can in the processor or portion realizes outside the processor.
Electronic equipment provided in this embodiment can be electronic equipment as shown in Figure 3, and voice as shown in figure 1 can be performed and increase All steps of strong method, and then realize the technical effect of sound enhancement method shown in Fig. 1, Fig. 1 associated description is specifically please referred to, Succinctly to describe, therefore not to repeat here.
The embodiment of the invention also provides a kind of storage medium (computer readable storage mediums).Here storage medium is deposited Contain one or more program.Wherein, storage medium may include volatile memory, such as random access memory;It deposits Reservoir also may include nonvolatile memory, such as read-only memory, flash memory, hard disk or solid state hard disk;Memory It can also include the combination of the memory of mentioned kind.
It is above-mentioned in language to realize when one or more program can be executed by one or more processor in storage medium Sound enhances the sound enhancement method that equipment side executes.
The processor is following in speech enhancement apparatus to realize for executing the speech enhan-cement program stored in memory The step of sound enhancement method that side executes:
Voice capture device is called, the voice in current environment is acquired;According to preset speech processing algorithm, to institute's predicate Sound is handled, and single-channel voice is obtained;Punctuate cutting is carried out to the single-channel voice, is obtained comprising preset kind sound Voice segment data flow;The voice segment data flow is inputted in preset speech enhan-cement network model, is obtained and institute's predicate The corresponding enhancing voice of sound piecewise data flow;It is voice segments by the enhancing speech synthesis.
Professional should further appreciate that, described in conjunction with the examples disclosed in the embodiments of the present disclosure Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, hard in order to clearly demonstrate The interchangeability of part and software generally describes each exemplary composition and step according to function in the above description. These functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution. Professional technician can use different methods to achieve the described function each specific application, but this realization It should not be considered as beyond the scope of the present invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can be executed with hardware, processor The combination of software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field In any other form of storage medium well known to interior.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims (10)

1. a kind of sound enhancement method, which is characterized in that the described method includes:
Voice capture device is called, the voice in current environment is acquired;
According to preset speech processing algorithm, the voice is handled, single-channel voice is obtained;
Punctuate cutting is carried out to the single-channel voice, obtains the voice segment data flow comprising preset kind sound;
The voice segment data flow is inputted in preset speech enhan-cement network model, is obtained and the voice segment data flow Corresponding enhancing voice;
It is voice segments by the enhancing speech synthesis.
2. the method according to claim 1, wherein described according to preset speech processing algorithm, to institute's predicate Sound is handled, and single-channel voice is obtained, comprising:
The voice is converted by A/D, is sampled according to preset sample rate, obtains single-channel voice.
3. being obtained the method according to claim 1, wherein described carry out punctuate cutting to the single-channel voice To the voice segment data flow comprising preset kind sound, comprising:
Punctuate cutting is carried out to the voice in the single-channel voice in preset threshold range;
For any one frame voice in the single-channel voice in preset threshold range, the neural network pre-established is utilized Whether model inspection includes preset kind sound;
If the frame voice includes preset kind sound, retain the frame voice;
All speech frames comprising preset kind sound are combined, the voice segment data flow comprising preset kind sound is obtained.
4. according to the method described in claim 3, it is characterized in that, the method also includes:
If the frame voice does not include preset kind sound, the frame voice is filtered.
5. a kind of speech sound enhancement device, which is characterized in that described device includes:
Voice acquisition module acquires the voice in current environment for calling voice capture device;
Speech processing module, for handling the voice, obtaining single channel language according to preset speech processing algorithm Sound;
Phonetic segmentation module obtains the voice comprising preset kind sound for carrying out punctuate cutting to the single-channel voice Piecewise data flow;
Speech enhan-cement module, for the voice segment data flow to be inputted in preset speech enhan-cement network model, obtain with The corresponding enhancing voice of the voice segment data flow;
Voice synthetic module, for being voice segments by the enhancing speech synthesis.
6. device according to claim 5, which is characterized in that the speech processing module is specifically used for:
The voice is converted by A/D, is sampled according to preset sample rate, obtains single-channel voice.
7. device according to claim 5, which is characterized in that the phonetic segmentation module is specifically used for:
Punctuate cutting is carried out to the voice in the single-channel voice in preset threshold range;
For any one frame voice in the single-channel voice in preset threshold range, the neural network pre-established is utilized Whether model inspection includes preset kind sound;
If the frame voice includes preset kind sound, retain the frame voice;
All speech frames comprising preset kind sound are combined, the voice segment data flow comprising preset kind sound is obtained.
8. device according to claim 7, which is characterized in that described device further include:
Voice filtering module filters the frame voice if not including preset kind sound for the frame voice.
9. a kind of electronic equipment characterized by comprising processor and memory, the processor is for executing the storage The speech enhan-cement program stored in device, to realize sound enhancement method according to any one of claims 1 to 4.
10. a kind of storage medium, which is characterized in that the storage medium is stored with one or more program, it is one or The multiple programs of person can be executed by one or more processor, to realize speech enhan-cement according to any one of claims 1 to 4 Method.
CN201910663257.8A 2019-07-22 2019-07-22 Voice enhancement method and device, storage medium and electronic equipment Active CN110534123B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910663257.8A CN110534123B (en) 2019-07-22 2019-07-22 Voice enhancement method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910663257.8A CN110534123B (en) 2019-07-22 2019-07-22 Voice enhancement method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110534123A true CN110534123A (en) 2019-12-03
CN110534123B CN110534123B (en) 2022-04-01

Family

ID=68660741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910663257.8A Active CN110534123B (en) 2019-07-22 2019-07-22 Voice enhancement method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110534123B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312224A (en) * 2020-02-20 2020-06-19 北京声智科技有限公司 Training method and device of voice segmentation model and electronic equipment
CN112309411A (en) * 2020-11-24 2021-02-02 深圳信息职业技术学院 Phase-sensitive gated multi-scale void convolutional network speech enhancement method and system
CN112509593A (en) * 2020-11-17 2021-03-16 北京清微智能科技有限公司 Voice enhancement network model, single-channel voice enhancement method and system
CN113571074A (en) * 2021-08-09 2021-10-29 四川启睿克科技有限公司 Voice enhancement method and device based on multi-band structure time domain audio separation network
CN113870887A (en) * 2021-09-26 2021-12-31 平安科技(深圳)有限公司 Single-channel speech enhancement method and device, computer equipment and storage medium

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7392312B1 (en) * 1998-09-11 2008-06-24 Lv Partners, L.P. Method for utilizing visual cue in conjunction with web access
CN102124518A (en) * 2008-08-05 2011-07-13 弗朗霍夫应用科学研究促进协会 Apparatus and method for processing an audio signal for speech enhancement using a feature extraction
CN103794221A (en) * 2012-10-26 2014-05-14 索尼公司 Signal processing device and method, and program
US20160111107A1 (en) * 2014-10-21 2016-04-21 Mitsubishi Electric Research Laboratories, Inc. Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System
CN106157953A (en) * 2015-04-16 2016-11-23 科大讯飞股份有限公司 continuous speech recognition method and system
CN106898350A (en) * 2017-01-16 2017-06-27 华南理工大学 A kind of interaction of intelligent industrial robot voice and control method based on deep learning
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN108172238A (en) * 2018-01-06 2018-06-15 广州音书科技有限公司 A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system
CN108564963A (en) * 2018-04-23 2018-09-21 百度在线网络技术(北京)有限公司 Method and apparatus for enhancing voice
CN108877823A (en) * 2018-07-27 2018-11-23 三星电子(中国)研发中心 Sound enhancement method and device
US20190043516A1 (en) * 2018-06-22 2019-02-07 Intel Corporation Neural network for speech denoising trained with deep feature losses
CN109326299A (en) * 2018-11-14 2019-02-12 平安科技(深圳)有限公司 Sound enhancement method, device and storage medium based on full convolutional neural networks
US20190066713A1 (en) * 2016-06-14 2019-02-28 The Trustees Of Columbia University In The City Of New York Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
CN109410974A (en) * 2018-10-23 2019-03-01 百度在线网络技术(北京)有限公司 Sound enhancement method, device, equipment and storage medium
CN109841226A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
CN110010144A (en) * 2019-04-24 2019-07-12 厦门亿联网络技术股份有限公司 Voice signals enhancement method and device
CN110503940A (en) * 2019-07-12 2019-11-26 中国科学院自动化研究所 Sound enhancement method, device, storage medium, electronic equipment

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7392312B1 (en) * 1998-09-11 2008-06-24 Lv Partners, L.P. Method for utilizing visual cue in conjunction with web access
CN102124518A (en) * 2008-08-05 2011-07-13 弗朗霍夫应用科学研究促进协会 Apparatus and method for processing an audio signal for speech enhancement using a feature extraction
CN103794221A (en) * 2012-10-26 2014-05-14 索尼公司 Signal processing device and method, and program
US20160111107A1 (en) * 2014-10-21 2016-04-21 Mitsubishi Electric Research Laboratories, Inc. Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System
CN106157953A (en) * 2015-04-16 2016-11-23 科大讯飞股份有限公司 continuous speech recognition method and system
US20190066713A1 (en) * 2016-06-14 2019-02-28 The Trustees Of Columbia University In The City Of New York Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments
CN106898350A (en) * 2017-01-16 2017-06-27 华南理工大学 A kind of interaction of intelligent industrial robot voice and control method based on deep learning
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN108172238A (en) * 2018-01-06 2018-06-15 广州音书科技有限公司 A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system
CN108564963A (en) * 2018-04-23 2018-09-21 百度在线网络技术(北京)有限公司 Method and apparatus for enhancing voice
US20190043516A1 (en) * 2018-06-22 2019-02-07 Intel Corporation Neural network for speech denoising trained with deep feature losses
CN108877823A (en) * 2018-07-27 2018-11-23 三星电子(中国)研发中心 Sound enhancement method and device
CN109841226A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
CN109410974A (en) * 2018-10-23 2019-03-01 百度在线网络技术(北京)有限公司 Sound enhancement method, device, equipment and storage medium
CN109326299A (en) * 2018-11-14 2019-02-12 平安科技(深圳)有限公司 Sound enhancement method, device and storage medium based on full convolutional neural networks
CN110010144A (en) * 2019-04-24 2019-07-12 厦门亿联网络技术股份有限公司 Voice signals enhancement method and device
CN110503940A (en) * 2019-07-12 2019-11-26 中国科学院自动化研究所 Sound enhancement method, device, storage medium, electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JEN-TZUNG CHIEN ET AL: "Convolutional Neural Turing Machine for Speech Separation", 《ISCSLP 2018》 *
时文华等: "利用深度全卷积编解码网络的单通道语音增强", 《信号处理》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312224A (en) * 2020-02-20 2020-06-19 北京声智科技有限公司 Training method and device of voice segmentation model and electronic equipment
CN111312224B (en) * 2020-02-20 2023-04-21 北京声智科技有限公司 Training method and device of voice segmentation model and electronic equipment
CN112509593A (en) * 2020-11-17 2021-03-16 北京清微智能科技有限公司 Voice enhancement network model, single-channel voice enhancement method and system
CN112509593B (en) * 2020-11-17 2024-03-08 北京清微智能科技有限公司 Speech enhancement network model, single-channel speech enhancement method and system
CN112309411A (en) * 2020-11-24 2021-02-02 深圳信息职业技术学院 Phase-sensitive gated multi-scale void convolutional network speech enhancement method and system
CN112309411B (en) * 2020-11-24 2024-06-11 深圳信息职业技术学院 Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system
CN113571074A (en) * 2021-08-09 2021-10-29 四川启睿克科技有限公司 Voice enhancement method and device based on multi-band structure time domain audio separation network
CN113571074B (en) * 2021-08-09 2023-07-25 四川启睿克科技有限公司 Voice enhancement method and device based on multi-band structure time domain audio frequency separation network
CN113870887A (en) * 2021-09-26 2021-12-31 平安科技(深圳)有限公司 Single-channel speech enhancement method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN110534123B (en) 2022-04-01

Similar Documents

Publication Publication Date Title
CN110534123A (en) Sound enhancement method, device, storage medium, electronic equipment
Macartney et al. Improved speech enhancement with the wave-u-net
Hsieh et al. Improving perceptual quality by phone-fortified perceptual loss using wasserstein distance for speech enhancement
Tan et al. Real-time speech enhancement using an efficient convolutional recurrent network for dual-microphone mobile phones in close-talk scenarios
Al-Ali et al. Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions
Mowlaee et al. Harmonic phase estimation in single-channel speech enhancement using phase decomposition and SNR information
CN110503940A (en) Sound enhancement method, device, storage medium, electronic equipment
Valentini-Botinhao et al. Speech enhancement of noisy and reverberant speech for text-to-speech
CN113823308B (en) Method for denoising voice by using single voice sample with noise
Siedenburg et al. Persistent time-frequency shrinkage for audio denoising
Su et al. Perceptually-motivated environment-specific speech enhancement
Wang et al. Joint noise and mask aware training for DNN-based speech enhancement with sub-band features
KR102198598B1 (en) Method for generating synthesized speech signal, neural vocoder, and training method thereof
US20240013775A1 (en) Patched multi-condition training for robust speech recognition
CN117542373A (en) Non-air conduction voice recovery system and method
CN116741144B (en) Voice tone conversion method and system
Saeki et al. SelfRemaster: Self-supervised speech restoration with analysis-by-synthesis approach using channel modeling
CN112233693B (en) Sound quality evaluation method, device and equipment
Zheng et al. Bandwidth extension WaveNet for bone-conducted speech enhancement
Nikitaras et al. Fine-grained noise control for multispeaker speech synthesis
Barinov et al. Channel compensation for forensic speaker identification using inverse processing
Schmidt et al. Deep neural network based guided speech bandwidth extension
US20240079022A1 (en) General speech enhancement method and apparatus using multi-source auxiliary information
Shahhoud et al. PESQ enhancement for decoded speech audio signals using complex convolutional recurrent neural network
CN114678036B (en) Speech enhancement method, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant