CN110534123A - Sound enhancement method, device, storage medium, electronic equipment - Google Patents
Sound enhancement method, device, storage medium, electronic equipment Download PDFInfo
- Publication number
- CN110534123A CN110534123A CN201910663257.8A CN201910663257A CN110534123A CN 110534123 A CN110534123 A CN 110534123A CN 201910663257 A CN201910663257 A CN 201910663257A CN 110534123 A CN110534123 A CN 110534123A
- Authority
- CN
- China
- Prior art keywords
- voice
- preset
- sound
- speech
- channel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 230000002708 enhancing effect Effects 0.000 claims abstract description 35
- 238000012545 processing Methods 0.000 claims abstract description 26
- 239000004568 cement Substances 0.000 claims abstract description 22
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 10
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 10
- 230000015654 memory Effects 0.000 claims description 35
- 230000011218 segmentation Effects 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000007689 inspection Methods 0.000 claims 2
- 230000008569 process Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 230000004927 fusion Effects 0.000 description 5
- 238000003062 neural network model Methods 0.000 description 5
- 230000001360 synchronised effect Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000001965 increasing effect Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 210000005036 nerve Anatomy 0.000 description 2
- KLDZYURQCUYZBL-UHFFFAOYSA-N 2-[3-[(2-hydroxyphenyl)methylideneamino]propyliminomethyl]phenol Chemical compound OC1=CC=CC=C1C=NCCCN=CC1=CC=CC=C1O KLDZYURQCUYZBL-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 201000001098 delayed sleep phase syndrome Diseases 0.000 description 1
- 208000033921 delayed sleep phase type circadian rhythm sleep disease Diseases 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000035800 maturation Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The present embodiments relate to a kind of sound enhancement method, device, storage medium, electronic equipments, which comprises calls voice capture device, acquires the voice in current environment;According to preset speech processing algorithm, the voice is handled, single-channel voice is obtained;Punctuate cutting is carried out to the single-channel voice, obtains the voice segment data flow comprising preset kind sound;The voice segment data flow is inputted in preset speech enhan-cement network model, enhancing voice corresponding with the voice segment data flow is obtained;It is voice segments by the enhancing speech synthesis.Thus, it is possible to realize the application of more scenes, the influence of noise is avoided, it is contemplated that characteristics of speech sounds avoids introducing distortion, to avoid causing to damage to voice.
Description
Technical field
The present embodiments relate to computerized information technology for automatically treating field more particularly to a kind of sound enhancement method,
Device, storage medium, electronic equipment.
Background technique
Voice, i.e. the substance shell of language, are the external forms of language, are the most directly symbols of the thinking activities of recorder
Number system is that user carries out that information exchange is most natural, one of most effective means.User is while obtaining voice signal, no
The evitable interference that will receive ambient noise, RMR room reverb and other users has seriously affected voice quality, and then has influenced
The performance of speech recognition, speech enhan-cement comes into being since then.Speech enhan-cement is to inhibit interference, prompt as preposition processing mode
A kind of effective way of far field phonetic recognization rate.
Speech enhan-cement, refer to when voice signal by various noise jammings, even flood after, mentioned from noise background
Useful voice signal is taken, the technology of noise jamming is inhibited, reduces.In short, it is extracted from noisy speech as pure as possible
Raw tone.
In the related technology, traditional sound enhancement method mainly has spectrum-subtraction, Wiener filtering and based on least mean-square error
Short-time spectra amplitude Enhancement Method.Although traditional sound enhancement method has, speed is fast, does not need large-scale training corpus etc.
Advantage, but these methods depend greatly on the estimation of noise, and the applicable scene of these methods is few, fails to consider
Characteristics of speech sounds inevitably introduces distortion, causes to damage to voice.
Summary of the invention
In consideration of it, to solve above-mentioned technical problem or partial technical problems, the embodiment of the invention provides a kind of increasings of voice
Strong method, device, storage medium, electronic equipment.
In a first aspect, the embodiment of the invention provides a kind of Speech enhancement methods, which comprises
Voice capture device is called, the voice in current environment is acquired;
According to preset speech processing algorithm, the voice is handled, single-channel voice is obtained;
Punctuate cutting is carried out to the single-channel voice, obtains the voice segment data flow comprising preset kind sound;
The voice segment data flow is inputted in preset speech enhan-cement network model, is obtained and the voice segment number
According to the corresponding enhancing voice of stream;
It is voice segments by the enhancing speech synthesis.
It is in a possible embodiment, described that the voice is handled according to preset speech processing algorithm,
Obtain single-channel voice, comprising:
The voice is converted by A/D, is sampled according to preset sample rate, obtains single-channel voice.
In a possible embodiment, described that punctuate cutting is carried out to the single-channel voice, it obtains comprising default
The voice segment data flow of type sound, comprising:
Punctuate cutting is carried out to the voice in the single-channel voice in preset threshold range;
For any one frame voice in the single-channel voice in preset threshold range, the nerve pre-established is utilized
Network model is detected whether comprising preset kind sound;
If the frame voice includes preset kind sound, retain the frame voice;
All speech frames comprising preset kind sound are combined, the voice segment data comprising preset kind sound are obtained
Stream.
In a possible embodiment, the method also includes:
If the frame voice does not include preset kind sound, the frame voice is filtered.
Second aspect, the embodiment of the present invention provide a kind of speech sound enhancement device, and described device includes:
Voice acquisition module acquires the voice in current environment for calling voice capture device;
Speech processing module, for handling the voice, obtaining single channel according to preset speech processing algorithm
Voice;
Phonetic segmentation module is obtained for carrying out punctuate cutting to the single-channel voice comprising preset kind sound
Voice segment data flow;
Speech enhan-cement module is obtained for inputting the voice segment data flow in preset speech enhan-cement network model
To enhancing voice corresponding with the voice segment data flow;
Voice synthetic module, for being voice segments by the enhancing speech synthesis.
In a possible embodiment, the speech processing module is specifically used for:
The voice is converted by A/D, is sampled according to preset sample rate, obtains single-channel voice.
In a possible embodiment, the phonetic segmentation module is specifically used for:
Punctuate cutting is carried out to the voice in the single-channel voice in preset threshold range;
For any one frame voice in the single-channel voice in preset threshold range, the nerve pre-established is utilized
Network model is detected whether comprising preset kind sound;
If the frame voice includes preset kind sound, retain the frame voice;
All speech frames comprising preset kind sound are combined, the voice segment data comprising preset kind sound are obtained
Stream.
In a possible embodiment, described device further include:
Voice filtering module filters the frame voice if not including preset kind sound for the frame voice.
The third aspect, the embodiment of the present invention provide a kind of storage medium, and the storage medium is stored with one or more
Program, one or more of programs can be executed by one or more processor, to realize sound enhancement method above-mentioned.
Fourth aspect, the embodiment of the present invention provide a kind of electronic equipment, comprising: processor and memory, the processor
For executing the speech enhan-cement program stored in the memory, to realize sound enhancement method above-mentioned.
Technical solution provided in an embodiment of the present invention obtains single-channel voice by being handled voice, to single channel
Voice carries out punctuate cutting and obtains the voice segment data flow comprising preset kind sound, voice segment data flow is inputted default
Speech enhan-cement network model in, avoid the influence of noise, it is contemplated that characteristics of speech sounds is avoided introducing and is distorted, to avoid pair
Voice causes to damage, and such available enhancing voice synthesizes the enhancing voice and obtains voice segments, answering for more scenes may be implemented
With.
Detailed description of the invention
In order to illustrate more clearly of this specification embodiment or technical solution in the prior art, below will to embodiment or
Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only
The some embodiments recorded in this specification embodiment for those of ordinary skill in the art can also be attached according to these
Figure obtains other attached drawings.
Fig. 1 is the implementation process diagram of the sound enhancement method of the embodiment of the present invention;
Fig. 2 is the structural schematic diagram of the speech sound enhancement device of the embodiment of the present invention;
Fig. 3 is the structural schematic diagram of the electronic equipment of the embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
In order to facilitate understanding of embodiments of the present invention, it is further explained below in conjunction with attached drawing with specific embodiment
Bright, embodiment does not constitute the restriction to the embodiment of the present invention.
As shown in Figure 1, being a kind of implementation process diagram of sound enhancement method provided in an embodiment of the present invention, this method
It can specifically include following steps:
S101 calls voice capture device, acquires the voice in current environment.
In embodiments of the present invention, it for current environment, can be in the acoustic enviroment that far field, band are made an uproar, the present invention is real
It applies example and this is not construed as limiting.
In current environment, voice capture device, such as microphone are called, voice is acquired, carries target in the voice and use
Noise in original voice and current environment at family can be other in current environment and use for the noise in current environment
The voice at family can be music in current environment, impact sound etc., relative to original voice of target user, all other sound
Noise can be considered as, the embodiment of the present invention is not construed as limiting this.
S102 handles the voice, obtains single-channel voice according to preset speech processing algorithm.
For collected language in above-mentioned steps S101, is handled according to preset speech processing algorithm, obtain list
Channel speech provides a kind of optional implementation here and is handled according to preset speech processing algorithm:
The voice is converted by A/D, is sampled according to preset sample rate, obtains single-channel voice.Wherein, right
In A/D, the circuit for converting analog signals into digital signal, referred to as analog-digital converter are referred to.
For example, calling the language in microphone acquisition current environment, voice is converted by A/D, according to 16000 sample rates
It is sampled, obtains the single-channel voice of 16000 sample rates.
S103 carries out punctuate cutting to the single-channel voice, obtains the voice segment data comprising preset kind sound
Stream;
One neural network model of training in advance, the neural network model is for detecting whether every frame voice includes default class
Type sound here presets at original voice that type sound refers to target user;
Punctuate cutting is carried out to the voice in the single-channel voice in preset threshold range, for the single channel language
Any one frame voice in sound in preset threshold range is detected whether using the neural network model pre-established comprising default
Type sound;
If the frame voice includes preset kind sound, retain the frame voice;If the frame voice does not include default class
Type sound then filters the frame voice;It can so be crossed and be filtered out comprising target user's by the neural network model pre-established
Other speech frames except original voice, can leave the speech frame comprising preset kind sound;
All speech frames comprising preset kind sound are combined, the voice segment data comprising preset kind sound are obtained
Stream.
The voice segment data flow is inputted in preset speech enhan-cement network model, is obtained and the voice by S104
The corresponding enhancing voice of piecewise data flow;
Speech enhan-cement network model enhances voice segment data flow end to end, obtains enhancing voice, the voice
Enhancing network model input is the voice segment data flow comprising preset kind sound, is exported to enhance voice.
In embodiments of the present invention, speech enhan-cement network model is the multiple dimensioned time domain speech based on air cock control convolutional network
Enhance model, can specifically include coder module, enhances module, decoder module.
Coder module: noise waveform is encoded to intermediate features space.Wherein, input section is by one-dimensional convolutional neural networks
Be converted to high dimensional feature representation.
Enhancing module: the high dimensional feature expression after coding is operated, it includes three operating process: Analysis On Multi-scale Features
Extraction, convolution block and multi-scale feature fusion.
Multi resolution feature extraction: these features are extracted and merged using different size of gate convolution operation parallel.Tool
For body, feature is extracted using one-dimensional gate convolution operation, in fact, the feature extraction of different scale is by different IPs size
Gate convolutional network realize, later after the network output result splicing by these with different IPs, normalized by layer
Method after feature normalization to exporting.
Convolution block: it is made of several convolution blocks.In each piece, using the full convolution net based on convolution network
Network.It in each piece, repeats convolution algorithm R times, while constantly promoting the convolutional network coefficient of expansion, extend receptive field.Pass through expansion
Open up receptive field, information when network can capture long.
Multi-scale feature fusion: the convolutional neural networks of different levels export different types of feature, as low layer texture is (shallow
Layer) and semantic clues (deep layer).These features are different the contribution of final task.Specifically, in the embodiment of the present invention
In not instead of not directly using the output of the last layer as final output, extract the output of each convolution block, and by they
It is fused to the final output of model.The output characteristics of each convolution block indicates the details of different levels.It is established for each piece
One connection.Transmit the information of different masses in the training process.This process is known as feature transmission.From other layers
The benefit of information be unknown.Useful information is screened using door control mechanism, controls information flow.Specifically, step by step
High-level characteristic is transferred to shallow-layer by ground.
Decoder module: the inverse process of coding module.Character representation is decoded as speech samples by it.Specifically, it uses
One-dimensional transposition convolution realizes decoding process.
The enhancing speech synthesis is voice segments by S105.
Voice segment data flow is handled by above-mentioned speech enhan-cement network model, obtains enhancing voice, it will be described
Enhancing speech synthesis is voice segments.
Sound enhancement method in the embodiment of the present invention constructs the more rulers efficiently based on air cock control convolutional network
Spending time domain speech enhances model, captures voice signal using the multiple dimensioned time domain speech enhancing model based on air cock control convolutional network
Timing information;Door control mechanism is integrated into the enhancing model of the multiple dimensioned time domain speech based on air cock control convolutional network, base is made
It can learn the character representation of different levels in the multiple dimensioned time domain speech enhancing model of air cock control convolutional network;It is not selection
The output of the last layer is exported as final output by the characteristic pattern of fusion different depth, in different depth
Foundation between layers connection, in this way the information that deep layer is acquired can be for delivery to shallow-layer.Another door control mechanism is used for
Screen useful information.
In order to verify the validity of the sound enhancement method in the embodiment of the present invention, one is constructed first based on air cock control
The multiple dimensioned time domain speech of convolutional network enhances model, and the output of the last one convolution block selects convolution as final output
Block number is 3, and the convolutional network coefficient of expansion is experimental configuration as 6.On this basis, multi-scale feature fusion and spy are gradually increased
Sign transmission.
From experimental result as can be seen that the multiple dimensioned time domain speech enhancing model based on air cock control convolutional network can be effective
Enhance voice, is transmitted by being stepped up Fusion Features and feature, further improve the performance of model.Be based on convolution
Model compare, the final mask in the embodiment of the present invention (objective in short-term can in PESQ (evaluation of voice quality consciousness) and STOI
Degree of understanding) on obtain respectively 0.12 and 0.01 performance boost.In addition, compared with noisy voice, in the embodiment of the present invention
Performance of the model on PESQ and STOI 0.43 and 0.123 has been respectively increased.
Convolution block number is 4, and the experimental configuration that the convolutional network coefficient of expansion is 8 has best performance.With noisy language
Sound is compared, and the best model in the embodiment of the present invention realizes 0.54 and 0.125 performance improvement respectively on PESQ and STOI.
Enhancing model in the embodiment of the present invention can not only effectively enhance noise speech, and performance is better than other baseline systems.
The performance of multiple dimensioned time domain speech enhancing model based on air cock control convolutional network is better than the system based on frequency domain and based on circulation
The system of neural network.By extending receptive field, the multiple dimensioned time domain speech enhancing model based on air cock control convolutional network can be with
Capture long-term dependence.Especially in terms of STOI, significant performance improvement is achieved.This shows through end-to-end training,
Multiple dimensioned time domain speech enhancing model based on air cock control convolutional network can more accurately enhance and estimate voice.
By the above-mentioned description to technical solution provided in an embodiment of the present invention, single-pass is obtained by being handled voice
Road voice carries out punctuate cutting to single-channel voice and obtains the voice segment data flow comprising preset kind sound, by voice point
Segment data stream inputs in preset speech enhan-cement network model, avoids the influence of noise, it is contemplated that characteristics of speech sounds avoids introducing
Distortion, to avoid causing to damage to voice, such available enhancing voice synthesizes the enhancing voice and obtains voice segments, can
To realize the application of more scenes.
Relative to embodiment of the method, the embodiment of the invention also provides a kind of embodiments of speech sound enhancement device, such as Fig. 2 institute
Show, the apparatus may include: voice acquisition module 210, speech processing module 220, phonetic segmentation module 230, speech enhan-cement mould
Block 240, voice synthetic module 250.
Voice acquisition module 210 calls voice capture device for voice collecting, acquires the voice in current environment;
Speech processing module 220, for handling the voice, obtaining list according to preset speech processing algorithm
Channel speech;
Phonetic segmentation module 230 is obtained for carrying out punctuate cutting to the single-channel voice comprising preset kind sound
Voice segment data flow;
Speech enhan-cement module 240, for the voice segment data flow to be inputted in preset speech enhan-cement network model,
Obtain enhancing voice corresponding with the voice segment data flow;
Voice synthetic module 250, for being voice segments by the enhancing speech synthesis.
A kind of specific embodiment provided according to the present invention, the speech processing module 220 are specifically used for: by institute's predicate
Sound is converted by A/D, is sampled according to preset sample rate, obtains single-channel voice.
A kind of specific embodiment provided according to the present invention, the phonetic segmentation module 230 are specifically used for: to the list
Voice in channel speech in preset threshold range carries out punctuate cutting;For in the single-channel voice in preset threshold model
Interior any one frame voice is enclosed, is detected whether using the neural network model pre-established comprising preset kind sound;If should
Frame voice includes preset kind sound, then retains the frame voice;All speech frames comprising preset kind sound are combined, are wrapped
The voice segment data flow of the sound containing preset kind.
A kind of specific embodiment provided according to the present invention, described device further include:
Voice filtering module 260 filters the frame voice if not including preset kind sound for the frame voice.
Fig. 3 is the structural schematic diagram of the electronic equipment of one kind provided in an embodiment of the present invention, electronic equipment shown in Fig. 3
300 include: at least one processor 301, memory 302, at least one network interface 304 and other users interface 303.It is mobile
Various components in terminal 300 are coupled by bus system 305.It is understood that bus system 305 is for realizing these groups
Connection communication between part.Bus system 305 further includes power bus, control bus and state in addition to including data/address bus
Signal bus.But for the sake of clear explanation, various buses are all designated as bus system 305 in Fig. 3.
Wherein, user interface 303 may include display, keyboard or pointing device (for example, mouse, trace ball
(trackball), touch-sensitive plate or touch screen etc..
It is appreciated that the memory 302 in the embodiment of the present invention can be volatile memory or nonvolatile memory,
It or may include both volatile and non-volatile memories.Wherein, nonvolatile memory can be read-only memory (Read-
OnlyMemory, ROM), programmable read only memory (ProgrammableROM, PROM), Erasable Programmable Read Only Memory EPROM
(ErasablePROM, EPROM), electrically erasable programmable read-only memory (ElectricallyEPROM, EEPROM) dodge
It deposits.Volatile memory can be random access memory (RandomAccessMemory, RAM), and it is slow to be used as external high speed
It deposits.By exemplary but be not restricted explanation, the RAM of many forms is available, such as static random access memory
(StaticRAM, SRAM), dynamic random access memory (DynamicRAM, DRAM), Synchronous Dynamic Random Access Memory
(SynchronousDRAM, SDRAM), double data speed synchronous dynamic RAM (DoubleDataRate
SDRAM, DDRSDRAM), enhanced Synchronous Dynamic Random Access Memory (Enhanced SDRAM, ESDRAM), synchronized links
Dynamic random access memory (SynchlinkDRAM, SLDRAM) and direct rambus random access memory
(DirectRambusRAM, DRRAM).Memory 302 described herein is intended to include but is not limited to these to be suitble to any other
The memory of type.
In some embodiments, memory 302 stores following element, and unit or data structure can be performed, or
Their subset of person or their superset: operating system 3021 and application program 3022.
Wherein, operating system 3021 include various system programs, such as ccf layer, core library layer, driving layer etc., are used for
Realize various basic businesses and the hardware based task of processing.Application program 3022 includes various application programs, such as media
Player (MediaPlayer), browser (Browser) etc., for realizing various applied business.Realize embodiment of the present invention side
The program of method may be embodied in application program 3022.
In embodiments of the present invention, by the program or instruction of calling memory 302 to store, specifically, can be application
The program or instruction stored in program 3022, processor 301 are used to execute method and step provided by each method embodiment, such as
Include:
Voice capture device is called, the voice in current environment is acquired;According to preset speech processing algorithm, to institute's predicate
Sound is handled, and single-channel voice is obtained;Punctuate cutting is carried out to the single-channel voice, is obtained comprising preset kind sound
Voice segment data flow;The voice segment data flow is inputted in preset speech enhan-cement network model, is obtained and institute's predicate
The corresponding enhancing voice of sound piecewise data flow;It is voice segments by the enhancing speech synthesis.
The method that the embodiments of the present invention disclose can be applied in processor 301, or be realized by processor 301.
Processor 301 may be a kind of IC chip, the processing capacity with signal.During realization, the above method it is each
Step can be completed by the integrated logic circuit of the hardware in processor 301 or the instruction of software form.Above-mentioned processing
Device 301 can be general processor, digital signal processor (DigitalSignalProcessor, DSP), specific integrated circuit
(ApplicationSpecific IntegratedCircuit, ASIC), ready-made programmable gate array
(FieldProgrammableGateArray, FPGA) either other programmable logic device, discrete gate or transistor logic
Device, discrete hardware components.It may be implemented or execute disclosed each method, step and the logical box in the embodiment of the present invention
Figure.General processor can be microprocessor or the processor is also possible to any conventional processor etc..In conjunction with the present invention
The step of method disclosed in embodiment, can be embodied directly in hardware decoding processor and execute completion, or use decoding processor
In hardware and software unit combination execute completion.Software unit can be located at random access memory, and flash memory, read-only memory can
In the storage medium of this fields such as program read-only memory or electrically erasable programmable memory, register maturation.The storage
Medium is located at memory 302, and processor 301 reads the information in memory 302, and the step of the above method is completed in conjunction with its hardware
Suddenly.
It is understood that embodiments described herein can with hardware, software, firmware, middleware, microcode or its
Combination is to realize.For hardware realization, processing unit be may be implemented in one or more specific integrated circuit (Application
SpecificIntegratedCircuits, ASIC), digital signal processor (DigitalSignalProcessing, DSP),
Digital signal processing appts (DSPDevice, DSPD), programmable logic device (ProgrammableLogicDevice, PLD),
Field programmable gate array (Field-ProgrammableGateArray, FPGA), general processor, controller, microcontroller
In device, microprocessor, other electronic units for executing herein described function or combinations thereof.
For software implementations, the techniques described herein can be realized by executing the unit of function described herein.Software generation
Code is storable in memory and is executed by processor.Memory can in the processor or portion realizes outside the processor.
Electronic equipment provided in this embodiment can be electronic equipment as shown in Figure 3, and voice as shown in figure 1 can be performed and increase
All steps of strong method, and then realize the technical effect of sound enhancement method shown in Fig. 1, Fig. 1 associated description is specifically please referred to,
Succinctly to describe, therefore not to repeat here.
The embodiment of the invention also provides a kind of storage medium (computer readable storage mediums).Here storage medium is deposited
Contain one or more program.Wherein, storage medium may include volatile memory, such as random access memory;It deposits
Reservoir also may include nonvolatile memory, such as read-only memory, flash memory, hard disk or solid state hard disk;Memory
It can also include the combination of the memory of mentioned kind.
It is above-mentioned in language to realize when one or more program can be executed by one or more processor in storage medium
Sound enhances the sound enhancement method that equipment side executes.
The processor is following in speech enhancement apparatus to realize for executing the speech enhan-cement program stored in memory
The step of sound enhancement method that side executes:
Voice capture device is called, the voice in current environment is acquired;According to preset speech processing algorithm, to institute's predicate
Sound is handled, and single-channel voice is obtained;Punctuate cutting is carried out to the single-channel voice, is obtained comprising preset kind sound
Voice segment data flow;The voice segment data flow is inputted in preset speech enhan-cement network model, is obtained and institute's predicate
The corresponding enhancing voice of sound piecewise data flow;It is voice segments by the enhancing speech synthesis.
Professional should further appreciate that, described in conjunction with the examples disclosed in the embodiments of the present disclosure
Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, hard in order to clearly demonstrate
The interchangeability of part and software generally describes each exemplary composition and step according to function in the above description.
These functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.
Professional technician can use different methods to achieve the described function each specific application, but this realization
It should not be considered as beyond the scope of the present invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can be executed with hardware, processor
The combination of software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only memory
(ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field
In any other form of storage medium well known to interior.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects
It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention
Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include
Within protection scope of the present invention.
Claims (10)
1. a kind of sound enhancement method, which is characterized in that the described method includes:
Voice capture device is called, the voice in current environment is acquired;
According to preset speech processing algorithm, the voice is handled, single-channel voice is obtained;
Punctuate cutting is carried out to the single-channel voice, obtains the voice segment data flow comprising preset kind sound;
The voice segment data flow is inputted in preset speech enhan-cement network model, is obtained and the voice segment data flow
Corresponding enhancing voice;
It is voice segments by the enhancing speech synthesis.
2. the method according to claim 1, wherein described according to preset speech processing algorithm, to institute's predicate
Sound is handled, and single-channel voice is obtained, comprising:
The voice is converted by A/D, is sampled according to preset sample rate, obtains single-channel voice.
3. being obtained the method according to claim 1, wherein described carry out punctuate cutting to the single-channel voice
To the voice segment data flow comprising preset kind sound, comprising:
Punctuate cutting is carried out to the voice in the single-channel voice in preset threshold range;
For any one frame voice in the single-channel voice in preset threshold range, the neural network pre-established is utilized
Whether model inspection includes preset kind sound;
If the frame voice includes preset kind sound, retain the frame voice;
All speech frames comprising preset kind sound are combined, the voice segment data flow comprising preset kind sound is obtained.
4. according to the method described in claim 3, it is characterized in that, the method also includes:
If the frame voice does not include preset kind sound, the frame voice is filtered.
5. a kind of speech sound enhancement device, which is characterized in that described device includes:
Voice acquisition module acquires the voice in current environment for calling voice capture device;
Speech processing module, for handling the voice, obtaining single channel language according to preset speech processing algorithm
Sound;
Phonetic segmentation module obtains the voice comprising preset kind sound for carrying out punctuate cutting to the single-channel voice
Piecewise data flow;
Speech enhan-cement module, for the voice segment data flow to be inputted in preset speech enhan-cement network model, obtain with
The corresponding enhancing voice of the voice segment data flow;
Voice synthetic module, for being voice segments by the enhancing speech synthesis.
6. device according to claim 5, which is characterized in that the speech processing module is specifically used for:
The voice is converted by A/D, is sampled according to preset sample rate, obtains single-channel voice.
7. device according to claim 5, which is characterized in that the phonetic segmentation module is specifically used for:
Punctuate cutting is carried out to the voice in the single-channel voice in preset threshold range;
For any one frame voice in the single-channel voice in preset threshold range, the neural network pre-established is utilized
Whether model inspection includes preset kind sound;
If the frame voice includes preset kind sound, retain the frame voice;
All speech frames comprising preset kind sound are combined, the voice segment data flow comprising preset kind sound is obtained.
8. device according to claim 7, which is characterized in that described device further include:
Voice filtering module filters the frame voice if not including preset kind sound for the frame voice.
9. a kind of electronic equipment characterized by comprising processor and memory, the processor is for executing the storage
The speech enhan-cement program stored in device, to realize sound enhancement method according to any one of claims 1 to 4.
10. a kind of storage medium, which is characterized in that the storage medium is stored with one or more program, it is one or
The multiple programs of person can be executed by one or more processor, to realize speech enhan-cement according to any one of claims 1 to 4
Method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910663257.8A CN110534123B (en) | 2019-07-22 | 2019-07-22 | Voice enhancement method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910663257.8A CN110534123B (en) | 2019-07-22 | 2019-07-22 | Voice enhancement method and device, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110534123A true CN110534123A (en) | 2019-12-03 |
CN110534123B CN110534123B (en) | 2022-04-01 |
Family
ID=68660741
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910663257.8A Active CN110534123B (en) | 2019-07-22 | 2019-07-22 | Voice enhancement method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110534123B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111312224A (en) * | 2020-02-20 | 2020-06-19 | 北京声智科技有限公司 | Training method and device of voice segmentation model and electronic equipment |
CN112309411A (en) * | 2020-11-24 | 2021-02-02 | 深圳信息职业技术学院 | Phase-sensitive gated multi-scale void convolutional network speech enhancement method and system |
CN112509593A (en) * | 2020-11-17 | 2021-03-16 | 北京清微智能科技有限公司 | Voice enhancement network model, single-channel voice enhancement method and system |
CN113571074A (en) * | 2021-08-09 | 2021-10-29 | 四川启睿克科技有限公司 | Voice enhancement method and device based on multi-band structure time domain audio separation network |
CN113870887A (en) * | 2021-09-26 | 2021-12-31 | 平安科技(深圳)有限公司 | Single-channel speech enhancement method and device, computer equipment and storage medium |
Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7392312B1 (en) * | 1998-09-11 | 2008-06-24 | Lv Partners, L.P. | Method for utilizing visual cue in conjunction with web access |
CN102124518A (en) * | 2008-08-05 | 2011-07-13 | 弗朗霍夫应用科学研究促进协会 | Apparatus and method for processing an audio signal for speech enhancement using a feature extraction |
CN103794221A (en) * | 2012-10-26 | 2014-05-14 | 索尼公司 | Signal processing device and method, and program |
US20160111107A1 (en) * | 2014-10-21 | 2016-04-21 | Mitsubishi Electric Research Laboratories, Inc. | Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System |
CN106157953A (en) * | 2015-04-16 | 2016-11-23 | 科大讯飞股份有限公司 | continuous speech recognition method and system |
CN106898350A (en) * | 2017-01-16 | 2017-06-27 | 华南理工大学 | A kind of interaction of intelligent industrial robot voice and control method based on deep learning |
CN107845389A (en) * | 2017-12-21 | 2018-03-27 | 北京工业大学 | A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks |
CN108172238A (en) * | 2018-01-06 | 2018-06-15 | 广州音书科技有限公司 | A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system |
CN108564963A (en) * | 2018-04-23 | 2018-09-21 | 百度在线网络技术(北京)有限公司 | Method and apparatus for enhancing voice |
CN108877823A (en) * | 2018-07-27 | 2018-11-23 | 三星电子(中国)研发中心 | Sound enhancement method and device |
US20190043516A1 (en) * | 2018-06-22 | 2019-02-07 | Intel Corporation | Neural network for speech denoising trained with deep feature losses |
CN109326299A (en) * | 2018-11-14 | 2019-02-12 | 平安科技(深圳)有限公司 | Sound enhancement method, device and storage medium based on full convolutional neural networks |
US20190066713A1 (en) * | 2016-06-14 | 2019-02-28 | The Trustees Of Columbia University In The City Of New York | Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments |
CN109410974A (en) * | 2018-10-23 | 2019-03-01 | 百度在线网络技术(北京)有限公司 | Sound enhancement method, device, equipment and storage medium |
CN109841226A (en) * | 2018-08-31 | 2019-06-04 | 大象声科(深圳)科技有限公司 | A kind of single channel real-time noise-reducing method based on convolution recurrent neural network |
CN110010144A (en) * | 2019-04-24 | 2019-07-12 | 厦门亿联网络技术股份有限公司 | Voice signals enhancement method and device |
CN110503940A (en) * | 2019-07-12 | 2019-11-26 | 中国科学院自动化研究所 | Sound enhancement method, device, storage medium, electronic equipment |
-
2019
- 2019-07-22 CN CN201910663257.8A patent/CN110534123B/en active Active
Patent Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7392312B1 (en) * | 1998-09-11 | 2008-06-24 | Lv Partners, L.P. | Method for utilizing visual cue in conjunction with web access |
CN102124518A (en) * | 2008-08-05 | 2011-07-13 | 弗朗霍夫应用科学研究促进协会 | Apparatus and method for processing an audio signal for speech enhancement using a feature extraction |
CN103794221A (en) * | 2012-10-26 | 2014-05-14 | 索尼公司 | Signal processing device and method, and program |
US20160111107A1 (en) * | 2014-10-21 | 2016-04-21 | Mitsubishi Electric Research Laboratories, Inc. | Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System |
CN106157953A (en) * | 2015-04-16 | 2016-11-23 | 科大讯飞股份有限公司 | continuous speech recognition method and system |
US20190066713A1 (en) * | 2016-06-14 | 2019-02-28 | The Trustees Of Columbia University In The City Of New York | Systems and methods for speech separation and neural decoding of attentional selection in multi-speaker environments |
CN106898350A (en) * | 2017-01-16 | 2017-06-27 | 华南理工大学 | A kind of interaction of intelligent industrial robot voice and control method based on deep learning |
CN107845389A (en) * | 2017-12-21 | 2018-03-27 | 北京工业大学 | A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks |
CN108172238A (en) * | 2018-01-06 | 2018-06-15 | 广州音书科技有限公司 | A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system |
CN108564963A (en) * | 2018-04-23 | 2018-09-21 | 百度在线网络技术(北京)有限公司 | Method and apparatus for enhancing voice |
US20190043516A1 (en) * | 2018-06-22 | 2019-02-07 | Intel Corporation | Neural network for speech denoising trained with deep feature losses |
CN108877823A (en) * | 2018-07-27 | 2018-11-23 | 三星电子(中国)研发中心 | Sound enhancement method and device |
CN109841226A (en) * | 2018-08-31 | 2019-06-04 | 大象声科(深圳)科技有限公司 | A kind of single channel real-time noise-reducing method based on convolution recurrent neural network |
CN109410974A (en) * | 2018-10-23 | 2019-03-01 | 百度在线网络技术(北京)有限公司 | Sound enhancement method, device, equipment and storage medium |
CN109326299A (en) * | 2018-11-14 | 2019-02-12 | 平安科技(深圳)有限公司 | Sound enhancement method, device and storage medium based on full convolutional neural networks |
CN110010144A (en) * | 2019-04-24 | 2019-07-12 | 厦门亿联网络技术股份有限公司 | Voice signals enhancement method and device |
CN110503940A (en) * | 2019-07-12 | 2019-11-26 | 中国科学院自动化研究所 | Sound enhancement method, device, storage medium, electronic equipment |
Non-Patent Citations (2)
Title |
---|
JEN-TZUNG CHIEN ET AL: "Convolutional Neural Turing Machine for Speech Separation", 《ISCSLP 2018》 * |
时文华等: "利用深度全卷积编解码网络的单通道语音增强", 《信号处理》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111312224A (en) * | 2020-02-20 | 2020-06-19 | 北京声智科技有限公司 | Training method and device of voice segmentation model and electronic equipment |
CN111312224B (en) * | 2020-02-20 | 2023-04-21 | 北京声智科技有限公司 | Training method and device of voice segmentation model and electronic equipment |
CN112509593A (en) * | 2020-11-17 | 2021-03-16 | 北京清微智能科技有限公司 | Voice enhancement network model, single-channel voice enhancement method and system |
CN112509593B (en) * | 2020-11-17 | 2024-03-08 | 北京清微智能科技有限公司 | Speech enhancement network model, single-channel speech enhancement method and system |
CN112309411A (en) * | 2020-11-24 | 2021-02-02 | 深圳信息职业技术学院 | Phase-sensitive gated multi-scale void convolutional network speech enhancement method and system |
CN112309411B (en) * | 2020-11-24 | 2024-06-11 | 深圳信息职业技术学院 | Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system |
CN113571074A (en) * | 2021-08-09 | 2021-10-29 | 四川启睿克科技有限公司 | Voice enhancement method and device based on multi-band structure time domain audio separation network |
CN113571074B (en) * | 2021-08-09 | 2023-07-25 | 四川启睿克科技有限公司 | Voice enhancement method and device based on multi-band structure time domain audio frequency separation network |
CN113870887A (en) * | 2021-09-26 | 2021-12-31 | 平安科技(深圳)有限公司 | Single-channel speech enhancement method and device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110534123B (en) | 2022-04-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110534123A (en) | Sound enhancement method, device, storage medium, electronic equipment | |
Macartney et al. | Improved speech enhancement with the wave-u-net | |
Hsieh et al. | Improving perceptual quality by phone-fortified perceptual loss using wasserstein distance for speech enhancement | |
Tan et al. | Real-time speech enhancement using an efficient convolutional recurrent network for dual-microphone mobile phones in close-talk scenarios | |
Al-Ali et al. | Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions | |
Mowlaee et al. | Harmonic phase estimation in single-channel speech enhancement using phase decomposition and SNR information | |
CN110503940A (en) | Sound enhancement method, device, storage medium, electronic equipment | |
Valentini-Botinhao et al. | Speech enhancement of noisy and reverberant speech for text-to-speech | |
CN113823308B (en) | Method for denoising voice by using single voice sample with noise | |
Siedenburg et al. | Persistent time-frequency shrinkage for audio denoising | |
Su et al. | Perceptually-motivated environment-specific speech enhancement | |
Wang et al. | Joint noise and mask aware training for DNN-based speech enhancement with sub-band features | |
KR102198598B1 (en) | Method for generating synthesized speech signal, neural vocoder, and training method thereof | |
US20240013775A1 (en) | Patched multi-condition training for robust speech recognition | |
CN117542373A (en) | Non-air conduction voice recovery system and method | |
CN116741144B (en) | Voice tone conversion method and system | |
Saeki et al. | SelfRemaster: Self-supervised speech restoration with analysis-by-synthesis approach using channel modeling | |
CN112233693B (en) | Sound quality evaluation method, device and equipment | |
Zheng et al. | Bandwidth extension WaveNet for bone-conducted speech enhancement | |
Nikitaras et al. | Fine-grained noise control for multispeaker speech synthesis | |
Barinov et al. | Channel compensation for forensic speaker identification using inverse processing | |
Schmidt et al. | Deep neural network based guided speech bandwidth extension | |
US20240079022A1 (en) | General speech enhancement method and apparatus using multi-source auxiliary information | |
Shahhoud et al. | PESQ enhancement for decoded speech audio signals using complex convolutional recurrent neural network | |
CN114678036B (en) | Speech enhancement method, electronic device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |