CN110503968A - A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing - Google Patents
A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing Download PDFInfo
- Publication number
- CN110503968A CN110503968A CN201810481272.6A CN201810481272A CN110503968A CN 110503968 A CN110503968 A CN 110503968A CN 201810481272 A CN201810481272 A CN 201810481272A CN 110503968 A CN110503968 A CN 110503968A
- Authority
- CN
- China
- Prior art keywords
- voice signal
- signal
- speech
- residual error
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003860 storage Methods 0.000 title claims abstract description 24
- 238000003672 processing method Methods 0.000 title claims abstract description 19
- 239000004568 cement Substances 0.000 claims abstract description 66
- 238000000034 method Methods 0.000 claims abstract description 39
- 238000013507 mapping Methods 0.000 claims description 61
- 238000000605 extraction Methods 0.000 claims description 51
- 238000012545 processing Methods 0.000 claims description 47
- 238000012549 training Methods 0.000 claims description 38
- 238000011946 reduction process Methods 0.000 claims description 32
- 230000015654 memory Effects 0.000 claims description 20
- 230000005236 sound signal Effects 0.000 claims description 12
- 230000002123 temporal effect Effects 0.000 claims description 8
- 230000001965 increasing effect Effects 0.000 claims description 5
- 230000000694 effects Effects 0.000 abstract description 14
- 238000013528 artificial neural network Methods 0.000 abstract description 12
- 230000008034 disappearance Effects 0.000 abstract description 9
- 238000010586 diagram Methods 0.000 description 14
- 238000004891 communication Methods 0.000 description 13
- 238000005516 engineering process Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 7
- 230000015572 biosynthetic process Effects 0.000 description 6
- 238000004590 computer program Methods 0.000 description 6
- 230000009467 reduction Effects 0.000 description 6
- 238000003786 synthesis reaction Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 5
- 230000009471 action Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 230000002708 enhancing effect Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 230000000712 assembly Effects 0.000 description 2
- 238000000429 assembly Methods 0.000 description 2
- 238000013506 data mapping Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- KLDZYURQCUYZBL-UHFFFAOYSA-N 2-[3-[(2-hydroxyphenyl)methylideneamino]propyliminomethyl]phenol Chemical compound OC1=CC=CC=C1C=NCCCN=CC1=CC=CC=C1O KLDZYURQCUYZBL-UHFFFAOYSA-N 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 201000001098 delayed sleep phase syndrome Diseases 0.000 description 1
- 208000033921 delayed sleep phase type circadian rhythm sleep disease Diseases 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 239000012092 media component Substances 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
Abstract
The embodiment of the invention provides a kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing, this method comprises: obtaining the mixing voice signal of input;Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains targeted voice signal;It is exported according to the targeted voice signal.The embodiment of the present invention is able to solve gradient disappearance problem existing for the existing sound enhancement method based on traditional neural network, improves speech enhan-cement effect.
Description
Technical field
The present invention relates to fields of communication technology, more particularly to a kind of audio-frequency processing method, a kind of apparatus for processing audio, one
Kind equipment and a kind of readable storage medium storing program for executing.
Background technique
With the fast development of the communication technology, the terminals such as mobile phone, tablet computer are more more and more universal, to the life of people
Living, study, work bring great convenience.
These terminals can collect voice signal by microphone, using speech enhancement technique to the voice signal being collected into
It is handled, to reduce the influence of noise jamming.Wherein, speech enhan-cement refers to when voice signal is done by various noises
After disturbing, even flooding, useful voice signal is extracted from noise background, inhibits, reduces the technology of noise jamming.
Currently, terminal usually uses such as deep neural network (Deep Neural Network, DNN), convolutional Neural
Network (Convolutional Neural Network, CNN), shot and long term remember artificial neural network (Long Short-Term
Memory, LSTM) etc. traditional neural networks sound enhancement method carry out speech enhan-cement.But use traditional neural network
There is gradient disappearance in sound enhancement method.For example, for the DNN connected entirely, as the network depth of DNN is continuous
Deepen, i.e., as the network number of plies of DNN is continuously increased, the gradient disappearance problem of network is increasingly severe, such as increases in the network number of plies
When being added to 5 layers, the gradient disappearance problem of network is than more serious.If continuing growing the network number of plies of DNN, lead to
Cross the DNN and carry out speech enhan-cement, the speech enhan-cement performance for finally obtaining sound result not will increase not only, instead it is possible that
Decline, affects speech enhan-cement effect.
Summary of the invention
The embodiment of the present invention is the technical problem to be solved is that a kind of audio-frequency processing method is provided, to promote speech enhan-cement effect
Fruit.
Correspondingly, the embodiment of the invention also provides a kind of apparatus for processing audio, a kind of equipment and a kind of readable storages
Medium, to guarantee the implementation and application of the above method.
To solve the above-mentioned problems, the embodiment of the invention discloses a kind of audio-frequency processing methods, comprising: obtains the mixed of input
Close voice signal;Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains target
Voice signal;It is exported according to the targeted voice signal.
Optionally, the residual error network model that the foundation is trained in advance carries out speech enhan-cement to the mixing voice signal,
Obtain targeted voice signal, comprising: to the mixing voice signal carry out feature extraction, obtain target user phonetic feature and
Voice data, the mixing voice signal include the voice signal of noise signal and the target user;It is special according to the voice
Sign carries out noise reduction process to the voice data by residual error network model trained in advance, it is corresponding to obtain the target user
Targeted voice signal.
Optionally, further includes: the corresponding residual error network model of training phonetic feature in advance.Wherein, described according to institute's predicate
Sound feature carries out noise reduction process to the voice data by residual error network model trained in advance, obtains the target user
Corresponding targeted voice signal, comprising: according to the phonetic feature of the target user, determine the corresponding residual error of the target user
Network model;By the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, obtains institute
State targeted voice signal.
Optionally, by the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data,
Obtain the targeted voice signal, comprising: determine the corresponding network weight information of each network layer in the residual error network model;According to
According to the corresponding network weight information of each network layer, mapping processing is carried out to the voice data, obtains mapping voice data;
Based on the mapping voice data and the voice data, targeted voice signal is generated.
Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user
Sound data, comprising: frequency domain character extraction is carried out to the mixing voice signal, obtains the frequency domain speech feature and frequency of target user
Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: to institute
It states mapping voice data and the frequency domain speech data is decoded, obtain decoded voice data;It is special according to the frequency domain speech
Sign carries out Waveform Reconstructing to the decoded voice data, obtains targeted voice signal.
Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user
Sound data, comprising: to the mixing voice signal carry out temporal signatures extraction, obtain target user time domain speech feature and when
Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: use
The mapping voice data and the time domain speech data generate targeted voice signal.
Optionally, the corresponding residual error network model of the trained phonetic feature, comprising: make an uproar for the voice signal addition of input
Acoustical signal generates Noisy Speech Signal;Feature extraction is carried out to the Noisy Speech Signal, obtains the Noisy Speech Signal pair
The phonetic feature answered;According to preset residual error network structure, mould is carried out according to the Noisy Speech Signal and the voice signal
Type training generates the corresponding residual error network model of the phonetic feature.
Optionally, it is exported according to the targeted voice signal, comprising: carry out voice according to the targeted voice signal
Output;And/or speech recognition is carried out to the targeted voice signal, generate recognition result;Export the recognition result.
The embodiment of the invention also discloses a kind of apparatus for processing audio, comprising:
Voice signal obtains module, for obtaining the mixing voice signal of input;
Speech enhan-cement module, for carrying out voice to the mixing voice signal according to residual error network model trained in advance
Enhancing, obtains targeted voice signal;
Voice signal output module, for being exported according to the targeted voice signal.
Optionally, the speech enhan-cement module includes:
Feature extraction submodule obtains the voice of target user for carrying out feature extraction to the mixing voice signal
Feature and voice data, the mixing voice signal include the voice signal of noise signal and the target user;
Noise reduction process submodule is used for according to the phonetic feature, by residual error network model trained in advance to described
Voice data carries out noise reduction process, obtains the corresponding targeted voice signal of the target user.
It optionally, can also include: residual error network model training module, for the corresponding residual error of training phonetic feature in advance
Network model;
Wherein, the noise reduction process submodule includes: residual error network model determination unit, for according to the target user
Phonetic feature, determine the corresponding residual error network model of the target user;Noise reduction processing unit, for being used by the target
The corresponding residual error network model in family carries out noise reduction process to the voice data, obtains the targeted voice signal.
Optionally, the noise reduction processing unit includes:
Network weight information determines subelement, for determining the corresponding network weight of each network layer in the residual error network model
Weight information;
Mapping processing subelement, for according to the corresponding network weight information of each network layer, to the voice data
Mapping processing is carried out, mapping voice data is obtained;
Targeted voice signal generates subelement, for being based on the mapping voice data and the voice data, generates mesh
Poster sound signal.
Optionally, the feature extraction submodule is specifically used for carrying out frequency domain character extraction to the mixing voice signal,
Obtain the frequency domain speech feature and frequency domain speech data of target user;
The targeted voice signal generates subelement, is specifically used for the mapping voice data and the frequency domain speech number
According to being decoded, decoded voice data is obtained;And wave is carried out to the decoded voice data according to the frequency domain speech feature
Shape reconstruct, obtains targeted voice signal.
Optionally, the feature extraction submodule is specifically used for carrying out temporal signatures extraction to the mixing voice signal,
Obtain the time domain speech feature and time domain speech data of target user;
The targeted voice signal generates subelement, is specifically used for using the mapping voice data and the time domain speech
Data generate targeted voice signal.
Optionally, the residual error network model training module includes:
Noise adds submodule, for adding noise signal for the voice signal of input, generates Noisy Speech Signal;
Feature extraction submodule obtains the noisy speech letter for carrying out feature extraction to the Noisy Speech Signal
Number corresponding phonetic feature;
Model training submodule, for according to preset residual error network structure, according to the Noisy Speech Signal and described
Voice signal carries out model training, generates the corresponding residual error network model of the phonetic feature.
Optionally, the voice signal output module includes:
Voice output submodule, for carrying out voice output according to the targeted voice signal;And/or
Speech recognition submodule generates recognition result for carrying out speech recognition to the targeted voice signal;Output institute
State recognition result.
It include memory and one or more than one program the embodiment of the invention also discloses a kind of equipment,
Perhaps more than one program is stored in memory and is configured to be executed by one or more than one processor for one of them
The one or more programs include the instruction for performing the following operation: obtaining the mixing voice signal of input;According to
Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains targeted voice signal;Foundation
The targeted voice signal is exported.
Optionally, the residual error network model that the foundation is trained in advance carries out speech enhan-cement to the mixing voice signal,
Obtain targeted voice signal, comprising: to the mixing voice signal carry out feature extraction, obtain target user phonetic feature and
Voice data, the mixing voice signal include the voice signal of noise signal and the target user;It is special according to the voice
Sign carries out noise reduction process to the voice data by residual error network model trained in advance, it is corresponding to obtain the target user
Targeted voice signal.
Optionally, described that the one or more programs are executed comprising also by one or more than one processor
Instruction for performing the following operation: the corresponding residual error network model of training phonetic feature in advance.Wherein, described according to institute's predicate
Sound feature carries out noise reduction process to the voice data by residual error network model trained in advance, obtains the target user
Corresponding targeted voice signal, comprising: according to the phonetic feature of the target user, determine the corresponding residual error of the target user
Network model;By the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, obtains institute
State targeted voice signal.
Optionally, by the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data,
Obtain the targeted voice signal, comprising: determine the corresponding network weight information of each network layer in the residual error network model;According to
According to the corresponding network weight information of each network layer, mapping processing is carried out to the voice data, obtains mapping voice data;
Based on the mapping voice data and the voice data, targeted voice signal is generated.
Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user
Sound data, comprising: frequency domain character extraction is carried out to the mixing voice signal, obtains the frequency domain speech feature and frequency of target user
Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: to institute
It states mapping voice data and the frequency domain speech data is decoded, obtain decoded voice data;It is special according to the frequency domain speech
Sign carries out Waveform Reconstructing to the decoded voice data, obtains targeted voice signal.
Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user
Sound data, comprising: to the mixing voice signal carry out temporal signatures extraction, obtain target user time domain speech feature and when
Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: use
The mapping voice data and the time domain speech data generate targeted voice signal.
Optionally, the corresponding residual error network model of the trained phonetic feature, comprising: make an uproar for the voice signal addition of input
Acoustical signal generates Noisy Speech Signal;Feature extraction is carried out to the Noisy Speech Signal, obtains the Noisy Speech Signal pair
The phonetic feature answered;According to preset residual error network structure, mould is carried out according to the Noisy Speech Signal and the voice signal
Type training generates the corresponding residual error network model of the phonetic feature.
Optionally, it is exported according to the targeted voice signal, comprising: carry out voice according to the targeted voice signal
Output;And/or speech recognition is carried out to the targeted voice signal, generate recognition result;Export the recognition result.
The embodiment of the invention also discloses a kind of readable storage medium storing program for executing, when the instruction in the storage medium is by equipment
When managing device execution, enable a device to execute audio-frequency processing method described in one or more of embodiment of the present invention.
The embodiment of the present invention includes following advantages:
The embodiment of the present invention can carry out the mixing voice signal got by residual error network model trained in advance
Speech enhan-cement leads to the problem of speech enhan-cement effect difference so as to avoid increasing network depth, that is, solves existing based on biography
Gradient disappearance problem existing for the sound enhancement method for neural network of uniting improves speech enhan-cement effect.
Detailed description of the invention
Fig. 1 is a kind of step flow chart of audio-frequency processing method embodiment of the invention;
Fig. 2 is the signal for carrying out speech enhan-cement in an example of the present invention using the residual error network model that training obtains in advance
Figure;
Fig. 3 is a kind of step flow chart of audio-frequency processing method alternative embodiment of the invention;
Fig. 4 is collected a kind of schematic diagram of mixing voice in an example of the present invention;
Fig. 5 is a kind of structural block diagram of apparatus for processing audio embodiment of the invention;
Fig. 6 is a kind of structural block diagram of equipment for audio processing shown according to an exemplary embodiment;
Fig. 7 is a kind of structural schematic diagram of equipment in the embodiment of the present invention.
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real
Applying mode, the present invention is described in further detail.
Currently, existing sound enhancement method, which usually uses traditional neural network, carries out model training, based on training
It obtains neural network model and carries out speech enhan-cement.Wherein, network depth is to influence a big factor of traditional neural network performance.When
When network depth is constantly deepened, traditional neural network will appear gradient disappearance problem.With the intensification of network depth, traditional neural
The gradient disappearance problem of network can be increasingly severe, and neural network model is caused to train poor targeted voice signal, shadow
Ring speech enhan-cement effect.
One of the core concepts in the embodiments of the present invention is to provide a kind of new audio-frequency processing methods, to use residual error
Network model carries out speech enhan-cement to the mixing voice signal of input, solves the existing speech enhan-cement based on traditional neural network
Gradient disappearance problem existing for method improves speech enhan-cement effect.
Referring to Fig.1, a kind of step flow chart of audio-frequency processing method embodiment of the invention is shown, can specifically include
Following steps:
Step 102, the mixing voice signal of input is obtained.
The embodiment of the present invention in voice input process, can obtain the mixing voice signal of input.The creolized language message
It number may include the voice signal for needing to carry out speech enhan-cement, can specifically include the voice signal and noise signal of target user
Deng.Wherein, the voice signal of target user can be the clean speech signal for referring to that target user speaks, such as target speaker's voice
Corresponding time-domain signal;Noise signal can refer to signal corresponding to interference noise, as may include described in other speakers
The corresponding time-domain signal of interference voice etc., the embodiment of the present invention to this with no restriction.
Step 104, speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtained
Targeted voice signal.
In embodiments of the present invention, the mixing voice signal that can be will acquire is as residual error network model trained in advance
Input, the mixing voice signal that can then get is input in advance trained residual error network model, to lead to residual error network
Model carries out speech enhan-cement to the mixing voice signal got, removes the interference noise in the mixing voice signal, obtains language
The enhanced targeted voice signal of sound.The targeted voice signal can only include the signal of the clean speech of target user, can
With the corresponding signal of clean speech for characterizing target user, the corresponding clean speech of feeling the pulse with the finger-tip mark speaker's voice such as can be
Signal etc..
In a kind of optional embodiment, after getting mixing voice signal, which can be carried out
Feature extraction obtains phonetic feature and voice data.Wherein, voice data can be the noisy speech after referring to speech feature extraction
Data can specifically include: noise data and the target speech data etc. for needing to retain.Then, can lead to according to phonetic feature
Noise reduction process is carried out to voice data after residual error network model trained in advance, the targeted voice signal after obtaining speech enhan-cement.
It should be noted that phonetic feature may include: time domain speech feature and/or frequency domain speech feature, the embodiment of the present invention is to this
With no restriction.Time domain speech feature can be used for characterizing the phonetic feature in time domain, and frequency domain speech feature can be used for characterizing frequency
Phonetic feature on domain.
Step 106, it is exported according to the targeted voice signal.
The embodiment of the present invention after obtaining the targeted voice signal after speech enhan-cement, can according to the targeted voice signal into
Row output.For example, voice output can be carried out according to the targeted voice signal, to export clean speech described in user;For another example,
Speech recognition can be carried out according to targeted voice signal, to identify clean speech described in user, can also will recognized
Clean speech is converted to text information, is then exported according to text information, such as shows text on the screen of the device, shows
Corresponding search result of text etc..
To sum up, the embodiment of the present invention can be by residual error network model trained in advance, to the creolized language message got
Number speech enhan-cement is carried out, leads to the problem of speech enhan-cement effect difference so as to avoid increasing network depth, that is, solve existing
Gradient disappearance problem existing for sound enhancement method based on traditional neural network improves speech enhan-cement effect.
In the concrete realization, the language feature of speech signal can be in advance based on, according to residual error network network structure into
Row model training, to train the corresponding residual error network model of various phonetic features, so as to it is subsequent can be according to phonetic feature
Speech enhan-cement is carried out using trained residual error network model in advance, guarantees speech enhan-cement effect.Optionally, the embodiment of the present invention
Audio-frequency processing method can also include: the corresponding residual error network model of preparatory training phonetic feature.
Specifically, can add noise signal in model training stage to the speech signal of input, generate noisy speech
Signal obtains corresponding phonetic feature to carry out feature extraction according to the Noisy Speech Signal;It then, can be for obtaining
Phonetic feature carries out model training using the Noisy Speech Signal of generation, the raw voice is special according to preset residual error network structure
Levy corresponding residual error network model.Wherein, the voice signal of input can refer to clean voice signal, can specifically include:
The clean speech signal being collected into, and/or, pre-synthesis clean speech signal such as can be real in voice input process
When the clean speech signal currently entered that gets, be also possible to the time-domain signal of prerecord one section of clean speech, again
Such as can be the time-domain signal of one section of pre-synthesis clean speech, the embodiment of the present invention to this with no restriction.
In an alternate embodiment of the present invention where, the corresponding residual error network model of training phonetic feature, specifically can wrap
It includes: adding noise signal for the voice signal of input, generate Noisy Speech Signal;Feature is carried out to the Noisy Speech Signal to mention
It takes, obtains the corresponding phonetic feature of Noisy Speech Signal;According to preset residual error network structure, according to the Noisy Speech Signal
Model training is carried out with the voice signal, generates the corresponding residual error network model of the phonetic feature.Wherein, preset residual error
Network structure can be what value was configured according to the network structure of residual error network in advance, and the embodiment of the present invention does not limit this
System.
Specifically, can add to the clean voice signal progress noise of input and make an uproar, it can be defeated in the training stage
The voice signal addition noise signal entered, generates Noisy Speech Signal.Wherein, noise signal may include simulator and noise signal and
The noise signal etc. collected in advance.The simulator and noise signal can be used for characterizing it is pre- first pass through speech synthesis technique synthesis make an uproar
Sound;The noise signal collected in advance can be used for characterizing the real noise being collected into advance, such as can be the noise prerecorded
Signal etc..
It can be used pre-synthesis as an example of the invention in the case where not being collected into real noise
Simulator and noise signal is carried out plus is made an uproar processing to the voice signal of input, with according to plus make an uproar handle after the Noisy Speech Signal that generates into
Row model training, thus avoid collecting a large amount of real noises and the problem that causes model training at high cost, reduce trained cost.
Certainly, in the case where being collected into real noise, the corresponding noise signal of the real noise being collected into also can be used to input
Voice signal carry out plus make an uproar processing, the noise signal being collected into such as can be used, the voice signal of input is carried out to add the place that makes an uproar
Reason;The simulator and noise signal that the real noise that portion collection arrives and synthesis for another example can be used, to the voice signal of input into
Row plus processing, etc. of making an uproar, this example is not specifically limited this.
Then, feature extraction can be carried out according to the Noisy Speech Signal after addition noise signal, it is special obtains corresponding voice
Sign carries out model training using residual error network, obtains the corresponding residual error of the phonetic feature so as to combine voice vocal print feature
Network model.Specifically, as shown in Fig. 2, can be used for obtained phonetic feature according to preset residual error network structure
The Noisy Speech Signal of generation and the voice signal part of input carry out model training, so that it is corresponding to train each phonetic feature
Residual error network model.The residual error network model may include at least three network layers.During model training, each network layer
Output result act not only as the input of next network layer, be also used as cross-layer and be input in other network layers, such as the
The output result of one network layer can be used as the input of second network layer, be also used as the defeated of third network layer
Enter, and/or, it can be input in network layer further below, to update the weight parameter of each network layer in residual error network model, delay
The reduction problem of gradient is solved, to solve the problems, such as that gradient disappears this.
Thus in the speech enhan-cement stage, i.e., it, can when carrying out speech enhan-cement using trained residual error network model
To determine currently to need residual error network model to be used based on phonetic feature, the residual error network mould determined may then pass through
Type carries out noise reduction process to the voice data after extracting feature, as shown in Fig. 2, obtaining targeted voice signal, and to obtained mesh
The output of poster sound signal part.Wherein, voice data, which can be, to the mixing voice signal of input generate after feature extraction,
It such as can be and the frequency domain speech data obtained after frequency domain character extraction carried out to mixing voice signal, be also possible to mixing voice
Signal carries out obtained time domain speech data etc. after temporal signatures extraction, the embodiment of the present invention to this with no restriction.
Referring to Fig. 3, a kind of step flow chart of audio-frequency processing method alternative embodiment of the invention is shown, it specifically can be with
Include the following steps:
Step 302, the mixing voice signal of input is obtained.
Step 304, feature extraction is carried out to the mixing voice signal, obtains the phonetic feature and voice number of target user
According to the mixing voice signal includes the voice signal of noise signal and the target user.
It, can will be currently detected after detecting the mixing voice signal of input specifically, in the speech enhan-cement stage
Mixing voice signal determine and need to carry out the signal of speech enhan-cement processing, and the mixing voice signal of the available input,
To execute corresponding speech enhan-cement task based on the mixing voice signal got.It, can during speech enhan-cement task execution
To carry out feature to the mixing voice signal got, the phonetic feature and voice data of target user is obtained.Wherein, creolized language
Sound signal may include the noise signal for having the voice signal of target user and needs to remove, and such as may include that target user speaks
Corresponding clean speech signal and other users are spoken corresponding interference voice signal etc..
Step 306, according to the phonetic feature, the voice data is carried out by residual error network model trained in advance
Noise reduction process obtains the corresponding targeted voice signal of the target user.
In the concrete realization, can be for the obtained residual error network model of different phonetic feature training it is different, such as
The residual error network model that phonetic feature training for user A obtains can interference voice signal progress corresponding to other users
Inhibit, such as inhibit the interference voice signal spoken of user B, and retain user A and speak corresponding voice signal, reaches enhancing user
A speaks the purpose of corresponding voice signal;The residual error network model that phonetic feature training for user B obtains can be to other
The corresponding interference voice signal of user inhibits, the interference voice signal for such as user A being inhibited to speak, and retains user B and speak
Corresponding voice signal, achieving the purpose that, which enhances user B, speaks corresponding voice signal.Therefore, it before noise reduction process, can tie
Sound groove recognition technology in e is closed, the phonetic feature according to the target user determines current desired residual error network model to be used, with
Noise reduction process is carried out to voice data by the phonetic feature of target user corresponding residual error network model, obtains target user couple
The targeted voice signal answered.
In an alternate embodiment of the present invention where, according to the phonetic feature, pass through residual error network mould trained in advance
Type carries out noise reduction process to the voice data, obtains the corresponding targeted voice signal of the target user, may include: foundation
The phonetic feature of the target user determines the corresponding residual error network model of the target user;Pass through the target user couple
The residual error network model answered carries out noise reduction process to the voice data, obtains the targeted voice signal.Target user is corresponding
Targeted voice signal may include the residual error network model that trains of phonetic feature for the target user in advance.
Specifically, the embodiment of the present invention is after obtaining the phonetic feature of target user, it can be based on the language of target user
Sound feature determines the residual error network model trained in advance for the phonetic feature of the target user, then can be by true
The residual error network model made carries out noise reduction process to voice data and is protected simultaneously with removing the noise data in the voice data
Target speech data included in the voice data is stayed, after then can generating speech enhan-cement based on the target speech data of reservation
Targeted voice signal.Wherein, target speech data can be used for characterizing the voice signal of target user, such as can be target use
The frequency domain data for the clean speech that family is spoken, alternatively, can be the time domain data etc. for the clean speech that target user speaks.
In the embodiment of the present invention, optionally, by the corresponding residual error network model of target user, to the voice data into
Row noise reduction process obtains the targeted voice signal, can specifically include: determining each network layer pair in the residual error network model
The network weight information answered;According to the corresponding network weight information of each network layer, the voice data is carried out at mapping
Reason obtains mapping voice data;Based on the mapping voice data and the voice data, targeted voice signal is generated.
Specifically, the embodiment of the present invention after determining residual error network model, can be corresponded to based on the residual error network model
Residual error network structure, determine the corresponding network weight information of each network layer in residual error network model, then can be according to each net
The corresponding network weight information of network layers carries out mapping processing to voice data, i.e., respectively according to the corresponding network weight of each network layer
Information carries out noise reduction process to the voice data for being input to each network layer, to remove noise data included in voice data,
Obtain mapping voice data.Wherein, network weight information is determined for voice data and maps reflecting between voice data
Penetrate relationship.The mapping voice data can be used for characterizing the clean speech signal that removal noise signal obtains, and can such as characterize
Except the time-domain signal of the clean speech of noise signal, for another example, the frequency-region signal etc. of the clean speech of noise signal can be characterized, this
Inventive embodiments to this with no restriction.Obtain mapping voice data after, the embodiment of the present invention can according to residual error network structure,
The mapping voice data and voice data are handled, the corresponding targeted voice signal of target user is generated.
It, can be according to each network layer in residual error network model after obtaining voice data x as an example of the invention
Corresponding network weight information weight layer maps voice data x, obtains mapping voice data F (x), then may be used
To generate target speech data H (x) using the mapping voice data F (x) and voice data x according to residual error network structure.Example
Such as, when voice data x is 5, if the target speech data H (x) generated is 5.1, it can determine that the voice data is corresponding
Mapping voice data F (x) is 0.1;If the target speech data H (x) generated is 5.2, it can determine that the voice data is corresponding
Mapping voice data F (x) be 0.2, i.e., when target speech data H (x) becomes 5.2 from 5.1, mapping voice data F (x) from
0.1 becomes 0.2, increases 100% variable, so as to the minor change of prominent voice data x, can obviously embody network
The corrective action of weight information, and then can preferably inhibit the noise data in voice data, promote speech enhan-cement effect.
In a kind of embodiment of the embodiment of the present invention, frequency domain character can be carried out to the mixing voice signal of input and mentioned
It takes, frequency domain speech feature and corresponding frequency domain speech data after frequency domain character extracts is obtained, according to the frequency domain speech feature
Speech enhan-cement processing is carried out on frequency domain with frequency domain speech data.Optionally, above-mentioned that feature is carried out to the mixing voice signal
It extracts, obtains the phonetic feature and voice data of target user, may include: that frequency domain character is carried out to the mixing voice signal
It extracts, obtains the frequency domain speech feature and frequency domain speech data of target user.It is thus possible to according to frequency domain speech feature pair is obtained
Frequency domain speech data carry out voice increasing by mixing voice signal of the residual error network model trained in advance to input on frequency domain
By force, speech enhan-cement task can be completed on frequency domain.
Wherein, frequency domain speech data can be used for characterizing the noisy speech data on frequency domain, may include making an uproar on frequency domain
Sound data and target speech data etc..Optionally, described to be based on the mapping voice data and the voice data, generate target
Voice signal, comprising: the mapping voice data and the frequency domain speech data are decoded, decoded voice data is obtained;
Waveform Reconstructing is carried out to the decoded voice data according to the frequency domain speech feature, obtains targeted voice signal.Specifically,
After obtaining frequency domain speech data, it can be based on frequency domain speech feature, according to the corresponding network weight information of each network layer
Mapping processing is carried out to frequency domain speech data, to remove the noise data in the frequency domain speech data, obtains mapping voice data,
Then the mapping voice data and frequency domain speech data can be decoded, obtains corresponding decoded voice data, it then can be with
The phonetic feature of combining target user reconstructs the corresponding time domain waveform of the decoded voice data, so that exporting according to the time domain waveform
Targeted voice signal can have the phonetic feature of target user, i.e., according to the frequency domain speech feature extracted to decoded speech
Data carry out Waveform Reconstructing, generate the targeted voice signal in time domain, and the target voice feature carries the language of target user
Sound feature, the sense of hearing after guaranteeing speech enhan-cement, improves user experience.
Certainly, the embodiment of the present invention is based on residual error network model, can also be carried out using other modes to creolized language message
Speech enhan-cement, such as speech enhan-cement processing can be carried out to mixing voice signal in the time domain.In an optional reality of the invention
It applies in mode, feature extraction is carried out to the mixing voice signal, obtains phonetic feature and voice data, comprising: to described mixed
It closes voice signal and carries out temporal signatures extraction, obtain time domain speech feature and time domain speech data.It is thus possible to according to when obtaining
Domain phonetic feature is to time domain voice data, in the time domain by residual error network model trained in advance to the creolized language message of input
Number carry out speech enhan-cement, i.e., in the time domain complete speech enhan-cement task.
Wherein, time domain speech data can be used for characterizing the noisy speech data in time domain, may include making an uproar on frequency domain
Sound data and target speech data etc..Optionally, above-mentioned to be based on the mapping voice data and the voice data, generate target
Voice signal, comprising: use the mapping voice signal and the time domain speech data, generate targeted voice signal.It is specific and
Speech at the extraction after the phonetic feature of domain, can be based on time domain speech feature, according to the corresponding network weight letter of each network layer
Breath carries out mapping processing to time domain voice data, to remove the noise data in the time domain speech data, obtains mapping voice number
According to then in combination with time domain speech feature, to mapping voice data and time domain speech data progress speech processes, generation mesh
The corresponding targeted voice signal of user is marked, the targeted voice signal is allowed to carry the phonetic feature of target user, is guaranteed
Sense of hearing after speech enhan-cement, to improve the voice quality after speech enhan-cement.
Step 308, it is exported according to the targeted voice signal.
It in a kind of optional embodiment, is exported according to the targeted voice signal, may include: according to the mesh
Poster sound signal carries out voice output.It makes an uproar the production of voice dialogue in environment specifically, the embodiment of the present invention can be applied in band
In product, the phone wrist-watch in voice communication scene can be such as applied, both call sides is allowed to be only hearing its master of concern
The clean speech of speaker.For example, the child in parent using phone wrist-watch to activity makes a phone call, implement using the present invention
The audio-frequency processing method that example provides, can allow parent to be only hearing the clear sound of oneself child, reduce the shadow that other children speak
It rings, the influence of noise jamming can be reduced.
Certainly, the embodiment of the present invention can be applied in other scenes, can such as apply in voice input scene,
Can apply in speech recognition scene etc., the embodiment of the present invention to this with no restriction.
It in another optional embodiment, is exported according to the targeted voice signal, may include: to the mesh
Poster sound signal carries out speech recognition, generates recognition result;Export the recognition result.
For example, target speaker voice be Fig. 4 in first dotted line frame 41 in sentence " hello, I is Lee
XX is very glad and recognizes everybody.";And noise be tweedle, as in second dotted line frame 42 in Fig. 4 " the sound of a bird chirping chirp caye
Caye ".As shown in figure 4, voice and noise (i.e. tweedle) that target speaker says have a large amount of intersection on a timeline.In
Beginning, due to not having a tweedle, thus " everybody " two words described in target speaker are disturbed not yet, thus this two
A word can not heard;And target speaker says below " good, I makes Lee XX " partially be interfered by tweedle " the sound of a bird chirping ", this
Result in that target speaker says " good, I makes Lee XX " that may can not hear clearly.At this point, using audio provided in an embodiment of the present invention
Processing method can remove " the sound of a bird chirping " this sentence interference voice, leave behind mesh such as using based on speech enhan-cement model end to end
Poster sound is " hello, I is Lee XX, is very glad and recognizes everybody ", to achieve the purpose that speech enhan-cement.
Then, the targeted voice signal after speech enhan-cement can be used carries out speech recognition, i.e., using the pure of target speaker
Net voice carries out speech recognition, to identify voice described in target speaker, such as combines above-mentioned example, can use voice
The target voice " hello, I is Lee XX, is very glad and recognizes everybody " for enhancing model output carries out speech recognition, so as to mention
Rise speech recognition effect.It is then possible to exported according to the recognition result recognized, it is as corresponding in exported the voice recognized
Text " hello, I is Lee XX, is very glad and recognizes everybody ", the personal photograph of " Lee XX " etc..
To sum up, residual error network structure can be introduced into speech enhan-cement task by the embodiment of the present invention, to solve voice
Gradient disappears this problem in enhancing task, and then can train to obtain the residual error network model with deeper network depth, with
Speech enhan-cement is carried out using the residual error network model, promotes speech enhan-cement effect.
It should be noted that for simple description, therefore, it is stated as a series of action groups for embodiment of the method
It closes, but those skilled in the art should understand that, embodiment of that present invention are not limited by the describe sequence of actions, because according to
According to the embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also should
Know, the embodiments described in the specification are all preferred embodiments, and the related movement not necessarily present invention is implemented
Necessary to example.
Referring to Fig. 5, show a kind of structural block diagram of apparatus for processing audio embodiment of the invention, can specifically include as
Lower module:
Voice signal obtains module 510, for obtaining the mixing voice signal of input;
Speech enhan-cement module 520, for being carried out according to residual error network model trained in advance to the mixing voice signal
Speech enhan-cement obtains targeted voice signal;
Voice signal output module 530, for being exported according to the targeted voice signal.
In an alternate embodiment of the present invention where, the speech enhan-cement module 520 may include following submodule:
Feature extraction submodule obtains the voice of target user for carrying out feature extraction to the mixing voice signal
Feature and voice data, the mixing voice signal include the voice signal of noise signal and the target user;
Noise reduction process submodule is used for according to the phonetic feature, by residual error network model trained in advance to described
Voice data carries out noise reduction process, obtains the corresponding targeted voice signal of the target user.
In embodiments of the present invention, optionally, which can also include residual error network model training module.
The residual error network model training module, for the corresponding residual error network model of training phonetic feature in advance.Wherein, at the noise reduction
Managing submodule includes such as lower unit:
Residual error network model determination unit determines the target user for the phonetic feature according to the target user
Corresponding residual error network model;
Noise reduction processing unit, for by the corresponding residual error network model of the target user, to the voice data into
Row noise reduction process obtains the targeted voice signal.
In an alternate embodiment of the present invention where, the noise reduction processing unit may include following subelement:
Network weight information determines subelement, for determining the corresponding network weight of each network layer in the residual error network model
Weight information;
Mapping processing subelement, for according to the corresponding network weight information of each network layer, to the voice data
Mapping processing is carried out, mapping voice data is obtained;
Targeted voice signal generates subelement, for being based on the mapping voice data and the voice data, generates mesh
Poster sound signal.
In an alternate embodiment of the present invention where, the feature extraction submodule is specifically used for the mixing voice
Signal carries out frequency domain character extraction, obtains the frequency domain speech feature and frequency domain speech data of target user.The target language message
Number subelement is generated, specifically for being decoded the mapping voice data and the frequency domain speech data, obtains decoding language
Sound data;And Waveform Reconstructing is carried out to the decoded voice data according to the frequency domain speech feature, obtain target language message
Number.
In another alternative embodiment of the invention, the feature extraction submodule is specifically used for the creolized language
Sound signal carries out temporal signatures extraction, obtains the time domain speech feature and time domain speech data of target user.The target voice
Signal generates subelement, is specifically used for using the mapping voice data and the time domain speech data, generates target language message
Number.
In an alternate embodiment of the present invention where, the residual error network model training module may include following submodule
Block:
Noise adds submodule, for adding noise signal for the voice signal of input, generates Noisy Speech Signal;
Feature extraction submodule obtains the noisy speech letter for carrying out feature extraction to the Noisy Speech Signal
Number corresponding phonetic feature;
Model training submodule, for according to preset residual error network structure, according to the Noisy Speech Signal and described
Voice signal carries out model training, generates the corresponding residual error network model of the phonetic feature.
In an alternate embodiment of the present invention where, the voice signal output module 530 may include following submodule:
Voice output submodule, for carrying out voice output according to the targeted voice signal;And/or
Speech recognition submodule generates recognition result for carrying out speech recognition to the targeted voice signal;Output institute
State recognition result.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple
Place illustrates referring to the part of embodiment of the method.
Fig. 6 is a kind of structural block diagram of equipment 600 for audio processing shown according to an exemplary embodiment.Example
Such as, equipment 600 can be mobile phone, computer, digital broadcasting terminal, messaging device, game console, and plate is set
It is standby, Medical Devices, body-building equipment, personal digital assistant, server etc..
Referring to Fig. 6, equipment 600 may include following one or more components: processing component 602, memory 604, power supply
Component 606, multimedia component 608, audio component 610, the interface 612 of input/output (I/O), sensor module 614, and
Communication component 616.
Processing component 602 usually control equipment 600 integrated operation, such as with display, telephone call, data communication, phase
Machine operation and record operate associated operation.Processing component 602 may include that one or more processors 620 refer to execute
It enables, to perform all or part of the steps of the methods described above.In addition, processing component 602 may include one or more modules, just
Interaction between processing component 602 and other assemblies.For example, processing component 602 may include multi-media module, it is more to facilitate
Interaction between media component 608 and processing component 602.
Memory 604 is configured as storing various types of data to support the operation in equipment 600.These data are shown
Example includes the instruction of any application or method for operating in equipment 600, contact data, and telephone book data disappears
Breath, picture, video etc..Memory 604 can be by any kind of volatibility or non-volatile memory device or their group
It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile
Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash
Device, disk or CD.
Power supply module 606 provides electric power for the various assemblies of equipment 600.Power supply module 606 may include power management system
System, one or more power supplys and other with for equipment 600 generate, manage, and distribute the associated component of electric power.
Multimedia component 608 includes the screen of one output interface of offer between the equipment 600 and user.One
In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen
Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings
Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action
Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers
Body component 608 includes a front camera and/or rear camera.When equipment 600 is in operation mode, such as screening-mode or
When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and
Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 610 is configured as output and/or input audio signal.For example, audio component 610 includes a Mike
Wind (MIC), when equipment 600 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched
It is set to reception external audio signal.The received audio signal can be further stored in memory 604 or via communication set
Part 616 is sent.In some embodiments, audio component 610 further includes a loudspeaker, is used for output audio signal.
I/O interface 612 provides interface between processing component 602 and peripheral interface module, and above-mentioned peripheral interface module can
To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock
Determine button.
Sensor module 614 includes one or more sensors, and the state for providing various aspects for equipment 600 is commented
Estimate.For example, sensor module 614 can detecte the state that opens/closes of equipment 600, and the relative positioning of component, for example, it is described
Component is the display and keypad of equipment 600, and sensor module 614 can be with 600 1 components of detection device 600 or equipment
Position change, the existence or non-existence that user contacts with equipment 600,600 orientation of equipment or acceleration/deceleration and equipment 600
Temperature change.Sensor module 614 may include proximity sensor, be configured to detect without any physical contact
Presence of nearby objects.Sensor module 614 can also include optical sensor, such as CMOS or ccd image sensor, at
As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors
Device, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 616 is configured to facilitate the communication of wired or wireless way between equipment 600 and other equipment.Equipment
600 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In an exemplary implementation
In example, communication component 616 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel.
In one exemplary embodiment, the communication component 616 further includes near-field communication (NFC) module, to promote short range communication.Example
Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology,
Bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, equipment 600 can be believed by one or more application specific integrated circuit (ASIC), number
Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array
(FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided
It such as include the memory 604 of instruction, above-metioned instruction can be executed by the processor 620 of equipment 600 to complete the above method.For example,
The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk
With optical data storage devices etc..
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is held by the processor of equipment
It when row, enables a device to execute a kind of audio-frequency processing method, which comprises obtain the mixing voice signal of input;According to
Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains targeted voice signal;Foundation
The targeted voice signal is exported.
Fig. 7 is a kind of structural schematic diagram of equipment in the embodiment of the present invention.The equipment 700 can be due to configuration or performance be different
Generate bigger difference, may include one or more central processing units (central processing units,
CPU) 722 (for example, one or more processors) and memory 732, one or more storage application programs 742 or
The storage medium 730 (such as one or more mass memory units) of data 744.Wherein, memory 732 and storage medium
730 can be of short duration storage or persistent storage.The program for being stored in storage medium 730 may include one or more modules
(diagram does not mark), each module may include to the series of instructions operation in equipment.Further, central processing unit
722 can be set to communicate with storage medium 730, and the series of instructions operation in storage medium 730 is executed in equipment 700.
Equipment 700 can also include one or more power supplys 726, one or more wired or wireless networks connect
Mouthfuls 750, one or more input/output interfaces 758, one or more keyboards 756, and/or, one or one with
Upper operating system 741, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
In the exemplary embodiment, equipment be configured to be executed by one or more than one processor it is one or
More than one program includes the instruction for performing the following operation: obtaining the mixing voice signal of input;According to training in advance
Residual error network model carries out speech enhan-cement to the mixing voice signal, obtains targeted voice signal;According to the target voice
Signal is exported.
Optionally, the residual error network model that the foundation is trained in advance carries out speech enhan-cement to the mixing voice signal,
Obtain targeted voice signal, comprising: to the mixing voice signal carry out feature extraction, obtain target user phonetic feature and
Voice data, the mixing voice signal include the voice signal of noise signal and the target user;It is special according to the voice
Sign carries out noise reduction process to the voice data by residual error network model trained in advance, it is corresponding to obtain the target user
Targeted voice signal.
Optionally, described that the one or more programs are executed comprising also by one or more than one processor
Instruction for performing the following operation: the corresponding residual error network model of training phonetic feature in advance.Wherein, described according to institute's predicate
Sound feature carries out noise reduction process to the voice data by residual error network model trained in advance, obtains the target user
Corresponding targeted voice signal, comprising: according to the phonetic feature of the target user, determine the corresponding residual error of the target user
Network model;By the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, obtains institute
State targeted voice signal.
Optionally, by the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data,
Obtain the targeted voice signal, comprising: determine the corresponding network weight information of each network layer in the residual error network model;According to
According to the corresponding network weight information of each network layer, mapping processing is carried out to the voice data, obtains mapping voice data;
Based on the mapping voice data and the voice data, targeted voice signal is generated.
Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user
Sound data, comprising: frequency domain character extraction is carried out to the mixing voice signal, obtains the frequency domain speech feature and frequency of target user
Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: to institute
It states mapping voice data and the frequency domain speech data is decoded, obtain decoded voice data;It is special according to the frequency domain speech
Sign carries out Waveform Reconstructing to the decoded voice data, obtains targeted voice signal.
Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user
Sound data, comprising: to the mixing voice signal carry out temporal signatures extraction, obtain target user time domain speech feature and when
Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: use
The mapping voice data and the time domain speech data generate targeted voice signal.
Optionally, the corresponding residual error network model of the trained phonetic feature, comprising: make an uproar for the voice signal addition of input
Acoustical signal generates Noisy Speech Signal;Feature extraction is carried out to the Noisy Speech Signal, obtains the Noisy Speech Signal pair
The phonetic feature answered;According to preset residual error network structure, mould is carried out according to the Noisy Speech Signal and the voice signal
Type training generates the corresponding residual error network model of the phonetic feature.
Optionally, it is exported according to the targeted voice signal, comprising: carry out voice according to the targeted voice signal
Output;And/or speech recognition is carried out to the targeted voice signal, generate recognition result;Export the recognition result.
All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with
The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.
It should be understood by those skilled in the art that, the embodiment of the embodiment of the present invention can provide as method, apparatus or calculate
Machine program product.Therefore, the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and
The form of the embodiment of hardware aspect.Moreover, the embodiment of the present invention can be used one or more wherein include computer can
With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code
The form of the computer program product of implementation.
The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, terminal device (system) and computer program
The flowchart and/or the block diagram of product describes.It should be understood that flowchart and/or the block diagram can be realized by computer program instructions
In each flow and/or block and flowchart and/or the block diagram in process and/or box combination.It can provide these
Computer program instructions are set to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals
Standby processor is to generate a machine, so that being held by the processor of computer or other programmable data processing terminal devices
Capable instruction generates for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram
The device of specified function.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices
In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates packet
The manufacture of command device is included, which realizes in one side of one or more flows of the flowchart and/or block diagram
The function of being specified in frame or multiple boxes.
These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that
Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus
The instruction executed on computer or other programmable terminal equipments is provided for realizing in one or more flows of the flowchart
And/or in one or more blocks of the block diagram specify function the step of.
Although the preferred embodiment of the embodiment of the present invention has been described, once a person skilled in the art knows bases
This creative concept, then additional changes and modifications can be made to these embodiments.So the following claims are intended to be interpreted as
Including preferred embodiment and fall into all change and modification of range of embodiment of the invention.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by
One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation
Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning
Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements not only wrap
Those elements are included, but also including other elements that are not explicitly listed, or further includes for this process, method, article
Or the element that terminal device is intrinsic.In the absence of more restrictions, being wanted by what sentence "including a ..." limited
Element, it is not excluded that there is also other identical elements in process, method, article or the terminal device for including the element.
Above to a kind of audio-frequency processing method provided by the present invention and device, a kind of equipment and a kind of readable storage
Medium is described in detail, and used herein a specific example illustrates the principle and implementation of the invention, with
The explanation of upper embodiment is merely used to help understand method and its core concept of the invention;Meanwhile for the general of this field
Technical staff, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion
The contents of this specification are not to be construed as limiting the invention.
Claims (10)
1. a kind of audio-frequency processing method characterized by comprising
Obtain the mixing voice signal of input;
Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains target language message
Number;
It is exported according to the targeted voice signal.
2. the method according to claim 1, wherein the residual error network model that the foundation is trained in advance is to described
Mixing voice signal carries out speech enhan-cement, obtains targeted voice signal, comprising:
Feature extraction is carried out to the mixing voice signal, obtains the phonetic feature and voice data of target user, the mixing
Voice signal includes the voice signal of noise signal and the target user;
According to the phonetic feature, noise reduction process is carried out to the voice data by residual error network model trained in advance, is obtained
To the corresponding targeted voice signal of the target user.
3. according to the method described in claim 2, it is characterized by further comprising:
The corresponding residual error network model of training phonetic feature in advance;
Wherein, described according to the phonetic feature, the voice data is dropped by residual error network model trained in advance
It makes an uproar processing, obtains the corresponding targeted voice signal of the target user, comprising:
According to the phonetic feature of the target user, the corresponding residual error network model of the target user is determined;
By the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, obtains the mesh
Poster sound signal.
4. according to the method described in claim 3, it is characterized in that, by the corresponding residual error network model of the target user,
Noise reduction process is carried out to the voice data, obtains the targeted voice signal, comprising:
Determine the corresponding network weight information of each network layer in the residual error network model;
According to the corresponding network weight information of each network layer, mapping processing is carried out to the voice data, obtains mapping language
Sound data;
Based on the mapping voice data and the voice data, targeted voice signal is generated.
5. according to the method described in claim 4, it is characterized in that,
It is described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and voice data of target user, comprising:
Frequency domain character extraction is carried out to the mixing voice signal, obtains the frequency domain speech feature and frequency domain speech data of target user;
It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: to the mapping language
Sound data and the frequency domain speech data are decoded, and obtain decoded voice data;According to the frequency domain speech feature to described
Decoded voice data carries out Waveform Reconstructing, obtains targeted voice signal.
6. according to the method described in claim 4, it is characterized in that,
It is described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and voice data of target user, comprising:
Temporal signatures extraction is carried out to the mixing voice signal, obtains the time domain speech feature and time domain speech data of target user;
It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: use the mapping
Voice data and the time domain speech data generate targeted voice signal.
7. according to any method of claim 3 to 6, which is characterized in that the corresponding residual error net of the trained phonetic feature
Network model, comprising:
Noise signal is added for the voice signal of input, generates Noisy Speech Signal;
Feature extraction is carried out to the Noisy Speech Signal, obtains the corresponding phonetic feature of the Noisy Speech Signal;
According to preset residual error network structure, model training is carried out according to the Noisy Speech Signal and the voice signal, it is raw
At the corresponding residual error network model of the phonetic feature.
8. a kind of apparatus for processing audio characterized by comprising
Voice signal obtains module, for obtaining the mixing voice signal of input;
Speech enhan-cement module, for carrying out voice increasing to the mixing voice signal according to residual error network model trained in advance
By force, targeted voice signal is obtained;
Voice signal output module, for being exported according to the targeted voice signal.
9. a kind of equipment, which is characterized in that include memory and one or more than one program, one of them or
More than one program of person is stored in memory, and be configured to be executed by one or more than one processor it is one or
More than one program of person includes the instruction for performing the following operation:
Obtain the mixing voice signal of input;
Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains target language message
Number;
It is exported according to the targeted voice signal.
10. a kind of readable storage medium storing program for executing, which is characterized in that when the instruction in the storage medium is executed by the processor of equipment
When, it enables a device to execute the audio-frequency processing method as described in any in claim to a method 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810481272.6A CN110503968A (en) | 2018-05-18 | 2018-05-18 | A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810481272.6A CN110503968A (en) | 2018-05-18 | 2018-05-18 | A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110503968A true CN110503968A (en) | 2019-11-26 |
Family
ID=68583983
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810481272.6A Pending CN110503968A (en) | 2018-05-18 | 2018-05-18 | A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110503968A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111081223A (en) * | 2019-12-31 | 2020-04-28 | 广州市百果园信息技术有限公司 | Voice recognition method, device, equipment and storage medium |
CN112820300A (en) * | 2021-02-25 | 2021-05-18 | 北京小米松果电子有限公司 | Audio processing method and device, terminal and storage medium |
CN113409803A (en) * | 2020-11-06 | 2021-09-17 | 腾讯科技(深圳)有限公司 | Voice signal processing method, device, storage medium and equipment |
CN113611318A (en) * | 2021-06-29 | 2021-11-05 | 华为技术有限公司 | Audio data enhancement method and related equipment |
WO2022178970A1 (en) * | 2021-02-26 | 2022-09-01 | 平安科技(深圳)有限公司 | Speech noise reducer training method and apparatus, and computer device and storage medium |
WO2022253003A1 (en) * | 2021-05-31 | 2022-12-08 | 华为技术有限公司 | Speech enhancement method and related device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110069625A1 (en) * | 2009-09-23 | 2011-03-24 | Avaya Inc. | Priority-based, dynamic optimization of utilized bandwidth |
CN102811310A (en) * | 2011-12-08 | 2012-12-05 | 苏州科达科技有限公司 | Method and system for controlling voice echo cancellation on network video camera |
US20150142446A1 (en) * | 2013-11-21 | 2015-05-21 | Global Analytics, Inc. | Credit Risk Decision Management System And Method Using Voice Analytics |
CN106887225A (en) * | 2017-03-21 | 2017-06-23 | 百度在线网络技术(北京)有限公司 | Acoustic feature extracting method, device and terminal device based on convolutional neural networks |
CN107180628A (en) * | 2017-05-19 | 2017-09-19 | 百度在线网络技术(北京)有限公司 | Set up the method, the method for extracting acoustic feature, device of acoustic feature extraction model |
CN107274906A (en) * | 2017-06-28 | 2017-10-20 | 百度在线网络技术(北京)有限公司 | Voice information processing method, device, terminal and storage medium |
CN107293288A (en) * | 2017-06-09 | 2017-10-24 | 清华大学 | A kind of residual error shot and long term remembers the acoustic model modeling method of Recognition with Recurrent Neural Network |
CN107578775A (en) * | 2017-09-07 | 2018-01-12 | 四川大学 | A kind of multitask method of speech classification based on deep neural network |
CN107818779A (en) * | 2017-09-15 | 2018-03-20 | 北京理工大学 | A kind of infant's crying sound detection method, apparatus, equipment and medium |
CN108010515A (en) * | 2017-11-21 | 2018-05-08 | 清华大学 | A kind of speech terminals detection and awakening method and device |
-
2018
- 2018-05-18 CN CN201810481272.6A patent/CN110503968A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110069625A1 (en) * | 2009-09-23 | 2011-03-24 | Avaya Inc. | Priority-based, dynamic optimization of utilized bandwidth |
CN102811310A (en) * | 2011-12-08 | 2012-12-05 | 苏州科达科技有限公司 | Method and system for controlling voice echo cancellation on network video camera |
US20150142446A1 (en) * | 2013-11-21 | 2015-05-21 | Global Analytics, Inc. | Credit Risk Decision Management System And Method Using Voice Analytics |
CN106887225A (en) * | 2017-03-21 | 2017-06-23 | 百度在线网络技术(北京)有限公司 | Acoustic feature extracting method, device and terminal device based on convolutional neural networks |
CN107180628A (en) * | 2017-05-19 | 2017-09-19 | 百度在线网络技术(北京)有限公司 | Set up the method, the method for extracting acoustic feature, device of acoustic feature extraction model |
CN107293288A (en) * | 2017-06-09 | 2017-10-24 | 清华大学 | A kind of residual error shot and long term remembers the acoustic model modeling method of Recognition with Recurrent Neural Network |
CN107274906A (en) * | 2017-06-28 | 2017-10-20 | 百度在线网络技术(北京)有限公司 | Voice information processing method, device, terminal and storage medium |
CN107578775A (en) * | 2017-09-07 | 2018-01-12 | 四川大学 | A kind of multitask method of speech classification based on deep neural network |
CN107818779A (en) * | 2017-09-15 | 2018-03-20 | 北京理工大学 | A kind of infant's crying sound detection method, apparatus, equipment and medium |
CN108010515A (en) * | 2017-11-21 | 2018-05-08 | 清华大学 | A kind of speech terminals detection and awakening method and device |
Non-Patent Citations (1)
Title |
---|
赵冬斌;邵坤;朱圆恒;李栋;陈亚冉;王海涛;刘德荣;周彤;王成红;: "深度强化学习综述:兼论计算机围棋的发展", 控制理论与应用, no. 06, 15 June 2016 (2016-06-15) * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111081223A (en) * | 2019-12-31 | 2020-04-28 | 广州市百果园信息技术有限公司 | Voice recognition method, device, equipment and storage medium |
CN111081223B (en) * | 2019-12-31 | 2023-10-13 | 广州市百果园信息技术有限公司 | Voice recognition method, device, equipment and storage medium |
CN113409803A (en) * | 2020-11-06 | 2021-09-17 | 腾讯科技(深圳)有限公司 | Voice signal processing method, device, storage medium and equipment |
CN113409803B (en) * | 2020-11-06 | 2024-01-23 | 腾讯科技(深圳)有限公司 | Voice signal processing method, device, storage medium and equipment |
CN112820300A (en) * | 2021-02-25 | 2021-05-18 | 北京小米松果电子有限公司 | Audio processing method and device, terminal and storage medium |
CN112820300B (en) * | 2021-02-25 | 2023-12-19 | 北京小米松果电子有限公司 | Audio processing method and device, terminal and storage medium |
WO2022178970A1 (en) * | 2021-02-26 | 2022-09-01 | 平安科技(深圳)有限公司 | Speech noise reducer training method and apparatus, and computer device and storage medium |
WO2022253003A1 (en) * | 2021-05-31 | 2022-12-08 | 华为技术有限公司 | Speech enhancement method and related device |
CN113611318A (en) * | 2021-06-29 | 2021-11-05 | 华为技术有限公司 | Audio data enhancement method and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110503968A (en) | A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing | |
CN108198569B (en) | Audio processing method, device and equipment and readable storage medium | |
CN109801644B (en) | Separation method, separation device, electronic equipment and readable medium for mixed sound signal | |
CN108346433A (en) | A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing | |
CN106464939B (en) | The method and device of play sound effect | |
CN107705783A (en) | A kind of phoneme synthesizing method and device | |
CN103391347B (en) | A kind of method and device of automatic recording | |
CN110097890A (en) | A kind of method of speech processing, device and the device for speech processes | |
US20130211826A1 (en) | Audio Signals as Buffered Streams of Audio Signals and Metadata | |
CN107992485A (en) | A kind of simultaneous interpretation method and device | |
CN111508511A (en) | Real-time sound changing method and device | |
CN111508531B (en) | Audio processing method and device | |
CN110232909A (en) | A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing | |
CN105451056B (en) | Audio and video synchronization method and device | |
CN110197677A (en) | A kind of control method for playing back, device and playback equipment | |
WO2021244056A1 (en) | Data processing method and apparatus, and readable medium | |
Zhang et al. | Sensing to hear: Speech enhancement for mobile devices using acoustic signals | |
CN103973955A (en) | Information processing method and electronic device | |
CN108073572A (en) | Information processing method and its device, simultaneous interpretation system | |
US20240096343A1 (en) | Voice quality enhancement method and related device | |
CN110349578A (en) | Equipment wakes up processing method and processing device | |
CN107886963B (en) | A kind of method, apparatus and electronic equipment of speech processes | |
CN109036404A (en) | Voice interactive method and device | |
CN115273831A (en) | Voice conversion model training method, voice conversion method and device | |
CN106782625B (en) | Audio-frequency processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20220720 Address after: 100084. Room 9, floor 01, cyber building, building 9, building 1, Zhongguancun East Road, Haidian District, Beijing Applicant after: BEIJING SOGOU TECHNOLOGY DEVELOPMENT Co.,Ltd. Address before: 100084. Room 9, floor 01, cyber building, building 9, building 1, Zhongguancun East Road, Haidian District, Beijing Applicant before: BEIJING SOGOU TECHNOLOGY DEVELOPMENT Co.,Ltd. Applicant before: SOGOU (HANGZHOU) INTELLIGENT TECHNOLOGY Co.,Ltd. |
|
TA01 | Transfer of patent application right |