CN110503968A - A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing - Google Patents

A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing Download PDF

Info

Publication number
CN110503968A
CN110503968A CN201810481272.6A CN201810481272A CN110503968A CN 110503968 A CN110503968 A CN 110503968A CN 201810481272 A CN201810481272 A CN 201810481272A CN 110503968 A CN110503968 A CN 110503968A
Authority
CN
China
Prior art keywords
voice signal
signal
speech
residual error
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810481272.6A
Other languages
Chinese (zh)
Inventor
文仕学
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Sogou Hangzhou Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd, Sogou Hangzhou Intelligent Technology Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201810481272.6A priority Critical patent/CN110503968A/en
Publication of CN110503968A publication Critical patent/CN110503968A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Abstract

The embodiment of the invention provides a kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing, this method comprises: obtaining the mixing voice signal of input;Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains targeted voice signal;It is exported according to the targeted voice signal.The embodiment of the present invention is able to solve gradient disappearance problem existing for the existing sound enhancement method based on traditional neural network, improves speech enhan-cement effect.

Description

A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
Technical field
The present invention relates to fields of communication technology, more particularly to a kind of audio-frequency processing method, a kind of apparatus for processing audio, one Kind equipment and a kind of readable storage medium storing program for executing.
Background technique
With the fast development of the communication technology, the terminals such as mobile phone, tablet computer are more more and more universal, to the life of people Living, study, work bring great convenience.
These terminals can collect voice signal by microphone, using speech enhancement technique to the voice signal being collected into It is handled, to reduce the influence of noise jamming.Wherein, speech enhan-cement refers to when voice signal is done by various noises After disturbing, even flooding, useful voice signal is extracted from noise background, inhibits, reduces the technology of noise jamming.
Currently, terminal usually uses such as deep neural network (Deep Neural Network, DNN), convolutional Neural Network (Convolutional Neural Network, CNN), shot and long term remember artificial neural network (Long Short-Term Memory, LSTM) etc. traditional neural networks sound enhancement method carry out speech enhan-cement.But use traditional neural network There is gradient disappearance in sound enhancement method.For example, for the DNN connected entirely, as the network depth of DNN is continuous Deepen, i.e., as the network number of plies of DNN is continuously increased, the gradient disappearance problem of network is increasingly severe, such as increases in the network number of plies When being added to 5 layers, the gradient disappearance problem of network is than more serious.If continuing growing the network number of plies of DNN, lead to Cross the DNN and carry out speech enhan-cement, the speech enhan-cement performance for finally obtaining sound result not will increase not only, instead it is possible that Decline, affects speech enhan-cement effect.
Summary of the invention
The embodiment of the present invention is the technical problem to be solved is that a kind of audio-frequency processing method is provided, to promote speech enhan-cement effect Fruit.
Correspondingly, the embodiment of the invention also provides a kind of apparatus for processing audio, a kind of equipment and a kind of readable storages Medium, to guarantee the implementation and application of the above method.
To solve the above-mentioned problems, the embodiment of the invention discloses a kind of audio-frequency processing methods, comprising: obtains the mixed of input Close voice signal;Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains target Voice signal;It is exported according to the targeted voice signal.
Optionally, the residual error network model that the foundation is trained in advance carries out speech enhan-cement to the mixing voice signal, Obtain targeted voice signal, comprising: to the mixing voice signal carry out feature extraction, obtain target user phonetic feature and Voice data, the mixing voice signal include the voice signal of noise signal and the target user;It is special according to the voice Sign carries out noise reduction process to the voice data by residual error network model trained in advance, it is corresponding to obtain the target user Targeted voice signal.
Optionally, further includes: the corresponding residual error network model of training phonetic feature in advance.Wherein, described according to institute's predicate Sound feature carries out noise reduction process to the voice data by residual error network model trained in advance, obtains the target user Corresponding targeted voice signal, comprising: according to the phonetic feature of the target user, determine the corresponding residual error of the target user Network model;By the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, obtains institute State targeted voice signal.
Optionally, by the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, Obtain the targeted voice signal, comprising: determine the corresponding network weight information of each network layer in the residual error network model;According to According to the corresponding network weight information of each network layer, mapping processing is carried out to the voice data, obtains mapping voice data; Based on the mapping voice data and the voice data, targeted voice signal is generated.
Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user Sound data, comprising: frequency domain character extraction is carried out to the mixing voice signal, obtains the frequency domain speech feature and frequency of target user Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: to institute It states mapping voice data and the frequency domain speech data is decoded, obtain decoded voice data;It is special according to the frequency domain speech Sign carries out Waveform Reconstructing to the decoded voice data, obtains targeted voice signal.
Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user Sound data, comprising: to the mixing voice signal carry out temporal signatures extraction, obtain target user time domain speech feature and when Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: use The mapping voice data and the time domain speech data generate targeted voice signal.
Optionally, the corresponding residual error network model of the trained phonetic feature, comprising: make an uproar for the voice signal addition of input Acoustical signal generates Noisy Speech Signal;Feature extraction is carried out to the Noisy Speech Signal, obtains the Noisy Speech Signal pair The phonetic feature answered;According to preset residual error network structure, mould is carried out according to the Noisy Speech Signal and the voice signal Type training generates the corresponding residual error network model of the phonetic feature.
Optionally, it is exported according to the targeted voice signal, comprising: carry out voice according to the targeted voice signal Output;And/or speech recognition is carried out to the targeted voice signal, generate recognition result;Export the recognition result.
The embodiment of the invention also discloses a kind of apparatus for processing audio, comprising:
Voice signal obtains module, for obtaining the mixing voice signal of input;
Speech enhan-cement module, for carrying out voice to the mixing voice signal according to residual error network model trained in advance Enhancing, obtains targeted voice signal;
Voice signal output module, for being exported according to the targeted voice signal.
Optionally, the speech enhan-cement module includes:
Feature extraction submodule obtains the voice of target user for carrying out feature extraction to the mixing voice signal Feature and voice data, the mixing voice signal include the voice signal of noise signal and the target user;
Noise reduction process submodule is used for according to the phonetic feature, by residual error network model trained in advance to described Voice data carries out noise reduction process, obtains the corresponding targeted voice signal of the target user.
It optionally, can also include: residual error network model training module, for the corresponding residual error of training phonetic feature in advance Network model;
Wherein, the noise reduction process submodule includes: residual error network model determination unit, for according to the target user Phonetic feature, determine the corresponding residual error network model of the target user;Noise reduction processing unit, for being used by the target The corresponding residual error network model in family carries out noise reduction process to the voice data, obtains the targeted voice signal.
Optionally, the noise reduction processing unit includes:
Network weight information determines subelement, for determining the corresponding network weight of each network layer in the residual error network model Weight information;
Mapping processing subelement, for according to the corresponding network weight information of each network layer, to the voice data Mapping processing is carried out, mapping voice data is obtained;
Targeted voice signal generates subelement, for being based on the mapping voice data and the voice data, generates mesh Poster sound signal.
Optionally, the feature extraction submodule is specifically used for carrying out frequency domain character extraction to the mixing voice signal, Obtain the frequency domain speech feature and frequency domain speech data of target user;
The targeted voice signal generates subelement, is specifically used for the mapping voice data and the frequency domain speech number According to being decoded, decoded voice data is obtained;And wave is carried out to the decoded voice data according to the frequency domain speech feature Shape reconstruct, obtains targeted voice signal.
Optionally, the feature extraction submodule is specifically used for carrying out temporal signatures extraction to the mixing voice signal, Obtain the time domain speech feature and time domain speech data of target user;
The targeted voice signal generates subelement, is specifically used for using the mapping voice data and the time domain speech Data generate targeted voice signal.
Optionally, the residual error network model training module includes:
Noise adds submodule, for adding noise signal for the voice signal of input, generates Noisy Speech Signal;
Feature extraction submodule obtains the noisy speech letter for carrying out feature extraction to the Noisy Speech Signal Number corresponding phonetic feature;
Model training submodule, for according to preset residual error network structure, according to the Noisy Speech Signal and described Voice signal carries out model training, generates the corresponding residual error network model of the phonetic feature.
Optionally, the voice signal output module includes:
Voice output submodule, for carrying out voice output according to the targeted voice signal;And/or
Speech recognition submodule generates recognition result for carrying out speech recognition to the targeted voice signal;Output institute State recognition result.
It include memory and one or more than one program the embodiment of the invention also discloses a kind of equipment, Perhaps more than one program is stored in memory and is configured to be executed by one or more than one processor for one of them The one or more programs include the instruction for performing the following operation: obtaining the mixing voice signal of input;According to Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains targeted voice signal;Foundation The targeted voice signal is exported.
Optionally, the residual error network model that the foundation is trained in advance carries out speech enhan-cement to the mixing voice signal, Obtain targeted voice signal, comprising: to the mixing voice signal carry out feature extraction, obtain target user phonetic feature and Voice data, the mixing voice signal include the voice signal of noise signal and the target user;It is special according to the voice Sign carries out noise reduction process to the voice data by residual error network model trained in advance, it is corresponding to obtain the target user Targeted voice signal.
Optionally, described that the one or more programs are executed comprising also by one or more than one processor Instruction for performing the following operation: the corresponding residual error network model of training phonetic feature in advance.Wherein, described according to institute's predicate Sound feature carries out noise reduction process to the voice data by residual error network model trained in advance, obtains the target user Corresponding targeted voice signal, comprising: according to the phonetic feature of the target user, determine the corresponding residual error of the target user Network model;By the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, obtains institute State targeted voice signal.
Optionally, by the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, Obtain the targeted voice signal, comprising: determine the corresponding network weight information of each network layer in the residual error network model;According to According to the corresponding network weight information of each network layer, mapping processing is carried out to the voice data, obtains mapping voice data; Based on the mapping voice data and the voice data, targeted voice signal is generated.
Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user Sound data, comprising: frequency domain character extraction is carried out to the mixing voice signal, obtains the frequency domain speech feature and frequency of target user Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: to institute It states mapping voice data and the frequency domain speech data is decoded, obtain decoded voice data;It is special according to the frequency domain speech Sign carries out Waveform Reconstructing to the decoded voice data, obtains targeted voice signal.
Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user Sound data, comprising: to the mixing voice signal carry out temporal signatures extraction, obtain target user time domain speech feature and when Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: use The mapping voice data and the time domain speech data generate targeted voice signal.
Optionally, the corresponding residual error network model of the trained phonetic feature, comprising: make an uproar for the voice signal addition of input Acoustical signal generates Noisy Speech Signal;Feature extraction is carried out to the Noisy Speech Signal, obtains the Noisy Speech Signal pair The phonetic feature answered;According to preset residual error network structure, mould is carried out according to the Noisy Speech Signal and the voice signal Type training generates the corresponding residual error network model of the phonetic feature.
Optionally, it is exported according to the targeted voice signal, comprising: carry out voice according to the targeted voice signal Output;And/or speech recognition is carried out to the targeted voice signal, generate recognition result;Export the recognition result.
The embodiment of the invention also discloses a kind of readable storage medium storing program for executing, when the instruction in the storage medium is by equipment When managing device execution, enable a device to execute audio-frequency processing method described in one or more of embodiment of the present invention.
The embodiment of the present invention includes following advantages:
The embodiment of the present invention can carry out the mixing voice signal got by residual error network model trained in advance Speech enhan-cement leads to the problem of speech enhan-cement effect difference so as to avoid increasing network depth, that is, solves existing based on biography Gradient disappearance problem existing for the sound enhancement method for neural network of uniting improves speech enhan-cement effect.
Detailed description of the invention
Fig. 1 is a kind of step flow chart of audio-frequency processing method embodiment of the invention;
Fig. 2 is the signal for carrying out speech enhan-cement in an example of the present invention using the residual error network model that training obtains in advance Figure;
Fig. 3 is a kind of step flow chart of audio-frequency processing method alternative embodiment of the invention;
Fig. 4 is collected a kind of schematic diagram of mixing voice in an example of the present invention;
Fig. 5 is a kind of structural block diagram of apparatus for processing audio embodiment of the invention;
Fig. 6 is a kind of structural block diagram of equipment for audio processing shown according to an exemplary embodiment;
Fig. 7 is a kind of structural schematic diagram of equipment in the embodiment of the present invention.
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.
Currently, existing sound enhancement method, which usually uses traditional neural network, carries out model training, based on training It obtains neural network model and carries out speech enhan-cement.Wherein, network depth is to influence a big factor of traditional neural network performance.When When network depth is constantly deepened, traditional neural network will appear gradient disappearance problem.With the intensification of network depth, traditional neural The gradient disappearance problem of network can be increasingly severe, and neural network model is caused to train poor targeted voice signal, shadow Ring speech enhan-cement effect.
One of the core concepts in the embodiments of the present invention is to provide a kind of new audio-frequency processing methods, to use residual error Network model carries out speech enhan-cement to the mixing voice signal of input, solves the existing speech enhan-cement based on traditional neural network Gradient disappearance problem existing for method improves speech enhan-cement effect.
Referring to Fig.1, a kind of step flow chart of audio-frequency processing method embodiment of the invention is shown, can specifically include Following steps:
Step 102, the mixing voice signal of input is obtained.
The embodiment of the present invention in voice input process, can obtain the mixing voice signal of input.The creolized language message It number may include the voice signal for needing to carry out speech enhan-cement, can specifically include the voice signal and noise signal of target user Deng.Wherein, the voice signal of target user can be the clean speech signal for referring to that target user speaks, such as target speaker's voice Corresponding time-domain signal;Noise signal can refer to signal corresponding to interference noise, as may include described in other speakers The corresponding time-domain signal of interference voice etc., the embodiment of the present invention to this with no restriction.
Step 104, speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtained Targeted voice signal.
In embodiments of the present invention, the mixing voice signal that can be will acquire is as residual error network model trained in advance Input, the mixing voice signal that can then get is input in advance trained residual error network model, to lead to residual error network Model carries out speech enhan-cement to the mixing voice signal got, removes the interference noise in the mixing voice signal, obtains language The enhanced targeted voice signal of sound.The targeted voice signal can only include the signal of the clean speech of target user, can With the corresponding signal of clean speech for characterizing target user, the corresponding clean speech of feeling the pulse with the finger-tip mark speaker's voice such as can be Signal etc..
In a kind of optional embodiment, after getting mixing voice signal, which can be carried out Feature extraction obtains phonetic feature and voice data.Wherein, voice data can be the noisy speech after referring to speech feature extraction Data can specifically include: noise data and the target speech data etc. for needing to retain.Then, can lead to according to phonetic feature Noise reduction process is carried out to voice data after residual error network model trained in advance, the targeted voice signal after obtaining speech enhan-cement. It should be noted that phonetic feature may include: time domain speech feature and/or frequency domain speech feature, the embodiment of the present invention is to this With no restriction.Time domain speech feature can be used for characterizing the phonetic feature in time domain, and frequency domain speech feature can be used for characterizing frequency Phonetic feature on domain.
Step 106, it is exported according to the targeted voice signal.
The embodiment of the present invention after obtaining the targeted voice signal after speech enhan-cement, can according to the targeted voice signal into Row output.For example, voice output can be carried out according to the targeted voice signal, to export clean speech described in user;For another example, Speech recognition can be carried out according to targeted voice signal, to identify clean speech described in user, can also will recognized Clean speech is converted to text information, is then exported according to text information, such as shows text on the screen of the device, shows Corresponding search result of text etc..
To sum up, the embodiment of the present invention can be by residual error network model trained in advance, to the creolized language message got Number speech enhan-cement is carried out, leads to the problem of speech enhan-cement effect difference so as to avoid increasing network depth, that is, solve existing Gradient disappearance problem existing for sound enhancement method based on traditional neural network improves speech enhan-cement effect.
In the concrete realization, the language feature of speech signal can be in advance based on, according to residual error network network structure into Row model training, to train the corresponding residual error network model of various phonetic features, so as to it is subsequent can be according to phonetic feature Speech enhan-cement is carried out using trained residual error network model in advance, guarantees speech enhan-cement effect.Optionally, the embodiment of the present invention Audio-frequency processing method can also include: the corresponding residual error network model of preparatory training phonetic feature.
Specifically, can add noise signal in model training stage to the speech signal of input, generate noisy speech Signal obtains corresponding phonetic feature to carry out feature extraction according to the Noisy Speech Signal;It then, can be for obtaining Phonetic feature carries out model training using the Noisy Speech Signal of generation, the raw voice is special according to preset residual error network structure Levy corresponding residual error network model.Wherein, the voice signal of input can refer to clean voice signal, can specifically include: The clean speech signal being collected into, and/or, pre-synthesis clean speech signal such as can be real in voice input process When the clean speech signal currently entered that gets, be also possible to the time-domain signal of prerecord one section of clean speech, again Such as can be the time-domain signal of one section of pre-synthesis clean speech, the embodiment of the present invention to this with no restriction.
In an alternate embodiment of the present invention where, the corresponding residual error network model of training phonetic feature, specifically can wrap It includes: adding noise signal for the voice signal of input, generate Noisy Speech Signal;Feature is carried out to the Noisy Speech Signal to mention It takes, obtains the corresponding phonetic feature of Noisy Speech Signal;According to preset residual error network structure, according to the Noisy Speech Signal Model training is carried out with the voice signal, generates the corresponding residual error network model of the phonetic feature.Wherein, preset residual error Network structure can be what value was configured according to the network structure of residual error network in advance, and the embodiment of the present invention does not limit this System.
Specifically, can add to the clean voice signal progress noise of input and make an uproar, it can be defeated in the training stage The voice signal addition noise signal entered, generates Noisy Speech Signal.Wherein, noise signal may include simulator and noise signal and The noise signal etc. collected in advance.The simulator and noise signal can be used for characterizing it is pre- first pass through speech synthesis technique synthesis make an uproar Sound;The noise signal collected in advance can be used for characterizing the real noise being collected into advance, such as can be the noise prerecorded Signal etc..
It can be used pre-synthesis as an example of the invention in the case where not being collected into real noise Simulator and noise signal is carried out plus is made an uproar processing to the voice signal of input, with according to plus make an uproar handle after the Noisy Speech Signal that generates into Row model training, thus avoid collecting a large amount of real noises and the problem that causes model training at high cost, reduce trained cost. Certainly, in the case where being collected into real noise, the corresponding noise signal of the real noise being collected into also can be used to input Voice signal carry out plus make an uproar processing, the noise signal being collected into such as can be used, the voice signal of input is carried out to add the place that makes an uproar Reason;The simulator and noise signal that the real noise that portion collection arrives and synthesis for another example can be used, to the voice signal of input into Row plus processing, etc. of making an uproar, this example is not specifically limited this.
Then, feature extraction can be carried out according to the Noisy Speech Signal after addition noise signal, it is special obtains corresponding voice Sign carries out model training using residual error network, obtains the corresponding residual error of the phonetic feature so as to combine voice vocal print feature Network model.Specifically, as shown in Fig. 2, can be used for obtained phonetic feature according to preset residual error network structure The Noisy Speech Signal of generation and the voice signal part of input carry out model training, so that it is corresponding to train each phonetic feature Residual error network model.The residual error network model may include at least three network layers.During model training, each network layer Output result act not only as the input of next network layer, be also used as cross-layer and be input in other network layers, such as the The output result of one network layer can be used as the input of second network layer, be also used as the defeated of third network layer Enter, and/or, it can be input in network layer further below, to update the weight parameter of each network layer in residual error network model, delay The reduction problem of gradient is solved, to solve the problems, such as that gradient disappears this.
Thus in the speech enhan-cement stage, i.e., it, can when carrying out speech enhan-cement using trained residual error network model To determine currently to need residual error network model to be used based on phonetic feature, the residual error network mould determined may then pass through Type carries out noise reduction process to the voice data after extracting feature, as shown in Fig. 2, obtaining targeted voice signal, and to obtained mesh The output of poster sound signal part.Wherein, voice data, which can be, to the mixing voice signal of input generate after feature extraction, It such as can be and the frequency domain speech data obtained after frequency domain character extraction carried out to mixing voice signal, be also possible to mixing voice Signal carries out obtained time domain speech data etc. after temporal signatures extraction, the embodiment of the present invention to this with no restriction.
Referring to Fig. 3, a kind of step flow chart of audio-frequency processing method alternative embodiment of the invention is shown, it specifically can be with Include the following steps:
Step 302, the mixing voice signal of input is obtained.
Step 304, feature extraction is carried out to the mixing voice signal, obtains the phonetic feature and voice number of target user According to the mixing voice signal includes the voice signal of noise signal and the target user.
It, can will be currently detected after detecting the mixing voice signal of input specifically, in the speech enhan-cement stage Mixing voice signal determine and need to carry out the signal of speech enhan-cement processing, and the mixing voice signal of the available input, To execute corresponding speech enhan-cement task based on the mixing voice signal got.It, can during speech enhan-cement task execution To carry out feature to the mixing voice signal got, the phonetic feature and voice data of target user is obtained.Wherein, creolized language Sound signal may include the noise signal for having the voice signal of target user and needs to remove, and such as may include that target user speaks Corresponding clean speech signal and other users are spoken corresponding interference voice signal etc..
Step 306, according to the phonetic feature, the voice data is carried out by residual error network model trained in advance Noise reduction process obtains the corresponding targeted voice signal of the target user.
In the concrete realization, can be for the obtained residual error network model of different phonetic feature training it is different, such as The residual error network model that phonetic feature training for user A obtains can interference voice signal progress corresponding to other users Inhibit, such as inhibit the interference voice signal spoken of user B, and retain user A and speak corresponding voice signal, reaches enhancing user A speaks the purpose of corresponding voice signal;The residual error network model that phonetic feature training for user B obtains can be to other The corresponding interference voice signal of user inhibits, the interference voice signal for such as user A being inhibited to speak, and retains user B and speak Corresponding voice signal, achieving the purpose that, which enhances user B, speaks corresponding voice signal.Therefore, it before noise reduction process, can tie Sound groove recognition technology in e is closed, the phonetic feature according to the target user determines current desired residual error network model to be used, with Noise reduction process is carried out to voice data by the phonetic feature of target user corresponding residual error network model, obtains target user couple The targeted voice signal answered.
In an alternate embodiment of the present invention where, according to the phonetic feature, pass through residual error network mould trained in advance Type carries out noise reduction process to the voice data, obtains the corresponding targeted voice signal of the target user, may include: foundation The phonetic feature of the target user determines the corresponding residual error network model of the target user;Pass through the target user couple The residual error network model answered carries out noise reduction process to the voice data, obtains the targeted voice signal.Target user is corresponding Targeted voice signal may include the residual error network model that trains of phonetic feature for the target user in advance.
Specifically, the embodiment of the present invention is after obtaining the phonetic feature of target user, it can be based on the language of target user Sound feature determines the residual error network model trained in advance for the phonetic feature of the target user, then can be by true The residual error network model made carries out noise reduction process to voice data and is protected simultaneously with removing the noise data in the voice data Target speech data included in the voice data is stayed, after then can generating speech enhan-cement based on the target speech data of reservation Targeted voice signal.Wherein, target speech data can be used for characterizing the voice signal of target user, such as can be target use The frequency domain data for the clean speech that family is spoken, alternatively, can be the time domain data etc. for the clean speech that target user speaks.
In the embodiment of the present invention, optionally, by the corresponding residual error network model of target user, to the voice data into Row noise reduction process obtains the targeted voice signal, can specifically include: determining each network layer pair in the residual error network model The network weight information answered;According to the corresponding network weight information of each network layer, the voice data is carried out at mapping Reason obtains mapping voice data;Based on the mapping voice data and the voice data, targeted voice signal is generated.
Specifically, the embodiment of the present invention after determining residual error network model, can be corresponded to based on the residual error network model Residual error network structure, determine the corresponding network weight information of each network layer in residual error network model, then can be according to each net The corresponding network weight information of network layers carries out mapping processing to voice data, i.e., respectively according to the corresponding network weight of each network layer Information carries out noise reduction process to the voice data for being input to each network layer, to remove noise data included in voice data, Obtain mapping voice data.Wherein, network weight information is determined for voice data and maps reflecting between voice data Penetrate relationship.The mapping voice data can be used for characterizing the clean speech signal that removal noise signal obtains, and can such as characterize Except the time-domain signal of the clean speech of noise signal, for another example, the frequency-region signal etc. of the clean speech of noise signal can be characterized, this Inventive embodiments to this with no restriction.Obtain mapping voice data after, the embodiment of the present invention can according to residual error network structure, The mapping voice data and voice data are handled, the corresponding targeted voice signal of target user is generated.
It, can be according to each network layer in residual error network model after obtaining voice data x as an example of the invention Corresponding network weight information weight layer maps voice data x, obtains mapping voice data F (x), then may be used To generate target speech data H (x) using the mapping voice data F (x) and voice data x according to residual error network structure.Example Such as, when voice data x is 5, if the target speech data H (x) generated is 5.1, it can determine that the voice data is corresponding Mapping voice data F (x) is 0.1;If the target speech data H (x) generated is 5.2, it can determine that the voice data is corresponding Mapping voice data F (x) be 0.2, i.e., when target speech data H (x) becomes 5.2 from 5.1, mapping voice data F (x) from 0.1 becomes 0.2, increases 100% variable, so as to the minor change of prominent voice data x, can obviously embody network The corrective action of weight information, and then can preferably inhibit the noise data in voice data, promote speech enhan-cement effect.
In a kind of embodiment of the embodiment of the present invention, frequency domain character can be carried out to the mixing voice signal of input and mentioned It takes, frequency domain speech feature and corresponding frequency domain speech data after frequency domain character extracts is obtained, according to the frequency domain speech feature Speech enhan-cement processing is carried out on frequency domain with frequency domain speech data.Optionally, above-mentioned that feature is carried out to the mixing voice signal It extracts, obtains the phonetic feature and voice data of target user, may include: that frequency domain character is carried out to the mixing voice signal It extracts, obtains the frequency domain speech feature and frequency domain speech data of target user.It is thus possible to according to frequency domain speech feature pair is obtained Frequency domain speech data carry out voice increasing by mixing voice signal of the residual error network model trained in advance to input on frequency domain By force, speech enhan-cement task can be completed on frequency domain.
Wherein, frequency domain speech data can be used for characterizing the noisy speech data on frequency domain, may include making an uproar on frequency domain Sound data and target speech data etc..Optionally, described to be based on the mapping voice data and the voice data, generate target Voice signal, comprising: the mapping voice data and the frequency domain speech data are decoded, decoded voice data is obtained; Waveform Reconstructing is carried out to the decoded voice data according to the frequency domain speech feature, obtains targeted voice signal.Specifically, After obtaining frequency domain speech data, it can be based on frequency domain speech feature, according to the corresponding network weight information of each network layer Mapping processing is carried out to frequency domain speech data, to remove the noise data in the frequency domain speech data, obtains mapping voice data, Then the mapping voice data and frequency domain speech data can be decoded, obtains corresponding decoded voice data, it then can be with The phonetic feature of combining target user reconstructs the corresponding time domain waveform of the decoded voice data, so that exporting according to the time domain waveform Targeted voice signal can have the phonetic feature of target user, i.e., according to the frequency domain speech feature extracted to decoded speech Data carry out Waveform Reconstructing, generate the targeted voice signal in time domain, and the target voice feature carries the language of target user Sound feature, the sense of hearing after guaranteeing speech enhan-cement, improves user experience.
Certainly, the embodiment of the present invention is based on residual error network model, can also be carried out using other modes to creolized language message Speech enhan-cement, such as speech enhan-cement processing can be carried out to mixing voice signal in the time domain.In an optional reality of the invention It applies in mode, feature extraction is carried out to the mixing voice signal, obtains phonetic feature and voice data, comprising: to described mixed It closes voice signal and carries out temporal signatures extraction, obtain time domain speech feature and time domain speech data.It is thus possible to according to when obtaining Domain phonetic feature is to time domain voice data, in the time domain by residual error network model trained in advance to the creolized language message of input Number carry out speech enhan-cement, i.e., in the time domain complete speech enhan-cement task.
Wherein, time domain speech data can be used for characterizing the noisy speech data in time domain, may include making an uproar on frequency domain Sound data and target speech data etc..Optionally, above-mentioned to be based on the mapping voice data and the voice data, generate target Voice signal, comprising: use the mapping voice signal and the time domain speech data, generate targeted voice signal.It is specific and Speech at the extraction after the phonetic feature of domain, can be based on time domain speech feature, according to the corresponding network weight letter of each network layer Breath carries out mapping processing to time domain voice data, to remove the noise data in the time domain speech data, obtains mapping voice number According to then in combination with time domain speech feature, to mapping voice data and time domain speech data progress speech processes, generation mesh The corresponding targeted voice signal of user is marked, the targeted voice signal is allowed to carry the phonetic feature of target user, is guaranteed Sense of hearing after speech enhan-cement, to improve the voice quality after speech enhan-cement.
Step 308, it is exported according to the targeted voice signal.
It in a kind of optional embodiment, is exported according to the targeted voice signal, may include: according to the mesh Poster sound signal carries out voice output.It makes an uproar the production of voice dialogue in environment specifically, the embodiment of the present invention can be applied in band In product, the phone wrist-watch in voice communication scene can be such as applied, both call sides is allowed to be only hearing its master of concern The clean speech of speaker.For example, the child in parent using phone wrist-watch to activity makes a phone call, implement using the present invention The audio-frequency processing method that example provides, can allow parent to be only hearing the clear sound of oneself child, reduce the shadow that other children speak It rings, the influence of noise jamming can be reduced.
Certainly, the embodiment of the present invention can be applied in other scenes, can such as apply in voice input scene, Can apply in speech recognition scene etc., the embodiment of the present invention to this with no restriction.
It in another optional embodiment, is exported according to the targeted voice signal, may include: to the mesh Poster sound signal carries out speech recognition, generates recognition result;Export the recognition result.
For example, target speaker voice be Fig. 4 in first dotted line frame 41 in sentence " hello, I is Lee XX is very glad and recognizes everybody.";And noise be tweedle, as in second dotted line frame 42 in Fig. 4 " the sound of a bird chirping chirp caye Caye ".As shown in figure 4, voice and noise (i.e. tweedle) that target speaker says have a large amount of intersection on a timeline.In Beginning, due to not having a tweedle, thus " everybody " two words described in target speaker are disturbed not yet, thus this two A word can not heard;And target speaker says below " good, I makes Lee XX " partially be interfered by tweedle " the sound of a bird chirping ", this Result in that target speaker says " good, I makes Lee XX " that may can not hear clearly.At this point, using audio provided in an embodiment of the present invention Processing method can remove " the sound of a bird chirping " this sentence interference voice, leave behind mesh such as using based on speech enhan-cement model end to end Poster sound is " hello, I is Lee XX, is very glad and recognizes everybody ", to achieve the purpose that speech enhan-cement.
Then, the targeted voice signal after speech enhan-cement can be used carries out speech recognition, i.e., using the pure of target speaker Net voice carries out speech recognition, to identify voice described in target speaker, such as combines above-mentioned example, can use voice The target voice " hello, I is Lee XX, is very glad and recognizes everybody " for enhancing model output carries out speech recognition, so as to mention Rise speech recognition effect.It is then possible to exported according to the recognition result recognized, it is as corresponding in exported the voice recognized Text " hello, I is Lee XX, is very glad and recognizes everybody ", the personal photograph of " Lee XX " etc..
To sum up, residual error network structure can be introduced into speech enhan-cement task by the embodiment of the present invention, to solve voice Gradient disappears this problem in enhancing task, and then can train to obtain the residual error network model with deeper network depth, with Speech enhan-cement is carried out using the residual error network model, promotes speech enhan-cement effect.
It should be noted that for simple description, therefore, it is stated as a series of action groups for embodiment of the method It closes, but those skilled in the art should understand that, embodiment of that present invention are not limited by the describe sequence of actions, because according to According to the embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also should Know, the embodiments described in the specification are all preferred embodiments, and the related movement not necessarily present invention is implemented Necessary to example.
Referring to Fig. 5, show a kind of structural block diagram of apparatus for processing audio embodiment of the invention, can specifically include as Lower module:
Voice signal obtains module 510, for obtaining the mixing voice signal of input;
Speech enhan-cement module 520, for being carried out according to residual error network model trained in advance to the mixing voice signal Speech enhan-cement obtains targeted voice signal;
Voice signal output module 530, for being exported according to the targeted voice signal.
In an alternate embodiment of the present invention where, the speech enhan-cement module 520 may include following submodule:
Feature extraction submodule obtains the voice of target user for carrying out feature extraction to the mixing voice signal Feature and voice data, the mixing voice signal include the voice signal of noise signal and the target user;
Noise reduction process submodule is used for according to the phonetic feature, by residual error network model trained in advance to described Voice data carries out noise reduction process, obtains the corresponding targeted voice signal of the target user.
In embodiments of the present invention, optionally, which can also include residual error network model training module. The residual error network model training module, for the corresponding residual error network model of training phonetic feature in advance.Wherein, at the noise reduction Managing submodule includes such as lower unit:
Residual error network model determination unit determines the target user for the phonetic feature according to the target user Corresponding residual error network model;
Noise reduction processing unit, for by the corresponding residual error network model of the target user, to the voice data into Row noise reduction process obtains the targeted voice signal.
In an alternate embodiment of the present invention where, the noise reduction processing unit may include following subelement:
Network weight information determines subelement, for determining the corresponding network weight of each network layer in the residual error network model Weight information;
Mapping processing subelement, for according to the corresponding network weight information of each network layer, to the voice data Mapping processing is carried out, mapping voice data is obtained;
Targeted voice signal generates subelement, for being based on the mapping voice data and the voice data, generates mesh Poster sound signal.
In an alternate embodiment of the present invention where, the feature extraction submodule is specifically used for the mixing voice Signal carries out frequency domain character extraction, obtains the frequency domain speech feature and frequency domain speech data of target user.The target language message Number subelement is generated, specifically for being decoded the mapping voice data and the frequency domain speech data, obtains decoding language Sound data;And Waveform Reconstructing is carried out to the decoded voice data according to the frequency domain speech feature, obtain target language message Number.
In another alternative embodiment of the invention, the feature extraction submodule is specifically used for the creolized language Sound signal carries out temporal signatures extraction, obtains the time domain speech feature and time domain speech data of target user.The target voice Signal generates subelement, is specifically used for using the mapping voice data and the time domain speech data, generates target language message Number.
In an alternate embodiment of the present invention where, the residual error network model training module may include following submodule Block:
Noise adds submodule, for adding noise signal for the voice signal of input, generates Noisy Speech Signal;
Feature extraction submodule obtains the noisy speech letter for carrying out feature extraction to the Noisy Speech Signal Number corresponding phonetic feature;
Model training submodule, for according to preset residual error network structure, according to the Noisy Speech Signal and described Voice signal carries out model training, generates the corresponding residual error network model of the phonetic feature.
In an alternate embodiment of the present invention where, the voice signal output module 530 may include following submodule:
Voice output submodule, for carrying out voice output according to the targeted voice signal;And/or
Speech recognition submodule generates recognition result for carrying out speech recognition to the targeted voice signal;Output institute State recognition result.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.
Fig. 6 is a kind of structural block diagram of equipment 600 for audio processing shown according to an exemplary embodiment.Example Such as, equipment 600 can be mobile phone, computer, digital broadcasting terminal, messaging device, game console, and plate is set It is standby, Medical Devices, body-building equipment, personal digital assistant, server etc..
Referring to Fig. 6, equipment 600 may include following one or more components: processing component 602, memory 604, power supply Component 606, multimedia component 608, audio component 610, the interface 612 of input/output (I/O), sensor module 614, and Communication component 616.
Processing component 602 usually control equipment 600 integrated operation, such as with display, telephone call, data communication, phase Machine operation and record operate associated operation.Processing component 602 may include that one or more processors 620 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 602 may include one or more modules, just Interaction between processing component 602 and other assemblies.For example, processing component 602 may include multi-media module, it is more to facilitate Interaction between media component 608 and processing component 602.
Memory 604 is configured as storing various types of data to support the operation in equipment 600.These data are shown Example includes the instruction of any application or method for operating in equipment 600, contact data, and telephone book data disappears Breath, picture, video etc..Memory 604 can be by any kind of volatibility or non-volatile memory device or their group It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash Device, disk or CD.
Power supply module 606 provides electric power for the various assemblies of equipment 600.Power supply module 606 may include power management system System, one or more power supplys and other with for equipment 600 generate, manage, and distribute the associated component of electric power.
Multimedia component 608 includes the screen of one output interface of offer between the equipment 600 and user.One In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers Body component 608 includes a front camera and/or rear camera.When equipment 600 is in operation mode, such as screening-mode or When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 610 is configured as output and/or input audio signal.For example, audio component 610 includes a Mike Wind (MIC), when equipment 600 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched It is set to reception external audio signal.The received audio signal can be further stored in memory 604 or via communication set Part 616 is sent.In some embodiments, audio component 610 further includes a loudspeaker, is used for output audio signal.
I/O interface 612 provides interface between processing component 602 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock Determine button.
Sensor module 614 includes one or more sensors, and the state for providing various aspects for equipment 600 is commented Estimate.For example, sensor module 614 can detecte the state that opens/closes of equipment 600, and the relative positioning of component, for example, it is described Component is the display and keypad of equipment 600, and sensor module 614 can be with 600 1 components of detection device 600 or equipment Position change, the existence or non-existence that user contacts with equipment 600,600 orientation of equipment or acceleration/deceleration and equipment 600 Temperature change.Sensor module 614 may include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor module 614 can also include optical sensor, such as CMOS or ccd image sensor, at As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 616 is configured to facilitate the communication of wired or wireless way between equipment 600 and other equipment.Equipment 600 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In an exemplary implementation In example, communication component 616 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 616 further includes near-field communication (NFC) module, to promote short range communication.Example Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, equipment 600 can be believed by one or more application specific integrated circuit (ASIC), number Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 604 of instruction, above-metioned instruction can be executed by the processor 620 of equipment 600 to complete the above method.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is held by the processor of equipment It when row, enables a device to execute a kind of audio-frequency processing method, which comprises obtain the mixing voice signal of input;According to Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains targeted voice signal;Foundation The targeted voice signal is exported.
Fig. 7 is a kind of structural schematic diagram of equipment in the embodiment of the present invention.The equipment 700 can be due to configuration or performance be different Generate bigger difference, may include one or more central processing units (central processing units, CPU) 722 (for example, one or more processors) and memory 732, one or more storage application programs 742 or The storage medium 730 (such as one or more mass memory units) of data 744.Wherein, memory 732 and storage medium 730 can be of short duration storage or persistent storage.The program for being stored in storage medium 730 may include one or more modules (diagram does not mark), each module may include to the series of instructions operation in equipment.Further, central processing unit 722 can be set to communicate with storage medium 730, and the series of instructions operation in storage medium 730 is executed in equipment 700.
Equipment 700 can also include one or more power supplys 726, one or more wired or wireless networks connect Mouthfuls 750, one or more input/output interfaces 758, one or more keyboards 756, and/or, one or one with Upper operating system 741, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
In the exemplary embodiment, equipment be configured to be executed by one or more than one processor it is one or More than one program includes the instruction for performing the following operation: obtaining the mixing voice signal of input;According to training in advance Residual error network model carries out speech enhan-cement to the mixing voice signal, obtains targeted voice signal;According to the target voice Signal is exported.
Optionally, the residual error network model that the foundation is trained in advance carries out speech enhan-cement to the mixing voice signal, Obtain targeted voice signal, comprising: to the mixing voice signal carry out feature extraction, obtain target user phonetic feature and Voice data, the mixing voice signal include the voice signal of noise signal and the target user;It is special according to the voice Sign carries out noise reduction process to the voice data by residual error network model trained in advance, it is corresponding to obtain the target user Targeted voice signal.
Optionally, described that the one or more programs are executed comprising also by one or more than one processor Instruction for performing the following operation: the corresponding residual error network model of training phonetic feature in advance.Wherein, described according to institute's predicate Sound feature carries out noise reduction process to the voice data by residual error network model trained in advance, obtains the target user Corresponding targeted voice signal, comprising: according to the phonetic feature of the target user, determine the corresponding residual error of the target user Network model;By the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, obtains institute State targeted voice signal.
Optionally, by the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, Obtain the targeted voice signal, comprising: determine the corresponding network weight information of each network layer in the residual error network model;According to According to the corresponding network weight information of each network layer, mapping processing is carried out to the voice data, obtains mapping voice data; Based on the mapping voice data and the voice data, targeted voice signal is generated.
Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user Sound data, comprising: frequency domain character extraction is carried out to the mixing voice signal, obtains the frequency domain speech feature and frequency of target user Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: to institute It states mapping voice data and the frequency domain speech data is decoded, obtain decoded voice data;It is special according to the frequency domain speech Sign carries out Waveform Reconstructing to the decoded voice data, obtains targeted voice signal.
Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user Sound data, comprising: to the mixing voice signal carry out temporal signatures extraction, obtain target user time domain speech feature and when Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: use The mapping voice data and the time domain speech data generate targeted voice signal.
Optionally, the corresponding residual error network model of the trained phonetic feature, comprising: make an uproar for the voice signal addition of input Acoustical signal generates Noisy Speech Signal;Feature extraction is carried out to the Noisy Speech Signal, obtains the Noisy Speech Signal pair The phonetic feature answered;According to preset residual error network structure, mould is carried out according to the Noisy Speech Signal and the voice signal Type training generates the corresponding residual error network model of the phonetic feature.
Optionally, it is exported according to the targeted voice signal, comprising: carry out voice according to the targeted voice signal Output;And/or speech recognition is carried out to the targeted voice signal, generate recognition result;Export the recognition result.
All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.
It should be understood by those skilled in the art that, the embodiment of the embodiment of the present invention can provide as method, apparatus or calculate Machine program product.Therefore, the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the embodiment of the present invention can be used one or more wherein include computer can With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form of the computer program product of implementation.
The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, terminal device (system) and computer program The flowchart and/or the block diagram of product describes.It should be understood that flowchart and/or the block diagram can be realized by computer program instructions In each flow and/or block and flowchart and/or the block diagram in process and/or box combination.It can provide these Computer program instructions are set to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals Standby processor is to generate a machine, so that being held by the processor of computer or other programmable data processing terminal devices Capable instruction generates for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram The device of specified function.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates packet The manufacture of command device is included, which realizes in one side of one or more flows of the flowchart and/or block diagram The function of being specified in frame or multiple boxes.
These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus The instruction executed on computer or other programmable terminal equipments is provided for realizing in one or more flows of the flowchart And/or in one or more blocks of the block diagram specify function the step of.
Although the preferred embodiment of the embodiment of the present invention has been described, once a person skilled in the art knows bases This creative concept, then additional changes and modifications can be made to these embodiments.So the following claims are intended to be interpreted as Including preferred embodiment and fall into all change and modification of range of embodiment of the invention.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements not only wrap Those elements are included, but also including other elements that are not explicitly listed, or further includes for this process, method, article Or the element that terminal device is intrinsic.In the absence of more restrictions, being wanted by what sentence "including a ..." limited Element, it is not excluded that there is also other identical elements in process, method, article or the terminal device for including the element.
Above to a kind of audio-frequency processing method provided by the present invention and device, a kind of equipment and a kind of readable storage Medium is described in detail, and used herein a specific example illustrates the principle and implementation of the invention, with The explanation of upper embodiment is merely used to help understand method and its core concept of the invention;Meanwhile for the general of this field Technical staff, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion The contents of this specification are not to be construed as limiting the invention.

Claims (10)

1. a kind of audio-frequency processing method characterized by comprising
Obtain the mixing voice signal of input;
Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains target language message Number;
It is exported according to the targeted voice signal.
2. the method according to claim 1, wherein the residual error network model that the foundation is trained in advance is to described Mixing voice signal carries out speech enhan-cement, obtains targeted voice signal, comprising:
Feature extraction is carried out to the mixing voice signal, obtains the phonetic feature and voice data of target user, the mixing Voice signal includes the voice signal of noise signal and the target user;
According to the phonetic feature, noise reduction process is carried out to the voice data by residual error network model trained in advance, is obtained To the corresponding targeted voice signal of the target user.
3. according to the method described in claim 2, it is characterized by further comprising:
The corresponding residual error network model of training phonetic feature in advance;
Wherein, described according to the phonetic feature, the voice data is dropped by residual error network model trained in advance It makes an uproar processing, obtains the corresponding targeted voice signal of the target user, comprising:
According to the phonetic feature of the target user, the corresponding residual error network model of the target user is determined;
By the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, obtains the mesh Poster sound signal.
4. according to the method described in claim 3, it is characterized in that, by the corresponding residual error network model of the target user, Noise reduction process is carried out to the voice data, obtains the targeted voice signal, comprising:
Determine the corresponding network weight information of each network layer in the residual error network model;
According to the corresponding network weight information of each network layer, mapping processing is carried out to the voice data, obtains mapping language Sound data;
Based on the mapping voice data and the voice data, targeted voice signal is generated.
5. according to the method described in claim 4, it is characterized in that,
It is described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and voice data of target user, comprising: Frequency domain character extraction is carried out to the mixing voice signal, obtains the frequency domain speech feature and frequency domain speech data of target user;
It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: to the mapping language Sound data and the frequency domain speech data are decoded, and obtain decoded voice data;According to the frequency domain speech feature to described Decoded voice data carries out Waveform Reconstructing, obtains targeted voice signal.
6. according to the method described in claim 4, it is characterized in that,
It is described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and voice data of target user, comprising: Temporal signatures extraction is carried out to the mixing voice signal, obtains the time domain speech feature and time domain speech data of target user;
It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: use the mapping Voice data and the time domain speech data generate targeted voice signal.
7. according to any method of claim 3 to 6, which is characterized in that the corresponding residual error net of the trained phonetic feature Network model, comprising:
Noise signal is added for the voice signal of input, generates Noisy Speech Signal;
Feature extraction is carried out to the Noisy Speech Signal, obtains the corresponding phonetic feature of the Noisy Speech Signal;
According to preset residual error network structure, model training is carried out according to the Noisy Speech Signal and the voice signal, it is raw At the corresponding residual error network model of the phonetic feature.
8. a kind of apparatus for processing audio characterized by comprising
Voice signal obtains module, for obtaining the mixing voice signal of input;
Speech enhan-cement module, for carrying out voice increasing to the mixing voice signal according to residual error network model trained in advance By force, targeted voice signal is obtained;
Voice signal output module, for being exported according to the targeted voice signal.
9. a kind of equipment, which is characterized in that include memory and one or more than one program, one of them or More than one program of person is stored in memory, and be configured to be executed by one or more than one processor it is one or More than one program of person includes the instruction for performing the following operation:
Obtain the mixing voice signal of input;
Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains target language message Number;
It is exported according to the targeted voice signal.
10. a kind of readable storage medium storing program for executing, which is characterized in that when the instruction in the storage medium is executed by the processor of equipment When, it enables a device to execute the audio-frequency processing method as described in any in claim to a method 1-7.
CN201810481272.6A 2018-05-18 2018-05-18 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing Pending CN110503968A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810481272.6A CN110503968A (en) 2018-05-18 2018-05-18 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810481272.6A CN110503968A (en) 2018-05-18 2018-05-18 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing

Publications (1)

Publication Number Publication Date
CN110503968A true CN110503968A (en) 2019-11-26

Family

ID=68583983

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810481272.6A Pending CN110503968A (en) 2018-05-18 2018-05-18 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing

Country Status (1)

Country Link
CN (1) CN110503968A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081223A (en) * 2019-12-31 2020-04-28 广州市百果园信息技术有限公司 Voice recognition method, device, equipment and storage medium
CN112820300A (en) * 2021-02-25 2021-05-18 北京小米松果电子有限公司 Audio processing method and device, terminal and storage medium
CN113409803A (en) * 2020-11-06 2021-09-17 腾讯科技(深圳)有限公司 Voice signal processing method, device, storage medium and equipment
CN113611318A (en) * 2021-06-29 2021-11-05 华为技术有限公司 Audio data enhancement method and related equipment
WO2022178970A1 (en) * 2021-02-26 2022-09-01 平安科技(深圳)有限公司 Speech noise reducer training method and apparatus, and computer device and storage medium
WO2022253003A1 (en) * 2021-05-31 2022-12-08 华为技术有限公司 Speech enhancement method and related device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110069625A1 (en) * 2009-09-23 2011-03-24 Avaya Inc. Priority-based, dynamic optimization of utilized bandwidth
CN102811310A (en) * 2011-12-08 2012-12-05 苏州科达科技有限公司 Method and system for controlling voice echo cancellation on network video camera
US20150142446A1 (en) * 2013-11-21 2015-05-21 Global Analytics, Inc. Credit Risk Decision Management System And Method Using Voice Analytics
CN106887225A (en) * 2017-03-21 2017-06-23 百度在线网络技术(北京)有限公司 Acoustic feature extracting method, device and terminal device based on convolutional neural networks
CN107180628A (en) * 2017-05-19 2017-09-19 百度在线网络技术(北京)有限公司 Set up the method, the method for extracting acoustic feature, device of acoustic feature extraction model
CN107274906A (en) * 2017-06-28 2017-10-20 百度在线网络技术(北京)有限公司 Voice information processing method, device, terminal and storage medium
CN107293288A (en) * 2017-06-09 2017-10-24 清华大学 A kind of residual error shot and long term remembers the acoustic model modeling method of Recognition with Recurrent Neural Network
CN107578775A (en) * 2017-09-07 2018-01-12 四川大学 A kind of multitask method of speech classification based on deep neural network
CN107818779A (en) * 2017-09-15 2018-03-20 北京理工大学 A kind of infant's crying sound detection method, apparatus, equipment and medium
CN108010515A (en) * 2017-11-21 2018-05-08 清华大学 A kind of speech terminals detection and awakening method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110069625A1 (en) * 2009-09-23 2011-03-24 Avaya Inc. Priority-based, dynamic optimization of utilized bandwidth
CN102811310A (en) * 2011-12-08 2012-12-05 苏州科达科技有限公司 Method and system for controlling voice echo cancellation on network video camera
US20150142446A1 (en) * 2013-11-21 2015-05-21 Global Analytics, Inc. Credit Risk Decision Management System And Method Using Voice Analytics
CN106887225A (en) * 2017-03-21 2017-06-23 百度在线网络技术(北京)有限公司 Acoustic feature extracting method, device and terminal device based on convolutional neural networks
CN107180628A (en) * 2017-05-19 2017-09-19 百度在线网络技术(北京)有限公司 Set up the method, the method for extracting acoustic feature, device of acoustic feature extraction model
CN107293288A (en) * 2017-06-09 2017-10-24 清华大学 A kind of residual error shot and long term remembers the acoustic model modeling method of Recognition with Recurrent Neural Network
CN107274906A (en) * 2017-06-28 2017-10-20 百度在线网络技术(北京)有限公司 Voice information processing method, device, terminal and storage medium
CN107578775A (en) * 2017-09-07 2018-01-12 四川大学 A kind of multitask method of speech classification based on deep neural network
CN107818779A (en) * 2017-09-15 2018-03-20 北京理工大学 A kind of infant's crying sound detection method, apparatus, equipment and medium
CN108010515A (en) * 2017-11-21 2018-05-08 清华大学 A kind of speech terminals detection and awakening method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵冬斌;邵坤;朱圆恒;李栋;陈亚冉;王海涛;刘德荣;周彤;王成红;: "深度强化学习综述:兼论计算机围棋的发展", 控制理论与应用, no. 06, 15 June 2016 (2016-06-15) *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081223A (en) * 2019-12-31 2020-04-28 广州市百果园信息技术有限公司 Voice recognition method, device, equipment and storage medium
CN111081223B (en) * 2019-12-31 2023-10-13 广州市百果园信息技术有限公司 Voice recognition method, device, equipment and storage medium
CN113409803A (en) * 2020-11-06 2021-09-17 腾讯科技(深圳)有限公司 Voice signal processing method, device, storage medium and equipment
CN113409803B (en) * 2020-11-06 2024-01-23 腾讯科技(深圳)有限公司 Voice signal processing method, device, storage medium and equipment
CN112820300A (en) * 2021-02-25 2021-05-18 北京小米松果电子有限公司 Audio processing method and device, terminal and storage medium
CN112820300B (en) * 2021-02-25 2023-12-19 北京小米松果电子有限公司 Audio processing method and device, terminal and storage medium
WO2022178970A1 (en) * 2021-02-26 2022-09-01 平安科技(深圳)有限公司 Speech noise reducer training method and apparatus, and computer device and storage medium
WO2022253003A1 (en) * 2021-05-31 2022-12-08 华为技术有限公司 Speech enhancement method and related device
CN113611318A (en) * 2021-06-29 2021-11-05 华为技术有限公司 Audio data enhancement method and related equipment

Similar Documents

Publication Publication Date Title
CN110503968A (en) A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN108198569B (en) Audio processing method, device and equipment and readable storage medium
CN109801644B (en) Separation method, separation device, electronic equipment and readable medium for mixed sound signal
CN108346433A (en) A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN106464939B (en) The method and device of play sound effect
CN107705783A (en) A kind of phoneme synthesizing method and device
CN103391347B (en) A kind of method and device of automatic recording
CN110097890A (en) A kind of method of speech processing, device and the device for speech processes
US20130211826A1 (en) Audio Signals as Buffered Streams of Audio Signals and Metadata
CN107992485A (en) A kind of simultaneous interpretation method and device
CN111508511A (en) Real-time sound changing method and device
CN111508531B (en) Audio processing method and device
CN110232909A (en) A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN105451056B (en) Audio and video synchronization method and device
CN110197677A (en) A kind of control method for playing back, device and playback equipment
WO2021244056A1 (en) Data processing method and apparatus, and readable medium
Zhang et al. Sensing to hear: Speech enhancement for mobile devices using acoustic signals
CN103973955A (en) Information processing method and electronic device
CN108073572A (en) Information processing method and its device, simultaneous interpretation system
US20240096343A1 (en) Voice quality enhancement method and related device
CN110349578A (en) Equipment wakes up processing method and processing device
CN107886963B (en) A kind of method, apparatus and electronic equipment of speech processes
CN109036404A (en) Voice interactive method and device
CN115273831A (en) Voice conversion model training method, voice conversion method and device
CN106782625B (en) Audio-frequency processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220720

Address after: 100084. Room 9, floor 01, cyber building, building 9, building 1, Zhongguancun East Road, Haidian District, Beijing

Applicant after: BEIJING SOGOU TECHNOLOGY DEVELOPMENT Co.,Ltd.

Address before: 100084. Room 9, floor 01, cyber building, building 9, building 1, Zhongguancun East Road, Haidian District, Beijing

Applicant before: BEIJING SOGOU TECHNOLOGY DEVELOPMENT Co.,Ltd.

Applicant before: SOGOU (HANGZHOU) INTELLIGENT TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right