CN110503968A

CN110503968A - A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing

Info

Publication number: CN110503968A
Application number: CN201810481272.6A
Authority: CN
Inventors: 文仕学
Original assignee: Beijing Sogou Technology Development Co Ltd; Sogou Hangzhou Intelligent Technology Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2018-05-18
Filing date: 2018-05-18
Publication date: 2019-11-26

Abstract

The embodiment of the invention provides a kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing, this method comprises: obtaining the mixing voice signal of input；Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains targeted voice signal；It is exported according to the targeted voice signal.The embodiment of the present invention is able to solve gradient disappearance problem existing for the existing sound enhancement method based on traditional neural network, improves speech enhan-cement effect.

Description

A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing

Technical field

The present invention relates to fields of communication technology, more particularly to a kind of audio-frequency processing method, a kind of apparatus for processing audio, one Kind equipment and a kind of readable storage medium storing program for executing.

Background technique

With the fast development of the communication technology, the terminals such as mobile phone, tablet computer are more more and more universal, to the life of people Living, study, work bring great convenience.

These terminals can collect voice signal by microphone, using speech enhancement technique to the voice signal being collected into It is handled, to reduce the influence of noise jamming.Wherein, speech enhan-cement refers to when voice signal is done by various noises After disturbing, even flooding, useful voice signal is extracted from noise background, inhibits, reduces the technology of noise jamming.

Currently, terminal usually uses such as deep neural network (Deep Neural Network, DNN), convolutional Neural Network (Convolutional Neural Network, CNN), shot and long term remember artificial neural network (Long Short-Term Memory, LSTM) etc. traditional neural networks sound enhancement method carry out speech enhan-cement.But use traditional neural network There is gradient disappearance in sound enhancement method.For example, for the DNN connected entirely, as the network depth of DNN is continuous Deepen, i.e., as the network number of plies of DNN is continuously increased, the gradient disappearance problem of network is increasingly severe, such as increases in the network number of plies When being added to 5 layers, the gradient disappearance problem of network is than more serious.If continuing growing the network number of plies of DNN, lead to Cross the DNN and carry out speech enhan-cement, the speech enhan-cement performance for finally obtaining sound result not will increase not only, instead it is possible that Decline, affects speech enhan-cement effect.

Summary of the invention

The embodiment of the present invention is the technical problem to be solved is that a kind of audio-frequency processing method is provided, to promote speech enhan-cement effect Fruit.

Correspondingly, the embodiment of the invention also provides a kind of apparatus for processing audio, a kind of equipment and a kind of readable storages Medium, to guarantee the implementation and application of the above method.

To solve the above-mentioned problems, the embodiment of the invention discloses a kind of audio-frequency processing methods, comprising: obtains the mixed of input Close voice signal；Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains target Voice signal；It is exported according to the targeted voice signal.

Optionally, the residual error network model that the foundation is trained in advance carries out speech enhan-cement to the mixing voice signal, Obtain targeted voice signal, comprising: to the mixing voice signal carry out feature extraction, obtain target user phonetic feature and Voice data, the mixing voice signal include the voice signal of noise signal and the target user；It is special according to the voice Sign carries out noise reduction process to the voice data by residual error network model trained in advance, it is corresponding to obtain the target user Targeted voice signal.

Optionally, further includes: the corresponding residual error network model of training phonetic feature in advance.Wherein, described according to institute's predicate Sound feature carries out noise reduction process to the voice data by residual error network model trained in advance, obtains the target user Corresponding targeted voice signal, comprising: according to the phonetic feature of the target user, determine the corresponding residual error of the target user Network model；By the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, obtains institute State targeted voice signal.

Optionally, by the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, Obtain the targeted voice signal, comprising: determine the corresponding network weight information of each network layer in the residual error network model；According to According to the corresponding network weight information of each network layer, mapping processing is carried out to the voice data, obtains mapping voice data； Based on the mapping voice data and the voice data, targeted voice signal is generated.

Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user Sound data, comprising: frequency domain character extraction is carried out to the mixing voice signal, obtains the frequency domain speech feature and frequency of target user Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: to institute It states mapping voice data and the frequency domain speech data is decoded, obtain decoded voice data；It is special according to the frequency domain speech Sign carries out Waveform Reconstructing to the decoded voice data, obtains targeted voice signal.

Optionally, described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and language of target user Sound data, comprising: to the mixing voice signal carry out temporal signatures extraction, obtain target user time domain speech feature and when Domain voice data.It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: use The mapping voice data and the time domain speech data generate targeted voice signal.

Optionally, the corresponding residual error network model of the trained phonetic feature, comprising: make an uproar for the voice signal addition of input Acoustical signal generates Noisy Speech Signal；Feature extraction is carried out to the Noisy Speech Signal, obtains the Noisy Speech Signal pair The phonetic feature answered；According to preset residual error network structure, mould is carried out according to the Noisy Speech Signal and the voice signal Type training generates the corresponding residual error network model of the phonetic feature.

Optionally, it is exported according to the targeted voice signal, comprising: carry out voice according to the targeted voice signal Output；And/or speech recognition is carried out to the targeted voice signal, generate recognition result；Export the recognition result.

The embodiment of the invention also discloses a kind of apparatus for processing audio, comprising:

Voice signal obtains module, for obtaining the mixing voice signal of input；

Speech enhan-cement module, for carrying out voice to the mixing voice signal according to residual error network model trained in advance Enhancing, obtains targeted voice signal；

Voice signal output module, for being exported according to the targeted voice signal.

Optionally, the speech enhan-cement module includes:

Feature extraction submodule obtains the voice of target user for carrying out feature extraction to the mixing voice signal Feature and voice data, the mixing voice signal include the voice signal of noise signal and the target user；

Noise reduction process submodule is used for according to the phonetic feature, by residual error network model trained in advance to described Voice data carries out noise reduction process, obtains the corresponding targeted voice signal of the target user.

It optionally, can also include: residual error network model training module, for the corresponding residual error of training phonetic feature in advance Network model；

Wherein, the noise reduction process submodule includes: residual error network model determination unit, for according to the target user Phonetic feature, determine the corresponding residual error network model of the target user；Noise reduction processing unit, for being used by the target The corresponding residual error network model in family carries out noise reduction process to the voice data, obtains the targeted voice signal.

Optionally, the noise reduction processing unit includes:

Network weight information determines subelement, for determining the corresponding network weight of each network layer in the residual error network model Weight information；

Mapping processing subelement, for according to the corresponding network weight information of each network layer, to the voice data Mapping processing is carried out, mapping voice data is obtained；

Targeted voice signal generates subelement, for being based on the mapping voice data and the voice data, generates mesh Poster sound signal.

Optionally, the feature extraction submodule is specifically used for carrying out frequency domain character extraction to the mixing voice signal, Obtain the frequency domain speech feature and frequency domain speech data of target user；

The targeted voice signal generates subelement, is specifically used for the mapping voice data and the frequency domain speech number According to being decoded, decoded voice data is obtained；And wave is carried out to the decoded voice data according to the frequency domain speech feature Shape reconstruct, obtains targeted voice signal.

Optionally, the feature extraction submodule is specifically used for carrying out temporal signatures extraction to the mixing voice signal, Obtain the time domain speech feature and time domain speech data of target user；

The targeted voice signal generates subelement, is specifically used for using the mapping voice data and the time domain speech Data generate targeted voice signal.

Optionally, the residual error network model training module includes:

Noise adds submodule, for adding noise signal for the voice signal of input, generates Noisy Speech Signal；

Feature extraction submodule obtains the noisy speech letter for carrying out feature extraction to the Noisy Speech Signal Number corresponding phonetic feature；

Model training submodule, for according to preset residual error network structure, according to the Noisy Speech Signal and described Voice signal carries out model training, generates the corresponding residual error network model of the phonetic feature.

Optionally, the voice signal output module includes:

Voice output submodule, for carrying out voice output according to the targeted voice signal；And/or

Speech recognition submodule generates recognition result for carrying out speech recognition to the targeted voice signal；Output institute State recognition result.

It include memory and one or more than one program the embodiment of the invention also discloses a kind of equipment, Perhaps more than one program is stored in memory and is configured to be executed by one or more than one processor for one of them The one or more programs include the instruction for performing the following operation: obtaining the mixing voice signal of input；According to Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains targeted voice signal；Foundation The targeted voice signal is exported.

Optionally, described that the one or more programs are executed comprising also by one or more than one processor Instruction for performing the following operation: the corresponding residual error network model of training phonetic feature in advance.Wherein, described according to institute's predicate Sound feature carries out noise reduction process to the voice data by residual error network model trained in advance, obtains the target user Corresponding targeted voice signal, comprising: according to the phonetic feature of the target user, determine the corresponding residual error of the target user Network model；By the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, obtains institute State targeted voice signal.

The embodiment of the invention also discloses a kind of readable storage medium storing program for executing, when the instruction in the storage medium is by equipment When managing device execution, enable a device to execute audio-frequency processing method described in one or more of embodiment of the present invention.

The embodiment of the present invention includes following advantages:

The embodiment of the present invention can carry out the mixing voice signal got by residual error network model trained in advance Speech enhan-cement leads to the problem of speech enhan-cement effect difference so as to avoid increasing network depth, that is, solves existing based on biography Gradient disappearance problem existing for the sound enhancement method for neural network of uniting improves speech enhan-cement effect.

Detailed description of the invention

Fig. 1 is a kind of step flow chart of audio-frequency processing method embodiment of the invention；

Fig. 2 is the signal for carrying out speech enhan-cement in an example of the present invention using the residual error network model that training obtains in advance Figure；

Fig. 3 is a kind of step flow chart of audio-frequency processing method alternative embodiment of the invention；

Fig. 4 is collected a kind of schematic diagram of mixing voice in an example of the present invention；

Fig. 5 is a kind of structural block diagram of apparatus for processing audio embodiment of the invention；

Fig. 6 is a kind of structural block diagram of equipment for audio processing shown according to an exemplary embodiment；

Fig. 7 is a kind of structural schematic diagram of equipment in the embodiment of the present invention.

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

Currently, existing sound enhancement method, which usually uses traditional neural network, carries out model training, based on training It obtains neural network model and carries out speech enhan-cement.Wherein, network depth is to influence a big factor of traditional neural network performance.When When network depth is constantly deepened, traditional neural network will appear gradient disappearance problem.With the intensification of network depth, traditional neural The gradient disappearance problem of network can be increasingly severe, and neural network model is caused to train poor targeted voice signal, shadow Ring speech enhan-cement effect.

One of the core concepts in the embodiments of the present invention is to provide a kind of new audio-frequency processing methods, to use residual error Network model carries out speech enhan-cement to the mixing voice signal of input, solves the existing speech enhan-cement based on traditional neural network Gradient disappearance problem existing for method improves speech enhan-cement effect.

Referring to Fig.1, a kind of step flow chart of audio-frequency processing method embodiment of the invention is shown, can specifically include Following steps:

Step 102, the mixing voice signal of input is obtained.

The embodiment of the present invention in voice input process, can obtain the mixing voice signal of input.The creolized language message It number may include the voice signal for needing to carry out speech enhan-cement, can specifically include the voice signal and noise signal of target user Deng.Wherein, the voice signal of target user can be the clean speech signal for referring to that target user speaks, such as target speaker's voice Corresponding time-domain signal；Noise signal can refer to signal corresponding to interference noise, as may include described in other speakers The corresponding time-domain signal of interference voice etc., the embodiment of the present invention to this with no restriction.

Step 104, speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtained Targeted voice signal.

In embodiments of the present invention, the mixing voice signal that can be will acquire is as residual error network model trained in advance Input, the mixing voice signal that can then get is input in advance trained residual error network model, to lead to residual error network Model carries out speech enhan-cement to the mixing voice signal got, removes the interference noise in the mixing voice signal, obtains language The enhanced targeted voice signal of sound.The targeted voice signal can only include the signal of the clean speech of target user, can With the corresponding signal of clean speech for characterizing target user, the corresponding clean speech of feeling the pulse with the finger-tip mark speaker's voice such as can be Signal etc..

In a kind of optional embodiment, after getting mixing voice signal, which can be carried out Feature extraction obtains phonetic feature and voice data.Wherein, voice data can be the noisy speech after referring to speech feature extraction Data can specifically include: noise data and the target speech data etc. for needing to retain.Then, can lead to according to phonetic feature Noise reduction process is carried out to voice data after residual error network model trained in advance, the targeted voice signal after obtaining speech enhan-cement. It should be noted that phonetic feature may include: time domain speech feature and/or frequency domain speech feature, the embodiment of the present invention is to this With no restriction.Time domain speech feature can be used for characterizing the phonetic feature in time domain, and frequency domain speech feature can be used for characterizing frequency Phonetic feature on domain.

Step 106, it is exported according to the targeted voice signal.

The embodiment of the present invention after obtaining the targeted voice signal after speech enhan-cement, can according to the targeted voice signal into Row output.For example, voice output can be carried out according to the targeted voice signal, to export clean speech described in user；For another example, Speech recognition can be carried out according to targeted voice signal, to identify clean speech described in user, can also will recognized Clean speech is converted to text information, is then exported according to text information, such as shows text on the screen of the device, shows Corresponding search result of text etc..

To sum up, the embodiment of the present invention can be by residual error network model trained in advance, to the creolized language message got Number speech enhan-cement is carried out, leads to the problem of speech enhan-cement effect difference so as to avoid increasing network depth, that is, solve existing Gradient disappearance problem existing for sound enhancement method based on traditional neural network improves speech enhan-cement effect.

In the concrete realization, the language feature of speech signal can be in advance based on, according to residual error network network structure into Row model training, to train the corresponding residual error network model of various phonetic features, so as to it is subsequent can be according to phonetic feature Speech enhan-cement is carried out using trained residual error network model in advance, guarantees speech enhan-cement effect.Optionally, the embodiment of the present invention Audio-frequency processing method can also include: the corresponding residual error network model of preparatory training phonetic feature.

Specifically, can add noise signal in model training stage to the speech signal of input, generate noisy speech Signal obtains corresponding phonetic feature to carry out feature extraction according to the Noisy Speech Signal；It then, can be for obtaining Phonetic feature carries out model training using the Noisy Speech Signal of generation, the raw voice is special according to preset residual error network structure Levy corresponding residual error network model.Wherein, the voice signal of input can refer to clean voice signal, can specifically include: The clean speech signal being collected into, and/or, pre-synthesis clean speech signal such as can be real in voice input process When the clean speech signal currently entered that gets, be also possible to the time-domain signal of prerecord one section of clean speech, again Such as can be the time-domain signal of one section of pre-synthesis clean speech, the embodiment of the present invention to this with no restriction.

In an alternate embodiment of the present invention where, the corresponding residual error network model of training phonetic feature, specifically can wrap It includes: adding noise signal for the voice signal of input, generate Noisy Speech Signal；Feature is carried out to the Noisy Speech Signal to mention It takes, obtains the corresponding phonetic feature of Noisy Speech Signal；According to preset residual error network structure, according to the Noisy Speech Signal Model training is carried out with the voice signal, generates the corresponding residual error network model of the phonetic feature.Wherein, preset residual error Network structure can be what value was configured according to the network structure of residual error network in advance, and the embodiment of the present invention does not limit this System.

Specifically, can add to the clean voice signal progress noise of input and make an uproar, it can be defeated in the training stage The voice signal addition noise signal entered, generates Noisy Speech Signal.Wherein, noise signal may include simulator and noise signal and The noise signal etc. collected in advance.The simulator and noise signal can be used for characterizing it is pre- first pass through speech synthesis technique synthesis make an uproar Sound；The noise signal collected in advance can be used for characterizing the real noise being collected into advance, such as can be the noise prerecorded Signal etc..

It can be used pre-synthesis as an example of the invention in the case where not being collected into real noise Simulator and noise signal is carried out plus is made an uproar processing to the voice signal of input, with according to plus make an uproar handle after the Noisy Speech Signal that generates into Row model training, thus avoid collecting a large amount of real noises and the problem that causes model training at high cost, reduce trained cost. Certainly, in the case where being collected into real noise, the corresponding noise signal of the real noise being collected into also can be used to input Voice signal carry out plus make an uproar processing, the noise signal being collected into such as can be used, the voice signal of input is carried out to add the place that makes an uproar Reason；The simulator and noise signal that the real noise that portion collection arrives and synthesis for another example can be used, to the voice signal of input into Row plus processing, etc. of making an uproar, this example is not specifically limited this.

Then, feature extraction can be carried out according to the Noisy Speech Signal after addition noise signal, it is special obtains corresponding voice Sign carries out model training using residual error network, obtains the corresponding residual error of the phonetic feature so as to combine voice vocal print feature Network model.Specifically, as shown in Fig. 2, can be used for obtained phonetic feature according to preset residual error network structure The Noisy Speech Signal of generation and the voice signal part of input carry out model training, so that it is corresponding to train each phonetic feature Residual error network model.The residual error network model may include at least three network layers.During model training, each network layer Output result act not only as the input of next network layer, be also used as cross-layer and be input in other network layers, such as the The output result of one network layer can be used as the input of second network layer, be also used as the defeated of third network layer Enter, and/or, it can be input in network layer further below, to update the weight parameter of each network layer in residual error network model, delay The reduction problem of gradient is solved, to solve the problems, such as that gradient disappears this.

Thus in the speech enhan-cement stage, i.e., it, can when carrying out speech enhan-cement using trained residual error network model To determine currently to need residual error network model to be used based on phonetic feature, the residual error network mould determined may then pass through Type carries out noise reduction process to the voice data after extracting feature, as shown in Fig. 2, obtaining targeted voice signal, and to obtained mesh The output of poster sound signal part.Wherein, voice data, which can be, to the mixing voice signal of input generate after feature extraction, It such as can be and the frequency domain speech data obtained after frequency domain character extraction carried out to mixing voice signal, be also possible to mixing voice Signal carries out obtained time domain speech data etc. after temporal signatures extraction, the embodiment of the present invention to this with no restriction.

Referring to Fig. 3, a kind of step flow chart of audio-frequency processing method alternative embodiment of the invention is shown, it specifically can be with Include the following steps:

Step 302, the mixing voice signal of input is obtained.

Step 304, feature extraction is carried out to the mixing voice signal, obtains the phonetic feature and voice number of target user According to the mixing voice signal includes the voice signal of noise signal and the target user.

It, can will be currently detected after detecting the mixing voice signal of input specifically, in the speech enhan-cement stage Mixing voice signal determine and need to carry out the signal of speech enhan-cement processing, and the mixing voice signal of the available input, To execute corresponding speech enhan-cement task based on the mixing voice signal got.It, can during speech enhan-cement task execution To carry out feature to the mixing voice signal got, the phonetic feature and voice data of target user is obtained.Wherein, creolized language Sound signal may include the noise signal for having the voice signal of target user and needs to remove, and such as may include that target user speaks Corresponding clean speech signal and other users are spoken corresponding interference voice signal etc..

Step 306, according to the phonetic feature, the voice data is carried out by residual error network model trained in advance Noise reduction process obtains the corresponding targeted voice signal of the target user.

In the concrete realization, can be for the obtained residual error network model of different phonetic feature training it is different, such as The residual error network model that phonetic feature training for user A obtains can interference voice signal progress corresponding to other users Inhibit, such as inhibit the interference voice signal spoken of user B, and retain user A and speak corresponding voice signal, reaches enhancing user A speaks the purpose of corresponding voice signal；The residual error network model that phonetic feature training for user B obtains can be to other The corresponding interference voice signal of user inhibits, the interference voice signal for such as user A being inhibited to speak, and retains user B and speak Corresponding voice signal, achieving the purpose that, which enhances user B, speaks corresponding voice signal.Therefore, it before noise reduction process, can tie Sound groove recognition technology in e is closed, the phonetic feature according to the target user determines current desired residual error network model to be used, with Noise reduction process is carried out to voice data by the phonetic feature of target user corresponding residual error network model, obtains target user couple The targeted voice signal answered.

In an alternate embodiment of the present invention where, according to the phonetic feature, pass through residual error network mould trained in advance Type carries out noise reduction process to the voice data, obtains the corresponding targeted voice signal of the target user, may include: foundation The phonetic feature of the target user determines the corresponding residual error network model of the target user；Pass through the target user couple The residual error network model answered carries out noise reduction process to the voice data, obtains the targeted voice signal.Target user is corresponding Targeted voice signal may include the residual error network model that trains of phonetic feature for the target user in advance.

Specifically, the embodiment of the present invention is after obtaining the phonetic feature of target user, it can be based on the language of target user Sound feature determines the residual error network model trained in advance for the phonetic feature of the target user, then can be by true The residual error network model made carries out noise reduction process to voice data and is protected simultaneously with removing the noise data in the voice data Target speech data included in the voice data is stayed, after then can generating speech enhan-cement based on the target speech data of reservation Targeted voice signal.Wherein, target speech data can be used for characterizing the voice signal of target user, such as can be target use The frequency domain data for the clean speech that family is spoken, alternatively, can be the time domain data etc. for the clean speech that target user speaks.

In the embodiment of the present invention, optionally, by the corresponding residual error network model of target user, to the voice data into Row noise reduction process obtains the targeted voice signal, can specifically include: determining each network layer pair in the residual error network model The network weight information answered；According to the corresponding network weight information of each network layer, the voice data is carried out at mapping Reason obtains mapping voice data；Based on the mapping voice data and the voice data, targeted voice signal is generated.

Specifically, the embodiment of the present invention after determining residual error network model, can be corresponded to based on the residual error network model Residual error network structure, determine the corresponding network weight information of each network layer in residual error network model, then can be according to each net The corresponding network weight information of network layers carries out mapping processing to voice data, i.e., respectively according to the corresponding network weight of each network layer Information carries out noise reduction process to the voice data for being input to each network layer, to remove noise data included in voice data, Obtain mapping voice data.Wherein, network weight information is determined for voice data and maps reflecting between voice data Penetrate relationship.The mapping voice data can be used for characterizing the clean speech signal that removal noise signal obtains, and can such as characterize Except the time-domain signal of the clean speech of noise signal, for another example, the frequency-region signal etc. of the clean speech of noise signal can be characterized, this Inventive embodiments to this with no restriction.Obtain mapping voice data after, the embodiment of the present invention can according to residual error network structure, The mapping voice data and voice data are handled, the corresponding targeted voice signal of target user is generated.

It, can be according to each network layer in residual error network model after obtaining voice data x as an example of the invention Corresponding network weight information weight layer maps voice data x, obtains mapping voice data F (x), then may be used To generate target speech data H (x) using the mapping voice data F (x) and voice data x according to residual error network structure.Example Such as, when voice data x is 5, if the target speech data H (x) generated is 5.1, it can determine that the voice data is corresponding Mapping voice data F (x) is 0.1；If the target speech data H (x) generated is 5.2, it can determine that the voice data is corresponding Mapping voice data F (x) be 0.2, i.e., when target speech data H (x) becomes 5.2 from 5.1, mapping voice data F (x) from 0.1 becomes 0.2, increases 100% variable, so as to the minor change of prominent voice data x, can obviously embody network The corrective action of weight information, and then can preferably inhibit the noise data in voice data, promote speech enhan-cement effect.

In a kind of embodiment of the embodiment of the present invention, frequency domain character can be carried out to the mixing voice signal of input and mentioned It takes, frequency domain speech feature and corresponding frequency domain speech data after frequency domain character extracts is obtained, according to the frequency domain speech feature Speech enhan-cement processing is carried out on frequency domain with frequency domain speech data.Optionally, above-mentioned that feature is carried out to the mixing voice signal It extracts, obtains the phonetic feature and voice data of target user, may include: that frequency domain character is carried out to the mixing voice signal It extracts, obtains the frequency domain speech feature and frequency domain speech data of target user.It is thus possible to according to frequency domain speech feature pair is obtained Frequency domain speech data carry out voice increasing by mixing voice signal of the residual error network model trained in advance to input on frequency domain By force, speech enhan-cement task can be completed on frequency domain.

Wherein, frequency domain speech data can be used for characterizing the noisy speech data on frequency domain, may include making an uproar on frequency domain Sound data and target speech data etc..Optionally, described to be based on the mapping voice data and the voice data, generate target Voice signal, comprising: the mapping voice data and the frequency domain speech data are decoded, decoded voice data is obtained； Waveform Reconstructing is carried out to the decoded voice data according to the frequency domain speech feature, obtains targeted voice signal.Specifically, After obtaining frequency domain speech data, it can be based on frequency domain speech feature, according to the corresponding network weight information of each network layer Mapping processing is carried out to frequency domain speech data, to remove the noise data in the frequency domain speech data, obtains mapping voice data, Then the mapping voice data and frequency domain speech data can be decoded, obtains corresponding decoded voice data, it then can be with The phonetic feature of combining target user reconstructs the corresponding time domain waveform of the decoded voice data, so that exporting according to the time domain waveform Targeted voice signal can have the phonetic feature of target user, i.e., according to the frequency domain speech feature extracted to decoded speech Data carry out Waveform Reconstructing, generate the targeted voice signal in time domain, and the target voice feature carries the language of target user Sound feature, the sense of hearing after guaranteeing speech enhan-cement, improves user experience.

Certainly, the embodiment of the present invention is based on residual error network model, can also be carried out using other modes to creolized language message Speech enhan-cement, such as speech enhan-cement processing can be carried out to mixing voice signal in the time domain.In an optional reality of the invention It applies in mode, feature extraction is carried out to the mixing voice signal, obtains phonetic feature and voice data, comprising: to described mixed It closes voice signal and carries out temporal signatures extraction, obtain time domain speech feature and time domain speech data.It is thus possible to according to when obtaining Domain phonetic feature is to time domain voice data, in the time domain by residual error network model trained in advance to the creolized language message of input Number carry out speech enhan-cement, i.e., in the time domain complete speech enhan-cement task.

Wherein, time domain speech data can be used for characterizing the noisy speech data in time domain, may include making an uproar on frequency domain Sound data and target speech data etc..Optionally, above-mentioned to be based on the mapping voice data and the voice data, generate target Voice signal, comprising: use the mapping voice signal and the time domain speech data, generate targeted voice signal.It is specific and Speech at the extraction after the phonetic feature of domain, can be based on time domain speech feature, according to the corresponding network weight letter of each network layer Breath carries out mapping processing to time domain voice data, to remove the noise data in the time domain speech data, obtains mapping voice number According to then in combination with time domain speech feature, to mapping voice data and time domain speech data progress speech processes, generation mesh The corresponding targeted voice signal of user is marked, the targeted voice signal is allowed to carry the phonetic feature of target user, is guaranteed Sense of hearing after speech enhan-cement, to improve the voice quality after speech enhan-cement.

Step 308, it is exported according to the targeted voice signal.

It in a kind of optional embodiment, is exported according to the targeted voice signal, may include: according to the mesh Poster sound signal carries out voice output.It makes an uproar the production of voice dialogue in environment specifically, the embodiment of the present invention can be applied in band In product, the phone wrist-watch in voice communication scene can be such as applied, both call sides is allowed to be only hearing its master of concern The clean speech of speaker.For example, the child in parent using phone wrist-watch to activity makes a phone call, implement using the present invention The audio-frequency processing method that example provides, can allow parent to be only hearing the clear sound of oneself child, reduce the shadow that other children speak It rings, the influence of noise jamming can be reduced.

Certainly, the embodiment of the present invention can be applied in other scenes, can such as apply in voice input scene, Can apply in speech recognition scene etc., the embodiment of the present invention to this with no restriction.

It in another optional embodiment, is exported according to the targeted voice signal, may include: to the mesh Poster sound signal carries out speech recognition, generates recognition result；Export the recognition result.

For example, target speaker voice be Fig. 4 in first dotted line frame 41 in sentence " hello, I is Lee XX is very glad and recognizes everybody."；And noise be tweedle, as in second dotted line frame 42 in Fig. 4 " the sound of a bird chirping chirp caye Caye ".As shown in figure 4, voice and noise (i.e. tweedle) that target speaker says have a large amount of intersection on a timeline.In Beginning, due to not having a tweedle, thus " everybody " two words described in target speaker are disturbed not yet, thus this two A word can not heard；And target speaker says below " good, I makes Lee XX " partially be interfered by tweedle " the sound of a bird chirping ", this Result in that target speaker says " good, I makes Lee XX " that may can not hear clearly.At this point, using audio provided in an embodiment of the present invention Processing method can remove " the sound of a bird chirping " this sentence interference voice, leave behind mesh such as using based on speech enhan-cement model end to end Poster sound is " hello, I is Lee XX, is very glad and recognizes everybody ", to achieve the purpose that speech enhan-cement.

Then, the targeted voice signal after speech enhan-cement can be used carries out speech recognition, i.e., using the pure of target speaker Net voice carries out speech recognition, to identify voice described in target speaker, such as combines above-mentioned example, can use voice The target voice " hello, I is Lee XX, is very glad and recognizes everybody " for enhancing model output carries out speech recognition, so as to mention Rise speech recognition effect.It is then possible to exported according to the recognition result recognized, it is as corresponding in exported the voice recognized Text " hello, I is Lee XX, is very glad and recognizes everybody ", the personal photograph of " Lee XX " etc..

To sum up, residual error network structure can be introduced into speech enhan-cement task by the embodiment of the present invention, to solve voice Gradient disappears this problem in enhancing task, and then can train to obtain the residual error network model with deeper network depth, with Speech enhan-cement is carried out using the residual error network model, promotes speech enhan-cement effect.

It should be noted that for simple description, therefore, it is stated as a series of action groups for embodiment of the method It closes, but those skilled in the art should understand that, embodiment of that present invention are not limited by the describe sequence of actions, because according to According to the embodiment of the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art also should Know, the embodiments described in the specification are all preferred embodiments, and the related movement not necessarily present invention is implemented Necessary to example.

Referring to Fig. 5, show a kind of structural block diagram of apparatus for processing audio embodiment of the invention, can specifically include as Lower module:

Voice signal obtains module 510, for obtaining the mixing voice signal of input；

Speech enhan-cement module 520, for being carried out according to residual error network model trained in advance to the mixing voice signal Speech enhan-cement obtains targeted voice signal；

Voice signal output module 530, for being exported according to the targeted voice signal.

In an alternate embodiment of the present invention where, the speech enhan-cement module 520 may include following submodule:

In embodiments of the present invention, optionally, which can also include residual error network model training module. The residual error network model training module, for the corresponding residual error network model of training phonetic feature in advance.Wherein, at the noise reduction Managing submodule includes such as lower unit:

Residual error network model determination unit determines the target user for the phonetic feature according to the target user Corresponding residual error network model；

Noise reduction processing unit, for by the corresponding residual error network model of the target user, to the voice data into Row noise reduction process obtains the targeted voice signal.

In an alternate embodiment of the present invention where, the noise reduction processing unit may include following subelement:

In an alternate embodiment of the present invention where, the feature extraction submodule is specifically used for the mixing voice Signal carries out frequency domain character extraction, obtains the frequency domain speech feature and frequency domain speech data of target user.The target language message Number subelement is generated, specifically for being decoded the mapping voice data and the frequency domain speech data, obtains decoding language Sound data；And Waveform Reconstructing is carried out to the decoded voice data according to the frequency domain speech feature, obtain target language message Number.

In another alternative embodiment of the invention, the feature extraction submodule is specifically used for the creolized language Sound signal carries out temporal signatures extraction, obtains the time domain speech feature and time domain speech data of target user.The target voice Signal generates subelement, is specifically used for using the mapping voice data and the time domain speech data, generates target language message Number.

In an alternate embodiment of the present invention where, the residual error network model training module may include following submodule Block:

In an alternate embodiment of the present invention where, the voice signal output module 530 may include following submodule:

For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.

Fig. 6 is a kind of structural block diagram of equipment 600 for audio processing shown according to an exemplary embodiment.Example Such as, equipment 600 can be mobile phone, computer, digital broadcasting terminal, messaging device, game console, and plate is set It is standby, Medical Devices, body-building equipment, personal digital assistant, server etc..

Referring to Fig. 6, equipment 600 may include following one or more components: processing component 602, memory 604, power supply Component 606, multimedia component 608, audio component 610, the interface 612 of input/output (I/O), sensor module 614, and Communication component 616.

Processing component 602 usually control equipment 600 integrated operation, such as with display, telephone call, data communication, phase Machine operation and record operate associated operation.Processing component 602 may include that one or more processors 620 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 602 may include one or more modules, just Interaction between processing component 602 and other assemblies.For example, processing component 602 may include multi-media module, it is more to facilitate Interaction between media component 608 and processing component 602.

Memory 604 is configured as storing various types of data to support the operation in equipment 600.These data are shown Example includes the instruction of any application or method for operating in equipment 600, contact data, and telephone book data disappears Breath, picture, video etc..Memory 604 can be by any kind of volatibility or non-volatile memory device or their group It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash Device, disk or CD.

Power supply module 606 provides electric power for the various assemblies of equipment 600.Power supply module 606 may include power management system System, one or more power supplys and other with for equipment 600 generate, manage, and distribute the associated component of electric power.

Multimedia component 608 includes the screen of one output interface of offer between the equipment 600 and user.One In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers Body component 608 includes a front camera and/or rear camera.When equipment 600 is in operation mode, such as screening-mode or When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio component 610 is configured as output and/or input audio signal.For example, audio component 610 includes a Mike Wind (MIC), when equipment 600 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched It is set to reception external audio signal.The received audio signal can be further stored in memory 604 or via communication set Part 616 is sent.In some embodiments, audio component 610 further includes a loudspeaker, is used for output audio signal.

I/O interface 612 provides interface between processing component 602 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock Determine button.

Sensor module 614 includes one or more sensors, and the state for providing various aspects for equipment 600 is commented Estimate.For example, sensor module 614 can detecte the state that opens/closes of equipment 600, and the relative positioning of component, for example, it is described Component is the display and keypad of equipment 600, and sensor module 614 can be with 600 1 components of detection device 600 or equipment Position change, the existence or non-existence that user contacts with equipment 600,600 orientation of equipment or acceleration/deceleration and equipment 600 Temperature change.Sensor module 614 may include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor module 614 can also include optical sensor, such as CMOS or ccd image sensor, at As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 616 is configured to facilitate the communication of wired or wireless way between equipment 600 and other equipment.Equipment 600 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In an exemplary implementation In example, communication component 616 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 616 further includes near-field communication (NFC) module, to promote short range communication.Example Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, equipment 600 can be believed by one or more application specific integrated circuit (ASIC), number Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.

In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 604 of instruction, above-metioned instruction can be executed by the processor 620 of equipment 600 to complete the above method.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is held by the processor of equipment It when row, enables a device to execute a kind of audio-frequency processing method, which comprises obtain the mixing voice signal of input；According to Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains targeted voice signal；Foundation The targeted voice signal is exported.

Fig. 7 is a kind of structural schematic diagram of equipment in the embodiment of the present invention.The equipment 700 can be due to configuration or performance be different Generate bigger difference, may include one or more central processing units (central processing units, CPU) 722 (for example, one or more processors) and memory 732, one or more storage application programs 742 or The storage medium 730 (such as one or more mass memory units) of data 744.Wherein, memory 732 and storage medium 730 can be of short duration storage or persistent storage.The program for being stored in storage medium 730 may include one or more modules (diagram does not mark), each module may include to the series of instructions operation in equipment.Further, central processing unit 722 can be set to communicate with storage medium 730, and the series of instructions operation in storage medium 730 is executed in equipment 700.

Equipment 700 can also include one or more power supplys 726, one or more wired or wireless networks connect Mouthfuls 750, one or more input/output interfaces 758, one or more keyboards 756, and/or, one or one with Upper operating system 741, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..

In the exemplary embodiment, equipment be configured to be executed by one or more than one processor it is one or More than one program includes the instruction for performing the following operation: obtaining the mixing voice signal of input；According to training in advance Residual error network model carries out speech enhan-cement to the mixing voice signal, obtains targeted voice signal；According to the target voice Signal is exported.

All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.

It should be understood by those skilled in the art that, the embodiment of the embodiment of the present invention can provide as method, apparatus or calculate Machine program product.Therefore, the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the embodiment of the present invention can be used one or more wherein include computer can With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form of the computer program product of implementation.

The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, terminal device (system) and computer program The flowchart and/or the block diagram of product describes.It should be understood that flowchart and/or the block diagram can be realized by computer program instructions In each flow and/or block and flowchart and/or the block diagram in process and/or box combination.It can provide these Computer program instructions are set to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals Standby processor is to generate a machine, so that being held by the processor of computer or other programmable data processing terminal devices Capable instruction generates for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram The device of specified function.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates packet The manufacture of command device is included, which realizes in one side of one or more flows of the flowchart and/or block diagram The function of being specified in frame or multiple boxes.

These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus The instruction executed on computer or other programmable terminal equipments is provided for realizing in one or more flows of the flowchart And/or in one or more blocks of the block diagram specify function the step of.

Although the preferred embodiment of the embodiment of the present invention has been described, once a person skilled in the art knows bases This creative concept, then additional changes and modifications can be made to these embodiments.So the following claims are intended to be interpreted as Including preferred embodiment and fall into all change and modification of range of embodiment of the invention.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements not only wrap Those elements are included, but also including other elements that are not explicitly listed, or further includes for this process, method, article Or the element that terminal device is intrinsic.In the absence of more restrictions, being wanted by what sentence "including a ..." limited Element, it is not excluded that there is also other identical elements in process, method, article or the terminal device for including the element.

Above to a kind of audio-frequency processing method provided by the present invention and device, a kind of equipment and a kind of readable storage Medium is described in detail, and used herein a specific example illustrates the principle and implementation of the invention, with The explanation of upper embodiment is merely used to help understand method and its core concept of the invention；Meanwhile for the general of this field Technical staff, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion The contents of this specification are not to be construed as limiting the invention.

Claims

1. a kind of audio-frequency processing method characterized by comprising

Obtain the mixing voice signal of input；

Speech enhan-cement is carried out to the mixing voice signal according to residual error network model trained in advance, obtains target language message Number；

It is exported according to the targeted voice signal.

2. the method according to claim 1, wherein the residual error network model that the foundation is trained in advance is to described Mixing voice signal carries out speech enhan-cement, obtains targeted voice signal, comprising:

Feature extraction is carried out to the mixing voice signal, obtains the phonetic feature and voice data of target user, the mixing Voice signal includes the voice signal of noise signal and the target user；

According to the phonetic feature, noise reduction process is carried out to the voice data by residual error network model trained in advance, is obtained To the corresponding targeted voice signal of the target user.

3. according to the method described in claim 2, it is characterized by further comprising:

The corresponding residual error network model of training phonetic feature in advance；

Wherein, described according to the phonetic feature, the voice data is dropped by residual error network model trained in advance It makes an uproar processing, obtains the corresponding targeted voice signal of the target user, comprising:

According to the phonetic feature of the target user, the corresponding residual error network model of the target user is determined；

By the corresponding residual error network model of the target user, noise reduction process is carried out to the voice data, obtains the mesh Poster sound signal.

4. according to the method described in claim 3, it is characterized in that, by the corresponding residual error network model of the target user, Noise reduction process is carried out to the voice data, obtains the targeted voice signal, comprising:

Determine the corresponding network weight information of each network layer in the residual error network model；

According to the corresponding network weight information of each network layer, mapping processing is carried out to the voice data, obtains mapping language Sound data；

Based on the mapping voice data and the voice data, targeted voice signal is generated.

5. according to the method described in claim 4, it is characterized in that,

It is described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and voice data of target user, comprising: Frequency domain character extraction is carried out to the mixing voice signal, obtains the frequency domain speech feature and frequency domain speech data of target user；

It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: to the mapping language Sound data and the frequency domain speech data are decoded, and obtain decoded voice data；According to the frequency domain speech feature to described Decoded voice data carries out Waveform Reconstructing, obtains targeted voice signal.

6. according to the method described in claim 4, it is characterized in that,

It is described that feature extraction is carried out to the mixing voice signal, obtain the phonetic feature and voice data of target user, comprising: Temporal signatures extraction is carried out to the mixing voice signal, obtains the time domain speech feature and time domain speech data of target user；

It is described to be based on the mapping voice data and the voice data, generate targeted voice signal, comprising: use the mapping Voice data and the time domain speech data generate targeted voice signal.

7. according to any method of claim 3 to 6, which is characterized in that the corresponding residual error net of the trained phonetic feature Network model, comprising:

Noise signal is added for the voice signal of input, generates Noisy Speech Signal；

Feature extraction is carried out to the Noisy Speech Signal, obtains the corresponding phonetic feature of the Noisy Speech Signal；

According to preset residual error network structure, model training is carried out according to the Noisy Speech Signal and the voice signal, it is raw At the corresponding residual error network model of the phonetic feature.

8. a kind of apparatus for processing audio characterized by comprising

Voice signal obtains module, for obtaining the mixing voice signal of input；

Speech enhan-cement module, for carrying out voice increasing to the mixing voice signal according to residual error network model trained in advance By force, targeted voice signal is obtained；

9. a kind of equipment, which is characterized in that include memory and one or more than one program, one of them or More than one program of person is stored in memory, and be configured to be executed by one or more than one processor it is one or More than one program of person includes the instruction for performing the following operation:

Obtain the mixing voice signal of input；

It is exported according to the targeted voice signal.

10. a kind of readable storage medium storing program for executing, which is characterized in that when the instruction in the storage medium is executed by the processor of equipment When, it enables a device to execute the audio-frequency processing method as described in any in claim to a method 1-7.