CN108922517A - The method, apparatus and storage medium of training blind source separating model - Google Patents

The method, apparatus and storage medium of training blind source separating model Download PDF

Info

Publication number
CN108922517A
CN108922517A CN201810717811.1A CN201810717811A CN108922517A CN 108922517 A CN108922517 A CN 108922517A CN 201810717811 A CN201810717811 A CN 201810717811A CN 108922517 A CN108922517 A CN 108922517A
Authority
CN
China
Prior art keywords
voice signal
uproar
blind source
source separating
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810717811.1A
Other languages
Chinese (zh)
Inventor
李超
朱唯鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810717811.1A priority Critical patent/CN108922517A/en
Publication of CN108922517A publication Critical patent/CN108922517A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

The embodiment of the present invention provides the method, apparatus and storage medium of a kind of trained blind source separating model.The method of the training blind source separating model includes:Training voice signal is determined according to adding control parameter of making an uproar to add to make an uproar online, wherein control parameter of should plus making an uproar is the default distribution of satisfaction for controlling the parameter of noise;Using training voice signal training convolutional neural networks, blind source separating model is obtained.More preferably background sound can be suppressed to most strong degree in the case where damage foreground voice few as far as possible to the available performance of the embodiment of the present invention by blind source separating model, i.e. the blind source separating model.

Description

The method, apparatus and storage medium of training blind source separating model
Technical field
The present embodiments relate to speech recognition technology more particularly to a kind of method, apparatus of trained blind source separating model And storage medium.
Background technique
In recent years, speech recognition technology is applied to industry, household electrical appliances, communication, automotive electronics, medical treatment, family more and more The fields such as front yard service, consumption electronic product.Under quiet environment, the accuracy of speech recognition technology can achieve 97%, More than the auditory system of the mankind;But under noisy environment, the accuracy of speech recognition technology is also far below the auditory system of the mankind. Wherein, the auditory system of the mankind can tell the interested sound in noisy environment, this phenomenon is called " cocktail party effect It answers ".
" cocktail party effect " is technically described as blind source separating, that is, without reference to signal, Interested " foreground voice " is separated from noisy " background sound ".Blind source separating is substantially regression model, i.e., blind source Disjunctive model.In the training of existing blind source separating model, by the way of offline plus noise, saved after voice is added noise On hard disk.
The blind source separating model performance obtained by the training of the above-mentioned prior art is poor, is in particular in following three kinds of feelings Condition:1, background sound is not eliminated;2, foreground voice is also eliminated;3, background sound is not eliminated clean but foreground voice and is damaged.
Summary of the invention
The embodiment of the present invention provides the method, apparatus and storage medium of a kind of trained blind source separating model, to obtain performance More preferably blind source separating model, i.e. the blind source separating model can be in the case where damage foreground voices few as far as possible, background sound It is suppressed to most strong degree.
In a first aspect, the embodiment of the present invention provides a kind of method of trained blind source separating model, including:According to adding control of making an uproar Parameter, which adds to make an uproar online, determines training voice signal, wherein described plus control parameter of making an uproar is to meet making an uproar for controlling for default distribution The parameter of sound;Using the trained voice signal training convolutional neural networks, blind source separating model is obtained.
In a kind of possible design, described plus control parameter of making an uproar is signal-to-noise ratio.
It is described default to be distributed as being uniformly distributed or Gaussian Profile in a kind of possible design.
In a kind of possible design, the basis adds control parameter of making an uproar to add determining training voice signal of making an uproar online, including: Obtain described plus make an uproar control parameter, voice signal and noise;The voice signal and institute are calculated according to described plus control parameter of making an uproar State the mixed coefficint of noise;According to the mixed coefficint, the voice signal and the noise, the trained voice letter is determined Number.
It is described to use the trained voice signal training convolutional neural networks in a kind of possible design, obtain blind source Disjunctive model, including:
Sub-frame processing is carried out to the trained voice signal, obtains multiframe voice signal;
Using the multiframe voice signal training convolutional neural networks, the blind source separating model is obtained.
It is described using the multiframe voice signal training convolutional neural networks in a kind of possible design, it obtains The blind source separating model, including:
To each frame voice signal, the characteristic value of the voice signal is extracted by following either type:
Mode one:Extract the amplitude spectrum of the voice signal;
Mode two:Extract the Meier frequency spectrum of the voice signal;
Mode three:Extract the mel-frequency cepstrum coefficient MFCC of the voice signal;
Enter ginseng using the corresponding characteristic value of the voice signal as the convolutional neural networks, by controlling the convolution The mean square error of neural network obtains the blind source separating model.
Second aspect, the embodiment of the present invention provide a kind of device of trained blind source separating model, including:Determining module is used Add to make an uproar online in basis plus control parameter of making an uproar and determine training voice signal, wherein described plus control parameter of making an uproar is to meet default point The parameter for being used to control noise of cloth;Processing module is obtained for using the trained voice signal training convolutional neural networks Blind source separating model.
In a kind of possible design, described plus control parameter of making an uproar is signal-to-noise ratio.
It is described default to be distributed as being uniformly distributed or Gaussian Profile in a kind of possible design.
In a kind of possible design, the determining module is specifically used for:
Obtain described plus make an uproar control parameter, voice signal and noise;
The mixed coefficint of the voice signal and the noise is calculated according to described plus control parameter of making an uproar;
According to the mixed coefficint, the voice signal and the noise, the trained voice signal is determined.
In a kind of possible design, the processing module includes:
Framing unit obtains multiframe voice signal for carrying out sub-frame processing to the trained voice signal;
Training unit, for obtaining the blind source point using the multiframe voice signal training convolutional neural networks From model.
It is described in a kind of possible design
Training unit is specifically used for:
To each frame voice signal, the characteristic value of the voice signal is extracted by following either type:
Mode one:Extract the amplitude spectrum of the voice signal;
Mode two:Extract the Meier frequency spectrum of the voice signal;
Mode three:Extract the mel-frequency cepstrum coefficient MFCC of the voice signal;
Enter ginseng using the corresponding characteristic value of the voice signal as the convolutional neural networks, by controlling the convolution The mean square error of neural network obtains the blind source separating model.
The third aspect, the embodiment of the present invention provide a kind of device of trained blind source separating model, including:Processor and storage Device;The memory stores computer executed instructions;The processor executes the computer executed instructions of the memory storage, So that the processor executes the method such as the described in any item trained blind source separating models of first aspect.
Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, the computer-readable storage medium It is stored with computer executed instructions in matter, appoints when the computer executed instructions are executed by processor for realizing such as first aspect The method of training blind source separating model described in one.
The method, apparatus and storage medium of trained blind source separating model provided in an embodiment of the present invention, according to adding control of making an uproar Parameter, which adds to make an uproar online, determines training voice signal, wherein control parameter of should plus making an uproar is the default distribution of satisfaction for controlling noise Parameter;Using the trained voice signal training convolutional neural networks, blind source separating model is obtained.Due to adding control parameter of making an uproar For the parameter for being used to control noise for meeting default distribution, therefore, compared with prior art, the embodiment of the present invention is by setting plus makes an uproar Control parameter meets default distribution, to increase amount of noise and type;And it is made an uproar by adding online so that blind source separating model is easy to Adjustment.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to do one simply to introduce, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without any creative labor, can be with It obtains other drawings based on these drawings.
Fig. 1 is the flow chart of the method for the training blind source separating model that one embodiment of the invention provides;
Fig. 2 be another embodiment of the present invention provides training blind source separating model method flow chart;
Fig. 3 is the network architecture diagram for the training blind source separating model that one embodiment of the invention provides;
Fig. 4 is the structural schematic diagram of the device for the training blind source separating model that one embodiment of the invention provides;
Fig. 5 be another embodiment of the present invention provides training blind source separating model device structural schematic diagram.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended The example of device and method being described in detail in claims, some aspects of the invention are consistent.
Term " includes " and " having " and their any deformations in description and claims of this specification, it is intended that It is to cover and non-exclusive includes.Such as the process, method, system, product or equipment for containing a series of steps or units do not have It is defined in listed step or unit, but optionally further comprising the step of not listing or unit, or optionally also wrap Include the other step or units intrinsic for these process, methods, product or equipment.
" first " and " second " in the embodiment of the present invention etc. only plays mark action, is not understood to indicate or imply suitable Order relation, relative importance or the quantity for implicitly indicating indicated technical characteristic." multiple " refer to two or more. "and/or" describes the incidence relation of affiliated partner, indicates may exist three kinds of relationships, for example, A and/or B, can indicate:It is single Solely there are A, exist simultaneously A and B, these three situations of individualism B.It is a kind of that character "/", which typicallys represent forward-backward correlation object, The relationship of "or".
" one embodiment " or " embodiment " mentioned in the whole text in specification of the invention means related with embodiment A particular feature, structure, or characteristic include at least one embodiment of the application.Therefore, occur everywhere in the whole instruction " in one embodiment " or " in one embodiment " not necessarily refer to identical embodiment.It should be noted that not rushing In the case where prominent, the feature in embodiment and embodiment in the present invention be can be combined with each other.
Inventor's discovery:The necessity of blind source separating is the following aspects:
On the one hand, the audio signal that blind source separating can speak the voice of target speaker from speaker more than one section simultaneously In extract.For example, TV is playing news hookup in parlor, user wants to carry out voice friendship with intelligent sound box on tea table Mutually.Intelligent sound box has received the voice request of user simultaneously, and, the casting of host in news hookup.That is, same The two people of host are speaking in one moment, user and news hookup, and intelligent sound needs the sound spoken simultaneously from two people The corresponding voice of user is extracted in frequency signal.
On the other hand, blind source separating can separate voice from ambient noise.For example, typical example is exactly vehicle Carry the speech recognition under environment.When driving, the microphone of vehicle device or mobile phone can receive various ambient noises:Wind It makes an uproar, road noise, whistle etc., blind source separating can inhibit these ambient noises, and only speech enhan-cement is taken out, is sent to speech recognition system In.
Above example is all more satisfactory situation.Blind source separating itself is a regression model, if blind source separating model Performance is undesirable, just will appear following three kinds of situations:
1, background sound is not eliminated.That is blind source separating denoising effect is poor, low to the rejection ability of noise.
2, foreground voice is also eliminated.That is, blind source separating not only eliminates noise, meanwhile, also eliminate voice.
3, background sound is not eliminated clean but foreground voice and is damaged.This situation is the most universal, that is, in certain time-frequencies Point, noise have been retained;And in other time frequency points, voice is but eliminated.
Therefore, two most crucial abilities of blind source separating are:Noise suppressed+do not damage voice.One good blind source separating Model, it should background sound can be suppressed to most strong degree in the case where damage foreground voice few as far as possible.
In the training of current blind source separating model, using the mode of offline plus noise.Voice is added into noise, then It is stored on hard disk.This mode is maximum to keep at least following two in check:
One, noise type and quantity are less.
Two, plus mode flexibility of making an uproar is poor and lower to the regulated efficiency of blind source separating model.
But, it is intended that blind source separating model can have various plus make an uproar with extensive various noises according to the variation of environment Implementation.
Based on the above issues, the embodiment of the present invention provides a kind of method, apparatus of trained blind source separating model and storage is situated between Matter, by the way that control parameter of making an uproar will be added to be designed as meeting the parameter for being used to control noise of default distribution, come increase noise type and Quantity, and made an uproar by adding online come the mode flexibility that improves plus make an uproar, so that blind source separating model is easy to adjust.
The embodiment of the present invention can be applied to all kinds of intelligent sound boxes, DuerOS, smart television, intelligent refrigerator etc. and hand over voice In the electronic equipment of mutual function, the method for the training blind source separating model has popularity.
Detailed embodiment is used, below to illustrate how the embodiment of the present invention realizes the online instruction of blind source separating model Practice.
Fig. 1 is the flow chart of the method for the training blind source separating model that one embodiment of the invention provides.The embodiment of the present invention A kind of method of trained blind source separating model is provided, the executing subject of the method for the training blind source separating model can be blind for training The device of the device of source disjunctive model, the training blind source separating model can be realized by way of software and/or hardware.
Specifically, the device of training blind source separating model can include but is not limited at least one of the following:User sets Standby, network equipment etc..Wherein, user equipment can include but is not limited to computer, smart phone, personal digital assistant (Personal Digital Assistant, referred to as:) and the above-mentioned electronic equipment etc. referred to PDA.The network equipment may include But be not limited to single network server, multiple network servers composition server group or based on cloud computing by a large amount of computers Or the cloud that network server is constituted, wherein cloud computing is one kind of distributed computing, is made of the computer of a group loose couplings A super virtual computer.The present embodiment is without limitation.
As shown in Figure 1, the method for the training blind source separating model includes:
S101, basis plus control parameter of making an uproar, which add to make an uproar online, determines training voice signal.
Wherein, control parameter of should plus making an uproar is the default distribution of satisfaction for controlling the parameter of noise.It is appreciated that should plus make an uproar Control parameter is to meet default distribution.Here default distribution refers to parameter distribution, for example, normal distribution, being uniformly distributed, referring to Number distribution, etc..The effect for adding control parameter of making an uproar is for controlled training noise in speech signal, for example, the ratio of noise, making an uproar The gain etc. of sound.
For example, setting training voice signal is expressed as x, wherein actual speech is expressed as s, and noise is expressed as n, and a expression is made an uproar The gain of sound, then x=s+a × n.Wherein, when control parameter of making an uproar being added to be used for the gain of controlled training noise in speech signal, as Here a.
Alternatively, plus ratio of the control parameter for controlled training noise in speech signal of making an uproar, that is, signal-to-noise ratio.According to letter It makes an uproar than (Signal-to-Noise Ratio, abbreviation:SNR calculation formula), can calculate according to the following formula can satisfy The a of current SNR value:
In above formula, std () indicates standard deviation, and std (n) indicates the standard deviation of noise, and std (s) indicates the mark of actual speech It is quasi- poor;Snr indicates current SNR value.
S102, using training voice signal training convolutional neural networks, obtain blind source separating model.
Wherein, convolutional neural networks (Convolutional Neural Network, abbreviation:It CNN may include) multiple The convolutional layer of stacking.Illustratively, the convolutional layer of the convolutional neural networks bottom includes 257 nodes (i.e. neuron).
In practical applications, activation primitive of the sigmoid function as the convolutional neural networks, mean square error can be used (Mean Square Error, referred to as:MSE) as the cost function of the convolutional neural networks, but the embodiment of the present invention is not with this It is limited.
It should be noted that the length of the training voice signal in the embodiment of the present invention can be set according to actual needs It sets.The speech recognition of requirement in view of to(for) real-time, the length of training voice signal should not be arranged in the embodiment of the present invention It is too long, for example, the length that trained voice signal can be set is 10 milliseconds;Alternatively, training voice signal in S101 is divided Frame processing, obtains the training voice signal of preset length, which is less than the length of training voice signal in S101.
Series of preprocessing, such as above-mentioned framing can be carried out to the training voice signal that previous step determines in the step Processing etc., does not limit specific pretreatment mode here, illustrates and can refer to subsequent embodiment.
It should be noted that test phase is similar with all fitting problems based on machine learning, details are not described herein again.Here It should be noted that test phase does not need other plus noise.That is, being input to the training in convolutional neural networks Voice signal has been with noise.Therefore, the structure of the convolutional neural networks of test phase is training stage network structure A sub-network.
In the embodiment of the present invention, training voice signal is determined according to adding control parameter of making an uproar to add to make an uproar online, wherein control of should plus making an uproar Parameter processed is to meet the parameter for being used to control noise of default distribution;Using training voice signal training convolutional neural networks, obtain To blind source separating model.Due to adding control parameter of making an uproar to be to meet the parameter for being used to control noise of default distribution, compared to existing There is technology, the embodiment of the present invention meets default distribution by setting plus control parameter of making an uproar, to increase amount of noise and type;And lead to It crosses online plus makes an uproar so that blind source separating model is easy to adjust.
The control parameter that refers to plus make an uproar in above-described embodiment can be specially signal-to-noise ratio, default distribution can be uniformly distributed or Gaussian Profile etc., therefore, here by signal-to-noise ratio satisfaction be uniformly distributed or Gaussian Profile for be further expalined.
Be uniformly distributed is to be evenly distributed between two values.Such as:It is uniformly distributed between 5db to 30db.Using satisfaction The benefit that the signal-to-noise ratio of this default distribution carries out plus makes an uproar is that comparison is direct.The calculation formula of signal-to-noise ratio is:
SNR=randm (A, B)
Here A is lower limit, is here 5db in a example.
B is online, is 30db in this example.
Randm is uniformly distributed function.
Gaussian Profile, also known as normal distribution (Normal distribution) or normal distribution.Such as:Signal-to-noise ratio clothes From being contemplated to be 15db, the Gaussian Profile that variance is 5.The calculation formula of signal-to-noise ratio is:
SNR=gauss (C, D)
Here, C is desired value, is here 15 in a example.
D is variance, is 5 in this example.
Gauss is the function of Gaussian Profile.
In some embodiments, S101, basis plus control parameter of making an uproar, which add to make an uproar online, determines training voice signal, may include: Obtain plus make an uproar control parameter, voice signal and noise;According to the mixed coefficint for adding control parameter of making an uproar to calculate voice signal and noise; According to mixed coefficint, voice signal and noise, training voice signal is determined.
Illustratively, it sets training voice signal and is expressed as x, wherein voice signal (i.e. actual speech) is expressed as s, makes an uproar Sound is expressed as n, and a indicates the gain of noise, then x=s+a × n.Wherein, adding control parameter of making an uproar is signal-to-noise ratio, is used for controlled training The gain a of noise in speech signal, i.e. mixed coefficint.
It is appreciated that online plus when making an uproar, electronic equipment reads three data:Voice signal, noise add control parameter of making an uproar, Wherein, add control parameter of making an uproar is signal-to-noise ratio here;Later, according to the mixed coefficint a of signal-to-noise ratio computation voice signal and noise, and It is obtained training voice signal according to relational expression x=s+a × n.
Wherein, voice signal and noise can be float type, i.e., value is between -1 to 1;Alternatively, voice signal and Noise is also possible to int type, value between -32767 to 32767 (16 quantizations), etc..
Fig. 2 be another embodiment of the present invention provides training blind source separating model method flow chart.As shown in Fig. 2, On the basis of process shown in Fig. 1, wherein S102, using training voice signal training convolutional neural networks, obtain blind source separating Model may include:
S201, sub-frame processing is carried out to training voice signal, obtains multiframe voice signal.
The step corresponds to framing layer, and main function is to carry out sub-frame processing to training voice signal, obtains one by one Voice signal.
Wherein, voice signal is either continuously, can also take 10~30ms using overlapping framing, general frame length.Before The overlapped portion of one frame voice signal and a later frame voice signal is known as frame shifting, and the ratio between frame shifting and frame length are generally taken as 0~1/2. For example, it is 25ms that frame length, which can be set, it is 10ms that frame, which moves,.
Optionally, windowing process is carried out to voice signal, i.e., is multiplied with certain window function w (n) with voice signal, thus Form adding window voice signal.The purpose of windowing process is to reduce by sub-frame processing bring spectral leakage.This is because framing Processing is the unexpected truncation to training voice signal, and the period of the frequency spectrum and window function frequency spectrum that are equivalent to trained voice signal rolls up Product.Since the secondary lobe of window frequency spectrum is higher, the frequency spectrum of training voice signal can generate " hangover ", i.e. spectral leakage.For this purpose, can be used Hamming (hamming) window can be special with smoother low pass efficiently against leakage phenomenon because Hamming window secondary lobe is minimum Property, obtained frequency spectrum is smoother.
S202, using multiframe voice signal training convolutional neural networks, obtain blind source separating model.
Optionally, which includes:Each frame voice signal is carried out the following processing:Extract the characteristic value of voice signal; Enter ginseng using the characteristic value of voice signal as convolutional neural networks, the mean square error by controlling convolutional neural networks obtains blind Source disjunctive model.
Wherein, convolutional neural networks are used as using acoustic feature and input, therefore each frame is also extracted after framing The characteristic value of voice signal.
Optionally, the characteristic value of voice signal is extracted in the step, at least can be any in following several implementations It is a kind of:
The first implementation:Extract the amplitude spectrum of voice signal.
Specifically, the extraction of amplitude spectrum may include:Discrete fourier transform is carried out to voice signal, for example, using discrete Fast algorithm (Fast Fourier Transformation, the abbreviation of Fourier transform:FFT) voice signal is handled;So Afterwards, the absolute value after taking discrete fourier transform obtains the amplitude spectrum of voice signal.
Second of implementation:Extract the Meier frequency spectrum (Mel-Filter banks) of voice signal.
The third implementation:Extract mel-frequency cepstrum coefficient (the Mel-Frequency Cepstral of voice signal Coefficients, referred to as:MFCC).
Wherein, Meier frequency spectrum is similar with the whole extraction process of MFCC, and (discrete cosine becomes the only more step DCT of MFCC It changes).Whole extraction process may include following steps:
1) preemphasis, effect are exactly the effect caused by vocal cords and lip in order to eliminate in voiced process, to compensate voice letter Number high frequency section constrained by articulatory system.And the formant of high frequency can be highlighted.The step is optional step.
2) Short Time Fourier Transform (Short-Time Fourier Transform, abbreviation:STFT), vector spy is obtained Sign, and power spectrum (by square) is converted by energy (amplitude) spectrum.
3) Meier filters, and is filtered by Meier filter group, to obtain meeting the sound spectrum of human auditory system habit, finally Usually take logarithm by Conversion of measurement unit at db.
4) DCT, discrete cosine transform obtain cepstrum coefficient, that is, MFCC.
Later, in test phase, the blind source separating model obtained using above-described embodiment surveys tested speech signal The characteristic value of tested speech signal and blind source separating model are carried out dot product, obtain the corresponding test of tested speech signal by examination As a result.Specifically, by the characteristic value of each frame tested speech signal, frame by frame with blind source separating model carry out dot product, obtain:
Y=h.*x
Wherein .* indicates that dot product symbol, x indicate characteristic value, and y indicates the corresponding test result of tested speech signal, and h is indicated The blind source separating model that training obtains.
In a kind of implementation, the network architecture for obtaining above-mentioned blind source separating model is as shown in Figure 3.With reference to Fig. 3, the network Framework may include:Add make an uproar layer and feature extraction layer.Wherein, the function of feature extraction layer may include:Feature is extracted in framing The calculating of value and desired proportions masking value.
The network architecture of test phase may include feature extraction layer and test layer.Here the function of feature extraction layer It includes at least:The use of characteristic value and desired proportions masking value is extracted in framing (frame).Illustratively, test phase The network architecture is followed successively by from top to bottom:Frame (framing), fft (Fast Fourier Transform (FFT)), log (taking logarithm), conv1d64 (convolution), bn, relu, conv1d64, bn, relu, conv1d64, bn, relu, conv1d64, bn, relu, conv1d64, Bn, relu, linear, bn and relu.Wherein Each part can refer to the relevant technologies, and details are not described herein again.
Fig. 4 is the structural schematic diagram of the device for the training blind source separating model that one embodiment of the invention provides.The present invention is real Apply example and a kind of device of trained blind source separating model be provided, the device of the training blind source separating model can by software and/or The mode of hardware is realized.
Specifically, the device of training blind source separating model can include but is not limited at least one of the following:User sets The standby, network equipment.Wherein, user equipment can include but is not limited to computer, smart phone, PDA and the above-mentioned electronics referred to Equipment etc..The network equipment can include but is not limited to single network server, multiple network servers composition server group or Cloud consisting of a large number of computers or network servers based on cloud computing, wherein cloud computing is one kind of distributed computing, by One super virtual computer of the computer composition of a group loose couplings.The present embodiment is without limitation.
As shown in figure 4, the device 30 of training blind source separating model includes:Determining module 31 and processing module 32.Wherein,
The determining module 31 adds determining training voice signal of making an uproar for basis plus control parameter of making an uproar online.Wherein, add control of making an uproar Parameter processed is to meet the parameter for being used to control noise of default distribution.
The processing module 32 obtains blind source separating mould for using the trained voice signal training convolutional neural networks Type.
The device of trained blind source separating model provided in this embodiment, can be used for executing above-mentioned embodiment of the method, in fact Existing mode is similar with technical effect, and details are not described herein again for the present embodiment.
Illustratively, adding control parameter of making an uproar can be signal-to-noise ratio.
Optionally, default distribution can be to be uniformly distributed or Gaussian Profile etc..
In some embodiments, determining module 31 can be specifically used for:Obtain plus make an uproar control parameter, voice signal and noise;Root According to the mixed coefficint for adding control parameter of making an uproar to calculate voice signal and noise;According to mixed coefficint, voice signal and noise, instruction is determined Practice voice signal.
Further, processing module 32 may include:Framing unit (not shown) and training unit (not shown).Wherein,
Framing unit obtains multiframe voice signal for carrying out sub-frame processing to training voice signal;
Training unit obtains blind source separating model for using multiframe voice signal training convolutional neural networks.
Optionally, training unit can be specifically used for:
To each frame voice signal, the characteristic value of the voice signal is extracted by following either type:
Mode one:Extract the amplitude spectrum of voice signal;
Mode two:Extract the Meier frequency spectrum of voice signal;
Mode three:Extract the MFCC of voice signal;
Enter ginseng using the corresponding characteristic value of voice signal as convolutional neural networks, passes through the equal of control convolutional neural networks Square error obtains blind source separating model.
Above-described embodiment determines training voice signal according to adding control parameter of making an uproar to add to make an uproar online, wherein control ginseng of should plus making an uproar Number is the default distribution of satisfaction for controlling the parameter of noise;Using the trained voice signal training convolutional neural networks, obtain To blind source separating model.Due to adding control parameter of making an uproar to be to meet the parameter for being used to control noise of default distribution, compared to existing There is technology, the embodiment of the present invention meets default distribution by setting plus control parameter of making an uproar, to increase amount of noise and type;And lead to It crosses online plus makes an uproar so that blind source separating model is easy to adjust.
Fig. 5 be another embodiment of the present invention provides training blind source separating model device structural schematic diagram.Such as Fig. 5 institute Show, the device 40 of the training blind source separating model includes:
Processor 41 and memory 42;
Memory 42 stores computer executed instructions;
Processor 41 executes the computer executed instructions that memory 42 stores, so that processor 41 executes instruction as described above Practice the method for blind source separating model.
The specific implementation process of processor 41 can be found in above method embodiment, and it is similar that the realization principle and technical effect are similar, Details are not described herein again for the present embodiment.
Optionally, the device 40 of the training blind source separating model further includes communication component 43.Wherein, processor 41, storage Device 42 and communication component 43 can be connected by bus 44.
The embodiment of the present invention also provides a kind of computer readable storage medium, stores in the computer readable storage medium There are computer executed instructions, trains blind source point as described above when the computer executed instructions are executed by processor Method from model.
In the above-described embodiment, it should be understood that disclosed device and method, it can be real by another way It is existing.For example, apparatus embodiments described above are merely indicative, for example, the division of the module, only one kind are patrolled Function division is collected, there may be another division manner in actual implementation, such as multiple modules may be combined or can be integrated into Another system, or some features can be ignored or not executed.Another point, shown or discussed mutual coupling or Direct-coupling or communication connection can be through some interfaces, and the indirect coupling or communication connection of device or module can be electricity Property, mechanical or other forms.
The module as illustrated by the separation member may or may not be physically separated, aobvious as module The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.Some or all of the modules therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
It, can also be in addition, each functional module in each embodiment of the present invention can integrate in one processing unit It is that modules physically exist alone, can also be integrated in one unit with two or more modules.Above-mentioned module at Unit both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated module realized in the form of software function module, can store and computer-readable deposit at one In storage media.Above-mentioned software function module is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) or processor (English:Processor this Shen) is executed Please each embodiment the method part steps.
It should be understood that above-mentioned processor can be central processing unit (English:Central Processing Unit, letter Claim:CPU), it can also be other general processors, digital signal processor (English:Digital Signal Processor, Referred to as:DSP), specific integrated circuit (English:Application Specific Integrated Circuit, referred to as: ASIC) etc..General processor can be microprocessor or the processor is also possible to any conventional processor etc..In conjunction with hair The step of bright disclosed method, can be embodied directly in hardware processor and execute completion, or with hardware in processor and soft Part block combiner executes completion.
Memory may include high speed RAM memory, it is also possible to and it further include non-volatile memories NVM, for example, at least one Magnetic disk storage can also be USB flash disk, mobile hard disk, read-only memory, disk or CD etc..
Bus can be industry standard architecture (Industry Standard Architecture, ISA) bus, outer Portion's apparatus interconnection (Peripheral Component, PCI) bus or extended industry-standard architecture (Extended Industry Standard Architecture, EISA) bus etc..Bus can be divided into address bus, data/address bus, control Bus etc..For convenient for indicating, the bus in illustrations does not limit only a bus or a type of bus.
Above-mentioned storage medium can be by any kind of volatibility or non-volatile memory device or their combination It realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable Read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, Disk or CD.Storage medium can be any usable medium that general or specialized computer can access.
A kind of illustrative storage medium is coupled to processor, believes to enable a processor to read from the storage medium Breath, and information can be written to the storage medium.Certainly, storage medium is also possible to the component part of processor.It processor and deposits Storage media can be located at specific integrated circuit (Application Specific Integrated Circuits, abbreviation: ASIC in).Certainly, pocessor and storage media can also be used as discrete assembly and be present in terminal or server.
Those of ordinary skill in the art will appreciate that:Realize that all or part of the steps of above-mentioned each method embodiment can lead to The relevant hardware of program instruction is crossed to complete.Program above-mentioned can be stored in a computer readable storage medium.The journey When being executed, execution includes the steps that above-mentioned each method embodiment to sequence;And storage medium above-mentioned includes:ROM, RAM, magnetic disk or The various media that can store program code such as person's CD.
Finally it should be noted that:The above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Present invention has been described in detail with reference to the aforementioned embodiments for pipe, those skilled in the art should understand that:Its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims (14)

1. a kind of method of trained blind source separating model, which is characterized in that including:
Training voice signal is determined according to adding control parameter of making an uproar to add to make an uproar online, wherein described plus control parameter of making an uproar is default to meet The parameter for being used to control noise of distribution;
Using the trained voice signal training convolutional neural networks, blind source separating model is obtained.
2. the method according to claim 1, wherein described plus control parameter of making an uproar is signal-to-noise ratio.
3. default being distributed as being uniformly distributed or Gaussian Profile the method according to claim 1, wherein described.
4. method according to any one of claims 1 to 3, which is characterized in that the basis adds control parameter of making an uproar to add online It makes an uproar and determines training voice signal, including:
Obtain described plus make an uproar control parameter, voice signal and noise;
The mixed coefficint of the voice signal and the noise is calculated according to described plus control parameter of making an uproar;
According to the mixed coefficint, the voice signal and the noise, the trained voice signal is determined.
5. method according to any one of claims 1 to 3, which is characterized in that described using the trained voice signal instruction Practice convolutional neural networks, obtains blind source separating model, including:
Sub-frame processing is carried out to the trained voice signal, obtains multiframe voice signal;
Using the multiframe voice signal training convolutional neural networks, the blind source separating model is obtained.
6. according to the method described in claim 5, it is characterized in that, described using the multiframe voice signal training convolution Neural network obtains the blind source separating model, including:
To each frame voice signal, the characteristic value of the voice signal is extracted by following either type:
Mode one:Extract the amplitude spectrum of the voice signal;
Mode two:Extract the Meier frequency spectrum of the voice signal;
Mode three:Extract the mel-frequency cepstrum coefficient MFCC of the voice signal;
Enter ginseng using the corresponding characteristic value of the voice signal as the convolutional neural networks, by controlling the convolutional Neural The mean square error of network obtains the blind source separating model.
7. a kind of device of trained blind source separating model, which is characterized in that including:
Determining module, for according to plus make an uproar control parameter add online make an uproar determine training voice signal, wherein it is described plus make an uproar control ginseng Number is the default distribution of satisfaction for controlling the parameter of noise;
Processing module obtains blind source separating model for using the trained voice signal training convolutional neural networks.
8. device according to claim 7, which is characterized in that described plus control parameter of making an uproar is signal-to-noise ratio.
9. device according to claim 7, which is characterized in that described default to be distributed as being uniformly distributed or Gaussian Profile.
10. device according to any one of claims 7 to 9, which is characterized in that the determining module is specifically used for:
Obtain described plus make an uproar control parameter, voice signal and noise;
The mixed coefficint of the voice signal and the noise is calculated according to described plus control parameter of making an uproar;
According to the mixed coefficint, the voice signal and the noise, the trained voice signal is determined.
11. device according to any one of claims 7 to 9, which is characterized in that the processing module includes:
Framing unit obtains multiframe voice signal for carrying out sub-frame processing to the trained voice signal;
Training unit, for obtaining the blind source separating mould using the multiframe voice signal training convolutional neural networks Type.
12. device according to claim 11, which is characterized in that the training unit is specifically used for:
To each frame voice signal, the characteristic value of the voice signal is extracted by following either type:
Mode one:Extract the amplitude spectrum of the voice signal;
Mode two:Extract the Meier frequency spectrum of the voice signal;
Mode three:Extract the mel-frequency cepstrum coefficient MFCC of the voice signal;
Enter ginseng using the corresponding characteristic value of the voice signal as the convolutional neural networks, by controlling the convolutional Neural The mean square error of network obtains the blind source separating model.
13. a kind of device of trained blind source separating model, which is characterized in that including:Processor and memory;
The memory stores computer executed instructions;
The processor executes the computer executed instructions of the memory storage, so that the processor executes such as claim The method of 1 to 6 described in any item trained blind source separating models.
14. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium It executes instruction, for realizing instruction such as claimed in any one of claims 1 to 6 when the computer executed instructions are executed by processor Practice the method for blind source separating model.
CN201810717811.1A 2018-07-03 2018-07-03 The method, apparatus and storage medium of training blind source separating model Pending CN108922517A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810717811.1A CN108922517A (en) 2018-07-03 2018-07-03 The method, apparatus and storage medium of training blind source separating model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810717811.1A CN108922517A (en) 2018-07-03 2018-07-03 The method, apparatus and storage medium of training blind source separating model

Publications (1)

Publication Number Publication Date
CN108922517A true CN108922517A (en) 2018-11-30

Family

ID=64423445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810717811.1A Pending CN108922517A (en) 2018-07-03 2018-07-03 The method, apparatus and storage medium of training blind source separating model

Country Status (1)

Country Link
CN (1) CN108922517A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222693A (en) * 2019-06-03 2019-09-10 第四范式(北京)技术有限公司 The method and apparatus for constructing character recognition model and identifying character
CN111081222A (en) * 2019-12-30 2020-04-28 北京明略软件系统有限公司 Speech recognition method, speech recognition apparatus, storage medium, and electronic apparatus
CN111243573A (en) * 2019-12-31 2020-06-05 深圳市瑞讯云技术有限公司 Voice training method and device
WO2021027132A1 (en) * 2019-08-12 2021-02-18 平安科技(深圳)有限公司 Audio processing method and apparatus and computer storage medium
CN112489675A (en) * 2020-11-13 2021-03-12 北京云从科技有限公司 Multi-channel blind source separation method and device, machine readable medium and equipment
CN114067785A (en) * 2022-01-05 2022-02-18 江苏清微智能科技有限公司 Voice deep neural network training method and device, storage medium and electronic device
CN117292703A (en) * 2023-11-24 2023-12-26 国网辽宁省电力有限公司电力科学研究院 Sound source positioning method and device for transformer equipment, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101366078A (en) * 2005-10-06 2009-02-11 Dts公司 Neural network classifier for separating audio sources from a monophonic audio signal
CN101710490A (en) * 2009-11-20 2010-05-19 安徽科大讯飞信息科技股份有限公司 Method and device for compensating noise for voice assessment
CN106297819A (en) * 2015-05-25 2017-01-04 国家计算机网络与信息安全管理中心 A kind of noise cancellation method being applied to Speaker Identification
US20170178664A1 (en) * 2014-04-11 2017-06-22 Analog Devices, Inc. Apparatus, systems and methods for providing cloud based blind source separation services
CN107481731A (en) * 2017-08-01 2017-12-15 百度在线网络技术(北京)有限公司 A kind of speech data Enhancement Method and system
US9858949B2 (en) * 2015-08-20 2018-01-02 Honda Motor Co., Ltd. Acoustic processing apparatus and acoustic processing method
CN107680586A (en) * 2017-08-01 2018-02-09 百度在线网络技术(北京)有限公司 Far field Speech acoustics model training method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101366078A (en) * 2005-10-06 2009-02-11 Dts公司 Neural network classifier for separating audio sources from a monophonic audio signal
CN101710490A (en) * 2009-11-20 2010-05-19 安徽科大讯飞信息科技股份有限公司 Method and device for compensating noise for voice assessment
US20170178664A1 (en) * 2014-04-11 2017-06-22 Analog Devices, Inc. Apparatus, systems and methods for providing cloud based blind source separation services
CN106297819A (en) * 2015-05-25 2017-01-04 国家计算机网络与信息安全管理中心 A kind of noise cancellation method being applied to Speaker Identification
US9858949B2 (en) * 2015-08-20 2018-01-02 Honda Motor Co., Ltd. Acoustic processing apparatus and acoustic processing method
CN107481731A (en) * 2017-08-01 2017-12-15 百度在线网络技术(北京)有限公司 A kind of speech data Enhancement Method and system
CN107680586A (en) * 2017-08-01 2018-02-09 百度在线网络技术(北京)有限公司 Far field Speech acoustics model training method and system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222693A (en) * 2019-06-03 2019-09-10 第四范式(北京)技术有限公司 The method and apparatus for constructing character recognition model and identifying character
WO2021027132A1 (en) * 2019-08-12 2021-02-18 平安科技(深圳)有限公司 Audio processing method and apparatus and computer storage medium
CN111081222A (en) * 2019-12-30 2020-04-28 北京明略软件系统有限公司 Speech recognition method, speech recognition apparatus, storage medium, and electronic apparatus
CN111243573A (en) * 2019-12-31 2020-06-05 深圳市瑞讯云技术有限公司 Voice training method and device
CN112489675A (en) * 2020-11-13 2021-03-12 北京云从科技有限公司 Multi-channel blind source separation method and device, machine readable medium and equipment
CN114067785A (en) * 2022-01-05 2022-02-18 江苏清微智能科技有限公司 Voice deep neural network training method and device, storage medium and electronic device
CN117292703A (en) * 2023-11-24 2023-12-26 国网辽宁省电力有限公司电力科学研究院 Sound source positioning method and device for transformer equipment, electronic equipment and storage medium
CN117292703B (en) * 2023-11-24 2024-03-15 国网辽宁省电力有限公司电力科学研究院 Sound source positioning method and device for transformer equipment, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108922517A (en) The method, apparatus and storage medium of training blind source separating model
Michelsanti et al. Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification
CN106486131B (en) A kind of method and device of speech de-noising
CN108899047B (en) The masking threshold estimation method, apparatus and storage medium of audio signal
CN109036460B (en) Voice processing method and device based on multi-model neural network
CN107452389A (en) A kind of general monophonic real-time noise-reducing method
CN106161751B (en) A kind of noise suppressing method and device
CN109817209A (en) A kind of intelligent speech interactive system based on two-microphone array
CN111508519B (en) Method and device for enhancing voice of audio signal
CN109036437A (en) Accents recognition method, apparatus, computer installation and computer readable storage medium
Liang et al. Real-time speech enhancement algorithm based on attention LSTM
CN113744749B (en) Speech enhancement method and system based on psychoacoustic domain weighting loss function
Xie et al. Real-time, robust and adaptive universal adversarial attacks against speaker recognition systems
Sivaraman et al. Personalized speech enhancement through self-supervised data augmentation and purification
CN108922514B (en) Robust feature extraction method based on low-frequency log spectrum
CN111916093A (en) Audio processing method and device
CN112712818A (en) Voice enhancement method, device and equipment
CN110176243A (en) Sound enhancement method, model training method, device and computer equipment
CN109215635B (en) Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement
CN103971697B (en) Sound enhancement method based on non-local mean filtering
CN110875037A (en) Voice data processing method and device and electronic equipment
CN113035216A (en) Microphone array voice enhancement method and related equipment thereof
CN108574911B (en) The unsupervised single microphone voice de-noising method of one kind and system
Ayhan et al. Robust speaker identification algorithms and results in noisy environments
Goodarzi et al. Feature bandwidth extension for Persian conversational telephone speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181130