Single channel mixing voice separation method based on compressed sensing and K-SVD
Technical field
The present invention relates to the special voice of a class and strengthen category-speech Separation, relate in particular to a kind of single channel mixing voice separation method, belong to the technical field that voice signal is handled based on compressed sensing and K-SVD,
Background technology
Voice are the most direct also the most frequently used exchange waies of human most convenient.Yet, in actual environment, people can be subjected to the interference of ambient noise inevitably when obtaining voice signal, these disturb the performance that can influence speech processing system (for example speech recognition system) on the one hand, can influence perception and the understanding of people's ear to voice on the other hand.Therefore, voice strengthen the particularly necessity that seems.Speech Separation is the special sound enhancement method of a class, its noise object is generally reluctant class voice noise, be under source voice signal and transport channel parameters (being mixed process) condition of unknown, only according to the observation data (being the mixing voice signal) that collects from microphone, recover or isolate the independently process of source voice signal.Its objective is to strengthen the target voice, suppress to disturb voice.The similarity of target voice and interference voice character has determined that speech Separation difficulty in the various types of voice Enhancement Method is maximum.Single channel mixing voice separation requirement is isolated multiple source voice signal independent of each other from the mixing voice signal that a microphone collects, because of the minimum difficulty of known conditions bigger.But because a microphone is that the easiest laying also is the most frequently used, so if single channel mixing voice isolation technics can break through, the actual application value maximum that then has.
Single channel mixing voice separation at present mainly contains three class main method: based on the single channel mixing voice separation method of statistical model, and computing machine auditory scene analysis and the single channel mixing voice separation method that decomposes based on projection.Single channel mixing voice separation method based on statistical model is based upon on the signal training modeling basis, is divided into for three steps usually: the first step, to each source voice signal or its characteristic parameter modeling, establish the parameter of each source voice signal model by training; Second step was a known conditions with mixing voice signal and source voice signal model, according to suitable criteria, selected the some one-tenth in the voice signal of source to assign to optimally form the mixing voice signal; The 3rd step, directly form each source voice after the separation by each source voice signal composition of choosing, perhaps form corresponding wave filter earlier, dope each source voice signal again.Computing machine auditory scene analysis method is carried out speech Separation by the imitation human auditory system, and core is made up of segmentation and combination two parts for separating.Segmentation is that mixing voice is decomposed into a series of sensation section, and each sensation section of each isolating speech signals requires to come from a source voice signal.Combination is that the sensation section that comes from the identical sources signal is merged, and forms the stream of corresponding source signal.Single channel mixing voice separation method based on the projection decomposition, generally be earlier by suitable basis function or the dictionary of machine learning structure, again by probabilistic method or optimization method, dope the projection vector of source voice signal under certain basis function or certain dictionary, the voice signal after at last obtaining separating by the projection vector of prediction and basis function or dictionary reconstruct accordingly.
From separation method: the single channel mixing voice separation algorithm based on probability statistics model lays particular emphasis on probabilistic method, realizes the separation of mixing voice on the basis of probabilistic Modeling, needs training in advance; Computing machine auditory scene analysis method (CASA) lays particular emphasis on the biosimulation method, and the separation by simulation human auditory system realization mixing voice does not need training in advance; The single channel mixing voice separation algorithm that decomposes based on projection lays particular emphasis on machine learning, by suitable basis function or the dictionary of machine learning structure, realizes the separation of mixing voice on this basis, needs training in advance.On separating property: generally speaking, the separating property of the single channel mixing voice separation algorithm that decomposes based on projection is best, takes second place based on the separating property of the single channel mixing voice separation algorithm of probability statistics model, and the separating property of CASA is the poorest.From algorithm complex: CASA carries out speech Separation by the simulation human auditory system, and need repeatedly adjust the segmentation reorganization of voice, and complexity is the highest.Single channel mixing voice separation algorithm that decomposes based on projection and the single channel mixing voice separation algorithm based on probability statistics model, all based on mathematics probability model or optimization method, so in general, complexity is more or less the same.On development potentiality, the three respectively has relative merits, and the development space of oneself is all arranged.Though the research of single channel mixing voice separation has at present obtained certain achievement, but totally all algorithm complex is higher, though and the difference of performance source voice signal and difference is bigger, in the training stage training data there is special requirement in addition, so generally speaking, practicality is not strong, haves much room for improvement, so that concrete the application.
Summary of the invention
The invention provides a kind of single channel mixing voice separation method based on compressed sensing and K-SVD, its purpose focuses on the consideration practicality, and design does not have specific (special) requirements to training data, and the single channel mixing voice separation method of stable performance, can strengthen the target voice, reduce and disturb voice.This method is utilized the sparse property of voice signal under the K-SVD dictionary, according to compressed sensing observation and the similarity of single channel mixing voice on expression formula, adopt the signal reconfiguring method in the compressed sensing theory, carry out the separation of single channel mixing voice, strengthen the purpose that the target voice suppression is disturbed voice to reach.
For achieving the above object, the present invention has adopted following technical scheme:
A kind of single channel mixing voice separation method based on compressed sensing and K-SVD, it is characterized in that: this method is utilized the sparse property of voice signal under the K-SVD dictionary, according to compressed sensing observation and the similarity of single channel mixing voice on expression formula, adopt the signal reconfiguring method in the compressed sensing theory, carry out the separation of single channel mixing voice, strengthen the purpose that the target voice suppression is disturbed voice to reach, the step of taking is:
1) adopt the K-SVD algorithm, divide frame to man man, men and women and women three class combined training voice, by the combined training speech frame a complete dictionary of blanket mistake being constructed in every class combined training voice is the K-SVD dictionary;
2) with single channel creolized language cent frame, separate the single channel mixing voice frame by frame; Based on the expression formula of the K-SVD dictionary of being constructed and each single channel mixing voice frame, the similarity from compressed sensing observation and single channel mixing voice expression formula adopts in the compressed sensing theory based on l
0The signal reconstruction algorithm of-norm optimization is asked for the estimation of each source speech frame rarefaction representation under the K-SVD dictionary, and by the estimation of this rarefaction representation and the product of K-SVD dictionary, reconstruct is respectively separated speech frame;
3) each is separated speech frame and be linked in sequence the voice signal after obtaining separating.
In above-mentioned:
1) adopt the concrete grammar of K-SVD algorithm construction K-SVD dictionary to be:
A. remember x=s
1+ s
2Be known single channel mixing voice, wherein s
i(i=1,2) are unknown source voice signal; Suppose s
1And s
2Pairing speaker is known, with s
1And s
2Pairing speaker's training utterance divides frame, and frame length is L sampling point/frame, gets L=128, and interframe is not overlapping, remembers that respectively each source voice signal frame is
With
Will
With
Be linked in sequence, obtain the combined training speech frame
Wherein
Expression s
1Corresponding speaker's i frame training utterance,
Expression s
2Pairing speaker's i frame training utterance, N
TrainThe training utterance frame number of representing each speaker, two speakers' training utterance frame number is identical;
B. adopt the training of K-SVD algorithm to construct complete dictionary Q, each combined training speech frame signal is satisfied under Q: reconstructed error is sparse under this dictionary as far as possible at the synchronous signal of setting range, and concrete available following mathematical expression is represented:
γ wherein
iFor
Rarefaction representation under dictionary Q, ε are represented the reconstructed error threshold value set, and value is 0.1;
2) with single channel creolized language cent frame, the concrete grammar that separates the single channel mixing voice frame by frame is:
With single channel creolized language cent frame, frame length is got L=128 equally for for L sampling point/frame, and interframe is not overlapping, separates the single channel mixing voice frame by frame, and the method that each frame separates the single channel mixing voice is identical; Be separated into example with j frame single channel mixing voice below, the method for separating each single channel mixing voice frame be described:
A. remember that j frame single channel mixing voice is
Wherein
Be source signal s
iThe j frame of (i=1,2) number,
Be expressed as follows with matrix form:
I wherein
L * LThe unit matrix of expression L * L dimension; Because Q has reflected the general character of all combined training speech frames, and in the process of training, guarantee that all combined training speech frames have sparse property under Q, so can think
Under Q, also has sparse property; Note
Rarefaction representation under Q is β, promptly
‖ ‖ in the following formula
0Expression l
0-norm is specifically represented the number of nonzero element in the vector; Definition P=[I
L * LI
L * L], x then
jCan be expressed as
x
j=PQβ
B. according to the expression formula of compressed sensing observation with state single channel mixing voice frame x
jThe expression formula of=PQ β is closely similar, by the method for observation reconstruction signal rarefaction representation, asks for source voice signal frame in the usefulness compressed sensing theory
The estimation of the rarefaction representation under the K-SVD dictionary:
Note s=[s (1), s (2) ..., s (N)]
TFor length is the discrete signal of N, Ψ is known basis function or dictionary, makes signal s have sparse property under Ψ, that is:
S=Ψ α and ‖ α ‖
0N
Wherein α is the rarefaction representation of s under Ψ.The compressed sensing theory is thought: when signal s has sparse property under Ψ, can be with the observation y=[y (1) of certain dimension, and y (2) ..., y (M)]
TCome approximate nondestructively reconstruct α and then reconstruction signal s, wherein observe y multiply each other and obtain by observing matrix Φ and s:
y=Φs=ΦΨα
Wherein Φ is the observing matrix of M * N;
Compare x
jThe expression formula of=MQ β and y=Φ s=Φ Ψ α, regard M as observing matrix, can find that both forms are basic identical: Φ and M represent observing matrix, Ψ and Q represent known basis function or dictionary, α and β represent the rarefaction representation of signal under basis function or dictionary, therefore, pass through l in the employing compressed sensing theory
0-norm optimization method is asked for the estimation of each source speech frame rarefaction representation under the K-SVD dictionary by the thinking of observation reconstruction signal rarefaction representation under dictionary by finding the solution following formula:
s.t.x
j=MQβ
Obtain
The estimation of rarefaction representation under Q
Wherein
Be above-mentioned l
0The optimum solution of-norm optimization problem;
C. by the estimation of K-SVD dictionary Q and above-mentioned rarefaction representation
Product, reconstruct is respectively separated speech frame:
Wherein
For separating speech frame;
3) each is separated speech frame and is linked in sequence, obtain isolating speech signals:
N wherein
iFor the source voice signal from voice signal s
iTotalframes.
Advantage of the present invention and beneficial effect:
The present invention is based on compressed sensing theory and K-SVD, carry out the separation of single channel mixing voice, realizing to the enhancing of target voice with to disturbing the inhibition of voice, have practical, the advantage of stable performance.Single channel mixing voice separation method based on compressed sensing theory and K-SVD, when adopting K-SVD algorithm training structure dictionary, only need be to three groups of dissimilar training utterance frames (male combined training speech frames, men and women's combined training speech frame, women combined training speech frame) trains, construct three complete dictionaries of different mistakes, be used for separating respectively three kinds of dissimilar mixing voices (male mixing voice, men and women's mixing voice, women mixing voice), do not need different source speakers is trained the structure dictionary respectively.Owing to training data is not had the requirement of special harshness, so have practical advantage based on the single channel mixing voice separation method of compressed sensing and K-SVD.In addition, based on compressed sensing and K-SVD single channel mixing voice separation method, mainly utilize the sparse characteristic of voice signal under the K-SVD dictionary, do not utilize the characteristic of each source voice signal too much, so little the separating effect difference of each group mixing voice.Emulation experiment also shows, has stable performance based on the single channel mixing voice separation method of compressed sensing and K-SVD, and separating effect is related with the source voice signal little.
Description of drawings:
Fig. 1 is the system chart of the inventive method;
Fig. 2 is the process flow diagram of K-SVD algorithm;
Fig. 3 is the average statement improvement signal to noise ratio (S/N ratio) ISNR of single channel mixing voice separating property-two speaker of and K-SVD theoretical based on compressed sensing under the different input signal-to-noise ratios
Av
Fig. 4 is the average statement improvement signal to noise ratio (S/N ratio) ISNR of single channel mixing voice separating property-two speaker of and DCT theoretical based on compressed sensing under the different input signal-to-noise ratios
Av
Fig. 5 is the mean opinion score implication;
The mean opinion score of single channel mixing voice separating property-each speaker's statement of and K-SVD theoretical when Fig. 6 is 0dB for input signal-to-noise ratio based on compressed sensing.
Embodiment
Fig. 1 is for realizing the system chart of this programme.As shown in the figure: the present invention at first adopts the K-SVD algorithm, and complete dictionary was constructed in training, based on the K-SVD dictionary of structure, adopts in the compressed sensing based on l again
0The signal reconstruction algorithm of-norm optimization carries out the separation of single channel mixing voice.
Testing used voice is that sampling rate is the voice of 16KHZ.Totally four of speakers, two male two woman.Each speaker respectively gets the voice of 40 Chinese phrase structures and constructs training utterance.The voice of 5 Chinese phrase structures of every speaker's picked at random are as tested speech, and tested speech is different with training utterance.Single channel mixing voice x is by two source tested speech s
1, s
2Stack is obtained, i.e. x=s
1+ s
2, obtain 100 men and women's mixing voices altogether, 25 male mixing voices of man, 25 women mixing voices of woman.All all adopt rectangular window to divide frame, and frame length is 128 sampling points (8ms), the interframe zero lap.
When structure K-SVD dictionary, need three groups of training utterance frames of structure altogether: men and women's mixing voice is constructed one group of training utterance frame (being called men and women's combined training speech frame), the male mixing voice of man is constructed one group of training utterance frame (being called male combined training speech frame), the women mixing voice of woman is constructed one group of training utterance frame (being called women combined training speech frame).Concrete building method is as follows:
1. male voice speech frame and female voice speech frame are linked in sequence structure men and women combined training speech frame.Every speaker gets a speech frame, and then can construct four frame frame lengths is men and women's combined training voice signal of 256 sampling points (16ms).
2. male voice speech frame and male voice speech frame are linked in sequence, construct male combined training speech frame.Every male speaker gets a speech frame, and then can construct a frame frame length is the male combined training voice signal of 256 sampling points (16ms).
3. female voice speech frame and female voice speech frame are linked in sequence, construct women combined training speech frame.Every women speaker gets a speech frame, and then can construct a frame frame length is the women combined training voice signal of 256 sampling points (16ms).
Adopt the K-SVD algorithm that above-mentioned three groups of training utterances are trained respectively, construct three complete dictionaries of mistake, be respectively applied for men and women's mixing voice, male mixing voice, the separation of women mixing voice.The atom dimension is identical with the training utterance frame length in the dictionary, the atom number is made as 1024 in the dictionary, and promptly the dictionary dimension is 256 * 1024, and it is 256 * 1024 the complete DCT dictionary of mistake that initial dictionary is made as dimension, the reconstructed error threshold epsilon is made as 0.1, and iterations is made as 30 times.
When adopting the training of K-SVD algorithm to construct complete dictionary, adopt the mode of iteration to realize the renewal of dictionary, so that each training utterance frame satisfies formula under the complete dictionary Q of mistake of structure
Specifically realize by iteration.Each iteration is carried out in two steps, is elaborated with the j time iteration:
1. Sparse Decomposition: the dictionary Q that keeps the j-1 time iteration to obtain
J-1Constant, find the solution training signal
At Q
J-1Under rarefaction representation
Specifically find the solution by separating above-mentioned optimization problem
Usually adopt matching pursuit algorithm to realize.
2. dictionary updating: keep
Constant, to dictionary Q
J-1Upgrade by row, make
Minimum can adopt singular value decomposition algorithm to realize.The K-SVD algorithm flow chart sees for details shown in Figure 2.
Based on the complete dictionary Q of mistake of structure, the similarity on expression formula according to single channel mixing voice and compressed sensing observation adopts in the compressed sensing based on l
0The signal reconfiguring method of-norm optimization carries out the separation of single channel mixing voice x.Specifically find the solution minimum l earlier
0-norm problem
, obtain
The estimation of the rarefaction representation under Q
Wherein
Be above-mentioned minimum l
0The optimum solution of-norm problem.Owing to separate above-mentioned minimum l
0-norm problem need be listed the β chosen candidate value that all satisfy restrictive condition, finds out the β with minimum nonzero element again from these chosen candidate values, and complexity is very high and be difficult to realization, so be converted into minimum l of equal value usually
1-norm problem is found the solution:
s.t.x=PQβ
‖ ‖ wherein
1Expression l
1-norm, its value is the absolute value sum of each element of vector.Following formula can be regarded as
The convexification of s.t.x=PQ β formula, can realize easily by the linear programming algorithm:
Wherein A=(PQ ,-PQ), b=x, c=(1; 1),
At last, obtain isolating speech signals by following formula reconstruct
To the single channel mixing voice separation method based on compressed sensing theory and K-SVD that the present invention proposes, we carry out separating experiment respectively to the mixing voice under the different input signal-to-noise ratio conditions, and experimental situation is the Matlab environment.Average statement with two speakers improves signal to noise ratio (S/N ratio) (Improved Signal to Noise ratio, ISNR, the improvement effect of signal to noise ratio (S/N ratio) before and after promptly separating) ISNR
AvBe index, weigh the separating property of institute's extracting method, as shown in Figure 3.Two speakers' average statement improves signal to noise ratio (S/N ratio) (Improved Signal to Noise ratio, ISNR, the improvement effect of signal to noise ratio (S/N ratio) before and after promptly separating) ISNR
AvBe defined as follows:
ISNR wherein
iThe average statement that is i speaker improves signal to noise ratio (S/N ratio),
Wherein
Serve as reasons
With
The mixing voice signal that stack obtains,
(k=1,2 ..., K) i speaker's of expression k sentence source voice signal,
(k=1,2 ..., K) i speaker's of expression k sentence isolating speech signals, K represents mixing voice sentence number.
As can be seen from Figure 3, separation algorithm theoretical based on compressed sensing and K-SVD separates three class mixing voices (male mixing voice, men and women's mixing voice, women mixing voice), average statement signal to noise ratio (S/N ratio) all is improved, and average statement improves signal to noise ratio (S/N ratio) ISNR
AvAll be more or less the same.This explanation all has certain separating effect based on the separation algorithm of compressed sensing and K-SVD to all kinds of mixing voices, and separating property is stable, related with the source voice signal not quite.
For the validity that adopts the K-SVD algorithm construction to cross complete dictionary is described, we will adopt the complete dictionary of the mistake of K-SVD algorithm construction to replace with the DCT base of 256 * 256 dimensions, separate the single channel mixing voice based on the thinking of signal reconstruction in the above-mentioned compressed sensing equally.Fig. 4 has provided the single channel mixing voice separating property of and DCT theoretical based on compressed sensing, and tested speech is the same.
Comparison diagram 3 and chart 4, as can be seen, single channel mixing voice separating property based on compressed sensing theory and K-SVD, be better than single channel mixing voice separating property based on compressed sensing and DCT, this explanation is adopted the K-SVD algorithm construction to cross complete dictionary to carry out speech Separation, and it is more effective to carry out speech Separation than direct employing DCT base.
In order to weigh the subjective acoustical quality of separating voice, we use P.862 standard, when above-mentioned input signal-to-noise ratio is 0dB, assess with the subjective quality that separates voice that the K-SVD separation algorithm obtains based on CS.Because the mean opinion score scope when adopting P.862 standard testing is 0~4.5, we are converted to this mean opinion score between 1~5.The mean opinion score implication is seen Fig. 5.
When input signal-to-noise ratio was 0dB, the mean opinion score of each speaker's detach statement of and K-SVD theoretical based on compressed sensing as shown in Figure 6.