CN102081928A

CN102081928A - Method for separating single-channel mixed voice based on compressed sensing and K-SVD

Info

Publication number: CN102081928A
Application number: CN2010105566949A
Authority: CN
Inventors: 郭海燕; 杨震
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2010-11-24
Filing date: 2010-11-24
Publication date: 2011-06-01
Anticipated expiration: 2030-11-24
Also published as: CN102081928B

Abstract

The invention relates to a method for separating single-channel mixed voice based on compressed sensing and kernel singular value decomposition (K-SVD), which comprises the following steps of: constructing a universally applicable overcomplete dictionary, namely a K-SVD dictionary for each of three types of man-man, man-woman and woman-woman mixed training voice by using a K-SVD algorithm through mixed training voice frames; making a signal sparse under the dictionary while a reconfiguration error is in a certain range; on the basis of the constructed K-SVD dictionary, starting from the similarity of compressed sensing observation and single-channel mixed voice expressions, separating the single-channel mixed voice by using a l0-norm optimization-based signal reconfiguration algorithm in a compressed sensing theory; solving the estimation of sparse representation of each source voice frame under the K-SVD dictionary on the basis of the expression of each single-channel mixed voice frame, and reconfiguring each separated voice frame through the estimation of the sparse representation and the K-SVD dictionary; and sequentially connecting the separated voice frames to acquire a separated voice signal.

Description

Single channel mixing voice separation method based on compressed sensing and K-SVD

Technical field

The present invention relates to the special voice of a class and strengthen category-speech Separation, relate in particular to a kind of single channel mixing voice separation method, belong to the technical field that voice signal is handled based on compressed sensing and K-SVD,

Background technology

Voice are the most direct also the most frequently used exchange waies of human most convenient.Yet, in actual environment, people can be subjected to the interference of ambient noise inevitably when obtaining voice signal, these disturb the performance that can influence speech processing system (for example speech recognition system) on the one hand, can influence perception and the understanding of people's ear to voice on the other hand.Therefore, voice strengthen the particularly necessity that seems.Speech Separation is the special sound enhancement method of a class, its noise object is generally reluctant class voice noise, be under source voice signal and transport channel parameters (being mixed process) condition of unknown, only according to the observation data (being the mixing voice signal) that collects from microphone, recover or isolate the independently process of source voice signal.Its objective is to strengthen the target voice, suppress to disturb voice.The similarity of target voice and interference voice character has determined that speech Separation difficulty in the various types of voice Enhancement Method is maximum.Single channel mixing voice separation requirement is isolated multiple source voice signal independent of each other from the mixing voice signal that a microphone collects, because of the minimum difficulty of known conditions bigger.But because a microphone is that the easiest laying also is the most frequently used, so if single channel mixing voice isolation technics can break through, the actual application value maximum that then has.

Single channel mixing voice separation at present mainly contains three class main method: based on the single channel mixing voice separation method of statistical model, and computing machine auditory scene analysis and the single channel mixing voice separation method that decomposes based on projection.Single channel mixing voice separation method based on statistical model is based upon on the signal training modeling basis, is divided into for three steps usually: the first step, to each source voice signal or its characteristic parameter modeling, establish the parameter of each source voice signal model by training; Second step was a known conditions with mixing voice signal and source voice signal model, according to suitable criteria, selected the some one-tenth in the voice signal of source to assign to optimally form the mixing voice signal; The 3rd step, directly form each source voice after the separation by each source voice signal composition of choosing, perhaps form corresponding wave filter earlier, dope each source voice signal again.Computing machine auditory scene analysis method is carried out speech Separation by the imitation human auditory system, and core is made up of segmentation and combination two parts for separating.Segmentation is that mixing voice is decomposed into a series of sensation section, and each sensation section of each isolating speech signals requires to come from a source voice signal.Combination is that the sensation section that comes from the identical sources signal is merged, and forms the stream of corresponding source signal.Single channel mixing voice separation method based on the projection decomposition, generally be earlier by suitable basis function or the dictionary of machine learning structure, again by probabilistic method or optimization method, dope the projection vector of source voice signal under certain basis function or certain dictionary, the voice signal after at last obtaining separating by the projection vector of prediction and basis function or dictionary reconstruct accordingly.

From separation method: the single channel mixing voice separation algorithm based on probability statistics model lays particular emphasis on probabilistic method, realizes the separation of mixing voice on the basis of probabilistic Modeling, needs training in advance; Computing machine auditory scene analysis method (CASA) lays particular emphasis on the biosimulation method, and the separation by simulation human auditory system realization mixing voice does not need training in advance; The single channel mixing voice separation algorithm that decomposes based on projection lays particular emphasis on machine learning, by suitable basis function or the dictionary of machine learning structure, realizes the separation of mixing voice on this basis, needs training in advance.On separating property: generally speaking, the separating property of the single channel mixing voice separation algorithm that decomposes based on projection is best, takes second place based on the separating property of the single channel mixing voice separation algorithm of probability statistics model, and the separating property of CASA is the poorest.From algorithm complex: CASA carries out speech Separation by the simulation human auditory system, and need repeatedly adjust the segmentation reorganization of voice, and complexity is the highest.Single channel mixing voice separation algorithm that decomposes based on projection and the single channel mixing voice separation algorithm based on probability statistics model, all based on mathematics probability model or optimization method, so in general, complexity is more or less the same.On development potentiality, the three respectively has relative merits, and the development space of oneself is all arranged.Though the research of single channel mixing voice separation has at present obtained certain achievement, but totally all algorithm complex is higher, though and the difference of performance source voice signal and difference is bigger, in the training stage training data there is special requirement in addition, so generally speaking, practicality is not strong, haves much room for improvement, so that concrete the application.

Summary of the invention

The invention provides a kind of single channel mixing voice separation method based on compressed sensing and K-SVD, its purpose focuses on the consideration practicality, and design does not have specific (special) requirements to training data, and the single channel mixing voice separation method of stable performance, can strengthen the target voice, reduce and disturb voice.This method is utilized the sparse property of voice signal under the K-SVD dictionary, according to compressed sensing observation and the similarity of single channel mixing voice on expression formula, adopt the signal reconfiguring method in the compressed sensing theory, carry out the separation of single channel mixing voice, strengthen the purpose that the target voice suppression is disturbed voice to reach.

For achieving the above object, the present invention has adopted following technical scheme:

A kind of single channel mixing voice separation method based on compressed sensing and K-SVD, it is characterized in that: this method is utilized the sparse property of voice signal under the K-SVD dictionary, according to compressed sensing observation and the similarity of single channel mixing voice on expression formula, adopt the signal reconfiguring method in the compressed sensing theory, carry out the separation of single channel mixing voice, strengthen the purpose that the target voice suppression is disturbed voice to reach, the step of taking is:

1) adopt the K-SVD algorithm, divide frame to man man, men and women and women three class combined training voice, by the combined training speech frame a complete dictionary of blanket mistake being constructed in every class combined training voice is the K-SVD dictionary;

2) with single channel creolized language cent frame, separate the single channel mixing voice frame by frame; Based on the expression formula of the K-SVD dictionary of being constructed and each single channel mixing voice frame, the similarity from compressed sensing observation and single channel mixing voice expression formula adopts in the compressed sensing theory based on l ₀The signal reconstruction algorithm of-norm optimization is asked for the estimation of each source speech frame rarefaction representation under the K-SVD dictionary, and by the estimation of this rarefaction representation and the product of K-SVD dictionary, reconstruct is respectively separated speech frame;

3) each is separated speech frame and be linked in sequence the voice signal after obtaining separating.

In above-mentioned:

1) adopt the concrete grammar of K-SVD algorithm construction K-SVD dictionary to be:

A. remember x=s ₁+ s ₂Be known single channel mixing voice, wherein s _i(i=1,2) are unknown source voice signal; Suppose s ₁And s ₂Pairing speaker is known, with s ₁And s ₂Pairing speaker's training utterance divides frame, and frame length is L sampling point/frame, gets L=128, and interframe is not overlapping, remembers that respectively each source voice signal frame is With

Will

With

Be linked in sequence, obtain the combined training speech frame

Wherein

Expression s ₁Corresponding speaker's i frame training utterance,

Expression s ₂Pairing speaker's i frame training utterance, N _TrainThe training utterance frame number of representing each speaker, two speakers' training utterance frame number is identical;

B. adopt the training of K-SVD algorithm to construct complete dictionary Q, each combined training speech frame signal is satisfied under Q: reconstructed error is sparse under this dictionary as far as possible at the synchronous signal of setting range, and concrete available following mathematical expression is represented:

&ForAll; i : \min_{γ_{i}} {| | γ_{i} | |}_{0}

s . t . {| | x_{{train}_{i}} - Q γ_{i} | |}_{2} \leq ϵ

γ wherein _iFor Rarefaction representation under dictionary Q, ε are represented the reconstructed error threshold value set, and value is 0.1;

2) with single channel creolized language cent frame, the concrete grammar that separates the single channel mixing voice frame by frame is:

With single channel creolized language cent frame, frame length is got L=128 equally for for L sampling point/frame, and interframe is not overlapping, separates the single channel mixing voice frame by frame, and the method that each frame separates the single channel mixing voice is identical; Be separated into example with j frame single channel mixing voice below, the method for separating each single channel mixing voice frame be described:

A. remember that j frame single channel mixing voice is Wherein Be source signal s _iThe j frame of (i=1,2) number, Be expressed as follows with matrix form:

I wherein _{L * L}The unit matrix of expression L * L dimension; Because Q has reflected the general character of all combined training speech frames, and in the process of training, guarantee that all combined training speech frames have sparse property under Q, so can think

Under Q, also has sparse property; Note

Rarefaction representation under Q is β, promptly

And ‖ β ‖ ₀＜＜2L

‖ ‖ in the following formula ₀Expression l ₀-norm is specifically represented the number of nonzero element in the vector; Definition P=[I _{L * L}I _{L * L}], x then ^jCan be expressed as

x ^j＝PQβ

B. according to the expression formula of compressed sensing observation with state single channel mixing voice frame x ^jThe expression formula of=PQ β is closely similar, by the method for observation reconstruction signal rarefaction representation, asks for source voice signal frame in the usefulness compressed sensing theory

The estimation of the rarefaction representation under the K-SVD dictionary:

Note s=[s (1), s (2) ..., s (N)] ^TFor length is the discrete signal of N, Ψ is known basis function or dictionary, makes signal s have sparse property under Ψ, that is:

S=Ψ α and ‖ α ‖ ₀N

Wherein α is the rarefaction representation of s under Ψ.The compressed sensing theory is thought: when signal s has sparse property under Ψ, can be with the observation y=[y (1) of certain dimension, and y (2) ..., y (M)] ^TCome approximate nondestructively reconstruct α and then reconstruction signal s, wherein observe y multiply each other and obtain by observing matrix Φ and s:

y＝Φs＝ΦΨα

Wherein Φ is the observing matrix of M * N;

Compare x ^jThe expression formula of=MQ β and y=Φ s=Φ Ψ α, regard M as observing matrix, can find that both forms are basic identical: Φ and M represent observing matrix, Ψ and Q represent known basis function or dictionary, α and β represent the rarefaction representation of signal under basis function or dictionary, therefore, pass through l in the employing compressed sensing theory ₀-norm optimization method is asked for the estimation of each source speech frame rarefaction representation under the K-SVD dictionary by the thinking of observation reconstruction signal rarefaction representation under dictionary by finding the solution following formula:

s.t.x ^j＝MQβ

Obtain

The estimation of rarefaction representation under Q

Wherein

Be above-mentioned l ₀The optimum solution of-norm optimization problem;

C. by the estimation of K-SVD dictionary Q and above-mentioned rarefaction representation

Product, reconstruct is respectively separated speech frame:

[\begin{matrix} {\hat{s}}_{1}^{j} \\ {\hat{s}}_{2}^{j} \end{matrix}] = Q [\begin{matrix} {\hat{β}}_{1} \\ {\hat{β}}_{2} \end{matrix}]

Wherein

For separating speech frame;

3) each is separated speech frame and is linked in sequence, obtain isolating speech signals:

{\hat{s}}_{i} = [\begin{matrix} {\hat{s}}_{i}^{1} & {\hat{s}}_{i}^{2} & \cdot \cdot \cdot & {\hat{s}}_{i}^{N_{i}} \end{matrix}]

N wherein _iFor the source voice signal from voice signal s _iTotalframes.

Advantage of the present invention and beneficial effect:

The present invention is based on compressed sensing theory and K-SVD, carry out the separation of single channel mixing voice, realizing to the enhancing of target voice with to disturbing the inhibition of voice, have practical, the advantage of stable performance.Single channel mixing voice separation method based on compressed sensing theory and K-SVD, when adopting K-SVD algorithm training structure dictionary, only need be to three groups of dissimilar training utterance frames (male combined training speech frames, men and women's combined training speech frame, women combined training speech frame) trains, construct three complete dictionaries of different mistakes, be used for separating respectively three kinds of dissimilar mixing voices (male mixing voice, men and women's mixing voice, women mixing voice), do not need different source speakers is trained the structure dictionary respectively.Owing to training data is not had the requirement of special harshness, so have practical advantage based on the single channel mixing voice separation method of compressed sensing and K-SVD.In addition, based on compressed sensing and K-SVD single channel mixing voice separation method, mainly utilize the sparse characteristic of voice signal under the K-SVD dictionary, do not utilize the characteristic of each source voice signal too much, so little the separating effect difference of each group mixing voice.Emulation experiment also shows, has stable performance based on the single channel mixing voice separation method of compressed sensing and K-SVD, and separating effect is related with the source voice signal little.

Description of drawings:

Fig. 1 is the system chart of the inventive method;

Fig. 2 is the process flow diagram of K-SVD algorithm;

Fig. 3 is the average statement improvement signal to noise ratio (S/N ratio) ISNR of single channel mixing voice separating property-two speaker of and K-SVD theoretical based on compressed sensing under the different input signal-to-noise ratios _Av

Fig. 4 is the average statement improvement signal to noise ratio (S/N ratio) ISNR of single channel mixing voice separating property-two speaker of and DCT theoretical based on compressed sensing under the different input signal-to-noise ratios _Av

Fig. 5 is the mean opinion score implication;

The mean opinion score of single channel mixing voice separating property-each speaker's statement of and K-SVD theoretical when Fig. 6 is 0dB for input signal-to-noise ratio based on compressed sensing.

Embodiment

Fig. 1 is for realizing the system chart of this programme.As shown in the figure: the present invention at first adopts the K-SVD algorithm, and complete dictionary was constructed in training, based on the K-SVD dictionary of structure, adopts in the compressed sensing based on l again ₀The signal reconstruction algorithm of-norm optimization carries out the separation of single channel mixing voice.

Testing used voice is that sampling rate is the voice of 16KHZ.Totally four of speakers, two male two woman.Each speaker respectively gets the voice of 40 Chinese phrase structures and constructs training utterance.The voice of 5 Chinese phrase structures of every speaker's picked at random are as tested speech, and tested speech is different with training utterance.Single channel mixing voice x is by two source tested speech s ₁, s ₂Stack is obtained, i.e. x=s ₁+ s ₂, obtain 100 men and women's mixing voices altogether, 25 male mixing voices of man, 25 women mixing voices of woman.All all adopt rectangular window to divide frame, and frame length is 128 sampling points (8ms), the interframe zero lap.

When structure K-SVD dictionary, need three groups of training utterance frames of structure altogether: men and women's mixing voice is constructed one group of training utterance frame (being called men and women's combined training speech frame), the male mixing voice of man is constructed one group of training utterance frame (being called male combined training speech frame), the women mixing voice of woman is constructed one group of training utterance frame (being called women combined training speech frame).Concrete building method is as follows:

1. male voice speech frame and female voice speech frame are linked in sequence structure men and women combined training speech frame.Every speaker gets a speech frame, and then can construct four frame frame lengths is men and women's combined training voice signal of 256 sampling points (16ms).

2. male voice speech frame and male voice speech frame are linked in sequence, construct male combined training speech frame.Every male speaker gets a speech frame, and then can construct a frame frame length is the male combined training voice signal of 256 sampling points (16ms).

3. female voice speech frame and female voice speech frame are linked in sequence, construct women combined training speech frame.Every women speaker gets a speech frame, and then can construct a frame frame length is the women combined training voice signal of 256 sampling points (16ms).

Adopt the K-SVD algorithm that above-mentioned three groups of training utterances are trained respectively, construct three complete dictionaries of mistake, be respectively applied for men and women's mixing voice, male mixing voice, the separation of women mixing voice.The atom dimension is identical with the training utterance frame length in the dictionary, the atom number is made as 1024 in the dictionary, and promptly the dictionary dimension is 256 * 1024, and it is 256 * 1024 the complete DCT dictionary of mistake that initial dictionary is made as dimension, the reconstructed error threshold epsilon is made as 0.1, and iterations is made as 30 times.

When adopting the training of K-SVD algorithm to construct complete dictionary, adopt the mode of iteration to realize the renewal of dictionary, so that each training utterance frame satisfies formula under the complete dictionary Q of mistake of structure

&ForAll; i : \min_{γ_{i}} {| | γ_{i} | |}_{0}

s . t . {| | x_{{train}_{i}} - Q γ_{i} | |}_{2} \leq ϵ

Specifically realize by iteration.Each iteration is carried out in two steps, is elaborated with the j time iteration:

1. Sparse Decomposition: the dictionary Q that keeps the j-1 time iteration to obtain _J-1Constant, find the solution training signal

At Q _J-1Under rarefaction representation

Specifically find the solution by separating above-mentioned optimization problem

Usually adopt matching pursuit algorithm to realize.

2. dictionary updating: keep

Constant, to dictionary Q _J-1Upgrade by row, make

Minimum can adopt singular value decomposition algorithm to realize.The K-SVD algorithm flow chart sees for details shown in Figure 2.

Based on the complete dictionary Q of mistake of structure, the similarity on expression formula according to single channel mixing voice and compressed sensing observation adopts in the compressed sensing based on l ₀The signal reconfiguring method of-norm optimization carries out the separation of single channel mixing voice x.Specifically find the solution minimum l earlier ₀-norm problem

s.t.x＝PQβ

, obtain The estimation of the rarefaction representation under Q

Wherein

Be above-mentioned minimum l ₀The optimum solution of-norm problem.Owing to separate above-mentioned minimum l ₀-norm problem need be listed the β chosen candidate value that all satisfy restrictive condition, finds out the β with minimum nonzero element again from these chosen candidate values, and complexity is very high and be difficult to realization, so be converted into minimum l of equal value usually ₁-norm problem is found the solution:

s.t.x＝PQβ

‖ ‖ wherein ₁Expression l ₁-norm, its value is the absolute value sum of each element of vector.Following formula can be regarded as

The convexification of s.t.x=PQ β formula, can realize easily by the linear programming algorithm:

s.t.Az＝b

Wherein A=(PQ ,-PQ), b=x, c=(1; 1),

At last, obtain isolating speech signals by following formula reconstruct

To the single channel mixing voice separation method based on compressed sensing theory and K-SVD that the present invention proposes, we carry out separating experiment respectively to the mixing voice under the different input signal-to-noise ratio conditions, and experimental situation is the Matlab environment.Average statement with two speakers improves signal to noise ratio (S/N ratio) (Improved Signal to Noise ratio, ISNR, the improvement effect of signal to noise ratio (S/N ratio) before and after promptly separating) ISNR _AvBe index, weigh the separating property of institute's extracting method, as shown in Figure 3.Two speakers' average statement improves signal to noise ratio (S/N ratio) (Improved Signal to Noise ratio, ISNR, the improvement effect of signal to noise ratio (S/N ratio) before and after promptly separating) ISNR _AvBe defined as follows:

{ISNR}_{av} = \frac{1}{2} Σ_{i = 1}^{2} {ISNR}_{i}

ISNR wherein _iThe average statement that is i speaker improves signal to noise ratio (S/N ratio),

{ISNR}_{i} = \frac{1}{K} Σ_{k = 1}^{K} (101 g (\frac{{(r_{i}^{k})}^{T} r_{i}^{k}}{{(r_{i}^{k} - {\hat{r}}_{i}^{k})}^{T} (r_{i}^{k} - {\hat{r}}_{i}^{k})}) - 101 g (\frac{{(r_{i}^{k})}^{T} r_{i}^{k}}{{(r_{mix}^{k} - r_{i}^{k})}^{T} (r_{mix}^{k} - r_{i}^{k})}))

Wherein

Serve as reasons

With The mixing voice signal that stack obtains,

(k=1,2 ..., K) i speaker's of expression k sentence source voice signal, (k=1,2 ..., K) i speaker's of expression k sentence isolating speech signals, K represents mixing voice sentence number.

As can be seen from Figure 3, separation algorithm theoretical based on compressed sensing and K-SVD separates three class mixing voices (male mixing voice, men and women's mixing voice, women mixing voice), average statement signal to noise ratio (S/N ratio) all is improved, and average statement improves signal to noise ratio (S/N ratio) ISNR _AvAll be more or less the same.This explanation all has certain separating effect based on the separation algorithm of compressed sensing and K-SVD to all kinds of mixing voices, and separating property is stable, related with the source voice signal not quite.

For the validity that adopts the K-SVD algorithm construction to cross complete dictionary is described, we will adopt the complete dictionary of the mistake of K-SVD algorithm construction to replace with the DCT base of 256 * 256 dimensions, separate the single channel mixing voice based on the thinking of signal reconstruction in the above-mentioned compressed sensing equally.Fig. 4 has provided the single channel mixing voice separating property of and DCT theoretical based on compressed sensing, and tested speech is the same.

Comparison diagram 3 and chart 4, as can be seen, single channel mixing voice separating property based on compressed sensing theory and K-SVD, be better than single channel mixing voice separating property based on compressed sensing and DCT, this explanation is adopted the K-SVD algorithm construction to cross complete dictionary to carry out speech Separation, and it is more effective to carry out speech Separation than direct employing DCT base.

In order to weigh the subjective acoustical quality of separating voice, we use P.862 standard, when above-mentioned input signal-to-noise ratio is 0dB, assess with the subjective quality that separates voice that the K-SVD separation algorithm obtains based on CS.Because the mean opinion score scope when adopting P.862 standard testing is 0～4.5, we are converted to this mean opinion score between 1～5.The mean opinion score implication is seen Fig. 5.

When input signal-to-noise ratio was 0dB, the mean opinion score of each speaker's detach statement of and K-SVD theoretical based on compressed sensing as shown in Figure 6.

Claims

1. single channel mixing voice separation method based on compressed sensing and K-SVD, it is characterized in that: this method is utilized the sparse property of voice signal under the K-SVD dictionary, according to compressed sensing observation and the similarity of single channel mixing voice on expression formula, adopt the signal reconfiguring method in the compressed sensing theory, carry out the separation of single channel mixing voice, strengthen the purpose that the target voice suppression is disturbed voice to reach, the step of taking is:

2. the single channel mixing voice separation method based on compressed sensing and K-SVD according to claim 1 is characterized in that:

A. remember x=s ₁+ s ₂Be known single channel mixing voice, wherein s _i(i=1,2) are unknown source voice signal; If s ₁And s ₂Pairing speaker is known, with s ₁And s ₂Pairing speaker's training utterance divides frame, and frame length is L sampling point/frame, gets L=128, and interframe is not overlapping, remembers that respectively each source voice signal frame is

With

Will

With

Be linked in sequence, obtain the combined training speech frame

Wherein

Expression s ₁Corresponding speaker's i frame training utterance, Expression s ₂Pairing speaker's i frame training utterance, N _TrainThe training utterance frame number of representing each speaker, two speakers' training utterance frame number is identical;

B. adopt the training of K-SVD algorithm to construct complete dictionary Q, each combined training speech frame signal is satisfied under Q: reconstructed error is sparse under this dictionary as far as possible at the synchronous signal of setting range, represents with following mathematical expression:

&ForAll; i : \min_{γ_{i}} {| | γ_{i} | |}_{0}

s . t . {| | x_{{train}_{i}} - Q γ_{i} | |}_{2} \leq ϵ

γ wherein _iFor

Rarefaction representation under dictionary Q, ε are represented the reconstructed error threshold value set, and value is 0.1;

With single channel creolized language cent frame, frame length is L sampling point/frame, gets L=128 equally, and interframe is not overlapping, separates the single channel mixing voice frame by frame, and the method that each frame separates the single channel mixing voice is identical, for J frame wherein:

A. remember that j frame single channel mixing voice is

Wherein

Be source signal s _iThe j frame of (i=1,2) number,

Be expressed as follows with matrix form:

Under Q, also has sparse property; Note Rarefaction representation under Q is β, promptly

And ‖ β ‖ ₀＜＜2L

x ^j＝PQβ

The estimation of the rarefaction representation under the K-SVD dictionary:

S=Ψ α and ‖ α ‖ ₀N

y＝Φs＝ΦΨα

Wherein Φ is the observing matrix of M * N dimension;

Compare x ^jThe expression formula of=PQ β and y=Φ s=Φ Ψ α, regard P as observing matrix, can find that both forms are basic identical: Φ and P represent observing matrix, Ψ and Q represent known basis function or dictionary, α and β represent the rarefaction representation of signal under basis function or dictionary, therefore, pass through l in the employing compressed sensing theory ₀-norm optimization method is asked for the estimation of each source speech frame rarefaction representation under the K-SVD dictionary by the thinking of observation reconstruction signal rarefaction representation under dictionary by finding the solution following formula:

s.t.x ^j＝PQβ

Obtain

The estimation of rarefaction representation under Q

Wherein

Be above-mentioned l ₀The optimum solution of-norm optimization problem;

Product, reconstruct is respectively separated speech frame:

[\begin{matrix} {\hat{s}}_{1}^{j} \\ {\hat{s}}_{2}^{j} \end{matrix}] = Q [\begin{matrix} {\hat{β}}_{1} \\ {\hat{β}}_{2} \end{matrix}]

Wherein

(i=1,2)

For separating speech frame;

{\hat{s}}_{i} = [\begin{matrix} {\hat{s}}_{i}^{1} & {\hat{s}}_{i}^{2} & \cdot \cdot \cdot & {\hat{s}}_{i}^{N_{i}} \end{matrix}]

N wherein _iFor the source voice signal from voice signal s _iTotalframes.