CN102081928A - Method for separating single-channel mixed voice based on compressed sensing and K-SVD - Google Patents

Method for separating single-channel mixed voice based on compressed sensing and K-SVD Download PDF

Info

Publication number
CN102081928A
CN102081928A CN2010105566949A CN201010556694A CN102081928A CN 102081928 A CN102081928 A CN 102081928A CN 2010105566949 A CN2010105566949 A CN 2010105566949A CN 201010556694 A CN201010556694 A CN 201010556694A CN 102081928 A CN102081928 A CN 102081928A
Authority
CN
China
Prior art keywords
frame
voice
dictionary
svd
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010105566949A
Other languages
Chinese (zh)
Other versions
CN102081928B (en
Inventor
郭海燕
杨震
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN2010105566949A priority Critical patent/CN102081928B/en
Publication of CN102081928A publication Critical patent/CN102081928A/en
Application granted granted Critical
Publication of CN102081928B publication Critical patent/CN102081928B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to a method for separating single-channel mixed voice based on compressed sensing and kernel singular value decomposition (K-SVD), which comprises the following steps of: constructing a universally applicable overcomplete dictionary, namely a K-SVD dictionary for each of three types of man-man, man-woman and woman-woman mixed training voice by using a K-SVD algorithm through mixed training voice frames; making a signal sparse under the dictionary while a reconfiguration error is in a certain range; on the basis of the constructed K-SVD dictionary, starting from the similarity of compressed sensing observation and single-channel mixed voice expressions, separating the single-channel mixed voice by using a l0-norm optimization-based signal reconfiguration algorithm in a compressed sensing theory; solving the estimation of sparse representation of each source voice frame under the K-SVD dictionary on the basis of the expression of each single-channel mixed voice frame, and reconfiguring each separated voice frame through the estimation of the sparse representation and the K-SVD dictionary; and sequentially connecting the separated voice frames to acquire a separated voice signal.

Description

Single channel mixing voice separation method based on compressed sensing and K-SVD
Technical field
The present invention relates to the special voice of a class and strengthen category-speech Separation, relate in particular to a kind of single channel mixing voice separation method, belong to the technical field that voice signal is handled based on compressed sensing and K-SVD,
Background technology
Voice are the most direct also the most frequently used exchange waies of human most convenient.Yet, in actual environment, people can be subjected to the interference of ambient noise inevitably when obtaining voice signal, these disturb the performance that can influence speech processing system (for example speech recognition system) on the one hand, can influence perception and the understanding of people's ear to voice on the other hand.Therefore, voice strengthen the particularly necessity that seems.Speech Separation is the special sound enhancement method of a class, its noise object is generally reluctant class voice noise, be under source voice signal and transport channel parameters (being mixed process) condition of unknown, only according to the observation data (being the mixing voice signal) that collects from microphone, recover or isolate the independently process of source voice signal.Its objective is to strengthen the target voice, suppress to disturb voice.The similarity of target voice and interference voice character has determined that speech Separation difficulty in the various types of voice Enhancement Method is maximum.Single channel mixing voice separation requirement is isolated multiple source voice signal independent of each other from the mixing voice signal that a microphone collects, because of the minimum difficulty of known conditions bigger.But because a microphone is that the easiest laying also is the most frequently used, so if single channel mixing voice isolation technics can break through, the actual application value maximum that then has.
Single channel mixing voice separation at present mainly contains three class main method: based on the single channel mixing voice separation method of statistical model, and computing machine auditory scene analysis and the single channel mixing voice separation method that decomposes based on projection.Single channel mixing voice separation method based on statistical model is based upon on the signal training modeling basis, is divided into for three steps usually: the first step, to each source voice signal or its characteristic parameter modeling, establish the parameter of each source voice signal model by training; Second step was a known conditions with mixing voice signal and source voice signal model, according to suitable criteria, selected the some one-tenth in the voice signal of source to assign to optimally form the mixing voice signal; The 3rd step, directly form each source voice after the separation by each source voice signal composition of choosing, perhaps form corresponding wave filter earlier, dope each source voice signal again.Computing machine auditory scene analysis method is carried out speech Separation by the imitation human auditory system, and core is made up of segmentation and combination two parts for separating.Segmentation is that mixing voice is decomposed into a series of sensation section, and each sensation section of each isolating speech signals requires to come from a source voice signal.Combination is that the sensation section that comes from the identical sources signal is merged, and forms the stream of corresponding source signal.Single channel mixing voice separation method based on the projection decomposition, generally be earlier by suitable basis function or the dictionary of machine learning structure, again by probabilistic method or optimization method, dope the projection vector of source voice signal under certain basis function or certain dictionary, the voice signal after at last obtaining separating by the projection vector of prediction and basis function or dictionary reconstruct accordingly.
From separation method: the single channel mixing voice separation algorithm based on probability statistics model lays particular emphasis on probabilistic method, realizes the separation of mixing voice on the basis of probabilistic Modeling, needs training in advance; Computing machine auditory scene analysis method (CASA) lays particular emphasis on the biosimulation method, and the separation by simulation human auditory system realization mixing voice does not need training in advance; The single channel mixing voice separation algorithm that decomposes based on projection lays particular emphasis on machine learning, by suitable basis function or the dictionary of machine learning structure, realizes the separation of mixing voice on this basis, needs training in advance.On separating property: generally speaking, the separating property of the single channel mixing voice separation algorithm that decomposes based on projection is best, takes second place based on the separating property of the single channel mixing voice separation algorithm of probability statistics model, and the separating property of CASA is the poorest.From algorithm complex: CASA carries out speech Separation by the simulation human auditory system, and need repeatedly adjust the segmentation reorganization of voice, and complexity is the highest.Single channel mixing voice separation algorithm that decomposes based on projection and the single channel mixing voice separation algorithm based on probability statistics model, all based on mathematics probability model or optimization method, so in general, complexity is more or less the same.On development potentiality, the three respectively has relative merits, and the development space of oneself is all arranged.Though the research of single channel mixing voice separation has at present obtained certain achievement, but totally all algorithm complex is higher, though and the difference of performance source voice signal and difference is bigger, in the training stage training data there is special requirement in addition, so generally speaking, practicality is not strong, haves much room for improvement, so that concrete the application.
Summary of the invention
The invention provides a kind of single channel mixing voice separation method based on compressed sensing and K-SVD, its purpose focuses on the consideration practicality, and design does not have specific (special) requirements to training data, and the single channel mixing voice separation method of stable performance, can strengthen the target voice, reduce and disturb voice.This method is utilized the sparse property of voice signal under the K-SVD dictionary, according to compressed sensing observation and the similarity of single channel mixing voice on expression formula, adopt the signal reconfiguring method in the compressed sensing theory, carry out the separation of single channel mixing voice, strengthen the purpose that the target voice suppression is disturbed voice to reach.
For achieving the above object, the present invention has adopted following technical scheme:
A kind of single channel mixing voice separation method based on compressed sensing and K-SVD, it is characterized in that: this method is utilized the sparse property of voice signal under the K-SVD dictionary, according to compressed sensing observation and the similarity of single channel mixing voice on expression formula, adopt the signal reconfiguring method in the compressed sensing theory, carry out the separation of single channel mixing voice, strengthen the purpose that the target voice suppression is disturbed voice to reach, the step of taking is:
1) adopt the K-SVD algorithm, divide frame to man man, men and women and women three class combined training voice, by the combined training speech frame a complete dictionary of blanket mistake being constructed in every class combined training voice is the K-SVD dictionary;
2) with single channel creolized language cent frame, separate the single channel mixing voice frame by frame; Based on the expression formula of the K-SVD dictionary of being constructed and each single channel mixing voice frame, the similarity from compressed sensing observation and single channel mixing voice expression formula adopts in the compressed sensing theory based on l 0The signal reconstruction algorithm of-norm optimization is asked for the estimation of each source speech frame rarefaction representation under the K-SVD dictionary, and by the estimation of this rarefaction representation and the product of K-SVD dictionary, reconstruct is respectively separated speech frame;
3) each is separated speech frame and be linked in sequence the voice signal after obtaining separating.
In above-mentioned:
1) adopt the concrete grammar of K-SVD algorithm construction K-SVD dictionary to be:
A. remember x=s 1+ s 2Be known single channel mixing voice, wherein s i(i=1,2) are unknown source voice signal; Suppose s 1And s 2Pairing speaker is known, with s 1And s 2Pairing speaker's training utterance divides frame, and frame length is L sampling point/frame, gets L=128, and interframe is not overlapping, remembers that respectively each source voice signal frame is With
Figure BSA00000357583000022
Will
Figure BSA00000357583000023
With
Figure BSA00000357583000024
Be linked in sequence, obtain the combined training speech frame
Figure BSA00000357583000025
Wherein
Figure BSA00000357583000026
Figure BSA00000357583000031
Expression s 1Corresponding speaker's i frame training utterance,
Figure BSA00000357583000032
Expression s 2Pairing speaker's i frame training utterance, N TrainThe training utterance frame number of representing each speaker, two speakers' training utterance frame number is identical;
B. adopt the training of K-SVD algorithm to construct complete dictionary Q, each combined training speech frame signal is satisfied under Q: reconstructed error is sparse under this dictionary as far as possible at the synchronous signal of setting range, and concrete available following mathematical expression is represented:
∀ i : min γ i | | γ i | | 0 s . t . | | x train i - Q γ i | | 2 ≤ ϵ
γ wherein iFor Rarefaction representation under dictionary Q, ε are represented the reconstructed error threshold value set, and value is 0.1;
2) with single channel creolized language cent frame, the concrete grammar that separates the single channel mixing voice frame by frame is:
With single channel creolized language cent frame, frame length is got L=128 equally for for L sampling point/frame, and interframe is not overlapping, separates the single channel mixing voice frame by frame, and the method that each frame separates the single channel mixing voice is identical; Be separated into example with j frame single channel mixing voice below, the method for separating each single channel mixing voice frame be described:
A. remember that j frame single channel mixing voice is Wherein Be source signal s iThe j frame of (i=1,2) number, Be expressed as follows with matrix form:
Figure BSA00000357583000039
I wherein L * LThe unit matrix of expression L * L dimension; Because Q has reflected the general character of all combined training speech frames, and in the process of training, guarantee that all combined training speech frames have sparse property under Q, so can think
Figure BSA000003575830000310
Under Q, also has sparse property; Note
Figure BSA000003575830000311
Rarefaction representation under Q is β, promptly
Figure BSA000003575830000312
And ‖ β ‖ 0<<2L
‖ ‖ in the following formula 0Expression l 0-norm is specifically represented the number of nonzero element in the vector; Definition P=[I L * LI L * L], x then jCan be expressed as
x j=PQβ
B. according to the expression formula of compressed sensing observation with state single channel mixing voice frame x jThe expression formula of=PQ β is closely similar, by the method for observation reconstruction signal rarefaction representation, asks for source voice signal frame in the usefulness compressed sensing theory
Figure BSA00000357583000041
The estimation of the rarefaction representation under the K-SVD dictionary:
Note s=[s (1), s (2) ..., s (N)] TFor length is the discrete signal of N, Ψ is known basis function or dictionary, makes signal s have sparse property under Ψ, that is:
S=Ψ α and ‖ α ‖ 0N
Wherein α is the rarefaction representation of s under Ψ.The compressed sensing theory is thought: when signal s has sparse property under Ψ, can be with the observation y=[y (1) of certain dimension, and y (2) ..., y (M)] TCome approximate nondestructively reconstruct α and then reconstruction signal s, wherein observe y multiply each other and obtain by observing matrix Φ and s:
y=Φs=ΦΨα
Wherein Φ is the observing matrix of M * N;
Compare x jThe expression formula of=MQ β and y=Φ s=Φ Ψ α, regard M as observing matrix, can find that both forms are basic identical: Φ and M represent observing matrix, Ψ and Q represent known basis function or dictionary, α and β represent the rarefaction representation of signal under basis function or dictionary, therefore, pass through l in the employing compressed sensing theory 0-norm optimization method is asked for the estimation of each source speech frame rarefaction representation under the K-SVD dictionary by the thinking of observation reconstruction signal rarefaction representation under dictionary by finding the solution following formula:
s.t.x j=MQβ
Obtain
Figure BSA00000357583000043
The estimation of rarefaction representation under Q
Figure BSA00000357583000044
Wherein
Figure BSA00000357583000045
Be above-mentioned l 0The optimum solution of-norm optimization problem;
C. by the estimation of K-SVD dictionary Q and above-mentioned rarefaction representation
Figure BSA00000357583000046
Product, reconstruct is respectively separated speech frame:
s ^ 1 j s ^ 2 j = Q β ^ 1 β ^ 2
Wherein
Figure BSA00000357583000051
For separating speech frame;
3) each is separated speech frame and is linked in sequence, obtain isolating speech signals:
s ^ i = s ^ i 1 s ^ i 2 · · · s ^ i N i
N wherein iFor the source voice signal from voice signal s iTotalframes.
Advantage of the present invention and beneficial effect:
The present invention is based on compressed sensing theory and K-SVD, carry out the separation of single channel mixing voice, realizing to the enhancing of target voice with to disturbing the inhibition of voice, have practical, the advantage of stable performance.Single channel mixing voice separation method based on compressed sensing theory and K-SVD, when adopting K-SVD algorithm training structure dictionary, only need be to three groups of dissimilar training utterance frames (male combined training speech frames, men and women's combined training speech frame, women combined training speech frame) trains, construct three complete dictionaries of different mistakes, be used for separating respectively three kinds of dissimilar mixing voices (male mixing voice, men and women's mixing voice, women mixing voice), do not need different source speakers is trained the structure dictionary respectively.Owing to training data is not had the requirement of special harshness, so have practical advantage based on the single channel mixing voice separation method of compressed sensing and K-SVD.In addition, based on compressed sensing and K-SVD single channel mixing voice separation method, mainly utilize the sparse characteristic of voice signal under the K-SVD dictionary, do not utilize the characteristic of each source voice signal too much, so little the separating effect difference of each group mixing voice.Emulation experiment also shows, has stable performance based on the single channel mixing voice separation method of compressed sensing and K-SVD, and separating effect is related with the source voice signal little.
Description of drawings:
Fig. 1 is the system chart of the inventive method;
Fig. 2 is the process flow diagram of K-SVD algorithm;
Fig. 3 is the average statement improvement signal to noise ratio (S/N ratio) ISNR of single channel mixing voice separating property-two speaker of and K-SVD theoretical based on compressed sensing under the different input signal-to-noise ratios Av
Fig. 4 is the average statement improvement signal to noise ratio (S/N ratio) ISNR of single channel mixing voice separating property-two speaker of and DCT theoretical based on compressed sensing under the different input signal-to-noise ratios Av
Fig. 5 is the mean opinion score implication;
The mean opinion score of single channel mixing voice separating property-each speaker's statement of and K-SVD theoretical when Fig. 6 is 0dB for input signal-to-noise ratio based on compressed sensing.
Embodiment
Fig. 1 is for realizing the system chart of this programme.As shown in the figure: the present invention at first adopts the K-SVD algorithm, and complete dictionary was constructed in training, based on the K-SVD dictionary of structure, adopts in the compressed sensing based on l again 0The signal reconstruction algorithm of-norm optimization carries out the separation of single channel mixing voice.
Testing used voice is that sampling rate is the voice of 16KHZ.Totally four of speakers, two male two woman.Each speaker respectively gets the voice of 40 Chinese phrase structures and constructs training utterance.The voice of 5 Chinese phrase structures of every speaker's picked at random are as tested speech, and tested speech is different with training utterance.Single channel mixing voice x is by two source tested speech s 1, s 2Stack is obtained, i.e. x=s 1+ s 2, obtain 100 men and women's mixing voices altogether, 25 male mixing voices of man, 25 women mixing voices of woman.All all adopt rectangular window to divide frame, and frame length is 128 sampling points (8ms), the interframe zero lap.
When structure K-SVD dictionary, need three groups of training utterance frames of structure altogether: men and women's mixing voice is constructed one group of training utterance frame (being called men and women's combined training speech frame), the male mixing voice of man is constructed one group of training utterance frame (being called male combined training speech frame), the women mixing voice of woman is constructed one group of training utterance frame (being called women combined training speech frame).Concrete building method is as follows:
1. male voice speech frame and female voice speech frame are linked in sequence structure men and women combined training speech frame.Every speaker gets a speech frame, and then can construct four frame frame lengths is men and women's combined training voice signal of 256 sampling points (16ms).
2. male voice speech frame and male voice speech frame are linked in sequence, construct male combined training speech frame.Every male speaker gets a speech frame, and then can construct a frame frame length is the male combined training voice signal of 256 sampling points (16ms).
3. female voice speech frame and female voice speech frame are linked in sequence, construct women combined training speech frame.Every women speaker gets a speech frame, and then can construct a frame frame length is the women combined training voice signal of 256 sampling points (16ms).
Adopt the K-SVD algorithm that above-mentioned three groups of training utterances are trained respectively, construct three complete dictionaries of mistake, be respectively applied for men and women's mixing voice, male mixing voice, the separation of women mixing voice.The atom dimension is identical with the training utterance frame length in the dictionary, the atom number is made as 1024 in the dictionary, and promptly the dictionary dimension is 256 * 1024, and it is 256 * 1024 the complete DCT dictionary of mistake that initial dictionary is made as dimension, the reconstructed error threshold epsilon is made as 0.1, and iterations is made as 30 times.
When adopting the training of K-SVD algorithm to construct complete dictionary, adopt the mode of iteration to realize the renewal of dictionary, so that each training utterance frame satisfies formula under the complete dictionary Q of mistake of structure
∀ i : min γ i | | γ i | | 0 s . t . | | x train i - Q γ i | | 2 ≤ ϵ
Specifically realize by iteration.Each iteration is carried out in two steps, is elaborated with the j time iteration:
1. Sparse Decomposition: the dictionary Q that keeps the j-1 time iteration to obtain J-1Constant, find the solution training signal
Figure BSA00000357583000063
At Q J-1Under rarefaction representation
Figure BSA00000357583000064
Specifically find the solution by separating above-mentioned optimization problem
Figure BSA00000357583000065
Usually adopt matching pursuit algorithm to realize.
2. dictionary updating: keep
Figure BSA00000357583000066
Constant, to dictionary Q J-1Upgrade by row, make
Figure BSA00000357583000067
Minimum can adopt singular value decomposition algorithm to realize.The K-SVD algorithm flow chart sees for details shown in Figure 2.
Based on the complete dictionary Q of mistake of structure, the similarity on expression formula according to single channel mixing voice and compressed sensing observation adopts in the compressed sensing based on l 0The signal reconfiguring method of-norm optimization carries out the separation of single channel mixing voice x.Specifically find the solution minimum l earlier 0-norm problem
Figure BSA00000357583000068
s.t.x=PQβ
, obtain The estimation of the rarefaction representation under Q
Figure BSA000003575830000610
Wherein
Figure BSA000003575830000611
Be above-mentioned minimum l 0The optimum solution of-norm problem.Owing to separate above-mentioned minimum l 0-norm problem need be listed the β chosen candidate value that all satisfy restrictive condition, finds out the β with minimum nonzero element again from these chosen candidate values, and complexity is very high and be difficult to realization, so be converted into minimum l of equal value usually 1-norm problem is found the solution:
s.t.x=PQβ
‖ ‖ wherein 1Expression l 1-norm, its value is the absolute value sum of each element of vector.Following formula can be regarded as
Figure BSA00000357583000072
The convexification of s.t.x=PQ β formula, can realize easily by the linear programming algorithm:
Figure BSA00000357583000073
s.t.Az=b
Wherein A=(PQ ,-PQ), b=x, c=(1; 1),
At last, obtain isolating speech signals by following formula reconstruct
Figure BSA00000357583000076
Figure BSA00000357583000077
To the single channel mixing voice separation method based on compressed sensing theory and K-SVD that the present invention proposes, we carry out separating experiment respectively to the mixing voice under the different input signal-to-noise ratio conditions, and experimental situation is the Matlab environment.Average statement with two speakers improves signal to noise ratio (S/N ratio) (Improved Signal to Noise ratio, ISNR, the improvement effect of signal to noise ratio (S/N ratio) before and after promptly separating) ISNR AvBe index, weigh the separating property of institute's extracting method, as shown in Figure 3.Two speakers' average statement improves signal to noise ratio (S/N ratio) (Improved Signal to Noise ratio, ISNR, the improvement effect of signal to noise ratio (S/N ratio) before and after promptly separating) ISNR AvBe defined as follows:
ISNR av = 1 2 Σ i = 1 2 ISNR i
ISNR wherein iThe average statement that is i speaker improves signal to noise ratio (S/N ratio),
ISNR i = 1 K Σ k = 1 K ( 101 g ( ( r i k ) T r i k ( r i k - r ^ i k ) T ( r i k - r ^ i k ) ) - 101 g ( ( r i k ) T r i k ( r mix k - r i k ) T ( r mix k - r i k ) ) )
Wherein
Figure BSA000003575830000710
Serve as reasons
Figure BSA000003575830000711
With The mixing voice signal that stack obtains,
Figure BSA000003575830000713
(k=1,2 ..., K) i speaker's of expression k sentence source voice signal, (k=1,2 ..., K) i speaker's of expression k sentence isolating speech signals, K represents mixing voice sentence number.
As can be seen from Figure 3, separation algorithm theoretical based on compressed sensing and K-SVD separates three class mixing voices (male mixing voice, men and women's mixing voice, women mixing voice), average statement signal to noise ratio (S/N ratio) all is improved, and average statement improves signal to noise ratio (S/N ratio) ISNR AvAll be more or less the same.This explanation all has certain separating effect based on the separation algorithm of compressed sensing and K-SVD to all kinds of mixing voices, and separating property is stable, related with the source voice signal not quite.
For the validity that adopts the K-SVD algorithm construction to cross complete dictionary is described, we will adopt the complete dictionary of the mistake of K-SVD algorithm construction to replace with the DCT base of 256 * 256 dimensions, separate the single channel mixing voice based on the thinking of signal reconstruction in the above-mentioned compressed sensing equally.Fig. 4 has provided the single channel mixing voice separating property of and DCT theoretical based on compressed sensing, and tested speech is the same.
Comparison diagram 3 and chart 4, as can be seen, single channel mixing voice separating property based on compressed sensing theory and K-SVD, be better than single channel mixing voice separating property based on compressed sensing and DCT, this explanation is adopted the K-SVD algorithm construction to cross complete dictionary to carry out speech Separation, and it is more effective to carry out speech Separation than direct employing DCT base.
In order to weigh the subjective acoustical quality of separating voice, we use P.862 standard, when above-mentioned input signal-to-noise ratio is 0dB, assess with the subjective quality that separates voice that the K-SVD separation algorithm obtains based on CS.Because the mean opinion score scope when adopting P.862 standard testing is 0~4.5, we are converted to this mean opinion score between 1~5.The mean opinion score implication is seen Fig. 5.
When input signal-to-noise ratio was 0dB, the mean opinion score of each speaker's detach statement of and K-SVD theoretical based on compressed sensing as shown in Figure 6.

Claims (2)

1. single channel mixing voice separation method based on compressed sensing and K-SVD, it is characterized in that: this method is utilized the sparse property of voice signal under the K-SVD dictionary, according to compressed sensing observation and the similarity of single channel mixing voice on expression formula, adopt the signal reconfiguring method in the compressed sensing theory, carry out the separation of single channel mixing voice, strengthen the purpose that the target voice suppression is disturbed voice to reach, the step of taking is:
1) adopt the K-SVD algorithm, divide frame to man man, men and women and women three class combined training voice, by the combined training speech frame a complete dictionary of blanket mistake being constructed in every class combined training voice is the K-SVD dictionary;
2) with single channel creolized language cent frame, separate the single channel mixing voice frame by frame; Based on the expression formula of the K-SVD dictionary of being constructed and each single channel mixing voice frame, the similarity from compressed sensing observation and single channel mixing voice expression formula adopts in the compressed sensing theory based on l 0The signal reconstruction algorithm of-norm optimization is asked for the estimation of each source speech frame rarefaction representation under the K-SVD dictionary, and by the estimation of this rarefaction representation and the product of K-SVD dictionary, reconstruct is respectively separated speech frame;
3) each is separated speech frame and be linked in sequence the voice signal after obtaining separating.
2. the single channel mixing voice separation method based on compressed sensing and K-SVD according to claim 1 is characterized in that:
1) adopt the concrete grammar of K-SVD algorithm construction K-SVD dictionary to be:
A. remember x=s 1+ s 2Be known single channel mixing voice, wherein s i(i=1,2) are unknown source voice signal; If s 1And s 2Pairing speaker is known, with s 1And s 2Pairing speaker's training utterance divides frame, and frame length is L sampling point/frame, gets L=128, and interframe is not overlapping, remembers that respectively each source voice signal frame is
Figure FSA00000357582900011
With
Figure FSA00000357582900012
Will
Figure FSA00000357582900013
With
Figure FSA00000357582900014
Be linked in sequence, obtain the combined training speech frame
Figure FSA00000357582900015
Wherein
Figure FSA00000357582900016
Figure FSA00000357582900017
Expression s 1Corresponding speaker's i frame training utterance, Expression s 2Pairing speaker's i frame training utterance, N TrainThe training utterance frame number of representing each speaker, two speakers' training utterance frame number is identical;
B. adopt the training of K-SVD algorithm to construct complete dictionary Q, each combined training speech frame signal is satisfied under Q: reconstructed error is sparse under this dictionary as far as possible at the synchronous signal of setting range, represents with following mathematical expression:
∀ i : min γ i | | γ i | | 0 s . t . | | x train i - Q γ i | | 2 ≤ ϵ
γ wherein iFor
Figure FSA000003575829000111
Rarefaction representation under dictionary Q, ε are represented the reconstructed error threshold value set, and value is 0.1;
2) with single channel creolized language cent frame, the concrete grammar that separates the single channel mixing voice frame by frame is:
With single channel creolized language cent frame, frame length is L sampling point/frame, gets L=128 equally, and interframe is not overlapping, separates the single channel mixing voice frame by frame, and the method that each frame separates the single channel mixing voice is identical, for J frame wherein:
A. remember that j frame single channel mixing voice is
Figure FSA00000357582900021
Wherein
Figure FSA00000357582900022
Be source signal s iThe j frame of (i=1,2) number,
Figure FSA00000357582900023
Be expressed as follows with matrix form:
Figure FSA00000357582900024
I wherein L * LThe unit matrix of expression L * L dimension; Because Q has reflected the general character of all combined training speech frames, and in the process of training, guarantee that all combined training speech frames have sparse property under Q, so can think
Figure FSA00000357582900025
Under Q, also has sparse property; Note Rarefaction representation under Q is β, promptly
Figure FSA00000357582900027
And ‖ β ‖ 0<<2L
‖ ‖ in the following formula 0Expression l 0-norm is specifically represented the number of nonzero element in the vector; Definition P=[I L * LI L * L], x then jCan be expressed as
x j=PQβ
B. according to the expression formula of compressed sensing observation with state single channel mixing voice frame x jThe expression formula of=PQ β is closely similar, by the method for observation reconstruction signal rarefaction representation, asks for source voice signal frame in the usefulness compressed sensing theory
Figure FSA00000357582900028
The estimation of the rarefaction representation under the K-SVD dictionary:
Note s=[s (1), s (2) ..., s (N)] TFor length is the discrete signal of N, Ψ is known basis function or dictionary, makes signal s have sparse property under Ψ, that is:
S=Ψ α and ‖ α ‖ 0N
Wherein α is the rarefaction representation of s under Ψ.The compressed sensing theory is thought: when signal s has sparse property under Ψ, can be with the observation y=[y (1) of certain dimension, and y (2) ..., y (M)] TCome approximate nondestructively reconstruct α and then reconstruction signal s, wherein observe y multiply each other and obtain by observing matrix Φ and s:
y=Φs=ΦΨα
Wherein Φ is the observing matrix of M * N dimension;
Compare x jThe expression formula of=PQ β and y=Φ s=Φ Ψ α, regard P as observing matrix, can find that both forms are basic identical: Φ and P represent observing matrix, Ψ and Q represent known basis function or dictionary, α and β represent the rarefaction representation of signal under basis function or dictionary, therefore, pass through l in the employing compressed sensing theory 0-norm optimization method is asked for the estimation of each source speech frame rarefaction representation under the K-SVD dictionary by the thinking of observation reconstruction signal rarefaction representation under dictionary by finding the solution following formula:
Figure FSA00000357582900031
s.t.x j=PQβ
Obtain
Figure FSA00000357582900032
The estimation of rarefaction representation under Q
Figure FSA00000357582900033
Wherein
Figure FSA00000357582900034
Be above-mentioned l 0The optimum solution of-norm optimization problem;
C. by the estimation of K-SVD dictionary Q and above-mentioned rarefaction representation
Figure FSA00000357582900035
Product, reconstruct is respectively separated speech frame:
s ^ 1 j s ^ 2 j = Q β ^ 1 β ^ 2
Wherein
Figure FSA00000357582900037
(i=1,2)
For separating speech frame;
3) each is separated speech frame and is linked in sequence, obtain isolating speech signals:
s ^ i = s ^ i 1 s ^ i 2 · · · s ^ i N i
N wherein iFor the source voice signal from voice signal s iTotalframes.
CN2010105566949A 2010-11-24 2010-11-24 Method for separating single-channel mixed voice based on compressed sensing and K-SVD Active CN102081928B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105566949A CN102081928B (en) 2010-11-24 2010-11-24 Method for separating single-channel mixed voice based on compressed sensing and K-SVD

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105566949A CN102081928B (en) 2010-11-24 2010-11-24 Method for separating single-channel mixed voice based on compressed sensing and K-SVD

Publications (2)

Publication Number Publication Date
CN102081928A true CN102081928A (en) 2011-06-01
CN102081928B CN102081928B (en) 2013-03-06

Family

ID=44087851

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105566949A Active CN102081928B (en) 2010-11-24 2010-11-24 Method for separating single-channel mixed voice based on compressed sensing and K-SVD

Country Status (1)

Country Link
CN (1) CN102081928B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332268A (en) * 2011-09-22 2012-01-25 王天荆 Speech signal sparse representation method based on self-adaptive redundant dictionary
CN103325381A (en) * 2013-05-29 2013-09-25 吉林大学 Speech separation method based on fuzzy membership function
CN104217730A (en) * 2014-08-18 2014-12-17 大连理工大学 Artificial speech bandwidth expansion method and device based on K-SVD
CN104795064A (en) * 2015-03-30 2015-07-22 福州大学 Recognition method for sound event under scene of low signal to noise ratio
CN106500735A (en) * 2016-11-03 2017-03-15 重庆邮电大学 A kind of FBG signal adaptive restorative procedures based on compressed sensing
CN107705795A (en) * 2017-09-27 2018-02-16 天津大学 Multichannel audio processing method based on KSVD algorithms
CN108802687A (en) * 2018-06-25 2018-11-13 大连大学 The more sound localization methods of distributed microphone array in reverberation room
CN109040116A (en) * 2018-09-06 2018-12-18 深圳市益鑫智能科技有限公司 A kind of video conferencing system based on cloud server
CN110189761A (en) * 2019-05-21 2019-08-30 哈尔滨工程大学 A kind of single channel speech dereverberation method based on greedy depth dictionary learning
CN110764047A (en) * 2019-10-25 2020-02-07 哈尔滨工程大学 Target angle estimation method for optimizing regular parameters under sparse representation model
CN111383652A (en) * 2019-10-25 2020-07-07 南京邮电大学 Single-channel speech enhancement method based on double-layer dictionary learning
CN112927710A (en) * 2021-01-21 2021-06-08 安徽南瑞继远电网技术有限公司 Power transformer working condition noise separation method based on unsupervised mode
CN113129872A (en) * 2021-04-06 2021-07-16 新疆大学 Voice enhancement method based on deep compressed sensing
CN113223032A (en) * 2021-04-27 2021-08-06 武汉纺织大学 Double-sparse decomposition-based complex image Canny edge detection method
CN117221016A (en) * 2023-11-09 2023-12-12 北京亚康万玮信息技术股份有限公司 Data security transmission method in remote connection process

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1897113A (en) * 2005-06-03 2007-01-17 索尼株式会社 Audio signal separation device and method thereof
EP1811498A1 (en) * 2006-01-18 2007-07-25 Sony Corporation Speech signal separation apparatus and method
CN101030383A (en) * 2006-03-02 2007-09-05 株式会社日立制作所 Sound source separating device, method, and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1897113A (en) * 2005-06-03 2007-01-17 索尼株式会社 Audio signal separation device and method thereof
EP1811498A1 (en) * 2006-01-18 2007-07-25 Sony Corporation Speech signal separation apparatus and method
CN101030383A (en) * 2006-03-02 2007-09-05 株式会社日立制作所 Sound source separating device, method, and program

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
《信号处理》 20090831 郭海燕; 杨震 基于相空间重构和基频分析的单信道混合语音清浊音分类方法 , *
《信号处理》 20100331 叶蕾; 郭海燕; 杨震; 基于压缩感知重构信号的说话人识别系统抗噪方法研究 , *
《信号处理》 20100630 孙林慧;杨震 基于压缩感知的分布式语音压缩与重构 , *
《南京邮电大学学报(自然科学版)》 20100831 叶蕾;杨震 基于压缩感知的语音压缩与重构 , *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332268B (en) * 2011-09-22 2013-03-13 南京工业大学 Speech signal sparse representation method based on self-adaptive redundant dictionary
CN102332268A (en) * 2011-09-22 2012-01-25 王天荆 Speech signal sparse representation method based on self-adaptive redundant dictionary
CN103325381A (en) * 2013-05-29 2013-09-25 吉林大学 Speech separation method based on fuzzy membership function
CN103325381B (en) * 2013-05-29 2015-09-02 吉林大学 A kind of speech separating method based on fuzzy membership functions
CN104217730B (en) * 2014-08-18 2017-07-21 大连理工大学 A kind of artificial speech bandwidth expanding method and device based on K SVD
CN104217730A (en) * 2014-08-18 2014-12-17 大连理工大学 Artificial speech bandwidth expansion method and device based on K-SVD
WO2016155047A1 (en) * 2015-03-30 2016-10-06 福州大学 Method of recognizing sound event in auditory scene having low signal-to-noise ratio
CN104795064B (en) * 2015-03-30 2018-04-13 福州大学 The recognition methods of sound event under low signal-to-noise ratio sound field scape
CN104795064A (en) * 2015-03-30 2015-07-22 福州大学 Recognition method for sound event under scene of low signal to noise ratio
CN106500735A (en) * 2016-11-03 2017-03-15 重庆邮电大学 A kind of FBG signal adaptive restorative procedures based on compressed sensing
CN106500735B (en) * 2016-11-03 2019-03-22 重庆邮电大学 A kind of compressed sensing based FBG signal adaptive restorative procedure
CN107705795A (en) * 2017-09-27 2018-02-16 天津大学 Multichannel audio processing method based on KSVD algorithms
CN108802687A (en) * 2018-06-25 2018-11-13 大连大学 The more sound localization methods of distributed microphone array in reverberation room
CN109040116A (en) * 2018-09-06 2018-12-18 深圳市益鑫智能科技有限公司 A kind of video conferencing system based on cloud server
CN110189761B (en) * 2019-05-21 2021-03-30 哈尔滨工程大学 Single-channel speech dereverberation method based on greedy depth dictionary learning
CN110189761A (en) * 2019-05-21 2019-08-30 哈尔滨工程大学 A kind of single channel speech dereverberation method based on greedy depth dictionary learning
CN110764047B (en) * 2019-10-25 2022-08-02 哈尔滨工程大学 Target angle estimation method for optimizing regular parameters under sparse representation model
CN110764047A (en) * 2019-10-25 2020-02-07 哈尔滨工程大学 Target angle estimation method for optimizing regular parameters under sparse representation model
CN111383652A (en) * 2019-10-25 2020-07-07 南京邮电大学 Single-channel speech enhancement method based on double-layer dictionary learning
CN111383652B (en) * 2019-10-25 2023-09-12 南京邮电大学 Single-channel voice enhancement method based on double-layer dictionary learning
CN112927710A (en) * 2021-01-21 2021-06-08 安徽南瑞继远电网技术有限公司 Power transformer working condition noise separation method based on unsupervised mode
CN112927710B (en) * 2021-01-21 2021-10-26 安徽南瑞继远电网技术有限公司 Power transformer working condition noise separation method based on unsupervised mode
CN113129872B (en) * 2021-04-06 2023-03-14 新疆大学 Voice enhancement method based on deep compressed sensing
CN113129872A (en) * 2021-04-06 2021-07-16 新疆大学 Voice enhancement method based on deep compressed sensing
CN113223032A (en) * 2021-04-27 2021-08-06 武汉纺织大学 Double-sparse decomposition-based complex image Canny edge detection method
CN117221016A (en) * 2023-11-09 2023-12-12 北京亚康万玮信息技术股份有限公司 Data security transmission method in remote connection process
CN117221016B (en) * 2023-11-09 2024-01-12 北京亚康万玮信息技术股份有限公司 Data security transmission method in remote connection process

Also Published As

Publication number Publication date
CN102081928B (en) 2013-03-06

Similar Documents

Publication Publication Date Title
CN102081928B (en) Method for separating single-channel mixed voice based on compressed sensing and K-SVD
Wang et al. TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain
CN110867181B (en) Multi-target speech enhancement method based on SCNN and TCNN joint estimation
CN105957537B (en) One kind being based on L1/2The speech de-noising method and system of sparse constraint convolution Non-negative Matrix Factorization
KR101197407B1 (en) Apparatus and method for separating audio signals
CN106205623B (en) A kind of sound converting method and device
CN102222508A (en) Matrix-transformation-based method for underdetermined blind source separation
CN106847301A (en) A kind of ears speech separating method based on compressed sensing and attitude information
WO2015182379A1 (en) Method for estimating source signals from mixture of source signals
Do et al. Speech source separation using variational autoencoder and bandpass filter
JP6099032B2 (en) Signal processing apparatus, signal processing method, and computer program
Dua et al. Discriminative training using heterogeneous feature vector for Hindi automatic speech recognition system
Li et al. U-shaped transformer with frequency-band aware attention for speech enhancement
CN106356058A (en) Robust speech recognition method based on multi-band characteristic compensation
Grais et al. Single channel speech music separation using nonnegative matrix factorization with sliding windows and spectral masks
Xue A novel english speech recognition approach based on hidden Markov model
Ullah et al. Single channel speech dereverberation and separation using RPCA and SNMF
KR101802444B1 (en) Robust speech recognition apparatus and method for Bayesian feature enhancement using independent vector analysis and reverberation parameter reestimation
CN104036775A (en) Voice recognition system fusing video with audition
CN108875824B (en) Single-channel blind source separation method
JP6910609B2 (en) Signal analyzers, methods, and programs
JP4946330B2 (en) Signal separation apparatus and method
CN103559886B (en) Speech signal enhancing method based on group sparse low-rank expression
Missaoui et al. Blind speech separation based on undecimated wavelet packet-perceptual filterbanks and independent component analysis
KR20130125227A (en) Blind source separation method using harmonic frequency dependency and de-mixing system therefor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20110601

Assignee: Jiangsu Nanyou IOT Technology Park Ltd.

Assignor: Nanjing Post & Telecommunication Univ.

Contract record no.: 2016320000215

Denomination of invention: Method for separating single-channel mixed voice based on compressed sensing and K-SVD

Granted publication date: 20130306

License type: Common License

Record date: 20161118

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
EC01 Cancellation of recordation of patent licensing contract

Assignee: Jiangsu Nanyou IOT Technology Park Ltd.

Assignor: Nanjing Post & Telecommunication Univ.

Contract record no.: 2016320000215

Date of cancellation: 20180116

EC01 Cancellation of recordation of patent licensing contract