CN103680517A

CN103680517A - Method, device and equipment for processing audio signals

Info

Publication number: CN103680517A
Application number: CN201310587304.8A
Authority: CN
Inventors: 徐德著; 顾凤香; 赵翔宇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2013-11-20
Filing date: 2013-11-20
Publication date: 2014-03-26

Abstract

The invention discloses a method, a device and equipment for processing audio signals, and belongs to the field of audio processing. The method includes converting a single-track signal of a song from time domain into frequency domain to acquire a first frequency domain signal including a harmonic musical component, a percussive musical component and a human voice component; separating a second frequency domain signal, including the percussive musical component and the human voice component, from the first frequency domain signal by the HPSS (harmonic/percussive sound separation) algorithm; extracting the human voice component from the second frequency domain signal by the NNMF (nearest neighbors and median filtering) algorithm. The device comprises a converting unit, a separating unit and an extracting unit. The equipment is used for implementing the method. By the method, the device and the equipment, quality of human voices extracted from songs is improved.

Description

A kind of disposal route of sound signal, device and equipment

Technical field

The present invention relates to field of audio processing, particularly a kind of disposal route of sound signal, device and equipment.

Background technology

Each sound channel signal of monophony or two-channel song generally comprises voice and two kinds of sound signals of accompaniment.If user wants to extract voice or accompaniment in song, can adopt voice or accompaniment isolation technics that voice or accompaniment are extracted from song.

With voice, be separated into example, introduce existing separate mode, it comprises the following steps: the first step, is converted to frequency domain by the left and right sound track signals of song by time domain respectively; Second step, the right normalized crosscorrelation value of corresponding frequency of calculating left and right sound track signals; The 3rd step, the mean value signal weighting people acoustic gain right to the corresponding frequency of left and right sound track signals, the normalized crosscorrelation value value in direct ratio that people's acoustic gain and current frequency are right; The 4th step, is converted to time domain by the mean value signal of the left channel signals after weighting people acoustic gain and right-channel signals by frequency domain and extracts voice.

Existing separate mode gives music composition different gains according to the correlativity of the left and right sound track signals of song, the right gain of different frequent points is different, and the right gain value of different frequent points is separate, there is no certain correlativity, while getting different gains, can change tamber characteristic, cause people's sound distortion; Like this, the people's sound effective value extracting is poor, cannot meet the extraction requirement of high-quality voice.

Summary of the invention

In order to solve the problem of prior art, the embodiment of the present invention provides a kind of disposal route, device and equipment of sound signal.Described technical scheme is as follows:

First aspect, the embodiment of the present invention provides a kind of disposal route of sound signal, and described method comprises:

From time domain, convert the monophonic signal of song to frequency domain, obtain the first frequency-region signal, described the first frequency-region signal comprises harmonic wave class music composition, knock type music composition and voice composition;

Adopt the separated HPSS algorithm of harmonic wave class/knock type music to isolate the second frequency-region signal from described the first frequency-region signal, described the second frequency-region signal comprises described knock type music composition and described voice composition;

Adopt the medium filtering NNMF algorithm between the most similar consecutive frame from described the second frequency-region signal, to extract described voice composition.

In the first implementation of first aspect, the separated HPSS algorithm of described employing harmonic wave class/knock type music is isolated the second frequency-region signal from described the first frequency-region signal, comprising:

Each frequency in described the first frequency-region signal is got to amplitude, obtain the first matrix;

Each row in described the first matrix are carried out to medium filtering, obtain the second matrix, and each row in described the first matrix is carried out to medium filtering, obtain the 3rd matrix;

According to described the second matrix and described the 3rd matrix, by following formula, from described the first frequency-region signal, isolate described the second frequency-region signal;

（（P.*P）./（（H.*H）+（P.*P）））.*X

H represents described the second matrix, and P represents described the 3rd matrix, and X represents described the first matrix ./representing some division operation .* represents point multiplication operation.

In conjunction with the first implementation of first aspect or first aspect, in the second implementation, the described monophonic signal by song converts frequency domain to from time domain, obtains the first frequency-region signal, comprising:

Adopt Fast Fourier Transform (FFT) FFT from time domain, to convert the monophonic signal of described song to frequency domain, obtain described the first frequency-region signal; The sampling rate of described FFT is 44.1KHz, and frame length is not less than 8192 points, and frame moves half into described frame length.

The first implementation or the second implementation in conjunction with first aspect, first aspect, in the third implementation, medium filtering NNMF algorithm between the most similar consecutive frame of described employing also comprised extract described voice composition from described the second frequency-region signal before:

Adopt Fast Fourier Transform Inverse (FFTI) from frequency domain, to convert described the second frequency-region signal to time domain, then adopt FFT to carry out time domain to the conversion of frequency domain, obtain repeating described second frequency-region signal of conversion; The sampling rate that obtains the FFT that described the second frequency-region signal that repeats to change adopts is 44.1KHz, and frame length is not more than 4096 points, frame move for obtain the FFT that described the second frequency-region signal that repeats conversion adopts frame length 1/4th;

Medium filtering NNMF algorithm between the most similar consecutive frame of described employing extracts described voice composition from described the second frequency-region signal, comprising:

Adopt NNMF algorithm to extract described voice composition from described the second frequency-region signal that repeats to change.

In conjunction with the third implementation of first aspect, in the 4th kind of implementation, described employing NNMF algorithm extracts described voice composition from described the second frequency-region signal that repeats to change, and comprising:

Each frequency in described the second frequency-region signal that repeats to change is got to amplitude;

Travel through each frame signal in described the second frequency-region signal that repeats conversion, calculate each frame signal respectively and described the second frequency-region signal that repeats conversion in similarity between other frame signals except described each frame signal;

According to described similarity, obtain the described frequency domain spectra that repeats the second frequency-region signal of conversion and estimate;

By following formula, calculate the difference of the right index normalized crosscorrelation value of corresponding frequency between the frequency domain spectra estimation of described the second frequency-region signal that repeats conversion and described the second frequency-region signal that repeats conversion,

Q (i, j) = {(\exp (- \frac{{(\log (PP (i, j)) - \log (Y (i, j)))}^{2}}{2 * namda * namda}))}^{2}

And by following formula, according to described difference, calculate the weight of described knock type music composition;

W (i, j) = \{\begin{matrix} 0, Q (i, j) < 0.85 \\ 1, Q (i, j) &GreaterEqual; 0.85 \end{matrix}

PP(i, j) represent described repeat conversion the second frequency-region signal in i frequency of j frame; Y(i, j) represent the described frequency domain spectra that repeats the second frequency-region signal of conversion estimate in i frequency of j frame; Q(i, j) represent the difference of the index normalized crosscorrelation value between i frequency of j frame in i frequency of j frame in described the second frequency-region signal that repeats conversion and the described frequency domain spectra estimation that repeats the second frequency-region signal of changing, W(i, j) represent the weight of the knock type music composition of i frequency of j frame in described the second frequency-region signal that repeats to change, namda is weight factor, namda=3;

By following formula, according to the weight of described knock type music composition, from described the second frequency-region signal that repeats to change, extract described voice composition;

P1=（1-W）.*PP

P1 represents described voice composition, and W represents the weight of described knock type music composition, and PP represents described the second frequency-region signal that repeats conversion.

In conjunction with the 4th kind of implementation of first aspect, in the 5th kind of implementation, described according to described similarity, obtain the described frequency domain spectra that repeats the second frequency-region signal of conversion and estimate, comprising:

According to described similarity, obtain the similar frame signal of the predetermined quantity of described each frame signal;

According to the similar frame signal of described predetermined quantity, calculate the frequency domain spectra of described each frame signal and estimate;

The frequency domain spectra of described each frame signal calculating is estimated to form the described frequency domain spectra that repeats the second frequency-region signal of conversion to be estimated.

In conjunction with the 4th kind of implementation of first aspect, in the 6th kind of implementation, described according to the weight of described knock type music composition, from described the second frequency-region signal that repeats to change, extract after described voice composition, also comprise:

By following formula, from described the second frequency-region signal that repeats to change, isolate described knock type music composition,

P2=W.*PP

P2 represents described knock type music composition.

In conjunction with the 6th kind of implementation of first aspect, in the 7th kind of implementation, described from described repeat conversion the second frequency-region signal isolate after described knock type music composition, also comprise:

From described the first frequency-region signal, isolate described harmonic wave class music composition;

From frequency domain, convert described isolated described harmonic wave class music composition to time domain, from frequency domain, convert described knock type music composition to time domain, and the knock type music composition after the harmonic wave class music composition after conversion and conversion is synthesized, obtain the composition of accompanying.

Second aspect, the embodiment of the present invention provides a kind for the treatment of apparatus of sound signal, and described device comprises:

Converting unit, for converting the monophonic signal of song to frequency domain from time domain, obtains the first frequency-region signal, and described the first frequency-region signal comprises harmonic wave class music composition, knock type music composition and voice composition;

Separative element, isolates described the second frequency-region signal for the first frequency-region signal that adopts the separated HPSS algorithm of harmonic wave class/knock type music to obtain from described converting unit, and described the second frequency-region signal comprises described knock type music composition and described voice composition;

Extraction unit, for adopting the medium filtering NNMF algorithm between the most similar contiguous frames to extract described voice composition from isolated the second frequency-region signal of described separative element.

In the first implementation of second aspect, described separative element specifically for:

In the first frequency-region signal that described converting unit is obtained, each frequency is got amplitude, obtains the first matrix;

（（P.*P）./（（H.*H）+（P.*P）））.*X

In conjunction with the first implementation of the second method or second aspect, in the second implementation, described converting unit specifically for:

In conjunction with the first implementation or the second implementation of the second method, second aspect, in the third implementation, described separative element also for:

Adopt Fast Fourier Transform Inverse (FFTI) from frequency domain, to convert described the second frequency-region signal to time domain, then adopt FFT to carry out time domain to the conversion of frequency domain, obtain repeating the second frequency-region signal of conversion; The sampling rate that obtains the FFT that described the second frequency-region signal that repeats to change adopts is 44.1K, and frame length is not more than 4096 points, frame move for obtain the FFT that described the second frequency-region signal that repeats conversion adopts frame length 1/4th;

Described converting unit specifically for:

Described the second frequency-region signal that repeats to change that adopts described NNMF algorithm to obtain from described separative element, extract described voice composition.

In conjunction with the third implementation of second aspect, in the 4th kind of implementation, described extraction unit comprises:

First obtains subelement, for described each frequency of the second frequency-region signal that repeats conversion that described separative element is obtained, gets amplitude;

The first computation subunit, for traveling through described each frame signal of the second frequency-region signal that repeats conversion, calculate each frame signal respectively and described the second frequency-region signal that repeats conversion in similarity between other frame signals except described each frame signal;

Second obtains subelement, for the similarity calculating according to described the first computation subunit, obtains the described frequency domain spectra that repeats the second frequency-region signal of conversion and estimates;

The second computation subunit, for passing through following formula, the difference of the right index normalized crosscorrelation value of corresponding frequency between the frequency domain spectra estimation of described the second frequency-region signal that repeats conversion that calculates that described the second frequency-region signal and described second that repeats conversion obtains that subelement obtains

Q (i, j) = {(\exp (- \frac{{(\log (PP (i, j)) - \log (Y (i, j)))}^{2}}{2 * namda * namda}))}^{2}

And by following formula, according to described difference, calculate the weight of described knock type music signal;

W (i, j) = \{\begin{matrix} 0, Q (i, j) < 0.85 \\ 1, Q (i, j) &GreaterEqual; 0.85 \end{matrix}

Wherein, PP(i, j) represent described repeat conversion the second frequency-region signal in i frequency of j frame; Y(i, j) represent the described frequency domain spectra that repeats the second frequency-region signal of conversion estimate in i frequency of j frame; Q(i, j) represent the difference of the index normalized crosscorrelation value between i frequency of j frame in i frequency of j frame in described the second frequency-region signal that repeats conversion and the described frequency domain spectra estimation that repeats the second frequency-region signal of changing, W(i, j) represent the weight of the knock type music composition of i frequency of j frame in described the second frequency-region signal that repeats to change, namda is weight factor, namda=3;

Extract subelement, for by following formula, according to the weight of the knock type music composition of described the second computation subunit calculating, from described the second frequency-region signal that repeats to change, extract described voice composition;

P1=（1-W）.*PP

In conjunction with the 4th kind of implementation of second aspect, in the 5th kind of implementation, described second obtain subelement specifically for:

In conjunction with the 4th kind of implementation of second aspect, in the 6th kind of implementation, described extraction unit also for:

By following formula, described the second frequency-region signal that repeats to change obtaining from described separative element, isolate described knock type music composition,

P2=W.*PP

P2 represents described knock type music composition.

In conjunction with the 6th kind of implementation of second aspect, in the 7th kind of implementation, described device also comprises:

Synthesis unit, isolates described harmonic wave class music composition for the first frequency-region signal obtaining from described converting unit; From frequency domain, convert isolated described knock type music composition to time domain, from frequency domain, convert described knock type music composition to time domain, and the knock type music composition after the harmonic wave class music composition after conversion and conversion is synthesized, obtain the composition of accompanying.

The third aspect, the embodiment of the present invention provides a kind for the treatment of facility of sound signal, and described equipment comprises processor and storer, and described processor is for carrying out as giving an order:

Adopt the separated HPSS algorithm of harmonic wave class/knock type music to isolate the second frequency-region signal from described the first frequency-region signal, described the second frequency-region signal comprises described knock type music composition and described voice signal content;

Adopt the medium filtering NNMF algorithm between the most similar contiguous frames from described the second frequency-region signal, to extract described voice composition.

The beneficial effect that the technical scheme that the embodiment of the present invention provides is brought is: by accompanying song being divided for harmonic wave class music composition and knock type music composition, first adopt HPSS algorithm from song, to isolate the second frequency-region signal, the second frequency-region signal comprises knock type music composition and voice composition; Then adopt NNMF algorithm to isolate voice composition from knock type music composition, make isolated voice comparison of ingredients clean, avoided larger accompaniment residual; And, by NNMF algorithm, from the monophonic signal of song, extract voice composition, can consider the frequency distribution feature between similar frame signal, making full use of accompaniment and thering is very strong periodicity and being full of variety property of voice feature is extracted voice composition, the damage having brought to voice composition while having avoided extracting voice with the right frequency distribution feature of independent frequency, applied widely, the voice composition effect extracting is better, can meet the extraction requirement of high-quality voice.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing of required use during embodiment is described is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the process flow diagram of the disposal route of a kind of sound signal of providing of the embodiment of the present invention one;

Fig. 2 is the process flow diagram of the disposal route of another sound signal of providing of the embodiment of the present invention two;

Fig. 3 is the schematic diagram of the KTV application scenarios that provides of the embodiment of the present invention two;

Fig. 4 is the structural representation of the treating apparatus of a kind of sound signal of providing of the embodiment of the present invention three;

Fig. 5 is the structural representation of the treating apparatus of another sound signal of providing of the embodiment of the present invention four;

Fig. 6 is the structural representation of the treatment facility of a kind of sound signal of providing of the embodiment of the present invention five.

Embodiment

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.

Embodiment mono-

Referring to Fig. 1, the embodiment of the present invention provides a kind of disposal route of sound signal, and the method flow process comprises:

Step 101: convert the monophonic signal of song to frequency domain from time domain, obtain the first frequency-region signal, this first frequency-region signal comprises harmonic wave class music composition, knock type music composition and voice composition.

The monophonic signal of song can be, the left/right sound channel signal of the monophonic signal of monophony song or two-channel song.Two-channel song comprises the stereo song of two-channel.

Wherein, harmonic wave class music composition and knock type music composition form the accompaniment composition of instrument playing in song.Harmonic wave class music composition comprises the sound that the musical instrument of piano one class sends, and knock type music composition comprises tympanitic note and beats sound.

Can adopt Fast Fourier Transform (FFT) (Fast Fourier Transformation is called for short FFT) from time domain, to convert the monophonic signal of song to frequency domain.Alternatively, the sampling rate of carrying out FFT is 44.1KHz, and frame length is not less than 8192 points, and it can be 1/2nd of frame length that frame moves, and for example, the song that is 44.1kHz for sampling rate, adopts 8192 point (185.7ms) frame lengths, and 4096 point frames move, and are FFT.

Step 102: adopt harmonic wave class/knock type music separated (Harmonic/Percussive Sound Separation is called for short HPSS) algorithm to isolate the second frequency-region signal from the first frequency-region signal.

Wherein, the second frequency-region signal comprises knock type music composition and voice composition.

Wherein, HPSS algorithm comprises medium filtering HPSS algorithm and spectrum diffusion method (the complementary diffusion method of frequency spectrum).

Step 103: adopt medium filtering (Nearest Neighbours and Median Filtering the is called for short NNMF) algorithm between the most similar contiguous frames to extract voice composition from the second frequency-region signal.

The embodiment of the present invention, by accompanying song being divided for harmonic wave class music composition and knock type music composition, first adopts HPSS algorithm from song, to isolate the second frequency-region signal, and the second frequency-region signal comprises knock type music composition and voice composition; Then adopt NNMF algorithm to isolate voice composition from knock type music composition, make isolated voice comparison of ingredients clean, avoided larger accompaniment residual; And, by NNMF algorithm, from the monophonic signal of song, extract voice composition, can consider the frequency distribution feature between similar frame signal, making full use of accompaniment and thering is very strong periodicity and being full of variety property of voice feature is extracted voice composition, the damage having brought to voice composition while having avoided extracting voice with the right frequency distribution feature of independent frequency, applied widely, the voice composition effect extracting is better, can meet the extraction requirement of high-quality voice.

Embodiment bis-

Referring to Fig. 2, the embodiment of the present invention provides a kind of disposal route of sound signal, and the method flow process comprises:

Step 201: convert the monophonic signal of song to frequency domain from time domain, obtain the first frequency-region signal, this first frequency-region signal comprises harmonic wave class music composition, knock type music composition and voice composition.

Usually, song comprises monophony song and two-channel song.The monophonic signal of song is, the left/right sound channel signal of the monophonic signal of monophony song or two-channel song.

Alternatively, can adopt FFT from time domain, to convert the monophonic signal of song to frequency domain, obtain the first frequency-region signal.The sampling rate of this FFT can be 44.1KHz, and frame length is not less than 8192 points, and it can be 1/2nd of frame length that frame moves.For example, the song that is 44.1kHz for sampling rate, adopts 8192 point (185.7ms) frame lengths, and 4096 point frames move, and are FFT, convert time domain to frequency domain, obtain " time-frequency band " information of two dimension.In the first frequency-region signal, there were significant differences on spectrum signature for harmonic wave class music composition and knock type music composition, and the spectrum signature of knock type music composition and voice composition approaches.

Step 202: adopt HPSS algorithm to isolate the second frequency-region signal from the first frequency-region signal.

Wherein, because HPSS algorithm can be realized the separation between the signal that frequency spectrum difference is large, therefore, can adopt HPSS algorithm to isolate harmonic wave class music composition in song.It should be noted that the characteristic based on HPSS algorithm itself, in step 201, carry out the frame length of FFT must be large (being not less than 8192 points), isolated like this knock type music composition and voice comparison of ingredients are clean.Meanwhile, in order to take into account operand, the frame that carries out FFT moves should not be too large (choose frame length 1/2nd).

Wherein, the second frequency-region signal comprises knock type music composition and voice composition.Alternatively, step 202 comprises:

Step 2021: each frequency in the first frequency-region signal is got to amplitude, obtain the first matrix.

Wherein, the first frequency-region signal is " time-frequency band " information of two dimension, supposes that this " time-frequency band " information is X(F, N), the frame number that N is time dimension, F is the frequency band number (equaling frame length) of frequency domain dimension.X(F, N) in complex representation for each element (each frequency in corresponding frequency-region signal), the amplitude and the phase information that have comprised each frequency in the first frequency-region signal.

Suppose X(F, N) in each frequency to get the first matrix obtaining after amplitude be XX(F, N).

Step 2022: each row in the first matrix are carried out to medium filtering, obtain the second matrix; And each row in the first matrix is carried out to medium filtering, obtain the 3rd matrix.

Take each row in the first matrix are carried out to medium filtering as example, introduce the process of medium filtering.Wherein, in the first matrix, each classifies F dimensional vector as.The k that supposes the first matrix classifies vector x (k) as, x(k)=and (x(k1), x(k2) ... x(kF)), x(k) carry out being output as F dimensional vector y(k after medium filtering), y(k) represent that the k of the second matrix is listed as, y(k)=(y(k1), y(k2) ..., y(kF)), and

Y(k)=median{x(k-l:k+l), l=(order-l)/2}, k=1 ..., F; Median represents to get median; Order is exponent number, can be 17.

Suppose that the second matrix is H(F, N), the 3rd matrix is P(F, N).

Step 2023: according to the second matrix and the 3rd matrix, by following formula (1), isolate the second frequency-region signal from the first frequency-region signal.

（（P.*P）./（（H.*H）+（P.*P）））.*X （1）

Wherein, H represents the second matrix, and P represents the 3rd matrix, and X represents the first matrix ./representing some division operation .* represents point multiplication operation, matrix multiplies each other by element.

From the first frequency-region signal, isolate harmonic wave class music and become sub-signal, harmonic wave class music becomes sub-signal to be expressed as, ((H.*H) ./((H.*H)+(P.*P))) .*X.

Step 203: adopt Fast Fourier Transform Inverse (FFTI) to convert the second frequency-region signal to time domain from frequency domain, then adopt FFT to carry out time domain to the conversion of frequency domain, obtain repeating the second frequency-region signal of conversion.

Alternatively, the sampling rate of the FFT that the second frequency-region signal that obtains repeating to change adopts is 44.1KHz, and frame length is not more than 4096 points, and it can be 1/4th of frame length that frame moves.For example, the second frequency-region signal that is 44.1KHz for sampling rate, adopts 4096 point (92.8ms) frame lengths, and 1024 point frames move, and are FFT, convert the second frequency-region signal to frequency domain from time domain, obtain " time-frequency band " information of two dimension.

It should be noted that, in embodiments of the present invention, step 203 is optional step.In other embodiments, after step 202 executes, can directly perform step 204, adopt NNMF algorithm from the second frequency-region signal, to extract voice composition.

Step 204: adopt NNMF algorithm to extract voice composition from the second frequency-region signal that repeats to change.

Alternatively, step 204 comprises:

Step 2041: each frequency in the second frequency-region signal that repeats to change is got to amplitude.

Wherein, repeat " time-frequency band " information that the second frequency-region signal of conversion is two dimension, suppose that the second frequency-region signal that repeats to change is PP(F, N), to PP(F, N) in each frequency obtain the 4th matrix Z(F, N after getting amplitude), N is the frame number of time dimension, and F is the frequency band number (F can be half of frame length) of frequency domain dimension.

Step 2042: traversal repeats each frame signal in the second frequency-region signal of conversion, calculates each frame signal respectively and repeats the similarity between other frame signals except each frame signal in the second frequency-region signal of conversion.

Wherein, can represent each frame signal respectively and repeat the similarity between other frame signals except each frame signal in the second frequency-region signal of conversion with the 5th matrix.The 5th matrix is the symmetric matrix of N*N dimension, the diagonal element of the 5th matrix is set to 0(and represents each frame signal and the similarity of self), except diagonal element, the row or column of the 5th matrix is placed every frame signal according to the order of frame and is repeated the similarity between other frame signals except each frame signal in the second frequency-region signal of conversion.Suppose that the 5th matrix is D, in D k to be listed as the element that l is capable be D(k, l), k or l=1 ..., N; Have,

D（k，l）=（Z（：，k）-Z（：，l）） ²，

Z(:, k) represent k column element (frequency domain information that has comprised k frame signal) in the 4th matrix, Z(:, l) represent l column element in the 4th matrix.D(k, l) represent the similarity between k column element and l column element in the 4th matrix.Easily know, similarity is higher, D(k, l) value less.

Step 2043: according to each frame signal calculating respectively and repeat the similarity between other frame signals except each frame signal in the second frequency-region signal of conversion, obtain the frequency domain spectra of the second frequency-region signal that repeats conversion and estimate.

Alternatively, step 2043 comprises: first, according to similarity, obtain the similar frame signal of the predetermined quantity of each frame signal, this similar frame signal other similarities outside being greater than the similarity of removing similar frame signal and each frame signal in similarity to the similarity of each frame signal.Particularly, can be by the similarity between each frame signal calculating and different frame signal by descending sort, and select the similarity above that is arranged in of predetermined quantity.The similarity of this similar frame signal and each frame signal is the similarity above that is arranged in of the predetermined quantity selected.Then, according to the similar frame signal of predetermined quantity, calculate the frequency domain spectra of each frame signal and estimate, in each frame signal, the frequency domain spectra of each frequency is estimated as, the intermediate value of all corresponding frequencies in the similar frame signal of the predetermined quantity of determining for each frame signal.Finally, the frequency domain spectra of each frame signal calculating is estimated form the frequency domain spectra estimation of the second frequency-region signal that repeats conversion.

For example, suppose that the present frame of traversal is i frame, predetermined quantity is 20.First, i column data in the 5th matrix (in the second frequency-region signal each frame signal and repeat the similarity between other frame signals except each frame signal in the second frequency-region signal of conversion) is sorted (the less similarity of data is higher) from small to large, obtain front 20 similarities.In corresponding the 4th matrix of the line number of front 20 similarities (repeat each frequency in the second frequency-region signal of conversion and get the matrix obtaining after amplitude) with the row number of 20 frames of i frame similarity maximum.Secondly, from the 4th matrix, extract this 20 frame signal, form the 6th matrix.In the 6th matrix, each line frequency is put each line frequency point of corresponding i frame.Then, each line frequency point of the 6th matrix is got to intermediate value, the frequency domain spectra that obtains i frame is estimated.

The frequency domain spectra of supposing to repeat the second frequency-region signal of conversion is estimated as the 7th matrix Y(F, N).

Step 2044: by following formula (2), the difference of the index normalized crosscorrelation value that the second frequency-region signal that calculating repeats conversion is right with repeating corresponding frequency between the frequency domain spectra estimation of the second frequency-region signal of conversion, and by following formula (3), according to this difference, calculate the weight that knock type music becomes sub-signal.

Q (i, j) = {(\exp (- \frac{{(\log (PP (i, j)) - \log (Y (i, j)))}^{2}}{2 * namda * namda}))}^{2} - - - (2),

W (i, j) = \{\begin{matrix} 0, Q (i, j) < 0.85 \\ 1, Q (i, j) &GreaterEqual; 0.85 \end{matrix} - - - (3)

Wherein, PP(i, j) represent to repeat i frequency of j frame in second frequency-region signal (the 4th matrix) of conversion; Y(i, j) frequency domain spectra that represents to repeat the second frequency-region signal of conversion estimates i frequency of j frame in (the 7th matrix); Q(i, j) represent to repeat the difference of the index normalized crosscorrelation value between i frequency of j frame in the frequency domain spectra estimation of i frequency of j frame and the second frequency-region signal that repeats to change in the second frequency-region signal of conversion, W(i, j) the knock type music that represents to repeat i frequency of j frame in the second frequency-region signal of conversion becomes the weight of sub-signal.Namda is weight factor, can get 3.

Step 2045: by following formula (4), become the weight of sub-signal according to knock type music, extract voice composition from the second frequency-region signal that repeats to change.

P1=（1-W）.*PP （4）

Wherein, P1 represents voice composition, and W represents the weight of knock type music composition, and PP represents to repeat the second frequency-region signal of conversion.

Alternatively, by Fast Fourier Transform Inverse (FFTI), voice composition is converted to time domain from frequency domain, output voice time-domain signal.The voice composition that adopts NNMF algorithm to extract from repeat the second frequency-region signal of conversion, residual slight tum in knock type music composition, can play the effect that instructs of beating time for voice.

Step 2046: by following formula (5), isolate knock type music and become sub-signal from the second frequency-region signal that repeats to change.

P2=W.*PP （5）

Wherein, P2 represents that knock type music becomes sub-signal.

Step 205: isolate harmonic wave class music composition from the first frequency-region signal, from frequency domain, convert isolated harmonic wave class music composition to time domain, from frequency domain, convert knock type music composition to time domain, and the knock type music composition after the harmonic wave class music composition after conversion and conversion is synthesized, obtain the composition of accompanying.

Alternatively, the process of isolating harmonic wave class music composition from the first frequency-region signal can, referring to step 2023, not repeat them here.

Alternatively, by Fast Fourier Transform Inverse (FFTI), respectively isolated harmonic wave class music composition and knock type music composition are converted to time domain from frequency domain, and both are synthesized to accompaniment time-domain signal.

It should be noted that, the applicable scene of the method that the present embodiment provides comprises KTV application scenarios.Referring to Fig. 3, adopt in advance the method that the present embodiment provides that song files is divided into people's acoustical signal and accompaniment signal, and for song arranges three kinds of play mode, lead singing/instructional model, original singer's pattern and accompaniment pattern.Leading under sing/instructional model, user can regulate respectively accompaniment signal volume and voice signal volume, makes loudspeaker can only play voice, or the voice that loudspeaker is play has slight accompaniment.User, can be when singing under this pattern, and background has singer original singer softly to instruct, and experiences singer's the effect of singing opera arias.Under accompaniment pattern, shielding people acoustical signal, loudspeaker is only play accompaniment signal; Under original singer's pattern, loudspeaker is play people's acoustical signal and accompaniment signal.

The embodiment of the present invention, by accompanying song being divided for harmonic wave class music composition and knock type music composition, first adopts HPSS algorithm from song, to isolate the second frequency-region signal, and the second frequency-region signal comprises knock type music composition and voice composition; Then adopt NNMF algorithm to isolate voice composition from knock type music composition, make isolated voice comparison of ingredients clean; And, by NNMF algorithm, from the monophonic signal of song, extract voice composition, can consider the frequency distribution feature between similar frame signal, making full use of accompaniment and thering is very strong periodicity and being full of variety property of voice feature is extracted voice composition, the damage having brought to voice composition while having avoided extracting voice with the right frequency distribution feature of independent frequency, applied widely, the voice composition effect extracting is better, can meet the extraction requirement of high-quality voice.

Embodiment tri-

Referring to Fig. 4, the embodiment of the present invention provides a kind for the treatment of apparatus of sound signal, and this device comprises:

Converting unit 401, for converting the monophonic signal of song to frequency domain from time domain, obtains the first frequency-region signal.

This first frequency-region signal comprises harmonic wave class music composition, knock type music composition and voice composition.

The monophonic signal of song can be the left/right sound channel signal of two-channel song or the monophonic signal of monophony song.

Separative element 402, for adopting the first frequency-region signal that HPSS algorithm is converted to from converting unit 401 to isolate the second frequency-region signal, the second frequency-region signal comprises and knocks knock type music composition and voice composition.

Extraction unit 403, for adopting NNMF algorithm to extract voice composition from isolated the second frequency-region signal of separative element 402.

Embodiment tetra-

Referring to Fig. 5, the embodiment of the present invention provides a kind for the treatment of apparatus of sound signal, and this device comprises:

Converting unit 501, for converting the monophonic signal of song to frequency domain from time domain, obtains the first frequency-region signal.

Alternatively, this converting unit 501 for, adopt FFT from time domain, to convert the monophonic signal of song to frequency domain, obtain the first frequency-region signal, the sampling rate of this FFT is 44.1KHz, frame length is not less than 8192 points, it can be 1/2nd of frame length that frame moves.

Separative element 502, for adopting the first frequency-region signal that HPSS algorithm obtains from converting unit 501 to isolate the second frequency-region signal, the second frequency-region signal comprises and knocks knock type music composition and voice composition.

Alternatively, separative element 502 for: each frequency of the first frequency-region signal that converting unit 501 is obtained is got amplitude, obtains the first matrix; Each row in the first matrix are carried out to medium filtering, obtain the second matrix, and each row of the first matrix is carried out to medium filtering, obtain the 3rd matrix; According to the second matrix and the 3rd matrix, by following formula (1), from the first frequency-region signal, isolate the second frequency-region signal; The second frequency-region signal comprises and knocks knock type music composition and voice composition;

（（P.*P）./（（H.*H）+（P.*P）））.*X （1）

H represents the second matrix, and P represents the 3rd matrix, and X represents the first matrix ./representing some division operation .* represents point multiplication operation.

Extraction unit 503, for adopting NNMF algorithm to extract voice composition from isolated the second frequency-region signal of separative element 502.

Alternatively, separative element 502 also for: adopt Fast Fourier Transform Inverse (FFTI) from frequency domain, to convert the second frequency-region signal to time domain, then adopt FFT to carry out time domain to the conversion of frequency domain, obtain repeating the second frequency-region signal of conversion; The sampling rate of the FFT that the second frequency-region signal that obtains repeating to change adopts is 44.1KHz, and frame length is not more than 4096 points, frame move for obtain repeating the FFT that the second frequency-region signal of conversion adopts frame length 1/4th.

Alternatively, extraction unit 503 for, the second frequency-region signal that repeats conversion that adopts NNMF algorithm to obtain from separative element 502, extract voice composition.

Alternatively, extraction unit 503 comprises:

First obtains subelement 5031, for each frequency of the second frequency-region signal that repeats conversion that separative element 502 is obtained, gets amplitude.

The first computation subunit 5032, for traveling through each frame signal of the second frequency-region signal that repeats conversion, calculates each frame signal respectively and repeats the similarity between other frame signals except each frame signal in the second frequency-region signal of conversion.

Second obtains subelement 5033, for the similarity calculating according to the first computation subunit 5032, obtains the frequency domain spectra of the second frequency-region signal and estimates.

Alternatively, second obtains subelement 5033 for the similarity calculating according to the first computation subunit 5032, obtains the similar frame signal of the predetermined quantity of each frame signal; According to the similar frame signal of predetermined quantity, calculate the frequency domain spectra of each frame signal and estimate; The frequency domain spectra of each frame signal calculating is estimated to form to the frequency domain spectra estimation of the second frequency-region signal that repeats conversion.

The second computation subunit 5034, for passing through following formula (2), the second frequency-region signal and second that calculating repeats conversion obtains the difference of the right index normalized crosscorrelation value of corresponding frequency between the frequency domain spectra estimation of the second frequency-region signal that repeats conversion that subelement 5033 obtains, and by following formula (3), according to this difference, calculate the weight that knock type music becomes sub-signal, this computing formula is

Q (i, j) = {(\exp (- \frac{{(\log (PP (i, j)) - \log (Y (i, j)))}^{2}}{2 * namda * namda}))}^{2} - - - (2),

W (i, j) = \{\begin{matrix} 0, Q (i, j) < 0.85 \\ 1, Q (i, j) &GreaterEqual; 0.85 \end{matrix} - - - (3)

Wherein, PP(i, j) represent to repeat i frequency of j frame in the second frequency-region signal of conversion; Y(i, j) frequency domain spectra that represents to repeat the second frequency-region signal of conversion estimate in i frequency of j frame; Q(i, j) represent to repeat the difference of the index normalized crosscorrelation value between i frequency of j frame in the frequency domain spectra estimation of i frequency of j frame and the second frequency-region signal that repeats to change in the second frequency-region signal of conversion, W(i, j) weight of the knock type music composition of i frequency of j frame in the second frequency-region signal that represents to repeat to change, namda is weight factor, namda=3.

Extract subelement 5035, for by following formula (4), according to the weight of the knock type music composition of the second computation subunit 5034 calculating, from the second frequency-region signal that repeats to change, extract voice composition.

P1=（1-W）.*PP （4）

P1 represents voice composition, and W represents the weight of knock type music composition, and PP represents to repeat the second frequency-region signal of conversion.

Alternatively, extract subelement 5035 also for: by following formula (5), according to the weight of knock type music composition, from repeat the second frequency-region signal of conversion, isolate knock type music composition.

P2=W.*PP （5）

P2 represents that knock type music becomes sub-signal.

Alternatively, this device also comprises:

Synthesis unit 504, isolates harmonic wave class music composition for the first frequency-region signal obtaining from converting unit 501; From frequency domain, convert isolated harmonic wave class music composition to time domain, from frequency domain, convert knock type music composition to time domain, and the knock type music composition after the harmonic wave class music composition after conversion and conversion is synthesized, obtain the composition of accompanying.

Embodiment five

The embodiment of the present invention provides a kind for the treatment of facility of sound signal, and this equipment can be a kind of computing machine (computer system that comprises hand-held form, as smart mobile phone, panel computer etc.) or server, as shown in Figure 6.It generally comprises for example CPU of at least one processor 10(), user interface 11, at least one network interface 12 or other communication interfaces, storer 13 and at least one communication bus 14.The structure that it will be understood by those skilled in the art that the computing machine shown in Fig. 6 does not form the restriction to computing machine, and it can comprise the parts more more or less than diagram, or combines some parts, or different parts are arranged.

Below in conjunction with Fig. 6, each component parts of this equipment is carried out to concrete introduction:

Communication bus 14 is for realizing the connection communication between processor 10, storer 13 and communication interface.

At least one network interface 12(can be wired or wireless) realize the communication connection between this equipment and at least one other computing machine or server, can use internet, wide area network, local network, Metropolitan Area Network (MAN) etc.

Storer 13 can be used for storing software program and application module, and processor 10 is stored in software program and the application module of storer 13 by operation, thus various function application and the data processing of actuating equipment.Storer 13 can mainly comprise storage program district and storage data field, wherein, and the application program that storage program district can storage operation system, at least one function (such as moving HPSS algorithm) is required etc.; The data (such as the frequency-region signal that is stored in buffer memory) that create according to the use of equipment etc. can be stored in storage data field.In addition, storer 13 can comprise high-speed RAM (Random Access Memory, random access memory), can also comprise nonvolatile memory (non-volatile memory), for example at least one disk memory, flush memory device or other volatile solid-state parts.

User interface 10, includes but not limited to output device and input equipment.Wherein, input equipment generally includes keyboard and pointing device (for example, mouse, trace ball (trackball), touch-sensitive plate or touch sensitive display screen).Wherein, output device generally includes the equipment that display, printer and projector etc. represent computerized information.Display can be used for showing the information of being inputted by user or the file that offers user etc.; Keyboard and pointing device can be used for receiving numeral or the character information of input, and generation arranges with the user of equipment and function is controlled the input of relevant signal, such as obtaining operational order that user sends according to operation indicating etc.

Processor 10 is control centers of equipment, utilize the various piece of various interface and the whole equipment of connection, by moving or carry out software program and/or the application module being stored in storer 13, and call the data that are stored in storer 13, various functions and the deal with data of actuating equipment, thus equipment is carried out to integral monitoring.

Particularly, by moving or carry out software program and/or the application module being stored in storer 13, and call the data that are stored in storer 13, processor 10 can be realized, from time domain, convert the monophonic signal of song to frequency domain, obtain the first frequency-region signal, this first frequency-region signal comprises harmonic wave harmonic wave class music composition, knocks knock type music composition and voice composition; Adopt HPSS algorithm from the first frequency-region signal, to isolate the second frequency-region signal, the second frequency-region signal comprises and knocks knock type music composition and voice composition; Adopt NNMF algorithm from the second frequency-region signal, to extract voice composition.

Alternatively, processor 10 can be realized, and each frequency in the first frequency-region signal is got to amplitude, obtains the first matrix; Each row in the first matrix are carried out to medium filtering, obtain the second matrix, each row in the first matrix is carried out to medium filtering, obtain the 3rd matrix; According to the second matrix and the 3rd matrix, by following formula (1), from the first frequency-region signal, isolate the second frequency-region signal.

（（P.*P）./（（H.*H）+（P.*P）））.*X （1）

H represents the second matrix, and P represents the 3rd matrix, and X represents the first matrix ./representing some division operation, dot product representing matrix multiplies each other by element.

Alternatively, processor 10 can be realized, and adopts FFT from time domain, to convert the monophonic signal of song to frequency domain, obtains the first frequency-region signal; The sampling rate of FFT is 44.1KHz, and frame length is not less than 8192 points, and frame moves half into frame length.

Alternatively, processor 10 can be realized, and adopts Fast Fourier Transform Inverse (FFTI) from frequency domain, to convert the second frequency-region signal to time domain, then adopts FFT to carry out time domain to the conversion of frequency domain, obtains repeating the second frequency-region signal of conversion; The sampling rate of the FFT that the second frequency-region signal that obtains repeating to change adopts is 44.1KHz, and frame length is not more than 4096 points, frame move for obtain repeating the FFT that the second frequency-region signal of conversion adopts frame length 1/4th; Adopt NNMF algorithm to extract voice composition from the second frequency-region signal that repeats to change.

Alternatively, processor 10 can be realized, and each frequency in the second frequency-region signal that repeats to change is got to amplitude; Travel through each frame signal in this second frequency-region signal that repeats conversion, calculate each frame signal respectively and repeat the similarity between other frame signals except each frame signal in the second frequency-region signal of conversion; According to similarity, obtain the frequency domain spectra of the second frequency-region signal that repeats conversion and estimate; By following formula (2), the difference of the index normalized crosscorrelation value that the second frequency-region signal that calculating repeats conversion is right with repeating corresponding frequency between the frequency domain spectra estimation of the second frequency-region signal of conversion, and by following formula (3), according to this difference, calculate the weight that knock type music becomes sub-signal; By following formula (4), according to knock type music, become the weight of sub-signal, from the second frequency-region signal that repeats to change, extract voice composition;

Q (i, j) = {(\exp (- \frac{{(\log (PP (i, j)) - \log (Y (i, j)))}^{2}}{2 * namda * namda}))}^{2} - - - (2),

W (i, j) = \{\begin{matrix} 0, Q (i, j) < 0.85 \\ 1, Q (i, j) &GreaterEqual; 0.85 \end{matrix} - - - (3)

P1=（1-W）.*PP （4）

Wherein, PP(i, j) represent to repeat i frequency of j frame in the second frequency-region signal of conversion; Y(i, j) frequency domain spectra that represents to repeat the second frequency-region signal of conversion estimate in i frequency of j frame; Q(i, j) represent to repeat the difference of the index normalized crosscorrelation value between i frequency of j frame in the frequency domain spectra estimation of i frequency of j frame and the second frequency-region signal that repeats to change in the second frequency-region signal of conversion, W(i, j) weight of the knock type music composition of i frequency of j frame in the second frequency-region signal that represents to repeat to change, namda is weight factor, namda=3.P1 represents voice composition, and W represents the weight of knock type music composition, and PP represents to repeat the second frequency-region signal of conversion.

Alternatively, processor 10 can be realized, and according to similarity, obtains the similar frame signal of the predetermined quantity of each frame signal; According to the similar frame signal of predetermined quantity, calculate the frequency domain spectra of each frame signal and estimate; The frequency domain spectra of each frame signal calculating is estimated to form to the frequency domain spectra estimation of the second frequency-region signal that repeats conversion.

Alternatively, processor 10 can be realized, and by following formula (5), according to the weight of knock type music composition, isolates knock type music composition from the second frequency-region signal that repeats to change:

P2=W.*PP （5）

P2 represents that knock type music becomes sub-signal.

Alternatively, processor 10 can be realized, and from the first frequency-region signal, isolates harmonic wave class music composition; Isolated harmonic wave class music composition is converted to time domain from frequency domain, knock type music composition is converted to time domain from frequency-region signal, and the knock type music composition after the harmonic wave class music composition after conversion and conversion is synthesized, obtain the composition of accompanying.

It should be noted that: the treating apparatus of the sound signal that above-described embodiment provides is when extracting voice, only the division with above-mentioned each functional module is illustrated, in practical application, can above-mentioned functions be distributed and by different functional modules, completed as required, the inner structure of the equipment of being about to is divided into different functional modules, to complete all or part of function described above.In addition, the treating apparatus of the sound signal that above-described embodiment provides and the disposal route embodiment of sound signal belong to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.

The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

One of ordinary skill in the art will appreciate that all or part of step that realizes above-described embodiment can complete by hardware, also can come the hardware that instruction is relevant to complete by program, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium of mentioning can be ROM (read-only memory), disk or CD etc.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. a disposal route for sound signal, is characterized in that, described method comprises:

Adopt the separated HPSS algorithm of harmonic wave class/knock type music to isolate the second frequency-region signal from described the first frequency-region signal, described the second frequency-region signal comprises described harmonic wave class music composition and described voice composition;

2. method according to claim 1, is characterized in that, the separated HPSS algorithm of described employing harmonic wave class/knock type music is isolated the second frequency-region signal from described the first frequency-region signal, comprising:

Each row in described the first matrix are carried out to medium filtering, obtain the second matrix;

Each row in described the first matrix is carried out to medium filtering, obtain the 3rd matrix;

（（P.*P）./（（H.*H）+（P.*P）））.*X

3. method according to claim 1 and 2, is characterized in that, the described monophonic signal by song converts frequency domain to from time domain, obtains the first frequency-region signal, comprising:

4. according to the arbitrary described method of claims 1 to 3, it is characterized in that, described medium filtering NNMF algorithm between the most similar consecutive frame of employing also comprised extract described voice composition from described the second frequency-region signal before:

Adopt Fast Fourier Transform Inverse (FFTI) from frequency domain, to convert described the second frequency-region signal to time domain, then adopt FFT to carry out time domain to the conversion of frequency domain, obtain repeating the second frequency-region signal of conversion; The sampling rate that obtains the FFT that described the second frequency-region signal that repeats to change adopts is 44.1KHz, and frame length is not more than 4096 points, frame move for obtain the FFT that described the second frequency-region signal that repeats conversion adopts frame length 1/4th;

5. method according to claim 4, is characterized in that, described employing NNMF algorithm extracts described voice composition from described the second frequency-region signal that repeats to change, and comprising:

Q (i, j) = {(\exp (- \frac{{(\log (PP (i, j)) - \log (Y (i, j)))}^{2}}{2 * namda * namda}))}^{2}

W (i, j) = \{\begin{matrix} 0, Q (i, j) < 0.85 \\ 1, Q (i, j) &GreaterEqual; 0.85 \end{matrix}

P1=（1-W）.*PP

6. method according to claim 5, is characterized in that, described according to similarity, obtains the described frequency domain spectra that repeats the second frequency-region signal of conversion and estimates, comprising:

7. method according to claim 5, is characterized in that, described according to the weight of described knock type music composition, from described the second frequency-region signal that repeats to change, extracts after described voice composition, also comprises:

P2=W.*PP

P2 represents described knock type music composition.

8. method according to claim 7, is characterized in that, described from described repeat conversion the second frequency-region signal isolate after described knock type music composition, also comprise:

9. a treating apparatus for sound signal, is characterized in that, described device comprises:

Separative element, isolates described the second frequency-region signal for the first frequency-region signal that adopts the separated HPSS algorithm of harmonic wave class/knock type music to obtain from described converting unit, and described the second frequency-region signal comprises described class music composition and described voice composition;

10. device according to claim 9, is characterized in that, described separative element specifically for:

（（P.*P）./（（H.*H）+（P.*P）））.*X

11. according to the device described in claim 9 or 10, it is characterized in that, described converting unit specifically for:

12. according to the arbitrary described device of claim 9-11, it is characterized in that, described separative element also for:

Described converting unit specifically for:

13. devices according to claim 12, is characterized in that, described extraction unit comprises:

Q (i, j) = {(\exp (- \frac{{(\log (PP (i, j)) - \log (Y (i, j)))}^{2}}{2 * namda * namda}))}^{2}

W (i, j) = \{\begin{matrix} 0, Q (i, j) < 0.85 \\ 1, Q (i, j) &GreaterEqual; 0.85 \end{matrix}

P1=（1-W）.*PP

14. devices according to claim 13, is characterized in that, described second obtain subelement specifically for:

15. devices according to claim 13, is characterized in that, described extraction unit also for:

P2=W.*PP

P2 represents described knock type music composition.

16. devices according to claim 15, is characterized in that, described device also comprises:

The treatment facility of 17. 1 kinds of sound signals, is characterized in that, described equipment comprises processor and storer, and described processor is for carrying out as giving an order: