CN108520756A

CN108520756A - A kind of method and device of speaker's speech Separation

Info

Publication number: CN108520756A
Application number: CN201810231676.XA
Authority: CN
Inventors: 孙学京; 刘恩; 张晨; 张兴涛
Original assignee: Beijing Tuoling Inc
Current assignee: Beijing Tuoling Inc
Priority date: 2018-03-20
Filing date: 2018-03-20
Publication date: 2018-09-11
Anticipated expiration: 2038-03-20
Also published as: CN108520756B

Abstract

The invention discloses a kind of method and device of speaker's speech Separation, method includes：Obtain the audio signal of preset format；It is pre-processed for the audio signal, first audio signal that obtains that treated；Audio separating treatment is carried out for first audio signal, obtains the second audio signal of different direction speaker；Enhancing processing is carried out for second audio signal, obtains the third audio signal of enhanced different direction speaker；Export the third audio signal.Technical solution using the present invention realizes quickly and accurately separation without the audio signal of multiple speakers in orientation.

Description

A kind of method and device of speaker's speech Separation

Technical field

The present invention relates to technical field of voice recognition, and in particular to a kind of method and device of speaker's speech Separation.

Background technology

With the development of science and technology, every field is higher and higher for the pursuit of audio quality, all kinds of audio documents Acquiring way is more and more abundant, and data volume is in explosive growth, to also more and more difficult to the management of audio documents.In recent years Come, people begin one's study audio retrieval technology, to the multi-media voices such as call voice, broadcasting speech and conference voice document into Row management.Wherein, maximum to the retrieval difficulty of conference voice, because including multiple channels in conference voice document, more Speaker.

Existing audio separation method is broadly divided into single channel (Mike) technology and multichannel (Mike) technology.Single Mike's skill Art includes mainly the audio separation method based on model and the separation method based on distance scale；More Mike's technologies include mainly wave Beam forms separation method and blind source separation method.

Wherein, the audio separation method based on model includes two steps of training and identification：To inputting sound in training process Frequency carries out the laggard step of feature extraction and is trained and stores the model after training；Feature is carried out to input audio in identification process After carrying out speaker's separation and speaker clustering after extraction, matching primitives are further carried out with the model of storage, judgement is each Speaker finally obtains the audio signal after separation.Separation method based on distance scale then passes through the left and right of calculating every bit The distance of two segment signals of adjacent certain window length, is further compared with the threshold value of setting, obtains the jump of audio signal Height, to the audio signal after being detached.Wave beam forming separation method by input audio carry out auditory localization in real time, And enhancing processing is further carried out according to speaker orientation, obtain the audio signal of each speaker.Blind source separation method passes through Blind source separating processing is carried out to input audio, to obtain the audio signal of each speaker.

But the separation method based on model, it is desirable that the time that each speaker continuously speaks in dialogue is longer, and Algorithm complexity is excessively high；The problems such as separation method based on distance scale, there are testing number excessive redundancy cut-points.And wave beam Form separation method, the methods of blind source separation method, primarily directed to linear microphone array and plane microphone array etc. into Row processing, and there are certain deficiencies for the effect handled in complex environment.

Therefore, under complex environment, more quickly and accurately separation does not have to the audio signal of multiple speakers in orientation, It is current technical problem urgently to be resolved hurrily.

Invention content

The purpose of the present invention is to provide a kind of method and devices of speaker's speech Separation, realize quickly and accurately Separation does not have to the audio signal of multiple speakers in orientation.

To achieve the above object, the present invention provides a kind of method of speaker's speech Separation, including：

Obtain the audio signal of preset format；

It is pre-processed for the audio signal, first audio signal that obtains that treated；

Audio separating treatment is carried out for first audio signal, obtains the second audio letter of different direction speaker Number；

Enhancing processing is carried out for second audio signal, obtains the third audio of enhanced different direction speaker Signal；

Export the third audio signal.

Further, it in method described above, is pre-processed for the audio signal, obtains that treated first Audio signal, including：

Obtain the modes of emplacement parameter and ambient parameters of wheat battle array；

According to the modes of emplacement parameter of wheat battle array, conversion process is carried out to the audio signal, obtains being located at same flat The transducing audio signal in face；

Time-frequency conversion is carried out to the transducing audio signal, obtains the corresponding frequency-region signal of the transducing audio signal；

According to the ambient parameters, audio enhancing processing is carried out to the frequency-region signal, obtains enhanced frequency domain Signal；

Time-frequency inverse transformation is carried out for enhanced frequency-region signal, time-domain signal is obtained, as first audio signal.

Further, in method described above, audio separating treatment is carried out to first audio signal, obtains difference The second audio signal of orientation speaker, including：

According to first audio signal, obtains the corresponding auditory localization result of first audio signal and speaker knows Other result；

According to the auditory localization result and the Speaker Identification as a result, carrying out audio point to first audio signal From processing, second audio signal is obtained.

Further, in method described above, according to first audio signal, first audio signal pair is obtained The auditory localization result and Speaker Identification answered are as a result, include：

Speech detection processing is carried out to first audio signal, obtains testing result；

According to the testing result, auditory localization processing is carried out to first audio signal, obtains the auditory localization As a result；

According to preset identification model, Speaker Identification processing is carried out to first audio signal, obtains described speak People's recognition result.

Further, in method described above, according to the auditory localization result and the Speaker Identification as a result, right First audio signal carries out audio separating treatment, obtains second audio signal, including：

According to the auditory localization result and the Speaker Identification as a result, using Beamforming Method, to described first Audio signal carries out audio separating treatment, obtains second audio signal.

Choose audio separation method corresponding with the auditory localization result；

According to the Speaker Identification as a result, using the audio separation method, sound is carried out to first audio signal Frequency separating treatment obtains second audio signal.

Further, in method described above, enhancing processing is carried out to second audio signal, is obtained enhanced Third audio signal, including：

Based on the Speaker Identification as a result, second audio signal is smoothed and audio conversion point The correcting process set obtains the third audio signal.

The present invention also provides a kind of devices of speaker's speech Separation, including：

Acquisition module, the audio signal for obtaining preset format；

Preprocessing module is pre-processed for being directed to the audio signal, first audio signal that obtains that treated；

Audio separation module obtains different direction and speaks for carrying out audio separating treatment to first audio signal The second audio signal of people；

Enhancing processing module obtains enhanced third audio for carrying out enhancing processing to second audio signal Signal；

Output module, for exporting the third audio signal.

The method and device of speaker's speech Separation of the present invention, is located in advance by the audio signal to preset format Reason, first audio signal that obtains that treated carry out audio separating treatment to the first audio signal, obtain different direction speaker The second audio signal, enhancing processing is carried out to the second audio signal, obtains the third sound of enhanced different direction speaker Frequency signal exports third audio signal, realizes quickly and accurately separation without the audio signal of multiple speakers in orientation.

Description of the drawings

Fig. 1 is the flow chart of the embodiment of the method for speaker's speech Separation of the present invention；

Fig. 2 is the wheat battle array placement schematic diagram of the present invention four tunnel audio signals of acquisition；

Fig. 3 is the structural schematic diagram of the device embodiment of speaker's speech Separation of the present invention.

Specific implementation mode

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the specific embodiment of the invention and The present embodiment technical solution is clearly and completely described in corresponding attached drawing.Obviously, described embodiment is only this implementation Example a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not having There is the every other embodiment obtained under the premise of making creative work, belongs to the range of the present embodiment protection.

The (if present)s such as term " first ", " second " in specification and claims and above-mentioned attached drawing are to be used for area Not similar part, without being used to describe specific sequence or precedence.It should be appreciated that the data used in this way are appropriate In the case of can be interchanged, so that embodiments herein described herein can be to the sequence other than illustrating herein Implement.

The following examples are used to illustrate the present invention, but are not intended to limit the scope of the present invention..

Embodiment 1

Fig. 1 is the flow chart of the embodiment of the method for speaker's speech Separation of the present invention, as shown in Figure 1, the present embodiment The method of speaker's speech Separation can specifically include following steps：

100, the audio signal of preset format is obtained.

The audio signal of preset format in the present embodiment can be the audio signal of Ambisonic A formats.Its In, the audio signal of Ambisonic A formats is four tunnel audio signals (left front road (Left-Front-Up, LFU), You Qianlu (Right-Front-Down, RFD), left back road (Left-Back-Down, LBD), the right way of escape (Right-Back-Up, RBU)). Fig. 2 is the wheat battle array placement schematic diagram of the present invention four tunnel audio signals of acquisition.

101, it is pre-processed for the audio signal of acquisition, first audio signal that obtains that treated.

During specific implementation at one, when obtaining the audio signal of preset format, the placement side of wheat battle array can be obtained Formula parameter and ambient parameters, so as to according to the modes of emplacement parameter of wheat battle array, to the audio signal of the preset format of acquisition into Row conversion process, the transducing audio signal being generally aligned in the same plane, and time-frequency conversion is carried out to transducing audio signal, turned The corresponding frequency-region signal of audio signal is changed, and, according to ambient parameters, audio enhancing processing is carried out to frequency-region signal, is obtained To enhanced frequency-region signal, time-frequency inverse transformation further is carried out to frequency-region signal, obtains time-domain signal, believed as the first audio Number.

For example, after getting the modes of emplacement of Mai Zhen, can the modes of emplacement based on wheat battle array to audio signal according to public affairs Formula (1) carry out rotation processing, so as to get audio signal be generally aligned in the same plane.

Wherein, A is transition matrix：

Wherein, θ_hFor angle of heading, θ_pFor pitch angle, θ_bFor inclination angle, f (θ_h,θ_p,θ_b) be and θ_h、θ_pAnd θ_bRelevant letter Number.

After obtaining conversion signal, discrete fourier transform (Discrete Fourier may be used Transformation), the methods of fast Fourier transform (Fast Fourier Transformation, FFT) is to conversion signal Time-frequency conversion processing is carried out by road.By taking DFT as an example, time-frequency conversion processing can be carried out to conversion signal according to formula (2)：

Wherein, AN domain index value when n is, k are frequency domain index value, and L is audio frequency process frame length, L_fFor the length of time-frequency conversion, j For imaginary part unit, M is number of channels, and x (n) is audio time domain sample value, and X (k) is audio frequency coefficient.

After obtaining frequency-region signal, reverberation time (RT can be passed through according to 4 tunnel audio signal estimated noise energy spectrums₆₀) Parameter and through and reflectivity (Direct-to-Reverberant Energy Ratio, DRR) parameter Estimation reverberation energy Spectrum, the noise energy spectrum and reverberation energy spectrum for being based further on estimation carry out audio enhancing processing by road, to obtained frequency Domain signal carries out the processing such as denoising, dereverberation, so as to get frequency-region signal enhanced.

It, can be according to the modes of emplacement parameter and ambient parameters of wheat battle array, to the multichannel sound of reception in the present embodiment Frequency signal is pre-processed, and influence of the environment to follow audio separating treatment is reduced.

102, audio separating treatment is carried out to the first audio signal, obtains the second audio signal of different direction speaker.

In the present embodiment, after obtaining the first audio signal, first audio can be obtained according to first audio signal The corresponding auditory localization result of signal and Speaker Identification are as a result, and according to auditory localization result and Speaker Identification as a result, right First audio signal carries out audio separating treatment, to obtain the second audio signal of different direction speaker.

During specific implementation at one, speech detection processing can be carried out to the first audio signal, be examined accordingly Survey as a result, so as to according to the testing result, auditory localization processing carried out to the first audio signal, obtain auditory localization as a result, with And according to preset identification model, Speaker Identification processing is carried out to the first audio signal, obtains Speaker Identification result.

For example, may be used multiple signal classification (Multiple Signal Classification, MUSIC) algorithm, The methods of broad sense cross-correlation (Generalized Cross Correlation, GCC) realizes auditory localization, specific by taking GCC as an example It can realize in the following way：

A) cross-correlation of each road audio is calculated separately according to formula (3)：

Wherein, K₁To originate frequency point, K₂To end frequency point.

B) it is smoothed based on voice detection results according to formula (4)：

G_sm(i, j)=G_sm(i,j)*f_sm+(1-f_sm)*G(i,j) (4)

Wherein, f_smFor smoothing factor：

Vad is speech detection handling result.

C) cross-correlation function after smooth is further processed, obtains auditory localization result.

In the present embodiment, can the mode based on model carry out Speaker Identification, obtain Speaker Identification as a result, such as Gauss Mixed model (Gaussian Mixed Model, GMM), Hidden Markov Model (Hidden Markov Model, HMM) are deep Spend neural network (Deep Neural Networks, DNN) etc..

After obtaining auditory localization result and Speaker Identification result, Wave beam forming mode may be used, to first via sound Frequency signal carries out audio separating treatment, obtains the second audio signal of different direction speaker.

Audio separation method corresponding with auditory localization result can also be chosen, and according to Speaker Identification as a result, profit With audio separation method, audio separating treatment is carried out to the first audio signal, obtains the second audio letter of different direction speaker Number.

For example, audio separating treatment can be carried out using formula (5), the second audio letter of different direction speaker is obtained Number.

Wherein, V_doaFor in the weighted factor of Sounnd source direction：

τ is time delay, and S is sound source number, V_speFor simple sund source when weighted factor.

As S ＞ 1,Beamforming Method may be used and obtain the audio signal of Sounnd source direction.When S≤ When 1, V_doa=V_spe, for example be set as showing when (1,0,0,0) using the 1st tunnel audio as the audio signal after separation.

102, enhancing processing is carried out to the second audio signal of different direction speaker, obtains enhanced different direction and says Talk about the third audio signal of people.

For example, can be based on Speaker Identification as a result, the second audio signal to different direction speaker is smoothly located The correcting process of reason and audio position of conversion point, obtains the third audio signal of different direction speaker, to ensure audio Continuity.

103, third audio signal is exported.

The executive agent of the method for speaker's speech Separation of the present embodiment can be the device of speaker's speech Separation, should The device of speaker's speech Separation can specifically be integrated by software, such as the device of speaker's speech Separation specifically can be with It is applied for one, the present invention is to this without being particularly limited to.

The method of speaker's speech Separation of the present embodiment, obtains the audio signal of preset format, by audio signal It is pre-processed, first audio signal that obtains that treated carries out audio separating treatment to the first audio signal, obtains not Tongfang The second audio signal of position speaker, carries out enhancing processing to the second audio signal, obtains enhanced different direction speaker Third audio signal, export third audio signal, realize quickly and accurately separation without orientation multiple speakers sound Frequency signal.

Embodiment 2

Fig. 3 is the structural schematic diagram of the device embodiment of speaker's speech Separation of the present invention, as shown in figure 3, this implementation The device of speaker's speech Separation of example may include acquisition module 10, preprocessing module 11, audio separation module 12, at enhancing Manage module 13 and output module 14.

Wherein, acquisition module 10, the audio signal for obtaining preset format.

The audio signal of preset format in the present embodiment can be the audio signal of Ambisonic A formats.Its In, the audio signal of Ambisonic A formats is four tunnel audio signals (left front road (Left-Front-Up, LFU), You Qianlu (Right-Front-Down, RFD), left back road (Left-Back-Down, LBD), the right way of escape (Right-Back-Up, RBU)). Fig. 2 is the wheat battle array placement schematic diagram of the present invention four tunnel audio signals of acquisition

Preprocessing module 11 is received audio signal and is pre-processed for docking, first audio signal that obtains that treated.Tool Body, preprocessing module 11 can obtain the modes of emplacement parameter and ambient parameters of wheat battle array；According to the modes of emplacement of wheat battle array Parameter carries out conversion process, the transducing audio signal being generally aligned in the same plane to multi-channel audio signal；To conversion signal into Row time-frequency conversion obtains the corresponding frequency-region signal of conversion signal；According to ambient parameters, audio enhancing is carried out to frequency-region signal Processing, obtains enhanced frequency-region signal；Time-frequency inverse transformation is carried out to enhanced audio signal, obtains audio time domain signal, As the first audio signal.

Audio separation module 12 obtains different direction speaker for carrying out audio separating treatment to the first audio signal The second audio signal.Specifically, audio separation module 12 can obtain the first audio signal and correspond to according to the first audio signal Auditory localization result and Speaker Identification as a result, for example, to the first audio signal carry out speech detection processing, obtain detection knot Fruit；According to testing result, auditory localization processing is carried out to the first audio signal, obtains auditory localization result；According to preset knowledge Other model carries out Speaker Identification processing to the first audio signal, obtains Speaker Identification result.

Audio separation module 12 can also according to auditory localization result and Speaker Identification as a result, to the first audio signal into Row audio separating treatment obtains the second audio signal of different direction speaker.For example, according to auditory localization result and can say People's recognition result is talked about, using beam-forming technology, audio separating treatment is carried out to the first audio signal, obtains speaking without orientation The second audio signal of people.Or choose audio separation method corresponding with auditory localization result；Known according to speaker Not as a result, using audio separation method, audio separating treatment is carried out to the first audio signal, obtains the of different direction speaker Two audio signals.

Enhance processing module 13, carries out enhancing processing for the second audio signal to different direction speaker, increased The third audio signal of different direction speaker after strong.Specifically, enhancing processing module 13 can be based on Speaker Identification knot Fruit, is smoothed the second audio signal and the correcting process of audio position of conversion point, obtains different direction speaker Third audio signal.

Output module 14, the third audio signal for exporting different direction speaker.

The device of speaker's speech Separation of the present embodiment, by using the realization machine of above-mentioned each module separating audio signals System is identical as the realization mechanism of above-mentioned embodiment illustrated in fig. 1, can refer to the record of above-mentioned embodiment illustrated in fig. 1 in detail, herein It repeats no more.

The device of speaker's speech Separation of the present embodiment, obtains the audio signal of preset format, by audio signal It is pre-processed, first audio signal that obtains that treated carries out audio separating treatment to the first audio signal, obtains not Tongfang The second audio signal of position speaker, carries out enhancing processing to the second audio signal, obtains enhanced different direction speaker Third audio signal, export third audio signal, realize quickly and accurately separation without orientation multiple speakers sound Frequency signal.

Although above having used general explanation and specific embodiment, the present invention is described in detail, at this On the basis of invention, it can be made some modifications or improvements, this will be apparent to those skilled in the art.Therefore, These modifications or improvements without departing from theon the basis of the spirit of the present invention belong to the scope of protection of present invention.

Claims

1. a kind of method of speaker's speech Separation, which is characterized in that including：

Obtain the audio signal of preset format；

Audio separating treatment is carried out for first audio signal, obtains the second audio signal of different direction speaker；

Enhancing processing is carried out for second audio signal, obtains the third audio letter of enhanced different direction speaker Number；

Export the third audio signal.

2. according to the method described in claim 1, it is characterized in that, pre-processed for the audio signal, handled The first audio signal afterwards, including：

According to the modes of emplacement parameter of wheat battle array, conversion process is carried out to the audio signal, is generally aligned in the same plane Transducing audio signal；

According to the ambient parameters, audio enhancing processing is carried out to the frequency-region signal, obtains enhanced frequency-region signal；

3. method according to claim 1 or 2, which is characterized in that carried out at audio separation to first audio signal Reason, obtains the second audio signal of different direction speaker, including：

According to first audio signal, the corresponding auditory localization result of first audio signal and Speaker Identification knot are obtained Fruit；

According to the auditory localization result and the Speaker Identification as a result, being carried out at audio separation to first audio signal Reason, obtains second audio signal.

4. according to the method described in claim 3, it is characterized in that, according to first audio signal, first sound is obtained The corresponding auditory localization result of frequency signal and Speaker Identification are as a result, include：

According to the testing result, auditory localization processing is carried out to first audio signal, obtains the auditory localization result；

According to preset identification model, Speaker Identification processing is carried out to first audio signal, the speaker is obtained and knows Other result.

5. according to the method described in claim 3, it is characterized in that, according to the auditory localization result and the Speaker Identification As a result, carrying out audio separating treatment to first audio signal, second audio signal is obtained, including：

According to the auditory localization result and the Speaker Identification as a result, using Beamforming Method, to first audio Signal carries out audio separating treatment, obtains second audio signal.

6. according to the method described in claim 3, it is characterized in that, according to the auditory localization result and the Speaker Identification As a result, carrying out audio separating treatment to first audio signal, second audio signal is obtained, including：

According to the Speaker Identification as a result, using the audio separation method, audio point is carried out to first audio signal From processing, second audio signal is obtained.

7. according to the method described in claim 3, it is characterized in that, carry out enhancing processing to second audio signal, obtain Enhanced third audio signal, including：

Based on the Speaker Identification as a result, being smoothed to second audio signal and audio position of conversion point Correcting process obtains the third audio signal.

8. a kind of device of speaker's speech Separation, which is characterized in that including：

Acquisition module, the audio signal for obtaining preset format；

Audio separation module obtains different direction speaker for carrying out audio separating treatment for first audio signal The second audio signal；

Enhance processing module, enhancing processing is carried out for being directed to second audio signal, obtains enhanced third audio letter Number；

Output module, for exporting the third audio signal.