CN110491411A

CN110491411A - In conjunction with the method for microphone sound source angle and phonetic feature similarity separation speaker

Info

Publication number: CN110491411A
Application number: CN201910908195.2A
Authority: CN
Inventors: 汪俊; 李索恒; 张志齐
Original assignee: Shanghai Yitu Information Technology Co Ltd
Current assignee: Shanghai Yitu Information Technology Co Ltd
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2019-11-22
Anticipated expiration: 2039-09-25
Also published as: CN110491411B

Abstract

The invention discloses a kind of method of combination microphone sound source angle and phonetic feature similarity separation speaker, the step of this method includes: the real-time angle variable rate for calculating microphone sound-source signal relative to microphone；Calculate the probability changing value of speaker in real time according to the characteristic similarity of the voice signal of microphone input；In conjunction with the angle variable rate and probability changing value, whether real-time judgment speaker changes.The present invention carries out speaker's separation by combining microphone sound-source signal angle and voice signal, not only increases the flexibility ratio and accuracy of speaker's separation, and reduces the restrictive condition of speaker's separation.

Description

In conjunction with the method for microphone sound source angle and phonetic feature similarity separation speaker

Technical field

The present invention relates to field of computer technology, and more particularly to speech Separation technology, more specifically, it relates to one kind The method for carrying out speaker's separation in conjunction with microphone sound-source signal angle and phonic signal character similarity.

Background technique

Current speaker's isolation technics generallys use following two method:

1. doing speaker's separation using angle difference of the different speakers before microphone.The shortcomings that this method, is, when multiple Angle of the speaker before microphone close in the case where, be difficult to differentiate speaker；Meanwhile this method is required with primary record Sound, the sound-source signal angle that microphone obtains is constant (sound source and microphone be not active), just can guarantee precision, therefore flexibility It is poor.

2. doing speaker's separation using voice signal.The advantages of this method is not dependent on hardware (microphone), the disadvantage is that It is affected (noise, reverberation have an impact to it) by quality of speech signal, therefore accuracy rate is poor, number is more or has more people It speaks and will lead to poor performance.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of combination microphone sound source angles and phonetic feature similarity point Method from speaker, this method restrictive condition is few, and flexibility ratio and accuracy are high.

In order to solve the above technical problems, combination microphone sound source angle of the invention and the separation of phonetic feature similarity are spoken The method of people, step include:

Angle variable rate of the microphone sound-source signal relative to microphone is calculated in real time；

Calculate the probability changing value of speaker in real time according to the characteristic similarity of the voice signal of microphone input；

In conjunction with the angle variable rate and probability changing value, whether real-time judgment speaker changes.

The calculation formula of the threshold value thres of the angle variable rate are as follows:

Wherein, v is speaker's movement velocity, and r is speaker at a distance from microphone.

When v is the mankind's slowly maximum speed of walking, the threshold value of the angle variable rate is thres_1；When v is the mankind When the maximum speed of power-walking, the threshold value of the angle variable rate is thres_2；Two threshold values of the probability changing value are Threshold_1 and threshold_2；The determination method are as follows:

When the angle variable rate is less than thres_1, and the probability changing value is less than threshold_2, then it is determined as It is identical to talk about people；

When the angle variable rate is less than thres_1, but the probability changing value is then determined as in threshold_2 or more Speaker is different；

When the angle variable rate is in thres_1 or more, but it is less than thres_2, and the probability changing value is less than When threshold_1, then it is determined as that speaker is identical；

When the angle variable rate is in thres_1 or more, but it is less than thres_2, and the probability changing value is in threshold_1 When above, then it is determined as speaker's difference；

When the angle variable rate is in thres_2 or more, then it is determined as speaker's difference.

The value range of the r is preferably 0.2~0.5 meter, and the value range of the thres_1 is preferably 0.17~ 0.43/m, the value range of the thres_2 are preferably 0.23~0.57/m.

The threshold_1 is preferably that 0.3, threshold_2 is preferably 0.5.

The feature of the voice signal includes vocal print feature.

The present invention passes through to microphone sound-source signal angle progress real-time tracking school synchronous with phonic signal character similarity Just, and microphone sound source angle situation of change and speech recognition result is combined to carry out speaker's separation, not only increases speaker Isolated flexibility ratio and accuracy, and reduce the restrictive condition of speaker's separation.

Detailed description of the invention

Fig. 1 is sounding position to change with time figure relative to the angle of microphone.

Specific embodiment

To have more specific understanding to technology contents of the invention, feature and effect, now in conjunction with specific embodiment, to this hair Bright technical solution is further described in detail.

1. being positioned using microphone sound-source signal and separating speaker

Principle using microphone sound-source signal separation speaker is: there is the audio signal reception device in multiple and different orientation inside microphone, When radio reception, synchronization, multiple audio signal reception devices, which receive the same sound-source signal, different phase differences, according to this phase Difference can calculate specific orientation of the sound-source signal relative to microphone, and it is (same to isolate speaker according to this orientation The same speaker in orientation, different direction difference speaker).Algorithm confidence calculations formula are as follows: 1- is with the same person's before The opposite variation of general bearing.Algorithm supports streaming computing, and the speaker at each moment can be calculated in real time.

The specific method is as follows:

It is quieted down by microphone, obtains angle of the sounding position relative to microphone in a period of time: θ=(θ₁,θ₂,…,θ_T)。

During the speech, if speaker carries out position adjustment, it is assumed that speaker's movement velocity is no more than v(unit: m/s), it says Talking about people is r(unit at a distance from microphone: m), then according to formula:

The threshold value thres of available angle variable rate d θ:

Wherein, speaker's movement velocity v and speaker and microphone distance r are adjustable parameter.

Speaker's movement velocity should be no more than the speed that the mankind slowly walk, i.e. v≤1.5m/s under normal circumstances.Therefore, As v=1.5m/s, the small threshold thres_1 of an available angle variable rate:

If the angle variable rate at front and back moment is no more than threshold value thres_1, that is, it is determined as same people.

In addition, speed v=2m/s of mankind's power-walking, it is possible thereby to which the larger threshold an of angle variable rate is calculated Value thres_2:

If the angle variable rate at front and back moment is greater than thres_2, it is determined as that speaker is changed.

If the angle variable rate at front and back moment between two threshold values of thres_1 and thres_2, will be known in conjunction with voice at this time Other separation algorithm as a result, judging whether front and back moment speaker identical.

Under normal circumstances, r value range is therefore [0.2,0.5] rice can estimate that thres_1, thres_2's is general Value range (being not necessarily suitble to all situations) is (unit: every millisecond of degree):

Thres_1 ∈ [0.17,0.43]

Thres_2 ∈ [0.23,0.57]

For example, setting speaker and microphone distance r=0.3m, then angle change is determined as same no more than 0.3/ms People.As shown in Figure 1, the rate of angle change is then determined as speaker in time period less than 0.3/ms in 10~180ms It is identical；And between 180~210ms, angle of speaking suddenly rises, and the rate of angle change has been more than 0.3/ms, then it is assumed that this When speaker changed.

2. utilizing speech signal separation speaker

Principle using speech signal separation speaker is: extracting feature (for example, vocal print is related to the voice signal at each moment Feature), using increment streaming cluster (that is, being gathered for each new sample according to the similarity with clustering cluster before One kind known before, or output new one kind) algorithm, it is exactly a few individuals that cluster, which comes out several clusters, to realize speaker point From.Negatively correlated (the distance of the average distance of the calculation of the confidence level of speech recognition separation algorithm and current time and all clusters Smaller, confidence level is higher).Algorithm supports streaming computing, and the speaker at each moment can be calculated in real time.

The specific method is as follows:

The output that speech recognition separation algorithm is obtained by softmax layers is the probability of artificial different people of speaking, probability and be 1, Wherein, maximum probability is by isolated speaker:

The probability for allowing speech recognition separation algorithm to export is engraved in a certain range at front and back two to be fluctuated.Two threshold values are set It (threshold) is respectively 0.3 and 0.5, if the probability value p of later moment in time_t+1Compared to the probability value p of previous moment_tChanging value Less than 0.3, then still think that the speaker at former and later two moment is same people；If the probability value p of later moment in time_t+1Compared to it is previous when The probability value p at quarter_tChanging value be more than 0.5, then be determined as that the speaker at former and later two moment is changed；If when forward and backward The probability changing value at quarter then needs to determine the forward and backward moment in conjunction with microphone sound-source signal positioning result between 0.3~0.5 Whether speaker is identical.

1 speech recognition separation algorithm probability output of table

3. the positioning of microphone sound-source signal and speech recognition result fusion

In conjunction with microphone sound-source signal location algorithm result and speech recognition separation algorithm as a result, by the result of two algorithms Mutually correction, available more accurate speaker's separating resulting.Specific determination method is following (ginseng is shown in Table 2):

If microphone sound bearing, i.e. the angle variable rate at front and back moment moves in thres_2 threshold value, and speech recognition point The probability changing value at the forward and backward moment from algorithm output is then determined as same speaker less than 0.3；

If the probability changing value at the forward and backward moment of speech recognition separation algorithm output is within 0.5, and when before and after microphone sound source The angle variable rate at quarter is then determined as same speaker in thres_1；

If the probability changing value at the forward and backward moment of speech recognition separation algorithm output is more than the moment before and after 0.5 or microphone sound source Angle variable rate be more than thres_2, then be determined as that speaker is changed；

If speech recognition separating resulting fluctuates in bigger range simultaneously with microphone sound-source signal positioning result, that is, The probability changing value at the forward and backward moment of speech recognition separation algorithm output is between 0.3~0.5, before and after microphone sound-source signal The angle variable rate at moment is then determined as that the speaker at former and later two moment is different between thres_1 and thres_2.

2 result of table merges speaker's criterion

。

Claims

1. the method for combining microphone sound source angle and phonetic feature similarity separation speaker, which is characterized in that step includes:

2. the method according to claim 1, wherein the calculation formula of the threshold value thres of the angle variable rate Are as follows:, wherein v is speaker's movement velocity, and r is speaker at a distance from microphone.

3. according to the method described in claim 2, it is characterized in that, when v be the mankind slowly walking maximum speed when, the angle The threshold value for spending change rate is thres_1；When v is the maximum speed of mankind's power-walking, the threshold value of the angle variable rate is thres_2；Two threshold values of the probability changing value are threshold_1 and threshold_2；The determination method are as follows:

4. the method according to claim 1, wherein the value range of the r be 0.2~0.5 meter, it is described The value range of thres_1 is 0.17~0.43/m, and the value range of the thres_2 is 0.23~0.57/m.

5. the method according to claim 1, wherein the threshold_1 is that 0.3, threshold_2 is 0.5。

6. the method according to claim 1, wherein the feature of the voice signal includes vocal print feature.