CN113409796A

CN113409796A - Voice identity verification method based on long-term formant measurement

Info

Publication number: CN113409796A
Application number: CN202110510987.1A
Authority: CN
Inventors: 汤申亮; 张华军; 邓小涛; 王征华
Original assignee: Wuhan Dashengji Technology Co ltd
Current assignee: Wuhan Dashengji Technology Co ltd
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-09-17
Anticipated expiration: 2041-05-11
Also published as: CN113409796B

Abstract

The invention provides a voice identity verification method based on long-term formant measurement, wherein a voice file from the same speaker is known, the distance between long-term formant data of any two sections of voice in the known voice file is calculated to obtain an upper limit distance and a lower limit distance, when a material detection voice is acquired, the long-term formant distance between the material detection voice and the known voice file is calculated, and if the long-term formant distance is smaller than the lower limit distance, the material detection voice and the known voice file are judged to have identity; if the detected material voice is larger than the upper limit distance, judging that the detected material voice does not have the same identity with the known voice file; if the distance is between the upper limit and the lower limit, a hypothesis test method is adopted to verify the identity. The invention can improve the verification precision by acquiring the long-term formants of the voice file and combining a hypothesis test method to verify the voice identity according to the distance of the long-term formants.

Description

Voice identity verification method based on long-term formant measurement

Technical Field

The invention belongs to the technical field of voice detection, and particularly relates to a voice identity verification method based on long-term formant measurement.

Background

Formants are important features in voiceprint identification, which not only provide a reference for consonants and vowel resolution, but also include personality characteristics of the speaker. The formant frequency is affected by the length of the vocal tract, and a longer vocal tract results in a lower vowel formant, and the proportional size between the various parts of the vocal tract also affects the formant frequency.

There are many ways to measure the formant frequency. Among them, the method of measuring the central frequency values of different vowel formants is the most classical. However, there is not sufficient correlation between formant frequencies of different vowels and between different formants, and this characteristic reduces the accuracy of identification. Another method for studying formants is dynamic analysis, in which individuals leave traces of their specific movement patterns when they pronounce, these traces reflect the personality characteristics of the speaker, but the dynamics of formants are affected by both the segment and prosodic contexts, and this method also requires further study of the differences between different speaking contexts.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method for verifying the voice identity based on the long-term formant measurement can improve the verification precision.

The technical scheme adopted by the invention for solving the technical problems is as follows: a voice identity verification method based on long-term formant measurement comprises the following steps:

knowing a voice file from the same speaker, calculating the distance between the long-time resonance peak data of any two sections of voice in the known voice file, and obtaining the upper limit distance

And a lower limit distance

When a material testing voice is collected, calculating the long-term formant distance D between the material testing voice and the known voice file, and carrying out the following judgment:

when in use

Judging that the material checking voice in the time interval has the same identity with the known voice file, namely the same speaker;

when in use

Judging that the material testing voice in the time interval does not have the same identity with the known voice file, namely different speakers are obtained;

when in use

A hypothesis test is used to verify identity.

According to the above method, the upper limit distance

And a lower limit distance

The calculation method of (2) is as follows:

let the 4 long-term formant measurement data of 2-segment speech in the known speech file be X1 and Y1, wherein,

in the formula, x_F11……x_F1mFor the first to the mth resonance peak data, x, under the first frequency of the first section voice_F21……x_F2mFor the first to the mth resonance peak data x under the second frequency of the first section voice_F31……x_F3mIs the first to the mth resonance peak data x under the third frequency of the first section voice_F41……x_F4mData of first to mth resonance peaks under a fourth frequency of the first section of voice; y is_F11……y_F1nFor the first to nth resonance peak data, y, at the first frequency of the second speech segment_F21……y_F2nFor the first to nth resonance peak data, y, at the second frequency of the second speech segment_F31……y_F3nIs the first to nth resonance peak data y under the third frequency of the second section voice_F41……y_F4nThe data of the first to nth resonance peak under the fourth frequency of the second section of voice; the first to fourth frequencies are frequencies which are sequentially increased or sequentially decreased;

the column data of each long-time formant measurement data matrix form a formant vector x_i＝ [x_F1i x_F2ix_F3i x_F4i]、y_i＝[y_F1i y_F2i y_F3i y_F4i]Respectively calculating the central position of the m vectors of the first section of voice and the n vectors of the second section of voice, and enabling x to be_c＝[x_F1c x_F2c x_F3c x_F4c]Is the center of the X1 matrix, let y_c＝[y_F1c y_F2c y_F3c y_F4c]For the center of Y1 matrix, x is obtained according to the clustering principle_cTo x_iIs minimized, x is obtained by solving the following minimum problem_cAnd y_c：

At x_cAnd y_cOn the basis, the Euclidean distance between centers is calculated to calculate the long-term formant distance D of the two sections of voice^*：

From the known speech fileRespectively calculating the distance between every two voices in different segments according to the method, and taking the maximum value and the minimum value as the upper limit distance

And a lower limit distance

According to the method, the method for calculating the long-term formant distance D of the material detection voice and the long-term formant distance D of the two sections of voices in the known voice file^*The same method is used.

According to the method, the hypothesis testing method is a t testing method, and the method comprises the following specific steps:

let the 4 long-term formant measurement data of the material testing voice be Z1, wherein

In the formula, z_F11……z_F1jFor first to jth resonance peak data, z, at a first frequency of the material-testing voice_F21……z_F2jFor first to jth resonance peak data, z, at a second frequency of the material-testing voice_F31……z_F3jFor first to jth resonance peak data, Z, under the third frequency of the material-testing voice_F41……z_F4jData of first to jth resonance peaks at a fourth frequency of the material detection voice are obtained;

let x_F21、x_F22、x_F23、……、x_F2mObedience as N (u, σ)²) Normal distribution of (a), z_F21、z_F22、 z_F23……z_F2jObedience as N (v, σ)²) According to the statistical theory, the data of the resonance peak at the second frequency are distributed as follows:

wherein x_F2mean、S_xAre respectively x_F21、x_F22、x_F23、……、x_F2mMean and standard deviation of (2), z_F2mean、S_zAre each z_F21、z_F22、z_F23……z_F2jMean and standard deviation of;

given a degree of confidence a, when

And judging that the time-interval material-checking voice is identical to the known voice file, otherwise, judging that the time-interval material-checking voice is not identical to the known voice file.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for voice identity verification based on long-term formant measurements when executing the computer program.

A non-transitory computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for voice identity verification based on long-term formant measurements.

The invention has the beneficial effects that: the voice identity verification is carried out by acquiring the long-term formants of the voice file and combining a hypothesis test method according to the distance of the long-term formants, so that the verification precision can be improved.

Drawings

FIG. 1 shows the frequencies of the formants LTF2 and LTF3 at vowels in different contexts of speech.

FIG. 2 is a formant spectrum.

FIG. 3 is a plot of formant F1-F3 frequency versus time.

FIG. 4 is a frequency distribution plot of formants F1-F3.

FIG. 5 is a graph of the long-term formant LTF2 and LTF3 distribution for different speakers.

FIG. 6 is a graph of the long term formant LTF2 and LTF3 distribution for the same speaker.

Fig. 7 is a t-test confidence interval distribution map.

FIG. 8 is a flowchart of a method according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following specific examples and figures.

FIG. 1 depicts the frequency variation of LTF2 and LTF3 in both the natural speaking and reading contexts of multiple test persons, from which it can be seen that the frequency mean variation of LTF2 and LTF3 for speakers in both contexts is very small; LTF4 is more affected by the telephone communication bandwidth, so the present invention selects LTF2 and LTF3 for the voiceprint authentication basis.

As shown in FIG. 2, the positions of vowel formants F1-F4 are determined by combining a linear predictive analysis technique and manual correction for a voice file to be identified, wherein the positions are F1-F4 in sequence according to a curve from low frequency to high frequency, the formants F1-F3 are not used as identification bases due to unstable formants F4, the time-varying curves of the formants F1-F3 are shown in FIG. 3, and long-time formant F1-F3 frequency distribution curves shown in FIG. 4 can be drawn according to the frequency and the occurrence probability of each formant. From the above-mentioned frequency distribution characteristics of the long-term formants, different speakers have different distributions of LTF2 and LTF3, and fig. 5 depicts the distributions of vowels LTF2 and LTF3 of 2 testers, in which two solid lines are distributions of LTF2 of two testers, respectively, and two dashed lines are distributions of LTF3 of two testers, respectively. It can be seen from the figure that LTF2 and LTF3 of 2 people not only have different frequency means, but also have larger differences in the section covered by the distribution curve and the curve shape. The distribution of vowel LTF2 and LTF3 measured in different contexts for the same speaker is shown in fig. 6, where two solid lines are the vowel LTF2 distribution measured in different contexts for the same speaker, and two dashed lines are the vowel LTF3 distribution measured in different contexts for the same speaker, it can be known from the figure that the long-term formants LTF2 and LTF3 of the same speaker in different contexts not only have small frequency mean variation, but also have very similar intervals and shapes of distribution curves, so that hypothesis test can be performed on the measured long-term formants LTF2 and LTF3 data by using a probabilistic method to determine whether the detected speech sample is the target speaker.

Based on the above principle and research, the present invention provides a voice identity verification method based on long-term formant measurement, as shown in fig. 8, the method includes:

s1, knowing a voice file from the same speaker, calculating the distance between the long-time resonance peak data of any two sections of voice in the known voice file, and obtaining the upper limit distance

And a lower limit distance

The upper limit distance

And a lower limit distance

The calculation method of (2) is as follows:

in the formula, x_F11……x_F1mFor the first to the mth resonance peak data, x, under the first frequency of the first section voice_F21……x_F2mFor the first to the mth resonance peak data x under the second frequency of the first section voice_F31……x_F3mIs the first to the mth resonance peak data x under the third frequency of the first section voice_F41……x_F4mFor the first speech segment from the first to the m < th > frequencyIndividual resonance peak data; y is_F11……y_F1nFor the first to nth resonance peak data, y, at the first frequency of the second speech segment_F21……y_F2nFor the first to nth resonance peak data, y, at the second frequency of the second speech segment_F31……y_F3nIs the first to nth resonance peak data y under the third frequency of the second section voice_F41……y_F4nThe data of the first to nth resonance peak under the fourth frequency of the second section of voice; the first to fourth frequencies are frequencies which are sequentially increased or sequentially decreased;

the column data of each long-time formant measurement data matrix form a formant vector x_i＝ [x_F1ix_F2ix_F3ix_F4i]、y_i＝[y_F1i y_F2i y_F3i y_F4i]Respectively calculating the central position of the m vectors of the first section of voice and the n vectors of the second section of voice, and enabling x to be_c＝[x_F1c x_F2c x_F3c x_F4c]Is the center of the X1 matrix, let y_c＝[y_F1c y_F2c y_F3c y_F4c]For the center of Y1 matrix, x is obtained according to the clustering principle_cTo x_iIs minimized, x is obtained by solving the following minimum problem_cAnd y_c：

From said knownRespectively calculating the distance between every two voices of different segments in the voice file according to the method, and taking the maximum value and the minimum value as the upper limit distance

And a lower limit distance

S2, when a material detection voice is collected, calculating the long-term formant distance D between the material detection voice and the known voice file, and the method for calculating the long-term formant distance D between the material detection voice and the long-term formant distance D between two sections of voices in the known voice file^*The same method is used.

Then the following judgments were made: when in use

Judging that the material checking voice in the time interval has the same identity with the known voice file, namely the same speaker; when in use

Judging that the material testing voice in the time interval does not have the same identity with the known voice file, namely different speakers are obtained; when in use

A hypothesis test is used to verify identity.

The hypothesis testing method is a t testing method, and comprises the following specific steps:

In the formula, z_F11……z_F1jFor first to jth resonance peak data, z, at a first frequency of the material-testing voice_F21……Z_F2jSecond voice for material inspectionFirst to jth resonance peak data z at frequency_F31……z_F3jFor first to jth resonance peak data, Z, under the third frequency of the material-testing voice_F41……Z_F4jData of first to jth resonance peaks at a fourth frequency of the material detection voice are obtained;

wherein x_F2mean、S_xAre respectively x_F21、x_F22、x_F23、……、x_F2mMean and standard deviation of (2), Z_F2mean、S_zAre each z_F21、z_F22、Z_F23……z_F2jMean and standard deviation of.

There are 2 hypotheses, H₀:u＝v，H₁U is not equal to v if H₀If true, then this time:

to H₀、H₁When performing hypothesis testing, a confidence level α is given when

Then, it is determined that the time interval material-checking voice is identical to the known voice file, i.e. H is accepted₀(ii) a Otherwise, judging that the time interval material checking voice does not have the same identity with the known voice file, namely rejecting H₀。

As shown in fig. 7, when two test materials are considered to be from the same speaker with a probability of 95% confidence level, the two detected files are required to measure long-term formants satisfying the following inequality:

|x_F2mean-z_F2mean|＜c

wherein

t_0.05And (m + j-2) is a t distribution variable value corresponding to the degree of freedom m + j-2, i.e., the degree of reliability α is 0.05. As can be seen from FIG. 7, the larger 1 α is, the higher H₀The greater the confidence that it is established. Since the t distribution is symmetrical about the vertical axis, let 2 β be 1- α, then

When the two samples are subjected to the hypothesis test of the identity of the two samples, in order to determine the reasonable value range of the beta, the upper and lower limits of the beta can be determined by comparing the beta with the samples

When in use

The detected materials are considered to have identity; when in use

Refusing the material to be detected to have identity;

a comprehensive judgment needs to be made in conjunction with the distance D.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the voice identity verification method based on the long-term formant measurement when executing the computer program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for voice identity verification based on long-term formant measurements.

The above embodiments are only used for illustrating the design idea and features of the present invention, and the purpose of the present invention is to enable those skilled in the art to understand the content of the present invention and implement the present invention accordingly, and the protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes and modifications made in accordance with the principles and concepts disclosed herein are intended to be included within the scope of the present invention.

Claims

1. A voice identity verification method based on long-term formant measurement is characterized by comprising the following steps: the method comprises the following steps:

And a lower limit distance

when in use

when in use

Judging that the time interval material checking voice and the known voice file do not have the sameSex, namely different speakers;

when in use

A hypothesis test is used to verify identity.

2. The method of claim 1, wherein: the upper limit distance

And a lower limit distance

The calculation method of (2) is as follows:

in the formula, x_F11……x_F1mFor the first to the mth resonance peak data, x, under the first frequency of the first section voice_F21……x_F2mFor the first to the mth resonance peak data x under the second frequency of the first section voice_F31……x_F3mIs the first to the mth resonance peak data x under the third frequency of the first section voice_F41……x_F4mData of first to mth resonance peaks under a fourth frequency of the first section of voice; y is_F11……y_F1nFor the first to nth resonance peak data, y, at the first frequency of the second speech segment_F21……y_F2nFor the first to nth resonance peak data, y, at the second frequency of the second speech segment_F31……y_F3nIs the first frequency of the second speech segmentTo nth resonance peak data, y_F41……y_F4nThe data of the first to nth resonance peak under the fourth frequency of the second section of voice; the first to fourth frequencies are frequencies which are sequentially increased or sequentially decreased;

the column data of each long-time formant measurement data matrix form a formant vector x_i＝[x_F1i x_F2i x_F3ix_F4i]、y_i＝[y_F1i y_F2i y_F3i y_F4i]Respectively calculating the central position of the m vectors of the first section of voice and the n vectors of the second section of voice, and enabling x to be_c＝[x_F1c x_F2c x_F3c x_F4c]Is the center of the X1 matrix, let y_c＝[y_F1c y_F2c y_F3c y_F4c]For the center of Y1 matrix, x is obtained according to the clustering principle_cTo x_iIs minimized, x is obtained by solving the following minimum problem_cAnd y_c：

Respectively calculating the distance between every two voices from the known voice file according to the method, and taking the maximum value and the minimum value as the upper limit distance

And a lower limit distance

3. The method of claim 2, wherein: the method for calculating the long-term formant distance D of the material-tested voice is the same as the method for calculating the long-term formant distance D of two sections of voice in the known voice file.

4. The method of claim 3, wherein: the hypothesis testing method is a t testing method, and comprises the following specific steps:

In the formula, z_F11……z_F1jFor first to jth resonance peak data, z, at a first frequency of the material-testing voice_F21……z_F2jFor first to jth resonance peak data, z, at a second frequency of the material-testing voice_F31……z_F3jFor the first to jth resonance peak data, z, at the third frequency of the material-testing voice_F41……z_F4jData of first to jth resonance peaks at a fourth frequency of the material detection voice are obtained;

let x_F21、x_F22、x_F23、……、x_F2mObedience as N (u, σ)²) Normal distribution of (a), z_F21、z_F22、z_F23……z_F2jObedience as N (v, σ)²) According to the statistical theory, the data of the resonance peak at the second frequency are distributed as follows:

given a degree of confidence a, when

5. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the processor, when executing the computer program, performs the steps of the method for verifying speech identity based on long-term formant measurements according to any one of claims 1 to 4.

6. A non-transitory computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when being executed by a processor realizes the steps of a method for verifying speech identity based on long-term formant measurements according to any one of claims 1 to 4.