KR101671305B1

KR101671305B1 - Apparatus for extracting feature parameter of input signal and apparatus for recognizing speaker using the same

Info

Publication number: KR101671305B1
Application number: KR1020150183897A
Authority: KR
Inventors: 정상배; 강지훈; 김영일
Original assignee: 경상대학교 산학협력단
Priority date: 2015-12-22
Filing date: 2015-12-22
Publication date: 2016-11-02
Also published as: WO2017111386A1

Abstract

The present invention relates to a characteristic parameter extracting device and a speaker recognizing device, capable of raising a speaker recognition rate by extracting an excitation signal from a periodic section of an input signal and then using a characteristic parameter extracted from the signal. According to an embodiment of the present invention, the characteristic parameter extracting device includes: a periodic signal detecting part detecting a periodic section of an input signal; an excitation signal extracting part extracting an excitation signal from the periodic section of the input signal; and a characteristic parameter calculating part calculating a characteristic parameter, characterizing the input signal, based on a frequency response spectrum of the excitation signal.

Description

TECHNICAL FIELD [0001] The present invention relates to an apparatus for extracting feature parameters of an input signal and a speaker recognizing apparatus using the feature parameter extracting apparatus.

The present invention relates to an apparatus for extracting characteristic parameters of an input signal and an apparatus for recognizing a speaker using the apparatus.

Speech processing technology, which computer processes and understands human speech, is a promising technology that can be used in various fields. In particular, a speaker recognition technique for identifying a speaker based on the input voice may be used for identity verification in a security system or for user identification in an intelligent robot.

In general, speech recognition including speaker recognition extracts a feature vector from a speech input signal and compares it with previously stored data to recognize the information. However, the speech recognition rate of the present technology is limited to be commercialized in various fields, and among them, the recognition rate of the speaker recognition technology is not high, and continuous research and development is necessary.

An object of the present invention is to provide a feature parameter extracting apparatus capable of improving the speaker recognition rate and a speaker recognizing apparatus using the feature parameter extracting apparatus.

An apparatus for extracting feature parameters according to an exemplary embodiment of the present invention includes: a periodic signal detector for detecting a periodic interval of an input signal; An excitation signal extractor for extracting an excitation signal in a periodic interval of the input signal; And a feature parameter calculation unit for calculating a feature parameter that characterizes the input signal based on the frequency response spectrum of the excitation signal.

The periodic signal detector may detect a periodic interval based on a result of an auto-correlation function for the input signal.

The periodic signal detector may determine that the periodic signal has a period T when the input signal is x (n) and R (T) / R (0) for the input signal is greater than or equal to a predetermined threshold, R (T) =? X (n) x (n + T).

Wherein the characteristic parameter extracting unit comprises: a representative pitch detecting unit for determining a representative pitch value among pitch values of the input signal; And a proper pitch detector for determining an appropriate pitch value based on the representative pitch value among the pitch values in the periodic interval of the input signal, wherein the feature parameter extractor extracts the feature parameter Can be calculated.

The representative pitch detection unit may determine the median value as the representative pitch value by arranging the pitch values of the input signals in order of magnitude.

The appropriate pitch detector may determine a value that is closest to the representative pitch value among the pitch values selected based on the result of the autocorrelation function for the input signal of the periodic interval as a proper pitch value.

The excitation signal extraction unit may include a line emphasis unit that preprocesses the input signal of the periodic section and outputs a preprocessed signal in which high frequency components lost in the process of generating the input signal are compensated.

Wherein the excitation signal extracting unit comprises: an autocorrelation function estimator for outputting a result of an autocorrelation function for the preprocessed signal; A speculative coefficient calculator receiving an output of the autocorrelation function estimator and outputting a speculative coefficient based on a Levinson-Durbin algorithm; And an inverse filtering unit performing inverse filtering based on the preprocessing signal and the estimated coefficient to output the excitation signal.

The feature parameter calculator may include a frequency domain transformer for performing a discrete Fourier transform on the excitation signal based on the predetermined pitch value to convert the excitation signal into a discrete Fourier spectrum in the frequency domain.

The feature parameter calculator may calculate the logarithm of the magnitude of the discrete Fourier spectra with the feature parameter.

Wherein the feature parameter calculating unit calculates a feature frequency of the input signal based on a Mel-Frequency (M-F) filter that applies a Mel-Frequency filter to a frequency response spectrum of the input signal with respect to a periodic interval and an aperiodic interval of the input signal, A response obtaining unit; And a cepstral coefficient acquiring unit for performing an inverse discrete cosine transform of the mel-frequency response to obtain a cepstrum coefficient.

The feature parameter calculating unit may calculate a value obtained by multiplying the output of the feature parameter calculating unit and the output of the cepstrum coefficient acquiring unit by predetermined weights, respectively, as a feature parameter.

A speaker recognition apparatus according to an embodiment of the present invention includes: a voice collection unit for collecting a voice of a speaker; A voice processing unit for processing the collected voice and discriminating whether or not it matches the voice of a previously registered user; And a storage unit for storing information on the voice of the user.

The voice processing unit may include: a periodic signal detector for detecting a period of a signal of the input signal; An excitation signal extractor for extracting an excitation signal in a periodic interval of the input signal; And a feature parameter calculation unit for calculating a feature parameter that characterizes the input signal based on the frequency response spectrum of the excitation signal.

According to an embodiment of the present invention, a feature parameter extracting apparatus that increases the speech recognition rate in speech signal processing can be obtained, and a speaker recognition apparatus improved in speaker recognition rate can be obtained.

1 is an exemplary block diagram of a speaker recognition apparatus according to an embodiment of the present invention.
2 is an exemplary block diagram of a feature parameter extraction unit according to an embodiment of the present invention.
3 is an exemplary flowchart of a method of detecting a representative pitch in the feature parameter extracting unit of FIG.
4A and 4B are graphs for explaining a method of detecting a representative pitch in an input signal according to an embodiment of the present invention.
5 is an analysis graph of an autocorrelation function for the periodic signal detected by the periodic signal detecting unit of the characteristic parameter extracting unit of FIG.
FIG. 6 is an exemplary flowchart of a method for detecting an appropriate pitch in the feature parameter extracting unit of FIG. 2. FIG.
7 is a graph for explaining a method of detecting a proper pitch in an input signal according to an embodiment of the present invention.
8 is an exemplary flowchart of a method of extracting an excitation signal in the feature parameter calculating unit of FIG.
9 is a table showing improved speaker recognition rates according to an embodiment of the present invention.

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings attached hereto.

1 is an exemplary block diagram of a speaker recognition apparatus 10 according to an embodiment of the present invention.

1, the speaker recognition apparatus 10 according to an embodiment of the present invention includes a voice collection unit 110, a voice processing unit 120, and a storage unit 130. [

The voice collecting unit 110 collects the voice of the speaker. According to an embodiment of the present invention, the voice collection unit 110 may include a microphone that converts a voice uttered by the speaker into an electrical signal. However, the voice collection unit 110 includes any device that acquires a signal related to the speaker's voice in various ways (for example, data communication via a network) without being micro-constrained to collect the speaker's voice directly from the speaker do.

The voice processing unit 120 processes the collected voice and discriminates whether it matches the voice of the registered user. According to an embodiment of the present invention, the voice processing unit 120 includes a processor for processing an electric signal related to voice (hereinafter referred to as a voice signal) according to a predetermined algorithm, and may include, for example, a CPU But is not limited to. In this case, the voice processing unit 120 can execute a program stored in the storage unit 130 to process the voice signal, and the data obtained through the process can be stored in the storage unit 130 have.

The storage unit 130 stores information on the user's voice. According to an embodiment of the present invention, the storage unit 130 is a device capable of storing data or various programs. For example, the storage unit 130 may store not only a large-capacity storage device such as an HDD, an SSD, Cache, and the like.

According to an embodiment of the present invention, the voice processing unit 120 extracts a feature parameter that characterizes the voice from the voice of the speaker to determine whether the voice of the speaker coincides with the voice of the previously registered user (121).

2 is an exemplary block diagram of a feature parameter extraction unit according to an embodiment of the present invention.

2, the feature parameter extraction unit 121 may include a periodic signal detection unit 1211, an excitation signal extraction unit 1212, and a feature parameter calculation unit 1213.

The feature parameter calculating section 1213 may include a representative pitch detecting section 12131 for determining a representative pitch value required for excitation signal extraction and feature parameter calculation.

3 is an exemplary flowchart of a method of detecting a representative pitch in the representative pitch detecting unit 12131 included in the characteristic parameter extracting unit 121 of FIG.

Referring to FIG. 3, the representative pitch detector 12131 detects a pitch by analyzing an autocorrelation function. The autocorrelation function is a function indicating the correlation of a value taken by two arbitrary time points of an arbitrary signal. For example, it is possible to show a correlation between values taken at time t + τ delayed by a predetermined time from t-time and t-time. The pitch can be detected at a position where the autocorrelation function of the speech signal becomes the maximum. Specifically, it can be detected by the following equation.

r _x (t) = Σx (n) x (n + τ), (τ = nT ₀ , r _x (nT ₀ ) = r _x

n is an integer, and the fundamental period of the periodic signal is defined as T ₀ .

Referring back to FIG. 3, the representative pitch detector 12131 may sort the pitch values of the input signals in order of magnitude, and determine the median value as the representative pitch value.

4A and 4B are graphs for explaining a method of detecting a representative pitch in an input signal according to an embodiment of the present invention.

4A is a distribution of pitch values detected from an exemplary input signal. As shown in FIG. 4A, the pitch value may suddenly increase (pitch doubling) or suddenly decrease (pitch haven) may occur. If the pitch value is used to perform speech recognition, the speech recognition rate and the speaker recognition rate Can be lowered. Accordingly, in order to correct such a pitch extraction error, the representative pitch detector 12131 can sort the pitch values in order of magnitude and determine the median value as the representative pitch value as shown in FIG. 4B.

The representative pitch detection unit 12131 can determine a representative pitch of the speaker for an arbitrary speaker and correct the pitch value used in future speech recognition and speaker recognition based on the representative pitch value. That is, the pitch extraction error in the speech frame to be analyzed can be corrected using the representative pitch value. The determined representative pitch value is used to determine an appropriate pitch value in the analysis of the periodic signal. The related contents will be described in detail with reference to FIGS. 6 and 7. FIG.

Referring back to FIG. 2, the periodic signal detector 1211 distinguishes the periodic interval and the non-periodic interval of the input signal.

A voice signal can be input as an input signal. The voices are divided into voiced voices that cause vibrations in the vocal cords and unvoiced voices that do not cause vibrations of the vocal cords. Periodic excitation signals can be detected in voiced intervals. The apparatus for extracting feature parameters according to an embodiment of the present invention detects feature parameters from excitation signals and uses the feature parameters as auxiliary parameters for improving the speech recognition rate. Therefore, it is necessary to divide the voiced part of the input signal, that is, the period of the periodic signal.

According to an embodiment of the present invention, the periodic signal detector 1211 detects a periodic interval based on a result of an auto-correlation function on an input signal. The periodic signal detection unit 1211 can determine that the resultant signal is a periodic signal when the resultant value of the autocorrelation function is equal to or greater than a predetermined threshold value.

5 is an analysis graph of an autocorrelation function for the periodic signal detected by the periodic signal detecting unit of the characteristic parameter extracting unit of FIG. The periodic signal detector 1211 may analyze the graph outline of the autocorrelation function for the input signal to determine the periodicity of the signal. Referring to FIG. 5, it can be determined that the input signal is a periodic signal in a period in which the value of R (T) / R (0) for T is equal to or greater than the threshold value R _Th . That is, when the input signal is x (n) and R (T) / R (0) for the input signal is equal to or greater than a predetermined threshold value, the periodic signal detecting unit 1121 detects the periodic signal It can be judged. Where R (T) = Sigmax (n) x (n + T).

Referring again to FIG. 2, the feature parameter calculator 1213 includes a proper pitch detector 12132 that determines an appropriate pitch value among the pitch values in the periodic interval of the input signal. 6 is an exemplary flowchart of a method of detecting an appropriate pitch in the optimum pitch detecting section 12132. Fig.

Referring to FIG. 6, the optimum pitch detector 12132 may estimate an autocorrelation function for an input signal and analyze an outline of an autocorrelation function graph to determine a proper pitch value. 7 is a graph for explaining a method of detecting a proper pitch in an input signal according to an embodiment of the present invention. The appropriate pitch detection unit 12132 can select candidate pitch values based on the autocorrelation function. As shown in FIG. 2, the optimum pitch detection unit 12132 can determine a proper pitch value based on the great pitch value among the candidate pitch values. The appropriate pitch detection unit 12132 can determine a value that is closest to the representative pitch value among the candidate pitch values as an appropriate pitch value. By using a value that is closest to the representative pitch value as the optimum pitch value, the possibility of pitch doubling and pitch havening phenomenon as described above can be reduced.

8 is an exemplary flowchart of a method of extracting an excitation signal in the excitation signal extraction unit 1212 in FIG.

Referring to FIG. 8, the excitation signal extracting unit 1212 may include a line emphasis unit. The pre-emphasis unit may pre-process the input signal s (n) of the periodic section to output the _pre- processing signal s _pre (n) compensated for the high frequency components lost in the process of generating the input signal. High frequency components are lost in the process of radiating air from lips to free space when the speaker utters voice. The line emphasis unit applies a line emphasis filter to the input signal to compensate for the lost high frequency component. For example, the line enhancement filter may be implemented to have a transfer function as shown in the following equation.

H (z) = 1 -? Z ^-1 , (0.9?? ¹ )

Typically a = 0.97.

8, the excitation signal extraction unit 1212 includes an autocorrelation function output unit for estimating an autocorrelation function for the preprocessed signal and outputting a result value, an input unit for receiving an output of the autocorrelation function estimation unit, Levinson-Durbin) guess coefficients based on the algorithm (a _k) to assume the coefficient calculating section to the output, and performs an inverse filter (inverse filtering) based on the pre-signal and the speculative factor here outputs the signal e (n) And an inverse filtering unit.

The inverse filtering unit may be implemented to have a transfer function expressed by the following equation.

Here, P may be an appropriate pitch value detected by the appropriate pitch detecting unit 12132. [ If the appropriate pitch value detected by the appropriate pitch detecting unit 12132 is not an integer, the integer P value can be calculated through a process such as rounding.

According to an embodiment of the present invention, the feature parameter calculator 1213 performs a discrete Fourier transform (Discrete Fourier Transform) on the excitation signal e (n) based on the appropriate pitch value P to obtain a discrete Fourier spectrum Fourier spectrum, DFS).

The feature parameter calculating unit 1213 can extract a discrete Fourier spectral coefficient (DFS coefficient) of the excitation signal and calculate it as a feature parameter that characterizes the input signal. When the excitation signal e (n) extracted by the excitation signal extracting unit 1212 is

, Where p may be the pitch value detected in the appropriate pitch detecting section. If the pitch value detected by the appropriate pitch detector is not an integer, the rounded value may be p.

The excitation signal e (n) is subjected to discrete Fourier transform to obtain a discrete Fourier spectrum for the excitation signal.

The excitation signal e (n) can be transformed as follows.

Here, the feature parameters that characterize the speech signal can be calculated using the DFS coefficients A _k and B _k .

In one embodiment, the feature parameter calculator 1213 can obtain the DFS size using the DFS coefficient as shown below.

According to one embodiment, the feature parameter calculating unit 1213 calculates a feature vector in which a harmonic frequency of a frequency corresponding to a pitch value and a value obtained by taking log as a magnitude of the calculated DFS coefficient are used as a pair can do. The formula of the calculated feature vector is as follows.

Where K is the harmonic number of the pitch value and E _k is the magnitude of the Kth DFS coefficient extracted from the excitation signal normalized by the energy of the input signal.

Referring again to FIG. 2, the feature parameter extracting unit 1213 according to an embodiment of the present invention may extract MFCC (Mel Frequency Cepstral Coefficient) from an input signal.

According to one embodiment, the feature parameter extractor 1213 applies a Mel-Frequency filter to the frequency response spectrum of the input signal with respect to the periodic interval and the aperiodic interval of the input signal, And a cepstral coefficient acquiring unit for acquiring a cepstrum coefficient by inverse discrete cosine transforming the Mel-frequency response. Therefore, the feature parameter extraction unit 1213 can extract the MFCC of the input signal and use it as a feature parameter.

The feature parameter extracting unit 1213 according to an embodiment of the present invention may extract the DFT coefficient for the excitation signal as well as the MFCC as the feature parameter. Therefore, it is possible to improve speech recognition rate and speaker recognition rate by using DFS coefficient as an auxiliary characteristic parameter.

Referring again to FIG. 1, in the speaker recognition apparatus 10 according to the embodiment of the present invention, the voice processing unit 120 processes the collected voice to determine whether or not it matches the voice of a previously registered user .

According to one embodiment, the speech processing unit 120 can calculate the score of the speaker based on the feature parameter and recognize the speaker by referring to the feature parameter extraction unit 1213. Specifically, the phonetic speaker can be recognized by scoring the degree of similarity between the Gaussian mixed model generated for each specific speaker and the feature parameter extracted from the input speech signal. The Gaussian mixture model may be stored in the storage unit 130.

In one embodiment, the score may be calculated as:

x _t is the feature vector in the analyzed speech frame, and S _i is the Gaussian mixture model parameter for the particular speaker. P represents the probability that the feature vector x _t will occur in the speaker's sound.

According to an embodiment of the present invention, the score may be calculated by applying various parameters extracted by the feature parameter extracting unit 121. [ According to an embodiment of the present invention, an excitation signal is extracted from a periodic signal to generate an auxiliary parameter, thereby improving the accuracy of the score. When a score is calculated using a plurality of parameters, a score can be calculated by giving a weight to each parameter. At this time, the weights can be given by the total number survey.

9 is a table showing improved speaker recognition rates according to an embodiment of the present invention. Referring to FIG. 9, speaker recognition rates are shown when speaker recognition is performed by calculating a score according to an embodiment of the present invention, and when speaker recognition is performed by calculating scores according to a comparative example. In the case of the embodiment, both the MFCC parameter and the DFS coefficient parameter of the excitation signal are used. In the case of the comparative example, only the MFCC parameter is used. The score weights?,?, And? Represent weights assigned to the scores calculated from the MFCC extracted from the periodic signal, the MFCC extracted from the aperiodic signal, and the DFS coefficient parameters extracted from the excitation signal.

As shown in the table of FIG. 9, according to an embodiment of the present invention, the speaker recognition rate is further improved when additional parameters extracted from the excitation signal are used.

While the present invention has been described with reference to the exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. Those skilled in the art will appreciate that various modifications may be made to the embodiments described above. The scope of the present invention is defined only by the interpretation of the appended claims.

10: Speaker recognition device
110:
120:
121: Feature parameter extracting unit
130:
1211: Periodic signal detection unit
1212:
1213: Feature parameter calculating section

Claims

A periodic signal detector for receiving an input signal and detecting a periodic period of the signal;
An excitation signal extractor for extracting an excitation signal in a periodic interval of the input signal;
A feature parameter calculating unit for calculating a feature parameter that characterizes the input signal based on a frequency response spectrum of the excitation signal;
A representative pitch detector for determining a representative pitch value among pitch values of the input signal; And
And an appropriate pitch detector for determining an appropriate pitch value based on the representative pitch value among the pitch values in the periodic interval of the input signal,
Wherein the feature parameter extraction unit calculates the feature parameter based on the appropriate pitch value.

The method according to claim 1,
Wherein the periodic signal detection unit comprises:
And detects a periodic interval based on a result of an auto-correlation function on the input signal.

3. The method of claim 2,
Wherein the periodic signal detection unit comprises:
(T) / R (0) for the input signal is equal to or greater than a predetermined threshold value,
Wherein R (T) = Sigmax (n) x (n + T).

delete

The method according to claim 1,
The representative pitch detecting unit may include:
And arranging the pitch values of the input signals in order of magnitude to determine an intermediate value as the representative pitch value.

The method according to claim 1,
The optimum pitch detection unit may include:
And determines a value that is closest to the representative pitch value among the pitch values selected based on the result of the autocorrelation function for the input signal of the periodic interval as a proper pitch value.

The method according to claim 1,
Wherein the excitation signal extracting unit comprises:
And a line emphasis unit for preprocessing the input signal of the periodic section and outputting a preprocessed signal compensated for high frequency components lost in the process of generating the input signal.

8. The method of claim 7,
Wherein the excitation signal extracting unit comprises:
An autocorrelation function output unit for outputting the result of the autocorrelation function for the preprocessed signal; And
A speculative coefficient calculator receiving an output of the autocorrelation function estimator and outputting a speculative coefficient based on a Levinson-Durbin algorithm; And
And an inverse filtering unit performing inverse filtering on the basis of the pre-processing signal and the estimated coefficient to output the excitation signal.

The method according to claim 1,
Wherein the feature parameter calculating unit comprises:
And a frequency domain transformer for transforming the excitation signal into a discrete Fourier spectrum in a frequency domain by performing a discrete Fourier transform on the basis of the appropriate pitch value.

10. The method of claim 9,
Wherein the feature parameter calculating unit comprises:
And the logarithm of the size of the discrete Fourier spectrum is calculated as the feature parameter.

10. The method of claim 9,
Wherein the feature parameter calculating unit comprises:
And a value obtained by adding predetermined weights to the energy of the order of the discrete Fourier spectrum and calculating a sum as a feature parameter.

The method according to claim 1,
Wherein the feature parameter extracting unit comprises:
A mel-frequency response acquiring unit for acquiring a mel-frequency response by applying a Mel-Frequency filter to a frequency response spectrum of the input signal with respect to a periodic interval and an aperiodic interval of the input signal; And
And a cepstrum coefficient obtaining unit for obtaining a cepstrum coefficient by inverse discrete cosine transform of the Mel-frequency response.

A voice collecting unit for collecting the voice of the speaker;
A voice processing unit for processing the collected voice and discriminating whether or not it matches the voice of a previously registered user; And
And a storage unit for storing information on the voice of the user,
Wherein the voice processing unit comprises:
A periodic signal detector for detecting a periodic interval of an input signal;
An excitation signal extractor for extracting an excitation signal in a periodic interval of the input signal;
A feature parameter calculating unit for calculating a feature parameter that characterizes the input signal based on a frequency response spectrum of the excitation signal;
A representative pitch detector for determining a representative pitch value among the pitch values of the input signal; And
And an appropriate pitch detector for determining an appropriate pitch value based on the representative pitch value among the pitch values in the periodic interval of the input signal,
Wherein the feature parameter extraction unit calculates the feature parameter based on the appropriate pitch value.