KR100751923B1

KR100751923B1 - Method and apparatus for compensating energy features for robust speech recognition in noise environment

Info

Publication number: KR100751923B1
Application number: KR1020050108236A
Authority: KR
Inventors: 고한석; 이윤재
Original assignee: 고려대학교 산학협력단
Priority date: 2005-11-11
Filing date: 2005-11-11
Publication date: 2007-08-24
Also published as: KR20070050699A

Abstract

잡음환경에 강인한 음성인식을 위한 에너지 특징 보상 방법 및 장치가 개시된다. 그 에너지 특징 보상 방법은 (a) 잡음이 없는 깨끗한 환경에서 수집한 음성 훈련데이터의 에너지 특징을 잡음환경의 에너지와 유사한 환경으로 변환시키는 단계; (b) 인식 음성데이터의 에너지 최소값이 소정의 목표 최소값보다 큰지 체크하여, 상기 음성데이터의 에너지 최소값이 상기 목표 최소값보다 작으면, 에너지의 최소값과 최대값의 중간값 이하만 ERN 변환식에 의해 변환시키는 Half-ERN을 수행하는 단계; 및 (c) 인식 음성데이터의 에너지 최소값이 목표 최소값보다 작지 않으면 에너지 특징을 소정의 방법으로 보상하는 단계를 포함함을 특징으로 한다. Disclosed are an energy feature compensation method and apparatus for speech recognition robust to a noisy environment. The energy feature compensation method comprises the steps of: (a) converting an energy feature of voice training data collected in a clean environment free of noise into an environment similar to that of a noisy environment; (b) checking whether the energy minimum value of the recognized voice data is larger than a predetermined target minimum value, and if the energy minimum value of the voice data is smaller than the target minimum value, converting only the median value between the minimum and maximum values of energy by the ERN conversion equation. Performing Half-ERN; And (c) compensating for the energy characteristic by a predetermined method if the energy minimum value of the recognized speech data is not smaller than the target minimum value.

본 발명에 의하면, 보다 다양한 환경에 더욱 강인한 에너지 특징 보상이 가능하게 되어 잡음 환경에서의 음성 인식률을 향상시킬 수 있다.According to the present invention, more robust energy feature compensation can be achieved in various environments, thereby improving speech recognition rate in a noisy environment.

Description

Method and apparatus for compensating energy features for speech recognition robust to noise environment {Method and apparatus for compensating energy features for robust speech recognition in noise environment}

도 1은 본 발명에 따른 잡음환경에 강인한 음성인식을 위한 에너지 특징 보상 장치의 구성을 블록도로 도시한 것이다.1 is a block diagram illustrating a configuration of an energy feature compensation device for speech recognition robust to a noise environment according to the present invention.

도 2는 본 발명에 따른 잡음환경에 강인한 음성인식을 위한 에너지 특징 보상 방법을 흐름도로 도시한 것이다.2 is a flowchart illustrating an energy feature compensation method for speech recognition robust to a noisy environment according to the present invention.

도 3은 잡음이 없는 깨끗한 환경과 10dB 자동차 잡음 환경에 의한 시간에 따른 음성 에너지의 변화를 도시한 것이다.Figure 3 shows the change of voice energy over time by a clean environment without noise and 10dB vehicle noise environment.

도 4는 수학식 1을 도식적으로 그린 것이다.4 is a schematic diagram of Equation 1. FIG.

도 5는 기존의 ERN 방법과 본 발명에 의한 방법의 결과를 비교한 것을 도시한 것이다.Figure 5 shows a comparison of the results of the conventional ERN method and the method according to the present invention.

본 발명은 음성인식에 관한 것으로서, 특히 잡음환경에 강인한 음성인식을 위한 에너지 특징 보상 방법 및 장치에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to speech recognition, and more particularly, to an energy feature compensation method and apparatus for speech recognition, which is robust against noise environments.

음성인식 분야에서 전통적인 에너지 정규화 방식은 음성신호의 각 프레임 중 최대의 값을 가지는 프레임의 로그 에너지를 모든 프레임에 빼주어 최대 에너지 값이 0이 되게 만들어주는 방법이다. 이 방법은 Hidden Markov Model Toolkit(HTK)에서 지원하는 방법이다. 하지만 이 방법은 잡음환경에 강인하지 못한 단점을 가진다. 최근에 나온 방법들 중에는 에너지 특징 또한 켑스트럼 특징과 같이 취급하여 평균과 분산 정규화 방법을 적용한 정규화 방법과 ERN(log-energy dynamic range normalization) 방법이 있다. In the speech recognition field, the conventional energy normalization method is to subtract the log energy of the frame having the maximum value among the frames of the voice signal to all the frames so that the maximum energy value becomes zero. This method is supported by the Hidden Markov Model Toolkit (HTK). However, this method has the disadvantage that it is not robust to noise environment. Among the recent methods, energy characteristics are also treated like cepstrum features, and normalization method using mean and variance normalization method and log-energy dynamic range normalization (ERN) method.

상기 ERN 방법은 깨끗한 환경의 에너지가 잡음에 의해 어떻게 변하는지 분석한 후, 깨끗한 환경의 에너지 특징을 잡음환경의 특징과 유사하게 변환시켜 주는 방법이며, 다양한 dynamic range(D.R)를 훈련데이터에 적용하여 음향 모델을 만든 후 인식데이터에도 D.R을 적용하여 최적의 인식 결과를 실험적으로 찾는 방식이다. 하지만 이 방법에서 사용하고 있는 음성 에너지의 최대값은 잡음에 영향을 받지 않는다는 가정이 높은 신호 대 잡음비 (signal to noise ratio, SNR)에만 적용이 되는 가정이므로 낮은 SNR 환경에서는 음성구간의 높은 에너지 차이를 줄여주지 못하게 되어 더 높은 음성 인식률을 기대할 수 없게 된다. 또한 실험적으로 찾은 D.R은 평균 인식률이 가장 높은 D.R 이지만 다양한 SNR에서 최고의 성능을 보이는 D.R 이 아니다. 그 이유는 높은 에너지 구간에서의 불일치뿐만 아니라 고정된 D.R 에 의해 낮은 구간의 에너지를 높은 에너지로 변환시켜도 실제 다양한 SNR 에 따라서 낮은 에너지 구간에서의 불일치가 생기기 때문이다. 실제로 높은 SNR과 낮은 SNR 에서 최적의 성능을 내는 D.R 이 다르게 되어 다양한 환경에 강인하지 못한 단점을 가진다. The ERN method analyzes how the energy of a clean environment is changed by noise, and then converts the energy characteristics of the clean environment to the characteristics of the noise environment, and applies various dynamic ranges (DR) to the training data. After creating the acoustic model, DR is also applied to the recognition data to find the optimal recognition results experimentally. However, since the assumption that the maximum value of speech energy used in this method is not affected by noise is applied only to high signal to noise ratio (SNR), high energy difference between voice intervals in low SNR environment You won't be able to reduce it, so you won't be able to expect a higher speech recognition rate. Also, experimentally found D.R is the D.R with the highest average recognition rate, but it is not the best performing D.R in various SNRs. The reason is that not only the discrepancy in the high energy section but also the low energy section inconsistency in the low energy section according to various SNRs even if the energy of the low section is converted into high energy by the fixed D.R. In fact, D.R, which performs optimally at high SNR and low SNR, is different from others.

본 발명이 이루고자하는 기술적 과제는 인식데이터의 에너지 중 음성 인식에 중요한 상대적으로 높은 에너지 구간에서는 에너지 차감법을 이용하여 깨끗한 환경의 음성 에너지로 일관성 있게 변환 시켜 주기 위해, 깨끗한 환경의 훈련 데이터의 에너지 중 음성에 의한 높은 에너지들은 ERN 변환을 하지 않고 낮은 에너지 구간만 ERN 변환을 함으로써, 다양한 환경 및 다양한 신호 대 잡음비에 강인한 에너지 보상이 가능하게 하는, 잡음환경에 강인한 음성인식을 위한 에너지 특징 보상 방법 및 장치를 제공하는 것이다.The technical problem to be achieved in the present invention is to use the energy subtraction method to convert the energy of the training data of the clean environment to the consistently high energy range, which is important for voice recognition, among the energy of the recognition data. Energy feature compensation method and device for speech recognition robust to noise environment, which enables robust energy compensation in various environments and various signal-to-noise ratios by performing ERN conversion only on low energy sections without ERN conversion. To provide.

상기 기술적 과제를 해결하기 위한 본 발명에 의한 잡음환경에 강인한 음성인식을 위한 에너지 특징 보상 방법은, (a) 잡음이 없는 깨끗한 환경에서 수집한 음성 훈련데이터의 에너지 특징을 잡음환경의 에너지와 유사한 환경으로 변환시키는 단계; (b) 인식 음성데이터의 에너지 최소값이 소정의 목표 최소값보다 큰지 체크하여, 상기 음성데이터의 에너지 최소값이 상기 목표 최소값보다 작으면, 에너지의 최소값과 최대값의 중간값 이하만 ERN 변환식에 의해 변환시키는 Half-ERN을 수행하는 단계; 및 (c) 인식 음성데이터의 에너지 최소값이 목표 최소값보다 작지 않으면 에너지 특징을 소정의 방법으로 보상하는 단계를 포함함을 특징으로 한다. 상기 (c)단계는 (c1) 인식 음성데이터의 에너지 최소값이 목표 최소값보다 작지 않으면, 상기 최소 에너지가 소정의 문턱치보다 큰지 체크하는 단계; 및 (c2) 인식 데 이터의 에너지 최소값이 소정의 문턱치 에너지보다 크면 에너지를 차감하고, 그렇지 않으면 인식 데이터에서 문턱치 에너지 이하의 값에서 ERN 역함수를 이용하여 에너지를 변환하는 단계를 구비함이 바람직하다. 상기 (a)단계의 음성 훈련데이터의 에너지 특징을 잡음환경의 에너지와 유사한 환경으로의 변환은 에너지의 최소값과 최대값의 중간값 이하만 ERN 변환식에 의해 변환시키는 Half ERN 방법을 이용하여 수행함이 바람직하다. 상기 에너지 차감은 음성이 입력될 때 첫 10개의 프레임은 음성이 없다고 가정할 때, 상기 인식 데이터의 첫 10 프레임을 이용하여 로그를 취하지 않은 잡음 에너지를 추정하여 에너지를 차감해준 후 다시 로그 함수를 취함이 바람직하다.Energy feature compensation method for speech recognition robust to the noise environment according to the present invention for solving the technical problem, (a) the energy characteristics of the voice training data collected in a clean environment without noise noise environment similar to the energy of the noise environment Converting to; (b) checking whether the energy minimum value of the recognized voice data is larger than a predetermined target minimum value, and if the energy minimum value of the voice data is smaller than the target minimum value, converting only the median value between the minimum and maximum values of energy by the ERN conversion equation. Performing Half-ERN; And (c) compensating for the energy characteristic by a predetermined method if the energy minimum value of the recognized speech data is not smaller than the target minimum value. Step (c) may include: (c1) checking whether the minimum energy is greater than a predetermined threshold if the energy minimum value of the recognized voice data is not smaller than a target minimum value; And (c2) subtracting the energy if the energy minimum value of the recognition data is greater than the predetermined threshold energy, otherwise converting the energy using the ERN inverse function at a value below the threshold energy in the recognition data. The energy characteristic of the voice training data of step (a) is converted into an environment similar to the energy of the noise environment by using the Half ERN method, which converts only the median value between the minimum and maximum values of the energy by the ERN conversion equation. Do. The energy subtraction assumes that the first 10 frames are not speech when the voice is input, estimates the noise energy without taking the log using the first 10 frames of the recognition data, subtracts the energy, and then takes the log function again. This is preferred.

상기 기술적 과제를 해결하기 위한 본 발명에 의한 잡음환경에 강인한 음성인식을 위한 에너지 특징 보상 장치는, 잡음이 없는 깨끗한 환경에서 수집한 음성 훈련데이터의 에너지 특징을 잡음환경의 에너지와 유사한 환경으로 변환시키는 에너지변환부; 인식 음성데이터의 에너지 최소값이 소정의 목표 최소값보다 큰지 체크하는 목표 최소값 비교부; 상기 목표 최소값 비교부에서 상기 음성데이터의 에너지 최소값이 목표 최소값보다 작으면, 에너지의 최소값과 최대값의 중간값 이하만 ERN 변환식에 의해 변환시키는 Half-ERN을 수행하는 Half-ERN 변환부; 상기 목표 최소값 비교부에서 인식 음성데이터의 에너지 최소값이 목표 최소값보다 작지 않으면, 상기 최소 에너지가 소정의 문턱치보다 큰지 체크하는 문턱값 비교부; 상기 문턱값 비교부에서 인식 데이터의 에너지 최소값이 소정의 문턱치 에너지보다 크면 에너지를 차감하는 에너지차감부; 및 상기 문턱값 비교부에서 인식 데이터의 에너 지 최소값이 소정의 문턱치 에너지보다 크지 않으면 인식 데이터에서 문턱치 에너지 이하의 값에서 ERN 역함수를 이용하여 에너지를 변환하는 역변환 ERN부를 더 포함함을 특징으로 한다.The energy feature compensation device for speech recognition robust to the noise environment according to the present invention for solving the technical problem, converts the energy characteristics of the voice training data collected in a clean environment without noise to an environment similar to the energy of the noise environment Energy conversion unit; A target minimum value comparison unit which checks whether an energy minimum value of the recognized voice data is larger than a predetermined target minimum value; A half-ERN conversion unit performing half-ERN for converting, by the target minimum value comparing unit, the energy minimum value of the voice data to less than a target minimum value, only an intermediate value between the minimum value and the maximum value of energy by an ERN conversion equation; A threshold value comparator for checking whether the minimum energy is greater than a predetermined threshold value if the energy minimum value of the recognized voice data is less than the target minimum value in the target minimum value comparator; An energy subtraction unit for subtracting energy when an energy minimum value of recognition data in the threshold comparison unit is greater than a predetermined threshold energy; And an inverse transform ERN unit for converting energy using an ERN inverse function at a value equal to or lower than a threshold energy in the recognition data when the energy minimum value of the recognition data is not greater than a predetermined threshold energy in the threshold comparison unit.

상기 에너지 변환부에서의 잡음환경 에너지와 유사한 환경으로의 변환은 에너지의 최소값과 최대값의 중간값 이하만 ERN 변환식에 의해 변환시키는 Half-ERN 방법을 이용하여 수행함이 바람직하다. 상기 에너지 차감은 음성이 입력될 때 첫 10개의 프레임은 음성이 없다고 가정할 때, 상기 인식 데이터의 첫 10 프레임을 이용하여 로그를 취하지 않은 잡음 에너지를 추정하여 에너지를 차감해준 후 다시 로그 함수를 취함이 바람직하다.In the energy conversion unit, the conversion to the environment similar to the noise environment energy is preferably performed by using the Half-ERN method in which only the minimum value and the maximum value of energy or less are converted by the ERN conversion equation. The energy subtraction assumes that the first 10 frames are not speech when the voice is input, estimates the noise energy without taking the log using the first 10 frames of the recognition data, subtracts the energy, and then takes the log function again. This is preferred.

그리고 상기 기재된 발명을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.A computer readable recording medium having recorded thereon a program for executing the invention described above is provided.

이하, 첨부된 도면들을 참조하여 본 발명에 따른 잡음환경에 강인한 음성인식을 위한 에너지 특징 보상 방법 및 장치에 대해 상세히 설명한다.Hereinafter, an energy feature compensation method and apparatus for robust voice recognition in a noisy environment according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 잡음환경에 강인한 음성인식을 위한 에너지 특징 보상 장치의 구성을 블록도로 도시한 것으로서, 에너지 변환부(100), 목표 최소값 비교부(120), Half-ERN변환부(140), 문턱값 비교부(160), 에너지 차감부(180) 및 역변환 ERN부(190)를 포함하여 이루어진다. 1 is a block diagram illustrating a configuration of an energy feature compensation device for speech recognition robust to a noise environment according to the present invention. The energy converter 100, the target minimum value comparator 120, and the Half-ERN converter 140 are shown in FIG. ), A threshold comparison unit 160, an energy subtraction unit 180, and an inverse transform ERN unit 190.

상기 에너지 변환부(100)는 잡음이 없는 깨끗한 환경에서 수집한 음성 훈련데이터의 에너지 특징을 잡음환경의 에너지와 유사한 환경으로 변환시킨다. 상기 에너지 변환부(100)에서의 잡음환경 에너지와 유사한 환경으로의 변환은 바람직하 게는 에너지의 최소값과 최대값의 중간값 이하만 ERN 변환식에 의해 변환시키는 Half-ERN 방법을 이용하여 수행한다.The energy conversion unit 100 converts the energy characteristics of the voice training data collected in a clean environment without noise into an environment similar to the energy of the noise environment. The energy conversion unit 100 converts the noise environment into an environment similar to that of the noise environment. The half-ERN method converts only the median value between the minimum and maximum values of the energy by the ERN conversion equation.

상기 목표 최소값 비교부(120)는 인식 음성데이터의 에너지 최소값이 소정의 목표 최소값보다 큰지 체크한다. Half-ERN변환부(140)는 상기 목표 최소값 비교부(120)에서 상기 음성데이터의 에너지 최소값이 목표 최소값보다 작으면, 에너지의 최소값과 최대값의 중간값 이하만 ERN 변환식에 의해 변환시키는 Half-ERN을 수행한다.The target minimum value comparator 120 checks whether an energy minimum value of the recognized voice data is greater than a predetermined target minimum value. Half-ERN conversion unit 140, if the minimum energy value of the voice data in the target minimum value comparison unit 120 is less than the target minimum value, half- to convert only the median between the minimum value and the maximum value of the energy by the ERN conversion equation. Perform an ERN.

상기 문턱값 비교부(160)는 상기 목표 최소값 비교부(120)에서 인식 음성데이터의 에너지 최소값이 목표 최소값보다 작지 않으면, 상기 최소 에너지가 소정의 문턱치보다 큰지 체크한다.The threshold comparison unit 160 checks whether the minimum energy is greater than a predetermined threshold value if the minimum energy value of the recognized voice data is less than the target minimum value in the target minimum value comparison unit 120.

상기 에너지 차감부(180)는 상기 문턱값 비교부(160)에서 인식 데이터의 에너지 최소값이 소정의 문턱치 에너지보다 크면 에너지를 차감한다. 상기 에너지 차감은 음성이 입력될 때 첫 10개의 프레임은 음성이 없다고 가정할 때, 상기 인식 데이터의 첫 10 프레임을 이용하여 로그를 취하지 않은 잡음 에너지를 추정하여 에너지를 차감해준 후 다시 로그 함수를 취함이 바람직하다. 상기 역변환 ERN부(190)는 상기 문턱값 비교부(160)에서 인식 데이터의 에너지 최소값이 소정의 문턱치 에너지보다 크지 않으면 인식 데이터에서 문턱치 에너지 이하의 값에서 ERN 역함수를 이용하여 에너지를 변환한다.The energy subtraction unit 180 subtracts energy when the minimum value of the recognition data in the threshold comparison unit 160 is greater than a predetermined threshold energy. The energy subtraction assumes that the first 10 frames are not speech when the voice is input, estimates the noise energy without taking the log using the first 10 frames of the recognition data, subtracts the energy, and then takes the log function again. This is preferred. The inverse transform ERN unit 190 converts the energy by using the inverse ERN function at a value equal to or lower than the threshold energy in the recognition data when the minimum value of the recognition data in the threshold comparison unit 160 is not greater than a predetermined threshold energy.

도 2는 본 발명에 따른 잡음환경에 강인한 음성인식을 위한 에너지 특징 보상 방법을 흐름도로 도시한 것으로서, 도 2를 참조하여 본 발명에 의한 잡음환경에 강인한 음성인식을 위한 에너지 특징 보상 방법을 설명하기로 한다.2 is a flowchart illustrating an energy feature compensation method for speech recognition robust to a noise environment according to the present invention. Referring to FIG. 2, an energy feature compensation method for speech recognition robust to a noise environment according to the present invention will be described. Shall be.

도 3은 잡음이 없는 깨끗한 환경과 10dB 자동차 잡음 환경에 의한 시간에 따른 음성 에너지의 변화를 도시한 것이다. 도 3을 참조하면, 깨끗한 환경의 에너지는 에너지가 낮은 구간에서 그 변화가 매우 크다는 것을 볼 수 있다. 이와 같은 사실을 기반으로 기존의 ERN 방법에서는 훈련데이터에 대해서는 먼저 깨끗한 환경의 훈련데이터의 에너지 특징을 ERN 변환식을 사용하여 잡음환경의 에너지로 변환하여 추정한다.Figure 3 shows the change of voice energy over time by a clean environment without noise and 10dB vehicle noise environment. Referring to FIG. 3, it can be seen that the energy of the clean environment is very large in the low energy section. Based on these facts, the existing ERN method estimates the training data by converting the energy characteristics of the training data in a clean environment into the energy of a noisy environment using the ERN transformation.

여기서, Min 및 Max는 입력 음성 데이터의 log 에너지의 최소값 및 최대값을 나타내고, log_energy_i는 변환전 음성데이터의 log 에너지, new_log_energy_i는 변환후의 음성데이터의 log 에너지를 나타낸다.Here, Min and Max represent the minimum and maximum values of the log energy of the input voice data, log_energy _i represents the log energy of the voice data before conversion, and new_log_energy _i represents the log energy of the voice data after conversion.

도 4는 수학식 1을 도식적으로 그린 것이다. 도 3에 도시된 바와 같이 깨끗한 환경의 에너지 최대값은 잡음에 영향을 받지 않는다고 가정하면 에너지 최대값을 고정시킨 채로 낮은 에너지 구간에 보다 많은 비중을 두어 변환시킨다.4 is a schematic diagram of Equation 1. FIG. As shown in FIG. 3, assuming that the energy maximum value of the clean environment is not affected by noise, the energy maximum value is fixed and converted into a higher energy section while the energy maximum value is fixed.

그러나 본 발명에서는 종래의 ERN 과는 달리 에너지의 최소값과 최대값의 중간값 이하만 ERN 변환식에 의해 변환시키며 이를 Half-ERN 이라 칭한다.However, in the present invention, unlike the conventional ERN, only the median value between the minimum value and the maximum value of energy is converted by the ERN conversion equation, which is called Half-ERN.

음성에너지의 최대값이 잡음에 영향을 덜 받는 것은 로그 함수의 특성 때문인데, 낮은 신호 대 잡음비 환경에서는 음성 에너지의 최대값도 잡음에 의해 변화 하게 된다. 따라서 기존의 ERN 알고리즘으로는 이러한 문제점을 해결 할 수가 없다. 그래서 본 발명에서는 훈련데이터의 상대적으로 높은 에너지들은 변환시키지 않고 그대로 유지시켜 깨끗한 환경의 에너지 상태로 두게 된다. 그리고 인식 데이터로부터 잡음 에너지를 추정하여 높은 에너지 구간에서 차감해주기 때문에 인식 환경의 SNR에 상관없이 상대적으로 잡음 환경의 높은 에너지들을 신뢰성 있게 깨끗한 환경의 에너지로 변환시킬 수 있게 된다.The maximum value of voice energy is less affected by noise because of the logarithmic function. In low signal-to-noise ratio environment, the maximum value of voice energy is also changed by noise. Therefore, the existing ERN algorithm cannot solve this problem. Therefore, in the present invention, relatively high energies of the training data are maintained without changing the energy state of the clean environment. And since the noise energy is estimated from the recognition data and subtracted in the high energy section, it is possible to reliably convert the high energy of the noise environment into the energy of the clean environment reliably regardless of the SNR of the recognition environment.

목표 최소값

은 수학식 2와 같이 고정된 Dynamic Range(D.R)에 의해 결정되는 값으로 에너지의 최소값이

값까지 올라가게 되는 선형변환이다.Goal minimum

Is the value determined by the fixed dynamic range (DR) as shown in Equation 2

A linear transformation that goes up to a value.

인식데이터에 대해서는 다음과 같은 과정으로 에너지 특징을 보상해 준다. 먼저, 음성이 입력되면(200단계), 에너지 최소값과 목표 최소값을 비교하여(210단계), 에너지 최소값이 목표 최소값 보다 낮으면 훈련데이터와 같은 Half-ERN 을 적용한다(220단계). 기존의 ERN 에서는 인식 데이터의 최소 로그 에너지가 목표 최소값보다 높으면 어떤 과정도 수행하지 않아 낮은 구간에서 여전히 발생하는 불일치를 해결하지 못했지만 본 발명에서는 낮은 에너지 구간에서의 불일치를 더욱 줄이기 위해 다음과 같은 과정을 수행한다.For recognition data, the energy characteristics are compensated by the following process. First, when a voice is input (step 200), the energy minimum value and the target minimum value are compared (step 210), and when the energy minimum value is lower than the target minimum value, Half-ERN is applied as the training data (step 220). In the conventional ERN, if the minimum log energy of the recognition data is higher than the target minimum value, no process is performed to solve the inconsistency still occurring in the low interval, but in the present invention, the following process is performed to further reduce the inconsistency in the low energy interval. Perform.

만약 인식 데이터의 에너지 최소값이 목표 최소값보다 높으면(210단계), 수학식 3에 의해 결정되는 문턱치에 따라 다른 알고리즘을 적용한다.If the energy minimum value of the recognition data is higher than the target minimum value (step 210), a different algorithm is applied according to the threshold determined by Equation (3).

여기서, 알파는 0에서 1사이의 임의의 값이며, Noisy_Min 및 Noisy_Max은 잡음환경에서 테스트 데이터의 log 에너지의 최소값 및 최대값을 나타낸다. Half-ERN 에서는 알파가 0.5였으나 잡음환경에서는 0.75로 둔다. 그 이유는 잡음환경에 의해 낮은 에너지가 변화가 크기 때문에 문턱치의 비율을 최소 에너지 값에 더 비중을 둘 필요가 있기 때문이다.Here, alpha is an arbitrary value between 0 and 1, and Noisy_Min and Noisy_Max represent minimum and maximum values of log energy of test data in a noisy environment. In half-ERN, the alpha was 0.5, but in the noisy environment, we left it at 0.75. The reason is that because the low energy changes due to the noise environment, the threshold ratio needs to be more weighted to the minimum energy value.

먼저 앞에서 언급하였듯이, 최소에너지와 문턱치를 비교하여(230단계), 문턱치 이상의 값에 대해서는 에너지 차감법을 적용한다(240단계). 로그 함수가 적용되기 전의 에너지는 깨끗한 환경의 에너지와 잡음에너지가 부가적인 관계이므로 음성이 없다고 가정되는 처음 10 프레임의 평균 에너지를 잡음에너지로 추정하여 차감해준다. 이 에너지 차감법을 통해 상대적으로 높은 에너지들은 잡음 환경과 SNR에 상관없이 일관성있게 깨끗한 환경의 에너지로 변환되는 것이다. 에너지 차감법 적용시 차감해준 값이 너무 작거나 음수가 되는 것을 방지하기 위해 로그 에너지 값이 5가 되도록 제한을 둔다. 이 값은 실험적으로 깨끗한 환경의 묵음구간의 최소 에너지를 구한 값이다.As mentioned earlier, the minimum energy and the threshold are compared (step 230), and the energy subtraction method is applied to the value above the threshold (step 240). Since the energy before the logarithmic function is applied, the energy of the clean environment and the noise energy are additionally related to each other, and the average energy of the first 10 frames, which is assumed to be negative, is estimated as the noise energy. With this energy subtraction method, relatively high energies are converted into energy in a clean environment consistently regardless of noise environment and SNR. When applying the energy subtraction method, the log energy value is limited to 5 to prevent the value of the deduction from being too small or negative. This value is the experimentally obtained minimum energy of the silence period of clean environment.

상기 230단계에서 비교결과, 최소에너지가 문턱치보다 크지 않으면 즉 문턱치 이하의 값에 대해서는 최소 에너지가 목표 최소값 보다 높으므로 수학식 4와 같은 ERN 변환 함수의 역변환식을 이용하여 낮은 구간의 에너지들을 더 낮은 에너지로 변환시켜 최소 에너지가 목표 최소값까지 내려오도록 하게 한다.(250단계) As a result of the comparison in step 230, if the minimum energy is not greater than the threshold value, that is, the minimum energy is higher than the target minimum value for the value less than or equal to the threshold value, the energy of the lower interval is lowered by using the inverse transform equation of the ERN conversion function as shown in Equation 4. Convert it to energy so that the minimum energy drops to the target minimum (step 250).

,

estimated_e_i는 새로 추정된 에너지를 나타내고, noisy_e_i는 잡음환경에서 변환되기 전의 현재 log 에너지를 나타낸다.estimated_e _i represents the newly estimated energy and noisy_e _i represents the current log energy before conversion in the noisy environment.

도 5는 기존의 ERN 방법과 본 발명에 의한 방법의 결과를 비교한 것을 도시한 것이다. 도 5에서 굵은 선이 깨끗한 환경의 에너지 특징이며, 점선이 에너지 특징이 보상된 것이다.Figure 5 shows a comparison of the results of the conventional ERN method and the method according to the present invention. In FIG. 5, the thick line is the energy characteristic of a clean environment, and the dashed line is the energy characteristic of which is compensated.

기존의 방법인 도 5의 위 그래프에서는 비음성 구간인 낮은 에너지 구간에서 여전히 불일치가 일어나며 에너지가 큰 음성 구간에서도 불일치가 보이는 것을 확인 할 수 있다.In the above graph of FIG. 5, which is a conventional method, inconsistency still occurs in a low energy period, which is a non-speech period, and it can be seen that inconsistency is seen even in a large energy range.

반면, 본 발명에 의한 방법을 나타내는 도 5의 아래 그래프에서, 음성 구간에 속하는 상대적으로 높은 에너지들은 훈련데이터 및 인식데이터 모두 깨끗한 환경의 에너지가 되어 유사하게 되고, 상대적으로 낮은 에너지 구간에서는 도 5의 위 그래프의 점선에 해당하는 에너지를 아래쪽으로 끌어내림으로써 깨끗한 환경의 훈련 데이터의 에너지와 잡음환경의 인식 데이터의 에너지간의 불일치가 더욱 줄어드는 것을 확인할 수 있다.On the other hand, in the lower graph of FIG. 5 showing the method according to the present invention, relatively high energies belonging to the voice interval become similar to the energy of the clean environment in both the training data and the recognition data, and in the relatively low energy interval of FIG. By lowering the energy corresponding to the dotted line in the graph above, it can be seen that the discrepancy between the energy of the training data in the clean environment and the energy of the recognition data in the noise environment is further reduced.

cleanclean 20dB20 dB 15dB15 dB 10dB10 dB 5dB5 dB 0dB0 dB -5dB-5 dB 평균Average BaselineBaseline 98.9698.96 97.4197.41 90.0490.04 67.0167.01 34.0934.09 14.4614.46 9.399.39 60.6060.60 ERNERN 98.9098.90 96.7296.72 94.2794.27 85.5485.54 59.2359.23 27.0227.02 10.7110.71 72.5672.56 개발된방법Developed method 98.9698.96 97.6797.67 95.4495.44 88.188.1 68.5168.51 34.4834.48 11.0111.01 76.8476.84

또한 표 1의 실험결과에서 보듯이 실제로 음성 인식에서도 본 발명에 의한 방법이 기존의 ERN 방법보다 다양한 SNR에 보다 강인하고 전체 평균 인식률도 보다 향상된 것을 확인할 수 있다.In addition, as shown in the experimental results of Table 1, it can be seen that in the speech recognition, the method of the present invention is more robust to various SNRs and the overall average recognition rate is more improved than the conventional ERN method.

본 발명은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터(정보 처리 기능을 갖는 장치를 모두 포함한다)가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 장치의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등이 있다.The present invention can be embodied as code that can be read by a computer (including all devices having an information processing function) in a computer-readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording devices include ROM, RAM, CD-ROM, magnetic tape, floppy disks, optical data storage devices, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다. Although the present invention has been described with reference to the embodiments shown in the drawings, this is merely exemplary, and it will be understood by those skilled in the art that various modifications and equivalent other embodiments are possible. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

본 발명에 따른 잡음환경에 강인한 음성인식을 위한 에너지 특징 보상 방법 및 장치에 의하면, 보다 다양한 환경에 더욱 강인한 에너지 특징 보상이 가능하게 되어 잡음 환경에서의 음성 인식률을 향상시킬 수 있다.According to the energy feature compensation method and apparatus for speech recognition robust to the noise environment according to the present invention, it is possible to more robust energy feature compensation in a more diverse environment, thereby improving the speech recognition rate in the noise environment.

본 발명에 의하면, 훈련데이터의 음성 부분에 해당하는 높은 에너지를 그대로 두어 인식 데이터가 입력으로 들어오면 인식데이터의 에너지 중 음성 인식에 중요한, 상대적으로 높은 에너지 구간에서는 에너지 차감법을 이용하여 깨끗한 환경의 음성 에너지로 일관성 있게 변환시킨다. 그 결과 다양한 환경 및 다양한 신호 대 잡음비에 강인한 에너지 보상이 가능하다. 또한 낮은 에너지 구간에서는 ERN 변환 후에도 여전히 발생하는 훈련 에너지 특징과 인식 에너지 특징을 불일치를 줄이며, 역변환 ERN 을 통해 불일치를 더욱 줄여 잡음에 강인한 음성인식을 구현한다.According to the present invention, if the high energy corresponding to the voice portion of the training data is left as it is and the recognition data is input, the energy of the recognition data, which is important for voice recognition, is used in the relatively high energy section by using the energy subtraction method. Convert consistently to voice energy. The result is robust energy compensation for a variety of environments and various signal-to-noise ratios. Also, in the low energy section, the discrepancies between the training energy characteristic and the recognition energy characteristic still occurring after the ERN transformation are reduced, and the inverse transformation ERN further reduces the mismatch, thereby realizing noise recognition robust to noise.

Claims

(a) converting energy characteristics of the voice training data collected in a clean environment free of noise into an environment similar to the energy of the noise environment;

(b) checking whether the energy minimum value of the recognized voice data is greater than a predetermined target minimum value, and if the energy minimum value of the voice data is smaller than the target minimum value, training in an environment free of noise only less than a median between the minimum and maximum values of energy. Performing a Half-ERN for converting the energy of the data by an ERN conversion equation for converting the energy of the data into the energy of the noise environment;

(c1) if the energy minimum value of the recognized voice data is not smaller than the target minimum value, checking whether the minimum energy is greater than a predetermined threshold; And

(c2) subtracting the energy if the energy minimum value of the recognition data is greater than the predetermined threshold energy, otherwise converting the energy using the ERN inverse function at a value less than or equal to the threshold energy in the recognition data. Energy feature compensation method for robust speech recognition.

delete

The method of claim 1, wherein the energy characteristic of the voice training data of step (a) is converted into an environment similar to the energy of the noise environment.

An energy feature compensation method for robust speech recognition in a noisy environment, which is performed by using the half ERN method, which converts only a minimum value between energy minimums and maximums by an ERN conversion equation.

The method of claim 3, wherein the threshold is

[Equation 3]

(Where alpha is any value between 0 and 1, where Noisy_Min is the minimum of noise and Noisy_Max is the maximum of noise).

Energy characteristic compensation method for speech recognition robust to the noise environment, characterized in that determined by the equation (3).

The method of claim 3, wherein the energy deduction is

Assuming that there is no voice when the first 10 frames are input, the first 10 frames of the recognition data is used to estimate the noise energy without taking the log, subtract the energy, and then take the log function again. Energy feature compensation for robust speech recognition in noisy environments.

An energy conversion unit for converting energy characteristics of voice training data collected in a clean environment without noise into an environment similar to energy of a noise environment;

A target minimum value comparison unit which checks whether an energy minimum value of the recognized voice data is larger than a predetermined target minimum value;

In the target minimum value comparison unit, if the energy minimum value of the voice data is smaller than the target minimum value, the ERN conversion equation converts the energy of training data of the noise-free environment into the energy of the noisy environment only less than or equal to the median value of the energy minimum and maximum values. Half-ERN conversion unit for performing the half-ERN to convert by;

A threshold value comparator for checking whether the minimum energy is greater than a predetermined threshold value if the energy minimum value of the recognized voice data is less than the target minimum value in the target minimum value comparator;

An energy subtraction unit for subtracting energy when an energy minimum value of recognition data in the threshold comparison unit is greater than a predetermined threshold energy; And

The threshold comparison unit further includes an inverse transform ERN unit for converting energy by using an ERN inverse function at a value lower than or equal to a threshold energy in the recognition data if the minimum energy of the recognition data is not greater than a predetermined threshold energy. Energy feature compensation device for robust voice recognition.

The method of claim 6, wherein the conversion to the environment similar to the noise environment energy in the energy conversion unit

An energy feature compensation device for robust speech recognition in a noisy environment, which is performed by using the half ERN method, which converts only the minimum value and the minimum value of energy by the ERN conversion equation.

8. The method of claim 7, wherein the threshold is

[Equation 3]

Energy feature compensation device for speech recognition robust to the noise environment, characterized in that determined by the equation (3).

The method of claim 7, wherein the energy deduction

Assuming that there is no voice when the first 10 frames are input, the first 10 frames of the recognition data is used to estimate the noise energy without taking the log, subtract the energy, and then take the log function again. Energy feature compensation device for robust voice recognition in noisy environments.

A computer-readable recording medium having recorded thereon a program for executing the invention according to any one of claims 1, 3, 4 or 5 on a computer.