KR20060095237A

KR20060095237A - Noise exclusion method of voice recognition system on vehicle

Info

Publication number: KR20060095237A
Application number: KR1020050016724A
Authority: KR
Inventors: 정윤식
Original assignee: 현대자동차주식회사
Priority date: 2005-02-28
Filing date: 2005-02-28
Publication date: 2006-08-31

Abstract

음성신호에 산술적으로 더해지는 주변 잡음신호가 음성신호 사이에 상관관계가 없다는 특성을 이용하여 입력되는 음성신호와 기준신호의 패턴 비교전에 잡음을 제거하여 음성신호에 더해진 잡음신호를 효과적으로 제거할 수 있도록 한 것으로,By using the characteristic that the ambient noise signal added arithmetically to the voice signal has no correlation between the voice signal, it is possible to effectively remove the noise signal added to the voice signal by removing the noise before comparing the pattern between the input voice signal and the reference signal. In that,

마이크로부터 잡음신호와 혼합되어 입력되는 음성신호에 대하여 시간 영역에서 잡음신호를 제거하는 과정과, 시간 영역에서 잡음신호가 제거된 음성신호를 FFT(혹은 DFT)의 적용으로 주파수 함수로 변환하는 과정과, 주파수 함수에 포함되어 있는 잡음신호를 스펙트럼 차감 기법을 이용하여 제거하는 과정 및 상기 잡음신호가 제거된 주파수 함수와 메모리에 저장된 기준 데이터의 특징 패턴과 비교하여 음성인식을 수행하는 과정을 포함한다.The process of removing the noise signal in the time domain for the voice signal mixed with the noise signal from the microphone, and the process of converting the voice signal from which the noise signal is removed in the time domain into a frequency function by applying FFT (or DFT); The method may include removing a noise signal included in a frequency function by using a spectral subtraction technique, and performing speech recognition by comparing the frequency function from which the noise signal is removed and a feature pattern of reference data stored in a memory.

음성인식, 잡음신호 제거, 특징 패턴, 스펙트럼 차감 기법 Speech Recognition, Noise Reduction, Feature Patterns, Spectral Subtraction Techniques

Description

NOISE EXCLUSION METHOD OF VOICE RECOGNITION SYSTEM ON VEHICLE}

도 1은 본 발명에 따른 음성 인식 시스템의 개략적인 구성도이다.1 is a schematic configuration diagram of a speech recognition system according to the present invention.

도 2는 본 발명에 따른 음성 인식 시스템에서 잡음 제거를 수행하는 개략적인 흐름도이다.2 is a schematic flowchart of performing noise cancellation in a speech recognition system according to the present invention.

도 3은 본 발명에 따른 음성 인식 시스템에서 잡음 제어를 수행하는 일 실시예의 흐름도이다.3 is a flowchart of an embodiment of performing noise control in a speech recognition system according to the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

10 ; 스위치 20 ; 마이크10; Switch 20; MIC

30 ; 증폭기 40 ; A/D 변환기30; Amplifier 40; A / D Converter

50 ; 제어부 60 ; 메모리부50; Control unit 60; Memory

70 ; 구동부70; Driving part

본 발명은 음성 인식 시스템에 관한 것으로, 더 상세하게는 음성에 산술적으로 더해지는 주변 잡음과 음성 신호 사이에 상관 관계가 없다는 특성을 이용하여 음성 신호에 더해진 잡음을 효과적으로 제거할 수 있도록 하는 음성 인시 시스템에서 잡음 제거방법에 관한 것이다.The present invention relates to a speech recognition system, and more particularly, to a speech recognition system that can effectively remove noise added to a speech signal by using a characteristic that there is no correlation between an arithmetic addition to the speech and a speech signal. It relates to a method of removing noise.

자동차 환경에서 차량에 장착되는 편의장치나 전자장치를 제어하는 맨 머신 인터페이스(Man Machine Interface)의 수단으로 음성인식의 중요성이 날로 부각되고 있다. The importance of voice recognition is gaining importance as a means of a man machine interface for controlling convenience devices or electronic devices mounted in a vehicle in an automobile environment.

유수의 차량 제조사들이 편의장치나 전자장치의 제어를 음성 명령의 입력으로 제어하고자 하는 관련 기술을 실험 및 적용하고 있으나, 자동차는 필히 잡음 환경안에서 음성인식이 이루어져야 하므로 상대적으로 낮은 음성 인식율로 인해 차량에 부분적으로만 적용되는 것이 현실이다. Leading vehicle manufacturers are experimenting with and applying related technologies to control the control of convenience devices or electronic devices by inputting voice commands.However, since the vehicle must be recognized in a noisy environment, the relatively low speech recognition rate is applied to the vehicle. The reality is that it only partially applies.

따라서, 차량내에서 음성인식 적용 부분을 확대하기 위해서는 자동차 소음환경에서의 인식율을 높이는 것이 관건이다. Therefore, in order to expand the application portion of the speech recognition in the vehicle, it is important to increase the recognition rate in the vehicle noise environment.

음성인식은 다음의 처리과정에 의해 수행된다. Speech recognition is performed by the following process.

음성 입력수단인 마이크(Microphone)에서 입력되는 아날로그 신호인 음성파형을 먼저 디지탈 신호처리로 위해서 적당한 주파수로 표본화하여 이를 분석 단위인 프레임으로 분리한다.Voice waveform, which is an analog signal input from microphone, which is a voice input means, is first sampled at an appropriate frequency for digital signal processing and separated into frames, which are analysis units.

이후, 분리된 각 프레임 단위의 신호를 주파수 분석하여 원하는 특징 패턴을 구하여 미리 기준 데이터로 저장되어 있는 기준 패턴과의 일치성을 비교하여 음성인식 결과를 출력한다.After that, the signal of each separated frame unit is frequency-analyzed to obtain a desired feature pattern and to compare the correspondence with the reference pattern stored as reference data in advance to output a voice recognition result.

상기한 절차에 따른 음성 인식에서 인식율을 높이는 방법에는 여러가지 방안이 있겠으나, 가장 중요한 것은 상기 음성인식의 과정에서 패턴 비교전에 음성신호 에 더해진 잡음을 제거하는 것이다. There are various ways to increase the recognition rate in speech recognition according to the above procedure, but most importantly, the noise added to the speech signal before pattern comparison in the speech recognition process is removed.

비교가 되는 기준음성은 그 데이터 베이스를 구축할 때 잡음이 없는 환경에서 녹음되어 그 특징 펙터를 추출해 놓은 것이기 때문에 입력음성에 잡음이 완전히 제거 되지 않고서는 패턴 비교시 기준 음성과 차이가 날 수 밖에 없다.Comparing the reference voice is recorded in a noise-free environment when the database is built, and the feature factor is extracted. Therefore, the noise cannot be different from the reference voice when comparing the patterns without completely removing noise from the input voice. .

종래에 적용하고 있는 음성 인식의 방법으로는 산술적인 잡음신호의 제거 방법과 어레이 마이크(Array Microphone)를 이용한 잡음신호의 제거 방법이 적용되고 있다.As a method of speech recognition applied in the related art, an arithmetic noise signal removal method and a noise signal removal method using an array microphone are applied.

전자의 경우 마이크를 통해 음성과 같이 입력된 잡음신호의 레벨을 측정하여 그 잡음신호의 양만큼을 빼는 방법으로, 이 경우 잡음신호보다 더 많은 양을 빼는 경우 원래 음성신호가 손상을 입게 되고, 적게 빼주는 경우 효과적인 잡음신호의 제거가 이루어지지 않는 문제점이 발생하며, 이로 인하여 정상적이고 안정적인 음성 인식이 이루어지지 않는 단점이 발생한다.In the former case, the level of a noise signal input like a voice through a microphone is measured and subtracted by the amount of the noise signal.In this case, when a greater amount than the noise signal is subtracted, the original voice signal is damaged. When subtracted, there is a problem in that the effective noise signal is not removed. As a result, a normal and stable speech recognition cannot be achieved.

또한, 후자의 경우 2개 내지 4개로 이루어지는 어레이 마이크를 이용하여 각각의 마이크로 입력된 신호를 분석하여 잡음이 입력되는 마이크 부분을 빼주는 빔 포밍(Beam Forming)방법으로, 음성인식에 있어 어느 정도의 효과적인 잡음신호의 제거를 갖는 특성에 의해 최근 들어 적극적인 적용 추세에 있다.In the latter case, the beam forming method analyzes each micro-input signal using two to four array microphones and subtracts the microphone part where noise is input. In recent years, there is an active trend of application due to the characteristic of removing noise signals.

그러나, 어레이 마이크를 이용한 잡음신호의 제거 효과는 어느 정도의 신뢰성이 제공되고 있으나, 고가의 어레이 마이크가 적용됨에 따라 시스템의 제작원가의 상승이 초래되고 여러개의 마이크로 입력되는 신호를 분석하여야 하므로, 계산량이 커지는 단점이 발생한다. However, although the effect of removing the noise signal using the array microphone is provided to some degree of reliability, as the expensive array microphone is applied, the production cost of the system is increased, and the signals inputted by the multiple microphones must be analyzed. This growing disadvantage occurs.

본 발명은 상기와 같은 문제점을 해결하기 위하여 발명한 것으로, 그 목적은 음성신호에 산술적으로 더해지는 주변 잡음신호가 음성신호 사이에 상관관계가 없다는 특성을 이용하여 입력되는 음성신호와 기준신호의 패턴 비교전에 잡음을 제거하여 음성신호에 더해진 잡음신호를 효과적으로 제거할 수 있도록 한 것이다.The present invention has been made to solve the above problems, and its object is to compare the pattern of the input voice signal with the reference signal using the characteristic that the ambient noise signal that is arithmeticly added to the voice signal has no correlation between the voice signals. The noise was previously removed to effectively remove the noise signal added to the voice signal.

즉, 잡음 신호가 혼합되어 입력되는 음성신호에 대하여 시간영역에서 1차로 잡음신호를 제거한 다음 FFT(Fast Fourier Transform)을 통해 주파수 함수로 변환하고, 주파수 함수의 영역에서 잔여 잡음신호를 2차로 제거한 다음 기준신호와의 패턴 분석으로 음성인식이 수행되도록 한 것이다.That is, after the noise signal is mixed, the noise signal is firstly removed in the time domain and then converted into a frequency function through the fast fourier transform (FFT), and the residual noise signal is secondly removed from the frequency function domain. Speech recognition is performed by pattern analysis with a reference signal.

상기와 같은 목적을 실현하기 위한 본 발명은 마이크로부터 잡음신호와 혼합되어 입력되는 음성신호에 대하여 시간 영역에서 잡음신호를 제거하는 과정과; 시간 영역에서 잡음신호가 제거된 음성신호를 FFT(혹은 DFT)의 적용으로 주파수 함수로 변환하는 과정과; 주파수 함수에 포함되어 있는 잡음신호를 스펙트럼 차감 기법을 이용하여 제거하는 과정 및; 상기 잡음신호가 제거된 주파수 함수와 메모리에 저장된 기준 데이터의 특징 패턴과 비교하여 음성인식을 수행하는 과정을 포함하는 것을 특징으로 하는 음성 인식 시스템에서 잡음 제거방법을 제공한다.The present invention for realizing the above object comprises the steps of removing the noise signal in the time domain for the voice signal is mixed with the noise signal from the microphone; Converting the speech signal from which the noise signal is removed in the time domain into a frequency function by applying an FFT (or DFT); Removing the noise signal included in the frequency function by using a spectrum subtraction technique; And a method of performing speech recognition by comparing the frequency function from which the noise signal is removed and a feature pattern of reference data stored in a memory.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 일 실시예를 상세하게 설명하면 다음과 같다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1에 도시된 바와 같이 본 발명에 따른 음성 인식 시스템은, 스위치(10)와 마이크(20), 증폭기(30), 메모리부(40), 제어부(50) 및 구동부(60)를 포함하여 구성된다.As shown in FIG. 1, the speech recognition system according to the present invention includes a switch 10, a microphone 20, an amplifier 30, a memory 40, a controller 50, and a driver 60. do.

스위치(10)는 운전자 혹은 사용자가 명령 음성을 입력하고자 하는 경우 음성 인식모드를 결정하는 스위치로, 운전자(사용자)에 의한 선택으로 접점이 선택되어 그에 대한 정보를 제어부(50)에 입력함으로써, 제어부(50)로 하여금 음성 인식모드가 진입될 수 있도록 하며, 음성 입력이 시작되는 지점을 판단할 수 있도록 한다.The switch 10 is a switch for determining a voice recognition mode when a driver or a user wants to input a command voice. The switch is selected by a driver (user) and inputs the information to the controller 50 to control the control unit. Allow the 50 to enter the voice recognition mode and determine the point where the voice input starts.

마이크(20)는 하나 혹은 2개의 마이크로 이루어지며, 사용자(운전자)가 입력하는 명령 음성신호와 이에 혼합되어지는 주변의 잡음신호를 전기적 신호로 변환하여 출력한다.The microphone 20 is composed of one or two microphones, and converts a command voice signal input by a user (driver) and a surrounding noise signal mixed therein into an electrical signal and outputs the electrical signal.

증폭기(30)는 상기 마이크(20)와 제어부(50)의 사이에 접속되어 마이크(30)에서 인가되는 신호를 설정된 레벨로 증폭하여 제어부(50)에 인가한다.The amplifier 30 is connected between the microphone 20 and the controller 50 to amplify a signal applied from the microphone 30 to a set level and apply it to the controller 50.

메모리부(50)는 제어부(50)와 접속되며, 잡음이 없는 환경에서 녹음되어 특징 패턴이 추출된 음성인식을 위한 기준 데이터가 저장되어, 음성인식 모드가 실행되는 경우 제어부(50)에 저장된 음성인식의 기준 데이터가 엑세스된다. The memory unit 50 is connected to the control unit 50, and stored in the noise-free environment, the reference data for voice recognition from which the feature pattern is extracted is stored, and the voice stored in the control unit 50 when the voice recognition mode is executed. Reference data of recognition is accessed.

제어부(50)는 스위치(10)의 신호에 의해 음성인식 모드의 진입이 검출되면 음성신호의 입력이 시작되는 지점(Startpoint)과 마이크(20)로 입력되는 신호중에서 실질적으로 운전자(사용자)의 음성신호가 있는 구간(Endpoint)을 추출한 다음 시간영역에서 1차로 잡음신호를 제거하고, FFT를 적용하여 주파수 함수로 변환한 후 주파수 함수에서 잔여 잡음신호를 제거하며, 잡음신호가 제거된 음성신호과 메모리부(40)에 저장된 기준 데이터와의 패턴을 비교하여 음성인식을 실행한다.The controller 50 is substantially the voice of the driver (user) from the start point of the input of the voice signal and the signal input to the microphone 20 when the entry of the voice recognition mode is detected by the signal of the switch 10 After extracting the end point where the signal is located, the noise signal is firstly removed in the time domain, converted into a frequency function by applying FFT, and the residual noise signal is removed from the frequency function. Speech recognition is performed by comparing the pattern with the reference data stored in 40.

구동부(60)는 상기 제어부(50)에서 음성인식을 통해 제어되는 신호에 따라 구동되어 해당 편의장치 혹은 전자 장치의 동작을 드라이브한다.The driver 60 is driven according to a signal controlled by the controller 50 through voice recognition to drive an operation of the corresponding convenience device or the electronic device.

전술한 바와 같은 구성을 갖는 음성인식 시스템에서 효과적인 잡음 제거를 통해 음성인식을 수행하는 동작에 대하여 도 2를 참조하여 개략적으로 설명하면 다음과 같다.Referring to FIG. 2, an operation of performing speech recognition through effective noise removal in a speech recognition system having the above-described configuration will be described below.

차량에 장착되는 음성인식 시스템에서 운전자(사용자)에 의한 스위치(10)의 접점 온 선택이 제어부(50)에 검출되면 제어부(50)는 음성 명령의 인식모드로 진입하여 마이크(20)에서 입력되어 증폭기(30)를 통해 설정된 소정의 레벨로 증폭되어 인가되는 운전자(사용자)의 명령에 대한 음성신호과 주변잡음이 혼합된 신호를 추출하여 시간영역에서의 신호처리를 통해 잡음 성분을 제거한다(S101).In the voice recognition system mounted on the vehicle, when the contact point on selection of the switch 10 by the driver (user) is detected by the controller 50, the controller 50 enters the recognition mode of the voice command and is input from the microphone 20. By extracting a signal mixed with the voice signal and the ambient noise for the command of the driver (user) that is amplified to a predetermined level set by the amplifier 30 and removes the noise component through signal processing in the time domain (S101). .

이후, FFT를 적용하여 전압의 신호를 주파수 함수로 변환하고(S102), 변환된 주파수 함수에 대한 신호 처리를 통해 음성신호에 혼합되어 있는 잔여 잡음신호를 제거한다(S103).Thereafter, the FFT is applied to convert the signal of the voltage into a frequency function (S102), and the residual noise signal mixed with the voice signal is removed through signal processing on the converted frequency function (S103).

상기한 바와 같이 시간 영역에서의 잡음신호 제거와 주파수 함수에서의 잡음 신호 제거를 통해 음성신호에 혼합된 잡음신호가 제거되면, 즉 순수한 음성신호만이 추출되면 메모리부(40)에 저장되어 있는 기준 데이터와 추출된 음성신호의 패턴을 비교함으로써, 음성 명령을 인식한 후 구동부(60)를 통해 인식된 명령이 실행되도록 드라이브 한다(S104).As described above, when the noise signal mixed with the voice signal is removed by removing the noise signal in the time domain and the noise signal in the frequency function, that is, only the pure voice signal is extracted, the reference stored in the memory unit 40. By comparing the pattern of the data with the extracted voice signal, the voice command is recognized and then the drive unit 60 is driven to execute the command (S104).

상기 S104에서 다른 실시예로 주파수 함수에서의 잡음신호 제거를 통해 음성신호에 혼합된 잡음신호가 제거되면, 즉 순수한 음성신호만이 추출되면 IFFT(Inverse Fast Fourier Transform)의 적용으로 다시 시간영역의 프레임 단위 신호로 변환한 다음 메모리부(40)에 저장되어 있는 기준 데이터와의 패턴을 비교함으로써, 음성명령을 인식한 후 구동부(60)를 통해 인식된 명령이 실행되도록 드라이브 한다(S104).In another embodiment of S104, when the noise signal mixed with the voice signal is removed by removing the noise signal in the frequency function, that is, when only the pure voice signal is extracted, the frame of the time domain is again applied by applying an inverse fast fourier transform (IFFT). After converting the unit signal into a unit signal and comparing the pattern with the reference data stored in the memory unit 40, the voice unit recognizes the voice command and drives the recognized command to be executed through the driving unit 60 (S104).

상기한 동작에 따른 잡음 제거를 통한 음성 인식에 대하여 도 3을 참조하여 좀 더 구체적으로 설명하면 다음과 같다.A voice recognition through noise removal according to the above operation will be described in more detail with reference to FIG. 3 as follows.

차량에 장착되는 음성 인식 시스템에서 운전자(사용자)에 의한 스위치(10)의 접점 온 선택이 제어부(50)에 검출되는지를 판단하여, 스위치(10)의 접점 온 선택이 검출되면 제어부(50)는 음성명령의 인식모드로 진입한다(S101).In the voice recognition system mounted on the vehicle, the controller 50 determines whether the contact on selection of the switch 10 by the driver (user) is detected by the controller 50, and when the contact on selection of the switch 10 is detected, the controller 50 determines that the contact on selection of the switch 10 is detected. The voice command enters the recognition mode (S101).

이후, 제어부(50)는 음성인식을 수행하기 위해 마이크(20)에서 입력되어 증폭기(30)를 통해 설정된 소정의 레벨로 증폭되어 인가되는 신호중에서 실제 음성신호가 있는 구간을 추출하기 위하여 입력신호의 엔드포인트(Endpoint)를 검출한다(S202).Subsequently, the controller 50 extracts the section of the input signal from the signal input from the microphone 20 to perform the voice recognition and amplified to a predetermined level set by the amplifier 30 to have a real voice signal. An endpoint is detected (S202).

상기의 엔드포인트는 음성인식을 수행하기 위해 음성신호의 입력이 시작되는 지점, 즉 음성인식 모드로 진입하는 시점을 스타트 포인트(Startpoint)라 하고, 음성입력이 끝나는 지점을 엔드포인트(Endpoint)라 한다.The endpoint is referred to as a start point at which the input of the voice signal starts, i.e., enters the voice recognition mode, in order to perform voice recognition, and an end point at which the voice input ends. .

상기의 엔드포인트 검출은 마이크(20)로 입력된 신호중에서 음성신호 부분만을 추출하여 실시간 프로세싱(Real Time Processing)을 수행함으로써, 음성신호의 처리에 따른 프로세싱의 부하를 최소화하고, 음성인식에 대한 정확성 및 효율성을 제공한다.The endpoint detection is performed by extracting only the voice signal portion from the signal input to the microphone 20 to perform real time processing, thereby minimizing the processing load due to the processing of the voice signal, and accuracy of voice recognition. And efficiency.

상기 엔드포인트는 시간축 상에서 실질적인 음성신호를 검출하기 위하여 프레임(Frame) 단위로 음성과 비음성을 구분하는 방식을 적용하며, 파라메타로 프레임 에너지(Frame Energy)와 프레임 영교차율(Zero Crossing Rate ; ZCR)을 이용한다.The endpoint applies a method for distinguishing speech from non-voice in units of frames in order to detect the actual speech signal on the time axis, and uses frame energy and zero crossing rate (ZCR) as parameters. Use

차량에 장착되는 음성인식 시스템은 음성인식 모드를 선택하는 스위치(10)의 접점을 온으로 선택한 이후에 음성신호가 입력하므로, 엔드포인트의 검출이 비교적 용이하다.In the voice recognition system mounted on the vehicle, since the voice signal is input after selecting the contact of the switch 10 for selecting the voice recognition mode, the detection of the endpoint is relatively easy.

엔드포인트의 검출을 통해 입력되는 음성신호를 프레임 단위로 구분하여 버퍼에 저장한다(S203).The voice signal input through the detection of the endpoint is divided into frame units and stored in a buffer (S203).

이때, 입력되는 음성신호의 프레임 단위 에너지는 하기의 수학식 1로 결정된다.At this time, the frame unit energy of the input voice signal is determined by Equation 1 below.

여기서,

로, 실시간 음성신호가 검출되기 시작되기 전 대략 0.25초 동안 구해진 주변 잡음신호의 평균이고, Xi는 i 번째 샘플의 값이며, N은 프레임의 길이이고, N'는 0.25초 동안의 프레임 수로, FS(샘플링 주파수) × 0.25초로 산출되는 값이다.here,

Is the average of the ambient noise signals obtained for approximately 0.25 seconds before the real-time speech signal is detected, Xi is the value of the i th sample, N is the length of the frame, and N 'is the number of frames for 0.25 seconds, FS (Sampling frequency) x 0.25 sec.

즉, 음성 인식 시스템에서 음성신호를 인식함에 있어 음성 모드의 진입을 선 택하는 스위치(10)의 접점이 선택된 이후, 음성신호가 입력되기까지의 일정시간, 대략 0.25초 동안은 잡음신호만이 마이크(20)를 통해 입력된다.That is, after the contact of the switch 10 for selecting the entry of the voice mode is selected in recognizing the voice signal in the voice recognition system, only the noise signal is a microphone for a predetermined time, approximately 0.25 seconds, until the voice signal is input. Is input via 20.

그러므로, 잡음신호만이 입력되는 일정시간, 대략 0.25초 동안의 잡음신호 평균을 산출한 다음 전체 음성신호에서 잡음신호의 평균을 차감하는 연산을 통해 시간 영역에서의 잡음신호를 제거한다.Therefore, the noise signal is calculated for a predetermined time, approximately 0.25 seconds when only the noise signal is input, and then the noise signal in the time domain is removed by calculating the noise signal from the entire voice signal.

상기와 같은 음성신호와 잡음신호가 혼합되어 입력되는 음성신호에 대하여 시간 영역에서의 잡음신호를 1차 제거한 다음 각 프레임에 해밍 윈도우(Hamming Window)를 적용한다(S204).The noise signal in the time domain is first removed from the voice signal mixed with the voice signal and the noise signal, and a hamming window is applied to each frame (S204).

일 예를 들어 상기 마이크(20)에서 입력되는 신호 y(m)은 음성신호인 x(m)과 잡음신호인 n(m)이 혼합되어진 상태이므로, 이를 수학식으로 표현하면 하기의 수학식 2와 같이 시간영역에서의 입력 신호가 된다.For example, the signal y (m) input from the microphone 20 is a state in which x (m), which is a voice signal, and n (m), which is a noise signal, are mixed. Like this, it becomes an input signal in the time domain.

y(m) = x(m) + n(m)y (m) = x (m) + n (m)

상기 시간 영역에서의 입력신호에 해밍 윈도우를 적용하면 하기의 수학식 3과 같이 된다.When a Hamming window is applied to the input signal in the time domain, the following Equation 3 is obtained.

Y_w(m) = w(m)y(m)Y _w (m) = w (m) y (m)

= w(m)[x(m) + n(m)] = w (m) [x (m) + n (m)]

= x_w(m) + n_w(m)= x _w (m) + n _w (m)

이후, 상기 수학식 1과 같은 프레임 단위의 시간 영역 신호에 대하여 FFT 혹은 DFT(Discrete Fourier Transform)를 적용하여 하기의 수학식 4와 같이 주파수 함수로 변환한다(S205)Thereafter, an FFT or a Discrete Fourier Transform (DFT) is applied to the time-domain signal in a frame unit as shown in Equation 1, and then converted into a frequency function as shown in Equation 4 below (S205).

Y(f) = X(f) + N(f)Y (f) = X (f) + N (f)

그리고, 일정구간의 잡음신호의 평균 크기를 산출하여(S206), 산출된 잡음 평균와 주파수 함수에 대하여 하기의 수학식 5와 같이 스펙트럼 차감 기법을 적용하여 주파수 함수에 잔여하는 잡음신호를 제거한다(S207).In addition, by calculating the average size of the noise signal of a certain period (S206), the noise signal remaining in the frequency function is removed by applying a spectral subtraction technique to the calculated noise average and the frequency function as shown in Equation 5 below (S207). ).

여기서, α는 계수로 실제 측정을 통해 산출되는 값이며,

는 잡음만이 존재하는 일정 구간에서 획득한 평균 노이즈 스펙트럼으로, 하기의 수학식 6으로 표현된다.Where α is a coefficient and is a value calculated by actual measurement,

Is an average noise spectrum obtained in a certain section in which only noise exists, and is represented by Equation 6 below.

여기서,

는 i 번째 노이즈 프레임의 스펙트럼을 의미하고, k는 잡음만 있는 구간의 프레임수 이다.here,

Is the spectrum of the i-th noise frame, k is the number of frames of the noise-only interval.

상기 S207을 통해 주파수 함수의 영역에 포함되어 있는 잔여 잡음신호가 제거되면 스펙트럼 크기의 노이즈 신호 위상과 결합하여 순수한 음성신호의 주파수 함수만을 추출한다(S208).When the residual noise signal included in the region of the frequency function is removed through S207, only the frequency function of the pure voice signal is extracted by combining with the noise signal phase having a spectral magnitude (S208).

그리고, IFFT(Inverse FFT) 혹은 IDFT(Inverse DFT)를 적용하여 시간 영역의 함수인 프레임 단위로 변환한 다음(S209) 각 블록 단위의 프레임 신호를 보상한다(S210).Subsequently, IFFT (Inverse FFT) or IDFT (Inverse DFT) is applied to convert the frame unit as a function of the time domain (S209), and then the frame signal of each block unit is compensated (S210).

상기 S210을 통해 입력 음성신호의 보상이 이루어지면 메모리부(40)에 저장되어 있는 기준 데이터와 특징 패턴을 비교하여(S211) 음성신호(명령)을 인식한 다음 구동부(60)를 통해 해당 동작을 실행시킨다(S212).When the input voice signal is compensated through S210, the voice signal (command) is recognized by comparing the reference data stored in the memory unit 40 with the feature pattern (S211), and then the corresponding operation is performed through the driver 60. It executes (S212).

상기 IFFT 혹은 IDFT를 적용한 시간영역의 함수로 변환은 하기의 수학식 7의 적용에 의해 이루어진다.The conversion to the function of the time domain to which the IFFT or IDFT is applied is performed by applying Equation 7 below.

여기서,

는 잡음 신호 주파수 Y(k)의 위상이다.here,

Is the phase of the noise signal frequency Y (k).

이상에서 설명한 바와 같이 본 발명은 차량에 적용되는 음성 인식 시스템에서 입력되는 명령 음성신호에 포함되어 있는 잡음 신호를 효율적으로 제거함으로써, 음성신호의 인식에 안정성 및 신뢰성을 제공한다.As described above, the present invention efficiently removes a noise signal included in a command voice signal input from a voice recognition system applied to a vehicle, thereby providing stability and reliability in recognition of a voice signal.

또한, 고가의 어레이 마이크 시스템을 적용하지 않고, 신뢰성 있는 음성신호를 인식함으로서, 음성 인식 시스템의 제작 원가 절감과 사용상 안정성 및 신뢰성이 제공된다.In addition, by recognizing a reliable voice signal without applying an expensive array microphone system, it is possible to reduce the manufacturing cost of the voice recognition system and to provide stability and reliability in use.

Claims

Removing the noise signal in the time domain with respect to the voice signal mixed with the noise signal from the microphone;

Converting the speech signal from which the noise signal is removed in the time domain into a frequency function by applying an FFT (or DFT);

Removing the noise signal included in the frequency function by using a spectrum subtraction technique;

And performing speech recognition by comparing the frequency function from which the noise signal is removed and a feature pattern of reference data stored in a memory.

The method of claim 1,

The noise removal in the time domain for the voice signal to which the noise signal is mixed is input,

According to the information of the switch setting the entry of the voice mode, the detection of the start point and the end point to which the actual voice signal is inputted calculates the average of the noise signal for a predetermined time until the actual voice signal is inputted, A method for removing noise in a speech recognition system, characterized in that the noise signal is removed in a time domain by calculating a mean of a noise signal.

Detecting an end frame of a voice signal mixed with a noise signal input through a microphone when the voice recognition mode is selected, and calculating an average of the noise signal for a predetermined time;

Removing a noise signal in a time domain by calculating an average of a noise signal from all input voice signals;

Storing the input voice signal from which the noise signal in the time domain has been removed in frame units;

Applying a Hamming window to the speech signal of each frame unit and converting the speech signal into a frequency function by applying an FFT;

Removing a noise signal included in the frequency function by applying a spectrum subtraction technique of calculating a mean of a noise signal in a predetermined period in a frequency function and then calculating an average of the noise signal in an entire frequency function;

Combining the noise signal phase having a spectral magnitude with a frequency function from which the noise signal is removed to extract the final speech signal magnitude;

Converting the extracted final speech signal into a signal of a frame unit which is a time function by applying IFFT;

And performing speech recognition by comparing a frame unit signal as a time function and a feature pattern as set reference data.

The method of claim 3,

And removing a noise signal included in the frequency function by adaptively applying a coefficient multiplied by an average of the noise signal according to a noise signal removal amount.