KR20130125014A

KR20130125014A - Robust speech recognition method based on independent vector analysis using harmonic frequency dependency and system using the method

Info

Publication number: KR20130125014A
Application number: KR1020120048380A
Authority: KR
Inventors: 박형민; 전소람; 김민욱; 오명우
Original assignee: 서강대학교산학협력단
Priority date: 2012-05-08
Filing date: 2012-05-08
Publication date: 2013-11-18
Also published as: KR101361034B1

Abstract

A robust speech recognition system according to the present invention improves a sound source by using an MPDR beamformer in a pre-processing process, applies an HIVA learning algorithm to the composed signals of the improved sound source signals and noise signals, and extracts a feature vector of the sound source signals. The speech recognition system applies a non-holonomic constraint and a minimal distortion principle when performing the HIVA learning algorithm to minimize signal distortion and improve convergence of a non-mixing matrix. In addition, the speech recognition system checks for missing features in the learning process by using an improved sound source and a noise sound source and compensates for the same. By the aforementioned features, the robust speech recognition system provides a system resistant to noise on the basis of an independent vector analysis algorithm using harmonic frequency dependency. [Reference numerals] (200) Signal input unit;(210) Signal converting unit;(220) Pre-processing unit;(230) Sound source extracting unit;(246) Mask generating unit;(248) Loss property compensation output unit;(250) DCT converting unit;(260) Voice recognition unit;(AA,BB) Log unit

Description

Robust speech recognition method based on independent vector analysis using harmonic frequency dependency and system using the method

본 발명은 음성 인식 시스템 및 그 방법에 관한 것으로서, 더욱 구체적으로는 하모닉 주파수 의존성을 이용한 독립 벡터 분석에 기반한 강한 음성 인식 방법 및 이를 이용한 음성 인식 시스템에 관한 것이다. The present invention relates to a speech recognition system and a method thereof, and more particularly, to a strong speech recognition method based on independent vector analysis using harmonic frequency dependency and a speech recognition system using the same.

음성 인식 기술은 마이크나 전화를 통하여 얻어진 음향학적 신호를 단어나 단어 집합 또는 문장으로 변환하는 기술로서, 이렇게 인식된 결과들은 명령이나, 제어, 데이터 입력, 문서 준비 등의 응용 분야에서 최종결과로 사용될 수 있다. 이러한 음성 인식 기술을 이용한 음성 인식 시스템에 대한 응용분야가 최근 증가하고 있으며, 이에 대한 다양한 연구와 개발이 진행되고 있다. Speech recognition technology converts acoustic signals obtained through microphones or telephones into words, word sets, or sentences. These recognized results can be used as final results in applications such as commands, controls, data entry, and document preparation. Can be. Applications of the speech recognition system using the speech recognition technology has been increasing recently, and various researches and developments have been conducted.

도 1은 종래의 음성 인식 시스템을 개략적으로 도시한 블록도이다. 도 1을 참조하면, 종래의 음성 인식 시스템(10)은 외부로부터 입력된 신호를 출력하는 신호 입력부(100), 상기 신호 입력부로부터 제공된 입력 신호를 주파수 영역의 신호로 변환하여 출력하는 신호 변환부(110), 상기 신호 변환부로부터 제공된 입력 신호들에 대하여 Mel-주파수 스펙트럼을 구하는 Mel-log filter bank(120), 상기 Mel-주파수 스펙트럼에 대한 로그값을 구하는 로그화부(122), 로그 스펙트럼에 DCT(Discrete Cosine Transform)를 취하여 음성 특징을 추출하는 MFCC 검출부(130), 및 추출된 특징 정보와 사전 저장된 패턴들과의 비교 과정을 통해 음성을 인식하여 출력하는 음성 인식부(140)를 구비한다. 1 is a block diagram schematically illustrating a conventional speech recognition system. Referring to FIG. 1, the conventional voice recognition system 10 includes a signal input unit 100 for outputting a signal input from the outside, and a signal converter for converting an input signal provided from the signal input unit into a signal in a frequency domain ( 110), a Mel-log filter bank 120 for obtaining a Mel-frequency spectrum with respect to the input signals provided from the signal converter, a logging unit 122 for obtaining a log value for the Mel-frequency spectrum, and a DCT in a log spectrum The MFCC detector 130 extracts a speech feature by taking a discrete cosine transform and a speech recognizer 140 that recognizes and outputs a speech through a comparison process between the extracted feature information and pre-stored patterns.

전술한 바와 같은 음성 인식 시스템은 주위 환경 잡음, 마이크의 종류나 위치 등과 같은 외부 요인들에 의해 음성 인식의 성능이 영향을 받게 된다. 특히, 주위 환경 잡음과 같은 노이즈는 시스템의 인식 성능을 급격하게 감쇄시키므로, 노이즈에 강한 음성 인식 기술을 개발하는 것이 중요 과제로 부상되고 있다. In the voice recognition system as described above, the performance of the voice recognition is affected by external factors such as ambient noise and the type or location of the microphone. In particular, noise such as ambient noise rapidly attenuates the recognition performance of the system. Therefore, it is emerging as an important task to develop a speech recognition technology that is resistant to noise.

여러 사운드가 혼합된 사운드에서 개별적인 음원 신호를 분리해 내는 것을 BSS(Blind Source Separation 또는 Blind Signal Separation)라고 하며, 여기서 Blind는 원본 신호에 대한 정보가 없으며, 믹싱된 신호에 대해서도 정보가 없다는 것을 의미한다. 그리고, 최종적으로 신호를 분리하는 과정을 디믹스(Demix) 또는 언믹스(Unmix)라고 표현한다. 이러한 음원 신호 분리하기 위한 학습 알고리즘으로, 독립 성분 분석(Independent Component Analysis;'ICA') 알고리즘, 독립 벡터 분석(Independent Vector Analysis;'IVA') 알고리즘, 하모닉 주파수 의존성(Harmonic Frequency Dependancy)을 갖는 독립 벡터 분석 ('HIVA') 알고리즘 등이 제안되고 있다. Separating individual source signals from mixed sound is called BSS (Blind Source Separation or Blind Signal Separation), where Blind means no information about the original signal and no information about the mixed signal. . Finally, the process of separating the signal is referred to as Demix or Unmix. As a learning algorithm for separating a sound source signal, an independent vector having an independent component analysis (ICA) algorithm, an independent vector analysis (IVA) algorithm, and a harmonic frequency dependency Analysis ('HIVA') algorithms have been proposed.

하모닉 주파수 의존성을 갖는 독립 벡터 분석 알고리즘은 음성(speech)이나 음악(music)과 같은 오디오 신호들의 분리에 매우 우수하다. 하지만, ICA 알고리즘과 같이 일시적으로 상호 연관된 오디오 신호들의 혼합 신호들에 대한 급격한 필터링으로 인하여, HIVA 에 기반하여 음원 분리 과정에서 추정된 관심 음원들에 대한 신호가 왜곡되는 문제들이 발생한다. 이렇게 분리된 관심 음원 신호에 대한 왜곡은 음성 인식 시스템의 성능의 감쇄를 초래하게 된다.
Independent vector analysis algorithms with harmonic frequency dependence are excellent for the separation of audio signals such as speech or music. However, due to the rapid filtering of mixed signals of temporarily correlated audio signals, such as the ICA algorithm, there is a problem in that signals for the sound sources of interest estimated in the process of sound source separation based on HIVA are distorted. The distortion of the separated sound source signal of interest may result in a decrease in the performance of the speech recognition system.

(1) 한국등록특허공보 제 10-4085240 호(1) Korean Registered Patent Publication No. 10-4085240 (2) 한국공개특허공보 제 10-2010-117055 호(2) Korean Patent Publication No. 10-2010-117055 (3) 한국공개특허공보 제 10-2010-83572 호(3) Korean Patent Publication No. 10-2010-83572

전술한 문제점을 해결하기 위한 본 발명의 목적은 음원 신호에 대한 분리 성능을 향상시키고, 최적화 조건을 사용하여 HIVA 학습 알고리즘을 적용하여 노이즈에 강한 음성 인식 시스템 및 그 방법을 제공하는 것이다. SUMMARY OF THE INVENTION An object of the present invention to solve the above problems is to improve the separation performance for a sound source signal, and to provide a speech recognition system and method resistant to noise by applying an HIVA learning algorithm using optimization conditions.

본 발명의 다른 목적은 향상된 음원 신호와 관찰된 노이즈 음원을 이용하여 노이즈에 의한 신호 감쇄를 검출하고 이를 보상하여 음원 신호를 추정해 냄으로써, 노이즈에 강한 음성 인식할 수 있는 음성 인식 시스템 및 그 방법을 제공하는 것이다. It is another object of the present invention to detect a signal attenuation caused by noise using an improved sound source signal and an observed noise source, and to compensate for this to estimate a sound source signal, thereby providing a speech recognition system and a method capable of recognizing a voice resistant to noise. To provide.

전술한 기술적 과제를 달성하기 위한 본 발명의 제1 특징에 따른 음성 인식 시스템은, 외부의 입력장치를 통해 다수 개의 입력 신호를 수신하는 신호 입력부; 수신된 입력 신호들을 주파수 영역으로 변환하는 신호 변환부; 상기 신호 변환부로부터 제공된 입력 신호들에 대하여 하모닉 주파수 의존성(Harmonic Frequency Dependency)을 이용한 독립 벡터 분석(Independent Vector Analysis)에 기반한 학습 알고리즘을 수행하여 특징 벡터를 추출하고, 추출된 특징 벡터를 이용하여 음원 신호를 추정하여 출력하는 음원 신호 추출부; 입력 신호를 이용하여 상기 추정된 음원 신호에 대하여 추정 과정에서 손실된 특징(Missing Feature)을 보상하여 출력하는 손실 특징 보상부;를 구비한다. According to an aspect of the present invention, there is provided a speech recognition system, including: a signal input unit configured to receive a plurality of input signals through an external input device; A signal converter converting the received input signals into a frequency domain; A feature vector is extracted by performing a learning algorithm based on independent vector analysis using a harmonic frequency dependency on input signals provided from the signal converter, and a sound source is extracted using the extracted feature vector. A sound source signal extracting unit estimating and outputting a signal; And a lossy feature compensator for compensating and outputting missing features during the estimation process with respect to the estimated sound source signal using an input signal.

전술한 제1 특징에 따른 음성 인식 시스템에 있어서, 음원의 방향에 대한 정보를 이용하여 상기 신호 변환부로부터 제공된 입력 신호들 중 음원 신호를 아래의 수학식에 따라 강화시켜 음원 신호 추출부로 제공하는 빔포머를 더 구비하는 것이 바람직하다. In the voice recognition system according to the first aspect described above, a beam for reinforcing a sound source signal among the input signals provided from the signal converter by using the information on the direction of the sound source according to the following equation and providing it to the sound source signal extracting unit It is preferable to further provide a former.

여기서, d_i(ω) 및 R(ω)는 각각 i 번째 음원에 대한 스티어링 벡터(steering vector towards the i-th source) 및 입력 스펙트럼 분산 매트릭스(an ipnut spectral covariance matrix)를 나타내며, λ는 R(ω)의 특이점(singularity)이 형성되는 것을 회피하기 위하여 설정되는 작은 양의 상수값이다. Here, d _i ( ω ) and R ( ω ) represent a steering vector towards the i-th source and an ipnut spectral covariance matrix for the i-th sound source, respectively, and λ is R ( ω ) is a small positive constant value set to avoid the formation of singularity.

전술한 제1 특징에 따른 음성 인식 시스템에 있어서, 상기 음원 신호 추출부는 상기 하모닉 주파수 의존성을 이용한 독립 벡터 분석에 기반한 학습 알고리즘을 수행하여 특징 벡터(W(ω))를 추출하고, 상기 추출된 특징 벡터는 off-diag 함수를 적용하여 아래의 수학식에 따라 수정되어 계산하는 것이 바람직하다. In the speech recognition system according to the first feature, the sound source signal extracting unit extracts a feature vector W (ω) by performing a learning algorithm based on independent vector analysis using the harmonic frequency dependency, and extracts the extracted feature vector. Vector is preferably modified by applying the off-diag function according to the following equation.

여기서, 'off-diag()' 함수는 다이고날 성분들(diagonal elements)이 영(zero)으로 설정된 매트릭스이며,

로서 추정된 음원 신호 벡터의 시간-주파수 세그먼트들이며,

는

에 대한 multivariate score function 이며,

이며, Ω는 주파수 빈들의 개수를 나타낸다.Here, the 'off-diag ()' function is a matrix in which diagonal elements are set to zero.

Are time-frequency segments of the sound source signal vector estimated as

The

Multivariate score function for,

Is the number of frequency bins.

전술한 제1 특징에 따른 음성 인식 시스템에 있어서, 상기 음원 신호 추출부는 상기 하모닉 주파수 의존성을 이용한 독립 벡터 분석에 기반한 학습 알고리즘을 수행하여 특징 벡터(W(ω))를 추출하고, 상기 추정된 음원 신호와 상기 입력 신호의 최소 왜곡을 유지하도록 하기 위하여 비용함수를 최소화시키도록 상기 추출된 특징 벡터를 아래의 수학식에 따라 수정하는 것이 바람직하다. In the speech recognition system according to the first feature, the sound source signal extracting unit extracts a feature vector (W (ω)) by performing a learning algorithm based on independent vector analysis using the harmonic frequency dependency, and estimates the estimated sound source. It is desirable to modify the extracted feature vector according to the following equation to minimize the cost function in order to maintain the minimum distortion of the signal and the input signal.

여기서,

, 로서 혼합 신호의 시간-주파수 세그먼트들이며,

로서 추정된 음원 신호 벡터의 시간-주파수 세그먼트들이다. here,

Are the time-frequency segments of the mixed signal,

Are time-frequency segments of the sound source signal vector estimated as.

전술한 제1 특징에 따른 음성 인식 시스템에 있어서, 상기 손실 특징 보상부는, 신호변환부로부터 제공된 입력 신호에 대하여 Mel 주파수 캡스트럼을 검출하는 제1 MFCC 검출부; 음원 신호 추출부로부터 제공된 상기 추정된 음원 신호에 대하여 Mel 주파수 캡스트럼을 검출하는 제2 MFCC 검출부; 상기 입력 신호에 대한 Mel 주파수 캡스트럼과 상기 추정된 음원 신호에 대한 Mel 주파수 캡스트럼을 이용하여 상기 추정된 음원 신호에서 손실된 특징을 보상하는 손실 특징 계산부;를 구비하고, In the speech recognition system according to the first feature, the lossy feature compensation unit comprises: a first MFCC detector for detecting a Mel frequency capstrum with respect to an input signal provided from the signal converter; A second MFCC detector for detecting a Mel frequency capstrum with respect to the estimated sound source signal provided from the sound source signal extractor; And a loss feature calculation unit for compensating for a feature lost in the estimated sound source signal using the Mel frequency capstrum for the input signal and the Mel frequency capstrum for the estimated sound source signal.

상기 손실 특징 계산부는, 상기 입력 신호에 대한 Mel 주파수 캡스트럼과 상기 추정된 음원 신호에 대한 Mel 주파수 캡스트럼을 이용하여 신뢰성 마스크(Reliability Mask)를 생성하는 마스크 생성부; 상기 신뢰성 마스크 및 상기 추정된 음원 신호에 대한 Mel 주파수 캡스트럼을 이용하여 상기 추정된 음원 신호에 대한 손실된 특징을 검출하고, 사전 설정된 클러스터 기반의 음성 특징 모델을 이용하여 상기 손실된 특징을 보상하여 출력하는 손실 특징 보상 출력부;를 구비하는 것이 바람직하다. The loss characteristic calculation unit may include: a mask generator configured to generate a reliability mask using a Mel frequency capstrum for the input signal and a Mel frequency capstrum for the estimated sound source signal; Detect the lost feature for the estimated sound source signal using the reliability mask and the Mel frequency capstrum for the estimated sound source signal, and compensate for the lost feature using a preset cluster-based speech feature model And a loss characteristic compensation output unit for outputting.

본 발명의 제2 특징에 따른 음성 인식 방법은, (a) 외부로부터 수신된 입력 신호들을 주파수 영역으로 변환하는 단계; (b) 상기 변환된 입력 신호들에 대하여 하모닉 주파수 의존성(Harmonic Frequency Dependency)을 이용한 독립 벡터 분석(Independent Vector Analysis)에 기반한 학습 알고리즘을 수행하여 특징 벡터를 추출하고, 추출된 특징 벡터를 이용하여 음원 신호를 추정하여 출력하는 단계; (c) 입력 신호를 이용하여 상기 추정된 음원 신호에 대하여 추정 과정에서 손실된 특징(Missing Feature)을 보상하여 출력하는 단계;를 구비한다. According to a second aspect of the present invention, there is provided a speech recognition method comprising: (a) converting input signals received from the outside into a frequency domain; (b) Performing a learning algorithm based on Independent Vector Analysis using Harmonic Frequency Dependency on the converted input signals, extracting a feature vector, and using the extracted feature vector, a sound source Estimating and outputting a signal; (c) compensating for the missing feature during the estimation process and outputting the estimated sound source signal using the input signal.

전술한 제2 특징에 따른 음성 인식 방법에 있어서, 특징 벡터를 추출하기 전에, 음원의 방향에 대한 정보를 이용하여 상기 입력 신호들 중 음원 신호를 강화시키는 단계를 더 구비하는 것이 바람직하다. In the voice recognition method according to the above-described second feature, it is preferable to further include reinforcing a sound source signal among the input signals by using information on the direction of the sound source before extracting the feature vector.

전술한 제2 특징에 따른 음성 인식 방법에 있어서, 상기 (b) 단계에서 상기 하모닉 주파수 의존성을 이용한 독립 벡터 분석에 기반한 학습 알고리즘을 수행하여 특징 벡터(W(ω))를 추출하고, 추출된 특징 벡터는 off-diag 함수를 적용하여 수정하는 것이 바람직하다. In the speech recognition method according to the above-described second feature, in step (b), a feature vector W (ω) is extracted by performing a learning algorithm based on independent vector analysis using the harmonic frequency dependency, and extracted feature. Vectors are preferably modified by applying the off-diag function.

전술한 제2 특징에 따른 음성 인식 방법에 있어서, 상기 (b) 단계에서 상기 하모닉 주파수 의존성을 이용한 독립 벡터 분석에 기반한 학습 알고리즘을 수행하여 특징 벡터(W(ω))를 추출하고, 상기 추정된 음원 신호와 상기 입력 신호의 최소 왜곡을 유지하도록 하기 위하여 비용함수를 최소화시키도록 상기 추출된 특징 벡터를 수정하는 것이 바람직하다. In the speech recognition method according to the above-described second feature, in step (b), a learning algorithm based on independent vector analysis using the harmonic frequency dependency is performed to extract a feature vector W (ω) and the estimated It is desirable to modify the extracted feature vector to minimize the cost function in order to maintain minimum distortion of the sound source signal and the input signal.

전술한 제2 특징에 따른 음성 인식 방법에 있어서, 상기 (c) 단계는, (c1) 상기 변환된 입력 신호에 대하여 Mel 주파수 캡스트럼을 검출하는 단계; (c2) 상기 추정된 음원 신호에 대하여 Mel 주파수 캡스트럼을 검출하는 단계; (c3) 상기 입력 신호에 대한 Mel 주파수 캡스트럼과 상기 추정된 음원 신호에 대한 Mel 주파수 캡스트럼을 이용하여 상기 추정된 음원 신호에서 손실된 특징을 보상하는 단계;를 구비하고,In the speech recognition method according to the second aspect, the step (c) comprises: (c1) detecting a Mel frequency capstrum with respect to the converted input signal; (c2) detecting a Mel frequency capstrum with respect to the estimated sound source signal; (c3) compensating for a feature lost in the estimated sound source signal using the Mel frequency capstrum for the input signal and the Mel frequency capstrum for the estimated sound source signal;

상기 (c3) 단계는, 상기 입력 신호에 대한 Mel 주파수 캡스트럼과 상기 추정된 음원 신호에 대한 Mel 주파수 캡스트럼을 이용하여 신뢰성 마스크(Reliability Mask)를 생성하고, 상기 신뢰성 마스크 및 상기 추정된 음원 신호에 대한 Mel 주파수 캡스트럼을 이용하여 상기 추정된 음원 신호에 대한 손실된 특징을 검출하고, 사전 설정된 클러스터 기반의 음성 특징 모델을 이용하여 상기 손실된 특징을 보상하여 출력하는 것이 바람직하다.
In step (c3), a reliability mask is generated by using a Mel frequency capstrum for the input signal and a Mel frequency capstrum for the estimated sound source signal, and the reliability mask and the estimated sound source signal are generated. It is preferable to detect a lost feature for the estimated sound source signal using a Mel frequency capstrum for and to compensate and output the lost feature using a preset cluster-based speech feature model.

본 발명에 따른 음성 인식 방법 및 음성 인식 시스템은, 잡음 환경에서 특히 우수한 음성 인식 성능을 보여준다. 또한, HIVA 학습 알고리즘을 수행할 때 음원 신호에 대한 특징들이 손실되는 문제점이 발생되는데, 본 발명에 따른 음성 인식 방법 및 음성 인식 시스템은 손실 특징(Missing Feature)을 보상함으로써 보다 정확하게 음원 신호를 추출할 수 있게 된다. 또한, 본 발명에 따른 음성 인식 방법 및 음성 인식 시스템은 HIVA 학습 알고리즘을 수행할 때 Non-holonomic Constraint를 적용함으로써 학습의 수렴 속도를 향상시킬 수 있게 된다.
The speech recognition method and speech recognition system according to the present invention show a particularly excellent speech recognition performance in a noisy environment. In addition, there is a problem that the characteristics of the sound source signal is lost when the HIVA learning algorithm is performed. The speech recognition method and the speech recognition system according to the present invention can extract the sound source signal more accurately by compensating for the missing feature. It becomes possible. In addition, the speech recognition method and the speech recognition system according to the present invention can improve the convergence speed of learning by applying a non-holonomic constraint when performing an HIVA learning algorithm.

도 1은 종래의 음성 인식 시스템을 개략적으로 도시한 블록도이다.
도 2는 본 발명의 바람직한 실시예에 따른 음성 인식 시스템을 전체적으로 도시한 블록도이다. 1 is a block diagram schematically illustrating a conventional speech recognition system.
2 is a block diagram showing an overall speech recognition system according to a preferred embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 따른 하모닉 주파수 의존성을 이용한 독립 벡터 분석 알고리즘을 기반으로 한 강한 음성 인식 시스템 및 그 방법에 대하여 구체적으로 설명한다. Hereinafter, a strong speech recognition system and method based on an independent vector analysis algorithm using harmonic frequency dependency according to a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

본 발명에 따른 강한 음성 인식 시스템은 전처리(pre-processing) 과정인 MPDR 빔포머를 사용하여 음원을 향상시킨 후, 향상된 음원 신호들과 노이즈 신호들의 합성신호에 대하여 HIVA 학습 알고리즘을 적용하여 음원 신호에 대한 특징 벡터를 추출하는 것을 특징으로 한다. 또한, 본 발명에 따른 강한 음성 인식 시스템은 신호 왜곡을 최소화시키고 언믹싱 매트릭스에 대한 컨버전스를 향상시키기 위하여, HIVA 학습 알고리즘을 수행함에 있어서, non-holonomic constraint와 최소 왜곡 원칙(Minimal Distortion Priciple; 이하 'MDP'라 한다)을 적용하는 것을 특징으로 한다. 또한, 본 발명에 따른 강한 음성 인식 시스템은 향상된 음원과 노이즈 음원을 이용하여 학습 과정에서 손실된 특징들(Missing Features)을 파악하고 이를 보상하는 것을 특징으로 한다. 전술한 특징들에 의하여, 본 발명에 따른 강한 음성 인식 시스템은 하모닉 주파수 의존성을 이용한 독립 벡터 분석 알고리즘을 기반으로 하여 노이즈 등에 강한 시스템을 제공하게 된다. The strong speech recognition system according to the present invention improves the sound source by using the MPDR beamformer, which is a pre-processing process, and then applies the HIVA learning algorithm to the synthesized signal of the enhanced sound signals and noise signals to the sound signal. It is characterized in that for extracting the feature vector. In addition, the strong speech recognition system according to the present invention has a non-holonomic constraint and a minimum distortion principle (Minimal Distortion Priciple) in performing the HIVA learning algorithm in order to minimize signal distortion and improve convergence to the unmixing matrix. MDP '). In addition, the strong voice recognition system according to the present invention is characterized by using the improved sound source and noise source to identify and compensate for the missing (Missing Features) in the learning process. By the above-described features, the strong speech recognition system according to the present invention provides a system resistant to noise and the like based on an independent vector analysis algorithm using harmonic frequency dependency.

도 2는 본 발명의 바람직한 실시예에 따른 음성 인식 시스템을 전체적으로 도시한 블록도이다. 이하, 도 2를 참조하여 본 발명에 따른 음성 인식 시스템의 구조 및 동작을 구체적으로 설명한다. 2 is a block diagram showing an overall speech recognition system according to a preferred embodiment of the present invention. Hereinafter, the structure and operation of the speech recognition system according to the present invention will be described in detail with reference to FIG. 2.

본 발명에 따른 음성 인식 시스템(20)은 신호 입력부(200), 신호 변환부(210), 전처리부(220), 음원신호 추출부(230), 손실특징 보상부(240), DCT 변환부(250) 및 음성 인식부(260)를 구비한다. The speech recognition system 20 according to the present invention includes a signal input unit 200, a signal converter 210, a preprocessor 220, a sound source signal extractor 230, a lossy feature compensation unit 240, and a DCT converter ( 250) and a speech recognition unit 260.

상기 신호 입력부(200)는 하나 또는 둘 이상의 마이크 등과 같은 신호 입력 장치를 통해 음원 신호(s(t))와 노이즈 신호(n(t))가 믹싱된 신호들(x ₁ (t), x ₂ (t))이 입력되고, 입력 신호(x ₁ (t), x ₂ (t))는 신호 변환부로 제공된다. The signal input unit 200 includes signals x ₁ (t) and x _{2 in} which a sound source signal s (t) and a noise signal n (t) are mixed through a signal input device such as one or more microphones. (t) ) is input, and the input signals x ₁ (t) and x ₂ (t ) are provided to the signal converter.

상기 신호 변환부(210)는 상기 신호 입력부로부터 제공된 시간 도메인(time-domain)의 입력 신호(x ₁ (t), x ₂ (t))를 주파수 도메인의 신호로 변환시키는 국소 푸리에 변환(Short-time Fourier Transform)하여 출력한다. The signal converter 210 converts a time-domain input signal ( x ₁ (t), x ₂ (t) ) provided from the signal input unit into a signal in a frequency domain. time Fourier Transform).

상기 전처리부(220)는 사전 설정된 음원에 대한 정보를 이용하여 상기 신호 변환부로부터 제공된 입력 신호(x ₁ (ω,τ), x ₂ (ω,τ))에 대하여 수학식 1에 따른 MPDR 빔포밍하여 음원을 향상시킨다. The preprocessing unit 220 uses the information about a predetermined sound source for the input signal x ₁ (ω, τ), x ₂ (ω, τ) provided from the signal converter according to the equation (1) MPDR beam Foaming enhances the sound source.

상기 음원 신호 추출부(230)는 최소 왜곡 원칙(Minimal Distortion Principle)과 non-holonomic 제한 조건을 적용한 HIVA 학습 알고리즘을 수행하여 상기 전처리부로부터 제공되는 입력 신호에 대한 특징 벡터를 추출하고, 상기 추출된 특징 벡터를 이용하여 음원 신호를 추정하여 출력한다. The sound source signal extractor 230 extracts a feature vector for an input signal provided from the preprocessor by performing an HIVA learning algorithm applying a minimum distortion principle and a non-holonomic constraint. The sound source signal is estimated and output using the feature vector.

이하, 상기 음원 신호 추출부가 특징 벡터를 추출하는 과정을 순차적으로 설명한다. 먼저, HIVA 학습 알고리즘을 적용하기 위하여 특징 벡터(W(ω))는 언믹싱 매트릭스(an unmixing matix)로서 수학식 2 및 수학식 3과 같이 정의된다. Hereinafter, the process of extracting the feature vector by the sound source signal extractor will be described sequentially. First, in order to apply the HIVA learning algorithm, the feature vector W (ω) is defined as Equation 2 and Equation 3 as an unmixing matrix.

여기서,

,

로서, 이들은 각각 혼합 신호의 시간-주파수 세그먼트들과 음원 신호 벡터들이다. A(ω)는 주파수 빈(frequency bin) ω 에서의 믹싱 매트릭스(mixing matrix)이다. here,

,

These are, respectively, the time-frequency segments of the mixed signal and the sound source signal vectors. A (ω) is the mixing matrix at the frequency bin ω.

수학식 3에 의해 음원 신호들을 추정할 수 있는데, 여기서, u(ω,τ)는

로서 추정된 음원 신호 벡터의 시간-주파수 세그먼트들이다. Equation 3 can be estimated the sound source signals, where u (ω, τ) is

Are time-frequency segments of the sound source signal vector estimated as.

비용 함수를 최소화시키기 위한 실시간 기울기 알고리즘(on-line natural gradient algorithm)은 수학식 4로 정의되는 하모닉 주파수 의존성을 이용한 독립벡터분석(HIVA) 학습에 의해 구할 수 있다. An on-line natural gradient algorithm for minimizing the cost function can be obtained by independent vector analysis (HIVA) learning using harmonic frequency dependence defined by Equation 4.

여기서,

는

에 대한 multivariate score function 이며,

이며, Ω는 주파수 빈들의 개수를 나타낸다. multivariate score function

는 수학식 5 및 수학식 6에 의해 구해질 수 있다. ,here,

Are time-frequency segments of the sound source signal vector estimated as

The

Multivariate score function for,

Is the number of frequency bins. multivariate score function

Can be obtained by equations (5) and (6). ,

S_ω는 ω번째 주파수 빈(ω-th frequency bin)을 포함하는 클리끄(cliques)들의 세트를 표시한 것이며,

는

에 대한 multivariate score function 이며,

이며, Ω는 주파수 빈들의 개수를 나타낸다. S _ω represents a set of cliques containing the ω-th frequency bin,

The

Multivariate score function for,

Is the number of frequency bins.

C_h는 h 번째 하모닉 클리크(h-th harmonic clique)에 속하는 주파수 빈들의 세트를 표시하며 수학식 7에 의해 구할 수 있으며, 1≤h≤H-1 이며, H는 클리끄의 총 개수를 나타낸다. 클리크의 개수는 총 50개이며, 따라서 1=h=H 이다, 이 중 1=h=H-1 까지의 클리크는 수학식 7을 따르며, C_H, 즉 마지막 50번째의 클리크는 모든 w가 포함되어 있다. C _h represents a set of frequency bins belonging to the h-th harmonic clique and can be obtained by Equation 7, where 1 ≦ h ≦ H-1 and H represents the total number of cleats. . The total number of clicks is 50, so 1 = h = H, of which 1 = h = H-1 clicks (7), C _H , the last 50 clicks, contains all w It is.

여기서, f(ω)는 ω번째 주파수 빈의 주파수이며, M은 8로 설정된 하모닉 클리끄의 하모닉 주파수들의 개수를 나타낸다. Here, f (ω) is the frequency of the ω th frequency bin, and M represents the number of harmonic frequencies of the harmonic cleak set to eight.

F_h는 하모닉 클리끄들의 기본 주파수들로서, 수학식 8로 정의된다. F _h are the fundamental frequencies of the harmonic cleats, and are defined by Equation (8).

여기서, F₁=55Hz 이면, 하모닉 클리끄들의 개수는 49이다. 이 주파수 범위는 인간의 음성 신호의 피치(pitch)의 전체 범위를 포함할 수 있게 된다. Here, if F ₁ = 55 Hz, the number of harmonic cleats is 49. This frequency range may cover the entire range of pitch of the human speech signal.

δ는 각 하모닉 주파수의 대역폭을 결정하는 것으로서, 2개의 연속되는 클리끄들 사이에서 50% 중첩되도록 설정된다.
δ, which determines the bandwidth of each harmonic frequency, is set to overlap 50% between two consecutive cleats.

HIVA 학습 알고리즘에 Non-holonomic Constraint를 적용하면 수학식 4는 수학식 9와 같이 수정된다.When the non-holonomic constraint is applied to the HIVA learning algorithm, Equation 4 is modified as in Equation 9.

여기서, 'off-diag()' 함수는 다이고날 성분들(diagonal elements)이 영(zero)으로 설정된 매트릭스이다. Here, the 'off-diag ()' function is a matrix in which diagonal elements are set to zero.

한편, HIVA 학습 알고리즘에 MDP 를 적용하면 수학식 4는 수학식 10과 같이 수정된다. On the other hand, if MDP is applied to the HIVA learning algorithm, Equation 4 is modified as in Equation 10.

따라서, HIVA 학습 알고리즘에 Non-holonomic Constraint 및 MDP를 모두 적용하면, 수학식 4는 수학식 11과 같이 수정된다. Therefore, when both the non-holonomic constraint and the MDP are applied to the HIVA learning algorithm, Equation 4 is modified as in Equation 11.

여기서, β는 MDP의 상관 가중치(relative weight)를 결정하는 작은 양의 상수값이다. Here, β is a small positive constant value that determines the relative weight of the MDP.

따라서, 음원 신호 추출부는 수학식 11로 표현된 Non-holonomic Constraint 및 MDP를 적용한 HIVA 학습 알고리즘을 적용하여 특징 벡터를 학습하여 추출하고, 이러한 특징 벡터를 이용하여 수학식 3에 따라 음원 신호를 추정하고, 추정된 음원 신호(u₁(ω,τ))가 출력된다. Therefore, the sound source signal extracting unit learns and extracts a feature vector by applying an HIVA learning algorithm applying a non-holonomic constraint and MDP represented by Equation 11, and estimates a sound source signal using Equation 3 using the feature vector. The estimated sound source signal u ₁ (ω, τ) is output.

상기 신호 변환부(210)로부터 출력된 입력 신호(x₁(ω,τ))와 상기 음원 신호 추출부(230)로부터 출력된 상기 추정된 음원 신호(u₁(ω,τ))가 상기 손실 특징 보상부(240)로 입력된다. 상기 손실 특징 보상부(240)는 상기 신호 변환부(210)로부터 출력된 입력 신호(x₁(ω,τ))와 상기 음원 신호 추출부(230)로부터 출력된 상기 추정된 음원 신호(u₁(ω,τ))를 이용하여, 상기 음원 신호를 추정하는 과정에서 손실된 특징들인 시간-주파수 세그먼트들을 보상하는 것을 특징으로 한다. The input signal x ₁ (ω, τ) output from the signal converter 210 and the estimated sound source signal u ₁ (ω, τ) output from the sound source signal extractor 230 are lost. It is input to the feature compensator 240. The loss feature compensator 240 outputs an input signal x ₁ (ω, τ) output from the signal converter 210 and the estimated sound source signal u ₁ output from the sound source signal extractor 230. By using (ω, τ)), time-frequency segments, which are features lost in the process of estimating the sound source signal, are compensated for.

상기 손실 특징 보상부(240)는, 신호변환부로부터 제공된 입력 신호(x₁(ω,τ))에 대하여 Mel 주파수 캡스트럼을 검출하는 제1 MFCC 검출부(242), 음원 신호 추출부로부터 제공된 상기 추정된 음원 신호(u₁(ω,τ))에 대하여 Mel 주파수 캡스트럼을 검출하는 제2 MFCC 검출부(244), 상기 입력 신호에 대한 Mel 주파수 캡스트럼과 상기 추정된 음원 신호에 대한 Mel 주파수 캡스트럼을 이용하여 신뢰성 마스크(Reliability Mask)를 생성하는 마스크 생성부(246), 및 상기 신뢰성 마스크와 상기 추정된 음원 신호에 대한 Mel 주파수 캡스트럼을 이용하여 상기 추정된 음원 신호에 대한 손실 특징을 검출하고 사전 구축된 클러스터 기반의 음성 신호들에 대한 스펙트럼 클러스터 모델을 이용하여 상기 손실 특징을 보상하여 출력하는 손실 특징 보상 출력부(248)를 구비한다. 전술한 구성을 갖는 손실 특징 보상부(240)는 상기 추정된 음원 신호(u₁(ω,τ))에 대한 Mel 주파수 캡스트럼을 검출하고, 상기 Mel 주파수 캡스트럼에서 손실 특징(missing Feature)들을 보상하고, 상기 손실 특징들이 보상된 상기 추정된 음원 신호에 대한 Mel 주파수 캡스트럼(L _recon(ω_mel, τ')를 출력한다. The loss feature compensator 240 may include a first MFCC detector 242 for detecting a Mel frequency capstrum with respect to the input signal x ₁ (ω, τ) provided from the signal converter, and the sound source signal extractor. A second MFCC detector 244 which detects a Mel frequency capstrum with respect to the estimated sound source signal u ₁ (ω, τ), a Mel frequency capstrum with respect to the input signal and a Mel frequency capse with respect to the estimated sound source signal A mask generator 246 for generating a reliability mask using a track, and a loss characteristic of the estimated sound source signal by using a Mel frequency capstrum for the reliability mask and the estimated sound source signal And a lossy feature compensation output unit 248 for compensating and outputting the lossy feature using a spectral cluster model for pre-established cluster-based speech signals. The lossy feature compensator 240 having the above-described configuration detects a Mel frequency capstrum for the estimated sound source signal u ₁ (ω, τ) and removes missing features from the Mel frequency capstrum. Compensating and outputting a Mel frequency capstrum L _recon (ω _mel , τ ′) for the estimated sound source signal with which the loss characteristics are compensated.

상기 제1 및 제2 MFCC 검출부(242, 244)는 입력된 신호들에 대하여 Mel 주파수 캡스트럼을 검출하여 출력하는 것들로서, 이들의 동작에 대하여 구체적으로 설명한다. Mel-Frequency Cepstrum(MFC)는 단구간 신호의 파워 스펙트럼을 표현하는 것으로서, Mel-Frequency Cepstral Coefficients(MFCCs)는 여러 개의 MFC들을 모아 놓은 계수를 의미한다. 상기 제1 및 제2 MFCC 검출부는 입력된 신호들에 대하여 Mel-scale의 필터뱅크를 이용하여 파워 스펙트럼(Power Spectrum)을 구하고, 각 Mel-scale의 파워 스펙트럼에 로그(Log)를 취함으로써, MFCC 값들을 구하게 된다. The first and second MFCC detectors 242 and 244 detect and output Mel frequency capstrums with respect to the input signals, and their operations will be described in detail. Mel-Frequency Cepstrum (MFC) represents the power spectrum of a short-term signal, and Mel-Frequency Cepstral Coefficients (MFCCs) refers to a coefficient of several MFCs. The first and second MFCC detectors obtain a power spectrum using Mel-scale filter banks with respect to the input signals, and take a log in the power spectrum of each Mel-scale. Get the values.

따라서, 제1 MFCC 검출부(242)는 입력 신호(x₁(ω,τ))에 대하여 Mel 주파수 캡스트럼(L _org(ω_mel, τ'))을 검출하여 제공하며, 제2 MFCC 검출부(244)는 상기 추정된 음원 신호(u₁(ω,τ))에 대하여 Mel 주파수 캡스트럼(L _enh(ω_mel, τ'))을 검출하여 제공한다. Accordingly, the first MFCC detector 242 detects and provides a Mel frequency capstrum L _org (ω _mel , τ ′) with respect to the input signal x ₁ (ω, τ), and the second MFCC detector 244. ) Detects and provides a Mel frequency capstrum ( L _enh (ω _mel , τ ′)) with respect to the estimated sound source signal u ₁ (ω, τ).

상기 마스크 생성부(246)는 상기 입력 신호에 대한 Mel 주파수 캡스트럼과 상기 추정된 음원 신호에 대한 Mel 주파수 캡스트럼을 이용하여 신뢰성 마스크(Reliability Mask)를 생성한다. Mel-Frequency band(ω_mel)과 프레임(τ')에서의 상기 신뢰성 마스크의 값(M(ω_mel, τ')는 수학식 12에 의해 표현된다. The mask generator 246 generates a reliability mask using a Mel frequency capstrum for the input signal and a Mel frequency capstrum for the estimated sound source signal. The value of the reliability mask M (ω _mel , τ ′) in the Mel-Frequency band (ω _mel ) and the frame τ ′ is represented by equation (12).

영(Zero)의 마스크 값에 대응되는 Mel 주파수 캡스트럼 성분은 신뢰할 수 없는 특징들로 간주되며, 그렇지 아니한 성분들은 신뢰할 수 있는 특징들로 고려된다. 따라서, 상기 신뢰성 마스크를 이용하여 Mel 주파수 캡스트럼 성분들 중 신뢰할 수 없는 성분들을 손실 특징(Missing Feature)으로 판단한다. 상기 신뢰할 수 있는 특징들과 사전 구축된 음성 신호들에 대한 스펙트럼 클러스터 모델을 이용하여, 상기 손실 특징들을 보상한다. The Mel frequency capstrum component corresponding to zero mask value is considered unreliable features, otherwise components are considered reliable features. Therefore, the unreliable components of the Mel frequency capstrum components are determined as a missing feature using the reliability mask. The lossy features are compensated for using the spectral cluster model for the reliable features and prebuilt speech signals.

상기 DCT 변환부(250)는 상기 손실 특징들이 보상된 상기 추정된 음원 신호에 대한 Mel 주파수 캡스트럼(L _recon(ω_mel,τ')을 DCT(Discrete Cosine Transform) 변환하여 출력한다. The DCT converter 250 converts a Mel frequency capstrum L _recon (ω _mel , τ ′) of the estimated sound source signal compensated for the loss characteristics and outputs a DCT (Discrete Cosine Transform).

상기 음성 인식부(260)는 상기 DCT 변환된 Mel 주파수 캡스트럼(C(q,τ'))을 이용하여 상기 추정된 음원 신호를 인식한다. 상기 음성 인식부가 음원 신호를 인식하는 알고리즘은 이미 매우 다양하게 제안되거나 사용되고 있으며, 이러한 알고리즘은 본 발명의 주요 구성 성분이 아니므로 이에 대한 구체적인 설명은 생략한다.
The speech recognition unit 260 recognizes the estimated sound source signal using the DCT-converted Mel frequency capstrum C (q, τ '). Algorithms for recognizing a sound source signal by the voice recognition unit have already been proposed or used in various ways. Since such algorithm is not a main component of the present invention, a detailed description thereof will be omitted.

이하, 본 발명에 따른 음성 인식 방법에 대하여 설명한다. Hereinafter, a speech recognition method according to the present invention will be described.

본 발명에 따른 음성 인식 방법은, 외부로부터 수신된 입력 신호들을 주파수 영역으로 변환하는 단계; 음원의 방향에 대한 정보를 이용하여 상기 입력 신호들 중 음원 신호를 아래의 수학식에 따라 강화시키는 단계; 상기 변환된 입력 신호들에 대하여 하모닉 주파수 의존성(Harmonic Frequency Dependency)을 이용한 독립 벡터 분석(Independent Vector Analysis)에 기반한 학습 알고리즘을 수행하여 특징 벡터를 추출하고, 추출된 특징 벡터를 이용하여 음원 신호를 추정하여 출력하는 단계; 입력 신호를 이용하여 상기 추정된 음원 신호에 대하여 추정 과정에서 손실된 특징(Missing Feature)을 보상하여 출력하는 단계;를 구비한다. The speech recognition method according to the present invention comprises the steps of: converting input signals received from the outside into a frequency domain; Reinforcing a sound source signal among the input signals using information on the direction of the sound source according to the following equation; A feature vector is extracted from the transformed input signals by performing a learning algorithm based on independent vector analysis using a harmonic frequency dependency, and the sound source signal is estimated using the extracted feature vector. Outputting; And compensating for the missing feature in the estimation process with respect to the estimated sound source signal using an input signal and outputting the missing feature.

전술한 음원신호를 추정하는 단계에서 상기 하모닉 주파수 의존성을 이용한 독립 벡터 분석에 기반한 학습 알고리즘을 수행하여 특징 벡터(W(ω))를 추출하고, 추출된 특징 벡터는 off-diag 함수를 적용하여 수정되는 것이 바람직하다. 이러한 수정을 함으로써, 학습의 수렴속도를 향상시킬 수 있게 된다. In estimating the aforementioned sound source signal, a feature vector W (ω) is extracted by performing a learning algorithm based on the independent vector analysis using the harmonic frequency dependency, and the extracted feature vector is modified by applying an off-diag function. It is preferable to be. By making these modifications, the convergence speed of learning can be improved.

또한, 상기 추정된 음원 신호와 상기 입력 신호의 최소 왜곡을 유지하도록 하기 위하여 비용함수를 최소화시키도록 상기 추출된 특징 벡터를 수정하는 것이 바람직하다. In addition, it is preferable to modify the extracted feature vector to minimize the cost function in order to maintain the minimum distortion of the estimated sound source signal and the input signal.

전술한 손실 특징 보상 출력 단계는, 상기 변환된 입력 신호에 대하여 Mel 주파수 캡스트럼을 검출하는 단계; 상기 추정된 음원 신호에 대하여 Mel 주파수 캡스트럼을 검출하는 단계; 상기 입력 신호에 대한 Mel 주파수 캡스트럼과 상기 추정된 음원 신호에 대한 Mel 주파수 캡스트럼을 이용하여 신뢰성 마스크(Reliability Mask)를 생성하고, 상기 신뢰성 마스크 및 상기 추정된 음원 신호에 대한 Mel 주파수 캡스트럼을 이용하여 상기 추정된 음원 신호에 대한 손실된 특징을 검출하고, 사전 설정된 클러스터 기반의 음성 특징 모델을 이용하여 상기 손실된 특징을 보상하여 출력하는 단계;를 구비하는 것이 바람직하다.
The above-described loss feature compensation output step includes: detecting a Mel frequency capstrum with respect to the converted input signal; Detecting a Mel frequency capstrum with respect to the estimated sound source signal; A reliability mask is generated by using a Mel frequency capstrum for the input signal and a Mel frequency capstrum for the estimated sound source signal, and a Mel frequency capstrum for the reliability mask and the estimated sound source signal is generated. Detecting a lost feature with respect to the estimated sound source signal, and compensating and outputting the lost feature using a preset cluster-based voice feature model.

이상에서 본 발명에 대하여 그 바람직한 실시예를 중심으로 설명하였으나, 이는 단지 예시일 뿐 본 발명을 한정하는 것이 아니며, 본 발명이 속하는 분야의 통상의 지식을 가진 자라면 본 발명의 본질적인 특성을 벗어나지 않는 범위에서 이상에 예시되지 않은 여러 가지의 변형과 응용이 가능함을 알 수 있을 것이다. 그리고, 이러한 변형과 응용에 관계된 차이점들은 첨부된 청구 범위에서 규정하는 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, It will be understood that various changes and modifications may be made without departing from the spirit and scope of the invention. And differences relating to such modifications and applications should be construed as being included in the scope of the invention defined in the appended claims.

본 발명에 따른 음성 인식 시스템에서 널리 사용될 수 있다.It can be widely used in the speech recognition system according to the present invention.

10, 20 : 음성 인식 시스템
100, 200 : 신호 입력부
110 : 신호 변환부
120 : Mel-filter bank
122 : 로그화부
130 : MFCC 검출부
140 : 음성 인식부
210 : 신호 변환부
220 : 전처리부
230 : 음원신호 추출부
240 : 손실특징 보상부
250 : DCT 변환부
260 : 음성 인식부
242 : 제1 MFCC 검출부
244 : 제2 MFCC 검출부
246 : 마스크 생성부
248 : 손실 특징 보상 출력부10, 20: speech recognition system
100, 200: signal input unit
110: signal conversion unit
120: Mel-filter bank
122: logging unit
130: MFCC detector
140: speech recognition unit
210:
220: preprocessing unit
230: sound source signal extraction unit
240: loss characteristic compensation unit
250 DCT converter
260: speech recognition unit
242: first MFCC detection unit
244: second MFCC detection unit
246: mask generator
248: loss characteristic compensation output unit

Claims

A signal input unit configured to receive a plurality of input signals through an external input device;
A signal converter converting the received input signals into a frequency domain;
A feature vector is extracted by performing a learning algorithm based on independent vector analysis using a harmonic frequency dependency on input signals provided from the signal converter, and a sound source is extracted using the extracted feature vector. A sound source signal extracting unit estimating and outputting a signal;
A lossy feature compensator for compensating for a missing feature in the estimation process with respect to the estimated sound source signal using an input signal and outputting the compensated feature;
Speech recognition system having a.

The beamformer of claim 1, wherein the speech recognition system uses a beamformer for reinforcing a sound source signal among the input signals provided from the signal converter using the information on the direction of the sound source according to the following equation and providing the beamformer to the sound source signal extractor. The speech recognition system characterized in that it further comprises.

Here, d _i ( ω ) and R ( ω ) represent a steering vector towards the i-th source and an ipnut spectral covariance matrix for the i-th sound source, respectively, and λ is R ( is a small positive constant set to avoid the formation of singularity of ω ).

The method of claim 1, wherein the sound source signal extractor extracts a feature vector W (ω) by performing a learning algorithm based on independent vector analysis using the harmonic frequency dependency, and extracts the off-diag function. Speech recognition system, characterized in that the modified by applying according to the following equation.

Here, the 'off-diag ()' function is a matrix in which diagonal elements are set to zero.

Are time-frequency segments of the sound source signal vector estimated as

The

Multivariate score function for,

Ω represents the number of frequency bins.

The method of claim 1, wherein the sound source signal extraction unit performs a learning algorithm based on independent vector analysis using the harmonic frequency dependency to extract a feature vector (W (ω)), the minimum of the estimated sound source signal and the input signal Modifying the extracted feature vector to minimize cost function to maintain distortion.

The speech recognition system of claim 4, wherein the sound source signal extracting unit modifies a feature vector according to the following equation.

here,

Are the time-frequency segments of the mixed signal,

Are the time-frequency segments of the sound source signal vector estimated as.

The method of claim 1, wherein the sound source signal extractor extracts a feature vector W (ω) by performing a learning algorithm based on independent vector analysis using the harmonic frequency dependency, and extracts the off-diag function. And a feature vector is determined using the modified value to minimize the cost function for the estimated sound source signal and the input signal in order to maintain the modified value and the minimum distortion of the estimated sound source signal. Voice recognition system.

The method of claim 1, wherein the loss characteristic compensation unit,
A first MFCC detector for detecting a Mel frequency capstrum with respect to an input signal provided from the signal converter;
A second MFCC detector for detecting a Mel frequency capstrum with respect to the estimated sound source signal provided from the sound source signal extractor;
A lossy feature calculator for compensating for a feature lost in the estimated sound source signal by using a Mel frequency capstrum for the input signal and a Mel frequency capstrum for the estimated sound source signal;
Speech recognition system comprising a.

The method of claim 7, wherein the loss characteristic calculation unit,
A mask generator configured to generate a reliability mask using a Mel frequency capstrum for the input signal and a Mel frequency capstrum for the estimated sound source signal;
Detect the lost feature for the estimated sound source signal using the reliability mask and the Mel frequency capstrum for the estimated sound source signal, and compensate for the lost feature using a preset cluster-based speech feature model A loss feature compensation output unit for outputting;
Speech recognition system comprising a.

(a) converting input signals received from the outside into a frequency domain;
(b) Performing a learning algorithm based on Independent Vector Analysis using Harmonic Frequency Dependency on the converted input signals, extracting a feature vector, and using the extracted feature vector, a sound source Estimating and outputting a signal;
(c) compensating for the missing feature in the estimation process and outputting the estimated sound source signal using the input signal;
Speech recognition method comprising a.

The method of claim 9, wherein the speech recognition method further comprises reinforcing a sound source signal among the input signals using information on the direction of the sound source according to the following equation before extracting the feature vector. Voice recognition method.

10. The method of claim 9, wherein in the step (b), a feature vector W (ω) is extracted by performing a learning algorithm based on independent vector analysis using the harmonic frequency dependency, and the extracted feature vector extracts an off-diag function. Speech recognition method characterized in that the modified by applying according to the following equation.

Are time-frequency segments of the sound source signal vector estimated as

The

Multivariate score function for,

Ω represents the number of frequency bins.

The method of claim 9, wherein in step (b), a learning algorithm based on independent vector analysis using the harmonic frequency dependence is performed to extract a feature vector W (ω), and the estimated sound source signal and the input signal And modifying the extracted feature vector to minimize cost function to maintain minimum distortion.

The speech recognition method of claim 12, wherein the step (b) modifies the feature vector according to the following equation.

here,

Are the time-frequency segments of the mixed signal,

Are the time-frequency segments of the sound source signal vector estimated as.

10. The method of claim 9, wherein step (b) extracts a feature vector W (ω) by performing a learning algorithm based on independent vector analysis using the harmonic frequency dependency, and extracts the off-diag function. Speech recognition characterized in that it is determined using the modified value to apply and to minimize the cost function for the estimated sound source signal and the input signal in order to maintain the minimum distortion of the estimated sound source signal Way.

10. The method of claim 9, wherein step (c)
(c1) detecting a Mel frequency capstrum with respect to the converted input signal;
(c2) detecting a Mel frequency capstrum with respect to the estimated sound source signal;
(c3) compensating for the missing feature in the estimated sound source signal using the Mel frequency capstrum for the input signal and the Mel frequency capstrum for the estimated sound source signal;
Speech recognition method comprising the.

The method of claim 15, wherein step (c3),
A reliability mask is generated by using a Mel frequency capstrum for the input signal and a Mel frequency capstrum for the estimated sound source signal, and a Mel frequency capstrum for the reliability mask and the estimated sound source signal is generated. Detecting a lost feature with respect to the estimated sound source signal, and compensating and outputting the lost feature using a preset cluster-based voice feature model.