KR101240588B1

KR101240588B1 - Method and device for voice recognition using integrated audio-visual

Info

Publication number: KR101240588B1
Application number: KR1020120146790A
Authority: KR
Inventors: 김영운; 강선경; 정성태
Original assignee: 주식회사 좋은정보기술
Priority date: 2012-12-14
Filing date: 2012-12-14
Publication date: 2013-03-11

Abstract

PURPOSE: An audio-video fusion voice recognition method and an apparatus for giving weight to audio word estimation probability and video word estimation probability are provided to secure a higher voice recognition rate by selectively fusing audio and video by using a signal to noise ratio. CONSTITUTION: An audio-video fusion voice recognition apparatus extracts audio characteristic information from an audio signal(S205). The apparatus calculates an audio word estimation probability from the audio characteristic information(S210). The apparatus extracts image characteristic information from a video signal(S230). The apparatus calculates a video word estimation probability from the video characteristic information(S235). The apparatus calculates an SNR(Signal to Noise Ratio) from the audio characteristic information(S215). The apparatus calculates an integrated estimation probability(S240). [Reference numerals] (AA) Voice command; (S200) Inputting a voice signal; (S205) Extracting audio characteristic information; (S210) Calculating an audio word estimation probability; (S215) Extracting a signal to noise ratio(SNR); (S220) Setting a weighted value; (S225) Inputting an image signal; (S230) Extracting image characteristic information; (S235) Calculating an image word estimation probability; (S240) Calculating an integrated estimation probability; (S245) Voice recognition

Description

Method and device for voice recognition using integrated audio-visual}

본 발명은 오디오-영상 융합 음성 인식 방법 및 장치에 관한 발명이다.
The present invention relates to an audio-visual fusion speech recognition method and apparatus.

음성은 인간이 의사 전달을 하는 가장 자연스러운 방법이며 음성을 매개로 한 인간과 기계간 의사소통 방법을 구현하기 위해 음성인식 기술이 활용되고 있다.
Voice is the most natural way for humans to communicate, and voice recognition technology is being used to implement voice-to-machine communication between humans and machines.

또한 최근 정보통신 기술의 발전으로 스마트 가전 기기가 소형화 되면서 NUI(Natural User Interface) 수단으로 음성인식 기술이 많이 사용되고 있다.
In addition, as smart home appliances are miniaturized due to the recent development of information and communication technology, voice recognition technology is being used as a NUI (Natural User Interface) means.

음성 인식은 오디오 방식을 주로 사용하는데 오디오 방식의 음성 인식 장치는 잡음이 큰 환경에서는 잡음과 음성 신호를 구분하기 어려워 음성 인식률이 저하되는 문제점이 있다.Speech recognition mainly uses an audio method. However, an audio speech recognition device has a problem in that a speech recognition rate is lowered because it is difficult to distinguish noise and a voice signal in a noisy environment.

음성 인식의 다른 방식으로 영상 방식을 사용할 수 있는데 사용자의 얼굴로부터 입술 영역을 검출하고 입술 영역으로부터 특징 벡터를 추출하여 단어를 인식하는 방식이다.
An image method may be used as another method of speech recognition, in which a lip region is detected from a user's face and a feature vector is extracted from the lip region to recognize a word.

상기 오디오 방식 및 영상 방식의 장점을 합한 것이 오디오-영상 융합 방법이다. 일반적인 오디오-비주얼 융합 방법에는 크게 오디오와 비디오의 특징벡터를 추출하여 통합한 후 인식하는 방법인 선통합 방법과 오디오와 비디오를 개별 인식한 후 확률 값을 계산하는 후통합 방법이 있다.
The combined advantages of the audio and video methods are the audio-visual fusion method. Common audio-visual fusion methods include a pre-integration method, which is a method of extracting and integrating feature vectors of audio and video and recognizing them, and a post-integration method that calculates probability values after separately recognizing audio and video.

한편, 오디오-영상 융합 방법은 있어서 신호 대 잡음 비(SNR)를 고려할 필요가 있다. 잡음이 큰 환경에서는 오디오 방식의 음성 인식 장치의 인식률이 떨어지고, 잡음이 거의 없는 환경에서는 오디오 방식의 음성 인식만으로도 충분하기 때문이다.
On the other hand, the audio-video fusion method needs to consider the signal-to-noise ratio (SNR). This is because the recognition rate of the audio type speech recognition device is reduced in a high noise environment, and the audio type speech recognition is sufficient in an environment where there is little noise.

한편, 종래의 공개 특허 2002-0057046은 음성 인식 및 영상 인식 후 가중치를 부여하여 통합 분석하는 방법을 제공하고 있으나, 이러한 분석 방법에 있어 신호 대 잡음 비(SNR)은 고려되지 않고 있다.
On the other hand, the conventional patent publication 2002-0057046 provides a method for integrated analysis by weighting after speech recognition and image recognition, but the signal-to-noise ratio (SNR) is not considered in this analysis method.

본 발명은 신호 대 잡음 비(SNR)를 고려하여 오디오와 영상을 선택적으로 융합함으로써 보다 높은 음성 인식률을 확보하는 것을 목적으로 한다.
The present invention aims to secure higher speech recognition rate by selectively fusing audio and video in consideration of signal-to-noise ratio (SNR).

상기의 목적을 달성하기 위해 음성 신호를 입력받아 오디오 특징 정보를 추출하고, 상기 오디오 특징 정보로부터 오디오 단어 추정 확률을 산출하는 단계; 영상 신호를 입력받아 영상 특징 정보를 추출하고, 상기 영상 특징 정보로부터 영상 단어 추정 확률을 산출하는 단계; 상기 오디오 특징 정보로부터 신호 대 잡음비(SNR)를 산출하고, 상기 신호 대 잡음비(SNR)를 이용하여 가충치를 설정하는 단계; 및 상기 오디오 단어 추정 확률과 상기 영상 단어 추정 확률에 가충치를 부여하여 통합 추정 확률을 계산하는 단계;를 포함하는 것을 특징으로 하는 오디오-영상 융합 음성 인식 방법을 제공할 수 있다.
Extracting audio feature information by receiving a voice signal to achieve the above object, and calculating an audio word estimation probability from the audio feature information; Receiving image signals, extracting image feature information, and calculating image word estimation probabilities from the image feature information; Calculating a signal-to-noise ratio (SNR) from the audio feature information, and setting a value using the signal-to-noise ratio (SNR); And calculating a combined estimation probability by giving a value to the audio word estimation probability and the image word estimation probability.

바람직하게는, 상기 가중치는 다음의 -수식1-에 의해 정해지는 것을 특징으로 하는 오디오-영상 융합 음성 인식 방법을 제공할 수 있다.Preferably, the weight may provide an audio-visual fused speech recognition method characterized by the following -Equation 1-.

-수식1-Equation 1

(여기서, a는 가중치, Max_SNR은 기 설정된 신호 대 잡음비(SNR)의 최대값, SNR은 신호 대 잡음비, Min_SNR은 기 설정된 신호 대 잡음비(SNR)의 최소값을 나타냄)
(Where a is the weight, Max _SNR is the maximum value of the preset signal-to-noise ratio (SNR), SNR is the minimum value of the signal-to-noise ratio (SNR), and Min _SNR is the minimum value).

바람직하게는 상기 통합 추정 확률은 다음의 -수식2-에 의해 정해지는 것을 특징으로 하는 오디오-영상 융합 음성 인식 방법을 제공할 수 있다.Preferably, the integrated estimation probability may be provided by an audio-video fused speech recognition method as defined by the following Equation 2.

-수식2-Formula 2

(여기서, P는 통합 추정 확률, Large(aP_visual, (1-a)P_audio)는 aP_visual와 (1-a)P_audio 중 큰 값, a는 가중치, P_visual는 영상 단어 추정 확률, P_audio는 오디오 단어 추정 확률을 나타냄)
Where P is the combined probability of estimation, Large (aP _visual , (1-a) P _audio ) is the greater of aP _visual and (1-a) P _audio , a is the weight, P _visual is the estimated probability _audio represents the probability of audio word estimation)

바람직하게는, 상기 통합 추정 확률은 다음의 -수식3-에 의해 정해지는 것을 특징으로 하는 오디오-영상 융합 음성 인식 방법을 제공할 수 있다.Preferably, the integrated estimation probability may be provided by an audio-video fused speech recognition method, which is determined by the following Equation 3.

-수식3-Equation 3

(여기서 P는 통합 추정 확률, P_i는 i번째 통합 추정 확률, a는 가중치, P_{i, visual}는 i번째 영상 단어 추정 확률, P_{i, audio}는 i번째 오디오 단어 추정 확률, Max(P_i)는 P_i중 가장 큰 값을 나타냄)
(Where P is the combined estimation probability, P _i is the i-integrated estimation probability, a is the weight, P _{i, visual} is the i-th image word estimation probability, P _{i, audio} is the i-th audio word estimation probability, Max (P _i ) Represents the largest value of P _i )

바람직하게는, 음성 신호를 입력 받는 오디오 입력 모듈; 영상 신호를 입력 받는 영상 입력 모듈; 상기 음성 신호로부터 오디오 특징 정보를 추출하고 상기 오디오 특징 정보로부터 오디오 단어 추정 확률을 산출하는 오디오 분석 모듈; 상기 영상 신호로부터 영상 특징 정보를 추출하고 상기 영상 특징 정보로부터 영상 단어 추정 확률을 산출하는 영상 분석 모듈; 상기 오디오 특징 정보로부터 신호 대 잡음 비(SNR)를 산출하고 상기 신호 대 잡음 비(SNR)를 이용하여 가중치를 설정하는 신호 대 잡음 비(SNR) 분석 모듈; 및 상기 오디오 단어 추정 확률 및 상기 영상 단어 추정 확률에 가중치를 부여하고 통합 추정 확률을 계산하는 통합 인식 모듈;을 포함하는 것을 특징으로 하는 오디오-영상 융합 음성 인식 장치를 제공할 수 있다.
Preferably, the audio input module for receiving a voice signal; An image input module configured to receive an image signal; An audio analysis module extracting audio feature information from the speech signal and calculating an audio word estimation probability from the audio feature information; An image analysis module extracting image feature information from the image signal and calculating an image word estimation probability from the image feature information; A signal-to-noise ratio (SNR) analysis module for calculating a signal-to-noise ratio (SNR) from the audio feature information and setting a weight using the signal-to-noise ratio (SNR); And an integrated recognition module that weights the audio word estimation probability and the image word estimation probability and calculates an integrated estimation probability.

바람직하게는, 상기 가중치는 다음의 -수식1-에 의해 정해지는 것을 특징으로 하는 오디오-영상 융합 음성 인식 장치를 제공할 수 있다.Preferably, the weight may be provided by an audio-visual fusion speech recognition apparatus characterized in that determined by the following equation.

-수식1-Equation 1

바람직하게는, 상기 통합 추정 확률은 다음의 -수식2-에 의해 정해지는 것을 특징으로 하는 오디오-영상 융합 음성 인식 장치를 제공할 수 있다.Preferably, the integrated estimation probability may be provided by an audio-visual fusion speech recognition apparatus according to the following Equation 2.

-수식2-Equation 2

바람직하게는, 상기 통합 추정 확률은 다음의 -수식3-에 의해 정해지는 것을 특징으로 하는 오디오-영상 융합 음성 인식 장치를 제공할 수 있다.Preferably, the integrated estimation probability can be provided by an audio-visual fusion speech recognition apparatus characterized by the following -Equation 3-.

-수식3-Formula 3

본 발명은 신호 대 잡음비(SNR)를 이용하여 오디오와 영상을 선택적으로 융합함으로써 보다 높은 음성 인식률을 확보할 수 있는 효과가 있다.
The present invention has the effect of securing a higher speech recognition rate by selectively fusing audio and video using a signal-to-noise ratio (SNR).

도 1은 본 발명의 일 실시예에 따른 오디오-영상 융합 음성 인식 장치를 나타낸 도면이다.
도 2는 본 발명의 일 실시예에 따른 오디오-영상 융합 음성 인식 방법을 나타낸 순서도이다.1 is a diagram illustrating an audio-visual converged speech recognition apparatus according to an embodiment of the present invention.
2 is a flowchart illustrating a method for recognizing an audio-visual fused voice according to an embodiment of the present invention.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.
BRIEF DESCRIPTION OF THE DRAWINGS The present invention is capable of various modifications and various embodiments, and specific embodiments are illustrated in the drawings and described in detail in the detailed description. It is to be understood, however, that the invention is not to be limited to the specific embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.
In the following description of the present invention, if it is determined that the detailed description of the related known technology may obscure the gist of the present invention, the detailed description thereof will be omitted.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.
The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

이하, 본 발명의 실시예를 첨부한 도면들을 참조하여 상세히 설명하기로 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면 번호에 상관없이 동일한 수단에 대해서는 동일한 참조 번호를 사용하기로 한다.
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description of the present invention, the same reference numerals will be used for the same means regardless of the reference numerals in order to facilitate the overall understanding.

도 1은 본 발명의 일 실시예에 따른 오디오-영상 융합 음성 인식 장치를 나타낸 도면이다.
1 is a diagram illustrating an audio-visual converged speech recognition apparatus according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 오디오-영상 융합 음성 인식 장치(100)는 오디오 입력 모듈(110), 오디오 분석 모듈(120), 신호 대 잡음 비(SNR) 분석 모듈(130), 영상 입력 모듈(140), 영상 분석 모듈(150) 및 통합 인식 모듈(160)을 포함하고 있다.
Referring to FIG. 1, an audio-visual fusion speech recognition apparatus 100 according to an embodiment of the present invention may include an audio input module 110, an audio analysis module 120, and a signal-to-noise ratio (SNR) analysis module 130. ), An image input module 140, an image analysis module 150, and an integrated recognition module 160.

오디오 입력 모듈(110)은 오디오 분석 모듈(120)과 연결되어 있으며, 음성 신호를 입력 받아 오디오 분석 모듈(120)에 제공한다
The audio input module 110 is connected to the audio analysis module 120 and receives a voice signal and provides the received audio signal to the audio analysis module 120.

오디오 분석 모듈(130)은 오디오 입력 모듈(110), 신호 대 잡음 비(SNR) 분석 모듈(130) 및 통합 인식 모듈(160)과 연결되어 있다.
The audio analysis module 130 is connected to the audio input module 110, the signal-to-noise ratio (SNR) analysis module 130, and the integrated recognition module 160.

오디오 분석 모듈(130)은 상기 오디오 입력 모듈(120)로부터 제공받은 음성 신호로부터 오디오 특징 정보를 추출하고 신호 대 잡음 비(SNR) 분석 모듈(130)에 제공한다.
The audio analysis module 130 extracts audio feature information from the voice signal provided from the audio input module 120 and provides it to the signal to noise ratio (SNR) analysis module 130.

또한 오디오 분석 모듈(130)은 상기 오디오 특징 정보로부터 오디오 단어 추정 확률을 산출하여 통합 인식 모듈(160)에 제공한다.
In addition, the audio analysis module 130 calculates an audio word estimation probability from the audio feature information and provides it to the integrated recognition module 160.

신호 대 잡음 비(SNR) 분석 모듈(130)은 오디오 분석 모듈(130) 및 통합 인식 모듈(160)과 연결되어 있으며, 상기 영상 특징 정보로부터 신호 대 잡음 비(SNR)를 산출하고 상기 신호 대 잡음 비(SNR)를 이용하여 가중치를 설정하고 통합 인식 모듈(160)에 제공한다.
The signal-to-noise ratio (SNR) analysis module 130 is connected to the audio analysis module 130 and the integrated recognition module 160, calculates a signal-to-noise ratio (SNR) from the image characteristic information, and generates the signal-to-noise. The weight is set using the ratio SNR and provided to the integrated recognition module 160.

영상 입력 모듈(140)은 영상 분석 모듈(150)과 연결되어 있으며, 영상 신호를 입력 받아 영상 분석 모듈(150)에 제공한다.
The image input module 140 is connected to the image analysis module 150 and receives an image signal and provides the image signal to the image analysis module 150.

영상 분석 모듈(150)은 상기 영상 입력 모듈(140)로부터 제공받은 영상 신호로부터 영상 특징 정보를 추출하고 상기 영상 특징 정보로부터 영상 단어 추정 확률을 산출하여 통합 인식 모듈(160)에 제공한다.
The image analysis module 150 extracts image feature information from the image signal provided from the image input module 140, calculates an image word estimation probability from the image feature information, and provides the image word estimation probability to the integrated recognition module 160.

도 2는 본 발명의 일 실시예에 따른 오디오-영상 융합 음성 인식 방법을 나타낸 순서도이다.
2 is a flowchart illustrating a method for recognizing an audio-visual fused voice according to an embodiment of the present invention.

도 2를 참조하면, S200단계에서 오디오 입력 모듈(110)은 사용자의 음성 명령으로부터 음성 신호를 입력 받는다.
Referring to FIG. 2, in step S200, the audio input module 110 receives a voice signal from a voice command of a user.

S205단계에서 오디오 분석 모듈(120)은 오디오 입력 모듈(110)로부터 제공받은 음성 신호로부터 오디오 특징 정보를 추출한다.
In operation S205, the audio analysis module 120 extracts audio feature information from the voice signal provided from the audio input module 110.

본 발명의 일 실시예에 따른 상기 오디오 특징 정보 추출의 예로서 음성 신호로부터 선형 예측 계수(Linear Prediction Coefficient)방법에 의해 특징 벡터를 추출하는 방법이 사용될 수 있다.
As an example of extracting the audio feature information according to an embodiment of the present invention, a method of extracting a feature vector from a speech signal by a linear prediction coefficient may be used.

S210단계에서 오디오 분석 모듈(120)은 상기 오디오 특징 정보로부터 오디오 단어 추정 확률을 산출한다.In operation S210, the audio analysis module 120 calculates an audio word estimation probability from the audio feature information.

본 발명의 일실시예로 상기 오디오 단어 추정 확률은 오디오 특징 정보와 HMM(Hidden Markov Model) 및 비터비(viterbi) 알고리즘을 이용하여 산출할 수 있다.
In one embodiment of the present invention, the audio word estimation probability may be calculated using audio feature information, a Hidden Markov Model (HMM), and a Viterbi algorithm.

HMM(Hidden Markov Model)은 음운, 단어와 같은 음성의 단위를 통계적으로 모델화한 음성 인식 알고리즘을 말한다.
HMM (Hidden Markov Model) refers to a speech recognition algorithm that statistically models speech units such as phonology and words.

비터비(viterbi) 알고리즘은 HMM등의 모델로부터 단어 추정 확률을 계산하는 알고리즘이다. 단어 추정 확률을 계산하는 알고리즘은 비터비(viterbi)알고리즘 외에 forward-backward 알고리즘 또는 Baum-Welch 재추정 알고리즘을 사용할 수 있다.
The Viterbi algorithm is an algorithm for calculating word estimation probability from a model such as HMM. The algorithm for calculating the word estimation probability may use a forward-backward algorithm or a Baum-Welch reestimation algorithm in addition to the Viterbi algorithm.

S215단계에서 신호 대 잡음 비(SNR) 분석 모듈(130)은 상기 오디오 특징 정보로부터 신호 대 잡음 비(SNR)를 산출한다.
In operation S215, the signal-to-noise ratio analysis module 130 calculates a signal-to-noise ratio SNR from the audio characteristic information.

S220단계에서 신호 대 잡음 비(SNR) 분석 모듈(130)은 상기 신호 대 잡음비(SNR)를 이용하여 가중치를 설정한다.
In operation S220, the signal-to-noise ratio (SNR) analysis module 130 sets weights using the signal-to-noise ratio (SNR).

S225단계에서 영상 입력 모듈(140)은 사용자가 음성 명령 시 영상 신호를 입력 받는다.
In operation S225, the image input module 140 receives an image signal when the user commands a voice.

S230단계에서 영상 분석 모듈(150)은 영상 입력 모듈(140)로부터 제공받은 영상 신호로부터 영상 특징 정보를 추출한다.
In operation S230, the image analysis module 150 extracts image feature information from the image signal provided from the image input module 140.

본 발명의 일 실시예에 따른 영상 특징 정보를 추출하는 방법은 영상 신호로부터 입술 이미지의 픽셀 값을 직접 특정 벡터로 사용하는 이미지 기반 방법이 될 수 있다.
The method of extracting image feature information according to an embodiment of the present invention may be an image-based method using a pixel value of a lip image directly from a video signal as a specific vector.

상기 이미지 기반 방법에 사용되는 알고리즘으로 주성분 분석, 선형판별 분석, 이산 코사인 변환, 이산 푸리에 변환, 이산 웨이블렛 변환 등이 사용될 수 있다.
As an algorithm used in the image-based method, principal component analysis, linear discriminant analysis, discrete cosine transform, discrete Fourier transform, discrete wavelet transform, and the like may be used.

본 발명의 또 다른 실시예에 따른 영상 특징 정보를 추출하는 방법은 입술의 너비, 높이, 내부 면적 등의 기하학적 형태를 특징 벡터로 사용하는 기하학적 특징 기반 방법이 될 수 있다.
The method of extracting image feature information according to another embodiment of the present invention may be a geometric feature-based method using geometric shapes such as width, height, and inner area of the lips as feature vectors.

본 발명의 또 다른 실시예에 따른 영상 특징 정보를 추출하는 방법은 입술 윤곽과 형태의 모델을 생성하여 특징 벡터로 사용하는 모델기반 방법이 될 수 있다.
The method of extracting image feature information according to another embodiment of the present invention may be a model-based method of generating a model of a lip outline and a shape and using the feature vector.

상기 모델기반 방법은 입술 경계선 정보를 이용한 방법과 Lip-shape 모델링 방법 등이 사용될 수 있다
As the model-based method, a method using lip boundary information and a lip-shape modeling method may be used.

S235단계에서 영상 분석 모듈(150)은 상기 영상 특징 정보로부터 영상 단어 추정 확률을 산출한다.
In operation S235, the image analysis module 150 calculates an image word estimation probability from the image feature information.

본 발명의 일실시예로 상기 영상 단어 추정 확률은 오디오 특징 정보와 HMM(Hidden Markov Model) 및 비터비(viterbi) 알고리즘을 이용하여 산출할 수 있다.
According to an embodiment of the present invention, the image word estimation probability may be calculated using audio feature information, a Hidden Markov Model (HMM), and a Viterbi algorithm.

S240단계에서 통합 인식 모듈(160)은 상기 오디오 단어 추정 확률과 상기 영상 단어 추정 확률에 가중치를 부여하여 통합 추정 확률을 계산한다.
In operation S240, the integrated recognition module 160 calculates an integrated estimation probability by weighting the audio word estimation probability and the image word estimation probability.

이하 본 발명의 일 실시예를 설명하기로 한다.
Hereinafter, an embodiment of the present invention will be described.

본 발명의 일 실시예에서는 상기 가중치는 다음의 수식1-에 의해 정하여 진다.In one embodiment of the present invention, the weight is determined by Equation 1 below.

-수식1-Equation 1

상기 a는 가중치, Max_SNR은 기설정된 신호 대 잡음비(SNR)의 최대값, SNR은 신호 대 잡음비, Min_SN은 기설정된 신호 대 잡음비(SNR)의 최소값을 나타낸다.
A represents a weight, Max _SNR represents a maximum value of a predetermined signal-to-noise ratio (SNR), SNR represents a signal-to-noise ratio, and Min _SN represents a minimum value of a predetermined signal-to-noise ratio (SNR).

상기 신호 대 잡음비(SNR)는 미국표준기술연구소(NIST, National Institute of Standards and Technology)에서 제공하는 stnr(speech to noise ratio) 알고리즘을 사용하여 구할 수 있다.
The signal-to-noise ratio (SNR) can be obtained using a speech to noise ratio (stnr) algorithm provided by the National Institute of Standards and Technology (NIST).

상기 Max_SNR은 잡음 신호의 크기에 따른 측정 구간을 정하고 잡음 신호의 크기 별 신호 대 잡음비(SNR)를 구하여 그 중 최대값으로 기 설정될 수 있다.
The Max _SNR may be set to a maximum value by determining a measurement interval according to the size of the noise signal and obtaining a signal-to-noise ratio (SNR) for each size of the noise signal.

상기 Min_SNR은 상기 잡음 신호의 크기 별 신호 대 잡음비(SNR) 중 최소값으로 기 설정될 수 있다.
The Min _SNR may be preset to a minimum value of a signal-to-noise ratio (SNR) for each size of the noise signal.

상기 가중치 a는 0과 1사이의 값을 갖고 상기 신호 대 잡음비(SNR)가 큰 값을 가질수록 0에 가까워지며 상기 신호 대 잡음비(SNR)가 작은 값을 가질수록 상기 가중치 a는 1에 가까워진다.
The weight a has a value between 0 and 1, and the larger the signal-to-noise ratio (SNR) is, the closer it is to zero. The smaller the value of the signal-to-noise ratio (SNR), the closer the value is to 1. .

잡음 신호의 크기는 상기 신호대 잡음비(SNR)가 Max_SNR인 경우에 가장 작으며 이 때, 상기 가중치 a는 0이 된다.
The magnitude of the noise signal is the smallest when the signal-to-noise ratio (SNR) is Max _SNR , where the weight a becomes zero.

잡음 신호의 크기는 상기 신호대 잡음비(SNR)가 Min_SNR인 경우에 가장 크며 이 때, 상기 가중치 a는 1이 된다.
The magnitude of the noise signal is the largest when the signal-to-noise ratio (SNR) is Min _SNR , where the weight a is 1.

또한 본 발명의 일 실시예에서는 상기 통합 추정 확률은 다음의 -수식2-에 의해 정하여진다.
In an embodiment of the present invention, the integrated estimation probability is determined by the following Equation 2.

-수식2-Formula 2

상기 P는 통합 추정 확률, Large(aP_visual, (1-a)P_audio)는 aP_visual와 (1-a)P_audio 중 큰 값, a는 가중치, P_visual는 영상 단어 추정 확률, P_audio는 오디오 단어 추정 확률을 나타낸다.
Where P is the combined estimation probability, Large (aP _visual , (1-a) P _audio ) is the larger of aP _visual and (1-a) P _audio , a is the weight, P _visual is the video word estimation probability, and P _audio is Indicates the audio word estimation probability.

오디오 단어 추정 확률은 오디오 특징 정보로부터 산출된 추정 단어일 확률을 나타낸다.
The audio word estimation probability indicates a probability of being an estimated word calculated from audio feature information.

영상 단어 추정 확률은 영상 특징 정보로부터 산출된 추정 단어일 확률을 나타낸다.
The image word estimation probability indicates a probability of being an estimated word calculated from image feature information.

상기 오디오 단어 추정 확률에 상응하는 추정 단어는 오디오 특징 정보로부터 산출되고 영상 단어 추정 확률에 상응하는 추정 단어는 영상 특징 정보로부터 산출되기 때문에 상기 추정 단어들은 서로 다를 수 있다.
Since the estimated words corresponding to the audio word estimation probability are calculated from the audio feature information and the estimated words corresponding to the image word estimation probability are calculated from the image feature information, the estimated words may be different.

상기 서로 다른 추정 단어들은 상기 수식2-에 의하여 인식하고자 하는 단어로 선택될 수 있다.
The different estimated words may be selected as words to be recognized by Equation 2.

본 발명은 상기 수식2-에 따라 통합 추정 확률을 계산하여 단어를 인식할 경우 신호대 잡음비(SNR)가 낮을 때는 영상 인식 방법을 위주로 단어를 인식하게 되고 신호대 잡음비(SNR)가 높을 때는 오디오 인식 방법을 위주로 단어를 인식하게 되는 효과가 있다.
In the present invention, when the signal is recognized by calculating the integrated estimation probability according to Equation 2 above, when the signal-to-noise ratio (SNR) is low, the word is recognized based on the image recognition method, and when the signal-to-noise ratio (SNR) is high, the audio recognition method is used. The effect is to recognize words mainly.

예를 들어, 오디오-영상 융합 음성 인식 장치(100)는 상기 가중치 a가 1인 경우 영상 단어 추정 확률을 통합 추정 확률로 선택하고, a가 1인 경우 오디오 단어 추정 확률을 통합 추정 확률로 선택하고, a가 0과 1사이의 값을 가질 경우 aP_visual와 (1-a)P_audio 중 큰 값을 통합 추정 확률로 선택하게 된다.
For example, the audio-visual fusion speech recognition apparatus 100 selects the image word estimation probability as the integrated estimation probability when the weight a is 1, and selects the audio word estimation probability as the integrated estimation probability when the weight a is 1. If a has a value between 0 and 1, a larger value of aP _visua l and (1-a) P _audio is selected as the integrated estimation probability.

오디오-영상 융합 음성 인식 장치(100)는 상기 통합 추정 확률에 상응하는 추정 단어를 인식하고자 하는 단어로 선택한다.
The audio-visual fusion speech recognition apparatus 100 selects an estimated word corresponding to the integrated estimated probability as a word to be recognized.

본 발명의 또 다른 일 실시예에서는 상기 통합 추정 확률은 다음의 -수식3-에 의해 정하여진다.
In another embodiment of the present invention, the integrated estimation probability is determined by the following Equation 3.

-수식3-Formula 3

상기 P는 통합 추정 확률, P_i는 i번째 통합 추정 확률, a는 가중치, P_{i, visual}는 i번째 영상 단어 추정 확률, P_{i, audio}는 i번째 오디오 단어 추정 확률, Max(P_i)는 P_i중 가장 큰 값을 나타낸다.
P is integrated estimation probability, P _i is i-th integrated estimation probability, a is weighted, P _{i, visual} is i-th image word estimation probability, P _{i, audio} is i-th audio word estimation probability, Max (P _i ) P _i represents the largest value.

오디오 분석 모듈(120)은 오디오 단어 추정 확률을 복수개 산출한다.
The audio analysis module 120 calculates a plurality of audio word estimation probabilities.

영상 분석 모듈(150)은 영상 단어 추정 확률을 복수개 산출한다.
The image analysis module 150 calculates a plurality of image word estimation probabilities.

통합 인식 모듈(160)은 추정된 모든 단어에 i번째 번호를 매기고 i번째 단어에 상응하는 i번째 통합 추정 확률 P_i를 계산한다.
The integrated recognition module 160 assigns the i-th number to all estimated words and calculates the i-th integrated estimated probability P _i corresponding to the i-th word.

통합 인식 모듈(160)은 상기 복수개의 i번째 통합 추정 확률 P_i중 가장 큰 값을 통합 추정 확률로 선택한다.
The integrated recognition module 160 selects the largest value of the plurality of i-th integrated estimated probabilities P _i as an integrated estimated probability.

통합 인식 모듈(160)은 상기와 같은 과정을 통하여 오디오 분석 모듈(120) 및 영상 분석 모듈(150)이 인식한 복수의 단어 중 통합 추정 확률이 가장 높은 단어를 선택할 수 있다.
The integrated recognition module 160 may select a word having the highest integrated estimation probability among the plurality of words recognized by the audio analysis module 120 and the image analysis module 150 through the above process.

100: 오디오-영상 음성 인식 장치
110: 오디오 입력 모듈
120: 오디오 분석 모듈
130: 신호 대 잡음비 분석 모듈
140: 영상 입력 모듈
150: 영상 분석 모듈
160: 통합 인식 모듈100: audio-visual speech recognition device
110: audio input module
120: audio analysis module
130: Signal to Noise Ratio Analysis Module
140: video input module
150: video analysis module
160: integrated recognition module

Claims

Receiving audio signals, extracting audio feature information, and calculating an audio word estimation probability from the audio feature information;
Receiving image signals, extracting image feature information, and calculating image word estimation probabilities from the image feature information;
Calculating a signal-to-noise ratio (SNR) from the audio feature information and setting a weight using the signal-to-noise ratio (SNR); And
Calculating an integrated estimation probability by weighting the audio word estimation probability and the video word estimation probability.
Audio-Video Fusion Speech Recognition Method

The method of claim 1,
The weight is determined by the following Equation 1.
Audio-Video Fusion Speech Recognition Method
Equation 1

(Where a is the weight, Max _SNR is the maximum value of the preset signal-to-noise ratio (SNR), SNR is the minimum value of the signal-to-noise ratio (SNR), and Min _SNR is the minimum value).

The method of claim 2,
The integrated estimation probability is determined by the following Equation 2.
Audio-Video Fusion Speech Recognition Method
Formula 2

Where P is the combined probability of estimation, Large (aP _visual , (1-a) P _audio ) is the greater of aP _visual and (1-a) P _audio , a is the weight, P _visual is the estimated probability _audio represents the probability of audio word estimation)

The method of claim 2,
The integrated estimated probability is determined by the following Equation 3-
Audio-Video Fusion Speech Recognition Method
Formula 3

(Where P is the combined estimation probability, P _i is the i-integrated estimation probability, a is the weight, P _{i, visual} is the i-th image word estimation probability, P _{i, audio} is the i-th audio word estimation probability, Max (P _i ) Represents the largest value of P _i )

An audio input module configured to receive a voice signal;
An image input module configured to receive an image signal;
An audio analysis module extracting audio feature information from the speech signal and calculating an audio word estimation probability from the audio feature information;
An image analysis module extracting image feature information from the image signal and calculating an image word estimation probability from the image feature information;
A signal-to-noise ratio (SNR) analysis module for calculating a signal-to-noise ratio (SNR) from the audio feature information and setting a weight using the signal-to-noise ratio (SNR); And
And an integrated recognition module that weights the audio word estimation probability and the image word estimation probability and calculates an integrated estimation probability.
Audio-visual fusion speech recognition device characterized in that

The method of claim 5,
The weight is determined by the following Equation 1.
Audio-visual fusion speech recognition device characterized in that
Equation 1

(Where, a is a weight, Max _SNR is the maximum value, SNR of the signal-to-noise ratio (SNR) represents the minimum value of the predetermined signal-to-noise ratio, _SNR Min silverware set signal-to-noise ratio (SNR))

The method according to claim 6,
The integrated estimation probability is determined by the following Equation 2.
Audio-visual fusion speech recognition device characterized in that
Formula 2

The method according to claim 6,
The integrated estimated probability is determined by the following Equation 3-
Audio-visual fusion speech recognition device characterized in that
Formula 3