KR100576803B1

KR100576803B1 - Speech Recognition Method and Device by Integrating Audio, Visual and Contextual Features Based on Neural Networks

Info

Publication number: KR100576803B1
Application number: KR1020030090418A
Authority: KR
Inventors: 한문성; 박준석; 김명원; 류정우; 박준
Original assignee: 한국전자통신연구원
Priority date: 2003-12-11
Filing date: 2003-12-11
Publication date: 2006-05-10
Also published as: KR20050058161A

Abstract

본 발명은 잡음환경에서 강인한 음성인식을 위해 신경망을 기반으로 음성과 영상정보를 효율적으로 융합하고, 이동단말기에서의 명령어 사용패턴인 문맥정보와 후처리 방법을 사용하여 음성, 영상 및 문맥에 대한 통합 인식을 수행함으로써 음성 인식률을 보다 향상시킬 수 있는 신경망에 기반한 음성인식 장치 및 방법에 관한 것이다. The present invention efficiently converges voice and video information based on neural network for robust voice recognition in noisy environment, and integrates voice, video and context using context information and post-processing method which are command usage patterns in mobile terminal. The present invention relates to a neural network-based speech recognition apparatus and method capable of further improving speech recognition rate.

본 발명의 통합 음성인식 방법은, 입력되는 음성 및 영상 신호로부터 특징 벡터를 추출하는 특징 추출단계; 음성 및 영상 정보를 신경망을 기반으로 융합하여 사용자 음성을 인식하는 이중모드 신경망 인식 단계; 이동 단말기에서의 사용자 명령어 패턴을 인식하는 문맥정보 인식 단계; 및 이중모드 신경망 인식 결과와 문맥정보 인식 결과를 통합하여 최종 인식결과를 출력하는 후처리 단계;로 이루어진다. Integrated voice recognition method of the present invention, the feature extraction step of extracting a feature vector from the input voice and video signal; A dual mode neural network recognition step of recognizing user voice by fusing voice and image information based on neural networks; A context information recognizing step of recognizing a user command pattern in the mobile terminal; And a post-processing step of outputting a final recognition result by integrating the dual mode neural network recognition result and the context information recognition result.

음성 인식, 이중모드 인식, 신경망 인식기, BMNN, 역전파 학습알고리즘, 문맥정보 인식Speech Recognition, Dual Mode Recognition, Neural Network Recognizer, BMNN, Backpropagation Algorithm, Context Information Recognition

Description

Speech Recognition Method and Device by Integrating Audio, Visual and Contextual Features Based on Neural Networks}

도 1은 본 발명에 따른 이중모드 음성인식 장치의 구성 및 그 처리 과정을 보여주는 도면. 1 is a view showing the configuration and processing of the dual mode speech recognition device according to the present invention.

도 2는 본 발명에 따른 신경망 기반 통합 음성 인식기의 구조를 보여주는 도면. 2 is a diagram illustrating a structure of a neural network based integrated speech recognizer according to the present invention.

도 3은 본 발명에 따른 음성 및 영상 정보를 융합한 이중모드 음성 인식기를 보여주는 도면. 3 is a diagram illustrating a dual mode speech recognizer incorporating audio and video information according to the present invention.

도 4는 본 발명에 따른 문맥 정보 인식기를 보여주는 도면. 4 illustrates a contextual information recognizer in accordance with the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of Symbols for Main Parts of Drawings>

101: 특징 추출부 102: 이중모드 신경망 인식기 101: feature extractor 102: dual mode neural network recognizer

103: 문맥정보 인식기 104: 후처리 부103: context information recognizer 104: post-processing unit

본 발명은 신경망에 기반한 이중모드 음성인식 장치 및 방법에 관한 것이며, 보다 상세히는 신경망을 이용하여 이중모드 음성 인식기의 음성과 영상 정보를 효율적으로 융합하고 이동 단말기에서 사용자의 명령어 사용 패턴 정보를 통합함으로써 잡음 환경에서 보다 향상된 음성 인식률을 제공하고 사용자 의사를 명확하게 파악할 수 있도록 하는 신경망에 기반한 음성, 영상, 및 문맥의 통합 음성인식 장치 및 방법에 관한 것이다. The present invention relates to a dual mode speech recognition apparatus and method based on a neural network, and more particularly, by efficiently combining voice and image information of a dual mode speech recognizer using a neural network and integrating user command usage pattern information in a mobile terminal. An apparatus and method for integrated speech recognition of speech, video, and context based on neural networks to provide improved speech recognition rates in a noisy environment and to clearly understand user intentions.

최근 들어, 사회가 점차 멀티미디어화 됨에 따라 인간과 기계의 인터페이스를 좀 더 간편하고 명확하게 실현하기 위하여 얼굴표정이나 방향, 입술 모양, 응시추적, 손동작, 그리고 음성 등을 이용한 멀티모달(multi-modal)형태의 인식 연구가 활발히 진행되고 있다.In recent years, as the society becomes more and more multimedia, multi-modal forms using facial expressions, directions, lip shapes, gaze tracking, hand gestures, and voices in order to realize human and machine interfaces more easily and clearly. Cognitive research is actively underway.

특히, 이러한 연구는 최근 이동 단말기의 기술이 발전함에 따라 잡음환경에 강인한 음성인식 방법인 이중모드 음성 인식 방법으로 활발히 연구되고 있다. 이중모드 음성인식 방법이란 잡음환경에 민감한 음성 정보를 보완할 수 있는 영상 정보를 동시에 고려함으로써 음성인식률을 향상시키는 방법이다. 예를 들어, 공장과 같은 시끄러운 환경에서 대화할 때 사람들은 서로의 음성뿐만 아니라 입모양 혹은 제스쳐와 같은 영상 정보를 이용하여 의미를 파악하는 경우를 생각할 수 있다.In particular, such research has been actively studied as a dual mode speech recognition method, a speech recognition method that is robust to a noisy environment with the recent development of mobile terminal technology. The dual mode speech recognition method improves the speech recognition rate by simultaneously considering the video information that can supplement the voice information sensitive to the noise environment. For example, when talking in a noisy environment such as a factory, people may think of a meaning using not only the voice of each other but also image information such as the shape of the mouth or the gesture.

이러한 이중모드 음성 인식 방법은 이질적인 음성과 영상 정보를 얼마나 효율적으로 융합하느냐에 따라 그 성능이 좌우된다. 기존의 융합 방법으로는 크게 특징 융합(feature fusion)과 결정 융합(decision fusion)으로 나누어진다. 특징 융합은 인식하기 전에 정보를 융합하는 방법을 말하고, 결정 융합은 인식된 결과를 융합하여 최종 인식을 수행하는 방법을 말한다. The performance of such a dual mode speech recognition method depends on how efficiently the heterogeneous voice and image information are fused. Conventional fusion methods are divided into feature fusion and decision fusion. Feature fusion refers to a method of fusing information before recognition, and decision fusion refers to a method of performing final recognition by fusing the recognized result.

효율적인 융합을 위해 특징 융합은 음성과 영상 정보의 동기화 문제를 해결해야만 하는 반면, 결정 융합은 음성과 영상 정보가 서로 독립적이어야 하며 연관성이 결여되어야 하는 가정을 만족해야만 한다. 따라서, 결정 융합은 특징 융합에 비해 적용은 쉬우나 성능이 떨어지고 특징 융합은 성능은 좋으나 입력정보의 동기화를 고려하여야 함으로 적용하기가 어렵다. Feature fusion must solve the synchronization problem of speech and video information for efficient fusion, while decision fusion must satisfy the assumption that speech and video information must be independent of each other and lack association. Therefore, crystal fusion is easier to apply than feature fusion, but its performance is inferior, and feature fusion is good, but it is difficult to apply it because the synchronization of input information should be considered.

이와 같은 융합 방법으로는 HMM(Hidden Markov Model)과 신경망이 대표적으로 사용된다. HMM을 이용한 융합 방법은 결과에 민감한 반응을 주는 학습 변수인 상태(state) 수와 가우시안 혼합(Gaussian mixture) 수를 결정하기 어렵다는 점과, 일반적으로 사용되는 CDMM(Continuous Density Hidden Markov Model)은 입력 특징들이 확률적 독립성 조건을 만족해야 하는 제약 사항들이 있어 적용하기 어렵다. As such a fusion method, HMM (Hidden Markov Model) and neural networks are typically used. The fusion method using HMM is difficult to determine the number of states and Gaussian mixture, which are learning variables that are sensitive to the result, and the commonly used Continuous Density Hidden Markov Model (CDMM) is an input feature. Are difficult to apply because of the constraints that must satisfy stochastic independence conditions.

반면에, 신경망에 기반한 융합 방법은 이질적인 정보를 효율적으로 융합할 수 있다. 하지만, 종래 제안된 신경망을 이용한 방법들은 학습방법과 모델이 복잡하고 또는 모델의 크기가 큼에 따라 계산량이 증가되는 비효율적인 문제점이 있다. On the other hand, a fusion method based on neural networks can efficiently fuse heterogeneous information. However, conventionally proposed methods using neural networks have an inefficient problem in that the computational amount is increased as the learning method and the model are complicated or the size of the model is large.

특히, 이동 단말기에서 음성 인식을 실용화시키기 위해서는 잡음환경에서 강인한 음성인식 방법이 요구될 뿐만 아니라 동시에 효율적으로 인식기를 생성시키는 방법도 요구된다. In particular, to realize speech recognition in a mobile terminal, not only a robust speech recognition method is required in a noisy environment, but also a method for efficiently generating a recognizer.

따라서, 본 발명은 상술한 종래의 문제점 및 필요성을 해결하기 위한 것으로 서, 본 발명의 목적은 잡음환경에서 강인한 음성인식을 위해 신경망을 이용하여 음성과 영상정보를 효율적으로 융합하고, 문맥정보에 의한 후처리 기법을 적용하여 음성 인식률을 보다 향상시킬 수 있는 신경망에 기반한 음성, 영상, 및 문맥의 통합 음성인식 장치 및 방법을 제공하는데 있다.Accordingly, the present invention is to solve the above-mentioned problems and necessity, the object of the present invention is to efficiently converge voice and image information using neural network for robust voice recognition in a noisy environment, The present invention provides an integrated speech recognition apparatus and method for speech, video, and context based on neural networks that can further improve speech recognition rate by applying post-processing techniques.

상기 본 발명의 목적을 달성하기 위한 신경망에 기반한 음성, 영상, 및 문맥의 통합 음성인식 장치는, 입력되는 음성 및 영상 신호의 특징을 추출하는 특징 추출부; 신경망을 기반으로 상기 음성 및 영상 신호의 특징 정보를 융합하여 사용자 음성을 인식하는 이중모드 신경망 인식기; 사용자의 명령어 패턴을 인식하는 문맥정보 인식기; 및 상기 이중모드 신경망 인식기의 출력값과 상기 문맥정보 인식기의 출력값을 통합하여 최종 인식결과를 출력하는 후처리부;로 구성된다. An integrated speech recognition apparatus for speech, video, and context based on a neural network for achieving the object of the present invention includes: a feature extractor for extracting features of an input speech and video signal; A dual mode neural network recognizer for recognizing a user's voice by fusing feature information of the voice and video signal based on a neural network; A context information recognizer for recognizing a command pattern of a user; And a post-processing unit for integrating the output value of the dual mode neural network recognizer and the output value of the context information recognizer and outputting a final recognition result.

또한, 본 발명의 목적을 달성하기 위한 신경망에 기반한 음성, 영상, 및 문맥의 통합 음성인식 방법은, 입력되는 음성 및 영상 신호로부터 특징 벡터를 추출하는 특징 추출단계; 상기 특징 추출된 음성 및 영상 정보를 신경망을 기반으로 융합하여 사용자 음성을 인식하는 이중모드 신경망 인식 단계; 이동 단말기에서의 사용자 명령어 패턴을 인식하는 문맥정보 인식 단계; 및 상기 이중모드 신경망 인식의 출력 값과 상기 문맥정보 인식의 출력 값를 통합하여 최종 인식결과를 출력하는 후처리 단계;로 이루어진다.
In addition, the integrated speech recognition method of speech, video, and context based on a neural network for achieving the object of the present invention, the feature extraction step of extracting a feature vector from the input voice and video signal; A dual mode neural network recognizing step of recognizing a user voice by fusing the feature extracted voice and image information based on a neural network; A context information recognizing step of recognizing a user command pattern in the mobile terminal; And a post-processing step of integrating an output value of the dual mode neural network recognition and an output value of the context information recognition to output a final recognition result.

이하, 본 발명에 따른 실시예를 첨부한 도면을 참조하여 상세히 설명하기로 한다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 신경망 기반 이중모드 음성인식 장치의 구성 및 그 처리 과정을 보여주는 도면이다. 1 is a view showing the configuration and processing of the neural network-based dual mode speech recognition device of the present invention.

도 1을 참조하면, 본 발명의 이중모드 음성인식 장치는, 특징 추출부(101), 이중모드 신경망(BMNN: BiModal Neural Network) 인식기(102), 문맥정보 인식기(103), 후처리부(104)로 이루어진다. Referring to FIG. 1, the dual mode speech recognition apparatus of the present invention includes a feature extractor 101, a bimodal neural network (BMNN) recognizer 102, a context information recognizer 103, and a post processor 104. Is made of.

상기 특징 추출부(101)는 카메라에 의해 캡춰된 이미지를 영상 버퍼에 시간 정보(system tick)와 함께 저장하고, 마이크를 통해 동시에 입력된 음성신호를 끝점 검출 과정을 통해 고립단어로 분할(segmentation)한다. 이때, 끝점 검출 시간과 같은 시간을 나타내는 시간(tick)을 계산하고 영상 버퍼로부터 동일한 시간에 입력된 영상들을 받아 온다. 그리고, 이와 같이 동기화된 영상과 음성 신호로부터 특징 추출 방법을 사용하여 특징 벡터들을 추출한다. The feature extractor 101 stores an image captured by the camera in a video buffer with time information and system ticks, and segments audio signals simultaneously input through a microphone into isolated words through an endpoint detection process. do. At this time, a time tick representing the same time as the endpoint detection time is calculated and images input at the same time are received from the image buffer. Then, feature vectors are extracted from the synchronized video and audio signal using a feature extraction method.

그리고, 상기 추출된 영상과 음성 신호의 특징 벡터들은 이중모드 신경망 인식기(102)의 입력으로 사용된다. The feature vectors of the extracted video and audio signals are used as inputs of the dual mode neural network recognizer 102.

한편, 도 2는 본 발명에 따른 신경망 기반 음성, 영상, 및 문맥 정보 통합 음성인식기의 구조를 보여주고 있다. 또한, 도 3은 음성 및 영상 정보를 융합한 이중모드 신경망 인식기(102)의 구조에 대한 도면이고, 도 4는 문맥 정보 인식기(103)의 구조를 보여주는 도면이다. On the other hand, Figure 2 shows the structure of the neural network-based voice, video, and context information integrated voice recognizer according to the present invention. 3 is a diagram illustrating the structure of the dual mode neural network recognizer 102 in which voice and image information are fused, and FIG. 4 is a diagram illustrating the structure of the context information recognizer 103.

도 3에 도시된 바와 같이, 상기 이중모드 신경망 인식기(102)는 입력층, 은닉층, 융합층, 결합층의 4개 층으로 구성되어 있으며 전방향 신경망 구조를 가진 다. 또한, 본 발명의 학습 알고리즘은 역전파 알고리즘을 사용하고 학습과 인식은 고립단어로 이루어진다. As shown in FIG. 3, the dual mode neural network recognizer 102 is composed of four layers of an input layer, a hidden layer, a fusion layer, and a coupling layer, and has an omnidirectional neural network structure. In addition, the learning algorithm of the present invention uses a backpropagation algorithm and learning and recognition are composed of isolated words.

상기 입력층에서, 입력정보에 대한 함축적 시제위치를 고려한 입력 특징의 추상화 기능을 수행하기 위해 윈도우 개념을 적용하고 고립단어 인식에서 성능이 높은 중복영역(overlap zone)구조를 사용한다. In the input layer, the window concept is applied to perform the abstraction function of the input feature considering the implicit tense position of the input information, and the overlap zone structure having high performance in the isolated word recognition is used.

상기 융합층에서는 잡음에 의한 음성정보의 손실을 보상하기 위하여 음성과 영상 특징을 신경망 학습과정에 의해 적절히 조정하여 얻은 신경망 가중치를 이용하여 융합하는 기능을 수행한다. The fusion layer performs a fusion function using neural network weights obtained by appropriately adjusting voice and video features by neural network learning process to compensate for loss of speech information due to noise.

여기에서, 인식기의 연결구조와 각 계층의 프레임 개수 및 노드 개수를 살펴보면, 윈도우에 포함된 모든 프레임들의 노드들과 대응되는 상위계층 프레임의 노드들이 완전 연결(fully connect)로 연결되고, 융합층은 윈도우가 없기 때문에 출력층과 완전연결로 이루어진다. Here, when the connection structure of the recognizer, the number of frames and the number of nodes of each layer are examined, the nodes of the upper layer frame corresponding to the nodes of all the frames included in the window are completely connected, and the fusion layer is Since there is no window, it is completely connected to the output layer.

앞서 기술한 것과 같이 상기 이중모드 신경망 인식기(102)는 전방향 신경망 구조로 역전파 학습알고리즘을 사용한다. 기본 신경망, 다음의 수학식 1과 같이 입력(I)과 출력(O) 사이의 비선형 사상함수를 제공한다. As described above, the dual mode neural network recognizer 102 uses a backpropagation learning algorithm as an omnidirectional neural network structure. The basic neural network, provides a nonlinear mapping function between the input (I) and the output (O), as shown in the following equation (1).

(1)

(One)

여기에서, W는 신경망의 가중치이며, A는 신경망 구조를 의미한다. Here, W is the weight of the neural network, and A means the neural network structure.

입력이 제시될 때 신경망으로부터 목표 출력 값을 생성할 수 있도록 신경망의 가중치를 적절히 조정하는 것이 신경망 학습과정이다.The neural network learning process is to properly adjust the weight of the neural network to generate a target output value from the neural network when the input is presented.

이러한 역전파 학습의 목적은 신경망의 가중치들에 대하여 수학식 2를 반복적인 방법으로 에러(E)를 최소화하도록 최급하강법(gradient descent)을 사용하여 가중치를 결정하는 것이다. The purpose of this backward propagation learning is to determine the weights using gradient descent to minimize the error E in a recursive manner with respect to the weights of neural networks.

(2)

여기에서,

는 kn번째 목표 출력 값,

은 입력패턴으로부터 신경망에 의해 계산된 실제 출력 값을 의미한다. From here,

Is the kn th target output value,

Means the actual output value calculated by the neural network from the input pattern.

또한, 학습 중 어느 한 시점에서 가중치 변화량은 다음의 수학식 3과 같다. In addition, the weight change amount at any point in the learning is expressed by Equation 3 below.

(3)

따라서, 학습 회수를

, 학습률(learning rate)을

라고 할 때, 이중모드 신경망 인식기(102)에서 각 계층의 가중치 변경에 대한 일반식은 다음의 수학식 4내지 수학식 6과 같다.Therefore, the number of learning

, Learning rate

In this case, the general equation for the weight change of each layer in the dual mode neural network recognizer 102 is as shown in Equation 4 to Equation 6 below.

- 출력층(k)과 융합층(c) 사이의 가중치 변경Weight change between the output layer (k) and the fusion layer (c)

(4)

- 융합층(c)과 은닉층(h) 사이의 가중치 변경Weight change between the fusion layer (c) and the hidden layer (h)

(5)

- 은닉층(h)과 입력층(i) 사이의 가중치 변경Weight change between the hidden layer h and the input layer i

(6)

여기에서,

은

번째 입력층 프레임의

번째 노드 출력 값을 의미하고,

은

번째 은닉층 프레임의

번째 노드에 대한 오차를 의미한다. 또한,

은

번째 은닉층 프레임과 대응되는 융합층 프레임들의 집합을 의미한다. From here,

silver

Of the first input layer frame

Means the output value of the first node,

silver

Of the first hidden frame

The error for the first node. Also,

silver

Means a set of fused layer frames corresponding to the first hidden layer frame.

한편, 상기 문맥정보 인식기(103) 또한 순차 정보를 학습할 수 있는 다층퍼셉트론 구조이고 역전파 학습알고리즘을 사용한다. Meanwhile, the contextual information recognizer 103 also has a multi-layer perceptron structure capable of learning sequential information and uses a backpropagation learning algorithm.

기존 신경망은 입력층의 입력노드에는 순서대로 입력되는 명령어에 대한 정보인 순차 정보를 가지고 있지 않으므로 순차 정보를 포함하는 명령어 패턴을 인식할 수 없다. 따라서, 본 발명은 입력 값을 이진형으로 표현함으로써 입력되는 명령어 패턴에 대한 순차 정보를 알 수 있다. 도 4는 상기 문맥정보 인식기(103) 구조를 보여주고 있으며, 인식할 고립단어가 "영","일","이","삼"의 네 개라고 가정할 경우를 예시하고 있다. Existing neural networks do not have sequential information, which is information about commands that are sequentially input, in the input node of the input layer, and thus cannot recognize the command pattern including the sequential information. Accordingly, the present invention can know the sequential information on the input command pattern by representing the input value in binary form. FIG. 4 illustrates the structure of the context information recognizer 103 and illustrates a case in which the isolated words to be recognized are four of "zero," "one," "two," and "three."

여기에서, 입력노드의 개수는 인식할 단어 개수에 선행 단어 개수를 곱한 것과 같고 출력노드 개수는 인식할 단어 개수와 같게 설정한다. 선행단어 개수란 현재 단어를 예측하기 위해 고려되는 단어의 개수로써 앞서 인식한 단어들 중 순차적으로 가장 최근에 인식한 단어의 개수를 말한다. 만약 입력 값이 "일 영 이: 0100 1000 0010"으로 입력되면 출력 노드에는 "일영이" 다음에 가장 많이 사용되는 단어 노드에 가장 높은 값이 출력된다. 도 4에서는 "삼"이 가장 높은 값을 갖기 때문에 이 경우 "일영이" 다음에 사용될 단어를 "삼"으로 예측한 것이다. Here, the number of input nodes is set equal to the number of words to be recognized multiplied by the number of preceding words, and the number of output nodes is set equal to the number of words to be recognized. The number of preceding words refers to the number of words considered in order to predict the current word, and refers to the number of words most recently recognized among the previously recognized words. If the input value is entered as "Japanese-English: 0100 1000 0010", the highest value is output to the word node used after "Japanese-English" at the output node. In FIG. 4, since "three" has the highest value, in this case, the word to be used after "Japanese-English" is predicted as "three".

이때, 선행 단어 개수를 너무 크게 설정하게 되면 패턴에 따라 발생빈도가 낮기 때문에 출력 예측 값이 낮은 경향을 보이고, 반대로 너무 작게 설정하게 되면 예측 값이 극소의 패턴으로 편중되는 경향을 보이게 되므로, 선행 단어를 적정한 개수로 설정할 필요가 있다. In this case, if the number of the preceding words is set too large, the output prediction value tends to be low because the occurrence frequency is low according to the pattern. On the contrary, if the setting is too small, the prediction value tends to be biased into the minimum pattern. You need to set the appropriate number of.

한편, 상기 후처리부(104)는 이중모드 신경망 인식기(102)의 결과와 문맥정보 인식기(103)의 결과를 아래의 표 1과 같은 순차 결합 방법으로 결합한다. On the other hand, the post-processing unit 104 combines the results of the dual mode neural network recognizer 102 and the results of the context information recognizer 103 in a sequential combining method as shown in Table 1 below.

[표 1]TABLE 1

BMNN(

): 이중모드 신경망 인식기에서 i번째 출력노드의 출력 값, Con(

): 문맥정보 인식기에서 i번째 출력노드의 출력 값,

: 임계값. IF (

＜ BMNN(

)) i = max _i (BMNN(

)) 번째 고립단어 인식, ELSE IF (

＜ Con(

)) i = max _i (Con(

)) 번째 고립단어 인식, ELSE IF ((BMNN(

)

) and (Con(

)

)) i = max _i (BMNN(

)ㆍCon(

)) 번째 고립단어 인식. BMNN (

): Output value of the i th output node in the dual mode neural network recognizer, Con (

): Output value of the i th output node in the context information recognizer,

: Threshold. IF (

<BMNN (

)) i = max _i (BMNN (

Recognition of the second isolated word, ELSE IF (

<Con (

)) i = max _i (Con (

Recognize the second isolated word, ELSE IF ((BMNN (

)

) and (Con (

)

)) i = max _i (BMNN (

) Con (

Recognize the second isolated word.

상기 순차 결합 방법은 이중모드 신경망 인식기(102)의 인식결과가 사용자가 설정한 임계값(

)보다 작을 경우 문맥정보 인식기(103)의 인식결과를 고려하는 방법이다. In the sequential coupling method, a threshold value set by a user is determined by a recognition result of the dual mode neural network recognizer 102.

If smaller than), the recognition result of the context information recognizer 103 is considered.

만약, 두 인식기의 결과가 모두 임계값 보다 작을 경우에는, 이중모드 신경망 인식기의 인식결과와 문맥정보 인식기의 인식결과의 차이가 작은 것, 즉, 두 인식기가 비슷하게 인식한 것을 최종 인식결과로 선택하기 위해 두 인식기의 인식결과를 곱하여 가장 큰 값을 선택한다. 여기에서, 사용자가 설정하는 임계값은 사용자가 인식기의 결과를 신뢰할 수 있는 최소 한계 값이다.If the results of both recognizers are smaller than the threshold value, the difference between the recognition result of the dual mode neural network recognizer and the context information recognizer is small, that is, the final recognition result is that the two recognizers have similar recognition results. To multiply the recognition results of two recognizers, select the largest value. Here, the threshold set by the user is the minimum threshold value at which the user can trust the results of the recognizer.

상기와 같은 본 발명의 신경망에 기반한 음성, 영상, 및 문맥의 통합 음성인식 방법은 컴퓨터로 읽을 수 있는 기록 매체에 저장될 수 있다. 이러한 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있도록 프로그램 및 데이터가 저장되는 모든 종류의 기록매체를 포함하는 것으로, 그 예로는, 롬(Read Only Memory), 램(Random Access Memory), CD(Compact Disk)-Rom, DVD(Digital Video Disk)-Rom, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등이 있다. 또한, 이러한 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. The integrated voice recognition method of voice, video, and context based on the neural network of the present invention may be stored in a computer-readable recording medium. Such a recording medium includes all kinds of recording media in which programs and data are stored so that they can be read by a computer system. Examples of the recording medium include read only memory, random access memory, and compact disk. -Rom, DVD (Digital Video Disk) -Rom, magnetic tape, floppy disk, optical data storage device. In addition, these recording media can be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

상술한 바와 같이 본 발명에 따른 신경망에 기반한 음성, 영상, 및 문맥의 통합 음성인식 장치 및 방법은, 음성과 영상 정보를 효율적으로 융합하고 사용자의 명령어 패턴을 이용한 후처리 기법을 적용함으로써 음성인식률을 향상시킬 수 있 다. As described above, the integrated speech recognition apparatus and method for speech, video, and context based on the neural network according to the present invention efficiently improve the speech recognition rate by applying a post-processing technique using a user's command pattern and efficiently fused speech and video information. It can be improved.

또한, 이중모드 신경망 인식기는 입력정보에 대한 추상화 기능을 수행하기 위해 윈도우를 적용함으로써 모델의 크기를 작게 하고, 결과적으로 모델을 효율적으로 생성할 수 있을 뿐만 아니라 고립단어 인식에서 성능이 높은 중복영역 구조와 특징 융합 방법을 적용하여 향상된 효율성을 갖게 된다. 문맥정보를 이용한 후처리 방법은 음성과 영상 정보만으로 인식하기 힘든 경우 사용자의 행동 패턴을 이용함으로써 음성 인식률을 크게 향상시킬 수 있다. In addition, the dual-mode neural network recognizer can reduce the size of the model by applying a window to perform abstraction function on input information, and as a result, can efficiently generate the model, and also has high performance in isolated word recognition. By applying the method and feature fusion method, the efficiency is improved. The post-processing method using the context information can greatly improve the speech recognition rate by using the user's behavior pattern when it is difficult to recognize only the voice and image information.

이상에서 설명한 것은 본 발명에 따른 신경망에 기반한 음성, 영상, 및 문맥의 통합 음성인식 장치 및 방법을 실시하기 위한 하나의 실시예에 불과한 것으로서, 본 발명은 상기한 실시예에 한정되지 않고, 이하의 특허청구의 범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변경 실시가 가능한 범위까지 본 발명의 기술적 정신이 있다고 할 것이다.
What has been described above is only one embodiment for implementing the integrated speech recognition apparatus and method of speech, video, and context based on the neural network according to the present invention, and the present invention is not limited to the above-described embodiments. Without departing from the gist of the invention as claimed in the claims, anyone of ordinary skill in the art will have the technical spirit of the present invention to the extent that various modifications can be made.

Claims

A feature extractor which extracts features of an input audio and video signal;

It consists of a multi-layer perceptron structure including an input layer, a hidden layer, a fusion layer, and a bonding layer, and has an omnidirectional neural network structure, and based on the neural network structure, the neural network weights between the layers are adjusted through a backpropagation learning algorithm for the input pattern. A target output value is generated, and an input function and an abstraction layer apply a window concept to perform an abstraction function of an input feature, and in the fusion layer, a window concept using the adjusted neural network weights for the features of the voice and video signals. A dual mode neural network recognizer that fuses to an implicit tense position and recognizes a user's voice as an isolated word, thereby compensating for loss of voice information due to noise;

A context information recognizer having a multi-layer perceptron structure capable of learning sequential information using a backpropagation learning algorithm, and recognizing the sequential information by representing input information in binary form and recognizing the sequential information; And

The recognition result value of the dual mode neural network recognizer is considered first and the recognition result value of the contextual information recognizer is sequentially considered only when the recognition result value of the dual mode neural network recognizer is smaller than a threshold value. If the result value is less than the threshold value, the post processing unit for multiplying the recognition result value of the two recognizers by selecting the difference between the recognition result values of the two recognizers as the final recognition result; Integrated speech recognition device for voice, video, and context.

The method of claim 1, wherein the feature extraction unit,

The input speech signal is divided into isolated words through an endpoint detection process to synchronize the audio signal and the video signal in isolated words, and extract feature vectors for each of the inputted audio signals. Integrated speech recognition device.

delete

According to claim 1, The dual mode neural network recognizer,

In the input layer and the hidden layer, the speech recognition apparatus of the speech, video, and context based on the neural network, characterized in that to perform the abstraction function of the input feature by applying a window concept.

delete

A feature extraction step of extracting a feature vector from an input audio and video signal;

It has an omnidirectional neural network structure of multi-layered perceptron including an input layer, a hidden layer, a fusion layer, and a coupling layer, and in the fusion layer, the characteristics of the voice and video signals are fused to a implicit tense position through a window concept based on a neural network. A dual mode neural network recognition step of compensating loss of speech information due to noise by recognizing speech as an isolated word;

Recognizing user command patterns sequentially used through a multi-layer perceptron structure that can learn sequential information using a back-propagation learning algorithm, and expressing input information in binary form to recognize sequential information, thereby recognizing the user command pattern in a mobile terminal. Context information recognizing step of recognizing; And

The recognition result value of the dual mode neural network recognition step is considered first and the recognition result value of the context information recognition step is sequentially considered only when the recognition result value is smaller than a threshold value, and both recognition result values are critical. If the value is smaller than the value, the two recognition result values are multiplied, and a post-processing step of selecting and outputting a small difference between the two recognition result values as the final recognition result; Neural network-based voice, image, and context integration Voice recognition method.

The method of claim 9, wherein the feature extraction step,

Integrated speech recognition based on neural network based on neural network, characterized by synchronizing the audio signal and the video signal in units of isolated words divided by the endpoint detection process, and extracting the feature vector for each voice and video signal Way.

delete

The method of claim 9, wherein the dual-mode neural network recognition step,

An integrated speech recognition method of speech, video, and context based on a neural network, which applies an abstraction function of an input feature by applying a window concept to each layer.

delete