KR102116054B1

KR102116054B1 - Voice recognition system based on deep neural network

Info

Publication number: KR102116054B1
Application number: KR1020160103586A
Authority: KR
Inventors: 정훈; 박전규; 이성주; 이윤근
Original assignee: 한국전자통신연구원
Priority date: 2016-08-16
Filing date: 2016-08-16
Publication date: 2020-05-28
Also published as: KR20180019347A

Abstract

본 발명은 심층 신경망 기반의 음향 모델의 파라미터인 비선형 활성화 함수를 훈련 가능한 형태로 표현함으로써 음향 모델의 훈련을 효과적으로 달성할 수 있으며 성능을 개선할 수 있도록 구현된 심층 신경망 기반의 음성인식 시스템에 관한 것이다. 상기 시스템은 각종 정보를 입력받는 입력부; 음성처리 알고리즘을 저장하는 저장부; 및 상기 입력부를 통해 입력되는 음성신호에 상기 저장부에 저장된 음성처리 알고리즘을 적용하여 음성인식을 수행하는 심층 신경망 기반의 처리부를 포함하고, 상기 처리부는 음향 모델 파라미터인 비선형 활성화 함수로서 거듭제곱 급수로 표현되는 활성화 함수를 이용하는 것을 특징으로 한다.The present invention relates to a speech recognition system based on a deep neural network implemented to effectively achieve training of an acoustic model and improve performance by expressing a nonlinear activation function, which is a parameter of an acoustic model based on a deep neural network, in a trainable form. . The system includes an input unit that receives various information; A storage unit for storing the speech processing algorithm; And a deep neural network based processing unit that performs speech recognition by applying a speech processing algorithm stored in the storage unit to a voice signal input through the input unit, and the processing unit is a nonlinear activation function that is an acoustic model parameter and is raised to a power of two. It is characterized by using an activated function expressed.

Description

Voice recognition system based on deep neural network

본 발명은 심층 신경망 기반의 음성인식 시스템에 관한 것으로, 상세하게는 심층 신경망 기반의 음향 모델의 파라미터인 비선형 활성화 함수를 훈련 가능한 형태로 표현함으로써 음향 모델의 훈련을 효과적으로 달성할 수 있으며 성능을 개선할 수 있도록 구현된 심층 신경망 기반의 음성인식 시스템에 관한 것이다.
The present invention relates to a speech recognition system based on a deep neural network. Specifically, by expressing a non-linear activation function that is a parameter of a deep neural network based acoustic model in a trainable form, training of the acoustic model can be effectively achieved and performance can be improved. It relates to a speech recognition system based on a deep neural network implemented to enable.

최근 들어 공학분야에서 빈번하게 접하게 되는 입력 패턴을 특정 그룹으로 분류하는 문제를 해결하는 방안으로서, 인간이 지니고 있는 효율적인 패턴 인식 방법을 실제 컴퓨터에 적용시키려는 연구가 활발히 진행되고 있다.Recently, as a method of solving the problem of classifying an input pattern that is frequently encountered in the engineering field into a specific group, studies have been actively conducted to apply an efficient pattern recognition method possessed by humans to a real computer.

여러 가지 컴퓨터 적용 연구들 중에서 효율적인 패턴 인식 작용이 일어나는 인간두뇌 세포구조를 공학적으로 모델링한 인공신경망(Artificial Neural Network)에 대한 연구가 있다. 입력 패턴을 특정 그룹으로 분류하는 문제를 해결하기 위해, 인공신경망은 인간이 가지고 있는 학습이라는 능력을 모방한 알고리즘을 사용한다.Among various computer application studies, there is a study on an artificial neural network that engineered a model of the human brain cell structure in which efficient pattern recognition occurs. To solve the problem of classifying input patterns into specific groups, artificial neural networks use algorithms that mimic the ability of humans to learn.

또한, 인공신경망은 학습된 결과를 바탕으로 학습에 사용되지 않았던 입력 패턴에 대하여 비교적 올바른 출력을 생성할 수 있는 일반화 능력이 있다. 학습과 일반화라는 두 대표적인 성능 때문에 인공신경망은 기존의 순차적 프로그래밍 방법에 의해서는 좀처럼 해결하기 힘든 문제에 적용되고 있다. 그리고, 인공신경망은 그 사용범위가 넓어 패턴 분류 문제, 연속 사상, 비선형 시스템 식별, 비선형 제어 및 로봇 제어 분야, 음성 인식 등에 활발히 응용되고 있다.In addition, the artificial neural network has a generalization ability to generate a relatively correct output for an input pattern that has not been used for learning based on the learned result. Due to the two typical performances of learning and generalization, artificial neural networks are applied to problems that are seldom solved by conventional sequential programming methods. In addition, artificial neural networks are widely used in a wide range of applications, such as pattern classification problems, continuous mapping, non-linear system identification, non-linear control and robot control fields, and voice recognition.

현재의 음성 인식은 특징 파라미터 X에 대해 최대 우도를 출력하는 단어(W)를 구하는 문제로 귀결되는데, 이는 아래 수학식 1과 같이 표현될 수 있다.Current speech recognition results in the problem of finding the word W that outputs the maximum likelihood for the feature parameter X, which can be expressed as Equation 1 below.

[수학식 1][Equation 1]

상기에서 확인할 수 있듯이, 수학식 1에는 3개의 확률 모델이 포함되는데, P(X|M)는 음향 모델이고, P(M|W)는 발음 모델이며, P(W)는 언어 모델이라고 한다.As can be seen above, Equation 1 includes three probability models, where P (X | M) is an acoustic model, P (M | W) is a pronunciation model, and P (W) is a language model.

이때, 언어 모델 P(W)는 단어 연결에 대한 확률 정보를 포함하고, 발음 모델 P(M|W)는는 단어가 어떤 발음 기호로 구성되었는지에 대한 정보를 표현한다.At this time, the language model P (W) includes probability information for word concatenation, and the pronunciation model P (M | W) expresses information about what pronunciation symbols the word is composed of.

그리고, 음향 모델 P(X|M)는는 발음 기호에 대해 실제 특징 벡터 X를 관측할 확률을 모델링한다.Then, the acoustic model P (X | M) models the probability of observing the actual feature vector X for the phonetic symbol.

그리고, 일반적으로 음성 인식 시스템은 음향 모델의 산출을 위해 심층 신경망(Deep Neural Network)을 사용하는데, 심층 신경망은 입력층과 출력층 사이에 다수의 은닉층(hidden layer)을 가지는 것을 특징으로 한다.In addition, in general, a speech recognition system uses a deep neural network to calculate an acoustic model, and the deep neural network is characterized by having a plurality of hidden layers between the input layer and the output layer.

심층 신경망에서의 각 은닉층들은 하기 수학식 2와 같이 표현될 수 있다.Each hidden layer in the deep neural network may be expressed as Equation 2 below.

[수학식 2][Equation 2]

즉, 입력층을 통해 입력되는 입력신호 x_t에 대한 W, b의 아핀 변환(affine transformation)을 수행하여 y를 구하고, y에 비선형 활성화 함수 σ를 적용하여 결과값 z를 구한다. 여기서, W는 weight matrix이고, b는 bias 항이다.That is, y is obtained by performing affine transformation of W and b on the input signal x _t input through the input layer, and the result z is obtained by applying a nonlinear activation function σ to y. Here, W is a weight matrix and b is a bias term.

은닉층에서 널리 사용되는 비선형 활성화 함수들은 하기 표 1과 다음과 같다.The nonlinear activation functions widely used in the hidden layer are shown in Table 1 below.

[표 1][Table 1]

그리고, 출력층에서는 하기의 수학식 3과 같이 sfotmax 연산을 통해 은닉층의 각 노드의 출력값을 확률값으로 정규화한다.Then, in the output layer, the output value of each node of the hidden layer is normalized to a probability value through the sfotmax operation as shown in Equation 3 below.

[수학식 3][Equation 3]

즉, 출력층에서는 L번째 은닉층의 N개의 모든 노드에 대한 출력 exp(y_j ^L)을 구한 후 각 노드 출력값을

으로 정규화한다. 결국, 심층 신경망 기반의 음향 모델 θ은 다음의 수학식 4와 같이 정의될 수 있다.That is, in the output layer, the output exp (y _j ^L ) for all N nodes of the Lth hidden layer is obtained, and then the output value of each node is obtained.

Normalize to Consequently, the acoustic model θ based on the deep neural network may be defined as in Equation 4 below.

[수학식 4][Equation 4]

θ = {W, b, σ}θ = {W, b, σ}

즉, 심층 신경망 기반의 음향 모델 θ는 파라미터 W, b 및 σ로 구성되며, W는 weight matrix이고, b는 bias 항이며, σ는 비선형 활성화 함수이다. That is, the deep neural network based acoustic model θ is composed of parameters W, b, and σ, W is a weight matrix, b is a bias term, and σ is a nonlinear activation function.

일반적으로 심층 신경망 기반의 음향 모델 θ에 대한 훈련은 파라미터를 임의의 초기화 값으로 설정하고, 오류 역전파(back-propagation) 알고리즘과 추계적 경사 강화(stochastic gradient descent, SGD) 알고리즘을 통해 이루어진다.In general, training for an acoustic model θ based on a deep neural network is performed through setting a parameter to an arbitrary initialization value and using an error back-propagation algorithm and a stochastic gradient descent (SGD) algorithm.

경우에 따라서는, 파라미터를 임의의 초기화 값으로 설정한 후, 오류 역전파(back-propagation) 알고리즘과 추계적 경사 강화(stochastic gradient descent, SGD) 알고리즘을 수행하기 전에 pre-training이라는 prior 추정 과정이 이루어질 수도 있다.In some cases, after setting the parameter to an arbitrary initialization value, a prior estimation process called pre-training is performed before performing an error back-propagation algorithm and a stochastic gradient descent (SGD) algorithm. It can be done.

이때, 모델 파라미터 W는 수학식 5와 같이 정의되는 추계적 경사 강화(SGD) 알고리즘을 통해 훈련될 수 있고, 모델 파라미터 b는 수학식 6과 같이 정의되는 추계적 경사 강화(SGD) 알고리즘을 통해 훈련될 수 있다.At this time, the model parameter W may be trained through a stochastic gradient enhancement (SGD) algorithm defined as Equation 5, and the model parameter b may be trained through a stochastic gradient enhancement (SGD) algorithm defined as Equation (6). Can be.

[수학식 5][Equation 5]

[수학식 6][Equation 6]

상기 수학식 5 및 6에 있어서 J는 비용(cost) 함수로서, cross entropy가 널리 사용되며, 하기 수학식 7과 같이 표현될 수 있다.In Equations 5 and 6, J is a cost function, cross entropy is widely used, and can be expressed as Equation 7 below.

[수학식 7][Equation 7]

여기서, p(x)와 q(x)는 확률 분포로서, 비용 함수(J)는 두 확률 분포 p(x)와 q(x) 사이에 존재하는 정보량을 계산하기 위한 것으로서, 확률 분포 p(x)에서 q(x)로 정보를 바꾸기 위해 필요한 정보량을 의미한다.Here, p (x) and q (x) are probability distributions, and the cost function J is for calculating the amount of information existing between the two probability distributions p (x) and q (x), and the probability distribution p (x) ) To q (x) means the amount of information needed to change the information.

문제는 수학식 4와 같이 정의되는 음향 모델 θ의 파라미터 W와 b는 오류 역전파 알고리즘을 이용하여 훈련 가능하나, 대부분 비선형 활성화 함수 σ는 고정된 함수를 사용하기 때문에 훈련 불가능하다는 것이다.The problem is that the parameters W and b of the acoustic model θ defined as Equation 4 can be trained using an error back propagation algorithm, but most of the nonlinear activation functions σ cannot be trained because they use a fixed function.

그리고, 비선형 활성화 함수 σ에 대한 훈련이 가능하더라도, 비선형 활성화 함수 σ의 기본 형태는 유지되기 때문에, 비선형 활성화 함수 σ에 대한 효과적인 훈련을 할 수 없다.
Further, even if training for the nonlinear activation function σ is possible, since the basic form of the nonlinear activation function σ is maintained, effective training for the nonlinear activation function σ cannot be performed.

따라서, 본 발명은 상기와 같은 종래 기술의 문제점을 해결하기 위하여 안출된 것으로, 본 발명의 목적은, 심층 신경망 기반의 음향 모델의 파라미터인 비선형 활성화 함수를 훈련 가능한 형태로 표현함으로써 음향 모델의 훈련을 효과적으로 달성할 수 있으며 성능을 개선할 수 있도록 구현된 심층 신경망 기반의 음성인식 시스템을 제공함에 있다.
Therefore, the present invention was devised to solve the problems of the prior art as described above, and the object of the present invention is to train the acoustic model by expressing a nonlinear activation function, which is a parameter of an acoustic model based on a deep neural network, in a trainable form. It is to provide a speech recognition system based on a deep neural network that can be effectively achieved and improved performance.

상기와 같은 목적을 달성하기 위한 본 발명의 일 측면에 따른 심층 신경망 기반의 음성인식 시스템은, 각종 정보를 입력받는 입력부; 음성처리 알고리즘을 저장하는 저장부; 및 상기 입력부를 통해 입력되는 음성신호에 상기 저장부에 저장된 음성처리 알고리즘을 적용하여 음성인식을 수행하는 심층 신경망 기반의 처리부를 포함하고, 상기 처리부는 음향 모델 파라미터인 비선형 활성화 함수로서 거듭제곱 급수로 표현되는 활성화 함수를 이용하는 것을 특징으로 한다.In order to achieve the above object, a deep neural network based voice recognition system according to an aspect of the present invention includes an input unit for receiving various information; A storage unit for storing the speech processing algorithm; And a deep neural network based processing unit that performs speech recognition by applying a speech processing algorithm stored in the storage unit to a voice signal input through the input unit, and the processing unit is a nonlinear activation function that is an acoustic model parameter and is raised to a power of two. It is characterized by using an activated function expressed.

상기 거듭제곱 급수로 표현되는 활성화 함수는 하기와 같이 표현되는 것을 특징으로 한다.The activation function represented by the power series is characterized by being expressed as follows.

여기서, N은 거듭제곱 급수의 차원이고,

는 l번째 층의 i번째 노드의 n번째 거듭제곱 급수의 계수를 의미하고,

는 l번째 층의 i번째 노드의 n번째 거듭제곱 급수의 bias를 의미한다.Where N is the dimension of the power of power,

Is the coefficient of the nth power of the i-th node of the l-th layer,

Is the bias of the nth power series of the i-th node of the l-th layer.

상기 거듭제곱 급수의 계수와 상기 거듭제곱 급수의 bias는 오류 역전파를 통해 훈련되는 것을 특징으로 한다.The coefficient of the power series and the bias of the power series are characterized by training through error back propagation.

상기 거듭제곱 급수의 계수의 초기값과 상기 거듭제곱 급수의 bias의 초기값은 테일러 급수를 이용하여 설정되는 것을 특징으로 한다.The initial value of the coefficient of the power series and the initial value of the bias of the power series are characterized by being set using a Taylor series.

상기 처리부는 테일러 급수

를 이용하여 초기값이 설정되는 계수와 bias로 표현되는 거듭제곱 급수로 표현되는 활성화 함수 sigmoid(x)를 이용하는 것을 특징으로 한다.The processing unit is Taylor water supply

It is characterized in that the activation function sigmoid (x) expressed by a power set by an initial value and a power of power expressed by a bias is used.

상기 처리부는 테일러 급수

를 이용하여 초기값이 설정되는 계수와 bias로 표현되는 거듭제곱 급수로 표현되는 활성화 함수 tanh(x)를 이용하는 것을 특징으로 한다.The processing unit is Taylor water supply

It is characterized by using an activation function tanh (x) expressed by a power set by an initial value and a power series expressed by a bias.

상기 처리부는 테일러 급수

를 이용하여 초기값이 설정되는 계수와 bias로 표현되는 거듭제곱 급수로 표현되는 활성화 함수 ReLU(x)를 이용하는 것을 특징으로 한다.
The processing unit is Taylor water supply

It is characterized by using an activation function ReLU (x) expressed by a power set by an initial value and a power series expressed by a bias.

이와 같은 본 발명의 실시 예에 따른 시스템은 심층 신경망 기반의 음향 모델의 파라미터인 비선형 활성화 함수로서 훈련 가능한 형태로 표현되는 비선형 활성화 함수를 이용한다.The system according to the embodiment of the present invention uses a nonlinear activation function expressed in a trainable form as a nonlinear activation function that is a parameter of an acoustic model based on a deep neural network.

따라서, 본 발명의 실시 예에 따른 시스템을 이용하여 음성인식을 수행하면, 심층 신경망 모델 훈련시 좀 더 빠른 훈련이 가능하고, 음성인식 성능을 개선할 수 있다.
Therefore, if speech recognition is performed using the system according to an embodiment of the present invention, when training a deep neural network model, faster training is possible, and speech recognition performance can be improved.

도 1은 본 발명의 실시 예에 따른 심층 신경망 기반의 음성인식 시스템에서 이용되는 심층 신경망에 대한 모델링을 도시한 다이어그램이다.
도 2는 본 발명의 실시 예에 따른 심층 신경망 기반의 음성인식 시스템의 음향 모델에서 이용되는 테일러 급수로 근사화된 활성화 함수 sigmoid(x)에 대한 epoch별 훈련 결과를 도시한 그래프이다.
도 3은 본 발명의 실시 예에 따른 심층 신경망 기반의 음성인식 시스템의 일례의 구성을 도시한 것이다.1 is a diagram illustrating modeling of a deep neural network used in a deep neural network based voice recognition system according to an embodiment of the present invention.
2 is a graph showing training results for each epoch for an activation function sigmoid (x) approximated by a Taylor series used in an acoustic model of a deep neural network-based speech recognition system according to an embodiment of the present invention.
3 is a diagram showing an example of a configuration of a voice recognition system based on a deep neural network according to an embodiment of the present invention.

본문에 개시되어 있는 본 발명의 실시 예들에 대해서, 특정한 구조적 내지 기능적 설명들은 단지 본 발명의 실시 예를 설명하기 위한 목적으로 예시된 것으로, 본 발명의 실시 예들은 다양한 형태로 실시될 수 있으며 본문에 설명된 실시 예들에 한정되는 것으로 해석되어서는 안 된다.With respect to the embodiments of the present invention disclosed in the text, specific structural or functional descriptions are exemplified only for the purpose of illustrating the embodiments of the present invention, and the embodiments of the present invention can be implemented in various forms, and It should not be interpreted as being limited to the described embodiments.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 본문에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.The present invention can be applied to various changes and may have various forms, and specific embodiments will be illustrated in the drawings and described in detail in the text. However, this is not intended to limit the present invention to a specific disclosure form, and it should be understood that it includes all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위로부터 이탈되지 않은 채 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from other components. For example, the first component may be referred to as the second component without departing from the scope of the present invention, and similarly, the second component may also be referred to as the first component.

어떤 구성요소가 다른 구성요소에 “연결되어” 있다거나 “접속되어” 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 “직접 연결되어” 있다거나 “직접 접속되어” 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 “~사이에”와 “바로 ~사이에” 또는 “~에 이웃하는”과 “~에 직접 이웃하는” 등도 마찬가지로 해석되어야 한다.When an element is said to be "connected" or "connected" to another component, it is understood that other components may be directly connected to or connected to the other component, but other components may exist in the middle. It should be. On the other hand, when a component is said to be “directly connected” or “directly connected” to another component, it should be understood that no other component exists in the middle. Other expressions describing the relationship between the components, such as “between” and “just between” or “adjacent to” and “directly neighboring to” should be interpreted similarly.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, “포함하다” 또는 “가지다” 등의 용어는 개시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms “include” or “have” are intended to indicate that a disclosed feature, number, step, action, component, part, or combination thereof exists, one or more other features or numbers, It should be understood that the existence or addition possibilities of steps, actions, components, parts or combinations thereof are not excluded in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person skilled in the art to which the present invention pertains. Terms such as those defined in a commonly used dictionary should be interpreted as having meanings consistent with meanings in the context of related technologies, and should not be interpreted as ideal or excessively formal meanings unless explicitly defined in the present application. Does not.

한편, 어떤 실시 예가 달리 구현 가능한 경우에 특정 블록 내에 명기된 기능 또는 동작이 순서도에 명기된 순서와 다르게 일어날 수도 있다. 예를 들어, 연속하는 두 블록이 실제로는 실질적으로 동시에 수행될 수도 있고, 관련된 기능 또는 동작에 따라서는 상기 블록들이 거꾸로 수행될 수도 있다.
On the other hand, when an embodiment can be implemented differently, functions or operations specified in a specific block may occur differently from the order specified in the flowchart. For example, two consecutive blocks may actually be executed substantially simultaneously, or the blocks may be performed backwards depending on related functions or operations.

본 발명의 실시 예에 따른 심층 신경망 기반의 음성인식 시스템에 대해서 살펴보기 전에, 본 발명의 음성인식 시스템에서 음성인식을 위해 인용되는 음향 모델의 파라미터인 활성화 함수에 관해서 먼저 살펴보기로 한다.
Before looking at the deep neural network-based speech recognition system according to an embodiment of the present invention, the activation function which is a parameter of the acoustic model cited for speech recognition in the speech recognition system of the present invention will be described first.

본 발명은 심층 신경망 기반의 음향 모델의 파라미터인 비선형 활성화 함수를 좀 더 일반적인 parametric 형태로 표현함으로써 비선형 활성화 함수에 대한 훈련이 가능하도록 하여 음향 모델의 훈련을 효과적으로 달성할 수 있으며 성능을 개선할 수 있는 심층 신경망 기반의 음성인식 시스템을 제공하는 것을 목적으로 한다.The present invention expresses a non-linear activation function, which is a parameter of an acoustic model based on a deep neural network, in a more general parametric form, so that training for a non-linear activation function is possible, so that training of the acoustic model can be effectively achieved and performance can be improved. An object of the present invention is to provide a speech recognition system based on a deep neural network.

상기와 같은 목적을 달성하기 위해서는, 활성화 함수를 어떠한 형태로 표현할 것인가와 초기값을 어떻게 설정할 것인가에 대한 고려가 이루어져야 한다.In order to achieve the above object, consideration must be given to how to express the activation function and how to set the initial value.

이에, 본 발명에서는 활성화 함수를 수학식 8과 같은 거듭제곱 급수(power series)로 표현하는 것을 제안한다.Accordingly, the present invention proposes to express the activation function as a power series such as Equation (8).

[수학식 8][Equation 8]

여기서, N은 거듭제곱 급수의 차원이고,

Is the coefficient of the nth power of the i-th node of the l-th layer,

Is the bias of the nth power series of the i-th node of the l-th layer.

수학식 8과 같이 정의되는 거듭제곱 급수 기반의 활성화 함수를 사용하는 경우에는 심층 신경망에서의 하나의 노드는 도 1과 같이 표현될 수 있다.In the case of using an activation function based on a power series defined as Equation 8, one node in the deep neural network may be represented as shown in FIG. 1.

도 1은 본 발명의 실시 예에 따른 심층 신경망 기반의 음성인식 시스템에서 이용되는 심층 신경망에 대한 모델링을 도시한 다이어그램이다.1 is a diagram illustrating modeling of a deep neural network used in a deep neural network based voice recognition system according to an embodiment of the present invention.

그리고, 거듭제곱 급수 기반의 활성화 함수를 사용하면, 심층 신경망 기반의 음향 모델 θ는 수학식 9와 같이 표현될 수 있다. 즉, 활성화 함수는 훈련 가능한 파라미터 A와 C로 표현될 수 있다.In addition, when an activation function based on a power series is used, the acoustic model θ based on the deep neural network may be expressed as Equation (9). That is, the activation function can be represented by trainable parameters A and C.

[수학식 9][Equation 9]

θ = {W, b, A, C}θ = {W, b, A, C}

여기서, A는 이고, C는 으로서, A와 C는 오류 역전파 알고리즘을 이용하여 훈련될 수 있다.Here, A is, C is as, and A and C can be trained using an error backpropagation algorithm.

그리고, 파라미터 A와 C의 초기값의 설정은 비선형 함수들의 테일러 급수(Taylor series)를 이용하여 이루어질 수 있으며, 테일러 급수는 하기 수학식 10과 같이 정의될 수 있다.In addition, the initial values of the parameters A and C may be set using a Taylor series of nonlinear functions, and the Taylor series may be defined as in Equation 10 below.

[수학식 10][Equation 10]

그리고, 널리 사용되는 비선형 활성화 함수들을 테일러 급수로 표현하면 하기 수학식 11과 같다.In addition, the widely used nonlinear activation functions are expressed by Taylor series as in Equation 11 below.

[수학식 11][Equation 11]

도 2는 본 발명의 실시 예에 따른 심층 신경망 기반의 음성인식 시스템의 음향 모델에서 이용되는 테일러 급수로 근사화된 활성화 함수 sigmoid(x)에 대한 epoch별 훈련 결과를 도시한 그래프이다.
2 is a graph showing training results for each epoch for an activation function sigmoid (x) approximated by a Taylor series used in an acoustic model of a deep neural network-based speech recognition system according to an embodiment of the present invention.

이상에서는 본 발명에서 제안하는 음성인식 시스템에서 이용되는 활성화 함수에 대해서 살펴보았다. 이하에서는 상기에서 살펴본 훈련 가능한 활성화 함수로 표현되는 음향 모델을 이용하는 음성인식 시스템에 대해서 살펴보기로 한다.
In the above, the activation function used in the speech recognition system proposed in the present invention has been described. Hereinafter, a voice recognition system using the acoustic model represented by the trainable activation function described above will be described.

도 3은 본 발명의 실시 예에 따른 심층 신경망 기반의 음성인식 시스템의 일례의 구성을 도시한 것이다.3 is a diagram showing an example of a configuration of a voice recognition system based on a deep neural network according to an embodiment of the present invention.

도 3에 도시된 바와 같이, 본 발명의 실시 예에 따른 심층 신경망 기반의 음성인식 시스템(300, 이하 ‘시스템’)은 각종 정보들을 입력받는 입력부(310), 다양한 프로그램들과 정보들을 저장하는 저장부(330), 입력부(310)를 통해 입력되는 정보를 프로그램들을 이용해 처리하는 처리부(350) 및 처리부(350)에 의해 처리된 결과를 출력하는 출력부(370)를 포함할 수 있다. 그리고, 상기 처리부(350)는 적어도 하나 이상의 프로세서로 이루어질 수 있다.As illustrated in FIG. 3, a deep neural network-based voice recognition system 300 (hereinafter referred to as a 'system') according to an embodiment of the present invention includes an input unit 310 for receiving various information, storage for storing various programs and information The unit 330 may include a processing unit 350 processing information input through the input unit 310 using programs, and an output unit 370 outputting a result processed by the processing unit 350. In addition, the processing unit 350 may be formed of at least one processor.

예를 들어, 상기 입력부(310)로는 음성입력을 입력받을 수 있고, 상기 저장부(330)에는 처리부(350)에 의해 실행되는 신호처리 알고리즘이 저장될 수 있으며, 상기 출력부(370)는 처리부(350)에 의해 처리된 음성처리 결과를 표시할 수 있다.For example, a voice input may be input to the input unit 310, and a signal processing algorithm executed by the processing unit 350 may be stored in the storage unit 330, and the output unit 370 may be processed. The voice processing result processed by 350 may be displayed.

특히, 상기 처리부(350)는 입력된 음성신호에서 음성인식을 위한 특징 파라미터를 추출하고, 추출된 파라미터를 이용하여 음성인식을 수행한다.In particular, the processing unit 350 extracts feature parameters for voice recognition from the input voice signal, and performs voice recognition using the extracted parameters.

그리고, 상기 처리부(350)는 음성인식을 위해 심층 신경망 기반으로 표현되는 특정 신호처리 알고리즘을 이용하는데, 음향 모델 파라미터 중 하나인 비선형 활성화 함수로는 도 1 및 2를 통해 살펴본 바와 같이 훈련 가능한 형태로 표현되는 것을 이용한다.In addition, the processing unit 350 uses a specific signal processing algorithm expressed based on a deep neural network for speech recognition. As a non-linear activation function that is one of the acoustic model parameters, it can be trained as shown through FIGS. 1 and 2. Use what is expressed.

즉, 상기 처리부(350)는 음향 파라미터 중 하나인 비선형 활성화 함수로 power series 기반의 함수를 이용한다.That is, the processor 350 uses a power series-based function as a nonlinear activation function that is one of acoustic parameters.

한편, 상기 처리부(350)는 입력층, 은닉층 및 출력층을 포함하여 구성될 수 있는데, 전방향 신경망 구조를 갖는다. 각각의 층은 입력된 값을 연산 처리하는 복수의 노드로 구성되는데, 한 노드에서의 출력 값은 그 노드의 활성화 함수 출력 값으로 결정되고, 활성화 함수의 입력은 그 노드로 연결된 모든 노드들의 가중된 합이다.
Meanwhile, the processing unit 350 may include an input layer, a hidden layer, and an output layer, and has an omni-directional neural network structure. Each layer is composed of a plurality of nodes that compute the input value, and the output value from one node is determined by the output value of the activation function of the node, and the input of the activation function is weighted by all nodes connected to the node. Sum.

한편, 본 발명에 따른 심층 신경망 기반의 음성인식 시스템을 실시 예에 따라 설명하였지만, 본 발명의 범위는 특정 실시 예에 한정되는 것은 아니며, 본 발명과 관련하여 통상의 지식을 가진 자에게 자명한 범위 내에서 여러 가지의 대안, 수정 및 변경하여 실시할 수 있다.
On the other hand, although the deep neural network-based speech recognition system according to the present invention has been described according to an embodiment, the scope of the present invention is not limited to a specific embodiment, and is obvious to a person having ordinary knowledge in connection with the present invention. Various alternatives, modifications and changes can be carried out within.

따라서, 본 발명에 기재된 실시 예 및 첨부된 도면들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시 예 및 첨부된 도면에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리 범위에 포함되는 것으로 해석되어야 할 것이다.
Therefore, the embodiments and the accompanying drawings described in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments and the accompanying drawings. . The scope of protection of the present invention should be interpreted by the claims, and all technical spirits within the scope equivalent thereto should be interpreted as being included in the scope of the present invention.

300 : 음성인식 시스템
310 : 입력부
330 : 저장부
350 : 처리부
370 : 출력부300: voice recognition system
310: input unit
330: storage unit
350: processing unit
370: output

Claims

An input unit that receives various information;
A storage unit for storing the speech processing algorithm; And
And a deep neural network based processing unit for performing speech recognition by applying a speech processing algorithm stored in the storage unit to the speech signal input through the input unit.
The processing unit uses an activation function expressed as a power of power as a nonlinear activation function that is an acoustic model parameter,
The activation function represented by the power of power is characterized by being expressed as follows.
Deep neural network based speech recognition system.

Where N is the dimension of the power of power,

Is the coefficient of the nth power series of the i-th node of the l-th layer,

Is the bias of the nth power series of the i-th node of the l-th layer.

delete

According to claim 1,
The coefficient of the power series and the bias of the power series are trained through error back propagation.
Deep neural network based speech recognition system.

According to claim 1,
The initial value of the coefficient of the power of the power and the initial value of the bias of the power of the power are set using a Taylor series.
Deep neural network based speech recognition system.

The method of claim 4,
The processing unit is Taylor water supply

It is characterized by using a coefficient whose initial value is set and an activation function sigmoid (x) expressed by a power of power expressed by bias.
Deep neural network based speech recognition system.

The method of claim 4,
The processing unit is Taylor water supply

It is characterized by using the activation function tanh (x) expressed by a power set by an initial value and a power series expressed by a bias.
Deep neural network based speech recognition system.

The method of claim 4,
The processing unit is Taylor water supply

Characterized in that the activation function ReLU (x) expressed by a power set by an initial value and a power of power expressed by a bias is used.
Deep neural network based speech recognition system.