KR102218046B1

KR102218046B1 - Deep-Neural network based state determination appratus and method for speech recognition acoustic models

Info

Publication number: KR102218046B1
Application number: KR1020170160967A
Authority: KR
Inventors: 강병옥; 박전규; 이윤근
Original assignee: 한국전자통신연구원
Priority date: 2017-11-28
Filing date: 2017-11-28
Publication date: 2021-02-22
Also published as: KR20190062008A

Abstract

본 발명은 음성인식용 음향모델을 위한 심층 신경망 기반 상태 결정 방법에 관한 것으로, 음성 데이터를 훈련하기 위한 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터의 상태 결정을 수행하는 상태 결정 단계; 및 상태 결정 단계를 통해 결정된 상태 결정 입력 값과 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터를 학습하는 학습 단계;를 포함한다. The present invention relates to a state determination method based on a deep neural network for an acoustic model for speech recognition, wherein the state of speech data is determined using a connection coefficient and a bias to an output node used in a neural network model for training speech data. Status determination step; And a learning step of learning speech data using a state determination input value determined through the state determination step and a connection coefficient and a bias to an output node used in the neural network model.

Description

Deep-Neural network based state determination appratus and method for speech recognition acoustic models

본 발명은 음성인식용 음향모델을 위한 심층 신경망 기반 상태 결정 장치 및 방법에 관한 것으로, 더욱 상세하게는 상태 결정을 심층신경망 학습 단계에서 결정하여 상태 결정과 심층 신경망 학습을 최적화할 수 있는 음성인식용 음향모델을 위한 심층 신경망 기반 상태 결정 장치 및 방법에 관한 것이다. The present invention relates to a state determination apparatus and method based on a deep neural network for an acoustic model for speech recognition, and more particularly, to a state determination apparatus and a method for optimizing state determination and deep neural network learning by determining a state determination in the deep neural network learning step. A deep neural network-based state determination apparatus and method for acoustic models.

현재 상용화되고 있는 대부분의 음성인식 시스템은 현재 상용화 단계에 접어들었고, 다양한 분야에 적용되어 서비스에 되고 있다. Most of the voice recognition systems that are currently commercialized have entered the stage of commercialization, and are being applied to various fields for service.

이러한 음성 인식 시스템은 음성 입력 데이터에 대하여 문맥을 갖도록 음소 단위(단어 어휘를 음소 열로 표시)로 세분화를 결정하기 위한 상태 결정 단계와, 상태 결정 데이터-상태 레이블 쌍을 대상으로 심층 신경망 학습을 수행하는 심층 신경망 학습 단계를 포함하고 있다. Such a speech recognition system performs a state determination step for determining the subdivision of speech input data into phoneme units (a word vocabulary is displayed as a phoneme column) to have a context, and deep neural network training for a state determination data-state label pair. It includes the deep neural network learning step.

하지만, 종래 음성 인식 시스템에 이용되는 상태 결정 단계는 가우시안믹스튜어-히든마코프 모델 단계에서 음성 데이터의 대상을 결정하고, 심층 신경망 학습을 수행하는 심층 신경망 학습 단계는 DNN 학습 모델을 이용하고 있다 However, in the state determination step used in the conventional speech recognition system, the object of speech data is determined in the Gaussian Mixture-Hidden Markov model step, and the deep neural network training step for performing the deep neural network training uses a DNN learning model.

그러나 종래 상태 결정에 이용되는 모델과 심층 신경망 학습 단계에 이용되는 파라미터가 상이하기 때문에 상태 결정 단계와 심층 신경망 학습 단계에 대한 최적화가 이루어질 수 없는 문제가 있다.However, there is a problem that optimization of the state determination step and the deep neural network training step cannot be performed because the conventional model used for state determination and the parameters used for the deep neural network training step are different.

본 발명은 종래 문제점을 해결하기 위해 안출된 것으로, 본 발명의 목적은 심층신경망 학습 단계에 이용되는 모델을 상태 결정 단계에 적용하여 음향모델을 최적화할 수 있는 음성인식용 음향모델을 위한 심층 신경망 기반 상태 결정 장치 및 방법에 관한 것이다. The present invention was conceived to solve the conventional problem, and an object of the present invention is based on a deep neural network for an acoustic model for speech recognition that can optimize the acoustic model by applying the model used in the deep neural network learning step to the state determination step. It relates to a state determination apparatus and method.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다. The object of the present invention is not limited to the above-mentioned object, and other objects not mentioned will be clearly understood by those skilled in the art from the following description.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 음성인식용 음향모델을 위한 심층 신경망 기반 상태 결정 장치는 음성 데이터를 훈련하기 위한 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터의 상태 결정을 수행하는 상태 결정부; 및 상기 상태 결정부를 통해 결정된 상태 결정 입력 값과 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터를 학습하는 학습모듈;를 포함한다. In order to achieve the above object, a state determination apparatus based on a deep neural network for an acoustic model for speech recognition according to an embodiment of the present invention uses a connection coefficient and a bias to an output node used in a neural network model for training speech data. A state determination unit that determines a state of voice data; And a learning module for learning speech data using a state determination input value determined through the state determination unit and a connection coefficient and a bias to an output node used in the neural network model.

상기 상태 결정부는, 전체 훈련용 음성 데이터를 대상으로 훈련된 3음소열(triphone) 기반의 GMM-HMM(Gausian Mixture Model-Hidden Markov Model)을 이용하여 음소를 분리하는 분리부, 음소가 분리된 음성 데이터를 정렬하는 정렬부, 정렬된 데이터/레이블 쌍을 이용하여 얕은 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터를 학습하는 학습부; 학습된 얕은 신경망 모델을 안정화시키는 안정화부; 음성 데이터 훈련에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 상태 결정을 수행하는 상태 결정부; 및 상기 결정된 상태 결정을 이용하여 음성 데이터의 훈련에 이용하는 훈련부;를 포함한다. The state determination unit is a separating unit for separating phonemes using a triphone-based Gausian Mixture Model-Hidden Markov Model (GMM-HMM) trained for all training voice data, and the phoneme-separated voice An alignment unit for aligning data, a learning unit for learning speech data using a bias and a connection coefficient to an output node used in a shallow neural network model using the aligned data/label pairs; A stabilizing unit for stabilizing the learned shallow neural network model; A state determination unit that determines a state using a bias and a link coefficient to an output node used for training voice data; And a training unit for training voice data by using the determined state determination.

그리고, 상기 상태 결정부는, 의사결정 트리를 생성하고, 문맥 정보를 바탕으로 세분화된 3음소열(triphone)에 대해, maximum log-likelhood 기준에 따라 분기되는 트리 구조에 의해 최종 상태 결정을 수행할 수 있다. In addition, the state determination unit may generate a decision tree and determine a final state by a tree structure branched according to a maximum log-likelhood criterion for a triphone subdivided based on context information. have.

본 발명의 일 실시예에 따른 음성인식용 음향모델을 위한 심층 신경망 기반 상태 결정 방법은 음성 데이터를 훈련하기 위한 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터의 상태 결정을 수행하는 상태 결정 단계; 및 상기 상태 결정 단계를 통해 결정된 상태 결정 입력 값과 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터를 학습하는 학습 단계;를 포함한다. In the deep neural network-based state determination method for an acoustic model for speech recognition according to an embodiment of the present invention, the state of speech data is determined using a connection coefficient and a bias to an output node used in a neural network model for training speech data. Performing a state determination step; And a learning step of learning speech data using a state determination input value determined through the state determination step and a connection coefficient and a bias to an output node used in the neural network model.

상기 상태 결정 단계는, 전체 훈련용 음성 데이터를 대상으로 훈련된 3음소열(triphone) 기반의 GMM-HMM(Gausian Mixture Model-Hidden Markov Model)을 이용하여 음소를 분리하는 분리 단계; 음소가 분리된 음성 데이터를 정렬하는 정렬 단계; 정렬된 데이터/레이블 쌍을 이용하여 얕은 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터를 학습하는 학습 단계; 학습된 얕은 신경망 모델을 안정화시키는 안정화 단계; 음성 데이터 훈련에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 상태 결정을 수행하는 상태 결정 단계; 및 상기 결정된 상태 결정을 이용하여 음성 데이터의 훈련에 이용하는 훈련 단계;를 포함한다. The state determination step may include a separation step of separating phonemes using a triphone-based Gausian Mixture Model-Hidden Markov Model (GMM-HMM) trained for all training voice data; An alignment step of arranging speech data from which phonemes are separated; A learning step of learning speech data using a bias and a connection coefficient to an output node used in a shallow neural network model using the aligned data/label pairs; A stabilization step of stabilizing the learned shallow neural network model; A state determination step of performing state determination by using a bias and a coupling coefficient to an output node used for voice data training; And a training step for training voice data by using the determined state determination.

이러한, 상기 상태 결정 단계는, 의사결정 트리를 생성하고, 문맥 정보를 바탕으로 세분화된 3음소열(triphone)에 대해, maximum log-likelhood 기준에 따라 분기되는 트리 구조에 의해 최종 상태 결정을 수행할 수 있다.In the state determination step, a decision tree is generated, and a final state determination is performed on a triphone subdivided based on context information by a tree structure branched according to a maximum log-likelhood criterion. I can.

그리고, 상기 상태 결정을 수행하는 상태 결정 단계는, 상기 음소가 분리된 음성 데이터를 정렬하는 정렬 단계, 상기 정렬된 데이터/레이블 쌍을 이용하여 얕은 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터를 학습하는 학습 단계 및 학습된 얕은 신경망 모델을 안정화시키는 안정화 단계를 순환시키면서 최종 상태 결정을 수행하는 것이 바람직하다. In addition, the state determination step of performing the state determination includes an alignment step of aligning speech data from which the phonemes are separated, and a coupling coefficient and a bias to an output node used in a shallow neural network model using the sorted data/label pair. It is preferable to perform the final state determination while cycling through the learning step of learning voice data and the stabilization step of stabilizing the learned shallow neural network model.

본 발명의 일 실시예에 따르면, 기존에 상태 결정과 심층 신경망 학습이 분리된 구조를 개선하여 심층 신경망 기반 상태 결정방법을 음향모델 학습에 사용함으로써 음성인식 성능을 향상시키는 효과가 있다. According to an embodiment of the present invention, there is an effect of improving speech recognition performance by using a state determination method based on a deep neural network for acoustic model learning by improving a structure in which state determination and deep neural network learning are previously separated.

도 1은 본 발명의 일 실시예에 따른 음성인식용 음향모델을 위한 심층 신경망 기반 상태 결정 장치를 설명하기 위한 기능블럭도,
도 2는 본 발명의 일 실시예에 채용된 상태 결정 모듈을 설명하기 위한 기능블럭도.
도 3은 본 발명의 일 실시예에 따른 음성인식용 음향모델을 위한 심층 신경망 기반 상태 결정 방법을 설명하기 위한 순서도.
도 4는 본 발명의 일 실시예에 채용된 상태 결정 단계를 설명하기 위한 순서도.
도 5는 본 발명의 일 실시예에 채용된 상태 결정 단계에서 상태 결정 과정을 설명하기 위한 순서도이다. 1 is a functional block diagram illustrating an apparatus for determining a state based on a deep neural network for an acoustic model for speech recognition according to an embodiment of the present invention.
2 is a functional block diagram for explaining a state determination module employed in an embodiment of the present invention.
3 is a flowchart illustrating a method for determining a state based on a deep neural network for an acoustic model for speech recognition according to an embodiment of the present invention.
Figure 4 is a flow chart for explaining the state determination step employed in an embodiment of the present invention.
5 is a flowchart illustrating a state determination process in a state determination step employed in an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성소자, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성소자, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.Advantages and features of the present invention, and a method of achieving them will become apparent with reference to the embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in various forms different from each other, and only these embodiments make the disclosure of the present invention complete, and common knowledge in the technical field to which the present invention pertains. It is provided to completely inform the scope of the invention to the possessor, and the invention is only defined by the scope of the claims. Meanwhile, terms used in the present specification are for explaining embodiments and are not intended to limit the present invention. In this specification, the singular form also includes the plural form unless specifically stated in the phrase. As used in the specification, "comprises" and/or "comprising" refers to the presence of one or more other components, steps, operations and/or elements in which the recited components, steps, operations and/or elements Or does not preclude additions.

이하, 본 발명의 바람직한 실시예에 대하여 첨부한 도면을 참조하여 상세히 설명하기로 한다. 도 1은 본 발명의 일 실시예에 따른 음성인식용 음향모델을 위한 심층 신경망 기반 상태 결정 장치를 설명하기 위한 기능블럭도이다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. 1 is a functional block diagram illustrating an apparatus for determining a state based on a deep neural network for an acoustic model for speech recognition according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 음성인식용 음향모델을 위한 심층 신경망 기반 상태 결정 장치는 상태 결정모듈(100)과 학습모듈(200)을 포함하는 것이 바람직하다. As shown in FIG. 1, it is preferable that a state determination apparatus based on a deep neural network for an acoustic model for speech recognition according to an embodiment of the present invention includes a state determination module 100 and a learning module 200.

상태 결정모듈(100)는 음성 데이터를 훈련하기 위한 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터의 상태 결정을 수행하는 역할을 한다. The state determination module 100 serves to determine the state of speech data by using a bias and a connection coefficient to an output node used in a neural network model for training speech data.

도 2에 도시된 바와 같이, 본 발명의 일 실시예에 채용된 상태 결정모듈(100)는 분리부(110), 정렬부(120), 학습부(130), 안정화부(140), 상태 결정부(150) 및 훈련부(160)를 포함하여 이루어진다. As shown in Fig. 2, the state determination module 100 employed in an embodiment of the present invention includes a separating unit 110, an alignment unit 120, a learning unit 130, a stabilization unit 140, and a state determination unit. It comprises a unit 150 and a training unit 160.

분리부(110)는 전체 훈련용 음성 데이터를 대상으로 훈련된 3음소열(triphone) 기반의 GMM-HMM(Gausian Mixture Model-Hidden Markov Model)을 이용하여 음소를 분리하는 역할을 한다. The separating unit 110 serves to separate phonemes by using a triphone-based Gausian Mixture Model-Hidden Markov Model (GMM-HMM) trained for the entire training voice data.

그리고 정렬부(120)는 음소가 분리된 음성 데이터를 정렬하는 역할을 한다. In addition, the alignment unit 120 serves to arrange the voice data from which the phonemes are separated.

또한 학습부(130)는 정렬된 데이터/레이블 쌍을 이용하여 얕은 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 통해 음성 데이터를 학습하는 역할을 한다. In addition, the learning unit 130 serves to learn speech data through a bias and a connection coefficient to an output node used in a shallow neural network model using the aligned data/label pairs.

그리고 안정화부(140)는 학습된 얕은 신경망 모델을 안정화시키는 역할을 한다. In addition, the stabilization unit 140 serves to stabilize the learned shallow neural network model.

또한, 상태 결정부(150)는 음성 데이터 훈련에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 상태 결정을 수행하는 역할을 한다. In addition, the state determination unit 150 serves to determine a state by using a bias and a connection coefficient to an output node used for training voice data.

그리고 훈련부(160)는 결정된 상태 결정을 이용하여 음성 데이터의 훈련에 이용하는 역할을 한다. Further, the training unit 160 serves to use the determined state determination to train voice data.

따라서, 본 발명의 일 실시예에 채용된 상태 결정부에 따르면, DNN 학습에 이용되는 파라미터를 상태 결정에 이용함으로써, 상태 결정과정과 신경망 학습에 대한 음성 인식 성능을 향상시킬 수 있는 장점이 있다. Accordingly, according to the state determination unit employed in an embodiment of the present invention, there is an advantage of improving the state determination process and speech recognition performance for neural network learning by using parameters used for DNN learning for state determination.

한편, 상기 상태 결정모듈(100)는, 의사결정 트리를 생성하고, 문맥 정보를 바탕으로 세분화된 3음소열(triphone)에 대해, maximum log-likelhood 기준에 따라 분기되는 트리 구조에 의해 최종 상태 결정을 수행한다. Meanwhile, the state determination module 100 generates a decision tree and determines a final state by a tree structure branched according to a maximum log-likelhood criterion for a triphone subdivided based on context information. Perform.

그리고 학습모듈(200)은 상태 결정모듈(100)를 통해 결정된 상태 결정 입력 값과 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터를 학습하는 역할을 한다. Further, the learning module 200 serves to learn voice data using a state determination input value determined through the state determination module 100 and a connection coefficient and a bias to an output node used in the neural network model.

본 발명의 일 실시예에 채용된 학습모듈(200)에 따르면, 상태 결정을 위해 이용되는 파라미터와 DNN 학습에 이용된 파라미터가 동일함으로써, 기존에 상태결정과 심층 신경망 학습이 분리된 구조를 개선할 수 있고, 심층 신경망 기반 상태 결정방법을 음향모델 학습에 사용함으로써, 음성인식 성능을 향상시키는 효과가 있다. According to the learning module 200 employed in an embodiment of the present invention, since the parameter used for state determination and the parameter used for DNN learning are the same, it is possible to improve a structure in which state determination and deep neural network learning are previously separated. In addition, by using a state determination method based on a deep neural network for acoustic model learning, there is an effect of improving speech recognition performance.

이하, 하기에서는 본 발명의 일 실시예에 따른 음성인식용 음향모델을 위한 심층 신경망 기반 상태 결정 방법에 대하여 도 3을 참조하여 설명하기로 한다. Hereinafter, a method for determining a state based on a deep neural network for an acoustic model for speech recognition according to an embodiment of the present invention will be described with reference to FIG. 3.

먼저, 음성 데이터를 훈련하기 위한 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터의 상태 결정을 수행한다(S100). First, a state of speech data is determined using a bias and a connection coefficient to an output node used in a neural network model for training speech data (S100).

이어서, 상기 상태 결정 단계(S100)를 통해 결정된 상태 결정 입력 값과 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터를 학습한다(S200). Subsequently, speech data is learned using the state determination input value determined through the state determination step (S100) and the connection coefficient and bias to the output node used in the neural network model (S200).

본 발명의 일 실시예에 따르면, 상태 결정을 위해 이용되는 파라미터와 DNN 학습에 이용된 파라미터가 동일함으로써, 기존에 상태결정과 심층 신경망 학습이 분리된 구조를 개선할 수 있고, 심층 신경망 기반 상태 결정방법을 음향모델 학습에 사용함으로써, 음성인식 성능을 향상시키는 효과가 있다. According to an embodiment of the present invention, since the parameter used for state determination and the parameter used for DNN learning are the same, it is possible to improve a structure in which state determination and deep neural network learning are separated from each other, and state determination based on a deep neural network By using the method for learning acoustic models, there is an effect of improving speech recognition performance.

이하, 하기에서는 본 발명의 일 실시예에 채용된 상기 상태 결정 단계(S100)에 대하여 도 4를 참조하여 상세히 설명하기로 한다. Hereinafter, the state determination step (S100) employed in an embodiment of the present invention will be described in detail with reference to FIG. 4.

먼저, 분리부(110)이 전체 훈련용 음성 데이터를 대상으로 훈련된 3음소열(triphone) 기반의 GMM-HMM(Gausian Mixture Model-Hidden Markov Model)을 이용하여 음소를 분리한다(S110). 예를 들어, "학교"의 '학'이라는 음성 데이터가 입력되면, 3음소열(triphone) 기반의 GMM-HMM(Gausian Mixture Model-Hidden Markov Model)을 이용하여 음성 데이터를 "h-a-g"와 같이 음소 단위로 분리한다. 이러한, 이유는 상태 결정 단계를 통해 모든 학습 모델을 구비할 수 없기 때문에 유사한 문맥의 모델을 묶어주어야 한다. First, the separating unit 110 separates the phoneme using a triphone-based Gausian Mixture Model-Hidden Markov Model (GMM-HMM) trained for all training voice data (S110). For example, when voice data called'hak' of "school" is input, the voice data is converted into a phoneme like "hag" using a triphone-based Gausian Mixture Model-Hidden Markov Model (GMM-HMM). Separate into units. The reason for this is that it is not possible to have all the learning models through the state determination step, so models in a similar context must be bundled.

이어서, 정렬부(120)이 음소가 분리된 음성 데이터를 정렬한다(S120). 즉, 데이터를 구성하기 위하여 입력되는 데이터의 음소가 어디서부터 어디까지가 'h' 음소이고, 어디서부터 어디까지가 'a'이며, 어디서부터 어디까지가 'g'임을 알 수 있도록 정렬한다. Subsequently, the alignment unit 120 arranges the voice data from which the phonemes are separated (S120). That is, in order to compose the data, the phoneme of the input data is arranged so that it can be seen that from where to where is the'h' phoneme, from where to where is'a', and from where to where is'g'

이후, 학습부(130)가 정렬된 데이터/레이블 쌍을 이용하여 얕은 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터를 학습한다(S130). Thereafter, the learning unit 130 learns the speech data using the alignment coefficients and biases to the output nodes used in the shallow neural network model using the aligned data/label pairs (S130).

이어서, 안정화부(140)가 학습된 얕은 신경망 모델을 안정화시킨다(S140). 예를 들어, 안정화부(140)는 의사 결정을 위한 결정 트리를 통해 분리된 음소를 문맥에 맞게 결정함으로써 얕은 신경망 모델이 안정화될 수 있다. Subsequently, the stabilization unit 140 stabilizes the learned shallow neural network model (S140). For example, the stabilization unit 140 may stabilize the shallow neural network model by determining the separated phoneme according to the context through the decision tree for decision making.

이후, 상태 결정부(150)가 음성 데이터 훈련에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 상태 결정을 수행한다(S150). Thereafter, the state determination unit 150 determines a state by using a bias and a connection coefficient to an output node used for training voice data (S150).

이어서, 훈련부(160)가 상기 결정된 상태 결정을 이용하여 음성 데이터의 훈련에 이용한다(S160). 훈련부(160)는 심층신경망을 통해 입력된 음성 데이터를 훈련한다. Subsequently, the training unit 160 uses the determined state determination to train voice data (S160). The training unit 160 trains voice data input through the deep neural network.

한편, 상기 상태 결정 단계(S150)는 의사결정 트리를 생성하고, 문맥 정보를 바탕으로 세분화된 3음소열(triphone)에 대해, maximum log-likelhood 기준에 따라 분기되는 트리 구조에 의해 최종 상태 결정을 수행할 수 있다. Meanwhile, in the state determination step (S150), a decision tree is generated, and a final state is determined by a tree structure branched according to a maximum log-likelhood criterion for a triphone subdivided based on context information. Can be done.

한편, 본 발명의 일 실시예에 채용된 상태 결정 단계(S150)에 대하여 좀 더 상세히 설명하기로 한다. On the other hand, the state determination step (S150) employed in an embodiment of the present invention will be described in more detail.

상태 결정부(150)가 음성 데이터를 훈련하기 위한 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여, 입력 데이터(O)에 대한 출력 노드에 해당하는 각 상태의 사후 확률을 [수학식 1]로 산출하는 소프트맥스(Softmax Layer) 계층 모델링을 설정한다(S151).The state determination unit 150 calculates the posterior probability of each state corresponding to the output node for the input data O using the bias and the connection coefficient to the output node used in the neural network model for training voice data. Softmax layer modeling calculated by Equation 1] is set (S151).

[수학식 1][Equation 1]

여기서,

는 입력 데이터 'O'에 대한 출력 노드에 해당하는 각 상태의 사후 확률 값이고, 'Z'는 출력 노드의 입력이고, 'C'는 전체 상태의 개수를 나타내고,

는 출력 노드로의 연결 계수 및 바이어스를 의미하고,

는 출력 노드로 입력되는 값을 의미한다.here,

Is the posterior probability value of each state corresponding to the output node for the input data'O','Z' is the input of the output node,'C' is the total number of states,

Means the coupling factor and bias to the output node,

Means the value input to the output node.

상태 결정부(150)가 [수학식 2]를 통해 입력에 대해서 출력을 산출하는 선형 로그(Log linear) 모델링을 설정한다(S152).The state determination unit 150 sets a linear logarithmic modeling that calculates an output for an input through [Equation 2] (S152).

[수학식 2][Equation 2]

여기서, 'Z'는 정규화 값이고, 'x'는 입력 데이터이고,

는 계수로서,

는 상수항의 계수이고,

는 일차원 항의 계수이며,

는 2차원 항의 계수이고, 'R'은 실수집합이며, 'D'는 관측 데이터의 차원을 의미한다.Here,'Z' is the normalized value,'x' is the input data,

Is the coefficient,

Is the coefficient of the constant term,

Is the coefficient of the one-dimensional term,

Is the coefficient of the two-dimensional term,'R' is the real set, and'D' is the dimension of the observed data.

상태 결정부(150)가 [수학식 3]을 통해 입력에 대해서 상태 S를 결정하는 싱글 가우시안 모델링을 설정한다(S153). The state determination unit 150 sets a single Gaussian modeling for determining the state S for the input through [Equation 3] (S153).

[수학식 3][Equation 3]

여기서,

는 평균값이고,

는 공분산 행렬이며, 'N'은 가우시안 분포이고, 'D'는 관측 데이터와 차원을 의미한다. here,

Is the average value,

Is the covariance matrix,'N' is a Gaussian distribution, and'D' is the observed data and dimensions.

상태 결정부(150)가 [수학식 3]의 각 파라미터를 [수학식 2]의 각 계수로 변환할 수 있도록, [수학식 4]를 이용하여 싱글 가우시안 모델을 선형 로그 모델로 변환한다(S154). The state determination unit 150 converts the single Gaussian model into a linear logarithmic model using [Equation 4] so that each parameter of [Equation 3] can be converted into each coefficient of [Equation 2] (S154) ).

[수학식 4][Equation 4]

여기서,

는 2차원 항의 계수이고,

는 일차원 항의 계수이며,

는 평균값이고,

는 상수항의 계수이고, 'P(S)'는 상태 S의 사전 확률이다.here,

Is the coefficient of the two-dimensional term,

Is the coefficient of the one-dimensional term,

Is the average value,

Is the coefficient of the constant term, and'P(S)' is the prior probability of state S.

다시, 상태 결정부(150)가 [수학식 5]를 이용하여 선형 로그 모델을 싱글 가우시안 모델로 변환한다(S155). Again, the state determination unit 150 converts the linear logarithmic model into a single Gaussian model using [Equation 5] (S155).

[수학식 5][Equation 5]

여기서,

는 2차원 항의 계수이고,

는 평균값으로

는 일차원 항의 계수이다. here,

Is the coefficient of the two-dimensional term,

Is the average value

Is the coefficient of a one-dimensional term.

상태 결정부(150)가 [수학식 6]을 이용하여 소프트맥스 계층 모델을 선형 로그 모델로 변환한다(S156). The state determination unit 150 converts the softmax layer model into a linear logarithmic model using [Equation 6] (S156).

[수학식 6][Equation 6]

여기서,

는 상수항의 계수이고,

는 일차원 항의 계수이며,

는 출력 노드로의 연결 계수 및 바이어스를 의미하고, 'x'는 입력값이며,

는 출력 노드로 입력되는 값을 의미하고,

는 2차원 항의 계수이다. here,

Is the coefficient of the constant term,

Is the coefficient of the one-dimensional term,

Means the coupling factor and bias to the output node,'x' is the input value,

Means the value input to the output node,

Is the coefficient of the two-dimensional term.

상태 결정부(150)가 [수학식 7]을 이용하여 소프트맥스 계층 모델에서 선형 로그 모델로 유도한다(S157). The state determination unit 150 derives from the softmax layer model to a linear log model using [Equation 7] (S157).

[수학식 7][Equation 7]

여기서,

는 공분산 행렬이고,

는 평균값이며,

는 출력 노드로의 연결 계수 및 바이어스이고, 'x'는 입력값이고,

는 출력 노드로 입력되는 값이다. here,

Is the covariance matrix,

Is the average value,

Is the coupling factor and bias to the output node,'x' is the input value,

Is the value input to the output node.

이와 같이, 상태 결정부(150)는 최종 히든 계층 출력 값인 V^L을 입력으로 하는 Single Gaussian 모델로 변환한다(S158). In this way, the state determination unit 150 converts the final hidden layer output value V ^L into a single Gaussian model as an input (S158).

Single Gaussian 모델로 표현되는 소프트맥스 계층은 입력 데이터인 최종 히든 계층 출력 값인 VL과 기존의 Maximum log-likelihood criterion에 의해 결정 트리를 생성한다(S159). The softmax layer represented by the single Gaussian model generates a decision tree based on the final hidden layer output value VL as input data and the existing maximum log-likelihood criterion (S159).

한편, 상기 상태 결정을 수행하는 상태 결정 단계(S150)는, 상기 음소가 분리된 음성 데이터를 정렬하는 정렬 단계(S151), 상기 정렬된 데이터/레이블 쌍을 이용하여 얕은 신경망 모델에 이용되는 출력 노드로의 연결 계수와 바이어스를 이용하여 음성 데이터를 학습하는 학습 단계(S152) 및 학습된 얕은 신경망 모델을 안정화시키는 안정화 단계(S153)를 순환시키면서 최종 상태 결정을 수행하는 것이 바람직하다. On the other hand, the state determination step (S150) of performing the state determination may include an alignment step (S151) of sorting speech data from which the phonemes are separated, and an output node used in a shallow neural network model using the sorted data/label pair. It is preferable to perform a final state determination while cycling through a learning step (S152) of learning speech data using a link coefficient and a bias of R and a stabilization step (S153) of stabilizing the learned shallow neural network model.

이상, 본 발명의 구성에 대하여 첨부 도면을 참조하여 상세히 설명하였으나, 이는 예시에 불과한 것으로서, 본 발명이 속하는 기술분야에 통상의 지식을 가진자라면 본 발명의 기술적 사상의 범위 내에서 다양한 변형과 변경이 가능함은 물론이다. 따라서 본 발명의 보호 범위는 전술한 실시예에 국한되어서는 아니되며 이하의 특허청구범위의 기재에 의하여 정해져야 할 것이다.In the above, the configuration of the present invention has been described in detail with reference to the accompanying drawings, but this is only an example, and various modifications and changes within the scope of the technical idea of the present invention are those of ordinary skill in the technical field to which the present invention belongs. Of course this is possible. Therefore, the scope of protection of the present invention should not be limited to the above-described embodiments, but should be determined by the description of the following claims.

100 : 상태 결정모듈 110 : 분리부
120 : 정렬부 130 : 학습부
140 : 안정화부 150 : 상태 결정부
160 : 훈련부 200 :학습모듈 100: state determination module 110: separation unit
120: alignment unit 130: learning unit
140: stabilization unit 150: state determination unit
160: training unit 200: learning module

Claims

A state determination step in which the state determination module performs state determination of the speech data using a bias and a connection coefficient to an output node used in a neural network model for training the speech data; And
A learning step of learning, by the learning module, speech data using a state determination input value determined through a state determination step, a connection coefficient to an output node used in the neural network model, and a bias;
Deep neural network-based state determination method for an acoustic model for speech recognition including.

The method of claim 1,
The state determination step,
A separating step of separating phonemes by using a triphone-based Gausian Mixture Model-Hidden Markov Model (GMM-HMM) trained on the entire training voice data;
An alignment step of arranging speech data from which phonemes are separated;
A learning step of learning speech data using a bias and a connection coefficient to an output node used in a shallow neural network model using the aligned data/label pairs;
A stabilization step of stabilizing the learned shallow neural network model;
A state determination step of performing state determination using a bias and a link coefficient to an output node used for voice data training; And
A method for determining a state based on a deep neural network for an acoustic model for speech recognition, comprising: a training step for training voice data by using the determined state determination.

The method of claim 2,
The state determination step,
An acoustic model for speech recognition is created that creates a decision tree and determines the final state of a triphone subdivided based on context information by a tree structure branched according to the maximum log-likelhood criterion. Deep neural network-based state determination method for

The method of claim 2,
The state determining step,
Softmax layer that models the posterior probability of each state corresponding to the output node for the input data (O) by using the connection coefficient and bias to the output node used in the neural network model for training speech data. step;
A log linear modeling step of calculating an output for an input and modeling a log linear;
A single Gaussian step of modeling when determining the state S for the input;
Converting the single Gaussian model to a linear logarithmic model;
Converting the linear log model to a single Gaussian model;
Converting the softmax layer model into a single Gaussian model;
Deriving a single Gaussian model from a softmax layer model;
Converting the softmax layer into a single Gaussian model that receives the final hidden layer output value VL as an input;
The softmax layer represented by the single Gaussian model comprises: generating a decision tree based on the final hidden layer output value V ^L as input data and the existing maximum log-likelihood criterion; and an acoustic model for speech recognition, comprising: Deep neural network-based state determination method for

The method of claim 2,
The state determination step of performing the state determination,
An alignment step of aligning speech data from which the phonemes are separated, a learning step of learning speech data using a connection coefficient and a bias to an output node used in a shallow neural network model using the sorted data/label pair, and the learned A deep neural network-based state determination method for an acoustic model for speech recognition in which the final state determination is performed while circulating the stabilization step of stabilizing the shallow neural network model.

A state determination module that determines a state of speech data using a bias and a connection coefficient to an output node used in a neural network model for training speech data; And
A deep neural network-based state determination for a speech recognition acoustic model including; a learning module for learning speech data using a state determination input value determined through the state determination module and a connection coefficient and a bias to an output node used in the neural network model. Device.

The method of claim 6,
The state determination module,
Separation unit for separating phonemes using GMM-HMM (Gausian Mixture Model-Hidden Markov Model) based on triphone trained for the entire training voice data,
An alignment unit that arranges the phoneme-separated voice data,
A learning unit for learning speech data using a bias and a connection coefficient to an output node used in a shallow neural network model by using the aligned data/label pairs;
A stabilizing unit for stabilizing the learned shallow neural network model;
A state determination unit that determines a state using a bias and a link coefficient to an output node used for training voice data; And
A state determination apparatus based on a deep neural network for an acoustic model for speech recognition comprising a; training unit used for training speech data by using the determined state determination.

The method of claim 7,
The state determination module,
An acoustic model for speech recognition is created that creates a decision tree and determines the final state of a triphone subdivided based on context information by a tree structure branched according to the maximum log-likelhood criterion. Deep neural network-based state determination device for