KR102094935B1

KR102094935B1 - System and method for recognizing speech

Info

Publication number: KR102094935B1
Application number: KR1020170047408A
Authority: KR
Inventors: 김동현; 이영직; 김상훈; 김승희; 이민규; 최무열
Original assignee: 한국전자통신연구원
Priority date: 2016-09-09
Filing date: 2017-04-12
Publication date: 2020-03-30
Also published as: KR20180028893A

Abstract

본 발명에 따른 자동 음소 생성이 가능한 음성 인식 방법은 음성 데이터의 특징 벡터를 비지도 학습하는 단계; 상기 비지도 학습 결과에 기초하여 선정된 음향학적 특성을 클러스터링하여 음소 세트를 생성하는 단계; 상기 생성된 음소 세트에 기초하여 상기 음성 데이터에 음소열을 할당하는 단계 및 상기 음소열이 할당된 음성 데이터 및 상기 음소열에 기초하여 음향 모델을 생성하는 단계를 포함한다.The speech recognition method capable of generating automatic phonemes according to the present invention includes unsupervised learning a feature vector of speech data; Generating a phoneme set by clustering selected acoustic characteristics based on the unsupervised learning result; And assigning a phoneme string to the voice data based on the generated phoneme set, and generating an acoustic model based on the phoneme data to which the phoneme string is assigned and the phoneme string.

Description

Speech recognition system and method {SYSTEM AND METHOD FOR RECOGNIZING SPEECH}

본 발명은 음성 인식 시스템 및 방법에 관한 것으로, 보다 구체적으로는 자동 음소 생성이 가능한 음성 인식 시스템 및 방법에 관한 것이다.The present invention relates to a speech recognition system and method, and more particularly, to a speech recognition system and method capable of generating automatic phonemes.

종래 기술에 따른 음소 클러스터링 방법으로는 B.Mak가 제안한 Bhattacharyya 거리 측정을 이용하는 방법이 있다(B. Mak, E. Barnard, “Phone clustering using the Bhattacharyya distance”, in Proc. ICSLP, 1996). 이는 각 음소들의 음향학적 특성을 구분짓고, 음향학적 거리를 묶기 위한 수단으로 Bhattacharyya 거리 측정 방법을 이용한 것이다. 그러나 이러한 방법의 경우 전문지식을 기반으로 미리 결정된 음소 세트를 이용하고 있다는 문제가 있다.As a phoneme clustering method according to the prior art, there is a method using Bhattacharyya distance measurement proposed by B.Mak (B. Mak, E. Barnard, “Phone clustering using the Bhattacharyya distance”, in Proc. ICSLP, 1996). This is to use the Bhattacharyya distance measurement method as a means for classifying the acoustic characteristics of each phoneme and tying the acoustic distance. However, this method has a problem in that it uses a predetermined phoneme set based on expertise.

즉, 종래 기술의 경우 음성학 전문가들의 지식에 의해 미리 결정된 음소 세트를 사용하여 음향 모델링을 수행하고, 언어학적 발음 규칙을 활용하여 발음사전을 생성하였다.That is, in the case of the prior art, acoustic modeling is performed using a phoneme set predetermined by the knowledge of phonetic experts, and a pronunciation dictionary is generated by using linguistic pronunciation rules.

그러나 이러한 방법은 자연스러운 연속어 발성에 부합되지 않는 문제가 있으며, 특히 발음상의 축약과 생략 그리고 가변 발음과 같은 다양성을 모두 처리하는데 어려움이 있다. 또한, 이러한 처리는 전사문이 주어지는 환경에서 발음사전을 기반으로 음향모델이 생성되기 때문에, 음향모델은 부정확한 발음사전에 영향을 받고, 전사된 데이터가 있을 때에만 모델을 구성할 수 있다는 한계가 존재한다.However, this method has a problem that does not correspond to natural continuous speech, and in particular, it is difficult to deal with diversity such as pronunciation reduction and omission and variable pronunciation. In addition, since the acoustic model is generated based on the pronunciation dictionary in the environment in which the transcription is given, the acoustic model is affected by an inaccurate pronunciation dictionary, and there is a limitation that the model can be constructed only when there is the transferred data. exist.

그 밖에도 음소 클러스터링을 위한 여러가지 top-down, bottom-up 등의 방법들이 제시되었지만, 이러한 방법들 역시 미리 결정된 음소 세트를 이용하고 있다는 문제가 있다.In addition, various top-down and bottom-up methods for clustering phonemes have been proposed, but these methods also have a problem of using a predetermined phoneme set.

한편, R.Singh는 음소 세트와 발음 사전을 자동으로 결정하는 방법을 제안하였다(R. Singh, B. Ray and R.Stern, “Automatic generation of phone sets and lexical transcriptions”, in Proc. ICASSP, 2000). 이 방법은 최적의 발음사전을 만들기 위해 maximum a posteriori(MAP) 방법을 이용하고, 음소 세트의 음향모델에 대해서는 우도(likelihood)를 기준으로 최적화하는 방법이다. 즉, MAP 방법을 순환적으로 적용하여 점차적으로 최적화된 음소와 발음 사전 생성을 목표로 한다. Meanwhile, R.Singh proposed a method for automatically determining phoneme sets and pronunciation dictionaries (R. Singh, B. Ray and R.Stern, “Automatic generation of phone sets and lexical transcriptions”, in Proc. ICASSP, 2000 ). This method uses the maximum a posteriori (MAP) method to create an optimal pronunciation dictionary, and optimizes the acoustic model of a phoneme set based on likelihood. That is, the goal is to gradually generate an optimized phoneme and pronunciation dictionary by applying the MAP method cyclically.

또한, 종래 기술의 경우 단어 발음 사전을 생성하기 위해 발음 후보에 대한 그래프를 생성하여 최적 후보를 선택하는 방법을 이용한다. 이 방법은 초기 수동으로 입력된 음소와 발음 사전을 기준으로 점차 자동으로 최적 후보를 선택할 수 있다. In addition, in the case of the prior art, a method of selecting an optimal candidate by generating a graph for a pronunciation candidate is used to generate a word pronunciation dictionary. This method can gradually select the best candidate based on the initial manually entered phoneme and pronunciation dictionary.

그러나 이러한 방법은 초기에 다수의 음소들을 수동으로 생성해야 하고, 완전한 발음 사전을 생성하기 위한 방법에 대하여 제약이 없는, 너무나도 일반적으로 폭넓은 접근법이기 때문에 최적화가 쉽지 않고, 보다 적절한 데이터를 수집해야 한다는 문제가 있다.However, this method is not easy to optimize and requires more appropriate data collection because it is an extremely general approach that requires a large number of phonemes to be generated manually at the beginning, and has no restrictions on how to generate a complete pronunciation dictionary. there is a problem.

또한, 적용된 우도를 기준으로 사용된 목적 함수가 문제 해결에 잘 들어맞지 않기 때문에, 최적화 문제를 푸는 순환적인 처리가 최적의 해결책이 되지 못하고 있다.In addition, since the objective function used based on the applied likelihood does not fit well in solving the problem, cyclic processing to solve the optimization problem is not an optimal solution.

이와 더불어, 종래 음성 인식 과정은 발음 규칙 변환 모듈을 거치므로 실제 음향학적 발성과 차이가 나는 인위적인 발음 왜곡이 존재한다는 문제가 있다.In addition, since the conventional speech recognition process goes through a pronunciation rule conversion module, there is a problem that there is an artificial pronunciation distortion that differs from the actual acoustic speech.

본 발명의 실시예는 음성 인식 수행을 위한 음향학적 기본 단위인 음소(phone)를 전문가의 분석으로 결정하고 구분하는 기존 방법과 달리, 음성 데이터를 기반으로 자동으로 음소를 생성할 수 있는 음성 인식 시스템 및 방법을 제공한다.Unlike an existing method of determining and classifying a phone, which is an acoustic basic unit for performing speech recognition, by an expert analysis, an embodiment of the present invention is a speech recognition system capable of automatically generating a phoneme based on speech data And methods.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the present embodiment is not limited to the technical problem as described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면에 따른 자동 음소 생성이 가능한 음성 인식 방법은 음성 데이터의 특징 벡터를 비지도 학습하는 단계; 상기 비지도 학습 결과에 기초하여 선정된 음향학적 특성을 클러스터링하여 음소 세트를 생성하는 단계; 상기 생성된 음소 세트에 기초하여 상기 음성 데이터에 음소열을 할당하는 단계 및 상기 음소열이 할당된 음성 데이터 및 상기 음소열에 기초하여 음향 모델을 생성하는 단계를 포함한다.As a technical means for achieving the above technical problem, a speech recognition method capable of generating automatic phonemes according to a first aspect of the present invention includes unsupervised learning a feature vector of speech data; Generating a phoneme set by clustering selected acoustic characteristics based on the unsupervised learning result; And assigning a phoneme string to the voice data based on the generated phoneme set, and generating an acoustic model based on the phoneme data to which the phoneme string is assigned and the phoneme string.

또한, 본 발명의 제 2 측면에 따른 자동 음소 생성이 가능한 음성 인식 시스템은 음성 인식을 위한 프로그램이 저장된 메모리 및 상기 메모리에 저장된 프로그램을 실행시키는 프로세서를 포함하되, 상기 프로세서는 상기 프로그램을 실행시킴에 따라, 비전사 음성 데이터로부터 특징 벡터를 추출하여 비지도 학습을 수행하고, 상기 비지도 학습 결과에 기초하여 선정된 음향학적 특성을 클러스터링하여 음소 세트를 생성하며, 상기 생성된 음소 세트에 기초하여 상기 음성 데이터에 음소열을 할당하며, 상기 음소열이 할당된 음성 데이터 및 상기 음소열에 기초하여 음향 모델을 생성한다.In addition, the speech recognition system capable of generating automatic phonemes according to the second aspect of the present invention includes a memory in which a program for speech recognition is stored and a processor for executing a program stored in the memory, wherein the processor executes the program. Accordingly, unsupervised learning is performed by extracting a feature vector from non-transcribed speech data, clustering selected acoustic characteristics based on the unsupervised learning result, and generating a phoneme set, based on the generated phoneme set, A phoneme string is assigned to voice data, and an acoustic model is generated based on the voice data to which the phoneme string is assigned and the phoneme string.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 규칙기반과 사전기반으로 발음사전을 생성하는 기존의 G2P의 제약에서 벗어나 발성된 데이터를 기반으로 발음사전을 생성하므로, 관찰된 가변 발음을 사전에 반영할 수 있는바 발음 변환 규칙과 실제 발성 음성간의 발음 차이를 줄일 수 있어 자유발화 음성 인식의 한계를 극복할 수 있다.According to any one of the above-described problem solving means of the present invention, since the pronunciation dictionary is generated based on the data uttered outside the limitations of the existing G2P generating a pronunciation dictionary based on a rule and a dictionary, the observed variable pronunciation is dictionary As it can be reflected in, the difference in pronunciation between the pronunciation conversion rule and the actual spoken voice can be reduced, thereby overcoming the limitation of free speech recognition.

또한, 음성 데이터에서 음소를 결정하고, 음소에 대한 정보를 이용하여 음향모델을 생성하기 때문에, 전사된 음성 데이터를 대량으로 수집하기 어려운 상황에서 벗어나 음성 데이터만으로도 음향 모델링이 가능하다는 장점이 있다.In addition, since a phoneme is determined from voice data and an acoustic model is generated using information about the phoneme, there is an advantage that acoustic modeling is possible only with voice data, since it is difficult to collect a large amount of transferred voice data.

또한, 발성된 전사문을 구분지어 발음사전을 생성하므로, 기존 의사 형태소 또는 단어를 음성인식 단위로 사용했던 것과는 다르게, 발성의 묶음이 되는 고빈도의 어절 또는 구에 해당하는 의미소를 기반으로 음성인식 단위가 재구성되어 인간의 인식과 더욱 유사한 형태의 음성 인식 성능을 제공할 수 있다.In addition, since a pronunciation dictionary is generated by distinguishing the uttered transcripts, unlike a conventional pseudo morpheme or word used as a speech recognition unit, speech recognition is based on a semantic element corresponding to a high-frequency word or phrase that is a bundle of speech. The unit may be reconstructed to provide speech recognition performance in a form more similar to human recognition.

또한, 전사된 음성 데이터를 이용하여 지역 방언에 적응되거나, 개인 사용자의 발성 스타일에 적응된 발음사전으로 확장할 수 있다.In addition, by using the transcribed voice data, it can be adapted to a local dialect or expanded to a pronunciation dictionary adapted to an individual user's speech style.

또한, 외국어 발음 변환 규칙의 경우 원어민 전문가가 아닌 경우 접근하기 불가능하여 다국어 확장에 장애요인이 되었으나, 본 발명의 일 실시예에 따르면 다국어 확장이 용이하다는 장점이 있다.In addition, in the case of a foreign language pronunciation conversion rule, it is an inaccessible factor for non-native speakers to become multi-lingual, but according to an embodiment of the present invention, multi-language expansion is easy.

또한, 음성 인식 시스템 상에 새로운 어휘를 추가하고자 하는 경우, 단어를 입력하고 이를 발성하여 쉽게 추가가 가능하다.In addition, if a new vocabulary is to be added to the speech recognition system, it is possible to easily add it by inputting a word and speaking it.

도 1은 본 발명의 일 실시예에 따른 음성 인식 시스템의 블록도이다.
도 2는 본 발명의 일 실시예에 따른 음성 인식 시스템의 기능을 설명하기 위한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 음성 인식 방법의 순서도이다.
도 4는 음성 데이터로부터 특징 벡터를 추출하는 방법을 설명하기 위한 도면이다.
도 5는 적층 오토인코더를 이용하여 특징벡터를 학습하는 과정을 설명하기 위한 도면이다.
도 6은 합성곱 오토인코더를 이용하여 특징벡터를 학습하는 과정을 설명하기 위한 도면이다.
도 7은 오토인코더 교차학습 방법을 이용하여 특징벡터를 학습하는 과정을 설명하기 위한 도면이다.
도 8은 발음사전 및 언어 네트워크를 생성하는 방법의 순서도이다.1 is a block diagram of a speech recognition system according to an embodiment of the present invention.
2 is a block diagram illustrating the function of a speech recognition system according to an embodiment of the present invention.
3 is a flowchart of a speech recognition method according to an embodiment of the present invention.
4 is a diagram for explaining a method of extracting a feature vector from speech data.
5 is a diagram for explaining a process of learning a feature vector using a stacked autoencoder.
6 is a diagram for explaining a process of learning a feature vector using a composite product autoencoder.
7 is a diagram for explaining a process of learning a feature vector using an auto-encoder cross-learning method.
8 is a flowchart of a method for generating a pronunciation dictionary and a language network.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present invention pertains can easily practice. However, the present invention can be implemented in many different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present invention, parts not related to the description are omitted.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.When a part of the specification "includes" a certain component, it means that other components may be further included instead of excluding other components, unless otherwise specified.

본 발명은 자동 음소 생성이 가능한 음성 인식 시스템(1) 및 방법에 관한 것이다.The present invention relates to a speech recognition system (1) and method capable of automatic phoneme generation.

본 발명의 일 실시예에 따르면, 음성 인식 수행을 위한 음향학적 기본 단위인 음소(phone)를 전문가의 분석으로 결정하고 구분하는 기존 방법과 달리, 자동으로 음소를 생성할 수 있다.According to an embodiment of the present invention, unlike a conventional method of determining and classifying a phone, which is an acoustic basic unit for performing speech recognition, by analyzing an expert, it is possible to automatically generate a phoneme.

특히, 본 발명의 일 실시예는 비지도 심층학습(deep learning)으로 음성 데이터의 패턴 기반의 음소 세트를 정의하며, 기존 전문가 지식이 이용된 G2P를 이용하여 텍스트에 음소열을 할당하도록 하는 규칙 기반 방법에서 벗어나 음성 신호 자체를 분석하여 음소열을 할당할 수 있다.In particular, an embodiment of the present invention defines a pattern-based phoneme set as unsupervised deep learning, and rules-based to assign phoneme strings to text using G2P using existing expert knowledge Deviating from the method, a phoneme string can be allocated by analyzing the voice signal itself.

이에 따라, 본 발명의 일 실시예는 전사된 음성 데이터가 상대적으로 부족한 문제를 해결하여 전사문이 없는 경우에도 음향 모델 생성이 가능하며, 어절(space-unit) 중심의 단어 발음 변이에 대해 효과적인 대응이 가능한바, 왜곡이 심해지는 대화체의 음성인식을 가능하게끔 할 수 있다.Accordingly, an embodiment of the present invention solves the problem that the transferred voice data is relatively insufficient, so that an acoustic model can be generated even when there is no transcription, and an effective response to the variation of word pronunciation centered on a space-unit This is possible, it is possible to enable the speech recognition of the dialogue is severe distortion.

이하 첨부된 도면을 참조하여 구체적으로 설명하도록 한다.Hereinafter, with reference to the accompanying drawings will be described in detail.

도 1은 본 발명의 일 실시예에 따른 음성 인식 시스템(1)의 블록도이다. 도 2는 본 발명의 일 실시예에 따른 음성 인식 시스템(1)의 기능을 설명하기 위한 블록도이다.1 is a block diagram of a speech recognition system 1 according to an embodiment of the present invention. 2 is a block diagram for explaining the function of the speech recognition system 1 according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 음성 인식 시스템(1)은 메모리(10) 및 프로세서(20)를 포함한다.The speech recognition system 1 according to an embodiment of the present invention includes a memory 10 and a processor 20.

메모리(10)에는 음성 인식을 위한 프로그램이 저장된다. 이때, 메모리(10)는 전원이 공급되지 않아도 저장된 정보를 계속 유지하는 비휘발성 저장장치 및 휘발성 저장장치를 통칭하는 것이다. A program for speech recognition is stored in the memory 10. At this time, the memory 10 refers to a non-volatile storage device and a volatile storage device that keep stored information even when power is not supplied.

예를 들어, 메모리(10)는 콤팩트 플래시(compact flash; CF) 카드, SD(secure digital) 카드, 메모리 스틱(memory stick), 솔리드 스테이트 드라이브(solid-state drive; SSD) 및 마이크로(micro) SD 카드 등과 같은 낸드 플래시 메모리(NAND flash memory), 하드 디스크 드라이브(hard disk drive; HDD) 등과 같은 마그네틱 컴퓨터 기억 장치 및 CD-ROM, DVD-ROM 등과 같은 광학 디스크 드라이브(optical disc drive) 등을 포함할 수 있다.For example, the memory 10 is a compact flash (CF) card, a secure digital (SD) card, a memory stick, a solid-state drive (SSD), and a micro SD NAND flash memory such as cards, magnetic computer storage devices such as hard disk drives (HDDs), and optical disc drives such as CD-ROMs, DVD-ROMs, and the like. You can.

프로세서(20)는 메모리(10)에 저장된 프로그램을 실행시키며, 프로그램을 실행시킴에 따라 비전사 음성 데이터로부터 특징 벡터를 추출하여 비지도 학습(Unsupervised learning)을 수행하고, 비지도 학습 결과에 기초하여 선정된 음향학적 특성을 클러스터링하여 음소 세트를 생성한다. 그리고 생성된 음성 세트에 기초하여 음성 데이터에 음소열을 할당하고, 음소열이 할당된 음성 데이터 및 음소열에 기초하여 음향 모델을 생성한다.The processor 20 executes a program stored in the memory 10, extracts a feature vector from non-transcription speech data as the program is executed, performs unsupervised learning, and based on the unsupervised learning result. The selected acoustic characteristics are clustered to generate a phoneme set. Then, a phoneme string is assigned to the voice data based on the generated voice set, and an acoustic model is generated based on the voice data and phoneme string to which the phoneme string is assigned.

그밖에 음성 인식 시스템(1)은 사용자 등의 음성을 수신하기 위한 마이크나 이를 구비하는 인터페이스를 포함할 수 있다.In addition, the voice recognition system 1 may include a microphone for receiving voices of a user or the like, or an interface having the same.

한편, 도 1의 구성요소에 의해 동작되는 본 발명의 일 실시예에 따른 음성 인식 시스템(1)은 도 2와 같은 기능 블록으로 나타낼 수 있다.Meanwhile, the voice recognition system 1 according to an embodiment of the present invention operated by the components of FIG. 1 may be represented by a functional block as shown in FIG. 2.

본 발명의 일 실시예에 따른 음성 인식 시스템(1)은 비지도 특징벡터 학습부(100), 클러스터링부(200), 음향 모델 생성부(300), 발음사전 생성부(400) 및 언어 네트워크 생성부(500)를 포함할 수 있다. The speech recognition system 1 according to an embodiment of the present invention includes an unsupervised feature vector learning unit 100, a clustering unit 200, an acoustic model generator 300, a pronunciation dictionary generator 400, and a language network. It may include a portion 500.

그리고 생성된 음향 모델 및 언어 네트워크에 기초하여 음성 인식 결과를 생성하기 위한 구성으로, 특징 벡터 추출부(600) 및 음성 인식 디코더(700)를 포함할 수 있다.In addition, as a configuration for generating a speech recognition result based on the generated acoustic model and language network, a feature vector extraction unit 600 and a speech recognition decoder 700 may be included.

참고로, 본 발명의 실시예에 따른 도 1 및 도 2에 도시된 구성 요소들은 소프트웨어 또는 FPGA(Field Programmable Gate Array) 또는 ASIC(Application Specific Integrated Circuit)와 같은 하드웨어 형태로 구현될 수 있으며, 소정의 역할들을 수행할 수 있다.For reference, the components shown in FIGS. 1 and 2 according to an embodiment of the present invention may be implemented in software or in a hardware form such as a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC). Roles can be played.

그렇지만 '구성 요소들'은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니며, 각 구성 요소는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다.However, 'components' are not limited to software or hardware, and each component may be configured to be in an addressable storage medium or may be configured to reproduce one or more processors.

따라서, 일 예로서 구성 요소는 소프트웨어 구성 요소들, 객체지향 소프트웨어 구성 요소들, 클래스 구성 요소들 및 태스크 구성 요소들과 같은 구성 요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다.Thus, as an example, a component is a component, such as software components, object-oriented software components, class components, and task components, processes, functions, attributes, procedures, subs It includes routines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays and variables.

구성 요소들과 해당 구성 요소들 안에서 제공되는 기능은 더 작은 수의 구성 요소들로 결합되거나 추가적인 구성 요소들로 더 분리될 수 있다.Components and functions provided within those components may be combined into a smaller number of components or further separated into additional components.

이하에서는 도 3 내지 도 8을 참조하여 본 발명의 일 실시예에 따른 음성 인식 시스템에서 수행되는 음성 인식 방법에 대해 구체적으로 설명하도록 한다.Hereinafter, a voice recognition method performed in a voice recognition system according to an embodiment of the present invention will be described in detail with reference to FIGS. 3 to 8.

도 3은 본 발명의 일 실시예에 따른 음성 인식 방법의 순서도이다.3 is a flowchart of a speech recognition method according to an embodiment of the present invention.

1. 비지도 특징벡터 학습 단계1. Unsupervised Feature Vector Learning Stage

본 발명의 일 실시예에 따른 음성 인식 방법은 먼저, 비지도 특징벡터 학습부(100)가 음성 데이터의 특징 벡터를 비지도 학습한다. In a voice recognition method according to an embodiment of the present invention, first, the unsupervised feature vector learning unit 100 unsupervises and learns a feature vector of speech data.

구체적으로 비지도 특징벡터 학습부(100)는 발성된 음성 데이터의 음향학적 특성을 미리 설정된 심볼 없이 패턴 자체로 추출하기 위해 전사문 없이 수집된 비전사 음성 데이터로부터 특징벡터를 추출한다(S110).Specifically, the unsupervised feature vector learning unit 100 extracts a feature vector from non-transcribed voice data collected without transcription to extract the acoustic characteristics of the spoken voice data as a pattern without a preset symbol (S110).

도 4는 음성 데이터로부터 특징 벡터를 추출하는 방법을 설명하기 위한 도면이다.4 is a diagram for explaining a method of extracting a feature vector from speech data.

비지도 특징벡터 학습부(100)는 도 4와 같이, 음성 데이터를 스펙트로그램(spectrogram, P1)으로 변환하고, 스펙트로그램(P1)으로 변환된 음성 데이터를 기 설정된 시간 프레임 단위(예를 들어, 10ms)의 멜-스케일 필터뱅크(mel-scale filterbank)로 변환시켜 제 1 특징벡터를 생성한다(S111). 그리고 제 1 특징벡터를 좌우로 기 설정된 프레임 수만큼의 윈도우를 스플라이싱(spclicing)하여 제 2 특징벡터를 생성하며, 이와 같이 생성된 제 2 특징벡터가 비지도 학습에 적용되는 특징벡터로 사용될 수 있다(S112).As illustrated in FIG. 4, the unsupervised feature vector learning unit 100 converts voice data into a spectrogram (P1), and converts voice data converted into a spectrogram (P1) into preset time frame units (for example, 10ms) to convert to a mel-scale filterbank (mel-scale filterbank) to generate a first feature vector (S111). Then, the second feature vector is generated by splicing a window corresponding to a predetermined number of frames from the left and right of the first feature vector, and the generated second feature vector is used as a feature vector applied to unsupervised learning. It can be (S112).

또한 상기 멜-스케일 필터뱅크(mel-scale filterbank)로 얻은 제 1 특징벡터를 다시 이산 코사인 변환 (Discrete Cosine Transform)하여 멜-주파수 켑스트럴 계수 (Mel-Frequency Cepstral Coefficients)로 생성한 특징 벡터를 제 1 특징 벡터로 사용할 수도 있다. In addition, the first feature vector obtained by the mel-scale filterbank (Discrete Cosine Transform) by the discrete cosine transform (Mel-Frequency Cepstral Coefficients) generated by the feature vector It can also be used as a first feature vector.

또 다른 방법으로, 비지도 특징벡터 학습부(100)는 먼저 음성 데이터를 스펙트로그램(spectrogram, P1)으로 변환한다. 그리고 스펙트로그램 자체의 음향학적 패턴을 이용하기 위해, 스펙트로그램으로 변환된 음성 데이터를 x 프레임 기준의 2차원 단위로 그룹화하여 특징 매트릭스를 생성하고(S115), 생성된 특징 매트릭스를 비지도 학습에 적용되는 특징벡터로 사용될 수 있다(S116).In another method, the unsupervised feature vector learning unit 100 first converts speech data into a spectrogram (P1). Then, in order to use the acoustic pattern of the spectrogram itself, the speech data converted to the spectrogram is grouped into two-dimensional units based on x frames (S115), and the generated feature matrix is applied to unsupervised learning. It can be used as a feature vector (S116).

상기 필터뱅크를 이용하여 특징벡터를 생성하는 방법(S111, S112)은 일반적인 심층학습에 이용되는 특징벡터로 매 프레임의 정보를 입력으로 활용할 수 있다. 그리고 상기 특징 매트릭스를 생성 및 이용하는 방법(S115, S116)은 컨볼루션(convolution) 심층학습에 적용할 수 있는 2차원 데이터로서, x프레임보다 적은 단위의 이동간격(shift-frame)을 갖는 특징을 추출하여, 추출된 특징 매트릭스의 정보를 입력으로 활용할 수 있다.The method of generating a feature vector using the filter bank (S111, S112) is a feature vector used for general in-depth learning, and can use information of every frame as input. And the method of generating and using the feature matrix (S115, S116) is a two-dimensional data that can be applied to deep learning of convolution, and extracts features having a shift-frame of less than x frames. Thus, the extracted feature matrix information can be used as input.

이와 같이 특징 벡터가 추출되고 나면, 비지도 특징벡터 학습부(100)는 추출된 특징벡터에 대하여 비지도 학습을 수행한다(S120). 이때, 비지도 특징벡터 학습부(100)는 도 5 내지 도 7에 도시된 오토인코더를 이용하여 특징벡터를 학습할 수 있다.After the feature vector is extracted in this way, the unsupervised feature vector learning unit 100 performs unsupervised learning on the extracted feature vector (S120). At this time, the unsupervised feature vector learning unit 100 may learn the feature vector using the autoencoder illustrated in FIGS. 5 to 7.

도 5는 적층 오토인코더를 이용하여 특징벡터를 학습하는 과정을 설명하기 위한 도면이다. 도 6은 합성곱 오토인코더를 이용하여 특징벡터를 학습하는 과정을 설명하기 위한 도면이다. 도 7은 오토인코더 교차학습 방법을 이용하여 특징벡터를 학습하는 과정을 설명하기 위한 도면이다.5 is a diagram for explaining a process of learning a feature vector using a stacked autoencoder. 6 is a diagram for explaining a process of learning a feature vector using a composite product autoencoder. 7 is a diagram for explaining a process of learning a feature vector using an auto-encoder cross-learning method.

비지도 특징벡터 학습부(100)는 도 5의 (a)와 같이 추출된 특징벡터를 적층 오토인코더(stacked autoencoder)의 입력 노드(I1) 및 출력 노드(O1)에 배치시켜 특징벡터를 비지도 학습할 수 있다. 이러한 적증 오토인코더는 중간 노드를 1개층씩 늘려나가는 방법을 통해 특징벡터를 학습할 수 있다.The unsupervised feature vector learning unit 100 places the extracted feature vectors in the input node I1 and the output node O1 of the stacked autoencoder as shown in Fig. 5A to unmap the feature vector. Can learn. Such an accreditation autoencoder can learn feature vectors by increasing the number of intermediate nodes by one layer.

즉, 비지도 특징벡터 학습부(100)는 적층 오토인코더의 입력(I1)과 출력(O1)에 멜-스케일 필터뱅크를 이용하여 생성된 특징벡터를 대칭하게 두고, 중간노드(히든 노드)와 모두 연결된 가중치 매트릭스(W1, W2, W3, W4)를 학습하는 방법으로 특징벡터를 비지도 학습할 수 있다. 여기서 W’는 W의 전치행렬(transpose matrix)이다.That is, the unsupervised feature vector learning unit 100 symmetrically sets the feature vectors generated using the mel-scale filter bank at the inputs I1 and O1 of the multi-layer auto-encoder, and the intermediate nodes (hidden nodes). Feature vectors can be unsupervised by learning all connected weight matrices W1, W2, W3, and W4. Where W 'is the transpose matrix of W.

또한, 비지도 특징벡터 학습부(100)는 도 6의 (a)와 같이 추출된 특징벡터를 합성곱 오토인코더(convolutional autoencoder)의 입력 노드(I2) 및 출력 노드(O2)에 배치시켜 특징벡터를 비지도 학습할 수 있다. In addition, the unsupervised feature vector learning unit 100 places the extracted feature vectors in the input node I2 and output node O2 of the convolutional autoencoder as shown in FIG. You can learn unsupervised.

이와 같은 합성곱 오토인코더는 이미지 패턴 분석에서 사용하는 방법으로, 가중치 방향에 따라 가중치 매트릭스를 공유하는 합성곱 노드와, 모든 노드간 연결(fully connected)에 대한 가중치 매트릭스를 갖는 중간층들로 구성된다.Such a convolutional autoencoder is a method used in image pattern analysis, and is composed of a convolutional node sharing a weighting matrix according to a weighting direction, and intermediate layers having a weighting matrix for fully connected nodes.

비지도 특징벡터 학습부(100)는 합성곱 오토인코더의 입력(I2)과 출력(O2)에 스펙트로그램을 2차원 단위로 그룹화함에 따라 생성된 특징 매트릭스를 대칭하게 두고, 중간노드와 연결된 가중치 매트릭스(W1, W2, W3, W4)를 학습할 수 있다. 여기서 W’는 W의 전치행렬(transpose matrix)이다.The unsupervised feature vector learning unit 100 symmetrically sets the feature matrix generated by grouping the spectrogram in two-dimensional units at the input (I2) and output (O2) of the composite auto-encoder, and the weight matrix connected to the intermediate node. Can learn ( W1 , W2 , W3 , W4 ). Where W ' is the transpose matrix of W.

또한, 비지도 특징벡터 학습부(100)는 도 7의 (a)에 도시된 바와 같이, 상술한 적층 오토인코더 및 합성곱 오토인코더를 결합한 교차 오토인코더를 이용하여 추출된 특징벡터를 교차 오토인코더의 입력 노드(I3) 및 출력 노드(O3)에 배치시켜 비지도 학습할 수 있다. 이러한 교차 오토인코더를 이용할 경우, 결합된 중간노드는 상술한 두 가지 특성의 입력 데이터에 대한 특성을 집약할 수 있게 된다.In addition, the unsupervised feature vector learning unit 100 crosses the feature vector extracted using a cross auto-encoder combining the above-described stacked auto-encoder and convolutional auto-encoder, as shown in FIG. 7 (a). By placing the input node (I3) and the output node (O3) of the non-visible can be learned. When such a cross autoencoder is used, the combined intermediate node can aggregate characteristics for input data of the two characteristics described above.

다시 도 2를 참조하면, 비지도 특징벡터 학습부(100)는 비지도 학습 결과에 기초하여 특징벡터에 대응하는 음향학적 패턴을 포함하는 인공 신경망을 생성한다(S130).Referring to FIG. 2 again, the unsupervised feature vector learning unit 100 generates an artificial neural network including an acoustic pattern corresponding to the feature vector based on the results of the unsupervised learning (S130).

즉, 비지도 특징벡터 학습부(100)에 의해 학습된 가중치 매트릭스가 포함된 네트워크는 도 5의 (b), 도 6의 (b) 및 도 7의 (b)와 같은 인공 신경망으로 생성될 수 있다. 이러한 인공 신경망은 입력 노드(I1, I2, I3)에 입력된 특징벡터에 대하여 대응하는 음향학적 패턴이 구분되는 형태인 출력값을 최종 노드(P1, P2, P3)로 출력시킬 수 있다.That is, the network including the weight matrix trained by the unsupervised feature vector learning unit 100 may be generated by artificial neural networks as shown in FIGS. 5 (b), 6 (b), and 7 (b). have. Such an artificial neural network may output an output value in a form in which a corresponding acoustic pattern is distinguished with respect to a feature vector input to input nodes I1, I2, and I3 to the final nodes P1, P2, and P3.

2. 자동 음소 클러스터링 단계2. Automatic phoneme clustering steps

비지도 특징벡터 학습부(100)에 의해 비지도 학습 단계가 수행되고 나면, 클러스터링부(200)는 비지도 학습 결과에 기초하여 선정된 음향학적 특성을 클러스터링하여 음소 세트를 생성한다(S210).After the unsupervised learning step is performed by the unsupervised feature vector learning unit 100, the clustering unit 200 generates a phoneme set by clustering selected acoustic characteristics based on the unsupervised learning results (S210).

구체적으로 클러스터링부(200)는 비지도 학습 결과 생성된 인공 신경망을 비지도 특징벡터 학습부(100)로부터 전달받으면, 인공 신경망의 매 입력 데이터에 대한 출력값을 나열하여 음소 세트를 생성할 수 있다.Specifically, when the artificial neural network generated as a result of the unsupervised learning is transmitted from the unsupervised feature vector learning unit 100, the clustering unit 200 may generate a phoneme set by listing output values for every input data of the artificial neural network.

이때, 클러스터링부(200)는 매 입력 데이터에 대한 출력값을 하나의 벡터로 표현하여 나열하고, top-down 또는 bottom-up 방식의 벡터 클러스터링에 기초하여 나열된 벡터 중 벡터간 거리가 특정 경계값 이하인 벡터들을 추출한다. 그리고 추출된 벡터들을 평균화하여 그룹 벡터를 생성한 다음, 나열된 벡터들과 생성된 그룹 벡터에 기초하여 음소 세트를 생성할 수 있다. 이러한 방법에 따르면, 음소를 나타내는 벡터값과 주변 음소와의 경계를 구분짓는 방법을 이용하여 최종 음소 세트를 생성할 수 있다.At this time, the clustering unit 200 lists and outputs the output values for each input data as one vector, and the distance between vectors is less than or equal to a specific boundary among the vectors listed based on top-down or bottom-up vector clustering. Extract them. Then, a group vector is generated by averaging the extracted vectors, and then a phoneme set can be generated based on the listed vectors and the generated group vector. According to this method, a final phoneme set may be generated using a method of distinguishing a boundary between a vector value representing a phoneme and a neighboring phoneme.

또 다른 방법으로 클러스터링부(200)는 인공 신경망의 출력 노드 자체를 하나의 음소 후보로 설정하고, 각 노드에 소프트맥스(softmax)와 같은 결정함수(active function)를 두어, 매 입력 데이터에 대응하는 출력값으로 출력 노드의 인덱스를 나열하는 방식을 통해 음소 세트를 생성할 수 있다.As another method, the clustering unit 200 sets the output node itself of the artificial neural network as one phoneme candidate, and places an active function such as softmax at each node, corresponding to every input data. You can create a phoneme set by listing the index of an output node as an output value.

즉, 클러스터링부(200)는 매 입력 데이터에 대한 출력값으로 출력 노드의 인덱스를 나열하고, 나열된 인덱스 중 출력 빈도가 기 설정된 횟수 이상인 인덱스를 중심으로 클러스터링을 수행한다. 이는 시간축으로 생성된 인덱스 열에서 발생 빈도가 낮은 인덱스를 상대적으로 발생 빈도가 높은 인접한 인덱스와 결합하는 방식으로써, 이와 같은 방법에 따라 음소 세트를 생성할 수 있다.That is, the clustering unit 200 lists the index of the output node as an output value for each input data, and performs clustering based on an index among output indexes having a predetermined number of times or more. This is a method of combining an index having a low frequency of occurrence with an adjacent index having a high frequency of occurrence in an index column generated on a time axis, and thus a phoneme set can be generated according to the above method.

상술한 두 방법에 의해 음소 세트를 생성하고 나면, 클러스터링부(200)는 생성된 음소 세트에 기초하여 음성 데이터에 음소열을 할당하게 된다(S220). 이때, 클러스터링부(200)는 인공 신경망을 이용하여 후보 음소열을 나열하고, 생성된 음소 세트와 후보 음소열에 기초하여 클러스터링 경계로 구분된 최종 음소열을 추출하여, 음성 데이터에 최종 음소열을 할당할 수 있다.After generating the phoneme set by the above two methods, the clustering unit 200 allocates a phoneme string to the voice data based on the generated phoneme set (S220). At this time, the clustering unit 200 lists candidate phoneme sequences using an artificial neural network, extracts a final phoneme sequence divided into clustering boundaries based on the generated phoneme set and candidate phoneme sequences, and allocates the final phoneme sequence to the voice data can do.

3. 결정음소 기반 음향모델 생성 단계3. Crystal phoneme-based acoustic model creation step

클러스터링부(200)에 의해 자동 음소 클러스터링 단계가 수행되고 나면, 음향 모델 생성부(300)는 음소열이 할당된 음성 데이터 및 음소열에 기초하여 음향 모델을 생성한다. 이러한 음향 모델 생성 단계는 GMM-HMM (Gaussian mixture model-hidden Markov model) 모델을 이용하거나, 단독으로 DNN-HMM (Deep neural net-hidden Markov model) 모델만 이용하거나, 또는 CNN (Convolutional neural network), RNN (Recurrent neural network)을 이용하거나 이들이 혼합된 모델의 이용도 가능하다.After the automatic phoneme clustering step is performed by the clustering unit 200, the acoustic model generator 300 generates an acoustic model based on the phoneme string and phoneme string to which the phoneme string is assigned. The acoustic model generation step uses a GMM-HMM (Gaussian mixture model-hidden Markov model) model, a DNN-HMM (Deep neural net-hidden Markov model) model alone, or a CNN (Convolutional neural network), Recurrent neural networks (RNNs) or mixed models of these are also available.

구체적으로, 음향 모델 생성부(300)는 음소열이 재할당된 음성 데이터 및 음소열을 이용하여 문맥 독립적인 음소열 모델을 생성한다(S310). 이는 기존의 모노폰(monophone) 학습과 같은 의미로, 음향 모델 생성부(300)는 클러스터링부(200)에 의해 결정된 음소 세트의 분포 또는 패턴을 모델링하게 된다.Specifically, the acoustic model generator 300 generates a context independent phoneme string model by using the phoneme string and the phoneme string to which the phoneme string is reassigned (S310). This has the same meaning as the conventional monophone learning, and the acoustic model generator 300 models the distribution or pattern of the phoneme set determined by the clustering unit 200.

다음으로 음향 모델 생성부(300)는 문맥 독립적인 음소열 모델과 음소열의 문맥에 따른 조합에 기초하여 문맥 의존 트리를 생성한다(S320). 즉, 음향 모델 생성부(300)는 인접한 다른 음소와의 문맥을 고려하여, 문맥 의존 트리를 생성할 수 있으며, 이는 생성될 수 있는 모든 문맥 의존 음소에 대한 학습 데이터의 엔트로피를 줄이는 방향으로 클러스터링을 수행하는 것을 의미한다.Next, the acoustic model generator 300 generates a context-dependent tree based on a combination of a context-independent phoneme string model and a phoneme string context (S320). That is, the acoustic model generator 300 may generate a context-dependent tree in consideration of the context with other adjacent phonemes, which clusters in a direction to reduce the entropy of learning data for all context-dependent phonemes that can be generated. It means to do.

다음으로 음향 모델 생성부(300)는 생성된 문맥 의존 트리에 기초하여 문맥 의존 음소에 대한 문맥 의존 상태(context-dependent phone state)를 정의하고(S330), 음소열을 이용하여 음성 데이터에 상기 정의된 문맥 의존 상태를 할당한다(S340).Next, the acoustic model generator 300 defines a context-dependent phone state for the context-dependent phoneme based on the generated context-dependent tree (S330), and defines the voice data by using a phoneme string. The allocated context-dependent state is assigned (S340).

그 다음 음향 모델 생성부(300)는 할당된 문맥 의존 상태의 정보와 학습 데이터인 음성 데이터에 기초하여 문맥 의존 상태 모델들을 심층학습(deep learning) 방법으로 학습시킨다(S350). 이때, 음향 모델 생성부(300)는 학습된 문맥 의존 상태 모델들을 이용하여 상기 학습 데이터로 이용하는 음성 데이터에 음소열을 재할당시키고, 재할당된 음소열에서 유도된 문맥 의존 상태의 정보와 음성 데이터에 기초하여 문맥 의존 상태를 재학습시킬 수 있다. 이와 같이, 본 발명의 일 실시예는 순환적으로 반복 학습을 수행함으로써 점차적으로 정확한 문맥 의존 상태 모델을 생성할 수 있다.Then, the acoustic model generator 300 trains the context-dependent state models by using a deep learning method based on the assigned context-dependent state information and voice data, which is learning data (S350). At this time, the acoustic model generator 300 reassigns the phoneme string to the voice data used as the learning data by using the learned context-dependent state models, and the context-dependent state information and voice data derived from the reassigned phoneme string. Based on this, context-dependent states can be retrained. As such, one embodiment of the present invention can gradually and accurately generate a context-dependent state model by performing repetitive learning.

4. 발음사전 생성 단계4. Pronunciation dictionary generation stage

도 8은 발음사전 및 언어 네트워크를 생성하는 방법의 순서도이다.8 is a flowchart of a method for generating a pronunciation dictionary and a language network.

일반적으로 음성 인식 시스템은 언어모델이 적용된 언어 네트워크와 음향모델을 가지고 음성 신호를 처리한다. 기존의 언어 네트워크는 훈련 코퍼스로 만들어진 언어 모델과 훈련 코퍼스를 G2P(Grapheme to Phoneme) 변환으로 생성한 발음 사전을 이용하여 생성된다.In general, a speech recognition system processes a speech signal using a language network and an acoustic model to which a language model is applied. The existing language network is generated using a language model made of a training corpus and a pronunciation dictionary generated by converting a training corpus into a G2P (Grapheme to Phoneme) transformation.

반면, 본 발명의 일 실시예에 따른 음성 인식 시스템(1)은 도 과 같이 전사문이 존재하는 음성 데이터를 기본으로 어절 단위(space-unit) 발음사전을 생성할 수 있으며, 부족한 부분이 있는 경우 음절단위(syllable-unit)의 연결로 대체하여 발음사전을 확장할 수 있다.On the other hand, the speech recognition system 1 according to an embodiment of the present invention may generate a space-unit pronunciation dictionary based on the speech data in which the transcription is present, as shown in FIG. The pronunciation dictionary can be expanded by replacing syllable-unit connections.

그리고 생성된 발음사전을 학습 코퍼스로 만들어진 언어모델과 연동하여 언어 네트워크를 생성할 수 있다. 이때, 어절단위로 발음사전을 생성하는 것은 어절의 경우 발성이 한번에 이어지는 단위가 되므로, 자연어 연속 발성에서 생기는 여러 발음 변이 현상을 포함하여 나타낼 수 있기 때문이다.In addition, a language network can be generated by linking the generated pronunciation dictionary with a language model made of a learning corpus. In this case, since the pronunciation dictionary is generated in units of word units, in the case of words, speech is a unit that continues at a time, so it can be expressed by including various pronunciation variation phenomena occurring in continuous speech in natural language.

구체적으로, 발음사전 생성부(400)는 도 8에 도시된 바와 같이, 전사 음성 데이터를 신호의 피치와 에너지 구간 측정을 통해 어절단위의 구간으로 분할한다(S410). Specifically, as illustrated in FIG. 8, the pronunciation dictionary generating unit 400 divides the transcription speech data into sections of a word unit through measurement of a signal pitch and an energy section (S410).

이때, 어절단위는 띄어쓰기 또는 쉬어읽기 단위가 되는 하나 또는 그 이상의 단어의 묶음으로 이루어진 것을 의미한다. 예를 들어, 한국어의 어절은 명사와 조사로 묶여 있거나 단독 명사로 이루어질 수 있다. 알파벳 언어권은 주로 단어가 어절이 되지만, 말줄임이나 연속 발성되는 짧은 어구(phrase)를 어절로 취급할 수 있다. 그리고 띄어쓰기가 없는 일본어, 중국어 등에는 의미 묶음이 되는 띄어읽기를 어절로 할 수 있다.In this case, the word unit means a group of one or more words that are a space or a reading unit. For example, Korean words can be grouped into nouns and investigations, or they can consist of single nouns. In the alphabetic language, words are mainly words, but short words or short phrases that are spoken continuously can be treated as words. In addition, in Japanese, Chinese, etc. where there are no spaces, it is possible to read spaces that are meaningful bundles.

한편, 일반적으로 전사문의 띄어쓰기가 띄어읽기와 일치하는 경우, 어절 단위로 전사문을 피치와 에너지 구간을 이용해 음성 신호에 쉽게 할당할 수 있다. 그러나 띄어쓰기가 띄어읽기와 다르거나 띄어쓰기가 없는 언어의 경우, 전사문을 음성 신호의 시간축에 강제 할당(force-alignment)하도록 다른 음성 인식기를 이용할 수도 있다.On the other hand, in general, when the spacing of the transcription statement coincides with the spacing reading, the transcription statement can be easily assigned to a speech signal using a pitch and an energy section in units of words. However, for languages in which spaces are different from space reads or there are no spaces, other voice recognizers may be used to force-align the transcription to the time axis of the voice signal.

다음으로, 발음사전 생성부(400)는 어절단위로 분할된 전사 음성 데이터에 음소열을 할당하고(S420), 음소열이 할당된 전사 음성 데이터의 어절에 대응하는 음소열을 정렬한다(S430). 이때, 어절에 해당하는 발성 음소열은 매 발성마다 조금씩 다르고, 발성자마다 상이하게 발성할 수 있는바, 음소열 측정시 많은 경우의 수로 다중 발음을 획득하게 된다.Next, the pronunciation dictionary generating unit 400 allocates a phoneme string to the transcription voice data divided by word units (S420), and arranges a phoneme sequence corresponding to the word of the transcription voice data to which the phoneme string is assigned (S430). . At this time, the phonological phoneme string corresponding to the word is slightly different for each utterance, and can be uttered differently for each utterance, so multiple pronunciations are acquired in a number of cases when measuring the phoneme fever.

다음으로, 발음사전 생성부(400)는 정렬된 음소열을 시간축 및 프레임 횟수에 기초하여 정제한다(S440). 이때, 발음사전 생성부(400)는 프레임 단위의 음소열에서 시간축으로 반복되는 음소는 병합하고, 너무 적은 프레임에 발생한 음소는 제거하는 방법으로 음소열을 정제할 수 있다.Next, the pronunciation dictionary generating unit 400 refines the sorted phoneme string based on the time axis and the number of frames (S440). At this time, the phonetic dictionary generating unit 400 may refine the phoneme string by merging phonemes that are repeated on the time axis from phoneme strings in units of frames, and removing phonemes generated in too few frames.

이와 같이 음소열이 정렬 및 정제되고 나면, 발음사전 생성부(400)는 정제된 음소열에 기초하여 어절 단위의 음소열 발음사전을 생성한다(S450). After the phoneme string is sorted and refined as described above, the pronunciation dictionary generating unit 400 generates a phoneme string pronunciation dictionary based on the refined phoneme string (S450).

이때, 발음사전 생성부(400)는 어절단위와 일치하지 않는 단어와의 연동을 위해, 전사 음성 데이터의 어절을 분할하여 생성된 부분 어절 또는 음절 단위의 발음 사전을 생성할 수 있다. 이러한 부분 어절 또는 음절 단위의 발음사전은 대상 부분어절 또는 음절에 해당하는 문자를 포함하는 어절단위 발음열들을 모아서, 해당 공통문자의 음소열들을 비교하여 발취한 뒤, 평균 발생빈도 음소 또는 다중 음소로 재할당하는 방식을 통해 생성할 수 있다.At this time, the pronunciation dictionary generating unit 400 may generate a pronunciation dictionary of partial words or syllable units generated by dividing words of transcription speech data for interworking with words that do not match the word unit. The pronunciation dictionary of the partial word or syllable unit is collected by collecting the pronunciation units of the word unit including the letters corresponding to the target subword or syllable, and comparing and extracting phoneme strings of the common character, and then generating the average occurrence phoneme or multiple phonemes. Can be created by reassignment.

5. 어절 단위 언어 네트워크 생성 단계5. Language network generation step

발음사전 생성부(400)에 의해 발음사전이 생성되면, 언어 네트워크 생성부(500)는 학습 코퍼스에 의해 생성된 어절기반의 언어모델을 발음사전과 연동시켜(S510) 언어 네트워크를 생성할 수 있다(S520). When the pronunciation dictionary is generated by the pronunciation dictionary generation unit 400, the language network generation unit 500 may generate a language network by linking the word-based language model generated by the learning corpus with the pronunciation dictionary (S510). (S520).

이때, 어절기반의 언어모델은 기존 형태소 분석에 기반한 의미 형태소 단위에서, 어절기반의 단위 묶음으로 변경하여 생성할 수 있다. 이는 어절기반의 통계치를 이용하여 발성될 확률을 추정하는 것으로, 어절단위를 기본 인식단위로 처리하겠다는 것을 의미한다. 이러한 방법은 사람이 학습하고 인지하는 묶음인 의미소와 유사하게 발성의 묶음인 어절 중심의 음성인식을 시도하는 것을 의미한다.In this case, the word-based language model can be generated by changing from a semantic morpheme unit based on an existing morpheme analysis to a word-based unit bundle. This is to estimate the probability of vocalization using word-based statistics, which means that the word unit will be processed as the basic recognition unit. This method means attempting speech recognition centered on a word, which is a bundle of speech, similar to a meaning that is a bundle that a person learns and recognizes.

한편, 언어 네트워크 생성부(500)는 상기 언어모델의 단어 중 어절단위의 발음사전에 포함되지 않은 단어들은 부분 어절 또는 음절 단위의 발음사전과 연동하여 어절단위의 발음사전을 확장할 수 있다. On the other hand, the language network generator 500 may expand the pronunciation dictionary of the word unit by interworking with the pronunciation dictionary of the partial word or syllable unit among words in the language model that are not included in the word unit pronunciation dictionary.

이와 같이 언어 네트워크 생성부는 언어모델과 생성 또는 확장된 발음사전을 연동시켜 언어 네트워크를 생성할 수 있다. 이때, 본 발명의 일 실시예는 어절단위는 한 개 이상의 단어이기 때문에 기존 한 개의 단어로 이루어진 언어모델도 복합적으로 결합할 수 있다.As such, the language network generation unit may generate a language network by linking the language model with the generated or expanded pronunciation dictionary. At this time, according to an embodiment of the present invention, since the word unit is one or more words, a language model composed of one word may be combined.

이와 같이 언어 네트워크가 생성되고 나면, 음성 인식 디코더(700)는 음향 모델과 언어 네트워크를 이용하여 음성 인식 결과를 생성할 수 있다.After the language network is generated in this way, the speech recognition decoder 700 may generate a speech recognition result using the acoustic model and the language network.

구체적으로, 특징 벡터 추출부(600)가 사용자에 의해 입력된 음성 데이터로부터 특징 벡터를 추출하면, 음성 인식 디코더(700)는 음향 모델 및 언어 네트워크를 적용시켜 특징 벡터를 입력으로 수신하고, 입력 결과에 기초하여 음성 데이터의 단어열을 추출하여 음성 인식 결과를 생성할 수 있다. Specifically, when the feature vector extracting unit 600 extracts the feature vector from the voice data input by the user, the voice recognition decoder 700 receives the feature vector as an input by applying an acoustic model and a language network, and inputs the result. On the basis of this, a word sequence of speech data may be extracted to generate a speech recognition result.

이때, 특징 벡터 추출부(600)는 도 4에서 설명한 방법과 같이 특징벡터를 추출할 수 있으며, 도 4와 상이한 방법에 의해 특징벡터를 추출할 수 있음은 물론이다.In this case, the feature vector extracting unit 600 may extract the feature vector as described in FIG. 4, and of course, the feature vector may be extracted by a method different from FIG. 4.

한편, 상술한 설명에서, 단계 S301 내지 S521은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. 아울러, 기타 생략된 내용이라 하더라도 도 1 내지 도 2에서 이미 기술된 내용은 도 3 내지 도 5의 관리 플랫폼 등록 방법에도 적용될 수 있다.Meanwhile, in the above description, steps S301 to S521 may be further divided into additional steps or combined into fewer steps according to an embodiment of the present invention. In addition, some steps may be omitted if necessary, and the order between the steps may be changed. In addition, even if omitted, the contents already described in FIGS. 1 to 2 may be applied to the management platform registration method of FIGS. 3 to 5.

이와 같은 본 발명의 일 실시예 중 어느 하나에 의하면, 규칙기반과 서전기반으로 발음사전을 생성하는 기존의 G2P의 제약에서 벗어나 발성된 데이터를 기반으로 발음사전을 생성하므로, 관찰된 가변 발음을 사전에 반영할 수 있는바 발음 변환 규칙과 실제 발성 음성간의 발음 차이를 줄일 수 있어 자유발화 음성 인식의 한계를 극복할 수 있다.According to any one of the exemplary embodiments of the present invention, the pronunciation dictionary is generated based on the uttered data out of the limitations of the existing G2P generating a pronunciation dictionary based on a rule and a dictionary, so that the observed variable pronunciation is dictionary. As it can be reflected in, the difference in pronunciation between the pronunciation conversion rule and the actual spoken voice can be reduced, thereby overcoming the limitation of free speech recognition.

또한, 음성 데이터에서 음소를 결정하고, 음소에 대한 정보를 이용하여 음향모델을 생성하기 때문에, 전사된 음성 데이터를 대량으로 수집하기 어려운 상황에서 벗어나 음성 데이터만으로도 음향 모델링이 가능하다는 장점이 있다.In addition, since a phoneme is determined from voice data and an acoustic model is generated using information about the phoneme, there is an advantage in that acoustic modeling is possible only with voice data, since it is difficult to collect a large amount of transferred voice data.

또한, 외국어 발음 변환 규칙의 경우 원어민 전문가가 아닌 경우 접근하기 불가능하여 다국어 확장에 장애요인이 되었으나, 본 발명의 일 실시예에 따르면 다국어 확장이 용이하다는 장점이 있다.In addition, in the case of a foreign language pronunciation conversion rule, it is an inaccessible factor for non-native speakers to become multi-lingual expansion, but according to an embodiment of the present invention, multi-language expansion is easy.

한편, 본 발명의 일 실시예에 따른 음성 인식 시스템(1)에서의 음성 인식 방법은 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램 또는 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체 및 통신 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. 통신 매체는 전형적으로 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈, 또는 반송파와 같은 변조된 데이터 신호의 기타 데이터, 또는 기타 전송 메커니즘을 포함하며, 임의의 정보 전달 매체를 포함한다. On the other hand, the speech recognition method in the speech recognition system 1 according to an embodiment of the present invention may also be implemented in the form of a computer program stored in a medium executed by a computer or a recording medium including instructions executable by a computer. Can. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. In addition, computer readable media may include both computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media typically include computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transport mechanism, and includes any information delivery media.

본 발명의 방법 및 시스템은 특정 실시예와 관련하여 설명되었지만, 그것들의 구성 요소 또는 동작의 일부 또는 전부는 범용 하드웨어 아키텍쳐를 갖는 컴퓨터 시스템을 사용하여 구현될 수 있다.Although the methods and systems of the present invention have been described in connection with specific embodiments, some or all of their components or operations may be implemented using a computer system having a general purpose hardware architecture.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustration only, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and it should be interpreted that all changes or modified forms derived from the meaning and scope of the claims and equivalent concepts thereof are included in the scope of the present invention. do.

1: 음성 인식 시스템
10: 메모리
20: 프로세서
100: 비지도 특징벡터 학습부
200: 클러스터링부
300: 음향 모델 생성부
400: 발음사전 생성부
500: 언어 네트워크 생성부
600: 특징벡터 추출부
700: 음성 인식 디코더1: Speech recognition system
10: memory
20: processor
100: unsupervised feature vector learning unit
200: clustering unit
300: acoustic model generator
400: pronunciation dictionary generator
500: language network generator
600: feature vector extraction unit
700: speech recognition decoder

Claims

In the speech recognition method capable of automatic phoneme generation,
Unsupervised learning a feature vector of speech data;
Generating a phoneme set by clustering selected acoustic characteristics based on the unsupervised learning result;
Assigning a phoneme string to the voice data based on the generated phoneme set;
Generating an acoustic model based on the phoneme string to which the phoneme string is assigned and the phoneme string; And
Generating a result of speech recognition through a speech recognition decoder to which the acoustic model and language network are applied
Speech recognition method comprising a.

According to claim 1,
The voice data is non-transcription voice data.

According to claim 2,
Unsupervised learning the feature vector of the speech data,
Extracting the feature vector from the speech data;
Unsupervised learning the extracted feature vector, and
And generating an artificial neural network including an acoustic pattern corresponding to the feature vector based on the unsupervised learning result.

The method of claim 3,
Extracting the feature vector from the speech data,
Converting the voice data into a spectrogram;
Generating a first feature vector by converting the speech data converted into the spectrogram into a mel-scale filterbank in a predetermined time frame unit, and
And splicing a window corresponding to a predetermined number of frames to the left and right of the first feature vector to generate a second feature vector.
The speech recognition method of extracting the generated second feature vector as the feature vector.

The method of claim 4,
Non-supervised learning the extracted feature vector,
A speech recognition method in which the feature vector is unsupervised learning by placing the extracted feature vector at an input node and an output node of a stacked autoencoder.

The method of claim 3,
Extracting the feature vector from the speech data,
Converting the voice data into a spectrogram, and
The step of generating a feature matrix by grouping the speech data converted into the spectrogram into two-dimensional units based on x frames,
The speech recognition method of extracting the generated feature matrix into the feature vector.

The method of claim 6,
Non-supervised learning the extracted feature vector,
A voice recognition method for unsupervised learning of the feature vectors by placing the extracted feature vectors at input and output nodes of a convolutional autoencoder.

The method of claim 3,
The step of generating a phoneme set by clustering selected acoustic characteristics based on the unsupervised learning result,
The speech recognition method of generating the phoneme set by listing output values for every input data of the artificial neural network.

The method of claim 8,
Generating the phoneme set,
Expressing and listing output values for each input data as vectors;
Extracting vectors in which the distance between vectors is less than or equal to a specific boundary based on vector clustering among the listed vectors;
Generating a group vector by averaging the extracted vectors, and
And generating the phoneme set based on the listed vector and the generated group vector.

The method of claim 8,
Generating the phoneme set,
Listing the index of the node as an output value for each input data, and
And generating the phoneme set by performing the clustering around an index whose output frequency is greater than or equal to a preset number of indexes.

The method of claim 8,
The step of assigning a phoneme string to the voice data,
Listing candidate phoneme strings based on the artificial neural network, and
And extracting a final phoneme sequence based on the generated phoneme set and the candidate phoneme sequence and assigning the final phoneme sequence to the voice data.

The method of claim 11,
Generating the acoustic model,
Generating a context-independent phoneme string model using the phoneme string re-assigned and the phoneme string;
Generating a context-dependent tree based on the context independent phoneme string model and a combination according to the context of the phoneme string;
Defining a context dependent state for a context dependent phoneme based on the context dependent tree;
Assigning the defined context-dependent state to the speech data using the phoneme string, and
And learning the context-dependent state based on the assigned context-dependent state information and the speech data.

The method of claim 12,
The step of learning the context-dependent state,
Reassigning the learned context dependent models to the speech data, and
And learning the reassigned context dependent state based on the information of the reassigned context dependent state and the speech data.

According to claim 1,
Further comprising the step of generating a pronunciation dictionary of the word unit based on the transcriptional speech data,
The step of generating the pronunciation dictionary,
Dividing the transcription speech data into sections of word units;
Allocating the phoneme string to the transcription speech data divided by the word unit;
Sorting a phoneme string corresponding to a word of the transcription voice data to which the phoneme string is assigned;
Purifying the sorted phoneme string based on the time axis and the number of frames; and
And generating a pronunciation dictionary of the word unit based on the refined phoneme string.

The method of claim 14,
The step of generating a pronunciation dictionary of the word unit may include:
A speech recognition method for generating a partial dictionary or a pronunciation dictionary in units of syllables generated by dividing the words of the transcriptional speech data.

The method of claim 15,
Linking the word-based language model generated by the learning corpus with the generated pronunciation dictionary, and
And generating the language network based on the interworking result.

The method of claim 16,
The step of linking with the pronunciation dictionary,
The words of the language model, the words that are not included in the pronunciation dictionary of the word unit include the step of expanding the pronunciation dictionary of the word unit in conjunction with the pronunciation dictionary of the partial word or syllable unit.

The method of claim 16,
Extracting a feature vector from speech data input by the user;
Inputting the feature vector into the speech recognition decoder to which the generated acoustic model and the language network are applied, and
And extracting a word sequence of the input speech data based on the input result to generate the speech recognition result.

In the speech recognition system capable of automatic phoneme generation,
Memory for storing programs for speech recognition and
It includes a processor for executing a program stored in the memory,
As the processor executes the program, the feature vector is extracted from the non-transcriptional speech data to perform unsupervised learning, and clusters selected acoustic characteristics based on the unsupervised learning results to generate a phoneme set, A phoneme string is assigned to the voice data based on the generated phoneme set, an audio model is generated based on the phoneme data to which the phoneme string is assigned, and the phoneme string, and a speech recognition decoder to which the sound model and language network is applied is generated. A voice recognition system that generates sounding recognition results through.