KR102413693B1

KR102413693B1 - Speech recognition apparatus and method, Model generation apparatus and method for Speech recognition apparatus

Info

Publication number: KR102413693B1
Application number: KR1020150104554A
Authority: KR
Inventors: 홍석진
Original assignee: 삼성전자주식회사
Priority date: 2015-07-23
Filing date: 2015-07-23
Publication date: 2022-06-27
Also published as: US20170025117A1; KR20170011636A; US9911409B2

Abstract

음성 인식 장치 및 방법, 그를 위한 모델 생성 장치 및 방법이 개시된다. 일 양상에 따른 음성 인식 장치는, 음향 모델, 기본 단어들만으로 구성된 발음 사전, 및 기본 단어들만으로 구성된 언어 모델을 이용하여 음성 인식을 수행하는 음성 인식부와, 띄어쓰기 모델을 이용하여 음성 인식 결과의 띄어쓰기를 보정하는 띄어쓰기 보정부를 포함할 수 있다.A voice recognition apparatus and method, and a model generating apparatus and method therefor are disclosed. A voice recognition apparatus according to an aspect includes a voice recognition unit that performs voice recognition using an acoustic model, a pronunciation dictionary composed of only basic words, and a language model composed of only basic words, and spacing of a voice recognition result using a spacing model It may include a spacing correction unit for correcting.

Description

Speech recognition apparatus and method, and model generating apparatus and method therefor

음성 인식 기술에 관한 발명으로, 특히, 음성 인식 장치 및 방법, 그를 위한 모델 생성 장치 및 방법과 관련된다.The invention relates to a speech recognition technology, and more particularly, to a speech recognition apparatus and method, and a model generating apparatus and method therefor.

일반적으로 음성 인식은 발음 사전을 기반으로 각 음소의 시퀀스가 어떠한 단어를 나타내는지 판단하게 된다. 즉, 음성 인식 결과는 발음 사전에 있는 단어의 시퀀스로 표현되므로 발음 사전에 없는 단어는 OOV(Out Of Vocabulary)가 되어 인식을 못하게 된다.In general, speech recognition determines which word each phoneme sequence represents based on a pronunciation dictionary. That is, since the speech recognition result is expressed as a sequence of words in the pronunciation dictionary, a word not in the pronunciation dictionary becomes out of speech (OOV) and cannot be recognized.

이를 해결하기 위해 발음 사전의 단어 수를 무한히 늘리게 되면 음성 인식의 속도가 감소하고, 또한 음성 인식을 위한 리소스 사용량이 증가하게 된다. 따라서, 일반적으로 100만 단어 이하로 단어 수를 제한하는 경우가 많다.In order to solve this problem, if the number of words in the pronunciation dictionary is infinitely increased, the speed of speech recognition is reduced and resource usage for speech recognition is increased. Therefore, in general, there are many cases where the number of words is limited to 1 million words or less.

그러나, 발음 사전의 단어 수를 제안한다면, 발음 사전에서 제외된 일부 단어는 인식이 불가능하게 된다.However, if the number of words in the pronunciation dictionary is suggested, some words excluded from the pronunciation dictionary may not be recognized.

음성 인식 장치 및 방법, 그를 위한 모델 생성 장치 및 방법을 제공하는 것을 목적으로 한다.An object of the present invention is to provide a speech recognition apparatus and method, and a model generating apparatus and method therefor.

일 양상에 따른 음성 인식 장치는, 음향 모델, 기본 단어들만으로 구성된 발음 사전, 및 기본 단어들만으로 구성된 언어 모델을 이용하여 음성 인식을 수행하는 음성 인식부와, 띄어쓰기 모델을 이용하여 음성 인식 결과의 띄어쓰기를 보정하는 띄어쓰기 보정부를 포함할 수 있다.A voice recognition apparatus according to an aspect includes a voice recognition unit that performs voice recognition using an acoustic model, a pronunciation dictionary composed of only basic words, and a language model composed of only basic words, and spacing of a voice recognition result using a spacing model It may include a spacing correction unit for correcting.

기본 단어는 사용 빈도가 일정 수준 이상인 단어, 형태소 수준의 단어, 및 음절(syllable) 수준의 단어 중 적어도 하나일 수 있다.The basic word may be at least one of a word having a frequency of use above a certain level, a word at a morpheme level, and a word at a syllable level.

음성 인식 수행부는, 입력 음성 신호로부터 특징 벡터를 추출하는 특징 추출부와, 음향 모델, 기본 단어들만으로 구성된 발음 사전, 및 기본 단어들만으로 구성된 언어 모델을 기반으로 추출된 특징 벡터로부터 가장 확률이 높은 기본 단어열을 검색하는 디코더를 포함할 수 있다.The speech recognition performer includes a feature extractor that extracts a feature vector from an input speech signal, and a basic word with the highest probability from a feature vector extracted based on an acoustic model, a pronunciation dictionary composed of only basic words, and a language model composed of only basic words. It may include a decoder to retrieve the column.

기본 단어들만으로 구성된 언어 모델은 기본 단어의 결합으로 재구성된 코퍼스를 이용하여 학습을 통해 생성될 수 있다.A language model composed of only basic words can be created through learning using a corpus reconstructed by combining basic words.

띄어쓰기 모델은 코퍼스를 이용하여 학습을 통해 생성될 수 있다.A spacing model can be created through learning using a corpus.

다른 양상에 따른 음성 인식 방법은, 음향 모델, 기본 단어들만으로 구성된 발음 사전, 및 기본 단어들만으로 구성된 언어 모델을 이용하여 음성 인식을 수행하는 단계와, 띄어쓰기 모델을 이용하여 음성 인식 결과의 띄어쓰기를 보정하는 단계를 포함할 수 있다.A voice recognition method according to another aspect includes performing voice recognition using an acoustic model, a pronunciation dictionary composed of only basic words, and a language model composed of only basic words, and correcting spacing of a voice recognition result using the spacing model may include steps.

음성 인식을 수행하는 단계는, 입력 음성 신호로부터 특징 벡터를 추출하는 단계와, 음향 모델, 기본 단어들만으로 구성된 발음 사전, 및 기본 단어들만으로 구성된 언어 모델을 기반으로 추출된 특징 벡터로부터 가장 확률이 높은 기본 단어열을 검색하는 단계를 포함할 수 있다.The step of performing speech recognition includes the steps of extracting a feature vector from the input speech signal, and a base with the highest probability from the feature vectors extracted based on an acoustic model, a pronunciation dictionary composed of only basic words, and a language model composed of only basic words. It may include searching for a word string.

기본 단어들만으로 구성된 언어 모델은 상기 기본 단어의 결합으로 재구성된 코퍼스를 이용하여 학습을 통해 생성될 수 있다.A language model composed of only basic words may be generated through learning using a corpus reconstructed by combining the basic words.

또 다른 양상에 따른 음성 인식 장치를 위한 모델 생성 장치는, 기본 단어들만으로 구성된 발음 사전을 생성하는 발음 사전 생성부와, 기본 단어의 결합으로 재구성된 코퍼스를 이용하여 학습을 통해 언어 모델을 생성하는 언어 모델 생성부와, 수집된 코퍼스를 이용하여 학습을 통해 띄어쓰기 모델을 생성하는 띄어쓰기 모델 생성부를 포함할 수 있다.A model generating apparatus for a speech recognition apparatus according to another aspect includes a pronunciation dictionary generator for generating a pronunciation dictionary composed of only basic words, and a language for generating a language model through learning using a corpus reconstructed by combining basic words It may include a model generator and a spacing model generator for generating a spacing model through learning using the collected corpus.

발음 사전 생성부는, 단어를 수집하는 단어 수집부와, 수집된 단어 중 기본 단어에 포함되지 않는 단어를 기본 단어들의 조합으로 분해하는 단어 분해부와, 단어 분해 결과를 기반으로 기본 단어들만으로 구성된 발음 사전을 생성하는 사전 생성부를 포함할 수 있다.The pronunciation dictionary generator includes a word collection unit for collecting words, a word decomposition unit for decomposing words not included in basic words among the collected words into combinations of basic words, and a pronunciation dictionary composed of only basic words based on the result of the word decomposition. may include a dictionary generator for generating .

언어 모델 생성부는, 코퍼스를 수집하는 코퍼스 수집부와, 수집된 코퍼스 내의 단어를 기본 단어를 기준으로 분해하여 코퍼스를 재구성하는 코퍼스 재구성부와, 재구성된 코퍼스를 기반으로 언어 모델을 학습하여 기본 단어만으로 구성된 언어 모델을 생성하는 언어 모델 학습부를 포함할 수 있다.The language model generation unit includes a corpus collection unit that collects the corpus, a corpus reconstruction unit that reconstructs the corpus by decomposing the words in the collected corpus based on the basic word, and learns the language model based on the reconstructed corpus and uses only the basic word It may include a language model learning unit that generates the configured language model.

코퍼스 재구성부는, 수집된 코퍼스를 기본 단어들이 띄어쓰기로 구분된 형태의 코퍼스로 재구성할 수 있다.The corpus reconstruction unit may reconstruct the collected corpus into a corpus in which basic words are separated by spaces.

띄어쓰기 모델 생성부는, 코퍼스를 수집하는 코퍼스 수집부와, 수집된 코퍼스의 각 문장의 각 음절(character)을 입력 데이터로 하고, 해당 음절(character)의 뒤에 띄어쓰기가 되어 있는지 여부를 정답(target) 데이터로 하여 띄어쓰기 모델을 학습하는 띄어쓰기 모델 학습부를 포함할 수 있다.The spacing model generating unit, the corpus collecting unit collecting the corpus, using each syllable (character) of each sentence of the collected corpus as input data, and whether there is a space after the corresponding syllable (character) correct answer (target) data It may include a spacing model learning unit for learning the spacing model as to.

띄어쓰기 모델 학습부는, 리커런트 뉴럴 네트워크(Recurrent Neural Network: RNN), LSTM(Long Short Term Memory), 결정 트리(decision tree), 유전 알고리즘(GA: Genetic Algorithm), 유전자 프로그래밍(GP: Genetic Programming), 가우스 과정 회귀, 선형 분별 분석, K 근접 이웃(K-NN: K-Nearest Neighbor), 퍼셉트론, 방사 기저 함수 네트워크, 및 서포트 벡터 머신(SVM: Support Vector Machine) 중 하나를 이용할 수 있다.The spacing model learning unit, Recurrent Neural Network (RNN), LSTM (Long Short Term Memory), decision tree, genetic algorithm (GA: Genetic Algorithm), genetic programming (GP: Genetic Programming), One of Gaussian process regression, linear fractional analysis, K-Nearest Neighbor (K-NN), perceptron, radial basis function networks, and Support Vector Machine (SVM) can be used.

사용자에 의해 설정된 작은 수의 기본 단어들만으로 발음 사전을 구성하고, 언어 모델 학습시 모든 OOV(Out Of Vocabulary)를 기본 단어의 조합으로 구성할 수 있어, 제한된 작은 발음 사전만으로 거의 모든 단어를 인식할 수 있다.A pronunciation dictionary is composed of only a small number of basic words set by the user, and all OOV (Out Of Vocabulary) can be composed of a combination of basic words when learning a language model, so almost all words can be recognized with only a limited small pronunciation dictionary. have.

도 1은 음성 인식 시스템의 일 실시예를 도시한 블록도이다.
도 2는 도 1의 모델 생성 장치(100)의 일 실시예를 도시한 블록도이다.
도 3은 도 1의 음성 인식 장치(200)의 일 실시예를 도시한 블록도이다.
도 4는 띄어쓰기 보정의 예를 도시한 도면이다.
도 5는 모델 생성 방법의 일 실시예를 도시한 흐름도이다.
도 6은 도 5의 발음 사전 생성 과정(520)의 일 실시예를 도시한 흐름도이다.
도 7은 도 5의 언어 모델 생성 과정(530)의 일 실시예를 도시한 흐름도이다.
도 8은 도 5의 띄어쓰기 모델 생성 과정(540)의 일 실시예를 도시한 흐름도이다.
도 9는 음성 인식 방법의 일 실시예를 도시한 흐름도이다.
도 10은 도 9의 음성 인식 수행 과정(910)의 일 실시예를 도시한 흐름도이다.1 is a block diagram illustrating an embodiment of a voice recognition system.
FIG. 2 is a block diagram illustrating an embodiment of the model generating apparatus 100 of FIG. 1 .
3 is a block diagram illustrating an embodiment of the voice recognition apparatus 200 of FIG. 1 .
4 is a diagram illustrating an example of spacing correction.
5 is a flowchart illustrating an embodiment of a method for generating a model.
6 is a flowchart illustrating an embodiment of the pronunciation dictionary generation process 520 of FIG. 5 .
7 is a flowchart illustrating an embodiment of the language model generation process 530 of FIG. 5 .
8 is a flowchart illustrating an embodiment of the space-space model generation process 540 of FIG. 5 .
9 is a flowchart illustrating an embodiment of a voice recognition method.
10 is a flowchart illustrating an embodiment of the voice recognition performing process 910 of FIG. 9 .

이하, 첨부된 도면을 참조하여 본 발명의 일 실시예를 상세하게 설명한다. 본 발명을 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 또한, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로, 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings. In describing the present invention, if it is determined that a detailed description of a related well-known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, the terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to intentions or customs of users and operators. Therefore, the definition should be made based on the content throughout this specification.

본 명세서에서 설명되는 단어는 띄어쓰기를 기준으로 구분된 음절(character)의 시퀀스를 의미한다. 예를 들면, "I'm a student"라는 문장의 경우, 단어는, "I'm", "a", "student"가 될 수 있다.A word described in this specification means a sequence of syllables separated based on spacing. For example, in the case of the sentence "I'm a student", the words can be "I'm", "a", "student".

또한, 본 명세서에서 설명되는 기본 단어(primitive word)는 다양한 기준에 따라 사용자에 의해 미리 결정 또는 설정된 음성 인식의 단위를 의미한다. 기본 단어는 후술하는 바와 같이, 기본 단어에 포함되지 않는 단어를 기본 단어의 조합으로 분해하는데 이용될 수 있다.In addition, a primitive word described in this specification means a unit of speech recognition predetermined or set by a user according to various criteria. The basic word may be used to decompose a word not included in the basic word into a combination of the basic word, as will be described later.

예를 들면, 기본 단어는 사용 빈도가 일정 수준 이상인 단어로 설정될 수 있다. 예컨대, "foot" 및 "ball"은 사용 빈도수가 일정 수준 이상인 반면 "football"은 사용 빈도수가 일정 수준 이하라고 가정한다. 이 경우, "foot" 및 "ball"은 기본 단어에 포함되나, "football"은 기본 단어에 포함되지 않는다.For example, the basic word may be set as a word having a frequency of use equal to or higher than a certain level. For example, it is assumed that "foot" and "ball" have a frequency of use above a certain level, whereas "football" has a frequency of use below a certain level. In this case, "foot" and "ball" are included in the basic word, but "football" is not included in the basic word.

다른 예를 들면, 기본 단어는 형태소 수준의 단어로 설정될 수 있다. 이 경우, "mistreatment"의 형태소는 "mis", "treat" 및 "ment"이므로, "mis", "treat" 및 "ment"는 기본 단어에 포함되나, "mistreatment"는 기본 단어에 포함되지 않는다.As another example, the basic word may be set as a word at a morpheme level. In this case, the morphemes of "mistreatment" are "mis", "treat" and "ment", so "mis", "treat" and "ment" are included in the basic word, but "mistreatment" is not included in the basic word. .

또 다른 예를 들면, 기본 단어는 음절(syllable) 수준의 단어로 설정될 수 있다. 이 경우, "watches"의 음절은 "watch" 및 "es"이므로, "watch" 및 "es"는 기본 단어에 포함되나, "watches"는 기본 단어에 포함되지 않는다.As another example, the basic word may be set as a word at a syllable level. In this case, since the syllables of "watches" are "watch" and "es", "watch" and "es" are included in the basic word, but "watches" is not included in the basic word.

도 1은 음성 인식 시스템의 일 실시예를 도시한 블록도이다.1 is a block diagram illustrating an embodiment of a voice recognition system.

도 1을 참조하면, 일 실시예에 따른 음성 인식 시스템(10)은 모델 생성 장치(100) 및 음성 인식 장치(200)를 포함할 수 있다.Referring to FIG. 1 , a voice recognition system 10 according to an embodiment may include a model generating apparatus 100 and a voice recognition apparatus 200 .

모델 생성 장치(100)는 음성 인식 장치(200)를 위한 음향 모델, 기본 단어들만으로 구성된 발음 사전, 기본 단어들만으로 구성된 언어 모델, 및 띄어쓰기 모델을 생성할 수 있다.The model generating apparatus 100 may generate an acoustic model for the speech recognition apparatus 200 , a pronunciation dictionary composed of only basic words, a language model composed of only basic words, and a spacing model.

일 실시예에 따르면, 모델 생성 장치(100)는 가장 로버스트(Robust)한 음향 모델을 만들기 위해 많은 화자들의 발성음을 기반으로 학습을 통해 음향 모델을 생성할 수 있다.According to an embodiment, the model generating apparatus 100 may generate an acoustic model through learning based on the pronunciation of many speakers in order to create the most robust acoustic model.

음향 모델은 사용자 음성을 인식하는데 사용된다. 일반적으로 음성인식 분야에서 음향모델은 은닉마코프모델(Hidden Markov Model: HMM)에 기반한다. 음성인식을 위한 음향 모델의 단위로는 음소(phoneme), 다이폰(diphone), 트라이폰(triphone), 퀸폰(quinphone), 음절(syllable), 단어(word) 등이 될 수 있다.The acoustic model is used to recognize the user's voice. In general, in the field of speech recognition, an acoustic model is based on a Hidden Markov Model (HMM). A unit of an acoustic model for voice recognition may be a phoneme, a diphone, a triphone, a quinphone, a syllable, a word, or the like.

일 실시예에 따르면, 모델 생성 장치(100)는 사전 또는 코퍼스로부터 수집된 단어들을 기본 단어들의 조합으로 분해하고, 이를 기반으로 기본 단어들만으로 구성된 발음 사전을 생성할 수 있다.According to an embodiment, the model generating apparatus 100 may decompose words collected from a dictionary or a corpus into combinations of basic words, and generate a pronunciation dictionary composed of only basic words based on this.

발음 사전은 음성 인식의 단위를 기본 단어로 하여 기본 단어의 발음을 모델링한다. 즉, 발음 사전은 각 음소의 시퀀스가 어떠한 기본 단어에 매칭되는지를 판단하는 데 사용된다.The pronunciation dictionary models the pronunciation of the basic word by using the unit of speech recognition as the basic word. That is, the pronunciation dictionary is used to determine which basic word each phoneme sequence matches.

일 실시예에 따르면, 모델 생성 장치(100)는 기본 단어를 기준으로 재구성된 코퍼스를 기반으로 학습을 통해, 기본 단어들만으로 구성된 언어 모델을 생성할 수 있다. According to an embodiment, the model generating apparatus 100 may generate a language model composed of only basic words through learning based on a corpus reconstructed based on the basic words.

언어 모델은 기본 단어와 기본 단어 사이의 말의 규칙을 정해두는 것으로, 일종의 문법이라고 볼 수 있다. 일반적으로 언어 모델은 연속(continuous) 음성인식에서 주로 사용된다. 음성 인식 장치는 언어 모델을 탐색과정에서 사용함으로써 음성 인식 장치의 탐색 공간을 줄일 수 있으며, 언어 모델은 문법에 맞는 문장에 대한 확률을 높여주는 역할을 하기 때문에 인식률을 향상시킨다.A language model can be seen as a kind of grammar, which determines the rules of speech between a basic word and a basic word. In general, language models are mainly used in continuous speech recognition. The speech recognition apparatus can reduce the search space of the speech recognition apparatus by using the language model in the search process, and since the language model serves to increase the probability of a sentence matching the grammar, the recognition rate is improved.

일 실시예에 따르면, 모델 생성 장치(100)는 코퍼스를 기반으로 학습을 통해 띄어쓰기 모델을 생성할 수 있다.According to an embodiment, the model generating apparatus 100 may generate a spacing model through learning based on the corpus.

띄어쓰기 모델은 음성 인식 장치(100)의 음성 인식 결과에서 띄어쓰기를 보정하기 위해 사용된다. 전술한 바와 같이, 발음 사전과 언어 모델은 일반적인 단어들이 아닌 사용자가 설정한 기본 단어들을 기반으로 생성된다. 즉, 발음 사전 및 언어 모델은 일반적인 띄어쓰기 규칙과는 맞지 않는 형태로 구성된다. 따라서, 모델 생성 장치(100)에서 생성된 기본 단어들만으로 구성된 발음 사전 및 기본 단어들만으로 구성된 언어 모델을 이용하여 음성 인식을 수행하면, 띄어쓰기가 틀린 형태로 음성 인식이 된다. 띄어쓰기 모델은 이를 고려한 것으로 음성 인식 결과에서 띄어쓰기를 보정하기 위해 사용된다.The spacing model is used to correct spacing in the voice recognition result of the voice recognition apparatus 100 . As described above, the pronunciation dictionary and language model are generated based on basic words set by the user, not general words. That is, the pronunciation dictionary and language model are configured in a form that does not match general spacing rules. Accordingly, when voice recognition is performed using a pronunciation dictionary composed of only basic words generated by the model generating apparatus 100 and a language model composed of only basic words, voice recognition is performed in a form with incorrect spacing. The spacing model takes this into account and is used to correct the spacing in the speech recognition result.

한편, 전술한 바와 같이, 기본 단어(primitive word)는 다양한 기준에 따라 사용자에 의해 미리 결정 또는 설정된 음성 인식의 단위를 의미하며, 기본단어는 사용 빈도가 일정 수준 이상인 단어, 형태소 수준의 단어, 및 음절(syllable) 수준의 단어 등으로 설정될 수 있다.Meanwhile, as described above, a primitive word means a unit of speech recognition predetermined or set by a user according to various criteria, and a basic word is a word having a frequency of use above a certain level, a word at a morpheme level, and It may be set to a word of a syllable level or the like.

모델 생성 장치(100)에 관한 자세한 설명은 도 2를 참조하여 후술하기로 한다.A detailed description of the model generating apparatus 100 will be described later with reference to FIG. 2 .

음성 인식 장치(200)는 모델 생성 장치(100)에서 생성된, 음향 모델, 기본 단어들만으로 구성된 발음 사전, 기본 단어들만으로 구성된 언어 모델, 및 띄어쓰기 모델을 이용하여 음성 인식을 수행할 수 있다.The speech recognition apparatus 200 may perform speech recognition using the acoustic model, a pronunciation dictionary including only basic words, a language model including only basic words, and a spacing model generated by the model generating apparatus 100 .

자세하게는, 음성 인식 장치(200)는 음향 모델, 기본 단어들만으로 구성된 발음 사전, 기본 단어들만으로 구성된 언어 모델을 참조하여 음성 인식을 수행하고, 띄어쓰기 모델을 이용하여 음성 인식 결과의 띄어쓰기를 보정할 수 있다.In detail, the voice recognition apparatus 200 may perform voice recognition by referring to an acoustic model, a pronunciation dictionary composed of only basic words, and a language model composed of only basic words, and correct the spacing of the voice recognition result using the spacing model. .

예를 들어, 기본 단어가 형태소 수준의 단어로 설정되어 있다고 가정하자. 이때, "mistreatment"의 형태소는 "mis", "treat" 및 "ment"이므로, "mis", "treat" 및 "ment"는 기본 단어에 포함되나, "mistreatment"는 기본 단어에 포함되지 않는다. 따라서, 모델 생성 장치(100)에서 생성된 발음 사전 및 언어 모델은 형태소 수준의 단어, 즉, "mis", "treat" 및 "ment"로 구성된다. 즉, 발음 사전 및 언어 모델은 일반적인 띄어쓰기 규칙과는 맞지 않는 형태로 구성되게 된다. 이러한 발음 사전 및 언어 모델을 이용하여 음성 인식을 수행하면, 음성 인식 수행 결과 역시 "mis treat ment"와 같이 띄어쓰기가 틀린 형태로 생성된다. 음성 인식 장치(200)는 띄어쓰기 모델을 이용하여 음성 인식 결과("mis treat ment")의 띄어쓰기를 보정하여 띄어쓰기가 보정된 최종 결과("mistreatment")를 생성할 수 있다.For example, suppose that the basic word is set as a word at the morpheme level. In this case, since the morphemes of "mistreatment" are "mis", "treat", and "ment", "mis", "treat" and "ment" are included in the basic word, but "mistreatment" is not included in the basic word. Accordingly, the pronunciation dictionary and the language model generated by the model generating device 100 are composed of words at the morpheme level, that is, “mis”, “treat”, and “ment”. That is, the pronunciation dictionary and the language model are configured in a form that does not match the general spacing rules. When speech recognition is performed using such a pronunciation dictionary and a language model, a result of speech recognition is also generated in a form with incorrect spacing, such as "mis treat ment". The speech recognition apparatus 200 may correct the spacing of the speech recognition result (“mis treat ment”) using the spacing model to generate a final result with the spacing corrected (“mistreatment”).

음성 인식 장치(200)에 관한 자세한 설명은 도 3을 참조하여 후술하기로 한다.A detailed description of the voice recognition apparatus 200 will be described later with reference to FIG. 3 .

이하, 기본 단어는 형태소 수준의 단어로 설정되어 있다고 가정한다.Hereinafter, it is assumed that the basic word is set as a word at the morpheme level.

도 2는 도 1의 모델 생성 장치(100)의 일 실시예를 도시한 블록도이다.FIG. 2 is a block diagram illustrating an embodiment of the model generating apparatus 100 of FIG. 1 .

도 2를 참조하면, 모델 생성 장치(100)는 음향 모델 생성부(110), 발음 사전 생성부(120), 언어 모델 생성부(130) 및 띄어쓰기 모델 생성부(140)를 포함할 수 있다.Referring to FIG. 2 , the model generating apparatus 100 may include an acoustic model generating unit 110 , a pronunciation dictionary generating unit 120 , a language model generating unit 130 , and a spacing model generating unit 140 .

음향 모델 생성부(110)는 다수의 화자들의 발성음을 기반으로 학습을 통해 음향 모델을 생성할 수 있다. 이때, 음향 모델 생성부(110)는 음향 모델로서 은닉마코프모델(Hidden Markov Model: HMM)을 이용할 수 있다.The acoustic model generator 110 may generate an acoustic model through learning based on the voiced sounds of a plurality of speakers. In this case, the acoustic model generator 110 may use a Hidden Markov Model (HMM) as the acoustic model.

발음 사전 생성부(120)는 기본 단어들만으로 구성된 발음 사전을 생성할 수 있다. 이를 위해, 발음 사전 생성부(120)는 단어 수집부(121), 단어 분해부(122) 및 사전 생성부(123)를 포함할 수 있다.The pronunciation dictionary generator 120 may generate a pronunciation dictionary composed of only basic words. To this end, the pronunciation dictionary generating unit 120 may include a word collecting unit 121 , a word decomposing unit 122 , and a dictionary generating unit 123 .

단어 수집부(121)는 코퍼스 또는 사전으로부터 단어를 수집할 수 있다. 여기서, 단어는 전술한 바와 같이, 띄어쓰기를 기준으로 구분된 음절(character)의 시퀀스를 의미한다. 예를 들어, "The boys planted these trees"라는 문장의 경우, 단어는, "The", "boys", "planted", "these" 및 "trees"가 될 수 있다.The word collecting unit 121 may collect words from a corpus or a dictionary. Here, the word refers to a sequence of syllables separated based on spacing as described above. For example, for the sentence "The boys planted these trees", the words could be "The", "boys", "planted", "these" and "trees".

단어 분해부(122)는 수집된 단어들 중 기본단어에 포함되지 않는 단어들을 기본 단어들의 조합으로 분해할 수 있다. 예를 들어, "boys"의 형태소는 "boy" 및 "s"이고, "planted"의 형태소는 "plant" 및 "ed"고, "trees"의 형태소는 "tree" 및 "s"이므로, 단어 분해부(122)는 "boys"를 "boy" 및 "s"로, "planted"를 "plant" 및 "ed"로, "trees"를 "tree" 및 "s"로 각각 분해할 수 있다.The word decomposing unit 122 may decompose words not included in the basic words among the collected words into combinations of basic words. For example, the morphemes of "boys" are "boy" and "s", the morphemes of "planted" are "plant" and "ed", and the morphemes of "trees" are "tree" and "s", so the word The decomposing unit 122 may decompose “boys” into “boy” and “s”, “planted” into “plant” and “ed”, and “trees” into “tree” and “s”, respectively.

사전 생성부(120)는 단어 분해 결과를 기반으로 기본 단어들만으로 구성된 발음 사전을 생성할 수 있다. 예컨대, 상기의 예에서, 사전 생성부(120)는 "The", "boy", "s", "plant", "ed", "these" 및 "tree"를 발음 사전에 추가하여 기본 단어들만으로 구성된 발음 사전을 생성할 수 있다. The dictionary generator 120 may generate a pronunciation dictionary composed of only basic words based on the word decomposition result. For example, in the above example, the dictionary generating unit 120 adds “The”, “boy”, “s”, “plant”, “ed”, “these”, and “tree” to the pronunciation dictionary to generate only basic words. A constructed pronunciation dictionary can be created.

언어 모델 생성부(130)는 기본 단어들만으로 구성된 언어 모델을 생성할 수 있다. 이를 위해, 언어 모델 생성부(130)는 코퍼스 수집부(131), 코퍼스 재구성부(132) 및 언어 모델 학습부(133)를 포함할 수 있다.The language model generator 130 may generate a language model composed of only basic words. To this end, the language model generation unit 130 may include a corpus collection unit 131 , a corpus reconstruction unit 132 , and a language model learning unit 133 .

코퍼스 수집부(131)는 언어 모델 학습을 위한 코퍼스를 수집할 수 있다.The corpus collection unit 131 may collect a corpus for learning a language model.

코퍼스 재구성부(132)는 코퍼스 내의 단어를 기본 단어를 기준으로 분해하여 코퍼스를 재구성할 수 있다. 일 실시예에 따르면, 코퍼스 재구성부(132)는 코퍼스 내의 단어 중 기본 단어에 포함되지 않는 단어를 기본 단어들의 조합으로 분해하여 기본 단어들간에 띄어쓰기로 구분된 형태로 코퍼스를 재구성할 수 있다. The corpus reconstruction unit 132 may reconstruct the corpus by decomposing the words in the corpus based on the basic word. According to an embodiment, the corpus reconstruction unit 132 may reconstruct the corpus in a form separated by spaces between the basic words by decomposing words not included in the basic words among words in the corpus into combinations of basic words.

예를 들어, 코퍼스에 "The boys planted these trees"라는 문장이 있는 경우, 코퍼스 재구성부(132)는 기본 단어에 포함되지 않는 "boys", "planted", "trees"를 각각 boy" 및 "s", "plant" 및 "ed", "tree" 및 "s"로 분해하여, "The boy s plant ed these tree s"라는 문장을 구성하여 코퍼스를 재구성할 수 있다.For example, if there is a sentence "The boys planted these trees" in the corpus, the corpus reconstruction unit 132 converts "boys", "planted", and "trees" that are not included in the basic word to "boy" and "s", respectively. By decomposing ", "plant" and "ed", "tree" and "s", the corpus can be reconstructed by constructing the sentence "The boy s plant ed these tree s".

언어 모델 학습부(133)는 재구성된 코퍼스를 기반으로 언어 모델을 학습하여 기본 단어들만으로 구성된 언어 모델을 생성할 수 있다.The language model learning unit 133 may learn a language model based on the reconstructed corpus to generate a language model composed of only basic words.

이때, 언어 모델 학습부(133)는 리커런트 뉴럴 네트워크(Recurrent Neural Network: RNN), LSTM(Long Short Term Memory), 결정 트리(decision tree), 유전 알고리즘(GA: Genetic Algorithm), 유전자 프로그래밍(GP: Genetic Programming), 가우스 과정 회귀, 선형 분별 분석, K 근접 이웃(K-NN: K-Nearest Neighbor), 퍼셉트론, 방사 기저 함수 네트워크, 및 서포트 벡터 머신(SVM: Support Vector Machine) 중 하나를 이용할 수 있다.At this time, the language model learning unit 133 is a recurrent neural network (RNN), a long short term memory (LSTM), a decision tree (decision tree), a genetic algorithm (GA: Genetic Algorithm), genetic programming (GP) : Genetic Programming), Gaussian Process Regression, Linear Fractional Analysis, K-Nearest Neighbor (K-NN), Perceptron, Radial Basis Function Network, and Support Vector Machine (SVM). have.

띄어쓰기 모델 생성부(140)는 코퍼스를 기반으로 학습을 통해 띄어쓰기 모델을 생성할 수 있다. 이를 위해, 띄어쓰기 모델 생성부(140)는 코퍼스 수집부(141) 및 띄어쓰기 모델 학습부(142)를 포함할 수 있다.The spacing model generating unit 140 may generate a spacing model through learning based on the corpus. To this end, the spacing model generation unit 140 may include a corpus collection unit 141 and a spacing model learning unit 142 .

코퍼스 수집부(141)는 띄어쓰기 모델 학습을 위한 코퍼스를 수집할 수 있다.The corpus collection unit 141 may collect a corpus for learning the spacing model.

띄어쓰기 모델 학습부(142)는 수집된 코퍼스의 각 문장의 각 음절(character)을 입력 데이터로 하고, 해당 글자의 뒤에 띄어쓰기가 되어 있는지 여부를 정답(target) 데이터로 하여 띄어쓰기 모델을 학습할 수 있다.The spacing model learning unit 142 uses each syllable of each sentence of the collected corpus as input data, and whether there is a space after the corresponding letter as the correct answer (target) data to learn the spacing model. .

이때, 띄어쓰기 모델 학습부(142)는 띄어쓰기 모델 학습시 띄어쓰기 이외에 문장 부호를 함께 학습하는 것도 가능하다.In this case, the spacing model learning unit 142 may also learn punctuation marks in addition to spacing when learning the spacing model.

띄어쓰기 모델 학습부(142)는 리커런트 뉴럴 네트워크(Recurrent Neural Network: RNN), LSTM(Long Short Term Memory), 결정 트리(decision tree), 유전 알고리즘(GA: Genetic Algorithm), 유전자 프로그래밍(GP: Genetic Programming), 가우스 과정 회귀, 선형 분별 분석, K 근접 이웃(K-NN: K-Nearest Neighbor), 퍼셉트론, 방사 기저 함수 네트워크, 및 서포트 벡터 머신(SVM: Support Vector Machine) 중 하나를 이용할 수 있다.The spacing model learning unit 142 includes a recurrent neural network (RNN), a long short term memory (LSTM), a decision tree, a genetic algorithm (GA: Genetic Algorithm), and a genetic programming (GP: Genetic). Programming), Gaussian process regression, linear fractional analysis, K-Nearest Neighbor (K-NN), perceptron, radial basis function networks, and Support Vector Machine (SVM) can be used.

한편, 코퍼스 수집부(130)와 코퍼스 수집부(141)는 별개의 구성으로 설명하였으나, 코퍼스 수집부(130)와 코퍼스 수집부(141)는 하나의 구성으로 통합될 수 있다.Meanwhile, although the corpus collection unit 130 and the corpus collection unit 141 have been described as separate configurations, the corpus collection unit 130 and the corpus collection unit 141 may be integrated into one configuration.

도 3는 도 1의 음성 인식 장치(200)의 일 실시예를 도시한 블록도이다.3 is a block diagram illustrating an embodiment of the voice recognition apparatus 200 of FIG. 1 .

도 3을 참조하면, 음성 인식 장치(200)는 음성 인식부(210), 띄어쓰기 보정부(220), 음향 모델 저장부(230), 발음 사전 저장부(240), 언어 모델 저장부(250), 및 띄어쓰기 모델 저장부(260)를 포함할 수 있다.Referring to FIG. 3 , the voice recognition apparatus 200 includes a voice recognition unit 210 , a spacing correction unit 220 , an acoustic model storage unit 230 , a pronunciation dictionary storage unit 240 , and a language model storage unit 250 . , and a space model storage unit 260 .

음성 인식부(210)는 모델 생성 장치(100)에서 생성된, 음향 모델, 기본 단어들만으로 구성된 발음 사전, 및 기본 단어들만으로 구성된 언어 모델을 이용하여 음성 인식을 수행할 수 있다. 이를 위해 음성 인식부(210)는 특징 추출부(211) 및 디코더(212)를 포함할 수 있다.The speech recognition unit 210 may perform speech recognition by using the acoustic model generated by the model generating apparatus 100 , a pronunciation dictionary including only basic words, and a language model including only basic words. To this end, the voice recognition unit 210 may include a feature extraction unit 211 and a decoder 212 .

특징 추출부(211)는 입력된 음성 신호를 단위 프레임으로 분할하여 분할된 프레임 영역에 대응되는 특징 벡터를 추출할 수 있다.The feature extraction unit 211 may extract a feature vector corresponding to the divided frame region by dividing the input voice signal into unit frames.

일 실시예에 따르면, 특징 추출부(211)는 음성 구간 검출(Voice Activity Detection: VAD)을 통하여 입력된 음성 신호에서 음성 구간을 검출하고, 검출된 음성 구간에 대하여 음성 신호로부터 음성 인식에 적합한 정보를 획득하기 위해 음성의 특징을 추출할 수 있다. 이때, 특징 추출부(211)는 음성 신호의 주파수 특성을 단위 프레임 별로 계산하여 음성 신호에 포함된 특징 벡터를 추출할 수 있다. 이를 위해, 특징 추출부(211)는 아날로그 음성 신호를 디지털로 변환하는 아날로그-디지털 변환 수단(A/D Converter)를 포함할 수 있으며, 디지털로 변환된 음성 신호를 약 10ms 단위의 프레임으로 나누어 처리할 수 있다. According to an embodiment, the feature extraction unit 211 detects a voice section from an input voice signal through Voice Activity Detection (VAD), and information suitable for voice recognition from the detected voice section from the voice signal. It is possible to extract speech features to obtain . In this case, the feature extraction unit 211 may extract a feature vector included in the voice signal by calculating the frequency characteristics of the voice signal for each unit frame. To this end, the feature extraction unit 211 may include an analog-to-digital converter (A/D converter) that converts an analog audio signal to digital, and divides the digitally converted audio signal into frames of about 10 ms and processes it can do.

한편, 특징 추출부(211)는 멜-주파수 켑스트럼 계수(Mel-Frequency Cepstrum Coefficients, MFCC) 특징 추출 방식을 이용하여 특징 벡터를 추출할 수 있다. 멜-주파수 켑스트럼 계수 특징 추출 방식은 멜-켑스트럼 계수, 로그 에너지, 그리고 이들의 1차, 2차 미분을 결합한 형태의 특징 벡터를 사용할 수 있다.Meanwhile, the feature extraction unit 211 may extract a feature vector using a Mel-Frequency Cepstrum Coefficients (MFCC) feature extraction method. The Mel-Kepstrom coefficient feature extraction method may use a feature vector in the form of combining the Mel-Kepstrom coefficient, log energy, and their first and second derivatives.

또한, 특징 추출부(211)는 단위 프레임 영역에서 음성 신호의 특징을 추출하는데 있어서, 선형 예측 부호화(Linear Predictive Coding, LPC), 선형 예측 부호화에 의한 켑스트럼(LPC derived Cepstrum), 인지 성형 예측(Perceptive Linear Prediction, PLP), 청각 모델(Audio Model) 특징 추출 및 필터 뱅크(Filter Bank) 등의 방법을 사용할 수도 있다.In addition, the feature extraction unit 211 extracts the features of the speech signal from the unit frame region, linear predictive coding (LPC), cepstrum by linear prediction coding (LPC derived Cepstrum), cognitive shaping prediction Methods such as Perceptive Linear Prediction (PLP), Audio Model feature extraction, and Filter Bank can also be used.

디코더(212)는 음향 모델, 기본 단어들만으로 구성된 발음 사전, 및 기본 단어들만으로 구성된 언어 모델을 이용하여 특징 추출부(211)에서 추출된 특징 벡터로부터 가장 확률이 높은 기본 단어열을 선정하는 비터비 탐색을 수행할 수 있다. 여기서, 대어휘 인식을 위하여 인식 대상 어휘들은 트리(tree)를 구성하고 있으며, 디코더(212)에서는 이러한 트리를 탐색할 수 있다.The decoder 212 is a Viterbi search that selects a basic word sequence with the highest probability from the feature vectors extracted by the feature extraction unit 211 using an acoustic model, a pronunciation dictionary composed of only basic words, and a language model composed of only basic words. can be performed. Here, for large vocabulary recognition, recognition target vocabularies constitute a tree, and the decoder 212 may search this tree.

띄어쓰기 보정부(220)는 모델 생성 장치(100)에서 생성된 띄어쓰기 모델을 이용하여 음성 인식부(210)의 음성 인식 결과에서 뜨어쓰기를 보정할 수 있다. The spacing correction unit 220 may correct the spacing in the voice recognition result of the voice recognition unit 210 by using the spacing model generated by the model generating apparatus 100 .

이때, 띄어쓰기 보정부(220)는 띄어쓰기 모델을 통해 보정된 결과와 음성 인식부(210)의 음성 인식 결과 생성된 문장의 띄어쓰기 상태를 조합하여 절충된 띄어쓰기 보정을 수행하는 것도 가능하다.In this case, the spacing correction unit 220 may perform compromised spacing correction by combining the result corrected through the spacing model and the spacing state of the sentence generated as a result of voice recognition by the voice recognition unit 210 .

음향 모델 저장부(230)는 모델 생성 장치(100)에서 생성된 음향 모델을 저장하고, 발음 사전 저장부(240)는 모델 생성 장치(100)에서 생성된 기본 단어들만으로 구성된 발음 사전을 저장하고, 언어 모델 저장부(250)는 모델 생성 장치(100)에서 생성된 기본 단어들만으로 구성된 언어 모델을 저장하고, 띄어쓰기 모델 저장부(260)는 모델 생성 장치(100)에서 생성된 띄어쓰기 모델을 저장할 수 있다.The acoustic model storage unit 230 stores the acoustic model generated by the model generating device 100, and the pronunciation dictionary storage unit 240 stores a pronunciation dictionary composed of only basic words generated by the model generating device 100, The language model storage unit 250 may store a language model composed of only basic words generated by the model generating device 100 , and the spacing model storage unit 260 may store the spacing model generated by the model generating device 100 . .

한편, 음향 모델 저장부(230), 발음 사전 저장부(240), 언어 모델 저장부(250) 및 띄어쓰기 모델 저장부(260)는 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어, SD 또는 XD 메모리 등), 램(Random Access Memory: RAM) SRAM(Static Random Access Memory), 롬(Read-Only Memory: ROM), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다.Meanwhile, the acoustic model storage unit 230 , the pronunciation dictionary storage unit 240 , the language model storage unit 250 , and the space model storage unit 260 are a flash memory type and a hard disk type. ), multimedia card micro type, card type memory (eg, SD or XD memory, etc.), RAM (Random Access Memory: RAM) SRAM (Static Random Access Memory), ROM (Read-Only) Memory: ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), a magnetic memory, a magnetic disk, and an optical disk may include at least one type of storage medium.

도 4는 띄어쓰기 보정의 예를 도시한 도면이다.4 is a diagram illustrating an example of spacing correction.

전술한 바와 같이, 기본 단어들만으로 구성된 발음 사전 및 기본 단어들만으로 구성된 언어 모델은 일반적인 띄어쓰기 규칙과는 맞지 않는 형태로 구성되게 된다. 따라서, 이를 이용하여 음성 인식을 수행하면, 음성 인식 결과 역시 띄어쓰기가 틀린 형태로 생성된다. 따라서, 띄어쓰기 보정부(220)는 띄어쓰기 모델을 이용하여 음성 인식 결과의 띄어쓰기를 보정하여 띄어쓰기가 보정된 최종 결과를 생성할 수 있다.As described above, the pronunciation dictionary composed of only basic words and the language model composed of only basic words are configured in a form that does not match general spacing rules. Therefore, when voice recognition is performed using this, the voice recognition result is also generated in the form of incorrect spacing. Accordingly, the spacing correction unit 220 may generate a final result in which the spacing is corrected by correcting the spacing of the voice recognition result using the spacing model.

도 4는, 음성 인식부(210)의 음성 인식 결과("The boy s plant ed these tree s")(410)에 대해, 띄어쓰기 보정부(220)가 띄어쓰기 모델을 기반으로 띄어쓰기 보정을 수행하여 최종 결과("The boys planted these trees")(420)를 생성하는 예를 도시한다.4, the space correction unit 220 performs spacing correction based on the spacing model for the voice recognition result ("The boy s plant ed these tree s") 410 of the voice recognition unit 210, and finally An example of generating a result (“The boys planted these trees”) 420 is shown.

도 5는 모델 생성 방법의 일 실시예를 도시한 흐름도이다.5 is a flowchart illustrating an embodiment of a method for generating a model.

도 5를 참조하면, 일 실시예에 따른 모델 생성 방법(500)은 먼저, 음향 모델을 생성한다(510). 예컨대, 모델 생성 장치(100)는 다수의 화자들의 발성음을 기반으로 학습을 통해 음향 모델을 생성할 수 있다.Referring to FIG. 5 , in the model generating method 500 according to an embodiment, first, an acoustic model is generated ( 510 ). For example, the model generating apparatus 100 may generate an acoustic model through learning based on the voiced sounds of a plurality of speakers.

그 후, 기본 단어들만으로 구성된 발음 사전을 생성하고(520), 기본 단어들만으로 구성된 언어 모델을 생성하고(530), 기본 단어들만으로 구성된 발음 사전 및 기본 단어들만으로 구성된 언어 모델을 이용한 음성 인식 결과에서 띄어쓰기를 보정하는데 사용되는 띄어쓰기 모델을 생성한다(540).Thereafter, a pronunciation dictionary composed of only basic words is generated (520), a language model composed of only basic words is generated (530), and spaces are defined in the speech recognition result using the pronunciation dictionary composed of only basic words and a language model composed of only basic words. A spacing model used for correction is generated (540).

도 6은 도 5의 발음 사전 생성 과정(520)의 일 실시예를 도시한 흐름도이다.6 is a flowchart illustrating an embodiment of the pronunciation dictionary generation process 520 of FIG. 5 .

도 6을 참조하면, 발음 사전 생성 과정(520)은 먼저, 코퍼스 또는 사전으로부터 단어를 수집한다(610). Referring to FIG. 6 , in the pronunciation dictionary generation process 520 , first, words are collected from a corpus or dictionary ( 610 ).

그 후, 수집된 단어들 중 기본 단어에 포함되지 않는 단어들을 기본 단어들의 조합으로 분해한다(620). 예컨대, 수집된 단어가 "The", "these", "boys", "planted", "trees"라면, 모델 생성 장치(100)는 수집된 단어 중 "boys"를 "boy" 및 "s"로, "planted"를 "plant" 및 "ed"로, "trees"를 "tree" 및 "s"로 각각 분해할 수 있다.Thereafter, words not included in the basic word among the collected words are decomposed into combinations of basic words ( 620 ). For example, if the collected words are “The”, “these”, “boys”, “planted”, and “trees”, the model generating apparatus 100 converts “boys” among the collected words into “boy” and “s”. , "planted" into "plant" and "ed", and "trees" into "tree" and "s", respectively.

그 후, 단어 분해 결과를 기반으로 기본 단어들만으로 구성된 발음 사전을 생성한다(630). 예컨대, 모델 생성 장치(100)는 "The", "boy", "s", "plant", "ed", "these" 및 "tree"를 발음 사전에 추가하여 기본 단어들만으로 구성된 발음 사전을 생성할 수 있다.Thereafter, a pronunciation dictionary composed of only basic words is generated based on the word decomposition result ( 630 ). For example, the model generating apparatus 100 adds "The", "boy", "s", "plant", "ed", "these", and "tree" to the pronunciation dictionary to generate a pronunciation dictionary composed of only basic words. can do.

도 7은 도 5의 언어 모델 생성 과정(530)의 일 실시예를 도시한 흐름도이다.7 is a flowchart illustrating an embodiment of the language model generation process 530 of FIG. 5 .

도 7을 참조하면, 언어 모델 생성 과정(530)은, 먼저, 언어 모델 학습을 위한 코퍼스를 수집한다(710).Referring to FIG. 7 , in the process of generating a language model 530 , first, a corpus for learning a language model is collected ( 710 ).

그 후, 코퍼스 내의 단어를 기본 단어를 기준으로 분해하여 코퍼스를 재구성한다(720). 예컨대, 모델 생성 장치(100)는 코퍼스 내의 단어 중 기본 단어에 포함되지 않는 단어를 기본 단어들의 조합으로 분해하여 기본 단어들간에 띄어쓰기로 구분된 형태로 코퍼스를 재구성할 수 있다. 예를 들어, 코퍼스에 "The boys planted these trees"라는 문장이 있는 경우, 모델 생성 장치(100)는 기본 단어에 포함되지 않는 "boys", "planted", "trees"를 각각 boy" 및 "s", "plant" 및 "ed", "tree" 및 "s"로 분해하여, "The boy s plant ed these tree s"라는 문장을 구성하여 코퍼스를 재구성할 수 있다.Thereafter, the corpus is reconstructed by decomposing the words in the corpus based on the basic word (S720). For example, the model generating apparatus 100 may reconstruct the corpus in a form separated by a space between the basic words by decomposing a word not included in the basic word among words in the corpus into a combination of basic words. For example, if there is a sentence "The boys planted these trees" in the corpus, the model generating device 100 may convert "boys", "planted", and "trees" that are not included in the basic words to "boy" and "s", respectively. By decomposing ", "plant" and "ed", "tree" and "s", the corpus can be reconstructed by constructing the sentence "The boy s plant ed these tree s".

그 후, 재구성된 코퍼스를 기반으로 언어 모델을 학습하여 기본 단어들만으로 구성된 언어 모델을 생성한다(730). 예컨대, 모델 생성 장치(100)는 리커런트 뉴럴 네트워크(Recurrent Neural Network: RNN), LSTM(Long Short Term Memory), 결정 트리(decision tree), 유전 알고리즘(GA: Genetic Algorithm), 유전자 프로그래밍(GP: Genetic Programming), 가우스 과정 회귀, 선형 분별 분석, K 근접 이웃(K-NN: K-Nearest Neighbor), 퍼셉트론, 방사 기저 함수 네트워크, 및 서포트 벡터 머신(SVM: Support Vector Machine) 중 하나를 이용하여 기본 단어들만으로 구성된 언어 모델을 생성할 수 있다.Thereafter, a language model composed of only basic words is generated by learning a language model based on the reconstructed corpus ( 730 ). For example, the model generating apparatus 100 may include a Recurrent Neural Network (RNN), a Long Short Term Memory (LSTM), a decision tree, a Genetic Algorithm (GA), and a Genetic Programming (GP). Genetic Programming), Gaussian Process Regression, Linear Fractional Analysis, K-Nearest Neighbor (K-NN), Perceptron, Radial Basis Function Network, and Support Vector Machine (SVM) You can create a language model made up of only words.

도 8은 도 5의 띄어쓰기 모델 생성 과정(540)의 일 실시예를 도시한 흐름도이다.8 is a flowchart illustrating an embodiment of the space-space model generation process 540 of FIG. 5 .

도 8을 참조하면, 띄어쓰기 모델 생성 과정(540)은 먼저, 띄어쓰기 모델 학습을 위한 코퍼스를 수집한다(810).Referring to FIG. 8 , in the process of generating a spacing model 540 , first, a corpus for learning the spacing model is collected ( 810 ).

그 후, 수집된 코퍼스를 이용하여 띄어쓰기 모델을 학습한다(820). 예컨대, 모델 생성 장치(100)는 수집된 코퍼스의 각 문장의 각 음절(character)을 입력 데이터로 하고, 해당 글자의 뒤에 띄어쓰기가 되어 있는지 여부를 정답(target) 데이터로 하여 띄어쓰기 모델을 학습할 수 있다. 띄어쓰기 모델 학습시 띄어쓰기 이외에 문장 부호를 함께 학습하는 것도 가능하다.Thereafter, a spacing model is learned using the collected corpus ( 820 ). For example, the model generating device 100 uses each syllable of each sentence of the collected corpus as input data, and whether there is a space after the corresponding letter as the correct answer (target) data to learn the spacing model. have. When learning the spacing model, it is also possible to learn punctuation marks in addition to spacing.

한편, 모델 생성 장치(100)는 리커런트 뉴럴 네트워크(Recurrent Neural Network: RNN), LSTM(Long Short Term Memory), 결정 트리(decision tree), 유전 알고리즘(GA: Genetic Algorithm), 유전자 프로그래밍(GP: Genetic Programming), 가우스 과정 회귀, 선형 분별 분석, K 근접 이웃(K-NN: K-Nearest Neighbor), 퍼셉트론, 방사 기저 함수 네트워크, 및 서포트 벡터 머신(SVM: Support Vector Machine) 중 하나를 이용하여 띄어쓰기 모델을 생성할 수 있다.On the other hand, the model generating apparatus 100 is a recurrent neural network (Recurrent Neural Network: RNN), LSTM (Long Short Term Memory), a decision tree (decision tree), a genetic algorithm (GA: Genetic Algorithm), genetic programming (GP: Genetic Programming), Gaussian Process Regression, Linear Fractional Analysis, K-Nearest Neighbor (K-NN), Perceptron, Radial Basis Function Network, and Spacing Using One of the Support Vector Machines (SVMs) You can create a model.

도 9는 음성 인식 방법(900)의 일 실시예를 도시한 흐름도이다.9 is a flowchart illustrating an embodiment of a voice recognition method 900 .

도 9를 참조하면, 음성 인식 방법(900)은 먼저, 음향 모델, 기본 단어들만으로 구성된 발음 사전, 및 기본 단어들만으로 구성된 언어 모델을 이용하여 음성 인식을 수행한다(910).Referring to FIG. 9 , the voice recognition method 900 first performs voice recognition using an acoustic model, a pronunciation dictionary composed of only basic words, and a language model composed of only basic words ( S910 ).

그 후, 띄어쓰기 모델을 이용하여 음성 인식 결과의 띄어쓰기를 보정한다(920).Thereafter, the spacing of the speech recognition result is corrected using the spacing model ( 920 ).

도 10은 도 9의 음성 인식 수행 과정(910)의 일 실시예를 도시한 흐름도이다.10 is a flowchart illustrating an embodiment of the voice recognition performing process 910 of FIG. 9 .

도 10을 참조하면, 음성 인식 수행 과정(910)은, 먼저, 음성 신호로부터 특징 벡터를 추출한다(1010). 예컨대, 음성 인식 장치(200)는 입력된 음성 신호를 단위 프레임으로 분할하여 분할된 프레임 영역에 대응되는 특징 벡터를 추출할 수 있다.Referring to FIG. 10 , in the speech recognition performing process 910 , first, a feature vector is extracted from a speech signal ( 1010 ). For example, the voice recognition apparatus 200 may extract a feature vector corresponding to the divided frame region by dividing the input voice signal into unit frames.

그 후, 음향 모델, 기본 단어들만으로 구성된 발음 사전, 및 기본 단어들만으로 구성된 언어 모델을 이용하여 특징 벡터로부터 가장 확률이 높은 단어열을 검색한다(1020).Thereafter, a word sequence with the highest probability is retrieved from the feature vector by using the acoustic model, the pronunciation dictionary composed of only basic words, and the language model composed of only the basic words ( 1020 ).

본 발명의 일 양상은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있다. 상기의 프로그램을 구현하는 코드들 및 코드 세그먼트들은 당해 분야의 컴퓨터 프로그래머에 의하여 용이하게 추론될 수 있다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함할 수 있다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 디스크 등을 포함할 수 있다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드로 작성되고 실행될 수 있다.An aspect of the present invention may be implemented as computer-readable codes on a computer-readable recording medium. Codes and code segments implementing the above program can be easily inferred by a computer programmer in the art. The computer-readable recording medium may include any type of recording device in which data readable by a computer system is stored. Examples of the computer-readable recording medium may include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, and the like. In addition, the computer-readable recording medium may be distributed in a network-connected computer system, and may be written and executed as computer-readable code in a distributed manner.

이제까지 본 발명에 대하여 그 바람직한 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 따라서, 본 발명의 범위는 전술한 실시 예에 한정되지 않고 특허 청구범위에 기재된 내용과 동등한 범위 내에 있는 다양한 실시 형태가 포함되도록 해석되어야 할 것이다.So far, the present invention has been looked at with respect to preferred embodiments thereof. Those of ordinary skill in the art to which the present invention pertains will understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Accordingly, the scope of the present invention is not limited to the above-described embodiments and should be construed to include various embodiments within the scope equivalent to the content described in the claims.

10: 음성 인식 시스템
100: 모델 생성 장치
110: 음향 모델 생성부
120: 발음 사전 생성부
121: 단어 수집부
122: 단어 분해부
123: 사전 생성부
130: 언어 모델 생성부
131, 141: 코퍼스 수집부
132: 코퍼스 재구성부
133: 언어 모델 학습부
142: 띄어쓰기 모델 학습부
200: 음성 인식 장치
210: 음성 인식부
211: 특징 추출부
212: 디코더
220: 띄어쓰기 보정부
230: 음향 모델 저장부
240: 발음 사전 저장부
250: 언어 모델 저장부
260: 띄어쓰기 모델 저장부10: Speech Recognition System
100: model generator
110: acoustic model generation unit
120: pronunciation dictionary generator
121: word collection
122: word decomposition part
123: dictionary creation unit
130: language model generation unit
131, 141: Corpus Collector
132: corpus reconstruction unit
133: language model learning unit
142: spacing model learning unit
200: speech recognition device
210: voice recognition unit
211: feature extraction unit
212: decoder
220: space correction unit
230: acoustic model storage unit
240: pronunciation dictionary storage
250: language model storage unit
260: space model storage unit

Claims

a speech recognition unit for performing speech recognition using an acoustic model, a pronunciation dictionary including only basic words including decomposed synthetic words, and a language model including only the basic words; and
a spacing correcting unit for correcting the spacing of the speech recognition result using the spacing model, but adaptively removing the spacing inserted between the basic words of the decomposed composite word from the speech recognition result to reconstruct the composite word; A voice recognition device comprising a.

According to claim 1,
The basic word is at least one of a word having a frequency of use above a certain level, a word at a morpheme level, and a word at a syllable level.

According to claim 1,
The voice recognition unit,
a feature extraction unit for extracting a feature vector from the input speech signal; and
a decoder for retrieving a basic word sequence with a highest probability from the extracted feature vector based on the acoustic model, a pronunciation dictionary composed of only the basic words, and a language model composed of only the basic words; A voice recognition device comprising a.

According to claim 1,
The language model composed of only the basic words is generated through learning using a corpus reconstructed by combining the basic words.

According to claim 1,
The spacing model is generated through learning using a corpus, speech recognition device.

performing speech recognition using an acoustic model, a pronunciation dictionary composed of only basic words composed of decomposed synthetic words, and a language model composed of only the basic words;
correcting the spacing of the speech recognition result using a spacing model, and reconstructing the composite word by adaptively removing the spacing inserted between the basic words of the decomposed composite word from the speech recognition result; Including, speech recognition method.

7. The method of claim 6,
The basic word is at least one of a word having a frequency of use of a certain level or higher, a word at a morpheme level, and a word at a syllable level.

7. The method of claim 6,
The step of performing the voice recognition comprises:
extracting a feature vector from the input speech signal; and
searching for a basic word sequence with the highest probability from the extracted feature vector based on the acoustic model, the pronunciation dictionary composed of only the basic words, and the language model composed of only the basic words; Including, a voice recognition method.

7. The method of claim 6,
The language model composed of only the basic words is generated through learning using a corpus reconstructed by combining the basic words.

7. The method of claim 6,
The spacing model is generated through learning using a corpus, speech recognition method.

A model generating apparatus for a speech recognition apparatus, comprising:
a pronunciation dictionary generator for generating a pronunciation dictionary composed of only basic words;
a language model generator for generating a language model through learning using a corpus reconstructed by combining basic words; and
a spacing model generator for generating a spacing model through learning using the collected corpus; including,
The spacing model generation unit,
a corpus collection unit for collecting the corpus; and
a spacing model learning unit for learning a spacing model using each syllable (character) of each sentence of the collected corpus as input data, and whether or not there is a space after the corresponding syllable (character) as the correct answer (target) data; A model generating device comprising a.

12. The method of claim 11,
The pronunciation dictionary generating unit,
a word collecting unit for collecting words;
a word decomposition unit that decomposes words not included in the basic words among the collected words into combinations of basic words; and
a dictionary generator generating a pronunciation dictionary composed of only the basic words based on the word decomposition result; Including, a model generating device.

12. The method of claim 11,
The language model generation unit,
a corpus collection unit for collecting the corpus;
a corpus reconstruction unit for reconstructing the corpus by decomposing the words in the collected corpus based on the basic word; and
a language model learning unit for learning a language model based on the reconstructed corpus to generate a language model composed of only basic words; Including, a model generating device.

14. The method of claim 13,
The corpus reconstruction unit,
A model generating device for reconstructing the collected corpus into a corpus in which basic words are separated by spaces.

delete

12. The method of claim 11,
The spacing model learning unit, Recurrent Neural Network (RNN), LSTM (Long Short Term Memory), decision tree (decision tree), genetic algorithm (GA: Genetic Algorithm), genetic programming (GP: Genetic Programming) , Gaussian process regression, linear fractional analysis, K-Nearest Neighbor (K-NN), a perceptron, a radial basis function network, and one of a Support Vector Machine (SVM).