KR102429891B1

KR102429891B1 - Voice recognition device and method of operating the same

Info

Publication number: KR102429891B1
Application number: KR1020200146976A
Authority: KR
Inventors: 탁민우
Original assignee: 엔에이치엔 주식회사
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2022-08-05
Also published as: KR20220060867A

Abstract

음성 인식 장치는, 소정의 음소들의 배열에 대한 가중치들을 저장하는 저장 매체; 사용자로부터 감지된 음성에 따라 음성 신호를 생성하는 음성 전처리부; 음성 신호로부터 후보 음소들을 판별하면서 후보 음소들에 대응하는 우도 값들을 생성하도록 구성되는 제 1 음성 인식부; 소정의 음소들의 배열에 대한 가중치들에 따라 후보 음소들에 대응하는 우도 값들의 적어도 일부를 조절하도록 구성되는 제 2 음성 인식부; 및 조절된 우도 값들에 기반하여 후보 음소들로부터 사용자의 음성에 대응하는 음소들의 배열을 추정하도록 구성되는 음성 추정부를 포함한다.A speech recognition apparatus includes: a storage medium for storing weights for an arrangement of predetermined phonemes; a voice pre-processing unit for generating a voice signal according to a voice sensed by a user; a first speech recognition unit configured to generate likelihood values corresponding to the candidate phonemes while discriminating the candidate phonemes from the speech signal; a second speech recognition unit configured to adjust at least a portion of likelihood values corresponding to candidate phonemes according to weights for an arrangement of predetermined phonemes; and a speech estimator configured to estimate an arrangement of phonemes corresponding to the user's speech from the candidate phonemes based on the adjusted likelihood values.

Description

Voice recognition device and method of operation thereof

본 발명은 전자 장치에 관한 것으로, 좀 더 구체적으로는 사용자의 음성을 인식할 수 있는 음성 인식 장치 및 그것의 동작 방법에 관한 것이다.The present invention relates to an electronic device, and more particularly, to a voice recognition device capable of recognizing a user's voice and an operating method thereof.

최근 들어 음성 인식 기술은 실험실 데모 수준을 벗어나 실생활에 적용, 상용화되고 있다. 현재의 음성 인식 시스템은 정해진 환경 하에서는 비교적 좋은 성능을 보이지만 실제 다양한 환경들 하에서는 저하된 인식 성능을 보인다. 이는 음성 인식을 수행하는 실제 환경들이 주변 소음, 발성 거리, 마이크 특성, 및 화자의 상이함 등 인식 성능을 저하시키는 요소들을 수반하기 때문이다.Recently, voice recognition technology has been applied and commercialized in real life beyond the level of laboratory demonstrations. The current speech recognition system shows relatively good performance under a set environment, but shows a degraded recognition performance under various environments. This is because real environments in which voice recognition is performed involve factors that degrade recognition performance, such as ambient noise, utterance distance, microphone characteristics, and speaker differences.

이러한 요소들은 감지된 음성 신호를 오염시켜 변화를 야기하며 이에 따라 그에 대응하는 특징 벡터를 변화시킬 수 있다. 이는, 특징 벡터가 갖는 통계적 특성이 변화됨을 의미한다. 예를 들면, 백색 잡음은 스펙트럼의 포락선 정보를 표현하는 켑스트럼과 같은 특징 벡터의 동적 범위(또는 분산)를 감소시킬 수 있다.These elements may pollute the sensed speech signal to cause changes and thus change the corresponding feature vector. This means that the statistical properties of the feature vector are changed. For example, white noise may reduce the dynamic range (or variance) of a feature vector, such as a cepstrum, that represents the envelope information of a spectrum.

위 기재된 내용은 오직 본 발명의 기술적 사상들에 대한 배경 기술의 이해를 돕기 위한 것이며, 따라서 그것은 본 발명의 기술 분야의 당업자에게 알려진 선행 기술에 해당하는 내용으로 이해될 수 없다.The above description is only for helping the understanding of the background of the technical spirit of the present invention, and therefore it cannot be understood as the content corresponding to the prior art known to those skilled in the art.

현재 한 개의 단어와 같은 짧은 길이의 언어(language)에 대한 음성 인식은 특징 벡터에 기반한 음향 모델에 의존하고 있으며, 음향 모델에 기반하여 인식된 단어에 대한 추가적인 보정은 고려되지 못하고 있다. 예를 들면, 명령어를 이용하여 활성화되는 디바이스에 있어서, 그러한 명령어는 “하이”와 같은 짧은 길이의 언어로 이루어져 있는 관계로, 복수의 단어들의 배열을 학습한 언어 모델을 이용하는 추가적인 보정은 고려되지 못하고 있으며, 이에 따라 해당 명령어에 대한 인식률은 낮은 실정이다.Currently, speech recognition for a short language such as a single word depends on an acoustic model based on a feature vector, and additional correction for a word recognized based on the acoustic model is not considered. For example, in a device activated using a command, since the command is made up of a short language such as “high”, additional correction using a language model that has learned the arrangement of a plurality of words is not considered. Accordingly, the recognition rate for the corresponding command is low.

본 발명은 향상된 신뢰성으로 음성을 인식할 수 있는 음성 인식 장치를 제공하기 위한 것이다. 예를 들면, 음성 인식 장치는 사용자의 음성으로부터 음소(phoneme)들을 인식하고 인식된 음소들에 대한 우도 값들을 조절하기 위한 가중치들을 제공할 수 있다. 이에 따라, 음성 인식 장치는 명령어와 같은 상대적으로 짧은 길이의 언어에 대해서도 상대적으로 높은 신뢰성으로 음성 인식을 수행할 수 있다.An object of the present invention is to provide a voice recognition apparatus capable of recognizing a voice with improved reliability. For example, the voice recognition apparatus may recognize phonemes from the user's voice and provide weights for adjusting likelihood values for the recognized phonemes. Accordingly, the voice recognition apparatus may perform voice recognition with relatively high reliability even for a language of a relatively short length, such as a command.

본 발명의 실시 예에 따른 음성 인식 장치는, 소정의 음소들의 배열에 대한 가중치들을 저장하는 저장 매체; 사용자로부터 감지된 음성에 따라 음성 신호를 생성하는 음성 전처리부; 상기 음성 신호로부터 후보 음소들을 판별하면서 상기 후보 음소들에 대응하는 우도 값들을 생성하도록 구성되는 제 1 음성 인식부; 상기 소정의 음소들의 상기 배열에 대한 상기 가중치들에 따라 상기 후보 음소들에 대응하는 상기 우도 값들의 적어도 일부를 조절하도록 구성되는 제 2 음성 인식부; 및 상기 조절된 우도 값들에 기반하여 상기 후보 음소들로부터 상기 사용자의 상기 음성에 대응하는 음소들의 배열을 추정하도록 구성되는 음성 추정부를 포함한다.A voice recognition apparatus according to an embodiment of the present invention includes: a storage medium for storing weights for an arrangement of predetermined phonemes; a voice pre-processing unit for generating a voice signal according to a voice sensed by a user; a first speech recognition unit configured to generate likelihood values corresponding to the candidate phonemes while discriminating candidate phonemes from the speech signal; a second speech recognition unit configured to adjust at least a portion of the likelihood values corresponding to the candidate phonemes according to the weights for the arrangement of the predetermined phonemes; and a speech estimator configured to estimate an arrangement of phonemes corresponding to the user's speech from the candidate phonemes based on the adjusted likelihood values.

상기 제 2 음성 인식부는 상기 후보 음소들 중 상기 소정의 음소들의 상기 배열의 적어도 일부와 매치되는 것들을 선택하고, 상기 소정의 음소들의 상기 매치되는 배열에 대응하는 가중치들에 따라 상기 선택된 후보 음소들에 대응하는 우도 값들을 조절하도록 구성될 수 있다.The second speech recognition unit selects from among the candidate phonemes that match at least a part of the arrangement of the predetermined phonemes, and assigns the selected candidate phonemes to the selected candidate phonemes according to weights corresponding to the matched arrangement of the predetermined phonemes. may be configured to adjust the corresponding likelihood values.

상기 음성 신호는 상기 감지된 음성을 주파수 도메인으로 변환한 특징 벡터 값들을 포함할 수 있다.The voice signal may include feature vector values obtained by converting the sensed voice into a frequency domain.

상기 제 1 음성 인식부는 은닉 마르코프 모델에 기반하여 상기 후보 음소들에 대응하는 상기 우도 값들을 생성하도록 구성될 수 있다.The first speech recognition unit may be configured to generate the likelihood values corresponding to the candidate phonemes based on a hidden Markov model.

상기 음성 추정부는 상기 추정된 음소들의 배열이 미리 정해진 명령어와 일치할 때 트리거 신호를 생성하도록 구성될 수 있다.The speech estimator may be configured to generate a trigger signal when the arrangement of the estimated phonemes matches a predetermined command.

상기 음성 인식 장치는 상기 트리거 신호에 응답하여 소정의 동작을 활성화하도록 구성되는 기능 블록을 더 포함할 수 있다.The voice recognition apparatus may further include a function block configured to activate a predetermined operation in response to the trigger signal.

본 발명의 다른 실시 예에 따른 음성 인식 장치는, 사용자로부터 음성을 감지하도록 구성되는 음향 센서; 상기 감지된 음성이 명령어에 대응하는지 여부를 판별하도록 구성되는 프로세서; 및 소정의 음소들의 배열에 대한 가중치들을 저장하는 저장 매체를 포함하며, 상기 프로세서는, 상기 감지된 음성에 따라 음성 신호를 생성하고; 상기 음성 신호로부터 후보 음소들을 판별하면서 상기 후보 음소들에 대응하는 우도 값들을 생성하고; 상기 소정의 음소들의 상기 배열에 대한 상기 가중치들에 따라 상기 후보 음소들에 대응하는 상기 우도 값들의 적어도 일부를 조절하고; 상기 조절된 우도 값들에 기반하여 상기 후보 음소들로부터 상기 사용자의 상기 음성에 대응하는 음소들의 배열을 추정하도록 구성된다.A voice recognition apparatus according to another embodiment of the present invention includes an acoustic sensor configured to detect a voice from a user; a processor configured to determine whether the sensed voice corresponds to a command; and a storage medium for storing weights for an arrangement of predetermined phonemes, wherein the processor is configured to: generate a voice signal according to the sensed voice; generating likelihood values corresponding to the candidate phonemes while discriminating candidate phonemes from the speech signal; adjust at least some of the likelihood values corresponding to the candidate phonemes according to the weights for the arrangement of the predetermined phonemes; and estimate an arrangement of phonemes corresponding to the voice of the user from the candidate phonemes based on the adjusted likelihood values.

상기 프로세서는 상기 후보 음소들 중 상기 소정의 음소들의 상기 배열의 적어도 일부와 매치되는 것들을 선택하고, 상기 소정의 음소들의 상기 매치되는 배열에 대응하는 가중치들에 따라 상기 선택된 후보 음소들에 대응하는 우도 값들을 조절하도록 구성될 수 있다.The processor selects among the candidate phonemes that match at least a portion of the arrangement of the given phonemes, and a likelihood corresponding to the selected candidate phonemes according to weights corresponding to the matched arrangement of the given phonemes. may be configured to adjust the values.

상기 프로세서는 은닉 마르코프 모델에 기반하여 상기 후보 음소들에 대응하는 상기 우도 값들을 생성하도록 구성될 수 있다.The processor may be configured to generate the likelihood values corresponding to the candidate phonemes based on a hidden Markov model.

상기 프로세서는 상기 추정된 음소들의 배열이 상기 명령어와 일치할 때 트리거 신호를 생성하도록 구성될 수 있다.The processor may be configured to generate a trigger signal when the arrangement of the estimated phonemes matches the instruction.

상기 음성 인식 장치는 상기 프로세서의 제어에 응답하여 동작하는 스피커를 더 포함하되, 상기 프로세서는 상기 트리거 신호에 응답하여 소정의 음성을 생성하도록 상기 스피커를 제어하도록 구성될 수 있다.The voice recognition apparatus may further include a speaker operating in response to the control of the processor, wherein the processor may be configured to control the speaker to generate a predetermined voice in response to the trigger signal.

본 발명의 다른 일면은 사용자의 음성을 인식하는 방법에 관한 것이다. 상기 방법은, 소정의 음소들의 배열에 대한 가중치들을 저장 매체에 저장하는 단계; 사용자의 상기 음성을 감지하는 단계; 상기 감지된 음성에 따라 음성 신호를 생성하는 단계; 상기 음성 신호로부터 후보 음소들을 판별하면서 상기 후보 음소들에 대응하는 우도 값들을 생성하는 단계; 상기 소정의 음소들의 상기 배열에 대한 상기 가중치들에 따라 상기 후보 음소들에 대응하는 상기 우도 값들의 적어도 일부를 조절하는 단계; 및 상기 조절된 우도 값들에 기반하여 상기 후보 음소들로부터 상기 음성에 대응하는 음소들의 배열을 추정하는 단계를 포함한다.Another aspect of the present invention relates to a method for recognizing a user's voice. The method includes the steps of: storing weights for an arrangement of predetermined phonemes in a storage medium; detecting the user's voice; generating a voice signal according to the sensed voice; generating likelihood values corresponding to the candidate phonemes while discriminating candidate phonemes from the speech signal; adjusting at least some of the likelihood values corresponding to the candidate phonemes according to the weights for the arrangement of the predetermined phonemes; and estimating an arrangement of phonemes corresponding to the voice from the candidate phonemes based on the adjusted likelihood values.

상기 방법은 상기 추정된 음소들의 배열이 미리 정해진 명령어와 일치할 때 트리거 신호를 생성하는 단계를 더 포함할 수 있다.The method may further include generating a trigger signal when the arrangement of the estimated phonemes matches a predetermined instruction.

상기 방법은 상기 트리거 신호에 응답하여 소정의 동작을 활성화하는 단계를 더 포함할 수 있다.The method may further include activating a predetermined operation in response to the trigger signal.

상기 조절하는 단계는, 상기 후보 음소들 중 상기 소정의 음소들의 상기 배열의 적어도 일부와 매치되는 것들을 선택하는 단계; 및 상기 소정의 음소들의 상기 매치되는 배열에 대응하는 가중치들에 따라 상기 선택된 후보 음소들에 대응하는 우도 값들을 조절하는 단계를 포함할 수 있다.The adjusting may include: selecting ones of the candidate phonemes that match at least a part of the arrangement of the predetermined phonemes; and adjusting likelihood values corresponding to the selected candidate phonemes according to weights corresponding to the matching arrangement of the predetermined phonemes.

상기 우도 값들을 생성하는 단계는 은닉 마르코프 모델에 기반하여 상기 후보 음소들에 대응하는 상기 우도 값들을 생성하는 단계를 포함할 수 있다.The generating of the likelihood values may include generating the likelihood values corresponding to the candidate phonemes based on a hidden Markov model.

본 발명의 또 다른 일면은 컴퓨터 장치와 통신하는 클라이언트 서버에 관한 것이다. 상기 클라이언트 서버는, 통신기; 소정의 음소들의 배열에 대한 가중치들, 그리고 프로그램 코드들을 저장하는 데이터베이스; 및 상기 통신기를 통해 컴퓨터 장치와 통신하여 상기 소정의 음소들의 상기 배열에 대한 상기 가중치들, 그리고 상기 프로그램 코드들을 제공하도록 구성되는 프로세서를 포함한다. 상기 프로그램 코드들은 상기 컴퓨터 장치에 의해 실행될 때, 사용자로부터 감지된 음성에 따라 음성 신호를 생성하기 위한 제 1 명령어들; 상기 음성 신호로부터 후보 음소들을 판별하면서 상기 후보 음소들에 대응하는 우도 값들을 생성하기 위한 제 2 명령어들; 상기 소정의 음소들의 상기 배열에 대한 상기 가중치들에 따라 상기 후보 음소들에 대응하는 상기 우도 값들의 적어도 일부를 조절하기 위한 제 3 명령어들; 및 상기 조절된 우도 값들에 기반하여 상기 후보 음소들로부터 상기 사용자의 상기 음성에 대응하는 음소들의 배열을 추정하기 위한 제 4 명령어들을 포함한다.Another aspect of the invention relates to a client server in communication with a computer device. The client server may include: a communicator; a database storing weights for an arrangement of predetermined phonemes, and program codes; and a processor configured to communicate with a computer device via the communicator to provide the weights for the arrangement of the predetermined phonemes and the program codes. The program codes, when executed by the computer device, include: first instructions for generating a voice signal according to a voice sensed from a user; second instructions for generating likelihood values corresponding to the candidate phonemes while discriminating candidate phonemes from the speech signal; third instructions for adjusting at least a portion of the likelihood values corresponding to the candidate phonemes according to the weights for the arrangement of the predetermined phonemes; and fourth instructions for estimating an arrangement of phonemes corresponding to the user's voice from the candidate phonemes based on the adjusted likelihood values.

상기 제 3 명령어들은 상기 컴퓨터 장치에 의해 실행될 때, 상기 후보 음소들 중 상기 소정의 음소들의 상기 배열의 적어도 일부와 매치되는 것들을 선택하고, 상기 소정의 음소들의 상기 매치되는 배열에 대응하는 가중치들에 따라 상기 선택된 후보 음소들의 우도 값들을 조절하기 위한 제 5 명령어들을 포함할 수 있다.the third instructions, when executed by the computer device, select ones of the candidate phonemes that match at least a portion of the arrangement of the given phonemes, and apply to weights corresponding to the matched arrangement of the given phonemes. Accordingly, fifth instructions for adjusting likelihood values of the selected candidate phonemes may be included.

본 발명에 따르면, 향상된 신뢰성으로 음성을 인식할 수 있는 음성 인식 장치가 제공된다.According to the present invention, there is provided a voice recognition apparatus capable of recognizing a voice with improved reliability.

도 1은 본 발명의 실시 예에 따른 음성 인식 장치를 보여주는 블록도이다.
도 2는 사용자의 음성으로부터 생성되는 후보 음소들을 개념적으로 보여주는 도면이다.
도 3은 후보 음소들 및 그것들에 대응하는 우도 값들을 개념적으로 보여주는 도면이다.
도 4는 도 1의 가중치 데이터 셋에 포함되는 가중치들을 개념적으로 보여주는 도면이다.
도 5는 가중치들이 적용되어 조절된 우도 값들을 개념적으로 보여주는 도면이다.
도 6은 본 발명의 실시 예에 따른 사용자의 음성을 인식하는 방법을 보여주는 순서도이다.
도 7은 도 1의 음성 인식 장치를 구현하기에 적합한 컴퓨터 장치를 보여주는 블록도이다.
도 8은 도 7의 음성 인식 모듈을 제공하도록 구성되는 클라이언트 서버의 실시 예를 보여주는 블록도이다.1 is a block diagram illustrating a voice recognition apparatus according to an embodiment of the present invention.
2 is a diagram conceptually illustrating candidate phonemes generated from a user's voice.
3 is a diagram conceptually showing candidate phonemes and likelihood values corresponding to them.
FIG. 4 is a diagram conceptually showing weights included in the weight data set of FIG. 1 .
5 is a diagram conceptually illustrating likelihood values adjusted by applying weights.
6 is a flowchart illustrating a method of recognizing a user's voice according to an embodiment of the present invention.
FIG. 7 is a block diagram illustrating a computer device suitable for implementing the voice recognition apparatus of FIG. 1 .
8 is a block diagram illustrating an embodiment of a client server configured to provide the voice recognition module of FIG. 7 .

이하, 본 발명에 따른 바람직한 실시 예를 첨부한 도면을 참조하여 상세히 설명한다. 하기의 설명에서는 본 발명에 따른 동작을 이해하는데 필요한 부분만이 설명되며 그 이외 부분의 설명은 본 발명의 요지를 모호하지 않도록 하기 위해 생략될 것이라는 것을 유의하여야 한다. 또한 본 발명은 여기에서 설명되는 실시 예에 한정되지 않고 다른 형태로 구체화될 수도 있다. 단지, 여기에서 설명되는 실시 예는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 본 발명의 기술적 사상을 용이하게 실시할 수 있을 정도로 상세히 설명하기 위하여 제공되는 것이다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings. It should be noted that in the following description, only parts necessary for understanding the operation according to the present invention are described, and descriptions of other parts will be omitted so as not to obscure the gist of the present invention. Also, the present invention is not limited to the embodiments described herein and may be embodied in other forms. However, the embodiments described herein are provided to explain in detail enough to be able to easily implement the technical spirit of the present invention to those of ordinary skill in the art to which the present invention pertains.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "간접적으로 연결"되어 있는 경우도 포함한다. 여기에서 사용된 용어는 특정한 실시예들을 설명하기 위한 것이며 본 발명을 한정하기 위한 것이 아니다. 명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. "X, Y, 및 Z 중 적어도 어느 하나", 그리고 "X, Y, 및 Z로 구성된 그룹으로부터 선택된 적어도 어느 하나"는 X 하나, Y 하나, Z 하나, 또는 X, Y, 및 Z 중 둘 또는 그 이상의 어떤 조합 (예를 들면, XYZ, XYY, YZ, ZZ) 으로 해석될 수 있다. 여기에서, "및/또는"은 해당 구성들 중 하나 또는 그 이상의 모든 조합을 포함한다.Throughout the specification, when a part is "connected" with another part, this includes not only the case of being "directly connected" but also the case of being "indirectly connected" with another element interposed therebetween. . The terminology used herein is for the purpose of describing particular embodiments and not for limiting the present invention. Throughout the specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. “At least any one of X, Y, and Z” and “at least any one selected from the group consisting of X, Y, and Z” means one X, one Y, one Z, or two of X, Y, and Z or It can be interpreted as any combination of more (eg, XYZ, XYY, YZ, ZZ). Herein, “and/or” includes any combination of one or more of the components.

도 1은 본 발명의 실시 예에 따른 음성 인식 장치를 보여주는 블록도이다. 도 2는 사용자의 음성으로부터 생성되는 후보 음소들을 개념적으로 보여주는 도면이다. 도 3은 후보 음소들 및 그것들에 대응하는 우도 값들을 개념적으로 보여주는 도면이다. 도 4는 도 1의 가중치 데이터 셋에 포함되는 가중치들을 개념적으로 보여주는 도면이다. 도 5는 가중치들이 적용되어 조절된 우도 값들을 개념적으로 보여주는 도면이다.1 is a block diagram illustrating a voice recognition apparatus according to an embodiment of the present invention. 2 is a diagram conceptually illustrating candidate phonemes generated from a user's voice. 3 is a diagram conceptually showing candidate phonemes and likelihood values corresponding to them. FIG. 4 is a diagram conceptually showing weights included in the weight data set of FIG. 1 . 5 is a diagram conceptually illustrating likelihood values adjusted by applying weights.

도 1을 참조하면, 음성 인식 장치(100)는 마이크로폰(110), 제 1 인터페이스(I/F)(120), 음성 인식 프로세서(130), 제 2 인터페이스(140), 저장 매체(150), 제 3 인터페이스(160), 및 기능 블록(170)을 포함할 수 있다.Referring to FIG. 1 , the voice recognition apparatus 100 includes a microphone 110 , a first interface (I/F) 120 , a voice recognition processor 130 , a second interface 140 , a storage medium 150 , It may include a third interface 160 , and a function block 170 .

마이크로폰(110)과 같은 음향 센서는 제 1 인터페이스(120)를 통해 음성 인식 프로세서(130)에 연결된다. 마이크로폰(110)은 사용자의 음성을 감지할 수 있으며, 감지된 음성에 대응하는 디지털 및/또는 아날로그 데이터를 제 1 인터페이스(120)를 통해 음성 인식 프로세서(130)로 제공할 수 있다.An acoustic sensor such as a microphone 110 is connected to the voice recognition processor 130 through the first interface 120 . The microphone 110 may detect the user's voice, and may provide digital and/or analog data corresponding to the sensed voice to the voice recognition processor 130 through the first interface 120 .

제 1 인터페이스(120)는 음성 인식 프로세서(130)가 마이크로폰(110)과 통신할 수 있도록, 마이크로폰(110)과 음성 인식 프로세서(130) 사이의 인터페이스를 제공할 수 있다.The first interface 120 may provide an interface between the microphone 110 and the voice recognition processor 130 so that the voice recognition processor 130 can communicate with the microphone 110 .

음성 인식 프로세서(130)는 제 1 인터페이스(120)를 통해 마이크로폰(110)에 연결되며, 제 2 인터페이스(140)를 통해 저장 매체(150)에 연결되며, 제 3 인터페이스(160)를 통해 기능 블록(170)에 연결될 수 있다. 음성 인식 프로세서(130)는 마이크로폰(110)으로부터 사용자의 음성에 대응하는 데이터를 수신할 수 있으며, 음성 데이터가 소정의 명령어(예를 들면, 단어 혹은 텍스트)에 대응하는지 여부를 판별하도록 구성될 수 있다. 음성 인식 프로세서(130)는 음성 전처리부(131), 제 1 음성 인식부(132), 제 2 음성 인식부(133), 및 음성 추정부(134)를 포함할 수 있다.The voice recognition processor 130 is connected to the microphone 110 through the first interface 120 , is connected to the storage medium 150 through the second interface 140 , and is a functional block through the third interface 160 . 170 may be connected. The voice recognition processor 130 may receive data corresponding to the user's voice from the microphone 110, and may be configured to determine whether the voice data corresponds to a predetermined command (eg, a word or text). have. The voice recognition processor 130 may include a voice preprocessor 131 , a first voice recognizer 132 , a second voice recognizer 133 , and a voice estimator 134 .

음성 전처리부(131)는 마이크로폰(110)으로부터의 음성 데이터에 따라 제 1 음성 인식부(132)에서 처리되기에 적합한 형식의 음성 신호, 예를 들면 특징 벡터 값들을 생성하도록 구성된다.The voice preprocessor 131 is configured to generate a voice signal in a format suitable for processing in the first voice recognition unit 132, for example, feature vector values according to voice data from the microphone 110 .

제 1 음성 인식부(132)는 음성 전처리부(131)에 의해 생성된 음성 신호로부터 후보 음소(phoneme)들을 판별하면서 후보 음소들에 대응하는 우도 값들을 생성하도록 구성된다. 도 2를 참조하면, 음성 전처리부(131)는 마이크로폰(110)으로부터의 음성 데이터(SND)에 기반하여 음성 신호를 제공하며, 음성 신호는 음성 데이터(SND)를 복수의 프레임들(FR1~FR3) 각각의 주파수 도메인의 신호들(혹은 값들)로 변환한 특징 벡터 값들을 포함할 수 있다. 실시 예들에서, 음성 전처리부(131)는 음성 데이터(SND)를 복수의 프레임들(FR1, FR2, FR3)의 데이터로 구분하고, 각 구분된 데이터를 주파수 도메인의 신호들로 변환하여 특징 벡터 값들을 생성할 수 있다. 제 1 음성 인식부(132)는 음향 모델에 기초하여, 특징 벡터 값들로부터 복수의 프레임들(FR1, FR2, FR3)에 대응하는 후보 음소들(CPHN1, CPHN2, CPHN3)을 생성할 수 있다. 도 2에서, 제 1 프레임(FR1)의 특징 벡터 값들로부터 제 1 후보 음소들(CPHN1)이 생성되고, 제 2 프레임(FR2)의 특징 벡터 값들로부터 제 2 후보 음소들(CPHN2)이 생성되고, 제 3 프레임(FR3)의 특징 벡터 값들로부터 제 3 후보 음소들(CPHN3)이 생성된다.The first voice recognition unit 132 is configured to generate likelihood values corresponding to the candidate phonemes while discriminating candidate phonemes from the voice signal generated by the voice preprocessor 131 . Referring to FIG. 2 , the voice preprocessor 131 provides a voice signal based on voice data SND from the microphone 110 , and the voice signal converts the voice data SND into a plurality of frames FR1 to FR3 . ) may include feature vector values converted into signals (or values) of each frequency domain. In embodiments, the voice preprocessor 131 divides the voice data SND into data of a plurality of frames FR1, FR2, FR3, and converts each divided data into signals of a frequency domain to obtain a feature vector value. can create The first speech recognition unit 132 may generate candidate phonemes CPHN1 , CPHN2 , and CPHN3 corresponding to the plurality of frames FR1 , FR2 , and FR3 from the feature vector values based on the acoustic model. 2 , first candidate phonemes CPHN1 are generated from feature vector values of a first frame FR1, and second candidate phonemes CPHN2 are generated from feature vector values of a second frame FR2, Third candidate phonemes CPHN3 are generated from feature vector values of the third frame FR3.

이어서 도 3을 참조하면, 제 1 음성 인식부(132)는 음향 모델에 기초하여, 특징 벡터 값들로부터 제 1 내지 제 3 후보 음소들(CPHN1~CPHN3)에 대응하는 우도 값들(LH1, LH2)을 생성한다. 이웃하는 후보 음소들(혹은 이웃하는 프레임들)에 있어서, 그것들 중 하나의 후보 음소 다음에 배열되는 후보 음소에 대응하는 우도 값이 생성될 수 있다. 예를 들면, 제 1 음성 인식부(132)는 제 1 프레임(FR1)의 제 1 후보 음소들(CPHN1)을 “ㅎ” 및 “ㅍ”로 결정할 수 있고, 제 2 프레임(FR2)의 제 2 후보 음소들(CPHN2)을 “ㅏ”, “ㅔ”, 및 “ㅣ”로 결정할 수 있다. 제 1 음성 인식부(132)는 “ㅎ” 다음에 “ㅏ”가 배열될 우도 값(혹은 확률)을 0.4로, “ㅎ” 다음에 “ㅔ”가 배열될 우도 값을 0.4로, “ㅎ” 다음에 “ㅐ”가 배열될 우도 값을 0.2로 결정할 수 있다. 제 1 음성 인식부(132)는 “ㅍ” 다음에 “ㅏ”, “ㅔ”, 및 “ㅐ”가 배열될 우도 값들을 각각 0.4, 0.3, 및 0.3으로 결정할 수 있다. 제 1 음성 인식부(132)는 제 3 프레임(FR3)의 제 3 후보 음소들(CPHN3)을 “ㅏ”, “ㅣ”, 및 “ㅐ”로 결정할 수 있다. 제 1 음성 인식부(132)는 “ㅏ” 다음에 “ㅏ”, “ㅣ”, 및 “ㅐ”가 배열될 우도 값들을 각각 0.1, 0.6, 및 0.3으로 결정할 수 있다. “ㅔ” 다음에 “ㅏ”, “ㅣ”, 및 “ㅐ”가 배열될 우도 값들은 각각 0.2, 0.6, 및 0.2로 결정될 수 있다. “ㅐ” 다음에 “ㅏ”, “ㅣ”, 및 “ㅐ”가 배열될 우도 값들은 각각 0.2, 0.7, 및 0.1로 결정될 수 있다.Next, referring to FIG. 3 , the first speech recognition unit 132 obtains likelihood values LH1 and LH2 corresponding to the first to third candidate phonemes CPHN1 to CPHN3 from the feature vector values based on the acoustic model. create For neighboring candidate phonemes (or neighboring frames), a likelihood value corresponding to a candidate phone arranged after one of them may be generated. For example, the first voice recognition unit 132 may determine the first candidate phonemes CPHN1 of the first frame FR1 as “h” and “p”, and the second frame FR2 Candidate phonemes CPHN2 may be determined as “a”, “ㅔ”, and “ㅣ”. The first voice recognition unit 132 sets the likelihood value (or probability) in which “h” is followed by “h” as 0.4, the likelihood value in which “ㅔ” is arranged after “h” as 0.4, and “heh” as Next, the likelihood value at which “ㅐ” will be arranged may be determined to be 0.2. The first voice recognition unit 132 may determine likelihood values in which “a”, “ㅔ”, and “ㅐ” are to be arranged after “p” as 0.4, 0.3, and 0.3, respectively. The first voice recognition unit 132 may determine the third candidate phonemes CPHN3 of the third frame FR3 as “a”, “ㅣ”, and “ㅐ”. The first voice recognition unit 132 may determine likelihood values in which “a”, “ㅣ”, and “ㅐ” are to be arranged as 0.1, 0.6, and 0.3, respectively. Likelihood values in which “a”, “ㅣ”, and “ㅐ” are arranged after “ㅔ” may be determined to be 0.2, 0.6, and 0.2, respectively. Likelihood values in which “a”, “ㅣ”, and “ㅐ” are arranged after “ㅐ” may be determined to be 0.2, 0.7, and 0.1, respectively.

실시 예들에서, 제 1 음성 인식부(132)는 은닉 마르코프 모델(Hidden Markov Model)에 기반한 알고리즘을 포함하여 후보 음소들(CPHN1~CPHN3) 및 우도 값들(LH1, LH2)을 생성할 수 있다. 실시 예들에서, 음향 모델은 GMM(Gaussian Mixture Model) 또는 DNN(Deep Neural Network)을 포함할 수 있다.In embodiments, the first voice recognition unit 132 may generate candidate phonemes CPHN1 to CPHN3 and likelihood values LH1 and LH2 including an algorithm based on a Hidden Markov Model. In embodiments, the acoustic model may include a Gaussian Mixture Model (GMM) or a Deep Neural Network (DNN).

다시 도 1을 참조하면, 제 2 음성 인식부(133)는 후보 음소들(CPHN1, CPHN2, CPHN3, 도 3 참조)에 대응하는 우도 값들(LH1, LH2, 도 3 참조)의 적어도 일부를 소정의 가중치 데이터 셋(WTS)에 따라 조절하도록 구성된다. 도 4를 참조하면, 가중치 데이터 셋(WTS)은 타겟이 되는 명령어(이하, 타겟 명령어)의 음소들(CMDPHN1, CMDPHN2, CMDPHN3, 이하 타겟 음소들)의 배열에 대한 가중치들(WT1, WT2)을 포함할 수 있다. 도 4에서, 타겟 음소들(CMDPHN1~CMDPHN3)은 “ㅎ”, “ㅏ”, 및 “ㅣ”이며, 가중치 데이터 셋(WTS)은 “ㅎ” 다음에 배열되는 “ㅏ”에 대해 제 1 가중치(WT1)를, “ㅏ” 다음에 배열되는 “ㅣ”에 대해 제 2 가중치(WT2)를 포함하고 있다. 제 1 및 제 2 가중치들(WT1, WT2)은 1보다 클 수 있다. 제 1 내지 제 3 타겟 음소들(CMDPHN1~CMDPHN3) “ㅎ”, “ㅏ”, 및 “ㅣ”에 따라 타겟 명령어는 “하이”일 수 있다. Referring back to FIG. 1 , the second speech recognition unit 133 may select at least some of the likelihood values (LH1, LH2, see FIG. 3 ) corresponding to the candidate phonemes (CPHN1, CPHN2, CPHN3, see FIG. 3). It is configured to adjust according to a weight data set (WTS). Referring to FIG. 4 , the weight data set WTS includes weights WT1 and WT2 for the arrangement of phonemes (CMDPHN1, CMDPHN2, CMDPHN3, hereinafter, target phonemes) of a target instruction (hereinafter, target instruction). may include In FIG. 4 , target phonemes CMDPHN1 to CMDPHN3 are “heh”, “a”, and “ㅣ”, and the weight data set WTS has a first weight ( WT1), and a second weight (WT2) for “ㅣ” arranged after “A”. The first and second weights WT1 and WT2 may be greater than one. According to the first to third target phonemes CMDPHN1 to CMDPHN3 “h”, “h”, and “l”, the target command may be “high”.

제 1 음성 인식부(132)에 의해 생성된 후보 음소들(CPHN1~CPHN3) 및 그 우도 값들(LH1, LH2)에 기반하여 명령어 “하이”가 감지되지 않을 수 있다. 다시 도 3을 참조하면, “ㅎ” 다음에 “ㅏ”가 배열될 우도 값은 0.4로서, “ㅎ” 다음에 “ㅔ”가 배열될 우도 값 0.4 혹은 “ㅎ” 다음에 “ㅐ”가 배열될 우도 값 0.2보다 적어도 임계값만큼 크지 않을 수 있다. 이러한 경우, 사용자가 “하이”라고 발음했더라도 “하이”가 감지되지 못하고, 예를 들면 “헤이”가 감지되거나, 음성 인식이 실패할 수 있다. 이와 같이, 특히 한 개의 단어와 같은 짧은 길이의 타겟 명령어에 대한 음성 인식은, 만약 특징 벡터 값들에 기반한 음향 모델에만 의존한다면, 주변 소음, 발성 거리, 마이크 특성, 및 화자의 상이함 등 인식 성능의 변화를 야기하는 다양한 요인들로 인해 사용자가 타겟 명령어를 발음했더라도 타겟 명령어의 감지에 실패할 수 있다.The command “high” may not be detected based on the candidate phonemes CPHN1 to CPHN3 generated by the first voice recognition unit 132 and their likelihood values LH1 and LH2. Referring back to FIG. 3 , the likelihood value in which “a” is arranged after “h” is 0.4, and the likelihood value in which “ㅔ” is arranged after “h” is 0.4 or “ㅐ” is arranged after “h”. It may not be greater than the likelihood value 0.2 by at least a threshold. In this case, even if the user pronounces “high”, “high” may not be detected, for example, “hey” may be detected, or voice recognition may fail. As such, in particular, the speech recognition for a short-length target command such as a single word may affect the recognition performance such as ambient noise, utterance distance, microphone characteristics, and speaker difference, if only relying on an acoustic model based on feature vector values. The detection of the target command may fail even if the user pronounces the target command due to various factors causing the change.

다시 도 3 및 4와 함께 도 1을 참조하면, 제 2 음성 인식부(133)는 후보 음소들(CPHN1, CPHN2, CPHN3) 중 가중치 데이터 셋(WTS)의 타겟 음소들(CMDPHN1~CMDPHN3)의 배열과 매치되는 후보 음소들을 선택하고, 매치되는 타겟 음소들(CMDPHN1~CMDPHN3)의 배열에 대응하는 가중치들(WT1, WT2)에 따라 선택된 후보 음소들의 우도 값들을 조절할 수 있다. 음성 추정부(134)는 조절된 우도 값들에 기반하여 후보 음소들(CPHN1, CPHN2, CPHN3)로부터 사용자의 음성에 대응하는 음소들의 배열을 추정하도록 구성된다. 실시 예들에서, 음성 추정부(134)는 비터비(viterbi) 복호 알고리즘에 기반하여 위 추정을 수행할 수 있다.Referring back to FIG. 1 together with FIGS. 3 and 4 , the second speech recognition unit 133 arranges target phonemes CMDPHN1 to CMDPHN3 of the weight data set WTS among the candidate phonemes CPHN1 , CPHN2 , and CPHN3 . Candidate phonemes that match , may be selected, and likelihood values of the selected candidate phonemes may be adjusted according to weights WT1 and WT2 corresponding to the arrangement of matched target phonemes CMDPHN1 to CMDPHN3. The voice estimator 134 is configured to estimate the arrangement of phonemes corresponding to the user's voice from the candidate phonemes CPHN1 , CPHN2 , and CPHN3 based on the adjusted likelihood values. In embodiments, the speech estimator 134 may perform the above estimation based on a Viterbi decoding algorithm.

위 설명된 바와 같이 가중치 데이터 셋(WTS)은 “ㅎ” 다음에 배열되는 “ㅏ”에 대한 제 1 가중치(WT1)를, “ㅏ” 다음에 배열되는 “ㅣ”에 대한 제 2 가중치(WT2)를 포함하고 있다. 제 2 음성 인식부(133)는 후보 음소들(CPHN1~CPHN3) 중 “ㅎ” 다음에 배열되는 “ㅏ”, 그리고 “ㅏ” 다음에 배열되는 “ㅣ”를 선택하고, “ㅎ” 다음에 배열되는 “ㅏ”에 대응하는 우도 값 0.4에 제 1 가중치(WT1)를 적용하여 그것을 조절하고, “ㅏ” 다음에 배열되는 “ㅣ”에 대응하는 우도 값 0.6에 제 2 가중치(WT2)를 적용하여 그것을 조절할 수 있다. 예를 들면, 제 2 음성 인식부(133)는 우도 값에 해당 가중치를 곱함으로써 조절된 우도 값을 연산할 수 있다. 이에 따라, 도 5에 도시된 바와 같이 “ㅎ” 다음에 배열되는 “ㅏ”는 조절된 우도 값(MLH1)인 0.8을 가지며, “ㅏ” 다음에 배열되는 “ㅣ”는 조절된 우도 값(MLH2)인 0.9를 가진다. 조절된 우도 값(MLH1)은 “ㅎ” 및/또는 “ㅏ”와 연관된 다른 우도 값들, 예를 들면 “ㅎ” 다음에 “ㅔ”가 배열될 우도 값 0.4, “ㅎ” 다음에 “ㅐ”가 배열될 우도 값 0.2, “ㅍ” 다음에 “ㅏ”가 배열될 우도 값 0.4보다 적어도 임계값만큼 클 수 있다. 또한, 조절된 우도 값(MLH2)은 “ㅏ” 및/또는 “ㅣ”와 연관된 우도 값들, 예를 들면 “ㅏ” 다음에 “ㅏ”가 배열될 우도 값 0.1, “ㅏ” 다음에 “ㅐ”가 배열될 우도 값 0.3, “ㅔ” 다음에 “ㅣ”가 배열될 우도 값 0.6, 및 “ㅐ” 다음에 “ㅣ”가 배열될 우도 값 0.7보다 적어도 임계값만큼 클 수 있다. 조절된 우도 값들(MLH1, MLH2)에 의해, 음성 추정부(134)는 예를 들면 비터비 복호 알고리즘에 기반하여 사용자의 음성에 대응하는 음소들의 배열을 도 5의 굵은 선으로 표시된 바와 같이 “ㅎ”, “ㅏ”, 및 “ㅣ”로 추정 및/또는 결정할 수 있다.As described above, the weight data set (WTS) includes a first weight (WT1) for “a” arranged after “h” and a second weight (WT2) for “l” arranged after “h”. contains The second voice recognition unit 133 selects “a” arranged after “h” and “ㅣ” arranged after “h” among the candidate phonemes CPHN1 to CPHN3, and arranges them after “h” It is adjusted by applying the first weight (WT1) to the likelihood value 0.4 corresponding to “a” being You can control it. For example, the second voice recognition unit 133 may calculate the adjusted likelihood value by multiplying the likelihood value by a corresponding weight. Accordingly, as shown in FIG. 5 , “a” arranged after “h” has an adjusted likelihood value (MLH1) of 0.8, and “ㅣ” arranged after “h” has an adjusted likelihood value (MLH2). ), which is 0.9. The adjusted likelihood value (MLH1) is a likelihood value of 0.4 where “h” and/or other likelihood values associated with “a”, for example “h” followed by “ㅔ”, and “h” followed by “ㅐ” The likelihood value to be arrayed 0.2, “p” followed by “a” may be greater than the likelihood value to be arrayed 0.4 by at least a threshold. In addition, the adjusted likelihood value (MLH2) is the likelihood values associated with “a” and/or “ㅣ”, for example, a likelihood value of 0.1 in which “a” is followed by “a”, and “a” followed by “ㅐ” may be greater than a likelihood value of 0.3 to be arranged, a likelihood value of which “I” will be arranged followed by a likelihood value of 0.6, and a likelihood value of “ㅐ” followed by “l” of 0.7 to be arranged at least by a threshold value. Based on the adjusted likelihood values MLH1 and MLH2, the voice estimator 134, for example, based on the Viterbi decoding algorithm, sets the arrangement of phonemes corresponding to the user's voice as indicated by a thick line in FIG. ”, “a”, and “l” may be estimated and/or determined.

이와 같이, 음성 인식 프로세서(130)는 사용자가 “하이”와 같은 짧은 길이의 타겟 명령어를 발음하더라도 상대적으로 높은 확률로 그 타겟 명령어를 감지할 수 있다. 이에 따라, 향상된 신뢰성으로 음성을 인식할 수 있는 음성 인식 장치(100)가 제공될 수 있다.As such, the voice recognition processor 130 may detect the target command with a relatively high probability even if the user pronounces a target command having a short length such as “high”. Accordingly, the voice recognition apparatus 100 capable of recognizing a voice with improved reliability may be provided.

음성 추정부(134)는 추정된 음소들의 배열이 타겟 명령어와 일치할 때 트리거 신호를 생성할 수 있다.The voice estimator 134 may generate a trigger signal when the arrangement of the estimated phonemes matches the target command.

제 2 인터페이스(140)는 음성 인식 프로세서(130)와 저장 매체(150) 사이의 인터페이스를 제공한다. 저장 매체(150)는 가중치 데이터 셋(WTS)을 저장하도록 구성된다. 타겟 명령어는 음성 인식 장치(100)의 임의의 동작을 트리거시키기 위한 것으로 미리 결정될 수 있으며, 타겟 명령어에 대응하는 가중치 데이터 셋(WTS)도 미리 결정되어 저장 매체(150)에 저장 및/또는 업데이트될 수 있다. 가중치 데이터 셋(WTS)은 음성 인식 프로세서(130)가 사용될 수 있는 다양한 환경들을 고려하여 결정될 수 있다.The second interface 140 provides an interface between the voice recognition processor 130 and the storage medium 150 . The storage medium 150 is configured to store a weight data set (WTS). The target command may be predetermined to trigger an arbitrary operation of the voice recognition apparatus 100 , and a weight data set (WTS) corresponding to the target command is also predetermined and stored and/or updated in the storage medium 150 . can The weight data set WTS may be determined in consideration of various environments in which the voice recognition processor 130 may be used.

실시 예들에서, 저장 매체(150)는 전원이 차단되더라도 저장된 데이터를 유지하는 다양한 타입들의 불휘발성 저장 매체들, 예를 들면 플래시 메모리(flash memory), 하드 디스크(hard disk) 등을 포함할 수 있다.In embodiments, the storage medium 150 may include various types of nonvolatile storage media that retain stored data even when power is cut off, for example, a flash memory, a hard disk, and the like. .

제 3 인터페이스(160)는 음성 인식 프로세서(130)와 기능 블록(170) 사이의 인터페이스를 제공한다. 기능 블록(170)은 음성 인식 프로세서(130)의 다양한 기능들 중 적어도 하나를 수행하기 위한 프로세서를 포함할 수 있다. 기능 블록(170)은 음성 인식 프로세서(130)로부터 생성된 트리거 신호에 응답하여 임의의 동작을 수행하거나 활성화하도록 구성된다. 예를 들면, 기능 블록(170)은 트리거 신호에 응답하여 “예”와 같은 음성 신호를 출력하거나 임의의 동작을 활성화시킬 수 있다. 실시 예들에서, 기능 블록(170)은 하드웨어, 소프트웨어, 및/또는 그것들의 조합을 통해 구성될 수 있다.The third interface 160 provides an interface between the voice recognition processor 130 and the function block 170 . The function block 170 may include a processor for performing at least one of various functions of the voice recognition processor 130 . The function block 170 is configured to perform or activate any operation in response to a trigger signal generated from the voice recognition processor 130 . For example, the function block 170 may output a voice signal such as “yes” or activate an arbitrary operation in response to a trigger signal. In embodiments, the functional block 170 may be configured through hardware, software, and/or a combination thereof.

실시 예들에서, 음성 인식 장치(100)는 AI(artificial intelligence) 스피커, 모바일 폰(mobile phone), 스마트폰(smart phone), UMPC (Ultra Mobile PC), 워크스테이션, 넷북(net-book), PDA (Personal Digital Assistants), 포터블(portable) 컴퓨터, 웹 타블렛(web tablet), PMP(portable multimedia player), 휴대용 게임기 등과 같은 음향을 감지하고 감지된 음향을 위 설명된 바와 같이 처리할 수 있는 다양한 컴퓨터 장치들로서 구현될 수 있다.In embodiments, the voice recognition device 100 is an artificial intelligence (AI) speaker, a mobile phone, a smart phone, an Ultra Mobile PC (UMPC), a workstation, a net-book, a PDA. Various computer devices capable of sensing sound and processing the detected sound as described above, such as (Personal Digital Assistants), portable computers, web tablets, portable multimedia players (PMPs), portable game consoles, etc. can be implemented as

도 6은 본 발명의 실시 예에 따른 사용자의 음성을 인식하는 방법을 보여주는 순서도이다.6 is a flowchart illustrating a method of recognizing a user's voice according to an embodiment of the present invention.

도 6을 참조하면, S110단계에서, 사용자의 음성이 감지된다. S120단계에서, 감지된 음성에 따라 음성 신호가 생성된다. 음성 신호는 S130단계를 수행하는 데에 적합한 데이터 포멧을 가질 수 있다. 실시 예들에서, 음성 신호는 S110단계에서 감지된 음성 데이터를 주파수 도메인의 신호들로 변환한 특징 벡터 값들을 포함할 수 있다.Referring to FIG. 6 , in step S110, the user's voice is detected. In step S120, a voice signal is generated according to the sensed voice. The voice signal may have a data format suitable for performing step S130. In embodiments, the voice signal may include feature vector values obtained by converting the voice data sensed in step S110 into signals of a frequency domain.

S130단계에서, 음성 신호로부터 후보 음소들이 판별되며, 그 후보 음소들 및/또는 그것들 사이의 배열에 대응하는 우도 값들이 생성된다.In step S130, candidate phonemes are determined from the speech signal, and likelihood values corresponding to the candidate phonemes and/or an arrangement therebetween are generated.

S140단계에서, 데이터베이스, 예를 들면 도 1의 저장 매체(150)에 저장된 타겟 음소들의 배열에 대한 가중치들에 따라, 후보 음소들에 대응하는 우도 값들의 적어도 일부가 조절된다. 실시 예들에서, 후보 음소들 중 타겟 음소들의 배열의 적어도 일부와 매치되는 것들이 선택되고, 타겟 음소들의 매치되는 배열에 대응하는 가중치들에 따라 선택된 후보 음소들에 대응하는 우도 값들이 조절될 수 있다.In operation S140 , at least some of likelihood values corresponding to candidate phonemes are adjusted according to weights for an arrangement of target phonemes stored in a database, for example, the storage medium 150 of FIG. 1 . In embodiments, ones matching at least a part of an arrangement of target phonemes may be selected from among candidate phonemes, and likelihood values corresponding to the selected candidate phonemes may be adjusted according to weights corresponding to the matching arrangement of target phonemes.

S150단계에서, 우도 값들에 기반하여 후보 음소들로부터 사용자의 음성에 대응하는 음소들의 배열이 추정된다. 가중치들에 따라 조절된 우도 값들로 인해, 사용자가 “하이”와 같은 짧은 길이의 타겟 명령어를 발음하더라도 타겟 음소들의 배열에 대한 추정은 상대적으로 높은 확률로 성공할 수 있다. 이에 따라, 향상된 신뢰성으로 음성을 인식할 수 있는 음성 인식 방법이 제공될 수 있다.In operation S150, an arrangement of phonemes corresponding to the user's voice is estimated from the candidate phonemes based on the likelihood values. Due to the likelihood values adjusted according to the weights, even if the user pronounces a short-length target command such as “high”, the estimation of the arrangement of target phonemes may be successful with a relatively high probability. Accordingly, a voice recognition method capable of recognizing a voice with improved reliability may be provided.

S160단계에서, 추정된 음소들의 배열이 타겟 명령어와 일치할 때 S170단계가 수행된다. S170단계에서, 트리거 신호가 생성된다. 트리거 신호에 응답하여 음성 처리 장치는 임의의 동작을 수행하거나 그것을 활성화할 수 있다.In step S160, when the arrangement of the estimated phonemes matches the target instruction, step S170 is performed. In step S170, a trigger signal is generated. In response to the trigger signal, the voice processing device may perform an arbitrary operation or activate it.

도 7은 도 1의 음성 인식 장치를 구현하기에 적합한 컴퓨터 장치를 보여주는 블록도이다.FIG. 7 is a block diagram illustrating a computer device suitable for implementing the voice recognition apparatus of FIG. 1 .

도 7을 참조하면, 컴퓨터 장치(1000)는 통신기(1100), 음향 센서(1200), 사용자 인터페이스(1300), 불휘발성 저장 매체(1400), 스피커(1500), 프로세서(1600), 및 시스템 메모리(1700)를 포함한다.Referring to FIG. 7 , the computer device 1000 includes a communicator 1100 , an acoustic sensor 1200 , a user interface 1300 , a nonvolatile storage medium 1400 , a speaker 1500 , a processor 1600 , and a system memory. (1700).

통신기(1100)는 이동 통신망 상에서 네트워크를 통해 외부의 서버와 유선 및/또는 무선 신호를 송신하도록 구성된다. 나아가, 통신기(1100)는 근거리 통신을 통해 사용자 단말기와 통신하도록 구성될 수 있으며, 블루투스(Bluetooth), RFID(Radio Frequency Identification), 적외선 통신(IrDA, infrared Data Association), UWB(Ultra Wideband), 지그비(ZigBee) 등의 근거리 통신 기술들이 이용될 수 있다.The communicator 1100 is configured to transmit a wired and/or wireless signal to an external server through a network on a mobile communication network. Furthermore, the communicator 1100 may be configured to communicate with the user terminal through short-range communication, and may include Bluetooth, Radio Frequency Identification (RFID), Infrared Data Association (IrDA), Ultra Wideband (UWB), and Zigbee. Short-range communication technologies such as (ZigBee) may be used.

음향 센서(1200)는 사용자 음성의 입력을 위한 것으로, 사용자의 음성을 감지하여 음성 데이터를 프로세서(1600)에 제공하도록 구성된다. 음향 센서(1200)는 도 1의 마이크로폰(110)을 포함할 수 있다.The acoustic sensor 1200 is for input of a user's voice, and is configured to detect the user's voice and provide voice data to the processor 1600 . The acoustic sensor 1200 may include the microphone 110 of FIG. 1 .

사용자 인터페이스(1300)는 컴퓨터 장치(1000) 혹은 프로세서(1600)의 동작들을 제어하기 위한 사용자 입력을 수신할 수 있다. 사용자 인터페이스(1300)는 키 패드(key pad), 돔 스위치(dome switch), 조그 휠, 조그 스위치, 핑거 마우스 등을 포함할 수 있다.The user interface 1300 may receive a user input for controlling operations of the computer device 1000 or the processor 1600 . The user interface 1300 may include a key pad, a dome switch, a jog wheel, a jog switch, a finger mouse, and the like.

불휘발성 저장 매체(1400)는 플래시 메모리(flash memory type), 하드 디스크 (hard disk type), 멀티미디어 카드(multimedia card) 등 중 적어도 하나일 수 있다. 불휘발성 저장 매체(1400)는 프로세서(1600)의 제어에 응답하여 데이터를 기입하고 독출하도록 구성된다. 불휘발성 저장 매체(1400)는 도 1의 저장 매체(150)로서 제공될 수 있다. 실시 예들에서, 프로세서(1600)는 통신기(1100)를 통해 외부의 서버로부터 가중치 데이터 셋(WTS)을 수신하고, 수신된 가중치 데이터 셋(WTS)을 저장 매체(150)에 저장할 수 있다.The nonvolatile storage medium 1400 may be at least one of a flash memory type, a hard disk type, and a multimedia card. The nonvolatile storage medium 1400 is configured to write and read data in response to the control of the processor 1600 . The nonvolatile storage medium 1400 may be provided as the storage medium 150 of FIG. 1 . In embodiments, the processor 1600 may receive the weight data set WTS from an external server through the communicator 1100 , and store the received weight data set WTS in the storage medium 150 .

실시 예들에서, 통신기(1100), 음향 센서(1200), 사용자 인터페이스(1300), 불휘발성 저장 매체(1400), 프로세서(1600), 및 시스템 메모리(1700)를 서로 연결하도록 구성되는 시스템 버스가 더 제공될 수 있다.In embodiments, a system bus configured to connect the communicator 1100 , the acoustic sensor 1200 , the user interface 1300 , the nonvolatile storage medium 1400 , the processor 1600 , and the system memory 1700 to each other is further can be provided.

프로세서(1600)는 범용 혹은 전용 프로세서 중 어느 하나를 포함할 수 있으며, 통신기(1100), 음향 센서(1200), 사용자 인터페이스(1300), 불휘발성 저장 매체(1400), 스피커(1500), 및 시스템 메모리(1700)의 동작들을 제어한다.The processor 1600 may include either a general-purpose or dedicated processor, and may include a communicator 1100 , an acoustic sensor 1200 , a user interface 1300 , a nonvolatile storage medium 1400 , a speaker 1500 , and a system. Controls operations of the memory 1700 .

프로세서(1600)는 실행될 때 다양한 기능들을 제공하는 명령어들을 포함하는 프로그램 코드들을 불휘발성 저장 매체(1400)로부터 시스템 메모리(1700)에 로딩하고, 로딩된 프로그램 코드들을 실행하도록 구성된다. 프로세서(1600)는 프로세서(1600)에 의해 실행될 때 도 1의 음성 인식 프로세서(130)의 기능들을 수행하는 음성 인식 모듈(1710)을 시스템 메모리(1700)에 로딩하고, 로딩된 음성 인식 모듈(1710)을 실행할 수 있다. 예를 들면, 음성 인식 모듈(1710)은, 프로세서(1600)에 의해 실행될 때, 음성 전처리부(131)의 기능들을 수행하기 위한 명령어들 및/또는 프로그램 코드들, 제 1 음성 인식부(132)의 기능들을 수행하기 위한 명령어들 및/또는 프로그램 코드들, 제 2 음성 인식부(133)의 기능들을 수행하기 위한 명령어들 및/또는 프로그램 코드들, 그리고 음성 추정부(134)의 기능들을 명령어들 및/또는 프로그램 코드들을 포함할 수 있다. 프로세서(1600)는 프로세서(1600)에 의해 실행될 때 음성 인식 모듈(1710)이 실행되기에 적합한 환경을 제공하도록 구성되는 운영 체제(1720)를 시스템 메모리(1700)에 로딩하고, 로딩된 운영 체제(1720)을 실행할 수 있다. 운영 체제(1720)는 컴퓨터 장치(1000)의 통신기(1100), 음향 센서(1200), 사용자 인터페이스(1300), 불휘발성 저장 매체(1400), 및 시스템 메모리(1700)과 같은 구성 요소들과 음성 인식 모듈(1710) 사이의 인터페이스를 제공할 수 있다. 예를 들면, 운영 체제(1720)는 프로세서(1600)에 의해 실행될 때 도 1을 참조하여 설명된 제 1 내지 제 3 인터페이스들(120, 140, 160)의 기능들을 수행할 수 있다. 운영 체제(1720)는 컴퓨터 장치(1000)의 제반 동작을 제어할 수 있다.The processor 1600 is configured to load program codes including instructions providing various functions when executed into the system memory 1700 from the nonvolatile storage medium 1400 and execute the loaded program codes. The processor 1600 loads the voice recognition module 1710 that performs functions of the voice recognition processor 130 of FIG. 1 when executed by the processor 1600 into the system memory 1700, and the loaded voice recognition module 1710 ) can be executed. For example, when the voice recognition module 1710 is executed by the processor 1600 , instructions and/or program codes for performing functions of the voice preprocessor 131 , the first voice recognition unit 132 . Commands and/or program codes for performing functions of , commands and/or program codes for performing functions of the second voice recognition unit 133 , and commands for functions of the voice estimation unit 134 and/or program codes. The processor 1600 loads an operating system 1720 configured to provide an environment suitable for the voice recognition module 1710 to be executed when executed by the processor 1600 into the system memory 1700, and the loaded operating system ( 1720) can be executed. The operating system 1720 includes components such as the communicator 1100 , the acoustic sensor 1200 , the user interface 1300 , the nonvolatile storage medium 1400 , and the system memory 1700 of the computer device 1000 and voice An interface between the recognition modules 1710 may be provided. For example, the operating system 1720 may perform the functions of the first to third interfaces 120 , 140 , and 160 described with reference to FIG. 1 when executed by the processor 1600 . The operating system 1720 may control overall operations of the computer device 1000 .

컴퓨터 장치(1000)는 실시 예들에 따라 다양한 기능들 중 적어도 하나를 수행하기 위한 하드웨어, 소프트웨어, 및/또는 그것들의 조합을 포함할 수 있으며, 그것은 도 1의 기능 블록(170)으로서 제공될 수 있다. 실시 예들에서, 프로세서(1600)는 위 다양한 기능들 중 적어도 하나를 수행하기 위한 명령어들 및/또는 프로그램 코드들을 불휘발성 저장 매체(1400)로부터 시스템 메모리(1700)에 로딩하고, 로딩된 명령어들 및/또는 프로그램 코드들을 실행함으로써 도 1의 기능 블록(170)을 제공할 수 있다. 그러한 기능 블록(170)은 트리거 신호에 응답하여 임의의 동작을 활성화 시킬 수 있다. 실시 예들에서, 기능 블록(170)은 트리거 신호에 응답하여 “예”와 같은 음성 신호를 출력하도록 스피커(1500)를 제어할 수 있다.The computer device 1000 may include hardware, software, and/or a combination thereof for performing at least one of various functions according to embodiments, which may be provided as the function block 170 of FIG. 1 . . In embodiments, the processor 1600 loads instructions and/or program codes for performing at least one of the above various functions from the nonvolatile storage medium 1400 into the system memory 1700 , and includes the loaded instructions and The functional block 170 of FIG. 1 may be provided by/or executing program codes. Such a function block 170 may activate any operation in response to a trigger signal. In embodiments, the function block 170 may control the speaker 1500 to output a voice signal such as “yes” in response to a trigger signal.

실시 예들에서, 시스템 메모리(1700)는 램(Random Access Memory, RAM), 롬(Read Only Memory, ROM), 및 다른 타입들의 컴퓨터에 의해 판독 가능한 저장 매체들 중 적어도 하나를 포함할 수 있다. 시스템 메모리(1700)는 프로세서(1600)와 분리된 구성 요소로서 도시되어 있으나, 이는 예시적인 것으로 시스템 메모리(1700)의 적어도 일부는 프로세서(1600) 내에 통합될 수 있다. 실시 예들에서, 시스템 메모리(1700)는 버퍼 메모리로서 제공될 수 있다.In embodiments, the system memory 1700 may include at least one of random access memory (RAM), read only memory (ROM), and other types of computer-readable storage media. Although the system memory 1700 is illustrated as a separate component from the processor 1600 , this is exemplary and at least a portion of the system memory 1700 may be integrated into the processor 1600 . In embodiments, the system memory 1700 may be provided as a buffer memory.

도 8은 도 7의 음성 인식 모듈을 제공하도록 구성되는 클라이언트 서버의 실시 예를 보여주는 블록도이다.8 is a block diagram illustrating an embodiment of a client server configured to provide the voice recognition module of FIG. 7 .

본 발명의 실시 예에 따르면, 도 7의 컴퓨터 장치(1000)에 의해 실행되는 프로그램 코드들 및/또는 명령어들은 클라이언트 서버(2000)로부터 제공될 수 있다. 도 8을 참조하면, 클라이언트 서버(2000)는 통신기(2100), 프로세서(2200), 및 데이터베이스(2300)를 포함할 수 있다. 통신기(2100)는 네트워크를 통해 컴퓨터 장치(1000)와 통신할 수 있다. 데이터베이스(2300)는 도 7의 컴퓨터 장치(1000) 및/또는 프로세서(1600)에 의해 실행될 수 있는 프로그램 코드들 및/또는 명령어들, 예를 들면 도 7의 음성 인식 모듈(1710)을 포함하는 응용 어플리케이션을 저장할 수 있다. 프로세서(2200)는 컴퓨터 장치(1000)로부터의 요청에 응답하여, 데이터베이스(2300)에 저장된 응용 어플리케이션을 통신기(2100)를 통해 컴퓨터 장치(1000)에 제공할 수 있다. 응용 어플리케이션은 컴퓨터 장치(1000)에 설치되어 실행될 수 있다.According to an embodiment of the present invention, program codes and/or instructions executed by the computer device 1000 of FIG. 7 may be provided from the client server 2000 . Referring to FIG. 8 , the client server 2000 may include a communicator 2100 , a processor 2200 , and a database 2300 . The communicator 2100 may communicate with the computer device 1000 through a network. The database 2300 includes program codes and/or instructions that may be executed by the computer device 1000 and/or the processor 1600 of FIG. 7 , for example, an application including the voice recognition module 1710 of FIG. 7 . You can save the application. The processor 2200 may provide the application application stored in the database 2300 to the computer device 1000 through the communicator 2100 in response to a request from the computer device 1000 . The application application may be installed and executed in the computer device 1000 .

비록 특정 실시 예들 및 적용 례들이 여기에 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정들 및 변형들이 가능하다.Although specific embodiments and application examples have been described herein, these are provided only to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments, and those of ordinary skill in the art to which the present invention pertains Various modifications and variations are possible from this substrate as it grows.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the described embodiments, and not only the claims described below, but also all those with equivalent or equivalent modifications to the claims will be said to belong to the scope of the spirit of the present invention. .

110: 마이크로폰
130: 음성 처리 프로세서
150: 저장 매체
170: 기능 블록110: microphone
130: speech processing processor
150: storage medium
170: function block

Claims

In the speech recognition device:
a storage medium storing weights for the arrangement of phonemes of a predetermined voice command for triggering a predetermined operation of the voice recognition apparatus;
a voice pre-processing unit for generating a voice signal according to a voice sensed by a user;
a first speech recognition unit configured to generate likelihood values corresponding to the candidate phonemes while discriminating candidate phonemes from the speech signal;
a second speech recognition unit configured to selectively adjust the likelihood values corresponding to the candidate phonemes according to the weights for the arrangement of the phonemes of the speech command; and
and a voice estimator configured to estimate an arrangement of phonemes corresponding to the user's voice from the candidate phonemes based on the adjusted likelihood values.

The method of claim 1,
The second speech recognition unit selects among the candidate phonemes that match the arrangement of the phonemes of the speech command, and corresponds to the selected candidate phonemes according to the weights for the arrangement of the phonemes of the speech command. Speech recognition device configured to adjust likelihood values.

The method of claim 1,
The voice signal includes feature vector values obtained by converting the sensed voice into a frequency domain.

The method of claim 1,
The first speech recognition unit is configured to generate the likelihood values corresponding to the candidate phonemes based on a Hidden Markov Model.

The method of claim 1,
and the speech estimator is configured to generate a trigger signal when the arrangement of the estimated phonemes matches the arrangement of the phonemes of the voice command.

6. The method of claim 5,
and a function block configured to activate the predetermined operation in response to the trigger signal.

In the speech recognition device:
an acoustic sensor configured to detect a voice from a user;
a processor configured to determine whether the sensed voice corresponds to a command; and
a storage medium for storing weights for the arrangement of phonemes of a predetermined voice command for triggering a predetermined operation of the voice recognition device,
The processor is
generate a voice signal according to the sensed voice;
generating likelihood values corresponding to the candidate phonemes while discriminating candidate phonemes from the speech signal;
selectively adjusting the likelihood values corresponding to the candidate phonemes according to the weights for the arrangement of phonemes of the voice command;
and estimating an arrangement of phonemes corresponding to the user's voice from the candidate phonemes based on the adjusted likelihood values.

8. The method of claim 7,
the processor selects ones of the candidate phonemes that match the arrangement of the phonemes of the speech command, and a likelihood value corresponding to the selected candidate phonemes according to the weights for the arrangement of the phonemes of the speech command A voice recognition device configured to control the sounds.

8. The method of claim 7,
The voice signal includes feature vector values obtained by converting the sensed voice into a frequency domain.

8. The method of claim 7,
and the processor is configured to generate the likelihood values corresponding to the candidate phonemes based on a hidden Markov model.

8. The method of claim 7,
and the processor is configured to generate a trigger signal when the arrangement of the estimated phonemes matches the arrangement of the phonemes of the speech command.

12. The method of claim 11,
Further comprising a speaker operating in response to the control of the processor,
and the processor is configured to control the speaker to generate a predetermined voice as the predetermined operation in response to the trigger signal.

In a method for recognizing a user's voice:
storing in a storage medium weights for arrangement of phonemes of a predetermined voice command for triggering a predetermined operation;
detecting the user's voice;
generating a voice signal according to the sensed voice;
generating likelihood values corresponding to the candidate phonemes while discriminating candidate phonemes from the speech signal;
selectively adjusting the likelihood values corresponding to the candidate phonemes according to the weights for the arrangement of the phonemes of the voice command; and
and estimating an arrangement of phonemes corresponding to the voice from the candidate phonemes based on the adjusted likelihood values.

14. The method of claim 13,
generating a trigger signal when the estimated arrangement of phonemes matches the arrangement of the phonemes of the predetermined spoken command.

15. The method of claim 14,
activating the predetermined action in response to the trigger signal.

14. The method of claim 13,
The adjusting step is
selecting ones of the candidate phonemes that match the arrangement of the phonemes of the voice command; and
and adjusting likelihood values corresponding to the selected candidate phonemes according to the weights for the arrangement of phonemes of the speech command.

14. The method of claim 13,
The voice signal includes feature vector values obtained by converting the sensed voice into a frequency domain.

14. The method of claim 13,
The generating the likelihood values includes generating the likelihood values corresponding to the candidate phonemes based on a hidden Markov model.

a database storing program codes and weights for the arrangement of phonemes of a predetermined voice command for triggering a predetermined operation; and
a processor configured to communicate with a computer device via a communicator to provide the weights for the arrangement of the phonemes of the spoken command and the program codes;
When the program codes are executed by the computer device,
first instructions for generating a voice signal according to a voice sensed by the user;
second instructions for generating likelihood values corresponding to the candidate phonemes while discriminating candidate phonemes from the speech signal;
third instructions for selectively adjusting the likelihood values corresponding to the candidate phonemes according to the weights for the arrangement of phonemes of the voice command; and
and fourth instructions for estimating an arrangement of phonemes corresponding to the user's voice from the candidate phonemes based on the adjusted likelihood values.

20. The method of claim 19,
the third instructions, when executed by the computer device, select ones of the candidate phonemes that match the arrangement of the phonemes of the voice command, according to the weights for the arrangement of the phonemes of the voice command and fifth instructions for adjusting likelihood values of the selected candidate phonemes.