KR20210052563A

KR20210052563A - Method and apparatus for providing context-based voice recognition service

Info

Publication number: KR20210052563A
Application number: KR1020217011945A
Authority: KR
Inventors: 황명진; 강민호; 지창진
Original assignee: 주식회사 엘솔루
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2021-05-10
Also published as: CN113016029A; WO2020091123A1

Abstract

본 발명은 음성을 인식하는 방법 및 그 장치에 관한 것이다. 보다 구체적으로, 본 발명에 의한 음성 인식 장치는 사용자로부터 음성 정보를 획득하고, 획득된 음성 정보를 음성 데이터로 변환할 수 있다.
이후, 음성 인식 모델은 제1 음성인식 모델로 상기 변환된 음성 데이터를 인식하여 제1 음성 인식 결과를 생성하고, 제2 음성인식 모델로 상기 변환된 음성 데이터를 인식하여 제2 음성 인식 결과를 생성하며, 특정 판단 절차를 통해서 상기 제1 음성 인식 결과 및 상기 제2 음성 인식 결과 중 특정 음성 인식 결과를 선택할 수 있다.The present invention relates to a method and apparatus for recognizing voice. More specifically, the voice recognition apparatus according to the present invention may obtain voice information from a user and convert the obtained voice information into voice data.
Thereafter, the voice recognition model generates a first voice recognition result by recognizing the converted voice data with a first voice recognition model, and generates a second voice recognition result by recognizing the converted voice data with a second voice recognition model. And, a specific voice recognition result may be selected from among the first voice recognition result and the second voice recognition result through a specific determination procedure.

Description

Method and apparatus for providing context-based voice recognition service

본 발명은 사용자의 음성을 인식하기 위한 방법 및 장치에 관한 것이다. 보다 구체적으로, 사용자로부터 획득된 음성을 인식하기 위한 방법에 있어서 문맥을 기반으로 음성인식 정확도를 향상시키기 위한 방법 및 장치에 관한 것이다. The present invention relates to a method and apparatus for recognizing a user's voice. More specifically, it relates to a method and apparatus for improving speech recognition accuracy based on context in a method for recognizing a speech acquired from a user.

자동 음성인식은(이하 음성인식이라 호칭한다.) 컴퓨터를 이용하여 음성을 문자로 변환해주는 기술이다. 이러한 음성인식은 최근 들어 급격한 인식 율 향상을 이뤘다. Automatic voice recognition (hereinafter referred to as voice recognition) is a technology that converts voice into text using a computer. Such speech recognition has achieved a rapid improvement in recognition rate in recent years.

하지만, 전체적으로 인식율은 향상되었지만 언어모델이나 음향모델 학습 시 사용하는 데이터의 구성이나 모델의 구조에 따라 성능의 차이가 발생한다.However, although the recognition rate has improved as a whole, performance differences occur depending on the structure of the data or the structure of the data used when learning a language model or an acoustic model.

본 발명의 목적은, 복수의 음성인식모델을 이용하여 음성을 인식하는 경우, 복수의 음성 인식 결과 중 정확도가 높은 음성 인식 결과를 선택하기 위한 방법을 제공함에 그 목적이 있다.An object of the present invention is to provide a method for selecting a speech recognition result with high accuracy from among a plurality of speech recognition results when speech is recognized using a plurality of speech recognition models.

또한, 문맥 정보를 이용하여 음성 인식을 위한 음성 인식 모델을 선택하기 위한 방법을 제공함에 그 목적이 있다.Another object is to provide a method for selecting a speech recognition model for speech recognition using context information.

본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems that are not mentioned will be clearly understood by those of ordinary skill in the technical field to which the present invention belongs from the following description. I will be able to.

본 발명에 의한 음성을 인식하는 방법은 사용자로부터 음성 정보를 획득하는 단계; 획득된 음성 정보를 음성 데이터로 변환하는 단계; 제1 음성인식 모델로 상기 변환된 음성 데이터를 인식하여 제1 음성 인식 결과를 생성하는 단계; 제2 음성인식 모델로 상기 변환된 음성 데이터를 인식하여 제2 음성 인식 결과를 생성하는 단계; 및 특정 판단 절차를 통해서 상기 제1 음성 인식 결과 및 상기 제2 음성 인식 결과 중 특정 음성 인식 결과를 선택하는 단계를 포함A method of recognizing a voice according to the present invention comprises: obtaining voice information from a user; Converting the obtained voice information into voice data; Generating a first speech recognition result by recognizing the converted speech data with a first speech recognition model; Generating a second speech recognition result by recognizing the converted speech data with a second speech recognition model; And selecting a specific speech recognition result from among the first speech recognition result and the second speech recognition result through a specific determination procedure.

또한, 본 발명에서, 상기 특정 판단 절차는, 상기 제1 음성 인식 결과 및 상기 제2 음성 인식 결과로부터 문맥 정보를 추출하는 단계; 상기 문맥정보를 기 설정된 상기 제1 음성 인식 모델의 제 1 특성 및 상기 제 2 음성 인식 모델의 제 2 특성과 각각 비교하는 단계; 및 상기 비교 결과에 기초하여 상기 제1 음성 인식 결과 및 상기 제2 음성 인식 결과 중 하나를 선택하는 단계를 포함한다.In addition, in the present invention, the specific determination procedure includes: extracting context information from the first voice recognition result and the second voice recognition result; Comparing the context information with a preset first characteristic of the first speech recognition model and a second characteristic of the second speech recognition model, respectively; And selecting one of the first voice recognition result and the second voice recognition result based on the comparison result.

또한, 본 발명에서, 문맥 정보는 상기 음성 정보의 일부 또는 상기 제 1 음성 인식 결과 및 상기 제 2 음성 인식 결과로부터 획득될 수 있는 정보 또는 음성을 발성한 사용자와 관련된 정보 중 적어도 하나를 포함할 수 있다.In addition, in the present invention, the context information may include at least one of a part of the voice information, information that can be obtained from the first voice recognition result and the second voice recognition result, or information related to the user who uttered the voice. have.

또한, 본 발명에서, 상기 제1 음성 인식 모델 및 상기 제2 음성 인식 모델은 상기 사용자로부터 획득되는 상기 음성 정보를 인식하기 위한 복수의 음성 인식 모델들 중 하나이다.In addition, in the present invention, the first voice recognition model and the second voice recognition model are one of a plurality of voice recognition models for recognizing the voice information obtained from the user.

또한, 본 발명은, 상기 복수의 음성 인식 모델들로 상기 변환된 음성 데이터를 인식하여 복수의 음성 인식 결과를 생성하는 단계를 더 포함하되, 상기 특정 음성 인식 결과는 상기 제1 음성 인식 결과, 상기 제2 음성 인식 결과 및 상기 복수의 음성 인식 결과 중에서 선택된다.In addition, the present invention further comprises the step of generating a plurality of speech recognition results by recognizing the converted speech data with the plurality of speech recognition models, wherein the specific speech recognition result is the first speech recognition result, the The second voice recognition result and the plurality of voice recognition results are selected.

또한, 본 발명에서, 상기 특정 판단 절차는 문맥 정보에 포함된 문맥에 기초하여 음성 인식 결과를 판단하는 절차이다.In addition, in the present invention, the specific determination procedure is a procedure for determining a speech recognition result based on the context included in the context information.

또한, 본 발명은, 사용자로부터 음성 정보를 획득하는 단계; 획득된 음성 정보를 음성 데이터로 변환하는 단계; 상기 제1 음성 인식 모델로 상기 음성 데이터를 인식하여 제1 음성 인식 결과를 생성하는 단계; 상기 제1 음성 인식 결과에 기초하여 복수의 음성 인식 모델 중 상기 음성 데이터를 인식하기 위한 제2 음성 인식 모델을 선택하는 단계; 및 상기 제2 음성 인식 모델로 상기 음성 데이터를 인식하여 제2 음성 인식 결과를 생성하는 단계를 포함하는 방법을 제공한다.In addition, the present invention, obtaining voice information from a user; Converting the obtained voice information into voice data; Generating a first speech recognition result by recognizing the speech data with the first speech recognition model; Selecting a second voice recognition model for recognizing the voice data from among a plurality of voice recognition models based on the first voice recognition result; And generating a second speech recognition result by recognizing the speech data with the second speech recognition model.

또한, 본 발명은, 상기 제1 음성 인식 결과로부터 문맥 정보를 추출하는 단계; 및 상기 문맥 정보와 상기 복수의 음성 인식 모델의 기 설정된 특정을 비교하는 단계를 더 포함하되, 상기 제2 음성 인식 모델은 상기 비교결과에 기초하여 선택된다.In addition, the present invention includes the steps of extracting context information from the first speech recognition result; And comparing the context information with preset specifics of the plurality of speech recognition models, wherein the second speech recognition model is selected based on the comparison result.

또한, 본 발명에서, 상기 제1 음성 인식 모델은 상기 문맥 정보를 추출하기 위한 음성 인식 모델이다.In addition, in the present invention, the first speech recognition model is a speech recognition model for extracting the context information.

또한, 본 발명은, 사용자로부터 음성 정보를 획득하는 단계; 획득된 음성 정보를 음성 데이터로 변환하는 단계; 및 복수의 음성 인식 모델 중에 선택된 특정 음성 인식 모델로 상기 음성 데이터를 인식하여 음성 인식 결과를 생성하는 단계를 포함하는 방법을 제공한다.In addition, the present invention, obtaining voice information from a user; Converting the obtained voice information into voice data; And generating a speech recognition result by recognizing the speech data with a specific speech recognition model selected from among a plurality of speech recognition models.

또한, 본 발명은, 음성 인식을 위한 문맥 정보를 설정하는 단계; 및 상기 복수의 음성 인식 모델 중에서 특성이 상기 문맥 정보에 가장 적합한 상기 특정 음성 인식 모델을 선택하는 단계를 더 포함한다.In addition, the present invention, the step of setting context information for speech recognition; And selecting the specific speech recognition model from among the plurality of speech recognition models whose characteristics are most suitable for the context information.

본 발명의 어느 한 실시예에 따르면, 음성입력을 인식할 때 복수의 음성인식모델을 사용하여 복수의 결과를 생성했을 때 이들 중 정확도가 높은 음식인식모델의 인식 결과를 선택함으로써, 음성 인식의 정확도를 높일 수 있다.According to an embodiment of the present invention, when a plurality of results are generated using a plurality of voice recognition models when recognizing a voice input, the accuracy of speech recognition is selected by selecting a recognition result of a food recognition model with high accuracy among them. Can increase.

또한, 문맥 정보에 따른 음성 인식 모델을 선택함으로써, 복수의 음성 인식 모델 각각을 용도에 맞게 이용할 수 있다.In addition, by selecting a speech recognition model according to context information, each of a plurality of speech recognition models can be used according to a purpose.

또한, 대규모 사용자를 위한 서비스나 사용자가 위치한 물리적, 상황적 환경이 수시로 바뀌는 환경에서도 적절한 음성인식모델을 선택할 수 있다.In addition, it is possible to select an appropriate voice recognition model even in a service for a large-scale user or in an environment where the physical and contextual environment in which the user is located changes from time to time.

또한, 적절한 음성인식모델을 선택할 수 있음으로 인해, 거대 언어모델을 적용하면서 발생할 수 있는 유사 어휘로 인한 오인식을 줄일 수 있고, 소규모 언어모델을 적용하면서 발생할 수 있는 미등록 어휘로 인한 오인식을 줄일 수 있다.In addition, since it is possible to select an appropriate speech recognition model, it is possible to reduce misrecognition due to similar vocabulary that may occur while applying a large language model, and reduce misrecognition due to unregistered vocabulary that may occur while applying a small language model. .

본 발명에 관한 이해를 돕기 위해 상세한 설명의 일부로 포함되는, 첨부 도면은 본 발명에 대한 실시예를 제공하고, 상세한 설명과 함께 본 발명의 기술적 특징을 설명한다.
도 1은 본 발명의 일 실시예에 따른 음성인식장치의 블록도이다.
도 2 및 도 3은 본 발명의 일 실시예에 따른 음성 인식 장치의 일 예를 나타내는 도면이다.
도 4 및 도 5는 본 발명의 일 실시예에 따른 음성 인식 장치의 또 다른 일 예를 나타내는 도면이다.
도 6은 본 발명의 일 실시예에 따른 음성 인식 장치의 또 다른 일 예를 나타내는 도면이다.
도 7은 본 발명의 일 실시 예에 따른 음성 인식 방법의 일 예를 나타내는 순서도이다.
도 8은 본 발명의 일 실시 예에 따른 음성 인식 방법의 또 다른 일 예를 나타내는 순서도이다.
도 9는 본 발명의 일 실시 예에 따른 음성 인식 방법의 또 다른 일 예를 나타내는 순서도이다.BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are included as part of the detailed description to aid in understanding of the present invention, provide embodiments of the present invention, and together with the detailed description, the technical features of the present invention are described.
1 is a block diagram of a voice recognition apparatus according to an embodiment of the present invention.
2 and 3 are diagrams illustrating an example of a speech recognition apparatus according to an embodiment of the present invention.
4 and 5 are diagrams illustrating still another example of a speech recognition apparatus according to an embodiment of the present invention.
6 is a diagram illustrating another example of a speech recognition apparatus according to an embodiment of the present invention.
7 is a flowchart illustrating an example of a speech recognition method according to an embodiment of the present invention.
8 is a flow chart illustrating another example of a voice recognition method according to an embodiment of the present invention.
9 is a flowchart illustrating still another example of a voice recognition method according to an embodiment of the present invention.

이하, 본 발명에 따른 바람직한 실시 형태를 첨부된 도면을 참조하여 상세하게 설명한다. 첨부된 도면과 함께 이하에 개시될 상세한 설명은 본 발명의 예시적인 실시형태를 설명하고자 하는 것이며, 본 발명이 실시될 수 있는 유일한 실시 형태를 나타내고자 하는 것이 아니다. 이하의 상세한 설명은 본 발명의 완전한 이해를 제공하기 위해서 구체적 세부사항을 포함한다. 그러나, 당 업자는 본 발명이 이러한 구체적 세부사항 없이도 실시될 수 있음을 안다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. The detailed description to be disclosed below together with the accompanying drawings is intended to describe exemplary embodiments of the present invention, and is not intended to represent the only embodiments in which the present invention may be practiced. The following detailed description includes specific details to provide a thorough understanding of the present invention. However, those skilled in the art know that the invention may be practiced without these specific details.

몇몇 경우, 본 발명의 개념이 모호해지는 것을 피하기 위하여 공지의 구조 및 장치는 생략되거나, 각 구조 및 장치의 핵심 기능을 중심으로 한 블록도 형식으로 도시될 수 있다. In some cases, in order to avoid obscuring the concept of the present invention, well-known structures and devices may be omitted or illustrated in a block diagram form centering on core functions of each structure and device.

도 1은 본 발명의 일 실시예에 따른 음성인식장치의 블록도이다. 1 is a block diagram of a voice recognition apparatus according to an embodiment of the present invention.

도 1을 참조하면, 사용자의 음성을 인식하기 위한 음성인식장치(100)는 입력부(110), 저장부(120), 제어부(130) 및/또는 출력부(140) 등을 포함할 수 있다.Referring to FIG. 1, a voice recognition apparatus 100 for recognizing a user's voice may include an input unit 110, a storage unit 120, a control unit 130, and/or an output unit 140.

도 1에 도시된 구성요소들이 필수적인 것은 아니어서, 그보다 많은 구성요소들을 갖거나 그보다 적은 구성요소들을 갖는 전자기기가 구현될 수도 있다.Since the components shown in FIG. 1 are not essential, an electronic device having more components or fewer components may be implemented.

이하, 상기 구성요소들에 대해 차례로 살펴본다.Hereinafter, the above components will be described in order.

입력부(110)는 오디오 신호, 비디오 신호 또는 사용자로부터 음성 정보(또는 음성 신호) 및 데이터를 입력 받을 수 있다.The input unit 110 may receive an audio signal, a video signal, or audio information (or audio signal) and data from a user.

입력부(110)는 오디오 신호 또는 비디오 신호 입력 받기 위해서 카메라와 마이크 등을 포함할 수 있다. 카메라는 화상 통화모드 또는 촬영 모드에서 이미지 센서에 의해 얻어지는 정지영상 또는 동영상 등의 화상 프레임을 처리한다.The input unit 110 may include a camera and a microphone to receive an audio signal or a video signal. The camera processes image frames such as still images or moving pictures obtained by an image sensor in a video call mode or a photographing mode.

카메라에서 처리된 화상 프레임은 저장부(120)에 저장될 수 있다.The image frames processed by the camera may be stored in the storage unit 120.

마이크는 통화모드 또는 녹음모드, 음성인식 모드 등에서 마이크로폰(Microphone)에 의해 외부의 음향 신호를 입력 받아 전기적인 음성 데이터로 처리한다. 마이크에는 외부의 음향 신호를 입력 받는 과정에서 발생되는 잡음(noise)을 제거하기 위한 다양한 잡음 제거 알고리즘이 구현될 수 있다.The microphone receives an external sound signal by a microphone in a call mode, recording mode, or voice recognition mode, and processes it as electrical voice data. Various noise removal algorithms may be implemented in the microphone to remove noise generated in the process of receiving an external sound signal.

입력부(110)는 마이크 또는 마이크로폰(microphone)을 통해서 사용자의 발화(utterance)된 음성이 입력되면 이를 전기적 신호로 변환하여 제어부(130)로 전달할 수 있다.When a user's uttered voice is input through a microphone or a microphone, the input unit 110 converts the voice into an electrical signal and transmits it to the controller 130.

제어부(130)는 입력부(110)로부터 수신한 신호에 음성인식(speech recognition) 알고리즘 또는 음성인식 엔진(speech recognition engine)을 적용하여 사용자의 음성 데이터를 획득할 수 있다.The control unit 130 may obtain the user's voice data by applying a speech recognition algorithm or a speech recognition engine to the signal received from the input unit 110.

이때, 제어부(130)로 입력되는 신호는 음성인식을 위한 더 유용한 형태로 변환될 수 있으며, 제어부(130)는 입력된 신호를 아날로그 형태에서 디지털 형태로 변환하고, 음성의 시작과 끝지점을 검출하여 음성데이터에 포함된 실제 음성구간/데이터를 검출할 수 있다. 이를 EPD(End Point Detection)라 한다.At this time, the signal input to the control unit 130 may be converted into a more useful form for speech recognition, and the control unit 130 converts the input signal from an analog form to a digital form, and detects the start and end points of the voice. Thus, the actual voice section/data included in the voice data can be detected. This is called EPD (End Point Detection).

그리고, 제어부(130)는 검출된 구간 내에서 켑스트럼(Cepstrum), 선형예측코딩(Linear Predictive Coefficient: LPC), 멜 프리퀀시 켑스트럼(Mel Frequency Cepstral Coefficient: MFCC) 또는 필터뱅크 에너지(Filter Bank Energy) 등의 특징벡터 추출 기술을 적용하여 신호의 특징 벡터를 추출할 수 있다.And, the control unit 130 within the detected section Cepstrum (Cepstrum), linear predictive coding (Linear Predictive Coefficient: LPC), Mel Frequency Cepstral Coefficient (MFCC) or filter bank energy (Filter Bank). Energy), etc., can be applied to extract a feature vector of a signal.

메모리(120)는 제어부(130)의 동작을 위한 프로그램을 저장할 수 있고, 입/출력되는 데이터들을 임시 저장할 수도 있다. 사용자로부터 심볼 기반 악성 코드 탐지 모델을 위한 샘플 파일을 저장할 수 있으며, 악성코드의 분석 결과를 저장할 수 있다.The memory 120 may store a program for the operation of the controller 130 and may temporarily store input/output data. A sample file for a symbol-based malicious code detection model can be saved from a user, and the analysis result of malicious code can be saved.

메모리(120)는 인식된 음성과 관련된 다양한 데이터를 저장할 수 있으며, 특히, 제어부(130)에 의해서 처리된 음성 데이터의 끝지점과 관련된 정보 및 특징 벡터를 저장할 수 있다.The memory 120 may store various data related to the recognized voice, and in particular, may store information and feature vectors related to an end point of the voice data processed by the controller 130.

메모리(120)는 플래시메모리(flash memory), 하드디크스(hard disc), 메모리카드, 롬(ROM:Read-OnlyMemory), 램(RAM:Random Access Memory), 메모리카드, EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기메모리, 자기디스크, 광디스크 중 적어도 하나의 저장매체를 포함할 수 있다.The memory 120 includes flash memory, hard disc, memory card, ROM (Read-Only Memory), RAM (Random Access Memory), memory card, EEPROM (Electrically Erasable Programmable Read). -Only Memory), Programmable Read-Only Memory (PROM), magnetic memory, magnetic disk, and optical disk.

그리고, 제어부(130)는 추출된 특징벡터와 훈련된 기준패턴과의 비교를 통하여 인식결과를 얻을 수 있다. 이를 위해, 음성의 신호적인 특성을 모델링하여 비교하는 음성인식모델과 인식어휘에 해당하는 단어나 음절 등의 언어적인 순서관계를 모델링하는 언어모델(Language Model)이 사용될 수 있다.In addition, the controller 130 may obtain a recognition result by comparing the extracted feature vector with the trained reference pattern. To this end, a speech recognition model for modeling and comparing signal characteristics of speech and a language model for modeling a linguistic order relationship such as words or syllables corresponding to the recognized vocabulary may be used.

음성인식모델은 다시 인식대상을 특징벡터 모델로 설정하고 이를 음성데이터의 특징벡터와 비교하는 직접비교방법과 인식대상의 특징벡터를 통계적으로 처리하여 이용하는 통계방법으로 나뉠 수 있다.The speech recognition model can be divided into a direct comparison method that sets the recognition object as a feature vector model and compares it with the feature vector of speech data, and a statistical method that statistically processes and uses the feature vector of the recognition object.

직접비교방법은 인식대상이 되는 단어, 음소 등의 단위를 특징벡터모델로 설정하고 입력음성이 이와 얼마나 유사한지를 비교하는 방법으로서, 대표적으로 벡터양자화(Vector Quantization) 방법이 있다. 벡터 양자화 방법에 의하면 입력된 음성데이터의 특징벡터를 기준모델인 코드북(codebook)과 매핑시켜 대표값으로 부호화함으로써 이 부호값들을 서로 비교하는 방법이다.The direct comparison method is a method of setting units such as words and phonemes to be recognized as a feature vector model and comparing how similar the input speech is to this, and representatively, there is a vector quantization method. According to the vector quantization method, a feature vector of input speech data is mapped with a codebook, which is a reference model, and encoded as a representative value, thereby comparing the code values with each other.

통계적모델 방법은 인식대상에 대한 단위를 상태열(State Sequence)로 구성하고 상태열간의 관계를 이용하는 방법이다. 상태열은 복수의 노드(node)로 구성될 수 있다. 상태열 간의 관계를 이용하는 방법은 다시 동적시간 와핑(Dynamic Time Warping: DTW), 히든마르코프모델(Hidden Markov Model: HMM), 신경회로망을 이용한 방식 등이 있다.The statistical model method is a method of configuring the unit for the recognition object as a state sequence and using the relationship between the state sequences. The state column may be composed of a plurality of nodes. Methods of using the relationship between state sequences include Dynamic Time Warping (DTW), Hidden Markov Model (HMM), and neural networks.

동적시간 와핑은 같은 사람이 같은 발음을 해도 신호의 길이가 시간에 따라 달라지는 음성의 동적 특성을 고려하여 기준모델과 비교할 때 시간축에서의 차이를 보상하는 방법이고, 히든마르코프모델은 음성을 상태천이확률 및 각 상태에서의 노드(출력심볼)의 관찰확률을 갖는 마르코프프로세스로 가정한 후에 학습데이터를 통해 상태천이확률 및 노드의 관찰확률을 추정하고, 추정된 모델에서 입력된 음성이 발생할 확률을 계산하는 인식기술이다.Dynamic time warping is a method of compensating for differences in the time axis when compared with the reference model by considering the dynamic characteristics of the voice whose signal length varies with time even if the same person speaks the same pronunciation. And after assuming the Markov process with the observation probability of the node (output symbol) in each state, the state transition probability and the observation probability of the node are estimated through the learning data, and the probability of generating the input voice from the estimated model is calculated. It is a recognition technology.

한편, 단어나 음절 등의 언어적인 순서관계를 모델링하는 언어모델은 언어를 구성하는 단위들간의 순서관계를 음성인식에서 얻어진 단위들에 적용함으로써 음향적인 모호성을 줄이고 인식의 오류를 줄일 수 있다. 언어모델에는 통계적언어 모델과 유한상태네트워크(Finite State Automata: FSA)에 기반한 모델이 있고, 통계적 언어모델에는 Unigram, Bigram, Trigram 등 단어의 연쇄확률이 이용된다.On the other hand, a language model for modeling linguistic order relations such as words and syllables can reduce acoustic ambiguity and reduce recognition errors by applying order relations between units constituting a language to units obtained from speech recognition. Linguistic models include statistical language models and models based on Finite State Automata (FSA), and statistical language models use chain probabilities of words such as Unigram, Bigram, and Trigram.

제어부(130)는 음성을 인식함에 있어 상술한 방식 중 어느 방식을 사용해도 무방하다. 예를 들어, 히든마르코프모델이 적용된 음성인식모델을 사용할 수도 있고, 음성인식모델과 언어모델을 통합한 N-best 탐색법을 사용할 수 있다. N-best 탐색법은 음성인식모델과 언어모델을 이용하여 N개까지의 인식결과후보를 선택한 후, 이들 후보의 순위를 재평가함으로써 인식성능을 향상시킬 수 있다.The control unit 130 may use any of the above-described methods in recognizing the voice. For example, a voice recognition model to which the Hidden Markov model is applied may be used, or an N-best search method in which a voice recognition model and a language model are integrated may be used. The N-best search method can improve recognition performance by selecting up to N recognition result candidates using a speech recognition model and a language model, and then re-evaluating the ranking of these candidates.

제어부(130)는 인식결과의 신뢰성을 확보하기 위해 신뢰도점수(confidence score)(또는'신뢰도'로 약칭될 수 있음)를 계산할 수 있다.The controller 130 may calculate a confidence score (or may be abbreviated as “confidence”) to secure the reliability of the recognition result.

신뢰도점수는 음성인식결과에 대해서 그 결과를 얼마나 믿을 만한 것인가를 나타내는 척도로서, 인식된 결과인 음소나 단어에 대해서, 그외의 다른 음소나 단어로부터 그 말이 발화되었을 확률에 대한 상대값으로 정의할 수 있다. 따라서, 신뢰도점수는 0 에서 1 사이의 값으로 표현할 수도 있고, 0 에서 100 사이의 값으로 표현할 수도 있다. 신뢰도 점수가 미리 설정된 임계값(threshold)보다 큰 경우에는 인식결과를 인정하고, 작은 경우에는 인식결과를 거절(rejection)할 수 있다.Reliability score is a measure of how reliable the result is for a speech recognition result, and can be defined as a relative value for the probability of the speech being uttered from a phoneme or word that is a recognized result, or from another phoneme or word. have. Therefore, the reliability score may be expressed as a value between 0 and 1 or a value between 0 and 100. If the reliability score is greater than a preset threshold, the recognition result may be recognized, and if the reliability score is small, the recognition result may be rejected.

이 외에도, 신뢰도점수는 종래의 다양한 신뢰도점수 획득 알고리즘에 따라 획득될 수 있다.In addition to this, the reliability score may be obtained according to various conventional reliability score acquisition algorithms.

제어부(130)는 소프트웨어, 하드웨어 또는 이들의 조합을 이용하여 컴퓨터로 읽을 수 있는 기록매체 내에서 구현될 수 있다. 하드웨어적인 구현에 의하면, ASICs(Application Specific Integrated Circuits), DSPs(Digital Signal Processors), DSPDs(Digital Signal Processing Devices), PLDs(Programmable LogicDevices), FPGAs(Field Programmable Gate Arrays), 프로세서(processor), 마이크로컨트롤러(microcontrollers),마이크로제어부(micro-processor) 등의 전기적인 유닛 중 적어도 하나를 이용하여 구현될 수 있다.The control unit 130 may be implemented in a computer-readable recording medium using software, hardware, or a combination thereof. According to the hardware implementation, Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, and microcontrollers It may be implemented using at least one of electrical units such as (microcontrollers) and micro-processors.

소프트웨어적인 구현에 의하면, 적어도 하나의 기능 또는 동작을 수행하는 별개의 소프트웨어 모듈과 함께 구현될 수 있고, 소프트웨어코드는 적절한 프로그램언어로 쓰여진 소프트웨어 어플리케이션에 의해 구현될 수 있다.According to the software implementation, it may be implemented together with a separate software module that performs at least one function or operation, and the software code may be implemented by a software application written in an appropriate programming language.

제어부(130)는 이하에서 후술할 도2내지 도6에서 제안된 기능, 과정 및/또는 방법을 구현하며, 이하에서는 설명의 편의를 위해 제어부(130)을 음성인식장치(100)와 동일시하여 설명한다. The control unit 130 implements the functions, processes, and/or methods proposed in FIGS. 2 to 6 to be described later, and hereinafter, for convenience of explanation, the control unit 130 is identical with the voice recognition device 100 and described. do.

출력부(140)는 시각, 청각 등과 관련된 출력을 발생시키기 위한 것으로, 장치(100)에 의해 처리되는 정보를 출력한다.The output unit 140 is for generating output related to vision, hearing, and the like, and outputs information processed by the device 100.

예를 들어, 출력부(140)는 제어부(130)에서 처리된 음성 신호의 인식 결과를 시각 또는 청각을 통해 사용자가 인식할 수 있도록 출력할 수 있다.For example, the output unit 140 may output a recognition result of the voice signal processed by the control unit 130 so that the user can recognize it through visual or audio.

이하에서 설명하는 음성인식 모델은 도 1에서 설명한 음성인식 모델과 동일한 방법을 통해서 사용자로부터 입력된 음성 정보를 인식할 수 있다.The voice recognition model described below may recognize voice information input from a user through the same method as the voice recognition model described in FIG. 1.

도 2 및 도 3은 본 발명의 일 실시예에 따른 음성 인식 장치의 일 예를 나타내는 도면이다.2 and 3 are diagrams illustrating an example of a speech recognition apparatus according to an embodiment of the present invention.

도 2 및 도 3을 참조하면, 음성 인식 장치는 사용자로부터 획득된 음성 데이터를 복수의 음성인식 모델로 인식하고, 문맥 정보에 기초하여 복수의 음성 인식 모델로부터 인식된 결과 중 하나를 선택하여 음성 인식 서비스를 제공할 수 있다.2 and 3, the speech recognition apparatus recognizes speech data acquired from a user as a plurality of speech recognition models, and selects one of the recognized results from the plurality of speech recognition models based on context information to recognize speech. Service can be provided.

구체적으로, 음성 인식 장치는 사용자로부터 입력된 음성 정보를 전기적 신호로 변환하고, 변환된 전기적 신호인 아날로그 신호를 디지털 신호로 변환하여 음성 데이터를 생성할 수 있다.Specifically, the voice recognition apparatus may generate voice data by converting voice information input from a user into an electrical signal, and converting an analog signal, which is a converted electrical signal, into a digital signal.

이후 음성 인식 모델은 제1 음성 인식 모델(2010) 및 제2 음성 인식 모델(2020)을 이용하여 음성 데이터를 각각 인식할 수 있다.Thereafter, the voice recognition model may recognize voice data using the first voice recognition model 2010 and the second voice recognition model 2020, respectively.

음성 인식 장치는 기본 음성인식 모델 및 사용자 음성인식 모델 각각을 이용하여 사용자로부터 입력된 음성신호가 변환된 음성 데이터로부터 두 개의 음성인식 결과(음성인식 결과 1(2030), 음성인식 결과2(2040))을 획득할 수 있다.The voice recognition device uses two voice recognition results (voice recognition result 1(2030), voice recognition result 2(2040)) from the voice data converted from the voice signal input from the user using each of the basic voice recognition model and the user voice recognition model. ) Can be obtained.

음성 인식 장치는 상기 제1 음성인식결과와 상기 제2 음성인식결과를 제1 특정 판단 절차(예를 들면, 제1 문맥기반 적정 음성인식모델 판단기법)에 적용해 제1 음성인식결과 및 제2 음성인식결과 중 더 적합한 음성 인식 결과(2050)를 선택해 출력할 수 있다.The speech recognition apparatus applies the first speech recognition result and the second speech recognition result to a first specific determination procedure (for example, a first context-based appropriate speech recognition model determination technique) to provide a first speech recognition result and a second speech recognition result. Among the speech recognition results, a more suitable speech recognition result 2050 may be selected and output.

즉, 음성 인식 장치는 따라 제1 음성 인식 결과 및 제2 음성 인식 결과 중 음성 인식의 목적에 더 적합한 음성 인식 결과를 제1 특정 판단 절차를 통해서 선택할 수 있으며, 선택된 음성 인식 결과를 출력할 수 있다.That is, the voice recognition apparatus may select a voice recognition result more suitable for the purpose of voice recognition among the first voice recognition result and the second voice recognition result through the first specific determination procedure, and may output the selected voice recognition result. .

예를 들면, 제1 음성 인식 결과 및 제2 음성 인식 결과에서 추출한 문맥정보가 주소 검색과 관련된 경우, 제1 음성인식 모델과 제2 음성인식 모델 중 주소 검색에 더 적합한 음성 인식 모델을 선택하고, 선택된 음성 인식 모델의 음성 인식 결과를 음성 인식 서비스로 제공할 수 있다.For example, when the context information extracted from the first speech recognition result and the second speech recognition result is related to an address search, a speech recognition model more suitable for address search is selected from among the first speech recognition model and the second speech recognition model, The speech recognition result of the selected speech recognition model can be provided as a speech recognition service.

이하, 도 3을 참조하여 특정 판단 절차에 대해서 살펴보도록 한다.Hereinafter, a specific determination procedure will be described with reference to FIG. 3.

도 3은 제1 음성 인식 결과 및 제2 음성 인식 결과를 문맥에 기반하여 적정한 음성 인식 모델을 판단하기 위한 제1 특정 판단 절차의 일 예를 나타내는 순서도이다.3 is a flowchart illustrating an example of a first specific determination procedure for determining an appropriate speech recognition model based on a context based on a first speech recognition result and a second speech recognition result.

도 3에 도시된 바와 같이 제1 특정 판단 절차는, 제1 음성 인식 결과(3010)와 제2 음성인식결과(3020)가 제1 음성 인식 모델 및 제2 음성 인식 모델로부터 각각 생성된 경우, 제 1 음성 인식 결과(3010) 및 제 2 음성 인식 결과(3020)에서 추출된 문맥(3032)에 기초하여 제1 음성 인식 결과(3010)와 제2 음성인식결과(3020) 중 음성 인식의 목적에 더 적합한 음성 인식 모델을 선택할 수 있다(3034).As shown in FIG. 3, the first specific determination procedure is performed when the first voice recognition result 3010 and the second voice recognition result 3020 are generated from the first voice recognition model and the second voice recognition model, respectively. Based on the context 3032 extracted from the 1 voice recognition result 3010 and the second voice recognition result 3020, among the first voice recognition result 3010 and the second voice recognition result 3020, the purpose of speech recognition A suitable speech recognition model can be selected (3034).

이후, 음성 인식 장치는 선택된 음성 인식 모델로부터 생성된 음성 인식 결과를 선택(3036)하여 출력(3040)할 수 있다.Thereafter, the speech recognition apparatus may select 3036 and output 3040 a speech recognition result generated from the selected speech recognition model.

예를 들면, 도 3에서 제1 음성 인식 결과인 '이기통 주소 좀 알려줘'와 제2 음성 인식 결과인'이길동 주소 좀 알려줘'에서 음성 인식 장치는'주소 좀 알려줘'를 문맥정보로 판단하였다.For example, in FIG. 3, in the first voice recognition result of'tell me the address of Lee Ki-tong' and the second voice recognition result of'tell me the address of Lee Gil-dong', the voice recognition apparatus determines'tell me the address' as context information.

구체적으로, 음성 인식 장치는 이기통 주소 좀 알려줘'와 '이길동 주소 좀 알려줘'로부터 문맥 정보인 '주소좀 알려줘'(3032)를 추출할 수 있다.Specifically, the speech recognition device may extract context information'Tell me an address' (3032) from'Tell me the address of Gi-Tong Lee' and'Tell me the address of Gil-dong Lee'.

이후, 음성 인식 장치는 추출된 문맥 정보와 제1 음성 인식 모델의 특성(제1 특성) 및 제2 음성 인식 모델의 특성(제2 특성)을 비교하여 음성 인식의 목적에 더 적합한 음성인식 모델로 제1 음성 인식 모델을 선택할 수 있다(3034).Thereafter, the speech recognition apparatus compares the extracted context information with the characteristic of the first speech recognition model (first characteristic) and the characteristic of the second speech recognition model (second characteristic) to obtain a speech recognition model more suitable for the purpose of speech recognition. A first speech recognition model may be selected (3034).

이후, 음성 인식 장치는 선택된 제1 음성 인식 모델의 제1 음성 인식 결과를 선택하고(3036), 선택된 제1 음성 인식 결과인 '이기통 주소 좀 알려줘'를 출력할 수 있다.Thereafter, the voice recognition apparatus may select a first voice recognition result of the selected first voice recognition model (3036), and may output a “tell me an address of this machine” which is the selected first voice recognition result.

이때, 문맥 정보로 음성 데이터로부터 인식된 인식 문장의 부분 외에도 인식 결과 등을 통해 유추할 수 있는 모든 정보가 문맥정보로 사용될 수 있다. In this case, in addition to the part of the recognition sentence recognized from the voice data as context information, all information that can be inferred through the recognition result or the like may be used as the context information.

예를 들면, 사용자와 관련된 정보인 사용자의 위치, 사용자가 처한 날씨, 사용자 습관, 사용자의 이전 발화 문맥, 사용자의 경력, 사용자의 직책, 사용자의 금전 상태, 현재 시각 및 사용자의 언어 등 중 적어도 하나가 문맥 정보로 사용될 수 있다.For example, at least one of information related to the user, such as the user's location, the user's weather, the user's habits, the user's previous utterance context, the user's career, the user's job title, the user's financial status, the current time, and the user's language. Can be used as contextual information.

도 4 및 도 5는 본 발명의 일 실시예에 따른 음성 인식 장치의 또 다른 일 예를 나타내는 도면이다. 4 and 5 are diagrams illustrating still another example of a speech recognition apparatus according to an embodiment of the present invention.

도 4 및 도 5를 참조하면, 음성 인식 장치는 사용자로부터 획득된 음성 데이터를 복수의 음성인식 모델로 인식하고, 문맥 정보에 기초하여 복수의 음성 인식 모델로부터 인식된 결과 중 하나를 선택하여 음성 인식 서비스를 제공할 수 있다.4 and 5, the speech recognition apparatus recognizes speech data obtained from a user as a plurality of speech recognition models, and selects one of the recognized results from the plurality of speech recognition models based on context information to recognize speech. Service can be provided.

구체적으로, 음성 인식 장치는 사용자로부터 입력된 음성 정보를 제1 음성 인식 모델(4010)로 인식하여 제1 음성인식 결과(4020)를 생성할 수 있다.Specifically, the voice recognition apparatus may generate a first voice recognition result 4020 by recognizing voice information input from a user as the first voice recognition model 4010.

이때, 제1 음성 인식 모델은 사용자로부터 획득된 음성 정보로부터 문맥을 추출하기 위한 음성 정보 모델로써, 음성 인식의 목적에 따른 음성 정보 모델의 용도에 따라 작은 리소스만을 사용하도록 구성될 수 있다.In this case, the first voice recognition model is a voice information model for extracting a context from voice information obtained from a user, and may be configured to use only a small resource according to the purpose of the voice information model according to the purpose of voice recognition.

음성인식 장치는 제2 특정 판단 절차(예를 들면, 제2 문맥기반 적정 음성인식 모델 판단기법)을 이용하여 기 설정된 복수개의 음성 인식 모델 중 사용자로부터 입력된 음성 정보를 인식하기에 가장 적합한 특정 음성 인식 모델을 선택할 수 있다(4030).The speech recognition device uses a second specific determination procedure (e.g., a second context-based appropriate speech recognition model determination technique), among a plurality of preset speech recognition models, which are most suitable for recognizing speech information input from a user. A recognition model can be selected (4030).

즉, 음성인식 장치는 음성 인식의 목적 및 용도에 따라 제 1 음성 인식 결과에 기초하여 복수의 음성 인식 모델 중 특정 음성 인식 모델을 선택할 수 있다.That is, the speech recognition apparatus may select a specific speech recognition model from among a plurality of speech recognition models based on the first speech recognition result according to the purpose and use of speech recognition.

예를 들면, 제 1 음성 인식 결과에서 추출한 문맥 정보가 주소 검색과 관련된 경우, 복수의 후보 음성 인식 모델 중에서 주소 검색에 가장 적합한 음성 인식 모델을 특정 음성 인식 모델로 선택할 수 있다.For example, when context information extracted from the first speech recognition result is related to address search, a speech recognition model most suitable for address search among a plurality of candidate speech recognition models may be selected as a specific speech recognition model.

이때, 제2 특정 판단 절차는 특정 음성 인식 모델을 선택하기 위해 제1 음성인식결과로부터 문맥정보를 추출하고, 추출된 문맥정보를 이용하여 특정 음성 인식 모델을 선택하는 절차를 포함한다.In this case, the second specific determination procedure includes a procedure of extracting context information from the first speech recognition result in order to select a specific speech recognition model, and selecting a specific speech recognition model using the extracted context information.

이후, 음성 인식 절차는 선택된 특정 음성 인식 모델을 이용하여 사용자로부터 입력된 음성 정보가 변환된 음성 데이터를 재 인식하여 최종적으로 음성 인식 결과(4040)을 생성할 수 있다.Thereafter, the voice recognition procedure may re-recognize voice data from which voice information input from the user is converted using a selected specific voice recognition model, and finally generate a voice recognition result 4040.

이하, 도 5을 참조하여 제 2 특정 판단 절차에 대해서 살펴보도록 한다.Hereinafter, a second specific determination procedure will be described with reference to FIG. 5.

도 5는 제1 음성 인식 결과 및 제2 음성 인식 결과를 문맥에 기반하여 적정한 음성 인식 모델을 판단하기 위한 제2 특정 판단 절차의 일 예를 나타내는 순서도이다.5 is a flowchart illustrating an example of a second specific determination procedure for determining an appropriate speech recognition model based on a context based on a first speech recognition result and a second speech recognition result.

구체적으로, 음성 인식 장치는 도 4에서 설명한 제1 음성인식 모델에 의해서 사용자의 음성 정보를 인식한 제1 음성 인식 결과(5010)를 생성(또는 입력 받아)하고, 생성된 제1 음성 인식 결과에 기초하여 제2 특정 판단 절차를 통해서 복수(예를 들면, N개)의 음성 인식 모델 중에서 도 4에서 살펴본 바와 같이 음성 인식의 목적에 가장 적합한 특정 음성 인식 모델을 선택할 수 있다(5020).Specifically, the speech recognition apparatus generates (or receives input) the first speech recognition result 5010 in which the user's speech information is recognized by the first speech recognition model described in FIG. 4, and the generated first speech recognition result is Based on the second specific determination procedure, a specific speech recognition model most suitable for the purpose of speech recognition may be selected from among a plurality of (for example, N) speech recognition models (5020).

예를 들면, 제1 음성 인식 모델을 통해서 인식한 제1 음성 인식 결과인 '이길동 주소 좀 알려줘'로부터 '주소 좀 알려줘'를 문맥 정보로 추출할 수 있다.For example, from the first voice recognition result recognized through the first voice recognition model, “tell me the address of Gil-dong Lee”, “tell me the address” may be extracted as context information.

이때, 제1 음성 인식 모델은 앞에서 살펴본 바와 같이 사용자로부터 획득된 음성 정보로부터 문맥을 추출하기 위한 음성 정보 모델로써, 음성 인식의 목적에 따른 음성 정보 모델의 용도에 따라 작은 리소스만을 사용하도록 구성될 수 있다.In this case, the first voice recognition model is a voice information model for extracting context from voice information obtained from a user as described above, and may be configured to use only small resources according to the purpose of the voice information model according to the purpose of voice recognition. have.

문맥 정보는 음성 인식 모델을 통해서 인식된 문장의 일 부분 외에도 인식 결과 등을 통해 유추할 수 있는 모든 정보가 문맥정보로 사용될 수 있다. As for the context information, in addition to a part of the sentence recognized through the speech recognition model, all information that can be inferred through the recognition result can be used as the context information.

이후, 음성 인식 장치는 추출된 문맥 정보인 '주소 좀 알려줘'를 이용하여 복수의 음성 인식 모델 중에서 도 4에서 살펴본 바와 같이 음성 인식의 목적에 가장 적합한 특정 음성 인식 모델을 선택할 수 있다.Thereafter, the voice recognition apparatus may select a specific voice recognition model most suitable for the purpose of voice recognition from among a plurality of voice recognition models using the extracted context information,'Tell me an address,' as shown in FIG. 4.

이와 같은 방법을 통해서 음성 인식 장치는 문맥 정보를 획득하기 위한 음성 인식 모델을 통해서 문맥 정보를 추출하여 음성 인식의 목적에 가장 적합한 특정 음성 인식 모델을 선택할 수 있다.Through this method, the speech recognition apparatus can select a specific speech recognition model most suitable for the purpose of speech recognition by extracting context information through a speech recognition model for acquiring context information.

도 6은 본 발명의 일 실시예에 따른 음성 인식 장치의 또 다른 일 예를 나타내는 도면이다. 6 is a diagram illustrating another example of a speech recognition apparatus according to an embodiment of the present invention.

도 6을 참조하면, 음성 인식 장치는 음성 인식을 위한 문맥 정보를 설정하여 사전에 복수의 음성 인식 모델들 중에서 특정한 음성 인식 모델을 선택할 수 있으며, 선택된 음성 인식 모델을 통해서 인식된 음성 인식 결과를 이용하여 음성 인식 서비스를 제공할 수 있다.Referring to FIG. 6, the speech recognition apparatus may select a specific speech recognition model from among a plurality of speech recognition models in advance by setting context information for speech recognition, and use the speech recognition result recognized through the selected speech recognition model. Thus, a voice recognition service can be provided.

구체적으로, 음성 인식 장치는 기 설정된 문맥정보에 따라 복수의 음성 인식 모델 중에서 음성 인식에 가장 적합하다고 판단되는 특정 음성 인식 모델을 선택할 수 있다(6010).Specifically, the speech recognition apparatus may select a specific speech recognition model determined to be most suitable for speech recognition from among a plurality of speech recognition models according to preset context information (6010).

예를 들면, 음성 인식 서비스의 목적 및 용도가 주소 검색인 경우, 음성 인식 장치는 복수 개의 음성 인식 모델 중 주소 검색을 위한 용도로 기 설정된 음성 인식 모델을 특정 음성 인식 모델로 선택할 수 있다.For example, when the purpose and purpose of the voice recognition service is address search, the voice recognition apparatus may select a voice recognition model preset for address search among a plurality of voice recognition models as a specific voice recognition model.

이후, 음성 인식 모델은 선택된 특정 음식 모델을 통해서 사용자로부터 획득된 음성 데이터를 인식하여 음성 인식 결과(6020)를 생성할 수 있다.Thereafter, the voice recognition model may generate a voice recognition result 6020 by recognizing voice data obtained from the user through the selected specific food model.

이때, 음성 데이터는 사용자로부터 획득된 음성 정보가 전기적 신호로 변경되고, 변경된 전기적 신호인 아날로그 신호가 디지털 신호로 변경된 데이터를 의미할 수 있다.In this case, the voice data may mean data in which voice information obtained from a user is changed into an electrical signal, and an analog signal, which is a changed electrical signal, is changed into a digital signal.

도 7은 본 발명의 일 실시 예에 따른 음성 인식 방법의 일 예를 나타내는 순서도이다.7 is a flowchart illustrating an example of a speech recognition method according to an embodiment of the present invention.

도 7을 참조하면, 도 2 및 도 3에서 살펴본 바와 같이 음성 인식 장치는 복수의 음성 인식 장치들을 통해서 음성 인식 결과를 생성하고, 생성된 음성 인식 결과들 중에서 가장 적합한 음성 인식 결과를 선택하여 음성 인식 서비스를 제공할 수 있다.Referring to FIG. 7, as shown in FIGS. 2 and 3, a speech recognition device generates a speech recognition result through a plurality of speech recognition devices, and selects the most suitable speech recognition result from the generated speech recognition results to recognize speech. Service can be provided.

구체적으로, 음성 인식 장치는 사용자로부터 음성 정보를 획득하고, 획득된 음성 정보를 음성 데이터로 변환할 수 있다(S7010).Specifically, the voice recognition apparatus may obtain voice information from a user and convert the obtained voice information into voice data (S7010).

예를 들면, 음성 인식 장치는 사용자로부터 획득한 음성 정보를 전기적 신호로 변환하고, 변경된 전기적 신호인 아날로그 신호를 디지털 신호인 음성 데이터로 변환할 수 있다.For example, the voice recognition apparatus may convert voice information acquired from a user into an electric signal, and convert an analog signal, which is a changed electric signal, into voice data which is a digital signal.

이후, 음성 인식 장치는 음성 데이터를 제1 음성 인식 모델 및 제2 음성 인식 모델로 각각 인식하여 제1 음성 인식 결과 및 제2 음성 인식 결과를 생성할 수 있다(S7020, S7030).Thereafter, the voice recognition apparatus may recognize the voice data as a first voice recognition model and a second voice recognition model, respectively, and generate a first voice recognition result and a second voice recognition result (S7020 and S7030).

이후, 음성 인식 장치는 도 2 및 도 3에서 살펴본 제1 특정 판단 절차를 통해서 제1 음성 인식 결과 및 제2 음성 인식 결과 중 음성 인식의 목적에 더 적합한 음성 인식 결과를 선택하여 음성 인식 서비스를 제공할 수 있다(S7040).Thereafter, the voice recognition apparatus provides a voice recognition service by selecting a voice recognition result more suitable for the purpose of voice recognition from among the first voice recognition result and the second voice recognition result through the first specific determination procedure described in FIGS. 2 and 3. It can be done (S7040).

예를 들면, 음성 인식 장치는 제1 음성 인식 결과 및 제2 음성 인식 결과로부터 문맥 정보를 추출하고, 추출된 문맥 정보를 기 설정된 제1 음성 인식 모델의 제 1 특성 및 제 2 음성 인식 모델의 제 2 특성과 각각 비교할 수 있다.For example, the speech recognition apparatus extracts context information from the first speech recognition result and the second speech recognition result, and uses the extracted context information as a first characteristic of a preset first speech recognition model and a second speech recognition model. 2 characteristics and can be compared respectively.

이후, 음성 인식 장치는 상기 비교 결과에 기초하여 상기 제1 음성 인식 모델 및 제2 음성 인식 모델 중 음성 인식의 목적 및/또는 용도에 적합한 음성 인식 모델을 선택할 수 있다.Thereafter, the speech recognition apparatus may select a speech recognition model suitable for the purpose and/or purpose of speech recognition from among the first speech recognition model and the second speech recognition model based on the comparison result.

이후, 음성 인식 장치는 선택된 제2 음성 인식 모델에 의해서 생성된 제2 음성 인식 결과를 음성 인식 결과로 선택하고, 선택된 제2 음성 인식 결과에 기초하여 음성 인식 서비스를 제공할 수 있다.Thereafter, the voice recognition apparatus may select a second voice recognition result generated by the selected second voice recognition model as the voice recognition result, and provide a voice recognition service based on the selected second voice recognition result.

도 8은 본 발명의 일 실시 예에 따른 음성 인식 방법의 또 다른 일 예를 나타내는 순서도이다.8 is a flow chart illustrating another example of a voice recognition method according to an embodiment of the present invention.

도 8을 참조하면, 음성 인식 모델은 음성 데이터를 통해서 문맥 정보를 추출하고, 추출된 문맥 정보에 기초하여 음성 인식 서비스를 제공할 수 있다.Referring to FIG. 8, the voice recognition model may extract context information through voice data and provide a voice recognition service based on the extracted context information.

먼저 단계 S8010은 도 7의 단계 S7010과 동일하므로 설명을 생략하도록 한다.First, step S8010 is the same as step S7010 of FIG. 7, so a description thereof will be omitted.

이후, 음성 인식 장치는 제1 음성 인식 모델로 음성 데이터를 인식하여 제1 음성 인식 결과를 생성한다(S8020).Thereafter, the speech recognition apparatus generates a first speech recognition result by recognizing speech data with the first speech recognition model (S8020).

이때, 제1 음성 인식 모델은 도 4에서 설명한 바와 같이 사용자로부터 획득된 음성 정보로부터 문맥을 추출하기 위한 음성 정보 모델로써, 음성 인식의 목적에 따른 음성 정보 모델의 용도에 따라 작은 리소스만을 사용하도록 구성될 수 있다.At this time, the first voice recognition model is a voice information model for extracting context from voice information obtained from a user as described in FIG. 4, and is configured to use only small resources according to the purpose of the voice information model according to the purpose of voice recognition. Can be.

음성인식 장치는 제1 음성 인식 결과로부터 문맥 정보를 추출할 수 있다(S8030).The speech recognition apparatus may extract context information from the first speech recognition result (S8030).

문맥 정보는 음성 인식 모델을 통해서 인식된 문장의 일 부분 외에도 인식 결과 등을 통해 유추할 수 있는 모든 정보를 의미할 수 있다.Context information may mean all information that can be inferred through recognition results, etc. in addition to a part of a sentence recognized through a speech recognition model.

이후, 음성 인식 장치는 도 4 및 도 5에서 설명한 제2 특정 판단 절차을 이용하여 기 설정된 복수개의 음성 인식 모델 중 사용자로부터 입력된 음성 정보를 인식하기에 가장 적합한 특정 음성 인식 모델을 선택할 수 있다(S8040).Thereafter, the voice recognition apparatus may select a specific voice recognition model most suitable for recognizing voice information input from a user from among a plurality of preset voice recognition models using the second specific determination procedure described in FIGS. 4 and 5 (S8040). ).

예를 들면, 제 1 음성 인식 결과에서 추출한 문맥 정보가 주소 검색과 관련된 경우, 복수의 후보 음성 인식 모델 중에서 주소 검색에 가장 적합한 음성 인식 모델을 특정 음식 인식 모델로 선택할 수 있다.For example, when context information extracted from the first speech recognition result is related to address search, a speech recognition model most suitable for address search among a plurality of candidate speech recognition models may be selected as a specific food recognition model.

이후, 음성 인식 절차는 선택된 특정 음성 인식 모델을 이용하여 사용자로부터 입력된 음성 정보가 변환된 음성 데이터를 재 인식하여 최종적으로 음성 인식 결과을 생성할 수 있다(S8040).Thereafter, the speech recognition procedure may re-recognize the voice data from which voice information input from the user is converted using the selected specific voice recognition model, and finally generate a voice recognition result (S8040).

이후, 음성 인식 장치는 특정 음성 인식 모델을 통해서 음성 데이터를 인식한 음성 인식 결과에 기초하여 음성 인식 서비스를 제공할 수 있다.Thereafter, the voice recognition apparatus may provide a voice recognition service based on a voice recognition result of recognizing voice data through a specific voice recognition model.

도 9는 본 발명의 일 실시 예에 따른 음성 인식 방법의 또 다른 일 예를 나타내는 순서도이다.9 is a flowchart illustrating still another example of a voice recognition method according to an embodiment of the present invention.

도 9를 참조하면, 음성 인식 장치는 사용자로부터 음성 정보를 입력받기 전에 문맥 정보에 기초하여 복수의 음성 인식 모델 중 특정 음성 인식 모델을 선택할 수 있으며, 선택된 음성 인식 모델을 통해서 사용자로부터 입력되는 음성 정보를 인식할 수 있다.Referring to FIG. 9, before receiving voice information from a user, the voice recognition device may select a specific voice recognition model from among a plurality of voice recognition models based on context information, and voice information input from a user through the selected voice recognition model. Can be recognized.

구체적으로, 음성 인식 장치는 음성 인식을 위한 문맥 정보를 기 설정할 수 있다.Specifically, the speech recognition apparatus may preset context information for speech recognition.

이후, 음성 인식 장치는 문맥 정보에 기초하여 복수의 음성 인식 모델 중 음성 인식의 목적/용도에 따라 특정 음성 인식 모델을 선택한다(S9020).Thereafter, the speech recognition apparatus selects a specific speech recognition model according to the purpose/use of speech recognition from among the plurality of speech recognition models based on the context information (S9020).

예를 들면, 주소 검색인 경우, 음성 인식 장치는 복수의 음성 인식 모델 중에서 주소 검색을 위한 용도로 기 설정된 음성 인식 모델을 특정 음성 인식 모델로 선택할 수 있다.For example, in the case of address search, the speech recognition apparatus may select a speech recognition model preset for address search among a plurality of speech recognition models as a specific speech recognition model.

이후, 음성 인식 장치는 사용자로부터 음성 정보가 획득된 경우, 획득된 음성 정보를 음성 데이터로 변환할 수 있다(S9010).Thereafter, when voice information is obtained from the user, the voice recognition apparatus may convert the obtained voice information into voice data (S9010).

이후, 음성 인식 장치는 선택된 특정 음성 인식 모델로 음성 데이터를 인식하여 음성 인식 결과를 생성할 수 있다(S9050).Thereafter, the speech recognition apparatus may generate a speech recognition result by recognizing speech data using the selected specific speech recognition model (S9050).

본 발명에 따른 실시예는 다양한 수단, 예를 들어, 하드웨어, 펌웨어(firmware), 소프트웨어 또는 그것들의 결합 등에 의해 구현될 수 있다. 하드웨어에 의한 구현의 경우, 본 발명의 일 실시예는 하나 또는 그 이상의 ASICs(application specific integrated circuits), DSPs(digital signal processors), DSPDs(digital signal processing devices), PLDs(programmable logic devices), FPGAs(field programmable gate arrays), 프로세서, 콘트롤러, 마이크로콘트롤러, 마이크로프로세서 등에 의해 구현될 수 있다.The embodiment according to the present invention may be implemented by various means, for example, hardware, firmware, software, or a combination thereof. In the case of implementation by hardware, an embodiment of the present invention is one or more ASICs (application specific integrated circuits), DSPs (digital signal processors), DSPDs (digital signal processing devices), PLDs (programmable logic devices), FPGAs ( field programmable gate arrays), processors, controllers, microcontrollers, microprocessors, etc.

펌웨어나 소프트웨어에 의한 구현의 경우, 본 발명의 일 실시예는 이상에서 설명된 기능 또는 동작들을 수행하는 모듈, 절차, 함수 등의 형태로 구현될 수 있다. 소프트웨어코드는 메모리에 저장되어 프로세서에 의해 구동될 수 있다. 상기 메모리는 상기 프로세서 내부 또는 외부에 위치하여, 이미 공지된 다양한 수단에 의해 상기 프로세서와 데이터를 주고받을 수 있다.In the case of implementation by firmware or software, an embodiment of the present invention may be implemented in the form of a module, procedure, or function that performs the functions or operations described above. The software code can be stored in a memory and driven by a processor. The memory may be located inside or outside the processor, and may exchange data with the processor through various known means.

본 발명은 본 발명의 필수적 특징을 벗어나지 않는 범위에서 다른 특정한 형태로 구체화될 수 있음은 당 업자에게 자명하다. 따라서, 상술한 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다. It is obvious to those skilled in the art that the present invention can be embodied in other specific forms without departing from the essential features of the present invention. Therefore, the above detailed description should not be construed as restrictive in all respects and should be considered as illustrative. The scope of the present invention should be determined by reasonable interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.

본 발명은 다양한 음성인식 기술 분야에 적용될 수 있으며, 본 발명은 문맥에 기반한 최적의 음성인식모델 선택 방법을 제공할 수 있다.The present invention can be applied to various fields of speech recognition technology, and the present invention can provide a method for selecting an optimal speech recognition model based on context.

이런 특징으로 인해, 분야별로 강점이 다른 다수의 음성인식모델을 이용한 서비스에서 불특정 음성입력이 들어왔을 때 최상의 음성인식결과를 도출할 수 있다.Due to this characteristic, it is possible to derive the best speech recognition result when an unspecified voice input is received from a service using a number of voice recognition models with different strengths for each field.

이러한 특징은 음성인식뿐만 아니라 다른 인공지능 서비스에서도 적용될 수 있다.This feature can be applied not only to voice recognition, but also to other AI services.

Claims

In the method of recognizing speech,
Obtaining voice information from a user;
Converting the obtained voice information into voice data;
Generating a first speech recognition result by recognizing the converted speech data with a first speech recognition model;
Generating a second speech recognition result by recognizing the converted speech data with a second speech recognition model; And
And selecting a specific speech recognition result from among the first speech recognition result and the second speech recognition result through a specific determination procedure.

The method of claim 1, wherein the specific determination procedure,
Extracting context information from the first speech recognition result and the second speech recognition result;
Comparing the context information with a preset first characteristic of the first speech recognition model and a second characteristic of the second speech recognition model, respectively; And
And selecting one of the first speech recognition result and the second speech recognition result based on the comparison result.

The method of claim 2,
The context information includes at least one of a part of the voice information, information that can be obtained from the first voice recognition result and the second voice recognition result, or information related to a user who uttered the voice.

The method of claim 1,
The first voice recognition model and the second voice recognition model are one of a plurality of voice recognition models for recognizing the voice information obtained from the user.

The method of claim 1,
Further comprising the step of generating a plurality of speech recognition results by recognizing the converted speech data with the plurality of speech recognition models,
The specific voice recognition result is selected from among the first voice recognition result, the second voice recognition result, and the plurality of voice recognition results.

The method of claim 1,
The specific determination procedure is a procedure for determining a speech recognition result based on a context included in context information.

In the method of recognizing speech,
Obtaining voice information from a user;
Converting the obtained voice information into voice data;
Generating a first speech recognition result by recognizing the speech data with the first speech recognition model;
Selecting a second voice recognition model for recognizing the voice data from among a plurality of voice recognition models based on the first voice recognition result; And
And generating a second speech recognition result by recognizing the speech data with the second speech recognition model.

The method of claim 7,
Extracting context information from the first speech recognition result; And
Comprising the step of comparing the context information and a predetermined specificity of the plurality of speech recognition models,
The second speech recognition model is selected based on the comparison result.

The method of claim 8,
The first speech recognition model is a speech recognition model for extracting the context information.

In the method of recognizing speech,
Obtaining voice information from a user;
Converting the obtained voice information into voice data; And
And generating a speech recognition result by recognizing the speech data with a specific speech recognition model selected from among a plurality of speech recognition models.

The method of claim 10,
Setting context information for speech recognition; And
And selecting the specific speech recognition model from among the plurality of speech recognition models in which characteristics of the speech recognition model are most suitable for the context information.