KR20210019930A

KR20210019930A - Server supporting speech recognition of device and operating method thereof

Info

Publication number: KR20210019930A
Application number: KR1020200018574A
Authority: KR
Inventors: 김찬우; 김은향; 이경민; 다난자야 엔. 고다; 김광윤
Original assignee: 삼성전자주식회사
Priority date: 2019-08-13
Filing date: 2020-02-14
Publication date: 2021-02-23

Abstract

A server according to one embodiment of the present disclosure comprises: a memory storing one or more instructions; a processor executing the one or more instructions stored in the memory; and a communication unit receiving a first character string from a device. The processor may identify a plurality of estimated character strings from the first character string, obtain a second character string based on the plurality of estimated character strings, and control the communication unit so as to transmit the second string to the device. The first character string is outputted through voice recognition processing from a voice signal inputted to the device.

Description

Server supporting voice recognition of device and its operation method {SERVER SUPPORTING SPEECH RECOGNITION OF DEVICE AND OPERATING METHOD THEREOF}

본 개시는 디바이스의 음성 인식을 지원하는 서버 및 그의 동작 방법에 관한 것이다. 구체적으로는, 서버 단에서의 후처리(post-processing)를 이용하여 음성 인식 결과를 강화하는 방법에 관한 것이다. The present disclosure relates to a server supporting voice recognition of a device and a method of operating the same. Specifically, it relates to a method of enhancing a speech recognition result by using post-processing at the server side.

다양한 기능을 복합적으로 수행하는 전자 장치들이 개발됨에 따라, 조작성을 향상시키기 위하여 음성 인식 기능이 탑재된 전자 장치들이 출시되고 있다. 음성 인식 기능은, 별도의 버튼 조작 또는 터치 모듈의 접촉에 의하지 않고 사용자의 음성을 인식함으로써 장치를 손쉽게 제어할 수 있는 장점을 가진다.As electronic devices that perform various functions in combination are developed, electronic devices equipped with a voice recognition function are being released to improve operability. The voice recognition function has the advantage of being able to easily control a device by recognizing a user's voice without using a separate button operation or touching a touch module.

이러한 음성 인식 기능에 의하면, 예를 들어 스마트 폰과 같은 휴대용 단말기 및 TV, 냉장고 등과 같은 가전 제품에서 별도의 버튼을 누르는 조작 없이 통화 기능을 수행하거나 문자 메시지를 작성할 수 있으며, 길 찾기, 인터넷 검색, 알람 설정 등 다양한 기능을 손쉽게 설정할 수 있다.According to such a voice recognition function, for example, it is possible to perform a call function or write a text message without pressing a separate button on a portable terminal such as a smart phone and a home appliance such as a TV and a refrigerator. Various functions such as alarm setting can be easily set.

최근에는, 인공 지능(Artificial Intelligence, AI) 기술이 발전함에 따라 음성 인식 기능에도 인공 지능 기술이 접목됨으로써, 다양한 발화들에 대해서 빠르고 정확한 음성 인식이 가능해졌다. 인공 지능 시스템은 인간 수준의 지능을 구현하는 컴퓨터 시스템이며, 기존 룰(rule) 기반 스마트 시스템과 달리 기계가 스스로 학습하고 판단하며 똑똑해지는 시스템이다. 인공지능 시스템은 사용할 수록 인식률이 향상되고 사용자의 취향을 보다 정확하게 이해할 수 있게 되어, 기존의 룰 기반 스마트 시스템은 점차 딥러닝 기반 인공지능 시스템으로 대체되고 있다.In recent years, as artificial intelligence (AI) technology develops, artificial intelligence technology is also applied to the speech recognition function, enabling fast and accurate speech recognition for various utterances. The artificial intelligence system is a computer system that implements human-level intelligence, and unlike the existing rule-based smart system, the machine learns, judges, and becomes smarter. As artificial intelligence systems are used, the recognition rate improves and users' tastes can be more accurately understood, and the existing rule-based smart systems are gradually being replaced by deep learning-based artificial intelligence systems.

디바이스 측에서 음성 인식(Automatic Speech Recognition, ASR)이 수행되는 온-디바이스 음성 인식은 대기 시간(latency)이 짧고 네트워크가 연결이 되지 않는 경우에도 이용 가능하다는 장점이 있다. 반면에 서버-기반 음성 인식은, 서버에 저장된 많은 수의 개체명(Named entity)에 기초하여 음성 인식이 수행된다는 장점이 있다. On-device speech recognition, in which automatic speech recognition (ASR) is performed on the device side, has an advantage that latency is short and can be used even when the network is not connected. On the other hand, server-based speech recognition has an advantage that speech recognition is performed based on a large number of named entities stored in the server.

온-디바이스 음성 인식과 서버-기반 음성 인식의 장점을 모두 활용할 수 있도록, 디바이스가 두 방식을 선택적으로 이용하는 방법이 요구된다.In order to take advantage of both on-device voice recognition and server-based voice recognition, there is a need for a device to selectively use both methods.

본 개시의 일 실시 예에 따르면, 서버는, 하나 이상의 인스트럭션들을 저장하는 메모리; 상기 메모리에 저장된 상기 하나 이상의 인스트럭션들을 실행하는 프로세서; 및 제1 문자열을 디바이스로부터 수신하는 통신부를 포함하고, 상기 프로세서는, 상기 제1 문자열로부터 복수의 추정 문자열들을 식별하고, 상기 복수의 추정 문자열에 기초하여 제2 문자열을 획득하고, 상기 제2 문자열을 상기 디바이스에게 전송하도록 상기 통신부를 제어하며, 상기 제1 문자열은, 상기 디바이스에 입력된 음성 신호로부터 음성 인식 처리를 거쳐 출력되는 것을 특징으로 할 수 있다.According to an embodiment of the present disclosure, a server includes: a memory for storing one or more instructions; A processor that executes the one or more instructions stored in the memory; And a communication unit receiving a first character string from a device, wherein the processor identifies a plurality of estimated character strings from the first character string, obtains a second character string based on the plurality of estimated character strings, and the second character string The communication unit may be controlled to transmit a signal to the device, and the first character string may be output through a speech recognition process from a voice signal input to the device.

본 개시의 일 실시 예에 따르면 서버의 동작 방법은, 제 1 문자열을 디바이스로부터 수신하는 단계; 상기 제1 문자열로부터 복수의 추정 문자열들을 식별하는 단계; 상기 복수의 추정 문자열에 기초하여 제2 문자열을 획득하는 단계; 및 상기 제2 문자열을 상기 디바이스에게 전송하는 단계를 포함하고, 상기 제1 문자열은, 상기 디바이스에 입력된 음성 신호로부터 음성 인식 처리를 거쳐 출력되는 것을 특징으로 할 수 있다.According to an embodiment of the present disclosure, a method of operating a server may include: receiving a first character string from a device; Identifying a plurality of estimated character strings from the first character string; Obtaining a second character string based on the plurality of estimated character strings; And transmitting the second character string to the device, wherein the first character string is output through a speech recognition process from a voice signal input to the device.

본 개시의 일 실시 예에 따르면 디바이스는, 하나 이상의 인스트럭션들을 저장하는 메모리; 상기 메모리에 저장된 상기 하나 이상의 인스트럭션들을 실행하는 프로세서; 및 서버와 통신하는 통신부를 포함하고, 상기 프로세서는, 음성 신호에 대해서 음성 인식을 수행하여 제1 문자열을 획득하고, 상기 제1 문자열을 다른 문자열로 대체할 지 여부를 결정하고, 상기 결정에 기초하여 상기 제1 문자열을 상기 서버로 전송하도록 상기 통신부를 제어하고, 상기 서버에 의해 상기 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 대체됨으로써 획득된 제2 문자열을 상기 서버로부터 수신하도록 상기 통신부를 제어하는 것을 특징으로 할 수 있다.According to an embodiment of the present disclosure, a device includes: a memory that stores one or more instructions; A processor that executes the one or more instructions stored in the memory; And a communication unit communicating with a server, wherein the processor performs speech recognition on a voice signal to obtain a first character string, determines whether to replace the first character string with another character string, and based on the determination The communication unit to control the communication unit to transmit the first character string to the server, and to receive a second character string obtained by replacing at least one character in the first character string with another character by the server from the server. It can be characterized by controlling.

본 개시의 일 실시 예에 따르면 디바이스의 동작 방법은, 음성 신호에 대해서 음성 인식을 수행하여 제1 문자열을 획득하는 단계; 상기 제1 문자열을 다른 문자열로 대체할 지 여부를 결정하는 단계; 상기 결정에 기초하여 상기 제1 문자열을 서버로 전송하는 단계; 및 제2 문자열을 상기 서버로부터 수신하는 단계를 포함하고, 상기 제2 문자열은, 상기 서버에 의해 상기 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 대체됨으로써 획득된 것을 특징으로 할 수 있다.According to an embodiment of the present disclosure, a method of operating a device may include: acquiring a first character string by performing voice recognition on a voice signal; Determining whether to replace the first character string with another character string; Transmitting the first character string to a server based on the determination; And receiving a second character string from the server, wherein the second character string is obtained by replacing at least one character in the first character string with another character by the server.

도 1은 온-디바이스 음성 인식과 서버-기반 음성 인식을 비교하여 설명하기 위한 도면이다.
도 2a는 일 실시 예에 따른 음성 인식 시스템을 도시한다.
도 2b는 일 실시 예에 따른 음성 인식 시스템을 도시한다.
도 2c는 일 실시 예에 따른 음성 인식 시스템을 도시한다.
도 3은 일 실시 예에 따른 디바이스의 블록도를 도시한다.
도 4a는 일 실시 예에 따른 디바이스의 구체적인 블록도를 도시한다.
도 4b는 일 실시 예에 따른 디바이스의 구체적인 블록도를 도시한다.
도 5a는 일 실시 예에 따라 디바이스가 온-디바이스 음성 인식을 수행할 것을 판단하는 방법을 설명하기 위한 도면이다.
도 5b는 일 실시 예에 따라 디바이스가 서버 기반 음성 인식을 수행할 것을 판단하는 방법을 설명하기 위한 도면이다.
도 6은 일 실시 예에 따른 프레임 동기화된 문자열을 설명하기 위한 도면이다.
도 7은 일 실시 예에 따른 서버의 블록도를 도시한다.
도 8a는 일 실시 예에 따른 서버가 디바이스의 음성 인식을 지원하는 방법을 설명하기 위한 도면이다.
도 8b는 일 실시 예에 따른 서버가 음성 신호 프레임 각각에 대응하는 문자 별로 가능도를 획득하여 대체 문자열을 결정하는 방법을 설명하기 위한 도면이다.
도 9는 일 실시 예에 따른 서버의 구체적인 블록도를 도시한다.
도 10a는 일 실시 예에 따라 사후 확률을 계산하기 위해 이용되는 인공 지능 회귀 신경망 (recurrent neural network, RNN)의 구조를 도시한다.
도 10b는 일 실시 예에 따라 가능도(likelihood)를 계산하기 위해 이용되는오차 행렬(confusion matrix)의 예를 도시한다.
도 11a는 일 실시 예에 따른 서버가 디바이스로부터 수신하는 제1 문자열내의 각 문자에 대하여 각 문자가 대체될 대체 문자들에 관한 가능도 행렬을 산출하는 과정을 설명하기 위한 도면이다.
도 11b는 일 실시 예에 따른 서버가 디바이스로부터 수신하는 제1 문자열내의 각 문자에 대하여 각 문자가 대체될 대체 문자들에 관한 가능도 행렬을 산출하는 과정을 설명하기 위한 도면이다.
도 12는 일 실시 예에 따라 두 개의 음성 인식 모듈들을 선택적으로 이용하는 디바이스의 블록도를 도시한다.
도 13은 일 실시 예에 따라 디바이스가 음성 인식을 수행하는 방법의 흐름도를 도시한다.
도 14는 일 실시 예에 따라 디바이스가 음성 인식을 수행하는 구체적인 방법의 흐름도를 도시한다.
도 15는 일 실시 예에 따라 서버의 동작 방법의 흐름도를 도시한다.
도 16은 일 실시 예에 따라 서버의 구체적인 동작 방법의 흐름도를 도시한다.
도 17은 일 실시 예에 따라 서버에서 수행되는 WFST(weighted Finite State Transducer) 디코딩을 설명하기 위한 도면이다.
도 18은 일 실시 예에 따라 디바이스 상에 음성 인식 결과가 디스플레이 되는 화면의 예를 도시한다.
도 19는 일 실시 예에 따른 디바이스의 구체적인 블록도를 도시한다.1 is a diagram for describing a comparison between on-device voice recognition and server-based voice recognition.
2A is a diagram of a speech recognition system according to an embodiment.
2B is a diagram illustrating a speech recognition system according to an embodiment.
2C is a diagram illustrating a speech recognition system according to an embodiment.
3 is a block diagram of a device according to an embodiment.
4A is a detailed block diagram of a device according to an embodiment.
4B is a detailed block diagram of a device according to an embodiment.
5A is a diagram for describing a method of determining that a device performs on-device voice recognition, according to an exemplary embodiment.
5B is a diagram illustrating a method of determining that a device performs server-based voice recognition, according to an exemplary embodiment.
6 is a diagram for describing a frame-synchronized character string according to an exemplary embodiment.
7 is a block diagram of a server according to an embodiment.
8A is a diagram illustrating a method for a server to support voice recognition of a device according to an exemplary embodiment.
8B is a diagram illustrating a method of determining a replacement character string by obtaining a likelihood for each character corresponding to each voice signal frame by a server according to an exemplary embodiment.
9 is a detailed block diagram of a server according to an embodiment.
10A is a diagram illustrating a structure of an artificial intelligence recurrent neural network (RNN) used to calculate a posterior probability according to an embodiment.
10B illustrates an example of a confusion matrix used to calculate a likelihood according to an embodiment.
FIG. 11A is a diagram illustrating a process of calculating a likelihood matrix for replacement characters to be replaced by each character for each character in a first character string received from a device, according to an exemplary embodiment.
11B is a diagram illustrating a process of calculating a likelihood matrix for replacement characters to be replaced by each character for each character in a first character string received from a device according to an exemplary embodiment.
12 is a block diagram of a device that selectively uses two voice recognition modules according to an embodiment.
13 is a flowchart illustrating a method of performing voice recognition by a device according to an exemplary embodiment.
14 is a flowchart of a specific method of performing voice recognition by a device according to an embodiment.
15 is a flowchart illustrating a method of operating a server according to an exemplary embodiment.
16 is a flowchart of a detailed operation method of a server according to an embodiment.
FIG. 17 is a diagram for explaining weighted finite state transducer (WFST) decoding performed in a server according to an embodiment.
18 illustrates an example of a screen on which a voice recognition result is displayed on a device according to an embodiment.
19 is a detailed block diagram of a device according to an embodiment.

본 개시의 실시 예들에서 사용되는 용어는 본 개시의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 실시 예의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.Terms used in the embodiments of the present disclosure have selected general terms that are currently widely used as possible while taking the functions of the present disclosure into consideration, but this may vary according to the intention or precedent of a technician working in the field, the emergence of new technologies, and the like. In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning of the terms will be described in detail in the description of the corresponding embodiment. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the contents of the present disclosure, not the name of a simple term.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에 기재된 "...부", "...모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.When a part of the specification is said to "include" a certain component, it means that other components may be further included rather than excluding other components unless otherwise stated. In addition, terms such as "... unit" and "... module" described in the specification mean a unit that processes at least one function or operation, which is implemented as hardware or software, or as a combination of hardware and software. Can be.

본 개시에서 "문자"란, 인간의 언어를 눈에 볼 수 있는 형태로 나타내어 적는데 사용하는 기호를 의미한다. 예를 들어, 문자에는, 한글, 알파벳, 한자, 숫자, 발음 부호, 문장 부호 및 기타 기호가 포함될 수 있다.In the present disclosure, "text" refers to a symbol used to express and write human language in a visible form. For example, characters may include Korean characters, alphabets, Chinese characters, numbers, pronunciation marks, punctuation marks, and other symbols.

본 개시에서 "문자열"이란, 문자들의 배열(sequence)을 의미한다. In the present disclosure, the term "string" refers to a sequence of characters.

본 개시에서 “문자소(grapheme)”란, 적어도 하나의 문자로 구성되는, 소리를 나타내는 가장 작은 단위이다. 예를 들어, 알파벳 표기 체계의 경우, 하나의 문자가 문자소가 될 수 있다. 따라서, 본 개시에서 문자는 문자소를 지칭할 수 있으며, 본 개시에서 문자열은, 문자소들의 배열을 의미할 수 있다. 또한, 본 개시에서 문자열은 텍스트라고 지칭될 수도 있다.In the present disclosure, “grapheme” is the smallest unit representing a sound, composed of at least one character. For example, in the case of an alphabetic notation system, one character may be a character element. Accordingly, in the present disclosure, a character may refer to a character place, and in the present disclosure, a character string may refer to an arrangement of character elements. Also, in the present disclosure, a character string may be referred to as text.

"형태소(morpheme)"란, 적어도 하나의 문자소로 구성되는, 의미를 가지는 가장 작은 단위이다. "단어(word)"란, 적어도 하나의 형태소로 구성되는, 자립적으로 쓰일 수 있거나 문법적 기능을 나타내는 언어의 최소 기본 단위이다. "음소"란 인간의 언어에서 하나의 단어를 다른 단어와 구별하는 소리의 단위이다.A "morpheme" is the smallest unit that has a meaning, consisting of at least one letter. A "word" is the smallest basic unit of a language that can be used independently or represents a grammatical function, composed of at least one morpheme. A "phoneme" is a unit of sound that distinguishes one word from another in human language.

본 개시의 일 실시 예에 따른 음성 인식 모델은, 음성 신호를 문자열로 변환하여 출력할 수 있다. 본 개시의 일 실시 예에 따른 음성 인식 모델이 출력하는 문자열은 "프레임 동기화된 문자열"일 수 있다. "프레임"이란 음성 신호의 처리를 위하여 음성 신호가 일정한 시간 간격으로 분할되는 단위, 또는 분할된 음성 신호 그 자체를 의미할 수 있다. 본 개시에서 "프레임 동기화된 문자열"이란, 음성 신호가 음성 인식 모델에 의해 문자열로 변환되어 출력됨에 있어서, 음성 신호의 프레임들 각각에 개별적으로 대응하는 문자들을 포함하는 문자열을 의미한다.The speech recognition model according to an embodiment of the present disclosure may convert a speech signal into a character string and output it. A character string output by the speech recognition model according to an embodiment of the present disclosure may be a “frame synchronized character string”. The term "frame" may mean a unit in which an audio signal is divided at regular time intervals for processing an audio signal, or a divided audio signal itself. In the present disclosure, the term "frame-synchronized character string" refers to a character string including characters individually corresponding to each of the frames of the speech signal when the speech signal is converted into a character string by the speech recognition model and output.

예를 들어, 음성 인식 모델은, 사용자가 "baseball"이라고 발음하는 음성 신호를 수신하고, 프레임 동기화된 문자열인 [b, b, a, , a, a, s, s, e, , b, b, a, a, l]을 출력할 수 있다.For example, the speech recognition model receives a speech signal that the user pronounces "baseball", and is frame-synchronized string [b, b, a,, a, a, s, s, e,, b, b , a, a, l] can be printed.

본 개시에서 음성 인식 모델이 음성 신호로부터 소정 문자열을 출력함에 있어서, "소정 문자열의 신뢰도"란, 소정 문자열을 출력한 음성 인식 모델이 얼마나 정확하게 음성 인식을 수행하고 있는 가의 정도를 나타낸다. 예를 들어, 소정 문자열의 신뢰도는, 소정 문자열로부터 획득되는 가능도, 소정 문자열을 추정하는 과정에서 출력되는 부분 가능도 또는 사후 확률 값 등에 기초하여 미리 결정된 수학식에 따라 산출될 수 있다. 소정 문자열의 신뢰도가 높을수록, 소정 문자열이 음성 인식 모델로부터 정확하게 추정되었다고 판단될 수 있다.In the present disclosure, when the speech recognition model outputs a predetermined character string from a speech signal, "the reliability of the predetermined character string" indicates how accurately the speech recognition model outputting the predetermined character string performs speech recognition. For example, the reliability of a predetermined character string may be calculated according to a predetermined equation based on a likelihood obtained from a predetermined character string, a partial likelihood output in the process of estimating the predetermined character string, or a posterior probability value. As the reliability of the predetermined character string is higher, it may be determined that the predetermined character string is accurately estimated from the speech recognition model.

본 개시에서 "소정 문자열에 대한 평가 정보"는, 일 실시 예에 따른 서버가 소정 문자열보다 높은 신뢰도를 가지는 다른 문자열을 추천하여 출력하기 위하여 이용하는 소정 문자열에 대한 정보를 의미할 수 있다. 예를 들어, 소정 문자열에 대한 평가 정보는, 소정 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 포함할 수 있다. 예를 들어, 일 실시 예에 따른 서버는, 복수의 추정 문자열들 중에서 가능도를 최대로 하는 다른 문자열을 선택하여 출력할 수 있다.In the present disclosure, "evaluation information for a predetermined character string" may mean information on a predetermined character string used by the server to recommend and output another character string having a higher reliability than the predetermined character string according to an embodiment. For example, the evaluation information for a predetermined character string may include the likelihood of a plurality of estimated character strings obtained from the predetermined character string. For example, the server according to an embodiment may select and output another character string that maximizes likelihood from among a plurality of estimated character strings.

한편, 본 개시에서 "가능도(likelihood)"란, 가능한 정도를 뜻하는 것으로서, "사건 A에 대한 사건 B의 가능도"는, 사건 A가 발생하였을 때 사건 B가 일어날 가능성을 나타내는 조건부 확률 P(B|A)을 의미할 수 있다.On the other hand, in the present disclosure, "likelihood" means a possible degree, and "the likelihood of event B with respect to event A" is a conditional probability P indicating the likelihood that event B will occur when event A occurs. It can mean (B|A).

본 개시에서 음성 인식 모델이 음성 신호로부터 소정 문자열을 출력함에 있어서, "소정 문자열로부터 획득되는 가능도(likelihood)"란, 소정 문자열로부터 추정되는 복수의 추정 문자열들의 가능도를 의미한다. 소정 문자열로부터 추정되는 복수의 추정 문자열들은, 소정 문자열 내의 적어도 하나의 문자가 다른 문자로 대체됨으로써 획득되는 복수의 문자열들을 의미할 수 있다. In the present disclosure, when the speech recognition model outputs a predetermined character string from a speech signal, "likelihood obtained from the predetermined character string" means the likelihood of a plurality of estimated character strings estimated from the predetermined character string. The plurality of estimated character strings estimated from the predetermined character string may mean a plurality of character strings obtained by replacing at least one character in the predetermined character string with another character.

보다 구체적으로는, 음성 인식이 정확하게 수행되었을 때에 출력되는 문자열을 참값(ground truth) 문자열이라고 하면, "소정 문자열로부터 획득되는 가능도"는, 복수의 추정 문자열들 각각이 참값 문자열이라고 가정하였을 때 음성 인식 결과로서 소정 문자열이 추정될 가능성을 의미할 수 있다. 본 개시의 일 실시 예에 따르면, "소정 문자열로부터 획득되는 가능도"는 소정 문자열 내의 각 문자에 대하여 각 문자가 대체될 대체 문자들에 관한 가능도 행렬들을 포함할 수 있다. More specifically, if the character string that is output when speech recognition is correctly performed is referred to as a ground truth character string, the "likelihood obtained from a predetermined character string" is a speech when it is assumed that each of the plurality of estimated character strings is a true value character string. As a result of recognition, it may mean a possibility that a predetermined character string is estimated. According to an embodiment of the present disclosure, the "likelihood obtained from a predetermined character string" may include likelihood matrices regarding replacement characters to which each character is replaced with respect to each character in a predetermined character string.

본 개시의 일 실시예에 따르면, "소정 문자열로부터 획득되는 가능도"는, 소정 문자열 내의 각 문자와 발음이 유사한 대체 문자들을 식별하고, 식별된 대체 문자들에 기초하여 소정 문자열 내의 적어도 하나의 문자가 다른 문자로 교정된 추정 문자열들을 결정하기 위하여 이용될 수 있다. 더 나아가서는, 결정된 추정 문자열들 중에서, 언어 모델 및 사전 정보 등 미리 저장된 정보에 기초하여 가장 적합한 추정 문자열이 선택되어 소정 문자열 대신에 추천될 수 있다.According to an embodiment of the present disclosure, "the likelihood obtained from a predetermined character string" is to identify replacement characters having a similar pronunciation to each character in a predetermined character string, and at least one character in a predetermined character string based on the identified replacement characters. Can be used to determine estimated strings corrected to other characters. Furthermore, among the determined estimated character strings, the most suitable estimated character string may be selected based on pre-stored information such as language model and dictionary information and recommended instead of a predetermined character string.

음성 인식 모델이 음성 인식을 수행함에 있어서 앞에서 수행된 음성 인식 결과가 이 후에 수행되는 음성 인식 결과에 영향을 미칠 수 있다. 소정 문자가 발음이 유사한 다른 문자로 잘못 인식 되는 경우, 잘못 인식됨으로 인해 언어적인 정보가 틀어지면서 소정 문자의 뒤에 나오는 문자들 역시 잘못 인식 될 확률이 높아질 수 있다. 즉, 소정 문자가 다른 문자로 오인식 되는 경우, 소정 문자와 뒤에 나오는 문자들을 조합하여 결정되는 단어들과 오인식된 다른 문자와 뒤에 나오는 문자들을 조합하여 결정되는 단어들이 상이해질 수 있다.When the speech recognition model performs speech recognition, the result of speech recognition performed previously may affect the result of speech recognition performed later. When a predetermined character is misrecognized as another character having a similar pronunciation, as the linguistic information is misrecognized due to the misrecognition, the probability that the characters following the predetermined character are also misrecognized may increase. That is, when a predetermined character is misrecognized as another character, words determined by combining the predetermined character and subsequent characters, and words determined by combining other misrecognized characters and subsequent characters may be different.

따라서, 본 개시의 일 실시예에 따른 디바이스 또는 서버는, 소정 문자열에 대한 발음 정보와 언어 정보를 함께 고려하여 디코딩함으로써 대체 문자열을 획득하기 위해서, 소정 문자열로부터 획득되는 가능도를 이용할 수 있다.Accordingly, the device or server according to an embodiment of the present disclosure may use the likelihood obtained from the predetermined character string in order to obtain a replacement character string by considering and decoding pronunciation information and language information for a predetermined character string.

본 개시에서 "소정 문자에 대해서 획득되는 가능도 행렬"이란, 소정 문자가 대체될 대체 문자들에 대한 가능도 값들을 포함하는 행렬을 의미할 수 있다. "소정 문자가 대체될 대체 문자에 대한 가능도 값"이란, 대체 문자가 참값(ground truth) 문자라고 가정하였을 때 음성 인식 결과로서 소정 문자가 추정될 확률을 의미할 수 있다. 예를 들어, 음성 인식 결과 획득되는 문자열 내에 포함되는 문자 "a"에 대해서, 참값(ground truth) 문자가 "a"일 확률 값, "b"일 확률 값, "c"일 확률 값, ..., 및 "z"일 확률 값을 포함하는 가능도 행렬 [0.4 0.01 0.01 0.01 0.2 ... 0.01]이 획득될 수 있다. 문자열 내에 포함되는 각 문자에 대응하는 대체 문자들에 대한 가능도 값들을 포함하는 가능도 행렬을 획득함에 있어서, 각 문자와 발음이 유사한 대체 문자들에 대해서 높은 가능도 값이 부여될 수 있다.In the present disclosure, "a likelihood matrix obtained for a predetermined character" may mean a matrix including likelihood values for replacement characters to which a predetermined character is to be replaced. "The likelihood value for a replacement character to be replaced with a predetermined character" may mean a probability that a predetermined character is estimated as a result of speech recognition when the replacement character is assumed to be a ground truth character. For example, for a character "a" included in a character string obtained as a result of speech recognition, a probability value of "a" as a ground truth character, a probability value of "b", a probability value of "c", .. ., and a likelihood matrix [0.4 0.01 0.01 0.01 0.2 ... 0.01] including a probability value of "z" may be obtained. In obtaining a likelihood matrix including likelihood values for replacement characters corresponding to each character included in a character string, a high likelihood value may be assigned to replacement characters having a similar pronunciation to each character.

본 개시에서, "소정 문자열로부터 획득되는 가능도"는 소정 문자열 내의 각 문자에 대응하는 대체 문자들에 관한 가능도 값들로부터 획득될 수 있다. 이 때, 소정 문자열 내의 각 문자에 대응하는 대체 문자들에 관한 가능도 값들은, 각 문자의 이전에 누적된 문자들을 고려하여 계산될 수 있다. 그러나 본 개시는 이에 제한되지 않으며, 소정 문자열 내의 각 문자에 대응하는 대체 문자들에 관한 가능도 값들은, 각 문자의 이전에 누적된 문자들을 고려하지 않고, 각 문자 자체만을 고려하여 계산될 수 있다.In the present disclosure, "the likelihood obtained from a predetermined character string" may be obtained from likelihood values for replacement characters corresponding to each character in the predetermined character string. At this time, likelihood values for replacement characters corresponding to each character in a predetermined character string may be calculated in consideration of previously accumulated characters of each character. However, the present disclosure is not limited thereto, and the likelihood values for replacement characters corresponding to each character in a predetermined character string may be calculated by considering only each character itself, without considering the previously accumulated characters of each character. .

본 개시의 일 실시 예에 따르면, 소정 문자열 내의 각 문자의 이전에 누적된 문자들을 고려하여 "소정 문자열로부터 획득되는 가능도"는 "소정 문자열에 포함되는 각 문자의 사후 확률들(posterior probabilities)" 및 소정 문자열의 "문자 배열 확률(sequence probability)"로부터 계산될 수 있다. According to an embodiment of the present disclosure, in consideration of the previously accumulated characters of each character in a predetermined character string, "the likelihood obtained from the predetermined character string" is "posterior probabilities of each character included in the predetermined character string" And a "sequence probability" of a predetermined character string.

사건 A의 "사후 확률"이란, 사건 A와 관련된 사건, 관측 사실 또는 배경 지식을 고려하였을 때 사건 A가 기대되는 조건부 확률을 의미한다. The "post-probability" of event A means the conditional probability that event A is expected given the event, observational facts, or background knowledge associated with event A.

본 개시에서 음성 인식 모델이 음성 신호로부터 문자열을 출력함에 있어서, "문자열 내의 소정 문자의 사후 확률들"은, 문자열 내의 소정 문자의 이전 문자들을 고려하였을 때, 음성 인식 모델이 소정 문자를 정확하게 예측했을 확률 및 다른 문자를 소정 문자로 잘못 예측했을 확률을 포함할 수 있다. In the present disclosure, when the speech recognition model outputs a character string from a speech signal, "the posterior probabilities of a predetermined character in the character string" are, when considering the previous characters of the predetermined character in the character string, the speech recognition model should accurately predict the predetermined character Probability and probability of incorrectly predicting another character as a given character may be included.

본 개시에서 음성 인식 모델이 음성 신호로부터 문자열을 출력함에 있어서, "문자열의 문자 배열 확률"이란, 해당 문자열에 따라 문자들이 배열될 확률을 의미할 수 있다.In the present disclosure, when the speech recognition model outputs a character string from a speech signal, the "character arrangement probability of a character string" may mean a probability that characters are arranged according to the corresponding character string.

본 개시의 일 실시 예에 따르면, 소정 문자열 내의 각 문자 자체만을 고려하여 "소정 문자열로부터 획득되는 가능도"는, 각 문자들이 잘못 예측됐을 확률 값들을 포함하는 "오차 행렬(confusion matrix)"로부터 계산될 수 있다. 본 개시에서 "오차 행렬"이란, 에러 행렬(error matrix)이라고도 지칭되며, 음성 인식 모델이 음성 신호를 소정 문자열로 변환하여 출력함에 있어서, 음성 인식 모델이 소정 문자열에 포함되는 소정 문자를 정확하게 예측했을 확률 및 다른 문자를 소정 문자로 잘못 예측했을 확률을 포함한다. 예를 들어, 소정 문자와 발음이 유사한 문자들에 대해서는, 음성 인식 모델에 의해 발음이 유사한 문자들이 소정 문자로 잘못 예측했을 확률이 높게 부여될 수 있다.According to an embodiment of the present disclosure, considering only each character in a predetermined character string, "the likelihood obtained from a predetermined character string" is calculated from a "confusion matrix" including probability values that each character is incorrectly predicted. Can be. In the present disclosure, the term "error matrix" is also referred to as an error matrix, and when the speech recognition model converts and outputs a speech signal into a predetermined character string, the speech recognition model accurately predicts a predetermined character included in the predetermined character string. It includes the probability and the probability of incorrectly predicting another character as a given character. For example, for characters having similar pronunciation to a predetermined character, a high probability that characters having similar pronunciation are erroneously predicted as a predetermined character by a speech recognition model may be given.

본 개시에서 "음향 모델"이란, 음성 신호가 어떠한 문자 또는 발음 기호에 매칭되는지 문자소 단위로 판단할 수 있는 정보를 포함하는 모델을 의미할 수 있다. 예를 들어, 본 개시의 일 실시 예에 따른 디바이스는, 음향 모델에 기초하여 음성 신호와 문자들 각각이 매칭될 확률을 계산할 수 있다. In the present disclosure, the term “sound model” may mean a model including information capable of determining which character or pronunciation symbol is matched with a speech signal in units of letters. For example, the device according to an embodiment of the present disclosure may calculate a probability that each of a voice signal and characters are matched based on an acoustic model.

본 개시에서 "사전 정보(dictionary information)"란, 복수의 단어들과 복수의 단어들 각각에 포함되는 문자들의 매핑 정보를 포함할 수 있다. "언어 모델"은, 특정 단어열이 주어졌을 때 다음에 나올 단어들의 확률을 추정할 수 있도록 단어들 간의 관계를 학습한 인공 지능 모델일 수 있다.In the present disclosure, "dictionary information" may include mapping information of a plurality of words and characters included in each of the plurality of words. The "language model" may be an artificial intelligence model in which a relationship between words is learned so that a probability of the next word appearing can be estimated when a specific word sequence is given.

본 개시에서 "인공 신경망"이란, 사람 또는 동물 두뇌의 신경망에 착안하여 구현된 컴퓨팅 시스템의 총칭한다. 인공 신경망은, 기계 학습(machine learning)의 세부 방법론 중 하나로, 신경 세포인 뉴런(neuron)이 여러 개 연결된 망의 형태이다. 인공 신경망은 하드웨어로 구현될 수도 있으나, 주로 컴퓨터 소프트웨어로 구현된다. 인공 신경망은 기초 컴퓨팅 단위인 뉴런 여러 개가 가중된 링크(weighted link)로 연결된 형태이다. 가중된 링크(weighted link)는 주어진 환경에 적응할 수 있도록 가중치를 조정할 수 있다.In the present disclosure, the term "artificial neural network" is a generic term for a computing system implemented with the focus on a neural network of a human or animal brain. Artificial neural networks are one of the detailed methodologies of machine learning, and are in the form of a network in which multiple neurons, which are neurons, are connected. Artificial neural networks may be implemented in hardware, but are mainly implemented in computer software. An artificial neural network is a form in which several neurons, which are basic computing units, are connected by weighted links. A weighted link can adjust its weight to adapt to a given environment.

인공 신경망은 자기 조직화 지도(SOM: Self-Organizing Map), 순환 신경망(RNN: Recurrent Neural Network), 콘볼루션 신경망(CNN: Convolutional Neural Network)과 같은 다양한 모델에 대한 총칭으로, 그 종류는 수십 가지에 이른다.Artificial neural network is a generic term for various models such as Self-Organizing Map (SOM), Recurrent Neural Network (RNN), and Convolutional Neural Network (CNN). To come.

본 개시에서 "도메인"이란, 어떠한 속성과 관련된 단어들의 집합을 해당 속성의 도메인이라 한다.In the present disclosure, a "domain" refers to a set of words related to a certain attribute as a domain of the attribute.

본 개시에서 "제1 문자열을 교정하는 동작"은, 제1 문자열에 포함되는 적어도 하나의 문자를 다른 문자로 대체함으로써, 제1 문자열 보다 높은 신뢰도를 가지는 제2 문자열을 추천하여 출력하는 동작을 의미할 수 있다. 따라서, 본 개시에서 '문자열의 교정', '문자의 교정', '소정 문자를 다른 문자로 대체', '소정 문자 대신에 다른 문자를 추천', '소정 문자열을 다른 문자열로 대체', 및 '소정 문자열 대신에 다른 문자열을 추천' 하는 표현은 대체되어 이용될 수 있다.In the present disclosure, "an operation of correcting a first character string" means an operation of recommending and outputting a second character string having a higher reliability than the first character string by replacing at least one character included in the first character string with another character. can do. Accordingly, in the present disclosure,'correction of a character string','correction of a character','replace a predetermined character with another character','recommend another character instead of a predetermined character','replace a predetermined character string with another character string', and ' The expression'recommending another character string instead of a predetermined character string' may be substituted and used.

본 개시의 일 실시 예에 따른 음성 인식 시스템에 포함되는 디바이스 또는 서버는, 사용자에게 "음성 비서 서비스"를 제공할 수 있다. 음성 비서 서비스는, 사용자와의 대화를 제공하는 서비스일 수 있다. 음성 비서 서비스는 사용자의 상황, 디바이스의 상황 등을 고려하여 사람이 사용자와 직접 대화하는 것처럼 사용자에게 응답 메시지를 제공할 수 있다. 또한, 음성 비서 서비스는, 사용자의 개인 비서처럼 사용자가 필요한 정보를 적절하게 생성하여 사용자에게 제공할 수 있다. 음성 비서 서비스는, 예를 들어, 방송 서비스, 콘텐트 공유 서비스, 콘텐트 제공 서비스, 전력 관리 서비스, 게임 제공 서비스, 채팅 서비스, 문서 작성 서비스, 검색 서비스, 통화 서비스, 사진 촬영 서비스, 교통 수단 추천 서비스 및 동영상 재생 서비스 등 다양한 서비스와 연계되어 사용자가 필요한 정보 또는 기능을 사용자에게 제공할 수 있다.A device or server included in the voice recognition system according to an embodiment of the present disclosure may provide a "voice assistant service" to a user. The voice assistant service may be a service that provides a conversation with a user. The voice assistant service may provide a response message to a user as if a person had a direct conversation with the user in consideration of the user's situation and the device's situation. In addition, the voice assistant service can appropriately generate and provide information required by the user, like the user's personal assistant. The voice assistant service includes, for example, a broadcasting service, a content sharing service, a content providing service, a power management service, a game providing service, a chat service, a document writing service, a search service, a call service, a photo shooting service, a transportation method recommendation service, and In connection with various services such as a video playback service, information or functions required by the user can be provided to the user.

아래에서는 첨부한 도면을 참고하여 본 개시의 실시 예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다.Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art may easily implement the exemplary embodiments. However, the present disclosure may be implemented in various different forms and is not limited to the exemplary embodiments described herein.

이하에서는 도면을 참조하여 본 개시의 실시 예들을 상세하게 설명한다.Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the drawings.

도 1은 온-디바이스 음성 인식과 서버-기반 음성 인식을 비교하여 설명하기 위한 도면이다.1 is a diagram for describing a comparison between on-device voice recognition and server-based voice recognition.

온-디바이스 음성 인식은, 사용자(10)의 발화에 대한 음성 인식이 디바이스(100) 측에서 수행되는 것을 의미하고, 서버-기반 음성 인식은, 디바이스(100)에서 수신된 사용자(10)의 발화에 대한 음성 인식이 서버(200) 측에서 수행되는 것을 의미한다.On-device voice recognition means that voice recognition for the user's 10 utterance is performed on the device 100 side, and server-based voice recognition is the utterance of the user 10 received from the device 100 It means that voice recognition for is performed on the server 200 side.

엔드-투-엔드(End-to-end) 음성 인식과 압축 기술의 발전으로 온-디바이스 음성 인식 기술이 점점 발전함에 따라 서버-기반 음성 인식과의 성능 차이가 점점 줄어들고 있다. 특히, 특정 분야에 제한되지 않는 오픈 도메인(Open Domain)의 발화에 대한 음성 인식이나 일반적인 받아쓰기에 있어서 디바이스와 서버 간의 성능 차이가 거의 없어 졌다. 일반적인 받아쓰기란, 개체명(Named Entity) 위주의 도메인에 해당하지 않는 발화에 대한 받아쓰기를 의미한다. 개체명은, 특정 지명, 인명, 기기명, 상표명 등을 포함할 수 있다. 도메인이란, 어떠한 속성과 관련된 단어들의 집합을 해당 속성의 도메인이라 한다. With the development of end-to-end speech recognition and compression technology, on-device speech recognition technology is gradually developing, and the performance difference from server-based speech recognition is gradually decreasing. In particular, there is almost no difference in performance between the device and the server in speech recognition or general dictation for open domain utterances that are not limited to a specific field. General dictation refers to dictation for utterances that do not correspond to a domain based on a named entity. The entity name may include a specific place name, a person name, a device name, a brand name, and the like. With a domain, a set of words related to a certain attribute is called a domain of the attribute.

온-디바이스 음성 인식은 지연 시간(latency)이 약 50ms 미만으로, 수백 ms인 서버-기반 음성 인식의 지연 시간 보다 짧고, 네트워크 연결이 되지 않는 교외, 비행기 내부, 또는 전파 음영 지역 등에서도 이용 가능하다는 장점이 있다. 또한, 온-디바이스 음성 인식은, 보안 및 사생활 침해 이슈에서 보다 유리하고, 서버를 관리하는 비용을 줄일 수 있다는 장점이 있다.On-device speech recognition has a latency of less than about 50ms, less than that of server-based speech recognition, which is hundreds of ms, and can be used in suburbs, inside airplanes, or in areas where there is no network connection. There is an advantage. In addition, on-device voice recognition is more advantageous in security and privacy invasion issues, and has the advantage of reducing the cost of managing the server.

한편, 서버-기반 음성 인식은, 지명, 인명, 상표명등과 같은 개체명들을 디바이스보다 많이 저장할 수 있는 서버에서 구현된다는 장점이 있다. On the other hand, server-based speech recognition has the advantage of being implemented in a server capable of storing more entity names such as geographical names, human names, and brand names than devices.

따라서, 서버-기반 음성 인식에 의하면, 새로운 유행어나 새로운 신곡 제목 등과 관련된 단어들에 보다 높은 가중치를 부여할 수 있고, 음성 인식이 안 되는 경우 사전에 해당 단어를 추가함으로써 음성 인식의 결함을 복구하는 핫-픽스(hot-fix)가 가능하다는 장점이 있다. 또한, 서버에서 동작하는 써드 파티(third-party) 애플리케이션에 최적화된 언어 모델 및 사전 정보 등을 이용하여 음성 인식 결과에 대한 리스코어링(rescoring)이 가능하다는 장점이 있다.Therefore, according to server-based speech recognition, a higher weight can be given to words related to a new buzzword or a new song title, and when speech recognition is not possible, the defect in speech recognition is repaired by adding the word to the dictionary. The advantage is that hot-fix is possible. In addition, there is an advantage in that rescoring of the speech recognition result is possible by using a language model and dictionary information optimized for a third-party application operating in a server.

그러므로, 받아쓰기, 일반적인 명령, 자막 생성 등과 같은 일반적인 목적의 음성 인식은 디바이스에서 수행하되, 특정 도메인에 해당하는 언어 모델 및 사전 정보 등을 이용하여 음성 인식이 수행되어야 하는 경우 서버에서 수행하도록 하는 하이브리드(hybrid) 방식의 음성 인식이 요구된다.Therefore, speech recognition for general purposes such as dictation, general commands, subtitle generation, etc., is performed by the device, but when speech recognition is to be performed using a language model and dictionary information corresponding to a specific domain, a hybrid ( hybrid) method of speech recognition is required.

이 때, 전체 음성 인식 과정을 디바이스와 서버가 나누어서 수행할 경우, 디바이스와 서버 사이에 의존성(dependency)이 생길 수 있다. In this case, when the device and the server separately perform the entire speech recognition process, a dependency may occur between the device and the server.

일 예로서, 발화에 대해 음향 모델을 적용하는 계산이 디바이스에서 수행되고, 음향 모델로부터 추출된 중간값에 대해 언어 모델과 사전 정보를 적용한 디코딩 계산이 서버에서 수행되는 방법이 이용될 수 있다. 이러한 방법에 따르면 디바이스와 서버 사이에는 의존성이 생기게 되므로, 서로 호환되지 않은 디바이스와 서버 간에는 사용할 수 없다는 문제점이 있다.As an example, a method in which a calculation for applying an acoustic model to speech is performed in a device, and a decoding calculation in which a language model and dictionary information are applied to an intermediate value extracted from the acoustic model is performed in a server may be used. According to this method, since there is a dependency between the device and the server, there is a problem that it cannot be used between a device and a server that are not compatible with each other.

다른 예로서, 인코딩 계산 및 디코딩 계산 과정을 포함하는 엔드-투-엔드(end-to-end) 음성 인식 방식에 있어서 인코딩 계산만 디바이스에서 수행되고 인코딩 된 데이터에 대한 디코딩 계산이 서버에서 수행되는 방법이 이용될 수 있다. 디코딩 계산이 수행되기 위해서는 인코딩 방식에 대한 사전 정보가 필요하므로 인코딩을 수행하는 디바이스와 디코딩을 수행하는 서버 사이에는 의존성이 생기게 된다. 따라서, 이러한 방법도 서로 호환되지 않은 디바이스와 기기 간에는 사용할 수 없다는 문제점이 있다.As another example, in an end-to-end speech recognition method including an encoding calculation and a decoding calculation process, only the encoding calculation is performed in the device and the decoding calculation for the encoded data is performed in the server. Can be used. In order to perform the decoding calculation, prior information on the encoding method is required, so there is a dependency between the device performing the encoding and the server performing the decoding. Accordingly, there is a problem that such a method cannot be used between devices and devices that are not compatible with each other.

상술한 문제점을 해결하기 위하여, 도 2a에 도시된 본 개시의 일 실시 예에 따른 음성 인식 시스템이 제안된다.In order to solve the above-described problem, a speech recognition system according to an embodiment of the present disclosure illustrated in FIG. 2A is proposed.

본 개시의 일 실시 예에 따른 디바이스(100)는, 음성 신호를 제1 문자열로 변환하는 온-디바이스 음성 인식을 수행할 수 있다. 디바이스(100)는, 제1 문자열의 신뢰도에 기초하여, 온-디바이스 음성 인식이 실패하였는 지 여부를 판단할 수 있다. 디바이스(100)는, 온-디바이스 음성 인식이 실패하는 경우, 음성 인식의 결과인 제1 문자열을 서버(200)에게 전송할 수 있다. The device 100 according to an embodiment of the present disclosure may perform on-device voice recognition for converting a voice signal into a first character string. The device 100 may determine whether the on-device voice recognition has failed based on the reliability of the first character string. When the on-device voice recognition fails, the device 100 may transmit a first character string, which is a result of voice recognition, to the server 200.

본 개시의 일 실시 예에 따르면, 디바이스(100)가 음성 신호에 대한 정보를 문자열의 형태로 서버(200)에게 전송함으로써, 디바이스(100)가 어떤 종류의 온-디바이스 음성 인식을 사용하는지에 관계없이 서버(200)가 문자열을 처리할 수 있다는 장점이 있다.According to an embodiment of the present disclosure, the device 100 transmits information on the voice signal to the server 200 in the form of a character string, so that the relationship between the type of on-device voice recognition that the device 100 uses. There is an advantage that the server 200 can process a character string without it.

본 개시의 일 실시 예에 따라 디바이스(100)로부터 서버(200)에게 전송되는 제1 문자열은, 프레임 동기화된 문자열일 수 있다. According to an embodiment of the present disclosure, the first character string transmitted from the device 100 to the server 200 may be a frame synchronized character string.

‘프레임’이란 음성 신호의 처리를 위하여 음성 신호가 일정한 시간 간격으로 분할되는 단위, 또는 분할된 음성 신호 그 자체를 의미할 수 있다. “프레임 동기화된 문자열”이란, 음성 신호가 음성 인식 모델에 의해 문자열로 변환되어 출력됨에 있어서, 음성 신호의 프레임들 각각에 개별적으로 대응하는 문자들을 포함하는 문자열을 의미한다.A'frame' may mean a unit in which a voice signal is divided at regular time intervals for processing a voice signal, or a divided voice signal itself. "Frame-synchronized character string" means a character string including characters individually corresponding to each of the frames of the speech signal when the speech signal is converted into a character string by the speech recognition model and output.

일 실시 예에 따른 디바이스(100)는, RNN-T나 CTC등의 알고리즘을 이용하여 프레임 동기화된 문자열을 음성 인식 결과로서 생성할 수 있다.The device 100 according to an embodiment may generate a frame-synchronized character string as a result of speech recognition using an algorithm such as RNN-T or CTC.

그러나 본 개시는 이에 제한되지 않으며, 일 실시 예에 따른 디바이스(100)는, 디바이스(100)의 음성 인식 결과가 프레임 동기화되지 않은 경우에도, 강제 정렬(forced alignment)을 수행함으로써, 프레임 동기화된 문자열을 생성할 수 있다. 프레임 동기화된 문자열 및 강제 정렬에 의해 프레임 동기화된 문자열을 생성하는 구체적인 방법과 관련하여서는, 후에 도 6을 참조하여 구체적으로 설명한다.However, the present disclosure is not limited thereto, and the device 100 according to an embodiment performs a forced alignment even when the result of the voice recognition of the device 100 is not frame-synchronized, Can be created. A specific method of generating a frame-synchronized character string and a frame-synchronized character string by forced alignment will be described later with reference to FIG. 6.

본 개시의 일 실시 예에 따른 디바이스(100)는, 온-디바이스 음성 인식을 이용한 음성 인식 수행 결과에 대한 신뢰도(confidence score)가 충분히 높은 경우, 음성 인식 수행 결과를 그대로 이용할 수 있다. The device 100 according to an embodiment of the present disclosure may use the result of performing the voice recognition as it is when the confidence score for the result of performing voice recognition using on-device voice recognition is sufficiently high.

반면에, 본 개시의 일 실시 예에 따른 디바이스(100)는, 온-디바이스 음성 인식을 이용하여 음성 인식을 수행한 결과의 신뢰도가 충분히 높지 않다고 판단하는 경우, 음성 인식 결과인 문자열을 서버(200)에게 전송할 수 있다. On the other hand, if the device 100 according to an embodiment of the present disclosure determines that the reliability of the result of performing the speech recognition using on-device speech recognition is not sufficiently high, the server 200 ).

따라서, 본 개시의 일 실시 예에 따른 디바이스(100)는, 온-디바이스 음성 인식을 이용하여 음성 인식을 수행한 결과의 신뢰도가 충분히 높지 않다고 판단하는 경우, 서버(200)에게 음성 신호를 전송하여 음성 인식 과정을 처음부터 서버(200)에서 다시 시작하도록 하는 것이 아니므로 처리 시간을 감소시킬 수 있다는 장점이 있다.Therefore, when it is determined that the reliability of the result of performing speech recognition using on-device speech recognition is not sufficiently high, the device 100 according to an embodiment of the present disclosure transmits a speech signal to the server 200 Since the speech recognition process is not restarted in the server 200 from the beginning, there is an advantage in that the processing time can be reduced.

본 개시의 일 실시 예에 따른 디바이스(100)는, 온-디바이스 음성 인식을 이용하여 음성 인식을 수행한 결과의 신뢰도가 충분히 높지 않다고 판단하는 경우, 문장 단위, 단어 단위, 구 단위 또는 프레임 단위로 음성 인식 결과인 문자열을 서버(200)에게 전송할 수 있다. When it is determined that the reliability of a result of performing speech recognition using on-device speech recognition is not sufficiently high, the device 100 according to an embodiment of the present disclosure may perform a sentence unit, a word unit, a phrase unit, or a frame unit. A character string resulting from speech recognition may be transmitted to the server 200.

일 실시 예에 따른 디바이스(100)는, 음성 인식을 수행하여 문장 또는 구(句, phrase)를 구성하는 문자열을 획득한 경우, 문장 또는 구에 포함되는 문자들을 모두 서버(200)에게 전송하거나, 문장 또는 구에 포함되는 문자들 중 일부를 서버(200)에게 전송할 수 있다. 디바이스(100)는, 문자열의 신뢰도에 기초하여, 신뢰도가 낮은 일부의 문자들을 서버(200)에게 전송할 수 있다.The device 100 according to an embodiment transmits all characters included in the sentence or phrase to the server 200 when acquiring a string constituting a sentence or phrase by performing voice recognition, or Some of the characters included in the sentence or phrase may be transmitted to the server 200. The device 100 may transmit some characters with low reliability to the server 200 based on the reliability of the character string.

일 실시 예에 따른 디바이스(100)는, 서버(200)에서 교정된 문자열을 수신하고, 교정이 필요 없다고 판단되어 서버(200)에게 전송되지 않았던 문자열을 교정된 문자열과 조합할 수 있다. 일 실시 예에 따른 디바이스(100)는, 조합된 문자열을 출력하거나, 조합된 문자열을 해석한 결과에 기초하여 음성 비서 서비스를 제공할 수 있다.The device 100 according to an embodiment may receive a corrected character string from the server 200 and may combine a character string that has not been transmitted to the server 200 because it is determined that there is no need for calibration, with the corrected character string. The device 100 according to an embodiment may output a combined character string or provide a voice assistant service based on a result of analyzing the combined character string.

본 개시의 일 실시 예에 따른 서버(200)는, 문장 단위, 단어 단위, 구 단위 또는 프레임 단위로 음성 인식 결과인 문자열을 디바이스(100)로부터 수신할 수 있다.The server 200 according to an embodiment of the present disclosure may receive a character string, which is a result of speech recognition, from the device 100 in units of sentences, words, phrases, or frames.

본 개시의 일 실시 예에 따른 서버(200)는, 서버(200) 내에 저장된 언어 모델 및 사전 정보를 이용하여, 수신된 제1 문자열의 오류를 교정할 수 있다. 서버(200)는, 디바이스(100)에 저장되는 언어 모델 보다 많은 양의 정보를 포함하는 서버(200) 내의 언어 모델을 이용하여, 제1 문자열로부터 제2 문자열을 획득할 수 있다. 서버(200)는, 제1 문자열에 포함되는 적어도 하나의 문자를 다른 문자로 대체함으로써 제2 문자열을 획득할 수 있다. 제2 문자열은 제1 문자열에 포함되어 있던 오류가 교정된 문자열일 수 있다.The server 200 according to an embodiment of the present disclosure may correct an error in the received first character string by using the language model and dictionary information stored in the server 200. The server 200 may obtain a second character string from the first character string by using a language model in the server 200 including a larger amount of information than the language model stored in the device 100. The server 200 may obtain the second character string by replacing at least one character included in the first character string with another character. The second character string may be a character string in which an error included in the first character string is corrected.

본 개시에서 일 실시 예에 따른 서버(200)는, 디바이스(100)로부터 수신된 제1 문자열에 포함되는 적어도 하나의 문자를 다른 문자로 대체함으로써, 제1 문자열을 교정하고, 교정된 제1 문자열을 디바이스(100)에게 전송할 수 있다. In the present disclosure, the server 200 according to an embodiment corrects the first character string by replacing at least one character included in the first character string received from the device 100 with another character, and corrects the first character string. May be transmitted to the device 100.

“제1 문자열을 교정하는 동작”은, 제1 문자열 보다 높은 신뢰도를 가지는 제2 문자열을 추천하여 출력하는 동작을 의미할 수 있다. 따라서, 본 개시에서 ‘문자열의 교정’, ‘문자의 교정’, ‘소정 문자를 다른 문자로 대체’, ‘소정 문자 대신에 다른 문자를 추천’, ‘소정 문자열을 다른 문자열로 대체’, 및 ‘소정 문자열 대신에 다른 문자열을 추천’하는 표현은 서로 대체되어 사용될 수 있다.The “operation of correcting the first character string” may refer to an operation of recommending and outputting a second character string having a higher reliability than the first character string. Accordingly, in the present disclosure,'correction of a character string','correction of a character','replace a predetermined character with another character','recommend another character instead of a predetermined character','replace a predetermined character string with another character string', and ' The expression'recommending another string instead of a predetermined string' may be used interchangeably.

일 실시 예에 따른 서버(200)는, 문장 또는 구를 구성하는 문자열을 디바이스(100)로부터 획득한 경우, 문장 또는 구에 포함되는 문자들에 대해서 교정을 수행하거나, 문장 또는 구에 포함되는 문자들 중 일부에 대해서 교정을 수행할 수 있다. 서버(200)는, 문자열의 신뢰도에 기초하여, 신뢰도가 낮은 일부의 문자들에 대해서 교정을 수행할 수 있다.The server 200 according to an embodiment, when a character string constituting a sentence or phrase is obtained from the device 100, corrects characters included in the sentence or phrase, or Some of these can be calibrated. The server 200 may correct some characters with low reliability based on the reliability of the character string.

일 실시 예에 따른 서버(200)는, 교정이 필요 없다고 판단되어 교정 과정을 거치지 않은 문자열과 교정된 문자열을 조합할 수 있다. 일 실시 예에 따른 서버(200)는, 조합된 문자열을 디바이스(100)에게 전송할 수 있다.The server 200 according to an embodiment may combine a text string that has not undergone a calibration process and a corrected text string because it is determined that calibration is not required. The server 200 according to an embodiment may transmit the combined character string to the device 100.

본 개시의 일 실시 예에 따른 서버(200)는, 도메인 별로 서로 다른 사전 정보 및 언어 모델을 이용하여, 수신된 문자열에 대한 디코딩을 수행하는 것이 가능하다. 본 개시의 일 실시 예에 따르면, 사전 정보가 서버(200) 내에 저장되기 때문에, 신조어나 새로운 개체명을 쉽게 핫픽스 할 수 있다는 장점이 있다.The server 200 according to an embodiment of the present disclosure may perform decoding on a received character string using different dictionary information and language models for each domain. According to an embodiment of the present disclosure, since dictionary information is stored in the server 200, there is an advantage that a new word or a new entity name can be easily hot-fixed.

일 실시 예에 따른 서버(200)는, 디바이스(100)로부터 문자열을 수신하고, 수신된 문자열에 관련된 도메인을 선택할 수 있다. 일 예로서, 서버(200)는, 디바이스(100)로부터 문자열과 함께 문자열에 관련된 도메인 정보를 수신하고, 수신된 정보에 기초하여 문자열에 대한 디코딩을 수행할 도메인을 결정할 수 있다. 다른 예로서, 서버(200)는 디바이스(100)로부터 수신된 문자열에 기초하여, 수신된 문자열에 관련된 도메인을 결정할 수 있다.일 실시예에 따른 서버(200)는, 결정된 도메인에 대응되는 사전 정보 및 언어 모델을 이용하여, 수신된 문자열에 대한 디코딩을 수행할 수 있다. The server 200 according to an embodiment may receive a character string from the device 100 and select a domain related to the received character string. As an example, the server 200 may receive domain information related to the character string together with the character string from the device 100, and determine a domain to perform decoding on the character string based on the received information. As another example, the server 200 may determine a domain related to the received character string based on the character string received from the device 100. The server 200 according to an embodiment may include advance information corresponding to the determined domain. And using the language model, it is possible to perform decoding on the received character string.

따라서, 본 개시의 일 실시 예에 따른 서버(200)는, 디바이스(100)로부터 수신된 문자열에 대한 재 디코딩을 통해 음성 인식 정확도가 높아진 음성 인식 결과를 출력할 수 있다. 예를 들어, 서버(200)는, 디바이스(100)로부터 제1 문자열을 수신하고, 서버(200) 내의 언어 모델 및 사전 정보를 이용하여 디코딩을 수행함으로써, 제1 문자열에 포함되는 적어도 하나의 문자가 교정된 제2 문자열을 출력할 수 있다. Accordingly, the server 200 according to an embodiment of the present disclosure may output a speech recognition result with improved speech recognition accuracy through re-decoding of a character string received from the device 100. For example, the server 200 receives at least one character included in the first character string by receiving a first character string from the device 100 and performing decoding using the language model and dictionary information in the server 200 The corrected second character string may be output.

서버(200)는, 제2 문자열을 디바이스(100)에게 전송할 수 있다. 디바이스(100)는, 제1 문자열보다 신뢰도가 높은 제2 문자열을 서버(200)로부터 수신하여 이용함으로써 음성 인식 성능을 높일 수 있다.The server 200 may transmit the second character string to the device 100. The device 100 may improve speech recognition performance by receiving and using a second character string having higher reliability than the first character string from the server 200.

일 실시 예에 따른 서버(200)가 문장을 구성하는 문자들을 포함하는 문자열을 디바이스(100)로부터 획득한 경우, 전체 문장에 대해서 오류를 교정하거나, 문장을 구성하는 문자들 중 일부에 대해서 오류를 교정할 수 있다. 서버(200)는, 문자열의 신뢰도에 기초하여, 신뢰도가 낮은 일부의 문자들에 대해서 오류를 교정할 수 있다. 일 실시 예에 따른 서버(200)는, 교정이 필요 없다고 판단되어 교정되지 않은 문자열을 교정된 문자열과 조합함으로써 제2 문자열을 획득 할 수 있다.When the server 200 according to an embodiment obtains a character string including characters constituting a sentence from the device 100, an error is corrected for the entire sentence or an error is corrected for some of the characters constituting the sentence. Can be corrected. The server 200 may correct errors for some characters with low reliability based on the reliability of the character string. The server 200 according to an embodiment may obtain the second character string by combining the uncorrected character string with the corrected character string because it is determined that there is no need for calibration.

도 2a에 도시된 바와 같이 본 개시의 일 실시 예에 따른 서버(200)는, 제2 문자열 자체를 음성 인식 결과로서 디바이스(100)에게 전송할 수 있다. 그러나 실시 예는 도 2a에 도시된 예에 제한되지 않는다.As shown in FIG. 2A, the server 200 according to an embodiment of the present disclosure may transmit the second character string itself to the device 100 as a result of voice recognition. However, the embodiment is not limited to the example shown in FIG. 2A.

도 2b 및 도 2c에 도시된 바와 같이 본 개시의 일 실시 예에 따른 서버(200)는, 제2 문자열에 대한 사용자의 발화 의도를 파악함으로써 제2 문자열에 기초한 음성 비서 서비스에 관련된 정보를 디바이스(100)에게 전송할 수 있다.As shown in FIGS. 2B and 2C, the server 200 according to an exemplary embodiment of the present disclosure recognizes the user's utterance intention for the second character string, and thereby transmits information related to the voice assistant service based on the second character string to the device ( 100).

본 개시의 일 실시 예에 따른 서버(200)는, 제1 문자열로부터 획득된 제2 문자열을 이용하여, 디바이스(100)에게 다양한 종류의 음성 비서 서비스를 제공할 수 있다. 음성 비서 서비스는, 사용자와의 대화를 제공하는 서비스일 수 있다. 음성 비서 서비스는 사용자의 상황, 디바이스의 상황 등을 고려하여 사람이 사용자와 직접 대화하는 것처럼 사용자에게 응답 메시지를 제공할 수 있다. 또한, 음성 비서 서비스는, 사용자의 개인 비서처럼 사용자가 필요한 정보를 적절하게 생성하여 사용자에게 제공할 수 있다.The server 200 according to an embodiment of the present disclosure may provide various types of voice assistant services to the device 100 by using the second character string obtained from the first character string. The voice assistant service may be a service that provides a conversation with a user. The voice assistant service may provide a response message to a user as if a person had a direct conversation with the user in consideration of the user's situation and the device's situation. In addition, the voice assistant service can appropriately generate and provide information required by the user, like the user's personal assistant.

이 경우, 서버(200)는 문자열에 기초하여 음성 비서 서비스를 제공하기 위하여, 서버(200) 내의 NLU (Natural Language Understanding) 모델, DM (Dialog Mananer) 모델 및 NLG (Natural Language Generating) 모델 등을 이용하여, 사용자와의 대화를 수행하기 위한 정보를 디바이스(100)에게 제공할 수 있다. In this case, the server 200 uses a Natural Language Understanding (NLU) model, a Dialog Mananer (DM) model, and a Natural Language Generating (NLG) model in the server 200 to provide a voice assistant service based on a character string. Thus, information for conducting a conversation with the user may be provided to the device 100.

일 예로서, 서버(200)는 제2 문자열을 해석한 결과를 바탕으로, 디바이스(100) 또는 다른 디바이스(예를 들어, 스마트 가전, 웨어러블 디바이스 등)를 제어할 수 있다. As an example, the server 200 may control the device 100 or another device (eg, a smart home appliance, a wearable device, etc.) based on a result of analyzing the second character string.

도 2b에 도시된 바와 같이, 일 실시예에 따른 서버(200)는 문자열을 해석한 결과를 바탕으로, 디바이스(100)를 제어하기 위한 제어 명령 또는 디바이스(100)가 다른 디바이스를 제어하도록 하기 위한 제어 명령을 생성하고, 생성된 제어 명령을 디바이스(100)에게 제공할 수도 있다.As shown in FIG. 2B, the server 200 according to an embodiment provides a control command for controlling the device 100 or a control command for controlling the device 100 to control other devices based on the result of analyzing the character string. A control command may be generated and the generated control command may be provided to the device 100.

또한, 도 2c에 도시된 바와 같이, 본 개시의 일 실시 예에 따른 서버(200)는, 다양한 서비스들과 관련된 음성 비서 서비스를 제공할 수 있다. 예를 들어, 방송 서비스, 콘텐트 공유 서비스, 콘텐트 제공 서비스, 전력 관리 서비스, 게임 제공 서비스, 채팅 서비스, 문서 작성 서비스, 검색 서비스, 통화 서비스, 사진 촬영 서비스, 교통 수단 추천 서비스 또는 동영상 재생 서비스 등의 다양한 서비스와 음성 비서 서비스가 연계되어, 음성 비서 서비스가 사용자가 필요한 정보 또는 기능을 제공할 수 있다.In addition, as shown in FIG. 2C, the server 200 according to an embodiment of the present disclosure may provide a voice assistant service related to various services. For example, broadcast service, content sharing service, content providing service, power management service, game providing service, chat service, document writing service, search service, call service, photo shooting service, transportation recommendation service, or video playback service. Various services and voice assistant services are linked, so that the voice assistant service can provide the information or functions required by the user.

본 개시의 일 실시 예에 따른 서버(200)는, 제2 문자열에 기초하여 음성 비서 서비스에 관련된 정보를 디바이스(100)에게 전송할 수 있다. 음성 비서 서비스에 관련된 정보는, 사용자의 상황, 디바이스의 상황 등을 고려하여 사람이 사용자와 직접 대화하는 것처럼 사용자에게 제공되는 응답 메시지 또는 사용자가 필요한 정보를 포함할 수 있다.The server 200 according to an embodiment of the present disclosure may transmit information related to the voice assistant service to the device 100 based on the second character string. The information related to the voice assistant service may include a response message provided to the user or information required by the user as if a person directly communicates with the user in consideration of the situation of the user and the device.

또한, 서버(200)는, 제2 문자열에 기초하여 사용자의 발화 의도를 파악하고, 서비스 제공 서버(201)에게 사용자가 필요로 하는 서비스의 제공을 요청할 수 있다. 서비스 제공 서버(201)는, 방송 서비스, 콘텐트 공유 서비스, 콘텐트 제공 서비스, 전력 관리 서비스, 게임 제공 서비스, 채팅 서비스, 문서 작성 서비스, 검색 서비스, 통화 서비스, 사진 촬영 서비스, 교통 수단 추천 서비스 또는 동영상 재생 서비스 중 적어도 하나의 서비스를 제공할 수 있다. In addition, the server 200 may determine the user's utterance intention based on the second character string, and may request the service providing server 201 to provide a service required by the user. The service providing server 201 is a broadcasting service, a content sharing service, a content providing service, a power management service, a game providing service, a chat service, a document creation service, a search service, a call service, a photo shooting service, a transportation method recommendation service, or a video. At least one of the reproduction services may be provided.

도 2c에서는 음성 비서 서비스를 제공하는 서버(200)가 하나의 서비스 제공 서버(201)와 연결되는 것으로 도시되었으나, 본 개시는 도 2c에 도시된 바에 제한되지 않는다. 예를 들어, 본 개시의 일 실시예에 따라 서버(200)는, 복수의 서비스 제공 서버들과 연결되며, 사용자의 발화 의도에 따라 사용자가 필요로 하는 서비스를 결정할 수 있다. 서버(200)는, 결정된 서비스에 대응하는 서비스 제공 서버를 선택하고, 선택된 서비스 제공 서버에게 서비스 제공 요청을 전송할 수 있다. In FIG. 2C, it is illustrated that the server 200 providing the voice assistant service is connected to one service providing server 201, but the present disclosure is not limited to that shown in FIG. 2C. For example, according to an embodiment of the present disclosure, the server 200 is connected to a plurality of service providing servers, and may determine a service required by the user according to the user's speech intention. The server 200 may select a service providing server corresponding to the determined service and transmit a service provision request to the selected service providing server.

일 실시예에 따른 서비스 제공 서버(201)는, 음성 비서 서비스 제공 서버(200)로부터 수신되는 서비스 요청에 기초하여, 요청된 서비스와 관련된 정보를 제공할 수 있다. 예를 들어, 서비스 제공 서버(201)는, 요청된 서비스와 관련된 정보로서, 방송, 콘텐트, 애플리케이션, 교통 수단 추천 정보, 및 검색 결과 등을 제공할 수 있다. 서비스 제공 서버(201)는, 음성 비서 서비스 제공 서버(200) 또는 디바이스(100)에게 요청된 서비스와 관련된 정보를 제공할 수 있다.The service providing server 201 according to an embodiment may provide information related to the requested service based on a service request received from the voice assistant service providing server 200. For example, the service providing server 201 may provide broadcast, content, application, transportation means recommendation information, and search results, as information related to the requested service. The service providing server 201 may provide information related to the requested service to the voice assistant service providing server 200 or the device 100.

이하에서는, 본 개시의 일 실시 예에 따라 음성 인식 결과인 문자열을 선택적으로 서버(200)에게 전송하여 문자열의 교정을 요청하는 디바이스(100) 및 수신된 문자열에 대한 교정을 수행하는 서버(200) 각각의 구성 및 동작 방법을 구체적으로 설명한다.Hereinafter, according to an embodiment of the present disclosure, the device 100 for requesting correction of the string by selectively transmitting the string as a result of voice recognition to the server 200 and the server 200 for performing correction on the received string Each configuration and operation method will be described in detail.

도 3은 일 실시 예에 따른 디바이스의 블록도를 도시한다.3 is a block diagram of a device according to an embodiment.

본 개시의 일 실시 예에 따른 디바이스(100)는 컴퓨터 장치로 구현되는 고정형 단말이거나 이동형 단말일 수 있다. 디바이스(100)는, 예를 들어, 스마트 폰(smart phone), 휴대폰, 내비게이션, 컴퓨터, 노트북, 디지털방송용 단말, 인공 지능 스피커, 스피커, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 및 태블릿 PC 중 적어도 하나일 수 있으나, 이에 한정되지 않는다. 디바이스(100)는, 무선 또는 유선 통신 방식을 이용하여 네트워크를 통해 다른 디바이스 및/또는 서버와 통신할 수 있다. The device 100 according to an embodiment of the present disclosure may be a fixed terminal implemented as a computer device or a mobile terminal. The device 100 includes, for example, a smart phone, a mobile phone, a navigation device, a computer, a notebook computer, a digital broadcasting terminal, an artificial intelligence speaker, a speaker, a personal digital assistant (PDA), a portable multimedia player (PMP), and It may be at least one of the tablet PCs, but is not limited thereto. The device 100 may communicate with other devices and/or servers through a network using a wireless or wired communication method.

도 3을 참조하면, 디바이스(100)는, 수신부(110), 프로세서(120), 통신부(130), 메모리(140), 및 출력부(150)를 포함할 수 있다. 도 3에 도시된 구성 요소 모두가 디바이스(100)의 필수 구성 요소인 것은 아니다. 도 3에 도시된 구성 요소보다 많은 구성 요소에 의해 디바이스(100)가 구현될 수도 있고, 도 3에 도시된 구성 요소보다 적은 구성 요소에 의해 디바이스(100)가 구현될 수도 있다. 예를 들어, 도 19에 도시된 바와 같이, 일부 실시 예에 따른 디바이스(100)는, 사용자 입력부(2100), 센싱부(2400), 및 A/V 입력부(2600)를 더 포함할 수도 있다.Referring to FIG. 3, the device 100 may include a reception unit 110, a processor 120, a communication unit 130, a memory 140, and an output unit 150. Not all of the components shown in FIG. 3 are essential components of the device 100. The device 100 may be implemented by more components than the components shown in FIG. 3, or the device 100 may be implemented by fewer components than the components shown in FIG. 3. For example, as shown in FIG. 19, the device 100 according to some embodiments may further include a user input unit 2100, a sensing unit 2400, and an A/V input unit 2600.

본 개시의 일 실시 예에 따른 수신부(110)는 사용자로부터 음성 신호를 입력 받을 수 있다. 예를 들어, 수신부(110)는, 마이크로폰(Microphone)에 의해 외부의 소리를 전기적인 음향 데이터로 변환함으로써 음성 신호를 수신할 수 있다. 도 3에는, 수신부(110)가, 디바이스(100)의 내부에 포함되는 것으로 도시되었으나, 다른 일 실시 예에 따른 수신부(110)는 별도의 디바이스 내에 포함되고 디바이스(100)와는 유, 무선으로 연결되는 형태로 구현될 수 있다.The receiver 110 according to an embodiment of the present disclosure may receive a voice signal from a user. For example, the receiving unit 110 may receive a voice signal by converting external sound into electrical sound data using a microphone. In FIG. 3, the receiving unit 110 is shown to be included in the device 100, but the receiving unit 110 according to another embodiment is included in a separate device and is connected to the device 100 by wire or wirelessly. It can be implemented in the form of

본 개시의 일 실시 예에 따른 메모리(140)는, 음성 인식을 수행하기 위한 인스트럭션들, 음성 인식에 이용되는 각종 모델, 신경망, 사전 정보 등을 저장할 수 있다.The memory 140 according to an embodiment of the present disclosure may store instructions for performing speech recognition, various models used for speech recognition, neural networks, dictionary information, and the like.

본 개시의 일 실시 예에 따른 프로세서(120)는, 메모리(140)에 저장된 하나 이상의 인스터럭션들을 실행함으로써, 음성 인식을 수행할 수 있다. The processor 120 according to an embodiment of the present disclosure may perform speech recognition by executing one or more instructions stored in the memory 140.

본 개시의 일 실시 예에 따른 프로세서(120)는, 음성 신호에 대한 음성 인식의 결과로서 제1 문자열을 획득할 수 있다.The processor 120 according to an embodiment of the present disclosure may obtain the first character string as a result of speech recognition for a speech signal.

예를 들어, 제1 문자열은, 음성 신호가 소정 시간 간격으로 분할된 음성 신호 프레임들 각각에 대응하는 문자들을 포함하는 프레임 동기화된 문자열일 수 있다. 또는, 제1 문자열은, 음성 신호에 의해 발음되는 각 문자를 하나씩 포함하도록 라벨 동기화 방식으로 획득된 문자열일 수 있다.For example, the first character string may be a frame-synchronized character string including characters corresponding to each of the audio signal frames in which the audio signal is divided at predetermined time intervals. Alternatively, the first character string may be a character string obtained through a label synchronization method so as to include one character for each character pronounced by a voice signal.

다음으로 본 개시의 일 실시 예에 따른 프로세서(120)는, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정하고, 결정에 기초하여 통신부(130)를 통해 제1 문자열을 서버(200)에게 전송할 수 있다. 일 실시예에 따른 프로세서(120)는, 문장 단위, 단어 단위, 구 단위 또는 프레임 단위로 제1 문자열을 서버(200)에게 전송할 수 있다. 일 실시 예에 따른 프로세서(120)는, 음성 인식을 수행하여 문장 또는 구를 구성하는 문자열을 획득한 경우, 문장 또는 구에 포함되는 문자들을 모두 서버(200)에게 전송하거나, 문장 또는 구에 포함되는 문자들 중 일부를 서버(200)에게 전송할 수 있다. 프로세서(120)는, 문자열의 신뢰도에 기초하여, 신뢰도가 낮은 일부의 문자들을 서버(200)에게 전송할 수 있다.Next, the processor 120 according to an embodiment of the present disclosure determines whether to replace the first character string with another character string, and sends the first character string to the server 200 through the communication unit 130 based on the determination. Can be transmitted. The processor 120 according to an embodiment may transmit the first character string to the server 200 in a sentence unit, a word unit, a phrase unit, or a frame unit. The processor 120 according to an embodiment transmits all the characters included in the sentence or phrase to the server 200 or includes them in the sentence or phrase when acquiring a string constituting a sentence or phrase by performing voice recognition. Some of the characters may be transmitted to the server 200. The processor 120 may transmit some characters with low reliability to the server 200 based on the reliability of the character string.

제1 문자열을 다른 문자열로 대체할 지 여부를 결정한다는 것은, 음성 인식이 실패하였다고 판단하고 제1 문자열을 다른 문자열로 대체하여 이용할 것을 결정하는 것을 의미할 수 있다. 또는, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정한다는 것은, 서버에서 추가적으로 음성 인식을 수행함으로써 획득된 다른 문자열로 제1 문자열을 대체할 지 여부를 결정한다는 것을 의미할 수 있다. Determining whether to replace the first character string with another character string may mean determining that speech recognition has failed and determining that the first character string is replaced with another character string. Alternatively, determining whether to replace the first character string with another character string may mean that the server determines whether to replace the first character string with another character string obtained by additionally performing voice recognition.

일 예로서, 프로세서(120)는 제1 문자열의 신뢰도를 결정하고, 신뢰도에 기초하여 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. As an example, the processor 120 may determine the reliability of the first character string, and determine whether to replace the first character string with another character string based on the reliability.

제1 문자열의 신뢰도는, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도, 및 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 대체될 사후 확률들 중 적어도 하나에 기초하여 계산될 수 있다.The reliability of the first character string may be calculated based on at least one of a likelihood of a plurality of estimated character strings obtained from the first character string and posterior probabilities in which at least one character in the first character string is replaced with another character.

예를 들어, 프로세서(120)는, 비터비(Viterbi) 디코딩 결과 출력되는 가능도에 기초하여 신뢰도를 계산할 수 있다. 또는, 프로세서(120)는, 엔드-투-엔드 방식 음성 인식 모델에서 소프트맥스 레이어로부터 출력되는 사후 확률들에 기초하여 신뢰도를 계산할 수 있다.For example, the processor 120 may calculate the reliability based on the likelihood output as a result of Viterbi decoding. Alternatively, the processor 120 may calculate the reliability based on posterior probabilities output from the softmax layer in the end-to-end speech recognition model.

또는, 일 실시 예에 따른 프로세서(120)는, 음성 신호에 대한 음성 인식 과정에서 추정되는 복수의 추정 문자열들을 결정하고, 복수의 추정 문자열들의 상관도에 기초하여, 제1 문자열의 신뢰도를 계산할 수 있다. 제1 문자열을 포함하는 복수의 추정 문자열들의 상관도가 높을 수록, 제1 문자열의 신뢰도가 높을 수 있다.Alternatively, the processor 120 according to an embodiment may determine a plurality of estimated character strings estimated during a speech recognition process for a speech signal, and calculate the reliability of the first character string based on the correlation between the plurality of estimated character strings. have. The higher the correlation between the plurality of estimated strings including the first string, the higher the reliability of the first string.

다른 예로서, 프로세서(120)는, 디바이스(100)에 미리 저장된 키워드들과 제1 문자열을 비교한 결과에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. 예를 들어, 프로세서(120)는, 제1 문자열에 미리 저장된 키워드들이 포함되지 않는 경우, 제1 문자열을 다른 문자열로 대체할 것을 결정할 수 있다.As another example, the processor 120 may determine whether to replace the first character string with another character string based on a result of comparing the first character string with keywords previously stored in the device 100. For example, when keywords previously stored in the first string are not included, the processor 120 may determine to replace the first string with another string.

또 다른 예로서, 프로세서(120)는, 제1 문자열이 관련된 도메인 또는 제1 문자열 내에 개체명이 포함되는지 여부에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. 예를 들어, 프로세서(120)는, 제1 문자열이 개체명 위주의 도메인과 관련이 있다고 판단되거나 또는 오픈 도메인과 관련이 없다고 판단되는 경우, 제1 문자열을 다른 문자열로 대체할 것을 결정할 수 있다.As another example, the processor 120 may determine whether to replace the first character string with another character string, based on whether a domain to which the first character string is related or whether an entity name is included in the first character string. For example, when it is determined that the first character string is related to a domain based on an entity name or is not related to an open domain, the processor 120 may determine to replace the first character string with another character string.

본 개시의 일 실시 예에 따른 프로세서(120)는, 제1 문자열을 다른 문자열로 대체하여야 한다고 결정되는 경우, 이러한 결정에 기초하여 제1 문자열을 서버(200)로 전송하도록 통신부(130)를 제어할 수 있다. When it is determined that the first character string should be replaced with another character string, the processor 120 according to an embodiment of the present disclosure controls the communication unit 130 to transmit the first character string to the server 200 based on this determination. can do.

한편, 본 개시의 일 실시 예에 따른 통신부(130)는 유선 통신 또는 무선 통신을 통해 외부 디바이스, 장치 또는 서버와 통신할 수 있다. 통신부(130)는, 근거리 통신 모듈, 유선 통신 모듈, 이동 통신 모듈, 방송 수신 모듈 등을 포함할 수 있다.Meanwhile, the communication unit 130 according to an embodiment of the present disclosure may communicate with an external device, device, or server through wired communication or wireless communication. The communication unit 130 may include a short-range communication module, a wired communication module, a mobile communication module, and a broadcast receiving module.

일 실시 예에 따른 프로세서(120)는, 음성 신호에 대한 음성 인식 결과가 프레임 동기화된 문자열이 아닌 경우, 제1 문자열에 대한 강제 정렬을 수행함으로써 프레임 동기화된 문자열을 생성하여 서버(200)에게 전송할 수 있다. The processor 120 according to an embodiment generates a frame-synchronized character string by performing a forced alignment on the first character string and transmits it to the server 200 when the result of the speech recognition for the speech signal is not a frame-synchronized character string. I can.

일 실시 예에 따른 프로세서(120)는, 제1 문자열에 포함되는 각 문자가 발음되는 음성 신호 구간을 식별하고, 식별된 음성 신호 구간에 포함되는 복수의 음성 프레임들을 식별할 수 있다. 프로세서(120)는, 식별된 음성 프레임들에 따라 해당 문자를 복수 회 연속하여 배치함으로써, 프레임 동기화된 문자열을 획득할 수 있다. The processor 120 according to an embodiment may identify a voice signal section in which each character included in the first character string is pronounced, and identify a plurality of voice frames included in the identified voice signal section. The processor 120 may obtain a frame-synchronized character string by sequentially arranging the corresponding character a plurality of times according to the identified voice frames.

예를 들어, 프로세서(120)는, 제1 문자열에 포함되는 소정 문자의 발음 시간이 n 프레임(n은 자연수)인 경우, n개의 소정 문자를 연속하여 나열함으로써 프레임 동기화된 문자열을 획득할 수 있다.For example, when the pronunciation time of a predetermined character included in the first character string is n frames (n is a natural number), the processor 120 may obtain a frame synchronized character string by sequentially arranging n predetermined characters. .

통신부(130)는, 서버(200)로부터 제2 문자열을 수신할 수 있다. 제2 문자열은, 서버(200)에 의해 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 대체됨으로써 획득된 문자열이다. 또한, 통신부(130)는, 서버(200)에 의해 제2 문자열에 대한 해석에 기초하여 생성된 응답 메시지를 서버(200)로부터 수신할 수도 있다.The communication unit 130 may receive a second character string from the server 200. The second character string is a character string obtained by replacing at least one character in the first character string with another character by the server 200. In addition, the communication unit 130 may receive a response message generated by the server 200 based on the interpretation of the second character string from the server 200.

본 개시의 일 실시 예에 따른 프로세서(120)는, 제1 문자열의 교정이 필요하지 않다고 판단되는 경우, 제1 문자열을 다른 문자열로 대체하지 않을 것을 결정 할 수 있다. 본 개시의 일 실시 예에 따른 프로세서(120)는, 제1 문자열을 다른 문자열로 대체하지 않을 경우, 제1 문자열을 출력부(150)를 통해 출력할 수 있다. When it is determined that the correction of the first character string is not required, the processor 120 according to an embodiment of the present disclosure may determine not to replace the first character string with another character string. The processor 120 according to an embodiment of the present disclosure may output the first character string through the output unit 150 when the first character string is not replaced with another character string.

반면에, 프로세서(120)는, 제1 문자열의 교정이 필요하다고 판단되는 경우, 제1 문자열을 다른 문자열로 대체할 것을 결정 할 수 있다. 제1 문자열을 다른 문자열로 대체하여야 한다고 판단되는 경우, 출력부(150)는, 제1 문자열 대신에, 서버(200)로부터 수신된 제2 문자열을 출력할 수 있다.On the other hand, when it is determined that the correction of the first character string is necessary, the processor 120 may determine to replace the first character string with another character string. When it is determined that the first character string should be replaced with another character string, the output unit 150 may output the second character string received from the server 200 instead of the first character string.

본 개시의 일 실시 예에 따라 디바이스(100)에서 획득되는 제1 문자열은, 제1 사전 정보 및 제1 언어 모델에 기초하여 획득된 문자열일 수 있다. 본 개시의 일 실시 예에 따라 서버(200)에서 획득되는 제2 문자열은, 서버(200) 내에 저장되는 제2 사전 정보 및 제2 언어 모델에 기초하여 획득된 문자열일 수 있다.According to an embodiment of the present disclosure, the first character string acquired by the device 100 may be a character string acquired based on the first dictionary information and the first language model. According to an embodiment of the present disclosure, the second character string acquired by the server 200 may be a character string acquired based on second dictionary information and a second language model stored in the server 200.

서버(200) 내에 저장되는 제2 사전 정보 및 제2 언어 모델은, 제1 사전 정보 및 제1 언어 모델보다 많은 양의 정보를 포함할 수 있다. 따라서, 서버(200)로부터 수신된 제2 문자열은 제1 문자열보다 높은 신뢰도를 가질 수 있다. 디바이스(100)는, 제1 문자열보다 신뢰도가 높은 제2 문자열을 서버(200)로부터 수신하여 이용함으로써 음성 인식 성능을 높일 수 있다.The second dictionary information and the second language model stored in the server 200 may include a larger amount of information than the first dictionary information and the first language model. Accordingly, the second string received from the server 200 may have a higher reliability than the first string. The device 100 may improve speech recognition performance by receiving and using a second character string having higher reliability than the first character string from the server 200.

본 개시의 일 실시 예에 따른 출력부(150)는, 제1 문자열 또는 제2 문자열을 그대로 출력하거나, 제1 문자열 또는 제2 문자열로부터 획득되는 단어열을 출력할 수 있다. 예를 들어, 제1 문자열이 프레임 동기화된 문자열인 경우, 출력부(150)는, 제1 문자열로부터 획득된 단어열을 출력할 수 있다.The output unit 150 according to an embodiment of the present disclosure may output a first character string or a second character string as it is, or may output a word sequence obtained from the first character string or the second character string. For example, when the first character string is a frame-synchronized character string, the output unit 150 may output a word sequence obtained from the first character string.

본 개시의 일 실시 예에 따른 출력부(150)는, 제1 문자열 또는 제2 문자열에 기초하여 음성 인식이 수행된 결과를 출력 할 수 있다. 출력부(150)는, 음성 인식이 수행된 결과를 사용자에게 알리거나, 외부 디바이스(예를 들어, 스마트 폰, 가전 제품, 웨어러블 디바이스, 서버 등)에게 전송할 수 있다. 예를 들어, 출력부(150)는, 오디오 신호를 출력할 수 있는 스피커 또는 비디오 신호를 출력 할 수 있는 디스플레이를 포함할 수 있다.The output unit 150 according to an embodiment of the present disclosure may output a result of performing speech recognition based on the first character string or the second character string. The output unit 150 may notify a user of a result of performing voice recognition or transmit it to an external device (eg, a smart phone, a home appliance, a wearable device, a server, etc.). For example, the output unit 150 may include a speaker capable of outputting an audio signal or a display capable of outputting a video signal.

또는, 출력부(150)는, 음성 인식이 수행된 결과에 대응하는 동작을 수행할 수 있다. 예를 들어, 디바이스(100)는, 제1 문자열 또는 제2 문자열을 해석하고, 해석 결과에 대응하는 디바이스(100)의 기능을 결정할 수 있다. 디바이스(100)는, 해당 기능을 수행하는 화면을 출력부(150)를 통해 출력할 수 있다. 또는, 디바이스(100)는, 해석 결과에 대응하는 키워드를 외부 서버로 전송하고, 전송된 키워드에 관련된 정보를 서버로부터 수신하여 출력부(150)를 통해 화면 상에 출력할 수 있다. 또는, 디바이스(100)는, 해석 결과에 기초하여, 음성 신호에 대한 응답 메시지를 생성하고, 응답 메시지를 출력부(150)를 통해 출력할 수 있다.Alternatively, the output unit 150 may perform an operation corresponding to a result of performing speech recognition. For example, the device 100 may analyze a first character string or a second character string and determine a function of the device 100 corresponding to the analysis result. The device 100 may output a screen that performs a corresponding function through the output unit 150. Alternatively, the device 100 may transmit a keyword corresponding to the analysis result to an external server, receive information related to the transmitted keyword from the server, and output it on the screen through the output unit 150. Alternatively, the device 100 may generate a response message for the voice signal based on the analysis result, and may output the response message through the output unit 150.

일 실시예에 따른 디바이스(100)는, 제1 문자열 또는 제2 문자열에 대한 자연어 처리를 통해 사용자의 발화 의도를 파악함으로써, 음성 비서 서비스에 관련된 정보를 출력부(150)를 통해 출력 할 수 있다. 디바이스(100)는 제1 문자열 또는 제2 문자열에 기초하여 음성 비서 서비스를 제공하기 위하여, 디바이스(100) 내의 NLU (Natural Language Understanding) 모델, DM (Dialog Mananer) 모델 및 NLG (Natural Language Generating) 모델 등을 이용할 수 있다.The device 100 according to an embodiment may output information related to the voice assistant service through the output unit 150 by identifying the user's utterance intention through natural language processing on the first character string or the second character string. . The device 100 includes a Natural Language Understanding (NLU) model, a Dialog Mananer (DM) model, and a Natural Language Generating (NLG) model in the device 100 in order to provide a voice assistant service based on a first character string or a second character string. Etc. can be used.

또는, 출력부(150)는, 서버(200)로부터 제2 문자열에 기초한 음성 비서 서비스에 관련된 정보를 수신하고, 수신된 정보를 출력할 수 있다. 예를 들어, 제2 문자열에 기초한 음성 비서 서비스에 관련된 정보는, 제2 문자열에 대한 자연어 처리를 통해 사용자의 발화 의도를 해석한 결과를 바탕으로 생성되는 디바이스(100) 또는 다른 디바이스를 제어하기 위한 제어 명령을 포함할 수 있다. 또는, 예를 들어, 제2 문자열에 기초한 음성 비서 서비스에 관련된 정보는, 제2 문자열에 대한 자연어 처리를 통해 사용자의 발화 의도를 해석한 결과를 바탕으로 제공되는 사용자가 필요로 하는 서비스, 또는 사용자가 필요로 하는 정보를 포함할 수 있다.Alternatively, the output unit 150 may receive information related to the voice assistant service based on the second character string from the server 200 and output the received information. For example, information related to the voice assistant service based on the second character string is generated based on the result of analyzing the user’s speech intention through natural language processing on the second character string. May contain control commands. Or, for example, information related to the voice assistant service based on the second character string is provided based on the result of interpreting the user’s speech intention through natural language processing on the second character string, or the user May contain information required by

한편, 일 실시예에 따른 프로세서(120)가, 문장 또는 구에 포함되는 문자들 중 일부만을 서버(200)에게 전송한 경우, 프로세서(120)는, 서버(200)로부터 수신된 교정된 문자열과 교정이 필요 없다고 판단되어 서버(200)에게 전송되지 않았던 문자열을 조합할 수 있다. 프로세서(120)는 조합된 문자열을 출력하거나, 조합된 문자열에 기초하여 음성 인식이 수행된 결과를 출력하거나, 조합된 문자열을 해석한 결과에 기초하여 음성 비서 서비스를 제공할 수 있다. On the other hand, when the processor 120 according to an embodiment transmits only some of the characters included in the sentence or phrase to the server 200, the processor 120, the corrected character string received from the server 200 and It is determined that calibration is not necessary, and a character string that has not been transmitted to the server 200 may be combined. The processor 120 may output a combined character string, output a result of performing speech recognition based on the combined character string, or provide a voice assistant service based on a result of interpreting the combined character string.

이하에서는, 도 4a 및 도 4b를 참조하여 디바이스(100)의 동작 방법을 구체적으로 설명한다.Hereinafter, a method of operating the device 100 will be described in detail with reference to FIGS. 4A and 4B.

도 4a는 일 실시 예에 따른 디바이스의 구체적인 블록도를 도시한다.4A is a detailed block diagram of a device according to an embodiment.

도 4a에 도시된 바와 같이, 프로세서(120)의 ASR 모듈(121)은 수신부(110)에서 획득된 음성 신호를 수신하고, 음성 신호에 대한 음성 인식을 수행할 수 있다.As shown in FIG. 4A, the ASR module 121 of the processor 120 may receive the voice signal obtained from the receiver 110 and perform voice recognition on the voice signal.

도 4a의 ASR 모듈(121)은 엔드-투-엔드 방식으로 음성 신호에 대한 음성 인식을 수행할 수 있다. 엔드-투-엔드 방식이란, 음성 신호를 문자열 또는 단어 열로 직접 매핑할 수 있도록 훈련된 심층 신경망을 이용한 음성 인식 방식이다. 음향모델 및 언어모델 등의 다수의 모델들을 이용하는 다른 음성 인식 방식과 비교하면, 엔드-투-엔드 방식은 하나의 훈련된 심층 신경망을 이용함으로써 음성 인식 과정을 단순화할 수 있다. 엔드-투-엔드 음성 인식 모델의 하위 실시 예로는, RNN-T 모델, 및 CTC 모델 등이 존재한다.The ASR module 121 of FIG. 4A may perform voice recognition on a voice signal in an end-to-end manner. The end-to-end scheme is a speech recognition scheme using a deep neural network trained to directly map speech signals to strings or word strings. Compared with other speech recognition methods using multiple models such as acoustic models and language models, the end-to-end method can simplify the speech recognition process by using a single trained deep neural network. As a sub-example of an end-to-end speech recognition model, there are an RNN-T model, a CTC model, and the like.

ASR 모듈(121)은, 음성 신호로부터 특징 벡터를 추출할 수 있다. ASR 모듈(121)은, 메모리(140)에 저장된 심층 신경망(DNN)(144)을 이용하여, 특징 벡터로부터 제1 문자열을 출력할 수 있다.The ASR module 121 can extract a feature vector from the speech signal. The ASR module 121 may output a first character string from a feature vector using a deep neural network (DNN) 144 stored in the memory 140.

본 개시의 일 실시 예에 따른 프로세서(120)의 결정부(125)는, ASR 모듈(121)에서 출력된 제1 문자열의 신뢰도에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. 결정부(125)는, ASR 모듈(121)로부터 제1 문자열에 관한 신뢰도 정보를 수신할 수 있다.The determination unit 125 of the processor 120 according to an embodiment of the present disclosure determines whether to replace the first string with another string based on the reliability of the first string output from the ASR module 121. I can. The determination unit 125 may receive reliability information regarding the first character string from the ASR module 121.

일 실시 예에 따른 결정부(125)는, 제1 문자열에 관한 신뢰도 정보로서 ASR 모듈(121) 내의 소프트맥스 레이어로부터 출력되는 사후 확률 값을 수신할 수 있다. 결정부(125)는, 제1 문자열과 관련된 사후 확률 값에 기초하여 신뢰도를 계산할 수 있다.The determiner 125 according to an embodiment may receive a posterior probability value output from the softmax layer in the ASR module 121 as reliability information about the first character string. The determiner 125 may calculate a reliability based on a posterior probability value related to the first character string.

예를 들어, 결정부(125)는, 신뢰도가 임계값 보다 높거나 같으면, 제1 문자열의 교정이 필요하지 않은 것으로 판단하고 제1 문자열을 출력부(150)를 통해 출력할 수 있다. 반면에, 결정부(125)는, 신뢰도가 임계값 보다 작으면, 제1 문자열의 교정이 필요하다고 판단하고 제1 문자열을 통신부(130)를 통해 서버(200)에게 전송할 수 있다.For example, if the reliability is higher than or equal to the threshold value, the determination unit 125 may determine that the correction of the first character string is not required and output the first character string through the output unit 150. On the other hand, if the reliability is less than the threshold value, the determination unit 125 may determine that calibration of the first character string is necessary and transmit the first character string to the server 200 through the communication unit 130.

도 4a에서는 설명의 편의를 위하여, 제1 문자열이 출력부(150)를 통해 출력되는 경우를 예로 들어 도시하였지만 본 개시는 이에 제한되지 않는다. 일 실시예에 따른 디바이스(100)는, 제1 문자열에 대한 자연어 처리를 통해 사용자의 발화 의도를 파악함으로써, 음성 비서 서비스에 관련된 정보를 출력부(150)를 통해 출력 할 수 있다. In FIG. 4A, for convenience of explanation, a case in which the first character string is output through the output unit 150 is illustrated as an example, but the present disclosure is not limited thereto. The device 100 according to an embodiment may output information related to the voice assistant service through the output unit 150 by identifying the user's utterance intention through natural language processing on the first character string.

디바이스(100)는 제1 문자열에 기초하여 음성 비서 서비스를 제공하기 위하여, 디바이스(100) 내의 NLU (Natural Language Understanding) 모델, DM (Dialog Mananer) 모델 및 NLG (Natural Language Generating) 모델 등을 이용할 수 있다.The device 100 may use a Natural Language Understanding (NLU) model, a Dialog Mananer (DM) model, and a Natural Language Generating (NLG) model in the device 100 to provide a voice assistant service based on the first character string. have.

예를 들어, 디바이스(100)의 프로세서(120)는, 사용자의 상황, 디바이스의 상황 등을 고려하여 사람이 사용자와 직접 대화하는 것처럼, 사용자에게 제1 문자열에 대한 응답 메시지를 생성하고 출력부(150)를 통해 출력할 수 있다. 또는, 예를 들어, 프로세서(120)는, 제1 문자열에 기초하여 사용자가 필요한 정보를 생성하여 출력부(150)를 통해 사용자에게 제공할 수 있다. 또는, 예를 들어, 프로세서(120)는, 제1 문자열에 기초하여 사용자의 발화 의도를 파악하고, 서비스 제공 서버에게 사용자가 필요로 하는 서비스의 제공을 요청할 수 있다. 출력부(150)는, 서비스 제공 서버로부터 수신된 정보를 출력할 수 있다.For example, the processor 120 of the device 100 generates a response message for the first character string to the user and generates a response message to the user as if a person directly communicates with the user in consideration of the situation of the user and the device 150) can be printed. Alternatively, for example, the processor 120 may generate information required by the user based on the first character string and provide it to the user through the output unit 150. Alternatively, for example, the processor 120 may determine the user's utterance intention based on the first character string, and request a service providing server to provide a service required by the user. The output unit 150 may output information received from the service providing server.

또한, 일 실시예에 따른 디바이스(100)의 출력부(150)는, 서버(200)로부터 음성 비서 서비스에 관련된 정보를 수신하고, 수신된 정보를 출력할 수 있다. 음성 비서 서비스에 관련된 정보는, 제1 문자열 또는 제1 문자열이 교정된 제2 문자열에 기초하여 서버(200)에 의해 생성되는 정보일 수 있다. 예를 들어, 음성 비서 서비스에 관련된 정보는, 사용자의 음성 신호에 대한 응답 메시지, 사용자가 필요로 하는 서비스, 또는 사용자가 필요한 정보를 포함할 수 있다.In addition, the output unit 150 of the device 100 according to an embodiment may receive information related to the voice assistant service from the server 200 and output the received information. The information related to the voice assistant service may be information generated by the server 200 based on the first character string or the second character string in which the first character string has been corrected. For example, information related to the voice assistant service may include a response message to a user's voice signal, a service required by the user, or information required by the user.

한편, 도 4b는 다른 일 실시 예에 따른 디바이스의 구체적인 블록도를 도시한다.Meanwhile, FIG. 4B is a detailed block diagram of a device according to another exemplary embodiment.

도 4b에 도시된 바와 같이, 프로세서(120)의 ASR 모듈(121)은 수신부(110)에서 획득된 음성 신호를 수신하고, 음성 신호에 대한 음성 인식을 수행할 수 있다. 음소열 획득부(122)는, 메모리(140)에 저장된 음향 모델(141)을 이용하여, 음성 신호로부터 음소열을 획득할 수 있다. 음향 모델(141)은, 음성 신호의 파형을 분할하고, 은닉 마르코프 모델, 가우스 혼합 모델, 베이즈 추론, 또는 다층 신경망 등을 이용하여 음소들을 포함하는 음소열을 추정할 수 있다.As shown in FIG. 4B, the ASR module 121 of the processor 120 may receive the voice signal obtained from the receiver 110 and perform voice recognition on the voice signal. The phoneme sequence acquisition unit 122 may acquire a phoneme sequence from a speech signal using the acoustic model 141 stored in the memory 140. The acoustic model 141 may divide a waveform of a speech signal and estimate a phoneme sequence including phonemes using a hidden Markov model, a Gaussian mixture model, Bayesian inference, or a multilayer neural network.

프로세서(120)의 문자열 획득부(123)는, 메모리(140)에 저장된 사전 정보(142) 및 언어 모델(143)에 기초하여, 음소열로부터 단어들을 추정하고 추정된 단어들을 포함하는 문자열을 출력할 수 있다.The character string acquisition unit 123 of the processor 120 estimates words from the phoneme sequence and outputs a character string including the estimated words based on the dictionary information 142 and the language model 143 stored in the memory 140 can do.

본 개시의 일 실시 예에 따른 프로세서(120)의 결정부(125)는, ASR 모듈(121)에서 출력된 제1 문자열의 신뢰도를 계산하고, 계산된 신뢰도에 기초하여 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. 결정부(125)는, ASR 모듈(121)로부터 제1 문자열에 관한 신뢰도 정보를 수신할 수 있다.The determination unit 125 of the processor 120 according to an embodiment of the present disclosure calculates the reliability of the first character string output from the ASR module 121, and converts the first character string into another character string based on the calculated reliability. You can decide whether to replace it. The determination unit 125 may receive reliability information regarding the first character string from the ASR module 121.

일 실시 예에 따른 결정부(125)는, 제1 문자열에 관한 신뢰도 정보로서 ASR 모듈(121) 내의 비터비 디코더로부터 출력되는 제1 문자열의 부분 가능도 값(partial likelihood)에 기초하여 신뢰도를 계산할 수 있다. The determination unit 125 according to an embodiment may calculate the reliability based on a partial likelihood of the first string output from the Viterbi decoder in the ASR module 121 as reliability information on the first string. I can.

일 실시 예에 따른 결정부(125)는, 신뢰도가 임계값 보다 높거나 같으면, 제1 문자열의 교정이 필요하지 않은 것으로 판단하고 제1 문자열을 출력부(150)를 통해 출력할 수 있다. 반면에, 결정부(125)는, 신뢰도가 임계값 보다 작으면, 제1 문자열의 교정이 필요하다고 판단하고 제1 문자열을 통신부(130)를 통해 서버(200)에게 전송할 수 있다.도 4b에서는 설명의 편의를 위하여, 제1 문자열이 출력부(150)를 통해 출력되는 경우를 예로 들어 도시하였지만 본 개시는 이에 제한되지 않는다. 일 실시예에 따른 디바이스(100)는, 제1 문자열에 대한 자연어 처리를 통해 사용자의 발화 의도를 파악함으로써, 음성 비서 서비스에 관련된 정보를 출력부(150)를 통해 출력 할 수 있다. If the reliability is higher than or equal to the threshold value, the determiner 125 according to an embodiment may determine that calibration of the first character string is not required and output the first character string through the output unit 150. On the other hand, if the reliability is less than the threshold value, the determination unit 125 may determine that calibration of the first character string is necessary and transmit the first character string to the server 200 through the communication unit 130. In FIG. 4B For convenience of explanation, a case in which the first character string is output through the output unit 150 is illustrated as an example, but the present disclosure is not limited thereto. The device 100 according to an embodiment may output information related to the voice assistant service through the output unit 150 by identifying the user's utterance intention through natural language processing on the first character string.

상술한 바와 같이, 본 개시의 일 실시 예에 따른 디바이스(100)는, 음성 신호에 대한 음성 인식 결과의 신뢰도에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. 그러나 본 개시는 이러한 실시 예에 제한되지 않으며, 다른 일 실시 예에 따르면, 디바이스(100)는, 디바이스(100)에 미리 저장된 키워드들과 문자열을 비교한 결과에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. 또는, 다른 일 실시 예에 따른 디바이스(100)는, 제1 문자열이 관련된 도메인에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. 또는, 다른 일 실시 예에 따른 디바이스(100)는, 자연어 이해 처리를 통해 제1 문자열의 의미를 해석하고, 해석한 결과에 기초하여 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다.As described above, the device 100 according to an embodiment of the present disclosure may determine whether to replace the first character string with another character string, based on the reliability of the result of speech recognition for the speech signal. However, the present disclosure is not limited to this embodiment, and according to another embodiment, the device 100 converts the first character string to another character string based on a result of comparing the character string with keywords previously stored in the device 100. You can decide whether to replace it with. Alternatively, the device 100 according to another embodiment may determine whether to replace the first character string with another character string based on a domain to which the first character string is associated. Alternatively, the device 100 according to another embodiment may interpret the meaning of the first character string through natural language understanding processing, and determine whether to replace the first character string with another character string based on the analyzed result.

도 5a는 일 실시 예에 따라 디바이스가 온-디바이스 음성 인식을 수행할 것을 판단하는 방법을 설명하기 위한 도면이다.5A is a diagram for describing a method of determining that a device performs on-device voice recognition, according to an exemplary embodiment.

일 예로서, 본 개시의 일 실시 예에 따른 디바이스(100)의 프로세서(120)의 결정부(125)는, 디바이스(100)에 미리 저장된 키워드들과 제1 문자열을 비교한 결과에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. As an example, the determination unit 125 of the processor 120 of the device 100 according to an embodiment of the present disclosure, based on a result of comparing the keywords previously stored in the device 100 with a first character string, It is possible to determine whether to replace the first character string with another character string.

일 실시 예에 따른 결정부(125)는, 제1 문자열에 미리 저장된 키워드들 중 적어도 하나가 포함된 경우, 제1 문자열을 다른 문자열로 대체하지 않을 것을 결정할 수 있다. 따라서, 디바이스(100)는, 서버(200)의 개입 없이, 디바이스(100)의 ASR 모듈(121)에서 수행된 음성 인식 수행 결과를 그대로 이용할 수 있다.The determiner 125 according to an embodiment may determine not to replace the first string with another string when at least one of keywords previously stored in the first string is included. Accordingly, the device 100 may use the result of performing the speech recognition performed by the ASR module 121 of the device 100 as it is without the intervention of the server 200.

예를 들어, ASR 모듈(121)로부터 출력된 제1 문자열이 “Read me my text”인 경우, 결정부(125)는, 제1 문자열이 미리 저장된 키워드인 "text"를 포함하는 것으로 판단하고, 제1 문자열을 다른 문자열로 대체하지 않을 것을 결정할 수 있다.For example, if the first character string output from the ASR module 121 is “Read me my text”, the determination unit 125 determines that the first character string includes a pre-stored keyword “text”, It may be determined not to replace the first string with another string.

다른 예로서, 본 개시의 일 실시 예에 따른 디바이스(100)의 프로세서(120)의 결정부(125)는, 제1 문자열이 관련된 도메인 또는 제1 문자열 내에 개체명이 포함되는지 여부에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. As another example, the determination unit 125 of the processor 120 of the device 100 according to an embodiment of the present disclosure may be based on whether the domain to which the first character string is related or the entity name is included in the first character string. You can decide whether to replace 1 string with another string.

일 실시 예에 따른 결정부(125)는, 제1 문자열이 개체명 위주의 도메인과 관련이 없고, 오픈 도메인과 관련이 있다고 판단되는 경우, 제1 문자열을 다른 문자열로 대체하지 않을 것을 결정할 수 있다. 따라서, 디바이스(100)는, 서버(200)의 개입 없이, 디바이스(100)의 ASR 모듈(121)에서 수행된 음성 인식 수행 결과를 그대로 이용할 수 있다.The determination unit 125 according to an embodiment may determine not to replace the first string with another string when it is determined that the first string is not related to an entity name-oriented domain and is related to an open domain. . Accordingly, the device 100 may use the result of performing the speech recognition performed by the ASR module 121 of the device 100 as it is without the intervention of the server 200.

예를 들어, ASR 모듈(121)로부터 출력된 제1 문자열이 “Take a picture”인 경우, 결정부(125)는, 제1 문자열이 오픈 도메인과 관련이 있다고 판단하고, 제1 문자열을 다른 문자열로 대체하지 않을 것을 결정할 수 있다.For example, when the first character string output from the ASR module 121 is “Take a picture”, the determination unit 125 determines that the first character string is related to the open domain, and sets the first character string to another character string. You can decide not to replace it with.

일 실시 예에 따른 결정부(125)는, 제1 문자열에 개체명이 포함된다고 판단되는 경우, 제1 문자열을 다른 문자열로 대체할 것을 결정할 수 있다. When it is determined that the object name is included in the first string, the determiner 125 according to an embodiment may determine to replace the first string with another string.

본 개시의 일 실시 예에 따른 결정부(125)는, 메모리(140)에 저장되어 있는 개체명들 중 적어도 하나가 제1 문자열 내에 포함되는지 여부를 판단할 수 있다. 또는, 개체명에 대한 사전 정보 없이도, 일 실시 예에 따른 결정부(125)는, 제1 문자열 내에 개체명이 포함되는지 여부를 판단할 수 있다. 예를 들어, 결정부(125)는, 제1 문자열로부터 식별되는 단어들의 품사 태깅(POS Tagging, Part-Of-Speech Tagging)을 함으로써, 제1 문자열 내에 포함되어 있는 개체명을 식별할 수 있다.The determiner 125 according to an embodiment of the present disclosure may determine whether at least one of the object names stored in the memory 140 is included in the first character string. Alternatively, without prior information on the entity name, the determination unit 125 according to an embodiment may determine whether the entity name is included in the first string. For example, the determination unit 125 may identify an entity name included in the first string by performing POS Tagging (Part-Of-Speech Tagging) of words identified from the first string.

예를 들어, ASR 모듈(121)로부터 출력된 제1 문자열이 “Take a picture”인 경우, 결정부(125)는, 제1 문자열이 개체명을 포함하지 않는다고 판단하고, 제1 문자열을 다른 문자열로 대체하지 않을 것을 결정할 수 있다.For example, when the first character string output from the ASR module 121 is “Take a picture”, the determination unit 125 determines that the first character string does not include an entity name, and sets the first character string to another character string. You can decide not to replace it with.

또 다른 예로서, 본 개시의 일 실시 예에 따른 디바이스(100)의 프로세서(120)의 결정부(125)는, 자연어 이해 처리를 통해 제1 문자열의 의미를 해석하고, 해석한 결과에 기초하여 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. As another example, the determination unit 125 of the processor 120 of the device 100 according to an embodiment of the present disclosure interprets the meaning of the first character string through natural language understanding processing, and based on the analyzed result It is possible to determine whether to replace the first character string with another character string.

일 실시 예에 따른 결정부(125)는, 제1 문자열을 해석한 결과, 음성 신호가 디바이스(100)의 동작과 관련된 일반적인 명령이라고 판단되는 경우, 제1 문자열을 다른 문자열로 대체하지 않을 것을 결정할 수 있다. 따라서, 디바이스(100)는, 서버(200)의 개입 없이, 디바이스(100)의 ASR 모듈(121)에서 수행된 음성 인식 수행 결과를 그대로 이용할 수 있다.When determining that the voice signal is a general command related to the operation of the device 100 as a result of analyzing the first string, the determiner 125 according to an embodiment determines not to replace the first string with another string. I can. Accordingly, the device 100 may use the result of performing the speech recognition performed by the ASR module 121 of the device 100 as it is without the intervention of the server 200.

예를 들어, ASR 모듈(121)로부터 출력된 제1 문자열이 “Do I have any new voice mail?”인 경우, 결정부(125)는, 제1 문자열이 텍스트 메시지의 확인과 관련된 일반적인 명령이라고 판단하고, 제1 문자열을 다른 문자열로 대체하지 않을 것을 결정할 수 있다.For example, when the first string output from the ASR module 121 is “Do I have any new voice mail?”, the determination unit 125 determines that the first string is a general command related to the confirmation of the text message. And, it may be determined not to replace the first character string with another character string.

한편, 도 5b는 일 실시 예에 따라 디바이스가 서버 기반 음성 인식을 수행할 것을 판단하는 방법을 설명하기 위한 도면이다.Meanwhile, FIG. 5B is a diagram illustrating a method of determining that a device performs server-based voice recognition, according to an exemplary embodiment.

도 5b에 도시된 바와 같이, 일 실시 예에 따른 디바이스(100)의 프로세서(120)의 결정부(125)는, 제1 문자열을 다른 문자열로 대체하여야 한다고 결정하고, 이러한 결정에 기초하여 제1 문자열을 서버(200)로 전송할 수 있다.As shown in FIG. 5B, the determination unit 125 of the processor 120 of the device 100 according to an embodiment determines that the first character string should be replaced with another character string, and the first character string is The character string may be transmitted to the server 200.

도 5b는, 디바이스(100)의 ASR 모듈(121)이, 사용자가 “The Cardinals baseball team"을 발음한 음성 신호를 수신하고, 제1 문자열 [the cat and deers baseball team]을 획득한 경우를 예로서 도시한다. 5B is an example of a case where the ASR module 121 of the device 100 receives a voice signal in which the user pronounces “The Cardinals baseball team” and acquires the first string [the cat and deers baseball team] Shown as.

일 예로서, 본 개시의 일 실시 예에 따른 디바이스(100)의 프로세서(120)의 결정부(125)는, 제1 문자열이 미리 저장된 키워드들을 포함하지 않으므로, 제1 문자열을 다른 문자열로 대체할 것을 결정할 수 있다.As an example, the determining unit 125 of the processor 120 of the device 100 according to an embodiment of the present disclosure may replace the first string with another string because the first string does not include pre-stored keywords. Can decide.

다른 예로서, 본 개시의 일 실시 예에 따른 디바이스(100)의 프로세서(120)의 결정부(125)는, 제1 문자열이 스포츠 도메인과 관련이 있다고 판단되거나 개체명이 포함된다고 판단되는 경우, 제1 문자열을 다른 문자열로 대체할 것을 결정할 수 있다. As another example, when the determination unit 125 of the processor 120 of the device 100 according to an embodiment of the present disclosure determines that the first character string is related to a sports domain or is determined to include an entity name, You can decide to replace one string with another string.

본 개시의 일 실시 예에 따른 결정부(125)는, 메모리(140)에 저장되어 있는 개체명들 중 적어도 하나가 제1 문자열 내에 포함되는지 여부를 판단할 수 있다. 또는, 개체명에 대한 사전 정보 없이도, 일 실시 예에 따른 결정부(125)는, 제1 문자열 내에 개체명이 포함되는지 여부를 판단할 수 있다. 예를 들어, 결정부(125)는, 제1 문자열로부터 식별되는 단어들의 품사 태깅(POS Tagging, Part-Of-Speech Tagging)을 함으로써, 제1 문자열 내에 포함되어 있는 개체명을 식별할 수 있다. 그러나 실시 예는 이에 제한되지 않으며, 다양한 방식의 개체명 인식(Named Entity Recognition, NER) 방법이 이용될 수 있다.The determiner 125 according to an embodiment of the present disclosure may determine whether at least one of the object names stored in the memory 140 is included in the first character string. Alternatively, without prior information on the entity name, the determination unit 125 according to an embodiment may determine whether the entity name is included in the first string. For example, the determination unit 125 may identify an entity name included in the first string by performing POS Tagging (Part-Of-Speech Tagging) of words identified from the first string. However, embodiments are not limited thereto, and various methods of Named Entity Recognition (NER) methods may be used.

또 다른 예로서, 본 개시의 일 실시 예에 따른 디바이스(100)의 프로세서(120)의 결정부(125)는, 제1 문자열을 해석한 결과, 음성 신호가 일반적인 명령이 아니라고 판단하고, 제1 문자열을 다른 문자열로 대체할 것을 결정할 수 있다.As another example, the determination unit 125 of the processor 120 of the device 100 according to an embodiment of the present disclosure determines that the voice signal is not a general command as a result of analyzing the first character string, and the first You can decide to replace the string with another string.

도 5b에 도시된 바와 같이, 일 실시 예에 따른 디바이스(100)의 결정부(125)는, 제1 문자열을 다른 문자열로 대체하여야 한다고 결정하고, 이러한 결정에 기초하여 제1 문자열을 서버(200)로 전송할 수 있다. 서버(200)는, 디바이스(100)로부터 제1 문자열을 수신하고, 서버(200) 내의 언어 모델 및 사전 정보(예를 들어, 스포츠 도메인의 사전 정보)을 이용하여 디코딩을 수행할 수 있다. 서버(200)는, 디코딩 결과, 제1 문자열에 포함되는 적어도 하나의 문자가 수정된 제2 문자열을 획득할 수 있다. 디바이스(100)는, 서버(200)로부터 제2 문자열을 수신하여 이용함으로써, 음성 인식의 정확도를 높일 수 있다.As shown in FIG. 5B, the determination unit 125 of the device 100 according to an embodiment determines that the first character string should be replaced with another character string, and based on this determination, the first character string is replaced by the server 200. ). The server 200 may receive the first character string from the device 100 and perform decoding using a language model and dictionary information (eg, dictionary information of a sports domain) in the server 200. As a result of decoding, the server 200 may obtain a second character string in which at least one character included in the first character string has been modified. The device 100 may increase the accuracy of speech recognition by receiving and using the second character string from the server 200.

한편, 본 개시의 일 실시 예에 따른 디바이스(100)는, 음성 인식을 수행하여 문장 또는 구를 구성하는 문자열을 획득한 경우, 문장 또는 구에 포함되는 문자들을 모두 서버(200)에게 전송하거나, 문장 또는 구에 포함되는 문자들 중 일부를 서버(200)에게 전송할 수 있다. 디바이스(100)의 프로세서(120)의 결정부(125)는, 문자열의 신뢰도에 기초하여, 신뢰도가 낮은 일부의 문자들을 서버(200)에게 전송할 것을 결정할 수 있다.Meanwhile, when the device 100 according to an embodiment of the present disclosure acquires a string constituting a sentence or phrase by performing voice recognition, all characters included in the sentence or phrase are transmitted to the server 200, or Some of the characters included in the sentence or phrase may be transmitted to the server 200. The determination unit 125 of the processor 120 of the device 100 may determine to transmit some characters with low reliability to the server 200 based on the reliability of the character string.

일 실시 예에 따른 디바이스(100)는, 서버(200)에서 교정된 문자열을 수신하고, 교정이 필요 없다고 판단되어 서버(200)에게 전송되지 않았던 문자열을 교정된 문자열과 조합할 수 있다. 일 실시 예에 따른 디바이스(100)는, 조합된 문자열을 출력하거나, 조합된 문자열에 기초하여 음성 인식이 수행된 결과를 출력하거나, 조합된 문자열을 해석한 결과에 기초하여 음성 비서 서비스를 제공할 수 있다.The device 100 according to an embodiment may receive a corrected character string from the server 200 and may combine a character string that has not been transmitted to the server 200 because it is determined that there is no need for calibration, with the corrected character string. The device 100 according to an embodiment may output a combined character string, output a result of performing speech recognition based on the combined character string, or provide a voice assistant service based on a result of interpreting the combined character string. I can.

또한, 일 실시 예에 따른 디바이스(100)는 서버(200)에게 제1 문자열에 대한 교정을 요청하면서, 디바이스(100)의 제1 문자열에 관련된 도메인 정보를 서버(200)에게 제공할 수 있다. 도메인 정보는 도메인을 식별하기 위한 정보로서, 예를 들어, 도메인의 명칭, 도메인의 식별자를 포함할 수 있으나, 이에 제한되지 않는다. 디바이스(100)는 디바이스(100)의 ASR 모델로부터 출력된 제1 문자열의 도메인 신뢰도에 기초하여, 제1 문자열에 관련된 도메인을 식별할 수 있다. 도메인 신뢰도는, 제1 문자열의 적어도 일부가 특정 도메인에 어느 정도 관련되었는지를 나타내는 수치일 수 있다. 예를 들어, 디바이스(100)는 ASR 모델로부터 출력된 제1 문자열이 디바이스(100)에 미리 등록된 도메인에 어느 정도 관련성이 있는 지를 나타내는 컨피던스 스코어를 산출할 수 있다. 또한, 디바이스(100)는 산출된 도메인 신뢰도에 기초하여, 제1 문자열에 관련된 도메인을 식별할 수 있다. 디바이스(100)는 룰 기반으로 제1 문자열에 관련된 도메인을 식별하거나 도메인 식별을 위해 훈련된 인공 지능 모델을 이용하여 제1 문자열에 관련된 도메인 신뢰도를 획득할 수 있다. In addition, the device 100 according to an embodiment may provide the server 200 with domain information related to the first string of the device 100 while requesting the server 200 to calibrate the first string. The domain information is information for identifying a domain, and may include, for example, a name of a domain and an identifier of a domain, but is not limited thereto. The device 100 may identify a domain related to the first character string based on the domain reliability of the first character string output from the ASR model of the device 100. The domain reliability may be a number indicating to what extent at least a part of the first character string is related to a specific domain. For example, the device 100 may calculate a confidence score indicating to what extent the first character string output from the ASR model is related to a domain previously registered in the device 100. Also, the device 100 may identify a domain related to the first character string based on the calculated domain reliability. The device 100 may identify a domain related to the first character string based on a rule, or obtain a domain reliability level related to the first character string by using an artificial intelligence model trained for domain identification.

도 6은 일 실시 예에 따른 프레임 동기화된 문자열을 설명하기 위한 도면이다.6 is a diagram for describing a frame-synchronized character string according to an exemplary embodiment.

도 6에 도시된 바와 같이, 본 개시의 일 실시 예에 따른 디바이스(100)의 ASR 모듈(121)은, 음성 신호(601)가 소정 시간 간격으로 분할된 음성 신호 프레임(F)들 각각에 대응하는 문자들을 포함하는, 프레임 동기화된 문자열(603)을 출력할 수 있다.As shown in FIG. 6, the ASR module 121 of the device 100 according to an embodiment of the present disclosure corresponds to each of the voice signal frames F in which the voice signal 601 is divided at predetermined time intervals. A frame-synchronized character string 603 including characters to be displayed may be output.

예를 들어, ASR 모듈(121)은, 사용자가 "baseball"이라고 발음하는 음성 신호를 수신하고, 프레임 동기화된 문자열인 [b, b, a, , a, a, s, s, e, , b, b, a, a, l]을 출력할 수 있다.For example, the ASR module 121 receives a voice signal pronounced by the user as "baseball", and frame-synchronized strings [b, b, a,, a, a, s, s, e,, b , b, a, a, l] can be printed.

그러나 본 개시는 이에 제한되지 않으며, 일 실시 예에 따른 ASR 모듈(121)은, 음성 인식 결과로서 프레임 동기화되지 않은 문자열(즉, 라벨 동기화 문자열)을 출력할 수 있다. 이 경우에도, 디바이스(100)는, 음성 신호로부터 획득된 문자열에 대한 강제 정렬을 수행함으로써, 프레임 동기화된 문자열을 생성할 수 있다.However, the present disclosure is not limited thereto, and the ASR module 121 according to an embodiment may output a text string that is not frame synchronized (ie, a label synchronization text string) as a result of speech recognition. Even in this case, the device 100 may generate a frame-synchronized character string by forcibly aligning the character string obtained from the voice signal.

일 실시 예에 따른 디바이스(100)의 프로세서(120)는, 제1 문자열에 포함되는 각 문자가 발음되는 음성 신호 구간을 식별하고, 식별된 음성 신호 구간에 포함되는 복수의 음성 프레임들을 식별할 수 있다. 프로세서(120)는, 식별된 음성 프레임들에 따라 해당 문자를 복수 회 연속하여 배치함으로써, 프레임 동기화된 문자열을 획득할 수 있다.The processor 120 of the device 100 according to an embodiment may identify a voice signal section in which each character included in the first string is pronounced, and identify a plurality of voice frames included in the identified voice signal section. have. The processor 120 may obtain a frame-synchronized character string by sequentially arranging the corresponding character a plurality of times according to the identified voice frames.

예를 들어, ASR 모듈(121)은 프레임 동기화 되지 않은 문자열인 제1 문자열 [b, a, s, e, b, a, l, l]을 출력할 수 있다. 이 경우, 프로세서(120)는, 제1 문자열에 포함되는 각 문자가 발음되는 시간에 기초하여, 각 문자를 복수 회 연속하여 배치할 수 있다. 그 결과, 프로세서(120)는, 프레임 동기화된 문자열인 [b, b, a, , a, a, s, s, e, , b, b, a, a, l]을 획득할 수 있다.For example, the ASR module 121 may output a first character string [b, a, s, e, b, a, l, l], which is a character string that is not frame synchronized. In this case, the processor 120 may consecutively arrange each character a plurality of times based on a time when each character included in the first character string is pronounced. As a result, the processor 120 may acquire frame-synchronized character strings [b, b, a,, a, a, s, s, e,, b, b, a, a, l].

본 개시의 일 실시 예에 따른 디바이스(100)는, 프레임 동기화된 문자열(603)을 서버(200)로 출력할 수 있다. 서버(200)는, 디바이스(100)로부터 수신된 프레임 동기화된 문자열(603)에 대한 디코딩을 수행하고, 디코딩 수행 결과에 기초하여 획득된 제2 문자열을 디바이스(100)에게 전송할 수 있다.The device 100 according to an embodiment of the present disclosure may output the frame-synchronized character string 603 to the server 200. The server 200 may decode the frame-synchronized character string 603 received from the device 100 and transmit the obtained second character string to the device 100 based on a result of performing the decoding.

도 7은 일 실시 예에 따른 서버의 블록도를 도시한다.7 is a block diagram of a server according to an embodiment.

본 개시의 일 실시 예에 따른 서버(200)는, 디바이스(100)와 유선 또는 무선으로 연결될 수 있다.The server 200 according to an embodiment of the present disclosure may be connected to the device 100 by wire or wirelessly.

도 7을 참조하면, 서버(200)는, 통신부(210), 프로세서(220), 및 메모리(230)를 포함할 수 있다. 도 7에 도시된 구성 요소 모두가 서버(200)의 필수 구성 요소인 것은 아니다. 도 7에 도시된 구성 요소보다 많은 구성 요소에 의해 서버(200)가 구현될 수도 있고, 도 7에 도시된 구성 요소보다 적은 구성 요소에 의해 서버(200)가 구현될 수도 있다.Referring to FIG. 7, the server 200 may include a communication unit 210, a processor 220, and a memory 230. Not all of the components shown in FIG. 7 are essential components of the server 200. The server 200 may be implemented by more components than the components shown in FIG. 7, or the server 200 may be implemented by fewer components than the components shown in FIG. 7.

본 개시의 일 실시 예에 따른 서버(200)의 메모리(230)는, 음성 인식을 수행하기 위한 인스트럭션들, 음성 인식에 이용되는 각종 모델, 신경망, 사전 정보 등을 저장할 수 있다. The memory 230 of the server 200 according to an embodiment of the present disclosure may store instructions for performing speech recognition, various models used for speech recognition, neural networks, dictionary information, and the like.

본 개시의 일 실시 예에 따른 프로세서(220)는, 메모리(230)에 저장된 하나 이상의 인스터럭션들을 실행함으로써, 음성 인식을 수행할 수 있다. The processor 220 according to an embodiment of the present disclosure may perform speech recognition by executing one or more instructions stored in the memory 230.

한편, 본 개시의 일 실시 예에 따른 통신부(210)는 유선 통신 또는 무선 통신을 통해 외부 디바이스, 또는 장치와 통신할 수 있다. 통신부(210)는, 근거리 통신 모듈, 유선 통신 모듈, 이동 통신 모듈, 방송 수신 모듈 등을 포함할 수 있다.Meanwhile, the communication unit 210 according to an embodiment of the present disclosure may communicate with an external device or an apparatus through wired communication or wireless communication. The communication unit 210 may include a short-range communication module, a wired communication module, a mobile communication module, a broadcast receiving module, and the like.

먼저, 본 개시의 일 실시 예에 따른 서버(200)의 통신부(210)는, 디바이스(100)로부터 제1 문자열을 수신할 수 있다. 제1 문자열은, 디바이스(100)에 입력된 음성 신호로부터 디바이스(100)에 의해 음성 인식 처리를 거쳐 출력될 수 있다.First, the communication unit 210 of the server 200 according to an embodiment of the present disclosure may receive a first character string from the device 100. The first character string may be output through a voice recognition process by the device 100 from a voice signal input to the device 100.

일 예로서, 서버(200)에서 수신되는 제1 문자열은, 음성 신호가 소정 시간 간격으로 분할된 음성 신호 프레임들 각각에 대응하는 문자들을 포함하는 프레임 동기화 문자열일 수 있다. 다른 예로서, 서버(200)에서 수신되는 제1 문자열은, 프레임 동기화되지 않은 문자열일 수 있다. As an example, the first character string received from the server 200 may be a frame synchronization character string including characters corresponding to each of the audio signal frames in which the audio signal is divided at predetermined time intervals. As another example, the first character string received from the server 200 may be a character string that is not frame synchronized.

본 개시의 일 실시 예에 따른 프로세서(220)는, 디바이스(100)로부터 수신되는 제1 문자열이 프레임 동기화되지 않은 문자열인 경우, 제1 문자열로부터 프레임 동기화된 문자열을 획득할 수 있다. 프로세서(220)는, 제1 문자열에 포함되는 적어도 하나의 문자를 소정 시간 간격의 프레임 단위로 연속하여 배치함으로써, 프레임 동기화된 문자열을 획득할 수 있다. The processor 220 according to an embodiment of the present disclosure may obtain a frame-synchronized character string from the first character string when the first character string received from the device 100 is a character string that is not frame synchronized. The processor 220 may obtain a frame-synchronized character string by continuously arranging at least one character included in the first character string in units of frames at a predetermined time interval.

본 개시의 일 실시 예에 따른 서버(200)의 프로세서(220)는, 제1 문자열 내의 적어도 하나의 문자를 다른 문자로 대체함으로써 제1 문자열로부터 제2 문자열을 획득할 수 있다.The processor 220 of the server 200 according to an embodiment of the present disclosure may obtain a second character string from the first character string by replacing at least one character in the first character string with another character.

본 개시의 일 실시예에 따라 프로세서(220)는, 제1 문자열 내의 각 문자와 발음이 유사한 대체 문자들을 식별하고, 식별된 대체 문자들에 기초하여 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 교정된 추정 문자열들을 결정할 수 있다. 프로세서(220)는, 결정된 추정 문자열들 중에서, 언어 모델 및 사전 정보 등 미리 저장된 정보에 기초하여 가장 적합한 추정 문자열을 선택하여 제2 문자열로서 획득할 수 있다.According to an embodiment of the present disclosure, the processor 220 identifies replacement characters having a similar pronunciation to each character in the first string, and at least one character in the first string is converted into a different character based on the identified replacement characters. Corrected estimated character strings can be determined. The processor 220 may select the most suitable estimated character string from among the determined estimated character strings based on pre-stored information such as language model and dictionary information, and obtain it as the second character string.

이하에서는, 일 실시예에 따른 프로세서(220)가 제2 문자열을 획득하는 방법을 보다 구체적으로 설명한다.Hereinafter, a method of obtaining a second character string by the processor 220 according to an exemplary embodiment will be described in more detail.

먼저, 프로세서(220)는, 제1 문자열로부터 복수의 추정 문자열들을 식별할 수 있다. 프로세서(220)는, 제1 문자열 내의 각 문자에 대하여 각 문자가 대체될 대체 문자들에 관한 가능도 행렬들을 산출할 수 있다. 프로세서(220)는, 가능도 행렬들 내의 가능도 값들에 기초하여, 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 대체된 복수의 추정 문자열들을 식별할 수 있다.First, the processor 220 may identify a plurality of estimated character strings from the first character string. The processor 220 may calculate likelihood matrices for replacement characters to be substituted for each character in the first character string. The processor 220 may identify a plurality of estimated character strings in which at least one character in the first character string is replaced with another character based on likelihood values in the likelihood matrices.

일 실시 예에 따른 프로세서(220)는, 제1 문자열로부터 복수의 추정 문자열들의 가능도를 계산할 수 있다. 프로세서(220)는, 제1 문자열 내의 각 문자에 대하여 각 문자가 대체될 대체 문자들에 관한 가능도 행렬들 내의 가능도 값들에 기초하여, 복수의 추정 문자열들의 가능도를 계산할 수 있다. The processor 220 according to an embodiment may calculate the likelihood of a plurality of estimated character strings from the first character string. The processor 220 may calculate the likelihood of a plurality of estimated character strings based on likelihood values in likelihood matrices for replacement characters to which each character is to be replaced with respect to each character in the first character string.

제1 문자열로부터 획득되는 가능도는, 복수의 추정 문자열들 각각이 참값 문자열이라고 가정하였을 때, 음성 인식 결과로서 제1 문자열이 추정될 가능성을 의미할 수 있다. 본 개시의 일 실시예에 따르면, 프로세서(220)는, 제1 문자열 내의 각 문자와 발음이 유사한 대체 문자들을 식별하고, 식별된 대체 문자들에 기초하여 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 교정된 추정 문자열들을 결정하기 위하여, 제1 문자열로부터 획득되는 가능도를 이용할 수 있다.The likelihood obtained from the first character string may mean a probability that the first character string is estimated as a result of speech recognition, assuming that each of the plurality of estimated character strings is a true value character string. According to an embodiment of the present disclosure, the processor 220 identifies replacement characters having a similar pronunciation to each character in the first string, and at least one character in the first string is different based on the identified replacement characters. In order to determine the estimated character strings corrected to, the likelihood obtained from the first character string can be used.

프로세서(220)는, 가능도, 사전 정보, 및 언어 모델에 기초하여, 복수의 추정 문자열들 중 하나인 제2 문자열을 획득할 수 있다. 프로세서(220)는, 계산된 가능도에 기초하여, 제1 문자열을 제2 문자열로 대체할 지를 결정할 수 있다. 프로세서(220)는, 결정에 기초하여, 제1 문자열 내의 적어도 하나의 문자를 다른 문자로 대체함으로써 제1 문자열로부터 제2 문자열을 획득할 수 있다.The processor 220 may obtain a second character string, which is one of a plurality of estimated character strings, based on the likelihood, dictionary information, and language model. The processor 220 may determine whether to replace the first character string with the second character string based on the calculated likelihood. Based on the determination, the processor 220 may obtain a second character string from the first character string by replacing at least one character in the first character string with another character.

일 실시 예에 따른 프로세서(220)는, 후술하는 과정을 거쳐 제1 문자열로부터 가능도를 획득할 수 있다.The processor 220 according to an embodiment may obtain a likelihood from the first character string through a process described later.

일 예로서, 프로세서(220)는, 제1 문자열 내의 각 문자의 이전 문자들에 기초하여, 각 문자의 사후 확률들을 계산할 수 있다. 제1 문자열 내의 소정 문자의 사후 확률들은, 이전 문자들을 고려하였을 때, 소정 문자가 복수의 다른 문자들로 대체될 확률들을 포함할 수 있다. 즉, 소정 문자의 사후 확률들은, 문자열 내에서 소정 문자의 이전 문자들을 고려하였을 때, 디바이스(100)의 프로세서(120)의 ASR 모듈이 소정 문자를 정확하게 예측했을 확률 및 다른 문자를 소정 문자로 잘못 예측했을 확률을 포함할 수 있다.As an example, the processor 220 may calculate posterior probabilities of each character based on previous characters of each character in the first character string. The posterior probabilities of the predetermined character in the first character string may include probabilities in which the predetermined character is replaced by a plurality of other characters when previous characters are considered. That is, the posterior probabilities of a predetermined character are the probability that the ASR module of the processor 120 of the device 100 correctly predicts the predetermined character when considering the previous characters of the predetermined character in the character string, and the other character incorrectly as a predetermined character It can contain the probability of predicting.

다음으로, 프로세서(220)는, 제1 문자열의 문자 배열 확률을 계산할 수 있다. 문자열의 문자 배열 확률이란, 해당 문자열에 따라 문자들이 배열될 확률을 의미할 수 있다. 문자 배열 확률은, 문자열의 각 문자의 이 전에 누적된 문자들에 기초하여 산출될 수 있다. 프로세서(220)는, 각 문자의 사후 확률들 및 문자 배열 확률에 기초하여, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 계산할 수 있다.Next, the processor 220 may calculate a character arrangement probability of the first character string. The character arrangement probability of a character string may mean a probability that characters are arranged according to a corresponding character string. The character arrangement probability may be calculated based on characters accumulated before each character of the character string. The processor 220 may calculate the likelihood of a plurality of estimated character strings obtained from the first character string based on the posterior probabilities of each character and the character arrangement probability.

사후 확률들을 계산함에 있어서, 일 실시 예에 따른 프로세서(220)는, 복수의 LSTM(long-short term memory) 레이어들 및 소프트맥스(softmax) 레이어를 포함하는 회귀 신경망(recurrent neural network, RNN)을 이용할 수 있다. 사후 확률들을 계산하기 위하여 이용되는 회귀 신경망과 관련하여서는, 후에 도 10a를 참조하여 보다 구체적으로 설명한다. In calculating the posterior probabilities, the processor 220 according to an embodiment comprises a recurrent neural network (RNN) including a plurality of long-short term memory (LSTM) layers and a softmax layer. Can be used. A regression neural network used to calculate posterior probabilities will be described in more detail later with reference to FIG. 10A.

다른 예로서, 프로세서(220)는, 미리 결정된 오차 행렬(confusion matrix)에 기초하여, 제1 문자열 내의 각 문자의 사후 확률들을 계산할 수 있다. 프로세서(220)는, 각 문자의 사후 확률들에 기초하여, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 계산할 수 있다. 사후 확률들을 계산하기 위하여 이용되는 오차 행렬과 관련하여서는, 후에 도 10b를 참조하여 보다 구체적으로 설명한다.As another example, the processor 220 may calculate posterior probabilities of each character in the first character string, based on a predetermined confusion matrix. The processor 220 may calculate the likelihood of a plurality of estimated character strings obtained from the first character string based on the posterior probabilities of each character. The error matrix used to calculate posterior probabilities will be described in more detail later with reference to FIG. 10B.

또 다른 예로서, 프로세서(220)는, 미리 결정된 확률 값들에 기초하여, 제1 문자열 내의 각 문자의 사후 확률들을 계산할 수 있다. 프로세서(220)는, 제1 문자열에 포함되는 제1 문자가 실제로도 제1 문자일 확률을 P로 결정할 수 있다. P는 미리 결정된 값일 수 있다. P는 0 이상 1 이하의 값일 수 있다. 그리고, 프로세서(220)는, 제1 문자열에 포함되는 제1 문자가 실제로는 제1 문자 이외의 다른 문자일 확률을 (1-P)/(N-1)로 결정할 수 있다. N은 문자들의 개수를 의미하고, 자연수일 수 있다. 즉, 프로세서(220)는, 디바이스(100)의 프로세서(120)의 ASR 모듈이 제1 문자열 내의 제1 문자를 정확하게 예측했을 확률을 P라고 결정하고, 다른 문자를 제1 문자로 잘못 예측했을 확률을 (1-P)/(N-1)로 결정할 수 있다.As another example, the processor 220 may calculate posterior probabilities of each character in the first string based on predetermined probability values. The processor 220 may determine a probability that the first character included in the first character string is actually the first character as P. P may be a predetermined value. P may be a value of 0 or more and 1 or less. Further, the processor 220 may determine a probability that the first character included in the first character string is actually a character other than the first character as (1-P)/(N-1). N means the number of characters, and may be a natural number. That is, the processor 220 determines the probability that the ASR module of the processor 120 of the device 100 correctly predicts the first character in the first character string as P, and the probability that another character is incorrectly predicted as the first character. Can be determined as (1-P)/(N-1).

예를 들어, 프로세서(220)는, 제1 문자열에 포함되는 제1 문자가 실제로도 제1 문자일 확률을 0.9로 결정하고, 제1 문자가 실제로는 다른 문자일 확률을 0.1/(N-1)로 결정할 수 있다.For example, the processor 220 determines the probability that the first character included in the first character string is actually the first character as 0.9, and the probability that the first character is actually another character is 0.1/(N-1) Can be determined by

일 실시 예에 따른 프로세서(220)는, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 계산하는 가능도 계산부를 포함할 수 있다. 또한, 프로세서(220)는, 사전 정보, 및 언어 모델을 이용하여, 가능도로부터 제2 문자열을 획득하는 디코더를 포함할 수 있다. 프로세서(220)는, 사전 정보 및 언어 모델을 이용하여, 제1 문자열로부터 획득되는 가능도에 대한 재 디코딩을 수행함으로써 제2 문자열을 획득할 수 있다.The processor 220 according to an embodiment may include a likelihood calculator that calculates the likelihood of a plurality of estimated strings obtained from the first string. Further, the processor 220 may include a decoder that obtains a second character string from the likelihood by using dictionary information and a language model. The processor 220 may obtain the second character string by performing re-decoding of the likelihood obtained from the first character string using the dictionary information and the language model.

일 예로서, 프로세서(220)의 디코더는, 서버(200) 내에 저장되는 사전 정보 및 언어 모델을 기반으로 제2 문자열을 획득할 수 있다. 디코더는, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도가 입력됨에 따라, 제2 문자열을 출력할 수 있다. 예를 들어, 프로세서(220)의 디코더는 WFST(weighted Finite State Transducer) 디코더를 포함할 수 있다.As an example, the decoder of the processor 220 may obtain the second character string based on dictionary information and a language model stored in the server 200. The decoder may output the second character string as the likelihood of the plurality of estimated character strings obtained from the first character string is input. For example, the decoder of the processor 220 may include a weighted finite state transducer (WFST) decoder.

프로세서(220)가 WFST 디코딩을 수행하는 경우, 일 실시 예에 따른 서버(200)는, 문자들 간의 관계(T), 단어와 문자들의 매핑 정보를 포함하는 사전 정보(L), 및 특정 단어열이 주어졌을 때 다음에 나올 단어들의 확률을 추정하는 언어 모델(G)에 기초하여, WFST로 탐색 공간을 구성하여 디코딩할 수 있다.When the processor 220 performs WFST decoding, the server 200 according to an embodiment includes a relationship between characters (T), dictionary information (L) including mapping information of words and characters, and a specific word string. Based on the language model (G) that estimates the probability of the next word given when is, it is possible to construct and decode a search space with WFST.

다른 예로서, 프로세서(220)의 디코더는, 사전 정보 및 언어 모델에 기초하여, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 재연산할 수 있다. 디코더는, 재 연산된 가능도를 최대로 하는 제2 문자열을 복수의 추정 문자열들 중에서 결정할 수 있다. 예를 들어, 프로세서(220)의 디코더는, 비터비 디코더를 포함할 수 있다. 비터비 디코더는, 사전 정보 및 언어 모델을 고려하여, 제1 문자열들에 대한 가장 가능성 높은 문자열을 제2 문자열로서 찾아낼 수 있다.As another example, the decoder of the processor 220 may recalculate the likelihood of a plurality of estimated character strings obtained from the first character string based on the dictionary information and the language model. The decoder may determine a second character string that maximizes the recalculated likelihood from among the plurality of estimated character strings. For example, the decoder of the processor 220 may include a Viterbi decoder. The Viterbi decoder may find the most likely character string for the first character strings as the second character string in consideration of the dictionary information and the language model.

본 개시의 일 실시 예에 따른 통신부(210)는, 제2 문자열을 디바이스(100)에게 전송할 수 있다. 또는, 통신부(210)는, 프로세서(220)에서 생성된 음성 신호에 대한 응답 메시지를 디바이스(100)에게 전송할 수 있다. 프로세서(220)는, 자연어 이해(Natural Language Understanding) 모델을 이용하여 제2 문자열을 해석하고, 해석한 결과에 기초하여 음성 신호에 대한 응답 메시지를 생성할 수 있다.The communication unit 210 according to an embodiment of the present disclosure may transmit the second character string to the device 100. Alternatively, the communication unit 210 may transmit a response message to the voice signal generated by the processor 220 to the device 100. The processor 220 may interpret the second character string using a natural language understanding model, and generate a response message for the speech signal based on the result of the analysis.

프로세서(220)는, 제2 문자열을 해석한 결과에 대해 DM(Dialog Manager) 모델을 적용하여, 응답 메시지의 유형을 결정할 수 있다. 프로세서(220)는, NLG(Natural Language Generation) 모델을 이용하여, 결정된 유형의 응답 메시지를 생성하고 디바이스(100)에게 전송할 수 있다.The processor 220 may determine the type of the response message by applying a dialog manager (DM) model to the result of analyzing the second character string. The processor 220 may generate a response message of a determined type using a Natural Language Generation (NLG) model and transmit it to the device 100.

또는, 통신부(210)는, 제2 문자열에 기초하여 생성된 음성 비서 서비스에 관련된 정보를 디바이스(100)에게 전송할 수 있다. 프로세서(220)는 제2 문자열에 기초하여 음성 비서 서비스를 제공하기 위하여, 서버(200) 내의 NLU 모델, DM 모델 및 NLG 모델 등을 이용하여, 사용자와의 대화를 수행하기 위한 정보를 디바이스(100)에게 제공할 수 있다. 또한, 프로세서(220)는 제2 문자열을 해석한 결과를 바탕으로, 디바이스(100) 또는 다른 디바이스를 제어하기 위한 제어 명령을 생성하고, 생성된 제어 명령을 디바이스(100)에게 제공할 수도 있다.Alternatively, the communication unit 210 may transmit information related to the voice assistant service generated based on the second character string to the device 100. In order to provide a voice assistant service based on the second character string, the processor 220 uses the NLU model, the DM model, and the NLG model in the server 200 to provide information for conducting a conversation with the user. ) Can be provided. Also, the processor 220 may generate a control command for controlling the device 100 or another device based on the result of analyzing the second character string, and provide the generated control command to the device 100.

이하에서는 도 8a를 참조하여, 일 실시 예에 따른 서버(200)의 각 구성이 디바이스(100)의 음성 인식을 지원하는 방법을 설명한다. 도 8a는 디바이스(100)의 사용자가 “The Cardinals baseball team”을 발화한 경우를 예로 들어 설명한다. Hereinafter, with reference to FIG. 8A, a method of supporting voice recognition of the device 100 in each component of the server 200 according to an embodiment will be described. 8A illustrates an example in which the user of the device 100 utters “The Cardinals baseball team”.

먼저, 디바이스(100)는, 사용자의 음성 신호에 대한 음성 인식을 수행하여 [The cat and deers baseball team]이라는 제1 문자열을 추정할 수 있다. First, the device 100 may estimate a first character string called [The cat and deers baseball team] by performing voice recognition on the user's voice signal.

디바이스(100)는, 제1 문자열의 신뢰도, 제1 문자열이 관련된 도메인, 제1 문자열의 의미를 해석한 결과, 또는 제1 문자열에 개체명이 포함되는 지 여부에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. 디바이스(100)가 제1 문자열을 대체하기 위하여 서버-기반 음성 인식을 수행할 지 여부를 판단하는 구체적인 방법에 관해서는, 도 4a 내지 도 5b를 참조하여 상술하였으므로 중복되는 설명은 생략한다. The device 100 converts the first string to another string based on the reliability of the first string, the domain to which the first string is related, the result of analyzing the meaning of the first string, or whether the object name is included in the first string. You can decide whether to replace it with. A detailed method of determining whether or not the device 100 performs server-based voice recognition to replace the first character string has been described above with reference to FIGS. 4A to 5B, and thus a duplicate description will be omitted.

도 8a에서 디바이스(100)는, 제1 문자열을 다른 문자열로 대체하는 것이 필요하다고 판단하고, 제1 문자열 [The cat and deers baseball team]을 서버(200)에게 전송할 수 있다.In FIG. 8A, the device 100 may determine that it is necessary to replace the first character string with another character string, and transmit the first character string [The cat and deers baseball team] to the server 200.

제1 문자열을 서버(200)에게 전송함에 있어서, 일 실시예에 따른 디바이스(100)는 음성 신호와 관련된 정보를 제1 문자열과 함께 전송할 수 있다. 일 실시예에 따른 디바이스(100)는, 제1 문자열 내의 각 문자가 나타내는 음성 신호 프레임의 길이와 관련된 정보를 제1 문자열과 함께 전송할 수 있다. 예를 들어, 디바이스(100)는, 음성 신호 프레임에 동기화된 제1 문자열을 서버(200)에게 전송할 수 있다. 음성 신호 프레임에 동기화된 문자열이란, 음성 신호가 소정 시간 간격으로 분할된 음성 신호 프레임들 각각에 대응하는 문자들을 포함하는 문자열을 의미할 수 있다.In transmitting the first character string to the server 200, the device 100 according to an embodiment may transmit information related to a voice signal together with the first character string. The device 100 according to an embodiment may transmit information related to the length of a voice signal frame represented by each character in the first character string together with the first character string. For example, the device 100 may transmit a first character string synchronized to the voice signal frame to the server 200. The character string synchronized with the audio signal frame may mean a character string including characters corresponding to each of the audio signal frames in which the audio signal is divided at predetermined time intervals.

그러나 본 개시는 디바이스(100)가 서버(200)에게 프레임 동기화된 문자열을 전송하는 실시예에 제한되지 않는다. 일 실시예에 따른 디바이스(100)는 프레임 동기화되지 않은 제1 문자열을 서버(200)에게 전송할 수 있다. 프레임 동기화되지 않은 제1 문자열은, 음성 신호에 의해 발음되는 각 문자를 하나씩 포함하도록 라벨 동기화 방식으로 획득된 문자열을 의미할 수 있다.However, the present disclosure is not limited to an embodiment in which the device 100 transmits a frame-synchronized character string to the server 200. The device 100 according to an embodiment may transmit a first character string that is not frame-synchronized to the server 200. The first character string that is not frame-synchronized may mean a character string obtained through a label synchronization method to include one character for each character pronounced by a voice signal.

일 실시예에 따른 디바이스(100)는, 프레임 동기화되지 않은 제1 문자열을 서버(200)에게 전송하되, 음성 신호와 관련된 정보를 제1 문자열과 함께 전송할 수 있다. 서버(200)는, 음성 신호와 관련된 정보에 기초하여 제1 문자열을 강제 정렬함으로써 프레임 동기화된 문자열을 생성할 수 있다. 예를 들어, 음성 신호와 관련된 정보는, 디바이스(100)의 음성 인식 모델이 제1 문자열을 획득한 음성 신호 구간에 대한 정보를 포함할 수 있다.The device 100 according to an embodiment may transmit a first string that is not frame-synchronized to the server 200, but may transmit information related to a voice signal together with the first string. The server 200 may generate a frame-synchronized character string by forcibly aligning the first character string based on information related to the voice signal. For example, the information related to the speech signal may include information on a speech signal section in which the speech recognition model of the device 100 has obtained the first character string.

본 개시의 일 실시 예에 따른 서버(200)는, 통신부(210)를 통해 디바이스(100)로부터 제1 문자열을 수신할 수 있다. 일 실시예에 따른 서버(200)는, 프레임 동기화된 제1 문자열을 수신할 수 있다. 그러나, 상술한 바와 같이, 서버(200)는, 프레임 동기화되지 않은 제1 문자열을 수신할 수도 있다. 이 경우, 서버(200)는, 디바이스(100)에 의해 음성 신호로부터 획득된 제1 문자열과 함께 음성 신호와 관련된 정보를 디바이스(100)로부터 수신할 수 있다. 서버(200)는, 음성 신호와 관련된 정보에 기초하여 제1 문자열을 강제 정렬함으로써 프레임 동기화된 제1 문자열을 생성할 수 있다. The server 200 according to an embodiment of the present disclosure may receive the first character string from the device 100 through the communication unit 210. The server 200 according to an embodiment may receive a frame-synchronized first character string. However, as described above, the server 200 may receive a first character string that is not frame synchronized. In this case, the server 200 may receive information related to the voice signal together with the first character string obtained from the voice signal by the device 100 from the device 100. The server 200 may generate a frame-synchronized first character string by forcibly aligning the first character string based on information related to the voice signal.

프로세서(220)는, 제1 문자열로부터 복수의 추정 문자열들을 식별하고, 복수의 추정 문자열들에 기초하여 제2 문자열을 획득할 수 있다. The processor 220 may identify a plurality of estimated character strings from the first character string and obtain a second character string based on the plurality of estimated character strings.

일 실시예에 따라 프로세서(220)는, 제1 문자열 내의 각 문자와 발음이 유사한 대체 문자들을 식별하고, 식별된 대체 문자들에 기초하여 소정 문자열 내의 적어도 하나의 문자가 다른 문자로 교정된 추정 문자열들을 결정할 수 있다. 프로세서(220)는, 결정된 추정 문자열들 중에서, 언어 모델 및 사전 정보 등 미리 저장된 정보에 기초하여 가장 적합한 추정 문자열을 선택하여 제2 문자열로서 획득할 수 있다.According to an embodiment, the processor 220 identifies replacement characters whose pronunciation is similar to that of each character in the first string, and at least one character in a predetermined string is corrected to another character based on the identified replacement characters. You can decide. The processor 220 may select the most suitable estimated character string from among the determined estimated character strings based on pre-stored information such as language model and dictionary information, and obtain it as the second character string.

먼저, 프로세서(220)는, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 계산할 수 있다. First, the processor 220 may calculate the likelihood of a plurality of estimated character strings obtained from the first character string.

디바이스(100)에 의해 음성 신호로부터 추정된 제1 문자열은, 음성 신호 프레임들이 임의의 문자들 각각에 대응할 확률 분포에 대해서, 디바이스(100)에 저장된 언어 모델 및 사전 정보 등을 고려하여 획득된 것이다. 서버(200)는, 디바이스(100)에 의해 추정된 제1 문자열로부터 디바이스(100)의 언어 모델 및 사전 정보와 관련된 바이어스를 제거하고, 서버(200) 내에 저장된 언어 모델 및 사전 정보를 이용하여 재 디코딩을 수행할 수 있다. The first character string estimated from the speech signal by the device 100 is obtained by considering a language model and dictionary information stored in the device 100 with respect to a probability distribution in which speech signal frames correspond to each of arbitrary characters. . The server 200 removes the bias related to the language model and dictionary information of the device 100 from the first character string estimated by the device 100, and re-reads the language model and dictionary information stored in the server 200. Decoding can be performed.

서버(200)는, 제1 문자열로부터 디바이스(100)의 언어 모델 및 사전 정보에 기초한 바이어스를 제거하기 위하여, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 계산할 수 있다.The server 200 may calculate the likelihood of a plurality of estimated character strings obtained from the first character string in order to remove a bias based on the language model and dictionary information of the device 100 from the first character string.

프로세서(220)는, 제1 문자열로부터 획득되는 가능도에 대해서 메모리(230)에 저장된 사전 정보 및 언어 모델을 적용하여 디코딩을 수행함으로써 제2 문자열을 획득할 수 있다. 프로세서(220)가 서버(200)의 메모리(230)에 저장된 사전 정보 및 언어 모델을 적용하여 디코딩을 수행할 경우, 방대한 양의 개체명을 포함하는 사전 정보 및 언어 모델을 이용할 수 있으므로, 음성 인식의 정확도가 높아질 수 있다. The processor 220 may obtain the second character string by performing decoding by applying dictionary information and a language model stored in the memory 230 on the likelihood obtained from the first character string. When the processor 220 performs decoding by applying the dictionary information and language model stored in the memory 230 of the server 200, it is possible to use dictionary information and language models including a vast amount of object names, and thus speech recognition The accuracy of the can be increased.

예를 들어, 디바이스(100)의 메모리 내의 언어 모델에는 개체명 "Cardinals"가 저장되어 있지 않을 수 있다. 따라서, 디바이스(100)는, 음성 신호“The Cardinals baseball team"로부터 제1 문자열 [The cat and deers baseball team]을 잘못 추정할 수 있다. For example, the entity name "Cardinals" may not be stored in the language model in the memory of the device 100. Accordingly, the device 100 may erroneously estimate the first character string [The cat and deers baseball team] from the voice signal “The Cardinals baseball team”.

반면에, 도 8a에 도시된 바와 같이, 서버(200)의 메모리(230) 내에는, 스포츠 도메인의 개체명 "Cardinals"가 저장되어 있을 수 있다. 따라서, 서버(200)의 프로세서(220)는, 디바이스(100)에서 추정된 'cat and deers'가 실제로는 야구 팀 명칭인 'Cardinals'일 확률이 더 높다고 판단할 수 있다. On the other hand, as shown in FIG. 8A, in the memory 230 of the server 200, the entity name “Cardinals” of the sports domain may be stored. Accordingly, the processor 220 of the server 200 may determine that the probability that the'cat and deers' estimated by the device 100 is actually the baseball team name'Cardinals' is higher.

프로세서(220)는, 제1 문자열 내의 각 문자와 발음이 유사한 대체 문자들을 식별하고, 식별된 대체 문자들에 기초하여 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 교정된 제2 문자열을 획득할 수 있다. 따라서, 프로세서(220)는, 제1 문자열 내의 'cat and deers'가 'Cardinals'로 대체된 제2 문자열 [The Cardinals baseball team]을 획득할 수 있다. 예를 들어, WFST 디코딩 방식을 이용하여, 제1 문자열 [The cat and deers baseball team]으로부터 제2 문자열 [The Cardinals baseball team]을 획득하는 구체적인 방법은 후에 도 17을 참조하여 설명한다. The processor 220 identifies replacement characters having a similar pronunciation to each character in the first string, and obtains a second string in which at least one character in the first string is corrected to another character based on the identified replacement characters. I can. Accordingly, the processor 220 may acquire a second string [The Cardinals baseball team] in which'cat and deers' in the first string is replaced with'Cardinals'. For example, a specific method of acquiring the second string [The Cardinals baseball team] from the first string [The cat and deers baseball team] using the WFST decoding method will be described later with reference to FIG. 17.

서버(200)는, 제2 문자열을 디바이스(100)에게 전송할 수 있다. 디바이스(100)는, 디바이스(100)에서 추정된 제1 문자열을 서버(200)로부터 수신된 제2 문자열로 대체하여 출력할 수 있다. 도 8a에 도시된 바와 같이, 예를 들어, 제1 문자열[The cat and deers baseball team]의 신뢰도는 0.1이고 제2 문자열 [The Cardinals baseball team]의 신뢰도는 0.5일 수 있다. 일 실시 예에 따른 디바이스(100)는, 제1 문자열보다 신뢰도가 높은 제2 문자열을 서버(200)로부터 수신하여 이용함으로써 음성 인식 성능을 높일 수 있다.The server 200 may transmit the second character string to the device 100. The device 100 may replace and output the first character string estimated by the device 100 with the second character string received from the server 200. As shown in FIG. 8A, for example, the reliability of the first string [The cat and deers baseball team] may be 0.1 and the reliability of the second string [The Cardinals baseball team] may be 0.5. The device 100 according to an embodiment may increase speech recognition performance by receiving and using a second character string having higher reliability than the first character string from the server 200.

상술한 바와 같이, 일 실시 예에 따른 서버(200)는, 디바이스(100)로부터 프레임 동기화된 문자열을 수신하거나, 디바이스(100)로부터 수신된 문자열로부터 프레임 동기화된 문자열을 생성할 수 있다. 서버(200)는, 음성 신호 프레임 각각에 대응하는 문자 별로 가능도를 획득함으로써 대체 문자열을 결정할 수 있다. 서버(200)는, 복수의 문자들을 포함하는 문자열 전체를 동시에 수신할 수도 있고, 문자열에 포함되는 적어도 일부 문자들을 순차적으로 수신할 수도 있다. As described above, the server 200 according to an embodiment may receive a frame-synchronized character string from the device 100 or generate a frame-synchronized character string from the character string received from the device 100. The server 200 may determine a replacement character string by acquiring the likelihood of each character corresponding to each voice signal frame. The server 200 may simultaneously receive the entire character string including a plurality of characters, or may sequentially receive at least some characters included in the character string.

이하에서는, 도 8b를 참조하여 일 실시 예에 따른 서버(200)가 음성 신호 프레임 각각에 대응하는 문자 별로 가능도를 획득함으로써 대체 문자열을 결정하는 방법을 보다 구체적으로 설명한다.Hereinafter, a method of determining a replacement character string by obtaining a likelihood for each character corresponding to each voice signal frame by the server 200 according to an embodiment will be described in more detail with reference to FIG. 8B.

일 실시 예에 따른 서버(200)는, 디바이스(100)로부터 프레임 동기화된 제1 문자열을 수신하거나, 디바이스(100)로부터 수신된 문자열로부터 프레임 동기화된 제1 문자열을 생성할 수 있다. The server 200 according to an embodiment may receive a frame-synchronized first character string from the device 100 or may generate a frame-synchronized first character string from the character string received from the device 100.

예를 들어, 서버(200)의 통신부(210)는, 디바이스(100)에 의해 음성 신호로부터 획득된 문자열과 함께 음성 신호와 관련된 정보를 디바이스(100)로부터 수신할 수 있다. 서버(200)는, 음성 신호와 관련된 정보에 기초하여 문자열을 강제 정렬함으로써 프레임 동기화된 제1 문자열을 생성할 수 있다 For example, the communication unit 210 of the server 200 may receive information related to the voice signal from the device 100 together with a character string acquired from the voice signal by the device 100. The server 200 may generate a frame-synchronized first character string by forcibly aligning character strings based on information related to the voice signal.

서버(200)의 문자열 평가부(221)는, 프레임 동기화된 제1 문자열 내의 각 문자에 대하여 각 문자가 대체될 대체 문자들에 관한 가능도 행렬(813)들을 산출할 수 있다. The character string evaluation unit 221 of the server 200 may calculate a likelihood matrix 813 for replacement characters to be replaced with each character for each character in the frame-synchronized first character string.

일 실시예에 따라 문자열 평가부(221)에서 산출하는 소정 문자에 대한 가능도 행렬이란, 소정 문자가 대체될 대체 문자들에 대한 가능도 값들을 포함하는 행렬을 의미할 수 있다. 소정 문자가 대체될 대체 문자에 대한 가능도 값이란, 대체 문자가 참값(ground truth) 문자라고 가정하였을 때 음성 인식 결과로서 소정 문자가 추정될 확률을 의미할 수 있다. According to an embodiment, the likelihood matrix for a predetermined character calculated by the character string evaluation unit 221 may mean a matrix including likelihood values for replacement characters to which the predetermined character is to be replaced. The likelihood value for a replacement character to be replaced by a predetermined character may mean a probability that a predetermined character is estimated as a result of speech recognition when the replacement character is assumed to be a ground truth character.

예를 들어, 음성 인식 결과 획득되는 문자열 내에 포함되는 문자 “a”에 대해서, 참값(ground truth) 문자가 “a”일 확률 값, “b”일 확률 값, “c”일 확률 값, …, 및 “z”일 확률 값을 포함하는 가능도 행렬 [0.4 0.01 0.01 0.01 0.2 … 0.01]이 획득될 수 있다. 문자열 내에 포함되는 각 문자에 대응하는 대체 문자들에 대한 가능도 값들을 포함하는 가능도 행렬을 획득함에 있어서, 각 문자와 발음이 유사한 대체 문자들에 대해서 높은 가능도 값이 부여될 수 있다.For example, for a character “a” included in a character string obtained as a result of speech recognition, a probability value of “a” as a ground truth character, a probability value of “b”, a probability value of “c”,… , And likelihood matrix containing probability values of “z” [0.4 0.01 0.01 0.01 0.2. 0.01] can be obtained. In obtaining a likelihood matrix including likelihood values for replacement characters corresponding to each character included in a character string, a high likelihood value may be assigned to replacement characters having a similar pronunciation to each character.

서버(200)의 디코더(223)는, 가능도 행렬(813)들에 기초하여, 프레임 동기화된 제1 문자열 내의 적어도 하나의 문자가 대체된 복수의 추정 문자열들 중에서 하나의 추정 문자열을 선택하고, 선택된 추정 문자열을 제2 문자열로서 획득할 수 있다.The decoder 223 of the server 200 selects one estimated character string from among a plurality of estimated character strings in which at least one character in the frame-synchronized first character string has been replaced, based on the likelihood matrices 813, The selected estimated character string may be obtained as a second character string.

예를 들어, 디코더(223)는, 사전 정보 및 언어 모델에 기초하여, 가능도 행렬(813)들을 재연산할 수 있다. 디코더(223)는, 재 연산된 가능도를 최대로 하는 제2 문자열을 복수의 추정 문자열들 중에서 결정할 수 있다. 예를 들어, 디코더(223)는, 비터비 디코더를 포함할 수 있다. 비터비 디코더는, 사전 정보 및 언어 모델을 고려하여, 제1 문자열에 대한 가장 가능성 높은 문자열을 제2 문자열로서 찾아낼 수 있다.For example, the decoder 223 may re-compute the likelihood matrices 813 based on the dictionary information and the language model. The decoder 223 may determine a second character string that maximizes the recalculated likelihood from among the plurality of estimated character strings. For example, the decoder 223 may include a Viterbi decoder. The Viterbi decoder may find the most likely character string for the first character string as the second character string in consideration of dictionary information and a language model.

서버(200)의 디코더(223)는, 복수의 추정 문자열들의 가능도, 사전 정보, 및 언어 모델에 기초하여, 복수의 추정 문자열들(815) 중에서 신뢰도가 가장 높은 문자열(817)을 제2 문자열로서 획득할 수 있다. 서버(200)는, 제2 문자열을 디바이스(100)에게 전송할 수 있다. 디바이스(100)는, 제1 문자열보다 신뢰도가 높은 제2 문자열을 서버(200)로부터 수신하여 이용함으로써 음성 인식 성능을 높일 수 있다.The decoder 223 of the server 200 selects the character string 817 with the highest reliability among the plurality of estimated character strings 815 based on the likelihood of the plurality of estimated character strings, dictionary information, and a language model. Can be obtained as. The server 200 may transmit the second character string to the device 100. The device 100 may improve speech recognition performance by receiving and using a second character string having higher reliability than the first character string from the server 200.

이하에서는, 도 9 내지 도 11b를 참조하여 서버(200)에서 가능도를 계산하는 방법의 다양한 실시 예들을 구체적으로 설명한다. Hereinafter, various embodiments of a method for calculating the likelihood in the server 200 will be described in detail with reference to FIGS. 9 to 11B.

도 9는 일 실시 예에 따른 서버의 구체적인 블록도를 도시한다.9 is a detailed block diagram of a server according to an embodiment.

도 9에 도시된 바와 같이, 서버(200)의 통신부(210)는, 디바이스(100)로부터 제1 문자열을 수신할 수 있다.As illustrated in FIG. 9, the communication unit 210 of the server 200 may receive a first character string from the device 100.

프로세서(220)의 문자열 평가부(221)는, 디코더(223)가 제1 문자열 보다 높은 신뢰도를 가지는 제2 문자열을 추천하여 출력할 수 있도록 하는 제1 문자열에 대한 평가 정보를 출력할 수 있다. 예를 들어, 제1 문자열에 대한 평가 정보는, 제1 문자열로부터 산출되는 가능도를 포함할 수 있다. The character string evaluation unit 221 of the processor 220 may output evaluation information on the first character string that enables the decoder 223 to recommend and output a second character string having a higher reliability than the first character string. For example, evaluation information on the first character string may include a likelihood calculated from the first character string.

문자열 평가부(221)는, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 계산할 수 있다. 문자열 평가부(221)는, 제1 문자열 내의 각 문자를 다른 문자로 대체함으로써, 복수의 추정 문자열들을 획득할 수 있다. 복수의 추정 문자열들의 가능도란, 제1 문자열로부터 획득되는 복수의 추정 문자열들 각각이 참값 문자열이라고 가정하였을 때, 음성 인식 모듈로부터 제1 문자열이 추정될 확률을 의미할 수 있다. The character string evaluation unit 221 may calculate the likelihood of a plurality of estimated character strings obtained from the first character string. The character string evaluation unit 221 may obtain a plurality of estimated character strings by replacing each character in the first character string with a different character. The likelihood of the plurality of estimated character strings may mean a probability that the first character string is estimated from the speech recognition module, assuming that each of the plurality of estimated character strings obtained from the first character string is a true value character string.

문자열 평가부(221)에서 출력되는 제1 문자열로부터 획득되는 가능도는, 제1 문자열 내의 각 문자와 발음이 유사한 대체 문자들을 식별하고, 식별된 대체 문자들에 기초하여 소정 문자열 내의 적어도 하나의 문자가 다른 문자로 교정된 추정 문자열들을 결정하기 위하여 이용될 수 있다.The likelihood obtained from the first character string output from the character string evaluation unit 221 identifies replacement characters having a similar pronunciation to each character in the first character string, and at least one character in a predetermined character string based on the identified substitution characters. Can be used to determine estimated strings corrected to other characters.

문자열 평가부(221)는, 제1 문자열 내의 각 문자에 대하여 각 문자가 대체될 대체 문자들에 관한 가능도 행렬들을 산출하고, 가능도 행렬들 내의 가능도 값들에 기초하여 복수의 추정 문자열들을 식별할 수 있다. 문자열 평가부(221)는, 복수의 추정 문자열들의 가능도로서, 각 문자로부터 획득되는 가능도 행렬들을 출력할 수 있다.The character string evaluation unit 221 calculates likelihood matrices for replacement characters to be replaced for each character in the first character string, and identifies a plurality of estimated character strings based on likelihood values in the likelihood matrices. can do. The character string evaluation unit 221 may output likelihood matrices obtained from each character as the likelihood of a plurality of estimated character strings.

문자열 평가부(221)는, 메모리(230)에 저장된 가능도 계산용 데이터(231)를 이용하여, 제1 문자열로부터 가능도를 계산할 수 있다. 예를 들어, 가능도 계산용 데이터(231)는, 가능도 계산을 위해 훈련된 신경망 또는 오차 행렬을 포함할 수 있다.The character string evaluation unit 221 may calculate the likelihood from the first character string by using the likelihood calculation data 231 stored in the memory 230. For example, the likelihood calculation data 231 may include a neural network trained for likelihood calculation or an error matrix.

일 예로서, 문자열 평가부(221)는, 제1 문자열 내의 각 문자의 이전 문자들에 기초하여 각 문자의 사후 확률들을 계산할 수 있다. 문자열 평가부(221)는, 제1 문자열로부터 문자 배열 확률을 계산할 수 있다. 문자열 평가부(221)는, 각 문자의 사후 확률들 및 문자 배열 확률에 기초하여, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 계산할 수 있다.As an example, the character string evaluation unit 221 may calculate posterior probabilities of each character based on previous characters of each character in the first character string. The character string evaluation unit 221 may calculate a character arrangement probability from the first character string. The character string evaluation unit 221 may calculate the likelihood of a plurality of estimated character strings obtained from the first character string, based on the posterior probabilities of each character and the character arrangement probability.

다른 예로서, 문자열 평가부(221)는, 미리 결정된 오차 행렬에 기초하여, 제1 문자열 내의 각 문자의 사후 확률들을 계산할 수 있다. 문자열 평가부(221)는, 제1 문자열 내의 각 문자의 사후 확률들에 기초하여, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 계산할 수 있다.As another example, the character string evaluation unit 221 may calculate posterior probabilities of each character in the first character string based on a predetermined error matrix. The character string evaluation unit 221 may calculate the likelihood of a plurality of estimated character strings obtained from the first character string, based on posterior probabilities of each character in the first character string.

문자열 평가부(221)에서 가능도를 계산한 후, 디코더(223)는 사전 정보, 및 언어 모델을 이용하여, 계산된 가능도에 기초하여 제2 문자열을 획득할 수 있다. 디코더(223)는, 제1 문자열 내의 적어도 하나를 다른 문자로 대체함으로써 획득되는 복수의 추정 문자열들 중에서, 가능도를 최대로 하는 제2 문자열을 획득할 수 있다.After calculating the likelihood in the character string evaluation unit 221, the decoder 223 may obtain the second character string based on the calculated likelihood using dictionary information and a language model. The decoder 223 may obtain a second character string having a maximum likelihood from among a plurality of estimated character strings obtained by replacing at least one in the first character string with another character.

디코더(223)는, 사전 정보(232) 및 언어 모델(233)을 이용하여, 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 대체된 제2 문자열을 획득할 수 있다. 예를 들어, 디코더(223)는, 가능도를 입력으로 하는 WFST 디코더 또는 전통적인 토큰 패싱(token passing) 을 이용하는 비터비 디코더를 포함할 수 있다.The decoder 223 may obtain a second character string in which at least one character in the first character string is replaced with another character using the dictionary information 232 and the language model 233. For example, the decoder 223 may include a WFST decoder that takes likelihood as an input or a Viterbi decoder that uses traditional token passing.

본 개시의 일 실시 예에 따라 서버(200)에 저장된 사전 정보는, 음소열과 단어 간의 관계를 저장하는 일반적인 사전 정보가 아닌, 단어와 문자열 간의 관계를 저장한 사전 정보일 수 있다. 또한, 언어 모델은, 특정 단어열이 주어졌을 때 다음에 나올 단어들의 확률을 추정할 수 있도록 단어들 간의 관계를 학습한 인공 지능 모델일 수 있다. 예를 들어, 언어 모델은, RNN 등의 신경망이거나, 통계적 n-gram일 수 있다.According to an embodiment of the present disclosure, the dictionary information stored in the server 200 may be dictionary information storing a relationship between a word and a character string, not general dictionary information storing a relationship between a phoneme sequence and a word. In addition, the language model may be an artificial intelligence model in which a relationship between words is learned so as to estimate the probability of words that appear next when a specific word sequence is given. For example, the language model may be a neural network such as an RNN or a statistical n-gram.

통신부(210)는 제2 문자열을 디바이스(100)에게 전송할 수 있다. 그러나 본 개시의 실시예는 제2 문자열을 디바이스(100)에게 전송하는 실시예에 제한되지 않는다. 본 개시의 일 실시 예에 따른 서버(200)는, 제2 문자열에 대한 자연어 처리를 통해 사용자의 발화 의도를 파악함으로써 제2 문자열에 기초한 음성 비서 서비스에 관련된 정보를 통신부(210)를 통해 디바이스(100)에게 전송할 수 있다.The communication unit 210 may transmit the second character string to the device 100. However, the embodiment of the present disclosure is not limited to the embodiment of transmitting the second character string to the device 100. The server 200 according to an embodiment of the present disclosure recognizes the user's utterance intention through natural language processing on the second character string, and thereby transmits information related to the voice assistant service based on the second character string to the device ( 100).

본 개시의 다양한 실시예들에 따라 서버(200)로부터 디바이스(100)에게 전송되는 제2 문자열과 관련된 정보에 대해서는, 도 2b 및 도 2c를 참조하여 상술하였으므로 구체적인 설명은 생략한다.Information related to the second character string transmitted from the server 200 to the device 100 according to various embodiments of the present disclosure has been described above with reference to FIGS. 2B and 2C, and thus a detailed description thereof will be omitted.

한편, 본 개시의 일 실시 예에 따른 서버(200)의 디코더(223)는, 도메인 별로 서로 다른 사전 정보 및 언어 모델을 이용하여, 제1 문자열에 대한 디코딩을 수행할 수 있다. 따라서, 본 개시의 일 실시 예에 따른 서버(200)는, 디바이스(100)로부터 제1 문자열에 대한 재 디코딩을 통해 음성 인식 정확도가 높아진 음성 인식 결과를 출력할 수 있다.Meanwhile, the decoder 223 of the server 200 according to an embodiment of the present disclosure may decode a first character string using different dictionary information and language models for each domain. Accordingly, the server 200 according to an embodiment of the present disclosure may output a speech recognition result with improved speech recognition accuracy through re-decoding of the first character string from the device 100.

일 실시 예에 따른 서버(200)의 프로세서(220)는, 디바이스(100)로부터 제1 문자열을 수신하고, 제1 문자열에 관련된 도메인을 결정할 수 있다. 서버(200)의 디코더(223)는, 결정된 도메인에 대응되는 사전 정보 및 언어 모델을 이용하여, 제1 문자열에 대한 디코딩을 수행할 수 있다.The processor 220 of the server 200 according to an embodiment may receive a first character string from the device 100 and determine a domain related to the first character string. The decoder 223 of the server 200 may decode the first character string using dictionary information and a language model corresponding to the determined domain.

일 예로서, 서버(200)의 프로세서(220)는, 디바이스(100)로부터 제1 문자열과 함께 제1 문자열에 관련된 도메인 정보를 수신하고, 수신된 정보에 기초하여 제1 문자열에 대한 디코딩을 수행할 도메인을 결정할 수 있다. 예를 들어, 프로세서(220)는, 디바이스(100)로부터 수신된 정보로부터 식별된 도메인과 동일하거나 유사한 도메인을 디코딩을 수행할 도메인으로서 결정할 수 있다.As an example, the processor 220 of the server 200 receives domain information related to the first character string together with the first character string from the device 100, and performs decoding on the first character string based on the received information. You can decide which domain to do. For example, the processor 220 may determine a domain that is identical or similar to a domain identified from information received from the device 100 as a domain to perform decoding.

다른 예로서, 서버(200)의 프로세서(220)는 디바이스(100)로부터 수신된 제1 문자열에 기초하여, 수신된 제1 문자열에 관련된 도메인을 결정할 수 있다. 도 9에는 도시되지 않았지만, 서버(200)는 도메인 식별을 위해 훈련된 인공 지능 모델인 도메인 식별 모델을 메모리(230)에 저장할 수 있다. 프로세서(220)는, 도메인 식별 모델을 이용하여, 제1 문자열을 입력 값으로 하여 도메인 신뢰도를 출력할 수 있다. 프로세서(220)는, 도메인 신뢰도에 기초하여, 제1 문자열에 관련된 도메인을 결정할 수 있다. 일 실시예에 따르면, 서버(200)의 문자열 평가부(221) 또는 디코더(223)가 디바이스(100)로부터 수신된 제1 문자열에 기초하여 수신된 제1 문자열에 관련된 도메인을 결정할 수 있다. As another example, the processor 220 of the server 200 may determine a domain related to the received first character string based on the first character string received from the device 100. Although not shown in FIG. 9, the server 200 may store a domain identification model, which is an artificial intelligence model trained for domain identification, in the memory 230. The processor 220 may output the domain reliability using the first character string as an input value using the domain identification model. The processor 220 may determine a domain related to the first character string based on the domain reliability. According to an embodiment, the string evaluation unit 221 or the decoder 223 of the server 200 may determine a domain related to the received first string based on the first string received from the device 100.

예를 들어, 서버(200)의 디코더(223)는 디바이스(100)로부터 수신된 제1 문자열에 기초하여, 제1 문자열에 관련된 도메인을 결정할 수 있다. 일 실시예에 따른 디코더(223)는, 수신된 제1 문자열에 대해 결정된 도메인에 특화된 사전 및 언어 모델을 이용한 디코딩을 수행할 수 있다.For example, the decoder 223 of the server 200 may determine a domain related to the first character string based on the first character string received from the device 100. The decoder 223 according to an embodiment may decode the received first character string using a dictionary and a language model specialized for the determined domain.

일 실시예에 따른 디코더(223)는 투패스 디코더(2 Pass Decoder)일 수 있다. 투패스 디코더는, 문자열 평가부(221)로부터 수신되는 제1 문자열에 대한 평가 정보에 대한 1차 디코딩을 수행한 후, 1차 디코딩 결과를 이용하여 2차 디코딩을 수행할 수 있다. The decoder 223 according to an embodiment may be a two-pass decoder. The two-pass decoder may perform primary decoding on the evaluation information for the first character string received from the character string evaluation unit 221 and then perform secondary decoding using the first decoding result.

이 때, 본 개시의 일 실시예에 따른 디코더(223)는, 제1 패스 디코더(1st pass decoder)에 의해, 일반적인(general) 사전 및 언어 모델을 이용한 디코딩을 수행할 수 있다. 그리고 디코더(223)는, 제2 패스 디코더(2nd pass decoder)에 의해, 수신된 제1 문자열에 대해 결정된 도메인에 특화된 사전 및 언어 모델을 이용한 디코딩을 수행할 수 있다.In this case, the decoder 223 according to an embodiment of the present disclosure may perform decoding using a general dictionary and a language model by a first pass decoder. In addition, the decoder 223 may decode the received first character string using a dictionary and a language model specialized for a domain determined by a second pass decoder.

또 다른 예로서, 일 실시예에 따른 서버(200)의 통신부(210)는, 디바이스(100)로부터 제1 문자열과 함께 제1 문자열에 관련된 도메인을 판단하기 위한 정보를 수신할 수 있다. 예를 들어, 디바이스(100)로부터 수신된 도메인을 판단하기 위한 정보는 컨텍스트 정보를 포함할 수 있다. 예를 들어, 컨텍스트 정보는, 사용자가 현재 디바이스(100) 또는 서버(200) 상에서 이용하고 있는 애플리케이션에 대한 정보, 대화 이력 정보, 디바이스(100) 주변의 상황 정보 또는 트렌드 정보 중 적어도 하나를 포함할 수 있다. 서버(200)의 프로세서(220)는, 컨텍스트 정보를 기반으로 제1 문자열에 대한 디코딩을 수행할 도메인을 결정할 수 있다. 이하에서는, 컨텍스트 정보 기반으로 도메인을 결정하는 구체적인 방법을 설명한다.As another example, the communication unit 210 of the server 200 according to an embodiment may receive information for determining a domain related to the first string together with the first string from the device 100. For example, information for determining a domain received from the device 100 may include context information. For example, the context information may include at least one of information on an application currently being used by the user on the device 100 or the server 200, conversation history information, situation information around the device 100, or trend information. I can. The processor 220 of the server 200 may determine a domain for decoding the first character string based on context information. Hereinafter, a specific method of determining a domain based on context information will be described.

예를 들어, 프로세서(220)는, 사용자가 현재 이용하고 있는 애플리케이션에 기초하여 도메인을 결정할 수 있다. 프로세서(220)는, 사용자가 디바이스(100) 또는 서버(200) 상에서 지도 애플리케이션을 이용하고 있는 경우, 사용자의 발화로부터 획득된 문자열에 대한 도메인을 결정함에 있어서 지도와 관련된 도메인을 디코딩을 수행할 도메인으로서 결정할 수 있다. 예를 들어, 프로세서(220)는, 지도 도메인에 더 높은 가중치를 부여하여 디코딩을 수행할 도메인을 결정하거나, 지도 도메인을 디코딩을 수행할 도메인으로서 결정할 수 있다.For example, the processor 220 may determine the domain based on the application currently being used by the user. The processor 220, when the user is using the map application on the device 100 or the server 200, determines the domain for the character string obtained from the user's utterance, the domain to decode the domain related to the map It can be determined as For example, the processor 220 may determine a domain to perform decoding by assigning a higher weight to the map domain, or determine the map domain as a domain to perform decoding.

또는, 예를 들어, 프로세서(220)는, 대화 이력 정보에 기초하여 도메인을 결정할 수 있다. 프로세서(220)는, 사용자의 대화 이력이 '음악'과 관련된다고 판단되는 경우, 사용자의 발화로부터 획득된 문자열에 대한 도메인을 결정함에 있어서 음악과 관련된 도메인을 디코딩을 수행할 도메인으로서 결정할 수 있다. 예를 들어, 프로세서(220)는, 음악 도메인에 더 높은 가중치를 부여하여 디코딩을 수행할 도메인을 결정하거나, 음악 도메인을 디코딩을 수행할 도메인으로서 결정할 수 있다.Alternatively, for example, the processor 220 may determine a domain based on conversation history information. When it is determined that the user's conversation history is related to'music', the processor 220 may determine a domain related to music as a domain to perform decoding in determining a domain for a character string obtained from the user's speech. For example, the processor 220 may determine a domain to perform decoding by assigning a higher weight to the music domain, or determine the music domain as a domain to perform decoding.

또는, 예를 들어, 프로세서(220)는, 디바이스(100)에 탑재된 센서에 의해 감지되는 디바이스(100) 주변의 상황 정보에 기초하여 도메인을 결정할 수 있다. 프로세서(220)는, 디바이스(100)의 GPS 정보를 이용하여 식별되는 디바이스(100)의 위치에 기초하여 도메인을 결정할 수 있다. 프로세서(220)는, 사용자가 음식점을 검색하고자 하는 경우, 디바이스(100)의 위치와 관련된 도메인을 디코딩을 수행할 도메인으로서 결정할 수 있다. 프로세서(220)는, 디바이스(100)의 위치가 영화관 근처인 경우, 영화와 관련된 도메인을 디코딩을 수행할 도메인으로서 결정할 수 있다.Alternatively, for example, the processor 220 may determine the domain based on context information around the device 100 detected by a sensor mounted on the device 100. The processor 220 may determine the domain based on the location of the device 100 identified using GPS information of the device 100. When the user wants to search for a restaurant, the processor 220 may determine a domain related to the location of the device 100 as a domain to perform decoding. When the location of the device 100 is near a movie theater, the processor 220 may determine a domain related to a movie as a domain to perform decoding.

또는, 예를 들어, 프로세서(220)는, 트렌드 정보에 기초하여 도메인을 결정할 수 있다. 프로세서(220)는, 주요 뉴스 또는 포털 사이트를 통한 실시간 검색어와 관련된 도메인을 디코딩을 수행할 도메인으로서 결정할 수 있다.Alternatively, for example, the processor 220 may determine a domain based on trend information. The processor 220 may determine a domain related to major news or a real-time search word through a portal site as a domain to perform decoding.

이하에서는, 일 실시 예에 따른 서버(200)의 문자열 평가부(221)가 제1 문자열 내의 각 문자의 이전에 누적된 문자들에 기초하여 가능도를 획득하는 경우를 구체적으로 설명한다. Hereinafter, a case in which the character string evaluation unit 221 of the server 200 according to an exemplary embodiment acquires a likelihood based on previously accumulated characters of each character in the first character string will be described in detail.

일 실시 예에 따른 서버(200)의 통신부(210)는, 프레임 동기화된 제1 문자열 y_o[0:l+1]을 디바이스(100)로부터 수신할 수 있다. 프레임 동기화된 문자열과 관련하여서는, 도 6을 참조하여 상술하였으므로, 중복되는 설명은 생략한다.The communication unit 210 of the server 200 according to an embodiment may receive a frame-synchronized first string y _o [0:l+1] from the device 100. Regarding the frame-synchronized character string, since it has been described above with reference to FIG. 6, redundant descriptions are omitted.

이하의 설명에서, y_o[L]은, 온-디바이스 음성 인식 모듈에 의해 음성 신호로부터 추정되는, 프레임 동기화된 문자일 수 있다. 프레임 동기화된 문자란, 음성 신호에 포함되는 하나의 음성 프레임으로부터 추정된 문자를 의미할 수 있다. y_o[L]은, 전체 문자들의 집합인 V에 포함된다.In the following description, y _o [L] may be a frame-synchronized character estimated from a speech signal by the on-device speech recognition module. The frame-synchronized text may mean a text estimated from one voice frame included in the voice signal. y _o [L] is included in V, the set of all characters.

y_o[0:L+1]은, 0≤L'≤L일 때 y_o[L']의 배열(sequence)를 의미한다. L 및 L'는 문자열의 인덱스이다.y _o [0:L+1] means the sequence of _{y o [L'] when 0≤L'≤L.} L and L'are indexes of the character string.

통신부(210)는 복수의 문자들을 포함하는 문자열 전체를 동시에 수신할 수도 있고, 문자열에 포함되는 적어도 일부 문자들을 순차적으로 수신할 수도 있다. The communication unit 210 may simultaneously receive the entire character string including a plurality of characters, or may sequentially receive at least some characters included in the character string.

y_p[L]은, 디바이스에 의해 획득된 문자열을 서버에서 후처리하기 위해서 추정되는 프레임 동기화된 문자를 의미한다. y_p[L]은, 문자들의 집합인 V에 포함된다. W_i는 단어열이다. W_i는, 단어들의 집합인 D에 포함되는 단어이다.y _p [L] refers to a frame-synchronized character estimated to post-process the character string acquired by the device in the server. y _p [L] is included in V, which is a set of characters. W _i is a string of words. W _i is a word included in D, which is a set of words.

일 실시 예에 따른 서버(200)의 문자열 평가부(221)는, 제1 문자열 y_o[0:L+1]에 따라 문자들이 나열될 문자 배열 확률 P(y_o[0:L+1])을 계산할 수 있다. 문자 배열 확률 P(y_o[0:L+1])는, 문자-레벨 언어 모델로부터 계산될 수 있다.The character string evaluation unit 221 of the server 200 according to an embodiment, a _{character arrangement probability P(y o} [0:L+1] in which characters are to be listed according to _{the first character string y o [0:L+1])} ) Can be calculated. The character arrangement probability P(y _o [0:L+1]) can be calculated from a character-level language model.

일 실시 예에 따른 문자열 평가부(221)는, 제1 문자열 y_o[0:L+1]이 디바이스(100)에 의해 추정되었을 때, 실제로는 L번째 문자가 y_p[L]일 사후 확률들 P(y_p[L]|y_o[0:L+1])을 계산할 수 있다. 문자열 평가부(221)는, 제1 문자열 y_o[0:L+1]에 기초하여, 문자 y_o[L]의 사후 확률들 P(y_p[L]|y_o[0:L+1])을 계산할 수 있다. 즉, 사후 확률 계산부(225)는, 제1 문자열 y_o[0:L+1]에 기초하여, 디바이스(100)에 의해 문자 y_o[L]가 정확하게 추정되었을 확률 및 문자 y_o[L]가 잘못 추정되었을 확률들을 계산할 수 있다. _{When the first character string y o} [0:L+1] is estimated by the device 100, the character string evaluating unit 221 according to an exemplary embodiment may have a _{posterior probability that the L-th character is actually y p} [L]. S P(y _p [L]|y _o [0:L+1]) can be calculated. The character string evaluation unit 221, based on the first character string y _o [0:L+1], _{the posterior probabilities of the character y o} [L] P(y _p [L]|y _o [0:L+1] ]) can be calculated. That is, the posterior probability calculation unit 225, a first character string _{y o [0: L + 1} ] on the basis of, character by a device (100) y _o [L] is the probability and the letter is correctly estimate y _o [L We can calculate the probabilities that] is incorrectly estimated.

일 실시 예에 따른 문자열 평가부(221)는, 신경망을 이용해서, 제1 문자열로부터 제1 문자열의 각 문자의 사후 확률들을 계산할 수 있다.The string evaluation unit 221 according to an embodiment may calculate posterior probabilities of each character of the first string from the first string using a neural network.

일 실시 예에 따른 문자열 평가부(221)는, 도 10a에 도시된 LSTM 레이어(1010) 및 소프트맥스 레이어(1030)를 포함하는 회귀 신경망을 이용하여 제1 문자열 내의 각 문자의 사후 확률들을 계산할 수 있다. The string evaluation unit 221 according to an embodiment may calculate the posterior probabilities of each character in the first string using a regression neural network including the LSTM layer 1010 and the softmax layer 1030 shown in FIG. 10A. have.

도 10a에 도시된 LSTM 레이어(1010)는 복수의 적층된 LSTM 레이어들을 포함할 수 있다. 도 10a에서, 제1 문자열이 LSTM 레이어(1010)에 입력되고, LSTM 레이어(1010)로부터 출력되는 데이터가 소프트맥스 레이어(1030)에 입력되며, 소프트맥스 레이어(1030)는 제1 문자열의 각 문자의 사후 확률들을 출력할 수 있다.The LSTM layer 1010 illustrated in FIG. 10A may include a plurality of stacked LSTM layers. In FIG. 10A, a first character string is input to the LSTM layer 1010, data output from the LSTM layer 1010 is input to the softmax layer 1030, and the softmax layer 1030 is each character of the first character string. We can output the posterior probabilities of.

일 실시 예에 따라 문자열 내의 각 문자의 사후 확률들을 계산하는 신경망은, 참값 문자열 및 음성 인식 모듈로부터 출력되는 오류 문자열을 학습함으로써 훈련될 수 있다. 구체적으로, 신경망은, 음성 인식 모듈로부터 출력되는 오류 문자열을 입력을 수신하였을 때, 출력되는 출력값이 참값 문자열에 가까워지도록 훈련될 수 있다.According to an embodiment, a neural network that calculates posterior probabilities of each character in a character string may be trained by learning a true value character string and an error character string output from a speech recognition module. Specifically, when the neural network receives an input of an error string output from a speech recognition module, the neural network may be trained so that the output value approaches the true value string.

일 실시 예에 따른 문자열 평가부(221) 가 사후 확률들을 획득하기 위해 이용하는 인공 지능 모델은, 특정 음성 인식 모듈의 음성 인식 결과에 과적합(overfitting)되는 것을 막기 위해서, 복수 개의 음성 인식 모듈들의 음성 인식 결과에 기초하여 훈련될 수 있다.The artificial intelligence model used by the character string evaluation unit 221 according to an embodiment to obtain posterior probabilities is, in order to prevent overfitting with the speech recognition result of a specific speech recognition module, the speech of a plurality of speech recognition modules It can be trained based on recognition results.

프로세서(220)의 문자열 평가부(221)는, 사후 확률들 P(y_p[L]|y_o[0:L+1]) 및 문자 배열 확률 P(y_o[0:L+1])에 기초하여, 가능도 P(y_o[0:L+1]|y_p[L])를 계산할 수 있다.The character string evaluation unit 221 of the processor 220 includes the posterior probabilities P(y _p [L]|y _o [0:L+1]) and the character arrangement probability P(y _o [0:L+1]) Based on, the likelihood P(y _o [0:L+1]|y _p [L]) can be calculated.

가능도 P(y_o[0:L+1]|y_p[L])는, 사후 확률 및 문자 배열 확률에 기초하여 다음의 수학식 1에 의해 계산될 수 있다.The likelihood P(y _o [0:L+1]|y _p [L]) may be calculated by Equation 1 below based on a posterior probability and a character arrangement probability.

[수학식 1][Equation 1]

상기 [수학식 1]에서 P(y_p[L])은 y_p[L]의 사전 확률을 의미한다. 소정 문자 y_p[L]의 사전 확률은, 소정 문자가 이용되는 빈도에 기초하여 통계적으로 미리 계산된 값일 수 있다. In [Equation 1], P(y _p [L]) means a prior probability _{of y p [L].} The prior probability of the predetermined character y _p [L] may be a statistically pre-calculated value based on the frequency at which the predetermined character is used.

일 실시 예에 따른 서버(200)의 디코더(223)는, 사전 정보(232) 및 언어 모델(233)을 이용하여, 가능도 P(y_o[0:L+1]|y_p[L])로부터 제2 문자열 W_i을 추정할 수 있다. 제2 문자열은, 제1 문자열의 적어도 하나의 문자가 다른 문자로 대체된 문자열일 수 있다. 통신부(210)는 제2 문자열 W_i를 디바이스(100)에게 전송할 수 있다. 서버(200)는, 디바이스(100)로부터 프레임 동기화된 문자열 y_o[0:L+1]을 수신하였지만, 단어열의 형태를 가지는 제2 문자열 W_i를 디바이스(100)에게 전송할 수 있다.The decoder 223 of the server 200 according to an embodiment uses the dictionary information 232 and the language model 233, the likelihood P(y _o [0:L+1]|y _p [L] ), the second character string W _i can be estimated. The second character string may be a character string in which at least one character of the first character string is replaced with another character. The communication unit 210 may transmit the second character string W _i to the device 100. Although the server 200 receives the frame-synchronized character string y _o [0:L+1] _{from the device 100, the server 200 may transmit the second character string W i} having the form of a word string to the device 100.

한편, 다른 일 실시 예에 따른 서버(200)의 문자열 평가부(221)는, 각 문자의 이전에 누적된 문자들을 고려하지 않고, 각 문자 자체만을 고려하여 가능도를 계산할 수 있다. 다른 일 실시예에 따른 문자열 평가부(221)는, 문자열 y_o[0:L+1] 대신에 문자 y_o[L]만을 고려하여 가능도를 계산할 수 있다. 문자열 y_o[0:L+1] 대신에 문자 y_o[L] 만을 고려하는 경우, 서버(200)의 구조가 매우 간단해지고, 신경망 대신에 문자 레벨의 오차 행렬만 저장하여 이용하므로 계산 과정이 단순해질 수 있다.Meanwhile, the string evaluation unit 221 of the server 200 according to another embodiment may calculate a likelihood by considering only each character itself, without considering the previously accumulated characters of each character. The string evaluation unit 221 according to another embodiment may calculate the likelihood by considering only the _{characters y o} _{[L] instead of the string y o} [0:L+1]. _{When only the character y o} [L] is considered instead of the string y _o [0:L+1], the structure of the server 200 becomes very simple, and since only the character-level error matrix is stored and used instead of a neural network, the calculation process is reduced. It can be simplified.

서버(200)의 통신부(210)는, 프레임 동기화된 제1 문자열 y_o[0:L+1]을 디바이스(100)로부터 수신할 수 있다. 프레임 동기화된 문자열과 관련하여서는, 도 6을 참조하여 상술하였으므로, 중복되는 설명은 생략한다. 통신부(210)는 복수의 문자들을 포함하는 문자열 전체를 동시에 수신할 수도 있고, 문자열에 포함되는 적어도 일부 문자들을 순차적으로 수신할 수도 있다.The communication unit 210 of the server 200 may receive a frame-synchronized first character string y _o [0:L+1] from the device 100. Regarding the frame-synchronized character string, since it has been described above with reference to FIG. 6, redundant descriptions are omitted. The communication unit 210 may simultaneously receive the entire character string including a plurality of characters, or may sequentially receive at least some characters included in the character string.

다른 일 실시 예에 따른 서버(200)의 문자열 평가부(221)는, 제1 문자열 내의 제1 문자 y_o[L]이 디바이스(100)에 의해 추정되었을 때, 실제로는 L번째 문자가 y_p[L]일 사후 확률들 P(y_p[L]|y_o[L])을 획득할 수 있다. 문자열 평가부(221)는, 제1 문자 y_o[L]에 기초하여, 문자 y_o[L]의 사후 확률들 P(y_p[L]|y_o[L])을 획득할 수 있다. 즉, 문자열 평가부(221)는, 제1 문자 y_o[L]에 기초하여, 디바이스(100)에 의해 문자 y_o[L]가 정확하게 추정되었을 확률 및 문자 y_o[L]가 잘못 추정되었을 확률들을 획득할 수 있다. _{According to another embodiment, when the first character y o} [L] in the first character string is estimated by the device 100, the character string evaluation unit 221 of the server 200 according to another exemplary embodiment is _{actually the L-th character y p} [L] days posterior probabilities P(y _p [L]|y _o [L]) can be obtained. String evaluation unit 221, the character y _o based on [L], the character y _o the posterior probability of the [L] P | may obtain the _{_{(y p [L] y o}} [L]). That is, the character string evaluation unit 221, based on the first character y _o [L], the probability that the character y _o [L] was accurately estimated _{by the device 100 and the character y o} [L] was incorrectly estimated. You can get probabilities.

일 실시 예에 따른 문자열 평가부(221)는, 오차 행렬을 이용해서 제1 문자열로부터 각 문자의 사후 확률들을 획득할 수 있다. The character string evaluation unit 221 according to an embodiment may obtain posterior probabilities of each character from the first character string using an error matrix.

도 10b는 일 실시 예에 따라 사후 확률들을 계산하기 위해 이용되는 오차 행렬(1001)의 예를 도시한다. 10B shows an example of an error matrix 1001 used to calculate posterior probabilities according to an embodiment.

오차 행렬(1001)은, 디바이스(100)의 음성 인식 모듈이 문자열에 포함되는 소정 문자를 정확하게 예측했을 확률 및 다른 문자를 소정 문자로 잘못 예측했을 확률을 포함한다.The error matrix 1001 includes a probability that the speech recognition module of the device 100 correctly predicts a predetermined character included in a character string and a probability that another character incorrectly predicts a predetermined character.

예를 들어, 문자 "a"와 문자 "e"는 발음이 유사하기 때문에 음성 인식 모듈이 실제 문자 "a"를 "e"로 잘못 추정할 확률이 상대적으로 높을 수 있다. 반면에, 문자 "a"와 문자 "b"는 발음이 매우 상이하기 때문에 음성 인식 모듈이 실제 문자 "a"를 "b"로 잘못 추정할 확률이 상대적으로 낮을 수 있다.For example, since the pronunciation of the letter “a” and the letter “e” are similar, the probability that the speech recognition module incorrectly estimates the actual letter “a” as “e” may be relatively high. On the other hand, since the pronunciation of the letter "a" and the letter "b" are very different, the probability that the speech recognition module incorrectly estimates the actual letter "a" as "b" may be relatively low.

따라서, 도 10b에 예시된 바와 같이, 디바이스(100)의 음성 인식 모듈이 실제 문자 "a"를 문자 "e"로 잘못 추정할 확률은 0.23이고, 실제 문자 "a"를 문자 "b"로 잘못 추정할 확률은 0.01일 수 있다.Therefore, as illustrated in FIG. 10B, the probability that the speech recognition module of the device 100 incorrectly estimates the actual character “a” as the character “e” is 0.23, and incorrectly estimates the actual character “a” as the character “b”. The probability of estimation may be 0.01.

일 실시 예에 따른 문자열 평가부(221)는, 디바이스(100)에 의해 추정되는 문자가 제1 문자 y_o[L]일 때, 실제 문자가 y_p[L]일 사후 확률들 P(y_p[L]|y_o[L])을 도 10b에 오차 행렬(1001)로부터 검색하여 획득할 수 있다. When the character estimated by the device 100 is a first character y _o [L], the character string evaluating unit 221 according to an embodiment includes the posterior probabilities P(y _p) when the actual character is y _{p [L].} [L]|y _o [L]) may be obtained by searching from the error matrix 1001 in FIG. 10B.

문자열 평가부(221)는, 획득된 사후 확률들 P(y_p[L]|y_o[L])에 기초하여, 가능도 P(y_o[L]|y_p[L])를 계산할 수 있다.The character string evaluation unit 221 may calculate the likelihood _{P(y o} [L]|y _p [L]) _{based on the acquired posterior probabilities P(y p} [L]|y _o [L]). have.

가능도 P(y_o[L]|y_p[L])는, 사후 확률들에 기초하여 다음의 [수학식 2]에 의해 계산될 수 있다.The likelihood P(y _o [L]|y _p [L]) can be calculated by the following [Equation 2] based on posterior probabilities.

[수학식 2][Equation 2]

상기 [수학식 2]에서 P(y_p[L])은 y_p[L]의 사전 확률을 의미한다. 소정 문자 y_p[L]의 사전 확률은, 소정 문자가 이용되는 빈도에 기초하여 통계적으로 미리 계산된 값일 수 있다. In [Equation 2], P(y _p [L]) denotes _{a prior probability of y p} [L]. The prior probability of the predetermined character y _p [L] may be a statistically pre-calculated value based on the frequency at which the predetermined character is used.

다른 일 실시 예에 따른 서버(200)의 디코더(223)는, 사전 정보(232) 및 언어 모델(233)을 이용하여, 가능도P(y_o[L]|y_p[L])로부터 제2 문자열 W_i을 추정할 수 있다. 제2 문자열은, 제1 문자열의 적어도 하나의 문자가 다른 문자로 대체된 문자열일 수 있다. 통신부(210)는 제2 문자열 W_i를 디바이스(100)에게 전송할 수 있다. 서버(200)는, 디바이스(100)로부터 프레임 동기화 문자열 y_o[0:l+1]을 수신하였지만, 단어열의 형태를 가지는 제2 문자열 W_i를 출력할 수 있다.The decoder 223 of the server 200 according to another embodiment uses the dictionary information 232 and the language model 233 to determine the likelihood P(y _o [L]|y _p [L]). 2 Can estimate the string W _i. The second character string may be a character string in which at least one character of the first character string is replaced with another character. The communication unit 210 may transmit the second character string W _i to the device 100. The server 200 receives the frame synchronization string y _o [0:l+1] from the device 100, but may output _{the second string W i having the form of a word string.}

상술한 바와 같이 일 실시 예에 따른 서버(200)의 문자열 평가부(221)는, 디바이스(100)로부터 프레임 동기화된 문자열을 수신하고, 음성 신호 프레임 각각에 대응하는 문자 별로 가능도를 획득할 수 있다. 예를 들어, 문자열 평가부(221)는, 음성 신호 프레임에 대응하는 인덱스 L의 문자 y_o[L]에 대해서 가능도 P(y_o[0:L+1]|y_p[L]) 또는 P(y_o[L]|y_p[L])를 산출할 수 있다.As described above, the character string evaluation unit 221 of the server 200 according to an embodiment may receive a frame-synchronized character string from the device 100 and obtain a probability for each character corresponding to each voice signal frame. have. For example, the character string evaluation unit 221 may have the likelihood P(y _o [0:L+1]|y _p [L]) _{for the character y o [L] of the index L corresponding to the speech signal frame.} P(y _o [L]|y _p [L]) can be calculated.

이하에서는, 도 11a 및 도 11b를 참조하여 일 실시 예에 따른 문자열 평가부(221)가, 디바이스(100)로부터 수신된 문자열로부터, 음성 신호 프레임 각각에 대응하는 문자 별로 가능도를 획득하는 방법을 구체적으로 설명한다.Hereinafter, referring to FIGS. 11A and 11B, a method of obtaining a likelihood for each character corresponding to each voice signal frame from a character string received from the device 100 by the character string evaluation unit 221 according to an exemplary embodiment. It will be described in detail.

도 11a에 도시된 바와 같이 일 실시 예에 따른 문자열 평가부(221)는 프레임 동기화된 문자열(1101)을 수신할 수 있다. 문자열 평가부(221)는, 각 문자에 대하여 각 문자가 대체될 대체 문자들에 관한 가능도 행렬을 산출할 수 있다.As shown in FIG. 11A, the character string evaluation unit 221 according to an embodiment may receive a frame-synchronized character string 1101. The character string evaluation unit 221 may calculate a likelihood matrix for replacement characters to be replaced with each character for each character.

도 11b에 도시된 바와 같이, 일 실시 예에 따른 문자열 평가부(221)에서 산출되는 대체 문자들에 관한 가능도 행렬은, 소정 문자가 임의의 문자들 각각일 가능도 값들을 포함하는 행렬로써 표현될 수 있다. 도 11b의 표(1105)에 도시된 바와 같이 임의의 문자들 각각은 가능도 행렬의 인덱스 각각에 매핑될 수 있다. As shown in FIG. 11B, the likelihood matrix for replacement characters calculated by the string evaluation unit 221 according to an embodiment is expressed as a matrix including likelihood values where a predetermined character is each of arbitrary characters. Can be. As shown in table 1105 of FIG. 11B, each of the arbitrary characters may be mapped to each index of the likelihood matrix.

예를 들어, 가능도 행렬(1103)에서 인덱스 a₁의 값은, 소정 문자가 인덱스 a₁에 대응하는 문자 "a"로 대체될 가능도 값을 나타낼 수 있다. 가능도 행렬(1103)에서 인덱스 a₂의 값은, 소정 문자가 인덱스 a₂에 대응하는 문자 "b"로 대체될 가능도 값을 나타낼 수 있다. 가능도 행렬(1103)에서 인덱스 a₃의 값은, 소정 문자가 인덱스 a₃에 대응하는 문자 "c"로 대체될 가능도 값을 나타낼 수 있다.For example, the value of the _{index a 1} in the likelihood matrix 1103 may represent a likelihood value in which a predetermined character is replaced with the character “a” corresponding to the _{index a 1.} The value of the _{index a 2} in the likelihood matrix 1103 may represent a likelihood value in which a predetermined character is replaced with a character “b” corresponding to the _{index a 2.} Possible value of the index is also a ₃ in matrix 1103, a certain character can represent the value can also be replaced with the letter "c" corresponding to the index ₃ a.

일 실시 예에 따른 문자열 평가부(221)는, 문자열 내의 각 문자에 대하여 각 문자가 대체될 대체 문자들에 관한 가능도 행렬들(1107)을 산출할 수 있다. 문자열 평가부(221)는 제1 문자열 내의 적어도 하나의 문자가 대체된 복수의 추정 문자열들의 가능도로서 산출된 가능도 행렬들(1107)을 디코더(223)에게 출력할 수 있다.The character string evaluating unit 221 according to an embodiment may calculate likelihood matrices 1107 regarding replacement characters to which each character is to be replaced for each character in the character string. The character string evaluation unit 221 may output the likelihood matrices 1107 calculated as the likelihood of a plurality of estimated character strings in which at least one character in the first character string has been replaced, to the decoder 223.

일 실시 예에 따른 디코더(222)는, 문자열 평가부(221)로부터 수신한 가능도에 기초하여, 사전 정보, 및 언어 모델을 이용하여, 복수의 추정 문자열들 중에서 신뢰도가 가장 높은 문자열을 제2 문자열로서 획득할 수 있다.Based on the likelihood received from the character string evaluation unit 221, the decoder 222 according to an embodiment uses dictionary information and a language model to select a character string having the highest reliability among a plurality of estimated character strings. Can be obtained as a string.

상술한 바와 같이 본 개시의 다양한 실시 예들에 따른 음성 인식 시스템은 온-디바이스 음성 인식이 수행되는 경우와 서버-기반 음성 인식이 수행되는 경우를 나누어 선택적으로 수행할 수 있다. 그러나 본 개시는 이에 제한되지 않으며, 본 개시의 일 실시 예에 따른 디바이스(300)는, 복수의 음성 인식 모듈을 포함하고 상술한 바와 유사한 방식으로 제1 음성 인식 모듈에서 음성 인식이 수행되는 경우와 제2 음성 인식 모듈에서 음성 인식이 수행되는 경우를 나누어 선택적으로 수행할 수 있다.As described above, the speech recognition system according to various embodiments of the present disclosure may selectively perform on-device speech recognition and server-based speech recognition. However, the present disclosure is not limited thereto, and the device 300 according to an embodiment of the present disclosure includes a plurality of speech recognition modules, and a case in which speech recognition is performed in the first speech recognition module in a manner similar to that described above, The second voice recognition module may selectively perform a case where voice recognition is performed.

도 12는 일 실시 예에 따라 두 개의 음성 인식 모듈들을 선택적으로 이용하는 디바이스의 블록도를 도시한다. 12 is a block diagram of a device that selectively uses two voice recognition modules according to an embodiment.

도 12를 참조하면, 디바이스(300)는, 수신부(310), 프로세서(320), 메모리(340) 및 출력부(350)를 포함할 수 있다. 도 12에 도시된 구성 요소 모두가 디바이스(300)의 필수 구성 요소인 것은 아니다. 도 12에 도시된 구성 요소보다 많은 구성 요소에 의해 디바이스(300)가 구현될 수도 있고, 도 12에 도시된 구성 요소보다 적은 구성 요소에 의해 디바이스(300)가 구현될 수도 있다. 예를 들어, 도 19에 도시된 바와 같이, 일부 실시 예에 따른 디바이스(300)는, 사용자 입력부(2100), 센싱부(2400), 및 A/V 입력부(2600)를 더 포함할 수도 있다.Referring to FIG. 12, the device 300 may include a receiving unit 310, a processor 320, a memory 340, and an output unit 350. Not all of the components shown in FIG. 12 are essential components of the device 300. The device 300 may be implemented by more components than the components shown in FIG. 12, or the device 300 may be implemented by fewer components than the components shown in FIG. 12. For example, as shown in FIG. 19, the device 300 according to some embodiments may further include a user input unit 2100, a sensing unit 2400, and an A/V input unit 2600.

본 개시의 일 실시 예에 따른 수신부(310)는 사용자로부터 음성 신호를 입력 받을 수 있다. 예를 들어, 수신부(310)는, 마이크로폰에 의해 외부의 소리를 전기적인 음향 데이터로 변환함으로써 음성 신호를 수신할 수 있다. 도 12에는, 수신부(310)가, 디바이스(300)의 내부에 포함되는 것으로 도시되었으나, 다른 일 실시 예에 따른 수신부(310)는 별도의 디바이스 내에 포함되고 디바이스(300)와는 유, 무선으로 연결되는 형태로 구현될 수 있다.The receiving unit 310 according to an embodiment of the present disclosure may receive a voice signal from a user. For example, the receiving unit 310 may receive a voice signal by converting external sound into electrical sound data using a microphone. In FIG. 12, the receiver 310 is shown to be included in the device 300, but the receiver 310 according to another embodiment is included in a separate device and is connected to the device 300 by wire or wirelessly. It can be implemented in the form of

본 개시의 일 실시 예에 따른 메모리(340)는, 음성 인식을 수행하기 위한 인스트럭션들, 음성 인식에 이용되는 각종 모델, 신경망, 사전 정보 등을 저장할 수 있다.The memory 340 according to an embodiment of the present disclosure may store instructions for performing speech recognition, various models used for speech recognition, neural networks, dictionary information, and the like.

메모리(340)는, 음성 인식에 이용되는 각종 모델, 신경망, 사전 정보 등을 저장할 수 있다. 메모리(340)에 저장되는 제1 데이터(341)는, 제1 ASR 모듈(321)이 음성 인식을 수행하는 데 이용되는 모델, 신경망, 및 사전 정보 중 적어도 하나를 포함할 수 있다. 메모리(340)에 저장되는 제2 데이터(342)는, 제2 ASR 모듈(322)이 음성 인식을 수행하는 데 이용되는 모델, 신경망, 및 사전 정보 중 적어도 하나를 포함할 수 있다.The memory 340 may store various models, neural networks, dictionary information, and the like used for speech recognition. The first data 341 stored in the memory 340 may include at least one of a model, a neural network, and dictionary information used by the first ASR module 321 to perform speech recognition. The second data 342 stored in the memory 340 may include at least one of a model, a neural network, and dictionary information used by the second ASR module 322 to perform speech recognition.

본 개시의 일 실시 예에 따른 프로세서(320)는, 메모리(340)에 저장된 하나 이상의 인스터럭션들을 실행함으로써, 음성 인식을 수행할 수 있다. 본 개시의 일 실시 예에 따른 프로세서(320)는 제1 ASR 모듈(321) 및 제2 ASR 모듈(322)을 포함할 수 있다. The processor 320 according to an embodiment of the present disclosure may perform speech recognition by executing one or more instructions stored in the memory 340. The processor 320 according to an embodiment of the present disclosure may include a first ASR module 321 and a second ASR module 322.

본 개시의 일 실시 예에 따른 프로세서(320)의 제1 ASR 모듈(321)은 수신부(310)에서 획득된 음성 신호를 수신하고, 제1 데이터(341)(예를 들어, 음향 모델, 신경망, 언어 모델, 또는 사전 정보 등)에 기초하여 음성 신호에 대한 음성 인식을 수행할 수 있다. 제1 ASR 모듈(321)은 음성 신호로부터 제1 문자열을 획득할 수 있다. 제1 문자열은 프레임 동기화된 문자열일 수 있다. The first ASR module 321 of the processor 320 according to an embodiment of the present disclosure receives the voice signal obtained from the receiving unit 310, and receives the first data 341 (eg, an acoustic model, a neural network, Speech recognition on a speech signal may be performed based on a language model or dictionary information. The first ASR module 321 may obtain a first character string from the voice signal. The first character string may be a frame synchronized character string.

도 12의 제1 ASR 모듈(321)은, 도 4a의 ASR 모듈(121) 또는 도 4b의 ASR 모듈(121)에 대응될 수 있으므로 중복되는 설명은 생략한다.Since the first ASR module 321 of FIG. 12 may correspond to the ASR module 121 of FIG. 4A or the ASR module 121 of FIG. 4B, a duplicate description will be omitted.

본 개시의 일 실시 예에 따른 프로세서(320)의 결정부(323)는, 제1 ASR 모듈(321)로부터 출력된 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. The determination unit 323 of the processor 320 according to an embodiment of the present disclosure may determine whether to replace the first character string output from the first ASR module 321 with another character string.

일 예로서, 프로세서(320)의 결정부(323)는 제1 문자열의 신뢰도를 결정하고, 신뢰도에 기초하여 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. As an example, the determination unit 323 of the processor 320 may determine the reliability of the first character string and determine whether to replace the first character string with another character string based on the reliability.

예를 들어, 일 실시 예에 따른 프로세서(320)의 결정부(323)는, 제1 문자열의 신뢰도가 임계값 보다 높거나 같으면, 제1 문자열의 교정이 필요하지 않은 것으로 판단하고 제1 문자열을 출력부(350)를 통해 출력할 수 있다. 반면에, 프로세서(320)의 결정부(323)는, 신뢰도가 임계값 보다 작으면, 제1 문자열의 교정이 필요하다고 판단하고 제1 문자열을 제2 ASR 모듈(322)에게 전송할 수 있다.For example, if the reliability of the first character string is higher than or equal to the threshold value, the determination unit 323 of the processor 320 according to an embodiment determines that calibration of the first character string is not required and selects the first character string. It may be output through the output unit 350. On the other hand, if the reliability is less than the threshold value, the determination unit 323 of the processor 320 may determine that calibration of the first character string is necessary and transmit the first character string to the second ASR module 322.

다른 예로서, 프로세서(320)의 결정부(323)는, 디바이스(100)에 미리 저장된 키워드들과 제1 문자열을 비교한 결과에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. 또 다른 예로서, 프로세서(320)의 결정부(323)는, 제1 문자열이 관련된 도메인 또는 제1 문자열 내에 개체명이 포함되는지 여부에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다.As another example, the determination unit 323 of the processor 320 determines whether to replace the first string with another string based on a result of comparing the first string with keywords stored in the device 100. I can. As another example, the determination unit 323 of the processor 320 determines whether to replace the first string with another string, based on whether the domain to which the first string is related or whether the object name is included in the first string. I can.

일 실시 예에 따른 프로세서(320)의 결정부(323)가 제1 문자열을 다른 문자열로 대체할 지 여부를 결정하는 구체적인 방법과 관련하여서는, 도 3 내지 도 5b와 관련하여 일 실시 예에 따른 디바이스(100)의 프로세서(120)가 제1 문자열을 다른 문자열로 대체할 지 여부를 결정하는 방법이 이용될 수 있다. 중복되는 설명은 생략한다.Regarding a specific method of determining whether the determination unit 323 of the processor 320 according to an embodiment determines whether to replace a first character string with another character string, the device according to an embodiment with reference to FIGS. 3 to 5B A method in which the processor 120 of 100 determines whether to replace the first character string with another character string may be used. Redundant descriptions are omitted.

본 개시의 일 실시 예에 따른 프로세서(320)의 결정부(323)는, 제1 문자열의 교정이 필요하지 않다고 판단되는 경우, 제1 문자열을 다른 문자열로 대체하지 않을 것을 결정 할 수 있다. 본 개시의 일 실시 예에 따른 프로세서(320)의 결정부(323)는, 제1 문자열을 다른 문자열로 대체하지 않을 경우, 제1 문자열을 출력부(350)를 통해 출력할 수 있다. The determination unit 323 of the processor 320 according to an embodiment of the present disclosure may determine not to replace the first string with another string when it is determined that the correction of the first string is not required. The determination unit 323 of the processor 320 according to an embodiment of the present disclosure may output the first character string through the output unit 350 when not replacing the first character string with another character string.

본 개시의 일 실시 예에 따른 프로세서(320)의 결정부(323)는, 제1 문자열을 다른 문자열로 대체하여야 한다고 결정되는 경우, 이러한 결정에 기초하여 제1 문자열을 제2 ASR 모듈(322)로 전송할 수 있다. When it is determined that the first character string should be replaced with another character string, the determination unit 323 of the processor 320 according to an embodiment of the present disclosure replaces the first character string with the second ASR module 322 based on this determination. Can be transferred to.

일 실시예에 따른 프로세서(320)의 결정부(323)는, 문장 단위, 단어 단위, 구 단위 또는 프레임 단위로 제1 문자열을 제2 ASR 모듈(322)로 전송할 수 있다. 일 실시 예에 따른 프로세서(320)의 제1 ASR 모듈(321)에서 음성 인식을 수행하여 문장 또는 구를 구성하는 문자열을 획득한 경우, 결정부(323)는, 문장 또는 구에 포함되는 문자들을 모두 제2 ASR 모듈(322)에게 전송하거나 문장 또는 구에 포함되는 문자들 중 일부를 제2 ASR 모듈(322)에게 전송할 수 있다. 결정부(323), 문자열의 신뢰도에 기초하여, 신뢰도가 낮은 일부의 문자들을 제2 ASR 모듈(322) 에게 전송할 수 있다.The determiner 323 of the processor 320 according to an embodiment may transmit the first character string to the second ASR module 322 in units of sentences, words, phrases, or frames. When the first ASR module 321 of the processor 320 according to an embodiment performs speech recognition to obtain a character string constituting a sentence or phrase, the determination unit 323 may select characters included in the sentence or phrase. All of them may be transmitted to the second ASR module 322 or some of the characters included in the sentence or phrase may be transmitted to the second ASR module 322. The determination unit 323 may transmit some characters with low reliability to the second ASR module 322 based on the reliability of the character string.

본 개시의 일 실시 예에 따른 프로세서(320)의 제2 ASR 모듈(322)은 제1 문자열을 수신하고, 제1 문자열을 처리할 수 있다. 제2 ASR 모듈(322)은 제2 데이터(342)에 포함되는 언어 모델 및 사전 정보에 기초하여 제1 문자열에 대한 재 디코딩을 수행함으로써, 제1 문자열 내의 적어도 하나의 문자가 대체된 제2 문자열을 획득할 수 있다. The second ASR module 322 of the processor 320 according to an embodiment of the present disclosure may receive the first character string and process the first character string. The second ASR module 322 performs re-decoding of the first character string based on the language model and dictionary information included in the second data 342, thereby replacing the second character string with at least one character in the first character string. Can be obtained.

제2 ASR 모듈(322)은, 제1 문자열로부터 복수의 추정 문자열들의 가능도를 계산할 수 있다. 제2 ASR 모듈(322)는, 계산된 가능도에 기초하여, 제1 문자열을 제2 문자열로 대체할 지를 결정할 수 있다. 제2 ASR 모듈(322)는, 결정에 기초하여, 제1 문자열 내의 적어도 하나의 문자를 다른 문자로 대체함으로써 제1 문자열로부터 제2 문자열을 획득할 수 있다. 제2 ASR 모듈(322)는, 가능도, 사전 정보, 및 언어 모델에 기초하여, 복수의 추정 문자열들 중 하나인 제2 문자열을 획득할 수 있다.The second ASR module 322 may calculate the likelihood of a plurality of estimated character strings from the first character string. The second ASR module 322 may determine whether to replace the first character string with the second character string based on the calculated likelihood. The second ASR module 322 may obtain the second character string from the first character string by replacing at least one character in the first character string with another character based on the determination. The second ASR module 322 may acquire a second character string, which is one of a plurality of estimated character strings, based on the likelihood, dictionary information, and language model.

도 12의 제2 ASR 모듈(322)은, 도 7 및 도 9의 프로세서(220)에 대응될 수 있으므로 중복되는 설명은 생략한다.Since the second ASR module 322 of FIG. 12 may correspond to the processor 220 of FIGS. 7 and 9, a redundant description will be omitted.

제2 ASR 모듈(322)은, 제2 문자열을 출력부(350)를 통해 출력할 수 있다.The second ASR module 322 may output the second character string through the output unit 350.

본 개시의 일 실시 예에 따른 출력부(350)는, 제1 문자열 또는 제2 문자열에 대응하는 음성 인식 결과를 출력 할 수 있다. 출력부(350)는, 음성 인식이 수행된 결과를 사용자에게 알리거나, 외부 디바이스(예를 들어, 스마트 폰, 가전 제품, 웨어러블 디바이스, 서버 등)에게 전송할 수 있다. 예를 들어, 출력부(350)는, 오디오 신호를 출력 할 수 있는 스피커 또는 비디오 신호를 출력 할 수 있는 디스플레이를 포함할 수 있다.The output unit 350 according to an embodiment of the present disclosure may output a speech recognition result corresponding to the first character string or the second character string. The output unit 350 may notify a user of a result of performing voice recognition or transmit it to an external device (eg, a smart phone, a home appliance, a wearable device, a server, etc.). For example, the output unit 350 may include a speaker capable of outputting an audio signal or a display capable of outputting a video signal.

또는, 일 실시예에 따른 디바이스(300)는, 제1 문자열 또는 제2 문자열을 해석한 결과에 대응하는 동작을 수행할 수 있다. 예를 들어, 디바이스(300)는, 음성 인식이 수행된 결과에 대응하는 디바이스(300)의 기능을 결정하고, 해당 기능을 수행하는 화면을 출력부(350)를 통해 출력할 수 있다. 또는, 디바이스(300)는, 음성 인식이 수행된 결과에 대응하는 키워드를 외부 서버로 전송하고, 전송된 키워드에 관련된 정보를 서버로부터 수신하여 출력부(350)를 통해 화면 상에 출력할 수 있다.Alternatively, the device 300 according to an embodiment may perform an operation corresponding to a result of analyzing the first character string or the second character string. For example, the device 300 may determine a function of the device 300 corresponding to a result of performing speech recognition, and may output a screen performing the function through the output unit 350. Alternatively, the device 300 may transmit a keyword corresponding to a result of performing speech recognition to an external server, receive information related to the transmitted keyword from the server, and output it on the screen through the output unit 350. .

또는, 일 실시예에 따른 디바이스(300)는, 제1 문자열 또는 제2 문자열에 대한 자연어 처리를 통해 사용자의 발화 의도를 파악함으로써, 음성 비서 서비스에 관련된 정보를 출력부(350)를 통해 출력 할 수 있다. 디바이스(300)는 음성 비서 서비스를 제공하기 위하여, 디바이스(300) 내의 NLU (Natural Language Understanding) 모델, DM (Dialog Mananer) 모델 및 NLG (Natural Language Generating) 모델 등을 이용할 수 있다.Alternatively, the device 300 according to an embodiment may output information related to the voice assistant service through the output unit 350 by identifying the user's utterance intention through natural language processing on the first character string or the second character string. I can. The device 300 may use a Natural Language Understanding (NLU) model, a Dialog Mananer (DM) model, and a Natural Language Generating (NLG) model in the device 300 to provide a voice assistant service.

일 예로서, 디바이스(300)는, 사용자의 상황, 디바이스의 상황 등을 고려하여 사람이 사용자와 직접 대화하는 것처럼, 제1 문자열 또는 제2 문자열에 기초하여 응답 메시지를 생성하고 출력할 수 있다. 다른 예로서, 디바이스(300)는, 제1 문자열 또는 제2 문자열에 기초하여 사용자가 필요한 정보를 생성하여 출력할 수 있다. 다른 예로서, 디바이스(300)는, 제1 문자열 또는 제2 문자열에 기초하여 사용자의 발화 의도를 파악하고, 서비스 제공 서버에게 사용자가 필요로 하는 서비스의 제공을 요청할 수 있다. 디바이스(300)는, 서비스 제공 서버로부터 수신된 정보를 출력부(350)를 통해 전송할 수 있다.As an example, the device 300 may generate and output a response message based on the first character string or the second character string as if a person directly communicates with the user in consideration of the user's situation, the device situation, and the like. As another example, the device 300 may generate and output information required by the user based on the first character string or the second character string. As another example, the device 300 may determine the user's speech intention based on the first character string or the second character string, and request the service providing server to provide a service that the user needs. The device 300 may transmit information received from the service providing server through the output unit 350.

본 개시의 일 실시 예에 따른 제2 ASR 모듈(322)은, 제1 ASR 모듈(321)이 이용하는 제1 데이터(341)에 비하여 많은 양의 언어 모델 및 사전 정보를 포함하는 제2 데이터(342)를 이용할 수 있다. 또한, 제2 데이터(342)는, 제1 데이터(341)에 비하여 지명, 인명, 상표명등과 같은 개체명들을 많이 포함할 수 있다. 따라서, 제2 ASR 모듈(322)에 의한 음성 인식에 의하면, 방대한 양의 개체명을 포함하는 사전 정보 및 언어 모델을 이용할 수 있으므로 높은 정확도의 음성 인식을 수행할 수 있다.The second ASR module 322 according to an embodiment of the present disclosure includes a larger amount of language model and second data 342 than the first data 341 used by the first ASR module 321. ) Can be used. In addition, compared to the first data 341, the second data 342 may include more entity names such as a place name, a person's name, and a brand name. Therefore, according to the speech recognition by the second ASR module 322, it is possible to use dictionary information including a vast amount of entity names and a language model, so that speech recognition with high accuracy can be performed.

그러므로, 도 12의 디바이스(300)는, 지연 시간을 최소화 하기 위하여, 받아쓰기, 일반적인 명령, 자막 생성등과 같은 일반적인 목적의 음성 인식은 제1 ASR 모듈(321)에서 수행할 수 있다. 반면에, 디바이스(300)는, 제1 ASR 모듈(321)로부터 출력된 제1 문자열의 신뢰도가 충분하지 않다고 판단되는 경우, 제2 ASR 모듈(322)에서 제1 문자열에 대한 추가적인 처리를 수행할 수 있다. 제2 ASR 모듈(322)은, 제1 데이터(341) 보다 많은 양의 정보를 포함하는 제2 데이터(342)를 이용함으로써 음성 인식의 정확도를 높일 수 있다.Therefore, the device 300 of FIG. 12 may perform voice recognition for general purposes such as dictation, general commands, and caption generation in the first ASR module 321 in order to minimize the delay time. On the other hand, when it is determined that the reliability of the first character string output from the first ASR module 321 is insufficient, the second ASR module 322 performs additional processing on the first character string. I can. The second ASR module 322 may increase the accuracy of speech recognition by using the second data 342 including a larger amount of information than the first data 341.

한편, 일 실시 예에 따른 디바이스(300)의 프로세서(320)는, 제2 ASR 모듈(322)에서 교정된 문자열을 획득하고, 교정이 필요 없다고 판단되어 제2 ASR 모듈(322)에게 전송되지 않았던 문자열을 교정된 문자열과 조합할 수 있다. 일 실시 예에 따른 디바이스(300)는, 조합된 문자열을 출력하거나, 조합된 문자열에 기초하여 음성 인식이 수행된 결과를 출력하거나, 조합된 문자열을 해석한 결과에 기초하여 음성 비서 서비스를 제공할 수 있다.On the other hand, the processor 320 of the device 300 according to an embodiment acquires the corrected character string in the second ASR module 322, determines that calibration is not necessary, and has not been transmitted to the second ASR module 322. Strings can be combined with corrected strings. The device 300 according to an embodiment may output a combined character string, output a result of performing speech recognition based on the combined character string, or provide a voice assistant service based on a result of interpreting the combined character string. I can.

또한, 일 실시 예에 따른 프로세서(320)의 결정부(323)는 제2 ASR 모듈(322)에게 제1 문자열에 대한 교정을 요청하면서, 제1 문자열에 관련된 도메인 정보를 함께 제공할 수 있다. 도메인 정보는 도메인을 식별하기 위한 정보로서, 예를 들어, 도메인의 명칭, 도메인의 식별자를 포함할 수 있으나, 이에 제한되지 않는다. In addition, the determination unit 323 of the processor 320 according to an embodiment may request the second ASR module 322 to correct the first string and provide domain information related to the first string together. The domain information is information for identifying a domain, and may include, for example, a name of a domain and an identifier of a domain, but is not limited thereto.

디바이스(300)의 결정부(323)는, 제1 ASR 모듈(321)로부터 출력된 제1 문자열의 도메인 신뢰도에 기초하여, 제1 문자열에 관련된 도메인을 식별할 수 있다. 도메인 신뢰도는, 제1 문자열의 적어도 일부가 특정 도메인에 어느 정도 관련되었는지를 나타내는 수치일 수 있다. 예를 들어, 결정부(323)는, 제1 ASR 모듈(321)로부터 출력된 제1 문자열이 제1 데이터(341)에 미리 등록된 도메인에 어느 정도 관련성이 있는 지를 나타내는 컨피던스 스코어를 산출할 수 있다. 또한, 디바이스(300)는 산출된 도메인 신뢰도에 기초하여, 제1 문자열에 관련된 도메인을 식별할 수 있다. 디바이스(300)는 룰 기반으로 제1 문자열에 관련된 도메인을 식별하거나 도메인 식별을 위해 훈련된 인공 지능 모델을 이용하여 제1 문자열에 관련된 도메인 신뢰도를 획득할 수 있다.The determination unit 323 of the device 300 may identify a domain related to the first character string based on the domain reliability of the first character string output from the first ASR module 321. The domain reliability may be a number indicating to what extent at least a part of the first character string is related to a specific domain. For example, the determination unit 323 may calculate a confidence score indicating how relevant the first character string output from the first ASR module 321 is to a domain previously registered in the first data 341. have. Also, the device 300 may identify a domain related to the first character string based on the calculated domain reliability. The device 300 may identify a domain related to the first character string based on a rule, or obtain a domain reliability level related to the first character string using an artificial intelligence model trained for domain identification.

한편, 본 개시의 일 실시 예에 따른 제2 ASR 모듈(322)은, 제2 데이터(342)에 포함되는 도메인 별로 서로 다른 사전 정보 및 언어 모델을 이용하여, 제1 문자열에 대한 디코딩을 수행할 수 있다. 따라서, 본 개시의 일 실시 예에 따른 제2 ASR 모듈(322)은, 제1 문자열에 대한 재 디코딩을 통해 음성 인식 정확도가 높아진 음성 인식 결과를 출력할 수 있다.Meanwhile, the second ASR module 322 according to an embodiment of the present disclosure may perform decoding on a first character string using different dictionary information and language models for each domain included in the second data 342. I can. Accordingly, the second ASR module 322 according to an embodiment of the present disclosure may output a speech recognition result with improved speech recognition accuracy through re-decoding of the first character string.

일 실시 예에 따른 제2 ASR 모듈(322)은, 결정부(323)로부터 제1 문자열을 수신하고, 제1 문자열에 관련된 도메인을 결정할 수 있다. 제2 ASR 모듈(322)은, 결정된 도메인에 대응되는 사전 정보 및 언어 모델을 이용하여, 제1 문자열에 대한 디코딩을 수행할 수 있다.The second ASR module 322 according to an embodiment may receive a first character string from the determiner 323 and determine a domain related to the first character string. The second ASR module 322 may decode the first character string by using dictionary information and language model corresponding to the determined domain.

일 예로서, 제2 ASR 모듈(322)은, 결정부(323)로부터 제1 문자열과 함께 제1 문자열에 관련된 도메인 정보를 수신하고, 수신된 정보에 기초하여 제1 문자열에 대한 디코딩을 수행할 도메인을 결정할 수 있다. 예를 들어, 제2 ASR 모듈(322)은, 결정부(323)로부터 수신된 정보로부터 식별된 도메인과 동일하거나 유사한 도메인을 디코딩을 수행할 도메인으로서 결정할 수 있다.As an example, the second ASR module 322 receives domain information related to the first character string together with the first character string from the determination unit 323, and performs decoding on the first character string based on the received information. You can decide the domain. For example, the second ASR module 322 may determine a domain that is identical to or similar to the domain identified from the information received from the determination unit 323 as a domain to perform decoding.

다른 예로서, 제2 ASR 모듈(322)은 결정부(323)로부터 수신된 제1 문자열에 기초하여, 수신된 제1 문자열에 관련된 도메인을 결정할 수 있다. 디바이스(300)는 도메인 식별을 위해 훈련된 인공 지능 모델인 도메인 식별 모델을 메모리(340)에 저장할 수 있다. 제2 ASR 모듈(322)은, 도메인 식별 모델을 이용하여, 제1 문자열을 입력 값으로 하여 도메인 신뢰도를 출력할 수 있다. 제2 ASR 모듈(322)은, 도메인 신뢰도에 기초하여, 제1 문자열에 관련된 도메인을 결정할 수 있다..As another example, the second ASR module 322 may determine a domain related to the received first character string based on the first character string received from the determination unit 323. The device 300 may store a domain identification model, which is an artificial intelligence model trained for domain identification, in the memory 340. The second ASR module 322 may output the domain reliability using the first character string as an input value using the domain identification model. The second ASR module 322 may determine a domain related to the first character string based on the domain reliability.

또 다른 예로서, 일 실시예에 따른 제2 ASR 모듈(322)은, 결정부(323)로부터 제1 문자열과 함께 제1 문자열에 관련된 도메인을 판단하기 위한 정보를 수신할 수 있다. 결정부(323)로부터 수신된 도메인을 판단하기 위한 정보는 컨텍스트 정보를 포함할 수 있다. 예를 들어, 컨텍스트 정보는, 사용자가 현재 디바이스(300) 상에서 이용하고 있는 애플리케이션에 대한 정보, 대화 이력 정보, 디바이스(300) 주변의 상황 정보 또는 트렌드 정보 중 적어도 하나를 포함할 수 있다. 제2 ASR 모듈(322)은, 컨텍스트 정보를 기반으로 제1 문자열에 대한 디코딩을 수행할 도메인을 결정할 수 있다. 컨텍스트 정보 기반으로 도메인을 결정하는 구체적인 방법에 대해서는 도 9의 프로세서(220)의 동작 방법이 적용될 수 있으므로, 중복되는 설명은 생략한다. As another example, the second ASR module 322 according to an embodiment may receive information for determining a domain related to the first character string together with the first character string from the determination unit 323. Information for determining the domain received from the determination unit 323 may include context information. For example, the context information may include at least one of information about an application currently being used by the user on the device 300, information about a conversation history, information about a situation around the device 300, or information about a trend. The second ASR module 322 may determine a domain for decoding the first character string based on context information. For a specific method of determining a domain based on context information, since the operating method of the processor 220 of FIG. 9 may be applied, a redundant description will be omitted.

이하에서는, 일 실시 예에 따른 디바이스(100)의 동작 방법을 설명한다. 이하에서 서술하는 디바이스(100)의 동작 방법의 각 단계는, 도 3, 도 4a, 및 도 4b에 도시된 구성들에 의해 수행될 수 있다. 중복되는 설명은 생략한다.Hereinafter, a method of operating the device 100 according to an exemplary embodiment will be described. Each step of the method of operating the device 100 described below may be performed by the configurations shown in FIGS. 3, 4A, and 4B. Redundant descriptions are omitted.

도 13은 일 실시 예에 따라 디바이스가 음성 인식을 수행하는 방법의 흐름도를 도시한다.13 is a flowchart illustrating a method of performing voice recognition by a device according to an exemplary embodiment.

단계 S1310에서 본 개시의 일 실시 예에 따른 디바이스(100)는, 음성 신호에 대한 음성 인식을 수행하여 제1 문자열을 획득할 수 있다. In step S1310, the device 100 according to an embodiment of the present disclosure may acquire a first character string by performing voice recognition on a voice signal.

본 개시의 일 실시 예에 따른 디바이스(100)는, 다양한 음성 인식 방식을 이용하여 음성 인식을 수행하여 제1 문자열을 추정할 수 있다.The device 100 according to an embodiment of the present disclosure may estimate the first character string by performing speech recognition using various speech recognition methods.

일 예로서, 디바이스(100)는, 음향 모델, 사전 정보 및 언어 모델을 이용하여, 음성 신호로부터 문자열을 획득할 수 있다. 먼저, 디바이스(100)는, 음향 모델을 이용하여, 음성 신호에 포함되는 음소열을 획득할 수 있다. 예를 들어, 디바이스(100)는, 은닉 마르코프 모델, 가우스 혼합 모델, 베이즈 추론, 및 다층 신경망 등을 이용하여 음소들을 포함하는 음소열을 추정할 수 있다. 디바이스(100)는, 사전 정보 및 언어 모델에 기초하여, 음소열로부터 단어들을 추정하고 추정된 단어들을 포함하는 제1 문자열을 획득할 수 있다.As an example, the device 100 may obtain a character string from a speech signal using an acoustic model, dictionary information, and a language model. First, the device 100 may acquire a phoneme sequence included in a speech signal using an acoustic model. For example, the device 100 may estimate a phoneme sequence including phonemes using a hidden Markov model, a Gaussian mixed model, Bayesian inference, and a multilayer neural network. The device 100 may estimate words from a phoneme sequence and obtain a first character string including the estimated words based on dictionary information and a language model.

다른 예로서, 디바이스(100)는, 음성 신호로부터 특징 벡터를 추출하고, 심층 신경망(DNN)을 이용하여, 특징 벡터로부터 제1 문자열을 획득할 수 있다.As another example, the device 100 may extract a feature vector from a speech signal and obtain a first character string from the feature vector using a deep neural network (DNN).

예를 들어, 제1 문자열은, 음성 신호가 소정 시간 간격으로 분할된 음성 신호 프레임들 각각에 대응하는 문자들을 포함하는 프레임 동기화된 문자열일 수 있다. 또는, 예를 들어, 제1 문자열은, 음성 신호에 의해 발음되는 각 문자를 하나씩 포함하도록 라벨 동기화 방식으로 획득된 문자열일 수 있다.For example, the first character string may be a frame-synchronized character string including characters corresponding to each of the audio signal frames in which the audio signal is divided at predetermined time intervals. Alternatively, for example, the first character string may be a character string obtained through a label synchronization method so as to include one character for each character pronounced by a voice signal.

일 실시 예에 따른 디바이스(100)는, 제1 문자열이 프레임 동기화되지 않은 경우, 강제 정렬을 수행함으로써 프레임 동기화된 문자열을 획득할 수 있다. 프레임 동기화된 문자열 및 강제 정렬에 의해 프레임 동기화된 문자열을 생성하는 구체적인 방법과 관련하여서는, 도 6을 참조하여 상술한 내용이 적용될 수 있다. 중복되는 설명은 생략한다. The device 100 according to an embodiment may obtain a frame-synchronized character string by performing forced alignment when the first character string is not frame-synchronized. With regard to a frame-synchronized character string and a specific method of generating a frame-synchronized character string by forced alignment, the above description with reference to FIG. Redundant descriptions are omitted.

단계 S1330에서 본 개시의 일 실시 예에 따른 디바이스(100)는, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다.In step S1330, the device 100 according to an embodiment of the present disclosure may determine whether to replace the first character string with another character string.

일 예로서, 일 실시 예에 따른 디바이스(100)는, 제1 문자열의 신뢰도를 결정하고, 신뢰도에 기초하여 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. 예를 들어, 디바이스(100)는, 신뢰도가 임계값 보다 높거나 같으면, 제1 문자열을 다른 문자열로 대체하지 않아도 된다고 결정할 수 있다. 반면에, 디바이스(100)는, 신뢰도가 임계값보다 작으면, 제1 문자열을 다른 문자열로 대체할 것을 결정할 수 있다.As an example, the device 100 according to an embodiment may determine the reliability of the first character string, and determine whether to replace the first character string with another character string based on the reliability. For example, if the reliability is higher than or equal to the threshold value, the device 100 may determine that it is not necessary to replace the first character string with another character string. On the other hand, when the reliability is less than the threshold value, the device 100 may determine to replace the first character string with another character string.

예를 들어, 디바이스(100)는, 비터비(Viterbi) 디코딩 결과 출력되는 가능도에 기초하여 신뢰도를 계산할 수 있다. 또는, 프로세서(120)는, 엔드-투-엔드 방식 음성 인식 모델에서 소프트맥스 레이어로부터 출력되는 사후 확률들에 기초하여 신뢰도를 계산할 수 있다.For example, the device 100 may calculate the reliability based on the likelihood output as a result of Viterbi decoding. Alternatively, the processor 120 may calculate the reliability based on posterior probabilities output from the softmax layer in the end-to-end speech recognition model.

또는, 일 실시 예에 따른 디바이스(100)는, 음성 신호에 대한 음성 인식 과정에서 추정되는 복수의 추정 문자열들을 결정하고, 복수의 추정 문자열들의 상관도에 기초하여, 제1 문자열의 신뢰도를 계산할 수 있다. 제1 문자열을 포함하는 복수의 추정 문자열들의 상관도가 높을 수록, 제1 문자열의 신뢰도가 높을 수 있다.Alternatively, the device 100 according to an embodiment may determine a plurality of estimated character strings estimated during a speech recognition process for a speech signal, and calculate the reliability of the first character string based on the correlation between the plurality of estimated character strings. have. The higher the correlation between the plurality of estimated strings including the first string, the higher the reliability of the first string.

다른 예로서, 디바이스(100)는, 미리 저장된 키워드들과 제1 문자열을 비교한 결과에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. 예를 들어, 디바이스(100)는, 제1 문자열에 미리 저장된 키워드들이 포함되지 않는 경우, 제1 문자열을 다른 문자열로 대체할 것을 결정할 수 있다.As another example, the device 100 may determine whether to replace the first character string with another character string based on a result of comparing the previously stored keywords with the first character string. For example, when keywords previously stored in the first string are not included, the device 100 may determine to replace the first string with another string.

또 다른 예로서, 디바이스(100)는, 제1 문자열이 관련된 도메인 또는 제1 문자열 내에 개체명이 포함되는지 여부에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. 예를 들어, 디바이스(100)는, 제1 문자열이 개체명 위주의 도메인과 관련이 있다고 판단되는 경우, 제1 문자열을 다른 문자열로 대체할 것을 결정할 수 있다.As another example, the device 100 may determine whether to replace the first character string with another character string, based on whether the domain to which the first character string is related or whether an entity name is included in the first character string. For example, when it is determined that the first character string is related to a domain based on an entity name, the device 100 may determine to replace the first character string with another character string.

제1 문자열을 다른 문자열로 대체할 것을 결정한 경우, 단계 S1340에서 본 개시의 일 실시 예에 따른 디바이스(100)는 제1 문자열을 서버(200)로 전송할 수 있다. 일 실시 예에 따른 디바이스(100)는, 프레임 동기화된 제1 문자열을 서버(200)로 전송할 수 있다. 디바이스(100)는, 복수의 문자들을 포함하는 문자열 전체를 동시에 전송할 수도 있고, 문자열에 포함되는 적어도 일부 문자들을 순차적으로 전송할 수도 있다. 또한, 일 실시 예에 따른 디바이스(100)는, 단어 단위로 또는 문장 단위로 제1 문자열을 전송할 수 있다.When it is determined to replace the first string with another string, the device 100 according to an embodiment of the present disclosure may transmit the first string to the server 200 in step S1340. The device 100 according to an embodiment may transmit the frame-synchronized first character string to the server 200. The device 100 may simultaneously transmit an entire character string including a plurality of characters, or may sequentially transmit at least some characters included in the character string. In addition, the device 100 according to an embodiment may transmit the first character string in units of words or sentences.

한편, 제1 문자열을 다른 문자열로 대체하지 않을 것을 결정한 경우, 단계 S1370에서 본 개시의 일 실시 예에 따른 디바이스(100)는 제1 문자열을 출력할 수 있다. 일 실시 예에 따른 디바이스(100)는, 제1 문자열을 그대로 출력하거나, 제1 문자열로부터 획득되는 단어열을 출력할 수 있다. Meanwhile, when it is determined not to replace the first character string with another character string, the device 100 according to an embodiment of the present disclosure may output the first character string in step S1370. The device 100 according to an embodiment may output the first character string as it is or may output a word sequence obtained from the first character string.

단계 S1350에서 본 개시의 일 실시 예에 따른 디바이스(100)는, 제2 문자열을 서버(200)로부터 수신할 수 있다. 제2 문자열은, 서버(200)에 의해 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 대체됨으로써 획득된 문자열일 수 있다.In step S1350, the device 100 according to an embodiment of the present disclosure may receive a second character string from the server 200. The second character string may be a character string obtained by replacing at least one character in the first character string with another character by the server 200.

단계 S1360에서 본 개시의 일 실시 예에 따른 디바이스(100)는, 제2 문자열을 출력할 수 있다. 일 실시 예에 따른 디바이스(100)는, 제2 문자열을 그대로 출력하거나, 제2 문자열로부터 획득되는 단어열을 출력할 수 있다.In step S1360, the device 100 according to an embodiment of the present disclosure may output a second character string. The device 100 according to an exemplary embodiment may output the second character string as it is or may output a word sequence obtained from the second character string.

본 개시는 도 13에 도시된 바와 같이 디바이스(100)가 제1 문자열 또는 제2 문자열을 그대로 출력하는 실시예에 제한되지 않는다. 일 실시예에 따른 디바이스(100)는, 제1 문자열 또는 제2 문자열에 대한 자연어 처리를 통해 사용자의 발화 의도를 파악함으로써, 음성 비서 서비스에 관련된 정보를 출력 할 수 있다.The present disclosure is not limited to an embodiment in which the device 100 outputs the first character string or the second character string as it is, as shown in FIG. 13. The device 100 according to an embodiment may output information related to a voice assistant service by identifying a user's utterance intention through natural language processing on a first character string or a second character string.

디바이스(100)는 제1 문자열 또는 제2 문자열에 기초하여 음성 비서 서비스를 제공하기 위하여, 디바이스(100) 내의 NLU 모델, DM 모델 및 NLG 모델 등을 이용할 수 있다.The device 100 may use an NLU model, a DM model, and an NLG model in the device 100 to provide a voice assistant service based on the first character string or the second character string.

예를 들어, 디바이스(100)는, 사용자의 상황, 디바이스의 상황 등을 고려하여 사람이 사용자와 직접 대화하는 것처럼, 제1 문자열 또는 제2 문자열에 기초하여 응답 메시지를 생성하고 출력할 수 있다. 또는, 예를 들어, 디바이스(100)는, 제1 문자열 또는 제2 문자열에 기초하여 사용자가 필요한 정보를 생성하여 출력할 수 있다. 또는, 예를 들어, 디바이스(100)는, 제1 문자열 또는 제2 문자열에 기초하여 사용자의 발화 의도를 파악하고, 서비스 제공 서버에게 사용자가 필요로 하는 서비스의 제공을 요청할 수 있다. 디바이스(100)는, 서비스 제공 서버로부터 수신된 정보를 출력할 수 있다.For example, the device 100 may generate and output a response message based on the first character string or the second character string, as if a person directly communicates with the user in consideration of the user's situation, the device situation, and the like. Alternatively, for example, the device 100 may generate and output information required by the user based on the first character string or the second character string. Alternatively, for example, the device 100 may determine the user's speech intention based on the first character string or the second character string, and request a service providing server to provide a service that the user needs. The device 100 may output information received from the service providing server.

한편, 일 실시예에 따른 디바이스(100)는, 서버(200)로부터 제2 문자열을 수신하는 대신에, 제2 문자열에 기초하여 생성된 음성 비서 서비스에 관련된 정보를 수신하고, 수신된 정보를 출력할 수 있다. 음성 비서 서비스에 관련된 정보는, 제1 문자열이 교정된 제2 문자열에 기초하여 서버(200)에 의해 생성되는 정보일 수 있다. 예를 들어, 음성 비서 서비스에 관련된 정보는, 사용자의 음성 신호에 대한 응답 메시지, 사용자가 필요로 하는 서비스, 또는 사용자가 필요한 정보를 포함할 수 있다.Meanwhile, instead of receiving the second character string from the server 200, the device 100 according to an embodiment receives information related to the voice assistant service generated based on the second character string, and outputs the received information. can do. The information related to the voice assistant service may be information generated by the server 200 on the basis of the second character string in which the first character string has been corrected. For example, information related to the voice assistant service may include a response message to a user's voice signal, a service required by the user, or information required by the user.

도 13에 도시된 바와 같이, 일 실시 예에 따른 디바이스(100)는, 온-디바이스 음성 인식 모듈로부터 출력된 제1 문자열을 다른 문자열로 대체할 지 여부를 판단하고, 판단 결과에 기초하여 서버-기반 후처리를 선택적으로 이용할 수 있다.As shown in FIG. 13, the device 100 according to an embodiment determines whether to replace the first string output from the on-device voice recognition module with another string, and based on the determination result, the server- Based post-processing can be used selectively.

일 실시 예에 따른 디바이스(100)는, 사용자에 의해 발화되는 단어 단위(또는, 문장 단위)로 음성 인식 모듈로부터 출력된 제1 문자열의 신뢰도를 계산하고, 신뢰도에 기초하여 제1 문자열을 대체할 지 여부를 결정할 수 있다. The device 100 according to an embodiment calculates the reliability of the first string output from the speech recognition module in units of words (or sentences) spoken by the user, and replaces the first string based on the reliability. You can decide whether or not.

도 14는 일 실시 예에 따라 디바이스가 음성 인식을 수행하는 구체적인 방법에 있어서, 도 13의 단계 S1310을 구체화한 흐름도를 도시한다.14 is a flowchart illustrating step S1310 of FIG. 13 in a specific method of performing voice recognition by a device according to an embodiment.

단계 S1411에서 일 실시 예에 따른 디바이스(100)는 음성 신호를 수신하고, 단계 S1413에서 단어의 경계가 검출되었는지 여부를 판단할 수 있다. 일 실시 예에 따른 디바이스(100)는 단어의 경계가 검출될 때까지 계속적으로 음성 프레임들을 포함하는 음성 신호를 수신할 수 있다. In step S1411, the device 100 according to an embodiment may receive a voice signal and determine whether a word boundary has been detected in step S1413. The device 100 according to an embodiment may continuously receive a speech signal including speech frames until a word boundary is detected.

예를 들어, 디바이스(100)는, 음성 신호로부터 검출되는 퍼즈(pause), 또는 억양 및 강세를 포함하는 운율적(prosodic) 정보에 기초하여 단어의 경계를 검출할 수 있다.For example, the device 100 may detect the boundary of a word based on a pause detected from a voice signal or prosodic information including intonation and stress.

단어의 경계가 검출되면, 단계 S1415에서 일 실시 예에 따른 디바이스(100)는 음성 신호로부터 제1 문자열을 획득할 수 있다.When the boundary of the word is detected, in step S1415, the device 100 according to an embodiment may obtain the first character string from the voice signal.

단계 S1431에서 본 개시의 일 실시 예에 따른 디바이스(100)는, 제1 문자열의 신뢰도를 계산할 수 있다. 제1 문자열의 신뢰도는, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도, 및 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 대체될 사후 확률들 중 적어도 하나에 기초하여 계산될 수 있다.In step S1431, the device 100 according to an embodiment of the present disclosure may calculate the reliability of the first character string. The reliability of the first character string may be calculated based on at least one of a likelihood of a plurality of estimated character strings obtained from the first character string and posterior probabilities in which at least one character in the first character string is replaced with another character.

예를 들어, 디바이스(100)는, 비터비 디코딩 결과 출력되는 가능도에 기초하여 신뢰도를 계산할 수 있다. 또는, 디바이스(100)는, 엔드-투-엔드 방식 음성 인식 모델에서 소프트맥스 레이어로부터 출력되는 사후 확률들에 기초하여 신뢰도를 계산할 수 있다.For example, the device 100 may calculate the reliability based on the likelihood output as a result of Viterbi decoding. Alternatively, the device 100 may calculate the reliability based on posterior probabilities output from the softmax layer in the end-to-end speech recognition model.

단계 S1433에서 본 개시의 일 실시 예에 따른 디바이스(100)는, 신뢰도가 임계값보다 작은지 여부를 판단할 수 있다. In step S1433, the device 100 according to an embodiment of the present disclosure may determine whether the reliability is less than a threshold value.

제1 문자열의 신뢰도가 임계값보다 작은 경우, 단계 S1340에서 본 개시의 일 실시 예에 따른 디바이스(100)는 제1 문자열을 서버(200)로 전송할 수 있다. 디바이스(100)는, 전송된 제1 문자열에 응답하여, 서버(200)에 의해 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 대체된 제2 문자열을 수신할 수 있다. 디바이스(100)는, 수신된 제2 문자열을 출력할 수 있다. When the reliability of the first character string is less than the threshold value, in operation S1340, the device 100 according to an embodiment of the present disclosure may transmit the first character string to the server 200. In response to the transmitted first character string, the device 100 may receive a second character string in which at least one character in the first character string is replaced by another character by the server 200. The device 100 may output the received second character string.

반면에, 제1 문자열의 신뢰도가 임계값보다 크거나 같은 경우, 단계 S1370에서 본 개시의 일 실시 예에 따른 디바이스(100)는 제1 문자열을 출력할 수 있다.On the other hand, when the reliability of the first character string is greater than or equal to the threshold value, in step S1370, the device 100 according to an embodiment of the present disclosure may output the first character string.

한편, 일 실시예에 따른 디바이스(100)는, 제1 문자열 또는 제2 문자열을 그대로 출력하는 대신에, 제1 문자열 또는 제2 문자열에 대한 자연어 처리를 통해 사용자의 발화 의도를 파악함으로써, 음성 비서 서비스에 관련된 정보를 출력 할 수 있다.Meanwhile, the device 100 according to an embodiment recognizes the user's speech intention through natural language processing on the first string or the second string, instead of outputting the first string or the second string as it is, Service-related information can be displayed.

한편, 일 실시예에 따른 디바이스(100)는, 전송된 제1 문자열에 응답하여 서버(200)로부터 제2 문자열을 수신하는 대신에, 제2 문자열에 기초하여 생성된 음성 비서 서비스에 관련된 정보를 수신할 수 있다. 디바이스(100)는, 서버(200)로부터 수신된 정보를 출력할 수 있다. 음성 비서 서비스에 관련된 정보는, 제1 문자열이 교정된 제2 문자열에 기초하여 서버(200)에 의해 생성되는 정보일 수 있다. On the other hand, the device 100 according to an embodiment, instead of receiving the second string from the server 200 in response to the transmitted first string, provides information related to the voice assistant service generated based on the second string. Can receive. The device 100 may output information received from the server 200. The information related to the voice assistant service may be information generated by the server 200 on the basis of the second character string in which the first character string has been corrected.

예를 들어, 음성 비서 서비스에 관련된 정보는, 사용자의 음성 신호에 대한 응답 메시지, 사용자가 필요로 하는 서비스, 또는 사용자가 필요한 정보를 포함할 수 있다.For example, information related to the voice assistant service may include a response message to a user's voice signal, a service required by the user, or information required by the user.

도 14에 도시된 바와 같이, 본 개시의 일 실시 예에 따른 디바이스(100)는 제1 문자열의 신뢰도에 기초하여, 제1 문자열을 다른 문자열로 대체할 지 여부를 결정할 수 있다. 디바이스(100)는, 제1 문자열의 신뢰도가 임계값 미만인 경우, 제1 문자열을 서버(200)로 전송할 수 있다. 디바이스(100)는, 서버(200) 내의 사전 정보 및 언어 모델에 기초하여 제1 문자열의 적어도 하나의 문자가 다른 문자로 대체됨으로써 획득된 제2 문자열을 서버(200)로부터 획득할 수 있다. 따라서, 일 실시 예에 따른 디바이스(100)는, 제1 문자열보다 높은 신뢰도를 갖는 제2 문자열을 서버(200)로부터 수신하여 이용함으로써 음성 인식 정확도를 높일 수 있다.As shown in FIG. 14, the device 100 according to an embodiment of the present disclosure may determine whether to replace the first character string with another character string based on the reliability of the first character string. When the reliability of the first character string is less than the threshold value, the device 100 may transmit the first character string to the server 200. The device 100 may obtain a second character string obtained by replacing at least one character of the first character string with another character based on dictionary information and a language model in the server 200 from the server 200. Accordingly, the device 100 according to an embodiment may increase speech recognition accuracy by receiving and using a second character string having a higher reliability than the first character string from the server 200.

도 14에는 사용자에 의해 발화되는 단어 단위로 음성 인식 결과의 신뢰도를 계산하고, 제1 문자열을 대체할 지 여부를 결정하는 실시 예가 도시된다. 그러나, 본 개시는 도 14에 도시된 예에 제한되지 않으며, 일 실시 예에 따른 디바이스(100)는, 사용자에 의해 발화되는 문장 단위로 음성 인식 결과의 신뢰도를 계산하고, 제1 문자열을 대체할 지 여부를 결정할 수 있다. 사용자에 의해 발화되는 문장의 종료를 검출하는 방법은 종래의 다양한 방법이 이용될 수 있으며, 본 개시에서는 구체적인 설명은 생략한다.14 illustrates an embodiment in which the reliability of the speech recognition result is calculated in units of words spoken by the user and whether to replace the first character string is determined. However, the present disclosure is not limited to the example shown in FIG. 14, and the device 100 according to an embodiment calculates the reliability of the speech recognition result in units of sentences spoken by the user, and replaces the first character string. You can decide whether or not. As a method of detecting the end of a sentence uttered by a user, various conventional methods may be used, and a detailed description thereof will be omitted in the present disclosure.

도 15는 일 실시 예에 따라 서버(200)의 동작 방법의 흐름도를 도시한다. 이하에서 서술하는 서버(200)의 동작 방법의 각 단계는, 도 7 및 도 9에 도시된 구성들에 의해 수행될 수 있다. 중복되는 설명은 생략한다.15 is a flowchart of a method of operating the server 200 according to an embodiment. Each step of the method of operating the server 200 described below may be performed by the configurations shown in FIGS. 7 and 9. Redundant descriptions are omitted.

단계 S1510에서 본 개시의 일 실시 예에 따른 서버(200)는, 디바이스(100)로부터 제1 문자열을 수신할 수 있다. 제1 문자열은, 디바이스(100)에 의해 음성 신호로부터 음성 인식 처리를 거쳐 출력될 수 있다.In operation S1510, the server 200 according to an embodiment of the present disclosure may receive a first character string from the device 100. The first character string may be output from a voice signal by the device 100 through voice recognition processing.

본 개시의 일 실시 예에 따른 프로세서(220)는, 디바이스(100)로부터 수신되는 제1 문자열이 프레임 동기화되지 않은 문자열인 경우, 제1 문자열로부터 프레임 동기화된 문자열을 획득할 수 있다. 프로세서(220)는, 제1 문자열에 포함되는 적어도 하나의 문자를 소정 시간 간격의 프레임 단위로 연속하여 배치함으로써, 프레임 동기화된 문자열을 획득할 수 있다.The processor 220 according to an embodiment of the present disclosure may obtain a frame-synchronized character string from the first character string when the first character string received from the device 100 is a character string that is not frame synchronized. The processor 220 may obtain a frame-synchronized character string by continuously arranging at least one character included in the first character string in units of frames at a predetermined time interval.

단계 S1520에서 본 개시의 일 실시 예에 따른 서버(200)는, 제1 문자열로부터 복수의 추정 문자열들에 대한 가능도를 산출할 수 있다. 일 실시예에 따른 서버(200)는, 제1 문자열 내의 각 문자를 다른 문자로 대체함으로써, 복수의 추정 문자열들을 획득할 수 있다. 복수의 추정 문자열들의 가능도란, 제1 문자열로부터 획득되는 복수의 추정 문자열들 각각이 참값 문자열이라고 가정하였을 때, 음성 인식 모듈로부터 제1 문자열이 추정될 확률을 의미할 수 있다.In step S1520, the server 200 according to an embodiment of the present disclosure may calculate the likelihood of a plurality of estimated character strings from the first character string. The server 200 according to an embodiment may obtain a plurality of estimated character strings by replacing each character in the first character string with a different character. The likelihood of the plurality of estimated character strings may mean a probability that the first character string is estimated from the speech recognition module, assuming that each of the plurality of estimated character strings obtained from the first character string is a true value character string.

본 개시의 일 실시예에 따르면, 서버(200)는, 제1 문자열 내의 각 문자와 발음이 유사한 대체 문자들을 식별하고, 식별된 대체 문자들에 기초하여 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 교정된 추정 문자열들을 결정하기 위하여, 제1 문자열로부터 획득되는 가능도를 획득할 수 있다.According to an embodiment of the present disclosure, the server 200 identifies replacement characters having a similar pronunciation to each character in the first string, and at least one character in the first string is different based on the identified replacement characters. In order to determine the estimated character strings corrected to, the likelihood obtained from the first character string may be obtained.

일 실시예에 따른 서버(200)는, 제1 문자열 내의 각 문자에 대하여 각 문자가 대체될 대체 문자들에 관한 가능도 행렬들을 산출하고, 가능도 행렬들 내의 가능도 값들에 기초하여 복수의 추정 문자열들을 식별할 수 있다. 서버(200)는, 복수의 추정 문자열들에 대한 가능도로서, 각 문자로부터 획득되는 가능도 행렬들을 획득할 수 있다. The server 200 according to an embodiment calculates likelihood matrices for replacement characters to be replaced for each character in the first character string, and estimates a plurality of likelihood values in the likelihood matrices. Strings can be identified. The server 200 may obtain likelihood matrices obtained from each character as likelihood for a plurality of estimated character strings.

일 예로서, 서버(200)는, 제1 문자열 내의 각 문자의 이전에 누적된 문자들에 기초하여, 제1 문자열로부터 가능도를 산출 할 수 있다. 일 실시 예에 따른 서버(200)는, 제1 문자열 내의 각 문자의 이전에 누적된 문자들에 기초하여 각 문자의 사후 확률들을 계산할 수 있다. 서버(200)는, 제1 문자열 내의 각 문자의 이 전에 누적된 문자들에 기초하여 문자 배열 확률을 계산할 수 있다. 서버(200)는, 각 문자의 사후 확률들 및 문자 배열 확률에 기초하여, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 계산할 수 있다.As an example, the server 200 may calculate the likelihood from the first character string based on previously accumulated characters of each character in the first character string. The server 200 according to an embodiment may calculate posterior probabilities of each character based on previously accumulated characters of each character in the first character string. The server 200 may calculate a character arrangement probability based on previously accumulated characters of each character in the first character string. The server 200 may calculate the likelihood of a plurality of estimated character strings obtained from the first character string, based on the posterior probabilities of each character and the character arrangement probability.

다른 예로서, 서버(200)는, 제1 문자열 내의 각 문자의 이전에 누적된 문자들을 고려하지 않고, 각 문자만을 고려하여 제1 문자열로부터 가능도를 산출 할 수 있다. 일 실시예에 따른 서버(200)는, 미리 결정된 오차 행렬에 기초하여, 제1 문자열 내의 각 문자의 사후 확률들을 계산할 수 있다. 서버(200)는, 제1 문자열 내의 각 문자의 사후 확률들에 기초하여, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 계산할 수 있다.As another example, the server 200 may calculate the likelihood from the first character string by considering only each character without considering the previously accumulated characters of each character in the first character string. The server 200 according to an embodiment may calculate posterior probabilities of each character in the first character string based on a predetermined error matrix. The server 200 may calculate the likelihood of a plurality of estimated character strings obtained from the first character string, based on posterior probabilities of each character in the first character string.

단계 S1530에서 본 개시의 일 실시 예에 따른 서버(200)는, 단계 S1520에서 산출된 가능도에 기초하여, 제1 문자열 내의 적어도 하나의 문자를 다른 문자로 대체함으로써 제1 문자열로부터 제2 문자열을 획득할 수 있다.In step S1530, the server 200 according to an embodiment of the present disclosure replaces at least one character in the first character string with another character based on the likelihood calculated in step S1520 to replace the second character string from the first character string. Can be obtained.

일 실시 예에 따른 서버(200)는, 계산된 가능도에 기초하여, 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 대체된 복수의 추정 문자열들을 식별할 수 있다. 서버(200)는, 식별된 복수의 추정 문자열에 대한 가능도, 언어 모델, 및 사전 정보에 기초하여, 복수의 추정 문자열들 중에서 제2 문자열을 획득할 수 있다.The server 200 according to an embodiment may identify a plurality of estimated character strings in which at least one character in the first character string has been replaced with another character based on the calculated likelihood. The server 200 may obtain a second character string from among the plurality of estimated character strings, based on the likelihood of the identified plurality of estimated character strings, a language model, and dictionary information.

일 실시 예에 따른 서버(200)는, 계산된 가능도에 기초하여, 제1 문자열을 제2 문자열로 대체할 지를 결정할 수 있다. 서버(200)는, 결정에 기초하여, 제1 문자열 내의 적어도 하나의 문자를 다른 문자로 대체함으로써 제1 문자열로부터 제2 문자열을 획득할 수 있다. 서버(200)는, 가능도, 사전 정보, 및 언어 모델에 기초하여, 복수의 추정 문자열들 중에서 가능도를 최대로 하는 추정 문자열을 선택할 수 있다. 서버(200)는, 선택된 추정 문자열에 따라, 제1 문자열 내의 적어도 하나의 문자가 다른 문자로 대체된 제2 문자열을 획득할 수 있다.The server 200 according to an embodiment may determine whether to replace the first character string with the second character string based on the calculated likelihood. The server 200 may obtain the second character string from the first character string by replacing at least one character in the first character string with another character based on the determination. The server 200 may select an estimated character string that maximizes the likelihood from among a plurality of estimated character strings based on the likelihood, dictionary information, and language model. The server 200 may obtain a second character string in which at least one character in the first character string is replaced with another character according to the selected estimated character string.

일 예로서, 서버(200)는, 서버(200) 내에 저장되는 사전 정보 및 언어 모델을 기반으로 WFST 디코더를 이용하여 제2 문자열을 획득할 수 있다. 서버(200)가 WFST 디코딩을 수행하는 경우, 일 실시 예에 따른 서버(200)는, 문자들 간의 관계(T), 단어와 문자들의 매핑 정보를 포함하는 사전 정보(L), 및 특정 단어열이 주어졌을 때 다음에 나올 단어들의 확률을 추정하는 언어 모델(G)에 기초하여, WFST로 탐색 공간을 구성하여 디코딩 할 수 있다.As an example, the server 200 may acquire the second character string using a WFST decoder based on dictionary information and language models stored in the server 200. When the server 200 performs WFST decoding, the server 200 according to an embodiment includes a relationship (T) between characters, dictionary information (L) including mapping information of words and characters, and a specific word string. Based on the language model (G) that estimates the probability of the next word given when is, a search space can be constructed and decoded with WFST.

다른 예로서, 서버(200)는, 사전 정보 및 언어 모델에 기초하여, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 재 연산하는 비터비 디코더를 포함할 수 있다. 비터비 디코더는, 재 연산된 가능도를 최대로 하는 제2 문자열을 복수의 추정 문자열들 중에서 결정할 수 있다. 비터비 디코더는, 사전 정보 및 언어 모델을 고려하여, 제1 문자열들에 대한 가장 가능성 높은 문자열을 제2 문자열로서 찾아낼 수 있다.As another example, the server 200 may include a Viterbi decoder that recalculates the likelihood of a plurality of estimated character strings obtained from the first character string based on dictionary information and a language model. The Viterbi decoder may determine a second character string that maximizes the recalculated likelihood from among the plurality of estimated character strings. The Viterbi decoder may find the most likely character string for the first character strings as the second character string in consideration of the dictionary information and the language model.

단계 S1540에서 본 개시의 일 실시 예에 따른 서버(200)는, 제2 문자열을 디바이스(100)에게 전송할 수 있다.In step S1540, the server 200 according to an embodiment of the present disclosure may transmit the second character string to the device 100.

또한, 일 실시 예에 따른 서버(200)는, 자연어 이해 모델을 이용하여 제2 문자열을 해석하고, 해석한 결과에 기초하여 사용자의 음성 신호에 대한 응답 메시지를 생성할 수 있다. 서버(200)는, 응답 메시지를 생성하고 디바이스(100)에게 추가로 전송할 수 있다.In addition, the server 200 according to an embodiment may analyze the second character string using a natural language understanding model, and generate a response message for a user's voice signal based on the result of the analysis. The server 200 may generate a response message and additionally transmit it to the device 100.

본 개시는 도 15에 도시된 바와 같이 서버(200)가 제2 문자열을 그대로 디바이스(100)에게 전송하는 실시예에 제한되지 않는다. 일 실시예에 따른 서버(200)는, 제2 문자열에 대한 자연어 처리를 통해 사용자의 발화 의도를 파악함으로써, 음성 비서 서비스에 관련된 정보를 전송 할 수 있다.The present disclosure is not limited to an embodiment in which the server 200 transmits the second character string to the device 100 as it is, as shown in FIG. 15. The server 200 according to an embodiment may transmit information related to a voice assistant service by identifying the user's speech intention through natural language processing on the second character string.

서버(200)는 제2 문자열에 기초하여 음성 비서 서비스를 제공하기 위하여, 서버(200) 내의 NLU 모델, DM 모델 및 NLG 모델 등을 이용할 수 있다.The server 200 may use an NLU model, a DM model, and an NLG model in the server 200 in order to provide a voice assistant service based on the second character string.

일 예로서, 서버(200)는, 제2 문자열을 해석한 결과를 바탕으로, 디바이스(100) 또는 다른 디바이스를 제어하기 위한 제어 명령을 생성하고, 생성된 제어 명령을 디바이스(100)에게 전송할 수도 있다. 다른 예로서, 서버(200)는, 사용자의 상황, 디바이스의 상황 등을 고려하여 사람이 사용자와 직접 대화하는 것처럼, 제2 문자열에 기초하여 응답 메시지를 생성하고 전송할 수 있다. 다른 예로서, 서버(200)는, 제2 문자열에 기초하여 사용자가 필요한 정보를 생성하여 전송할 수 있다. 다른 예로서, 서버(200)는, 제2 문자열에 기초하여 사용자의 발화 의도를 파악하고, 서비스 제공 서버에게 사용자가 필요로 하는 서비스의 제공을 요청할 수 있다. 서버(200)는, 서비스 제공 서버로부터 수신된 정보를 전송할 수 있다.As an example, the server 200 may generate a control command for controlling the device 100 or another device based on the result of analyzing the second character string, and transmit the generated control command to the device 100 have. As another example, the server 200 may generate and transmit a response message based on the second character string, as if a person directly communicates with the user in consideration of the situation of the user and the device. As another example, the server 200 may generate and transmit information required by the user based on the second character string. As another example, the server 200 may determine the user's utterance intention based on the second character string, and may request the service providing server to provide a service required by the user. The server 200 may transmit information received from the service providing server.

도 16은, 일 실시 예에 따른 서버의 동작 방법에 있어서, 각 문자의 이 전에 누적된 문자들을 고려하여 문자열로부터 가능도를 획득하는 방법을 구체적으로 도시한다.16 illustrates in detail a method of obtaining a likelihood from a character string in consideration of characters accumulated before each character in a method of operating a server according to an embodiment.

단계 S1510에서 본 개시의 일 실시 예에 따른 서버(200)는, 디바이스(100)로부터 제1 문자열을 획득할 수 있다.In step S1510, the server 200 according to an embodiment of the present disclosure may obtain a first character string from the device 100.

단계 S1621에서 일 실시 예에 따른 서버(200)는, 제1 문자열 내의 각 문자의 이전 문자들에 기초하여 각 문자의 사후 확률들을 획득할 수 있다.In step S1621, the server 200 according to an embodiment may obtain posterior probabilities of each character based on previous characters of each character in the first character string.

예를 들어, 서버(200)는, 문자열의 사후 확률을 계산하기 위해 미리 훈련된 신경망을 이용해서, 제1 문자열 내의 각 문자의 사후 확률들을 계산할 수 있다.For example, the server 200 may calculate the posterior probabilities of each character in the first character string using a pretrained neural network to calculate the posterior probabilities of the character string.

단계 S1623에서 일 실시 예에 따른 서버(200)는, 제1 문자열로부터 문자 배열 확률을 계산할 수 있다. In step S1623, the server 200 according to an embodiment may calculate a character arrangement probability from the first string.

단계 S1625에서 일 실시 예에 따른 서버(200)는, 단계 S1621에서 계산된 사후 확률들 및 단계 S1623에서 계산된 문자 배열 확률에 기초하여, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도를 계산할 수 있다. 일 실시예에 따른 서버(200)는, 제1 문자열 내의 각 문자에 대하여 각 문자가 대체될 대체 문자들에 관한 가능도 행렬들을 산출하고, 산출된 가능도 행렬들을 포함하는 복수의 추정 문자열들의 가능도를 획득할 수 있다.In step S1625, the server 200 according to an embodiment may calculate the likelihood of a plurality of estimated strings obtained from the first string, based on the posterior probabilities calculated in step S1621 and the character arrangement probability calculated in step S1623. I can. The server 200 according to an embodiment calculates likelihood matrices for replacement characters to be replaced with each character for each character in the first character string, and enables a plurality of estimated character strings including the calculated likelihood matrices. You can get degrees.

일 실시 예에 따른 서버(200)는, 제1 문자열에 포함되는 모든 문자들에 대해서 가능도 행렬이 계산되었는지 여부를 판단할 수 있다. 일 실시 예에 따른 서버(200)는, 제1 문자열에 포함되는 모든 문자들에 대해서 가능도 행렬들이 계산될 때까지 단계 S1621, S1623, S1625를 반복하여 수행할 수 있다.The server 200 according to an embodiment may determine whether a likelihood matrix has been calculated for all characters included in the first string. The server 200 according to an embodiment may repeatedly perform steps S1621, S1623, and S1625 until likelihood matrices are calculated for all characters included in the first string.

제1 문자열로부터 가능도를 계산하는 구체적인 과정은, 도 9를 참조하여 상술하였으므로, 중복되는 설명은 생략한다.The detailed process of calculating the likelihood from the first character string has been described above with reference to FIG. 9, and therefore, a duplicate description is omitted.

단계 S1627에서 일 실시 예에 따른 서버(200)는, 사전 정보 및 언어 모델을 이용하여, 단계 S1625에서 계산된 가능도로부터 제2 문자열을 획득할 수 있다. 제2 문자열은, 제1 문자열의 적어도 하나의 문자가 다른 문자로 대체된 문자열일 수 있다.In step S1627, the server 200 according to an embodiment may obtain a second character string from the likelihood calculated in step S1625 using dictionary information and a language model. The second character string may be a character string in which at least one character of the first character string is replaced with another character.

예를 들어, 서버(200)는, 가능도를 입력으로 하는 WFST 디코더 또는 전통적인 토큰 패싱을 이용하는 비터비 디코더를 이용하여, 사전 정보, 언어 모델, 및 계산된 가능도에 기초하여, 복수의 추정 문자열들 중에서 제2 문자열을 획득할 수 있다.For example, the server 200 uses a WFST decoder that inputs likelihood or a Viterbi decoder using traditional token passing, based on dictionary information, a language model, and a calculated likelihood, Among them, a second character string may be obtained.

단계 S1540에서 일 실시 예에 따른 서버(200)는, 제2 문자열을 디바이스(100)에게 전송할 수 있다. 일 실시예에 따른 서버(200)는, 제2 문자열을 그대로 디바이스(100)에게 전송하는 대신에, 제2 문자열에 대한 자연어 처리를 통해 사용자의 발화 의도를 파악함으로써 음성 비서 서비스에 관련된 정보를 전송 할 수 있다. 중복되는 설명은 생략한다.In operation S1540, the server 200 according to an embodiment may transmit a second character string to the device 100. The server 200 according to an embodiment transmits information related to the voice assistant service by identifying the user's speech intention through natural language processing on the second string instead of transmitting the second string as it is to the device 100 can do. Redundant descriptions are omitted.

도 17은 일 실시 예에 따른 WFST(weighted Finite State Transducer) 디코딩을 설명하기 위한 도면이다.17 is a diagram for explaining weighted finite state transducer (WFST) decoding according to an embodiment.

본 개시의 일 실시 예에 따른 서버(200)는, 디바이스(100)로부터 수신된 제1 문자열로부터 가능도를 계산하고, 계산된 가능도를 입력으로 하여 WFST 디코딩을 수행할 수 있다. 본 개시의 일 실시 예에 따른 서버(200)는, 제1 문자열로부터 획득되는 복수의 추정 문자열들의 가능도(T), 단어와 문자들의 매핑 정보를 포함하는 사전 정보(L), 및 특정 단어열이 주어졌을 때 다음에 나올 단어들의 확률을 추정하는 언어 모델(G)을 각각 WFST(weighted finite-state transducer)로 모델링함으로써 WFST 디코딩을 수행할 수 있다. The server 200 according to an embodiment of the present disclosure may calculate a likelihood from a first character string received from the device 100 and perform WFST decoding by inputting the calculated likelihood. The server 200 according to an embodiment of the present disclosure includes a likelihood (T) of a plurality of estimated strings obtained from a first string, dictionary information (L) including mapping information of words and characters, and a specific word string. Given is given, WFST decoding can be performed by modeling each language model G, which estimates the probability of the next word, using a weighted finite-state transducer (WFST).

이하에서는, 단어들 'the, cat, and, deer, is, cardinals, baseball, team' 간의 관계에 관한 정보를 저장하는 언어 모델을 WFST로 모델링한 예를 설명한다. 도 17은 언어 모델에 기초하여 단어들을 조합함으로써 만들 수 있는 유한한 개수의 문자열들을 도시한다. Hereinafter, an example of modeling a language model for storing information on a relationship between words'the, cat, and, deer, is, cardinals, baseball, team' using WFST will be described. 17 shows a finite number of character strings that can be created by combining words based on a language model.

도 17의 각 원은 상태(state)를 나타내고, 화살표 상에는 언어 모델에 저장되는 단어가 표시된다. WFST 디코더는 복수의 경로들을 따라 조합되는 복수의 문자열들 각각으로부터 문자열에 대한 신뢰도를 계산할 수 있다. 각 문자열에 대한 신뢰도는, 각 문자열에 대한 가능도, 사전 정보, 및 언어 모델에 기초하여 계산될 수 있다. WFST 디코더는, 신뢰도가 가장 높은 문자열을 선택하고 출력할 수 있다.Each circle in FIG. 17 represents a state, and words stored in the language model are displayed on the arrow. The WFST decoder may calculate the reliability of a character string from each of a plurality of character strings combined along a plurality of paths. The reliability of each character string may be calculated based on the likelihood of each character string, dictionary information, and a language model. The WFST decoder can select and output a character string with the highest reliability.

예를 들어, 도 8a에 도시된 바와 같이 본 개시의 일 실시 예에 따른 서버(200)는, 디바이스(100)로부터 제1 문자열 [The cat and deer is baseball team]을 수신할 수 있다. For example, as shown in FIG. 8A, the server 200 according to an embodiment of the present disclosure may receive a first character string [The cat and deer is baseball team] from the device 100.

서버(200)는, 제1 문자열로부터 복수의 추정 문자열들의 가능도를 계산할 수 있다. 계산된 가능도가 서버(200)의 WFST 디코더에 입력됨에 따라, WFST 디코더는 제2 문자열을 출력할 수 있다. WFST 디코더는, 복수의 추정 문자열들 중에서 신뢰도가 가장 높은 제2 문자열을 결정하고 출력할 수 있다.The server 200 may calculate the likelihood of a plurality of estimated character strings from the first character string. As the calculated likelihood is input to the WFST decoder of the server 200, the WFST decoder may output a second character string. The WFST decoder may determine and output a second character string having the highest reliability among a plurality of estimated character strings.

도 8a에 도시된 바와 같이, 서버(200)의 메모리(230) 내에는, 스포츠 도메인의 개체명 "Cardinals"가 저장되어 있을 수 있다. 따라서, 서버(200)의 프로세서(220)는, 디바이스(100)에서 추정된 'cat and deer is'가 실제로는 야구 팀 명칭인 'Cardinals'일 확률이 더 높다고 판단할 수 있다.As illustrated in FIG. 8A, in the memory 230 of the server 200, the entity name “Cardinals” of the sports domain may be stored. Accordingly, the processor 220 of the server 200 may determine that the probability that the'cat and deer is' estimated by the device 100 is actually the baseball team name'Cardinals' is higher.

따라서, 도 17을 참조하면, 일 실시 예에 따른 WFST 디코더는, 복수의 추정 문자열들 [The cat and deers baseball team] 및 [The Cardinals baseball team] 중에서 신뢰도가 가장 높은 문자열인 [The Cardinals baseball team]을 제2 문자열로서 결정하고 출력할 수 있다.Therefore, referring to FIG. 17, the WFST decoder according to an embodiment includes [The Cardinals baseball team], which is a string having the highest reliability among a plurality of estimated strings [The cat and deers baseball team] and [The Cardinals baseball team] Can be determined as the second character string and printed.

도 18은 일 실시 예에 따라 디바이스 상에 음성 인식 결과가 디스플레이 되는 화면의 예를 도시한다.18 illustrates an example of a screen on which a voice recognition result is displayed on a device according to an embodiment.

일 실시 예에 따른 디바이스(100)는, 사용자로부터 수신된 음성 신호에 대한 음성 인식을 수행하여 추정된 문자열로부터 획득된 단어열(1811) "Cat and deers baseball team"을 출력할 수 있다. 디바이스(100)는, 온-디바이스 음성 인식이 수행되고 있는 경우, 온-디바이스 음성 인식이 수행되고 있음을 나타내는 영상(1812)을 화면 상에 디스플레이 할 수 있다.The device 100 according to an embodiment may perform voice recognition on a voice signal received from a user and output a word string 1811 "Cat and deers baseball team" obtained from an estimated character string. When on-device voice recognition is being performed, the device 100 may display an image 1812 indicating that on-device voice recognition is being performed on the screen.

본 개시의 일 실시 예에 따른 디바이스(100)는, 온-디바이스 음성 인식을 이용한 음성 인식 수행 결과에 대한 신뢰도가 충분히 높은 경우, 음성 인식 수행 결과를 그대로 이용할 수 있다. The device 100 according to an embodiment of the present disclosure may use the result of performing the voice recognition as it is when the reliability of the result of performing voice recognition using on-device voice recognition is sufficiently high.

반면에, 본 개시의 일 실시 예에 따른 디바이스(100)는, 온-디바이스 음성 인식을 이용하여 음성 인식을 수행한 결과의 신뢰도가 충분히 높지 않다고 판단하는 경우, 음성 인식 결과인 문자열을 서버(200)에게 전송할 수 있다.On the other hand, if the device 100 according to an embodiment of the present disclosure determines that the reliability of the result of performing the speech recognition using on-device speech recognition is not sufficiently high, the server 200 ).

일 실시 예에 따른 서버(200)는, 디바이스(100)로부터 문자열을 수신하고, 서버(200) 내의 언어 모델 및 사전 정보를 이용하여 디코딩을 수행함으로써, 수신된 문자열에 포함되는 적어도 하나의 문자가 교정된 문자열 "Caldinals baseball team"을 획득할 수 있다. 서버(200)는, 디바이스(100)에게 "Caldinals baseball team "을 전송할 수 있다. The server 200 according to an embodiment receives a character string from the device 100 and performs decoding using a language model and dictionary information in the server 200, so that at least one character included in the received character string is The corrected string "Caldinals baseball team" can be acquired. The server 200 may transmit "Caldinals baseball team" to the device 100.

일 실시 예에 따른 디바이스(100)는, 서버(200)로부터 수신된 문자열(1821) "Caldinals baseball team"를 출력할 수 있다. 디바이스(100)는, 서버-기반 음성 인식이 수행되고 있는 경우, 서버-기반 음성 인식이 수행되고 있음을 나타내는 영상(1832)을 화면 상에 디스플레이 할 수 있다.The device 100 according to an embodiment may output the string 1821 "Caldinals baseball team" received from the server 200. When server-based voice recognition is being performed, the device 100 may display an image 1832 indicating that server-based voice recognition is being performed on the screen.

도 19는 일 실시 예에 따른 디바이스의 구체적인 블록도를 도시한다.19 is a detailed block diagram of a device according to an embodiment.

도 19에 도시된 디바이스(100)는 도 3에서 설명한 디바이스(100)와 동일한 구성 요소를 포함할 수 있다. 예를 들어, 도 19에 도시된 구성 요소 중 제어부(2300)는 도 3에 도시된 프로세서(120)와 동일하고, 출력부(2220)는 도 3에 도시된 출력부(150)과 동일하다. 또한, 도 19에는 도시되지 않았지만, 도 19의 메모리(2700)는, 도 3의 메모리(140)와 같이, 음성 인식을 수행하기 위한 인스트럭션들, 음성 인식에 이용되는 각종 모델, 신경망, 사전 정보 등을 저장할 수 있다. 따라서, 중복되는 설명은 생략하기로 한다. The device 100 illustrated in FIG. 19 may include the same components as the device 100 described in FIG. 3. For example, among the components illustrated in FIG. 19, the controller 2300 is the same as the processor 120 illustrated in FIG. 3, and the output unit 2220 is the same as the output unit 150 illustrated in FIG. 3. In addition, although not shown in FIG. 19, the memory 2700 of FIG. 19, like the memory 140 of FIG. 3, includes instructions for performing speech recognition, various models used for speech recognition, neural networks, dictionary information, etc. Can be saved. Therefore, redundant descriptions will be omitted.

도 19에 도시된 디바이스(100)는 도 3 내지 도 18에서 설명한 디바이스(100)의 동작 및 기능들을 모두 수행할 수 있다. 따라서, 이하에서는 지금까지 설명되지 않았던 디바이스(100)의 구성 요소들에 대하여 설명하기로 한다.The device 100 illustrated in FIG. 19 may perform all of the operations and functions of the device 100 described in FIGS. 3 to 18. Accordingly, hereinafter, components of the device 100 that have not been described so far will be described.

도 19를 참조하면, 디바이스(100)는 사용자 입력부(2100), 출력부(2200), 제어부(2300), 센싱부(2400), 통신부(2500), A/V 입력부(2600), 및 메모리(2700)를 포함할 수 있다. Referring to FIG. 19, the device 100 includes a user input unit 2100, an output unit 2200, a control unit 2300, a sensing unit 2400, a communication unit 2500, an A/V input unit 2600, and a memory ( 2700) may be included.

사용자 입력부(2100)는, 사용자가 디바이스(100)를 제어하기 위한 데이터를 입력하는 수단을 의미한다. 예를 들어, 사용자 입력부(2100)에는 키 패드(key pad), 돔 스위치 (dome switch), 터치 패드(접촉식 정전 용량 방식, 압력식 저항막 방식, 적외선 감지 방식, 표면 초음파 전도 방식, 적분식 장력 측정 방식, 피에조 효과 방식 등), 조그 휠, 조그 스위치 등이 있을 수 있으나 이에 한정되는 것은 아니다. 사용자 입력부(2100)는, 사용자에게 제공할 대화 정보를 생성하기 위하여 필요한 사용자 입력을 수신할 수 있다.The user input unit 2100 refers to a means for a user to input data for controlling the device 100. For example, the user input unit 2100 includes a key pad, a dome switch, a touch pad (contact type capacitance method, pressure type resistive film method, infrared detection method, surface ultrasonic conduction method, integral type). Tension measurement method, piezo effect method, etc.), a jog wheel, a jog switch, and the like, but are not limited thereto. The user input unit 2100 may receive a user input necessary to generate conversation information to be provided to a user.

출력부(2200)는 오디오 신호 또는 비디오 신호 또는 진동 신호를 출력할 수 있으며, 출력부(2200)는 디스플레이부(2210), 음향 출력부(2220), 및 진동 모터(2230)를 포함할 수 있다.The output unit 2200 may output an audio signal, a video signal, or a vibration signal, and the output unit 2200 may include a display unit 2210, an audio output unit 2220, and a vibration motor 2230. .

진동 모터(2230)는 진동 신호를 출력할 수 있다. 예를 들어, 진동 모터(2230)는 오디오 데이터 또는 비디오 데이터(예컨대, 호신호 수신음, 메시지 수신음 등)의 출력에 대응하는 진동 신호를 출력할 수 있다. The vibration motor 2230 may output a vibration signal. For example, the vibration motor 2230 may output a vibration signal corresponding to an output of audio data or video data (eg, a call signal reception sound, a message reception sound, etc.).

센싱부(2400)는, 디바이스(100)의 상태 또는 디바이스(100) 주변의 상태를 감지하고, 감지된 정보를 제어부(2300)로 전달할 수 있다. The sensing unit 2400 may detect a state of the device 100 or a state around the device 100, and transmit the sensed information to the controller 2300.

센싱부(2400)는, 지자기 센서(Magnetic sensor)(2410), 가속도 센서(Acceleration sensor)(2420), 온/습도 센서(2430), 적외선 센서(2440), 자이로스코프 센서(2450), 위치 센서(예컨대, GPS)(2460), 기압 센서(2470), 근접 센서(2480), 및 RGB 센서(illuminance sensor)(2490) 중 적어도 하나를 포함할 수 있으나, 이에 한정되는 것은 아니다. 각 센서들의 기능은 그 명칭으로부터 당업자가 직관적으로 추론할 수 있으므로, 구체적인 설명은 생략하기로 한다.The sensing unit 2400 includes a magnetic sensor 2410, an acceleration sensor 2420, a temperature/humidity sensor 2430, an infrared sensor 2440, a gyroscope sensor 2450, and a position sensor. (For example, a GPS) 2460, an atmospheric pressure sensor 2470, a proximity sensor 2480, and an RGB sensor 2490 may be included, but are not limited thereto. Since the function of each sensor can be intuitively inferred by a person skilled in the art from its name, detailed description will be omitted.

통신부(2500)는, 다른 디바이스와의 통신을 수행하기 위한 구성 요소를 포함할 수 있다. 예를 들어, 통신부(2500)는, 근거리 통신부(2510), 이동 통신부(2520), 방송 수신부(2530)를 포함할 수 있다. The communication unit 2500 may include a component for performing communication with another device. For example, the communication unit 2500 may include a short range communication unit 2510, a mobile communication unit 2520, and a broadcast reception unit 2530.

근거리 통신부(short-range wireless communication unit)(251)는, 블루투스 통신부, BLE(Bluetooth Low Energy) 통신부, 근거리 무선 통신부(Near Field Communication unit), WLAN(와이파이) 통신부, 지그비(Zigbee) 통신부, 적외선(IrDA, infrared Data Association) 통신부, WFD(Wi-Fi Direct) 통신부, UWB(ultra wideband) 통신부, Ant+ 통신부 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. The short-range wireless communication unit 251 includes a Bluetooth communication unit, a Bluetooth Low Energy (BLE) communication unit, a near field communication unit, a WLAN (Wi-Fi) communication unit, a Zigbee communication unit, and an infrared ( IrDA, infrared data association) communication unit, WFD (Wi-Fi Direct) communication unit, UWB (ultra wideband) communication unit, Ant+ communication unit, etc. may be included, but is not limited thereto.

이동 통신부(2520)는, 이동 통신망 상에서 기지국, 외부의 단말, 서버 중 적어도 하나와 무선 신호를 송수신한다. 여기에서, 무선 신호는, 음성 호 신호, 화상 통화 호 신호 또는 문자/멀티미디어 메시지 송수신에 따른 다양한 형태의 데이터를 포함할 수 있다.The mobile communication unit 2520 transmits and receives a radio signal with at least one of a base station, an external terminal, and a server over a mobile communication network. Here, the wireless signal may include a voice call signal, a video call signal, or various types of data according to transmission/reception of text/multimedia messages.

방송 수신부(2530)는, 방송 채널을 통하여 외부로부터 방송 신호 및/또는 방송 관련된 정보를 수신한다. 방송 채널은 위성 채널, 지상파 채널을 포함할 수 있다. 구현 예에 따라서 디바이스(100)가 방송 수신부(2530)를 포함하지 않을 수도 있다.The broadcast receiver 2530 receives a broadcast signal and/or broadcast-related information from outside through a broadcast channel. Broadcast channels may include satellite channels and terrestrial channels. Depending on the implementation example, the device 100 may not include the broadcast receiver 2530.

또한, 통신부(2500)는, 제1 사용자에게 제공할 대화 정보를 생성하기 위하여 필요한 정보를, 제2 대화형 전자 장치(3000), 다른 디바이스 및 서버와 송수신할 수 있다.Also, the communication unit 2500 may transmit/receive information necessary for generating conversation information to be provided to the first user with the second interactive electronic device 3000 and other devices and servers.

A/V(Audio/Video) 입력부(2600)는 오디오 신호 또는 비디오 신호 입력을 위한 것으로, 이에는 카메라(2610)와 마이크로폰(2620) 등이 포함될 수 있다. 카메라(2610)은 화상 통화모드 또는 촬영 모드에서 이미지 센서를 통해 정지영상 또는 동영상 등의 화상 프레임을 얻을 수 있다. 이미지 센서를 통해 캡쳐된 이미지는 제어부(2300) 또는 별도의 이미지 처리부(미도시)를 통해 처리될 수 있다. The A/V (Audio/Video) input unit 2600 is for inputting an audio signal or a video signal, and may include a camera 2610 and a microphone 2620. The camera 2610 may obtain an image frame such as a still image or a video through an image sensor in a video call mode or a photographing mode. The image captured through the image sensor may be processed by the controller 2300 or a separate image processing unit (not shown).

카메라(2610)에서 처리된 화상 프레임은 메모리(2700)에 저장되거나 통신부(2500)를 통하여 외부로 전송될 수 있다. 카메라(2610)는 단말기의 구성 태양에 따라 2개 이상이 구비될 수도 있다.The image frame processed by the camera 2610 may be stored in the memory 2700 or transmitted to the outside through the communication unit 2500. Two or more cameras 2610 may be provided depending on the configuration aspect of the terminal.

마이크로폰(2620)은, 외부의 음향 신호를 입력 받아 전기적인 음성 데이터로 처리한다. 예를 들어, 마이크로폰(2620)은 외부 디바이스 또는 화자로부터 음향 신호를 수신할 수 있다. 마이크로폰(2620)는 외부의 음향 신호를 입력 받는 과정에서 발생 되는 잡음(noise)를 제거하기 위한 다양한 잡음 제거 알고리즘을 이용할 수 있다. The microphone 2620 receives an external sound signal and processes it as electrical voice data. For example, the microphone 2620 may receive an acoustic signal from an external device or a speaker. The microphone 2620 may use various noise removal algorithms for removing noise generated in a process of receiving an external sound signal.

메모리(2700)는, 제어부(2300)의 처리 및 제어를 위한 프로그램을 저장할 수 있고, 디바이스(100)로 입력되거나 디바이스(100)로부터 출력되는 데이터를 저장할 수도 있다. The memory 2700 may store a program for processing and control of the controller 2300 and may store data input to the device 100 or output from the device 100.

메모리(2700)는 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램, SRAM, 롬, EEPROM, PROM, 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다. The memory 2700 is a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (for example, SD or XD memory), and RAM. , SRAM, ROM, EEPROM, PROM, magnetic memory, magnetic disk, optical disk may include at least one type of storage medium.

메모리(2700)에 저장된 프로그램들은 그 기능에 따라 복수 개의 모듈들로 분류할 수 있는데, 예를 들어, UI 모듈(2710), 터치 스크린 모듈(2720), 알림 모듈(2730) 등으로 분류될 수 있다. Programs stored in the memory 2700 may be classified into a plurality of modules according to their functions, for example, a UI module 2710, a touch screen module 2720, a notification module 2730, and the like. .

UI 모듈(2710)은, 애플리케이션 별로 디바이스(100)와 연동되는 특화된 UI, GUI 등을 제공할 수 있다. 터치 스크린 모듈(2720)은 사용자의 터치 스크린 상의 터치 제스처를 감지하고, 터치 제스처에 관한 정보를 제어부(2300)로 전달할 수 있다. 일부 실시 예에 따른 터치 스크린 모듈(2720)은 터치 코드를 인식하고 분석할 수 있다. 터치 스크린 모듈(2720)은 컨트롤러를 포함하는 별도의 하드웨어로 구성될 수도 있다.The UI module 2710 may provide a specialized UI, GUI, etc. that are interlocked with the device 100 for each application. The touch screen module 2720 may detect a user's touch gesture on a touch screen and transmit information on the touch gesture to the controller 2300. The touch screen module 2720 according to some embodiments may recognize and analyze a touch code. The touch screen module 2720 may be configured with separate hardware including a controller.

알림 모듈(2730)은 디바이스(100)의 이벤트 발생을 알리기 위한 신호를 발생할 수 있다. 디바이스(100)에서 발생되는 이벤트의 예로는 호 신호 수신, 메시지 수신, 키 신호 입력, 일정 알림 등이 있다. 알림 모듈(2730)은 디스플레이부(2210)를 통해 비디오 신호 형태로 알림 신호를 출력할 수도 있고, 음향 출력부(2220)를 통해 오디오 신호 형태로 알림 신호를 출력할 수도 있고, 진동 모터(2230)를 통해 진동 신호 형태로 알림 신호를 출력할 수도 있다.The notification module 2730 may generate a signal to notify the occurrence of an event of the device 100. Examples of events occurring in the device 100 include call signal reception, message reception, key signal input, and schedule notification. The notification module 2730 may output a notification signal in the form of a video signal through the display unit 2210, may output a notification signal in the form of an audio signal through the sound output unit 2220, or the vibration motor 2230 It is also possible to output a notification signal in the form of a vibration signal through.

개시된 실시 예들은 컴퓨터로 읽을 수 있는 저장 매체(computer-readable storage media)에 저장된 명령어를 포함하는 S/W 프로그램으로 구현될 수 있다. The disclosed embodiments may be implemented as a S/W program including commands stored in a computer-readable storage media.

컴퓨터는, 저장 매체로부터 저장된 명령어를 호출하고, 호출된 명령어에 따라 개시된 실시 예에 따른 동작이 가능한 장치로서, 개시된 실시 예들에 따른 영상 전송 장치 및 영상 수신 장치를 포함할 수 있다.The computer, as a device capable of calling a command stored from a storage medium and performing operations according to the disclosed embodiments according to the called command, may include an image transmission device and an image reception device according to the disclosed embodiments.

컴퓨터로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, ‘비일시적’은 저장매체가 신호(signal)를 포함하지 않으며 실재(tangible)한다는 것을 의미할 뿐 데이터가 저장매체에 반영구적 또는 임시적으로 저장됨을 구분하지 않는다. The computer-readable storage medium may be provided in the form of a non-transitory storage medium. Here, "non-transitory" means that the storage medium does not contain signals and is tangible, but does not distinguish between semi-permanent or temporary storage of data in the storage medium.

또한, 개시된 실시 예들에 따른 전자 장치 또는 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다.Further, the electronic device or method according to the disclosed embodiments may be provided in a computer program product. Computer program products can be traded between sellers and buyers as commodities.

컴퓨터 프로그램 제품은 S/W 프로그램, S/W 프로그램이 저장된 컴퓨터로 읽을 수 있는 저장 매체를 포함할 수 있다. 예를 들어, 컴퓨터 프로그램 제품은 전자 장치의 제조사 또는 전자 마켓(예, 구글 플레이 스토어, 앱 스토어)을 통해 전자적으로 배포되는 S/W 프로그램 형태의 상품(예, 다운로더블 앱)을 포함할 수 있다. 전자적 배포를 위하여, S/W 프로그램의 적어도 일부는 저장 매체에 저장되거나, 임시적으로 생성될 수 있다. 이 경우, 저장 매체는 제조사의 서버, 전자 마켓의 서버, 또는 SW 프로그램을 임시적으로 저장하는 중계 서버의 저장매체가 될 수 있다.The computer program product may include a S/W program and a computer-readable storage medium storing the S/W program. For example, the computer program product may include a product (e.g., downloadable app) electronically distributed through an electronic device manufacturer or an electronic market (e.g., Google Play Store, App Store). have. For electronic distribution, at least a part of the S/W program may be stored in a storage medium or may be temporarily generated. In this case, the storage medium may be a server of a manufacturer, a server of an electronic market, or a storage medium of a relay server temporarily storing an SW program.

컴퓨터 프로그램 제품은, 서버 및 단말(예로, 영상 전송 장치 또는 영상 수신 장치)로 구성되는 시스템에서, 서버의 저장매체 또는 단말의 저장매체를 포함할 수 있다. 또는, 서버 또는 단말과 통신 연결되는 제3 장치(예, 스마트폰)가 존재하는 경우, 컴퓨터 프로그램 제품은 제3 장치의 저장매체를 포함할 수 있다. 또는, 컴퓨터 프로그램 제품은 서버로부터 단말 또는 제3 장치로 전송되거나, 제3 장치로부터 단말로 전송되는 S/W 프로그램 자체를 포함할 수 있다.The computer program product may include a storage medium of a server or a storage medium of a terminal in a system consisting of a server and a terminal (eg, an image transmission device or an image reception device). Alternatively, when there is a third device (eg, a smartphone) that is communicatively connected to the server or terminal, the computer program product may include a storage medium of the third device. Alternatively, the computer program product may include a S/W program itself transmitted from a server to a terminal or a third device, or transmitted from a third device to a terminal.

이 경우, 서버, 단말 및 제3 장치 중 하나가 컴퓨터 프로그램 제품을 실행하여 개시된 실시 예들에 따른 방법을 수행할 수 있다. 또는, 서버, 단말 및 제3 장치 중 둘 이상이 컴퓨터 프로그램 제품을 실행하여 개시된 실시 예들에 따른 방법을 분산하여 실시할 수 있다.In this case, one of the server, the terminal, and the third device may execute the computer program product to perform the method according to the disclosed embodiments. Alternatively, two or more of a server, a terminal, and a third device may execute a computer program product to distribute and implement the method according to the disclosed embodiments.

예를 들면, 서버(예로, 클라우드 서버 또는 인공 지능 서버 등)가 서버에 저장된 컴퓨터 프로그램 제품을 실행하여, 서버와 통신 연결된 단말이 개시된 실시 예들에 따른 방법을 수행하도록 제어할 수 있다. For example, a server (eg, a cloud server or an artificial intelligence server) may execute a computer program product stored in the server, and control a terminal connected to the server to perform the method according to the disclosed embodiments.

또 다른 예로, 제3 장치가 컴퓨터 프로그램 제품을 실행하여, 제3 장치와 통신 연결된 단말이 개시된 실시 예에 따른 방법을 수행하도록 제어할 수 있다. 구체적인 예로, 제3 장치는 영상 전송 장치 또는 영상 수신 장치를 원격 제어하여, 패킹 영상을 전송 하거나 수신하도록 제어할 수 있다. As another example, by executing a computer program product, the third device may control a terminal connected in communication with the third device to perform the method according to the disclosed embodiment. As a specific example, the third device may remotely control an image transmission device or an image reception device to transmit or receive a packed image.

제3 장치가 컴퓨터 프로그램 제품을 실행하는 경우, 제3 장치는 서버로부터 컴퓨터 프로그램 제품을 다운로드하고, 다운로드된 컴퓨터 프로그램 제품을 실행할 수 있다. 또는, 제3 장치는 프리로드된 상태로 제공된 컴퓨터 프로그램 제품을 실행하여 개시된 실시 예들에 따른 방법을 수행할 수도 있다.When the third device executes the computer program product, the third device may download the computer program product from the server and execute the downloaded computer program product. Alternatively, the third device may perform the method according to the disclosed embodiments by executing the computer program product provided in a preloaded state.

Claims

A memory that stores one or more instructions;
A processor that executes the one or more instructions stored in the memory; And
Including a communication unit for receiving the first character string from the device,
The processor,
Identify a plurality of estimated character strings from the first character string, and obtain a second character string based on the plurality of estimated character strings,
Controlling the communication unit to transmit the second character string to the device,
The server, characterized in that the first character string is output through speech recognition processing from a speech signal input to the device.

The method of claim 1,
The processor,
Identifying replacement characters corresponding to each character in the first character string, identifying the plurality of estimated character strings based on the identified replacement characters,
Obtaining one estimated character string from among the plurality of estimated character strings as the second character string,
The server, characterized in that the replacement characters are characters having a pronunciation similar to each of the characters.

The method of claim 1,
The processor,
For each character in the first character string, likelihood matrices for replacement characters to be substituted for each character are calculated, and the plurality of estimated character strings are identified based on likelihood values in the likelihood matrices. , server.

The method of claim 1,
Wherein the first character string includes characters corresponding to each of the voice signal frames in which the voice signal is divided at predetermined time intervals.

The method of claim 3,
The processor,
Calculate likelihood of the plurality of estimated character strings based on likelihood values in the likelihood matrices,
Based on the likelihood, dictionary information, and a language model, one of the plurality of estimated character strings is selected,
And obtaining the second character string in which at least one character in the first character string is replaced with another character according to the selected estimated character string.

The method of claim 3,
The likelihood matrices obtained for each character are,
The server, characterized in that calculated based on the previously accumulated characters of each character.

The method of claim 3,
The likelihood matrices obtained for each character are,
The server, characterized in that calculated based on posterior probabilities calculated based on previously accumulated characters of each character, and character arrangement probabilities calculated based on previously accumulated characters of each character.

The method of claim 7,
The posterior probabilities are calculated using an artificial intelligence recurrent neural network (RNN) including a plurality of long-short term memory (LSTM) layers and a softmax layer.

The method of claim 3,
The likelihood matrices obtained for each character are,
The server, characterized in that calculated based on a predetermined confusion matrix.

The method of claim 1,
The processor,
The server, characterized in that to provide a service related to the voice signal input to the device based on the obtained second character string.

A memory that stores one or more instructions;
A processor that executes the one or more instructions stored in the memory; And
Including a communication unit in communication with the server,
The processor,
Perform speech recognition on the speech signal to obtain a first character string,
Determine whether to replace the first character string with another character string,
Controlling the communication unit to transmit the first character string to the server based on the determination,
And controlling the communication unit to receive from the server a second character string obtained by replacing at least one character in the first character string with another character by the server.

In the server operation method,
Receiving a first character string from a device;
Identifying a plurality of estimated character strings from the first character string;
Obtaining a second character string based on the plurality of estimated character strings; And
Transmitting the second character string to the device,
The method, characterized in that the first character string is output through speech recognition processing from a speech signal input to the device.

The method of claim 12,
Identifying a plurality of estimated character strings from the first character string,
Identifying replacement characters corresponding to each character in the first character string; And
Identifying the plurality of estimated character strings based on the identified replacement characters,
Obtaining a second character string based on the plurality of estimated character strings,
Obtaining one estimated character string from among the plurality of estimated character strings as the second character string,
The method, characterized in that the replacement characters are characters of a pronunciation similar to each of the characters.

The method of claim 12,
Identifying the plurality of estimated character strings,
Calculating likelihood matrices for replacement characters to be replaced with each character for each character in the first character string; And
And identifying the plurality of estimated strings based on likelihood values in the likelihood matrices.

The method of claim 12,
Wherein the first character string includes characters corresponding to each of the speech signal frames in which the speech signal is divided at predetermined time intervals.

The method of claim 14,
Obtaining the second character string,
Calculating a likelihood of the plurality of estimated character strings based on likelihood values in the likelihood matrices;
Selecting one of the plurality of estimated character strings based on the likelihood, dictionary information, and a language model; And
And obtaining the second character string in which at least one character in the first character string has been replaced with another character according to the selected estimated character string.

The method of claim 14,
The likelihood matrices obtained for each character are,
Characterized in that it is calculated based on previously accumulated characters of each character.

The method of claim 14,
The likelihood matrices obtained for each character are,
Characterized in that it is calculated based on posterior probabilities calculated based on previously accumulated characters of each character, and character arrangement probabilities calculated based on previously accumulated characters of each character.

The method of claim 12,
And providing a service related to the voice signal input to the device based on the obtained second character string.

In the method of operating the device,
Obtaining a first character string by performing speech recognition on the speech signal;
Determining whether to replace the first character string with another character string;
Transmitting the first character string to a server based on the determination; And
And receiving, from the server, a second character string obtained by replacing at least one character in the first character string with another character by the server.