KR20210001937A

KR20210001937A - The device for recognizing the user's speech input and the method for operating the same

Info

Publication number: KR20210001937A
Application number: KR1020200069846A
Authority: KR
Inventors: 이경민; 한영호; 김상윤; 정동욱; 아환 쿠두물라; 한창우
Original assignee: 삼성전자주식회사
Priority date: 2019-06-28
Filing date: 2020-06-09
Publication date: 2021-01-06

Abstract

Disclosed are a device recognizing voice input of a user including a named entity, and an operating method thereof. According to one embodiment of the present invention, the device creates a weighted finite state transducer model by using a vocabulary list including a plurality of named entities; obtains a first character string from voice input received from a user by using a first decoding model; obtains a second character string including a word sequence corresponding to at least one named entity and an unrecognized word sequence which is not identified as a named entity by using a second decoding model using the weighted finite state transducer model; and replaces the unrecognized word sequence of the second character string with a word sequence included in the first character string, thereby outputting a text corresponding to the voice input.

Description

A device that recognizes the user's voice input and its operation method {THE DEVICE FOR RECOGNIZING THE USER'S SPEECH INPUT AND THE METHOD FOR OPERATING THE SAME}

본 개시는 인공 지능 모델을 이용하여 사용자로부터 수신된 음성 입력을 인식하는 디바이스 및 그 동작 방법에 관한 것이다. The present disclosure relates to a device for recognizing a voice input received from a user using an artificial intelligence model and a method of operating the same.

음성 인식 기능은, 별도의 버튼 조작 또는 터치 모듈의 접촉에 의하지 않고 사용자의 음성 입력을 인식함으로써 디바이스를 손쉽게 제어하는 기능이다. 최근에는, 인공 지능(Artificial Intelligence, AI) 기술이 발전함에 따라 음성 인식 기능에도 인공 지능 기술이 접목됨으로써, 다양한 발화들에 대해서 빠르고 정확한 음성 인식이 가능해졌다. The voice recognition function is a function of easily controlling a device by recognizing a user's voice input without operating a separate button or touching a touch module. In recent years, as artificial intelligence (AI) technology develops, artificial intelligence technology is also applied to the speech recognition function, enabling fast and accurate speech recognition for various utterances.

인공 지능 기술을 이용하여 사용자의 음성 입력을 인식하는 방법으로는, 마이크로 폰을 통해 아날로그 신호인 음성 신호를 수신하고, ASR(Automatic Speech Recognition)모델을 이용하여 음성 부분을 컴퓨터로 판독 가능한 텍스트로 변환할 수 있다. ASR 모델은 인공지능 모델일 수 있다. 인공지능 모델은 인공지능 모델의 처리에 특화된 하드웨어 구조로 설계된 인공지능 전용 프로세서에 의해 처리될 수 있다. 인공지능 모델은 학습을 통해 만들어 질 수 있다. 여기서, 학습을 통해 만들어진다는 것은, 기본 인공지능 모델이 학습 알고리즘에 의하여 다수의 학습 데이터들을 이용하여 학습됨으로써, 원하는 특성(또는, 목적)을 수행하도록 설정된 기 정의된 동작 규칙 또는 인공지능 모델이 만들어짐을 의미한다. 인공지능 모델은, 복수의 신경망 레이어들로 구성될 수 있다. 복수의 신경망 레이어들 각각은 복수의 가중치들(weight values)을 갖고 있으며, 이전(previous) 레이어의 연산 결과와 복수의 가중치들 간의 연산을 통해 신경망 연산을 수행한다. As a method of recognizing a user's voice input using artificial intelligence technology, a voice signal, which is an analog signal, is received through a microphone, and the voice part is converted into a computer-readable text using an ASR (Automatic Speech Recognition) model. can do. The ASR model may be an artificial intelligence model. The artificial intelligence model can be processed by an artificial intelligence dedicated processor designed with a hardware structure specialized for processing the artificial intelligence model. Artificial intelligence models can be created through learning. Here, to be made through learning means that a basic artificial intelligence model is learned using a plurality of learning data by a learning algorithm, so that a predefined motion rule or an artificial intelligence model set to perform a desired characteristic (or purpose) is created. Means Jim. The artificial intelligence model may be composed of a plurality of neural network layers. Each of the plurality of neural network layers has a plurality of weight values, and a neural network operation is performed through an operation between the operation result of a previous layer and a plurality of weights.

음성 인식 기능을 사용하기 위한 인공 지능 모델은 학습을 통해 생성되고, 특히 개체명(Named Entity)의 경우 개체명과 결합된 문장 등 수많은 패턴 문장을 이용하여 학습을 하여야 하는바, 많은 데이터 연산량이 필요하고, 학습에 시간이 많이 소요된다. 특히, 최근에는 디바이스 측에서 ASR과 같은 음성 인식이 수행되는 온 디바이스(On-device) 음성 인식 기능을 사용하는데, 개체명을 포함하는 패턴 문장을 온 디바이스 환경에서 학습하기에는 데이터 연산량이 너무 많기 때문에 시간이 오래 소요된다. 또한, 온 디바이스 환경에서 개체명에 관한 패턴 문장 학습 시 개체명의 개수와 관련 명령문 등을 결합하는 모든 패턴 문장을 학습하여야 하는바, 온 디바이스에서 생성된 인공 지능 모델을 통한 음성 인식은 정확도가 저하되는 문제점이 있다. The artificial intelligence model for using the speech recognition function is created through learning, and in particular, in the case of a named entity, it is necessary to learn using a number of pattern sentences such as sentences combined with the entity name, so a large amount of data computation is required. , It takes a lot of time to learn. In particular, in recent years, the device side uses an on-device voice recognition function that performs voice recognition such as ASR, but the amount of data computation is too large to learn pattern sentences including entity names in an on-device environment. This takes a long time. In addition, when learning pattern sentences about entity names in an on-device environment, all pattern sentences that combine the number of entity names and related statements, etc., must be learned. Speech recognition through artificial intelligence models generated on-device is less accurate. There is a problem.

본 개시는 패턴 문장에 관한 학습 없이도, 개체명을 포함하는 음성 입력의 인식 정확도를 향상시킬 수 있는 음성 인식 방법 및 디바이스를 제공하는 것을 목적으로 한다. An object of the present disclosure is to provide a speech recognition method and device capable of improving the recognition accuracy of a speech input including an entity name without learning a pattern sentence.

상술한 기술적 과제를 해결하기 위하여, 본 개시의 일 실시예는, 복수의 개체명(Named Entity)을 포함하는 어휘 리스트(vocabulary list)를 이용하여, 상기 복수의 개체명 각각으로부터 추출된 서브 워드(subword)가 개체명을 나타내는 단어 또는 단어 열로 예측될 수 있는 확률을 학습(training)함으로써, 가중치 유한 상태 변환기(Weighted Finite State Transducer) 모델을 생성하는 단계, 사용자의 음성 입력을 수신하는 단계, 제1 디코딩 모델을 이용하여, 상기 수신된 음성 입력이 특정 서브 워드로 예측될 수 있는 확률값을 포함하는 특징 벡터(feature vector)를 획득하고, 상기 특징 벡터의 확률값을 이용하여 복수의 예측 문자열들을 포함하는 제1 문자열을 획득하는 단계, 상기 가중치 유한 상태 변환기 모델을 이용하는 제2 디코딩 모델에 상기 획득된 특징 벡터를 입력하고, 상기 제2 디코딩 모델을 이용하여 상기 특징 벡터로부터 상기 복수의 개체명 중 적어도 하나의 개체명에 해당되는 단어 열(word sequence) 및 상기 적어도 하나의 개체명으로 식별되지 않는 미인식 단어열을 포함하는 제2 문자열을 획득하는 단계, 및 상기 제2 문자열 중 상기 미인식 단어열을 상기 제1 문자열에 포함되는 단어열로 대체함으로써, 상기 음성 입력에 대응되는 텍스트를 출력하는 단계를 포함하는, 디바이스가 음성 입력을 인식하는 방법을 제공한다. In order to solve the above-described technical problem, an embodiment of the present disclosure uses a vocabulary list including a plurality of named entities, and subwords extracted from each of the plurality of entity names ( generating a weighted finite state transducer model by training the probability that subword) can be predicted as a word or word column representing an entity name, receiving a user's voice input, the first Using a decoding model, a feature vector including a probability value that the received speech input can be predicted as a specific subword is obtained, and a first including a plurality of prediction strings using the probability value of the feature vector 1 obtaining a character string, inputting the obtained feature vector to a second decoding model using the weighted finite state converter model, and at least one of the plurality of entity names from the feature vector using the second decoding model Obtaining a second character string including a word sequence corresponding to an entity name and an unrecognized word sequence not identified by the at least one entity name, and the unrecognized word sequence among the second character strings. It provides a method for a device to recognize a voice input, including the step of outputting a text corresponding to the voice input by replacing it with a word string included in the first character string.

예를 들어, 상기 가중치 유한 상태 변환기 모델을 생성하는 단계는 복수의 개체명을 포함하는 어휘 리스트를 획득하는 단계, 복수의 개체명을 구성하는 단어 또는 문자열을 음소 또는 음절 단위인 서브 워드(subword)로 분할(segmentation)하는 단계, 및 상기 서브 워드의 빈도수 및 상기 서브 워드의 배열 순서를 이용한 상태 변환(state transition)을 통해 가중치(weight)를 학습함으로써, 상기 서브 워드가 상기 복수의 개체명 중 특정 개체명으로 예측될 수 있는 사후 확률값(posterior probability)을 포함하는 신뢰도 점수(confidence score)를 획득하는 단계를 포함할 수 있다. For example, the step of generating the weighted finite state converter model may include obtaining a vocabulary list including a plurality of entity names, and subwording words or strings constituting the plurality of entity names into phoneme or syllable units. By learning the weight through the step of segmentation and state transition using the frequency of the subword and the order of the arrangement of the subwords, the subword is specified among the plurality of entity names. It may include the step of obtaining a confidence score including a posterior probability that can be predicted by the name of the entity.

예를 들어, 상기 가중치 유한 상태 변환기 모델을 생성하는 단계는 획득된 어휘 리스트에 포함되는 복수의 개체명 중 상기 디바이스의 메모리에 기 저장된 단어와 중복되는 개체명은 제거하는 필터링(filtering)을 수행하는 단계를 포함할 수 있다. For example, the step of generating the weighted finite state converter model may include performing filtering to remove entity names that overlap with words previously stored in the memory of the device among a plurality of entity names included in the obtained vocabulary list. It may include.

예를 들어, 상기 가중치 유한 상태 변환기 모델은 상기 서브 워드가 특정 단어로 예측될 수 있는 확률값인 매핑 정보를 포함하는 렉시콘 유한 상태 변환기(Lexicon Finite State Transducer; L FST) 및 상기 특정 단어 또는 단어 열이 입력되는 경우 입력된 단어 또는 단어 열 이후에 배열될 수 있는 단어 열을 예측하기 위한 가중치(weight) 정보를 포함하는 그래머 유한 상태 변환기(Grammar Finite State Transducer; G FST)를 포함하고, 상기 가중치 유한 상태 변환기 모델은 상기 렉시콘 유한 상태 변환기와 상기 그래머 유한 상태 변환기의 합성을 통해 생성될 수 있다. For example, the weighted finite state transformer model includes a Lexicon Finite State Transducer (L FST) including mapping information that is a probability value that the subword can be predicted as a specific word, and the specific word or word sequence. When input, includes a Grammar Finite State Transducer (G FST) including weight information for predicting a word sequence that can be arranged after the input word or word sequence, and the weighted finite state The transducer model may be generated through the synthesis of the Lexicon finite state transducer and the Grammer finite state transducer.

예를 들어, 제1 디코딩 모델은 종단간 ASR 모델(End-to-End Automatic Speech Recognition)일 수 있다. For example, the first decoding model may be an end-to-end automatic speech recognition (ASR) model.

예를 들어, 상기 가중치 유한 상태 변환기 모델을 생성하는 단계는 복수의 개체명을 서로 다른 복수의 도메인(domain)에 따라 분류하는 단계, 및 분류된 상기 복수의 개체명을 이용하여, 상기 복수의 도메인 별로 복수의 가중치 유한 상태 변환기 모델을 생성하는 단계를 포함할 수 있다. For example, generating the weighted finite state converter model includes classifying a plurality of entity names according to a plurality of different domains, and using the classified entity names, the plurality of domains It may include generating a plurality of weighted finite state converter models for each.

예를 들어 상기 방법은, 상기 디바이스를 통해 실행되는 애플리케이션 또는 상기 디바이스를 통해 제공되는 웹 페이지로부터 개체명에 해당되는 단어들을 식별하는 단계, 및 상기 식별된 단어들을 상기 복수의 도메인 별로 생성된 상기 복수의 가중치 유한 상태 변환기 모델 각각의 어휘 리스트에 포함되는 복수의 개체명과 비교함으로써, 상기 현재 실행 중인 애플리케이션 또는 상기 웹 페이지가 분류될 수 있는 도메인을 결정하는 단계를 더 포함할 수 있다. For example, the method includes the steps of identifying words corresponding to an entity name from an application executed through the device or a web page provided through the device, and generating the identified words for each of the plurality of domains. The method may further include determining a domain in which the currently running application or the web page can be classified by comparing it with a plurality of entity names included in the vocabulary list of each of the weighted finite state converter models.

예를 들어, 상기 방법은 서버로부터 신규 개체명의 추가, 개체명 삭제, 및 개체명 변경 중 적어도 하나를 포함하는 어휘 리스트의 업데이트 정보를 수신하는 단계, 및 상기 업데이트 정보를 이용하여, 상기 어휘 리스트를 업데이트하는 단계를 더 포함하고, 상기 가중치 유한 상태 변환기 모델을 생성하는 단계는, 상기 업데이트된 어휘 리스트를 이용하는 학습을 통해 상기 가중치 유한 상태 변환기 모델을 생성할 수 있다. For example, the method comprises the steps of receiving update information of a vocabulary list including at least one of addition of a new entity name, deletion of entity name, and change of entity name from a server, and using the update information, the vocabulary list is The step of updating may further include generating the weighted finite state transformer model, wherein the weighted finite state transformer model may be generated through learning using the updated vocabulary list.

예를 들어, 상기 방법은 상기 디바이스의 위치 정보를 획득함으로써, 상기 디바이스가 신규 지역으로 진입함을 인식하는 단계, 상기 디바이스의 신규 지역 진입 정보를 애플리케이션 서비스 제공 업체의 서버에 전송하는 단계, 및 상기 애플리케이션 서비스 제공 업체의 서버로부터 상기 신규 지역의 지명, 명소, 관광지, 및 유명 음식점 중 적어도 하나에 관한 개체명을 포함하는 관심 장소(Point of Interest; POI) 어휘 리스트를 수신하는 단계를 더 포함하고, 상기 가중치 유한 상태 변환기 모델을 생성하는 단계는, 상기 수신된 관심 장소 어휘 리스트에 포함되는 개체명을 이용하는 학습을 통해 상기 가중치 유한 상태 변환기 모델을 생성할 수 있다. For example, the method includes recognizing that the device enters a new area by acquiring location information of the device, transmitting information about entering the new area of the device to a server of an application service provider, and the Receiving a point of interest (POI) vocabulary list including an entity name of at least one of the new area name, attraction, tourist attraction, and famous restaurant from the server of the application service provider, In the step of generating the weighted finite state transformer model, the weighted finite state transformer model may be generated through learning using an entity name included in the received place of interest vocabulary list.

예를 들어, 상기 가중치 유한 상태 변환기 모델을 생성하는 단계는 상기 디바이스를 통해 자주 실행하는 애플리케이션, 메신저 애플리케이션의 로그 데이터(log data), 컨텐트 스트리밍 애플리케이션에서의 검색어 기록 중 적어도 하나로부터 사용자의 특성을 반영하는 복수의 개체명을 포함하는 개인화된 어휘 리스트를 이용하는 학습을 통해 상기 가중치 유한 상태 변환기 모델을 생성할 수 있다. For example, the step of generating the weighted finite state converter model reflects the user's characteristics from at least one of an application frequently executed through the device, log data of a messenger application, and a search term record in a content streaming application. The weighted finite state transformer model may be generated through learning using a personalized vocabulary list including a plurality of entity names.

상술한 기술적 과제를 해결하기 위하여, 본 개시의 일 실시예는, 사용자로부터 음성 입력을 수신하는 음성 입력부, 하나 이상의 명령어들(instructions)을 포함하는 프로그램을 저장하는 메모리, 및 상기 메모리에 저장된 프로그램의 하나 이상의 명령어들을 실행하는 프로세서를 포함하고, 상기 프로세서는, 복수의 개체명(Named Entity)을 포함하는 어휘 리스트(vocabulary list)를 이용하여, 상기 복수의 개체명 각각으로부터 추출된 서브 워드(subword)가 개체명을 나타내는 단어 또는 단어 열로 예측될 수 있는 확률을 학습(training)함으로써, 가중치 유한 상태 변환기(Weighted Finite State Transducer) 모델을 생성하고, 상기 음성 입력부로부터 상기 음성 입력을 수신하고, 제1 디코딩 모델을 이용하여 상기 수신된 음성 입력이 특정 서브 워드로 예측될 수 있는 확률값을 포함하는 특징 벡터(feature vector)를 획득하고, 상기 특징 벡터의 확률값을 이용하여 복수의 예측 문자열들을 포함하는 제1 문자열을 획득하고, 상기 가중치 유한 상태 변환기를 이용하는 제2 디코딩 모델에 상기 획득된 특징 벡터를 입력하고, 상기 제2 디코딩 모델을 이용하여 상기 특징 벡터로부터 상기 복수의 개체명 중 적어도 하나의 개체명에 해당되는 개체명 단어 열(word sequence) 및 상기 적어도 하나의 개체명으로 식별되지 않는 미인식 단어열을 포함하는 제2 문자열을 획득하고, 상기 제2 문자열 중 상기 미인식 단어열을 상기 제1 문자열에 포함되는 단어열로 대체함으로써, 상기 음성 입력에 대응되는 텍스트를 획득하는, 디바이스를 제공한다. In order to solve the above technical problem, an embodiment of the present disclosure includes a voice input unit for receiving a voice input from a user, a memory storing a program including one or more instructions, and a program stored in the memory. And a processor that executes one or more instructions, wherein the processor includes a subword extracted from each of the plurality of entity names using a vocabulary list including a plurality of named entities A weighted finite state transducer model is generated by training a probability that can be predicted as a word or word string representing an entity name, receiving the speech input from the speech input unit, and first decoding A first character string including a plurality of predicted character strings using a model to obtain a feature vector including a probability value that the received speech input can be predicted as a specific subword, and using the probability value of the feature vector And inputting the obtained feature vector to a second decoding model using the weighted finite state converter, and corresponding to at least one entity name among the plurality of entity names from the feature vector using the second decoding model. A second character string including an entity name word sequence and an unrecognized word sequence not identified by the at least one entity name is obtained, and the unrecognized word sequence among the second character strings is transferred to the first character string. A device is provided that obtains text corresponding to the voice input by replacing it with an included word string.

예를 들어, 상기 프로세서는 상기 복수의 개체명을 포함하는 어휘 리스트를 획득하고, 상기 복수의 개체명을 구성하는 단어 또는 문자열을 음소 또는 음절 단위인 서브 워드(subword)로 분할(segmentation)하고, 상기 서브 워드의 빈도수 및 상기 서브 워드의 배열 순서를 이용한 상태 변환(state transition)을 통해 가중치(weight)를 학습함으로써, 상기 서브 워드가 상기 복수의 개체명 중 특정 개체명으로 예측될 수 있는 사후 확률값(posterior probability)을 포함하는 신뢰도 점수(confidence score)를 획득함으로써, 상기 가중치 유한 상태 변환기 모델을 생성할 수 있다. For example, the processor obtains a vocabulary list including the plurality of entity names, and divides a word or string constituting the plurality of entity names into subwords that are phoneme or syllable units, and A posterior probability value at which the subword can be predicted as a specific entity name among the plurality of entity names by learning a weight through a state transition using the frequency of the subwords and the arrangement order of the subwords The weighted finite state converter model may be generated by obtaining a confidence score including a posterior probability.

예를 들어, 상기 프로세서는 상기 획득된 어휘 리스트에 포함되는 복수의 개체명 중 상기 디바이스의 메모리에 기 저장된 단어와 중복되는 개체명은 제거하는 필터링(filtering)을 수행할 수 있다. For example, the processor may perform filtering to remove an entity name overlapping with a word previously stored in a memory of the device among a plurality of entity names included in the acquired vocabulary list.

예를 들어, 상기 제1 디코딩 모델은 종단간 ASR 모델(End-to-End Automatic Speech Recognition)일 수 있다. For example, the first decoding model may be an end-to-end automatic speech recognition (ASR) model.

예를 들어, 상기 프로세서는 상기 복수의 개체명을 서로 다른 복수의 도메인(domain)에 따라 분류하고, 분류된 상기 복수의 개체명을 이용하여, 상기 복수의 도메인 별로 복수의 가중치 유한 상태 변환기 모델을 생성할 수 있다. For example, the processor classifies the plurality of entity names according to a plurality of different domains, and uses the classified entity names to generate a plurality of weighted finite state converter models for each of the plurality of domains. Can be generated.

예를 들어, 상기 프로세서는 실행되고 있는 애플리케이션 또는 액세스하고 있는 웹 페이지로부터 개체명에 해당되는 단어들을 식별하고, 상기 식별된 단어들을 상기 복수의 도메인 별로 생성된 상기 복수의 가중치 유한 상태 변환기 모델 각각의 어휘 리스트에 포함되는 복수의 개체명과 비교함으로써, 상기 현재 실행 중인 애플리케이션 또는 상기 웹 페이지가 분류될 수 있는 도메인을 결정할 수 있다. For example, the processor identifies words corresponding to an entity name from an application being executed or a web page being accessed, and uses the identified words for each of the plurality of weighted finite state converter models generated for each of the plurality of domains. By comparing the names of a plurality of entities included in the vocabulary list, it is possible to determine a domain in which the currently running application or the web page can be classified.

예를 들어, 상기 디바이스는 서버와 데이터 송수신을 수행하는 통신 인터페이스를 더 포함하고, 상기 프로세서는, 상기 통신 인터페이스를 이용하여 상기 서버로부터 신규 개체명의 추가, 개체명 삭제, 및 개체명 변경 중 적어도 하나를 포함하는 어휘 리스트의 업데이트 정보를 수신하고, 상기 업데이트 정보를 이용하여, 상기 어휘 리스트를 업데이트하고, 상기 업데이트된 어휘 리스트를 이용하는 학습을 통해 상기 가중치 유한 상태 변환기 모델을 생성할 수 있다. For example, the device further includes a communication interface for transmitting and receiving data with the server, and the processor further includes at least one of adding a new entity name, deleting an entity name, and changing the entity name from the server using the communication interface. The weighted finite state converter model may be generated by receiving update information of a vocabulary list including, updating the vocabulary list using the update information, and learning using the updated vocabulary list.

예를 들어, 상기 디바이스는 위치 정보를 획득하는 위치 센서 및 음성 비서 서버 또는 외부 서버와 데이터 송수신을 수행하는 통신 인터페이스를 더 포함하고, 상기 프로세서는 상기 위치 센서를 이용하여 상기 디바이스가 신규 지역으로 진입함을 인식하고, 상기 신규 지역으로의 진입을 인식함에 응답하여, 상기 통신 인터페이스를 통해 애플리케이션 서비스 제공 업체의 서버로부터 상기 신규 지역의 지명, 명소, 관광지, 및 유명 음식점 중 적어도 하나에 관한 개체명을 포함하는 관심 장소(Point of Interest; POI) 어휘 리스트를 수신하고, 상기 수신된 관심 장소 어휘 리스트에 포함되는 개체명을 이용하는 학습을 통해 상기 가중치 유한 상태 변환기 모델을 생성할 수 있다. For example, the device further includes a location sensor for acquiring location information and a communication interface for transmitting and receiving data with a voice assistant server or an external server, and the processor uses the location sensor to allow the device to enter a new area. And in response to recognizing the entry into the new area, the entity name for at least one of the place name, attraction, tourist destination, and famous restaurant of the new area from the server of the application service provider through the communication interface The weighted finite state converter model may be generated through learning using an entity name included in the received point of interest vocabulary list and receiving a POI vocabulary list.

예를 들어, 상기 프로세서는 상기 디바이스를 통해 자주 실행하는 애플리케이션, 메신저 애플리케이션의 로그 데이터(log data), 컨텐트 스트리밍 애플리케이션에서의 검색어 기록 중 적어도 하나로부터 사용자의 특성을 반영하는 복수의 개체명을 포함하는 개인화된 어휘 리스트를 이용하는 학습을 통해 상기 가중치 유한 상태 변환기 모델을 생성할 수 있다. For example, the processor includes a plurality of entity names reflecting the characteristics of a user from at least one of an application frequently executed through the device, log data of a messenger application, and a search term record in a content streaming application. The weighted finite state transformer model may be generated through learning using a personalized vocabulary list.

상술한 기술적 과제를 해결하기 위하여, 본 개시의 다른 실시예는 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.In order to solve the above technical problem, another embodiment of the present disclosure provides a computer-readable recording medium in which a program for execution on a computer is recorded.

도 1은 본 개시의 일 실시예에 따른 디바이스가 사용자의 음성 입력을 인식하는 동작을 도시한 도면이다.
도 2는 본 개시의 일 실시예에 따른 디바이스 및 서버가 사용자의 음성 입력을 인식하는 동작을 도시한 도면이다.
도 3은 본 개시의 일 실시예에 따른 디바이스의 구성 요소를 도시한 블록도이다.
도 4는 본 개시의 일 실시예에 따른 서버의 구성 요소를 도시한 블록도이다.
도 5는 본 개시의 일 실시예에 따른 디바이스가 음성 입력을 인식하는 방법을 도시한 흐름도이다.
도 6은 본 개시의 디바이스가 개체명에 관한 가중치 유한 상태 변환기 모델을 생성하는 실시예를 도시한 흐름도이다.
도 7은 본 개시의 디바이스가 가중치 유한 상태 변환기 모델을 이용하여 도메인(domain)을 자동으로 선택하는 실시예를 도시한 도면이다.
도 8은 본 개시의 디바이스가 가중치 유한 상태 변환기 모델을 이용하여 도메인을 자동으로 선택하는 실시예를 도시한 흐름도이다.
도 9는 본 개시의 디바이스가 서버로부터 수신된 정보를 이용하여 개체명에 관한 가중치 유한 상태 변환기 모델을 생성하는 실시예를 도시한 흐름도이다.
도 10은 본 개시의 디바이스가 신규 지역에 진입하는 경우, 신규 지역에 관한 관심 장소 어휘 리스트를 이용하여 가중치 유한 상태 변환기 모델을 생성하는 실시예를 도시한 개념도이다.
도 11은 본 개시의 디바이스가 신규 지역에 진입하는 경우, 신규 지역에 관한 관심 장소 어휘 리스트를 이용하여 가중치 유한 상태 변환기 모델을 생성하는 실시예를 도시한 개념도이다.
도 12는 본 개시의 디바이스가 사용자의 개인적 특성을 반영하는 개체명을 이용하여 개인화 가중치 유한 상태 변환기 모델을 생성하는 실시예를 도시한 도면이다. 1 is a diagram illustrating an operation in which a device recognizes a user's voice input according to an embodiment of the present disclosure.
2 is a diagram illustrating an operation of recognizing a user's voice input by a device and a server according to an embodiment of the present disclosure.
3 is a block diagram showing components of a device according to an embodiment of the present disclosure.
4 is a block diagram illustrating components of a server according to an embodiment of the present disclosure.
5 is a flowchart illustrating a method for a device to recognize a voice input according to an embodiment of the present disclosure.
6 is a flowchart illustrating an embodiment in which the device of the present disclosure generates a weighted finite state converter model for an entity name.
7 is a diagram illustrating an embodiment in which a device of the present disclosure automatically selects a domain using a weighted finite state converter model.
8 is a flowchart illustrating an embodiment in which a device of the present disclosure automatically selects a domain using a weighted finite state converter model.
9 is a flowchart illustrating an embodiment in which the device of the present disclosure generates a weighted finite state converter model for an entity name using information received from a server.
FIG. 10 is a conceptual diagram illustrating an embodiment of generating a weighted finite state converter model using a place of interest vocabulary list related to a new area when a device of the present disclosure enters a new area.
11 is a conceptual diagram illustrating an embodiment of generating a weighted finite state converter model by using a place of interest vocabulary list for a new area when a device of the present disclosure enters a new area.
12 is a diagram illustrating an embodiment in which the device of the present disclosure generates a personalized weighted finite state converter model using an entity name reflecting a user's personal characteristics.

본 명세서의 실시예들에서 사용되는 용어는 본 개시의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 실시예의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 명세서에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다. The terms used in the embodiments of the present specification have selected general terms that are currently widely used as possible while taking the functions of the present disclosure into consideration, but this may vary depending on the intention or precedent of a technician working in the field, the emergence of new technologies, etc. . In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning of the terms will be described in detail in the description of the corresponding embodiment. Therefore, the terms used in the present specification should be defined based on the meaning of the term and the contents of the present disclosure, not the name of a simple term.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다. 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 용어들은 본 명세서에 기재된 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가질 수 있다. Singular expressions may include plural expressions unless the context clearly indicates otherwise. Terms used herein, including technical or scientific terms, may have the same meaning as commonly understood by a person of ordinary skill in the technical field described herein.

본 개시 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 본 명세서에 기재된 "...부", "...모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.When a certain part "includes" a certain component throughout the present disclosure, it means that other components may be further included rather than excluding other components unless specifically stated to the contrary. In addition, terms such as "... unit" and "... module" described in this specification mean a unit that processes at least one function or operation, which is implemented as hardware or software, or is a combination of hardware and software. Can be implemented.

본 명세서에서 사용된 표현 "~하도록 구성된(또는 설정된)(configured to)"은 상황에 따라, 예를 들면, "~에 적합한(suitable for)", "~하는 능력을 가지는(having the capacity to)", "~하도록 설계된(designed to)", "~하도록 변경된(adapted to)", "~하도록 만들어진(made to)", 또는 "~를 할 수 있는(capable of)"과 바꾸어 사용될 수 있다. 용어 "~하도록 구성된(또는 설정된)"은 하드웨어적으로 "특별히 설계된(specifically designed to)" 것만을 반드시 의미하지 않을 수 있다. 대신, 어떤 상황에서는, "~하도록 구성된 시스템"이라는 표현은, 그 시스템이 다른 장치 또는 부품들과 함께 "~할 수 있는" 것을 의미할 수 있다. 예를 들면, 문구 "A, B, 및 C를 수행하도록 구성된(또는 설정된) 프로세서"는 해당 동작을 수행하기 위한 전용 프로세서(예: 임베디드 프로세서), 또는 메모리에 저장된 하나 이상의 소프트웨어 프로그램들을 실행함으로써, 해당 동작들을 수행할 수 있는 범용 프로세서(generic-purpose processor)(예: CPU 또는 application processor)를 의미할 수 있다.The expression "configured to (configured to)" used in this specification is, for example, "suitable for", "having the capacity to" depending on the situation. It can be used interchangeably with ", "designed to", "adapted to", "made to", or "capable of". The term "configured to (or set)" may not necessarily mean only "specifically designed to" in hardware. Instead, in some situations, the expression "a system configured to" may mean that the system "can" along with other devices or parts. For example, the phrase “a processor configured (or configured) to perform A, B, and C” means a dedicated processor (eg, an embedded processor) for performing the operation, or by executing one or more software programs stored in memory, It may mean a generic-purpose processor (eg, a CPU or an application processor) capable of performing corresponding operations.

본 개시에서 '문자'는 인간의 언어를 눈에 볼 수 있는 형태로 나타내어 적는데 사용하는 기호를 의미한다. 예를 들어, 문자에는 한글, 알파벳, 한자, 숫자, 발음 부호, 문장 부호 및 기타 기호가 포함될 수 있다.In the present disclosure,'letter' means a symbol used to express and write human language in a visible form. For example, characters may include Korean, alphabet, Chinese characters, numbers, pronunciation marks, punctuation marks, and other symbols.

본 개시에서 '문자열'이란, 문자들의 배열(sequence)을 의미한다. In the present disclosure, the term'character string' means a sequence of characters.

본 개시에서 '문자소(grapheme)'는 적어도 하나의 문자로 구성되는, 소리를 나타내는 가장 작은 단위이다. 예를 들어, 알파벳 표기 체계의 경우, 하나의 문자가 문자소가 될 수 있으며, 문자열은 문자소들의 배열을 의미할 수 있다.In the present disclosure, the'grapheme' is the smallest unit representing sound, which is composed of at least one character. For example, in the case of the alphabet notation system, one character may be a character element, and a character string may mean an arrangement of character elements.

본 개시에서 '텍스트(text)'는 적어도 하나의 문자소를 포함할 수 있다. 예를 들어, 텍스트는, 형태소 또는 단어를 포함할 수 있다. In the present disclosure,'text' may include at least one letter element. For example, the text may include morphemes or words.

본 개시에서 '단어(word)'는 적어도 하나의 문자열로 구성되는, 자립적으로 사용되거나, 또는 문법적 기능을 나타내는 언어의 기본 단위이다. In the present disclosure, a'word' is a basic unit of a language composed of at least one character string, used independently, or representing a grammatical function.

본 개시에서 '단어 열(word sequence)'은 하나 이상의 단어들의 배열을 의미한다. In the present disclosure, a'word sequence' refers to an arrangement of one or more words.

본 개시에서 '서브 워드(sub word)'는 단어를 구성하는 기본 단위로서, 예를 들어 음소(phoneme) 또는 음절(syllable)을 의미한다. 서브 워드를 모델링하는데 사용되는 기법으로는 은닉 마르코프(Hidden Marcov model; HMM) 방법이 주로 사용되는데, 이는 각 서브 워드 단위에 해당되는 음성 신호를 모아서 특징 벡터를 추출한 이후, 확률적 분포를 계산하기 위함이다. In the present disclosure, a'sub word' is a basic unit constituting a word, and means, for example, a phoneme or a syllable. As a technique used to model the subword, the Hidden Marcov model (HMM) method is mainly used, which is to calculate the probability distribution after extracting feature vectors by collecting speech signals corresponding to each subword unit. to be.

본 개시에서 '라벨(label)'은 음소 또는 음절을 나타내는 임의의 서브 워드이다. 라벨은 종단 간 ASR 모델(End-to-End Automatic Speech Recognition)에 의해 출력될 수 있다. In the present disclosure, a'label' is an arbitrary subword representing a phoneme or syllable. The label can be output by an end-to-end ASR model (End-to-End Automatic Speech Recognition).

도 1은 본 개시의 일 실시예에 따른 디바이스(1000a)가 사용자의 음성 입력을 인식하는 동작을 도시한 도면이다. 1 is a diagram illustrating an operation of recognizing a user's voice input by a device 1000a according to an embodiment of the present disclosure.

도 1을 참조하면, 디바이스(1000a)는 WFST 모델 생성 모듈(1330), WFST 모델(1340), 음성 인식 모듈(1350), 심층 신경망(1360), 통신 인터페이스(1400), 및 출력부(1500)를 포함할 수 있다. 도 1에는 디바이스(1000a)의 동작을 설명하기 위한 필수적인 구성 요소만 도시되었다. 디바이스(1000a)가 포함하고 있는 구성이 도 1에 도시된 바와 같이 한정되는 것은 아니다. Referring to FIG. 1, a device 1000a includes a WFST model generation module 1330, a WFST model 1340, a speech recognition module 1350, a deep neural network 1360, a communication interface 1400, and an output unit 1500. It may include. In FIG. 1, only essential components for describing the operation of the device 1000a are illustrated. The configuration included in the device 1000a is not limited as illustrated in FIG. 1.

디바이스(1000a)는 WFST 모델 생성 모듈(1330)과 관련된 명령어(instructions) 또는 프로그램 코드를 이용하여, 개체명(Named Entity)에 관한 WFST 모델(Weighted Finite State Transducer)(1340)을 생성할 수 있다. '개체명'은 사람의 이름, 회사명, 장소 명, 지역 명, 시간, 또는 날짜와 같이 고유한 의미를 갖는 단어 또는 단어 열을 의미한다. WFST 모델 생성 모듈(1330)은 복수의 개체명을 포함하는 어휘 리스트를 획득할 수 있다. 일 실시예에서, WFST 모델 생성 모듈(1330)은 사용자 입력을 수신하거나, 외부의 서버로부터 수신하거나, 또는 디바이스(1000a)에 의해 실행되는 애플리케이션, 웹 페이지 등을 크롤링(crawling)함으로써 복수의 개체명을 포함하는 어휘 리스트를 획득할 수 있다. 도 1에 도시된 실시예에서, 개체명 어휘 리스트는 Chris Brown, August Burns Red, Cardi-B, Cody Jinks, Road Trip 등 가수 이름에 관한 복수의 개체명을 포함할 수 있다.The device 1000a may generate a WFST model (Weighted Finite State Transducer) 1340 for a Named Entity by using instructions or program codes related to the WFST model generation module 1330. 'Object name' means a word or word string having a unique meaning, such as a person's name, company name, place name, area name, time, or date. The WFST model generation module 1330 may obtain a vocabulary list including a plurality of entity names. In one embodiment, the WFST model generation module 1330 receives a user input, receives from an external server, or crawls an application, web page, etc. that are executed by the device 1000a to name a plurality of entities. It is possible to obtain a vocabulary list including. In the embodiment shown in FIG. 1, the entity name vocabulary list may include a plurality of entity names related to singer names such as Chris Brown, August Burns Red, Cardi-B, Cody Jinks, and Road Trip.

WFST 모델 생성 모듈(1330)은 텍스트 전처리 모듈(1332), 필터링 모듈(1334), 및 확률 모델 생성 모듈(1336)을 포함할 수 있다. The WFST model generation module 1330 may include a text preprocessing module 1332, a filtering module 1334, and a probability model generation module 1336.

텍스트 전처리 모듈(1332)은 어휘 리스트에 포함되는 복수의 개체명을 입력받고, 입력된 복수의 개체명에 관한 전처리(preprocessing)를 수행함으로써 서브 워드(sub word)를 출력하도록 구성된다. 일 실시예에서, 텍스트 전처리 모듈(1332)은 복수의 개체명을 서브 워드(sub word) 단위로 분할(segmentation)할 수 있다. 텍스트 전처리 모듈(1332)은 예를 들어, BPE(Byte Pair Encoding) 알고리즘을 이용하여 복수의 개체명에 포함되는 단어 또는 단어 열을 서브 워드 단위로 토큰화(tokenize)할 수 있다. 일 실시예에서, 텍스트 전처리 모듈(1332)은 구두점 또는 특수 문자, 특수 기호 등을 제거하고, 불용어를 제거하는 전처리를 수행할 수도 있다. The text preprocessing module 1332 is configured to receive a plurality of entity names included in a vocabulary list, and to output a sub word by performing preprocessing on the inputted plurality of entity names. In an embodiment, the text preprocessing module 1332 may segment a plurality of entity names into sub-word units. The text preprocessing module 1332 may tokenize a word or a word string included in a plurality of entity names in units of subwords using, for example, a byte pair encoding (BPE) algorithm. In an embodiment, the text preprocessing module 1332 may perform preprocessing of removing punctuation marks, special characters, and special symbols, and removing stop words.

필터링 모듈(1334)은 텍스트 전처리 모듈(1332)로부터 출력된 서브 워드 중 디바이스(1000a)의 사전 DB(1320, 도 3 참조)에 기 저장된 어휘(In vocabulary)를 제거하는 필터링(filtering)을 수행하도록 구성된다. 일 실시예에서, 필터링 모듈(1334)은 개체명 어휘 리스트에 포함되는 복수의 개체명 중 사전 DB(1320)에 기 저장된 어휘와 동일한 단어 또는 단어 열을 제거할 수 있다. 필터링 모듈(1334)은 사전 DB(1320)에 포함되지 않은 개체명에 해당되는 단어들(Out of Vocabulary; OOV)로부터 추출된 서브 워드만을 출력할 수 있다. The filtering module 1334 removes the vocabulary previously stored in the dictionary DB 1320 (refer to FIG. 3) of the device 1000a among the subwords output from the text preprocessing module 1332 to perform filtering. Is composed. In an embodiment, the filtering module 1334 may remove a word or word string identical to a vocabulary previously stored in the dictionary DB 1320 from among a plurality of entity names included in the entity name vocabulary list. The filtering module 1334 may output only sub-words extracted from out of vocabulary (OOV) words corresponding to entity names not included in the dictionary DB 1320.

확률 모델 생성 모듈(1336)은 텍스트 전처리 모듈(1332) 및 필터링 모듈(1334)을 거쳐 출력된 서브 워드가 개체명 어휘 리스트에 포함되는 복수의 개체명 중 적어도 하나의 개체명을 나타내는 단어 또는 단어 열로 예측될 수 있는 확률을 학습(training)하고, 학습 결과 WFST 모델(1340)을 생성하도록 구성된다. 일 실시예에서, 확률 모델 생성 모듈(1336)은 복수의 개체명 각각으로부터 추출된 서브 워드들 각각을 입력으로 하고, 개체명을 출력으로 하는 매핑을 부호화하는 유한 상태 트랜스듀서(Finite State Transducer)를 포함하는 WFST 모델(1340)을 생성할 수 있다. Probability model generation module 1336 is a word or word string representing at least one entity name among a plurality of entity names included in the entity name vocabulary list by subwords output through the text preprocessing module 1332 and the filtering module 1334. It is configured to train a predictable probability and generate a WFST model 1340 as a result of the training. In one embodiment, the probability model generation module 1336 includes a finite state transducer that encodes a mapping using each of the subwords extracted from each of the plurality of entity names as inputs and the entity names as outputs. A WFST model 1340 including it may be generated.

WFST 모델(1340)은 입력된 서브 워드가 개체명을 나타내는 단어 또는 단어 열로 예측될 수 있는 확률에 기초하여, 서브 워드가 입력되면 단어 또는 단어 열을 출력하는 언어 모델이다. WFST 모델(1340)은 유한 상태 트랜스듀서를 포함할 수 있다. 유한 상태 트랜스듀서는 입력되는 서브 워드에 대하여 기 설정된 규칙에 기초한 상태 변환(state transition)을 하고, 상태 변환 결과에 따라 입력된 서브 워드의 이후에 나열될 수 있는 서브 워드를 결정할 수 있다. 유한 상태 트랜스듀서는, 상태 변환이 입력 및 출력 서브 워드들에 의해 라벨링되는 유한 오토매톤(automaton)이다. 유한 상태 트랜스듀서는 노드(node)와 노드 간을 연결하는 아크(arc)로 구성되는 그래프로서 나타낼 수 있다. 노드는 상태를 나타내고 아크는 상태 변환을 나타낸다. 각 아크에는 입력 서브 워드와 출력 서브 워드가 부여된다. 각 아크에 추가로 가중치를 부가한 것이 WFST 모델(1340)이다. 이 가중치에 의해 확률이라고 하는 개념을 나타낼 수 있다. 루트 노드에서 각 아크를 더듬어 감으로써 가설(hypothesis)이 생성되고, 아크에 할당된 가중치를 곱함으로써 가설의 발생 확률을 계산할 수 있다. WFST 모델(1340)은 발생 확률을 기초로, 특정 서브 워드로부터 단어 또는 단어 열로 출력할 수 있다. The WFST model 1340 is a language model that outputs a word or word sequence when a sub word is input based on a probability that the input sub word can be predicted as a word or word sequence representing an entity name. The WFST model 1340 may include a finite state transducer. The finite state transducer performs a state transition based on a preset rule with respect to the input subword, and may determine a subword that may be listed after the input subword according to the state conversion result. A finite state transducer is a finite automaton in which state transitions are labeled by input and output subwords. The finite state transducer can be represented as a graph composed of a node and an arc connecting the nodes. Nodes represent state and arcs represent state transitions. Each arc is given an input subword and an output subword. The WFST model 1340 adds weight to each arc. The concept of probability can be expressed by this weight. A hypothesis is generated by following each arc at the root node, and the probability of occurrence of a hypothesis can be calculated by multiplying the weight assigned to the arc. The WFST model 1340 may output a word or word string from a specific subword based on the probability of occurrence.

디바이스(1000a)는 음성 인식 모듈(1350)과 관련된 명령어 또는 프로그램 코드를 이용하여, 사용자로부터 수신한 음성 입력에 대응되는 텍스트를 획득하고, 텍스트를 출력할 수 있다. 음성 인식 모듈(1350)은 제1 디코딩 모델(1352), 제2 디코딩 모델(1354), 및 인식 결과 통합 모듈(1356)을 포함할 수 있다. The device 1000a may acquire text corresponding to a voice input received from a user and output the text by using a command or program code related to the voice recognition module 1350. The speech recognition module 1350 may include a first decoding model 1352, a second decoding model 1354, and a recognition result integration module 1356.

제1 디코딩 모델(1352)은 사용자로부터 수신한 음성 신호를 입력 받아 음성 신호의 각 음소 별 길이에 따라 음소가 특정 라벨(label)로 예측될 수 있는 확률값의 벡터인 특징 벡터를 획득하고, 특징 벡터의 확률값에 기초하여 제1 문자열을 출력하도록 구성된다. 여기서, '라벨'은 음소 또는 음절을 나타내는 서브 워드로서, 기 학습된(pre-trained) 심층 신경망(1360)에 의해 정의된 토큰(token)이다. 일 실시예에서, 제1 디코딩 모델(1352)은 종단 간 ASR(End-to-End Automatic Speech Recognition) 방식으로 음성 신호에 대한 음성 인식을 수행할 수 있다. 종단 간 ASR 방식은, 음성 신호를 문자열 또는 단어 열로 직접 매핑할 수 있도록 훈련된(trained) 심층 신경망(1360)을 이용하는 음성 인식 방식이다. 음향 모델 및 언어 모델 등의 다수의 모델들을 이용하는 다른 음성 인식 방식과는 달리, 종단 간 ASR 방식은 하나의 훈련된 심층 신경망(1360)을 이용함으로써 음성 인식 과정을 단순화할 수 있다. 종단 간 ASR 모델의 하위 실시예로는, 예를 들어 RNN-T(Recurrent Neural Network Transducer) 모델, 및 Attention-based 모델 등이 존재한다. The first decoding model 1352 receives a voice signal received from a user, obtains a feature vector, which is a vector of probability values that can predict a phoneme as a specific label according to the length of each phoneme of the voice signal, and Is configured to output a first character string based on the probability value of. Here,'label' is a subword representing a phoneme or syllable, and is a token defined by the pre-trained deep neural network 1360. In an embodiment, the first decoding model 1352 may perform speech recognition on a speech signal in an end-to-end automatic speech recognition (ASR) method. The end-to-end ASR scheme is a speech recognition scheme using a deep neural network 1360 trained to directly map a speech signal to a string or word string. Unlike other speech recognition methods that use multiple models such as acoustic models and language models, the end-to-end ASR method can simplify the speech recognition process by using one trained deep neural network 1360. As a lower embodiment of the end-to-end ASR model, there are, for example, a Recurrent Neural Network Transducer (RNN-T) model and an attention-based model.

일 실시예에서, 제1 디코딩 모델(1352)은 Attention-based 모델에 기초하는 종단 간 ASR 모델을 이용할 수 있다. Attention-based 모델은 예를 들어, Transformer 또는 Listen-Attend-Spell(LAS)를 포함할 수 있다.In one embodiment, the first decoding model 1352 may use an end-to-end ASR model based on an attention-based model. The attention-based model may include, for example, Transformer or Listen-Attend-Spell (LAS).

제1 디코딩 모델(1352)은 특징 벡터들로부터 각 프레임 별 음소로부터 대응되는 하나의 라벨을 선택할 수 있다. 일 실시예에서, 제1 디코딩 모델(1352)은 종단 간 ASR 모델(End-to-End ASR model)을 이용하여, 프레임 별 음소가 특정 라벨에 매칭될 수 있는 확률값을 포함하는 소프트맥스(softmax)를 획득할 수 있다. 제1 디코딩 모델(1352)은 소프트맥스의 확률값을 이용하여 프레임 별 음소로부터 하나의 라벨을 선택하고, 선택된 라벨들을 연결하고(concatenating), 대응되는 라벨에 의해 각각의 음소를 표현함으로써 라벨 후보(label candidates)를 획득할 수 있다. 제1 디코딩 모델(1352)은 라벨 후보의 사후 확률을 이용하여 제1 문자열을 획득할 수 있다. 도 1에 도시된 실시예에서, 사용자는 'Please search Cardi-B'라는 발화를 하였고, 제1 디코딩 모델(1352)은 사용자의 발화로부터 음성 신호를 획득하고, 음성 신호로부터 'Please search card'와 같은 제1 문자열을 획득할 수 있다. 제1 디코딩 모델(1352)은 사전 DB(1320, 도 3 참조)에 저장되지 않은 단어(Out of Vocabulary)인 'Cardi-B'에 해당되는 문자열은 정확하게 예측할 수 없다. The first decoding model 1352 may select a corresponding label from phonemes for each frame from feature vectors. In one embodiment, the first decoding model 1352 uses an end-to-end ASR model (softmax) that includes a probability value that a phoneme for each frame can match a specific label. Can be obtained. The first decoding model 1352 selects a label from phonemes for each frame using a probability value of Softmax, concatenates the selected labels, and expresses each phoneme by a corresponding label. candidates) can be obtained. The first decoding model 1352 may obtain a first character string using a posterior probability of a label candidate. In the embodiment shown in FIG. 1, the user uttered'Please search Cardi-B', and the first decoding model 1352 acquires a voice signal from the user's utterance, and a'Please search card' from the voice signal. The same first character string can be obtained. The first decoding model 1352 cannot accurately predict a character string corresponding to'Cardi-B' which is an Out of Vocabulary that is not stored in the dictionary DB 1320 (see FIG. 3).

일 실시예에서, 제1 디코딩 모델(1352)은 프레임 별 음소가 특정 라벨에 매칭될 수 있는 확률값을 포함하는 라벨 단위의 특징 벡터를 서브 워드 단위의 특징 벡터로 변환할 수 있다. 제1 디코딩 모델(1352)은 변환 결과 획득된 서브 워드 단위의 특징 벡터를 제2 디코딩 모델(1354)로 출력할 수 있다. In an embodiment, the first decoding model 1352 may convert a label-unit feature vector including a probability value that a phoneme for each frame may match a specific label into a sub-word feature vector. The first decoding model 1352 may output a feature vector in units of subwords obtained as a result of the conversion as the second decoding model 1354.

제2 디코딩 모델(1354)은 제1 디코딩 모델(1352)로부터 서브 워드 단위 특징 벡터를 입력받고, 서브 워드 단위 특징 벡터로부터 개체명 어휘 리스트에 포함되는 복수의 개체명 중 적어도 하나의 개체명에 해당되는 단어 및 개체명으로 식별되지 않는 미인식 단어열을 포함하는 제2 문자열을 획득하도록 구성된다. 제2 디코딩 모델(1354)은 WFST 모델(1340)을 이용함으로써 서브 워드 단위 특징 벡터로부터 예측되는 제2 문자열을 출력할 수 있다. 제2 디코딩 모델(1354)은 서브 워드 별 단어에 대한 가능도(likelihood), 사전 정보, 및 언어 모델에 기초하여 신뢰도 점수(confidence score)를 계산하고, 신뢰도 점수가 높은 문자열을 선택하고, 출력할 수 있다. The second decoding model 1354 receives a subword-unit feature vector from the first decoding model 1352, and corresponds to at least one entity name among a plurality of entity names included in the entity name vocabulary list from the subword-unit feature vector. And a second character string including an unrecognized word string that is not identified by a word to be used and an entity name. The second decoding model 1354 may output a second character string predicted from a sub-word unit feature vector by using the WFST model 1340. The second decoding model 1354 calculates a confidence score based on the likelihood of a word for each subword, dictionary information, and a language model, selects a character string with a high confidence score, and outputs I can.

일 실시예에서, 제2 디코딩 모델(1354)은 렉시콘 FST(1354L) 및 그래머 FST(1354G)를 포함할 수 있다. 제2 디코딩 모델(1354)은 렉시콘 FST(1354L)과 그래머 FST(1354G)를 합성(composition)함으로써 서브 워드 별 특징 벡터로부터 단어 또는 단어 열을 출력할 수 있다. 그러나, 이에 한정되는 것은 아니고, 제2 디코딩 모델(1354)은 서브 워드 별 특징 벡터로부터 개체명에 해당되는 단어 및 미인식 단어 열을 출력하는 하나의 통합 모듈로 구성될 수도 있다. In one embodiment, the second decoding model 1354 may include a Lexicon FST 1354L and a Grammer FST 1354G. The second decoding model 1354 may output a word or word sequence from a feature vector for each sub-word by composing the Lexicon FST 1354L and the grammar FST 1354G. However, the present invention is not limited thereto, and the second decoding model 1354 may be configured as one integrated module that outputs a word corresponding to an entity name and an unrecognized word sequence from a feature vector for each subword.

렉시콘 FST(Lexicon FST)(1354L)는 서브 워드 단위의 특징 벡터를 입력받고, 서브 워드 단위의 특징 벡터에 기초하여 예측되는 단어 또는 단어 열을 출력하도록 구성된다. 렉시콘 FST(1354L)는 서브 워드가 특정 단어로 예측될 수 있는 확률값인 매핑 정보를 포함할 수 있다. 일 실시예에서, 렉시콘 FST(1354L)는 서브 워드 시퀀스 s를 단어 열(word sequence) W에 관한 확률인 P(s|W)로 변환할 수 있다. The Lexicon FST 1354L is configured to receive a feature vector in units of subwords and output a word or word sequence predicted based on the feature vector in units of subwords. The Lexicon FST 1354L may include mapping information, which is a probability value at which a subword can be predicted as a specific word. In one embodiment, the Lexicon FST 1354L may convert the sub-word sequence s into P(s|W), which is a probability related to a word sequence W.

그래머 FST(Grammar FST)(1354G)는 렉시콘 FST(1354L)로부터 서브 워드 시퀀스의 단어 또는 단어 열을 입력받고, 개체명에 해당되는 단어 열과 미인식 단어열을 포함하는 제2 문자열을 출력하도록 구성된다. 그래머 FST(1354G)는 특정 단어 또는 단어 열이 입력되는 경우 입력된 단어 또는 단어 열 이후에 배열될 수 있는 단어 열을 예측하기 위한 가중치(weight)를 학습한 모델일 수 있다. 그래머 FST(1354G)는 예를 들어, RNN(Recurrent Neural Network) 또는 통계적 n-Gram 모델을 이용하여, 특정 단어 또는 단어 열 다음에 나올 수 있는 단어 또는 단어 열을 예측할 수 있다. 일 실시예에서, 그래머 FST(1354G)는 단어 열 W에 대한 가중치로서의 언어 모델 확률 P(W)에 관한 정보를 포함하고, 확률 P(W)가 부가된 단어 또는 단어 열을 출력할 수 있다. The grammar FST (1354G) is configured to receive a word or word string of a sub-word sequence from the Lexicon FST (1354L) and output a second string including a word string corresponding to an entity name and an unrecognized word string. . The grammar FST 1354G may be a model in which weights for predicting a word sequence that may be arranged after the input word or word sequence when a specific word or word sequence is input. The grammar FST 1354G may predict a word or word sequence that may appear after a specific word or word sequence using, for example, a recurrent neural network (RNN) or a statistical n-Gram model. In an embodiment, the grammar FST 1354G includes information on the language model probability P(W) as a weight for the word sequence W, and may output a word or word sequence to which the probability P(W) is added.

도 1에 도시된 실시예에서, 제2 디코딩 모델(1354)은 WFST 모델(1340)을 이용하여 제1 디코딩 모델(1352)로부터 출력된 서브 워드 단위 특징 벡터로부터 개체명에 해당되는 Cardi-B와 미인식 단어 <unk>를 포함하는 제2 문자열을 획득할 수 있다. In the embodiment shown in FIG. 1, the second decoding model 1354 includes Cardi-B corresponding to the entity name from the sub-word unit feature vector output from the first decoding model 1352 using the WFST model 1340 and A second string including the unrecognized word may be obtained.

인식 결과 통합 모듈(1356)은 제1 디코딩 모델(1352)로부터 획득된 제1 문자열과 제2 디코딩 모델(1354)로부터 획득된 제2 문자열을 통합하도록 구성된다. 인식 결과 통합 모듈(1356)은 제2 문자열 중 미인식 단어 열을 제1 문자열에 포함되는 단어 열로 대체함으로써, 음성 입력에 대응되는 텍스트를 출력할 수 있다. 일 실시예에서, 인식 결과 통합 모듈(1356)은 제2 문자열에 포함되는 미인식 단어 열을, 제1 문자열 중 미인식 단어 열의 자리에 대응되는 단어 열로 대체할 수 있다. 도 1에 도시된 실시예에서, 인식 결과 통합 모듈(1356)은 제2 문자열 중 미인식 단어 열인 <unk>, <unk>를 제1 문자열에 포함되는 'Please', 'search'로 각각 대체함으로써, 음성 입력에 대응되는 텍스트인 "Please search Cardi-B"를 출력할 수 있다. The recognition result integration module 1356 is configured to integrate the first character string obtained from the first decoding model 1352 and the second character string obtained from the second decoding model 1354. The recognition result integration module 1356 may output text corresponding to the voice input by replacing the unrecognized word string among the second strings with the word string included in the first string. In an embodiment, the recognition result integration module 1356 may replace the unrecognized word string included in the second string with a word string corresponding to a position of the unrecognized word string among the first strings. In the embodiment shown in Figure 1, the recognition result integration module (1356) is by replacing the second string unrecognized word column of each of a 'Please', 'search' contained a <unk>, <unk> to the first string, , "Please search Cardi-B", which is a text corresponding to the voice input, may be output.

인식 결과 통합 모듈(1356)로부터 출력된 텍스트는 통신 인터페이스(1400) 또는 출력부(1500)로 제공될 수 있다. The text output from the recognition result integration module 1356 may be provided to the communication interface 1400 or the output unit 1500.

종래의 음성 인식 기술은, 언어 모델을 생성하는 경우 예를 들어, "<개체 명> + 보여줘" 또는 "오늘 <개체명> + 에서 만나자"와 같이 완성된 문장으로 텍스트를 구비하였는바, 개체명에 따른 패턴 문장(Pattern sentence)을 고려하여 학습을 수행하여야 하였다. 종래의 음성 인식의 경우, 학습에 사용되는 데이터의 양이 많고, 계산량이 많기 때문에 서버 환경에서도 수 시간이 필요하는 등 처리 속도가 매우 느리고, 온 디바이스(On device) 환경에서는 언어 모델을 생성하기 어려운 문제점이 있었다. 또한, 개체명 각각에 따른 패턴 문장을 모두 학습하여야 하므로, 패턴이 존재하지 않는 경우에는 언어 모델을 이용한 인식률이 저하되는 문제점도 있었다.In the conventional speech recognition technology, when creating a language model, for example, "<object name> + show me" or "Let's meet at <object name> + today" has text as a completed sentence. Learning had to be performed in consideration of the pattern sentence according to. In the case of conventional speech recognition, since the amount of data used for learning is large and the amount of calculation is large, processing speed is very slow, such as several hours required even in a server environment, and it is difficult to create a language model in an on-device environment. There was a problem. In addition, since it is necessary to learn all pattern sentences according to each entity name, there is a problem in that the recognition rate using the language model is lowered when the pattern does not exist.

본 개시의 일 실시예에 따른 디바이스(1000a)는, 사전 DB(1320, 도 3 참조)에는 포함되지 않은 단어(Out of Vocabulary)인 개체명의 어휘 리스트를 이용하는 학습을 통해 WFST 모델(1340)을 생성하고, 사용자의 음성 입력 중 개체명이 아닌 일반적인 발화는 종단 간 ASR 방식을 이용하는 제1 디코딩 모델(1352)을, 개체명에 해당되는 단어는 제2 디코딩 모델(1354)을 각각 이용하여 디코딩하고, 제2 디코딩 모델(1354)을 통해 변환이 불가능한 미인식 단어열은 제1 디코딩 모델(1352)을 통해 획득한 제1 문자열로 대체함으로써 음성 입력에 대응되는 텍스트를 출력할 수 있다. 본 개시의 디바이스(1000a)는 복수의 개체명만을 포함하는 WFST 모델(1340)을 생성하고, WFST 모델(1340)을 이용하여 개체명에 해당되는 단어 또는 단어 열을 출력함으로써, 개체명에 관한 음성 입력을 텍스트로 변환하는 디코딩에 사용되는 처리 시간을 현저하게 단축시킬 수 있다. 본 개시의 실시예에 따른 디바이스(1000a)는 제1 디코딩 모델(1352)과 제2 디코딩 모델(1354)에서의 출력이 모두 단어이기 때문에, 별도로 개체명에 대한 패턴 문장의 위치를 정렬하는 작업(alignment)을 생략할 수 있는 장점도 있다. The device 1000a according to an embodiment of the present disclosure generates a WFST model 1340 through learning using a vocabulary list of entity names that are words that are not included in the dictionary DB 1320 (see FIG. 3). And, among the user's voice input, a first decoding model 1352 using an end-to-end ASR method is used for general utterances other than an entity name, and a word corresponding to the entity name is decoded using a second decoding model 1354, respectively. 2 An unrecognized word sequence that cannot be converted through the decoding model 1354 may be replaced with a first character string obtained through the first decoding model 1352 to output text corresponding to the voice input. The device 1000a of the present disclosure generates a WFST model 1340 including only a plurality of entity names, and outputs a word or word string corresponding to the entity name using the WFST model 1340, thereby The processing time used for decoding the input to text can be significantly reduced. In the device 1000a according to the embodiment of the present disclosure, since both the outputs from the first decoding model 1352 and the second decoding model 1354 are words, the operation of separately aligning the position of the pattern sentence with respect to the entity name ( There is also an advantage of being able to omit the alignment).

또한, 본 개시의 일 실시예에 따른 디바이스(1000a)는 복수의 개체명 각각에 대한 패턴 문장이 없이도 음성 인식이 가능한 바, 데이터 계산량이 감소되고, 따라서 서버에 비하여 상대적으로 계산 능력이 낮은 디바이스(1000a)에서도 음성 인식이 수행될 수 있다. 본 개시의 일 실시예에 따른 디바이스(1000a)는 WFST 모델(1340)을 이용하여 개체명에 관한 음성 입력을 인식하고, 단어로 변환하는바, 음성 인식의 정확도를 향상시킬 수 있다. In addition, since the device 1000a according to an embodiment of the present disclosure can recognize speech without pattern sentences for each of a plurality of entity names, the amount of data calculation is reduced, and therefore, a device having a relatively low computing power compared to the server ( 1000a) can also perform speech recognition. The device 1000a according to an embodiment of the present disclosure recognizes a voice input regarding an entity name using the WFST model 1340 and converts it into words, thereby improving accuracy of voice recognition.

도 2는 본 개시의 일 실시예에 따른 디바이스(1000b) 및 서버(2000)가 사용자의 음성 입력을 인식하는 동작을 도시한 도면이다. FIG. 2 is a diagram illustrating an operation in which the device 1000b and the server 2000 recognize a user's voice input according to an embodiment of the present disclosure.

도 2를 참조하면, 디바이스(1000b)는 음성 인식 모듈(1350), 심층 신경망(1360), 통신 인터페이스(1400), 및 출력부(1500)를 포함할 수 있다. 서버(2000)는 통신 인터페이스(2100) 및 WFST 디코더(2310)를 포함할 수 있다. 도 2에는 디바이스(1000b) 및 서버(2000)의 동작을 설명하기 위한 필수적인 구성 요소만 도시되었다. 디바이스(1000b) 및 서버(2000)가 포함하고 있는 구성이 도 2에 도시된 바와 같이 한정되는 것은 아니다. Referring to FIG. 2, the device 1000b may include a speech recognition module 1350, a deep neural network 1360, a communication interface 1400, and an output unit 1500. The server 2000 may include a communication interface 2100 and a WFST decoder 2310. In FIG. 2, only essential components for describing the operation of the device 1000b and the server 2000 are illustrated. The configurations included in the device 1000b and the server 2000 are not limited as illustrated in FIG. 2.

디바이스(1000b)는 음성 인식 모듈(1350)과 관련된 명령어 또는 프로그램 코드를 이용하여, 사용자로부터 수신한 음성 입력에 대응되는 텍스트를 획득하고, 텍스트를 출력할 수 있다. 음성 인식 모듈(1350)은 디코딩 모델(1353) 및 인식 결과 통합 모듈(1356)을 포함할 수 있다. The device 1000b may acquire text corresponding to the voice input received from the user and output the text by using a command or program code related to the voice recognition module 1350. The speech recognition module 1350 may include a decoding model 1352 and a recognition result integration module 1356.

디코딩 모델(1353)은 사용자로부터 수신한 음성 신호를 입력 받고, 음성 신호의 각 음소 별 길이에 따라 음소가 특정 서브 워드로 예측될 수 있는 확률값의 벡터인 특징 벡터를 획득하고, 특징 벡터의 확률값에 기초하여 제1 문자열을 출력하도록 구성된다. 도 2에 도시된 디코딩 모델(1353)은 서브 워드 단위의 특징 벡터를 통신 인터페이스(1400)로 출력하고, 제1 문자열은 인식 결과 통합 모듈(1356)에 출력하는 점을 제외하면, 도 1에 도시된 제1 디코딩 모델(1352)와 동일한바, 중복되는 설명은 생략한다.The decoding model 1352 receives a voice signal received from a user, obtains a feature vector, which is a vector of probability values that a phoneme can predict as a specific subword according to the length of each phoneme of the voice signal, and calculates the probability value of the feature vector. It is configured to output the first character string on the basis of. The decoding model 1352 shown in FIG. 2 is shown in FIG. 1 except that a feature vector in units of subwords is output to the communication interface 1400, and the first character string is output to the recognition result integration module 1356. It is the same as that of the first decoding model 1352, so redundant descriptions are omitted.

일 실시예에서, 디코딩 모델(1353)은 서브 워드 단위의 특징 벡터 중 임의의 시간 영역의 음소가 특정 라벨로 예측될 수 있는 가능도(likelihood), 즉 특징 벡터의 확률값 중 가장 높은 확률값을 갖는 라벨들끼리 연결하여 래티스(lattice)를 생성하고, 생성된 래티스를 통신 인터페이스(1400)에 제공할 수 있다. In one embodiment, the decoding model 1352 is a label having the highest probability value among the probability values of the feature vectors, that is, the likelihood that a phoneme in an arbitrary time domain among feature vectors in sub-word units can be predicted with a specific label. A lattice may be generated by connecting them to each other, and the generated lattice may be provided to the communication interface 1400.

통신 인터페이스(1400)는 디코딩 모델(1353)로부터 입력받은 서브 워드 단위의 특징 벡터를 서버(2000)에 전송한다. 일 실시예에서, 통신 인터페이스(1400)는 유선 또는 무선 네트워크를 통해 서브 워드 단위의 특징 벡터를 서버(2000)에 전송할 수 있다.The communication interface 1400 transmits the feature vector in units of subwords received from the decoding model 1352 to the server 2000. In an embodiment, the communication interface 1400 may transmit a feature vector in units of subwords to the server 2000 through a wired or wireless network.

일 실시예에서, 통신 인터페이스(1400)는 디코딩 모델(1353)로부터 입력받은 래티스를 서버(2000)에 전송할 수 있다. 래티스의 데이터 용량은 서브 워드 각각의 특징 벡터의 데이터 용량 보다 작으므로, 통신 인터페이스(1400)가 래티스를 서버(2000)에 전송하는 경우, 서브 워드 단위의 특징 벡터를 전송하는 경우 보다 전송 시간을 단축하고, 예기치 못한 데이터 로스(loss)도 방지할 수 있다. In an embodiment, the communication interface 1400 may transmit the lattice received from the decoding model 1352 to the server 2000. Since the data capacity of the lattice is smaller than the data capacity of the feature vector of each subword, when the communication interface 1400 transmits the lattice to the server 2000, the transmission time is reduced compared to the case of transmitting the feature vector in units of subwords. And, it can prevent unexpected data loss.

서버(2000)는 통신 인터페이스(2100)를 통해 디바이스(1000b)로부터 서브 워드 단위의 특징 벡터를 수신할 수 있다. 서버(2000)는 WFST 디코더(2310)과 관련된 명령어 또는 프로그램 코드를 이용하여, 서브 워드 단위 특징 벡터로부터 개체명과 미인식 단어를 포함하는 제2 문자열을 획득할 수 있다. The server 2000 may receive a feature vector in units of subwords from the device 1000b through the communication interface 2100. The server 2000 may obtain a second character string including an entity name and an unrecognized word from a subword unit feature vector using a command or program code related to the WFST decoder 2310.

WFST 디코더(2310)는, WFST 모델(2320)을 이용하여 개체명 어휘 리스트에 포함되는 복수의 개체명 중 적어도 하나의 개체명에 해당되는 단어 및 개체명으로 식별되지 않은 미인식 단어열을 포함하는 제2 문자열을 획득하도록 구성된다. WFST 모델(2320)은 서브 워드가 특정 개체명을 나타내는 단어 또는 단어 열로 예측될 수 있는 확률에 기초하여, 서브 워드가 입력되면 단어 또는 단어 열을 출력하는 모델이다. WFST 모델(2320)은 서버(2000)에 의해 생성되고, 서버(2000)에 저장되어 있다는 점을 제외하면, 도 1에 도시된 WFST 모델(1340)과 동일한바, 중복되는 설명은 생략한다. WFST 디코더(2310)는 도 1에 도시된 제2 디코딩 모델(1354)과 동일하게 렉시콘 FST(1354L) 및 그래머 FST(1354G)를 포함할 수 있다. 그러나, 이에 한정되는 것은 아니고, WFST 디코더(2310)는 서브 워드 단위의 특징 벡터가 입력되면, 특징 벡터의 확률값에 기초하여 가능도(likelihood)가 높은 단어 또는 단어 열을 출력하는 하나의 통합 모듈로 구성될 수도 있다. The WFST decoder 2310 includes a word corresponding to at least one entity name among a plurality of entity names included in the entity name vocabulary list using the WFST model 2320 and an unrecognized word string not identified as entity name. It is configured to obtain a second character string. The WFST model 2320 is a model that outputs a word or word sequence when a sub word is input based on a probability that the sub word can be predicted as a word or word sequence representing a specific entity name. The WFST model 2320 is the same as the WFST model 1340 shown in FIG. 1, except that the WFST model 2320 is generated by the server 2000 and stored in the server 2000, and a duplicate description will be omitted. The WFST decoder 2310 may include a Lexicon FST 1354L and a grammar FST 1354G, similar to the second decoding model 1354 illustrated in FIG. 1. However, the present invention is not limited thereto, and the WFST decoder 2310 is a single integrated module that outputs a word or word sequence having a high likelihood based on a probability value of the feature vector when a feature vector in units of subwords is input. It can also be configured.

WFST 디코더(2310)는 도 1에 도시된 제2 디코딩 모델(1354)과 동일한 동작 또는 기능을 수행하는바, 중복되는 설명은 생략한다.The WFST decoder 2310 performs the same operation or function as the second decoding model 1354 illustrated in FIG. 1, and a duplicate description will be omitted.

WFST 디코더(2310)는 개체명과 미인식 단어를 포함하는 제2 문자열을 통신 인터페이스(2100)로 출력할 수 있다. The WFST decoder 2310 may output a second character string including an entity name and an unrecognized word to the communication interface 2100.

서버(2000)는 통신 인터페이스(2100)를 이용하여, 개체명과 미인식 단어를 포함하는 제2 문자열을 디바이스(1000)에 전송할 수 있다. The server 2000 may transmit a second character string including an entity name and an unrecognized word to the device 1000 using the communication interface 2100.

디바이스(1000b)는 통신 인터페이스(1400)를 이용하여, 서버(2000)로부터 제2 문자열을 수신할 수 있다. 통신 인터페이스(1400)는 서버(2000)로부터 수신한 제2 문자열을 인식 결과 통합 모듈(1356)에 제공할 수 있다.The device 1000b may receive the second character string from the server 2000 by using the communication interface 1400. The communication interface 1400 may provide the second character string received from the server 2000 to the recognition result integration module 1356.

인식 결과 통합 모듈(1356)은 디코딩 모델(1353)로부터 획득된 제1 문자열과 서버(2000)의 WFST 디코더(2310)로부터 획득한 제2 문자열을 통합하도록 구성된다. 인식 결과 통합 모듈(1356)은 제2 문자열 중 미인식 단어 열을 제1 문자열에 포함되는 단어 열로 대체함으로써, 음성 입력에 대응되는 텍스트를 출력할 수 있다. 도 2에 도시된 실시예에서, 인식 결과 통합 모듈(1356)은 제2 문자열 중 미인식 단어 열인 <unk>, <unk>를 제1 문자열에 포함되는 'Please', 'search'로 각각 대체함으로써, 음성 입력에 대응되는 텍스트인 "Please search Cardi-B"를 출력할 수 있다. The recognition result integration module 1356 is configured to integrate the first character string obtained from the decoding model 1352 and the second character string obtained from the WFST decoder 2310 of the server 2000. The recognition result integration module 1356 may output text corresponding to the voice input by replacing the unrecognized word string among the second strings with the word string included in the first string. In the embodiment shown in Figure 2, the recognition result integration module (1356) is by replacing the second string unrecognized word column of each of a 'Please', 'search' contained a <unk>, <unk> to the first string, , "Please search Cardi-B", which is a text corresponding to the voice input, may be output.

도 1에 도시된 디바이스(1000a)와는 달리, 도 2에 도시된 실시예에서 디바이스(1000b)는 종단 간 ASR 모델을 이용하는 디코딩 모델(1353)을 통해 개체명이 아닌 일반적인 문자열, 예를 들어 "Please search card"와 같은 제1 문자열만을 디코딩하고, 서버(2000)는 복수의 개체명을 통해 학습된 WFST 모델(2320)을 이용하는 WFST 디코더(2310)를 통해 개체명과 미인식 단어를 포함하는 제2 문자열을 디코딩할 수 있다. 도 2에 도시된 실시예에서, 서버(2000)는 상대적으로 디바이스(1000b)에 비하여 학습(training)을 위한 데이터 계산 또는 처리 능력이 높으므로, 복수의 개체명을 통한 WFST 모델(2320)을 빠르게 생성할 수 있고, 신규 개체명을 실시간으로 학습으로써 WFST 모델(2320)을 업데이트할 수 있다. 따라서, 도 2에 도시된 실시예에서, 디바이스(1000b)는 서버(2000)를 통해 최신의 개체명에 관한 음성 입력을 정확하게 텍스트로 변환할 수 있으며, 복수의 개체명을 포함하는 어휘 리스트를 이용하는 학습을 통해 WFST 모델을 생성할 필요가 없는바, 처리 속도도 향상시킬 수 있다. 또한, 일 실시예에서, 디바이스(1000b)는 디코딩 모델(1353)에서 서브 워드 단위의 특징 벡터를 서버(2000)에 전송하기 보다는, 래티스(lattice)를 서버(2000)에 전송함으로써, 데이터 전송량을 감소시킬 수 있고, 서버(2000)와의 데이터 통신 속도를 향상시킬 수 있다. Unlike the device 1000a illustrated in FIG. 1, in the embodiment illustrated in FIG. 2, the device 1000b uses a decoding model 1352 using an end-to-end ASR model, and a general character string, for example, “Please search Decodes only a first character string such as "card", and the server 2000 generates a second character string including an entity name and an unrecognized word through the WFST decoder 2310 using the WFST model 2320 learned through a plurality of entity names. Can be decoded. In the embodiment shown in FIG. 2, the server 2000 has relatively higher data calculation or processing capability for training than the device 1000b, so that the WFST model 2320 through a plurality of entity names is It can be created, and the WFST model 2320 can be updated by learning a new entity name in real time. Accordingly, in the embodiment shown in FIG. 2, the device 1000b can accurately convert the voice input for the latest entity name into text through the server 2000, and use a vocabulary list including a plurality of entity names. Since there is no need to generate a WFST model through training, processing speed can also be improved. In addition, in an embodiment, the device 1000b transmits a lattice to the server 2000, rather than transmitting the feature vector in units of subwords from the decoding model 1352 to the server 2000, thereby reducing the amount of data transmission. It can be reduced, and the data communication speed with the server 2000 can be improved.

도 3은 본 개시의 일 실시예에 따른 디바이스(1000)의 구성 요소를 도시한 블록도이다. 3 is a block diagram illustrating components of a device 1000 according to an embodiment of the present disclosure.

디바이스(1000)는 사용자의 음성 입력을 수신하고, 음성 입력을 처리함으로써, 음성 입력을 텍스트로 변환하는 전자 장치일 수 있다. 디바이스(1000)는 예를 들어, 스마트 폰(smartphone), 태블릿 PC(tablet personal computer), 이동 전화기(mobile phone), 영상 전화기, 전자책 리더기(e-book reader), 데스크탑 PC(desktop personal computer), 랩탑 PC(laptop personal computer), 넷북 컴퓨터(netbook computer), 워크스테이션(workstation), 서버, PDA(personal digital assistant), PMP(portable multimedia player), MP3 플레이어, 모바일 의료기기, 카메라(camera), 또는 웨어러블 장치(wearable device) 중 적어도 하나로 구성될 수 있다. 그러나, 디바이스(1000)가 전술한 예시로 한정되는 것은 아니다. The device 1000 may be an electronic device that receives a user's voice input and processes the voice input to convert the voice input into text. The device 1000 is, for example, a smart phone, a tablet personal computer (PC), a mobile phone, a video phone, an e-book reader, and a desktop personal computer (desktop personal computer). , Laptop personal computer, netbook computer, workstation, server, personal digital assistant (PDA), portable multimedia player (PMP), MP3 player, mobile medical device, camera, Alternatively, it may be configured with at least one of a wearable device. However, the device 1000 is not limited to the above-described example.

디바이스(1000)는 음성 입력부(1100), 프로세서(1200), 메모리(1300), 통신 인터페이스(1400), 및 출력부(1500)를 포함할 수 있다. The device 1000 may include a voice input unit 1100, a processor 1200, a memory 1300, a communication interface 1400, and an output unit 1500.

음성 입력부(1100)는 사용자로부터 음성 입력을 수신할 수 있다. 일 실시예에서, 음성 입력부(1100)는 마이크로폰을 포함할 수 있다. 음성 입력부(1100)는 마이크로폰을 통해 사용자로부터 음성 입력(예를 들어, 사용자의 발화)을 수신하고, 수신된 음성 입력으로부터 음성 신호를 획득할 수 있다. 일 실시예에서, 디바이스(1000)의 프로세서(1200)는, 마이크로폰을 통해 수신된 소리를 음향 신호로 변환하고, 음향 신호로부터 노이즈(예를 들어, 비음성 성분)를 제거하여 음성 신호를 획득할 수 있다. The voice input unit 1100 may receive a voice input from a user. In one embodiment, the voice input unit 1100 may include a microphone. The voice input unit 1100 may receive a voice input (eg, a user's utterance) from a user through a microphone and obtain a voice signal from the received voice input. In one embodiment, the processor 1200 of the device 1000 converts the sound received through the microphone into an acoustic signal, and removes noise (eg, non-speech component) from the acoustic signal to obtain a speech signal. I can.

도면에는 도시되지 않았지만, 디바이스(1000)는 지정된 음성 입력(예를 들어, '하이 빅스비', '오케이 구글'등과 같은 웨이크 업 입력)을 감지하는 기능 또는 일부 음성 입력으로부터 획득한 음성 신호를 전처리하는 기능을 갖는 음성 전처리 모듈을 포함할 수 있다. Although not shown in the drawing, the device 1000 has a function of detecting a designated voice input (for example, a wake-up input such as'Hi Bixby','Okay Google', etc.) or preprocessing the voice signal obtained from some voice inputs. It may include a voice pre-processing module having a function of.

프로세서(1200)는 메모리(1300)에 저장된 프로그램의 하나 이상의 명령어들(instructions)을 실행할 수 있다. 프로세서(1200)는 산술, 로직 및 입출력 연산과 시그널 프로세싱을 수행하는 하드웨어 구성 요소로 구성될 수 있다. 프로세서(1200)는 예를 들어, 중앙 처리 장치(Central Processing Unit), 마이크로 프로세서(microprocessor), 그래픽 프로세서(Graphic Processing Unit), ASICs(Application Specific Integrated Circuits), DSPs(Digital Signal Processors), DSPDs(Digital Signal Processing Devices), PLDs(Programmable Logic Devices), 및 FPGAs(Field Programmable Gate Arrays) 중 적어도 하나로 구성될 수 있으나, 이에 한정되는 것은 아니다. The processor 1200 may execute one or more instructions of a program stored in the memory 1300. The processor 1200 may be configured with hardware components that perform arithmetic, logic, input/output operations, and signal processing. The processor 1200 is, for example, a central processing unit (Central Processing Unit), a microprocessor (microprocessor), a graphics processor (Graphic Processing Unit), ASICs (Application Specific Integrated Circuits), DSPs (Digital Signal Processors), DSPDs (Digital Signal Processing Devices), Programmable Logic Devices (PLDs), and Field Programmable Gate Arrays (FPGAs), but are not limited thereto.

메모리(1300)에는 음성 입력부(1100)를 통해 수신한 사용자의 음성 입력을 처리하여 텍스트로 변환하기 위한 명령어들(instruction)을 포함하는 프로그램이 저장될 수 있다. 메모리(1300)에는 프로세서(1200)가 판독할 수 있는 명령어들 및 프로그램 코드(program code)가 저장될 수 있다. 이하의 실시예에서, 프로세서(1200)는 메모리에 저장된 프로그램의 명령어들 또는 코드들을 실행함으로써 구현될 수 있다. The memory 1300 may store a program including instructions for processing a user's voice input received through the voice input unit 1100 and converting it into text. The memory 1300 may store instructions and program codes that the processor 1200 can read. In the following embodiments, the processor 1200 may be implemented by executing instructions or codes of a program stored in a memory.

메모리(1300)에는 개체명 어휘리스트 DB(1310), 사전 DB(dictionary database)(1320), WFST 모델 생성 모듈(1330), WFST 모델(1340), 음성 인식 모듈(1350), 및 심층 신경망(1360) 각각에 대응되는 데이터가 저장되어 있을 수 있다. The memory 1300 includes an entity name vocabulary list DB 1310, a dictionary DB (dictionary database) 1320, a WFST model generation module 1330, a WFST model 1340, a speech recognition module 1350, and a deep neural network 1360. ) Data corresponding to each may be stored.

메모리(1300)는 예를 들어, 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램(RAM, Random Access Memory) SRAM(Static Random Access Memory), 롬(ROM, Read-Only Memory), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기 메모리, 자기 디스크, 광 디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다. The memory 1300 is, for example, a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (eg, SD or XD memory). Etc.), RAM (RAM, Random Access Memory) SRAM (Static Random Access Memory), ROM (ROM, Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), Magnetic It may include at least one type of storage medium among memory, magnetic disk, and optical disk.

프로세서(1200)는 메모리(1300)에 저장된 WFST 모델 생성 모듈(1330)과 관련된 명령어(instructions) 또는 프로그램 코드를 이용하여, 개체명(Named Entity)에 관한 WFST 모델(1340)을 생성할 수 있다. '개체명'은 사람의 이름, 회사명, 장소 명, 지역 명, 시간, 또는 날짜와 같이 고유한 의미를 갖는 단어 또는 단어 열을 의미한다. The processor 1200 may generate a WFST model 1340 for a Named Entity by using instructions or program code related to the WFST model generation module 1330 stored in the memory 1300. 'Object name' means a word or word string having a unique meaning, such as a person's name, company name, place name, area name, time, or date.

프로세서(1200)는 복수의 개체명을 포함하는 어휘 리스트를 획득하고, 획득된 개체명에 관한 어휘 리스트를 개체명 어휘리스트 DB(1310)에 저장할 수 있다. 일 실시예에서, 프로세서(1200)는 WFST 모델 생성 모듈(1330)의 명령어 또는 프로그램 코드를 이용하여, 사용자 입력을 수신하거나, 외부의 서버로부터 수신하거나, 또는 디바이스(1000)에 의해 실행되는 애플리케이션, 웹 페이지 등을 크롤링(crawling)함으로써 복수의 개체명을 포함하는 어휘 리스트를 획득할 수 있다. 예를 들어, 프로세서(1200)는 실행되고 있는 애플리케이션의 API(Application Programming Interface)로부터 애플리케이션에 포함된 복수의 개체명에 해당되는 단어 또는 단어 열을 획득할 수 있다. 예를 들어, 음악 스트리밍 애플리케이션인 경우, 프로세서(1200)는 사용자의 플레이리스트에 포함되었거나, 현재 재생 중인 음악에 관한 노래 명, 아티스트 명, 작곡가 명 등 개체명을 획득하고, 획득된 개체명을 개체명 어휘리스트 DB(1310)에 저장할 수 있다. The processor 1200 may acquire a vocabulary list including a plurality of entity names, and store a vocabulary list related to the acquired entity names in the entity name vocabulary list DB 1310. In one embodiment, the processor 1200 receives a user input, receives a user input from an external server, or an application executed by the device 1000, using a command or program code of the WFST model generation module 1330, A vocabulary list including a plurality of entity names may be obtained by crawling a web page or the like. For example, the processor 1200 may obtain a word or word string corresponding to a plurality of entity names included in an application from an application programming interface (API) of an application being executed. For example, in the case of a music streaming application, the processor 1200 acquires an entity name, such as a song name, artist name, composer name, etc., which is included in the user's playlist or is currently playing, and stores the acquired entity name as an entity. It can be stored in the name vocabulary list DB 1310.

텍스트 전처리 모듈(1332)은 어휘 리스트에 포함되는 복수의 개체명을 입력받고, 입력된 복수의 개체명에 관한 전처리(preprocessing)를 수행함으로써 서브 워드(sub word)를 출력하도록 구성된다. 일 실시예에서, 프로세서(1200)는 텍스트 전처리 모듈(1332)과 관련된 명령어 또는 프로그램 코드를 이용하여 개체명 어휘리스트 DB(1310)에 저장된 복수의 개체명을 서브 워드(sub word) 단위로 분할(segmentation)할 수 있다. '서브 워드'는 단어를 구성하는 기본 단위로서, 예를 들어 음소(phoneme) 또는 음절(syllable)을 의미한다. 프로세서(1200)는 예를 들어, 은닉 마르코프 방법(Hidden Marcov model; HMM)을 이용하여 복수의 개체명을 서브 워드 단위로 분할할 수 있다. 그러나, 이에 한정되는 것은 아니다. 다른 예를 들어, 프로세서(1200)는 BPE(Byte Pair Encoding) 알고리즘을 이용하여 복수의 개체명에 포함되는 단어 또는 단어 열을 서브 워드 단위로 토큰화(tokenize)할 수 있다. The text preprocessing module 1332 is configured to receive a plurality of entity names included in a vocabulary list, and to output a sub word by performing preprocessing on the inputted plurality of entity names. In one embodiment, the processor 1200 divides a plurality of entity names stored in the entity name vocabulary list DB 1310 into sub-word units by using a command or program code related to the text preprocessing module 1332 ( segmentation). 'Subword' is a basic unit constituting a word, and means, for example, a phoneme or a syllable. The processor 1200 may divide a plurality of entity names into subword units using, for example, a Hidden Marcov model (HMM). However, it is not limited thereto. For another example, the processor 1200 may tokenize a word or word string included in a plurality of entity names in units of subwords using a byte pair encoding (BPE) algorithm.

일 실시예에서, 프로세서(1200)는 텍스트 전처리 모듈(1332)을 이용하여 복수의 개체명에 포함된 구두점 또는 특수 문자, 특수 기호 등을 제거하고, 불용어를 제거하는 전처리를 수행할 수도 있다. In an embodiment, the processor 1200 may use the text preprocessing module 1332 to remove punctuation marks, special characters, and special symbols included in a plurality of entity names, and perform preprocessing of removing stop words.

필터링 모듈(1334)은 텍스트 전처리 모듈(1332)로부터 출력된 서브 워드 중 사전 DB(1320)에 기 저장된 어휘(In vocabulary)를 제거하는 필터링(filtering)을 수행하도록 구성된다. 일 실시예에서, 프로세서(1200)는 필터링 모듈(1334)을 이용하여, 개체명 어휘 리스트 DB(1310)에 포함되는 복수의 개체명 중 사전 DB(1320)에 기 저장된 어휘와 동일한 단어 또는 단어 열을 제거하는 필터링을 수행할 수 있다. 프로세서(1200)는 필터링을 통해 사전 DB(1320)에 포함되지 않은 개체명에 해당되는 단어들(Out of Vocabulary; OOV)로부터 추출된 서브 워드만을 획득할 수 있다. The filtering module 1334 is configured to perform filtering to remove an in vocabulary previously stored in the dictionary DB 1320 among subwords output from the text preprocessing module 1332. In one embodiment, the processor 1200 uses the filtering module 1334 to provide a word or word string identical to a vocabulary previously stored in the dictionary DB 1320 among a plurality of entity names included in the entity name vocabulary list DB 1310. You can perform filtering to remove. The processor 1200 may acquire only sub-words extracted from words (Out of Vocabulary) corresponding to entity names not included in the dictionary DB 1320 through filtering.

확률 모델 생성 모듈(1336)은 텍스트 전처리 모듈(1332) 및 필터링 모듈(1334)을 거쳐 출력된 서브 워드가 개체명 어휘 리스트에 포함되는 복수의 개체명 중 적어도 하나의 개체명을 나타내는 단어 또는 단어 열로 예측될 수 있는 확률을 학습(training)하고, 학습 결과 WFST 모델(1340)을 생성하도록 구성된다. 일 실시예에서, 프로세서(1200)는 확률 모델 생성 모듈(1336)의 명령어 또는 프로그램 코드를 이용하여 개체명 어휘리스트 DB(1310)에 저장된 복수의 개체명 각각으로부터 추출된 서브 워드들 각각을 입력으로 하고, 개체명을 출력으로 하는 매핑을 부호화하는 유한 상태 트랜스듀서(Finite State Transducer)를 포함하는 WFST 모델(1340)을 생성할 수 있다. Probability model generation module 1336 is a word or word string representing at least one entity name among a plurality of entity names included in the entity name vocabulary list by subwords output through the text preprocessing module 1332 and the filtering module 1334. It is configured to train a predictable probability and generate a WFST model 1340 as a result of the training. In one embodiment, the processor 1200 inputs each of the subwords extracted from each of the plurality of entity names stored in the entity name vocabulary list DB 1310 using an instruction or program code of the probability model generation module 1336 In addition, a WFST model 1340 including a finite state transducer encoding a mapping using an entity name as an output may be generated.

일 실시예에서, 프로세서(1200)는 개체명 어휘리스트 DB(1310)에 저장된 복수의 개체명을 서로 다른 복수의 도메인(domain)에 따라 분류하고, 분류된 복수의 개체명을 이용하여, 도메인 별로 복수의 WFST 모델(1340)을 생성할 수도 있다. 여기서, '도메인'은 사용자로부터 수신된 음성 입력과 관련된 분야 또는 카테고리(category)를 의미하고, 예를 들어 음성 입력의 의미, 음성 입력의 속성 등에 따라 미리 설정될 수 있다. 도메인은 음성 입력과 관련된 서비스에 따라 분류될 수도 있다. 도메인은 예를 들어, 영화 도메인, 음악 도메인, 책 도메인, 게임 도메인, 항공 도메인, 음식 도메인 등 하나 이상을 포함할 수 있다. 프로세서(1200)가 도메인 별로 복수의 WFST 모델(1340)을 생성하고, 생성된 복수의 WFST 모델(1340)을 이용하여 도메인을 자동으로 결정하는 실시예에 대해서는 도 7 및 도 8에서 상세하게 설명하기로 한다. In one embodiment, the processor 1200 classifies a plurality of entity names stored in the entity name vocabulary list DB 1310 according to a plurality of different domains, and uses the classified plurality of entity names for each domain. A plurality of WFST models 1340 may be generated. Here, the'domain' means a field or category related to a voice input received from a user, and may be set in advance according to, for example, a meaning of a voice input or a property of a voice input. Domains may be classified according to services related to voice input. The domain may include, for example, one or more of a movie domain, a music domain, a book domain, a game domain, an aviation domain, and a food domain. An embodiment in which the processor 1200 generates a plurality of WFST models 1340 for each domain and automatically determines a domain using the generated plurality of WFST models 1340 will be described in detail with reference to FIGS. 7 and 8. To

일 실시예에서, 프로세서(1200)는 통신 인터페이스(1400)를 이용하여, 애플리케이션 서비스 제공 업체의 서버로부터 신규 지역의 지명, 명소, 관광지, 및 유명 음식점 중 적어도 하나에 관한 개체명을 포함하는 관심 장소(Point of Interest; POI) 어휘 리스트를 수신하고, 수신된 관심 장소 어휘 리스트에 포함되는 개체명을 이용하는 학습을 통해 WFST 모델(1340)을 생성할 수 있다. 예를 들어, 애플리케이션은 네비게이션 애플리케이션일 수 있다. 프로세서(1200)가 신규 지역에 관한 개체명을 이용하여 관심 장소 어휘 리스트를 수신하고, 관심 장소 어휘 리스트를 이용하여 WFST 모델(1340)을 생성하는 구체적인 실시예에 대해서는 도 10 및 도 11에서 상세하게 설명하기로 한다. In one embodiment, the processor 1200 uses the communication interface 1400, from a server of an application service provider, to a place of interest including an entity name of at least one of a place name, a tourist attraction, a tourist destination, and a famous restaurant. A WFST model 1340 may be generated by receiving a (Point of Interest; POI) vocabulary list and learning using an entity name included in the received point of interest vocabulary list. For example, the application may be a navigation application. A specific embodiment in which the processor 1200 receives a place of interest vocabulary list using an entity name for a new area and generates a WFST model 1340 using the place of interest vocabulary list is described in detail in FIGS. 10 and 11. I will explain.

일 실시예에서, 프로세서(1200)는 자주 실행하는 애플리케이션, 메신저 애플리케이션의 로그 데이터(log data), 컨텐트 스트리밍 애플리케이션에서의 검색어 기록 중 적어도 하나로부터 사용자의 특성을 반영하는 복수의 개체명을 개체명 어휘리스트 DB(1310)에 저장하고, 저장된 개인화된 어휘 리스트를 이용하는 학습을 통해 WFST 모델(1340)을 생성할 수 있다. 프로세서(1200)가 개인화된 어휘 리스트를 이용하여 WFST 모델(1340)을 생성하는 구체적인 실시예는 도 12에서 상세하게 설명하기로 한다. In one embodiment, the processor 1200 stores a plurality of entity names reflecting the user's characteristics from at least one of a frequently executed application, log data of a messenger application, and a search term record in a content streaming application. The WFST model 1340 may be generated through learning that is stored in the list DB 1310 and uses the stored personalized vocabulary list. A specific embodiment in which the processor 1200 generates the WFST model 1340 using a personalized vocabulary list will be described in detail with reference to FIG. 12.

WFST 모델(1340)은 입력된 서브 워드가 개체명을 나타내는 단어 또는 단어 열로 예측될 수 있는 확률에 기초하여, 서브 워드가 입력되면 단어 또는 단어 열을 출력하는 언어 모델이다. WFST 모델(1340)은 발생 확률을 기초로, 특정 서브 워드로부터 단어 또는 단어 열로 출력할 수 있다. The WFST model 1340 is a language model that outputs a word or word sequence when a sub word is input based on a probability that the input sub word can be predicted as a word or word sequence representing an entity name. The WFST model 1340 may output a word or word string from a specific subword based on the probability of occurrence.

음성 인식 모듈(1350)은 음성 입력부(1100)를 통해 획득한 사용자의 음성 입력을 처리함으로써, 음성 입력에 대응되는 텍스트를 출력하도록 구성되는 모듈이다. 음성 인식 모듈(1350)은 제1 디코딩 모델(1352), 제2 디코딩 모델(1354), 및 인식 결과 통합 모듈(1356)을 포함할 수 있다. The voice recognition module 1350 is a module configured to output a text corresponding to the voice input by processing the user's voice input acquired through the voice input unit 1100. The speech recognition module 1350 may include a first decoding model 1352, a second decoding model 1354, and a recognition result integration module 1356.

제1 디코딩 모델(1352)은 음성 입력을 수신하고, 수신된 음성 입력으로부터 음성 신호를 획득하고, 음성 신호의 각 음소 별 길이에 따라 음소가 특정 라벨(label)로 예측될 수 있는 확률값의 벡터인 특징 벡터를 획득하고, 특징 벡터의 확률값에 기초하여 제1 문자열을 출력하도록 구성된다. 여기서, '라벨'은 기 학습된 심층 신경망(1360)에 의해 정의된 토큰(token)이다. 일 실시예에서, '라벨'은 음소 또는 음절을 나타내는 임의의 서브 워드일 수 있다. 라벨은 서브 워드와 동일한 개념일 수 있지만, 이에 한정되는 것은 아니다. The first decoding model 1352 receives a speech input, obtains a speech signal from the received speech input, and is a vector of probability values at which a phoneme can be predicted with a specific label according to the length of each phoneme of the speech signal. It is configured to obtain a feature vector and output a first character string based on a probability value of the feature vector. Here, the'label' is a token defined by the deep neural network 1360 that has already been learned. In one embodiment, the'label' may be any subword representing a phoneme or syllable. The label may have the same concept as the sub word, but is not limited thereto.

일 실시예에서, 제1 디코딩 모델(1352)은 종단 간 ASR(End-to-End Automatic Speech Recognition) 방식으로 음성 신호에 대한 음성 인식을 수행할 수 있다. 종단 간 ASR 방식은, 음성 신호를 문자열 또는 단어 열로 직접 매핑할 수 있도록 훈련된(trained) 심층 신경망(1360)을 이용하는 음성 인식 방식이다. 음향 모델 및 언어 모델 등의 다수의 모델들을 이용하는 다른 음성 인식 방식과는 달리, 종단 간 ASR 방식은 하나의 훈련된 심층 신경망(1360)을 이용함으로써 음성 인식 과정을 단순화할 수 있다. 종단 간 ASR 모델의 하위 실시예로는, 예를 들어 RNN-T(Recurrent Neural Network Transducer) 모델, 및 Attention-based 모델 등이 존재한다. In an embodiment, the first decoding model 1352 may perform speech recognition on a speech signal in an end-to-end automatic speech recognition (ASR) method. The end-to-end ASR scheme is a speech recognition scheme using a deep neural network 1360 trained to directly map a speech signal to a string or word string. Unlike other speech recognition methods that use multiple models such as acoustic models and language models, the end-to-end ASR method can simplify the speech recognition process by using one trained deep neural network 1360. As a lower embodiment of the end-to-end ASR model, there are, for example, a Recurrent Neural Network Transducer (RNN-T) model and an attention-based model.

일 실시예에서, 제1 디코딩 모델(1352)은 Attention-based 모델에 기초하는 종단 간 ASR 모델을 이용할 수 있다. Attention-based 모델은 예를 들어, Transformer 또는 Listen-Attend-Spell(LAS)를 포함할 수 있다. In one embodiment, the first decoding model 1352 may use an end-to-end ASR model based on an attention-based model. The attention-based model may include, for example, Transformer or Listen-Attend-Spell (LAS).

일 실시예에서, 음성 인식 모듈(1350)은 음성 입력 전처리 모듈을 더 포함할 수 있다. 프로세서(1200)는 음성 입력 전처리 모듈의 명령어 또는 프로그램 코드를 이용하여, 음성 입력에 대하여 아날로그/디지털 변환(A/D)을 수행하고, 디지털 신호로 출력되는 음성 신호를 소정 길이 및 소정 시프트 량으로 일부 중복하는 윈도우를 이용하여 프레임화할 수 있다. 프로세서(1200)는 음성 전처리 모듈을 이용하여, 음성 전처리 모듈로부터 획득된 프레임들 각각에 대하여 소정의 신호 처리를 수행하고, 프레임의 음성 특징량을 추출하여 특징 벡터를 추출할 수 있다. 음성 특징량으로는 멜 주파수 켑스트럼 계수(Mel-Frequency Cepstrum Coefficient; MFCC), 일차 미분, 2차 미분 등이 이용될 수 있다. In an embodiment, the voice recognition module 1350 may further include a voice input preprocessing module. The processor 1200 performs analog/digital conversion (A/D) on the voice input using a command or program code of the voice input preprocessing module, and converts the voice signal output as a digital signal to a predetermined length and a predetermined shift amount. It can be framed using some overlapping windows. The processor 1200 may use the speech preprocessing module to perform predetermined signal processing on each of the frames acquired from the speech preprocessing module, and extract a feature vector by extracting a speech feature amount of the frame. As the speech feature quantity, a Mel-Frequency Cepstrum Coefficient (MFCC), a first derivative, a second derivative, and the like may be used.

프로세서(1200)는 제1 디코딩 모델(1352)과 관련된 명령어 또는 프로그램 코드를 이용하여, 신호 처리된 음성 신호로부터 각 시각의 프레임이 특정 라벨에 대응되는 사후 확률(posterior probability)을 나타내는 특징 벡터를 추출할 수 있다. 일 실시예에서, 특징 벡터는 소정 길이를 갖는 음성 신호가 특정 라벨로 예측될 수 있는 사후 확률값들을 포함하는 소프트맥스(softmax) 열일 수 있다. 소프트맥스 열 각각에 포함된 확률값들을 모두 합산하면 1이 된다. The processor 1200 extracts a feature vector representing a posterior probability that a frame at each time corresponds to a specific label from the signal-processed speech signal using a command or program code related to the first decoding model 1352 can do. In an embodiment, the feature vector may be a softmax column including posterior probability values from which a speech signal having a predetermined length can be predicted with a specific label. When the probability values included in each softmax column are summed, it becomes 1.

프로세서(1200)는 제1 디코딩 모델(1352)을 이용하여, 시간 순으로 획득된 특징 벡터들로부터 각 프레임에서 하나의 라벨을 선택하고, 선택된 라벨들을 연결하고(concatenating) 대응되는 라벨에 의해 각각의 음소를 표현함으로써 라벨 후보(label candidates)를 획득할 수 있다. 프로세서(1200)는 라벨 후보의 사후 확률을 이용하여 제1 문자열을 획득할 수 있다. 예를 들어, 사용자로부터 수신한 음성 입력이 'Please search Cardi-B'라는 발화인 경우, 프로세서(1200)는 제1 디코딩 모델(1352)을 이용하여 사용자의 발화로부터 추출한 음성 신호로부터 'Please search card'와 같은 제1 문자열을 획득할 수 있다. 이 경우, 프로세서(1200)는 사전 DB(1320)에 저장되지 않은 개체명인 'Cardi-B'에 해당되는 문자열은 정확하게 예측할 수 없다. The processor 1200 uses the first decoding model 1352 to select one label from each frame from feature vectors obtained in chronological order, concatenating the selected labels, and use each of the corresponding labels. Label candidates can be obtained by expressing the phoneme. The processor 1200 may obtain the first character string using the posterior probability of the label candidate. For example, if the voice input received from the user is an utterance of'Please search Cardi-B', the processor 1200 uses the first decoding model 1352 to select the'Please search card' from the voice signal extracted from the user's utterance. A first character string such as' may be obtained. In this case, the processor 1200 cannot accurately predict the character string corresponding to the object name'Cardi-B' that is not stored in the dictionary DB 1320.

일 실시예에서, 제1 디코딩 모델(1352)은 라벨 단위의 특징 벡터를 서브 워드 단위의 특징 벡터로 변환하고, 변환 결과 획득된 서브 워드 단위의 특징 벡터를 제2 디코딩 모델(1354)로 출력할 수 있다. In one embodiment, the first decoding model 1352 converts a feature vector in units of labels into a feature vector in units of sub words, and outputs the feature vector in units of sub words obtained as a result of the conversion as the second decoding model 1354. I can.

제2 디코딩 모델(1354)은 서브 워드 단위 특징 벡터로부터 개체명 어휘 리스트에 포함되는 복수의 개체명 중 적어도 하나의 개체명에 해당되는 단어 및 개체명으로 식별되지 않는 미인식 단어열을 포함하는 제2 문자열을 획득하도록 구성된다. 제2 디코딩 모델(1354)은 WFST 모델(1340)을 이용함으로써 서브 워드 단위 특징 벡터로부터 예측되는 제2 문자열을 출력할 수 있다. 프로세서(1200)는 제2 디코딩 모델(1354)과 관련된 명령어 또는 프로그램 코드를 이용하여, 서브 워드 별 단어에 대한 가능도(likelihood), 사전 정보, 및 언어 모델에 기초하여 신뢰도 점수(confidence score)를 계산하고, 신뢰도 점수가 높은 문자열을 선택하고, 선택된 문자열을 포함하는 제2 문자열을 출력할 수 있다. The second decoding model 1354 includes a word corresponding to at least one entity name among a plurality of entity names included in the entity name vocabulary list from the subword unit feature vector, and an unrecognized word sequence that is not identified as entity name. It is configured to obtain 2 strings. The second decoding model 1354 may output a second character string predicted from a sub-word unit feature vector by using the WFST model 1340. The processor 1200 uses a command or program code related to the second decoding model 1354 to calculate a confidence score based on the likelihood of a word for each subword, dictionary information, and a language model. Calculate, select a character string having a high reliability score, and output a second character string including the selected character string.

제2 디코딩 모델(1354)은 렉시콘 FST(1354L), 및 그래머 FST(1354G)를 포함할 수 있다. 제2 디코딩 모델(1354)은 토큰 FST(1354T), 렉시콘 FST(1354L) 및 그래머 FST(1354G)를 합성(composition)함으로써 서브 워드 열(라벨)로부터 단어 또는 단어 열을 출력할 수 있다. The second decoding model 1354 may include a Lexicon FST 1354L and a grammar FST 1354G. The second decoding model 1354 may output a word or a word sequence from a sub word sequence (label) by composing a token FST 1354T, a Lexicon FST 1354L, and a grammar FST 1354G.

렉시콘 FST(Lexicon FST)(1354L)는 제1 디코딩 모델(1352)로부터 서브 워드 단위의 특징 벡터를 입력받고, 서브 워드 단위의 특징 벡터에 기초하여 예측되는 단어 또는 단어 열을 출력하도록 구성된다. 프로세서(1200)는 렉시콘 FST(1354L)과 관련된 명령어 또는 프로그램 코드를 이용하여, 서브 워드가 특정 단어로 예측될 수 있는 확률값인 매핑 정보를 포함할 수 있다. 일 실시예에서, 렉시콘 FST(1354L)는 서브 워드 시퀀스 s를 단어 열(word sequence) W에 관한 확률인 P(s|W)로 변환할 수 있다. The Lexicon FST 1354L is configured to receive a feature vector in units of subwords from the first decoding model 1352 and output a word or word sequence predicted based on the feature vectors in units of subwords. The processor 1200 may include mapping information, which is a probability value by which a subword can be predicted as a specific word, using a command or program code related to the Lexicon FST 1354L. In one embodiment, the Lexicon FST 1354L may convert the sub-word sequence s into P(s|W), which is a probability related to a word sequence W.

그래머 FST(Grammar FST)(1354G)는 렉시콘 FST(1354L)로부터 서브 워드 시퀀스의 단어 또는 단어 열을 입력받고, 개체명에 해당되는 단어 열과 미인식 단어열을 포함하는 제2 문자열을 출력하도록 구성된다. 그래머 FST(1354G)는 특정 단어 또는 단어 열이 입력되는 경우 입력된 단어 또는 단어 열 이후에 배열될 수 있는 단어 열을 예측하기 위한 가중치(weight)를 학습한 모델일 수 있다. 그래머 FST(1354G)는 예를 들어, RNN(Recurrent Neural Network) 또는 통계적 n-Gram 모델을 이용하여, 특정 단어 또는 단어 열 다음에 나올 수 있는 단어 또는 단어 열을 예측할 수 있다. 일 실시예에서, 프로세서(1200)는 그래머 FST(1354G)와 관련된 명령어 또는 프로그램 코드를 이용하여, 단어 열 W에 대한 가중치로서의 언어 모델 확률 P(W)에 관한 정보를 포함하고, 확률 P(W)가 부가된 단어 또는 단어 열을 출력할 수 있다. The grammar FST (1354G) is configured to receive a word or word string of a sub-word sequence from the Lexicon FST (1354L) and output a second string including a word string corresponding to an entity name and an unrecognized word string. . The grammar FST 1354G may be a model in which weights for predicting a word sequence that may be arranged after the input word or word sequence when a specific word or word sequence is input. The grammar FST 1354G may predict a word or word sequence that may appear after a specific word or word sequence using, for example, a recurrent neural network (RNN) or a statistical n-Gram model. In one embodiment, the processor 1200 includes information on the language model probability P(W) as a weight for the word string W, using an instruction or program code related to the grammar FST 1354G, and the probability P(W A word or word string with) can be displayed.

프로세서(1200)는 제2 디코딩 모델(1354)의 명령어 또는 프로그램 코드를 이용하여, 라벨 단위 특징 벡터의 입력 X가 서브 워드 s로 예측될 확률인 P(s|X)와 서브 워드 s가 단어 또는 단어 열 W로 예측될 확률인 P(W|s)를 조합하여 단어 사후 확률 P(W|X)를 산출하고, 단어 사후 확률(W|X)이 최대가 되는 가설(hypothesis)을 탐색함으로써 개체명에 해당되는 단어 또는 단어 열을 출력할 수 있다. 프로세서(1200)는 WFST 모델(1340)을 통해 학습된 개체명에 해당되는 단어, 예를 들어 'Cardi-B'를 획득할 수 있다. 그러나, 프로세서(1200)는 제2 디코딩 모델(1354)을 이용하는 경우, WFST 모델(1340)에 의해 학습된 개체명을 제외한 단어 또는 단어 열은 식별하지 못하고, 미인식 단어 열(<unk> <unk>)로 출력할 수 있다. 프로세서(1200)는 개체명과 미인식 단어 열을 포함하는 제2 문자열을 획득할 수 있다. The processor 1200 uses the instruction or program code of the second decoding model 1354 to determine the probability that the input X of the label-unit feature vector is predicted as the subword s, P(s|X) and the subword s, are words or Calculate the word posterior probability P(W|X) by combining P(W|s), which is the probability predicted by the word string W, and search for a hypothesis where the word posterior probability (W|X) is maximum. A word or word string corresponding to a name can be displayed. The processor 1200 may acquire a word corresponding to an entity name learned through the WFST model 1340, for example,'Cardi-B'. However, the processor 1200 is a second case where the decoding model (1354), the word or word sequence, except for the named entity learned by WFST model 1340 is unable to identify, unrecognized word sequence (<unk> <unk >). The processor 1200 may obtain a second character string including an entity name and an unrecognized word sequence.

인식 결과 통합 모듈(1356)은 제1 디코딩 모델(1352)로부터 획득된 제1 문자열과 제2 디코딩 모델(1354)로부터 획득된 제2 문자열을 통합하도록 구성된다. 프로세서(1200)는 인식 결과 통합 모듈(1356)과 관련된 명령어 또는 프로그램 코드를 이용하여, 제2 문자열 중 미인식 단어 열을 제1 문자열에 포함되는 단어 열로 대체함으로써, 음성 입력에 대응되는 텍스트를 출력할 수 있다. 일 실시예에서, 프로세서(1200)는 제2 문자열에 포함되는 미인식 단어 열을, 제1 문자열 중 미인식 단어 열의 자리에 대응되는 단어 열로 대체할 수 있다. 프로세서는 예를 들어, 제2 문자열 중 미인식 단어 열인 <unk>, <unk>를 제1 문자열에 포함되는 'Please', 'search'로 각각 대체함으로써, 음성 입력에 대응되는 텍스트인 "Please search Cardi-B"를 획득할 수 있다. The recognition result integration module 1356 is configured to integrate the first character string obtained from the first decoding model 1352 and the second character string obtained from the second decoding model 1354. The processor 1200 outputs text corresponding to the speech input by replacing the unrecognized word string among the second strings with the word string included in the first string using a command or program code related to the recognition result integration module 1356 can do. In an embodiment, the processor 1200 may replace the unrecognized word string included in the second string with a word string corresponding to a position of the unrecognized word string among the first strings. A processor, for example, the second string unrecognized word column of <unk>, the text that is by replacing each of the <unk> a 'Please', 'search' contained in the first character string, corresponding to the voice input "Please search Cardi-B" can be obtained.

프로세서(1200)는 음성 인식 모듈(1350)을 통해 획득한 텍스트를 통신 인터페이스(1400) 또는 출력부(1500)에 제공할 수 있다. The processor 1200 may provide the text acquired through the speech recognition module 1350 to the communication interface 1400 or the output unit 1500.

통신 인터페이스(1400)는 서버(2000, 도 2 참조), 애플리케이션 서비스 제공 업체의 서버, 또는 타 디바이스와 데이터 통신을 수행할 수 있다. 통신 인터페이스(1400)는 예를 들어, 유선 랜, 무선 랜(Wireless LAN), 와이파이(Wi-Fi), 블루투스(Bluetooth), 지그비(zigbee), WFD(Wi-Fi Direct), 적외선 통신(IrDA, infrared Data Association), BLE (Bluetooth Low Energy), NFC(Near Field Communication), 와이브로(Wireless Broadband Internet, Wibro), 와이맥스(World Interoperability for Microwave Access, WiMAX), SWAP(Shared Wireless Access Protocol), 와이기그(Wireless Gigabit Allicance, WiGig) 및 RF 통신을 포함하는 데이터 통신 방식 중 적어도 하나를 이용하여 서버(2000), 애플리케이션 서비스 제공 업체의 서버, 또는 타 디바이스와 데이터를 송수신할 수 있다. The communication interface 1400 may perform data communication with a server 2000 (refer to FIG. 2 ), a server of an application service provider, or another device. The communication interface 1400 is, for example, wired LAN, wireless LAN, Wi-Fi, Bluetooth, zigbee, Wi-Fi Direct (WFD), infrared communication (IrDA, infrared Data Association), BLE (Bluetooth Low Energy), NFC (Near Field Communication), Wibro (Wireless Broadband Internet, Wibro), WiMAX (World Interoperability for Microwave Access, WiMAX), SWAP (Shared Wireless Access Protocol), Wiigg Data can be transmitted and received with the server 2000, a server of an application service provider, or other devices using at least one of a data communication method including (Wireless Gigabit Allicance, WiGig) and RF communication.

출력부(1500)는 음성 입력에 대응되는 텍스트를 출력할 수 있다. 출력부(1500)는 음성 인식이 수행된 결과, 즉 텍스트를 사용자에게 알리거나, 또는 외부 디바이스(예를 들어, 스마트 폰, 가전 제품, 웨어러블 디바이스, 서버 등)에게 전송할 수 있다. 출력부(1500)는 디스플레이부(1510) 및 스피커(1520)를 포함할 수 있다. The output unit 1500 may output text corresponding to a voice input. The output unit 1500 may notify a user of a result of speech recognition, that is, a text, or may transmit it to an external device (eg, a smart phone, a home appliance, a wearable device, a server, etc.). The output unit 1500 may include a display unit 1510 and a speaker 1520.

디스플레이부(1510)는 음성 입력으로부터 변환된 텍스트를 디스플레이할 수 있다. 디스플레이부(1510)는 예를 들어, 액정 디스플레이(liquid crystal display), 박막 트랜지스터 액정 디스플레이(thin film transistor-liquid crystal display), 유기 발광 다이오드(organic light-emitting diode), 플렉시블 디스플레이(flexible display), 3차원 디스플레이(3D display), 전기영동 디스플레이(electrophoretic display) 중에서 적어도 하나로 구성될 수 있다. The display 1510 may display text converted from a voice input. The display unit 1510 is, for example, a liquid crystal display, a thin film transistor-liquid crystal display, an organic light-emitting diode, a flexible display, It may be composed of at least one of a 3D display and an electrophoretic display.

스피커(1520)는 텍스트에 대응되는 오디오 신호를 출력할 수 있다. The speaker 1520 may output an audio signal corresponding to text.

도 4는 본 개시의 일 실시예에 따른 서버(2000)의 구성 요소를 도시한 블록도이다. 4 is a block diagram showing the components of the server 2000 according to an embodiment of the present disclosure.

서버(2000)는 디바이스(1000)로부터 서브 워드 단위의 특징 벡터를 수신하고, 수신된 특징 벡터를 문자열로 변환하며, 변환된 문자열을 디바이스(1000)에게 전송할 수 있다. The server 2000 may receive a feature vector in units of subwords from the device 1000, convert the received feature vector into a character string, and transmit the converted character string to the device 1000.

도 4를 참조하면, 서버(2000)는 통신 인터페이스(2100), 프로세서(2200), 및 메모리(2300)를 포함할 수 있다. Referring to FIG. 4, the server 2000 may include a communication interface 2100, a processor 2200, and a memory 2300.

통신 인터페이스(2100)는 서버(2000)와 디바이스(1000) 간의 데이터 송수신을 수행할 수 있다. 통신 인터페이스(2100)는 예를 들어, 유선 랜, 무선 랜, 와이파이, WFD(Wi-Fi Direct), 와이브로(Wireless Broadband Internet, Wibro), 와이맥스(World Interoperability for Microwave Access, WiMAX), SWAP(Shared Wireless Access Protocol), 와이기그(Wireless Gigabit Allicance, WiGig) 및 RF 통신을 포함하는 데이터 통신 방식 중 적어도 하나를 이용하여 디바이스(1000)와 데이터를 송수신할 수 있다.The communication interface 2100 may transmit and receive data between the server 2000 and the device 1000. The communication interface 2100 is, for example, wired LAN, wireless LAN, Wi-Fi, WFD (Wi-Fi Direct), Wibro (Wireless Broadband Internet, Wibro), WiMAX (World Interoperability for Microwave Access, WiMAX), SWAP (Shared Wireless). Access Protocol), Wireless Gigabit Allicance (WiGig), and data communication with the device 1000 may be transmitted and received using at least one of a data communication method including RF communication.

통신 인터페이스(2100)는 디바이스(1000)로부터 서브 워드 단위의 특징 벡터를 수신할 수 있다. 일 실시예에서, 통신 인터페이스(2100)는 디바이스(1000)로부터 래티스(lattice)를 수신할 수도 있다. 래티스는 특징 벡터의 확률값 중 가장 높은 확률값을 갖는 라벨들끼리 연결함으로써 디바이스(1000)에 의해 생성되고, 디바이스(1000)로부터 서버(2000)에 전송될 수 있다. 일 실시예에서, 통신 인터페이스(2100)는 프로세서(2200)의 제어에 따라 WFST 디코더(2310)에 의해 획득된 제2 문자열을 디바이스(1000)로 전송할 수 있다. 제2 문자열은 개체명에 관한 단어 및 개체명으로 식별되지 않은 미인식 단어열을 포함할 수 있다. The communication interface 2100 may receive a feature vector in units of subwords from the device 1000. In an embodiment, the communication interface 2100 may receive a lattice from the device 1000. The lattice is generated by the device 1000 by connecting labels having the highest probability value among the probability values of the feature vector, and may be transmitted from the device 1000 to the server 2000. In an embodiment, the communication interface 2100 may transmit the second character string acquired by the WFST decoder 2310 to the device 1000 under the control of the processor 2200. The second character string may include a word related to an entity name and an unrecognized word sequence that is not identified as an entity name.

프로세서(2200)는 메모리(2300)에 저장된 프로그램의 하나 이상의 명령어들(instructions)을 실행할 수 있다. 프로세서(2200)는 산술, 로직 및 입출력 연산과 시그널 프로세싱을 수행하는 하드웨어 구성 요소로 구성될 수 있다. 프로세서(2200)는 예를 들어, 중앙 처리 장치(Central Processing Unit), 마이크로 프로세서(microprocessor), 그래픽 프로세서(Graphic Processing Unit), ASICs(Application Specific Integrated Circuits), DSPs(Digital Signal Processors), DSPDs(Digital Signal Processing Devices), PLDs(Programmable Logic Devices), 및 FPGAs(Field Programmable Gate Arrays) 중 적어도 하나로 구성될 수 있으나, 이에 한정되는 것은 아니다. The processor 2200 may execute one or more instructions of a program stored in the memory 2300. The processor 2200 may be composed of hardware components that perform arithmetic, logic, input/output operations, and signal processing. The processor 2200 includes, for example, a central processing unit, a microprocessor, a graphic processing unit, application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processors (DSPs). Signal Processing Devices), Programmable Logic Devices (PLDs), and Field Programmable Gate Arrays (FPGAs), but are not limited thereto.

메모리(2300)에는 WFST 디코더(2310), WFST 모델(2320), 및 개체명 어휘 리스트 DB(2330) 각각에 관한 명령어들(instruction)을 포함하는 프로그램이 저장될 수 있다. 메모리(2300)에는 프로세서(2200)가 판독할 수 있는 명령어들 및 프로그램 코드(program code)가 저장될 수 있다. 이하의 실시예에서, 프로세서(2200)는 메모리에 저장된 프로그램의 명령어들 또는 코드들을 실행함으로써 구현될 수 있다. A program including instructions for each of the WFST decoder 2310, the WFST model 2320, and the entity name vocabulary list DB 2330 may be stored in the memory 2300. The memory 2300 may store instructions and program codes that the processor 2200 can read. In the following embodiments, the processor 2200 may be implemented by executing instructions or codes of a program stored in a memory.

메모리(2300)는 예를 들어, 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램(RAM, Random Access Memory) SRAM(Static Random Access Memory), 롬(ROM, Read-Only Memory), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기 메모리, 자기 디스크, 광 디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다. 그러나, 이에 한정되지 않는다. The memory 2300 is, for example, a flash memory type, a hard disk type, a multimedia card micro type, and a card type memory (eg, SD or XD memory). Etc.), RAM (RAM, Random Access Memory) SRAM (Static Random Access Memory), ROM (ROM, Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), Magnetic It may include at least one type of storage medium among memory, magnetic disk, and optical disk. However, it is not limited thereto.

WFST 디코더(2310)는, WFST 모델(2320)을 이용하여 개체명 어휘 리스트에 포함되는 복수의 개체명 중 적어도 하나의 개체명에 해당되는 단어 및 개체명으로 식별되지 않은 미인식 단어열을 포함하는 제2 문자열을 획득하도록 구성된다. 일 실시예에서, WFST 디코더(2310)는 서브 워드 단위의 특징 벡터가 입력되면, 특징 벡터의 확률값에 기초하여 가능도(likelihood)가 높은 단어 또는 단어 열을 출력하는 모듈로 구성될 수도 있다. The WFST decoder 2310 includes a word corresponding to at least one entity name among a plurality of entity names included in the entity name vocabulary list using the WFST model 2320 and an unrecognized word string not identified as entity name. It is configured to obtain a second character string. In an embodiment, the WFST decoder 2310 may be configured as a module that outputs a word or word sequence having a high likelihood based on a probability value of the feature vector when a feature vector in units of subwords is input.

프로세서(2200)는 WFST 디코더(2310)에 관한 명령어들 또는 프로그램 코드를 이용하여, 디바이스(1000)로부터 수신된 서브 워드 단위의 특징 벡터를 디코딩함으로써, 개체명에 해당되는 단어 및 미인식 단어를 포함하는 문자열을 획득할 수 있다. 프로세서(2200)는 획득된 문자열을 디바이스(1000)에 전송하도록 통신 인터페이스(2100)를 제어할 수 있다. The processor 2200 includes a word corresponding to an entity name and an unrecognized word by decoding a feature vector in units of sub-words received from the device 1000 using instructions or program codes for the WFST decoder 2310. You can get the character string. The processor 2200 may control the communication interface 2100 to transmit the acquired character string to the device 1000.

WFST 모델(2320)은 서브 워드가 특정 개체명을 나타내는 단어 또는 단어 열로 예측될 수 있는 확률에 기초하여, 서브 워드가 입력되면 단어 또는 단어 열을 출력하는 모델이다. WFST 모델(2320)은 프로세서(2200)에 의해 생성 또는 업데이트되는 모델로서, 특정 서브 워드가 개체명 어휘리스트 DB(2330)에 포함되는 복수의 개체명 중 적어도 하나의 개체명을 나타내는 단어 또는 단어 열로 예측될 수 있는 확률을 학습(training)함으로써, 생성될 수 있다. 일 실시예에서, 프로세서(2200)는 개체명 어휘리스트 DB(2330)에 저장된 복수의 개체명 각각으로부터 추출된 서브 워드들 각각을 입력으로 하고, 개체명을 출력으로 하는 매핑을 부호화하는 유한 상태 트랜스듀서(Finite State Transducer)를 포함하는 WFST 모델(2320)을 생성할 수 있다.The WFST model 2320 is a model that outputs a word or word sequence when a sub word is input based on a probability that the sub word can be predicted as a word or word sequence representing a specific entity name. The WFST model 2320 is a model that is created or updated by the processor 2200, and a specific subword is a word or word string representing at least one entity name among a plurality of entity names included in the entity name vocabulary list DB 2330. It can be created by training the probabilities that can be predicted. In one embodiment, the processor 2200 receives each of the subwords extracted from each of the plurality of entity names stored in the entity name vocabulary list DB 2330 as inputs, and encodes a mapping using the entity name as an output. A WFST model 2320 including a Finite State Transducer may be created.

개체명 어휘리스트 DB(2330)는 복수의 개체명에 해당되는 어휘들을 저장하는 데이터베이스이다. 일 실시예에서, 프로세서(2200)는 신규 애플리케이션, 웹 페이지, 동영상 스트리밍 서비스, 또는 게임 등으로부터 신규 개체명에 해당되는 어휘들을 획득하고, 획득된 어휘들을 이용하여 개체명 어휘리스트 DB(2330)를 업데이트할 수 있다. The entity name vocabulary list DB 2330 is a database that stores vocabularies corresponding to a plurality of entity names. In one embodiment, the processor 2200 acquires vocabularies corresponding to a new entity name from a new application, a web page, a video streaming service, or a game, and uses the acquired vocabulary to obtain the entity name vocabulary list DB 2330. Can be updated.

프로세서(2200)는 통신 인터페이스(2100)를 통해, 개체명 어휘리스트 DB(2330)의 업데이트 정보를 디바이스(1000)에 전송할 수 있다. 여기서, '개체명 어휘리스트 DB(2330)의 업데이트 정보'는 예를 들어, 개체명 어휘리스트 DB(2330)에 새로 추가된 적어도 하나의 신규 개체명, 적어도 하나의 신규 개체명 각각이 분류될 수 있는 도메인 정보, 신규 개체명과 관련된 애플리케이션 정보, 기 저장된 개체명의 삭제 정보, 및 기 저장된 개체명의 변경 정보 중 적어도 하나를 포함할 수 있다. 일 실시예에서, 프로세서(2200)는 기설정된 주기, 예를 들어 6시간, 1일, 1주일, 또는 한달 마다 개체명 어휘리스트 DB(2330)의 업데이트 정보를 디바이스(1000)에 전송할 수 있다. 그러나, 이에 한정되는 것은 아니다. 다른 실시예에서, 프로세서(2200)는 개체명 어휘리스트 DB(2330)에 신규 개체명 어휘들이 추가되거나, 삭제되거나, 또는 변경되는 시점 마다 개체명 어휘리스트 DB(2330)의 업데이트 정보를 디바이스(1000)에 전송할 수 있다. The processor 2200 may transmit update information of the entity name vocabulary list DB 2330 to the device 1000 through the communication interface 2100. Here, the'update information of the object name vocabulary list DB 2330' is, for example, at least one new object name newly added to the object name vocabulary list DB 2330, and each of at least one new object name can be classified. It may include at least one of existing domain information, application information related to a new entity name, deletion information of a previously stored entity name, and change information of a previously stored entity name. In an embodiment, the processor 2200 may transmit update information of the entity name vocabulary list DB 2330 to the device 1000 at a preset period, for example, every 6 hours, 1 day, 1 week, or month. However, it is not limited thereto. In another embodiment, the processor 2200 stores the update information of the entity name vocabulary list DB 2330 every time a new entity name vocabulary is added, deleted, or changed to the entity name vocabulary list DB 2330. ).

디바이스(1000)는 서버(2000)로부터 개체명 어휘리스트 DB(2330)의 업데이트 정보를 수신하고, 수신된 개체명 어휘리스트 DB(2330)의 업데이트 정보를 이용하여, 디바이스(1000)에 저장된 개체명 어휘리스트 DB(1310)를 업데이트할 수 있다. 디바이스(1000)가 서버(2000)로부터 수신된 업데이트 정보를 이용하여 개체명 어휘리스트 DB(1310)를 업데이트하는 구체적인 실시예에 대해서는 도 9에서 상세하게 설명하기로 한다. The device 1000 receives the update information of the entity name vocabulary list DB 2330 from the server 2000, and uses the updated information of the received entity name vocabulary list DB 2330 to store the entity name in the device 1000. The vocabulary list DB 1310 can be updated. A specific embodiment in which the device 1000 updates the entity name vocabulary list DB 1310 using the update information received from the server 2000 will be described in detail with reference to FIG. 9.

도 5는 본 개시의 디바이스(1000)가 음성 입력을 인식하는 실시예를 도시한 흐름도이다.5 is a flowchart illustrating an embodiment in which the device 1000 of the present disclosure recognizes a voice input.

단계 S510에서, 디바이스(1000)는 복수의 개체명(Named Entity)을 포함하는 어휘 리스트를 이용하여, 가중치 유한 상태 변환기 모델(Weighted Finite State Transducer)을 생성한다. 디바이스(1000)는 대기 모드 상태 또는 애플리케이션을 실행하고 있는 동안 백그라운드에서 개체명 어휘 리스트 DB(1310, 도 3 참조)에 저장된 복수의 개체명에 해당되는 단어 또는 단어 열을 이용하여 가중치 유한 상태 변환기 모델을 생성할 수 있다. 그러나, 이에 한정되는 것은 아니다. 일 실시예에서, 디바이스(1000)는 기설정된 음성 입력, 예를 들어 '하이 빅스비', '오케이 구글' 등과 같은 웨이크 업 입력을 감지하고, 웨이크 업 입력에 응답하여 개체명에 관한 가중치 유한 상태 변환기 모델을 생성할 수 있다. 다른 실시예에서, 디바이스(1000)는 음성 비서 서비스(예를 들어, 빅스비 또는 구글 어시스턴트)를 실행하기 위한 버튼을 누르거나, 디스플레이부(1510, 도 3 참조) 상에 표시되는 그래픽 사용자 인터페이스(GUI)를 터치하는 사용자 입력이 수신되는 경우, 개체명에 관한 가중치 유한 상태 변환기 모델을 생성할 수 있다. In step S510, the device 1000 generates a weighted finite state transducer model by using a vocabulary list including a plurality of named entities. The device 1000 is a weighted finite state converter model using words or word strings corresponding to a plurality of entity names stored in the entity name vocabulary list DB 1310 (refer to FIG. 3) in the background while the device is in standby mode or while an application is running. Can be created. However, it is not limited thereto. In one embodiment, the device 1000 detects a preset voice input, for example, a wake-up input such as'Hi Bixby','Okay Google', etc., and a weight finite state regarding the entity name in response to the wake-up input. You can create a transducer model. In another embodiment, the device 1000 presses a button for executing a voice assistant service (for example, Bixby or Google Assistant), or a graphic user interface displayed on the display unit 1510 (refer to FIG. 3). When a user input touching (GUI) is received, a weighted finite state converter model for an entity name may be generated.

일 실시예에서, 디바이스(1000)는 개체명을 입력하는 사용자 입력을 수신하거나, 외부의 서버로부터 개체명을 수신하거나, 또는 디바이스(1000)에 의해 실행되는 애플리케이션, 웹 페이지 등을 크롤링(crawling)함으로써 개체명을 획득할 수 있다. 디바이스(1000)는 획득된 개체명을 개체명 어휘 리스트 DB(1310)에 저장할 수 있다. In one embodiment, the device 1000 receives a user input for entering an entity name, receives an entity name from an external server, or crawls applications, web pages, etc. executed by the device 1000 By doing so, you can get the object name. The device 1000 may store the acquired entity name in the entity name vocabulary list DB 1310.

디바이스(1000)는 개체명 어휘 리스트 DB(1310)에 포함되는 복수의 개체명을 구성하는 단어 또는 문자열을 음소 또는 음절 단위인 서브 워드(subword)로 분할(segmentation)할 수 있다. 일 실시예에서, 디바이스(1000)는 BPE(Byte Pair Encoding) 알고리즘을 이용하여 복수의 개체명에 포함되는 단어 또는 단어 열을 서브 워드 단위로 토큰화(tokenize)할 수 있다. The device 1000 may segment words or strings constituting a plurality of entity names included in the entity name vocabulary list DB 1310 into subwords that are phoneme or syllable units. In an embodiment, the device 1000 may tokenize a word or a word string included in a plurality of entity names in units of subwords using a byte pair encoding (BPE) algorithm.

일 실시예에서, 디바이스(1000)는 서브 워드가 복수의 개체명 중 적어도 하나의 특정 개체명으로 예측될 수 있는 사후 확률값(posterior probability)의 확률 그래프로 구성되는 통계 모델을 생성할 수 있다. 디바이스(1000)는 서브 워드의 빈도수 및 서브 워드의 배열 순서를 이용한 상태 변환(state transition)을 통해 가중치(weight)를 학습함으로써, 개체명에 관한 가중치 유한 상태 변환기 모델(WFST 모델)을 생성할 수 있다. In an embodiment, the device 1000 may generate a statistical model comprising a probability graph of a posterior probability value in which a subword can be predicted as at least one specific entity name among a plurality of entity names. The device 1000 may generate a weighted finite state converter model (WFST model) for an entity name by learning a weight through a state transition using the frequency of the subwords and the arrangement order of the subwords. have.

WFST 모델은, 서브 워드가 특정 단어로 예측될 수 있는 확률값인 매핑 정보를 포함하는 렉시콘 FST(Lexicon Finite State Transducer) 및 특정 단어 또는 단어 열이 입력되는 경우 다음에 나올 단어 열을 예측하기 위한 가중치(weight) 정보를 포함하는 그래머 FST(Grammar Finite State Transducer)를 포함할 수 있다. WFST 모델은 렉시콘 FST와 그래머 FST의 합성(composition)으로 구성될 수 있다. The WFST model includes a Lexicon Finite State Transducer (FST) containing mapping information, which is a probability value that a subword can be predicted into a specific word, and a weight for predicting the next word column when a specific word or word column is input. weight) information may include a Grammar Finite State Transducer (FST). The WFST model can be composed of a composition of Lexicon FST and Gramer FST.

단계 S520에서, 디바이스(1000)는 사용자의 음성 입력을 수신한다. 일 실시예에서, 디바이스(1000)는 마이크로폰을 통해 사용자로부터 음성 입력(예를 들어, 사용자의 발화)을 수신하고, 수신된 음성 입력으로부터 음성 신호를 획득할 수 있다. 일 실시예에서, 디바이스(1000)의 프로세서(1200, 도 3 참조)는, 마이크로폰을 통해 수신된 소리를 음향 신호로 변환하고, 음향 신호로부터 노이즈(예를 들어, 비음성 성분)를 제거하여 음성 신호를 획득할 수 있다. In step S520, the device 1000 receives a user's voice input. In an embodiment, the device 1000 may receive a voice input (eg, a user's utterance) from a user through a microphone and obtain a voice signal from the received voice input. In one embodiment, the processor 1200 of the device 1000 (refer to FIG. 3) converts the sound received through the microphone into an acoustic signal, and removes noise (eg, non-speech component) from the sound signal to The signal can be acquired.

단계 S520에서 수신된 사용자의 음성 입력은, 사용자가 디바이스(1000)를 통해 실행하고자 하는 동작 또는 기능과 관련된 음성 명령일 수 있으나, 이에 한정되는 것은 아니다. 일 실시예에서, 사용자의 음성 입력은 웨이크 업 입력 및 동작 또는 기능과 관련된 음성 명령을 포함할 수 있다. The user's voice input received in operation S520 may be a voice command related to an operation or function that the user intends to execute through the device 1000, but is not limited thereto. In one embodiment, the user's voice input may include a wake-up input and a voice command related to an operation or function.

사용자의 음성 입력이 웨이크 업 입력을 포함하는 경우, 단계 S520은 단계 S510 보다 이전에 수행될 수 있다. 예를 들어, 사용자가 '하이 빅스비, 오늘 날씨 알려줘' 라는 발화를 한 경우, 디바이스(1000)는 '하이 빅스비'라는 웨이크 업 입력과 '오늘 날씨 알려줘'라는 음성 명령을 동시에 수신할 수 있다. 이 경우, 디바이스(1000)는 수신된 음성 입력에 응답하여, 개체명에 관한 가중치 유한 상태 변환기 모델을 생성할 수 있다. When the user's voice input includes a wake-up input, step S520 may be performed before step S510. For example, when the user utters'Hi Bixby, tell me today's weather', the device 1000 may simultaneously receive a wake-up input'Hi Bixby' and a voice command'Tell me today's weather'. . In this case, the device 1000 may generate a weighted finite state converter model for the entity name in response to the received voice input.

단계 S530에서, 디바이스(1000)는 제1 디코딩 모델을 이용하여, 음성 입력에 관한 특징 벡터(feature vector)를 획득하고, 특징 벡터의 확률값을 이용하여 제1 문자열을 획득한다. 일 실시예에서, 디바이스(1000)는 제1 디코딩 모델을 이용하여 음성 신호의 각 음소 별 길이에 따라 발음이 특정 라벨(label)로 예측될 수 있는 확률값의 벡터인 특징 벡터를 추출할 수 있다. 특징 벡터는 소정 길이를 갖는 음성 신호가 특정 라벨로 예측될 수 있는 사후 확률값들을 포함하는 소프트 맥스(softmax) 열일 수 있다.In step S530, the device 1000 obtains a feature vector for speech input by using the first decoding model, and obtains a first character string by using a probability value of the feature vector. In an embodiment, the device 1000 may extract a feature vector, which is a vector of probability values for which pronunciation can be predicted with a specific label according to the length of each phoneme of the speech signal using the first decoding model. The feature vector may be a softmax column including posterior probability values from which a speech signal having a predetermined length can be predicted with a specific label.

일 실시예에서, 디바이스(1000)는 제1 디코딩 모델을 이용하여, 시간 순으로 획득된 특징 벡터들로부터 각 프레임에서 하나의 라벨을 선택하고, 선택된 라벨들을 연결하고(concatenating) 대응되는 라벨에 의해 각각의 음소를 표현함으로써 라벨 후보(label candidates)를 획득할 수 있다. 디바이스(1000)는 라벨 후보의 사후 확률을 이용하여 제1 문자열을 획득할 수 있다. In one embodiment, the device 1000 selects one label in each frame from feature vectors obtained in chronological order using the first decoding model, concatenating the selected labels, and uses the corresponding label. Label candidates can be obtained by expressing each phoneme. The device 1000 may obtain the first character string using the posterior probability of the label candidate.

일 실시예에서, 제1 디코딩 모델은 종단 간 ASR(End-to-End Automatic Speech Recognition) 방식으로 음성 신호에 대한 음성 인식을 수행하도록 구성된 모델일 수 있다. In one embodiment, the first decoding model may be a model configured to perform speech recognition on a speech signal in an end-to-end automatic speech recognition (ASR) scheme.

단계 S540에서, 디바이스(1000)는 단계 S510에서 생성된 WFST 모델을 이용하는 제2 디코딩 모델에 특징 벡터를 입력하고, 특징 벡터로부터 특정 개체명에 해당되는 단어 열(word sequence) 및 개체명으로 식별되지 않는 미인식 단어열을 포함하는 제2 문자열을 획득한다. 일 실시예에서, 디바이스(1000)는 제2 디코딩 모델을 이용하여 서브 워드 별 단어에 대한 가능도(likelihood), 사전 정보, 및 언어 모델에 기초하여 신뢰도 점수(confidence score)를 계산하고, 신뢰도 점수가 높은 문자열을 선택하고, 선택된 문자열을 포함하는 제2 문자열을 출력할 수 있다. 디바이스(1000)는 제2 디코딩 모델을 이용하여, 라벨 단위의 특징 벡터의 입력 X가 서브 워드 s로 예측될 확률인 P(s|X)와 서브 워드 s가 단어 또는 단어 열 W로 예측될 확률인 P(W|s)를 조합하여 단어 사후 확률 P(W|X)를 산출하고, 단어 사후 확률(W|X)이 최대가 되는 가설(hypothesis)을 탐색함으로써 개체명에 해당되는 단어 또는 단어 열을 획득할 수 있다. 디바이스(1000)는 WFST 모델에 의해 학습된 개체명을 제외한 단어 또는 단어 열을 식별하지 못하고, 미인식 단어 열의 형태로 출력할 수 있다. In step S540, the device 1000 inputs a feature vector to the second decoding model using the WFST model generated in step S510, and is not identified as a word sequence and entity name corresponding to a specific entity name from the feature vector. A second character string including an unrecognized word sequence is acquired. In one embodiment, the device 1000 calculates a confidence score based on a likelihood for a word for each subword, dictionary information, and a language model using a second decoding model, and a confidence score A character string having a high value may be selected, and a second character string including the selected character string may be output. The device 1000 uses the second decoding model to predict P(s|X), which is a probability that the input X of the feature vector of the label unit is predicted as the subword s, and the probability that the subword s is predicted as a word or word string W. The word or word corresponding to the entity name is calculated by calculating the word posterior probability P(W|X) by combining P(W|s), and searching for a hypothesis that maximizes the word posterior probability (W|X). Heat can be obtained. The device 1000 may not identify a word or word sequence other than the entity name learned by the WFST model, and may output it in the form of an unrecognized word sequence.

단계 S550에서, 디바이스(1000)는 제2 문자열 중 미인식 단어열을 제1 문자열에 포함되는 단어 열로 대체함으로써, 음성 입력에 대응되는 텍스트를 출력한다. 일 실시예에서, 디바이스(1000)는 제2 문자열에 포함되는 미인식 단어 열을 제1 문자열 중 미인식 단어 열의 자리에 대응되는 단어 열로 대체할 수 있다. In step S550, the device 1000 outputs text corresponding to the voice input by replacing the unrecognized word string among the second strings with the word string included in the first string. In an embodiment, the device 1000 may replace the unrecognized word string included in the second string with a word string corresponding to a position of the unrecognized word string among the first strings.

도 6은 본 개시의 디바이스(1000)가 개체명에 관한 가중치 유한 상태 변환기 모델을 생성하는 실시예를 도시한 흐름도이다. 도 6에 도시된 단계 S610 내지 S640은 도 5의 단계 S510을 구체화한 단계들이다. 도 6에 도시된 단계 S640 이후에 도 5의 단계 S520이 수행될 수 있다.6 is a flowchart illustrating an embodiment in which the device 1000 of the present disclosure generates a weighted finite state converter model for an entity name. Steps S610 to S640 shown in FIG. 6 are steps embodied in step S510 of FIG. 5. Step S520 of FIG. 5 may be performed after step S640 shown in FIG. 6.

단계 S610에서, 디바이스(1000)는 복수의 개체명을 포함하는 어휘 리스트를 획득한다. 일 실시예에서, 디바이스(1000)는 개체명을 입력하는 사용자 입력을 수신하거나, 외부의 서버로부터 수신하거나, 또는 디바이스(1000)에 의해 실행되는 애플리케이션, 웹 페이지 등을 크롤링(crawling)함으로써 복수의 개체명을 포함하는 어휘 리스트를 획득할 수 있다. 디바이스(1000)는 예를 들어, 실행되고 있는 애플리케이션의 API(Application Programming Interface)로부터 애플리케이션에 포함된 복수의 개체명에 해당되는 단어 또는 단어 열을 획득할 수 있다. 예를 들어, 음악 스트리밍 애플리케이션인 경우, 디바이스(1000)는 사용자의 플레이리스트에 포함되었거나, 현재 재생 중인 음악에 관한 노래 명, 아티스트 명, 작곡가 명 등 개체명에 해당되는 단어 또는 단어 열을 획득할 수 있다. 일 실시예에서, 디바이스(1000)는 획득된 개체명에 관한 단어 또는 단어 열을 메모리(1300, 도 3 참조)의 어휘 리스트 DB(1310, 도 3 참조)에 저장할 수 있다. In step S610, the device 1000 acquires a vocabulary list including a plurality of entity names. In one embodiment, the device 1000 receives a user input for entering an entity name, receives from an external server, or crawls an application, web page, etc., executed by the device 1000 A vocabulary list including entity names can be obtained. The device 1000 may obtain, for example, a word or word string corresponding to a plurality of entity names included in the application from an application programming interface (API) of an application being executed. For example, in the case of a music streaming application, the device 1000 may acquire a word or word string corresponding to an entity name, such as a song name, artist name, composer name, etc. that is included in the user's playlist or is currently playing. I can. In an embodiment, the device 1000 may store a word or word string related to the acquired entity name in a vocabulary list DB 1310 (see FIG. 3) of the memory 1300 (see FIG. 3 ).

단계 S610가 수행된 이후, 디바이스(1000)는 복수의 개체명에 포함된 구두점 또는 특수 문자, 특수 기호 등을 제거하고, 불용어를 제거하는 전처리(preprocessing)를 수행할 수도 있다. After operation S610 is performed, the device 1000 may perform preprocessing of removing punctuation marks, special characters, and special symbols included in the plurality of entity names, and removing stop words.

단계 S620에서, 디바이스(1000)는 어휘 리스트에 포함되는 복수의 개체명 중 디바이스(1000)에 기 저장된 단어와 중복되는 개체명을 제거하는 필터링(filtering)을 수행한다. 일 실시예에서, 디바이스(1000)는 메모리(1300)의 개체명 어휘 리스트 DB(1310)에 포함되는 복수의 개체명 중 사전 DB(1320, 도 3 참조)에 기 저장된 단어 또는 단어 열과 동일한 개체명을 제거할 수 있다. 필터링이 수행됨으로써, 개체명 어휘 리스트 DB(1310)에는 사전 DB(1320)에 포함되지 않은 개체명에 해당되는 단어들(Out of Vocabulary; OOV)만이 저장될 수 있다. In step S620, the device 1000 performs filtering to remove an entity name overlapping with a word previously stored in the device 1000 among a plurality of entity names included in the vocabulary list. In one embodiment, the device 1000 is the same entity name as a word or word string previously stored in the dictionary DB 1320 (refer to FIG. 3) among a plurality of entity names included in the entity name vocabulary list DB 1310 of the memory 1300 Can be removed. By performing filtering, only words (Out of Vocabulary; OOV) corresponding to entity names not included in the dictionary DB 1320 may be stored in the entity name vocabulary list DB 1310.

단계 S620(필터링 수행)는 필수적인 단계는 아니다. 본 개시의 일 실시예에서, 단계 S620는 생략될 수 있다. 이 경우, 단계 S610이 수행된 이후 단계 S630이 수행될 수 있다.Step S620 (performing filtering) is not an essential step. In an embodiment of the present disclosure, step S620 may be omitted. In this case, after step S610 is performed, step S630 may be performed.

단계 S630에서, 디바이스(1000)는 복수의 개체명을 서브 워드 단위로 분할(segmentation)한다. '서브 워드'는 단어를 구성하는 기본 단위로서, 예를 들어 음소(phoneme) 또는 음절(syllable)을 의미한다. 디바이스(1000)는 예를 들어, 은닉 마르코프 방법(Hidden Marcov model; HMM)을 이용하여 복수의 개체명을 서브 워드 단위로 분할할 수 있다. 그러나, 이에 한정되는 것은 아니다. 다른 예를 들어, 디바이스(1000)는 BPE(Byte Pair Encoding) 알고리즘을 이용하여 복수의 개체명에 해당되는 단어 또는 단어 열을 서브 워드 단위로 토큰화(tokenize)할 수 있다. In step S630, the device 1000 divides a plurality of entity names into sub-word units (segmentation). 'Subword' is a basic unit constituting a word, and means, for example, a phoneme or a syllable. The device 1000 may divide a plurality of entity names into subword units using, for example, a Hidden Marcov model (HMM). However, it is not limited thereto. For another example, the device 1000 may tokenize a word or a word sequence corresponding to a plurality of entity names in units of sub-words using a byte pair encoding (BPE) algorithm.

단계 S640에서, 디바이스(1000)는 서브 워드의 빈도수 및 배열 순서를 이용한 학습(training)을 통해, 서브 워드가 특정 개체명으로 예측될 수 있는 신뢰도 점수(confidence score)를 획득한다. 일 실시예에서, 디바이스(1000)는 단계 S630에서 출력된 서브 워드의 빈도수 및 서브 워드의 배열 순서에 기초하여 획득된 사전 확률을 이용하여, 서브 워드 각각이 복수의 개체명 중 적어도 하나의 특정 개체명으로 예측될 수 있는 확률 그래프로 구성되는 통계 모델을 생성할 수 있다. 여기서, '사전 확률'은 소정 서브 워드가 특정 단어 또는 단어 열로 이용되는 빈도수 및 배열 순서에 기초하여 통계적으로 미리 계산된 확률값을 의미한다. In step S640, the device 1000 acquires a confidence score for predicting a sub-word as a specific entity name through training using the frequency and arrangement order of the sub-words. In an embodiment, the device 1000 uses a prior probability obtained based on the frequency of the subwords output in step S630 and the arrangement order of the subwords, and each subword is at least one specific entity among a plurality of entity names. It is possible to create a statistical model consisting of a probability graph that can be predicted by a person. Here, the'priority probability' means a statistically pre-calculated probability value based on the frequency and arrangement order in which a predetermined subword is used as a specific word or word string.

디바이스(1000)는 서브 워드의 사전 확률을 이용한 상태 변환(state transition)을 통해 가중치(weight)를 학습함으로써, 개체명에 관한 가중치 유한 상태 변환기 모델(WFST 모델)을 생성할 수 있다. The device 1000 may generate a weighted finite state converter model (WFST model) for an entity name by learning a weight through state transition using a prior probability of a subword.

단계 S640에서 생성된 가중치 유한 상태 변환기 모델은, 서브 워드를 입력으로 하고, 개체명을 나타내는 단어 또는 단어 열을 출력으로 하는 언어 모델일 수 있다. 가중치 유한 상태 변환기 모델은 입력된 서브 워드가 단어 또는 단어 열로 예측될 수 있는 확률에 기초하여, 서브 워드가 입력되면 개체명에 해당되는 단어 또는 단어 열을 출력할 수 있다. The weighted finite state converter model generated in step S640 may be a language model in which a sub word is input and a word representing an entity name or a word string is output. The weighted finite state converter model may output a word or word sequence corresponding to an entity name when a sub word is input based on a probability that an input sub word can be predicted as a word or word sequence.

도 7은 본 개시의 디바이스(1000)가 가중치 유한 상태 변환기 모델(1340-1, 1340-2, 1340-3)을 이용하여 도메인(domain)을 자동으로 선택하는 실시예를 도시한 도면이다.FIG. 7 is a diagram illustrating an embodiment in which the device 1000 of the present disclosure automatically selects a domain using weighted finite state converter models 1341-1, 1340-2, and 1340-3.

도 7을 참조하면, 디바이스(1000)는 실행되고 있는 애플리케이션으로부터 복수의 개체명에 해당되는 단어 또는 단어 열을 식별할 수 있다. 도 7에 도시된 실시예에서, 디바이스(1000)는 음악 스트리밍 애플리케이션을 실행하고 있고, 디스플레이부(1510) 상에는 현재 플레이되는 음악의 앨범 아트 및 사용자의 플레이리스트가 디스플레이될 수 있다. 플레이리스트에는 현재 플레이되거나, 또는 플레이될 예정인 음악에 관한 노래 명, 아티스트 명, 작곡가 명, 또는 앨범 아트 등이 포함될 수 있다. 디바이스(1000)는 실행되고 있는 애플리케이션의 API(Application Programming Interface)로부터 애플리케이션에 포함된 복수의 개체명에 해당되는 단어 또는 단어 열을 획득할 수 있다. 일 실시예에서, 디바이스(1000)는 음악 스트리밍 애플리케이션으로부터 노래 명, 아티스트 명, 및 작곡가 명 중 적어도 하나를 포함하는 개체명에 관한 단어 또는 단어 열을 식별할 수 있다. Referring to FIG. 7, the device 1000 may identify a word or word sequence corresponding to a plurality of entity names from an application being executed. In the embodiment illustrated in FIG. 7, the device 1000 is executing a music streaming application, and album art of currently played music and a user's playlist may be displayed on the display unit 1510. The playlist may include a song name, artist name, composer name, album art, or the like related to music currently played or to be played. The device 1000 may obtain a word or word string corresponding to a plurality of entity names included in the application from an application programming interface (API) of an application being executed. In an embodiment, the device 1000 may identify a word or a word string related to an entity name including at least one of a song name, an artist name, and a composer name from the music streaming application.

디바이스(1000)는 식별된 적어도 하나의 단어 또는 단어 열을 개체명 어휘 리스트(NE)에 저장할 수 있다. 도 7에 도시된 실시예에서, 개체명 어휘 리스트(NE)에는 ZICO, Heize, Mark Ronson, Bruno Mars, 및 Michael Jackson을 포함하는 아티스트 명에 관한 개체명이 저장될 수 있다. The device 1000 may store the identified at least one word or word string in the entity name vocabulary list NE. In the embodiment shown in FIG. 7, the entity name vocabulary list NE may store entity names related to artist names including ZICO, Heize, Mark Ronson, Bruno Mars, and Michael Jackson.

일 실시예에서, 디바이스(1000)는 도메인 선택 모듈(1370)을 포함할 수 있다. 도메인 선택 모듈(1370)은 디바이스(1000)에 의해 식별된 복수의 개체명과 기 생성된 복수의 가중치 유한 상태 변환기 모델 각각에 포함되는 개체명 어휘 리스트를 비교함으로써, 식별된 복수의 개체명과 관련된 도메인을 자동으로 선택하도록 구성되는 모듈이다. 일 실시예에서, 도메인 선택 모듈(1370)은 디바이스(1000)의 메모리(1300, 도 3 참조)에 포함될 수 있다. In an embodiment, the device 1000 may include a domain selection module 1370. The domain selection module 1370 compares a plurality of entity names identified by the device 1000 with an entity name vocabulary list included in each of a plurality of pre-generated weighted finite state converter models, thereby selecting domains associated with the identified plurality of entity names. It is a module that is configured to be automatically selected. In an embodiment, the domain selection module 1370 may be included in the memory 1300 (refer to FIG. 3) of the device 1000.

디바이스(1000)는 도메인 선택 모듈(1370)과 관련된 명령어 또는 프로그램 코드를 이용하여, 기 생성된 복수의 가중치 유한 상태 변환기 모델(1340-1, 1340-2, 1340-3) 각각에 포함되는 개체명과 개체명 어휘 리스트(NE)에 포함되는 개체명을 비교하고, 비교 결과에 기초하여 복수의 도메인 중 현재 실행 중인 애플리케이션과 관련된 도메인을 자동으로 선택할 수 있다. 여기서, '도메인'은 음성 입력과 관련된 분야 또는 카테고리(category)를 의미한다. 예를 들어, 도메인은 음성 입력의 의미, 음성 입력의 속성, 음성 입력과 관련된 서비스에 따라 분류될 수 있다. 도메인은 예를 들어, 영화 도메인, 음악 도메인, 책 도메인, 게임 도메인, 항공 도메인, 음식 도메인 등 하나 이상을 포함할 수 있다. The device 1000 uses a command or program code related to the domain selection module 1370 to obtain an entity name included in each of the previously generated weighted finite state converter models 1341-1, 1340-2, and 1340-3. The entity names included in the entity name vocabulary list NE may be compared, and a domain related to an application currently being executed may be automatically selected from among a plurality of domains based on the comparison result. Here, the'domain' means a field or category related to voice input. For example, domains may be classified according to the meaning of voice input, attributes of voice input, and services related to voice input. The domain may include, for example, one or more of a movie domain, a music domain, a book domain, a game domain, an aviation domain, and a food domain.

복수의 가중치 유한 상태 변환기 모델(1340-1, 1340-2, 1340-3)은 서로 다른 복수의 도메인에 따라 분류된 복수의 개체명을 이용하여 학습된 모델들이다. 예를 들어, 제1 가중치 유한 상태 변환기 모델(1340-1)은 음악에 관한 복수의 개체명을 이용하여 학습된 모델이고, 제2 가중치 유한 상태 변환기 모델(1340-2)은 영화에 관한 복수의 개체명을 이용하여 학습된 모델이며, 제3 가중치 유한 상태 변환기 모델(1340-3)은 게임에 관한 복수의 개체명을 이용하여 학습된 모델이다. 도 7에서 복수의 가중치 유한 상태 변환기 모델(1340-1, 1340-2, 1340-3)은 총 3개로 도시되었지만, 이에 한정되는 것은 아니다. The plurality of weighted finite state converter models 1341-1, 1340-2, and 1340-3 are models trained using a plurality of entity names classified according to a plurality of different domains. For example, the first weighted finite state converter model 1341-1 is a model trained using a plurality of entity names related to music, and the second weighted finite state converter model 1340-2 is It is a model trained using entity names, and the third weighted finite state converter model 1340-3 is a model trained using a plurality of entity names related to a game. In FIG. 7, a total of three weighted finite state converter models 1341-1, 1340-2, and 1340-3 are illustrated, but are not limited thereto.

디바이스(1000)는 도메인 선택 모듈(1370)을 이용하여, 개체명 어휘 리스트(NE)에 포함된 개체명과 제1 가중치 유한 상태 변환기 모델(1340-1)에 포함된 개체명을 비교하고, 중복되는 개체명의 개수를 카운트(count)할 수 있다. 마찬가지로, 디바이스(1000)는 도메인 선택 모듈(1370)을 이용하여, 개체명 어휘 리스트(NE)에 포함된 개체명과 제2 가중치 유한 상태 변환기 모델(1340-2) 및 제3 가중치 유한 상태 변환기 모델(1340-3) 각각에 포함된 개체명을 비교하고, 중복되는 개체명의 개수를 가중치 유한 상태 변환기 모델에 따라 각각 카운트할 수 있다. The device 1000 uses the domain selection module 1370 to compare the entity name included in the entity name vocabulary list NE with the entity name included in the first weighted finite state converter model 1341-1, and You can count the number of entity names. Similarly, the device 1000 uses the domain selection module 1370 to configure the entity name included in the entity name vocabulary list NE, the second weighted finite state converter model 1340-2, and the third weighted finite state converter model ( 1340-3) The entity names included in each may be compared, and the number of duplicate entity names may be counted according to the weighted finite state converter model.

디바이스(1000)는 카운트된 개체명의 개수가 최대인 가중치 유한 상태 변환기 모델에 기초하여, 도메인을 선택할 수 있다. 도 7에 도시된 실시예에서, 개체명 어휘 리스트(NE)는 음악 도메인에 관련된 복수의 개체명(예를 들어, 음악 스트리밍 애플리케이션의 플레이리스트 내의 아티스트 명)을 포함하는바, 중복되는 개체명의 개수가 최대인 가중치 유한 상태 변환기 모델은 음악 도메인에 해당되는 복수의 개체명을 통해 학습된 제1 가중치 유한 상태 변환기 모델(1340-1)일 수 있다. 디바이스(1000)는 제1 가중치 유한 상태 변환기 모델(1340-1)이 학습된 도메인인 '음악'을 현재 실행 중인 애플리케이션과 관련된 도메인으로서 결정할 수 있다. The device 1000 may select a domain based on a weighted finite state converter model in which the number of counted entity names is the maximum. In the embodiment shown in FIG. 7, the entity name vocabulary list NE includes a plurality of entity names related to the music domain (eg, artist names in the playlist of a music streaming application), and the number of duplicate entity names The weighted finite state converter model having a maximum value of may be a first weighted finite state converter model 1341-1 learned through a plurality of entity names corresponding to the music domain. The device 1000 may determine'music', which is a domain in which the first weighted finite state converter model 1341-1 is learned, as a domain related to an application currently being executed.

디바이스(1000)는 실행 중인 애플리케이션과 관련된 사용자의 음성 입력이 수신되는 경우, 결정된 도메인에 기초하여 음성 입력을 해석하고, 해석 결과에 따른 동작 또는 기능을 수행할 수 있다. 예를 들어, "Bruno Mars 노래 틀어줘~"라는 음성 입력이 수신되는 경우, 디바이스(1000)는 결정된 도메인인 '음악'에 기초하여 사용자의 음성 입력을 해석하고, 해석 결과에 따라 플레이리스트 내에 포함된 복수의 노래 중 Bruno Mars의 노래를 플레이할 수 있다. When a user's voice input related to a running application is received, the device 1000 may analyze the voice input based on the determined domain and perform an operation or function according to the analysis result. For example, when a voice input such as "Play a Bruno Mars song~" is received, the device 1000 interprets the user's voice input based on the determined domain'music', and is included in the playlist according to the analysis result. You can play Bruno Mars' song among the multiple songs that have been made.

도 8은 본 개시의 디바이스(1000)가 가중치 유한 상태 변환기 모델을 이용하여 도메인(domain)을 자동으로 선택하는 실시예를 도시한 흐름도이다.8 is a flowchart illustrating an embodiment in which the device 1000 of the present disclosure automatically selects a domain using a weighted finite state converter model.

단계 S810에서, 디바이스(1000)는 실행되는 애플리케이션 또는 웹 페이지로부터 개체명에 해당되는 단어들을 식별한다. 일 실시예에서 디바이스(1000)는, 디바이스(1000)에 의해 실행되는 애플리케이션의 API로부터 애플리케이션에 포함된 복수의 개체명에 해당되는 단어 또는 단어 열을 획득할 수 있다. 다른 실시예에서, 디바이스(1000)는 현재 액세스 하고 있는 웹 페이지를 크롤링(crawling)함으로써, 웹 페이지에 포함되는 복수의 개체명에 관한 단어 또는 단어 열을 획득할 수 있다. In step S810, the device 1000 identifies words corresponding to the entity name from the executed application or web page. In an embodiment, the device 1000 may obtain a word or word string corresponding to a plurality of entity names included in the application from an API of an application executed by the device 1000. In another embodiment, the device 1000 may acquire words or word strings for a plurality of entity names included in the web page by crawling the currently accessed web page.

디바이스(1000)는 획득된 복수의 개체명에 해당되는 단어 또는 단어 열을 개체명 어휘 리스트 DB(1310)에 저장할 수 있다.The device 1000 may store words or word strings corresponding to the acquired plurality of entity names in the entity name vocabulary list DB 1310.

단계 S820에서, 디바이스(1000)는 식별된 단어들을, 도메인 별로 기 생성된 복수의 가중치 유한 상태 변환기 모델 각각의 어휘 리스트에 포함되는 개체명과 비교한다. 복수의 가중치 유한 상태 변환기 모델은, 서로 다른 복수의 도메인에 따라 분류된 복수의 개체명을 이용하여 학습된 모델들이다. 복수의 가중치 유한 상태 변환기 모델 각각은 서로 다른 복수의 도메인으로 분류된 복수의 개체명을 포함하는 어휘 리스트를 포함할 수 있다. 일 실시예에서, 디바이스(1000)는 단계 S810에서 식별된 개체명에 해당되는 단어들을, 복수의 가중치 유한 상태 변환기 모델 각각에 포함되는 어휘 리스트 내의 복수의 개체명과 비교하고, 비교 결과 중복되는 개체명을 식별하고, 식별된 개체명의 개수를 카운트할 수 있다. In step S820, the device 1000 compares the identified words with the names of entities included in the vocabulary list of each of the plurality of weighted finite state converter models previously generated for each domain. The plurality of weighted finite state converter models are models trained using a plurality of entity names classified according to a plurality of different domains. Each of the plurality of weighted finite state transformer models may include a vocabulary list including a plurality of entity names classified into a plurality of different domains. In one embodiment, the device 1000 compares the words corresponding to the entity name identified in step S810 with a plurality of entity names in a vocabulary list included in each of the plurality of weighted finite state converter models, and as a result of the comparison, duplicate entity names Can be identified, and the number of identified entity names can be counted.

단계 S830에서, 디바이스(1000)는 비교 결과에 기초하여, 실행 중인 애플리케이션 또는 웹 페이지가 분류될 수 있는 도메인을 결정한다. 일 실시예에서, 디바이스(1000)는 복수의 가중치 유한 상태 변환기 모델 중 단계 S820에서 카운트된 중복 개체명의 개수가 최대인 가중치 유한 상태 변환기 모델에 기초하여, 도메인을 결정할 수 있다. 예를 들어, 카운트된 개체명의 개수가 최대인 가중치 유한 상태 변환기 모델이 '음악' 도메인에 관한 개체명 어휘 리스트를 포함하는 모델인 경우, 디바이스(1000)는 실행 중인 애플리케이션 또는 웹 페이지와 관련된 도메인을 '음악'으로 결정할 수 있다. In step S830, the device 1000 determines a domain in which the running application or web page can be classified based on the comparison result. In an embodiment, the device 1000 may determine a domain based on a weighted finite state converter model in which the number of duplicate entity names counted in step S820 is the largest among a plurality of weighted finite state converter models. For example, if the weighted finite state converter model in which the number of counted entity names is the largest is a model including a vocabulary list of entity names related to the'music' domain, the device 1000 determines the domain related to the running application or web page. It can be decided by'music'.

종래의 음성 인식 기술에서는 사용자의 음성 입력이 수신되는 경우, 수신된 음성 입력과 관련된 도메인을 자동으로 선택하기 위해서는 개체명에 따른 패턴 문장(Pattern sentence)을 학습하고, 학습된 패턴 문장에 기초하여 도메인을 선택하여야 하였다. 개체명에 따른 패턴 문장은 예를 들어, "<개체 명> + 보여줘" 또는 "<개체명> + 들려줘"와 같은 특정 개체명에 따라서 특정 음성 명령이 결합될 확률이 높은 바, 개별적인 패턴 문장에 관한 학습을 수행하기에는 데이터 연산량이 많고, 학습 속도도 오래 걸리는 문제점이 있었다. 또한, 도메인에 관하여 특화된 패턴이 존재하지 않는 경우에는 음성 입력과 관련된 도메인을 정확하게 선택할 수 없는 문제점이 있었다. In the conventional speech recognition technology, when a user's voice input is received, in order to automatically select a domain related to the received voice input, a pattern sentence according to an entity name is learned, and a domain based on the learned pattern sentence Should be selected. Pattern sentences according to the entity name have a high probability of combining specific voice commands according to the specific entity name, such as "<object name> + show" or "<object name> + hear". There is a problem in that the amount of data computation is large and the learning speed takes a long time to perform the learning about the problem. In addition, when there is no specialized pattern for the domain, there is a problem in that the domain related to the voice input cannot be accurately selected.

도 7 및 도 8에 도시된 실시예에서는, 디바이스(1000)를 통해 실행되고 있는 애플리케이션 또는 디바이스(1000)를 통해 액세스하고 있는 웹 페이지로부터 복수의 개체명을 식별하고, 기 생성된 복수의 가중치 유한 상태 변환기 모델(1340-1, 1340-2, 1340-3)에 포함된 개체명과 식별된 복수의 개체명을 비교함으로써, 도메인을 선택하는바, 종래 기술 대비 처리 속도가 빠른 장점이 있다. 또한, 본 개시의 실시예에서는 식별된 개체명과 복수의 가중치 유한 상태 변환기 모델(1340-1, 1340-2, 1340-3)에 포함된 개체명을 비교하는바, 패턴 문장에 관한 고려가 불필요하여 도메인 선택의 정확도를 향상시킬 수 있다. In the embodiments shown in FIGS. 7 and 8, a plurality of entity names are identified from an application running through the device 1000 or a web page accessed through the device 1000, and a plurality of previously generated weights The domain is selected by comparing the entity names included in the state converter models 1341-1, 1340-2, and 1340-3 with the identified plurality of entity names, which has an advantage in that the processing speed is high compared to the prior art. In addition, in the embodiment of the present disclosure, the identified entity name and the entity name included in the plurality of weighted finite state converter models (1340-1, 1340-2, 1340-3) are compared, so that consideration of the pattern sentence is unnecessary. The accuracy of domain selection can be improved.

도 9는 본 개시의 디바이스(1000)가 서버(2000)로부터 수신된 정보를 이용하여 개체명에 관한 가중치 유한 상태 변환기 모델을 생성하는 실시예를 도시한 흐름도이다. 9 is a flowchart illustrating an embodiment in which the device 1000 according to the present disclosure generates a weighted finite state converter model for an entity name using information received from the server 2000.

단계 S910에서, 디바이스(1000)는 서버(2000)로부터 개체명 어휘 리스트의 업데이트 정보를 수신한다. '개체명 어휘 리스트의 업데이트 정보'는 예를 들어, 서버(2000) 내에 저장된 개체명 어휘 리스트 DB(2330, 도 4 참조)에 새로 추가된 적어도 하나의 신규 개체명, 적어도 하나의 신규 개체명 각각이 분류될 수 있는 도메인 정보, 신규 개체명과 관련된 애플리케이션 정보, 기 저장된 개체명의 삭제 정보, 및 기 저장된 개체명의 변경 정보 중 적어도 하나를 포함할 수 있다. 일 실시예에서, 디바이스(1000)는 기설정된 주기에 따라 서버(2000)로부터 개체명 어휘 리스트 DB(2330)의 업데이트 정보를 수신할 수 있다. 기설정된 주기는 예를 들어, 6시간, 1일, 1주일, 또는 한달일 수 있으나, 이에 한정되지 않는다. In step S910, the device 1000 receives update information of the entity name vocabulary list from the server 2000. The'object name vocabulary list update information' is, for example, at least one new entity name newly added to the entity name vocabulary list DB (2330, see Fig. 4) stored in the server 2000, and at least one new entity name, respectively. It may include at least one of domain information that can be classified, application information related to a new entity name, deletion information of a previously stored entity name, and change information of a previously stored entity name. In an embodiment, the device 1000 may receive update information of the entity name vocabulary list DB 2330 from the server 2000 according to a preset period. The preset period may be, for example, 6 hours, 1 day, 1 week, or 1 month, but is not limited thereto.

다른 실시예에서, 디바이스(1000)는 서버(2000)의 개체명 어휘리스트 DB(2330)에 신규 개체명 어휘들이 추가되거나, 삭제되거나, 또는 변경되는 업데이트가 수행된 시점 마다 개체명 어휘 리스트 DB(2330)의 업데이트 정보를 서버(2000)로부터 수신할 수 있다. In another embodiment, the device 1000 is the entity name vocabulary list DB (the entity name vocabulary list DB 2330) of the server 2000 at each time when a new entity name vocabulary is added, deleted, or updated. Update information of 2330 may be received from the server 2000.

단계 S920에서, 수신된 업데이트 정보를 이용하여 디바이스(1000)의 개체명 어휘 리스트를 업데이트 한다. 디바이스(1000)는 서버(2000)로부터 수신된 업데이트 정보를 이용하여, 메모리(1300, 도 3 참조)에 기 저장된 개체명 어휘리스트 DB(1310)에 신규 개체명을 추가하거나, 기 저장된 개체명을 삭제하거나, 또는 기 저장된 개체명을 변경할 수 있다. In step S920, the entity name vocabulary list of the device 1000 is updated using the received update information. The device 1000 uses the update information received from the server 2000 to add a new entity name to the entity name vocabulary list DB 1310 previously stored in the memory 1300 (refer to FIG. 3), or save the entity name. You can delete or change the previously saved object name.

단계 S930에서, 디바이스(1000)는 업데이트된 어휘 리스트를 이용하는 학습을 통해 가중치 유한 상태 변환기 모델을 생성한다. 일 실시예에서, 디바이스(1000)는 업데이트된 개체명 어휘리스트 DB(1310)에 포함되는 적어도 하나의 단어 또는 단어 열을 서브 워드로 분할하고, 분할된 서브 워드의 사전 확률 정보(예를 들어, 서브 워드의 빈도수 및 배열 순서)를 이용하여 서브 워드 각각이 특정 단어 또는 단어 열로 예측될 수 있는 사후 확률에 관한 가중치(weight)를 학습함으로써, 가중치 유한 상태 변환기 모델을 생성할 수 있다. 디바이스(1000)가 복수의 개체명에 관한 단어 또는 단어 열을 이용하여 가중치 유한 상태 변환기 모델을 생성하는 구체적인 방법은 도 1 및 도 3에서 설명한 방법과 동일한바, 중복되는 설명은 생략한다. In step S930, the device 1000 generates a weighted finite state converter model through learning using the updated vocabulary list. In one embodiment, the device 1000 divides at least one word or word column included in the updated entity name vocabulary list DB 1310 into sub words, and prior probability information of the divided sub words (for example, A weighted finite state transformer model may be generated by learning a weight for a posterior probability that each subword can be predicted as a specific word or word sequence using subword frequencies and arrangement order). A detailed method of generating a weighted finite state converter model by using a word or word string for a plurality of entity names by the device 1000 is the same as the method described with reference to FIGS. 1 and 3, and a duplicate description will be omitted.

일 실시예에서, 디바이스(1000)는 개체명 어휘리스트 DB(1310)의 업데이트 정보를 이용하여, 기 생성된 가중치 유한 상태 변환기 모델을 갱신할 수 있다. In an embodiment, the device 1000 may update the previously generated weighted finite state converter model by using the update information of the entity name vocabulary list DB 1310.

도 9에 도시된 실시예에서, 디바이스(1000)는 개체명의 업데이트 정보를 서버(2000)로부터 수신하고, 수신된 개체명 업데이트 정보를 이용하여 개체명에 관한 가중치 유한 상태 변환기 모델을 생성하거나, 또는 갱신함으로써, 개체명을 최신으로 유지할 수 있다. 따라서, 본 개시의 일 실시예에 따른 디바이스(1000)는 최신의 개체명에 관한 발화를 포함하는 사용자의 음성 입력에 관한 인식 정확도를 향상시킬 수 있다. In the embodiment shown in FIG. 9, the device 1000 receives update information of the entity name from the server 2000, and generates a weighted finite state converter model for the entity name using the received entity name update information, or By updating, the object name can be kept up to date. Accordingly, the device 1000 according to an exemplary embodiment of the present disclosure may improve recognition accuracy regarding a user's voice input including speech regarding the latest entity name.

도 10은 본 개시의 디바이스(1000)가 신규 지역에 진입하는 경우, 신규 지역에 관한 관심 장소 어휘 리스트를 이용하여 가중치 유한 상태 변환기 모델을 생성하는 실시예를 도시한 개념도이다.10 is a conceptual diagram illustrating an embodiment of generating a weighted finite state converter model by using a place of interest vocabulary list related to the new area when the device 1000 of the present disclosure enters a new area.

도 10을 참조하면, 사용자는 차량(100)을 이용하면서 디바이스(1000)를 통해 네비게이션 애플리케이션을 실행할 수 있다. 일 실시예에서, 디바이스(1000)는 위치 센서, 예를 들어 GPS 센서를 포함할 수 있다. 디바이스(1000)는 GPS 센서를 이용하여 차량(100)을 통해 이동하고 있는 현재 디바이스(1000)의 위치 정보를 획득할 수 있다. Referring to FIG. 10, a user may execute a navigation application through the device 1000 while using the vehicle 100. In one embodiment, the device 1000 may include a location sensor, for example a GPS sensor. The device 1000 may acquire location information of the current device 1000 moving through the vehicle 100 using a GPS sensor.

차량(100)이 신규 지역에 진입하는 경우, 디바이스(1000)는 GPS 센서를 이용하여 신규 지역에 진입하였다는 진입 정보를 애플리케이션 서비스 제공 업체의 서버(3000)에 전송한다 (단계 S1010). '신규 지역 진입 정보'는 예를 들어, 신규 지역의 위치 정보, 진입 시점, 및 기설정된 시간 구간 동안의 진입 예정지에 관한 위치 정보 중 적어도 하나를 포함할 수 있다. 도 10에 도시된 실시예에서, '애플리케이션 서비스 제공 업체'는 네비게이션 애플리케이션 서비스 제공 업체일 수 있다. When the vehicle 100 enters a new area, the device 1000 transmits entry information indicating that it has entered the new area using a GPS sensor to the server 3000 of an application service provider (step S1010). The'new area entry information' may include, for example, at least one of location information of a new area, an entry time point, and location information about an entry point during a preset time period. In the embodiment illustrated in FIG. 10, the'application service provider' may be a navigation application service provider.

'신규 지역'은 디바이스(1000)에 설치되고, 디바이스(1000)에 의해 실행되는 위치 기반 애플리케이션에는 저장되어 있지 않지만, 애플리케이션 서비스 제공 업체의 서버(3000)에는 저장되어 있는 지역을 의미할 수 있다. 여기서, '위치 기반 애플리케이션'은 예를 들어 네비게이션 애플리케이션 또는 지도 애플리케이션 등과 같이 디바이스(1000)의 위치 정보에 기초하여 정보를 제공하거나, 특정 동작 또는 기능을 실행하는 애플리케이션을 의미한다. 디바이스(1000)에 의해 실행되는 네비게이션 애플리케이션의 경우, 디바이스(1000) 내의 설치 용량의 한계로 인하여 모든 지역에 관한 지명, 명소, 또는 관광지 등 장소에 관한 정보를 모두 포함할 수는 없다. 디바이스(1000)에 설치되는 네비게이션 애플리케이션의 경우에는 최소한의 지역에 관한 장소 정보만을 저장하고, 신규 지역에 관한 장소 정보는 서버(3000)에만 저장되어 있을 수 있다. The'new area' may refer to an area installed in the device 1000 and not stored in a location-based application executed by the device 1000, but stored in the server 3000 of an application service provider. Here, the'location-based application' refers to an application that provides information based on location information of the device 1000, such as a navigation application or a map application, or executes a specific operation or function. In the case of a navigation application executed by the device 1000, due to the limitation of the installed capacity in the device 1000, it is not possible to include all information about places such as place names, attractions, or tourist attractions. In the case of a navigation application installed in the device 1000, only place information regarding a minimum area may be stored, and place information regarding a new area may be stored only in the server 3000.

애플리케이션 서비스 제공 업체의 서버(3000)는 디바이스(1000)로부터 신규 지역 진입 정보를 수신하는 경우, 신규 지역에 관한 관심 장소 어휘 리스트(POI)를 디바이스(1000)에 제공한다 (단계 S1020). 관심 장소 어휘 리스트(POI)는 예를 들어, 신규 지역에 관한 지명, 명소, 관광지, 및 유명 음식점 중 적어도 하나에 관한 개체명을 포함할 수 있다. 도 10에 도시된 실시예에서, 관심 장소 어휘 리스트(POI)는 신규 지역의 지명, 명소, 또는 관광지에 관한 개체명의 예시로서, Times Square, Gershwin Theatre, The Town Hall, Madame Tussauds, 및 Bryant Park를 포함할 수 있다. When receiving information about entering a new area from the device 1000, the server 3000 of the application service provider provides the device 1000 with a point of interest vocabulary list POI related to the new area (step S1020 ). The point of interest vocabulary list (POI) may include, for example, an entity name of at least one of a place name for a new area, a tourist attraction, a tourist destination, and a famous restaurant. In the embodiment shown in FIG. 10, the point of interest vocabulary list (POI) is an example of an entity name related to a place name, attraction, or tourist destination of a new area, and includes Times Square, Gershwin Theater, The Town Hall, Madame Tussauds, and Bryant Park. Can include.

디바이스(1000)는 애플리케이션 서비스 제공 업체의 서버(3000)로부터 수신된 관심 장소 어휘 리스트(POI)에 포함되는 개체명을 이용하는 학습을 통해 가중치 유한 상태 변환기 모델을 생성할 수 있다. 일 실시예에서, 디바이스(1000)는 관심 장소 어휘 리스트(POI)에 포함되는 개체명에 해당되는 단어 또는 단어 열을 서브 워드 단위로 분할하고, 분할된 서브 워드의 사전 확률 정보(예를 들어, 서브 워드의 빈도수 및 배열 순서)를 이용하여 서브 워드 각각이 특정 단어 또는 단어 열로 예측될 수 있는 사후 확률에 관한 가중치(weight)를 학습함으로써, 가중치 유한 상태 변환기 모델을 생성할 수 있다. 디바이스(1000)가 복수의 개체명을 이용하여 가중치 유한 상태 변환기 모델을 생성하는 구체적인 방법은 도 1 및 도 3에서 설명한 방법과 동일한바, 중복되는 설명은 생략한다. The device 1000 may generate a weighted finite state converter model through learning using an entity name included in a POI received from the server 3000 of an application service provider. In one embodiment, the device 1000 divides a word or word column corresponding to an entity name included in the POI vocabulary list in units of sub-words, and prior probability information of the divided sub-words (for example, A weighted finite state transformer model may be generated by learning a weight for a posterior probability that each subword can be predicted as a specific word or word sequence using subword frequencies and arrangement order). A detailed method of generating a weighted finite state converter model by using a plurality of entity names by the device 1000 is the same as the method described in FIGS. 1 and 3, and a duplicate description will be omitted.

도 11은 본 개시의 디바이스(1000)가 신규 지역에 진입하는 경우, 신규 지역에 관한 관심 장소 어휘 리스트를 이용하여 가중치 유한 상태 변환기 모델을 생성하는 실시예를 도시한 개념도이다.11 is a conceptual diagram illustrating an embodiment of generating a weighted finite state converter model by using a place of interest vocabulary list related to the new area when the device 1000 of the present disclosure enters a new area.

단계 S1110에서, 디바이스(1000)는 신규 지역으로 진입함을 인식한다. 일 실시예에서, 디바이스(1000)는 GPS 센서와 같은 위치 센서를 포함할 수 있다. 디바이스(1000)는 위치 센서를 이용하여 디바이스(1000)의 위치 정보를 획득함으로써, 디바이스(1000)가 신규 지역에 진입하였음을 인식할 수 있다. '신규 지역'은 디바이스(1000)에 설치되고, 디바이스(1000)에 의해 실행되는 위치 기반 애플리케이션에는 저장되어 있지 않지만, 애플리케이션 서비스 제공 업체의 서버에는 저장되어 있는 지역을 의미할 수 있다.In step S1110, the device 1000 recognizes that it enters a new area. In one embodiment, the device 1000 may include a location sensor such as a GPS sensor. The device 1000 may recognize that the device 1000 has entered a new area by obtaining location information of the device 1000 using a location sensor. The'new area' may refer to an area installed in the device 1000 and not stored in a location-based application executed by the device 1000, but stored in a server of an application service provider.

단계 S1120에서, 디바이스(1000)는 애플리케이션 서비스 제공 업체의 서버에 신규 지역의 진입 정보를 전송한다. 일 실시예에서, 디바이스(1000)는 통신 인터페이스(1400, 도 3 참조)를 이용하여, 신규 지역의 진입 정보를 애플리케이션 서비스 제공 업체의 서버에 전송할 수 있다. '신규 지역 진입 정보'는 예를 들어, 신규 지역의 위치 정보, 진입 시점, 및 기설정된 시간 구간 동안의 진입 예정지에 관한 위치 정보 중 적어도 하나를 포함할 수 있다. In step S1120, the device 1000 transmits entry information of a new area to the server of the application service provider. In an embodiment, the device 1000 may transmit information about entering a new area to a server of an application service provider using the communication interface 1400 (see FIG. 3 ). The'new area entry information' may include, for example, at least one of location information of a new area, an entry time point, and location information about an entry point during a preset time period.

단계 S1130에서, 디바이스(1000)는 애플리케이션 서비스 제공 업체의 서버로부터 신규 지역에 관한 개체명을 포함하는 관심 장소 어휘 리스트를 수신한다. 관심 장소 어휘 리스트(POI)는 예를 들어, 신규 지역에 관한 지명, 명소, 관광지, 및 유명 음식점 중 적어도 하나에 관한 개체명을 포함할 수 있다.In step S1130, the device 1000 receives a point of interest vocabulary list including an entity name for a new area from a server of an application service provider. The point of interest vocabulary list (POI) may include, for example, an entity name of at least one of a place name for a new area, a tourist attraction, a tourist destination, and a famous restaurant.

단계 S1140에서, 디바이스(1000)는 수신된 관심 장소 어휘 리스트에 포함되는 개체명을 이용하는 학습을 통해 가중치 유한 상태 변환기 모델을 생성한다. 일 실시예에서, 디바이스(1000)는 관심 장소 어휘 리스트에 포함되는 개체명에 해당되는 단어 또는 단어 열을 서브 워드 단위로 분할하고, 분할된 서브 워드의 사전 확률 정보(예를 들어, 서브 워드의 빈도수 및 배열 순서)를 이용하여 서브 워드 각각이 특정 단어 또는 단어 열로 예측될 수 있는 사후 확률에 관한 가중치(weight)를 학습함으로써, 가중치 유한 상태 변환기 모델을 생성할 수 있다.In step S1140, the device 1000 generates a weighted finite state converter model through learning using the entity name included in the received place of interest vocabulary list. In one embodiment, the device 1000 divides a word or a word column corresponding to an entity name included in a place of interest vocabulary list into sub-word units, and prior probability information of the divided sub-words (eg, A weighted finite state transformer model may be generated by learning a weight for a posterior probability that each subword can be predicted as a specific word or word sequence using a frequency number and an order of arrangement).

도 12는 본 개시의 디바이스(1000)가 사용자의 개인적 특성을 반영하는 개체명을 이용하여 개인화 가중치 유한 상태 변환기 모델을 생성하는 실시예를 도시한 도면이다.12 is a diagram illustrating an embodiment in which the device 1000 of the present disclosure generates a personalized weighted finite state converter model using an entity name reflecting a user's personal characteristics.

도 12를 참조하면, 디바이스(1000)는 메신저 애플리케이션을 실행하고, 메신저 애플리케이션의 대화 창이 디스플레이부(1510) 상에 디스플레이될 수 있다. 일 실시예에서, 디바이스(1000)는 메신저 애플리케이션의 로그 데이터(log data)를 분석함으로써, 사용자의 개인적 특성을 반영하는 복수의 개체명을 포함하는 개인화된 개체명 어휘 리스트(NE)를 획득할 수 있다. '개인적 특성'은 나이, 성별, 학교, 직장과 같은 개인 정보 뿐만 아니라, 사용자의 친구 또는 회사 직원과 같은 인간 관계, 사용자가 관심을 갖는 게임, 스포츠, 음악, 영화 등 관심 분야를 포함할 수 있다. 도 12에 도시된 실시예에서, 디바이스(1000)는 메신저 애플리케이션의 로그 데이터로부터 맨유, 토트넘, 해리 케인, 포그바, 무리뉴, 화이트 하트 레인 등과 같은 축구 관련 개체명을 식별하고, 식별된 축구 관련 개체명을 포함하는 개인화된 개체명 어휘 리스트(NE)를 획득할 수 있다. Referring to FIG. 12, the device 1000 may execute a messenger application, and a chat window of the messenger application may be displayed on the display unit 1510. In one embodiment, the device 1000 may obtain a personalized entity name vocabulary list NE including a plurality of entity names reflecting the personal characteristics of the user by analyzing log data of the messenger application. have. 'Personal characteristics' may include personal information such as age, gender, school, and work, as well as personal relationships such as a user's friend or company employee, and fields of interest such as games, sports, music, and movies that the user is interested in. . In the embodiment shown in FIG. 12, the device 1000 identifies soccer-related entity names such as Manchester United, Tottenham, Harry Kane, Pogba, Mourinho, and White Hart Lane from log data of a messenger application, and identified soccer-related entities A personalized entity name vocabulary list (NE) including names may be obtained.

디바이스(1000)는 WFST 모델 생성 모듈(1330)과 관련된 명령어들 또는 프로그램 코드를 이용하여, 개인화 가중치 유한 상태 변환기 모델(1340p)을 생성할 수 있다. 일 실시예에서, 디바이스(1000)는 개인화된 개체명 어휘 리스트(NE)에 포함되는 복수의 개체명을 서브 워드 단위로 분할하고, 분할된 서브 워드의 사전 확률 정보(예를 들어, 서브 워드의 빈도수 및 배열 순서)를 이용하여 서브 워드 각각이 특정 단어 또는 단어 열로 예측될 수 있는 사후 확률에 관한 가중치(weight)를 학습함으로써, 개인화된 가중치 유한 상태 변환기 모델(1340p)을 생성할 수 있다. The device 1000 may generate the personalized weight finite state converter model 1340p by using instructions or program codes related to the WFST model generation module 1330. In one embodiment, the device 1000 divides a plurality of entity names included in the personalized entity name vocabulary list NE into subword units, and prior probability information of the divided subwords (e.g., A personalized weighted finite state transformer model 1340p may be generated by learning a weight for a posterior probability that each subword can be predicted as a specific word or word sequence using a frequency number and an arrangement order).

도 12에는 디바이스(1000)가 메신저 애플리케이션의 로그 데이터를 이용하여 개인화된 개체명 어휘 리스트(NE)를 획득하는 것으로 도시되었지만, 이에 한정되는 것은 아니다. 일 실시예에서, 디바이스(1000)는 자주 실행되는 애플리케이션, 메신저 애플리케이션, 및 컨텐트 스트리밍 애플리케이션에서의 검색어 기록 중 적어도 하나로부터 사용자의 개인적 특성을 반영하는 복수의 개체명을 포함하는 개인화된 개체명 어휘 리스트(NE)를 획득할 수 있다. 12 illustrates that the device 1000 acquires a personalized entity name vocabulary list NE using log data of a messenger application, but is not limited thereto. In one embodiment, the device 1000 is a personalized entity name vocabulary list including a plurality of entity names reflecting personal characteristics of a user from at least one of a search term record in a frequently executed application, a messenger application, and a content streaming application. (NE) can be obtained.

본 개시를 통해 설명된 디바이스(1000)에 의해 실행되는 프로그램은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 프로그램은 컴퓨터로 읽을 수 있는 명령어들을 수행할 수 있는 모든 시스템에 의해 수행될 수 있다. A program executed by the device 1000 described through the present disclosure may be implemented as a hardware component, a software component, and/or a combination of a hardware component and a software component. A program can be executed by any system capable of executing computer-readable instructions.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령어(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. Software may include a computer program, code, instructions, or a combination of one or more of them, and configure the processing unit to operate as desired or process it independently or collectively. You can command the device.

소프트웨어는, 컴퓨터로 읽을 수 있는 저장 매체(computer-readable storage media)에 저장된 명령어를 포함하는 컴퓨터 프로그램으로 구현될 수 있다. 컴퓨터가 읽을 수 있는 기록 매체로는, 예를 들어 마그네틱 저장 매체(예컨대, ROM(read-only memory), RAM(random-access memory), 플로피 디스크, 하드 디스크 등) 및 광학적 판독 매체(예컨대, 시디롬(CD-ROM), 디브이디(DVD: Digital Versatile Disc)) 등이 있다. 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템들에 분산되어, 분산 방식으로 컴퓨터가 판독 가능한 코드가 저장되고 실행될 수 있다. 매체는 컴퓨터에 의해 판독가능하며, 메모리에 저장되고, 프로세서에서 실행될 수 있다. The software may be implemented as a computer program including instructions stored in a computer-readable storage media. The computer-readable recording medium includes, for example, a magnetic storage medium (e.g., read-only memory (ROM), random-access memory (RAM), floppy disk, hard disk, etc.) and optical reading medium (e.g., CD-ROM (CD-ROM) and DVD (Digital Versatile Disc)). The computer-readable recording medium is distributed over networked computer systems, so that computer-readable codes can be stored and executed in a distributed manner. The medium is readable by a computer, stored in memory, and executed on a processor.

컴퓨터로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, '비일시적'은 저장매체가 신호(signal)를 포함하지 않으며 실재(tangible)한다는 것을 의미할 뿐 데이터가 저장매체에 반영구적 또는 임시적으로 저장됨을 구분하지 않는다. The computer-readable storage medium may be provided in the form of a non-transitory storage medium. Here,'non-transient' means that the storage medium does not contain a signal and is tangible, but does not distinguish between semi-permanent or temporary storage of data in the storage medium.

또한, 본 명세서에 개시된 실시예들에 따른 프로그램은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다.In addition, a program according to the embodiments disclosed in the present specification may be included in a computer program product and provided. Computer program products can be traded between sellers and buyers as commodities.

컴퓨터 프로그램 제품은 소프트웨어 프로그램, 소프트웨어 프로그램이 저장된 컴퓨터로 읽을 수 있는 저장 매체를 포함할 수 있다. 예를 들어, 컴퓨터 프로그램 제품은 디바이스의 제조사 또는 전자 마켓(예를 들어, 구글 플레이 스토어, 앱 스토어)을 통해 전자적으로 배포되는 소프트웨어 프로그램 형태의 상품(예를 들어, 다운로드 가능한 애플리케이션(downloadable application))을 포함할 수 있다. 전자적 배포를 위하여, 소프트웨어 프로그램의 적어도 일부는 저장 매체에 저장되거나, 임시적으로 생성될 수 있다. 이 경우, 저장 매체는 제조사의 서버, 전자 마켓의 서버, 또는 소프트웨어 프로그램을 임시적으로 저장하는 중계 서버의 저장매체가 될 수 있다.The computer program product may include a software program and a computer-readable storage medium in which the software program is stored. For example, a computer program product is a product (for example, a downloadable application) in the form of a software program that is electronically distributed through a device manufacturer or an electronic market (eg, Google Play Store, App Store). It may include. For electronic distribution, at least a part of the software program may be stored in a storage medium or may be temporarily generated. In this case, the storage medium may be a server of a manufacturer, a server of an electronic market, or a storage medium of a relay server temporarily storing a software program.

컴퓨터 프로그램 제품은, 서버 및 디바이스로 구성되는 시스템에서, 서버의 저장매체 또는 디바이스의 저장매체를 포함할 수 있다. 또는, 서버 또는 디바이스와 통신 연결되는 제3의 디바이스(예, 스마트폰)가 존재하는 경우, 컴퓨터 프로그램 제품은 제3의 디바이스의 저장매체를 포함할 수 있다. 또는, 컴퓨터 프로그램 제품은 서버로부터 디바이스 또는 제3 디바이스로 전송되거나, 제3 디바이스로부터 디바이스로 전송되는 소프트웨어 프로그램 자체를 포함할 수 있다.The computer program product may include a storage medium of a server or a storage medium of a device in a system composed of a server and a device. Alternatively, when there is a server or a third device (eg, a smartphone) that is communicatively connected to the device, the computer program product may include a storage medium of the third device. Alternatively, the computer program product may include a software program itself transmitted from a server to a device or a third device, or transmitted from a third device to the device.

이 경우, 서버, 디바이스 및 제3 디바이스 중 하나가 컴퓨터 프로그램 제품을 실행하여 개시된 실시예들에 따른 방법을 수행할 수 있다. 또는, 서버, 디바이스 및 제3 디바이스 중 둘 이상이 컴퓨터 프로그램 제품을 실행하여 개시된 실시예들에 따른 방법을 분산하여 실시할 수 있다.In this case, one of the server, the device and the third device may execute the computer program product to perform the method according to the disclosed embodiments. Alternatively, two or more of a server, a device, and a third device may execute a computer program product to distribute and implement the method according to the disclosed embodiments.

예를 들면, 서버가 서버에 저장된 컴퓨터 프로그램 제품을 실행하여, 서버와 통신 연결된 디바이스가 개시된 실시예들에 따른 방법을 수행하도록 제어할 수 있다. For example, the server may execute a computer program product stored in the server, and control a device communicating with the server to perform the method according to the disclosed embodiments.

또 다른 예로, 제3 디바이스가 컴퓨터 프로그램 제품을 실행하여, 제3 디바이스와 통신 연결된 디바이스가 개시된 실시예에 따른 방법을 수행하도록 제어할 수 있다. As another example, a third device may execute a computer program product, and a device connected in communication with the third device may be controlled to perform a method according to the disclosed embodiment.

제3 디바이스가 컴퓨터 프로그램 제품을 실행하는 경우, 제3 디바이스는 서버로부터 컴퓨터 프로그램 제품을 다운로드하고, 다운로드된 컴퓨터 프로그램 제품을 실행할 수 있다. 또는, 제3 디바이스는 프리로드(pre-loaded)된 상태로 제공된 컴퓨터 프로그램 제품을 실행하여 개시된 실시예들에 따른 방법을 수행할 수도 있다.When the third device executes the computer program product, the third device may download the computer program product from the server and execute the downloaded computer program product. Alternatively, the third device may perform a method according to the disclosed embodiments by executing a computer program product provided in a pre-loaded state.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 컴퓨터 시스템 또는 모듈 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다. As described above, although the embodiments have been described by the limited embodiments and drawings, various modifications and variations are possible from the above description by those of ordinary skill in the art. For example, the described techniques are performed in a different order from the described method, and/or components such as a computer system or module described are combined or combined in a form different from the described method, or other components or equivalents Even if substituted or substituted by, appropriate results can be achieved.

Claims

In the method for the device to recognize voice input,
Using a vocabulary list including a plurality of named entities, the probability that a subword extracted from each of the plurality of entity names can be predicted as a word or word string representing the entity name is determined. Generating a Weighted Finite State Transducer model by training;
Receiving a user's voice input;
Using a first decoding model, a feature vector including a probability value that the received speech input can be predicted as a specific subword is obtained, and a plurality of prediction strings are included using the probability value of the feature vector Obtaining a first character string to perform;
A word sequence corresponding to at least one entity name among the plurality of entity names from the feature vector by inputting the obtained feature vector into a second decoding model using the weighted finite state converter model obtaining a second character string including a (word sequence) and an unrecognized word sequence that is not identified by the at least one entity name; And
Outputting text corresponding to the voice input by replacing the unrecognized word string among the second strings with a word string included in the first string;
Containing, method.

The method of claim 1,
Generating the weighted finite state transformer model,
Obtaining a vocabulary list including the plurality of entity names;
Segmenting the words or strings constituting the plurality of entity names into subwords, which are phoneme or syllable units; And
A posterior probability value at which the subword can be predicted as a specific entity name among the plurality of entity names by learning a weight through a state transition using the frequency of the subwords and the arrangement order of the subwords obtaining a confidence score including (posterior probability);
Containing, method.

The method of claim 2,
Generating the weighted finite state transformer model,
Performing filtering to remove an entity name overlapping with a word previously stored in a memory of the device among a plurality of entity names included in the obtained vocabulary list;
Containing, method.

The method of claim 2,
The weighted finite state converter model is input when a Lexicon Finite State Transducer (L FST) including mapping information that is a probability value that the sub word can be predicted as a specific word and the specific word or word string are input A Grammar Finite State Transducer (G FST) including weight information for predicting a word sequence that may be arranged after the word or word sequence,
Wherein the weighted finite state transformer model is generated through synthesis of the Lexicon finite state transformer and the grammar finite state transformer.

The method of claim 1,
The first decoding model is an end-to-end ASR model (End-to-End Automatic Speech Recognition).

The method of claim 1,
Generating the weighted finite state transformer model,
Classifying the plurality of entity names according to a plurality of different domains; And
Generating a plurality of weighted finite state transformer models for each of the plurality of domains using the classified plurality of entity names;
Containing, method.

The method of claim 6,
Identifying words corresponding to an entity name from an application executed through the device or a web page provided through the device; And
By comparing the identified words with a plurality of entity names included in the vocabulary list of each of the plurality of weighted finite state converter models generated for each of the plurality of domains, the currently executing application or the domain in which the web page can be classified is determined. Determining;
The method further comprising.

The method of claim 1,
Receiving update information of a vocabulary list including at least one of addition of a new entity name, deletion of entity name, and change of entity name from the server; And
Updating the vocabulary list using the update information;
Including more,
The method of generating the weighted finite state transformer model comprises generating the weighted finite state transformer model through learning using the updated vocabulary list.

The method of claim 1,
Recognizing that the device enters a new area by acquiring location information of the device;
Transmitting information on entering a new region of the device to a server of an application service provider; And
Receiving a Point of Interest (POI) vocabulary list including an entity name of at least one of a place name, a tourist attraction, a tourist destination, and a famous restaurant of the new area from the server of the application service provider;
Including more,
The generating of the weighted finite state transformer model comprises generating the weighted finite state transformer model through learning using an entity name included in the received place of interest vocabulary list.

The method of claim 1,
Generating the weighted finite state transformer model,
Learning using a personalized vocabulary list including a plurality of entity names reflecting user characteristics from at least one of frequently executed applications through the device, log data of messenger applications, and search term records in content streaming applications Generating the weighted finite state transformer model through.

In the device for recognizing voice input,
A voice input unit for receiving a voice input from a user;
A memory storing a program including one or more instructions; And
A processor that executes one or more instructions of a program stored in the memory;
Including,
The processor,
Using a vocabulary list including a plurality of named entities, the probability that a subword extracted from each of the plurality of entity names can be predicted as a word or word string representing the entity name is determined. By training, a Weighted Finite State Transducer model is created,
Receiving the voice input from the voice input unit,
Obtaining a feature vector including a probability value that the received speech input can be predicted as a specific subword using a first decoding model, and including a plurality of prediction strings using the probability value of the feature vector Get the first string,
An entity name word corresponding to at least one entity name among the plurality of entity names from the feature vector by inputting the obtained feature vector into a second decoding model using the weighted finite state converter Acquiring a second character string including a word sequence and an unrecognized word sequence not identified by the at least one entity name,
A device for obtaining a text corresponding to the voice input by replacing the unrecognized word string among the second strings with a word string included in the first string.

The method of claim 11,
The processor obtains a vocabulary list including the plurality of entity names, divides a word or string constituting the plurality of entity names into subwords in phoneme or syllable units, and performs the subword By learning the weight through a state transition using the frequency of and the order of subword arrangement, the posterior probability value that the subword can be predicted as a specific entity name among the plurality of entity names. A device for generating the weighted finite state transformer model by obtaining a confidence score including ).

The method of claim 12,
The processor, of the plurality of entity names included in the obtained vocabulary list, performs filtering to remove entity names that overlap with words previously stored in the memory of the device.

The method of claim 12,
The weighted finite state converter model is input when a Lexicon Finite State Transducer (L FST) including mapping information that is a probability value that the sub word can be predicted as a specific word and the specific word or word string are input A Grammar Finite State Transducer (G FST) including weight information for predicting a word sequence that may be arranged after the word or word sequence,
The device, wherein the weighted finite state transformer model is generated through synthesis of the Lexicon finite state transformer and the grammar finite state transformer.

The method of claim 11,
The first decoding model is an end-to-end ASR model (End-to-End Automatic Speech Recognition).

The method of claim 11,
The processor classifies the plurality of entity names according to a plurality of different domains, and generates a plurality of weighted finite state converter models for each of the plurality of domains using the classified plurality of entity names, device.

The method of claim 16,
The processor identifies words corresponding to an entity name from an application being executed or a web page being accessed, and converts the identified words into a vocabulary list of each of the plurality of weighted finite state converter models generated for each of the plurality of domains. A device for determining a domain in which the currently running application or the web page can be classified by comparing it with a plurality of included entity names.

The method of claim 11,
A communication interface for transmitting and receiving data with a server;
Including more,
The processor,
Receive update information of a vocabulary list including at least one of addition of a new entity name, deletion of entity name, and change of entity name from the server using the communication interface,
The device for generating the weighted finite state converter model through learning using the updated vocabulary list and updating the vocabulary list using the update information.

The method of claim 11,
A location sensor that obtains location information of the device; And
A communication interface for transmitting and receiving data with a voice assistant server or an external server;
Including more,
The processor recognizes that the device enters a new area using the location sensor,
In response to recognizing the entry into the new area, a place of interest including an entity name of at least one of a place name, a tourist attraction, a tourist destination, and a famous restaurant of the new area from the server of the application service provider through the communication interface ( A device for receiving a Point of Interest (POI) vocabulary list and generating the weighted finite state transformer model through learning using an entity name included in the received place of interest vocabulary list.

The method of claim 11,
The processor,
Learning using a personalized vocabulary list including a plurality of entity names reflecting user characteristics from at least one of frequently executed applications through the device, log data of messenger applications, and search term records in content streaming applications And generating the weighted finite state transformer model through.

A computer-readable recording medium storing a program for executing the method of claim 1 on a computer.