KR20240057182A

KR20240057182A - Method and apparatus for speech recognition

Info

Publication number: KR20240057182A
Application number: KR1020220137612A
Authority: KR
Inventors: 한택진
Original assignee: 주식회사 케이티
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2024-05-02

Abstract

본 개시의 일 실시예에 따라, 컴퓨팅 장치에 의해 수행되는 음성인식 방법으로서, 복수의 프레임들로 분할되는 음성 데이터를 인코더에 입력하여 특징 벡터를 출력받는 단계, 상기 복수의 프레임들 중 사전 설정된 길이 구간에 대응되는 프레임들을, 후보 문자열을 결정하고자 하는 타겟 프레임에 대한 기본 프레임으로 식별하는 단계, 및 상기 특징 벡터를 제1 디코더에 입력하여, 상기 기본 프레임을 기초로 상기 타겟 프레임에 대한 상기 후보 문자열을 결정하는 단계를 포함할 수 있다. According to an embodiment of the present disclosure, a voice recognition method performed by a computing device includes inputting voice data divided into a plurality of frames into an encoder and outputting a feature vector, a preset length among the plurality of frames, Identifying frames corresponding to a section as a basic frame for a target frame for which a candidate string is to be determined, and inputting the feature vector to a first decoder to determine the candidate string for the target frame based on the basic frame. It may include a step of determining.

Description

Voice recognition method and device {METHOD AND APPARATUS FOR SPEECH RECOGNITION}

본 개시는 음성인식 방법 및 장치에 관한 것이다.This disclosure relates to a voice recognition method and device.

음성인식이란 사람의 소리를 입력으로 받고, 텍스트로 출력을 내는 기술이다. 이러한 음성인식은 스마트폰, 셋탑박스, AI 스피커등 다양한 영역에 적용되어왔으며, 사용자들에게 손 대신 간단한 음성 명령으로 디바이스들을 컨트롤 할 수 있는 편의를 제공하고 있다. Speech recognition is a technology that receives human sounds as input and outputs them as text. This voice recognition has been applied to various areas such as smartphones, set-top boxes, and AI speakers, providing users with the convenience of controlling devices with simple voice commands instead of using their hands.

전통적인 음성인식은, 입력된 음성에 대해 음성 전처리를 거친 다음, 특징 추출을 수행한다. 특징 추출 파트에서는 일반적으로 음성 프레임마다 MFCC(Mel-Frequency Cepstral Coefficient)를 추출하고, 이러한 특징 벡터를 이용하여 특징을 음소로, 음소를 단어로, 그리고 단어를 문장으로 변환하여 최종 텍스트 결과를 출력한다. In traditional speech recognition, the input speech is pre-processed and then feature extraction is performed. In the feature extraction part, MFCC (Mel-Frequency Cepstral Coefficient) is generally extracted for each speech frame, and these feature vectors are used to convert features into phonemes, phonemes into words, and words into sentences to output the final text result. .

그러나 이러한 전통적인 음성인식 방법은, 원하는 음성인식의 필드가 변경될 때마다 음향 모델(Acoustic Model), 어휘 사전(Lexicon), 및 언어 모델(Language Model)의 재학습과 최적의 성능을 내기 위한 튜닝작업을 필요로 한다는 점에서 한계를 보이고 있다. 이와 같이 음성인식 모델을 재생성하기 고도의 배경지식을 갖춘 전문가들이 필요하며, 시간 또한 오래 걸리는 단점이 존재한다.However, these traditional voice recognition methods involve relearning the acoustic model, lexicon, and language model whenever the desired voice recognition field changes, and tuning them to achieve optimal performance. It has limitations in that it requires . In this way, experts with advanced background knowledge are needed to recreate the voice recognition model, and it also has the disadvantage of taking a long time.

이에 따라 양 끝단에 음성 데이터와 그 변환된 텍스트 데이터만 있으면 별도의 음향 모델, 어휘 사전, 및 언어 모델 없이도 음성인식 시스템을 구성할 수 있는 종단간 모델(End-to-End; E2E)이 주목받고 있다. 종단간 모델은 그 방식에 따라 크게 CTC(Connectionist Temporal Classification), RNN-T(RNN Transducer), 및 어텐션(Attention) 방식으로 구분될 수 있다. Accordingly, the end-to-end model (End-to-End; E2E), which can construct a voice recognition system without a separate acoustic model, vocabulary dictionary, and language model, as long as there is voice data and the converted text data at both ends, is attracting attention. there is. End-to-end models can be largely divided into CTC (Connectionist Temporal Classification), RNN-T (RNN Transducer), and Attention methods depending on the method.

이 중 기존 CTC 방식은 프레임마다 인식 결과를 출력하지만, 디코딩 방식의 한계로 인해 음성 데이터가 들리는 대로 받아쓰는 경향을 보이며 철자 오류가 빈번하다고 알려져 있다. 한편 기존 RNN-T 방식은 고정된 컨텍스트(context) 벡터가 고정됨에 따라, 입력 시퀀스(sequence)가 늘어날 경우 병목 현상이 발생하는 단점이 있다. Among these, the existing CTC method outputs recognition results for each frame, but due to limitations in the decoding method, it tends to dictate the voice data as it is heard and is known to have frequent spelling errors. Meanwhile, the existing RNN-T method has a disadvantage in that a bottleneck occurs when the input sequence increases as the context vector is fixed.

이에, 종단간 모델을 기반으로, 잡음 구간이나 불명확한 음성에 대하여도 강인한 음성인식 모델을 생성할 수 있는 기술이 요구된다. Accordingly, based on an end-to-end model, a technology that can generate a robust voice recognition model even in noise sections or unclear voices is required.

해결하고자 하는 과제는, 쉬프트(shift) 방식으로 음성인식 확률을 계산하도록 변형된 CTC(이하, “Shift CTC”) 방식과 트랜스포머(Transformer)을 결합한 하이브리드형 모델을 기반으로, 잡음 구간이나 불명확한 음성에 대하여도 강인한 음성인식 모델을 생성하는 방법 및 장치를 제공하는 것이다. 상기 과제 이외에도 구체적으로 언급되지 않은 다른 과제를 달성하는 데 사용될 수 있다. The problem to be solved is based on a hybrid model that combines a transformer and a modified CTC (hereinafter referred to as “Shift CTC”) method to calculate voice recognition probability using a shift method. The aim is to provide a method and device for generating a robust voice recognition model. In addition to the above tasks, it can be used to achieve other tasks not specifically mentioned.

한 실시예에 따른 컴퓨팅 장치에 의해 수행되는 음성인식 방법으로서, 복수의 프레임들로 분할되는 음성 데이터를 인코더에 입력하여 특징 벡터를 출력받는 단계, 상기 복수의 프레임들 중 사전 설정된 길이 구간에 대응되는 프레임들을, 후보 문자열을 결정하고자 하는 타겟 프레임에 대한 기본 프레임으로 식별하는 단계, 그리고 상기 특징 벡터를 제1 디코더에 입력하여, 상기 기본 프레임을 기초로 상기 타겟 프레임에 대한 상기 후보 문자열을 결정하는 단계를 포함한다.A voice recognition method performed by a computing device according to an embodiment, comprising: inputting voice data divided into a plurality of frames into an encoder to output a feature vector, and selecting a feature vector corresponding to a preset length section among the plurality of frames. Identifying frames as basic frames for a target frame for which a candidate string is to be determined, and inputting the feature vector to a first decoder to determine the candidate string for the target frame based on the basic frame. Includes.

상기 기본 프레임은, 상기 복수의 프레임들 중, 상기 타겟 프레임부터 상기 사전 설정된 길이 구간만큼 선행하는 프레임까지의 프레임들을 포함하고, 상기 사전 설정된 길이 구간은, 상기 복수의 프레임들의 개수보다 적은 개수의 프레임을 포함할 수 있다.The basic frame includes frames from the target frame to a frame preceding the preset length section among the plurality of frames, and the preset length section is a number of frames less than the number of the plurality of frames. may include.

상기 후보 문자열을 결정하는 단계는, 상기 기본 프레임에 포함되는 프레임들 각각의 후보 문자열 별 확률 값들에 기초하여, 상기 확률 값들의 합산 값이 최대화되는 값을 상기 타겟 프레임에 대한 상기 후보 문자열로 결정하는 단계를 포함할 수 있다. The step of determining the candidate string includes determining a value at which the sum of the probability values is maximized as the candidate string for the target frame, based on probability values for each candidate string of frames included in the basic frame. May include steps.

상기 방법은, 상기 특징 벡터를 제2 디코더에 입력하는 단계, 상기 제1 디코더에서 결정된 상기 후보 문자열과 상기 제2 디코더의 출력 값을 기초로, 상기 타겟 프레임에 대한 최종 문자열을 결정하는 단계를 더 포함할 수 있다.The method further includes inputting the feature vector to a second decoder, and determining a final string for the target frame based on the candidate string determined in the first decoder and the output value of the second decoder. It can be included.

상기 제1 디코더는, CTC(Connectionist Temporal Classification) 기반의 디코더이고, 상기 제2 디코더는, 상기 인코더와 어텐션(attention) 구조를 형성하는 트랜스포머(transformer) 기반 디코더일 수 있다. The first decoder may be a CTC (Connectionist Temporal Classification)-based decoder, and the second decoder may be a transformer-based decoder that forms an attention structure with the encoder.

상기 인코더는, 상기 제1 디코더 및 상기 제2 디코더에 의해 공유되는, 컨포머(conformer) 기반의 인코더일 수 있다.The encoder may be a conformer-based encoder shared by the first decoder and the second decoder.

상기 방법은, 상기 제1 디코더 및 상기 제2 디코더 각각의 로스(loss)의 합산 값을 최소화하는 방향으로 상기 인코더, 상기 제1 디코더, 및 제2 디코더를 학습시키는 단계를 더 포함할 수 있다. The method may further include training the encoder, the first decoder, and the second decoder in a direction that minimizes the sum of losses of each of the first decoder and the second decoder.

한 실시예에 따른 컴퓨팅 장치에 의해 수행되는 음성인식 방법으로서, 복수의 프레임들로 분할되는 음성 데이터를 인코더에 입력하여 특징 벡터를 출력받는 단계, 상기 특징 벡터, 및 상기 복수의 프레임들 중 사전 설정된 길이 구간에 대응되는 프레임들을 기초로, 제1 디코더에 의해 타겟 프레임에 대한 후보 문자열을 결정하는 단계, 그리고 상기 특징 벡터 및 상기 제1 디코더에서 결정된 상기 후보 문자열을 기초로, 상기 제2 디코더에 의해 상기 타겟 프레임에 대한 최종 문자열을 결정하는 단계를 포함한다.A voice recognition method performed by a computing device according to an embodiment, comprising inputting voice data divided into a plurality of frames into an encoder and outputting a feature vector, the feature vector, and a preset character among the plurality of frames. determining, by a first decoder, a candidate string for a target frame, based on frames corresponding to a length interval, by the second decoder, based on the feature vector and the candidate string determined by the first decoder; and determining a final character string for the target frame.

상기 사전 설정된 길이 구간에 대응되는 프레임들은, 상기 복수의 프레임들 중, 상기 타겟 프레임부터 상기 사전 설정된 길이 구간만큼 선행하는 프레임까지의 프레임들을 포함하고, 상기 사전 설정된 길이 구간은, 상기 복수의 프레임들의 개수보다 적은 개수의 프레임을 포함할 수 있다.Frames corresponding to the preset length section include, among the plurality of frames, frames from the target frame to a frame preceding the preset length section, and the preset length section is one of the plurality of frames. It may contain fewer frames than the number.

상기 후보 문자열을 결정하는 단계는, 상기 사전 설정된 길이 구간에 대응되는 프레임들 각각의 후보 문자열 별 확률 값들에 기초하여, 상기 확률 값들의 합산 값이 최대화되는 값을 상기 타겟 프레임에 대한 상기 후보 문자열로 결정하는 단계를 포함할 수 있다. The step of determining the candidate string includes, based on probability values for each candidate string of frames corresponding to the preset length section, the value at which the sum of the probability values is maximized is used as the candidate string for the target frame. It may include a decision step.

상기 방법은, 상기 제1 디코더 및 상기 제2 디코더 각각의 로스(loss)의 합산 값을 최소화하는 방향으로 상기 인코더, 상기 제1 디코더, 및 제2 디코더를 학습시키는 단계를 더 포함할 수 있다.The method may further include training the encoder, the first decoder, and the second decoder in a direction that minimizes the sum of losses of each of the first decoder and the second decoder.

한 실시예에 따른 음성인식 장치로서, 복수의 프레임들로 분할되는 음성 데이터를 입력받아 특징 벡터를 출력하는 인코더, 상기 특징 벡터, 및 상기 복수의 프레임들 중 사전 설정된 길이 구간에 대응되는 프레임들을 기초로, 타겟 프레임에 대한 후보 문자열을 결정하는 제1 디코더, 그리고 상기 특징 벡터 및 상기 후보 문자열을 기초로, 상기 타겟 프레임에 대한 최종 문자열을 결정하는 제2 디코더를 포함한다. A voice recognition device according to an embodiment, comprising an encoder that receives voice data divided into a plurality of frames and outputs a feature vector, the feature vector, and frames corresponding to a preset length section among the plurality of frames. It includes a first decoder that determines a candidate string for the target frame, and a second decoder that determines a final string for the target frame based on the feature vector and the candidate string.

상기 사전 설정된 길이 구간에 대응되는 프레임들은, 상기 복수의 프레임들 중, 상기 타겟 프레임부터 상기 사전 설정된 길이 구간만큼 선행하는 프레임까지의 프레임들을 포함하고, 상기 사전 설정된 길이 구간은, 상기 복수의 프레임들의 개수보다 적은 개수의 프레임을 포함할 수 있다. Frames corresponding to the preset length section include, among the plurality of frames, frames from the target frame to a frame preceding the preset length section, and the preset length section is one of the plurality of frames. It may contain fewer frames than the number.

상기 타겟 프레임에 대한 상기 후보 문자열은, 상기 사전 설정된 길이 구간에 대응되는 프레임들 각각의 후보 문자열 별 확률 값들에 기초하여, 상기 확률 값들의 합산 값이 최대화되는 값으로 결정될 수 있다. The candidate string for the target frame may be determined as a value that maximizes the sum of the probability values, based on probability values for each candidate string of frames corresponding to the preset length section.

상기 제1 디코더는, CTC(Connectionist Temporal Classification) 기반의 디코더이고, 상기 제2 디코더는, 상기 인코더와 어텐션(attention) 구조를 형성하는 트랜스포머(transformer) 기반 디코더일 수 있다.The first decoder may be a CTC (Connectionist Temporal Classification)-based decoder, and the second decoder may be a transformer-based decoder that forms an attention structure with the encoder.

상기 인코더는, 상기 제1 디코더 및 상기 제2 디코더에 의해 공유되는, 컨포머(conformer) 기반의 인코더일 수 있다. The encoder may be a conformer-based encoder shared by the first decoder and the second decoder.

상기 인코더, 상기 제1 디코더, 및 제2 디코더는, 상기 제1 디코더 및 상기 제2 디코더 각각의 로스(loss)의 합산 값을 최소화하는 방향으로 학습될 수 있다.The encoder, the first decoder, and the second decoder may be trained to minimize the sum of the losses of each of the first decoder and the second decoder.

본 발명의 실시예에 따르면, 음성 데이터와 그 변환된 텍스트 데이터만으로, 별도의 언어 모델과 어휘 사전을 구축 및 학습시키지 않고도 음성인식 모델을 생성할 수 있다.According to an embodiment of the present invention, a voice recognition model can be created using only voice data and the converted text data without building and training a separate language model and vocabulary dictionary.

본 발명의 실시예에 따르면, 동적 청크(dynamic chunk) 훈련 기술을 사용함에 따라, 출력 텍스트의 레이턴시(latency)를 모델링 이후 변경 가능하므로, 스트리밍 서비스를 지원할 수 있는 음성인식 모델을 생성할 수 있다. According to an embodiment of the present invention, by using dynamic chunk training technology, the latency of the output text can be changed after modeling, thereby creating a voice recognition model that can support streaming services.

본 발명의 실시예에 따르면, 쉬프트 방식으로 음성인식 확률을 계산함으로써, 인식률이 향상된 음성인식 모델을 생성할 수 있다. According to an embodiment of the present invention, a voice recognition model with improved recognition rate can be created by calculating the voice recognition probability using a shift method.

도 1은 본 발명의 몇몇 실시예에 따른 음성인식 장치를 나타낸 블록도이다.
도 2는 본 발명의 몇몇 실시예에 따른 음성인식 방법을 종래 기술과 비교하여 설명하기 위한 개략도이다.
도 3은 본 발명의 몇몇 실시예에 따른 음성인식 장치의 구조를 도시한 도면이다.
도 4는 본 발명의 몇몇 실시예에 따른 음성인식 방법에 대한 순서도이다.
도 5는 본 발명의 몇몇 실시예에 따른 음성인식 방법을 제공하는 컴퓨팅 장치를 나타낸 블록도이다. 1 is a block diagram showing a voice recognition device according to some embodiments of the present invention.
Figure 2 is a schematic diagram for explaining a voice recognition method according to some embodiments of the present invention compared with the prior art.
Figure 3 is a diagram showing the structure of a voice recognition device according to some embodiments of the present invention.
Figure 4 is a flowchart of a voice recognition method according to some embodiments of the present invention.
Figure 5 is a block diagram showing a computing device that provides a voice recognition method according to some embodiments of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 개시의 실시예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Below, with reference to the attached drawings, embodiments of the present disclosure will be described in detail so that those skilled in the art can easily practice them. However, the present disclosure may be implemented in many different forms and is not limited to the embodiments described herein. In order to clearly explain the present disclosure in the drawings, parts that are not related to the description are omitted, and similar parts are given similar reference numerals throughout the specification.

본 개시에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 네트워크를 구성하는 장치들은 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.In the present disclosure, when a part “includes” a certain component, this means that it may further include other components rather than excluding other components, unless specifically stated to the contrary. The devices that make up the network may be implemented as hardware, software, or a combination of hardware and software.

도 1은 본 발명의 몇몇 실시예에 따른 음성인식 장치를 나타낸 블록도이다. 1 is a block diagram showing a voice recognition device according to some embodiments of the present invention.

도 1을 참조하면, 본 개시에 따른 음성인식 장치(100)는 인코더(110), Shift-CTC 디코더(120), 트랜스포머 디코더(130), 및 학습부(140)를 포함할 수 있다. 다만 상술한 구성은 본 개시에 따른 음성인식 장치(100)를 구현하는 데 필수적인 것은 아니어서, 음성인식 장치(100)는 열거된 구성보다 많거나 적은 구성들로 구현될 수 있다. Referring to FIG. 1, the voice recognition device 100 according to the present disclosure may include an encoder 110, a Shift-CTC decoder 120, a transformer decoder 130, and a learning unit 140. However, the above-described configuration is not essential for implementing the voice recognition device 100 according to the present disclosure, so the voice recognition device 100 may be implemented with more or fewer configurations than the listed configurations.

본 개시에 따른 음성인식 장치(100)의 인코더(110)는, 음성 데이터를 입력받고, 입력받은 음성 데이터의 로컬 피처(local feature)들을 추출할 수 있다. 본 개시의 인코더(110)는, 후술할 Shift-CTC 디코더(120) 및 트랜스포머 디코더(130)에 의해 공유되는 컨포머(conformer) 형식의 인코더일 수 있다. The encoder 110 of the voice recognition device 100 according to the present disclosure can receive voice data and extract local features of the input voice data. The encoder 110 of the present disclosure may be a conformer type encoder shared by the Shift-CTC decoder 120 and the transformer decoder 130, which will be described later.

구체적으로 본 개시의 인코더(110)는, 음성 데이터로부터 추출한 로컬 피처들을 Shift-CTC 디코더(120)로 전달하여 쉬프트 방식으로 음성인식 확률을 계산하도록 할 수 있다. 이를 통해 Shift-CTC 디코더(120)는, 단어 간 연관관계를 고려하면서, Shift CTC 방식으로 현재 프레임에 대한 후보 문자열을 결정할 수 있다. 한편 본 개시의 인코더(110)는 트랜스포머 디코더(130)와 어텐션 구조를 형성할 수 있다. Specifically, the encoder 110 of the present disclosure can transfer local features extracted from voice data to the Shift-CTC decoder 120 to calculate the voice recognition probability using a shift method. Through this, the Shift-CTC decoder 120 can determine a candidate string for the current frame using the Shift CTC method while considering the correlation between words. Meanwhile, the encoder 110 of the present disclosure can form an attention structure with the transformer decoder 130.

본 개시에 따른 음성인식 장치(100)의 Shift-CTC 디코더(120)는, 인코더(110)에서 생성된 임베딩 벡터로부터 각 프레임에 대한 후보 문자열을 결정할 수 있다. 본 개시의 Shift-CTC 디코더(120)는, 첫 번째 프레임부터 현재 프레임까지의 복수의 프레임들 중 특정 길이 구간의 프레임들만을 반영하여 현재 프레임에 대한 후보 문자열을 결정할 수 있다. 이는 현재 프레임의 후보 문자열을 선택하기 위해 현재 프레임 이전의 모든 프레임들의 정답 패스(path)들의 확률 값을 반영하는 종래의 CTC 디코딩 방식과 대조된다. The Shift-CTC decoder 120 of the speech recognition device 100 according to the present disclosure can determine a candidate string for each frame from the embedding vector generated by the encoder 110. The Shift-CTC decoder 120 of the present disclosure may determine a candidate string for the current frame by reflecting only frames of a specific length section among a plurality of frames from the first frame to the current frame. This is in contrast to the conventional CTC decoding method, which reflects the probability values of correct paths of all frames before the current frame to select a candidate string for the current frame.

종래 CTC 디코딩 방식은, 특정 음성 데이터가 가질 수 있는 모든 경우의 수에 대한 확률 합산 값을 최대로 하는 값을, 해당 음성 데이터에 관한 정답 값으로 추론한다. 다시 말해 종래 CTC 디코딩 방식은, 현재 프레임과 이전 프레임들의 문자열 별 확률 값들을 모두 확인하고, 이를 기초로 현재 프레임의 정답 값을 추론한다. In the conventional CTC decoding method, the value that maximizes the sum of probabilities for all possible cases of specific voice data is inferred as the correct value for the corresponding voice data. In other words, the conventional CTC decoding method checks all probability values for each string of the current frame and previous frames, and infers the correct value of the current frame based on this.

이로 인해 종래 CTC 디코딩 방식은, 선행하는 프레임들에 대해 추론한 값이 부정확한 경우, 현재 프레임에 대한 추론 성능이 낮아지는 단점이 존재하게 된다. 이에 따라 본 개시의 음성인식 방법은, 현재 프레임의 정답 값을 추론할 때에 일정한 길이 구간의 정답 패스만을 반영하는 변형된 CTC(즉, “Shift CTC”) 방식을 제안한다. For this reason, the conventional CTC decoding method has the disadvantage that inference performance for the current frame is lowered when the inferred values for preceding frames are inaccurate. Accordingly, the speech recognition method of the present disclosure proposes a modified CTC (i.e., “Shift CTC”) method that reflects only the correct answer pass of a certain length section when inferring the correct value of the current frame.

구체적으로 본 개시의 Shift-CTC 디코더(120)는, 현재 프레임의 후보 문자열 결정에 반영되는 프레임들의 특정 길이 구간을 디코딩이 진행됨에 따라 진행 방향으로 쉬프팅(shifting)한다. 이에 따라 본 개시의 Shift-CTC 디코더(120)는, 초기 프레임들(즉, 특정 길이 구간보다 선행하는 프레임들)에서 노이즈가 있거나, 불명확한 단어로 인해 정답 패스들의 확률 값이 떨어지는 경우에도, 재보정 된 확률 값을 계산할 수 있다. Specifically, the Shift-CTC decoder 120 of the present disclosure shifts a specific length section of the frames reflected in determining the candidate string of the current frame in the forward direction as decoding progresses. Accordingly, the Shift-CTC decoder 120 of the present disclosure is capable of The corrected probability value can be calculated.

본 개시에 따른 음성인식 장치(100)의 트랜스포머 디코더(130)는, 인코더(110)에서 생성된 임베딩 벡터로부터 최종 문자열을 생성할 수 있으며, 이 때 어텐션은 트랜스포머 디코더(130)가 다음에 출력할 단어와 가장 연관이 높은 인코더(110)의 출력이 어디인지 집중하도록 할 수 있다. The transformer decoder 130 of the speech recognition device 100 according to the present disclosure can generate a final string from the embedding vector generated by the encoder 110, and at this time, the attention is the attention that the transformer decoder 130 will output next. It is possible to focus on which output of the encoder 110 is most related to the word.

본 개시의 트랜스포머 디코더(130)는, 고정된 컨텍스트 벡터(즉, 어텐션)가 아닌, 출력되는 인덱스를 기초로 가변적으로 생성되는 컨텍스트 벡터를 사용하여 디코더의 히든 스테이트(hidden state)를 업데이트할 수 있다. The transformer decoder 130 of the present disclosure can update the hidden state of the decoder using a context vector that is variably generated based on the output index, rather than a fixed context vector (i.e., attention). .

한편 본 개시의 트랜스포머 디코더(130)는, 임베딩 벡터로부터 최종 문자열을 생성하기 위해, 상술한 Shift-CTC 디코더(120)에서 프레임 별로 결정된 후보 문자열을 반영할 수 있다. 가령 본 개시의 트랜스포머 디코더(130)의 결과 값은 Shift-CTC 디코더(120)의 결과 값을 기초로 리스코어링(rescoring)될 수 있다. Meanwhile, the transformer decoder 130 of the present disclosure may reflect the candidate string determined for each frame in the Shift-CTC decoder 120 described above in order to generate a final string from the embedding vector. For example, the result value of the transformer decoder 130 of the present disclosure may be rescored based on the result value of the Shift-CTC decoder 120.

본 개시에 따른 음성인식 장치(100)의 학습부(140)는, 상술한 Shift-CTC 디코더(120) 및 트랜스포머 디코더(130)의 로스(loss) 값을 최소화하는 방향으로, 인코더(110), Shift-CTC 디코더(120) 및 트랜스포머 디코더(130)를 학습시킬 수 있다. 로스 값과 관련하여 하기 수학식 1을 참고할 수 있다:The learning unit 140 of the voice recognition device 100 according to the present disclosure operates the encoder 110, The Shift-CTC decoder 120 and transformer decoder 130 can be trained. Regarding the loss value, the following equation 1 may be referred to:

한편 본 개시의 학습부(140)는, 음성 데이터를 구성하는 복수의 프레임들을 기 설정된 프레임 개수에 따라 청크(chunk)로 묶고, 청크 단위(즉, 기 설정된 프레임 개수)를 가변적으로 조절하면서 인코더(110), Shift-CTC 디코더(120) 및 트랜스포머 디코더(130)를 학습시킬 수 있다. 이 경우 청크 단위는 1부터 최대 발화(utterance)까지의 값 중 어느 하나로 조절될 수 있다. Meanwhile, the learning unit 140 of the present disclosure groups a plurality of frames constituting the voice data into chunks according to a preset number of frames, and variably adjusts the chunk unit (i.e., the preset number of frames) while using the encoder ( 110), Shift-CTC decoder 120, and transformer decoder 130 can be trained. In this case, the chunk unit can be adjusted to any of the values from 1 to maximum utterance.

이와 같이 학습됨으로써, 본 개시의 음성인식 장치(100)는 청크 단위의 크기를 조절함으로써 출력의 레이턴시(latency)를 용이하게 조절할 수 있고, 이에 따라 스트리밍 서비스를 지원 가능한 음성인식 장치(100)를 구현할 수 있다. By learning in this way, the voice recognition device 100 of the present disclosure can easily adjust the latency of the output by adjusting the size of the chunk unit, and thus implement a voice recognition device 100 capable of supporting streaming services. You can.

도 2는 본 발명의 몇몇 실시예에 따른 음성인식 방법을 종래 기술과 비교하여 설명하기 위한 개략도이다. 도 2는 x축을 시간(즉, 프레임)으로 하며, 3개의 단어에 대해 Shift-CTC 방식의 디코딩을 수행하는 예시를 도시하며, 특히 도 2의(a)는 종래의 CTC 디코딩 방식을, 그리고 도 2의 (b)는 본 개시에 따른 Shift-CTC 디코딩 방식을 각각 개략적으로 도시한다. Figure 2 is a schematic diagram for explaining a voice recognition method according to some embodiments of the present invention compared with the prior art. FIG. 2 shows an example of performing Shift-CTC decoding on three words with the x-axis being time (i.e., frame). In particular, FIG. 2(a) shows the conventional CTC decoding method, and FIG. 2(b) schematically shows each Shift-CTC decoding method according to the present disclosure.

본 개시의 Shift-CTC 디코더(120)는, 첫 번째 프레임부터 현재 프레임까지의 복수의 프레임들 중 특정 길이 구간의 프레임들만을 반영하여 현재 프레임에 대한 후보 문자열을 결정할 수 있다. 이는 현재 프레임의 후보 문자열을 선택하기 위해 현재 프레임 이전의 모든 프레임들의 정답 패스(path)들의 확률 값을 반영하는 종래의 CTC 디코딩 방식과 대조된다. The Shift-CTC decoder 120 of the present disclosure may determine a candidate string for the current frame by reflecting only frames of a specific length section among a plurality of frames from the first frame to the current frame. This is in contrast to the conventional CTC decoding method, which reflects the probability values of correct paths of all frames before the current frame to select a candidate string for the current frame.

도 3은 본 발명의 몇몇 실시예에 따른 음성인식 장치의 구조를 도시한 도면이다. 구체적으로 도 3은, 본 개시의 음성인식 장치(100)를 구성하는 인코더(110), Shift-CTC 디코더(120) 및 트랜스포머 디코더(130) 각각의 구조를 상세하게 도시한다. Figure 3 is a diagram showing the structure of a voice recognition device according to some embodiments of the present invention. Specifically, FIG. 3 shows in detail the structures of each of the encoder 110, Shift-CTC decoder 120, and transformer decoder 130 that constitute the voice recognition device 100 of the present disclosure.

도 4는 본 발명의 몇몇 실시예에 따른 음성인식 방법에 대한 순서도이다.Figure 4 is a flowchart of a voice recognition method according to some embodiments of the present invention.

도 4를 참고하면, 먼저 본 개시에 따른 음성인식 장치(100)는 복수의 프레임들로 분할되는 음성 데이터를 인코더에 입력하여 특징 벡터를 출력받을 수 있다(S110). Referring to FIG. 4, the voice recognition device 100 according to the present disclosure can input voice data divided into a plurality of frames into an encoder and output a feature vector (S110).

다음으로 본 개시에 따른 음성인식 장치(100)는 복수의 프레임들 중 사전 설정된 길이 구간에 대응되는 프레임들을, 후보 문자열을 결정하고자 하는 타겟 프레임에 대한 기본 프레임으로 식별할 수 있다(S120). Next, the voice recognition device 100 according to the present disclosure may identify frames corresponding to a preset length section among a plurality of frames as basic frames for the target frame for determining a candidate string (S120).

다음으로 본 개시에 따른 음성인식 장치(100)는 특징 벡터를 제1 디코더에 입력하여, 상기 기본 프레임을 기초로 상기 타겟 프레임에 대한 상기 후보 문자열을 결정할 수 있다(S120). Next, the speech recognition device 100 according to the present disclosure may input the feature vector to the first decoder and determine the candidate string for the target frame based on the basic frame (S120).

도 5는 본 발명의 몇몇 실시예에 따른 음성인식 방법을 제공하는 컴퓨팅 장치를 나타낸 블록도이다.Figure 5 is a block diagram showing a computing device that provides a voice recognition method according to some embodiments of the present invention.

여기서 음성인식 방법을 제공하는 컴퓨팅 장치(10)는, 전술한 음성인식 장치(100)이거나, 또는 음성인식 방법을 제공하기 위해 음성인식 장치(100)와 통신적으로 연결되는 임의의 단말(미도시)일 수 있다. 다만 이에 한정되는 것은 아니다. Here, the computing device 10 that provides the voice recognition method is the voice recognition device 100 described above, or any terminal (not shown) that is communicatively connected to the voice recognition device 100 to provide the voice recognition method. ) can be. However, it is not limited to this.

도 5를 참조하면, 본 개시에 따른 컴퓨팅 장치(10)는 하나 이상의 프로세서(11), 프로세서(11)에 의하여 수행되는 프로그램을 로드하는 메모리(12), 프로그램 및 각종 데이터를 저장하는 스토리지(13), 및 통신 인터페이스(14)를 포함할 수 있다. 다만, 상술한 구성 요소들은 본 개시에 따른 컴퓨팅 장치(10)를 구현하는데 있어서 필수적인 것은 아니어서, 컴퓨팅 장치(10)는 위에서 열거된 구성요소들 보다 많거나, 또는 적은 구성요소들을 가질 수 있다. 예컨대 컴퓨팅 장치(10)는 출력부 및/또는 입력부(미도시)를 더 포함하거나, 또는 스토리지(13)가 생략될 수도 있다. Referring to FIG. 5, the computing device 10 according to the present disclosure includes one or more processors 11, a memory 12 that loads a program executed by the processor 11, and a storage 13 that stores the program and various data. ), and may include a communication interface 14. However, the above-described components are not essential for implementing the computing device 10 according to the present disclosure, so the computing device 10 may have more or less components than the components listed above. For example, the computing device 10 may further include an output unit and/or an input unit (not shown), or the storage 13 may be omitted.

프로그램은 메모리(12)에 로드될 때 프로세서(11)로 하여금 본 개시의 다양한 실시예에 따른 방법/동작을 수행하게끔 하는 명령어들(instructions)을 포함할 수 있다. 즉, 프로세서(11)는 명령어들을 실행함으로써, 본 개시의 다양한 실시예에 따른 방법/동작들을 수행할 수 있다. 프로그램은 기능을 기준으로 묶인 일련의 컴퓨터 판독가능 명령어들로 구성되고, 프로세서에 의해 실행되는 것을 가리킨다. When loaded into the memory 12, the program may include instructions that cause the processor 11 to perform methods/operations according to various embodiments of the present disclosure. That is, the processor 11 can perform methods/operations according to various embodiments of the present disclosure by executing instructions. A program consists of a series of computer-readable instructions grouped by function and executed by a processor.

프로세서(11)는 컴퓨팅 장치(10)의 각 구성의 전반적인 동작을 제어한다. 프로세서(11)는 CPU(Central Processing Unit), MPU(Micro Processor Unit), MCU(Micro Controller Unit), GPU(Graphic Processing Unit) 또는 본 개시의 기술 분야에 잘 알려진 임의의 형태의 프로세서 중 적어도 하나를 포함하여 구성될 수 있다. 또한, 프로세서(11)는 본 개시의 다양한 실시예들에 따른 방법/동작을 실행하기 위한 적어도 하나의 애플리케이션 또는 프로그램에 대한 연산을 수행할 수 있다. The processor 11 controls the overall operation of each component of the computing device 10. The processor 11 includes at least one of a Central Processing Unit (CPU), Micro Processor Unit (MPU), Micro Controller Unit (MCU), Graphic Processing Unit (GPU), or any type of processor well known in the art of the present disclosure. It can be configured to include. Additionally, the processor 11 may perform operations on at least one application or program to execute methods/operations according to various embodiments of the present disclosure.

메모리(12)는 각종 데이터, 명령 및/또는 정보를 저장한다. 메모리(12)는 본 개시의 다양한 실시예들에 따른 방법/동작을 실행하기 위하여 스토리지(13)로부터 하나 이상의 프로그램을 로드할 수 있다. 메모리(12)는 RAM과 같은 휘발성 메모리로 구현될 수 있을 것이나, 본 개시의 기술적 범위는 이에 한정되지 않는다.Memory 12 stores various data, instructions and/or information. Memory 12 may load one or more programs from storage 13 to execute methods/operations according to various embodiments of the present disclosure. The memory 12 may be implemented as a volatile memory such as RAM, but the technical scope of the present disclosure is not limited thereto.

스토리지(13)는 프로그램을 비임시적으로 저장할 수 있다. 스토리지(13)는 ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리 등과 같은 비휘발성 메모리, 하드 디스크, 착탈형 디스크, 또는 본 개시가 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터로 읽을 수 있는 기록 매체를 포함하여 구성될 수 있다. 통신 인터페이스(14)는 유/무선 통신 모듈일 수 있다.Storage 13 can store programs non-temporarily. The storage 13 may be a non-volatile memory such as Read Only Memory (ROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), flash memory, a hard disk, a removable disk, or the like in the technical field to which this disclosure pertains. It may be configured to include any known type of computer-readable recording medium. The communication interface 14 may be a wired/wireless communication module.

이상에서 설명한 본 개시의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 개시의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다.The embodiments of the present disclosure described above are not only implemented through devices and methods, but may also be implemented through programs that implement functions corresponding to the configurations of the embodiments of the present disclosure or recording media on which the programs are recorded.

이상에서 본 개시의 실시예에 대하여 상세하게 설명하였지만 본 개시의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 개시의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 개시의 권리범위에 속하는 것이다.Although the embodiments of the present disclosure have been described in detail above, the scope of the rights of the present disclosure is not limited thereto, and various modifications and improvements made by those skilled in the art using the basic concept of the present disclosure defined in the following claims are also possible. It falls within the scope of rights.

Claims

A voice recognition method performed by a computing device, comprising:
A step of inputting voice data divided into a plurality of frames into an encoder and outputting a feature vector,
Identifying frames corresponding to a preset length section among the plurality of frames as basic frames for a target frame for which a candidate string is to be determined, and
Inputting the feature vector into a first decoder to determine the candidate string for the target frame based on the basic frame,
Voice recognition method.

In paragraph 1:
The basic frame is,
Among the plurality of frames, it includes frames from the target frame to a frame preceding the preset length section,
The preset length section is,
Containing a number of frames less than the number of the plurality of frames,
Voice recognition method.

In paragraph 1:
The step of determining the candidate string is,
Based on the probability values for each candidate string of frames included in the basic frame, determining a value that maximizes the sum of the probability values as the candidate string for the target frame, comprising:
Voice recognition method.

In paragraph 1:
The above method is,
Inputting the feature vector to a second decoder, and
Determining a final string for the target frame based on the candidate string determined by the first decoder and the output value of the second decoder.
A voice recognition method further comprising:

In paragraph 4,
The first decoder,
It is a decoder based on CTC (Connectionist Temporal Classification).
The second decoder,
A transformer-based decoder that forms the encoder and an attention structure,
Voice recognition method.

In paragraph 4,
The encoder is,
A conformer-based encoder shared by the first decoder and the second decoder,
Voice recognition method.

In paragraph 4,
The above method is,
Further comprising training the encoder, the first decoder, and the second decoder in a direction to minimize the sum of the losses of each of the first decoder and the second decoder,
Voice recognition method.

A voice recognition method performed by a computing device, comprising:
A step of inputting voice data divided into a plurality of frames into an encoder and outputting a feature vector,
Determining a candidate string for a target frame by a first decoder based on the feature vector and frames corresponding to a preset length section among the plurality of frames, and
Based on the feature vector and the candidate string determined in the first decoder, determining a final string for the target frame by the second decoder,
Voice recognition method.

In paragraph 8:
Frames corresponding to the preset length section are:
Among the plurality of frames, it includes frames from the target frame to a frame preceding the preset length section,
The preset length section is,
Containing a number of frames less than the number of the plurality of frames,
Voice recognition method.

In paragraph 8:
The step of determining the candidate string is,
Based on probability values for each candidate string of frames corresponding to the preset length section, determining a value that maximizes the sum of the probability values as the candidate string for the target frame, comprising:
Voice recognition method.

In paragraph 8:
The first decoder,
It is a decoder based on CTC (Connectionist Temporal Classification).
The second decoder,
A transformer-based decoder that forms the encoder and an attention structure,
Voice recognition method.

In paragraph 8:
The encoder is,
A conformer-based encoder shared by the first decoder and the second decoder,
Voice recognition method.

In paragraph 8:
The above method is,
Further comprising training the encoder, the first decoder, and the second decoder in a direction to minimize the sum of the losses of each of the first decoder and the second decoder,
Voice recognition method.

An encoder that receives voice data divided into multiple frames and outputs a feature vector,
A first decoder that determines a candidate string for a target frame based on the feature vector and frames corresponding to a preset length section among the plurality of frames, and
A second decoder that determines a final string for the target frame based on the feature vector and the candidate string,
Voice recognition device.

In paragraph 14:
Frames corresponding to the preset length section are:
Among the plurality of frames, it includes frames from the target frame to a frame preceding the preset length section,
The preset length section is,
Containing a number of frames less than the number of the plurality of frames,
Voice recognition device.

In paragraph 14:
The candidate string for the target frame is,
Based on the probability values for each candidate string of frames corresponding to the preset length section, the sum of the probability values is determined to be a value that is maximized,
Voice recognition device.

In paragraph 14:
The first decoder is,
It is a decoder based on CTC (Connectionist Temporal Classification).
The second decoder,
A transformer-based decoder that forms the encoder and an attention structure,
Voice recognition device.

In paragraph 14:
The encoder is,
A conformer-based encoder shared by the first decoder and the second decoder,
Voice recognition device.

In paragraph 14:
The encoder, the first decoder, and the second decoder are learned in a way to minimize the sum of the losses of each of the first decoder and the second decoder,
Voice recognition device.