KR102386627B1

KR102386627B1 - Beam search method for speech recognition and apparatus thereof

Info

Publication number: KR102386627B1
Application number: KR1020200107214A
Authority: KR
Inventors: 전재진; 김의성
Original assignee: 주식회사 카카오엔터프라이즈
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2022-04-14
Also published as: KR20220026325A

Abstract

본 발명은 음성 인식에 관한 빔 서치 방법 및 장치에 관한 것이다. 실시예는 빔 서치 방법에 있어서, 음성 신호의 적어도 하나의 이전 프레임에 대응하는 복수의 후보 단어열들 및 후보 단어열들의 확률들을 획득하는 단계, 음성 신호의 현재 프레임에 기초하여, 복수의 후보 단어열들을 반복적으로 확장하는 단계 및 반복적으로 확장된 후보 단어열들 중 확률이 큰 미리 정해진 개수의 후보 단어열들을 선택하는 단계를 포함한다.The present invention relates to a beam search method and apparatus for speech recognition. An embodiment provides a beam search method, comprising: obtaining probabilities of a plurality of candidate word sequences and candidate word sequences corresponding to at least one previous frame of a speech signal; based on a current frame of a speech signal, a plurality of candidate words The method may include repeatedly expanding the rows and selecting a predetermined number of candidate word sequences having a high probability from among the repeatedly expanded candidate word sequences.

Description

BEAM SEARCH METHOD FOR SPEECH RECOGNITION AND APPARATUS THEREOF

음성 인식에 관한 빔 서치 방법 및 장치에 관한 것이다.It relates to a beam search method and apparatus for speech recognition.

음성 인식(Speech Recognition) 기술은 발화에 의하여 발생한 음성 신호를 텍스트 데이터로 전환하여 처리하는 기술로, STT(Speech-to-Text)라고도 한다. 음성 인식 기술로 인해 음성이 장치의 신규한 입력 방식으로 이용 가능해지면서, 음성을 통한 기기 제어 및 정보 검색 등 다양한 기술 분야에 음성 인식 기술이 응용되고 있다. 최근 음성 인식의 성능을 향상시키기 위한 머신 러닝을 이용한 음성 인식 알고리즘에 대한 연구 및 복수의 화자들의 음성이 포함된 음성 신호에서 화자 별 음성을 분리하는 기술, 음성 신호에서 화자를 식별하는 기술 등 음성 인식 기술의 응용을 보완하기 위한 연구도 활발히 진행되고 있다.Speech recognition (Speech Recognition) technology is a technology that converts a speech signal generated by utterance into text data and processes it, also referred to as Speech-to-Text (STT). As voice becomes available as a new input method for devices due to voice recognition technology, voice recognition technology is being applied to various technical fields such as device control and information retrieval through voice. Recently, a study on speech recognition algorithms using machine learning to improve the performance of speech recognition, a technology for separating each speaker’s voice from a speech signal including the voices of multiple speakers, and a technology for identifying a speaker from a speech signal. Research to supplement the application of the technology is also being actively conducted.

실시예는 빔 서치 연산을 효율적으로 수행하기 위한 방법을 제공할 수 있다. 보다 구체적으로, 빔 서치 과정에서 빔 개수의 가정 각각에 대하여, 해당 가정이 다른 가정의 접두사(prefix)에 해당하는지 비교하는 연산과 접두사에 해당하는 경우, 조건부 확률에 기초하여 해당 가정의 확률을 다시 계산하는 연산을 줄이기 위한 방법을 제공할 수 있다.The embodiment may provide a method for efficiently performing a beam search operation. More specifically, in the beam search process, for each assumption of the number of beams, an operation to compare whether the corresponding assumption corresponds to a prefix of another assumption and, if it corresponds to a prefix, the probability of the corresponding assumption again based on the conditional probability A method can be provided to reduce computational operations.

일 측에 따른 빔 서치 방법은 음성 신호의 적어도 하나의 이전 프레임에 대응하는 복수의 후보 단어열들 및 상기 후보 단어열들의 확률들을 획득하는 단계; 상기 음성 신호의 현재 프레임에 기초하여, 상기 복수의 후보 단어열들을 반복적으로 확장하는 단계; 및 상기 반복적으로 확장된 후보 단어열들 중 확률이 큰 미리 정해진 개수의 후보 단어열들을 선택하는 단계를 포함한다.A beam search method according to one aspect includes: obtaining a plurality of candidate word sequences corresponding to at least one previous frame of a speech signal and probabilities of the candidate word sequences; repeatedly expanding the plurality of candidate word sequences based on the current frame of the speech signal; and selecting a predetermined number of candidate word sequences having a high probability from among the repeatedly expanded candidate word sequences.

상기 반복적으로 확장하는 단계는 상기 복수의 후보 단어열들 각각을 접두사(prefix)로 하는 복수의 확장 후보 단어열들을 생성하는 단계; 상기 음성 신호의 현재 프레임에 기초하여 상기 확장 후보 단어열들의 확률들을 획득하는 단계; 상기 확장 후보 단어열들 중 상기 후보 단어열과 중복되는 적어도 하나의 확장 후보 단어열의 확률을 해당하는 후보 단어열의 확률에 머지(merge)하는 단계; 및 상기 확장 후보 단어열들 중 상기 후보 단어열과 중복되지 않는 나머지 확장 후보 단어열을 상기 후보 단어열에 추가하는 단계를 포함한다.The repeatedly expanding may include: generating a plurality of extension candidate word sequences using each of the plurality of candidate word sequences as a prefix; obtaining probabilities of the extended candidate word sequences based on the current frame of the speech signal; merging a probability of at least one extended candidate word sequence overlapping the candidate word sequence among the extended candidate word sequences with a probability of a corresponding candidate word sequence; and adding the remaining candidate extension word sequences that do not overlap the candidate word sequence among the candidate extension word sequences to the candidate word sequence.

상기 미리 정해진 개수의 후보 단어열들을 선택하는 단계는 상기 반복적으로 확장된 후보 단어열들의 확률들에 블랭크 레이블(blank label)에 대한 확률을 반영하여, 상기 반복적으로 확장된 후보 단어열들의 최종 확률들을 획득하는 단계; 및 상기 반복적으로 확장된 후보 단어열들 중 상기 최종 확률이 큰 미리 정해진 개수의 후보 단어열들을 상기 현재 프레임에 대응하는 후보 단어열들로 선택하는 단계를 포함할 수 있다.In the step of selecting the predetermined number of candidate word sequences, the final probabilities of the repeatedly expanded candidate word sequences are obtained by reflecting the probability of a blank label in the probabilities of the iteratively extended candidate word sequences. obtaining; and selecting a predetermined number of candidate word sequences having a high final probability from among the repeatedly expanded candidate word sequences as candidate word sequences corresponding to the current frame.

상기 머지하는 단계는 상기 후보 단어열과 중복되는 적어도 하나의 확장 후보 단어열의 확률을 해당하는 후보 단어열의 확률에 더한 값을 상기 해당하는 후보 단어열의 확률로 결정하는 단계를 포함할 수 있다.The merging may include determining a value obtained by adding a probability of at least one extended candidate word sequence overlapping with the candidate word sequence to a probability of a corresponding candidate word sequence as the probability of the corresponding candidate word sequence.

상기 머지하는 단계는 상기 후보 단어열과 중복되는 적어도 하나의 확장 후보 단어열의 확률 및 해당하는 후보 단어열의 확률 중 가장 큰 확률을 상기 해당하는 후보 단어열의 확률로 결정하는 단계를 포함할 수 있다.The merging may include determining the highest probability among the probability of at least one extended candidate word string overlapping the candidate word string and the probability of the corresponding candidate word string as the probability of the corresponding candidate word string.

상기 확장 후보 단어열들은 상기 후보 단어열들 각각에, 상기 음성 신호의 현재 프레임에 대응하는 복수의 후보 단어들 각각을 연결하여 생성될 수 있다.The extended candidate word sequences may be generated by connecting each of a plurality of candidate words corresponding to the current frame of the voice signal to each of the candidate word sequences.

상기 후보 단어열들 중 제1 후보 단어열에, 상기 후보 단어들 중 제1 후보 단어를 연결하여 생성된 제1 확장 후보 단어열의 확률은 상기 제1 후보 단어열의 확률 및 상기 음성 신호의 프레임이 상기 제1 후보 단어에 대응될 확률에 기초하여 획득될 수 있다.The probability of a first extended candidate word sequence generated by connecting a first candidate word among the candidate words to a first candidate word sequence among the candidate word sequences is determined by the probability of the first candidate word sequence and the frame of the speech signal being determined by the first candidate word sequence. It may be obtained based on a probability corresponding to one candidate word.

상기 빔 서치 방법은 음성 신호의 프레임에 대응하는 단어를 탐색하기 위한 전사 네트워크에 상기 음성 신호를 프레임 별로 인가하여 상기 음성 신호의 현재 프레임에 대응하는 제1 특징 벡터를 획득하는 단계; 단어열의 다음 단어를 탐색하기 위한 예측 네트워크에 상기 음성 신호의 이전 프레임에 대응하는 후보 단어열을 인가하여, 상기 후보 단어열의 다음 단어에 대응하는 제2 특징 벡터를 획득하는 단계; 및 상기 제1 특징 벡터와 상기 제2 특징 벡터에 기초하여, 상기 음성 신호의 현재 프레임이 복수의 후보 단어들에 대응될 확률들을 획득하는 단계를 더 포함할 수 있다.The beam search method includes: obtaining a first feature vector corresponding to a current frame of the speech signal by applying the speech signal frame by frame to a transcription network for searching for a word corresponding to a frame of the speech signal; obtaining a second feature vector corresponding to the next word of the candidate word string by applying a candidate word string corresponding to a previous frame of the speech signal to a prediction network for searching for a next word in the word string; and obtaining probabilities that the current frame of the speech signal corresponds to a plurality of candidate words based on the first feature vector and the second feature vector.

상기 확장 후보 단어열들을 생성하는 단계는 상기 후보 단어열들 각각에 상기 후보 단어들 각각을 연결하여. 상기 확장 후보 단어열들을 생성하는 단계를 포함할 수 있다.The generating of the extended candidate word sequences includes connecting each of the candidate words to each of the candidate word sequences. The method may include generating the extension candidate word sequences.

상기 확장 후보 단어열들의 확률들을 획득하는 단계는 상기 확장 후보 단어열들 각각에 대응하여, 해당 확장 후보 단어열에 대응하는 후보 단어열에 대한 확률 및 상기 음성 신호의 프레임이 상기 해당 확장 후보 단어열에 대응하는 후보 단어에 대응될 확률에 기초하여, 상기 해당 확장 후보 단어열에 대한 확률을 획득하는 단계를 포함할 수 있다.The obtaining of the probabilities of the extension candidate word sequences may include: corresponding to each of the extension candidate word sequences, a probability of a candidate word sequence corresponding to the corresponding extension candidate word sequence and a frame of the speech signal corresponding to the corresponding extension candidate word sequence. The method may include obtaining a probability of the corresponding extended candidate word sequence based on a probability corresponding to the candidate word.

상기 복수의 후보 단어들에 대응될 확률들을 획득하는 단계는 음성 신호 및 단어열에 대응하는 단어를 탐색하기 위한 연결 네트워크에 상기 제1 특징 벡터 및 제2 특징 벡터를 인가하여, 제3 특징 벡터를 획득하는 단계; 및 상기 제3 특징 벡터를 분류기에 입력하여, 상기 음성 신호의 현재 프레임이 복수의 후보 단어들에 대응될 확률들을 획득하는 단계를 포함할 수 있다.The obtaining of probabilities corresponding to the plurality of candidate words may include applying the first and second feature vectors to a connection network for searching for words corresponding to speech signals and word sequences to obtain a third feature vector. to do; and inputting the third feature vector to a classifier to obtain probabilities corresponding to a plurality of candidate words in a current frame of the speech signal.

상기 전사 네트워크는 음성 데이터 및 상기 음성 데이터에 대응하는 텍스트 레이블을 포함하는 학습 데이터에 기초하여, 상기 음성 데이터를 입력 받아, 상기 텍스트 레이블을 출력하도록 학습된 뉴럴 네트워크를 포함할 수 있다.The transcription network may include a neural network trained to receive the voice data and output the text label based on training data including voice data and a text label corresponding to the voice data.

상기 예측 네트워크는 특정 언어의 단어열들을 포함하는 텍스트 코퍼스의 학습 데이터에 기초하여, 상기 특정 언어의 단어열을 입력 받아, 다음 단어를 예측하도록 학습된 뉴럴 네트워크를 포함할 수 있다.The prediction network may include a neural network trained to predict the next word by receiving the word sequence of the specific language based on the training data of the text corpus including the word sequences of the specific language.

상기 반복적으로 확장하는 단계는 미리 정해진 횟수만큼 반복적으로 확장하는 단계를 포함할 수 있다.The repeatedly expanding may include repeatedly expanding by a predetermined number of times.

상기 미리 정해진 횟수는 상기 음성 신호의 프레임의 길이 및 상기 음성 신호의 프레임에 대응하는 단어의 길이에 기초하여 결정될 수 있다.The predetermined number of times may be determined based on a length of a frame of the voice signal and a length of a word corresponding to a frame of the voice signal.

상기 미리 정해진 개수의 후보 단어열들을 선택하는 단계는 상기 음성 신호의 다음 프레임이 상기 음성 신호의 끝을 지시하는 프레임인 경우, 상기 반복적으로 확장된 후보 단어열들에 대한 확률들에 기초하여, 상기 반복적으로 확장된 후보 단어열들 중 적어도 일부를 출력하는 단계를 포함할 수 있다.In the step of selecting the predetermined number of candidate word sequences, when the next frame of the speech signal is a frame indicating the end of the speech signal, based on the probabilities of the repeatedly expanded candidate word sequences, the The method may include outputting at least some of the iteratively expanded candidate word sequences.

상기 단어열은 하나의 단어 및 복수의 단어들의 시퀀스를 포함할 수 있다.The word sequence may include one word and a sequence of a plurality of words.

상기 단어는 특정 언어에 대응하는 하나의 문자 및 특정 언어에 대응하는 복수의 문자들의 조합을 포함할 수 있다.The word may include a combination of one character corresponding to a specific language and a plurality of characters corresponding to the specific language.

일 측에 따른 빔 서치 장치는 음성 신호의 적어도 하나의 이전 프레임에 대응하는 복수의 후보 단어열들 및 상기 후보 단어열들의 확률들을 획득하고, 상기 음성 신호의 현재 프레임에 기초하여, 상기 복수의 후보 단어열들을 반복적으로 확장하며, 상기 반복적으로 확장된 후보 단어열들 중 확률이 큰 미리 정해진 개수의 후보 단어열들을 선택하는 적어도 하나의 프로세서; 및 상기 후보 단어열들 상기 후보 단어열들의 확률들을 저장하는 메모리를 포함한다.A beam search apparatus according to an aspect obtains a plurality of candidate word sequences corresponding to at least one previous frame of a voice signal and probabilities of the candidate word sequences, and based on the current frame of the voice signal, the plurality of candidates at least one processor that repeatedly expands word sequences and selects a predetermined number of candidate word sequences with high probability from among the repeatedly expanded candidate word sequences; and a memory for storing the candidate word sequences and probabilities of the candidate word sequences.

상기 프로세서는 상기 반복적으로 확장함에 있어서, 상기 복수의 후보 단어열들 각각을 접두사(prefix)로 하는 복수의 확장 후보 단어열들을 생성하고, 상기 음성 신호의 현재 프레임에 기초하여 상기 확장 후보 단어열들의 확률들을 획득하고, 상기 확장 후보 단어열들 중 상기 후보 단어열과 중복되는 적어도 하나의 확장 후보 단어열의 확률을 해당하는 후보 단어열의 확률에 머지(merge)하며, 상기 확장 후보 단어열들 중 상기 후보 단어열과 중복되지 않는 나머지 확장 후보 단어열을 상기 후보 단어열에 추가한다.In the iteratively expanding, the processor generates a plurality of extension candidate word sequences using each of the plurality of candidate word sequences as a prefix, and selects the extension candidate word sequences based on the current frame of the speech signal. obtaining probabilities, merging the probability of at least one extended candidate word stream overlapping the candidate word string among the extended candidate word strings with the probability of the corresponding candidate word string, and the candidate word among the extended candidate word strings The remaining extended candidate word sequences that do not overlap the rows are added to the candidate word sequence.

상기 프로세서는 상기 미리 정해진 개수의 후보 단어열들을 선택함에 있어서, 상기 반복적으로 확장된 후보 단어열들의 확률들에 블랭크 레이블(blank label)에 대한 확률을 반영하여, 상기 반복적으로 확장된 후보 단어열들의 최종 확률들을 획득하고, 상기 반복적으로 확장된 후보 단어열들 중 상기 최종 확률이 큰 미리 정해진 개수의 후보 단어열들을 상기 현재 프레임에 대응하는 후보 단어열들로 선택할 수 있다.In selecting the predetermined number of candidate word sequences, the processor reflects the probability of a blank label in the probabilities of the repeatedly expanded candidate word sequences, so as to Final probabilities may be obtained, and a predetermined number of candidate word sequences having a high final probability among the repeatedly expanded candidate word sequences may be selected as candidate word sequences corresponding to the current frame.

상기 프로세서는 상기 머지함에 있어서, 상기 후보 단어열과 중복되는 적어도 하나의 확장 후보 단어열의 확률을 해당하는 후보 단어열의 확률에 더한 값을 상기 해당하는 후보 단어열의 확률로 결정할 수 있다.In the merging, the processor may determine a value obtained by adding a probability of at least one extended candidate word sequence overlapping with the candidate word sequence to a probability of a corresponding candidate word sequence as the probability of the corresponding candidate word sequence.

상기 프로세서는 상기 머지함에 있어서, 상기 후보 단어열과 중복되는 적어도 하나의 확장 후보 단어열의 확률 및 해당하는 후보 단어열의 확률 중 가장 큰 확률을 상기 해당하는 후보 단어열의 확률로 결정할 수 있다.In the merging, the processor may determine a greatest probability among a probability of at least one extended candidate word sequence overlapping with the candidate word sequence and a probability of a corresponding candidate word sequence as the probability of the corresponding candidate word sequence.

상기 프로세서는 음성 신호의 프레임에 대응하는 단어를 탐색하기 위한 전사 네트워크에 상기 음성 신호를 프레임 별로 인가하여, 상기 음성 신호의 현재 프레임에 대응하는 제1 특징 벡터를 획득하고, 단어열의 다음 단어를 탐색하기 위한 예측 네트워크에 상기 음성 신호의 이전 프레임에 대응하는 후보 단어열을 인가하여, 상기 후보 단어열의 다음 단어에 대응하는 제2 특징 벡터를 획득하고, 상기 제1 특징 벡터 및 상기 제2 특징 벡터에 기초하여, 상기 음성 신호의 현재 프레임이 복수의 후보 단어들에 대응될 확률들을 획득할 수 있다. The processor applies the speech signal for each frame to a transcription network for searching for a word corresponding to a frame of the speech signal, obtains a first feature vector corresponding to the current frame of the speech signal, and searches for a next word in the word string applying a candidate word sequence corresponding to the previous frame of the speech signal to a prediction network for Based on this, probabilities that the current frame of the speech signal corresponds to a plurality of candidate words may be obtained.

상기 프로세서는 상기 확장 후보 단어열들을 생성함에 있어서, 상기 후보 단어열들 각각에 상기 후보 단어들 각각을 연결하여. 상기 확장 후보 단어열들을 생성할 수 있다.In generating the extended candidate word sequences, the processor connects each of the candidate words to each of the candidate word sequences. The extension candidate word sequences may be generated.

상기 프로세서는 상기 확장 후보 단어열들의 확률들을 획득함에 있어서, 상기 확장 후보 단어열들 각각에 대응하여, 해당 확장 후보 단어열에 대응하는 후보 단어열에 대한 확률 및 상기 음성 신호의 프레임이 상기 해당 확장 후보 단어열에 대응하는 후보 단어에 대응될 확률에 기초하여, 상기 해당 확장 후보 단어열에 대한 확률을 획득할 수 있다.When the processor obtains the probabilities of the extension candidate word sequences, in response to each of the extension candidate word sequences, the probability of the candidate word sequence corresponding to the corresponding extension candidate word sequence and the frame of the speech signal are the corresponding extension candidate word sequences. Based on the probability of corresponding to the candidate word corresponding to the column, the probability of the corresponding extended candidate word string may be obtained.

상기 프로세서는 상기 복수의 후보 단어들에 대응될 확률들을 획득함에 있어서, 음성 신호 및 단어열에 대응하는 단어를 탐색하기 위한 연결 네트워크에 상기 제1 특징 벡터 및 제2 특징 벡터를 인가하여, 제3 특징 벡터를 획득하고, 상기 제3 특징 벡터를 분류기에 입력하여, 상기 음성 신호의 현재 프레임이 복수의 후보 단어들에 대응될 확률들을 획득할 수 있다.In obtaining probabilities corresponding to the plurality of candidate words, the processor applies the first feature vector and the second feature vector to a connection network for searching for a word corresponding to a voice signal and a word sequence to obtain a third feature By obtaining a vector and inputting the third feature vector to a classifier, probabilities that the current frame of the speech signal corresponds to a plurality of candidate words may be obtained.

상기 프로세서는 상기 반복적으로 확장함에 있어서, 미리 정해진 횟수만큼 반복적으로 확장할 수 있다.In the iteratively expanding, the processor may iteratively extend a predetermined number of times.

빔 서치 과정에서 가정이 접두사에 해당하는지 비교하는 연산 및 접두사에 해당하는 가정의 확률을 다시 계산하는 연산을 가정 확장 연산과 동시에 수행함으로써, 빔 서치 과정을 효율적으로 수행할 수 있다.In the beam search process, an operation for comparing whether a hypothesis corresponds to a prefix and an operation for recalculating the probability of an assumption corresponding to the prefix are performed simultaneously with the hypothesis expansion operation, thereby efficiently performing the beam search process.

도 1은 일실시예에 따른 빔 서치 방법의 순서도를 도시한 도면.
도 2는 일실시예에 따른 복수의 후보 단어열들을 반복적으로 확장하는 단계의 구체적인 동작의 순서도를 도시한 도면.
도 3a 및 도 3b는 빔 개수가 3인 빔 서치 과정을 도시한 도면들.
도 4는 일실시예에 따른 복수의 후보 단어열들 각각을 접두사로 하는 복수의 확장 후보 단어열들을 생성하는 과정을 도시한 도면.
도 5는 일실시예에 따른 확장 후보 단어열들 중 후보 단어열과 중복되는 확장 후보 단어열의 확률을 해당하는 후보 단어열의 확률에 머지하는 과정을 도시한 도면.
도 6은 일실시예에 따른 확장 후보 단어열들 중 후보 단어열과 중복되지 않는 나머지 확장 후보 단어열을 후보 단어열에 추가하는 과정을 도시한 도면.
도 7은 일실시예에 따른 나머지 확장 후보 단어열이 추가되어 확장된 후보 단어열에 기초하여, 후보 단어열을 다시 확장하는 과정을 도시한 도면이다.
도 8은 일실시예에 따른 장치의 구성의 예시도.1 is a view showing a flow chart of a beam search method according to an embodiment.
FIG. 2 is a flowchart illustrating a detailed operation of repeatedly expanding a plurality of candidate word strings according to an embodiment;
3A and 3B are diagrams illustrating a beam search process in which the number of beams is 3;
4 is a diagram illustrating a process of generating a plurality of extended candidate word sequences using each of a plurality of candidate word sequences as a prefix according to an exemplary embodiment;
5 is a diagram illustrating a process of merging a probability of an extended candidate word string overlapping a candidate word string among extended candidate word strings with a probability of a corresponding candidate word string according to an embodiment;
FIG. 6 is a diagram illustrating a process of adding, to a candidate word sequence, the remaining extended candidate word sequences that do not overlap with a candidate word sequence among extended candidate word sequences according to an exemplary embodiment;
7 is a diagram illustrating a process of re-expanding a candidate word sequence based on the extended candidate word sequence by adding the remaining extended candidate word sequence according to an exemplary embodiment.
8 is an exemplary diagram of a configuration of an apparatus according to an embodiment;

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, since various changes may be made to the embodiments, the scope of the patent application is not limited or limited by these embodiments. It should be understood that all modifications, equivalents and substitutes for the embodiments are included in the scope of the rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are used for the purpose of description only, and should not be construed as limiting. The singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that a feature, number, step, operation, component, part, or a combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same components are given the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted. In describing the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description thereof will be omitted.

또한, 실시 예의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다. In addition, in describing the components of the embodiment, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the elements from other elements, and the essence, order, or order of the elements are not limited by the terms. When it is described that a component is "connected", "coupled" or "connected" to another component, the component may be directly connected or connected to the other component, but another component is between each component. It will be understood that may also be "connected", "coupled" or "connected".

어느 하나의 실시 예에 포함된 구성요소와, 공통적인 기능을 포함하는 구성요소는, 다른 실시 예에서 동일한 명칭을 사용하여 설명하기로 한다. 반대되는 기재가 없는 이상, 어느 하나의 실시 예에 기재한 설명은 다른 실시 예에도 적용될 수 있으며, 중복되는 범위에서 구체적인 설명은 생략하기로 한다.Components included in one embodiment and components having a common function will be described using the same names in other embodiments. Unless otherwise stated, descriptions described in one embodiment may be applied to other embodiments as well, and detailed descriptions within the overlapping range will be omitted.

도 1은 일실시예에 따른 빔 서치 방법의 순서도를 도시한 도면이다.1 is a diagram illustrating a flowchart of a beam search method according to an embodiment.

빔 서치(beam search)는 너비 우선 탐색을 이용한 트리 탐색 기법으로, 트리의 현재 레벨에서 우선 순위가 높은 미리 정해진 개수(이하에서, 빔 개수라고 지칭함)의 노드들을 선택하고, 선택된 노드들의 자식 노드들 중 빔개수의 노드들을 선택하는 것을 반복하는 기법이다. 예를 들어, 도 3a 및 도 3b는 빔 개수가 3인 빔 서치 과정을 도시한 도면들이다. 도 3a를 참조하면, 시작 노드(301)의 레벨이 0이라고 할 때, 시작 노드(301)의 자식 노드인 레벨 1에 해당하는 4개의 노드 중 우선 순위에 따른 상위 3개의 노드를 선택하고, 선택된 노드들의 자식 노드들에 해당하는 레벨 2의 노드들 중 우선 순위에 따른 3개의 노드를 선택한다. 트리의 각 레벨에서 우선 순위가 높은 빔 개수의 노드를 선택하는 과정을 레벨 4까지 반복하여, 도 3b의 결과를 획득할 수 있다. 빔 서치가 음성 인식에 이용되는 경우, 트리를 구성하는 각 노드는 자연어에 대응될 수 있고, 우선 순위는 음성 신호가 특정 자연어로 인식될 확률 및/또는 자연어 시퀀스의 통계적 확률에 기초하여 결정된 확률이 높은 순위에 해당할 수 있으며, 빔 서치의 결과는 음성 신호가 인식된 자연어의 시퀀스에 대응될 수 있다.Beam search is a tree search technique using breadth-first search, and selects a predetermined number of high-priority nodes (hereinafter referred to as the number of beams) in the current level of the tree, and selects child nodes of the selected nodes. This is a technique that repeats selecting nodes with the number of beams among them. For example, FIGS. 3A and 3B are diagrams illustrating a beam search process in which the number of beams is three. Referring to FIG. 3A , when the level of the start node 301 is 0, the top three nodes according to priority are selected among the four nodes corresponding to level 1, which is a child node of the start node 301 , and the selected Three nodes according to priority are selected from among the level 2 nodes corresponding to the child nodes of the nodes. The process of selecting a node having a high-priority beam number in each level of the tree is repeated until level 4 to obtain the result of FIG. 3B . When beam search is used for speech recognition, each node constituting the tree may correspond to a natural language, and the priority is a probability determined based on a probability that a speech signal is recognized as a specific natural language and/or a statistical probability of a natural language sequence. It may correspond to a high rank, and a result of the beam search may correspond to a sequence of a natural language in which a voice signal is recognized.

도 1을 참조하면, 일실시예에 따른 빔 서치 방법은 음성 신호의 적어도 하나의 이전 프레임에 대응하는 복수의 후보 단어열들 및 후보 단어열들의 확률들을 획득하는 단계(110), 음성 신호의 현재 프레임에 기초하여, 복수의 후보 단어열들을 반복적으로 확장하는 단계(120) 및 반복적으로 확장된 후보 단어열들 중 확률이 큰 미리 정해진 개수의 후보 단어열들을 선택하는 단계(130)를 포함할 수 있다.Referring to FIG. 1 , a beam search method according to an embodiment includes obtaining a plurality of candidate word sequences and probabilities of candidate word sequences corresponding to at least one previous frame of a voice signal ( 110 ), Based on the frame, the method may include repeatedly expanding a plurality of candidate word sequences ( 120 ) and selecting ( 130 ) a predetermined number of candidate word sequences having a high probability from among the repeatedly expanded candidate word sequences. there is.

일실시예에 따른 단어열은 하나의 단어를 포함할 수 있으며, 복수의 단어들이 나열된 단어들의 시퀀스(sequence)를 포함할 수 있다. 일실시예에 따른 단어는 사전적 의미의 단어가 아니라 적어도 하나의 문자로 구성된 문자열을 의미할 수 있다. 다시 말해, 일실시예에 따른 단어는 특정 언어에 대응하는 하나의 문자 및 특정 언어에 대응하는 복수의 문자들의 조합을 포함할 수 있다.The word sequence according to an embodiment may include one word, and may include a sequence of words in which a plurality of words are listed. A word according to an exemplary embodiment may mean a character string composed of at least one character, rather than a word having a dictionary meaning. In other words, a word according to an exemplary embodiment may include a combination of one character corresponding to a specific language and a plurality of characters corresponding to the specific language.

일실시예에 따른 빔 서치 방법은 음성 신호의 프레임 단위로 수행될 수 있으며, 음성 신호의 첫번째 프레임부터 차례대로 수행될 수 있다. 음성 신호의 프레임은 음성 신호를 시간 단위로 분리한 것으로, 예를 들어 소리 형태인 음성 신호에 대하여 단시간 푸리에 변환(Short-time Fourier Transform; STFT)를 수행하여, 시간 단위 및 주파수 단위의 2차원 정보로 변환하는 경우, STFT 변환의 시간 단위가 프레임에 해당할 수 있다.The beam search method according to an embodiment may be performed in units of frames of the voice signal, and may be sequentially performed from the first frame of the voice signal. A frame of a voice signal is a division of a voice signal in units of time. For example, Short-time Fourier Transform (STFT) is performed on a voice signal in the form of a sound to perform two-dimensional information in units of time and frequency. , the time unit of the STFT transformation may correspond to a frame.

일실시예에 따른 빔 서치 방법은 적어도 하나의 프로세서 및 메모리를 포함하는 서버 또는 장치에 의해 수행될 수 있다. 보다 구체적으로, 일실시예에 따른 빔 서치 방법의 각 단계는 서버의 적어도 하나의 프로세서에 의해 수행될 수도 있고, 장치의 적어도 하나의 프로세서에 의해 수행될 수도 있다.The beam search method according to an embodiment may be performed by a server or apparatus including at least one processor and a memory. More specifically, each step of the beam search method according to an embodiment may be performed by at least one processor of the server, or may be performed by at least one processor of the apparatus.

일실시예에 따른 단계(110)에서 음성 신호의 이전 프레임에 대응하는 후보 단어열들 및 후보 단어열들의 확률은 음성 신호의 이전 프레임에 대하여 빔 서치를 수행한 결과, 이전 프레임에 대응하여 획득된 후보 단어열들 및 후보 단어열들의 확률들을 의미할 수 있다. 즉, 이전 프레임에 대하여 빔 서치를 수행한 결과 획득된 이전 프레임에 대응하는 빔 개수의 후보 단어열들 및 그의 확률들을 의미할 수 있다. 음성 신호의 현재 프레임에 대하여, 일실시예에 따른 단계(110) 내지 단계(130)들을 포함하는 빔 서치를 수행하여 획득된 현재 프레임에 대응하는 후보 단어열들은 현재 프레임의 다음 프레임에 대하여 빔 서치를 수행하기 위한 단계(110)에서의 이전 프레임에 대응하는 복수의 후보 단어열들 및 후보 단어열들의 확률들로 이용될 수 있다.In step 110 according to an embodiment, the candidate word sequences and the probabilities of the candidate word sequences corresponding to the previous frame of the speech signal are obtained corresponding to the previous frame as a result of performing a beam search on the previous frame of the speech signal. It may mean candidate word sequences and probabilities of candidate word sequences. That is, it may mean candidate word sequences of the number of beams corresponding to the previous frame obtained as a result of performing the beam search on the previous frame and their probabilities. Candidate word sequences corresponding to the current frame obtained by performing the beam search including steps 110 to 130 on the current frame of the voice signal are beam searched for the next frame of the current frame A plurality of candidate word sequences and probabilities of candidate word sequences corresponding to the previous frame in step 110 for performing ? may be used.

일실시예에 따른 후보 단어열들은 각각 대응하는 확률에 매칭될 수 있다. 예를 들어, 후보 단어열들 중 제1 후보 단어열은 제1 후보 단어열에 대응하는 제1 확률과 매칭될 수 있다. 일실시예에 따를 때, 후보 단어열 및 대응하는 후보 단어열의 확률은 매칭되어 일실시예에 따른 단계(110)를 수행하는 프로세서가 접근할 수 있는 메모리에 저장될 수 있다.Candidate word sequences according to an embodiment may be matched to corresponding probabilities, respectively. For example, a first candidate word sequence among the candidate word sequences may match a first probability corresponding to the first candidate word sequence. According to an embodiment, the candidate word sequence and the probability of the corresponding candidate word sequence are matched and stored in a memory accessible by the processor performing step 110 according to the embodiment.

일실시예에 따른 현재 프레임이 음성 신호의 첫번째 프레임인 경우, 이전 프레임은 존재하지 않으므로, 일실시예에 따른 이전 프레임에 대응하는 후보 단어열은 존재하지 않을 수 있다. 즉, 일실시예에 따른 현재 프레임이 첫번째 프레임인 경우, 단계(110)는 생략될 수 있다. 또는, 일실시예에 따른 현재 프레임이 첫번째 프레임인 경우, 단계(110)는 미리 정해진 후보 단어열(들) 및 확률(들)을 획득하는 단계를 포함할 수 있다. 예를 들어, 현재 프레임이 첫번째 프레임인 경우, 단계(110)는 시작 토큰을 포함하는 후보 단어열을 획득하고, 후보 단어열인 시작 토큰의 확률은 1인 것으로 미리 정해질 수 있다.According to an embodiment, when the current frame is the first frame of a voice signal, since the previous frame does not exist, a candidate word sequence corresponding to the previous frame according to an embodiment may not exist. That is, when the current frame according to an embodiment is the first frame, step 110 may be omitted. Alternatively, when the current frame according to an embodiment is the first frame, step 110 may include obtaining predetermined candidate word sequence(s) and probability(s). For example, when the current frame is the first frame, step 110 obtains a candidate word sequence including a start token, and the probability of a start token that is a candidate word sequence may be predetermined as 1.

일실시예에 따른 단계(120)에서 후보 단어열들을 확장하는 것은 후보 단어열들과 관련된 새로운 후보 단어열(들)을 추가하여, 후보 단어열의 개수를 늘리는 것을 의미할 수 있다. 예를 들어, 단계(110)에서 획득된 후보 단어열들은 제1 후보 단어열을 포함하고 있는 경우, 제1 후보 단어열과 관련된 새로운 제2 후보 단어열을 추가함으로써, 후보 단어열을 확장할 수 있다. 일실시예에 따를 때, 기존의 후보 단어열과 관련된 새로운 후보 단어열은 예를 들어, 기존의 후보 단어열을 접두사(prefix)로 하는 후보 단어열을 포함할 수 있다.Expanding the candidate word sequences in step 120 according to an embodiment may mean increasing the number of candidate word sequences by adding new candidate word sequence(s) related to the candidate word sequences. For example, when the candidate word sequences obtained in step 110 include the first candidate word sequence, the candidate word sequence may be expanded by adding a new second candidate word sequence related to the first candidate word sequence. . According to an embodiment, the new candidate word sequence related to the existing candidate word sequence may include, for example, a candidate word sequence using the existing candidate word sequence as a prefix.

일실시예에 따른 복수의 후보 단어열들을 반복적으로 확장하는 단계(120)의 구체적인 동작은 도 2를 참조할 수 있다.A detailed operation of the step 120 of repeatedly expanding a plurality of candidate word strings according to an embodiment may refer to FIG. 2 .

도 2를 참조하면, 복수의 후보 단어열들을 반복적으로 확장하는 단계는 복수의 후보 단어열들 각각을 접두사(prefix)로 하는 복수의 확장 후보 단어열들을 생성하는 단계(210), 음성 신호의 현재 프레임에 기초하여, 확장 후보 단어열들의 확률들을 획득하는 단계(220), 확장 후보 단어열들 중 후보 단어열과 중복되는 적어도 하나의 확장 후보 단어열의 확률을 해당하는 후보 단어열의 확률에 머지(merge)하는 단계(230) 및 확장 후보 단어열들 중 후보 단어열과 중복되지 않는 나머지 확장 후보 단어열을 후보 단어열에 추가하는 단계(240)를 포함할 수 있다.Referring to FIG. 2 , the step of repeatedly expanding a plurality of candidate word sequences includes generating a plurality of extension candidate word sequences using each of the plurality of candidate word sequences as a prefix ( 210 ), and the current In step 220 of obtaining probabilities of extended candidate word sequences based on the frame, the probability of at least one extended candidate word sequence overlapping with the candidate word sequence among the extended candidate word sequences is merged with the probability of the corresponding candidate word sequence and adding (240) the remaining candidate extension word sequences that do not overlap the candidate word sequence among the extension candidate word sequences to the candidate word sequence.

일실시예에 따른 단계(210) 내지 단계(240)가 미리 정해진 횟수만큼 반복됨으로써, 후보 단어열들이 반복적으로 확장될 수 있다. 일실시예에 따른 단계(210) 내지 단계(240)가 반복되는 횟수인 반복 횟수는 음성 신호의 프레임의 길이 및 음성 신호의 프레임에 대응하는 단어의 길이에 기초하여 결정될 수 있다. 예를 들어, 음성 신호의 프레임의 길이가 음성 신호의 프레임에 대응하는 단어의 길이에 비해 긴 경우 반복 횟수는 작아질 수 있고, 반면에 음성 신호의 프레임에 대응하는 단어의 길이가 음성 신호의 프레임의 길이에 비해 긴 경우 반복 횟수는 커질 수 있다. 다시 말해, 일실시예에 따른 단계(120)는 일실시예에 따른 단계(210) 내지 단계(240)가 미리 정해진 횟수만큼 반복하는 단계를 포함할 수 있으며, 반복적으로 확장된 후보 단어열들에 대하여, 일실시예에 따른 단계(130)가 수행될 수 있다.As steps 210 to 240 according to an embodiment are repeated a predetermined number of times, candidate word sequences may be iteratively expanded. The number of repetitions, which is the number of times steps 210 to 240 are repeated according to an embodiment, may be determined based on the length of a frame of the voice signal and the length of a word corresponding to the frame of the voice signal. For example, when the length of the frame of the voice signal is longer than the length of the word corresponding to the frame of the voice signal, the number of repetitions may be small, whereas the length of the word corresponding to the frame of the voice signal is the length of the frame of the voice signal. If it is long compared to the length of , the number of repetitions may be large. In other words, step 120 according to an embodiment may include repeating steps 210 to 240 according to an embodiment a predetermined number of times, and repeating the repeatedly expanded candidate word sequences. For example, step 130 according to an embodiment may be performed.

일실시예에 따를 때, 후보 단어열을 접두사로 하는 확장 후보 단어열은, 후보 단어열들 중 어느 하나의 단어열 뒤에 적어도 하나의 단어(들)을 더 포함하는 단어열에 해당할 수 있다. 예를 들어, 후보 단어열 "ab"를 접두사로 하는 확장 후보 단어열은 "aba", "abb", "abaa" 등이 있을 수 있다.According to an embodiment, the extended candidate word sequence using the candidate word sequence as a prefix may correspond to a word sequence further including at least one word(s) after any one of the candidate word sequences. For example, the extended candidate word sequence using the candidate word sequence "ab" as a prefix may include "aba", "abb", "abaa", and the like.

일실시예에 따른 단계(210)에서 확장 후보 단어열들은 후보 단어열들 각각에, 음성 신호의 현재 프레임에 대응하는 복수의 후보 단어들 각각을 연결하여 생성될 수 있다. 다시 말해, 일실시예에 따른 단계(210)는 후보 단어열들 각각에, 음성 신호의 현재 프레임에 대응하는 복수의 후보 단어들 각각을 연결하여 확장 후보 단어열들을 생성하는 단계를 포함할 수 있다.In step 210 according to an embodiment, the extended candidate word sequences may be generated by connecting each of a plurality of candidate words corresponding to the current frame of the voice signal to each of the candidate word sequences. In other words, operation 210 according to an embodiment may include generating extended candidate word sequences by connecting each of a plurality of candidate words corresponding to the current frame of the voice signal to each of the candidate word sequences. .

일실시예에 따른 확장 후보 단어열들 중 어느 하나인 제1 확장 후보 단어열은 후보 단어열들 중 어느 하나인 제1 후보 단어열에, 상기 후보 단어들 중 어느 하나인 제1 후보 단어를 연결하여 생성된 단어열에 해당할 수 있다. 이 경우, 일실시예에 따른 단계(220)에서 제1 확장 후보 단어열의 확률은 제1 후보 단어열의 확률 및 음성 신호의 프레임이 제1 후보 단어에 대응될 확률에 기초하여 획득될 수 있다. According to an embodiment, a first candidate extended word sequence, which is any one of the extended candidate word sequences, is obtained by connecting a first candidate word sequence, which is any one of the candidate word sequences, to a first candidate word, which is any one of the candidate words. It may correspond to the generated word string. In this case, in step 220 according to an embodiment, the probability of the first extended candidate word string may be obtained based on the probability of the first candidate word string and the probability that the frame of the speech signal corresponds to the first candidate word.

일실시예에 따른 음성 신호의 현재 프레임에 대응하는 복수의 후보 단어들 및 음성 신호의 현재 프레임이 후보 단어들에 대응될 확률들은 음성 인식 모델에 따라 획득될 수 있다. 예를 들어, 음성 신호의 현재 프레임이 후보 단어들에 대응될 확률들은 트랜스듀서(transducer) 기반의 음성 인식 모델에 따라 획득될 수 있다.According to an embodiment, a plurality of candidate words corresponding to the current frame of the speech signal and probabilities that the current frame of the speech signal corresponds to the candidate words may be obtained according to a speech recognition model. For example, probabilities that a current frame of a speech signal corresponds to candidate words may be obtained according to a transducer-based speech recognition model.

트랜스듀서 기반의 음성 인식 모델은 음성 신호 및 음성 신호에 대응하여 이미 탐색된 단어열에 기초하여, 음성 신호에 대응하는 단어열을 출력하는 모델에 해당할 수 있다. 보다 구체적으로, 트랜스듀서 기반의 음성 인식 모델은 음성 신호를 프레임 별로 처리하여, 음성 신호의 현재 프레임에 대응하는 특징 벡터 및 음성 신호의 이전 프레임에 대응하여 추정된 단어의 다음 단어에 대응하는 특징 벡터에 기초하여, 음성 신호의 현재 프레임에 대응하는 단어열을 추정할 수 있다. 일실시예에 따를 때, 음성 신호의 현재 프레임에 대응하는 단어열은 음성 신호의 현재 프레임에 대응되는 후보 단어에 관한 분류 확률에 기초하여 결정될 수 있다.The transducer-based voice recognition model may correspond to a model for outputting a voice signal and a word sequence corresponding to the voice signal based on a word sequence already searched for in response to the voice signal. More specifically, the transducer-based speech recognition model processes a speech signal frame by frame, and a feature vector corresponding to the current frame of the speech signal and a feature vector corresponding to the next word of the word estimated corresponding to the previous frame of the speech signal Based on , a word sequence corresponding to the current frame of the speech signal may be estimated. According to an embodiment, the word sequence corresponding to the current frame of the speech signal may be determined based on a classification probability of a candidate word corresponding to the current frame of the speech signal.

일실시예에 따른 트랜스듀서 기반의 음성 인식 모델은 전사 네트워크(transcription network), 예측 네트워크(prediction network) 및 연결 네트워크(joint network)를 포함할 수 있으며, 네트워크에서 출력된 특징 벡터에 기초하여 음성 신호에 대응하는 단어열의 분류 확률을 출력할 수 있다. 일실시예에 따른 음성 인식 모델은 단어열의 확률을 출력하기 위한 softmax 함수 등으로 구성된 분류기를 포함할 수 있다.The transducer-based speech recognition model according to an embodiment may include a transcription network, a prediction network, and a joint network, and a speech signal based on a feature vector output from the network. It is possible to output the classification probability of the word string corresponding to . The speech recognition model according to an embodiment may include a classifier composed of a softmax function for outputting the probability of a word sequence.

일실시예에 따른 전사 네트워크는 음성 신호의 프레임에 대응하는 단어를 탐색하기 위한 특징 벡터를 출력할 수 있으며, 뉴럴 네트워크(neural network)으로 구성될 수 있다. 일실시예에 따른 예측 네트워크는 단어열의 다음 단어를 탐색하기 위한 특징 벡터를 출력할 수 있으며, 뉴럴 네트워크로 구성될 수 있다. 일실시예에 따른 연결 네트워크는 전사 네트워크에서 출력된 특징 벡터 및 예측 네트워크에서 출력된 특징 벡터를 취합하는 네트워크에 해당할 수 있으며, 뉴럴 네트워크로 구성될 수 있다. 이하에서, 전사 네트워크에서 출력된 특징 벡터는 제1 특징 벡터, 예측 네트워크에서 출력된 특징 벡터는 제2 특징 벡터, 및 연결 네트워크에서 출력된 특징 벡터는 제3 특징 벡터로 지칭될 수 있다.The transcription network according to an embodiment may output a feature vector for searching for a word corresponding to a frame of a speech signal, and may be configured as a neural network. The prediction network according to an embodiment may output a feature vector for searching for a next word in a word string, and may be configured as a neural network. The connection network according to an embodiment may correspond to a network that collects feature vectors output from the transcription network and the feature vectors output from the prediction network, and may be configured as a neural network. Hereinafter, a feature vector output from the transcription network may be referred to as a first feature vector, a feature vector output from the prediction network may be referred to as a second feature vector, and a feature vector output from the connection network may be referred to as a third feature vector.

일실시예에 따른 트랜스듀서 기반의 음성 인식 모델은 음성 신호 및 음성 신호에 대응하여 이미 탐색된 단어열을 입력 받아, 음성 신호에 대응하는 텍스트 레이블을 출력하도록 학습될 수 있다. 일실시예에 따른 전사 네트워크는 음성 데이터 및 상기 음성 데이터에 대응하는 텍스트 레이블을 포함하는 학습 데이터에 기초하여, 음성 데이터를 입력 받아, 대응하는 텍스트 레이블을 출력하도록 학습된 뉴럴 네트워크를 포함할 수 있다. 일실시예에 따른 예측 네트워크는 특정 언어의 단어열들을 포함하는 텍스트 코퍼스의 학습 데이터에 기초하여, 특정 언어의 단어열을 입력 받아, 다음 단어를 예측하도록 학습된 뉴럴 네트워크를 포함할 수 있다. 일실시예에 따른 뉴럴 네트워크는 RNN, LSTM 등 다양한 신경망 구조로 구성될 수 있다.The transducer-based voice recognition model according to an embodiment may be trained to receive a voice signal and a word string already searched for in response to the voice signal, and to output a text label corresponding to the voice signal. The transcription network according to an embodiment may include a neural network trained to receive voice data and output a corresponding text label based on training data including voice data and a text label corresponding to the voice data. . The prediction network according to an embodiment may include a neural network trained to predict a next word by receiving a word sequence of a specific language based on training data of a text corpus including word sequences of a specific language. A neural network according to an embodiment may be configured with various neural network structures such as RNN and LSTM.

현재 프레임에 대응하는 복수의 후보 단어들은 음성 인식 모델의 학습에 이용된 학습 데이터에 포함된 텍스트 레이블에 기초하여 결정될 수 있다. 즉, 학습 데이터에 포함된 텍스트 레이블의 종류에 따라 분류기의 클래스가 결정될 수 있으며, 음성 신호의 각 프레임에 대응하는 후보 단어들이 결정될 수 있다.A plurality of candidate words corresponding to the current frame may be determined based on a text label included in training data used for training the speech recognition model. That is, the class of the classifier may be determined according to the type of text label included in the training data, and candidate words corresponding to each frame of the voice signal may be determined.

일실시예에 따른 빔 서치 방법은 음성 신호에 대응하는 단어열을 탐색하기 위한 인코더에 음성 신호를 프레임 별로 인가하여 음성 신호의 각 프레임에 대응하는 특징 벡터를 획득하는 단계 및 특징 벡터를 분류기에 입력하여, 음성 신호의 각 프레임이 복수의 후보 단어들에 대응될 확률들을 획득하는 단계를 더 포함할 수 있다. 일실시예에 따른 확장 후보 단어열들을 획득하는 단계(210)는 후보 단어열들 각각에 현재 프레임에 대응하는 후보 단어들 각각을 연결하여. 확장 후보 단어열들을 획득하는 단계를 포함할 수 있다. 또한, 확장 후보 단어열들의 확률들을 획득하는 단계(220)는 확장 후보 단어열들 각각에 대응하여, 해당 확장 후보 단어열에 대응하는 후보 단어열에 대한 확률 및 음성 신호의 프레임이 해당 확장 후보 단어열에 대응하는 후보 단어에 대응될 확률에 기초하여, 해당 확장 후보 단어열에 대한 확률을 획득하는 단계를 포함할 수 있다.A beam search method according to an embodiment includes the steps of applying a voice signal to an encoder for searching a word string corresponding to the voice signal for each frame to obtain a feature vector corresponding to each frame of the voice signal, and inputting the feature vector to a classifier Thus, the method may further include obtaining probabilities that each frame of the speech signal corresponds to a plurality of candidate words. According to an embodiment, the step 210 of obtaining extended candidate word sequences may include connecting each of the candidate words corresponding to the current frame to each of the candidate word sequences. The method may include obtaining extension candidate word sequences. In addition, in step 220 of obtaining the probabilities of the extension candidate word sequences, the probability of the candidate word sequence corresponding to the corresponding extension candidate word sequence and the frame of the voice signal correspond to the respective extension candidate word sequences. and obtaining a probability for the corresponding extended candidate word sequence based on the probability corresponding to the candidate word.

일실시예에 따른 확장 후보 단어열들은 모든 후보 단어열들 및 현재 프레임에 대응하는 모든 후보 단어들의 조합을 포함할 수 있다. 일실시예에 따른 확장 후보 단어열의 확률은 확장 후보 단어열을 구성하는 후보 단어열의 확률 및 현재 프레임에 대응하는 후보 단어의 확률에 기초하여 결정될 수 있다. 일실시예에 따른 현재 프레임에 대응하는 후보 단어의 확률은 음성 인식 모델의 분류기의 출력인 분류 확률에 따른 현재 프레임이 후보 단어에 대응될 확률에 해당할 수 있다. 예를 들어, 제1 확장 후보 단어열이 제1 후보 단어열을 접두사로 하는 단어열에 해당하는 경우, 즉, 제1 후보 단어열에 제1 후보 단어를 연결하여 생성된 단어열에 해당하는 경우, 제1 확장 후보 단어열의 확률은 제1 후보 단어열의 확률 및 제1 후보 단어의 확률의 곱셈으로 결정될 수 있다.The extended candidate word sequences according to an embodiment may include combinations of all candidate word sequences and all candidate words corresponding to the current frame. According to an embodiment, the probability of the candidate extended word sequence may be determined based on the probability of the candidate word sequence constituting the extended candidate word sequence and the probability of the candidate word corresponding to the current frame. According to an embodiment, the probability of the candidate word corresponding to the current frame may correspond to the probability that the current frame corresponds to the candidate word according to the classification probability that is an output of the classifier of the speech recognition model. For example, when the first extended candidate word string corresponds to a word string having the first candidate word string as a prefix, that is, when the first candidate word string corresponds to a word string generated by connecting the first candidate word string to the first candidate word string, the first The probability of the extended candidate word sequence may be determined by multiplying the probability of the first candidate word sequence and the probability of the first candidate word sequence.

일실시예에 따를 때, 후보 단어열과 마찬가지로, 확장 후보 단어열 및 해당 확장 후보 단어열의 확률은 매칭될 수 있으며, 확장 후보 단어열 및 해당 확장 후보 단어열의 확률은 매칭되어 일실시예에 따른 단계(210) 및 단계(220)를 수행하는 프로세서가 접근할 수 있는 메모리에 저장될 수 있다.According to an embodiment, similarly to the candidate word sequence, the probabilities of the extended candidate word sequence and the corresponding extended candidate word sequence may be matched, and the probabilities of the extended candidate word sequence and the corresponding extended candidate word sequence are matched, so that the step ( 210) and may be stored in a memory accessible to the processor performing steps 220.

일실시예에 따른 확장 후보 단어열들을 생성하는 단계(210) 및 확장 후보 단어열들의 확률들을 획득하는 단계(220)의 구체적인 예시는 이하의 도 4에서 상술한다.A detailed example of the step 210 of generating the extension candidate word streams and the step 220 of obtaining the probabilities of the extension candidate word streams according to an embodiment will be described in detail with reference to FIG. 4 below.

일실시예에 따른 머지하는 단계(230)에서 확장 후보 단어열들 중 후보 단어열과 중복되는 적어도 하나의 확장 후보 단어열의 확률을 해당하는 후보 단어열의 확률에 머지(merge)함으로써, 동일한 단어열에 대한 하나의 확률 값을 유지할 수 있다. 일실시예에 따를 때, 특정 확장 후보 단어열의 확률이 동일한 단어열에 해당하는 특정 후보 단어열의 확률에 머지된 경우, 머지된 확장 후보 단어열 및 확장 후보 단어열의 확률은 메모리에 저장되지 않으며, 이미 메모리에 저장된 경우, 메모리에서 삭제될 수 있다.In the merging step 230 according to an embodiment, by merging the probability of at least one extended candidate word string overlapping the candidate word string among the extended candidate word strings with the probability of the corresponding candidate word string, one of the same word strings can maintain the probability value of . According to an embodiment, when the probability of a specific extended candidate word stream is merged with the probability of a specific candidate word string corresponding to the same word string, the merged extension candidate word string and the probabilities of the extended candidate word string are not stored in the memory, and are already in the memory. If stored in , it can be deleted from memory.

일실시예에 따른 머지하는 단계(230)는 후보 단어열과 중복되는 적어도 하나의 확장 후보 단어열의 확률을 해당하는 후보 단어열의 확률에 더한 값을 상기 해당하는 후보 단어열의 확률로 결정하는 단계를 포함할 수 있다. 예를 들어, 후보 단어열들 중 제1 후보 단어열을 접두사로 하는 제1 확장 후보 단어열이 제2 후보 단어열과 동일한 경우, 제1 확장 후보 단어열의 확률을 제2 후보 단어열에 더한 값이, 제2 후보 단어열의 확률로 결정될 수 있다. 즉, 중복되는 제2 후보 단어열의 확률 및 제1 확장 후보 단어열의 확률을 합하는 방식으로 제1 확장 후보 단어열의 확률을 제2 후보 단어열의 확률에 머지함으로써, 동일한 단어열에 대한 하나의 확률 값을 유지할 수 있다. 일실시예에 따를 때, 머지된 제1 확장 후보 단어열 및 제1 확장 후보 단어열의 확률은 메모리에 저장되지 않으며, 이미 메모리에 저장된 경우, 메모리에서 삭제될 수 있다.The merging 230 according to an embodiment may include determining a value obtained by adding a probability of at least one extended candidate word string overlapping a candidate word string to a probability of the corresponding candidate word string as the probability of the corresponding candidate word string. can For example, if a first extended candidate word string using the first candidate word string as a prefix among the candidate word strings is the same as the second candidate word string, the value obtained by adding the probability of the first extended candidate word string to the second candidate word string, It may be determined by the probability of the second candidate word sequence. That is, one probability value for the same word sequence is maintained by merging the probability of the first extended candidate word sequence with the probability of the second candidate word sequence by adding the overlapping probability of the second candidate word sequence and the probability of the first extended candidate word sequence. can According to an embodiment, the probabilities of the merged first extension candidate word stream and the first extension candidate word stream are not stored in the memory, and if they are already stored in the memory, they may be deleted from the memory.

일실시예에 따른 머지하는 단계(230)는 후보 단어열과 중복되는 적어도 하나의 확장 후보 단어열의 확률 및 해당하는 후보 단어열의 확률 중 가장 큰 확률을 상기 해당하는 후보 단어열의 확률로 결정하는 단계를 포함할 수 있다. 예를 들어, 후보 단어열들 중 제1 후보 단어열을 접두사로 하는 제1 확장 후보 단어열이 제2 후보 단어열과 동일한 경우, 제1 확장 후보 단어열의 확률 및 제2 후보 단어열의 확률 중 큰 값이, 제2 후보 단어열의 확률로 결정될 수 있다. 즉, 중복되는 제2 후보 단어열의 확률 및 제1 확장 후보 단어열의 확률 중 더 큰 어느 하나를 택하는 방식으로 제1 확장 후보 단어열의 확률을 제2 후보 단어열의 확률에 머지함으로써, 동일한 단어열에 대한 하나의 확률 값을 유지할 수 있다. 일실시예에 따를 때, 머지된 제1 확장 후보 단어열 및 제1 확장 후보 단어열의 확률은 메모리에 저장되지 않으며, 이미 메모리에 저장된 경우, 메모리에서 삭제될 수 있다.The step of merging 230 according to an embodiment includes determining the highest probability among the probability of at least one extended candidate word string overlapping the candidate word string and the probability of the corresponding candidate word string as the probability of the corresponding candidate word string. can do. For example, when a first extended candidate word string using the first candidate word string as a prefix among the candidate word strings is the same as the second candidate word string, a greater value of the probability of the first extended candidate word string and the second candidate word string This may be determined as the probability of the second candidate word sequence. That is, by merging the probability of the first extended candidate word string with the probability of the second candidate word string in a manner of selecting one of the greater of the overlapping second candidate word string probability and the first extended candidate word string probability, the probability of the same word string A single probability value can be maintained. According to an embodiment, the probabilities of the merged first extension candidate word stream and the first extension candidate word stream are not stored in the memory, and if they are already stored in the memory, they may be deleted from the memory.

일실시예에 따른 머지하는 단계(230)의 구체적인 예시는 이하의 도 5에서 상술한다.A specific example of the merging step 230 according to an embodiment will be described in detail with reference to FIG. 5 below.

일실시예에 따른 나머지 확장 후보 단어열을 후보 단어열에 추가하는 단계(240)는 확장 후보열들과 중복되지 않는 단어열을 갖는 적어도 하나의 확장 후보 단어열을 후보 단어열에 추가함으로써, 후보 단어열의 개수를 증가시키는 단계에 해당할 수 있다. 일실시예에 따른 현재 프레임에 대응하는 복수의 후보 단어열들은 나머지 확장 후보 단어열이 후보 단어열에 추가된 후보 단어열들 중 적어도 일부를 의미할 수 있다.In the step 240 of adding the remaining extended candidate word sequences to the candidate word sequence according to an embodiment, at least one extended candidate word sequence having a word sequence that does not overlap with the extended candidate sequences is added to the candidate word sequence, thereby forming the candidate word sequence. It may correspond to a step of increasing the number. According to an embodiment, the plurality of candidate word sequences corresponding to the current frame may mean at least some of the candidate word sequences in which the remaining extended candidate word sequences are added to the candidate word sequence.

일실시예에 따른 나머지 확장 후보 단어열을 후보 단어열에 추가하는 단계(240)의 구체적인 예시는 이하의 도 6에서 상술한다.A specific example of the step 240 of adding the remaining extended candidate word sequence to the candidate word sequence according to an embodiment will be described in detail with reference to FIG. 6 below.

다시 도 1을 참조하면, 일실시예에 따른 단계(130)는 반복적으로 확장된 후보 단어열들 중 확률이 큰 순서로 빔 개수의 후보 단어열들을 선택하는 단계에 해당할 수 있다. 일실시예에 따를 때, 선택된 빔 개수의 후보 단어열들은 다음 프레임에 대한 빔 서치 과정의 단계(110)에서 획득되는 후보 단어열들에 해당할 수 있다.Referring back to FIG. 1 , step 130 according to an exemplary embodiment may correspond to a step of selecting candidate word sequences having the number of beams in an order of increasing probability from among the iteratively extended candidate word sequences. According to an embodiment, the candidate word sequences of the selected number of beams may correspond to candidate word sequences obtained in step 110 of the beam search process for the next frame.

일실시예에 따른 미리 정해진 개수의 후보 단어열들을 선택하는 단계(130)는 반복적으로 확장된 후보 단어열들의 확률들에 블랭크 레이블(blank label)에 대한 확률을 반영하여, 반복적으로 확장된 후보 단어열들의 최종 확률들을 획득하는 단계 및 반복적으로 확장된 후보 단어열들 중 최종 확률이 큰 미리 정해진 개수의 후보 단어열들을 현재 프레임에 대응하는 후보 단어열들로 선택하는 단계를 포함할 수 있다.In the step 130 of selecting a predetermined number of candidate word sequences according to an embodiment, the iteratively expanded candidate word by reflecting the probability of a blank label in the probabilities of the repeatedly expanded candidate word sequences. The method may include obtaining final probabilities of the columns, and selecting a predetermined number of candidate word streams having a high final probability from among the repeatedly expanded candidate word streams as candidate word streams corresponding to the current frame.

일실시예에 따른 블랭크 레이블은 해당 프레임의 정보의 소비가 끝났다는 것을 의미하는 것으로, 특정 프레임에 대한 음성 인식 결과의 끝을 지시할 수 있다. 특정 프레임에 대하여 블랭크 레이블이 출력된 경우, 해당 프레임에 대한 인식 프로세스는 종료되고, 다음 프레임에 대한 인식 프로세스로 넘어갈수 있다. 다시 말해, 특정 프레임에 대응하는 후보 단어열은 마지막에 블랭크 레이블을 포함하며, 특정 프레임에 대응하는 후보 단어열의 확률은 마지막에 블랭크 레이블에 대한 확률을 반영하여 결정된 확률에 해당할 수 있다.The blank label according to an embodiment means that the consumption of information of the corresponding frame is over, and may indicate the end of the speech recognition result for a specific frame. When a blank label is output with respect to a specific frame, the recognition process for the corresponding frame is terminated, and the recognition process for the next frame may proceed. In other words, the candidate word string corresponding to the specific frame may include a blank label at the end, and the probability of the candidate word string corresponding to the specific frame may correspond to a probability determined by reflecting the probability of the blank label at the end.

일실시예에 따른 미리 정해진 개수의 후보 단어열들을 선택하는 단계(130)는 음성 신호의 현재 프레임이 음성 신호의 마지막 프레임에 해당하는 경우, 반복적으로 확장된 후보 단어열들에 대한 확률들에 기초하여, 반복적으로 확장된 후보 단어열들 중 적어도 일부를 출력하는 단계를 포함할 수 있다. 일실시예에 따를 때, 음성 신호에 대응하는 후보 단어열들 중 확률이 높은 순서로 적어도 하나의 단어열이 출력될 수 있다. 일실시예에 따른 출력된 단어열(들)은 입력된 음성 신호의 음성 인식 결과로 획득된, 음성 신호에 대응하는 텍스트 데이터에 해당할 수 있다.In the step 130 of selecting a predetermined number of candidate word sequences according to an embodiment, when the current frame of the speech signal corresponds to the last frame of the speech signal, the selection of the candidate word sequences is based on probabilities for the repeatedly expanded candidate word sequences. Thus, the method may include outputting at least some of the repeatedly expanded candidate word sequences. According to an embodiment, at least one word sequence may be output in an order of high probability among candidate word sequences corresponding to the voice signal. The output word sequence(s) according to an exemplary embodiment may correspond to text data corresponding to the voice signal obtained as a result of voice recognition of the input voice signal.

도 4 내지 도 7은 일실시예에 따른 복수의 후보 단어열들을 반복적으로 확장하는 과정을 구체적인 예시를 들어 설명하기 위한 도면들이다.4 to 7 are diagrams for explaining a process of repeatedly expanding a plurality of candidate word strings according to an exemplary embodiment, with specific examples.

보다 구체적으로, 도 4는 복수의 후보 단어열들 각각을 접두사로 하는 복수의 확장 후보 단어열들을 생성하는 과정을 도시하고, 도 5는 확장 후보 단어열들 중 후보 단어열과 중복되는 확장 후보 단어열의 확률을 해당하는 후보 단어열의 확률에 머지하는 과정을 도시하며, 도 6은 확장 후보 단어열들 중 후보 단어열과 중복되지 않는 나머지 확장 후보 단어열을 후보 단어열에 추가하는 과정을 도시한다. 도 7은 나머지 확장 후보 단어열이 추가된 후보 단어열에 기초하여, 후보 단어열을 다시 확장하는 과정을 도시한 도면이다.More specifically, FIG. 4 shows a process of generating a plurality of candidate extension word streams using each of a plurality of candidate word streams as a prefix, and FIG. 5 is an extension candidate word stream overlapping with the candidate word stream among the candidate extension word streams. A process of merging the probability into the probability of a corresponding candidate word sequence is shown. FIG. 6 shows a process of adding the remaining extended candidate word sequences that do not overlap with the candidate word sequences among the extended candidate word sequences to the candidate word sequence. 7 is a diagram illustrating a process of re-expanding a candidate word sequence based on a candidate word sequence to which the remaining extended candidate word sequences are added.

도 4를 참조하면, 단어열들(410)은 음성 신호의 이전 프레임에 대응하는 복수의 후보 단어열들의 예시이며, 단어열들(420)은 후보 단어열들 각각을 접두사로 하는 확장 후보 단어열들의 예시이다. 후보 단어열들(410) 중 "ba"후보 단어열을 접두사로 하는 확장 후보 단어열들은 "baa" 및 "bac"가 있을 수 있다. 이 경우, "baa" 확장 후보 단어열은 "ba"후보 단어열에, "a"후보 단어를 연결하여 생성된 단어열에 해당할 수 있다. 또한, "baa" 확장 후보 단어열의 확률은 "ba"확장 후보 단어열의 확률에, 분류 확률에 따른 현재 프레임이 "a" 후보 단어에 대응될 확률을 곱하여 획득될 수 있다.Referring to FIG. 4 , the word sequences 410 are examples of a plurality of candidate word sequences corresponding to the previous frame of the voice signal, and the word sequences 420 are extended candidate word sequences using each of the candidate word sequences as a prefix. are examples of Among the candidate word sequences 410 , “baa” and “bac” may be extended candidate word sequences using the “ba” candidate word sequence as a prefix. In this case, the "baa" extension candidate word sequence may correspond to a word sequence generated by connecting the "a" candidate word sequence to the "ba" candidate word sequence. Also, the probability of the candidate word sequence “baa” may be obtained by multiplying the probability of the candidate word sequence “ba” by the probability that the current frame according to the classification probability corresponds to the candidate word “a”.

도 5를 참조하면, 확장 후보 단어열들 중 "ba" 확장 후보 단어열(521)이 후보 단어열(511)과 중복되고, "ab" 확장 후보 단어열(522)이 후보 단어열(512)과 중복된다. 일실시예에 따른 단계(230)에 따라, 확장 후보 단어열(521)의 확률이 후보 단어열(511)의 확률에 머지될 수 있으며, 확장 후보 단어열(522)의 확률이 후보 단어열(512)의 확률에 머지될 수 있다. 상술한 바와 같이, 중복되는 확장 후보 단어열의 확률 및 후보 단어열의 확률을 합하거나, 어느 하나의 확률을 선택함으로써, 머지가 일어날 수 있다. 예를 들어, "ba" 후보 단어열(511)의 확률에 확장 후보 단어열(521)의 확률이 더해진 값이 “ba” 후보 단어열(511)의 확률로 결정될 수 있다. 또는 “ba” 후보 단어열(511)의 확률 및 확장 후보 단어열(521)의 확률 중 더 큰 값이 “ba” 후보 단어열(511)의 확률로 결정될 수 있다.Referring to FIG. 5 , among the candidate extension word sequences, a "ba" extended candidate word sequence 521 overlaps with a candidate word sequence 511, and "ab" extended candidate word sequence 522 is a candidate word sequence 512. overlaps with According to step 230 according to an embodiment, the probability of the extended candidate word string 521 may be merged with the probability of the candidate word string 511, and the probability of the extended candidate word string 522 is 512) can be merged. As described above, the merge may occur by summing the probability of the overlapping extended candidate word sequence and the candidate word sequence or selecting any one probability. For example, a value obtained by adding the probability of the extended candidate word sequence 521 to the probability of the candidate word sequence “ba” 511 may be determined as the probability of the candidate word sequence “ba” 511 . Alternatively, a larger value among the probability of the candidate word sequence “ba” 511 and the probability of the extended candidate word sequence 521 may be determined as the probability of the candidate word sequence “ba” 511 .

도 6을 참조하면, 단어열들(610)은 도 5의 확장 후보 단어열들 중 후보 단어열과 중복되지 않는 나머지 확장 후보 단어열들에 해당한다. 즉, 나머지 확장 후모 단어열들(610)이 후보 단어열들에 추가됨으로써, 후보 단어열들이 확장될 수 있다.Referring to FIG. 6 , word sequences 610 correspond to the remaining candidate extension word sequences that do not overlap the candidate word sequence among the candidate extension word sequences of FIG. 5 . That is, by adding the remaining extended trailing word sequences 610 to the candidate word sequences, the candidate word sequences may be expanded.

도 6에 도시된 확장된 후보 단어열들은 다시 도 2의 단계(210) 내지 단계(240)에 따라 확장될 수 있다.The expanded candidate word sequences shown in FIG. 6 may be expanded again according to steps 210 to 240 of FIG. 2 .

도 7을 참조하면, 도 6의 확장된 후보 단어열들을 접두사로 하는 복수의 확장 후보 단어열들이 생성될 수 있다. 일실시예에 따를 때, 후보 단어열을 확장하는 과정에서 추가된 후보 단어열들이 도 2의 단계(210) 내지 단계(240)에 따라 확장되어 확장된 후보 단어열들이 생성될 수 있다. 도 7을 참조하면, 후보 단어열(710)들은 이미 확장이 수행된 후보 단어열들에 해당하므로 다시 확장되지 않고, 새롭게 추가된 후보 단어열들(720)이 확장될 수 있다.Referring to FIG. 7 , a plurality of extended candidate word sequences using the extended candidate word sequences of FIG. 6 as prefixes may be generated. According to an embodiment, the candidate word sequences added in the process of expanding the candidate word sequence are expanded according to steps 210 to 240 of FIG. 2 to generate the expanded candidate word sequences. Referring to FIG. 7 , since candidate word sequences 710 correspond to candidate word sequences that have already been expanded, they are not expanded again, and newly added candidate word sequences 720 may be expanded.

일실시예에 따를 때, 후보 단어열들은 미리 정해진 횟수만큼 반복적으로 확장될 수 있다. 확장이 완료된 복수의 후보 단어열들 중 확률이 높은 빔 개수의 후보 단어열들이 현재 프레임에 대응하는 후보 단어열들에 해당할 수 있으며, 다음 프레임에 대하여 빔 서치를 수행하기 위하여 이용될 수 있다.According to an embodiment, the candidate word sequences may be repeatedly expanded a predetermined number of times. Among the plurality of candidate word sequences for which expansion is completed, candidate word sequences having a high beam number may correspond to candidate word sequences corresponding to the current frame, and may be used to perform a beam search on the next frame.

도 8은 일실시예에 따른 장치의 구성의 예시도이다.8 is an exemplary diagram of a configuration of an apparatus according to an embodiment.

도 8을 참조하면, 장치(901)는 프로세서(902) 및 메모리(903)를 포함한다.Referring to FIG. 8 , an apparatus 901 includes a processor 902 and a memory 903 .

일실시예에 따른 장치(901)는 상술한 빔 서치 방법을 수행하는 장치일 수 있다. 프로세서(902)는 도 1 내지 도 2를 통하여 전술한 적어도 하나의 방법을 수행할 수 있다. 메모리(903)는 빔 서치 방법과 관련된 정보(예를 들어, 후보 단어열 및 후보 단어열의 확률)를 저장하거나 상술한 빔 서치 방법이 구현된 프로그램을 저장할 수 있다. 메모리(903)는 휘발성 메모리 또는 비휘발성 메모리일 수 있다.The apparatus 901 according to an embodiment may be an apparatus for performing the above-described beam search method. The processor 902 may perform at least one method described above with reference to FIGS. 1 to 2 . The memory 903 may store information related to the beam search method (eg, a candidate word sequence and a probability of a candidate word sequence) or a program in which the above-described beam search method is implemented. The memory 903 may be a volatile memory or a non-volatile memory.

프로세서(902)는 프로그램을 실행하고, 장치(901)를 제어할 수 있다. 프로 세서(902)에 의하여 실행되는 프로그램의 코드는 메모리(903)에 저장될 수 있다. 장치(901)는 입출력 장치(도면 미 표시)를 통하여 외부 장치(예를 들어, 퍼스널 컴퓨터 또는 네트워크)에 연결되고, 데이터를 교환할 수 있다.The processor 902 may execute a program and control the device 901 . The code of the program executed by the processor 902 may be stored in the memory 903 . The device 901 may be connected to an external device (eg, a personal computer or a network) through an input/output device (not shown) and exchange data.

또한, 일실시예에 따른 장치(901)는 상술한 바와 같이 음성 인식을 위한 인코더 및 분류기를 더 포함할 수 있으며, 음성 신호를 입력 받아, 음성 인식의 결과인 텍스트 데이터를 출력하는 음성 인식 장치에 해당할 수 있다. 또한, 프로세서(902)는 상술한 음성 인식 시스템에 따른 특징 벡터 추출, 특징 벡터에 따른 분류 확률 출력, 및 빔 서치 동작을 포함하는 음성 인식 과정을 수행할 수 있다. In addition, the device 901 according to an embodiment may further include an encoder and a classifier for voice recognition as described above, and is configured to receive a voice signal and output text data resulting from voice recognition to the voice recognition device. may be applicable. In addition, the processor 902 may perform a speech recognition process including extracting a feature vector according to the aforementioned speech recognition system, outputting a classification probability according to the feature vector, and a beam search operation.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or apparatus, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

obtaining a plurality of candidate word sequences corresponding to at least one previous frame of a speech signal and probabilities of the candidate word sequences;
repeatedly expanding the plurality of candidate word sequences based on the current frame of the speech signal; and
selecting a predetermined number of candidate word sequences having a high probability from among the repeatedly expanded candidate word sequences;
including,
The step of iteratively expanding
generating a plurality of extended candidate word sequences using each of the plurality of candidate word sequences as a prefix;
obtaining probabilities of the extension candidate word sequences based on the current frame of the speech signal;
merging the probability of at least one extension candidate word stream overlapping the candidate word strings among the extension candidate word strings with the probability of the overlapping candidate word string;
updating the candidate word sequences by adding the remaining candidate extended word sequences that do not overlap the candidate word sequences among the extended candidate word sequences; and
repeating the steps of generating the extended candidate word streams based on the updated candidate word streams, obtaining probabilities of the extended candidate word strings, merging, and updating the candidate word strings step
containing,
Beam search method.

According to claim 1,
The step of selecting the predetermined number of candidate word sequences includes:
obtaining final probabilities of the iteratively expanded candidate word sequences by reflecting the probability of a blank label in the probabilities of the iteratively expanded candidate word sequences; and
selecting a predetermined number of candidate word sequences having a high final probability from among the repeatedly expanded candidate word sequences as candidate word sequences corresponding to the current frame;
containing,
Beam search method.

According to claim 1,
The merging step
determining a value obtained by adding a probability of at least one extended candidate word string overlapping the candidate word string to a probability of a corresponding candidate word string as the probability of the corresponding candidate word string;
containing,
Beam search method.

According to claim 1,
The merging step
determining the highest probability among the probability of at least one extended candidate word string overlapping the candidate word string and the probability of the corresponding candidate word string as the probability of the corresponding candidate word string;
containing,
Beam search method.

According to claim 1,
The extension candidate word sequences are
generated by connecting each of a plurality of candidate words corresponding to the current frame of the speech signal to each of the candidate word strings;
The probability of a first extended candidate word sequence generated by connecting a first candidate word sequence among the candidate word sequences to a first candidate word among the candidate words is
obtained based on the probability of the first candidate word sequence and the probability that the frame of the speech signal corresponds to the first candidate word.
Beam search method.

According to claim 1,
obtaining a first feature vector corresponding to a current frame of the speech signal by applying the speech signal frame by frame to a transcription network for searching for a word corresponding to a frame of the speech signal;
obtaining a second feature vector corresponding to the next word of the candidate word string by applying a candidate word string corresponding to a previous frame of the speech signal to a prediction network for searching for a next word in the word string; and
obtaining probabilities corresponding to a plurality of candidate words in the current frame of the speech signal based on the first feature vector and the second feature vector;
further comprising,
The step of generating the extension candidate word sequences includes:
By connecting each of the candidate words to each of the candidate word strings. generating the extension candidate word sequences;
includes,
The step of obtaining the probabilities of the extension candidate word sequences includes:
Corresponding to each of the extended candidate word strings, based on a probability of a candidate word string corresponding to the corresponding extended candidate word string and a probability that the frame of the speech signal corresponds to a candidate word corresponding to the corresponding extended candidate word string, the corresponding Steps of obtaining probabilities for extended candidate word sequences
containing,
Beam search method.

7. The method of claim 6,
The step of obtaining probabilities corresponding to the plurality of candidate words includes:
obtaining a third feature vector by applying the first feature vector and the second feature vector to a connection network for searching for a word corresponding to a speech signal and a word sequence; and
inputting the third feature vector into a classifier to obtain probabilities corresponding to a plurality of candidate words in the current frame of the speech signal;
containing,
Beam search method.

7. The method of claim 6,
The enterprise network
A neural network trained to receive the voice data and output the text label based on voice data and training data including a text label corresponding to the voice data,
Beam search method.

7. The method of claim 6,
The prediction network is
Based on the training data of the text corpus including the word sequences of the specific language, receiving the word sequence of the specific language as an input, comprising a neural network trained to predict the next word,
Beam search method.

According to claim 1,
The step of iteratively expanding
It includes the step of repeatedly expanding by a predetermined number of times,
The predetermined number of times
is determined based on the length of the frame of the voice signal and the length of the word corresponding to the frame of the voice signal,
Beam search method.

According to claim 1,
The step of selecting the predetermined number of candidate word sequences includes:
When the current frame is the last frame of the speech signal, outputting at least some of the iteratively extended candidate word sequences based on the probabilities of the iteratively extended candidate word sequences;
containing,
Beam search method.

According to claim 1,
The word sequence is
comprising a single word and a sequence of multiple words,
the word is
comprising a combination of one character corresponding to a specific language and a plurality of characters corresponding to the specific language,
Beam search method.

A computer program stored on a medium in combination with hardware to execute the method of any one of claims 1 to 12.

obtaining a plurality of candidate word sequences and probabilities of the candidate word sequences corresponding to at least one previous frame of the speech signal, and iteratively expanding the plurality of candidate word sequences based on the current frame of the speech signal; at least one processor for selecting a predetermined number of candidate word sequences having a high probability from among the repeatedly expanded candidate word sequences; and
a memory for storing probabilities of the candidate word sequences
including,
the processor is
In the iterative expansion,
generating a plurality of extension candidate word streams using each of the plurality of candidate word streams as a prefix, obtaining probabilities of the extension candidate word streams based on the current frame of the speech signal, and the extension candidate word stream Among the candidate word sequences, the probability of at least one extension candidate word sequence overlapping with the candidate word sequences is merged with the probability of the overlapping candidate word sequence, and the remaining extensions that do not overlap with the candidate word sequences among the candidate extension word sequences are merged. updating the candidate word sequences by adding candidate word sequences, generating the extended candidate word sequences based on the updated candidate word sequences, obtaining probabilities of the extended candidate word sequences, and the merge repeating the steps of, and updating the candidate word sequences,
Beam search device.

15. The method of claim 14,
the processor is
In selecting the predetermined number of candidate word sequences,
By reflecting the probability of a blank label in the probabilities of the repeatedly expanded candidate word sequences, final probabilities of the repeatedly expanded candidate word sequences are obtained, and among the iteratively expanded candidate word sequences selecting a predetermined number of candidate word sequences having a high final probability as candidate word sequences corresponding to the current frame;
Beam search device.

15. The method of claim 14,
the processor is
In the merging, a value obtained by adding a probability of at least one extended candidate word sequence overlapping with the candidate word sequence to a probability of a corresponding candidate word sequence is determined as the probability of the corresponding candidate word sequence;
Beam search device.

15. The method of claim 14,
the processor is
In the merging, determining the greatest probability among the probability of at least one extended candidate word string overlapping the candidate word string and the probability of the corresponding candidate word string as the probability of the corresponding candidate word string;
Beam search device.

15. The method of claim 14,
the processor is
Prediction for applying the speech signal frame by frame to a transcription network for searching for a word corresponding to a frame of the speech signal, obtaining a first feature vector corresponding to the current frame of the speech signal, and searching for the next word in the word string applying a candidate word sequence corresponding to the previous frame of the speech signal to the network to obtain a second feature vector corresponding to the next word of the candidate word sequence, based on the first and second feature vectors, obtaining probabilities that the current frame of the speech signal corresponds to a plurality of candidate words,
In generating the extended candidate word sequences, each of the candidate words is connected to each of the candidate word sequences. generating the extension candidate word sequences;
In obtaining the probabilities of the extension candidate word sequences, in response to each of the extension candidate word sequences, a probability of a candidate word sequence corresponding to the corresponding extension candidate word sequence and a frame of the speech signal corresponding to the corresponding extension candidate word sequence obtaining a probability for the corresponding extended candidate word sequence based on a probability corresponding to the candidate word;
Beam search device.

19. The method of claim 18,
the processor is
In obtaining probabilities corresponding to the plurality of candidate words,
The first feature vector and the second feature vector are applied to a connection network for searching a word corresponding to a speech signal and a word sequence to obtain a third feature vector, and the third feature vector is input to a classifier to input the speech obtaining probabilities that a current frame of a signal corresponds to a plurality of candidate words,
Beam search device.

15. The method of claim 14,
the processor is
In the iteratively expanding, repeatedly expanding by a predetermined number of times,
The predetermined number of times is determined based on a length of a frame of the voice signal and a length of a word corresponding to a frame of the voice signal,
Beam search device.