KR20220154422A

KR20220154422A - Method and apparatus for speech recognition

Info

Publication number: KR20220154422A
Application number: KR1020210061950A
Authority: KR
Inventors: 김준태; 김의성; 김도현; 박진우; 이윤한
Original assignee: 주식회사 카카오; 주식회사 카카오엔터프라이즈
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2022-11-22

Abstract

Disclosed are a method for speech recognition and a device for speech recognition including a decoding process based on a beam search algorithm. According to an embodiment, the speech recognition method comprises the steps of: obtaining candidate word sequences included in a beam corresponding to a target frame of a speech signal and the probabilities of the candidate word sequences; expanding the candidate word sequences based on the target frame and the probabilities of the candidate word sequences; determining whether the expanded candidate word sequences are included in a predetermined number of first upper word sequences in an order with a high probability in a preliminary candidate group including the candidate word sequences and the expanded candidate word sequences; updating the preliminary candidate group by additionally expanding at least one expanded candidate word sequence included in the first upper word sequences based on the determination result; and determining a predetermined number of second upper word sequences in an order with high probability in the preliminary candidate group as beams corresponding to a frame next to the target frame. Therefore, provided is a beam search method for reducing calculation time and a calculation amount required for expanded search.

Description

Voice recognition method and apparatus {METHOD AND APPARATUS FOR SPEECH RECOGNITION}

아래 실시 예들은 음성 인식 방법 및 장치에 관한 것으로, 구체적으로는 빔 서치(beam search) 알고리즘에 기반한 디코딩 과정을 포함하는 음성 인식 방법 및 음성 인식 장치에 관한 것이다.The following embodiments relate to a voice recognition method and apparatus, and specifically, to a voice recognition method and apparatus including a decoding process based on a beam search algorithm.

음성 인식(speech recognition) 기술은 발화에 의하여 발생한 음성 신호를 텍스트 데이터로 전환하여 처리하는 기술로, STT(speech-to-text)라고도 한다. 최근 딥 러닝 기술의 발전으로 음성 신호를 문자 시퀀스로 직접 매핑하는 종단간(end-to-end) 프레임워크가 개발되었다. 제안된 종단간 프레임워크는 인코더-디코더 네트워크, 순환 신경 정렬기(recurrent neural aligner) 및 순환 신경망 변환기(recurrent neural network transducer; RNN-T)를 포함한다.Speech recognition technology converts a voice signal generated by speech into text data for processing, and is also referred to as STT (speech-to-text). Recent advances in deep learning technology have resulted in the development of end-to-end frameworks that directly map speech signals to character sequences. The proposed end-to-end framework includes an encoder-decoder network, a recurrent neural aligner, and a recurrent neural network transducer (RNN-T).

특히, RNN-T에 기반한 음성 인식은 상당한 정확도를 보이며, 활발히 연구가 진행되고 있다. RNN-T는 디코딩 방법으로 최상 우선 탐색(best-first search) 또는 너비 우선 탐색(breadth-first search)을 이용하며, 빔 서치에 기반한 디코딩 방법의 속도를 개선하기 위한 기술이 요구되고 있다.In particular, speech recognition based on RNN-T shows considerable accuracy, and research is being actively conducted. RNN-T uses best-first search or breadth-first search as a decoding method, and a technique for improving the speed of a decoding method based on beam search is required.

실시 예는 확장 탐색에 소요되는 연산 시간 및 연산량을 줄이기 위한 빔 서치 방법을 제공할 수 있다.The embodiment may provide a beam search method for reducing computation time and computation amount required for extended search.

실시 예는 음성 인식의 결과인 단어열이 음성 신호가 인코딩된 데이터의 길이보다 짧아 디코딩 과정에서 대부분의 음성 신호의 프레임이 공백 문자에 해당하는 특징을 이용하여, 빔 서치에 기반한 효율적인 음성 인식의 디코딩 방법을 제공할 수 있다.The embodiment decodes efficient speech recognition based on beam search by using the feature that most of the frames of the speech signal correspond to blank characters in the decoding process because the length of the word string, which is the result of speech recognition, is shorter than the length of the data in which the speech signal is encoded. method can be provided.

다만, 기술적 과제는 상술한 기술적 과제들로 한정되는 것은 아니며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical challenges are not limited to the above-described technical challenges, and other technical challenges may exist.

일 측에 따른 프로세서에 의해 수행되는 음성 인식 방법은 음성 신호의 대상 프레임에 대응하는 빔에 포함된 후보 단어열들 및 상기 후보 단어열들의 상기 음성 신호에 대응될 확률들을 획득하는 단계; 상기 대상 프레임 및 상기 후보 단어열들의 상기 확률들에 기초하여, 상기 후보 단어열들을 확장하는 단계; 예비 후보군- 상기 예비 후보군은 상기 후보 단어열들 및 상기 후보 단어열들의 확장된 후보 단어열들을 포함함 - 내 확률이 높은 순서에 따른 일정 개수의 제1 상위 단어열들에 상기 확장된 후보 단어열들 중 적어도 하나의 포함 여부를 판단하는 단계; 상기 판단 결과에 기초하여, 상기 제1 상위 단어열들에 포함된 상기 적어도 하나의 확장된 후보 단어열을 추가 확장함으로써, 상기 예비 후보군을 갱신하는 단계; 및 상기 예비 후보군 내 확률이 높은 순서에 따른 상기 일정 개수의 제2 상위 단어열들을 상기 대상 프레임의 다음 프레임에 대응하는 빔으로 결정하는 단계를 포함한다.A speech recognition method performed by a processor according to one aspect may include obtaining candidate word sequences included in a beam corresponding to a target frame of a speech signal and probabilities of the candidate word sequences corresponding to the speech signal; expanding the candidate word sequences based on the target frame and the probabilities of the candidate word sequences; Preliminary candidate group-The preliminary candidate group includes the candidate word sequences and extended candidate word sequences of the candidate word sequences-A predetermined number of first upper word sequences according to an order of high probability within the extended candidate word sequences determining whether at least one of them is included; updating the preliminary candidate group by additionally extending the at least one extended candidate word string included in the first upper word strings based on a result of the determination; and determining, as beams corresponding to a frame next to the target frame, a predetermined number of second higher order word sequences in an order with a higher probability in the preliminary candidate group.

상기 예비 후보군을 갱신하는 단계는 상기 제1 상위 단어열들에 상기 적어도 하나의 확장된 후보 단어열이 포함되는 것으로 판단된 경우, 상기 대상 프레임 및 상기 제1 상위 단어열들에 포함된 적어도 하나의 확장된 후보 단어열의 확률에 기초하여, 상기 적어도 하나의 확장된 후보 단어열을 추가 확장하는 단계; 및 상기 추가 확장된 후보 단어열을 상기 예비 후보군에 추가함으로써, 상기 예비 후보군을 갱신하는 단계를 포함할 수 있다.In the step of updating the preliminary candidate group, when it is determined that the at least one extended candidate word sequence is included in the first higher-order word sequences, at least one of the target frame and the first higher-order word sequences included in the first higher-order word sequences is determined. further extending the at least one extended candidate word string based on a probability of the extended candidate word string; and updating the preliminary candidate group by adding the additionally extended candidate word sequence to the preliminary candidate group.

상기 다음 프레임에 대응하는 빔으로 결정하는 단계는 상기 갱신된 예비 후보군 내 확률이 높은 순서에 따라 일정 개수의 제2 상위 단어열들을 선정하는 단계; 및 상기 제2 상위 단어열들을 상기 다음 프레임에 대응하는 빔으로 결정하는 단계를 포함할 수 있다.The step of determining the beam corresponding to the next frame may include selecting a predetermined number of second higher order word sequences in an order with a higher probability in the updated preliminary candidate group; and determining the second upper word sequences as beams corresponding to the next frame.

상기 예비 후보군을 갱신하는 단계는 상기 제1 상위 단어열들에 상기 적어도 하나의 확장된 후보 단어열이 포함되지 않는 것으로 판단된 경우, 상기 예비 후보군을 갱신하지 않는 단계를 포함할 수 있다.The updating of the preliminary candidate group may include not updating the preliminary candidate group when it is determined that the at least one extended candidate word sequence is not included in the first higher order word sequences.

상기 제1 상위 단어열들 및 상기 제2 상위 단어열들은 서로 일치할 수 있다.The first high-order word strings and the second high-order word strings may match each other.

상기 후보 단어열들을 확장하는 단계는 상기 후보 단어열들 각각에 대응하여, 해당 후보 단어열에 후보 단어를 부가하여, 상기 해당 후보 단어열의 확장된 후보 단어열을 획득하는 단계; 상기 해당 후보 단어열에 기초하여, 상기 대상 프레임이 상기 후보 단어에 대응될 확률을 획득하는 단계; 및 상기 해당 후보 단어열의 확률 및 상기 대상 프레임이 상기 후보 단어에 대응될 확률에 기초하여, 상기 확장된 후보 단어열의 확률을 획득하는 단계를 포함할 수 있다.The extending of the candidate word sequences may include: obtaining an extended candidate word sequence of the corresponding candidate word sequence by adding a candidate word to the corresponding candidate word sequence, corresponding to each of the candidate word sequences; obtaining a probability that the target frame corresponds to the candidate word based on the corresponding candidate word string; and obtaining a probability of the extended candidate word sequence based on the probability of the corresponding candidate word sequence and the probability that the target frame corresponds to the candidate word.

상기 음성 인식 방법은 상기 음성 신호를 인코딩하여 상기 음성 신호에서 추출된 특징을 포함하는 복수의 프레임들을 획득하는 단계를 더 포함할 수 있다.The voice recognition method may further include obtaining a plurality of frames including features extracted from the voice signal by encoding the voice signal.

상기 후보 단어열들을 확장하는 단계는 종단간 음성 인식 모델에 기초하여, 상기 확장된 후보 단어열들 및 상기 확장된 후보 단어열들의 상기 확률들을 획득하는 단계를 포함할 수 있다.The extending of the candidate word sequences may include obtaining the expanded candidate word sequences and the probabilities of the extended candidate word sequences based on an end-to-end speech recognition model.

상기 종단간 음성 인식 모델은 음성 데이터 및 상기 음성 데이터에 대응하는 텍스트 레이블을 포함하는 학습 데이터에 기초하여, 상기 음성 데이터를 입력 받아, 상기 텍스트 레이블을 출력하도록 학습된 뉴럴 네트워크를 포함할 수 있다.The end-to-end speech recognition model may include a neural network trained to receive voice data and output text labels based on training data including voice data and text labels corresponding to the voice data.

상기 대상 프레임이 상기 음성 신호의 마지막 프레임인 경우, 상기 다음 프레임에 대응하는 빔으로 결정하는 단계는 상기 제2 상위 단어열들 중 적어도 일부를 상기 음성 신호에 대응하는 음성 인식 결과로 출력하는 단계를 더 포함할 수 있다.When the target frame is the last frame of the voice signal, determining the beam corresponding to the next frame includes outputting at least some of the second upper word strings as a voice recognition result corresponding to the voice signal. can include more.

상기 후보 단어열들의 확장된 후보 단어열들은 상기 후보 단어열들을 접두사(prefix)로 하는 단어열들을 포함할 수 있다.The extended candidate word sequences of the candidate word sequences may include word sequences having the candidate word sequences as prefixes.

상기 단어열은 하나의 단어 및 복수의 단어들의 시퀀스를 포함하며, 상기 단어는 특정 언어에 대응하는 하나의 문자 및 상기 특정 언어에 대응하는 복수의 문자들의 조합을 포함할 수 있다.The word sequence includes one word and a sequence of a plurality of words, and the word may include a combination of a single character corresponding to a specific language and a plurality of characters corresponding to the specific language.

일 측에 따른 프로세서에 의해 수행되는 빔 서치 방법은 인코딩 데이터의 대상 프레임에 대응하는 빔에 포함된 후보 단어열들을 예비 후보군에 추가하는 단계; 상기 후보 단어열들을 대상으로 확장 연산을 수행하는 단계; 상기 확장 연산에 기초하여 선정된 단어열들에 상기 확장 연산에 기초하여 획득된 상기 후보 단어열들의 확장된 단어열의 포함 여부에 기초하여, 상기 확장 연산의 반복 여부를 결정하는 단계; 상기 확장 연산을 반복하는 것으로 결정함에 따라, 상기 선정된 단어열들을 대상으로 상기 확장 연산을 수행하는 단계; 상기 확장 연산을 반복하지 않는 것으로 결정함에 따라, 상기 선정된 단어열들을 상기 인코딩 데이터의 다음 프레임에 대응하는 빔으로 결정하는 단계를 포함한다.A beam search method performed by a processor according to one aspect includes adding candidate word sequences included in a beam corresponding to a target frame of encoded data to a preliminary candidate group; performing an expansion operation on the candidate word sequences; determining whether or not to repeat the expansion operation based on whether word sequences selected based on the expansion operation include an extended word sequence of the candidate word sequences obtained based on the expansion operation; performing the expansion operation on the selected word sequences when it is determined to repeat the expansion operation; When it is determined not to repeat the expansion operation, determining the selected word strings as beams corresponding to the next frame of the encoded data.

상기 확장 연산은 상기 확장 연산의 대상이 되는 단어열들의 확장된 단어열들을 상기 예비 후보군에 추가하는 단계; 및 상기 예비 후보군에서 미리 정해진 기준에 따른 일정 개수의 단어열들을 선정하는 단계를 포함한다.The expansion operation may include adding extended word sequences of word sequences subject to the expansion operation to the preliminary candidate group; and selecting a certain number of word strings according to a predetermined criterion from the preliminary candidate group.

상기 확장 연산의 반복 여부를 결정하는 단계는 상기 선정된 단어열들에 상기 확장된 단어열이 포함되는 경우, 상기 확장 연산을 반복하는 것으로 결정하는 단계; 및 상기 선정된 단어열들에 상기 확장된 단어열이 포함되지 않는 경우, 상기 확장 연산을 반복하지 않는 것으로 결정하는 단계를 포함할 수 있다.The step of determining whether to repeat the expansion operation may include determining to repeat the expansion operation when the selected word sequences include the extended word sequence; and determining not to repeat the expansion operation when the extended word sequence is not included in the selected word sequences.

상기 확장 연산의 반복 여부를 결정하는 단계는 미리 결정된 반복 종료 조건의 만족 여부에 기초하여, 상기 확장 연산의 반복 여부를 결정하는 단계를 포함할 수 있다.The determining whether to repeat the extension operation may include determining whether to repeat the extension operation based on whether a predetermined repetition termination condition is satisfied.

일 측에 따른 음성 인식 장치는 음성 신호의 대상 프레임에 대응하는 빔에 포함된 후보 단어열들 및 상기 후보 단어열들의 상기 음성 신호에 대응될 확률들을 획득하고, 상기 대상 프레임 및 상기 후보 단어열들의 상기 확률들에 기초하여, 상기 후보 단어열들을 확장하고, 예비 후보군- 상기 예비 후보군은 상기 후보 단어열들 및 상기 후보 단어열들의 확장된 후보 단어열들을 포함함 - 내 확률이 높은 순서에 따른 일정 개수의 제1 상위 단어열들에 상기 확장된 후보 단어열들 중 적어도 하나의 포함 여부를 판단하고, 상기 판단 결과에 기초하여, 상기 제1 상위 단어열들에 포함된 상기 적어도 하나의 확장된 후보 단어열을 추가 확장함으로써, 상기 예비 후보군을 갱신하며, 상기 예비 후보군 내 확률이 높은 순서에 따른 상기 일정 개수의 제2 상위 단어열들을 상기 대상 프레임의 다음 프레임에 대응하는 빔으로 결정하는, 적어도 하나의 프로세서를 포함한다.The speech recognition apparatus according to one side obtains candidate word sequences included in a beam corresponding to a target frame of a speech signal and probabilities corresponding to the speech signal of the candidate word sequences, and determines the target frame and the candidate word sequences Based on the probabilities, the candidate word sequences are expanded, and a preliminary candidate group-the preliminary candidate group includes the candidate word sequences and extended candidate word sequences of the candidate word sequences-a schedule in order of high probability It is determined whether at least one of the extended candidate word strings is included in the number of first upper word strings, and the at least one extended candidate included in the first upper word strings is determined based on a result of the determination. At least one method for updating the preliminary candidate group by additionally extending a word sequence, and determining a predetermined number of second upper word sequences according to an order with a high probability in the preliminary candidate group as a beam corresponding to a frame next to the target frame. contains the processor of

상기 프로세서는, 상기 예비 후보군을 갱신함에 있어서, 상기 제1 상위 단어열들에 상기 적어도 하나의 확장된 후보 단어열이 포함되는 것으로 판단된 경우, 상기 대상 프레임 및 상기 제1 상위 단어열들에 포함된 적어도 하나의 확장된 후보 단어열의 확률에 기초하여, 상기 적어도 하나의 확장된 후보 단어열을 추가 확장하고, 상기 추가 확장된 후보 단어열을 상기 예비 후보군에 추가함으로써, 상기 예비 후보군을 갱신할 수 있다.In updating the preliminary candidate group, when it is determined that the at least one extended candidate word sequence is included in the first higher-order word sequences, the processor includes the target frame and the first higher-order word sequences. Based on the probability of the at least one extended candidate word string, the preliminary candidate group may be updated by additionally extending the at least one extended candidate word sequence and adding the additionally extended candidate word sequence to the preliminary candidate group. have.

상기 프로세서는, 상기 다음 프레임에 대응하는 빔으로 결정함에 있어서, 상기 갱신된 예비 후보군 내 확률이 높은 순서에 따라 일정 개수의 제2 상위 단어열들을 선정하고, 상기 제2 상위 단어열들을 상기 다음 프레임에 대응하는 빔으로 결정할 수 있다.In determining the beam corresponding to the next frame, the processor selects a certain number of second higher order word sequences in an order with a higher probability in the updated preliminary candidate group, and selects the second higher order word sequences in the next frame. It can be determined as a beam corresponding to .

상기 프로세서는, 상기 예비 후보군을 갱신함에 있어서, 상기 제1 상위 단어열들에 상기 적어도 하나의 확장된 후보 단어열이 포함되지 않는 것으로 판단된 경우, 상기 예비 후보군을 갱신하지 않을 수 있다.In updating the preliminary candidate group, the processor may not update the preliminary candidate group when it is determined that the at least one extended candidate word sequence is not included in the first higher-order word sequences.

상기 프로세서는, 상기 후보 단어열들을 확장함에 있어서, 상기 후보 단어열들 각각에 대응하여, 해당 후보 단어열에 후보 단어를 부가하여, 상기 해당 후보 단어열의 확장된 후보 단어열을 획득하고, 상기 해당 후보 단어열에 기초하여, 상기 대상 프레임이 상기 후보 단어에 대응될 확률을 획득하며, 상기 해당 후보 단어열의 확률 및 상기 대상 프레임이 상기 후보 단어에 대응될 확률에 기초하여, 상기 확장된 후보 단어열의 확률을 획득할 수 있다.The processor, in extending the candidate word sequences, obtains an extended candidate word sequence of the corresponding candidate word sequence by adding a candidate word to the corresponding candidate word sequence, corresponding to each of the candidate word sequences, and A probability that the target frame corresponds to the candidate word is obtained based on the word string, and a probability of the extended candidate word string is determined based on the probability of the corresponding candidate word string and the probability that the target frame corresponds to the candidate word. can be obtained

일 측에 따른 빔 서치 장치는 인코딩 데이터의 대상 프레임에 대응하는 빔에 포함된 후보 단어열들을 예비 후보군에 추가하고, 상기 후보 단어열들을 대상으로 확장 연산을 수행하고, 상기 확장 연산에 기초하여 선정된 상위 단어열들에 상기 확장 연산에 기초하여 획득된 상기 후보 단어열들의 확장된 단어열의 포함 여부에 기초하여, 상기 확장 연산의 반복 여부를 결정하고, 상기 확장 연산을 반복하는 것으로 결정함에 따라, 상기 선정된 단어열들을 대상으로 상기 확장 연산을 수행하고, 상기 확장 연산을 반복하지 않는 것으로 결정함에 따라, 상기 선정된 단어열들을 상기 인코딩 데이터의 다음 프레임에 대응하는 빔으로 결정하는, 적어도 하나의 프로세서를 포함한다.A beam search apparatus according to one side adds candidate word sequences included in a beam corresponding to a target frame of encoded data to a preliminary candidate group, performs an expansion operation on the candidate word sequences, and selects based on the expansion operation. Based on whether or not the extended word sequences of the candidate word sequences obtained based on the expansion operation are included in the upper word sequences obtained by the expansion operation, whether or not to repeat the expansion operation is determined, and as it is determined to repeat the expansion operation, At least one of performing the expansion operation on the selected word sequences and determining that the expansion operation is not repeated, determining the selected word sequences as beams corresponding to the next frame of the encoded data contains the processor.

상기 확장 연산은 상기 확장 연산의 대상이 되는 단어열들의 확장된 단어열들을 상기 예비 후보군에 추가하는 동작 및 상기 예비 후보군에서 미리 정해진 기준에 따른 일정 개수의 단어열들을 선정하는 동작을 포함한다.The expansion operation includes an operation of adding extended word sequences of word sequences to be subjected to the expansion operation to the preliminary candidate group and an operation of selecting a predetermined number of word sequences from the preliminary candidate group according to a predetermined criterion.

실시 예는 빔 서치 방법에서 확장 탐색의 프로세스를 병렬적으로 처리함으로써, 연산 속도가 향상된 빔 서치 방법을 제공할 수 있다. The embodiment may provide a beam search method with improved calculation speed by parallelly processing an extended search process in the beam search method.

실시 예는 빔 서치 과정에서 확장 탐색을 계속할지 여부를 판단하는 프로세스를 채택함으로써, 음성 인식의 정확도를 유지하면서 불필요한 확장 탐색 프로세스의 발생을 방지할 수 있으며, 불필요한 확장 탐색 프로세스로 인한 연산 오버 헤드를 감소시킬 수 있다.In the embodiment, by adopting a process for determining whether to continue the extended search during the beam search process, the occurrence of unnecessary extended search processes can be prevented while maintaining the accuracy of speech recognition, and computational overhead due to unnecessary extended search processes can be reduced. can reduce

도 1은 일 실시 예에 따른 음성 인식 방법의 동작 흐름도이다.
도 2a 및 도 2b는 일 실시 예에 따른 후보 단어열들의 확장의 상세한 동작을 설명하기 위한 도면들이다.
도 3은 일 실시 예에 따른 음성 인식 방법의 상세한 동작 흐름도이다.
도 4는 일 실시 예에 따른 종단간 음성 인식 모델의 일 예인 RNN-T의 구조를 예시한 도면이다.
도 5은 일 실시 예에 따른 RNN-T에 기초하여 수행되는 음성 인식을 위한 디코딩 과정의 동작 흐름도이다.
도 6는 일 실시 예에 따른 빔 서치 방법의 동작 흐름도이다.
도 7은 일 실시예에 따른 장치의 구성의 예시도이다.1 is an operation flowchart of a voice recognition method according to an embodiment.
2A and 2B are diagrams for explaining a detailed operation of expanding candidate word strings according to an embodiment.
3 is a detailed operation flowchart of a voice recognition method according to an embodiment.
4 is a diagram illustrating the structure of an RNN-T, which is an example of an end-to-end speech recognition model according to an embodiment.
5 is an operational flowchart of a decoding process for speech recognition performed based on RNN-T according to an embodiment.
6 is an operation flowchart of a beam search method according to an embodiment.
7 is an exemplary diagram of a configuration of a device according to an embodiment.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 명세서의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only, and may be changed and implemented in various forms. Therefore, the form actually implemented is not limited only to the specific embodiments disclosed, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical idea described in the embodiments.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Although terms such as first or second may be used to describe various components, such terms should only be construed for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.It should be understood that when an element is referred to as being “connected” to another element, it may be directly connected or connected to the other element, but other elements may exist in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, but one or more other features or numbers, It should be understood that the presence or addition of steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in this specification, it should not be interpreted in an ideal or excessively formal meaning. don't

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다. 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In the description with reference to the accompanying drawings, the same reference numerals are given to the same components regardless of reference numerals, and overlapping descriptions thereof will be omitted.

도 1은 일실시예에 따른 음성 인식 방법의 동작 흐름도이다.1 is an operation flowchart of a voice recognition method according to an embodiment.

일 측에 따른 음성 인식 방법은 입력된 음성 신호를 인코딩하여 획득된 길이 T의 부호화 정보를 빔 서치(beam search) 알고리즘에 기초하여 디코딩함으로써, 음성 신호에 대응되는 문자열을 출력하는 과정으로 진행될 수 있다. 길이 T의 부호화 정보는 T개의 프레임들로 구성될 수 있으며, 첫 번째 프레임부터 마지막 T 번째 프레임까지 시간 순서대로 디코딩될 수 있다. t 번째 프레임의 디코딩 과정은 전체 부호화 정보의 디코딩 과정의 t-step에 해당할 수 있다. 디코딩 과정에서 각 프레임의 음성 인식 결과의 후보군이 결정될 수 있으며, 결정된 후보군은 빔(beam)으로 지칭될 수 있다. 각 프레임에 대응하는 빔은 일정 개수(예: N개)의 후보 단어열을 포함할 수 있다. 부호화 정보의 임의의 프레임 t+1에 대응하는 빔은 시간 순서 상 이전 프레임인 프레임 t에 대한 디코딩 과정에서, 프레임 t에 대응하는 빔과 프레임 t의 음성 인식 결과에 기초하여 결정될 수 있다. 부호화 정보의 마지막 프레임 T의 경우, 프레임 T에 대한 디코딩 결과 생성된 빔에 포함된 후보 단어열이 입력된 음성 신호의 최종 인식 결과로 출력될 수 있다.The voice recognition method according to one side may proceed with a process of outputting a character string corresponding to the voice signal by decoding the encoded information of length T obtained by encoding the input voice signal based on a beam search algorithm. . Encoding information of length T may be composed of T frames, and may be decoded in time order from the first frame to the last T frame. A decoding process of the t-th frame may correspond to a t-step of a decoding process of all encoded information. In the decoding process, a candidate group of speech recognition results of each frame may be determined, and the determined candidate group may be referred to as a beam. A beam corresponding to each frame may include a predetermined number (eg, N) of candidate word strings. A beam corresponding to a certain frame t+1 of encoded information may be determined based on a beam corresponding to frame t and a result of voice recognition of frame t in a decoding process for frame t, which is a previous frame in time order. In the case of the last frame T of the encoded information, a candidate word sequence included in a beam generated as a result of decoding the frame T may be output as a final recognition result of the input speech signal.

일 측에 따른 음성 인식 방법은 종단간(end-to-end) 음성 인식 모델을 포함하는 서버 또는 장치에 의해 수행될 수 있다. 보다 구체적으로, 일실시예에 따른 음성 인식 방법에 포함된 단계들은 서버 또는 장치에 포함된 적어도 하나의 프로세서에 의해 수행될 수 있다. 일 측에 따른 음성 인식 방법을 수행하는 장치 또는 서버의 구체적인 구성은 이하의 도 7을 통해 상술한다.A voice recognition method according to one aspect may be performed by a server or device including an end-to-end voice recognition model. More specifically, steps included in the voice recognition method according to an embodiment may be performed by at least one processor included in a server or device. A specific configuration of a device or server that performs a voice recognition method according to one side will be described in detail with reference to FIG. 7 below.

도 1을 참조하면, 음성 인식 방법은 음성 신호의 대상 프레임에 대응하는 빔에 포함된 후보 단어열들 및 후보 단어열들의 확률들을 획득하는 단계(110), 음성 신호의 대상 프레임 및 후보 단어열들의 확률들에 기초하여, 후보 단어열들을 확장하는 단계(120), 예비 후보군 내 일정 개수의 제1 상위 단어열들에 확장된 후보 단어열의 포함 여부를 판단하는 단계(130), 판단 결과에 기초하여 제1 상위 단어열들에 포함된 확장된 후보 단어열을 추가 확장함으로써, 예비 후보군을 갱신하는 단계(140) 및 예비 후보군 내 일정 개수의 제2 상위 단어열들을 다음 프레임에 대응하는 빔으로 결정하는 단계(150)를 포함할 수 있다. 일 실시 예에 따르면, 단계(110) 내지 단계(150)는 음성 인식 방법의 디코딩 과정에 대응될 수 있다.Referring to FIG. 1, the speech recognition method includes obtaining candidate word sequences included in a beam corresponding to a target frame of a speech signal and probabilities of candidate word sequences (110), and determining the target frame and candidate word sequences of a speech signal. Expanding candidate word sequences based on the probabilities (120), determining whether the extended candidate word sequences are included in a predetermined number of first upper word sequences in the preliminary candidate group (130), based on the determination result Updating the preliminary candidate group by additionally extending the extended candidate word sequences included in the first upper word sequences (140), and determining a predetermined number of second upper word sequences in the preliminary candidate group as beams corresponding to the next frame Step 150 may be included. According to an embodiment, steps 110 to 150 may correspond to a decoding process of a voice recognition method.

일 측에 따른 음성 신호의 대상 프레임은 음성 신호를 인코딩하여 획득된 부호화 정보를 구성하는 복수의 프레임들에 포함될 수 있다. 다시 말해, 음성 인식 방법은 음성 신호를 인코딩하여 음성 신호에서 추출된 특징을 포함하는 복수의 프레임들을 획득하는 단계를 더 포함할 수 있다. 일 실시 예에 따르면, 음성 신호의 인코딩은 종단간 음성 인식 모델의 인코더에서 수행되는 인코딩 동작을 포함할 수 있다.A target frame of a voice signal according to one aspect may be included in a plurality of frames constituting encoding information obtained by encoding the voice signal. In other words, the voice recognition method may further include obtaining a plurality of frames including features extracted from the voice signal by encoding the voice signal. According to an embodiment, encoding of a voice signal may include an encoding operation performed by an encoder of an end-to-end voice recognition model.

일 실시 예에 따르면, 대상 프레임은 프로세서에 의해 현재 처리되는 프레임, 다시 말해 디코딩 과정이 수행되는 프레임을 의미하며, 이하에서 현재 프레임으로 지칭될 수 있다. 다음 프레임은 시간 순서에 따른 현재 프레임의 직후 프레임을 의미할 수 있다. 예를 들어, 현재 프레임이 t번째 프레임에 해당하는 경우, 다음 프레임은 t+1번째 프레임에 해당할 수 있다. According to an embodiment, the target frame means a frame currently being processed by a processor, that is, a frame on which a decoding process is performed, and may be referred to as a current frame hereinafter. The next frame may refer to a frame immediately after the current frame in time order. For example, if the current frame corresponds to the tth frame, the next frame may correspond to the t+1th frame.

일 측에 따른 단어열은 하나 이상의 단어를 포함할 수 있으며, 복수의 단어들이 나열된 단어들의 시퀀스(sequence)를 포함할 수 있다. 일실시예에 따른 단어는 사전적 의미의 단어가 아니라 적어도 하나의 문자로 구성된 문자열을 의미할 수 있다. 다시 말해, 일실시예에 따른 단어는 특정 언어에 대응하는 하나의 문자 및 특정 언어에 대응하는 복수의 문자들의 조합을 포함할 수 있다.A word sequence according to one aspect may include one or more words, and may include a sequence of words in which a plurality of words are arranged. A word according to an embodiment may mean a string composed of at least one character, not a word in a dictionary meaning. In other words, a word according to an embodiment may include one letter corresponding to a specific language and a combination of a plurality of letters corresponding to the specific language.

일 실시 예에 따르면, 현재 프레임에 대응하는 빔은 음성 신호의 이전 프레임에 대하여 디코딩을 수행하여 획득될 수 있다. 이전 프레임은 시간 순서에 따른 현재 프레임(t)의 직전 프레임(t-1)에 해당할 수 있다. 보다 구체적으로, 단계(110)에서 획득된 음성 신호의 현재 프레임에 대응하는 후보 단어열들 및 후보 단어열들의 확률들은 음성 신호의 이전 프레임에 대하여 디코딩을 수행하여 획득된 후보 단어열들 및 후보 단어열들의 음성 신호에 대응될 확률들에 해당할 수 있다. 다시 말해, 이전 프레임을 현재 프레임으로 하여 단계(110) 내지 단계(150)가 수행됨으로써 획득된 후보 단어열들 및 후보 단어열들의 음성 신호에 대응될 확률들에 해당할 수 있다.According to an embodiment, a beam corresponding to the current frame may be obtained by decoding a previous frame of the voice signal. The previous frame may correspond to a frame (t−1) immediately preceding the current frame (t) according to time order. More specifically, the candidate word sequences corresponding to the current frame of the speech signal obtained in step 110 and the probabilities of the candidate word sequences are the candidate word sequences and candidate words obtained by decoding the previous frame of the speech signal. It may correspond to probabilities corresponding to the speech signals of the columns. In other words, it may correspond to candidate word strings obtained by performing steps 110 to 150 using the previous frame as the current frame and probabilities corresponding to speech signals of the candidate word strings.

일 실시 예에 따르면, 음성 신호의 현재 프레임에 대응하여 단계(110) 내지 단계(150)들을 수행하여 획득된 다음 프레임에 대응하는 빔은 현재 프레임의 다음 프레임에 대하여 수행되는 단계(110)에서 이용될 수 있다. 일 실시 예에 따르면, 음성 신호의 현재 프레임이 음성 신호의 마지막 프레임인 경우, 대응하여 단계(110) 내지 단계(150)들을 수행하여 획득된 다음 프레임에 대응하는 빔은 음성 인식 결과로 출력될 수 있다. According to an embodiment, a beam corresponding to the next frame obtained by performing steps 110 to 150 corresponding to the current frame of the voice signal is used in step 110 performed for the next frame of the current frame. It can be. According to an embodiment, when the current frame of the voice signal is the last frame of the voice signal, a beam corresponding to the next frame obtained by performing steps 110 to 150 may be output as a voice recognition result. have.

일실시예에 따르면, 현재 프레임이 음성 신호의 첫번째 프레임인 경우, 이전 프레임은 존재하지 않으므로, 단계(110)는 생략될 수 있다. 또는, 일실시예에 따른 현재 프레임이 첫번째 프레임인 경우, 단계(110)는 미리 정해진 후보 단어열(들) 및 확률(들)을 획득하는 단계를 포함할 수 있다. 예를 들어, 현재 프레임이 첫번째 프레임인 경우, 단계(110)는 시작 토큰을 포함하는 후보 단어열을 획득하고, 후보 단어열인 시작 토큰의 확률은 1인 것으로 미리 정해질 수 있다.According to one embodiment, when the current frame is the first frame of the voice signal, since the previous frame does not exist, step 110 may be omitted. Alternatively, when the current frame is the first frame according to an embodiment, step 110 may include obtaining a predetermined candidate word sequence(s) and probability(s). For example, when the current frame is the first frame, step 110 obtains a candidate word sequence including a start token, and the probability of the start token that is the candidate word sequence may be predetermined to be 1.

일 실시 예에 따르면, 특정 후보 단어열에 대응하는 확률은 종단간 음성 인식 모델에 기초하여 획득될 수 있다. 일 측에 따른 종단간 음성 인식 모델은 음성 데이터 및 음성 데이터에 대응하는 텍스트 레이블을 포함하는 학습 데이터에 기초하여, 음성 데이터를 입력 받아, 텍스트 레이블을 출력하도록 학습된 뉴럴 네트워크를 포함할 수 있다. 다시 말해, 종단간 음성 인식 모델은 뉴럴 네트워크에 기초한 딥 러닝 기반의 음성 인식 모델을 포함할 수 있으며, 예를 들어 RNN(recurrent neural network)에 기반한 RNN-T(RNN transducer)를 포함할 수 있다. 종단간 음성 인식 모델에 기초하여 후보 단어열에 대응하는 확률을 획득하는 과정은 이하의 도 4 내지 도 5를 통해 상술한다.According to an embodiment, a probability corresponding to a specific candidate word string may be obtained based on an end-to-end speech recognition model. An end-to-end speech recognition model according to one aspect may include a neural network trained to receive voice data and output text labels based on training data including voice data and text labels corresponding to the voice data. In other words, the end-to-end speech recognition model may include a deep learning-based speech recognition model based on a neural network, and may include, for example, an RNN transducer (RNN-T) based on a recurrent neural network (RNN). A process of obtaining a probability corresponding to a candidate word string based on an end-to-end speech recognition model will be described in detail with reference to FIGS. 4 and 5 below.

일 측에 따른 후보 단어열들은 각각 대응하는 확률에 매칭될 수 있다. 예를 들어, 후보 단어열 A는 제1 확률과 매칭될 수 있다. 일실시예에 따르면, 후보 단어열에 매칭된 확률은 단계(110)를 수행하는 프로세서가 접근 가능한 메모리에 저장될 수 있다.Candidate word sequences according to one side may be matched with corresponding probabilities. For example, candidate word sequence A may be matched with a first probability. According to one embodiment, the probability of matching the candidate word string may be stored in a memory accessible to the processor performing step 110 .

일 실시 예에 따르면, 현재 프레임에 대응하는 빔에 포함된 후보 단어열들의 확률들은 이전 프레임에 대응하여 수행된 디코딩 과정에서 획득된 확률들로, 후보 단어열들에 매칭되어 메모리에 저장될 수 있다. 일 실시 예에 따르면, 단계(110)는 메모리에서 후보 단어열들에 매칭되어 저장된 확률들을 로딩(loading)함으로써, 후보 단어열들의 확률들을 획득하는 단계를 포함할 수 있다.According to an embodiment, the probabilities of candidate word sequences included in the beam corresponding to the current frame are probabilities obtained in a decoding process performed corresponding to the previous frame, and may be matched with the candidate word sequences and stored in memory. . According to an embodiment, step 110 may include obtaining probabilities of candidate word sequences by loading probabilities matched and stored in memory with candidate word sequences.

일 측에 따른 단계(120)는 음성 신호의 현재 프레임 및 후보 단어열들의 확률들에 기초하여, 후보 단어열들의 확장된 후보 단어열들 및 확장된 후보 단어열들의 확률들을 획득하는 단계에 해당할 수 있다. 특정 후보 단어열의 확장된 후보 단어열(들)은 해당 후보 단어열에 하나 이상의 후보 단어가 부가된 단어열(들)을 포함할 수 있다. 후보 단어는 종단간 음성 인식 모델의 학습 데이터에 기초하여 결정될 수 있다. 예를 들어, 종단간 음성 인식 모델이 음성 신호 및 음성 신호에 대응하는 알파벳 레이블을 포함하는 학습 데이터에 기초하여, 음성 신호에 대응하는 알파벳 레이블을 출력하도록 학습된 경우, 후보 단어는 알파벳 레이블에 해당할 수 있다. 일 실시 예에 따르면, 후보 단어열들의 확장된 후보 단어열들은 후보 단어열들을 접두사(prefix)로 하는 단어열들을 포함할 수 있다. 예를 들어, 후보 단어열들이 'a' 및 'bc'를 포함하는 경우, 후보 단어열들을 접두사로 하는 단어열들은 'aa', 'ab', 'ac', 'bca', 'bcb' 및 'bcc'를 포함할 수 있다.Step 120 according to one side corresponds to a step of obtaining extended candidate word sequences of candidate word sequences and probabilities of the extended candidate word sequences, based on the current frame of the speech signal and probabilities of the candidate word sequences. can The extended candidate word string(s) of a specific candidate word string may include word string(s) in which one or more candidate words are added to the corresponding candidate word string. Candidate words may be determined based on training data of an end-to-end speech recognition model. For example, when an end-to-end speech recognition model is trained to output an alphabetic label corresponding to a speech signal based on training data including a speech signal and an alphabetic label corresponding to the speech signal, the candidate word corresponds to the alphabetic label. can do. According to an embodiment, extended candidate word sequences of candidate word sequences may include word sequences having candidate word sequences as prefixes. For example, when candidate word strings include 'a' and 'bc', word strings prefixed with candidate word strings are 'aa', 'ab', 'ac', 'bca', 'bcb' and May contain 'bcc'.

일 실시 예에 따르면, 특정 후보 단어열을 확장하여 획득된 확장된 후보 단어열이 다른 후보 단어열과 중복되는 경우가 있을 수 있다. 예를 들어, 후보 단어열들에 'a' 및 'ab'가 포함된 경우, 후보 단어열 'a'를 확장한 결과 확장된 후보 단어열 'ab'가 생성될 수 있는데, 확장된 후보 단어열 'ab'는 후보 단어열 'ab'와 중복된다. 이와 같이, 특정 후보 단어열의 확장된 후보 단어열이 다른 후보 단어열과 중복되는 경우, 해당 단어열은 확장된 후보 단어열이 아닌 후보 단어열로 취급될 수 있다. 예를 들어, 후보 단어열 'a'를 확장하여 획득된 'ab'가 후보 단어열에 이미 포함된 경우, 'ab'는 확장된 후보 단어열이 아닌 후보 단어열로 취급될 수 있다.According to an embodiment, there may be cases where an extended candidate word sequence obtained by expanding a specific candidate word sequence overlaps another candidate word sequence. For example, when 'a' and 'ab' are included in the candidate word sequences, the extended candidate word sequence 'ab' may be generated as a result of extending the candidate word sequence 'a'. 'ab' is duplicated with the candidate word string 'ab'. In this way, when an extended candidate word string of a specific candidate word string overlaps another candidate word string, the corresponding word string may be treated as a candidate word string rather than an extended candidate word string. For example, when 'ab' obtained by extending the candidate word sequence 'a' is already included in the candidate word sequence, 'ab' may be treated as a candidate word sequence other than the extended candidate word sequence.

일 실시 예에 따르면, 후보 단어열들을 확장하는 단계(120)는 후보 단어열들 각각에 대응하여 수행될 수 있다. 예를 들어, 특정 후보 단어열을 확장하는 단계(120)는 해당 후보 단어열에 후보 단어를 부가하여 해당 후보 단어열의 확장된 후보 단어열을 획득하는 단계, 해당 후보 단어열에 기초하여 현재 프레임이 후보 단어에 대응될 확률을 획득하는 단계, 및 해당 후보 단어열의 확률 및 현재 프레임이 후보 단어에 대응될 확률에 기초하여, 해당 후보 단어열의 확장된 후보 단어열의 확률을 획득하는 단계를 포함할 수 있다.According to an embodiment, the step 120 of expanding the candidate word strings may be performed corresponding to each of the candidate word strings. For example, in step 120 of extending a specific candidate word string, the step of adding candidate words to the corresponding candidate word string to obtain an extended candidate word string of the corresponding candidate word string, the current frame is a candidate word based on the candidate word string Obtaining a probability corresponding to , and acquiring a probability of an extended candidate word sequence of the corresponding candidate word sequence based on a probability of the corresponding candidate word sequence and a probability that the current frame corresponds to the candidate word.

일 실시 예에 따르면, 후보 단어열을 확장하는 단계(120)는 후보 단어열들 각각에 대응하여 병렬적으로 수행될 수 있다. 예를 들어, 도 2a를 참조하면, 현재 프레임에 대응하는 빔에 'a', 'ab' 및 'c'의 후보 단어열들(201)이 포함된 경우, 후보 단어열들 각각에 후보 단어들 'a', 'b' 및 'c' 각각을 부가하여 후보 단어열들의 확장된 후보 단어열들(202) 각각을 획득하는 동작은 복수의 프로세서들에 의해 병렬적으로 수행될 수 있다. 다시 말해, 후보 단어열 'a'에 후보 단어 'a'를 부가한 확장된 후보 단어열 'aa'를 획득하는 과정 내지 후보 단어열 'c'에 후보 단어 'c'를 부가한 확장된 후보 단어열 'cc'를 획득하는 과정은 복수의 프로세서들에 의해 수행될 수 있다. 예를 들어, 각각의 확장된 후보 단어열을 획득하는 과정은 각각 다른 프로세서에 의해 동시에 수행될 수도 있다.According to an embodiment, the step 120 of expanding the candidate word sequence may be performed in parallel in correspondence with each of the candidate word sequences. For example, referring to FIG. 2A , when candidate word sequences 201 of 'a', 'ab', and 'c' are included in a beam corresponding to a current frame, candidate words are included in each of the candidate word sequences. An operation of obtaining each of the extended candidate word strings 202 of candidate word strings by adding 'a', 'b', and 'c' may be performed in parallel by a plurality of processors. In other words, the process of acquiring the extended candidate word string 'aa' by adding the candidate word 'a' to the candidate word string 'a' or the extended candidate word by adding the candidate word 'c' to the candidate word string 'c' The process of obtaining the column 'cc' may be performed by a plurality of processors. For example, a process of obtaining each extended candidate word sequence may be simultaneously performed by different processors.

일 실시 예에 따르면, 후보 단어열들을 확장하는 단계(120)는 종단간 음성 인식 모델에 기초하여, 확장된 후보 단어열들 및 확장된 후보 단어열들의 확률들을 획득하는 단계를 포함할 수 있다. 일 측에 따른 종단간 음성 인식 모델에서 확장된 후보 단어열들 및 확장된 후보 단어열들의 확률들을 획득하는 과정은 이하의 도 4 내지 도 5를 통해 상술한다.According to an embodiment, the step of expanding candidate word sequences 120 may include obtaining expanded candidate word sequences and probabilities of the extended candidate word sequences based on an end-to-end speech recognition model. A process of acquiring extended candidate word strings and probabilities of the extended candidate word strings in the end-to-end speech recognition model according to one side will be described in detail with reference to FIGS. 4 and 5 below.

일 측에 따른 단계(130)는 예비 후보군 내 확률이 높은 순서에 따른 일정 개수의 제1 상위 단어열들에 확장된 후보 단어열의 포함 여부를 판단하는 단계에 해당할 수 있다. 여기서, 예비 후보군은 후보 단어열들 및 후보 단어열들의 확장된 후보 단어열들을 포함할 수 있다. 예를 들어, 도 2a를 참조하면, 예비 후보군은 후보 단어열들(201) 및 확장된 후보 단어열들(202)을 포함하는 집합 [a, ab, c, aa, ab, ac, aba, abb, abc, ca, cb, cc]에 대응될 수 있다.Step 130 according to one side may correspond to a step of determining whether an extended candidate word string is included in a predetermined number of first upper word strings in an order with a high probability in the preliminary candidate group. Here, the preliminary candidate group may include candidate word sequences and extended candidate word sequences of the candidate word sequences. For example, referring to FIG. 2A , the preliminary candidate group is a set including candidate word sequences 201 and extended candidate word sequences 202 [a, ab, c, aa, ab, ac, aba, abb , abc, ca, cb, cc].

일 측에 따른 제1 상위 단어열들은 예비 후보군에서 확률이 높은 상위 일정 개수의 후보 단어열들을 포함할 수 있다. 일 실시 예에 따르면, 단계(130)는 예비 후보군에서 확률이 높은 순서로 일정 개수의 제1 상위 단어열들을 선택하는 단계를 더 포함할 수 있다. The first upper word strings according to one side may include a predetermined number of upper rank word strings having a high probability in the preliminary candidate group. According to an embodiment, step 130 may further include selecting a predetermined number of first upper word strings in an order of high probability from the preliminary candidate group.

일 실시 예에 따르면, 단계(130)를 수행하는 프로세서는 제1 상위 단어열들에 단계(120)에서 확장된 후보 단어열들 중 하나 이상이 포함된 경우, 제1 상위 단어열들에 확장된 후보 단어열이 포함되는 것으로 판단할 수 있다. 혹은 제1 상위 단어열들에 단계(120)에서 확장된 후보 단어열들 중 어느 하나도 포함되지 않은 경우, 제1 상위 단어열들에 확장된 후보 단어열이 포함되지 않는 것으로 판단할 수 있다.According to an embodiment, when the processor performing step 130 includes one or more of the candidate word strings expanded in step 120 in the first higher order word strings, the first higher order word strings are extended. It can be determined that the candidate word string is included. Alternatively, when the first upper word strings do not include any one of the candidate word strings expanded in step 120, it may be determined that the extended candidate word strings are not included in the first upper word strings.

예를 들어, 도 2a를 참조하면, 후보 단어열(201) 및 확장된 후보 단어열(202)을 포함하는 예비 후보군 [a, ab, c, aa, ab, ac, aba, abb, abc, ca, cb, cc]에서 확률이 높은 상위 3개의 단어열 'a'(211), 'ab'(212) 및 'cb'(214)가 제1 상위 단어열들에 해당하는 경우, 제1 상위 단어열들에 확장된 후보 단어열 'cb'가 포함되므로 단계(130)에서 제1 상위 단어열들에 확장된 후보 단어열이 포함되는 것으로 판단될 수 있다.For example, referring to FIG. 2A , a preliminary candidate group including a candidate word sequence 201 and an extended candidate word sequence 202 [a, ab, c, aa, ab, ac, aba, abb, abc, ca , cb, cc], when the top three word strings 'a' (211), 'ab' (212), and 'cb' (214) with high probability correspond to the first high word strings, the first high word Since the extended candidate word string 'cb' is included in the columns, it may be determined in step 130 that the extended candidate word string is included in the first upper word strings.

또 예를 들어, 도 2a를 참조하면, 후보 단어열(201) 및 확장된 후보 단어열(202)을 포함하는 예비 후보군 [a, ab, c, aa, ab, ac, aba, abb, abc, ca, cb, cc]에서 확률이 높은 상위 3개의 단어열 'a'(211), 'ab'(212) 및 'c'(213)가 제1 상위 단어열들에 해당하는 경우, 제1 상위 단어열들에 확장된 후보 단어열이 포함되지 않으므로 단계(130)에서 제1 상위 단어열들에 확장된 후보 단어열이 포함되지 않는 것으로 판단될 수 있다.For another example, referring to FIG. 2A , a preliminary candidate group including a candidate word sequence 201 and an extended candidate word sequence 202 [a, ab, c, aa, ab, ac, aba, abb, abc, ca, cb, cc], when the top three word strings 'a' (211), 'ab' (212), and 'c' (213) with high probability correspond to the first high word strings, the first high Since the extended candidate word strings are not included in the word strings, it may be determined in step 130 that the extended candidate word strings are not included in the first upper word strings.

일 측에 따른 단계(140)는 단계(13)에 따른 판단 결과에 기초하여, 제1 상위 단어열들에 포함된 적어도 하나의 확장된 후보 단어열을 추가 확장함으로써, 예비 후보군을 갱신하는 단계에 해당할 수 있다. Step 140 according to one side includes updating a preliminary candidate group by additionally extending at least one extended candidate word string included in the first upper word strings based on the determination result according to step 13. may apply.

일 실시 예에 따르면, 예비 후보군을 갱신하는 단계(140)는 단계(130)에 따라 제1 상위 단어열들에 확장된 후보 단어열이 포함되지 않는 것으로 판단된 경우, 예비 후보군을 갱신하지 않는 단계를 포함할 수 있다. 이 경우, 예비 후보군에는 후보 단어열들 및 후보 단어열들의 확장된 후보 단어열들이 포함될 수 있다.According to an embodiment, the step of updating the preliminary candidate group (140) is not updating the preliminary candidate group when it is determined that the extended candidate word string is not included in the first upper word strings according to the step (130). can include In this case, the preliminary candidate group may include candidate word sequences and extended candidate word sequences of the candidate word sequences.

일 실시 예에 따르면, 예비 후보군을 갱신하는 단계(140)는 단계(130)에 따라 제1 상위 단어열들에 확장된 후보 단어열이 포함되는 것으로 판단된 경우, 현재 프레임 및 제1 상위 단어열들에 포함된 적어도 하나의 확장된 후보 단어열의 확률에 기초하여, 제1 상위 단어열들에 포함된 적어도 하나의 확장된 후보 단어열을 추가 확장하는 단계 및 추가 확장된 후보 단어열을 예비 후보군에 추가함으로써, 예비 후보군을 갱신하는 단계를 포함할 수 있다.According to an embodiment, the step of updating the preliminary candidate group (140) includes the current frame and the first higher word string when it is determined that the extended candidate word string is included in the first upper word string according to step 130. Based on the probability of the at least one extended candidate word sequence included in , further extending at least one extended candidate word sequence included in the first upper word sequences, and adding the additionally extended candidate word sequence to a preliminary candidate group. By adding, updating the preliminary candidate group may be included.

일 실시 예에 따르면, 제1 상위 단어열에 포함된 확장된 후보 단어열은 후보 단어열이 단계(120)에 따라 1회 확장된 단어열에 해당할 수 있다. 제1 상위 단어열들에 포함된 확장된 후보 단어열을 추가 확장하는 단계는 제1 상위 단어열들에 포함된 1회 확장된 후보 단어열을 1회 더 확장하는 단계를 포함할 수 있다. 다시 말해, 추가 확장된 후보 단어열(들)은 현재 프레임에 대응하는 빔에 포함된 후보 단어열들 중 적어도 일부를 2회 확장하여 획득된 단어열(들)에 해당할 수 있다.According to an embodiment, the extended candidate word string included in the first upper word string may correspond to a word string in which the candidate word string is expanded once according to step 120 . The step of additionally extending the expanded candidate word sequence included in the first upper word sequences may include extending the once-extended candidate word sequence included in the first upper word sequences one more time. In other words, the additionally expanded candidate word sequence(s) may correspond to word sequence(s) obtained by extending at least some of the candidate word sequences included in a beam corresponding to the current frame twice.

예를 들어, 도 2b를 참조하면, 갱신되기 이전의 예비 후보군은 후보 단어열들(201) 및 1회 확장된 후보 단어열들(202)을 포함하는 집합 [a, ab, c, aa, ab, ac, aba, abb, abc, ca, cb, cc]에 대응될 수 있다. 예비 후보군 내 확률이 높은 상위 3개의 단어열을 포함하는 제1 상위 단어열들에 'a', 'ab' 및 'cb'가 포함되는 경우, 제1 상위 단어열들에 1회 확장된 후보 단어열 'cb'가 포함되므로, 단계(130)에서 제1 상위 단어열들에 확장된 후보 단어열이 포함되는 것으로 판단될 수 있다. 제1 상위 단어열들에 포함된 확장된 후보 단어열 'cb'는 단계(140)에 의해 추가 확장될 수 있으며, 확장된 후보 단어열 'cb'가 추가 확장된 후보 단어열들(203)은 예비 후보군에 추가될 수 있다. 단계(140)의 수행에 의해 갱신된 예비 후보군은 후보 단어열들(201), 확장된 후보 단어열들(202) 및 추가 확장된 후보 단어열들(203)을 포함하는 [a, ab, c, aa, ab, ac, aba, abb, abc, ca, cb, cc, cba, cbb, cbc]에 대응될 수 있다.For example, referring to FIG. 2B , the preliminary candidate group before updating is a set [a, ab, c, aa, ab including candidate word sequences 201 and candidate word sequences 202 expanded once. , ac, aba, abb, abc, ca, cb, cc]. When 'a', 'ab', and 'cb' are included in the first upper word strings including the top three word strings with high probability in the preliminary candidate group, the candidate word extended once to the first upper word strings Since the column 'cb' is included, it may be determined in step 130 that the extended candidate word sequence is included in the first upper word sequences. The extended candidate word string 'cb' included in the first upper word strings may be additionally extended by step 140, and the candidate word strings 203 to which the extended candidate word string 'cb' is additionally extended are It can be added to the preliminary candidate group. The preliminary candidate group updated by the execution of step 140 is [a, ab, c including candidate word sequences 201, extended candidate word sequences 202, and additionally extended candidate word sequences 203. , aa, ab, ac, aba, abb, abc, ca, cb, cc, cba, cbb, cbc].

한편, 도 2b에서 예비 후보군 내 확률이 높은 상위 3개의 단어열을 포함하는 제1 상위 단어열들에 'a', 'ab' 및 'c'가 포함되는 경우, 제1 상위 단어열들에 1회 확장된 후보 단어열들(202) 중 어느 하나도 포함되지 않으므로, 단계(130)에서 제1 상위 단어열들에 확장된 후보 단어열이 포함되지 않는 것으로 판단될 수 있다. 이 경우, 단계(140)에서 예비 후보군은 갱신되지 않으며, 후보 단어열들(201) 및 확장된 후보 단어열들(202)을 포함하는 집합 [a, ab, c, aa, ab, ac, aba, abb, abc, ca, cb, cc]로 유지될 수 있다. 또한, 추가 확장된 후보 단어열들(203) 역시 생성되지 않을 수 있다.Meanwhile, in FIG. 2B , when the first upper word strings including the top three word strings with high probability in the preliminary candidate group include 'a', 'ab', and 'c', 1 is included in the first upper word strings. Since none of the candidate word strings 202 extended 202 are included, it may be determined in step 130 that the extended candidate word string is not included in the first upper word strings. In this case, the preliminary candidate group is not updated in step 140, and the set [a, ab, c, aa, ab, ac, aba including the candidate word strings 201 and the extended candidate word strings 202 , abb, abc, ca, cb, cc]. In addition, additional extended candidate word strings 203 may not be generated either.

다시 도 1을 참조하면, 일 측에 따른 단계(150)는 예비 후보군 내 확률이 높은 순서에 따른 일정 개수의 제2 상위 단어열들을 현재 프레임의 다음 프레임에 대응하는 빔으로 결정하는 단계에 해당할 수 있다. 제1 상위 단어열에 포함된 단어열들의 개수와 제2 상위 단어열에 포함된 단어열들의 개수는 동일할 수 있다.Referring back to FIG. 1 , step 150 according to one side may correspond to a step of determining a predetermined number of second upper word strings in an order with high probability in the preliminary candidate group as beams corresponding to the next frame of the current frame. can The number of word strings included in the first upper word string may be the same as the number of word strings included in the second upper word string.

일 실시 예에 따르면, 단계(130)에서 제1 상위 단어열들에 확장된 후보 단어열이 포함되는 것으로 판단된 경우, 단계(140)에서 갱신된 예비 후보군 내에서 확률이 높은 순서에 따라 선정된 일정 개수의 제2 상위 단어열들이 다음 프레임에 대응하는 빔으로 결정될 수 있다. 갱신된 예비 후보군 내에서 확률이 높은 순서에 따라 선정된 제2 상위 단어열들은 제1 상위 단어열들과 동일하게 유지될 수도 있고, 달라질 수도 있다. 예를 들어, 제1 상위 단어열들 및 제2 상위 단어열들의 개수가 N개라고 할 때, 단계(140)에서 예비 후보군에 추가된 2회 확장된 후보 단어열들이 확률이 높은 순서에 따른 상위 N개의 단어열에 속하지 않는 경우 제2 상위 단어열들은 제1 상위 단어열들과 동일하게 결정될 수 있다.According to an embodiment, when it is determined in step 130 that the extended candidate word string is included in the first upper word strings, in step 140, the candidate word string is selected according to the order of high probability within the updated preliminary candidate group. A certain number of second upper word strings may be determined as beams corresponding to the next frame. The second upper word strings selected in the order of high probability within the updated preliminary candidate group may remain the same as the first upper word strings or may be different. For example, when it is assumed that the number of first upper word strings and second upper word strings is N, the candidate word strings extended twice added to the preliminary candidate group in step 140 are ranked higher in order of probability. If they do not belong to N number of word strings, the second upper word strings may be determined identically to the first upper word strings.

일 실시 예에 따르면, 단계(130)에서 제1 상위 단어열들에 확장된 후보 단어열이 포함되지 않는 것으로 판단된 경우, 예비 후보군은 갱신되지 않으므로 제2 상위 단어열들은 제1 상위 단어열들과 동일할 수 있다. 다시 말해, 제1 상위 단어열들 및 제2 상위 단어열들은 서로 일치할 수 있다.According to an embodiment, when it is determined in step 130 that the extended candidate word sequence is not included in the first upper word sequences, the preliminary candidate group is not updated, so that the second upper word sequences are the first upper word sequences. can be the same as In other words, the first upper word strings and the second upper word strings may match each other.

일 실시 예에 따르면, 제1 상위 단어열들 및 제2 상위 단어열들은 예비 후보군에 포함된 단어열들 중 일부를 지시하기 위한 것으로, 제1 상위 단어열들과 제2 상위 단어열들이 구분되는 인스턴스 또는 객체에 해당하는 것은 아니다. 예를 들어, 제1 상위 단어열들 및 제2 상위 단어열들은 모두 특정 시점에 예비 후보군에 포함된 단어열들 중 확률이 높은 상위 N개의 단어열들을 의미할 수 있다.According to an embodiment, the first high-order word strings and the second high-order word strings are to indicate some of the word strings included in the preliminary candidate group, and the first high-order word strings are distinguished from the second high-order word strings. It does not correspond to instances or objects. For example, both the first upper word strings and the second upper word strings may mean top N word strings with high probability among word strings included in the preliminary candidate group at a specific time point.

일 실시 예에 따르면, 단계(130)의 판단 결과에 따라 예비 후보군이 갱신되는지 여부가 달라지며, 다음 프레임에 대응하는 빔으로 결정되는 후보열들은 예비 후보군 내 확률이 높은 상위 N개의 단어열들에 해당할 수 있다.According to an embodiment, whether or not the preliminary candidate group is updated depends on the determination result of step 130, and the candidate sequences determined as the beam corresponding to the next frame are the top N word sequences with high probability in the preliminary candidate group. may apply.

예를 들어, 도 3을 참조하면, 일 실시 예에 따른 음성 인식 방법은 도 1에 도시된 단계(120) 이후에 예비 후보군 내 확률이 높은 상위 N개의 단어열에 확장된 후보 단어열이 포함되는지 여부를 판단(310)하여, 판단 결과에 따라 예비 후보군을 갱신하지 않고 후보 단어열들 및 1회 확장된 후보 단어열들이 포함된 예비 후보군에서 상위 N개의 단어열을 다음 프레임에 대응하는 빔으로 결정하는 동작 및 예비 후보군을 갱신하여 갱신된 예비 후보군에서 상위 N개의 단어열을 다음 프레임에 대응하는 빔으로 결정하는 동작 중 어느 하나가 수행될 수 있다.For example, referring to FIG. 3 , the speech recognition method according to an embodiment determines whether the extended candidate word string is included in the top N word strings with high probability in the preliminary candidate group after step 120 shown in FIG. 1 . In step 310, the preliminary candidate group is not updated according to the determination result, and the upper N word sequences in the preliminary candidate group including the candidate word sequences and the candidate word sequences expanded once are determined as beams corresponding to the next frame. Any one of an operation and an operation of updating the preliminary candidate group and determining upper N word sequences from the updated preliminary candidate group as beams corresponding to the next frame may be performed.

보다 구체적으로, 예비 후보군 내 상위 N개의 단어열들에 확장된 후보 단어열이 포함되는 것으로 판단된 경우, 일 측에 따른 음성 인식 방법은 상위 N개의 단어열에 포함된 확장된 후보 단어열을 추가 확장하는 단계(320), 추가 확장된 후보 단어열을 예비 후보군에 추가하는 단계(330), 예비 후보군에서 상위 N개의 단어열을 다시 선정하는 단계(340) 및 다시 선정된 상위 N개의 단어열을 다음 프레임에 대응하는 빔으로 결정하는 단계(350)를 포함할 수 있다.More specifically, when it is determined that the extended candidate word string is included in the top N word strings in the preliminary candidate group, the speech recognition method according to one side further extends the extended candidate word string included in the top N word strings. step 320, adding the extended candidate word string to the preliminary candidate group 330, re-selecting the top N word strings from the preliminary candidate group 340, and selecting the top N word strings selected again as follows. A step 350 of determining a beam corresponding to the frame may be included.

한편, 제 1 상위 단어열들에 확장된 후보 단어열이 포함되지 않는 것으로 판단된 경우, 일 측에 따른 음성 인식 방법은 단계(310)의 상위 N개의 단어열을 다음 프레임에 대응하는 빔으로 결정하는 단계(350)를 포함할 수 있다.Meanwhile, when it is determined that the extended candidate word strings are not included in the first upper word strings, the speech recognition method according to one side determines the upper N word strings in step 310 as beams corresponding to the next frame. It may include step 350 of doing.

다시 도 1을 참조하면, 현재 프레임이 상기 음성 신호의 마지막 프레임인 경우, 다음 프레임에 대응하는 빔으로 결정하는 단계(150)는 제2 상위 단어열들 중 적어도 일부를 음성 신호에 대응하는 음성 인식 결과로 출력하는 단계를 더 포함할 수 있다.Referring back to FIG. 1, if the current frame is the last frame of the voice signal, in step 150 of determining a beam corresponding to the next frame, at least some of the second upper word strings are used for voice recognition corresponding to the voice signal. A step of outputting as a result may be further included.

도 4는 일 실시 예에 따른 종단간 음성 인식 모델의 일 예인 RNN-T 의 구조를 예시한 도면이다.4 is a diagram illustrating the structure of an RNN-T, which is an example of an end-to-end speech recognition model according to an embodiment.

RNN-T는 입력 음향 프레임을 문자 시퀀스로 변환하는 종단 간 음성 인식 프레임 워크이다. RNN-T는 x = {x_t}^T _t=1을 y = {y_u}^U _u=1로 전사(transcribe)할 수 있다. 여기서, x_t는 음향 프레임, t는 프레임 인덱스, y_u는 출력 레이블, u는 출력 레이블 인덱스를 나타낸다.RNN-T is an end-to-end speech recognition framework that converts input sound frames into character sequences. RNN-T can transcribe x = {x _t } ^T _t=1 to y = {y _u } ^U _u=1 . Here, x _t denotes a sound frame, t denotes a frame index, y _u denotes an output label, and u denotes an output label index.

음향 프레임들의 길이(또는 개수) T와 출력 레이블들의 길이(또는 개수) U는 다를 수 있으며, 예를 들어 T는 U보가 큰 값을 가질 수 있다. RNN-T는 x와 y의 길이 차이를 블랭크 심볼(blank symbol) Ø를 도입함으로써 처리할 수 있다. 블랭크 심볼은 RNN-T가 T 스텝의 디코딩 프로세스 중에 출력 레이블을 생성할지 여부를 결정할 수 있게 한다. 예를 들어, |y| = U인 y = [y₁, y₂, y₃]에 관한 RNN-T의 디코딩 경로는 |y*| = T인 y* = [y₁, Ø, Ø, ..., y₂, Ø, y₃]을 포함할 수 있다.The length (or number) T of sound frames and the length (or number) U of output labels may be different, for example, T may have a larger value than U. RNN-T can handle the length difference between x and y by introducing a blank symbol Ø. Blank symbols enable the RNN-T to decide whether to generate output labels during the decoding process of T steps. For example, |y| = U, the decoding path of an RNN-T for y = [y ₁ , y ₂ , y ₃ ] is |y*| = T, y* = [y ₁ , Ø, Ø, ..., y ₂ , Ø, y ₃ ].

도 4를 참조하면, RNN-T는 인코더(encoder)(410), 예측 네트워크(prediction network)(420) 및 조인트 네트워크(joint network)(430)를 포함할 수 있다. 인코더(410)는 아래의 수학식 1과 같이 음향 프레임들 x를 잠재 표현들(hidden representations)로 매핑할 수 있다.Referring to FIG. 4 , the RNN-T may include an encoder 410, a prediction network 420, and a joint network 430. The encoder 410 may map sound frames x to hidden representations as shown in Equation 1 below.

예측 네트워크(420)는 아래의 수학식 2와 같이 이전의 블랭크가 아닌 출력 레이블 y_u-1에 기초하여, 은닉 상태(hidden state) p_u를 추정할 수 있다.The prediction network 420 may estimate a hidden state p _u based on the previous non-blank output label y _u−1 as shown in Equation 2 below.

조인트 네트워크(430)는 아래의 수학식 3과 같이 인코더(410) 및 예측 네트워크(420)에 의한 잠재 표현들을 결합할 수 있다.The joint network 430 may combine latent expressions by the encoder 410 and the prediction network 420 as shown in Equation 3 below.

여기서, W는 가중치 행렬(weight matrix), b는 바이어스 벡터(bias vector)를 나타낸다. 각 k번째 출력 레이블의 사후 확률(posterior probability)은 아래의 수학식 4 및 5와 같이 softmax 레이어(440)를 적용함으로써 획득될 수 있다.Here, W represents a weight matrix and b represents a bias vector. The posterior probability of each kth output label can be obtained by applying the softmax layer 440 as shown in Equations 4 and 5 below.

여기서, y = [y₁, …, y_u]이다.where y = [y ₁ , . . . , y _u ].

도 5은 일 실시 예에 따른 RNN-T에 기초하여 수행되는 음성 인식을 위한 디코딩 과정의 동작 흐름도이다. 보다 구체적으로, 도 5은 프레임 t (또는 t-step)에 대응하는 빔에 기초하여, 프레임 t+1(또는 t+1-step)에 대응하는 빔을 획득하는 디코딩 과정을 도시한다.5 is an operational flowchart of a decoding process for speech recognition performed based on RNN-T according to an embodiment. More specifically, FIG. 5 illustrates a decoding process of obtaining a beam corresponding to frame t+1 (or t+1-step) based on a beam corresponding to frame t (or t-step).

일 실시 예에 따른 디코딩 과정은 복수의 확장 블록(expansion block)들로 구성될 수 있다. 확장 블록은 빔 서치에 관한 확장 탐색(expansion search)을 위한 연산들의 집합에 해당할 수 있다. 단일 확장 블록에서 가능한 확장 횟수는 1회로 제한될 수 있으며, 가설들(hyphteses)을 일괄 처리(batching)하여 확장 탐색을 병렬로 수행할 수 있다. 다시 말해, 확장 블록에서 모든 가설들을 동시에 확장할 수 있다. 2 회의 확장은 2개의 확장 블록을 계단식(cascaded)으로 연결하여 수행될 수 있다. 예를 들어, 단일 확장 블록의 연산은 상술한 후보 단어열들을 확장하는 동작(예: 도 1의 단계(120))을 포함할 수 있으며, 계단식으로 연결된 2개의 확장 블록의 연산은 상술한 확장된 후보 단어열을 추가 확장하는 동작(예: 도 1의 단계(140) 또는 도 3의 단계(320))을 포함할 수 있다.A decoding process according to an embodiment may be composed of a plurality of expansion blocks. An expansion block may correspond to a set of operations for expansion search on a beam search. The number of expansions possible in a single expansion block may be limited to one time, and expansion search may be performed in parallel by batch processing hypotheses. In other words, all hypotheses can be expanded simultaneously in an extension block. Two expansions can be performed by cascading two expansion blocks. For example, the operation of a single extension block may include an operation of extending the above-described candidate word sequences (eg, step 120 of FIG. 1), and the operation of two extension blocks connected in a cascade manner may include the above-described extended extension block operation. An operation of additionally extending the candidate word string (eg, step 140 of FIG. 1 or step 320 of FIG. 3) may be included.

일 실시 예에 따르면, 디코딩 과정은 2개 이상의 확장 블록을 포함할 수 있다. 예를 들어, 단어 오류율(word error rate; WER)을 개선하기 위해 3 개의 확장 블록을 포함할 수 있다.According to one embodiment, a decoding process may include two or more extension blocks. For example, it may include three extension blocks to improve word error rate (WER).

일 측에 따른 확장 블록에 적용되는 일괄 계산(batched computation)은 확장 탐색을 가속화할 수 있지만, 확장이 필요하지 않은 가설에 대해서도 확장 탐색이 수행되어 계산 오버 헤드가 증가할 수 있다. 이에, 디코딩 과정에서 불필요한 확장 탐색을 피하기 위해 일 측에 따른 확장 블록은 확장 탐색을 계속할지 여부를 적응적으로 결정하는 과정을 포함할 수 있다.Batch computation applied to expansion blocks according to one side may accelerate expansion exploration, but expansion exploration may be performed even for hypotheses that do not require expansion, which may increase computational overhead. Accordingly, in order to avoid unnecessary extended search in a decoding process, an extension block according to one side may include a process of adaptively determining whether to continue an extended search.

일 측에 따른 디코딩 과정에서 2 개의 계단식 확장 블록을 이용하는 경우, 가설의 사후 확률에 따라 확장하지 않는 경우, 1회 확장하는 경우 또는 2회 확장하는 경우의 3 가지 시나리오가 있을 수 있다. In the case of using two cascaded extension blocks in a decoding process according to one side, there may be three scenarios of not extending, extending once, or extending twice according to the posterior probability of the hypothesis.

일 실시 예에 따르면, 단계(510)에서 확장 블록에 대한 입력은 가설들의 리스트(list)이며, 첫 번째 확장 블록의 리스트로 beam_t가 할당될 수 있다. beam_t는 프레임 t에 대한 디코딩 과정(또는 t-step의 디코딩 과정)에서의 가설들을 포함한다. 예를 들어, beam_t는 상술한 특정 프레임(t)에 대응하는 빔에 대응될 수 있다. beam_t의 가설들은 특정 프레임(t)에 대응하는 빔에 포함된 후보 단어열들에 대응될 수 있으며, 단계(510)는 상술한 도 1의 단계(110)에 대응될 수 있다.According to an embodiment, an input for an extension block in step 510 is a list of hypotheses, and beam _t may be allocated as a list of a first extension block. Beam _t includes hypotheses in the decoding process for frame t (or the decoding process of t-step). For example, beam _t may correspond to a beam corresponding to the above-described specific frame t. Hypotheses of beam _t may correspond to candidate word sequences included in a beam corresponding to a specific frame t, and step 510 may correspond to step 110 of FIG. 1 described above.

일 실시 예에 따르면, 첫 번째 확장 블록(expansion block 1)(520)에서 LoadLMStates는 리스트의 모든 가설들에 대해 상기 수학식 2의 p_u와 관련하여 예측 네트워크(예: 도 4의 예측 네트워크(420))의 은닉 상태를 로드(load)하는 연산이다. LoadLMStates의 결과 배치된(batched) 예측 네트워크의 은닉 상태는 lmh_batch로 나타낼 수 있다(521). According to an embodiment, in the first expansion block (expansion block 1) 520, _LoadLMStates is a prediction network (e.g., the prediction network 420 of FIG. )) is an operation that loads the hidden state of The hidden state of the batched prediction network as a result of LoadLMStates can be expressed as lmh _batch (521).

일 실시 예에 따르면, 상기 수학식 3 내지 5와 관련된 BatchJointFn은 인코더(예: 도 4의 인코더(410))에서 출력된 e_t 및 lmh_batch를 이용하여 일괄 계산을 수행하여 리스트 내 가설들의 가능한 모든 확장들에 대한 사후 확률들을 획득하는 연산이다. BatchJointFn의 결과는 prob_batch로 배치(batched)된다(522). 예를 들어, prob_batch는 상술한 특정 프레임(t)이 후보 단어에 대응될 확률에 대응될 수 있다.According to an embodiment, BatchJointFn related to Equations 3 to 5 performs batch calculation using e _t and lmh _batch output from an encoder (eg, the encoder 410 of FIG. 4 ) to obtain all possible hypotheses in the list. It is an operation that obtains posterior probabilities for extensions. The result of BatchJointFn is batched into prob _batch (522). For example, prob _batch may correspond to a probability that the above-described specific frame t corresponds to a candidate word.

일 실시 예에 따르면, prob_batch의 모든 logPr(k|y, t)은 AccumulateProb 연산을 이용하여 리스트 내 이전 가설에 대한 누적된 로그 확률에 더해지고, 그 결과는 acc_prob_batch에 할당된다(523). 예를 들어, 이전 가설에 대한 누적된 로그 확률은 상술한 후보 단어열의 확률에 대응되고, acc_prob_batch는 상술한 확장된 후보 단어열의 확률을 포함할 수 있다. acc_prob_batch는 상술한 후보 단어열의 확률을 더 포함할 수 있다.According to an embodiment, all logPr(k|y, t) of the prob _batch are added to the accumulated log probabilities for previous hypotheses in the list using the AccumulateProb operation, and the result is assigned to the acc_prob _batch (523). For example, the accumulated log probabilities of the previous hypotheses may correspond to the probabilities of the above-described candidate word sequence, and acc_prob _batch may include the above-described extended probability of the candidate word sequence. acc_prob _batch may further include the above-described probabilities of the candidate word sequence.

일 실시 예에 따르면, FindTopIndexes는 acc_prob_batch에서 상위 (N + β)의 사후 확률들의 인덱스를 추출하는 연산이다. FindTopIndexes에 의해 상위 (N + β)의 사후 확률들의 인덱스는 top_indexes에 할당된다(524). 예를 들어, 단계(524)는 상술한 제1 상위 단어열들을 선정하는 동작에 대응될 수 있다. 여기서, β는 마진 계수(margin factor)이다. 이에 대해서는 이하에서 상술한다.According to one embodiment, FindTopIndexes is an operation for extracting indices of higher (N + β) posterior probabilities in acc_prob _batch . The index of the top (N + β) posterior probabilities by FindTopIndexes is assigned to top_indexes (524). For example, step 524 may correspond to the operation of selecting the above-described first upper word strings. Here, β is the margin factor. This will be described in detail below.

일 실시 예에 따르면, UpdateHyps는 출력 레이블이 k 인 특정 가설의 인덱스를 포함하는 top_indexes들을 참조하여 리스트의 가설들을 업데이트하는 연산이다. 일 예로, k가 공백이 아닌 레이블을 나타내는 경우, UpdateHyps 연산을 이용하여 k 번째 출력 레이블이 추가됨으로써, 해당 가설이 업데이트될 수 있으며, 이 경우 업데이트된 가설의 사후 확률은 불완전 확률(incomplete probabilities)에 대응될 수 있다. 또 일 예로, k가 공백 레이블을 나타내는 경우, UpdateHyps 연산을 이용하여 프레임 t에서 해당 가설에 대한 확장이 종료되었음을 지시하는 정보가 부가됨으로써, 해당 가설이 업데이트될 수 있으며, 이 경우 업데이트된 가설의 사후 확률은 완전 확률(complete probabilities)에 대응될 수 있다. 업데이트된 가설의 사후 확률은 top_indexes로 인덱싱된 acc_prob_batch의 값이다. 업데이트된 가설들은 newhyps에 할당된다(525). 예를 들어, 단계(525)는 상술한 후보 단어열들의 확장된 후보 단어열들 및 확장된 후보 단어열들의 확률들을 획득하는 단계에 대응될 수 있다.According to an embodiment, UpdateHyps is an operation that updates hypotheses in a list by referring to top_indexes including an index of a specific hypothesis whose output label is k. For example, if k represents a non-blank label, the corresponding hypothesis may be updated by adding the k th output label using the UpdateHyps operation, and in this case, the posterior probability of the updated hypothesis is incomplete probabilities. can be matched. As another example, when k represents a blank label, the corresponding hypothesis may be updated by adding information indicating that the extension to the corresponding hypothesis has ended in frame t using the UpdateHyps operation. Probabilities may correspond to complete probabilities. The posterior probability of the updated hypothesis is the value of acc_prob _batch indexed by top_indexes. Updated hypotheses are assigned to newhyps (525). For example, operation 525 may correspond to the above-described step of obtaining extended candidate word sequences and probabilities of the extended candidate word sequences.

일 실시 예에 따르면, 확장 절차 중에 가설 간의 중복이 발생할 수 있다. 예를 들어, beam_t가 두 개의 가설 [a]와 [a, b]를 포함하는 경우, 가설 [a]는 UpdateHyps의 확장 프로세스를 통해 [a, b]로 확장될 수 있으며, 이 경우 가설 간의 중복이 발생한다.According to an embodiment, duplication between hypotheses may occur during the expansion procedure. For example, if beam _t includes two hypotheses [a] and [a, b], hypothesis [a] can be expanded to [a, b] through the expansion process of UpdateHyps, and in this case, between hypotheses duplication occurs.

일 실시 예에 따르면, 도 5에 도시된 디코딩 과정을 진행하기에 앞서, beam_t에 대응하여 접두사 탐색(prefix search)이 수행될 수 있다. 접두사 탐색은 beam_t에 포함된 가설들 중 어느 하나의 가설을 접두사로 갖는 가설을 탐색하여, 탐색된 가설의 사후 확률을 보정하는 과정에 해당할 수 있다. 예를 들어, beam_t에 [a] 및 [a, b]가 포함된 경우, [a, b]는 [a]를 접두사로 갖는 가설에 해당하므로, 접두사 탐색 과정을 통해 [a, b]의 사후 확률이 보정될 수 있다. [a, b]의 사후 확률은 프레임 t에서 [a]가 확장되어 [a, b]가 되는 경우의 사후 확률을 계산하여, 이를 미리 beam_t의 가설 [a, b]의 사후 확률에 더함으로써 계산될 수 있다.According to an embodiment, prior to performing the decoding process shown in FIG. 5, a prefix search may be performed in correspondence to beam _t . The prefix search may correspond to a process of correcting the posterior probability of the searched hypothesis by searching for a hypothesis having one of the hypotheses included in beam _t as a prefix. For example, if beam _t includes [a] and [a, b], since [a, b] corresponds to a hypothesis having [a] as a prefix, The posterior probabilities can be calibrated. The posterior probability of [a, b] is calculated by calculating the posterior probability when [a] expands to [a, b] in frame t, and by adding it to the posterior probability of beam _t hypothesis [a, b] in advance can be calculated.

일 실시 예에 따른 접두사 탐색 과정을 통해 [a]가 확장된 [a, b]의 사후 확률은 이미 beam_t에 포함된 [a, b]의 사후 확률에 집계될 수 있다. 따라서 UpdateHyps 연산에서 beam_t에 이미 존재하는 가설로 확장되는 경우를 무시한다. 다시 말해, beam_t+1 내 가설들의 중복을 방지하기 위해 확장된 가설이 beam_t에 이미 존재하는 경우 해당 가설의 확장을 무시한다. 일 실시 예에서, beam_t에 포함된 가설과 중복되는 확장된 가설을 무시하므로, newhyps에 할당된 가설들의 수는 상위 인덱스의 요소들의 수보다 적을 수 있다. 따라서 top_indexes에는 N + β 개의 요소를 포함하도록 하여 newhyps의 가설 수를 N보다 높게 유지할 수 있다.The posterior probability of [a, b] in which [a] is expanded through the prefix search process according to an embodiment may be aggregated into the posterior probability of [a, b] already included in beam _t . Therefore, in the UpdateHyps operation, the case of extending to a hypothesis that already exists in beam _t is ignored. In other words, in order to prevent duplication of hypotheses in beam _t+1 , if an extended hypothesis already exists in beam _t , the extension of the corresponding hypothesis is ignored. In one embodiment, since extended hypotheses that overlap with hypotheses included in beam _t are ignored, the number of hypotheses assigned to newhyps may be less than the number of elements of higher indexes. Therefore, the number of hypotheses of newhyps can be kept higher than N by including N + β elements in top_indexes.

일 실시 예에 따르면, newhyps는 각각 BlankHyps 연산 및 NonBlankHyps 연산에 의해 list_b와 list_e1로 구분(526)될 수 있다. 일 예로, newhyps에 포함된 가설들 중 확장이 종료되었음을 지시하는 정보가 부가된 가설은 BlankHyps 연산에 의해 list_b로, 확장이 종료되었음을 지시하는 정보가 부가되지 않은 가설은 NonBlankHyps 연산에 의해list_e1로 구분될 수 있다. list_b 의 사후 확률은 완전 확률이며, list_e1 의 사후 확률은 불완전 확률 확률에 해당할 수 있다. 완전 확률을 얻기 위해 list_e1는 추가로 처리될 수 있다. 예를 들어, list_b는 상술한 제1 상위 단어열들에 포함된 후보 단어열들에 대응될 수 있으며, list_e1는 상술한 제1 상위 단어열들에 포함된 확장된 후보 단어열들에 대응될 수 있다. 또 예를 들어, list_e1의 존재 여부를 확인하는 단계(527)는 상술한 제1 상위 단어열들에 확장된 후보 단어열의 포함 여부를 판단하는 단계(예: 도 1의 단계(130) 또는 도 3의 단계(310))에 대응될 수 있다.According to an embodiment, newhyps may be divided into list _b and list _e1 by BlankHyps operation and NonBlankHyps operation (526). For example, among the hypotheses included in newhyps, the hypothesis to which information indicating that expansion has been completed is added to list _b by the BlankHyps operation, and the hypothesis to which information indicating that expansion has been completed is not added to list _e1 by NonBlankHyps operation. can be distinguished. The posterior probability of list _b is a perfect probability, and the posterior probability of list _e1 may correspond to an incomplete probability probability. To obtain perfect probability, list _e1 can be further processed. For example, list _b may correspond to candidate word sequences included in the above-described first upper word sequences, and list _e1 corresponds to extended candidate word sequences included in the above-described first upper word sequences. It can be. Also, for example, the step of checking whether list _e1 exists (527) is a step of determining whether the above-described first higher order word strings include an extended candidate word string (eg, step 130 of FIG. 1 or FIG. It may correspond to step 310 of 3).

일 실시 예에 따르면, list_e1이 존재하지 않는 경우, list_b에서 상위 N 개의 가설을 선택(530)하여 탐색 절차를 종료할 수 있다. 이 경우를 시나리오 1이라고 지칭한다. 예를 들어, 시나리오 1은 상술한 제1 상위 단어열들에 확장된 후보 단어열이 포함되지 않은 경우, 예비 후보군을 갱신하지 않고 제1 상위 단어열들을 현재 프레임에 대응하는 빔으로 결정하는 단계에 대응될 수 있다.According to an embodiment, when list _e1 does not exist, the search procedure may be terminated by selecting 530 top N hypotheses from list _b . This case is referred to as Scenario 1. For example, in scenario 1, when the extended candidate word sequences are not included in the above-described first higher-order word sequences, the step of determining the first higher-order word sequences as beams corresponding to the current frame without updating the preliminary candidate group. can be matched.

한편, list_e1이 존재하는 경우, list_e1이 더 확장되는지 여부를 결정하기 위해 list_e1에 대한 탐색 절차가 계속될 수 있다. Meanwhile, when list _e1 exists, a search procedure for list _e1 may be continued to determine whether list _e1 is further extended.

일 실시 예에 따르면, BatchDecoderFn은 list_e1의 원래 lmh_batch와 list_e1 내 가설들의 마지막 레이블 배치(batch)에 해당하는 label_prev를 이용하여 상기 수학식 2에 따른 f^pre의 일괄 계산을 수행하는 연산이다. BatchDecoderFn 연산으로 list_e1 의 lmh_batch이 업데이트(528)된다. 탐색 절차가 시나리오 1을 따를 경우 BatchDecoderFn 연산이 수행될 필요가 없으므로, BatchJointFn 연산(522) 이전이 아닌 list_e1의 존재 확인(527) 후 BatchDecoderFn가 수행될 수 있다. 업데이트된 lmh_batch는 SaveLMStates에 의해 list_e1의 lmh_batch에 할당(529)된다.According to an embodiment, BatchDecoderFn is an operation that performs batch calculation of f ^pre according to Equation 2 above using the original lmh _batch of list _e1 and label _prev corresponding to the last label batch of hypotheses in list _e1 . . BatchDecoderFn operation updates lmh _batch of list _e1 (528). If the search procedure follows scenario 1, since the BatchDecoderFn operation does not need to be performed, BatchDecoderFn may be performed after checking the existence of list _e1 (527), not before the BatchJointFn operation (522). The updated lmh _batch is allocated (529) to the lmh _batch of list _e1 by SaveLMStates.

일 실시 예에 따르면, 두 번째 확장 블록(expansion block 2)(550)에서는 list_e1에 대해 첫 번째 확장 블록(520)과 유사한 연산이 수행된다. 시나리오 2와 3 사이의 분기점은 list_e2가 존재하는지 여부(557)이다. 예를 들어, list_e2는 상술한 제2 상위 단어열에 포함된 추가 확장된 후보 단어열들에 대응될 수 있다. list_e2가 존재하지 않는 경우, list_b 및 list_e1b에서 상위 N 개의 가설을 선택(560)하여 탐색 절차를 종료한다. 이 경우를 시나리오 2라고 지칭한다. 시나리오 2는 상술한 갱신된 후보 단어열 내 확률이 높은 순서의 N개의 제2 상위 단어열에 추가 확장된 후보 단어열이 포함되지 않은 경우, 후보 단어열 및 확장된 후보 단어열을 포함하는 제2 상위 단어열 내 확률이 높은 순서의 상위 N개의 단어열을 현재 프레임에 대응하는 빔으로 결정하는 단계에 대응될 수 있다.According to an embodiment, in the second expansion block (expansion block 2) 550, an operation similar to that of the first expansion block 520 is performed on list _e1 . The branching point between scenarios 2 and 3 is whether list _e2 exists (557). For example, list _e2 may correspond to additionally extended candidate word sequences included in the above-described second upper word sequence. If list _e2 does not exist, the top N hypotheses are selected from list _b and list _e1b (560), and the search procedure is terminated. This case is referred to as Scenario 2. Scenario 2 is the second higher order word sequence including the candidate word sequence and the extended candidate word sequence when the additionally extended candidate word sequence is not included in the N second higher word sequences in the order of high probability in the above-described updated candidate word sequence. This may correspond to a step of determining top N word strings in an order with high probability in the word string as beams corresponding to the current frame.

일 실시 예에 따르면, list_e2가 존재하는 경우, 두 번째 확장 블록(550)은 연산들(558, 559)을 수행함으로써, 불완전 확률을 갖는 list_e2를 출력한다. 추가적인 공백 확률이 BatchJointFn 연산 및 AccumulateProb 연산을 포함하는 AccumulateBlankProb에 의해 list_e2의 사후 확률로 계산되고 집계(570)될 수 있다. list_e2의 완성된 버전을 list_e2b로 표시한다. 이후, list_b, list_e1b 및 list_e2b에서 상위 N 개의 가설을 선택(580)하여 탐색 절차를 종료한다. 이 경우를 시나리오 3이라고 지칭한다. 예를 들어, 시나리오 3은 상술한 갱신된 후보 단어열 내 확률이 높은 순서의 N개의 제2 상위 단어열에 추가 확장된 후보 단어열이 포함된 경우, 후보 단어열, 확장된 후보 단어열 및 추가 확장된 후보 단어열을 포함하는 제2 상위 단어열 내 확률이 높은 순서의 상위 N개의 단어열을 현재 프레임에 대응하는 빔으로 결정하는 단계에 대응될 수 있다.According to an embodiment, when list _e2 exists, the second extension block 550 outputs list _e2 having an incomplete probability by performing operations 558 and 559. Additional blank probabilities can be calculated and aggregated 570 as the posterior probabilities of list _e2 by AccumulateBlankProb, which includes the BatchJointFn operation and the AccumulateProb operation. Display the completed version of list _e2 as list _e2b . Thereafter, the top N hypotheses are selected from list _b , list _e1b , and list _e2b (580), and the search procedure is terminated. This case is referred to as Scenario 3. For example, in scenario 3, when the additionally extended candidate word sequences are included in the N second higher order word sequences with high probability in the above-described updated candidate word sequences, the candidate word sequences, the extended candidate word sequences, and the additional expansion This may correspond to the step of determining, as beams corresponding to the current frame, the top N word strings in the order of high probability in the second upper word string including the candidate word string.

도 6는 일 실시 예에 따른 빔 서치 방법의 동작 흐름도이다.6 is an operation flowchart of a beam search method according to an embodiment.

상기 도 1 내지 도 5를 통해, 일 측에 따른 빔 서치 알고리즘에 기반한 디코딩 과정을 포함하는 음성 인식 방법을 상술하였으나, 일 측에 따른 빔 서치 알고리즘은 트리 탐색을 이용하는 다양한 실시 예에 적용될 수 있다.1 to 5, the speech recognition method including a decoding process based on the beam search algorithm according to one side has been described above, but the beam search algorithm according to one side can be applied to various embodiments using tree search.

도 6를 참조하면, 일 실시 예에 따른 빔 서치 방법은 인코딩 데이터의 현재 프레임(t)에 대응하는 빔에 포함된 후보 단어열들(beam_t)을 예비 후보군에 추가하는 단계(610), 후보 단어열들을 대상으로 확장 연산을 수행하는 단계(620 내지 640), 확장 연산에 기초하여 선정된 단어열들(예: N개의 상위 단어열)에 확장 연산에 기초하여 획득된 후보 단어열들의 확장된 단어열의 포함 여부에 기초하여, 확장 연산의 반복 여부를 결정하는 단계(650), 확장 연산을 반복하는 것으로 결정함에 따라, 선정된 단어열들을 대상으로 확장 연산을 수행하는 단계(670, 630 및 640), 및 확장 연산을 반복하지 않는 것으로 결정함에 따라, 상위 단어열들을 인코딩 데이터의 다음 프레임(t+1)에 대응하는 빔(beam_t+1)으로 결정하는 단계(660)를 포함할 수 있다.Referring to FIG. 6, the beam search method according to an embodiment includes adding candidate word sequences (beam _t ) included in a beam corresponding to a current frame (t) of encoded data to a preliminary candidate group (610), Steps 620 to 640 of performing an expansion operation on word sequences, and extending candidate word sequences obtained based on the expansion operation on word sequences (eg, N upper word sequences) selected based on the expansion operation. Determining whether to repeat the expansion operation based on whether or not the word sequence is included (650), and performing the expansion operation on the selected word sequences according to the decision to repeat the expansion operation (670, 630, and 640) ), and determining that the expansion operation is not repeated, determining upper word sequences as beams _t+1 corresponding to the next frame t+1 of the encoded data (step 660). .

일 측에 따른 확장 연산은 확장 연산의 대상이 되는 단어열들의 확장된 단어열들을 예비 후보군에 추가하는 단계(630) 및 예비 후보군에서 미리 정해진 기준에 따른 일정 개수의 단어열들(예: N개의 상위 단어열)을 선정(640)하는 단계를 포함할 수 있다. 여기서, 미리 정해진 기준은 예비 후보군에 포함된 단어열들의 우선 순위를 결정하는 기준으로, 예를 들어 예비 후보군에 포함된 단어열들에 대응하는 확률들이 높은 순서를 포함할 수 있다.The expansion operation according to one side includes adding extended word sequences of word sequences subject to the expansion operation to a preliminary candidate group (630) and a certain number of word sequences according to a predetermined criterion in the preliminary candidate group (eg, N A step of selecting (640) the upper word string) may be included. Here, the predetermined criterion is a criterion for determining the priority of word strings included in the preliminary candidate group, and may include, for example, an order in which probabilities corresponding to word strings included in the preliminary candidate group are high.

일 측에 따른 확장 연산의 반복 여부를 결정하는 단계(650)는 선정된 단어열들에 확장된 단어열이 포함되는 경우, 확장 연산을 반복하는 것으로 결정하는 단계 및 선정된 단어열들에 확장된 단어열이 포함되지 않는 경우, 확장 연산을 반복하지 않는 것으로 결정하는 단계를 포함할 수 있다.Step 650 of determining whether to repeat the expansion operation according to one side includes determining to repeat the expansion operation when the selected word sequences include the extended word sequence, and When the word string is not included, determining not to repeat the expansion operation may be included.

일 실시 예에 따르면, 확장 연산의 반복 여부를 결정하는 단계(650)는 미리 결정된 반복 종료 조건의 만족 여부에 기초하여, 확장 연산의 반복 여부를 결정하는 단계(680)를 포함할 수 있다. 여기서, 반복 종료 조건은 빔 서치에 관하여 미리 결정된 확장 연산의 반복을 종료하기 위한 조건으로, 예를 들어 확장 연산의 반복 횟수가 임계 값(예: 2회)을 초과한 경우, 상위 단어열에 대응하는 확률이 미리 정해진 임계 값을 초과한 경우 등을 포함할 수 있다. According to an embodiment, the step of determining whether to repeat the extension operation ( 650 ) may include determining whether to repeat the extension operation based on whether a predetermined repetition end condition is satisfied ( 680 ). Here, the repetition end condition is a condition for terminating repetition of a predetermined extension operation in relation to the beam search. For example, when the number of repetitions of the extension operation exceeds a threshold value (eg, 2 times), the number of repetitions corresponding to the upper word sequence A case where the probability exceeds a predetermined threshold value may be included.

도 7은 일 실시예에 따른 장치의 구성의 예시도이다.7 is an exemplary diagram of a configuration of a device according to an embodiment.

도 7을 참조하면, 장치(700)는 프로세서(701), 메모리(703) 및 입출력 장치(705)를 포함한다. 장치(700)는 예를 들어, 사용자 디바이스(예: 스마트폰, 퍼스널 컴퓨터, 태블릿 PC 등), 서버를 포함할 수 있다.Referring to FIG. 7 , a device 700 includes a processor 701 , a memory 703 and an input/output device 705 . The apparatus 700 may include, for example, a user device (eg, a smart phone, a personal computer, a tablet PC, etc.) and a server.

일실시예에 따른 장치(700)는 상술한 음성 인식 방법 및/또는 빔 서치 방법을 수행하는 장치를 포함할 수 있다. 프로세서(701)는 도 1 내지 도 5을 통하여 전술한 적어도 하나의 방법을 수행할 수 있다. 예를 들어, 프로세서(701)는 상술한 빔 서치에 기반한 디코딩 과정을 포함하는 음성 인식을 수행할 수 있다. 프로세서(701)는 입력된 음성 신호를 인코딩하여 음성 신호에 대응하는 부호화 데이터를 획득할 수 있으며, 부호화 데이터를 디코딩하여 음성 신호에 대응하는 텍스트 데이터를 획득할 수 있다. 일 실시 예에 따르면, 프로세서(701)는 종단간 음성 인식 모델에 기초하여 디코딩 과정을 수행할 수 있다.The apparatus 700 according to an embodiment may include an apparatus for performing the above-described voice recognition method and/or beam search method. The processor 701 may perform at least one method described above through FIGS. 1 to 5 . For example, the processor 701 may perform speech recognition including the above-described beam search-based decoding process. The processor 701 may obtain encoded data corresponding to the voice signal by encoding the input voice signal, and may obtain text data corresponding to the voice signal by decoding the encoded data. According to an embodiment, the processor 701 may perform a decoding process based on an end-to-end speech recognition model.

메모리(703)는 음성 인식 방법 및/또는 빔 서치 방법과 관련된 정보를 저장할 수 있으며, 종단간 음성 인식 모델에 관한 데이터를 저장할 수 있다. 메모리(703)는 휘발성 메모리 또는 비휘발성 메모리일 수 있다.The memory 703 may store information related to a voice recognition method and/or a beam search method, and may store data related to an end-to-end voice recognition model. Memory 703 may be volatile memory or non-volatile memory.

일 측에 따른 장치(700)는 입출력 장치(705)를 통하여 외부 장치(예를 들어, 퍼스널 컴퓨터 또는 네트워크)에 연결되고, 데이터를 교환할 수 있다. 예를 들어, 장치(700)는 입출력 장치(705)를 통해 음성 신호를 수신할 수 있으며, 음성 신호의 음성 인식된 결과로 음성 신호에 대응하는 텍스트 데이터를 출력할 수 있다.Device 700 according to one side may be connected to an external device (eg, a personal computer or network) through an input/output device 705 and exchange data. For example, the device 700 may receive a voice signal through the input/output device 705 and output text data corresponding to the voice signal as a result of voice recognition of the voice signal.

일 실시 예에 따르면, 메모리(703)는 상술한 음성 인식 방법 및/또는 빔 서치 방법이 구현된 프로그램을 저장할 수 있다. 프로세서(701)는 메모리(703)에 저장된 프로그램을 실행하고, 장치(700)를 제어할 수 있다. 프로세서(701)에 의하여 실행되는 프로그램의 코드는 메모리(703)에 저장될 수 있다. According to an embodiment, the memory 703 may store a program in which the above-described voice recognition method and/or beam search method are implemented. The processor 701 may execute a program stored in the memory 703 and control the device 700 . Program codes executed by the processor 701 may be stored in the memory 703 .

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA). array), programmable logic units (PLUs), microprocessors, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. The device can be commanded. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 저장할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. A computer readable medium may store program instructions, data files, data structures, etc. alone or in combination, and program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in the art of computer software. have. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware device described above may be configured to operate as one or a plurality of software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited drawings, those skilled in the art can apply various technical modifications and variations based on this. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

In the voice recognition method performed by the processor,
obtaining candidate word sequences included in a beam corresponding to a target frame of a speech signal and probabilities of the candidate word sequences corresponding to the speech signal;
expanding the candidate word sequences based on the target frame and the probabilities of the candidate word sequences;
Preliminary candidate group-The preliminary candidate group includes the candidate word sequences and extended candidate word sequences of the candidate word sequences-A predetermined number of first upper word sequences according to an order of high probability within the extended candidate word sequences determining whether at least one of them is included;
updating the preliminary candidate group by additionally extending the at least one extended candidate word string included in the first upper word strings based on a result of the determination; and
Determining a predetermined number of second higher order word sequences in an order with a higher probability in the preliminary candidate group as a beam corresponding to a frame next to the target frame
including,
voice recognition method.

According to claim 1,
Updating the preliminary candidate group
When it is determined that the at least one extended candidate word sequence is included in the first upper word sequences,
further extending the at least one extended candidate word sequence based on a probability of the at least one extended candidate word sequence included in the target frame and the first upper word sequences; and
Updating the preliminary candidate group by adding the additionally extended candidate word sequence to the preliminary candidate group;
including,
voice recognition method.

According to claim 2,
The step of determining the beam corresponding to the next frame
selecting a predetermined number of second high-ranking word strings in an order of high probability within the updated preliminary candidate group; and
Determining the second upper word sequences as beams corresponding to the next frame
including,
voice recognition method.

According to claim 1,
Updating the preliminary candidate group
not updating the preliminary candidate group when it is determined that the at least one extended candidate word sequence is not included in the first upper word sequences;
including,
The first upper word strings and the second upper word strings match each other
voice recognition method.

According to claim 1,
Expanding the candidate word strings
Corresponding to each of the candidate word sequences,
obtaining an extended candidate word sequence of the corresponding candidate word sequence by adding a candidate word to the corresponding candidate word sequence;
obtaining a probability that the target frame corresponds to the candidate word based on the corresponding candidate word string; and
obtaining a probability of the extended candidate word sequence based on the probability of the corresponding candidate word sequence and the probability that the target frame corresponds to the candidate word;
including,
voice recognition method.

According to claim 1,
Encoding the voice signal to obtain a plurality of frames including features extracted from the voice signal
Including more,
voice recognition method.

According to claim 1,
Expanding the candidate word strings
obtaining the extended candidate word sequences and the probabilities of the extended candidate word sequences based on an end-to-end speech recognition model;
including,
The end-to-end speech recognition model includes a neural network trained to receive the speech data and output the text label based on training data including speech data and text labels corresponding to the speech data,
voice recognition method.

According to claim 1,
When the target frame is the last frame of the voice signal,
The step of determining the beam corresponding to the next frame
outputting at least some of the second upper word strings as a voice recognition result corresponding to the voice signal;
Including more,
voice recognition method.

According to claim 1,
The extended candidate word sequences of the candidate word sequences are
Including word sequences having the candidate word sequences as prefixes,
voice recognition method.

According to claim 1,
The word sequence is
contains a word and a sequence of plural words;
said word
Including one character corresponding to a specific language and a combination of a plurality of characters corresponding to the specific language,
voice recognition method.

In the beam search method performed by the processor,
adding candidate word sequences included in a beam corresponding to a target frame of encoded data to a preliminary candidate group;
performing an expansion operation on the candidate word sequences;
determining whether or not to repeat the expansion operation based on whether word sequences selected based on the expansion operation include an extended word sequence of the candidate word sequences obtained based on the expansion operation;
performing the expansion operation on the selected word sequences when it is determined to repeat the expansion operation;
Determining the selected word strings as beams corresponding to the next frame of the encoded data when it is determined not to repeat the extension operation
including,
The expansion operation is
adding extended word strings of word strings subject to the expansion operation to the preliminary candidate group; and
Selecting a certain number of word strings according to a predetermined criterion from the preliminary candidate group
including,
Beam search method.

According to claim 11,
The step of determining whether to repeat the expansion operation is
determining to repeat the expansion operation when the selected word sequences include the extended word sequence; and
determining not to repeat the expansion operation when the extended word sequence is not included in the selected word sequences;
including,
Beam search method.

According to claim 11,
The step of determining whether to repeat the expansion operation is
determining whether to repeat the expansion operation based on whether a predetermined repetition end condition is satisfied;
including,
Beam search method.

A computer program stored in a medium to execute the method of any one of claims 1 to 13 in combination with hardware.

Obtaining candidate word sequences included in a beam corresponding to a target frame of a speech signal and probabilities of the candidate word sequences corresponding to the speech signal;
Based on the target frame and the probabilities of the candidate word strings, expand the candidate word strings;
Preliminary candidate group-The preliminary candidate group includes the candidate word sequences and extended candidate word sequences of the candidate word sequences-A predetermined number of first upper word sequences according to an order of high probability within the extended candidate word sequences determine whether at least one of them is included,
Based on the determination result, updating the preliminary candidate group by additionally extending the at least one extended candidate word sequence included in the first upper word sequences;
Determining the predetermined number of second high-order word strings in order of high probability in the preliminary candidate group as a beam corresponding to a frame next to the target frame,
at least one processor
including,
voice recognition device.

According to claim 15,
the processor,
In updating the preliminary candidate group,
When it is determined that the at least one extended candidate word sequence is included in the first upper word sequences,
Further extending the at least one extended candidate word sequence based on a probability of the at least one extended candidate word sequence included in the target frame and the first upper word sequences;
Updating the preliminary candidate group by adding the additional extended candidate word sequence to the preliminary candidate group.
voice recognition device.

According to claim 16,
the processor,
In determining the beam corresponding to the next frame,
Selecting a certain number of second upper word strings in an order with a higher probability in the updated preliminary candidate group;
Determining the second upper word strings as beams corresponding to the next frame,
voice recognition device.

According to claim 15,
the processor,
In updating the preliminary candidate group,
When it is determined that the at least one extended candidate word string is not included in the first upper word strings, the preliminary candidate group is not updated;
The first upper word strings and the second upper word strings match each other,
voice recognition device.

According to claim 15,
the processor,
In expanding the candidate word sequences,
Corresponding to each of the candidate word sequences,
Adding candidate words to the corresponding candidate word sequence to obtain an extended candidate word sequence of the corresponding candidate word sequence;
Obtaining a probability that the target frame corresponds to the candidate word based on the corresponding candidate word string;
obtaining a probability of the extended candidate word sequence based on a probability of the corresponding candidate word sequence and a probability that the target frame corresponds to the candidate word;
voice recognition device.

Adding candidate word sequences included in a beam corresponding to a target frame of encoded data to a preliminary candidate group;
performing an expansion operation on the candidate word sequences;
Determine whether or not to repeat the expansion operation based on whether word sequences selected based on the expansion operation include an extended word sequence of the candidate word sequences obtained based on the expansion operation;
When it is determined to repeat the expansion operation, the expansion operation is performed on the selected word strings,
Determining the selected word sequences as beams corresponding to the next frame of the encoded data according to determining not to repeat the extension operation,
at least one processor
including,
The expansion operation is
Adding extended word strings of word strings subject to the expansion operation to the preliminary candidate group and selecting a certain number of word strings according to a predetermined criterion from the preliminary candidate group,
Beam search device.