KR102409873B1

KR102409873B1 - Method and system for training speech recognition models using augmented consistency regularization

Info

Publication number: KR102409873B1
Application number: KR1020200111929A
Authority: KR
Inventors: 김희수; 방지환; 유영준; 하정우
Original assignee: 네이버 주식회사; 라인 가부시키가이샤
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2022-06-16
Also published as: KR20220030120A; JP7044856B2; JP2022042460A

Abstract

본 개시는 음성 인식 모델 학습 방법에 관한 것이다. 음성 인식 모델 학습 방법은 레이블이 할당되지 않은 복수의 음성 샘플을 수신하는 단계, 음성 인식 모델을 이용하여 복수의 음성 샘플로부터 휴먼 레이블링(human labeling)을 위한 제1 세트의 음성 샘플을 추출하는 단계, 제1 세트의 음성 샘플과 대응되는 제1 세트의 레이블을 수신하는 단계, 음성 인식 모델을 이용하여 복수의 음성 샘플로부터 머신 레이블링(machine labeling)을 위한 제2 세트의 음성 샘플을 추출하는 단계, 음성 인식 모델을 이용하여 제2 세트의 음성 샘플과 대응되는 제2 세트의 레이블을 결정하는 단계, 제2 세트의 음성 샘플을 증강(augment)하는 단계 및 제1 세트의 음성 샘플, 제1 세트의 레이블, 증강된 제2 세트의 음성 샘플 및 제2 세트의 레이블에 기초하여 준지도 학습(semi-supervised learning)을 수행하여 음성 인식 모델을 업데이트하는 단계를 포함한다.The present disclosure relates to a method for training a speech recognition model. A method for training a speech recognition model comprises the steps of: receiving a plurality of unlabeled speech samples; extracting a first set of speech samples for human labeling from the plurality of speech samples by using the speech recognition model; receiving a first set of labels corresponding to the first set of speech samples; extracting a second set of speech samples for machine labeling from a plurality of speech samples using a speech recognition model; using the recognition model to determine a second set of labels corresponding to the second set of speech samples, augmenting the second set of speech samples and the first set of speech samples, the first set of labels , performing semi-supervised learning based on the augmented second set of speech samples and the second set of labels to update the speech recognition model.

Description

Speech recognition model training method and system using augmented consistency regularization

본 개시는 음성 인식 모델 학습 방법 및 시스템에 관한 것으로, 구체적으로 증강된 일관성 정규화를 이용한 효율적인 점진적 음성 인식 모델 학습 방법 및 시스템에 관한 것이다.The present disclosure relates to a method and system for training a speech recognition model, and more particularly, to a method and system for learning an efficient progressive speech recognition model using augmented consistency regularization.

인공지능 기술 및 IoT(Internet Over Things) 기술의 급격한 발전으로 인해, 사용자의 음성 요청에 대응하는 특정 서비스를 사용자에게 제공하는 지능형 개인 또는 가상 비서(Intelligent Personal Assistant)를 탑재한 인공지능 스피커, 스마트폰 등과 같은 단말이 널리 이용되고 있다. 이러한 지능형 개인 비서는 인공지능 음성 인식 기술을 이용해 사용자의 음성 명령을 인식하고, 음성 명령에 대응하는 서비스를 제공하고 있다. 예를 들어, 인공지능 스피커는 사용자의 음성 명령을 통해 전화를 걸 수 있는 것은 물론이고, 특정 애플리케이션을 실행하거나, 날씨 정보를 제공하거나, 인터넷 검색을 통한 정보를 제공하는 등의 서비스를 제공할 수 있다.Due to the rapid development of artificial intelligence technology and IoT (Internet Over Things) technology, an artificial intelligence speaker or smartphone equipped with an intelligent personal or virtual assistant that provides a specific service to the user in response to the user's voice request Such terminals are widely used. This intelligent personal assistant uses artificial intelligence voice recognition technology to recognize a user's voice command and provides a service corresponding to the voice command. For example, an AI speaker can not only make a call through a user's voice command, but also provide services such as launching a specific application, providing weather information, or providing information through an Internet search. have.

이러한 음성 인식 서비스의 품질을 향상시키기 위해서는 수많은 학습 데이터를 이용하여 계속해서 음성 인식 모델을 업데이트해야 한다. 종래의 기술에서는, 음성 인식 모델을 학습하기 위해 인간 주해자(human annotator)가 직접 수많은 음성 샘플에 대하여 정답 레이블을 결정해야 하므로 많은 비용이 든다는 문제가 있다.In order to improve the quality of the speech recognition service, it is necessary to continuously update the speech recognition model using a large amount of training data. In the prior art, there is a problem in that a human annotator has to directly determine a correct answer label for a large number of speech samples in order to learn a speech recognition model, and thus there is a problem in that it costs a lot.

본 개시는 상기와 같은 문제를 해결하기 위한 음성 인식 모델 학습 방법, 기록 매체에 저장된 컴퓨터 프로그램 및 장치(시스템)를 제공한다.The present disclosure provides a method for learning a speech recognition model, a computer program stored in a recording medium, and an apparatus (system) for solving the above problems.

본 개시는 방법, 장치(시스템) 또는 판독 가능 저장 매체에 저장된 컴퓨터 프로그램을 포함한 다양한 방식으로 구현될 수 있다.The present disclosure may be implemented in various ways including a method, an apparatus (system), or a computer program stored in a readable storage medium.

본 개시의 일 실시예에 따르면, 적어도 하나의 프로세서에 의해 수행되는 음성 인식 모델 학습 방법은 레이블이 할당되지 않은 복수의 음성 샘플을 수신하는 단계, 음성 인식 모델을 이용하여 복수의 음성 샘플로부터 휴먼 레이블링(human labeling)을 위한 제1 세트의 음성 샘플을 추출하는 단계, 제1 세트의 음성 샘플과 대응되는 제1 세트의 레이블을 수신하는 단계, 음성 인식 모델을 이용하여 복수의 음성 샘플로부터 머신 레이블링(machine labeling)을 위한 제2 세트의 음성 샘플을 추출하는 단계, 음성 인식 모델을 이용하여 제2 세트의 음성 샘플과 대응되는 제2 세트의 레이블을 결정하는 단계, 제2 세트의 음성 샘플을 증강(augment)하는 단계 및 제1 세트의 음성 샘플, 제1 세트의 레이블, 증강된 제2 세트의 음성 샘플 및 제2 세트의 레이블에 기초하여 준지도 학습(semi-supervised learning)을 수행하여 음성 인식 모델을 업데이트하는 단계를 포함한다.According to an embodiment of the present disclosure, a method for training a speech recognition model performed by at least one processor includes receiving a plurality of unlabeled speech samples, and human labeling from the plurality of speech samples by using the speech recognition model. extracting a first set of speech samples for human labeling, receiving a first set of labels corresponding to the first set of speech samples, and machine labeling ( extracting a second set of speech samples for machine labeling; determining a second set of labels corresponding to the second set of speech samples using a speech recognition model; augmenting the second set of speech samples ( Augmenting and performing semi-supervised learning based on the first set of speech samples, the first set of labels, the augmented second set of speech samples, and the second set of labels, the speech recognition model updating the .

본 개시의 일 실시예에 따른 음성 인식 모델 학습 방법을 컴퓨터에서 실행하기 위해 컴퓨터 판독 가능한 기록 매체에 저장된 컴퓨터 프로그램이 제공된다.A computer program stored in a computer-readable recording medium is provided for executing the method for learning a speech recognition model according to an embodiment of the present disclosure in a computer.

본 개시의 일 실시예에 따른 음성 인식 모델 학습 시스템은, 통신 모듈, 메모리 및 메모리와 연결되고, 메모리에 포함된 컴퓨터 판독 가능한 적어도 하나의 프로그램을 실행하도록 구성된 적어도 하나의 프로세서를 포함한다. 적어도 하나의 프로그램은, 레이블링 되지 않은 복수의 음성 샘플을 수신하고, 음성 인식 모델을 이용하여 복수의 음성 샘플로부터 휴먼 레이블링을 위한 제1 세트의 음성 샘플을 추출하고, 제1 세트의 음성 샘플과 대응되는 제1 세트의 레이블을 수신하고, 음성 인식 모델을 이용하여 복수의 음성 샘플로부터 머신 레이블링을 위한 제2 세트의 음성 샘플을 추출하고, 음성 인식 모델을 이용하여 제2 세트의 음성 샘플과 대응되는 제2 세트의 레이블을 결정하고, 제2 세트의 음성 샘플을 증강하고, 제1 세트의 음성 샘플, 제1 세트의 레이블, 증강된 제2 세트의 음성 샘플 및 제2 세트의 레이블에 기초하여 준지도 학습을 수행하여 음성 인식 모델을 업데이트하기 위한 명령어들을 포함한다.The speech recognition model training system according to an embodiment of the present disclosure includes a communication module, a memory, and at least one processor connected to the memory and configured to execute at least one computer-readable program included in the memory. The at least one program is configured to receive a plurality of unlabeled speech samples, extract a first set of speech samples for human labeling from the plurality of speech samples using a speech recognition model, and correspond to the first set of speech samples. receiving a first set of labels to be used, extracting a second set of speech samples for machine labeling from a plurality of speech samples using a speech recognition model, and using a speech recognition model determine a second set of labels, augment the second set of voice samples, and perform a chord based on the first set of voice samples, the first set of labels, the augmented second set of voice samples, and the second set of labels It also includes instructions for performing learning to update the speech recognition model.

본 개시의 다양한 실시예에서 음성 인식 모델을 학습하기 위해 인간이 직접 텍스트 시퀀스로 전사해야 하는 음성 샘플의 수를 줄여, 비용은 절감하면서, 음성 인식 모델의 성능 저하가 거의 없도록 할 수 있다. 구체적으로, 레이블링 비용을 2/3 가량 절감하면서, 글자 수준 오류율(character-level error rate; CER)은 약 0.26 %p만 증가(즉, 성능 저하)하고, 레이블링 비용을 약 6/7 가량 절감하면서, CER은 약 1.08 %p만 증가하는 것이 가능하다.In various embodiments of the present disclosure, it is possible to reduce the number of speech samples that a human needs to directly transcribe into a text sequence in order to train the speech recognition model, thereby reducing the cost and substantially reducing the performance of the speech recognition model. Specifically, while reducing the labeling cost by 2/3, the character-level error rate (CER) increases by only about 0.26 %p (that is, performance degradation), while reducing the labeling cost by about 6/7 , it is possible to increase the CER by only about 1.08 %p.

본 개시의 다양한 실시예에서 음성 샘플에 대한 텍스트 시퀀스의 결합 확률을 고려한 불확실성 스코어를 산출할 수 있으며, 불확실성 스코어를 기준으로 음성 인식 모델을 학습하는데 유용한 샘플(informative sample)을 추출할 수 있다.In various embodiments of the present disclosure, an uncertainty score may be calculated in consideration of the coupling probability of a text sequence with respect to a speech sample, and an informative sample useful for learning a speech recognition model may be extracted based on the uncertainty score.

본 개시의 다양한 실시예에서 음성 샘플에 포함된 언어 정보를 손상시키지 않으면서 음성 샘플을 증강(augment)할 수 있으며, 이러한 음성 샘플의 증강은 음성 인식 모델 학습의 효율성을 향상시킬 수 있다. 또한, 증강된 음성 샘플을 이용하여 음성 인식 모델의 강인성을 향상시킬 수 있다.According to various embodiments of the present disclosure, a voice sample may be augmented without damaging language information included in the voice sample, and the augmentation of the voice sample may improve the efficiency of training a voice recognition model. In addition, the robustness of the speech recognition model can be improved by using the augmented speech sample.

본 개시의 실시예들은, 이하 설명하는 첨부 도면들을 참조하여 설명될 것이며, 여기서 유사한 참조 번호는 유사한 요소들을 나타내지만, 이에 한정되지는 않는다.
도 1은 사용자가 음성 명령을 통해 사용자 단말로부터 서비스를 제공받는 예시를 나타내는 도면이다.
도 2는 본 개시의 일 실시예에 따른 음성 인식 서비스를 제공하고 음성 인식 모델을 학습하기 위하여, 정보 처리 시스템이 복수의 사용자 단말과 통신 가능하도록 연결된 구성을 나타내는 개요도이다.
도 3은 본 개시의 일 실시예에 따른 사용자 단말 및 정보 처리 시스템의 내부 구성을 나타내는 블록도이다.
도 4는 본 개시의 일 실시예에 따른 음성 샘플에 대한 레이블링 작업을 통해 HLS 데이터베이스(DB) 및 MLS DB를 구축하는 예시를 나타내는 도면이다.
도 5는 본 개시의 일 실시예에 따른 초기 음성 인식 모델 생성 방법을 나타내는 흐름도이다.
도 6은 본 개시의 일 실시예에 따른 점진적 음성 인식 모델 학습 방법을 나타내는 흐름도이다.
도 7은 본 개시의 일 실시예에 따른 음성 인식 모델을 생성, 업데이트, 그리고 테스트하기 위한 음성 샘플의 예시를 나타내는 도면이다.
도 8은 휴먼 레이블링을 위한 음성 샘플을 추출하는 방식에 따른 음성 인식 모델의 성능 차이를 나타내는 그래프이다.
도 9는 본 개시의 음성 샘플 증강 방식에 따른 음성 인식 모델 성능의 차이를 나타내는 그래프이다.
도 10은 본 개시의 일 실시예에 따라 음성 인식 모델을 여러 번 업데이트하는 경우, 학습 회차와 음성 인식 모델 성능의 관계를 나타내는 그래프이다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present disclosure will be described with reference to the accompanying drawings described below, in which like reference numerals denote like elements, but are not limited thereto.
1 is a diagram illustrating an example in which a user receives a service from a user terminal through a voice command.
2 is a schematic diagram illustrating a configuration in which an information processing system is connected to communicate with a plurality of user terminals in order to provide a voice recognition service and learn a voice recognition model according to an embodiment of the present disclosure.
3 is a block diagram illustrating an internal configuration of a user terminal and an information processing system according to an embodiment of the present disclosure.
4 is a diagram illustrating an example of constructing an HLS database (DB) and an MLS DB through a labeling operation on a voice sample according to an embodiment of the present disclosure.
5 is a flowchart illustrating a method for generating an initial speech recognition model according to an embodiment of the present disclosure.
6 is a flowchart illustrating a method for learning a progressive speech recognition model according to an embodiment of the present disclosure.
7 is a diagram illustrating an example of a voice sample for generating, updating, and testing a voice recognition model according to an embodiment of the present disclosure.
8 is a graph illustrating a performance difference of a speech recognition model according to a method of extracting a speech sample for human labeling.
9 is a graph illustrating a difference in speech recognition model performance according to the speech sample augmentation method of the present disclosure.
10 is a graph illustrating a relationship between a training cycle and voice recognition model performance when a voice recognition model is updated several times according to an embodiment of the present disclosure.

이하, 본 개시의 실시를 위한 구체적인 내용을 첨부된 도면을 참조하여 상세히 설명한다. 다만, 이하의 설명에서는 본 개시의 요지를 불필요하게 흐릴 우려가 있는 경우, 널리 알려진 기능이나 구성에 관한 구체적 설명은 생략하기로 한다.Hereinafter, specific contents for carrying out the present disclosure will be described in detail with reference to the accompanying drawings. However, in the following description, if there is a risk of unnecessarily obscuring the gist of the present disclosure, detailed descriptions of well-known functions or configurations will be omitted.

첨부된 도면에서, 동일하거나 대응하는 구성요소에는 동일한 참조부호가 부여되어 있다. 또한, 이하의 실시예들의 설명에 있어서, 동일하거나 대응되는 구성요소를 중복하여 기술하는 것이 생략될 수 있다. 그러나, 구성요소에 관한 기술이 생략되어도, 그러한 구성요소가 어떤 실시예에 포함되지 않는 것으로 의도되지는 않는다.In the accompanying drawings, the same or corresponding components are assigned the same reference numerals. In addition, in the description of the embodiments below, overlapping description of the same or corresponding components may be omitted. However, even if descriptions regarding components are omitted, it is not intended that such components are not included in any embodiment.

개시된 실시예의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 개시는 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 개시가 완전하도록 하고, 본 개시가 통상의 기술자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것일 뿐이다.Advantages and features of the disclosed embodiments, and methods of achieving them, will become apparent with reference to the embodiments described below in conjunction with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the present embodiments allow the present disclosure to be complete, and the present disclosure provides those skilled in the art with the scope of the invention. It is provided for complete information only.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 개시된 실시예에 대해 구체적으로 설명하기로 한다. 본 명세서에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 관련 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서, 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.Terms used in this specification will be briefly described, and the disclosed embodiments will be described in detail. Terms used in this specification have been selected as currently widely used general terms as possible while considering the functions in the present disclosure, but these may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, and the like. In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the content throughout the present disclosure, rather than the simple name of the term.

본 명세서에서의 단수의 표현은 문맥상 명백하게 단수인 것으로 특정하지 않는 한, 복수의 표현을 포함한다. 또한, 복수의 표현은 문맥상 명백하게 복수인 것으로 특정하지 않는 한, 단수의 표현을 포함한다. 명세서 전체에서 어떤 부분이 어떤 구성요소를 포함한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다.References in the singular herein include plural expressions unless the context clearly dictates the singular. Also, the plural expression includes the singular expression unless the context clearly dictates the plural. In the entire specification, when a part includes a certain element, this means that other elements may be further included, rather than excluding other elements, unless otherwise stated.

또한, 명세서에서 사용되는 '모듈' 또는 '부'라는 용어는 소프트웨어 또는 하드웨어 구성요소를 의미하며, '모듈' 또는 '부'는 어떤 역할들을 수행한다. 그렇지만, '모듈' 또는 '부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '모듈' 또는 '부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서, '모듈' 또는 '부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 또는 변수들 중 적어도 하나를 포함할 수 있다. 구성요소들과 '모듈' 또는 '부'들은 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '모듈' 또는 '부'들로 결합되거나 추가적인 구성요소들과 '모듈' 또는 '부'들로 더 분리될 수 있다.In addition, the term 'module' or 'unit' used in the specification means a software or hardware component, and 'module' or 'unit' performs certain roles. However, 'module' or 'unit' is not meant to be limited to software or hardware. A 'module' or 'unit' may be configured to reside on an addressable storage medium or configured to reproduce one or more processors. Thus, as an example, a 'module' or 'unit' refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, and properties. , procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays or at least one of variables. Components and 'modules' or 'units' are the functions provided therein that are combined into a smaller number of components and 'modules' or 'units' or additional components and 'modules' or 'units' can be further separated.

본 개시의 일 실시예에 따르면, '모듈' 또는 '부'는 프로세서 및 메모리로 구현될 수 있다. '프로세서'는 범용 프로세서, 중앙 처리 장치(CPU), 마이크로프로세서, 디지털 신호 프로세서(DSP), 제어기, 마이크로제어기, 상태 머신 등을 포함하도록 넓게 해석되어야 한다. 몇몇 환경에서, '프로세서'는 주문형 반도체(ASIC), 프로그램가능 로직 디바이스(PLD), 필드 프로그램가능 게이트 어레이(FPGA) 등을 지칭할 수도 있다. '프로세서'는, 예를 들어, DSP와 마이크로프로세서의 조합, 복수의 마이크로프로세서들의 조합, DSP 코어와 결합한 하나 이상의 마이크로프로세서들의 조합, 또는 임의의 다른 그러한 구성들의 조합과 같은 처리 디바이스들의 조합을 지칭할 수도 있다. 또한, '메모리'는 전자 정보를 저장 가능한 임의의 전자 컴포넌트를 포함하도록 넓게 해석되어야 한다. '메모리'는 임의 액세스 메모리(RAM), 판독-전용 메모리(ROM), 비-휘발성 임의 액세스 메모리(NVRAM), 프로그램가능 판독-전용 메모리(PROM), 소거-프로그램가능 판독 전용 메모리(EPROM), 전기적으로 소거가능 PROM(EEPROM), 플래쉬 메모리, 자기 또는 광학 데이터 저장장치, 레지스터들 등과 같은 프로세서-판독가능 매체의 다양한 유형들을 지칭할 수도 있다. 프로세서가 메모리로부터 정보를 판독하고/하거나 메모리에 정보를 기록할 수 있다면 메모리는 프로세서와 전자 통신 상태에 있다고 불린다. 프로세서에 집적된 메모리는 프로세서와 전자 통신 상태에 있다.According to an embodiment of the present disclosure, a 'module' or a 'unit' may be implemented with a processor and a memory. 'Processor' should be construed broadly to include general purpose processors, central processing units (CPUs), microprocessors, digital signal processors (DSPs), controllers, microcontrollers, state machines, and the like. In some contexts, a 'processor' may refer to an application specific semiconductor (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA), or the like. 'Processor' refers to a combination of processing devices, such as, for example, a combination of a DSP and a microprocessor, a combination of a plurality of microprocessors, a combination of one or more microprocessors in combination with a DSP core, or any other such configuration. You may. Also, 'memory' should be construed broadly to include any electronic component capable of storing electronic information. 'Memory' means random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erase-programmable read-only memory (EPROM); may refer to various types of processor-readable media, such as electrically erasable PROM (EEPROM), flash memory, magnetic or optical data storage, registers, and the like. A memory is said to be in electronic communication with the processor if the processor is capable of reading information from and/or writing information to the memory. A memory integrated in the processor is in electronic communication with the processor.

본 개시에서 '음성 인식 모델'은 스피치(speech) 데이터를 입력하면, 입력된 음성에 포함된 언어 정보에 대응하는 텍스트 데이터를 출력하는 모델을 지칭할 수 있다. 즉, 음성 인식 모델은 STT(Speech-to-Text) 기술을 구현할 수 있다. 본 개시의 일 실시예에서, 음성 인식 모델은 학습 데이터를 이용하여 지도 학습, 비지도 학습 또는 준지도 학습을 수행함으로써 생성 또는 업데이트되는 인공신경망 모델에 해당할 수 있다. 예를 들면, 음성 인식 모델은 Listen, Attend and Spell(LAS) 기반의 E2E-ASR(End-to-End Automatic Speech Recognition) 모델일 수 있다.In the present disclosure, the 'voice recognition model' may refer to a model that outputs text data corresponding to language information included in the input voice when speech data is input. That is, the speech recognition model may implement a speech-to-text (STT) technology. In an embodiment of the present disclosure, the speech recognition model may correspond to an artificial neural network model that is created or updated by performing supervised learning, unsupervised learning, or semi-supervised learning using training data. For example, the speech recognition model may be a Listen, Attend and Spell (LAS)-based End-to-End Automatic Speech Recognition (E2E-ASR) model.

본 개시에서 '음성 샘플'은 음성 인식 모델을 학습, 업데이트, 그리고 테스트하기 위해 수집한 사용자의 스피치(speech) 데이터를 지칭할 수 있다. 음성 샘플은 수집된 데이터를 전처리하여 일정한 포맷으로 가공한 것일 수 있다. 예를 들면, 음성 샘플은 수집한 사용자의 스피치 데이터로부터, 200ms의 윈도우 길이(window-length)와 100ms의 보폭 길이(stride-length)를 갖는 해밍 윈도우(hamming window)를 이용하여 추출한 스펙토그램(Spectrogram)을 포함할 수 있다.In the present disclosure, a 'speech sample' may refer to a user's speech data collected to train, update, and test a voice recognition model. The voice sample may be processed into a predetermined format by pre-processing the collected data. For example, the voice sample is a spectogram ( spectrogram) may be included.

본 개시에서 '레이블'은 음성 샘플에 대응하는 텍스트 시퀀스를 지칭할 수 있다. 예를 들면, 레이블은 음성 샘플에 포함된 언어 정보, 언어학적 의미를 텍스트로 전사(transcribe)한 것일 수 있다. 레이블은 음성 샘플이 음성 인식 모델에 입력될 때 출력되는 수도 레이블(pseudo label)과 인간 주해자(human annotator)가 음성 샘플에 대하여 전사한 정답 레이블을 포함할 수 있다.In the present disclosure, a 'label' may refer to a text sequence corresponding to a voice sample. For example, the label may be a text transcribed of linguistic information and linguistic meaning included in a voice sample. The label may include a pseudo label output when the voice sample is input to the voice recognition model and a correct answer label transcribed by a human annotator with respect to the voice sample.

도 1은 사용자(110)가 음성 명령을 통해 사용자 단말(120)로부터 서비스를 제공받는 예시를 나타내는 도면이다. 일 실시예에서, 사용자 단말(120)은 마이크 등의 입력 장치를 통해 사용자(110)로부터 음성 명령을 수신할 수 있다. 이 경우, 사용자 단말(120)은 수신된 음성 명령을 음성 인식 모델을 이용하여 인식하고, 인식된 음성 명령에 대응하는 정보 및/또는 서비스를 사용자(110)에게 제공할 수 있다. 도시된 바와 같이, 사용자(110)가 "오늘 날씨 알려줘"라는 음성 명령을 발화하는 경우, 사용자 단말(120)은 해당 음성 명령을 자동으로 인식하고, 오늘의 일기 예보를 스피커 등을 통해 출력할 수 있다.1 is a diagram illustrating an example in which a user 110 receives a service from a user terminal 120 through a voice command. In an embodiment, the user terminal 120 may receive a voice command from the user 110 through an input device such as a microphone. In this case, the user terminal 120 may recognize the received voice command using a voice recognition model, and provide information and/or services corresponding to the recognized voice command to the user 110 . As shown, when the user 110 utters a voice command "Tell me about today's weather", the user terminal 120 automatically recognizes the corresponding voice command and outputs today's weather forecast through a speaker. have.

사용자 단말(120)은 사용자(110)가 발화하는 음성 명령을 인식하고, 음성 명령에 대응하는 서비스/정보를 제공하도록 구성된 임의의 장치일 수 있다. 예를 들어, 사용자 단말(120)은 음성 검색 서비스, 인공지능(AI, Artificial Intelligence) 비서 서비스, 지도 내비게이션(Navigation) 서비스, 셋톱 박스(set-top box) 제어 서비스 등의 서비스를 제공할 수 있다. 도 1에서는 사용자 단말(120)이 인공지능 스피커로 도시되었으나, 이에 한정되지 않으며, 음성 명령을 인식하고 그에 대응하는 서비스를 제공할 수 있는 임의의 장치일 수 있다.The user terminal 120 may be any device configured to recognize a voice command uttered by the user 110 and provide services/information corresponding to the voice command. For example, the user terminal 120 may provide services such as a voice search service, an artificial intelligence (AI) assistant service, a map navigation service, a set-top box control service, and the like. . Although the user terminal 120 is illustrated as an artificial intelligence speaker in FIG. 1 , the present invention is not limited thereto, and may be any device capable of recognizing a voice command and providing a service corresponding thereto.

사용자(110)의 음성 명령을 인식하기 위하여, 사용자 단말(120)은 기계 학습 등을 통해 생성된 음성 인식 모델을 이용할 수 있다. 이러한 음성 인식 모델은 음성 인식의 정확도를 높이기 위해, 반복적인/점진적인 학습을 통해 업데이트될 수 있다. 사람이 음성 샘플을 청취하고 직접 정답 레이블을 생성한 휴먼 레이블드 샘플(Human Labeled Sample; HLS)을 최대한 많이 사용함으로써, 음성 인식 모델의 성능을 극대화할 수 있지만, 레이블링 비용의 한계로 HLS만을 이용한 음성 인식 모델 학습 방법은 현실적으로 어려움이 있다. 특히, 음성 샘플을 레이블링하는 작업, 즉, 사람이 음성 샘플을 듣고 전사하는 작업은 이미지를 레이블링하는 작업보다 훨씬 높은 비용이 요구되므로, 휴먼 레이블링 비용을 최소화하면서 음성 인식 성능을 최대화할 수 있는 기계 학습 방식이 요구된다.In order to recognize the voice command of the user 110 , the user terminal 120 may use a voice recognition model generated through machine learning or the like. Such a speech recognition model may be updated through iterative/gradual learning in order to increase the accuracy of speech recognition. The performance of the speech recognition model can be maximized by using as many Human Labeled Samples (HLS) as possible, in which a human listens to a speech sample and directly generates a correct answer label. There are practical difficulties in learning the recognition model. In particular, since the operation of labeling speech samples, that is, the operation of human listening and transcription of speech samples, is much more expensive than the operation of labeling images, machine learning can maximize speech recognition performance while minimizing human labeling costs. method is required.

일 실시예에서, HLS를 최소화하기 위해 준지도 학습(Semi-Supervised Learning; SSL)과 능동 학습(Active Learning; AL)을 접목하고, 레이블이 할당되지 않은 음성 샘플을 이용하여 학습 효율성을 더욱 향상시키기 위해 일관성 정규화(Consistency Regularization; CR) 기법을 사용할 수 있다. 구체적으로, 레이블이 할당되지 않은 음성 샘플 풀에서 불확실성 스코어가 가장 높은(즉, 음성 인식 모델의 신뢰도가 가장 낮은) n개의 음성 샘플을 추출하여 휴먼 레이블링 작업을 수행함으로써 복수의 HLS를 준비할 수 있다. 여기서, n은 자연수이고, 휴먼 레이블링 비용 예산에 따라 결정될 수 있다. 또한, 레이블이 할당되지 않은 음성 샘플 풀에 남아 있는 음성 샘플 중 불확실성 스코어가 미리 정해진 임계치 미만(즉, 음성 인식 모델의 신뢰도가 임계치 초과)인 음성 샘플을 추출하여 머신 레이블링 작업을 수행하고, 음성 샘플을 증강함으로써 복수의 머신 레이블드 샘플(Machine Labeled Sample; MLS)을 준비할 수 있다. 그리고, 음성 인식 모델은 HLS와 MLS를 함께 사용하여 학습/업데이트될 수 있다. 여기서, MLS는 음성 인식 모델을 학습/업데이트하는데 HLS를 보조하는 역할을 수행할 수 있다.In one embodiment, to minimize HLS by combining semi-supervised learning (SSL) and active learning (AL), further improving learning efficiency using unlabeled speech samples For this purpose, a consistency regularization (CR) technique may be used. Specifically, multiple HLSs can be prepared by extracting the n voice samples with the highest uncertainty score (that is, the lowest reliability of the voice recognition model) from the unlabeled voice sample pool and performing human labeling. . Here, n is a natural number and may be determined according to the human labeling cost budget. In addition, from among the voice samples remaining in the unlabeled voice sample pool, the voice sample whose uncertainty score is less than a predetermined threshold (that is, the reliability of the voice recognition model exceeds the threshold) is extracted to perform a machine labeling operation, and the voice sample It is possible to prepare a plurality of machine labeled samples (MLS) by augmenting . In addition, the speech recognition model may be trained/updated by using both HLS and MLS. Here, the MLS may serve to assist the HLS in learning/updating the speech recognition model.

도 2는 본 개시의 일 실시예에 따른 음성 인식 서비스를 제공하고 음성 인식 모델을 학습하기 위하여, 정보 처리 시스템(230)이 복수의 사용자 단말(210_1, 210_2, 210_3)과 통신 가능하도록 연결된 구성을 나타내는 개요도이다. 정보 처리 시스템(230)은 네트워크(220)를 통해 음성 인식 기반 서비스를 제공할 수 있는 시스템(들) 및/또는 음성 인식 모델을 학습할 수 있는 시스템(들)을 포함할 수 있다. 일 실시예에서, 정보 처리 시스템(230)은 음성 인식 기반 서비스 또는 음성 인식 모델 학습과 관련된 컴퓨터 실행 가능한 프로그램(예를 들어, 다운로드 가능한 어플리케이션) 및 데이터를 저장, 제공 및 실행할 수 있는 하나 이상의 서버 장치 및/또는 데이터베이스, 또는 클라우드 컴퓨팅 서비스 기반의 하나 이상의 분산 컴퓨팅 장치 및/또는 분산 데이터베이스를 포함할 수 있다. 정보 처리 시스템(230)에 의해 제공되는 음성 인식 기반 서비스는, 복수의 사용자 단말(210_1, 210_2, 210_3)의 각각에 설치된 음성 검색 어플리케이션, 인공지능 비서 어플리케이션 등을 통해 사용자에게 제공될 수 있다. 예를 들어, 정보 처리 시스템(230)은 음성 검색 어플리케이션, 인공지능 비서 어플리케이션 등을 통해 사용자로부터 입력되는 음성 명령에 대응하는 정보를 제공하거나 대응하는 처리를 수행할 수 있다. 추가적으로, 정보 처리 시스템(230)은 음성 인식 모델을 학습/업데이트하기 위해, 복수의 사용자 단말(210_1, 210_2, 210_3)로부터 음성 샘플을 수집할 수 있다.2 is a configuration in which the information processing system 230 is connected to communicate with a plurality of user terminals 210_1, 210_2, and 210_3 to provide a voice recognition service and learn a voice recognition model according to an embodiment of the present disclosure; It is a schematic diagram showing The information processing system 230 may include a system(s) capable of providing a voice recognition-based service via the network 220 and/or a system(s) capable of learning a voice recognition model. In one embodiment, the information processing system 230 is one or more server devices capable of storing, providing, and executing computer-executable programs (eg, downloadable applications) and data related to speech recognition-based services or speech recognition model learning. and/or a database, or one or more distributed computing devices and/or distributed databases based on cloud computing services. The voice recognition-based service provided by the information processing system 230 may be provided to the user through a voice search application installed in each of the plurality of user terminals 210_1 , 210_2 , 210_3 , an artificial intelligence assistant application, and the like. For example, the information processing system 230 may provide information corresponding to a voice command input from a user through a voice search application, an artificial intelligence assistant application, or the like, or perform a corresponding process. Additionally, the information processing system 230 may collect voice samples from the plurality of user terminals 210_1 , 210_2 , and 210_3 in order to learn/update the voice recognition model.

복수의 사용자 단말(210_1, 210_2, 210_3)은 네트워크(220)를 통해 정보 처리 시스템(230)과 통신할 수 있다. 네트워크(220)는 복수의 사용자 단말(210_1, 210_2, 210_3)과 정보 처리 시스템(230) 사이의 통신이 가능하도록 구성될 수 있다. 네트워크(220)는 설치 환경에 따라, 예를 들어, 이더넷(Ethernet), 유선 홈 네트워크(Power Line Communication), 전화선 통신 장치 및 RS-serial 통신 등의 유선 네트워크, 이동통신망, WLAN(Wireless LAN), Wi-Fi, Bluetooth 및 ZigBee 등과 같은 무선 네트워크 또는 그 조합으로 구성될 수 있다. 통신 방식은 제한되지 않으며, 네트워크(220)가 포함할 수 있는 통신망(일례로, 이동통신망, 유선 인터넷, 무선 인터넷, 방송망, 위성망 등)을 활용하는 통신 방식뿐만 아니라 사용자 단말(210_1, 210_2, 210_3) 사이의 근거리 무선 통신 역시 포함될 수 있다.The plurality of user terminals 210_1 , 210_2 , and 210_3 may communicate with the information processing system 230 through the network 220 . The network 220 may be configured to enable communication between the plurality of user terminals 210_1 , 210_2 , and 210_3 and the information processing system 230 . Network 220 according to the installation environment, for example, Ethernet (Ethernet), wired home network (Power Line Communication), telephone line communication device and wired networks such as RS-serial communication, mobile communication network, WLAN (Wireless LAN), It may consist of a wireless network such as Wi-Fi, Bluetooth and ZigBee, or a combination thereof. The communication method is not limited, and the user terminals 210_1, 210_2, 210_3 as well as a communication method using a communication network (eg, a mobile communication network, a wired Internet, a wireless Internet, a broadcasting network, a satellite network, etc.) that the network 220 may include. ) may also include short-range wireless communication between

도 2에서 휴대폰 단말(210_1), 태블릿 단말(210_2) 및 PC 단말 (210_3)이 사용자 단말의 예로서 도시되었으나, 이에 한정되지 않으며, 사용자 단말(210_1, 210_2, 210_3)은 유선 및/또는 무선 통신이 가능하고 음성 기반 서비스 어플리케이션, 검색 어플리케이션, 웹 브라우저 어플리케이션 등이 설치되어 실행될 수 있는 임의의 컴퓨팅 장치일 수 있다. 예를 들어, 사용자 단말은, AI 스피커, 스마트폰, 휴대폰, 내비게이션, 컴퓨터, 노트북, 디지털방송용 단말, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 태블릿 PC, 게임 콘솔(game console), 웨어러블 디바이스(wearable device), IoT(internet of things) 디바이스, VR(virtual reality) 디바이스, AR(augmented reality) 디바이스, 셋톱 박스 등을 포함할 수 있다. 또한, 도 2에는 3개의 사용자 단말(210_1, 210_2, 210_3)이 네트워크(220)를 통해 정보 처리 시스템(230)과 통신하는 것으로 도시되어 있으나, 이에 한정되지 않으며, 상이한 수의 사용자 단말이 네트워크(220)를 통해 정보 처리 시스템(230)과 통신하도록 구성될 수도 있다.Although the mobile phone terminal 210_1, the tablet terminal 210_2, and the PC terminal 210_3 are illustrated as examples of the user terminal in FIG. 2, the present invention is not limited thereto, and the user terminals 210_1, 210_2, and 210_3 are wired and/or wireless communication. It may be any computing device capable of this and in which a voice-based service application, a search application, a web browser application, etc. may be installed and executed. For example, the user terminal is an AI speaker, a smartphone, a mobile phone, a navigation system, a computer, a laptop computer, a digital broadcasting terminal, a PDA (Personal Digital Assistants), a PMP (Portable Multimedia Player), a tablet PC, a game console (game console), It may include a wearable device, an Internet of things (IoT) device, a virtual reality (VR) device, an augmented reality (AR) device, a set-top box, and the like. In addition, in FIG. 2 , three user terminals 210_1 , 210_2 , and 210_3 are illustrated as communicating with the information processing system 230 through the network 220 , but the present invention is not limited thereto, and a different number of user terminals is connected to the network ( It may be configured to communicate with information processing system 230 via 220 .

도 3은 본 개시의 일 실시예에 따른 사용자 단말(210) 및 정보 처리 시스템(230)의 내부 구성을 나타내는 블록도이다. 사용자 단말(210)은 음성 기반 서비스 어플리케이션 등이 실행 가능하고 유/무선 통신이 가능한 임의의 컴퓨팅 장치를 지칭할 수 있으며, 예를 들어, 도 2의 휴대폰 단말(210_1), 태블릿 단말(210_2), PC 단말(210_3) 등을 포함할 수 있다. 도시된 바와 같이, 사용자 단말(210)은 메모리(312), 프로세서(314), 통신 모듈(316) 및 입출력 인터페이스(318)를 포함할 수 있다. 이와 유사하게, 정보 처리 시스템(230)은 메모리(332), 프로세서(334), 통신 모듈(336) 및 입출력 인터페이스(338)를 포함할 수 있다. 도 3에 도시된 바와 같이, 사용자 단말(210) 및 정보 처리 시스템(230)은 각각의 통신 모듈(316, 336)을 이용하여 네트워크(220)를 통해 정보 및/또는 데이터를 통신할 수 있도록 구성될 수 있다. 또한, 입출력 장치(320)는 입출력 인터페이스(318)를 통해 사용자 단말(210)에 정보 및/또는 데이터를 입력하거나 사용자 단말(210)로부터 생성된 정보 및/또는 데이터를 출력하도록 구성될 수 있다.3 is a block diagram illustrating the internal configuration of the user terminal 210 and the information processing system 230 according to an embodiment of the present disclosure. The user terminal 210 may refer to any computing device capable of executing a voice-based service application and the like and capable of wired/wireless communication, for example, the mobile phone terminal 210_1, the tablet terminal 210_2 of FIG. It may include a PC terminal 210_3 and the like. As shown, the user terminal 210 may include a memory 312 , a processor 314 , a communication module 316 , and an input/output interface 318 . Similarly, the information processing system 230 may include a memory 332 , a processor 334 , a communication module 336 , and an input/output interface 338 . As shown in FIG. 3 , the user terminal 210 and the information processing system 230 are configured to communicate information and/or data via the network 220 using the respective communication modules 316 and 336 . can be In addition, the input/output device 320 may be configured to input information and/or data to the user terminal 210 through the input/output interface 318 or to output information and/or data generated from the user terminal 210 .

메모리(312, 332)는 비-일시적인 임의의 컴퓨터 판독 가능한 기록매체를 포함할 수 있다. 일 실시예에 따르면, 메모리(312, 332)는 RAM(random access memory), ROM(read only memory), 디스크 드라이브, SSD(solid state drive), 플래시 메모리(flash memory) 등과 같은 비소멸성 대용량 저장 장치(permanent mass storage device)를 포함할 수 있다. 다른 예로서, ROM, SSD, 플래시 메모리, 디스크 드라이브 등과 같은 비소멸성 대용량 저장 장치는 메모리와는 구분되는 별도의 영구 저장 장치로서 사용자 단말(210) 또는 정보 처리 시스템(230)에 포함될 수 있다. 또한, 메모리(312, 332)에는 운영체제와 적어도 하나의 프로그램 코드(예를 들어, 사용자 단말(210)에 설치되어 구동되는 음성 기반 서비스 어플리케이션 등을 위한 코드)가 저장될 수 있다.The memories 312 and 332 may include any non-transitory computer-readable recording medium. According to one embodiment, the memories 312 and 332 are non-volatile mass storage devices such as random access memory (RAM), read only memory (ROM), disk drives, solid state drives (SSDs), flash memory, and the like. (permanent mass storage device) may be included. As another example, a non-volatile mass storage device such as a ROM, an SSD, a flash memory, a disk drive, etc. may be included in the user terminal 210 or the information processing system 230 as a separate permanent storage device distinct from the memory. In addition, an operating system and at least one program code (eg, a code for a voice-based service application installed and driven in the user terminal 210 ) may be stored in the memories 312 and 332 .

이러한 소프트웨어 구성요소들은 메모리(312, 332)와는 별도의 컴퓨터에서 판독 가능한 기록매체로부터 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독가능한 기록매체는 이러한 사용자 단말(210) 및 정보 처리 시스템(230)에 직접 연결가능한 기록 매체를 포함할 수 있는데, 예를 들어, 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록매체를 포함할 수 있다. 다른 예로서, 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록매체가 아닌 통신 모듈을 통해 메모리(312, 332)에 로딩될 수도 있다. 예를 들어, 적어도 하나의 프로그램은 개발자들 또는 어플리케이션의 설치 파일을 배포하는 파일 배포 시스템이 네트워크(220)를 통해 제공하는 파일들에 의해 설치되는 컴퓨터 프로그램에 기반하여 메모리(312, 332)에 로딩될 수 있다.These software components may be loaded from a computer-readable recording medium separate from the memories 312 and 332 . The separate computer-readable recording medium may include a recording medium directly connectable to the user terminal 210 and the information processing system 230, for example, a floppy drive, disk, tape, DVD/CD- It may include a computer-readable recording medium such as a ROM drive and a memory card. As another example, the software components may be loaded into the memories 312 and 332 through a communication module rather than a computer-readable recording medium. For example, at least one program is loaded into the memories 312 and 332 based on a computer program installed by files provided through the network 220 by developers or a file distribution system that distributes installation files of applications. can be

프로세서(314, 334)는 기본적인 산술, 로직 및 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(312, 332) 또는 통신 모듈(316, 336)에 의해 프로세서(314, 334)로 제공될 수 있다. 예를 들어, 프로세서(314, 334)는 메모리(312, 332)와 같은 기록 장치에 저장된 프로그램 코드에 따라 수신되는 명령을 실행하도록 구성될 수 있다.The processors 314 and 334 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to the processor 314 , 334 by the memory 312 , 332 or the communication module 316 , 336 . For example, the processors 314 and 334 may be configured to execute received instructions according to program code stored in a recording device, such as the memories 312 and 332 .

통신 모듈(316, 336)은 네트워크(220)를 통해 사용자 단말(210)과 정보 처리 시스템(230)이 서로 통신하기 위한 구성 또는 기능을 제공할 수 있으며, 사용자 단말(210) 및/또는 정보 처리 시스템(230)이 다른 사용자 단말 또는 다른 시스템(일례로 별도의 클라우드 시스템 등)과 통신하기 위한 구성 또는 기능을 제공할 수 있다. 일례로, 사용자 단말(210)의 프로세서(314)가 메모리(312) 등과 같은 기록 장치에 저장된 프로그램 코드에 따라 생성한 요청 또는 데이터(예를 들어, 사용자의 음성 명령에 대응하는 데이터 등)는 통신 모듈(316)의 제어에 따라 네트워크(220)를 통해 정보 처리 시스템(230)으로 전달될 수 있다. 역으로, 정보 처리 시스템(230)의 프로세서(334)의 제어에 따라 제공되는 제어 신호나 명령이 통신 모듈(336)과 네트워크(220)를 거쳐 사용자 단말(210)의 통신 모듈(316)을 통해 사용자 단말(210)에 수신될 수 있다. 예를 들어, 사용자 단말(210)은 정보 처리 시스템(230)으로부터 통신 모듈(316)을 통해 음성 명령과 연관된 정보 등을 수신할 수 있다.The communication modules 316 and 336 may provide a configuration or function for the user terminal 210 and the information processing system 230 to communicate with each other via the network 220 , and the user terminal 210 and/or information processing The system 230 may provide a configuration or function for communicating with another user terminal or another system (eg, a separate cloud system). For example, a request or data (eg, data corresponding to a user's voice command, etc.) generated by the processor 314 of the user terminal 210 according to a program code stored in a recording device such as the memory 312 is communicated. It may be transmitted to the information processing system 230 through the network 220 under the control of the module 316 . Conversely, a control signal or command provided under the control of the processor 334 of the information processing system 230 is transmitted through the communication module 336 and the network 220 through the communication module 316 of the user terminal 210 . It may be received by the user terminal 210 . For example, the user terminal 210 may receive information related to a voice command from the information processing system 230 through the communication module 316 .

입출력 인터페이스(318)는 입출력 장치(320)와의 인터페이스를 위한 수단일 수 있다. 일 예로서, 입력 장치는 오디오 센서 및/또는 이미지 센서를 포함한 카메라, 키보드, 마이크로폰, 마우스 등의 장치를, 그리고 출력 장치는 디스플레이, 스피커, 햅틱 피드백 디바이스(haptic feedback device) 등과 같은 장치를 포함할 수 있다. 다른 예로, 입출력 인터페이스(318)는 터치스크린 등과 같이 입력과 출력을 수행하기 위한 구성 또는 기능이 하나로 통합된 장치와의 인터페이스를 위한 수단일 수 있다. 예를 들어, 사용자 단말(210)의 프로세서(314)가 메모리(312)에 로딩된 컴퓨터 프로그램의 명령을 처리함에 있어서 정보 처리 시스템(230)이나 다른 사용자 단말이 제공하는 정보 및/또는 데이터를 이용하여 구성되는 서비스 화면 등이 입출력 인터페이스(318)를 통해 디스플레이에 표시될 수 있다. 도 3에서는 입출력 장치(320)가 사용자 단말(210)에 포함되지 않도록 도시되어 있으나, 이에 한정되지 않으며, 사용자 단말(210)과 하나의 장치로 구성될 수 있다. 또한, 정보 처리 시스템(230)의 입출력 인터페이스(338)는 정보 처리 시스템(230)과 연결되거나 정보 처리 시스템(230)이 포함할 수 있는 입력 또는 출력을 위한 장치(미도시)와의 인터페이스를 위한 수단일 수 있다. 도 3에서는 입출력 인터페이스(318, 338)가 프로세서(314, 334)와 별도로 구성된 요소로서 도시되었으나, 이에 한정되지 않으며, 입출력 인터페이스(318, 338)가 프로세서(314, 334)에 포함되도록 구성될 수 있다.The input/output interface 318 may be a means for interfacing with the input/output device 320 . As an example, an input device may include a device such as a camera, keyboard, microphone, mouse, etc., including an audio sensor and/or an image sensor, and an output device may include a device such as a display, speaker, haptic feedback device, etc. can As another example, the input/output interface 318 may be a means for an interface with a device in which a configuration or function for performing input and output, such as a touch screen, is integrated into one. For example, when the processor 314 of the user terminal 210 processes a command of a computer program loaded in the memory 312, information and/or data provided by the information processing system 230 or other user terminals are used. A service screen and the like configured by doing this may be displayed on the display through the input/output interface 318 . In FIG. 3 , the input/output device 320 is not included in the user terminal 210 , but the present invention is not limited thereto, and may be configured as a single device with the user terminal 210 . In addition, the input/output interface 338 of the information processing system 230 is connected to the information processing system 230 or means for interfacing with a device (not shown) for input or output that the information processing system 230 may include. can be In FIG. 3, the input/output interfaces 318 and 338 are illustrated as elements configured separately from the processors 314 and 334, but the present invention is not limited thereto, and the input/output interfaces 318 and 338 may be configured to be included in the processors 314 and 334. have.

사용자 단말(210) 및 정보 처리 시스템(230)은 도 3의 구성요소들보다 더 많은 구성요소들을 포함할 수 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. 일 실시예에 따르면, 사용자 단말(210)은 상술된 입출력 장치(320) 중 적어도 일부를 포함하도록 구현될 수 있다. 또한, 사용자 단말(210)은 트랜시버(transceiver), GPS(Global Positioning system) 모듈, 카메라, 각종 센서, 데이터베이스 등과 같은 다른 구성요소들을 더 포함할 수 있다. 예를 들어, 사용자 단말(210)이 스마트폰인 경우, 일반적으로 스마트폰이 포함하고 있는 구성요소를 포함할 수 있으며, 예를 들어, 가속도 센서, 자이로 센서, 카메라 모듈, 각종 물리적인 버튼, 터치패널을 이용한 버튼, 입출력 포트, 진동을 위한 진동기 등의 다양한 구성요소들이 사용자 단말(210)에 더 포함되도록 구현될 수 있다.The user terminal 210 and the information processing system 230 may include more components than those of FIG. 3 . However, there is no need to clearly show most of the prior art components. According to an embodiment, the user terminal 210 may be implemented to include at least a part of the above-described input/output device 320 . In addition, the user terminal 210 may further include other components such as a transceiver, a global positioning system (GPS) module, a camera, various sensors, and a database. For example, when the user terminal 210 is a smart phone, it may include components generally included in the smart phone, for example, an acceleration sensor, a gyro sensor, a camera module, various physical buttons, and touch Various components such as a button using a panel, an input/output port, and a vibrator for vibration may be implemented to be further included in the user terminal 210 .

일 실시예에 따르면, 사용자 단말(210)의 프로세서(314)는 음성 기반 서비스를 제공하는 어플리케이션 등이 동작하도록 구성될 수 있다. 이 때, 해당 어플리케이션 및/또는 프로그램과 연관된 코드가 사용자 단말(210)의 메모리(312)에 로딩될 수 있다. 어플리케이션 및/또는 프로그램이 동작되는 동안에, 사용자 단말(210)의 프로세서(314)는 입출력 장치(320)로부터 제공된 정보 및/또는 데이터를 입출력 인터페이스(318)를 통해 수신하거나 통신 모듈(316)을 통해 정보 처리 시스템(230)으로부터 정보 및/또는 데이터를 수신할 수 있으며, 수신된 정보 및/또는 데이터를 처리하여 메모리(312)에 저장할 수 있다. 또한, 이러한 정보 및/또는 데이터는 통신 모듈(316)을 통해 정보 처리 시스템(230)에 제공할 수 있다.According to an embodiment, the processor 314 of the user terminal 210 may be configured to operate an application providing a voice-based service. In this case, a code associated with a corresponding application and/or program may be loaded into the memory 312 of the user terminal 210 . While the application and/or program is being operated, the processor 314 of the user terminal 210 receives information and/or data provided from the input/output device 320 through the input/output interface 318 or through the communication module 316 . Information and/or data may be received from the information processing system 230 , and the received information and/or data may be processed and stored in the memory 312 . In addition, such information and/or data may be provided to the information processing system 230 through the communication module 316 .

음성 기반 서비스 어플리케이션 등을 위한 프로그램이 동작되는 동안에, 프로세서(314)는 입출력 인터페이스(318)와 연결된 터치 스크린, 키보드, 오디오 센서 및/또는 이미지 센서를 포함한 카메라, 마이크로폰 등의 입력 장치를 통해 입력되거나 선택된 텍스트, 이미지, 영상, 음성 등을 수신할 수 있으며, 수신된 텍스트, 이미지, 영상 및/또는 음성 등을 메모리(312)에 저장하거나 통신 모듈(316) 및 네트워크(220)를 통해 정보 처리 시스템(230)에 제공할 수 있다. 일 실시예에서, 프로세서(314)는 입력 장치를 통해 음성 기반 서비스 어플리케이션 상에서 사용자에 의해 입력된 음성 명령 관련 데이터를 네트워크(220) 및 통신 모듈(316)을 통해 정보 처리 시스템(230)에 제공할 수 있다. 정보 처리 시스템(230)의 프로세서(334)는 복수의 사용자 단말 및/또는 복수의 외부 시스템으로부터 수신된 정보 및/또는 데이터를 관리, 처리 및/또는 저장하도록 구성될 수 있다. 일 실시예에서, 정보 처리 시스템(230)은 사용자 단말(210)로부터 수신한 음성 명령 관련 데이터에 대응하는 정보를 사용자 단말(210)로 제공할 수 있다. 추가적으로, 정보 처리 시스템(230)은 사용자 단말(210)로부터 레이블이 할당되지 않은 음성 샘플을 수집할 수 있다.While a program for a voice-based service application is being operated, the processor 314 is inputted through an input device such as a touch screen, a keyboard, an audio sensor and/or an image sensor connected to the input/output interface 318, a camera, a microphone, or the like. The selected text, image, video, voice, etc. may be received, and the received text, image, video, and/or voice may be stored in the memory 312 or an information processing system through the communication module 316 and the network 220 . (230) may be provided. In one embodiment, the processor 314 is configured to provide the voice command related data input by the user on the voice-based service application through the input device to the information processing system 230 through the network 220 and the communication module 316 . can The processor 334 of the information processing system 230 may be configured to manage, process, and/or store information and/or data received from a plurality of user terminals and/or a plurality of external systems. In an embodiment, the information processing system 230 may provide information corresponding to the voice command related data received from the user terminal 210 to the user terminal 210 . Additionally, the information processing system 230 may collect unlabeled voice samples from the user terminal 210 .

도 4는 본 개시의 일 실시예에 따른 음성 샘플(410)에 대한 레이블링 작업을 통해 HLS 데이터베이스(DB)(460) 및 MLS DB(470)를 구축하는 예시를 나타내는 도면이다. 정보 처리 시스템의 프로세서는 사용자 단말들로부터 레이블이 할당되지 않은 음성 샘플(410)을 수집할 수 있다. 수집된 음성 샘플(410)은 레이블이 할당되지 않은 음성 샘플 DB(420)에 저장될 수 있다. 수집된 모든 음성 샘플에 대하여 휴먼 레이블링을 수행하는 것은 많은 비용이 요구되므로, 프로세서는 음성 인식 모델(440)을 이용하여 음성 샘플(410)로부터 휴먼 레이블링을 수행하기 위한 음성 샘플을 추출할 수 있다.4 is a diagram illustrating an example of constructing an HLS database (DB) 460 and an MLS DB 470 through a labeling operation on a voice sample 410 according to an embodiment of the present disclosure. The processor of the information processing system may collect unlabeled voice samples 410 from user terminals. The collected voice samples 410 may be stored in a non-labeled voice sample DB 420 . Since it is expensive to perform human labeling on all collected voice samples, the processor may extract a voice sample for human labeling from the voice sample 410 using the voice recognition model 440 .

프로세서는 음성 샘플(410) 중에서 음성 인식 모델(440)을 학습하는데 유용한 샘플(informative sample)을 선택하기 위해 불확실성 기반의 AL(uncertainty-based AL)을 사용할 수 있다. 구체적으로, 프로세서는 각각의 음성 샘플의 불확실성 스코어에 기초하여 휴먼 레이블링을 위한 음성 샘플(422)을 추출할 수 있다. 일 실시예에서, 프로세서는 미리 생성된 음성 인식 모델(440)을 이용하여 레이블이 할당되지 않은 음성 샘플 DB(420) 내의 음성 샘플들의 불확실성 스코어를 산출하고, 가장 높은 불확실성 스코어를 가지는 n개의 음성 샘플(422)을 추출할 수 있다. 여기서, n은 자연수이고, 휴먼 레이블링 비용 예산에 따라 결정될 수 있다.The processor may use an uncertainty-based AL (AL) to select an informative sample useful for learning the speech recognition model 440 from among the speech samples 410 . Specifically, the processor may extract a speech sample 422 for human labeling based on the uncertainty score of each speech sample. In one embodiment, the processor uses the pre-generated speech recognition model 440 to calculate an uncertainty score of speech samples in the unlabeled speech sample DB 420 , and n speech samples having the highest uncertainty score. (422) can be extracted. Here, n is a natural number and may be determined according to the human labeling cost budget.

일 실시예에서, 음성 샘플의 불확실성 스코어는 음성 인식 모델(440)에 의해 출력되는 텍스트 시퀀스의 길이 정규화된 결합 확률(length-normalized joint probability)을 나타낼 수 있다. 예를 들어, 음성 샘플의 불확실성 스코어와 신뢰도 스코어는 아래의 수학식 1 내지 3을 이용하여 산출될 수 있다.In one embodiment, the uncertainty score of the speech sample may represent a length-normalized joint probability of a text sequence output by the speech recognition model 440 . For example, the uncertainty score and the reliability score of the voice sample may be calculated using Equations 1 to 3 below.

여기서,

는 음성 샘플 DB(420) 내의 음성 샘플을 나타내고,

는 음성 인식 모델(440)에 의해 출력되는 텍스트 시퀀스(즉, 가장 가능성이 높은 디코딩된 텍스트)를 나타내고,

는 출력 텍스트 시퀀스의 결합 확률을 나타내고,

은 출력 텍스트 시퀀스의 길이를 나타내고,

은 길이 정규화된 로그 결합 확률을 나타내고,

는 음성 샘플의 불확실성 스코어를 나타내고,

는 음성 샘플의 신뢰도 스코어를 나타낼 수 있다. 위에서 확인할 수 있듯이, 긴 텍스트에 대한 결합 확률이 과소 평가(underestimating)되는 것을 방지하기 위해, 프로세서는 출력 텍스트의 길이에 기초하여 결합 확률을 정규화할 수 있다. 일 실시예에서, 음성 샘플의 불확실성 스코어는 음성 인식 모델(440)이 음성 샘플의 출력 텍스트 시퀀스를 디코딩하는 동안, 음성 인식 모델(440)의 디코더 부분에서 산출될 수 있다.here,

represents a voice sample in the voice sample DB 420,

represents the text sequence (ie, the most probable decoded text) output by the speech recognition model 440,

represents the joint probability of the output text sequence,

represents the length of the output text sequence,

represents the length-normalized log joint probability,

represents the uncertainty score of the negative sample,

may represent a confidence score of a negative sample. As can be seen above, to avoid underestimating the joint probability for long text, the processor may normalize the joint probability based on the length of the output text. In one embodiment, the uncertainty score of the speech sample may be calculated in the decoder portion of the speech recognition model 440 while the speech recognition model 440 decodes the output text sequence of the speech sample.

가장 높은 불확실성 스코어(가장 낮은 신뢰도 스코어)를 가지는 n개의 음성 샘플(422)은 휴먼 레이블링을 위해, 인간 주해자(Human annotator)(430)에게 제공될 수 있다. 인간 주해자(430)는 전달받은 n개의 음성 샘플(422)을 청취하고 정답 레이블(432)을 생성할 수 있다. 정답 레이블(432)은 음성 샘플 내에 포함된 스피치를 전사한 텍스트 시퀀스일 수 있다. 프로세서는 불확실성 높은 n개의 음성 샘플(422)과 대응되는 n개의 정답 레이블(432)을 HLS DB(460)에 HLS(Human Labeled Sample)로서 저장할 수 있다. 이 때, 하나의 HLS는 음성 샘플과 정답 레이블의 쌍으로 구성될 수 있다.The n speech samples 422 having the highest uncertainty score (lowest confidence score) may be provided to a human annotator 430 for human labeling. The human commentator 430 may listen to the received n voice samples 422 and generate a correct answer label 432 . The correct answer label 432 may be a text sequence in which speech included in the voice sample is transcribed. The processor may store n voice samples 422 with high uncertainty and n correct answer labels 432 corresponding to them as Human Labeled Samples (HLS) in the HLS DB 460 . In this case, one HLS may consist of a pair of a negative sample and a correct answer label.

추가적으로, 프로세서는 레이블이 할당되지 않은 음성 샘플 DB(420)에서 머신 레이블링을 위한 음성 샘플(424)을 추출할 수 있다. 불확실성이 높은 음성 샘플(즉, 음성 인식 모델(440)의 신뢰도가 낮은 샘플)을 이용하여 MLS를 준비하는 경우, MLS가 음성 인식 모델(440)에 잘못된 정보를 제공하여 음성 인식 모델의 성능을 오히려 떨어뜨릴 수 있다. 따라서, 프로세서는 음성 샘플 DB(420) 내의 남아 있는 음성 샘플 중 미리 결정된 임계치 이하의 불확실성 스코어(임계치 이상의 신뢰도 스코어)를 가지는 적어도 하나의 음성 샘플을 머신 레이블링을 위한 불확실성이 낮은 음성 샘플(424)로 추출할 수 있다.Additionally, the processor may extract a voice sample 424 for machine labeling from the non-labeled voice sample DB 420 . When MLS is prepared using a speech sample with high uncertainty (ie, a sample with low reliability of the speech recognition model 440 ), the MLS provides erroneous information to the speech recognition model 440 to improve the performance of the speech recognition model. can be dropped Accordingly, the processor converts at least one voice sample having an uncertainty score below a predetermined threshold (a confidence score above the threshold) among the remaining voice samples in the voice sample DB 420 as a low uncertainty voice sample 424 for machine labeling. can be extracted.

불확실성 낮은 음성 샘플(424)은 머신 레이블링을 위해 음성 인식 모델(440)로 제공될 수 있다. 음성 인식 모델(440)은 전달받은 음성 샘플(424) 각각에 대응하는 수도 레이블(pseudo label)(442)을 예측할 수 있다. 수도 레이블은 음성 샘플이 음성 인식 모델(440)에 입력될 때, 출력되는 텍스트 시퀀스일 수 있다.The low uncertainty speech sample 424 may be provided to a speech recognition model 440 for machine labeling. The speech recognition model 440 may predict a pseudo label 442 corresponding to each of the received speech samples 424 . The pseudo label may be a text sequence that is output when a speech sample is input to the speech recognition model 440 .

수도 레이블은 HLS에 비해 정보가 적을뿐 아니라 잡음이 많을 수 있으므로 HLS와 동일한 방식으로 MLS를 처리하는 경우, 음성 인식 모델(440)을 학습/업데이트하는데 도움이 되지 않거나, 오히려 잘못된 정보를 제공하여 음성 인식 모델(440)의 성능을 저해할 수도 있다. 이를 방지하기 위해, 불확실성 낮은 음성 샘플(424)은 데이터 증강 유닛(450)으로 제공될 수 있다. 데이터 증강 유닛(450)은 전달받은 음성 샘플(424)을 증강(augment)하여 증강된 음성 샘플(452)을 생성할 수 있다. 음성 샘플의 증강은 음성 샘플에 왜곡, 노이즈 등을 추가하는 것을 의미할 수 있다. 이미지 샘플 증강과 달리, 음성 샘플에 포함된 언어 정보는 왜곡, 노이즈 등에 매우 취약하여, 음성 샘플 내의 언어 정보는 왜곡, 노이즈 등에 의해 쉽게 손상될 수 있다. 따라서, 음성 샘플 증강 프로세스는 왜곡, 노이즈 등이 추가되더라도 음성 샘플 내의 언어학적 의미가 변경되지 않도록 신중하게 설계되어야 한다.Since pseudo labels can be noisy as well as less informational compared to HLS, processing MLS in the same way as HLS does not help train/update the speech recognition model 440, or rather provides incorrect information to make speech It may also impair the performance of the recognition model 440 . To prevent this, the low uncertainty speech sample 424 may be provided to the data enhancement unit 450 . The data augmentation unit 450 may augment the received voice sample 424 to generate an augmented voice sample 452 . Augmenting the voice sample may mean adding distortion, noise, or the like to the voice sample. Unlike image sample augmentation, language information included in a voice sample is very vulnerable to distortion, noise, and the like, so that language information in a voice sample can be easily damaged by distortion, noise, or the like. Therefore, the speech sample augmentation process must be carefully designed so that the linguistic meaning in the speech sample is not changed even if distortion, noise, etc. are added.

일 실시예에 따르면, 데이터 증강 유닛(450)은 음성 샘플(424)에 대해 피치 시프팅(pitch shifting)을 수행할 수 있다. 대안적으로, 데이터 증강 유닛(450)은 음성 샘플(424)에 대해 타임 스케일링(time scaling)을 수행할 수 있다. 대안적으로, 데이터 증강 유닛(450)은 음성 샘플(424)에 가산성 백색 가우시안 노이즈(Additive White Gaussian Noise)를 추가할 수 있다. 프로세서는 증강된 음성 샘플(452)과 대응되는 수도 레이블(442)을 MLS DB(470)에 MLS(Machine Labeled Sample)로서 저장할 수 있다. 이 때, 하나의 MLS는 증강된 음성 샘플과 수도 레이블의 쌍으로 구성될 수 있다.According to an embodiment, the data enhancement unit 450 may perform pitch shifting on the speech sample 424 . Alternatively, the data enhancement unit 450 may perform time scaling on the speech samples 424 . Alternatively, the data enhancement unit 450 may add additive white Gaussian noise to the speech sample 424 . The processor may store the augmented speech sample 452 and the corresponding pseudo label 442 as a Machine Labeled Sample (MLS) in the MLS DB 470 . In this case, one MLS may be composed of a pair of augmented speech samples and pseudo labels.

프로세서는 HLS DB(460) 내의 HLS와 MLS DB(470) 내의 MLS를 사용하여 음성 인식 모델(440)을 업데이트할 수 있다. 일 실시예에 따르면, 프로세서는 HLS DB(460)에 저장된 음성 샘플-정답 레이블 쌍들과, MLS DB(470)에 저장된 증강된 음성 샘플-수도 레이블 쌍들에 기초하여 준지도 학습(semi-supervised learning)을 수행하여 음성 인식 모델(440)을 업데이트할 수 있다. HLS와 MLS를 모두 이용하여 음성 인식 모델(440)을 업데이트함으로써, 음성 인식 모델(440)의 강인성(robustness)을 향상시킬 수 있다.The processor may update the speech recognition model 440 using the HLS in the HLS DB 460 and the MLS in the MLS DB 470 . According to an embodiment, the processor performs semi-supervised learning based on the speech sample-correct label pairs stored in the HLS DB 460 and the augmented speech sample-number label pairs stored in the MLS DB 470 . to update the voice recognition model 440 . By updating the speech recognition model 440 using both HLS and MLS, robustness of the speech recognition model 440 may be improved.

일 실시예에 따르면, 프로세서는 음성 인식 모델(440)에 의해 예측되는 음성 샘플(422)과 대응되는 출력 데이터, 그리고 음성 샘플(422)의 정답 레이블(432) 사이의 차이가 최소화되도록 음성 인식 모델(440)을 업데이트할 수 있다. 예를 들어, 음성 인식 모델(440)에 의해 예측되는 음성 샘플(422)과 대응되는 출력 데이터와 정답 레이블(432) 사이의 차이는, 아래와 같은 표준 크로스-엔트로피 손실 함수(standard cross-entropy loss function)에 의해 산출될 수 있다.According to an embodiment, the processor is configured to minimize the difference between the voice sample 422 predicted by the voice recognition model 440 and corresponding output data, and the correct answer label 432 of the voice sample 422 is minimized. 440 may be updated. For example, the difference between the speech sample 422 predicted by the speech recognition model 440 and the corresponding output data and the correct answer label 432 is a standard cross-entropy loss function as follows. ) can be calculated by

여기서

는 지도 손실(supervised loss)을 나타내고, B는 미니 배치(mini-batch)의 크기를 나타내고,

은 n 번째 HLS 샘플의 길이를 나타내고,

은 정답 레이블(432)을 나타내고,

은 음성 인식 모델(440)에 의해 예측된 출력 데이터(즉, 음성 인식 모델(440)에 의해 예측되는 음성 샘플(422)과 대응되는 텍스트 시퀀스)를 나타내고, H는 크로스-엔트로피(cross-entropy)를 나타낸다.here

represents the supervised loss, B represents the size of the mini-batch,

denotes the length of the nth HLS sample,

represents the correct answer label 432,

denotes output data predicted by the voice recognition model 440 (that is, a text sequence corresponding to the voice sample 422 predicted by the voice recognition model 440), and H denotes cross-entropy. indicates

또한, 프로세서는 음성 인식 모델(440)에 의해 예측되는 증강된 음성 샘플(452)과 대응되는 출력 데이터, 그리고 음성 샘플(424)의 수도 레이블(442) 사이의 차이가 최소화되도록 음성 인식 모델(440)을 업데이트할 수 있다. 예를 들면, 음성 인식 모델(440)에 의해 예측되는 증강된 음성 샘플(452)과 대응되는 출력 데이터와 음성 샘플(424)의 수도 레이블(442) 사이의 차이는, 아래와 같은 표준 크로스-엔트로피 손실 함수에 의해 산출될 수 있다.In addition, the processor further configures the speech recognition model 440 such that the difference between the augmented speech sample 452 predicted by the speech recognition model 440 and the corresponding output data and the number label 442 of the speech sample 424 is minimized. ) can be updated. For example, the difference between the augmented speech sample 452 predicted by the speech recognition model 440 and the number label 442 of the speech sample 424 and the corresponding output data is the standard cross-entropy loss as follows: It can be calculated by a function.

는 비지도 손실(unsupervised loss)을 나타내고, B는 미니 배치(mini-batch)의 크기를 나타내고,

은 n 번째 MLS 샘플의 길이를 나타내고, A는 증강 함수를 나타내고,

은 음성 인식 모델(440)에 의해 예측되는 증강된 음성 샘플(452)과 대응되는 출력 데이터(즉, 음성 인식 모델(440)에 의해 예측되는 증강된 음성 샘플(452)과 대응되는 텍스트 시퀀스)를 나타내고,

은 수도 레이블(442)을 나타내고, H는 크로스-엔트로피(cross-entropy)를 나타낸다.

represents the unsupervised loss, B represents the size of the mini-batch,

denotes the length of the nth MLS sample, A denotes the enhancement function,

is output data corresponding to the augmented voice sample 452 predicted by the voice recognition model 440 (ie, a text sequence corresponding to the augmented voice sample 452 predicted by the voice recognition model 440). indicate,

denotes a capital label 442, and H denotes cross-entropy.

음성 인식 모델(440)을 업데이트하는데 사용되는 총 손실(

)은 지도 손실(

)과 비지도 손실(

)을 통합하여, 아래의 수학식 6과 같이 정의될 수 있다.The total loss used to update the speech recognition model 440 (

) is the map loss (

) and unsupervised loss (

), it can be defined as Equation 6 below.

여기서

는 비지도 손실의 계수 값을 나타낼 수 있다. 예를 들어,

는 0과 1 사이의 상수 값일 수 있다.

는 준지도 학습(semi-supervised learning)을 수행하여 음성 인식 모델(440)을 업데이트하는 과정에서, 신뢰할 수 있는 샘플인 HLS을 사용하는 지도 손실에 가중치를 더하기 위해 사용될 수 있다. 프로세서는 총 손실(

)이 최소화하도록 준지도 학습을 수행할 수 있다.here

may represent a coefficient value of unsupervised loss. for example,

may be a constant value between 0 and 1.

In the process of updating the speech recognition model 440 by performing semi-supervised learning, it may be used to add a weight to the map loss using HLS, which is a reliable sample. The processor loses the total loss (

) can be performed semi-supervised learning to minimize

일 실시예에서, 프로세서는 일정량의 음성 샘플(410)이 레이블이 할당되지 않은 음성 샘플 DB(420)에 추가될 때마다, 상술한 흐름에 따라 새로운 HLS와 MLS을 HLS DB(460)와 MLS DB(470)에 저장하고, HLS DB(460) 내의 HLS와 MLS DB(470) 내의 MLS를 사용하여 음성 인식 모델(440)을 업데이트하는 과정을 반복할 수 있다.In one embodiment, whenever a certain amount of speech samples 410 is added to the unlabeled speech sample DB 420, the processor creates new HLS and MLS according to the above-described flow into the HLS DB 460 and the MLS DB. It is stored in 470 , and the process of updating the speech recognition model 440 using the HLS in the HLS DB 460 and the MLS in the MLS DB 470 may be repeated.

도 5는 본 개시의 일 실시예에 따른 초기 음성 인식 모델 생성 방법(500)을 나타내는 흐름도이다. 일 실시예에서, 초기 음성 인식 모델을 생성하는 방법(500)은 프로세서(예를 들어, 정보 처리 시스템의 적어도 하나의 프로세서)에 의해 수행될 수 있다. 도시된 바와 같이, 초기 음성 인식 모델을 생성하는 방법(500)은 프로세서가 레이블이 할당되지 않은 복수의 음성 샘플을 수신함으로써 개시될 수 있다(S510). 그 후, 프로세서는 인간 주해자로부터 레이블이 할당되지 않은 복수의 음성 샘플 각각에 대한 정답 레이블을 수신할 수 있다(S520).5 is a flowchart illustrating a method 500 for generating an initial speech recognition model according to an embodiment of the present disclosure. In one embodiment, the method 500 for generating an initial speech recognition model may be performed by a processor (eg, at least one processor of an information processing system). As shown, the method 500 for generating an initial speech recognition model may be initiated by a processor receiving a plurality of unlabeled speech samples ( S510 ). Thereafter, the processor may receive a correct answer label for each of the plurality of speech samples to which no label is assigned from the human commentator ( S520 ).

그 후, 프로세서는 단계(S510)에서 수신한 음성 샘플 및 단계(S520)에서 수신한 정답 레이블의 쌍들에 기초하여, 초기 음성 인식 모델을 생성할 수 있다(S530). 즉, 프로세서는 HLS를 이용하여 인공신경망 모델의 지도 학습을 수행함으로써 초기 음성 인식 모델을 생성할 수 있다. 여기서, 하나의 HLS는 음성 샘플과 정답 레이블의 쌍으로 구성될 수 있다.Thereafter, the processor may generate an initial speech recognition model based on the pairs of the voice sample received in step S510 and the correct answer label received in step S520 ( S530 ). That is, the processor may generate an initial speech recognition model by performing supervised learning of the artificial neural network model using HLS. Here, one HLS may consist of a pair of a voice sample and a correct answer label.

도 6은 본 개시의 일 실시예에 따른 점진적 음성 인식 모델 학습 방법(600)을 나타내는 흐름도이다. 일 실시예에서, 음성 인식 모델을 학습하는 방법(600)은 프로세서(예를 들어, 정보 처리 시스템의 적어도 하나의 프로세서)에 의해 수행될 수 있다. 도시된 바와 같이, 음성 인식 모델을 학습하는 방법(600)은 프로세서가 레이블이 할당되지 않은 복수의 음성 샘플을 수신함으로써 개시될 수 있다(S610). 복수의 음성 샘플은 음성 인식 서비스를 제공하는 동안 사용자 단말로부터 수집된 음성 샘플일 수 있다.6 is a flowchart illustrating a method 600 for learning a progressive speech recognition model according to an embodiment of the present disclosure. In one embodiment, the method 600 for training a speech recognition model may be performed by a processor (eg, at least one processor of an information processing system). As shown, the method 600 for learning the speech recognition model may be initiated by the processor receiving a plurality of unlabeled speech samples ( S610 ). The plurality of voice samples may be voice samples collected from a user terminal while providing a voice recognition service.

복수의 음성 샘플을 수신하는 것에 응답하여, 프로세서는 음성 인식 모델을 이용하여 복수의 음성 샘플로부터 휴먼 레이블링을 위한 제1 세트의 음성 샘플을 추출할 수 있다(S620). 일 실시예에서, 프로세서는 음성 인식 모델을 이용하여 복수의 음성 샘플 각각의 불확실성 스코어를 산출하고, 복수의 음성 샘플 중 가장 높은 불확실성 스코어를 가지는 미리 결정된 개수의 음성 샘플을 제1 세트의 음성 샘플로 추출할 수 있다. 여기서, 불확실성 스코어는 음성 인식 모델에 의해 출력되는 텍스트 시퀀스의 길이 정규화된 결합 확률을 나타낼 수 있다.In response to receiving the plurality of voice samples, the processor may extract a first set of voice samples for human labeling from the plurality of voice samples by using the voice recognition model ( S620 ). In one embodiment, the processor calculates an uncertainty score of each of the plurality of speech samples by using the speech recognition model, and sets a predetermined number of speech samples having a highest uncertainty score among the plurality of speech samples as the first set of speech samples. can be extracted. Here, the uncertainty score may represent a length-normalized combination probability of a text sequence output by the speech recognition model.

그 후, 프로세서는 제1 세트의 음성 샘플과 대응되는 제1 세트의 레이블을 수신할 수 있다(S630). 여기서, 제1 세트의 레이블은 사람에 의해 생성된 정답 레이블일 수 있다. 프로세서는 제1 세트의 음성 샘플과 제1 세트의 레이블을 HLS로 저장할 수 있다.Thereafter, the processor may receive a first set of labels corresponding to the first set of voice samples ( S630 ). Here, the first set of labels may be a correct answer label generated by a person. The processor may store the first set of speech samples and the first set of labels as HLS.

또한, 프로세서는 음성 인식 모델을 이용하여 복수의 음성 샘플로부터 머신 레이블링을 위한 제2 세트의 음성 샘플을 추출할 수 있다(S640). 일 실시예에서, 프로세서는 복수의 음성 샘플 중 미리 결정된 임계치 이하의 불확실성 스코어를 가지는 적어도 하나의 음성 샘플을 제2 세트의 음성 샘플로 추출할 수 있다. 휴먼 레이블링을 위한 제1 세트의 음성 샘플의 수는 머신 레이블링을 위한 제2 세트의 음성 샘플의 수보다 적을 수 있다.Also, the processor may extract a second set of voice samples for machine labeling from the plurality of voice samples using the voice recognition model ( S640 ). In an embodiment, the processor may extract at least one voice sample having an uncertainty score equal to or less than a predetermined threshold among the plurality of voice samples as the second set of voice samples. The number of speech samples in the first set for human labeling may be less than the number of speech samples in the second set for machine labeling.

그 후, 프로세서는 음성 인식 모델을 이용하여 제2 세트의 음성 샘플과 대응되는 제2 세트의 레이블을 결정할 수 있다(S650). 여기서, 제2 세트의 레이블은 음성 인식 모델에 의해 예측된 수도 레이블일 수 있다.Thereafter, the processor may determine a second set of labels corresponding to the second set of speech samples using the speech recognition model ( S650 ). Here, the second set of labels may be pseudo labels predicted by the speech recognition model.

또한, 프로세서는 제2 세트의 음성 샘플을 증강할 수 있다(S660). 일 실시예에서, 프로세서는 제2 세트의 음성 샘플에 대해 피치 시프팅을 수행할 수 있다. 다른 실시예에서, 프로세서는 제2 세트의 음성 샘플에 대해 타임 스케일링을 수행할 수 있다. 또 다른 실시예에서, 프로세서는 제2 세트의 음성 샘플에 가산성 백색 가우시안 노이즈를 추가할 수 있다. 프로세서는 증강된 제2 세트의 음성 샘플과 제2 세트의 레이블을 MLS로 저장할 수 있다.Also, the processor may augment the second set of voice samples (S660). In one embodiment, the processor may perform pitch shifting on the second set of speech samples. In another embodiment, the processor may perform time scaling on the second set of speech samples. In another embodiment, the processor may add additive white Gaussian noise to the second set of speech samples. The processor may store the augmented second set of speech samples and the second set of labels as the MLS.

그 후, 프로세서는 제1 세트의 음성 샘플, 제1 세트의 레이블, 증강된 제2 세트의 음성 샘플 및 제2 세트의 레이블에 기초하여 준지도 학습을 수행하여 음성 인식 모델을 업데이트할 수 있다(S670). 일 실시예에서, 프로세서는 음성 인식 모델에 의해 예측되는 제1 세트의 음성 샘플과 대응되는 제1 세트의 출력 데이터, 그리고 제1 세트의 레이블 사이의 차이가 최소화되도록 음성 인식 모델을 업데이트할 수 있다. 추가적으로, 프로세서는 음성 인식 모델에 의해 예측되는 증강된 제2 세트의 음성 샘플과 대응되는 제2 세트의 출력 데이터, 그리고 제2 세트의 레이블 사이의 차이가 최소화되도록 음성 인식 모델을 업데이트할 수 있다. 여기서, 제1 세트의 출력 데이터와 제1 세트의 레이블 사이의 차이, 그리고 제2 세트의 출력 데이터와 제2 세트의 레이블 사이의 차이는, 표준 크로스-엔트로피 손실 함수에 의해 산출될 수 있다. 도시된 바와 같이, 프로세서는 S610 내지 S670을 반복적으로 수행함으로써, 음성 인식 모델을 점진적으로 학습/업데이트할 수 있다.Then, the processor may update the speech recognition model by performing semi-supervised learning based on the first set of speech samples, the first set of labels, the augmented second set of speech samples, and the second set of labels ( S670). In one embodiment, the processor may update the speech recognition model such that a difference between the first set of speech samples predicted by the speech recognition model, the first set of output data corresponding to the first set, and the first set of labels is minimized. . Additionally, the processor may update the speech recognition model such that a difference between the second set of augmented speech samples predicted by the speech recognition model, the corresponding second set of output data, and the second set of labels is minimized. Here, the difference between the first set of output data and the first set of labels and the difference between the second set of output data and the second set of labels may be calculated by a standard cross-entropy loss function. As shown, the processor may gradually learn/update the speech recognition model by repeatedly performing S610 to S670.

도 7은 본 개시의 일 실시예에 따른 음성 인식 모델을 생성, 업데이트, 그리고 테스트하기 위한 음성 샘플(710, 720, 730)의 예시를 나타내는 도면이다. 정보 처리 시스템의 프로세서는 사용자 단말들로부터 음성 샘플(710, 720, 730)을 수신할 수 있다. 수신된 음성 샘플은 초기 음성 샘플(710), 후속 음성 샘플(720), 그리고 테스트용 음성 샘플(730)로 분류될 수 있다. 일 실시예에서, 프로세서는 200ms의 윈도우 길이(window-length)와 100ms의 보폭 길이(stride-length)를 갖는 해밍 윈도우(hamming window)를 이용하여, 수신된 음성 샘플로부터 스펙토그램(Spectrogram)을 추출할 수 있다.7 is a diagram illustrating examples of voice samples 710 , 720 , and 730 for generating, updating, and testing a voice recognition model according to an embodiment of the present disclosure. The processor of the information processing system may receive the voice samples 710 , 720 , 730 from the user terminals. The received voice sample may be classified into an initial voice sample 710 , a subsequent voice sample 720 , and a test voice sample 730 . In one embodiment, the processor uses a hamming window having a window-length of 200 ms and a stride-length of 100 ms to generate a spectrogram from the received speech sample. can be extracted.

프로세서는 초기 음성 샘플(710)을 이용하여 초기 음성 인식 모델을 생성할 수 있다. 일 실시예에서, 프로세서는 초기 음성 샘플(710)을 이용하여 도 5에서 상술한 초기 음성 인식 모델 생성 방법을 수행함으로써, 초기 음성 인식 모델을 생성할 수 있다. 그 후, 프로세서는 후속 음성 샘플(720)을 이용하여 음성 인식 모델을 업데이트할 수 있다. 일 실시예에서, 프로세서는 후속 음성 샘플(720)을 이용하여 도 6에서 상술한 음성 인식 모델 학습 방법을 수행함으로써, 음성 인식 모델을 업데이트할 수 있다. 예를 들어, 프로세서는 후속 음성 샘플(720)을 여러 구간(예를 들어, 30개의 구간)으로 나누고, 각 구간의 음성 샘플을 이용하여 음성 인식 모델 업데이트를 여러 번(예를 들어, 30번) 수행할 수 있다.The processor may generate an initial speech recognition model by using the initial speech sample 710 . In an embodiment, the processor may generate the initial speech recognition model by performing the method for generating the initial speech recognition model described above with reference to FIG. 5 using the initial speech sample 710 . The processor may then update the speech recognition model using the subsequent speech samples 720 . In an embodiment, the processor may update the speech recognition model by performing the method for learning the speech recognition model described above with reference to FIG. 6 using the subsequent speech sample 720 . For example, the processor divides the subsequent speech sample 720 into multiple intervals (eg, 30 intervals), and updates the speech recognition model multiple times (eg, 30 times) using the speech samples in each interval. can be done

음성 인식 모델의 생성 및 업데이트가 완료된 후, 프로세서는 테스트용 음성 샘플(730)을 이용하여 음성 인식 모델의 성능을 테스트할 수 있다. 일 실시예에서, 프로세서는 테스트용 음성 샘플(730) 각각을 업데이트 완료된 음성 인식 모델에 입력하고, 출력 데이터와 인간 주해자가 생성한 정답 레이블을 비교함으로써, 음성 인식 모델의 성능을 테스트할 수 있다. 음성 인식 모델의 성능은 글자 수준 오류율(Character-level Error Rate; CER)로 평가될 수 있다. 여기서, CER은 출력 데이터와 정답 레이블 사이의 글자 차이에 기초하여 결정될 수 있다.After the generation and update of the speech recognition model is completed, the processor may test the performance of the speech recognition model using the test speech sample 730 . In an embodiment, the processor may test the performance of the speech recognition model by inputting each of the test speech samples 730 into the updated speech recognition model and comparing the output data with the correct answer label generated by the human interpreter. The performance of the speech recognition model may be evaluated by a character-level error rate (CER). Here, the CER may be determined based on the letter difference between the output data and the correct answer label.

일 실시예에서, 초기 음성 샘플(710)의 수는 후속 음성 샘플(720)의 수보다 적을 수 있다. 예를 들면, 초기 음성 샘플(710)은 110시간의 음성 샘플을 포함하고, 후속 음성 샘플(720)은 386시간의 음성 샘플을 포함하고, 테스트용 음성 샘플(730)은 56시간의 음성 샘플을 포함할 수 있다. 또한, 초기 음성 샘플(710)은 후속 음성 샘플(720)보다 먼저 수집된 음성 샘플이고, 후속 음성 샘플(720)은 테스트용 음성 샘플(730)보다 먼저 수집된 음성 샘플일 수 있다. 이와 같은 구성에 의해, 본 개시의 실시예들에 따른 음성 인식 모델 학습 방법의 성능을 실제 상황과 유사하게 평가할 수 있다. 이러한 환경에서 수행된 본 개시의 실시예들에 따른 음성 인식 모델 학습 방법의 성능 평가는 이하에서 도 8 내지 10을 참고하여 설명한다. 성능 평가에서는 음성 샘플의 신뢰도 스코어(수학식 3의

값)가 임계값(τ=0.9)을 초과하는 음성 샘플들을 추출하여 머신 레이블링을 수행했다. 또한, 음성 인식 모델 학습에서 MLS의 영향을 강조하기 위해 비지도 손실의 계수 값(

)을 1로 사용했다.In one embodiment, the number of initial speech samples 710 may be less than the number of subsequent speech samples 720 . For example, the initial voice sample 710 contains 110 hours of voice samples, the subsequent voice sample 720 contains 386 hours of voice samples, and the test voice sample 730 contains 56 hours of voice samples. may include Also, the initial voice sample 710 may be a voice sample collected before the subsequent voice sample 720 , and the subsequent voice sample 720 may be a voice sample collected before the test voice sample 730 . With such a configuration, the performance of the speech recognition model learning method according to the embodiments of the present disclosure may be evaluated similarly to an actual situation. Performance evaluation of the speech recognition model training method according to embodiments of the present disclosure performed in such an environment will be described below with reference to FIGS. 8 to 10 . In the performance evaluation, the reliability score of the negative sample (in Equation 3)

value) exceeds a threshold (τ=0.9), and machine labeling was performed by extracting voice samples. In addition, to highlight the influence of MLS on training speech recognition models, the coefficient values of unsupervised loss (

) was used as 1.

도 8은 휴먼 레이블링을 위한 음성 샘플을 추출하는 방식에 따른 음성 인식 모델의 성능 차이를 나타내는 그래프이다. 앞서 설명한 것과 같이, 음성 인식 모델을 학습/업데이트하기 위해 레이블이 할당되지 않은 음성 샘플로부터 휴먼 레이블링을 수행하기 위한 음성 샘플을 추출할 수 있다. 그래프에서 'NP'는 상술한 수학식 1 및 2를 이용하여 음성 샘플의 불확실성 스코어를 산출한 경우를 나타낸다. 그래프에서 'RND'는 휴먼 레이블링을 수행할 음성 샘플을 랜덤하게 추출한 경우를 나타낸다. 그래프에서 'Loss'와 'CER'은 수학식 1 내지 2가 아닌 다른 방식으로 불확실성 스코어를 산출한 경우를 나타낸다.8 is a graph illustrating a performance difference of a speech recognition model according to a method of extracting a speech sample for human labeling. As described above, in order to train/update the speech recognition model, it is possible to extract speech samples for performing human labeling from unlabeled speech samples. In the graph, 'NP' indicates a case in which the uncertainty score of the negative sample is calculated using Equations 1 and 2 described above. In the graph, 'RND' indicates a case in which voice samples to be subjected to human labeling are randomly extracted. In the graph, 'Loss' and 'CER' indicate a case in which the uncertainty score is calculated by a method other than Equations 1 and 2.

각각의 기준에 따라 추출되는 휴먼 레이블링을 수행할 음성 샘플들의 음성 인식 모델 학습에 대한 유용성을 평가하기 위해, 복수의 음성 샘플을 상술한 기준에 따라 정렬하고, 5개의 음성 샘플 세트로 분할할 수 있다. 예를 들면, 총 386.5시간의 음성 샘플을 각각의 기준에 따라 정렬하고, 77.3시간의 5개의 음성 샘플 세트로 분할할 수 있다. 여기서, 'set_1/5'는 가장 불확실성 높은 샘플들(즉, 음성 인식 모델 학습에 유용한 샘플들)을 포함하는 세트이고, 'set_5/5'는 가장 불확실성이 낮은 샘플들(즉, 음성 인식 모델 학습에 유용하지 않은 샘플들)을 포함하는 세트이다. 그 후, 각 음성 샘플 세트를 이용하여 HLS를 준비하고, 준비된 HLS를 이용하여 지도 학습을 수행하여 음성 인식 모델을 생성할 수 있다. 생성된 음성 인식 모델의 성능은 CER(%)로 나타낼 수 있다. 여기서, CER(%)이 낮을수록 음성 인식 모델의 성능이 좋음을 의미할 수 있다.In order to evaluate the usefulness of training the speech recognition model of speech samples to be subjected to human labeling extracted according to each criterion, a plurality of speech samples may be sorted according to the above-mentioned criteria and divided into sets of 5 speech samples. . For example, a total of 386.5 hours of negative samples can be sorted according to each criterion and divided into 5 negative sample sets of 77.3 hours. Here, 'set _1/5 ' is a set including samples with the highest uncertainty (that is, samples useful for training a speech recognition model), and 'set _5/5 ' is a set including samples with the lowest uncertainty (ie, samples for speech recognition) samples that are not useful for model training). Thereafter, HLS is prepared using each set of speech samples, and supervised learning is performed using the prepared HLS to generate a speech recognition model. The performance of the generated speech recognition model can be expressed as CER (%). Here, as the CER (%) is lower, it may mean that the performance of the speech recognition model is better.

도시된 바와 같이, 'NP', 'Loss', 'CER'은 각각 'set_1/5'에서 최소의 CER(%) 값을 가지며, 'set_1/5'에서 'NP'가 가장 작은 CER(%) 값을 갖는다. 또한, 'NP'에서는 불확실성 스코어가 낮은(즉, 신뢰도 스코어가 높은) 음성 샘플 세트를 사용할수록 CER(%)이 거의 단조적(monotonically)으로 증가하는 것을 확인할 수 있다. 반면 'Loss', 'CER'에서는 'NP'와 달리, 각 음성 샘플 세트에 대한 CER(%) 값에 대하여 예상 밖의 변화 형태가 나타나는 것을 확인할 수 있다. 이는, 'Loss' 또는 'CER' 방식을 이용하여 음성 샘플의 불확실성 스코어를 산출하는 경우, 음성 인식 모델을 통해 예측되는 텍스트 시퀀스 사이의 결합 확률을 고려하지 않고, 정답 레이블과 음성 인식 모델에 의해 예측되는 레이블 사이의 차이를 측정하여 불확실성 스코어를 결정하기 때문이다. 따라서, 휴먼 레이블링을 수행할 음성 샘플을 추출하기 위해, 음성 샘플의 NP 값(상술한 수학식 1 및 2)을 이용하여 음성 샘플의 불확실성 스코어를 산출하는 것이 다른 기준으로 불확실성 스코어를 산출하는 것보다 정확하고 효과적이다.As shown, 'NP', 'Loss', and 'CER' each have the minimum CER(%) value in 'set _1/5 ', and 'NP' in 'set _1/5 ' has the smallest CER ( %) value. Also, in 'NP', it can be seen that the CER (%) increases almost monotonically as the negative sample set with a low uncertainty score (ie, a high confidence score) is used. On the other hand, in 'Loss' and 'CER', it can be seen that, unlike 'NP', an unexpected change in the CER (%) value for each negative sample set appears. This is, when calculating the uncertainty score of a speech sample using the 'Loss' or 'CER' method, it is predicted by the correct answer label and the speech recognition model without considering the coupling probability between the text sequences predicted through the speech recognition model. This is because the uncertainty score is determined by measuring the difference between the labels. Therefore, in order to extract a negative sample to be subjected to human labeling, calculating the uncertainty score of the negative sample using the NP value of the negative sample (Equations 1 and 2 above) is better than calculating the uncertainty score based on other criteria. Accurate and effective

도 9는 본 개시의 음성 샘플 증강 방식에 따른 음성 인식 모델 성능의 차이를 나타내는 그래프이다. 그래프에서 'NoCR'은 데이터 증강을 수행하지 않은 경우를 나타내고, 'CR-P'는 데이터 증강으로서 음성 샘플에 대해 피치 시프팅을 수행한 경우를 나타내고, 'CR-A'는 데이터 증강으로서 음성 샘플에 가산성 백색 가우시안 노이즈를 추가한 경우를 나타내고, 'CR-S'는 데이터 증강으로서 음성 샘플에 대해 타임 스케일링을 수행한 경우를 나타낸다. 예를 들어, 'CR-P'는 음성 샘플의 음정을 2.5 단계(1 단계는 한 옥타브를 8개로 나눈 것) 시프트한 것을 나타내고, 'CR-A'는 음성 샘플에 SNR(Signal-to-Noise Ratio)이 5 이하인 가산성 백색 가우시안 노이즈를 추가한 것을 나타내고, 'CR-S'는 음성 샘플의 재생 속도를 1.5 배 빠르게 타임 스케일링한 것을 나타낸다.9 is a graph illustrating a difference in speech recognition model performance according to the speech sample augmentation method of the present disclosure. In the graph, 'NoCR' indicates a case in which data enhancement is not performed, 'CR-P' indicates a case in which pitch shifting is performed on a voice sample as data enhancement, and 'CR-A' indicates a case in which data enhancement is performed on a negative sample represents a case in which additive white Gaussian noise is added, and 'CR-S' represents a case in which time scaling is performed on a voice sample as data enhancement. For example, 'CR-P' indicates that the pitch of the voice sample is shifted by 2.5 steps (1 stage is one octave divided by 8), and 'CR-A' is the signal-to-noise (SNR) of the negative sample. Ratio) indicates that additive white Gaussian noise of 5 or less is added, and 'CR-S' indicates that the playback speed of a voice sample is time-scaled 1.5 times faster.

도 9에서, x축은 HLS의 양(즉, 음성 샘플의 시간)을 나타내고, x축의 '(LUxy)'의 xy는 HLS(x)와 MLS(y)의 비율을 나타낸다. 예를 들어, 38.6h(LU19)의 경우 38.6시간 분량의 HLS와 HLS 9배 양의 MLS에 기초하여 준지도 학습을 수행하여 음성 인식 모델 업데이트를 진행한 경우를 나타낸다. 도 9의 그래프는 아래 표 1 과 함께 분석될 수 있다. 표 1은 각 행 및 각 열에 대응하는 조건으로 업데이트된 음성 인식 모델의 성능(CER(%))을 나타낸다. 여기서, CER(%)이 낮을수록 음성 인식 모델의 성능이 좋은 것으로 평가할 수 있다.In FIG. 9 , the x-axis represents the amount of HLS (ie, the time of negative samples), and xy of '(LUxy)' on the x-axis represents the ratio of HLS(x) to MLS(y). For example, in the case of 38.6h (LU19), it represents a case in which the speech recognition model is updated by performing semi-supervised learning based on 38.6 hours of HLS and 9 times HLS of MLS. The graph of FIG. 9 may be analyzed together with Table 1 below. Table 1 shows the performance (CER (%)) of the updated speech recognition model with conditions corresponding to each row and each column. Here, it can be evaluated that the lower the CER (%), the better the performance of the speech recognition model.

Set-upSet-up Supervised
LearningSupervised
Learning NoCRNoCR CR-SCR-S CR-ACR-A CR-PCR-P A+B(386h)A+B (386h) 10.8910.89 -- -- -- -- LU12(137h)LU12 (137h) 11.2311.23 11.9311.93 11.9411.94 11.6211.62 11.1511.15 LU14(77.4h)LU14 (77.4h) 11.9711.97 12.2412.24 12.3912.39 12.0812.08 11.8011.80 LU16(57h)LU16 (57h) 13.6313.63 13.2113.21 12.6012.60 12.6312.63 11.9711.97 LU19(38.6h)LU19 (38.6h) 13.8313.83 14.1714.17 13.5413.54 13.0213.02 12.5712.57

표 1에서 확인할 수 있는 것과 같이, 386 시간의 HLS만을 이용하여 지도 학습을 통해 생성된 음성 인식 모델의 성능이 CER=10.89 %로 가장 좋다. 또한, 표 1과 도 9에서 확인할 수 있는 것과 같이, HLS의 양이 줄어들고 MLS의 양이 늘어날수록 음성 인식 모델의 성능이 점점 안 좋아지는 것을 확인할 수 있다. 추가적으로, LU16의 경우를 제외하고 'NoCR'의 CER(%)이 'Supervised learning'의 CER(%)보다 높은 것으로 보아, 증강하지 않은 음성 샘플을 포함하는 MLS는 오히려 음성 인식 모델의 학습에 부정적인 영향을 주는 것을 확인할 수 있다. 특히, 성능 평가에서 비교적 낮은 신뢰도 스코어 임계값(τ=0.9)을 기초로 머신 레이블링을 수행할 음성 샘플을 추출하였고, 높은 비지도 손실의 계수 값(

=1)을 설정하여 음성 인식 모델을 준지도 학습했기 때문에, MLS 내의 정확하지 않은 수도 레이블의 음성 인식 모델에 대한 부정적 영향이 잘 드러난다.As can be seen in Table 1, the performance of the speech recognition model generated through supervised learning using only HLS of 386 hours is the best with CER=10.89%. In addition, as can be seen in Table 1 and FIG. 9 , as the amount of HLS decreases and the amount of MLS increases, it can be seen that the performance of the speech recognition model gradually deteriorates. Additionally, except for LU16, the CER (%) of 'NoCR' is higher than the CER (%) of 'Supervised learning'. It can be seen that giving In particular, we extracted speech samples to be machine-labeled based on a relatively low confidence score threshold (τ = 0.9) in the performance evaluation, and a high coefficient of unsupervised loss (

=1) to semisupervise the speech recognition model, so the negative impact of inaccurate pseudo-labels in the MLS on the speech recognition model is well revealed.

표 1과 도 9에서 LU12와 LU14의 'CR-S'를 제외하고, 각 행에서, 'NoCR'보다 증강된 음성 샘플을 이용한 경우('CR-S', 'CR-A', 'CR-P')의 CER(%)이 낮다는 것을 확인할 수 있다. 또한, 'CR-S', 'CR-A', 'CR-P' 중 'CR-P'가 가장 낮은 CER(%)을 가지므로, 데이터 증강으로서 음성 샘플에 대해 피치 시프팅을 수행한 경우, 음성 인식 모델의 성능이 가장 좋다는 것을 확인할 수 있다.In Table 1 and Figure 9, except for 'CR-S' of LU12 and LU14, in each row, when using a negative sample enhanced than 'NoCR' ('CR-S', 'CR-A', 'CR- It can be seen that the CER (%) of P') is low. In addition, since 'CR-P' has the lowest CER (%) among 'CR-S', 'CR-A', and 'CR-P', when pitch shifting is performed on a voice sample as data enhancement , it can be confirmed that the performance of the speech recognition model is the best.

한편, 음성 인식 모델 학습에 이용되는 HLS의 수가 적을 때(예를 들면, LU16 또는 LU19), MLS에 포함된 음성 샘플의 증강의 효과가 더욱 두드러지게 나타나는 것을 확인할 수 있다. 예를 들어, LU19에서 증강된 음성 샘플을 사용하는 경우의 CER(%)이 'Supervised learning'과 'NoCR'에 비해 각각 1.26 %p 및 1.60 %p 감소한다. 반면, 충분한 양의 HLS을 이용하여 음성 인식 모델을 학습/업데이트한 경우(예를 들어, LU12)에는 음성 샘플 증강의 효과가 미미한 것으로 보이나, 이는 음성 인식 모델에 대한 HLS의 학습 효과가 충분히 크게 나타나기 때문이다.On the other hand, it can be seen that when the number of HLSs used for speech recognition model training is small (eg, LU16 or LU19), the effect of augmentation of speech samples included in MLS is more prominent. For example, the CER (%) when using the augmented negative sample in LU19 is reduced by 1.26 %p and 1.60 %p, respectively, compared to 'Supervised learning' and 'NoCR'. On the other hand, when the speech recognition model is trained/updated using a sufficient amount of HLS (for example, LU12), the effect of speech sample augmentation appears to be insignificant, but this shows that the learning effect of HLS on the speech recognition model is sufficiently large. Because.

도 10은 본 개시의 일 실시예에 따라 음성 인식 모델을 여러 번 업데이트하는 경우, 학습 회차와 음성 인식 모델 성능의 관계를 나타내는 그래프이다. 도 10의 그래프는 음성 인식 모델을 30회차까지 업데이트하고, 각각 회차에 업데이트된 음성 인식 모델의 CER(%)을 나타낸다. LU12, LU16 각각에 대하여 'NoCR'의 CER(%)이 'CR-S', 'CR-A'. 'CR-P'의 CER(%)보다 더 큰 것을 확인할 수 있다. 즉, 'NoCR'에서 정확하지 않은 수도 레이블(즉, 정확하지 않은 MLS)로 인한 음성 인식 모델의 성능 저하가 나타난다. 본 개시의 실시예들에 따르면, 비지도 손실(

)이 음성 인식 모델이 잘 모르는 음성 샘플에 높은 신뢰도 스코어를 부여하는 것을 제약하므로, 상술한 문제를 완화하여 기계 학습에서 MLS를 활용하면서 우수한 음성 인식 모델 성능을 제공할 수 있다.10 is a graph illustrating a relationship between a training cycle and voice recognition model performance when a voice recognition model is updated several times according to an embodiment of the present disclosure. The graph of FIG. 10 shows the CER (%) of the speech recognition model updated up to 30 times, and the updated speech recognition model in each round. For each of LU12 and LU16, the CER (%) of 'NoCR' was 'CR-S' and 'CR-A'. It can be confirmed that it is larger than the CER (%) of 'CR-P'. That is, in 'NoCR', the performance degradation of the speech recognition model due to incorrect pseudo label (ie, incorrect MLS) appears. According to embodiments of the present disclosure, unsupervised loss (

) constrains the speech recognition model to give high confidence scores to unknown speech samples, so it is possible to provide excellent speech recognition model performance while utilizing MLS in machine learning by alleviating the above-mentioned problems.

결론적으로, 본 개시의 실시예들에 따라 음성 인식 모델을 학습/업데이트하는 경우, 0.26 %p의 CER 증가(즉, 성능 저하)만으로 레이블링 비용을 2/3 가량 절감할 수 있으며, 1.08 %p의 CER 증가만으로 레이블링 비용을 6/7 가량 절감할 수 있다. 따라서, 음성 인식 모델의 성능 저하(예를 들면, 정확하지 않은 MLS로 인한 성능 저하)는 거의 없도록 하면서, 음성 인식 모델을 업데이트하기 위한 휴먼 레이블링 비용을 현저하게 줄이는 것이 가능하다.In conclusion, when training/updating the speech recognition model according to the embodiments of the present disclosure, it is possible to reduce the labeling cost by 2/3 only by increasing the CER of 0.26 %p (ie, lowering the performance), and by 1.08 %p By increasing the CER alone, the labeling cost can be reduced by 6/7. Accordingly, it is possible to significantly reduce the human labeling cost for updating the speech recognition model, with little performance degradation (eg, performance degradation due to inaccurate MLS) of the speech recognition model.

상술한 음성 인식 모델 학습 방법은 컴퓨터에서 실행하기 위해 컴퓨터 판독 가능한 기록 매체에 저장된 컴퓨터 프로그램으로 제공될 수 있다. 매체는 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수개 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD 와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 애플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다.The above-described voice recognition model training method may be provided as a computer program stored in a computer-readable recording medium to be executed by a computer. The medium may be to continuously store a computer executable program, or to temporarily store it for execution or download. In addition, the medium may be various recording means or storage means in the form of a single or several hardware combined, it is not limited to a medium directly connected to any computer system, and may exist distributedly on a network. Examples of the medium include a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floppy disk, and those configured to store program instructions, including ROM, RAM, flash memory, and the like. In addition, examples of other media may include recording media or storage media managed by an app store that distributes applications, sites that supply or distribute various other software, and servers.

본 개시의 방법, 동작 또는 기법들은 다양한 수단에 의해 구현될 수도 있다. 예를 들어, 이러한 기법들은 하드웨어, 펌웨어, 소프트웨어, 또는 이들의 조합으로 구현될 수도 있다. 본원의 개시와 연계하여 설명된 다양한 예시적인 논리적 블록들, 모듈들, 회로들, 및 알고리즘 단계들은 전자 하드웨어, 컴퓨터 소프트웨어, 또는 양자의 조합들로 구현될 수도 있음을 통상의 기술자들은 이해할 것이다. 하드웨어 및 소프트웨어의 이러한 상호 대체를 명확하게 설명하기 위해, 다양한 예시적인 구성요소들, 블록들, 모듈들, 회로들, 및 단계들이 그들의 기능적 관점에서 일반적으로 위에서 설명되었다. 그러한 기능이 하드웨어로서 구현되는지 또는 소프트웨어로서 구현되는지의 여부는, 특정 애플리케이션 및 전체 시스템에 부과되는 설계 요구사항들에 따라 달라진다. 통상의 기술자들은 각각의 특정 애플리케이션을 위해 다양한 방식들로 설명된 기능을 구현할 수도 있으나, 그러한 구현들은 본 개시의 범위로부터 벗어나게 하는 것으로 해석되어서는 안된다.The method, operation, or techniques of this disclosure may be implemented by various means. For example, these techniques may be implemented in hardware, firmware, software, or a combination thereof. Those of ordinary skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementations should not be interpreted as causing a departure from the scope of the present disclosure.

하드웨어 구현에서, 기법들을 수행하는 데 이용되는 프로세싱 유닛들은, 하나 이상의 ASIC들, DSP들, 디지털 신호 프로세싱 디바이스들(digital signal processing devices; DSPD들), 프로그램가능 논리 디바이스들(programmable logic devices; PLD들), 필드 프로그램가능 게이트 어레이들(field programmable gate arrays; FPGA들), 프로세서들, 제어기들, 마이크로제어기들, 마이크로프로세서들, 전자 디바이스들, 본 개시에 설명된 기능들을 수행하도록 설계된 다른 전자 유닛들, 컴퓨터, 또는 이들의 조합 내에서 구현될 수도 있다.In a hardware implementation, the processing units used to perform the techniques include one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs). ), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described in this disclosure. , a computer, or a combination thereof.

따라서, 본 개시와 연계하여 설명된 다양한 예시적인 논리 블록들, 모듈들, 및 회로들은 범용 프로세서, DSP, ASIC, FPGA나 다른 프로그램 가능 논리 디바이스, 이산 게이트나 트랜지스터 로직, 이산 하드웨어 컴포넌트들, 또는 본원에 설명된 기능들을 수행하도록 설계된 것들의 임의의 조합으로 구현되거나 수행될 수도 있다. 범용 프로세서는 마이크로프로세서일 수도 있지만, 대안으로, 프로세서는 임의의 종래의 프로세서, 제어기, 마이크로제어기, 또는 상태 머신일 수도 있다. 프로세서는 또한, 컴퓨팅 디바이스들의 조합, 예를 들면, DSP와 마이크로프로세서, 복수의 마이크로프로세서들, DSP 코어와 연계한 하나 이상의 마이크로프로세서들, 또는 임의의 다른 구성의 조합으로서 구현될 수도 있다.Accordingly, the various illustrative logic blocks, modules, and circuits described in connection with this disclosure are suitable for use in general purpose processors, DSPs, ASICs, FPGAs or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or the present disclosure. It may be implemented or performed in any combination of those designed to perform the functions described in A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, eg, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other configuration.

펌웨어 및/또는 소프트웨어 구현에 있어서, 기법들은 랜덤 액세스 메모리(random access memory; RAM), 판독 전용 메모리(read-only memory; ROM), 비휘발성 RAM(non-volatile random access memory; NVRAM), PROM(programmable read-only memory), EPROM(erasable programmable read-only memory), EEPROM(electrically erasable PROM), 플래시 메모리, 컴팩트 디스크(compact disc; CD), 자기 또는 광학 데이터 스토리지 디바이스 등과 같은 컴퓨터 판독가능 매체 상에 저장된 명령들로서 구현될 수도 있다. 명령들은 하나 이상의 프로세서들에 의해 실행 가능할 수도 있고, 프로세서(들)로 하여금 본 개시에 설명된 기능의 특정 양태들을 수행하게 할 수도 있다.In firmware and/or software implementations, the techniques may include random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), PROM ( on computer readable media such as programmable read-only memory), erasable programmable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, compact disc (CD), magnetic or optical data storage devices, etc. It may be implemented as stored instructions. The instructions may be executable by one or more processors, and may cause the processor(s) to perform certain aspects of the functionality described in this disclosure.

이상 설명된 실시예들이 하나 이상의 독립형 컴퓨터 시스템에서 현재 개시된 주제의 양태들을 활용하는 것으로 기술되었으나, 본 개시는 이에 한정되지 않고, 네트워크나 분산 컴퓨팅 환경과 같은 임의의 컴퓨팅 환경과 연계하여 구현될 수도 있다. 또 나아가, 본 개시에서 주제의 양상들은 복수의 프로세싱 칩들이나 장치들에서 구현될 수도 있고, 스토리지는 복수의 장치들에 걸쳐 유사하게 영향을 받게 될 수도 있다. 이러한 장치들은 PC들, 네트워크 서버들, 및 휴대용 장치들을 포함할 수도 있다.Although the embodiments described above have been described utilizing aspects of the presently disclosed subject matter in one or more standalone computer systems, the present disclosure is not so limited and may be implemented in connection with any computing environment, such as a network or distributed computing environment. . Still further, aspects of the subject matter in this disclosure may be implemented in a plurality of processing chips or devices, and storage may be similarly affected across the plurality of devices. Such devices may include PCs, network servers, and portable devices.

본 명세서에서는 본 개시가 일부 실시예들과 관련하여 설명되었지만, 본 개시의 발명이 속하는 기술분야의 통상의 기술자가 이해할 수 있는 본 개시의 범위를 벗어나지 않는 범위에서 다양한 변형 및 변경이 이루어질 수 있다. 또한, 그러한 변형 및 변경은 본 명세서에 첨부된 특허청구의 범위 내에 속하는 것으로 생각되어야 한다.Although the present disclosure has been described in connection with some embodiments herein, various modifications and changes may be made without departing from the scope of the present disclosure that can be understood by those skilled in the art to which the present disclosure pertains. Further, such modifications and variations are intended to fall within the scope of the claims appended hereto.

110: 사용자
120, 210: 사용자 단말
220: 네트워크
230: 정보 처리 시스템110: user
120, 210: user terminal
220: network
230: information processing system

Claims

In the speech recognition model learning method performed by at least one processor,
receiving a plurality of unassigned speech samples;
extracting a first set of speech samples for human labeling from the plurality of speech samples by using a speech recognition model;
receiving a first set of labels corresponding to the first set of speech samples;
extracting a second set of speech samples for machine labeling from the plurality of speech samples using the speech recognition model;
determining a second set of labels corresponding to the second set of speech samples by using the speech recognition model;
augmenting the second set of voice samples; and
The speech recognition model by performing semi-supervised learning based on the first set of speech samples, the first set of labels, the augmented second set of speech samples, and the second set of labels steps to update
including,
Augmenting the second set of voice samples comprises:
performing pitch shifting on the second set of speech samples;
Containing, a speech recognition model training method.

delete

In the speech recognition model learning method performed by at least one processor,
receiving a plurality of unassigned speech samples;
extracting a first set of speech samples for human labeling from the plurality of speech samples by using a speech recognition model;
receiving a first set of labels corresponding to the first set of speech samples;
extracting a second set of speech samples for machine labeling from the plurality of speech samples using the speech recognition model;
determining a second set of labels corresponding to the second set of speech samples by using the speech recognition model;
augmenting the second set of voice samples; and
The speech recognition model by performing semi-supervised learning based on the first set of speech samples, the first set of labels, the augmented second set of speech samples, and the second set of labels steps to update
including,
Augmenting the second set of voice samples comprises:
performing time scaling on the second set of voice samples;
Containing, a speech recognition model training method.

In the speech recognition model learning method performed by at least one processor,
receiving a plurality of unassigned speech samples;
extracting a first set of speech samples for human labeling from the plurality of speech samples by using a speech recognition model;
receiving a first set of labels corresponding to the first set of speech samples;
extracting a second set of speech samples for machine labeling from the plurality of speech samples using the speech recognition model;
determining a second set of labels corresponding to the second set of speech samples by using the speech recognition model;
augmenting the second set of voice samples; and
The speech recognition model by performing semi-supervised learning based on the first set of speech samples, the first set of labels, the augmented second set of speech samples, and the second set of labels steps to update
including,
Augmenting the second set of voice samples comprises:
adding additive white Gaussian noise to the second set of speech samples;
Containing, a speech recognition model training method.

According to claim 1,
extracting a first set of speech samples for human labeling from the plurality of speech samples by using the speech recognition model,
calculating an uncertainty score of each of the plurality of speech samples by using the speech recognition model; and
extracting a predetermined number of speech samples having a highest uncertainty score among the plurality of speech samples as the first set of speech samples;
Containing, a speech recognition model training method.

6. The method of claim 5,
extracting a second set of speech samples for machine labeling from the plurality of speech samples using the speech recognition model;
extracting, as the second set of speech samples, at least one speech sample having an uncertainty score equal to or less than a predetermined threshold among the plurality of speech samples;
Containing, a speech recognition model training method.

6. The method of claim 5,
and the uncertainty score represents a length-normalized joint probability of a text sequence output by the speech recognition model.

According to claim 1,
Updating the speech recognition model comprises:
updating the speech recognition model such that a difference between the first set of speech samples predicted by the speech recognition model, a first set of output data corresponding to the first set, and the first set of labels is minimized;
Containing, a speech recognition model training method.

9. The method of claim 8,
Updating the speech recognition model comprises:
updating the speech recognition model to minimize a difference between the second set of augmented speech samples predicted by the speech recognition model, a second set of output data corresponding to the second set, and the second set of labels;
Further comprising, a speech recognition model training method.

10. The method of claim 9,
The difference between the first set of output data and the first set of labels, and the difference between the second set of output data and the second set of labels, is a standard cross-entropy loss function. function), a method for learning a speech recognition model.

According to claim 1,
The speech recognition model is an artificial neural network model generated by performing supervised learning, a method for learning a speech recognition model.

According to claim 1,
and the number of speech samples in the first set for human labeling is less than the number of speech samples in the second set for machine labeling.

According to claim 1,
wherein the first set of labels are human-generated correct answer labels.

According to claim 1,
and the second set of labels are pseudo labels predicted by the speech recognition model.

A computer program stored in a computer-readable recording medium for executing the method for learning a speech recognition model according to any one of claims 1 to 14 in a computer.

A speech recognition model training system comprising:
communication module;
Memory; and
At least one processor coupled to the memory and configured to execute at least one computer readable program included in the memory
including,
the at least one program,
receiving a plurality of unlabeled voice samples;
extracting a first set of speech samples for human labeling from the plurality of speech samples by using a speech recognition model;
receive a first set of labels corresponding to the first set of speech samples;
extracting a second set of speech samples for machine labeling from the plurality of speech samples using the speech recognition model;
determine a second set of labels corresponding to the second set of speech samples by using the speech recognition model;
augmenting the second set of negative samples;
instructions for updating the speech recognition model by performing semi-supervised learning based on the first set of speech samples, the first set of labels, the augmented second set of speech samples, and the second set of labels. including,
Augmenting the second set of negative samples comprises:
and performing pitch shifting on the second set of speech samples.