KR102539165B1

KR102539165B1 - Residual coding method of linear prediction coding coefficient based on collaborative quantization, and computing device for performing the method

Info

Publication number: KR102539165B1
Application number: KR1020200152071A
Authority: KR
Inventors: 김민제; 이미숙; 백승권; 성종모; 이태진; 최진수; 젠 카이
Original assignee: 한국전자통신연구원; 더 트러스티즈 오브 인디애나 유니버시티
Priority date: 2019-11-13
Filing date: 2020-11-13
Publication date: 2023-06-12
Also published as: KR20210058731A

Abstract

협력 양자화에 기초한 LPC 계수의 잔차 신호 코딩 방법 및 상기 방법을 수행하는 컴퓨팅 장치가 개시된다. 잔차 신호 코딩 방법은 입력 음성에 대해 LPC(Linear Prediction Coding) 분석 및 양자화를 수행하여 부호화된 LPC 계수와 LPC 잔차 신호를 생성하는 단계; 상기 LPC 잔차 신호를 크로스 모듈 잔차 학습에 적용하여 예측된 LPC 잔차 신호를 결정하는 단계; 상기 부호화된 LPC 계수와 상기 예측된 LPC 잔차 신호를 이용하여 LPC 합성을 수행하는 단계; 상기 LPC 합성을 수행한 결과에 따라 합성된 출력인 출력 음성을 결정하는 단계를 포함할 수 있다.A method for coding a residual signal of LPC coefficients based on cooperative quantization and a computing device performing the method are disclosed. The residual signal coding method includes performing LPC (Linear Prediction Coding) analysis and quantization on input speech to generate coded LPC coefficients and LPC residual signals; determining a predicted LPC residual signal by applying the LPC residual signal to cross-module residual learning; performing LPC synthesis using the encoded LPC coefficient and the predicted LPC residual signal; A step of determining an output voice that is a synthesized output according to a result of performing the LPC synthesis may be included.

Description

A method for coding a residual signal of LPC coefficients based on cooperative quantization and a computing device performing the method

본 발명은 협력 양자화에 기초한 LPC 계수의 잔차 신호 코딩 방법 및 상기 방법을 수행하는 컴퓨팅 장치에 관한 것이다.The present invention relates to a method for coding a residual signal of LPC coefficients based on cooperative quantization and a computing device performing the method.

음성 코딩은 통신 시스템에서의 효율적인 전송 및 저장을 위해 음성 신호를 낮은 비트 스트림으로 양자화하는 방식을 의미한다. 음성 코덱의 설계는 낮은 비트 전송률, 높은 지각 품질, 낮은 복잡성 및 지연 등의 단점을 해결하는 것이다.Speech coding refers to a method of quantizing a speech signal into a low bit stream for efficient transmission and storage in a communication system. The design of speech codecs addresses the disadvantages of low bit rate, high perceptual quality, low complexity and delay.

대부분의 음성 코덱은 보코더(vocoder)와 파형 코더(waveform coder)로 분류될 수 있다. 보코더는 보컬, 피치 주파수 등과 같은 사람의 음성 제작 프로세스를 모델링하는 데 파라미터를 사용하지 않는다. 하지만, 파형 코더는 파형을 압축 및 재구성하여 디코딩된 음성을 입력된 음성과 "지각적으로" 유사하게 만들 수 있다.Most speech codecs can be classified into vocoders and waveform coders. Vocoders do not use parameters to model the human voice production process, such as vocal, pitch frequency, etc. However, a waveform coder can compress and reconstruct the waveform to make the decoded speech “perceptually” similar to the input speech.

종래의 보코더는 계산 효율성이 뛰어나고 매우 낮은 비트 전송률로 음성을 인코딩 할 수 있는 반면, 파형 코더는 확장 가능한 성능으로 훨씬 더 넓은 비트 전송률 범위를 지원하고 노이즈에 대해 효과적이다.Conventional vocoders are computationally efficient and can encode speech at very low bit rates, whereas waveform coders support a much wider bit rate range with scalable performance and are effective against noise.

기존의 보코더와 파형 코더 모두에서 전극 선형 필터(all pole linear filter)인 LPC (Linear Predictive Coding)는 몇 개의 계수만으로 파워 스펙트럼을 효율적으로 모델링 할 수 있다. 보코더의 경우 LPC 잔차는 피치 펄스 트레인 또는 백색 잡음 성분을 사용하여 합성 여기 신호(synthetic excitation signal)로 모델링된다. 반면에 파형 코더의 경우, 잔차 신호는 디코딩된 신호로 합성되기 전에 원하는 비트 전송률로 직접 압축될 수 있다.Linear Predictive Coding (LPC), which is an all pole linear filter in both conventional vocoders and waveform coders, can efficiently model the power spectrum with only a few coefficients. For a vocoder, the LPC residual is modeled as a synthetic excitation signal using a pitch pulse train or white noise component. In the case of a waveform coder, on the other hand, the residual signal can be directly compressed to the desired bit rate before being synthesized into a decoded signal.

LPC는 최신의 신경 음성 코덱(neural speech codec)에서도 유용하다. Autoregressive model은 합성된 음성의 품질을 크게 향상시킬 수 있지만, 디코딩 프로세스동안 모델 복잡성이 발생한다.LPC is also useful in modern neural speech codecs. Autoregressive models can greatly improve the quality of synthesized speech, but introduce model complexity during the decoding process.

본 발명은 음성신호를 LPC 계수와 단계적 오토인코더를 이용하여 코딩하는 방법에 관한 것으로 특히, LPC 계수의 양자화와 LPC 잔차신호의 양자화와 동시에 최적화하기 위한 구조 및 훈련 방법에 관한 방법 및 장치를 제공한다.The present invention relates to a method for coding a speech signal using LPC coefficients and a stepwise autoencoder, and in particular, provides a method and apparatus for a structure and training method for simultaneously optimizing quantization of LPC coefficients and quantization of LPC residual signals. .

본 발명은 LPC 계수와 단계적으로 연결된 오토인코더를 같이 최적화할 수 있는 구조와 훈련 방법을 제안한다.The present invention proposes a structure and training method capable of optimizing both LPC coefficients and autoencoders connected in stages.

본 발명의 일실시예에 따른 컴퓨팅 장치에 의해 수행되는 LPC 계수의 잔차 신호 코딩 방법은 컴퓨팅 장치가 입력 음성에 대해 LPC(Linear Prediction Coding) 분석 및 양자화를 수행하여 부호화된 LPC 계수와 LPC 잔차 신호를 생성하는 단계; 상기 LPC 잔차 신호를 크로스 모듈 잔차 학습에 적용하여 예측된 LPC 잔차 신호를 결정하는 단계; 상기 부호화된 LPC 계수와 상기 예측된 LPC 잔차 신호를 이용하여 LPC 합성을 수행하는 단계; 상기 LPC 합성을 수행한 결과에 따라 합성된 출력인 출력 음성을 결정하는 단계를 포함할 수 있다.In a method for coding residual signals of LPC coefficients performed by a computing device according to an embodiment of the present invention, the computing device performs LPC (Linear Prediction Coding) analysis and quantization on an input speech to generate coded LPC coefficients and LPC residual signals. generating; determining a predicted LPC residual signal by applying the LPC residual signal to cross-module residual learning; performing LPC synthesis using the encoded LPC coefficient and the predicted LPC residual signal; A step of determining an output voice that is a synthesized output according to a result of performing the LPC synthesis may be included.

상기 크로스 모듈 잔차 학습은, 입력 음성에 대해 하이패스 필터를 적용하는 단계; 상기 하이패스 필터가 적용한 결과에 사전 강조 필터를 적용하는 단계; 상기 사전 강조 필터가 적용된 결과로부터 LPC 계수를 결정하는 단계; 상기 LPC 계수를 양자화하여 부호화된 LPC 계수와 소프트맥스의 소프트 할당 매트릭스를 생성하는 단계; 및 상기 사전 강조 필터가 적용된 결과와 LPC 계수를 양자화한 결과에 기초하여 LPC 잔차 신호를 결정하는 단계를 포함할 수 있다.The cross-module residual learning may include applying a high-pass filter to an input speech; applying a pre-emphasis filter to the result of applying the high-pass filter; determining an LPC coefficient from a result of applying the pre-emphasis filter; quantizing the LPC coefficients to generate a soft allocation matrix of the coded LPC coefficients and softmax; and determining an LPC residual signal based on a result of applying the pre-emphasis filter and a result of quantizing the LPC coefficient.

상기 LPC 계수를 결정하는 단계는, 상기 사전 강조 필터가 적용된 입력 음성에 대한 전체 프레임에 윈도우를 적용하여 크로스 프레임 윈도우잉을 수행하는 단계; 상기 크로스 프레임 윈도우잉의 수행 결과에서 입력 음성에 대한 전체 프레임들 중 중간 영역에 대응하는 복수의 서브 프레임들에 대해 윈도우를 적용하여 서브 프레임 윈도우잉을 수행하는 단계; 및 상기 서브 프레임 윈도우잉의 수행 결과에 대해 오버랩을 수행하여 합성 윈도우잉을 수행하는 단계를 포함할 수 있다.The determining of the LPC coefficients may include performing cross-frame windowing by applying a window to all frames of the input speech to which the pre-emphasis filter is applied; performing sub-frame windowing by applying a window to a plurality of sub-frames corresponding to a middle region among all frames of the input voice in a result of performing the cross-frame windowing; and performing composite windowing by overlapping a result of performing the subframe windowing.

상기 LPC 계수는, 훈련 가능한 소프트맥스를 LSP 도메인의 LPC 계수에 적용함으로써 양자화될 수 있다.The LPC coefficients can be quantized by applying a trainable softmax to the LPC coefficients in the LSP domain.

상기 LPC 잔차 신호는 1D-CNN의 오토인코더들에 의해 부호화될 수 있다.The LPC residual signal may be encoded by 1D-CNN autoencoders.

상기 1D-CNN의 오토인코더들은, 이전 오토인코더의 출력인 잔차 신호가 다음 오토인코더의 입력으로 사용됨으로써 시퀀셜하게 훈련될 수 있다.The autoencoders of the 1D-CNN can be sequentially trained by using a residual signal output from a previous autoencoder as an input of a next autoencoder.

상기 1D-CNN의 오토 인코더들은, 상기 오토인코더의 출력에 차등 코딩이 적용되고, 상기 오토인코더의 출력은, 오토인코더의 프레임별 코드의 길이에 기초하여 차등 코딩이 적용될 수 있다.In the auto-encoders of the 1D-CNN, differential coding may be applied to an output of the autoencoder, and differential coding may be applied to an output of the autoencoder based on a code length for each frame of the autoencoder.

본 발명의 일실시예에 따른 LPC 계수의 잔차 신호 코딩 방법을 수행하는 컴퓨팅 장치는 상기 컴퓨팅 장치는 프로세서를 포함하고, 상기 프로세서는, 입력 음성에 대해 LPC(Linear Prediction Coding) 분석 및 양자화를 수행하여 부호화된 LPC 계수와 LPC 잔차 신호를 생성하고, 상기 LPC 잔차 신호를 크로스 모듈 잔차 학습에 적용하여 예측된 LPC 잔차 신호를 결정하고, 상기 부호화된 LPC 계수와 상기 예측된 LPC 잔차 신호를 이용하여 LPC 합성을 수행하고, 상기 LPC 합성을 수행한 결과에 따라 합성된 출력인 출력 음성을 결정할 수 있다.A computing device performing a method for coding a residual signal of LPC coefficients according to an embodiment of the present invention includes a processor, and the processor performs LPC (Linear Prediction Coding) analysis and quantization on an input speech, A coded LPC coefficient and an LPC residual signal are generated, a predicted LPC residual signal is determined by applying the LPC residual signal to cross-module residual learning, and an LPC is synthesized using the coded LPC coefficient and the predicted LPC residual signal. , and an output voice, which is a synthesized output, may be determined according to a result of performing the LPC synthesis.

상기 프로세서는, 입력 음성에 대해 하이패스 필터를 적용하고, 상기 하이패스 필터가 적용한 결과에 사전 강조 필터를 적용하고, 상기 사전 강조 필터가 적용된 결과로부터 LPC 계수를 결정하고, 상기 LPC 계수를 양자화하여 부호화된 LPC 계수와 소프트맥스의 소프트 할당 매트릭스를 생성하고, 상기 사전 강조 필터가 적용된 결과와 LPC 계수를 양자화한 결과에 기초하여 LPC 잔차 신호를 결정하는 크로스 모듈 잔차 학습을 수행할 수 있다.The processor applies a high-pass filter to an input voice, applies a pre-emphasis filter to a result of applying the high-pass filter, determines LPC coefficients from a result of applying the pre-emphasis filter, and quantizes the LPC coefficients. Cross-module residual learning may be performed in which a soft allocation matrix of encoded LPC coefficients and softmax is generated, and an LPC residual signal is determined based on a result of applying the pre-emphasis filter and a result of quantizing the LPC coefficient.

상기 프로세서는, LPC 계수를 결정하기 위해, 상기 사전 강조 필터가 적용된 입력 음성에 대한 전체 프레임에 윈도우를 적용하여 크로스 프레임 윈도우잉을 수행하고, 상기 크로스 프레임 윈도우잉의 수행 결과에서 입력 음성에 대한 전체 프레임들 중 중간 영역에 대응하는 복수의 서브 프레임들에 대해 윈도우를 적용하여 서브 프레임 윈도우잉을 수행하고, 상기 서브 프레임 윈도우잉의 수행 결과에 대해 오버랩을 수행하여 합성 윈도우잉을 수행할 수 있다.The processor performs cross-frame windowing by applying a window to all frames of the input speech to which the pre-emphasis filter is applied to determine LPC coefficients, and as a result of performing the cross-frame windowing, the entirety of the input speech Sub-frame windowing may be performed by applying a window to a plurality of sub-frames corresponding to a middle region among frames, and composition windowing may be performed by overlapping a result of the sub-frame windowing.

본 발명의 일실시예에 따르면, 음성신호를 LPC 계수와 단계적 오토인코더를 이용하여 코딩함으로써 LPC 계수의 양자화와 LPC 잔차신호의 양자화와 동시에 최적화하기 위한 구조 및 훈련 방법을 제공할 수 있다.According to an embodiment of the present invention, it is possible to provide a structure and training method for simultaneously optimizing quantization of LPC coefficients and quantization of LPC residual signals by coding a speech signal using LPC coefficients and a stepwise autoencoder.

본 발명의 일실시예에 따르면, LPC 계수와 단계적으로 연결된 오토인코더를 같이 최적화할 수 있는 구조와 훈련 방법을 제공할 수 있다.According to one embodiment of the present invention, it is possible to provide a structure and training method capable of optimizing both LPC coefficients and an autoencoder connected in stages.

도 1은 본 발명의 일실시예에 따른 협력 양자화에 기초한 LPC 계수의 잔차 신호 코딩 방법을 도시한 도면이다.
도 2는 본 발명의 일실시예에 따른 훈련 가능한 LPC 계수의 분석 방법을 도시한 도면이다.
도 3은 본 발명의 일실시예에 따른 소프트맥스 양자화 과정을 도시한 도면이다.
도 4는 본 발명의 일실시예에 따른 LPC 윈도우잉의 과정을 설명하기 위한 도면이다.
도 5는 본 발명의 일실시예에 따른 크로스 모듈 잔차 훈련(CMRL)의 처리 과정을 설명하기 위한 도면이다.
도 6은 본 발명의 일실시예에 따른 크로스 모듈 잔차 훈련(CMRL)의 잔차 신호 코딩 과정을 설명하는 도면이다.
도 7은 본 발명의 일실시예에 따른 잔차 신호의 코딩을 통한 차등적인 코딩의 중앙화된 분배(centralized distribution)를 설명하기 위한 도면이다.1 is a diagram illustrating a method of coding a residual signal of LPC coefficients based on cooperative quantization according to an embodiment of the present invention.
2 is a diagram showing a method of analyzing trainable LPC coefficients according to an embodiment of the present invention.
3 is a diagram illustrating a softmax quantization process according to an embodiment of the present invention.
4 is a diagram for explaining a process of LPC windowing according to an embodiment of the present invention.
5 is a diagram for explaining a process of cross-module residual training (CMRL) according to an embodiment of the present invention.
6 is a diagram illustrating a residual signal coding process of cross-module residual training (CMRL) according to an embodiment of the present invention.
7 is a diagram for explaining centralized distribution of differential coding through coding of a residual signal according to an embodiment of the present invention.

이하, 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다. 그러나, 특허출원의 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, the scope of the patent application is not limited or limited by these examples. Like reference numerals in each figure indicate like elements.

아래 설명하는 실시예들에는 다양한 변경이 가해질 수 있다. 아래 설명하는 실시예들은 실시 형태에 대해 한정하려는 것이 아니며, 이들에 대한 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Various changes may be made to the embodiments described below. The embodiments described below are not intended to be limiting on the embodiments, and should be understood to include all modifications, equivalents or substitutes thereto.

제1 또는 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 이해되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Terms such as first or second may be used to describe various components, but these terms should only be understood for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element.

실시예에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 실시예를 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수 개의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성 요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in the examples are used only to describe specific examples, and are not intended to limit the examples. Expressions in the singular number include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as "include" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in the present application, they should not be interpreted in an ideal or excessively formal meaning. don't

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same reference numerals are given to the same components regardless of reference numerals, and overlapping descriptions thereof will be omitted. In describing the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description will be omitted.

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 상세하게 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일실시예에 따른 협력 양자화에 기초한 LPC 계수의 잔차 신호 코딩 방법을 도시한 도면이다.1 is a diagram illustrating a method of coding a residual signal of LPC coefficients based on cooperative quantization according to an embodiment of the present invention.

본 발명의 일실시예에 따르면, 낮은 모델 복잡도를 가지는 확장 가능한 파형 코딩을 위해서 신경망과 LPC를 더 잘 통합할 수 있도록, LPC 양자화가 훈련 가능한 협력 양자화 (Collaborative Quantization)가 제안된다. LPC 잔차 신호의 코딩을 위해 서로 다른 오토인코딩 모듈들을 결함함으로써 협력 양자화가 LPC 계수들과 다른 신경망의 코드 계층 간의 최적의 비트 할당을 학습할 수 있다. 본 발명에서 제안된 협력 양자화의 학습 방식에 의해, 협력 양자화는 이전의 방식에 비해 성능이 향상되고, 낮은 복잡도를 가지면서도 24kbps의 최신의 코덱의 성능에 맞게 확장이 가능할 수 있다.According to one embodiment of the present invention, in order to better integrate neural networks and LPCs for scalable waveform coding with low model complexity, collaborative quantization capable of training LPC quantization is proposed. By incorporating different autoencoding modules for the coding of the LPC residual signal, cooperative quantization can learn the optimal bit assignment between the LPC coefficients and the code layer of another neural network. According to the cooperative quantization learning scheme proposed in the present invention, cooperative quantization has improved performance compared to previous schemes and can be extended to match the performance of the latest codec of 24 kbps while having low complexity.

LPC는 최신의 신경 음성 코덱에 유용하며, 신경망에서 계산적인 오버 헤드(computational overhead)를 언로드 할 수 있습니다. 그리고, 신경 파형 코더인 크로스 모듈 잔차 학습(Cross Module Residual Learning)은 LPC를 사전 프로세서로 사용하며, 최신의 음성 품질에 매칭할 수 있도록 LPC 잔차 신호를 모델링할 수 있다.LPC is useful for modern neural speech codecs and can unload computational overhead from neural networks. In addition, cross module residual learning, which is a neural waveform coder, uses LPC as a pre-processor and can model LPC residual signals to match the latest voice quality.

신경 스피치 코덱(neural speech codec)에서는 확장성 및 효율성이 요구되며, 이는 다양한 장치에서의 응용을 위해 광범위한 비트 전송률을 지원한다. 이를 위해, 본 발명의 일실시예에 따르면, LPC 계수의 코드북과 잔차 신호들을 공동으로(jointly) 학습하는 협력 양자화를 적용한다.Scalability and efficiency are required in a neural speech codec, which supports a wide range of bit rates for applications in various devices. To this end, according to an embodiment of the present invention, cooperative quantization for jointly learning a codebook of LPC coefficients and residual signals is applied.

본 발명의 일실시예에 따른 협력 양자화는 도메인에 특별화된 디지털 신호 처리 방법을 제시한다. 협력 양자화에 의하면, 기존의 양자화 방법보다 9kbps에서 훨씬 더 높은 품질을 달성하면서도 모델 복잡성이 훨씬 낮은 것을 알 수 있다. 또한, 협력 양자화는 AMR-WB 및 Opus보다 우수한 24kbps까지 확장할 수 있음을 보여준다. 협력 양자화는 신경 파형 코덱으로서 기존의 모델보다는 훨씬 작은 파라미터를 가진다.Cooperative quantization according to an embodiment of the present invention suggests a digital signal processing method specialized for a domain. According to cooperative quantization, it can be seen that model complexity is much lower while achieving much higher quality at 9 kbps than conventional quantization methods. We also show that cooperative quantization can extend up to 24 kbps, which is superior to AMR-WB and Opus. Cooperative quantization is a neural waveform codec and has much smaller parameters than existing models.

도 1을 참고하면, LPC 계수의 잔차 신호 코딩 방법은 컴퓨팅 장치에 의해 수행될 수 있다. 컴퓨팅 장치는 아래와 같은 과정을 통해 잔차 신호 코딩 방법을 수행할 수 있다.Referring to FIG. 1 , a method of coding a residual signal of LPC coefficients may be performed by a computing device. The computing device may perform the residual signal coding method through the following process.

단계(1)에서, 컴퓨팅 장치는 입력 음성(input speech)를 수신하여 LPC 분석과 양자화를 수행할 수 있다. 그러면, 컴퓨팅 장치는 단계(1)을 통해 LPC 잔차 신호와 LPC 계수들을 출력할 수 있다.In step (1), the computing device may receive input speech and perform LPC analysis and quantization. Then, the computing device may output the LPC residual signal and the LPC coefficients through step (1).

단계(2)에서, 컴퓨팅 장치는 LPC 잔차 신호를 학습할 수 있다. 일례로, 컴퓨팅 장치는 크로스 모듈 잔차 학습(CMRL)에 기초하여 LPC 잔차 신호를 학습할 수 있다. LPC 잔차 신호의 학습 결과 예측된 LPC 잔차 신호들이 출력될 수 있다. 크로스 모듈 잔차 학습의 동작에 대해서는 이하의 도 5 및 도 6에서 구체적으로 설명하기로 한다.At step (2), the computing device may learn the LPC residual signal. In one example, a computing device may learn an LPC residual signal based on cross-module residual learning (CMRL). As a result of learning the LPC residual signal, predicted LPC residual signals may be output. The operation of cross-module residual learning will be described in detail with reference to FIGS. 5 and 6 below.

단계(3)에서, 컴퓨팅 장치는 LPC 계수들과 LPC 잔차 신호를 이용하여 LPC 역양자화와 LPC 합성을 수행할 수 있다.In step (3), the computing device may perform LPC dequantization and LPC synthesis using the LPC coefficients and the LPC residual signal.

단계(4)에서, 컴퓨팅 장치는 LPC 합성의 출력 결과에 비강조 필터(De-emphasis filtering)를 적용함으로써 합성된 출력(synthesized output)인 출력 음성을 결정할 수 있다.In step (4), the computing device may determine an output speech, which is a synthesized output, by applying de-emphasis filtering to the output result of the LPC synthesis.

도 2는 본 발명의 일실시예에 따른 훈련 가능한 LPC 계수의 분석 방법을 도시한 도면이다.2 is a diagram showing a method of analyzing trainable LPC coefficients according to an embodiment of the present invention.

도 2는 도 1의 단계(1)에서 설명한 LPC 분석 및 양자화에 대한 구체적인 과정을 설명한다. 본 발명의 일실시예에 따르면, LPC 분석을 크로스 모듈 잔차 학습의 파이프라인으로 통합함으로써 LPC 계수의 양자화를 신경망의 훈련 알고리즘에 적용할 수 있다. LPC의 전체적인 과정은 AMR-WB에 기초한다.FIG. 2 describes a specific process for LPC analysis and quantization described in step (1) of FIG. 1. According to one embodiment of the present invention, quantization of LPC coefficients can be applied to a neural network training algorithm by integrating LPC analysis into a cross-module residual learning pipeline. The overall process of LPC is based on AMR-WB.

도 2는 도 1에서 설명한 컴퓨팅 장치에 의해 수행된다. 도 2의 단계(1)에서, 컴퓨팅 장치는 입력 음성에 하이패스 필터를 적용할 수 있다. 그리고, 단계(2)에서, 컴퓨팅 장치는 입력 음성에 하이패스를 적용한 결과에 추가적으로 사전 강조 필터(Pre-emphasis filter)를 적용할 수 있다.FIG. 2 is performed by the computing device described in FIG. 1 . In step (1) of FIG. 2, the computing device may apply a high-pass filter to the input speech. And, in step (2), the computing device may additionally apply a pre-emphasis filter to the result of applying the high pass to the input voice.

일례로, 하이패스 필터는 차단 주파수(cut-off frequency)가 50Hz 인 고역 통과 필터를 포함할 수 있다. 그리고, 사전 강조 필터는

로 설정될 수 있으며, 고주파수에서 아티팩트를 제거하기 위해 사용된다.As an example, the high-pass filter may include a high-pass filter having a cut-off frequency of 50 Hz. And, the pre-emphasis filter is

Can be set to , and is used to remove artifacts at high frequencies.

도 2의 단계(3)에서, 컴퓨팅 장치는 LPC 계수를 결정할 수 있다. 단계(2)에서 사전 강조 필터가 적용된 입력 음성은 복수의 프레임들로 분할될 수 있다. 예를 들어, 입력 음성은 1024프레임으로 분할될 수 있다.In step (3) of FIG. 2, the computing device may determine LPC coefficients. In step (2), the input speech to which the pre-emphasis filter is applied may be divided into a plurality of frames. For example, an input voice may be divided into 1024 frames.

LPC 계수가 결정되기 전에, 입력 음성에서 분할된 복수의 프레임들 각각은 윈도우가 처리될 수 있다. 윈도우가 처리되는 과정은 도 4에서 구체적으로 설명하기로 한다.Before the LPC coefficient is determined, each of a plurality of divided frames from the input speech may be windowed. A process of processing a window will be described in detail with reference to FIG. 4 .

도 2의 단계(4)에서, 컴퓨팅 장치는 LPC 계수를 양자화할 수 있다. 일례로, 컴퓨팅 장치는 각각의 LPC 계수를 그것의 가장 가까운 중심(centroid)을 표현할 수 있도록 LSP 도메인에서 LPC 계수에 훈련 가능한 소프트맥스 양자화를 적용할 수 있다. 소프트맥스 양자화는 도 3에서 구체적으로 설명하기로 한다.In step 4 of Figure 2, the computing device may quantize the LPC coefficients. In one example, the computing device may apply trainable softmax quantization to the LPC coefficients in the LSP domain so that each LPC coefficient represents its nearest centroid. Softmax quantization will be described in detail with reference to FIG. 3 .

윈도우가 처리된 각각의 프레임 x에 대해, LSP 도메인에서 표현되는 LPC 계수는

로 표현된다. LPC에 특화된 중심인

은 학습될 필요가 있으며, 소프트 할당 매트릭스를 구성하기 위해 사용될 수 있다.For each frame x in which the window is processed, the LPC coefficients represented in the LSP domain are

is expressed as A center specializing in LPC

needs to be learned and can be used to construct a soft allocation matrix.

일례로, 본 발명에서 LPC 계수의 차수는 16으로 설정될 수 있고, 중심의 개수는 256 (예를 들면, 8비트)로 설정될 수 있다. 소프트 할당 매트릭스와 하드 할당 매트릭스의 크기는 16*256이다. 그리고, 소프트 할당 매트릭스의 열은 확률 벡터이고, 하드 할당 매트릭스의 열은 원 핫 벡터(one-hot vector)이다.For example, in the present invention, the order of LPC coefficients may be set to 16, and the number of centers may be set to 256 (eg, 8 bits). The size of the soft allocation matrix and the hard allocation matrix is 16*256. And, the columns of the soft allocation matrix are probability vectors, and the columns of the hard allocation matrix are one-hot vectors.

한편, 도 2의 단계(4)에서, 컴퓨팅 장치는 부호화된 LPC 계수와 소프트 할당 매트릭스인 A_soft를 결정할 수 있다.Meanwhile, in step (4) of FIG. 2, the computing device may determine the coded LPC coefficients and the soft allocation matrix A _soft .

도 2의 단계(5)에서, 컴퓨팅 장치는 양자화된 LPC 계수와 단계(2)에서 사전 강조 필터가 적용된 입력 음성을 이용하여 LPC 잔차 신호를 결정할 수 있다.In step (5) of FIG. 2, the computing device may determine an LPC residual signal using the quantized LPC coefficients and the input speech to which the pre-emphasis filter is applied in step (2).

<잔차 코딩><residual coding>

도 1의 (1) 단계에서 계산된 LPC 잔차 신호는 1D-CNN의 오토인코더들에 의해 압축될 수 있다. 여기서, 1D-CNN의 오토인코더들은 도 5와 도 6에서 설명되는 오토인코더들이다.The LPC residual signal calculated in step (1) of FIG. 1 may be compressed by 1D-CNN autoencoders. Here, the autoencoders of the 1D-CNN are the autoencoders described in FIGS. 5 and 6.

오토인코더들의 출력인

에 차등 코딩이 적용될 수 있다. 여기서, m은 각각의 오토인코더들의 프레임별 코드의 길이를 의미한다. 소프트맥스 양자화의 입력 스칼라는

이다.The output of the autoencoder is

Differential coding may be applied to Here, m means the code length for each frame of each autoencoder. The input scalar of softmax quantization is

am.

소프트맥스 양자화는 도 7과 같이 보다 중앙화된 실수값으로 표현된 코드의 분배(distribution)로부터 시작된다. 도 1에 도시된 바와 같이, LPC 계수의 양자화와 크로스 모듈 잔차 학습의 잔차 코딩은 함께 최적화된다. LPC 분석은 최대한 잔차 신호의 에너지를 최소화하는 것뿐만 아니라, 다음의 크로스 모듈 잔차 학습의 모듈들로부터 잔차 압축을 용이하게 수행하는 피봇을 찾는 것이다.Softmax quantization starts from distribution of codes expressed as more centralized real values as shown in FIG. 7 . As shown in Fig. 1, quantization of LPC coefficients and residual coding of cross-module residual learning are optimized together. The LPC analysis is to find a pivot that not only minimizes the energy of the residual signal as much as possible, but also easily performs residual compression from the modules of the next cross-module residual learning.

도 3은 본 발명의 일실시예에 따른 소프트맥스 양자화 과정을 도시한 도면이다.3 is a diagram illustrating a softmax quantization process according to an embodiment of the present invention.

음성 신호를 압축하기 위해, 오토인코더의 코어 요소는 훈련가능한 양자화기(quantizer)다. 훈련 가능한 양자화기는 오토인코더의 코드 계층의 개별 표현(discrete representation)을 학습한다. 소프트-하드 양자화와 같은 신경망에 적합한 양자화 방식은 엔드 투 엔드 음성 코딩에서 소프트맥스 양자화로 불린다.To compress speech signals, the core element of an autoencoder is a trainable quantizer. A trainable quantizer learns a discrete representation of an autoencoder's code layer. A quantization scheme suitable for neural networks, such as soft-hard quantization, is called softmax quantization in end-to-end speech coding.

S 샘플들의 입력 프레임

에 대해, 인코더의 출력은

로 결정된다. 인코더의 출력들 각각은 16비트의 부동소수점 값을 나타낸다. 벡터

로 표현되는

중심(centroid)들이 주어질 때, 소프트맥스 양자화는

에서 각각의 샘플들을

중심들 중 어느 하나로 맵핑할 수 있다. 그리고, 각각의 양자화된 샘플은 log₂J 비트들로 표현될 수 있다. 예를 들어, J가 32일 때 5비트가 될 수 있다.Input frame of S samples

, the output of the encoder is

is determined by Each of the encoder's outputs represents a 16-bit floating point value. vector

expressed as

Given centroids, softmax quantization is

each sample in

You can map to any of the centroids. And, each quantized sample can be represented by log ₂ J bits. For example, when J is 32, it can be 5 bits.

소프트맥스의 양자화 과정은 하드 할당 매트릭스(hard assignment matrix)

를 사용한다. 여기서, I는 중심들의 코드의 차원을 의미하고, J는 중심들의 벡터의 차원을 의미한다. 하드 할당 매트릭스는 유클리디안 거리 매트릭스

에 기초하여 수학식 1에 의해 결정된다.The quantization process of Softmax is a hard assignment matrix.

Use Here, I means the dimension of the code of centroids, and J means the dimension of the vector of centroids. The hard assignment matrix is the Euclidean distance matrix

It is determined by Equation 1 based on

소프트맥스의 양자화는

의 요소들

각각에 대해 가장 가까운 중심을 할당할 수 있다. 이 과정은 차별화(differentiable)되지 않으며, 훈련동안 역확산 오류 흐름(backpropagation error flow)를 차단한다.Quantization of softmax

elements of

You can assign the closest centroid to each. This process is non-differentiable and blocks the backpropagation error flow during training.

대신에 소프트 할당은 훈련되는 동안 아래와 같이 사용된다.Instead, soft assignments are used during training as follows.

(i) 컴퓨팅 장치는 h와 b의 요소들 간에 유클리디안 거리 매트릭스

를 계산할 수 있다.(i) the computing device is a Euclidean distance matrix between the elements of h and b

can be calculated.

(ii) 컴퓨팅 장치는 소프트맥스 함수

를 이용하여 비유사도 매트릭스로부터 소프트 할당 매트릭스를 계산할 수 있다. 여기서, 소프트맥스 함수는 소프트 할당 매트릭스의 각 열(tow)에 적용하여 확률 벡터

로 변경할 수 있다. 확률 벡터는

에 가장 유사한 가장 높은 확률값을 홀딩한다. 훈련동안

는 하드 할당들(hard assignment)을 근사화하고, 근사화된 결과는 입력 코드로서 디코더에 제공된다.(ii) the computing device is a softmax function

The soft assignment matrix can be calculated from the dissimilarity matrix using Here, the softmax function is applied to each column (tow) of the soft allocation matrix to generate a probability vector.

can be changed to The probability vector is

Hold the highest probability value most similar to . during training

approximates the hard assignments, and the approximated result is provided to the decoder as an input code.

추가적인 변수

는

와 같이 소프트맥스 함수의 소프트함(softness)을 제어한다. 소프트 할당 매트릭스

와 하드 할당 매트릭스

간의 갭(gap)이 최소화가 되도록

는 300으로 설정될 수 있다.additional variables

Is

Controls the softness of the softmax function. soft allocation matrix

with hard allocation matrix

to minimize the gap between

may be set to 300.

(iii) 테스트 시간에서,

는 열에서 가장 큰 확률값을 제로로 변경함으로써

를 교체한다.

는 양자화된 코드

를 생성한다.(iii) at test time;

by changing the largest probability value in the column to zero

replace

is the quantized code

generate

도 4는 본 발명의 일실시예에 따른 LPC 윈도우잉의 과정을 설명하기 위한 도면이다.4 is a diagram for explaining a process of LPC windowing according to an embodiment of the present invention.

도 2에서, 사전 강조 필터링이 적용된 입력 음성은 복수의 프레임들로 분할될 수 있다. 일례로, 입력 음성은 1024 샘플 포인트의 프레임으로 분할될 수 있다. 입력 음성에서 LPC 계수가 계산되기 전에, 입력 음성의 각 프레임에 윈도우가 처리되어 LPC 윈도우잉이 수행된다.In FIG. 2 , an input voice to which pre-emphasis filtering is applied may be divided into a plurality of frames. As an example, an input speech may be divided into frames of 1024 sample points. Before LPC coefficients are calculated in the input speech, a window is processed in each frame of the input speech to perform LPC windowing.

도 4에서 볼 수 있듯이, 대칭 윈도우(symmetric window)는 중간 50% 영역에서 강조된 가중치를 가질 수 있다. 그리고, 대칭 윈도우는 첫번째 25 % 영역에서 512 샘플 포인트를 가지는 Hann 윈도우의 왼쪽 절반(left half)이고, 나머지 25% 영역에서 512 샘플 포인트를 가지는 Hann 윈도우의 오른쪽 절반(right half)이다.As can be seen in FIG. 4, a symmetric window may have an emphasized weight in the middle 50% area. And, the symmetrical window is the left half of the Hann window having 512 sample points in the first 25% area and the right half of the Hann window having 512 sample points in the remaining 25% area.

그리고, LPC는 시간 도메인 s에서 윈도우가 처리된 프레임에 대해 수행된다. t번째 샘플의 예측을 수행한 결과는 아래 수학식 2에 의해 결정된다.And, LPC is performed on the window-processed frame in the time domain s. The result of predicting the tth sample is determined by Equation 2 below.

는 t번째 샘플의 예측을 의미하고,

는 i번째 LPC 계수를 의미한다. 프레임들은 50%만큼 오버랩된다. LPC 차수는 16차로 설정될 수 있다. 일례로, LPC 계수는 Levinson Durbin 알고리즘에 기초하여 결정되고, 이 알고리즘은 양자화에 강인한 LSP(line spectral pair)로 표시될 수 있다.

Means the prediction of the tth sample,

denotes the i th LPC coefficient. The frames overlap by 50%. The LPC order may be set to the 16th order. As an example, the LPC coefficients are determined based on the Levinson Durbin algorithm, which can be expressed as a line spectral pair (LSP) that is robust to quantization.

본 발명의 일실시예에 따르면, LPC 잔차 신호를 계산하기 위해 서브 프레임의 윈도우잉이 적용된다. 일례로, 도 4의 (a)는 크로스 프레임에 대한 윈도우잉을 나타내고, (b)는 서브 프레임에 대한 윈도우잉을 나타내며, (c)는 합성 윈도우잉이 도시된다. 컴퓨팅 장치는 입력 음성에서 분할된 음성 프레임과 양자화된 LPC 계수에 대해 각 서브 프레임의 잔차 신호를 개별적으로 계싼한다.According to one embodiment of the present invention, windowing of sub-frames is applied to calculate the LPC residual signal. As an example, (a) of FIG. 4 shows windowing for cross frames, (b) shows windowing for subframes, and (c) shows composite windowing. The computing device separately calculates a residual signal of each subframe for the divided speech frames and quantized LPC coefficients from the input speech.

이 때, 1024 샘플 포인트의 프레임(도 4의 (a))에서 중간 50% (예를 들면, 첫번째 분석 프레임 [0:1024]에 대해 [256:768], 두번째 분석 프레임 [512:1536]에 대해 [768:1280])은 7개의 서브 프레임(도 4의 (b))으로 분할될 수 있다. 7개의 서브 프레임들은 각각 128 샘플 포인트의 사이즈를 가지고, 50%만큼 프레임들 간에 오버랩된다. 도 4의 (b)와 같이, 7개의 서브 프레임들 중 가운데에 있는 5개의 서브 프레임은 Hann 함수에 의해 윈도우가 처리될 수 있다. 그리고, 7개의 서브 프레임들 중 첫번째 서브 프레임과 마지막 서브 프레임은 비대칭적으로 윈도우가 처리될 수 있다.At this time, in the middle 50% of the frame of 1024 sample points (Fig. 4(a)) (eg, [256:768] for the first analysis frame [0:1024], and [512:1536] for the second analysis frame) [768:1280]) can be divided into 7 subframes ((b) in FIG. 4). Each of the 7 subframes has a size of 128 sample points and overlaps between frames by 50%. As shown in (b) of FIG. 4, 5 subframes in the middle among 7 subframes may be window processed by the Hann function. Also, among the seven subframes, the first subframe and the last subframe may be asymmetrically windowed.

LPC 잔차 신호는 1024 샘플 포인트를 가지는 전체 프레임의 50%인 중간 영역의 512개 샘플 포인트에 대응하는 7개의 서브 프레임에 대해 계산될 수 있다. 서브 프레임들 간에 50%만큼의 분석 프레임의 오버랩이 발생되면, 잔차 세그먼트들 간에 오버랩은 없다.The LPC residual signal can be calculated for 7 subframes corresponding to 512 sample points in the middle region, 50% of the total frame having 1024 sample points. If an overlap of analysis frames by 50% occurs between subframes, there is no overlap between residual segments.

도 5는 본 발명의 일실시예에 따른 크로스 모듈 잔차 훈련(CMRL)의 처리 과정을 설명하기 위한 도면이다.5 is a diagram for explaining a process of cross-module residual training (CMRL) according to an embodiment of the present invention.

<엔드 투 엔드 음성 코딩 오토인코더들><End-to-end speech coding autoencoders>

시간 도메인 샘들들에서 1D-CNN 구조는 엔드 투 엔드 음성 코딩을 위해 원하는 오토인코더를 제공한다. 표 1에 설명되는 것과 같이, 오토인코더에서 인코더 파트는 4개의 ResNet 단계들로 구성되며, 다운샘플링 컨볼루션 레이어는 중간에서 피쳐맵(feature map)으로 절반 감소되며, 채널 압축 레이어는 256 차원의 실수값(real-valued)의 코드를 형성한다. 표 1은 도 5의 크로스 모듈 잔차 훈련에 포함된 오토인코더의 구조에 대응할 수있다. 오토인코더에서 디코더 파트는 인코더 파트의 미러링된 구조를 가진다. 다만, 디코더 파트에서 업샘플링 레이어는 감소된 코드 길이(256 샘플 포인트)로부터 원본 프레임 사이즈(512 샘플 포인트)를 복원할 수 있다.The 1D-CNN structure in time domain samples provides the desired autoencoder for end-to-end speech coding. As illustrated in Table 1, in the autoencoder, the encoder part consists of four ResNet stages, the downsampling convolution layer is halved to a feature map in the middle, and the channel compression layer is a 256-dimensional real number. Form real-valued code. Table 1 may correspond to the structure of the autoencoder included in the cross-module residual training of FIG. 5 . In an autoencoder, the decoder part has a mirrored structure of the encoder part. However, the upsampling layer in the decoder part may restore the original frame size (512 sample points) from the reduced code length (256 sample points).

1D-CNN 오토인코더의 구조에서, 입력 텐서와 출력 텐서는 폭(width)과 채널(Channel)로 표현되지만, 커널 모양은 폭(width), 입력 채널(in channel), 출력 채널(out channel)로 표현된다.In the structure of 1D-CNN autoencoder, the input tensor and output tensor are represented by width and channel, but the kernel shape is expressed by width, input channel, and out channel. is expressed

크로스 모듈 잔차 훈련의 파이프라인에서 LPC 코딩 모듈은 2.4kbps의 고정된 비트율을 가지는 사전 프로세서(pre-processor)를 제공한다. 그것은 효과적으로 스펙트럴 인벨럽(spectral envelop)을 모델링할 수 있으나, 잔차 신호의 양자화에 도움이 되지 않을 수 있다. 예를 들어, 프레임에 대해 LPC가 효과적으로 모델링하지 않는다면, 협력 양자화는 좀더 많은 비트를 사용하도록 다음의 오토인코더에 더 많이 가중할 수 있다.In the pipeline of cross-module residual training, the LPC coding module provides a pre-processor with a fixed bit rate of 2.4 kbps. It can effectively model the spectral envelope, but may not help with quantization of the residual signal. For example, if LPC doesn't model effectively for a frame, cooperative quantization can weight the next autoencoder more heavily to use more bits.

본 발명의 일실시예에 따르면, LPC 과정을 분할하여 크로스 모듈 잔차 훈련에서 다른 오토인코더 모듈들과 함께 LPC 잔차 신호를 복원할 수 있는 훈련 가능한 양자화 모듈이 생성될 수 있다.According to one embodiment of the present invention, a trainable quantization module capable of restoring an LPC residual signal together with other autoencoder modules in cross-module residual training can be generated by dividing the LPC process.

도 5를 참고하면, 음성 신호가 LPC 분석 및 양자화됨으로써 LPC 잔차 신호가 생성된다. 그리고, LPC 잔차 신호는 크로스 모듈 잔차 훈련에 적용된다. 크로스 모듈 잔차 훈련은 기본적으로 오토인코더의 구조와 소프트맥스 양자화가 조합된 구조를 가진다. 크로스 모듈 잔차 훈련에서 LPC 잔차 신호는 오토인코더의 인코더 파트에서 차원이 축소된 후 소프트맥스 양자화가 적용된다. 그리고, 소프트맥스 양자화의 결과는 다시 오토인코더의 디코더 파트에 적용되어 차원이 확장됨으로써 원래의 LPC 잔차 신호로 복원된다.Referring to FIG. 5 , an LPC residual signal is generated by LPC analysis and quantization of a speech signal. And, the LPC residual signal is applied to cross-module residual training. Cross-module residual training basically has a structure in which an autoencoder structure and softmax quantization are combined. In cross-module residual training, the LPC residual signal is dimensionally reduced in the encoder part of the autoencoder and then softmax quantization is applied. Then, the result of softmax quantization is applied to the decoder part of the autoencoder again, and the original LPC residual signal is restored by extending the dimension.

즉, 크로스 모듈 잔차 훈련은 음성신호를 LPC 필터링한 residual 신호를 CMRL 구조의 오토인코더로 코딩할 수 있다. 이 때, LPC 양자화에 할당되는 비트와 LPC 잔차 코딩에 할당되는 비트는 상호 독립적일 수 있다. LPC 양자화도 훈련 가능하게 함으로써 음성 신호의 특성에 따라 LPC 양자화와 LPC 잔차 신호의 양자화에 할당되는 비트를 조정함으로써 음성 코덱의 성능이 향상될 수 있다.That is, cross-module residual training can code the residual signal obtained by LPC-filtering the speech signal with an autoencoder having a CMRL structure. In this case, bits allocated to LPC quantization and bits allocated to LPC residual coding may be independent of each other. By allowing LPC quantization to be trained, the performance of the speech codec can be improved by adjusting bits allocated to LPC quantization and quantization of the LPC residual signal according to the characteristics of the speech signal.

도 6은 본 발명의 일실시예에 따른 크로스 모듈 잔차 훈련(CMRL)의 잔차 신호 코딩 과정을 설명하는 도면이다.6 is a diagram illustrating a residual signal coding process of cross-module residual training (CMRL) according to an embodiment of the present invention.

도 6의 크로스 모듈 잔차 훈련은 도 1의 (2) 단계인 잔차 신호 학습의 일례이다.The cross-module residual training of FIG. 6 is an example of residual signal learning in step (2) of FIG. 1 .

도 6를 참고하면, 크로스 모듈 잔차 훈련(Cross Module Residual Learning)은 오토인코더의 빌딩 블록 모듈들 간에 잔차 훈련이 가능하도록 오토인코더의 리스트를 직렬화(serialization)한 것이다. 크로스 모듈 잔차 훈련은 하나의 오토인코더에 의존하지 않고, 오토인코더의 빌딩 블록 모듈들을 직렬화한다.Referring to FIG. 6 , cross module residual learning is serialization of an autoencoder list so that residual training is possible between building block modules of the autoencoder. Cross-module residual training serializes the building block modules of an autoencoder rather than relying on a single autoencoder.

도 6를 참고하면, i-1번째 오토인코더(601), i번째 오토인코더(602) 및 i+1번째 오토인코더(603)들이 직렬로 연결될 수 있다. 그리고, i-1번째 오토인코더(601)은 입력 신호

으로부터 출력 신호

를 생성한다. 이 때, i-1번째 오토인코더(601)은 입력 신호와 출력 신호가 서로 유사하도록 훈련될 수 있으며, i-1번째 오토인코더(601)의 입력 신호와 출력 신호의 차이는 잔차 신호로서 i번째 오토 인코더(602)의 입력으로 설정될 수 있다. 즉, i번째 오토 인코더(602)의 입력인

는 i-1번째 오토인코더(601)의 입력 신호와 출력 신호의 차이인

-

로 결정될 수 있다.Referring to FIG. 6 , an i−1 th autoencoder 601, an i th autoencoder 602, and an i+1 th autoencoder 603 may be connected in series. And, the i-1th autoencoder 601 is an input signal

output signal from

generate In this case, the i-1 th autoencoder 601 can be trained so that the input signal and output signal are similar to each other, and the difference between the input signal and the output signal of the i-1 th autoencoder 601 is a residual signal, i th It can be set as an input of the auto-encoder 602. That is, the input of the i-th auto-encoder 602

Is the difference between the input signal and the output signal of the i-1th autoencoder 601

-

can be determined by

도 6를 참고하면, i번째 오토인코더(602)는

를 입력받고,

를 예측하도록 훈련할 수 있다. 가장 처음에 배치된 오토인코더를 제외하고, i번째 오토인코더의 입력인

는 잔차 신호이거나 또는 이전에 배치된 오토인코더들에 의해 재구성되지 않은 잔차 신호의 합계와 입력 음성인

간의 차이일 수 있다.

는 하기 수학식 3에 의해 결정될 수 있다.Referring to FIG. 6, the i-th autoencoder 602 is

input,

can be trained to predict Except for the first placed autoencoder, the input of the ith autoencoder

is the residual signal or the sum of the residual signals not reconstructed by previously placed autoencoders and the input speech

may be the difference between

Can be determined by Equation 3 below.

크로스 모듈 잔차 훈련은 하나의 신경망을 최적화하기 위한 노력(effort)을 분산시킨다. 크로스 모듈 잔차 훈련은 학습가능한 파라미터의 관점에서 모델의 복잡도를 낮춤으로써, 신경 오디오 코딩 알고리즘을 에너지 공급과 저장 공간이 제한된 사용자 단말에 좀더 적합하게 만들 수 있다.Cross-module residual training distributes the effort to optimize one neural network. Cross-module residual training can make neural audio coding algorithms more suitable for user terminals with limited energy supply and storage space by lowering the model complexity in terms of learnable parameters.

크로스 모듈 잔차 훈련의 파이프라인에 따르면, 각각의 오토인코더들은 이전 모듈의 잔차 신호를 현재 모듈의 입력으로 사용함으로써 시퀀셜하게(sequentially) 훈련될 수 있다. 모든 오토인코더들이 훈련되면, 전체 복원 품질을 향상시키기 위해 미세조정 과정이 수행된다.According to the cross-module residual training pipeline, each autoencoder can be trained sequentially by using the residual signal of the previous module as the input of the current module. When all autoencoders are trained, a fine-tuning process is performed to improve the overall reconstruction quality.

오토인코더들 각각의 훈련에서 사용된 손실 함수(loss function)은 재구성 에러(reconstruction error)와 레귤라이저(regularizer)로 구성된다. 손실 함수는 수학식 4에 의해 결정된다A loss function used in training of each of the autoencoders is composed of a reconstruction error and a regularizer. The loss function is determined by Equation 4

협력 양자화의 입력이 시간 도메인에서 주어질 때, 시간 도메인과 주파수 도메인에서 손실 함수를 최소화하는 것이 요구된다. 시간 도메인 에러

는 MSE(mean squared error)로 측정된다.

는 멜 스케일(mel-scale) 주파수 도메인에서 손실 함수를 측정함으로써 비지각적인

에 의해 캡쳐되지 않도록 보상한다. 4개의 멜 필터(mel-filter) 뱅크들은 128, 32, 16 및 8의 사이즈로 특정되고, 이는 coarse-to-ne differentiation이 가능하도록 한다.When the input of cooperative quantization is given in the time domain, it is required to minimize the loss function in the time domain and frequency domain. time domain error

is measured as the mean squared error (MSE).

is non-perceptual by measuring the loss function in the mel-scale frequency domain.

compensates for not being captured by Four mel-filter banks are specified with sizes of 128, 32, 16 and 8, which allows coarse-to-ne differentiation.

수학식 4에서,

와

는 소프트맥스 양자화를 위한 레귤라이저이다. 소프트 할당 매트릭스

는 도 3에서 이미 설명되었다.

는 소프트 할당 매트릭스가 보다 더 하드 할당 매트릭스에 가깝도록 보장할 수 있는

로 설명된다.In Equation 4,

and

is a regulator for softmax quantization. soft allocation matrix

has already been described in FIG. 3 .

can ensure that the soft allocation matrix is closer to the hard allocation matrix.

is explained by

는 비트율을 제어하기 위해 소프트맥스 양자화된 비트 스트링의 엔트로피를 계산할 수 있다. 먼저, 각 커널의 주파수는 수학식 5에 따라 소프트 할당 매트릭스의 열들을 합산함으로써 계산된다.

can calculate the entropy of the softmax quantized bit string to control the bit rate. First, the frequency of each kernel is calculated by summing the columns of the soft allocation matrix according to Equation 5.

커널들의 확률 분포

는 얼마나 자주 코드들이 각 커널에 할당되는지를 나타내며, 수학식 6과 같이 결정된다Probability distribution of kernels

indicates how often codes are allocated to each kernel, and is determined as in Equation 6

그리고, 엔트로피는 수학식 7과 같이 정의된다.And, entropy is defined as in Equation 7.

가 조절됨으로써 모델은 원하는 비트율의 범위로 미세조정된다. 그리고, 허프만 코딩을 그룹화된 샘플 쌍들(쌍별로 2개의 인접한 샘플들)에 적용하는 것은 좀더 높은 압축율을 제공한다.

By adjusting , the model is fine-tuned to a desired bit rate range. And, applying Huffman coding to grouped sample pairs (two adjacent samples per pair) provides a higher compression ratio.

도 7은 본 발명의 일실시예에 따른 잔차 신호의 코딩을 통한 차등 코딩의 중앙화된 분배(centralized distribution)를 설명하기 위한 도면이다.7 is a diagram for explaining centralized distribution of differential coding through coding of a residual signal according to an embodiment of the present invention.

본 발명은 좀더 간략화되고 스케일러블한 파형 신경 코덱을 제안한다. 협력 양자화에서, LPC 계수 양자화는 잔차 양자화와 최적으로 조합될 수 있도록 훈련 가능한 요소가 된다. The present invention proposes a more simplified and scalable waveform neural codec. In cooperative quantization, LPC coefficient quantization becomes a trainable element to be optimally combined with residual quantization.

한편, 본 발명에 따른 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성되어 마그네틱 저장매체, 광학적 판독매체, 디지털 저장매체 등 다양한 기록 매체로도 구현될 수 있다.Meanwhile, the method according to the present invention is written as a program that can be executed on a computer and can be implemented in various recording media such as magnetic storage media, optical reading media, and digital storage media.

본 명세서에 설명된 각종 기술들의 구현들은 디지털 전자 회로조직으로, 또는 컴퓨터 하드웨어, 펌웨어, 소프트웨어로, 또는 그들의 조합들로 구현될 수 있다. 구현들은 데이터 처리 장치, 예를 들어 프로그램가능 프로세서, 컴퓨터, 또는 다수의 컴퓨터들의 동작에 의한 처리를 위해, 또는 이 동작을 제어하기 위해, 컴퓨터 프로그램 제품, 즉 정보 캐리어, 예를 들어 기계 판독가능 저장 장치(컴퓨터 판독가능 매체) 또는 전파 신호에서 유형적으로 구체화된 컴퓨터 프로그램으로서 구현될 수 있다. 상술한 컴퓨터 프로그램(들)과 같은 컴퓨터 프로그램은 컴파일된 또는 인터프리트된 언어들을 포함하는 임의의 형태의 프로그래밍 언어로 기록될 수 있고, 독립형 프로그램으로서 또는 모듈, 구성요소, 서브루틴, 또는 컴퓨팅 환경에서의 사용에 적절한 다른 유닛으로서 포함하는 임의의 형태로 전개될 수 있다. 컴퓨터 프로그램은 하나의 사이트에서 하나의 컴퓨터 또는 다수의 컴퓨터들 상에서 처리되도록 또는 다수의 사이트들에 걸쳐 분배되고 통신 네트워크에 의해 상호 연결되도록 전개될 수 있다.Implementations of the various techniques described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or combinations thereof. Implementations may be a computer program product, i.e., an information carrier, e.g., a machine-readable storage, for processing by, or for controlling, the operation of a data processing apparatus, e.g., a programmable processor, computer, or plurality of computers. It can be implemented as a computer program tangibly embodied in a device (computer readable medium) or a radio signal. A computer program, such as the computer program(s) described above, may be written in any form of programming language, including compiled or interpreted languages, and may be written as a stand-alone program or in a module, component, subroutine, or computing environment. It can be deployed in any form, including as other units suitable for the use of. A computer program can be deployed to be processed on one computer or multiple computers at one site or distributed across multiple sites and interconnected by a communication network.

컴퓨터 프로그램의 처리에 적절한 프로세서들은 예로서, 범용 및 특수 목적 마이크로프로세서들 둘 다, 및 임의의 종류의 디지털 컴퓨터의 임의의 하나 이상의 프로세서들을 포함한다. 일반적으로, 프로세서는 판독 전용 메모리 또는 랜덤 액세스 메모리 또는 둘 다로부터 명령어들 및 데이터를 수신할 것이다. 컴퓨터의 요소들은 명령어들을 실행하는 적어도 하나의 프로세서 및 명령어들 및 데이터를 저장하는 하나 이상의 메모리 장치들을 포함할 수 있다. 일반적으로, 컴퓨터는 데이터를 저장하는 하나 이상의 대량 저장 장치들, 예를 들어 자기, 자기-광 디스크들, 또는 광 디스크들을 포함할 수 있거나, 이것들로부터 데이터를 수신하거나 이것들에 데이터를 송신하거나 또는 양쪽으로 되도록 결합될 수도 있다. 컴퓨터 프로그램 명령어들 및 데이터를 구체화하는데 적절한 정보 캐리어들은 예로서 반도체 메모리 장치들, 예를 들어, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리, EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM) 등을 포함한다. 프로세서 및 메모리는 특수 목적 논리 회로조직에 의해 보충되거나, 이에 포함될 수 있다.Processors suitable for processing a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from read only memory or random access memory or both. Elements of a computer may include at least one processor that executes instructions and one or more memory devices that store instructions and data. In general, a computer may include, receive data from, send data to, or both, one or more mass storage devices that store data, such as magnetic, magneto-optical disks, or optical disks. It can also be combined to become. Information carriers suitable for embodying computer program instructions and data include, for example, semiconductor memory devices, for example, magnetic media such as hard disks, floppy disks and magnetic tapes, compact disk read only memory (CD-ROM) ), optical media such as DVD (Digital Video Disk), magneto-optical media such as Floptical Disk, ROM (Read Only Memory), RAM (RAM) , Random Access Memory), flash memory, EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), and the like. The processor and memory may be supplemented by, or included in, special purpose logic circuitry.

또한, 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용매체일 수 있고, 컴퓨터 저장매체 및 전송매체를 모두 포함할 수 있다.In addition, computer readable media may be any available media that can be accessed by a computer, and may include both computer storage media and transmission media.

본 명세서는 다수의 특정한 구현물의 세부사항들을 포함하지만, 이들은 어떠한 발명이나 청구 가능한 것의 범위에 대해서도 제한적인 것으로서 이해되어서는 안되며, 오히려 특정한 발명의 특정한 실시형태에 특유할 수 있는 특징들에 대한 설명으로서 이해되어야 한다. 개별적인 실시형태의 문맥에서 본 명세서에 기술된 특정한 특징들은 단일 실시형태에서 조합하여 구현될 수도 있다. 반대로, 단일 실시형태의 문맥에서 기술한 다양한 특징들 역시 개별적으로 혹은 어떠한 적절한 하위 조합으로도 복수의 실시형태에서 구현 가능하다. 나아가, 특징들이 특정한 조합으로 동작하고 초기에 그와 같이 청구된 바와 같이 묘사될 수 있지만, 청구된 조합으로부터의 하나 이상의 특징들은 일부 경우에 그 조합으로부터 배제될 수 있으며, 그 청구된 조합은 하위 조합이나 하위 조합의 변형물로 변경될 수 있다.Although this specification contains many specific implementation details, they should not be construed as limiting on the scope of any invention or what is claimed, but rather as a description of features that may be unique to a particular embodiment of a particular invention. It should be understood. Certain features that are described in this specification in the context of separate embodiments may also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments individually or in any suitable subcombination. Further, while features may operate in particular combinations and are initially depicted as such claimed, one or more features from a claimed combination may in some cases be excluded from that combination, and the claimed combination is a subcombination. or sub-combination variations.

마찬가지로, 특정한 순서로 도면에서 동작들을 묘사하고 있지만, 이는 바람직한 결과를 얻기 위하여 도시된 그 특정한 순서나 순차적인 순서대로 그러한 동작들을 수행하여야 한다거나 모든 도시된 동작들이 수행되어야 하는 것으로 이해되어서는 안 된다. 특정한 경우, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다. 또한, 상술한 실시형태의 다양한 장치 컴포넌트의 분리는 그러한 분리를 모든 실시형태에서 요구하는 것으로 이해되어서는 안되며, 설명한 프로그램 컴포넌트와 장치들은 일반적으로 단일의 소프트웨어 제품으로 함께 통합되거나 다중 소프트웨어 제품에 패키징 될 수 있다는 점을 이해하여야 한다.Similarly, while actions are depicted in the drawings in a particular order, it should not be construed as requiring that those actions be performed in the specific order shown or in the sequential order, or that all depicted actions must be performed to obtain desired results. In certain cases, multitasking and parallel processing can be advantageous. Further, the separation of various device components in the embodiments described above should not be understood as requiring such separation in all embodiments, and the program components and devices described may generally be integrated together into a single software product or packaged into multiple software products. You have to understand that you can.

한편, 본 명세서와 도면에 개시된 본 발명의 실시 예들은 이해를 돕기 위해 특정 예를 제시한 것에 지나지 않으며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시 예들 이외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 자명한 것이다.On the other hand, the embodiments of the present invention disclosed in this specification and drawings are only presented as specific examples to aid understanding, and are not intended to limit the scope of the present invention. In addition to the embodiments disclosed herein, it is obvious to those skilled in the art that other modified examples based on the technical idea of the present invention can be implemented.

Claims

A method for coding a residual signal of LPC coefficients performed by a computing device,
Generating, by a computing device, LPC (Linear Prediction Coding) analysis and quantization on input speech to generate coded LPC coefficients and LPC residual signals;
determining a predicted LPC residual signal by applying the LPC residual signal to cross-module residual learning;
performing LPC synthesis using the encoded LPC coefficient and the predicted LPC residual signal;
Determining an output voice that is a synthesized output according to a result of performing the LPC synthesis
including,
The cross-module residual learning,
applying a high-pass filter to the input voice;
applying a pre-emphasis filter to the result of applying the high-pass filter;
determining an LPC coefficient from a result of applying the pre-emphasis filter;
quantizing the LPC coefficients to generate a soft allocation matrix of the coded LPC coefficients and softmax; and
Determining an LPC residual signal based on a result of applying the pre-emphasis filter and a result of quantizing the LPC coefficient
including,
Determining the LPC coefficient,
performing cross-frame windowing by applying a window to all frames of the input speech to which the pre-emphasis filter is applied;
performing sub-frame windowing by applying a window to a plurality of sub-frames corresponding to a middle region among all frames of the input voice in a result of performing the cross-frame windowing;
Performing composite windowing by overlapping a result of performing the subframe windowing
Residual signal coding method comprising a.

delete

According to claim 1,
The LPC coefficients may be quantized by applying a trainable softmax to LPC coefficients in the LSP domain.

According to claim 1,
The LPC residual signal is encoded by 1D-CNN autoencoders.

According to claim 5,
The autoencoders of the 1D-CNN,
A residual signal coding method in which the residual signal, which is the output of the previous autoencoder, is sequentially trained by being used as the input of the next autoencoder.

According to claim 5,
Auto encoders of the 1D-CNN,
Differential coding is applied to the output of the autoencoder,
The residual signal coding method of applying differential coding to the output of the autoencoder based on the length of a code for each frame of the autoencoder.

A computing device performing a residual signal coding method of LPC coefficients,
The computing device includes a processor;
The processor performs LPC (Linear Prediction Coding) analysis and quantization on the input speech to generate coded LPC coefficients and LPC residual signals,
Applying the LPC residual signal to cross-module residual learning to determine a predicted LPC residual signal;
Performing LPC synthesis using the encoded LPC coefficient and the predicted LPC residual signal;
Determining an output voice, which is a synthesized output, according to a result of performing the LPC synthesis;
the processor,
apply a high-pass filter to the input voice;
Applying a pre-emphasis filter to the result of applying the high-pass filter;
Determine an LPC coefficient from the result of applying the pre-emphasis filter,
quantizing the LPC coefficients to generate a soft allocation matrix of coded LPC coefficients and softmax;
Performing cross-module residual learning for determining an LPC residual signal based on a result of applying the pre-emphasis filter and a result of quantizing the LPC coefficient;
the processor,
The processor, to determine the LPC coefficient,
Performing cross-frame windowing by applying a window to an entire frame of the input speech to which the pre-emphasis filter is applied;
Performing sub-frame windowing by applying a window to a plurality of sub-frames corresponding to a middle region among all frames of the input voice in the result of performing the cross-frame windowing;
A computing device for performing composite windowing by performing overlap on a result of performing the sub-frame windowing.

delete

According to claim 8,
The LPC coefficients may be quantized by applying a trainable softmax to the LPC coefficients in the LSP domain.

According to claim 8,
The LPC residual signal is encoded by autoencoders of 1D-CNN.

According to claim 12,
The autoencoders of the 1D-CNN,
A computing device that is sequentially trained by using the residual signal, the output of the previous autoencoder, as the input of the next autoencoder.

According to claim 12,
Auto encoders of the 1D-CNN,
Differential coding is applied to the output of the autoencoder,
The output of the autoencoder is a computing device to which differential coding is applied based on the length of a code for each frame of the autoencoder.