KR20220022337A

KR20220022337A - A bit-stream-based modified vocoder using autoencoder, computer-readable storage medium and computer program

Info

Publication number: KR20220022337A
Application number: KR1020200103400A
Authority: KR
Inventors: 최승호; 이한나; 윤덕규; 김홍국; 김선교
Original assignee: 국방과학연구소
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2022-02-25

Abstract

A modified voice encoder of an embodiment may comprise a decoder device of the modified voice encoder comprising: an encoding device of the modified voice encoder comprising a voice compression part that generates a quantized bit string per frame using an original sound, a feature extraction part of a self-encoder encoder part that extracts a bottleneck feature of the self-encoder encoder by inputting the bit string extracted from the voice compression part, and a quantization part that quantizes the bottleneck feature and converts thereof into the bit string; a self-encoder decoder part that restores the bit string extracted from the encoding device; and a voice decoding part that synthesizes a voice by inputting the restored bit string. Therefore, the present invention is capable of having an effect for which can be used in a communication system that requires security such as for the military without loss compared to an existing one.

Description

A bit stream-based modified speech encoder using a magnetic encoder, a computer-readable recording medium, and a computer program

본 발명은 디지털 음성통신에서의 보안을 위한 음성 부호화기를 변형하는 기술에 관한 것이다.The present invention relates to a technique for transforming a voice encoder for security in digital voice communication.

일반적으로, 디지털 음성통신을 위한 저 비트율 음성 부호화기에는 linear predictive coding(LPC)-10e, mixed-exited linear prediction(MELP), code-excited linear prediction(CELP) 등이 존재한다. 군용 등 보안을 요하는 통신 시스템에서는 이러한 음성 부호화기를 변형하는 기술이 필요하다.In general, a low bit rate voice encoder for digital voice communication includes linear predictive coding (LPC)-10e, mixed-exited linear prediction (MELP), code-excited linear prediction (CELP), and the like. In a communication system requiring security, such as for military use, a technology for transforming such a voice encoder is required.

상술한 문제점을 해결하기 위해, 실시예는 딥러닝 기법을 활용한 자기부호화기(autoencoder)를 사용하여 음성 부호화기를 변형할 수 있는 비트 열 기반 변형 음성 부호화기를 제공하는 것을 그 목적으로 한다.In order to solve the above-mentioned problems, the embodiment aims to provide a bit stream-based transformed speech encoder capable of transforming the speech encoder using an autoencoder utilizing a deep learning technique.

실시예의 변형 음성 부호화기의 인코딩 장치는 원음을 이용하여 프레임 당 양자화된 비트 열을 생성하는 음성 압축부와, 상기 음성 압축부에서 추출한 상기 비트 열을 입력으로 하여 자기부호화기 인코더의 병목특징을 추출하는 자기부호화기 인코더부의 특징 추출부와, 상기 병목특징을 양자화하여 비트 열로 변환시키는 양자화부를 포함할 수 있다.The encoding apparatus of the modified speech encoder of the embodiment includes a speech compression unit that generates a quantized bit stream per frame using the original sound, and a self-encoder extracting bottleneck features of the encoder by inputting the bit string extracted from the speech compression unit as an input The encoder may include a feature extraction unit of the encoder unit, and a quantization unit that quantizes the bottleneck feature and converts it into a bit stream.

상기 양자화부는 상기 비트 열을 선형적으로 양자화시킬 수 있다.The quantization unit may linearly quantize the bit stream.

상기 병목특징은 상기 자기부호화기의 노드에서 병목 현상에 발생되는 노드의 값일 수 있다.The bottleneck feature may be a value of a node generated as a bottleneck in the node of the self-encoder.

상기 음성 압축부는 상기 원음을 음악 샘플에 통과시켜 프레임당 양자화된 비트열을 생성할 수 있다.The speech compression unit may generate a quantized bit stream per frame by passing the original sound through the music sample.

실시예의 변형 음성 부호화기의 디코딩 장치는 변형 음성 부호화기의 인코딩 장치에서 추출된 비트 열을 복원시키는 자기부호화기 디코더부와, 상기 복원된 비트 열을 입력으로 하여 음성을 합성시키는 음성 복호부를 포함할 수 있다.The decoding apparatus of the modified speech encoder of the embodiment may include a self-encoder decoder unit for restoring a bit stream extracted from the encoding device of the modified speech encoder, and a speech decoder for synthesizing speech by inputting the restored bit string as an input.

상기 음성 복호부는 상기 자기부호화기 디코더부에서 출력된 벡터에서 임계값보다 큰 값을 1로 치환하고, 임계값보다 작은 값을 0으로 치환하고, 상기 치환된 값을 이용하여 합성된 음성을 프레임 별로 수집할 수 있다.The speech decoder substitutes 1 for a value larger than a threshold value in the vector output from the self-encoder decoder unit, substitutes 0 for a value smaller than the threshold value, and collects synthesized speech using the substituted value for each frame. can do.

본 발명의 실시예에 의하면, 자기부호화기를 활용한 변형 음성 부호화기로 음성을 합성함으로써, 기존 대비 손실 없이 군용 등 보안을 요하는 통신 시스템에서 사용이 가능한 효과가 있다.According to an embodiment of the present invention, by synthesizing speech using a modified speech encoder using a self-encoder, there is an effect that it can be used in a communication system requiring security such as military use without loss compared to the existing one.

도 1은 본 발명의 실시예에 따른 변형 음성 부호화기에서 송신하기 위한 비트 열을 얻는 인코더 장치의 블록도이다.
도 2는 본 발명의 실시예에 따른 변형 음성 부호화기에서 비트 열이 수신되었을 때 음성을 합성하기 위한 디코더 장치의 블록도이다.
도 3는 본 발명의 실시예에 따른 변형 음성 부호화기에서 송신하기 위한 비트 열을 얻는 인코딩 과정을 예시한 흐름도이다.
도 4는 본 발명의 실시예에 따른 변형 음성 부호화기에서 비트 열이 수신되었을 때 음성이 합성되는 디코딩 과정을 예시한 흐름도이다.1 is a block diagram of an encoder device for obtaining a bit stream for transmission in a modified speech encoder according to an embodiment of the present invention.
2 is a block diagram of a decoder device for synthesizing speech when a bit stream is received in the modified speech encoder according to an embodiment of the present invention.
3 is a flowchart illustrating an encoding process for obtaining a bit stream for transmission in a modified speech encoder according to an embodiment of the present invention.
4 is a flowchart illustrating a decoding process in which speech is synthesized when a bit stream is received in the modified speech encoder according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명의 범주는 청구항에 의해 정의될 뿐이다.Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various forms, and only these embodiments allow the disclosure of the present invention to be complete, and those of ordinary skill in the art to which the present invention pertains. It is provided to fully inform the person of the scope of the invention, and the scope of the invention is only defined by the claims.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명은 본 발명의 실시예들을 설명함에 있어 실제로 필요한 경우 외에는 생략될 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In describing the embodiments of the present invention, detailed descriptions of well-known functions or configurations will be omitted except when it is actually necessary to describe the embodiments of the present invention. In addition, the terms to be described later are terms defined in consideration of functions in an embodiment of the present invention, which may vary according to intentions or customs of users and operators. Therefore, the definition should be made based on the content throughout this specification.

본 발명의 실시예는 음성 발성 모델의 파라미터를 이용하여 음성을 합성하는 음성 부호화 시스템에서 음성 부호화기 인코더의 출력인 비트 열을 입출력으로 하는 자기부호화기를 훈련하고, 훈련된 자기부호화기의 인코더와 디코더를 분리하여 자기부호화기의 병목특징을 양자화 한 뒤 비트열로 송신하는 것을 기초로 한다.An embodiment of the present invention trains a self-encoder that uses a bit string, which is an output of a speech encoder encoder, as input and output in a speech encoding system that synthesizes speech using the parameters of a speech speech model, and separates the encoder and decoder of the trained self-encoder It is based on quantizing the bottleneck feature of the self-encoder and then transmitting it as a bit stream.

즉, 본 발명의 실시예는 음성 부호화기 인코더의 출력인 비트 열로 자기부호화기를 훈련하고, 훈련된 자기부호화기의 인코더와 디코더를 이용하여 비트 열을 전송하는 것을 기초로 한다.That is, the embodiment of the present invention is based on training the self-encoder with the bit stream that is the output of the speech encoder encoder and transmitting the bit stream using the trained self-encoder encoder and decoder.

이하, 첨부된 도면을 참조하여 본 발명의 실시예에 대해 상세히 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따른 변형 음성 부호화기에서 송신하기 위한 비트 열을 얻는 인코더 장치의 블록도이고, 도 2는 본 발명의 실시예에 따른 변형 음성 부호화기에서 비트 열이 수신되었을 때 음성을 합성하기 위한 디코더 장치의 블록도이다.1 is a block diagram of an encoder device for obtaining a bit stream for transmission from a modified speech encoder according to an embodiment of the present invention, and FIG. It is a block diagram of a decoder device for synthesizing.

도 1을 참조하면, 실시예에 따른 변형 음성 부호화기의 인코더 장치는 원음을 이용하여 프레임 당 양자화된 비트 열을 생성하는 음성 압축부(100)와, 상기 음성 압축부(100)에서 추출한 상기 비트 열을 입력으로 하여 자기부호화기 인코더의 병목특징을 추출하는 자기부호화기 인코더부의 특징 추출부(102)와, 상기 병목특징을 양자화하여 비트 열로 변환시키는 양자화부(104)를 포함할 수 있다. Referring to FIG. 1 , the encoder device of the modified speech encoder according to the embodiment includes a speech compression unit 100 that generates a quantized bit string per frame using an original sound, and the bit string extracted from the speech compression unit 100 . It may include a feature extracting unit 102 of the self-encoder encoder unit for extracting the bottleneck feature of the self-encoder encoder as input, and a quantization unit 104 for quantizing the bottleneck feature and converting it into a bit string.

여기서, 음성 압축부(100)로 입력되는 x(n)은 음성 부호화 시스템에 입력되는 원음을 나타내고, 출력되는 y(n)은 변형 음성 부호화기에서 송신하는 비트 열을 나타낸다.Here, x(n) input to the speech compression unit 100 represents an original sound input to the speech encoding system, and output y(n) represents a bit stream transmitted from the modified speech encoder.

음성 압축부(100)는 기존 음성 부호화기의 인코더에 음성 샘플을 통과시켜 프레임 당 양자화 된 비트열을 얻을 수 있다. The speech compression unit 100 may pass a speech sample through an encoder of an existing speech encoder to obtain a quantized bit stream per frame.

자기부호화기 인코더부의 특징 추출부(102)는 음성 압축부(100)를 통해 얻은 비트 열을 자기부호화기의 인코더에 입력하여 병목특징을 추출할 수 있다. 이때, 자기부호화기는 기존 음성 부호화기 인코더에서 출력된 양자화 된 비트열을 입출력으로 사용하여 훈련될 수 있다. 이러한 자기부호화기는, 예를 들어 Dense 레이어를 사용하여 모델이 구성될 수 있으며 softsign이나 ReLU 등의 활성화 함수를 사용할 수 있다.The feature extraction unit 102 of the encoder unit of the self-encoder may extract the bottleneck feature by inputting the bit string obtained through the speech compression unit 100 to the encoder of the self-encoder. In this case, the self-encoder can be trained by using the quantized bit stream output from the existing speech encoder encoder as input/output. In such a self-encoder, a model may be constructed using, for example, a dense layer, and an activation function such as softsign or ReLU may be used.

자기부호화기는 딥 러닝을 이용한 신경망 모델일 수 있으나, 이에 한정되지 않는다. 자기부호화기는 입력층, 은닉층, 출력층으로 구성될 수 있으며, 은닉층은 2개 이상의 층으로 이루어질 수도 있다. 입력층, 은닉층, 출력층 사이에는 복수의 노드로 구성될 수 있으며, 은닉층을 기준으로 왼쪽은 인코더, 은닉층을 기준으로 오른쪽은 디코더일 수 있으나, 이에 한정되지 않는다.The self-encoder may be a neural network model using deep learning, but is not limited thereto. The magnetic encoder may be composed of an input layer, a hidden layer, and an output layer, and the hidden layer may be composed of two or more layers. A plurality of nodes may be formed between the input layer, the hidden layer, and the output layer, and an encoder may be on the left with respect to the hidden layer and a decoder on the right with respect to the hidden layer, but is not limited thereto.

양자화부(104)는 자기부호화기에서 추출한 비트 열을 선형적으로 스칼라 양자화 한다. 본 발명의 실시예에서 양자화를 하는 이유는 실제 디지털 통신을 고려했기 때문이다. 양자화는 다단계 양자화 기법 등 종래 양자화기법을 이용할 수 있으나, 이에 한정되지 않는다.The quantization unit 104 linearly scalar quantizes the bit stream extracted from the self-encoder. The reason for performing quantization in the embodiment of the present invention is that actual digital communication is considered. For quantization, a conventional quantization technique such as a multi-step quantization technique may be used, but the present invention is not limited thereto.

도 2를 참조하면, 실시예에 따른 변형 음성 부호화기의 디코더 장치는 변형 음성 부호화기의 인코딩 장치에서 추출된 비트 열을 복원시키는 자기부호화기 디코더부(200)와, 상기 복원된 비트 열을 입력으로 하여 음성을 합성시키는 음성 복호부(202)를 포함할 수 있다. 여기서, 입력되는 y(n)과 출력되는

은 각각 변형 음성 부호화기에서 수신하는 비트 열과 출력되는 합성 음성을 나타낸다.Referring to FIG. 2 , the decoder device of the modified speech encoder according to the embodiment includes a self-encoder decoder unit 200 that restores a bit stream extracted from the encoding device of the modified speech encoder, and receives the restored bit string as an input. It may include a voice decoder 202 for synthesizing. Here, the input y(n) and the output

represents a bit stream received from the modified speech encoder and a synthesized speech output, respectively.

디코더부(200)는 프레임 당 수신되는 비트 열을 역 양자화 한 뒤 자기부호화기의 디코더에 입력하여 음성 부호화기의 비트 열의 추정 값 벡터를 얻는다.The decoder unit 200 inversely quantizes the bit stream received per frame and inputs it to the decoder of the self-encoder to obtain an estimated value vector of the bit stream of the speech encoder.

음성 복호부(202)는 상기 디코더부 내부의 자기부호화기의 디코더에서 출력된 벡터에서 특정 임계 값보다 큰 값을 1, 작은 값을 0으로 치환한 뒤 기존 음성 부호화기의 디코더에 입력하여 합성된 음성을 프레임 별로 얻을 수 있다.The speech decoding unit 202 substitutes 1 for a value larger than a specific threshold value and 0 for a value smaller than a specific threshold in the vector output from the decoder of the self-encoder inside the decoder unit, and inputs the synthesized speech to the decoder of the existing speech encoder. You can get it frame by frame.

이하에서는 실시예에 따른 변형 음성 부호화기의 동작 과정을 도 3 및 도 4를 참조하여 설명하기로 한다.Hereinafter, an operation process of the modified speech encoder according to the embodiment will be described with reference to FIGS. 3 and 4 .

도 3는 본 발명의 실시예에 따른 변형 음성 부호화기에서 송신하기 위한 비트 열을 얻는 인코딩 과정을 예시한 흐름도이고, 도 4는 본 발명의 실시예에 따른 변형 음성 부호화기에서 비트 열이 수신되었을 때 음성이 합성되는 디코딩 과정을 예시한 흐름도이다.3 is a flowchart illustrating an encoding process for obtaining a bit stream for transmission in a modified speech encoder according to an embodiment of the present invention, and FIG. 4 is a flowchart illustrating a bit string received by the modified speech encoder according to an embodiment of the present invention. It is a flowchart illustrating this synthesized decoding process.

도 3에 도시된 바와 같이, 음성 압축부(100)는 기존 음성 부호화기의 비트 열을 추출할 수 있다(S100). As shown in FIG. 3 , the speech compression unit 100 may extract a bit stream of an existing speech encoder ( S100 ).

이후, 자기부호화기 인코더부의 특징 추출부(102)는 음성 압축부(100)에서 출력된 비트 열을 입력으로 하는 자기부호화기 인코더의 병목특징을 추출할 수 있다(S102).Thereafter, the feature extracting unit 102 of the self-encoder encoder unit may extract the bottleneck feature of the self-encoder encoder using the bit string output from the speech compression unit 100 as an input (S102).

양자화부(104)는 자기부호화기 인코더부의 특징 추출부(102)에서 추출한 병목특징을 양자화 할 수 있다(S104). 예를 들어, 상기 음성 압축부(100)에서 기존 음성 부호화기로 1.2 kbps MELP 인코더를 사용하고 인코더부(100)에서 병목특징으로 13차를 사용하며 양자화부(104)에서 4 비트 스칼라 양자화를 한다면, 기존 음성 부호화기에서는 프레임 당 추출한 양자화 된 27 비트를 자기부호화기에 입력하여 13차 벡터를 얻고, 최종적으로 변형 음성 부호화기의 인코더는 프레임 당 54차 비트 열을 전송할 수 있다.The quantization unit 104 may quantize the bottleneck feature extracted by the feature extraction unit 102 of the encoder of the self-encoder (S104). For example, if the voice compression unit 100 uses a 1.2 kbps MELP encoder as the existing voice encoder, the encoder unit 100 uses the 13th order as a bottleneck feature, and the quantizer 104 performs 4-bit scalar quantization, In the conventional speech encoder, the quantized 27 bits extracted per frame are input to the self-encoder to obtain the 13th vector, and finally, the encoder of the transformed speech encoder can transmit the 54th bit stream per frame.

도 4에 도시된 바와 같이, 디코더부(200)는 프레임 당 수신되는 비트 열을 역 양자화 한 뒤 자기부호화기의 디코더에 입력하여 음성 부호화기의 비트 열의 추정 값 벡터를 얻는다(S200). 본 발명의 실시예에서 역 양자화는 양자화부(104)에서 사용한 양자화기를 역으로 설계한 것이다. 예를 들어, 양자화부(104)에서 4차 선형 스칼라 양자화가 사용되었다면 디코더부(200)에서는 비트 열을 4개씩 역 양자화한다.As shown in FIG. 4 , the decoder unit 200 inversely quantizes the bit stream received per frame and then inputs it to the decoder of the self-encoder to obtain an estimated value vector of the bit stream of the speech encoder ( S200 ). In the embodiment of the present invention, inverse quantization is a design of the quantizer used in the quantization unit 104 inversely. For example, if fourth-order linear scalar quantization is used in the quantization unit 104 , the decoder unit 200 inversely quantizes each bit stream by 4 .

음성 복호부(202)는 상기 디코더부 내부의 자기부호화기의 디코더에서 출력된 벡터에서 특정 임계 값보다 큰 값을 1, 작은 값을 0으로 치환한 뒤 기존 음성 부호화기의 디코더에 입력하여 합성된 음성을 프레임 별로 얻을 수 있다(S202). 예를 들어, 상기 음성 압축부(100)와 음성 복호부(202)에서 기존 음성 부호화기로 1.2 kbps MELP 인코더를 사용하고 디코더부(200)에서 프레임 당 54 비트 열을 수신한다면 최종적으로 프레임 당 27차 비트 열을 통해 음성이 합성될 수 있다.The speech decoding unit 202 substitutes 1 for a value larger than a specific threshold value and 0 for a value smaller than a specific threshold in the vector output from the decoder of the self-encoder inside the decoder unit, and inputs the synthesized speech to the decoder of the existing speech encoder. It can be obtained for each frame (S202). For example, if a 1.2 kbps MELP encoder is used as an existing voice encoder in the voice compression unit 100 and the voice decoder 202 and a 54-bit stream per frame is received by the decoder unit 200, the final 27th order per frame A voice may be synthesized through the bit string.

[수학식 1][Equation 1]

[수학식 1]에서 LSD(Long Spectral Distance)는 스펙트럼 간의 차이를 나타내는 값(dB)이며, LSD가 낮을수록 원음과 함성음 간의 왜곡 정도가 줄어드는 것을 의미한다.In [Equation 1], LSD (Long Spectral Distance) is a value (dB) representing a difference between spectra, and the lower the LSD, the lower the degree of distortion between the original sound and the voiced sound.

본 발명의 실시예에서는 LSD 값을 비교하여 결과를 시뮬레이션 하였으며, 아래 [표 1]은 기존 음성 부호화기인 MELP 보코더로 합성한 음원과, 본 발명의 실시예와 같이 변형 음성 부호화기로 합성한 음원을 각각 원래의 음성샘플과 비교했을 때의 결과이다. LSD 값과 표준 침입적 음성명료도 추정 방법(STOI) 점수로 합성 음성의 음질을 나타냈다.In the embodiment of the present invention, the results were simulated by comparing the LSD values, and [Table 1] below shows the sound source synthesized by the MELP vocoder, which is an existing voice encoder, and the sound source synthesized by the modified voice encoder as in the embodiment of the present invention, respectively. This is the result when compared to the original negative sample. LSD values and standard invasive speech intelligibility estimation method (STOI) scores were used to represent the sound quality of synthesized speech.

Vocoder

MeasureVocoder

Measure MELP
1.2 kbps Encoder& DecoderMELP
1.2 kbps Encoder & Decoder MELP
2.4 kbps Encoder & DecoderMELP
2.4 kbps Encoder & Decoder Modified MELP Encoder & DecoderModified MELP Encoder & Decoder LSDLSD 12.3912.39 9.179.17 12.4912.49 STOISTOI 0.820.82 0.880.88 0.800.80

본 발명의 실시예와 같이, 변형 음성 부호화기로 합성된 음성의 음질을 기존 음성 부호화기와 비교하면 양자화 오류로 인한 약간의 명료도 저하가 발생하나 비슷한 성능을 보임을 알 수 있다.As in the embodiment of the present invention, when the sound quality of speech synthesized by a modified speech encoder is compared with that of an existing speech encoder, it can be seen that a slight decrease in intelligibility due to a quantization error occurs, but similar performance is shown.

이상 설명한 바와 같은 본 발명의 실시예에 의하면, 자기부호화기를 이용하여 기존 음성 부호화기가 합성한 음성과 음질 차이가 거의 없는 변형 음성 부호화기를 구현하였다.According to the embodiment of the present invention as described above, a modified speech encoder having little difference in sound quality from the speech synthesized by the existing speech encoder is implemented using a self-encoder.

본 문서의 다양한 실시예들은 기기(machine)(예: 컴퓨터)로 읽을 수 있는 저장 매체(machine-readable storage media)(예: 메모리(내장 메모리 또는 외장 메모리))에 저장된 명령어를 포함하는 소프트웨어(예: 프로그램)로 구현될 수 있다. 기기는, 저장 매체로부터 저장된 명령어를 호출하고, 호출된 명령어에 따라 동작이 가능한 장치로서, 개시된 실시예들에 따른 전자 장치를 포함할 수 있다. 상기 명령이 제어부에 의해 실행될 경우, 제어부가 직접, 또는 상기 제어부의 제어하에 다른 구성요소들을 이용하여 상기 명령에 해당하는 기능을 수행할 수 있다. 명령은 컴파일러 또는 인터프리터에 의해 생성 또는 실행되는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, 비일시적은 저장매체가 신호(signal)를 포함하지 않으며 실재(tangible)한다는 것을 의미할 뿐 데이터가 저장매체에 반영구적 또는 임시적으로 저장됨을 구분하지 않는다.Various embodiments of the present document include software (eg, a machine-readable storage media) (eg, a memory (internal memory or external memory)) including instructions stored in a readable storage medium (eg, a computer). : program) can be implemented. The device is a device capable of calling a stored command from a storage medium and operating according to the called command, and may include the electronic device according to the disclosed embodiments. When the command is executed by the control unit, the control unit may perform a function corresponding to the command directly or using other components under the control of the control unit. Instructions may include code generated or executed by a compiler or interpreter. The device-readable storage medium may be provided in the form of a non-transitory storage medium. Here, non-transitory means that the storage medium does not include a signal and is tangible, and does not distinguish that data is semi-permanently or temporarily stored in the storage medium.

실시예에 따르면, 본 문서에 개시된 다양한 실시예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다.According to an embodiment, the method according to various embodiments disclosed in the present document may be included and provided in a computer program product.

일 실시예에 따르면, 컴퓨터 프로그램을 저장하고 있는 컴퓨터 판독 가능 기록매체로서, 원음을 이용하여 프레임 당 양자화된 비트 열을 생성하는 단계와, 상기 비트 열을 입력으로 하여 자기부호화기 인코더의 병목특징을 추출하는 단계와, 상기 병목특징을 양자화하여 비트 열로 변환시키는 단계와, 상기 변환된 비트 열을 복원시키는 단계와, 상기 복원된 비트 열을 입력으로 하여 음성을 합성시키는 단계를 수행하기 위한 동작을 포함하는 방법을 프로세서가 수행하도록 하기 위한 명령어를 포함할 수 있다.According to an embodiment, as a computer-readable recording medium storing a computer program, generating a quantized bit stream per frame using an original sound, and extracting a bottleneck feature of a self-encoder encoder by inputting the bit string as an input quantizing the bottleneck feature and converting it into a bit stream, restoring the converted bit string, and synthesizing speech using the restored bit string as an input. It may include instructions for causing a processor to perform the method.

일 실시예에 따르면, 컴퓨터 판독 가능한 기록매체에 저장되어 있는 컴퓨터 프로그램으로서, 원음을 이용하여 프레임 당 양자화된 비트 열을 생성하는 단계와, 상기 비트 열을 입력으로 하여 자기부호화기 인코더의 병목특징을 추출하는 단계와, 상기 병목특징을 양자화하여 비트 열로 변환시키는 단계와, 상기 변환된 비트 열을 복원시키는 단계와, 상기 복원된 비트 열을 입력으로 하여 음성을 합성시키는 단계를 수행하기 위한 동작을 포함하는 방법을 프로세서가 수행하도록 하기 위한 명령어를 포함할 수 있다.According to an embodiment, as a computer program stored in a computer-readable recording medium, generating a quantized bit stream per frame using an original sound, and extracting a bottleneck feature of a self-encoder encoder by inputting the bit string as an input quantizing the bottleneck feature and converting it into a bit stream, restoring the converted bit string, and synthesizing speech using the restored bit string as an input. It may include instructions for causing a processor to perform the method.

상기에서는 도면 및 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허청구범위에 기재된 실시예의 기술적 사상으로부터 벗어나지 않는 범위 내에서 실시예는 다양하게 수정 및 변경시킬 수 있음은 이해할 수 있을 것이다.Although the above has been described with reference to the drawings and embodiments, it will be understood by those skilled in the art that various modifications and changes can be made to the embodiments without departing from the spirit of the embodiments described in the claims below. will be able

100: 음성 압축부
102: 자기부호화기 인코더부 특징 추출부
104: 양자화부
200: 자기부호화기 디코더부
202: 음성 복호부100: voice compression unit
102: self-encoder encoder unit feature extraction unit
104: quantization unit
200: self-encoder decoder unit
202: voice decoder

Claims

a speech compression unit generating a quantized bit stream per frame using the original sound;
a feature extraction unit of the self-encoder encoder unit for extracting bottleneck features of the self-encoder encoder by inputting the bit string extracted from the speech compression unit; and
and a quantization unit that quantizes the bottleneck feature and converts it into a bit stream.

According to claim 1,
and the quantization unit linearly quantizes the bit stream.

According to claim 1,
The bottleneck feature is a value of a node that is generated in a bottleneck phenomenon in a node of the encoder of the self-encoder unit.

According to claim 1,
The speech compression unit passes the original sound through the music sample to generate a quantized bit stream per frame.

a self-encoder decoder unit for restoring a bit stream extracted from an encoding device of a transformed speech encoder; and
and a speech decoder for synthesizing speech by inputting the restored bit stream as an input.

6. The method of claim 5,
The speech decoder substitutes 1 for a value larger than a threshold value in the vector output from the self-encoder decoder unit, substitutes 0 for a value smaller than the threshold value, and collects synthesized speech using the substituted value for each frame. A decoder device for a transformed speech encoder.

a speech compression unit generating a quantized bit stream per frame using the original sound;
a feature extraction unit of the self-encoder encoder unit for extracting bottleneck features of the self-encoder encoder by inputting the bit string extracted from the speech compression unit; and
an encoding device for a modified speech encoder including a quantizer that quantizes the bottleneck feature and converts it into a bit stream;
a self-encoder decoder unit for restoring the bit stream extracted by the encoding device; and
and a speech decoder for synthesizing speech by inputting the reconstructed bit stream as an input.

generating a quantized bit stream per frame using the original sound;
extracting a bottleneck feature of a self-encoder encoder using the bit string as an input;
quantizing the bottleneck feature and converting it into a bit stream;
restoring the converted bit stream; and
synthesizing speech using the restored bit string as an input;
Encoding and decoding method of a transformed speech encoder comprising a.

As a computer-readable recording medium storing a computer program,
The computer program is
generating a quantized bit stream per frame using the original sound;
extracting a bottleneck feature of a self-encoder encoder using the bit string as an input;
quantizing the bottleneck feature and converting it into a bit stream;
restoring the converted bit stream; and
A computer-readable recording medium comprising instructions for causing a processor to perform the step of synthesizing speech by inputting the restored bit string as an input.

As a computer program stored in a computer-readable recording medium,
The computer program is
generating a quantized bit stream per frame using the original sound;
extracting a bottleneck feature of a self-encoder encoder using the bit string as an input;
quantizing the bottleneck feature and converting it into a bit stream;
restoring the converted bit stream; and
A computer program comprising instructions for causing the processor to perform the step of synthesizing speech by inputting the restored bit string as an input.