KR20230017286A

KR20230017286A - Generate conditional output through data density gradient estimation

Info

Publication number: KR20230017286A
Application number: KR1020227045943A
Authority: KR
Inventors: 난신 첸; 병하 천; 윌리엄 챈; 론 제이. 웨이스; 모하매드 노로우지; 유 장; 용후이 우
Original assignee: 구글 엘엘씨
Priority date: 2020-09-02
Filing date: 2021-09-02
Publication date: 2023-02-03
Also published as: EP4150615A1; US20230325658A1; WO2022051548A1; JP2023540834A; CN115803805A

Abstract

신경망을 사용하여 네트워크 입력을 조건으로 하는 출력을 생성하기 위한 컴퓨터 저장 매체에 인코딩된 컴퓨터 프로그램을 포함하는 방법, 시스템 및 장치가 제공된다. 일 양상에서, 방법은, 네트워크 입력을 획득하는 단계; 현재 네트워크 출력을 초기화하는 단계; 복수의 이터레이션들 각각에서 상기 현재 네트워크 출력을 업데이트함으로써 최종 네트워크 출력을 생성하는 단계, 각각의 이터레이션은 각각의 잡음 레벨에 대응하고, 상기 업데이트는 각각의 이터레이션에서: 잡음 출력을 생성하기 위해 모델 입력을 프로세싱하도록 구성된 잡음 추정 신경망을 사용하여 (i) 현재 네트워크 출력 및 (ii) 네트워크 입력을 포함하는 이터레이션에 대한 모델 입력을 프로세싱하는 단계, 상기 잡음 출력은 현재 네트워크 출력의 각 값에 대한 각각의 잡음 추정을 포함하고; 그리고 잡음 추정 및 이터레이션에 대한 잡음 레벨을 사용하여 현재 네트워크 출력을 업데이트하는 단계를 포함한다. Methods, systems and apparatus are provided that include a computer program encoded on a computer storage medium for using a neural network to generate an output conditioned on a network input. In one aspect, a method includes obtaining a network input; initializing the current network output; Generating a final network output by updating the current network output in each of a plurality of iterations, each iteration corresponding to a respective noise level, the update at each iteration: to generate a noise output. processing a model input for an iteration comprising (i) a current network output and (ii) a network input using a noise estimation neural network configured to process the model input, wherein the noise output is for each value of the current network output including each noise estimate; and updating the current network output using the noise estimate and the noise level for the iteration.

Description

Generate conditional output through data density gradient estimation

관련 출원에 대한 상호 참조CROSS REFERENCES TO RELATED APPLICATIONS

본 출원은 2020년 9월 2일에 출원된 미국 출원 번호 63/073,867에 대한 우선권을 주장하며, 이것의 개시 내용은 본 명세서에 참조로 포함된다.This application claims priority to US Application No. 63/073,867, filed on September 2, 2020, the disclosure of which is incorporated herein by reference.

본 발명은 머신 학습 모델을 사용하여 네트워크 입력들에 조건화된 출력들을 생성하는 것에 관한 것이다. The present invention relates to using a machine learning model to generate outputs conditioned on network inputs.

머신 러닝 모델은 입력을 수신하고 수신된 입력에 기초하여 예를 들어 예측된 출력과 같은 출력을 생성한다. 일부 머신 러닝 모델은 파라메트릭 모델이며 수신된 입력과 모델의 파라미터 값을 기반으로 출력을 생성한다. A machine learning model receives inputs and produces outputs, for example predicted outputs, based on the received inputs. Some machine learning models are parametric models and generate outputs based on received inputs and the model's parameter values.

일부 머신 러닝 모델은 수신된 입력에 대한 출력을 생성하기 위해 여러 계층의 모델을 사용하는 심층(deep) 모델이다. 예를 들어, 심층 신경망은 출력 계층과 출력을 생성하기 위해 수신된 입력에 각각 비선형 변환을 적용하는 하나 이상의 은닉 계층를 포함하는 심층 러닝 학습 모델이다. Some machine learning models are deep models that use multiple layers of models to generate outputs for received inputs. For example, a deep neural network is a deep learning model that includes an output layer and one or more hidden layers that each apply a nonlinear transformation to received inputs to generate an output.

본 명세서는 네트워크 입력에 조건화되는 네트워크 출력을 생성하는 하나 이상의 위치에 있는 하나 이상의 컴퓨터에서 컴퓨터 프로그램으로 구현되는 시스템을 설명한다. This specification describes a system implemented as a computer program on one or more computers at one or more locations that generates network outputs conditioned on network inputs.

본 발명의 제 1 양상에 따르면, 네트워크 입력에 대해 조건화된 복수의 출력들을 포함하는 최종 네트워크 출력을 생성하는 방법에 제공되며, 상기 방법은, 네트워크 입력을 획득하는 단계; 현재 네트워크 출력을 초기화하는 단계; 복수의 이터레이션들 각각에서 상기 현재 네트워크 출력을 업데이트함으로써 최종 네트워크 출력을 생성하는 단계, 각각의 이터레이션은 각각의 잡음 레벨에 대응하고, 상기 업데이트는 각각의 이터레이션에서: 잡음 출력을 생성하기 위해 모델 입력을 프로세싱하도록 구성된 잡음 추정 신경망을 사용하여 (i) 현재 네트워크 출력 및 (ii) 네트워크 입력을 포함하는 이터레이션에 대한 모델 입력을 프로세싱하는 단계, 상기 잡음 출력은 현재 네트워크 출력의 각 값에 대한 각각의 잡음 추정을 포함하고; 그리고 잡음 추정 및 이터레이션에 대한 잡음 레벨을 사용하여 현재 네트워크 출력을 업데이트하는 단계를 포함한다. According to a first aspect of the present invention there is provided a method for generating a final network output comprising a plurality of outputs conditioned on a network input, the method comprising: obtaining the network input; initializing the current network output; Generating a final network output by updating the current network output in each of a plurality of iterations, each iteration corresponding to a respective noise level, the update at each iteration: to generate a noise output. processing a model input for an iteration comprising (i) a current network output and (ii) a network input using a noise estimation neural network configured to process the model input, wherein the noise output is for each value of the current network output including each noise estimate; and updating the current network output using the noise estimate and the noise level for the iteration.

일부 구현예에서, 상기 네트워크 입력은 오디오 세그먼트의 스펙트로그램(spectrogram)이고, 상기 최종 네트워크 출력은 오디오 세그먼트에 대한 파형(waveform)이다. In some implementations, the network input is a spectrogram of an audio segment and the final network output is a waveform for the audio segment.

일부 구현예에서, 상기 오디오 세그먼트는 스피치 세그먼트이다. In some implementations, the audio segment is a speech segment.

일부 구현예에서, 상기 스펙트로그램은 텍스트-스피치(text-to-speech) 모델에 의해, 텍스트 세그먼트 또는 텍스트 세그먼트의 언어적 피처들로부터 생성된다.In some implementations, the spectrogram is generated from a text segment or linguistic features of a text segment by a text-to-speech model.

일부 구현예에서, 상기 스펙트로그램은 멜 스펙트로그램(mel spectrogram) 또는 로그 멜 스펙트로그램(log mel spectrogram)이다. In some embodiments, the spectrogram is a mel spectrogram or a log mel spectrogram.

일부 구현예에서, 상기 잡음 추정 및 이터레이션에 대한 잡음 레벨을 사용하여 현재 네트워크 출력을 업데이트하는 단계는, 적어도 잡음 추정 및 이터레이션에 대응하는 잡음 레벨로부터 이터레이션에 대한 업데이트를 생성하는 단계; 및 초기 업데이트된 네트워크 출력을 생성하기 위해 현재 네트워크 출력으로부터 상기 업데이트를 감산하는 단계를 포함한다. In some implementations, updating the current network output using the noise estimate and the noise level for the iteration includes generating an update for the iteration from at least the noise estimate and the noise level corresponding to the iteration; and subtracting the update from the current network output to produce an initial updated network output.

일부 구현예에서, 상기 현재 네트워크 출력을 업데이트하는 단계는, 수정된 초기 업데이트된 네트워크 출력을 생성하기 위해 이터레이션에 대한 잡음 레벨에 기초하여 초기 업데이트된 네트워크 출력을 수정하는 단계를 더 포함한다. In some implementations, updating the current network output further comprises modifying the initial updated network output based on the noise level for the iteration to produce a modified initial updated network output.

일부 구현예에서, 최종 이터레이션에 대하여, 상기 수정된 초기 업데이트된 네트워크 출력은 최종 이터레이션 이후의 업데이트된 네트워크 출력이고, 그리고 상기 최종 이터레이션 이전의 각각의 이터레이션에 대하여, 상기 최종 이터레이션 이후의 업데이트된 네트워크 출력은 수정된 초기 업데이트된 네트워크 출력에 잡음를 추가함으로써 생성된다. In some implementations, for a last iteration, the modified initially updated network output is an updated network output after the last iteration, and for each iteration before the last iteration, after the last iteration. The updated network output of is produced by adding noise to the modified initial updated network output.

일부 구현예에서, 상기 현재 네트워크 출력을 초기화하는 단계는, 대응 잡음 분포로부터 현재 네트워크 출력에 대한 복수의 초기 값들 각각을 샘플링하는 단계를 포함한다. In some implementations, initializing the current network output includes sampling each of a plurality of initial values for the current network output from a corresponding noise distribution.

일부 구현예에서, 각 이터레이션에 대한 모델 입력은 각 이터레이션마다 다른 이터레이션 특정 데이터를 포함한다. In some implementations, the model input for each iteration includes different iteration-specific data for each iteration.

일부 구현예에서, 각 이터레이션에 대한 모델 입력은 이터레이션에 대응하는 잡음 레벨을 포함한다. In some implementations, the model input for each iteration includes a noise level corresponding to the iteration.

일부 구현예에서, 각 이터레이션에 대한 모델 입력은 복수의 이터레이션들 중 상기 이터레이션 및 상기 이터레이션 이후의 임의의 이터레이션에 대응하는 잡음 레벨들로부터 생성된 이터레이션에 대한 총 잡음 레벨을 포함한다. In some implementations, the model input for each iteration includes a total noise level for an iteration generated from noise levels corresponding to that iteration and any iteration after the iteration of the plurality of iterations. do.

일부 구현예에서, 상기 잡음 추정 신경망은, 복수의 잡음 생성 신경망 계층들을 포함하고 그리고 네트워크 입력을 잡음 출력에 매핑하기 위해 상기 네트워크 입력을 프로세싱하도록 구성된 잡음 생성 신경망; 및 현재 네트워크 출력의 대안적인 표현을 생성하기 위해 상기 현재 네트워크 출력을 프로세싱하도록 구성된 복수의 네트워크 출력 프로세싱 신경망 계층들을 포함하는 네트워크 출력 프로세싱 신경망을 포함하고, 적어도 하나의 잡음 생성 신경망 계층들은 (i) 잡음 생성 신경망 계층들 중 다른 하나의 출력, (ii) 해당 네트워크 출력 프로세싱 신경망 계층의 출력, 및 (iii) 이터레이션에 대한 이터레이션 특정 데이터로부터 도출된 입력을 수신한다. In some implementations, the noise estimation neural network includes a noise generating neural network comprising a plurality of noise generating neural network layers and configured to process the network input to map the network input to a noise output; and a network output processing neural network comprising a plurality of network output processing neural network layers configured to process the current network output to generate an alternate representation of the current network output, wherein at least one noise generating neural network layer comprises (i) noise It receives input derived from the output of another one of the generating neural network layers, (ii) the output of that network output processing neural network layer, and (iii) iteration specific data for the iteration.

일부 구현예에서, 상기 최종 네트워크 출력은 상기 네트워크 입력보다 더 높은 차원을 갖고, 상기 대안적인 표현은 상기 네트워크 입력과 동일한 차원을 갖는다. In some implementations, the final network output has a higher dimension than the network input, and the alternate representation has the same dimension as the network input.

일부 구현예에서, 상기 잡음 추정 신경망은 적어도 하나의 잡음 생성 신경망 계층들 각각에 대응하는 각각의 FiLM (Feature-wise Linear Modulation) 모듈을 포함하고, 소정의 잡음 생성 신경망 계층에 대응하는 상기 FiLM 모듈은 잡음 생성 신경망 계층에 대한 입력을 생성하기 위해 (i) 잡음 생성 신경망 계층들 중 다른 하나의 출력, (ii) 해당 네트워크 출력 프로세싱 신경망 계층의 출력, (iii) 이터레이션에 대한 이터레이션 특정 데이터를 프로세싱하도록 구성된다. In some implementations, the noise estimation neural network includes each feature-wise linear modulation (FiLM) module corresponding to each of at least one noise generating neural network layer, and the FiLM module corresponding to a predetermined noise generating neural network layer comprises: To generate the input to the noise-producing neural network layer, (i) the output of another one of the noise-producing neural network layers, (ii) the output of that network output processing neural network layer, (iii) processing the iteration-specific data for the iteration. is configured to

일부 구현예에서, 상기 소정 잡음 생성 신경망 계층에 대응하는 FiLM 모듈은, (ii) 해당 네트워크 출력 프로세싱 신경망 계층의 출력, 및 (iii) 이터레이션에 대한 이터레이션 특정 데이터로부터 스케일 벡터 및 바이어스 벡터를 생성하고; (i) 잡음 생성 신경망 계층들 중 다른 하나의 출력에 아핀 변환을 적용함으로써 상기 소정 잡음 생성 신경망 계층에 대한 입력을 생성하도록 구성된다. In some implementations, the FiLM module corresponding to the given noise generating neural network layer generates scale vectors and bias vectors from (ii) the output of that network output processing neural network layer, and (iii) iteration specific data for the iteration. do; (i) generate an input to a noise-producing neural network layer by applying an affine transformation to an output of another one of the noise-producing neural network layers;

일부 구현예에서, 적어도 하나의 잡음 생성 신경망 계층들은 비선형 활성화 함수를 활성화 함수 계층에 대한 입력에 적용하는 활성화 함수 계층을 포함한다. In some implementations, at least one of the noise generating neural network layers includes an activation function layer that applies a non-linear activation function to an input to the activation function layer.

일부 구현예에서, 활성화 함수 계층에 대응하는 잡음 생성 신경망 계층들 중 다른 하나는 잔차 연결 계층(residual connection layer) 또는 콘볼루션 계층이다. In some implementations, another of the noise generating neural network layers corresponding to the activation function layer is a residual connection layer or a convolutional layer.

일부 구현예에서, 잡음 추정 신경망을 트레이닝하는 방법이 제공되며, 상기 방법은, 트레이닝 네트워크 입력 및 대응하는 트레이닝 네트워크 출력을 획득하는 단계; 복수의 모든 이터레이션들에 대한 이터레이션 특정 데이터를 포함하는 세트로부터 이터레이션 특정 데이터를 선택하는 단계; 트레이닝 네트워크 출력의 각각의 값에 대한 각각의 잡음 값을 포함하는 잡음성 출력을 샘플링하는 단계; 잡음성 출력 및 대응하는 트레이닝 네트워크 출력으로부터 수정된 트레이닝 네트워크 출력을 생성하는 단계; 잡음 추정 신경망을 이용하여 트레이닝 잡음 출력을 생성하기 위해 (i) 수정된 트레이닝 네트워크 출력, (ii) 트레이닝 네트워크 입력, 및 (iii) 이터레이션 특정 데이터를 포함하는 모델 입력을 프로세싱하는 단계; 및 샘플링된 잡음성 출력과 트레이닝 잡음 출력 사이의 에러를 측정하는 목적 함수의 그래디언트로부터 잡음 추정 신경망의 네트워크 파라미터들에 대한 업데이트를 결정하는 단계를 포함하는 단계들을 반복적으로 수행한다. In some implementations, a method of training a noise estimation neural network is provided, the method comprising: obtaining a training network input and a corresponding training network output; selecting iteration specific data from a set comprising iteration specific data for all iterations of a plurality of iterations; sampling a noisy output comprising a respective noise value for each value of the training network output; generating a modified training network output from the noisy output and the corresponding training network output; processing (i) a modified training network output, (ii) a training network input, and (iii) a model input comprising iteration specific data to generate a training noise output using the noise estimation neural network; and determining updates to network parameters of the noise estimation neural network from a gradient of an objective function that measures an error between the sampled noisy output and the training noise output.

일부 구현예에서, 상기 목적 함수는 샘플링된 잡음성 출력과 트레이닝 잡음 출력 사이의 거리를 측정한다. In some implementations, the objective function measures the distance between the sampled noisy output and the training noise output.

일부 구현예에서, 상기 거리는 L1 거리이다. In some embodiments, the distance is an L1 distance.

본 명세서에 기술된 주제의 특정 실시예들은 다음의 장점들 중 하나 이상을 실현하도록 구현될 수 있다. Certain embodiments of the subject matter described herein may be implemented to realize one or more of the following advantages.

설명된 기술은 비자동회귀 방식으로 네트워크 입력들에 조건화된 네트워크 출력을 생성한다. 일반적으로, 자동 회귀 모델은 고품질 네트워크 출력들을 생성하는 것으로 나타났지만 많은 수의 이터레이션들을 필요로 하므로, 결과적으로 높은 레이턴시와 많은 리소스 소비(예컨대, 메모리 및 프로세싱 전력)를 유발한다. 이는 자동 회귀 모델은, 네트워크 출력 내의 주어진 각각의 출력을 하나씩 생성하고, 그 각각은 네트워크 출력 내에서 상기 주어진 출력에 선행하는 모든 출력을 조건으로 하기 때문이다. The described technique produces a network output conditioned on network inputs in a non-autoregressive manner. In general, autoregressive models have been shown to produce high-quality network outputs but require a large number of iterations, resulting in high latency and high resource consumption (eg, memory and processing power). This is because an autoregressive model produces one for each given output in the network output, each conditioned on all outputs that precede the given output in the network output.

다른 한편으로, 설명된 기술은 초기 네트워크 출력, 예를 들어 잡음 분포로부터 샘플링된 값들을 포함하는 잡음성 출력에서 시작하고, 그리고 네트워크 입력에 대해 조건화된 그래디언트 기반 샘플러를 통해 네트워크 출력을 이터레이션적으로 정제한다. 즉, 이터레이션 잡음 제거 프로세스가 사용될 수 있다. 결과적으로, 이러한 접근 방식은 자동 회귀가 아니며 그리고 추론 동안 일정한 개수의 생성 단계들만을 필요로 한다. 예를 들어, 스펙트로그램에 대해 조건화된 오디오 합성의 경우, 서술된 기술은 레이턴시를 크게 감소시키면서도 매우 적은 계산 리소스를 이용하여, 매우 적은 이터레이션들(예를 들어, 6개 이하의 이터레이션)에서 고충실도의 오디오 샘플을 생성할 수 있는데, 이는 현존하는 자동 회귀 모델에 의해 생성된 것과 비견되거나 또는 심지어 이보다 우수한 것이다. 또한, 설명된 기술은 기존의 비자동회귀 모델에 의해 생성된 것보다 더 높은 품질(예: 더 높은 충실도)을 갖는 샘플들을 생성할 수 있다. On the other hand, the described technique starts with an initial network output, e.g., a noisy output comprising values sampled from a noise distribution, and iterates the network output through a gradient-based sampler conditioned on the network input. refine That is, an iteration noise cancellation process may be used. As a result, this approach is not autoregressive and requires only a certain number of generative steps during inference. For example, in the case of audio synthesis conditioned on the spectrogram, the described technique uses very few computational resources while greatly reducing latency, in very few iterations (e.g., 6 iterations or less). It can produce high-fidelity audio samples, comparable to or even superior to those produced by existing auto-regressive models. Additionally, the described technique may produce samples with higher quality (eg, higher fidelity) than those produced by existing non-autoregressive models.

본 명세서의 주제의 하나 이상의 실시예의 세부 사항은 첨부된 도면 및 아래의 설명에서 설명된다. 본 발명의 다른 특징, 양태 및 이점은 설명, 도면 및 청구범위로부터 명백해질 것이다. The details of one or more embodiments of the subject matter in this specification are set forth in the accompanying drawings and the description below. Other features, aspects and advantages of the present invention will become apparent from the description, drawings and claims.

도 1은 예시적인 조건부 출력 생성 시스템의 블록도이다.
도 2는 네트워크 입력들에 대해 조건화된 출력들을 생성하기 위한 예시적인 프로세스의 흐름도이다.
도 3은 예시적인 잡음 추정 신경망의 블록도이다.
도 4는 예시적인 네트워크 출력 프로세싱 신경망 블록의 블록도이다.
도 5는 예시적인 FiLM (Feature-wise Linear Modulation) 모듈의 블록도이다.
도 6은 예시적인 잡음 생성 신경망 블록의 블록도이다.
도 7은 잡음 추정 신경망을 트레이닝하기 위한 예시적인 프로세스의 흐름도이다.
다양한 도면에서 동일한 참조 번호 및 명칭은 동일한 요소를 나타낸다. 1 is a block diagram of an exemplary conditional output generating system.
2 is a flow diagram of an exemplary process for generating outputs conditioned on network inputs.
3 is a block diagram of an exemplary noise estimation neural network.
4 is a block diagram of an example network output processing neural network block.
5 is a block diagram of an exemplary Feature-wise Linear Modulation (FiLM) module.
6 is a block diagram of an example noise generating neural network block.
7 is a flow diagram of an example process for training a noise estimation neural network.
Like reference numbers and designations in the various drawings indicate like elements.

도 1은 예시적인 조건부 출력 생성 시스템(100)을 도시한다. 조건부 출력 생성 시스템(100)은 아래에서 설명되는 시스템, 컴포넌트 및 기술이 구현되는 하나 이상의 위치에 있는 하나 이상의 컴퓨터에서 컴퓨터 프로그램으로 구현되는 시스템의 일례이다. 1 shows an example conditional output generating system 100 . Conditional output generation system 100 is an example of a system implemented as a computer program on one or more computers at one or more locations where the systems, components, and techniques described below are implemented.

조건부 출력 생성 시스템(100)은 네트워크 입력(102)을 조건으로 하는 최종 네트워크 출력(104)을 생성한다.A conditional output generation system (100) generates a final network output (104) conditioned on a network input (102).

본 발명의 조건부 출력 생성 시스템(100)은 광범위하게 적용가능하며 하나의 특정 구현예로 제한되지 않는다. 그러나 설명을 위해 소수의 예시적인 구현예들이 아래에 설명되어 있다. The conditional output generation system 100 of the present invention is broadly applicable and is not limited to one specific implementation. However, for purposes of explanation, a few example implementations are described below.

예를 들어, 시스템은 스펙트로그램, 예컨대, 멜-스펙트로그램 또는 스펙트로그램에 대해 조건화된 오디오의 파형을 생성하도록 구성될 수 있으며, 여기서 주파수들은 오디오의 상이한 스케일이다. 이에 대한 특정 예로서, 스펙트로그램은 스피치 세그먼트의 스펙트로그램일 수 있고 파형은 스피치 세그먼트에 대한 파형일 수 있다. 예를 들어, 스펙트로그램은 텍스트 또는 텍스트의 언어적 피처를 발언되고 있는 텍스트의 발화에 대한 스펙트로그램으로 변환하는 텍스트-스피치 머신 러닝 모델의 출력일 수 있다. For example, the system can be configured to generate a spectrogram, eg, a mel-spectrogram or a waveform of audio conditioned to the spectrogram, where the frequencies are different scales of the audio. As a specific example of this, the spectrogram can be the spectrogram of a speech segment and the waveform can be the waveform for the speech segment. For example, the spectrogram may be the output of a text-to-speech machine learning model that transforms text or linguistic features of text into a spectrogram for an utterance of text being spoken.

다른 예로서, 시스템은 네트워크 출력을 생성하기 위해 네트워크 입력에 대한 이미지 프로세싱 태스크를 수행하도록 구성될 수 있다. 예를 들어, 네트워크 입력은 생성될 이미지 객체의 클래스를 지정하는 객체의 클래스(예를 들어, 원-핫 벡터로 표현됨)일 수 있고, 네트워크 출력은 객체의 클래스의 생성된 이미지(예를 들어, 이미지의 각 픽셀에 대한 강도 값 또는 RGB 값 세트로 표현됨)일 수 있다. As another example, a system can be configured to perform image processing tasks on network inputs to generate network outputs. For example, the network input can be a class of object specifying the class of image object to be created (e.g., represented as a one-hot vector), and the network output can be a generated image of the class of object (e.g., represented as a one-hot vector). It can be expressed as a set of intensity values or RGB values for each pixel in the image.

다른 특정 예로서, 태스크는 조건부 이미지 생성일 수 있고 그리고 네트워크 입력은 텍스트 시퀀스일 수 있고 네트워크 출력은 텍스트를 반영하는 이미지일 수 있다. 예를 들어 텍스트 시퀀스는 이미지 내의 장면을 설명하는 문장 또는 형용사 시퀀스를 포함할 수 있다. As another specific example, the task can be conditional image generation and the network input can be a sequence of text and the network output can be an image reflecting the text. For example, a text sequence may include a sequence of sentences or adjectives describing a scene within an image.

다른 특정 예에서, 태스크는 이미지 임베딩 생성일 수 있고, 네트워크 입력은 이미지일 수 있고 네트워크 출력은 이미지를 특징짓는 입력 이미지의 숫자 임베딩일 수 있다. In another particular example, the task may be image embedding generation, the network input may be an image and the network output may be a numeric embedding of the input image characterizing the image.

또 다른 특정 예로서, 태스크는 객체 검출일 수 있고, 네트워크 입력은 이미지일 수 있고, 네트워크 출력은 특정 유형들의 객체들이 묘사되는 입력 이미지 내의 위치들을 식별할 수 있는바, 예를 들어 객체들의 묘사들을 포함하는 입력 이미지 내의 경계 박스들을 지정할 수 있다. As another specific example, the task can be object detection, the network input can be an image, and the network output can identify locations within the input image at which certain types of objects are depicted, eg, depictions of objects. You can specify the bounding boxes in the containing input image.

또 다른 특정 예로서, 태스크는 이미지 세그먼트일 수 있고, 네트워크 입력은 이미지일 수 있고, 네트워크 출력은 입력 이미지의 복수의 픽셀들 각각을 카테고리 세트의 카테고리에 할당하는 세그먼트 출력일 수 있는바, 예를 들어, 픽셀이 카테고리에 속할 가능성을 나타내는 카테고리들 각각에 대한 점수를 각각의 픽셀에 할당할 수 있다. As another specific example, a task can be an image segment, a network input can be an image, and a network output can be a segment output that assigns each of a plurality of pixels of the input image to a category of a category set, e.g. For example, each pixel may be assigned a score for each of the categories representing the probability that the pixel belongs to the category.

보다 일반적으로, 태스크는 네트워크 입력에 대해 조건화된 연속 데이터를 출력하는 모든 태스크가 될 수 있다. More generally, a task can be any task that outputs continuous data conditioned on network inputs.

네트워크 입력(102)을 조건으로 하는 최종 네트워크 출력(104)을 생성하기 위해, 조건부 출력 생성 시스템(100)은 네트워크 입력(102)을 획득하고 현재 네트워크 출력(114)을 초기화한다. 예를 들어, 시스템(100)은 대응 잡음 분포(예를 들어, N(0,I)와 같은 가우시안 분포, 여기서 I는 항등 행렬임)로부터 현재 네트워크 출력의 각 값을 샘플링함으로써, 현재 네트워크 출력(114)을 초기화할 수 있다(즉, 현재 네트워크 출력(114)의 제1 인스턴스를 생성할 수 있다). 즉, 초기 현재 네트워크 출력(114)은 최종 네트워크 출력(104)과 동일한 수의 값들을 포함하지만, 각각의 값은 대응하는 잡음 분포로부터 샘플링된다. To generate a final network output (104) conditioned on a network input (102), the conditional output generation system (100) obtains the network input (102) and initializes the current network output (114). For example, system 100 samples each value of the current network output from a corresponding noise distribution (e.g., a Gaussian distribution such as N(0,I), where I is the identity matrix) to obtain the current network output ( 114) (ie, create a first instance of the current network output 114). That is, the initial current network output 114 includes the same number of values as the final network output 104, but each value is sampled from a corresponding noise distribution.

다음으로, 시스템(100)은 다수의 이터레이션들 각각에서 현재 네트워크 출력(114)을 업데이트함으로써 최종 네트워크 출력(104)을 생성한다. 다시 말해서, 최종 네트워크 출력(104)은 다수의 이터레이션들 중 마지막 이터레이션 이후의 현재 네트워크 출력(114)이다. Next, system 100 generates final network output 104 by updating current network output 114 in each of multiple iterations. In other words, the last network output 104 is the current network output 114 since the last iteration of multiple iterations.

일부 경우, 이터레이션 횟수는 고정되어 있다. In some cases, the number of iterations is fixed.

다른 경우, 시스템(100) 또는 다른 시스템은 최종 네트워크 출력의 생성을 위한 레이턴시 요건에 기초하여 이터레이션 횟수를 조정할 수 있다. 즉, 시스템(100)은 레이턴시 요건을 만족하면서 최종 네트워크 출력(104)이 생성될 수 있도록 이터레이션 횟수를 선택할 수 있다. In other cases, system 100 or other systems may adjust the number of iterations based on latency requirements for generation of the final network output. That is, the system 100 can select the number of iterations so that the final network output 104 can be generated while meeting latency requirements.

또 다른 경우, 시스템(100) 또는 다른 시스템은 최종 네트워크 출력(104)의 생성을 위한 계산 리소스 소비 요건에 기초하여 이터레이션 횟수를 조정할 수 있다. 즉, 시스템(100) 또는 다른 시스템은 이러한 요건을 만족시키면서 최종 네트워크 출력(104)이 생성될 수 있도록 이터레이션 횟수를 선택할 수 있다. 예를 들어, 이러한 요건은 최종 네트워크 출력을 생성시키는 것의 일부로서 수행될 플로팅 연산들(FLOPS)의 최대 개수일 수 있다. In still other cases, system 100 or other systems may adjust the number of iterations based on computational resource consumption requirements for generation of final network output 104 . That is, system 100 or another system can choose the number of iterations so that the final network output 104 can be generated while meeting these requirements. For example, this requirement may be the maximum number of floating operations (FLOPS) to be performed as part of generating the final network output.

각각의 이터레이션에서, 시스템은 잡음 추정 신경망(300)을 이용하여 (i) 현재 네트워크 출력(114), (ii) 네트워크 입력(112), 및 선택적으로는 (iii) 이터레이션에 대한 이터레이션 특정 데이터를 포함하는 이터레이션에 대한 모델 입력을 프로세싱한다. 이터레이션 특정 데이터는 일반적으로 잡음 레벨(106)(예를 들어, 각각의 잡음 레벨이 특정 이터레이션에 대응함)로부터 도출된다. 시스템은 업데이트의 각 이터레이션에 대한 스케일로서 잡음 레벨(106)을 사용하여 현재 네트워크 출력을 업데이트할 수 있다. 즉, 잡음 레벨들(106)의 각각의 잡음 레벨은 특정 이터레이션에 대응할 수 있고, 이터레이션에 대한 각각의 잡음 레벨은 이터레이션에서 현재 네트워크 출력(114)에 대한 업데이트의 스케일을 안내(guide)할 수 있다. At each iteration, the system uses the noise estimation neural network 300 to specify (i) the current network output 114, (ii) the network input 112, and optionally (iii) the iteration specific for the iteration. Process model input for iterations containing data. Iteration specific data is generally derived from noise levels 106 (eg, each noise level corresponds to a particular iteration). The system may update the current network output using the noise level 106 as a scale for each iteration of the update. That is, each noise level of noise levels 106 may correspond to a particular iteration, and each noise level for an iteration guides the scale of updates to the current network output 114 in that iteration. can do.

잡음 추정 신경망(300)은 파라미터들("네트워크 파라미터들")를 갖는 신경망이며, 이러한 신경망은 네트워크 파라미터들의 현재 값에 따라 모델 입력을 프로세싱하여 현재 네트워크 출력(114)의 각 값에 대한 각각의 잡음 추정치를 포함하는 잡음 출력(110)을 생성하도록 구성된다. 잡음 추정 신경망의 세부사항은 도 3과 관련하여 다음에 더 자세히 논의된다.The noise estimation neural network 300 is a neural network having parameters ("network parameters"), which process the model input according to the current values of the network parameters to obtain a respective noise for each value of the current network output 114. and generate a noise output 110 that includes an estimate. The details of the noise estimation neural network are discussed in more detail next with respect to FIG. 3 .

일반적으로, 현재 네트워크 출력의 소정 값에 대한 잡음 추정치는 상기 소정 값을 생성하기 위해 네트워크 입력에 대한 실제 네트워크 출력의 대응 실제 값에 합산된 잡음의 추정치이다. 즉, 잡음 추정치는 현재 이터레이션에 대응하는 잡음 레벨이 주어지면, 현재 네트워크 출력에서 상기 소정 값을 생성하기 위해 실제 값(알려진 경우)이 어떻게 수정될 필요가 있는지를 정의한다. 달리 말하면, 현재 이터레이션에 대한 잡음 레벨에 따라 실제 값에 잡음 추정치를 적용함으로써, 상기 소정 값이 생성될 수 있다. In general, the noise estimate for a given value of the current network output is an estimate of the noise added to the corresponding actual value of the actual network output for the network input to produce that given value. That is, the noise estimate defines how the actual value (if known) needs to be modified to produce that given value at the current network output, given the noise level corresponding to the current iteration. In other words, the predetermined value may be generated by applying a noise estimate to the actual value according to the noise level for the current iteration.

이러한 잡음 추정치는 데이터 밀도의 그래디언트(gradient) 추정치로 해석될 수 있으므로, 생성 프로세스는 데이터 밀도 추정을 통해 네트워크 출력을 반복적으로 생성하는 프로세스로 보여질 수 있다. Since this noise estimate can be interpreted as a gradient estimate of the data density, the generation process can be seen as a process of iteratively generating the network output through data density estimation.

다음으로, 시스템(100)은 업데이트 엔진(112)을 사용하여 잡음 추정치의 방향으로 현재 네트워크 출력(114)을 업데이트한다. Next, system 100 uses update engine 112 to update current network output 114 in the direction of the noise estimate.

특히, 업데이트 엔진(112)은 잡음 추정치 및 이터레이션에 대한 대응 잡음 레벨을 사용하여 현재 네트워크 출력(114)을 업데이트한다. 즉, 업데이트 엔진(112)은 잡음 출력(110)의 대응하는 잡음 추정치 및 이터레이션에서의 대응하는 잡음 레벨을 사용하여 현재 네트워크 출력(114)의 각각의 값을 업데이트하는데, 이는 도 2와 관련하여 더 상세히 논의된다. In particular, update engine 112 updates current network output 114 using the noise estimate and the corresponding noise level for the iteration. That is, update engine 112 uses the corresponding noise estimate of noise output 110 and the corresponding noise level in the iteration to update each value of current network output 114, which is related to FIG. discussed in more detail.

최종 이터레이션 후, 조건부(conditional) 출력 생성 시스템(100)은 업데이트된 네트워크 출력(114)을 최종 네트워크 출력(104)으로서 출력한다. 예를 들어, 최종 네트워크 출력(104)이 오디오 파형을 나타내는 구현예에서, 시스템은 스피커를 이용하여 오디오를 재생할 수 있으며 또는 재생을 위해 오디오를 전송하거나 기타 등등을 수행할 수 있다. 최종 네트워크 출력(104)이 이미지를 나타내는 다른 구현예에서, 시스템은 사용자 디스플레이 상에 이미지를 보여주거나, 디스플레이를 위한 이미지를 전송하는 것, 기타 등등을 수행할 수 있다. 일부 구현에서, 시스템(100)은 최종 네트워크 출력(104)을 데이터 저장소에 저장하거나 또는 원격으로 저장되도록 최종 네트워크 출력(104)을 전송할 수 있다. After the final iteration, the conditional output generation system 100 outputs the updated network output 114 as the final network output 104 . For example, in an implementation where the final network output 104 represents an audio waveform, the system can use a speaker to play the audio or send the audio for playback, and the like. In other implementations where the final network output 104 represents an image, the system may show the image on a user display, send the image for display, and the like. In some implementations, system 100 can store final network output 104 in a data store or transmit final network output 104 to be stored remotely.

최종 네트워크 출력을 생성하기 위해 시스템(100)이 잡음 추정 신경망(300)을 사용하기 이전에, 시스템(100) 또는 다른 시스템은 트레이닝 데이터를 사용하여 잡음 추정 신경망(300)을 트레이닝시킨다. 범위 영향 신경망을 포함하는 구현예에서, 시스템은 또한 트레이닝 데이터를 사용하여 범위 영향 신경망도 트레이닝시킬 수 있다. 트레이닝은 이후에 도 7을 참조하여 설명될 것이다. Before system 100 uses noise estimation neural network 300 to generate the final network output, system 100 or another system uses training data to train noise estimation neural network 300. In implementations that include a range-affected neural network, the system can also use the training data to train the range-affected neural network as well. Training will be described later with reference to FIG. 7 .

도 2는 네트워크 입력들을 조건으로 하는(conditioned) 출력들을 생성하기 위한 예시적인 프로세스(200)의 흐름도이다. 편의상, 프로세스(200)는 하나 이상의 위치들에 배치된 하나 이상의 컴퓨터 시스템들에 의해 수행되는 것으로 설명될 것이다. 예를 들어, 본 명세서에 따라 적절하게 프로그래밍된 조건부(conditional) 출력 생성 시스템, 예컨대, 도 1의 조건부 출력 생성 시스템(100)이 프로세스(200)를 수행할 수 있다. 2 is a flow diagram of an exemplary process 200 for generating outputs conditioned on network inputs. For convenience, process 200 will be described as being performed by one or more computer systems located at one or more locations. For example, a conditional output generation system suitably programmed in accordance with this disclosure, such as conditional output generation system 100 of FIG. 1 , may perform process 200 .

시스템은 최종 네트워크 출력을 조건화하기 위한 네트워크 입력(202)를 획득한다. 예를 들어, 네트워크 출력이 오디오 파형인 경우, 네트워크 입력은 스펙트로그램, 멜-스펙트로그램, 또는 오디오 파형에 의해 반영된 텍스트 본문의 언어적 피처들이 될 수 있다. The system obtains a network input 202 to condition the final network output. For example, if the network output is an audio waveform, the network input could be a spectrogram, a mel-spectrogram, or linguistic features of a body of text reflected by the audio waveform.

시스템은 현재 네트워크 출력을 초기화한다(204). 여러 값들을 포함하는 최종 네트워크 출력의 경우, 시스템은 최종 네트워크 출력과 동일한 개수의 값들을 갖는 초기 현재 네트워크 출력의 각 값을 잡음 분포로부터 샘플링할 수 있다. 예를 들어, 시스템은 y_N ~ N(0, I)로 표시되는 잡음 분포(예컨대, 가우시안 잡음 분포)를 사용하여 현재 네트워크 출력을 초기화할 수 있으며, 여기서 I 는 항등 행렬이고 y_N의 N 은 의도된 이터레이션 횟수를 나타낸다. 시스템은 이터레이션 N에서 이터레이션 1까지 N개의 이터레이션 동안 초기 현재 네트워크 출력을 내림차순으로 업데이트할 수 있다. The system initializes the current network output (204). For a final network output containing multiple values, the system can sample from the noise distribution each value of the initial current network output having the same number of values as the final network output. For example, the system can initialize the current network output using a noise distribution denoted by y _N ~ N(0, I) (e.g., a Gaussian noise distribution), where I is the identity matrix and N of y _N is Indicates the intended number of iterations. The system may update the initial current network output during N iterations from iteration N to iteration 1 in descending order.

그런 다음 시스템은 다수의 이터레이션들 각각에서 현재 네트워크 출력을 업데이트한다. 일반적으로, 각 이터레이션의 현재 네트워크 출력은 추가 잡음이 있는 최종 네트워크 출력으로 해석될 수 있다. 즉, 현재 네트워크 출력은 최종 네트워크 출력의 잡음성(noisy) 버전이다. 예를 들어, 초기 현재 네트워크 출력 y_N 의 경우(여기서, N은 이터레이션 횟수를 나타냄), 시스템은 이터레이션에 대응하는 잡음에 대한 추정치를 제거함으로써, N에서 1까지의 각 이터레이션에서 현재 네트워크 출력을 업데이트할 수 있다. 즉, 시스템은 잡음에 대한 추정치를 결정하고 그리고 추정치에 따라 현재 네트워크 출력을 업데이트함으로써 각 이터레이션에서 현재 네트워크 출력을 정제(refine)할 수 있다. 시스템은 최종 네트워크 출력 y₀ 를 출력할 때까지 이터레이션들에 대해 내림차순을 사용할 수 있다. The system then updates the current network output in each of multiple iterations. In general, the current network output of each iteration can be interpreted as the final network output with additional noise. That is, the current network output is a noisy version of the final network output. For example, for the initial current network output y _N , where N denotes the number of iterations, the system removes the estimate of the noise corresponding to the iteration, so that at each iteration from N to 1 the current network The output can be updated. That is, the system can refine the current network output at each iteration by determining an estimate for the noise and updating the current network output according to the estimate. The system can use descending order through iterations until it outputs the final network output y ₀ .

여러 이터레이션들 각각에서, 시스템은 잡음 추정 신경망을 이용하여 (1) 현재 네트워크 출력, (2) 네트워크 입력, 및 선택적으로는 (3) 이터레이션에 대한 이터레이션 특정 데이터를 포함하는 모델 입력을 프로세싱하여 이터레이션에 대한 잡음 출력을 생성한다(206). 이터레이션-특정 데이터는 일반적으로 이터레이션들에 대한 잡음 레벨들에서 파생되며, 여기서 각 잡음 레벨은 특정 이터레이션에 대응한다. 잡음 출력은 현재 네트워크 출력의 각 값에 대한 잡음 추정치를 포함할 수 있다. 예를 들어, 현재 네트워크 출력의 특정 값에 대한 각각의 잡음 추정치는, 상기 특정 값을 생성하기 위해 네트워크 입력에 대한 실제 네트워크 출력의 해당 실제 값에 추가된 잡음의 추정치를 나타낼 수 있다. 즉, 특정 값에 대한 잡음 추정치는 대응 잡음 레벨이 주어지면, 상기 특정 값을 생성하기 위해 실제 값(알려진 경우)이 어떻게 수정될 필요가 있는지를 나타낸다. At each of several iterations, the system uses the noise estimation neural network to process (1) the current network output, (2) the network input, and optionally (3) the model input, including iteration-specific data for the iteration. to generate a noise output for the iteration (206). Iteration-specific data is generally derived from noise levels over the iterations, where each noise level corresponds to a specific iteration. The noise output may include a noise estimate for each value of the current network output. For example, each noise estimate for a particular value of the current network output may represent an estimate of noise added to that actual value of the actual network output for the network input to produce that particular value. That is, the noise estimate for a particular value indicates how the actual value (if known) would need to be modified to produce that particular value, given the corresponding noise level.

여러 이터레이션들 각각에서, 시스템은 현재 이터레이션에 대한 잡음 출력 및 현재 이터레이션에 대응하는 잡음 레벨을 사용하여, 현재 이터레이션과 같은 현재 네트워크 출력을 업데이트한다(208). 시스템은 잡음 출력의 해당 잡음 추정치와 현재 이터레이션에 대한 잡음 레벨을 사용하여 현재 네트워크 출력의 각 값을 업데이트할 수 있다. 시스템은 이터레이션에 대한 잡음 추정치 및 잡음 레벨로부터 이터레이션에 대한 업데이트를 생성할 수 있고, 현재 네트워크 출력에서 업데이트를 감산하여 초기 업데이트된 네트워크 출력을 생성할 수 있다. 그런 다음 시스템은 이터레이션에 대한 잡음 레벨에 기초하여 초기 업데이트된 네트워크 출력을 수정하여, 다음과 같은 수정된 초기 업데이트된 네트워크 출력을 다음과 같이 생성할 수 있다. At each of several iterations, the system updates the current network output equal to the current iteration using the noise output for the current iteration and the noise level corresponding to the current iteration (208). The system can update each value of the current network output using the corresponding noise estimate of the noise output and the noise level for the current iteration. The system can generate an update for an iteration from the noise estimate and noise level for the iteration, and can subtract the update from the current network output to produce an initial updated network output. The system can then modify the initial updated network output based on the noise level over the iteration, producing the following modified initial updated network output as

(1)

(One)

여기서, n은 이터레이션들을 인덱싱하고, y_n 은 이터레이션 n에서 현재 네트워크 출력을 나타내며, y_n-1 은 수정된 초기 업데이트된 네트워크 출력을 나타내고, x는 네트워크 입력을 나타내며, a_n 은 이터레이션 n에 대한 잡음 레벨을 나타내고,

은 이터레이션 n 에 대한 총 잡음 레벨을 나타내고(예를 들어, 현재 이터레이션 및 현재 이터레이션 이후의 임의의 이터레이션에서의 잡음 레벨들로부터 생성됨),

는 파라미터 θ를 구비한 잡음 추정 신경망에 의해 생성된 잡음 출력을 나타낸다. 잡음 레벨 a_n 및 총 잡음 레벨

은 잡음 스케줄

로부터 결정될 수 있다(예를 들어, 최소값에서 최대값까지 선형 범위의 선형 잡음 스케줄, 피보나치-기반 스케줄 또는 데이터-유도 기반 또는 휴리스틱 방법에서 생성된 커스텀 스케줄). 잡음 레벨

= 1 - βn 이고, 총 잡음 레벨

은 다음과 같이 균일한 분포에서 샘플링될 수 있다. where n indexes the iterations, y _n denotes the current network output in iteration n, y _n-1 denotes the modified initial updated network output, x denotes the network input, and a _n denotes the iteration denotes the noise level for n,

denotes the total noise level for iteration n (e.g., generated from the noise levels in the current iteration and any iterations after the current iteration);

denotes the noise output generated by the noise estimation neural network with parameter θ. Noise level a _n and total noise level

silver noise schedule

(e.g., a linear noise schedule with a linear range from minimum to maximum, a Fibonacci-based schedule, or a data-derived based or custom schedule generated from a heuristic method). noise level

= 1 - βn, and the total noise level

can be sampled from a uniform distribution as

(2)

여기서, n 은 이터레이션들을 인덱싱하고,

이다. 수학식 (2)에서와 같은 샘플링

은, 시스템이 서로 다른 잡음 스케일들에 기초하여 업데이트들을 생성할 수 있게 한다. 각각의 이터레이션 n 에 대한 잡음 레벨 a_n 및 총 잡음 레벨

은 모델 입력의 일부로서 시스템에 의해서 미리결정되고 그리고 획득될 수 있다. where n indexes the iterations,

am. Sampling as in Equation (2)

, allows the system to generate updates based on different noise scales. Noise level a _n and total noise level for each iteration n

may be predetermined and obtained by the system as part of the model input.

마지막 이터레이션의 경우, 수정된 초기 업데이트된 네트워크 출력은 마지막 이터레이션 이후의 업데이트된 네트워크 출력이며, 그리고 마지막 이터레이션 이전의 각각의 이터레이션의 경우, 수정된 초기 업데이트된 네트워크 출력에 잡음를 추가함으로써, 마지막 이터레이션 이후의 업데이트된 네트워크 출력이 생성된다. 즉, 이터레이션이 최종 이터레이션이 아니라면(즉, n > 1 이면), 시스템은 수정된 초기 업데이트된 네트워크 출력을 다음과 같이 추가로 업데이트한다.For the last iteration, the modified initial updated network output is the updated network output since the last iteration, and for each iteration before the last iteration, by adding noise to the modified initial updated network output, An updated network output since the last iteration is generated. That is, if the iteration is not the final iteration (i.e. n > 1), the system further updates the modified initial updated network output as

(3)

여기서 n 은 이터레이션들을 인덱싱하고,

은 잡음 스케줄

또는 다른 방법(예컨대, 잡음 스케줄의 함수로서, 또는 경험적 실험을 사용하여 하이퍼-파라미터 튜닝을 통해 결정됨)으로부터 결정될 수 있으며,

이다. 다중 모달 분포를 모델링할 수 있도록

가 포함되어 있다.where n indexes the iterations,

silver noise schedule

or determined from other methods (e.g., as a function of noise schedule, or determined through hyper-parameter tuning using empirical experiments);

am. To be able to model multimodal distributions

is included.

시스템은 종료 기준이 충족되었는지 여부를 결정한다(210). 예를 들어, 종료 기준은 특정 횟수의 이터레이션들을 수행했는지를 포함할 수 있다(예를 들어, 최소 성능 메트릭, 최대 레이턴시 요건 또는 최대 FLOPS 수와 같은 최대 계산 리소스 요구 사항을 충족하도록 결정됨). 특정 횟수의 이터레이션들이 수행되지 않은 경우, 시스템은 단계(206)에서 다시 시작하여 현재 네트워크 출력에 대한 또 다른 업데이트를 수행할 수 있다. The system determines whether the exit criterion has been met (210). For example, the termination criterion may include performing a certain number of iterations (eg, determined to meet a minimum performance metric, a maximum latency requirement, or a maximum computational resource requirement such as a maximum number of FLOPS). If the specified number of iterations are not performed, the system may start again at step 206 and perform another update to the current network output.

종료 기준이 충족되었다고 시스템이 결정하면, 시스템은 최종 네트워크 출력(212)을 출력하며, 이는 최종 이터레이션 이후의 업데이트된 네트워크 출력이다.If the system determines that the termination criterion has been met, the system outputs the last network output 212, which is the updated network output since the last iteration.

프로세스(200)는 네트워크 입력들에 대해 조건화된(conditioned) 비자동회귀적 방식(non-autoregressive manner)으로 네트워크 출력들을 생성하는데 사용될 수 있다. 일반적으로, 자동-회귀 모델은 고품질의 네트워크 출력들을 생성하는 것으로 알려져 왔지만 많은 수의 이터레이션들을 필요로 하며, 결과적으로 높은 레이턴시 및 리소스(예컨대, 메모리 및 프로세싱 파워) 소비가 발생한다. 이는, 자동-회귀 모델들은 네트워크 출력 내의 각각의 소정 출력을 한번에 하나씩 생성하고, 그 각각이 네트워크 출력 내의 상기 소정 출력에 선행하는 모든 출력들을 조건으로 하기 때문이다. 다른 한편으로, 프로세스(200)는 초기 네트워크 출력, 예를 들어, 잡음 분포로부터 샘플링된 값들을 포함하는 잡음성 출력에서 시작하고, 그리고 네트워크 입력에 대해 조건화된 그래디언트-기반 샘플러를 통해 네트워크 출력을 반복적으로 정제한다. 결과적으로, 이러한 접근 방식은 비-자동회귀이며 그리고 추론 동안 일정한 개수의 생성 단계들만을 필요로 한다. 예를 들어, 스펙트로그램(spectrogram)에 대해 조건화된 오디오 합성의 경우, 설명된 기술은 매우 적은 이터레이션들(예를 들어, 6개 이하의 이터레이션)에서 고충실도(high fidelity)의 오디오 샘플들을 생성할 수 있는바, 이는 훨씬 더 적은 계산 리소스를 사용하면서도 크게 감소된 레이턴시와 함께, 최신 자동 회귀 모델에 의해 생성된 것들과 비견되거나 심지어 이를 초과할 수 있다. Process 200 can be used to generate network outputs in a non-autoregressive manner conditioned on network inputs. In general, auto-regressive models have been known to produce high-quality network outputs but require a large number of iterations, resulting in high latency and resource (eg, memory and processing power) consumption. This is because auto-regressive models generate each given output in the network output, one at a time, each conditioned on all outputs preceding the given output in the network output. On the other hand, process 200 starts with an initial network output, e.g., a noisy output comprising values sampled from a noise distribution, and iterates the network output through a gradient-based sampler conditioned on the network input. refined with As a result, this approach is non-autoregressive and requires only a certain number of generative steps during inference. For example, in the case of audio synthesis conditioned on a spectrogram, the described technique can generate high fidelity audio samples in very few iterations (e.g., 6 iterations or less). can be generated, which can match or even exceed those produced by state-of-the-art autoregressive models, with significantly reduced latency while using far less computational resources.

도 3은 잡음 추정 네트워크(300)의 예시적인 아키텍처를 도시한다. 3 shows an exemplary architecture of a noise estimation network 300 .

예시적인 잡음 추정 네트워크(300)는 콘볼루션 신경망 계층들, 잡음 생성 신경망 블록들(예를 들어, 각각의 신경망 블록은 여려 신경망 계층들을 포함함), 피처-와이즈 선형 변조(Feature-wise Linear Modulation: FiLM) 모듈 신경망 블록 및 네트워크 출력 프로세싱 신경망 블록을 포함하는 다양한 유형들의 신경망 계층들 및 신경망 블록들을 포함한다. The exemplary noise estimation network 300 includes convolutional neural network layers, noise generating neural network blocks (e.g., each neural network block includes several neural network layers), feature-wise linear modulation: FiLM) module neural network block and network output processing neural network block.

잡음 추정 네트워크(300)는 (1) 현재 네트워크 출력(114), (2) 네트워크 입력(102), 및 (3) 현재 이터레이션에 대응하는 총 잡음 레벨(306)을 포함하는 이터레이션-특정 데이터를 포함하는 모델 입력을 프로세싱하여, 잡음 출력(110)을 생성한다. 네트워크 출력(114)은 네트워크 입력(102)보다 더 높은 차원을 갖고, 잡음 출력(110)은 현재 네트워크 출력(114)과 동일한 차원을 갖는다. 예를 들어, 현재 네트워크 출력이 24kHz에서 오디오 파형을 나타내는 경우, 네트워크 입력은 오디오 파형에 대응하는 80Hz 멜-스펙트로그램 신호를 포함할 수 있다(예를 들어, 추론 동안 다른 시스템에 의해 예측됨). The noise estimation network 300 provides iteration-specific data comprising (1) current network output 114, (2) network input 102, and (3) total noise level 306 corresponding to the current iteration. Processing the model input, including , to produce a noise output (110). The network output 114 has a higher dimension than the network input 102, and the noise output 110 has the same dimension as the current network output 114. For example, if the current network output represents an audio waveform at 24 kHz, the network input may include an 80 Hz mel-spectrogram signal corresponding to the audio waveform (eg predicted by another system during inference).

잡음 추정 네트워크(300)는 현재 네트워크 출력(114)의 각각의 대체 표현들을 생성하기 위해 현재 네트워크 출력(114)을 프로세싱하기 위한 다수의 네트워크 출력 프로세싱 블록들을 포함한다. The noise estimation network 300 includes a number of network output processing blocks for processing the current network output 114 to generate respective alternative representations of the current network output 114 .

잡음 추정 네트워크(300)는 또한 현재 네트워크 출력의 대체 표현을 생성하기 위해 현재 네트워크 출력(114)을 프로세싱하는 네트워크 출력 프로세싱 블록(400)을 포함하고, 여기서 대체 표현은 현재 네트워크 출력보다 더 작은 차원을 갖는다. Noise estimation network 300 also includes a network output processing block 400 that processes the current network output 114 to produce an alternative representation of the current network output, where the alternative representation has a smaller dimension than the current network output. have

잡음 추정 네트워크(300)는 이전의 대체 표현 보다 더 적은 차원을 갖는 또 다른 대체 표현을 생성하기 위해, 이전의 네트워크 출력 프로세싱 블록에 의해 생성된 대체 표현을 프로세싱하는 추가적인 네트워크 출력 프로세싱 블록(예를 들어, 네트워크 출력 프로세싱 블록 318, 316, 314, 312)을 더 포함한다(예를 들어, 네트워크(318)는 블록(400)으로부터의 대체 표현을 프로세싱하여 블록(400)의 출력보다 더 작은 차원을 갖는 대체 표현을 생성하고, 블록(316)은 블록(318)으로부터의 대체 표현을 프로세싱하여 블록(318)의 출력보다 더 작은 차원을 갖는 대체 표현을 생성하고, 기타 등등). 최종 네트워크 출력 프로세싱 블록(예를 들어, 312)으로부터 생성된 현재 네트워크 출력의 대체 표현은 네트워크 입력(102)과 동일한 차원을 갖는다. The noise estimation network 300 has an additional network output processing block (e.g. , network output processing blocks 318, 316, 314, 312 (e.g., network 318 processes the replacement representation from block 400 to have a smaller dimension than the output of block 400). generate an alternative representation, block 316 processes the alternative representation from block 318 to generate an alternative representation having a smaller dimension than the output of block 318, and so forth). An alternative representation of the current network output generated from the last network output processing block (eg 312 ) has the same dimensions as the network input 102 .

예를 들어, 24kHz의 오디오 파형을 포함하는 현재 네트워크 출력 및 80Hz의 멜-스펙트로그램을 포함하는 네트워크 입력의 경우, 네트워크 출력 프로세싱 블록들은 최종 계층(312)에 의해 생성된 대체 표현이 80Hz가 될 때까지(즉, 멜-스펙트로그램과 일치하도록 300 이라는 팩터 만큼 감소됨), 상기 차원을 2, 2, 3, 5, 및 5 라는 팩터에 의해서 "다운샘플링"(즉, 차원을 감소시킴)할 수 있다(예를 들어, 네트워크 출력 프로세싱 블록들 400, 318, 316, 314 및 312 각각에 의해서). 예시적인 네트워크 출력 프로세싱 블록의 이러한 아키텍처는 도 4와 관련하여 더 상세히 논의된다. For example, for a current network output containing an audio waveform at 24 kHz and a network input containing a mel-spectrogram at 80 Hz, the network output processing blocks will run when the alternative representation generated by the last layer 312 is at 80 Hz. (i.e., reduced by a factor of 300 to match the mel-spectrogram), we can “downsample” (i.e., reduce the dimension) by a factor of 2, 2, 3, 5, and 5. (eg, by network output processing blocks 400, 318, 316, 314 and 312 respectively). This architecture of an exemplary network output processing block is discussed in more detail with respect to FIG. 4 .

잡음 추정 블록(300)은 현재 이터레이션에 대응하는 이터레이션-특정 데이터(예를 들어, 총 잡음 레벨 306) 및 네트워크 출력 프로세싱 신경망 블록들로부터의 대체 표현들을 프로세싱하여, 잡음 생성 신경망 블록들에 대한 입력들을 생성하는 다수의 FiLM 모듈 신경망 블록들을 포함한다. 각각의 FiLM 모듈은 총 잡음 레벨(306) 및 각각의 네트워크 출력 프로세싱 블록으로부터의 대체 표현을 프로세싱하여, 각각의 잡음 생성 블록에 대한 입력을 생성한다(예를 들어, FiLM 모듈(500)은 네트워크 출력 프로세싱 블록(400)으로부터의 대체 표현을 프로세싱하여 잡음 생성 블록(600)에 대한 입력을 생성하고, FiLM 모듈(328)은 네트워크 출력 프로세싱 블록(318)으로부터의 대체 표현을 프로세싱하여 잡음 생성 블록(338)에 대한 입력을 생성하고, 기타 등등). 특히, 각각의 FiLM 모듈은 각각의 잡음 생성 블록에 대한 입력으로서 스케일 벡터 및 바이어스 벡터를 생성하며(예를 들어, 각각의 잡음 생성 블록 내의 아핀 변환 신경망 계층에 대한 입력으로서), 이는 도 5를 참조하여 더 자세히 논의된다. The noise estimation block 300 processes the iteration-specific data corresponding to the current iteration (e.g., the total noise level 306) and replacement representations from the network output processing neural network blocks to determine the noise generating neural network blocks. The FiLM module contains a number of neural network blocks that generate inputs. Each FiLM module processes the total noise level 306 and a substitute representation from each network output processing block to produce an input to a respective noise generation block (e.g., FiLM module 500 outputs the network output The replacement representation from processing block 400 is processed to generate an input to noise generation block 600, and the FiLM module 328 processes the replacement representation from network output processing block 318 to generate noise block 338. ), and so on). In particular, each FiLM module generates a scale vector and a bias vector as inputs to each noise generation block (e.g., as inputs to an affine transform neural network layer within each noise generation block), see FIG. are discussed in more detail.

잡음 추정 네트워크(300)는 네트워크 입력(102) 및 FiLM 모듈들로부터의 출력을 프로세싱하여, 잡음 출력(110)을 생성하는 다수의 잡음 생성 신경망 블록들을 포함한다. 잡음 추정 네트워크(300)는 네트워크 입력(102)을 프로세싱하여 제 1 잡음 생성 블록(332)에 대한 입력을 생성하는 콘볼루션 계층(302)을 포함할 수 있으며, 그리고 최종 잡음 생성 블록(600)으로부터의 출력을 프로세싱하여 잡음 출력(110)을 생성하는 콘볼루션 계층(304)을 포함할 수 있다. 각각의 잡음 생성 블록은 네트워크 입력(102)보다 더 높은 차원을 갖는 출력을 생성한다. 특히, 첫 번째 이후의 각각의 잡음 생성 블록은 이전의 잡음 생성 블록의 출력보다 더 높은 차원을 갖는 출력을 생성한다. 최종 잡음 생성 블록은 현재 네트워크 출력(114)과 동일한 차원을 갖는 출력을 생성한다. The noise estimation network 300 includes a number of noise generating neural network blocks that process the network input 102 and the output from the FiLM modules to produce a noise output 110 . The noise estimation network 300 can include a convolution layer 302 that processes the network input 102 to generate an input to a first noise generation block 332, and from the final noise generation block 600 and a convolutional layer 304 that processes the output of x to generate noise output 110. Each noise generating block produces an output having a higher dimension than the network input 102 . In particular, each noise generating block after the first generates an output having a higher dimension than the output of the previous noise generating block. The final noise generation block produces an output having the same dimensions as the current network output (114).

잡음 추정 네트워크(300)는 잡음 생성 블록(332)을 포함하는바, 잡음 생성 블록(332)은 콘볼루션 계층(302)(즉, 네트워크 입력 102을 프로세싱하는 콘볼루션 계층)으로부터의 출력 및 FILM 모듈(322)로부터의 출력을 프로세싱하여, 잡음 생성 블록(334)에 대한 입력을 생성한다. 또한, 잡음 추정 네트워크(300)는 잡음 생성 블록(336, 338, 600)을 더 포함한다. 잡음 생성 블록들(334, 336, 338, 600) 각각은 이전의 잡음 생성 블록들 각각의 출력을 프로세싱하고(예를 들어, 블록 334는 블록 332로부터의 출력을 프로세싱하고, 블록 336는 블록 334로부터의 출력을 프로세싱하고, 기타 등등) 그리고 각각의 FiLM 모듈로부터의 출력을 프로세싱하여(예를 들어, 잡음 생성 블록(334)은 FILM 모듈(324)로부터의 출력을 프로세싱하고, 잡음 생성 블록(336)은 FILM 모듈(326)로부터의 출력을 프로세싱하는 등), 다음 신경망 블록에 대한 입력을 생성한다. 잡음 생성 블록(600)은 콘볼루션 계층(304)에 대한 입력을 생성하며, 콘볼루션 계층(304)은 이러한 입력을 프로세싱하여 잡음 출력(110)을 생성한다. 예시적인 잡음 생성 블록(예를 들어, 잡음 생성 블록 600)의 아키텍처는 도 6과 관련하여 더 상세히 논의된다. The noise estimation network 300 includes a noise generation block 332 comprising the output from the convolution layer 302 (i.e., the convolution layer processing the network input 102) and the FILM module The output from 322 is processed to produce an input to noise generation block 334. In addition, the noise estimation network 300 further includes noise generation blocks 336, 338, and 600. Each of the noise generation blocks 334, 336, 338, 600 processes the output of each of the previous noise generation blocks (e.g., block 334 processes the output from block 332, and block 336 processes the output from block 334). processing the output of FILM module 324, etc.) and processing the output from each FiLM module (e.g., noise generation block 334 processes the output from FILM module 324, noise generation block 336 processes the output from the FILM module 326, etc.), and generates input for the next neural network block. Noise generation block 600 generates an input to convolution layer 304, which processes this input to produce noise output 110. The architecture of an exemplary noise generation block (eg, noise generation block 600 ) is discussed in more detail with respect to FIG. 6 .

마지막 잡음 생성 블록 이전의 각각의 잡음 생성 블록은 현재 네트워크 출력의 해당 대체 표현과 동일한 차원을 갖는 출력을 생성할 수 있다. 예를 들어, 잡음 생성 블록(332)은 네트워크 출력 프로세싱 블록(314)에 의해 생성된 대체 표현과 동일한 차원을 갖는 출력을 생성하고, 잡음 생성 블록(334)은 네트워크 출력 프로세싱 블록(316)의 출력과 동일한 차원을 갖는 출력을 생성하고, 기타 등등이 수행될 수 있다. Each noise-generating block before the last noise-generating block may produce an output having the same dimensions as the corresponding alternative representation of the current network output. For example, noise generation block 332 generates an output having the same dimensions as the alternate representation produced by network output processing block 314, and noise generation block 334 outputs output of network output processing block 316. produce an output with dimensions equal to , and so forth.

예를 들어, 현재 네트워크 출력이 24kHz의 오디오 파형을 포함하고 그리고 네트워크 입력이 80Hz의 멜-스펙트로그램을 포함하는 경우, 잡음 생성 블록들은 최종 잡음 생성 블록(예를 들어, 잡음 생성 블록 600)의 출력이 24 KHz가 될 때까지(즉, 현재 네트워트 출력 114과 일치하도록 300 이라는 팩터 만큼 증가됨), 상기 차원을 5, 5, 3, 2, 및 2 라는 팩터에 의해서 "업샘플링"(즉, 차원을 증가시킴)할 수 있다(예를 들어, 잡음 생성 블록들 332, 334, 336, 338,및 600 각각에 의해서). For example, if the current network output includes an audio waveform at 24 kHz and the network input includes a mel-spectrogram at 80 Hz, the noise generation blocks are the output of the final noise generation block (e.g., noise generation block 600). 24 KHz (i.e., increased by a factor of 300 to match the current network output of 114), "upsampling" the dimension by a factor of 5, 5, 3, 2, and 2 (i.e., increasing the dimension (e.g., by noise generation blocks 332, 334, 336, 338, and 600, respectively).

도 4는 네트워크 출력 프로세싱 블록(400)의 예시적인 아키텍처를 도시한다.4 shows an exemplary architecture of network output processing block 400 .

네트워크 출력 프로세싱 블록(400)은 현재 네트워크 출력(114)을 프로세싱하여 현재 네트워크 출력(114)의 대체 표현(402)을 생성한다. 대체 표현은 현재 네트워크 출력 보다 더 작은 차원을 갖는다. 네트워크 출력 프로세싱 블록(400)은 하나 이상의 신경망 계층을 포함한다. 하나 이상의 신경망 계층은 다운샘플링 계층(예를 들어, 입력의 차원을 "다운샘플링"하거나 줄이기 위해), 비선형 활성화 함수를 갖는 활성화 계층(예를 들어, 누설형(leaky) ReLU 활성화 함수가 있는 완전-연결된 계층), 콘볼루션 계층 및 잔차(residual) 연결 계층을 포함하는 다양한 유형들의 신경망 계층들을 포함할 수 있다. The network output processing block 400 processes the current network output 114 to generate an alternative representation 402 of the current network output 114 . The alternative representation has a smaller dimension than the current network output. Network output processing block 400 includes one or more neural network layers. One or more neural network layers may include a downsampling layer (e.g., to “downsample” or reduce the dimensionality of the input), an activation layer with a nonlinear activation function (e.g., a fully- It may include various types of neural network layers, including connected layers), convolutional layers, and residual connected layers.

예를 들어, 다운샘플링 계층은 입력의 차원을 감소시키기 위해("다운샘플링") 필요한 스트라이드(stride)를 갖는 콘볼루션 계층일 수 있다. 특정 일례에서, 스트라이드 X는 X 라는 팩터만큼 입력의 차원을 줄이기 위해 사용될 수 있다(예를 들어, 스트라이드 2는 입력의 차원을 2 라는 팩터만큼 줄이기 위해 사용될 수 있으며, 스트라이드 5는 입력의 차원을 5 라는 팩터만큼 줄이기 위해 사용될 수 있다). For example, a downsampling layer can be a convolutional layer with a necessary stride to reduce the dimensionality of the input (“downsampling”). In one specific example, a stride X can be used to reduce the dimension of an input by a factor of X (e.g., a stride of 2 can be used to reduce the dimension of an input by a factor of 2, and a stride of 5 can reduce the dimension of an input by a factor of 5). can be used to reduce by a factor of ).

잔차 연결 계층(420)의 좌측 브랜치는 콘볼루션 계층(402) 및 다운샘플링 계층(404)을 포함한다. 콘볼루션 계층(402)은 현재 네트워크 출력(114)을 프로세싱하여 다운샘플링 계층(404)에 대한 입력을 생성할 수 있다. 다운샘플링 계층(404)은 콘볼루션 계층(402)의 출력을 프로세싱하여 잔차 연결 계층(420)에 대한 입력을 생성한다. 다운샘플링 계층(404)의 출력은 현재 네트워크 출력(114)과 비교하여 감소된 차원을 갖는다. 예를 들어, 콘볼루션 계층(402)은 스트라이드 1(즉, 차원을 유지하기 위해)을 갖는 사이즈 1x1의 필터들을 포함할 수 있으며 그리고 다운샘플링 계층(404)은 입력의 차원을 2라는 팩터로 다운샘플링하기 위해 스트라이드가 2인 사이즈 2x1의 필터를 포함할 수 있다. The left branch of the residual linking layer 420 includes a convolution layer 402 and a downsampling layer 404 . The convolution layer 402 may process the current network output 114 to generate an input to the downsampling layer 404 . The downsampling layer 404 processes the output of the convolution layer 402 to produce an input to the residual concatenation layer 420 . The output of the downsampling layer 404 has a reduced dimension compared to the current network output 114 . For example, the convolution layer 402 can include filters of size 1x1 with a stride of 1 (i.e., to maintain dimensionality) and the downsampling layer 404 downsamples the dimensionality of the input by a factor of 2. You can include a filter of size 2x1 with a stride of 2 to sample.

잔차 연결 계층(420)의 우측 브랜치는 다운샘플링 계층(406) 및 콘볼루션 계층이 뒤따르는 활성화 계층의 3개의 블록들을 포함한다(예를 들어, 활성화 계층(408), 콘볼루션 계층(410), 활성화 계층(412), 콘볼루션 계층(414), 활성화 계층(416) 및 콘볼루션 층(418)). 다운샘플링 계층(406)은 현재 네트워크 출력(114)을 프로세싱하여 활성화 계층 및 콘볼루션 계층의 후속 3개 블록들에 대한 입력을 생성한다. 다운샘플링 계층(406)의 출력은 현재 네트워크 출력(114)에 비하여 더 작은 차원을 갖는다. 후속 3개의 블록들은 다운샘플링 계층(406)의 출력을 프로세싱하여 잔차 연결 계층(420)에 대한 입력을 생성한다. 예를 들어, 다운샘플링 계층(406)은 입력의 차원을 2라는 팩터만큼 감소시키기 위해 스트라이드 2를 갖는 사이즈 2x1의 필터들을 포함할 수 있다(예를 들어, 다운샘플링 계층(404)에 적절하게 매칭시키기 위해). 활성화 계층들(예를 들어, 408, 412 및 416)은 누설형 ReLU 활성화 함수를 갖는 완전-연결 계층일 수 있다. 콘볼루션 계층(예를 들어, 410, 414 및 418)는 스트라이드 1을 갖는(즉, 차원을 유지하기 위해) 사이즈 3x1의 필터를 포함할 수 있다. The right branch of the residual concatenation layer 420 includes three blocks of an activation layer followed by a downsampling layer 406 and a convolution layer (e.g., an activation layer 408, a convolution layer 410, activation layer 412, convolutional layer 414, activation layer 416 and convolutional layer 418). The downsampling layer 406 processes the current network output 114 to generate input for the activation layer and the next three blocks of the convolutional layer. The output of the downsampling layer 406 has a smaller dimension compared to the current network output 114 . The next three blocks process the output of the downsampling layer 406 to generate the input to the residual concatenation layer 420 . For example, the downsampling layer 406 may include filters of size 2x1 with a stride of 2 to reduce the dimensionality of the input by a factor of 2 (e.g., matching the downsampling layer 404 as appropriate). to do). Activation layers (eg, 408, 412 and 416) may be fully-connected layers with leaky ReLU activation functions. The convolutional layers (eg, 410, 414, and 418) may include filters of size 3x1 with a stride of 1 (ie, to preserve dimensionality).

잔차 연결 계층(420)은 좌측 브랜치로부터의 출력과 우측 브랜치로부터의 출력을 결합하여 대체 표현(402)을 생성한다. 예를 들어, 잔차 연결 계층(420)은 대체 표현(402)을 생성하기 위해 좌측 브랜치로부터의 출력과 우측 브랜치로부터의 출력을 합산할 수 있다(예컨대, 요소별 합산(elementwise addition)). Residual linking layer 420 combines the output from the left branch with the output from the right branch to produce alternative representation 402 . For example, the residual linking layer 420 can sum the output from the left branch with the output from the right branch to produce the alternative representation 402 (eg, elementwise addition).

도 5는 FiLM(Feature-wise Linear Modulation) 모듈(500)의 예를 도시한다.5 shows an example of a Feature-wise Linear Modulation (FiLM) module 500 .

FiLM 모듈(500)은 현재 네트워크 출력의 대체 표현(402) 및 현재 이터레이션에 대응하는 총 잡음 레벨(306)을 프로세싱하여, 스케일 벡터(512) 및 바이어스 벡터(516)를 생성한다. 스케일 벡터(512) 및 바이어스 벡터(516)는 각각의 잡음 생성 블록(예를 들어, 도 3의 잡음 추정 네트워크(300)의 잡음 생성 블록(600))에 있는 특정 계층(예를 들어, 아핀 변환 계층)에 대한 입력으로서 프로세싱될 수 있다. FiLM 모듈(500)은 위치 인코딩 함수(positional encoding function) 및 하나 이상의 신경망 계층을 포함한다. 하나 이상의 신경망 계층은 잔차 연결 계층, 콘볼루션 계층, 비선형 활성화 함수를 갖는 활성화 계층(예를 들어, 누설형 ReLU 활성화 함수를 갖는 완전 연결 계층)을 포함하는 여러 유형들의 신경망 계층을 포함할 수 있다. The FiLM module 500 processes the alternative representation 402 of the current network output and the total noise level 306 corresponding to the current iteration to produce a scale vector 512 and a bias vector 516 . The scale vector 512 and the bias vector 516 are specific to a particular layer (e.g., an affine transform) in each noise generation block (e.g., the noise generation block 600 of the noise estimation network 300 in FIG. layer). The FiLM module 500 includes a positional encoding function and one or more neural network layers. The one or more neural network layers may include several types of neural network layers, including residual connected layers, convolutional layers, and activation layers with nonlinear activation functions (eg, fully connected layers with leaky ReLU activation functions).

잔차 연결 계층(508)의 좌측 브랜치는 위치 인코딩 함수(502)를 포함한다. 위치 인코딩 함수(502)는 총 잡음 레벨(306)을 프로세싱하여 잡음 레벨에 대한 위치 인코딩을 생성한다. 예를 들어, 총 잡음 레벨(306)은 트랜스포머 모델에 대한 전-프로세싱(pre-processing)에서와 같이, 짝수 차원 인덱스들에 대한 사인 함수와 홀수 차원 인덱스들에 대한 코사인 함수의 조합인 위치 인코딩 함수(502)에 의해 곱해질 수 있다. The left branch of the residual linking layer 508 includes the position encoding function 502. A positional encoding function 502 processes the total noise level 306 to produce a positional encoding for the noise level. For example, the total noise level 306 is a position encoding function that is a combination of the sine function for even dimension indices and the cosine function for odd dimension indices, as in pre-processing for the transformer model. It can be multiplied by (502).

잔차 연결 계층(508)의 우측 브랜치는 콘볼루션 계층(504) 및 활성화 계층(506)을 포함한다. 콘볼루션 계층(504)은 대체 표현(402)을 프로세싱하여 활성화 계층(506)에 대한 입력을 생성한다. 활성화 계층(506)은 콘볼루션 계층(504)의 출력을 프로세싱하여 잔차 연결 계층(508)에 대한 입력을 생성한다. 예를 들어, 콘볼루션 계층(504)은 스트라이드 1을 갖는(즉, 차원을 유지하기 위해) 사이즈 3x1의 필터를 포함할 수 있으며, 그리고 활성화 계층(506)은 누설형 ReLU 활성화 함수를 갖는 완전 연결 계층일 수 있다. The right branch of the residual connection layer 508 includes a convolution layer 504 and an activation layer 506 . The convolutional layer 504 processes the replacement representation 402 to generate an input to the activation layer 506. The activation layer 506 processes the output of the convolution layer 504 to produce an input to the residual concatenation layer 508 . For example, the convolutional layer 504 can include a filter of size 3x1 with a stride of 1 (i.e., to maintain dimensionality), and the activation layer 506 can include a fully-connected filter with a leaky ReLU activation function. can be hierarchical.

잔차 연결 계층(508)은 좌측 브랜치로부터의 출력(예를 들어, 위치 인코딩 함수(502)로부터의 출력)과 우측 브랜치로부터의 출력(예를 들어, 활성화 계층(506)으로부터의 출력)을 결합하여, 콘볼루션 계층(510) 및 콘볼루션 계층(514) 둘다에 대한 입력을 생성할 수 있다. 예를 들어, 잔차 연결 계층(508)은 2개의 콘볼루션 계층들(예: 510 및 514)에 대한 입력을 생성하기 위해, 좌측 브랜치 분기로부터의 출력과 우측 브랜치로부터의 출력을 합산할 수 있다(예를 들어, 요소별 합산). The residual linking layer 508 combines the output from the left branch (e.g., the output from the position encoding function 502) and the output from the right branch (e.g., the output from the activation layer 506) to , can generate inputs to both convolutional layer 510 and convolutional layer 514. For example, the residual linking layer 508 can sum the output from the left branch branch and the output from the right branch to generate inputs for two convolutional layers (e.g., 510 and 514). For example, element-by-element summation).

콘볼루션 계층(510)은 잔차 연결 계층(508)으로부터의 출력을 프로세싱하여 스케일 벡터(512)를 생성한다. 예를 들어, 콘볼루션 계층(510)은 스트라이드 1을 갖는(차원을 유지하기 위해) 사이즈 3x1의 필터를 포함할 수 있다. 콘볼루션 계층(514)은 잔차 연결 계층(508)으로부터의 출력을 프로세싱하여 바이어스 벡터(516)를 생성한다. 예를 들어, 콘볼루션 계층(514)은 스트라이드 1을 갖는(차원을 유지하기 위해) 사이즈 3x1의 필터를 포함할 수 있다. The convolution layer 510 processes the output from the residual concatenation layer 508 to generate a scale vector 512 . For example, convolutional layer 510 may include a filter of size 3×1 with a stride of 1 (to maintain dimensionality). The convolution layer 514 processes the output from the residual concatenation layer 508 to generate a bias vector 516 . For example, convolutional layer 514 may include a filter of size 3x1 with a stride of 1 (to maintain dimensionality).

도 6은 예시적인 잡음 생성 네트워크(600)를 도시한다. 잡음 생성 네트워크(600)는 아래에서 설명되는 시스템, 컴포넌트들 및 기술이 구현되는 하나 이상의 위치에서 하나 이상의 컴퓨터에서 컴퓨터 프로그램으로 구현되는 시스템의 일례이다. 6 shows an example noise generation network 600 . Noise generation network 600 is an example of a system implemented as a computer program on one or more computers at one or more locations where the systems, components, and techniques described below are implemented.

잡음 생성 블록(600)은 잡음 추정 신경망, 예를 들어 도 3의 잡음 추정 신경망(300)에 사용되는 잡음 생성 블록의 예시적인 신경망 아키텍처이다. Noise generation block 600 is an example neural network architecture of a noise generation block used in a noise estimation neural network, such as noise estimation neural network 300 of FIG. 3 .

잡음 생성 블록(600)은 입력(602) 및 FiLM 모듈(500)로부터의 출력을 프로세싱하여 출력(310)을 생성한다. 입력(602)은 하나 이상의 이전 신경망 계층들(예를 들어, 도 3의 잡음 생성 블록들(338, 336, 334, 332), 및 콘볼루션 계층(302))에 의해 프로세싱된 네트워크 입력일 수 있다. 출력(310)은 후속 콘볼루션 계층에 대한 입력일 수 있으며, 후속 콘볼루션 계층은 상기 출력(310)을 프로세싱하여 잡음 출력(110)(예를 들어, 도 3의 콘볼루션 계층 304)을 생성할 수 있다. 잡음 생성 블록(600)은 하나 이상의 신경망 계층을 포함한다. 하나 이상의 신경망 계층은 비선형 활성화 함수들을 구비한 활성화 계층(예컨대, 누설형 ReLU 활성화 함수를 갖는 완전 연결 계층), 업샘플링 계층(예컨대, 입력의 차원을 "업샘플링"하거나 증가시키는 계층), 콘볼루션 계층, 아핀 변환 계층 및 잔차 연결 계층 등을 포함하는 여러 유형들의 신경망 계층들을 포함할 수 있다. Noise generation block 600 processes input 602 and output from FiLM module 500 to produce output 310 . Input 602 may be network input processed by one or more previous neural network layers (e.g., noise generation blocks 338, 336, 334, 332, and convolution layer 302 of FIG. 3). Output 310 may be an input to a subsequent convolution layer, which may process the output 310 to produce a noise output 110 (e.g., convolution layer 304 in FIG. 3). there is. Noise generation block 600 includes one or more neural network layers. One or more neural network layers may be an activation layer with nonlinear activation functions (e.g., a fully connected layer with a leaky ReLU activation function), an upsampling layer (e.g., a layer that “upsamples” or increases the dimensionality of the input), convolutional It may include several types of neural network layers, including layers, affine transformation layers, and residual connection layers.

예를 들어, 업샘플링 계층은 입력의 차원을 "업샘플링"(즉, 증가)시키는 신경망 계층일 수 있다. 즉, 업샘플링 계층은 이러한 계층에 대한 입력보다 차원이 높은 출력을 생성한다. 특정 일례에서, 업샘플링 계층은 입력에 있는 각 값의 X 개의 복사본을 갖는 출력을 생성하여, 입력에 비해 출력의 차원을 팩터 X 만큼 증가시킬 수 있다(예를 들어, 입력(2, 7, -4)의 경우, (2,2,7,7,-4,-4)와 같이 각 값의 2개의 복사본을 갖는 출력을 생성하거나 또는 (2,2,2,2,2,7,7,7,7,7,-4,-4,-4,-4,-4)와 같이 각 값의 5개의 복사본을 갖는 출력을 생성하거나, 기타 등등). 일반적으로, 업샘플링 계층은 입력에서 가장 가까운 값으로 출력의 각각의 추가 지점(extra spot)을 채울 수 있다. For example, an upsampling layer can be a neural network layer that “upsamples” (ie increases) the dimensionality of the input. In other words, the upsampling layers produce outputs with a higher dimension than the inputs to these layers. In one specific example, the upsampling layer may produce an output with X copies of each value in the input, increasing the dimensionality of the output relative to the input by a factor X (e.g., input(2, 7, - For 4), it produces an output with 2 copies of each value, such as (2,2,7,7,-4,-4) or (2,2,2,2,2,7,7, 7,7,7,-4,-4,-4,-4,-4), and so on). In general, an upsampling layer can fill each extra spot in the output with a value closest to the input.

잔차 연결 계층(618)의 좌측 브랜치는 업샘플링 계층(602) 및 콘볼루션 계층(604)을 포함한다. 업샘플링 계층(602)은 입력(602)을 프로세싱하여 콘볼루션 계층(604)에 대한 입력을 생성한다. 콘볼루션 계층에 대한 입력은 입력(602) 보다 더 높은 차원을 갖는다. 콘볼루션 계층(604)은 업샘플링 계층(602)으로부터의 출력을 프로세싱하여 잔차 연결 계층(618)에 대한 입력을 생성한다. 예를 들어, 업샘플링 계층은 입력(602)에 있는 각 값의 2개의 복사본을 갖는 출력을 생성함으로써, 입력의 차원을 2 라는 팩터만큼 증가시킬 수 있다. 콘볼루션 계층(604)은 스트라이드 1을 갖는(예를 들어, 차원을 유지하기 위해) 차원 3x1의 필터를 포함할 수 있다. The left branch of the residual concatenation layer 618 includes an upsampling layer 602 and a convolution layer 604. Upsampling layer 602 processes input 602 to generate input for convolution layer 604 . The input to the convolutional layer has a higher dimension than input 602 . The convolution layer 604 processes the output from the upsampling layer 602 to produce an input to a residual concatenation layer 618 . For example, the upsampling layer can increase the dimensionality of the input by a factor of two, by producing an output with two copies of each value in input 602 . The convolutional layer 604 may include a filter of dimension 3×1 with a stride of 1 (eg, to preserve dimensionality).

잔차 연결 계층(618)의 우측 브랜치는 활성화 계층(606)(예를 들어, 누설형 ReLU 활성화 함수를 갖는 완전-연결 계층), 업샘플링 계층(608), 콘볼루션 계층(610)(예를 들어, 3x1 필터 사이즈 및 스트라이드 1을 갖음), 아핀 변환 계층(612), 활성화 계층(614)(예를 들어, 누설형 ReLU 활성화 함수를 갖는 완전-연결 계층) 및 콘볼루션 계층(616)(예를 들어, 3x1 필터 사이즈 및 스트라이드 1을 가짐)을 그 순서대로 포함할 수 있다. The right branch of the residual connected layer 618 is the activation layer 606 (e.g., a fully-connected layer with a leaky ReLU activation function), an upsampling layer 608, a convolutional layer 610 (e.g. . eg with a 3x1 filter size and a stride of 1) in that order.

활성화 계층(606)은 업샘플링 계층(608)에 대한 입력을 생성하기 위해 입력(602)을 프로세싱한다. 업샘플링 계층(608)은 활성화 계층(606)으로부터의 출력의 차원을 증가시켜서 콘볼루션 계층(610)에 대한 입력을 생성하는바, 이는 입력(602) 보다 높은 차원을 갖는다(예를 들어, 업샘플링 계층(602)과 매칭시키기 팩터 2만큼 증가됨). 콘볼루션 계층(610)은 업샘플링 계층(608)으로부터의 출력을 프로세싱하여 아핀 변환 계층(612)에 대한 입력을 생성한다(예를 들어, 차원을 유지하기 위해 차원 3x1 및 스트라이드 1의 필터를 사용하여). 또한, 활성화 계층(614) 및 콘볼루션 계층(616)은 잔차 연결 계층(618)에 대한 입력을 생성하기 위해 아핀 변환 계층(612)으로부터의 출력을 추가로 프로세싱한다(예를 들어, 네트워크(614)에 대한 누설형 ReLU 함수 및 네트워크(616)에 대한 차원 3x1 및 스트라이드 1의 필터를 사용하여). Activation layer 606 processes input 602 to generate input for upsampling layer 608 . Upsampling layer 608 increases the dimensionality of the output from activation layer 606 to produce an input to convolutional layer 610, which has a higher dimension than input 602 (e.g., Matching with sampling layer 602 increased by factor 2). The convolution layer 610 processes the output from the upsampling layer 608 to generate an input to an affine transformation layer 612 (e.g., using a filter of dimension 3x1 and stride 1 to preserve dimensionality). So). In addition, the activation layer 614 and the convolution layer 616 further process the output from the affine transformation layer 612 to generate an input to the residual link layer 618 (e.g., the network 614 ) using a leaky ReLU function and a filter of dimension 3x1 and stride 1 for network 616).

예를 들어, 아핀 변환 함수는 이전의 신경망 계층(예를 들어, 잡음 생성 블록(600)의 콘볼루션 계층(610))으로부터의 출력 및 FiLM 모듈로부터의 출력을 프로세싱하여 출력을 생성할 수 있다. 예를 들어, FiLM 모듈은 스케일 벡터와 바이어스 벡터를 생성할 수 있다. 아핀 변환 계층은 FiLM 모듈의 스케일 벡터를 이용하여 이전 신경망 계층의 출력을 스케일링한 결과(예: Hadamard 곱 또는 요소별 곱셈을 사용하여)에 바이어스 벡터를 합산할 수 있다. For example, the affine transform function may process the output from the previous neural network layer (eg, the convolution layer 610 of the noise generation block 600) and the output from the FiLM module to generate an output. For example, the FiLM module can generate scale vectors and bias vectors. The affine transform layer may add a bias vector to the result of scaling the output of the previous neural network layer using the FiLM module's scale vector (eg, using Hadamard multiplication or elementwise multiplication).

아핀 변환 계층(612)은 콘볼루션 계층(610)으로부터의 출력 및 FiLM 모듈(500)로부터의 출력을 프로세싱하여 활성화 계층(614)에 대한 입력을 생성할 수 있다(예를 들어, FiLM 모듈(500)로부터의 스케일 벡터를 이용하여 콘볼루션 계층(610)으로부터의 출력을 스케일링한 결과에 FiLM 모듈(500)로부터의 바이어스 벡터를 합산함으로써). The affine transformation layer 612 may process the output from the convolutional layer 610 and the output from the FiLM module 500 to generate an input to the activation layer 614 (e.g., the FiLM module 500 by adding the bias vector from the FiLM module 500 to the result of scaling the output from the convolutional layer 610 using the scale vector from ).

잔차 연결 계층(618)은 좌측 브랜치로부터의 출력(예를 들어, 콘볼루션 계층(604)으로부터의 출력)과 우측 브랜치로부터의 출력(예를 들어, 콘볼루션 계층(616)으로부터의 출력)을 결합하여 출력을 생성한다. 예를 들어, 잔차 연결 계층(618)은 좌측 브랜치로부터의 출력과 우측 브랜치로부터의 출력을 합산하여 출력을 생성할 수 있다. Residual connection layer 618 combines the output from the left branch (e.g., output from convolution layer 604) and the output from the right branch (e.g., output from convolution layer 616). to generate the output. For example, the residual linking layer 618 can sum the output from the left branch and the output from the right branch to produce an output.

잔차 연결 계층(632)의 좌측 브랜치는 잔차 연결 계층(618)으로부터의 출력을 포함한다. 좌측 브랜치는 잔차 연결 계층(618)으로부터의 출력의 항등 함수(identity function)로서 해석될 수 있다. The left branch of residual linking layer 632 includes the output from residual linking layer 618 . The left branch can be interpreted as the identity function of the output from the residual linking layer 618.

잔차 연결 계층(632)의 우측 브랜치는 잔차 연결 계층(618)의 출력을 프로세싱하고 그리고 잔차 연결 계층(632)에 대한 입력을 생성하기 위해, 순서대로 아핀 변환 계층, 활성화 계층 및 콘볼루션 계층의 2개의 순차적인 블록들을 포함한다. 특히, 제 1 블록은 아핀 변환 계층(620), 활성화 계층(622) 및 콘볼루션 계층(624)을 포함한다. 제 2 블록은 아핀 변환 계층(626), 활성화 계층(628) 및 콘볼루션 계층(630)을 포함한다. The right branch of the residual link layer 632 processes the output of the residual link layer 618 and generates the input to the residual link layer 632, in order the affine transform layer, the activation layer and the two convolution layers. contains sequential blocks. In particular, the first block includes an affine transformation layer 620 , an activation layer 622 and a convolution layer 624 . The second block includes an affine transformation layer 626 , an activation layer 628 and a convolution layer 630 .

예를 들어, 각각의 블록에 대해, 각각의 아핀 변환 계층은 각각의 출력을 생성하기 위해 FiLM 모듈(500)로부터의 출력 및 각각의 이전 신경망 계층으로부터의 출력을 프로세싱할 수 있다(예를 들어, 아핀 변환 계층(620)은 잔차 연결 계층(618)으로부터의 출력을 프로세싱할 수 있으며, 아핀 변환 계층(626)은 콘볼루션 계층(624)으로부터의 출력을 프로세싱할 수 있음). 각각의 아핀 변환 계층은 이전 신경망 계층의 출력을 FiLM 모듈(500)로부터의 스케일 벡터로 스케일링하고 그리고 스케일링 결과와 FiLM 모듈(500)의 바이어스 벡터를 합산하여 각각의 출력을 생성할 수 있다. 각각의 활성화 계층(예를 들어, 620 및 628)은 누설형 ReLU 활성화 함수를 갖는 각각의 완전-연결된 계층일 수 있다. 각각의 콘볼루션 계층은 차원 3x1 및 스트라이드 1의 필터를 각각 포함할 수 있다(예를 들어, 차원을 유지하기 위해). For example, for each block, each affine transformation layer may process the output from the FiLM module 500 and the output from each previous neural network layer to generate a respective output (e.g., Affine transformation layer 620 can process the output from residual linking layer 618, and affine transformation layer 626 can process the output from convolution layer 624). Each affine transformation layer may generate each output by scaling the output of the previous neural network layer with a scale vector from the FiLM module 500 and summing the scaling result with the bias vector of the FiLM module 500. Each activation layer (eg, 620 and 628) can be a respective fully-connected layer with a leaky ReLU activation function. Each convolutional layer may include filters of dimension 3x1 and stride 1, respectively (eg, to preserve dimensionality).

잔차 연결 계층(632)은 왼쪽 브랜치로부터의 출력(예를 들어, 잔차 연결 계층(618)으로부터의 출력의 아이덴티티)과 오른쪽 브랜치로부터의 출력(예를 들어, 콘볼루션 계층(630)으로부터의 출력)을 결합하여 출력(310)을 생성할 수 있다. 예를 들어, 잔차 연결 계층(632)은 좌측 브랜치로부터의 출력과 우측 브랜치로부터의 출력을 합산하여 출력(310)을 생성할 수 있다. 출력(310)은 콘볼루션 계층(예를 들어, 도 3의 콘볼루션 계층 304)에 대한 입력일 수 있으며, 이러한 콘볼루션 계층은 잡음 출력(110)을 생성할 것이다. The residual linking layer 632 is the output from the left branch (e.g., the identity of the output from the residual linking layer 618) and the output from the right branch (e.g., the output from the convolutional layer 630). may be combined to produce output 310 . For example, residual linking layer 632 can sum the output from the left branch and the output from the right branch to produce output 310 . Output 310 may be an input to a convolution layer (eg, convolution layer 304 of FIG. 3 ), which convolution layer will produce noise output 110 .

잡음 생성 블록(600)은 여러 채널들을 포함할 수 있다. 도 3의 각각의 잡음 생성 블록(예를 들어, 600, 338, 336, 334 및 332)은 각각 여러 채널들을 포함할 수 있다. 예를 들어, 잡음 생성 블록(600, 338, 336, 334, 332)은 128, 128, 256, 512, 512개의 채널을 각각 포함할 수 있다. The noise generation block 600 may include several channels. Each noise generating block (eg, 600, 338, 336, 334, and 332) of FIG. 3 may include several channels. For example, the noise generating blocks 600, 338, 336, 334, and 332 may include 128, 128, 256, 512, and 512 channels, respectively.

도 7은 잡음 추정 신경망을 트레이닝하기 위한 예시적인 프로세스의 흐름도이다. 편의상, 프로세스(700)는 하나 이상의 위치들에 위치한 하나 이상의 컴퓨터 시스템들에 의해 수행되는 것으로 설명될 것이다. 7 is a flow diagram of an example process for training a noise estimation neural network. For convenience, process 700 will be described as being performed by one or more computer systems located at one or more locations.

시스템은 잡음 추정 신경망의 파라미터 값들을 반복적으로 업데이트하기 위해 다수의 트레이닝 이터레이션들 각각에서 프로세스(700)를 수행할 수 있다. The system may perform process 700 in each of multiple training iterations to iteratively update the parameter values of the noise estimation neural network.

이러한 시스템은 트레이닝 네트워크 입력-트레이닝 네트워크 출력 쌍들의 배치(batch)를 획득한다(702). 예를 들어, 시스템은 데이터 저장소로부터 트레이닝 쌍들을 무작위로 샘플링할 수 있다. 예를 들어, 각각의 트레이닝 네트워크 출력은 오디오 파형일 수 있으며 그리고 각각의 네트워크 입력은 대응 오디오 파형으로부터 계산된 그라운드-트루 멜-스펙트로그램일 수 있다. This system obtains (702) a batch of training network input-training network output pairs. For example, the system can randomly sample training pairs from a data store. For example, each training network output can be an audio waveform and each network input can be a ground-true Mel-spectrogram computed from the corresponding audio waveform.

배치(batch) 내의 각각의 트레이닝 쌍에 대하여, 시스템은 모든 이터레이션들에 대한 이터레이션-특정 데이터를 포함하는 세트로부터 이터레이션-특정 데이터를 선택한다(704). 예를 들어, 시스템은 1부터 최종 이터레이션까지의 정수들을 포함하는 이산 균일 분포(discrete uniform distribution)로부터 특정 이터레이션을 샘플링할 수 있으며, 분포로부터 샘플링된 특정 이터레이션에 기초하여 이터레이션 특정 데이터를 선택할 수 있다. 이터레이션 특정 데이터는 잡음 레벨, 총 잡음 레벨(예를 들어, 수학식 (2)에서 결정됨) 또는 이터레이션 횟수 자체를 포함할 수 있다. 따라서, 시스템은 이산 인덱스에 대하여 잡음 추정 신경망을 조건화하거나 또는 잡음 레벨을 나타내는 연속 스칼라에 대하여 잡음 추정 신경망을 조건화할 수 있다. 잡음 레벨을 나타내는 연속 스칼라에 대하여 조건화하는 것은 유리할 수 있는데, 잡음 추정 신경망이 일단 트레이닝되면, 추론에서 최종 신경망 출력을 생성할 때 상이한 수의 정제 단계들(즉, 이터레이션들)을 사용할 수 있기 때문이다. For each training pair in the batch, the system selects the iteration-specific data from the set containing the iteration-specific data for all iterations (704). For example, the system can sample a particular iteration from a discrete uniform distribution containing integers from 1 to the final iteration, and based on the particular iteration sampled from the distribution, iterate-specific data is generated. You can choose. The iteration specific data may include the noise level, the total noise level (e.g. determined in Equation (2)) or the number of iterations itself. Thus, the system can condition the noise estimation network on a discrete index or on a continuous scalar representing the noise level. Conditioning on a continuous scalar representing the noise level can be advantageous, since once the noise estimation network is trained, the inference can use a different number of refinement steps (i.e., iterations) when generating the final network output. am.

배치(batch)의 각각의 트레이닝 쌍에 대하여, 시스템은 트레이닝 네트워크 출력의 각 값에 대한 각각의 잡음 값을 포함하는 잡음성(noisy) 출력을 샘플링한다(708). 예를 들어, 시스템은 잡음 분포로부터 잡음성 출력을 샘플링할 수 있다. 특정 예에서, 잡음 분포는 가우시안 잡음 분포일 수 있다(예를 들어, N(0, I), 여기서, I 는 차원 n×n을 갖는 항등 행렬이고, n은 트레이닝 네트워크 출력의 숫자 값들임). For each training pair in the batch, the system samples 708 the noisy output, which includes a respective noise value for each value of the training network output. For example, the system can sample the noisy output from a noise distribution. In a specific example, the noise distribution may be a Gaussian noise distribution (e.g., N(0, I), where I is an identity matrix with dimensions n×n, and n are numeric values of the training network output).

배치(batch)의 각각의 트레이닝 쌍에 대하여, 시스템은 잡음성 출력 및 대응하는 트레이닝 네트워크 출력으로부터 수정된 트레이닝 네트워크 출력을 생성한다(708). 시스템은 잡음성 출력과 대응하는 트레이닝 네트워크 출력을 결합하여 수정된 트레이닝 네트워크 출력을 생성할 수 있다. 예를 들어, 시스템은 수정된 트레이닝 네트워크 출력을 다음과 같이 생성할 수 있다. For each training pair in the batch, the system generates a modified training network output from the noisy output and the corresponding training network output (708). The system may combine the noisy output and the corresponding training network output to produce a modified training network output. For example, the system can produce a modified training network output as follows.

(4)

여기서, y' 는 수정된 트레이닝 네트워크 출력을 나타내고, y0 는 해당 트레이닝 네트워크 출력을 나타내며, ε는 잡음성 출력을 나타내고,

는 이터레이션 특정 데이터(예: 총 잡음 레벨)를 나타낸다. where y' denotes the modified training network output, y0 denotes the corresponding training network output, ε denotes the noisy output,

denotes iteration specific data (eg total noise level).

배치(batch)의 각각의 트레이닝 쌍에 대하여, 시스템은 네트워크 파라미터들의 현재 값들에 따라 잡음 추정 신경망을 이용하여, (1) 수정된 트레이닝 네트워크 출력, (2) 트레이닝 네트워크 입력 및 (3) 이터레이션 특정 데이터를 포함하는 모델 입력을 프로세싱하여 트레이닝 잡음 출력을 생성한다(710). 잡음 추정 신경망은 모델 입력을 프로세싱하여 도 2의 프로세스에서 설명한 바와 같이 트레이닝 잡음 출력을 생성할 수 있다. 예를 들어, 이터레이션 특정 기준은 총 잡음 레벨

을 포함할 수 있다. For each training pair in the batch, the system uses the noise estimation neural network according to the current values of the network parameters, (1) modified training network output, (2) training network input, and (3) iteration specific The model input comprising the data is processed to produce a training noise output (710). The noise estimation neural network may process the model input to generate a training noise output as described in the process of FIG. 2 . For example, an iteration specific criterion is the total noise level.

can include

시스템은 트레이닝 쌍들의 배치에 대한 목적 함수의 그래디언트로부터 잡음 추정 네트워크의 네트워크 파라미터들에 대한 업데이트를 결정한다(712). 시스템은 각각의 트레이닝 쌍에 대한 잡음 추정 네트워크의 신경망 파라미터들에 대하여 목적 함수의 그래디언트를 결정할 수 있으며, 그런 다음 모멘텀을 사용한 확률적 그래디언트 하강 또는 ADAM과 같은 다양한 적절한 최적화 방법을 이용하여, 신경망 파라미터들의 현재 값들을 그래디언트들(예컨대, 그래디언트의 평균과 같은 그래디언트의 선형 조합)로 업데이트할 수 있다. The system determines an update to the network parameters of the noise estimation network from the gradient of the objective function over the batch of training pairs (712). The system can determine the gradient of the objective function with respect to the neural network parameters of the noise estimation network for each training pair, and then use a variety of suitable optimization methods, such as Stochastic Gradient Descent with Momentum or ADAM, to determine the neural network parameters. Current values can be updated with gradients (eg, a linear combination of gradients, such as the mean of the gradients).

목적 함수는 각각의 트레이닝 쌍에 대한 잡음 추정 네트워크에 의해 생성된 잡음성 출력과 트레이닝 잡음 출력 간의 에러를 측정할 수 있다. 예를 들어, 특정 트레이닝 쌍에 대해, 목적 함수는 다음과 같이 잡음성 출력과 트레이닝 잡음 출력 사이의 L1 거리를 측정하는 손실 항을 포함할 수 있다. The objective function may measure the error between the training noise output and the noisy output generated by the noise estimation network for each training pair. For example, for a particular training pair, the objective function may include a loss term that measures the L1 distance between the noisy output and the training noise output as

(5)

여기서,

는 손실 함수를 나타내고, ε는 잡음성 출력을 나타내고,

는 파라미터 θ를 사용하여 잡음 추정 신경망에 의해 생성된 트레이닝 잡음 출력을 나타내고, y' 는 수정된 트레이닝 네트워크 출력을 나타내고, x 는 트레이닝 네트워크 입력을 나타내고,

는 이터레이션별 데이터(예: 총 소음 레벨)를 나타낸다. here,

denotes the loss function, ε denotes the noisy output,

denotes the training noise output generated by the noise estimation neural network using parameters θ, y' denotes the modified training network output, x denotes the training network input,

represents data per iteration (eg, total noise level).

시스템은 다수의 배치들(batches)에 대하여 단계(702)-(712)를 반복적으로 수행할 수 있다(예를 들어, 트레이닝 네트워크 입력 - 트레이닝 네트워크 출력 쌍들의 다수의 배치들). The system may repeatedly perform steps 702 - 712 for multiple batches (eg, multiple batches of training network input-training network output pairs).

본 명세서는 시스템 및 컴퓨터 프로그램 컴포넌트와 관련하여 "구성된"이라는 용어를 사용한다. 특정 동작들 또는 액션들을 수행하도록 구성된 하나 이상의 컴퓨터로 구성된 시스템의 경우, 이는 동작시 시스템으로 하여금 동작들 또는 액션들을 수행하게 하는 소프트웨어, 펌웨어, 하드웨어 또는 이들의 조합을 상기 시스템에 설치했음을 의미한다. 하나 이상의 컴퓨터 프로그램이 특정 동작들 또는 액션들을 수행하도록 구성된다는 것은 하나 이상의 프로그램이 데이터 프로세싱 장치에 의해 실행될 때 상기 장치로 하여금 동작들 또는 액션들을 수행하게 하는 명령을 포함한다는 것을 의미한다. This specification uses the term "configured" in relation to systems and computer program components. In the case of a system consisting of one or more computers configured to perform certain operations or actions, this means having installed on the system software, firmware, hardware or a combination thereof which, when operated, causes the system to perform the operations or actions. When one or more computer programs are configured to perform particular actions or actions, it is meant that the one or more programs, when executed by a data processing device, contain instructions that cause the device to perform the actions or actions.

본 명세서에 기술된 주제 및 기능적 동작의 실시예는 본 명세서에 개시된 구조들 및 이들의 구조적 등가물들 포함하여 디지털 전자 회로, 유형적으로 구현된 컴퓨터 소프트웨어 또는 펌웨어, 컴퓨터 하드웨어 또는 이들 중 하나 이상의 조합으로 구현될 수 있다. 본 명세서에 기술된 주제의 실시예는 하나 이상의 컴퓨터 프로그램, 즉 데이터 프로세싱 장치에 의한 실행 또는 그 작동을 제어하기 위해 유형의 비일시적 저장 매체에 인코딩된 하나 이상의 컴퓨터 프로그램 명령 모듈로 구현될 수 있다. 컴퓨터 저장 매체는 머신 판독가능 저장 디바이스, 머신 판독가능 저장 기판, 랜덤 또는 직렬 액세스 메모리 디바이스 또는 이들 중 하나 이상의 조합일 수 있다. 대안적으로 또는 추가로, 프로그램 명령은 인공적으로 생성된 전파 신호, 예를 들어, 데이터 프로세싱 장치에 의한 실행을 위해 적절한 수신기 디바이스로의 전송을 위해 정보를 인코딩하도록 생성된 머신 생성 전기, 광학 또는 전자기 신호에 인코딩될 수 있다. Embodiments of subject matter and functional operation described herein may be implemented in digital electronic circuitry, tangibly implemented computer software or firmware, computer hardware, or a combination of one or more of these, including the structures disclosed herein and their structural equivalents. It can be. Embodiments of the subject matter described herein may be implemented as one or more computer programs, ie, one or more computer program instruction modules encoded in a tangible, non-transitory storage medium to control the operation or execution by a data processing device. A computer storage medium may be a machine readable storage device, a machine readable storage substrate, a random or serial access memory device, or a combination of one or more of these. Alternatively or additionally, the program instructions may be artificially generated radio signals, e.g., machine generated electrical, optical or electromagnetic generated to encode information for transmission to a suitable receiver device for execution by a data processing apparatus. can be encoded into the signal.

"데이터 프로세싱 장치"라는 용어는 데이터 프로세싱 하드웨어를 지칭하며 예를 들어, 프로그래밍 가능한 프로세서, 컴퓨터 또는 다수의 프로세서들 또는 컴퓨터들을 포함하여 데이터를 프로세싱하기 위한 모든 종류의 장치, 디바이스 및 머신를 포함한다. 상기 장치는 또한 특수 목적 논리 회로, 예를 들어 FPGA (필드 프로그래밍가능 게이트 어레이) 또는 ASIC(주문형 집적 회로)일 수 있거나 이를 더 포함할 수 있다. 상기 장치는 선택적으로는 하드웨어 이외에도 컴퓨터 프로그램에 대한 실행 환경을 생성하는 코드, 예를 들어 프로세서 펌웨어, 프로토콜 스택, 데이터베이스 관리 시스템, 운영 체제 또는 이들 중 하나 이상의 조합을 구성하는 코드를 포함할 수 있다. The term “data processing apparatus” refers to data processing hardware and includes all kinds of apparatus, devices and machines for processing data, including, for example, a programmable processor, a computer, or multiple processors or computers. The device may also be or further include a special purpose logic circuit, such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit). The apparatus may optionally include, in addition to hardware, code that creates an execution environment for a computer program, such as code that constitutes a processor firmware, protocol stack, database management system, operating system, or a combination of one or more of these.

프로그램, 소프트웨어, 소프트웨어 어플리케이션, 앱, 모듈, 소프트웨어 모듈, 스크립트 또는 코드로도 언급되거나 설명될 수 있는 컴퓨터 프로그램은 컴파일된 언어 또는 해석된 언어 또는 선언적 또는 절차적 언어를 포함하여 모든 형태의 프로그래밍 언어로 작성될 수 있으며; 독립 실행형 프로그램이나 모듈, 컴포넌트, 서브루틴 또는 컴퓨팅 환경에서 사용하기에 적합한 기타 단위를 포함하여 모든 형태로 배포될 수 있다. 프로그램은 파일 시스템의 파일에 해당할 수 있지만 반드시 그럴 필요는 없다. 프로그램은 다른 프로그램이나 데이터를 보유하는 파일의 일부에 저장될 수 있는바, 예를 들어, 마크업 언어 문서에 저장된 하나 이상의 스크립트, 문제의 프로그램 전용 단일 파일 또는 여러 조정 파일들, 예를 들어 하나 이상의 모듈들, 서브-프로그램 또는 코드의 일부분들을 저장하는 파일들에 저장될 수 있다. 컴퓨터 프로그램은 하나의 컴퓨터 또는 다수의 컴퓨터들에서 실행되도록 배포될 수 있으며, 다수의 컴퓨터들은 한 장소에 위치하거나 또는 여러 장소들에 분산되어 통신 네트워크에 의해 연결될 수 있다. A computer program, which may also be referred to as or described as program, software, software application, app, module, software module, script, or code, is in any form of programming language, including compiled or interpreted language, or declarative or procedural language. can be written; may be distributed in any form, including stand-alone programs or modules, components, subroutines, or other units suitable for use in a computing environment. A program can, but does not have to, correspond to a file on a file system. A program may be stored in another program or part of a file that holds data, eg one or more scripts stored in a markup language document, a single file dedicated to the program in question, or several control files, eg one or more scripts. It can be stored in files that store modules, sub-programs or parts of code. A computer program may be distributed to be executed on one computer or multiple computers, and multiple computers may be located in one place or distributed in several places and connected by a communication network.

본 명세서에서 "엔진"이라는 용어는 하나 이상의 특정 기능을 수행하도록 프로그래밍된 소프트웨어 기반 시스템, 서브시스템 또는 프로세스를 지칭하기 위해 광범위하게 사용된다. 일반적으로 엔진은 하나 이상의 위치에 있는 하나 이상의 컴퓨터에 설치된 하나 이상의 소프트웨어 모듈 또는 구성 요소로 구현된다. 어떤 경우에는 하나 이상의 컴퓨터가 특정 엔진에 전용될 것이며; 다른 경우에는 동일한 컴퓨터에 여러 엔진을 설치하고 실행할 수 있다. The term "engine" is used herein broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions. An engine is typically implemented as one or more software modules or components installed on one or more computers in one or more locations. In some cases more than one computer will be dedicated to a particular engine; In other cases, multiple engines can be installed and run on the same computer.

본 명세서에 기술된 프로세스 및 논리 흐름은 입력 데이터에 대해 동작하고 출력을 생성함으로써 기능을 수행하기 위해 하나 이상의 컴퓨터 프로그램을 실행하는 하나 이상의 프로그램가능한 컴퓨터에 의해 수행될 수 있다. 프로세스 및 논리 흐름은 예를 들어, FPGA 또는 ASIC과 같은 특수 목적 논리 회로 또는 특수 목적 논리 회로와 하나 이상의 프로그래밍된 컴퓨터의 조합에 의해 수행될 수도 있다. The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows may be performed by special purpose logic circuitry, such as, for example, FPGAs or ASICs, or a combination of special purpose logic circuitry and one or more programmed computers.

컴퓨터 프로그램 실행에 적합한 컴퓨터는 범용 또는 특수 목적의 마이크로프로세서 또는 둘 다 또는 다른 종류의 중앙 프로세싱 유닛을 기반으로 할 수 있다. 일반적으로 중앙 프로세싱 유닛은 읽기 전용 메모리나 랜덤 액세스 메모리 또는 둘 다에서 명령과 데이터를 수신한다. 컴퓨터의 필수 요소는 명령을 수행하거나 실행하는 중앙 프로세싱 유닛과 명령 및 데이터를 저장하는 하나 이상의 메모리 디바이스이다. 중앙 프로세싱 유닛과 메모리는 특수 목적 논리 회로에 의해 보완되거나 통합될 수 있다. 일반적으로, 컴퓨터는 데이터를 저장하기 위한 하나 이상의 대용량 저장 디바이스(예를 들어, 자기, 광자기 디스크 또는 광학 디스크)로부터 데이터를 수신하거나 데이터를 전송하도록 또는 둘 모두를 포함하거나 작동가능하게 결합될 것이다. 하지만, 컴퓨터는 이러한 디바이스를 반드시 필요로 하는 것은 아니다. 또한, 컴퓨터는 휴대폰, PDA(Personal Digital Assistant), 모바일 오디오 또는 비디오 재생기, 게임 콘솔, GPS(Global Positioning System) 수신기 또는 휴대용 저장 디바이스(예를 들어, USB(Universal Serial Bus) 플래시 드라이브)와 같은 다른 디바이스에 내장될 수 있다. A computer suitable for executing a computer program may be based on a general purpose or special purpose microprocessor, or both, or some other type of central processing unit. Typically, the central processing unit receives instructions and data from read-only memory or random access memory or both. The essential elements of a computer are a central processing unit that carries out or executes instructions and one or more memory devices that store instructions and data. The central processing unit and memory may be supplemented or integrated by special purpose logic circuitry. Generally, a computer will include, or be operably coupled to, receive data from, transmit data to, or both from one or more mass storage devices (e.g., magnetic, magneto-optical disks, or optical disks) for storing data. . However, a computer does not necessarily need such a device. In addition, a computer may include another device such as a cell phone, personal digital assistant (PDA), mobile audio or video player, game console, Global Positioning System (GPS) receiver, or portable storage device (eg, Universal Serial Bus (USB) flash drive). Can be built into the device.

컴퓨터 프로그램 명령 및 데이터를 저장하기에 적합한 컴퓨터 판독가능 매체는 모든 형태의 비휘발성 메모리, 매체 및 메모리 디바이스를 포함하며, 예를 들어 EPROM, EEPROM 및 플래시 메모리 디바이스와 같은 반도체 메모리 디바이스; 자기 디스크, 예를 들어 내부 하드 디스크 또는 이동식 디스크; 광자기 디스크; 및 CD-ROM 및 DVD-ROM 디스크를 포함할 수 있다. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including, for example, semiconductor memory devices such as EPROM, EEPROM and flash memory devices; magnetic disks such as internal hard disks or removable disks; magneto-optical disk; and CD-ROM and DVD-ROM disks.

사용자와의 상호작용을 제공하기 위해, 본 명세서에 기술된 주제의 실시예는 컴퓨터에서 구현될 수 있으며, 컴퓨터는 정보를 표시하기 위한 디스플레이 디바이스, 예를 들어 CRT(음극선관) 또는 LCD(액정 디스플레이) 모니터를 가질 수 있으며, 사용자 및 사용자가 컴퓨터에 입력을 제공할 수 있는 마우스 또는 트랙볼과 같은 포인팅 디바이스 및 키보드도 가질 수 있다. 또한 다른 종류의 디바이스를 사용하여 사용자와의 상호작용도 제공할 수 있는바, 예를 들어, 사용자에게 제공되는 피드백은 임의의 형태의 감각 피드백, 예를 들어 시각적 피드백, 청각 피드백 또는 촉각 피드백일 수 있으며; 사용자로부터의 입력은 음향, 음성 또는 촉각 입력을 포함한 모든 형태로 수신될 수 있다. 또한 컴퓨터는 사용자가 사용하는 디바이스와 문서를 주고받음으로써 사용자와 상호작용할 수 있다. 예를 들어, 웹 브라우저에서 받은 요청에 대한 응답으로 사용자 디바이스의 웹 브라우저에 웹 페이지를 보낼 수 있다. 또한, 컴퓨터는 문자 메시지 또는 다른 형태의 메시지를 퍼스널 디바이스로 전송함으로써 사용자와 상호작용할 수 있는바, 예를 들어, 메시징 애플리케이션을 실행하는 스마트폰에 전송하고 사용자로부터 응답 메시지를 수신함으로써 사용자와 상호작용할 수 있다. To provide interaction with a user, embodiments of the subject matter described herein may be implemented in a computer, which is a display device for displaying information, such as a cathode ray tube (CRT) or a liquid crystal display (LCD). ) monitor, and also a keyboard and a pointing device such as a mouse or trackball through which the user and the user can provide input to the computer. Other types of devices may also be used to provide interaction with the user, for example, the feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback. there is; Input from the user may be received in any form including acoustic, voice, or tactile input. In addition, the computer can interact with the user by exchanging documents with the device used by the user. For example, a web page may be sent to a web browser of a user device in response to a request received from the web browser. The computer may also interact with the user by sending a text message or other type of message to the personal device, for example by sending a message to a smartphone running a messaging application and receiving a response message from the user. can

머신 학습 모델을 구현하기 위한 데이터 프로세싱 장치는 또한, 예를 들어, 머신 러닝 트레이닝의 공통적이고 계산 집약적인 부분, 즉 추론, 워크로드를 프로세싱하기 위한 특수 목적 하드웨어 가속기 유닛을 포함할 수 있다. Data processing apparatus for implementing machine learning models may also include special purpose hardware accelerator units for processing common and computationally intensive parts of machine learning training, ie inference, workloads, for example.

머신 러닝 모델은 머신 러닝 프레임워크, 예를 들어 텐서플로우(TensorFlow) 프레임워크, Microsoft Cognitive Toolkit 프레임워크, Apache Singa 프레임워크 또는 Apache MXNet 프레임워크를 사용하여 구현 및 배포될 수 있다.The machine learning model may be implemented and deployed using a machine learning framework, such as the TensorFlow framework, the Microsoft Cognitive Toolkit framework, the Apache Singa framework, or the Apache MXNet framework.

본 명세서에 기술된 주제의 실시예는 컴퓨팅 시스템에서 구현될 수 있으며, 컴퓨팅 시스템은 예를 들어, 데이터 서버와 같은 백엔드 컴포넌트를 포함하거나, 애플리케이션 서버와 같은 미들웨어 컴포넌트를 포함하거나, 예를 들어, 그래픽 사용자 인터페이스, 웹 브라우저 또는 사용자가 이 사양에 설명된 주제의 구현예와 상호작용할 수 있는 앱을 갖는 클라이언트 컴퓨터와 같은 프런트 엔드 컴포넌트를 포함하거나, 또는 그러한 백엔드, 미들웨어 또는 프런트엔드 컴포넌트의 하나 이상의 조합을 포함할 수 있다. 시스템의 컴포넌트들은 예를 들어, 통신 네트워크와 같은 디지털 데이터 통신의 모든 형태들 또는 매체들에 의해 상호연결될 수 있다. 통신 네트워크의 일례들은 LAN(Local Area Network) 및 WAN(Wide Area Network), 예를 들어 인터넷을 포함한다. Embodiments of the subject matter described herein may be implemented in a computing system, which includes a backend component, such as a data server, or a middleware component, such as an application server, or, for example, a graphics server. Include a front-end component, such as a user interface, a web browser, or a client computer having an app through which a user can interact with an implementation of the subject matter described in this specification, or a combination of one or more such back-ends, middleware, or front-end components. can include Components of the system may be interconnected by any form or medium of digital data communication, such as, for example, a communication network. Examples of communication networks include Local Area Networks (LANs) and Wide Area Networks (WANs), such as the Internet.

컴퓨팅 시스템은 클라이언트 및 서버를 포함할 수 있다. 클라이언트와 서버는 일반적으로 서로 멀리 떨어져 있으며 일반적으로 통신 네트워크를 통해 상호작용한다. 클라이언트와 서버의 관계는 각 컴퓨터에서 실행되고 서로 클라이언트-서버 관계를 갖는 컴퓨터 프로그램 덕분에 발생한다. 일부 실시예에서, 서버는 예를 들어, 클라이언트로서 작용하는 디바이스와 상호작용하는 사용자에게 데이터를 디스플레이하고 그로부터 사용자 입력을 수신하기 위해 데이터, 예를 들어, HTML 페이지를 사용자 디바이스로 전송한다. 사용자 디바이스에서 생성된 데이터, 예를 들어 사용자 상호작용의 결과는 디바이스로부터 서버에서 수신될 수 있다. A computing system may include a client and a server. Clients and servers are usually remote from each other and usually interact through a communication network. The relationship of client and server arises by virtue of computer programs running on each computer and having a client-server relationship with each other. In some embodiments, a server sends data, eg, an HTML page, to a user device, eg, to display the data to a user interacting with the device acting as a client and to receive user input therefrom. Data generated by the user device, for example the result of user interaction, may be received at the server from the device.

본 명세서는 많은 구체적인 구현 세부 사항들을 포함하지만, 이들은 임의의 발명의 범위 또는 청구될 수 있는 범위에 대한 제한으로 해석되어서는 안되며, 오히려 특정 발명의 특정 실시예에 특별할 수 있는 피처들에 대한 설명으로 해석되어야 한다. 별도의 실시예들의 맥락에서 본 명세서에 설명된 특정한 피처들은 또한 단일 실시예에서 조합되어 구현될 수 있다. 역으로, 단일 실시예의 맥락에서 설명된 다양한 피처들은 또한 다수의 실시예에서 개별적으로 또는 임의의 적합한 하위 조합으로 구현될 수 있다. 더욱이, 피처들이 특정 조합에서 작용하는 것으로 위에서 기술될 수 있고 심지어 초기에 그렇게 주장될지라도, 청구된 조합으로부터의 하나 이상의 피처들은 경우에 따라 조합에서 제외될 수도 있고, 청구된 조합은 하위 조합 또는 하위 조합의 변형에 관한 것일 수도 있다. Although this specification contains many specific implementation details, they should not be construed as limitations on the scope of any invention or what may be claimed, but rather as a description of features that may be particular to particular embodiments of a particular invention. should be interpreted Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in a particular combination, and even initially claimed to do so, one or more features from a claimed combination may be excluded from the combination as the case may be, and the claimed combination may be a subcombination or subcombination. It may be about a variation of the combination.

유사하게, 동작들이 특정 순서로 도면에 도시되고 청구범위에 개시되었지만, 이는 이러한 동작들이 도시된 특정 순서로 또는 순차적인 순서로 수행되거나 또는 원하는 결과를 획득하기 위해 예시된 모든 동작들이 수행될 것을 요구하는 것으로 이해되어서는 안된다. 특정 상황에서는 멀티태스킹 및 병렬 프로세싱가 유리할 수 있다. 또한, 전술한 실시예에서 다양한 시스템 모듈들 및 구성요소들의 분리는 모든 실시예에서 그러한 분리를 요구하는 것으로 이해되어서는 안 되며, 설명된 프로그램 구성요소 및 시스템은 일반적으로 단일 소프트웨어 제품 또는 여러 소프트웨어 제품으로 패키징된다. Similarly, while actions are shown in the drawings and disclosed in the claims in a particular order, this requires that these acts be performed in the particular order shown or in a sequential order or that all of the illustrated acts be performed to obtain a desired result. It should not be understood as Multitasking and parallel processing can be advantageous in certain circumstances. Further, the separation of various system modules and components in the foregoing embodiments should not be understood as requiring such separation in all embodiments, and the described program components and systems are generally a single software product or multiple software products. packaged with

본 발명의 주제에 대한 특정 실시예가 설명되었다. 다른 실시예들은 다음의 청구 범위에 속한다. 예를 들어, 청구범위에 인용된 동작들은 다른 순서로 수행될 수 있으며 여전히 원하는 결과를 얻을 수 있다. 일례로서, 첨부된 도면에 묘사된 프로세스는 원하는 결과를 얻기 위해 표시된 특정 순서 또는 순차적인 순서를 반드시 요구하지는 않는다. 경우에 따라 멀티태스킹 및 병렬 프로세싱이 유리할 수 있다. Specific embodiments of the subject matter of the present invention have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still obtain desired results. As an example, the processes depicted in the attached figures do not necessarily require the specific order or sequential order shown to achieve desired results. In some cases, multitasking and parallel processing can be advantageous.

Claims

A method of generating a final network output comprising a plurality of outputs conditioned on a network input, comprising:
obtaining network input;
initializing the current network output;
generating a final network output by updating the current network output in each of a plurality of iterations, each iteration corresponding to a respective noise level, the update at each iteration:
processing a model input for an iteration comprising (i) a current network output and (ii) a network input using a noise estimation neural network configured to process the model input to generate a noise output, the noise output being the current network output includes a respective noise estimate for each value of the output; and
and updating the current network output using the noise estimate and the noise level for the iteration.

According to claim 1,
wherein the network input is a spectrogram of an audio segment and the final network output is a waveform for the audio segment.

According to claim 2,
The method of claim 1, wherein the audio segment is a speech segment.

According to claim 3,
wherein the spectrogram is generated from a text segment or linguistic features of a text segment by a text-to-speech model.

According to any one of claims 2 to 4,
Wherein the spectrogram is a mel spectrogram or a log mel spectrogram.

According to claim 1,
network input is a class of object specifying the class of image object to be created, and network output is a generated image of that class of object, or
The network input is a sequence of text, and the network output is an image reflecting the text, or
The network input is an image and the network output is a numeric embedding of the input image characterizing the image, or
The network input is an image, and the network output identifies locations in the input image where objects of particular types are depicted, or
wherein the network input is an image and the network input is a segment output that assigns each of a plurality of pixels of the input image to a category from a set of categories.

In any preceding claim,
Updating the current network output using the noise estimate and the noise level for the iteration comprises:
generating an update to the iteration from at least the noise estimate and the noise level corresponding to the iteration; and
and subtracting said update from a current network output to produce an initial updated network output.

According to claim 7,
Updating the current network output comprises:
and modifying the initial updated network output based on the noise level for the iteration to produce a modified initial updated network output.

According to claim 8,
For the last iteration, the modified initial updated network output is the updated network output after the last iteration, and for each iteration before the last iteration, the updated network output after the last iteration. wherein is generated by adding noise to the modified initial updated network output.

In any preceding claim,
wherein the step of initializing the current network output comprises sampling each of a plurality of initial values for the current network output from a corresponding noise distribution.

In any preceding claim,
Wherein the model input for each iteration includes different iteration specific data for each iteration.

According to claim 11,
Wherein the model input for each iteration includes a noise level corresponding to the iteration.

According to claim 11,
The model input for each iteration is
and a total noise level for an iteration generated from noise levels corresponding to the iteration of the plurality of iterations and any iteration after the iteration.

According to any one of claims 11 to 13,
The noise estimation neural network,
a noise generating neural network comprising a plurality of noise generating neural network layers and configured to process a network input to map the network input to a noise output; and
a network output processing neural network comprising a plurality of network output processing neural network layers configured to process the current network output to generate an alternative representation of the current network output;
At least one of the noise generating neural network layers is (i) the output of another of the noise generating neural network layers, (ii) the output of that network output processing neural network layer, and (iii) an input derived from iteration specific data for the iteration. A method characterized in that for receiving.

According to claim 14,
wherein the final network output has a higher dimension than the network input, and wherein the alternative representation has the same dimension as the network input.

The method of claim 14 or 15,
The noise estimation neural network includes each feature-wise linear modulation (FiLM) module corresponding to each of at least one noise generating neural network layer, and the FiLM module corresponding to a predetermined noise generating neural network layer is included in the noise generating neural network layer. configured to process (i) the output of another one of the noise generating neural network layers, (ii) the output of that network output processing neural network layer, and (iii) iteration specific data for the iteration to generate an input for the iteration. How to.

According to claim 16,
The FiLM module corresponding to the predetermined noise generating neural network layer,
generate scale vectors and bias vectors from (ii) the output of the corresponding network output processing neural network layer, and (iii) iteration specific data for the iteration;
(i) generate an input to a noise-producing neural network layer by applying an affine transformation to the output of another one of the noise-producing neural network layers.

According to any one of claims 14 to 17,
The method of claim 1 , wherein at least one of the noise generating neural network layers comprises an activation function layer that applies a non-linear activation function to an input to the activation function layer.

According to claim 18,
Another of the noise generating neural network layers corresponding to the activation function layer is a residual connection layer or a convolutional layer.

A method for training the noise estimation neural network of any one of claims 11 to 19, the method comprising:
obtaining a training network input and a corresponding training network output;
selecting iteration specific data from a set comprising iteration specific data for all iterations of a plurality of iterations;
sampling a noisy output comprising a respective noise value for each value of the training network output;
generating a modified training network output from the noisy output and the corresponding training network output;
processing (i) a modified training network output, (ii) a training network input, and (iii) a model input comprising iteration specific data to generate a training noise output using the noise estimation neural network; and
determining an update to network parameters of a noise estimation neural network from a gradient of an objective function that measures an error between a sampled noisy output and a training noise output;
A method for training a noise estimation neural network, characterized in that it repeatedly performs the steps including.

According to claim 20,
wherein the objective function measures a distance between a sampled noisy output and a training noise output.

According to claim 21,
The method of training a noise estimation neural network, characterized in that the distance is an L1 distance.

A system comprising one or more computers and one or more storage devices storing instructions, which when executed by the one or more computers cause the one or more computers to perform the method of any one of the preceding claims. A system characterized by doing so.

A computer readable storage medium encoded with instructions, which when executed by one or more computers causes the one or more computers to perform the method of any one of the preceding claims. available storage media.