WO2024089995A1 - Musical sound synthesis method, musical sound synthesis system, and program - Google Patents

Musical sound synthesis method, musical sound synthesis system, and program Download PDF

Info

Publication number
WO2024089995A1
WO2024089995A1 PCT/JP2023/030522 JP2023030522W WO2024089995A1 WO 2024089995 A1 WO2024089995 A1 WO 2024089995A1 JP 2023030522 W JP2023030522 W JP 2023030522W WO 2024089995 A1 WO2024089995 A1 WO 2024089995A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
unit
control vector
control
time series
Prior art date
Application number
PCT/JP2023/030522
Other languages
French (fr)
Japanese (ja)
Inventor
慶二郎 才野
ジョセフ ティ カーネル
Original Assignee
ヤマハ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from JP2022170758A external-priority patent/JP2024062724A/en
Application filed by ヤマハ株式会社 filed Critical ヤマハ株式会社
Publication of WO2024089995A1 publication Critical patent/WO2024089995A1/en

Links

Images

Definitions

  • This disclosure relates to technology for synthesizing sound.
  • Patent Document 1 discloses a configuration for generating a time series of acoustic features of a voice waveform by processing a time series of multidimensional score features related to voice using a convolutional neural network.
  • one aspect of the present disclosure aims to generate musical tones with a variety of partial timbres in response to instructions from the user.
  • a musical sound synthesis method is a musical sound synthesis method realized by a computer system, which acquires a time series of control data representing the conditions of a target musical sound, and generates a time series of acoustic data representing the acoustic features of the target musical sound by processing the time series of control data using a trained generative model that includes multiple basic layers and one or more intermediate layers and has learned the relationship between the conditions of the musical sounds and the acoustic features of the musical sounds.
  • the method generates a control vector representing the characteristics of the temporal change in timbre in response to an instruction from a user, and generates a first parameter set from the control vector.
  • a first intermediate layer of the one or more intermediate layers applies the first parameter set to the data input to the first intermediate layer, and outputs the data after application to the next layer.
  • a musical sound synthesis system includes a control data acquisition unit that acquires a time series of control data representing the conditions of a target musical sound, a control vector generation unit that generates a control vector representing the characteristics of temporal changes in tone in response to an instruction from a user, a control vector processing unit that generates a first parameter set from the control vector, and a musical sound synthesis unit that includes a plurality of base layers and one or more intermediate layers, and processes the time series of the control data using a trained generative model that has learned the relationship between the conditions of a musical sound and the acoustic features of the musical sound, thereby generating a time series of acoustic data representing the acoustic features of the target musical sound, and a first intermediate layer of the one or more intermediate layers applies the first parameter set to data input to the first intermediate layer, and outputs the data after application to the next layer.
  • a program causes a computer system to function as a control data acquisition unit that acquires a time series of control data representing the conditions of a target musical tone, a control vector generation unit that generates a control vector representing the characteristics of temporal changes in tone in response to an instruction from a user, a control vector processing unit that generates a first parameter set from the control vector, and a musical tone synthesis unit that includes a plurality of base layers and one or more intermediate layers and processes the time series of control data using a trained generative model that has learned the relationship between the conditions of a musical tone and the acoustic features of the musical tone, thereby generating a time series of acoustic data representing the acoustic features of the target musical tone, and a first intermediate layer of the one or more intermediate layers applies the first parameter set to data input to the first intermediate layer and outputs the data after application to the next layer.
  • FIG. 1 is a block diagram illustrating a configuration of a musical sound synthesis system according to a first embodiment.
  • 1 is a block diagram illustrating a functional configuration of a musical sound synthesis system.
  • FIG. 2 is a block diagram illustrating a specific configuration of a first generation model.
  • FIG. 11 is an explanatory diagram of a conversion process.
  • FIG. 4 is a schematic diagram of a setting screen.
  • FIG. 13 is a block diagram illustrating a specific configuration of a second generation model. 13 is a flowchart of a musical sound synthesis process.
  • FIG. 1 is an explanatory diagram of machine learning. 13 is a flowchart of a training process.
  • FIG. 11 is a block diagram of a control vector generating unit in the second embodiment.
  • FIG. 11 is a schematic diagram of a setting screen in the second embodiment.
  • 10 is a flowchart of a musical sound synthesis process in a second embodiment.
  • FIG. 2 is an explanatory diagram of a conversion process executed by each intermediate layer L.
  • FIG. 13 is a block diagram of a first generation model in the fourth embodiment.
  • FIG. 13 is a block diagram of a unit processing unit according to a fourth embodiment.
  • FIG. 13 is an explanatory diagram of a processing period in a modified example.
  • FIG. 1 is a block diagram illustrating the configuration of a musical sound synthesis system 100 according to a first embodiment.
  • the musical sound synthesis system 100 is a computer system that synthesizes a desired musical sound (hereinafter referred to as a "target musical sound").
  • the target musical sound is a musical sound to be synthesized by the musical sound synthesis system 100.
  • a singing sound to be produced by singing a specific piece of music hereinafter referred to as a "target piece of music” is exemplified as the target musical sound.
  • the musical sound synthesis system 100 comprises a control device 11, a storage device 12, a display device 13, an operation device 14, and a sound emission device 15.
  • the musical sound synthesis system 100 is realized by an information device such as a smartphone, a tablet terminal, or a personal computer. Note that the musical sound synthesis system 100 can be realized as a single device, or as multiple devices configured separately from each other.
  • the control device 11 is a single or multiple processors that control each element of the musical sound synthesis system 100.
  • the control device 11 is composed of one or more types of processors, such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), an SPU (Sound Processing Unit), a DSP (Digital Signal Processor), an FPGA (Field Programmable Gate Array), or an ASIC (Application Specific Integrated Circuit).
  • a CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • SPU Sound Processing Unit
  • DSP Digital Signal Processor
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • the storage device 12 is a single or multiple memories that store the programs executed by the control device 11 and various data used by the control device 11.
  • a well-known recording medium such as a semiconductor recording medium or a magnetic recording medium, or a combination of multiple types of recording media, is used as the storage device 12.
  • a portable recording medium that is detachable from the musical sound synthesis system 100, or a recording medium that the control device 11 can access via a communication network (e.g., cloud storage) may also be used as the storage device 12.
  • the storage device 12 of the first embodiment stores music data M and a reference signal Sr.
  • the music data M represents the musical score of the target music piece. More specifically, the music data M specifies the pitch, pronunciation period, and pronunciation character for each of the multiple notes of the target music piece.
  • the pitch is one of multiple discretely set scale notes.
  • the pronunciation period is specified, for example, by the start point and duration of the note.
  • the pronunciation character is a symbol representing the lyrics of the music piece.
  • a music file that complies with the MIDI (Musical Instrument Digital Interface) standard is used as the music data M.
  • the music data M is provided to the musical sound synthesis system 100, for example, from a distribution device via a communication network.
  • the reference signal Sr is an audio signal that represents the waveform of a specific musical tone (hereinafter referred to as "reference musical tone").
  • the reference musical tone is, for example, a singing tone that should be produced by singing a reference musical piece.
  • the reference signal Sr is provided to the musical tone synthesis system 100 from a distribution device via a communication network.
  • the reference signal Sr may be provided from a playback device that drives a recording medium such as an optical disk, or may be generated by collecting the reference musical tone using a sound collection device.
  • the reference signal Sr may also be an audio signal synthesized using a known synthesis technique such as singing synthesis or musical tone synthesis.
  • the reference musical piece and the target musical piece corresponding to the reference signal Sr may be the same piece or different pieces of music.
  • the singer of the target musical tone and the singer of the reference musical tone may be the same or different.
  • the target musical tone in the first embodiment is a singing tone of a target musical piece, and is a musical tone that is given a feature of temporal changes in acoustic characteristics (hereinafter referred to as "partial tone") within a specific period (hereinafter referred to as "specific section") among the reference musical tone.
  • partial tone a feature of temporal changes in acoustic characteristics
  • specific section a specific period of temporal changes in acoustic characteristics
  • specific section a specific period
  • a musical tone that is given a partial tone desired by the user is generated as the target musical tone.
  • a desired feature that exists in a specific section such as repeated fluctuations (vibrato) in acoustic characteristics such as volume or pitch, or changes in acoustic characteristics over time, is assumed as a partial tone.
  • the reference musical tone is a musical tone that is the material for the partial tone to be given to the target musical piece.
  • the control device 11 uses the musical piece data M and the reference signal Sr to generate an audio signal W that represents the target musical tone.
  • the audio signal W is a time-domain signal that represents the waveform of the target musical tone.
  • the display device 13 displays images under the control of the control device 11.
  • the display device 13 is, for example, a display panel such as a liquid crystal display panel or an organic EL (Electroluminescence) panel.
  • the operation device 14 is an input device that accepts instructions from a user.
  • the operation device 14 is, for example, an operator operated by the user, or a touch panel that detects contact by the user.
  • the display device 13 or operation device 14, which are separate from the musical sound synthesis system 100 may be connected to the musical sound synthesis system 100 by wire or wirelessly.
  • the sound emitting device 15 reproduces sound under the control of the control device 11. Specifically, the sound emitting device 15 reproduces the target musical sound represented by the audio signal W.
  • the sound emitting device 15 reproduces the target musical sound represented by the audio signal W.
  • a speaker or a headphone is used as the sound emitting device 15.
  • a D/A converter that converts the audio signal W from digital to analog, and an amplifier that amplifies the audio signal W are omitted from the illustration.
  • the sound emitting device 15, which is separate from the musical sound synthesis system 100, may be connected to the musical sound synthesis system 100 by wire or wirelessly.
  • FIG. 2 is a block diagram illustrating the functional configuration of the musical sound synthesis system 100.
  • the control device 11 executes a program stored in the storage device 12 to realize multiple functions (control data acquisition unit 21, musical sound synthesis unit 22, waveform synthesis unit 23, control vector generation unit 24, control vector processing unit 25, and training processing unit 26) for generating an audio signal W of a target musical sound.
  • the data size (number of dimensions) b of one piece of data and the time length a of a time series consisting of multiple pieces of that data are represented by the symbols [a, b].
  • the time length a is expressed as the number of units of a period of a specified length on the time axis (hereinafter referred to as a "unit period").
  • [800, 134] in Figure 2 means a time series in which data consisting of 134 dimensions is arranged for 800 unit periods.
  • a unit period is, for example, a period (frame) with a time length of about 5 milliseconds. Therefore, 800 unit periods are equivalent to 4 seconds. Note that the above values are only examples and may be changed as desired.
  • Each unit period is specified by the time.
  • the control data acquisition unit 21 acquires control data Dx that represents the conditions of the target musical tone. Specifically, the control data acquisition unit 21 acquires the control data Dx for each unit period. In the first embodiment, the control data acquisition unit 21 generates the control data Dx for each unit period from the music data M. In other words, the "generation" of the control data Dx is an example of the "acquisition" of the control data Dx.
  • the control data Dx represents the features of the score of the target piece of music (hereinafter referred to as "score features").
  • the score features represented by the control data Dx include, for example, the pitch (fundamental frequency) in the unit period, information indicating voiced/unvoiced in the unit period, and phoneme information in the unit period.
  • Pitch is a numerical value in one unit period of the pitch time series (pitch trajectory) corresponding to each note specified by the music data M. While the pitch of each note in the target music is discrete, the pitch trajectory used in the control data Dx is a continuous change in pitch on the time axis.
  • the control data acquisition unit 21 estimates the pitch trajectory in the control data Dx, for example, by processing the music data M with an estimation model that has learned the relationship between the pitch of each note and the pitch trajectory. However, the method of generating the pitch trajectory is not limited to the above examples.
  • the control data Dx may also include discrete pitches of each note.
  • Phoneme information is information about phonemes that correspond to the pronunciation characters of the target song.
  • the phoneme information includes, for example, information specifying one of a plurality of phonemes, for example by one-hot expression, the position of the unit period relative to the phoneme period, the time length from the beginning or end of the phoneme period, and the duration of the phoneme.
  • the time series of control data Dx within processing period B constitutes control data string X.
  • Processing period B is a period of a predetermined length composed of multiple (specifically, 800) consecutive unit periods on the time axis.
  • the control data acquisition unit 21 of the first embodiment generates a time series of control data Dx (i.e., control data string X) representing the conditions of the target musical tone for each processing period B on the time axis.
  • the musical sound synthesis unit 22 generates an acoustic data sequence Z by processing the control data sequence X. Specifically, the musical sound synthesis unit 22 generates an acoustic data sequence Z for each processing period B.
  • the acoustic data sequence Z is time-series data representing the acoustic characteristics of the target musical sound in the processing period B.
  • the acoustic data sequence Z is composed of multiple (specifically, 800) acoustic data Dz corresponding to successive unit periods within the processing period B. In other words, the acoustic data sequence Z is a time series of acoustic data Dz within the processing period B.
  • the musical sound synthesis unit 22 generates an acoustic data sequence Z for one processing period B from the control data sequence X corresponding to that processing period B.
  • the acoustic data Dz represents the acoustic features of the target musical tone.
  • the acoustic features are, for example, the amplitude spectrum envelope of the target musical tone.
  • the acoustic data Dz includes the amplitude spectrum envelope of the harmonic components of the target musical tone and the amplitude spectrum envelope of the non-harmonic components of the target musical tone.
  • the amplitude spectrum envelope is an outline of the amplitude spectrum.
  • the amplitude spectrum envelope of the harmonic components and the non-harmonic components is expressed, for example, by Mel-cepstrum or MFCC (Mel-Frequency Cepstrum Coefficients).
  • the musical tone synthesis unit 22 of the first embodiment generates a time series of acoustic data Dz (i.e., acoustic data sequence Z) representing the acoustic features of the target musical tone for each processing period B.
  • the acoustic data Dz may include the amplitude spectrum envelope and pitch trajectory of the target musical tone.
  • the acoustic data Dz may also include the spectrum (amplitude spectrum or power spectrum) of the target musical tone.
  • the spectrum of the target musical tone may be expressed, for example, as a Mel spectrum.
  • the amplitude spectrum envelope may also be the outline of the power spectrum (power spectrum envelope).
  • the waveform synthesis unit 23 generates an audio signal W of a target musical tone from the audio data sequence Z. Specifically, the waveform synthesis unit 23 generates a waveform signal from the audio data Dz of each unit period by calculations including, for example, a discrete inverse Fourier transform, and generates the audio signal W by linking the waveform signals for successive unit periods on the time axis.
  • a deep neural network (a so-called neural vocoder) that has learned the relationship between the audio data sequence Z and the waveform signal may be used as the waveform synthesis unit 23.
  • the audio signal W generated by the waveform synthesis unit 23 is supplied to the sound emission device 15, whereby the target musical tone is reproduced from the sound emission device 15.
  • the pitch generated by the control data acquisition unit 21 may be applied to the generation of the audio signal W by the waveform synthesis unit 23.
  • the musical sound synthesis unit 22 generates an acoustic data sequence Z by processing a control data sequence X using a first generative model 30.
  • the first generative model 30 is a trained statistical model that has learned the relationship between the conditions on the score of the target musical sound (control data sequence X) and the acoustic features of the target musical sound (acoustic data sequence Z) through machine learning.
  • the first generative model 30 outputs the acoustic data sequence Z in response to the input of the control data sequence X.
  • the first generative model 30 is, for example, configured by a deep neural network.
  • the first generative model 30 is realized by a combination of a program that causes the control device 11 to execute an operation (architecture) for generating an acoustic data sequence Z from a control data sequence X, and a plurality of variables (weights and biases) that are applied to the operation.
  • the program and the plurality of variables that realize the first generative model 30 are stored in the storage device 12.
  • the plurality of variables of the first generative model 30 are set in advance by machine learning.
  • the first generative model 30 of the first embodiment includes a first encoder 31 and a decoder 32.
  • the first encoder 31 is a trained statistical model that has learned the relationship between the control data sequence X and the intermediate data Y through machine learning. That is, the first encoder 31 outputs intermediate data Y in response to the input of the control data sequence X.
  • the musical sound synthesis unit 22 generates intermediate data Y by processing the control data sequence X with the first encoder 31.
  • the intermediate data Y represents the characteristics of the control data sequence X. Specifically, the generated sound data sequence Z changes in response to the characteristics of the control data sequence X represented by the intermediate data Y. That is, the first encoder 31 encodes the control data sequence X into intermediate data Y.
  • the decoder 32 is a trained statistical model that has learned the relationship between the intermediate data Y and the acoustic data string Z through machine learning. That is, the decoder 32 outputs the acoustic data string Z in response to the input of the intermediate data Y.
  • the musical sound synthesis unit 22 generates the acoustic data string Z by processing the intermediate data Y with the decoder 32. That is, the decoder 32 decodes the intermediate data Y into the acoustic data string Z.
  • the acoustic data string Z can be generated by encoding by the first encoder 31 and decoding by the decoder 32.
  • FIG. 3 is a block diagram illustrating a specific configuration (architecture) of the first generative model 30.
  • the first encoder 31 includes a preprocessing unit 311, N1 convolutional layers 312, and N1 coding intermediate layers Le.
  • the N1 convolutional layers 312 and the N1 coding intermediate layers Le are arranged alternately after the preprocessing unit 311. That is, N1 sets each consisting of a convolutional layer 312 and a coding intermediate layer Le are stacked after the preprocessing unit 311.
  • the preprocessing unit 311 is composed of a multi-layer perceptron for processing the control data sequence X.
  • the preprocessing unit 311 is composed of multiple calculation units corresponding to different control data Dx of the control data sequence X. Each calculation unit is composed of a stack of multiple stages of fully connected layers. Each control data Dx is processed sequentially by each fully connected layer. For example, a neural network process is executed for each control data Dx of the control data sequence X with the same configuration and the same variables applied.
  • the array of the control data Dx after processing by each calculation unit (processed control data sequence X) is input to the first convolution layer 312.
  • a control data sequence X that more clearly expresses the characteristics of the target song (song data M) is generated.
  • the preprocessing unit 311 may be omitted.
  • the data processed by the pre-processing unit 311 is input to the first convolutional layer 312 among the N1 convolutional layers 312.
  • the data processed by the previous coding intermediate layer Le is input to each of the second and subsequent convolutional layers 312 among the N1 convolutional layers 312.
  • Each convolutional layer 312 performs arithmetic processing on the data input to the convolutional layer 312.
  • the arithmetic processing by the convolutional layer 312 includes a convolutional operation.
  • the arithmetic processing by the convolutional layer 312 may also include a pooling operation.
  • the convolution operation is a process of convolving a filter with data input to the convolution layer 312.
  • the multiple convolution layers 312 include a convolution layer 312 that performs time compression and a convolution layer 312 that does not perform time compression.
  • the movement amount (stride) of the filter in the time direction is set to 2 or more.
  • the processing by the first encoder 31 includes downsampling of the control data string X.
  • data compression may be performed by continuing the pooling operation while keeping the stride of the convolution operation at 1.
  • the pooling operation is an operation that selects a representative value within each range set in the data after the convolution operation.
  • the representative value is, for example, a statistical value such as the maximum value, the average value, or the root mean square value.
  • the compression of the control data sequence X is achieved by one or both of a convolution operation and a pooling operation.
  • time compression (downsampling) of the control data sequence X may be performed only for a portion of the series of convolution operations of the N1 convolution layers 312.
  • the compression rate of each convolution layer 312 is arbitrary.
  • Each of the N1 code intermediate layers Le performs a conversion process on the data input to the code intermediate layer Le from the previous convolution layer 312. The specific content of the conversion process by each code intermediate layer Le will be described later.
  • Data after processing by the final code intermediate layer Le among the N1 code intermediate layers Le is input to the decoder 32 as intermediate data Y. Note that the code intermediate layers Le do not need to be installed after all of the N1 convolution layers 312. In other words, the number N1x of code intermediate layers Le is any natural number less than or equal to N1.
  • the data after the conversion process by the code intermediate layer Le is input to the next convolution layer 312, and if there is no code intermediate layer Le after a certain convolution layer 312, the data after the convolution process by the convolution layer 312 (i.e., data that has not been converted) is input to the next convolution layer 312.
  • the decoder 32 includes N2 convolutional layers 321, N2 decoding intermediate layers Ld, and a post-processing unit 322. Specifically, the N2 convolutional layers 321 and the N2 decoding intermediate layers Ld are arranged alternately, and the post-processing unit 322 is stacked after the final decoding intermediate layer Ld. In other words, N2 sets consisting of the convolutional layers 321 and the decoding intermediate layers Ld are stacked before the post-processing unit 322.
  • Intermediate data Y is input to the first convolutional layer 321 of the N2 convolutional layers 321.
  • Data processed by the previous decoding intermediate layer Ld is input to each of the second and subsequent convolutional layers 321 of the N2 convolutional layers 321.
  • Each convolutional layer 321 performs arithmetic processing on the data input to that convolutional layer 321.
  • the arithmetic processing by the convolutional layer 321 includes a transpose convolution operation (or a deconvolution operation).
  • the transposed convolution performed by the convolution layer 321 is the inverse of the convolution performed by each convolution layer 312 of the encoder.
  • the filter movement amount in the time direction (stride) is set to 2 or more.
  • the time length of the data is maintained by a transposed convolution operation with a stride of 1
  • the time length of the data is expanded by a transposed convolution operation with a stride of 2 or more. That is, in the decoder 32, data expansion on the time axis is performed.
  • the processing by the decoder 32 includes upsampling of the intermediate data Y.
  • the first encoder 31 compresses the control data sequence X and the decoder 32 expands the intermediate data Y. Therefore, intermediate data Y that appropriately reflects the characteristics of the control data sequence X is generated, and an acoustic data sequence Z that appropriately reflects the characteristics of the intermediate data Y is generated.
  • Each of the N2 decoding intermediate layers Ld performs a conversion process on the data input to the decoding intermediate layer Ld from the previous convolution layer 321. The specific content of the conversion process by each decoding intermediate layer Ld will be described later.
  • Data after processing by the final decoding intermediate layer Ld among the N2 decoding intermediate layers Ld is input to the post-processing unit 322 as the acoustic data string Z. Note that the decoding intermediate layer Ld does not need to be installed after all of the N2 convolution layers 321. In other words, the number N2x of the decoding intermediate layers Ld is a natural number less than or equal to N2.
  • the data after the conversion process by the decoding intermediate layer Ld is input to the next convolution layer 321, and if there is no decoding intermediate layer Ld after a certain convolution layer 321, the data after the convolution process by the convolution layer 321 (i.e., data that has not been converted) is input to the next convolution layer 321.
  • the post-processing unit 322 is composed of a multi-layer perceptron for processing the acoustic data sequence Z.
  • the post-processing unit 322 is composed of multiple calculation units corresponding to different acoustic data Dz of the acoustic data sequence Z. Each calculation unit is composed of a stack of multiple fully connected layers. Each acoustic data Dz is processed sequentially by each fully connected layer. For example, a neural network with a similar configuration and similar variables is processed for each acoustic data Dz of the acoustic data sequence Z.
  • the array of acoustic data Dz after processing by each calculation unit is input to the waveform synthesis unit 23 as the final acoustic data sequence Z.
  • the post-processing unit 322 may be omitted.
  • the first encoder 31 includes N1x encoded intermediate layers Le
  • the number Nx of intermediate layers L is a natural number equal to or greater than 1.
  • the number N1x of encoded intermediate layers Le and the number N2x of decoded intermediate layers Ld may be equal or different.
  • the preprocessing section 311, the N1 convolutional layers 312, the N2 convolutional layers 321, and the postprocessing section 322 are basic layers required for generating the audio data sequence Z.
  • the Nx intermediate layers (N1x encoding intermediate layers Le and N2x decoding intermediate layers Ld) are layers for controlling partial timbres in the target musical tone.
  • the first generative model 30 includes N basic convolutional layers and Nx (N ⁇ Nx ⁇ 1) intermediate layers L.
  • Each of the N intermediate layers L performs a conversion process on the data input to that intermediate layer L.
  • a different parameter set Pn is applied to the conversion process by each of the multiple intermediate layers L.
  • Each of the N parameter sets P1 to PN includes, for example, a first parameter p1 and a second parameter p2.
  • FIG. 4 is an explanatory diagram of the conversion process.
  • the unit data string U in FIG. 4 is data input to the intermediate layer L.
  • the unit data string U is composed of a time series of multiple unit data Du corresponding to different unit periods.
  • Each unit data Du is expressed as an H-dimensional (H is a natural number equal to or greater than 2) vector.
  • the first parameter p1 is expressed as a square matrix with H rows and H columns.
  • the second parameter p2 is expressed as an H-dimensional vector. Note that the first parameter p1 may be expressed as a diagonal matrix with H rows and H columns or an H-dimensional vector.
  • the conversion process includes a first operation and a second operation.
  • the first operation and the second operation are executed sequentially for each of the multiple unit data Du that make up the unit data string U.
  • the first operation is a process of multiplying the unit data Du by the first parameter p1.
  • the second operation is a process of adding the second parameter p2 to the result of the first operation (p1 ⁇ Du).
  • the conversion process by the intermediate layer L is a process (i.e., affine transformation) that includes the multiplication of the first parameter p1 and the addition of the second parameter p2.
  • the second operation that applies the second parameter p2 may be omitted. In that case, the generation of the second parameter p2 is also omitted. In other words, it is sufficient that the conversion process includes at least the first operation.
  • each of the N intermediate layers L in FIG. 3 performs a conversion process by applying a parameter set Pn to each unit data Du of the unit data string U input to the intermediate layer L, and outputs the unit data string U after the conversion process.
  • a conversion process including multiplication of a first parameter p1 and addition of a second parameter p2 is performed on each unit data Du of the unit data string U input to each intermediate layer L. Therefore, it is possible to generate an acoustic data string Z of a target musical tone to which a partial timbre represented by a control vector V is appropriately assigned.
  • the number of intermediate layers L is explained as N, but the basic operation is similar even when the number of intermediate layers L is Nx, which is less than N.
  • Each intermediate layer L may be either an encoding intermediate layer Le or a decoding intermediate layer Ld.
  • the n1th intermediate layer L performs a conversion process by applying a parameter set Pn1 to each unit data Du of the unit data string U input to the intermediate layer L, and outputs the unit data string U after the application to the next layer.
  • the n2th intermediate layer L performs a conversion process by applying a parameter set Pn2 to each unit data Du of the unit data string U input to the intermediate layer L, and outputs the unit data string U after the application to the next layer.
  • the n1th intermediate layer L is an example of a "first intermediate layer”
  • the parameter set Pn1 is an example of a "first parameter set”.
  • the n2th intermediate layer L is an example of a "second intermediate layer”
  • the parameter set Pn2 is an example of a "second parameter set.”
  • a different parameter set Pn is applied to each of the N intermediate layers L, so that an acoustic data sequence Z of a target tone having a variety of partial timbres can be generated.
  • the control vector generation unit 24 and the control vector processing unit 25 illustrated in FIG. 2 generate N parameter sets P1 to PN by processing the reference signal Sr.
  • the control vector generation unit 24 generates a control vector V by processing the reference signal Sr of a specific section.
  • the control vector V is a K-dimensional vector that represents the partial timbre of the reference tone.
  • the control vector V is a vector that represents the characteristics of the temporal change in acoustic characteristics (i.e., the partial timbre) in the reference signal Sr of a specific section.
  • the control vector generation unit 24 of the first embodiment includes a section setting unit 241, a feature extraction unit 242, and a second encoder 243.
  • the section setting unit 241 sets a specific section in the reference musical tone. Specifically, the section setting unit 241 sets the specific section in response to a first instruction Q1 from the user via the operation device 14.
  • the time length of the specific section is a fixed length equivalent to one processing period B.
  • FIG. 5 is a schematic diagram of the setting screen Ga.
  • the setting screen Ga is a screen for the user to specify a specific section.
  • the section setting unit 241 displays the setting screen Ga on the display device 13.
  • the setting screen Ga includes a waveform image Ga1 and a section image Ga2.
  • the waveform image Ga1 is an image that represents the waveform of the reference signal Sr.
  • the section image Ga2 is an image that represents a specific section.
  • the user can move the section image Ga2 to a desired position along the time axis by operating the operation device 14 (first instruction Q1) while checking the waveform image Ga1 of the reference musical tone. For example, the user moves the section image Ga2 so that it includes a section of the reference musical tone whose acoustic characteristics change under desired conditions.
  • the section setting unit 241 determines the section of the reference signal Sr that corresponds to the section image Ga2 after the user's movement as the specific section.
  • the first instruction Q1 is an instruction to change the position of the specific section on the time axis.
  • the section setting unit 241 changes the position of the specific section on the time axis in response to the first instruction Q1.
  • the feature extraction unit 242 in FIG. 2 processes the reference signal Sr for a specific section to generate one reference data string R.
  • the reference data string R is time-series data representing the acoustic features of the reference musical tone in a specific section.
  • the reference data string R is composed of multiple (e.g., 800) reference data Dr corresponding to different unit periods within the specific section.
  • the reference data string R is a time series of acoustic data Dz.
  • the reference data Dr represents the acoustic features of the reference musical tone.
  • the acoustic features are, for example, the amplitude spectrum envelope of the reference musical tone.
  • the reference data Dr includes the amplitude spectrum envelope of the harmonic components of the reference musical tone and the amplitude spectrum envelope of the non-harmonic components of the reference musical tone.
  • the amplitude spectrum envelopes of the harmonic components and non-harmonic components are expressed, for example, by mel-cepstrum or MFCC.
  • the data size of the reference data Dr is equal to the data size of the acoustic data Dz. Therefore, the data size of one reference data string R is equal to the data size of one acoustic data string Z.
  • the reference data Dr may be data in a different format from the acoustic data Dz.
  • the acoustic features represented by the reference data Dr and the acoustic features represented by the acoustic data Dz may be different types of features.
  • the feature extraction unit 242 of the first embodiment generates a time series of reference data Dr (reference data string R) that represents the acoustic features of a reference musical tone.
  • the feature extraction unit 242 generates the reference data string R by performing a calculation including a discrete Fourier transform on a reference signal Sr of a specific section.
  • the second encoder 243 is a trained statistical model that has learned the relationship between the reference data sequence R and the control vector V through machine learning. That is, the second encoder 243 outputs the control vector V in response to the input of the reference data sequence R.
  • the control vector generation unit 24 generates the control vector V by processing the reference data sequence R with the second encoder 243. That is, the second encoder 243 encodes the reference data sequence R into the control vector V.
  • control vector V is a vector that represents the characteristics of the temporal change in acoustic characteristics in the reference signal Sr of a specific section (i.e., partial timbre). Since the partial timbre changes depending on the position of the reference signal Sr, the control vector V depends on the position of the specific section on the time axis. In other words, the control vector V depends on the first instruction Q1 from the user that specifies the specific section. As can be understood from the above explanation, the control vector generation unit 24 of the first embodiment generates the control vector V in response to the first instruction Q1 from the user.
  • the control vector processing unit 25 generates N parameter sets P1 to PN from the control vector V. Because the control vector V represents a partial tone, each parameter set Pn depends on the partial tone. In addition, because the control vector V depends on the first instruction Q1, each parameter set Pn also depends on the first instruction Q1 from the user.
  • FIG. 6 is a block diagram illustrating a specific configuration of the second encoder 243 and the control vector processing unit 25.
  • the second encoder 243 includes a plurality of convolution layers 411 and an output processing unit 412.
  • the output processing unit 412 is stacked after the final convolution layer 411 among the plurality of convolution layers 411.
  • a reference data sequence R is input to the first convolutional layer 411 of the multiple convolutional layers 411.
  • Data processed by the previous convolutional layer 411 is input to each of the second and subsequent convolutional layers 411 of the multiple convolutional layers 411.
  • Each convolutional layer 411 performs arithmetic processing on the data input to that convolutional layer 411.
  • the arithmetic processing by the convolutional layer 411 includes a convolution operation and an optional pooling operation, similar to the arithmetic processing by the convolutional layer 312.
  • the final convolutional layer 411 outputs feature data Dv representing the features of the reference data sequence R.
  • the output processing unit 412 generates a control vector V in response to the feature data Dv.
  • the output processing unit 412 in the first embodiment includes a post-processing unit 413 and a sampling unit 414.
  • the post-processing unit 413 determines K probability distributions F1 to FK according to the feature data Dv.
  • Each of the K probability distributions F1 to FK is, for example, a normal distribution.
  • the post-processing unit 413 is a trained statistical model that has learned the relationship between the feature data Dv and each probability distribution Fk by machine learning.
  • the control vector generation unit 24 determines the K probability distributions F1 to FK by processing the feature data Dv with the post-processing unit 413.
  • the sampling unit 414 generates a control vector V according to the K probability distributions F1 to FK. Specifically, the sampling unit 414 samples an element Ek from each of the K probability distributions F1 to FK.
  • the sampling of the element Ek is, for example, random sampling. That is, each element Ek is a latent variable sampled from the probability distribution Fk.
  • the control vector V is composed of the K elements E1 to EK sampled from different probability distributions Fk. That is, the control vector V includes K elements E1 to EK.
  • the control vector V is a K-dimensional vector that represents the characteristics of the temporal change in acoustic characteristics (i.e., partial timbre) in the reference signal Sr of a specific section.
  • the configuration and processing by which the output processing unit 412 generates the control vector V from the feature data Dv are not limited to the above examples.
  • the output processing unit 412 may generate the control vector V without generating K probability distributions F1 to FK. Therefore, the post-processing unit 413 and the sampling unit 414 may be omitted.
  • the control vector processing unit 25 includes N conversion models 28-1 to 28-N corresponding to different hidden layers L.
  • Each conversion model 28-n generates a parameter set Pn from a control vector V.
  • each conversion model 28-n is a trained statistical model that has learned the relationship between the control vector V and the parameter set Pn through machine learning.
  • Each conversion model 28-n is composed of a multi-layer perceptron for generating the parameter set Pn.
  • N parameter sets P1 to PN corresponding to the partial timbre of the reference musical tone are generated by the control vector processing unit 25.
  • the N parameter sets P1 to PN are generated from a common control vector V.
  • the second encoder 243 and the control vector processing unit 25 constitute the second generative model 40.
  • the second generative model 40 is a trained statistical model that learns the relationship between the reference data string R and the N parameter sets P1 to PN through machine learning.
  • the second generative model 40 is constituted, for example, by a deep neural network.
  • the second generative model 40 is realized by a combination of a program that causes the control device 11 to execute an operation to generate a control vector V from a reference data string R, and a number of variables (weights and biases) that are applied to the operation.
  • the program and the number of variables that realize the second generative model 40 are stored in the storage device 12.
  • the number of variables of the second generative model 40 are set in advance by machine learning.
  • FIG. 7 is a flowchart of the process (hereinafter referred to as "musical sound synthesis process Sa") in which the control device 11 generates an audio signal W of a target musical sound.
  • the musical sound synthesis process Sa is started in response to an instruction from the user via the operation device 14.
  • the musical sound synthesis process Sa is repeated for each processing period B.
  • the musical sound synthesis process Sa is an example of a "musical sound synthesis method.” Note that, before the musical sound synthesis process Sa starts, the section setting unit 241 sets a specific section in response to a first instruction Q1 from the user. Data representing the specific section is stored in the storage device 12.
  • control device 11 When the musical tone synthesis process Sa is started, the control device 11 (control vector generation unit 24) generates a control vector V representing a partial tone color in response to a first instruction Q1 from the user (Sa1).
  • the specific steps for generating the control vector V (Sa11 to Sa13) are as follows:
  • the control device 11 acquires data representing a specific section from the storage device 12 (Sa11). Specifically, the section setting unit 241 sets the specific section in response to a first instruction Q1 from the user on the operation device 14.
  • the control device 11 (feature extraction unit 242) processes the reference signal Sr of the specific section to generate one reference data string R (Sa12). Then, the control device 11 processes the reference data string R using the second encoder 243 to generate a control vector V (Sa13).
  • the control device 11 (control vector processing unit 25) generates N parameter sets P1 to PN from the control vector V (Sa2). Specifically, the control device 11 processes the control vector V using each transformation model 28-n to generate the parameter set Pn.
  • the control device 11 processes the music data M to generate a control data sequence X (Sa3).
  • the control device 11 (musical sound synthesis unit 22) processes the control data sequence X using the first generative model 30 to generate an audio data sequence Z (Sa4). Specifically, the control device 11 processes the control data sequence X using the first encoder 31 to generate intermediate data Y, and processes the intermediate data Y using the decoder 32 to generate the audio data sequence Z.
  • a parameter set Pn is applied to the conversion process by each intermediate layer L of the first generative model 30.
  • the control device 11 (waveform synthesis unit 23) generates an audio signal W of the target musical tone from the audio data sequence Z (Sa5).
  • the control device 11 supplies the audio signal W to the sound emitting device 15 (Sa6).
  • the sound emitting device 15 reproduces the target musical tone represented by the audio signal W.
  • a control vector V representing the partial timbre of a reference musical tone is generated in response to an instruction from the user (first instruction Q1), a parameter set Pn is generated from the control vector V, and the parameter set Pn is applied to each unit data Du of the unit data string U input to each intermediate layer L. Therefore, it is possible to generate an acoustic data string Z of a target musical tone having a variety of partial timbres in response to an instruction from the user.
  • a specific section of the reference musical tone is set in response to a first instruction Q1 from the user, and a control vector V is generated that represents the partial timbre in the specific section. Therefore, it is possible to generate a target musical tone having the partial timbre of the specific section of the reference musical tone desired by the user.
  • the position of the specific section on the time axis is changed in response to the first instruction Q1. Therefore, it is possible to generate a target musical tone having the partial timbre of the position of the reference musical tone desired by the user.
  • the training processing unit 26 in FIG. 2 establishes the first generative model 30 and the second generative model 40 through machine learning using multiple pieces of training data T.
  • the training processing unit 26 in the first embodiment establishes the first generative model 30 and the second generative model 40 collectively. After establishment, each of the first generative model 30 and the second generative model 40 may be trained individually.
  • FIG. 8 is an explanatory diagram regarding machine learning that establishes the first generative model 30 and the second generative model 40.
  • Each of the multiple training data T is composed of a combination of a training control data string Xt, a training reference data string Rt, and a training audio data string Zt.
  • the control data string Xt is time-series data representing the conditions of the target musical tone. Specifically, the control data string Xt represents a time series of musical score features in a specific section (hereinafter referred to as the "training section") of the training piece of music.
  • the format of the control data string Xt is the same as that of the control data string X.
  • the reference data string Rt is time-series data representing the acoustic characteristics of musical tones prepared in advance for a training piece of music.
  • the partial timbre represented by the reference data string Rt is a characteristic of the temporal change in acoustic characteristics of the musical tones of the training piece of music in the training section.
  • the format of the reference data string Rt is the same as that of the reference data string R.
  • the acoustic data sequence Zt is time-series data representing the acoustic characteristics of the musical tones to be generated by the first generative model 30 and the second generative model 40 from the control data sequence Xt and the reference data sequence Rt.
  • the acoustic data sequence Zt corresponds to the ground truth for the control data sequence Xt and the reference data sequence Rt.
  • the format of the acoustic data sequence Zt is the same as that of the acoustic data sequence Z.
  • FIG. 9 is a flowchart of the process (hereinafter referred to as "training process Sb") in which the control device 11 establishes the first generation model 30 and the second generation model 40.
  • the control device 11 executes the training process Sb to realize the training processing unit 26 in FIG. 8.
  • the control device 11 prepares a first provisional model 51 and a second provisional model 52 (Sb1).
  • the first provisional model 51 is an initial or provisional model that is updated to the first generative model 30 by machine learning.
  • the initial first provisional model 51 has a similar configuration to the first generative model 30, but multiple variables are set to, for example, random numbers.
  • the second provisional model 52 is an initial or provisional model that is updated to the second generative model 40 by machine learning.
  • the initial second provisional model 52 has a similar configuration to the second generative model 40, but multiple variables are set to, for example, random numbers.
  • the structure of each of the first provisional model 51 and the second provisional model 52 is arbitrarily designed by a designer.
  • the control device 11 selects one of the multiple training data T (hereinafter referred to as "selected training data T") (Sb2). As illustrated in FIG. 8, the control device 11 generates N parameter sets P1 to PN by processing the reference data string Rt of the selected training data T using the second provisional model 52 (Sb3). Specifically, the second provisional model 52 generates a control vector V, and the control vector processing unit 25 generates N parameter sets P1 to PN. The control device 11 also generates an acoustic data string Z by processing a control data string Xt of the selected training data T using the first provisional model 51 (Sb4). The N parameter sets P1 to PN generated by the second provisional model 52 are applied to the processing of the control data string Xt.
  • the control device 11 calculates an error function that represents the error between the acoustic data sequence Z generated by the first interim model 51 and the acoustic data sequence Zt of the selected training data T (Sb5).
  • the control device 11 updates multiple variables of the first interim model 51 and multiple variables of the second interim model 52 so that the loss function is reduced (ideally minimized) (Sb6).
  • the backpropagation method is used to update each variable according to the loss function.
  • the control device 11 determines whether a predetermined termination condition is met (Sb7).
  • the termination condition is, for example, that the loss function falls below a predetermined threshold, or that the amount of change in the loss function falls below a predetermined threshold. If the termination condition is not met (Sb7: NO), the control device 11 selects the unselected training data T as new selected training data T (Sb2). That is, the process of updating the multiple variables of the first interim model 51 and the multiple variables of the second interim model 52 (Sb2 to Sb6) is repeated until the termination condition is met (Sb7: YES). Note that when the above process has been performed for all training data T, each training data T is returned to an unselected state and the same process is repeated. That is, each training data T is used repeatedly.
  • the control device 11 ends the training process Sb.
  • the first interim model 51 at the time when the termination condition is met is determined to be the trained first generative model 30.
  • the second interim model 52 at the time when the termination condition is met is determined to be the trained second generative model 40.
  • the first generative model 30 learns the underlying relationship between the control data sequence Xt and the acoustic data sequence Zt under the N parameter sets P1 to PN corresponding to the reference data sequence R. Therefore, the trained first generative model 30 outputs an acoustic data sequence Z that is statistically valid for the unknown control data sequence X under that relationship.
  • the second generative model 40 also learns the underlying relationship between the reference data sequence Rt and the N parameter sets P1 to PN. Specifically, the second generative model 40 learns the relationship between the reference data sequence Rt and the N parameter sets P1 to PN necessary to generate an appropriate acoustic data sequence Z from the control data sequence Xt.
  • the second encoder 243 learns the underlying relationship between the reference data sequence Rt and the control vector V
  • the control vector processing unit 25 learns the underlying relationship between the control vector V and the N parameter sets P1 to PN. Therefore, by using the first generation model 30 and the second generation model 40, an acoustic data sequence Z of a target musical tone is generated to which a desired partial timbre of a reference musical tone is imparted.
  • Second embodiment A second embodiment will be described.
  • elements having the same functions as those in the first embodiment will be denoted by the same reference numerals as those used in the description of the first embodiment, and detailed descriptions thereof will be omitted as appropriate.
  • FIG. 10 is a block diagram of the control vector generation unit 24 in the second embodiment.
  • the control vector generation unit 24 in the second embodiment includes a control vector adjustment unit 244 in addition to the same elements as in the first embodiment (interval setting unit 241, feature extraction unit 242, and second encoder 243).
  • the second encoder 243 generates a control vector V in the same manner as in the first embodiment.
  • the initial control vector V generated by the second encoder 243 is denoted as "control vector V0" for convenience.
  • the section setting unit 241 sets a specific section of the reference musical tone in response to the first instruction Q1 from the user. Therefore, the control vector V0 in the second embodiment is generated in response to the first instruction Q1 from the user.
  • the initial control vector V0 does not have to be a vector generated by the second encoder 243.
  • a vector in which each element Ek is set to a predetermined value (e.g., zero), or a vector in which each element Ek is set to a random number may be used as the initial control vector V0.
  • the final control vector V when the previous musical sound synthesis process Sa was executed may be adopted as the current initial control vector V0.
  • the elements for generating the control vector V0 (the interval setting unit 241, the feature extraction unit 242, and the second encoder 243) may be omitted in the second embodiment.
  • the control vector adjustment unit 244 generates a control vector V by adjusting an initial control vector V0. Specifically, the control vector adjustment unit 244 changes one or more elements Ek of the K elements E1 to EK of the control vector V0 in response to a second instruction Q2 from the user to the operation device 14. A K-dimensional vector consisting of the K elements E1 to EK after the changes is supplied to the control vector processing unit 25 as the control vector V. As can be understood from the above explanation, the control vector generation unit 24 of the second embodiment generates a control vector V in response to a first instruction Q1 and a second instruction Q2 from the user.
  • FIG. 11 is a schematic diagram of the setting screen Gb.
  • the setting screen Gb is a screen for the user to instruct changes to each element Ek.
  • the control vector adjustment unit 244 displays the setting screen Gb on the display device 13.
  • the setting screen Gb includes K operators Gb-1 to Gb-K corresponding to different elements Ek of the control vector V.
  • the K operators Gb-1 to Gb-K are arranged in the horizontal direction.
  • the operators Gb-k corresponding to each element Ek are images that accept operations by the user.
  • each operator Gb-k is, for example, a slider that moves up and down in response to an operation by the user.
  • the second instruction Q2 by the user is, for example, an operation to move each of the K operators Gb-1 to Gb-K.
  • the second instruction Q2 is an instruction from the user to individually specify the numerical value of each element Ek.
  • the numerical value of the element Ek is displayed near each operator Gb-k.
  • the position of each operator Gb-k in the vertical direction corresponds to the numerical value of the element Ek. That is, moving the operator Gb-k upwards means an increase in the element Ek, and moving the operator Gb-k downwards means a decrease in the element Ek.
  • the control vector adjustment unit 244 sets the initial position of each operator Gb-k according to the numerical value of each element Ek of the control vector V0.
  • the control vector adjustment unit 244 then changes the numerical value of the element Ek according to the user's operation to move each operator Gb-k (i.e., the second instruction Q2). That is, the control vector adjustment unit 244 sets the element Ek corresponding to each operator Gb-k according to the user's operation on one or more operators Gb-k among the K operators Gb-1 to Gb-K.
  • the control vector V represents a partial tone. Therefore, the change in each element Ek by the control vector adjustment unit 244 is a process of changing the partial tone in response to the second instruction Q2 from the user. In other words, the temporal change in the acoustic characteristics imparted to the target tone (i.e., the partial tone) changes in response to the second instruction Q2 from the user.
  • the control vector processing unit 25 generates N parameter sets P1 to PN from the control vector V after adjustment by the control vector adjustment unit 244.
  • FIG. 12 is a flowchart of the musical tone synthesis process Sa in the second embodiment.
  • the generation of the control vector V (Sa1) includes the same procedures (Sa11-Sa13) as in the first embodiment, as well as the adjustment of the control vector V0 (Sa14).
  • the control device 11 (control vector adjustment unit 244) generates the control vector V by changing one or more elements Ek of the K elements E1-EK of the initial control vector V0 in response to a second instruction Q2 from the user.
  • the operations other than the adjustment of the control vector V0 (Sa14) are the same as in the first embodiment.
  • the second instruction Q2 is given by the user at any timing in parallel with the musical tone synthesis process Sa.
  • the same effect as in the first embodiment is achieved. Furthermore, in the second embodiment, one or more elements Ek of the K elements E1 to EK of the control vector V0 are changed in response to the second instruction Q2 from the user. Therefore, it is possible to generate a variety of target musical tones having partial timbres in response to the second instruction Q2 from the user. In particular, in the second embodiment, the user can easily adjust the partial timbre by operating each operator Gb-k.
  • the control vector generation unit 24 of the third embodiment generates a control vector V for each unit period on the time axis. That is, the control vector generation unit 24 generates a time series of control vectors V in response to instructions from a user (first instruction Q1, second instruction Q2).
  • first instruction Q1, second instruction Q2 instructions from a user
  • the control vector V may be generated in response to one of the first instruction Q1 and the second instruction Q2.
  • the control vector generating unit 24 of the third embodiment generates a control vector V in response to a first instruction Q1 and a second instruction Q2 from the user, as in the second embodiment.
  • the control vector V is generated for each unit period, so the control vector V changes for each unit period within one processing period B. Therefore, the partial timbre assigned to the target musical tone changes at a point midway through the processing period B.
  • the user can specify a specific section by the first instruction Q1 for any time (unit period) of the target music piece.
  • a control vector V is generated in the same manner as in each of the above-mentioned forms.
  • a control vector V is generated by interpolating two control vectors V generated for unit periods before and after the target period.
  • control vector generation unit 24 generates a control vector V for the target period by interpolating a control vector V corresponding to a specific section specified immediately before the target period (e.g., one or more unit periods in the past) and a control vector V corresponding to a specific section specified immediately after the target period (e.g., one or more unit periods in the future). Any method of interpolating the control vector V may be used, for example, an interpolation method may be used.
  • the control vector generation unit 24 also generates a time series of control vectors V by detecting the second instruction Q2 given by the user for each unit period in parallel with the musical sound synthesis process Sa. Note that the control vector generation unit 24 may generate a time series of control vectors V by detecting the second instruction Q2 at a cycle longer than the unit period, and generate a control vector V for each unit period by processing to smooth the time series of control vectors V on the time axis (i.e., a low-pass filter).
  • the control vector processing unit 25 of the third embodiment generates N parameter sets P1 to PN from the control vector V for each unit period.
  • the control vector processing unit 25 generates N parameter sets P1 to PN for each unit period on the time axis. In other words, the control vector processing unit 25 generates a time series of each parameter set Pn.
  • the control vector V changes in response to the first instruction Q1 or the second instruction Q2. Therefore, the N parameter sets P1 to PN in the unit period immediately before the first instruction Q1 or the second instruction Q2 are different from the N parameter sets P1 to PN in the unit period immediately after. In other words, the parameter set Pn changes within one processing period B. In a state in which the first instruction Q1 or the second instruction Q2 is not given, the same parameter set Pn is generated over multiple unit periods.
  • the number of unit data Du constituting one unit data string U changes for each stage of processing in the first generation model 30.
  • parameter sets Pn in a number corresponding to the number of unit data Du supplied to the intermediate layer L are used for conversion processing by one intermediate layer L.
  • the conversion model 28-n generates a time series of parameter sets Pn in the same number as the unit data Du processed by the n-th intermediate layer L.
  • FIG. 13 is an explanatory diagram of the conversion process executed by each intermediate layer L.
  • a conversion process is executed in which a common parameter set Pn is applied to each of the multiple unit data Du that make up the unit data string U (FIG. 4).
  • a conversion process is executed in which an individual parameter set Pn is applied to each of the multiple unit data Du that make up the unit data string U.
  • parameter set Pn(t1) is applied to the conversion process of unit data Du(t1)
  • parameter set Pn(t2) is applied to the conversion process of unit data Du(t2).
  • Parameter set Pn(t1) and parameter set Pn(t2) are generated separately.
  • parameter set Pn(t1) is generated from control vector V(t1) corresponding to time t1
  • parameter set Pn(t2) is generated from control vector V(t2) corresponding to time t2. Therefore, the numerical values of the first parameter p1 and the second parameter p2 may differ between parameter set Pn(t1) and parameter set Pn(t2).
  • the parameter set Pn applied to the conversion process changes at a point in the middle of the unit data string U.
  • a time series of a control vector V is generated in response to instructions from a user (first instruction Q1, second instruction Q2), and a time series of each parameter set Pn is generated from the time series of the control vector V. Therefore, it is possible to generate a variety of target sounds whose timbre changes at points in the middle of the control data string X.
  • FIG. 14 is a block diagram illustrating the configuration of the first generative model 30 (musical tone synthesis unit 22) in the fourth embodiment.
  • the first generative model 30 in the fourth embodiment is an autoregressive (AR) type generative model including a conversion processing unit 61, a convolution layer 62, N unit processing units 63-1 to 63-N, and a synthesis processing unit 64.
  • AR autoregressive
  • the first generative model 30 has an arbitrary number (Nx) of intermediate layers, but if all intermediate layers are omitted, it becomes equivalent to the generative model (NPSS) disclosed in the 2017 paper "A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs" by Merlijn Blaauw and Jordi Bonada in applied sciences.
  • NPSS generative model
  • the configuration other than the first generative model 30 is the same as that of the first embodiment.
  • Each parameter set Pn generated by the control vector processing unit 25 (conversion model 28-n) is supplied to the unit processing unit
  • the conversion processing unit 61 Similar to the pre-processing unit 311, the conversion processing unit 61 generates latent data d from the control data Dx acquired by the control data acquisition unit 21 for each unit period.
  • the latent data d represents the characteristics of the control data Dx.
  • the conversion processing unit 61 is configured with a multi-layer perceptron for converting the control data Dx into latent data d.
  • the latent data d may be supplied in common to the N unit processing units 63-1 to 63-N, or different data may be supplied individually.
  • the control data Dx acquired by the control data acquisition unit 21 may be supplied to each unit processing unit 63-n as latent data d. In other words, the conversion processing unit 61 may be omitted.
  • FIG. 15 is a block diagram of each unit processing unit 63-n.
  • Each unit processing unit 63-n is a generative model that generates output data O and processed data Cn by processing input data I, latent data d, and parameter set Pn.
  • the input data I includes first data Ia and second data Ib.
  • the unit processing unit 63-n includes an extended convolutional layer 65, an intermediate layer L, and a processing layer 67.
  • the dilated convolution layer 65 generates unit data Du1 by performing dilated convolution on the input data I (first data Ia and second data Ib).
  • the intermediate layer L generates unit data Du2 by performing a conversion process on the unit data Du1.
  • the contents of the conversion process are the same as those in the first embodiment.
  • a parameter set Pn is applied to the conversion process. Note that it is not necessary for the intermediate layer L to be installed in all of the N unit processing units 63-1 to 63-N. In other words, the intermediate layer L is installed in Nx (one or more) unit processing units 63-n out of the N unit processing units 63-1 to 63-N. Here, the explanation will be given assuming that the intermediate layer L is installed in all of the N unit processing units.
  • the processing layer 67 generates output data O and processing data Cn from the unit data Du2 and the latent data d.
  • the processing layer 67 includes a convolution layer 671, an adder 672, an activation layer 673, an activation layer 674, a multiplier 675, a convolution layer 676, a convolution layer 677, and an adder 678.
  • the convolution layer 671 performs a 1x1 convolution operation on the latent data d.
  • the adder 672 generates unit data Du3 by adding unit data Du2 and the output data of the convolution layer 671.
  • the unit data Du3 is divided into a first part and a second part.
  • the activation layer 673 processes the first part of the unit data Du3 using an activation function (e.g., a tanh function).
  • the activation layer 674 processes the second part of the unit data Du3 using an activation function (e.g., a sigmoid function).
  • the multiplier 675 generates unit data Du4 by calculating an element product between the output data of the activation layer 673 and the output data of the activation layer 674.
  • the unit data Du4 is data obtained by applying a gated activation function (673-675) to the output of the extended convolution layer 65.
  • each of the unit data Du1 to Du3 includes a first part and a second part, but if a general activation function (an ungated function such as sigmoid, tanh, or ReLU) is used, the unit data Du1 to Du3 only need to include the first part.
  • a general activation function an ungated function such as sigmoid, tanh, or ReLU
  • the convolution layer 676 generates processed data Cn by performing a 1 ⁇ 1 convolution operation on the unit data Du4.
  • the convolution layer 677 performs a 1 ⁇ 1 convolution operation on the unit data Du4.
  • the adder 678 generates output data O by adding the first data Ia and the output data of the convolution layer 677.
  • the output data O is stored in the storage device 12.
  • the synthesis processing unit 64 in FIG. 14 generates acoustic data Dz by processing N pieces of processed data C1 to CN generated by different unit processing units 63-n. For example, the synthesis processing unit 64 generates acoustic data Dz based on data obtained by weighting the N pieces of processed data C1 to CN. The generation of acoustic data Dz by the synthesis processing unit 64 is repeated for each unit period. In other words, the synthesis processing unit 64 generates a time series of acoustic data Dz. The acoustic data Dz generated by the synthesis processing unit 64 is supplied to the waveform synthesis unit 23 as in the first embodiment, and is also stored in the storage device 12 and used in the convolution layer 62.
  • the convolution layer 62 generates unit data Du0 for each unit period by performing a convolution operation (causal convolution) on the acoustic data Dz generated in the immediately preceding multiple unit periods.
  • the unit data Du0 is supplied to the first-stage unit processing unit 63-1 as input data I.
  • the first data Ia supplied to the unit processing unit 63-1 in each unit period is the unit data Du0 generated in the current unit period.
  • the second data Ib supplied to the unit processing unit 63-1 in each unit period is the unit data Du0 generated in the immediately preceding (one previous) unit period. As described above, the first data Ia and second data Ib corresponding to different unit periods are supplied to the unit processing unit 63-1.
  • each unit processing unit 63-n from the second stage onwards is supplied with multiple output data O generated by the unit processing unit 63-n-1 in the previous stage for different unit periods as first data Ia and second data Ib.
  • the first generation model 30 of the fourth embodiment includes N intermediate layers L corresponding to different unit processing units 63-n. Furthermore, the convolution layer 62 and the dilated convolution layer 65 and processing layer 67 of each unit processing unit 63-n are basic layers required for generating a time series of acoustic data Dz. That is, the first generation model 30 of the fourth embodiment includes multiple basic layers and one or more intermediate layers L, similar to the first embodiment. Therefore, the fourth embodiment also achieves the same effects as the first embodiment.
  • the target musical tone is a singing tone.
  • the musical tone synthesis system 100 of the fifth embodiment synthesizes, as the target musical tone, an instrument tone to be generated by playing the target piece of music.
  • the control data Dx in the first to fourth embodiments includes the pitch (fundamental frequency) of the target musical tone, information indicating voiced/unvoiced, and phoneme information.
  • the control data Dx in the fifth embodiment is a musical score feature for an instrument sound, which includes the intensity (volume) and performance style of the target musical tone instead of the voiced/unvoiced information and phoneme information.
  • the performance style is, for example, information indicating the method of playing an instrument.
  • the target musical tone is an instrument sound
  • the instrument sound is also used as the reference musical tone.
  • the partial timbre is a characteristic of the temporal change in the acoustic characteristics of the instrument sound.
  • the first and second generative models 30 and 40 of the fifth embodiment are established by training using training data T for musical instrument sounds (control data sequence Xt, reference data sequence Rt, and acoustic data sequence Zt) in the machine learning of FIG. 8.
  • the first generative model 30 is a trained statistical model that has learned the relationship between the conditions on the musical score of the target musical instrument sound (control data sequence X) and the acoustic features of the target musical instrument sound (acoustic data sequence Z).
  • the musical sound synthesis unit 22 then processes the control data sequence X for the musical instrument sound using the first generative model 30 to generate the acoustic data sequence Z for the musical instrument sound.
  • the first encoder 31 includes a pre-processing unit 311, but the pre-processing unit 311 may be omitted.
  • the control data sequence X may be directly supplied from the control data acquisition unit 21 to the first-stage convolutional layer 321 of the first encoder 31.
  • the decoder 32 includes a post-processing unit 322, but the post-processing unit 322 may be omitted.
  • the acoustic data sequence Z output by the final-stage intermediate layer L may be directly supplied to the waveform synthesis unit 23.
  • the control device 11 (section setting unit 241) may select one of the multiple reference signals Sr in response to the first instruction Q1 from the user.
  • the control device 11 generates a reference data string R from the reference signal Sr of the specific section selected in response to the first instruction Q1.
  • the configuration for changing each element Ek of the control vector V in response to the second instruction Q2 from the user is not limited to the above example.
  • multiple preset data for the control vector V0 may be stored in the storage device 12.
  • Each preset data is data that specifies each of the K elements E1 to EK of the control vector V0.
  • the user can select one of the multiple preset data by operating the operation device 14.
  • the control vector adjustment unit 244 uses the preset data selected by the user as the control vector V0. and applies it to the adjustment.
  • the instruction to select and call one of the multiple preset data corresponds to the second instruction Q2.
  • each operator Gb-k corresponds to the numerical value of each element Ek, but the position of each operator Gb-k may correspond to the amount of change in each element Ek.
  • the control vector adjustment unit 244 sets the numerical value of the element Ek to a numerical value that is changed from the initial value in the control vector V0 by an amount corresponding to the position of the operator Gb-k.
  • the musical sound synthesis system 100 is provided with the training processing unit 26 for convenience, but the training processing unit 26 may be mounted on a machine learning system separate from the musical sound synthesis system 100.
  • the first generation model 30 and the second generation model 40 established by the machine learning system are provided to the musical sound synthesis system 100 and used in the musical sound synthesis process Sa.
  • an audio signal W is generated using a processing period B as a time unit.
  • multiple processing periods B that are consecutive on the time axis may partially overlap each other on the time axis, as illustrated in FIG. 16. Note that the temporal relationship between the processing periods B is not limited to the example illustrated in FIG. 16.
  • an audio signal W is generated sequentially for each processing period B on the time axis.
  • the audio signals W within the valid period b of each processing period B are added (e.g., weighted averaged) to each other between successive processing periods B on the time axis to generate a final audio signal.
  • the valid period b is a period included in the processing period B.
  • the valid period b is a period obtained by excluding from the processing period B a period of a predetermined length including the start point of the processing period B and a period of a predetermined length including the end point of the processing period B.
  • the discontinuity of the waveform of the audio signal W at the end (start point or edge) of the processing period B is reduced, and as a result, an audio signal with a continuous waveform and a natural auditory sensation can be generated.
  • a virtual operator Gb-k is displayed on the display device 13, but the operator Gb-k that receives an instruction to change each element Ek may be a real operator that the user can actually touch.
  • each intermediate layer L is not limited to the process exemplified in each of the above-mentioned embodiments.
  • one of the multiplication of the first parameter p1 and the addition of the second parameter p2 may be omitted.
  • the parameter set Pn is composed of only the first parameter p1.
  • the parameter set Pn is composed of only the second parameter p2.
  • the parameter set Pn is expressed as a variable including one or more parameters.
  • the first generative model 30 including the first encoder 31 and the decoder 32 is exemplified, but the configuration of the first generative model 30 is not limited to the above examples.
  • the first generative model 30 is comprehensively expressed as a model that learns the relationship between the conditions of the target musical tone (control data sequence X) and the acoustic features of the target musical tone (acoustic data sequence Z). Therefore, a model of any structure including one or more intermediate layers L capable of performing conversion processing is used as the first generative model 30.
  • the configuration of the second generation model 40 is not limited to the examples in the above-mentioned embodiments.
  • the sampling unit 414 samples the element Ek of the control vector V from each probability distribution Fk has been exemplified, but the control vector V may be generated by multiple convolution layers 411.
  • the output processing unit 412 in the second encoder 243 may be omitted.
  • the section setting unit 241 may set the reference signal Sr for a specific section regardless of an instruction from the user.
  • the section setting unit 241 sets a section in which the acoustic characteristics of the reference signal Sr satisfy a specific condition as the specific section.
  • the section setting unit 241 sets a section in which the acoustic characteristics, such as timbre, fluctuate significantly as the specific section.
  • the entire reference signal Sr may be used as the specific section. In a configuration in which the entire reference signal Sr is used as the specific section, the section setting unit 241 may be omitted.
  • the first generation model 30 includes N1 encoding intermediate layers Le and N2 decoding intermediate layers Ld has been exemplified, but the encoding intermediate layer Le or the decoding intermediate layer Ld may be omitted.
  • a form in which the first encoder 31 of the first generation model 30 does not include an encoding intermediate layer Le, or a form in which the decoder 32 does not include a decoding intermediate layer Ld are also envisioned.
  • each intermediate layer L performs a conversion process. Therefore, a form in which the first encoder 31 does not perform a conversion process, or a form in which the decoder 32 does not perform a conversion process are also envisioned.
  • the first generation model 30 includes N2x decoding intermediate layers Ld. As described above, the number N2x of the decoding intermediate layers Ld is a natural number equal to or less than N2. Also, in a form in which the decoding intermediate layer Ld is omitted, the first generation model 30 includes N1x coding intermediate layers Le. As described above, the number N1x of the coding intermediate layers Le is a natural number equal to or less than N1.
  • the number Nx of intermediate layers L in the first generation model 30 in the first to fourth embodiments is a natural number between 1 and N. That is, the first generation model 30 is comprehensively expressed as a model including multiple base layers and one or more intermediate layers L.
  • the intermediate layers L are included in one or both of the first encoder 31 and the decoder 32. That is, the conversion process is performed in at least one location in the first generation model 30.
  • the musical sound synthesis system 100 may be realized by a server device that communicates with an information device such as a smartphone or tablet terminal. For example, the musical sound synthesis system 100 generates an audio signal W from the music data M and the reference signal Sr received from the information device, and transmits the audio signal W to the information device. Note that in a form in which the audio data sequence Z generated by the musical sound synthesis unit 22 is transmitted from the musical sound synthesis system 100 to the information device, the waveform synthesis unit 23 may be omitted from the musical sound synthesis system 100. The information device generates an audio signal from the audio data sequence Z received from the musical sound synthesis system 100.
  • a control data sequence X may be transmitted from the information device to the musical sound synthesis system 100 instead of the music data M.
  • the control data acquisition unit 21 receives the control data sequence X transmitted from the information device. "Receiving" the control data Dx (control data sequence X) is an example of "acquiring" the control data Dx.
  • the functions of the musical tone synthesis system 100 exemplified above are realized by the cooperation of one or more processors constituting the control device 11 and the program stored in the storage device 12.
  • the program according to the present disclosure can be provided in a form stored in a computer-readable recording medium and installed in a computer.
  • the recording medium is, for example, a non-transitory recording medium, and a good example is an optical recording medium (optical disk) such as a CD-ROM, but also includes any known type of recording medium such as a semiconductor recording medium or a magnetic recording medium.
  • a non-transitory recording medium includes any recording medium except a transient, propagating signal, and does not exclude volatile recording media.
  • the storage device that stores the program in the distribution device corresponds to the non-transitory recording medium described above.
  • a musical sound synthesis method is a musical sound synthesis method realized by a computer system, which acquires a time series of control data representing the conditions of a target musical sound, and generates a time series of acoustic data representing the acoustic features of the target musical sound by processing the time series of control data using a trained generative model that includes multiple basic layers and one or more intermediate layers and has learned the relationship between the conditions of the musical sounds and the acoustic features of the musical sounds.
  • the method generates a control vector representing the characteristics of the temporal change in timbre in response to an instruction from a user, and generates a first parameter set from the control vector.
  • a first intermediate layer of the one or more intermediate layers applies the first parameter set to the data input to the first intermediate layer, and outputs the data after application to the next layer.
  • a control vector representing the characteristics of the temporal change in timbre (partial timbre) is generated in response to instructions from the user, a first parameter set is generated from the control vector, and the first parameter set is applied to the data input to the first intermediate layer. Therefore, it is possible to generate a time series of acoustic data for a target musical tone having a variety of partial timbres in response to instructions from the user.
  • a “target musical sound” is a musical sound that is the target to be synthesized.
  • a “musical sound” is a musical sound. For example, a singer's singing sound or a musical instrument's playing sound are musical sounds, and these are examples of "musical sounds.”
  • Control data is data in any format that represents the conditions of a target musical tone.
  • data that represents the features (score features) of music data that represents the musical score of a musical tone is an example of "control data”.
  • the type of score features represented by the control data is arbitrary. For example, score features similar to those in Patent Document 1 are used.
  • Aspect 2 of Aspect 1 in generating the control vector, a time series of the control vector is generated in response to instructions from the user, and in generating the first parameter set, a time series of the first parameter set is generated from the time series of the control vector.
  • a time series of the control vector is generated in response to instructions from the user, and a time series of the first parameter set is generated from the time series of the control vector. Therefore, it is possible to generate a variety of target sounds whose timbre changes at intermediate points in the time series of the control data.
  • a second parameter set is further generated from the control vector, and a second intermediate layer of the one or more intermediate layers executes processing in which the second parameter set is applied to data input to the second intermediate layer, and outputs the data after application to the next layer.
  • the second parameter set in addition to applying the first parameter set to data input to the first intermediate layer, the second parameter set is applied to data input to the second intermediate layer. Therefore, a time series of acoustic data of a target musical tone having a variety of partial timbres can be generated.
  • the one or more intermediate layers are multiple intermediate layers
  • the generative model includes a first encoder including multiple encoding intermediate layers among the one or more intermediate layers, and a decoder including multiple decoding intermediate layers among the one or more intermediate layers, and in generating the time series of acoustic data, the time series of control data is processed by the first encoder to generate intermediate data representing characteristics of the time series of control data, and the time series of acoustic data is generated by processing the intermediate data by the decoder.
  • the time series of acoustic data can be generated by encoding by the first encoder and decoding by the decoder.
  • the "first encoder” is a statistical model that generates intermediate data that represents the characteristics of a time series of control data.
  • the decoder is a statistical model that generates a time series of acoustic data from the intermediate data.
  • Each of the "first hidden layer” and the “second hidden layer” may be either a coding hidden layer or a decoding hidden layer.
  • the first encoder compresses data on the time axis, and the decoder expands data on the time axis.
  • intermediate data is generated that appropriately reflects the characteristics of the time series of control data, and a time series of acoustic data is generated that appropriately reflects the characteristics of the intermediate data.
  • a specific section of the reference musical sound is set in response to a first instruction from the user, and a time series of reference data representing the acoustic features of the reference musical sound in the specific section is processed by a second encoder to generate the control vector representing the characteristics of the temporal change in timbre in the specific section of the reference musical sound.
  • a specific section of the reference musical sound is set in response to a first instruction from the user, and a control vector representing the characteristics of the temporal change in timbre in the specific section (partial timbre) is generated. Therefore, it is possible to generate a target musical sound having the partial timbre of the specific section of the reference musical sound in response to the first instruction.
  • the position of the specific section on the time axis is further changed in response to the first instruction.
  • the position of the specific section on the time axis in the reference musical sound is changed in response to a first instruction from the user. Therefore, it is possible to generate a target musical sound having a partial timbre at a position of the reference musical sound desired by the user.
  • the control vector includes a plurality of elements, and in generating the control vector, one or more of the plurality of elements are changed in response to a second instruction from the user.
  • one or more of the plurality of elements of the control vector are changed in response to a second instruction from the user. Therefore, it is possible to generate a variety of target musical tones having partial timbres in response to the second instruction from the user.
  • the second instruction is an operation on a plurality of operators corresponding to the plurality of elements, and in changing the one or more elements, the one or more elements are set in response to an operation on one or more of the plurality of operators that correspond to the one or more elements.
  • the user can easily adjust partial tones by operating each operator.
  • the "operator” may take any form.
  • a reciprocating type operator sliding
  • a rotary type operator knob
  • the "operator” may be a real operator that the user can touch, or a virtual operator that is displayed by a display device.
  • the first intermediate layer performs a conversion process by applying the first parameter set to the data input to the first intermediate layer.
  • the first parameter set includes a first parameter and a second parameter
  • the conversion process includes multiplication of the first parameter and addition of the second parameter.
  • a conversion process including multiplication of the first parameter and addition of the second parameter is performed on data input to the first hidden layer. Therefore, it is possible to generate acoustic data of a target musical tone to which a partial timbre represented by a control vector is appropriately assigned.
  • a musical sound synthesis system includes a control data acquisition unit that acquires a time series of control data representing the conditions of a target musical sound, a control vector generation unit that generates a control vector representing the characteristics of a temporal change in tone in response to an instruction from a user, a control vector processing unit that generates a first parameter set from the control vector, and a musical sound synthesis unit that generates a time series of acoustic data representing the acoustic features of the target musical sound by processing the time series of the control data using a trained generative model that includes multiple base layers and one or more intermediate layers and has learned the relationship between the conditions of the musical sounds and the acoustic features of the musical sounds, and a first intermediate layer of the one or more intermediate layers applies the first parameter set to data input to the first intermediate layer and outputs the data after application to the next layer.
  • a program causes a computer system to function as a control data acquisition unit that acquires a time series of control data representing the conditions of a target musical tone, a control vector generation unit that generates a control vector representing the characteristics of temporal changes in tone in response to an instruction from a user, a control vector processing unit that generates a first parameter set from the control vector, and a musical tone synthesis unit that includes a plurality of base layers and one or more intermediate layers and processes the time series of control data using a trained generative model that has learned the relationship between the conditions of a musical tone and the acoustic features of the musical tone, thereby generating a time series of acoustic data representing the acoustic features of the target musical tone, and a first intermediate layer of the one or more intermediate layers applies the first parameter set to data input to the first intermediate layer and outputs the data after application to the next layer.

Abstract

A musical sound synthesis system 100 comprises: a control data acquisition unit 21 that acquires a time series X of control data indicating the condition of a target musical sound; a control vector generation unit 24 that generates a control vector V representing the feature of temporal change of timbre in response to an instruction from a user; a control vector processing unit 25 that generates a first parameter set Pn from the control vector V; and a musical sound synthesis unit 22 that generates a time series Z of acoustic data representing the acoustic feature quantity of the target musical sound by processing the time series X of the control data by a trained first generative model 30 including a plurality of basic layers and one or more intermediate layers and having learned the relation between the condition of the musical sound and the acoustic feature quantity of the musical sound. A first intermediate layer out of the one or more intermediate layers executes processing in which the first parameter set Pn is applied to data to be inputted to the first intermediate layer, and outputs the data after the application to the next layer.

Description

楽音合成方法、楽音合成システムおよびプログラムMusical sound synthesis method, music sound synthesis system and program
 本開示は、音を合成する技術に関する。 This disclosure relates to technology for synthesizing sound.
 例えばニューラルネットワーク等の生成モデルを利用して所望の楽音を生成する技術が従来から提案されている。例えば特許文献1には、音声に関する多次元の楽譜特徴量の時系列を畳込ニューラルネットワークにより処理することで、音声波形の音響特徴量の時系列を生成する構成が開示されている。 Technologies have been proposed for generating desired musical tones using generative models such as neural networks. For example, Patent Document 1 discloses a configuration for generating a time series of acoustic features of a voice waveform by processing a time series of multidimensional score features related to voice using a convolutional neural network.
特許第6552146号公報Patent No. 6552146
 生成モデルを利用する近年の音声合成においては、楽譜特徴量の時系列から画一的な楽音を合成するだけでなく、特定の楽音の一部の区間における音色の時間的な変化(以下「部分音色」という)を、利用者からの指示に応じて楽音に付与することが要求される。以上の事情を考慮して、本開示のひとつの態様は、利用者からの指示に応じた多様な部分音色を有する楽音を生成することを目的とする。 In recent voice synthesis using generative models, it is required not only to synthesize uniform musical tones from a time series of musical score features, but also to impart temporal changes in timbre in a partial section of a specific musical tone (hereinafter referred to as "partial timbre") to the musical tone in response to instructions from the user. In consideration of the above circumstances, one aspect of the present disclosure aims to generate musical tones with a variety of partial timbres in response to instructions from the user.
 以上の課題を解決するために、本開示のひとつの態様に係る楽音合成方法は、目標楽音の条件を表す制御データの時系列を取得し、複数の基本層と1以上の中間層を含み、楽音の条件と当該楽音の音響特徴量との関係を学習した訓練済の生成モデルにより、前記制御データの時系列を処理することで、前記目標楽音の音響特徴量を表す音響データの時系列を生成する、コンピュータシステムにより実現される楽音合成方法であって、音色の時間的な変化の特徴を表す制御ベクトルを利用者からの指示に応じて生成し、前記制御ベクトルから第1パラメータセットを生成し、前記1以上の中間層のうちの第1中間層は、前記第1中間層に入力されるデータに対して前記第1パラメータセットを適用した処理を実行し、適用後のデータを次層に出力する。 In order to solve the above problems, a musical sound synthesis method according to one aspect of the present disclosure is a musical sound synthesis method realized by a computer system, which acquires a time series of control data representing the conditions of a target musical sound, and generates a time series of acoustic data representing the acoustic features of the target musical sound by processing the time series of control data using a trained generative model that includes multiple basic layers and one or more intermediate layers and has learned the relationship between the conditions of the musical sounds and the acoustic features of the musical sounds. The method generates a control vector representing the characteristics of the temporal change in timbre in response to an instruction from a user, and generates a first parameter set from the control vector. A first intermediate layer of the one or more intermediate layers applies the first parameter set to the data input to the first intermediate layer, and outputs the data after application to the next layer.
 本開示のひとつの態様に係る楽音合成システムは、目標楽音の条件を表す制御データの時系列を取得する制御データ取得部と、音色の時間的な変化の特徴を表す制御ベクトルを利用者からの指示に応じて生成する制御ベクトル生成部と、前記制御ベクトルから第1パラメータセットを生成する制御ベクトル処理部と、複数の基本層と1以上の中間層を含み、楽音の条件と当該楽音の音響特徴量との関係を学習した訓練済の生成モデルにより、前記制御データの時系列を処理することで、前記目標楽音の音響特徴量を表す音響データの時系列を生成する楽音合成部と、を具備し、前記1以上の中間層のうちの第1中間層は、前記第1中間層に入力されるデータに対して前記第1パラメータセットを適用した処理を実行し、適用後のデータを次層に出力する。 A musical sound synthesis system according to one aspect of the present disclosure includes a control data acquisition unit that acquires a time series of control data representing the conditions of a target musical sound, a control vector generation unit that generates a control vector representing the characteristics of temporal changes in tone in response to an instruction from a user, a control vector processing unit that generates a first parameter set from the control vector, and a musical sound synthesis unit that includes a plurality of base layers and one or more intermediate layers, and processes the time series of the control data using a trained generative model that has learned the relationship between the conditions of a musical sound and the acoustic features of the musical sound, thereby generating a time series of acoustic data representing the acoustic features of the target musical sound, and a first intermediate layer of the one or more intermediate layers applies the first parameter set to data input to the first intermediate layer, and outputs the data after application to the next layer.
 本開示のひとつの態様に係るプログラムは、目標楽音の条件を表す制御データの時系列を取得する制御データ取得部、音色の時間的な変化の特徴を表す制御ベクトルを利用者からの指示に応じて生成する制御ベクトル生成部、前記制御ベクトルから第1パラメータセットを生成する制御ベクトル処理部、および、複数の基本層と1以上の中間層を含み、楽音の条件と当該楽音の音響特徴量との関係を学習した訓練済の生成モデルにより、前記制御データの時系列を処理することで、前記目標楽音の音響特徴量を表す音響データの時系列を生成する楽音合成部、としてコンピュータシステムを機能させるプログラムであって、前記1以上の中間層のうちの第1中間層は、前記第1中間層に入力されるデータに対して前記第1パラメータセットを適用した処理を実行し、適用後のデータを次層に出力する。  A program according to one aspect of the present disclosure causes a computer system to function as a control data acquisition unit that acquires a time series of control data representing the conditions of a target musical tone, a control vector generation unit that generates a control vector representing the characteristics of temporal changes in tone in response to an instruction from a user, a control vector processing unit that generates a first parameter set from the control vector, and a musical tone synthesis unit that includes a plurality of base layers and one or more intermediate layers and processes the time series of control data using a trained generative model that has learned the relationship between the conditions of a musical tone and the acoustic features of the musical tone, thereby generating a time series of acoustic data representing the acoustic features of the target musical tone, and a first intermediate layer of the one or more intermediate layers applies the first parameter set to data input to the first intermediate layer and outputs the data after application to the next layer.
第1実施形態における楽音合成システムの構成を例示するブロック図である。1 is a block diagram illustrating a configuration of a musical sound synthesis system according to a first embodiment. 楽音合成システムの機能的な構成を例示するブロック図である。1 is a block diagram illustrating a functional configuration of a musical sound synthesis system. 第1生成モデルの具体的な構成を例示するブロック図である。FIG. 2 is a block diagram illustrating a specific configuration of a first generation model. 変換処理の説明図である。FIG. 11 is an explanatory diagram of a conversion process. 設定画面の模式図である。FIG. 4 is a schematic diagram of a setting screen. 第2生成モデルの具体的な構成を例示するブロック図である。FIG. 13 is a block diagram illustrating a specific configuration of a second generation model. 楽音合成処理のフローチャートである。13 is a flowchart of a musical sound synthesis process. 機械学習の説明図である。FIG. 1 is an explanatory diagram of machine learning. 訓練処理のフローチャートである。13 is a flowchart of a training process. 第2実施形態における制御ベクトル生成部のブロック図である。FIG. 11 is a block diagram of a control vector generating unit in the second embodiment. 第2実施形態における設定画面の模式図である。FIG. 11 is a schematic diagram of a setting screen in the second embodiment. 第2実施形態における楽音合成処理のフローチャートである。10 is a flowchart of a musical sound synthesis process in a second embodiment. 各中間層Lが実行する変換処理の説明図である。FIG. 2 is an explanatory diagram of a conversion process executed by each intermediate layer L. 第4実施形態における第1生成モデルのブロック図である。FIG. 13 is a block diagram of a first generation model in the fourth embodiment. 第4実施形態における単位処理部のブロック図である。FIG. 13 is a block diagram of a unit processing unit according to a fourth embodiment. 変形例における処理期間の説明図である。FIG. 13 is an explanatory diagram of a processing period in a modified example.
A:第1実施形態
 図1は、第1実施形態に係る楽音合成システム100の構成を例示するブロック図である。楽音合成システム100は、所望の楽音(以下「目標楽音」という)を合成するコンピュータシステムである。目標楽音は、楽音合成システム100により合成されるべき音楽的な音である。第1実施形態においては、特定の楽曲(以下「目標楽曲」という)の歌唱により発音されるべき歌唱音を目標楽音として例示する。
A: First embodiment Fig. 1 is a block diagram illustrating the configuration of a musical sound synthesis system 100 according to a first embodiment. The musical sound synthesis system 100 is a computer system that synthesizes a desired musical sound (hereinafter referred to as a "target musical sound"). The target musical sound is a musical sound to be synthesized by the musical sound synthesis system 100. In the first embodiment, a singing sound to be produced by singing a specific piece of music (hereinafter referred to as a "target piece of music") is exemplified as the target musical sound.
 楽音合成システム100は、制御装置11と記憶装置12と表示装置13と操作装置14と放音装置15とを具備する。楽音合成システム100は、例えばスマートフォン、タブレット端末またはパーソナルコンピュータ等の情報装置で実現される。なお、楽音合成システム100は、単体の装置として実現されるほか、相互に別体で構成された複数の装置でも実現される。 The musical sound synthesis system 100 comprises a control device 11, a storage device 12, a display device 13, an operation device 14, and a sound emission device 15. The musical sound synthesis system 100 is realized by an information device such as a smartphone, a tablet terminal, or a personal computer. Note that the musical sound synthesis system 100 can be realized as a single device, or as multiple devices configured separately from each other.
 制御装置11は、楽音合成システム100の各要素を制御する単数または複数のプロセッサである。具体的には、制御装置11は、例えばCPU(Central Processing Unit)、GPU(Graphics Processing Unit)、SPU(Sound Processing Unit)、DSP(Digital Signal Processor)、FPGA(Field Programmable Gate Array)、またはASIC(Application Specific Integrated Circuit)等の1種類以上のプロセッサにより構成される。 The control device 11 is a single or multiple processors that control each element of the musical sound synthesis system 100. Specifically, the control device 11 is composed of one or more types of processors, such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), an SPU (Sound Processing Unit), a DSP (Digital Signal Processor), an FPGA (Field Programmable Gate Array), or an ASIC (Application Specific Integrated Circuit).
 記憶装置12は、制御装置11が実行するプログラムと、制御装置11が使用する各種のデータとを記憶する単数または複数のメモリである。例えば半導体記録媒体および磁気記録媒体等の公知の記録媒体、または複数種の記録媒体の組合せが、記憶装置12として利用される。なお、例えば、楽音合成システム100に対して着脱される可搬型の記録媒体、または、制御装置11が通信網を介してアクセス可能な記録媒体(例えばクラウドストレージ)が、記憶装置12として利用されてもよい。第1実施形態の記憶装置12は、楽曲データMと参照信号Srとを記憶する。 The storage device 12 is a single or multiple memories that store the programs executed by the control device 11 and various data used by the control device 11. For example, a well-known recording medium such as a semiconductor recording medium or a magnetic recording medium, or a combination of multiple types of recording media, is used as the storage device 12. Note that, for example, a portable recording medium that is detachable from the musical sound synthesis system 100, or a recording medium that the control device 11 can access via a communication network (e.g., cloud storage) may also be used as the storage device 12. The storage device 12 of the first embodiment stores music data M and a reference signal Sr.
 楽曲データMは、目標楽曲の楽譜を表す。より詳細には、楽曲データMは、目標楽曲の複数の音符の各々について音高と発音期間と発音文字とを指定する。音高は、離散的に設定された複数の音階音の何れかである。発音期間は、例えば音符の始点と継続長とにより指定される。発音文字は、楽曲の歌詞を表す符号である。例えばMIDI(Musical Instrument Digital Interface)規格に準拠した音楽ファイルが、楽曲データMとして利用される。楽曲データMは、例えば配信装置から通信網を介して楽音合成システム100に提供される。 The music data M represents the musical score of the target music piece. More specifically, the music data M specifies the pitch, pronunciation period, and pronunciation character for each of the multiple notes of the target music piece. The pitch is one of multiple discretely set scale notes. The pronunciation period is specified, for example, by the start point and duration of the note. The pronunciation character is a symbol representing the lyrics of the music piece. For example, a music file that complies with the MIDI (Musical Instrument Digital Interface) standard is used as the music data M. The music data M is provided to the musical sound synthesis system 100, for example, from a distribution device via a communication network.
 参照信号Srは、特定の楽音(以下「参照楽音」という)の波形を表す音響信号である。参照楽音は、例えば参照用の楽曲の歌唱により発音されるべき歌唱音である。参照信号Srは、配信装置から通信網を介して楽音合成システム100に提供される。なお、参照信号Srは、例えば光ディスク等の記録媒体を駆動する再生装置から提供されてもよいし、収音装置を利用した参照楽音の収音により生成されてもよい。また、参照信号Srは、歌唱合成または楽音合成等の公知の合成技術により合成された音響信号でもよい。なお、参照信号Srに対応する参照用の楽曲と目標楽曲とは、共通の楽曲でも別個の楽曲でもよい。また、目標楽音の歌唱者と参照楽音の歌唱者とは同じでも異なってもよい。 The reference signal Sr is an audio signal that represents the waveform of a specific musical tone (hereinafter referred to as "reference musical tone"). The reference musical tone is, for example, a singing tone that should be produced by singing a reference musical piece. The reference signal Sr is provided to the musical tone synthesis system 100 from a distribution device via a communication network. The reference signal Sr may be provided from a playback device that drives a recording medium such as an optical disk, or may be generated by collecting the reference musical tone using a sound collection device. The reference signal Sr may also be an audio signal synthesized using a known synthesis technique such as singing synthesis or musical tone synthesis. The reference musical piece and the target musical piece corresponding to the reference signal Sr may be the same piece or different pieces of music. The singer of the target musical tone and the singer of the reference musical tone may be the same or different.
 第1実施形態の目標楽音は、目標楽曲の歌唱音であり、かつ、参照楽音のうち特定の期間(以下「特定区間」という)内における音響特性の時間的な変化の特徴(以下「部分音色」という)が付与された楽音である。具体的には、利用者の所望の部分音色が付与された楽音が目標楽音として生成される。例えば、音量または音高等の音響特性の反復的な変動(ビブラート)、または音響特性の経時的な変化等、特定区間に存在する所望の特徴が、部分音色として想定される。以上の説明から理解される通り、参照楽音は、目標楽曲に付与されるべき部分音色の素材となる楽音である。制御装置11は、楽曲データMと参照信号Srとを利用して、目標楽音を表す音響信号Wを生成する。音響信号Wは、目標楽音の波形を表す時間領域の信号である。 The target musical tone in the first embodiment is a singing tone of a target musical piece, and is a musical tone that is given a feature of temporal changes in acoustic characteristics (hereinafter referred to as "partial tone") within a specific period (hereinafter referred to as "specific section") among the reference musical tone. Specifically, a musical tone that is given a partial tone desired by the user is generated as the target musical tone. For example, a desired feature that exists in a specific section, such as repeated fluctuations (vibrato) in acoustic characteristics such as volume or pitch, or changes in acoustic characteristics over time, is assumed as a partial tone. As can be understood from the above explanation, the reference musical tone is a musical tone that is the material for the partial tone to be given to the target musical piece. The control device 11 uses the musical piece data M and the reference signal Sr to generate an audio signal W that represents the target musical tone. The audio signal W is a time-domain signal that represents the waveform of the target musical tone.
 表示装置13は、制御装置11による制御のもとで画像を表示する。表示装置13は、例えば、液晶表示パネルまたは有機EL(Electroluminescence)パネル等の表示パネルである。操作装置14は、利用者からの指示を受付ける入力機器である。操作装置14は、例えば、利用者が操作する操作子、または、利用者による接触を検知するタッチパネルである。なお、楽音合成システム100とは別体の表示装置13または操作装置14が、楽音合成システム100に有線または無線により接続されてもよい。 The display device 13 displays images under the control of the control device 11. The display device 13 is, for example, a display panel such as a liquid crystal display panel or an organic EL (Electroluminescence) panel. The operation device 14 is an input device that accepts instructions from a user. The operation device 14 is, for example, an operator operated by the user, or a touch panel that detects contact by the user. Note that the display device 13 or operation device 14, which are separate from the musical sound synthesis system 100, may be connected to the musical sound synthesis system 100 by wire or wirelessly.
 放音装置15は、制御装置11による制御のもとで音響を再生する。具体的には、放音装置15は、音響信号Wが表す目標楽音を再生する。例えばスピーカまたはヘッドホンが放音装置15として利用される。なお、音響信号Wをデジタルからアナログに変換するD/A変換器、および、音響信号Wを増幅する増幅器については、図示が便宜的に省略されている。楽音合成システム100とは別体の放音装置15が、楽音合成システム100に有線または無線により接続されてもよい。 The sound emitting device 15 reproduces sound under the control of the control device 11. Specifically, the sound emitting device 15 reproduces the target musical sound represented by the audio signal W. For example, a speaker or a headphone is used as the sound emitting device 15. For convenience, a D/A converter that converts the audio signal W from digital to analog, and an amplifier that amplifies the audio signal W are omitted from the illustration. The sound emitting device 15, which is separate from the musical sound synthesis system 100, may be connected to the musical sound synthesis system 100 by wire or wirelessly.
 図2は、楽音合成システム100の機能的な構成を例示するブロック図である。制御装置11は、記憶装置12に記憶されたプログラムを実行することで、目標楽音の音響信号Wを生成するための複数の機能(制御データ取得部21、楽音合成部22、波形合成部23、制御ベクトル生成部24、制御ベクトル処理部25および訓練処理部26)を実現する。 FIG. 2 is a block diagram illustrating the functional configuration of the musical sound synthesis system 100. The control device 11 executes a program stored in the storage device 12 to realize multiple functions (control data acquisition unit 21, musical sound synthesis unit 22, waveform synthesis unit 23, control vector generation unit 24, control vector processing unit 25, and training processing unit 26) for generating an audio signal W of a target musical sound.
 なお、以下の各図面においては、ひとつのデータのデータサイズ(次元数)bと、当該データの複数個で構成される時系列の時間長aとが、記号[a,b]により表記されている。時間長aは、時間軸上の所定長の期間(以下「単位期間」という)を単位とする個数で表現される。例えば図2における[800,134]は、134次元で構成されるデータを、単位期間の800個分だけ配列した時系列を意味する。単位期間は、例えば5ミリ秒程度の時間長の期間(フレーム)である。したがって、単位期間の800個は4秒分に相当する。なお、以上の数値は一例であり、任意に変更されてよい。各単位期間は、時刻により特定される。 In the following drawings, the data size (number of dimensions) b of one piece of data and the time length a of a time series consisting of multiple pieces of that data are represented by the symbols [a, b]. The time length a is expressed as the number of units of a period of a specified length on the time axis (hereinafter referred to as a "unit period"). For example, [800, 134] in Figure 2 means a time series in which data consisting of 134 dimensions is arranged for 800 unit periods. A unit period is, for example, a period (frame) with a time length of about 5 milliseconds. Therefore, 800 unit periods are equivalent to 4 seconds. Note that the above values are only examples and may be changed as desired. Each unit period is specified by the time.
 制御データ取得部21は、目標楽音の条件を表す制御データDxを取得する。具体的には、制御データ取得部21は、各単位期間の制御データDxを取得する。第1実施形態の制御データ取得部21は、楽曲データMから各単位期間の制御データDxを生成する。すなわち、制御データDxの「生成」は、制御データDxの「取得」の一例である。 The control data acquisition unit 21 acquires control data Dx that represents the conditions of the target musical tone. Specifically, the control data acquisition unit 21 acquires the control data Dx for each unit period. In the first embodiment, the control data acquisition unit 21 generates the control data Dx for each unit period from the music data M. In other words, the "generation" of the control data Dx is an example of the "acquisition" of the control data Dx.
 制御データDxは、目標楽曲の楽譜の特徴量(以下「楽譜特徴量」という)を表す。制御データDxが表す楽譜特徴量は、例えば、単位期間におけるピッチ(基本周波数)と、単位期間における有声/無声を表す情報と、単位期間における音素情報とを含む。 The control data Dx represents the features of the score of the target piece of music (hereinafter referred to as "score features"). The score features represented by the control data Dx include, for example, the pitch (fundamental frequency) in the unit period, information indicating voiced/unvoiced in the unit period, and phoneme information in the unit period.
 ピッチは、楽曲データMが指定する各音符に対応するピッチの時系列(ピッチ軌跡)のうち1個の単位期間における数値である。目標楽曲の各音符の音高は離散的であるのに対し、制御データDxに利用されるピッチ軌跡は、時間軸上におけるピッチの連続的な変化である。制御データ取得部21は、例えば各音符の音高とピッチ軌跡との関係を学習した推定モデルにより楽曲データMを処理することで、制御データDxにおけるピッチ軌跡を推定する。ただし、ピッチ軌跡の生成の方法は以上の例示に限定されない。また、制御データDxは、各音符の離散的な音高を含んでもよい。 Pitch is a numerical value in one unit period of the pitch time series (pitch trajectory) corresponding to each note specified by the music data M. While the pitch of each note in the target music is discrete, the pitch trajectory used in the control data Dx is a continuous change in pitch on the time axis. The control data acquisition unit 21 estimates the pitch trajectory in the control data Dx, for example, by processing the music data M with an estimation model that has learned the relationship between the pitch of each note and the pitch trajectory. However, the method of generating the pitch trajectory is not limited to the above examples. The control data Dx may also include discrete pitches of each note.
 音素情報は、目標楽曲の発音文字に対応する音素に関する情報である。具体的には、音素情報は、例えば、複数の音素のうち何れかの音素を例えばone-hot表現により指定する情報と、音素期間に対する単位期間の位置と、音素期間の先頭または末尾からの時間長と、音素の継続長とを含む。 Phoneme information is information about phonemes that correspond to the pronunciation characters of the target song. Specifically, the phoneme information includes, for example, information specifying one of a plurality of phonemes, for example by one-hot expression, the position of the unit period relative to the phoneme period, the time length from the beginning or end of the phoneme period, and the duration of the phoneme.
 処理期間B内における制御データDxの時系列は、制御データ列Xを構成する。処理期間Bは、時間軸上で連続する複数(具体的には800個)の単位期間により構成される所定長の期間である。以上の説明から理解される通り、第1実施形態の制御データ取得部21は、目標楽音の条件を表す制御データDxの時系列(すなわち制御データ列X)を、時間軸上の処理期間B毎に生成する。 The time series of control data Dx within processing period B constitutes control data string X. Processing period B is a period of a predetermined length composed of multiple (specifically, 800) consecutive unit periods on the time axis. As can be understood from the above explanation, the control data acquisition unit 21 of the first embodiment generates a time series of control data Dx (i.e., control data string X) representing the conditions of the target musical tone for each processing period B on the time axis.
 楽音合成部22は、制御データ列Xを処理することで音響データ列Zを生成する。具体的には、楽音合成部22は、処理期間B毎に音響データ列Zを生成する。音響データ列Zは、処理期間Bにおける目標楽音の音響的な特徴を表す時系列データである。音響データ列Zは、処理期間B内の連続する単位期間に対応する複数(具体的には800個)の音響データDzにより構成される。すなわち、音響データ列Zは、処理期間B内における音響データDzの時系列である。楽音合成部22は、1個の処理期間Bに対応する制御データ列Xから当該処理期間Bの音響データ列Zを生成する。 The musical sound synthesis unit 22 generates an acoustic data sequence Z by processing the control data sequence X. Specifically, the musical sound synthesis unit 22 generates an acoustic data sequence Z for each processing period B. The acoustic data sequence Z is time-series data representing the acoustic characteristics of the target musical sound in the processing period B. The acoustic data sequence Z is composed of multiple (specifically, 800) acoustic data Dz corresponding to successive unit periods within the processing period B. In other words, the acoustic data sequence Z is a time series of acoustic data Dz within the processing period B. The musical sound synthesis unit 22 generates an acoustic data sequence Z for one processing period B from the control data sequence X corresponding to that processing period B.
 音響データDzは、目標楽音の音響特徴量を表す。音響特徴量は、例えば、目標楽音の振幅スペクトル包絡である。具体的には、音響データDzは、目標楽音の調波成分の振幅スペクトル包絡と、目標楽音の非調波成分の振幅スペクトル包絡とを含む。振幅スペクトル包絡は、振幅スペクトルの概形である。調波成分および非調波成分の振幅スペクトル包絡は、例えばメルケプストラムまたはMFCC(Mel-Frequency Cepstrum Coefficients)等で表現される。以上の説明から理解される通り、第1実施形態の楽音合成部22は、目標楽音の音響特徴量を表す音響データDzの時系列(すなわち音響データ列Z)を処理期間B毎に生成する。なお、音響データDzは、目標楽音の振幅スペクトル包絡とピッチ軌跡とを含んでもよい。また、音響データDzは、目標楽音のスペクトル(振幅スペクトルまたはパワースペクトル)を含んでもよい。目標楽音のスペクトルは、例えばメルスペクトルとして表現されてもよい。また、振幅スペクトル包絡は、パワースペクトルの概形(パワースペクトル包絡)でもよい。 The acoustic data Dz represents the acoustic features of the target musical tone. The acoustic features are, for example, the amplitude spectrum envelope of the target musical tone. Specifically, the acoustic data Dz includes the amplitude spectrum envelope of the harmonic components of the target musical tone and the amplitude spectrum envelope of the non-harmonic components of the target musical tone. The amplitude spectrum envelope is an outline of the amplitude spectrum. The amplitude spectrum envelope of the harmonic components and the non-harmonic components is expressed, for example, by Mel-cepstrum or MFCC (Mel-Frequency Cepstrum Coefficients). As can be understood from the above explanation, the musical tone synthesis unit 22 of the first embodiment generates a time series of acoustic data Dz (i.e., acoustic data sequence Z) representing the acoustic features of the target musical tone for each processing period B. The acoustic data Dz may include the amplitude spectrum envelope and pitch trajectory of the target musical tone. The acoustic data Dz may also include the spectrum (amplitude spectrum or power spectrum) of the target musical tone. The spectrum of the target musical tone may be expressed, for example, as a Mel spectrum. The amplitude spectrum envelope may also be the outline of the power spectrum (power spectrum envelope).
 波形合成部23は、音響データ列Zから目標楽音の音響信号Wを生成する。具体的には、波形合成部23は、例えば離散逆フーリエ変換を含む演算により各単位期間の音響データDzから波形信号を生成し、時間軸上で連続する単位期間について波形信号を相互に連結することで音響信号Wを生成する。なお、音響データ列Zと波形信号との関係を学習した深層ニューラルネットワーク(いわゆるニューラルボコーダ)が、波形合成部23として利用されてもよい。波形合成部23が生成した音響信号Wが放音装置15に供給されることで、目標楽音が放音装置15から再生される。なお、波形合成部23による音響信号Wの生成には、制御データ取得部21が生成したピッチが適用されてもよい。 The waveform synthesis unit 23 generates an audio signal W of a target musical tone from the audio data sequence Z. Specifically, the waveform synthesis unit 23 generates a waveform signal from the audio data Dz of each unit period by calculations including, for example, a discrete inverse Fourier transform, and generates the audio signal W by linking the waveform signals for successive unit periods on the time axis. A deep neural network (a so-called neural vocoder) that has learned the relationship between the audio data sequence Z and the waveform signal may be used as the waveform synthesis unit 23. The audio signal W generated by the waveform synthesis unit 23 is supplied to the sound emission device 15, whereby the target musical tone is reproduced from the sound emission device 15. The pitch generated by the control data acquisition unit 21 may be applied to the generation of the audio signal W by the waveform synthesis unit 23.
 図2に例示される通り、楽音合成部22は、第1生成モデル30により制御データ列Xを処理することで音響データ列Zを生成する。第1生成モデル30は、目標楽音の楽譜上の条件(制御データ列X)と目標楽音の音響特徴量(音響データ列Z)との関係を機械学習により学習した訓練済の統計モデルである。すなわち、第1生成モデル30は、制御データ列Xの入力に応じて音響データ列Zを出力する。第1生成モデル30は、例えば深層ニューラルネットワークにより構成される。 As illustrated in FIG. 2, the musical sound synthesis unit 22 generates an acoustic data sequence Z by processing a control data sequence X using a first generative model 30. The first generative model 30 is a trained statistical model that has learned the relationship between the conditions on the score of the target musical sound (control data sequence X) and the acoustic features of the target musical sound (acoustic data sequence Z) through machine learning. In other words, the first generative model 30 outputs the acoustic data sequence Z in response to the input of the control data sequence X. The first generative model 30 is, for example, configured by a deep neural network.
 第1生成モデル30は、制御データ列Xから音響データ列Zを生成する演算(アーキテクチャ)を制御装置11に実行させるプログラムと、当該演算に適用される複数の変数(加重値およびバイアス)との組合せで実現される。第1生成モデル30を実現するプログラムおよび複数の変数は、記憶装置12に記憶される。第1生成モデル30の複数の変数は、機械学習により事前に設定される。第1実施形態の第1生成モデル30は、第1符号化器31と復号化器32とを含む。 The first generative model 30 is realized by a combination of a program that causes the control device 11 to execute an operation (architecture) for generating an acoustic data sequence Z from a control data sequence X, and a plurality of variables (weights and biases) that are applied to the operation. The program and the plurality of variables that realize the first generative model 30 are stored in the storage device 12. The plurality of variables of the first generative model 30 are set in advance by machine learning. The first generative model 30 of the first embodiment includes a first encoder 31 and a decoder 32.
 第1符号化器31は、制御データ列Xと中間データYとの関係を機械学習により学習した訓練済の統計モデルである。すなわち、第1符号化器31は、制御データ列Xの入力に応じて中間データYを出力する。楽音合成部22は、制御データ列Xを第1符号化器31により処理することで中間データYを生成する。中間データYは、制御データ列Xの特徴を表す。具体的には、中間データYが表す制御データ列Xの特徴に応じて、生成される音響データ列Zが変化する。すなわち、第1符号化器31は、制御データ列Xを中間データYに符号化する。 The first encoder 31 is a trained statistical model that has learned the relationship between the control data sequence X and the intermediate data Y through machine learning. That is, the first encoder 31 outputs intermediate data Y in response to the input of the control data sequence X. The musical sound synthesis unit 22 generates intermediate data Y by processing the control data sequence X with the first encoder 31. The intermediate data Y represents the characteristics of the control data sequence X. Specifically, the generated sound data sequence Z changes in response to the characteristics of the control data sequence X represented by the intermediate data Y. That is, the first encoder 31 encodes the control data sequence X into intermediate data Y.
 復号化器32は、中間データYと音響データ列Zとの関係を機械学習により学習した訓練済の統計モデルである。すなわち、復号化器32は、中間データYの入力に応じて音響データ列Zを出力する。楽音合成部22は、中間データYを復号化器32により処理することで音響データ列Zを生成する。すなわち、復号化器32は、中間データYを音響データ列Zに復号化する。以上の説明の通り、第1実施形態においては、第1符号化器31による符号化と復号化器32による復号化とにより音響データ列Zを生成できる。 The decoder 32 is a trained statistical model that has learned the relationship between the intermediate data Y and the acoustic data string Z through machine learning. That is, the decoder 32 outputs the acoustic data string Z in response to the input of the intermediate data Y. The musical sound synthesis unit 22 generates the acoustic data string Z by processing the intermediate data Y with the decoder 32. That is, the decoder 32 decodes the intermediate data Y into the acoustic data string Z. As explained above, in the first embodiment, the acoustic data string Z can be generated by encoding by the first encoder 31 and decoding by the decoder 32.
 図3は、第1生成モデル30の具体的な構成(アーキテクチャ)を例示するブロック図である。第1符号化器31は、前処理部311とN1個の畳込層312とN1個の符号中間層Leとを含む。具体的には、前処理部311の後段に、N1個の畳込層312とN1個の符号中間層Leとが交互に配置される。すなわち、畳込層312と符号中間層Leとで構成されるN1組が、前処理部311の後段に積層される。 FIG. 3 is a block diagram illustrating a specific configuration (architecture) of the first generative model 30. The first encoder 31 includes a preprocessing unit 311, N1 convolutional layers 312, and N1 coding intermediate layers Le. Specifically, the N1 convolutional layers 312 and the N1 coding intermediate layers Le are arranged alternately after the preprocessing unit 311. That is, N1 sets each consisting of a convolutional layer 312 and a coding intermediate layer Le are stacked after the preprocessing unit 311.
 前処理部311は、制御データ列Xを加工するための多層パーセプトロンで構成される。前処理部311は、制御データ列Xの相異なる制御データDxに対応する複数の演算部で構成される。各演算部は、複数段の全結合層の積層により構成される。各制御データDxが各全結合層により順次に処理される。例えば、制御データ列Xの各制御データDxに対して、同様の構成で同様の変数を適用したニューラルネットワークの処理が実行される。各演算部による処理後の制御データDxの配列(処理後の制御データ列X)が、初段の畳込層312に入力される。前処理部311により制御データDxを処理することで、目標楽曲(楽曲データM)の特徴をより明瞭に表現する制御データ列Xが生成される。ただし、前処理部311は省略されてもよい。 The preprocessing unit 311 is composed of a multi-layer perceptron for processing the control data sequence X. The preprocessing unit 311 is composed of multiple calculation units corresponding to different control data Dx of the control data sequence X. Each calculation unit is composed of a stack of multiple stages of fully connected layers. Each control data Dx is processed sequentially by each fully connected layer. For example, a neural network process is executed for each control data Dx of the control data sequence X with the same configuration and the same variables applied. The array of the control data Dx after processing by each calculation unit (processed control data sequence X) is input to the first convolution layer 312. By processing the control data Dx by the preprocessing unit 311, a control data sequence X that more clearly expresses the characteristics of the target song (song data M) is generated. However, the preprocessing unit 311 may be omitted.
 N1個の畳込層312のうち第1段目の畳込層312には前処理部311による処理後のデータが入力される。N1個の畳込層312のうち第2段目以降の各畳込層312には、前段の符号中間層Leによる処理後のデータが入力される。各畳込層312は、当該畳込層312に入力されるデータに対して演算処理を実行する。畳込層312による演算処理は、畳込演算を含む。また、畳込層312による演算処理は、プーリング演算を含んでもよい。 The data processed by the pre-processing unit 311 is input to the first convolutional layer 312 among the N1 convolutional layers 312. The data processed by the previous coding intermediate layer Le is input to each of the second and subsequent convolutional layers 312 among the N1 convolutional layers 312. Each convolutional layer 312 performs arithmetic processing on the data input to the convolutional layer 312. The arithmetic processing by the convolutional layer 312 includes a convolutional operation. The arithmetic processing by the convolutional layer 312 may also include a pooling operation.
 畳込演算は、畳込層312に入力されたデータに対してフィルタを畳込む処理である。複数の畳込層312は、時間圧縮を行う畳込層312と時間圧縮を行わない畳込層312とを含む。複数の畳込層312のうち、時間圧縮を行う畳込層312の畳込演算では、時間方向におけるフィルタの移動量(ストライド)が2以上に設定される。これにより、時間圧縮を行わない各畳込層312では、ストライドが1の畳込演算によりデータの時間長が維持され、時間圧縮を行う各畳込層312では、ストライドが2以上の畳込演算によりデータの時間長が短縮される。すなわち、第1符号化器31においては、時間軸上におけるデータの圧縮が実行される。換言すると、第1符号化器31による処理は、制御データ列Xのダウンサンプリングを含む。畳込演算のストライドを2以上にする代わりに、畳込演算のストライドを1にしたまま、続けてプーリング演算を行うことによりデータの圧縮(ダウンサンプリング)を行ってもよい。プーリング演算は、畳込演算後のデータに設定される各範囲内における代表値を選択する演算である。代表値は、例えば最大値、平均値、2乗平均等の統計値である。 The convolution operation is a process of convolving a filter with data input to the convolution layer 312. The multiple convolution layers 312 include a convolution layer 312 that performs time compression and a convolution layer 312 that does not perform time compression. In the convolution operation of the convolution layer 312 that performs time compression among the multiple convolution layers 312, the movement amount (stride) of the filter in the time direction is set to 2 or more. As a result, in each convolution layer 312 that does not perform time compression, the time length of the data is maintained by the convolution operation with a stride of 1, and in each convolution layer 312 that performs time compression, the time length of the data is shortened by the convolution operation with a stride of 2 or more. That is, in the first encoder 31, data compression on the time axis is performed. In other words, the processing by the first encoder 31 includes downsampling of the control data string X. Instead of setting the stride of the convolution operation to 2 or more, data compression (downsampling) may be performed by continuing the pooling operation while keeping the stride of the convolution operation at 1. The pooling operation is an operation that selects a representative value within each range set in the data after the convolution operation. The representative value is, for example, a statistical value such as the maximum value, the average value, or the root mean square value.
 つまり、制御データ列Xの圧縮は、畳込演算およびプーリング演算の一方または双方により実現される。なお、制御データ列Xの時間圧縮(ダウンサンプリング)は、N1個の畳込層312の一連の畳込演算のうちの一部についてのみにおいて実行されてもよい。各畳込層312による圧縮率は任意である。 In other words, the compression of the control data sequence X is achieved by one or both of a convolution operation and a pooling operation. Note that time compression (downsampling) of the control data sequence X may be performed only for a portion of the series of convolution operations of the N1 convolution layers 312. The compression rate of each convolution layer 312 is arbitrary.
 N1個の符号中間層Leの各々は、前段の畳込層312から当該符号中間層Leに入力されるデータに対して変換処理を実行する。各符号中間層Leによる変換処理の具体的な内容については後述する。N1個の符号中間層Leのうち最終段の符号中間層Leによる処理後のデータが、中間データYとして復号化器32に入力される。なお、符号中間層Leは、N1個の畳込層312の全部の後段に設置される必要はない。すなわち、符号中間層Leの個数N1xはN1以下の任意の自然数である。ある畳込層312の後段に符号中間層Leがある場合は、その次の畳込層312には、当該符号中間層Leによる変換処理後のデータが入力され、ある畳込層312の後段に符号中間層Leがない場合は、その次の畳込層312には、当該畳込層312による畳込処理後のデータ(つまり、変換処理されていないデータ)が入力される。 Each of the N1 code intermediate layers Le performs a conversion process on the data input to the code intermediate layer Le from the previous convolution layer 312. The specific content of the conversion process by each code intermediate layer Le will be described later. Data after processing by the final code intermediate layer Le among the N1 code intermediate layers Le is input to the decoder 32 as intermediate data Y. Note that the code intermediate layers Le do not need to be installed after all of the N1 convolution layers 312. In other words, the number N1x of code intermediate layers Le is any natural number less than or equal to N1. If there is a code intermediate layer Le after a certain convolution layer 312, the data after the conversion process by the code intermediate layer Le is input to the next convolution layer 312, and if there is no code intermediate layer Le after a certain convolution layer 312, the data after the convolution process by the convolution layer 312 (i.e., data that has not been converted) is input to the next convolution layer 312.
 復号化器32は、N2個の畳込層321とN2個の復号中間層Ldと後処理部322とを含む。具体的には、N2個の畳込層321とN2個の復号中間層Ldとが交互に配置され、最終段の復号中間層Ldの後段に後処理部322が積層される。すなわち、畳込層321と復号中間層Ldとで構成されるN2組が、後処理部322の前段に積層される。 The decoder 32 includes N2 convolutional layers 321, N2 decoding intermediate layers Ld, and a post-processing unit 322. Specifically, the N2 convolutional layers 321 and the N2 decoding intermediate layers Ld are arranged alternately, and the post-processing unit 322 is stacked after the final decoding intermediate layer Ld. In other words, N2 sets consisting of the convolutional layers 321 and the decoding intermediate layers Ld are stacked before the post-processing unit 322.
 N2個の畳込層321のうち第1段目の畳込層321には中間データYが入力される。N2個の畳込層321のうち第2段目以降の各畳込層321には、前段の復号中間層Ldによる処理後のデータが入力される。各畳込層321は、当該畳込層321に入力されるデータに対して演算処理を実行する。畳込層321による演算処理は、転置畳込演算(または逆畳込演算)を含む。 Intermediate data Y is input to the first convolutional layer 321 of the N2 convolutional layers 321. Data processed by the previous decoding intermediate layer Ld is input to each of the second and subsequent convolutional layers 321 of the N2 convolutional layers 321. Each convolutional layer 321 performs arithmetic processing on the data input to that convolutional layer 321. The arithmetic processing by the convolutional layer 321 includes a transpose convolution operation (or a deconvolution operation).
 畳込層321による転置畳込演算(transposed convolution)は、符号化器の各畳込層312による畳込演算の逆の畳込演算である。複数の畳込層321のうち、時間伸張を行う畳込層321の畳込演算では、時間方向におけるフィルタの移動量(ストライド)が2以上に設定される。これにより、時間伸張を行わない各畳込層321では、ストライドが1の転置畳込演算によりデータの時間長が維持され、時間伸張を行う各畳込層321では、ストライドが2以上の転置畳込演算によりデータの時間長が伸張される。すなわち、復号化器32においては、時間軸上におけるデータの伸長が実行される。換言すると、復号化器32による処理は、中間データYのアップサンプリングを含む。 The transposed convolution performed by the convolution layer 321 is the inverse of the convolution performed by each convolution layer 312 of the encoder. In the convolution operation of the convolution layer 321 that performs time expansion among the multiple convolution layers 321, the filter movement amount in the time direction (stride) is set to 2 or more. As a result, in each convolution layer 321 that does not perform time expansion, the time length of the data is maintained by a transposed convolution operation with a stride of 1, and in each convolution layer 321 that performs time expansion, the time length of the data is expanded by a transposed convolution operation with a stride of 2 or more. That is, in the decoder 32, data expansion on the time axis is performed. In other words, the processing by the decoder 32 includes upsampling of the intermediate data Y.
 以上の説明の通り、第1実施形態においては、第1符号化器31による制御データ列Xの圧縮と復号化器32による中間データYの伸長とが実行される。したがって、制御データ列Xの特徴が適切に反映された中間データYが生成され、中間データYの特徴が適切に反映された音響データ列Zが生成される。 As explained above, in the first embodiment, the first encoder 31 compresses the control data sequence X and the decoder 32 expands the intermediate data Y. Therefore, intermediate data Y that appropriately reflects the characteristics of the control data sequence X is generated, and an acoustic data sequence Z that appropriately reflects the characteristics of the intermediate data Y is generated.
 N2個の復号中間層Ldの各々は、前段の畳込層321から当該復号中間層Ldに入力されるデータに対して変換処理を実行する。各復号中間層Ldによる変換処理の具体的な内容については後述する。N2個の復号中間層Ldのうち最終段の復号中間層Ldによる処理後のデータが、音響データ列Zとして後処理部322に入力される。なお、復号中間層Ldは、N2個の畳込層321の全部の後段に設置される必要はない。すなわち、復号中間層Ldの個数N2xはN2以下の自然数である。ある畳込層321の後段に復号中間層Ldがある場合は、その次の畳込層321には、当該復号中間層Ldによる変換処理後のデータが入力され、ある畳込層321の後段に復号中間層Ldがない場合は、その次の畳込層321には、当該畳込層321による畳込処理後のデータ(つまり、変換処理されていないデータ)が入力される。 Each of the N2 decoding intermediate layers Ld performs a conversion process on the data input to the decoding intermediate layer Ld from the previous convolution layer 321. The specific content of the conversion process by each decoding intermediate layer Ld will be described later. Data after processing by the final decoding intermediate layer Ld among the N2 decoding intermediate layers Ld is input to the post-processing unit 322 as the acoustic data string Z. Note that the decoding intermediate layer Ld does not need to be installed after all of the N2 convolution layers 321. In other words, the number N2x of the decoding intermediate layers Ld is a natural number less than or equal to N2. If there is a decoding intermediate layer Ld after a certain convolution layer 321, the data after the conversion process by the decoding intermediate layer Ld is input to the next convolution layer 321, and if there is no decoding intermediate layer Ld after a certain convolution layer 321, the data after the convolution process by the convolution layer 321 (i.e., data that has not been converted) is input to the next convolution layer 321.
 後処理部322は、音響データ列Zを加工するための多層パーセプトロンで構成される。後処理部322は、音響データ列Zの相異なる音響データDzに対応する複数の演算部で構成される。各演算部は、複数段の全結合層の積層により構成される。各音響データDzが各全結合層により順次に処理される。例えば、音響データ列Zの各音響データDzに対して、同様の構成で同様の変数を適用したニューラルネットワークの処理が実行される。各演算部による処理後の音響データDzの配列が、最終的な音響データ列Zとして波形合成部23に入力される。後処理部322により音響データDzを処理することで、目標楽音の特徴をより明瞭に表現する音響データ列Zが生成される。ただし、後処理部322は省略されてもよい。 The post-processing unit 322 is composed of a multi-layer perceptron for processing the acoustic data sequence Z. The post-processing unit 322 is composed of multiple calculation units corresponding to different acoustic data Dz of the acoustic data sequence Z. Each calculation unit is composed of a stack of multiple fully connected layers. Each acoustic data Dz is processed sequentially by each fully connected layer. For example, a neural network with a similar configuration and similar variables is processed for each acoustic data Dz of the acoustic data sequence Z. The array of acoustic data Dz after processing by each calculation unit is input to the waveform synthesis unit 23 as the final acoustic data sequence Z. By processing the acoustic data Dz by the post-processing unit 322, an acoustic data sequence Z that more clearly expresses the characteristics of the target musical tone is generated. However, the post-processing unit 322 may be omitted.
 以上の説明の通り、第1符号化器31はN1x個の符号中間層Leを含み、復号化器32はN2x個の復号中間層Ldを含む。符号中間層Leおよび復号中間層Ldを「中間層L」と総称すると、第1生成モデル30は、Nx個(Nx=N1x+N2x)の中間層Lを含む統計モデルと表現される。すなわち、第1符号化器31は、Nx個の中間層LのうちN1x個の符号中間層Leを含み、復号化器32は、Nx個の中間層LのうちN2x個の復号中間層Ldを含む。中間層Lの個数Nxは、1以上の自然数である。符号中間層Leの個数N1xと復号中間層Ldの個数N2xとは、相等しい数値でも相異なる数値でもよい。 As explained above, the first encoder 31 includes N1x encoded intermediate layers Le, and the decoder 32 includes N2x decoded intermediate layers Ld. If the encoded intermediate layers Le and the decoded intermediate layers Ld are collectively referred to as "intermediate layers L", the first generation model 30 is expressed as a statistical model including Nx (Nx = N1x + N2x) intermediate layers L. That is, the first encoder 31 includes N1x encoded intermediate layers Le out of the Nx intermediate layers L, and the decoder 32 includes N2x decoded intermediate layers Ld out of the Nx intermediate layers L. The number Nx of intermediate layers L is a natural number equal to or greater than 1. The number N1x of encoded intermediate layers Le and the number N2x of decoded intermediate layers Ld may be equal or different.
 第1生成モデル30のうち、前処理部311とN1個の畳込層312とN2個の畳込層321と後処理部322とは、音響データ列Zの生成に必要な基本層である。以下では、N1個の畳込層312とN2個の畳込層321との組を、N個(N=N1+N2)の基本畳込層と呼ぶ場合がある。他方、Nx個の中間層(N1x個の符号中間層LeおよびN2x個の復号中間層Ld)は、目標楽音における部分音色の制御のための層である。すなわち、第1生成モデル30は、N個の基本畳込層とNx個(N≧Nx≧1)の中間層Lとを含む。 Of the first generative model 30, the preprocessing section 311, the N1 convolutional layers 312, the N2 convolutional layers 321, and the postprocessing section 322 are basic layers required for generating the audio data sequence Z. Hereinafter, the set of the N1 convolutional layers 312 and the N2 convolutional layers 321 may be referred to as N (N=N1+N2) basic convolutional layers. On the other hand, the Nx intermediate layers (N1x encoding intermediate layers Le and N2x decoding intermediate layers Ld) are layers for controlling partial timbres in the target musical tone. In other words, the first generative model 30 includes N basic convolutional layers and Nx (N≧Nx≧1) intermediate layers L.
 N個の中間層Lの各々は、当該中間層Lに入力されるデータに対して変換処理を実行する。N個の中間層Lのうち第n段目の中間層Lによる変換処理には、パラメータセットPn(n=1~N)が適用される。すなわち、複数の中間層Lの各々による変換処理には、相異なるパラメータセットPnが適用される。N個のパラメータセットP1~PNの各々は、例えば、第1パラメータp1と第2パラメータp2とを含む。 Each of the N intermediate layers L performs a conversion process on the data input to that intermediate layer L. A parameter set Pn (n = 1 to N) is applied to the conversion process by the n-th intermediate layer L among the N intermediate layers L. In other words, a different parameter set Pn is applied to the conversion process by each of the multiple intermediate layers L. Each of the N parameter sets P1 to PN includes, for example, a first parameter p1 and a second parameter p2.
 図4は、変換処理の説明図である。図4の単位データ列Uは、中間層Lに入力されるデータである。単位データ列Uは、相異なる単位期間に対応する複数の単位データDuの時系列で構成される。各単位データDuはH次元(Hは2以上の自然数)のベクトルで表現される。第1パラメータp1は、H行H列の正方行列で表現される。第2パラメータp2は、H次元のベクトルで表現される。なお、第1パラメータp1は、H行H列の対角行列またはH次元のベクトルで表現されてもよい。 FIG. 4 is an explanatory diagram of the conversion process. The unit data string U in FIG. 4 is data input to the intermediate layer L. The unit data string U is composed of a time series of multiple unit data Du corresponding to different unit periods. Each unit data Du is expressed as an H-dimensional (H is a natural number equal to or greater than 2) vector. The first parameter p1 is expressed as a square matrix with H rows and H columns. The second parameter p2 is expressed as an H-dimensional vector. Note that the first parameter p1 may be expressed as a diagonal matrix with H rows and H columns or an H-dimensional vector.
 変換処理は、第1演算と第2演算とを含む。単位データ列Uを構成する複数の単位データDuの各々について第1演算と第2演算とが順次に実行される。第1演算は、第1パラメータp1を単位データDuに乗算する処理である。第2演算は、第1演算の結果(p1・Du)に対して第2パラメータp2を加算する処理である。以上の説明から理解される通り、中間層Lによる変換処理は、第1パラメータp1の乗算と前記第2パラメータp2の加算とを含む処理(すなわちアフィン変換)である。なお、第2パラメータp2を適用する第2演算は省略されてよい。その場合、第2パラメータp2の生成も省略される。つまり、変換処理は、少なくとも第1演算が含まれていればよい。 The conversion process includes a first operation and a second operation. The first operation and the second operation are executed sequentially for each of the multiple unit data Du that make up the unit data string U. The first operation is a process of multiplying the unit data Du by the first parameter p1. The second operation is a process of adding the second parameter p2 to the result of the first operation (p1·Du). As can be understood from the above explanation, the conversion process by the intermediate layer L is a process (i.e., affine transformation) that includes the multiplication of the first parameter p1 and the addition of the second parameter p2. Note that the second operation that applies the second parameter p2 may be omitted. In that case, the generation of the second parameter p2 is also omitted. In other words, it is sufficient that the conversion process includes at least the first operation.
 以上の説明から理解される通り、図3のN個の中間層Lの各々は、当該中間層Lに入力される単位データ列Uの各単位データDuに対してパラメータセットPnを適用した変換処理を実行し、変換処理後の単位データ列Uを出力する。第1実施形態においては、各中間層Lに入力される単位データ列Uの各単位データDuに対して、第1パラメータp1の乗算と第2パラメータp2の加算とを含む変換処理が実行される。したがって、制御ベクトルVが表す部分音色が適切に付与された目標楽音の音響データ列Zを生成できる。ここでは、中間層LをN個として説明しているが、中間層Lの数がN個より少ないNx個である場合も、基本的な動作は同様である。 As can be understood from the above explanation, each of the N intermediate layers L in FIG. 3 performs a conversion process by applying a parameter set Pn to each unit data Du of the unit data string U input to the intermediate layer L, and outputs the unit data string U after the conversion process. In the first embodiment, a conversion process including multiplication of a first parameter p1 and addition of a second parameter p2 is performed on each unit data Du of the unit data string U input to each intermediate layer L. Therefore, it is possible to generate an acoustic data string Z of a target musical tone to which a partial timbre represented by a control vector V is appropriately assigned. Here, the number of intermediate layers L is explained as N, but the basic operation is similar even when the number of intermediate layers L is Nx, which is less than N.
 いま、第1生成モデル30のN個の中間層Lのうち、第n1段目および第n2段目の2個の中間層Lに便宜的に着目する(n1=1~N,n2=1~N,n1≠n2)。各中間層Lは、符号中間層Leおよび復号中間層Ldの何れでもよい。N個の中間層Lのうち第n1段目の中間層Lは、当該中間層Lに入力される単位データ列Uの各単位データDuに対してパラメータセットPn1を適用した変換処理を実行し、適用後の単位データ列Uを次層に出力する。第n2段目の中間層Lは、当該中間層Lに入力される単位データ列Uの各単位データDuに対してパラメータセットPn2を適用した変換処理を実行し、適用後の単位データ列Uを次層に出力する。なお、第n1段目の中間層Lは「第1中間層」の一例であり、パラメータセットPn1は「第1パラメータセット」の一例である。また、第n2段目の中間層Lは「第2中間層」の一例であり、パラメータセットPn2は「第2パラメータセット」の一例である。 Now, for the sake of convenience, of the N intermediate layers L of the first generation model 30, we focus on two intermediate layers L, the n1th and n2th stages (n1 = 1 to N, n2 = 1 to N, n1 ≠ n2). Each intermediate layer L may be either an encoding intermediate layer Le or a decoding intermediate layer Ld. Of the N intermediate layers L, the n1th intermediate layer L performs a conversion process by applying a parameter set Pn1 to each unit data Du of the unit data string U input to the intermediate layer L, and outputs the unit data string U after the application to the next layer. The n2th intermediate layer L performs a conversion process by applying a parameter set Pn2 to each unit data Du of the unit data string U input to the intermediate layer L, and outputs the unit data string U after the application to the next layer. Note that the n1th intermediate layer L is an example of a "first intermediate layer", and the parameter set Pn1 is an example of a "first parameter set". Additionally, the n2th intermediate layer L is an example of a "second intermediate layer," and the parameter set Pn2 is an example of a "second parameter set."
 以上の説明の通り、第1実施形態においては、N個の中間層Lの各々に相異なるパラメータセットPnが適用されるから、多様な部分音色を有する目標楽音の音響データ列Zを生成できる。図2に例示された制御ベクトル生成部24および制御ベクトル処理部25は、参照信号Srを処理することでN個のパラメータセットP1~PNを生成する。 As explained above, in the first embodiment, a different parameter set Pn is applied to each of the N intermediate layers L, so that an acoustic data sequence Z of a target tone having a variety of partial timbres can be generated. The control vector generation unit 24 and the control vector processing unit 25 illustrated in FIG. 2 generate N parameter sets P1 to PN by processing the reference signal Sr.
 制御ベクトル生成部24は、特定区間の参照信号Srを処理することで制御ベクトルVを生成する。制御ベクトルVは、参照楽音の部分音色を表すK次元のベクトルである。すなわち、制御ベクトルVは、特定区間の参照信号Srにおける音響特性の時間的な変化の特徴(すなわち部分音色)を表すベクトルである。第1実施形態の制御ベクトル生成部24は、区間設定部241と特徴抽出部242と第2符号化器243とを具備する。 The control vector generation unit 24 generates a control vector V by processing the reference signal Sr of a specific section. The control vector V is a K-dimensional vector that represents the partial timbre of the reference tone. In other words, the control vector V is a vector that represents the characteristics of the temporal change in acoustic characteristics (i.e., the partial timbre) in the reference signal Sr of a specific section. The control vector generation unit 24 of the first embodiment includes a section setting unit 241, a feature extraction unit 242, and a second encoder 243.
 区間設定部241は、参照楽音における特定区間を設定する。具体的には、区間設定部241は、操作装置14に対する利用者からの第1指示Q1に応じて特定区間を設定する。特定区間の時間長は、1個の処理期間Bと同等の固定長である。 The section setting unit 241 sets a specific section in the reference musical tone. Specifically, the section setting unit 241 sets the specific section in response to a first instruction Q1 from the user via the operation device 14. The time length of the specific section is a fixed length equivalent to one processing period B.
 図5は、設定画面Gaの模式図である。設定画面Gaは、利用者が特定区間を指示するための画面である。区間設定部241は、設定画面Gaを表示装置13に表示する。設定画面Gaは、波形画像Ga1と区間画像Ga2とを含む。波形画像Ga1は、参照信号Srの波形を表す画像である。区間画像Ga2は、特定区間を表す画像である。 FIG. 5 is a schematic diagram of the setting screen Ga. The setting screen Ga is a screen for the user to specify a specific section. The section setting unit 241 displays the setting screen Ga on the display device 13. The setting screen Ga includes a waveform image Ga1 and a section image Ga2. The waveform image Ga1 is an image that represents the waveform of the reference signal Sr. The section image Ga2 is an image that represents a specific section.
 利用者は、参照楽音の波形画像Ga1を確認しながら操作装置14を操作することで(第1指示Q1)、区間画像Ga2を時間軸に沿って所望の位置に移動できる。例えば、参照楽音のうち音響特性が所望の条件で変化する区間を内包するように、利用者は区間画像Ga2を移動する。 The user can move the section image Ga2 to a desired position along the time axis by operating the operation device 14 (first instruction Q1) while checking the waveform image Ga1 of the reference musical tone. For example, the user moves the section image Ga2 so that it includes a section of the reference musical tone whose acoustic characteristics change under desired conditions.
 区間設定部241は、参照信号Srのうち利用者による移動後の区間画像Ga2に対応する区間を、特定区間として確定する。以上の説明から理解される通り、第1指示Q1は、特定区間の時間軸上の位置を変更する指示である。すなわち、区間設定部241は、第1指示Q1に応じて時間軸上における特定区間の位置を変更する。 The section setting unit 241 determines the section of the reference signal Sr that corresponds to the section image Ga2 after the user's movement as the specific section. As can be understood from the above explanation, the first instruction Q1 is an instruction to change the position of the specific section on the time axis. In other words, the section setting unit 241 changes the position of the specific section on the time axis in response to the first instruction Q1.
 図2の特徴抽出部242は、特定区間の参照信号Srを処理することで1個の参照データ列Rを生成する。参照データ列Rは、参照楽音の特定区間における音響的な特徴を表す時系列データである。図6に例示される通り、参照データ列Rは、特定区間内の相異なる単位期間に対応する複数(例えば800個)の参照データDrにより構成される。すなわち、参照データ列Rは、音響データDzの時系列である。 The feature extraction unit 242 in FIG. 2 processes the reference signal Sr for a specific section to generate one reference data string R. The reference data string R is time-series data representing the acoustic features of the reference musical tone in a specific section. As illustrated in FIG. 6, the reference data string R is composed of multiple (e.g., 800) reference data Dr corresponding to different unit periods within the specific section. In other words, the reference data string R is a time series of acoustic data Dz.
 参照データDrは、参照楽音の音響特徴量を表す。音響特徴量は、例えば、参照楽音の振幅スペクトル包絡である。具体的には、参照データDrは、参照楽音の調波成分の振幅スペクトル包絡と、参照楽音の非調波成分の振幅スペクトル包絡とを含む。調波成分および非調波成分の振幅スペクトル包絡は、例えばメルケプストラムまたはMFCC等で表現される。参照データDrのデータサイズは、音響データDzのデータサイズと同等である。したがって、1個の参照データ列Rのデータサイズは、1個の音響データ列Zのデータサイズと同等である。なお、参照データDrは、音響データDzとは別形式のデータでもよい。例えば、参照データDrが表す音響特徴量と音響データDzが表す音響特徴量とは別種の特徴量でもよい。 The reference data Dr represents the acoustic features of the reference musical tone. The acoustic features are, for example, the amplitude spectrum envelope of the reference musical tone. Specifically, the reference data Dr includes the amplitude spectrum envelope of the harmonic components of the reference musical tone and the amplitude spectrum envelope of the non-harmonic components of the reference musical tone. The amplitude spectrum envelopes of the harmonic components and non-harmonic components are expressed, for example, by mel-cepstrum or MFCC. The data size of the reference data Dr is equal to the data size of the acoustic data Dz. Therefore, the data size of one reference data string R is equal to the data size of one acoustic data string Z. Note that the reference data Dr may be data in a different format from the acoustic data Dz. For example, the acoustic features represented by the reference data Dr and the acoustic features represented by the acoustic data Dz may be different types of features.
 以上の説明から理解される通り、第1実施形態の特徴抽出部242は、参照楽音の音響特徴量を表す参照データDrの時系列(参照データ列R)を生成する。例えば、特徴抽出部242は、離散フーリエ変換を含む演算を特定区間の参照信号Srに対して実行することで、参照データ列Rを生成する。 As can be understood from the above explanation, the feature extraction unit 242 of the first embodiment generates a time series of reference data Dr (reference data string R) that represents the acoustic features of a reference musical tone. For example, the feature extraction unit 242 generates the reference data string R by performing a calculation including a discrete Fourier transform on a reference signal Sr of a specific section.
 第2符号化器243は、参照データ列Rと制御ベクトルVとの関係を機械学習により学習した訓練済の統計モデルである。すなわち、第2符号化器243は、参照データ列Rの入力に応じて制御ベクトルVを出力する。制御ベクトル生成部24は、参照データ列Rを第2符号化器243により処理することで制御ベクトルVを生成する。すなわち、第2符号化器243は、参照データ列Rを制御ベクトルVに符号化する。 The second encoder 243 is a trained statistical model that has learned the relationship between the reference data sequence R and the control vector V through machine learning. That is, the second encoder 243 outputs the control vector V in response to the input of the reference data sequence R. The control vector generation unit 24 generates the control vector V by processing the reference data sequence R with the second encoder 243. That is, the second encoder 243 encodes the reference data sequence R into the control vector V.
 制御ベクトルVは、前述の通り、特定区間の参照信号Srにおける音響特性の時間的な変化の特徴(すなわち部分音色)を表すベクトルである。部分音色は参照信号Srの位置に応じて変化するから、制御ベクトルVは、時間軸上における特定区間の位置に依存する。すなわち、制御ベクトルVは、特定区間を指定する利用者からの第1指示Q1に依存する。以上の説明から理解される通り、第1実施形態の制御ベクトル生成部24は、利用者からの第1指示Q1に応じて制御ベクトルVを生成する。 As described above, the control vector V is a vector that represents the characteristics of the temporal change in acoustic characteristics in the reference signal Sr of a specific section (i.e., partial timbre). Since the partial timbre changes depending on the position of the reference signal Sr, the control vector V depends on the position of the specific section on the time axis. In other words, the control vector V depends on the first instruction Q1 from the user that specifies the specific section. As can be understood from the above explanation, the control vector generation unit 24 of the first embodiment generates the control vector V in response to the first instruction Q1 from the user.
 制御ベクトル処理部25は、制御ベクトルVからN個のパラメータセットP1~PNを生成する。制御ベクトルVは部分音色を表すから、各パラメータセットPnは部分音色に依存する。また、制御ベクトルVは第1指示Q1に依存するから、各パラメータセットPnも利用者からの第1指示Q1に依存する。 The control vector processing unit 25 generates N parameter sets P1 to PN from the control vector V. Because the control vector V represents a partial tone, each parameter set Pn depends on the partial tone. In addition, because the control vector V depends on the first instruction Q1, each parameter set Pn also depends on the first instruction Q1 from the user.
 図6は、第2符号化器243および制御ベクトル処理部25の具体的な構成を例示するブロック図である。第2符号化器243は、複数の畳込層411と出力処理部412とを含む。複数の畳込層411うち最終段の畳込層411の後段に出力処理部412が積層される。 FIG. 6 is a block diagram illustrating a specific configuration of the second encoder 243 and the control vector processing unit 25. The second encoder 243 includes a plurality of convolution layers 411 and an output processing unit 412. The output processing unit 412 is stacked after the final convolution layer 411 among the plurality of convolution layers 411.
 複数の畳込層411のうち第1段目の畳込層411には参照データ列Rが入力される。複数の畳込層411のうち第2段目以降の各畳込層411には、前段の畳込層411による処理後のデータが入力される。各畳込層411は、当該畳込層411に入力されるデータに対して演算処理を実行する。畳込層411による演算処理は、畳込層312による演算処理と同様に、畳込演算とオプションとしてのプーリング演算とを含む。最終段の畳込層411は、参照データ列Rの特徴を表す特徴データDvを出力する。 A reference data sequence R is input to the first convolutional layer 411 of the multiple convolutional layers 411. Data processed by the previous convolutional layer 411 is input to each of the second and subsequent convolutional layers 411 of the multiple convolutional layers 411. Each convolutional layer 411 performs arithmetic processing on the data input to that convolutional layer 411. The arithmetic processing by the convolutional layer 411 includes a convolution operation and an optional pooling operation, similar to the arithmetic processing by the convolutional layer 312. The final convolutional layer 411 outputs feature data Dv representing the features of the reference data sequence R.
 出力処理部412は、特徴データDvに応じて制御ベクトルVを生成する。第1実施形態の出力処理部412は、後処理部413とサンプリング部414とを含む。 The output processing unit 412 generates a control vector V in response to the feature data Dv. The output processing unit 412 in the first embodiment includes a post-processing unit 413 and a sampling unit 414.
 後処理部413は、特徴データDvに応じてK個の確率分布F1~FKを決定する。K個の確率分布F1~FKの各々は、例えば正規分布である。後処理部413は、各確率分布Fk(k=1~K)について平均および分散を出力する。具体的には、後処理部413は、特徴データDvと各確率分布Fkとの関係を機械学習により学習した訓練済の統計モデルである。制御ベクトル生成部24は、特徴データDvを後処理部413により処理することでK個の確率分布F1~FKを決定する。 The post-processing unit 413 determines K probability distributions F1 to FK according to the feature data Dv. Each of the K probability distributions F1 to FK is, for example, a normal distribution. The post-processing unit 413 outputs the mean and variance for each probability distribution Fk (k = 1 to K). Specifically, the post-processing unit 413 is a trained statistical model that has learned the relationship between the feature data Dv and each probability distribution Fk by machine learning. The control vector generation unit 24 determines the K probability distributions F1 to FK by processing the feature data Dv with the post-processing unit 413.
 サンプリング部414は、K個の確率分布F1~FKに応じて制御ベクトルVを生成する。具体的には、サンプリング部414は、K個の確率分布F1~FKの各々から要素Ekをサンプリングする。要素Ekのサンプリングは、例えばランダムサンプリングである。すなわち、各要素Ekは、確率分布Fkからサンプリングされる潜在変数である。相異なる確率分布FkからサンプリングされたK個の要素E1~EKにより、制御ベクトルVが構成される。すなわち、制御ベクトルVは、K個の要素E1~EKを含む。以上の説明から理解される通り、制御ベクトルVは、特定区間の参照信号Srにおける音響特性の時間的な変化の特徴(すなわち部分音色)を表すK次元のベクトルである。 The sampling unit 414 generates a control vector V according to the K probability distributions F1 to FK. Specifically, the sampling unit 414 samples an element Ek from each of the K probability distributions F1 to FK. The sampling of the element Ek is, for example, random sampling. That is, each element Ek is a latent variable sampled from the probability distribution Fk. The control vector V is composed of the K elements E1 to EK sampled from different probability distributions Fk. That is, the control vector V includes K elements E1 to EK. As can be understood from the above explanation, the control vector V is a K-dimensional vector that represents the characteristics of the temporal change in acoustic characteristics (i.e., partial timbre) in the reference signal Sr of a specific section.
 なお、出力処理部412が特徴データDvから制御ベクトルVを生成するための構成および処理は、以上の例示に限定されない。例えば、出力処理部412は、K個の確率分布F1~FKを生成せずに制御ベクトルVを生成してもよい。したがって、後処理部413およびサンプリング部414は省略されてもよい。 Note that the configuration and processing by which the output processing unit 412 generates the control vector V from the feature data Dv are not limited to the above examples. For example, the output processing unit 412 may generate the control vector V without generating K probability distributions F1 to FK. Therefore, the post-processing unit 413 and the sampling unit 414 may be omitted.
 図6に例示される通り、制御ベクトル処理部25は、相異なる中間層Lに対応するN個の変換モデル28-1~28-Nを含む。各変換モデル28-nは、制御ベクトルVからパラメータセットPnを生成する。具体的には、各変換モデル28-nは、制御ベクトルVとパラメータセットPnとの関係を機械学習により学習した訓練済の統計モデルである。各変換モデル28-nは、パラメータセットPnを生成するための多層パーセプトロンで構成される。以上の説明から理解される通り、参照楽音の部分音色に応じたN個のパラメータセットP1~PNが、制御ベクトル処理部25により生成される。N個のパラメータセットP1~PNは共通の制御ベクトルVから生成される。 As illustrated in FIG. 6, the control vector processing unit 25 includes N conversion models 28-1 to 28-N corresponding to different hidden layers L. Each conversion model 28-n generates a parameter set Pn from a control vector V. Specifically, each conversion model 28-n is a trained statistical model that has learned the relationship between the control vector V and the parameter set Pn through machine learning. Each conversion model 28-n is composed of a multi-layer perceptron for generating the parameter set Pn. As can be understood from the above explanation, N parameter sets P1 to PN corresponding to the partial timbre of the reference musical tone are generated by the control vector processing unit 25. The N parameter sets P1 to PN are generated from a common control vector V.
 第2符号化器243と制御ベクトル処理部25とにより、第2生成モデル40が構成される。第2生成モデル40は、参照データ列RとN個のパラメータセットP1~PNとの関係を機械学習により学習した訓練済の統計モデルである。第2生成モデル40は、例えば深層ニューラルネットワークにより構成される。 The second encoder 243 and the control vector processing unit 25 constitute the second generative model 40. The second generative model 40 is a trained statistical model that learns the relationship between the reference data string R and the N parameter sets P1 to PN through machine learning. The second generative model 40 is constituted, for example, by a deep neural network.
 第2生成モデル40は、参照データ列Rから制御ベクトルVを生成する演算を制御装置11に実行させるプログラムと、当該演算に適用される複数の変数(加重値およびバイアス)との組合せで実現される。第2生成モデル40を実現するプログラムおよび複数の変数は、記憶装置12に記憶される。第2生成モデル40の複数の変数は、機械学習により事前に設定される。 The second generative model 40 is realized by a combination of a program that causes the control device 11 to execute an operation to generate a control vector V from a reference data string R, and a number of variables (weights and biases) that are applied to the operation. The program and the number of variables that realize the second generative model 40 are stored in the storage device 12. The number of variables of the second generative model 40 are set in advance by machine learning.
 図7は、制御装置11が目標楽音の音響信号Wを生成する処理(以下「楽音合成処理Sa」という)のフローチャートである。例えば操作装置14に対する利用者からの指示を契機として楽音合成処理Saが開始される。楽音合成処理Saは、処理期間B毎に反復される。楽音合成処理Saは「楽音合成方法」の一例である。なお、楽音合成処理Saの開始前に、利用者からの第1指示Q1に応じた特定区間の設定が、区間設定部241により実行されている。特定区間を表すデータが記憶装置12に記憶される。 FIG. 7 is a flowchart of the process (hereinafter referred to as "musical sound synthesis process Sa") in which the control device 11 generates an audio signal W of a target musical sound. For example, the musical sound synthesis process Sa is started in response to an instruction from the user via the operation device 14. The musical sound synthesis process Sa is repeated for each processing period B. The musical sound synthesis process Sa is an example of a "musical sound synthesis method." Note that, before the musical sound synthesis process Sa starts, the section setting unit 241 sets a specific section in response to a first instruction Q1 from the user. Data representing the specific section is stored in the storage device 12.
 楽音合成処理Saが開始されると、制御装置11(制御ベクトル生成部24)は、部分音色を表す制御ベクトルVを利用者からの第1指示Q1に応じて生成する(Sa1)。制御ベクトルVを生成する具体的な手順(Sa11~Sa13)は以下の通りである。 When the musical tone synthesis process Sa is started, the control device 11 (control vector generation unit 24) generates a control vector V representing a partial tone color in response to a first instruction Q1 from the user (Sa1). The specific steps for generating the control vector V (Sa11 to Sa13) are as follows:
 まず、制御装置11(区間設定部241)は、特定区間を表すデータを記憶装置12から取得する(Sa11)。具体的には、区間設定部241は、操作装置14に対する利用者からの第1指示Q1に応じて特定区間を設定する。制御装置11(特徴抽出部242)は、特定区間の参照信号Srを処理することで1個の参照データ列Rを生成する(Sa12)。そして、制御装置11は、参照データ列Rを第2符号化器243により処理することで制御ベクトルVを生成する(Sa13)。 First, the control device 11 (section setting unit 241) acquires data representing a specific section from the storage device 12 (Sa11). Specifically, the section setting unit 241 sets the specific section in response to a first instruction Q1 from the user on the operation device 14. The control device 11 (feature extraction unit 242) processes the reference signal Sr of the specific section to generate one reference data string R (Sa12). Then, the control device 11 processes the reference data string R using the second encoder 243 to generate a control vector V (Sa13).
 制御装置11(制御ベクトル処理部25)は、制御ベクトルVからN個のパラメータセットP1~PNを生成する(Sa2)。具体的には、制御装置11は、制御ベクトルVを各変換モデル28-nにより処理することでパラメータセットPnを生成する。 The control device 11 (control vector processing unit 25) generates N parameter sets P1 to PN from the control vector V (Sa2). Specifically, the control device 11 processes the control vector V using each transformation model 28-n to generate the parameter set Pn.
 制御装置11(制御データ取得部21)は、楽曲データMを処理することで制御データ列Xを生成する(Sa3)。制御装置11(楽音合成部22)は、第1生成モデル30により制御データ列Xを処理することで音響データ列Zを生成する(Sa4)。具体的には、制御装置11は、制御データ列Xを第1符号化器31により処理することで中間データYを生成し、中間データYを復号化器32により処理することで音響データ列Zを生成する。第1生成モデル30の各中間層Lによる変換処理にはパラメータセットPnが適用される。 The control device 11 (control data acquisition unit 21) processes the music data M to generate a control data sequence X (Sa3). The control device 11 (musical sound synthesis unit 22) processes the control data sequence X using the first generative model 30 to generate an audio data sequence Z (Sa4). Specifically, the control device 11 processes the control data sequence X using the first encoder 31 to generate intermediate data Y, and processes the intermediate data Y using the decoder 32 to generate the audio data sequence Z. A parameter set Pn is applied to the conversion process by each intermediate layer L of the first generative model 30.
 制御装置11(波形合成部23)は、音響データ列Zから目標楽音の音響信号Wを生成する(Sa5)。制御装置11は、音響信号Wを放音装置15に供給する(Sa6)。放音装置15は、音響信号Wが表す目標楽音を再生する。 The control device 11 (waveform synthesis unit 23) generates an audio signal W of the target musical tone from the audio data sequence Z (Sa5). The control device 11 supplies the audio signal W to the sound emitting device 15 (Sa6). The sound emitting device 15 reproduces the target musical tone represented by the audio signal W.
 以上に説明した通り、第1実施形態においては、参照楽音の部分音色を表す制御ベクトルVが利用者からの指示(第1指示Q1)に応じて生成され、制御ベクトルVからパラメータセットPnが生成され、各中間層Lに入力される単位データ列Uの各単位データDuに対してパラメータセットPnが適用される。したがって、利用者からの指示に応じた多様な部分音色を有する目標楽音の音響データ列Zを生成できる。 As explained above, in the first embodiment, a control vector V representing the partial timbre of a reference musical tone is generated in response to an instruction from the user (first instruction Q1), a parameter set Pn is generated from the control vector V, and the parameter set Pn is applied to each unit data Du of the unit data string U input to each intermediate layer L. Therefore, it is possible to generate an acoustic data string Z of a target musical tone having a variety of partial timbres in response to an instruction from the user.
 第1実施形態においては、参照楽音の特定区間が利用者からの第1指示Q1に応じて設定され、特定区間における部分音色を表す制御ベクトルVが生成される。したがって、参照楽音のうち利用者の所望の特定区間の部分音色を有する目標楽音を生成できる。第1実施形態においては特に、時間軸上における特定区間の位置が、第1指示Q1に応じて変更される。したがって、参照楽音のうち利用者の所望の位置の部分音色を有する目標楽音を生成できる。 In the first embodiment, a specific section of the reference musical tone is set in response to a first instruction Q1 from the user, and a control vector V is generated that represents the partial timbre in the specific section. Therefore, it is possible to generate a target musical tone having the partial timbre of the specific section of the reference musical tone desired by the user. In particular, in the first embodiment, the position of the specific section on the time axis is changed in response to the first instruction Q1. Therefore, it is possible to generate a target musical tone having the partial timbre of the position of the reference musical tone desired by the user.
 図2の訓練処理部26は、複数の訓練データTを利用した機械学習により第1生成モデル30および第2生成モデル40を確立する。第1実施形態の訓練処理部26は、第1生成モデル30と第2生成モデル40とを一括的に確立する。確立後の第1生成モデル30および第2生成モデル40の各々は個別に訓練されてもよい。図8は、第1生成モデル30および第2生成モデル40を確立する機械学習に関する説明図である。 The training processing unit 26 in FIG. 2 establishes the first generative model 30 and the second generative model 40 through machine learning using multiple pieces of training data T. The training processing unit 26 in the first embodiment establishes the first generative model 30 and the second generative model 40 collectively. After establishment, each of the first generative model 30 and the second generative model 40 may be trained individually. FIG. 8 is an explanatory diagram regarding machine learning that establishes the first generative model 30 and the second generative model 40.
 複数の訓練データTの各々は、訓練用の制御データ列Xtと訓練用の参照データ列Rtと訓練用の音響データ列Ztとの組合せで構成される。制御データ列Xtは、目標楽音の条件を表す時系列データである。具体的には、制御データ列Xtは、訓練用の楽曲のうち特定の区間(以下「訓練区間」という)における楽譜特徴量の時系列を表す。制御データ列Xtの形式は制御データ列Xと同様である。 Each of the multiple training data T is composed of a combination of a training control data string Xt, a training reference data string Rt, and a training audio data string Zt. The control data string Xt is time-series data representing the conditions of the target musical tone. Specifically, the control data string Xt represents a time series of musical score features in a specific section (hereinafter referred to as the "training section") of the training piece of music. The format of the control data string Xt is the same as that of the control data string X.
 参照データ列Rtは、訓練用の楽曲について事前に準備された楽音の音響的な特徴を表す時系列データである。参照データ列Rtが表す部分音色は、訓練用の楽曲の楽音のうち訓練区間における音響特性の時間的な変化の特徴である。参照データ列Rtの形式は参照データ列Rと同様である。 The reference data string Rt is time-series data representing the acoustic characteristics of musical tones prepared in advance for a training piece of music. The partial timbre represented by the reference data string Rt is a characteristic of the temporal change in acoustic characteristics of the musical tones of the training piece of music in the training section. The format of the reference data string Rt is the same as that of the reference data string R.
 音響データ列Ztは、制御データ列Xtおよび参照データ列Rtから第1生成モデル30および第2生成モデル40が生成すべき楽音の音響的な特徴を表す時系列データである。すなわち、音響データ列Ztは、制御データ列Xtおよび参照データ列Rtに対する正解(Ground Truth)に相当する。音響データ列Ztの形式は音響データ列Zと同様である。 The acoustic data sequence Zt is time-series data representing the acoustic characteristics of the musical tones to be generated by the first generative model 30 and the second generative model 40 from the control data sequence Xt and the reference data sequence Rt. In other words, the acoustic data sequence Zt corresponds to the ground truth for the control data sequence Xt and the reference data sequence Rt. The format of the acoustic data sequence Zt is the same as that of the acoustic data sequence Z.
 図9は、制御装置11が第1生成モデル30および第2生成モデル40を確立する処理(以下「訓練処理Sb」という)のフローチャートである。制御装置11が訓練処理Sbを実行することで、図8の訓練処理部26が実現される。 FIG. 9 is a flowchart of the process (hereinafter referred to as "training process Sb") in which the control device 11 establishes the first generation model 30 and the second generation model 40. The control device 11 executes the training process Sb to realize the training processing unit 26 in FIG. 8.
 訓練処理Sbが開始されると、制御装置11は、第1暫定モデル51と第2暫定モデル52とを準備する(Sb1)。第1暫定モデル51は、機械学習により第1生成モデル30に更新される初期的または暫定的なモデルである。初期的な第1暫定モデル51は、第1生成モデル30と同様の構成であるが、複数の変数が例えば乱数に設定される。第2暫定モデル52は、機械学習により第2生成モデル40に更新される初期的または暫定的なモデルである。初期的な第2暫定モデル52は、第2生成モデル40と同様の構成であるが、複数の変数が例えば乱数に設定される。第1暫定モデル51および第2暫定モデル52の各々の構造は、設計者により任意に設計される。 When the training process Sb is started, the control device 11 prepares a first provisional model 51 and a second provisional model 52 (Sb1). The first provisional model 51 is an initial or provisional model that is updated to the first generative model 30 by machine learning. The initial first provisional model 51 has a similar configuration to the first generative model 30, but multiple variables are set to, for example, random numbers. The second provisional model 52 is an initial or provisional model that is updated to the second generative model 40 by machine learning. The initial second provisional model 52 has a similar configuration to the second generative model 40, but multiple variables are set to, for example, random numbers. The structure of each of the first provisional model 51 and the second provisional model 52 is arbitrarily designed by a designer.
 制御装置11は、複数の訓練データTの何れか(以下「選択訓練データT」という)を選択する(Sb2)。図8に例示される通り、制御装置11は、第2暫定モデル52により選択訓練データTの参照データ列Rtを処理することでN個のパラメータセットP1~PNを生成する(Sb3)。具体的には、第2暫定モデル52は制御ベクトルVを生成し、制御ベクトル処理部25はN個のパラメータセットP1~PNを生成する。また、制御装置11は、第1暫定モデル51により選択訓練データTの制御データ列Xtを処理することで音響データ列Zを生成する(Sb4)。制御データ列Xtの処理には、第2暫定モデル52が生成したN個のパラメータセットP1~PNが適用される。 The control device 11 selects one of the multiple training data T (hereinafter referred to as "selected training data T") (Sb2). As illustrated in FIG. 8, the control device 11 generates N parameter sets P1 to PN by processing the reference data string Rt of the selected training data T using the second provisional model 52 (Sb3). Specifically, the second provisional model 52 generates a control vector V, and the control vector processing unit 25 generates N parameter sets P1 to PN. The control device 11 also generates an acoustic data string Z by processing a control data string Xt of the selected training data T using the first provisional model 51 (Sb4). The N parameter sets P1 to PN generated by the second provisional model 52 are applied to the processing of the control data string Xt.
 制御装置11は、第1暫定モデル51が生成した音響データ列Zと選択訓練データTの音響データ列Ztとの誤差を表す誤差関数を算定する(Sb5)。制御装置11は、損失関数が低減(理想的には最小化)されるように、第1暫定モデル51の複数の変数と第2暫定モデル52の複数の変数とを更新する(Sb6)。損失関数に応じた各変数の更新には、例えば誤差逆伝播法が利用される。 The control device 11 calculates an error function that represents the error between the acoustic data sequence Z generated by the first interim model 51 and the acoustic data sequence Zt of the selected training data T (Sb5). The control device 11 updates multiple variables of the first interim model 51 and multiple variables of the second interim model 52 so that the loss function is reduced (ideally minimized) (Sb6). For example, the backpropagation method is used to update each variable according to the loss function.
 制御装置11は、所定の終了条件が成立したか否かを判定する(Sb7)。終了条件は、例えば、損失関数が所定の閾値を下回ること、または、損失関数の変化量が所定の閾値を下回ることである。終了条件が成立しない場合(Sb7:NO)、制御装置11は、未選択の訓練データTを新たな選択訓練データTとして選択する(Sb2)。すなわち、終了条件の成立(Sb7:YES)まで、第1暫定モデル51の複数の変数と第2暫定モデル52の複数の変数とを更新する処理(Sb2~Sb6)が反復される。なお、全部の訓練データTについて以上の処理が実行された場合、各訓練データTを未選択の状態に戻して同様の処理が反復される。すなわち、各訓練データTは反復的に利用される。 The control device 11 determines whether a predetermined termination condition is met (Sb7). The termination condition is, for example, that the loss function falls below a predetermined threshold, or that the amount of change in the loss function falls below a predetermined threshold. If the termination condition is not met (Sb7: NO), the control device 11 selects the unselected training data T as new selected training data T (Sb2). That is, the process of updating the multiple variables of the first interim model 51 and the multiple variables of the second interim model 52 (Sb2 to Sb6) is repeated until the termination condition is met (Sb7: YES). Note that when the above process has been performed for all training data T, each training data T is returned to an unselected state and the same process is repeated. That is, each training data T is used repeatedly.
 終了条件が成立した場合(Sb7:YES)、制御装置11は、訓練処理Sbを終了する。終了条件が成立した時点における第1暫定モデル51が、訓練済の第1生成モデル30として確定される。また、終了条件が成立した時点における第2暫定モデル52が、訓練済の第2生成モデル40として確定される。 If the termination condition is met (Sb7: YES), the control device 11 ends the training process Sb. The first interim model 51 at the time when the termination condition is met is determined to be the trained first generative model 30. Also, the second interim model 52 at the time when the termination condition is met is determined to be the trained second generative model 40.
 以上の説明から理解される通り、第1生成モデル30は、参照データ列Rに応じたN個のパラメータセットP1~PNのもとで、制御データ列Xtと音響データ列Ztとの間に潜在する関係を学習する。したがって、訓練済の第1生成モデル30は、その関係のもとで未知の制御データ列Xに対して統計的に妥当な音響データ列Zを出力する。また、第2生成モデル40は、参照データ列RtとN個のパラメータセットP1~PNとの間に潜在する関係を学習する。具体的には、制御データ列Xtから適切な音響データ列Zを生成するために必要なN個のパラメータセットP1~PNと、参照データ列Rtとの関係を、第2生成モデル40は学習する。具体的には、第2符号化器243は、参照データ列Rtと制御ベクトルVとの間に潜在する関係を学習し、制御ベクトル処理部25は、制御ベクトルVとN個のパラメータセットP1~PNとの間に潜在する関係を学習する。したがって、第1生成モデル30および第2生成モデル40の利用により、参照楽音の所望の部分音色が付与された目標楽音の音響データ列Zが生成される。 As can be understood from the above explanation, the first generative model 30 learns the underlying relationship between the control data sequence Xt and the acoustic data sequence Zt under the N parameter sets P1 to PN corresponding to the reference data sequence R. Therefore, the trained first generative model 30 outputs an acoustic data sequence Z that is statistically valid for the unknown control data sequence X under that relationship. The second generative model 40 also learns the underlying relationship between the reference data sequence Rt and the N parameter sets P1 to PN. Specifically, the second generative model 40 learns the relationship between the reference data sequence Rt and the N parameter sets P1 to PN necessary to generate an appropriate acoustic data sequence Z from the control data sequence Xt. Specifically, the second encoder 243 learns the underlying relationship between the reference data sequence Rt and the control vector V, and the control vector processing unit 25 learns the underlying relationship between the control vector V and the N parameter sets P1 to PN. Therefore, by using the first generation model 30 and the second generation model 40, an acoustic data sequence Z of a target musical tone is generated to which a desired partial timbre of a reference musical tone is imparted.
B:第2実施形態
 第2実施形態を説明する。なお、以下に例示する各形態において機能が第1実施形態と同様である要素については、第1実施形態の説明で使用したのと同様の符号を流用して各々の詳細な説明を適宜に省略する。
B: Second embodiment A second embodiment will be described. In each of the following exemplary embodiments, elements having the same functions as those in the first embodiment will be denoted by the same reference numerals as those used in the description of the first embodiment, and detailed descriptions thereof will be omitted as appropriate.
 図10は、第2実施形態における制御ベクトル生成部24のブロック図である。図10に例示される通り、第2実施形態の制御ベクトル生成部24は、第1実施形態と同様の要素(区間設定部241,特徴抽出部242および第2符号化器243)に加えて制御ベクトル調整部244を含む。第2符号化器243は、第1実施形態と同様に制御ベクトルVを生成する。以下の説明においては、第2符号化器243が生成する初期的な制御ベクトルVを便宜的に「制御ベクトルV0」と表記する。 FIG. 10 is a block diagram of the control vector generation unit 24 in the second embodiment. As illustrated in FIG. 10, the control vector generation unit 24 in the second embodiment includes a control vector adjustment unit 244 in addition to the same elements as in the first embodiment (interval setting unit 241, feature extraction unit 242, and second encoder 243). The second encoder 243 generates a control vector V in the same manner as in the first embodiment. In the following description, the initial control vector V generated by the second encoder 243 is denoted as "control vector V0" for convenience.
 第2実施形態においても第1実施形態と同様に、区間設定部241は、利用者からの第1指示Q1に応じて参照楽音の特定区間を設定する。したがって、第2実施形態における制御ベクトルV0は、利用者からの第1指示Q1に応じて生成される。 In the second embodiment, as in the first embodiment, the section setting unit 241 sets a specific section of the reference musical tone in response to the first instruction Q1 from the user. Therefore, the control vector V0 in the second embodiment is generated in response to the first instruction Q1 from the user.
 なお、初期的な制御ベクトルV0は、第2符号化器243が生成するベクトルでなくてもよい。例えば、各要素Ekが所定値(例えばゼロ)に設定されたベクトル、または各要素Ekが乱数に設定されたベクトルが、初期的な制御ベクトルV0として利用されてもよい。また、前回の楽音合成処理Saが実行されたときの最終的な制御ベクトルVが、今回の初期的な制御ベクトルV0として採用されてもよい。以上の説明から理解される通り、制御ベクトルV0を生成するための要素(区間設定部241,特徴抽出部242および第2符号化器243)は、第2実施形態において省略されてもよい。 The initial control vector V0 does not have to be a vector generated by the second encoder 243. For example, a vector in which each element Ek is set to a predetermined value (e.g., zero), or a vector in which each element Ek is set to a random number may be used as the initial control vector V0. In addition, the final control vector V when the previous musical sound synthesis process Sa was executed may be adopted as the current initial control vector V0. As can be understood from the above explanation, the elements for generating the control vector V0 (the interval setting unit 241, the feature extraction unit 242, and the second encoder 243) may be omitted in the second embodiment.
 制御ベクトル調整部244は、初期的な制御ベクトルV0を調整することで制御ベクトルVを生成する。具体的には、制御ベクトル調整部244は、制御ベクトルV0のK個の要素E1~EKのうち1以上の要素Ekを、操作装置14に対する利用者からの第2指示Q2に応じて変更する。変更後のK個の要素E1~EKで構成されるK次元のベクトルが、制御ベクトルVとして制御ベクトル処理部25に供給される。以上の説明から理解される通り、第2実施形態の制御ベクトル生成部24は、利用者からの第1指示Q1および第2指示Q2に応じて制御ベクトルVを生成する。 The control vector adjustment unit 244 generates a control vector V by adjusting an initial control vector V0. Specifically, the control vector adjustment unit 244 changes one or more elements Ek of the K elements E1 to EK of the control vector V0 in response to a second instruction Q2 from the user to the operation device 14. A K-dimensional vector consisting of the K elements E1 to EK after the changes is supplied to the control vector processing unit 25 as the control vector V. As can be understood from the above explanation, the control vector generation unit 24 of the second embodiment generates a control vector V in response to a first instruction Q1 and a second instruction Q2 from the user.
 図11は、設定画面Gbの模式図である。設定画面Gbは、利用者が各要素Ekの変更を指示するための画面である。制御ベクトル調整部244は、設定画面Gbを表示装置13に表示する。設定画面Gbは、制御ベクトルVの相異なる要素Ekに対応するK個の操作子Gb-1~Gb-Kを含む。K個の操作子Gb-1~GB-Kは、横方向に配列される。各要素Ekに対応する操作子Gb-kは、利用者による操作を受付ける画像である。具体的には、各操作子Gb-kは、例えば、利用者による操作に応じて上下に移動するスライダである。利用者による第2指示Q2は、例えば、K個の操作子Gb-1~Gb-Kの各々を移動させる操作である。すなわち、第2指示Q2は、各要素Ekの数値を個別に指定する利用者からの指示である。各操作子Gb-kの近傍には、要素Ekの数値が表示される。 FIG. 11 is a schematic diagram of the setting screen Gb. The setting screen Gb is a screen for the user to instruct changes to each element Ek. The control vector adjustment unit 244 displays the setting screen Gb on the display device 13. The setting screen Gb includes K operators Gb-1 to Gb-K corresponding to different elements Ek of the control vector V. The K operators Gb-1 to Gb-K are arranged in the horizontal direction. The operators Gb-k corresponding to each element Ek are images that accept operations by the user. Specifically, each operator Gb-k is, for example, a slider that moves up and down in response to an operation by the user. The second instruction Q2 by the user is, for example, an operation to move each of the K operators Gb-1 to Gb-K. In other words, the second instruction Q2 is an instruction from the user to individually specify the numerical value of each element Ek. The numerical value of the element Ek is displayed near each operator Gb-k.
 上下方向における各操作子Gb-kの位置は、要素Ekの数値に対応する。すなわち、操作子Gb-kの上方の移動は要素Ekの増加を意味し、操作子Gb-kの下方の移動は要素Ekの減少を意味する。制御ベクトル調整部244は、制御ベクトルV0の各要素Ekの数値に応じて各操作子Gb-kの初期的な位置を設定する。そして、制御ベクトル調整部244は、各操作子Gb-kを移動させる利用者からの操作(すなわち第2指示Q2)に応じて、要素Ekの数値を変更する。すなわち、制御ベクトル調整部244は、K個の操作子Gb-1~Gb-Kのうち1以上の操作子Gb-kに対する利用者の操作に応じて、各操作子Gb-kに対応する要素Ekを設定する。 The position of each operator Gb-k in the vertical direction corresponds to the numerical value of the element Ek. That is, moving the operator Gb-k upwards means an increase in the element Ek, and moving the operator Gb-k downwards means a decrease in the element Ek. The control vector adjustment unit 244 sets the initial position of each operator Gb-k according to the numerical value of each element Ek of the control vector V0. The control vector adjustment unit 244 then changes the numerical value of the element Ek according to the user's operation to move each operator Gb-k (i.e., the second instruction Q2). That is, the control vector adjustment unit 244 sets the element Ek corresponding to each operator Gb-k according to the user's operation on one or more operators Gb-k among the K operators Gb-1 to Gb-K.
 前述の通り、制御ベクトルVは部分音色を表す。したがって、制御ベクトル調整部244による各要素Ekの変更は、利用者からの第2指示Q2に応じて部分音色を変更する処理である。すなわち、目標楽音に付与される音響特性の時間的な変化(すなわち部分音色)が、利用者からの第2指示Q2に応じて変化する。制御ベクトル処理部25は、制御ベクトル調整部244による調整後の制御ベクトルVからN個のパラメータセットP1~PNを生成する。 As mentioned above, the control vector V represents a partial tone. Therefore, the change in each element Ek by the control vector adjustment unit 244 is a process of changing the partial tone in response to the second instruction Q2 from the user. In other words, the temporal change in the acoustic characteristics imparted to the target tone (i.e., the partial tone) changes in response to the second instruction Q2 from the user. The control vector processing unit 25 generates N parameter sets P1 to PN from the control vector V after adjustment by the control vector adjustment unit 244.
 図12は、第2実施形態における楽音合成処理Saのフローチャートである。第2実施形態における制御ベクトルVの生成(Sa1)は、第1実施形態と同様の手順(Sa11~Sa13)に加えて、制御ベクトルV0の調整(Sa14)を含む。制御ベクトルV0の調整(Sa14)において、制御装置11(制御ベクトル調整部244)は、初期的な制御ベクトルV0のK個の要素E1~EKのうち1以上の要素Ekを、利用者からの第2指示Q2に応じて変更することで、制御ベクトルVを生成する。制御ベクトルV0の調整(Sa14)以外の動作は第1実施形態と同様である。第2指示Q2は、楽音合成処理Saに並行した任意のタイミングで利用者により付与される。 FIG. 12 is a flowchart of the musical tone synthesis process Sa in the second embodiment. In the second embodiment, the generation of the control vector V (Sa1) includes the same procedures (Sa11-Sa13) as in the first embodiment, as well as the adjustment of the control vector V0 (Sa14). In the adjustment of the control vector V0 (Sa14), the control device 11 (control vector adjustment unit 244) generates the control vector V by changing one or more elements Ek of the K elements E1-EK of the initial control vector V0 in response to a second instruction Q2 from the user. The operations other than the adjustment of the control vector V0 (Sa14) are the same as in the first embodiment. The second instruction Q2 is given by the user at any timing in parallel with the musical tone synthesis process Sa.
 第2実施形態においても第1実施形態と同様の効果が実現される。また、第2実施形態においては、制御ベクトルV0のK個の要素E1~EKのうち1以上の要素Ekが利用者からの第2指示Q2に応じて変更される。したがって、利用者からの第2指示Q2に応じた部分音色を有する多様な目標楽音を生成できる。第2実施形態においては特に、各操作子Gb-kに対する操作により、利用者は部分音色を簡便に調整できる。 In the second embodiment, the same effect as in the first embodiment is achieved. Furthermore, in the second embodiment, one or more elements Ek of the K elements E1 to EK of the control vector V0 are changed in response to the second instruction Q2 from the user. Therefore, it is possible to generate a variety of target musical tones having partial timbres in response to the second instruction Q2 from the user. In particular, in the second embodiment, the user can easily adjust the partial timbre by operating each operator Gb-k.
C:第3実施形態
 第3実施形態の制御ベクトル生成部24は、時間軸上の単位期間毎に制御ベクトルVを生成する。すなわち、制御ベクトル生成部24は、利用者からの指示(第1指示Q1,第2指示Q2)に応じて制御ベクトルVの時系列を生成する。なお、以下の説明においては、第1指示Q1および第2指示Q2に応じて制御ベクトルVを生成する形態を例示するが、第1指示Q1および第2指示Q2の一方に応じて制御ベクトルVが生成されてもよい。
C: Third embodiment The control vector generation unit 24 of the third embodiment generates a control vector V for each unit period on the time axis. That is, the control vector generation unit 24 generates a time series of control vectors V in response to instructions from a user (first instruction Q1, second instruction Q2). In the following description, an example in which the control vector V is generated in response to the first instruction Q1 and the second instruction Q2 is shown, but the control vector V may be generated in response to one of the first instruction Q1 and the second instruction Q2.
 第3実施形態の制御ベクトル生成部24は、第2実施形態と同様に、利用者からの第1指示Q1および第2指示Q2に応じて制御ベクトルVを生成する。前述の通り、第3実施形態においては単位期間毎に制御ベクトルVが生成されるから、1個の処理期間B内の単位期間毎に制御ベクトルVが変化する。したがって、目標楽音に付与される部分音色が処理期間Bの途中の時点で変化する。 The control vector generating unit 24 of the third embodiment generates a control vector V in response to a first instruction Q1 and a second instruction Q2 from the user, as in the second embodiment. As described above, in the third embodiment, the control vector V is generated for each unit period, so the control vector V changes for each unit period within one processing period B. Therefore, the partial timbre assigned to the target musical tone changes at a point midway through the processing period B.
 例えば、利用者は、楽音合成処理Saの実行前に、目標楽曲の任意の時刻(単位期間)について第1指示Q1により特定区間を指定できる。特定区間が指定された単位期間については、前述の各形態と同様に制御ベクトルVが生成される。また、特定区間が指定されていない単位期間(以下「対象期間」という)については、対象期間の前方および後方の単位期間について生成された2個の制御ベクトルVの補間により、制御ベクトルVが生成される。例えば、制御ベクトル生成部24は、対象期間の直前(例えば1以上の単位期間だけ過去)に指定された特定区間に対応する制御ベクトルVと、対象期間の直後(例えば1以上の単位期間だけ未来)に指定された特定区間に対応する制御ベクトルVとの補間により、対象期間の制御ベクトルVを生成する。制御ベクトルVの補間の方法は任意であるが、例えば内挿補間が利用される。 For example, before the execution of the musical tone synthesis process Sa, the user can specify a specific section by the first instruction Q1 for any time (unit period) of the target music piece. For a unit period for which a specific section is specified, a control vector V is generated in the same manner as in each of the above-mentioned forms. For a unit period for which a specific section is not specified (hereinafter referred to as a "target period"), a control vector V is generated by interpolating two control vectors V generated for unit periods before and after the target period. For example, the control vector generation unit 24 generates a control vector V for the target period by interpolating a control vector V corresponding to a specific section specified immediately before the target period (e.g., one or more unit periods in the past) and a control vector V corresponding to a specific section specified immediately after the target period (e.g., one or more unit periods in the future). Any method of interpolating the control vector V may be used, for example, an interpolation method may be used.
 また、制御ベクトル生成部24は、楽音合成処理Saに並行して利用者が付与した第2指示Q2を単位期間毎に検出することで、制御ベクトルVの時系列を生成する。なお、制御ベクトル生成部24は、単位期間よりも長い周期で第2指示Q2を検出することで制御ベクトルVの時系列を生成し、制御ベクトルVの時系列を時間軸上で平滑する処理(すなわちローパスフィルタ)により、単位期間毎の制御ベクトルVを生成してもよい。 The control vector generation unit 24 also generates a time series of control vectors V by detecting the second instruction Q2 given by the user for each unit period in parallel with the musical sound synthesis process Sa. Note that the control vector generation unit 24 may generate a time series of control vectors V by detecting the second instruction Q2 at a cycle longer than the unit period, and generate a control vector V for each unit period by processing to smooth the time series of control vectors V on the time axis (i.e., a low-pass filter).
 第3実施形態の制御ベクトル処理部25は、各単位期間の制御ベクトルVからN個のパラメータセットP1~PNを生成する。制御ベクトル処理部25は、時間軸上の単位期間毎にN個のパラメータセットP1~PNを生成する。すなわち、制御ベクトル処理部25は、各パラメータセットPnの時系列を生成する。 The control vector processing unit 25 of the third embodiment generates N parameter sets P1 to PN from the control vector V for each unit period. The control vector processing unit 25 generates N parameter sets P1 to PN for each unit period on the time axis. In other words, the control vector processing unit 25 generates a time series of each parameter set Pn.
 前述の通り、制御ベクトルVは第1指示Q1または第2指示Q2に応じて変化する。したがって、第1指示Q1または第2指示Q2の直前の単位期間におけるN個のパラメータセットP1~PNと、直後の単位期間におけるN個のパラメータセットP1~PNとは相違する。すなわち、1個の処理期間B内においてパラメータセットPnが変化する。第1指示Q1または第2指示Q2が付与されない状態では、複数の単位期間にわたり同じパラメータセットPnが生成される。 As mentioned above, the control vector V changes in response to the first instruction Q1 or the second instruction Q2. Therefore, the N parameter sets P1 to PN in the unit period immediately before the first instruction Q1 or the second instruction Q2 are different from the N parameter sets P1 to PN in the unit period immediately after. In other words, the parameter set Pn changes within one processing period B. In a state in which the first instruction Q1 or the second instruction Q2 is not given, the same parameter set Pn is generated over multiple unit periods.
 図3に例示される通り、1個の単位データ列Uを構成する単位データDuの個数は、第1生成モデル30における処理の段階毎に変化する。1個の中間層Lによる変換処理には、当該中間層Lに供給される単位データDuの個数に対応する個数のパラメータセットPnが使用される。すなわち、変換モデル28-nは、第n段目の中間層Lが処理する単位データDuと同数のパラメータセットPnの時系列を生成する。 As illustrated in FIG. 3, the number of unit data Du constituting one unit data string U changes for each stage of processing in the first generation model 30. For conversion processing by one intermediate layer L, parameter sets Pn in a number corresponding to the number of unit data Du supplied to the intermediate layer L are used. In other words, the conversion model 28-n generates a time series of parameter sets Pn in the same number as the unit data Du processed by the n-th intermediate layer L.
 図13は、各中間層Lが実行する変換処理の説明図である。第1実施形態においては、単位データ列Uを構成する複数の単位データDuの各々に対して、共通のパラメータセットPnを適用した変換処理が実行される(図4)。第3実施形態においては、単位データ列Uを構成する複数の単位データDuの各々に対して、個別のパラメータセットPnを適用した変換処理が実行される。 FIG. 13 is an explanatory diagram of the conversion process executed by each intermediate layer L. In the first embodiment, a conversion process is executed in which a common parameter set Pn is applied to each of the multiple unit data Du that make up the unit data string U (FIG. 4). In the third embodiment, a conversion process is executed in which an individual parameter set Pn is applied to each of the multiple unit data Du that make up the unit data string U.
 図13には、単位データ列Uのうち、時刻t1に対する単位データDu(t1)と時刻t2(t2≠t1)に対応する単位データDu(t2)とが図示されている。単位データDu(t1)の変換処理にはパラメータセットPn(t1)が適用され、単位データDu(t2)の変換処理にはパラメータセットPn(t2)が適用される。パラメータセットPn(t1)とパラメータセットPn(t2)とは個別に生成される。具体的には、パラメータセットPn(t1)は、時刻t1に対応する制御ベクトルV(t1)から生成され、パラメータセットPn(t2)は、時刻t2に対応する制御ベクトルV(t2)から生成される。したがって、第1パラメータp1および第2パラメータp2の数値は、パラメータセットPn(t1)とパラメータセットPn(t2)との間で相違し得る。以上の例示の通り、単位データ列Uの途中の時点において、変換処理に適用されるパラメータセットPnが変化する。 13 shows unit data Du(t1) for time t1 and unit data Du(t2) corresponding to time t2 (t2 ≠ t1) in the unit data string U. Parameter set Pn(t1) is applied to the conversion process of unit data Du(t1), and parameter set Pn(t2) is applied to the conversion process of unit data Du(t2). Parameter set Pn(t1) and parameter set Pn(t2) are generated separately. Specifically, parameter set Pn(t1) is generated from control vector V(t1) corresponding to time t1, and parameter set Pn(t2) is generated from control vector V(t2) corresponding to time t2. Therefore, the numerical values of the first parameter p1 and the second parameter p2 may differ between parameter set Pn(t1) and parameter set Pn(t2). As shown in the above example, the parameter set Pn applied to the conversion process changes at a point in the middle of the unit data string U.
 第3実施形態においても第2実施形態と同様の効果が実現される。また、第3実施形態においては、利用者からの指示(第1指示Q1,第2指示Q2)に応じて制御ベクトルVの時系列が生成され、かつ、制御ベクトルVの時系列から各パラメータセットPnの時系列が生成される。したがって、制御データ列Xの途中の時点において音色が変化する多様な目標楽音を生成できる。 In the third embodiment, the same effect as in the second embodiment is achieved. In addition, in the third embodiment, a time series of a control vector V is generated in response to instructions from a user (first instruction Q1, second instruction Q2), and a time series of each parameter set Pn is generated from the time series of the control vector V. Therefore, it is possible to generate a variety of target sounds whose timbre changes at points in the middle of the control data string X.
D:第4実施形態
 図14は、第4実施形態における第1生成モデル30(楽音合成部22)の構成を例示するブロック図である。第4実施形態の第1生成モデル30は、変換処理部61と畳込層62とN個の単位処理部63-1~63-Nと合成処理部64とを含む自己回帰(AR:autoregressive)型の生成モデルである。後述するように第1生成モデル30は任意の個数(Nx個)の中間層を備えるが、全ての中間層を省くと、applied sciencesの2017年の論文” A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs”, Merlijn Blaauw and Jordi Bonadaに開示された生成モデル(NPSS)と等価になる。第1生成モデル30以外の構成は第1実施形態と同様である。制御ベクトル処理部25(変換モデル28-n)が生成した各パラメータセットPnは、単位処理部63-nに供給される。
D: Fourth embodiment FIG. 14 is a block diagram illustrating the configuration of the first generative model 30 (musical tone synthesis unit 22) in the fourth embodiment. The first generative model 30 in the fourth embodiment is an autoregressive (AR) type generative model including a conversion processing unit 61, a convolution layer 62, N unit processing units 63-1 to 63-N, and a synthesis processing unit 64. As described later, the first generative model 30 has an arbitrary number (Nx) of intermediate layers, but if all intermediate layers are omitted, it becomes equivalent to the generative model (NPSS) disclosed in the 2017 paper "A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs" by Merlijn Blaauw and Jordi Bonada in applied sciences. The configuration other than the first generative model 30 is the same as that of the first embodiment. Each parameter set Pn generated by the control vector processing unit 25 (conversion model 28-n) is supplied to the unit processing unit 63-n.
 変換処理部61は、前処理部311と同様に、制御データ取得部21が単位期間毎に取得した制御データDxから潜在データdを生成する。潜在データdは、制御データDxの特徴を表す。例えば変換処理部61は、制御データDxを潜在データdに変換するための多層パーセプトロンで構成される。潜在データdは、N個の単位処理部63-1~63-Nに対して共通に供給しても、異なるデータを個別に供給してもよい。なお、制御データ取得部21が取得した制御データDxが潜在データdとして各単位処理部63-nに供給されてもよい。すなわち、変換処理部61は省略されてよい。 Similar to the pre-processing unit 311, the conversion processing unit 61 generates latent data d from the control data Dx acquired by the control data acquisition unit 21 for each unit period. The latent data d represents the characteristics of the control data Dx. For example, the conversion processing unit 61 is configured with a multi-layer perceptron for converting the control data Dx into latent data d. The latent data d may be supplied in common to the N unit processing units 63-1 to 63-N, or different data may be supplied individually. Note that the control data Dx acquired by the control data acquisition unit 21 may be supplied to each unit processing unit 63-n as latent data d. In other words, the conversion processing unit 61 may be omitted.
 図15は、各単位処理部63-nのブロック図である。各単位処理部63-nは、入力データIと潜在データdとパラメータセットPnとを処理することで出力データOと処理データCnとを生成する生成モデルである。入力データIは、第1データIaと第2データIbとを含む。単位処理部63-nは、拡張畳込層65と中間層Lと処理層67とを具備する。 FIG. 15 is a block diagram of each unit processing unit 63-n. Each unit processing unit 63-n is a generative model that generates output data O and processed data Cn by processing input data I, latent data d, and parameter set Pn. The input data I includes first data Ia and second data Ib. The unit processing unit 63-n includes an extended convolutional layer 65, an intermediate layer L, and a processing layer 67.
 拡張畳込層65は、入力データI(第1データIaおよび第2データIb)に対して拡張畳込(dilated convolution)を実行することで単位データDu1を生成する。 The dilated convolution layer 65 generates unit data Du1 by performing dilated convolution on the input data I (first data Ia and second data Ib).
 中間層Lは、単位データDu1に対する変換処理により単位データDu2を生成する。変換処理の内容は第1実施形態と同様である。変換処理にはパラメータセットPnが適用される。なお、中間層Lは、N個の単位処理部63-1~63-Nの全部に設置される必要はない。すなわち、N個の単位処理部63-1~63-NのうちNx個(1個以上)の単位処理部63-nに中間層Lが設置される。ここでは、N個の単位処理部の全部に、中間層Lが設置されたものとして説明する。 The intermediate layer L generates unit data Du2 by performing a conversion process on the unit data Du1. The contents of the conversion process are the same as those in the first embodiment. A parameter set Pn is applied to the conversion process. Note that it is not necessary for the intermediate layer L to be installed in all of the N unit processing units 63-1 to 63-N. In other words, the intermediate layer L is installed in Nx (one or more) unit processing units 63-n out of the N unit processing units 63-1 to 63-N. Here, the explanation will be given assuming that the intermediate layer L is installed in all of the N unit processing units.
 処理層67は、単位データDu2と潜在データdとから出力データOと処理データCnとを生成する。具体的には、処理層67は、畳込層671と加算部672と活性化層673と活性化層674と乗算部675と畳込層676と畳込層677と加算部678とを含む。 The processing layer 67 generates output data O and processing data Cn from the unit data Du2 and the latent data d. Specifically, the processing layer 67 includes a convolution layer 671, an adder 672, an activation layer 673, an activation layer 674, a multiplier 675, a convolution layer 676, a convolution layer 677, and an adder 678.
 畳込層671は、潜在データdに対して1×1の畳込演算を実行する。加算部672は、単位データDu2と畳込層671の出力データとを加算することで単位データDu3を生成する。単位データDu3は、第1部分と第2部分とに区分される。活性化層673は、単位データDu3の第1部分を活性化関数(例えばtanh関数)により処理する。活性化層674は、単位データDu3の第2部分を活性化関数(例えばシグモイド関数)により処理する。乗算部675は、活性化層673の出力データと活性化層674の出力データとの間で要素積を演算することで単位データDu4を生成する。単位データDu4は、拡張畳込層65の出力に、ゲートされた活性化関数(673―675)を適用して得られたデータである。ここでは、単位データDu1~Du3がそれぞれ第1部分と第2部分を含むが、一般的な活性化関数(シグモイド、tanh、ReLUなどの、ゲートされていない関数)を用いる場合は、単位データDu1~Du3は第1部分だけでよい。 The convolution layer 671 performs a 1x1 convolution operation on the latent data d. The adder 672 generates unit data Du3 by adding unit data Du2 and the output data of the convolution layer 671. The unit data Du3 is divided into a first part and a second part. The activation layer 673 processes the first part of the unit data Du3 using an activation function (e.g., a tanh function). The activation layer 674 processes the second part of the unit data Du3 using an activation function (e.g., a sigmoid function). The multiplier 675 generates unit data Du4 by calculating an element product between the output data of the activation layer 673 and the output data of the activation layer 674. The unit data Du4 is data obtained by applying a gated activation function (673-675) to the output of the extended convolution layer 65. Here, each of the unit data Du1 to Du3 includes a first part and a second part, but if a general activation function (an ungated function such as sigmoid, tanh, or ReLU) is used, the unit data Du1 to Du3 only need to include the first part.
 畳込層676は、単位データDu4に対して1×1の畳込演算を実行することで処理データCnを生成する。畳込層677は、単位データDu4に対して1×1の畳込演算を実行する。加算部678は、第1データIaと畳込層677の出力データとを加算することで出力データOを生成する。出力データOは、記憶装置12に記憶される。 The convolution layer 676 generates processed data Cn by performing a 1×1 convolution operation on the unit data Du4. The convolution layer 677 performs a 1×1 convolution operation on the unit data Du4. The adder 678 generates output data O by adding the first data Ia and the output data of the convolution layer 677. The output data O is stored in the storage device 12.
 図14の合成処理部64は、相異なる単位処理部63-nが生成したN個の処理データC1~CNを処理することで音響データDzを生成する。例えば、合成処理部64は、N個の処理データC1~CNの加重和したデータに基づき音響データDzを生成する。合成処理部64による音響データDzの生成は単位期間毎に反復される。すなわち、合成処理部64は、音響データDzの時系列を生成する。合成処理部64が生成した音響データDzは、第1実施形態と同様に波形合成部23に供給されるほか、記憶装置12に記憶されて畳込層62に使用される。 The synthesis processing unit 64 in FIG. 14 generates acoustic data Dz by processing N pieces of processed data C1 to CN generated by different unit processing units 63-n. For example, the synthesis processing unit 64 generates acoustic data Dz based on data obtained by weighting the N pieces of processed data C1 to CN. The generation of acoustic data Dz by the synthesis processing unit 64 is repeated for each unit period. In other words, the synthesis processing unit 64 generates a time series of acoustic data Dz. The acoustic data Dz generated by the synthesis processing unit 64 is supplied to the waveform synthesis unit 23 as in the first embodiment, and is also stored in the storage device 12 and used in the convolution layer 62.
 畳込層62は、直前の複数の単位期間において生成された音響データDzに対する畳込演算(causal convolution)により、単位期間毎に単位データDu0を生成する。単位データDu0は、第1段目の単位処理部63-1に入力データIとして供給される。各単位期間に単位処理部63-1に供給される第1データIaは、現在の単位期間に生成された単位データDu0である。また、各単位期間に単位処理部63-1に供給される第2データIbは、直前(1個前)の単位期間に生成された単位データDu0である。以上に説明した通り、相異なる単位期間に対応する第1データIaと第2データIbとが、単位処理部63-1に供給される。 The convolution layer 62 generates unit data Du0 for each unit period by performing a convolution operation (causal convolution) on the acoustic data Dz generated in the immediately preceding multiple unit periods. The unit data Du0 is supplied to the first-stage unit processing unit 63-1 as input data I. The first data Ia supplied to the unit processing unit 63-1 in each unit period is the unit data Du0 generated in the current unit period. In addition, the second data Ib supplied to the unit processing unit 63-1 in each unit period is the unit data Du0 generated in the immediately preceding (one previous) unit period. As described above, the first data Ia and second data Ib corresponding to different unit periods are supplied to the unit processing unit 63-1.
 時間軸上の各単位期間において、第2段目以降の各単位処理部63-nには、前段の単位処理部63-n-1が相異なる単位期間について生成した複数の出力データOが、第1データIaおよび第2データIbとして供給される。例えば、単位処理部63-2には、単位処理部63-1が現在の単位期間に生成した出力データOが第1データIaとして供給され、かつ、単位処理部63-1が2個前(dilation=2)の単位期間に生成した出力データOが第2データIbとして供給される。単位処理部63-3には、単位処理部63-2が現在の単位期間に生成した出力データOが第1データIaとして供給され、かつ、単位処理部63-2が4個前(dilation=4)の単位期間に生成した出力データOが第2データIbとして供給される。 In each unit period on the time axis, each unit processing unit 63-n from the second stage onwards is supplied with multiple output data O generated by the unit processing unit 63-n-1 in the previous stage for different unit periods as first data Ia and second data Ib. For example, the unit processing unit 63-2 is supplied with the output data O generated by the unit processing unit 63-1 in the current unit period as the first data Ia, and the output data O generated by the unit processing unit 63-1 in the unit period two units prior (dilation = 2) as the second data Ib. The unit processing unit 63-3 is supplied with the output data O generated by the unit processing unit 63-2 in the current unit period as the first data Ia, and the output data O generated by the unit processing unit 63-2 in the unit period four units prior (dilation = 4) as the second data Ib.
 以上に説明した通り、第4実施形態の第1生成モデル30は、相異なる単位処理部63-nに対応するN個の中間層Lを含む。また、畳込層62と各単位処理部63-nの拡張畳込層65および処理層67とは、音響データDzの時系列の生成に必要な基本層である。すなわち、第4実施形態の第1生成モデル30は、第1実施形態と同様に、複数の基本層と1以上の中間層Lとを含む。したがって、第4実施形態においても第1実施形態と同様の効果が実現される。 As described above, the first generation model 30 of the fourth embodiment includes N intermediate layers L corresponding to different unit processing units 63-n. Furthermore, the convolution layer 62 and the dilated convolution layer 65 and processing layer 67 of each unit processing unit 63-n are basic layers required for generating a time series of acoustic data Dz. That is, the first generation model 30 of the fourth embodiment includes multiple basic layers and one or more intermediate layers L, similar to the first embodiment. Therefore, the fourth embodiment also achieves the same effects as the first embodiment.
E:第5実施形態
 第1実施形態から第4実施形態においては、目標楽音が歌唱音である場合を例示した。第5実施形態の楽音合成システム100は、目標楽曲の演奏により発音されるべき楽器音を目標楽音として合成する。
E: Fifth embodiment In the first to fourth embodiments, the target musical tone is a singing tone. The musical tone synthesis system 100 of the fifth embodiment synthesizes, as the target musical tone, an instrument tone to be generated by playing the target piece of music.
 第1実施形態から第4実施形態における制御データDxは、目標楽音のピッチ(基本周波数)と有声/無声を表す情報と音素情報とを含む。第5実施形態における制御データDxは、有声/無声の情報および音素情報に代えて、目標楽音の強度(音量)と演奏スタイルとを含む、楽器音用の楽譜特徴量である。演奏スタイルは、例えば楽器の演奏法を表す情報である。目標楽音が楽器音である場合、参照楽音としても楽器音が利用される。すなわち、部分音色は、楽器音の音響特性の時間的な変化の特徴である。 The control data Dx in the first to fourth embodiments includes the pitch (fundamental frequency) of the target musical tone, information indicating voiced/unvoiced, and phoneme information. The control data Dx in the fifth embodiment is a musical score feature for an instrument sound, which includes the intensity (volume) and performance style of the target musical tone instead of the voiced/unvoiced information and phoneme information. The performance style is, for example, information indicating the method of playing an instrument. When the target musical tone is an instrument sound, the instrument sound is also used as the reference musical tone. In other words, the partial timbre is a characteristic of the temporal change in the acoustic characteristics of the instrument sound.
 第5実施形態の第1生成モデル30および第2生成モデル40は、図8の機械学習において、楽器音用の訓練データT(制御データ列Xt、参照データ列Rt、音響データ列Zt)を用いた訓練により確立される。第1生成モデル30は、目標とする楽器音の楽譜上の条件(制御データ列X)と目標とする楽器音の音響特徴量(音響データ列Z)との関係を学習した訓練済みの統計モデルである。そして、楽音合成部22は、第1生成モデル30により楽器音用の制御データ列Xを処理することで楽器音の音響データ列Zを生成する。 The first and second generative models 30 and 40 of the fifth embodiment are established by training using training data T for musical instrument sounds (control data sequence Xt, reference data sequence Rt, and acoustic data sequence Zt) in the machine learning of FIG. 8. The first generative model 30 is a trained statistical model that has learned the relationship between the conditions on the musical score of the target musical instrument sound (control data sequence X) and the acoustic features of the target musical instrument sound (acoustic data sequence Z). The musical sound synthesis unit 22 then processes the control data sequence X for the musical instrument sound using the first generative model 30 to generate the acoustic data sequence Z for the musical instrument sound.
 第1実施形態から第4実施形態の各々を楽器音生成に応用した第5実施形態においても、第1実施形態から第4実施形態と同様の効果が実現される。以上の各形態の例示から理解される通り、本開示における「楽音」は、歌唱音または楽器音等の音楽的な音を意味する。 In the fifth embodiment, in which each of the first to fourth embodiments is applied to generating musical instrument sounds, the same effects as those of the first to fourth embodiments are achieved. As can be understood from the examples of each of the above embodiments, "musical sound" in this disclosure means musical sounds such as singing sounds or instrument sounds.
F:変形例
 以上に例示した各態様に付加される具体的な変形の態様を以下に例示する。以下の例示から任意に選択された2以上の態様を、相互に矛盾しない範囲で適宜に併合してもよい。
F: Modifications Specific modifications to the above-mentioned embodiments are given below. Two or more of the following embodiments may be combined as long as they are not mutually contradictory.
(1)第1実施形態から第3実施形態においては、第1符号化器31が前処理部311を含む形態を例示したが、前処理部311は省略されてもよい。例えば、制御データ取得部21から第1符号化器31の第1段目の畳込層321に直接的に制御データ列Xが供給されてもよい。また、前述の各形態においては、復号化器32が後処理部322を含む形態を例示したが、後処理部322は省略されてもよい。例えば、最終段の中間層Lが出力する音響データ列Zが、直接的に波形合成部23に供給されてもよい。 (1) In the first to third embodiments, the first encoder 31 includes a pre-processing unit 311, but the pre-processing unit 311 may be omitted. For example, the control data sequence X may be directly supplied from the control data acquisition unit 21 to the first-stage convolutional layer 321 of the first encoder 31. In addition, in each of the above-described embodiments, the decoder 32 includes a post-processing unit 322, but the post-processing unit 322 may be omitted. For example, the acoustic data sequence Z output by the final-stage intermediate layer L may be directly supplied to the waveform synthesis unit 23.
(2)前述の各形態においては、参照信号Srにおける特定区間の時間軸上の位置を、利用者からの第1指示Q1に応じて変更する形態を例示したが、制御ベクトルVに第1指示Q1を反映させるための構成は、以上の例示に限定されない。相異なる参照楽音を表す複数の参照信号Srが記憶装置12に記憶された構成において、制御装置11(区間設定部241)は、複数の参照信号Srの何れかを利用者からの第1指示Q1に応じて選択してもよい。制御装置11は、第1指示Q1に応じて選択した特定区間の参照信号Srから参照データ列Rを生成する。 (2) In each of the above embodiments, the position on the time axis of a specific section in the reference signal Sr is changed in response to a first instruction Q1 from the user, but the configuration for reflecting the first instruction Q1 in the control vector V is not limited to the above examples. In a configuration in which multiple reference signals Sr representing different reference musical tones are stored in the storage device 12, the control device 11 (section setting unit 241) may select one of the multiple reference signals Sr in response to the first instruction Q1 from the user. The control device 11 generates a reference data string R from the reference signal Sr of the specific section selected in response to the first instruction Q1.
(3)第2実施形態において、利用者からの第2指示Q2に応じて制御ベクトルVの各要素Ekを変更するための構成は、以上の例示に限定されない。 (3) In the second embodiment, the configuration for changing each element Ek of the control vector V in response to the second instruction Q2 from the user is not limited to the above example.
 例えば、制御ベクトルV0の複数のプリセットデータが、記憶装置12に記憶されてもよい。各プリセットデータは、制御ベクトルV0のK個の要素E1~EKの各々を指定するデータである。利用者は、操作装置14に対する操作により、複数のプリセットデータの何れかを選択できる。制御ベクトル調整部244は、利用者が選択したプリセットデータを制御ベクトルV0として使用する。の調整に適用する。以上の形態においては、複数のプリセットデータの何れかを選択して呼び出す指示が、第2指示Q2に相当する。 For example, multiple preset data for the control vector V0 may be stored in the storage device 12. Each preset data is data that specifies each of the K elements E1 to EK of the control vector V0. The user can select one of the multiple preset data by operating the operation device 14. The control vector adjustment unit 244 uses the preset data selected by the user as the control vector V0. and applies it to the adjustment. In the above embodiment, the instruction to select and call one of the multiple preset data corresponds to the second instruction Q2.
 また、前述の各形態においては、各操作子Gb-kの位置が各要素Ekの数値に対応する形態を例示したが、各操作子Gb-kの位置が、各要素Ekの変化量に対応してもよい。制御ベクトル調整部244は、要素Ekの数値を、制御ベクトルV0における初期値から、操作子Gb-kの位置に応じた変化量だけ変化させた数値に設定する。 In addition, in each of the above-mentioned embodiments, the position of each operator Gb-k corresponds to the numerical value of each element Ek, but the position of each operator Gb-k may correspond to the amount of change in each element Ek. The control vector adjustment unit 244 sets the numerical value of the element Ek to a numerical value that is changed from the initial value in the control vector V0 by an amount corresponding to the position of the operator Gb-k.
(4)前述の各形態においては、楽音合成システム100が訓練処理部26を具備する形態を便宜的に例示したが、楽音合成システム100とは別個の機械学習システムに訓練処理部26が搭載されてもよい。機械学習システムにより確立された第1生成モデル30および第2生成モデル40が、楽音合成システム100に提供されて、楽音合成処理Saに利用される。 (4) In each of the above embodiments, the musical sound synthesis system 100 is provided with the training processing unit 26 for convenience, but the training processing unit 26 may be mounted on a machine learning system separate from the musical sound synthesis system 100. The first generation model 30 and the second generation model 40 established by the machine learning system are provided to the musical sound synthesis system 100 and used in the musical sound synthesis process Sa.
(5)前述の各形態においては、処理期間Bを時間的な単位として音響信号Wが生成される形態を例示した。各形態において、時間軸上で連続する複数の処理期間Bは、図16に例示される通り、時間軸上において部分的に相互に重複してもよい。なお、各処理期間Bの時間的な関係は、図16の例示に限定されない。 (5) In each of the above-described embodiments, an audio signal W is generated using a processing period B as a time unit. In each embodiment, multiple processing periods B that are consecutive on the time axis may partially overlap each other on the time axis, as illustrated in FIG. 16. Note that the temporal relationship between the processing periods B is not limited to the example illustrated in FIG. 16.
 前述の各形態と同様に、時間軸上の処理期間B毎に音響信号Wが順次に生成される。各処理期間Bのうち有効期間b内の音響信号Wが、時間軸上で連続する処理期間Bの間で相互に加算(例えば加重平均)されることで、最終的な音響信号が生成される。有効期間bは、処理期間Bに内包される期間である。具体的には、有効期間bは、処理期間Bの始点を含む所定長の期間と、処理期間Bの終点を含む所定長の期間とを、当該処理期間Bから除外した期間である。図16の形態によれば、処理期間Bの端部(始点または周縁)における音響信号Wの波形の不連続性が低減され、結果的に波形が連続で聴感的に自然な音響信号を生成できる。 As in the above-mentioned embodiments, an audio signal W is generated sequentially for each processing period B on the time axis. The audio signals W within the valid period b of each processing period B are added (e.g., weighted averaged) to each other between successive processing periods B on the time axis to generate a final audio signal. The valid period b is a period included in the processing period B. Specifically, the valid period b is a period obtained by excluding from the processing period B a period of a predetermined length including the start point of the processing period B and a period of a predetermined length including the end point of the processing period B. According to the embodiment of FIG. 16, the discontinuity of the waveform of the audio signal W at the end (start point or edge) of the processing period B is reduced, and as a result, an audio signal with a continuous waveform and a natural auditory sensation can be generated.
(6)前述の各形態においては、表示装置13に表示される仮想的な操作子Gb-kを例示したが、各要素Ekの変更の指示を受付ける操作子Gb-kは、利用者が実際に接触することが可能な現実的な操作子でもよい。 (6) In each of the above embodiments, a virtual operator Gb-k is displayed on the display device 13, but the operator Gb-k that receives an instruction to change each element Ek may be a real operator that the user can actually touch.
(7)各中間層Lが実行する変換処理は、前述の各形態において例示した処理に限定されない。例えば、第1パラメータp1の乗算と前記第2パラメータp2の加算との一方は省略されてもよい。変換処理が第2パラメータp2の加算を含まない形態において、パラメータセットPnは第1パラメータp1のみで構成される。変換処理が第1パラメータp1の乗算を含まない形態において、パラメータセットPnは第2パラメータp2のみで構成される。すなわち、パラメータセットPnは1以上のパラメータを含む変数として表現される。 (7) The conversion process executed by each intermediate layer L is not limited to the process exemplified in each of the above-mentioned embodiments. For example, one of the multiplication of the first parameter p1 and the addition of the second parameter p2 may be omitted. In a form in which the conversion process does not include the addition of the second parameter p2, the parameter set Pn is composed of only the first parameter p1. In a form in which the conversion process does not include the multiplication of the first parameter p1, the parameter set Pn is composed of only the second parameter p2. In other words, the parameter set Pn is expressed as a variable including one or more parameters.
(8)前述の各形態においては、第1符号化器31と復号化器32とを含む第1生成モデル30を例示したが、第1生成モデル30の構成は以上の例示に限定されない。第1生成モデル30は、目標楽音の条件(制御データ列X)と目標楽音の音響特徴量(音響データ列Z)との関係を学習したモデルとして包括的に表現される。したがって、変換処理を実行可能な1以上の中間層Lを含む任意の構造のモデルが、第1生成モデル30として利用される。 (8) In each of the above embodiments, the first generative model 30 including the first encoder 31 and the decoder 32 is exemplified, but the configuration of the first generative model 30 is not limited to the above examples. The first generative model 30 is comprehensively expressed as a model that learns the relationship between the conditions of the target musical tone (control data sequence X) and the acoustic features of the target musical tone (acoustic data sequence Z). Therefore, a model of any structure including one or more intermediate layers L capable of performing conversion processing is used as the first generative model 30.
 同様に、第2生成モデル40の構成は、前述の各形態における例示に限定されない。例えば、前述の各形態においては、サンプリング部414が制御ベクトルVの要素Ekを各確率分布Fkからサンプリングする構成を例示したが、複数の畳込層411により制御ベクトルVが生成されてもよい。すなわち、第2符号化器243における出力処理部412は省略されてよい。 Similarly, the configuration of the second generation model 40 is not limited to the examples in the above-mentioned embodiments. For example, in the above-mentioned embodiments, a configuration in which the sampling unit 414 samples the element Ek of the control vector V from each probability distribution Fk has been exemplified, but the control vector V may be generated by multiple convolution layers 411. In other words, the output processing unit 412 in the second encoder 243 may be omitted.
(9)第2実施形態においては、第1指示Q1および第2指示Q2に応じて制御ベクトルVが生成される形態を例示したが、制御ベクトル調整部244を具備する構成において、第1指示Q1を受付ける構成は省略されてもよい。具体的には、区間設定部241は、利用者からの指示とは無関係に特定区間の参照信号Srを設定してもよい。例えば、区間設定部241は、参照信号Srの音響特性が特定の条件を充足する区間を特定区間として設定する。区間設定部241は、例えば音色等の音響特性が顕著に変動する区間を特定区間として設定する。また、参照信号Srの全体が特定区間として利用されてもよい。参照信号Srの全体が特定区間として利用される形態において、区間設定部241は省略されてもよい。 (9) In the second embodiment, a configuration in which the control vector V is generated in response to the first instruction Q1 and the second instruction Q2 has been exemplified, but in a configuration including the control vector adjustment unit 244, the configuration for receiving the first instruction Q1 may be omitted. Specifically, the section setting unit 241 may set the reference signal Sr for a specific section regardless of an instruction from the user. For example, the section setting unit 241 sets a section in which the acoustic characteristics of the reference signal Sr satisfy a specific condition as the specific section. The section setting unit 241 sets a section in which the acoustic characteristics, such as timbre, fluctuate significantly as the specific section. In addition, the entire reference signal Sr may be used as the specific section. In a configuration in which the entire reference signal Sr is used as the specific section, the section setting unit 241 may be omitted.
(10)第1実施形態から第3実施形態においては、第1生成モデル30がN1個の符号中間層LeとN2個の復号中間層Ldとを含む形態を例示したが、符号中間層Leまたは復号中間層Ldは省略されてもよい。例えば、第1生成モデル30の第1符号化器31が符号中間層Leを含まない形態、または、復号化器32が復号中間層Ldを含まない形態も想定される。前述の通り、各中間層Lは変換処理を実行する。したがって、第1符号化器31が変換処理を実行しない形態、または、復号化器32が変換処理を実行しない形態も想定される。 (10) In the first to third embodiments, a form in which the first generation model 30 includes N1 encoding intermediate layers Le and N2 decoding intermediate layers Ld has been exemplified, but the encoding intermediate layer Le or the decoding intermediate layer Ld may be omitted. For example, a form in which the first encoder 31 of the first generation model 30 does not include an encoding intermediate layer Le, or a form in which the decoder 32 does not include a decoding intermediate layer Ld, are also envisioned. As described above, each intermediate layer L performs a conversion process. Therefore, a form in which the first encoder 31 does not perform a conversion process, or a form in which the decoder 32 does not perform a conversion process are also envisioned.
 符号中間層Leが省略された形態において、第1生成モデル30はN2x個の復号中間層Ldを含む。前述の通り、復号中間層Ldの個数N2xはN2以下の自然数である。また、復号中間層Ldが省略された形態において、第1生成モデル30はN1x個の符号中間層Leを含む。前述の通り、符号中間層Leの個数N1xはN1以下の自然数である。 In a form in which the coding intermediate layer Le is omitted, the first generation model 30 includes N2x decoding intermediate layers Ld. As described above, the number N2x of the decoding intermediate layers Ld is a natural number equal to or less than N2. Also, in a form in which the decoding intermediate layer Ld is omitted, the first generation model 30 includes N1x coding intermediate layers Le. As described above, the number N1x of the coding intermediate layers Le is a natural number equal to or less than N1.
 以上の例示から理解される通り、第1実施形態から第4実施形態の第1生成モデル30における中間層Lの個数Nxは、1以上のN以下の自然数である。すなわち、第1生成モデル30は、複数の基本層と1以上の中間層Lとを含むモデルとして包括的に表現される。中間層Lは、第1符号化器31および復号化器32の一方または双方に含まれる。すなわち、変換処理は、第1生成モデル30のうち少なくとも1箇所において実行される。 As can be understood from the above examples, the number Nx of intermediate layers L in the first generation model 30 in the first to fourth embodiments is a natural number between 1 and N. That is, the first generation model 30 is comprehensively expressed as a model including multiple base layers and one or more intermediate layers L. The intermediate layers L are included in one or both of the first encoder 31 and the decoder 32. That is, the conversion process is performed in at least one location in the first generation model 30.
(11)例えばスマートフォンまたはタブレット端末等の情報装置との間で通信するサーバ装置により楽音合成システム100を実現してもよい。例えば、楽音合成システム100は、情報装置から受信した楽曲データMおよび参照信号Srから音響信号Wを生成し、音響信号Wを情報装置に送信する。なお、楽音合成部22が生成する音響データ列Zが楽音合成システム100から情報装置に送信される形態においては、楽音合成システム100から波形合成部23は省略されてよい。情報装置は、楽音合成システム100から受信した音響データ列Zから音響信号を生成する。 (11) The musical sound synthesis system 100 may be realized by a server device that communicates with an information device such as a smartphone or tablet terminal. For example, the musical sound synthesis system 100 generates an audio signal W from the music data M and the reference signal Sr received from the information device, and transmits the audio signal W to the information device. Note that in a form in which the audio data sequence Z generated by the musical sound synthesis unit 22 is transmitted from the musical sound synthesis system 100 to the information device, the waveform synthesis unit 23 may be omitted from the musical sound synthesis system 100. The information device generates an audio signal from the audio data sequence Z received from the musical sound synthesis system 100.
 また、楽曲データMの代わりに制御データ列Xが情報装置から楽音合成システム100に送信されてもよい。制御データ取得部21は、情報装置から送信された制御データ列Xを受信する。制御データDx(制御データ列X)の「受信」は、制御データDxの「取得」の一例である。 In addition, a control data sequence X may be transmitted from the information device to the musical sound synthesis system 100 instead of the music data M. The control data acquisition unit 21 receives the control data sequence X transmitted from the information device. "Receiving" the control data Dx (control data sequence X) is an example of "acquiring" the control data Dx.
(12)以上に例示した楽音合成システム100の機能は、前述の通り、制御装置11を構成する単数または複数のプロセッサと、記憶装置12に記憶されたプログラムとの協働により実現される。本開示に係るプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性(non-transitory)の記録媒体であり、CD-ROM等の光学式記録媒体(光ディスク)が好例であるが、半導体記録媒体または磁気記録媒体等の公知の任意の形式の記録媒体も包含される。なお、非一過性の記録媒体とは、一過性の伝搬信号(transitory, propagating signal)を除く任意の記録媒体を含み、揮発性の記録媒体も除外されない。また、配信装置が通信網を介してプログラムを配信する構成では、当該配信装置においてプログラムを記憶する記憶装置が、前述の非一過性の記録媒体に相当する。 (12) As described above, the functions of the musical tone synthesis system 100 exemplified above are realized by the cooperation of one or more processors constituting the control device 11 and the program stored in the storage device 12. The program according to the present disclosure can be provided in a form stored in a computer-readable recording medium and installed in a computer. The recording medium is, for example, a non-transitory recording medium, and a good example is an optical recording medium (optical disk) such as a CD-ROM, but also includes any known type of recording medium such as a semiconductor recording medium or a magnetic recording medium. Note that a non-transitory recording medium includes any recording medium except a transient, propagating signal, and does not exclude volatile recording media. In addition, in a configuration in which a distribution device distributes a program via a communication network, the storage device that stores the program in the distribution device corresponds to the non-transitory recording medium described above.
G:付記
 以上に例示した形態から、例えば以下の構成が把握される。
G: Supplementary Note From the above-described exemplary embodiments, the following configurations, for example, can be understood.
 本開示のひとつの態様(態様1)に係る楽音合成方法は、目標楽音の条件を表す制御データの時系列を取得し、複数の基本層と1以上の中間層を含み、楽音の条件と当該楽音の音響特徴量との関係を学習した訓練済の生成モデルにより、前記制御データの時系列を処理することで、前記目標楽音の音響特徴量を表す音響データの時系列を生成する、コンピュータシステムにより実現される楽音合成方法であって、音色の時間的な変化の特徴を表す制御ベクトルを利用者からの指示に応じて生成し、前記制御ベクトルから第1パラメータセットを生成し、前記1以上の中間層のうちの第1中間層は、前記第1中間層に入力されるデータに対して前記第1パラメータセットを適用した処理を実行し、適用後のデータを次層に出力する。 A musical sound synthesis method according to one aspect (aspect 1) of the present disclosure is a musical sound synthesis method realized by a computer system, which acquires a time series of control data representing the conditions of a target musical sound, and generates a time series of acoustic data representing the acoustic features of the target musical sound by processing the time series of control data using a trained generative model that includes multiple basic layers and one or more intermediate layers and has learned the relationship between the conditions of the musical sounds and the acoustic features of the musical sounds. The method generates a control vector representing the characteristics of the temporal change in timbre in response to an instruction from a user, and generates a first parameter set from the control vector. A first intermediate layer of the one or more intermediate layers applies the first parameter set to the data input to the first intermediate layer, and outputs the data after application to the next layer.
 以上の態様においては、音色の時間的な変化の特徴(部分音色)を表す制御ベクトルが利用者からの指示に応じて生成され、制御ベクトルから第1パラメータセットが生成され、第1中間層に入力されるデータに対して第1パラメータセットが適用される。したがって、利用者からの指示に応じた多様な部分音色を有する目標楽音の音響データの時系列を生成できる。 In the above embodiment, a control vector representing the characteristics of the temporal change in timbre (partial timbre) is generated in response to instructions from the user, a first parameter set is generated from the control vector, and the first parameter set is applied to the data input to the first intermediate layer. Therefore, it is possible to generate a time series of acoustic data for a target musical tone having a variety of partial timbres in response to instructions from the user.
 「目標楽音」は、合成されるべき目標となる楽音である。「楽音」は、音楽に関する音響である。例えば、歌唱者による歌唱音または楽器の演奏音等の音楽的な音響が、「楽音」の一例である。 A "target musical sound" is a musical sound that is the target to be synthesized. A "musical sound" is a musical sound. For example, a singer's singing sound or a musical instrument's playing sound are musical sounds, and these are examples of "musical sounds."
 「制御データ」は、目標楽音の条件を表す任意の形式のデータである。例えば、楽音の楽譜(すなわち音符列)を表す楽曲データの特徴量(楽譜特徴量)を表すデータが「制御データ」の一例である。制御データが表す楽譜特徴量の種類は任意である。例えば、特許文献1と同様の楽譜特徴量が利用される。 "Control data" is data in any format that represents the conditions of a target musical tone. For example, data that represents the features (score features) of music data that represents the musical score of a musical tone (i.e., a sequence of notes) is an example of "control data". The type of score features represented by the control data is arbitrary. For example, score features similar to those in Patent Document 1 are used.
 態様1の具体例(態様2)において、前記制御ベクトルの生成においては、前記利用者からの指示に応じて前記制御ベクトルの時系列を生成し、前記第1パラメータセットの生成においては、前記制御ベクトルの時系列から前記第1パラメータセットの時系列を生成する。以上の態様においては、利用者からの指示に応じて制御ベクトルの時系列が生成され、かつ、制御ベクトルの時系列から第1パラメータセットの時系列が生成される。したがって、制御データの時系列の途中の時点において音色が変化する多様な目標楽音を生成できる。 In a specific example (Aspect 2) of Aspect 1, in generating the control vector, a time series of the control vector is generated in response to instructions from the user, and in generating the first parameter set, a time series of the first parameter set is generated from the time series of the control vector. In the above aspect, a time series of the control vector is generated in response to instructions from the user, and a time series of the first parameter set is generated from the time series of the control vector. Therefore, it is possible to generate a variety of target sounds whose timbre changes at intermediate points in the time series of the control data.
 態様1または態様2の具体例(態様3)において、さらに、前記制御ベクトルから第2パラメータセットを生成し、前記1以上の中間層のうちの第2中間層は、前記第2中間層に入力されるデータに対して前記第2パラメータセットを適用した処理を実行し、適用後のデータを次層に出力する。以上の態様においては、第1中間層に入力されるデータに第1パラメータセットが適用されるほか、第2中間層に入力されるデータに第2パラメータセットが適用される。したがって、多様な部分音色を有する目標楽音の音響データの時系列を生成できる。 In a specific example (Aspect 3) of Aspect 1 or Aspect 2, a second parameter set is further generated from the control vector, and a second intermediate layer of the one or more intermediate layers executes processing in which the second parameter set is applied to data input to the second intermediate layer, and outputs the data after application to the next layer. In the above aspect, in addition to applying the first parameter set to data input to the first intermediate layer, the second parameter set is applied to data input to the second intermediate layer. Therefore, a time series of acoustic data of a target musical tone having a variety of partial timbres can be generated.
 態様1から態様3の何れかの具体例(態様4)において、前記1以上の中間層は、複数の中間層であり、前記生成モデルは、前記1以上の中間層のうち複数の符号中間層を含む第1符号化器と、前記1以上の中間層のうち複数の復号中間層を含む復号化器とを含み、前記音響データの時系列の生成においては、前記制御データの時系列を前記第1符号化器により処理することで、前記制御データの時系列の特徴を表す中間データを生成し、前記中間データを前記復号化器により処理することで、前記音響データの時系列を生成する。以上の態様によれば、第1符号化器による符号化と復号化器による復号化とにより音響データの時系列を生成できる。 In a specific example (Aspect 4) of any one of Aspects 1 to 3, the one or more intermediate layers are multiple intermediate layers, the generative model includes a first encoder including multiple encoding intermediate layers among the one or more intermediate layers, and a decoder including multiple decoding intermediate layers among the one or more intermediate layers, and in generating the time series of acoustic data, the time series of control data is processed by the first encoder to generate intermediate data representing characteristics of the time series of control data, and the time series of acoustic data is generated by processing the intermediate data by the decoder. According to the above aspects, the time series of acoustic data can be generated by encoding by the first encoder and decoding by the decoder.
 「第1符号化器」は、制御データの時系列の特徴を表す中間データを生成する統計モデルである。他方、復号化器は、中間データから音響データの時系列を生成する統計モデルである。「第1中間層」および「第2中間層」の各々は、符号中間層および復号中間層の何れでもよい。 The "first encoder" is a statistical model that generates intermediate data that represents the characteristics of a time series of control data. On the other hand, the decoder is a statistical model that generates a time series of acoustic data from the intermediate data. Each of the "first hidden layer" and the "second hidden layer" may be either a coding hidden layer or a decoding hidden layer.
 態様4の具体例(態様5)において、前記第1符号化器において、時間軸上におけるデータの圧縮が実行され、前記復号化器において、時間軸上におけるデータの伸長が実行される。以上の態様においては、制御データの時系列の特徴が適切に反映された中間データが生成され、中間データの特徴が適切に反映された音響データの時系列が生成される。 In a specific example of aspect 4 (aspect 5), the first encoder compresses data on the time axis, and the decoder expands data on the time axis. In the above aspect, intermediate data is generated that appropriately reflects the characteristics of the time series of control data, and a time series of acoustic data is generated that appropriately reflects the characteristics of the intermediate data.
 態様1から態様5の何れかの具体例(態様6)において、前記制御ベクトルの生成においては、参照楽音における特定区間を前記利用者からの第1指示に応じて設定し、前記特定区間における前記参照楽音の音響特徴量を表す参照データの時系列を第2符号化器により処理することで、前記参照楽音のうち前記特定区間における音色の時間的な変化の特徴を表す前記制御ベクトルを生成する。以上の態様においては、参照楽音の特定区間が利用者からの第1指示に応じて設定され、特定区間における音色の時間的な変化の特徴(部分音色)を表す制御ベクトルが生成される。したがって、参照楽音のうち第1指示に応じた特定区間の部分音色を有する目標楽音を生成できる。 In a specific example (Aspect 6) of any one of Aspects 1 to 5, in generating the control vector, a specific section of the reference musical sound is set in response to a first instruction from the user, and a time series of reference data representing the acoustic features of the reference musical sound in the specific section is processed by a second encoder to generate the control vector representing the characteristics of the temporal change in timbre in the specific section of the reference musical sound. In the above aspects, a specific section of the reference musical sound is set in response to a first instruction from the user, and a control vector representing the characteristics of the temporal change in timbre in the specific section (partial timbre) is generated. Therefore, it is possible to generate a target musical sound having the partial timbre of the specific section of the reference musical sound in response to the first instruction.
 態様6の具体例(態様7)において、さらに、前記第1指示に応じて時間軸上における前記特定区間の位置を変更する。以上の態様においては、参照楽音における特定区間の時間軸上の位置が利用者からの第1指示に応じて変更される。したがって、参照楽音のうち利用者の所望の位置の部分音色を有する目標楽音を生成できる。 In a specific example of aspect 6 (aspect 7), the position of the specific section on the time axis is further changed in response to the first instruction. In the above aspect, the position of the specific section on the time axis in the reference musical sound is changed in response to a first instruction from the user. Therefore, it is possible to generate a target musical sound having a partial timbre at a position of the reference musical sound desired by the user.
 態様1から態様7の何れかの具体例(態様8)において、前記制御ベクトルは、複数の要素を含み、前記制御ベクトルの生成においては、前記利用者からの第2指示に応じて前記複数の要素のうち1以上の要素を変更する。以上の態様においては、制御ベクトルの複数の要素のうち1以上の要素が利用者からの第2指示に応じて変更される。したがって、利用者からの第2指示に応じた部分音色を有する多様な目標楽音を生成できる。 In a specific example (Aspect 8) of any one of Aspects 1 to 7, the control vector includes a plurality of elements, and in generating the control vector, one or more of the plurality of elements are changed in response to a second instruction from the user. In the above aspects, one or more of the plurality of elements of the control vector are changed in response to a second instruction from the user. Therefore, it is possible to generate a variety of target musical tones having partial timbres in response to the second instruction from the user.
 態様8の具体例(態様9)において、前記第2指示は、前記複数の要素にそれぞれ対応する複数の操作子に対する操作であり、前記1以上の要素の変更においては、前記複数の操作子のうち前記1以上の要素に対応する1以上の操作子に対する操作に応じて、前記1以上の要素を設定する。以上の態様においては、利用者は、各操作子に対する操作により、部分音色を簡便に調整できる。 In a specific example of aspect 8 (aspect 9), the second instruction is an operation on a plurality of operators corresponding to the plurality of elements, and in changing the one or more elements, the one or more elements are set in response to an operation on one or more of the plurality of operators that correspond to the one or more elements. In the above aspect, the user can easily adjust partial tones by operating each operator.
 なお、「操作子」の形態は任意である。例えば、特定の範囲で直線的に移動可能な往復型の操作子(スライダ)、または、回転可能な回転型の操作子(ツマミ)が、「操作子」として例示される。「操作子」は、利用者が接触可能な現実的な操作子でもよいし、表示装置により表示される仮想的な操作子でもよい。 The "operator" may take any form. For example, a reciprocating type operator (slider) that can move linearly within a specific range, or a rotary type operator (knob) that can rotate is an example of an "operator." The "operator" may be a real operator that the user can touch, or a virtual operator that is displayed by a display device.
 態様1から態様9の何れかの具体例(態様10)において、前記第1中間層は、当該第1中間層に入力されるデータに対して、前記第1パラメータセットを適用した変換処理を実行する。 In a specific example (aspect 10) of any of aspects 1 to 9, the first intermediate layer performs a conversion process by applying the first parameter set to the data input to the first intermediate layer.
 態様10の具体例(態様11)において、前記第1パラメータセットは、第1パラメータと第2パラメータとを含み、前記変換処理は、前記第1パラメータの乗算と前記第2パラメータの加算とを含む。以上の態様においては、第1中間層に入力されるデータに対して、第1パラメータの乗算と第2パラメータの加算とを含む変換処理が実行される。したがって、制御ベクトルが表す部分音色が適切に付与された目標楽音の音響データを生成できる。 In a specific example (aspect 11) of aspect 10, the first parameter set includes a first parameter and a second parameter, and the conversion process includes multiplication of the first parameter and addition of the second parameter. In the above aspect, a conversion process including multiplication of the first parameter and addition of the second parameter is performed on data input to the first hidden layer. Therefore, it is possible to generate acoustic data of a target musical tone to which a partial timbre represented by a control vector is appropriately assigned.
 本開示のひとつの態様(態様12)に係る楽音合成システムは、目標楽音の条件を表す制御データの時系列を取得する制御データ取得部と、音色の時間的な変化の特徴を表す制御ベクトルを利用者からの指示に応じて生成する制御ベクトル生成部と、前記制御ベクトルから第1パラメータセットを生成する制御ベクトル処理部と、複数の基本層と1以上の中間層を含み、楽音の条件と当該楽音の音響特徴量との関係を学習した訓練済の生成モデルにより、前記制御データの時系列を処理することで、前記目標楽音の音響特徴量を表す音響データの時系列を生成する楽音合成部と、を具備し、前記1以上の中間層のうちの第1中間層は、前記第1中間層に入力されるデータに対して前記第1パラメータセットを適用した処理を実行し、適用後のデータを次層に出力する。 A musical sound synthesis system according to one aspect (aspect 12) of the present disclosure includes a control data acquisition unit that acquires a time series of control data representing the conditions of a target musical sound, a control vector generation unit that generates a control vector representing the characteristics of a temporal change in tone in response to an instruction from a user, a control vector processing unit that generates a first parameter set from the control vector, and a musical sound synthesis unit that generates a time series of acoustic data representing the acoustic features of the target musical sound by processing the time series of the control data using a trained generative model that includes multiple base layers and one or more intermediate layers and has learned the relationship between the conditions of the musical sounds and the acoustic features of the musical sounds, and a first intermediate layer of the one or more intermediate layers applies the first parameter set to data input to the first intermediate layer and outputs the data after application to the next layer.
 本開示のひとつの態様に係るプログラムは、目標楽音の条件を表す制御データの時系列を取得する制御データ取得部、音色の時間的な変化の特徴を表す制御ベクトルを利用者からの指示に応じて生成する制御ベクトル生成部、前記制御ベクトルから第1パラメータセットを生成する制御ベクトル処理部、および、複数の基本層と1以上の中間層を含み、楽音の条件と当該楽音の音響特徴量との関係を学習した訓練済の生成モデルにより、前記制御データの時系列を処理することで、前記目標楽音の音響特徴量を表す音響データの時系列を生成する楽音合成部、としてコンピュータシステムを機能させるプログラムであって、前記1以上の中間層のうちの第1中間層は、前記第1中間層に入力されるデータに対して前記第1パラメータセットを適用した処理を実行し、適用後のデータを次層に出力する。  A program according to one aspect of the present disclosure causes a computer system to function as a control data acquisition unit that acquires a time series of control data representing the conditions of a target musical tone, a control vector generation unit that generates a control vector representing the characteristics of temporal changes in tone in response to an instruction from a user, a control vector processing unit that generates a first parameter set from the control vector, and a musical tone synthesis unit that includes a plurality of base layers and one or more intermediate layers and processes the time series of control data using a trained generative model that has learned the relationship between the conditions of a musical tone and the acoustic features of the musical tone, thereby generating a time series of acoustic data representing the acoustic features of the target musical tone, and a first intermediate layer of the one or more intermediate layers applies the first parameter set to data input to the first intermediate layer and outputs the data after application to the next layer.
100…楽音合成システム、11…制御装置、12…記憶装置、13…表示装置、14…操作装置、15…放音装置、21…制御データ取得部、22…楽音合成部、23…波形合成部、24…制御ベクトル生成部、241…区間設定部、242…特徴抽出部、243…第2符号化器、25…制御ベクトル処理部、28-n…変換モデル、26…訓練処理部、30…第1生成モデル、31…第1符号化器、311…前処理部、312…畳込層、Le…符号中間層、32…復号化器、321…畳込層、Ld…復号中間層、322…後処理部、40…第2生成モデル、411…畳込層、412…出力処理部、413…後処理部、414…サンプリング部、51…第1暫定モデル、52…第2暫定モデル、61…変換処理部、62…畳込層、63-n…単位処理部、64…合成処理部、65…拡張畳込層、67…処理層。 100...musical sound synthesis system, 11...control device, 12...storage device, 13...display device, 14...operation device, 15...sound emission device, 21...control data acquisition unit, 22...musical sound synthesis unit, 23...waveform synthesis unit, 24...control vector generation unit, 241...interval setting unit, 242...feature extraction unit, 243...second encoder, 25...control vector processing unit, 28-n...transformation model, 26...training processing unit, 30...first generation model, 31...first encoder, 3 11...preprocessing unit, 312...convolutional layer, Le...coding intermediate layer, 32...decoder, 321...convolutional layer, Ld...decoding intermediate layer, 322...post-processing unit, 40...second generation model, 411...convolutional layer, 412...output processing unit, 413...post-processing unit, 414...sampling unit, 51...first provisional model, 52...second provisional model, 61...transformation processing unit, 62...convolutional layer, 63-n...unit processing unit, 64...synthesis processing unit, 65...extended convolutional layer, 67...processing layer.

Claims (13)

  1.  目標楽音の条件を表す制御データの時系列を取得し、
     複数の基本層と1以上の中間層を含み、楽音の条件と当該楽音の音響特徴量との関係を学習した訓練済の生成モデルにより、前記制御データの時系列を処理することで、前記目標楽音の音響特徴量を表す音響データの時系列を生成する、
     コンピュータシステムにより実現される楽音合成方法であって、
     音色の時間的な変化の特徴を表す制御ベクトルを利用者からの指示に応じて生成し、
     前記制御ベクトルから第1パラメータセットを生成し、
     前記1以上の中間層のうちの第1中間層は、前記第1中間層に入力されるデータに対して前記第1パラメータセットを適用した処理を実行し、適用後のデータを次層に出力する
     楽音合成方法。
    obtaining a time series of control data representing the conditions of the target musical tone;
    A trained generative model includes a plurality of base layers and one or more intermediate layers, and the trained generative model learns the relationship between the conditions of the musical tones and the acoustic features of the musical tones, thereby processing the time series of the control data to generate a time series of acoustic data representing the acoustic features of the target musical tones.
    A method for synthesizing musical tones implemented by a computer system, comprising:
    A control vector that represents the characteristics of the time-dependent change in tone is generated in response to a user's instruction.
    generating a first parameter set from the control vector;
    a first hidden layer of the one or more hidden layers executes a process in which the first parameter set is applied to data input to the first hidden layer, and outputs the data after application to a next layer.
  2.  前記制御ベクトルの生成においては、前記利用者からの指示に応じて前記制御ベクトルの時系列を生成し、
     前記第1パラメータセットの生成においては、前記制御ベクトルの時系列から前記第1パラメータセットの時系列を生成する
     請求項1の楽音合成方法。
    generating the control vector, the control vector being generated in response to an instruction from the user;
    2. The method of claim 1, wherein in generating said first parameter set, a time series of said first parameter set is generated from a time series of said control vectors.
  3.  さらに、前記制御ベクトルから第2パラメータセットを生成し、
     前記1以上の中間層のうちの第2中間層は、前記第2中間層に入力されるデータに対して前記第2パラメータセットを適用した処理を実行し、適用後のデータを次層に出力する
     請求項1の楽音合成方法。
    and generating a second parameter set from the control vector;
    2. The musical tone synthesis method according to claim 1, wherein a second intermediate layer among the one or more intermediate layers executes a process in which the second parameter set is applied to data input to the second intermediate layer, and outputs the data after application to a next layer.
  4.  前記1以上の中間層は、複数の中間層であり、
     前記生成モデルは、
     前記複数の中間層のうち複数の符号中間層を含む第1符号化器と、
     前記複数の中間層のうち複数の復号中間層を含む復号化器とを含み、
     前記音響データの時系列の生成においては、
     前記制御データの時系列を前記第1符号化器により処理することで、前記制御データの時系列の特徴を表す中間データを生成し、
     前記中間データを前記復号化器により処理することで、前記音響データの時系列を生成する
     請求項1の楽音合成方法。
    the one or more intermediate layers are a plurality of intermediate layers;
    The generative model is
    a first encoder including a plurality of code intermediate layers among the plurality of intermediate layers;
    a decoder including a plurality of decoding intermediate layers among the plurality of intermediate layers;
    In generating the time series of acoustic data,
    generating intermediate data representing characteristics of the time series of the control data by processing the time series of the control data with the first encoder;
    2. The method of claim 1, further comprising: processing said intermediate data by said decoder to generate said time series of audio data.
  5.  前記第1符号化器において、時間軸上におけるデータの圧縮が実行され、
     前記復号化器において、時間軸上におけるデータの伸長が実行される
     請求項4の楽音合成方法。
    The first encoder compresses data on a time axis;
    5. The method of claim 4, wherein the decoder performs data expansion on the time axis.
  6.  前記制御ベクトルの生成においては、
     参照楽音における特定区間を前記利用者からの第1指示に応じて設定し、
     前記特定区間における前記参照楽音の音響特徴量を表す参照データの時系列を第2符号化器により処理することで、前記参照楽音のうち前記特定区間における音色の時間的な変化の特徴を表す前記制御ベクトルを生成する
     請求項1の楽音合成方法。
    In generating the control vector,
    A specific section in the reference musical sound is set in response to a first instruction from the user;
    The method of claim 1, further comprising: processing a time series of reference data representing acoustic features of the reference musical sound in the specific section by a second encoder to generate the control vector representing a feature of a temporal change in timbre of the reference musical sound in the specific section.
  7.  さらに、
     前記第1指示に応じて時間軸上における前記特定区間の位置を変更する
     請求項6の楽音合成方法。
    moreover,
    The method of claim 6, further comprising changing a position of the specific section on the time axis in response to the first instruction.
  8.  前記制御ベクトルは、複数の要素を含み、
     前記制御ベクトルの生成においては、
     前記利用者からの第2指示に応じて前記複数の要素のうち1以上の要素を変更する
     請求項1の楽音合成方法。
    The control vector includes a plurality of elements;
    In generating the control vector,
    The method of claim 1 , further comprising changing one or more of said plurality of elements in response to a second instruction from said user.
  9.  前記第2指示は、前記複数の要素にそれぞれ対応する複数の操作子に対する操作であり、
     前記1以上の要素の変更においては、前記複数の操作子のうち前記1以上の要素に対応する1以上の操作子に対する操作に応じて、前記1以上の要素を設定する
     請求項8の楽音合成方法。
    the second instruction is an operation on a plurality of operators corresponding to the plurality of elements,
    9. The musical tone synthesis method according to claim 8, wherein in changing the one or more elements, the one or more elements are set in response to an operation on one or more operators among the plurality of operators that correspond to the one or more elements.
  10.  前記第1中間層は、当該第1中間層に入力されるデータに対して、前記第1パラメータセットを適用した変換処理を実行する
     請求項1の楽音合成方法。
    The method of claim 1 , wherein the first hidden layer performs a conversion process by applying the first parameter set to data input to the first hidden layer.
  11.  前記第1パラメータセットは、第1パラメータと第2パラメータとを含み、
     前記変換処理は、前記第1パラメータの乗算と前記第2パラメータの加算とを含む
     請求項10の楽音合成方法。
    the first parameter set includes a first parameter and a second parameter;
    The method of claim 10, wherein the conversion process includes multiplication of the first parameter and addition of the second parameter.
  12.  目標楽音の条件を表す制御データの時系列を取得する制御データ取得部と、
     音色の時間的な変化の特徴を表す制御ベクトルを利用者からの指示に応じて生成する制御ベクトル生成部と、
     前記制御ベクトルから第1パラメータセットを生成する制御ベクトル処理部と、
     複数の基本層と1以上の中間層を含み、楽音の条件と当該楽音の音響特徴量との関係を学習した訓練済の生成モデルにより、前記制御データの時系列を処理することで、前記目標楽音の音響特徴量を表す音響データの時系列を生成する楽音合成部と、を具備し、
     前記1以上の中間層のうちの第1中間層は、前記第1中間層に入力されるデータに対して前記第1パラメータセットを適用した処理を実行し、適用後のデータを次層に出力する
     楽音合成システム。
    a control data acquisition unit for acquiring a time series of control data representing the conditions of a target musical tone;
    a control vector generating unit that generates a control vector representing a characteristic of a time-dependent change in timbre in response to an instruction from a user;
    a control vector processing unit that generates a first parameter set from the control vector;
    A musical sound synthesis unit that processes the time series of the control data using a trained generative model that includes a plurality of basic layers and one or more intermediate layers and that has learned the relationship between the conditions of musical sounds and the acoustic features of the musical sounds, thereby generating a time series of acoustic data representing the acoustic features of the target musical sounds,
    A first hidden layer of the one or more hidden layers executes a process in which the first parameter set is applied to data input to the first hidden layer, and outputs the data after application to a next layer.
  13.  目標楽音の条件を表す制御データの時系列を取得する制御データ取得部、
     音色の時間的な変化の特徴を表す制御ベクトルを利用者からの指示に応じて生成する制御ベクトル生成部、
     前記制御ベクトルから第1パラメータセットを生成する制御ベクトル処理部、および、
     複数の基本層と1以上の中間層を含み、楽音の条件と当該楽音の音響特徴量との関係を学習した訓練済の生成モデルにより、前記制御データの時系列を処理することで、前記目標楽音の音響特徴量を表す音響データの時系列を生成する楽音合成部、
     としてコンピュータシステムを機能させるプログラムであって、
     前記1以上の中間層のうちの第1中間層は、前記第1中間層に入力されるデータに対して前記第1パラメータセットを適用した処理を実行し、適用後のデータを次層に出力する
     プログラム。
    a control data acquisition unit for acquiring a time series of control data representing the conditions of a target musical tone;
    a control vector generating unit for generating a control vector representing characteristics of a time-dependent change in timbre in response to an instruction from a user;
    a control vector processor that generates a first parameter set from the control vector; and
    a musical sound synthesis unit that processes the time series of the control data using a trained generative model including a plurality of base layers and one or more intermediate layers and that has learned the relationship between the conditions of musical sounds and the acoustic features of the musical sounds, thereby generating a time series of acoustic data representing the acoustic features of the target musical sounds;
    A program for causing a computer system to function as
    a first intermediate layer among the one or more intermediate layers executes a process in which the first parameter set is applied to data input to the first intermediate layer, and outputs the data after the application to a next layer.
PCT/JP2023/030522 2022-10-25 2023-08-24 Musical sound synthesis method, musical sound synthesis system, and program WO2024089995A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022170758A JP2024062724A (en) 2022-10-25 Musical sound synthesis method, music sound synthesis system and program
JP2022-170758 2022-10-25

Publications (1)

Publication Number Publication Date
WO2024089995A1 true WO2024089995A1 (en) 2024-05-02

Family

ID=90830448

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/030522 WO2024089995A1 (en) 2022-10-25 2023-08-24 Musical sound synthesis method, musical sound synthesis system, and program

Country Status (1)

Country Link
WO (1) WO2024089995A1 (en)

Similar Documents

Publication Publication Date Title
CN109952609B (en) Sound synthesizing method
KR20150016225A (en) Automatic conversion of speech into song, rap or other audible expression having target meter or rhythm
US7750229B2 (en) Sound synthesis by combining a slowly varying underlying spectrum, pitch and loudness with quicker varying spectral, pitch and loudness fluctuations
Lindemann Music synthesis with reconstructive phrase modeling
JP6733644B2 (en) Speech synthesis method, speech synthesis system and program
JP6821970B2 (en) Speech synthesizer and speech synthesizer
WO2020171033A1 (en) Sound signal synthesis method, generative model training method, sound signal synthesis system, and program
CN105957515A (en) Voice Synthesis Method, Voice Synthesis Device, Medium for Storing Voice Synthesis Program
WO2019107378A1 (en) Voice synthesis method, voice synthesis device, and program
WO2020095950A1 (en) Information processing method and information processing system
JP2016509384A (en) Acousto-visual acquisition and sharing framework with coordinated, user-selectable audio and video effects filters
CN111696498A (en) Keyboard musical instrument and computer-implemented method of keyboard musical instrument
JP2018077283A (en) Speech synthesis method
CN105719640B (en) Speech synthesizing device and speech synthesizing method
WO2024089995A1 (en) Musical sound synthesis method, musical sound synthesis system, and program
JP6737320B2 (en) Sound processing method, sound processing system and program
JP2003345400A (en) Method, device, and program for pitch conversion
JP2024062724A (en) Musical sound synthesis method, music sound synthesis system and program
EP2634769B1 (en) Sound synthesizing apparatus and sound synthesizing method
JP6834370B2 (en) Speech synthesis method
WO2022074754A1 (en) Information processing method, information processing system, and program
JP2018077280A (en) Speech synthesis method
WO2017164216A1 (en) Acoustic processing method and acoustic processing device
WO2022074753A1 (en) Information processing method, information processing system, and program
JP6822075B2 (en) Speech synthesis method