WO2024089995A1

WO2024089995A1 - Musical sound synthesis method, musical sound synthesis system, and program

Info

Publication number: WO2024089995A1
Application number: PCT/JP2023/030522
Authority: WO
Inventors: 慶二郎才野; ジョセフティカーネル
Original assignee: ヤマハ株式会社
Priority date: 2022-10-25
Filing date: 2023-08-24
Publication date: 2024-05-02

Abstract

A musical sound synthesis system 100 comprises: a control data acquisition unit 21 that acquires a time series X of control data indicating the condition of a target musical sound; a control vector generation unit 24 that generates a control vector V representing the feature of temporal change of timbre in response to an instruction from a user; a control vector processing unit 25 that generates a first parameter set Pn from the control vector V; and a musical sound synthesis unit 22 that generates a time series Z of acoustic data representing the acoustic feature quantity of the target musical sound by processing the time series X of the control data by a trained first generative model 30 including a plurality of basic layers and one or more intermediate layers and having learned the relation between the condition of the musical sound and the acoustic feature quantity of the musical sound. A first intermediate layer out of the one or more intermediate layers executes processing in which the first parameter set Pn is applied to data to be inputted to the first intermediate layer, and outputs the data after the application to the next layer.

Description

Musical sound synthesis method, music sound synthesis system and program

This disclosure relates to technology for synthesizing sound.

Technologies have been proposed for generating desired musical tones using generative models such as neural networks. For example, Patent Document 1 discloses a configuration for generating a time series of acoustic features of a voice waveform by processing a time series of multidimensional score features related to voice using a convolutional neural network.

Patent No. 6552146

In recent voice synthesis using generative models, it is required not only to synthesize uniform musical tones from a time series of musical score features, but also to impart temporal changes in timbre in a partial section of a specific musical tone (hereinafter referred to as "partial timbre") to the musical tone in response to instructions from the user. In consideration of the above circumstances, one aspect of the present disclosure aims to generate musical tones with a variety of partial timbres in response to instructions from the user.

In order to solve the above problems, a musical sound synthesis method according to one aspect of the present disclosure is a musical sound synthesis method realized by a computer system, which acquires a time series of control data representing the conditions of a target musical sound, and generates a time series of acoustic data representing the acoustic features of the target musical sound by processing the time series of control data using a trained generative model that includes multiple basic layers and one or more intermediate layers and has learned the relationship between the conditions of the musical sounds and the acoustic features of the musical sounds. The method generates a control vector representing the characteristics of the temporal change in timbre in response to an instruction from a user, and generates a first parameter set from the control vector. A first intermediate layer of the one or more intermediate layers applies the first parameter set to the data input to the first intermediate layer, and outputs the data after application to the next layer.

A musical sound synthesis system according to one aspect of the present disclosure includes a control data acquisition unit that acquires a time series of control data representing the conditions of a target musical sound, a control vector generation unit that generates a control vector representing the characteristics of temporal changes in tone in response to an instruction from a user, a control vector processing unit that generates a first parameter set from the control vector, and a musical sound synthesis unit that includes a plurality of base layers and one or more intermediate layers, and processes the time series of the control data using a trained generative model that has learned the relationship between the conditions of a musical sound and the acoustic features of the musical sound, thereby generating a time series of acoustic data representing the acoustic features of the target musical sound, and a first intermediate layer of the one or more intermediate layers applies the first parameter set to data input to the first intermediate layer, and outputs the data after application to the next layer.

　A program according to one aspect of the present disclosure causes a computer system to function as a control data acquisition unit that acquires a time series of control data representing the conditions of a target musical tone, a control vector generation unit that generates a control vector representing the characteristics of temporal changes in tone in response to an instruction from a user, a control vector processing unit that generates a first parameter set from the control vector, and a musical tone synthesis unit that includes a plurality of base layers and one or more intermediate layers and processes the time series of control data using a trained generative model that has learned the relationship between the conditions of a musical tone and the acoustic features of the musical tone, thereby generating a time series of acoustic data representing the acoustic features of the target musical tone, and a first intermediate layer of the one or more intermediate layers applies the first parameter set to data input to the first intermediate layer and outputs the data after application to the next layer.

1 is a block diagram illustrating a configuration of a musical sound synthesis system according to a first embodiment. 1 is a block diagram illustrating a functional configuration of a musical sound synthesis system. FIG. 2 is a block diagram illustrating a specific configuration of a first generation model. FIG. 11 is an explanatory diagram of a conversion process. FIG. 4 is a schematic diagram of a setting screen. FIG. 13 is a block diagram illustrating a specific configuration of a second generation model. 13 is a flowchart of a musical sound synthesis process. FIG. 1 is an explanatory diagram of machine learning. 13 is a flowchart of a training process. FIG. 11 is a block diagram of a control vector generating unit in the second embodiment. FIG. 11 is a schematic diagram of a setting screen in the second embodiment. 10 is a flowchart of a musical sound synthesis process in a second embodiment. FIG. 2 is an explanatory diagram of a conversion process executed by each intermediate layer L. FIG. 13 is a block diagram of a first generation model in the fourth embodiment. FIG. 13 is a block diagram of a unit processing unit according to a fourth embodiment. FIG. 13 is an explanatory diagram of a processing period in a modified example.

A: First embodiment Fig. 1 is a block diagram illustrating the configuration of a musical sound synthesis system 100 according to a first embodiment. The musical sound synthesis system 100 is a computer system that synthesizes a desired musical sound (hereinafter referred to as a "target musical sound"). The target musical sound is a musical sound to be synthesized by the musical sound synthesis system 100. In the first embodiment, a singing sound to be produced by singing a specific piece of music (hereinafter referred to as a "target piece of music") is exemplified as the target musical sound.

The musical sound synthesis system 100 comprises a control device 11, a storage device 12, a display device 13, an operation device 14, and a sound emission device 15. The musical sound synthesis system 100 is realized by an information device such as a smartphone, a tablet terminal, or a personal computer. Note that the musical sound synthesis system 100 can be realized as a single device, or as multiple devices configured separately from each other.

The control device 11 is a single or multiple processors that control each element of the musical sound synthesis system 100. Specifically, the control device 11 is composed of one or more types of processors, such as a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), an SPU (Sound Processing Unit), a DSP (Digital Signal Processor), an FPGA (Field Programmable Gate Array), or an ASIC (Application Specific Integrated Circuit).

The storage device 12 is a single or multiple memories that store the programs executed by the control device 11 and various data used by the control device 11. For example, a well-known recording medium such as a semiconductor recording medium or a magnetic recording medium, or a combination of multiple types of recording media, is used as the storage device 12. Note that, for example, a portable recording medium that is detachable from the musical sound synthesis system 100, or a recording medium that the control device 11 can access via a communication network (e.g., cloud storage) may also be used as the storage device 12. The storage device 12 of the first embodiment stores music data M and a reference signal Sr.

The music data M represents the musical score of the target music piece. More specifically, the music data M specifies the pitch, pronunciation period, and pronunciation character for each of the multiple notes of the target music piece. The pitch is one of multiple discretely set scale notes. The pronunciation period is specified, for example, by the start point and duration of the note. The pronunciation character is a symbol representing the lyrics of the music piece. For example, a music file that complies with the MIDI (Musical Instrument Digital Interface) standard is used as the music data M. The music data M is provided to the musical sound synthesis system 100, for example, from a distribution device via a communication network.

The reference signal Sr is an audio signal that represents the waveform of a specific musical tone (hereinafter referred to as "reference musical tone"). The reference musical tone is, for example, a singing tone that should be produced by singing a reference musical piece. The reference signal Sr is provided to the musical tone synthesis system 100 from a distribution device via a communication network. The reference signal Sr may be provided from a playback device that drives a recording medium such as an optical disk, or may be generated by collecting the reference musical tone using a sound collection device. The reference signal Sr may also be an audio signal synthesized using a known synthesis technique such as singing synthesis or musical tone synthesis. The reference musical piece and the target musical piece corresponding to the reference signal Sr may be the same piece or different pieces of music. The singer of the target musical tone and the singer of the reference musical tone may be the same or different.

The target musical tone in the first embodiment is a singing tone of a target musical piece, and is a musical tone that is given a feature of temporal changes in acoustic characteristics (hereinafter referred to as "partial tone") within a specific period (hereinafter referred to as "specific section") among the reference musical tone. Specifically, a musical tone that is given a partial tone desired by the user is generated as the target musical tone. For example, a desired feature that exists in a specific section, such as repeated fluctuations (vibrato) in acoustic characteristics such as volume or pitch, or changes in acoustic characteristics over time, is assumed as a partial tone. As can be understood from the above explanation, the reference musical tone is a musical tone that is the material for the partial tone to be given to the target musical piece. The control device 11 uses the musical piece data M and the reference signal Sr to generate an audio signal W that represents the target musical tone. The audio signal W is a time-domain signal that represents the waveform of the target musical tone.

The display device 13 displays images under the control of the control device 11. The display device 13 is, for example, a display panel such as a liquid crystal display panel or an organic EL (Electroluminescence) panel. The operation device 14 is an input device that accepts instructions from a user. The operation device 14 is, for example, an operator operated by the user, or a touch panel that detects contact by the user. Note that the display device 13 or operation device 14, which are separate from the musical sound synthesis system 100, may be connected to the musical sound synthesis system 100 by wire or wirelessly.

The sound emitting device 15 reproduces sound under the control of the control device 11. Specifically, the sound emitting device 15 reproduces the target musical sound represented by the audio signal W. For example, a speaker or a headphone is used as the sound emitting device 15. For convenience, a D/A converter that converts the audio signal W from digital to analog, and an amplifier that amplifies the audio signal W are omitted from the illustration. The sound emitting device 15, which is separate from the musical sound synthesis system 100, may be connected to the musical sound synthesis system 100 by wire or wirelessly.

FIG. 2 is a block diagram illustrating the functional configuration of the musical sound synthesis system 100. The control device 11 executes a program stored in the storage device 12 to realize multiple functions (control data acquisition unit 21, musical sound synthesis unit 22, waveform synthesis unit 23, control vector generation unit 24, control vector processing unit 25, and training processing unit 26) for generating an audio signal W of a target musical sound.

In the following drawings, the data size (number of dimensions) b of one piece of data and the time length a of a time series consisting of multiple pieces of that data are represented by the symbols [a, b]. The time length a is expressed as the number of units of a period of a specified length on the time axis (hereinafter referred to as a "unit period"). For example, [800, 134] in Figure 2 means a time series in which data consisting of 134 dimensions is arranged for 800 unit periods. A unit period is, for example, a period (frame) with a time length of about 5 milliseconds. Therefore, 800 unit periods are equivalent to 4 seconds. Note that the above values are only examples and may be changed as desired. Each unit period is specified by the time.

The control data acquisition unit 21 acquires control data Dx that represents the conditions of the target musical tone. Specifically, the control data acquisition unit 21 acquires the control data Dx for each unit period. In the first embodiment, the control data acquisition unit 21 generates the control data Dx for each unit period from the music data M. In other words, the "generation" of the control data Dx is an example of the "acquisition" of the control data Dx.

The control data Dx represents the features of the score of the target piece of music (hereinafter referred to as "score features"). The score features represented by the control data Dx include, for example, the pitch (fundamental frequency) in the unit period, information indicating voiced/unvoiced in the unit period, and phoneme information in the unit period.

Pitch is a numerical value in one unit period of the pitch time series (pitch trajectory) corresponding to each note specified by the music data M. While the pitch of each note in the target music is discrete, the pitch trajectory used in the control data Dx is a continuous change in pitch on the time axis. The control data acquisition unit 21 estimates the pitch trajectory in the control data Dx, for example, by processing the music data M with an estimation model that has learned the relationship between the pitch of each note and the pitch trajectory. However, the method of generating the pitch trajectory is not limited to the above examples. The control data Dx may also include discrete pitches of each note.

Phoneme information is information about phonemes that correspond to the pronunciation characters of the target song. Specifically, the phoneme information includes, for example, information specifying one of a plurality of phonemes, for example by one-hot expression, the position of the unit period relative to the phoneme period, the time length from the beginning or end of the phoneme period, and the duration of the phoneme.

The time series of control data Dx within processing period B constitutes control data string X. Processing period B is a period of a predetermined length composed of multiple (specifically, 800) consecutive unit periods on the time axis. As can be understood from the above explanation, the control data acquisition unit 21 of the first embodiment generates a time series of control data Dx (i.e., control data string X) representing the conditions of the target musical tone for each processing period B on the time axis.

The musical sound synthesis unit 22 generates an acoustic data sequence Z by processing the control data sequence X. Specifically, the musical sound synthesis unit 22 generates an acoustic data sequence Z for each processing period B. The acoustic data sequence Z is time-series data representing the acoustic characteristics of the target musical sound in the processing period B. The acoustic data sequence Z is composed of multiple (specifically, 800) acoustic data Dz corresponding to successive unit periods within the processing period B. In other words, the acoustic data sequence Z is a time series of acoustic data Dz within the processing period B. The musical sound synthesis unit 22 generates an acoustic data sequence Z for one processing period B from the control data sequence X corresponding to that processing period B.

The acoustic data Dz represents the acoustic features of the target musical tone. The acoustic features are, for example, the amplitude spectrum envelope of the target musical tone. Specifically, the acoustic data Dz includes the amplitude spectrum envelope of the harmonic components of the target musical tone and the amplitude spectrum envelope of the non-harmonic components of the target musical tone. The amplitude spectrum envelope is an outline of the amplitude spectrum. The amplitude spectrum envelope of the harmonic components and the non-harmonic components is expressed, for example, by Mel-cepstrum or MFCC (Mel-Frequency Cepstrum Coefficients). As can be understood from the above explanation, the musical tone synthesis unit 22 of the first embodiment generates a time series of acoustic data Dz (i.e., acoustic data sequence Z) representing the acoustic features of the target musical tone for each processing period B. The acoustic data Dz may include the amplitude spectrum envelope and pitch trajectory of the target musical tone. The acoustic data Dz may also include the spectrum (amplitude spectrum or power spectrum) of the target musical tone. The spectrum of the target musical tone may be expressed, for example, as a Mel spectrum. The amplitude spectrum envelope may also be the outline of the power spectrum (power spectrum envelope).

The waveform synthesis unit 23 generates an audio signal W of a target musical tone from the audio data sequence Z. Specifically, the waveform synthesis unit 23 generates a waveform signal from the audio data Dz of each unit period by calculations including, for example, a discrete inverse Fourier transform, and generates the audio signal W by linking the waveform signals for successive unit periods on the time axis. A deep neural network (a so-called neural vocoder) that has learned the relationship between the audio data sequence Z and the waveform signal may be used as the waveform synthesis unit 23. The audio signal W generated by the waveform synthesis unit 23 is supplied to the sound emission device 15, whereby the target musical tone is reproduced from the sound emission device 15. The pitch generated by the control data acquisition unit 21 may be applied to the generation of the audio signal W by the waveform synthesis unit 23.

As illustrated in FIG. 2, the musical sound synthesis unit 22 generates an acoustic data sequence Z by processing a control data sequence X using a first generative model 30. The first generative model 30 is a trained statistical model that has learned the relationship between the conditions on the score of the target musical sound (control data sequence X) and the acoustic features of the target musical sound (acoustic data sequence Z) through machine learning. In other words, the first generative model 30 outputs the acoustic data sequence Z in response to the input of the control data sequence X. The first generative model 30 is, for example, configured by a deep neural network.

The first generative model 30 is realized by a combination of a program that causes the control device 11 to execute an operation (architecture) for generating an acoustic data sequence Z from a control data sequence X, and a plurality of variables (weights and biases) that are applied to the operation. The program and the plurality of variables that realize the first generative model 30 are stored in the storage device 12. The plurality of variables of the first generative model 30 are set in advance by machine learning. The first generative model 30 of the first embodiment includes a first encoder 31 and a decoder 32.

The first encoder 31 is a trained statistical model that has learned the relationship between the control data sequence X and the intermediate data Y through machine learning. That is, the first encoder 31 outputs intermediate data Y in response to the input of the control data sequence X. The musical sound synthesis unit 22 generates intermediate data Y by processing the control data sequence X with the first encoder 31. The intermediate data Y represents the characteristics of the control data sequence X. Specifically, the generated sound data sequence Z changes in response to the characteristics of the control data sequence X represented by the intermediate data Y. That is, the first encoder 31 encodes the control data sequence X into intermediate data Y.

The decoder 32 is a trained statistical model that has learned the relationship between the intermediate data Y and the acoustic data string Z through machine learning. That is, the decoder 32 outputs the acoustic data string Z in response to the input of the intermediate data Y. The musical sound synthesis unit 22 generates the acoustic data string Z by processing the intermediate data Y with the decoder 32. That is, the decoder 32 decodes the intermediate data Y into the acoustic data string Z. As explained above, in the first embodiment, the acoustic data string Z can be generated by encoding by the first encoder 31 and decoding by the decoder 32.

FIG. 3 is a block diagram illustrating a specific configuration (architecture) of the first generative model 30. The first encoder 31 includes a preprocessing unit 311, N1 convolutional layers 312, and N1 coding intermediate layers Le. Specifically, the N1 convolutional layers 312 and the N1 coding intermediate layers Le are arranged alternately after the preprocessing unit 311. That is, N1 sets each consisting of a convolutional layer 312 and a coding intermediate layer Le are stacked after the preprocessing unit 311.

The preprocessing unit 311 is composed of a multi-layer perceptron for processing the control data sequence X. The preprocessing unit 311 is composed of multiple calculation units corresponding to different control data Dx of the control data sequence X. Each calculation unit is composed of a stack of multiple stages of fully connected layers. Each control data Dx is processed sequentially by each fully connected layer. For example, a neural network process is executed for each control data Dx of the control data sequence X with the same configuration and the same variables applied. The array of the control data Dx after processing by each calculation unit (processed control data sequence X) is input to the first convolution layer 312. By processing the control data Dx by the preprocessing unit 311, a control data sequence X that more clearly expresses the characteristics of the target song (song data M) is generated. However, the preprocessing unit 311 may be omitted.

The data processed by the pre-processing unit 311 is input to the first convolutional layer 312 among the N1 convolutional layers 312. The data processed by the previous coding intermediate layer Le is input to each of the second and subsequent convolutional layers 312 among the N1 convolutional layers 312. Each convolutional layer 312 performs arithmetic processing on the data input to the convolutional layer 312. The arithmetic processing by the convolutional layer 312 includes a convolutional operation. The arithmetic processing by the convolutional layer 312 may also include a pooling operation.

The convolution operation is a process of convolving a filter with data input to the convolution layer 312. The multiple convolution layers 312 include a convolution layer 312 that performs time compression and a convolution layer 312 that does not perform time compression. In the convolution operation of the convolution layer 312 that performs time compression among the multiple convolution layers 312, the movement amount (stride) of the filter in the time direction is set to 2 or more. As a result, in each convolution layer 312 that does not perform time compression, the time length of the data is maintained by the convolution operation with a stride of 1, and in each convolution layer 312 that performs time compression, the time length of the data is shortened by the convolution operation with a stride of 2 or more. That is, in the first encoder 31, data compression on the time axis is performed. In other words, the processing by the first encoder 31 includes downsampling of the control data string X. Instead of setting the stride of the convolution operation to 2 or more, data compression (downsampling) may be performed by continuing the pooling operation while keeping the stride of the convolution operation at 1. The pooling operation is an operation that selects a representative value within each range set in the data after the convolution operation. The representative value is, for example, a statistical value such as the maximum value, the average value, or the root mean square value.

In other words, the compression of the control data sequence X is achieved by one or both of a convolution operation and a pooling operation. Note that time compression (downsampling) of the control data sequence X may be performed only for a portion of the series of convolution operations of the N1 convolution layers 312. The compression rate of each convolution layer 312 is arbitrary.

Each of the N1 code intermediate layers Le performs a conversion process on the data input to the code intermediate layer Le from the previous convolution layer 312. The specific content of the conversion process by each code intermediate layer Le will be described later. Data after processing by the final code intermediate layer Le among the N1 code intermediate layers Le is input to the decoder 32 as intermediate data Y. Note that the code intermediate layers Le do not need to be installed after all of the N1 convolution layers 312. In other words, the number N1x of code intermediate layers Le is any natural number less than or equal to N1. If there is a code intermediate layer Le after a certain convolution layer 312, the data after the conversion process by the code intermediate layer Le is input to the next convolution layer 312, and if there is no code intermediate layer Le after a certain convolution layer 312, the data after the convolution process by the convolution layer 312 (i.e., data that has not been converted) is input to the next convolution layer 312.

The decoder 32 includes N2 convolutional layers 321, N2 decoding intermediate layers Ld, and a post-processing unit 322. Specifically, the N2 convolutional layers 321 and the N2 decoding intermediate layers Ld are arranged alternately, and the post-processing unit 322 is stacked after the final decoding intermediate layer Ld. In other words, N2 sets consisting of the convolutional layers 321 and the decoding intermediate layers Ld are stacked before the post-processing unit 322.

Intermediate data Y is input to the first convolutional layer 321 of the N2 convolutional layers 321. Data processed by the previous decoding intermediate layer Ld is input to each of the second and subsequent convolutional layers 321 of the N2 convolutional layers 321. Each convolutional layer 321 performs arithmetic processing on the data input to that convolutional layer 321. The arithmetic processing by the convolutional layer 321 includes a transpose convolution operation (or a deconvolution operation).

The transposed convolution performed by the convolution layer 321 is the inverse of the convolution performed by each convolution layer 312 of the encoder. In the convolution operation of the convolution layer 321 that performs time expansion among the multiple convolution layers 321, the filter movement amount in the time direction (stride) is set to 2 or more. As a result, in each convolution layer 321 that does not perform time expansion, the time length of the data is maintained by a transposed convolution operation with a stride of 1, and in each convolution layer 321 that performs time expansion, the time length of the data is expanded by a transposed convolution operation with a stride of 2 or more. That is, in the decoder 32, data expansion on the time axis is performed. In other words, the processing by the decoder 32 includes upsampling of the intermediate data Y.

As explained above, in the first embodiment, the first encoder 31 compresses the control data sequence X and the decoder 32 expands the intermediate data Y. Therefore, intermediate data Y that appropriately reflects the characteristics of the control data sequence X is generated, and an acoustic data sequence Z that appropriately reflects the characteristics of the intermediate data Y is generated.

Each of the N2 decoding intermediate layers Ld performs a conversion process on the data input to the decoding intermediate layer Ld from the previous convolution layer 321. The specific content of the conversion process by each decoding intermediate layer Ld will be described later. Data after processing by the final decoding intermediate layer Ld among the N2 decoding intermediate layers Ld is input to the post-processing unit 322 as the acoustic data string Z. Note that the decoding intermediate layer Ld does not need to be installed after all of the N2 convolution layers 321. In other words, the number N2x of the decoding intermediate layers Ld is a natural number less than or equal to N2. If there is a decoding intermediate layer Ld after a certain convolution layer 321, the data after the conversion process by the decoding intermediate layer Ld is input to the next convolution layer 321, and if there is no decoding intermediate layer Ld after a certain convolution layer 321, the data after the convolution process by the convolution layer 321 (i.e., data that has not been converted) is input to the next convolution layer 321.

The post-processing unit 322 is composed of a multi-layer perceptron for processing the acoustic data sequence Z. The post-processing unit 322 is composed of multiple calculation units corresponding to different acoustic data Dz of the acoustic data sequence Z. Each calculation unit is composed of a stack of multiple fully connected layers. Each acoustic data Dz is processed sequentially by each fully connected layer. For example, a neural network with a similar configuration and similar variables is processed for each acoustic data Dz of the acoustic data sequence Z. The array of acoustic data Dz after processing by each calculation unit is input to the waveform synthesis unit 23 as the final acoustic data sequence Z. By processing the acoustic data Dz by the post-processing unit 322, an acoustic data sequence Z that more clearly expresses the characteristics of the target musical tone is generated. However, the post-processing unit 322 may be omitted.

As explained above, the first encoder 31 includes N1x encoded intermediate layers Le, and the decoder 32 includes N2x decoded intermediate layers Ld. If the encoded intermediate layers Le and the decoded intermediate layers Ld are collectively referred to as "intermediate layers L", the first generation model 30 is expressed as a statistical model including Nx (Nx = N1x + N2x) intermediate layers L. That is, the first encoder 31 includes N1x encoded intermediate layers Le out of the Nx intermediate layers L, and the decoder 32 includes N2x decoded intermediate layers Ld out of the Nx intermediate layers L. The number Nx of intermediate layers L is a natural number equal to or greater than 1. The number N1x of encoded intermediate layers Le and the number N2x of decoded intermediate layers Ld may be equal or different.

Of the first generative model 30, the preprocessing section 311, the N1 convolutional layers 312, the N2 convolutional layers 321, and the postprocessing section 322 are basic layers required for generating the audio data sequence Z. Hereinafter, the set of the N1 convolutional layers 312 and the N2 convolutional layers 321 may be referred to as N (N=N1+N2) basic convolutional layers. On the other hand, the Nx intermediate layers (N1x encoding intermediate layers Le and N2x decoding intermediate layers Ld) are layers for controlling partial timbres in the target musical tone. In other words, the first generative model 30 includes N basic convolutional layers and Nx (N≧Nx≧1) intermediate layers L.

Each of the N intermediate layers L performs a conversion process on the data input to that intermediate layer L. A parameter set Pn (n = 1 to N) is applied to the conversion process by the n-th intermediate layer L among the N intermediate layers L. In other words, a different parameter set Pn is applied to the conversion process by each of the multiple intermediate layers L. Each of the N parameter sets P1 to PN includes, for example, a first parameter p1 and a second parameter p2.

FIG. 4 is an explanatory diagram of the conversion process. The unit data string U in FIG. 4 is data input to the intermediate layer L. The unit data string U is composed of a time series of multiple unit data Du corresponding to different unit periods. Each unit data Du is expressed as an H-dimensional (H is a natural number equal to or greater than 2) vector. The first parameter p1 is expressed as a square matrix with H rows and H columns. The second parameter p2 is expressed as an H-dimensional vector. Note that the first parameter p1 may be expressed as a diagonal matrix with H rows and H columns or an H-dimensional vector.

The conversion process includes a first operation and a second operation. The first operation and the second operation are executed sequentially for each of the multiple unit data Du that make up the unit data string U. The first operation is a process of multiplying the unit data Du by the first parameter p1. The second operation is a process of adding the second parameter p2 to the result of the first operation (p1·Du). As can be understood from the above explanation, the conversion process by the intermediate layer L is a process (i.e., affine transformation) that includes the multiplication of the first parameter p1 and the addition of the second parameter p2. Note that the second operation that applies the second parameter p2 may be omitted. In that case, the generation of the second parameter p2 is also omitted. In other words, it is sufficient that the conversion process includes at least the first operation.

As can be understood from the above explanation, each of the N intermediate layers L in FIG. 3 performs a conversion process by applying a parameter set Pn to each unit data Du of the unit data string U input to the intermediate layer L, and outputs the unit data string U after the conversion process. In the first embodiment, a conversion process including multiplication of a first parameter p1 and addition of a second parameter p2 is performed on each unit data Du of the unit data string U input to each intermediate layer L. Therefore, it is possible to generate an acoustic data string Z of a target musical tone to which a partial timbre represented by a control vector V is appropriately assigned. Here, the number of intermediate layers L is explained as N, but the basic operation is similar even when the number of intermediate layers L is Nx, which is less than N.

Now, for the sake of convenience, of the N intermediate layers L of the first generation model 30, we focus on two intermediate layers L, the n1th and n2th stages (n1 = 1 to N, n2 = 1 to N, n1 ≠ n2). Each intermediate layer L may be either an encoding intermediate layer Le or a decoding intermediate layer Ld. Of the N intermediate layers L, the n1th intermediate layer L performs a conversion process by applying a parameter set Pn1 to each unit data Du of the unit data string U input to the intermediate layer L, and outputs the unit data string U after the application to the next layer. The n2th intermediate layer L performs a conversion process by applying a parameter set Pn2 to each unit data Du of the unit data string U input to the intermediate layer L, and outputs the unit data string U after the application to the next layer. Note that the n1th intermediate layer L is an example of a "first intermediate layer", and the parameter set Pn1 is an example of a "first parameter set". Additionally, the n2th intermediate layer L is an example of a "second intermediate layer," and the parameter set Pn2 is an example of a "second parameter set."

As explained above, in the first embodiment, a different parameter set Pn is applied to each of the N intermediate layers L, so that an acoustic data sequence Z of a target tone having a variety of partial timbres can be generated. The control vector generation unit 24 and the control vector processing unit 25 illustrated in FIG. 2 generate N parameter sets P1 to PN by processing the reference signal Sr.

The control vector generation unit 24 generates a control vector V by processing the reference signal Sr of a specific section. The control vector V is a K-dimensional vector that represents the partial timbre of the reference tone. In other words, the control vector V is a vector that represents the characteristics of the temporal change in acoustic characteristics (i.e., the partial timbre) in the reference signal Sr of a specific section. The control vector generation unit 24 of the first embodiment includes a section setting unit 241, a feature extraction unit 242, and a second encoder 243.

The section setting unit 241 sets a specific section in the reference musical tone. Specifically, the section setting unit 241 sets the specific section in response to a first instruction Q1 from the user via the operation device 14. The time length of the specific section is a fixed length equivalent to one processing period B.

FIG. 5 is a schematic diagram of the setting screen Ga. The setting screen Ga is a screen for the user to specify a specific section. The section setting unit 241 displays the setting screen Ga on the display device 13. The setting screen Ga includes a waveform image Ga1 and a section image Ga2. The waveform image Ga1 is an image that represents the waveform of the reference signal Sr. The section image Ga2 is an image that represents a specific section.

The user can move the section image Ga2 to a desired position along the time axis by operating the operation device 14 (first instruction Q1) while checking the waveform image Ga1 of the reference musical tone. For example, the user moves the section image Ga2 so that it includes a section of the reference musical tone whose acoustic characteristics change under desired conditions.

The section setting unit 241 determines the section of the reference signal Sr that corresponds to the section image Ga2 after the user's movement as the specific section. As can be understood from the above explanation, the first instruction Q1 is an instruction to change the position of the specific section on the time axis. In other words, the section setting unit 241 changes the position of the specific section on the time axis in response to the first instruction Q1.

The feature extraction unit 242 in FIG. 2 processes the reference signal Sr for a specific section to generate one reference data string R. The reference data string R is time-series data representing the acoustic features of the reference musical tone in a specific section. As illustrated in FIG. 6, the reference data string R is composed of multiple (e.g., 800) reference data Dr corresponding to different unit periods within the specific section. In other words, the reference data string R is a time series of acoustic data Dz.

The reference data Dr represents the acoustic features of the reference musical tone. The acoustic features are, for example, the amplitude spectrum envelope of the reference musical tone. Specifically, the reference data Dr includes the amplitude spectrum envelope of the harmonic components of the reference musical tone and the amplitude spectrum envelope of the non-harmonic components of the reference musical tone. The amplitude spectrum envelopes of the harmonic components and non-harmonic components are expressed, for example, by mel-cepstrum or MFCC. The data size of the reference data Dr is equal to the data size of the acoustic data Dz. Therefore, the data size of one reference data string R is equal to the data size of one acoustic data string Z. Note that the reference data Dr may be data in a different format from the acoustic data Dz. For example, the acoustic features represented by the reference data Dr and the acoustic features represented by the acoustic data Dz may be different types of features.

As can be understood from the above explanation, the feature extraction unit 242 of the first embodiment generates a time series of reference data Dr (reference data string R) that represents the acoustic features of a reference musical tone. For example, the feature extraction unit 242 generates the reference data string R by performing a calculation including a discrete Fourier transform on a reference signal Sr of a specific section.

The second encoder 243 is a trained statistical model that has learned the relationship between the reference data sequence R and the control vector V through machine learning. That is, the second encoder 243 outputs the control vector V in response to the input of the reference data sequence R. The control vector generation unit 24 generates the control vector V by processing the reference data sequence R with the second encoder 243. That is, the second encoder 243 encodes the reference data sequence R into the control vector V.

As described above, the control vector V is a vector that represents the characteristics of the temporal change in acoustic characteristics in the reference signal Sr of a specific section (i.e., partial timbre). Since the partial timbre changes depending on the position of the reference signal Sr, the control vector V depends on the position of the specific section on the time axis. In other words, the control vector V depends on the first instruction Q1 from the user that specifies the specific section. As can be understood from the above explanation, the control vector generation unit 24 of the first embodiment generates the control vector V in response to the first instruction Q1 from the user.

The control vector processing unit 25 generates N parameter sets P1 to PN from the control vector V. Because the control vector V represents a partial tone, each parameter set Pn depends on the partial tone. In addition, because the control vector V depends on the first instruction Q1, each parameter set Pn also depends on the first instruction Q1 from the user.

FIG. 6 is a block diagram illustrating a specific configuration of the second encoder 243 and the control vector processing unit 25. The second encoder 243 includes a plurality of convolution layers 411 and an output processing unit 412. The output processing unit 412 is stacked after the final convolution layer 411 among the plurality of convolution layers 411.

A reference data sequence R is input to the first convolutional layer 411 of the multiple convolutional layers 411. Data processed by the previous convolutional layer 411 is input to each of the second and subsequent convolutional layers 411 of the multiple convolutional layers 411. Each convolutional layer 411 performs arithmetic processing on the data input to that convolutional layer 411. The arithmetic processing by the convolutional layer 411 includes a convolution operation and an optional pooling operation, similar to the arithmetic processing by the convolutional layer 312. The final convolutional layer 411 outputs feature data Dv representing the features of the reference data sequence R.

The output processing unit 412 generates a control vector V in response to the feature data Dv. The output processing unit 412 in the first embodiment includes a post-processing unit 413 and a sampling unit 414.

The post-processing unit 413 determines K probability distributions F1 to FK according to the feature data Dv. Each of the K probability distributions F1 to FK is, for example, a normal distribution. The post-processing unit 413 outputs the mean and variance for each probability distribution Fk (k = 1 to K). Specifically, the post-processing unit 413 is a trained statistical model that has learned the relationship between the feature data Dv and each probability distribution Fk by machine learning. The control vector generation unit 24 determines the K probability distributions F1 to FK by processing the feature data Dv with the post-processing unit 413.

The sampling unit 414 generates a control vector V according to the K probability distributions F1 to FK. Specifically, the sampling unit 414 samples an element Ek from each of the K probability distributions F1 to FK. The sampling of the element Ek is, for example, random sampling. That is, each element Ek is a latent variable sampled from the probability distribution Fk. The control vector V is composed of the K elements E1 to EK sampled from different probability distributions Fk. That is, the control vector V includes K elements E1 to EK. As can be understood from the above explanation, the control vector V is a K-dimensional vector that represents the characteristics of the temporal change in acoustic characteristics (i.e., partial timbre) in the reference signal Sr of a specific section.

Note that the configuration and processing by which the output processing unit 412 generates the control vector V from the feature data Dv are not limited to the above examples. For example, the output processing unit 412 may generate the control vector V without generating K probability distributions F1 to FK. Therefore, the post-processing unit 413 and the sampling unit 414 may be omitted.

As illustrated in FIG. 6, the control vector processing unit 25 includes N conversion models 28-1 to 28-N corresponding to different hidden layers L. Each conversion model 28-n generates a parameter set Pn from a control vector V. Specifically, each conversion model 28-n is a trained statistical model that has learned the relationship between the control vector V and the parameter set Pn through machine learning. Each conversion model 28-n is composed of a multi-layer perceptron for generating the parameter set Pn. As can be understood from the above explanation, N parameter sets P1 to PN corresponding to the partial timbre of the reference musical tone are generated by the control vector processing unit 25. The N parameter sets P1 to PN are generated from a common control vector V.

The second encoder 243 and the control vector processing unit 25 constitute the second generative model 40. The second generative model 40 is a trained statistical model that learns the relationship between the reference data string R and the N parameter sets P1 to PN through machine learning. The second generative model 40 is constituted, for example, by a deep neural network.

The second generative model 40 is realized by a combination of a program that causes the control device 11 to execute an operation to generate a control vector V from a reference data string R, and a number of variables (weights and biases) that are applied to the operation. The program and the number of variables that realize the second generative model 40 are stored in the storage device 12. The number of variables of the second generative model 40 are set in advance by machine learning.

FIG. 7 is a flowchart of the process (hereinafter referred to as "musical sound synthesis process Sa") in which the control device 11 generates an audio signal W of a target musical sound. For example, the musical sound synthesis process Sa is started in response to an instruction from the user via the operation device 14. The musical sound synthesis process Sa is repeated for each processing period B. The musical sound synthesis process Sa is an example of a "musical sound synthesis method." Note that, before the musical sound synthesis process Sa starts, the section setting unit 241 sets a specific section in response to a first instruction Q1 from the user. Data representing the specific section is stored in the storage device 12.

When the musical tone synthesis process Sa is started, the control device 11 (control vector generation unit 24) generates a control vector V representing a partial tone color in response to a first instruction Q1 from the user (Sa1). The specific steps for generating the control vector V (Sa11 to Sa13) are as follows:

First, the control device 11 (section setting unit 241) acquires data representing a specific section from the storage device 12 (Sa11). Specifically, the section setting unit 241 sets the specific section in response to a first instruction Q1 from the user on the operation device 14. The control device 11 (feature extraction unit 242) processes the reference signal Sr of the specific section to generate one reference data string R (Sa12). Then, the control device 11 processes the reference data string R using the second encoder 243 to generate a control vector V (Sa13).

The control device 11 (control vector processing unit 25) generates N parameter sets P1 to PN from the control vector V (Sa2). Specifically, the control device 11 processes the control vector V using each transformation model 28-n to generate the parameter set Pn.

The control device 11 (control data acquisition unit 21) processes the music data M to generate a control data sequence X (Sa3). The control device 11 (musical sound synthesis unit 22) processes the control data sequence X using the first generative model 30 to generate an audio data sequence Z (Sa4). Specifically, the control device 11 processes the control data sequence X using the first encoder 31 to generate intermediate data Y, and processes the intermediate data Y using the decoder 32 to generate the audio data sequence Z. A parameter set Pn is applied to the conversion process by each intermediate layer L of the first generative model 30.

The control device 11 (waveform synthesis unit 23) generates an audio signal W of the target musical tone from the audio data sequence Z (Sa5). The control device 11 supplies the audio signal W to the sound emitting device 15 (Sa6). The sound emitting device 15 reproduces the target musical tone represented by the audio signal W.

As explained above, in the first embodiment, a control vector V representing the partial timbre of a reference musical tone is generated in response to an instruction from the user (first instruction Q1), a parameter set Pn is generated from the control vector V, and the parameter set Pn is applied to each unit data Du of the unit data string U input to each intermediate layer L. Therefore, it is possible to generate an acoustic data string Z of a target musical tone having a variety of partial timbres in response to an instruction from the user.

In the first embodiment, a specific section of the reference musical tone is set in response to a first instruction Q1 from the user, and a control vector V is generated that represents the partial timbre in the specific section. Therefore, it is possible to generate a target musical tone having the partial timbre of the specific section of the reference musical tone desired by the user. In particular, in the first embodiment, the position of the specific section on the time axis is changed in response to the first instruction Q1. Therefore, it is possible to generate a target musical tone having the partial timbre of the position of the reference musical tone desired by the user.

The training processing unit 26 in FIG. 2 establishes the first generative model 30 and the second generative model 40 through machine learning using multiple pieces of training data T. The training processing unit 26 in the first embodiment establishes the first generative model 30 and the second generative model 40 collectively. After establishment, each of the first generative model 30 and the second generative model 40 may be trained individually. FIG. 8 is an explanatory diagram regarding machine learning that establishes the first generative model 30 and the second generative model 40.

Each of the multiple training data T is composed of a combination of a training control data string Xt, a training reference data string Rt, and a training audio data string Zt. The control data string Xt is time-series data representing the conditions of the target musical tone. Specifically, the control data string Xt represents a time series of musical score features in a specific section (hereinafter referred to as the "training section") of the training piece of music. The format of the control data string Xt is the same as that of the control data string X.

The reference data string Rt is time-series data representing the acoustic characteristics of musical tones prepared in advance for a training piece of music. The partial timbre represented by the reference data string Rt is a characteristic of the temporal change in acoustic characteristics of the musical tones of the training piece of music in the training section. The format of the reference data string Rt is the same as that of the reference data string R.

The acoustic data sequence Zt is time-series data representing the acoustic characteristics of the musical tones to be generated by the first generative model 30 and the second generative model 40 from the control data sequence Xt and the reference data sequence Rt. In other words, the acoustic data sequence Zt corresponds to the ground truth for the control data sequence Xt and the reference data sequence Rt. The format of the acoustic data sequence Zt is the same as that of the acoustic data sequence Z.

FIG. 9 is a flowchart of the process (hereinafter referred to as "training process Sb") in which the control device 11 establishes the first generation model 30 and the second generation model 40. The control device 11 executes the training process Sb to realize the training processing unit 26 in FIG. 8.

When the training process Sb is started, the control device 11 prepares a first provisional model 51 and a second provisional model 52 (Sb1). The first provisional model 51 is an initial or provisional model that is updated to the first generative model 30 by machine learning. The initial first provisional model 51 has a similar configuration to the first generative model 30, but multiple variables are set to, for example, random numbers. The second provisional model 52 is an initial or provisional model that is updated to the second generative model 40 by machine learning. The initial second provisional model 52 has a similar configuration to the second generative model 40, but multiple variables are set to, for example, random numbers. The structure of each of the first provisional model 51 and the second provisional model 52 is arbitrarily designed by a designer.

The control device 11 selects one of the multiple training data T (hereinafter referred to as "selected training data T") (Sb2). As illustrated in FIG. 8, the control device 11 generates N parameter sets P1 to PN by processing the reference data string Rt of the selected training data T using the second provisional model 52 (Sb3). Specifically, the second provisional model 52 generates a control vector V, and the control vector processing unit 25 generates N parameter sets P1 to PN. The control device 11 also generates an acoustic data string Z by processing a control data string Xt of the selected training data T using the first provisional model 51 (Sb4). The N parameter sets P1 to PN generated by the second provisional model 52 are applied to the processing of the control data string Xt.

The control device 11 calculates an error function that represents the error between the acoustic data sequence Z generated by the first interim model 51 and the acoustic data sequence Zt of the selected training data T (Sb5). The control device 11 updates multiple variables of the first interim model 51 and multiple variables of the second interim model 52 so that the loss function is reduced (ideally minimized) (Sb6). For example, the backpropagation method is used to update each variable according to the loss function.

The control device 11 determines whether a predetermined termination condition is met (Sb7). The termination condition is, for example, that the loss function falls below a predetermined threshold, or that the amount of change in the loss function falls below a predetermined threshold. If the termination condition is not met (Sb7: NO), the control device 11 selects the unselected training data T as new selected training data T (Sb2). That is, the process of updating the multiple variables of the first interim model 51 and the multiple variables of the second interim model 52 (Sb2 to Sb6) is repeated until the termination condition is met (Sb7: YES). Note that when the above process has been performed for all training data T, each training data T is returned to an unselected state and the same process is repeated. That is, each training data T is used repeatedly.

If the termination condition is met (Sb7: YES), the control device 11 ends the training process Sb. The first interim model 51 at the time when the termination condition is met is determined to be the trained first generative model 30. Also, the second interim model 52 at the time when the termination condition is met is determined to be the trained second generative model 40.

As can be understood from the above explanation, the first generative model 30 learns the underlying relationship between the control data sequence Xt and the acoustic data sequence Zt under the N parameter sets P1 to PN corresponding to the reference data sequence R. Therefore, the trained first generative model 30 outputs an acoustic data sequence Z that is statistically valid for the unknown control data sequence X under that relationship. The second generative model 40 also learns the underlying relationship between the reference data sequence Rt and the N parameter sets P1 to PN. Specifically, the second generative model 40 learns the relationship between the reference data sequence Rt and the N parameter sets P1 to PN necessary to generate an appropriate acoustic data sequence Z from the control data sequence Xt. Specifically, the second encoder 243 learns the underlying relationship between the reference data sequence Rt and the control vector V, and the control vector processing unit 25 learns the underlying relationship between the control vector V and the N parameter sets P1 to PN. Therefore, by using the first generation model 30 and the second generation model 40, an acoustic data sequence Z of a target musical tone is generated to which a desired partial timbre of a reference musical tone is imparted.

B: Second embodiment A second embodiment will be described. In each of the following exemplary embodiments, elements having the same functions as those in the first embodiment will be denoted by the same reference numerals as those used in the description of the first embodiment, and detailed descriptions thereof will be omitted as appropriate.

FIG. 10 is a block diagram of the control vector generation unit 24 in the second embodiment. As illustrated in FIG. 10, the control vector generation unit 24 in the second embodiment includes a control vector adjustment unit 244 in addition to the same elements as in the first embodiment (interval setting unit 241, feature extraction unit 242, and second encoder 243). The second encoder 243 generates a control vector V in the same manner as in the first embodiment. In the following description, the initial control vector V generated by the second encoder 243 is denoted as "control vector V0" for convenience.

In the second embodiment, as in the first embodiment, the section setting unit 241 sets a specific section of the reference musical tone in response to the first instruction Q1 from the user. Therefore, the control vector V0 in the second embodiment is generated in response to the first instruction Q1 from the user.

The initial control vector V0 does not have to be a vector generated by the second encoder 243. For example, a vector in which each element Ek is set to a predetermined value (e.g., zero), or a vector in which each element Ek is set to a random number may be used as the initial control vector V0. In addition, the final control vector V when the previous musical sound synthesis process Sa was executed may be adopted as the current initial control vector V0. As can be understood from the above explanation, the elements for generating the control vector V0 (the interval setting unit 241, the feature extraction unit 242, and the second encoder 243) may be omitted in the second embodiment.

The control vector adjustment unit 244 generates a control vector V by adjusting an initial control vector V0. Specifically, the control vector adjustment unit 244 changes one or more elements Ek of the K elements E1 to EK of the control vector V0 in response to a second instruction Q2 from the user to the operation device 14. A K-dimensional vector consisting of the K elements E1 to EK after the changes is supplied to the control vector processing unit 25 as the control vector V. As can be understood from the above explanation, the control vector generation unit 24 of the second embodiment generates a control vector V in response to a first instruction Q1 and a second instruction Q2 from the user.

FIG. 11 is a schematic diagram of the setting screen Gb. The setting screen Gb is a screen for the user to instruct changes to each element Ek. The control vector adjustment unit 244 displays the setting screen Gb on the display device 13. The setting screen Gb includes K operators Gb-1 to Gb-K corresponding to different elements Ek of the control vector V. The K operators Gb-1 to Gb-K are arranged in the horizontal direction. The operators Gb-k corresponding to each element Ek are images that accept operations by the user. Specifically, each operator Gb-k is, for example, a slider that moves up and down in response to an operation by the user. The second instruction Q2 by the user is, for example, an operation to move each of the K operators Gb-1 to Gb-K. In other words, the second instruction Q2 is an instruction from the user to individually specify the numerical value of each element Ek. The numerical value of the element Ek is displayed near each operator Gb-k.

The position of each operator Gb-k in the vertical direction corresponds to the numerical value of the element Ek. That is, moving the operator Gb-k upwards means an increase in the element Ek, and moving the operator Gb-k downwards means a decrease in the element Ek. The control vector adjustment unit 244 sets the initial position of each operator Gb-k according to the numerical value of each element Ek of the control vector V0. The control vector adjustment unit 244 then changes the numerical value of the element Ek according to the user's operation to move each operator Gb-k (i.e., the second instruction Q2). That is, the control vector adjustment unit 244 sets the element Ek corresponding to each operator Gb-k according to the user's operation on one or more operators Gb-k among the K operators Gb-1 to Gb-K.

As mentioned above, the control vector V represents a partial tone. Therefore, the change in each element Ek by the control vector adjustment unit 244 is a process of changing the partial tone in response to the second instruction Q2 from the user. In other words, the temporal change in the acoustic characteristics imparted to the target tone (i.e., the partial tone) changes in response to the second instruction Q2 from the user. The control vector processing unit 25 generates N parameter sets P1 to PN from the control vector V after adjustment by the control vector adjustment unit 244.

FIG. 12 is a flowchart of the musical tone synthesis process Sa in the second embodiment. In the second embodiment, the generation of the control vector V (Sa1) includes the same procedures (Sa11-Sa13) as in the first embodiment, as well as the adjustment of the control vector V0 (Sa14). In the adjustment of the control vector V0 (Sa14), the control device 11 (control vector adjustment unit 244) generates the control vector V by changing one or more elements Ek of the K elements E1-EK of the initial control vector V0 in response to a second instruction Q2 from the user. The operations other than the adjustment of the control vector V0 (Sa14) are the same as in the first embodiment. The second instruction Q2 is given by the user at any timing in parallel with the musical tone synthesis process Sa.

In the second embodiment, the same effect as in the first embodiment is achieved. Furthermore, in the second embodiment, one or more elements Ek of the K elements E1 to EK of the control vector V0 are changed in response to the second instruction Q2 from the user. Therefore, it is possible to generate a variety of target musical tones having partial timbres in response to the second instruction Q2 from the user. In particular, in the second embodiment, the user can easily adjust the partial timbre by operating each operator Gb-k.

C: Third embodiment The control vector generation unit 24 of the third embodiment generates a control vector V for each unit period on the time axis. That is, the control vector generation unit 24 generates a time series of control vectors V in response to instructions from a user (first instruction Q1, second instruction Q2). In the following description, an example in which the control vector V is generated in response to the first instruction Q1 and the second instruction Q2 is shown, but the control vector V may be generated in response to one of the first instruction Q1 and the second instruction Q2.

The control vector generating unit 24 of the third embodiment generates a control vector V in response to a first instruction Q1 and a second instruction Q2 from the user, as in the second embodiment. As described above, in the third embodiment, the control vector V is generated for each unit period, so the control vector V changes for each unit period within one processing period B. Therefore, the partial timbre assigned to the target musical tone changes at a point midway through the processing period B.

For example, before the execution of the musical tone synthesis process Sa, the user can specify a specific section by the first instruction Q1 for any time (unit period) of the target music piece. For a unit period for which a specific section is specified, a control vector V is generated in the same manner as in each of the above-mentioned forms. For a unit period for which a specific section is not specified (hereinafter referred to as a "target period"), a control vector V is generated by interpolating two control vectors V generated for unit periods before and after the target period. For example, the control vector generation unit 24 generates a control vector V for the target period by interpolating a control vector V corresponding to a specific section specified immediately before the target period (e.g., one or more unit periods in the past) and a control vector V corresponding to a specific section specified immediately after the target period (e.g., one or more unit periods in the future). Any method of interpolating the control vector V may be used, for example, an interpolation method may be used.

The control vector generation unit 24 also generates a time series of control vectors V by detecting the second instruction Q2 given by the user for each unit period in parallel with the musical sound synthesis process Sa. Note that the control vector generation unit 24 may generate a time series of control vectors V by detecting the second instruction Q2 at a cycle longer than the unit period, and generate a control vector V for each unit period by processing to smooth the time series of control vectors V on the time axis (i.e., a low-pass filter).

The control vector processing unit 25 of the third embodiment generates N parameter sets P1 to PN from the control vector V for each unit period. The control vector processing unit 25 generates N parameter sets P1 to PN for each unit period on the time axis. In other words, the control vector processing unit 25 generates a time series of each parameter set Pn.

As mentioned above, the control vector V changes in response to the first instruction Q1 or the second instruction Q2. Therefore, the N parameter sets P1 to PN in the unit period immediately before the first instruction Q1 or the second instruction Q2 are different from the N parameter sets P1 to PN in the unit period immediately after. In other words, the parameter set Pn changes within one processing period B. In a state in which the first instruction Q1 or the second instruction Q2 is not given, the same parameter set Pn is generated over multiple unit periods.

As illustrated in FIG. 3, the number of unit data Du constituting one unit data string U changes for each stage of processing in the first generation model 30. For conversion processing by one intermediate layer L, parameter sets Pn in a number corresponding to the number of unit data Du supplied to the intermediate layer L are used. In other words, the conversion model 28-n generates a time series of parameter sets Pn in the same number as the unit data Du processed by the n-th intermediate layer L.

FIG. 13 is an explanatory diagram of the conversion process executed by each intermediate layer L. In the first embodiment, a conversion process is executed in which a common parameter set Pn is applied to each of the multiple unit data Du that make up the unit data string U (FIG. 4). In the third embodiment, a conversion process is executed in which an individual parameter set Pn is applied to each of the multiple unit data Du that make up the unit data string U.

13 shows unit data Du(t1) for time t1 and unit data Du(t2) corresponding to time t2 (t2 ≠ t1) in the unit data string U. Parameter set Pn(t1) is applied to the conversion process of unit data Du(t1), and parameter set Pn(t2) is applied to the conversion process of unit data Du(t2). Parameter set Pn(t1) and parameter set Pn(t2) are generated separately. Specifically, parameter set Pn(t1) is generated from control vector V(t1) corresponding to time t1, and parameter set Pn(t2) is generated from control vector V(t2) corresponding to time t2. Therefore, the numerical values of the first parameter p1 and the second parameter p2 may differ between parameter set Pn(t1) and parameter set Pn(t2). As shown in the above example, the parameter set Pn applied to the conversion process changes at a point in the middle of the unit data string U.

In the third embodiment, the same effect as in the second embodiment is achieved. In addition, in the third embodiment, a time series of a control vector V is generated in response to instructions from a user (first instruction Q1, second instruction Q2), and a time series of each parameter set Pn is generated from the time series of the control vector V. Therefore, it is possible to generate a variety of target sounds whose timbre changes at points in the middle of the control data string X.

D: Fourth embodiment FIG. 14 is a block diagram illustrating the configuration of the first generative model 30 (musical tone synthesis unit 22) in the fourth embodiment. The first generative model 30 in the fourth embodiment is an autoregressive (AR) type generative model including a conversion processing unit 61, a convolution layer 62, N unit processing units 63-1 to 63-N, and a synthesis processing unit 64. As described later, the first generative model 30 has an arbitrary number (Nx) of intermediate layers, but if all intermediate layers are omitted, it becomes equivalent to the generative model (NPSS) disclosed in the 2017 paper "A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs" by Merlijn Blaauw and Jordi Bonada in applied sciences. The configuration other than the first generative model 30 is the same as that of the first embodiment. Each parameter set Pn generated by the control vector processing unit 25 (conversion model 28-n) is supplied to the unit processing unit 63-n.

Similar to the pre-processing unit 311, the conversion processing unit 61 generates latent data d from the control data Dx acquired by the control data acquisition unit 21 for each unit period. The latent data d represents the characteristics of the control data Dx. For example, the conversion processing unit 61 is configured with a multi-layer perceptron for converting the control data Dx into latent data d. The latent data d may be supplied in common to the N unit processing units 63-1 to 63-N, or different data may be supplied individually. Note that the control data Dx acquired by the control data acquisition unit 21 may be supplied to each unit processing unit 63-n as latent data d. In other words, the conversion processing unit 61 may be omitted.

FIG. 15 is a block diagram of each unit processing unit 63-n. Each unit processing unit 63-n is a generative model that generates output data O and processed data Cn by processing input data I, latent data d, and parameter set Pn. The input data I includes first data Ia and second data Ib. The unit processing unit 63-n includes an extended convolutional layer 65, an intermediate layer L, and a processing layer 67.

The dilated convolution layer 65 generates unit data Du1 by performing dilated convolution on the input data I (first data Ia and second data Ib).

The intermediate layer L generates unit data Du2 by performing a conversion process on the unit data Du1. The contents of the conversion process are the same as those in the first embodiment. A parameter set Pn is applied to the conversion process. Note that it is not necessary for the intermediate layer L to be installed in all of the N unit processing units 63-1 to 63-N. In other words, the intermediate layer L is installed in Nx (one or more) unit processing units 63-n out of the N unit processing units 63-1 to 63-N. Here, the explanation will be given assuming that the intermediate layer L is installed in all of the N unit processing units.

The processing layer 67 generates output data O and processing data Cn from the unit data Du2 and the latent data d. Specifically, the processing layer 67 includes a convolution layer 671, an adder 672, an activation layer 673, an activation layer 674, a multiplier 675, a convolution layer 676, a convolution layer 677, and an adder 678.

The convolution layer 671 performs a 1x1 convolution operation on the latent data d. The adder 672 generates unit data Du3 by adding unit data Du2 and the output data of the convolution layer 671. The unit data Du3 is divided into a first part and a second part. The activation layer 673 processes the first part of the unit data Du3 using an activation function (e.g., a tanh function). The activation layer 674 processes the second part of the unit data Du3 using an activation function (e.g., a sigmoid function). The multiplier 675 generates unit data Du4 by calculating an element product between the output data of the activation layer 673 and the output data of the activation layer 674. The unit data Du4 is data obtained by applying a gated activation function (673-675) to the output of the extended convolution layer 65. Here, each of the unit data Du1 to Du3 includes a first part and a second part, but if a general activation function (an ungated function such as sigmoid, tanh, or ReLU) is used, the unit data Du1 to Du3 only need to include the first part.

The convolution layer 676 generates processed data Cn by performing a 1×1 convolution operation on the unit data Du4. The convolution layer 677 performs a 1×1 convolution operation on the unit data Du4. The adder 678 generates output data O by adding the first data Ia and the output data of the convolution layer 677. The output data O is stored in the storage device 12.

The synthesis processing unit 64 in FIG. 14 generates acoustic data Dz by processing N pieces of processed data C1 to CN generated by different unit processing units 63-n. For example, the synthesis processing unit 64 generates acoustic data Dz based on data obtained by weighting the N pieces of processed data C1 to CN. The generation of acoustic data Dz by the synthesis processing unit 64 is repeated for each unit period. In other words, the synthesis processing unit 64 generates a time series of acoustic data Dz. The acoustic data Dz generated by the synthesis processing unit 64 is supplied to the waveform synthesis unit 23 as in the first embodiment, and is also stored in the storage device 12 and used in the convolution layer 62.

The convolution layer 62 generates unit data Du0 for each unit period by performing a convolution operation (causal convolution) on the acoustic data Dz generated in the immediately preceding multiple unit periods. The unit data Du0 is supplied to the first-stage unit processing unit 63-1 as input data I. The first data Ia supplied to the unit processing unit 63-1 in each unit period is the unit data Du0 generated in the current unit period. In addition, the second data Ib supplied to the unit processing unit 63-1 in each unit period is the unit data Du0 generated in the immediately preceding (one previous) unit period. As described above, the first data Ia and second data Ib corresponding to different unit periods are supplied to the unit processing unit 63-1.

In each unit period on the time axis, each unit processing unit 63-n from the second stage onwards is supplied with multiple output data O generated by the unit processing unit 63-n-1 in the previous stage for different unit periods as first data Ia and second data Ib. For example, the unit processing unit 63-2 is supplied with the output data O generated by the unit processing unit 63-1 in the current unit period as the first data Ia, and the output data O generated by the unit processing unit 63-1 in the unit period two units prior (dilation = 2) as the second data Ib. The unit processing unit 63-3 is supplied with the output data O generated by the unit processing unit 63-2 in the current unit period as the first data Ia, and the output data O generated by the unit processing unit 63-2 in the unit period four units prior (dilation = 4) as the second data Ib.

As described above, the first generation model 30 of the fourth embodiment includes N intermediate layers L corresponding to different unit processing units 63-n. Furthermore, the convolution layer 62 and the dilated convolution layer 65 and processing layer 67 of each unit processing unit 63-n are basic layers required for generating a time series of acoustic data Dz. That is, the first generation model 30 of the fourth embodiment includes multiple basic layers and one or more intermediate layers L, similar to the first embodiment. Therefore, the fourth embodiment also achieves the same effects as the first embodiment.

E: Fifth embodiment In the first to fourth embodiments, the target musical tone is a singing tone. The musical tone synthesis system 100 of the fifth embodiment synthesizes, as the target musical tone, an instrument tone to be generated by playing the target piece of music.

The control data Dx in the first to fourth embodiments includes the pitch (fundamental frequency) of the target musical tone, information indicating voiced/unvoiced, and phoneme information. The control data Dx in the fifth embodiment is a musical score feature for an instrument sound, which includes the intensity (volume) and performance style of the target musical tone instead of the voiced/unvoiced information and phoneme information. The performance style is, for example, information indicating the method of playing an instrument. When the target musical tone is an instrument sound, the instrument sound is also used as the reference musical tone. In other words, the partial timbre is a characteristic of the temporal change in the acoustic characteristics of the instrument sound.

The first and second

generative models

30 and 40 of the fifth embodiment are established by training using training data T for musical instrument sounds (control data sequence Xt, reference data sequence Rt, and acoustic data sequence Zt) in the machine learning of FIG. 8. The first generative model 30 is a trained statistical model that has learned the relationship between the conditions on the musical score of the target musical instrument sound (control data sequence X) and the acoustic features of the target musical instrument sound (acoustic data sequence Z). The musical sound synthesis unit 22 then processes the control data sequence X for the musical instrument sound using the first generative model 30 to generate the acoustic data sequence Z for the musical instrument sound.

In the fifth embodiment, in which each of the first to fourth embodiments is applied to generating musical instrument sounds, the same effects as those of the first to fourth embodiments are achieved. As can be understood from the examples of each of the above embodiments, "musical sound" in this disclosure means musical sounds such as singing sounds or instrument sounds.

F: Modifications Specific modifications to the above-mentioned embodiments are given below. Two or more of the following embodiments may be combined as long as they are not mutually contradictory.

(1) In the first to third embodiments, the first encoder 31 includes a pre-processing unit 311, but the pre-processing unit 311 may be omitted. For example, the control data sequence X may be directly supplied from the control data acquisition unit 21 to the first-stage convolutional layer 321 of the first encoder 31. In addition, in each of the above-described embodiments, the decoder 32 includes a post-processing unit 322, but the post-processing unit 322 may be omitted. For example, the acoustic data sequence Z output by the final-stage intermediate layer L may be directly supplied to the waveform synthesis unit 23.

(2) In each of the above embodiments, the position on the time axis of a specific section in the reference signal Sr is changed in response to a first instruction Q1 from the user, but the configuration for reflecting the first instruction Q1 in the control vector V is not limited to the above examples. In a configuration in which multiple reference signals Sr representing different reference musical tones are stored in the storage device 12, the control device 11 (section setting unit 241) may select one of the multiple reference signals Sr in response to the first instruction Q1 from the user. The control device 11 generates a reference data string R from the reference signal Sr of the specific section selected in response to the first instruction Q1.

(3) In the second embodiment, the configuration for changing each element Ek of the control vector V in response to the second instruction Q2 from the user is not limited to the above example.

For example, multiple preset data for the control vector V0 may be stored in the storage device 12. Each preset data is data that specifies each of the K elements E1 to EK of the control vector V0. The user can select one of the multiple preset data by operating the operation device 14. The control vector adjustment unit 244 uses the preset data selected by the user as the control vector V0. and applies it to the adjustment. In the above embodiment, the instruction to select and call one of the multiple preset data corresponds to the second instruction Q2.

In addition, in each of the above-mentioned embodiments, the position of each operator Gb-k corresponds to the numerical value of each element Ek, but the position of each operator Gb-k may correspond to the amount of change in each element Ek. The control vector adjustment unit 244 sets the numerical value of the element Ek to a numerical value that is changed from the initial value in the control vector V0 by an amount corresponding to the position of the operator Gb-k.

(4) In each of the above embodiments, the musical sound synthesis system 100 is provided with the training processing unit 26 for convenience, but the training processing unit 26 may be mounted on a machine learning system separate from the musical sound synthesis system 100. The first generation model 30 and the second generation model 40 established by the machine learning system are provided to the musical sound synthesis system 100 and used in the musical sound synthesis process Sa.

(5) In each of the above-described embodiments, an audio signal W is generated using a processing period B as a time unit. In each embodiment, multiple processing periods B that are consecutive on the time axis may partially overlap each other on the time axis, as illustrated in FIG. 16. Note that the temporal relationship between the processing periods B is not limited to the example illustrated in FIG. 16.

As in the above-mentioned embodiments, an audio signal W is generated sequentially for each processing period B on the time axis. The audio signals W within the valid period b of each processing period B are added (e.g., weighted averaged) to each other between successive processing periods B on the time axis to generate a final audio signal. The valid period b is a period included in the processing period B. Specifically, the valid period b is a period obtained by excluding from the processing period B a period of a predetermined length including the start point of the processing period B and a period of a predetermined length including the end point of the processing period B. According to the embodiment of FIG. 16, the discontinuity of the waveform of the audio signal W at the end (start point or edge) of the processing period B is reduced, and as a result, an audio signal with a continuous waveform and a natural auditory sensation can be generated.

(6) In each of the above embodiments, a virtual operator Gb-k is displayed on the display device 13, but the operator Gb-k that receives an instruction to change each element Ek may be a real operator that the user can actually touch.

(7) The conversion process executed by each intermediate layer L is not limited to the process exemplified in each of the above-mentioned embodiments. For example, one of the multiplication of the first parameter p1 and the addition of the second parameter p2 may be omitted. In a form in which the conversion process does not include the addition of the second parameter p2, the parameter set Pn is composed of only the first parameter p1. In a form in which the conversion process does not include the multiplication of the first parameter p1, the parameter set Pn is composed of only the second parameter p2. In other words, the parameter set Pn is expressed as a variable including one or more parameters.

(8) In each of the above embodiments, the first generative model 30 including the first encoder 31 and the decoder 32 is exemplified, but the configuration of the first generative model 30 is not limited to the above examples. The first generative model 30 is comprehensively expressed as a model that learns the relationship between the conditions of the target musical tone (control data sequence X) and the acoustic features of the target musical tone (acoustic data sequence Z). Therefore, a model of any structure including one or more intermediate layers L capable of performing conversion processing is used as the first generative model 30.

Similarly, the configuration of the second generation model 40 is not limited to the examples in the above-mentioned embodiments. For example, in the above-mentioned embodiments, a configuration in which the sampling unit 414 samples the element Ek of the control vector V from each probability distribution Fk has been exemplified, but the control vector V may be generated by multiple convolution layers 411. In other words, the output processing unit 412 in the second encoder 243 may be omitted.

(9) In the second embodiment, a configuration in which the control vector V is generated in response to the first instruction Q1 and the second instruction Q2 has been exemplified, but in a configuration including the control vector adjustment unit 244, the configuration for receiving the first instruction Q1 may be omitted. Specifically, the section setting unit 241 may set the reference signal Sr for a specific section regardless of an instruction from the user. For example, the section setting unit 241 sets a section in which the acoustic characteristics of the reference signal Sr satisfy a specific condition as the specific section. The section setting unit 241 sets a section in which the acoustic characteristics, such as timbre, fluctuate significantly as the specific section. In addition, the entire reference signal Sr may be used as the specific section. In a configuration in which the entire reference signal Sr is used as the specific section, the section setting unit 241 may be omitted.

(10) In the first to third embodiments, a form in which the first generation model 30 includes N1 encoding intermediate layers Le and N2 decoding intermediate layers Ld has been exemplified, but the encoding intermediate layer Le or the decoding intermediate layer Ld may be omitted. For example, a form in which the first encoder 31 of the first generation model 30 does not include an encoding intermediate layer Le, or a form in which the decoder 32 does not include a decoding intermediate layer Ld, are also envisioned. As described above, each intermediate layer L performs a conversion process. Therefore, a form in which the first encoder 31 does not perform a conversion process, or a form in which the decoder 32 does not perform a conversion process are also envisioned.

In a form in which the coding intermediate layer Le is omitted, the first generation model 30 includes N2x decoding intermediate layers Ld. As described above, the number N2x of the decoding intermediate layers Ld is a natural number equal to or less than N2. Also, in a form in which the decoding intermediate layer Ld is omitted, the first generation model 30 includes N1x coding intermediate layers Le. As described above, the number N1x of the coding intermediate layers Le is a natural number equal to or less than N1.

As can be understood from the above examples, the number Nx of intermediate layers L in the first generation model 30 in the first to fourth embodiments is a natural number between 1 and N. That is, the first generation model 30 is comprehensively expressed as a model including multiple base layers and one or more intermediate layers L. The intermediate layers L are included in one or both of the first encoder 31 and the decoder 32. That is, the conversion process is performed in at least one location in the first generation model 30.

(11) The musical sound synthesis system 100 may be realized by a server device that communicates with an information device such as a smartphone or tablet terminal. For example, the musical sound synthesis system 100 generates an audio signal W from the music data M and the reference signal Sr received from the information device, and transmits the audio signal W to the information device. Note that in a form in which the audio data sequence Z generated by the musical sound synthesis unit 22 is transmitted from the musical sound synthesis system 100 to the information device, the waveform synthesis unit 23 may be omitted from the musical sound synthesis system 100. The information device generates an audio signal from the audio data sequence Z received from the musical sound synthesis system 100.

In addition, a control data sequence X may be transmitted from the information device to the musical sound synthesis system 100 instead of the music data M. The control data acquisition unit 21 receives the control data sequence X transmitted from the information device. "Receiving" the control data Dx (control data sequence X) is an example of "acquiring" the control data Dx.

(12) As described above, the functions of the musical tone synthesis system 100 exemplified above are realized by the cooperation of one or more processors constituting the control device 11 and the program stored in the storage device 12. The program according to the present disclosure can be provided in a form stored in a computer-readable recording medium and installed in a computer. The recording medium is, for example, a non-transitory recording medium, and a good example is an optical recording medium (optical disk) such as a CD-ROM, but also includes any known type of recording medium such as a semiconductor recording medium or a magnetic recording medium. Note that a non-transitory recording medium includes any recording medium except a transient, propagating signal, and does not exclude volatile recording media. In addition, in a configuration in which a distribution device distributes a program via a communication network, the storage device that stores the program in the distribution device corresponds to the non-transitory recording medium described above.

G: Supplementary Note From the above-described exemplary embodiments, the following configurations, for example, can be understood.

A musical sound synthesis method according to one aspect (aspect 1) of the present disclosure is a musical sound synthesis method realized by a computer system, which acquires a time series of control data representing the conditions of a target musical sound, and generates a time series of acoustic data representing the acoustic features of the target musical sound by processing the time series of control data using a trained generative model that includes multiple basic layers and one or more intermediate layers and has learned the relationship between the conditions of the musical sounds and the acoustic features of the musical sounds. The method generates a control vector representing the characteristics of the temporal change in timbre in response to an instruction from a user, and generates a first parameter set from the control vector. A first intermediate layer of the one or more intermediate layers applies the first parameter set to the data input to the first intermediate layer, and outputs the data after application to the next layer.

In the above embodiment, a control vector representing the characteristics of the temporal change in timbre (partial timbre) is generated in response to instructions from the user, a first parameter set is generated from the control vector, and the first parameter set is applied to the data input to the first intermediate layer. Therefore, it is possible to generate a time series of acoustic data for a target musical tone having a variety of partial timbres in response to instructions from the user.

A "target musical sound" is a musical sound that is the target to be synthesized. A "musical sound" is a musical sound. For example, a singer's singing sound or a musical instrument's playing sound are musical sounds, and these are examples of "musical sounds."

"Control data" is data in any format that represents the conditions of a target musical tone. For example, data that represents the features (score features) of music data that represents the musical score of a musical tone (i.e., a sequence of notes) is an example of "control data". The type of score features represented by the control data is arbitrary. For example, score features similar to those in Patent Document 1 are used.

In a specific example (Aspect 2) of Aspect 1, in generating the control vector, a time series of the control vector is generated in response to instructions from the user, and in generating the first parameter set, a time series of the first parameter set is generated from the time series of the control vector. In the above aspect, a time series of the control vector is generated in response to instructions from the user, and a time series of the first parameter set is generated from the time series of the control vector. Therefore, it is possible to generate a variety of target sounds whose timbre changes at intermediate points in the time series of the control data.

In a specific example (Aspect 3) of Aspect 1 or Aspect 2, a second parameter set is further generated from the control vector, and a second intermediate layer of the one or more intermediate layers executes processing in which the second parameter set is applied to data input to the second intermediate layer, and outputs the data after application to the next layer. In the above aspect, in addition to applying the first parameter set to data input to the first intermediate layer, the second parameter set is applied to data input to the second intermediate layer. Therefore, a time series of acoustic data of a target musical tone having a variety of partial timbres can be generated.

In a specific example (Aspect 4) of any one of Aspects 1 to 3, the one or more intermediate layers are multiple intermediate layers, the generative model includes a first encoder including multiple encoding intermediate layers among the one or more intermediate layers, and a decoder including multiple decoding intermediate layers among the one or more intermediate layers, and in generating the time series of acoustic data, the time series of control data is processed by the first encoder to generate intermediate data representing characteristics of the time series of control data, and the time series of acoustic data is generated by processing the intermediate data by the decoder. According to the above aspects, the time series of acoustic data can be generated by encoding by the first encoder and decoding by the decoder.

The "first encoder" is a statistical model that generates intermediate data that represents the characteristics of a time series of control data. On the other hand, the decoder is a statistical model that generates a time series of acoustic data from the intermediate data. Each of the "first hidden layer" and the "second hidden layer" may be either a coding hidden layer or a decoding hidden layer.

In a specific example of aspect 4 (aspect 5), the first encoder compresses data on the time axis, and the decoder expands data on the time axis. In the above aspect, intermediate data is generated that appropriately reflects the characteristics of the time series of control data, and a time series of acoustic data is generated that appropriately reflects the characteristics of the intermediate data.

In a specific example (Aspect 6) of any one of Aspects 1 to 5, in generating the control vector, a specific section of the reference musical sound is set in response to a first instruction from the user, and a time series of reference data representing the acoustic features of the reference musical sound in the specific section is processed by a second encoder to generate the control vector representing the characteristics of the temporal change in timbre in the specific section of the reference musical sound. In the above aspects, a specific section of the reference musical sound is set in response to a first instruction from the user, and a control vector representing the characteristics of the temporal change in timbre in the specific section (partial timbre) is generated. Therefore, it is possible to generate a target musical sound having the partial timbre of the specific section of the reference musical sound in response to the first instruction.

In a specific example of aspect 6 (aspect 7), the position of the specific section on the time axis is further changed in response to the first instruction. In the above aspect, the position of the specific section on the time axis in the reference musical sound is changed in response to a first instruction from the user. Therefore, it is possible to generate a target musical sound having a partial timbre at a position of the reference musical sound desired by the user.

In a specific example (Aspect 8) of any one of Aspects 1 to 7, the control vector includes a plurality of elements, and in generating the control vector, one or more of the plurality of elements are changed in response to a second instruction from the user. In the above aspects, one or more of the plurality of elements of the control vector are changed in response to a second instruction from the user. Therefore, it is possible to generate a variety of target musical tones having partial timbres in response to the second instruction from the user.

In a specific example of aspect 8 (aspect 9), the second instruction is an operation on a plurality of operators corresponding to the plurality of elements, and in changing the one or more elements, the one or more elements are set in response to an operation on one or more of the plurality of operators that correspond to the one or more elements. In the above aspect, the user can easily adjust partial tones by operating each operator.

The "operator" may take any form. For example, a reciprocating type operator (slider) that can move linearly within a specific range, or a rotary type operator (knob) that can rotate is an example of an "operator." The "operator" may be a real operator that the user can touch, or a virtual operator that is displayed by a display device.

In a specific example (aspect 10) of any of aspects 1 to 9, the first intermediate layer performs a conversion process by applying the first parameter set to the data input to the first intermediate layer.

In a specific example (aspect 11) of aspect 10, the first parameter set includes a first parameter and a second parameter, and the conversion process includes multiplication of the first parameter and addition of the second parameter. In the above aspect, a conversion process including multiplication of the first parameter and addition of the second parameter is performed on data input to the first hidden layer. Therefore, it is possible to generate acoustic data of a target musical tone to which a partial timbre represented by a control vector is appropriately assigned.

A musical sound synthesis system according to one aspect (aspect 12) of the present disclosure includes a control data acquisition unit that acquires a time series of control data representing the conditions of a target musical sound, a control vector generation unit that generates a control vector representing the characteristics of a temporal change in tone in response to an instruction from a user, a control vector processing unit that generates a first parameter set from the control vector, and a musical sound synthesis unit that generates a time series of acoustic data representing the acoustic features of the target musical sound by processing the time series of the control data using a trained generative model that includes multiple base layers and one or more intermediate layers and has learned the relationship between the conditions of the musical sounds and the acoustic features of the musical sounds, and a first intermediate layer of the one or more intermediate layers applies the first parameter set to data input to the first intermediate layer and outputs the data after application to the next layer.

100...musical sound synthesis system, 11...control device, 12...storage device, 13...display device, 14...operation device, 15...sound emission device, 21...control data acquisition unit, 22...musical sound synthesis unit, 23...waveform synthesis unit, 24...control vector generation unit, 241...interval setting unit, 242...feature extraction unit, 243...second encoder, 25...control vector processing unit, 28-n...transformation model, 26...training processing unit, 30...first generation model, 31...first encoder, 3 11...preprocessing unit, 312...convolutional layer, Le...coding intermediate layer, 32...decoder, 321...convolutional layer, Ld...decoding intermediate layer, 322...post-processing unit, 40...second generation model, 411...convolutional layer, 412...output processing unit, 413...post-processing unit, 414...sampling unit, 51...first provisional model, 52...second provisional model, 61...transformation processing unit, 62...convolutional layer, 63-n...unit processing unit, 64...synthesis processing unit, 65...extended convolutional layer, 67...processing layer.

Claims

obtaining a time series of control data representing the conditions of the target musical tone;
A trained generative model includes a plurality of base layers and one or more intermediate layers, and the trained generative model learns the relationship between the conditions of the musical tones and the acoustic features of the musical tones, thereby processing the time series of the control data to generate a time series of acoustic data representing the acoustic features of the target musical tones.
A method for synthesizing musical tones implemented by a computer system, comprising:
A control vector that represents the characteristics of the time-dependent change in tone is generated in response to a user's instruction.
generating a first parameter set from the control vector;
a first hidden layer of the one or more hidden layers executes a process in which the first parameter set is applied to data input to the first hidden layer, and outputs the data after application to a next layer.
generating the control vector, the control vector being generated in response to an instruction from the user;
2. The method of claim 1, wherein in generating said first parameter set, a time series of said first parameter set is generated from a time series of said control vectors.
and generating a second parameter set from the control vector;
2. The musical tone synthesis method according to claim 1, wherein a second intermediate layer among the one or more intermediate layers executes a process in which the second parameter set is applied to data input to the second intermediate layer, and outputs the data after application to a next layer.
the one or more intermediate layers are a plurality of intermediate layers;
The generative model is
a first encoder including a plurality of code intermediate layers among the plurality of intermediate layers;
a decoder including a plurality of decoding intermediate layers among the plurality of intermediate layers;
In generating the time series of acoustic data,
generating intermediate data representing characteristics of the time series of the control data by processing the time series of the control data with the first encoder;
2. The method of claim 1, further comprising: processing said intermediate data by said decoder to generate said time series of audio data.
The first encoder compresses data on a time axis;
5. The method of claim 4, wherein the decoder performs data expansion on the time axis.
In generating the control vector,
A specific section in the reference musical sound is set in response to a first instruction from the user;
The method of claim 1, further comprising: processing a time series of reference data representing acoustic features of the reference musical sound in the specific section by a second encoder to generate the control vector representing a feature of a temporal change in timbre of the reference musical sound in the specific section.
moreover,
The method of claim 6, further comprising changing a position of the specific section on the time axis in response to the first instruction.
The control vector includes a plurality of elements;
In generating the control vector,
The method of claim 1 , further comprising changing one or more of said plurality of elements in response to a second instruction from said user.
the second instruction is an operation on a plurality of operators corresponding to the plurality of elements,
9. The musical tone synthesis method according to claim 8, wherein in changing the one or more elements, the one or more elements are set in response to an operation on one or more operators among the plurality of operators that correspond to the one or more elements.
The method of claim 1 , wherein the first hidden layer performs a conversion process by applying the first parameter set to data input to the first hidden layer.
the first parameter set includes a first parameter and a second parameter;
The method of claim 10, wherein the conversion process includes multiplication of the first parameter and addition of the second parameter.
a control data acquisition unit for acquiring a time series of control data representing the conditions of a target musical tone;
a control vector generating unit that generates a control vector representing a characteristic of a time-dependent change in timbre in response to an instruction from a user;
a control vector processing unit that generates a first parameter set from the control vector;
A musical sound synthesis unit that processes the time series of the control data using a trained generative model that includes a plurality of basic layers and one or more intermediate layers and that has learned the relationship between the conditions of musical sounds and the acoustic features of the musical sounds, thereby generating a time series of acoustic data representing the acoustic features of the target musical sounds,
A first hidden layer of the one or more hidden layers executes a process in which the first parameter set is applied to data input to the first hidden layer, and outputs the data after application to a next layer.
a control data acquisition unit for acquiring a time series of control data representing the conditions of a target musical tone;
a control vector generating unit for generating a control vector representing characteristics of a time-dependent change in timbre in response to an instruction from a user;
a control vector processor that generates a first parameter set from the control vector; and
a musical sound synthesis unit that processes the time series of the control data using a trained generative model including a plurality of base layers and one or more intermediate layers and that has learned the relationship between the conditions of musical sounds and the acoustic features of the musical sounds, thereby generating a time series of acoustic data representing the acoustic features of the target musical sounds;
A program for causing a computer system to function as
a first intermediate layer among the one or more intermediate layers executes a process in which the first parameter set is applied to data input to the first intermediate layer, and outputs the data after the application to a next layer.