WO2023171522A1 - Sound generation method, sound generation system, and program - Google Patents

Sound generation method, sound generation system, and program Download PDF

Info

Publication number
WO2023171522A1
WO2023171522A1 PCT/JP2023/007783 JP2023007783W WO2023171522A1 WO 2023171522 A1 WO2023171522 A1 WO 2023171522A1 JP 2023007783 W JP2023007783 W JP 2023007783W WO 2023171522 A1 WO2023171522 A1 WO 2023171522A1
Authority
WO
WIPO (PCT)
Prior art keywords
string
data string
control data
note
acoustic
Prior art date
Application number
PCT/JP2023/007783
Other languages
French (fr)
Japanese (ja)
Inventor
方成 西村
Original Assignee
ヤマハ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ヤマハ株式会社 filed Critical ヤマハ株式会社
Publication of WO2023171522A1 publication Critical patent/WO2023171522A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments

Definitions

  • the present disclosure relates to a technique for generating an acoustic data string representing musical instrument sounds.
  • Non-Patent Document 1 discloses a technique for generating a synthesized sound corresponding to a string of musical notes using a trained generative model.
  • one aspect of the present disclosure aims to generate an acoustic data string of musical instrument sounds having various acoustic characteristics.
  • a sound generation method includes a first control data string representing the characteristics of a note string, and a second control data string representing the characteristics of a text corresponding to the note string.
  • a sound corresponding to the characteristics of the text represented by the second control data string is generated.
  • An acoustic data string representing the musical instrument sound of the note string having characteristics is generated.
  • a sound generation system provides a control data string acquisition that obtains a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string. and the first control data string and the second control data string are processed by a trained first generation model to have acoustic characteristics according to the characteristics of the text represented by the second control data string. and an acoustic data string generation unit that generates an acoustic data string representing the musical instrument sound of the note string.
  • a program includes a control data string acquisition unit that acquires a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string;
  • the first control data string and the second control data string are processed by a trained first generation model, so that the second control data string has acoustic characteristics according to the characteristics of the text represented by the second control data string.
  • the computer system functions as an acoustic data string generation unit that generates an acoustic data string representing an instrument sound of a note string.
  • FIG. 1 is a block diagram illustrating the configuration of an information system in a first embodiment.
  • FIG. 1 is a block diagram illustrating a functional configuration of a sound generation system.
  • FIG. 3 is an explanatory diagram of the operation of a control data string acquisition section.
  • FIG. 3 is a block diagram illustrating the configuration of a second generation unit.
  • 3 is a flowchart illustrating a detailed procedure of compositing processing.
  • FIG. 1 is a block diagram illustrating a functional configuration of a machine learning system. 3 is a flowchart illustrating detailed steps of learning processing.
  • FIG. 7 is an explanatory diagram of the operation of the control data string acquisition unit in the second embodiment. It is a schematic diagram of phoneme data.
  • FIG. 7 is an explanatory diagram of a generative model in a modified example. It is a block diagram which illustrates the functional structure of the sound generation system in a modification. It is a block diagram which illustrates the functional structure of the sound generation system in a modification. FIG. 7 is an explanatory diagram of the operation of a control data string acquisition unit in a modified example.
  • FIG. 1 is a block diagram illustrating the configuration of an information system 100 according to a first embodiment.
  • the information system 100 includes a sound generation system 10 and a machine learning system 20.
  • the sound generation system 10 and the machine learning system 20 communicate with each other via a communication network 200 such as the Internet, for example.
  • the sound generation system 10 is a computer system that generates performance sounds (hereinafter referred to as "target sounds") of a specific musical piece.
  • the target sound in the first embodiment is an instrument sound having a musical instrument tone.
  • the sound generation system 10 includes a control device 11, a storage device 12, a communication device 13, an operating device 14, and a sound emitting device 15.
  • the sound generation system 10 is realized by, for example, an information terminal such as a smartphone, a tablet terminal, or a personal computer. Note that the sound generation system 10 is realized not only by a single device but also by a plurality of devices configured separately from each other.
  • the control device 11 is composed of one or more processors that control each element of the sound generation system 10.
  • the control device 11 is a CPU (Central Processing Unit), GPU (Graphics Processing Unit), SPU (Sound Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). ), etc.
  • the control device 11 generates an acoustic signal A representing the waveform of the target sound.
  • the storage device 12 is one or more memories that store programs executed by the control device 11 and various data used by the control device 11.
  • the storage device 12 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium.
  • the storage device 12 may be configured by a combination of multiple types of recording media. Note that a portable recording medium that can be attached to and detached from the sound generation system 10 or a recording medium that can be accessed by the control device 11 via the communication network 200 (for example, cloud storage) may be used as the storage device 12. .
  • the storage device 12 stores music data D representing music.
  • the music data D includes musical score data G and a word string T.
  • Music score data G specifies the time series of notes that make up the music piece. Specifically, the musical score data G specifies a pitch and a sounding period for each of a plurality of notes of a song. The sound production period is specified by, for example, the starting point and duration of the note.
  • the word string T specifies text corresponding to a song. Specifically, the word string T specifies one or more characters for each of a plurality of musical notes in a song.
  • a word string T is composed of a plurality of characters corresponding to different musical notes.
  • a music file compliant with the MIDI (Musical Instrument Digital Interface) standard is used as the music data D.
  • the music data D may specify information such as performance symbols that represent musical expressions.
  • the communication device 13 communicates with the machine learning system 20 via the communication network 200. Note that a communication device 13 separate from the sound generation system 10 may be connected to the sound generation system 10 by wire or wirelessly.
  • the operating device 14 is an input device that accepts operations by the user. For example, an operator operated by a user or a touch panel that detects a touch by a user is used as the operating device 14.
  • the sound emitting device 15 reproduces the target sound represented by the acoustic signal A.
  • the sound emitting device 15 is, for example, a speaker or headphones. Note that a D/A converter that converts the audio signal A from digital to analog and an amplifier that amplifies the audio signal A are not shown for convenience. Further, a sound emitting device 15 that is separate from the sound generation system 10 may be connected to the sound generation system 10 by wire or wirelessly.
  • FIG. 2 is a block diagram illustrating the functional configuration of the sound generation system 10.
  • the control device 11 has a plurality of functions (control data string acquisition section 30, acoustic data string generation section 33, and signal generation section 34) for generating the acoustic signal A by executing a program stored in the storage device 12. Realize.
  • FIG. 3 is an explanatory diagram of the operation of the control data string acquisition section 30.
  • the control data string acquisition unit 30 obtains a first control data string X and a second control data string Y. Specifically, the control data string acquisition unit 30 obtains the first control data string X and the second control data string Y in each of a plurality of unit periods U on the time axis.
  • Each unit period U is a period (hop size of a frame window) that is sufficiently short in time compared to the duration of each note of the song. For example, the window size is 2-20 times the hop size (the window is longer), the hop size is 2-20 ms, and the window size is 20-60 ms.
  • the control data string acquisition unit 30 of the first embodiment includes a first generation unit 31 and a second generation unit 32.
  • the first generation unit 31 generates the first control data X from the note data string N for each unit period U.
  • the musical note data string N used for generation is a portion of the musical score data G that corresponds to each unit period U.
  • the note data string N corresponding to any one unit period U is a part of the note data string of the music data D that includes note data that includes the unit period U (hereinafter referred to as "target note"). That is, the note data string N specifies a note string of the music data D that includes the target note and at least one of the preceding note and the following note.
  • the individual first control data X is data in any format that represents the characteristics of the note string specified by the note data string N.
  • the first control data X in any one unit period U is information indicating the characteristics of the note indicated by the note data of the target note including the unit period U among a plurality of notes of the music piece.
  • the characteristics indicated by the first control data string X include the characteristics of the musical notes that include the unit section (eg, pitch, optionally, time length).
  • the first control data string X includes information regarding notes other than the target note.
  • the first control data string X includes characteristics (for example, pitch) indicated by the note data of at least one of the notes before and after the note including the unit section.
  • the first control data string X may include a pitch difference between the target note and the note immediately before or after the target note. Furthermore, if there is no previous or subsequent note to be included and it is a rest, the characteristics of the rest may be included instead of the note.
  • the first generation unit 31 generates the first control data string X by performing predetermined arithmetic processing on the note data string N.
  • the first generation unit 31 may generate the first control data string X using a generation model configured with a deep neural network (DNN) or the like.
  • the generation model is a statistical estimation model in which the relationship between the musical note data string N and the first control data string X is learned by machine learning.
  • the first control data string X is data that specifies the musical conditions of the target sound that the sound generation system 10 should generate.
  • the second generation unit 32 generates second control data Y required for the current unit period U from the word string T in synchronization with the unit period U or prior to the progression of the unit period.
  • the second control data Y of each unit section U is data in any format that represents the characteristics of the current word phrase in the word string T that includes that unit section U.
  • the second control data Y includes a word vector V of the word included in the word string T.
  • the phrase vector V is a vector representing the position of each phrase in the semantic space. The closer the meanings of multiple words are, the closer the positions of the word vectors V of those words are in the semantic space.
  • the phrase represented by the phrase vector V is composed of one or more words. That is, the phrase vector V is data representing the characteristics of one word or one phrase (time series of a plurality of words) in the word string T.
  • FIG. 4 is a block diagram illustrating the configuration of the second generation unit 32.
  • the second generation section 32 includes a language analysis section 321 and an information generation section 322.
  • the language analysis unit 321 divides the word string T represented by the word string T into a plurality of words by natural language processing such as morphological analysis.
  • the language analysis unit 321 sequentially generates phrase data Q.
  • the phrase data Q is data that identifies a phrase made up of one or more words in the word string T, or data that represents a character string of the phrase.
  • the information generation unit 322 generates a phrase vector sequence V for the phrase represented by the phrase data Q. As illustrated in FIG. 3, in each unit period U within the period corresponding to one word in the song, the word vector V of the word is repeatedly used as the second control data Y. Note that a zero vector is generated as the second control data Y in each unit period U within a period in which a musical note or word string T is not set.
  • the generation model Ma is used to generate the phrase vector sequence V by the information generation unit 322.
  • the generative model Ma is a trained model in which the latent relationship between the word data Q as an input and the word vector sequence V in the semantic space as an output is learned by machine learning.
  • the generative model Ma outputs a word vector sequence V in response to input word data Q.
  • the information generation unit 322 generates a word vector V of each word by processing the word data Q using the trained generative model Ma, and outputs it as a word vector sequence V in the corresponding unit period.
  • the second generation unit 32 generates a phrase vector sequence V representing the words included in the word sequence T as the second control data sequence Y using the generation model Ma. According to the above configuration, the second control data string Y can be easily generated using the generation model Ma.
  • the generative model Ma is an example of a "second generative model.”
  • the generative model Ma of the first embodiment is, for example, a statistical estimation model such as a deep neural network.
  • a statistical estimation model such as a deep neural network.
  • a word vector sequence V of a phrase that is, a sentence
  • a phrase that is, a sentence
  • multiple words for example, Quoc Le, Tomas Mikolov, "Distributed Representations of Sentences and Documents," CoRR, abs/1405.4053, p.1 -9, 2014 (Doc2Vec) is used.
  • control data C is generated for each unit period U through the above processing by the control data string acquisition unit 30.
  • the control data C for each unit period U includes the first control data X generated by the first generation section 31 for the unit period U, and the second control data Y generated by the second generation section 32 for the unit period U. include.
  • the control data C is, for example, data obtained by concatenating the first control data X and the second control data Y.
  • the performance of musical instruments is basically defined by the string of notes on the musical score.
  • the note strings played by the performers on the musical instruments are the same, if the text added to the note strings is different, the music of the instrumental sounds generated by the performance of the musical instruments It was confirmed that there were also differences in their facial expressions.
  • the musical expression of singing voices depends on the text (i.e. lyrics)
  • the musical expression of instrumental sounds which is generally assumed to have no effect on the text, actually depends on the text. tends to be text dependent.
  • the first control data string X representing the characteristics of the note string and the second control data string Y representing the characteristics of the word string T corresponding to the note string are Then, an acoustic signal A of the target sound is generated.
  • the acoustic data string generation unit 33 in FIG. 2 generates an acoustic data string Z using the control data string C (first control data string X and second control data string Y).
  • the acoustic data string Z is data in any format representing the target sound.
  • the acoustic data string Z corresponds to the note string represented by the first control data string represent. That is, an instrument sound when a performer plays a note string on an instrument with the word string T in mind is generated as a target sound.
  • the acoustic data string Z is data representing the envelope of the frequency spectrum of the target sound.
  • the control data C of each unit period U the acoustic data Z corresponding to the unit period U is generated.
  • Each piece of acoustic data Z corresponds to a waveform sample sequence for one frame window longer than a unit period.
  • the acquisition of the control data C by the control data string acquisition section 30 and the generation of the acoustic data Z by the acoustic data string generation section 33 are executed every unit period U.
  • the generation model Mb is used to generate the acoustic data string Z by the acoustic data string generation unit 33.
  • the generative model Mb estimates acoustic data Z for each unit period according to the control data C for that unit period.
  • the generative model Mb is a learned model in which the latent relationship between the control data string C as an input and the acoustic data string Z as an output is learned by machine learning. That is, the generative model Mb outputs the acoustic data string Z that is statistically valid for the control data string C from the viewpoint of the relationship.
  • the acoustic data string generation unit 33 generates acoustic data Z for each unit period U by processing the control data C using the generation model Mb.
  • the generative model Mb is realized by a combination of a program that causes the control device 11 to execute a calculation to generate an acoustic data sequence Z from a control data sequence C, and a plurality of variables (weight values and biases) applied to the calculation. .
  • a program and a plurality of variables that realize the generative model Mb are stored in the storage device 12.
  • a plurality of variables of the generative model Mb are set in advance by machine learning.
  • the generative model Mb is an example of a "first generative model.”
  • the generative model Mb is composed of, for example, a deep neural network.
  • a deep neural network such as a recurrent neural network (RNN) or a convolutional neural network (CNN) is used as the generative model Mb.
  • RNN recurrent neural network
  • CNN convolutional neural network
  • the generative model Mb may be configured by a combination of multiple types of deep neural networks. Additionally, additional elements such as long short-term memory (LSTM) or attention may be included in the generative model Mb.
  • LSTM long short-term memory
  • the signal generation unit 34 generates the acoustic signal A of the target sound from the time series of the acoustic data string Z.
  • the signal generation unit 34 converts the acoustic data string Z into a time domain waveform signal by calculation including, for example, a discrete inverse Fourier transform, and generates the acoustic signal A by concatenating the waveform signals for successive unit periods U. .
  • the signal generation unit 34 generates the acoustic signal A from the acoustic data string Z by using, for example, a deep neural network (so-called neural vocoder) that has learned the relationship between the acoustic data string Z and each sample of the acoustic signal A. Good too.
  • the target sound is reproduced from the sound emitting device 15 by supplying the acoustic signal A generated by the signal generating unit 34 to the sound emitting device 15.
  • FIG. 5 is a flowchart illustrating the detailed procedure of the process (hereinafter referred to as "synthesis process") Sa in which the control device 11 generates the acoustic signal A.
  • the compositing process Sa is executed in each of the plurality of unit periods U.
  • control device 11 When the synthesis process Sa is started, the control device 11 (control data string acquisition unit 30) acquires the music data D from the storage device 12 (Sa1). The control device 11 (first generation unit 31) generates first control data X for the unit period U from the note data string N corresponding to the unit period U of the musical score data G of the music data D (Sa2). Further, the control device 11 (second generation unit 32) generates second control data Y for each unit period U from the word string T of the music data D (Sa3). Note that the order of generation of the first control data X (Sa2) and generation of the second control data Y (Sa3) may be reversed.
  • the control device 11 (acoustic data string generation unit 33) generates acoustic data Z for a unit period U by processing the control data C including the first control data X and the second control data Y using the generation model Mb. (Sa4).
  • the control device 11 (signal generation unit 34) generates the acoustic signal A of the unit period U from the acoustic data Z (Sa5). From the acoustic data string Z of each unit period, a signal that spans a time longer than the unit period is generated, and by overlapping and adding these signals, an acoustic signal A that spans a plurality of unit periods is generated.
  • the time difference (hop size) between the previous and subsequent frame windows corresponds to a unit period.
  • the control device 11 reproduces the target sound by supplying the acoustic signal A to the sound emitting device 15 (Sa6).
  • the second control data string Y representing the characteristics of the word string T corresponding to the note string is It is used to generate the acoustic data string Z. Therefore, compared to the configuration in which the acoustic data string Z is generated only from the first control data string X, it is possible to generate the acoustic data string Z of the target sound having various acoustic characteristics according to the word string T corresponding to the note string. For example, even if the musical note data string N is common, by changing the word string T, it is possible to generate acoustic data strings Z of target sounds having different acoustic characteristics.
  • the second control data string Y includes a word vector string V representing words in the word string T. That is, the phrase vector sequence V reflecting the meaning of the word sequence T is used as the second control data sequence Y. Therefore, it is possible to generate the acoustic data string Z of the target sound in which the meanings of the words in the word string T are reflected in the acoustic characteristics.
  • the machine learning system 20 in FIG. 1 is a computer system that establishes a generative model Mb used by the sound generation system 10 by machine learning.
  • the machine learning system 20 includes a control device 21, a storage device 22, and a communication device 23.
  • the control device 21 is composed of one or more processors that control each element of the machine learning system 20.
  • the control device 21 is configured by one or more types of processors such as a CPU, GPU, SPU, DSP, FPGA, or ASIC.
  • the storage device 22 is one or more memories that store programs executed by the control device 21 and various data used by the control device 21.
  • the storage device 22 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium.
  • the storage device 22 may be configured by a combination of multiple types of recording media. Note that a portable recording medium that can be attached to and detached from the machine learning system 20 or a recording medium that can be accessed by the control device 21 via the communication network 200 (for example, cloud storage) may be used as the storage device 22. .
  • the communication device 23 communicates with the sound generation system 10 via the communication network 200. Note that a communication device 23 separate from the machine learning system 20 may be connected to the machine learning system 20 by wire or wirelessly.
  • FIG. 6 is an explanatory diagram of the function of the machine learning system 20 to establish the generative model Mb.
  • the storage device 22 stores a plurality of basic data B corresponding to different songs.
  • Each of the plurality of basic data B includes music data D and reference signal R.
  • the music data D is data representing a note string of a specific music piece (hereinafter referred to as "reference music piece") that is played with the waveform represented by the reference signal R.
  • the music data D includes the musical score data G and the word string T, as described above.
  • Musical score data G specifies the time series of notes that constitute the reference music piece.
  • the word string T specifies the word string T corresponding to the reference song.
  • the reference signal R is a signal representing the waveform of the musical instrument sound produced by the musical instrument when the performer plays the reference song while referring to the word string T.
  • the reference signal R is generated by recording the musical instrument sounds produced by the musical instrument under the above circumstances. After recording the reference signal R, the position of the reference signal R on the time axis is adjusted. Therefore, the instrument sound represented by the reference signal R is an instrument sound that has acoustic characteristics according to the word string T.
  • the control device 21 implements a plurality of functions (training data acquisition unit 41, learning processing unit 42) for generating the generative model Mb by executing a program stored in the storage device 22.
  • the training data acquisition unit 41 generates a plurality of training data L from a plurality of basic data B.
  • One piece of training data L is generated for each reference piece of music. Therefore, a plurality of training data L are generated from each of a plurality of basic data B corresponding to different reference songs.
  • the learning processing unit 42 establishes the generative model Mb by machine learning using a plurality of training data L.
  • Each of the plurality of training data L is composed of a combination of a training control data sequence Ct and a training audio data sequence Zt.
  • the control data string Ct is composed of a combination of a first control data string for training Xt and a second control data string for training Yt.
  • the first control data string Xt is an example of a "first training control data string”
  • the second control data string Yt is an example of a "second training control data string.”
  • the acoustic data string Zt is an example of a "training acoustic data string.”
  • the training data acquisition unit 41 For each unit period U, the training data acquisition unit 41 generates first control data Xt for the unit period U from the musical note data string Nt.
  • the note data string Nt used to generate the first control data Xt for each unit period U is a part of the note data string of the musical score data G that includes the note data of the target note that includes the unit period U. That is, the note data string Nt includes note data of the target note in the reference music and note data of at least one of the previous note and the subsequent note.
  • the first control data string Xt is data representing the characteristics of the reference note string represented by the note data string Nt.
  • the training data acquisition unit 41 generates the first control data sequence Xt for each unit period U from the musical note data sequence Nt by the same process as the first generation unit 31.
  • the second control data Yt for one unit period U indicates a phrase vector V estimated for a phrase corresponding to the unit period U in the word string T.
  • the training data acquisition unit 41 generates a second control data sequence Yt for each unit period U, which indicates a phrase vector sequence V estimated from the word sequence T, by the same process as the second generation unit 32.
  • the acoustic data Zt of one unit period U represents the waveform of one frame of the reference signal R corresponding to the unit period U.
  • the training data acquisition unit 41 generates an acoustic data sequence Zt from the reference signal R.
  • the acoustic data string Zt is the sound data produced by the instrument when the reference note string corresponding to the first control data string Xt is played under the phrase expressed by the second control data string Yt. represents the waveform of the musical instrument sound. That is, the acoustic data string Zt is the ground truth of the acoustic data string that the generation model Mb should output in response to the input of the control data string Ct.
  • FIG. 7 is a flowchart of a process (hereinafter referred to as "learning process") Sb in which the control device 21 establishes a generative model Mb by machine learning.
  • learning process a process in which the control device 21 establishes a generative model Mb by machine learning.
  • the learning process Sb is started in response to an instruction from the operator of the machine learning system 20.
  • the learning processing unit 42 in FIG. 6 is realized by the control device 21 executing the learning process Sb.
  • the control device 21 selects any one of the plurality of training data L (hereinafter referred to as “selected training data L") (Sb1). As illustrated in FIG. 6, the control device 21 processes the control data string Ct of the selected training data L using an initial or provisional generation model Mb (hereinafter referred to as "temporary model Mb0") to generate an acoustic data string. Generate Z (Sb2).
  • the control device 21 calculates a loss function representing the error between the acoustic data string Z generated by the provisional model Mb0 and the acoustic data string Zt of the selected training data L (Sb3).
  • the control device 21 updates the plurality of variables of the provisional model Mb0 so that the loss function is reduced (ideally minimized) (Sb4). For example, error backpropagation is used to update each variable according to the loss function.
  • the control device 21 determines whether a predetermined termination condition is satisfied (Sb5).
  • the termination condition is that the loss function is less than a predetermined threshold, or that the amount of change in the loss function is less than a predetermined threshold. If the termination condition is not satisfied (Sb5: NO), the control device 21 selects the unselected training data L as the new selected training data L (Sb1). That is, the process of updating a plurality of variables of the provisional model Mb0 (Sb1 to Sb4) is repeated until the end condition is satisfied (Sb5: YES). If the termination condition is satisfied (Sb5: YES), the control device 21 terminates the learning process Sb.
  • the provisional model Mb0 at the time when the termination condition is satisfied is determined as the trained generative model Mb.
  • the generative model Mb learns the latent relationship between the input control data string Ct and the output acoustic data string Zt. Therefore, the trained generative model Mb outputs a statistically valid acoustic data sequence Z for the unknown control data sequence C from the viewpoint of the relationship.
  • the control device 21 transmits the generation model Mb established through the above processing to the sound generation system 10 from the communication device 23. Specifically, a plurality of variables defining the generative model Mb are sent to the sound generation system 10.
  • the control device 11 of the sound generation system 10 receives the generated model Mb transmitted from the machine learning system 20 through the communication device 13, and stores the generated model Mb in the storage device 12.
  • the control device 11 in the sound generation system 10 of the second embodiment includes a control data string acquisition unit 30 that obtains a control data string C, and generates an acoustic data string Z from the control data string C, as in the first embodiment. It includes an acoustic data string generation section 33 and a signal generation section 34 that generates an acoustic signal A from the acoustic data string Z.
  • FIG. 8 is an explanatory diagram of the operation of the control data string acquisition unit 30 in the second embodiment.
  • the first generation unit 31 of the control data string acquisition unit 30 generates the first control data X for each unit period U from the note data string N, similarly to the first embodiment.
  • the function of the second generation unit 32 is different from that in the first embodiment.
  • the second generation unit 32 of the first embodiment generates a phrase vector sequence V representing each phrase of the word sequence T as a second control data sequence Y.
  • the second generation unit 32 of the second embodiment generates phoneme data P representing each phoneme of the word string, and outputs it as second control data Y for each unit period corresponding to the period of the phoneme.
  • the second generation unit 32 generates phoneme data P indicating the phoneme type and period for each phoneme by analyzing the word string T, and converts the phoneme data P into second control data Y for each unit period U. Output as .
  • the control data string C includes a first control data string X and a second control data string Y.
  • FIG. 9 is a schematic diagram of the phoneme data P.
  • the phoneme data P specifies any of a plurality of types (K types) of phonemes.
  • the phoneme data P is composed of K elements E (E1 to EK) (K is a natural number of 2 or more) corresponding to different types of phonemes.
  • K is a natural number of 2 or more
  • For phoneme data P specifying any one type of phoneme one element E corresponding to the phoneme among K elements E1 to EK is set to "1", and the remaining (K-1) elements are set to "1". It is a one-hot vector with element E set to "0". Note that a one-cold vector in which "1" and "0" of each element E are exchanged may be adopted as the phoneme data P.
  • the second generation unit 32 estimates the type and duration of each phoneme of the characters at each point in time included in the word string T by phoneme analysis processing, and generates phoneme data P specifying the phoneme. Any known technique may be employed for the phoneme analysis process. As illustrated in FIG. 8, in each unit period U within a period corresponding to one phoneme in a song, phoneme data P indicating the phoneme is repeatedly used as second control data Y. The boundary of the period of each phoneme of the character at each point in time in the word string T is estimated by a statistical model such as HMM (Hidden Markov Model) or SVM (Support Vector Machine).
  • HMM Hidden Markov Model
  • SVM Serial Vector Machine
  • the second control data string Y (phrase vector string V) of the first embodiment reflects the meaning of the word string T, but does not reflect information regarding the pronunciation of the word string T (ie, phoneme).
  • the second control data string Y (phoneme data P) of the second embodiment reflects information regarding the pronunciation of the word string T (ie, phoneme), but does not reflect the meaning of the word string T.
  • the word vector string V and the phoneme data P are comprehensively expressed as data representing the characteristics of the word string T.
  • the phrase vector sequence V is replaced with the phoneme data P.
  • the processing in which the acoustic data string generation section 33 generates the acoustic data string Z from the control data string C and the processing in which the signal generation section 34 generates the acoustic signal A from the acoustic data string Z are similar to those in the first embodiment. be.
  • the synthesis process Sa and the learning process Sb are also the same as in the first embodiment.
  • the second control data string Y representing the characteristics of the word string T corresponding to the note string is used to used for. Therefore, similarly to the first embodiment, it is possible to generate the acoustic data string Z of the target sound having various acoustic characteristics according to the word string T corresponding to the note string.
  • the second control data string Y includes phoneme data P representing phonemes in the word string T. That is, the phoneme data P reflecting the pronunciation of the word string T is used as the second control data string Y.
  • the acoustic data string Z of the target sound in which non-linguistic characteristics (for example, characteristics in the time domain or frequency domain) regarding the pronunciation of the phonemes in the word string T are reflected.
  • non-linguistic characteristics for example, characteristics in the time domain or frequency domain
  • a target sound with the impression that the word string T is understood as onomatopoeia is generated.
  • FIG. 10 is a schematic diagram of the second control data string Y in the third embodiment.
  • the second control data string Y includes first data Y1 and second data Y2.
  • the first data Y1 corresponds to the second control data string Y in the first embodiment
  • the second data Y2 corresponds to the second control data string Y in the second embodiment.
  • the first data Y1 is a word vector string V representing each word included in the word string T.
  • the word vector V of the word is used as the first data Y1.
  • the second data Y2 is phoneme data P representing each phoneme of the word string T.
  • the phoneme data P of the phoneme is used as the second control data Y2.
  • the second control data string Y includes the first data Y1 (phrase vector string V) and the second data Y2 (phoneme data P). Therefore, it is possible to generate the acoustic data string Z of the target sound in which both the meaning of each word in the word string T and the pronunciation of the phoneme in the word string T are reflected.
  • the phoneme data P in the second embodiment is not limited to a vector composed of K elements E1 to EK.
  • a code string (identifier) uniquely assigned to each phoneme may be used as the phoneme data P.
  • the acoustic data string Z represents the frequency characteristics of the target sound, but the information expressed by the acoustic data string Z is not limited to the above examples. For example, a form in which the acoustic data string Z represents each sample of the target sound is also assumed. In the above embodiment, the time series of the acoustic data string Z constitutes the acoustic signal A. Therefore, the signal generation section 34 is omitted.
  • control data string acquisition section 30 generates the first control data string X and the second control data string Y, but the operation of the control data string acquisition section 30 is as described above. Not limited to examples.
  • the control data string acquisition unit 30 may receive a first control data string X and a second control data string Y generated by an external device from the external device using the communication device 13. Further, in the case where the first control data string X and the second control data string Y are stored in the storage device 12, the control data string acquisition unit 30 stores the first control data string X and the second control data string Y. Read from device 12.
  • "acquisition" by the control data string acquisition unit 30 includes generation, reception, and reading of the first control data string X and the second control data string Y. 2 includes any operation that obtains the control data string Y.
  • the "acquisition" of the first control data string Xt and the second control data string Yt by the training data acquisition unit 41 includes any operation (for example, generation, generation, receiving and reading).
  • control data string C which is a combination of the first control data string X and the second control data string Y, is supplied to the generative model Mb.
  • the input format of the first control data string X and the second control data string Y is not limited to the above example.
  • the generative model Mb is composed of a first part Mb1 and a second part Mb2.
  • the first part Mb1 is a part composed of the input layer and part of the intermediate layer of the generative model Mb.
  • the second part Mb2 is a part composed of another part of the intermediate layer of the generative model Mb and an output layer.
  • the first control data string X is supplied to the first portion Mb1 (input layer), and the second control data string Y is supplied to the second portion Mb2 together with the data output from the first portion Mb1. It's okay.
  • the connection of the first control data string X and the second control data string Y is not essential in the present disclosure.
  • a plurality of generative models Mb corresponding to different musical instruments may be selectively used.
  • the generative model Mb corresponding to one type of musical instrument is a learned model trained using the reference signal R of the musical instrument sound produced by the musical instrument. Therefore, the generation model Mb corresponding to each musical instrument outputs an acoustic data string Z representing the musical instrument sound of the musical instrument.
  • the user selects one of the plurality of musical instruments by operating the operating device 14.
  • the musical instrument data ⁇ in FIG. 12 is data specifying the musical instrument selected by the user.
  • the acoustic data string generation unit 33 selects a generated model Mb corresponding to the instrument specified by the musical instrument data ⁇ from among the plurality of generated models Mb, and processes the control data string C using the generated model Mb to generate the acoustic data string Z. generate. According to the above configuration, it is possible to generate a target sound having a timbre corresponding to any one of a plurality of types of musical instruments.
  • a control data string C including musical instrument data ⁇ in addition to the first control data string X and the second control data string Y may be input to one generation model Mb.
  • the generative model Mb in FIG. 13 is established by machine learning using a plurality of reference signals R corresponding to different musical instruments.
  • the training data L includes musical instrument data ⁇ specifying the musical instrument corresponding to the reference signal R, in addition to the first control data sequence Xt and the second control data sequence Yt for training. Therefore, the target sound represented by the acoustic data string Z is an instrument sound having the timbre of the instrument specified by the instrument data ⁇ .
  • the note data string N is generated from the music data D stored in advance in the storage device 12, but note data strings N sequentially supplied from the performance device may also be used.
  • the performance device is an input device such as a MIDI keyboard that accepts musical performances by the user, and sequentially outputs a string of musical note data N according to the musical performance by the user.
  • the sound generation system 10 generates a sound data string Z using a musical note data string N supplied from a performance device.
  • the above-described compositing process Sa may be executed in real time while the user is playing on the performance device.
  • the second control data string Y and the audio data string Z may be generated in parallel with the user's operation on the performance device.
  • one word vector string V is generated from one word data Q, but a plurality of word vector strings V may be generated from one word data string Q.
  • FIG. 14 is an explanatory diagram of the operation of the control data string acquisition section 30 in this modification.
  • the phrase data string Q in FIG. 14 is data representing a phrase string obtained by dividing the word string T into phrases. That is, the phrase data Q is data that identifies one or more phrase strings corresponding to one phrase.
  • a phrase is a section of a song divided according to musical or semantic unity. For example, each phrase is specified in the music data D. However, each phrase of the song may be defined by analyzing the song data D.
  • the language analysis unit 321 generates the phrase data string Q by dividing the word string T into phrases.
  • the information generation unit 322 processes the word/phrase data string Q to generate a word/phrase vector string V for each word included in the corresponding word/phrase string. As illustrated in FIG. 14, when one word data string Q contains multiple words, the context of the word string is analyzed, and the word vector strings V corresponding to each word are combined into one word data string. Generated from column Q. Similar to the first embodiment, the information generation unit 322 uses a generation model Ma that can interpret the context to generate the phrase vector sequence V.
  • the generative model Ma is a trained model that has learned the relationship between the context of the word string indicated by the word data string Q and the word vector string V indicating the meaning of each word in that context.
  • the generation model Ma generates a phrase vector sequence V indicating the meaning of each word included in the phrase string indicated by the phrase data Q in the phrase string.
  • a natural language processing model such as BERT (Bidirectional Encoder Representations from Transformers) is used as the generative model Ma.
  • BERT Bidirectional Encoder Representations from Transformers
  • a phrase data string Q1 that specifies phrase string #1 consisting of word #1a and word #1b is illustrated.
  • the generative model Ma generates a word vector string V corresponding to word #1a included in word string #1 and a word vector string V corresponding to word #1b included in word string #1 in response to input of one word data string Q1.
  • the word/phrase data string Q2 in FIG. 14 specifies the word string #2 consisting of word #2a, word #2b, and word #2c.
  • the generative model Ma generates a word vector string V corresponding to the word #2a included in the word string #2, a word vector string V corresponding to the word #2b, and a word #2c included in the word string #2 in response to the input of one word data string Q2.
  • a phrase vector sequence V corresponding to is generated. Similar to the first embodiment, in each unit period U corresponding to one word, the word vector V of the word is repeatedly used as the second control data Y. In this modification, by interpreting the context of a word string, a word vector V that more accurately indicates the meaning of each word included in the word string is generated.
  • the generative model M (Ma, Mb, Mc) is not limited to a deep neural network.
  • any format and type of statistical model such as HMM (Hidden Markov Model) or SVM (Support Vector Machine) may be used as the generative model M (Ma, Mb, Mc).
  • the machine learning system 20 establishes the generative model Mb, but the function for establishing the generative model Mb (the training data acquisition unit 41 and the learning processing unit 42) is installed in the sound generation system 10. may be done. Further, the sound generation system 10 may be equipped with a function of establishing the generation model Ma or Mc.
  • the sound generation system 10 may be realized by a server device that communicates with an information device such as a smartphone or a tablet terminal.
  • the sound generation system 10 receives music data D from an information device, and generates an audio signal A through a synthesis process Sa to which the music data D is applied.
  • the sound generation system 10 transmits the sound signal A generated by the synthesis process Sa to the information device.
  • the signal generation unit 34 is installed in the information device, the time series of the acoustic data string Z is transmitted to the information device. That is, the signal generation unit 34 is omitted from the sound generation system 10.
  • the functions of the sound generation system 10 are performed by one or more processors constituting the control device 11 and a storage device. This is realized by cooperation with a program stored in 12.
  • the functions of the machine learning system 20 are, as described above, the cooperation between one or more processors that constitute the control device 21 and the program stored in the storage device 22. This will be realized by working.
  • the programs exemplified above may be provided in a form stored in a computer-readable recording medium and installed on a computer.
  • the recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but any known recording medium such as a semiconductor recording medium or a magnetic recording medium is used. Also included are recording media in the form of.
  • the non-transitory recording medium includes any recording medium excluding transitory, propagating signals, and does not exclude volatile recording media.
  • the recording medium that stores the program in the distribution device corresponds to the above-mentioned non-transitory recording medium.
  • a sound generation method includes acquiring a first control data string representing characteristics of a musical note string and a second control data string representing characteristics of a text corresponding to the musical note string. , by processing the first control data string and the second control data string using a trained first generation model, the musical note has acoustic characteristics according to the characteristics of the text represented by the second control data string. A string of acoustic data representing the musical instrument sounds of the string is generated.
  • the second control data string representing the characteristics of the text corresponding to the note string is used to generate the acoustic data string. Therefore, compared to a configuration in which an acoustic data string is generated only from the first control data string, it is possible to generate an acoustic data string of musical instrument sounds having various acoustic characteristics depending on the text corresponding to the note string.
  • the "first control data string” is data (first control data) in any format that represents the characteristics of a note string, and is generated from, for example, a note data string representing a note string. Further, the first control data string may be generated from a musical note data string generated in real time in response to an operation on an input device such as an electronic musical instrument.
  • the "first control data string” can also be referred to as data specifying the conditions of the musical instrument sound to be synthesized.
  • the "first control data string” includes the pitch or duration of each note constituting the note string, the relationship between the pitch of one note and the pitches of other notes located around the note, etc. , specify various conditions regarding each note that makes up the note string.
  • “Instrument sound” is a musical sound generated from a musical instrument during performance.
  • the "first generation model” is a learned model that has learned the relationship between the first control data string, the second control data string, and the acoustic data string by machine learning.
  • a plurality of training data are used for machine learning of the first generative model.
  • Each training data includes a set of a first training control data string and a second training control data string, and a training acoustic data string.
  • the first training control data string is data representing the characteristics of the reference note string
  • the second training control data string is data representing the characteristics of the text corresponding to the reference note string.
  • the training audio data string represents musical instrument sounds produced by a performance based on the note string corresponding to the first training control data string and the text corresponding to the second training control data string.
  • various statistical estimation models such as a deep neural network (DNN), a hidden Markov model (HMM), or a support vector machine (SVM) are used as the "first generative model.” .
  • the form of input of the first control data string and the second control data string to the first generative model is arbitrary.
  • input data including a first control data string and a second control data string is input to the first generative model.
  • the first control data string may be input to the input layer, and the second control data string may be input to the intermediate layer. is assumed. That is, the combination of the first control data string and the second control data string is not essential.
  • the "acoustic data string” is data (acoustic data) in any format that represents musical instrument sounds.
  • data representing acoustic characteristics such as an intensity spectrum, a mel spectrum, and MFCC (Mel-Frequency Cepstrum Coefficients) is an example of an “acoustic data string.”
  • a sample sequence representing the waveform of the musical instrument sound may be generated as an “acoustic data sequence.”
  • Text corresponding to a note string means that text is associated with a note string. That is, the "correspondence" between a note string and a text means, for example, that each note in the note string is associated with each word in the text in terms of time.
  • the first generative model includes a first training control data string representing characteristics of a reference note string, and a second training control data string representing characteristics of a text corresponding to the reference note string.
  • This model is trained using training data including a control data string and a training audio data string representing the instrument sound of the reference note string.
  • the second control data string includes a word vector string representing words included in the text.
  • the second control data string includes a word vector string representing words in the text. Therefore, it is possible to generate an acoustic data string of musical instrument sounds in which the meanings of words in the text are reflected in the acoustic characteristics.
  • phrase vector sequence is a vector (phrase vector) defined in the linguistic space (semantic space) according to the meaning of the word.
  • a "phrase” is a word or a sequence of words (ie, a phrase).
  • To generate a word vector sequence for example, Tomas Mikolov et al. "Efficient Estimation of Word Representations in Vector Space,” arXiv:1301.3781 [cs.CL], 2013 (Word2Vec) or Quoc Le, Tomas Mikolov, “Distributed
  • the statistical estimation model described in “Representations of Sentences and Documents,” CoRR, abs/1405.4053, p.1-9, 2014 (Doc2Vec) will be used.
  • the phrase vector sequence in acquiring the second control data sequence, is generated using a trained second generation model.
  • the second control data string can be easily generated using the second generation model.
  • the second control data string includes phoneme data representing phonemes constituting the text.
  • the second control data string includes phoneme data representing phonemes making up the text. Therefore, it is possible to generate an acoustic data string of musical instrument sounds in which non-linguistic characteristics (for example, characteristics in the time domain or frequency domain) regarding the pronunciation of phonemes in the text are reflected in the acoustic characteristics.
  • the acquisition of the first control data and the second control data and the generation of the acoustic data are performed in each of a plurality of unit periods on the time axis. executed.
  • a sound generation system acquires a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string.
  • a control data string acquisition unit processes the first control data string and the second control data string using a trained first generation model, thereby generating a control data string that corresponds to the characteristics of the text represented by the second control data string.
  • an acoustic data string generation unit that generates an acoustic data string representing the musical instrument sound of the note string having acoustic characteristics.
  • a program according to one aspect (aspect 8) of the present disclosure provides control data for acquiring a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string.
  • a string acquisition unit and processing the first control data string and the second control data string using a trained first generation model, thereby generating sound according to the characteristics of the text represented by the second control data string.
  • the computer system is caused to function as an acoustic data string generation unit that generates an acoustic data string representing the musical instrument sound of the note string having characteristics.

Abstract

This sound generation system comprises: a control data string acquisition unit 30 that acquires a first control data string X representing features of a string of notes and a second control data string Y representing features of text corresponding to the string of notes; and a sound data string generation unit 33 that processes the first control data string X and the second control data string Y using a trained generation model Mb, thereby generating a sound data string Z representing musical instrument sounds of a string of notes having sound characteristics corresponding to the features of the text represented by the second control data string Y.

Description

音響生成方法、音響生成システムおよびプログラムSound generation method, sound generation system and program
 本開示は、楽器音を表す音響データ列を生成する技術に関する。 The present disclosure relates to a technique for generating an acoustic data string representing musical instrument sounds.
 所望の音を合成する技術が従来から提案されている。例えば非特許文献1には、訓練済の生成モデルを利用して、音符列に対応する合成音を生成する技術が開示されている。 Techniques for synthesizing desired sounds have been proposed in the past. For example, Non-Patent Document 1 discloses a technique for generating a synthesized sound corresponding to a string of musical notes using a trained generative model.
 しかし、従前の合成技術では、楽譜に沿った歌唱音が生成されるに過ぎず、多様な音響特性の合成音を生成することは困難である。以上の事情を考慮して、本開示のひとつの態様は、多様な音響特性を有する楽器音の音響データ列を生成することを目的とする。 However, conventional synthesis techniques only generate singing sounds that follow the musical score, and it is difficult to generate synthesized sounds with diverse acoustic characteristics. In consideration of the above circumstances, one aspect of the present disclosure aims to generate an acoustic data string of musical instrument sounds having various acoustic characteristics.
 以上の課題を解決するために、本開示のひとつの態様に係る音響生成方法は、音符列の特徴を表す第1制御データ列と、前記音符列に対応するテキストの特徴を表す第2制御データ列とを取得し、前記第1制御データ列と前記第2制御データ列とを訓練済の第1生成モデルにより処理することで、前記第2制御データ列が表す前記テキストの特徴に応じた音響特性を有する前記音符列の楽器音を表す音響データ列を生成する。 In order to solve the above problems, a sound generation method according to one aspect of the present disclosure includes a first control data string representing the characteristics of a note string, and a second control data string representing the characteristics of a text corresponding to the note string. by processing the first control data string and the second control data string using a trained first generative model, a sound corresponding to the characteristics of the text represented by the second control data string is generated. An acoustic data string representing the musical instrument sound of the note string having characteristics is generated.
 本開示のひとつの態様に係る音響生成システムは、音符列の特徴を表す第1制御データ列と、前記音符列に対応するテキストの特徴を表す第2制御データ列とを取得する制御データ列取得部と、前記第1制御データ列と前記第2制御データ列とを訓練済の第1生成モデルにより処理することで、前記第2制御データ列が表す前記テキストの特徴に応じた音響特性を有する前記音符列の楽器音を表す音響データ列を生成する音響データ列生成部とを具備する。 A sound generation system according to one aspect of the present disclosure provides a control data string acquisition that obtains a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string. and the first control data string and the second control data string are processed by a trained first generation model to have acoustic characteristics according to the characteristics of the text represented by the second control data string. and an acoustic data string generation unit that generates an acoustic data string representing the musical instrument sound of the note string.
 本開示のひとつの態様に係るプログラムは、音符列の特徴を表す第1制御データ列と、前記音符列に対応するテキストの特徴を表す第2制御データ列とを取得する制御データ列取得部、および、前記第1制御データ列と前記第2制御データ列とを訓練済の第1生成モデルにより処理することで、前記第2制御データ列が表す前記テキストの特徴に応じた音響特性を有する前記音符列の楽器音を表す音響データ列を生成する音響データ列生成部、としてコンピュータシステムを機能させる。 A program according to one aspect of the present disclosure includes a control data string acquisition unit that acquires a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string; The first control data string and the second control data string are processed by a trained first generation model, so that the second control data string has acoustic characteristics according to the characteristics of the text represented by the second control data string. The computer system functions as an acoustic data string generation unit that generates an acoustic data string representing an instrument sound of a note string.
第1実施形態における情報システムの構成を例示するブロック図である。FIG. 1 is a block diagram illustrating the configuration of an information system in a first embodiment. 音響生成システムの機能的な構成を例示するブロック図である。FIG. 1 is a block diagram illustrating a functional configuration of a sound generation system. 制御データ列取得部の動作の説明図である。FIG. 3 is an explanatory diagram of the operation of a control data string acquisition section. 第2生成部の構成を例示するブロック図である。FIG. 3 is a block diagram illustrating the configuration of a second generation unit. 合成処理の詳細な手順を例示するフローチャートである。3 is a flowchart illustrating a detailed procedure of compositing processing. 機械学習システムの機能的な構成を例示するブロック図である。FIG. 1 is a block diagram illustrating a functional configuration of a machine learning system. 学習処理の詳細な手順を例示するフローチャートである。3 is a flowchart illustrating detailed steps of learning processing. 第2実施形態における制御データ列取得部の動作の説明図である。FIG. 7 is an explanatory diagram of the operation of the control data string acquisition unit in the second embodiment. 音素データの模式図である。It is a schematic diagram of phoneme data. 第3実施形態における第2制御データ列Yの模式図である。It is a schematic diagram of the 2nd control data sequence Y in 3rd Embodiment. 変形例における生成モデルの説明図である。FIG. 7 is an explanatory diagram of a generative model in a modified example. 変形例における音響生成システムの機能的な構成を例示するブロック図である。It is a block diagram which illustrates the functional structure of the sound generation system in a modification. 変形例における音響生成システムの機能的な構成を例示するブロック図である。It is a block diagram which illustrates the functional structure of the sound generation system in a modification. 変形例における制御データ列取得部の動作の説明図である。FIG. 7 is an explanatory diagram of the operation of a control data string acquisition unit in a modified example.
A:第1実施形態
 図1は、第1実施形態に係る情報システム100の構成を例示するブロック図である。情報システム100は、音響生成システム10と機械学習システム20とを具備する。音響生成システム10と機械学習システム20とは、例えばインターネット等の通信網200を介して相互に通信する。
A: First Embodiment FIG. 1 is a block diagram illustrating the configuration of an information system 100 according to a first embodiment. The information system 100 includes a sound generation system 10 and a machine learning system 20. The sound generation system 10 and the machine learning system 20 communicate with each other via a communication network 200 such as the Internet, for example.
[音響生成システム10]
 音響生成システム10は、特定の楽曲の演奏音(以下「目標音」という)を生成するコンピュータシステムである。第1実施形態の目標音は、楽器の音色を有する楽器音である。
[Sound generation system 10]
The sound generation system 10 is a computer system that generates performance sounds (hereinafter referred to as "target sounds") of a specific musical piece. The target sound in the first embodiment is an instrument sound having a musical instrument tone.
 音響生成システム10は、制御装置11と記憶装置12と通信装置13と操作装置14と放音装置15とを具備する。音響生成システム10は、例えばスマートフォン、タブレット端末またはパーソナルコンピュータ等の情報端末により実現される。なお、音響生成システム10は、単体の装置で実現されるほか、相互に別体で構成された複数の装置でも実現される。 The sound generation system 10 includes a control device 11, a storage device 12, a communication device 13, an operating device 14, and a sound emitting device 15. The sound generation system 10 is realized by, for example, an information terminal such as a smartphone, a tablet terminal, or a personal computer. Note that the sound generation system 10 is realized not only by a single device but also by a plurality of devices configured separately from each other.
 制御装置11は、音響生成システム10の各要素を制御する単数または複数のプロセッサで構成される。例えば、制御装置11は、CPU(Central Processing Unit)、GPU(Graphics Processing Unit)、SPU(Sound Processing Unit)、DSP(Digital Signal Processor)、FPGA(Field Programmable Gate Array)、またはASIC(Application Specific Integrated Circuit)等の1種類以上のプロセッサにより構成される。制御装置11は、目標音の波形を表す音響信号Aを生成する。 The control device 11 is composed of one or more processors that control each element of the sound generation system 10. For example, the control device 11 is a CPU (Central Processing Unit), GPU (Graphics Processing Unit), SPU (Sound Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). ), etc. The control device 11 generates an acoustic signal A representing the waveform of the target sound.
 記憶装置12は、制御装置11が実行するプログラムと、制御装置11が使用する各種のデータとを記憶する単数または複数のメモリである。記憶装置12は、例えば磁気記録媒体または半導体記録媒体等の公知の記録媒体で構成される。複数種の記録媒体の組合せにより記憶装置12が構成されてもよい。なお、音響生成システム10に対して着脱される可搬型の記録媒体、または制御装置11が通信網200を介してアクセス可能な記録媒体(例えばクラウドストレージ)が、記憶装置12として利用されてもよい。 The storage device 12 is one or more memories that store programs executed by the control device 11 and various data used by the control device 11. The storage device 12 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium. The storage device 12 may be configured by a combination of multiple types of recording media. Note that a portable recording medium that can be attached to and detached from the sound generation system 10 or a recording medium that can be accessed by the control device 11 via the communication network 200 (for example, cloud storage) may be used as the storage device 12. .
 記憶装置12は、楽曲を表す楽曲データDを記憶する。楽曲データDは、楽譜データGと単語列Tとを含む。楽譜データGは、楽曲を構成する音符の時系列を指定する。具体的には、楽譜データGは、楽曲の複数の音符の各々について音高と発音期間とを指定する。発音期間は、例えば音符の始点と継続長とにより指定される。単語列Tは、楽曲に対応するテキストを指定する。具体的には、単語列Tは、楽曲の複数の音符の各々について1個以上の文字を指定する。相異なる音符に対応する複数の文字により、単語列Tが構成される。例えば、MIDI(Musical Instrument Digital Interface)規格に準拠した音楽ファイルが楽曲データDとして利用される。なお、音楽的な表情を表す演奏記号等の情報を、楽曲データDが指定してもよい。 The storage device 12 stores music data D representing music. The music data D includes musical score data G and a word string T. Musical score data G specifies the time series of notes that make up the music piece. Specifically, the musical score data G specifies a pitch and a sounding period for each of a plurality of notes of a song. The sound production period is specified by, for example, the starting point and duration of the note. The word string T specifies text corresponding to a song. Specifically, the word string T specifies one or more characters for each of a plurality of musical notes in a song. A word string T is composed of a plurality of characters corresponding to different musical notes. For example, a music file compliant with the MIDI (Musical Instrument Digital Interface) standard is used as the music data D. Note that the music data D may specify information such as performance symbols that represent musical expressions.
 通信装置13は、通信網200を介して機械学習システム20と通信する。なお、音響生成システム10とは別体の通信装置13を、音響生成システム10に対して有線または無線により接続してもよい。 The communication device 13 communicates with the machine learning system 20 via the communication network 200. Note that a communication device 13 separate from the sound generation system 10 may be connected to the sound generation system 10 by wire or wirelessly.
 操作装置14は、利用者による操作を受付ける入力機器である。例えば、利用者が操作する操作子、または、利用者による接触を検知するタッチパネルが、操作装置14として利用される。 The operating device 14 is an input device that accepts operations by the user. For example, an operator operated by a user or a touch panel that detects a touch by a user is used as the operating device 14.
 放音装置15は、音響信号Aが表す目標音を再生する。放音装置15は、例えばスピーカまたはヘッドホンである。なお、音響信号Aをデジタルからアナログに変換するD/A変換器と、音響信号Aを増幅する増幅器とについては、便宜的に図示が省略されている。また、音響生成システム10とは別体の放音装置15を、音響生成システム10に対して有線または無線により接続してもよい。 The sound emitting device 15 reproduces the target sound represented by the acoustic signal A. The sound emitting device 15 is, for example, a speaker or headphones. Note that a D/A converter that converts the audio signal A from digital to analog and an amplifier that amplifies the audio signal A are not shown for convenience. Further, a sound emitting device 15 that is separate from the sound generation system 10 may be connected to the sound generation system 10 by wire or wirelessly.
 図2は、音響生成システム10の機能的な構成を例示するブロック図である。制御装置11は、記憶装置12に記憶されたプログラムを実行することで、音響信号Aを生成するための複数の機能(制御データ列取得部30、音響データ列生成部33および信号生成部34)を実現する。 FIG. 2 is a block diagram illustrating the functional configuration of the sound generation system 10. The control device 11 has a plurality of functions (control data string acquisition section 30, acoustic data string generation section 33, and signal generation section 34) for generating the acoustic signal A by executing a program stored in the storage device 12. Realize.
 図3は、制御データ列取得部30の動作の説明図である。制御データ列取得部30は、第1制御データ列Xと第2制御データ列Yとを取得する。具体的には、制御データ列取得部30は、時間軸上の複数の単位期間Uの各々において、第1制御データ列Xおよび第2制御データ列Yを取得する。各単位期間Uは、楽曲の各音符の継続長と比較して充分に短い時間長の期間(フレーム窓のホップサイズ)である。例えば、窓サイズはホップサイズの2~20倍であり(窓の方が長い)、ホップサイズは2~20ミリ秒であり、窓サイズは20~60ミリ秒である。第1実施形態の制御データ列取得部30は、第1生成部31と第2生成部32とを具備する。 FIG. 3 is an explanatory diagram of the operation of the control data string acquisition section 30. The control data string acquisition unit 30 obtains a first control data string X and a second control data string Y. Specifically, the control data string acquisition unit 30 obtains the first control data string X and the second control data string Y in each of a plurality of unit periods U on the time axis. Each unit period U is a period (hop size of a frame window) that is sufficiently short in time compared to the duration of each note of the song. For example, the window size is 2-20 times the hop size (the window is longer), the hop size is 2-20 ms, and the window size is 20-60 ms. The control data string acquisition unit 30 of the first embodiment includes a first generation unit 31 and a second generation unit 32.
 第1生成部31は、単位期間U毎に音符データ列Nから第1制御データXを生成する。生成に使用される音符データ列Nは、楽譜データGのうち各単位期間Uに対応する部分である。任意の1個の単位期間Uに対応する音符データ列Nは、楽曲データDの音符データ列のうち当該単位期間Uを含む音符(以下「対象音符」という)の音符データを含む一部分である。すなわち、音符データ列Nは、楽曲データDの音符列のうち対象音符とその前の音符と後の音符の少なくとも一方の音符を含む音符列を指定する。 The first generation unit 31 generates the first control data X from the note data string N for each unit period U. The musical note data string N used for generation is a portion of the musical score data G that corresponds to each unit period U. The note data string N corresponding to any one unit period U is a part of the note data string of the music data D that includes note data that includes the unit period U (hereinafter referred to as "target note"). That is, the note data string N specifies a note string of the music data D that includes the target note and at least one of the preceding note and the following note.
 個々の第1制御データXは、音符データ列Nが指定する音符列の特徴を表す任意の形式のデータである。任意の1個の単位期間Uにおける第1制御データXは、楽曲の複数の音符のうち当該単位期間Uを含む対象音符の音符データが示す音符の特徴を示す情報である。例えば、第1制御データ列Xの示す特徴は、当該単位区間を含む音符の特徴(例えば、音高、オプションで時間長)を含む。また、第1制御データ列Xは、対象音符以外の音符に関する情報を含む。例えば、第1制御データ列Xは、当該単位区間を含む音符の前の音符と後の音符の少なくとも一方の音符の音符データが示す特徴(例えば、音高)を含む。また、第1制御データ列Xは、対象音符とその直前または直後の音符との音高差を含んでもよい。また、含むべき前または後の音符がなく、そこに休符である場合は、音符の代わりにその休符の特徴を含んでもよい。 The individual first control data X is data in any format that represents the characteristics of the note string specified by the note data string N. The first control data X in any one unit period U is information indicating the characteristics of the note indicated by the note data of the target note including the unit period U among a plurality of notes of the music piece. For example, the characteristics indicated by the first control data string X include the characteristics of the musical notes that include the unit section (eg, pitch, optionally, time length). Furthermore, the first control data string X includes information regarding notes other than the target note. For example, the first control data string X includes characteristics (for example, pitch) indicated by the note data of at least one of the notes before and after the note including the unit section. Further, the first control data string X may include a pitch difference between the target note and the note immediately before or after the target note. Furthermore, if there is no previous or subsequent note to be included and it is a rest, the characteristics of the rest may be included instead of the note.
 第1生成部31は、音符データ列Nに対する所定の演算処理により第1制御データ列Xを生成する。なお、第1生成部31は、深層ニューラルネットワーク(DNN:Deep Neural Network)等で構成される生成モデルを利用して第1制御データ列Xを生成してもよい。生成モデルは、音符データ列Nと第1制御データ列Xとの関係を機械学習により学習した統計的推定モデルである。第1制御データ列Xは、音響生成システム10が生成すべき目標音の音楽的な条件を指定するデータである。 The first generation unit 31 generates the first control data string X by performing predetermined arithmetic processing on the note data string N. Note that the first generation unit 31 may generate the first control data string X using a generation model configured with a deep neural network (DNN) or the like. The generation model is a statistical estimation model in which the relationship between the musical note data string N and the first control data string X is learned by machine learning. The first control data string X is data that specifies the musical conditions of the target sound that the sound generation system 10 should generate.
 第2生成部32は、単位期間Uに同期して、または、単位期間の進行に先行して、単語列Tから、現在の単位期間Uに必要とされる第2制御データYを生成する。各単位区間Uの第2制御データYは、単語列Tにおけるその単位区間Uを含む現在の語句の特徴を表す任意の形式のデータである。具体的には、第2制御データYは、単語列Tに含まれるその語句の語句ベクトルVを含む。語句ベクトルVは、意味空間内における各語句の位置を表すベクトルである。複数の語句の意味が近いほど、意味空間内におけるそれらの語句の語句ベクトルVの位置は近い。語句ベクトルVが表す語句は、1個以上の単語で構成される。すなわち、語句ベクトルVは、単語列Tのうち1個の単語または1個の句(複数の単語の時系列)の特徴を表すデータである。 The second generation unit 32 generates second control data Y required for the current unit period U from the word string T in synchronization with the unit period U or prior to the progression of the unit period. The second control data Y of each unit section U is data in any format that represents the characteristics of the current word phrase in the word string T that includes that unit section U. Specifically, the second control data Y includes a word vector V of the word included in the word string T. The phrase vector V is a vector representing the position of each phrase in the semantic space. The closer the meanings of multiple words are, the closer the positions of the word vectors V of those words are in the semantic space. The phrase represented by the phrase vector V is composed of one or more words. That is, the phrase vector V is data representing the characteristics of one word or one phrase (time series of a plurality of words) in the word string T.
 図4は、第2生成部32の構成を例示するブロック図である。第2生成部32は、言語解析部321と情報生成部322とを含む。言語解析部321は、単語列Tが表す単語列Tを、形態素解析等の自然言語処理により複数の単語に区分する。言語解析部321は、語句データQを順次に生成する。語句データQは、単語列Tの1個以上の単語で構成される語句を識別するデータまたはその語句の文字列を表すデータである。情報生成部322は、語句データQが表す語句について、語句ベクトル列Vを生成する。図3に例示される通り、楽曲のうち1個の語句に対応する期間内の各単位期間Uにおいては、当該語句の語句ベクトルVが第2制御データYとして反復的に利用される。なお、楽曲のうち音符または単語列Tが設定されていない期間内の各単位期間Uにおいては、ゼロベクトルが第2制御データYとして生成される。 FIG. 4 is a block diagram illustrating the configuration of the second generation unit 32. The second generation section 32 includes a language analysis section 321 and an information generation section 322. The language analysis unit 321 divides the word string T represented by the word string T into a plurality of words by natural language processing such as morphological analysis. The language analysis unit 321 sequentially generates phrase data Q. The phrase data Q is data that identifies a phrase made up of one or more words in the word string T, or data that represents a character string of the phrase. The information generation unit 322 generates a phrase vector sequence V for the phrase represented by the phrase data Q. As illustrated in FIG. 3, in each unit period U within the period corresponding to one word in the song, the word vector V of the word is repeatedly used as the second control data Y. Note that a zero vector is generated as the second control data Y in each unit period U within a period in which a musical note or word string T is not set.
 図4に例示される通り、情報生成部322による語句ベクトル列Vの生成には、生成モデルMaが利用される。生成モデルMaは、入力としての語句データQと出力としての意味空間における語句ベクトル列Vとの間に潜在する関係を機械学習により学習した学習済モデルである。生成モデルMaは、語句データQの入力に対して語句ベクトル列Vを出力する。情報生成部322は、訓練済の生成モデルMaにより語句データQを処理することで、各語句の語句ベクトルVを生成し、対応する単位期間に語句ベクトル列Vとして出力する。以上の説明から理解される通り、第2生成部32は、単語列Tに含まれる語句を表す語句ベクトル列Vを、生成モデルMaにより第2制御データ列Yとして生成する。以上の構成によれば、生成モデルMaを利用して第2制御データ列Yを簡便に生成できる。なお、生成モデルMaは「第2生成モデル」の一例である。 As illustrated in FIG. 4, the generation model Ma is used to generate the phrase vector sequence V by the information generation unit 322. The generative model Ma is a trained model in which the latent relationship between the word data Q as an input and the word vector sequence V in the semantic space as an output is learned by machine learning. The generative model Ma outputs a word vector sequence V in response to input word data Q. The information generation unit 322 generates a word vector V of each word by processing the word data Q using the trained generative model Ma, and outputs it as a word vector sequence V in the corresponding unit period. As understood from the above description, the second generation unit 32 generates a phrase vector sequence V representing the words included in the word sequence T as the second control data sequence Y using the generation model Ma. According to the above configuration, the second control data string Y can be easily generated using the generation model Ma. Note that the generative model Ma is an example of a "second generative model."
 第1実施形態の生成モデルMaは、例えば深層ニューラルネットワーク等の統計的推定モデルである。例えば、1個の単語で構成される語句の語句ベクトル列Vの生成には、Tomas Mikolov et al., "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781 [cs.CL], 2013に記載された技術(Word2Vec)が利用される。また、複数の単語で構成される語句(すなわち文)の語句ベクトル列Vの生成には、例えば、Quoc Le, Tomas Mikolov, "Distributed Representations of Sentences and Documents," CoRR, abs/1405.4053, p.1-9, 2014に記載された技術(Doc2Vec)が利用される。 The generative model Ma of the first embodiment is, for example, a statistical estimation model such as a deep neural network. For example, to generate a phrase vector sequence V of a phrase consisting of one word, see Tomas Mikolov et al., "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781 [cs.CL], 2013. The developed technology (Word2Vec) will be used. In addition, to generate a word vector sequence V of a phrase (that is, a sentence) consisting of multiple words, for example, Quoc Le, Tomas Mikolov, "Distributed Representations of Sentences and Documents," CoRR, abs/1405.4053, p.1 -9, 2014 (Doc2Vec) is used.
 図2に例示される通り、制御データ列取得部30による以上の処理により、制御データCが単位期間U毎に生成される。各単位期間Uの制御データCは、当該単位期間Uについて第1生成部31が生成した第1制御データXと、当該単位期間Uについて第2生成部32が生成した第2制御データYとを含む。制御データCは、例えば第1制御データXと第2制御データYとを相互に連結(concatenate)したデータである。 As illustrated in FIG. 2, the control data C is generated for each unit period U through the above processing by the control data string acquisition unit 30. The control data C for each unit period U includes the first control data X generated by the first generation section 31 for the unit period U, and the second control data Y generated by the second generation section 32 for the unit period U. include. The control data C is, for example, data obtained by concatenating the first control data X and the second control data Y.
 楽器の演奏は、基本的には楽譜の音符列により規定される。また、本願発明者による調査の結果、演奏者が楽器により演奏する音符列が共通する場合でも、当該音符列に付加されたテキストが相違する場合には、楽器の演奏により発生する楽器音の音楽的な表情も相違するという傾向が確認された。すなわち、歌唱音声の音楽的な表情がテキスト(すなわち歌詞)に依存するのは当然であるが、テキストには影響しないと一般的には想定される楽器音の音楽的な表情にも、実際にはテキストが依存するという傾向がある。以上の知見を背景として、第1実施形態においては、音符列の特徴を表す第1制御データ列Xと、当該音符列に対応する単語列Tの特徴を表す第2制御データ列Yとに応じて、目標音の音響信号Aを生成する。 The performance of musical instruments is basically defined by the string of notes on the musical score. In addition, as a result of research by the inventor, even if the note strings played by the performers on the musical instruments are the same, if the text added to the note strings is different, the music of the instrumental sounds generated by the performance of the musical instruments It was confirmed that there were also differences in their facial expressions. In other words, while it is natural that the musical expression of singing voices depends on the text (i.e. lyrics), the musical expression of instrumental sounds, which is generally assumed to have no effect on the text, actually depends on the text. tends to be text dependent. Based on the above findings, in the first embodiment, the first control data string X representing the characteristics of the note string and the second control data string Y representing the characteristics of the word string T corresponding to the note string are Then, an acoustic signal A of the target sound is generated.
 図2の音響データ列生成部33は、制御データ列C(第1制御データ列Xおよび第2制御データ列Y)を利用して音響データ列Zを生成する。音響データ列Zは、目標音を表す任意の形式のデータである。具体的には、音響データ列Zは、第1制御データ列Xが表す音符列に対応し、かつ、第2制御データ列Yが表す単語列Tの特徴に応じた音響特性を有する目標音を表す。すなわち、単語列Tを念頭に演奏者が楽器により音符列を演奏した場合の楽器音が、目標音として生成される。 The acoustic data string generation unit 33 in FIG. 2 generates an acoustic data string Z using the control data string C (first control data string X and second control data string Y). The acoustic data string Z is data in any format representing the target sound. Specifically, the acoustic data string Z corresponds to the note string represented by the first control data string represent. That is, an instrument sound when a performer plays a note string on an instrument with the word string T in mind is generated as a target sound.
 具体的には、音響データ列Zは、目標音の周波数スペクトルの包絡を表すデータである。具体的には、各単位期間Uの制御データCに応じて、当該単位期間Uに対応する音響データZが生成される。個々の音響データZは、単位期間よりも長い1フレーム窓分の波形サンプル系列に対応する。以上の説明の通り、制御データ列取得部30による制御データCの取得と、音響データ列生成部33による音響データZの生成とは、単位期間U毎に実行される。 Specifically, the acoustic data string Z is data representing the envelope of the frequency spectrum of the target sound. Specifically, according to the control data C of each unit period U, the acoustic data Z corresponding to the unit period U is generated. Each piece of acoustic data Z corresponds to a waveform sample sequence for one frame window longer than a unit period. As described above, the acquisition of the control data C by the control data string acquisition section 30 and the generation of the acoustic data Z by the acoustic data string generation section 33 are executed every unit period U.
 音響データ列生成部33による音響データ列Zの生成には、生成モデルMbが利用される。生成モデルMbは、単位期間毎に、その単位期間の制御データCに応じて、その単位期間の音響データZを推定する。生成モデルMbは、入力としての制御データ列Cと出力としての音響データ列Zとの間に潜在する関係を機械学習により学習した学習済モデルである。すなわち、生成モデルMbは、その関係の観点から、制御データ列Cに対して統計的に妥当な音響データ列Zを出力する。音響データ列生成部33は、生成モデルMbにより制御データCを処理することで、音響データZを単位期間U毎に生成する。 The generation model Mb is used to generate the acoustic data string Z by the acoustic data string generation unit 33. The generative model Mb estimates acoustic data Z for each unit period according to the control data C for that unit period. The generative model Mb is a learned model in which the latent relationship between the control data string C as an input and the acoustic data string Z as an output is learned by machine learning. That is, the generative model Mb outputs the acoustic data string Z that is statistically valid for the control data string C from the viewpoint of the relationship. The acoustic data string generation unit 33 generates acoustic data Z for each unit period U by processing the control data C using the generation model Mb.
 生成モデルMbは、制御データ列Cから音響データ列Zを生成する演算を制御装置11に実行させるプログラムと、当該演算に適用される複数の変数(加重値およびバイアス)との組合せで実現される。生成モデルMbを実現するプログラムおよび複数の変数は、記憶装置12に記憶される。生成モデルMbの複数の変数は、機械学習により事前に設定される。生成モデルMbは「第1生成モデル」の一例である。 The generative model Mb is realized by a combination of a program that causes the control device 11 to execute a calculation to generate an acoustic data sequence Z from a control data sequence C, and a plurality of variables (weight values and biases) applied to the calculation. . A program and a plurality of variables that realize the generative model Mb are stored in the storage device 12. A plurality of variables of the generative model Mb are set in advance by machine learning. The generative model Mb is an example of a "first generative model."
 生成モデルMbは、例えば深層ニューラルネットワークで構成される。例えば、再帰型ニューラルネットワーク(RNN:Recurrent Neural Network)、または畳込ニューラルネットワーク(CNN:Convolutional Neural Network)等の任意の形式の深層ニューラルネットワークが生成モデルMbとして利用される。複数種の深層ニューラルネットワークの組合せで生成モデルMbが構成されてもよい。また、長短期記憶(LSTM:Long Short-Term Memory)またはAttention等の付加的な要素が生成モデルMbに搭載されてもよい。 The generative model Mb is composed of, for example, a deep neural network. For example, any type of deep neural network such as a recurrent neural network (RNN) or a convolutional neural network (CNN) is used as the generative model Mb. The generative model Mb may be configured by a combination of multiple types of deep neural networks. Additionally, additional elements such as long short-term memory (LSTM) or attention may be included in the generative model Mb.
 信号生成部34は、音響データ列Zの時系列から目標音の音響信号Aを生成する。信号生成部34は、例えば離散逆フーリエ変換を含む演算により音響データ列Zを時間領域の波形信号に変換し、相前後する単位期間Uについて当該波形信号を連結することで音響信号Aを生成する。なお、例えば音響データ列Zと音響信号Aの各サンプルとの関係を学習した深層ニューラルネットワーク(いわゆるニューラルボコーダ)を利用して、信号生成部34が音響データ列Zから音響信号Aを生成してもよい。信号生成部34が生成した音響信号Aが放音装置15に供給されることで、目標音が放音装置15から再生される。 The signal generation unit 34 generates the acoustic signal A of the target sound from the time series of the acoustic data string Z. The signal generation unit 34 converts the acoustic data string Z into a time domain waveform signal by calculation including, for example, a discrete inverse Fourier transform, and generates the acoustic signal A by concatenating the waveform signals for successive unit periods U. . Note that the signal generation unit 34 generates the acoustic signal A from the acoustic data string Z by using, for example, a deep neural network (so-called neural vocoder) that has learned the relationship between the acoustic data string Z and each sample of the acoustic signal A. Good too. The target sound is reproduced from the sound emitting device 15 by supplying the acoustic signal A generated by the signal generating unit 34 to the sound emitting device 15.
 図5は、制御装置11が音響信号Aを生成する処理(以下「合成処理」という)Saの詳細な手順を例示するフローチャートである。複数の単位期間Uの各々において合成処理Saが実行される。 FIG. 5 is a flowchart illustrating the detailed procedure of the process (hereinafter referred to as "synthesis process") Sa in which the control device 11 generates the acoustic signal A. The compositing process Sa is executed in each of the plurality of unit periods U.
 合成処理Saが開始されると、制御装置11(制御データ列取得部30)は、楽曲データDを記憶装置12から取得する(Sa1)。制御装置11(第1生成部31)は、楽曲データDの楽譜データGのうち単位期間Uに対応する音符データ列Nから当該単位期間Uの第1制御データXを生成する(Sa2)。また、制御装置11(第2生成部32)は、楽曲データDの単語列Tから単位期間U毎の第2制御データYを生成する(Sa3)。なお、第1制御データXの生成(Sa2)と第2制御データYの生成(Sa3)との順序は逆転されてもよい。 When the synthesis process Sa is started, the control device 11 (control data string acquisition unit 30) acquires the music data D from the storage device 12 (Sa1). The control device 11 (first generation unit 31) generates first control data X for the unit period U from the note data string N corresponding to the unit period U of the musical score data G of the music data D (Sa2). Further, the control device 11 (second generation unit 32) generates second control data Y for each unit period U from the word string T of the music data D (Sa3). Note that the order of generation of the first control data X (Sa2) and generation of the second control data Y (Sa3) may be reversed.
 制御装置11(音響データ列生成部33)は、第1制御データXと第2制御データYとを含む制御データCを生成モデルMbにより処理することで、単位期間Uの音響データZを生成する(Sa4)。制御装置11(信号生成部34)は、単位期間Uの音響信号Aを音響データZから生成する(Sa5)。各単位期間の音響データ列Zからは、単位期間より長い時間長にわたる信号が生成され、それらをオーバーラップ加算することで、複数の単位期間にわたる音響信号Aが生成される。前後フレーム窓間の時間差(ホップサイズ)が、単位期間に相当する。制御装置11は、音響信号Aを放音装置15に供給することで、目標音を再生する(Sa6)。 The control device 11 (acoustic data string generation unit 33) generates acoustic data Z for a unit period U by processing the control data C including the first control data X and the second control data Y using the generation model Mb. (Sa4). The control device 11 (signal generation unit 34) generates the acoustic signal A of the unit period U from the acoustic data Z (Sa5). From the acoustic data string Z of each unit period, a signal that spans a time longer than the unit period is generated, and by overlapping and adding these signals, an acoustic signal A that spans a plurality of unit periods is generated. The time difference (hop size) between the previous and subsequent frame windows corresponds to a unit period. The control device 11 reproduces the target sound by supplying the acoustic signal A to the sound emitting device 15 (Sa6).
 以上の説明の通り、第1実施形態においては、音符列の特徴を表す第1制御データ列Xに加えて、当該音符列に対応する単語列Tの特徴を表す第2制御データ列Yが、音響データ列Zの生成に利用される。したがって、第1制御データ列Xのみから音響データ列Zを生成する構成と比較すると、音符列に対応する単語列Tに応じた多様な音響特性を有する目標音の音響データ列Zを生成できる。例えば、音符データ列Nが共通する場合でも、単語列Tを変更することで、音響特性が相違する目標音の音響データ列Zを生成できる。第1実施形態においては特に、第2制御データ列Yが、単語列T内の語句を表す語句ベクトル列Vを含む。すなわち、単語列Tの意味が反映された語句ベクトル列Vが第2制御データ列Yとして利用される。したがって、単語列T内の語句の意味が音響特性に反映された目標音の音響データ列Zを生成できる。 As described above, in the first embodiment, in addition to the first control data string X representing the characteristics of the note string, the second control data string Y representing the characteristics of the word string T corresponding to the note string is It is used to generate the acoustic data string Z. Therefore, compared to the configuration in which the acoustic data string Z is generated only from the first control data string X, it is possible to generate the acoustic data string Z of the target sound having various acoustic characteristics according to the word string T corresponding to the note string. For example, even if the musical note data string N is common, by changing the word string T, it is possible to generate acoustic data strings Z of target sounds having different acoustic characteristics. Particularly in the first embodiment, the second control data string Y includes a word vector string V representing words in the word string T. That is, the phrase vector sequence V reflecting the meaning of the word sequence T is used as the second control data sequence Y. Therefore, it is possible to generate the acoustic data string Z of the target sound in which the meanings of the words in the word string T are reflected in the acoustic characteristics.
[機械学習システム20]
 図1の機械学習システム20は、音響生成システム10が使用する生成モデルMbを機械学習により確立するコンピュータシステムである。機械学習システム20は、制御装置21と記憶装置22と通信装置23とを具備する。
[Machine learning system 20]
The machine learning system 20 in FIG. 1 is a computer system that establishes a generative model Mb used by the sound generation system 10 by machine learning. The machine learning system 20 includes a control device 21, a storage device 22, and a communication device 23.
 制御装置21は、機械学習システム20の各要素を制御する単数または複数のプロセッサで構成される。例えば、制御装置21は、CPU、GPU、SPU、DSP、FPGA、またはASIC等の1種類以上のプロセッサにより構成される。 The control device 21 is composed of one or more processors that control each element of the machine learning system 20. For example, the control device 21 is configured by one or more types of processors such as a CPU, GPU, SPU, DSP, FPGA, or ASIC.
 記憶装置22は、制御装置21が実行するプログラムと、制御装置21が使用する各種のデータとを記憶する単数または複数のメモリである。記憶装置22は、例えば磁気記録媒体または半導体記録媒体等の公知の記録媒体で構成される。複数種の記録媒体の組合せにより記憶装置22が構成されてもよい。なお、機械学習システム20に対して着脱される可搬型の記録媒体、または制御装置21が通信網200を介してアクセス可能な記録媒体(例えばクラウドストレージ)が、記憶装置22として利用されてもよい。 The storage device 22 is one or more memories that store programs executed by the control device 21 and various data used by the control device 21. The storage device 22 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium. The storage device 22 may be configured by a combination of multiple types of recording media. Note that a portable recording medium that can be attached to and detached from the machine learning system 20 or a recording medium that can be accessed by the control device 21 via the communication network 200 (for example, cloud storage) may be used as the storage device 22. .
 通信装置23は、通信網200を介して音響生成システム10と通信する。なお、機械学習システム20とは別体の通信装置23を、機械学習システム20に対して有線または無線により接続してもよい。 The communication device 23 communicates with the sound generation system 10 via the communication network 200. Note that a communication device 23 separate from the machine learning system 20 may be connected to the machine learning system 20 by wire or wirelessly.
 図6は、機械学習システム20が生成モデルMbを確立する機能の説明図である。記憶装置22は、相異なる楽曲に対応する複数の基礎データBを記憶する。複数の基礎データBの各々は、楽曲データDと参照信号Rとを含む。楽曲データDは、参照信号Rの表す波形で演奏されている、特定の楽曲(以下「参照楽曲」という)の音符列を表すデータである。楽曲データDは、前述の通り、楽譜データGと単語列Tとを含む。楽譜データGは、参照楽曲を構成する音符の時系列を指定する。単語列Tは、参照楽曲に対応する単語列Tを指定する。 FIG. 6 is an explanatory diagram of the function of the machine learning system 20 to establish the generative model Mb. The storage device 22 stores a plurality of basic data B corresponding to different songs. Each of the plurality of basic data B includes music data D and reference signal R. The music data D is data representing a note string of a specific music piece (hereinafter referred to as "reference music piece") that is played with the waveform represented by the reference signal R. The music data D includes the musical score data G and the word string T, as described above. Musical score data G specifies the time series of notes that constitute the reference music piece. The word string T specifies the word string T corresponding to the reference song.
 参照信号Rは、演奏者が単語列Tを参照しながら参照楽曲を演奏したときに楽器から発音される楽器音の波形を表す信号である。例えば、楽器演奏に熟練した演奏者が、単語列Tに応じた音楽的な表情を付加しながら参照楽曲を演奏する。以上の状況で楽器から発音される楽器音を収録することで、参照信号Rが生成される。参照信号Rの収録後に、参照信号Rの時間軸上の位置が調整される。したがって、参照信号Rが表す楽器音は、単語列Tに応じた音響特性を有する楽器音である。 The reference signal R is a signal representing the waveform of the musical instrument sound produced by the musical instrument when the performer plays the reference song while referring to the word string T. For example, a performer skilled in playing a musical instrument plays the reference piece while adding a musical expression corresponding to the word string T. The reference signal R is generated by recording the musical instrument sounds produced by the musical instrument under the above circumstances. After recording the reference signal R, the position of the reference signal R on the time axis is adjusted. Therefore, the instrument sound represented by the reference signal R is an instrument sound that has acoustic characteristics according to the word string T.
 制御装置21は、記憶装置22に記憶されたプログラムを実行することで、生成モデルMbを生成するための複数の機能(訓練データ取得部41、学習処理部42)を実現する。 The control device 21 implements a plurality of functions (training data acquisition unit 41, learning processing unit 42) for generating the generative model Mb by executing a program stored in the storage device 22.
 訓練データ取得部41は、複数の基礎データBから複数の訓練データLを生成する。1個の参照楽曲毎に1個の訓練データLが生成される。したがって、相異なる参照楽曲に対応する複数の基礎データBの各々から、複数の訓練データLが生成される。学習処理部42は、複数の訓練データLを利用した機械学習により生成モデルMbを確立する。 The training data acquisition unit 41 generates a plurality of training data L from a plurality of basic data B. One piece of training data L is generated for each reference piece of music. Therefore, a plurality of training data L are generated from each of a plurality of basic data B corresponding to different reference songs. The learning processing unit 42 establishes the generative model Mb by machine learning using a plurality of training data L.
 複数の訓練データLの各々は、訓練用の制御データ列Ctと訓練用の音響データ列Ztとの組合せで構成される。制御データ列Ctは、訓練用の第1制御データ列Xtと訓練用の第2制御データ列Ytとの組合せで構成される。第1制御データ列Xtは「第1訓練用制御データ列」の一例であり、第2制御データ列Ytは「第2訓練用制御データ列」の一例である。また、音響データ列Ztは、「訓練用音響データ列」の一例である。 Each of the plurality of training data L is composed of a combination of a training control data sequence Ct and a training audio data sequence Zt. The control data string Ct is composed of a combination of a first control data string for training Xt and a second control data string for training Yt. The first control data string Xt is an example of a "first training control data string," and the second control data string Yt is an example of a "second training control data string." Furthermore, the acoustic data string Zt is an example of a "training acoustic data string."
 訓練データ取得部41は、単位期間U毎に、音符データ列Ntから当該単位期間Uの第1制御データXtを生成する。各単位期間Uの第1制御データXtの生成に用いる音符データ列Ntは、楽譜データGの音符データ列のうち当該単位期間Uを含む対象音符の音符データを含む一部分である。すなわち、音符データ列Ntは、参照楽曲のうち対象音符の音符データとその前の音符と後の音符の少なくとも一方の音符データを含む。第1制御データ列Xtは、前述の第1制御データ列Xと同様に、音符データ列Ntが表す参照音符列の特徴を表すデータである。訓練データ取得部41は、第1生成部31と同様の処理により、音符データ列Ntから単位期間U毎の第1制御データ列Xtを生成する。 For each unit period U, the training data acquisition unit 41 generates first control data Xt for the unit period U from the musical note data string Nt. The note data string Nt used to generate the first control data Xt for each unit period U is a part of the note data string of the musical score data G that includes the note data of the target note that includes the unit period U. That is, the note data string Nt includes note data of the target note in the reference music and note data of at least one of the previous note and the subsequent note. Like the first control data string X described above, the first control data string Xt is data representing the characteristics of the reference note string represented by the note data string Nt. The training data acquisition unit 41 generates the first control data sequence Xt for each unit period U from the musical note data sequence Nt by the same process as the first generation unit 31.
 1個の単位期間Uの第2制御データYtは、単語列Tのうち当該単位期間Uに対応する語句について推定された語句ベクトルVを示す。訓練データ取得部41は、第2生成部32と同様の処理により、単語列Tから推定された語句ベクトル列Vを示す、単位期間U毎の第2制御データ列Ytを生成する。 The second control data Yt for one unit period U indicates a phrase vector V estimated for a phrase corresponding to the unit period U in the word string T. The training data acquisition unit 41 generates a second control data sequence Yt for each unit period U, which indicates a phrase vector sequence V estimated from the word sequence T, by the same process as the second generation unit 32.
 1個の単位期間Uの音響データZtは、参照信号Rのうち当該単位期間Uに対応する1フレーム分の波形を表す。訓練データ取得部41は、参照信号Rから音響データ列Ztを生成する。以上の説明から理解される通り、音響データ列Ztは、第1制御データ列Xtに対応する参照音符列を、第2制御データ列Ytが表す語句のもとで演奏したときに、楽器から発音される楽器音の波形を表す。すなわち、音響データ列Ztは、制御データ列Ctの入力に対して生成モデルMbが出力すべき音響データ列の正解(Ground Truth)である。 The acoustic data Zt of one unit period U represents the waveform of one frame of the reference signal R corresponding to the unit period U. The training data acquisition unit 41 generates an acoustic data sequence Zt from the reference signal R. As understood from the above explanation, the acoustic data string Zt is the sound data produced by the instrument when the reference note string corresponding to the first control data string Xt is played under the phrase expressed by the second control data string Yt. represents the waveform of the musical instrument sound. That is, the acoustic data string Zt is the ground truth of the acoustic data string that the generation model Mb should output in response to the input of the control data string Ct.
 図7は、制御装置21が機械学習により生成モデルMbを確立する処理(以下「学習処理」という)Sbのフローチャートである。例えば、機械学習システム20の運営者による指示を契機として学習処理Sbが開始される。制御装置21が学習処理Sbを実行することで、図6の学習処理部42が実現される。 FIG. 7 is a flowchart of a process (hereinafter referred to as "learning process") Sb in which the control device 21 establishes a generative model Mb by machine learning. For example, the learning process Sb is started in response to an instruction from the operator of the machine learning system 20. The learning processing unit 42 in FIG. 6 is realized by the control device 21 executing the learning process Sb.
 学習処理Sbが開始されると、制御装置21は、複数の訓練データLの何れか(以下「選択訓練データL」という)を選択する(Sb1)。制御装置21は、図6に例示される通り、初期的または暫定的な生成モデルMb(以下「暫定モデルMb0」という)により選択訓練データLの制御データ列Ctを処理することで、音響データ列Zを生成する(Sb2)。 When the learning process Sb is started, the control device 21 selects any one of the plurality of training data L (hereinafter referred to as "selected training data L") (Sb1). As illustrated in FIG. 6, the control device 21 processes the control data string Ct of the selected training data L using an initial or provisional generation model Mb (hereinafter referred to as "temporary model Mb0") to generate an acoustic data string. Generate Z (Sb2).
 制御装置21は、暫定モデルMb0が生成する音響データ列Zと選択訓練データLの音響データ列Ztとの誤差を表す損失関数を算定する(Sb3)。制御装置21は、損失関数が低減(理想的には最小化)されるように、暫定モデルMb0の複数の変数を更新する(Sb4)。損失関数に応じた各変数の更新には、例えば誤差逆伝播法が利用される。 The control device 21 calculates a loss function representing the error between the acoustic data string Z generated by the provisional model Mb0 and the acoustic data string Zt of the selected training data L (Sb3). The control device 21 updates the plurality of variables of the provisional model Mb0 so that the loss function is reduced (ideally minimized) (Sb4). For example, error backpropagation is used to update each variable according to the loss function.
 制御装置21は、所定の終了条件が成立したか否かを判定する(Sb5)。終了条件は、損失関数が所定の閾値を下回ること、または、損失関数の変化量が所定の閾値を下回ることである。終了条件が成立しない場合(Sb5:NO)、制御装置21は、未選択の訓練データLを新たな選択訓練データLとして選択する(Sb1)。すなわち、終了条件の成立(Sb5:YES)まで、暫定モデルMb0の複数の変数を更新する処理(Sb1~Sb4)が反復される。終了条件が成立した場合(Sb5:YES)、制御装置21は、学習処理Sbを終了する。終了条件が成立した時点における暫定モデルMb0が、訓練済の生成モデルMbとして確定される。 The control device 21 determines whether a predetermined termination condition is satisfied (Sb5). The termination condition is that the loss function is less than a predetermined threshold, or that the amount of change in the loss function is less than a predetermined threshold. If the termination condition is not satisfied (Sb5: NO), the control device 21 selects the unselected training data L as the new selected training data L (Sb1). That is, the process of updating a plurality of variables of the provisional model Mb0 (Sb1 to Sb4) is repeated until the end condition is satisfied (Sb5: YES). If the termination condition is satisfied (Sb5: YES), the control device 21 terminates the learning process Sb. The provisional model Mb0 at the time when the termination condition is satisfied is determined as the trained generative model Mb.
 以上の説明から理解される通り、生成モデルMbは、入力である制御データ列Ctと出力である音響データ列Ztとの間に潜在する関係を学習する。したがって、訓練済の生成モデルMbは、その関係の観点から未知の制御データ列Cに対して統計的に妥当な音響データ列Zを出力する。 As understood from the above description, the generative model Mb learns the latent relationship between the input control data string Ct and the output acoustic data string Zt. Therefore, the trained generative model Mb outputs a statistically valid acoustic data sequence Z for the unknown control data sequence C from the viewpoint of the relationship.
 制御装置21は、以上の処理により確立された生成モデルMbを、通信装置23から音響生成システム10に送信する。具体的には、生成モデルMbを規定する複数の変数が、音響生成システム10に送信される。音響生成システム10の制御装置11は、機械学習システム20から送信された生成モデルMbを通信装置13により受信し、当該生成モデルMbを記憶装置12に保存する。 The control device 21 transmits the generation model Mb established through the above processing to the sound generation system 10 from the communication device 23. Specifically, a plurality of variables defining the generative model Mb are sent to the sound generation system 10. The control device 11 of the sound generation system 10 receives the generated model Mb transmitted from the machine learning system 20 through the communication device 13, and stores the generated model Mb in the storage device 12.
B:第2実施形態
 第2実施形態を説明する。なお、以下に例示する各態様において機能が第1実施形態と同様である要素については、第1実施形態の説明と同様の符号を流用して各々の詳細な説明を適宜に省略する。
B: Second Embodiment The second embodiment will be described. In each aspect illustrated below, for elements whose functions are similar to those in the first embodiment, the same reference numerals as in the description of the first embodiment will be used, and detailed descriptions of each will be omitted as appropriate.
 第2実施形態の音響生成システム10における制御装置11は、第1実施形態と同様に、制御データ列Cを取得する制御データ列取得部30と、制御データ列Cから音響データ列Zを生成する音響データ列生成部33と、音響データ列Zから音響信号Aを生成する信号生成部34とを具備する。 The control device 11 in the sound generation system 10 of the second embodiment includes a control data string acquisition unit 30 that obtains a control data string C, and generates an acoustic data string Z from the control data string C, as in the first embodiment. It includes an acoustic data string generation section 33 and a signal generation section 34 that generates an acoustic signal A from the acoustic data string Z.
 図8は、第2実施形態における制御データ列取得部30の動作の説明図である。制御データ列取得部30の第1生成部31は、第1実施形態と同様に、音符データ列Nから単位期間U毎に第1制御データXを生成する。第2実施形態においては、第2生成部32の機能が第1実施形態とは相違する。第1実施形態の第2生成部32は、単語列Tの各語句を表す語句ベクトル列Vを第2制御データ列Yとして生成する。他方、第2実施形態の第2生成部32は、単語列の各音素を表す音素データPを生成し、第2制御データYとしてその音素の期間に対応する各単位期間毎に出力する。すなわち、第2生成部32は、単語列Tを解析することで、音素毎の音素種別と期間とを示す音素データPを生成し、その音素データPを単位期間U毎に第2制御データYとして出力する。第1実施形態と同様に、制御データ列Cは、第1制御データ列Xと第2制御データ列Yとを含む。 FIG. 8 is an explanatory diagram of the operation of the control data string acquisition unit 30 in the second embodiment. The first generation unit 31 of the control data string acquisition unit 30 generates the first control data X for each unit period U from the note data string N, similarly to the first embodiment. In the second embodiment, the function of the second generation unit 32 is different from that in the first embodiment. The second generation unit 32 of the first embodiment generates a phrase vector sequence V representing each phrase of the word sequence T as a second control data sequence Y. On the other hand, the second generation unit 32 of the second embodiment generates phoneme data P representing each phoneme of the word string, and outputs it as second control data Y for each unit period corresponding to the period of the phoneme. That is, the second generation unit 32 generates phoneme data P indicating the phoneme type and period for each phoneme by analyzing the word string T, and converts the phoneme data P into second control data Y for each unit period U. Output as . Similarly to the first embodiment, the control data string C includes a first control data string X and a second control data string Y.
 図9は、音素データPの模式図である。音素データPは、複数種(K種類)の音素の何れかを指定する。具体的には、音素データPは、相異なる種類の音素に対応するK個(Kは2以上の自然数)の要素E(E1~EK)で構成される。任意の1種類の音素を指定する音素データPは、K個の要素E1~EKのうち当該音素に対応する1個の要素Eが「1」に設定され、残余の(K-1)個の要素Eが「0」に設定されたone-hotベクトルである。なお、各要素Eの「1」と「0」とを交換したone-coldベクトルを、音素データPとして採用してもよい。 FIG. 9 is a schematic diagram of the phoneme data P. The phoneme data P specifies any of a plurality of types (K types) of phonemes. Specifically, the phoneme data P is composed of K elements E (E1 to EK) (K is a natural number of 2 or more) corresponding to different types of phonemes. For phoneme data P specifying any one type of phoneme, one element E corresponding to the phoneme among K elements E1 to EK is set to "1", and the remaining (K-1) elements are set to "1". It is a one-hot vector with element E set to "0". Note that a one-cold vector in which "1" and "0" of each element E are exchanged may be adopted as the phoneme data P.
 第2生成部32は、音素解析処理により単語列Tに含まれる各時点の文字の各音素の種別と期間を推定し、その音素を指定する音素データPを生成する。音素解析処理には、公知の技術が任意に採用される。図8に例示される通り、楽曲のうち1個の音素に対応する期間内の各単位期間Uにおいては、当該音素を示す音素データPが第2制御データYとして反復的に利用される。単語列Tにおける各時点の文字の各音素の期間の境界は、例えばHMM(Hidden Markov Model)またはSVM(Support Vector Machine)等の統計モデルにより推定される。なお、単語列Tを構成する各文字と各音素の境界との関係が登録された参照テーブルを利用して各音素の境界を特定する方法(ルールベース)も想定される。また、楽曲データDの作成者が例えば手動により指示した各音素の境界が、当該楽曲データDにより指定されてもよい。 The second generation unit 32 estimates the type and duration of each phoneme of the characters at each point in time included in the word string T by phoneme analysis processing, and generates phoneme data P specifying the phoneme. Any known technique may be employed for the phoneme analysis process. As illustrated in FIG. 8, in each unit period U within a period corresponding to one phoneme in a song, phoneme data P indicating the phoneme is repeatedly used as second control data Y. The boundary of the period of each phoneme of the character at each point in time in the word string T is estimated by a statistical model such as HMM (Hidden Markov Model) or SVM (Support Vector Machine). Note that a method (rule-based) in which the boundaries of each phoneme are specified using a reference table in which the relationship between each character constituting the word string T and the boundary of each phoneme is registered is also envisaged. Further, the boundaries of each phoneme may be specified by the music data D, for example, manually designated by the creator of the music data D.
 第1実施形態の第2制御データ列Y(語句ベクトル列V)には、単語列Tの意味が反映される一方、単語列Tの発音に関する情報(すなわち音素)は反映されない。他方、第2実施形態の第2制御データ列Y(音素データP)には、単語列Tの発音に関する情報(すなわち音素)が反映される一方、単語列Tの意味は反映されない。語句ベクトル列Vおよび音素データPは、単語列Tの特徴を表すデータとして包括的に表現される。 The second control data string Y (phrase vector string V) of the first embodiment reflects the meaning of the word string T, but does not reflect information regarding the pronunciation of the word string T (ie, phoneme). On the other hand, the second control data string Y (phoneme data P) of the second embodiment reflects information regarding the pronunciation of the word string T (ie, phoneme), but does not reflect the meaning of the word string T. The word vector string V and the phoneme data P are comprehensively expressed as data representing the characteristics of the word string T.
 語句ベクトル列Vが音素データPに置換される点以外は、第1実施形態と同様である。例えば、音響データ列生成部33が制御データ列Cから音響データ列Zを生成する処理、および信号生成部34が音響データ列Zから音響信号Aを生成する処理は、第1実施形態と同様である。また、合成処理Saおよび学習処理Sbも第1実施形態と同様である。 This is the same as the first embodiment except that the phrase vector sequence V is replaced with the phoneme data P. For example, the processing in which the acoustic data string generation section 33 generates the acoustic data string Z from the control data string C and the processing in which the signal generation section 34 generates the acoustic signal A from the acoustic data string Z are similar to those in the first embodiment. be. Furthermore, the synthesis process Sa and the learning process Sb are also the same as in the first embodiment.
 第2実施形態においては、音符列の特徴を表す第1制御データ列Xに加えて、当該音符列に対応する単語列Tの特徴を表す第2制御データ列Yが、音響データ列Zの生成に利用される。したがって、第1実施形態と同様に、音符列に対応する単語列Tに応じた多様な音響特性を有する目標音の音響データ列Zを生成できる。第2実施形態においては特に、第2制御データ列Yが、単語列T内の音素を表す音素データPを含む。すなわち、単語列Tの発音が反映された音素データPが第2制御データ列Yとして利用される。したがって、単語列T内の音素の発音に関する非言語的な特性(例えば時間領域または周波数領域における特性)が反映された目標音の音響データ列Zを生成できる。例えば、単語列Tをオノマトペとして把握される印象の目標音が生成される。 In the second embodiment, in addition to the first control data string X representing the characteristics of the note string, the second control data string Y representing the characteristics of the word string T corresponding to the note string is used to used for. Therefore, similarly to the first embodiment, it is possible to generate the acoustic data string Z of the target sound having various acoustic characteristics according to the word string T corresponding to the note string. In particular, in the second embodiment, the second control data string Y includes phoneme data P representing phonemes in the word string T. That is, the phoneme data P reflecting the pronunciation of the word string T is used as the second control data string Y. Therefore, it is possible to generate the acoustic data string Z of the target sound in which non-linguistic characteristics (for example, characteristics in the time domain or frequency domain) regarding the pronunciation of the phonemes in the word string T are reflected. For example, a target sound with the impression that the word string T is understood as onomatopoeia is generated.
C:第3実施形態
 図10は、第3実施形態における第2制御データ列Yの模式図である。第2制御データ列Yは、第1データY1と第2データY2とを含む。第1データY1は、第1実施形態における第2制御データ列Yに相当し、第2データY2は、第2実施形態における第2制御データ列Yに相当する。
C: Third Embodiment FIG. 10 is a schematic diagram of the second control data string Y in the third embodiment. The second control data string Y includes first data Y1 and second data Y2. The first data Y1 corresponds to the second control data string Y in the first embodiment, and the second data Y2 corresponds to the second control data string Y in the second embodiment.
 具体的には、第1データY1は、単語列Tに含まれる各語句を表す語句ベクトル列Vである。例えば、楽曲のうち1個の語句に対応する期間内の各単位期間Uにおいては、当該語句の語句ベクトルVが第1データY1として利用される。他方、第2データY2は、単語列Tの各音素を表す音素データPである。例えば、楽曲のうち1個の音素に対応する期間内の各単位期間Uにおいては、当該音素の音素データPが第2制御データY2として利用される。 Specifically, the first data Y1 is a word vector string V representing each word included in the word string T. For example, in each unit period U within a period corresponding to one word in a song, the word vector V of the word is used as the first data Y1. On the other hand, the second data Y2 is phoneme data P representing each phoneme of the word string T. For example, in each unit period U within a period corresponding to one phoneme in a song, the phoneme data P of the phoneme is used as the second control data Y2.
 第3実施形態においても第1実施形態と同様の効果が実現される。また、第3実施形態においては、第2制御データ列Yが第1データY1(語句ベクトル列V)と第2データY2(音素データP)とを含む。したがって、単語列T内の各語句の意味と、単語列T内の音素の発音と、の双方が反映された目標音の音響データ列Zを生成できる。 The same effects as in the first embodiment are achieved in the third embodiment as well. Further, in the third embodiment, the second control data string Y includes the first data Y1 (phrase vector string V) and the second data Y2 (phoneme data P). Therefore, it is possible to generate the acoustic data string Z of the target sound in which both the meaning of each word in the word string T and the pronunciation of the phoneme in the word string T are reflected.
D:変形例
 以上に例示した各態様に付加される具体的な変形の態様を以下に例示する。以下の例示から任意に選択された複数の態様を、相互に矛盾しない範囲で適宜に併合してもよい。
D: Modifications Specific modifications added to each of the embodiments exemplified above will be exemplified below. A plurality of aspects arbitrarily selected from the examples below may be combined as appropriate to the extent that they do not contradict each other.
(1)第2実施形態における音素データPは、K個の要素E1~EKで構成されるベクトルに限定されない。例えば、各音素に一意に割当てられた符号列(識別子)が、音素データPとして利用されてもよい。 (1) The phoneme data P in the second embodiment is not limited to a vector composed of K elements E1 to EK. For example, a code string (identifier) uniquely assigned to each phoneme may be used as the phoneme data P.
(2)前述の各形態においては、音響データ列Zが目標音の周波数特性を表す形態を例示したが、音響データ列Zが表す情報は以上の例示に限定されない。例えば、音響データ列Zが目標音の各サンプルを表す形態も想定される。以上の形態では、音響データ列Zの時系列が音響信号Aを構成する。したがって、信号生成部34は省略される。 (2) In each of the above embodiments, the acoustic data string Z represents the frequency characteristics of the target sound, but the information expressed by the acoustic data string Z is not limited to the above examples. For example, a form in which the acoustic data string Z represents each sample of the target sound is also assumed. In the above embodiment, the time series of the acoustic data string Z constitutes the acoustic signal A. Therefore, the signal generation section 34 is omitted.
(3)前述の各形態においては、制御データ列取得部30が第1制御データ列Xおよび第2制御データ列Yを生成する形態を例示したが、制御データ列取得部30の動作は以上の例示に限定されない。例えば、制御データ列取得部30は、外部装置が生成した第1制御データ列Xおよび第2制御データ列Yを、通信装置13により当該外部装置から受信してもよい。また、第1制御データ列Xおよび第2制御データ列Yが記憶装置12に記憶された形態においては、制御データ列取得部30は、第1制御データ列Xおよび第2制御データ列Yを記憶装置12から読出する。以上の例示から理解される通り、制御データ列取得部30による「取得」は、第1制御データ列Xおよび第2制御データ列Yの生成、受信および読出等、第1制御データ列Xおよび第2制御データ列Yを取得する任意の動作を包含する。訓練データ取得部41による第1制御データ列Xtおよび第2制御データ列Ytの「取得」も同様に、第1制御データ列Xtおよび第2制御データ列Ytを取得する任意の動作(例えば生成、受信および読出)を包含する。 (3) In each of the above embodiments, the control data string acquisition section 30 generates the first control data string X and the second control data string Y, but the operation of the control data string acquisition section 30 is as described above. Not limited to examples. For example, the control data string acquisition unit 30 may receive a first control data string X and a second control data string Y generated by an external device from the external device using the communication device 13. Further, in the case where the first control data string X and the second control data string Y are stored in the storage device 12, the control data string acquisition unit 30 stores the first control data string X and the second control data string Y. Read from device 12. As can be understood from the above examples, "acquisition" by the control data string acquisition unit 30 includes generation, reception, and reading of the first control data string X and the second control data string Y. 2 includes any operation that obtains the control data string Y. Similarly, the "acquisition" of the first control data string Xt and the second control data string Yt by the training data acquisition unit 41 includes any operation (for example, generation, generation, receiving and reading).
(4)前述の各形態においては、第1制御データ列Xと第2制御データ列Yとを連結した制御データ列Cが生成モデルMbに供給される形態を例示したが、生成モデルMbに対する第1制御データ列Xおよび第2制御データ列Yの入力の形態は、以上の例示に限定されない。 (4) In each of the above embodiments, the control data string C, which is a combination of the first control data string X and the second control data string Y, is supplied to the generative model Mb. The input format of the first control data string X and the second control data string Y is not limited to the above example.
 例えば、図11に例示される通り、生成モデルMbが第1部分Mb1と第2部分Mb2とで構成される形態を想定する。第1部分Mb1は、生成モデルMbの入力層と中間層の一部とで構成される部分である。第2部分Mb2は、生成モデルMbの中間層の他の一部と出力層とで構成される部分である。以上の形態においては、第1制御データ列Xが第1部分Mb1(入力層)に供給され、第2制御データ列Yが、第1部分Mb1から出力されるデータとともに第2部分Mb2に供給されてもよい。以上の例示から理解される通り、第1制御データ列Xと第2制御データ列Yとの連結は、本開示において必須ではない。 For example, as illustrated in FIG. 11, assume that the generative model Mb is composed of a first part Mb1 and a second part Mb2. The first part Mb1 is a part composed of the input layer and part of the intermediate layer of the generative model Mb. The second part Mb2 is a part composed of another part of the intermediate layer of the generative model Mb and an output layer. In the above embodiment, the first control data string X is supplied to the first portion Mb1 (input layer), and the second control data string Y is supplied to the second portion Mb2 together with the data output from the first portion Mb1. It's okay. As understood from the above example, the connection of the first control data string X and the second control data string Y is not essential in the present disclosure.
(5)図12に例示される通り、相異なる楽器に対応する複数の生成モデルMbが選択的に利用されてもよい。1種類の楽器に対応する生成モデルMbは、当該楽器から発音される楽器音の参照信号Rを利用して訓練された学習済モデルである。したがって、各楽器に対応する生成モデルMbは、当該楽器の楽器音を表す音響データ列Zを出力する。 (5) As illustrated in FIG. 12, a plurality of generative models Mb corresponding to different musical instruments may be selectively used. The generative model Mb corresponding to one type of musical instrument is a learned model trained using the reference signal R of the musical instrument sound produced by the musical instrument. Therefore, the generation model Mb corresponding to each musical instrument outputs an acoustic data string Z representing the musical instrument sound of the musical instrument.
 利用者は、操作装置14を操作することで、複数種の楽器の何れかを選択する。図12の楽器データαは、利用者が選択した楽器を指定するデータである。音響データ列生成部33は、複数の生成モデルMbのうち楽器データαが指定する楽器に対応する生成モデルMbを選択し、当該生成モデルMbにより制御データ列Cを処理することで音響データ列Zを生成する。以上の構成によれば、複数種の楽器のうち何れかの楽器に対応する音色の目標音を生成できる。 The user selects one of the plurality of musical instruments by operating the operating device 14. The musical instrument data α in FIG. 12 is data specifying the musical instrument selected by the user. The acoustic data string generation unit 33 selects a generated model Mb corresponding to the instrument specified by the musical instrument data α from among the plurality of generated models Mb, and processes the control data string C using the generated model Mb to generate the acoustic data string Z. generate. According to the above configuration, it is possible to generate a target sound having a timbre corresponding to any one of a plurality of types of musical instruments.
 なお、図13に例示される通り、第1制御データ列Xおよび第2制御データ列Yに加えて楽器データαを含む制御データ列Cが、1個の生成モデルMbに入力されてもよい。図13の生成モデルMbは、相異なる楽器に対応する複数の参照信号Rを利用した機械学習により確立される。また、訓練データLには、訓練用の第1制御データ列Xtおよび第2制御データ列Ytに加えて、参照信号Rに対応する楽器を指定する楽器データαが含まれる。したがって、音響データ列Zが表す目標音は、楽器データαが指定する楽器の音色を有する楽器音である。 Note that, as illustrated in FIG. 13, a control data string C including musical instrument data α in addition to the first control data string X and the second control data string Y may be input to one generation model Mb. The generative model Mb in FIG. 13 is established by machine learning using a plurality of reference signals R corresponding to different musical instruments. Further, the training data L includes musical instrument data α specifying the musical instrument corresponding to the reference signal R, in addition to the first control data sequence Xt and the second control data sequence Yt for training. Therefore, the target sound represented by the acoustic data string Z is an instrument sound having the timbre of the instrument specified by the instrument data α.
(6)前述の各形態においては、記憶装置12に事前に記憶された楽曲データDから音符データ列Nを生成したが、演奏装置から順次に供給される音符データ列Nを利用してもよい。演奏装置は、利用者による演奏を受付けるMIDIキーボード等の入力装置であり、利用者の演奏に応じた音符データ列Nを順次に出力する。音響生成システム10は、演奏装置から供給される音符データ列Nを利用して音響データ列Zを生成する。演奏装置に対する利用者の演奏に並行して実時間的に、前述の合成処理Saが実行されてよい。具体的には、演奏装置に対する利用者からの操作に並行して、第2制御データ列Yおよび音響データ列Zが生成されてもよい。 (6) In each of the above embodiments, the note data string N is generated from the music data D stored in advance in the storage device 12, but note data strings N sequentially supplied from the performance device may also be used. . The performance device is an input device such as a MIDI keyboard that accepts musical performances by the user, and sequentially outputs a string of musical note data N according to the musical performance by the user. The sound generation system 10 generates a sound data string Z using a musical note data string N supplied from a performance device. The above-described compositing process Sa may be executed in real time while the user is playing on the performance device. Specifically, the second control data string Y and the audio data string Z may be generated in parallel with the user's operation on the performance device.
(7)前述の各形態においては、1個の語句データQから1個の語句ベクトル列Vを生成したが、1個の語句データ列Qから複数の語句ベクトル列Vが生成されてもよい。図14は、本変形例における制御データ列取得部30の動作の説明図である。図14の語句データ列Qは、単語列Tをフレーズ毎に区分した語句列を表すデータである。すなわち、語句データQは、1個のフレーズに対応する1個以上の語句列を識別するデータである。フレーズは、音楽的または意味的な纏まりに応じて楽曲を区分した区間である。例えば楽曲データDにおいて各フレーズが指定される。ただし、楽曲データDの解析により楽曲の各フレーズが画定されてもよい。以上の通り、言語解析部321は、単語列Tをフレーズ毎に区分することで語句データ列Qを生成する。 (7) In each of the above embodiments, one word vector string V is generated from one word data Q, but a plurality of word vector strings V may be generated from one word data string Q. FIG. 14 is an explanatory diagram of the operation of the control data string acquisition section 30 in this modification. The phrase data string Q in FIG. 14 is data representing a phrase string obtained by dividing the word string T into phrases. That is, the phrase data Q is data that identifies one or more phrase strings corresponding to one phrase. A phrase is a section of a song divided according to musical or semantic unity. For example, each phrase is specified in the music data D. However, each phrase of the song may be defined by analyzing the song data D. As described above, the language analysis unit 321 generates the phrase data string Q by dividing the word string T into phrases.
 情報生成部322は、語句データ列Qを処理して、対応する語句列に含まれる語句毎の語句ベクトル列Vを生成する。図14に例示される通り、1個の語句データ列Qに複数の語句が含まれる場合、その語句列の文脈が解析され、各語句に対応する複数の語句ベクトル列Vが1個の語句データ列Qから生成される。第1実施形態と同様に、情報生成部322による語句ベクトル列Vの生成には、文脈を解釈できる生成モデルMaが利用される。生成モデルMaは、語句データ列Qの示す語句列の文脈とその文脈における各語句の意味を示す語句ベクトル列Vとの関係を学習した学習済モデルである。具体的には、生成モデルMaは、語句データQが示す語句列に含まれる各単語の、その語句列における意味を示す語句ベクトル列Vを生成する。例えばBERT(Bidirectional Encoder Representations from Transformers)等の自然言語処理モデルが生成モデルMaとして利用される。以上の説明から理解される通り、本変形例においては、フレーズ毎に、語句データQに応じた1個以上の語句ベクトル列Vが生成される。 The information generation unit 322 processes the word/phrase data string Q to generate a word/phrase vector string V for each word included in the corresponding word/phrase string. As illustrated in FIG. 14, when one word data string Q contains multiple words, the context of the word string is analyzed, and the word vector strings V corresponding to each word are combined into one word data string. Generated from column Q. Similar to the first embodiment, the information generation unit 322 uses a generation model Ma that can interpret the context to generate the phrase vector sequence V. The generative model Ma is a trained model that has learned the relationship between the context of the word string indicated by the word data string Q and the word vector string V indicating the meaning of each word in that context. Specifically, the generation model Ma generates a phrase vector sequence V indicating the meaning of each word included in the phrase string indicated by the phrase data Q in the phrase string. For example, a natural language processing model such as BERT (Bidirectional Encoder Representations from Transformers) is used as the generative model Ma. As understood from the above description, in this modification, one or more word vector sequences V corresponding to the word data Q are generated for each phrase.
 例えば図14においては、単語#1aと単語#1bとで構成される語句列#1を指定する語句データ列Q1が例示されている。生成モデルMaは、1個の語句データ列Q1の入力に対して、語句列#1に含まれる単語#1aに対応する語句ベクトル列Vと単語#1bに対応する語句ベクトル列Vとを生成する。また、図14の語句データ列Q2は、単語#2aと単語#2bと語句#2cとで構成される語句列#2を指定する。生成モデルMaは、1個の語句データ列Q2の入力に対して、語句列#2に含まれる単語#2aに対応する語句ベクトル列Vと単語#2bに対応する語句ベクトル列Vと単語#2cに対応する語句ベクトル列Vとを生成する。第1実施形態と同様に、1個の語句に対応する各単位期間Uにおいて、当該語句の語句ベクトルVが第2制御データYとして反復的に利用される。本変形例では、語句列の文脈を解釈することで、その語句列に含まれる各語句の意味をより正しく示す語句ベクトルVが生成される。 For example, in FIG. 14, a phrase data string Q1 that specifies phrase string #1 consisting of word #1a and word #1b is illustrated. The generative model Ma generates a word vector string V corresponding to word #1a included in word string #1 and a word vector string V corresponding to word #1b included in word string #1 in response to input of one word data string Q1. . Further, the word/phrase data string Q2 in FIG. 14 specifies the word string #2 consisting of word #2a, word #2b, and word #2c. The generative model Ma generates a word vector string V corresponding to the word #2a included in the word string #2, a word vector string V corresponding to the word #2b, and a word #2c included in the word string #2 in response to the input of one word data string Q2. A phrase vector sequence V corresponding to is generated. Similar to the first embodiment, in each unit period U corresponding to one word, the word vector V of the word is repeatedly used as the second control data Y. In this modification, by interpreting the context of a word string, a word vector V that more accurately indicates the meaning of each word included in the word string is generated.
(8)前述の各形態においては深層ニューラルネットワークを例示したが、生成モデルM(Ma,Mb,Mc)は深層ニューラルネットワークに限定されない。例えば、HMM(Hidden Markov Model)またはSVM(Support Vector Machine)等の任意の形式および種類の統計モデルが、生成モデルM(Ma,Mb,Mc)として利用されてもよい。 (8) In each of the above embodiments, a deep neural network is illustrated, but the generative model M (Ma, Mb, Mc) is not limited to a deep neural network. For example, any format and type of statistical model such as HMM (Hidden Markov Model) or SVM (Support Vector Machine) may be used as the generative model M (Ma, Mb, Mc).
(9)前述の各形態においては、機械学習システム20が生成モデルMbを確立したが、生成モデルMbを確立する機能(訓練データ取得部41および学習処理部42)は、音響生成システム10に搭載されてもよい。また、生成モデルMaまたは生成モデルMcを確立する機能が、音響生成システム10に搭載されてもよい。 (9) In each of the above embodiments, the machine learning system 20 establishes the generative model Mb, but the function for establishing the generative model Mb (the training data acquisition unit 41 and the learning processing unit 42) is installed in the sound generation system 10. may be done. Further, the sound generation system 10 may be equipped with a function of establishing the generation model Ma or Mc.
(10)例えばスマートフォンまたはタブレット端末等の情報装置と通信するサーバ装置により、音響生成システム10が実現されてもよい。例えば、音響生成システム10は、情報装置から楽曲データDを受信し、当該楽曲データDを適用した合成処理Saにより音響信号Aを生成する。音響生成システム10は、合成処理Saにより生成した音響信号Aを、情報装置に送信する。なお、信号生成部34が情報装置に搭載された形態では、音響データ列Zの時系列が情報装置に送信される。すなわち、音響生成システム10から信号生成部34は省略される。 (10) The sound generation system 10 may be realized by a server device that communicates with an information device such as a smartphone or a tablet terminal. For example, the sound generation system 10 receives music data D from an information device, and generates an audio signal A through a synthesis process Sa to which the music data D is applied. The sound generation system 10 transmits the sound signal A generated by the synthesis process Sa to the information device. Note that in the case where the signal generation unit 34 is installed in the information device, the time series of the acoustic data string Z is transmitted to the information device. That is, the signal generation unit 34 is omitted from the sound generation system 10.
(11)音響生成システム10の機能(制御データ列取得部30、音響データ列生成部33、信号生成部34)は、前述の通り、制御装置11を構成する単数または複数のプロセッサと、記憶装置12に記憶されたプログラムとの協働により実現される。また、機械学習システム20の機能(訓練データ取得部41、学習処理部42)は、前述の通り、制御装置21を構成する単数または複数のプロセッサと、記憶装置22に記憶されたプログラムとの協働により実現される。 (11) The functions of the sound generation system 10 (control data string acquisition section 30, acoustic data string generation section 33, signal generation section 34) are performed by one or more processors constituting the control device 11 and a storage device. This is realized by cooperation with a program stored in 12. In addition, the functions of the machine learning system 20 (the training data acquisition unit 41 and the learning processing unit 42) are, as described above, the cooperation between one or more processors that constitute the control device 21 and the program stored in the storage device 22. This will be realized by working.
 以上に例示したプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性(non-transitory)の記録媒体であり、CD-ROM等の光学式記録媒体(光ディスク)が好例であるが、半導体記録媒体または磁気記録媒体等の公知の任意の形式の記録媒体も包含される。なお、非一過性の記録媒体とは、一過性の伝搬信号(transitory, propagating signal)を除く任意の記録媒体を含み、揮発性の記録媒体も除外されない。また、配信装置が通信網200を介してプログラムを配信する構成では、当該配信装置においてプログラムを記憶する記録媒体が、前述の非一過性の記録媒体に相当する。 The programs exemplified above may be provided in a form stored in a computer-readable recording medium and installed on a computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but any known recording medium such as a semiconductor recording medium or a magnetic recording medium is used. Also included are recording media in the form of. Note that the non-transitory recording medium includes any recording medium excluding transitory, propagating signals, and does not exclude volatile recording media. Furthermore, in a configuration in which a distribution device distributes a program via the communication network 200, the recording medium that stores the program in the distribution device corresponds to the above-mentioned non-transitory recording medium.
E:付記
 以上に例示した形態から、例えば以下の構成が把握される。
E: Supplementary Note From the configurations exemplified above, for example, the following configurations can be understood.
 本開示のひとつの態様(態様1)に係る音響生成方法は、音符列の特徴を表す第1制御データ列と、前記音符列に対応するテキストの特徴を表す第2制御データ列とを取得し、前記第1制御データ列と前記第2制御データ列とを訓練済の第1生成モデルにより処理することで、前記第2制御データ列が表す前記テキストの特徴に応じた音響特性を有する前記音符列の楽器音を表す音響データ列を生成する。 A sound generation method according to one aspect (aspect 1) of the present disclosure includes acquiring a first control data string representing characteristics of a musical note string and a second control data string representing characteristics of a text corresponding to the musical note string. , by processing the first control data string and the second control data string using a trained first generation model, the musical note has acoustic characteristics according to the characteristics of the text represented by the second control data string. A string of acoustic data representing the musical instrument sounds of the string is generated.
 以上の態様においては、音符列の特徴を表す第1制御データ列に加えて、当該音符列に対応するテキストの特徴を表す第2制御データ列が、音響データ列の生成に利用される。したがって、第1制御データ列のみから音響データ列を生成する構成と比較すると、音符列に対応するテキストに応じた多様な音響特性を有する楽器音の音響データ列を生成できる。 In the above aspect, in addition to the first control data string representing the characteristics of the note string, the second control data string representing the characteristics of the text corresponding to the note string is used to generate the acoustic data string. Therefore, compared to a configuration in which an acoustic data string is generated only from the first control data string, it is possible to generate an acoustic data string of musical instrument sounds having various acoustic characteristics depending on the text corresponding to the note string.
 「第1制御データ列」は、音符列の特徴を表す任意の形式のデータ(第1制御データ)であり、例えば音符列を表す音符データ列から生成される。また、電子楽器等の入力装置に対する操作に応じてリアルタイムに生成される音符データ列から第1制御データ列が生成されてもよい。「第1制御データ列」は、合成目的となる楽器音の条件を指定するデータとも換言される。例えば、「第1制御データ列」は、音符列を構成する各音符の音高または継続長、1個の音符の音高と当該音符の周囲に位置する他の音符の音高との関係等、音符列を構成する各音符に関する各種の条件を指定する。「楽器音」は、演奏により楽器から発生する楽音である。 The "first control data string" is data (first control data) in any format that represents the characteristics of a note string, and is generated from, for example, a note data string representing a note string. Further, the first control data string may be generated from a musical note data string generated in real time in response to an operation on an input device such as an electronic musical instrument. The "first control data string" can also be referred to as data specifying the conditions of the musical instrument sound to be synthesized. For example, the "first control data string" includes the pitch or duration of each note constituting the note string, the relationship between the pitch of one note and the pitches of other notes located around the note, etc. , specify various conditions regarding each note that makes up the note string. "Instrument sound" is a musical sound generated from a musical instrument during performance.
 「第1生成モデル」は、第1制御データ列および第2制御データ列と、音響データ列との関係を機械学習により学習した学習済モデルである。第1生成モデルの機械学習には複数の訓練データが利用される。各訓練データは、第1訓練用制御データ列および第2訓練用制御データ列の組と、訓練用音響データ列とを含む。第1訓練用制御データ列は、参照音符列の特徴を表すデータであり、第2訓練用制御データ列は、参照音符列に対応するテキストの特徴を表すデータである。訓練用音響データ列は、第1訓練用制御データ列に対応する音符列と第2訓練用制御データ列に対応するテキストとのもとで演奏により発音された楽器音を表す。例えば深層ニューラルネットワーク(DNN:Deep Neural Network)、隠れマルコフモデル(HMM:Hidden Markov Model)、またはSVM(Support Vector Machine)等の各種の統計的推定モデルが、「第1生成モデル」として利用される。 The "first generation model" is a learned model that has learned the relationship between the first control data string, the second control data string, and the acoustic data string by machine learning. A plurality of training data are used for machine learning of the first generative model. Each training data includes a set of a first training control data string and a second training control data string, and a training acoustic data string. The first training control data string is data representing the characteristics of the reference note string, and the second training control data string is data representing the characteristics of the text corresponding to the reference note string. The training audio data string represents musical instrument sounds produced by a performance based on the note string corresponding to the first training control data string and the text corresponding to the second training control data string. For example, various statistical estimation models such as a deep neural network (DNN), a hidden Markov model (HMM), or a support vector machine (SVM) are used as the "first generative model." .
 第1生成モデルに対する第1制御データ列および第2制御データ列の入力の形態は任意である。例えば、第1制御データ列と第2制御データ列とを含む入力データが第1生成モデルに入力される。また、第1生成モデルが入力層と複数の中間層と出力層とを含む構成においては、第1制御データ列が入力層に入力され、第2制御データ列が中間層に入力される形態も想定される。すなわち、第1制御データ列と第2制御データ列との結合は必須ではない。 The form of input of the first control data string and the second control data string to the first generative model is arbitrary. For example, input data including a first control data string and a second control data string is input to the first generative model. Furthermore, in a configuration where the first generative model includes an input layer, a plurality of intermediate layers, and an output layer, the first control data string may be input to the input layer, and the second control data string may be input to the intermediate layer. is assumed. That is, the combination of the first control data string and the second control data string is not essential.
 「音響データ列」は、楽器音を表す任意の形式のデータ(音響データ)である。例えば、強度スペクトル、メルスペクトル、MFCC(Mel-Frequency Cepstrum Coefficients)等の音響特性(周波数特性)を表すデータが、「音響データ列」の一例である。また、楽器音の波形を表すサンプル系列が「音響データ列」として生成されてもよい。 The "acoustic data string" is data (acoustic data) in any format that represents musical instrument sounds. For example, data representing acoustic characteristics (frequency characteristics) such as an intensity spectrum, a mel spectrum, and MFCC (Mel-Frequency Cepstrum Coefficients) is an example of an "acoustic data string." Further, a sample sequence representing the waveform of the musical instrument sound may be generated as an "acoustic data sequence."
 「音符列に対応するテキスト」とは、音符列に対してテキストが対応付けられていることを意味する。すなわち、音符列とテキストとの「対応」とは、例えば音符列の各音符とテキストの各語句との時間的な対応が対応付けられていることを意味する。 "Text corresponding to a note string" means that text is associated with a note string. That is, the "correspondence" between a note string and a text means, for example, that each note in the note string is associated with each word in the text in terms of time.
 態様1の具体例(態様2)において、前記第1生成モデルは、参照音符列の特徴を表す第1訓練用制御データ列、および前記参照音符列に対応するテキストの特徴を表す第2訓練用制御データ列と、前記参照音符列の楽器音を表す訓練用音響データ列と、を含む訓練データを利用して訓練されたモデルである。以上の態様によれば、参照音符列の第1訓練用制御データ列および第2訓練用制御データ列と、当該参照音符列の楽器音を表す訓練用音響データ列との関係の観点から、統計的に妥当な音響データ列を生成できる。 In a specific example of Aspect 1 (Aspect 2), the first generative model includes a first training control data string representing characteristics of a reference note string, and a second training control data string representing characteristics of a text corresponding to the reference note string. This model is trained using training data including a control data string and a training audio data string representing the instrument sound of the reference note string. According to the above aspect, from the perspective of the relationship between the first training control data string and the second training control data string of the reference note string and the training acoustic data string representing the instrument sound of the reference note string, the statistics It is possible to generate a reasonably valid acoustic data sequence.
 態様1または態様2の具体例(態様3)において、前記第2制御データ列は、前記テキストに含まれる語句を表す語句ベクトル列を含む。以上の態様によれば、第2制御データ列が、テキスト内の語句を表す語句ベクトル列を含む。したがって、テキスト内の語句の意味が音響特性に反映された楽器音の音響データ列を生成できる。 In a specific example of Aspect 1 or Aspect 2 (Aspect 3), the second control data string includes a word vector string representing words included in the text. According to the above aspect, the second control data string includes a word vector string representing words in the text. Therefore, it is possible to generate an acoustic data string of musical instrument sounds in which the meanings of words in the text are reflected in the acoustic characteristics.
 「語句ベクトル列」は、語句の意味に応じて言語空間(意味空間)内に規定されるベクトル(語句ベクトル)である。「語句」は、1個の単語または複数の単語の配列(すなわち句)である。語句ベクトル列の生成には、例えば、Tomas Mikolov et al."Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781 [cs.CL], 2013(Word2Vec)、または、Quoc Le, Tomas Mikolov, "Distributed Representations of Sentences and Documents," CoRR, abs/1405.4053, p.1-9, 2014(Doc2Vec)に記載された統計的推定モデルが利用される。 The "phrase vector sequence" is a vector (phrase vector) defined in the linguistic space (semantic space) according to the meaning of the word. A "phrase" is a word or a sequence of words (ie, a phrase). To generate a word vector sequence, for example, Tomas Mikolov et al. "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781 [cs.CL], 2013 (Word2Vec) or Quoc Le, Tomas Mikolov, "Distributed The statistical estimation model described in “Representations of Sentences and Documents,” CoRR, abs/1405.4053, p.1-9, 2014 (Doc2Vec) will be used.
 態様3の具体例(態様4)において、前記第2制御データ列の取得においては、訓練済の第2生成モデルにより前記語句ベクトル列を生成する。以上の態様によれば、第2生成モデルを利用して第2制御データ列を簡便に生成できる。 In a specific example of aspect 3 (aspect 4), in acquiring the second control data sequence, the phrase vector sequence is generated using a trained second generation model. According to the above aspect, the second control data string can be easily generated using the second generation model.
 態様1から態様4の何れかの具体例(態様5)において、前記第2制御データ列は、前記テキストを構成する音素を表す音素データを含む。以上の態様によれば、第2制御データ列が、テキストを構成する音素を表す音素データを含む。したがって、テキスト内の音素の発音に関する非言語的な特性(例えば時間領域または周波数領域における特性)が音響特性に反映された楽器音の音響データ列を生成できる。 In a specific example of any one of aspects 1 to 4 (aspect 5), the second control data string includes phoneme data representing phonemes constituting the text. According to the above aspect, the second control data string includes phoneme data representing phonemes making up the text. Therefore, it is possible to generate an acoustic data string of musical instrument sounds in which non-linguistic characteristics (for example, characteristics in the time domain or frequency domain) regarding the pronunciation of phonemes in the text are reflected in the acoustic characteristics.
 態様1から態様5の何れかの具体例(態様6)において、時間軸上の複数の単位期間の各々において、前記第1制御データおよび第2制御データの取得と、前記音響データの生成とが実行される。 In a specific example of any one of aspects 1 to 5 (aspect 6), the acquisition of the first control data and the second control data and the generation of the acoustic data are performed in each of a plurality of unit periods on the time axis. executed.
 本開示のひとつの態様(態様7)に係る音響生成システムは、音符列の特徴を表す第1制御データ列と、前記音符列に対応するテキストの特徴を表す第2制御データ列とを取得する制御データ列取得部と、前記第1制御データ列と前記第2制御データ列とを訓練済の第1生成モデルにより処理することで、前記第2制御データ列が表す前記テキストの特徴に応じた音響特性を有する前記音符列の楽器音を表す音響データ列を生成する音響データ列生成部とを具備する。 A sound generation system according to one aspect (aspect 7) of the present disclosure acquires a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string. A control data string acquisition unit processes the first control data string and the second control data string using a trained first generation model, thereby generating a control data string that corresponds to the characteristics of the text represented by the second control data string. and an acoustic data string generation unit that generates an acoustic data string representing the musical instrument sound of the note string having acoustic characteristics.
 本開示のひとつの態様(態様8)に係るプログラムは、音符列の特徴を表す第1制御データ列と、前記音符列に対応するテキストの特徴を表す第2制御データ列とを取得する制御データ列取得部、および、前記第1制御データ列と前記第2制御データ列とを訓練済の第1生成モデルにより処理することで、前記第2制御データ列が表す前記テキストの特徴に応じた音響特性を有する前記音符列の楽器音を表す音響データ列を生成する音響データ列生成部、としてコンピュータシステムを機能させる。 A program according to one aspect (aspect 8) of the present disclosure provides control data for acquiring a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string. a string acquisition unit, and processing the first control data string and the second control data string using a trained first generation model, thereby generating sound according to the characteristics of the text represented by the second control data string. The computer system is caused to function as an acoustic data string generation unit that generates an acoustic data string representing the musical instrument sound of the note string having characteristics.
100…情報システム、10…音響生成システム、11…制御装置、12…記憶装置、13…通信装置、14…操作装置、15…放音装置、20…機械学習システム、21…制御装置、22…記憶装置、23…通信装置、30…制御データ列取得部、31…第1生成部、32…第2生成部、321…言語解析部、322,326…情報生成部、326…音素解析部、33…音響データ列生成部、34…信号生成部、41…訓練データ取得部、42…学習処理部。 100... Information system, 10... Sound generation system, 11... Control device, 12... Storage device, 13... Communication device, 14... Operating device, 15... Sound emitting device, 20... Machine learning system, 21... Control device, 22... Storage device, 23... Communication device, 30... Control data string acquisition section, 31... First generation section, 32... Second generation section, 321... Language analysis section, 322, 326... Information generation section, 326... Phoneme analysis section, 33...Acoustic data string generation section, 34...Signal generation section, 41...Training data acquisition section, 42...Learning processing section.

Claims (8)

  1.  音符列の特徴を表す第1制御データ列と、前記音符列に対応するテキストの特徴を表す第2制御データ列とを取得し、
     前記第1制御データ列と前記第2制御データ列とを訓練済の第1生成モデルにより処理することで、前記第2制御データ列が表す前記テキストの特徴に応じた音響特性を有する前記音符列の楽器音を表す音響データ列を生成する、
     コンピュータシステムにより実現される音響生成方法。
    obtaining a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string;
    By processing the first control data string and the second control data string using a trained first generation model, the musical note string has acoustic characteristics according to the characteristics of the text represented by the second control data string. generate an acoustic data string representing the musical instrument sound,
    A sound generation method realized by a computer system.
  2.  前記第1生成モデルは、
     参照音符列の特徴を表す第1訓練用制御データ列、および前記参照音符列に対応するテキストの特徴を表す第2訓練用制御データ列と、
     前記参照音符列の楽器音を表す訓練用音響データ列と、
     を含む訓練データを利用して訓練されたモデルである
     請求項1の音響生成方法。
    The first generative model is
    a first training control data string representing characteristics of a reference note string; and a second training control data string representing characteristics of a text corresponding to the reference note string;
    a training audio data string representing the instrument sound of the reference note string;
    The sound generation method according to claim 1, wherein the model is trained using training data including.
  3.  前記第2制御データ列は、前記テキストに含まれる語句を表す語句ベクトル列を含む
     請求項1または請求項2の音響生成方法。
    The sound generation method according to claim 1 or 2, wherein the second control data string includes a word vector string representing words included in the text.
  4.  前記第2制御データ列の取得においては、訓練済の第2生成モデルにより前記語句ベクトル列を生成する
     請求項3の音響生成方法。
    4. The sound generation method according to claim 3, wherein in acquiring the second control data string, the phrase vector string is generated using a trained second generation model.
  5.  前記第2制御データ列は、前記テキストを構成する音素を表す音素データを含む
     請求項1から請求項4の何れかの音響生成方法。
    The sound generation method according to any one of claims 1 to 4, wherein the second control data string includes phoneme data representing phonemes constituting the text.
  6.  時間軸上の複数の単位期間の各々において、
     前記第1制御データ列および第2制御データ列の取得における個々の第1制御データおよび第2制御データの取得と、
     前記音響データ列の生成における個々の音響データの生成とが実行される
     請求項1から請求項5の何れかの音響生成方法。
    In each of multiple unit periods on the time axis,
    Acquiring individual first control data and second control data in acquiring the first control data string and second control data string;
    The sound generation method according to any one of claims 1 to 5, wherein generation of individual sound data in the generation of the sound data string is executed.
  7.  音符列の特徴を表す第1制御データ列と、前記音符列に対応するテキストの特徴を表す第2制御データ列とを取得する制御データ列取得部と、
     前記第1制御データ列と前記第2制御データ列とを訓練済の第1生成モデルにより処理することで、前記第2制御データ列が表す前記テキストの特徴に応じた音響特性を有する前記音符列の楽器音を表す音響データ列を生成する音響データ列生成部と
     を具備する音響生成システム。
    a control data string acquisition unit that obtains a first control data string representing a feature of a note string and a second control data string representing a feature of a text corresponding to the note string;
    By processing the first control data string and the second control data string by a trained first generation model, the musical note string has acoustic characteristics according to the characteristics of the text represented by the second control data string. an acoustic data string generation unit that generates an acoustic data string representing an instrument sound; and an acoustic data string generation unit.
  8.  音符列の特徴を表す第1制御データ列と、前記音符列に対応するテキストの特徴を表す第2制御データ列とを取得する制御データ列取得部、および、
     前記第1制御データ列と前記第2制御データ列とを訓練済の第1生成モデルにより処理することで、前記第2制御データ列が表す前記テキストの特徴に応じた音響特性を有する前記音符列の楽器音を表す音響データ列を生成する音響データ列生成部、
     としてコンピュータシステムを機能させるプログラム。
    a control data string acquisition unit that obtains a first control data string representing a feature of a note string and a second control data string representing a feature of a text corresponding to the note string;
    By processing the first control data string and the second control data string by a trained first generation model, the musical note string has acoustic characteristics according to the characteristics of the text represented by the second control data string. an acoustic data string generation unit that generates an acoustic data string representing the musical instrument sound;
    A program that makes a computer system function as a computer.
PCT/JP2023/007783 2022-03-09 2023-03-02 Sound generation method, sound generation system, and program WO2023171522A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022036293A JP2023131494A (en) 2022-03-09 2022-03-09 Sound generation method, sound generation system and program
JP2022-036293 2022-03-09

Publications (1)

Publication Number Publication Date
WO2023171522A1 true WO2023171522A1 (en) 2023-09-14

Family

ID=87935315

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/007783 WO2023171522A1 (en) 2022-03-09 2023-03-02 Sound generation method, sound generation system, and program

Country Status (2)

Country Link
JP (1) JP2023131494A (en)
WO (1) WO2023171522A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06175658A (en) * 1992-12-02 1994-06-24 Matsushita Electric Ind Co Ltd Electronic musical instrument
JPH06332443A (en) * 1993-05-26 1994-12-02 Matsushita Electric Ind Co Ltd Score recognizing device
JPH0895566A (en) * 1994-09-27 1996-04-12 Yamaha Corp Automatic accompaniment device
JP2000227794A (en) * 1999-02-08 2000-08-15 Yamaha Corp Musical sound outputting device and recording medium therefor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06175658A (en) * 1992-12-02 1994-06-24 Matsushita Electric Ind Co Ltd Electronic musical instrument
JPH06332443A (en) * 1993-05-26 1994-12-02 Matsushita Electric Ind Co Ltd Score recognizing device
JPH0895566A (en) * 1994-09-27 1996-04-12 Yamaha Corp Automatic accompaniment device
JP2000227794A (en) * 1999-02-08 2000-08-15 Yamaha Corp Musical sound outputting device and recording medium therefor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Iwanami Course Multimedia Informatics 4, Information processing of letters and sounds, first edition", 1 January 2000, IWANAMI SHOTEN, PUBLISHERS, ISBN: 4-00-010964-2, article NAGAO, MAKOTO: "Automatic performance and musical interpretation", pages: 195 - 206, XP009548571 *
ZHANG YIXIAO, WANG ZIYU, WANG DINGSU, XIA GUS: "BUTTER: A Representation Learning Framework for Bi-directional Music-Sentence Retrieval and Generation", PROCEEDINGS OF THE 1ST WORKSHOP ON NLP FOR MUSIC AND AUDIO (NLP4MUSA), 16 October 2020 (2020-10-16), pages 54 - 58, XP093089574, Retrieved from the Internet <URL:https://aclanthology.org/2020.nlp4musa-1.11.pdf> [retrieved on 20231008] *

Also Published As

Publication number Publication date
JP2023131494A (en) 2023-09-22

Similar Documents

Publication Publication Date Title
JP6547878B1 (en) Electronic musical instrument, control method of electronic musical instrument, and program
JP6610715B1 (en) Electronic musical instrument, electronic musical instrument control method, and program
JP6610714B1 (en) Electronic musical instrument, electronic musical instrument control method, and program
KR101274961B1 (en) music contents production system using client device.
JP7088159B2 (en) Electronic musical instruments, methods and programs
JP2011048335A (en) Singing voice synthesis system, singing voice synthesis method and singing voice synthesis device
JP2016161919A (en) Voice synthesis device
JP2022071098A (en) Electronic musical instrument, method, and program
WO2020095950A1 (en) Information processing method and information processing system
CN113160780A (en) Electronic musical instrument, method and storage medium
JP6737320B2 (en) Sound processing method, sound processing system and program
WO2023171522A1 (en) Sound generation method, sound generation system, and program
JP6835182B2 (en) Electronic musical instruments, control methods for electronic musical instruments, and programs
JP6819732B2 (en) Electronic musical instruments, control methods for electronic musical instruments, and programs
JP6801766B2 (en) Electronic musical instruments, control methods for electronic musical instruments, and programs
WO2023171497A1 (en) Acoustic generation method, acoustic generation system, and program
WO2020171035A1 (en) Sound signal synthesis method, generative model training method, sound signal synthesis system, and program
WO2022080395A1 (en) Audio synthesizing method and program
US20230290325A1 (en) Sound processing method, sound processing system, electronic musical instrument, and recording medium
JP2022065554A (en) Method for synthesizing voice and program
JP2022065566A (en) Method for synthesizing voice and program
JP2020184092A (en) Information processing method
JP2013238664A (en) Speech fragment segmentation device
JP2004294795A (en) Tone synthesis control data, recording medium recording the same, data generating device, program, and tone synthesizer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23766698

Country of ref document: EP

Kind code of ref document: A1