WO2023171522A1

WO2023171522A1 - Sound generation method, sound generation system, and program

Info

Publication number: WO2023171522A1
Application number: PCT/JP2023/007783
Authority: WO
Inventors: 方成西村
Original assignee: ヤマハ株式会社
Priority date: 2022-03-09
Filing date: 2023-03-02
Publication date: 2023-09-14
Also published as: JP2023131494A

Abstract

This sound generation system comprises: a control data string acquisition unit 30 that acquires a first control data string X representing features of a string of notes and a second control data string Y representing features of text corresponding to the string of notes; and a sound data string generation unit 33 that processes the first control data string X and the second control data string Y using a trained generation model Mb, thereby generating a sound data string Z representing musical instrument sounds of a string of notes having sound characteristics corresponding to the features of the text represented by the second control data string Y.

Description

Sound generation method, sound generation system and program

The present disclosure relates to a technique for generating an acoustic data string representing musical instrument sounds.

Techniques for synthesizing desired sounds have been proposed in the past. For example, Non-Patent Document 1 discloses a technique for generating a synthesized sound corresponding to a string of musical notes using a trained generative model.

However, conventional synthesis techniques only generate singing sounds that follow the musical score, and it is difficult to generate synthesized sounds with diverse acoustic characteristics. In consideration of the above circumstances, one aspect of the present disclosure aims to generate an acoustic data string of musical instrument sounds having various acoustic characteristics.

In order to solve the above problems, a sound generation method according to one aspect of the present disclosure includes a first control data string representing the characteristics of a note string, and a second control data string representing the characteristics of a text corresponding to the note string. by processing the first control data string and the second control data string using a trained first generative model, a sound corresponding to the characteristics of the text represented by the second control data string is generated. An acoustic data string representing the musical instrument sound of the note string having characteristics is generated.

A sound generation system according to one aspect of the present disclosure provides a control data string acquisition that obtains a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string. and the first control data string and the second control data string are processed by a trained first generation model to have acoustic characteristics according to the characteristics of the text represented by the second control data string. and an acoustic data string generation unit that generates an acoustic data string representing the musical instrument sound of the note string.

A program according to one aspect of the present disclosure includes a control data string acquisition unit that acquires a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string; The first control data string and the second control data string are processed by a trained first generation model, so that the second control data string has acoustic characteristics according to the characteristics of the text represented by the second control data string. The computer system functions as an acoustic data string generation unit that generates an acoustic data string representing an instrument sound of a note string.

FIG. 1 is a block diagram illustrating the configuration of an information system in a first embodiment. FIG. 1 is a block diagram illustrating a functional configuration of a sound generation system. FIG. 3 is an explanatory diagram of the operation of a control data string acquisition section. FIG. 3 is a block diagram illustrating the configuration of a second generation unit. 3 is a flowchart illustrating a detailed procedure of compositing processing. FIG. 1 is a block diagram illustrating a functional configuration of a machine learning system. 3 is a flowchart illustrating detailed steps of learning processing. FIG. 7 is an explanatory diagram of the operation of the control data string acquisition unit in the second embodiment. It is a schematic diagram of phoneme data. It is a schematic diagram of the 2nd control data sequence Y in 3rd Embodiment. FIG. 7 is an explanatory diagram of a generative model in a modified example. It is a block diagram which illustrates the functional structure of the sound generation system in a modification. It is a block diagram which illustrates the functional structure of the sound generation system in a modification. FIG. 7 is an explanatory diagram of the operation of a control data string acquisition unit in a modified example.

A: First Embodiment FIG. 1 is a block diagram illustrating the configuration of an information system 100 according to a first embodiment. The information system 100 includes a sound generation system 10 and a machine learning system 20. The sound generation system 10 and the machine learning system 20 communicate with each other via a communication network 200 such as the Internet, for example.

[Sound generation system 10]
The sound generation system 10 is a computer system that generates performance sounds (hereinafter referred to as "target sounds") of a specific musical piece. The target sound in the first embodiment is an instrument sound having a musical instrument tone.

The sound generation system 10 includes a control device 11, a storage device 12, a communication device 13, an operating device 14, and a sound emitting device 15. The sound generation system 10 is realized by, for example, an information terminal such as a smartphone, a tablet terminal, or a personal computer. Note that the sound generation system 10 is realized not only by a single device but also by a plurality of devices configured separately from each other.

The control device 11 is composed of one or more processors that control each element of the sound generation system 10. For example, the control device 11 is a CPU (Central Processing Unit), GPU (Graphics Processing Unit), SPU (Sound Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). ), etc. The control device 11 generates an acoustic signal A representing the waveform of the target sound.

The storage device 12 is one or more memories that store programs executed by the control device 11 and various data used by the control device 11. The storage device 12 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium. The storage device 12 may be configured by a combination of multiple types of recording media. Note that a portable recording medium that can be attached to and detached from the sound generation system 10 or a recording medium that can be accessed by the control device 11 via the communication network 200 (for example, cloud storage) may be used as the storage device 12. .

The storage device 12 stores music data D representing music. The music data D includes musical score data G and a word string T. Musical score data G specifies the time series of notes that make up the music piece. Specifically, the musical score data G specifies a pitch and a sounding period for each of a plurality of notes of a song. The sound production period is specified by, for example, the starting point and duration of the note. The word string T specifies text corresponding to a song. Specifically, the word string T specifies one or more characters for each of a plurality of musical notes in a song. A word string T is composed of a plurality of characters corresponding to different musical notes. For example, a music file compliant with the MIDI (Musical Instrument Digital Interface) standard is used as the music data D. Note that the music data D may specify information such as performance symbols that represent musical expressions.

The communication device 13 communicates with the machine learning system 20 via the communication network 200. Note that a communication device 13 separate from the sound generation system 10 may be connected to the sound generation system 10 by wire or wirelessly.

The operating device 14 is an input device that accepts operations by the user. For example, an operator operated by a user or a touch panel that detects a touch by a user is used as the operating device 14.

The sound emitting device 15 reproduces the target sound represented by the acoustic signal A. The sound emitting device 15 is, for example, a speaker or headphones. Note that a D/A converter that converts the audio signal A from digital to analog and an amplifier that amplifies the audio signal A are not shown for convenience. Further, a sound emitting device 15 that is separate from the sound generation system 10 may be connected to the sound generation system 10 by wire or wirelessly.

FIG. 2 is a block diagram illustrating the functional configuration of the sound generation system 10. The control device 11 has a plurality of functions (control data string acquisition section 30, acoustic data string generation section 33, and signal generation section 34) for generating the acoustic signal A by executing a program stored in the storage device 12. Realize.

FIG. 3 is an explanatory diagram of the operation of the control data string acquisition section 30. The control data string acquisition unit 30 obtains a first control data string X and a second control data string Y. Specifically, the control data string acquisition unit 30 obtains the first control data string X and the second control data string Y in each of a plurality of unit periods U on the time axis. Each unit period U is a period (hop size of a frame window) that is sufficiently short in time compared to the duration of each note of the song. For example, the window size is 2-20 times the hop size (the window is longer), the hop size is 2-20 ms, and the window size is 20-60 ms. The control data string acquisition unit 30 of the first embodiment includes a first generation unit 31 and a second generation unit 32.

The first generation unit 31 generates the first control data X from the note data string N for each unit period U. The musical note data string N used for generation is a portion of the musical score data G that corresponds to each unit period U. The note data string N corresponding to any one unit period U is a part of the note data string of the music data D that includes note data that includes the unit period U (hereinafter referred to as "target note"). That is, the note data string N specifies a note string of the music data D that includes the target note and at least one of the preceding note and the following note.

The individual first control data X is data in any format that represents the characteristics of the note string specified by the note data string N. The first control data X in any one unit period U is information indicating the characteristics of the note indicated by the note data of the target note including the unit period U among a plurality of notes of the music piece. For example, the characteristics indicated by the first control data string X include the characteristics of the musical notes that include the unit section (eg, pitch, optionally, time length). Furthermore, the first control data string X includes information regarding notes other than the target note. For example, the first control data string X includes characteristics (for example, pitch) indicated by the note data of at least one of the notes before and after the note including the unit section. Further, the first control data string X may include a pitch difference between the target note and the note immediately before or after the target note. Furthermore, if there is no previous or subsequent note to be included and it is a rest, the characteristics of the rest may be included instead of the note.

The first generation unit 31 generates the first control data string X by performing predetermined arithmetic processing on the note data string N. Note that the first generation unit 31 may generate the first control data string X using a generation model configured with a deep neural network (DNN) or the like. The generation model is a statistical estimation model in which the relationship between the musical note data string N and the first control data string X is learned by machine learning. The first control data string X is data that specifies the musical conditions of the target sound that the sound generation system 10 should generate.

The second generation unit 32 generates second control data Y required for the current unit period U from the word string T in synchronization with the unit period U or prior to the progression of the unit period. The second control data Y of each unit section U is data in any format that represents the characteristics of the current word phrase in the word string T that includes that unit section U. Specifically, the second control data Y includes a word vector V of the word included in the word string T. The phrase vector V is a vector representing the position of each phrase in the semantic space. The closer the meanings of multiple words are, the closer the positions of the word vectors V of those words are in the semantic space. The phrase represented by the phrase vector V is composed of one or more words. That is, the phrase vector V is data representing the characteristics of one word or one phrase (time series of a plurality of words) in the word string T.

FIG. 4 is a block diagram illustrating the configuration of the second generation unit 32. The second generation section 32 includes a language analysis section 321 and an information generation section 322. The language analysis unit 321 divides the word string T represented by the word string T into a plurality of words by natural language processing such as morphological analysis. The language analysis unit 321 sequentially generates phrase data Q. The phrase data Q is data that identifies a phrase made up of one or more words in the word string T, or data that represents a character string of the phrase. The information generation unit 322 generates a phrase vector sequence V for the phrase represented by the phrase data Q. As illustrated in FIG. 3, in each unit period U within the period corresponding to one word in the song, the word vector V of the word is repeatedly used as the second control data Y. Note that a zero vector is generated as the second control data Y in each unit period U within a period in which a musical note or word string T is not set.

As illustrated in FIG. 4, the generation model Ma is used to generate the phrase vector sequence V by the information generation unit 322. The generative model Ma is a trained model in which the latent relationship between the word data Q as an input and the word vector sequence V in the semantic space as an output is learned by machine learning. The generative model Ma outputs a word vector sequence V in response to input word data Q. The information generation unit 322 generates a word vector V of each word by processing the word data Q using the trained generative model Ma, and outputs it as a word vector sequence V in the corresponding unit period. As understood from the above description, the second generation unit 32 generates a phrase vector sequence V representing the words included in the word sequence T as the second control data sequence Y using the generation model Ma. According to the above configuration, the second control data string Y can be easily generated using the generation model Ma. Note that the generative model Ma is an example of a "second generative model."

The generative model Ma of the first embodiment is, for example, a statistical estimation model such as a deep neural network. For example, to generate a phrase vector sequence V of a phrase consisting of one word, see Tomas Mikolov et al., "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781 [cs.CL], 2013. The developed technology (Word2Vec) will be used. In addition, to generate a word vector sequence V of a phrase (that is, a sentence) consisting of multiple words, for example, Quoc Le, Tomas Mikolov, "Distributed Representations of Sentences and Documents," CoRR, abs/1405.4053, p.1 -9, 2014 (Doc2Vec) is used.

As illustrated in FIG. 2, the control data C is generated for each unit period U through the above processing by the control data string acquisition unit 30. The control data C for each unit period U includes the first control data X generated by the first generation section 31 for the unit period U, and the second control data Y generated by the second generation section 32 for the unit period U. include. The control data C is, for example, data obtained by concatenating the first control data X and the second control data Y.

The performance of musical instruments is basically defined by the string of notes on the musical score. In addition, as a result of research by the inventor, even if the note strings played by the performers on the musical instruments are the same, if the text added to the note strings is different, the music of the instrumental sounds generated by the performance of the musical instruments It was confirmed that there were also differences in their facial expressions. In other words, while it is natural that the musical expression of singing voices depends on the text (i.e. lyrics), the musical expression of instrumental sounds, which is generally assumed to have no effect on the text, actually depends on the text. tends to be text dependent. Based on the above findings, in the first embodiment, the first control data string X representing the characteristics of the note string and the second control data string Y representing the characteristics of the word string T corresponding to the note string are Then, an acoustic signal A of the target sound is generated.

The acoustic data string generation unit 33 in FIG. 2 generates an acoustic data string Z using the control data string C (first control data string X and second control data string Y). The acoustic data string Z is data in any format representing the target sound. Specifically, the acoustic data string Z corresponds to the note string represented by the first control data string represent. That is, an instrument sound when a performer plays a note string on an instrument with the word string T in mind is generated as a target sound.

Specifically, the acoustic data string Z is data representing the envelope of the frequency spectrum of the target sound. Specifically, according to the control data C of each unit period U, the acoustic data Z corresponding to the unit period U is generated. Each piece of acoustic data Z corresponds to a waveform sample sequence for one frame window longer than a unit period. As described above, the acquisition of the control data C by the control data string acquisition section 30 and the generation of the acoustic data Z by the acoustic data string generation section 33 are executed every unit period U.

The generation model Mb is used to generate the acoustic data string Z by the acoustic data string generation unit 33. The generative model Mb estimates acoustic data Z for each unit period according to the control data C for that unit period. The generative model Mb is a learned model in which the latent relationship between the control data string C as an input and the acoustic data string Z as an output is learned by machine learning. That is, the generative model Mb outputs the acoustic data string Z that is statistically valid for the control data string C from the viewpoint of the relationship. The acoustic data string generation unit 33 generates acoustic data Z for each unit period U by processing the control data C using the generation model Mb.

The generative model Mb is realized by a combination of a program that causes the control device 11 to execute a calculation to generate an acoustic data sequence Z from a control data sequence C, and a plurality of variables (weight values and biases) applied to the calculation. . A program and a plurality of variables that realize the generative model Mb are stored in the storage device 12. A plurality of variables of the generative model Mb are set in advance by machine learning. The generative model Mb is an example of a "first generative model."

The generative model Mb is composed of, for example, a deep neural network. For example, any type of deep neural network such as a recurrent neural network (RNN) or a convolutional neural network (CNN) is used as the generative model Mb. The generative model Mb may be configured by a combination of multiple types of deep neural networks. Additionally, additional elements such as long short-term memory (LSTM) or attention may be included in the generative model Mb.

The signal generation unit 34 generates the acoustic signal A of the target sound from the time series of the acoustic data string Z. The signal generation unit 34 converts the acoustic data string Z into a time domain waveform signal by calculation including, for example, a discrete inverse Fourier transform, and generates the acoustic signal A by concatenating the waveform signals for successive unit periods U. . Note that the signal generation unit 34 generates the acoustic signal A from the acoustic data string Z by using, for example, a deep neural network (so-called neural vocoder) that has learned the relationship between the acoustic data string Z and each sample of the acoustic signal A. Good too. The target sound is reproduced from the sound emitting device 15 by supplying the acoustic signal A generated by the signal generating unit 34 to the sound emitting device 15.

FIG. 5 is a flowchart illustrating the detailed procedure of the process (hereinafter referred to as "synthesis process") Sa in which the control device 11 generates the acoustic signal A. The compositing process Sa is executed in each of the plurality of unit periods U.

When the synthesis process Sa is started, the control device 11 (control data string acquisition unit 30) acquires the music data D from the storage device 12 (Sa1). The control device 11 (first generation unit 31) generates first control data X for the unit period U from the note data string N corresponding to the unit period U of the musical score data G of the music data D (Sa2). Further, the control device 11 (second generation unit 32) generates second control data Y for each unit period U from the word string T of the music data D (Sa3). Note that the order of generation of the first control data X (Sa2) and generation of the second control data Y (Sa3) may be reversed.

The control device 11 (acoustic data string generation unit 33) generates acoustic data Z for a unit period U by processing the control data C including the first control data X and the second control data Y using the generation model Mb. (Sa4). The control device 11 (signal generation unit 34) generates the acoustic signal A of the unit period U from the acoustic data Z (Sa5). From the acoustic data string Z of each unit period, a signal that spans a time longer than the unit period is generated, and by overlapping and adding these signals, an acoustic signal A that spans a plurality of unit periods is generated. The time difference (hop size) between the previous and subsequent frame windows corresponds to a unit period. The control device 11 reproduces the target sound by supplying the acoustic signal A to the sound emitting device 15 (Sa6).

As described above, in the first embodiment, in addition to the first control data string X representing the characteristics of the note string, the second control data string Y representing the characteristics of the word string T corresponding to the note string is It is used to generate the acoustic data string Z. Therefore, compared to the configuration in which the acoustic data string Z is generated only from the first control data string X, it is possible to generate the acoustic data string Z of the target sound having various acoustic characteristics according to the word string T corresponding to the note string. For example, even if the musical note data string N is common, by changing the word string T, it is possible to generate acoustic data strings Z of target sounds having different acoustic characteristics. Particularly in the first embodiment, the second control data string Y includes a word vector string V representing words in the word string T. That is, the phrase vector sequence V reflecting the meaning of the word sequence T is used as the second control data sequence Y. Therefore, it is possible to generate the acoustic data string Z of the target sound in which the meanings of the words in the word string T are reflected in the acoustic characteristics.

[Machine learning system 20]
The machine learning system 20 in FIG. 1 is a computer system that establishes a generative model Mb used by the sound generation system 10 by machine learning. The machine learning system 20 includes a control device 21, a storage device 22, and a communication device 23.

The control device 21 is composed of one or more processors that control each element of the machine learning system 20. For example, the control device 21 is configured by one or more types of processors such as a CPU, GPU, SPU, DSP, FPGA, or ASIC.

The storage device 22 is one or more memories that store programs executed by the control device 21 and various data used by the control device 21. The storage device 22 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium. The storage device 22 may be configured by a combination of multiple types of recording media. Note that a portable recording medium that can be attached to and detached from the machine learning system 20 or a recording medium that can be accessed by the control device 21 via the communication network 200 (for example, cloud storage) may be used as the storage device 22. .

The communication device 23 communicates with the sound generation system 10 via the communication network 200. Note that a communication device 23 separate from the machine learning system 20 may be connected to the machine learning system 20 by wire or wirelessly.

FIG. 6 is an explanatory diagram of the function of the machine learning system 20 to establish the generative model Mb. The storage device 22 stores a plurality of basic data B corresponding to different songs. Each of the plurality of basic data B includes music data D and reference signal R. The music data D is data representing a note string of a specific music piece (hereinafter referred to as "reference music piece") that is played with the waveform represented by the reference signal R. The music data D includes the musical score data G and the word string T, as described above. Musical score data G specifies the time series of notes that constitute the reference music piece. The word string T specifies the word string T corresponding to the reference song.

The reference signal R is a signal representing the waveform of the musical instrument sound produced by the musical instrument when the performer plays the reference song while referring to the word string T. For example, a performer skilled in playing a musical instrument plays the reference piece while adding a musical expression corresponding to the word string T. The reference signal R is generated by recording the musical instrument sounds produced by the musical instrument under the above circumstances. After recording the reference signal R, the position of the reference signal R on the time axis is adjusted. Therefore, the instrument sound represented by the reference signal R is an instrument sound that has acoustic characteristics according to the word string T.

The control device 21 implements a plurality of functions (training data acquisition unit 41, learning processing unit 42) for generating the generative model Mb by executing a program stored in the storage device 22.

The training data acquisition unit 41 generates a plurality of training data L from a plurality of basic data B. One piece of training data L is generated for each reference piece of music. Therefore, a plurality of training data L are generated from each of a plurality of basic data B corresponding to different reference songs. The learning processing unit 42 establishes the generative model Mb by machine learning using a plurality of training data L.

Each of the plurality of training data L is composed of a combination of a training control data sequence Ct and a training audio data sequence Zt. The control data string Ct is composed of a combination of a first control data string for training Xt and a second control data string for training Yt. The first control data string Xt is an example of a "first training control data string," and the second control data string Yt is an example of a "second training control data string." Furthermore, the acoustic data string Zt is an example of a "training acoustic data string."

For each unit period U, the training data acquisition unit 41 generates first control data Xt for the unit period U from the musical note data string Nt. The note data string Nt used to generate the first control data Xt for each unit period U is a part of the note data string of the musical score data G that includes the note data of the target note that includes the unit period U. That is, the note data string Nt includes note data of the target note in the reference music and note data of at least one of the previous note and the subsequent note. Like the first control data string X described above, the first control data string Xt is data representing the characteristics of the reference note string represented by the note data string Nt. The training data acquisition unit 41 generates the first control data sequence Xt for each unit period U from the musical note data sequence Nt by the same process as the first generation unit 31.

The second control data Yt for one unit period U indicates a phrase vector V estimated for a phrase corresponding to the unit period U in the word string T. The training data acquisition unit 41 generates a second control data sequence Yt for each unit period U, which indicates a phrase vector sequence V estimated from the word sequence T, by the same process as the second generation unit 32.

The acoustic data Zt of one unit period U represents the waveform of one frame of the reference signal R corresponding to the unit period U. The training data acquisition unit 41 generates an acoustic data sequence Zt from the reference signal R. As understood from the above explanation, the acoustic data string Zt is the sound data produced by the instrument when the reference note string corresponding to the first control data string Xt is played under the phrase expressed by the second control data string Yt. represents the waveform of the musical instrument sound. That is, the acoustic data string Zt is the ground truth of the acoustic data string that the generation model Mb should output in response to the input of the control data string Ct.

FIG. 7 is a flowchart of a process (hereinafter referred to as "learning process") Sb in which the control device 21 establishes a generative model Mb by machine learning. For example, the learning process Sb is started in response to an instruction from the operator of the machine learning system 20. The learning processing unit 42 in FIG. 6 is realized by the control device 21 executing the learning process Sb.

When the learning process Sb is started, the control device 21 selects any one of the plurality of training data L (hereinafter referred to as "selected training data L") (Sb1). As illustrated in FIG. 6, the control device 21 processes the control data string Ct of the selected training data L using an initial or provisional generation model Mb (hereinafter referred to as "temporary model Mb0") to generate an acoustic data string. Generate Z (Sb2).

The control device 21 calculates a loss function representing the error between the acoustic data string Z generated by the provisional model Mb0 and the acoustic data string Zt of the selected training data L (Sb3). The control device 21 updates the plurality of variables of the provisional model Mb0 so that the loss function is reduced (ideally minimized) (Sb4). For example, error backpropagation is used to update each variable according to the loss function.

The control device 21 determines whether a predetermined termination condition is satisfied (Sb5). The termination condition is that the loss function is less than a predetermined threshold, or that the amount of change in the loss function is less than a predetermined threshold. If the termination condition is not satisfied (Sb5: NO), the control device 21 selects the unselected training data L as the new selected training data L (Sb1). That is, the process of updating a plurality of variables of the provisional model Mb0 (Sb1 to Sb4) is repeated until the end condition is satisfied (Sb5: YES). If the termination condition is satisfied (Sb5: YES), the control device 21 terminates the learning process Sb. The provisional model Mb0 at the time when the termination condition is satisfied is determined as the trained generative model Mb.

As understood from the above description, the generative model Mb learns the latent relationship between the input control data string Ct and the output acoustic data string Zt. Therefore, the trained generative model Mb outputs a statistically valid acoustic data sequence Z for the unknown control data sequence C from the viewpoint of the relationship.

The control device 21 transmits the generation model Mb established through the above processing to the sound generation system 10 from the communication device 23. Specifically, a plurality of variables defining the generative model Mb are sent to the sound generation system 10. The control device 11 of the sound generation system 10 receives the generated model Mb transmitted from the machine learning system 20 through the communication device 13, and stores the generated model Mb in the storage device 12.

B: Second Embodiment The second embodiment will be described. In each aspect illustrated below, for elements whose functions are similar to those in the first embodiment, the same reference numerals as in the description of the first embodiment will be used, and detailed descriptions of each will be omitted as appropriate.

The control device 11 in the sound generation system 10 of the second embodiment includes a control data string acquisition unit 30 that obtains a control data string C, and generates an acoustic data string Z from the control data string C, as in the first embodiment. It includes an acoustic data string generation section 33 and a signal generation section 34 that generates an acoustic signal A from the acoustic data string Z.

FIG. 8 is an explanatory diagram of the operation of the control data string acquisition unit 30 in the second embodiment. The first generation unit 31 of the control data string acquisition unit 30 generates the first control data X for each unit period U from the note data string N, similarly to the first embodiment. In the second embodiment, the function of the second generation unit 32 is different from that in the first embodiment. The second generation unit 32 of the first embodiment generates a phrase vector sequence V representing each phrase of the word sequence T as a second control data sequence Y. On the other hand, the second generation unit 32 of the second embodiment generates phoneme data P representing each phoneme of the word string, and outputs it as second control data Y for each unit period corresponding to the period of the phoneme. That is, the second generation unit 32 generates phoneme data P indicating the phoneme type and period for each phoneme by analyzing the word string T, and converts the phoneme data P into second control data Y for each unit period U. Output as . Similarly to the first embodiment, the control data string C includes a first control data string X and a second control data string Y.

FIG. 9 is a schematic diagram of the phoneme data P. The phoneme data P specifies any of a plurality of types (K types) of phonemes. Specifically, the phoneme data P is composed of K elements E (E1 to EK) (K is a natural number of 2 or more) corresponding to different types of phonemes. For phoneme data P specifying any one type of phoneme, one element E corresponding to the phoneme among K elements E1 to EK is set to "1", and the remaining (K-1) elements are set to "1". It is a one-hot vector with element E set to "0". Note that a one-cold vector in which "1" and "0" of each element E are exchanged may be adopted as the phoneme data P.

The second generation unit 32 estimates the type and duration of each phoneme of the characters at each point in time included in the word string T by phoneme analysis processing, and generates phoneme data P specifying the phoneme. Any known technique may be employed for the phoneme analysis process. As illustrated in FIG. 8, in each unit period U within a period corresponding to one phoneme in a song, phoneme data P indicating the phoneme is repeatedly used as second control data Y. The boundary of the period of each phoneme of the character at each point in time in the word string T is estimated by a statistical model such as HMM (Hidden Markov Model) or SVM (Support Vector Machine). Note that a method (rule-based) in which the boundaries of each phoneme are specified using a reference table in which the relationship between each character constituting the word string T and the boundary of each phoneme is registered is also envisaged. Further, the boundaries of each phoneme may be specified by the music data D, for example, manually designated by the creator of the music data D.

The second control data string Y (phrase vector string V) of the first embodiment reflects the meaning of the word string T, but does not reflect information regarding the pronunciation of the word string T (ie, phoneme). On the other hand, the second control data string Y (phoneme data P) of the second embodiment reflects information regarding the pronunciation of the word string T (ie, phoneme), but does not reflect the meaning of the word string T. The word vector string V and the phoneme data P are comprehensively expressed as data representing the characteristics of the word string T.

This is the same as the first embodiment except that the phrase vector sequence V is replaced with the phoneme data P. For example, the processing in which the acoustic data string generation section 33 generates the acoustic data string Z from the control data string C and the processing in which the signal generation section 34 generates the acoustic signal A from the acoustic data string Z are similar to those in the first embodiment. be. Furthermore, the synthesis process Sa and the learning process Sb are also the same as in the first embodiment.

In the second embodiment, in addition to the first control data string X representing the characteristics of the note string, the second control data string Y representing the characteristics of the word string T corresponding to the note string is used to used for. Therefore, similarly to the first embodiment, it is possible to generate the acoustic data string Z of the target sound having various acoustic characteristics according to the word string T corresponding to the note string. In particular, in the second embodiment, the second control data string Y includes phoneme data P representing phonemes in the word string T. That is, the phoneme data P reflecting the pronunciation of the word string T is used as the second control data string Y. Therefore, it is possible to generate the acoustic data string Z of the target sound in which non-linguistic characteristics (for example, characteristics in the time domain or frequency domain) regarding the pronunciation of the phonemes in the word string T are reflected. For example, a target sound with the impression that the word string T is understood as onomatopoeia is generated.

C: Third Embodiment FIG. 10 is a schematic diagram of the second control data string Y in the third embodiment. The second control data string Y includes first data Y1 and second data Y2. The first data Y1 corresponds to the second control data string Y in the first embodiment, and the second data Y2 corresponds to the second control data string Y in the second embodiment.

Specifically, the first data Y1 is a word vector string V representing each word included in the word string T. For example, in each unit period U within a period corresponding to one word in a song, the word vector V of the word is used as the first data Y1. On the other hand, the second data Y2 is phoneme data P representing each phoneme of the word string T. For example, in each unit period U within a period corresponding to one phoneme in a song, the phoneme data P of the phoneme is used as the second control data Y2.

The same effects as in the first embodiment are achieved in the third embodiment as well. Further, in the third embodiment, the second control data string Y includes the first data Y1 (phrase vector string V) and the second data Y2 (phoneme data P). Therefore, it is possible to generate the acoustic data string Z of the target sound in which both the meaning of each word in the word string T and the pronunciation of the phoneme in the word string T are reflected.

D: Modifications Specific modifications added to each of the embodiments exemplified above will be exemplified below. A plurality of aspects arbitrarily selected from the examples below may be combined as appropriate to the extent that they do not contradict each other.

(1) The phoneme data P in the second embodiment is not limited to a vector composed of K elements E1 to EK. For example, a code string (identifier) uniquely assigned to each phoneme may be used as the phoneme data P.

(2) In each of the above embodiments, the acoustic data string Z represents the frequency characteristics of the target sound, but the information expressed by the acoustic data string Z is not limited to the above examples. For example, a form in which the acoustic data string Z represents each sample of the target sound is also assumed. In the above embodiment, the time series of the acoustic data string Z constitutes the acoustic signal A. Therefore, the signal generation section 34 is omitted.

(3) In each of the above embodiments, the control data string acquisition section 30 generates the first control data string X and the second control data string Y, but the operation of the control data string acquisition section 30 is as described above. Not limited to examples. For example, the control data string acquisition unit 30 may receive a first control data string X and a second control data string Y generated by an external device from the external device using the communication device 13. Further, in the case where the first control data string X and the second control data string Y are stored in the storage device 12, the control data string acquisition unit 30 stores the first control data string X and the second control data string Y. Read from device 12. As can be understood from the above examples, "acquisition" by the control data string acquisition unit 30 includes generation, reception, and reading of the first control data string X and the second control data string Y. 2 includes any operation that obtains the control data string Y. Similarly, the "acquisition" of the first control data string Xt and the second control data string Yt by the training data acquisition unit 41 includes any operation (for example, generation, generation, receiving and reading).

(4) In each of the above embodiments, the control data string C, which is a combination of the first control data string X and the second control data string Y, is supplied to the generative model Mb. The input format of the first control data string X and the second control data string Y is not limited to the above example.

For example, as illustrated in FIG. 11, assume that the generative model Mb is composed of a first part Mb1 and a second part Mb2. The first part Mb1 is a part composed of the input layer and part of the intermediate layer of the generative model Mb. The second part Mb2 is a part composed of another part of the intermediate layer of the generative model Mb and an output layer. In the above embodiment, the first control data string X is supplied to the first portion Mb1 (input layer), and the second control data string Y is supplied to the second portion Mb2 together with the data output from the first portion Mb1. It's okay. As understood from the above example, the connection of the first control data string X and the second control data string Y is not essential in the present disclosure.

(5) As illustrated in FIG. 12, a plurality of generative models Mb corresponding to different musical instruments may be selectively used. The generative model Mb corresponding to one type of musical instrument is a learned model trained using the reference signal R of the musical instrument sound produced by the musical instrument. Therefore, the generation model Mb corresponding to each musical instrument outputs an acoustic data string Z representing the musical instrument sound of the musical instrument.

The user selects one of the plurality of musical instruments by operating the operating device 14. The musical instrument data α in FIG. 12 is data specifying the musical instrument selected by the user. The acoustic data string generation unit 33 selects a generated model Mb corresponding to the instrument specified by the musical instrument data α from among the plurality of generated models Mb, and processes the control data string C using the generated model Mb to generate the acoustic data string Z. generate. According to the above configuration, it is possible to generate a target sound having a timbre corresponding to any one of a plurality of types of musical instruments.

Note that, as illustrated in FIG. 13, a control data string C including musical instrument data α in addition to the first control data string X and the second control data string Y may be input to one generation model Mb. The generative model Mb in FIG. 13 is established by machine learning using a plurality of reference signals R corresponding to different musical instruments. Further, the training data L includes musical instrument data α specifying the musical instrument corresponding to the reference signal R, in addition to the first control data sequence Xt and the second control data sequence Yt for training. Therefore, the target sound represented by the acoustic data string Z is an instrument sound having the timbre of the instrument specified by the instrument data α.

(6) In each of the above embodiments, the note data string N is generated from the music data D stored in advance in the storage device 12, but note data strings N sequentially supplied from the performance device may also be used. . The performance device is an input device such as a MIDI keyboard that accepts musical performances by the user, and sequentially outputs a string of musical note data N according to the musical performance by the user. The sound generation system 10 generates a sound data string Z using a musical note data string N supplied from a performance device. The above-described compositing process Sa may be executed in real time while the user is playing on the performance device. Specifically, the second control data string Y and the audio data string Z may be generated in parallel with the user's operation on the performance device.

(7) In each of the above embodiments, one word vector string V is generated from one word data Q, but a plurality of word vector strings V may be generated from one word data string Q. FIG. 14 is an explanatory diagram of the operation of the control data string acquisition section 30 in this modification. The phrase data string Q in FIG. 14 is data representing a phrase string obtained by dividing the word string T into phrases. That is, the phrase data Q is data that identifies one or more phrase strings corresponding to one phrase. A phrase is a section of a song divided according to musical or semantic unity. For example, each phrase is specified in the music data D. However, each phrase of the song may be defined by analyzing the song data D. As described above, the language analysis unit 321 generates the phrase data string Q by dividing the word string T into phrases.

The information generation unit 322 processes the word/phrase data string Q to generate a word/phrase vector string V for each word included in the corresponding word/phrase string. As illustrated in FIG. 14, when one word data string Q contains multiple words, the context of the word string is analyzed, and the word vector strings V corresponding to each word are combined into one word data string. Generated from column Q. Similar to the first embodiment, the information generation unit 322 uses a generation model Ma that can interpret the context to generate the phrase vector sequence V. The generative model Ma is a trained model that has learned the relationship between the context of the word string indicated by the word data string Q and the word vector string V indicating the meaning of each word in that context. Specifically, the generation model Ma generates a phrase vector sequence V indicating the meaning of each word included in the phrase string indicated by the phrase data Q in the phrase string. For example, a natural language processing model such as BERT (Bidirectional Encoder Representations from Transformers) is used as the generative model Ma. As understood from the above description, in this modification, one or more word vector sequences V corresponding to the word data Q are generated for each phrase.

For example, in FIG. 14, a phrase data string Q1 that specifies phrase string #1 consisting of word #1a and word #1b is illustrated. The generative model Ma generates a word vector string V corresponding to word #1a included in word string #1 and a word vector string V corresponding to word #1b included in word string #1 in response to input of one word data string Q1. . Further, the word/phrase data string Q2 in FIG. 14 specifies the word string #2 consisting of word #2a, word #2b, and word #2c. The generative model Ma generates a word vector string V corresponding to the word #2a included in the word string #2, a word vector string V corresponding to the word #2b, and a word #2c included in the word string #2 in response to the input of one word data string Q2. A phrase vector sequence V corresponding to is generated. Similar to the first embodiment, in each unit period U corresponding to one word, the word vector V of the word is repeatedly used as the second control data Y. In this modification, by interpreting the context of a word string, a word vector V that more accurately indicates the meaning of each word included in the word string is generated.

(8) In each of the above embodiments, a deep neural network is illustrated, but the generative model M (Ma, Mb, Mc) is not limited to a deep neural network. For example, any format and type of statistical model such as HMM (Hidden Markov Model) or SVM (Support Vector Machine) may be used as the generative model M (Ma, Mb, Mc).

(9) In each of the above embodiments, the machine learning system 20 establishes the generative model Mb, but the function for establishing the generative model Mb (the training data acquisition unit 41 and the learning processing unit 42) is installed in the sound generation system 10. may be done. Further, the sound generation system 10 may be equipped with a function of establishing the generation model Ma or Mc.

(10) The sound generation system 10 may be realized by a server device that communicates with an information device such as a smartphone or a tablet terminal. For example, the sound generation system 10 receives music data D from an information device, and generates an audio signal A through a synthesis process Sa to which the music data D is applied. The sound generation system 10 transmits the sound signal A generated by the synthesis process Sa to the information device. Note that in the case where the signal generation unit 34 is installed in the information device, the time series of the acoustic data string Z is transmitted to the information device. That is, the signal generation unit 34 is omitted from the sound generation system 10.

(11) The functions of the sound generation system 10 (control data string acquisition section 30, acoustic data string generation section 33, signal generation section 34) are performed by one or more processors constituting the control device 11 and a storage device. This is realized by cooperation with a program stored in 12. In addition, the functions of the machine learning system 20 (the training data acquisition unit 41 and the learning processing unit 42) are, as described above, the cooperation between one or more processors that constitute the control device 21 and the program stored in the storage device 22. This will be realized by working.

The programs exemplified above may be provided in a form stored in a computer-readable recording medium and installed on a computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but any known recording medium such as a semiconductor recording medium or a magnetic recording medium is used. Also included are recording media in the form of. Note that the non-transitory recording medium includes any recording medium excluding transitory, propagating signals, and does not exclude volatile recording media. Furthermore, in a configuration in which a distribution device distributes a program via the communication network 200, the recording medium that stores the program in the distribution device corresponds to the above-mentioned non-transitory recording medium.

E: Supplementary Note From the configurations exemplified above, for example, the following configurations can be understood.

A sound generation method according to one aspect (aspect 1) of the present disclosure includes acquiring a first control data string representing characteristics of a musical note string and a second control data string representing characteristics of a text corresponding to the musical note string. , by processing the first control data string and the second control data string using a trained first generation model, the musical note has acoustic characteristics according to the characteristics of the text represented by the second control data string. A string of acoustic data representing the musical instrument sounds of the string is generated.

In the above aspect, in addition to the first control data string representing the characteristics of the note string, the second control data string representing the characteristics of the text corresponding to the note string is used to generate the acoustic data string. Therefore, compared to a configuration in which an acoustic data string is generated only from the first control data string, it is possible to generate an acoustic data string of musical instrument sounds having various acoustic characteristics depending on the text corresponding to the note string.

The "first control data string" is data (first control data) in any format that represents the characteristics of a note string, and is generated from, for example, a note data string representing a note string. Further, the first control data string may be generated from a musical note data string generated in real time in response to an operation on an input device such as an electronic musical instrument. The "first control data string" can also be referred to as data specifying the conditions of the musical instrument sound to be synthesized. For example, the "first control data string" includes the pitch or duration of each note constituting the note string, the relationship between the pitch of one note and the pitches of other notes located around the note, etc. , specify various conditions regarding each note that makes up the note string. "Instrument sound" is a musical sound generated from a musical instrument during performance.

The "first generation model" is a learned model that has learned the relationship between the first control data string, the second control data string, and the acoustic data string by machine learning. A plurality of training data are used for machine learning of the first generative model. Each training data includes a set of a first training control data string and a second training control data string, and a training acoustic data string. The first training control data string is data representing the characteristics of the reference note string, and the second training control data string is data representing the characteristics of the text corresponding to the reference note string. The training audio data string represents musical instrument sounds produced by a performance based on the note string corresponding to the first training control data string and the text corresponding to the second training control data string. For example, various statistical estimation models such as a deep neural network (DNN), a hidden Markov model (HMM), or a support vector machine (SVM) are used as the "first generative model." .

The form of input of the first control data string and the second control data string to the first generative model is arbitrary. For example, input data including a first control data string and a second control data string is input to the first generative model. Furthermore, in a configuration where the first generative model includes an input layer, a plurality of intermediate layers, and an output layer, the first control data string may be input to the input layer, and the second control data string may be input to the intermediate layer. is assumed. That is, the combination of the first control data string and the second control data string is not essential.

The "acoustic data string" is data (acoustic data) in any format that represents musical instrument sounds. For example, data representing acoustic characteristics (frequency characteristics) such as an intensity spectrum, a mel spectrum, and MFCC (Mel-Frequency Cepstrum Coefficients) is an example of an "acoustic data string." Further, a sample sequence representing the waveform of the musical instrument sound may be generated as an "acoustic data sequence."

"Text corresponding to a note string" means that text is associated with a note string. That is, the "correspondence" between a note string and a text means, for example, that each note in the note string is associated with each word in the text in terms of time.

In a specific example of Aspect 1 (Aspect 2), the first generative model includes a first training control data string representing characteristics of a reference note string, and a second training control data string representing characteristics of a text corresponding to the reference note string. This model is trained using training data including a control data string and a training audio data string representing the instrument sound of the reference note string. According to the above aspect, from the perspective of the relationship between the first training control data string and the second training control data string of the reference note string and the training acoustic data string representing the instrument sound of the reference note string, the statistics It is possible to generate a reasonably valid acoustic data sequence.

In a specific example of Aspect 1 or Aspect 2 (Aspect 3), the second control data string includes a word vector string representing words included in the text. According to the above aspect, the second control data string includes a word vector string representing words in the text. Therefore, it is possible to generate an acoustic data string of musical instrument sounds in which the meanings of words in the text are reflected in the acoustic characteristics.

The "phrase vector sequence" is a vector (phrase vector) defined in the linguistic space (semantic space) according to the meaning of the word. A "phrase" is a word or a sequence of words (ie, a phrase). To generate a word vector sequence, for example, Tomas Mikolov et al. "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781 [cs.CL], 2013 (Word2Vec) or Quoc Le, Tomas Mikolov, "Distributed The statistical estimation model described in “Representations of Sentences and Documents,” CoRR, abs/1405.4053, p.1-9, 2014 (Doc2Vec) will be used.

In a specific example of aspect 3 (aspect 4), in acquiring the second control data sequence, the phrase vector sequence is generated using a trained second generation model. According to the above aspect, the second control data string can be easily generated using the second generation model.

In a specific example of any one of aspects 1 to 4 (aspect 5), the second control data string includes phoneme data representing phonemes constituting the text. According to the above aspect, the second control data string includes phoneme data representing phonemes making up the text. Therefore, it is possible to generate an acoustic data string of musical instrument sounds in which non-linguistic characteristics (for example, characteristics in the time domain or frequency domain) regarding the pronunciation of phonemes in the text are reflected in the acoustic characteristics.

In a specific example of any one of aspects 1 to 5 (aspect 6), the acquisition of the first control data and the second control data and the generation of the acoustic data are performed in each of a plurality of unit periods on the time axis. executed.

A sound generation system according to one aspect (aspect 7) of the present disclosure acquires a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string. A control data string acquisition unit processes the first control data string and the second control data string using a trained first generation model, thereby generating a control data string that corresponds to the characteristics of the text represented by the second control data string. and an acoustic data string generation unit that generates an acoustic data string representing the musical instrument sound of the note string having acoustic characteristics.

A program according to one aspect (aspect 8) of the present disclosure provides control data for acquiring a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string. a string acquisition unit, and processing the first control data string and the second control data string using a trained first generation model, thereby generating sound according to the characteristics of the text represented by the second control data string. The computer system is caused to function as an acoustic data string generation unit that generates an acoustic data string representing the musical instrument sound of the note string having characteristics.

100... Information system, 10... Sound generation system, 11... Control device, 12... Storage device, 13... Communication device, 14... Operating device, 15... Sound emitting device, 20... Machine learning system, 21... Control device, 22... Storage device, 23... Communication device, 30... Control data string acquisition section, 31... First generation section, 32... Second generation section, 321... Language analysis section, 322, 326... Information generation section, 326... Phoneme analysis section, 33...Acoustic data string generation section, 34...Signal generation section, 41...Training data acquisition section, 42...Learning processing section.

Claims

obtaining a first control data string representing characteristics of a note string and a second control data string representing characteristics of a text corresponding to the note string;
By processing the first control data string and the second control data string using a trained first generation model, the musical note string has acoustic characteristics according to the characteristics of the text represented by the second control data string. generate an acoustic data string representing the musical instrument sound,
A sound generation method realized by a computer system.
The first generative model is
a first training control data string representing characteristics of a reference note string; and a second training control data string representing characteristics of a text corresponding to the reference note string;
a training audio data string representing the instrument sound of the reference note string;
The sound generation method according to claim 1, wherein the model is trained using training data including.
The sound generation method according to claim 1 or 2, wherein the second control data string includes a word vector string representing words included in the text.
4. The sound generation method according to claim 3, wherein in acquiring the second control data string, the phrase vector string is generated using a trained second generation model.
The sound generation method according to any one of claims 1 to 4, wherein the second control data string includes phoneme data representing phonemes constituting the text.
In each of multiple unit periods on the time axis,
Acquiring individual first control data and second control data in acquiring the first control data string and second control data string;
The sound generation method according to any one of claims 1 to 5, wherein generation of individual sound data in the generation of the sound data string is executed.
a control data string acquisition unit that obtains a first control data string representing a feature of a note string and a second control data string representing a feature of a text corresponding to the note string;
By processing the first control data string and the second control data string by a trained first generation model, the musical note string has acoustic characteristics according to the characteristics of the text represented by the second control data string. an acoustic data string generation unit that generates an acoustic data string representing an instrument sound; and an acoustic data string generation unit.
a control data string acquisition unit that obtains a first control data string representing a feature of a note string and a second control data string representing a feature of a text corresponding to the note string;
By processing the first control data string and the second control data string by a trained first generation model, the musical note string has acoustic characteristics according to the characteristics of the text represented by the second control data string. an acoustic data string generation unit that generates an acoustic data string representing the musical instrument sound;
A program that makes a computer system function as a computer.