WO2021251364A1 - 音響処理方法、音響処理システムおよびプログラム - Google Patents

音響処理方法、音響処理システムおよびプログラム Download PDF

Info

Publication number
WO2021251364A1
WO2021251364A1 PCT/JP2021/021691 JP2021021691W WO2021251364A1 WO 2021251364 A1 WO2021251364 A1 WO 2021251364A1 JP 2021021691 W JP2021021691 W JP 2021021691W WO 2021251364 A1 WO2021251364 A1 WO 2021251364A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
time
music
symbol
acoustic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2021/021691
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
慶二郎 才野
竜之介 大道
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yamaha Corp
Original Assignee
Yamaha Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yamaha Corp filed Critical Yamaha Corp
Priority to CN202180040942.0A priority Critical patent/CN115699161A/zh
Priority to EP21823051.4A priority patent/EP4163912A4/en
Priority to JP2022530567A priority patent/JP7517419B2/ja
Publication of WO2021251364A1 publication Critical patent/WO2021251364A1/ja
Priority to US18/076,739 priority patent/US20230098145A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H7/00Instruments in which the tones are synthesised from a data store, e.g. computer organs
    • G10H7/002Instruments in which the tones are synthesised from a data store, e.g. computer organs using a common processing for different operations or calculations, and a set of microinstructions, e.g. programs, to control the sequence thereof
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10GREPRESENTATION OF MUSIC; RECORDING MUSIC IN NOTATION FORM; ACCESSORIES FOR MUSIC OR MUSICAL INSTRUMENTS NOT OTHERWISE PROVIDED FOR, e.g. SUPPORTS
    • G10G1/00Means for the representation of music
    • G10G1/04Transposing; Transcribing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0033Recording/reproducing or transmission of music for electrophonic musical instruments
    • G10H1/0041Recording/reproducing or transmission of music for electrophonic musical instruments in coded form
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H7/00Instruments in which the tones are synthesised from a data store, e.g. computer organs
    • G10H7/08Instruments in which the tones are synthesised from a data store, e.g. computer organs by calculating functions or polynomial approximations to evaluate amplitudes at successive sample points of a tone waveform
    • G10H7/10Instruments in which the tones are synthesised from a data store, e.g. computer organs by calculating functions or polynomial approximations to evaluate amplitudes at successive sample points of a tone waveform using coefficients or parameters stored in a memory, e.g. Fourier coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/086Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for transcription of raw audio or music data to a displayed or printed staff representation or to displayable MIDI-like note-oriented data, e.g. in pianoroll format
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/311Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation

Definitions

  • This disclosure relates to acoustic processing.
  • Non-Patent Document 1 or Non-Patent Document 2 discloses a technique for generating an acoustic signal sample by a synthesis process for each time step using a deep neural network (DNN).
  • DNN deep neural network
  • each sample of an acoustic signal can be generated by adding the characteristics of the music behind the current time step.
  • one aspect of the present disclosure is to generate a synthetic sound according to the characteristics behind the music and the real-time instruction from the user.
  • the sound processing method includes the characteristics of the music in the time step and the characteristics of the time step in each of the plurality of time steps on the time axis.
  • the coded data corresponding to the characteristics of the music in the rear is acquired, the control data according to the real-time instruction from the user is acquired, and the acquired control data and the acquired coded data are included.
  • the acoustic feature data representing the acoustic feature of the synthetic sound is generated.
  • the sound processing system corresponds to the characteristics of the music in the time step and the characteristics of the music in the back of the time step in each of the plurality of time steps on the time axis.
  • the coded data acquisition unit that acquires the coded data
  • the control data acquisition unit that acquires the control data according to the real-time instruction from the user in each of the plurality of time steps, and the plurality of time steps.
  • Each includes an acoustic feature data generation unit that generates acoustic feature data representing the acoustic features of the synthesized sound according to the first input data including the acquired control data and the acquired coded data. ..
  • the program according to one aspect of the present disclosure is a code corresponding to the characteristics of the music in the time step and the characteristics of the music in the back of the time step in each of the plurality of time steps on the time axis.
  • the computer functions as an acoustic feature data generation unit that generates acoustic feature data representing the acoustic features of the synthesized sound according to the first input data including the acquired control data and the acquired coded data. ..
  • FIG. 1 is a block diagram illustrating the configuration of the acoustic processing system 100 according to the first embodiment of the present disclosure.
  • the sound processing system 100 is a computer system that generates an acoustic signal W representing a waveform of a synthetic sound.
  • the synthetic sound is, for example, a musical instrument sound produced by a virtual performer playing an instrument, or a singing sound produced by, for example, a virtual singer singing a song.
  • the acoustic signal W is composed of a time series of a plurality of samples.
  • the sound processing system 100 includes a control device 11, a storage device 12, a sound emitting device 13, and an operating device 14.
  • the sound processing system 100 is realized by an information device such as a smartphone, a tablet terminal, or a personal computer.
  • the sound processing system 100 is realized not only by a single device but also by a plurality of devices (for example, a client-server system) configured as separate bodies from each other.
  • the storage device 12 is a single or a plurality of memories for storing a program executed by the control device 11 and various data used by the control device 11.
  • the storage device 12 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of types of recording media. Further, a portable recording medium attached to and detached from the sound processing system 100, or a recording medium capable of writing or reading via a communication network (for example, cloud storage) may be used as the storage device 12.
  • the storage device 12 stores music data D representing the content of the music.
  • FIG. 2 illustrates music data D used for synthesizing musical instrument sounds
  • FIG. 3 illustrates music data D used for synthesizing singing sounds.
  • the music data D represents a time series of a plurality of symbols constituting the music. Each symbol is a musical note or phoneme.
  • the music data D used for synthesizing musical instrument sounds specifies a continuation length d1 and a pitch d2 for each of a plurality of symbols (specifically, musical notes) constituting the music.
  • the music data D used for synthesizing a singing sound specifies a continuation length d1, a pitch d2, and a phoneme code d3 for each of a plurality of symbols (specifically, phonemes) constituting the music.
  • the continuation length d1 is the number of beats at which the symbol is continued.
  • the continuation length d1 is specified by a tick value that does not depend on the tempo of the music.
  • the pitch d2 is specified by, for example, a note number.
  • the phoneme code d3 is a code for identifying a phoneme.
  • the phoneme / sil / in FIG. 3 means silence.
  • the music data D is also paraphrased as data representing the score of the music.
  • the control device 11 in FIG. 1 is a single or a plurality of processors that control each element of the acoustic processing system 100. Specifically, for example, one or more types of processors such as CPU (Central Processing Unit), SPU (Sound Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). 3.
  • the control device 11 is configured. The control device 11 generates an acoustic signal W from the music data D stored in the storage device 12.
  • the sound emitting device 13 reproduces the synthetic sound represented by the acoustic signal W generated by the control device 11.
  • the sound emitting device 13 is, for example, a speaker or headphones.
  • the D / A converter that converts the acoustic signal W from digital to analog and the amplifier that amplifies the acoustic signal W are not shown for convenience. Further, in FIG. 1, the configuration in which the sound emitting device 13 is mounted on the sound processing system 100 is illustrated, but the sound emitting device 13 separate from the sound processing system 100 is connected to the sound processing system 100 by wire or wirelessly. May be done.
  • the operation device 14 is an input device that receives instructions from the user.
  • the operation device 14 is, for example, a plurality of controls operated by the user or a touch panel for detecting contact by the user.
  • An input device such as a MIDI (Musical Instrument Digital Interface) controller including an operator such as an operation knob or an operation pedal may be used as the operation device 14.
  • MIDI Musical Instrument Digital Interface
  • the user can instruct the sound processing system 100 of the conditions of the synthesized sound by operating the operation device 14.
  • the user can instruct the instruction value Z1 and the tempo Z2 of the music.
  • the indicated value Z1 of the first embodiment is a numerical value representing the strength (dynamics) of the synthetic sound.
  • the indicated values Z1 and tempo Z2 are instructed in real time in parallel with the generation of the acoustic signal W.
  • the indicated value Z1 and the tempo Z2 continuously change on the time axis according to the instruction from the user.
  • the method in which the user instructs the tempo Z2 is arbitrary.
  • the tempo Z2 may be specified from the cycle in which the user repeatedly operates the operator of the operating device 14.
  • the tempo Z2 may be specified from the musical instrument performance or the singing voice by the user.
  • FIG. 4 is a block diagram illustrating a functional configuration of the sound processing system 100.
  • the control device 11 has a plurality of functions (encoding model 21, coding data acquisition unit 22, control data acquisition) for generating an acoustic signal W from music data D by executing a program stored in the storage device 12. Section 31, generation model 40, and waveform synthesis section 50) are realized.
  • the coding model 21 is a statistical estimation model that generates a time series of symbol data B from music data D. As illustrated as step Sa12 in FIGS. 2 and 3, the coding model 21 generates symbol data B for each of the plurality of symbols constituting the music. That is, one symbol data B is generated from one symbol (one note or one phoneme) of the music data D. Specifically, the coding model 21 generates the symbol data B of each symbol from the symbol and the symbols before and after the symbol. A time series of symbol data B over the entire music is generated from the music data D. Specifically, the coding model 21 is a trained model that has learned the relationship between the music data D and the symbol data B in the time series.
  • One symbol data B corresponding to one symbol (note or phonetic element) of the music data D includes the musical characteristics (continuation length d1, pitch d2 and phonetic code d3) of the symbol itself, as well as the musical element code d3. Of these, it changes according to the musical characteristics of each symbol in front of (past) the symbol and the musical characteristics of each symbol in the rear (future) of the symbol.
  • the time series of the symbol data B generated by the coding model 21 is stored in the storage device 12.
  • the coding model 21 is composed of, for example, a deep neural network (DNN).
  • DNN deep neural network
  • an arbitrary form of deep neural network such as a convolutional neural network (CNN) or a recurrent neural network (RNN) is used as the coding model 21.
  • An example of a recurrent neural network is a bidirectional recurrent neural network (Bi-directional RNN).
  • additional elements such as long short-term memory (LSTM: Long Short-Term Memory) or Self-Attention may be mounted on the coding model 21.
  • LSTM Long Short-Term Memory
  • Self-Attention may be mounted on the coding model 21.
  • the coding model 21 exemplified above includes a program that causes the control device 11 to execute an operation for generating a plurality of symbol data B from the music data D, and a plurality of variables (specifically, weighted values and weighted values) applied to the operation. It is realized in combination with bias).
  • the plurality of variables defining the coding model 21 are preset and stored in the storage device 12 by machine learning using the plurality of training data.
  • the coded data acquisition unit 22 sequentially acquires the coded data E at each of the plurality of time steps ⁇ on the time axis.
  • Each of the plurality of time steps ⁇ is a time point set discretely at equal intervals (for example, at intervals of 5 milliseconds) on the time axis.
  • the coded data acquisition unit 22 includes a period setting unit 221 and a conversion processing unit 222.
  • the period setting unit 221 sets the period (hereinafter referred to as "unit period") ⁇ in which each symbol in the music is sounded from the music data D and the tempo Z2. Specifically, the period setting unit 221 sets the start time and end time of the unit period ⁇ for each of the plurality of symbols in the music. For example, each unit period ⁇ is set according to the continuation length d1 specified by the music data D for each symbol and the tempo Z2 instructed by the user by the operating device 14. As illustrated in FIG. 2 or 3, each unit period ⁇ contains one or more time steps ⁇ on the time axis.
  • a known analysis technique is arbitrarily adopted for setting each unit period ⁇ .
  • a function G2P: Grapheme-to-Phonemes
  • HMM Hidden Markov Model
  • a function of estimating the continuation length using a well-trained statistical inference model is used as the period setting unit 221.
  • the period setting unit 221 generates information (hereinafter referred to as “mapping information”) representing the correspondence between each unit period ⁇ and the coded data E for each time step ⁇ .
  • the conversion processing unit 222 acquires the coded data E at each of the plurality of time steps ⁇ on the time axis as shown in FIG. 2 or FIG. 3 as step Sb14. That is, the conversion processing unit 222 sequentially selects each of the plurality of time steps ⁇ as the current step ⁇ c in chronological order, and generates coded data E for the current step ⁇ c. Specifically, the conversion processing unit 222 uses the result (that is, mapping information) in which the period setting unit 221 sets each unit period ⁇ to store the symbol data B for each symbol stored in the storage device 12 for time. It is converted into coded data E for each time step ⁇ on the axis.
  • the conversion processing unit 222 generates the coded data E for each time step ⁇ on the time axis by using the symbol data B generated by the coding model 21 and the mapping information generated by the period setting unit 221. do.
  • One symbol data B corresponding to one symbol is expanded into coded data E over a plurality of time steps ⁇ .
  • one symbol data B may be converted into one coded data E.
  • a deep neural network is used for conversion from the symbol data B for each symbol to the coded data E for each time step ⁇ .
  • the conversion processing unit 222 generates the coded data E by using a deep neural network of any form such as a convolutional neural network or a recurrent neural network.
  • the coded data acquisition unit 22 acquires the coded data E at each of the plurality of time steps ⁇ .
  • the symbol data B corresponding to one symbol in the music changes according to the characteristics of the symbol itself as well as the characteristics of each symbol in front of and behind the symbol. Therefore, the coded data E of the current step ⁇ c is the feature (d1-d3) of the symbol corresponding to the current step ⁇ c among the plurality of symbols (notes or phonemes) designated by the music data D, and the symbols before and after the symbol. It changes according to the characteristics (d1-d3) of.
  • the control data acquisition unit 31 in FIG. 4 acquires control data C at each of the plurality of time steps ⁇ .
  • the control data C is data according to an instruction given in real time by the user in the operation of the operation device 14. Specifically, the control data acquisition unit 31 sequentially generates control data C representing the instruction value Z1 by the user for each time step ⁇ .
  • the tempo Z2 may be used as the control data C.
  • the generation model 40 generates acoustic feature data F at each of the plurality of time steps ⁇ .
  • the acoustic feature data F is data representing the acoustic features of the synthetic sound.
  • the acoustic feature data F represents a frequency characteristic such as a mel-spectrum or an amplitude spectrum of a synthetic sound. That is, a time series of acoustic feature data F corresponding to different time steps ⁇ is generated.
  • the generative model 40 is a statistical estimation model that generates the acoustic feature data F of the current step ⁇ c from the input data Y of the current step ⁇ c. That is, the generative model 40 is a trained model that has learned the relationship between the input data Y and the acoustic feature data F.
  • the generative model 40 is an example of the "first generative model".
  • the input data Y in the current step ⁇ c includes the coded data E acquired by the coded data acquisition unit 22 in the current step ⁇ c and the control data C acquired by the control data acquisition unit 31 in the current step ⁇ c. Further, the input data Y of the current step ⁇ c includes the acoustic feature data F generated by the generation model 40 in each of the plurality of time steps ⁇ located in the past with respect to the current step ⁇ c. That is, the acoustic feature data F generated by the generative model 40 is regressed to the input of the generative model 40.
  • the generative model 40 obtains the acoustic feature data F of the current step ⁇ c from the coded data E and the control data C of the current step ⁇ c and the acoustic feature data F of the past time step ⁇ . Generate (step Sb16 in FIGS. 2 and 3).
  • the coding model 21 functions as an encoder that generates symbol data B from music data D
  • the generation model 40 serves as a decoder that generates acoustic feature data F from coded data E and control data C.
  • the input data Y is an example of "first input data".
  • the generative model 40 is composed of, for example, a deep neural network.
  • a deep neural network such as a causal convolutional neural network or a recurrent neural network
  • An example of a recurrent neural network is a unidirectional recurrent neural network.
  • additional elements such as long-term memory or self-attention may be mounted on the generative model 40.
  • the generation model 40 exemplified above includes a program that causes the control device 11 to execute an operation for generating acoustic feature data F from the input data Y, and a plurality of variables (specifically, weighted values and biases) applied to the operation. It is realized in combination with.
  • the plurality of variables defining the generative model 40 are preset and stored in the storage device 12 by machine learning using the plurality of training data.
  • the acoustic feature data F is generated by supplying the input data Y to the machine-learned generation model 40. Therefore, it is possible to generate statistically valid acoustic feature data F under the potential tendency of the plurality of training data used for machine learning.
  • the waveform synthesis unit 50 of FIG. 4 generates an acoustic signal W of the synthesized sound from the time series of the acoustic feature data F.
  • the waveform synthesizing unit 50 converts the frequency characteristics represented by the acoustic feature data F into a waveform in the time domain by, for example, an operation including a discrete inverse Fourier transform, and concatenates the waveforms for the time steps ⁇ before and after the phase to obtain the acoustic signal W. Generate.
  • a deep neural network (so-called neural vocoder) that has learned the relationship between the acoustic feature data F and the time series of the sample of the acoustic signal W may be used as the waveform synthesis unit 50.
  • FIG. 5 is a flowchart illustrating a specific procedure of a process (hereinafter referred to as “preparation process”) Sa in which the control device 11 generates a time series of symbol data B from music data D.
  • Preparation process Sa is executed every time the music data D is updated. For example, every time the music data D is updated in response to an editing instruction from the user, the control device 11 executes the preparation process Sa for the updated music data D.
  • the control device 11 acquires the music data D from the storage device 12 (Sa11). As illustrated in FIGS. 2 and 3, the control device 11 supplies music data D representing a time series (note string or phonetic sequence) of a plurality of symbols to the coding model 21, so that the music data is different from each other in the music. Generate a plurality of symbol data B corresponding to the symbol (Sa12). Specifically, a time series of symbol data B over the entire music is generated. The control device 11 stores the time series of the symbol data B generated by the coding model 21 in the storage device 12 (Sa13).
  • FIG. 6 is a flowchart illustrating a specific procedure of the process (hereinafter referred to as “synthesis process”) Sb in which the control device 11 generates the acoustic signal W.
  • the synthesis process Sb is executed at each of the plurality of time steps ⁇ on the time axis. That is, each of the plurality of time steps ⁇ is sequentially selected as the current step ⁇ c in chronological order, and the following synthesis process Sb is executed for the current step ⁇ c.
  • the operation device 14 the user can instruct the instruction value Z1 at an arbitrary time point in parallel with the iteration of the synthesis process Sb.
  • the control device 11 acquires the tempo Z2 instructed by the user (Sb11). Then, the control device 11 calculates a position (hereinafter referred to as “reading position”) corresponding to the current step ⁇ c in the music (Sb12). The reading position is determined according to the tempo Z2 acquired in step Sb11. For example, the faster the tempo Z2, the faster the reading position advances in the music for each synthesis process Sb. The control device 11 determines whether or not the reading position has reached the end position of the music (Sb13).
  • the control device 11 ends the synthesis process Sb.
  • the control device 11 encoded data acquisition unit 22
  • the control device 11 corresponds to the read position among the plurality of symbol data B stored in the storage device 12.
  • the coded data E corresponding to the current step ⁇ c is generated from one symbol data B (Sb14).
  • the control device 11 acquires the control data C representing the indicated value Z1 in the current step ⁇ c (Sb15).
  • the control device 11 supplies the input data Y of the current step ⁇ c to the generation model 40 to generate the acoustic feature data F of the current step ⁇ c (Sb16).
  • the input data Y of the current step ⁇ c includes the symbol data B and the control data C acquired for the current step ⁇ c, and the acoustic feature data F generated by the generation model 40 for each of the plurality of past time steps ⁇ . including.
  • the control device 11 stores the acoustic feature data F currently generated for step ⁇ c in the storage device 12 (Sb17).
  • the acoustic feature data F stored in the storage device 12 is used for the input data Y in the synthesis processing Sb from the next time onward.
  • the control device 11 (waveform synthesis unit 50) currently generates a time series of a sample of the acoustic signal W from the acoustic feature data F in step ⁇ c (Sb18). Then, the control device 11 supplies the acoustic signal W of the current step ⁇ c to the sound emitting device 13 following the acoustic signal W of the immediately preceding time step ⁇ (Sb19).
  • the synthetic sound over the entire music is reproduced from the sound emitting device 13.
  • the coded data E according to the characteristics of the music behind the present step ⁇ c and the control data C according to the instruction from the user in the present step ⁇ c are used.
  • the acoustic feature data F is generated. Therefore, it is possible to generate the acoustic feature data F of the synthesized sound according to the feature of the music behind the present step ⁇ c (future) and the real-time instruction from the user.
  • the input data Y used for generating the acoustic feature data F includes the acoustic feature data F of the past time step ⁇ in addition to the control data C and the coding data E of the current step ⁇ c. Therefore, it is possible to generate acoustic feature data F of a synthetic sound in which the temporal transition of acoustic features is audibly natural.
  • the acoustic signal W corresponding to the instruction from the user is generated. That is, there is an advantage that the acoustic characteristics of the acoustic signal W can be precisely controlled with high time resolution according to an instruction from the user.
  • the acoustic characteristics of the acoustic signal W generated by the acoustic processing system 100 are directly controlled according to the instruction from the user. The configuration is also assumed.
  • the acoustic characteristics of the synthesized sound are controlled by supplying the control data C according to the instruction from the user to the generation model 40. Therefore, the acoustic characteristics of the synthesized sound are controlled in response to the instructions from the user based on the tendency latent in the multiple training data used for machine learning (the tendency of the acoustic characteristics according to the instructions from the user). There is an advantage that it can be done.
  • FIG. 7 is an explanatory diagram of a process (hereinafter referred to as “learning process”) Sc for establishing the coding model 21 and the generative model 40.
  • the learning process Sc is supervised machine learning using a plurality of training data T prepared in advance.
  • Each of the plurality of training data T includes a time series of music data D and control data C, and a time series of acoustic feature data F.
  • the acoustic feature data F of each training data T is correct answer data representing the acoustic features (for example, frequency characteristics) of the synthetic sound to be generated from the music data D and the control data C of the training data T.
  • the control device 11 functions as a preparation processing unit 61 and a learning processing unit 62 in addition to the elements illustrated in FIG. 4 by executing the program stored in the storage device 12.
  • the preparation processing unit 61 generates the training data T from the reference data T0 stored in the storage device 12.
  • a plurality of training data T are generated from the different reference data T0.
  • Each reference data T0 is data including music data D and acoustic signal W.
  • the acoustic signal W of each reference data T0 is a signal representing the waveform of the music (hereinafter referred to as “reference sound”) corresponding to the music data D of the reference data T0.
  • the acoustic signal W is generated by recording the reference sound produced by playing or singing the music represented by the music data D.
  • a plurality of reference data T0s are prepared from a plurality of songs. Therefore, the plurality of training data T includes two or more training data T corresponding to different songs.
  • the preparation processing unit 61 analyzes the acoustic signal W of each reference data T0 to generate a time series of control data C and a time series of acoustic feature data F in the training data T. For example, the preparation processing unit 61 calculates a time series of the indicated value Z1 representing the signal strength (strength of the reference sound) of the acoustic signal W, and generates control data C representing the indicated value Z1 for each time step ⁇ .
  • the tempo Z2 may be calculated from the acoustic signal W to generate the control data C representing the tempo Z2.
  • the preparation processing unit 61 calculates the time series of the frequency characteristics (for example, the mel spectrum or the amplitude spectrum) of the acoustic signal W, and generates the acoustic feature data F representing the frequency characteristics for each time step ⁇ .
  • a known frequency analysis such as a discrete Fourier transform is arbitrarily adopted for calculating the frequency characteristic of the acoustic signal W.
  • the preparation processing unit 61 generates training data T by associating the time series of the control data C and the time series of the acoustic feature data F generated in the above procedure with the music data D.
  • the plurality of training data T generated by the preparation processing unit 61 are stored in the storage device 12.
  • the learning processing unit 62 establishes a coding model 21 and a generation model 40 by learning processing Sc using a plurality of training data T.
  • FIG. 8 is a flowchart illustrating a specific procedure of the learning process Sc. For example, the learning process Sc is started with an instruction to the operating device 14.
  • the learning process unit 62 selects a predetermined number of training data T (hereinafter referred to as “selective training data T”) from the plurality of training data T stored in the storage device 12 (Sc11). ).
  • the predetermined selection training data T constitutes one batch.
  • the learning processing unit 62 supplies the music data D of the selection training data T to the provisional coding model 21 (Sc12).
  • the coding model 21 generates symbol data B for each symbol from the music data D supplied by the learning processing unit 62.
  • the coded data acquisition unit 22 generates coded data E for each time step ⁇ from the symbol data B for each symbol.
  • the tempo Z2 applied to the acquisition of the coded data E by the coded data acquisition unit 22 is set to a predetermined reference value. Further, the learning processing unit 62 sequentially supplies each control data C of the selection training data T to the provisional generative model 40 (Sc13).
  • the input data Y including the coded data E, the control data C, and the past acoustic feature data F is supplied to the generation model 40 at each time step ⁇ .
  • the generation model 40 generates acoustic feature data F corresponding to the input data Y for each time step ⁇ .
  • the past acoustic feature data F to which a noise component is added after the generation by the generation model 40 may be included in the input data Y. Overfitting is suppressed by using the acoustic feature data F to which a noise component is added.
  • the learning processing unit 62 is a loss function showing a difference between the time series of the acoustic feature data F generated by the provisional generative model 40 and the time series of the acoustic feature data F included in the selection training data T (that is, the correct answer value). Is calculated (Sc14).
  • the learning processing unit 62 iteratively updates the plurality of variables of the coding model 21 and the plurality of variables of the generative model 40 so that the loss function is reduced (Sc15). For example, the backpropagation method is used to update the variable according to the loss function.
  • the update of the plurality of variables is executed for each time step ⁇ for the generation model 40 and for each symbol for the coding model 21. Specifically, the update of a plurality of variables is realized by the following steps 1 to 3.
  • the learning processing unit 62 updates a plurality of variables of the generation model 40 by error back propagation of the loss function corresponding to the coded data E for each time step ⁇ . Step 1 gives the loss function for the generative model 40.
  • the learning processing unit 62 converts the loss function corresponding to the coded data E for each time step into the loss function corresponding to the symbol data B for each symbol. Mapping information is used to convert the loss function.
  • the learning processing unit 62 updates a plurality of variables of the coding model 21 by error back propagation of the loss function corresponding to the symbol data B for each symbol.
  • the learning processing unit 62 determines whether or not the end condition regarding the learning processing Sc is satisfied (Sc16).
  • the termination condition is, for example, that the loss function is below a predetermined threshold value, or that the amount of change in the loss function is below a predetermined threshold value.
  • the loss function calculated by using the training data T may be used to judge the success or failure of the end condition, but the loss function calculated from the prepared test data separately from the training data T is terminated. It may be used to judge the success or failure of the condition.
  • the learning processing unit 62 selects a predetermined unselected training data T from the plurality of training data T stored in the storage device 12 as the new selected training data T. (Sc11). That is, the selection of a predetermined number of training data T (Sc11), the calculation of the loss function (Sc12-Sc14), and the update of a plurality of variables (Sc15) are repeated until the end condition is satisfied (Sc16: YES).
  • the learning processing unit 62 ends the learning processing Sc. By the end of the learning process Sc, the coding model 21 and the generative model 40 are established.
  • the coding model 21 established by the above learning process Sc it is possible to generate symbol data B suitable for generating statistically valid acoustic feature data F for unknown music data D. Further, according to the generative model 40, it is possible to generate the acoustic feature data F that is statistically valid for the coded data E.
  • the trained generation model 40 is retrained by using the time series of the control data C, which is different from the time series of the control data C in the training data T used for the training process Sc exemplified above. You may. In the retraining of the generative model 40, it is not necessary to update the plurality of variables defining the coding model 21.
  • the sound processing system 100 of the second embodiment includes a control device 11, a storage device 12, a sound emitting device 13, and an operating device 14, similar to the first embodiment illustrated in FIG.
  • the point that the music data D is stored in the storage device 12 is the same as that of the first embodiment.
  • FIG. 9 is an explanatory diagram of the operation of the sound processing system 100 in the second embodiment.
  • the second embodiment a case where the singing sound is synthesized by using the music data D for synthesizing the singing sound in the first embodiment is illustrated.
  • the continuation length d1, the pitch d2, and the phoneme code d3 are designated for each phoneme in the music. It is also possible to apply the second embodiment to the synthesis of musical instrument sounds.
  • FIG. 10 is a block diagram illustrating a functional configuration of the acoustic processing system 100 according to the second embodiment.
  • the control device 11 of the second embodiment has a plurality of functions (encoding model 21, coded data acquisition unit) for generating an acoustic signal W from music data D by executing a program stored in the storage device 12. 22, the generative model 32, the generative model 40, and the waveform synthesizing unit 50) are realized.
  • the coding model 21 is a statistical estimation model that generates a time series of symbol data B from music data D, as in the first embodiment. Specifically, the coding model 21 is a trained model that has learned the relationship between the music data D and the symbol data B. As illustrated as step Sa22 in FIG. 9, the coding model 21 generates symbol data B for each of the plurality of phonemes constituting the lyrics of the music. That is, a plurality of symbol data B corresponding to different symbols in the music are generated by the coding model 21.
  • the coding model 21 is composed of an arbitrary form of deep neural network, as in the first embodiment.
  • the symbol data B corresponding to any one phoneme includes the characteristics of the phoneme itself (continuation length d1, phoneme d2 and phoneme code d3), as well as among the music. It is influenced by the characteristics of each phoneme in front of the phoneme (past) and the characteristics of each phoneme in the rear (future) of the phoneme.
  • a time series of symbol data B over the entire music is generated from the music data D.
  • the time series of the symbol data B generated by the coding model 21 is stored in the storage device 12.
  • the coded data acquisition unit 22 sequentially acquires the coded data E at each of the plurality of time steps ⁇ on the time axis, as in the first embodiment.
  • the coded data acquisition unit 22 of the second embodiment includes a period setting unit 221, a conversion processing unit 222, a pitch estimation unit 223, and a generation model 224. Similar to the first embodiment, the period setting unit 221 of FIG. 10 sets the unit period ⁇ in which each phoneme in the music is sounded from the music data D and the tempo Z2.
  • the conversion processing unit 222 acquires intermediate data Q at each of the plurality of time steps ⁇ on the time axis.
  • the intermediate data Q corresponds to the coded data E in the first embodiment.
  • the conversion processing unit 222 sequentially selects each of the plurality of time steps ⁇ as the current step ⁇ c in chronological order, and generates intermediate data Q for the current step ⁇ c. That is, the conversion processing unit 222 uses the result (that is, mapping information) in which the period setting unit 221 sets each unit period ⁇ to store the symbol data B for each symbol stored in the storage device 12 on the time axis. Convert to intermediate data Q for each time step ⁇ .
  • the coded data acquisition unit 22 uses the symbol data B generated by the coding model 21 and the mapping information generated by the period setting unit 221 to generate intermediate data Q for each time step ⁇ on the time axis. do.
  • One symbol data B corresponding to one symbol is expanded into intermediate data Q over one or more time steps ⁇ .
  • the symbol data B corresponding to the phoneme / w / is converted into the intermediate data Q of one time step ⁇ within the unit period ⁇ set for the phoneme / w / by the period setting unit 221. ..
  • the symbol data B corresponding to the phoneme / ah / is converted into five intermediate data Q corresponding to the five time steps ⁇ within the unit period ⁇ set for the phoneme / ah / by the period setting unit 221.
  • the conversion processing unit 222 generates position data G for each of the plurality of time steps ⁇ .
  • the position data G of any one time step ⁇ is data indicating which position in the unit period ⁇ corresponds to the intermediate data Q corresponding to the time step ⁇ as a ratio to the unit period ⁇ .
  • the position data G is set to "0" when the position corresponding to the intermediate data Q is at the beginning of the unit period ⁇ , and the position data G is set to "1" when the position is at the end of the unit period ⁇ . Is set to. For example, focusing on any two time steps ⁇ among the five time steps ⁇ included in the unit period ⁇ of the phoneme / ah / in FIG.
  • the position data G of the rear time step ⁇ is the front position data G.
  • the time point behind within the unit period ⁇ is specified. For example, for the last time step ⁇ in one unit period ⁇ , position data G representing the end point of the unit period ⁇ is generated.
  • the pitch estimation unit 223 in FIG. 10 generates pitch data P for each of the plurality of time steps ⁇ .
  • the pitch data P corresponding to any one time step ⁇ is data representing the pitch of the synthetic sound in the time step ⁇ .
  • the pitch d2 specified by the music data D represents the pitch for each symbol (for example, a phonetic element), whereas the pitch data P is the time of the pitch within a predetermined length period including, for example, one time step ⁇ . Represents a change.
  • the pitch data P may be, for example, data representing the pitch in one time step ⁇ .
  • the pitch estimation unit 223 may be omitted.
  • the pitch estimation unit 223 has a plurality of times from the pitch d2 of each symbol of the music data D stored in the storage device 12 and the unit period ⁇ set by the period setting unit 221 for each phone element. Generate pitch data P for each step ⁇ .
  • a known analysis technique is arbitrarily adopted for the generation of the pitch data P (that is, the estimation of the temporal change of the pitch).
  • a function of estimating a temporal transition of pitch (so-called pitch curve) using a statistical estimation model such as a deep neural network or a hidden Markov model is used as the pitch estimation unit 223.
  • the generative model 224 of FIG. 10 generates coded data E at each of the plurality of time steps ⁇ , as illustrated as step Sb21 in FIG.
  • the generative model 224 is a statistical inference model that generates the coded data E from the input data X.
  • the generative model 224 is a trained model that has learned the relationship between the input data X and the coded data E.
  • the generative model 224 is an example of the "second generative model".
  • the input data X of the current step ⁇ c includes intermediate data Q, position data G, and pitch data P corresponding to each time step ⁇ in a predetermined length period (hereinafter referred to as “reference period”) Ra on the time axis. ..
  • the reference period Ra is the period currently including step ⁇ c.
  • the reference period Ra includes a current step ⁇ c, a plurality of time steps ⁇ located before the current step ⁇ c, and a plurality of time steps ⁇ located behind the current step ⁇ c.
  • the intermediate data Q associated with each time step ⁇ in the reference period Ra and the position data G and pitch data P generated for each time step ⁇ in the reference period Ra are the input data X of the current step ⁇ c. include.
  • the input data X is an example of "second input data".
  • one or both of the position data G and the pitch data P may be omitted from the input data X.
  • the position data G generated by the conversion processing unit 222 may be included in the input data Y as in the second embodiment.
  • the intermediate data Q of the current step ⁇ c is influenced by the characteristics of the music in the current step ⁇ c and the characteristics of the music in front of and behind the current step ⁇ c. Therefore, the coded data E generated from the input data X including the intermediate data Q has the characteristics of the music (continuation length d1, pitch d2 and phonetic code d3) in the current step ⁇ c, and the front and back of the current step ⁇ c. It is influenced by the characteristics of the music (continuation length d1, pitch d2, and phonetic code d3).
  • the reference period Ra includes the time step ⁇ behind (future) the present step ⁇ c. Therefore, it is possible to effectively influence the characteristics of the music behind the current step ⁇ c on the coded data E as compared with the configuration in which the reference period Ra includes only the current step ⁇ c.
  • the generative model 224 is composed of, for example, a deep neural network.
  • a deep neural network such as a non-causal convolutional neural network
  • a recurrent neural network may be used as the generative model 224
  • additional elements such as long short-term memory or self-attention may be mounted on the generative model 224.
  • the generation model 224 exemplified above is a program that causes the control device 11 to execute an operation for generating coded data E from the input data X, and a plurality of variables (specifically, weighted values and biases) applied to the operation. It is realized in combination with.
  • the plurality of variables defining the generative model 224 are preset and stored in the storage device 12 by machine learning using the plurality of training data.
  • the coded data E is generated by supplying the input data X to the machine-learned generation model 224. Therefore, it is possible to generate statistically valid coded data E under the latent relationship of the plurality of training data used for machine learning.
  • the generation model 32 of FIG. 10 generates control data C at each of the plurality of time steps ⁇ .
  • the control data C is data according to an instruction (specifically, a synthetic sound instruction value Z1) given in real time by the user in an operation on the operation device 14, as in the first embodiment. That is, the generation model 32 functions as an element (control data acquisition unit) for acquiring control data C in each of the plurality of time steps ⁇ .
  • the generation model 32 in the second embodiment may be replaced with the control data acquisition unit 31 in the first embodiment.
  • the generation model 32 generates control data C from a time series of the indicated value Z1 in a plurality of time steps ⁇ in a predetermined period (hereinafter referred to as “reference period”) Rb on the time axis.
  • the reference period Rb is the period currently including step ⁇ c.
  • the reference period Rb includes a current step ⁇ c and a plurality of time steps ⁇ located before the current step ⁇ c. That is, the reference period Ra that affects the input data X described above includes the time step ⁇ after the current step ⁇ c, whereas the reference period Rb that affects the control data C is the time step ⁇ after the current step ⁇ c. Does not include.
  • the generative model 32 is composed of, for example, a deep neural network.
  • a deep neural network such as a causal convolutional neural network or a recurrent neural network is used as the generative model 32.
  • An example of a recurrent neural network is a unidirectional recurrent neural network.
  • additional elements such as long-term memory or self-attention may be mounted on the generative model 32.
  • the generation model 32 exemplified above is a program that causes the control device 11 to execute an operation for generating control data C from a time series of the indicated value Z1 in the reference period Rb, and a plurality of variables (specifically) applied to the operation. Is realized in combination with a weighted value and a bias).
  • the plurality of variables defining the generative model 32 are preset and stored in the storage device 12 by machine learning using the plurality of training data.
  • the control data C is generated from the time series of the instruction value Z1 according to the instruction from the user. Therefore, it is possible to generate the control data C that appropriately changes according to the temporal change of the indicated value Z1 according to the instruction from the user.
  • the generative model 32 may be omitted. That is, the indicated value Z1 may be supplied to the generation model 32 as the control data C as it is. Further, the generative model 32 may be replaced with a low-pass filter. That is, the numerical value generated by the smoothing of the indicated value Z1 on the time axis may be supplied to the generation model 32 as the control data C.
  • the generation model 40 generates acoustic feature data F at each of the plurality of time steps ⁇ , as in the first embodiment. That is, a time series of acoustic feature data F corresponding to different time steps ⁇ is generated.
  • the generative model 40 is a statistical inference model that generates acoustic feature data F from input data Y. Specifically, the generative model 40 is a trained model in which the relationship between the input data Y and the acoustic feature data F is learned.
  • the input data Y in the current step ⁇ c includes the coded data E acquired by the coded data acquisition unit 22 in the current step ⁇ c and the control data C generated by the generation model 32 in the current step ⁇ c. Further, as illustrated in FIG. 9, the input data Y of the current step ⁇ c includes the acoustic feature data F generated by the generation model 40 in the plurality of time steps ⁇ located in the past of the current step ⁇ c and the plurality of time steps. The coded data E and the control data C in each of ⁇ are included.
  • the generative model 40 obtains the acoustic feature data F of the current step ⁇ c from the coded data E and the control data C of the current step ⁇ c and the acoustic feature data F of the past time step ⁇ .
  • the generative model 224 functions as an encoder for generating the coded data E
  • the generative model 32 functions as an encoder for generating the control data C
  • the generation model 40 functions as a decoder that generates acoustic feature data F from the coded data E and the control data C.
  • the input data Y is an example of "first input data".
  • the generation model 40 is composed of, for example, a deep neural network, as in the first embodiment.
  • a deep neural network such as a causal convolutional neural network or a recurrent neural network
  • An example of a recurrent neural network is a unidirectional recurrent neural network.
  • additional elements such as long-term memory or self-attention may be mounted on the generative model 40.
  • the generation model 40 exemplified above includes a program that causes the control device 11 to execute an operation for generating acoustic feature data F from the input data Y, and a plurality of variables (specifically, weighted values and biases) applied to the operation. It is realized in combination with.
  • the plurality of variables defining the generative model 40 are preset and stored in the storage device 12 by machine learning using the plurality of training data.
  • the generative model 40 is a recursive model (autoregressive model)
  • the generative model 32 may be omitted. Further, if the configuration includes the generative model 32, the reflexivity may be omitted for the generative model 40.
  • the waveform synthesis unit 50 generates the acoustic signal W of the synthesized sound from the time series of the acoustic feature data F as in the first embodiment. By supplying the acoustic signal W generated by the waveform synthesizing unit 50 to the sound emitting device 13, the synthesized sound is reproduced from the sound emitting device 13.
  • FIG. 11 is a flowchart illustrating a specific procedure of the preparation process Sa in the second embodiment. Similar to the first embodiment, the preparation process Sa is executed every time the music data D is updated. For example, every time the music data D is updated in response to an editing instruction from the user, the control device 11 executes the preparation process Sa for the updated music data D.
  • the control device 11 acquires the music data D from the storage device 12 (Sa21).
  • the control device 11 supplies the music data D to the coding model 21 to generate a plurality of symbol data B corresponding to different phonemes in the music (Sa22). Specifically, a time series of symbol data B over the entire music is generated.
  • the control device 11 stores the time series of the symbol data B generated by the coding model 21 in the storage device 12 (Sa23).
  • the control device 11 sets the unit period ⁇ of each phoneme in the music from the music data D and the tempo Z2 (Sa24). As illustrated in FIG. 9, the control device 11 (conversion processing unit 222) has one or more of the symbol data B stored in the storage device 12 for each of the plurality of phonemes within the unit period ⁇ corresponding to the phoneme. Generate one or more intermediate data Qs for the time step ⁇ of (Sa25). Further, the control device 11 (conversion processing unit 222) generates position data G for each of the plurality of time steps ⁇ (Sa26). The control device 11 (pitch estimation unit 223) generates pitch data P for each of the plurality of time steps ⁇ (Sa27). As can be understood from the above description, before the execution of the synthesis process Sb, a set of intermediate data Q, position data G, and pitch data P is generated for each time step ⁇ throughout the music.
  • the pitch data P may be generated (Sa27) before the intermediate data Q is generated (Sa25) and the position data G is generated (Sa26) for each time step ⁇ .
  • FIG. 12 is a flowchart illustrating a specific procedure of the synthesis process Sb in the second embodiment.
  • the preparation process Sa is executed, the synthesis process Sb is executed for each of the plurality of time steps ⁇ . That is, each of the plurality of time steps ⁇ is sequentially selected as the current step ⁇ c in chronological order, and the following synthesis process Sb is executed for the current step ⁇ c.
  • the control device 11 When the synthesis process Sb is started, the control device 11 (coded data acquisition unit 22) supplies the input data X of the current step ⁇ c to the generation model 224 as illustrated in FIG. 9, so that the current step ⁇ c is present.
  • the coded data E of is generated (Sb21).
  • the input data X of the current step ⁇ c includes the intermediate data Q, the position data G, and the pitch data P in each of the plurality of time steps ⁇ in the reference period Ra.
  • the control device 11 currently generates the control data C in step ⁇ c (Sb22). Specifically, the control device 11 generates the control data C of the current step ⁇ c by supplying the time series of the indicated value Z1 in the reference period Rb to the generation model 32.
  • the control device 11 supplies the input data Y of the current step ⁇ c to the generation model 40 to generate the acoustic feature data F of the current step ⁇ c (Sb23).
  • the input data Y of the current step ⁇ c is the coded data E and the control data C acquired for the current step ⁇ c, and the acoustic feature data F and the coded data generated for each of the plurality of past time steps ⁇ . E and control data C are included.
  • the control device 11 stores the acoustic feature data F generated for the current step ⁇ c in the storage device 12 together with the coded data E and the control data C of the current step ⁇ c (Sb24).
  • the acoustic feature data F, the coding data E, and the control data C stored in the storage device 12 are used for the input data Y in the synthesis processing Sb from the next time onward.
  • the control device 11 (waveform synthesis unit 50) currently generates a time series of a sample of the acoustic signal W from the acoustic feature data F in step ⁇ c (Sb25). Then, the control device 11 supplies the acoustic signal W currently generated for step ⁇ c to the sound emitting device 13 (Sb26). By repeating the synthesis process Sb exemplified above at each time step ⁇ , the synthetic sound over the entire music piece is reproduced from the sound emitting device 13 as in the first embodiment.
  • the coded data E according to the characteristics of the phonemes behind the current step ⁇ c in the music and the current step ⁇ c from the user.
  • Acoustic feature data F is generated using the control data C according to the instruction. Therefore, it is possible to generate the acoustic feature data F of the synthesized sound according to the feature of the music behind the present step ⁇ c (future) and the real-time instruction from the user.
  • the input data Y used for generating the acoustic feature data F includes the acoustic feature data F of the past time step ⁇ in addition to the control data C and the coding data E of the current step ⁇ c. Therefore, as in the first embodiment, it is possible to generate the acoustic feature data F of the synthetic sound in which the temporal transition of the acoustic feature is audibly natural.
  • the coded data E of the current step ⁇ c is generated from the input data X including two or more intermediate data Q corresponding to each of the plurality of time steps ⁇ including the current step ⁇ c and the rear time step ⁇ . Will be done. Therefore, it is possible to generate a time series of acoustic feature data F in which the temporal transition of acoustic features is audibly natural as compared with the configuration in which the coded data E is generated from one intermediate data Q.
  • the input data X including the position data G representing the position corresponding to the intermediate data Q within the unit period ⁇ and the pitch data P representing the pitch for each time step ⁇ is encoded.
  • Data E is generated. Therefore, it is possible to generate a time series of coded data E that appropriately represents the temporal transition of phonemes and pitches.
  • FIG. 13 is an explanatory diagram of the learning process Sc in the second embodiment.
  • the learning process Sc of the second embodiment is supervised machine learning that establishes a coding model 21, a generative model 224, a generative model 32, and a generative model 40 by using a plurality of training data T.
  • Each of the plurality of training data T includes a music data D, a time series of the indicated value Z1, and a time series of the acoustic feature data F.
  • the acoustic feature data F of each training data T is correct answer data representing the acoustic features (for example, frequency characteristics) of the synthetic sound to be generated from the music data D of the training data T and the indicated value Z1.
  • the control device 11 functions as a preparation processing unit 61 and a learning processing unit 62 in addition to the elements illustrated in FIG. 10 by executing the program stored in the storage device 12.
  • the preparation processing unit 61 generates the training data T from the reference data T0 stored in the storage device 12 as in the first embodiment.
  • Each reference data T0 is data including music data D and acoustic signal W.
  • the acoustic signal W of each reference data T0 is a signal representing the waveform of the reference sound (for example, a singing sound) corresponding to the music data D of the reference data T0.
  • the preparation processing unit 61 analyzes the acoustic signal W of each reference data T0 to generate a time series of the indicated value Z1 in the training data T and a time series of the acoustic feature data F. For example, the preparation processing unit 61 calculates the indicated value Z1 indicating the strength of the reference sound by analyzing the acoustic signal W. Further, the preparation processing unit 61 calculates the time series of the frequency characteristic of the acoustic signal W and generates the acoustic feature data F representing the frequency characteristic for each time step ⁇ , as in the first embodiment. The preparation processing unit 61 generates training data T by associating the time series of the indicated value Z1 and the time series of the acoustic feature data F generated in the above procedure with the music data D by the mapping information.
  • the learning processing unit 62 establishes a coding model 21, a generation model 224, a generation model 32, and a generation model 40 by learning processing Sc using a plurality of training data T.
  • FIG. 14 is a flowchart illustrating a specific procedure of the learning process Sc in the second embodiment. For example, the learning process Sc is started with an instruction to the operating device 14.
  • the learning process unit 62 selects a predetermined number of training data T among the plurality of training data T stored in the storage device 12 as the selection training data T (Sc21).
  • the learning processing unit 62 supplies the music data D of the selection training data T to the provisional coding model 21 (Sc22).
  • the coding model 21, the period setting unit 221, the conversion processing unit 222, and the pitch estimation unit 223 operate on the music data D to generate the input data X for each time step ⁇ .
  • the tentative generative model 224 generates coded data E corresponding to each input data X for each time step ⁇ .
  • the tempo Z2 applied to the setting of the unit period ⁇ by the period setting unit 221 is set to a predetermined reference value.
  • the learning processing unit 62 supplies each indicated value Z1 of the selection training data T to the provisional generative model 32 (Sc23).
  • the generation model 32 generates control data C according to the time series of the indicated value Z1 for each time step ⁇ .
  • the input data Y including the coded data E, the control data C, and the past acoustic feature data F is supplied to the generation model 40 at each time step ⁇ .
  • the generation model 40 generates acoustic feature data F corresponding to the input data Y for each time step ⁇ .
  • the learning processing unit 62 is a loss function showing a difference between the time series of the acoustic feature data F generated by the provisional generative model 40 and the time series of the acoustic feature data F included in the selection training data T (that is, the correct answer value). Is calculated (Sc24).
  • the learning processing unit 62 iteratively updates a plurality of variables in each of the coding model 21, the generative model 224, the generative model 32, and the generative model 40 so that the loss function is reduced (Sc25). For example, the backpropagation method is used to update the variable according to the loss function.
  • the learning processing unit 62 determines whether or not the end condition regarding the learning processing Sc is satisfied, as in the first embodiment (Sc26). When the end condition is not satisfied (Sc26: NO), the learning processing unit 62 selects a predetermined unselected training data T from the plurality of training data T stored in the storage device 12 as the new selected training data T. (Sc21). That is, the selection of a predetermined number of training data T (Sc21), the calculation of the loss function (Sc22-Sc24), and the update of a plurality of variables (Sc25) are repeated until the end condition is satisfied (Sc26: YES). When the end condition is satisfied (Sc26: YES), the learning processing unit 62 ends the learning processing Sc. By the end of the learning process Sc, the coding model 21, the generative model 224, the generative model 32, and the generative model 40 are established.
  • the coding model 21 established by the learning process Sc exemplified above it is possible to generate appropriate symbol data B in order to generate statistically valid acoustic feature data F for unknown music data D. Further, according to the generative model 224, it is possible to generate the coded data E suitable for generating the acoustic feature data F that is statistically valid for the music data D. Similarly, according to the generative model 32, it is possible to generate appropriate control data C in order to generate statistically valid acoustic feature data F with respect to the music data D.
  • the configuration for generating the acoustic signal W of the singing sound is illustrated, but the second embodiment is similarly applied to the generation of the acoustic signal W of the musical instrument sound.
  • the music data D specifies a continuation length d1 and a pitch d2 for each of the plurality of notes constituting the music, as described above in the first embodiment. That is, the phoneme code d3 is omitted from the music data D.
  • Acoustic feature data F may be generated by selectively using a plurality of generation models 40 constructed by using different training data T.
  • the training data T used for each learning process Sc of the plurality of generative models 40 is generated from the acoustic signal W of the reference sound produced by different singers or musical instruments.
  • the control device 11 generates the acoustic feature data F by using one generation model 40 corresponding to the singer or the musical instrument selected by the user among the plurality of generation models 40.
  • the indicated value Z1 indicating the strength of the synthetic sound is illustrated, but the indicated value Z1 is not limited to the strength.
  • Degree) the tempo of the synthetic sound, the identification code of the singer or the playing instrument of the synthetic sound, and various numerical values related to the condition of the synthetic sound are exemplified as the indicated value Z1.
  • the preparation processing unit 61 can calculate the time series of the various indicated values Z1 exemplified above by analyzing the acoustic signal W of the reference sound included in the reference data T0. It is possible. For example, the indicated value Z1 representing the depth or period of the vibrato of the reference sound is calculated from the temporal change of the frequency characteristic of the acoustic signal W. Of the reference sounds, the indicated value Z1 representing the time change of the intensity of the attack portion is calculated from the time derivative value of the signal intensity of the acoustic signal W or the time derivative value of the fundamental frequency. The indicated value Z1 representing the timbre of the synthetic sound is calculated from the intensity ratio for each frequency band in the acoustic signal W.
  • the indicated value Z1 representing the tempo of the synthetic sound is calculated by a known beat detection technique or tempo detection technique.
  • the instruction value Z1 representing the tempo of the synthesized sound may be calculated by analyzing the periodic instruction (for example, tap operation) by the creator. Further, the instruction value Z1 representing the identification code of the singer or the playing musical instrument of the synthetic sound is set according to, for example, a manual operation by the creator. Further, the indicated value Z1 in the training data T may be set from the performance information included in the music data D. For example, the indicated value Z1 is calculated from various performance information (velocity, modulation wheel, vibrato parameter, foot pedal, etc.) conforming to the MIDI standard.
  • the reference period Ra added to the input data X includes a plurality of time steps ⁇ currently located in front of the step ⁇ c and a plurality of time steps ⁇ located behind the step ⁇ c. Illustrated. However, a configuration in which the reference period Ra includes one time step ⁇ located immediately before the current step ⁇ c, or a configuration including one time step ⁇ located immediately after the current step ⁇ c is also assumed. Also, a configuration is adopted in which the reference period Ra currently includes only step ⁇ c. That is, the coded data E of the current step ⁇ c may be generated by supplying the set of the intermediate data Q, the position data G, and the pitch data P in the current step ⁇ c to the generation model 224 as the input data X. ..
  • the configuration in which the reference period Rb includes a plurality of time steps ⁇ is exemplified, but a configuration in which the reference period Rb currently includes only the step ⁇ c is also assumed. That is, the generation model 32 currently generates the control data C only from the indicated value Z1 in step ⁇ c.
  • the configuration in which the reference period Ra includes a plurality of time steps ⁇ currently located in front of and behind the step ⁇ c is illustrated.
  • the generated model 224 reflects the features of the music in front of and behind the current step ⁇ c with respect to the coded data E generated from the input data X including the intermediate data Q of the current step ⁇ c.
  • the intermediate data Q itself of each time step ⁇ may be data that reflects only the features corresponding to the time step ⁇ in the music. That is, the intermediate data Q of the current step ⁇ c may not reflect the characteristics of the music before or after the current step ⁇ c.
  • the intermediate data Q of the current step ⁇ c reflects the characteristics of one symbol corresponding to the current step ⁇ c, but does not reflect the characteristics of the symbols before or after the current step ⁇ c.
  • the intermediate data Q is generated from the symbol data B for each symbol.
  • the symbol data B is data representing the characteristics of one symbol (for example, continuation length d1, pitch d2, and phoneme code d3).
  • the intermediate data Q can be directly generated from only one symbol data B.
  • the conversion processing unit 222 generates intermediate data Q of each time step ⁇ from the symbol data B of each symbol by using the above-mentioned mapping information. That is, in this modification, the coding model 21 is not used for generating the intermediate data Q.
  • the control device 11 directly generates symbol data B corresponding to different phonemes in the music from the phoneme information (for example, phoneme code d3) in the music data D. do. That is, the coding model 21 is not used to generate the symbol data B.
  • the coding model 21 may be used to generate the symbol data B in this modification.
  • the configuration in which the input data Y supplied to the generation model 40 includes the acoustic feature data F of a plurality of past time steps ⁇ is exemplified, but the input data Y of the current step ⁇ c is immediately before.
  • a configuration including the acoustic feature data F of one time step ⁇ of the above is also assumed.
  • the configuration in which the past acoustic feature data F is regressed to the input of the generation model 40 is not essential. That is, the input data Y that does not include the past acoustic feature data F may be supplied to the generation model 40.
  • the acoustic feature of the synthesized sound may change discontinuously. Therefore, from the viewpoint of generating an audibly natural synthetic sound in which acoustic features continuously transition, a configuration in which the past acoustic feature data F is returned to the input of the generation model 40 is preferable.
  • the configuration in which the acoustic processing system 100 includes the coding model 21 is exemplified, but the coding model 21 may be omitted.
  • the time series of the symbol data B generated from the music data D by using the coding model 21 of the external device other than the sound processing system 100 may be stored in the storage device 12 of the sound processing system 100.
  • the coded data acquisition unit 22 generates the coded data E, but the coded data acquisition unit 22 receives the coded data E acquired by the external device from the external device. It may be an element. That is, the acquisition of the coded data E includes both the generation of the coded data E and the reception of the coded data E.
  • the preparatory process Sa is executed for the entire music, but the preparatory process Sa may be executed for each of the plurality of sections in which the music is divided.
  • the preparatory process Sa may be executed for each of a plurality of structural sections (for example, intro, verse, B melody, chorus, etc.) in which the music is divided according to the musical meaning.
  • the sound processing system 100 may be realized by a server device that communicates with a terminal device such as a mobile phone or a smartphone.
  • the acoustic processing system 100 generates an acoustic signal W from an instruction (instruction value Z1 and tempo Z2) received from the terminal device and music data D stored in the storage device 12, and the acoustic signal W is generated.
  • the terminal device To the terminal device.
  • the waveform synthesizing unit 50 is mounted on the terminal device, the time series of the acoustic feature data F generated by the generation model 40 is transmitted from the acoustic processing system 100 to the terminal device. That is, the waveform synthesis unit 50 is omitted from the sound processing system 100.
  • the functions of the acoustic processing system 100 exemplified above are realized by the cooperation of the single or a plurality of processors constituting the control device 11 and the program stored in the storage device 12.
  • the program according to the present disclosure may be provided and installed in a computer in a form stored in a computer-readable recording medium.
  • the recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disc) such as a CD-ROM is a good example, but a known arbitrary such as a semiconductor recording medium or a magnetic recording medium. Recording media in the form of are also included.
  • the non-transient recording medium includes any recording medium other than the transient propagation signal (transitory, propagating signal), and the volatile recording medium is not excluded. Further, in the configuration in which the distribution device distributes the program via the communication network, the storage device for storing the program in the distribution device corresponds to the above-mentioned non-transient recording medium.
  • the characteristics of the music in the time step and the music in the back of the time step are included in the music.
  • the coded data according to the characteristics is acquired, the control data according to the real-time instruction from the user is acquired, and according to the first input data including the acquired control data and the acquired coded data.
  • the acoustic feature data is generated according to the feature of the song behind the current time step of the song and the control data according to the instruction from the user in the current time step. Therefore, it is possible to generate acoustic feature data of the synthetic sound according to the feature in the rear (future) in the music and the real-time instruction from the user.
  • music is represented by a time series of multiple symbols.
  • Each symbol constituting the music is, for example, a musical note or a phoneme.
  • For each symbol one or more of a plurality of musical elements such as pitch, time of pronunciation, and volume are specified. That is, it is not essential to specify the pitch for each symbol.
  • the acquisition of the coded data includes, for example, the conversion of the coded data using the mapping information.
  • the first input data in each of the plurality of time steps includes one or more said acoustic feature data generated in one or more past time steps.
  • the first input data used for generating the acoustic feature data is the acoustic feature data generated in one or more time steps in the past in addition to the control data and the coding data of the current time step. including. Therefore, it is possible to generate acoustic feature data of a synthetic sound in which the temporal transition of acoustic features is audibly natural.
  • the acoustic feature data in the generation of the acoustic feature data, is generated by supplying the first input data to the machine-learned first generation model. ..
  • the machine-learned first generative model is used to generate the acoustic feature data. Therefore, it is possible to generate statistically valid acoustic feature data under the potential trends in the plurality of training data used in the machine learning of the first generative model.
  • an acoustic signal representing the waveform of the synthetic sound is further generated from the time series of the acoustic feature data.
  • the synthetic sound can be reproduced by supplying the acoustic signal to the sound emitting device.
  • a plurality of symbol data corresponding to different symbols in the music are generated from the music data representing the time series of the symbols constituting the music.
  • Each of the plurality of symbol data is data corresponding to the characteristics of the symbol corresponding to the symbol data in the music and the characteristics of the symbol behind the symbol, and in the acquisition of the coded data, the plurality of symbol data.
  • the coded data corresponding to the time step is acquired from the symbol data of.
  • a plurality of symbol data corresponding to different symbols in the music are generated from the music data representing the time series of the symbols constituting the music.
  • each of the plurality of symbol data is data corresponding to the characteristics of the symbol corresponding to the symbol data in the music and the characteristics of the symbol behind the symbol, and corresponds to each of the plurality of time steps.
  • the intermediate data to be generated is generated from the plurality of symbol data, and in the acquisition of the coded data, two or more time steps including the current time step and the time step after the time step among the plurality of time steps.
  • the coded data is generated from the second input data including two or more intermediate data corresponding to each.
  • the coded data of the current time step is generated from the second input data including two or more intermediate data corresponding to each of the two or more time steps including the current time step and the rear time step. Will be done. Therefore, it is possible to generate a time series of acoustic feature data in which the temporal transition of acoustic features is audibly natural as compared with the configuration in which the coded data is generated from one intermediate data.
  • the coded data in the acquisition of the coded data, is generated by supplying the second input data to the machine-learned second generation model.
  • the coded data is generated by supplying the second input data to the machine-learned second generation model. Therefore, it is possible to generate statistically valid coded data under the potential trends in multiple training data used for machine learning.
  • Aspect 6 or Aspect 7 in the generation of the intermediate data, one or more time steps within a unit period in which the symbol corresponding to the symbol data is sounded by using the symbol data.
  • the second input data includes position data indicating which position each of the two or more intermediate data corresponds to in the unit period, and each of the two or more time steps. Includes pitch data representing the pitch in.
  • the coded data is obtained from the second input data including the position data representing the position corresponding to the intermediate data within the unit period in which the symbol is sounded and the pitch data representing the pitch at each time step. Generated. Therefore, it is possible to generate a time series of coded data that appropriately represents the temporal transition of symbols and pitches.
  • intermediate data corresponding to each of the plurality of time steps is generated, and the intermediate data corresponding to each of the plurality of time steps is obtained. It is data corresponding to the characteristics of the symbols corresponding to the time steps in the time series of the symbols constituting the music, and in the acquisition of the coded data, the current time step and the time among the plurality of time steps.
  • the coded data is generated from the second input data including two or more intermediate data corresponding to each of the two or more time steps including the time step after the step.
  • control data in the acquisition of the control data, is generated from the time series of the indicated values according to the instruction from the user.
  • the control data since the control data is generated from the time series of the indicated values according to the instructions from the user, the control data changes appropriately according to the temporal change of the indicated values according to the instructions from the user. Control data can be generated.
  • the acoustic processing system is, in each of a plurality of time steps on the time axis, the characteristics of the music in the time step and the music in the back of the time step.
  • a coded data acquisition unit that acquires coded data according to a feature
  • a control data acquisition unit that acquires control data according to a real-time instruction from a user at each of the plurality of time steps, and the plurality.
  • the acoustic feature data generation unit that generates the acoustic feature data representing the acoustic feature of the synthesized sound according to the first input data including the acquired control data and the acquired coded data.
  • the program according to one aspect (aspect 12) of the present disclosure includes, in each of the plurality of time steps on the time axis, the characteristics of the music in the time step and the characteristics of the music behind the time step.
  • a coded data acquisition unit that acquires coded data according to the above, a control data acquisition unit that acquires control data according to a real-time instruction from a user in each of the plurality of time steps, and the plurality of times.
  • a computer as an acoustic feature data generation unit that generates acoustic feature data representing the acoustic feature of the synthesized sound according to the first input data including the acquired control data and the acquired coded data.
  • 100 100 ... Sound processing system, 11 ... Control device, 12 ... Storage device, 13 ... Sound emitting device, 14 ... Operating device, 21 ... Coding model, 22 ... Coded data acquisition unit, 221 ... Period setting unit, 222 ... Conversion Processing unit, 223 ... Sound pitch estimation unit, 224 ... Generation model, 31 ... Control data acquisition unit, 32 ... Generation model, 40 ... Generation model, 50 ... Waveform synthesis unit, 61 ... Preparation processing unit, 62 ... Learning processing unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Algebra (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Electrophonic Musical Instruments (AREA)
  • Auxiliary Devices For Music (AREA)
PCT/JP2021/021691 2020-06-09 2021-06-08 音響処理方法、音響処理システムおよびプログラム Ceased WO2021251364A1 (ja)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN202180040942.0A CN115699161A (zh) 2020-06-09 2021-06-08 音响处理方法、音响处理系统及程序
EP21823051.4A EP4163912A4 (en) 2020-06-09 2021-06-08 ACOUSTIC PROCESSING METHOD, ACOUSTIC PROCESSING SYSTEM AND PROGRAM
JP2022530567A JP7517419B2 (ja) 2020-06-09 2021-06-08 音響処理方法、音響処理システムおよびプログラム
US18/076,739 US20230098145A1 (en) 2020-06-09 2022-12-07 Audio processing method, audio processing system, and recording medium

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202063036459P 2020-06-09 2020-06-09
US63/036,459 2020-06-09
JP2020-130738 2020-07-31
JP2020130738 2020-07-31

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/076,739 Continuation US20230098145A1 (en) 2020-06-09 2022-12-07 Audio processing method, audio processing system, and recording medium

Publications (1)

Publication Number Publication Date
WO2021251364A1 true WO2021251364A1 (ja) 2021-12-16

Family

ID=78845687

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/021691 Ceased WO2021251364A1 (ja) 2020-06-09 2021-06-08 音響処理方法、音響処理システムおよびプログラム

Country Status (5)

Country Link
US (1) US20230098145A1 (https=)
EP (1) EP4163912A4 (https=)
JP (1) JP7517419B2 (https=)
CN (1) CN115699161A (https=)
WO (1) WO2021251364A1 (https=)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025033121A1 (ja) * 2023-08-07 2025-02-13 ヤマハ株式会社 信号生成方法、表示制御方法およびプログラム

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7568981B2 (ja) * 2021-05-17 2024-10-17 日本電信電話株式会社 学習装置、学習方法及びプログラム

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05158478A (ja) * 1991-12-04 1993-06-25 Kawai Musical Instr Mfg Co Ltd 電子楽器
JP2019028106A (ja) * 2017-07-25 2019-02-21 ヤマハ株式会社 情報処理方法およびプログラム
JP2019139294A (ja) * 2018-02-06 2019-08-22 ヤマハ株式会社 情報処理方法および情報処理装置
JP2019139295A (ja) * 2018-02-06 2019-08-22 ヤマハ株式会社 情報処理方法および情報処理装置
WO2020031544A1 (ja) * 2018-08-10 2020-02-13 ヤマハ株式会社 楽譜データの情報処理装置
JP2020076844A (ja) * 2018-11-06 2020-05-21 ヤマハ株式会社 音響処理方法および音響処理装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8158875B2 (en) * 2010-02-24 2012-04-17 Stanger Ramirez Rodrigo Ergonometric electronic musical device for digitally managing real-time musical interpretation
JP6201460B2 (ja) * 2013-07-02 2017-09-27 ヤマハ株式会社 ミキシング管理装置
JP6171711B2 (ja) * 2013-08-09 2017-08-02 ヤマハ株式会社 音声解析装置および音声解析方法
EP3208795B1 (en) * 2014-10-17 2020-03-04 Yamaha Corporation Content control device and content control program
JP6004358B1 (ja) * 2015-11-25 2016-10-05 株式会社テクノスピーチ 音声合成装置および音声合成方法
US10068557B1 (en) * 2017-08-23 2018-09-04 Google Llc Generating music with deep neural networks
JP6583756B1 (ja) * 2018-09-06 2019-10-02 株式会社テクノスピーチ 音声合成装置、および音声合成方法
CN110164412A (zh) * 2019-04-26 2019-08-23 吉林大学珠海学院 一种基于lstm的音乐自动合成方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05158478A (ja) * 1991-12-04 1993-06-25 Kawai Musical Instr Mfg Co Ltd 電子楽器
JP2019028106A (ja) * 2017-07-25 2019-02-21 ヤマハ株式会社 情報処理方法およびプログラム
JP2019139294A (ja) * 2018-02-06 2019-08-22 ヤマハ株式会社 情報処理方法および情報処理装置
JP2019139295A (ja) * 2018-02-06 2019-08-22 ヤマハ株式会社 情報処理方法および情報処理装置
WO2020031544A1 (ja) * 2018-08-10 2020-02-13 ヤマハ株式会社 楽譜データの情報処理装置
JP2020076844A (ja) * 2018-11-06 2020-05-21 ヤマハ株式会社 音響処理方法および音響処理装置

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BLAAUW, MERLIJNJORDI BONADA: "A NEURAL PARAMETRIC SINGING SYNTHESIZER", ARXIV: 1704.03809V3, 2017
See also references of EP4163912A4
VAN DEN OORD, AARON ET AL.: "WAVENET: A GENERATIVE MODEL FOR RAW AUDIO", ARXIV: 1609.03499V2, 2016

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025033121A1 (ja) * 2023-08-07 2025-02-13 ヤマハ株式会社 信号生成方法、表示制御方法およびプログラム

Also Published As

Publication number Publication date
CN115699161A (zh) 2023-02-03
EP4163912A1 (en) 2023-04-12
US20230098145A1 (en) 2023-03-30
EP4163912A4 (en) 2024-07-31
JP7517419B2 (ja) 2024-07-17
JPWO2021251364A1 (https=) 2021-12-16

Similar Documents

Publication Publication Date Title
JP6547878B1 (ja) 電子楽器、電子楽器の制御方法、及びプログラム
JP6610714B1 (ja) 電子楽器、電子楽器の制御方法、及びプログラム
JP6610715B1 (ja) 電子楽器、電子楽器の制御方法、及びプログラム
JP6747489B2 (ja) 情報処理方法、情報処理システムおよびプログラム
JP6835182B2 (ja) 電子楽器、電子楽器の制御方法、及びプログラム
JP6733644B2 (ja) 音声合成方法、音声合成システムおよびプログラム
CN109559718B (zh) 电子乐器、电子乐器的乐音产生方法以及存储介质
CN113016028B (zh) 音响处理方法及音响处理系统
JP7059972B2 (ja) 電子楽器、鍵盤楽器、方法、プログラム
WO2023276234A1 (ja) 情報処理装置、情報処理方法およびプログラム
JP7517419B2 (ja) 音響処理方法、音響処理システムおよびプログラム
US20210366454A1 (en) Sound signal synthesis method, neural network training method, and sound synthesizer
JP6819732B2 (ja) 電子楽器、電子楽器の制御方法、及びプログラム
JP7740068B2 (ja) 音響生成方法、音響生成システムおよびプログラム
JP7452162B2 (ja) 音信号生成方法、推定モデル訓練方法、音信号生成システム、およびプログラム
JP7192834B2 (ja) 情報処理方法、情報処理システムおよびプログラム
JP6801766B2 (ja) 電子楽器、電子楽器の制御方法、及びプログラム
CN118103905A (zh) 音响处理方法、音响处理系统及程序
JP2022065554A (ja) 音声合成方法およびプログラム
JP2004061753A (ja) 歌唱音声を合成する方法および装置
JP2023130095A (ja) 音響生成方法、音響生成システムおよびプログラム
CN121415762A (zh) 歌声合成方法、歌声合成模型的训练方法及相关设备
CN121858769A (zh) 音乐播放方法、装置、电子设备和可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21823051

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022530567

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021823051

Country of ref document: EP

Effective date: 20230109

WWW Wipo information: withdrawn in national office

Ref document number: 2021823051

Country of ref document: EP