WO2022074754A1 - Procédé de traitement d'informations, système de traitement d'informations et programme - Google Patents

Procédé de traitement d'informations, système de traitement d'informations et programme Download PDF

Info

Publication number
WO2022074754A1
WO2022074754A1 PCT/JP2020/037966 JP2020037966W WO2022074754A1 WO 2022074754 A1 WO2022074754 A1 WO 2022074754A1 JP 2020037966 W JP2020037966 W JP 2020037966W WO 2022074754 A1 WO2022074754 A1 WO 2022074754A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
time
user
editing
instruction
Prior art date
Application number
PCT/JP2020/037966
Other languages
English (en)
Japanese (ja)
Inventor
竜之介 大道
慶二郎 才野
正宏 清水
Original Assignee
ヤマハ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ヤマハ株式会社 filed Critical ヤマハ株式会社
Priority to JP2022555020A priority Critical patent/JPWO2022074754A1/ja
Priority to PCT/JP2020/037966 priority patent/WO2022074754A1/fr
Priority to CN202080105738.8A priority patent/CN116324965A/zh
Publication of WO2022074754A1 publication Critical patent/WO2022074754A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10GREPRESENTATION OF MUSIC; RECORDING MUSIC IN NOTATION FORM; ACCESSORIES FOR MUSIC OR MUSICAL INSTRUMENTS NOT OTHERWISE PROVIDED FOR, e.g. SUPPORTS
    • G10G1/00Means for the representation of music
    • G10G1/04Transposing; Transcribing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Definitions

  • This disclosure relates to the processing of time series data.
  • Patent Document 1 discloses a technique for synthesizing a singing voice that pronounces a note sequence instructed by a user on an editing screen.
  • the edit screen is a piano roll screen in which the time axis and the pitch axis are set.
  • the user specifies a phonetic (phonetic character), a pitch, and a pronunciation period for each note that constitutes a musical piece.
  • the information processing method uses the first time-series data representing the time-series of the feature amount of the sound in which the symbol string is pronounced in the first pronunciation style.
  • the information processing system edits the first time-series data representing the time-series of the feature amount of the sound that pronounces the symbol string in the first pronunciation style according to the first instruction from the user. Then, the second time series data representing the time series of the feature amount of the sound that pronounced the symbol string in the second pronunciation style different from the first pronunciation style is edited according to the second instruction from the user.
  • the editing processing unit edits the first time-series data
  • the first history data corresponding to the edited first time-series data is saved as new version data
  • the second time-series data is edited.
  • each time it is provided with an information management unit that saves the second history data corresponding to the edited second time-series data as new version data, and the information management unit has the saved different versions.
  • the information processing system edits the first time-series data representing the time-series of the feature amount of the sound that pronounces the symbol string in the first pronunciation style according to the first instruction from the user. Then, the second time series data representing the time series of the feature amount of the sound that pronounced the symbol string in the second pronunciation style different from the first pronunciation style is edited according to the second instruction from the user. For each editing of the editing processing unit and the first time-series data, the first history data corresponding to the edited first time-series data is saved as new version data, and the second time-series data is stored.
  • a program that causes the computer system to function as an information management unit that saves the second history data corresponding to the edited second time-series data as new version data for each edit, and the information management unit is a program.
  • the second history data the second time-series data corresponding to the second history data according to the instruction from the user is acquired.
  • FIG. 1 is a block diagram illustrating the configuration of the information processing system 100 according to the first embodiment of the present disclosure.
  • the information processing system 100 is an acoustic processing system that generates an acoustic signal Z.
  • the acoustic signal Z is a signal in the time domain representing the waveform of the synthetic sound.
  • the synthetic sound is, for example, a musical instrument sound produced by a virtual performer playing a musical instrument, or a singing sound produced by, for example, a virtual singer singing a song.
  • the information processing system 100 is realized by a computer system including a control device 11, a storage device 12, a sound emitting device 13, a display device 14, and an operating device 15.
  • the information processing system 100 is realized by, for example, an information device such as a smartphone, a tablet terminal, or a personal computer.
  • the information processing system 100 is realized not only by a single device but also by a plurality of devices (for example, a client-server system) configured as separate bodies from each other.
  • the control device 11 is a single or a plurality of processors that control each element of the information processing system 100. Specifically, for example, one or more types of processors such as CPU (Central Processing Unit), SPU (Sound Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). 3.
  • the control device 11 is configured. The control device 11 executes various processes for generating the acoustic signal Z.
  • the storage device 12 is a single or a plurality of memories for storing a program executed by the control device 11 and various data used by the control device 11.
  • the storage device 12 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium.
  • the storage device 12 may be composed of a combination of a plurality of types of recording media.
  • a portable recording medium attached to and detached from the information processing system 100, or a recording medium capable of writing and reading via a communication network (for example, cloud storage) may be used as the storage device 12.
  • the sound emitting device 13 reproduces the synthetic sound represented by the acoustic signal Z generated by the control device 11.
  • the sound emitting device 13 is, for example, a speaker or headphones.
  • the D / A converter that converts the acoustic signal Z from digital to analog and the amplifier that amplifies the acoustic signal Z are not shown for convenience. Further, in FIG. 1, the configuration in which the sound emitting device 13 is mounted on the information processing system 100 is illustrated, but the sound emitting device 13 separate from the information processing system 100 is connected to the information processing system 100 by wire or wirelessly. May be done.
  • the display device 14 displays an image under the control of the control device 11.
  • the display device 14 is composed of a display panel such as a liquid crystal panel or an organic EL (ElectroLuminescence) panel.
  • the operation device 15 is an input device that receives instructions from the user.
  • the operation device 15 is, for example, a plurality of controls operated by the user or a touch panel for detecting contact by the user.
  • the user can instruct the condition of the synthesized sound by operating the operation device 15.
  • the display device 14 displays an image (hereinafter referred to as “editing screen”) G referred to by the user for instructing the condition of the synthetic sound.
  • FIG. 2 is a schematic diagram of the edit screen G.
  • the editing screen G includes a plurality of editing areas E (En, Ef, and Ew).
  • a common time axis (horizontal axis) is set in the plurality of editing areas E.
  • the section of the synthetic sound displayed on the edit screen G is changed according to the instruction from the user to the operation device 15.
  • a time series (hereinafter referred to as "note sequence") N of a plurality of notes constituting the score of the synthesized sound is displayed.
  • a coordinate plane defined by a time axis and a pitch axis (vertical axis) is set in the editing area En.
  • An image representing each note constituting the note sequence N is arranged in the editing area En.
  • a pitch (for example, a note number) and a pronunciation period are specified for each note in the note sequence N.
  • the phoneme is specified for each note.
  • performance symbols such as crescendo, forte, and decrescendo are also displayed.
  • the user can give an edit instruction Qn to the edit area En by operating the operation device 15.
  • the edit instruction Qn is an instruction to edit the note string N.
  • the edit instruction Qn is an instruction to add or delete each note in the note sequence N, an instruction to change the condition (pitch, pronunciation period or phonology) of each note, or an instruction to change the performance symbol. be.
  • a time series (hereinafter referred to as "feature column") F of the feature amount of the synthetic sound is displayed.
  • the feature amount is an acoustic feature amount of the synthetic sound.
  • the feature column F (that is, the temporal transition of the fundamental frequency) is displayed in the editing area Ef with the fundamental frequency (pitch) of the synthesized sound as the feature amount.
  • the user can give an edit instruction Qf to the edit area Ef by operating the operation device 15.
  • the edit instruction Qf is an instruction to edit the feature column F.
  • the editing instruction Qf is, for example, an instruction for changing the time change of the feature amount in the desired section of the feature column F displayed in the editing area Ef.
  • the waveform W of the synthesized sound on the time axis is displayed.
  • the user can give an edit instruction Qw to the edit area Ew by operating the operation device 15.
  • the edit instruction Qw is an instruction to edit the waveform W.
  • the editing instruction Qw is an instruction to change the waveform in the user's desired section of the waveform W displayed in the editing area Ew.
  • the editing screen G includes, in addition to the plurality of editing areas E exemplified above, a plurality of operating areas (Gn, Gf and Gw) corresponding to different editing areas E, and an operating image B1 (playback).
  • the operation image B1 is a software button that can be operated by the user using the operation device 15.
  • the operation image B1 is an operation element for the user to instruct the reproduction of the synthesized sound.
  • the synthetic sound of the waveform W displayed in the editing area Ew is reproduced from the sound emitting device 13.
  • the operation area Gn is an area related to the note string N. Specifically, the note string version number Vn, the operation image Gn1 and the operation image Gn2 are displayed in the operation area Gn.
  • the note string version number Vn is a number representing the version of the note string N displayed in the editing area En.
  • the note string version number Vn is incremented by 1 each time the note string N is edited according to the edit instruction Qn. Further, the user can change the note string version number Vn in the operation area Gn to an arbitrary numerical value by operating the operation device 15.
  • the note string N of the version corresponding to the note string version number Vn changed by the user is displayed in the editing area En.
  • the operation image Gn1 and the operation image Gn2 are software buttons that can be operated by the user using the operation device 15.
  • the operation image Gn1 is an operation element for the user to instruct to return the note string N to the state before the execution of the immediately preceding edit (Undo). That is, when the user operates the operation image Gn1, the note string version number Vn is changed to the immediately preceding numerical value, and the note string N of the version corresponding to the changed note string version number Vn is the edit area En. Is displayed in. Therefore, the operation image Gn1 is also expressed as an operator for retreating the note string version number Vn to the immediately preceding numerical value (that is, canceling the immediately preceding edit regarding the note string N).
  • the operation image Gn2 is an operator for instructing the user to perform the editing canceled by the operation on the operation image Gn1 again (Redo).
  • the operation area Gf is an area related to the feature column F. Specifically, the feature column version number Vf, the operation image Gf1, and the operation image Gf2 are displayed in the operation area Gf.
  • the feature column version number Vf is a number representing the version of the feature column F displayed in the editing area Ef.
  • the feature column version number Vf is incremented by 1 each time the feature column F is edited according to the edit instruction Qf. Further, the user can change the feature column version number Vf in the operation area Gf to an arbitrary numerical value by operating the operation device 15.
  • the feature column F of the version corresponding to the feature column version number Vf changed by the user is displayed in the editing area Ef.
  • the operation image Gf1 and the operation image Gf2 are software buttons that can be operated by the user using the operation device 15.
  • the operation image Gf1 is an operation element for the user to instruct to return the feature column F to the state before the execution of the immediately preceding edit (Undo). That is, when the user operates the operation image Gf1, the feature column version number Vf is changed to the immediately preceding numerical value, and the feature column F of the version corresponding to the changed feature column version number Vf is the edit area Ef. Is displayed in. Therefore, the operation image Gf1 is also expressed as an operator for retreating the feature column version number Vf to the immediately preceding numerical value (that is, canceling the immediately preceding edit regarding the feature sequence F).
  • the operation image Gf2 is an operator for instructing the user to perform the editing canceled by the operation on the operation image Gf1 again (Redo).
  • the operation area Gw is an area related to the waveform W. Specifically, the waveform version number Vw, the operation image Gw1 and the operation image Gw2 are displayed in the operation area Gw.
  • the waveform version number Vw is a number representing the version of the waveform W displayed in the editing area Ew.
  • the waveform version number Vw is incremented by 1 each time the waveform W is edited according to the edit instruction Qw. Further, the user can change the waveform version number Vw in the operation area Gw to an arbitrary numerical value by operating the operation device 15.
  • the version of the waveform W corresponding to the waveform version number Vw changed by the user is displayed in the editing area Ew.
  • the operation image Gw1 and the operation image Gw2 are software buttons that can be operated by the user using the operation device 15.
  • the operation image Gw1 is an operator for instructing the user to return the waveform W to the state before the execution of the immediately preceding edit (Undo). That is, when the user operates the operation image Gw1, the waveform version number Vw is changed to the immediately preceding value, and the waveform W of the version corresponding to the changed waveform version number Vw is displayed in the editing area Ew. To. Therefore, the operation image Gw1 is also expressed as an operator for retreating the waveform version number Vw to the immediately preceding value (that is, canceling the immediately preceding edit regarding the waveform W).
  • the operation image Gw2 is an operator for instructing the user to perform the editing canceled by the operation on the operation image Gw1 again (Redo).
  • a plurality of version numbers V (Vn, Vf, Vw) are used.
  • An increase in each version number (increment) means the progress of the editing work, and a decrease in each version number (decrement) means a recession in the editing work.
  • FIG. 3 is a block diagram illustrating a functional configuration of the information processing system 100.
  • the control device 11 executes a program stored in the storage device 12 to perform a plurality of functions (display control unit 20, editing processing unit 30, and information) for editing synthetic sound conditions and generating an acoustic signal Z. Realize the management unit 40).
  • the display control unit 20 causes the display device 14 to display an image under the control of the control device 11.
  • the display control unit 20 causes the display device 14 to display the editing screen G illustrated in FIG.
  • the display control unit 20 updates the edit screen G in response to an instruction (Qn, Qf or Qw) from the user.
  • the editing processing unit 30 in FIG. 3 edits the synthetic sound conditions (note sequence N, feature sequence F, and waveform W) according to an instruction (Qn, Qf, or Qw) from the user.
  • the editing processing unit 30 includes a first editing unit 31, a first generation unit 32, a second editing unit 33, a second generation unit 34, and a third editing unit 35.
  • the first editorial unit 31 edits the note string data Dn.
  • the note string data Dn is time-series data representing the note sequence N of the synthesized sound.
  • the first editing unit 31 edits the note string data Dn according to the editing instruction Qn from the user for the editing area En.
  • the display control unit 20 displays the musical note string N represented by the musical note string data Dn edited by the first editing unit 31 in the editing area En.
  • the first generation unit 32 generates the feature sequence data Df from the note sequence data Dn edited by the first editing unit 31.
  • the feature sequence data Df is time-series data representing the feature sequence F of the synthesized sound.
  • at least the notes before and after the note are generated for the generation of the feature amount at each time point on the time axis among the plurality of feature amounts constituting the feature sequence F.
  • the data of one note is used. That is, the feature sequence data Df is generated according to the context of the note sequence N represented by the note sequence data Dn.
  • the first generation unit 32 generates the feature column data Df using the first generation model M1.
  • the first generative model M1 is a statistical inference model that inputs the note sequence data Dn and outputs the feature sequence data Df.
  • the first generative model M1 is a trained model that has learned the relationship between the note sequence N and the feature sequence F.
  • the first generative model M1 is composed of, for example, a deep neural network (DNN).
  • DNN deep neural network
  • CNN convolutional neural network
  • RNN recurrent neural network
  • additional elements such as long short-term memory (LSTM: Long Short-Term Memory) or Self-Attention may be mounted on the first generation model M1.
  • the first generation model M1 includes a program that causes the control device 11 to execute an operation for generating feature sequence data Df from the note sequence data Dn, and a plurality of variables (specifically, weighted values and biases) applied to the operation. It is realized by the combination of.
  • the plurality of variables defining the first generation model M1 are preset and stored in the storage device 12 by machine learning using the plurality of first training data.
  • Each of the plurality of first training data includes the note sequence data Dn and the feature sequence data Df (correct answer value).
  • the feature sequence data Df output by the provisional first generation model M1 for the note sequence data Dn of each first training data and the feature sequence data of the first training data.
  • the first generative model M1 is a statistically valid feature for the unknown note sequence data Dn under the latent tendency between the note sequence N and the feature sequence F in the plurality of first training data. Output the column data Df.
  • the second editing unit 33 edits the feature column data Df generated by the first generation unit 32. Specifically, the second editing unit 33 edits the feature column data Df according to the editing instruction Qf from the user for the editing area Ef.
  • the display control unit 20 displays the feature column F represented by the feature column data Df generated by the first generation unit 32 or the feature column F represented by the feature column data Df edited by the second editing unit 33 in the editing area Ef. do.
  • the second generation unit 34 generates waveform data Dw from the note sequence data Dn and the feature sequence data Df.
  • the waveform data Dw is time-series data representing the waveform W of the synthesized sound. That is, the waveform data Dw is composed of a time series of a plurality of samples representing the acoustic signal Z.
  • the acoustic signal Z is generated by D / A conversion and amplification for the waveform data Dw.
  • the feature sequence data Df immediately after being generated by the first generation unit 32 (that is, the feature sequence data DF not edited by the second editing unit 33) may be used for generating the waveform data Dw.
  • the second generation unit 34 generates waveform data Dw using the second generation model M2.
  • the second generative model M2 is a statistical inference model that outputs waveform data Dw by inputting a set of note sequence data Dn and feature sequence data Df (hereinafter referred to as “input data Din”).
  • the second generative model M2 is a trained model in which the relationship between the set of the note sequence N and the feature sequence F and the waveform W is learned.
  • the second generative model M2 is composed of, for example, a deep neural network.
  • an arbitrary form of deep neural network such as a convolutional neural network or a recurrent neural network is used as the second generative model M2.
  • additional elements such as long-term memory or self-attention may be mounted on the second generative model M2.
  • the second generation model M2 is a program that causes the control device 11 to execute an operation of generating waveform data Dw from the input data Din including the note string data Dn and the feature sequence data Df, and a plurality of variables applied to the operation (the second generation model M2). Specifically, it is realized in combination with a weighted value and a bias).
  • the plurality of variables defining the second generation model M2 are preset and stored in the storage device 12 by machine learning using the plurality of second training data. Each of the plurality of second training data includes input data Din and waveform data Dw (correct answer value).
  • a plurality of variables of the second generative model M2 are updated iteratively so that the error is reduced. Therefore, the second generative model M2 is statistical with respect to the unknown input data Din under the latent tendency between the set of the note sequence N and the feature sequence F and the waveform W in the plurality of second training data. Outputs appropriate waveform data Dw.
  • the third editing unit 35 edits the waveform data Dw generated by the second generation unit 34. Specifically, the third editing unit 35 edits the waveform data Dw according to the editing instruction Qw from the user for the editing area Ew.
  • the display control unit 20 displays the waveform W represented by the waveform data Dw generated by the second generation unit 34 or the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew. Further, when the operation image B1 (reproduction) is operated by the user, the acoustic signal Z corresponding to the waveform data Dw generated by the second generation unit 34 or the waveform data Dw edited by the third editing unit 35 is emitted. By being supplied to the device 13, the synthesized sound is reproduced.
  • the information management unit 40 manages versions of each of the note sequence data Dn, the feature sequence data Df, and the waveform data Dw. Specifically, the information management unit 40 manages the note sequence version number Vn, the feature sequence version number Vf, and the waveform version number Vw.
  • the information management unit 40 stores different versions of data (hereinafter referred to as “history data”) for each of the note sequence data Dn, the feature sequence data Df, and the waveform data Dw in the storage device 12.
  • a history area and a work area are set in the storage device 12.
  • the history area is a storage area in which the history of editing related to the synthetic sound condition is stored.
  • the work area is a storage area in which the note sequence data Dn, the feature sequence data Df, and the waveform data Dw are temporarily stored in the process of editing using the edit screen G.
  • the information management unit 40 saves the edited note sequence data Dn as the first history data Hn [Vn, Vf, Vw] in the history area for each edit of the note sequence N in response to the edit instruction Qn. do. That is, the new version of the note string data Dn is stored in the storage device 12 as the first history data Hn [Vn, Vf, Vw].
  • the information management unit 40 saves the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df according to the edit instruction Qf in the history area as new version data.
  • the second history data Hf [Vn, Vf, Vw] of the first embodiment is data showing how the feature column data Df was edited according to the edit instruction Qf (that is, the time series of the edit instruction Qf).
  • the second history data Hf [Vn, Vf, Vw] is also referred to as data representing the difference between the feature column data Df before and after editing.
  • the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] corresponding to the edited waveform data Dw according to the edit instruction Qw in the history area as new version data.
  • the third history data Hw [Vn, Vf, Vw] of the first embodiment is data showing how the waveform data Dw was edited according to the editing instruction Qw (that is, the time series of the editing instruction Qw).
  • the third history data Hw [Vn, Vf, Vw] is also referred to as data representing the difference between the waveform data Dw before and after editing.
  • FIG. 4 to 6 are flowcharts illustrating a specific procedure of the editing process Sa (Sa1, Sa2 and Sa3) for editing the condition of the synthetic sound according to the editing instruction Q (Qn, Qf or Qw) from the user.
  • FIG. 4 is a flowchart of the first editing process Sa1 relating to the editing of the note string N.
  • the first editing process Sa1 is started with the editing instruction Qn for the note string N as a trigger.
  • the first editing unit 31 edits the current note string data Dn according to the editing instruction Qn (Sa101).
  • the information management unit 40 increases the note string version number Vn by "1" (Sa102).
  • the note string data Dn is newly generated (Sa101), and the note string version number Vn is initialized to "0" (Sa102).
  • the information management unit 40 initializes the feature column version number Vf to "0" (Sa103) and initializes the waveform version number Vw to "0" (Sa104).
  • the first generation unit 32 generates the feature sequence data Df by supplying the note sequence data Dn edited by the first editing unit 31 to the first generation model M1 (Sa106).
  • the feature sequence data Df generated by the first generation unit 32 is stored in the work area of the storage device 12.
  • the second generation unit 34 supplies the input data Din including the note sequence data Dn edited by the first editing unit 31 and the feature sequence data Df generated by the first generation unit 32 to the second generation model M2. This generates waveform data Dw (Sa107).
  • the waveform data Dw generated by the second generation unit 34 is stored in the work area of the storage device 12.
  • the note string data Dn requires one data for each note.
  • the feature sequence data Df is composed of one sample every several milliseconds to several tens of milliseconds in order to represent the change in pitch in each note. Since the waveform data Dw represents the waveform of each note, one sample is configured for each sampling period (for example, 1/50 kHz to 20 ⁇ sec).
  • the amount of data of the feature sequence data Df created from one note sequence data Dn is several hundred to several thousand times the amount of data of the note sequence data Dn, and one feature sequence.
  • the amount of data of the waveform data Dw generated from the data Df is several hundred times to several thousand times the amount of data of the feature column data Df.
  • the layer data feature column data Df and waveform data Dw
  • the layer data has a large amount of data as described above, only the difference from the upper layer data is stored as historical data. According to the above configuration, there is an advantage that the amount of data stored in the storage device 12 can be significantly reduced with respect to the hierarchical data as compared with the configuration in which the data itself is stored.
  • the display control unit 20 updates the edit screen G (Sa108-Sa110). Specifically, the display control unit 20 displays the note string N represented by the note string data Dn edited by the first editing unit 31 in the editing area En (Sa108). Further, the display control unit 20 displays the feature column F represented by the current feature column data Df stored in the work area in the edit area Ef (Sa109). Similarly, the display control unit 20 displays the waveform W represented by the current waveform data Dw stored in the work area in the edit area Ew (Sa110).
  • FIG. 5 is a flowchart of the second editing process Sa2 relating to the editing of the feature column F.
  • the second editing process Sa2 is started with the editing instruction Qf for the feature column F as a trigger.
  • the second editing unit 33 edits the current feature column data Df according to the editing instruction Qf (Sa201).
  • the second generation unit 34 generates waveform data Dw by supplying input data Din including the current note sequence data Dn and the feature sequence data Df edited by the second editing unit 33 to the second generation model M2. (Sa206).
  • the waveform data Dw generated by the second generation unit 34 is stored in the work area of the storage device 12.
  • the display control unit 20 updates the edit screen G (Sa207 and Sa208). Specifically, the display control unit 20 displays the feature column F represented by the feature column data Df edited by the second editing unit 33 in the editing area Ef (Sa207). Further, the display control unit 20 displays the waveform W represented by the current waveform data Dw stored in the work area in the edit area Ew (Sa208). In the second editing process Sa2, the note string N in the editing area En is not updated.
  • FIG. 6 is a flowchart of the third editing process Sa3 relating to the editing of the waveform W.
  • the third editing process Sa3 is started with the editing instruction Qw for the waveform W as a trigger.
  • the third editing unit 35 edits the current waveform data Dw according to the editing instruction Qw (Sa301).
  • the information management unit 40 increases the waveform version number Vw by "1" (Sa302). Further, the information management unit 40 maintains the note sequence version number Vn at the current value Cn (Sa303), and also maintains the feature sequence version number Vf at the current value Cf (Sa304). Then, the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] representing the editing instruction Qw this time in the history area as new version data (Sa305).
  • step Sa303 and step Sa304 may be omitted.
  • the display control unit 20 displays the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew (Sa306).
  • the note string N in the editing area En and the feature string F in the editing area Ef are not updated.
  • FIG. 7 is an explanatory diagram of the data structure in the history area of the storage device 12.
  • a plurality of third history data Hw [Vn, Vn, corresponding to different versions of the waveform W under the common feature sequence F. Vf, Vw] is stored in the history area.
  • the hierarchical relationship is established in which the note sequence N is located above the feature sequence F and the feature sequence F is located above the waveform W.
  • the feature column version number Vf is increased, and the waveform version number Vw corresponding to the lower layer is set to "0" while the note string version number Vn corresponding to the upper layer is maintained. It is initialized.
  • FIG. 8 to 10 are flowcharts illustrating a specific procedure of the management process Sb (Sb1, Sb2 and Sb3) that manages the version according to the instruction from the user.
  • FIG. 8 is a flowchart of the first management process Sb1 regarding the version of the note string N.
  • the first management process Sb1 is started with the instruction to change the note string version number Vn.
  • the numerical value of the note string version number Vn after the change according to the instruction from the user is referred to as "set value Xn" below.
  • the changed numerical value that is, the numerical value specified by the user
  • the information management unit 40 changes the note string version number Vn from the current value Cn to the set value Xn (Sb101).
  • the information management unit 40 sets the feature column version number Vf to the latest value Yf corresponding to the set value Xn of the note string N (Sb102).
  • the latest value Yf is the number of the latest version among the plurality of versions of the feature string F generated for each edit instruction Qf under the note string N of the version corresponding to the set value Xn.
  • the information management unit 40 sets the waveform version number Vw to the latest value Yw corresponding to the set value Xn of the note string N (Sb103).
  • the latest value Yw is the number of the latest version among a plurality of versions of the waveform W generated for each edit instruction Qw under the note string N of the version corresponding to the set value Xn.
  • the note sequence version number Vn is the set value Xn. It is data representing the time series of the edit instruction Qf before the Yfth among the one or more edit instruction Qf sequentially given by the user under the note string N.
  • the edit instruction Qw before the Yw th It is data representing a time series.
  • the feature column data Df is sequentially edited according to the editing instruction Qf represented by (Sb106).
  • the feature sequence data Df edited according to the edit instruction Qf up to the Yf th is generated under the note sequence N corresponding to the set value Xn.
  • the editing by the second editing unit 33 is a small part of the feature sequence data Df over a plurality of notes. For example, only a very small part of the whole music, such as the attack part of a specific note in the music, or the first two notes in the third phrase in the music, is edited.
  • the waveform data Dw is generated by supplying the data Din to the second generation model M2 (Sb107).
  • the waveform data Dw is sequentially edited according to the editing instruction Qw represented by (Sb108).
  • the waveform data Dw edited according to the edit instruction Qw up to the Ywth th is generated under the note string N corresponding to the set value Xn and the feature string F corresponding to the latest value Yf.
  • the waveform data Dw is not edited in step Sb108, and the waveform data Dw is determined as final data.
  • the display control unit 20 displays the feature column F represented by the feature column data Df edited by the second editing unit 33 in the edit area Ef, and displays the feature column version number Vf of the operation area Gf in the latest value Yf. Update (Sb110). That is, the feature column F corresponding to the set value Xn and the latest value Yf is displayed in the editing area E2. Similarly, the display control unit 20 displays the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew, and updates the display of the waveform version number Vw in the operation area Gw to the latest value Yw. (Sb111).
  • the waveform W corresponding to the set value Xn, the latest value Yf, and the latest value Yw is displayed in the editing area Ew.
  • the user can give an editing instruction (Qn, Qf or Qw) for each of the note sequence N, the feature sequence F and the waveform W.
  • FIG. 9 is a flowchart of the second management process Sb2 regarding the version of the feature column F.
  • the second management process Sb2 is started with the instruction to change the feature column version number Vf.
  • the numerical value of the feature column version number Vf after the change according to the instruction from the user is referred to as "set value Xf" below.
  • the changed numerical value that is, the numerical value specified by the user
  • the information management unit 40 changes the feature column version number Vf from the current value Cf to the set value Xf (Sb201). Further, the information management unit 40 maintains the note string version number Vn at the current value Cn (Sb202), and changes the waveform version number Vw from the current value Cw to the latest value Yw (Sb203).
  • the latest value Yw of the waveform version number Vw is the number of the latest version among the plurality of versions of the waveform W generated for each edit instruction Qw under the feature column F of the version corresponding to the set value Xf.
  • the edit instruction Qw before the Ywth It is data representing a time series.
  • the feature column data Df is sequentially edited according to the editing instruction Qf represented by (Sb206). That is, the feature sequence data Df edited according to the edit instruction Qf up to the Xf th is generated under the note sequence N corresponding to the current value Cn.
  • the waveform data Dw is generated by supplying the data Din to the second generation model M2 (Sb207).
  • the waveform data Dw is sequentially edited according to the editing instruction Qw represented by (Sb208). That is, the waveform data Dw edited according to the edit instruction Qw up to the Ywth th is generated under the note string N corresponding to the current value Cn and the feature string F corresponding to the set value Xf.
  • the display control unit 20 updates the edit screen G (Sb209-Sb210). Specifically, the display control unit 20 displays the feature column F represented by the feature column data Df edited by the second editing unit 33 in the edit area Ef, and sets the display of the feature column version number Vf of the operation area Gf. Update to the value Xf (Sb209). That is, the feature column F corresponding to the current value Cn and the set value Xf is displayed in the editing area Ef. Further, the display control unit 20 displays the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew, and updates the display of the waveform version number Vw in the operation area Gw to the latest value Yw ( Sb210).
  • the waveform W corresponding to the current value Cn, the set value Xf, and the latest value Yw is displayed in the editing area Ew.
  • the user can give an editing instruction (Qn, Qf or Qw) for each of the note sequence N, the feature sequence F and the waveform W.
  • FIG. 10 is a flowchart of the third management process Sb3 regarding the version of the waveform W.
  • the third management process Sb3 is started with the instruction to change the waveform version number Vw.
  • the numerical value of the waveform version number Vw after the change according to the instruction from the user is referred to as "set value Xw" below.
  • the changed numerical value that is, the numerical value specified by the user
  • the information management unit 40 changes the waveform version number Vw from the current value Cw to the set value Xw (Sb301). Further, the information management unit 40 maintains the note sequence version number Vn at the current value Cn (Sb302) and the feature sequence version number Vf at the current value Cf (Sb303).
  • the note sequence version number Vn is the set value Xn. It is data representing the time series of the edit instruction Qf before the Cfth among one or more edit instruction Qf sequentially given by the user under the note string N.
  • the feature column data Df is sequentially edited according to the editing instruction Qf represented by (Sb306). That is, the feature sequence data Df edited according to the edit instruction Qf up to the Cf th is generated under the note sequence N corresponding to the current value Cn.
  • the waveform data Dw is generated by supplying the data Din to the second generation model M2 (Sb307).
  • the waveform data Dw is sequentially edited according to the editing instruction Qw represented by (Sb308). That is, the waveform data Dw edited according to the edit instruction Qw up to the Xwth is generated under the note string N corresponding to the current value Cn and the feature string F corresponding to the current value Cf.
  • the display control unit 20 updates the edit screen G (Sb309). Specifically, the display control unit 20 displays the waveform W represented by the waveform data Dw edited by the third editing unit 35 in the editing area Ew, and displays the waveform version number Vw in the operation area Gw as the set value Xw. Update. That is, the waveform W corresponding to the current value Cn, the current value Cf, and the set value Xf is displayed in the editing area Ew.
  • the note sequence data Dn and the feature sequence data Df are edited according to the instructions (editing instruction Qn and editing instruction Qf) from the user. Therefore, it is possible to generate waveform data Dw that precisely reflects the instruction from the user, as compared with the configuration in which only the note string data Dn is edited in response to the instruction from the user.
  • the note string version number Vn is increased, the numerical value of the feature string version number Vf is initialized, and when the feature string data Df is edited, the note is used.
  • the numerical value of the feature column version number Vf is increased while the numerical value of the column version number Vn is maintained. Then, among the plurality of numerical values of the note string version number Vn, the first history data Hn [Vn, Vf, Vw] corresponding to the set value Xn according to the instruction from the user, and the plurality of numerical values of the feature column version number Vf.
  • the waveform data Dw is generated by using at least one of the second history data Hf [Vn, Vf, Vw] corresponding to the set value Xf according to the instruction from the user. Therefore, the user can instruct the editing of the note sequence data Dn and the feature sequence data Df while generating the waveform data Dw by trial and error for different combinations of the note sequence version number Vn and the feature sequence version number Vf.
  • FIG. 11 is a schematic diagram of the editing screen G in the second embodiment.
  • the operation image B2 is added to the same elements as those of the first embodiment.
  • the operation image B2 is an image (specifically, a pull-down menu) for the user to select the pronunciation style of the synthetic sound.
  • the user can select a desired pronunciation style from a plurality of pronunciation styles by operating the operation device 15.
  • Pronunciation style means a feature related to how to pronounce.
  • the pronunciation style is a characteristic of how the musical instrument is played.
  • the pronunciation style is a feature (sung around) regarding how to sing the music.
  • a suitable pronunciation method for each music genre such as pop / rock / rap, is exemplified as a pronunciation style.
  • the musical expression of playing or singing such as bright / quiet / violent, is also exemplified as a pronunciation style.
  • FIG. 12 is a block diagram illustrating a functional configuration of the control device 11 in the second embodiment.
  • the pronunciation style s selected by the user in the operation on the operation image B2 is instructed to the first generation unit 32 and the second generation unit 34 of the second embodiment.
  • the first generation unit 32 generates the feature sequence data Df from the note sequence data Dn and the pronunciation style s.
  • the feature sequence data Df is time series data representing a time series of feature quantities (for example, fundamental frequency) related to a synthetic sound obtained by reproducing the note sequence N represented by the note sequence data Dn in the pronunciation style s.
  • the first generation unit 32 generates the feature column data Df using the first generation model M1.
  • the first generative model M1 is a statistical inference model that outputs feature sequence data Df by inputting note sequence data Dn and pronunciation style s. Similar to the first embodiment, the first generative model M1 is composed of a deep neural network having an arbitrary structure such as a convolutional neural network or a recurrent neural network.
  • the first generation model M1 includes a program that causes the control device 11 to execute an operation for generating feature sequence data Df from the note string data Dn and the pronunciation style s, and a plurality of variables applied to the operation. It is realized by the combination of.
  • a plurality of variables defining the first generation model M1 are set in advance by machine learning using a plurality of first training data and stored in the storage device 12.
  • Each of the plurality of first training data includes the note sequence data Dn, the set of pronunciation styles s, and the feature sequence data Df (correct answer value).
  • the feature sequence data Df output by the provisional first generation model M1 for the note sequence data Dn and the pronunciation style s of each first training data, and the first training.
  • the plurality of variables of the first generation model M1 are updated iteratively so that the error with the data feature column data Df is reduced. Therefore, the first generative model M1 outputs statistically valid feature sequence data Df for an unknown combination of note sequence data Dn and pronunciation style s under a tendency latent in a plurality of first training data. do.
  • the second generation unit 34 generates waveform data Dw from the note sequence data Dn, the feature sequence data Df, and the pronunciation style s.
  • the waveform data Dw is time-series data representing the waveform of the synthetic sound sound obtained by pronouncing the note sequence N represented by the note sequence data Dn in the pronunciation style s.
  • the second generation unit 34 generates the waveform data Dw using the second generation model M2.
  • the second generative model M2 is a statistical inference model that outputs waveform data Dw by inputting note sequence data Dn, feature sequence data Df, and pronunciation style s. Similar to the first embodiment, the second generative model M2 is composed of a deep neural network having an arbitrary structure such as a convolutional neural network or a recurrent neural network. Specifically, the second generation model M2 is applied to a program that causes the control device 11 to execute an operation of generating waveform data Dw from the note string data Dn, the feature sequence data Df, and the pronunciation style s, and the operation. It is realized by combining with multiple variables.
  • a plurality of variables defining the second generation model M2 are set in advance by machine learning using a plurality of second training data and stored in the storage device 12.
  • Each of the plurality of second training data includes a set of the note sequence data Dn, the feature sequence data Df, and the pronunciation style s, and the waveform data Dw (correct answer value).
  • the waveform data Dw output by the tentative second generative model M2 for the note sequence data Dn, the feature sequence data Df, and the pronunciation style s of each second training data A plurality of variables of the second generation model M2 are iteratively updated so that the error of the second training data with the waveform data Dw is reduced. Therefore, the second generative model M2 has a statistically valid waveform for an unknown combination of the note sequence data Dn, the feature sequence data Df, and the pronunciation style s under the tendency latent in the plurality of second training data.
  • Output data Dw is a statistically valid waveform for an unknown combination of the note sequence data Dn, the feature sequence data D
  • step Sa201 of the second editing process Sa2 the first editing unit 31 obtains the feature sequence data Df representing the feature sequence F of the synthetic sound in which the note sequence N is pronounced by the pronunciation style s selected by the user. Edit according to the edit instruction Qf. Further, in the step Sa205 of the second editing process Sa2, the information management unit 40 stores the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df for each version of the feature column data Df. It is saved in the history area of the device 12.
  • the feature sequence data Df corresponding to the pronunciation style s and the waveform data Dw corresponding to the pronunciation style s are generated under the specific note sequence N.
  • the note sequence N is not affected by the pronunciation style s. Therefore, as illustrated in FIG. 13, for the first history data Hn [Vn, Vf, Vw] (note string data Dn) corresponding to one note sequence N, different feature sequences F for each pronunciation style s.
  • a plurality of second history data Hf [Vn, Vf, Vw] corresponding to the above and a plurality of third history data Hw [Vn, Vf, Vw] corresponding to different waveforms W are stored in the history area of the storage device 12. It will be saved.
  • the feature sequence data Df representing the feature sequence F of the synthetic sound that pronounces the note sequence N in the pronunciation style s is generated by the first processing unit (Sa106), and represents the waveform W of the synthetic sound.
  • Waveform data Dw is generated by the second processing unit (Sa107).
  • the second editing unit 33 edits the feature sequence data Df according to the pronunciation style s according to the editing instruction Qf from the user.
  • the information management unit 40 history the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df for each edit of the feature column data Df (that is, for each version of the feature column data Df). Save to area.
  • the third editing unit 35 edits the waveform data Dw according to the pronunciation style s according to the editing instruction Qw from the user.
  • the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] corresponding to the edited waveform data Dw in the history area for each edit of the waveform data Dw (that is, for each version of the waveform data Dw). do.
  • the first management process Sb1 in the state where the pronunciation style s is selected, the first management process Sb1 is started with the instruction to change the note string version number Vn.
  • the second management process Sb2 in the state where the pronunciation style s is selected, the second management process Sb2 is started with the instruction to change the feature column version number Vf.
  • the "feature column F corresponding to the pronunciation style s" is a feature corresponding to the note sequence version number Vn (set value Xn), the pronunciation style s, and the feature sequence version number Vf (latest value Yf).
  • the "waveform W corresponding to the pronunciation style s" includes a note string version number Vn (set value Xn), a pronunciation style s, a feature string version number Vf (latest value Yf), and a waveform version.
  • the feature sequence data Df of the feature sequence F corresponding to the pronunciation style s and the waveform data Dw of the waveform W corresponding to the pronunciation style s are generated.
  • the "feature sequence F corresponding to the pronunciation style s" is a feature corresponding to the note sequence version number Vn (current value Cn), the pronunciation style s, and the feature sequence version number Vf (set value Xf).
  • the "waveform W corresponding to the pronunciation style s" includes a note string version number Vn (current value Cn), a pronunciation style s, a feature string version number Vf (set value Xf), and a waveform version. It is a waveform W corresponding to the number Vw (latest value Yw).
  • the third management process Sb3 is started with the instruction to change the waveform version number Vw.
  • the feature sequence data Df of the feature sequence F corresponding to the pronunciation style s and the waveform data Dw of the waveform W corresponding to the pronunciation style s are generated.
  • the "feature sequence F corresponding to the pronunciation style s" is a feature corresponding to the note sequence version number Vn (current value Cn), the pronunciation style s, and the feature sequence version number Vf (current value Cf).
  • the "waveform W corresponding to the pronunciation style s" specifically includes a note string version number Vn (current value Cn), a pronunciation style s, a feature string version number Vf (current value Cf), and a waveform version. It is a waveform W corresponding to the number Vw (set value Xw).
  • the pronunciation style s1 and the pronunciation style s2 are different pronunciation styles s.
  • the pronunciation style s1 is an example of the "first pronunciation style”
  • the pronunciation style s2 is an example of the "second pronunciation style”.
  • the second editing unit 33 edits the feature sequence data Df corresponding to the pronunciation style s1 according to the editing instruction Qf from the user. Then, each time the feature column data Df is edited, the information management unit 40 saves the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df in the history area. Similarly, in the third editing process Sa3, the third editing unit 35 edits the waveform data Dw according to the pronunciation style s1 according to the editing instruction Qw from the user.
  • the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] corresponding to the edited waveform data Dw in the history area.
  • the feature sequence data Df or waveform data Dw generated when the pronunciation style s1 is selected is an example of "first time series data”.
  • the editing instruction Qf or the editing instruction Qw given by the user with the pronunciation style s1 selected is an example of the "first instruction”.
  • the feature sequence data Df of F and the waveform data Dw of the waveform W corresponding to the pronunciation style s1 are generated. That is, the feature sequence data Df and the waveform data corresponding to the history data H corresponding to the instruction (Xn, Xf, Xw) from the user among the plurality of history data H (Hn, Hf, Hw) corresponding to the pronunciation style s1. Dw is generated.
  • the second editing unit 33 edits the feature sequence data Df corresponding to the pronunciation style s2 according to the editing instruction Qf from the user. Then, each time the feature column data Df is edited, the information management unit 40 saves the second history data Hf [Vn, Vf, Vw] corresponding to the edited feature column data Df in the history area. Similarly, in the third editing process Sa3, the third editing unit 35 edits the waveform data Dw corresponding to the pronunciation style s2 according to the editing instruction Qw from the user.
  • the information management unit 40 saves the third history data Hw [Vn, Vf, Vw] corresponding to the edited waveform data Dw in the history area.
  • the feature sequence data Df or waveform data Dw generated when the pronunciation style s2 is selected is an example of "second time series data”.
  • the editing instruction Qf or the editing instruction Qw given by the user with the pronunciation style s2 selected is an example of the "second instruction”.
  • the feature sequence data Df of F and the waveform data Dw of the waveform W corresponding to the pronunciation style s2 are generated. That is, the feature sequence data Df and the waveform data corresponding to the history data H corresponding to the instruction (Xn, Xf or Xw) from the user among the plurality of history data H (Hn, Hf and Hw) corresponding to the pronunciation style s2. Dw is generated.
  • the editing processing unit 30 in the second embodiment has the feature sequence data Df and the waveform data Dw corresponding to the pronunciation style s1, or the feature sequence data Df and the waveform data corresponding to the pronunciation style s2.
  • Dw is acquired according to the common version of the note string data Dn.
  • the editing history of the feature sequence data Df and the waveform data Dw corresponding to the pronunciation style s1 is stored in the storage device 12, and the feature sequence data Df and the feature sequence data Df corresponding to the pronunciation style s2 are stored.
  • the editing history of the waveform data Dw is stored in the storage device 12. Therefore, the editing of the feature sequence data Df or the waveform data Dw corresponding to the pronunciation style s1 and the editing of the feature sequence data Df or the waveform data Dw corresponding to the pronunciation style s2 are performed by trial and error according to the instruction from the user. It is possible to execute.
  • the display control unit 20 causes the display device 14 to display the comparison screen U of FIG.
  • the comparison screen U includes a first region U1, an operation image U1a (call), an operation image U1b (reproduction), a second region U2, an operation image U2a (call), and an operation image U2b (reproduction).
  • the first history data Hn [Vn, Vf, Vw], the second history data Hf [Vn, Vf, Vw] and the third history data Hw [Vn, Vf, The hierarchical relationship with Vw] is displayed.
  • the user can select desired historical data H for each of the first region U1 and the second region U2.
  • the user selects the desired history data H for each of the first region U1 and the second region U2 by designating the pronunciation style s and each version number (Vn, Vf, Vw). ..
  • the control device 11 uses each history data H acquired from the history area to display the feature sequence data Df of the feature sequence F and the waveform data Dw of the waveform W corresponding to the version numbers (Vn, Vf, Vw) of the pronunciation style s. And generate.
  • the display screen G including the above is displayed on the display device 14.
  • the control device 11 supplies the sound emitting device 13 with the acoustic signal Z corresponding to the waveform data Dw generated in the above procedure for the first region U1. By doing so, the synthetic sound is reproduced.
  • the control device 11 acquires the history data H selected in the second region U2 from the storage device 12, and edits the history data H according to the history data H.
  • the screen G is displayed on the display device 14.
  • the control device 11 sets the pronunciation style s and each version number (Vn, Vf, Vw) specified by the user for the second region U2 by the same procedure as described above for the first region U1. Generate the corresponding feature sequence data Df and waveform data Dw.
  • the display screen G including the above is displayed on the display device 14.
  • the control device 11 supplies the sound emitting device 13 with the acoustic signal Z corresponding to the waveform data Dw generated in the above procedure for the second region U2. By doing so, the synthetic sound is reproduced.
  • the user mutually compares the combination of the version and the pronunciation style s selected from the first region U1 with the combination of the version and the pronunciation style s selected from the second region U2. While, it is possible to adjust the note sequence N, the feature sequence F, the waveform W, and the pronunciation style s.
  • FIG. 15 is an explanatory diagram of the synthetic sound in the third embodiment.
  • the synthetic sound of the third embodiment is composed of a plurality of tracks T (T1, T2, ...) Parallel to each other on the time axis.
  • T1, T2, ...) Parallel to each other on the time axis.
  • each performance part corresponds to the track T.
  • a singing sound composed of a plurality of singing parts is used as a synthetic sound, each singing part corresponds to the track T.
  • Each of the plurality of tracks T includes a plurality of sections (hereinafter referred to as "unit intervals") R that do not overlap each other on the time axis.
  • Each of the plurality of unit intervals R is an interval (region) including the note string N on the time axis. That is, a unit interval R is set for each note sequence N, with a set of a plurality of notes that are close to each other on the time axis as a note sequence N.
  • the time length of each unit interval R is a variable length according to the total number of notes in the note sequence N, the continuation length of each note, and the like.
  • FIG. 16 is a schematic diagram of the editing screen G in the third embodiment.
  • Information on one unit interval R selected by the user (note sequence N, feature sequence F, or waveform) among the plurality of unit intervals R of one track T selected by the user from the plurality of tracks T of the synthetic sound. W) is displayed on the edit screen G.
  • the operation area Gt and the operation area Gr are added to the same elements as those of the first embodiment.
  • the operation area Gt is an area related to the track T of the synthetic sound. Specifically, the track version number Vt, the operation image Gt1 and the operation image Gt2 are displayed in the operation area Gt.
  • the track version number Vt is a number representing the version of the track T displayed on the edit screen G.
  • the track version number Vt is incremented by 1 each time the information about the track T displayed on the edit screen G (note string N, feature column F, or waveform W) is edited. Further, the user can change the track version number Vt in the operation area Gt to an arbitrary numerical value by operating the operation device 15.
  • the operation image Gt1 and the operation image Gt2 are software buttons that can be operated by the user using the operation device 15.
  • the operation image Gt1 is an operator for instructing the user to return the information (note string N, feature sequence F, or waveform W) related to the track T to the state before the execution of the immediately preceding edit (Undo).
  • the operation image Gt2 is an operator for instructing the user to perform the editing canceled by the operation on the operation image Gt1 again (Redo).
  • the operation area Gr is an area related to the unit interval R of the synthetic sound. Specifically, the section version number Vr, the operation image Gr1 and the operation image Gr2 are displayed in the operation area Gr.
  • the section version number Vr is a number representing the version of the unit section R displayed on the edit screen G.
  • the section version number Vr is incremented by 1 each time the information regarding the unit interval R displayed on the edit screen G (note sequence N, feature sequence F, or waveform W) is edited. Further, the user can change the track version number Vt in the operation area Gt to an arbitrary numerical value by operating the operation device 15.
  • the operation image Gr1 and the operation image Gr2 are software buttons that can be operated by the user using the operation device 15.
  • the operation image Gr1 is an operator for instructing the user to return the information (note string N, feature sequence F, or waveform W) regarding the unit interval R to the state before the execution of the immediately preceding edit (Undo).
  • the operation image Gr2 is an operator for instructing the user to execute (Redo) the editing canceled by the operation on the operation image Gr1 again.
  • the editing process Sa (Sa1-Sa3) or the management process Sb (Sb1-Sb3) is executed for each of the plurality of unit intervals R in one track T displayed on the editing screen G.
  • the information management unit 40 increases the track version number Vt and the section version number Vr by one. Further, when the user operates the operation image (Gn1, Gf1, Gw1, Gn2, Gf2 or Gw2), the information management unit 40 similarly increases the track version number Vt and the section version number Vr by one.
  • the user generates the waveform data Dw by trial and error for each of the plurality of unit intervals R on the time axis, while the note sequence data Dn, the feature sequence data Df, and the waveform data Dw. You can instruct each edit with.
  • the note string data Dn of each version is stored in the history area as the first history data Hn [Vn, Vf, Vw], but the first history data Hn [Vn, Vf, Vw] And the format of the first history data Hn [Vn, Vf, Vw] are not limited to the above examples.
  • the first history data Hn [Vn, Vf, Vw] indicating how the note string data Dn is edited may be saved.
  • the first history data Hn [Vn, Vf, Vw] is comprehensively expressed as data corresponding to the edited note sequence N.
  • the second history data Hf [Vn, Vf, Vw] indicating how the feature column data Df is edited (that is, the time series of the edit instruction Qf) is stored in the history area.
  • the matters represented by the second history data Hf [Vn, Vf, Vw] and the format of the second history data Hf [Vn, Vf, Vw] are not limited to the above examples.
  • the feature column data Df after editing according to the editing instruction Qf may be saved in the history area as the second history data Hf [Vn, Vf, Vw].
  • the second history data Hf [Vn, Vf, Vw] is comprehensively represented as data corresponding to the edited feature column data Df.
  • the third history data Hw [Vn, Vf, Vw] indicating how the waveform data Dw is edited (that is, the time series of the edit instruction Qw) is saved in the history area.
  • the matters represented by the third history data Hw [Vn, Vf, Vw] and the format of the third history data Hw [Vn, Vf, Vw] are not limited to the above examples.
  • the waveform data Dw after editing according to the editing instruction Qw may be saved in the history area as the third history data Hw [Vn, Vf, Vw].
  • the third history data Hw [Vn, Vf, Vw] is comprehensively expressed as data corresponding to the edited waveform data Dw.
  • the feature sequence F having the fundamental frequency of the synthesized sound as the feature quantity is illustrated, but the feature quantity represented by the feature sequence data Df is not limited to the fundamental frequency.
  • the frequency spectrum of the synthesized sound in the frequency domain for example, the intensity spectrum
  • the time-series data representing the time series (feature sequence F) of the feature amount with the sound pressure level on the time axis as the feature sequence data may be Df.
  • the feature sequence data Df is comprehensively represented as time series data representing a time series (feature sequence F) of the feature amount of the note sequence data Dn.
  • the second generation unit 34 generates the waveform data Dw from the note sequence data Dn and the feature sequence data Df, but the second generation unit 34 generates the waveform data Dw from the note sequence data Dn.
  • the second generation unit 34 generates waveform data Dw from the feature column data Df. That is, the second generation unit 34 is specified as an element that generates waveform data Dw from at least one of the note string data Dn and the waveform data Dw.
  • the first generation model M1 that outputs the feature sequence data Df for the input including the pronunciation style s is exemplified, but the feature sequence data Df corresponding to the pronunciation style s is first generated.
  • the configuration for the unit 32 to be generated is not limited to the above examples.
  • the feature sequence data Df may be generated by selectively using a plurality of first generation models M1 corresponding to different pronunciation styles s.
  • the first generation model M1 corresponding to each pronunciation style s is constructed by machine learning using a plurality of first training data prepared for the pronunciation style s.
  • the first generation unit 32 generates the feature sequence data Df by inputting the note sequence data Dn into the first generation model M1 corresponding to the pronunciation style s selected by the user among the plurality of first generation models M1. ..
  • the second generation model M2 that outputs the waveform data Dw to the input including the pronunciation style s is exemplified, but the second generation unit 34 generates the waveform data Dw according to the pronunciation style s.
  • the configuration for generation is not limited to the above examples.
  • the waveform data Dw may be generated by selectively using a plurality of second generation models M2 corresponding to different pronunciation styles s.
  • the second generative model M2 corresponding to each pronunciation style s is constructed by machine learning using a plurality of second training data prepared for the pronunciation style s.
  • the second generation unit 34 inputs the note sequence data Dn and the feature sequence data Df (input data Din) to the second generation model M2 corresponding to the pronunciation style s selected by the user among the plurality of second generation models M2. As a result, waveform data Dw is generated.
  • the waveform W of the acoustic signal Z is displayed in the edit area Ew of the edit screen G, but the time series (that is, spectrogram) of the frequency spectrum of the acoustic signal Z is displayed on the edit screen G together with the waveform W. It may be displayed.
  • the editing screen G illustrated in FIG. 17 includes an editing area Ew1 and an editing area Ew2.
  • the waveform W is displayed in the same manner as the editing area Ew in each of the above-described forms.
  • the time series of the frequency spectrum of the acoustic signal Z is displayed.
  • the user can give the editing instruction Qw for the frequency spectrum in the editing area Ew2 by operating the operation device 15.
  • Note string data Dn is time-series data representing a note sequence N having a plurality of notes on the time axis as elements.
  • the feature sequence data Df is time-series data representing the feature sequence F having a plurality of feature quantities on the time axis as elements.
  • the waveform data Dw is time-series data representing a waveform W having a plurality of samples on the time axis as elements.
  • the note sequence data Dn, the feature sequence data Df, and the waveform data Dw are comprehensively represented as time series data representing a time series of a plurality of elements.
  • the deep neural network is exemplified as the first generation model M1 and the second generation model M2, but the configurations of the first generation model M1 and the second generation model M2 are arbitrary.
  • a statistical inference model of another structure such as HMM (Hidden Markov Model) may be used as the first generation model M1 or the second generation model M2.
  • each of the above-mentioned forms the synthesis of the synthetic sound corresponding to the note string N is illustrated, but each of the above-mentioned forms can be used in any scene for processing time-series data representing a time-series of a plurality of elements. Will be done.
  • the upper layer corresponds to the note sequence N
  • the middle layer corresponds to the feature sequence F
  • the lower layer corresponds to the waveform W.
  • Each layer in the scene is a combination illustrated below.
  • the note strings constituting the melody correspond to the upper layer
  • the time series of the chords in the melody corresponds to the middle layer
  • the accompaniment sound that harmonizes with the melody corresponds to the lower layer.
  • the voice synthesis scene in which the voice corresponding to the character string is synthesized, the character string corresponds to the upper layer, the pronunciation style of the voice corresponds to the middle layer, and the waveform of the voice corresponds to the lower layer. do.
  • the waveform of the signal corresponds to the upper layer
  • the time series of the feature amount of the signal corresponds to the middle layer
  • the time series of the parameters related to the processing for the signal corresponds to the upper layer.
  • the lower layer data is expressed as "lower data”.
  • the lower-level data is data representing the content actually used by the user (for example, the waveform W in each of the above-mentioned forms).
  • each note constituting the note string N in each of the above-mentioned forms and each character constituting the character string in speech synthesis are comprehensively expressed as symbols indicating sounds. Further, the note string N and the character string are comprehensively represented as a symbol string in which a plurality of symbols are arranged in time series.
  • the functions of the acoustic processing system exemplified above are realized by the cooperation of the single or a plurality of processors constituting the control device 11 and the program stored in the storage device 12.
  • the program according to the present disclosure may be provided and installed in a computer in a form stored in a computer-readable recording medium.
  • the recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disc) such as a CD-ROM is a good example, but a semiconductor recording medium, a magnetic recording medium, or the like is known as arbitrary. Recording media in the form of are also included.
  • the non-transient recording medium includes any recording medium other than the transient propagation signal (transitory, propagating signal), and the volatile recording medium is not excluded. Further, in the configuration in which the distribution device distributes the program via the communication network, the storage device 12 that stores the program in the distribution device corresponds to the above-mentioned non-transient recording medium.
  • the first time-series data representing the time-series of the feature amount of the sound in which the symbol string is sounded in the first pronunciation style is given as the first instruction from the user.
  • the first history data corresponding to the edited first time-series data is saved as new version data for each edit of the first time-series data, and what is the first pronunciation style?
  • the second time-series data representing the time series of the feature amount of the sound that pronounced the symbol string with a different second pronunciation style is edited according to the second instruction from the user, and the second time-series data is edited.
  • the second history data corresponding to the edited second time-series data is saved as new version data, and among the saved first history data of different versions, from the user.
  • the history of editing the first time-series data corresponding to the first pronunciation style is saved, and the history of editing the second time-series data corresponding to the second pronunciation style is saved. Therefore, the editing of the first time-series data corresponding to the first pronunciation style and the editing of the second time-series data corresponding to the second pronunciation style are executed by trial and error according to the instruction from the user.
  • the "symbol string" is, for example, a musical note string or a character string.
  • the symbol string is a note sequence including a plurality of notes arranged in a time series.
  • the note sequence data representing the note sequence is edited according to the instruction from the user, and the first time series data and the second time series data are common. Generated from the note string data of the version of.
  • the first history data after the editing immediately before the plurality of first history data and the immediately preceding of the plurality of second history data Acquire any of the second history data after editing.
  • the first history data or the second history data before the execution of the immediately preceding edit that is, the state in which the edit is canceled
  • the first history data of the version designated by the user among the plurality of first history data, and the plurality of second history data.
  • any of the second history data of the version specified by the user is acquired. According to the above configuration, it is possible to acquire the first history data or the second history data corresponding to any version according to the instruction from the user.
  • the information processing system edits the first time-series data representing the time-series of the feature amount of the sound that pronounces the symbol string in the first pronunciation style according to the first instruction from the user. Then, the second time series data representing the time series of the feature amount of the sound that pronounced the symbol string in the second pronunciation style different from the first pronunciation style is edited according to the second instruction from the user.
  • the editing processing unit edits the first time-series data
  • the first history data corresponding to the edited first time-series data is saved as new version data
  • the second time-series data is edited.
  • Each time it is provided with an information management unit that saves the second history data corresponding to the edited second time-series data as new version data, and the information management unit has the saved different versions.
  • the program according to one aspect of the present disclosure causes the computer system to function as the above information processing system.
  • 100 Information processing system, 11 ... Control device, 12 ... Storage device, 13 ... Sound emitting device, 14 ... Display device, 15 ... Operation device, 20 ... Display control unit, 30 ... Editing processing unit, 31 ... First editing unit , 32 ... 1st generation unit, 33 ... 2nd editorial unit, 34 ... 2nd generation unit, 35 ... 3rd editorial unit, M1 ... 1st generation model, M2 ... 2nd generation model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

L'invention consiste : à éditer, conformément à une première instruction provenant d'un utilisateur, des premières données de série chronologique représentant une série chronologique d'une quantité caractéristique d'un son produit en prononçant une chaîne de symboles avec un premier style de prononciation ; à sauvegarder, en tant que nouvelle version de données, des premières données historiques correspondant aux premières données de série chronologique éditées, pour chaque édition des premières données de série chronologique ; à éditer, conformément à une seconde instruction provenant de l'utilisateur, des secondes données de série chronologique représentant une série chronologique d'une quantité caractéristique d'un son produit en prononçant la chaîne de symboles avec un second style de prononciation qui est différent du premier style de prononciation ; à sauvegarder, en tant que nouvelle version de données, des secondes données historiques correspondant aux secondes données de série chronologique éditées, pour chaque édition des secondes données de série chronologique ; et à acquérir des premières données de série chronologique correspondant à un premier élément de données historiques, parmi une pluralité de premiers éléments de données historiques sauvegardées de différentes versions, qui correspond à une instruction provenant de l'utilisateur, ou des secondes données de série chronologique correspondant à un second élément de données historiques, parmi une pluralité de seconds éléments de données historiques sauvegardées de différentes versions, qui correspondent à une instruction provenant de l'utilisateur.
PCT/JP2020/037966 2020-10-07 2020-10-07 Procédé de traitement d'informations, système de traitement d'informations et programme WO2022074754A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2022555020A JPWO2022074754A1 (fr) 2020-10-07 2020-10-07
PCT/JP2020/037966 WO2022074754A1 (fr) 2020-10-07 2020-10-07 Procédé de traitement d'informations, système de traitement d'informations et programme
CN202080105738.8A CN116324965A (zh) 2020-10-07 2020-10-07 信息处理方法、信息处理系统及程序

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/037966 WO2022074754A1 (fr) 2020-10-07 2020-10-07 Procédé de traitement d'informations, système de traitement d'informations et programme

Publications (1)

Publication Number Publication Date
WO2022074754A1 true WO2022074754A1 (fr) 2022-04-14

Family

ID=81125769

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/037966 WO2022074754A1 (fr) 2020-10-07 2020-10-07 Procédé de traitement d'informations, système de traitement d'informations et programme

Country Status (3)

Country Link
JP (1) JPWO2022074754A1 (fr)
CN (1) CN116324965A (fr)
WO (1) WO2022074754A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019239971A1 (fr) * 2018-06-15 2019-12-19 ヤマハ株式会社 Procédé de traitement d'informations, dispositif et programme de traitement d'informations

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004252719A (ja) * 2003-02-20 2004-09-09 Mitsubishi Electric Corp データ管理装置及びデータ管理方法及びデータ管理プログラム
JP2007034782A (ja) * 2005-07-28 2007-02-08 Keakomu:Kk 文書編集装置
US10997189B2 (en) * 2015-03-23 2021-05-04 Dropbox, Inc. Processing conversation attachments in shared folder backed integrated workspaces

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019239971A1 (fr) * 2018-06-15 2019-12-19 ヤマハ株式会社 Procédé de traitement d'informations, dispositif et programme de traitement d'informations

Also Published As

Publication number Publication date
JPWO2022074754A1 (fr) 2022-04-14
CN116324965A (zh) 2023-06-23

Similar Documents

Publication Publication Date Title
JP6547878B1 (ja) 電子楽器、電子楽器の制御方法、及びプログラム
JP6610714B1 (ja) 電子楽器、電子楽器の制御方法、及びプログラム
JP6610715B1 (ja) 電子楽器、電子楽器の制御方法、及びプログラム
JP3102335B2 (ja) フォルマント変換装置およびカラオケ装置
US5939654A (en) Harmony generating apparatus and method of use for karaoke
US20040177745A1 (en) Score data display/editing apparatus and program
JP6728754B2 (ja) 発音装置、発音方法および発音プログラム
JP6784022B2 (ja) 音声合成方法、音声合成制御方法、音声合成装置、音声合成制御装置およびプログラム
JP2022116335A (ja) 電子楽器、方法及びプログラム
JP7180587B2 (ja) 電子楽器、方法及びプログラム
CN111696498A (zh) 键盘乐器以及键盘乐器的计算机执行的方法
JP4274272B2 (ja) アルペジオ演奏装置
JP6835182B2 (ja) 電子楽器、電子楽器の制御方法、及びプログラム
JP5136128B2 (ja) 音声合成装置
JP2023100776A (ja) 電子楽器、電子楽器の制御方法、及びプログラム
WO2022074754A1 (fr) Procédé de traitement d'informations, système de traitement d'informations et programme
WO2022074753A1 (fr) Procédé de traitement d'informations, système de traitement d'informations et programme
JP6801766B2 (ja) 電子楽器、電子楽器の制御方法、及びプログラム
JP6819732B2 (ja) 電子楽器、電子楽器の制御方法、及びプログラム
CN115349147A (zh) 音信号生成方法、推定模型训练方法、音信号生成系统及程序
JP5106437B2 (ja) カラオケ装置及びその制御方法並びにその制御プログラム
JP4240099B2 (ja) 電子楽器および電子楽器制御用プログラム
WO2004025306A1 (fr) Expression generee par ordinateur dans une production de musique
JP5953743B2 (ja) 音声合成装置及びプログラム
WO2024089995A1 (fr) Procédé de synthèse de son musical, système de synthèse de son musical et programme

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20956702

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022555020

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20956702

Country of ref document: EP

Kind code of ref document: A1