CN117121089A

CN117121089A - Sound processing method, sound processing system, program, and method for creating generation model

Info

Publication number: CN117121089A
Application number: CN202280024965.7A
Authority: CN
Inventors: 西村方成; 安藤龙也
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2021-03-26
Filing date: 2022-03-10
Publication date: 2023-11-24
Also published as: WO2022202374A1; JP2022150179A

Abstract

The sound processing system includes: an instruction receiving unit that receives an instruction from a user; a feature data generation unit that generates feature data including feature values representing musical features of a musical composition from musical composition data representing the musical composition according to instructions from a user and rules corresponding to a specific musical theory; and a sound data generation unit that generates sound data representing a sound corresponding to the music data by inputting control data corresponding to the music data and the feature data into a generation model for which machine learning is completed.

Description

Sound processing method, sound processing system, program, and method for creating generation model

Technical Field

The present invention relates to a technique for generating acoustic data representing a sound of a musical composition.

Background

Various techniques for processing musical composition data representing musical compositions are currently proposed. For example, patent document 1 discloses a technique of following a theory related to explanation of music

(hereinafter, referred to as "music theory") to generate a time series of energy values, and to add a musical expression to the acoustic data using the time series of energy values.

Patent document 1: japanese patent application laid-open No. 2011-164162

Disclosure of Invention

However, in the technique of patent document 1, since the musical expression added to the acoustic data depends only on a single musical theory, it is difficult to musically generate acoustic data of a plurality of expressions. For example, according to a music genre of a musical piece, an expression suitable for the music genre may not be added to acoustic data. In addition, it is also conceivable that no expression conforming to the musical wish or preference of the user is added to the acoustic data. In view of the above, an object of one embodiment of the present invention is to generate acoustic data of various expressions.

In order to solve the above problems, an acoustic processing method according to one aspect of the present invention receives an instruction from a user, generates feature data including feature values representing musical features of a musical composition from musical composition data representing the musical composition according to the instruction from the user and a rule corresponding to a specific music theory, and generates acoustic data representing a sound corresponding to the musical composition data by inputting control data corresponding to the musical composition data and the feature data to a model for generating a machine learning.

An acoustic processing system according to an embodiment of the present invention includes: an instruction receiving unit that receives an instruction from a user; a feature data generation unit that generates a time series of feature data including feature values representing musical features of a musical composition from musical composition data representing the musical composition according to instructions from the user and rules corresponding to a specific musical theory;

and a sound data generation unit that generates sound data representing a sound corresponding to the music data by inputting control data corresponding to the music data and the feature data into a generation model for which machine learning is completed.

A program according to an embodiment of the present invention causes a computer system to function as: an instruction receiving unit that receives an instruction from a user; a feature data generation unit that generates a time series of feature data including feature values representing musical features of a musical composition from musical composition data representing the musical composition according to instructions from the user and rules corresponding to a specific musical theory; and a sound data generation unit that generates sound data representing a sound corresponding to the music data by inputting control data corresponding to the music data and the feature data into a generation model for which machine learning is completed.

A method for creating a generation model according to an embodiment of the present invention acquires learning data including learning control data and learning acoustic data, and creates a generation model for outputting acoustic data for input of control data by machine learning using the learning data, the learning control data including: condition data representing conditions specified by music data representing a music; and feature data including feature values representing musical features of the musical piece, the sound data representing sounds corresponding to the musical piece data.

Drawings

Fig. 1 is a block diagram illustrating the structure of an information system according to embodiment 1.

Fig. 2 is a block diagram illustrating a functional structure of the sound processing system.

Fig. 3 is a block diagram illustrating a configuration of the feature data generation section.

Fig. 4 is an explanatory diagram relating to the feature value and each element value.

Fig. 5 is a schematic diagram of an editing screen.

Fig. 6 is a flowchart illustrating a specific flow of the synthesis process.

Fig. 7 is a block diagram illustrating a functional structure of the machine learning system.

Fig. 8 is a flowchart illustrating a specific flow of the learning process.

Fig. 9 is a block diagram illustrating a functional configuration of the sound processing system according to embodiment 2.

Fig. 10 is an explanatory diagram relating to the operation of the sound processing system according to embodiment 3.

Fig. 11 is a block diagram illustrating a functional configuration of the sound processing system according to embodiment 4.

Fig. 12 is a block diagram illustrating the configuration of the feature data generation unit according to embodiment 5.

Fig. 13 is a block diagram illustrating the structure of an information system according to embodiment 6.

Fig. 14 is a block diagram illustrating a functional structure of the machine learning system of embodiment 6.

Fig. 15 is a block diagram illustrating a structure of a generation model of a modification.

Fig. 16 is a block diagram illustrating a structure of a generation model of a modification.

Detailed Description

A: embodiment 1

Fig. 1 is a block diagram illustrating a configuration of an information system 100 according to embodiment 1. The information system 100 has a sound processing system 10 and a machine learning system 20. The sound processing system 10 and the machine learning system 20 communicate with each other via a communication network 200 such as the internet.

[ Sound processing System 10]

The sound processing system 10 is a computer system having a control device 11, a storage device 12, a communication device 13, a playback device 14, an operation device 15, and a display device 16. The sound processing system 10 is implemented by an information terminal such as a smart phone, a tablet terminal, or a personal computer. In addition, the sound processing system 10 may be realized by a plurality of devices (for example, client server systems) which are separately configured from each other, instead of being realized by a single device.

The control device 11 is composed of a single processor or a plurality of processors that control the respective elements of the sound processing system 10. For example, the control device 11 is configured by 1 or more processors such as CPU (Central Processing Unit), SPU (Sound Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application SpecificIntegrated Circuit). The communication device 13 communicates with the machine learning system 20 via the communication network 200.

The storage device 12 is a single or a plurality of memories that store programs executed by the control device 11 and various data used by the control device 11. The storage device 12 is constituted by a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of recording media. Further, a removable recording medium that can be attached to or detached from the sound processing system 10, or a recording medium (for example, cloud storage) that can be written to or read from by the control device 11 via the communication network 200 may be used as the storage device 12.

The storage device 12 stores musical composition data S representing musical compositions. The musical composition data S contains note data for each of a plurality of notes constituting a musical composition. Note data of each note specifies the pitch, pronunciation period, chord (chord), and harmony function of the note. The harmony function is a class of chords (major chords/minor chords/subordinate chords) focusing on the music function. The music data S contains structural data related to the musical structure of the music piece. The construction data specifies a plurality of music pieces (music pieces or small music pieces) that divide a musical composition according to a music formula (music form). For example, the start point and the end point on the time axis are specified by the construction data for each of the a-music section and the B-music section of the musical composition. As understood from the above description, the music data S may also be replaced with data representing a score of a music.

The control device 11 generates an acoustic signal a representing a sound (hereinafter referred to as "target sound") corresponding to the music data S. The acoustic signal a is a signal indicating a time zone of the waveform of the target sound. The target sound is a performance sound uttered by playing the musical composition represented by the musical composition data S. Specifically, the target sound is a musical tone sounded by a performance of a musical instrument or a voice sounded by singing. The playback device 14 plays the target sound represented by the acoustic signal a. The playback device 14 is, for example, a speaker or a headphone. For convenience, a D/a converter for converting the acoustic signal a from a digital signal to an analog signal and an amplifier for amplifying the acoustic signal a are not shown. The playback device 14, which is separate from the sound processing system 10, may be connected to the sound processing system 10 by wire or wirelessly.

The operation device 15 is an input device that receives an instruction from a user. The operation device 15 is, for example, an operation tool operated by a user or a touch panel for detecting contact by the user. Further, the operation device 15 (for example, a mouse or a keyboard) separate from the sound processing system 10 may be connected to the sound processing system 10 in a wired or wireless manner.

The display device 16 displays an image based on the control of the control device 11. For example, various display panels such as a liquid crystal display panel and an organic EL (Electroluminescence) panel are used as the display device 16. The display device 16, which is separate from the sound processing system 10, may be connected to the sound processing system 10 by wire or wirelessly.

Fig. 2 is a block diagram illustrating a functional configuration of the sound processing system 10. The control device 11 realizes a plurality of functions (instruction receiving unit 31, condition data generating unit 32, characteristic data generating unit 33, acoustic data generating unit 35, signal generating unit 36) for generating the acoustic signal a by executing a program stored in the storage device 12. The instruction receiving unit 31 receives an instruction from the user to the operation device 15.

The condition data generating unit 32 generates condition data X from the music data S. The condition data X specifies the music condition (genre) specified by the music data S. The condition data X is generated for each unit period on the time axis. That is, the condition data generating unit 32 generates the time series of the condition data X. Each unit period is a period (time Frame) of a sufficiently shorter time length than each note of the musical composition. The condition data X for each unit period includes, for example, information (e.g., pitch and duration) related to notes included in the unit period. The condition data X in each unit period includes, for example, information (for example, pitch and duration) related to one or both of a note preceding (for example, preceding) and a note following (for example, following) the note in the unit period. Further, the condition data X may include a pitch difference between a note including a unit period and one or both of a note preceding and a note following the note, instead of the pitch of the note preceding or following.

The feature data generating section 33 generates feature data Y from the music data S. The feature data Y contains a numerical value (hereinafter referred to as "feature value") F representing a musical feature of the musical piece represented by the musical piece data S. The feature data Y is generated for each unit period. That is, the feature data generating unit 33 generates the time series of the feature data Y. Fig. 3 is a block diagram illustrating the structure of the feature data generation section 33. The feature data generation unit 33 includes a feature extraction unit 331 and an editing processing unit 332.

The feature extraction unit 331 generates a feature value F from the music data S. The characteristic value F is generated for each unit period. That is, the feature extraction unit 331 generates a time series of feature values F. The 1 feature value F is calculated in correspondence with a plurality of (N) element values E1 to EN. That is, the feature extraction unit 331 generates N element values E1 to EN by analyzing the music data S, and calculates the feature value F by applying the calculation of the N element values E1 to EN.

The element values En (n=1 to N) are set based on the music data S by rules corresponding to specific music theory. Music theory is a method of interpreting a musical composition in a musical viewpoint. The N element values E1 to EN are evaluation values for evaluating a musical composition from different viewpoints specified by a specific music theory (hereinafter referred to as "specific music theory"). That is, each element value En is set based on the music data S by a rule defined by a specific music theory. The acoustic signal a generated by the control device 11 represents a target sound to which a musical expression conforming to the interpretation of a musical piece based on a specific musical theory is added. In embodiment 1, a specific music theory based on the music theory described in "baokyo," child act "and" child act "in the guaiaceae, the child act づ v act solution " and the sound friend society, 1998 is assumed. The type of the element values En and the total number N of the element values En are different for each music theory. Therefore, the music theory may be alternatively referred to as a rule that defines at least one of the type of each element value En and the total number N of the element values En.

Fig. 4 is an explanatory diagram of the characteristic value F and the element values En of embodiment 1. The element value E1 in each unit period is a value corresponding to the pitch of the note designated by the music data S for that unit period. For example, the value of the pitch specified by the music data S is used as the element value E1. The time series of the element values E1 corresponds to the melody of the musical composition.

The element value E2 in each unit period is a numerical value corresponding to the category (major chord/minor chord/subordinate chord) of the chord harmony function of the chord specified for the unit period by the musical composition data S. Specifically, the element value E2 is set to a value corresponding to the degree of tension perceived by the listener from the chord. For example, the element value E2 in the unit period in which the main chord (T) is specified is set as the minimum value E21 (tone: low), the element value E2 in the unit period in which the subordinate chord (S) is specified is set as the intermediate value E22 (tone: medium), and the element value E2 in the unit period in which the subordinate chord (D) is specified is set as the maximum value E23 (tone: high). As understood from the above description, the rule corresponding to the specific music theory includes a rule for digitizing the category of harmony function.

The element value E3 of each unit period is a numerical value corresponding to the music piece specified for the unit period by the music piece data S. Specifically, the element value E3 of the unit period in the a-music piece including the start point of the music piece is set to the minimum value E31, and the element value E3 of the unit period in the B-music piece subsequent to the a-music piece is set to the maximum value E32. As understood from the above description, the rules corresponding to the specific music theory include rules for digitizing each musical piece.

The feature extraction unit 331 calculates a feature value F0 from a weighted sum of the N element values E1 to EN as expressed by the following expression (1). The sign Wn is a weighted value for the element value En.

F0＝W1·E1+W2·E2+…Wn·En+…+WN·EN … (1)

The feature extraction unit 331 of embodiment 1 calculates a moving average of the feature value F0 calculated by the expression (1) as the feature value F. That is, the characteristic value F in each unit period is an average value of the characteristic values F0 in a predetermined period including the unit period. Further, the numerical value calculated by the expression (1) may be determined as the characteristic value F. That is, the moving average of the feature value F0 may be omitted.

The instruction receiving unit 31 causes the display device 16 to display the edit screen G of fig. 5. The editing screen G includes a 1 st area Ga and a 2 nd area Gb. The 1 st area Ga displays the weighted value Wn corresponding to each element value En. Each weight Wn is initially set to a predetermined value. The time series (hereinafter referred to as "feature value series") Q of the feature value F calculated by the feature extraction unit 331 is displayed in the 2 nd area Gb. Specifically, the feature value sequence Q is displayed on the display device 16 as a broken line or a curved line, for example.

The user can instruct the change of the numerical value to each of the N weight values W1 to WN by operating the 1 st area Ga by using the operating device 15. The instruction receiving unit 31 receives an instruction concerning each weighted value Wn from the user. Specifically, the instruction receiving unit 31 receives an instruction to change the weighting value WN selected by the user among the N weighting values W1 to WN. For example, the instruction receiving unit 31 receives an instruction to increase/decrease each weight value Wn or an instruction to the numerical value of each weight value Wn. The feature extraction unit 331 applies the changed weighting value Wn related to the instruction received by the instruction receiving unit 31 to the expression (1). The weight Wn is set to, for example, a positive number, a negative number, or 0 in response to an instruction from the user. When the weighting value Wn is set to 0, the influence of the element value En on the feature value F can be ignored. That is, the user sets the weighting value Wn corresponding to the element value EN determined to be unnecessary among the N element values E1 to EN to 0, whereby the influence of the element value EN on the feature value F can be eliminated.

The user can instruct the change of the characteristic value sequence Q by operating the operating device 15 while checking the 2 nd area Gb. The instruction receiving unit 31 receives an instruction to change the characteristic value sequence Q from the user. For example, the instruction receiving unit 31 receives, from the user, an instruction to select a portion to be edited and to change the portion among the feature value sequence Q. The edit processing portion 332 of fig. 3 edits the feature value sequence Q in accordance with the instruction received by the instruction receiving portion 31. That is, the editing processing unit 332 changes the feature value F of the portion selected by the user among the feature value sequences Q in response to an instruction from the user. The time series of feature data Y representing each feature value F edited by the editing processing unit 332 is stored in the storage device 12. When the user does not instruct the change of the feature value sequence Q, feature data Y including the feature value F calculated by the feature extraction unit 331 is generated.

As understood from the above description, the feature data generating section 33 generates the time series of the feature data Y from the music data S according to the instruction from the user and the rule corresponding to the specific music theory. As described above, the instruction from the user is an instruction related to each weighted value Wn or the feature value series Q.

As illustrated in fig. 2, the control data D is generated for each unit period by the above processing performed by the condition data generation unit 32 and the feature data generation unit 33. The control data D for each unit period includes the condition data X generated by the condition data generating unit 32 for the unit period and the feature data Y generated by the feature data generating unit 33 for the unit period. That is, the control data D is data corresponding to the music data S (condition data X) and the feature data Y.

The acoustic data generating unit 35 generates acoustic data Z representing the target sound from the control data D. The acoustic data Z is generated for each unit period. That is, the acoustic data Z in each unit period is generated based on the control data D in that unit period. The acoustic data Z of embodiment 1 is data representing the frequency characteristics of the target sound. For example, the frequency characteristic represented by the acoustic data Z includes a spectrum such as a mel spectrum or an amplitude spectrum and a fundamental frequency of a target sound.

The generation model M is used for generating the acoustic data Z by the acoustic data generating unit 35. The generated model M is a trained model in which the relationship between the control data D and the sound data Z is learned by machine learning. That is, the generation model M outputs statistically reasonable acoustic data Z for the input of the control data D. The acoustic data generating unit 35 generates acoustic data Z by inputting the control data D to the generation model M.

The generative model M is constituted, for example, by a deep neural network (DNN: deep Neural Network). For example, a deep neural network of any form such as a recurrent neural network (RNN: recurrent Neural Network) or a convolutional neural network (CNN: convolutional Neural Network) is used as the generation model M. The generative model M may be formed from a combination of a plurality of deep neural networks. Additional elements such as Long Short Term storage (LSTM) may be mounted on the generation model M.

The generation model M is implemented by a combination of a program that causes the control device 11 to execute an operation of generating the acoustic data Z from the control data D, and a plurality of variables (specifically, weight values and deviation) applied to the operation. The program for realizing the generation model M and a plurality of variables of the generation model M are stored in the storage device 12. Values of a plurality of variables defining the generation model M are set in advance by machine learning.

The signal generating unit 36 generates the acoustic signal a of the target sound from the time series of acoustic data Z. The signal generating unit 36 converts the acoustic data Z into a waveform signal in a time zone by an operation including inverse discrete fourier transform, for example, and generates an acoustic signal a by connecting the waveform signals for a unit period before and after the phase. For example, the acoustic signal a may be generated by the signal generating unit 36 from the acoustic data Z using a deep neural network (so-called neural speech encoder) that learns the relationship between the acoustic data Z and each sample of the acoustic signal a. The acoustic signal a generated by the signal generating unit 36 is supplied to the playback device 14, whereby the target sound is played from the playback device 14.

Fig. 6 is a flowchart illustrating a specific flow of a process (hereinafter, referred to as "synthesis process") Sa of generating the acoustic signal a by the control device 11. For example, the synthesis process Sa is started with an instruction from the user to the operation device 15.

When the synthesizing process Sa is started, the control device 11 (condition data generating unit 32) generates condition data X for each unit period from the music data S stored in the storage device 12 (Sa 1). The control device 11 (feature data generation unit 33) generates feature data Y for each unit period from the musical composition data S according to a rule corresponding to a specific musical theory (Sa 2). Further, the order of generation (Sa 1) of the condition data X and generation (Sa 2) of the feature data Y may be reversed. The control device 11 (instruction receiving unit 31) causes the display device 16 to display an edit screen G including the feature value sequence Q (Sa 3).

The control device 11 (instruction receiving unit 31) determines whether or not an instruction for the editing screen G is received from the user (Sa 4). Specifically, the control device 11 determines whether or not the user instructs to change the weighting value Wn or the characteristic value sequence Q. When receiving an instruction from the user (Sa 4: YES), the control device 11 generates feature data Y reflecting the instruction (Sa 2) and displays a modified feature value sequence Q (Sa 3). Specifically, when the change of the weight value Wn is instructed, the control device 11 calculates the feature value F by the operation of the equation (1) to which the changed weight value Wn is applied. When the change of the feature value sequence Q is instructed, the control device 11 generates a time series of feature data Y indicating each feature value F of the changed feature value sequence Q.

Until the determination of the feature value sequence Q is instructed from the user, the generation of the feature data Y (Sa 2) and the display of the feature value sequence Q (Sa 3) are repeated (Sa 5: NO) for each instruction (Sa 4: YES) from the user for the editing screen G. If the characteristic value sequence Q is edited into a desired shape, the user instructs the determination of the characteristic value sequence Q by the operation of the operation device 15. Upon receiving an instruction to specify the feature value sequence Q (YES in Sa 5), the control device 11 (acoustic data generating unit 35) generates acoustic data Z for each unit period by inputting control data D including the condition data X and the feature data Y into the generation model M (Sa 6). The control device 11 (signal generation unit 36) generates an acoustic signal a of the target sound from the time series of acoustic data Z (Sa 7), and supplies the acoustic signal a to the playback device 14 to play the target sound (Sa 8).

As described above, in embodiment 1, the time series of the feature data Y including the music feature value F is generated based on the instruction from the user and the rule corresponding to the specific music theory, and the acoustic data Z is generated by inputting the control data D corresponding to the condition data X and the feature data Y to the generation model M. Therefore, compared with a mode in which the musical expression added to the acoustic data Z depends on only a single musical theory, the acoustic data Z reflecting various expressions instructed from the user can be generated.

In embodiment 1, the feature value F is calculated by applying a weighted sum of the weighted values Wn instructed from the user to N element values E1 to EN related to the specific music theory. That is, the user can adjust the degree to which each of the N element values E1 to EN affects the feature value F. Accordingly, the acoustic data Z (further, the acoustic signal a) can be generated in accordance with the musical intention or preference of the user.

Machine learning System 20

The machine learning system 20 of fig. 1 is a computer system that creates a generation model M used by the sound processing system 10 through machine learning. The machine learning system 20 has a control device 21, a storage device 22, and a communication device 23.

The control device 21 is constituted by a single or a plurality of processors that control the elements of the machine learning system 20. For example, the control device 21 is configured by 1 or more processors such as CPU, SPU, DSP, FPGA and ASIC. The communication device 23 communicates with the sound processing system 10 via the communication network 200.

The storage device 22 is a single or a plurality of memories that store programs executed by the control device 21 and various data used by the control device 21. The storage device 22 is constituted by a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of recording media. Further, a removable recording medium that is removable with respect to the machine learning system 20, or a recording medium (for example, cloud storage) that can be written to or read from by the control device 21 via the communication network 200 may be used as the storage device 22.

Fig. 7 is a block diagram illustrating a functional structure of the machine learning system 20. The storage device 22 stores a plurality of basic data B used in machine learning. The plurality of basic data B are each composed of a composition of learning musical composition data St and a learning acoustic signal At. The music data St is data representing music, like the music data S described above. The acoustic signal At corresponding to each musical composition data St is a signal representing a performance sound of the musical composition represented by the musical composition data St. For example, the acoustic signal At is generated by recording a performance sound of a musical composition performed by a player.

The control device 21 executes a program stored in the storage device 22 to realize a plurality of functions (the learning data acquisition unit 51 and the learning processing unit 52) for creating the generation model M. The learning data acquisition unit 51 generates a plurality of pieces of learning data T from the plurality of pieces of basic data B, respectively. The plurality of learning data T are each composed of a set of learning control data Dt and learning acoustic data Zt. The learning control data Dt is an example of "learning control data", and the learning acoustic data Zt is an example of "learning acoustic data".

The learning data acquisition unit 51 includes a condition data generation unit 511, a feature data generation unit 512, and an acoustic data generation unit 513. The condition data generating unit 511 generates condition data Xt for each unit period from the music data St of each basic data B by the same processing as the condition data generating unit 32 described above. Like the condition data X, the condition data Xt specifies the music condition specified by the music data St.

The feature data generating unit 512 generates feature data Yt for each unit period from the music data St of each basic data B. Specifically, the feature data generating unit 512 generates N element values E1 to EN corresponding to the specific music theory by analyzing the music data St, and generates feature data Yt indicating feature values F corresponding to the N element values E1 to EN, similarly to the feature extracting unit 331 described above. Specifically, as expressed by the above expression (1), the feature data generation unit 512 calculates the feature value F by applying a weighted sum of N element values E1 to EN of the weighted value Wn. However, each weight Wn is set to a predetermined reference value. By the above processing performed by the condition data generating unit 511 and the feature data generating unit 512, the control data Dt for learning is generated for each unit period. The control data Dt includes condition data Xt and feature data Yt.

The acoustic data generating unit 513 generates acoustic data Zt for each unit period from the acoustic signal At of each base data B. Specifically, the acoustic data generating unit 513 performs frequency analysis such as discrete fourier transform on the acoustic signal At to generate acoustic data Zt indicating the frequency characteristics of the acoustic signal At. Further, a time-series scheme is also conceivable in which the base data B includes acoustic data Zt instead of the acoustic signal At. In the case where the base data B includes a time series of the acoustic data Zt, the acoustic data generating unit 513 is omitted.

Learning data T including control data Dt generated for each unit period and acoustic data Zt generated for the unit period is generated. As understood from the above description, the acoustic data Zt of each learning data T corresponds to the correct value (label) to be output by the generation model M for the input of the control data Dt for the learning data T. The above processing is performed for each of the plurality of basic data B in each unit period, thereby generating a plurality of learning data T corresponding to different musical pieces or different unit periods. The plurality of learning data T generated by the learning data acquisition unit 51 are stored in the storage device 22. The learning processing unit 52 creates the generation model M by teacher machine learning using a plurality of learning data T.

Fig. 8 is a flowchart illustrating a specific flow of a process (hereinafter, referred to as "learning process") Sb of creating the generation model M by the control device 21 of the machine learning system 20 through machine learning. The learning process Sb also represents a method of creating the generated model M by machine learning (a trained model creation method).

When the learning process Sb is started, the control device 21 (learning data acquisition unit 51) generates a plurality of pieces of learning data T from the plurality of pieces of basic data B, respectively (Sb 1). The control device 21 (learning processing unit 52) selects any one of the plurality of learning data T (hereinafter, referred to as "selected learning data T") (Sb 2). As illustrated in fig. 7, the control device 21 (learning processing unit 52) inputs control data Dt for selecting learning data T into an initial or temporary model (hereinafter, referred to as a "temporary model") M0 (Sb 3), and acquires acoustic data Z (Sb 4) output from the temporary model M0 for the input.

The control device 21 (learning processing unit 52) calculates a loss function representing an error between the acoustic data Z generated by the temporary model M0 and the acoustic data Zt of the selected learning data T (Sb 5). The control device 21 (learning processing section 52) updates the plurality of variables of the temporary model M0 so that the loss function is reduced (desirably minimized) (Sb 6). For example, an error back propagation method is used to update a plurality of variables corresponding to the loss function.

The control device 21 (learning processing unit 52) determines whether or not a predetermined termination condition is satisfied (Sb 7). The end condition is, for example, that the loss function is below a predetermined threshold value or that the amount of change in the loss function is below a predetermined threshold value. When the end condition is not satisfied (Sb 7: NO), the control device 21 (learning processing unit 52) selects the unselected learning data T as the new selected learning data T (Sb 2). That is, until the end condition is satisfied (Sb 7: YES), the process of updating the plurality of variables of the temporary model M0 is repeatedly executed (Sb 2 to Sb 6). When the end condition is satisfied (YES in Sb 7), the control device 21 (learning processing unit 52) stops updating the plurality of variables (Sb 2 to Sb 6) defining the temporary model M0. The temporary model M0 at the point in time when the end condition is established is determined as the generation model M. Specifically, a plurality of variables of the generation model M are determined as values at the time points when the end condition is satisfied.

The control device 21 transmits the generated model M created by the above-described flow from the communication device 23 to the sound processing system 10 (Sb 8). Specifically, a plurality of variables defining the generation model M are transmitted to the sound processing system 10. The control device 11 of the sound processing system 10 receives the generated model M transmitted from the machine learning system 20 via the communication device 13, and stores the generated model M in the storage device 12.

As understood from the above description, the generation model M outputs statistically reasonable acoustic data Z for unknown control data D based on potential relationships between control data Dt and acoustic data Zt of a plurality of learning data T. That is, the generation model M is a statistical model in which the relationship between the control data D and the sound data Z is learned.

B: embodiment 2

Embodiment 2 will be described. In the embodiments illustrated below, the same reference numerals as those in the description of embodiment 1 are used for the elements having the same functions as those in embodiment 1, and detailed descriptions thereof are appropriately omitted.

In embodiment 1, the feature value F is calculated according to a rule corresponding to one music theory. In embodiment 2, any one of a plurality of music theory is selectively applied to the calculation of the feature value F (i.e., the generation of the feature data Y). Fig. 9 is a block diagram illustrating a functional configuration of sound processing system 10 according to embodiment 2. The instruction accepting unit 31 according to embodiment 2 accepts an instruction to select any one of a plurality of music theory as a specific music theory from a user. For example, the instruction accepting unit 31 accepts selection of any one of a plurality of music theory including the 1 st music theory and the 2 nd music theory from the user.

The 1 st music theory is the music theory illustrated in embodiment 1. Specifically, the 1 st music theory uses N element values E1 to EN including an element value E1 related to the pitch of each note of a musical piece, an element value E2 related to the category of harmony function, and an element value E3 related to a musical piece of the musical piece. On the other hand, the type of each element value En of the 2 nd music theory and the total number N of the element values En are different from the 1 st music theory. Specifically, the 2 nd music theory uses N element values E1 to EN including an element value E1 related to the instantaneous rhythm of the musical composition, an element value E2 related to the volume of each note of the musical composition, and an element value E3 related to the duration (duration) of each note of the musical composition. Regarding the 2 nd theory of music, for example, it is described in 3 of Jin Taixian et al, " melody and sound size resolution, fig せ, よ, fig, information processing society research report, 2010.

The feature data generating unit 33 calculates the feature value F for each unit period from the music data S according to a rule corresponding to a specific music theory selected by the user among the plurality of music theories. Specifically, the feature extraction unit 331 of the feature data generation unit 33 calculates N element values E1 to EN from the music data S according to a rule corresponding to a specific music theory, and calculates the feature value F by applying the calculation of the N element values E1 to EN. Specifically, as in embodiment 1, the feature extraction unit 331 calculates the feature value F0 by a weighted sum of the N element values E1 to EN, and calculates the feature value F by a moving average of the feature value F0. The feature data Y including the feature value F is generated for each unit period. Note that the modification of the weighting value Wn or the feature value sequence Q in response to an instruction from the user is the same as embodiment 1.

The storage device 12 according to embodiment 2 stores a plurality (K) of generation models M1 to MK corresponding to different music theory. The machine learning system 20 generates the generation models Mk (k=1 to K) for each music theory independently by the same configuration and operation as in embodiment 1. The K generated models M1 to MK generated by the machine learning system 20 are supplied to the sound processing system 10.

The acoustic data generating unit 35 according to embodiment 2 generates acoustic data Z using a generation model MK corresponding to a specific music theory selected by the user among K generation models M1 to MK corresponding to different music theories. Specifically, the acoustic data generating unit 35 inputs the control data D to the generation model Mk corresponding to the specific music theory, thereby generating acoustic data Z for each unit period. For example, when the user selects the 1 st music theory, the acoustic data generating unit 35 generates acoustic data Z using the generation model M1 corresponding to the 1 st music theory. On the other hand, when the user selects the 2 nd music theory, the acoustic data generating unit 35 generates acoustic data Z using the generation model M2 corresponding to the 2 nd music theory. The operation of other elements such as the condition data generation unit 32 and the signal generation unit 36 is the same as that of embodiment 1.

The same effects as those of embodiment 1 are achieved also in embodiment 2. In embodiment 2, acoustic data Z is generated using a generation model MK corresponding to a specific music theory selected by a user among K generation models M1 to MK corresponding to different music theories. Accordingly, the acoustic data Z can be generated in accordance with the musical intention or preference of the user. Further, any one of a plurality of music theory may be selected according to factors other than the instruction from the user.

C: embodiment 3

The functional configuration of sound processing system 10 according to embodiment 3 is the same as that of embodiment 2. That is, in embodiment 3, the feature data generating unit 33 generates the feature data Y based on a rule corresponding to any one of a plurality of music theory, and the acoustic data generating unit 35 selectively uses K generation models M1 to MK corresponding to different music theory.

Fig. 10 is an explanatory diagram relating to the operation of sound processing system 10 according to embodiment 3. The user can instruct a plurality of processing sections σ (σ1, σ2, …) in the musical composition by the operation of the operation device 15. The instruction receiving unit 31 receives an instruction concerning each processing section σ from the user. For example, the user can arbitrarily instruct the start point and the end point of each processing section σ. Therefore, the position and time length of each processing section σ on the time axis are variable. Further, there may be any interval between 2 processing sections σ before and after the phase.

The user can select any one of a plurality of music theory for each of the plurality of processing sections σ by operating the operation device 15. The instruction accepting unit 31 accepts, from the user, an instruction to select any one of a plurality of music theory as a specific music theory for each of the plurality of processing sections σ. For example, the instruction of the 1 st music theory is received for the processing section σ1, and the instruction of the 2 nd music theory is received for the processing section σ2 which is different from the processing section σ1. That is, in embodiment 3, the specific music theory is set independently for each processing section σ.

The feature data generating unit 33 of fig. 9 generates a time series of feature data Y for each unit period in each processing section σ according to a rule corresponding to a specific music theory instructed for the processing section σ. For example, for each unit period in the processing section σ1, the feature data generating unit 33 generates the feature data Y from the music data S according to a rule corresponding to the 1 st music theory. The feature data generating unit 33 generates the feature data Y from the music data S according to the rule corresponding to the 2 nd music theory for each unit period in the processing section σ2.

The acoustic data generating unit 35 generates acoustic data Z by using a generation model MK corresponding to a specific music theory instructed for each processing section σ among a plurality of generation models M1 to MK corresponding to different music theory for each unit period within the processing section σ. For example, for each unit period in the processing section σ1, the acoustic data generating unit 35 generates acoustic data Z by inputting the control data D to the generation model M1 corresponding to the 1 st music theory. The acoustic data generating unit 35 generates acoustic data Z by inputting the control data D to the generation model M2 corresponding to the 2 nd musical theory for each unit period in the processing section σ2. The embodiment 2 is the same as that of the embodiment except that the independent music theory is applied for each processing section σ.

The same effects as those of embodiment 2 are achieved also in embodiment 3. In embodiment 3, the music theory is independently indicated for each processing section σ on the time axis. That is, the music theory applied to the generation of the feature data Y and the generation model Mk used to generate the acoustic data Z are set independently for each processing section σ. Therefore, a plurality of acoustic data Z whose musical expression varies for each processing section σ can be generated.

D: embodiment 4

Fig. 11 is a block diagram illustrating a functional configuration of sound processing system 10 according to embodiment 4. The control device 11 according to embodiment 4 functions as the adjustment processing unit 34 in addition to the same elements (instruction receiving unit 31, condition data generating unit 32, feature data generating unit 33, acoustic data generating unit 35, and signal generating unit 36) as in embodiment 1.

The adjustment processing unit 34 generates control data D based on the condition data X and the feature data Y. The control data D is generated for each unit period. Specifically, the control data D for each unit period is generated based on the condition data X for that unit period and the feature data Y for that unit period. The adjustment processing unit 34 generates control data D by adjusting the condition data X based on the feature data Y. That is, the music condition indicated by the condition data X is changed according to the feature data Y. The adjustment of the condition data X using the feature data Y is performed by a predetermined algorithm defining the relationship among the condition data X, the feature data Y, and the control data D.

The acoustic data generating unit 35 generates acoustic data Z by inputting the control data D generated by the adjustment processing unit 34 into the generation model M. That is, the generation model M of embodiment 4 is a trained model in which the relationship between the control data D and the sound data Z generated by the adjustment of the condition data X using the feature data Y is learned.

The same effects as those of embodiment 1 are achieved in embodiment 4. In embodiment 4, control data D generated by adjusting condition data X corresponding to feature data Y is input to generation model M. Therefore, the generation model M having a structure in which the feature data Y is not input can be used for generating the acoustic data Z. For example, an existing generation model M in which the relationship between the condition data X and the sound data Z is learned can be used. In embodiment 1, control data D including condition data X and feature data Y is input to the generation model M. Therefore, there is an advantage that the processing load required for generating the control data D can be reduced as compared with embodiment 4 in which the condition data X is adjusted. The structure of embodiment 2 or 3 is also applicable to embodiment 4.

As understood from the illustrations of embodiment 1 and embodiment 4, the control data D is comprehensively represented as data corresponding to the music data S (condition data X) and the feature data Y. That is, both the control data D of embodiment 1 including the condition data X and the feature data Y and the control data D of embodiment 4 generated by adjustment of the condition data X corresponding to the feature data Y are included in the concept of "data corresponding to the music data S and the feature data Y".

E: embodiment 5

Fig. 12 is a block diagram illustrating the configuration of the feature data generation unit 33 according to embodiment 5. The feature data generation unit 33 according to embodiment 5 uses the generation model 334 for generating the feature data Y. The generated model 334 is a trained model in which the relationship between the music data S and the feature data Y (feature value F) is learned by machine learning. That is, the generation model 334 outputs the statistically reasonable feature data Y for the input of the music data S. The feature extraction unit 331 generates feature data Y by inputting the music data S to the generation model 334. The editing processing unit 332 is similar to embodiment 1 in that it edits the feature value sequence Q indicated by the time series of the feature data Y in response to an instruction from the user.

The generative model 334 is composed of, for example, a deep neural network. For example, any form of neural network, such as a recurrent or convolutional neural network, may be utilized as the generation model 334. The generative model 334 may be constructed from a combination of multiple deep neural networks. Further, additional elements such as long-short-term storage may be mounted on the generation model 334.

The generation model 334 is implemented by a combination of a program that causes the control device 11 to execute an operation of generating the feature data Y from the music data S, and a plurality of variables (specifically, weight values and deviation) applied to the operation. The program implementing the generative model 334 and the plurality of variables of the generative model 334 are stored in the storage device 12. Values of a plurality of variables defining the generation model 334 are set in advance by machine learning.

The same effects as those of embodiment 1 are achieved also in embodiment 5. The structures of embodiment 2 to embodiment 4 can be applied to embodiment 5 as well.

F: embodiment 6

Fig. 13 is a block diagram illustrating the structure of information system 100 according to embodiment 6. The information system 100 according to embodiment 6 includes a sound processing system 10a, a sound processing system 10b, and a machine learning system 20. The sound processing system 10a is used by the user Ua, and the sound processing system 10b is used by the user Ub. The configuration of each of the sound processing system 10a and the sound processing system 10b is the same as that of the sound processing system 10 of embodiment 1.

The sound processing system 10a transmits contribution data P related to the musical composition desired by the user Ua to the machine learning system 20. The posting data P of 1 musical piece contains musical piece data S of the musical piece, a time series of feature data Y, and an acoustic signal a representing a target sound of the musical piece.

The feature data Y of the posting data P is data generated by the feature data generating unit 33 based on the music data S of the posting data P and an instruction from the user Ua. That is, the time series of the feature data Y represents the edited feature value series Q corresponding to the instruction from the user Ua. The user Ua instructs editing of the feature value sequence Q so as to correspond to the individual music theory. Therefore, the individual music theory of the user Ua is reflected in the time series (feature value series Q) of the feature data Y included in the posting data P.

The acoustic signal a represents a target sound to which a musical expression corresponding to the time series of the feature data Y is added. Therefore, the music theory of the user Ua alone is also reflected in the acoustic signal a. For example, the acoustic signal a generated by the synthesis process Sa including editing of the feature value sequence Q corresponding to the instruction from the user Ua is included in the posting data P. The user Ua may play a musical composition according to its own music theory, and the acoustic signal a generated by recording the playing sound may be included in the posting data P.

As described above, the posting data P includes the musical composition data S that is not dependent on the standard of the music theory, and the feature data Y and the acoustic signal a that reflect the music theory of the user Ua alone. The control device 11 of the sound processing system 10a transmits the contribution data P described above from the communication device 13 to the machine learning system 20. The control device 21 of the machine learning system 20 receives the contribution data P transmitted from the sound processing system 10a via the communication device 23, and stores the contribution data P in the storage device 22. The transmission of the contribution data P by the sound processing system 10a is repeated in response to an instruction from the user Ua. Accordingly, the storage device 22 of the machine learning system 20 stores a plurality of contribution data P of the user Ua.

Fig. 14 is a block diagram illustrating a functional structure of the machine learning system 20 of embodiment 6. The control device 21 functions as a learning data acquisition unit 51 and a learning processing unit 52. The learning data acquisition unit 51 generates a plurality of learning data T from the plurality of contribution data P of the user Ua. The learning data acquisition unit 51 includes a condition data generation unit 511 and an acoustic data generation unit 513. As in embodiment 1, the condition data generating unit 511 generates condition data Xt for each unit period from the musical composition data S of each posting data P. Control data Dt for learning including the condition data X generated by the condition data generating unit 511 and the feature data Y in the contribution data P is generated.

As in embodiment 1, the acoustic data generating unit 513 generates acoustic data Zt for each unit period based on the acoustic signal a of each contribution data P. The learning data T including the control data Dt for learning and the acoustic data Zt generated by the acoustic data generating unit 513 is generated for each unit period. By performing the above processing for each of the plurality of posting data P of the user Ua, a plurality of learning data T corresponding to different musical pieces or different unit periods are generated for the user Ua. The individual musical theory of the user Ua is reflected on a plurality of learning data T (specifically, the feature data Y and the acoustic data Zt) of the user Ua.

The learning process unit 52 creates the generation model M of the user Ua by the learning process Sb described above using the plurality of learning data T of the user Ua. That is, a process (Sb 1) of generating a plurality of learning data T from a plurality of contribution data P of the user Ua and a process (Sb 2 to Sb 7) of creating a generation model M of the user Ua from a plurality of learning data T are executed. That is, the generation model M capable of generating the acoustic data Z based on the music theory of the user Ua alone is generated. The generated model M of the user Ua is transmitted from the communication device 23 to the sound processing system 10b of the user Ub (Sb 8).

The control device 11 of the sound processing system 10b receives the generated model M transmitted from the machine learning system 20 via the communication device 13, and stores the generated model M in the storage device 12. The control device 11 generates the acoustic signal a by the synthesis process Sa using the generation model M. Specifically, the synthesis process Sa using the generation model M of the user Ua is performed for the music data S of the music desired by the user Ub. Accordingly, the acoustic signal a reflecting the music theory of the user Ua is generated for the music desired by the user Ub.

In the above description, the generation model M of one user Ua is generated and used, but a system in which the generation model M is generated independently for each of a plurality of users is also conceivable. For example, a plurality of contribution data P are transmitted to the machine learning system 20 from a plurality of sound processing systems 10 used by different users, respectively. Then, for each user, a process (Sb 1) of generating a plurality of learning data T from a plurality of contribution data P and a process (Sb 2 to Sb 7) of creating a generation model M from a plurality of learning data T are executed. The acoustic processing system 10b of the user Ub transmits the generated model M of the user selected by the user Ub among the plurality of generated models M generated for different users. Further, a plurality of generated models M of different users may be transmitted from the machine learning system 20 to the acoustic processing system 10b, and the generated model M of the user selected by the user Ub among the plurality of generated models M may be selectively used in the synthesis process Sa. The structure of embodiment 4 can be applied to embodiment 6 as well.

G: modification examples

The following examples illustrate specific modifications added to the above-illustrated embodiments. The plural modes arbitrarily selected from the following illustrations may be appropriately combined within a range not contradicting each other.

(1) In each of the above embodiments, the acoustic data generating unit 35 generates the acoustic data Z using the single generation model M, but the specific configuration of the generation model M is not limited to the above example. For example, the following generation models M (Ma, mb) may be used.

[ modification 1 ]

Fig. 15 is a block diagram of a generation model Ma according to modification 1. The control data D including the condition data X and the feature data Y is supplied to the generation model Ma. The generation model Ma of the 1 st modification example has a1 st model Ma1 and a2 nd model Ma2. The 1 st model Ma1 and the 2 nd model Ma2 are each composed of a deep neural network such as a recurrent neural network or a convolutional neural network.

Model 1 Ma1 is a trained model in which the relationship between the condition data X and the intermediate data V is learned by machine learning. The intermediate data V is intermediate data representing characteristics of the condition data X. The acoustic data generating unit 35 generates the intermediate data V for each unit period by inputting the condition data X of the control data D to the 1 st model Ma 1. That is, the 1 st model Ma1 functions as an encoder for generating the intermediate data V from the condition data X.

Control data R including the intermediate data V output by the 1 st model Ma1 and the characteristic data Y of the control data D is supplied to the 2 nd model Ma2. Model 2 Ma2 is a trained model in which the relationship between control data R and sound data Z is learned by machine learning. The acoustic data generating unit 35 generates acoustic data Z for each unit period by inputting the control data R to the 2 nd model Ma2. That is, the 2 nd model Ma2 functions as a decoder for generating the acoustic data Z from the control data R.

[ modification 2 ]

Fig. 16 is a block diagram of a generation model Mb according to modification 2. The control data D including the condition data X and the feature data Y is supplied to the generation model Mb. The generation model Mb of modification 2 has a time model Mb1, a volume model Mb2, a pitch model Mb3, and an acoustic model Mb4. Each model (Mb 1 to Mb 4) of the generation model Mb is constituted by a deep neural network such as a recurrent neural network or a convolutional neural network.

The time model Mb1 is a trained model in which the relationship between the control data D and the time data V1 is learned by machine learning. The time data V1 specifies the time of each sound generation point of the target sound on the time axis. For example, a time difference between the start point of a note specified by the condition data X of the control data D and each pronunciation point of the target tone is specified by the time data V1. The acoustic data generating unit 35 generates the time data V1 for each unit period by inputting the control data D to the time model Mb 1.

Control data R2 including time data V1 and control data D output by the time model Mb1 is supplied to the volume model Mb2. The volume model Mb2 is a trained model in which the relationship between the control data R2 and the volume data V2 is learned by machine learning. The volume data V2 specifies the volume of the target sound. The acoustic data generating unit 35 generates the volume data V2 for each unit period by inputting the control data R2 to the volume model Mb2.

Control data R3 including volume data V2 and control data D output by the volume model Mb2 is supplied to the pitch model Mb3. The pitch model Mb3 is a trained model in which the relationship between the control data R3 and the pitch data V3 is learned by machine learning. The pitch data V3 specifies the pitch of the target tone. The acoustic data generating section 35 generates pitch data V3 for each unit period by inputting the control data R3 to the pitch model Mb3. The control data R3 may include time data V1.

Control data R4 including time data V1, volume data V2, pitch data V3, and control data D is supplied to the acoustic model Mb4. The acoustic model Mb4 is a trained model in which the relationship between the control data R4 and the acoustic data Z is learned by machine learning. The acoustic data generating unit 35 generates acoustic data Z for each unit period by inputting the control data R4 to the acoustic model Mb4.

As is understood from the above examples, the generation model M (Ma, mb) collectively represents a statistical model that outputs the acoustic data Z corresponding to the control data D, and the specific configuration is arbitrary.

(2) The kind of the element value En is not limited to the above example. For example, in embodiment 1, the feature data generating unit 33 may generate the element values E4 and E5 shown below in addition to the element values E1 to E3.

For example, there is a tendency that the music tension increases or decreases in the vicinity of the end point of each music piece in the music piece. In view of the above tendency, the element value E4 is a value that increases or decreases during a prescribed length at the end of each music piece in the music piece. In addition, similarly, in the vicinity of the end point of music, there is a tendency that the music tension increases or decreases. In view of the above tendency, the element value E5 is a value that increases or decreases during a prescribed length located at the end of the musical composition.

In the above embodiments, the numerical value corresponding to the category of the harmony function (major chord/minor chord/subordinate chord) of the chord specified by the music data S is exemplified as the element value E2, but the function classification of the chord represented by the element value E2 is not limited to the above examples. For example, other categories such as the subordinate chords may be added to the 3 kinds of harmony functions (main chord/subordinate chord) illustrated in the foregoing embodiments. Further, for example, in addition to the harmony functions illustrated above, the chords may be classified according to the degrees (I to VII) of the chords, the index, the additional chords (e.g., seven chords, etc.), the borrowed chords, the changed chords, and other various attributes, and the feature extraction unit 331 may set the value corresponding to the classification to which the chords specified by the music data S belong as the element value E2.

(3) In the above embodiments, the moving average of the feature value F0 is calculated as the feature value F, but the method of calculating the feature value F from the feature value F0 (that is, the method of smoothing the time series of the feature value F0) is not limited to the above example. For example, the feature value F may be calculated by performing a low-pass filtering process or an interpolation process other than moving average for the feature value F0. Examples of the low-pass filtering process include a process using a one-time lag system, a process using convolution of gaussian distribution, and a process of reducing a high-frequency component in a frequency region. As the interpolation processing, various processing such as lagrangian interpolation and spline interpolation are exemplified. Further, the above process of smoothing the time series of the feature value F0 may be omitted. That is, in embodiment 1, as described above, the value calculated by the above equation (1) may be determined as the feature value F.

(4) In the above embodiments, the embodiment in which the feature data Y includes 1 feature value F is exemplified, but it is also conceivable that the feature data Y includes a plurality of feature values F. For example, it is also conceivable that N element values E1 to EN in the above-described embodiments are included in the feature data Y as different feature values F. That is, the process (for example, the weighted sum operation) of integrating the N element values E1 to EN into 1 feature value F may be omitted. In the aspect in which the feature data Y includes a plurality of feature values F, the generation model M is created by the learning process Sb described above using the control data Dt for learning including the feature data Y. According to the generation model M described above, a plurality of acoustic data Z reflecting different musical views in a multi-dimensional manner can be generated.

(5) In the above embodiments, the embodiment in which the acoustic data Z represents the frequency characteristic of the target sound is illustrated, but the information represented by the acoustic data Z is not limited to the above illustration. For example, a mode in which the acoustic data Z represents each sample of the target sound can be also assumed. In the above manner, the time series of the acoustic data Z constitutes the acoustic signal a. Therefore, the signal generating section 36 is omitted.

(6) In the above-described embodiments, the learning data acquisition unit 51 of the machine learning system 20 generates the learning data T from the basic data B. However, in the system of generating the learning data T by the external device, the element of the learning data T received from the external device by the communication device 23 or the element of the learning data T read from the storage device 22 after the reception corresponds to the learning data acquisition unit 51. That is, the "acquisition" of the learning data T by the learning data acquisition unit 51 includes any operation of acquiring the learning data T, such as generation, reception, and reading of the learning data T.

(7) In embodiment 2, the mode of selectively using any one of the K generation models M1 to MK stored in the storage device 12 of the acoustic processing system 10 in the synthesis process Sa is illustrated, but the configuration for selectively using the K generation models M1 to MK is not limited to the above-described illustration. For example, the generation model MK selected by the user among the K generation models M1 to MK stored in the machine learning system 20 may be transmitted to the acoustic processing system 10, and the generation model MK may be used for the synthesis process Sa. That is, the sound processing system 10 does not need to store the K generation models M1 to MK.

(8) In the above embodiments, the feature data Y of each unit period is generated from the music data S, but the feature data generating unit 33 may generate the feature data Y of each unit period from the condition data X of each unit period.

(9) In embodiment 6, the manner in which the posting data P includes the acoustic signal a is illustrated, but a manner in which the posting data P includes the time series of the acoustic data Z instead of the acoustic signal a can be also assumed. In the case of the time-series system in which the posting data P includes the acoustic data Z, the acoustic data generating unit 513 of the learning data acquiring unit 51 of fig. 14 is omitted.

(10) In the above embodiments, the deep neural network is exemplified as the generation model M, but the generation model M is not limited to the deep neural network. For example, any type of statistical model such as HMM (Hidden Markov Model) or SVM (Support Vector Machine) may be used as the generation model M. Similarly, the generation model 334 of embodiment 5 is arbitrary in form or type.

(11) In the above embodiments, the machine learning system 20 creates the generation model M, but the function (the learning data acquisition unit 51 and the learning processing unit 52) of creating the generation model M may be mounted on the acoustic processing system 10. The function of creating the generation model 334 according to embodiment 5 may be similarly mounted in the sound processing system 10.

(12) For example, the sound processing system 10 may be implemented by a server device that communicates with an information device such as a smart phone or a tablet terminal. For example, the acoustic processing system 10 receives musical composition data S from an information device, and generates an acoustic signal a by applying the synthesis processing Sa of the musical composition data S. The acoustic processing system 10 transmits the acoustic signal a generated by the synthesis process Sa to the information device. In the system in which the signal generating unit 36 is mounted in the information apparatus, the time series of the acoustic data Z generated by the synthesis process Sa is transmitted to the information apparatus. That is, the signal generating unit 36 is omitted from the sound processing system 10.

(13) As described above, the functions of the sound processing system 10 (the instruction receiving unit 31, the condition data generating unit 32, the feature data generating unit 33, the sound data generating unit 35, and the signal generating unit 36) are realized by the cooperative operation of the single or plural processors constituting the control device 11 and the program stored in the storage device 12. As described above, the functions of the machine learning system 20 (the learning data acquisition unit 51 and the learning processing unit 52) are realized by the cooperation of the program stored in the storage device 22 and the single or multiple processors constituting the control device 21.

The above program may be provided and installed on a computer in a form stored on a computer-readable recording medium. The recording medium is, for example, a non-transitory (non-transitory) recording medium, preferably an optical recording medium (optical disc) such as a CD-ROM, and further includes a semiconductor recording medium, a magnetic recording medium, and other well-known arbitrary types of recording media. The non-temporary recording medium includes any recording medium other than the temporary transmission signal (propagating signal), and may not be a volatile recording medium. In addition, in the configuration in which the transmission apparatus transmits the program via the communication network 200, a recording medium storing the program in the transmission apparatus corresponds to the aforementioned non-transitory recording medium.

H: appendix

In the above-described exemplary embodiments, the following configuration is grasped, for example.

In the acoustic processing method according to one embodiment (embodiment 1), the instruction from the user is received, the characteristic data including the characteristic value representing the musical characteristic of the musical composition is generated from the musical composition data representing the musical composition according to the instruction from the user and the rule corresponding to the specific musical theory, and the acoustic data representing the sound corresponding to the musical composition data is generated by inputting the control data corresponding to the musical composition data and the characteristic data into the model for generating the machine learning. In the above aspect, the feature data is generated in response to the instruction from the user and the rule corresponding to the specific music theory, and the acoustic data is generated by inputting the control data corresponding to the music data and the feature data into the generation model. Therefore, compared with a mode in which the musical expression added to the acoustic data depends on only a single musical theory, acoustic data reflecting various expressions of the instruction from the user can be generated.

In a specific example (mode 2) of mode 1, condition data representing a condition specified by the music data is generated, and in the generation of the acoustic data, the control data including the condition data and the feature data is input to the generation model. In the above aspect, since the control data including the condition data and the feature data is input to the generation model, the processing load required for generating the control data can be reduced as compared with a configuration in which the adjusted condition data corresponding to the feature data is input to the generation model as the control data.

In a specific example (mode 3) of mode 1, condition data representing a condition specified by the music data is generated, and the control data is generated by adjustment of the condition data corresponding to the feature data. In the above aspect, since the control data generated by the adjustment of the condition data using the feature data is input to the generation model, the generation model having the structure in which the feature data is not input can be used for the generation of the acoustic data.

In any one of embodiments 1 to 3 (embodiment 4), in receiving the instruction, an instruction relating to a weighted value is received from the user, and in generating the feature data, a plurality of element values representing different types of music features of the music theory are generated from the music data, and the feature value is calculated by applying a weighted sum of the plurality of element values of the weighted value received from the user. In the above manner, the feature value is calculated by applying a weighted sum of weighted values instructed from the user to a plurality of element values related to a specific music theory. That is, the user can adjust the degree to which each of the plurality of element values affects the feature value. Therefore, sound data can be generated in accordance with the user's musical intention or preference.

In a specific example of embodiment 4 (embodiment 5), the plurality of element values include element values for digitizing categories of harmony functions related to chords of the musical piece, and the rule corresponding to the music theory includes a rule for digitizing the categories of harmony functions.

In a specific example of embodiment 4 or 5 (embodiment 6), the plurality of element values include element values corresponding to respective ones of a plurality of music pieces constituting the musical piece, and the rule corresponding to the music theory includes a rule related to the element value corresponding to each of the music pieces.

In any one of embodiments 1 to 6 (embodiment 7), in receiving the instruction, an instruction to select any one of a plurality of music theory as the specific music theory is received from the user, and in generating the acoustic data, the acoustic data is generated using a generation model corresponding to the specific music theory among a plurality of generation models corresponding to the plurality of music theory, respectively. In the above aspect, the acoustic data is generated using a generation model corresponding to the music theory selected by the user among the plurality of generation models corresponding to different music theories. Therefore, sound data can be generated in accordance with the user's musical intention or preference.

In a specific example (aspect 8) of aspect 7, in the receiving of the instruction, an instruction to select any one of the plurality of music theory as the specific music theory is received from the user for each of a plurality of processing sections on a time axis, the characteristic data is generated for each of the plurality of processing sections according to a rule corresponding to the specific music theory instructed for the processing section, and the acoustic data is generated for each of the plurality of processing sections by using a generation model corresponding to the specific music theory among the plurality of generation models. In the above manner, the music theory is indicated independently for each processing section on the time axis. That is, the music theory applied to the generation of the feature data and the generation model used to generate the acoustic data are set independently for each processing section. Therefore, a plurality of acoustic data whose musical expression varies for each processing section can be generated.

An acoustic processing system according to an embodiment (aspect 9) of the present invention includes: an instruction receiving unit that receives an instruction from a user; a feature data generation unit that generates feature data including feature values representing musical features of a musical composition from musical composition data representing the musical composition, based on an instruction from the user and a rule corresponding to a specific musical theory; and a sound data generation unit that generates sound data representing a sound corresponding to the music data by inputting control data corresponding to the music data and the feature data into a generation model for which machine learning is completed.

A program according to one embodiment (embodiment 10) of the present invention causes a computer system to function as the following functional units: an instruction receiving unit that receives an instruction from a user; a feature data generation unit that generates feature data including feature values representing musical features of a musical composition from musical composition data representing the musical composition, based on an instruction from the user and a rule corresponding to a specific musical theory; and a sound data generation unit that generates sound data representing a sound corresponding to the music data by inputting control data corresponding to the music data and the feature data into a generation model for which machine learning is completed.

A method for creating a generation model according to one embodiment (embodiment 11) of the present invention is a method for creating a generation model in which learning data including learning control data and learning acoustic data is acquired, and the acoustic data is output for input of the control data by machine learning using the learning data, the learning control data including: condition data representing conditions specified by music data representing a music; and feature data including feature values representing musical features of the musical piece, the sound data representing sounds corresponding to the musical piece data.

Description of the reference numerals

100 … information system, 10a, 10b … sound processing system, 11, 21 … control device, 12, 22 … storage device, 13, 23 … communication device, 14 … playback device, 15 … operation device, 16 … display device, 20 … machine learning system, 31 … instruction receiving part, 32 … condition data generating part, 33 … characteristic data generating part, 331 … characteristic extracting part, 332 … editing processing part, 334 … generating model, 34 … adjusting processing part, 35 … sound data generating part, 36 … signal generating part, 51 … learning data obtaining part, 511 … condition data generating part, 512 … characteristic data generating part, 513 … sound data generating part, 52 … learning processing part, M, ma, mb … generating model.

Claims

1. A sound processing method is realized by a computer system,

the instruction from the user is received and,

generating feature data including feature values representing musical features of a musical composition from the musical composition data representing the musical composition according to instructions from the user and rules corresponding to a specific musical theory,

acoustic data representing sound corresponding to the music data is generated by inputting control data corresponding to the music data and the feature data to a generation model in which machine learning is completed.

2. The sound processing method according to claim 1, wherein,

further, the method comprises the steps of,

generating condition data representing conditions specified by the music data,

in the generation of the acoustic data, the control data including the condition data and the feature data is input to the generation model.

3. The sound processing method according to claim 1, wherein,

further, the method comprises the steps of,

generating condition data representing conditions specified by the music data,

the control data is generated by adjustment of the condition data corresponding to the feature data.

4. A sound processing method according to any one of claims 1 to 3, wherein,

in the receiving of the instruction, an instruction relating to a weighted value is received from the user,

in the generation of the feature data,

generating a plurality of element values representing different kinds of music characteristics of the music theory from the music data,

the feature value is calculated by applying a weighted sum of the plurality of element values of the weighted value received from the user.

5. The sound processing method according to claim 4, wherein,

the plurality of element values include element values that digitize categories of harmony functions related to chords of the musical composition,

The rules corresponding to the music theory include rules for digitizing the categories of the harmony functions.

6. The sound processing method according to claim 4 or 5, wherein,

the plurality of element values include element values respectively corresponding to a plurality of music pieces constituting the music piece,

the rule corresponding to the music theory includes a rule related to an element value corresponding to each of the music pieces.

7. The sound processing method according to any one of claims 1 to 6, wherein,

in the receiving of the instruction, receiving an instruction to select any one of a plurality of music theory as the specific music theory from the user,

in the generating of the acoustic data, the acoustic data is generated using a generation model corresponding to the specific musical theory among a plurality of generation models corresponding to the plurality of musical theories, respectively.

8. The sound processing method according to claim 7, wherein,

in the receiving of the instruction, an instruction to select any one of the plurality of music theory as the specific music theory is received from the user for each of a plurality of processing sections on a time axis,

in the generation of the feature data, the feature data is generated for each of the plurality of processing sections according to a rule corresponding to the specific music theory instructed for the processing section,

In the generating of the acoustic data, the acoustic data is generated by using a generation model corresponding to the specific music theory among the plurality of generation models for the plurality of processing sections, respectively.

9. An acoustic processing system, comprising:

an instruction receiving unit that receives an instruction from a user;

a feature data generation unit that generates feature data including feature values representing musical features of a musical composition from musical composition data representing the musical composition, based on an instruction from the user and a rule corresponding to a specific musical theory; and

10. A program for causing a computer system to function as:

an instruction receiving unit that receives an instruction from a user;

11. A method for creating a generative model, which is implemented by a computer system,

learning data including learning control data and learning sound data is acquired,

by machine learning using the learning data, a generation model is created in which acoustic data is output for input of control data,

the learning control data includes:

condition data representing conditions specified by music data representing a music; and

feature data containing feature values representing musical features of the musical piece,

the sound data represents sounds corresponding to the music data.