CN107045867B

CN107045867B - Automatic composition method and device and terminal equipment

Info

Publication number: CN107045867B
Application number: CN201710175115.8A
Authority: CN
Inventors: 何江聪; 潘青华; 胡国平; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2017-03-22
Filing date: 2017-03-22
Publication date: 2020-06-02
Anticipated expiration: 2037-03-22
Also published as: CN107045867A

Abstract

The application provides an automatic composition method, an automatic composition device and terminal equipment, wherein the automatic composition method comprises the following steps: receiving a music file of a front piece of music to be predicted, wherein the music file of the front piece of music to be predicted comprises audio data or music description information of the front piece of music to be predicted; extracting frame-level audio features of music corresponding to the music file; according to the frame-level audio features and a pre-constructed music frequency band feature combination model, obtaining frame-level audio features carrying frequency band information; and obtaining the predicted music according to the frame-level audio features carrying the frequency band information and a pre-constructed music prediction model so as to realize automatic composition. The method and the device can realize automatic composition, further improve the efficiency and feasibility of automatic composition, and reduce the influence of subjective factors on automatic composition.

Description

Automatic composition method and device and terminal equipment

Technical Field

The present application relates to the field of audio signal processing technologies, and in particular, to an automatic composition method, an automatic composition device, and a terminal device.

Background

With the application of computer technology to music processing, computer music has come into play. Computer music has gradually penetrated into various aspects of music creation, musical instrument performance, education, entertainment, and the like as a new generation of art. The automatic music composition by adopting the artificial intelligence technology is taken as a relatively new research direction in computer music, and is highly valued by researchers in related fields in recent years.

The existing automatic composing method based on artificial intelligence technology mainly comprises the following two methods: automatic composition based on heuristic search and automatic composition based on genetic algorithm. However, the existing automatic composition based on heuristic search is only suitable for the condition that the length of music is short, and the search efficiency is exponentially reduced along with the increase of the length of the music, so that the feasibility of the method is poor for the music with longer length; the automatic composition method based on the genetic algorithm inherits some typical defects of the genetic algorithm, such as: the method has large dependence on the initial population and is difficult to accurately select genetic operators, and the like.

Disclosure of Invention

The present application aims to solve at least one of the technical problems in the related art to some extent.

To this end, a first object of the present application is to propose an automatic composition method. The method realizes automatic composition by constructing the music frequency band characteristic combination model and the music prediction model, is a brand-new automatic composition method, and solves the problems of low efficiency, poor feasibility, large subjective influence and the like in the prior art.

A second object of the present application is to provide an automatic composition device.

A third object of the present application is to provide a terminal device.

A fourth object of the present application is to propose a storage medium containing computer executable instructions.

In order to achieve the above object, an automatic composition method according to an embodiment of the first aspect of the present application includes: receiving a music file of a front piece of music to be predicted, wherein the music file of the front piece of music to be predicted comprises audio data or music description information of the front piece of music to be predicted; extracting frame-level audio features of music corresponding to the music file; according to the frame-level audio features and a pre-constructed music frequency band feature combination model, obtaining frame-level audio features carrying frequency band information; and obtaining the predicted music according to the frame-level audio features carrying the frequency band information and a pre-constructed music prediction model so as to realize automatic composition.

According to the automatic composition method, after a music file of a front section of music to be predicted is received, frame-level audio features of the music corresponding to the music file are extracted, then frame-level audio features carrying frequency band information are obtained according to the frame-level audio features and a pre-constructed music frequency band feature combination model, and finally predicted music is obtained according to the frame-level audio features carrying frequency band information and a pre-constructed music prediction model, so that automatic composition can be achieved, the efficiency and feasibility of automatic composition can be improved, and the influence of subjective factors on automatic composition is reduced.

In order to achieve the above object, an automatic composition apparatus according to an embodiment of the second aspect of the present application includes: the device comprises a receiving module, a prediction module and a prediction module, wherein the receiving module is used for receiving a music file of a front piece of music to be predicted, and the music file of the front piece of music to be predicted comprises audio data or music description information of the front piece of music to be predicted; the extracting module is used for extracting the frame-level audio features of the music corresponding to the music file received by the receiving module; the obtaining module is used for obtaining frame-level audio features carrying frequency band information according to the frame-level audio features and a pre-constructed music frequency band feature combination model; and obtaining the predicted music according to the frame-level audio features carrying the frequency band information and a pre-constructed music prediction model so as to realize automatic composition.

In the automatic composition device according to the embodiment of the application, after the receiving module receives a music file of a piece of music to be predicted, the extracting module extracts frame-level audio features of the music corresponding to the music file, the obtaining module obtains the frame-level audio features carrying band information according to the frame-level audio features and a pre-constructed music band feature combination model, and obtains the predicted music according to the frame-level audio features carrying band information and the pre-constructed music prediction model, so that automatic composition can be realized, the efficiency and feasibility of automatic composition can be improved, and the influence of subjective factors on automatic composition can be reduced.

In order to achieve the above object, a terminal device according to an embodiment of the third aspect of the present application includes: one or more processors; storage means for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods as described above.

To achieve the above object, a fourth aspect of the present application provides a storage medium containing computer-executable instructions for performing the method as described above when executed by a computer processor.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of one embodiment of an automatic composition method of the present application;

FIG. 2 is a flow chart of another embodiment of an automatic composition method of the present application;

FIG. 3 is a schematic diagram of one embodiment of a topology in an automatic composition method of the present application;

FIG. 4 is a flow chart of yet another embodiment of an automatic composition method of the present application;

FIG. 5 is a schematic representation of energy value coordinates in an automatic composition method of the present application;

FIG. 6 is a flow chart of yet another embodiment of an automatic composition method of the present application;

FIG. 7 is a flow chart of yet another embodiment of an automatic composition method of the present application;

FIG. 8 is a schematic diagram of another embodiment of a topology in an automatic composition method of the present application;

FIG. 9 is a schematic structural diagram of an embodiment of an automatic composition device according to the present application;

fig. 10 is a schematic structural view of another embodiment of the automatic composition device of the present application;

fig. 11 is a schematic structural diagram of an embodiment of a terminal device according to the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application. On the contrary, the embodiments of the application include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

Fig. 1 is a flowchart of an embodiment of an automatic composition method according to the present application, and as shown in fig. 1, the automatic composition method may include:

step 101, receiving a music file of a previous piece of music to be predicted, where the music file of the previous piece of music to be predicted includes audio data or music description information of the previous piece of music to be predicted.

The audio data or the music description information of the preceding piece of music to be predicted refers to the audio data or the music description information of a given section of music, and then the following piece of music can be predicted according to the audio data or the music description information of the given section of music.

The music description information may be generally converted into audio data, and the music description information may be a Musical Instrument Digital Interface (MIDI) file or the like.

And 102, extracting the frame-level audio features of the music corresponding to the music file.

And 103, obtaining frame-level audio features carrying frequency band information according to the frame-level audio features and a pre-constructed music frequency band feature combination model.

And step 104, obtaining the predicted music according to the frame-level audio features carrying the frequency band information and a pre-constructed music prediction model so as to realize automatic composition.

According to the automatic composition method, after a music file of a front section of music to be predicted is received, frame-level audio features of the music corresponding to the music file are extracted, then frame-level audio features carrying frequency band information are obtained according to the frame-level audio features and a pre-constructed music frequency band feature combination model, and finally predicted music is obtained according to the frame-level audio features carrying frequency band information and a pre-constructed music prediction model, so that automatic composition can be realized, the efficiency and feasibility of automatic composition can be improved, and the influence of subjective factors on automatic composition is reduced.

Fig. 2 is a flowchart of another embodiment of the automatic composition method of the present application, as shown in fig. 2, before step 103, the method may further include:

step 201, collecting music files and converting the music files into audio files with the same format.

Specifically, a large amount of training data can be obtained by crawling a large amount of music files on the internet, where the music files may be audio data or music description information, such as: MIDI files, and the like. Then, the music file can be converted into an audio file with the same format, and the format of the audio file only needs to satisfy the requirement of Fast Fourier Transform (FFT), for example: ". PCM or WAV, etc., the format of the audio file is not limited in this embodiment, and the present embodiment takes the format of" PCM "as an example for explanation. It should be noted that: if the music file is music description information, such as a MIDI file, it is necessary to convert the MIDI file into an audio file first, and then into an audio file in the ". PCM" format.

Step 202, extracting the frame-level audio features of the audio file.

Step 203, determining the topological structure of the music frequency band feature combination model.

Specifically, the topological structure is a hedge Neural network structure, and in this embodiment, taking a hedge Recurrent Neural Network (RNN) as an example, the topological structure includes two independent RNNs and a connection unit, as shown in fig. 3, and fig. 3 is a schematic diagram of an embodiment of the topological structure in the automatic composition method of the present application. Two independent RNNs, named LF _ RNN and HF _ RNN, are used for low-band multi-frequency signature combining and high-band multi-frequency signature combining, respectively.

The input of LF _ RNN is a certain frame T_mEnergy value E (T) from low frequency_m,F_i) I is 1,2, …, k, k is 1,2, …, N/2(N is the number of FFT points), and the output L of the previous frequency point LF _ RNN_i-1(ii) a The output of LF _ RNN is L_iIndicating the second after taking into account the low-frequency informationT_mAnd the energy value of the ith frequency point of the frame.

Similarly, the input of HF _ RNN is a certain frame T_mEnergy value E (T) from high frequency_m,F_j) J is N/2, N/2-1, …, k, where k is 1,2, …, N/2(N is the number of FFT points), and the output H of the previous frequency point HF _ RNN_j+1(ii) a The output of HF _ RNN is H_iIndicating the Tth after taking into account high frequency information_mAnd energy value of j frequency point of frame.

The successive units are concatenate in fig. 3, and when i ═ j ═ k, the two are connected to form N (T)_m,F_k) To obtain the Tth frequency point information considering other frequency point information_mAnd energy value of k frequency point of frame.

And step 204, training the music frequency band characteristic combination model according to the determined topological structure and the frame-level audio characteristics.

Specifically, when training the music band feature combination model, the training algorithm used may be a neural network model training algorithm, such as a Back Propagation (BP) algorithm, and the training algorithm used in this embodiment is not limited.

Fig. 4 is a flowchart of another embodiment of the automatic composition method of the present application, and as shown in fig. 4, in the embodiment shown in fig. 2 of the present application, step 202 may include:

step 401, performing fast fourier transform of a fixed number of points on the audio file according to a frame.

Specifically, an audio file in ". PCM" format may be subjected to a fixed number of points FFT per frame.

And step 402, calculating the energy value of each frame of the audio file at each frequency point according to the result of the fast Fourier transform.

Fig. 5 is a schematic diagram showing energy value coordinates in the automatic composition method of the present application, and fig. 5 is a schematic diagram showing energy value coordinates of each frame at each frequency point, where a horizontal axis t shows a time series frame, a vertical axis f shows a frequency point, coordinates E (t, f) show an energy value, M shows a total number of frames, and N shows FFT points.

And step 403, determining the attribution of the notes of each frame according to the energy value.

Specifically, at each frequency point, determining that a first frame and a second frame of the audio file belong to a first note; then judging whether the absolute value of a first difference value is smaller than a second difference value, wherein the first difference value is the difference between the energy value of a third frame of the audio file and the average value of the energy values of the first frame and a second frame of the audio file, and the second difference value is the difference between the maximum value and the minimum value of the energy values of the first frame and the second frame of the audio file; if yes, determining that the third frame of the audio file belongs to the first note, and sequentially judging backwards that the notes of the fourth frame till the last frame belong to.

If the absolute value of the first difference is greater than or equal to a second difference, taking a third frame of the audio file as the beginning of a second note, and determining that a fourth frame of the audio file belongs to the second note; judging whether the absolute value of a third difference value is smaller than a fourth difference value from a fifth frame of the audio file, wherein the third difference value is the difference between the energy value of the fifth frame of the audio file and the average value of the energy values of the third frame to the fourth frame of the audio file, and the fourth difference value is the difference between the maximum value and the minimum value of the energy values of the third frame to the fourth frame of the audio file; and determining the attribution of the musical note of the fifth frame in the same way of judging the attribution of the musical note of the third frame, and repeating the steps until the attribution of the musical note of the last frame of the audio file is determined.

That is, determining the note attribution for each frame may be: the following processing is carried out on each frequency point: will T₁And T₂The frame is considered to belong to the first note, from the Tth₃Frame Start judge Attribution-if E (T) is satisfied₃，F₁)-E_mean(T₁，T₂)|<(E_max(T₁,T₂)-E_min(T₁,T₂) Then T) then₃The frame belongs to the first note, and then the attribution of each frame is judged backwards in sequence, wherein E_mean(T₁，T₂)、E_max(T₁,T₂) And E_min(T₁,T₂) Respectively represent the T th₁To T₂Frame energy valueAverage, maximum and minimum values of; otherwise, will T₃The frame is taken as the start of the second note and the Tth note is determined₄The frame belonging to the second note, from the Tth₅The frame start judgment is still made through the formula | E (T)₅，F₁)-E_mean(T₃，T₄)|<(E_max(T₃,T₄)-E_min(T₃,T₄) Determine the attribution of notes of the T5 th frame until the attribution of notes of all frames is determined.

And step 404, calculating the energy value of each note, and acquiring the frame-level audio features according to the energy value of each note.

Fig. 6 is a flowchart of another embodiment of the automatic composition method of the present application, and as shown in fig. 6, in the embodiment shown in fig. 4 of the present application, step 404 may include:

step 601, calculating the energy average value of all frames contained in each note as the energy value of each note.

Step 602, normalizing the energy value of each frame included in each note to the energy value of the corresponding note.

Step 603, the notes with energy values less than a predetermined threshold are filtered out to obtain frame-level audio features.

The predetermined threshold may be set according to system performance and/or implementation requirements, and the size of the predetermined threshold is not limited in this embodiment.

In this embodiment, the energy of a note is defined as the energy mean of all frames contained in the note, so that the energy mean of all frames contained in each note can be calculated as the energy value e (i) of each note, and then the energy value of each frame contained in each note is normalized to the energy value of the note.

It should be noted that, in the embodiment shown in fig. 2 of the present application, steps 201 to 204 may be executed sequentially with steps 101 to 102, or may be executed concurrently with steps 101 to 102, which is not limited in this application.

Fig. 7 is a flowchart of another embodiment of the automatic composition method of the present application, as shown in fig. 7, in the embodiment shown in fig. 1 of the present application, before step 104, the method may further include:

step 701, determining a topological structure of a music prediction model.

In this embodiment, the music prediction model adopts an RNN model, as shown in fig. 8, fig. 8 is a schematic diagram of another embodiment of a topology structure in the automatic composition method of the present application, and an input of the RNN model shown in fig. 8 is an output N (T) of a music frequency band feature combination model_m,F_k) And the output h of the previous frame model_mThe output is the energy value N (T) of the next frame_m+1,F_k)。

Step 702, training the music prediction model according to the music frequency band feature and the output of the model and the determined topological structure.

It should be noted that step 701 and step 702 may be executed successively with step 101 to step 103, or may be executed in parallel with step 101 to step 103, which is not limited in this embodiment.

The automatic composition method can realize automatic composition, further improve the efficiency and feasibility of automatic composition, reduce the influence of subjective factors on automatic composition, is a brand new automatic composition method, and solves the problems of low efficiency, poor feasibility, large subjective influence and the like in the prior art.

Fig. 9 is a schematic structural diagram of an embodiment of an automatic composition device according to the present application, where the automatic composition device in the present embodiment may be used as a terminal device or a part of a terminal device to implement the automatic composition method provided by the present application. The terminal device may be a client device or a server device, and the form of the terminal device is not limited in the present application.

As shown in fig. 9, the automatic composition apparatus may include: a receiving module 91, an extracting module 92 and an obtaining module 93;

the receiving module 91 is configured to receive a music file of a to-be-predicted previous-segment music, where the music file of the to-be-predicted previous-segment music includes audio data or music description information of the to-be-predicted previous-segment music; the audio data or the music description information of the preceding piece of music to be predicted refers to the audio data or the music description information of a given section of music, and then the following piece of music can be predicted according to the audio data or the music description information of the given section of music. The music description information may be converted into audio data, and the music description information may be a MIDI file or the like.

An extracting module 92, configured to extract frame-level audio features of music corresponding to the music file received by the receiving module 91;

an obtaining module 93, configured to obtain a frame-level audio feature carrying band information according to the frame-level audio feature and a pre-constructed music band feature combination model; and obtaining the predicted music according to the frame-level audio features carrying the frequency band information and a pre-constructed music prediction model so as to realize automatic composition.

In the automatic composition device, after the receiving module 91 receives a music file of a front section of music to be predicted, the extracting module 92 extracts frame-level audio features of the music corresponding to the music file, the obtaining module 93 obtains the frame-level audio features carrying band information according to the frame-level audio features and a pre-constructed music band feature combination model, and obtains the predicted music according to the frame-level audio features carrying band information and the pre-constructed music prediction model, so that automatic composition can be realized, the efficiency and feasibility of automatic composition can be improved, and the influence of subjective factors on automatic composition can be reduced.

Fig. 10 is a schematic structural view of another embodiment of the automatic composing device of the present application, which is different from the automatic composing device shown in fig. 9 in that the automatic composing device shown in fig. 10 may further include: a collection module 94, a conversion module 95, a determination module 96, and a training module 97;

a collecting module 94, configured to collect music files before the obtaining module 93 obtains the frame-level audio features carrying the frequency band information;

a conversion module 95 for converting the music files collected by the collection module 94 into audio files of the same format;

specifically, the collection module 94 may obtain a large amount of training data by crawling a large amount of music files on the internet, where the music files may be audio data or music description information, such as: MIDI files, and the like. Then, the conversion module 95 may convert the music file into an audio file with the same format, where the format of the audio file only needs to satisfy the requirement of performing FFT, for example: ". PCM or WAV, etc., the format of the audio file is not limited in this embodiment, and the present embodiment takes the format of" PCM "as an example for explanation. It should be noted that: if the music file is music description information, such as a MIDI file, it is necessary to convert the MIDI file into an audio file first, and then into an audio file in the ". PCM" format.

The extracting module 92 is further configured to extract the frame-level audio features of the audio file converted by the converting module 95.

A determining module 96 for determining a topological structure of the music band feature combination model; specifically, the topology determined by the determining module 96 is a hedged neural network structure, and in this embodiment, the hedged RNN is taken as an example, the topology includes two independent RNNs and a connection unit, as shown in fig. 3, the two independent RNNs are respectively named as LF _ RNN and HF _ RNN, and are respectively used for low-band multi-frequency feature combination and high-band multi-frequency feature combination.

The input of LF _ RNN is a certain frame T_mEnergy value E (T) from low frequency_m,F_i) I is 1,2, …, k, k is 1,2, …, N/2(N is the number of FFT points), and the output L of the previous frequency point LF _ RNN_i-1(ii) a The output of LF _ RNN is L_iIndicating the Tth after taking into account the low frequency information_mAnd the energy value of the ith frequency point of the frame.

Similarly, the input of HF _ RNN is a certain frame T_mEnergy from high frequencyValue E (T)_m,F_j) J is N/2, N/2-1, …, k, where k is 1,2, …, N/2(N is the number of FFT points), and the output H of the previous frequency point HF _ RNN_j+1(ii) a The output of HF _ RNN is H_iIndicating the Tth after taking into account high frequency information_mAnd energy value of j frequency point of frame.

And a training module 97, configured to train the music band feature combination model according to the topology determined by the determining module 96 and the frame-level audio features extracted by the extracting module 92. Specifically, when the training module 97 trains the music band feature combination model, the training algorithm used may be a neural network model training algorithm, such as a BP algorithm, and the training algorithm used in this embodiment is not limited.

In this embodiment, the extracting module 92 may include: a transformation sub-module 921, a calculation sub-module 922, a determination sub-module 923 and an acquisition sub-module 924;

the transform submodule 921 is configured to perform fast fourier transform of a fixed number of points on the audio file by frame; specifically, the transform sub-module 921 may perform a fixed number of FFT on the audio file in ". PCM" format per frame.

The calculating submodule 922 is configured to calculate an energy value of each frame of the audio file at each frequency point according to a result of fast fourier transform of the transform submodule 921; fig. 5 is a schematic diagram showing the coordinate representation of the energy value of each frame at each frequency point, wherein the horizontal axis t represents a time sequence frame, the vertical axis f represents a frequency point, the coordinate E (t, f) represents the energy value, M represents the total frame number, and N represents the number of FFT points.

A determining submodule 923, configured to determine a note attribute of each frame according to the energy value calculated by the calculating submodule 922.

The calculating submodule 922 is further used for calculating an energy value of each note;

the obtaining sub-module 924 is configured to obtain the frame-level audio features according to the energy value of each note calculated by the calculating sub-module 922.

The calculating submodule 922 is specifically configured to calculate an energy average value of all frames included in each note as an energy value of each note; normalizing the energy value of each frame included by each note into the energy value of the corresponding note;

the obtaining sub-module 924 is specifically configured to filter out notes with energy values smaller than a predetermined threshold value to obtain frame-level audio features. The predetermined threshold may be set according to system performance and/or implementation requirements, and the size of the predetermined threshold is not limited in this embodiment.

In this embodiment, the determining sub-module 923 may include: a note determining unit 9231 and a judging unit 9232;

a note determining unit 9231, configured to determine that the first frame and the second frame of the audio file belong to a first note at each frequency point;

a judging unit 9232 configured to judge whether the absolute value of the first difference is smaller than the second difference; the first difference value is the difference between the energy value of the third frame of the audio file and the average value of the energy values from the first frame to the second frame of the audio file, and the second difference value is the difference between the maximum value and the minimum value of the energy values from the first frame to the second frame of the audio file;

the note determining unit 9231 is further configured to determine that the third frame of the audio file belongs to the first note and sequentially determine backwards that the notes of the fourth frame are to be attributed to the last frame when the absolute value of the first difference is smaller than the second difference.

A note determining unit 9231, further configured to take a third frame of the audio file as a start of a second note and determine that a fourth frame of the audio file belongs to the second note when an absolute value of the first difference is greater than or equal to a second difference;

the determining unit 9232 is further configured to determine whether an absolute value of a third difference value is smaller than a fourth difference value from a fifth frame of the audio file, where the third difference value is a difference between an energy value of the fifth frame of the audio file and an average value of energy values of third to fourth frames of the audio file, and the fourth difference value is a difference between a maximum value and a minimum value of the energy values of the third to fourth frames of the audio file; and determining the attribution of the musical note of the fifth frame in the same way of judging the attribution of the musical note of the third frame, and repeating the steps until the attribution of the musical note of the last frame of the audio file is determined.

That is, the determination sub-module 923 may determine the note attribution for each frame as: the following processing is carried out on each frequency point: note determining unit 9231 compares T₁And T₂The frame is considered to belong to the first note, and the judging unit 9232 judges from the Tth₃Frame Start determination Attribution-if | E (T) is satisfied₃，F₁)-E_mean(T₁，T₂)|<(E_max(T₁,T₂)-E_min(T₁,T₂) Then frame T3 belongs to the first note and then it is decided backwards in turn to attribute each frame, where E_mean(T₁，T₂)、E_max(T₁,T₂) And E_min(T₁,T₂) Respectively represent the T th₁To T₂An average, maximum, and minimum of the frame energy values; otherwise, will T₃The frame is taken as the start of the second note and the Tth note is determined₄The frame belonging to the second note, from the Tth₅The frame start judgment is still made through the formula | E (T)₅，F₁)-E_mean(T₃，T₄)|<(E_max(T₃,T₄)-E_min(T₃,T₄) Determine the attribution of notes of the T5 th frame until the attribution of notes of all frames is determined.

Further, the automatic composition apparatus may further include: a determination module 96 and a training module 97;

a determining module 96, configured to determine a topological structure of the music prediction model before the obtaining module 93 obtains the predicted music; in this embodiment, the topological structure of the music prediction model determined by the determining module 96 is an RNN model, and as shown in fig. 8, the input of the RNN model is the output N (T) of the music frequency band feature combination model_m,F_k) And the output h of the previous frame model_mThe output is the energy value N (T) of the next frame_m+1,F_k)。

And a training module 97, configured to train the music prediction model according to the output of the music band feature combination model and the topological structure determined by the determining module 96.

The automatic composition device can realize automatic composition, further improve the efficiency and feasibility of automatic composition, reduce the influence of subjective factors on automatic composition, is a brand new automatic composition method, and solves the problems of low efficiency, poor feasibility, large subjective influence and the like in the prior art.

Fig. 11 is a schematic structural diagram of an embodiment of a terminal device according to the present application, where the terminal device in the present application may implement the automatic composition method provided in the present application, the terminal device may be a client device or a server device, and the present application does not limit the form of the terminal device. The terminal device may include: one or more processors; storage means for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors implement the automatic composition method provided by the present application.

Fig. 11 shows a block diagram of an exemplary terminal device 12 suitable for use in implementing embodiments of the present application. The terminal device 12 shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 11, the terminal device 12 is represented in the form of a general-purpose computing device. The components of terminal device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

Terminal device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by terminal device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Terminal device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 11, and commonly referred to as a "hard drive"). Although not shown in FIG. 11, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only memory (CD-ROM), a Digital versatile disk Read Only memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally execute the automated composition method of the embodiments described herein.

Terminal device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with terminal device 12, and/or with any devices (e.g., network card, modem, etc.) that enable terminal device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Furthermore, the terminal device 12 can also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network (e.g., the Internet) via the Network adapter 20. As shown in fig. 11, the network adapter 20 communicates with the other modules of the terminal device 12 via the bus 18. It should be understood that although not shown in fig. 11, other hardware and/or software modules may be used in conjunction with terminal device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing, such as implementing the automatic composition method provided herein, by executing programs stored in the system memory 28.

It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic Gate circuit for implementing a logic function on a data signal, an asic having an appropriate combinational logic Gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), and the like.

The present application also provides a storage medium containing computer-executable instructions for performing the automatic composition method provided herein when executed by a computer processor.

The storage media described above, which comprise computer-executable instructions, may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM) or flash Memory, an optical fiber, a portable compact disc Read Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. An automatic composition method, comprising:

receiving a music file of a front piece of music to be predicted, wherein the music file of the front piece of music to be predicted comprises audio data or music description information of the front piece of music to be predicted;

extracting frame-level audio features of music corresponding to the music file;

obtaining frame-level audio features carrying frequency band information according to the frame-level audio features and a pre-constructed music frequency band feature combination model, wherein the music frequency band feature combination model is obtained according to the frame-level audio features of the audio files and the topological structure training of the music frequency band feature combination model;

and obtaining the predicted music according to the frame-level audio features carrying the frequency band information and a pre-constructed music prediction model so as to realize automatic composition, wherein the music prediction model is obtained by combining the output of the model and the topological structure training of the music prediction model according to the music frequency band features.

2. The method of claim 1, wherein before obtaining the frame-level audio features carrying band information according to the frame-level audio features and the pre-constructed music band feature combination model, the method further comprises:

collecting music files and converting the music files into audio files with the same format;

extracting frame-level audio features of the audio file;

determining a topological structure of the music frequency band feature combination model;

and training the music frequency band characteristic combination model according to the determined topological structure and the frame-level audio characteristics.

3. The method of claim 2, wherein extracting the frame-level audio features of the audio file comprises:

carrying out fast Fourier transform of a fixed point number on the audio file according to frames;

calculating the energy value of each frame of the audio file at each frequency point according to the result of the fast Fourier transform;

determining the attribution of the notes of each frame according to the energy value;

and calculating the energy value of each note, and acquiring the frame-level audio features according to the energy value of each note.

4. A method as recited in claim 3, wherein said determining a note attribute for each frame as a function of the energy value comprises:

determining, at each frequency point, that a first frame and a second frame of the audio file belong to a first note;

judging whether the absolute value of the first difference is smaller than the second difference; the first difference value is the difference between the energy value of the third frame of the audio file and the average value of the energy values of the first frame to the second frame of the audio file, and the second difference value is the difference between the maximum value and the minimum value of the energy values of the first frame to the second frame of the audio file;

if yes, determining that the third frame of the audio file belongs to the first musical note, and sequentially judging backwards whether the musical notes of the fourth frame till the last frame belong to.

5. The method of claim 4, wherein after determining whether the absolute value of the first difference is less than the absolute value of the second difference, further comprising:

if the absolute value of the first difference is greater than or equal to the second difference, taking a third frame of the audio file as the beginning of a second note and determining that a fourth frame of the audio file belongs to the second note;

judging whether the absolute value of a third difference value is smaller than a fourth difference value from a fifth frame of the audio file, wherein the third difference value is the difference between the energy value of the fifth frame of the audio file and the average value of the energy values of the third frame to the fourth frame of the audio file, and the fourth difference value is the difference between the maximum value and the minimum value of the energy values of the third frame to the fourth frame of the audio file; and until the attribution of the note of the last frame of the audio file is determined.

6. The method of claim 3, wherein calculating the energy value of each note, and wherein obtaining the frame-level audio features according to the energy value of each note comprises:

calculating the energy mean value of all frames contained in each note as the energy value of each note;

normalizing the energy value of each frame included by each note to the energy value of the corresponding note;

notes with energy values less than a predetermined threshold are filtered out to obtain frame-level audio features.

7. The method according to claim 1, wherein before obtaining the predicted music according to the frame-level audio features carrying the frequency band information and a pre-constructed music prediction model, the method further comprises:

determining a topological structure of a music prediction model;

and training the music prediction model according to the music frequency band characteristic and the output of the model and the determined topological structure.

8. An automatic composition device, comprising:

the device comprises a receiving module, a prediction module and a prediction module, wherein the receiving module is used for receiving a music file of a front piece of music to be predicted, and the music file of the front piece of music to be predicted comprises audio data or music description information of the front piece of music to be predicted;

the extracting module is used for extracting the frame-level audio features of the music corresponding to the music file received by the receiving module;

the obtaining module is used for obtaining frame-level audio features carrying band information according to the frame-level audio features and a pre-constructed music band feature combination model, wherein the music band feature combination model is obtained by training according to the frame-level audio features of the audio files and the topological structure of the music band feature combination model; and obtaining predicted music according to the frame-level audio features carrying the frequency band information and a pre-constructed music prediction model so as to realize automatic music composition, wherein the music prediction model is obtained by combining the output of the model and the topological structure training of the music prediction model according to the music frequency band features.

9. The apparatus of claim 8, further comprising: the device comprises a collection module, a conversion module, a determination module and a training module;

the collecting module is used for collecting music files before the obtaining module obtains the frame-level audio features carrying the frequency band information;

the conversion module is used for converting the music files collected by the collection module into audio files with the same format;

the extraction module is also used for extracting the frame-level audio features of the audio file converted by the conversion module;

the determining module is used for determining a topological structure of the music frequency band feature combination model;

and the training module is used for training the music frequency band feature combination model according to the topological structure determined by the determining module and the frame-level audio features extracted by the extracting module.

10. The apparatus of claim 9, wherein the extraction module comprises:

the transformation submodule is used for performing fast Fourier transformation of fixed points on the audio file according to frames;

the computing submodule is used for computing the energy value of each frame of the audio file at each frequency point according to the result of the fast Fourier transform of the transform submodule;

the determining submodule is used for determining the attribution of the musical notes of each frame according to the energy value calculated by the calculating submodule;

the calculating submodule is also used for calculating the energy value of each note;

and the acquisition submodule is used for acquiring the frame-level audio features according to the energy value of each note calculated by the calculation submodule.

11. The apparatus of claim 10, wherein the determination submodule comprises:

a note determining unit for determining that a first frame and a second frame of the audio file belong to a first note at each frequency point;

a judging unit configured to judge whether an absolute value of the first difference is smaller than a second difference; the first difference value is the difference between the energy value of the third frame of the audio file and the average value of the energy values of the first frame to the second frame of the audio file, and the second difference value is the difference between the maximum value and the minimum value of the energy values of the first frame to the second frame of the audio file;

and the note determining unit is further configured to determine that a third frame of the audio file belongs to a first note when the absolute value of the first difference is smaller than a second difference, and sequentially judge backwards that notes of a fourth frame are attributed to a last frame.

12. The apparatus of claim 11,

the note determining unit is further configured to take a third frame of the audio file as a start of a second note and determine that a fourth frame of the audio file belongs to the second note when the absolute value of the first difference is greater than or equal to the second difference;

the judging unit is further configured to judge whether an absolute value of a third difference value is smaller than a fourth difference value from a fifth frame of the audio file, where the third difference value is a difference between an energy value of the fifth frame of the audio file and an average value of energy values of third to fourth frames of the audio file, and the fourth difference value is a difference between a maximum value and a minimum value of the energy values of the third to fourth frames of the audio file; and until the attribution of the note of the last frame of the audio file is determined.

13. The apparatus of claim 10,

the calculation submodule is specifically used for calculating the energy mean value of all frames contained in each note as the energy value of each note; normalizing the energy value of each frame included by each note into the energy value of the corresponding note;

the obtaining submodule is specifically configured to filter out notes with energy values smaller than a predetermined threshold value, so as to obtain frame-level audio features.

14. The apparatus of claim 8, further comprising: a determining module and a training module;

the determining module is configured to determine a topological structure of the music prediction model before the obtaining module obtains the predicted music;

and the training module is used for training the music prediction model according to the music frequency band characteristic and the output of the model and the topological structure determined by the determining module.

15. A terminal device, comprising:

one or more processors;

storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-7.

16. A storage medium containing computer-executable instructions for performing the method of any one of claims 1-7 when executed by a computer processor.