CN112967734B

CN112967734B - Music data identification method, device, equipment and storage medium based on multiple sound parts

Info

Publication number: CN112967734B
Application number: CN202110322916.9A
Authority: CN
Inventors: 刘奡智; 韩宝强; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2024-02-27
Anticipated expiration: 2041-03-26
Also published as: CN112967734A

Abstract

The invention relates to the field of artificial intelligence, and discloses a music data identification method, device, equipment and storage medium based on multiple sound parts, which are used for improving the accuracy of identifying music data of single sound part. The music data recognition method based on the multi-vocal part comprises the following steps: acquiring multi-sound part music data, and converting the multi-sound part music data into a music sequence by adopting a convolutional neural network, wherein the music sequence comprises a pitch sequence and a rhythm sequence; inputting a music sequence into a multi-sound part music recognition model for recursive convolution to generate a plurality of sample music score sequence groups; generating a plurality of matching probability groups according to a plurality of sample music score sequence groups, a pre-trained music language model and a conditional probability model, wherein the plurality of sample music score sequence groups are in one-to-one correspondence with the plurality of matching probability groups; a plurality of target monophonic music data is determined in a plurality of sample score sequence sets based on a plurality of matching probability sets. In addition, the invention also relates to a blockchain technology, and the multi-sound part music data can be stored in the blockchain.

Description

Music data identification method, device, equipment and storage medium based on multiple sound parts

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for identifying music data based on multiple vocal parts.

Background

The multi-vocal music is composed of a plurality of musical instruments, different musical instruments have different sound characteristics, and sometimes a person needs to extract accompaniment parts thereof or extract a score of a certain musical instrument, thereby decomposing a piece of music into individual sample score data, and then new score data can be generated from the individual sample score data.

In the prior art, a non-limiting multi-sound part learning algorithm is mainly used for identifying multi-sound part music, but when the multi-sound part music is identified, a large output space is generated by the non-limiting multi-sound part learning algorithm, and single-sound part music data cannot be accurately identified in the large output space.

Disclosure of Invention

The invention provides a music data identification method, device and equipment based on multiple sound parts and a storage medium, which are used for improving the accuracy of identifying music data of a single sound part.

The first aspect of the present invention provides a music data recognition method based on a multi-vocal part, comprising: acquiring multi-sound part music data, and converting the multi-sound part music data into a music sequence by adopting a neural network architecture, wherein the music sequence comprises a pitch sequence and a rhythm sequence; inputting the music sequence into a multi-sound part music recognition model for recursive convolution to generate a plurality of sample music score sequence groups; generating a plurality of matching probability groups according to the plurality of sample music score sequence groups, the pre-trained music language model and the conditional probability model, wherein the plurality of sample music score sequence groups are in one-to-one correspondence with the plurality of matching probability groups; a plurality of target monophonic music data is determined in the plurality of sample score sequence groups based on the plurality of matching probability groups.

Optionally, in a first implementation manner of the first aspect of the present invention, the acquiring multi-vocal music data and converting the multi-vocal music data into a music sequence using a neural network architecture, where the music sequence includes a pitch sequence and a tempo sequence includes: acquiring multi-sound-part music data, inputting the multi-sound-part music data into a convolutional neural network, and generating an initial music feature vector; inputting the initial music feature vector into a long-short-period memory artificial neural network, performing time domain change reduction processing, and generating a music feature vector with reduced time domain change; and inputting the music feature vector after the time domain change reduction into a full-connection layer of a deep neural network for mapping to generate a music sequence, wherein the music sequence comprises a pitch sequence and a rhythm sequence.

Optionally, in a second implementation manner of the first aspect of the present invention, the inputting the music sequence into the multi-vocal music recognition model for recursive convolution, generating a plurality of sample score sequence sets includes: reading preset weight parameters, inputting the music sequence into a first hidden layer in a multi-vocal music recognition model, and convolving the music sequence with the preset weight parameters to generate a first accuracy and a first hidden music vector; calculating a first weight parameter based on a preset loss function and the first accuracy, inputting the first hidden music vector into a second hidden layer, and convolving the first hidden music vector with the first weight parameter to generate a second accuracy and a second hidden music vector; calculating a second weight parameter based on the loss function and the second accuracy, inputting the second hidden music vector into a third hidden layer, and convolving the second hidden music vector with the second weight parameter to generate a third accuracy and a third hidden music vector; and convolving the hidden layers in other layers according to the steps based on the corresponding weight parameters and the corresponding hidden music vectors to generate a plurality of sample music score sequence groups.

Optionally, in a third implementation manner of the first aspect of the present invention, the generating a plurality of matching probability groups according to the plurality of sample score sequence groups, the pre-trained music language model and the conditional probability model, where the plurality of sample score sequence groups corresponds to the plurality of matching probability groups one to one includes: sequentially inputting the plurality of sample music score sequence groups into a pre-trained music language model according to a time sequence for comparison to generate a plurality of conditional probability groups; and sequentially inputting the conditional probability groups into a conditional probability model according to the time sequence to generate a plurality of matching probability groups, wherein the plurality of sample music score sequence groups are in one-to-one correspondence with the plurality of matching probability groups.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the determining, based on the plurality of matching probability groups, a plurality of target monophonic music data in the plurality of sample score sequence groups includes: aiming at a sample music score sequence group, searching the maximum matching probability in the corresponding matching probability group, and determining the target matching probability; in the corresponding sample music score sequence group, determining the sample music score sequence corresponding to the target matching probability as single target single-part music data; determining a plurality of other target monophonic music data for other sample score sequence groups and corresponding other matching probability groups; and integrating the single target single-part music data and the plurality of other target single-part music data to generate a plurality of target single-part music data.

Optionally, in a fifth implementation manner of the first aspect of the present invention, before the acquiring multi-vocal music data and converting the multi-vocal music data into a music sequence by using a neural network architecture, the music sequence includes a pitch sequence and a tempo sequence, the multi-vocal music data identification method further includes: and acquiring the multi-sound part music data to be processed, and preprocessing the multi-sound part music data to generate the multi-sound part music data.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the acquiring multi-vocal music data to be processed, and preprocessing the multi-vocal music data, generating multi-vocal music data includes: pre-emphasis processing is carried out on the multi-sound part music data, and multi-sound part music data after the pre-emphasis processing is generated; and windowing the multi-sound part music data to generate the multi-sound part music data.

A second aspect of the present invention provides a multi-vocal part-based music data recognition apparatus comprising: the acquisition module is used for acquiring the multi-sound part music data and converting the multi-sound part music data into a music sequence by adopting a neural network architecture, wherein the music sequence comprises a pitch sequence and a rhythm sequence; the convolution module is used for inputting the music sequence into a multi-sound-part music recognition model to carry out recursive convolution to generate a plurality of sample music score sequence groups; the probability generation module is used for generating a plurality of matching probability groups according to the plurality of sample music score sequence groups, the pre-trained music language model and the conditional probability model, wherein the plurality of sample music score sequence groups are in one-to-one correspondence with the plurality of matching probability groups; a music data generating module for determining a plurality of target monophonic music data in the plurality of sample score sequence groups based on the plurality of matching probability groups.

Optionally, in a first implementation manner of the second aspect of the present invention, the obtaining module may be further specifically configured to: acquiring multi-sound-part music data, inputting the multi-sound-part music data into a convolutional neural network, and generating an initial music feature vector; inputting the initial music feature vector into a long-short-period memory artificial neural network, performing time domain change reduction processing, and generating a music feature vector with reduced time domain change; and inputting the music feature vector after the time domain change reduction into a full-connection layer of a deep neural network for mapping to generate a music sequence, wherein the music sequence comprises a pitch sequence and a rhythm sequence.

Optionally, in a second implementation manner of the second aspect of the present invention, the convolution module may be further specifically configured to: reading preset weight parameters, inputting the music sequence into a first hidden layer in a multi-vocal music recognition model, and convolving the music sequence with the preset weight parameters to generate a first accuracy and a first hidden music vector; calculating a first weight parameter based on a preset loss function and the first accuracy, inputting the first hidden music vector into a second hidden layer, and convolving the first hidden music vector with the first weight parameter to generate a second accuracy and a second hidden music vector; calculating a second weight parameter based on the loss function and the second accuracy, inputting the second hidden music vector into a third hidden layer, and convolving the second hidden music vector with the second weight parameter to generate a third accuracy and a third hidden music vector; and convolving the hidden layers in other layers according to the steps based on the corresponding weight parameters and the corresponding hidden music vectors to generate a plurality of sample music score sequence groups.

Optionally, in a third implementation manner of the second aspect of the present invention, the probability generating module may be further specifically configured to: sequentially inputting the plurality of sample music score sequence groups into a pre-trained music language model according to a time sequence for comparison to generate a plurality of conditional probability groups; and sequentially inputting the conditional probability groups into a conditional probability model according to the time sequence to generate a plurality of matching probability groups, wherein the plurality of sample music score sequence groups are in one-to-one correspondence with the plurality of matching probability groups.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the music data generating module may be further specifically configured to: aiming at a sample music score sequence group, searching the maximum matching probability in the corresponding matching probability group, and determining the target matching probability; in the corresponding sample music score sequence group, determining the sample music score sequence corresponding to the target matching probability as single target single-part music data; determining a plurality of other target monophonic music data for other sample score sequence groups and corresponding other matching probability groups; and integrating the single target single-part music data and the plurality of other target single-part music data to generate a plurality of target single-part music data.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the multi-vocal-based music data recognition device further includes: the preprocessing module is used for acquiring the multi-sound part music data to be processed, preprocessing the multi-sound part music data and generating the multi-sound part music data.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the preprocessing module may be further specifically configured to: pre-emphasis processing is carried out on the multi-sound part music data, and multi-sound part music data after the pre-emphasis processing is generated; and windowing the multi-sound part music data to generate the multi-sound part music data.

A third aspect of the present invention provides a multi-vocal-based music data recognition apparatus comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the multi-sounded music data recognition device to perform the multi-sounded music data recognition method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having instructions stored therein that, when executed on a computer, cause the computer to perform the multi-vocal part-based music data recognition method described above.

According to the technical scheme provided by the invention, the multi-sound part music data are obtained, and the multi-sound part music data are converted into a music sequence by adopting a neural network architecture, wherein the music sequence comprises a pitch sequence and a rhythm sequence; inputting the music sequence into a multi-sound part music recognition model for recursive convolution to generate a plurality of sample music score sequence groups; generating a plurality of matching probability groups according to the plurality of sample music score sequence groups, the pre-trained music language model and the conditional probability model, wherein the plurality of sample music score sequence groups are in one-to-one correspondence with the plurality of matching probability groups; a plurality of target monophonic music data is determined in the plurality of sample score sequence groups based on the plurality of matching probability groups. In the embodiment of the invention, multi-sound music data are recursively convolved into a plurality of sample music score sequence groups, then a plurality of conditional probability groups are generated based on a pre-trained music language model and the plurality of sample music score sequence groups, and a plurality of target single-sound music data are determined based on the plurality of conditional probability groups and the conditional probability model; the problem that a larger output space is generated when the multi-sound part music data are identified is solved, so that the accuracy of identifying the single-sound part music data is improved.

Drawings

FIG. 1 is a diagram showing an embodiment of a multi-vocal-based music data recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of another embodiment of a multi-vocal-based music data recognition method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an embodiment of a multi-vocal-based music data recognition device according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another embodiment of a multi-vocal-based music data recognition device according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a multi-vocal-based music data recognition apparatus according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a music data identification method, device and equipment based on multiple sound parts and a storage medium, which are used for improving the accuracy of identifying music data of a single sound part.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For easy understanding, the following describes a specific flow of an embodiment of the present invention, referring to fig. 1, and an embodiment of a multi-vocal-based music data recognition method in an embodiment of the present invention includes:

101. acquiring multi-sound part music data, and converting the multi-sound part music data into a music sequence by adopting a neural network architecture, wherein the music sequence comprises a pitch sequence and a rhythm sequence;

the server acquires multi-part music data and converts the multi-part music data into a music sequence including a pitch sequence and a rhythm sequence by using a convolutional neural network. It should be emphasized that, to further ensure the privacy and security of the multi-sounded music data, the multi-sounded music data may also be stored in a node of a blockchain.

The multi-part music data is musical instrument digital interface (musical instrument digital interface, MIDI) file data, which is composed of a plurality of musical instruments, wherein human voice may also be accompanied, and which includes a plurality of part music. In the present invention, the server can recognize the multi-sound part music data as a plurality of mono music data. Firstly, a server converts the multi-sound part music data into a music sequence by adopting a neural network architecture, wherein the neural network architecture comprises a convolutional neural network, a long-short-term memory artificial neural network and a deep neural network, the multi-sound part music data is input into the neural network architecture, and the music sequence is generated, and the music sequence consists of a pitch sequence and a rhythm sequence.

It is to be understood that the execution subject of the present invention may be a multi-vocal part-based music data recognition device, and may also be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.

102. Inputting a music sequence into a multi-sound part music recognition model for recursive convolution to generate a plurality of sample music score sequence groups;

the server inputs the music sequence into a multi-vocal part music recognition model, and convolves the music sequence by combining the iterative weights, so that a plurality of sample music score sequence groups are obtained.

It should be noted that, for each vocal music data, the server generates a sample score sequence group according to the corresponding music sequence, so as to complete splitting of the music sequence. In addition, in the multi-sound part music recognition model, the accuracy of the last recursion weight is the basis for generating the current recursion weight.

Assume that the music sequence is (x ¹ x ² x ³ …x ^t ) The server inputs the musical sequence into a multi-vocal music recognition model for recursive convolution, wherein the multi-vocal music recognition model relates to the following formula:

wherein h is a hidden layer vector, and W is a weight. After the server obtains the last hidden layer vector, a plurality of sample music score sequence groups are generated based on the last hidden layer vector, wherein one sample music score sequence group corresponds to one vocal music data.

103. Generating a plurality of matching probability groups according to a plurality of sample music score sequence groups, a pre-trained music language model and a conditional probability model, wherein the plurality of sample music score sequence groups are in one-to-one correspondence with the plurality of matching probability groups;

the server inputs a plurality of sample music score sequence groups into a pre-trained music language model to generate a plurality of initial matching probability groups, and then combines the plurality of initial matching probability groups and the conditional probability model to generate a plurality of matching probability groups which are matched with the multi-vocal music data one by one.

The server inputs a plurality of sample music score sequences into a pre-trained music language model, wherein the music language models are multiple, the music language models form a model library, the server determines a corresponding music language model in the model library according to each sample music score sequence group, then inputs each sample music score sequence group into the corresponding music language model to generate a corresponding initial matching probability, and filters out the sample music score sequences and the initial matching probabilities with the initial matching probability smaller than an initial matching probability threshold value, so that the output space for generating the single-sound-part music data is reduced, and finally, a plurality of matching probability groups are generated according to the filtered initial matching probability and the conditional model probability.

104. A plurality of target monophonic music data is determined in a plurality of sample score sequence sets based on a plurality of matching probability sets.

The server determines a plurality of target monophonic music data in a plurality of sample score sequence groups based on a plurality of matching probability groups.

The matching probability groups are in one-to-one correspondence with the sample score sequence groups, one sample score sequence group comprises a plurality of sample score sequences, and one matching probability group comprises a plurality of matching probabilities, so that one sample score sequence corresponds to one matching probability. And finally, the server determines corresponding target mono music data in each sample music sequence group based on the matching probability of each sample music sequence, so as to obtain a plurality of target mono music data.

In the embodiment of the invention, multi-sound music data are recursively convolved into a plurality of sample music score sequence groups, then a plurality of conditional probability groups are generated based on a pre-trained music language model and the plurality of sample music score sequence groups, and a plurality of target single-sound music data are determined based on the plurality of conditional probability groups and the conditional probability model; the problem that a larger output space is generated when the multi-sound part music data are identified is solved, so that the accuracy of identifying the single-sound part music data is improved.

Referring to fig. 2, another embodiment of a multi-vocal-based music data recognition method according to an embodiment of the present invention includes:

201. acquiring multi-sound part music data to be processed, and preprocessing the multi-sound part music data to generate multi-sound part music data;

the server first acquires multi-sound music data to be processed, and then preprocesses the multi-sound music data into multi-sound music data.

Specifically, the server performs pre-emphasis processing on the multi-sound part music data to generate pre-emphasis processed multi-sound part music data; the server performs windowing processing on the multi-sound part music data to generate the multi-sound part music data.

The multi-sound part music data to be processed is an analog signal which includes not only useful music information but also noise, and the multi-sound part music data to be processed needs to be preprocessed first before being identified, so that noise is suppressed. In this embodiment, pre-emphasis and windowing are mainly used. Firstly, a server performs pre-emphasis processing on multi-sound part music data to be processed, and emphasizes a high-frequency part of the music data, so that the high-frequency resolution of voice is increased, and the multi-sound part music data after the pre-emphasis processing is obtained; then, the server performs windowing processing on the pre-emphasis-processed multi-sound-part music data, wherein the pre-emphasis-processed multi-sound-part music data is mainly divided into voiced sound and unvoiced sound, and in order to enable smooth transition between frames of different music signals, the server needs to perform windowing processing on the pre-emphasis-processed multi-sound-part music data, so that the continuity of the music data is maintained, and multi-sound-part music data is generated.

202. Acquiring multi-sound part music data, and converting the multi-sound part music data into a music sequence by adopting a neural network architecture, wherein the music sequence comprises a pitch sequence and a rhythm sequence;

The multi-vocal music data is MIDI file data, which is composed of a plurality of musical instruments, wherein human voice may also be accompanied, and which includes a plurality of vocal music. In the present invention, the server can recognize the multi-sound part music data as a plurality of mono music data. Firstly, a server converts the multi-sound part music data into a music sequence by adopting a neural network architecture, wherein the neural network architecture comprises a convolutional neural network, a long-short-term memory artificial neural network and a deep neural network, the multi-sound part music data is input into the neural network architecture, and the music sequence is generated, and the music sequence consists of a pitch sequence and a rhythm sequence.

Specifically, the server acquires multi-sound music data, inputs the multi-sound music data into a hidden layer of a convolutional neural network, and generates an initial music feature vector; the server inputs the initial music feature vector into a long-short-period memory artificial neural network, performs time domain change reduction processing, and generates a music feature vector with reduced time domain change; the server inputs the music feature vector with reduced time domain change into the full connection layer to map, and a music sequence is generated, wherein the music sequence comprises a pitch sequence and a rhythm sequence.

In this embodiment, the server first inputs the acquired multi-vocal music data into the convolutional neural network to perform convolution, generates an initial music feature vector, and the generated initial music feature vector is a feature vector for reducing frequency domain variation; then the server inputs the initial music feature vector into a long-short-term memory artificial neural network to reduce the time domain change of the initial music feature vector and generate the music feature vector after the time domain change is reduced; and finally, the server inputs the music feature vector after time domain change reduction into a full-connection layer for mapping, wherein the full-connection layer is a full-connection layer of the deep neural network, the purpose of inputting the music feature vector after time domain change into the full-connection layer is to map the feature space to an output layer which is easier to classify, and the server calculates the music feature vector after time domain change in the full-connection layer to generate a music sequence comprising a pitch sequence and a rhythm sequence.

In other embodiments, the process of generating the musical sequence may also be: acquiring multi-sound music data, inputting the multi-sound music data into a convolutional neural network, and extracting characteristic amplitude of the multi-sound music data by adopting a fast Fourier transform algorithm to generate a multi-sound amplitude spectrum; a preset Mel scale filter is adopted to convert the Mel frequency domain of the multi-sound part amplitude spectrum, and a multi-sound part Mel frequency spectrum is generated; nonlinear transformation is carried out on the multi-sound part Mel spectrum, and the multi-sound part Mel spectrum after linear transformation is generated; and performing feature extraction on the linear transformed multi-sound part Mel frequency spectrum by adopting a discrete cosine transform algorithm to generate a music sequence, wherein the music sequence comprises a pitch sequence and a rhythm sequence.

203. Inputting a music sequence into a multi-sound part music recognition model for recursive convolution to generate a plurality of sample music score sequence groups;

Assume that the music sequence is (x ¹ ,x ² ,x ³ ,…x ⁿ The server inputs the musical sequence into a multi-vocal music recognition model for recursive convolution, wherein the multi-vocal music recognition model relates to the formula:

Specifically, a server firstly reads a preset weight parameter, then inputs the obtained music sequence into a first hidden layer in a multi-sound part music recognition model, and convolves the music sequence with the preset weight parameter to generate a first accuracy and a first hidden music vector; in the second hidden layer, the server inputs the first accuracy into a preset loss function to generate a first weight parameter, and the server convolves the first hidden music vector and the first weight parameter in the second hidden layer to generate a second accuracy and a second hidden music vector; in the third hidden layer, the server inputs the second accuracy into a preset loss function to generate a second weight parameter, and the server convolves the second hidden music vector and the second weight parameter at the third hidden layer to generate a third accuracy and a third hidden music vector; if the hidden layer has 5 layers, the hidden music vector of the last layer is generated by recursion five times according to the steps, and finally the hidden music vector of the last layer is convolved in the multi-sound part music recognition model to generate a plurality of sample music score sequence groups.

204. Generating a plurality of matching probability groups according to a plurality of sample music score sequence groups, a pre-trained music language model and a conditional probability model, wherein the plurality of sample music score sequence groups are in one-to-one correspondence with the plurality of matching probability groups;

Specifically, the server sequentially inputs a plurality of sample music score sequence groups into a pre-trained music language model according to a time sequence for comparison, and a plurality of conditional probability groups are generated; the server sequentially inputs a plurality of conditional probability groups into the conditional probability model according to the time sequence to generate a plurality of matching probability groups, and the plurality of sample music score sequence groups are in one-to-one correspondence with the plurality of matching probability groups.

The server firstly sequentially inputs a plurality of sample music score sequence groups into a pre-trained music language model according to the sequence of time to be compared, and a plurality of conditional probability groups are generated, wherein one conditional probability group corresponds to one sample sequence group. Before generating a conditional probability group, firstly generating an initial conditional probability group, comparing each initial conditional probability with a conditional probability threshold value by a server, and then filtering out a sample sequence with the initial conditional probability smaller than the conditional probability threshold value and the initial conditional probability, so as to generate a plurality of conditional probability groups; the server inputs a plurality of conditional probability groups into a conditional probability model once according to the time sequence, wherein the conditional probability model relates to the following formula:

wherein, P (x) is the matching probability, i is the moment, i-1 is the last moment, after calculating the matching probability of the last moment, the server calculates the matching probability of the target moment (current) by combining the matching probability of the last moment, thereby generating a plurality of matching probability groups.

205. A plurality of target monophonic music data is determined in a plurality of sample score sequence sets based on a plurality of matching probability sets.

Specifically, the server searches the largest matching probability in the corresponding matching probability group aiming at a sample music score sequence group, and determines the target matching probability; the server determines a sample music score sequence corresponding to the target matching probability as single target single-part music data in the corresponding sample music score sequence group; the server determines a plurality of other target single-sound-part music data aiming at other sample music score sequence groups and corresponding other matching probability groups; the server integrates the single target monophonic music data with a plurality of other target monophonic music data to generate a plurality of target monophonic music data.

Taking a sample score sequence set as an example, assume a sample score sequence set asThe corresponding matching probability group is [0.64,0.67,0.70,0.73,0.75,0.78,0.81,0.83,0.89,0.95 ]]In the sample music score sequence group, searching the matching probability 0.95 with the largest value to obtain the target probability 0.95, and then the server searches the sample music score sequence corresponding to 0.95>And determining the target monophonic music data. According to this mode, in a plurality of sample score sequence groups, a plurality of target single-sound-part music numbers are obtained by combining corresponding matching probability groupsAccording to the above.

The method for identifying music data based on multiple vocal parts in the embodiment of the present invention is described above, and the apparatus for identifying music data based on multiple vocal parts in the embodiment of the present invention is described below, referring to fig. 3, an embodiment of the apparatus for identifying music data based on multiple vocal parts in the embodiment of the present invention includes:

The acquiring module 301 is configured to acquire multi-sound music data, and convert the multi-sound music data into a music sequence by adopting a neural network architecture, where the music sequence includes a pitch sequence and a rhythm sequence;

a convolution module 302, configured to input the music sequence into a multi-vocal music recognition model for recursive convolution, and generate a plurality of sample score sequence groups;

the probability generation module 303 is configured to generate a plurality of matching probability groups according to the plurality of sample score sequence groups, the pre-trained music language model and the conditional probability model, where the plurality of sample score sequence groups are in one-to-one correspondence with the plurality of matching probability groups;

the music data generating module 304 is configured to determine a plurality of target monophonic music data among the plurality of sample score sequence groups based on the plurality of matching probability groups.

Referring to fig. 4, another embodiment of a multi-vocal-based music data recognition apparatus according to an embodiment of the present invention includes:

Optionally, the obtaining module 301 may be further specifically configured to:

acquiring multi-sound-part music data, inputting the multi-sound-part music data into a convolutional neural network, and generating an initial music feature vector;

inputting the initial music feature vector into a long-short-period memory artificial neural network, performing time domain change reduction processing, and generating a music feature vector with reduced time domain change;

And inputting the music feature vector after the time domain change reduction into a full-connection layer of a deep neural network for mapping to generate a music sequence, wherein the music sequence comprises a pitch sequence and a rhythm sequence.

Optionally, the convolution module 302 may be further specifically configured to:

reading preset weight parameters, inputting the music sequence into a first hidden layer in a multi-vocal music recognition model, and convolving the music sequence with the preset weight parameters to generate a first accuracy and a first hidden music vector;

calculating a first weight parameter based on a preset loss function and the first accuracy, inputting the first hidden music vector into a second hidden layer, and convolving the first hidden music vector with the first weight parameter to generate a second accuracy and a second hidden music vector;

calculating a second weight parameter based on the loss function and the second accuracy, inputting the second hidden music vector into a third hidden layer, and convolving the second hidden music vector with the second weight parameter to generate a third accuracy and a third hidden music vector;

and convolving the hidden layers in other layers according to the steps based on the corresponding weight parameters and the corresponding hidden music vectors to generate a plurality of sample music score sequence groups.

Optionally, the probability generation module 303 may be further specifically configured to:

sequentially inputting the plurality of sample music score sequence groups into a pre-trained music language model according to a time sequence for comparison to generate a plurality of conditional probability groups;

and sequentially inputting the conditional probability groups into a conditional probability model according to the time sequence to generate a plurality of matching probability groups, wherein the plurality of sample music score sequence groups are in one-to-one correspondence with the plurality of matching probability groups.

Optionally, the music data generating module 304 may be further specifically configured to:

aiming at a sample music score sequence group, searching the maximum matching probability in the corresponding matching probability group, and determining the target matching probability;

in the corresponding sample music score sequence group, determining the sample music score sequence corresponding to the target matching probability as single target single-part music data;

determining a plurality of other target monophonic music data for other sample score sequence groups and corresponding other matching probability groups;

and integrating the single target single-part music data and the plurality of other target single-part music data to generate a plurality of target single-part music data.

Optionally, the multi-vocal-based music data recognition device further includes:

The preprocessing module 305 is configured to obtain multi-sound music data to be processed, and perform preprocessing on the multi-sound music data to generate multi-sound music data.

Optionally, the preprocessing module 305 may be further specifically configured to:

pre-emphasis processing is carried out on the multi-sound part music data, and multi-sound part music data after the pre-emphasis processing is generated;

and windowing the multi-sound part music data to generate the multi-sound part music data.

The multi-part based music data recognition apparatus in the embodiment of the present invention is described in detail from the point of view of the modularized functional entity in fig. 3 and 4 above, and the multi-part based music data recognition device in the embodiment of the present invention is described in detail from the point of view of the hardware processing below.

Fig. 5 is a schematic structural diagram of a multi-vocal-based music data recognition device 500 according to an embodiment of the present invention, where the multi-vocal-based music data recognition device 500 may have relatively large differences due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing application programs 533 or data 532. Wherein memory 520 and storage medium 530 may be transitory or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations in the multi-vocal-based music data recognition device 500. Still further, the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the multi-sounded music data recognition device 500.

The multi-sound based music data recognition device 500 can also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input/output interfaces 560, and/or one or more operating systems 531, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the multi-vocal-based music data recognition device structure shown in fig. 5 does not constitute a limitation of the multi-vocal-based music data recognition device, and may include more or less components than illustrated, or may combine certain components, or may be a different arrangement of components.

The present invention also provides a multi-vocal-unit-based music data recognition apparatus, the computer apparatus including a memory and a processor, the memory storing computer-readable instructions that, when executed by the processor, cause the processor to perform the steps of the multi-vocal-unit-based music data recognition method in the above embodiments.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and may also be a volatile computer readable storage medium, in which instructions are stored which, when executed on a computer, cause the computer to perform the steps of the multi-vocal-based music data recognition method.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A multi-vocal-based music data recognition method, characterized in that the multi-vocal-based music data recognition method comprises:

acquiring multi-sound part music data, and converting the multi-sound part music data into a music sequence by adopting a neural network architecture, wherein the music sequence comprises a pitch sequence and a rhythm sequence;

inputting the music sequence into a multi-sound part music recognition model for recursive convolution to generate a plurality of sample music score sequence groups;

generating a plurality of matching probability groups according to the plurality of sample music score sequence groups, the pre-trained music language model and the conditional probability model, wherein the plurality of sample music score sequence groups are in one-to-one correspondence with the plurality of matching probability groups;

determining a plurality of target monophonic music data in the plurality of sample score sequence groups based on the plurality of matching probability groups;

the method comprises the steps of obtaining multi-sound part music data, and converting the multi-sound part music data into a music sequence by adopting a neural network architecture, wherein the music sequence comprises a pitch sequence and a rhythm sequence, and the steps comprise:

inputting the music feature vector after time domain change reduction into a full connection layer of a deep neural network for mapping to generate a music sequence, wherein the music sequence comprises a pitch sequence and a rhythm sequence;

the step of inputting the music sequence into a multi-sound part music recognition model for recursive convolution to generate a plurality of sample music score sequence groups comprises the following steps:

Convolving the hidden layer at other layers based on the corresponding weight parameters and the corresponding hidden music vectors according to the steps to generate a plurality of sample music score sequence groups;

the method comprises the steps of generating a plurality of matching probability groups according to the plurality of sample music score sequence groups, a pre-trained music language model and a conditional probability model, wherein the plurality of sample music score sequence groups and the plurality of matching probability groups are in one-to-one correspondence, and the method comprises the following steps:

2. The multi-vocal music data recognition method of claim 1, wherein the determining a plurality of target mono music data in the plurality of sample score sequence sets based on the plurality of matching probability sets comprises:

3. The multi-part based music data recognition method according to claim 1 or claim 2, wherein before the acquiring multi-part music data and converting the multi-part music data into a music sequence including a pitch sequence and a tempo sequence using a neural network architecture, the multi-part based music data recognition method further comprises:

and acquiring the multi-sound part music data to be processed, and preprocessing the multi-sound part music data to generate the multi-sound part music data.

4. The multi-vocal music data recognition method of claim 3, wherein the acquiring multi-vocal music data to be processed and preprocessing the multi-vocal music data to generate the multi-vocal music data comprises:

5. A multi-sound-part-based music data recognition apparatus, characterized by comprising:

the acquisition module is used for acquiring the multi-sound part music data and converting the multi-sound part music data into a music sequence by adopting a neural network architecture, wherein the music sequence comprises a pitch sequence and a rhythm sequence;

the convolution module is used for inputting the music sequence into a multi-sound-part music recognition model to carry out recursive convolution to generate a plurality of sample music score sequence groups;

the probability generation module is used for generating a plurality of matching probability groups according to the plurality of sample music score sequence groups, the pre-trained music language model and the conditional probability model, wherein the plurality of sample music score sequence groups are in one-to-one correspondence with the plurality of matching probability groups;

a music data generating module for determining a plurality of target monophonic music data in the plurality of sample score sequence groups based on the plurality of matching probability groups;

The acquisition module is specifically configured to: acquiring multi-sound-part music data, inputting the multi-sound-part music data into a convolutional neural network, and generating an initial music feature vector; inputting the initial music feature vector into a long-short-period memory artificial neural network, performing time domain change reduction processing, and generating a music feature vector with reduced time domain change; inputting the music feature vector after time domain change reduction into a full connection layer of a deep neural network for mapping to generate a music sequence, wherein the music sequence comprises a pitch sequence and a rhythm sequence;

the convolution module is specifically configured to: reading preset weight parameters, inputting the music sequence into a first hidden layer in a multi-vocal music recognition model, and convolving the music sequence with the preset weight parameters to generate a first accuracy and a first hidden music vector; calculating a first weight parameter based on a preset loss function and the first accuracy, inputting the first hidden music vector into a second hidden layer, and convolving the first hidden music vector with the first weight parameter to generate a second accuracy and a second hidden music vector; calculating a second weight parameter based on the loss function and the second accuracy, inputting the second hidden music vector into a third hidden layer, and convolving the second hidden music vector with the second weight parameter to generate a third accuracy and a third hidden music vector; convolving the hidden layer at other layers based on the corresponding weight parameters and the corresponding hidden music vectors according to the steps to generate a plurality of sample music score sequence groups;

The probability generation module is specifically configured to: sequentially inputting the plurality of sample music score sequence groups into a pre-trained music language model according to a time sequence for comparison to generate a plurality of conditional probability groups; and sequentially inputting the conditional probability groups into a conditional probability model according to the time sequence to generate a plurality of matching probability groups, wherein the plurality of sample music score sequence groups are in one-to-one correspondence with the plurality of matching probability groups.

6. The multi-vocal music data recognition device of claim 5, wherein the music data generating module is specifically configured to:

7. The multi-tone section-based music data recognition apparatus according to claim 5, further comprising: the preprocessing module is used for acquiring the multi-sound part music data to be processed, preprocessing the multi-sound part music data and generating the multi-sound part music data.

8. The multi-vocal music data recognition device of claim 7, wherein the preprocessing module is specifically configured to: pre-emphasis processing is carried out on the multi-sound part music data, and multi-sound part music data after the pre-emphasis processing is generated; and windowing the multi-sound part music data to generate the multi-sound part music data.

9. A multi-vocal-based music data recognition apparatus, characterized by comprising: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invokes the instructions in the memory to cause the multi-part based music data recognition device to perform the multi-part based music data recognition method of any one of claims 1-4.

10. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the multi-part based music data recognition method of any one of claims 1-4.