CN113593504A

CN113593504A - Pitch recognition model establishing method, pitch recognition method and pitch recognition device

Info

Publication number: CN113593504A
Application number: CN202010369795.9A
Authority: CN
Inventors: 夏雨; 周建民; 闫召曦; 应笕
Original assignee: Xiaoyezi Beijing Technology Co ltd
Current assignee: Xiaoyezi Beijing Technology Co ltd
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2021-11-02

Abstract

The embodiment of the invention discloses a pitch recognition model establishing method, a pitch recognition method and a pitch recognition device, relates to the technical field of music education, and can improve the accuracy rate of pitch recognition. The establishing method comprises the following steps: acquiring musical sound training data, wherein the musical sound training data comprise musical sound original file segments and digital control signals corresponding to note pitches in the musical sound original file segments; convolving the frequency domain data of the music sound original file segment through a convolutional neural network model to obtain intermediate data, wherein the convolutional neural network model carries a first to-be-determined parameter; inputting the intermediate data into a bidirectional long-short term memory network model, and determining the pitch probability of notes in the musical sound original file segment, wherein the bidirectional long-short term memory network model carries a second parameter to be determined; and determining the first to-be-determined parameter and the second to-be-determined parameter according to the corresponding relation between the pitch probability and the digital control signal.

Description

Pitch recognition model establishing method, pitch recognition method and pitch recognition device

Technical Field

The invention relates to the technical field of computers, in particular to a pitch identification model establishing method, a pitch identification method and a pitch identification device.

Background

The method has the advantages that the music file (such as wav format) played by the musical instrument is accurately converted into the digital signal (such as MIDI (musical instrument digital interface) signal) capable of being processed by the computer, and the method has very important significance in the field of music education (such as the work of piano auxiliary teaching, automatic recording of spectrums, music content retrieval and the like).

Various techniques for converting music audio into MIDI exist, such as recognition based on traditional time domain and frequency domain algorithms (e.g., hidden markov models, etc.). However, due to the complexity of the musical sound waveform (tone/chord, etc.), the converted MIDI effect is not ideal, and a lot of manual proofreading is required.

With the rise of artificial intelligence technology, many techniques in the field of machine learning are applied to the field of music pitch detection, such as K-nearest neighbor, Hidden Markov Model (HMM), Recurrent Neural Network (RNN), and the like. Although the pitch detection methods improve the recognition rate of the pitch to a certain extent compared with the traditional algorithm, the commercial accuracy requirements cannot be met.

Disclosure of Invention

In view of this, embodiments of the present invention provide a pitch recognition model establishing method, a pitch recognition device, an electronic device, and a storage medium, which can effectively improve the accuracy of pitch recognition.

In a first aspect, an embodiment of the present invention provides a method for establishing a pitch recognition model, including: acquiring musical sound training data, wherein the musical sound training data comprise musical sound original file segments and digital control signals corresponding to note pitches in the musical sound original file segments; convolving the frequency domain data of the music sound original file segment through a convolutional neural network model to obtain intermediate data, wherein the convolutional neural network model carries a first to-be-determined parameter; inputting the intermediate data into a bidirectional long-short term memory network model, and determining the pitch probability of notes in the musical sound original file segment, wherein the bidirectional long-short term memory network model carries a second parameter to be determined; and determining the first to-be-determined parameter and the second to-be-determined parameter according to the corresponding relation between the pitch probability and the digital control signal to obtain a pitch identification model.

Optionally, the inputting the intermediate data into a bidirectional long and short term memory network model, and the determining the pitch probability of the musical note in the musical sound file segment includes: inputting the intermediate data into a bidirectional long-short term memory network model, and determining the triggering probability of each sounding element in a preset type of musical instrument; the determining the first to-be-determined parameter and the second to-be-determined parameter according to the corresponding relationship between the pitch probability and the digital control signal comprises: and determining the first to-be-determined parameter and the second to-be-determined parameter according to the corresponding relation between the triggering probability of each sounding element and the digital control signal.

Optionally, the convolving the frequency domain data of the musical sound original file segment by using a convolutional neural network model includes: and performing two-dimensional convolution operation on the frequency domain data of the musical sound original file segment and removing redundant information to obtain the intermediate data.

Optionally, after obtaining the intermediate data and before inputting the intermediate data into the bidirectional long-short term memory network model, the method further includes: performing preventive overfitting processing on the intermediate data to obtain first data; the inputting the intermediate data into the bidirectional long-short term memory network model comprises: and inputting the first data into the bidirectional long-short term memory network model.

Optionally, after obtaining the pitch recognition model, the method further includes: performing operation adjustment on the pitch recognition model according to the operation environment of the pitch recognition model, wherein the operation adjustment comprises at least one of the following items: model structure adjustment, model parameter adjustment and data type adjustment.

Optionally, before the frequency domain data of the musical sound file segment is convolved by the convolutional neural network model, the method further includes: adding environmental noise into the musical sound original file fragments to obtain noise-added file fragments; the convolving the frequency domain data of the musical sound original file segment by the convolutional neural network model comprises: and convolving the frequency domain data of the noise-added file segment through a convolutional neural network model.

In a second aspect, an embodiment of the present invention further provides a pitch identification method, which is based on a pitch identification model established by any one of the pitch identification model establishment methods provided by the embodiments of the present invention, and includes: inputting the frequency domain data of the musical sound segment to be identified into the pitch identification model; and identifying the pitch of the note in the musical piece to be identified according to the pitch probability output by the pitch identification model.

Optionally, the musical piece to be identified includes a chord and/or at least two notes triggered simultaneously.

Optionally, before the frequency domain data of the musical sound segment to be identified is input into the pitch identification model, the method further includes: calibrating the frequency domain data of the musical sound fragment to be identified according to a preset standard pitch frequency to obtain calibration data; the inputting the frequency domain data of the musical sound segment to be identified into the pitch identification model comprises: inputting the calibration data into the pitch identification model.

Optionally, the pitch of the note in the musical piece to be identified is identified according to the pitch probability output by the pitch identification model: identifying the pitch of the musical notes in the musical piece to be identified according to a preset rule, wherein the preset rule comprises at least one of the following items: the size relation between each pitch probability and a preset probability threshold, the change rule of each pitch probability along with time and the association relation among a plurality of pitches.

In a third aspect, an embodiment of the present invention further provides an apparatus for creating a pitch recognition model, including: the music training device comprises an acquisition unit, a display unit and a control unit, wherein the acquisition unit is used for acquiring music training data which comprises a music sound original file segment and a digital control signal corresponding to the pitch of a note in the music sound original file segment; the convolution unit is used for convolving the frequency domain data of the musical sound original file segment through a convolution neural network model to obtain intermediate data, wherein the convolution neural network model carries a first parameter to be determined; a probability determining unit, configured to input the intermediate data into a bidirectional long-short term memory network model, and determine a pitch probability of a note in the musical sound original file segment, where the bidirectional long-short term memory network model carries a second predetermined parameter; and the parameter determining unit is used for determining the first to-be-determined parameter and the second to-be-determined parameter according to the corresponding relation between the pitch probability and the digital control signal to obtain a pitch identification model.

Optionally, the probability determining unit is specifically configured to input the intermediate data into a bidirectional long-term and short-term memory network model, and determine the trigger probability of each sound generating element in a preset type of musical instrument; the parameter determining unit is specifically configured to determine the first to-be-determined parameter and the second to-be-determined parameter according to a correspondence between the trigger probability of each sound generating element and the digital control signal.

Optionally, the convolution unit is specifically configured to perform two-dimensional convolution operation on the frequency domain data of the musical sound original file segment and remove redundant information to obtain the intermediate data.

Optionally, the establishing apparatus further includes a processing unit, configured to perform preventive overfitting processing on the intermediate data after obtaining the intermediate data and before inputting the intermediate data into the bidirectional long-short term memory network model, so as to obtain first data; the probability determination unit is specifically configured to input the first data into the bidirectional long-short term memory network model, and determine a pitch probability of a note in the musical sound original file segment.

Optionally, the creating apparatus further includes an adjusting unit, configured to perform operation adjustment on the pitch recognition model according to an operating environment of the pitch recognition model after obtaining the pitch recognition model, where the operation adjustment includes at least one of: model structure adjustment, model parameter adjustment and data type adjustment.

Optionally, the establishing means further includes: the noise adding unit is used for adding environmental noise into the musical sound original file fragment before the frequency domain data of the musical sound original file fragment is convoluted through a convolution neural network model to obtain a noise added file fragment; and the convolution unit is specifically used for convolving the frequency domain data of the noisy file segment through a convolution neural network model.

In a fourth aspect, an embodiment of the present invention further provides a pitch recognition device, which is built based on the method for building a pitch recognition model provided by the present invention, and the pitch recognition device includes: the input unit is used for inputting the frequency domain data of the musical sound segment to be recognized into the pitch recognition model; and the identification unit is used for identifying the pitch of the note in the musical piece to be identified according to the pitch probability output by the pitch identification model.

Optionally, the pitch recognition apparatus further includes: the calibration unit is used for calibrating the frequency domain data of the musical sound segment to be identified according to a preset standard pitch frequency before inputting the frequency domain data of the musical sound segment to be identified into the pitch identification model to obtain calibration data; the identification unit is specifically configured to input the calibration data into the pitch identification model.

Optionally, the identifying unit is specifically configured to identify a pitch of a note in the musical piece to be identified according to a preset rule, where the preset rule includes at least one of the following: the size relation between each pitch probability and a preset probability threshold, the change rule of each pitch probability along with time and the association relation among a plurality of pitches.

In a fifth aspect, an embodiment of the present invention further provides an electronic device, including: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, for executing any one of the pitch recognition model establishing methods or pitch recognition methods provided by the embodiments of the present invention.

In a sixth aspect, embodiments of the present invention also provide a computer readable storage medium storing one or more programs, which are executable by one or more processors to implement any of the pitch recognition model establishment methods or pitch recognition methods provided by the embodiments of the present invention.

The method for establishing the pitch recognition model or the pitch recognition method, the device, the electronic equipment and the storage medium provided by the embodiment of the invention can acquire musical sound training data, convolve frequency domain data of musical sound original file segments in the musical sound training data through a convolutional neural network model to obtain intermediate data, input the intermediate data into a bidirectional long and short term memory network model so as to determine pitch probabilities of notes in the musical sound original file segments, and determine a first undetermined parameter in the convolutional neural network model and a second undetermined parameter in the bidirectional long and short term memory network model according to the corresponding relationship between the pitch probabilities and digital control signals in the musical sound training data, so as to obtain the pitch recognition model. Therefore, the marked musical sound training data is used for training the convolutional neural network model and the bidirectional long and short term memory network model, the pitch recognition model based on the convolutional neural network and the bidirectional long and short term memory network can be obtained, the advantages of the neural network and the memory advantages of the bidirectional long and short term memory network can be combined by the model, and therefore the accuracy of pitch recognition can be effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for creating a pitch identification model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a process of obtaining musical tone training data in the pitch recognition model establishing method according to the embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating an acquisition process of noisy training data in the pitch recognition model building method according to the embodiment of the present invention;

FIG. 4 is a flowchart of a pitch identification method provided by an embodiment of the present invention;

FIG. 5 is a graphical illustration of the variation of individual pitch probabilities over time in an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an apparatus for creating a pitch recognition model according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a pitch recognition device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In a first aspect, embodiments of the present invention provide a method for establishing a pitch recognition model, which can effectively improve the accuracy of pitch recognition of a sound.

As shown in fig. 1, an embodiment of the present invention provides a method for establishing a pitch recognition model, which may include:

s11, acquiring musical sound training data, wherein the musical sound training data comprise musical sound original file segments and digital control signals corresponding to the pitches of musical notes in the musical sound original file segments;

the musical sound training data may be musical sound files for which pitches have been identified. In this step, the musical sound training data may include two parts, one of which is a musical sound file segment, and the other part is a digital control signal corresponding to the pitch of the musical note in the musical sound file segment, such as a MIDI signal. Alternatively, the format of the musical training data may be (wav, midi), for example.

The musical sound file may include audio files formed of sounds played by various musical instruments, and may also include audio files formed of sounds sung by a person. These sounds may include a fundamental tone and an overtone, wherein the fundamental tone is related to the vibration frequency of the sounding body and determines the pitch of the sound, and the overtone may be the tone of the sound, such as a piano tone or a violin tone. The digital control signal corresponding to the pitch of the note may be a musical control signal indicating the characteristics of pitch, duration, etc., which can be understood and executed by a computer. Since the pitch of a sound is determined by the fundamental tone, the key to identifying the pitch is the accurate detection of the implementation of the frequency of the fundamental tone.

In an embodiment of the invention, the musical sound file may be a relatively complete piece of music, such as a wav file. The musical sound source file fragments may be from a fragment of the musical sound source file. Alternatively, the segment length of each musical sound file segment may be, for example, several microseconds to several milliseconds. Each piece of the musical sound file may include only one note or a plurality of notes depending on the melody of the music itself and the duration of each note. When segment segmentation is performed, segment segmentation can be performed in the gap between two notes, thereby avoiding the need to scribe the same note into two different segments.

Optionally, in an embodiment of the present invention, the segment lengths of the musical sound original file segments that are cut from the same musical sound original file may be equal or different. For convenience of model training, in an embodiment of the present invention, for musical sound original document segments with different segment lengths, the longest musical sound original document segment may be used as a reference, and musical sound original document segments with lengths smaller than the reference may be filled in with blanks.

For example, in one embodiment of the present invention, the music training data may be obtained as shown in fig. 2.

S12, convolving the frequency domain data of the musical sound original file segment through a convolution neural network model to obtain intermediate data, wherein the convolution neural network model carries a first parameter to be determined;

in this step, the musical sound file segment may be sampled according to a predetermined sampling rate, then a mel-frequency cepstral cepstrum coefficient (MFCC) of the sampled data is obtained, mel frequency spectrum data is obtained, and frequency domain feature extraction is performed by using the MFCC, so as to obtain frequency domain data corresponding to the musical sound original file segment.

The convolutional neural network may include an input layer, an output layer, and a plurality of hidden layers connecting the input layer and the output layer. Each input layer may input one or more musical sound-source file segments, which the convolutional neural network may process in parallel and output the intermediate data from the output.

S13, inputting the intermediate data into a bidirectional long-short term memory network model, and determining the pitch probability of the musical notes in the musical sound original file segment, wherein the bidirectional long-short term memory network model carries a second parameter to be determined;

in this step, the intermediate data output in step S12 may be further input to a Bi-directional Long Short-Term Memory (Bi-directional Long Short-Term Memory) network model. After the intermediate data is processed by the BilSTM, the pitch probability of each note in the musical sound original document segment can be output. Optionally, in one embodiment of the present invention, the BilSTM model may output the probability of each note at each pitch, e.g., at time t1 in the musical sound file segment, the probability of note A being duo "1" is 0.06, the probability of note A being re "2" is 0.11, the probability of note mi "3" is 0.3, etc. The probability of different pitches may be different at different times, and for each pitch, the probability varies with time, for example, the probability of the pitch fa "4" at time t1 may be 0.1, the probability of the pitch fa "4" at time t2 stoke may be 0.5, the probability of the pitch at time t3 may be 0.05, and so on.

S14, determining the first to-be-determined parameter and the second to-be-determined parameter according to the corresponding relation between the pitch probability and the digital control signal, and obtaining a pitch identification model.

It should be noted that the pitch recognition model in this step may include a convolutional neural network and a BilSTM network. The input layer, the hidden layer and the output layer of the convolutional neural network have relationships to be determined, and the relationships to be determined can be represented by a first parameter to be determined. The BiLSTM also includes some relationships to be determined, which may be represented by a second pending parameter. Once these first, second, parameters to be determined are determined, the pitch recognition model can be determined.

In this step, the first parameter to be determined and the second parameter to be determined may be determined by using a correspondence between the digital control signal corresponding to the musical sound original file segment and the pitch probability. That is, the pitch specified by the pitch probability determined in step S13 may be specified as the corresponding pitch in the digital control signal, thereby deriving the unknown parameters therein, i.e., the first and second parameters to be determined.

Optionally, in an embodiment of the present invention, a set of values may be first assigned to the first to-be-determined parameter and the second to-be-determined parameter, and then the pitch probability is obtained according to the assigned values, and then the values of the first to-be-determined parameter and the second to-be-determined parameter are further adjusted according to the corresponding relationship between the obtained pitch probability and the marked digital control signal, so that the obtained probabilities are consistent with the indication of the digital control signal.

The method for establishing the pitch identification model provided by the embodiment of the invention can obtain musical sound training data, convolve frequency domain data of musical sound original file segments in the musical sound training data through a convolutional neural network model to obtain intermediate data, input the intermediate data into a bidirectional long and short term memory network model so as to determine pitch probabilities of notes in the musical sound original file segments, and determine a first undetermined parameter in the convolutional neural network model and a second undetermined parameter in the bidirectional long and short term memory network model according to the corresponding relation between the pitch probabilities and digital control signals in the musical sound training data, so that the pitch identification model is obtained. Therefore, the marked musical sound training data is used for training the convolutional neural network model and the bidirectional long and short term memory network model, the pitch recognition model based on the convolutional neural network and the bidirectional long and short term memory network can be obtained, the advantages of the neural network and the memory advantages of the bidirectional long and short term memory network can be combined by the model, and therefore the accuracy of pitch recognition can be effectively improved.

Optionally, in an embodiment of the present invention, after the musical sound training data is acquired in step S11, before the frequency domain data of the musical sound original file segment is convolved by the convolutional neural network model in step S12, the method for establishing the pitch recognition model according to the embodiment of the present invention may further include: and adding environmental noise into the musical sound original file fragments to obtain noise-added file fragments. The noisy file segments together with digital control signals, e.g. MIDI segments, form noisy training data. Illustratively, the specific process of adding noise may be as shown in fig. 3. Based on this, the step S12 of convolving the frequency domain data of the musical sound original file segment by the convolutional neural network model may specifically include: and convolving the frequency domain data of the noise-added file segment through a convolutional neural network model. Therefore, the trained model contains the influence of environmental noise factors, so that the actual environment of the model during pitch prediction is closer to the actual application model, and the pitch recognition accuracy under the noisy environment can be effectively improved.

Alternatively, the convolutional neural network in step S12 may include various functions and/or structures as long as the pitch probability can be derived more accurately in the end. For example, in one embodiment of the present invention, convolving the frequency domain data of the musical sound file segment with a convolutional neural network model may include: and performing two-dimensional convolution operation on the frequency domain data of the musical sound original file segment and removing redundant information to obtain the intermediate data. For example, in an embodiment of the present invention, after convolution of two hidden layers is performed, redundant information in a convolution result may be removed through a max-pool algorithm, so as to effectively improve subsequent calculation efficiency.

Further, after the convolution neural network model outputs the intermediate data, the intermediate data can be further input into the long-short term memory network model, and the pitch probability of the musical notes in the musical sound original file segments is output from the long-short term memory network model.

Further, in order to be able to guide the musical instrument playing with the pitch probability, in one embodiment of the present invention, the pitch probability may be associated with the musical instrument playing. In particular, since the pitch that can be played by a certain kind of instrument is also determined, for example, 88 keys in a piano correspond to 88 different pitches, the pitch probability can be directly converted into the trigger probability for each pitch in the kind of instrument, for example, the trigger probability for each key in 88 keys in the piano.

For example, in an embodiment of the present invention, the step S13 of inputting the intermediate data into the bidirectional long and short term memory network model, and determining the pitch probability of the note in the musical sound file segment may specifically include: inputting the intermediate data into a bidirectional long-short term memory network model, and determining the triggering probability of each sounding element in a preset type of musical instrument; based on this, the determining the first to-be-determined parameter and the second to-be-determined parameter according to the corresponding relationship between the pitch probability and the digital control signal in step S14 may specifically include: and determining the first to-be-determined parameter and the second to-be-determined parameter according to the corresponding relation between the triggering probability of each sounding element and the digital control signal.

For example, in an embodiment of the present invention, at time t5, the triggering probabilities of key A, B, C, D are 0.3, 0.02, 0.05, 0.03, and 0.07, respectively, and the pitch of the corresponding digital control signal is the pitch of key B, the values of the first parameter to be determined and the second parameter to be determined may be adjusted so that the probability corresponding to key B is adjusted from 0.02 to, for example, 0.4, and the triggering probability corresponding to key a is reduced from 0.3 to, for example, 0.01, so that the first parameter to be determined and the second parameter to be determined can be obtained more reasonably.

And obtaining the first to-be-determined parameter and the second to-be-determined parameter to obtain the pitch identification model. In order to enable the trained pitch recognition model to have a more general and accurate prediction effect, in an embodiment of the present invention, after obtaining the intermediate data in step S12 and before inputting the intermediate data into the bidirectional long-short term memory network model in step S13, the method for establishing the pitch recognition model according to the embodiment of the present invention may further include: performing preventive overfitting processing on the intermediate data, for example, performing the preventive overfitting processing by using a drop out algorithm to obtain first data; based on this, the inputting the intermediate data into the bidirectional long-short term memory network model in step S13 may specifically include: and inputting the first data into the bidirectional long-short term memory network model, so that the trained pitch recognition model can effectively avoid the over-fitting condition, and the prediction accuracy of the pitch recognition model is effectively improved.

Further, after obtaining the pitch recognition model in step S14, the method for establishing the pitch recognition model according to the embodiment of the present invention may further include: performing operation adjustment on the pitch recognition model according to the operation environment of the pitch recognition model, wherein the operation adjustment comprises one or more of the following items: model structure adjustment, model parameter adjustment and data type adjustment. The structure adjustment may include adjustment to an algorithm supported by the electronic device, the model parameter adjustment may include adjustment to the number of layers of a neural network in the model, the number of neurons in each layer, and the like, and the adjustment to the data type may include, for example, conversion of floating point type data into integer type data, or conversion of integer type data into character type data, and the like.

Because the pitch recognition model can be operated in various electronic devices, and the physical resources and the operational capability of different electronic devices are different, the method for establishing the pitch recognition model provided by the embodiment of the invention can optimize and adjust the model according to different specific operating environments. For example, a model running in a computer may be adjusted to a finer model, so that pitch prediction can be performed quickly and accurately by using the powerful calculation capability of the computer, while a model running in a mobile terminal such as a mobile phone may be adjusted to a model with a simplified structure, so that a more ideal balance state between the real-time performance and the accuracy of pitch prediction can be achieved even when the calculation capability of the mobile terminal is limited.

In a second aspect, embodiments of the present invention further provide a pitch recognition method, which can effectively improve the accuracy of pitch recognition.

As shown in fig. 4, a pitch recognition method provided by an embodiment of the present invention, based on a pitch recognition model established by any one of the pitch recognition model establishment methods provided by the foregoing embodiments, may include:

s21, inputting the frequency domain data of the musical sound segment to be recognized into the pitch recognition model;

the musical sound segment to be identified can come from a sound file to be identified or a sound signal collected on site, and can be a sound played by various musical instruments or a sound generated by singing of people.

In this step, the sound file to be recognized may be segmented into a plurality of musical sound segments to be recognized. When segment segmentation is performed, segment segmentation can be performed in the gap between two notes, thereby avoiding the need to scribe the same note into two different segments. Each musical piece to be recognized may include only one note or a plurality of notes.

Optionally, in an embodiment of the present invention, the segment lengths of the musical sound segments to be recognized, which are cut from the same sound file to be recognized, may be equal or different. In an embodiment of the present invention, for musical sound segments to be recognized with different segment lengths, the longest musical sound segment to be recognized may be used as a reference, and musical sound segments to be recognized with lengths smaller than the reference may be filled up in blank spaces.

And S22, identifying the pitch of the note in the musical piece to be identified according to the pitch probability output by the pitch identification model.

In this step, after the musical sound segment to be recognized is input into the pitch recognition model, the model can output the change situation of the prediction probability of each pitch corresponding to the musical sound segment to be recognized along with the time, so that the accurate pitch corresponding to the musical sound to be recognized is recognized according to the probability rule.

According to the pitch identification method provided by the embodiment of the invention, the frequency domain data of the musical sound segment to be identified is input into the pre-established pitch identification model, and the pitch of the note in the musical sound segment to be identified is identified according to the pitch probability output by the pitch identification model. Based on the method for establishing the pitch recognition model, the convolutional neural network and the BilSTM network are introduced in the establishing process of the pitch recognition model, and after deep learning is carried out on the musical sound training data, the model is stable and accurate in prediction, so that the accuracy of pitch recognition can be effectively improved.

Optionally, in the pitch identification method provided by the embodiment of the present invention, the musical piece to be identified may include not only a single note, but also a chord and/or at least two notes triggered at the same time. That is, the pitch recognition method provided by the embodiment of the invention can not only accurately recognize the single tone, but also accurately recognize the chord tone and/or the polyphone, thereby effectively improving the recognition effect of the chord tone and the polyphone in the prior art.

Optionally, before the frequency domain data of the musical sound segment to be recognized is input into the pitch recognition model in step S21, the pitch recognition method provided by the embodiment of the invention may further include: calibrating the frequency domain data of the musical sound fragment to be identified according to a preset standard pitch frequency to obtain calibration data; based on this, the step S22 of inputting the frequency domain data of the musical sound segment to be recognized into the pitch recognition model may specifically include: inputting the calibration data into the pitch identification model. Therefore, if the pitch of the musical sound segment to be identified has a slight deviation from the standard pitch due to the musical instrument or the singer, the pitch of the musical sound segment to be identified can be calibrated by using the method, so that the pitch identification accuracy of the pitch identification model is further improved.

After the frequency domain data of the musical piece to be recognized is input into the pitch recognition model, the pitch of the note in the musical piece to be recognized can be recognized according to the pitch probability output by the pitch recognition model in step S22.

Specifically, the pitch probability of the output of the pitch recognition model, i.e., the probability of the occurrence of a note. In the process of playing music, the probability of different pitches appearing at the same time may be different, and the probability of the same pitch appearing at different times may also be different. Illustratively, in one embodiment of the invention, the probability of occurrence of a certain pitch over time may be as shown in FIG. 5. In fig. 5, the horizontal axis represents time, and the vertical axis represents the probability that the pitch is played.

In step S22, the pitch recognition model may recognize the pitch of the note in the musical piece to be recognized according to preset rules, wherein the preset rules include one or more of the following items: the size relation between each pitch probability and a preset probability threshold, the change rule of each pitch probability along with time and the association relation among a plurality of pitches.

Optionally, in an embodiment of the present invention, the pitch probability output by the pitch recognition model may be compared with a preset probability threshold, and if the pitch probability is higher than the preset probability threshold, it is further detected whether the change of the pitch probability with time conforms to a preset change rule, and if the pitch probability is lower than or equal to the preset probability threshold, no further detection may be needed. That is, the predetermined probability threshold may serve as a starting point for pitch identification. For example, in one embodiment of the present invention, the preset probability threshold is 0.1, and if the pitch probability of pitch m is 0.08, the pitch is predicted to not appear. Whereas if the pitch probability of pitch n is 0.3, it means that pitch n is likely to occur, and further detection is required to clarify whether pitch n is occurring.

Optionally, in an embodiment of the present invention, when the probability of occurrence of a pitch is further detected, if a change of the probability of the pitch over time conforms to a preset change rule, it may be determined that the pitch is played, otherwise, it may be determined that the pitch is not played. Alternatively, the preset variation rule may include, for example, a probability curve in which a preset pattern occurs, such as a variation pattern in which a peak-trough occurs in a probability manner, or a variation pattern in which a sawtooth wave occurs in a probability manner. Optionally, the preset variation rule may further include, for example, a specification of a parameter in the probability variation, such as that the maximum probability of the peak should be greater than a preset upper limit, the minimum probability of the trough should be less than a preset lower limit, and/or the duration or interval duration of the peak and the trough should be longer than a first preset duration, and/or shorter than a second preset duration, and the like.

Further, in addition to detecting the variation law of a single pitch, polyphones and/or the correlation between multiple pitches in a chord may also be used for pitch identification. For example, when a plurality of notes are played simultaneously, a pitch with a higher frequency is easily masked by a pitch with a lower frequency, and thus is difficult to recognize. If the association relationship between the two pitches is known in advance, the detection standard for the pitch with higher frequency can be lowered when the pitch is judged. For example, in one embodiment of the present invention, the peak is calculated only when the pitch probability reaches 0.6 according to the general rule, and the peak can be considered when the pitch probability reaches 0.4 due to the mutual influence between pitches at this time, so that the accuracy of pitch identification is effectively improved.

Optionally, the association relationship between each pitch may be known in advance through a plurality of ways, for example, may be known through a music score, through a playing habit of a user, or based on a previously identified tune, and the like.

It should be noted that, although in the above-described embodiment, the pitch of a note is predicted, since the pitch and the sound emitting element have a corresponding relationship in a specific instrument, the pitch is predicted, which corresponds to the prediction that the sound emitting element corresponding to the pitch is triggered. Based on this, in one embodiment of the present invention, the pitch recognition model may also output the triggered probabilities of the respective sound emitting elements in the preset type musical instrument, for example, the pitch recognition model may output the variation of the triggered probabilities of 88 keys of a piano within a preset time period.

In a third aspect, an embodiment of the present invention further provides a device for establishing a pitch recognition model, which can effectively improve the accuracy of pitch recognition.

As shown in fig. 6, the apparatus for creating a pitch recognition model according to an embodiment of the present invention may include:

an obtaining unit 31, configured to obtain musical sound training data, where the musical sound training data includes musical sound original file segments and digital control signals corresponding to pitches of notes in the musical sound original file segments;

the convolution unit 32 is configured to convolve the frequency domain data of the musical sound original file segment by a convolution neural network model to obtain intermediate data, where the convolution neural network model carries a first parameter to be determined;

a probability determining unit 33, configured to input the intermediate data into a bidirectional long-short term memory network model, and determine a pitch probability of a note in the musical sound original file segment, where the bidirectional long-short term memory network model carries a second predetermined parameter;

and the parameter determining unit 34 is configured to determine the first to-be-determined parameter and the second to-be-determined parameter according to a corresponding relationship between the pitch probability and the digital control signal, so as to obtain a pitch identification model.

The pitch identification model establishing device provided by the embodiment of the invention can obtain musical sound training data, convolve frequency domain data of musical sound original file segments in the musical sound training data through a convolutional neural network model to obtain intermediate data, input the intermediate data into a bidirectional long and short term memory network model so as to determine pitch probabilities of notes in the musical sound original file segments, and determine a first undetermined parameter in the convolutional neural network model and a second undetermined parameter in the bidirectional long and short term memory network model according to the corresponding relation between the pitch probabilities and digital control signals in the musical sound training data, so that a pitch identification model is obtained. Therefore, the marked musical sound training data is used for training the convolutional neural network model and the bidirectional long and short term memory network model, the pitch recognition model based on the convolutional neural network and the bidirectional long and short term memory network can be obtained, the advantages of the neural network and the memory advantages of the bidirectional long and short term memory network can be combined by the model, and therefore the accuracy of pitch recognition can be effectively improved.

Optionally, the probability determining unit 33 may be specifically configured to input the intermediate data into the bidirectional long-term and short-term memory network model, and determine the trigger probability of each sound generating element in a preset type of musical instrument;

optionally, the parameter determining unit 34 may be specifically configured to determine the first to-be-determined parameter and the second to-be-determined parameter according to a corresponding relationship between the triggering probability of each sound generating element and the digital control signal.

Optionally, the convolution unit 32 may be specifically configured to perform two-dimensional convolution operation on the frequency domain data of the musical sound original file segment and remove redundant information to obtain the intermediate data.

Optionally, the establishing apparatus may further include a processing unit, configured to perform preventive overfitting processing on the intermediate data after obtaining the intermediate data and before inputting the intermediate data into the bidirectional long-short term memory network model, to obtain first data;

the probability determination unit 34 may be specifically configured to input the first data into the bidirectional long-short term memory network model, and determine a pitch probability of a note in the musical sound file segment.

Optionally, the creating apparatus may further include an adjusting unit, configured to perform operation adjustment on the pitch recognition model according to an operating environment of the pitch recognition model after obtaining the pitch recognition model, where the operation adjustment includes at least one of: model structure adjustment, model parameter adjustment and data type adjustment.

Optionally, the establishing means may further include: the noise adding unit is used for adding environmental noise into the musical sound original file fragment before the frequency domain data of the musical sound original file fragment is convoluted through a convolution neural network model to obtain a noise added file fragment; the convolution unit 32 may be specifically configured to convolve the frequency domain data of the noisy file segment with a convolutional neural network model.

In a fourth aspect, embodiments of the present invention further provide a pitch recognition apparatus, which can effectively improve the accuracy of pitch recognition.

As shown in fig. 7, a pitch recognition apparatus provided by an embodiment of the present invention, based on a pitch recognition model established by any one of the pitch recognition model establishing methods provided by the foregoing embodiments, may include:

an input unit 41, configured to input frequency domain data of a musical sound segment to be recognized into the pitch recognition model;

and the identifying unit 42 is used for identifying the pitches of the notes in the musical piece to be identified according to the pitch probabilities output by the pitch identification model.

According to the pitch identification device provided by the embodiment of the invention, the frequency domain data of the musical sound segment to be identified is input into the pre-established pitch identification model, and the pitch of the note in the musical sound segment to be identified is identified according to the pitch probability output by the pitch identification model. Based on the method for establishing the pitch recognition model, the convolutional neural network and the BilSTM network are introduced in the establishing process of the pitch recognition model, and after deep learning is carried out on the musical sound training data, the model is stable and accurate in prediction, so that the accuracy of pitch recognition can be effectively improved.

Optionally, the pitch recognition apparatus may further include: the calibration unit is used for calibrating the frequency domain data of the musical sound segment to be identified according to a preset standard pitch frequency before inputting the frequency domain data of the musical sound segment to be identified into the pitch identification model to obtain calibration data;

the recognition unit 42 may specifically be configured to input the calibration data into the pitch recognition model.

Optionally, the identifying unit 42 may be specifically configured to identify the pitch of the note in the musical piece to be identified according to a preset rule, where the preset rule includes at least one of the following: the size relation between each pitch probability and a preset probability threshold, the change rule of each pitch probability along with time and the association relation among a plurality of pitches.

In a fifth aspect, embodiments of the present invention further provide an electronic device, which can effectively improve the accuracy of pitch recognition.

As shown in fig. 8, an electronic device provided in an embodiment of the present invention may include: the device comprises a shell 51, a processor 52, a memory 53, a circuit board 54 and a power circuit 55, wherein the circuit board 54 is arranged inside a space enclosed by the shell 51, and the processor 52 and the memory 53 are arranged on the circuit board 54; a power supply circuit 55 for supplying power to each circuit or device of the electronic apparatus; the memory 53 is used to store executable program code; the processor 52 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 53, for executing the pitch recognition model establishing method or the pitch recognition method provided by any of the foregoing embodiments.

For specific execution processes of the above steps by the processor 52 and further steps executed by the processor 52 by running the executable program code, reference may be made to the description of the foregoing embodiments, and details are not described herein again.

The above electronic devices exist in a variety of forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include: smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. This type of device comprises: audio, video players (e.g., ipods), handheld game consoles, electronic books, and smart toys and portable car navigation devices.

(4) A server: the device for providing the computing service comprises a processor, a hard disk, a memory, a system bus and the like, and the server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.

(5) And other electronic equipment with data interaction function.

Accordingly, an embodiment of the present invention further provides a computer-readable storage medium, where one or more programs are stored, and the one or more programs can be executed by one or more processors to implement any one of the methods for establishing a high recognition model and the pitch recognition method provided in the foregoing embodiments, so that corresponding technical effects can also be achieved.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising", without further limitation, means that the element so defined is not excluded from the group consisting of additional identical elements in the process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments.

In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

For convenience of description, the above devices are described separately in terms of functional division into various units/modules. Of course, the functionality of the units/modules may be implemented in one or more software and/or hardware implementations of the invention.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for establishing a pitch recognition model is characterized by comprising the following steps:

acquiring musical sound training data, wherein the musical sound training data comprise musical sound original file segments and digital control signals corresponding to note pitches in the musical sound original file segments;

convolving the frequency domain data of the music sound original file segment through a convolutional neural network model to obtain intermediate data, wherein the convolutional neural network model carries a first to-be-determined parameter;

inputting the intermediate data into a bidirectional long-short term memory network model, and determining the pitch probability of notes in the musical sound original file segment, wherein the bidirectional long-short term memory network model carries a second parameter to be determined;

and determining the first to-be-determined parameter and the second to-be-determined parameter according to the corresponding relation between the pitch probability and the digital control signal to obtain a pitch identification model.

2. The method of claim 1, wherein inputting the intermediate data into a two-way long-short term memory network model, determining a pitch probability of a note in the musical sound file segment comprises:

inputting the intermediate data into a bidirectional long-short term memory network model, and determining the triggering probability of each sounding element in a preset type of musical instrument;

the determining the first to-be-determined parameter and the second to-be-determined parameter according to the corresponding relationship between the pitch probability and the digital control signal comprises:

and determining the first to-be-determined parameter and the second to-be-determined parameter according to the corresponding relation between the triggering probability of each sounding element and the digital control signal.

3. The method of claim 1, wherein convolving the frequency domain data of the musical sound file segment with a convolutional neural network model comprises:

and performing two-dimensional convolution operation on the frequency domain data of the musical sound original file segment and removing redundant information to obtain the intermediate data.

4. The method of claim 1, wherein after obtaining the intermediate data and before inputting the intermediate data into a two-way long-short term memory network model, the method further comprises:

performing preventive overfitting processing on the intermediate data to obtain first data;

the inputting the intermediate data into the bidirectional long-short term memory network model comprises:

and inputting the first data into the bidirectional long-short term memory network model.

5. The method of claim 1, wherein after the deriving a pitch identification model, the method further comprises:

performing operation adjustment on the pitch recognition model according to the operation environment of the pitch recognition model, wherein the operation adjustment comprises at least one of the following items: model structure adjustment, model parameter adjustment and data type adjustment.

6. The method of any of claims 1 to 5, wherein prior to convolving the frequency domain data of the musical sound file segment with a convolutional neural network model, the method further comprises:

adding environmental noise into the musical sound original file fragments to obtain noise-added file fragments;

the convolving the frequency domain data of the musical sound original file segment by the convolutional neural network model comprises:

and convolving the frequency domain data of the noise-added file segment through a convolutional neural network model.

7. A pitch recognition method based on a pitch recognition model created by the creation method of any one of claims 1-6, comprising:

inputting the frequency domain data of the musical sound segment to be identified into the pitch identification model;

and identifying the pitch of the note in the musical piece to be identified according to the pitch probability output by the pitch identification model.

8. The method according to claim 7, characterized in that the musical piece to be identified comprises a chord and/or at least two notes that are triggered simultaneously.

9. A method according to claim 7 or 8, wherein before inputting the frequency domain data of the musical piece to be identified into the pitch identification model, the method further comprises:

calibrating the frequency domain data of the musical sound fragment to be identified according to a preset standard pitch frequency to obtain calibration data;

the inputting the frequency domain data of the musical sound segment to be identified into the pitch identification model comprises:

inputting the calibration data into the pitch identification model.

10. A method according to claim 7 or 8, wherein the identifying of the pitch of the notes in the musical piece to be identified is based on the pitch probabilities output by the pitch identification model:

identifying the pitch of the musical notes in the musical piece to be identified according to a preset rule, wherein the preset rule comprises at least one of the following items: the size relation between each pitch probability and a preset probability threshold, the change rule of each pitch probability along with time and the association relation among a plurality of pitches.