WO2022052940A1 - 基于音频识别被弹奏琴键的方法及系统 - Google Patents

基于音频识别被弹奏琴键的方法及系统 Download PDF

Info

Publication number
WO2022052940A1
WO2022052940A1 PCT/CN2021/117129 CN2021117129W WO2022052940A1 WO 2022052940 A1 WO2022052940 A1 WO 2022052940A1 CN 2021117129 W CN2021117129 W CN 2021117129W WO 2022052940 A1 WO2022052940 A1 WO 2022052940A1
Authority
WO
WIPO (PCT)
Prior art keywords
key
audio
user
hand
piano
Prior art date
Application number
PCT/CN2021/117129
Other languages
English (en)
French (fr)
Inventor
陶之雨
郑庆伟
Original Assignee
桂林智神信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202010939320.9A external-priority patent/CN114170868A/zh
Priority claimed from CN202110982027.5A external-priority patent/CN113658612B/zh
Application filed by 桂林智神信息技术股份有限公司 filed Critical 桂林智神信息技术股份有限公司
Publication of WO2022052940A1 publication Critical patent/WO2022052940A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present invention relates to the field of intelligent pianos, in particular to the fields of piano tuning, playing evaluation and piano teaching, and more particularly, to a method and system for recognizing played keys based on audio.
  • the evaluation of the quality of piano tuning and piano performance is mostly based on personal experience. For example, after a tuner completes the tuning of the piano, the user plays each key and judges the tuning according to personal experience. The effect, due to the uneven personal experience, the judgment of the tuning effect is also uneven. In the process of piano teaching and piano learning, the player's piano performance is often evaluated. In the prior art, the evaluation of piano performance in piano teaching or performance is mostly based on artificial listening. In this process, if the experience is insufficient or the tuning effect is not good, it will directly affect the judgment of playing notes.
  • the vibration of the strings can be simply modeled as the forced vibration of the rigid material.
  • the waves of different frequencies of the strings have an obvious proportional relationship, for example, playing the A4 key of a standard piano will excite waves such as 440Hz, 880Hz, 1320Hz, 1760Hz...
  • the minimum value of the vibration frequency is defined as the fundamental frequency of the key.
  • the vibration that is an integer multiple higher than the fundamental frequency is called the overtone or harmonic of the fundamental frequency, and multiple harmonics form a harmonic sequence or an overtone sequence, which forms an obvious harmonic structure.
  • Each key of a piano has its own fundamental frequency and harmonic structure.
  • the string is modeled as a simple rigid vibration, when in fact the string is made of metal with different thicknesses fixed on the two ends of the cavity resonance plate. Because the two ends are fixed, the actual vibrating chord length is shortened, resulting in that the harmonic frequency and the fundamental frequency do not satisfy the strict integer multiple relationship, so that the harmonics of different keys will overlap, and because the piano tuning is not strictly qualitative and quantitative Benchmarks, resulting in inaccurate estimates of played keys based on notes when evaluating playing. This will cause more octave errors in the note recognition process. For example, if you need to play the A4 key, but play the A3 and A5 keys at the same time, due to the overlap of harmonics, it is judged that only the A4 key is played.
  • the existing piano training methods still have some deficiencies: on the one hand, the practice (or evaluation) method based only on the audio data of playing music may lead to inaccurate judgment results due to the interference of noise in the environment. On the other hand, even if the notes played are correct, the accuracy of the judgment results will be affected by ignoring important factors such as the rhythm and speed of the music.
  • the object of the present invention is to overcome the above-mentioned defects of the prior art, and to provide a method and system that can intelligently identify and play a key according to the audio frequency played, calculate the beat and evaluate the method of playing, and identify a method based on the aforementioned method.
  • An intelligent piano training method and system for playing keys is to overcome the above-mentioned defects of the prior art, and to provide a method and system that can intelligently identify and play a key according to the audio frequency played, calculate the beat and evaluate the method of playing, and identify a method based on the aforementioned method.
  • a method for recognizing played keys based on audio comprising acquiring audio of playing the piano and performing frame processing on the audio to obtain a plurality of frame signals, and Each framing signal performs the following steps: T1, the current framing signal is analyzed in the frequency domain to obtain the frequency spectrum of the framing signal; T2, the frequency spectrum obtained in step T1 is subjected to noise level estimation to obtain a plurality of musical tone spectrum peaks to form musical tones Spectral peak set; T3. Search for the first candidate key set that may be played according to the musical spectrum peak set obtained by the noise level estimation and the harmonic frequency composed of all key spectral parameter sets, wherein the first candidate key set is the harmonic frequency and the harmonic frequency. A set of keys with the same one or more musical spectrum peaks in the set of musical spectrum peaks.
  • the method further includes performing the following steps on each framed signal: T4, take the hand and key images captured by the camera in the corresponding time period of the framed signal to obtain the set of keys below the position of the hand and the set of first candidate keys The intersection of the second candidate key set is obtained; T5, based on the harmonic sequence of each key in the second candidate key set, the harmonic in the second candidate key set and one or more musical tones in the musical tone spectrum peak set The keys with the same spectrum peak form the third candidate key set; T6, filter the third candidate key set, and leave the keys played by the high-octave key or the low-octave key to form the fourth candidate key set.
  • the method further includes: T7, taking the fourth candidate key set as the multi-fundamental frequency estimation result of a single framed signal, and using HMM and Viterbi methods to perform note events on it. Inter-frame smoothing is performed to obtain a comprehensive estimation result of the played key corresponding to the framed signal. By considering the continuity of chords in time and through the multi-fundamental frequency estimation of note events, the estimation result is closer to the sense of hearing, and the accuracy of multi-fundamental frequency estimation is improved.
  • the acquired audio is subjected to noise reduction and speech separation processing in sequence, and then framed.
  • the multi-microphone noise reduction technology or the multi-microphone array beamforming technology is used to perform noise reduction processing on the audio to eliminate ambient noise in the audio, and the multi-microphone noise reduction technology or the multi-microphone array beamforming technology is used to The audio is separated by voice to separate the human voice and musical tones.
  • the multi-microphone noise reduction technology is used to eliminate ambient noise
  • the multi-microphone array beamforming technology is used for voice separation.
  • the multi-microphone noise reduction technology is implemented by the following steps: pointing the multiple microphones in the direction that needs to be picked up, including the piano resonant cavity and the direction of environmental noise; adopting spectral subtraction to collect the musical sound content The high sound spectrum is subtracted from the sound spectrum with high ambient noise content to cancel the ambient noise.
  • the microphones that collect audio are arranged in the following manner: the center point of the rectangular plane where the top view of the piano keyboard is located and the two farthest sound source points outside the rectangle are triangles Vertex, arranging multiple microphones on a line parallel to the line connecting the two farthest sound source points and passing through the center of the triangle as the tangent line, taking the center of the triangle as the tangent point, enclosing the center point of the rectangular plane, or the two farthest sound source points on the arc.
  • an offline calibration or an online calibration method is used to acquire or update a key spectral parameter set, where the key spectral parameter set includes a spectral parameter corresponding to each key.
  • each key spectrum parameter is used to represent the harmonic structure of the key:
  • the off-line calibration method includes the following steps: L1, manually play the key according to the required strength, and obtain the audio frequency of playing the key; L2, analyze the audio frequency in step L1 by adopting the mute suppression technology, and obtain the audio frequency of playing the key The initial time period and the end time period; L3, according to the initial time period and the end time period obtained in step L2, perform frequency domain analysis on the time domain data between the initial time period and the end time period to obtain the spectrum harmonic structure of the key sound, on the spectrum.
  • a method for calculating playing tempo based on audio includes the following steps: A1. Using the method as described in the first aspect of the present invention to obtain a set of played keys of the current framed signal ; A2. Determine the start time and end time of each key being played based on the set of keys being played, and calculate the beat accordingly.
  • a method for evaluating playing based on audio comprising the following steps: P1. Using the method as described in the first aspect of the present invention to obtain a set of played keys of the current framed signal P2, judge the starting time and the ending time that each key is played based on the set of played keys and calculate the beat with this; And based on the set of played keys and the music spectrum peak of the current frame, calculate the set of played keys The power value and the intensity value of each key in the harmonic column are used as the intensity value of the key; P3, according to the preset evaluation index, evaluate the pitch, tempo, rhythm, velocity pedal, and playing emotion information.
  • a system for recognizing played piano keys based on audio comprising: a sound collection module for collecting the audio of playing the piano; a computer storage and management module for storing the collected audio and perform the following operations: B1, perform frequency domain analysis on the framed signal, and obtain the frequency spectrum of the signal; B2, perform noise level estimation on the frequency spectrum obtained in step B1 to divide a plurality of musical spectrum peaks in the frequency spectrum to form a musical tone spectrum Peak set; B3.
  • the first candidate key set that may be played according to the harmonic frequency set composed of the musical spectrum peak set and all key spectral parameter sets, wherein the first candidate key set is the harmonic and the musical spectrum peak set A collection of keys that share the same key as one or more musical peaks.
  • the computer storage and management module is further configured to perform the following operations: B4.
  • the set of keys below the position of the hand obtained by using the hand captured by the camera and the key image according to the corresponding time period of the current sub-frame Simplify the first candidate key set to obtain a second candidate key set, wherein the second key set is the intersection of the key set below the position of the hand in the corresponding time period of the current frame and the first candidate key set; B5, based on the second candidate key set
  • a third candidate key set is constituted by keys whose harmonics in the second key set are the same as one or more musical spectrum peaks in the musical spectrum peak set; B6.
  • the candidate key set is screened, and the keys whose high-octave keys or low-octave keys are played are left to form a fourth candidate key set.
  • the computer storage and management module is also configured to perform the following operations: B7.
  • Inter-frame smoothing is performed to obtain a comprehensive estimation result of the played key corresponding to the framed signal.
  • the system further includes: a noise suppression module for performing noise reduction processing on the audio using a multi-microphone noise reduction technology to eliminate ambient noise in the audio; a sound source separation module for using a multi-microphone noise reduction technology Microphone array beamforming technology speech-separates the audio to separate people's speech in the audio.
  • a noise suppression module for performing noise reduction processing on the audio using a multi-microphone noise reduction technology to eliminate ambient noise in the audio
  • a sound source separation module for using a multi-microphone noise reduction technology
  • Microphone array beamforming technology speech-separates the audio to separate people's speech in the audio.
  • the system further includes: a spectrum calibration module, configured to acquire or update the key spectrum parameter set by using an offline calibration or an online calibration method.
  • the system further includes: a performance evaluation module, configured to evaluate the pitch, beat, rhythm, velocity pedal, and performance emotion information of the performance according to preset evaluation indexes.
  • the system further includes: a human-computer interaction module for realizing the interaction between the user and the system.
  • a method for training an intelligent piano comprising: acquiring audio information and video information of a user playing the piano; extracting user audio data from the audio information, and storing it with an audio database Compared with the corresponding reference audio data, the matching degree between the user audio data and the corresponding reference audio data is obtained; the user hand image corresponding to the user audio data is intercepted from the video information, and the user hand image corresponding to the user audio data is intercepted from the video information.
  • the hand model identifies the user's hand data in the user's hand image, and compares it with the corresponding correct hand data stored in the hand database to obtain the user's hand data and the corresponding reference hand data.
  • the matching degree of wherein the hand model takes the hand image as input data and the hand data in the hand image as output data, obtained by training a neural network; and based on the user audio data and all The degree of matching of the corresponding reference audio data and the degree of matching between the user's hand data and the corresponding reference hand data are used to feed back the playing result to the user.
  • the user audio data includes extraction time, musical note, fundamental frequency and sound intensity
  • the method described in the first aspect of the present invention is used to identify the played keys in the audio information, and based on this, users such as musical notes, fundamental frequency and sound intensity are obtained. audio information.
  • an intelligent piano training system comprising: an audio and video acquisition unit for acquiring audio information and video information of a user's piano playing; a data extraction unit for extracting from the audio information extracting user audio data, and intercepting the user's hand image corresponding to the user's audio data from the video information; a data recognition unit for recognizing the user's hand data in the user's hand image through a hand model , wherein the hand model takes the hand image as input data and the hand data in the hand image as output data, obtained by training a neural network; a data matching unit is used to match the user audio data Compare with the corresponding reference audio data in the audio database, obtain the matching degree between the user audio data and the corresponding reference audio data, and compare the user hand data with the corresponding reference hand data in the hand database.
  • the data extraction unit is configured with a system for recognizing played keys based on audio as described in the fourth aspect of the present invention, which is used to identify the played keys in the audio information and obtain user notes, fundamental frequency and sound intensity based on this. audio data.
  • the present invention has the advantages that: the present invention separates environmental noise and human voice from musical tones, thereby effectively reducing the influence of noise and voice on the accuracy of multi-fundamental frequency estimation and intensity estimation; Calibration, low user perception, timely capture of changes in piano sound characteristics; calibrated parameters can comprehensively evaluate the tuning of the user's piano; suitable for traditional pianos, no need to install hardware such as pressure sensors and distance sensors for each key,
  • the traditional piano can also become an intelligent piano; it can be freely played by the user without being limited to the existing tunes in the music library; by adopting the method of the present invention, the played keys can be accurately identified, and accurate notes and fundamental frequencies can be obtained based on this. Harmony intensity and other data can improve the accuracy of piano training and thus improve the effect of piano teaching.
  • FIG. 1 is a schematic diagram of a microphone arrangement of a multi-microphone array beamforming technology according to an embodiment of the present invention
  • Fig. 2 is an intelligent piano training method according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of standard audio data storage in an audio database according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of standard hand data storage in a hand database according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of storage of a comprehensive database according to an embodiment of the present invention.
  • Fig. 6 is a smart piano training method according to an embodiment of the present invention.
  • Fig. 7 is a smart piano training method according to an embodiment of the present invention.
  • FIG. 8 is a smart piano training system according to one embodiment of the present invention.
  • the core of the present invention is to recognize the played keys based on audio frequency, which is the multi-basic frequency estimation task, estimating the precise fundamental frequency value and number, and the starting time period, so as to accurately estimate the frequency of the fundamental frequency in a single frame.
  • the specific key to be played To achieve this goal, there are great difficulties, which are embodied in the following aspects:
  • Time-frequency analysis of audio data is limited by time resolution and frequency resolution.
  • the specific performance is that the frequency domain analysis must load data of a certain period of time.
  • the frequency domain analysis result is a frequency-discrete spectral line with a constant spectral line interval (except CQT and VQT).
  • CQT and VQT constant spectral line interval
  • Beat The frequency of each string of a multi-string keyboard cannot be exactly the same.
  • each string excites the fundamental frequency and its overtone sequence according to its own tension, and the frequencies are relatively different.
  • the phenomenon of “beating” occurs when there are two less waves, which means that the amplitudes after the two waves are synthesized are modulated: the peak value after the synthesis is the sum of the amplitudes of the two waves, and the valley is the difference between the amplitudes of the two waves. Absolute value, the period is the absolute value of the frequency difference between the two series of waves.
  • the harmonics of a key include the fundamental frequency and its higher harmonics.
  • the harmonic loss mainly includes the loss of the fundamental frequency and the loss of higher harmonics.
  • the loss of fundamental frequency refers to the frequency domain analysis of the collected sound when a certain key is played. There is no fundamental frequency component in the analyzed spectrum or the fundamental frequency component is much smaller than its harmonic component. The reason is that the resonance of the low frequency signal is not good, so that its vibration amplitude is small or the energy is very low. Therefore, the loss of fundamental frequency mainly occurs in the notes in the bass region, which affects the accuracy of bass note recognition.
  • ⁇ Voice interference The pronunciation of vowels in speech is consistent with the overtone column structure of piano keys, which affects the accuracy of multi-fundamental frequency estimation.
  • Noise interference Noise and consonant pronunciation in speech will affect the frequency analysis results of some frequency bands or the whole frequency domain, which manifests as raising all spectral values in this frequency band, submerging the original peaks, and greatly affecting the performance of multi-fundamental frequency estimation. Accuracy.
  • ⁇ It is necessary to balance the contradiction between computational complexity and accuracy to a certain extent.
  • the present invention provides a method for recognizing played keys based on audio based on the problems of the above-mentioned aspects.
  • a method for recognizing played keys based on audio including steps 1-10, each step will be described in detail below.
  • step 1 the audio of playing the piano is captured.
  • the sound of the piano when the piano is played is collected by the sound collection module and stored in the computer storage and management module.
  • the sound collection module can be a microphone or other devices that can collect sound, preferably a microphone , because the microphone only needs to be installed around the piano, and does not need to be installed inside the piano keys, and the sound sensor needs to be installed inside the piano keys when the acquisition effect is better.
  • step 2 the collected audio is subjected to noise reduction and speech separation and is divided into frames, and the obtained framed signal is used as a unit of subsequent processing.
  • the purpose of framing is to consider the audio signal to be stable in the current frame time period. , which is helpful for subsequent processing.
  • the purpose of noise reduction is to remove ambient noise in the musical audio
  • the purpose of speech separation is to separate the human voice in the musical audio and reduce the impact of the human voice on the musical audio spectrum.
  • ambient noise is removed by a noise suppression module.
  • a noise suppression module As we all know, in different environments, such as noisy shopping malls, outdoors, piano rooms, etc., the noise will be mixed with the music of the piano and collected into the audio by the microphone. The collected noise will seriously affect the subsequent key recognition and performance evaluation. The correctness and accuracy of the sound acquisition module also affect the playback effect in the human-computer interaction module. Similarly, the hardware of the sound acquisition module will also generate noise floor. need to be filtered out.
  • the present invention uses the multi-microphone noise reduction technology to eliminate most of the noise in the environment, and this link mainly eliminates the environmental noise.
  • the principle of multi-microphone noise reduction technology can be simply understood as: point multiple microphones in the direction of sound pickup (piano resonance cavity) and the direction of noise (distance), so that multiple microphones will collect music and noise, But their mixing ratios are different.
  • the musical sound components collected by a microphone pointing at the piano are higher than those collected by a microphone pointing at a distance, and the noise components collected by a microphone pointing at a distance are higher than those collected by a microphone pointing at the piano.
  • the sound with high musical content can be subtracted from the sound with high noise content by spectral subtraction, so that the noise can be eliminated from the collected sound to a certain extent.
  • the human voice is separated by a sound source separation module.
  • a sound source separation module in addition to the influence of environmental noise in musical audio, in the application scenario of the present invention, another important interference is the human voice, the frequency domain structural characteristics of the voice (mainly vowels) and the sound emitted by the piano.
  • the characteristics of musical tones are relatively similar, and they all have obvious harmonic structures, so that the human voice will also affect the correctness and accuracy of subsequent key recognition and performance evaluation.
  • the separation of speech and tones in the present invention uses a microphone array beamforming technique. This technology is applied to the selection of the spatial domain of the sound source signal.
  • the data received by the sensor is weighted to form a certain beam shape.
  • the function of the beam is to allow the information in the direction of the signal of interest to pass through and form a gain for it. , while suppressing signals in directions of no interest.
  • the realization of this solution requires the installation position of the microphone. The requirements are described as follows: As shown in Figure 1, there is a rectangle on the plane. The sound source represents several speakers, such as two teachers and students sitting in front of the piano.
  • the direction of the arc M can choose the direction of surrounding point A or the direction of surrounding point BC.
  • the installation orientation or position of multiple microphones should be avoided as far as possible on the straight line L, but should be on the arc M as far as possible, and the arc M is as large as possible, which can also ensure that the pickup difference of multiple microphones is maximized. This ensures the accuracy and robustness of the algorithm's spatial domain estimation.
  • a relatively pure musical sound signal is obtained after the sound collection, noise suppression, and speech separation processes of the above embodiment, which basically meet the requirements of subsequent processing.
  • step 3 frequency domain analysis is performed on each frame of the audio signal after being divided into frames to obtain a frequency spectrum of a single frame of audio signal, and there are multiple spectral peaks in the frequency spectrum.
  • step 4 noise level estimation is performed on the spectrum obtained in step 3, and an extreme value in the neighborhood higher than the estimated noise level is searched in the spectrum obtained in step 3 to obtain a plurality of musical spectral peaks to form a musical spectral peak set.
  • an adaptive noise level estimation method is used to determine whether the statistic of the sub-band frequency domain signal in the single-frame audio signal satisfies the Rayleigh distribution. Usually, the number of spectral peaks excited by musical tones is much smaller than that of noise signals. If the statistic of the subband frequency domain signal satisfies the Rayleigh distribution, the Q quantile of the distribution is used to represent the noise level of the current single frame (subinterval).
  • 97.5% or 99% can be selected.
  • the Q quantile of that is, within a single frame subband, approximately 97.5% or 99% of the spectral peaks are considered noise, and the rest are musical tones.
  • Find the extreme value in the spectrum the extreme value higher than the estimated noise level is the peak of the music spectrum.
  • the extreme value is searched through the following steps: first, all local extreme values in the spectrum are searched; second, Screen out the local extrema higher than the noise level estimate; thirdly, further screen out the local extrema that satisfies the vibration frequency of the string among the local extrema higher than the noise level estimate, and the final selected local extrema is the musical spectrum peak .
  • step 5 the first candidate key set S1 that may be played is comprehensively searched according to the harmonic frequency composed of the musical spectrum peak set estimated in the previous step 4 and the key spectrum calibration parameter set P.
  • the search criterion is that the keys have harmonics that are consistent with the spectral peaks in the current single-frame signal, then the first candidate key set finally obtained is the set of keys whose harmonics are the same as one or more musical spectral peaks.
  • the key spectrum calibration parameter set P is a set of spectrum parameters of all keys after calibration, and the spectrum parameters of the keys are used to represent the harmonic structure of the keys.
  • a spectrum calibration module is used to calibrate the spectrum parameters of the keys. This module is divided into two application scenarios, one is offline calibration and the other is online calibration. The functions implemented are the same, but the application scenarios are different.
  • the reason why the key spectrum parameters need to be calibrated when the keys of the piano are played, the hammer strikes the corresponding one or more strings. At this time, the vibration of the strings can be simply modeled as the forced vibration of rigid materials. . This impact can excite waves of different frequencies of the string. This frequency has an obvious proportional relationship. For example, playing the A4 key of a standard piano will excite waves such as 440Hz, 880Hz, 1320Hz, 1760Hz, etc. The minimum value of the vibration frequency is defined as the fundamental frequency of the key. The vibration that is an integer multiple higher than the fundamental frequency is called the overtone or harmonic of the fundamental frequency, and multiple harmonics form a harmonic sequence or an overtone sequence, which forms an obvious harmonic structure.
  • Each key of a piano has its own fundamental frequency and harmonic structure.
  • the working premise of key identification is that different keys correspond to different harmonic structures.
  • the premise of generating integer multiples of harmonics is that the string is modeled as a simple rigid vibration, when in fact the string is composed of metal with different thicknesses fixed at both ends on the resonance plate of the cavity. Because the two ends are fixed, the chord length of the actual vibration is shortened, resulting in that the frequency of the harmonic frequency and the fundamental frequency do not satisfy the strict integer multiple relationship.
  • f 0 the fundamental frequency of a certain key
  • B indicates a non-integer multiple relationship, also known as the detuning rate.
  • Different pianos have undergone different tuners or a well-tuned piano in different environments or over time, the fundamental frequency of each string of each key will change, and the corresponding harmonic sequence will also change. Offset will occur, then f 0 in the above formula is also a variable quantity.
  • the present invention will store a set of general parameters suitable for most pianos in the computer storage and management module.
  • the parameter set is not specific to a certain piano, but basically conforms to the general parameters of most pianos.
  • the implementation process of offline calibration and online calibration will be described in detail below.
  • Offline calibration use the human-computer interface to enter the offline calibration procedure.
  • the program allows the user to play all 88 keys of the piano individually and sequentially in a quiet environment, and the microphone of the sound acquisition module records all the collected sounds.
  • VAD voice activity detection
  • silence suppression is mainly used to identify and eliminate long silence periods from the sound signal stream to save speech without reducing service quality.
  • the role of road resources. Silence suppression can save valuable bandwidth resources and help reduce the end-to-end delay felt by users.
  • the VAD is used to detect the start time period and the end time period of playing a certain key to produce a sound, and the corresponding intermediate time is the normal playing time of a certain key.
  • the human-computer interaction interface allows the user to play the A0 key with medium strength (medium or upper strength is mainly to improve the signal-to-noise ratio), the sound acquisition is synchronized, and the collected sound data is passed to the VAD algorithm. , detect the initial time period and the end time period of playing, and perform frequency domain analysis on the time domain data in the middle of the two time periods.
  • the frequency domain analysis will display the harmonic structure of the key sound, and different peaks on the spectrum correspond to different subharmonics. Wave.
  • the parameter is used as the spectrum parameter of the key A0.
  • the human-computer interaction interface prompts the user to continue to play the B0 key at medium strength, and repeat the above process until all keys are calibrated, otherwise re-play the key that does not meet the requirements.
  • the calibrated parameters will not apply to the current piano, and you need to re-run the offline calibration to obtain the latest piano parameters.
  • offline calibration requires the user to perform the calibration step by step according to the instructions of the human-machine interface, while online calibration does not require the user to perform this operation, and the user has no feeling for the calibration procedure.
  • the module will automatically monitor the collected audio, and use the key recognition module to determine that one or several keys are played during the time period.
  • the acquired data is then subjected to frequency domain analysis (same as frequency domain analysis for offline calibration). For keys that have never been calibrated, determine which keys are played according to the factory parameter set and key identification module, and use calibrated parameters for keys that have been calibrated.
  • Wr is the Nr harmonic before the r note
  • the product of the magnitude and a frequency-dependent function that can be expressed as the product of the main lobe of the magnitude and the frequency difference of the Fourier transform of the window function used in the frequency domain analysis.
  • the advantage of this is that the effects of frequency resolution discontinuities in frequency domain analysis can be attenuated.
  • the sum of Wr of all the keys played above is After iterating to a minimum, the parameters Ar and Br for each played key are solved.
  • the parsed parameters to construct all harmonic frequencies of the keys played in the current time period, set the amplitude of the corresponding frequency in the spectrum to zero (or perform notch filtering in the frequency domain), and then recalculate the residual frequency domain energy or statistical spectral coefficients Whether the Rayleigh distribution is basically satisfied. If the residual energy is less than a certain range or the spectral coefficient basically satisfies the Rayleigh distribution, it means that the currently calculated parameters Ar and Br have high reliability, and the parameters are saved. Then the parameters of the played keys will be calibrated as the user uses them. This gradually completes the calibration of the entire key.
  • the first candidate key set S1 is simplified.
  • the position of the hand and the key captured by the camera is used to calculate the key set Sh below the position of the hand corresponding to the time period of the single frame audio signal , set the second candidate key set S2 as the intersection of Sh and S1.
  • the played key position Sr corresponding to the comprehensive evaluation result of the played key in the previous frame is used to estimate the key set Ss in the area where both hands are located in the current frame period, and the second candidate key set is set S2 is the intersection of Ss and S1.
  • step 7 according to the second candidate key set S2 and the amplitudes of all identified as musical spectrum peaks, the range of the candidate key set is narrowed down. Specifically, when a certain key is played, the amplitudes of the spectral peaks of all harmonic series excited by it should show a regular change. Exponential decay function or some probability density function. According to the above assumptions, reassign the harmonics of which keys each peak is more likely to belong to, and then form the candidate third candidate key set S3, that is, use the harmonics in the second key set S2 and one or more musical spectrums in the musical spectrum peak set The keys with the same peak form the third candidate key set.
  • step 8 for the third candidate key set S3 and the peak value of the musical audio spectrum, check the high and low octave keys of each key in the third candidate key set S3 (wherein, the high and low octave keys are also in the third candidate key set S3).
  • the proportion of harmonics in the spectral peaks If the proportion is high and the proportion of harmonic components is high each time, it proves that the upper octave key or the lower octave key of the key is played. Otherwise, it means that the key is not played and is misjudged, and should be removed from S3, and finally a fourth candidate key set S4 is formed.
  • step 9 the fourth candidate key set S4 is used as the result of the multi-fundamental frequency estimation of a single framed signal, and then the HMM and Viterbi algorithm are used to perform inter-frame smoothing of note events, and the smoothed result is used as the synthesis of the frame Estimated result Sr.
  • the purpose of frame-to-frame smoothing of note events is to combat the recognition errors caused by sound attenuation. For example, suppose the user presses the A4 key in the first second, and then releases it after 5 seconds. During these four seconds, the sound will be reduced.
  • the possible output recognition results are that the A4 key is pressed for 1 to 3 seconds, the key is not pressed for 3 to 3.5 seconds, and the key is pressed for 3.5 to 5 seconds. (press and release are called events). If each frame is 10 milliseconds, then it is equivalent to 100-300 frames and 351-500 frames being pressed, and 301-350 frames being released, which is inconsistent with the actual situation.
  • the inter-frame smoothing of note events refers to Frames 301 to 350 are also modified to be pressed.
  • a time domain filter can also be used to process the multi-base frequency estimation result of a single framed signal to obtain a comprehensive estimation result.
  • step 10 using the comprehensive estimation result Sr and spectral peak corresponding to a single frame, the power sum of each key harmonic sequence in the Sr set is calculated as the strength value of the key, which is used for performance evaluation, where the power sum is The sum of the squares of the amplitudes; and according to the comprehensive estimation result Sr of the current frame and the acquisition time period of the current frame, determine the start time and end time of each key playing, calculate the duration of each note, and calculate the beat.
  • the present invention evaluates the performance through a performance evaluation module.
  • the evaluation indicators include but are not limited to information such as pitch, beat, rhythm, intensity, pedal, and playing feeling. Each evaluation index is described in detail below.
  • Pitch When the user plays according to a specific score, it is judged by comparing with the known score, and the judgment result can be displayed on the score in real time, or can be drawn as a curve, or a report can be generated after playing, and the statistics are generally correct. rate, section accuracy rate and other statistical indicators, and statistics the skilled interval (the accuracy rate continues to be high) and the unskilled interval (the accuracy rate continues to be low). If the user is playing freely, no pitch is counted. Considering that the user may have many incorrect pitches, the dynamic time warping algorithm (DTW) is used to dynamically compare the pitch of the playing and the pitch of the score when comparing with the known score, and obtain the correct rate of the segment or the whole. statistic value.
  • DTW dynamic time warping algorithm
  • Beat refers to the total length of notes in each measure in the score. For example, if a song is 4/4 time, it means that the song takes quarter notes as one beat, and each measure has four quarter notes.
  • the judgment of the beat is mainly the judgment of the initial time period of the note, and the difference of the initial time period of the adjacent notes is the performance of the current beat.
  • Rhythm Describes the distribution of playing intensity over a long period of time at the musical level.
  • Intensity It is expressed as the intensity of playing. The intensity in the score ranges from extremely weak to extremely strong. However, there is no absolute evaluation index for these strengths, only relative evaluations, which are set according to specific application scenarios in the actual application process.
  • Pedal Correct use of sustain pedal and damper pedal.
  • Playing emotion Use pitch, beat, rhythm, strength, pedal and other information to comprehensively estimate the emotion of a piece of music over a period of time, such as strong, gentle, intense, lively, sad, etc.
  • the present invention uses a computer storage and management module to store all the collected audio data, variables of the calculation process, calibration parameters, etc., and realizes offline calibration through the human-computer interaction module, and displays the evaluation index of the user's playing. , display score, sound playback, etc.
  • the advantages of the present invention are: the noise and speech are separated from the musical sound, which effectively reduces the influence of noise and speech on the accuracy of multi-base frequency estimation and intensity estimation; the multi-key spectrum is calibrated at the same time, the user perception is low, and the piano sound is captured in time Changes in characteristics; the calibrated parameters can comprehensively evaluate the tuning of the user’s piano; considering the detuning, updating the harmonic parameters of the keys makes the search for spectrum peaks more accurate and direct, the harmonic sequence power calculation is more accurate, and the intensity calculation is more accurate; The harmonic overlap problem is considered, and the judgment error of the octave key is reduced; the strength evaluation index constructed by using the relationship between the strength and the harmonic sequence power is universal, the thinking is clear, the principle is simple, the calculation amount is small, and the judgment accuracy is high.
  • the played keys can be identified through the audio information, as well as the corresponding user audio information such as pitch, beat, rhythm, strength, pedal, playing emotion, etc., based on the harmonic structure and frequency spectrum of the keys parameters, the fundamental frequency of the keys can be obtained, and using this information in piano teaching can effectively improve the effect of piano training.
  • piano teaching involves the judgment and evaluation of the practitioner's piano performance.
  • the judgment and evaluation of the practitioner's piano performance include at least two aspects: notes and hand movements.
  • the notes can include fundamental and overtone. Spectrum, velocity, speed, rhythm and other factors.
  • it can be judged whether the played note is correct by converting the audio data signal during playing into audio data and comparing it with the standard audio data.
  • standard audio data and standard hand data refer to "reference audio data", “reference hand data” used for comparing with user audio data and user hand data to judge the user's piano playing result data”.
  • FIG. 2 shows a smart piano training method according to an embodiment of the present invention.
  • the method includes the following steps:
  • S310 Acquire audio information and video information of the user playing the piano.
  • audio and video information of the user playing the piano may be collected through audio and video collection devices (eg, a microphone and a camera, or a camera with a microphone).
  • the collected audio information can be preprocessed by removing silent segments, denoising, and noise reduction to avoid external interference and improve the accuracy of scoring.
  • the MIDI audio digital signal of the user playing the piano may be collected by connecting to a MIDI interface (Musical Instrument Digital Interface) on the electronic piano.
  • MIDI audio digital signals are binary data output by an electronic piano, representing a certain note played, and that can be recognized and processed by a computer.
  • the user's hand movement of playing the piano can be captured by a camera or other device with image capture function.
  • the hand movements of both hands can be captured by the same camera, or the hand movements of the left and right hands can be captured separately from different angles by multiple cameras. In this case, the video information of the left and right hands can be spliced.
  • a pressure sensor installed under the keys may also collect the touch strength of the user playing the piano, so as to combine with the above audio and video information to jointly determine the score of the user playing the piano.
  • the audio database contains audio data of a large number of standard piano playing pieces (for example, pieces played by piano teachers or professionals, or pieces automatically generated by artificial intelligence based on musical scores).
  • Standard audio data can be extracted from the audio information of standard piano performance pieces according to a certain time interval, and stored in units of pieces to form an audio database.
  • the audio data in the audio database may at least include information such as track name, extraction time, musical note, fundamental frequency, and sound intensity.
  • standard audio data may be extracted from the audio information of a standard piano performance at time intervals of 10 ms or less and stored in an audio database.
  • the current Guinness World Records record for the fastest pianist is 14 times in 1 second. Taking pressing the piano keys 20 times in 1s as an example, the viewing angle of pressing the piano keys once is 50ms. Therefore, extracting audio data from the audio information of a piano performance at intervals of 10ms can cover all the notes produced by the performance.
  • FIG. 3 shows a schematic diagram of standard audio data storage in the audio database of an embodiment.
  • the audio database may include a first-level table and a second-level table, wherein the first-level table is used to store the basic information of the standard piano performance, including the serial number, name, level, tone and audio data.
  • the second-level table serial number and other information.
  • the secondary table is used to store the audio data of each track, including the extraction time, the notes extracted from the audio information of the track at fixed time intervals, and their fundamental frequency and sound intensity.
  • Figure 3 (A) several standard piano performance pieces are stored in the first-level table.
  • the name of the piece 0001 is "Song of Spring", the first-level, A major, and its audio data is stored in the second-level table 0001 Medium;
  • the name of track 0036 is "Rondo”, in the key of C major, and the audio data is stored in the second-level table 0036;
  • the name of the track 0180 is "Canon", other, in the key of D, the audio data is stored in the second-level table 0180, and so on.
  • the secondary table 0001 stores all the notes extracted at every 10ms interval during the complete performance time of the 0001 track, as well as the corresponding fundamental frequency and sound intensity and other data, for example, at the time of "0.000" , no note, the fundamental frequency is 0, the sound intensity is 0; at the "0.010” moment, the note is G4, the fundamental frequency is 391Hz, and the sound intensity is 10dB; at the "0.020” moment, the note is still G4, the fundamental frequency is 391Hz, The sound intensity is 15dB; at the "0.030" moment, the note is still G4, the fundamental frequency is 391Hz, and the sound intensity is 20dB; ...; at the "0.250” moment, the note is D4, the fundamental frequency is 293Hz, and the sound intensity is 10;..., etc.
  • User audio data can be extracted from the audio information at regular intervals, and the extracted user audio data can be compared with the corresponding standard audio data in the audio database to obtain the matching degree between the user audio data and the corresponding standard audio data.
  • the user audio data may be extracted from the audio information according to the first time interval.
  • the first time interval may be the same as the time interval for extracting standard audio data from the standard performance repertoire in the audio database, or may be an integer multiple of the above-mentioned time interval.
  • the extracted user audio data may at least include information such as extraction time, notes and their fundamental frequencies and sound intensity.
  • the user audio data can correspond to the standard audio data in the audio database through its extraction time information. Taking the audio database in Figure 2 as an example, when the user plays the song "Song of Spring", the played keys can be identified from the collected audio information and other user audio data can be obtained according to the time interval of 30ms.
  • the note corresponding to the key being played is G4, the fundamental frequency is 391, and the sound intensity is the user audio data of 15, then the standard audio data corresponding to the user audio data is the track 0001 in the first-level table in the audio database. And the audio data at 0.030s in the secondary table 0001 (including the note and its fundamental frequency and pitch).
  • the user Before the user plays the piano, the user can select the song to be played in the database, or after the user starts to play the piano, the user can intelligently identify the song played by the user and query the standard of the song in the audio database. audio data, and then compare the user audio data with the corresponding standard audio data in the audio database to obtain the matching degree between the user audio data and the corresponding standard audio data.
  • the audio database may store standard audio data of the same track in different genres. After the user starts to play the piano, intelligently identify the song and genre played by the user, and query the standard audio data corresponding to the song and genre in the audio database, and then compare the user audio data with the corresponding standard audio in the audio database. The data are compared to obtain the matching degree between the user audio data and the corresponding standard audio data.
  • different weight values may be set for different information in the audio data, so as to calculate the degree of matching between the user audio data and the corresponding standard audio data.
  • the fundamental frequency weight of a note can be set to be greater than its pitch intensity weight, so that the fundamental frequency information of the note accounts for a larger proportion in the calculation of the matching degree.
  • an error redundancy interval can also be set for the standard audio data in the audio database, for example, an error redundancy interval of ⁇ 10 Hz is set for the fundamental frequency information of the musical note. When the user audio data falls within this interval, It can be considered that the fundamental frequency information of the notes in the user audio data is basically consistent with the fundamental frequency information of the notes in the corresponding standard audio data.
  • the audio database may be further subdivided into a monophonic database and a repertoire database, wherein the monophonic database stores standard audio data corresponding to a single note, and the repertoire database stores a large number of standard piano playing repertoires corresponding to standard audio data.
  • the monophonic database stores standard audio data corresponding to a single note
  • the repertoire database stores a large number of standard piano playing repertoires corresponding to standard audio data.
  • the judgment of hand movements is meaningful only when the notes are played correctly or substantially correctly. Therefore, according to an embodiment of the present invention, it is determined whether the user's hand image corresponding to the user's audio data needs to be intercepted from the video information based on the degree of matching between the user's audio data and the corresponding standard audio data.
  • an audio match threshold may be set.
  • the audio matching degree threshold can be set by the user, by default by the system, or intelligently set by the system after counting the playing levels of other piano players on the same piece in the networked state.
  • the matching degree between the extracted user audio data and the corresponding standard audio data is greater than or equal to the audio matching degree threshold, it means that the notes played by the user are correct or basically correct, and then the user audio data can be intercepted from the video information.
  • the corresponding user hand image is used to judge the user's hand movement; when the matching degree of the extracted user audio data and the corresponding standard audio data is less than the audio matching degree threshold, it means that the note played by the user is wrong, so it is unnecessary to perform Judgment of hand movements.
  • Video generally refers to various technologies in which a series of static images are captured, recorded, processed, stored, transmitted and reproduced in the form of electrical signals. So a video is actually a series of images arranged in chronological order.
  • the continuous image changes exceed 24 frames per second, according to the principle of persistence of vision, the human eye cannot distinguish a single static image, and it appears to be a smooth and continuous visual effect. Therefore, the user's hand image can be intercepted from the video information at regular intervals.
  • the user's hand image may correspond to the user's audio data in the video information through its interception time information.
  • an image of the user's hand may be captured from the video information at a second time interval.
  • the second time interval may be the same as the time interval for extracting user audio data from the audio information, or may be an integer multiple of the above-mentioned time interval. If the time information is consistent, the user's hand image captured at this moment is the corresponding user's hand image when the user's audio data is generated. For example, if user audio data with a note of G4, a sound intensity of 15, and a pitch of 1750 is extracted at 0.030s, the user's hand image captured from the video information at 0.030s is the time when the user audio data is generated. The corresponding user hand image.
  • the second time interval is not greater than 30 ms.
  • the captured user hand images may be screened, and only the user hand images including the piano key area are selected for identifying the user hand data.
  • the user in addition to the degree of matching between the user audio data and the corresponding standard audio data, the user can also set whether to capture the user's hand image and identify the user's hand data.
  • S340 Identify the user hand data in the user hand image through the hand model, and compare it with the corresponding correct hand data in the hand database to obtain the matching degree between the user hand data and the corresponding standard hand data.
  • the hand model takes the hand image as input data and the hand data in the hand image as output data, and is obtained by training the neural network.
  • the user's hand data in the user's hand image can be identified through the hand model, such as the coordinate positions or relative positions of hand joint points, including the coordinate positions or relative positions of 21 key joint points in the left and right hands, or more or less The coordinate position or relative position of the 21 joint points.
  • the user hand data may further include the coordinate position or relative position of the wrist.
  • the hand model may use a trained recurrent neural network or a long and short-term memory neural network.
  • the piano key area can be detected first in the background of the field of view, and a key candidate frame is drawn, and then the hand model is used in the key candidate frame area.
  • Hand keypoint regression detection to extract user hand data.
  • the hand database contains a large amount of standard hand data.
  • the standard hand image can be intercepted from the standard performance video information of the piano playing piece, and then the standard hand data in the standard hand image can be identified by the hand model, and stored in units of pieces to form Hand database.
  • the standard hand image can be intercepted from the video information of the standard piano repertoire according to the same time interval as the standard audio data extracted from the standard repertoire in the audio database, or an integer multiple of the above-mentioned time interval, and then The standard hand data in the standard hand image is recognized by the hand model and stored in the hand database.
  • Standard hand data may include information such as time (ie interception time), coordinate positions or relative positions of key joint points of the left and right hands.
  • the user's hand data may correspond to the standard hand data in the hand database through its time information.
  • FIG. 4 shows a schematic diagram of standard hand data storage in the hand database of an embodiment.
  • the hand database may include a first-level table and a second-level table, wherein the first-level table (as shown in Figure 4(A)) is used to store the basic information of the standard piano performance, including the serial number, name , grade, pitch and hand data secondary table serial number and other information; the secondary table (as shown in Figure 4(B)) is used to store the hand data of each track, including the interception time, from the Data such as the relative positions of the 21 joint points of the left and right hands in the hand image captured from the video information of the track.
  • the first-level table as shown in Figure 4(A)
  • the secondary table is used to store the hand data of each track, including the interception time, from the Data such as the relative positions of the 21 joint points of the left and right hands in the hand image captured from the video information of the track.
  • the audio database may be associated with the hand database, that is, the standard audio data with consistent time information and the standard hand data are stored in association with each other in units of tracks to form a comprehensive database.
  • the time interval for extracting the standard audio data from the standard performance is inconsistent with the time interval for extracting the standard hand image from the standard performance, or the time interval for extracting the standard audio data from the standard performance is the time interval when the standard hand image is intercepted from the standard performance
  • the time interval is an integral multiple of the time interval, only the standard hand data that matches the time information of the standard audio data is stored.
  • FIG. 5 shows a schematic diagram of storage of the integrated database of one embodiment.
  • the comprehensive database may include a first-level table and a second-level table, wherein the first-level table (as shown in Fig. 5(A)) is used to store the basic information of standard piano performance pieces, including serial number, name, level , pitch and comprehensive data secondary table number and other information; the secondary table (as shown in Figure 5(B)) is used to store the audio data and hand data of each track.
  • an error redundancy interval may also be set for the standard hand data in the hand database.
  • the user's hand data falls within the error redundancy interval, it can be considered that the user's hand data is the same as the standard hand data. Basically the same.
  • a hand data matching degree threshold may be set.
  • the hand data matching degree threshold can be set by the user, by default by the system, or intelligently set by the system after counting the performance levels of other piano players on the same piece in the networked state.
  • the matching degree between the user's hand data and the standard hand data is greater than or equal to the hand data matching degree threshold, it indicates that the hand movement played by the user is correct or basically correct; when the extracted user's hand data matches the corresponding standard
  • the matching degree of the hand data is less than the threshold of the matching degree of the hand data, it means that the hand movement of the user is wrong.
  • the user's hand image corresponding to the user's hand data can be automatically saved, which is convenient for the user to view.
  • wrong hand movements can also be displayed to the user, for example, a virtual hand outline is generated through animation rendering, when the fingering is wrong, the wrong finger is displayed to the user; when the hand shape is wrong, the wrong finger is displayed to the user.
  • Hand area eg palm, fingertips, etc.
  • the degree of matching between the user audio data and the corresponding standard audio data and the degree of matching between the user's hand data and the corresponding standard hand data can comprehensively consider the level of the user's piano playing from two aspects of notes and hand movements.
  • the degree of matching between the user audio data and the corresponding standard audio data and the degree of matching between the user's hand data and the corresponding standard hand data may be set as weights in determining the user's piano performance score.
  • the user's playing habits are personalized to formulate scoring rules. For example, if the notes played by a user are accurate but the hand movements are often wrong, a larger weight can be set for the matching degree between the user's hand data and the corresponding standard hand data, so as to give more feedback on the user's performance in piano playing. of hand movements.
  • standard key touch force data is also stored in the database, and the collected user key touch force data can be compared with the standard key touch force data to obtain the matching degree between the user key touch force and the standard key touch force,
  • the level of the user's piano playing is comprehensively considered in combination with the degree of matching between the user audio data and the corresponding standard audio data and the degree of matching between the user's hand data and the corresponding standard hand data.
  • the results of the user's piano playing may be fed back to the user in a delayed manner.
  • the comprehensive score of the piece can also be displayed; the specific audio errors and hand movement errors of the user during the playing can also be recorded in detail, and a score report can be formed, so that the user can make targeted Practice or correct mistakes in piano playing; you can also compare the current score or score report with the user's previous performance records or other users' performance records, and comprehensively evaluate the user's current playing level.
  • the user audio data and the user hand data may be compared at the same time, and based on the matching degree between the user audio data and the corresponding standard audio data and the matching degree between the user hand data and the corresponding standard hand data , and feedback the playing result to the user.
  • the user audio data and the user's hand image can be extracted and analyzed in real time while acquiring the audio information and video information of the user playing the piano.
  • the user may be fed back the overall performance result of the piece.
  • FIG. 6 shows a smart piano training method according to an embodiment of the present invention. As shown in Figure 6, the method includes the following steps:
  • S710 Acquire audio information and video information of the user playing the piano.
  • S720 Identify the played key from the audio information, obtain user audio data based on the key, and compare it with the corresponding standard audio data in the audio database to obtain the matching degree between the user audio data and the corresponding standard audio data.
  • Steps S710-S720 are similar to the above-mentioned steps S310-S320, and are not repeated here.
  • step S730 compare the matching degree between the user audio data and the corresponding standard audio data with the specified threshold N1, when the matching degree between the user audio data and the corresponding standard audio data is greater than or equal to the specified threshold N1, execute step S740 ; when When the degree of matching between the user audio data and the corresponding standard audio data is less than the specified threshold N1, step S760 is executed.
  • S750 Identify the user hand data in the user hand image through the hand model, and compare it with the corresponding correct hand data in the hand database to obtain the matching degree between the user audio data and the corresponding standard hand data.
  • step S760 it is judged whether the user's piano playing has ended, if it has ended, go to step S770; if it has not ended, go to steps S710-S760.
  • the user when the matching degree between the user audio data and the corresponding standard audio data is less than a specified threshold, the user may be prompted for the key information corresponding to the standard audio data, for example, a virtual keyboard is generated through animation rendering and prompts the correct piano keys; and/or when the degree of matching between the user's hand data and the corresponding standard hand data is less than a specified threshold, the user can be prompted for the hand action corresponding to the standard hand data, for example, a virtual hand outline is generated by animation rendering And prompt the correct hand movements.
  • FIG. 7 shows a smart piano training method according to an embodiment of the present invention. As shown in Figure 7, the method includes the following steps:
  • S810 Acquire audio information and video information of the user playing the piano.
  • S820 Identify the played key from the audio information, obtain user audio data based on the key, and compare it with the corresponding standard audio data in the audio database to obtain the matching degree between the user audio data and the corresponding standard audio data.
  • step S830 compare the matching degree between the user audio data and the corresponding standard audio data with the specified threshold N1, when the matching degree between the user audio data and the corresponding standard audio data is greater than or equal to the specified threshold N1, execute step S840 ; when When the degree of matching between the user audio data and the corresponding standard audio data is less than the specified threshold N1, the user is prompted for the key information corresponding to the standard audio data, and step S870 is executed.
  • S850 Identify the user hand data in the user hand image through the hand model, and compare it with the corresponding correct hand data in the hand database to obtain the matching degree between the user audio data and the corresponding standard hand data.
  • step S870 it is judged whether the user's piano playing has ended, if so, go to step S880; if not, go to steps S810-S870.
  • the present invention accurately identifies the hand data in the user's hand image by using the hand model, and based on the comprehensive consideration of the audio data and the hand data generated by the user when playing the piano, the user's playing result
  • Making an overall judgment enables users to effectively obtain feedback on notes and hand movements during practice without the guidance of professional teachers, which is helpful for users to find and correct mistakes and improve practice efficiency.
  • by prompting the user with correct key information and/or hand movements in real time it can also help the user to obtain correct demonstration and guidance in time, which is helpful for the user to learn to play the piano by himself.
  • the present invention also provides an intelligent piano training system implementing the above method.
  • the system includes: an audio and video acquisition unit for acquiring audio information and video information of the user's piano playing; a data extraction unit for Extracting user audio data from audio information, and intercepting a user's hand image corresponding to the user audio data from video information, wherein the data extraction unit is configured with the system for recognizing played keys based on audio of the present invention, for From; a data identification unit, used for identifying the user hand data in the user hand image through the hand model, wherein the hand model takes the hand image as input data, and takes the hand data in the hand image as the output data, through The neural network is trained and obtained; the data matching unit is used to compare the user audio data with the corresponding standard audio data in the audio database, obtain the matching degree between the user audio data and the corresponding standard audio data, and compare the user's hand data with the corresponding standard audio data.
  • the user interaction unit is used to obtain the degree of matching between the user's audio data and the corresponding standard audio data and the user's The matching degree between the hand data and the corresponding standard hand data, and feedback the playing result to the user.
  • the data extraction unit is configured with a system for recognizing the played key based on audio, for identifying the played key from the audio information, and based on this, user audio information such as note, fundamental frequency and sound intensity can be obtained
  • the user interaction unit in the smart piano training system is further configured to: prompt the user for the key information corresponding to the corresponding standard audio data, and prompt the user for the hand motion corresponding to the corresponding standard hand data.
  • the intelligent piano training system further includes a control unit for controlling the mutual cooperation between the audio and video acquisition unit, the data extraction unit, the data identification unit, the data matching unit, and the user interaction unit, and based on the user
  • the degree of matching between the audio data and the corresponding standard audio data determines whether to activate the data identification unit, and judges based on the degree of matching between the user audio data and the corresponding standard audio data or the degree of matching between the user's hand law and the corresponding standard hand data Whether the user plays the piano piece is over, and it is determined whether the audio and video capture unit or the user interaction unit is activated.
  • FIG. 8 shows a smart piano practice system according to an embodiment of the present invention.
  • the intelligent piano practice system 900 includes an audio and video acquisition unit 901 , a data extraction unit 902 , a data identification unit 903 , a data matching unit 904 and a user interaction unit 905 .
  • the audio and video acquisition unit 901 includes a sound acquisition device 9011 and a video acquisition device 9012, and is used for acquiring audio information and video information generated when a user plays the piano.
  • the sound collection device 9011 may be, for example, one or more microphones installed near the piano.
  • the sound collection device 9011 can be connected to the data extraction unit 902 in a wired or wireless manner, and sends the acquired audio information to the data extraction unit 902 .
  • the video capture device 9012 may be a device with a photography or image capture function, such as a monocular camera, a binocular camera or a depth camera.
  • the video capture device 9012 can be fixed around the piano to capture hand video information at a fixed point.
  • the video capture device 9012 can also be connected to the data extraction unit 902 in a wired or wireless manner, and sends the acquired video information to the data extraction unit 902 .
  • the sound collecting device 9011 and the video collecting device 9012 can be integrated into one device to simultaneously acquire audio information and video information when the user plays the piano.
  • the data extraction unit 902 includes an audio data extraction unit 9021 and an image data extraction unit 9022, wherein the audio data extraction unit 9021 is configured with a system for recognizing the played keys based on audio, for identifying the played keys from the audio information and Based on this, user audio data is acquired and sent to the data matching unit 904 ; the image data intercepting unit 9022 is configured to intercept the user's hand image corresponding to the user audio data from the video information, and send it to the data identification unit 903 .
  • the data identification unit 903 includes a hand model 9031 and is connected to the image data interception unit 9022 for identifying the user's hand data in the user's hand image through the hand model and sending it to the data matching unit 904 .
  • the hand model 9031 takes the hand image as input data and the hand data in the hand image as output data, and is obtained by training a neural network.
  • the data matching unit 904 includes an audio data matching unit 9041 and a hand data matching unit 9042.
  • the audio data matching unit 9041 includes an audio database for comparing the user audio data from the data extraction unit 902 with the corresponding standard audio data in the audio database to obtain the degree of matching between the user audio data and the corresponding standard audio data, And send it to the user interaction unit 905 and the control unit 906 .
  • the hand data matching unit 9042 includes a hand database for comparing the user hand data from the data identification unit 903 with the corresponding standard audio data in the hand database to obtain the user hand data and the corresponding standard hand data. The matching degree is sent to the user interaction unit 905.
  • the audio database and the hand database can be stored in the audio data matching unit 9041 and the hand data matching unit 9042 as built-in files, and can also be connected to the audio data matching unit 9041 and the hand data matching unit 9042 through the API program interface.
  • the user interaction unit 905 includes a processor 9051 and a display device 9052, wherein the processor 9051 is configured to receive the matching degree between the user audio data from the audio data matching unit 9041 and the corresponding standard audio data, and receive from the hand data matching unit 9042, the degree of matching between the user's hand data and the corresponding standard hand data, and based on the degree of matching between the user's audio data and the corresponding standard audio data and the degree of matching between the user's hand data and the corresponding standard hand data , and determine the score of the user's piano performance.
  • the display device 9052 can be, for example, an electronic device with a display function, such as a smart phone, an IPAD, smart glasses, a liquid crystal display screen, an electronic ink screen, etc., for displaying the scoring result of the processor 9051 .
  • the processor 9051 can determine the correct key information and display it on the display device 9052 based on the matching degree between the user audio data and the corresponding standard audio data, for example, generate a virtual keyboard through animation rendering and prompt the correct key information piano keys.
  • the processor 9051 can also construct a correct hand motion and display it on the display device 9052 based on the degree of matching between the user's hand data and the corresponding standard hand data, for example, can generate a virtual hand contour and Prompt the correct hand movements, and can also establish a specific skeletal system for different users according to the user's hand information, generate the user's personalized virtual hand contour by means of skinning, animation rendering, etc., and control it according to standard hand data. Virtual hand contours suggest correct hand movements.
  • the smart piano practice system may further include a sensor, and the sensor may be installed under the keys, and used to collect data on the touch strength of the user when playing the piano.
  • the present invention may be a system, method and/or computer program product.
  • the computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present invention.
  • a computer-readable storage medium may be a tangible device that retains and stores instructions for use by the instruction execution device.
  • Computer-readable storage media may include, but are not limited to, electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing, for example.
  • Non-exhaustive list of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • flash memory static random access memory
  • SRAM static random access memory
  • CD-ROM compact disk read only memory
  • DVD digital versatile disk
  • memory sticks floppy disks
  • mechanically coded devices such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

一种基于音频识别被弹奏琴键的方法,包括获取弹奏钢琴的音频并对音频进行分帧处理以得到多个分帧信号,以及对每一个分帧信号执行如下步骤:T1、对当前分帧信号进行频域分析,获得该分帧信号的频谱;T2、对步骤T1获得的频谱进行噪声水平估计以获得多个乐音谱峰形成乐音谱峰集合;T3、根据噪声水平估计获得的乐音谱峰集合、所有琴键频谱参数集构成的谐波频率查找可能被弹奏的第一候选琴键集合,其中第一候选琴键集合是谐波频率与乐音谱峰集合中的一个或多个乐音谱峰相同的琴键的集合。

Description

基于音频识别被弹奏琴键的方法及系统 技术领域
本发明涉及智能钢琴领域,具体来说,涉及钢琴调音、弹奏评价领域以及钢琴教学领域,更具体地说,涉及一种基于音频识别被弹奏琴键的方法及系统。
背景技术
目前对钢琴调音的好坏评价以及钢琴弹奏的评价大多基于个人经验进行,例如,有调音师对钢琴完成调较后,用户通过弹奏每一个琴键,并根据个人经验来判断调音效果,由于个人经验参差不齐,对调音效果的判断也参差不齐。在钢琴教学以及钢琴学习过程中经常要对弹奏者的钢琴弹奏进行评价,现有技术下在钢琴教学或者演奏中对钢琴弹奏的评价,大多基于人工听,通过听弹奏者弹奏的音符来判断弹奏的好坏,无法准确定位好坏的等级和具体指标,在这个过程中,如果经验不足或者调音效果不好,会直接影响弹奏音符的判断。
具体来说,当钢琴的按键被弹奏时,琴锤撞击相应的一根或多根琴弦,此时琴弦震动可以简单的建模为刚性材料的受迫振动,此敲击可以激发出弦的不同频率的波,该频率具有明显的比例关系,例如弹奏标准钢琴的A4按键,会激发出诸如440Hz、880Hz、1320Hz、1760Hz……的波。该震动频率的最小值定义为该琴键的基频。比基频高整数倍的震动称为该基频的泛音或谐波,多个谐波组成谐波列或称为泛音列,这样就形成了明显的谐波结构。钢琴的每个琴键都有属于自己的基频和谐波结构。但是能产生整数倍谐波的前提是琴弦被建模为简单的刚性震动,而实际上琴弦是由两端被固定在琴腔共振板上粗细不同的金属组成。因为两端的固定,实际振动的弦长被缩短,导致谐波频率和基频的频率不满足严格整数倍关系,使得不同琴键的谐波会有重叠,且由于钢琴调音没有严格的定性和定量基准,导致对弹奏进行评价时基于音符对被弹奏琴键的估计会不准确。这会使得音符识别过程中八度错误较多,例如,需要弹奏A4琴键,但是同时弹奏A3和A5琴键,由于谐波的重叠,被判断为仅弹奏了A4按键。
此外,现有技术下,对弹奏音符的识别过程中缺乏对噪声的处理,使得噪声影响大,音符识别准确度不高。现有技术下也没有对每架钢琴建模,也没有对模型进行实时校准,使得钢琴的频谱参数不够精准。
此外,依靠人工进行调音评价以及弹奏评价费时费力,无法智能的获取钢琴调音情况,会因操作者的不同会出现很大的偏差,此外,现有技术下也没有能够准确识别弹奏曲目节拍的方法,尤其是弹奏曲库中没有的曲目时,无法获得准确的节拍,不利于智能钢琴教学的应用。
众所周知,钢琴教学中,钢琴弹奏中的音准、节奏、指法、手型等因素十分关键,是初学者必须反复练习的基本功,因此通常需要在专业的钢琴老师的监督、辅导下练习。然而,囿于受专业老师辅导的时间有限,初学者往往独自练习,导致各种错误得不到及时地反馈与纠正,练习效果欠佳。现有技术中已有不少针对初学者进行自我练习时的钢琴训练方法,有的是依据所弹奏乐曲的音频数据来判断练习者弹奏乐曲的准确性,例如,将演奏者的音频信息与大师演奏的正确声音数据相比较,以此来判断音准、节奏、速度、力度,以此评估演奏结果;有的是基于钢琴演奏的视频图像来评估弹奏的准确性,例如,通过从钢琴教学视频中截取指关节和钢琴按键的图像建立标准指法模型图和标准按键顺序模型图,然后将练习视频与标准模型图进行对比分析,来实现自动纠错和智能教学;还有的综合考量演奏的音频数据和视频数据以判断钢琴演奏的正确性,例如,将音频数据及时间信号与标准音符数据相比较得到正确音符数据,并据此调取对应的演奏图像数据并进行视觉识别分析以获得演奏图像中正确手部数据,根据正确音符数据、正确手部数据以及标准音符数据计算钢琴演奏的评分,等等。然而,现有的钢琴训练方法仍存在一些不足:一方面,仅以弹奏乐曲的音频数据作为评定基础的练习(或测评)方法可能会由于环境中杂音的干扰导致判断结果不准确。另一方面,即使弹奏的音符都正确,也会由于忽略了乐曲的节奏、速度等重要因素而影响判断结果的准确性。
因此,亟需一种更加准确、合理的智能钢琴训练方法和系统。
发明内容
因此,本发明的目的在于克服上述现有技术的缺陷,提供一种能够根据弹奏的音频智能识别弹奏琴键的方法及系统、计算弹奏节拍和评价弹奏 的方法,以及基于前述方法识别弹奏琴键的智能钢琴训练方法及系统。
根据本发明的第一方面,提供一种基于音频识别被弹奏琴键的方法,所述方法包括获取弹奏钢琴的音频并对所述音频进行分帧处理以得到多个分帧信号,以及对每一个分帧信号执行如下步骤:T1、对当前分帧信号进行频域分析,获得该分帧信号的频谱;T2、对步骤T1获得的频谱进行噪声水平估计以获得多个乐音谱峰形成乐音谱峰集合;T3、根据噪声水平估计获得的乐音谱峰集合、所有琴键频谱参数集构成的谐波频率查找可能被弹奏的第一候选琴键集合,其中第一候选琴键集合是谐波频率与所述乐音谱峰集合中的一个或多个乐音谱峰相同的琴键的集合。
优选的,所述方法还包括对每一个分帧信号执行如下步骤:T4、取该分帧信号对应时间段采用摄像头捕获的手和琴键图像获取手的位置下方的琴键集合与第一候选琴键集合的交集,得到第二候选琴键集合;T5、基于第二候选琴键集合中每个琴键的谐波列,以第二候选琴键集合中谐波与所述乐音谱峰集合中的一个或多个乐音谱峰相同的琴键构成第三候选琴键集合;T6、对第三候选琴键集合进行筛选,留下高八度按键或者低八度按键被弹奏的琴键以形成第四候选琴键集合。
优选的,在本发明的一些实施例中,所述方法还包括:T7、将第四候选琴键集合作为单个分帧信号的多基频估计结果,采用HMM和维特比方法对其进行音符事件的帧间平滑,获得该分帧信号对应的被弹奏琴键的综合估计结果。通过考虑和弦在时间上的连续性,通过音符事件的多基频估计,使估计结果更接近于听感,提高多基频估计的准确度。
在本发明的一些实施例中,对获取的音频依次进行降噪、语音分离处理后进行分帧处理。根据本发明的一个实施例,采用多麦克风降噪技术或多麦克风阵列波束形成技术对音频进行降噪处理以消除音频中的环境噪声,以及采用多麦克风降噪技术或多麦克风阵列波束形成技术对音频进行语音分离以将人物语音和乐音分离,优选的,采用多麦克风降噪技术消除环境噪音,采用多麦克风阵列波束形成技术进行语音分离。其中,根据本发明的一个实施例,多麦克风降噪技术通过如下步骤实现:将多个麦克风分别指向需要拾音的方向,包括钢琴共振腔和环境噪声方向;采用谱减法将采集到的乐音含量高的声音频谱减去环境噪声含量高的声音频谱以消除环境噪声。根据本发明的一个实施例,采用多麦克风阵列波束形成技术 时通过如下方式布置采集音频的麦克风:以钢琴键盘的俯视图所在的长方形平面的中心点以及长方形外的两个最远声源点为三角形顶点,将多个麦克风布置在以平行于两个最远声源点连线且穿过三角形中心的直线为切线、以三角形中心为切点、包围长方形平面中心点或两个最远声源点的弧线上。
在发明的一些实施例中,采用离线校准或在线校准的方法获取或更新琴键频谱参数集,所述琴键频谱参数集中包含每一个琴键对应的频谱参数。其中,每一个琴键频谱参数用于表示该琴键的谐波结构:
f k=k*(A*F0)*sqrt(1+B*k^2)
其中,A、B是琴键频谱参数,f k是该琴键的k次谐波,F0是该琴键的基频。优选的,所述离线校准方法包括如下步骤:L1、按照要求的力度手动弹奏琴键,并获取弹奏该琴键的音频;L2、采用静音抑制技术分析步骤L1中的音频,获得弹奏琴键的初始时间段和结束时间段;L3、根据步骤L2获得的初始时间段和结束时间段对初始时间段和结束时间段中间的时域数据做频域分析获得琴键声音的频谱谐波结构,频谱上不同的峰值对应该琴键的不同次谐波,采用线性拟合技术并结合f k=k*(A*F0)*sqrt(1+B*k^2)计算该琴键的频谱参数。所述在线校准方法包括如下步骤:在钢琴使用过程中多次采集琴键弹奏音频,并对其进行频域分析获得琴键的真实频谱谐波,基于谐波结构f k=k*(A*F0)*sqrt(1+B*k∧2)计算该琴键的频谱参数,比较多次采集的音频中计算的参数是否一致,如果一致则更新该琴键的频谱参数。
根据本发明的第二方面,提供一种基于音频计算弹奏节拍的方法,所述方法包括如下步骤:A1、采用如本发明第一方面所述的方法获取当前分帧信号被弹奏琴键集合;A2、基于被弹奏琴键集合判断每个琴键被弹奏的起始时间和结束时间并以此计算节拍。
根据本发明的第三方面,提供一种基于音频进行弹奏评价的方法,所述方法包括如下步骤:P1、采用如本发明第一方面所述的方法获取当前分帧信号被弹奏琴键集合;P2、基于被弹奏琴键集合判断每个琴键被弹奏的起始时间和结束时间并以此计算节拍;以及基于被弹奏琴键集合和当前帧的乐音谱峰,计算被弹奏琴键集合中每个琴键谐波列的功率值和作为该琴键的强度值;P3、按照预设的评价指标,评价弹奏的音高、节拍、节奏、 力度踏板、弹奏感情信息。
根据本发明的第四方面,提供一种基于音频识别被弹奏琴键的系统,所述系统包括:声音采集模块,用于采集弹奏钢琴的音频;计算机存储及管理模块,用于存储采集到的音频数据并执行如下操作:B1、对分帧信号进行频域分析,获得信号的频谱;B2、对步骤B1获得的频谱进行噪声水平估计以划分出频谱中的多个乐音谱峰形成乐音谱峰集合;B3、根据乐音谱峰集合、所有琴键频谱参数集构成的谐波频率综合查找可能被弹奏的第一候选琴键集合,其中第一候选琴键集合是谐波与所述乐音谱峰集合中的一个或多个乐音谱峰相同的琴键的集合。
在本发明的一些实施例中,所述计算机存储及管理模块还被配置为执行如下操作:B4、根据当前分帧对应时间段采用摄像头捕获的手和琴键图像获取的手的位置下方的琴键集合对第一候选琴键集合进行精简获得第二候选琴键集合,其中,第二琴键集合是当前帧对应时间段的手的位置下方的琴键集合与第一候选琴键集合的交集;B5、基于第二候选琴键集合中每个琴键的谐波列,以第二琴键集合中谐波与所述乐音谱峰集合中的一个或多个乐音谱峰相同的琴键构成第三候选琴键集合;B6、对第三候选琴键集合进行筛选,留下高八度按键或者低八度按键被弹奏的琴键以形成第四候选琴键集合。
优选的,所述计算机存储及管理模块还被配置为执行如下操作:B7、将第四候选琴键集合作为单个分帧信号的多基频估计结果,采用HMM和维特比方法对其进行音符事件的帧间平滑,获得该分帧信号对应的被弹奏琴键的综合估计结果。
在本发明的一些实施例中,所述系统还包括:噪声抑制模块,用于采用多麦克风降噪技术对音频进行降噪处理以消除音频中的环境噪声;声源分离模块,用于采用多麦克风阵列波束形成技术对音频进行语音分离以分离音频中的人物语音。
在本发明的一些实施例中,所述系统还包括:频谱校准模块,用于采用离线校准或在线校准的方法获取或更新琴键频谱参数集。
在本发明的一些实施例中,所述系统还包括:弹奏评价模块,用于按照预设的评价指标评价弹奏的音高、节拍、节奏、力度踏板、弹奏感情信息。
在本发明的一些实施例中,所述系统还包括:人机交互模块,用于实现用户和系统的交互。
根据本发明的第五方面,提供一种智能钢琴训练方法,所述方法包括:获取用户弹奏钢琴的音频信息和视频信息;从所述音频信息中提取用户音频数据,并与音频数据库中存储的对应的参照音频数据相比较,获得所述用户音频数据与所述对应的参照音频数据的匹配度;从所述视频信息中截取与所述用户音频数据相对应的用户手部图像,通过手部模型识别所述用户手部图像中的用户手部数据,并与手部数据库中存储的对应的正确的手部数据相比较,获得所述用户手部数据与所述对应的参照手部数据的匹配度,其中,所述手部模型以手部图像为输入数据,以所述手部图像中手部数据为输出数据,通过对神经网络进行训练获得;以及基于所述用户音频数据与所述对应的参照音频数据的匹配度以及所述用户手部数据与所述对应的参照手部数据的匹配度,向用户反馈弹奏结果。其中,所述用户音频数据包括提取时间、音符、基频和音强,采用本发明第一方面所述的方法识别音频信息中的被弹奏琴键并基于此以获取音符、基频和音强等用户音频信息。
根据本发明的第六方面,提供一种智能钢琴训练系统,包括:音频和视频采集单元,用于获取用户钢琴弹奏的音频信息和视频信息;数据提取单元,用于从所述音频信息中提取用户音频数据,以及从所述视频信息中截取与所述用户音频数据相对应的用户手部图像;数据识别单元,用于通过手部模型识别所述用户手部图像中的用户手部数据,其中,所述手部模型以手部图像为输入数据,以所述手部图像中手部数据为输出数据,通过对神经网络进行训练获得;数据匹配单元,用于将所述用户音频数据与音频数据库中对应的参照音频数据相比较,获得所述用户音频数据与所述对应的参照音频数据的匹配度,以及将所述用户手部数据与手部数据库中对应的参照手部数据相比较,获得所述用户手部数据与所述对应的参照手部数据的匹配度;以及用户交互单元,用于基于所述用户音频数据与所述对应的参照音频数据的匹配度以及所述用户手部数据与所述对应的参照手部数据的匹配度,向用户反馈弹奏结果。其中,所述数据提取单元配置有如本发明第四方面所述的基于音频识别被弹奏琴键的系统,用于识别音频信息中的被弹奏琴键并基于此获取音符、基频和音强等用户音频数据。
与现有技术相比,本发明的优点在于:本发明将环境噪声、人物语音从乐音中分离出来,有效降低了噪声和语音对多基频估计和强度估计准确度的影响;多琴键频谱同时校准,用户感知度低,及时捕获钢琴声音特征的变化;校准后的参数可以综合评价用户钢琴的调音情况;适用于传统钢琴,不需要对每个琴键安装诸如压力传感器、距离传感器的硬件,使得传统钢琴也能成为智能钢琴;可以由用户自由弹奏而不局限于曲库中已有的曲子;通过采用本发明方法可以准确地识别被弹奏琴键,基于此获得准确的音符、基频和音强等数据,能够提高钢琴训练的准确度并以此提升钢琴教学的效果。
附图说明
以下参照附图对本发明实施例作进一步说明,其中:
图1为根据本发明实施例的多麦克风阵列波束形成技术的麦克风布置示意图;
图2是根据本发明一个实施例的智能钢琴训练方法;
图3是根据本发明一个实施例的音频数据库中标准音频数据存储示意图;
图4是根据本发明一个实施例的手部数据库中标准手部数据存储示意图;
图5是根据本发明一个实施例的综合数据库的存储示意图;
图6是根据本发明一个实施例的智能钢琴训练方法;
图7是根据本发明一个实施例的智能钢琴训练方法;
图8是根据本发明一个实施例的智能钢琴训练系统。
具体实施方式
为了使本发明的目的,技术方案及优点更加清楚明白,以下结合附图通过具体实施例对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。
为了使本发明的目的,技术方案及优点更加清楚明白,以下通过具体实施例对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。
首先介绍本技术的难点,本发明基于音频识别被弹奏琴键的核心在于多基频估计任务,估计出精确的基频频率值和个数以及起始时间段,从而准确的估计单帧中被弹奏的具体琴键。要实现这个目的,具有很大的难点,具体表现在以下几个方面:
●分辨率限制:音频数据的时频分析受时间分辨率和频率分辨率的限制。具体表现为频域分析必须要加载一定时间段的数据,根据采样频率和采样长度,频域分析结果是频率离散的谱线,谱线间隔恒定(CQT和VQT除外)。而时间上变化很快的音符,比如32分音符,持续时间极短,它可能被某一段频域分析数据捕获,也可能被拆分到相邻的两个数据段,这要导致对该短时音符的分析带来困难。
●谐波重叠:对于实际构建的乐谱而言,基频整数比值越小,音乐听上去越和谐悦耳。但是整数比值越小,琴键彼此之间的谐波共享相同谱线的现象越严重。比如标准钢琴A4琴键被弹奏时理论上将会产生440Hz、880Hz、1320Hz、1760Hz……的泛音列,若同时按下的还有A5琴键,它将产生880Hz、1760Hz、3520Hz……的泛音列,则A4的偶次谐波将与A5的各泛音列重合。乐器的调音不严格一致,以及环境中温度、湿度等因素的影响会造成的泛音列频率偏差,因此实际上很难出现精确的谐波重叠现象,但是由于频率分辨率的限制,很难辨别这些谐波频率差。由上述分析可知,谐波重叠问题是多基频估计的一个难点。
●拍:多琴弦琴键的每根琴弦的频率不可能完全一致,当弹奏时,每个琴弦根据各自的张紧程度激发出各自琴弦的基频及其泛音列,频率相差较少的两列波会产生“拍”的现象,表现为两列波合成之后的幅值被调制:合成后的峰值为两列波幅值的和,谷值为两列波的幅值差的绝对值,周期为两列波频率差值的绝对值,当频率差值较低时,比如10Hz以内,听感上有“哇~呜~”的感觉,声音不稳定,频域分析中表现为对应的频谱峰值上下波动。当然,不同琴键的不同谐波列间也存在相互影响,并形成“拍”的现象,该现象在钢琴中广泛存在,影响琴键识别的准确性。
●谐波丢失:一个琴键的谐波包括基频和它的高次谐波,谐波丢失主要包括基频丢失和高次谐波丢失,二者产生的原因不同。其中,基 频丢失是指某一琴键被弹奏时,对采集声音进行频域分析,分析的频谱中不存在基频成分或者基频成分远远小于它的谐波成分。产生的原因是低频信号的共振性不好,以至于其振动幅度小或能量很低。所以基频丢失主要发生在低音区的音符中,影响低音音符识别的准确度。而大部分高频信号可以通过乐器的共振腔有效地增强震动幅度,所以中高音符没有基频丢失现象。大部分算法是无法识别或感知丢失的基频。谐波丢失是指频谱中的高次谐波成分占比很小,当琴锤以一定的速度或能量撞击高音琴弦时,琴弦发生受迫振动,更高频的振动模态衰减更快或难以被激发,该现象影响着高音按键识别的准确度。
●语音干扰:语音中的元音发音与钢琴琴键的泛音列结构一致,影响多基频估计的准确度。
●噪声干扰:噪声和语音中的辅音发音会影响某些频段或者全频域的频率分析结果,表现为抬高该频段内所有频谱值,淹没本该有的峰值,大大影响多基频估计的准确度。
●需要在一定程度上平衡计算复杂度与准确度之间的矛盾。
由此,本发明从上述各个方面的问题出发,提供一种基于音频识别被弹奏琴键的方法。
根据本发明的一个实施例,提供一种基于音频识别被弹奏琴键的方法,包括:步骤1-10,下面详细说明每个步骤。
在步骤1中,采集弹奏钢琴的音频。根据本发明的一个实施例,通过声音采集模块采集钢琴被弹奏时发出的乐音音频并将其存储在计算机存储及管理模块中,声音采集模块可以是麦克风或者其他可采集声音的设备,优选麦克风,因为麦克风只需要设置在钢琴周边即可,无需安装在钢琴琴键内部,而声音传感器需要安装在钢琴琴键内部时采集效果较好。
在步骤2中,将采集到的音频进行降噪和语音分离并进行分帧,得到的分帧信号作为后续处理的一个单元,分帧的目的在于在当前帧时间段内认为音频信号是平稳的,有利于后续处理。
降噪的目的在于去除乐音音频中的环境噪声,语音分离的目的在于分离乐音音频中的人物语音,降低人物语音对乐音频谱的影响。
根据本发明的一个实施例,通过噪声抑制模块来去除环境噪声。众所 周知,在不同的环境下,比如嘈杂的商场、室外、琴房等,噪声会混杂在钢琴的乐音中被麦克风一同采集到音频中,采集到的噪声会严重影响后续诸如琴键识别和弹奏评价的正确性和准确性,也影响着人机交互模块中回放效果,同样地,声音采集模块的硬件也会产生底噪。需要一并滤除。为了尽可能降低上述环境和硬件噪声对乐音音频处理的影响,需要对声音采集的前端进行声学结构的优化(例如回放喇叭的共振腔与麦克风隔离等)。本发明利用多麦克风降噪技术消除环境中的大部分噪声,该环节主要消除的是环境噪声。多麦克风降噪技术的原理可以简单理解为:将多个麦克风分别指向需要拾音的方向(钢琴共振腔)和噪声的方向(远处),这样多个麦克风都就会采集到乐音和噪音,但是它们的混合比例不同。例如指向钢琴的麦克风采集到的乐音成分比指向远处的乐音成分含量高,指向远处的麦克风采集到的噪声成分比指向钢琴的麦克风采集到的噪声成分高。这样就可以用谱减法将乐音含量高的声音减去噪声含量高的声音,从而将噪声从采集到的声音中一定程度上消除。
根据本发明的一个实施例,通过声源分离模块来分离人物语音。通过上述描述可知,乐音音频中除了环境噪声的影响外,在本发明的应用场景中,另外一个比较重要的干扰就是人物语音,语音(主要是元音音素)的频域结构特征与钢琴发出的乐音的特征比较相似,都具有明显的谐波结构,使得人物语音同样会影响到后续琴键识别和弹奏评价的正确性和准确性。本发明中对语音和乐音的分离使用麦克风阵列波束形成技术。该技术应用在声源信号的空间域的选择上,对传感器接收到的数据进行加权处理,使其形成一定的波束形状,该波束的作用就是允许感兴趣信号方向的信息通过,对其形成增益,同时抑制掉不感兴趣方向的信号。前一个实施例中,噪声抑制模块中描述的降噪算法难以将人物语音和乐音分离,由于乐音和人物语音的声源位置不一致,所以可以利用声源空间差异性对语音和乐音进行分离。该方案的实现对麦克风的安装位置有要求,要求描述如下:如图1所示,平面上有一个长方形,该长方形代表钢琴的俯视图,钢琴的发声共鸣腔在长方体的内部,长方形周围有若干点声源,代表着若干位说话的人,例如常见的是两位坐在钢琴前面的老师和学生。以长方形中心A和长方形外的两个点声源BC为三角形顶点,绘制平行于边BC的直线L,令直线L穿过三角形的中心O,绘制以O点为切点、以直线L为切线的弧线 M,弧线M的方向可以选择包围A点的方向也可以选择包围BC点的方向。多个麦克风的安装方位或位置应该尽量避免都在直线L上,而应该尽量在弧线M上,且弧线M的弧度越大越好,这样也可以保证多个麦克风的拾音差异最大化,这样可以保证算法空间域估计的准确性和鲁棒性。如果所有的麦克风都安装在直线L上或附近,那么对于多个麦克风而言,难以区分哪些声音是由直线L下方(A点代表该方向)方向发出,哪些声音是由直线L上方(B点和C点代表该方向)方向发出(两侧发声相对于L是对称的)。
经过上述实施例的声音采集、噪声抑制、语音分离流程后获得的就是比较纯净的乐音信号,基本满足后续处理要求。
在步骤3中,对分帧后的每一帧音频信号进行频域分析,获得单帧音频信号的频谱,频谱中有多个谱峰。
在步骤4中,对步骤3获得的频谱进行噪声水平估计,并在步骤3获得的频谱中查找邻域内的高于噪声水平估计的极值以获得多个乐音谱峰形成乐音谱峰集合。根据本发明的一个实施例,采用自适应噪声水平估计方法判断单帧音频信号中的子带频域信号的统计量是否满足瑞利分布。通常情况下,被乐音激起的谱峰数量要远小于噪声信号的谱峰数量。如果子带频域信号的统计量满足瑞利分布,则用该分布的Q分位数代表当前单帧(子区间)的噪声水平,根据本发明的一个实施例,可以选择97.5%或99%的Q分位数,即单帧子带内近似认为97.5%或99%的谱峰是噪声,剩下的是乐音。在频谱中查找极值,高于噪声水平估计的极值就是乐音谱峰,根据本发明的一个实施例,通过如下步骤查找极值:第一,查找频谱中的所有局部极值;第二,筛选出高于噪声水平估计的局部极值;第三,进一步筛选出高于噪声水平估计的局部极值中满足琴弦震动频率的局部极值,最终筛选出的局部极值即为乐音谱峰。
在步骤5中,根据前面步骤4估计的乐音谱峰集合、琴键频谱校准参数集P构成的谐波频率综合查找可能被弹奏的第一候选琴键集合S1。查找的标准是琴键具有与当前单帧信号中的谱峰一致的谐波,则最后获得的第一候选琴键集合是谐波与一个或多个乐音谱峰相同的琴键的集合。
所述琴键频谱校准参数集P是经过校准之后的所有琴键的频谱参数的集合,琴键的频谱参数用于表示琴键的谐波结构。根据本发明的一个实施 例,采用频谱校准模块实现琴键频谱参数的校准。该模块分为两种应用场景,一种是离线校准,一种是在线校准。所实现的功能是一样的,只是应用场景不一样。
首先说明一下琴键频谱参数需要校准的原因:当钢琴的按键被弹奏时,琴锤撞击相应的一根或多跟琴弦,此时琴弦震动可以简单的建模为刚性材料的受迫震动。此撞击可以激发出弦的不同频率的波。该频率具有明显的比例关系,例如弹奏标准钢琴的A4按键,会激发出诸如440Hz、880Hz、1320Hz、1760Hz……的波,该震动频率的最小值定义为该琴键的基频。比基频高整数倍的震动称为该基频的泛音或谐波,多个谐波组成谐波列或称为泛音列,这样就形成了明显的谐波结构。钢琴的每个琴键都有属于自己的基频和谐波结构,琴键识别的工作前提就是不同琴键对应不同的谐波结构。但是能产生整数倍谐波的前提是琴弦被建模为简单的刚性震动,而实际上琴弦是由两端被固定在琴腔共振板上粗细不同的金属组成。因为两端的固定,实际振动的弦长被缩短,导致谐波频率和基频的频率不满足严格整数倍关系。假设某一琴键的基频为f 0,那么它的第k次谐波频率可以近似用非线性函数f k=k*f 0*sqrt(1+B*k^2)描述,其中k表示谐波编号,B表明非整数倍关系,也称为失谐率。不同的钢琴经过不同的调音师或者一台调好音的钢琴在不同的环境下或者随着时间的推移,每个按键的每根琴弦的基频会发生变化,相应的谐波列也会发生偏移,那么上述公式中的f 0也是一个变化的量。假设某一琴键的理论基频是F0,则f k=k*(A*F0)*sqrt(1+B*k^2),其中参数A描述的就是基频偏移的情况。现在对于钢琴某一特定的琴键,用两个参数描述该琴键产生的谐波结构,每个琴键的泛音列都有两个参数描述,所有琴键对应的A和B参数形成参数集,频谱校准主要就是根据每台钢琴发出的乐音估计出该钢琴的上述参数集。
校准的目的:若要尽可能准确地在和弦中分析出哪些琴键被弹奏,和各自弹奏的强度,弹奏的节奏等信息,需要尽可能的获取钢琴声音的特征,也就是上述的参数集。
在校准前本发明会在计算机存储及管理模块中存储一组适合大部分钢琴的通用参数集合,该参数集并未针对某一架钢琴,而是基本符合大部分钢琴的通用参数。下面详细说明离线校准和在线校准的实现过程。
离线校准:应用人机交互界面进入离线校准程序。该程序让用户在安 静的环境中单独地、按顺序地弹奏钢琴所有的88个琴键,声音采集模块的麦克风记录所有采集到的声音。该离线校准所需要的数据只用一个麦克风的一路信号即可,也可以对多路信号做加权平均处理。假设最终采集到的一路信号由数列X表示,理想化地,例如麦克风采样周期为48kHz,用户每个按键持续1秒,每个按键的间隔为一秒,则数组X的长度为48000*(88*2-1)=8400000,采集175秒数据。实际上,用户的弹奏强度,弹奏间隔都是未知的,这就需要定位真正弹奏一个琴键时发出的声音。实现该定位的一方法称为语音活动检测(VAD),又称为静音抑制,主要用于从声音信号流里识别和消除长时间的静音期,以达到在不降低业务质量的情况下节省话路资源的作用。静音抑制可以节省宝贵的带宽资源,可以有利于减少用户感觉到的端到端的时延。本发明用VAD检测弹奏某个琴键发出声音的开始时间段和结束时间段,则对应的中间时间就是某个按键的正常弹奏时间。例如,进入离线校准程序,人机交互界面让用户以中等力度(中等或偏上力度主要是为了提高信噪比)弹奏A0琴键,声音采集同步进行,将采集到的声音数据传递给VAD算法,检测出弹奏的初始时间段和结束时间段,对两时间段中间的时域数据做频域分析,频域分析会显示出琴键声音的谐波结构,频谱上不同的峰值对应不同次谐波。将不同的峰值频率结合公式f k=k*(A*F0)*sqrt(1+B*k^2)利用线性拟合可以求解出参数A和B,当A和B在合理范围时,保存参数作为琴键A0的频谱参数。人机交互界面提示用户继续以中等强度弹奏B0琴键,重复上述流程直至所有琴键都校准完毕,否则重新弹奏不符合要求的琴键。但是当钢琴长时间没有使用或者被重新调音后,校准过的参数将不适用于当前钢琴,需要重新运行离线校准以获得最新的钢琴参数。
在线校准:该校准与离线校准的主要区别是:离线校准需要用户按照人机交互界面的指示一步一步进行校准操作,而在线校准不需要用户进行该操作,用户对校准程序无感觉。随着用户的正常使用,模块会自动监听采集到的音频,利用琴键识别模块判断该时间段有一个或者几个琴键被弹奏。然后对采集到的数据进行频域分析(与离线校准的频域分析相同)。对于从未校准过的琴键,依据出厂的参数集和琴键识别模块判断出哪些琴键被弹奏,对于曾经校准过的琴键,使用校准过的参数。假设每个音符(编号为r,r∈[1,88])的频谱由前Nr个谐波列组成,谐波列的编号为n, n∈[1,15](例如只研究前15次谐波列,Nr=15)。每个谐波的幅值和频率为anr和fnr,所以每个音符的谐波组成可以表示为Pr={anr,fnr|n∈[1,15],r∈[1,88]},整个琴键的参数集就可以表示为P={Pr|r∈[1,88]}。某一个音符真实谐波分布与公式f k=k*(A*F0)*sqrt(1+B*k^2)的分布的差异可以量化地表示为Wr,Wr为r音符前Nr次谐波幅值与一个与频率有关的函数的乘积,这个与频率有关的函数可以表示为频域分析中使用的窗函数的傅里叶变换的幅值的主瓣与频率差的积。这样做的好处是可以削弱频域分析中频率分辨率非连续的影响。根据非负矩阵分解,将上述所有被弹奏琴键的Wr的和W
Figure PCTCN2021117129-appb-000001
迭代至最小后,每个被弹奏的琴键的参数Ar和Br被求解出。利用解析出的参数构建当前时间段被弹奏琴键的所有谐波频率,将频谱对应频率的幅值置零(或进行频域陷波滤波),然后重新计算残余频域的能量或统计频谱系数是否基本满足瑞利分布。如果残余能量小于一定范围或者频谱系数基本满足瑞利分布则表示当前计算的参数Ar和Br有很高的可信度,保存参数。那么随着用户使用,被弹奏过的琴键的参数将被校准。这样就渐渐地完成整个琴键的校准环节。该环节为了保证校准参数稳定收敛且正确,将每次校准的参数暂时保存,比较相同琴键多次校准的参数,如果多次校准的参数一致,则表示校准参数正确,更新参数集合P,供琴键识别模快使用。在线校准的好处有:即使长时间不使用或者重新调音过的钢琴并不需要用户重新进行繁琐的离线校准,只需用户使用一段时间后,常被弹奏的琴键会被很快的校准,随着钢琴的琴弦的变化和用户的持续使用,在线校准可以默默地、及时地捕获变化并更正上述变化对音频处理系统的影响。在线校准可以同时校准多个按键,效率高。此外,用户钢琴调音状态可以用上述校准后的参数集评价。
在步骤6中,对第一候选琴键集合S1进行精简,根据本发明的一个实施例,采用摄像头捕获的手和琴键的位置,计算单帧音频信号对应时间段的手的位置下方的琴键集合Sh,设定第二候选琴键集合S2为Sh与S1的交集。根据本发明的另一个实施例,采用上一帧被弹奏琴键的综合评估结果对应的被弹奏琴键位置Sr估计在当前帧时间段双手所在区域的琴键集合Ss,设定第二候选琴键集合S2为Ss与S1的交集。
在步骤7中,根据第二候选琴键集合S2以及所有被认定为乐音谱峰 的幅值,缩小候选琴键集合的范围。具体来说,某一琴键被弹奏,它所激发的所有谐波列的谱峰幅值应该呈现一种规律的变化,比如谐波谱峰连线(谐波包络线)满足多项式函数、指数衰减函数或某一概率密度函数。根据上述假设,重新分配每一个峰值更可能属于哪些琴键的谐波,然后形成候第三候选琴键集合S3,即以第二琴键集合S2中谐波与乐音谱峰集合中一个或多个乐音谱峰相同的琴键构成第三候选琴键集合。
在步骤8中,针对第三候选琴键集合S3和乐音频谱峰值,检查第三候选琴键集合S3中每个琴键高低八度按键(其中,高低八度按键也在第三候选琴键集合S3中)的谐波在频谱峰值中的占比,如果占比高且每次谐波成分占比都高,则证明该按键的高八度按键或低八度的按键被弹奏。否则就说明该琴键没有没弹奏,被误判,应从S3中剔除,最终形成第四候选琴键集合S4。
在步骤9中,将第四候选琴键集合S4作为单个分帧信号的多基频估计的结果,然后利用HMM和维特比算法进行音符事件的帧间平滑,将平滑后的结果作为该帧的综合估计结果Sr。进行音符事件的帧间平滑的目的时为了对抗声音衰减带来的识别错误,例如,假设用户在第1秒按A4琴键,持续了5秒后松开,在这四秒的过程中,声音会越来越衰减,加上“拍”现象后,可能输出的识别结果是1~3秒是按下A4琴键的,3~3.5秒是没有按下琴键的,3.5~5秒是按下琴键的(按下和松开称为事件)。如果每一帧是10毫秒,那么就相当于100~300帧和351~500帧是按下,301~350帧是松开状态,这和实际情况不符,音符事件的帧间平滑就是指把这301~350帧也修改为按下的状态。根据本发明的一个实施例,也可采用时域滤波器对单个分帧信号的多基频估计结果进行处理以得到综合估计结果。
在步骤10中,利用单帧对应的综合估计结果Sr和频谱峰值,计算Sr集合中每个琴键谐波列的功率和作为该琴键的强度值,用于进行弹奏评价,其中,功率和是幅值平方的和;以及根据当前帧的综合估计结果Sr以及当前帧的采集时间段,判断出每一个琴键弹奏的起始时间和结束时间,计算每个音符的持续时间,计算出节拍。
根据本发明的一个实施例,本发明通过弹奏评价模块对弹奏进行评价。按照评价的等级(预先根据需要设定),评价的指标包括但不限于音高、节拍、节奏、力度、踏板、弹奏感情等信息。下面详细说明每一个评价指标。
音高:当用户按照某一特定曲谱弹奏时,对比已知曲谱进行判断,判断结果可以实时显示在曲谱上,也可以绘制成曲线表示,也可以在弹奏后生成报表,统计出总体正确率,小节正确率等多个统计指标,统计出熟练区间(正确率持续高)和不熟练区间(正确率持续不高)。如果用户是自由发挥,则不对音高进行统计。考虑到用户可能有较多的音高不正确,在和已知曲谱对比时利用动态时间规整算法(DTW)对弹奏的音准和曲谱音准进行动态比对,得出分段或整体的正确率的统计值。
节拍:节拍是指乐谱中每一小节的音符总长度,如一首歌曲是4/4拍,说明歌曲以4分音符为一拍,每一小节有四个4分音符。对于节拍的判断,主要是音符初始时间段的判断,相邻音符初始时间段的差即为当前拍子的表现。
节奏:表述的是在乐曲层面上长时间的弹奏强度的分布。
力度:表现为弹奏的强度,乐谱中的强度有极弱到极强等多种强度。而这些强度没有绝对的评价指标,只有相对的评价,实际应用过程中根据具体应用场景进行设定。
踏板:延音踏板和止音踏板的正确使用。
弹奏感情:利用音高、节拍、节奏、力度、踏板等信息综合估计出一段时间内乐曲的感情,比如强烈,温柔、激烈、活泼、悲伤等。
根据本发明的一个实施例,本发明采用计算机存储及管理模块存储所有采集到的音频数据、计算过程的变量、校准参数等,并通过人机交互模块实现离线校准、显示用户弹奏的评价指标、显示曲谱、声音回放等。
本发明的优点在于:将噪声、语音从乐音中分离出来,有效降低了噪声和语音对多基频估计和强度估计准确度的影响;多琴键频谱同时校准,用户感知度低,及时捕获钢琴声音特征的变化;校准后的参数可以综合评价用户钢琴的调音情况;考虑了失谐性,更新琴键谐波参数使得查找谱峰会更准确直接,谐波列功率计算更准确,强度计算更准确;考虑了谐波重叠问题,减少差八度琴键的判断错误;利用强度与谐波列功率关系构建出的强度评价指标,具有普适性,思路清晰,原理简单,计算量小,判断准确度高;根据当前已经估计的按键位置估计手的位置,然后生成候选弹奏区域,该区域是下一时间段优先搜索和重点搜索的区域,可以有效缩小范围,减小计算量,提高计算速度,降低错误率;结合视频信息,可以有效 缩小范围,减小计算量,提高计算速度,降低错误率;适用于传统钢琴,不需要对每个琴键安装诸如压力传感器,距离传感器的硬件,使得传统钢琴也能成为智能钢琴;可以由用户自由弹奏而不局限于曲库中已有的曲子。
通过前述实施例的描述可知,通过音频信息可以识别到被弹奏的琴键,以及对应的音高、节拍、节奏、力度、踏板、弹奏感情等用户音频信息,基于琴键的谐波结构以及频谱参数,可以获得琴键的基频,将这些信息用于钢琴教学中能够有效提升钢琴训练的效果。
通常情况下,钢琴教学中涉及到对练习者钢琴弹奏的判断和评价,对于练习者钢琴弹奏的判断和评价至少包括音符和手部动作两个方面,其中,音符可以包括基音和泛音的频谱、力度、速度、节奏等因素。在时间信息相对应的情况下,可以通过将弹奏时的音频数据信号转换为音频数据,并与标准音频数据相比较,以此判断弹奏的音符是否正确。在本发明中,“标准音频数据”、“标准手部数据”是指用于与用户音频数据、用户手部数据做比较以判断用户钢琴弹奏结果的“参照音频数据”、“参照手部数据”。
根据本发明的一个实施例,提供一种智能钢琴训练方法,图2示出了本发明一个实施例的智能钢琴训练方法。如图2所示,该方法包括以下步骤:
S310,获取用户弹奏钢琴的音频信息和视频信息。
音符练习和手部动作练习是钢琴练习中的两个主要方面,因此需要同时获取用户弹奏钢琴时的音频信息和视频信息。在一些实施中,可以通过音频和视频采集设备(例如麦克风和摄像头,或者带有麦克风的摄像头)采集用户弹奏钢琴的音频信息和视频信息。这种情况下,可以对采集到的音频信息进行去除静音段或去噪、降噪等预处理,以避免外部干扰,提高评分的准确性。在另一些实施例中,对于在电子钢琴上弹奏的音频信息,可以通过连接电子钢琴上的MIDI接口(Musical Instrument Digital Interface)收集用户弹奏钢琴的MIDI音频数字信号。MIDI音频数字信号是由电子钢琴输出的、代表弹奏的某一音符的、并且可以被计算机识别和处理的二进制数据。对于视频信息,可以通过摄像头或其他具有图像采集功能的设备拍摄用户弹奏钢琴的手部动作。可以由同一摄像头拍摄双手的手部动作,也可以由多个摄像头分别从不同角度分开拍摄左右手的手部动作,在这种情况下,可以对左右手的视频信息进行拼接。
在一个实施例中,还可以通过安装在琴键下方的压力传感器采集用户弹奏钢琴的触键力度,以与上述音频和视频信息结合,共同确定用户弹奏钢琴的评分。
S320,采用本发明前述实施例所述的基于音频识别被弹奏琴键的方法从音频信息中识别被弹奏琴键,并基于此获取用户音频数据,并与音频数据库中对应的标准音频数据相比较,获得用户音频数据与对应的标准音频数据的匹配度。
音频数据库中包含有大量标准的钢琴弹奏曲目(例如由钢琴教师或者专业人员演奏的曲目,或者由人工智能根据乐谱自动生成的曲目)的音频数据。可以按照一定的时间间隔从标准的钢琴演奏曲目的音频信息中提取标准音频数据,并以曲目为单位进行存储,形成音频数据库。音频数据库中的音频数据至少可以包括曲目名称、提取时间、音符、基频以及音强等信息。
在一个实施例中,可以按照10ms或者更短的时间间隔从标准钢琴演奏曲目的音频信息中提取标准音频数据并存在在音频数据库中。当前世界吉尼斯记录中最快钢琴手记录为1s摁压钢琴琴键14次。以1s摁压钢琴琴键20次为例,摁压一次钢琴琴键的视角为50ms。因此,按10ms的时间间隔从钢琴演奏曲目的音频信息中提取音频数据可以涵盖演奏产生的所有音符。
图3示出了一个实施例的音频数据库中标准音频数据存储示意图。如图3所示,音频数据库中可以包括一级表格和二级表格,其中,一级表格用于存储标准钢琴演奏曲目的基本信息,包括序号、名称、等级、音调和音频数据二级表格序号等信息。二级表格用于存储每个曲目的音频数据,包括提取时间、按固定的时间间隔从该曲目的音频信息中提取的音符及其基频和音强等数据。如图3(A)所示,在一级表格中存储有若干标准钢琴演奏曲目,例如,曲目0001的名称为“春之歌”,一级,A大调,其音频数据存储在二级表格0001中;曲目0036的名称为“回旋曲”,四级,C大调,音频数据存储在二级表格0036中;曲目0180的名称为“卡农”,其他,D大调,音频数据存储在二级表格0180中,等等。如图3(B)所示,在二级表格0001中存储有在0001曲目完整的演奏时间内每间隔10ms所提取的全部音符以及对应的基频和音强等数据,例如,在“0.000”时刻, 没有音符,基频为0,音强为0;在“0.010”时刻,音符为G4,基频为391Hz,音强为10dB;在“0.020”时刻,音符仍为G4,基频为391Hz,音强为15dB;在“0.030”时刻,音符仍为G4,基频为391Hz,音强为20dB;…;在“0.250”时刻,音符为D4,基频为293Hz,音强为10;…,等等。
可以按照固定的间隔从音频信息中提取用户音频数据,将提取到的用户音频数据与音频数据库中对应的标准音频数据相比较,可以获得用户音频数据与对应的标准音频数据的匹配度。
在一个实施例中,可以按照第一时间间隔从音频信息中提取用户音频数据。第一时间间隔可以与音频数据库中从标准演奏曲目中提取标准音频数据的时间间隔相同,也可以是上述时间间隔的整数倍。提取到的用户音频数据至少可以包含即提取时间、音符及其基频和音强等信息。用户音频数据可以通过其提取时间信息与音频数据库中标准音频数据相对应。以图2中音频数据库为例,当用户弹奏“春之歌”曲目时,可以按照30ms的时间间隔从采集到的音频信息中识别被弹奏的琴键以及获取其他用户音频数据,假若在0.030s时识别到的被弹奏琴键对应的音符为G4,基频为391,音强为15的用户音频数据,则与该用户音频数据相对应的标准音频数据为音频数据库中一级表格中曲目0001以及二级表格0001中在0.030s时的音频数据(包括音符及其基频和音强)。
在用户弹奏钢琴前,可以由用户自行在数据库中选择其将要弹奏的曲目,也可以在用户开始弹奏钢琴后,智能识别用户所弹奏的曲目并查询该曲目在音频数据库中的标准音频数据,然后将用户音频数据与音频数据库中对应的标准音频数据相比较,以获得用户音频数据与对应的标准音频数据的匹配度。
在一个实施例中,音频数据库中可以存储有同一曲目不同曲风的标准音频数据。在用户开始弹奏钢琴后,智能识别用户所弹奏的曲目及曲风,并查询该曲目及曲风在音频数据库中对应的标准音频数据,然后将用户音频数据与音频数据库中对应的标准音频数据相比较,以获得用户音频数据与对应的标准音频数据的匹配度。
在一个实施例中,可以为音频数据中的不同信息设置不同的权重值,以计算用户音频数据与对应的标准音频数据的匹配度。例如,可以设置音 符的基频权重大于其音强权重,使得音符的基频信息在计算匹配度中占比更大。在一个实施例中,还可以为音频数据库中的标准音频数据设置误差冗余区间,例如,为音符的基频信息设置±10Hz的误差冗余区间,当用户音频数据落入该区间内时,可以认为该用户音频数据中音符的基频信息与对应的标准音频数据中音符的基频信息基本一致。
在一个实施例中,音频数据库还可以进一步细分为单音数据库和曲目数据库,其中,单音数据库中存储有单个音符对应的标准音频数据,曲目数据库中存储有大量标准的钢琴弹奏曲目对应的标准音频数据。由此,当用户在练习钢琴时,既可以判断用户弹奏单个音符时的音频数据,也能够判断用户弹奏某个曲目时的音频数据。
S330,基于用户音频数据与对应的标准音频数据的匹配度,从视频信息中截取与用户音频数据相对应的用户手部图像。
在钢琴弹奏中,只有当弹奏音符正确或基本正确时,手部动作的判断才有意义。因此,根据本发明的一个实施例,基于用户音频数据与对应的标准音频数据的匹配度,决定是否需要从视频信息中截取与用户音频数据相对应的用户手部图像。
在一个实施例,可以设置一个音频匹配度阈值。音频匹配度阈值可以由用户设置、系统默认、或者在联网状态下由系统统计其他钢琴弹奏者对同一曲目的弹奏水平后智能设定。当提取到的用户音频数据与对应的标准音频数据的匹配度大于或等于该音频匹配度阈值时,表示用户弹奏的音符正确或基本正确,进而可以从视频信息中截取与该用户音频数据相对应的用户手部图像以判断用户的手部动作;当提取到的用户音频数据与对应的标准音频数据的匹配度小于该音频匹配度阈值时,表示用户弹奏的音符错误,因而不必再进行手部动作判断。
视频泛指一系列静态影像以电信号的方式加以捕捉、纪录、处理、储存、传送与重现的各种技术。因此视频实际上是由一系列图像按时序排列构成。当连续的图像变化每秒超过24帧图像以上时,根据视觉暂留原理,人眼无法辨别单幅的静态图像,看上去是平滑连续的视觉效果。因此,可以按照固定的间隔从视频信息中截取用户手部图像。用户手部图像可以通过其截取时间信息与视频信息中的用户音频数据相对应。
在一个实施例中,可以按照第二时间间隔从视频信息中截取用户手部 图像。第二时间间隔可以与从音频信息中提取用户音频数据的时间间隔相同,也可以是上述时间间隔的整数倍。在时间信息一致的情况下,此刻截取的用户手部图像即为产生该用户音频数据时相应的用户手部图像。例如,假若在0.030s时提取到音符为G4,音强为15,音高为1750的用户音频数据,则在0.030s从视频信息中截取到的用户手部图像则为产生该用户音频数据时相应的用户手部图像。在一个实施例中,第二时间间隔不大于30ms。
在一个实施例中,为减少手部模型的计算量,可以对截取到的用户手部图像进行筛选,仅选出包含有钢琴琴键区域的用户手部图像用于识别用户手部数据。
在一个实施例中,除了基于用户音频数据与对应的标准音频数据的匹配度,还可以由用户自行设置是否截取用户手部图像并识别用户手部数据。
S340,通过手部模型识别用户手部图像中的用户手部数据,并与手部数据库中对应的正确的手部数据相比较,获得用户手部数据与对应的标准手部数据的匹配度。
如上所述,手部模型是以手部图像为输入数据,以该手部图像中手部数据为输出数据,通过对神经网络进行训练获得。通过手部模型可以识别用户手部图像中的用户手部数据,例如手部关节点的坐标位置或相对位置,包括左右手中各21个关键关节点的坐标位置或相对位置,或者多于或少于21个关节点的坐标位置或相对位置。在一个实施例中,用户手部数据还可以包括腕部的坐标位置或相对位置。
在一个实施例中,手部模型可以采用训练好的循环神经网络或者长短时记忆神经网络。在一个实施例中,当用户手部图像包含有钢琴琴键区域时,可以首先在视场背景下检测出钢琴琴键区域,并画出琴键候选框,然后通过手部模型在琴键候选框区域内进行手部关键点回归检测,以提取用户手部数据。
手部数据库中包含有大量标准手部数据。可以按照一定的时间间隔,从钢琴弹奏曲目的标准演奏视频信息中截取标准手部图像,再由手部模型识别标准手部图像中的标准手部数据,并以曲目为单位进行存储,形成手部数据库。在一个实施例中,可以按照与音频数据库中从标准演奏曲目提取标准音频数据相同的时间间隔,或者是上述时间间隔的整数倍,从标准钢琴演奏曲目的视频信息中截取标准手部图像,再由手部模型识别标准手 部图像中的标准手部数据,并存在手部数据库中。标准手部数据可以包含有时间(即截取时间)、左右手各自关键关节点的坐标位置或相对位置等信息。用户手部数据可以通过其时间信息与手部数据库中的标准手部数据相对应。
图4示出了一个实施例的手部数据库中标准手部数据存储示意图。如图4所示,手部数据库中可以包括一级表格和二级表格,其中,一级表格(如图4(A)所示)用于存储标准钢琴演奏曲目的基本信息,包括序号、名称、等级、音调和手部数据二级表格序号等信息;二级表格(如图4(B)所示)用于存储每个曲目的手部数据,包括截取时间、按固定的时间间隔从该曲目的视频信息中截取的手部图像中左右手21个关节点的相对位置等数据。
在一个实施例中,可以将音频数据库与手部数据库相关联,即以曲目为单位,将时间信息一致的标准音频数据和标准手部数据关联存储,形成综合数据库。当从标准演奏曲目提取标准音频数据的时间间隔与从标准演奏曲目截取标准手部图像的时间间隔不一致,或者从标准演奏曲目提取标准音频数据的时间间隔是从标准演奏曲目截取标准手部图像的时间间隔的整数倍时,仅存储与标准音频数据的时间信息一致时的标准手部数据。
图5示出了一个实施例的综合数据库的存储示意图。如图5所示,综合数据库可以包括一级表格和二级表格,其中,一级表格(如图5(A)所示)用于存储标准钢琴演奏曲目的基本信息,包括序号、名称、等级、音调和综合数据二级表格序号等信息;二级表格(如图5(B)所示)用于存储每个曲目的音频数据和手部数据。
通过将提取到的用户手部数据与手部数据库中对应的标准手部数据相比较,可以获得用户手部数据与对应的标准手部数据的匹配度。在一个实施例中,还可以为手部数据库中的标准手部数据设置误差冗余区间,当用户手部数据落入误差冗余区间内时,可以认为该用户手部数据与标准手部数据基本一致。
在一个实施例中,可以设置一个手部数据匹配度阈值。手部数据匹配度阈值可以由用户设置、系统默认、或者在联网状态下由系统统计其他钢琴弹奏者对同一曲目的弹奏水平后智能设定。当用户手部数据与标准手部数据的匹配度大于或等于该手部数据匹配度阈值时,表示用户弹奏的手部 动作正确或基本正确;当提取到的用户手部数据与对应的标准手部数据的匹配度小于该手部数据匹配度阈值时,表示用户弹奏的手部动作错误,此时可以自动保存该用户手部数据对应的用户手部图像,便于用户查看。在一个实施例中,还可以向用户显示错误的手部动作,例如,通过动画渲染生成虚拟手部轮廓,当指法错误时,向用户显示错误的手指;当手形错误时,向用户显示错误的手部区域(如手掌、指尖等)。
S350,基于用户音频数据与对应的标准音频数据的匹配度以及用户手部数据与对应的标准手部数据的匹配度,向用户反馈弹奏结果。
用户音频数据与对应的标准音频数据的匹配度越高,表示用户钢琴弹奏的音符准确度越高;同样地,用户手部数据与对应的标准手部数据的匹配度越高,表示用户钢琴弹奏的手部动作准确度越高。因此,用户音频数据与对应的标准音频数据的匹配度和用户手部数据与对应的标准手部数据的匹配度能够从音符和手部动作两个方面综合考量用户钢琴弹奏的水平。
在一个实施例中,可以设置用户音频数据与对应的标准音频数据的匹配度以及用户手部数据与对应的标准手部数据的匹配度在确定用户钢琴演奏评分中的权重,由此可以针对不同用户的弹奏习惯个性化制定评分规则。例如,若某用户弹奏的音符较准但手部动作经常出现错误,可以将用户手部数据与对应的标准手部数据的匹配度设置较大权重,以着重反馈该用户在钢琴弹奏中的手部动作情况。
在一个实施例中,数据库中还存储有标准触键力度数据,可以将采集到的用户触键力度数据与标准触键力度数据相比较,获取用户触键力度与标准触键力度的匹配度,并结合用户音频数据与对应的标准音频数据的匹配度以及用户手部数据与对应的标准手部数据的匹配度综合考量用户钢琴弹奏的水平。
通过上述智能钢琴训练方法,能够使用户在缺乏老师指导的情况下,也能够及时、准确地得知在自己练习钢琴时的音符和手部动作情况,有助于用户及时纠错,有效提高练习效率。
在一些实施例中,用户钢琴弹奏的结果可以以延时的方式向用户反馈。例如,也可以在用户钢琴弹奏结束时,显示该曲目的综合评分;也可以详细记录用户在弹奏中具体的音频错误和手部动作错误,并形成评分报告, 以使用户能够针对性的练习或改正钢琴弹奏中的错误;还可以将当前的评分或评分报告与该用户以往的弹奏记录或其他用户的弹奏记录做对比,综合评价该用户当前的弹奏水平。
在其他实施例中,还可以同时比较用户音频数据和用户手部数据,并基于用户音频数据与对应的标准音频数据的匹配度以及用户手部数据与所述对应的标准手部数据的匹配度,向用户反馈弹奏结果。在这种情况下,可以在获取用户弹奏钢琴时的音频信息和视频信息时,实时提取并分析用户音频数据和用户手部图像。
在一些情况下,可以在用户弹奏完某一曲目后,向用户反馈该曲目的整体弹奏结果。
图6示出了本发明一个实施例的智能钢琴训练方法。如图6所示,该方法包括以下步骤:
S710,获取用户弹奏钢琴的音频信息和视频信息。
S720,从音频信息中识别被弹奏的琴键并基于此获取用户音频数据,并与音频数据库中对应的标准音频数据相比较,获得用户音频数据与对应的标准音频数据的匹配度。
步骤S710-S720与上述步骤S310-S320类似,在此不再赘述。
S730,将用户音频数据与对应的标准音频数据的匹配度与指定阈值N 1相比较,当用户音频数据与对应的标准音频数据的匹配度大于或等于指定阈值N 1时,执行步骤S740;当用户音频数据与对应的标准音频数据的匹配度小于指定阈值N 1时,执行步骤S760。
S740,从视频信息中截取与用户音频数据相对应的用户手部图像。
S750,通过手部模型识别用户手部图像中的用户手部数据,并与手部数据库中对应的正确的手部数据相比较,获得用户音频数据与对应的标准手部数据的匹配度。
S760,判断用户钢琴弹奏是否结束,若结束,执行步骤S770;若尚未结束,执行步骤S710-S760。
S770,基于用户弹奏钢琴中产生的全部用户音频数据与对应的标准音频数据的匹配度,以及用户弹奏钢琴中产生的全部用户手部数据与对应的标准手部数据的匹配度,向用户反馈弹奏结果。
上述方法通过在用户钢琴弹奏结束后向用户反馈其钢琴弹奏的综合 效果,有利于用户整体把握其弹奏的完整曲目或其中一段旋律。
在一些实施例中,当用户音频数据与对应的标准音频数据的匹配度小于指定阈值时,可以向用户提示标准音频数据所对应的琴键信息,例如,通过动画渲染生成虚拟键盘并提示出正确的琴键;以及/或者当用户手部数据与对应的标准手部数据的匹配度小于指定阈值时,可以向用户提示标准手部数据所对应的手部动作,例如,通过动画渲染生成虚拟手部轮廓并提示出正确的手部动作。
图7示出了本发明一个实施例的智能钢琴训练方法。如图7所示,该方法包括以下步骤:
S810,获取用户弹奏钢琴的音频信息和视频信息。
S820,从音频信息中识别被弹奏的琴键并基于此获取用户音频数据,,并与音频数据库中对应的标准音频数据相比较,获得用户音频数据与对应的标准音频数据的匹配度。
S830,将用户音频数据与对应的标准音频数据的匹配度与指定阈值N 1相比较,当用户音频数据与对应的标准音频数据的匹配度大于或等于指定阈值N 1时,执行步骤S840;当用户音频数据与对应的标准音频数据的匹配度小于指定阈值N 1时,向用户提示标准音频数据对应的琴键信息,并执行步骤S870。
S840,从视频信息中截取与用户音频数据相对应的用户手部图像。
S850,通过手部模型识别用户手部图像中的用户手部数据,并与手部数据库中对应的正确的手部数据相比较,获得用户音频数据与对应的标准手部数据的匹配度。
S860,将用户手部数据与对应的标准手部数据的匹配度与指定阈值N 2相比较,当用户手部数据与对应的标准手部数据的匹配度小于指定阈值N 2时,向用户提示标准手部数据对应的手部动作。
S870,判断用户钢琴弹奏是否结束,若结束,执行步骤S880;若尚未结束,执行步骤S810-S870。
S880,基于用户弹奏钢琴中产生的全部用户音频数据与对应的标准音频数据的匹配度,以及用户弹奏钢琴中产生的全部用户手部数据与对应的标准手部数据的匹配度,向用户反馈弹奏结果。
通过上述方法,能够针对用户钢琴弹奏中产生的在音符和/或手部动作 错误的进行实时指导与示范,以使用户能够及时掌握正确的弹奏音符和/或手部动作,提高练习效率。
综上,本发明通过利用手部模型精确识别用户手部图像中的手部数据,并在综合考量用户在弹奏钢琴时产生的音频数据和手部数据的基础上,对用户弹奏的结果作出整体判断,使得用户在缺乏专业老师指导的情况下,能够有效获得在练习中有关音符及手部动作方面的反馈,有利于用户发现并纠正错误,提高练习效率。此外,通过实时向用户提示正确的琴键信息和/或手部动作,还可以帮助用户及时获得正确的示范和指导,有助于用户自学弹奏钢琴。
另一方面,本发明还提供了一种实施上述方法的智能钢琴训练系统,该系统包括:音频和视频采集单元,用于获取用户钢琴弹奏的音频信息和视频信息;数据提取单元,用于从音频信息中提取用户音频数据,以及从视频信息中截取与用户音频数据相对应的用户手部图像,其中,数据提取单元中配置有本发明的基于音频识别被弹奏琴键的系统,用于从;数据识别单元,用于通过手部模型识别用户手部图像中的用户手部数据,其中,手部模型以手部图像为输入数据,以手部图像中手部数据为输出数据,通过对神经网络进行训练获得;数据匹配单元,用于将用户音频数据与音频数据库中对应的标准音频数据相比较,获得用户音频数据与对应的标准音频数据的匹配度,以及将用户手部数据与手部数据库中对应的标准手部数据相比较,获得用户手部数据与对应的标准手部数据的匹配度;用户交互单元,用于基于用户音频数据与对应的标准音频数据的匹配度以及用户手部数据与对应的标准手部数据的匹配度,向用户反馈弹奏结果。其中,所述数据提取单元配置有基于音频识别被弹奏琴键的系统,用于从音频信息中识别被弹奏琴键,并基于此并基于此可获得音符、基频和音强等用户音频信息
在一个实施例中,智能钢琴训练系统中的用户交互单元还用于:向用户提示对应的标准音频数据所对应的琴键信息,向用户提示对应的标准手部数据所对应的手部动作。
在一个实施例中,智能钢琴训练系统中还包括控制单元,用于控制音频和视频采集单元、数据提取单元、数据识别单元、数据匹配单元、以及用户交互单元之间的相互配合,并基于用户音频数据与对应的标准音频数 据的匹配度确定是否激活数据识别单元,以及基于用户音频数据与对应的标准音频数据的匹配度或基于用户手部法据与对应的标准手部数据的匹配度判断用户弹奏钢琴曲目是否结束,并确定是否激活音频和视频采集单元或者用户交互单元。
图8示出了本发明一个实施例的智能钢琴练习系统。如图8所述,该智能钢琴练习系统900包括音频和视频采集单元901、数据提取单元902、数据识别单元903、数据匹配单元904以及用户交互单元905。
音频和视频采集单元901,包括声音采集装置9011和视频采集装置9012,用于获取用户在弹奏钢琴时产生的音频信息和视频信息。其中,声音采集装置9011例如可以是安装在钢琴附近的一个或多个麦克风。声音采集装置9011可以通过有线或者无线的方式与数据提取单元902相连,将获取的音频信息发送给数据提取单元902。视频采集装置9012可以是具有摄影或图像采集功能的设备,例如可以是单目摄像头、双目摄像头或深度摄像头。视频采集装置9012可以被固定在钢琴周围定点采集手部视频信息,例如,可以仅仅安装在钢琴键盘的前上方,也可以在键盘的上方、前方、左侧和/或右侧分别安装多个具有摄影功能的设备;也可以被安装在滑轨上自动追踪采集手部视频信息,自动调整其拍摄位置和/或角度。类似地,视频采集装置9012也可以通过有线或者无线的方式与数据提取单元902相连,将获取的视频信息发送给数据提取单元902。在一个实施例中,声音采集装置9011和视频采集装置9012可以集成在一个装置中,以同时获取用户弹奏钢琴时的音频信息和视频信息。
数据提取单元902,包括音频数据提取单元9021和图像数据截取单元9022,其中,音频数据提取单元9021配置有基于音频识别被弹奏琴键的系统,用于从音频信息中识别被弹奏的琴键并基于此获取用户音频数据,并发送给数据匹配单元904;图像数据截取单元9022用于从视频信息中截取与用户音频数据相对应的用户手部图像,并发送给数据识别单元903。
数据识别单元903,其中包含手部模型9031,并与图像数据截取单元9022相连接,用于通过手部模型识别用户手部图像中的用户手部数据,并发送给数据匹配单元904。其中,手部模型9031以手部图像为输入数据,以手部图像中手部数据为输出数据,通过对神经网络进行训练获得。
数据匹配单元904,包括音频数据匹配单元9041和手部数据匹配单元 9042。其中,音频数据匹配单元9041中包含音频数据库,用于将来自数据提取单元902的用户音频数据与音频数据库中对应的标准音频数据相比较,获得用户音频数据与对应的标准音频数据的匹配度,并发送给用户交互单元905和控制单元906。手部数据匹配单元9042中包含手部数据库,用于将来自数据识别单元903的用户手部数据与手部数据库中对应的标准音频数据相比较,获得用户手部数据与对应的标准手部数据的匹配度,并发送给用户交互单元905。音频数据库和手部数据库可以作为内置文件存储在音频数据匹配单元9041和手部数据匹配单元9042中,也可以通过API程序接口与音频数据匹配单元9041和手部数据匹配单元9042相连接。
用户交互单元905,包括处理器9051和显示设备9052,其中,处理器9051用于接收来自音频数据匹配单元9041的用户音频数据与对应的标准音频数据的匹配度,以及接收来自手部数据匹配单元9042的用户手部数据与对应的标准手部数据的匹配度,并基于用户音频数据与对应的标准音频数据的匹配度以及所述用户手部数据与所述对应的标准手部数据的匹配度,确定所述用户钢琴演奏的评分。显示设备9052例如可以是智能手机、IPAD、智能眼镜、液晶显示屏、电子水墨屏等具有显示功能的电子设备,用于显示处理器9051的评分结果。在一个实施例中,处理器9051可以基于用户音频数据与对应的标准音频数据的匹配度,确定正确的琴键信息并显示在显示设备9052上,例如,通过动画渲染生成虚拟键盘并提示出正确的琴键。在一个实施例中,处理器9051还可以基于用户手部数据与对应的标准手部数据的匹配度,构建正确的手部动作并显示在显示设备9052上,例如,可以生成虚拟手部轮廓并提示出正确的手部动作,也可根据用户手部信息建立不同用户的特定的骨骼系统,通过蒙皮、动画渲染等手段生成该用户个性化的虚拟手部轮廓,并根据标准手部数据控制虚拟手部轮廓提示正确的手部动作。
在一个实施例中,智能钢琴练习系统中还可以包括传感器,传感器可以安装于琴键下方,用于采集用户弹奏钢琴时的触键力度数据。
需要说明的是,虽然上文按照特定顺序描述了各个步骤,但是并不意味着必须按照上述特定顺序来执行各个步骤,实际上,这些步骤中的一些可以并发执行,甚至改变顺序,只要能够实现所需要的功能即可。
本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以 包括计算机可读存储介质,其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。
计算机可读存储介质可以是保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以包括但不限于电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。
以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (24)

  1. 一种基于音频识别被弹奏琴键的方法,其特征在于,包括:
    获取弹奏钢琴的音频并对所述音频进行分帧处理以得到多个分帧信号,以及
    对每一个分帧信号执行如下步骤:
    T1、对当前分帧信号进行频域分析,获得该分帧信号的频谱;
    T2、对步骤T1获得的频谱进行噪声水平估计以获得多个乐音谱峰形成乐音谱峰集合;
    T3、根据噪声水平估计获得的乐音谱峰集合、所有琴键频谱参数集构成的谐波频率查找可能被弹奏的第一候选琴键集合,其中第一候选琴键集合是谐波频率与所述乐音谱峰集合中的一个或多个乐音谱峰相同的琴键的集合。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括对每一个分帧信号执行如下步骤:
    T4、取该分帧信号对应时间段采用摄像头捕获的手和琴键图像获取手的位置下方的琴键集合与第一候选琴键集合的交集,得到第二候选琴键集合;
    T5、基于第二候选琴键集合中每个琴键的谐波列,以第二候选琴键集合中谐波与所述乐音谱峰集合中的一个或多个乐音谱峰相同的琴键构成第三候选琴键集合;
    T6、对第三候选琴键集合进行筛选,留下高八度按键或者低八度按键被弹奏的琴键以形成第四候选琴键集合。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括对每一个分帧信号执行如下步骤:
    T7、将第四候选琴键集合作为单个分帧信号的多基频估计结果,采用HMM和维特比方法对其进行音符事件的帧间平滑,获得该分帧信号对应的被弹奏琴键的综合估计结果。
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:对获取的音频依次进行降噪、语音分离处理后进行分帧处理。
  5. 根据权利要求4所述的方法,其特征在于,
    采用多麦克风降噪技术或多麦克风阵列波束形成技术对音频进行降噪处理以消除音频中的环境噪声。
  6. 根据权利要求4所述的方法,其特征在于,
    采用多麦克风降噪技术或多麦克风阵列波束形成技术对音频进行语音分离以将人物语音和乐音分离。
  7. 根据权利要求5或6所述的方法,其特征在于,多麦克风降噪技术通过如下步骤实现:
    将多个麦克风分别指向需要拾音的方向,包括钢琴共振腔和环境噪声方向;
    采用谱减法将采集到的乐音含量高的声音频谱减去环境噪声含量高的声音频谱以消除环境噪声。
  8. 根据权利要求5或6所述的方法,其特征在于,采用多麦克风阵列波束形成技术时通过如下方式布置采集音频的麦克风:
    以钢琴键盘的俯视图所在的长方形平面的中心点以及长方形外的两个最远声源点为三角形顶点,将多个麦克风布置在以平行于两个最远声源点连线且穿过三角形中心的直线为切线、以三角形中心为切点、包围长方形平面中心点或两个最远声源点的弧线上。
  9. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    采用离线校准或在线校准的方法获取或更新琴键频谱参数集,所述琴键频谱参数集中包含每一个琴键对应的频谱参数。
  10. 根据权利要求9所述的方法,其特征在于,每一个琴键频谱参数用于表示该琴键的谐波结构:
    f k=k*(A*F0)*sqrt(1+B*k^2)
    其中,A、B是琴键频谱参数,f k是该琴键的k次谐波,F0是该琴键的基频。
  11. 根据权利要求9所述的方法,其特征在于,所述离线校准方法包括如下步骤:
    L1、按照要求的力度手动弹奏琴键,并获取弹奏该琴键的音频;
    L2、采用静音抑制技术分析步骤L1中的音频,获得弹奏琴键的初始时间段和结束时间段;
    L3、根据步骤L2获得的初始时间段和结束时间段对初始时间段和结 束时间段中间的时域数据做频域分析获得琴键声音的频谱谐波结构,频谱上不同的峰值对应该琴键的不同次谐波,采用线性拟合技术基于f k=k*(A*F0)*sqrt(1+B*k^2)计算该琴键的频谱参数。
  12. 根据权利要求9所述的方法,其特征在于,所述在线校准方法包括如下步骤:
    在钢琴使用过程中多次采集琴键弹奏音频,并对其进行频域分析获得琴键的真实频谱谐波,基于谐波结构f k=k*(A*F0)*sqrt(1+B*k^2)计算该琴键的频谱参数,比较多次采集的音频中计算的参数是否一致,如果一致则更新该琴键的频谱参数。
  13. 一种基于音频计算弹奏节拍的方法,其特征在于,所述方法包括如下步骤:
    A1、采用如权利要求1-12任一所述方法获取每个分帧信号被弹奏琴键集合;
    A2、基于被弹奏琴键集合判断每个琴键被弹奏的起始时间和结束时间并以此计算节拍。
  14. 一种基于音频进行弹奏评价的方法,其特征在于,所述方法包括如下步骤:
    P1、采用如权利要求1-12任一所述方法获取当前分帧信号被弹奏琴键集合;
    P2、基于被弹奏琴键集合判断每个琴键被弹奏的起始时间和结束时间并以此计算节拍;以及基于被弹奏琴键集合的综合评估结果和当前分帧信号的乐音谱峰,计算被弹奏琴键集合中每个琴键谐波列的功率值和作为该琴键的强度值;
    P3、按照预设的评价指标,评价弹奏的音高、节拍、节奏、力度踏板、弹奏感情信息。
  15. 一种基于音频识别被弹奏琴键的系统,其特征在于,所述系统包括:
    声音采集模块,用于采集弹奏钢琴的音频;
    计算机存储管理模块,用于存储采集到的音频数据并执行如下操作:
    B1、对分帧信号进行频域分析,获得信号的频谱;
    B2、对步骤B1获得的频谱进行噪声水平估计以划分出频谱中的多个 乐音谱峰形成乐音谱峰集合;
    B3、根据乐音谱峰集合、所有琴键频谱参数集构成的谐波频率综合查找可能被弹奏的第一候选琴键集合,其中第一候选琴键集合是谐波与所述乐音谱峰集合中的一个或多个乐音谱峰相同的琴键的集合。
  16. 根据权利要求15所述的系统,其特征在于,所述计算机存储管理模块还被配置为执行如下操作:
    B4、根据当前分帧对应时间段采用摄像头捕获的手和琴键图像获取的手的位置下方的琴键集合对第一候选琴键集合进行精简获得第二候选琴键集合,其中,第二琴键集合是当前帧对应时间段的手的位置下方的琴键集合与第一候选琴键集合的交集;
    B5、基于第二候选琴键集合中每个琴键的谐波列,以第二琴键集合中谐波与所述乐音谱峰集合中的一个或多个乐音谱峰相同的琴键构成第三候选琴键集合;
    B6、对第三候选琴键集合进行筛选,留下高八度按键或者低八度按键被弹奏的琴键以形成第四候选琴键集合。
  17. 根据权利要求16所述的系统,其特征在于,所述计算机存储及管理模块还被配置为执行如下操作:
    B7、将第四候选琴键集合作为单个分帧信号的多基频估计结果,采用HMM和维特比方法对其进行音符事件的帧间平滑,获得该分帧信号对应的被弹奏琴键的综合估计结果。
  18. 根据权利要求15所述的系统,其特征在于,所述系统还包括:
    噪声抑制模块,用于采用多麦克风降噪技术对音频进行降噪处理以消除音频中的环境噪声;
    声源分离模块,用于采用多麦克风阵列波束形成技术对音频进行语音分离以分离音频中的人物语音。
  19. 根据权利要求15所述的系统,其特征在于,所述系统还包括:
    频谱校准模块,用于采用离线校准或在线校准的方法获取或更新琴键频谱参数集。
  20. 根据权利要求15所述的系统,其特征在于,所述系统还包括:
    琴键识别模块,用于将第四候选琴键集合作为单帧多基频估计结果,采用HMM和维特比方法对其进行音符事件的帧间平滑,获得当前帧的被弹 奏琴键的综合估计结果。
  21. 根据权利要求15所述的系统,其特征在于,所述系统还包括:
    弹奏评价模块,用于按照预设的评价指标评价弹奏的音高、节拍、节奏、力度踏板、弹奏感情信息。
  22. 根据权利要求15所述的系统,其特征在于,所述系统还包括:
    人机交互模块,用于实现用户和系统的交互。
  23. 一种计算机可读存储介质,其特征在于,其上包含有计算机程序,所述计算机程序可被处理器执行以实现权利要求1至12任一所述方法的步骤。
  24. 一种电子设备,其特征在于,包括:
    一个或多个处理器;
    存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述电子设备实现如权利要求1至12中任一项所述方法的步骤。
PCT/CN2021/117129 2020-09-09 2021-09-08 基于音频识别被弹奏琴键的方法及系统 WO2022052940A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202010939320.9 2020-09-09
CN202010939320.9A CN114170868A (zh) 2020-09-09 2020-09-09 智能钢琴训练的方法和系统
CN202110982027.5 2021-08-25
CN202110982027.5A CN113658612B (zh) 2021-08-25 2021-08-25 一种基于音频识别被弹奏琴键的方法及系统

Publications (1)

Publication Number Publication Date
WO2022052940A1 true WO2022052940A1 (zh) 2022-03-17

Family

ID=80632082

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/117129 WO2022052940A1 (zh) 2020-09-09 2021-09-08 基于音频识别被弹奏琴键的方法及系统

Country Status (1)

Country Link
WO (1) WO2022052940A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403597A (zh) * 2023-06-08 2023-07-07 武汉惠强新能源材料科技有限公司 用于大屏看板的数据自动抓取及状态更新方法
CN117953914A (zh) * 2024-03-27 2024-04-30 深圳市西昊智能家具有限公司 用于智能办公的语音数据增强优化方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054470A (zh) * 2009-11-05 2011-05-11 大连民族学院 一种高精度钢琴校音器及其校音方法
CN104200818A (zh) * 2014-08-06 2014-12-10 重庆邮电大学 一种音高检测方法
CN105513580A (zh) * 2014-09-26 2016-04-20 上海渐华科技发展有限公司 基于辅助摄像头的键盘乐器弹奏音符识别系统
CN106157973A (zh) * 2016-07-22 2016-11-23 南京理工大学 音乐检测与识别方法
CN108073867A (zh) * 2016-11-18 2018-05-25 北京酷我科技有限公司 一种钢琴演奏的视频特征提取方法及系统
WO2019146285A1 (ja) * 2018-01-23 2019-08-01 ヤマハ株式会社 電子鍵盤楽器、音信号発生装置および音信号発生方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054470A (zh) * 2009-11-05 2011-05-11 大连民族学院 一种高精度钢琴校音器及其校音方法
CN104200818A (zh) * 2014-08-06 2014-12-10 重庆邮电大学 一种音高检测方法
CN105513580A (zh) * 2014-09-26 2016-04-20 上海渐华科技发展有限公司 基于辅助摄像头的键盘乐器弹奏音符识别系统
CN106157973A (zh) * 2016-07-22 2016-11-23 南京理工大学 音乐检测与识别方法
CN108073867A (zh) * 2016-11-18 2018-05-25 北京酷我科技有限公司 一种钢琴演奏的视频特征提取方法及系统
WO2019146285A1 (ja) * 2018-01-23 2019-08-01 ヤマハ株式会社 電子鍵盤楽器、音信号発生装置および音信号発生方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403597A (zh) * 2023-06-08 2023-07-07 武汉惠强新能源材料科技有限公司 用于大屏看板的数据自动抓取及状态更新方法
CN116403597B (zh) * 2023-06-08 2023-09-05 武汉惠强新能源材料科技有限公司 用于大屏看板的数据自动抓取及状态更新方法
CN117953914A (zh) * 2024-03-27 2024-04-30 深圳市西昊智能家具有限公司 用于智能办公的语音数据增强优化方法

Similar Documents

Publication Publication Date Title
JP6814146B2 (ja) オーディオをキャプチャーし、解釈するシステムと方法
Klapuri Multiple fundamental frequency estimation based on harmonicity and spectral smoothness
Gillet et al. Transcription and separation of drum signals from polyphonic music
KR101521368B1 (ko) 다중 채널 오디오 신호를 분해하는 방법, 장치 및 머신 판독가능 저장 매체
WO2022052940A1 (zh) 基于音频识别被弹奏琴键的方法及系统
US8889976B2 (en) Musical score position estimating device, musical score position estimating method, and musical score position estimating robot
US9804818B2 (en) Musical analysis platform
KR20130112898A (ko) 시간 변화 정보를 갖는 기저 함수를 사용한 음악 신호의 분해
CN107533848B (zh) 用于话音恢复的系统和方法
CN111888765B (zh) 多媒体文件的处理方法、装置、设备及介质
Joder et al. A comparative study of tonal acoustic features for a symbolic level music-to-score alignment
US20220383842A1 (en) Estimation model construction method, performance analysis method, estimation model construction device, and performance analysis device
Essid et al. Musical instrument recognition based on class pairwise feature selection
CN113658612B (zh) 一种基于音频识别被弹奏琴键的方法及系统
US9401684B2 (en) Methods, systems, and computer readable media for synthesizing sounds using estimated material parameters
WO2017057531A1 (ja) 音響処理装置
Dittmar et al. Real-time guitar string detection for music education software
CN105895079A (zh) 语音数据的处理方法和装置
CN112420071A (zh) 一种基于恒q变换的复调电子琴音乐音符识方法
Dixon Learning to detect onsets of acoustic piano tones
US20230335090A1 (en) Information processing device, information processing method, and program
Mansour et al. Post-classification of nominally identical steel-string guitars using bridge admittances
US20230171541A1 (en) System and method for automatic detection of music listening reactions, and mobile device performing the method
Tiraboschi et al. Spectral analysis for modal parameters linear estimate
JP3584287B2 (ja) 音響評価方法およびそのシステム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21865993

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21865993

Country of ref document: EP

Kind code of ref document: A1