WO2022052941A1

WO2022052941A1 - Intelligent identification method and system for giving assistance with piano teaching, and intelligent piano training method and system

Info

Publication number: WO2022052941A1
Application number: PCT/CN2021/117130
Authority: WO
Inventors: 韩冰冰; 陶之雨; 郑庆伟
Original assignee: 桂林智神信息技术股份有限公司
Priority date: 2020-09-09
Filing date: 2021-09-08
Publication date: 2022-03-17

Abstract

An intelligent identification method for giving assistance with piano teaching, the method comprising: acquiring, from above a piano keyboard, a 2D image, which contains the complete piano keyboard, of playing a piano; performing target detection on the 2D image by means of a piano keyboard detection network to detect a piano keyboard region, which is represented by the relative position coordinates of the 2D image, and obtaining, by means of conversion, piano keyboard position coordinates under the original coordinates of the 2D image; performing, by means of a hand detection network, target detection on the piano keyboard region, which is represented by the piano keyboard position coordinates, so as to detect a hand region, which is represented by the relative position coordinates of the piano keyboard region, and obtaining hand position coordinates under the original coordinate system of the 2D image by means of conversion; and identifying, by means of a hand shape error detection network, whether there is a hand shape error in the hand region, which is represented by the hand position coordinates, outputting a hand shape error type and a hand shape error position, which is represented by the relative position coordinates of the hand region, and obtaining, by means of conversion, hand shape error position coordinates under the original coordinates of the 2D image. In addition, further provided is an intelligent piano training method for identifying a playing error by using the intelligent identification method.

Description

Intelligent recognition method and system for assisting piano teaching, intelligent piano training method and system

technical field

The invention relates to the field of deep learning, in particular to an intelligent identification method and system for assisting piano teaching, and an intelligent piano training method.

Background technique

At present, most of the piano teaching adopts the method of face-to-face instruction by teachers. This method is limited by factors such as manpower, time, money, and teachers' level, which greatly increases the difficulty of piano learning. With the advent of the AI era, artificial intelligence technology has become a breakthrough in solving piano learning problems, and more and more intelligent piano teaching systems have been born. The existing intelligent piano teaching system has the following main shortcomings:

1. Most of the schemes use a method based on feature comparison. First, a mathematical model is used to establish a standard database of correct hand shapes; then, a prediction model is constructed to extract the features of the predicted pictures, and the features are compared with the standard database. Make a comparison to determine whether it is a wrong playing hand. The difficulty of this method is that it is a complex and inefficient process to build a standard database. Due to the large differences in the size and joint ratio of human hands, when using joint angles or joint lengths to construct a standard hand shape, it is more difficult to Subjectivity is large and is not very accurate. At the same time, due to the change of the angle between the hand and the camera, even if the hand shapes are very different, a high similarity may be obtained when comparing, resulting in incorrect conclusions. Therefore, the method of feature comparison has poor robustness, high subjectivity and low recognition rate.

2. When building a predictive model, many methods use binocular or depth cameras to obtain 3D data to build a 3D model. Compared with the 2D visual model, the 3D model has a large amount of calculation, complex design, poor performance, high hardware requirements, and requires the support of chips with large computing power. Depth cameras and large computing power chips will greatly increase the cost.

3. The existing method lacks a systematic and comprehensive piano hand shape and fingering correction scheme. The hand shape error can only judge right and wrong, not what kind of error it is, nor can it point out where the error is, and the ability to guide students to correct the error is insufficient. At present, there is no good method for precise binding of keys and fingers. There are hundreds of combinations of 5 fingers and 88 keys. Existing methods cannot accurately identify fingering errors and do not accurately identify playing errors. Under the circumstance, it is impossible to accurately guide the playing.

As we all know, factors such as pitch, rhythm, fingering, and hand shape in piano playing are very critical. They are the basic skills that beginners must practice repeatedly. Therefore, they usually need to be practiced under the supervision and guidance of professional piano teachers. However, due to the limited time to be tutored by professional teachers, beginners often practice alone, resulting in the lack of timely feedback and correction of various errors, and the practice effect is not good.

In the prior art, there are many piano training methods for beginners to perform self-practice. Some are based on the audio data of the played musical piece to judge the accuracy of the musical piece played by the practitioner. For example, the audio information of the player is compared with the master. Comparing with the correct sound data of the performance to judge pitch, rhythm, speed, and strength, to evaluate the performance; some are based on the video image of the piano performance to evaluate the accuracy of the performance, for example, by intercepting from the piano teaching video The images of the knuckles and piano keys establish a standard fingering model diagram and a standard key sequence model diagram, and then compare and analyze the exercise video and the standard model diagram to realize automatic error correction and intelligent teaching; Video data to judge the correctness of the piano performance, for example, compare the audio data and time signal with the standard note data to obtain the correct note data, and then retrieve the corresponding performance image data and perform visual recognition analysis to obtain the correct performance image. Hand data, calculates a score for piano performance based on correct note data, correct hand data, and standard note data, etc.

However, the existing piano training methods still have some shortcomings. On the one hand, the practice (or evaluation) method based only on the audio data of playing music may lead to inaccurate judgment results due to the interference of noise in the environment. In addition, since such methods only focus on the accuracy of the notes played, they cannot provide feedback or corrections to other important aspects such as the player's hand posture and fingering. On the other hand, the practice (or evaluation) method that only uses the video (or image) data of playing as the basis for judgment may be recognized in isolation because it is only based on the intercepted images of the player's hands and keys, so it cannot be compared with the The music played is organically combined. Even if the fingering and notes are played correctly, the accuracy of the judgment results will be affected by ignoring important factors such as the rhythm and speed of the music. In addition, in the existing practice (or evaluation) method that comprehensively considers the player's audio and video (or image) data, it cannot correctly judge the player's fingering, nor can it provide timely feedback or correction, so it is impossible to achieve a truly professional guidance.

In addition, for the collection of hand information, in the prior art, technologies such as wearable devices (eg, data gloves), movement tracking technologies (eg, micro radar systems), or manual extraction of gesture data from images are used. However, in piano playing, the use of wearable devices may affect the flexibility of the arms (or fingers), and the motion tracking technology is not very accurate for detecting subtle movements such as finger movement on the keys or pressing the keys, and Manual extraction of gesture data requires a large workload, high professionalism, and unsatisfactory generalization ability and robustness.

Therefore, there is an urgent need for a more accurate and reasonable intelligent piano training method and system.

SUMMARY OF THE INVENTION

Therefore, the purpose of the present invention is to overcome the above-mentioned defects of the prior art, and to provide an intelligent identification method and system for assisting piano teaching which can accurately identify playing errors.

According to a first aspect of the present invention, an intelligent recognition method for assisting piano teaching is provided, for identifying hand shape errors and/or fingering errors from a 2D image of playing the piano, the method comprising: learning from a piano keyboard The 2D image including the complete piano keyboard of playing the piano is obtained above; the target detection is performed on the 2D image through the piano keyboard detection network to detect the piano keyboard area represented by the relative position coordinates of the 2D image, and the said 2D image is used to represent The relative position coordinates of the 2D image of the piano keyboard area are converted into the coordinates in the original coordinate system of the 2D image to obtain the piano keyboard position coordinates in the original coordinates of the 2D image; Object detection to detect the hand area represented by the relative position coordinates of the piano keyboard area, and convert the relative position coordinates of the piano keyboard area used to represent the hand area to the coordinates in the original coordinate system of the 2D image to obtain 2D The hand position coordinates under the original coordinates of the image; the hand shape error detection network is used to identify whether there is a hand shape error in the hand region represented by the hand position coordinates, and if there is a hand shape error, the hand shape error type and the hand shape The hand shape error position represented by the relative position coordinates of the area, and the relative position coordinates of the hand region used to represent the hand shape error position are converted into coordinates in the original coordinate system of the 2D image to obtain the hand under the original coordinates of the 2D image. Type error location coordinates. In some embodiments of the present invention, the method of the present invention further comprises: dividing each piano key from the piano keyboard region represented by the piano keyboard position coordinates to obtain different piano keys represented by the relative position coordinates of the piano keyboard region, and Convert the relative position coordinates of each piano keyboard area used to represent the keys to the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the keys in the original coordinates of the 2D image; The fingertip feature points of different fingers represented by the relative position coordinates of the hand region are detected in the hand region of The coordinates in the original coordinate system of the image are obtained to obtain the fingertip coordinates in the original coordinates of the 2D image; the position judgment is performed based on the coordinates of the fingertip and the coordinates of the key, and the fingertip that falls on the key is bound to the key to obtain the key binding relationship of the finger , and compare the key binding relationship of the fingers playing the same note with the standard binding relationship in the score database to detect whether there is a fingering error. By adopting the method of the present invention, the specific errors in the playing process can be accurately identified directly through the 2D image, and the calculation amount is small.

Preferably, in the above method, after the piano keyboard area is detected, the position coordinates of the piano keyboard corresponding to the piano keyboard area are expanded by the first preset pixel to obtain the effective area of the piano keyboard including the complete hand, and then based on the effective area of the piano keyboard area for subsequent processing. Preferably, the first preset pixel is 200 pixels. Through pixel expansion, the problem of inaccurate recognition caused by incomplete hands in some 2D images due to different camera angles or different positions of the hands on the keyboard can be effectively avoided. At the same time, by playing directly on the effective area of the piano keyboard For wrong identification, it is not necessary to identify the entire image, which greatly reduces the computational workload and hardware overhead.

Preferably, in some embodiments of the present invention, after the hand region is detected, based on the comparison between the position coordinates of the piano keyboard and the position coordinates of the hand, the coordinates of the hands that do not fall on the piano keyboard are filtered out, that is, the coordinates of the hands that do not fall on the piano are removed. For the hand information on the keyboard, the coordinate boundary of the hand falling on the piano keyboard is extended by a second preset pixel in four directions, so as to obtain the effective hand area including the complete hand of the hand falling on the piano keyboard and Corresponds to hand position coordinates. Preferably, the second preset pixel is 30 pixels. Through the filtering of the hand position coordinates, the data of the hands that do not fall on the piano keyboard can be effectively eliminated without the need to perform wrong playing identification. Reduce computational effort.

In the above method, the piano keyboard detection network, the hand detection network, the hand shape error detection network, and the fingertip feature point detection network are all obtained through neural network training, which can intelligently and accurately perform target detection and error recognition. In some embodiments of the present invention, the neural network is trained in the following manner to obtain the piano keyboard detection network, hand detection network, hand shape error detection network, and fingertip feature point detection network:

S1. Collect images of multiple people playing different types of pianos in various scenarios to form an original data set, so that the images in the original data set cover the scenes and all error types corresponding to all piano types under the prior art;

S2. Label the original data set, including labeling the piano keyboard position coordinates, labeling the hand position coordinates, labeling the hand shape error type and hand shape error position coordinates, labeling the coordinates of the feature points of different fingers and fingertips, all labels are in the same two in the dimensional coordinate system;

S3, process the images in the original data set according to the marked piano keyboard position coordinates, and obtain an image containing the piano keyboard area represented by the marked piano keyboard position coordinates to form a first data set; further, expand the piano keyboard area For the obtained effective area of the piano keyboard, the original image is cropped on the basis of the original data set according to the marked keyboard position coordinates and hand position coordinates, and the effective area of the piano keyboard in each original image is obtained to form a second data set. In the second dataset, the hand position coordinates marked in the original image are converted into coordinates in the same coordinate system as the effective area of the piano keyboard; further, the effective hand area obtained by expanding the hand area is based on the marked hand position coordinates and The hand shape error position coordinates are based on the original data set. The original image is cropped to obtain the effective area of the hand in each original image to form a third data set. The third data set will include the hand shape error position coordinates marked in the original image. Convert to the coordinates in the same coordinate system as the effective area of the hand; according to the marked hand position coordinates and the coordinates of the feature points of different fingers and fingertips, the original image is cropped based on the original data set to obtain the hand in each original image. The effective area forms a fourth data set, wherein, in the fourth data set, the coordinates of different fingertip feature points marked in the original image are converted into coordinates in the same coordinate system as the effective area of the hand;

S4, use the first data set to train the predetermined neural network to converge to obtain the piano keyboard detection network, use the second data set to train the predetermined neural network to converge to obtain the hand detection network; use the third data set to carry out the predetermined neural network Train to convergence to obtain a hand shape error detection network; use the fourth data set to train a predetermined neural network to converge to obtain a fingertip feature point detection network.

In some embodiments of the present invention, the first data set, the second data set, and the third data set are used to train the yolov4 network to convergence to obtain a piano keyboard detection network, a hand detection network, and a hand shape error detection network, respectively. The fourth dataset is used to train the network composed of ResNet18 and cascaded pyramid network to convergence to obtain the fingertip feature point detection network.

The piano keyboard detection network trained by the neural network can intelligently identify the position of the piano keyboard and obtain the position coordinates of the piano keyboard; the hand detection network obtained by the neural network training can intelligently and accurately identify the hand represented by the relative position coordinates of the input piano keyboard area. The hand position coordinates in the original image can be directly obtained by converting the relative position coordinates of the piano keyboard area used to represent the hand position into the coordinates in the original image coordinate system; the hand shape error detection network obtained by the neural network training can be Intelligently and accurately identify the specific hand type error type and hand type error position coordinates; the fingertip feature point detection network obtained by neural network training can intelligently and accurately identify the relative position coordinates of the input hand area for different fingers. Fingertip position, by converting the relative position coordinates of the input hand area used to represent the position of the fingertip to the coordinates in the original image coordinate system, the fingertip coordinates in the original image can be directly obtained, which is convenient for subsequent fingering recognition. The above detection network obtained by training the neural network with the labeled data set can accurately identify the hand shape errors in playing without establishing a standard hand shape error comparison database, with good robustness and high accuracy.

According to a second aspect of the present invention, there is provided a system for implementing the method described in the first aspect of the present invention, comprising: an image acquisition module for acquiring a 2D image of playing the piano including a complete piano keyboard; a piano keyboard detection module , for performing target detection on the 2D image to detect the piano keyboard area represented by the relative position coordinates of the 2D image, and converting the relative position coordinates of the 2D image representing the piano keyboard area into the original coordinates of the 2D image The coordinates under the system to obtain the position coordinates of the piano keyboard under the original coordinates of the 2D image; the hand detection module is used to perform target detection on the piano keyboard area represented by the piano keyboard position coordinates to detect the relative position of the piano keyboard area. The hand area represented by the coordinates, and the relative position coordinates of the piano keyboard area used to represent the hand area are converted into the coordinates under the original coordinate system of the 2D image to obtain the hand position coordinates under the original coordinates of the 2D image; hand shape The error detection module is used to identify whether there is a hand shape error from the hand region represented by the hand position coordinates, and output the hand shape error type and the hand shape error position represented by the relative position coordinates of the hand region. The relative position coordinates of the hand region used to represent the wrong hand position are converted into the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the wrong hand position under the original coordinates of the 2D image. Preferably, the system further includes: a key dividing module, configured to divide each key from the piano keyboard region represented by the piano keyboard position coordinates to obtain different piano keys represented by the relative position coordinates of the piano keyboard region, and The relative position coordinates of each piano keyboard area used to represent the keys are converted to the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the keys in the original coordinates of the 2D image; the fingertip feature point detection network is used to detect the hand position from the hand position. In the hand region represented by the coordinates, the fingertip feature points of different fingers represented by the relative position coordinates of the hand region are detected, and the relative position coordinates of each hand region used to represent the fingertip feature points of different fingers are converted. It is the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the fingertip in the original coordinate of the 2D image; the fingering error detection module is used to judge the position based on the coordinates of the fingertip and the coordinates of the key, and compare the fingertip falling on the key with the key Bind to obtain the finger key binding relationship, and compare the finger key binding relationship of playing the same note with the standard binding relationship in the score database to detect whether there is a fingering error. In some embodiments of the present invention, the system further includes: a user interaction and display module for merging and displaying the playing errors occurring during playing with the image of playing the piano, and providing mode selection and Interactive interface for feature selection.

In some embodiments of the present invention, the image acquisition module adopts any electronic device that can take pictures, such as a mobile phone, a camera, a camera, and the like.

According to the third aspect of the present invention, in order to overcome the deficiencies of piano training in the prior art, the present invention also provides a method for training an intelligent piano, the method comprising: acquiring audio information and video information of a user playing the piano; Extract the user audio data from the audio information, and compare it with the corresponding reference audio data stored in the audio database to obtain the degree of matching between the user audio data and the corresponding reference audio data; The user's hand image corresponding to the user's audio data, the user's hand data in the user's hand image is identified by the hand model, and the corresponding correct hand data stored in the hand database is compared to obtain the obtained data. The degree of matching between the user's hand data and the corresponding reference hand data, wherein the hand model takes the hand image as input data, and takes the hand data in the hand image as output data. The network is trained to obtain; and/or, intercepting a 2D image corresponding to the user audio data containing the complete piano keyboard from the video information, and using the method described in the first aspect of the present invention to identify from the 2D image whether There is a playing error; and based on the degree of matching of the user audio data with the corresponding reference audio data and the degree of matching of the user hand data with the corresponding reference hand data, and/or a user playing error , and feedback the playing result to the user. The playing errors include hand shape errors, and/or fingering errors.

Optionally, the above method further includes: based on the degree of matching of all the user audio data generated by the user playing the piano with the corresponding reference audio data and all the user audio data generated by the user playing the piano. The matching degree between the hand data and the corresponding reference hand data and/or all the playing errors of the user, the playing result is fed back to the user.

Optionally, the above method further includes: when the degree of matching between the user audio data and the corresponding reference audio data is less than a specified threshold, prompting the user for key information corresponding to the corresponding reference audio data.

Optionally, the above method further includes: when the degree of matching between the user's hand data and the corresponding reference hand data is less than a specified threshold, prompting the user of the hand corresponding to the corresponding reference hand data. part actions, and/or displaying to the user the type of playing error and the wrong playing position.

Optionally, the user audio data includes extraction time, musical note, fundamental frequency and sound intensity.

Optionally, the user hand data includes the interception time and the relative positions of 21 key joint points of each of the left and right hands.

Optionally, wherein the extracting the user audio data from the audio information includes: extracting the user audio data from the audio information according to a first time interval, and wherein the user audio data is extracted according to its extraction time Corresponding to the reference audio data in the audio database.

Optionally, wherein the intercepting the user's hand image corresponding to the user audio data from the video information includes: intercepting the user's hand image from the video information according to a second time interval, and Wherein, the user's hand image corresponds to the user's audio data through its interception time.

Optionally, the second time interval is the same as the first time interval, or the second time interval is an integer multiple of the first time interval.

Optionally, the interception time of the user hand data is the same as the interception time of the user hand image, and the user hand data corresponds to the reference hand data in the database according to the interception time information.

Optionally, the hand model is obtained by training a recurrent neural network or a long-term memory neural network.

Optionally, the above method further includes: selecting a user's hand image including piano keys from the user's hand image to identify the user's hand data.

Optionally, the above method further includes: acquiring the user's key-touching force data; The matching degree of the corresponding reference touch key force data; based on the matching degree between the user audio data and the corresponding reference audio data, the matching degree between the user hand data and the corresponding reference hand data, and The degree of matching between the user's key-touching force data and the corresponding reference key-touching force data and/or all playing errors of the user determine the score of the user for playing the piano.

A fourth aspect of the present invention provides an intelligent piano training system, comprising: an audio and video acquisition unit for acquiring audio information and video information of a user's piano playing; a data extraction unit for extracting the user from the audio information Audio data, and intercepting a user's hand image corresponding to the user's audio data and/or a 2D image including a complete piano keyboard from the video information; a data recognition unit for recognizing the user's hand through a hand model user hand data in the 2D image, and/or whether there is a playing error is identified from the 2D image by the system for intelligently recognizing playing errors for assisting piano teaching according to the second aspect of the present invention, wherein, The hand model takes the hand image as input data and the hand data in the hand image as output data, and is obtained by training a neural network; a data matching unit is used to match the user audio data with the audio database. Compare the corresponding reference audio data in the user audio data to obtain the matching degree between the user audio data and the corresponding reference audio data, and compare the user hand data with the corresponding reference hand data in the hand database to obtain a degree of matching between the user's hand data and the corresponding reference hand data; and a user interaction unit configured to match the user's hand data with the corresponding reference audio data and the user's hand data based on the degree of matching The matching degree with the corresponding reference hand data and/or all the playing errors of the user, the playing result is fed back to the user. The playing errors include hand shape errors, and/or fingering errors. Optionally, the user interaction unit is further configured to: prompt the user with the key information corresponding to the corresponding reference audio data; and/or prompt the user where the corresponding reference hand data is located. corresponding hand movements.

Optionally, the video and audio capture unit includes an audio capture device and a video capture device, and wherein the video capture device includes one or more monocular cameras, binocular cameras, or depth cameras, and the video capture device The device is fixed around the piano to collect hand video information at a fixed point, or is installed on the slide rail to automatically track and collect hand video information.

Optionally, the above-mentioned system further includes: a sensor, which is installed under the piano keys and is used for collecting key-touching force data when the user plays the piano.

Compared with the prior art, the advantages of the present invention are: the present invention accurately identifies the hand data in the user's hand image through the hand model, and comprehensively considers the audio data and the hand data generated when the user is playing the piano. On this basis, make an overall judgment on the results of the user's piano playing, so that the user can quickly obtain effective feedback on the notes and fingering in the practice without the guidance of a professional teacher, which is conducive to the user to discover and correct mistakes in time, improve Practice efficiency. In addition, in some embodiments of the present invention, by prompting the user with correct key information and/or hand movements in real time, it can help the user to obtain correct demonstration and guidance in time, and help the user to learn to play the piano by himself. In the present invention, the 2D image is used to identify the playing errors, the calculation amount is small, and the hardware cost is low. Therefore, 2D visual models have irreplaceable advantages.

Description of drawings

The embodiments of the present invention will be further described below with reference to the accompanying drawings, wherein:

1 is a schematic diagram of main work contents in an intelligent identification method for assisting piano teaching according to an embodiment of the present invention;

2 is a schematic diagram of a framework of an intelligent recognition system for assisting piano teaching according to an embodiment of the present invention;

3 is a schematic diagram illustrating an example of a 2D image collected according to an embodiment of the present invention;

4 is a schematic diagram of a piano keyboard area detected from the 2D image shown in FIG. 3 according to an embodiment of the present invention;

5 is a schematic diagram of a hand area detected from the piano keyboard area shown in FIG. 4 according to an embodiment of the present invention;

6 is a schematic diagram of detecting fingertip feature points from the hand region shown in FIG. 5 according to an embodiment of the present invention;

Fig. 7 is a schematic diagram of 6 common wrong hand movements and corresponding correct hand movements in piano practice;

8 is a schematic diagram of 21 key joint points in a single palm according to an embodiment of the present invention;

Fig. 9 is a smart piano practice method according to an embodiment of the present invention;

10 is a schematic diagram of standard audio data storage in an audio database according to an embodiment of the present invention;

11 is a schematic diagram of standard hand data storage in a hand database according to an embodiment of the present invention;

12 is a schematic diagram of storage of a comprehensive database according to an embodiment of the present invention;

Fig. 13 is a smart piano training method according to an embodiment of the present invention;

Fig. 14 is a smart piano training method according to an embodiment of the present invention;

FIG. 15 is a smart piano training system according to one embodiment of the present invention.

detailed description

In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below through specific embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

Based on 2D images, the present invention regards hand shape errors as a detection task, and fingering errors as feature point detection tasks and image segmentation tasks, and constructs a 2D visual model to solve the problem of recognizing playing errors. 3D modeling is required, and there is no need to establish a standard database for comparison, which greatly reduces the amount of computing tasks and hardware overhead.

According to an embodiment of the present invention, as shown in FIG. 1, an intelligent identification method for assisting piano teaching is used to identify whether there is a playing error by detecting the collected 2D images of playing the piano, which mainly includes: The following parts are as follows: First, the piano keyboard is detected on the collected 2D image of playing the piano, and the piano keyboard area is intercepted from it; then hand detection and key segmentation are performed for the intercepted piano keyboard area. The hand detection is from the piano keyboard. The hand area is detected in the keyboard area, and the key segmentation is to divide the keys based on the piano keyboard area to obtain each key; secondly, for the hand area, hand type error detection and fingertip detection are performed respectively, and hand type error detection detects hand type errors. And output the existing hand type error type and hand type error position coordinates. Fingertip detection is to detect the fingertip feature points of the hand area to obtain the fingertip position coordinates of different fingers; Judging, bind the fingertips on the keys to the keys to obtain the finger key binding relationship, and compare the finger key binding relationship with the binding relationship of playing the same note in the score database to determine whether there is a fingering error , and output the fingering errors that exist. Fig. 2 shows the main functional modules in the process of identifying playing errors using the method of the present invention, wherein the 2D image including the complete keyboard of playing the piano is collected from above the piano keyboard by the image acquisition module, and the 2D image is collected by the piano keyboard detection module. Perform piano keyboard detection to obtain the piano keyboard area, perform hand detection on the piano keyboard area through the hand detection module to obtain the hand area, and use the hand type error detection module to perform hand type error detection on the hand area to obtain the hand type error type and hand type error The position coordinates are obtained by dividing the keys based on the piano keyboard area by the key division module to obtain the coordinates of each key, and the fingertip feature points of the hand area are detected by the fingertip feature point detection network to obtain the fingertip coordinates of different fingers. Fingering error detection The module compares and judges the position of the fingertip coordinates and the key coordinates to bind the fingertip falling on the key with the key to obtain the finger key binding relationship. If the area represented by the fingertip coordinates of the finger and the area represented by the key coordinates have Overlapping, then this finger falls on the key that overlaps with its fingertip coordinates, bind the finger to the key to obtain the finger key binding relationship, and the finger key binding relationship that plays the same note will be the same as the one in the score database. Standard bindings are compared to detect fingering errors.

For better understanding of the present invention, the present invention will be described in detail below with reference to the accompanying drawings.

1. Image collection

The 2D image including the complete piano keyboard when the player plays the piano is collected from above the piano keyboard by the image acquisition module (in the collected 2D image, the person is located above the image, and the piano keyboard area in the image is rectangular or approximate on the image. Rectangle), and use the image processing algorithm to process the collected 2D image to obtain the processed 2D image. The image acquisition module here uses any electronic device that can take pictures to collect 2D images, such as mobile phones, cameras, cameras, etc., adjust the angle of the device before or during the shooting, so that the captured 2D image contains the complete piano keyboard and some or all of the hand. The image processing algorithms here include, but are not limited to, black level compensation, lens correction, bad pixel correction, color interpolation, noise removal, gamma correction, color space conversion, white balance correction, color and contrast enhancement, format conversion, etc. The captured 2D image is processed so that the processed 2D image is suitable for subsequent operations. For example, the captured raw 2D image can be converted into a 2D image in a format compatible with piano keyboard detection, such as bmp, jpg, png, tif, gif, pcx, tga, exif, fpx, svg, psd, cdr, Images in formats such as pcd, dxf, ufo, eps, ai, raw, WMF, webp, avif, apng, etc. The specific format can be set according to the actual application requirements. An example shown in FIG. 3 is a collected 2D image including a complete piano keyboard and all hands of the player when playing the piano.

Second, the piano keyboard detection

The position of the piano keyboard is detected from the 2D image by the piano keyboard detection module, so as to obtain the piano keyboard area represented by the coordinates of the piano keyboard position. The piano keyboard detection module uses the piano keyboard detection network to perform target detection on the 2D image, and obtains the position coordinates of the piano keyboard. The output of the piano keyboard detection network is the relative position coordinates of the input 2D image, which is converted into the original coordinate system of the 2D image. coordinates to obtain the piano keyboard position coordinates in the original coordinate system of the 2D image, and the piano keyboard position coordinates are used to indicate the piano keyboard area in the 2D image. Among them, the piano keyboard detection network takes the 2D image of playing the piano as the input, and the relative position coordinates of the piano keyboard in the 2D image as the output, which is obtained by training the neural network. When recognizing the piano keyboard in the 2D image, the piano keyboard in the 2D image completely falls into the rectangular target detection frame of the piano keyboard detection network, so that the rectangular target detection frame includes the entire piano keyboard, and the opposite corners of the rectangular target detection frame are used. Or the two-dimensional coordinates of the center point (in the form of (x, y)) to represent the relative position coordinates of the piano keyboard in the input 2D image, and convert the relative position coordinates output by the piano keyboard detection network relative to the input 2D image into a 2D image. The coordinates in the original coordinate system can obtain the position coordinates of the piano keyboard in the original 2D image. For example, it can be expressed as the coordinates of the upper left corner (x1, y1), the coordinates of the lower right corner (x2, y2), the coordinates of the upper left corner (x1, y1), and the coordinates of the lower right corner (x2, y2) of the rectangular target detection frame. The rectangular area is the piano keyboard area; alternatively, the piano keyboard area can also be expressed as the coordinates of the center point of the rectangular target detection frame (x0, y0) and the width w and height h of the rectangular target detection frame, both ways are acceptable, the present invention In the embodiment, the first diagonal method {(x1, y1), (x2, y2)} is used for description, and for the convenience of description in the future, all the coordinates in the present invention refer to the coordinates that have been converted into the original coordinate system of the 2D image. coordinate.

The piano keyboard area has two functions, one is as the input of the hand detection module to detect the position of the hand; the other is as the input of the key division module, which divides each individual key and obtains the position of each key.

3. Hand detection

The hand area represented by the relative position coordinates of the piano keyboard area is detected from the piano keyboard area by the hand detection module, and the relative position coordinates of the piano keyboard area used to represent the hand area are converted into the original coordinate system of the 2D image. The coordinates of the hand position in the original coordinate system of the 2D image are obtained, and the hand position coordinates are used to indicate the hand region in the 2D image. Wherein, the hand detection network takes the piano keyboard area as the input, and the relative position coordinates of the hand position in the piano keyboard area as the output, and is obtained by training the neural network. In the image, the hand position coordinates and the piano keyboard position coordinates are compared to determine whether the two overlap. If the rectangular area represented by the hand position coordinates and the rectangular area represented by the piano keyboard coordinates overlap, the hand area and the piano keyboard If the area overlaps, then the hand is considered to be on the keyboard, and the following hand type error detection, fingering error detection, etc. are performed on these hands; on the contrary, if there is no overlap, then the hand is considered not to be on the keyboard. There is no need to perform the following detection processes such as hand type error detection and fingering error detection.

It should be considered that the hand in the piano keyboard area may be incomplete, for example, the fingers are placed on the keyboard, and the palm is outside the keyboard. In order to ensure the detection integrity of the hand placed on the keyboard, the present invention needs to place the piano keyboard The upper edge of the area is expanded by a certain pixel (for example, 200 pixels) above the image to form the expanded piano keyboard area, also known as the effective area of the piano keyboard, so as to ensure that all hands on the keyboard are intact in this area. According to an embodiment of the present invention, the piano keyboard area is expanded upward by 200 pixels, and the coordinates of the expanded piano keyboard effective area are expressed as {(x1, y1-200), (x2, y2)}.

The hand detection module performs target detection on the effective area of the piano keyboard represented by the position coordinates of the piano keyboard through the hand detection network to obtain the hand area represented by the relative position coordinates of the piano keyboard area, and will be used to represent the piano keyboard area of the hand area. The relative position coordinates of are converted into the coordinates in the original coordinate system of the 2D image to obtain the position coordinates of the hand in the original coordinate system of the 2D image. It can be seen that the hand position coordinates are detected and converted by the hand detection network. Although the simplest way is to input the entire picture into the hand detection network, this will increase the amount of calculation. Therefore, the present invention uses a piano The effective area of the keyboard is used for hand detection. When the effective area of the piano keyboard is identified, the hand in the effective area of the piano keyboard represented by {(x1, y1-200), (x2, y2)} will be completely touched. Start with the rectangular target detection frame of the hand detection network, and use the upper left corner and lower right corner coordinates of the rectangular target detection frame of the hand detection network to represent the hand area, and the position coordinates of the hand area (ie the hand position coordinates) and the piano keyboard The position coordinates of the effective area are in the same coordinate system, and are respectively used to indicate the hand area and the effective area of the piano keyboard in the 2D image, which is convenient for location judgment. Assuming that when the rectangular target detection frame of the hand detection network detects the hand, the coordinates of the upper left corner are (x1', y1'), and the coordinates of the lower right corner are (x2', y2')}, then the position coordinates of the hand area can be expressed as {(x1', y1'), (x2', y2')}. Since the hand area is detected from the effective area of the piano keyboard, the hand area may be incomplete or some hands do not fall on the piano keyboard, compare the hand position coordinates {(x1', y1'), (x2' , y2')} and the piano keyboard position coordinates {(x1, y1), (x2, y2)}, if the hand area overlaps with the keyboard area, then it is considered that the hand is on the keyboard, and the present invention connects these hands. On the contrary, if there is no overlap, then it is considered that the hand is not placed on the keyboard, and there is no need to perform the following detection processes such as hand type error detection and fingering error detection. Since the hand area obtained by hand detection in the effective area of the piano keyboard is limited by the rectangular target detection frame of the hand detection network, it may not contain the complete hand. For example, the fingertips of some fingers of the same hand are in On the piano keyboard, the fingertips of another part of the fingers are outside the piano keyboard, and the fingertips outside the piano keyboard are not included in the target detection frame during the detection. Coordinates, and then extend the boundary of the hand area (hand area) that falls on the piano keyboard to the four directions by a certain pixel boundary to obtain the expanded hand area that contains the complete hand of the hand that falls on the piano keyboard, and then It is called the effective hand area. According to an embodiment of the present invention, the boundary of the hand area is expanded by 30 pixels, and the coordinates of the expanded effective area of the hand can be expressed as {(x1'-30, y1'-30), (x2 '+30, y2'+30)}, which ensures that the detected hand on the piano keyboard is intact. Among them, it should be noted that when expanding the hand area, it is necessary to perform out-of-bounds checking on the four boundaries of the hand area. If an expanded boundary exceeds the boundary range of the original 2D image, the boundary of the 2D image needs to be used instead of the hand area. The out-of-bounds boundary is crossed and the boundary is no longer expanded. The image shown in FIG. 5 is a schematic diagram of identifying the effective area of the hand from the effective area of the piano keyboard shown in FIG. 4 .

By intercepting the effective area of the hand and removing the hands that are not on the piano keyboard, the subsequent processing time can be reduced and the accuracy of playing error recognition can be improved. There are two functions of the hand effective area here, one is as the input of the hand shape error detection module to detect hand shape errors; the other is as the input of the fingertip feature point detection network to detect the fingertip feature points.

Fourth, hand type error detection

The hand type error detection module detects the hand type error in the effective hand area represented by the hand position coordinates to obtain the hand type error type and the hand type error position coordinates. The hand type error detection module adopts the hand type error detection network to be effective for the hand. Area is detected and the hand shape error position represented by the relative position coordinates of the input hand effective area is obtained, and the relative position coordinates of the hand effective area used to represent the hand shape error position are converted into the coordinates in the original coordinate system of the 2D image to obtain Incorrect hand position coordinates. Wherein, the hand shape error detection network takes the effective hand area as the input, and the hand shape error type and the relative position coordinates of the hand shape error in the effective hand area as the output, and is obtained by training the neural network. That is to say, in the hand shape error detection module, a method based on deep learning is used to detect the input effective area of the hand, and the output result is the type of hand shape error and the coordinates of the wrong hand shape, so as to guide the user to correct the wrong hand shape.

Among them, hand shape errors are divided into folded fingers, fingertips not standing, fingertips pointing upwards, wrist pressing, and palmar joint collapse. Each type of error is a category. The present invention regards hand shape errors as a detection task. The output of the detection network is the wrong hand type error category in the hand area in the 2D image and the relative position coordinates of the error in the input hand effective area, and output as many detection results as there are errors.

5. Fingering Error Detection

1. Fingertip feature point detection

The fingertip feature points of different fingers represented by the relative position coordinates of the hand region are detected from the hand region represented by the hand position coordinates through the fingertip feature point detection network, and each fingertip feature point representing a different finger is used to The relative position coordinates of the hand region of the sharp feature points are converted to coordinates in the original coordinate system of the 2D image to obtain the coordinates of the fingertip in the original coordinates of the 2D image. Among them, the fingertip feature point detection network is a network obtained by training a neural network as the input and the relative position coordinates of the fingertips of different fingers in the effective area of the hand as the output. The fingertip feature point detection network performs image segmentation on the effective area of the hand, and the fingertip of each finger is segmented to obtain the feature points of each fingertip. For example, the image shown in Figure 6 is from the hand shown in Figure 5. A schematic diagram of identifying the fingertip feature points of each finger in the effective area of the hand, and finally obtain the relative position coordinates of each finger in the effective area of the hand, such as the coordinates of the thumb fingertip, the index fingertip coordinate, the middle finger fingertip coordinate, etc., The fingertip coordinates corresponding to each finger are represented by the diagonal corners of the rectangular detection frame corresponding to the fingertip, and the relative position coordinates of each fingertip are converted into coordinates in the original coordinate system of the 2D image to obtain the original coordinates of each fingertip in the 2D image. The fingertip coordinates of the system.

2. Key division

Different piano keys represented by the relative position coordinates of the piano keyboard region are obtained by dividing each piano key from the piano keyboard region represented by the piano keyboard position coordinates by the piano key dividing module, and each piano keyboard region used to represent the keys is divided into different piano keys. The relative position coordinates of the 2D image are converted into the coordinates under the original coordinate system of the 2D image to obtain the key coordinates under the original coordinates of the 2D image; the range limited by the coordinates of each key is the effective area of the key, and the effective area of the key is morphologically processed. Get the effective edge of each key. The purpose of dividing the keyboard area is to judge whether the fingering is correct or not based on the feature points of the fingertips. The actual division form is to start numbering from the first key of the edge, such as the set of keys numbered [K1, K2, K3, ..., K88]. The pixel area of each key in the image has a polygonal expression, such as the area of the K1 key is expressed as a set of vertices and [(x _K10 ,y _K10 ),...,(x _K1n ,y _K1n )], Where (x _K1n , y _K1n ) represents a point in the 2D image coordinate system, x _K1n is the abscissa of the point, y _K1n is the ordinate of the point, and the pixels wrapped in the polygon formed by these points are the K1 key effective area . The keyboard division module takes the detected piano keyboard area as input, converts it into a grayscale image, and then performs morphological operations to remove the influence of light noise, etc., and uses edge detection algorithms (such as the sobel operator) to extract edges, and finally connect them. Domain analysis to get keyboard edges. The piano keyboard has only two types of black keys and white keys, and the boundaries are regular line segments. According to the statistical characteristics of pixels, the edges are bound to each key, and the intersection points of different edges are the vertices of the polygon in the effective area of the key, which is divided by the key division module. , the mathematical expression model of each bond is established in the 2D image coordinate system.

3. Fingering recognition

Through the fingering error detection module, the position of the fingertip coordinates and the key coordinates are compared and judged to bind the fingertip falling on the key with the key to obtain the finger key binding relationship. If the indicated area overlaps, then the finger falls on the key whose fingertip coordinates overlap, and the finger is bound to the key to obtain the finger key binding relationship, and the key binding relationship between the fingers playing the same note and The standard binding relationship in the score database is compared, and if they are inconsistent, it is judged as a fingering error, and the user is prompted to correct the fingering.

Among them, when the area represented by the fingertip coordinates of a certain finger overlaps with the areas represented by the coordinates of multiple keys, the key binding relationship between the same finger and multiple keys will be obtained when a certain note is played, which is the same as The standard binding relationship in the score database is definitely inconsistent, and this finger is an obvious fingering error.

The purpose of fingering recognition is to determine which finger has pressed which key, and to realize the binding of finger keys. It relies on the output of the key segmentation module and the fingertip feature point detection network. At the same time, the sound sensor on the key can sense the signal generated by pressing the key, so as to determine which key is pressed, and obtain the currently played note. The valid area of the key is extracted from the result output by the key division module, and it is calculated in turn whether the detected fingertip feature points fall into the key area, and if so, the finger is bound to the key. From the score database, find the standard binding relationship between the finger and the key at the note, and compare it with the predicted binding relationship obtained by fingering recognition. If it is inconsistent, it is judged that the fingering is wrong, and the user is prompted to correct the wrong fingering. .

It can be known from the above-mentioned embodiments that the present invention adopts the means based on deep learning to complete the detection task and the segmentation task, and obtains the detection network by training the neural network. The present invention provides a complete set of neural network training methods to obtain a piano keyboard detection network, a hand detection network, a hand shape error detection network, and a fingertip feature point detection network. The invention analyzes the biomechanical principle of each hand type error and fingering error, summarizes the essential visual feature of each error, and uses this feature as the basis for neural network learning and prediction. Then the feature is marked with a rectangular frame (but not limited to a rectangular frame) to obtain a sample data set, and the data set is divided into a training set, a validation set and a test set according to a certain ratio (for example, the samples are divided according to the ratio of 7:2:1) data set). Among them, the training set and validation set are used to train the neural network, and the test set is used to test and evaluate the effect of the final network model.

According to an embodiment of the present invention, the present invention provides a method for training a neural network to obtain the piano keyboard detection network, hand detection network, hand shape error detection network, and fingertip feature point detection network, including the following parts :

a. Data set collection

Deploy an image acquisition module to collect images of people playing pianos in various scenes, different types and models of pianos, different angles, and different lighting conditions of people of different ages, genders, school ages, and skin colors. The dataset covers the entire scene and full error type.

b. Dataset annotation

It is divided into keyboard annotation, hand annotation, hand type error annotation, and fingertip feature point annotation. Among them, in the hand shape error labeling, the essential characteristics of each hand shape error are summarized, and the hand shape error category and hand shape error position are marked. Specifically, it includes labeling the position coordinates of the piano keyboard, labeling the position coordinates of the hand, labeling the wrong type of hand and the coordinates of the wrong hand shape, labeling the coordinates of the feature points of different fingers and fingertips, all labels are in the original coordinate system of the image.

c. Dataset processing

The images in the original data set are processed according to the marked position coordinates of the piano keyboard, and an image containing the piano keyboard area represented by the marked piano keyboard position coordinates is obtained to form a first data set; further, the piano keyboard area is expanded to obtain an image In the effective area of the piano keyboard, the original image is cropped based on the original data set according to the marked keyboard position coordinates and hand position coordinates, and the effective area of the piano keyboard in each original image is obtained to form a second data set, wherein the second data set The hand position coordinates marked in the original image are converted into the coordinates in the same coordinate system as the effective area of the piano keyboard; further, the effective hand area obtained by expanding the hand area is based on the marked hand position coordinates and hand shape. The error position coordinates are based on the original data set, and the original image is cropped to obtain the effective area of the hand in each original image to form a third data set. The third data set converts the hand type error position coordinates marked in the original image into Coordinates in the same coordinate system as the effective area of the hand; according to the marked hand position coordinates and the coordinates of different fingertip feature points, the original image is cropped on the basis of the original data set to obtain the effective area of the hand in each original image A fourth data set is formed, wherein the coordinates of different fingertip feature points marked in the original image are converted into coordinates in the same coordinate system as the effective area of the hand in the fourth data set.

d. Model training

(1) Piano keyboard detection network, hand detection network and hand shape error detection network

The invention takes piano keyboard detection, hand detection and hand shape error detection as multi-branch detection tasks, designs multi-task branch network structure, and then trains them to obtain detection network.

Among them, the piano keyboard detection network has only one task branch, that is, the piano keyboard detection branch.

For the hand detection network, there is only one task branch, the hand detection branch. The network needs to complete three tasks, one is to output the coordinate position of the hand, the other is to output the left and right attributes of the hand, and the third is to output the positive and negative attributes of the hand. The present invention divides the hand parts into four categories, positive left hand, positive right hand, anti-left hand, anti-right hand, positive left hand means the back of the left hand is up, anti-left hand means the palm of the left hand is up, right hand and so on.

For the hand type error detection network, there are multiple task branches, each branch is a detection sub-network of an error type, and the prediction type of the sub-network has only one type of error, that is, there are as many sub-branchs as there are error categories. All error detection sub-branch networks share the backbone network. For example, it is divided into a broken finger detection branch, a palm joint collapse detection branch, a wrist collapse detection branch, and the like.

Use the first data set to train the yolov4 network to convergence to obtain the piano keyboard detection network, use the second data set to train the yolov4 network to convergence to obtain the hand detection network; use the third data set to train the yolov4 network to converge to obtain the hand detection network. type error detection network, where the same loss function is used for each branch of the detection task. Among them, for a single-task branch network, the loss of the detection task branch is the total loss of the entire network; for a multi-task branch network, the weighted sum of the losses of each detection task branch is the total loss of the entire network. The training of the neural network and the design of the loss function are common methods in the field, and will not be repeated here. During the training process, online data enhancement can be performed on images, including but not limited to color, contrast, brightness, noise, smooth blur, flip, deformation, distortion, random occlusion and erasure, etc., to improve network robustness.

It should be noted that the choice of neural network is not limited to the yolov4 network, other neural networks can also be used.

(2) Fingertip feature point detection network

Taking fingertip feature point detection as an image segmentation task, the fourth dataset uses ResNet18 as the backbone network, and the detection head adopts a neural network with cascaded pyramid networks to train to converge to obtain a fingertip feature point detection network. The so-called cascaded pyramid network refers to the cascade of two networks that take multi-scale features as input. The first network is called GlobalNet, which performs preliminary detection on fingertip feature points and uses the L2 loss function. The feature map generated by GlobalNet is then extracted by the convolutional layer and input to the RefineNet network to fine-tune the predicted feature points to produce more accurate results.

It can be seen from the above-mentioned embodiments that the present invention has the following advantages: 1. Fast speed and high efficiency, small calculation amount of 2D images, light and simple algorithm, good effect and high performance; 2. It can accurately output error type and error position information, distinguish Hand shape errors and fingering errors are more targeted when correcting errors; 3. Each specific model is trained in a data-driven way, and there is no need to empirically establish a standard comparison database, which has high robustness; 4. Adopt automatic The top-down method predicts the results, from coarse-grained keyboard detection, to fine-grained hand type error detection and fingering error recognition, and multiple networks are cascaded. At the same time, each sub-network adopts a multi-task branch, which achieves higher performance.

It can be known from the background technology of the present invention that identifying playing errors plays an important role in piano teaching and training, and can significantly improve the quality of piano teaching. Usually, the judgment and evaluation of a practitioner's piano performance include at least two aspects: notes and hand movements, where notes may include the frequency spectrum, strength, speed, rhythm and other factors of fundamental and overtones. When the time information corresponds, it can be judged whether the played note is correct by converting the audio data signal during playing into audio data and comparing it with the standard audio data. In the present invention, "standard audio data" and "standard hand data" refer to "reference audio data", "reference hand data" used for comparing with user audio data and user hand data to judge the user's piano playing result data".

Hand movements include two aspects: fingering and hand shape. Fingering is used to determine that the correct finger is used to play the corresponding note during repertoire practice. Fingering includes the position (or position change) of a single finger and multiple fingers. relative position changes. Common fingerings can include, for example, straight-finger (that is, one finger corresponds to a key), finger-penetrating (that is, one finger passes under one or more other fingers to play higher notes), cross-fingering (that is, one finger passes from another or Multiple fingers are stepped over to play lower bass), brackets, retractions, ring fingers, and so on. Hand shape is used to determine whether there are any problems such as broken fingers, not standing fingertips, collapse of palm joints, wrist shaking, finger lift, and finger tension when playing any note. Figure 7 shows 6 common wrong hand movements and the corresponding correct hand movements in piano practice, wherein Figure 7A shows folding fingers and the corresponding correct hand movements, Figure 7B shows the wrong hand movements when the fingertips are standing Figure 7C shows the palmar joint collapse and the corresponding correct hand motion, Figure 7D shows the wrist shaking and the corresponding correct hand motion, and Figure 7E shows the finger lift and the corresponding correct hand motion. Figure 7F shows finger tension and the corresponding correct hand movement. Changes in hand movements can achieve different pronunciation effects, which have a great impact on the coherence, rhythm, speed, and timbre of notes, and are the key to good results.

According to an embodiment of the present invention, a single palm includes at least 21 key joint points, and the hand data of the palm can be characterized according to the coordinate positions or relative positions of the 21 key joint points. Thanks to the development of deep learning, the trained hand model (ie, the neural network model) can be used to identify the coordinate position or relative position of the key joint points of the player's hands, that is, the player's hand data, and use the hand model. The part data is compared with the standard playing hand data to judge whether the playing hand movements are accurate.

FIG. 8 shows a schematic diagram of 21 key joint points in a single palm according to an embodiment of the present invention. As shown in Figure 8, 21 key joint points can be selected from a single palm, respectively represented by serial numbers 0-20, where [0, 1, 2, 3, 4] represent the thumb from the wrist to the fingertip. 5 key joint points; [5, 6, 7, 8] represent the 4 key joint points in the index finger from the wrist to the fingertip; [9, 10, 11, 12] represent the 4 key joint points in the middle finger from the wrist to the fingertip Key joint points; [13, 14, 15, 16] represent the 4 key joint points in the ring finger from the wrist to the fingertip; [17, 18, 19, 20] represent the 4 key joints in the little finger from the wrist to the fingertip point.

Hand data can be represented by the coordinate position of each key joint point, or by the relative position of each key joint point. In one embodiment, the "0" joint point in the thumb can be selected as the center origin, and the relative positions of other joint points can be represented by the relative coordinate positions of the joint point relative to the "0" joint point in the thumb, wherein, The coordinate position of each key joint point can be represented by plane coordinates (x, y). In another embodiment, (x, y, v) can also be used to represent the coordinates of a joint point, where v represents whether the joint point is occluded. When v=1, it means that the joint point is not blocked; when v=0, it means that the joint point is blocked by other parts. In one embodiment, "left" or "right" may also be marked to distinguish whether the key joint is located in the left hand or the right hand. The relative position of each key joint point of the hand can be represented by the relative position of other joint points relative to a certain joint point.

The hand model takes the hand image as the input data and the hand data in the hand image as the output data, which is obtained by training the neural network model. In one embodiment, since the hand images in piano performance are time-series, the neural network in the hand model can use a Recurrent Neural Network (RNN) or a Long Short-Term Memory Neural Network (Long Short-Term Memory, LSTM). RNN is based on the ordinary multi-layer BP neural network, increases the horizontal connection between the units of the hidden layer, and transmits the value of the neural unit of the previous time series to the current neural unit through a weight matrix, so that the neural network has the memory function. RNN has good applicability for dealing with contextual NLP or time series machine learning problems. However, although RNN has memory, it cannot memorize the content that is too early or too late due to gradient explosion or gradient disappearance. Therefore, according to an embodiment of the present invention, when the sampling interval is long, LSTM is used to recognize the hand image. On the basis of ordinary RNN, LSTM adds memory units to each neural unit of the hidden layer, so that the memory information in the time series is controllable, and each time it passes through several controllable gates (forgetting gate, input gate, candidate gate, output gate), which can control the memory and forgetting degree of previous information and current information, so that the RNN network has a long-term memory function.

The training set of the hand model can include hand pictures of various samples, for example, hand images of different hand movements of different ages (eg, the elderly, adults, children) and different genders (eg, male, female). The hand motion in the hand image is not limited to playing the piano, and can include various motions, such as making a fist, extending the whole palm, pushing, pulling, raising the thumb, and so on. The hand data of the hand images in the training set (such as the coordinate positions or relative positions of key joint points, etc.) can be manually annotated or obtained from an existing database. Using the trained hand model (ie, the neural network model), the coordinate position or relative position of the key joint points of the player's hands in the hand image can be identified, that is, the user's hand data.

According to the description of the foregoing embodiments of the present invention, whether there are playing errors such as hand shape errors and/or fingering errors can be identified through a 2D image including a complete piano keyboard.

Based on the above research, the present invention provides an intelligent piano training method. The method extracts user audio data from the user audio information at a certain time interval from the acquired audio information and video information of the user playing the piano, and compares the audio data with the audio information. Compare the corresponding standard audio data in the database to obtain the matching degree between the user audio data and the standard audio data, and intercept the user's hand image corresponding to the user's audio data from the user's video information according to a certain time interval, and identify it through the hand model. The user's hand data in the user's hand image is compared with the corresponding standard hand data in the hand database to obtain the matching degree between the user's hand data and the standard hand data, and/or, intercepted from the video information A 2D image containing a complete piano keyboard corresponding to the user's audio data, and identifying whether there is a playing error from the 2D image using the recognition method for assisting piano teaching described in the previous embodiment; according to all users of the user The matching degree between the audio data and the corresponding standard audio data, the matching degree between all the user's hand data and the corresponding standard hand data, and/or the user's playing error feedback the playing result to the user.

FIG. 9 shows a smart piano training method according to an embodiment of the present invention. As shown in Figure 9, the method includes the following steps:

S310: Acquire audio information and video information of the user playing the piano.

As mentioned above, note practice and hand movement practice are two main aspects in piano practice, so it is necessary to obtain both audio information and video information when the user plays the piano. In some implementations, audio and video information of the user playing the piano may be collected through audio and video collection devices (eg, a microphone and a camera, or a camera with a microphone). In this case, the collected audio information can be preprocessed by removing silent segments, denoising, and noise reduction to avoid external interference and improve the accuracy of scoring. In other embodiments, for the audio information played on the electronic piano, the MIDI audio digital signal of the user playing the piano may be collected by connecting to a MIDI interface (Musical Instrument Digital Interface) on the electronic piano. MIDI audio digital signals are binary data output by an electronic piano, representing a certain note played, and that can be recognized and processed by a computer. For video information, the user's hand movement of playing the piano can be captured by a camera or other device with image capture function. The hand movements of both hands can be captured by the same camera, or the hand movements of the left and right hands can be captured separately from different angles by multiple cameras. In this case, the video information of the left and right hands can be spliced.

In one embodiment, a pressure sensor installed under the keys may also collect the touch strength of the user playing the piano, so as to combine with the above audio and video information to jointly determine the score of the user playing the piano.

S320, extracting user audio data from the audio information, and comparing it with the corresponding standard audio data in the audio database, to obtain a degree of matching between the user audio data and the corresponding standard audio data.

The audio database contains audio data of a large number of standard piano playing pieces (for example, pieces played by piano teachers or professionals, or pieces automatically generated by artificial intelligence based on musical scores). Standard audio data can be extracted from the audio information of standard piano performance pieces according to a certain time interval, and stored in units of pieces to form an audio database. The audio data in the audio database may at least include information such as track name, extraction time, musical note, fundamental frequency, and sound intensity.

In one embodiment, standard audio data may be extracted from the audio information of a standard piano performance at time intervals of 10 ms or less and stored in an audio database. The current Guinness World Records record for the fastest pianist is 14 times in 1 second. Taking pressing the piano keys 20 times in 1s as an example, the viewing angle of pressing the piano keys once is 50ms. Therefore, extracting audio data from the audio information of a piano performance at intervals of 10ms can cover all the notes produced by the performance.

FIG. 10 shows a schematic diagram of standard audio data storage in the audio database of one embodiment. As shown in FIG. 10 , the audio database may include a first-level table and a second-level table, wherein the first-level table is used to store the basic information of standard piano performance pieces, including the serial number, name, level, pitch and audio data. The second-level table serial number and other information. The secondary table is used to store the audio data of each track, including the extraction time, the notes extracted from the audio information of the track at fixed time intervals, and their fundamental frequency and sound intensity. As shown in Figure 10(A), several standard piano performance pieces are stored in the first-level table. For example, the name of the piece 0001 is "Song of Spring", the first-level, A major, and its audio data is stored in the second-level table 0001 Medium; the name of track 0036 is "Rondo", in the key of C major, and the audio data is stored in the second-level table 0036; the name of the track 0180 is "Canon", other, in the key of D, the audio data is stored in the second-level table 0180, and so on. As shown in Figure 10(B), the secondary table 0001 stores all the notes extracted at every 10ms interval during the complete performance time of the 0001 track and the corresponding fundamental frequency and sound intensity data, for example, at the time of "0.000" , there is no note, the fundamental frequency is 0, and the sound intensity is 0; at the "0.010" moment, the note is G4, the fundamental frequency is 391Hz, and the sound intensity is 10dB; at the "0.020" moment, the note is still G4, and the fundamental frequency is 391Hz, The sound intensity is 15dB; at the "0.030" moment, the note is still G4, the fundamental frequency is 391Hz, and the sound intensity is 20dB; ...; at the "0.250" moment, the note is D4, the fundamental frequency is 293Hz, and the sound intensity is 10;..., etc.

User audio data can be extracted from the audio information at regular intervals, and the extracted user audio data can be compared with the corresponding standard audio data in the audio database to obtain the matching degree between the user audio data and the corresponding standard audio data.

In one embodiment, the user audio data may be extracted from the audio information according to the first time interval. The first time interval may be the same as the time interval for extracting standard audio data from the standard performance repertoire in the audio database, or may be an integer multiple of the above-mentioned time interval. The extracted user audio data may at least include information such as extraction time, notes and their fundamental frequencies and sound intensity. The user audio data can correspond to the standard audio data in the audio database through its extraction time information. Taking the audio database in Figure 9 as an example, when the user plays the song "Song of Spring", the user audio data can be extracted from the collected audio information according to the time interval of 30ms. If the extracted note at 0.030s is G4, The user audio data whose base frequency is 391 and the sound intensity is 15, then the standard audio data corresponding to the user audio data is the audio data at the time of 0.030s in the track 0001 in the first-level table and the second-level table 0001 in the audio database ( including the note and its fundamental frequency and intensity).

Before the user plays the piano, the user can select the song to be played in the database, or after the user starts to play the piano, the user can intelligently identify the song played by the user and query the standard of the song in the audio database. audio data, and then compare the user audio data with the corresponding standard audio data in the audio database to obtain the matching degree between the user audio data and the corresponding standard audio data.

In one embodiment, the audio database may store standard audio data of the same track in different genres. After the user starts to play the piano, intelligently identify the song and genre played by the user, and query the standard audio data corresponding to the song and genre in the audio database, and then compare the user audio data with the corresponding standard audio in the audio database. The data are compared to obtain the matching degree between the user audio data and the corresponding standard audio data.

In one embodiment, different weight values may be set for different information in the audio data, so as to calculate the degree of matching between the user audio data and the corresponding standard audio data. For example, the fundamental frequency weight of a note can be set to be greater than its pitch intensity weight, so that the fundamental frequency information of the note accounts for a larger proportion in the calculation of the matching degree. In one embodiment, an error redundancy interval can also be set for the standard audio data in the audio database, for example, an error redundancy interval of ±10 Hz is set for the fundamental frequency information of the musical note. When the user audio data falls within this interval, It can be considered that the fundamental frequency information of the notes in the user audio data is basically consistent with the fundamental frequency information of the notes in the corresponding standard audio data.

In one embodiment, the audio database may be further subdivided into a monophonic database and a repertoire database, wherein the monophonic database stores standard audio data corresponding to a single note, and the repertoire database stores a large number of standard piano playing repertoires corresponding to standard audio data. Thus, when the user is practicing the piano, it is possible to determine not only the audio data when the user plays a single note, but also the audio data when the user plays a certain piece of music.

S330 , based on the degree of matching between the user audio data and the corresponding standard audio data, intercept an image of the user's hand corresponding to the user audio data from the video information.

In piano playing, the judgment of hand movements is meaningful only when the notes are played correctly or substantially correctly. Therefore, according to an embodiment of the present invention, based on the degree of matching between the user audio data and the corresponding standard audio data, it is determined whether the user hand image corresponding to the user audio data needs to be intercepted from the video information.

In one embodiment, an audio match threshold may be set. The audio matching degree threshold can be set by the user, by default by the system, or intelligently set by the system after counting the playing levels of other piano players on the same piece in the networked state. When the matching degree between the extracted user audio data and the corresponding standard audio data is greater than or equal to the audio matching degree threshold, it means that the notes played by the user are correct or basically correct, and then the user audio data can be intercepted from the video information. The corresponding user hand image is used to judge the user's hand movement; when the matching degree of the extracted user audio data and the corresponding standard audio data is less than the audio matching degree threshold, it means that the note played by the user is wrong, so it is unnecessary to perform Judgment of hand movements.

Video generally refers to various technologies in which a series of static images are captured, recorded, processed, stored, transmitted and reproduced in the form of electrical signals. So a video is actually a series of images arranged in chronological order. When the continuous image changes exceed 24 frames per second, according to the principle of persistence of vision, the human eye cannot distinguish a single static image, and it appears to be a smooth and continuous visual effect. Therefore, the user's hand image can be intercepted from the video information at regular intervals. The user's hand image may correspond to the user's audio data in the video information through its interception time information.

In one embodiment, an image of the user's hand may be captured from the video information at a second time interval. The second time interval may be the same as the time interval for extracting user audio data from the audio information, or may be an integer multiple of the above-mentioned time interval. If the time information is consistent, the user's hand image captured at this moment is the corresponding user's hand image when the user's audio data is generated. For example, if user audio data with a note of G4, a sound intensity of 15, and a pitch of 1750 is extracted at 0.030s, the user's hand image captured from the video information at 0.030s is the time when the user audio data is generated. The corresponding user hand image. In one embodiment, the second time interval is not greater than 30 ms.

In one embodiment, in order to reduce the calculation amount of the hand model, the captured user hand images may be screened, and only the user hand images including the piano key area are selected for identifying the user hand data.

In one embodiment, in addition to the degree of matching between the user audio data and the corresponding standard audio data, the user can also set whether to capture the user's hand image and identify the user's hand data.

In one embodiment of the present invention, a 2D image containing the complete piano keyboard is captured from the video information for identifying possible playing errors.

S340, identify the user hand data in the user hand image by the hand model, and compare it with the corresponding correct hand data in the hand database to obtain the matching degree between the user hand data and the corresponding standard hand data, And/or identify from the 2D image whether there is a user playing error.

As mentioned above, the hand model takes the hand image as input data and the hand data in the hand image as output data, and is obtained by training the neural network. The user's hand data in the user's hand image can be identified through the hand model, such as the coordinate positions or relative positions of hand joint points, including the coordinate positions or relative positions of 21 key joint points in the left and right hands, or more or less The coordinate position or relative position of the 21 joint points. In one embodiment, the user hand data may further include the coordinate position or relative position of the wrist.

In one embodiment, the hand model may use a trained recurrent neural network or a long and short-term memory neural network. In one embodiment, when the user's hand image includes a piano key area, the piano key area can be detected first in the background of the field of view, and a key candidate frame is drawn, and then the hand model is used in the key candidate frame area. Hand keypoint regression detection to extract user hand data.

The hand database contains a large amount of standard hand data. According to a certain time interval, the standard hand image can be intercepted from the standard performance video information of the piano playing piece, and then the standard hand data in the standard hand image can be identified by the hand model, and stored in units of pieces to form Hand database. In one embodiment, the standard hand image can be intercepted from the video information of the standard piano repertoire according to the same time interval as the standard audio data extracted from the standard repertoire in the audio database, or an integer multiple of the above-mentioned time interval, and then The standard hand data in the standard hand image is recognized by the hand model and stored in the hand database. Standard hand data may include information such as time (ie interception time), coordinate positions or relative positions of key joint points of the left and right hands. The user's hand data may correspond to the standard hand data in the hand database through its time information.

FIG. 11 shows a schematic diagram of standard hand data storage in the hand database of an embodiment. As shown in Figure 11, the hand database may include a first-level table and a second-level table, wherein the first-level table (as shown in Figure 11(A)) is used to store the basic information of the standard piano performance, including the serial number, name , level, pitch and hand data secondary table serial number and other information; the secondary table (as shown in Figure 11(B)) is used to store the hand data of each track, including the interception time, from the Data such as the relative positions of the 21 joint points of the left and right hands in the hand image captured from the video information of the track.

In one embodiment, the audio database can be associated with the hand database, that is, the standard audio data with consistent time information and the standard hand data are stored in association with each other in units of tracks to form a comprehensive database. When the time interval for extracting the standard audio data from the standard performance is inconsistent with the time interval for extracting the standard hand image from the standard performance, or the time interval for extracting the standard audio data from the standard performance is the time interval when the standard hand image is intercepted from the standard performance When the time interval is an integral multiple of the time interval, only the standard hand data that matches the time information of the standard audio data is stored.

FIG. 12 shows a schematic diagram of the storage of the integrated database of one embodiment. As shown in Fig. 12, the comprehensive database may include a first-level table and a second-level table, wherein the first-level table (as shown in Fig. 12(A)) is used to store the basic information of standard piano performance pieces, including serial number, name, level , pitch and comprehensive data secondary table number and other information; the secondary table (as shown in Figure 12(B)) is used to store the audio data and hand data of each track.

By comparing the extracted user hand data with the corresponding standard hand data in the hand database, the degree of matching between the user's hand data and the corresponding standard hand data can be obtained. In one embodiment, an error redundancy interval may also be set for the standard hand data in the hand database. When the user's hand data falls within the error redundancy interval, it can be considered that the user's hand data is the same as the standard hand data. Basically the same.

In one embodiment, a hand data matching degree threshold may be set. The hand data matching degree threshold can be set by the user, by default by the system, or intelligently set by the system after counting the performance levels of other piano players on the same piece in the networked state. When the matching degree between the user's hand data and the standard hand data is greater than or equal to the hand data matching degree threshold, it indicates that the hand movement played by the user is correct or basically correct; when the extracted user's hand data matches the corresponding standard When the matching degree of the hand data is less than the threshold of the matching degree of the hand data, it means that the hand movement of the user is wrong. At this time, the user's hand image corresponding to the user's hand data can be automatically saved, which is convenient for the user to view. In one embodiment, wrong hand movements can also be displayed to the user, for example, a virtual hand outline is generated through animation rendering, when the fingering is wrong, the wrong finger is displayed to the user; when the hand shape is wrong, the wrong finger is displayed to the user hand area (e.g. palm, fingertips, etc.). Or display hand type error categories and positions, fingering errors and positions, etc. to the user. Since the foregoing embodiments have detailed descriptions for identifying playing errors from 2D images, they will not be repeated here.

S350, based on the degree of matching between the user audio data and the corresponding standard audio data, the degree of matching between the user's hand data and the corresponding standard hand data, and/or the user's playing error, feedback the playing result to the user.

The higher the matching degree between the user audio data and the corresponding standard audio data, the higher the accuracy of the notes played by the user's piano; similarly, the higher the matching degree between the user's hand data and the corresponding standard hand data, the higher the accuracy of the user's piano The higher the accuracy of the hand movements you play. Therefore, the degree of matching between the user audio data and the corresponding standard audio data and the degree of matching between the user's hand data and the corresponding standard hand data can comprehensively consider the level of the user's piano playing from two aspects of notes and hand movements.

In one embodiment, the degree of matching between the user audio data and the corresponding standard audio data and the degree of matching between the user's hand data and the corresponding standard hand data may be set as weights in determining the user's piano performance score. The user's playing habits are personalized to formulate scoring rules. For example, if the notes played by a user are accurate but the hand movements are often wrong, a larger weight can be set for the matching degree between the user's hand data and the corresponding standard hand data, so as to give more feedback on the user's performance in piano playing. of hand movements.

In one embodiment, standard key touch force data is also stored in the database, and the collected user key touch force data can be compared with the standard key touch force data to obtain the matching degree between the user key touch force and the standard key touch force, The user's piano playing level is comprehensively considered in combination with the matching degree between the user audio data and the corresponding standard audio data, the matching degree between the user's hand data and the corresponding standard hand data, and/or the user's playing error.

Through the above-mentioned intelligent piano training method, users can timely and accurately know the notes and hand movements when they practice the piano by themselves in the absence of teacher guidance, which is helpful for users to correct mistakes in time and effectively improve their practice. effectiveness.

In some embodiments, the results of the user's piano playing may be fed back to the user in a delayed manner. For example, at the end of the user's piano playing, the comprehensive score of the piece can also be displayed; the user's specific audio errors and hand movement errors during playing can also be recorded in detail, and a score report can be formed, so that the user can make targeted Practice or correct mistakes in piano playing; you can also compare the current score or score report with the user's previous performance records or other users' performance records, and comprehensively evaluate the user's current playing level.

In other embodiments, the user audio data and the user hand data may be compared at the same time, and based on the matching degree between the user audio data and the corresponding standard audio data and the matching degree between the user hand data and the corresponding standard hand data , and/or the user plays incorrectly, and feedback the playing result to the user. In this case, the user audio data and the user's hand image can be extracted and analyzed in real time while acquiring the audio information and video information of the user playing the piano.

In some cases, after the user has played a certain piece, the user may be fed back with the overall performance result of the piece.

FIG. 13 shows a smart piano training method according to an embodiment of the present invention. As shown in Figure 13, the method includes the following steps:

S710: Acquire audio information and video information of the user playing the piano.

S720, extracting user audio data from the audio information, and comparing it with the corresponding standard audio data in the audio database to obtain a degree of matching between the user audio data and the corresponding standard audio data.

Steps S710-S720 are similar to the above-mentioned steps S310-S320, and are not repeated here.

S730, compare the matching degree between the user audio data and the corresponding standard audio data with the specified threshold N1, when the matching degree between the user audio data and the corresponding standard audio data is greater _than or equal to the specified threshold N1, execute step _S740 ; when When the degree of matching between the user audio data and the corresponding standard audio data is less _than the specified threshold N1, step S760 is executed.

S740, intercept the user's hand image corresponding to the user audio data from the video information. In some embodiments, a 2D image including the complete piano keyboard may also be intercepted from the video information to identify whether there is a playing error.

S750, identifying the user's hand data in the user's hand image by the hand model, and comparing it with the corresponding correct hand data in the hand database to obtain the matching degree between the user's audio data and the corresponding standard hand data, and / or identify through 2D images if there is a playing error.

S760, it is judged whether the user's piano playing has ended, if it has ended, go to step S770; if it has not ended, go to steps S710-S760.

S770, based on the degree of matching of all user audio data generated by the user playing the piano with the corresponding standard audio data, and the degree of matching between all the user hand data generated by the user playing the piano and the corresponding standard hand data, and/ Or the user plays incorrectly, and feedback the playing result to the user.

By feeding back the comprehensive effect of the user's piano playing to the user after the user's piano playing, the above method is helpful for the user to grasp the complete piece or one of the melody played by the user as a whole.

In some embodiments, when the matching degree between the user audio data and the corresponding standard audio data is less than a specified threshold, the user may be prompted for the key information corresponding to the standard audio data, for example, a virtual keyboard is generated through animation rendering and prompts the correct piano keys; and/or when the degree of matching between the user's hand data and the corresponding standard hand data is less than a specified threshold, the user can be prompted for the hand action corresponding to the standard hand data, for example, a virtual hand outline is generated by animation rendering And prompt the correct hand movements.

FIG. 14 shows a smart piano training method according to an embodiment of the present invention. As shown in Figure 14, the method includes the following steps:

S810: Acquire audio information and video information of the user playing the piano.

S820, extracting user audio data from the audio information, and comparing it with the corresponding standard audio data in the audio database to obtain a degree of matching between the user audio data and the corresponding standard audio data.

S830, compare the matching degree between the user audio data and the corresponding standard audio data with the specified threshold N1, when the matching degree between the user audio data and the corresponding standard audio data is greater _than or equal to the specified threshold N1, execute step _S840 ; when When the degree of matching between the user audio data and the corresponding standard audio data is less _than the specified threshold N1, the user is prompted for the key information corresponding to the standard audio data, and step S870 is executed.

S840, intercept the user's hand image corresponding to the user's audio data from the video information. In some embodiments, a 2D image including a complete piano keyboard may also be intercepted from the video information to identify whether there is a playing error.

S850, identifying the user's hand data in the user's hand image by the hand model, and comparing it with the corresponding correct hand data in the hand database to obtain the matching degree between the user's audio data and the corresponding standard hand data, and / or identify through 2D images if there is a playing error.

S860, compare the matching degree between the user's hand data and the corresponding standard hand data with the specified threshold _N2 , and when the matching degree between the user's hand data and the corresponding standard hand data is less than the specified threshold _N2 , prompt the user to Hand movements corresponding to standard hand data, and/or hand type errors and wrong positions or fingering errors and positions corresponding to playing errors.

S870, it is judged whether the user's piano playing has ended, if so, go to step S880; if not, go to steps S810-S870.

S880, based on the matching degree of all user audio data generated in the user's playing the piano with the corresponding standard audio data, and the matching degree of all the user's hand data generated in the user's playing the piano and the corresponding standard hand data, and/ Or the user plays incorrectly, and feedback the playing result to the user.

Through the above method, real-time guidance and demonstration can be carried out for the wrong notes and/or hand movements in the user's piano playing, so that the user can grasp the correct playing notes and/or hand movements in time and improve the practice efficiency. .

To sum up, the present invention accurately recognizes the hand data in the user's hand image by using the hand model, and plays wrong, and on the basis of comprehensively considering the audio data and hand data generated by the user when playing the piano The overall judgment of the results of playing enables users to effectively obtain feedback on notes and hand movements during practice without the guidance of professional teachers, which is helpful for users to find and correct mistakes and improve practice efficiency. In addition, by prompting the user with correct key information and/or hand movements in real time, it can also help the user to obtain correct demonstration and guidance in time, which is helpful for the user to learn to play the piano by himself.

On the other hand, the present invention also provides an intelligent piano training system implementing the above method. The system includes: an audio and video acquisition unit for acquiring audio information and video information of the user's piano playing; a data extraction unit for Extracting user audio data from audio information, and intercepting user hand images corresponding to user audio data and/or 2D images containing a complete piano keyboard from video information; data recognition unit for recognizing user hands through hand models user hand data in the 2D image, and/or identify whether there is a playing error from the 2D image through an intelligent recognition system for assisting piano teaching, wherein the hand model uses the hand image as input data, and uses the hand image as input data. The hand data in the external image is the output data, which is obtained by training the neural network; the data matching unit is used to compare the user audio data with the corresponding standard audio data in the audio database, and obtain the user audio data and the corresponding standard audio data. The matching degree of the user's hand data and the corresponding standard hand data in the hand database are compared to obtain the matching degree of the user's hand data and the corresponding standard hand data; the user interaction unit is used for user audio data based on user audio data. The degree of matching with the corresponding standard audio data and the degree of matching between the user's hand data and the corresponding standard hand data, and/or the user's playing error, feedback the playing result to the user.

In one embodiment, the user interaction unit in the smart piano training system is further configured to: prompt the user for the key information corresponding to the corresponding standard audio data, prompt the user for the hand movement corresponding to the corresponding standard hand data, and The user prompts for playing errors and wrong positions.

In one embodiment, the intelligent piano training system further includes a control unit for controlling the mutual cooperation between the audio and video acquisition unit, the data extraction unit, the data identification unit, the data matching unit, and the user interaction unit, and based on the user The degree of matching between the audio data and the corresponding standard audio data determines whether to activate the data identification unit, and judges based on the degree of matching between the user audio data and the corresponding standard audio data or the degree of matching between the user's hand law and the corresponding standard hand data Whether the user plays the piano piece is over, and it is determined whether the audio and video capture unit or the user interaction unit is activated.

FIG. 15 shows a smart piano practice system according to an embodiment of the present invention. As shown in FIG. 15 , the intelligent piano practice system 900 includes an audio and video acquisition unit 901 , a data extraction unit 902 , a data identification unit 903 , a data matching unit 904 and a user interaction unit 905 .

The audio and video capture unit 901, including a sound capture device 9011 and a video capture device 9012, is used to obtain audio information and video information generated when the user plays the piano. The sound collection device 9011 may be, for example, one or more microphones installed near the piano. The sound collection device 9011 can be connected to the data extraction unit 902 in a wired or wireless manner, and sends the acquired audio information to the data extraction unit 902 . The video capture device 9012 may be a device with a photography or image capture function, such as a monocular camera, a binocular camera or a depth camera. The video capture device 9012 can be fixed around the piano to capture hand video information at a fixed point. For example, it can be installed only above the front of the piano keyboard, or can be installed on the upper, front, left and/or right sides of the keyboard. A device with photography function; it can also be installed on the slide rail to automatically track and collect video information of the hand, and automatically adjust its shooting position and/or angle. Similarly, the video capture device 9012 can also be connected to the data extraction unit 902 in a wired or wireless manner, and sends the acquired video information to the data extraction unit 902 . In one embodiment, the sound collecting device 9011 and the video collecting device 9012 can be integrated in one device to simultaneously acquire audio information and video information when the user plays the piano.

The data extraction unit 902 includes an audio data extraction unit 9021 and an image data extraction unit 9022, wherein the audio data extraction unit 9021 is used to extract user audio data from the audio information and send it to the data matching unit 904; the image data extraction unit 9022 uses The user's hand image corresponding to the user's audio data and/or the 2D image including the complete piano keyboard is intercepted from the video information, and sent to the data identification unit 903 .

The data recognition unit 903, which contains the hand model 9031, is connected with the image data interception unit 9022, and is used for identifying the user hand data in the user hand image through the hand model and/or by using the intelligence for assisting piano teaching. The recognition system recognizes playing errors from the 2D images and sends them to the data matching unit 904 . The hand model 9031 takes the hand image as input data and the hand data in the hand image as output data, and is obtained by training a neural network.

The data matching unit 904 includes an audio data matching unit 9041 and a hand data matching unit 9042. Wherein, the audio data matching unit 9041 includes an audio database for comparing the user audio data from the data extraction unit 902 with the corresponding standard audio data in the audio database to obtain the degree of matching between the user audio data and the corresponding standard audio data, And send it to the user interaction unit 905 and the control unit 906 . The hand data matching unit 9042 includes a hand database for comparing the user hand data from the data identification unit 903 with the corresponding standard audio data in the hand database to obtain the user hand data and the corresponding standard hand data. The matching degree of the user's hand data and the corresponding standard hand data and the user's playing error are sent to the user interaction unit 905. The audio database and the hand database can be stored in the audio data matching unit 9041 and the hand data matching unit 9042 as built-in files, and can also be connected to the audio data matching unit 9041 and the hand data matching unit 9042 through the API program interface.

The user interaction unit 905 includes a processor 9051 and a display device 9052, wherein the processor 9051 is configured to receive the matching degree between the user audio data from the audio data matching unit 9041 and the corresponding standard audio data, and receive from the hand data matching unit 9042 the degree of matching between the user's hand data and the corresponding standard hand data, and/or the user's playing error, and based on the degree of matching between the user's audio data and the corresponding standard audio data and the user's hand data and the corresponding The matching degree of the standard hand data and/or the user's playing error, determine the score of the user's piano performance. The display device 9052 can be, for example, an electronic device with a display function, such as a smart phone, an IPAD, smart glasses, a liquid crystal display screen, an electronic ink screen, etc., for displaying the scoring result of the processor 9051 . In one embodiment, the processor 9051 can determine the correct key information and display it on the display device 9052 based on the matching degree between the user audio data and the corresponding standard audio data, for example, generate a virtual keyboard through animation rendering and prompt the correct key information piano keys. In one embodiment, the processor 9051 can also construct a correct hand motion and display it on the display device 9052 based on the degree of matching between the user's hand data and the corresponding standard hand data, for example, can generate a virtual hand contour and Prompt the correct hand movements, and can also establish a specific skeletal system for different users according to the user's hand information, generate the user's personalized virtual hand contour by means of skinning, animation rendering, etc., and control it according to standard hand data. Virtual hand contours suggest correct hand movements. Or display a specific hand image annotated with a specific hand type and location, fingering error and location.

In one embodiment, the smart piano practice system may further include a sensor, and the sensor may be installed under the keys, and used to collect data on the touch strength of the user when playing the piano.

In one embodiment of the present invention, the present invention may be implemented in the form of a computer program. The computer program can be stored in various storage media (eg, hard disk, optical disk, flash memory, etc.), and when the computer program is executed by the processor, can be used to implement the method of the present invention.

In another embodiment of the present invention, the present invention may be implemented in the form of an electronic device. The electronic device includes a processor and a memory, and the memory stores a computer program that, when executed by the processor, can be used to implement the method of the present invention.

It should be noted that the above embodiments describe the position coordinates in the shape of a rectangle or an approximate rectangle. In the case of different shooting angles, when the piano keyboard does not present a rectangle on the 2D image, the vertices of the polygon containing the complete piano keyboard are used. Coordinates to represent the piano keyboard area.

It should be noted that although the steps are described above in a specific order, it does not mean that the steps must be executed in the above-mentioned specific order. In fact, some of these steps can be executed concurrently, or even change the order, as long as it can be achieved The required function can be.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present invention.

A computer-readable storage medium may be a tangible device that retains and stores instructions for use by the instruction execution device. Computer-readable storage media may include, but are not limited to, electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing, for example. More specific examples (non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.

Various embodiments of the present invention have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

A method for intelligent identification for assisting piano teaching, characterized in that the method comprises:

Get a 2D image of the full piano keyboard playing the piano from above the piano keyboard;

Perform object detection on the 2D image through the piano keyboard detection network to detect the piano keyboard area represented by the relative position coordinates of the 2D image, and convert the relative position coordinates of the 2D image used to represent the piano keyboard area into a 2D image The coordinates in the original coordinate system to obtain the piano keyboard position coordinates in the original coordinates of the 2D image;

The target detection is performed on the piano keyboard area represented by the piano keyboard position coordinates through the hand detection network to detect the hand area represented by the relative position coordinates of the piano keyboard area, and the piano keyboard area used to represent the hand area is used to detect the hand area. The relative position coordinates of are converted to the coordinates under the original coordinate system of the 2D image to obtain the hand position coordinates under the original coordinates of the 2D image;

Whether there is a hand shape error in the hand region represented by the hand position coordinates is identified by the hand shape error detection network.
The method according to claim 1, wherein the method further comprises:

Obtain the identified hand shape error type and the hand shape error position represented by the relative position coordinates of the hand region, and convert the relative position coordinates of the hand region used to represent the hand shape error position to the original coordinate system of the 2D image to obtain the hand shape error position coordinates in the original coordinates of the 2D image.
The method according to claim 1, wherein the method further comprises:

From the piano keyboard area represented by the piano keyboard position coordinates, each key is divided to obtain different keys represented by the relative position coordinates of the piano keyboard area, and each of the relative position coordinates of the piano keyboard area used to represent the keys Convert to the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the keys in the original coordinate system of the 2D image;

The fingertip feature points of different fingers represented by the relative position coordinates of the hand region are detected from the hand region represented by the hand position coordinates through the fingertip feature point detection network, and each fingertip feature point representing a different finger is used to The relative position coordinates of the hand region of the sharp feature point are converted into the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the fingertip under the original coordinates of the 2D image;

Determine the position based on the coordinates of the fingertip and the key, bind the fingertip on the key to the key to obtain the key binding relationship of the finger, and compare the key binding relationship of the finger playing the same note with the key binding relationship in the score database. Standard bindings are compared to detect fingering errors.
The method of claim 1, further comprising:

After the piano keyboard area is detected, the piano keyboard area is extended by the first preset pixel to obtain an effective area of the piano keyboard including the complete hand.
The method according to claim 4, wherein the first preset pixel is 200 pixels.
The method of claim 1, further comprising:

After the hand area is detected, based on the comparison between the position coordinates of the piano keyboard and the position coordinates of the hands, the hands that do not fall on the piano keyboard are filtered out, and the coordinate boundaries of the hands falling on the piano keyboard are subjected to a second prediction in four directions. Assuming the extension of pixels, the hand area including the complete hand of the hand falling on the piano keyboard and the corresponding hand position coordinates are obtained.
The method according to claim 6, wherein the second preset pixel is 30 pixels.
The method according to claim 2, wherein the neural network is trained in the following manner to obtain the piano keyboard detection network, the hand detection network, the hand shape error detection network, and the fingertip feature point detection network:

S1. Collect images of multiple people playing different types of pianos in various scenarios to form an original data set, so that the images in the original data set cover the scenes and all error types corresponding to all piano types under the prior art;

S2. Label the original data set, including labeling the piano keyboard position coordinates, labeling the hand position coordinates, labeling the hand shape error type and hand shape error position coordinates, labeling the coordinates of the feature points of different fingers and fingertips, all labels are in the same two in the dimensional coordinate system;

S3, process the images in the original data set according to the marked piano keyboard position coordinates, and obtain an image containing the piano keyboard area represented by the marked piano keyboard position coordinates to form a first data set; further, expand the piano keyboard area For the obtained effective area of the piano keyboard, the original image is cropped on the basis of the original data set according to the marked keyboard position coordinates and hand position coordinates, and the effective area of the piano keyboard in each original image is obtained to form a second data set. In the second dataset, the hand position coordinates marked in the original image are converted into coordinates in the same coordinate system as the effective area of the piano keyboard; further, the effective hand area obtained by expanding the hand area is based on the marked hand position coordinates and The hand shape error position coordinates are based on the original data set. The original image is cropped to obtain the effective area of the hand in each original image to form a third data set. The third data set will include the hand shape error position coordinates marked in the original image. Convert to the coordinates in the same coordinate system as the effective area of the hand; according to the marked hand position coordinates and the coordinates of the feature points of different fingers and fingertips, the original image is cropped based on the original data set to obtain the hand in each original image. The effective area forms a fourth data set, wherein, in the fourth data set, the coordinates of different fingertip feature points marked in the original image are converted into coordinates in the same coordinate system as the effective area of the hand;

S4, use the first data set to train the predetermined neural network to converge to obtain the piano keyboard detection network, use the second data set to train the predetermined neural network to converge to obtain the hand detection network; use the third data set to carry out the predetermined neural network Train to convergence to obtain a hand shape error detection network; use the fourth data set to train a predetermined neural network to converge to obtain a fingertip feature point detection network.
The method according to claim 8, wherein the first data set, the second data set, and the third data set are used to train the yolov4 network to convergence to obtain a piano keyboard detection network, a hand detection network, and a hand shape respectively. Error detection network.
The method according to claim 8, characterized in that a network composed of ResNet18 and a cascaded pyramid network is trained to converge to obtain a fingertip feature point detection network with the fourth data set.
An intelligent identification system for assisting piano teaching, characterized in that the system comprises:

An image acquisition module for acquiring a 2D image of a piano playing piano including a complete piano keyboard;

The piano keyboard detection module is used to perform target detection on the 2D image to detect the piano keyboard area represented by the relative position coordinates of the 2D image, and convert the relative position coordinates of the 2D image used to represent the piano keyboard area to The coordinates in the original coordinate system of the 2D image are obtained to obtain the position coordinates of the piano keyboard under the original coordinates of the 2D image;

The hand detection module is used to perform target detection on the piano keyboard area represented by the piano keyboard position coordinates to detect the hand area represented by the relative position coordinates of the piano keyboard area, and use the said used to represent the hand area The relative position coordinates of the piano keyboard area are converted to the coordinates under the original coordinate system of the 2D image to obtain the hand position coordinates under the original coordinates of the 2D image;

The hand shape error detection module is used to identify whether there is a hand shape error from the hand region represented by the hand position coordinates, and output the hand shape error type and the hand shape error position represented by the relative position coordinates of the hand region, and The relative position coordinates of the hand region used to represent the wrong hand position are converted into coordinates in the original coordinate system of the 2D image to obtain the wrong hand position coordinates under the original coordinates of the 2D image.
The system of claim 11, wherein the system further comprises:

The piano key division module is used to divide each piano key from the piano keyboard area represented by the piano keyboard position coordinates to obtain different piano keys represented by the relative position coordinates of the piano keyboard area, and divide each piano used to represent the piano keys. The relative position coordinates of the keyboard area are converted to the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the keys in the original coordinates of the 2D image;

The fingertip feature point detection network is used to detect the fingertip feature points of different fingers represented by the relative position coordinates of the hand region from the hand region represented by the hand position coordinates, and each is used to represent different fingers. The relative position coordinates of the hand region of the fingertip feature points are converted into the coordinates under the original coordinate system of the 2D image to obtain the fingertip coordinates under the original coordinates of the 2D image;

The fingering error detection module is used to judge the position based on the coordinates of the fingertips and the coordinates of the keys, bind the fingertips that fall on the keys to the keys to obtain the key binding relationship of the fingers, and bind the keys of the fingers that play the same note. The fixed relationship is compared with the standard binding relationship in the score database to detect whether there is a fingering error.
The system according to any one of claims 11-12, wherein the system further comprises:

The user interaction and display module is used to combine and display the playing errors and the images of playing the piano, and provide an interactive interface for mode selection and function selection.
A computer-readable storage medium, characterized in that a computer program is contained thereon, and the computer program can be executed by a processor to implement the steps of any one of the methods of claims 1 to 10.
An electronic device, comprising:

one or more processors;

A storage device for storing one or more programs that, when executed by the one or more processors, cause the electronic device to implement the electronic device as claimed in any one of claims 1 to 10 steps of the method.
An intelligent piano training method, comprising:

Obtain the audio information and video information of the user playing the piano;

Extract user audio data from the audio information, and compare it with the corresponding reference audio data stored in the audio database to obtain a degree of matching between the user audio data and the corresponding reference audio data;

The user's hand image corresponding to the user's audio data is intercepted from the video information, the user's hand data in the user's hand image is identified by the hand model, and the corresponding reference stored in the hand database is used. Comparing the hand data to obtain the degree of matching between the user's hand data and the corresponding reference hand data, wherein the hand model uses the hand image as input data, and uses the hand image in the hand image as the input data. The data is output data, obtained by training a neural network; and/or, intercepting a 2D image corresponding to the user audio data and comprising a complete piano keyboard from the video information, and using any of the methods as claimed in claims 3-9. - the method identifies from the 2D image whether there is a playing error; and

Based on the degree of matching between the user audio data and the corresponding reference audio data, the degree of matching between the user's hand data and the corresponding reference hand data, and/or the user's playing error, feedback the playing to the user result.
The piano training method according to claim 16, further comprising:

Based on the matching degree of all the user audio data of the user playing the piano and the corresponding reference audio data and the matching degree of all the user hand data of the user playing the piano and the corresponding reference hand data Matching degree and/or all playing errors of the user, feedback the playing result to the user.
The piano training method according to claim 16, wherein, further comprising:

When the degree of matching between the user audio data and the corresponding reference audio data is less than a specified threshold, the user is prompted for the key information corresponding to the corresponding reference audio data.
The piano training method according to claim 16, wherein, further comprising:

When the degree of matching between the user's hand data and the corresponding reference hand data is less than a specified threshold, display an erroneous hand motion to the user, and/or prompt the user for the corresponding reference hand The hand motion corresponding to the data.
The piano training method according to claim 16, wherein the user audio data includes extraction time, musical note, fundamental frequency and sound intensity.
The piano training method according to claim 16, wherein the user's hand data includes the interception time and the relative positions of 21 key joint points of each of the left and right hands.
The piano training method according to claim 16, wherein the playing errors include hand shape errors and/or fingering errors.
The piano training method according to claim 16, wherein the extracting user audio data from the audio information comprises: extracting the user audio data from the audio information according to a first time interval; and wherein the The user audio data corresponds to the reference audio data in the audio database according to its extraction time.
The piano training method according to claim 23, wherein the intercepting the user's hand image corresponding to the user audio data from the video information comprises: intercepting the video information according to a second time interval the user hand image; and wherein the user hand image corresponds to the user audio data according to its interception time.
The piano training method according to claim 24, wherein the second time interval is the same as the first time interval, or the second time interval is an integer multiple of the first time interval.
The piano training method according to claim 24, wherein the interception time of the user's hand data is the same as the interception time of the image of the user's hand, and the user's hand data is consistent with the database according to the interception time information thereof. Corresponding to the reference hand data.
The piano training method according to claim 16, wherein the hand model is obtained by training a recurrent neural network or a long-term memory neural network.
The piano training method according to claim 16, further comprising:

A user's hand image including piano keys is selected from the user's hand image for identifying the user's hand data.
The piano training method according to claim 16, further comprising:

obtaining the touch force data of the user;

Comparing the user's key-touching force data with the corresponding reference key-touching force data stored in the database to obtain the degree of matching between the user's key-touching force data and the corresponding reference key-touching force data;

Based on the matching degree between the user audio data and the corresponding reference audio data, the matching degree between the user hand data and the corresponding reference hand data, and the user touch force data and the corresponding reference The matching degree of the touch velocity data and/or the user's overall playing error, and the playing result is fed back to the user.
An intelligent piano training system, comprising:

The audio and video acquisition unit is used to obtain the audio information and video information of the user's piano performance;

A data extraction unit for extracting user audio data from the audio information, and intercepting a user hand image corresponding to the user audio data and/or a 2D image including a complete piano keyboard from the video information;

A data recognition unit for recognizing user hand data in said user hand image through a hand model, and/or from said 2D image through an intelligent recognition system for assisting piano teaching as claimed in claim 12 Identifying whether there is a playing error, wherein the hand model takes the hand image as input data, and takes the hand data in the hand image as output data, and is obtained by training a neural network;

A data matching unit is configured to compare the user audio data with the corresponding reference audio data in the audio database, obtain the degree of matching between the user audio data and the corresponding reference audio data, and compare the user hand data with the corresponding reference audio data. Comparing with the corresponding reference hand data in the hand database, obtaining the degree of matching between the user's hand data and the corresponding reference hand data;

A user interaction unit, configured to match the user audio data with the corresponding reference audio data and the user hand data with the corresponding reference hand data and/or all playing by the user Error, feedback the playing result to the user.
The piano training system of claim 30, wherein the user interaction unit is further configured to:

prompting the user the key information corresponding to the corresponding reference audio data; and/or

prompting the user for the hand motion corresponding to the corresponding reference hand data; and/or

Playing errors and wrong positions are prompted to the user.
The piano training system of claim 30, wherein the video and audio capture unit includes an audio capture device and a video capture device, and wherein the video capture device includes one or more monocular cameras, binocular cameras, or Depth camera, the video capture device is fixed around the piano to capture hand video information at a fixed point, or is installed on the slide rail to automatically track and capture hand video information.
The piano training system according to claim 30, further comprising: a sensor, which is installed under the piano keys and is used to collect data on the touch force of the keys when the user plays the piano.
A storage medium for smart piano training, wherein a computer program is stored, and when the computer program is executed by a processor, it can be used to implement the method of any one of claims 16-29.
An electronic device for intelligent piano training, comprising a processor and a memory, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, it can be used to realize any one of claims 16-29 the method described.