WO2022052941A1 - Intelligent identification method and system for giving assistance with piano teaching, and intelligent piano training method and system - Google Patents

Intelligent identification method and system for giving assistance with piano teaching, and intelligent piano training method and system Download PDF

Info

Publication number
WO2022052941A1
WO2022052941A1 PCT/CN2021/117130 CN2021117130W WO2022052941A1 WO 2022052941 A1 WO2022052941 A1 WO 2022052941A1 CN 2021117130 W CN2021117130 W CN 2021117130W WO 2022052941 A1 WO2022052941 A1 WO 2022052941A1
Authority
WO
WIPO (PCT)
Prior art keywords
hand
user
piano
data
image
Prior art date
Application number
PCT/CN2021/117130
Other languages
French (fr)
Chinese (zh)
Inventor
韩冰冰
陶之雨
郑庆伟
Original Assignee
桂林智神信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202010939320.9A external-priority patent/CN114170868A/en
Priority claimed from CN202110982026.0A external-priority patent/CN113723264A/en
Application filed by 桂林智神信息技术股份有限公司 filed Critical 桂林智神信息技术股份有限公司
Publication of WO2022052941A1 publication Critical patent/WO2022052941A1/en

Links

Images

Classifications

    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B15/00Teaching music

Definitions

  • the invention relates to the field of deep learning, in particular to an intelligent identification method and system for assisting piano teaching, and an intelligent piano training method.
  • the method of feature comparison has poor robustness, high subjectivity and low recognition rate.
  • the existing method lacks a systematic and comprehensive piano hand shape and fingering correction scheme.
  • the hand shape error can only judge right and wrong, not what kind of error it is, nor can it point out where the error is, and the ability to guide students to correct the error is insufficient.
  • Existing methods cannot accurately identify fingering errors and do not accurately identify playing errors. Under the circumstance, it is impossible to accurately guide the playing.
  • the existing piano training methods still have some shortcomings.
  • the practice (or evaluation) method based only on the audio data of playing music may lead to inaccurate judgment results due to the interference of noise in the environment.
  • the practice (or evaluation) method that only uses the video (or image) data of playing as the basis for judgment may be recognized in isolation because it is only based on the intercepted images of the player's hands and keys, so it cannot be compared with the The music played is organically combined.
  • the purpose of the present invention is to overcome the above-mentioned defects of the prior art, and to provide an intelligent identification method and system for assisting piano teaching which can accurately identify playing errors.
  • an intelligent recognition method for assisting piano teaching for identifying hand shape errors and/or fingering errors from a 2D image of playing the piano, the method comprising: learning from a piano keyboard The 2D image including the complete piano keyboard of playing the piano is obtained above; the target detection is performed on the 2D image through the piano keyboard detection network to detect the piano keyboard area represented by the relative position coordinates of the 2D image, and the said 2D image is used to represent The relative position coordinates of the 2D image of the piano keyboard area are converted into the coordinates in the original coordinate system of the 2D image to obtain the piano keyboard position coordinates in the original coordinates of the 2D image; Object detection to detect the hand area represented by the relative position coordinates of the piano keyboard area, and convert the relative position coordinates of the piano keyboard area used to represent the hand area to the coordinates in the original coordinate system of the 2D image to obtain 2D The hand position coordinates under the original coordinates of the image; the hand shape error detection network is used to identify whether there is a
  • the method of the present invention further comprises: dividing each piano key from the piano keyboard region represented by the piano keyboard position coordinates to obtain different piano keys represented by the relative position coordinates of the piano keyboard region, and Convert the relative position coordinates of each piano keyboard area used to represent the keys to the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the keys in the original coordinates of the 2D image;
  • the fingertip feature points of different fingers represented by the relative position coordinates of the hand region are detected in the hand region of
  • the coordinates in the original coordinate system of the image are obtained to obtain the fingertip coordinates in the original coordinates of the 2D image; the position judgment is performed based on the coordinates of the fingertip and the coordinates of the key, and the fingertip that falls on the key is bound to the key to obtain the key binding relationship of the finger , and compare the key binding relationship of the fingers playing the same note with the standard binding relationship in the score database to detect whether there is a fingering error.
  • the position coordinates of the piano keyboard corresponding to the piano keyboard area are expanded by the first preset pixel to obtain the effective area of the piano keyboard including the complete hand, and then based on the effective area of the piano keyboard area for subsequent processing.
  • the first preset pixel is 200 pixels.
  • the coordinates of the hands that do not fall on the piano keyboard are filtered out, that is, the coordinates of the hands that do not fall on the piano are removed.
  • the coordinate boundary of the hand falling on the piano keyboard is extended by a second preset pixel in four directions, so as to obtain the effective hand area including the complete hand of the hand falling on the piano keyboard and Corresponds to hand position coordinates.
  • the second preset pixel is 30 pixels.
  • the piano keyboard detection network, the hand detection network, the hand shape error detection network, and the fingertip feature point detection network are all obtained through neural network training, which can intelligently and accurately perform target detection and error recognition.
  • the neural network is trained in the following manner to obtain the piano keyboard detection network, hand detection network, hand shape error detection network, and fingertip feature point detection network:
  • Label the original data set including labeling the piano keyboard position coordinates, labeling the hand position coordinates, labeling the hand shape error type and hand shape error position coordinates, labeling the coordinates of the feature points of different fingers and fingertips, all labels are in the same two in the dimensional coordinate system;
  • S3 process the images in the original data set according to the marked piano keyboard position coordinates, and obtain an image containing the piano keyboard area represented by the marked piano keyboard position coordinates to form a first data set; further, expand the piano keyboard area
  • the original image is cropped on the basis of the original data set according to the marked keyboard position coordinates and hand position coordinates, and the effective area of the piano keyboard in each original image is obtained to form a second data set.
  • the hand position coordinates marked in the original image are converted into coordinates in the same coordinate system as the effective area of the piano keyboard; further, the effective hand area obtained by expanding the hand area is based on the marked hand position coordinates and The hand shape error position coordinates are based on the original data set.
  • the original image is cropped to obtain the effective area of the hand in each original image to form a third data set.
  • the third data set will include the hand shape error position coordinates marked in the original image. Convert to the coordinates in the same coordinate system as the effective area of the hand; according to the marked hand position coordinates and the coordinates of the feature points of different fingers and fingertips, the original image is cropped based on the original data set to obtain the hand in each original image.
  • the effective area forms a fourth data set, wherein, in the fourth data set, the coordinates of different fingertip feature points marked in the original image are converted into coordinates in the same coordinate system as the effective area of the hand;
  • the first data set, the second data set, and the third data set are used to train the yolov4 network to convergence to obtain a piano keyboard detection network, a hand detection network, and a hand shape error detection network, respectively.
  • the fourth dataset is used to train the network composed of ResNet18 and cascaded pyramid network to convergence to obtain the fingertip feature point detection network.
  • the piano keyboard detection network trained by the neural network can intelligently identify the position of the piano keyboard and obtain the position coordinates of the piano keyboard; the hand detection network obtained by the neural network training can intelligently and accurately identify the hand represented by the relative position coordinates of the input piano keyboard area.
  • the hand position coordinates in the original image can be directly obtained by converting the relative position coordinates of the piano keyboard area used to represent the hand position into the coordinates in the original image coordinate system; the hand shape error detection network obtained by the neural network training can be Intelligently and accurately identify the specific hand type error type and hand type error position coordinates; the fingertip feature point detection network obtained by neural network training can intelligently and accurately identify the relative position coordinates of the input hand area for different fingers.
  • Fingertip position by converting the relative position coordinates of the input hand area used to represent the position of the fingertip to the coordinates in the original image coordinate system, the fingertip coordinates in the original image can be directly obtained, which is convenient for subsequent fingering recognition.
  • the above detection network obtained by training the neural network with the labeled data set can accurately identify the hand shape errors in playing without establishing a standard hand shape error comparison database, with good robustness and high accuracy.
  • a system for implementing the method described in the first aspect of the present invention comprising: an image acquisition module for acquiring a 2D image of playing the piano including a complete piano keyboard; a piano keyboard detection module , for performing target detection on the 2D image to detect the piano keyboard area represented by the relative position coordinates of the 2D image, and converting the relative position coordinates of the 2D image representing the piano keyboard area into the original coordinates of the 2D image
  • the coordinates under the system to obtain the position coordinates of the piano keyboard under the original coordinates of the 2D image
  • the hand detection module is used to perform target detection on the piano keyboard area represented by the piano keyboard position coordinates to detect the relative position of the piano keyboard area.
  • the hand area represented by the coordinates, and the relative position coordinates of the piano keyboard area used to represent the hand area are converted into the coordinates under the original coordinate system of the 2D image to obtain the hand position coordinates under the original coordinates of the 2D image; hand shape
  • the error detection module is used to identify whether there is a hand shape error from the hand region represented by the hand position coordinates, and output the hand shape error type and the hand shape error position represented by the relative position coordinates of the hand region.
  • the relative position coordinates of the hand region used to represent the wrong hand position are converted into the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the wrong hand position under the original coordinates of the 2D image.
  • the system further includes: a key dividing module, configured to divide each key from the piano keyboard region represented by the piano keyboard position coordinates to obtain different piano keys represented by the relative position coordinates of the piano keyboard region, and The relative position coordinates of each piano keyboard area used to represent the keys are converted to the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the keys in the original coordinates of the 2D image; the fingertip feature point detection network is used to detect the hand position from the hand position. In the hand region represented by the coordinates, the fingertip feature points of different fingers represented by the relative position coordinates of the hand region are detected, and the relative position coordinates of each hand region used to represent the fingertip feature points of different fingers are converted.
  • a key dividing module configured to divide each key from the piano keyboard region represented by the piano keyboard position coordinates to obtain different piano keys represented by the relative position coordinates of the piano keyboard region, and The relative position coordinates of each piano keyboard area used to represent the keys are converted to the coordinates in the original coordinate system of the 2D image to obtain the coordinate
  • the fingering error detection module is used to judge the position based on the coordinates of the fingertip and the coordinates of the key, and compare the fingertip falling on the key with the key Bind to obtain the finger key binding relationship, and compare the finger key binding relationship of playing the same note with the standard binding relationship in the score database to detect whether there is a fingering error.
  • the system further includes: a user interaction and display module for merging and displaying the playing errors occurring during playing with the image of playing the piano, and providing mode selection and Interactive interface for feature selection.
  • the image acquisition module adopts any electronic device that can take pictures, such as a mobile phone, a camera, a camera, and the like.
  • the present invention in order to overcome the deficiencies of piano training in the prior art, also provides a method for training an intelligent piano, the method comprising: acquiring audio information and video information of a user playing the piano; Extract the user audio data from the audio information, and compare it with the corresponding reference audio data stored in the audio database to obtain the degree of matching between the user audio data and the corresponding reference audio data; The user's hand image corresponding to the user's audio data, the user's hand data in the user's hand image is identified by the hand model, and the corresponding correct hand data stored in the hand database is compared to obtain the obtained data.
  • the network is trained to obtain; and/or, intercepting a 2D image corresponding to the user audio data containing the complete piano keyboard from the video information, and using the method described in the first aspect of the present invention to identify from the 2D image whether There is a playing error; and based on the degree of matching of the user audio data with the corresponding reference audio data and the degree of matching of the user hand data with the corresponding reference hand data, and/or a user playing error , and feedback the playing result to the user.
  • the playing errors include hand shape errors, and/or fingering errors.
  • the above method further includes: based on the degree of matching of all the user audio data generated by the user playing the piano with the corresponding reference audio data and all the user audio data generated by the user playing the piano.
  • the matching degree between the hand data and the corresponding reference hand data and/or all the playing errors of the user, the playing result is fed back to the user.
  • the above method further includes: when the degree of matching between the user audio data and the corresponding reference audio data is less than a specified threshold, prompting the user for key information corresponding to the corresponding reference audio data.
  • the above method further includes: when the degree of matching between the user's hand data and the corresponding reference hand data is less than a specified threshold, prompting the user of the hand corresponding to the corresponding reference hand data. part actions, and/or displaying to the user the type of playing error and the wrong playing position.
  • the user audio data includes extraction time, musical note, fundamental frequency and sound intensity.
  • the user hand data includes the interception time and the relative positions of 21 key joint points of each of the left and right hands.
  • the extracting the user audio data from the audio information includes: extracting the user audio data from the audio information according to a first time interval, and wherein the user audio data is extracted according to its extraction time Corresponding to the reference audio data in the audio database.
  • the intercepting the user's hand image corresponding to the user audio data from the video information includes: intercepting the user's hand image from the video information according to a second time interval, and Wherein, the user's hand image corresponds to the user's audio data through its interception time.
  • the second time interval is the same as the first time interval, or the second time interval is an integer multiple of the first time interval.
  • the interception time of the user hand data is the same as the interception time of the user hand image, and the user hand data corresponds to the reference hand data in the database according to the interception time information.
  • the hand model is obtained by training a recurrent neural network or a long-term memory neural network.
  • the above method further includes: selecting a user's hand image including piano keys from the user's hand image to identify the user's hand data.
  • the above method further includes: acquiring the user's key-touching force data; The matching degree of the corresponding reference touch key force data; based on the matching degree between the user audio data and the corresponding reference audio data, the matching degree between the user hand data and the corresponding reference hand data, and The degree of matching between the user's key-touching force data and the corresponding reference key-touching force data and/or all playing errors of the user determine the score of the user for playing the piano.
  • a fourth aspect of the present invention provides an intelligent piano training system, comprising: an audio and video acquisition unit for acquiring audio information and video information of a user's piano playing; a data extraction unit for extracting the user from the audio information Audio data, and intercepting a user's hand image corresponding to the user's audio data and/or a 2D image including a complete piano keyboard from the video information; a data recognition unit for recognizing the user's hand through a hand model user hand data in the 2D image, and/or whether there is a playing error is identified from the 2D image by the system for intelligently recognizing playing errors for assisting piano teaching according to the second aspect of the present invention, wherein, The hand model takes the hand image as input data and the hand data in the hand image as output data, and is obtained by training a neural network; a data matching unit is used to match the user audio data with the audio database.
  • the user interaction unit configured to match the user's hand data with the corresponding reference audio data and the user's hand data based on the degree of matching
  • the matching degree with the corresponding reference hand data and/or all the playing errors of the user, the playing result is fed back to the user.
  • the playing errors include hand shape errors, and/or fingering errors.
  • the user interaction unit is further configured to: prompt the user with the key information corresponding to the corresponding reference audio data; and/or prompt the user where the corresponding reference hand data is located. corresponding hand movements.
  • the video and audio capture unit includes an audio capture device and a video capture device, and wherein the video capture device includes one or more monocular cameras, binocular cameras, or depth cameras, and the video capture device
  • the video capture device includes one or more monocular cameras, binocular cameras, or depth cameras, and the video capture device
  • the device is fixed around the piano to collect hand video information at a fixed point, or is installed on the slide rail to automatically track and collect hand video information.
  • the above-mentioned system further includes: a sensor, which is installed under the piano keys and is used for collecting key-touching force data when the user plays the piano.
  • the advantages of the present invention are: the present invention accurately identifies the hand data in the user's hand image through the hand model, and comprehensively considers the audio data and the hand data generated when the user is playing the piano. On this basis, make an overall judgment on the results of the user's piano playing, so that the user can quickly obtain effective feedback on the notes and fingering in the practice without the guidance of a professional teacher, which is conducive to the user to discover and correct mistakes in time, improve Practice efficiency.
  • by prompting the user with correct key information and/or hand movements in real time it can help the user to obtain correct demonstration and guidance in time, and help the user to learn to play the piano by himself.
  • the 2D image is used to identify the playing errors, the calculation amount is small, and the hardware cost is low. Therefore, 2D visual models have irreplaceable advantages.
  • FIG. 1 is a schematic diagram of main work contents in an intelligent identification method for assisting piano teaching according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a framework of an intelligent recognition system for assisting piano teaching according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram illustrating an example of a 2D image collected according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a piano keyboard area detected from the 2D image shown in FIG. 3 according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of a hand area detected from the piano keyboard area shown in FIG. 4 according to an embodiment of the present invention
  • FIG. 6 is a schematic diagram of detecting fingertip feature points from the hand region shown in FIG. 5 according to an embodiment of the present invention
  • Fig. 7 is a schematic diagram of 6 common wrong hand movements and corresponding correct hand movements in piano practice
  • FIG. 8 is a schematic diagram of 21 key joint points in a single palm according to an embodiment of the present invention.
  • Fig. 9 is a smart piano practice method according to an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of standard audio data storage in an audio database according to an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of standard hand data storage in a hand database according to an embodiment of the present invention.
  • FIG. 12 is a schematic diagram of storage of a comprehensive database according to an embodiment of the present invention.
  • Fig. 13 is a smart piano training method according to an embodiment of the present invention.
  • Fig. 14 is a smart piano training method according to an embodiment of the present invention.
  • FIG. 15 is a smart piano training system according to one embodiment of the present invention.
  • the present invention regards hand shape errors as a detection task, and fingering errors as feature point detection tasks and image segmentation tasks, and constructs a 2D visual model to solve the problem of recognizing playing errors.
  • 3D modeling is required, and there is no need to establish a standard database for comparison, which greatly reduces the amount of computing tasks and hardware overhead.
  • an intelligent identification method for assisting piano teaching is used to identify whether there is a playing error by detecting the collected 2D images of playing the piano, which mainly includes:
  • the following parts are as follows: First, the piano keyboard is detected on the collected 2D image of playing the piano, and the piano keyboard area is intercepted from it; then hand detection and key segmentation are performed for the intercepted piano keyboard area.
  • the hand detection is from the piano keyboard.
  • the hand area is detected in the keyboard area, and the key segmentation is to divide the keys based on the piano keyboard area to obtain each key; secondly, for the hand area, hand type error detection and fingertip detection are performed respectively, and hand type error detection detects hand type errors.
  • Fingertip detection is to detect the fingertip feature points of the hand area to obtain the fingertip position coordinates of different fingers; Judging, bind the fingertips on the keys to the keys to obtain the finger key binding relationship, and compare the finger key binding relationship with the binding relationship of playing the same note in the score database to determine whether there is a fingering error , and output the fingering errors that exist.
  • Fig. 2 shows the main functional modules in the process of identifying playing errors using the method of the present invention, wherein the 2D image including the complete keyboard of playing the piano is collected from above the piano keyboard by the image acquisition module, and the 2D image is collected by the piano keyboard detection module.
  • the position coordinates are obtained by dividing the keys based on the piano keyboard area by the key division module to obtain the coordinates of each key, and the fingertip feature points of the hand area are detected by the fingertip feature point detection network to obtain the fingertip coordinates of different fingers.
  • Fingering error detection The module compares and judges the position of the fingertip coordinates and the key coordinates to bind the fingertip falling on the key with the key to obtain the finger key binding relationship.
  • this finger falls on the key that overlaps with its fingertip coordinates, bind the finger to the key to obtain the finger key binding relationship, and the finger key binding relationship that plays the same note will be the same as the one in the score database. Standard bindings are compared to detect fingering errors.
  • the 2D image including the complete piano keyboard when the player plays the piano is collected from above the piano keyboard by the image acquisition module (in the collected 2D image, the person is located above the image, and the piano keyboard area in the image is rectangular or approximate on the image. Rectangle), and use the image processing algorithm to process the collected 2D image to obtain the processed 2D image.
  • the image acquisition module here uses any electronic device that can take pictures to collect 2D images, such as mobile phones, cameras, cameras, etc., adjust the angle of the device before or during the shooting, so that the captured 2D image contains the complete piano keyboard and some or all of the hand.
  • the image processing algorithms here include, but are not limited to, black level compensation, lens correction, bad pixel correction, color interpolation, noise removal, gamma correction, color space conversion, white balance correction, color and contrast enhancement, format conversion, etc.
  • the captured 2D image is processed so that the processed 2D image is suitable for subsequent operations.
  • the captured raw 2D image can be converted into a 2D image in a format compatible with piano keyboard detection, such as bmp, jpg, png, tif, gif, pcx, tga, exif, fpx, svg, psd, cdr, Images in formats such as pcd, dxf, ufo, eps, ai, raw, WMF, webp, avif, apng, etc.
  • the specific format can be set according to the actual application requirements.
  • An example shown in FIG. 3 is a collected 2D image including a complete piano keyboard and all hands of the player when playing the piano.
  • the position of the piano keyboard is detected from the 2D image by the piano keyboard detection module, so as to obtain the piano keyboard area represented by the coordinates of the piano keyboard position.
  • the piano keyboard detection module uses the piano keyboard detection network to perform target detection on the 2D image, and obtains the position coordinates of the piano keyboard.
  • the output of the piano keyboard detection network is the relative position coordinates of the input 2D image, which is converted into the original coordinate system of the 2D image. coordinates to obtain the piano keyboard position coordinates in the original coordinate system of the 2D image, and the piano keyboard position coordinates are used to indicate the piano keyboard area in the 2D image.
  • the piano keyboard detection network takes the 2D image of playing the piano as the input, and the relative position coordinates of the piano keyboard in the 2D image as the output, which is obtained by training the neural network.
  • the piano keyboard in the 2D image completely falls into the rectangular target detection frame of the piano keyboard detection network, so that the rectangular target detection frame includes the entire piano keyboard, and the opposite corners of the rectangular target detection frame are used.
  • the two-dimensional coordinates of the center point in the form of (x, y)
  • the coordinates in the original coordinate system can obtain the position coordinates of the piano keyboard in the original 2D image. For example, it can be expressed as the coordinates of the upper left corner (x1, y1), the coordinates of the lower right corner (x2, y2), the coordinates of the upper left corner (x1, y1), and the coordinates of the lower right corner (x2, y2) of the rectangular target detection frame.
  • the rectangular area is the piano keyboard area; alternatively, the piano keyboard area can also be expressed as the coordinates of the center point of the rectangular target detection frame (x0, y0) and the width w and height h of the rectangular target detection frame, both ways are acceptable, the present invention
  • the first diagonal method ⁇ (x1, y1), (x2, y2) ⁇ is used for description, and for the convenience of description in the future, all the coordinates in the present invention refer to the coordinates that have been converted into the original coordinate system of the 2D image. coordinate.
  • the piano keyboard area has two functions, one is as the input of the hand detection module to detect the position of the hand; the other is as the input of the key division module, which divides each individual key and obtains the position of each key.
  • the hand area represented by the relative position coordinates of the piano keyboard area is detected from the piano keyboard area by the hand detection module, and the relative position coordinates of the piano keyboard area used to represent the hand area are converted into the original coordinate system of the 2D image.
  • the coordinates of the hand position in the original coordinate system of the 2D image are obtained, and the hand position coordinates are used to indicate the hand region in the 2D image.
  • the hand detection network takes the piano keyboard area as the input, and the relative position coordinates of the hand position in the piano keyboard area as the output, and is obtained by training the neural network. In the image, the hand position coordinates and the piano keyboard position coordinates are compared to determine whether the two overlap.
  • the hand area and the piano keyboard If the rectangular area represented by the hand position coordinates and the rectangular area represented by the piano keyboard coordinates overlap, the hand area and the piano keyboard If the area overlaps, then the hand is considered to be on the keyboard, and the following hand type error detection, fingering error detection, etc. are performed on these hands; on the contrary, if there is no overlap, then the hand is considered not to be on the keyboard. There is no need to perform the following detection processes such as hand type error detection and fingering error detection.
  • the hand in the piano keyboard area may be incomplete, for example, the fingers are placed on the keyboard, and the palm is outside the keyboard.
  • the present invention needs to place the piano keyboard
  • the upper edge of the area is expanded by a certain pixel (for example, 200 pixels) above the image to form the expanded piano keyboard area, also known as the effective area of the piano keyboard, so as to ensure that all hands on the keyboard are intact in this area.
  • the piano keyboard area is expanded upward by 200 pixels, and the coordinates of the expanded piano keyboard effective area are expressed as ⁇ (x1, y1-200), (x2, y2) ⁇ .
  • the hand detection module performs target detection on the effective area of the piano keyboard represented by the position coordinates of the piano keyboard through the hand detection network to obtain the hand area represented by the relative position coordinates of the piano keyboard area, and will be used to represent the piano keyboard area of the hand area.
  • the relative position coordinates of are converted into the coordinates in the original coordinate system of the 2D image to obtain the position coordinates of the hand in the original coordinate system of the 2D image. It can be seen that the hand position coordinates are detected and converted by the hand detection network. Although the simplest way is to input the entire picture into the hand detection network, this will increase the amount of calculation. Therefore, the present invention uses a piano
  • the effective area of the keyboard is used for hand detection.
  • the hand in the effective area of the piano keyboard represented by ⁇ (x1, y1-200), (x2, y2) ⁇ will be completely touched.
  • the rectangular target detection frame of the hand detection network and use the upper left corner and lower right corner coordinates of the rectangular target detection frame of the hand detection network to represent the hand area, and the position coordinates of the hand area (ie the hand position coordinates) and the piano keyboard
  • the position coordinates of the effective area are in the same coordinate system, and are respectively used to indicate the hand area and the effective area of the piano keyboard in the 2D image, which is convenient for location judgment.
  • the position coordinates of the hand area can be expressed as ⁇ (x1', y1'), (x2', y2') ⁇ .
  • the hand area Since the hand area is detected from the effective area of the piano keyboard, the hand area may be incomplete or some hands do not fall on the piano keyboard, compare the hand position coordinates ⁇ (x1', y1'), (x2' , y2') ⁇ and the piano keyboard position coordinates ⁇ (x1, y1), (x2, y2) ⁇ , if the hand area overlaps with the keyboard area, then it is considered that the hand is on the keyboard, and the present invention connects these hands. On the contrary, if there is no overlap, then it is considered that the hand is not placed on the keyboard, and there is no need to perform the following detection processes such as hand type error detection and fingering error detection.
  • the hand area obtained by hand detection in the effective area of the piano keyboard is limited by the rectangular target detection frame of the hand detection network, it may not contain the complete hand.
  • the fingertips of some fingers of the same hand are in On the piano keyboard, the fingertips of another part of the fingers are outside the piano keyboard, and the fingertips outside the piano keyboard are not included in the target detection frame during the detection. Coordinates, and then extend the boundary of the hand area (hand area) that falls on the piano keyboard to the four directions by a certain pixel boundary to obtain the expanded hand area that contains the complete hand of the hand that falls on the piano keyboard, and then It is called the effective hand area.
  • the boundary of the hand area is expanded by 30 pixels, and the coordinates of the expanded effective area of the hand can be expressed as ⁇ (x1'-30, y1'-30), (x2 '+30, y2'+30) ⁇ , which ensures that the detected hand on the piano keyboard is intact.
  • the boundary of the hand area it is necessary to perform out-of-bounds checking on the four boundaries of the hand area. If an expanded boundary exceeds the boundary range of the original 2D image, the boundary of the 2D image needs to be used instead of the hand area. The out-of-bounds boundary is crossed and the boundary is no longer expanded.
  • the image shown in FIG. 5 is a schematic diagram of identifying the effective area of the hand from the effective area of the piano keyboard shown in FIG. 4 .
  • the subsequent processing time can be reduced and the accuracy of playing error recognition can be improved.
  • the hand effective area There are two functions of the hand effective area here, one is as the input of the hand shape error detection module to detect hand shape errors; the other is as the input of the fingertip feature point detection network to detect the fingertip feature points.
  • the hand type error detection module detects the hand type error in the effective hand area represented by the hand position coordinates to obtain the hand type error type and the hand type error position coordinates.
  • the hand type error detection module adopts the hand type error detection network to be effective for the hand. Area is detected and the hand shape error position represented by the relative position coordinates of the input hand effective area is obtained, and the relative position coordinates of the hand effective area used to represent the hand shape error position are converted into the coordinates in the original coordinate system of the 2D image to obtain Incorrect hand position coordinates.
  • the hand shape error detection network takes the effective hand area as the input, and the hand shape error type and the relative position coordinates of the hand shape error in the effective hand area as the output, and is obtained by training the neural network.
  • the hand shape error detection module a method based on deep learning is used to detect the input effective area of the hand, and the output result is the type of hand shape error and the coordinates of the wrong hand shape, so as to guide the user to correct the wrong hand shape.
  • hand shape errors are divided into folded fingers, fingertips not standing, fingertips pointing upwards, wrist pressing, and palmar joint collapse.
  • Each type of error is a category.
  • the present invention regards hand shape errors as a detection task.
  • the output of the detection network is the wrong hand type error category in the hand area in the 2D image and the relative position coordinates of the error in the input hand effective area, and output as many detection results as there are errors.
  • the fingertip feature points of different fingers represented by the relative position coordinates of the hand region are detected from the hand region represented by the hand position coordinates through the fingertip feature point detection network, and each fingertip feature point representing a different finger is used to
  • the relative position coordinates of the hand region of the sharp feature points are converted to coordinates in the original coordinate system of the 2D image to obtain the coordinates of the fingertip in the original coordinates of the 2D image.
  • the fingertip feature point detection network is a network obtained by training a neural network as the input and the relative position coordinates of the fingertips of different fingers in the effective area of the hand as the output.
  • the fingertip feature point detection network performs image segmentation on the effective area of the hand, and the fingertip of each finger is segmented to obtain the feature points of each fingertip.
  • the image shown in Figure 6 is from the hand shown in Figure 5.
  • the fingertip coordinates corresponding to each finger are represented by the diagonal corners of the rectangular detection frame corresponding to the fingertip, and the relative position coordinates of each fingertip are converted into coordinates in the original coordinate system of the 2D image to obtain the original coordinates of each fingertip in the 2D image.
  • the fingertip coordinates of the system are represented by the diagonal corners of the rectangular detection frame corresponding to the fingertip, and the relative position coordinates of each fingertip are converted into coordinates in the original coordinate system of the 2D image to obtain the original coordinate
  • Different piano keys represented by the relative position coordinates of the piano keyboard region are obtained by dividing each piano key from the piano keyboard region represented by the piano keyboard position coordinates by the piano key dividing module, and each piano keyboard region used to represent the keys is divided into different piano keys.
  • the relative position coordinates of the 2D image are converted into the coordinates under the original coordinate system of the 2D image to obtain the key coordinates under the original coordinates of the 2D image; the range limited by the coordinates of each key is the effective area of the key, and the effective area of the key is morphologically processed. Get the effective edge of each key.
  • the purpose of dividing the keyboard area is to judge whether the fingering is correct or not based on the feature points of the fingertips.
  • the actual division form is to start numbering from the first key of the edge, such as the set of keys numbered [K1, K2, K3, ..., K88].
  • the pixel area of each key in the image has a polygonal expression, such as the area of the K1 key is expressed as a set of vertices and [(x K10 ,y K10 ),...,(x K1n ,y K1n )], Where (x K1n , y K1n ) represents a point in the 2D image coordinate system, x K1n is the abscissa of the point, y K1n is the ordinate of the point, and the pixels wrapped in the polygon formed by these points are the K1 key effective area .
  • the keyboard division module takes the detected piano keyboard area as input, converts it into a grayscale image, and then performs morphological operations to remove the influence of light noise, etc., and uses edge detection algorithms (such as the sobel operator) to extract edges, and finally connect them. Domain analysis to get keyboard edges.
  • the piano keyboard has only two types of black keys and white keys, and the boundaries are regular line segments. According to the statistical characteristics of pixels, the edges are bound to each key, and the intersection points of different edges are the vertices of the polygon in the effective area of the key, which is divided by the key division module. , the mathematical expression model of each bond is established in the 2D image coordinate system.
  • the position of the fingertip coordinates and the key coordinates are compared and judged to bind the fingertip falling on the key with the key to obtain the finger key binding relationship. If the indicated area overlaps, then the finger falls on the key whose fingertip coordinates overlap, and the finger is bound to the key to obtain the finger key binding relationship, and the key binding relationship between the fingers playing the same note and The standard binding relationship in the score database is compared, and if they are inconsistent, it is judged as a fingering error, and the user is prompted to correct the fingering.
  • fingering recognition is to determine which finger has pressed which key, and to realize the binding of finger keys. It relies on the output of the key segmentation module and the fingertip feature point detection network. At the same time, the sound sensor on the key can sense the signal generated by pressing the key, so as to determine which key is pressed, and obtain the currently played note. The valid area of the key is extracted from the result output by the key division module, and it is calculated in turn whether the detected fingertip feature points fall into the key area, and if so, the finger is bound to the key. From the score database, find the standard binding relationship between the finger and the key at the note, and compare it with the predicted binding relationship obtained by fingering recognition. If it is inconsistent, it is judged that the fingering is wrong, and the user is prompted to correct the wrong fingering. .
  • the present invention adopts the means based on deep learning to complete the detection task and the segmentation task, and obtains the detection network by training the neural network.
  • the present invention provides a complete set of neural network training methods to obtain a piano keyboard detection network, a hand detection network, a hand shape error detection network, and a fingertip feature point detection network.
  • the invention analyzes the biomechanical principle of each hand type error and fingering error, summarizes the essential visual feature of each error, and uses this feature as the basis for neural network learning and prediction.
  • the feature is marked with a rectangular frame (but not limited to a rectangular frame) to obtain a sample data set, and the data set is divided into a training set, a validation set and a test set according to a certain ratio (for example, the samples are divided according to the ratio of 7:2:1) data set).
  • the training set and validation set are used to train the neural network
  • the test set is used to test and evaluate the effect of the final network model.
  • the present invention provides a method for training a neural network to obtain the piano keyboard detection network, hand detection network, hand shape error detection network, and fingertip feature point detection network, including the following parts :
  • an image acquisition module to collect images of people playing pianos in various scenes, different types and models of pianos, different angles, and different lighting conditions of people of different ages, genders, school ages, and skin colors.
  • the dataset covers the entire scene and full error type.
  • hand shape error labeling it is divided into keyboard annotation, hand annotation, hand type error annotation, and fingertip feature point annotation.
  • hand shape error labeling the essential characteristics of each hand shape error are summarized, and the hand shape error category and hand shape error position are marked. Specifically, it includes labeling the position coordinates of the piano keyboard, labeling the position coordinates of the hand, labeling the wrong type of hand and the coordinates of the wrong hand shape, labeling the coordinates of the feature points of different fingers and fingertips, all labels are in the original coordinate system of the image.
  • the images in the original data set are processed according to the marked position coordinates of the piano keyboard, and an image containing the piano keyboard area represented by the marked piano keyboard position coordinates is obtained to form a first data set; further, the piano keyboard area is expanded to obtain an image In the effective area of the piano keyboard, the original image is cropped based on the original data set according to the marked keyboard position coordinates and hand position coordinates, and the effective area of the piano keyboard in each original image is obtained to form a second data set, wherein the second data set
  • the hand position coordinates marked in the original image are converted into the coordinates in the same coordinate system as the effective area of the piano keyboard; further, the effective hand area obtained by expanding the hand area is based on the marked hand position coordinates and hand shape.
  • the error position coordinates are based on the original data set, and the original image is cropped to obtain the effective area of the hand in each original image to form a third data set.
  • the third data set converts the hand type error position coordinates marked in the original image into Coordinates in the same coordinate system as the effective area of the hand; according to the marked hand position coordinates and the coordinates of different fingertip feature points, the original image is cropped on the basis of the original data set to obtain the effective area of the hand in each original image
  • a fourth data set is formed, wherein the coordinates of different fingertip feature points marked in the original image are converted into coordinates in the same coordinate system as the effective area of the hand in the fourth data set.
  • the invention takes piano keyboard detection, hand detection and hand shape error detection as multi-branch detection tasks, designs multi-task branch network structure, and then trains them to obtain detection network.
  • the piano keyboard detection network has only one task branch, that is, the piano keyboard detection branch.
  • the hand detection branch there is only one task branch, the hand detection branch.
  • the network needs to complete three tasks, one is to output the coordinate position of the hand, the other is to output the left and right attributes of the hand, and the third is to output the positive and negative attributes of the hand.
  • the present invention divides the hand parts into four categories, positive left hand, positive right hand, anti-left hand, anti-right hand, positive left hand means the back of the left hand is up, anti-left hand means the palm of the left hand is up, right hand and so on.
  • each branch is a detection sub-network of an error type, and the prediction type of the sub-network has only one type of error, that is, there are as many sub-branchs as there are error categories.
  • All error detection sub-branch networks share the backbone network. For example, it is divided into a broken finger detection branch, a palm joint collapse detection branch, a wrist collapse detection branch, and the like.
  • the first data set to train the yolov4 network to convergence to obtain the piano keyboard detection network use the second data set to train the yolov4 network to convergence to obtain the hand detection network; use the third data set to train the yolov4 network to converge to obtain the hand detection network.
  • type error detection network where the same loss function is used for each branch of the detection task. Among them, for a single-task branch network, the loss of the detection task branch is the total loss of the entire network; for a multi-task branch network, the weighted sum of the losses of each detection task branch is the total loss of the entire network.
  • the training of the neural network and the design of the loss function are common methods in the field, and will not be repeated here.
  • online data enhancement can be performed on images, including but not limited to color, contrast, brightness, noise, smooth blur, flip, deformation, distortion, random occlusion and erasure, etc., to improve network robustness.
  • neural network is not limited to the yolov4 network, other neural networks can also be used.
  • the fourth dataset uses ResNet18 as the backbone network, and the detection head adopts a neural network with cascaded pyramid networks to train to converge to obtain a fingertip feature point detection network.
  • the so-called cascaded pyramid network refers to the cascade of two networks that take multi-scale features as input.
  • the first network is called GlobalNet, which performs preliminary detection on fingertip feature points and uses the L2 loss function.
  • the feature map generated by GlobalNet is then extracted by the convolutional layer and input to the RefineNet network to fine-tune the predicted feature points to produce more accurate results.
  • the present invention has the following advantages: 1. Fast speed and high efficiency, small calculation amount of 2D images, light and simple algorithm, good effect and high performance; 2. It can accurately output error type and error position information, distinguish Hand shape errors and fingering errors are more targeted when correcting errors; 3. Each specific model is trained in a data-driven way, and there is no need to empirically establish a standard comparison database, which has high robustness; 4. Adopt automatic The top-down method predicts the results, from coarse-grained keyboard detection, to fine-grained hand type error detection and fingering error recognition, and multiple networks are cascaded. At the same time, each sub-network adopts a multi-task branch, which achieves higher performance.
  • identifying playing errors plays an important role in piano teaching and training, and can significantly improve the quality of piano teaching.
  • the judgment and evaluation of a practitioner's piano performance include at least two aspects: notes and hand movements, where notes may include the frequency spectrum, strength, speed, rhythm and other factors of fundamental and overtones.
  • notes may include the frequency spectrum, strength, speed, rhythm and other factors of fundamental and overtones.
  • Hand movements include two aspects: fingering and hand shape.
  • Fingering is used to determine that the correct finger is used to play the corresponding note during repertoire practice. Fingering includes the position (or position change) of a single finger and multiple fingers. relative position changes. Common fingerings can include, for example, straight-finger (that is, one finger corresponds to a key), finger-penetrating (that is, one finger passes under one or more other fingers to play higher notes), cross-fingering (that is, one finger passes from another or Multiple fingers are stepped over to play lower bass), brackets, retractions, ring fingers, and so on.
  • Hand shape is used to determine whether there are any problems such as broken fingers, not standing fingertips, collapse of palm joints, wrist shaking, finger lift, and finger tension when playing any note.
  • Figure 7 shows 6 common wrong hand movements and the corresponding correct hand movements in piano practice, wherein Figure 7A shows folding fingers and the corresponding correct hand movements, Figure 7B shows the wrong hand movements when the fingertips are standing Figure 7C shows the palmar joint collapse and the corresponding correct hand motion, Figure 7D shows the wrist shaking and the corresponding correct hand motion, and Figure 7E shows the finger lift and the corresponding correct hand motion. Figure 7F shows finger tension and the corresponding correct hand movement. Changes in hand movements can achieve different pronunciation effects, which have a great impact on the coherence, rhythm, speed, and timbre of notes, and are the key to good results.
  • a single palm includes at least 21 key joint points, and the hand data of the palm can be characterized according to the coordinate positions or relative positions of the 21 key joint points.
  • the trained hand model ie, the neural network model
  • the part data is compared with the standard playing hand data to judge whether the playing hand movements are accurate.
  • FIG. 8 shows a schematic diagram of 21 key joint points in a single palm according to an embodiment of the present invention.
  • 21 key joint points can be selected from a single palm, respectively represented by serial numbers 0-20, where [0, 1, 2, 3, 4] represent the thumb from the wrist to the fingertip.
  • Hand data can be represented by the coordinate position of each key joint point, or by the relative position of each key joint point.
  • the "0" joint point in the thumb can be selected as the center origin, and the relative positions of other joint points can be represented by the relative coordinate positions of the joint point relative to the "0" joint point in the thumb, wherein,
  • the coordinate position of each key joint point can be represented by plane coordinates (x, y).
  • "left" or "right” may also be marked to distinguish whether the key joint is located in the left hand or the right hand.
  • the relative position of each key joint point of the hand can be represented by the relative position of other joint points relative to a certain joint point.
  • the hand model takes the hand image as the input data and the hand data in the hand image as the output data, which is obtained by training the neural network model.
  • the neural network in the hand model can use a Recurrent Neural Network (RNN) or a Long Short-Term Memory Neural Network (Long Short-Term Memory, LSTM).
  • RNN is based on the ordinary multi-layer BP neural network, increases the horizontal connection between the units of the hidden layer, and transmits the value of the neural unit of the previous time series to the current neural unit through a weight matrix, so that the neural network has the memory function.
  • RNN has good applicability for dealing with contextual NLP or time series machine learning problems.
  • RNN has memory, it cannot memorize the content that is too early or too late due to gradient explosion or gradient disappearance. Therefore, according to an embodiment of the present invention, when the sampling interval is long, LSTM is used to recognize the hand image.
  • LSTM adds memory units to each neural unit of the hidden layer, so that the memory information in the time series is controllable, and each time it passes through several controllable gates (forgetting gate, input gate, candidate gate, output gate), which can control the memory and forgetting degree of previous information and current information, so that the RNN network has a long-term memory function.
  • the training set of the hand model can include hand pictures of various samples, for example, hand images of different hand movements of different ages (eg, the elderly, adults, children) and different genders (eg, male, female).
  • the hand motion in the hand image is not limited to playing the piano, and can include various motions, such as making a fist, extending the whole palm, pushing, pulling, raising the thumb, and so on.
  • the hand data of the hand images in the training set (such as the coordinate positions or relative positions of key joint points, etc.) can be manually annotated or obtained from an existing database.
  • the trained hand model ie, the neural network model
  • the coordinate position or relative position of the key joint points of the player's hands in the hand image can be identified, that is, the user's hand data.
  • the present invention provides an intelligent piano training method.
  • the method extracts user audio data from the user audio information at a certain time interval from the acquired audio information and video information of the user playing the piano, and compares the audio data with the audio information. Compare the corresponding standard audio data in the database to obtain the matching degree between the user audio data and the standard audio data, and intercept the user's hand image corresponding to the user's audio data from the user's video information according to a certain time interval, and identify it through the hand model.
  • the user's hand data in the user's hand image is compared with the corresponding standard hand data in the hand database to obtain the matching degree between the user's hand data and the standard hand data, and/or, intercepted from the video information A 2D image containing a complete piano keyboard corresponding to the user's audio data, and identifying whether there is a playing error from the 2D image using the recognition method for assisting piano teaching described in the previous embodiment; according to all users of the user
  • the matching degree between the audio data and the corresponding standard audio data, the matching degree between all the user's hand data and the corresponding standard hand data, and/or the user's playing error feedback the playing result to the user.
  • FIG. 9 shows a smart piano training method according to an embodiment of the present invention. As shown in Figure 9, the method includes the following steps:
  • S310 Acquire audio information and video information of the user playing the piano.
  • audio and video information of the user playing the piano may be collected through audio and video collection devices (eg, a microphone and a camera, or a camera with a microphone).
  • the collected audio information can be preprocessed by removing silent segments, denoising, and noise reduction to avoid external interference and improve the accuracy of scoring.
  • the MIDI audio digital signal of the user playing the piano may be collected by connecting to a MIDI interface (Musical Instrument Digital Interface) on the electronic piano.
  • MIDI audio digital signals are binary data output by an electronic piano, representing a certain note played, and that can be recognized and processed by a computer.
  • the user's hand movement of playing the piano can be captured by a camera or other device with image capture function.
  • the hand movements of both hands can be captured by the same camera, or the hand movements of the left and right hands can be captured separately from different angles by multiple cameras. In this case, the video information of the left and right hands can be spliced.
  • a pressure sensor installed under the keys may also collect the touch strength of the user playing the piano, so as to combine with the above audio and video information to jointly determine the score of the user playing the piano.
  • the audio database contains audio data of a large number of standard piano playing pieces (for example, pieces played by piano teachers or professionals, or pieces automatically generated by artificial intelligence based on musical scores).
  • Standard audio data can be extracted from the audio information of standard piano performance pieces according to a certain time interval, and stored in units of pieces to form an audio database.
  • the audio data in the audio database may at least include information such as track name, extraction time, musical note, fundamental frequency, and sound intensity.
  • standard audio data may be extracted from the audio information of a standard piano performance at time intervals of 10 ms or less and stored in an audio database.
  • the current Guinness World Records record for the fastest pianist is 14 times in 1 second. Taking pressing the piano keys 20 times in 1s as an example, the viewing angle of pressing the piano keys once is 50ms. Therefore, extracting audio data from the audio information of a piano performance at intervals of 10ms can cover all the notes produced by the performance.
  • FIG. 10 shows a schematic diagram of standard audio data storage in the audio database of one embodiment.
  • the audio database may include a first-level table and a second-level table, wherein the first-level table is used to store the basic information of standard piano performance pieces, including the serial number, name, level, pitch and audio data.
  • the second-level table serial number and other information.
  • the secondary table is used to store the audio data of each track, including the extraction time, the notes extracted from the audio information of the track at fixed time intervals, and their fundamental frequency and sound intensity.
  • Figure 10(A) several standard piano performance pieces are stored in the first-level table.
  • the name of the piece 0001 is "Song of Spring", the first-level, A major, and its audio data is stored in the second-level table 0001 Medium;
  • the name of track 0036 is "Rondo”, in the key of C major, and the audio data is stored in the second-level table 0036;
  • the name of the track 0180 is "Canon", other, in the key of D, the audio data is stored in the second-level table 0180, and so on.
  • the secondary table 0001 stores all the notes extracted at every 10ms interval during the complete performance time of the 0001 track and the corresponding fundamental frequency and sound intensity data, for example, at the time of "0.000" , there is no note, the fundamental frequency is 0, and the sound intensity is 0; at the "0.010” moment, the note is G4, the fundamental frequency is 391Hz, and the sound intensity is 10dB; at the "0.020” moment, the note is still G4, and the fundamental frequency is 391Hz, The sound intensity is 15dB; at the "0.030" moment, the note is still G4, the fundamental frequency is 391Hz, and the sound intensity is 20dB; ...; at the "0.250” moment, the note is D4, the fundamental frequency is 293Hz, and the sound intensity is 10;..., etc.
  • User audio data can be extracted from the audio information at regular intervals, and the extracted user audio data can be compared with the corresponding standard audio data in the audio database to obtain the matching degree between the user audio data and the corresponding standard audio data.
  • the user audio data may be extracted from the audio information according to the first time interval.
  • the first time interval may be the same as the time interval for extracting standard audio data from the standard performance repertoire in the audio database, or may be an integer multiple of the above-mentioned time interval.
  • the extracted user audio data may at least include information such as extraction time, notes and their fundamental frequencies and sound intensity.
  • the user audio data can correspond to the standard audio data in the audio database through its extraction time information. Taking the audio database in Figure 9 as an example, when the user plays the song "Song of Spring", the user audio data can be extracted from the collected audio information according to the time interval of 30ms.
  • the extracted note at 0.030s is G4
  • the user Before the user plays the piano, the user can select the song to be played in the database, or after the user starts to play the piano, the user can intelligently identify the song played by the user and query the standard of the song in the audio database. audio data, and then compare the user audio data with the corresponding standard audio data in the audio database to obtain the matching degree between the user audio data and the corresponding standard audio data.
  • the audio database may store standard audio data of the same track in different genres. After the user starts to play the piano, intelligently identify the song and genre played by the user, and query the standard audio data corresponding to the song and genre in the audio database, and then compare the user audio data with the corresponding standard audio in the audio database. The data are compared to obtain the matching degree between the user audio data and the corresponding standard audio data.
  • different weight values may be set for different information in the audio data, so as to calculate the degree of matching between the user audio data and the corresponding standard audio data.
  • the fundamental frequency weight of a note can be set to be greater than its pitch intensity weight, so that the fundamental frequency information of the note accounts for a larger proportion in the calculation of the matching degree.
  • an error redundancy interval can also be set for the standard audio data in the audio database, for example, an error redundancy interval of ⁇ 10 Hz is set for the fundamental frequency information of the musical note. When the user audio data falls within this interval, It can be considered that the fundamental frequency information of the notes in the user audio data is basically consistent with the fundamental frequency information of the notes in the corresponding standard audio data.
  • the audio database may be further subdivided into a monophonic database and a repertoire database, wherein the monophonic database stores standard audio data corresponding to a single note, and the repertoire database stores a large number of standard piano playing repertoires corresponding to standard audio data.
  • the monophonic database stores standard audio data corresponding to a single note
  • the repertoire database stores a large number of standard piano playing repertoires corresponding to standard audio data.
  • the judgment of hand movements is meaningful only when the notes are played correctly or substantially correctly. Therefore, according to an embodiment of the present invention, based on the degree of matching between the user audio data and the corresponding standard audio data, it is determined whether the user hand image corresponding to the user audio data needs to be intercepted from the video information.
  • an audio match threshold may be set.
  • the audio matching degree threshold can be set by the user, by default by the system, or intelligently set by the system after counting the playing levels of other piano players on the same piece in the networked state.
  • the matching degree between the extracted user audio data and the corresponding standard audio data is greater than or equal to the audio matching degree threshold, it means that the notes played by the user are correct or basically correct, and then the user audio data can be intercepted from the video information.
  • the corresponding user hand image is used to judge the user's hand movement; when the matching degree of the extracted user audio data and the corresponding standard audio data is less than the audio matching degree threshold, it means that the note played by the user is wrong, so it is unnecessary to perform Judgment of hand movements.
  • Video generally refers to various technologies in which a series of static images are captured, recorded, processed, stored, transmitted and reproduced in the form of electrical signals. So a video is actually a series of images arranged in chronological order.
  • the continuous image changes exceed 24 frames per second, according to the principle of persistence of vision, the human eye cannot distinguish a single static image, and it appears to be a smooth and continuous visual effect. Therefore, the user's hand image can be intercepted from the video information at regular intervals.
  • the user's hand image may correspond to the user's audio data in the video information through its interception time information.
  • an image of the user's hand may be captured from the video information at a second time interval.
  • the second time interval may be the same as the time interval for extracting user audio data from the audio information, or may be an integer multiple of the above-mentioned time interval. If the time information is consistent, the user's hand image captured at this moment is the corresponding user's hand image when the user's audio data is generated. For example, if user audio data with a note of G4, a sound intensity of 15, and a pitch of 1750 is extracted at 0.030s, the user's hand image captured from the video information at 0.030s is the time when the user audio data is generated. The corresponding user hand image.
  • the second time interval is not greater than 30 ms.
  • the captured user hand images may be screened, and only the user hand images including the piano key area are selected for identifying the user hand data.
  • the user in addition to the degree of matching between the user audio data and the corresponding standard audio data, the user can also set whether to capture the user's hand image and identify the user's hand data.
  • a 2D image containing the complete piano keyboard is captured from the video information for identifying possible playing errors.
  • the hand model takes the hand image as input data and the hand data in the hand image as output data, and is obtained by training the neural network.
  • the user's hand data in the user's hand image can be identified through the hand model, such as the coordinate positions or relative positions of hand joint points, including the coordinate positions or relative positions of 21 key joint points in the left and right hands, or more or less The coordinate position or relative position of the 21 joint points.
  • the user hand data may further include the coordinate position or relative position of the wrist.
  • the hand model may use a trained recurrent neural network or a long and short-term memory neural network.
  • the piano key area can be detected first in the background of the field of view, and a key candidate frame is drawn, and then the hand model is used in the key candidate frame area.
  • Hand keypoint regression detection to extract user hand data.
  • the hand database contains a large amount of standard hand data.
  • the standard hand image can be intercepted from the standard performance video information of the piano playing piece, and then the standard hand data in the standard hand image can be identified by the hand model, and stored in units of pieces to form Hand database.
  • the standard hand image can be intercepted from the video information of the standard piano repertoire according to the same time interval as the standard audio data extracted from the standard repertoire in the audio database, or an integer multiple of the above-mentioned time interval, and then The standard hand data in the standard hand image is recognized by the hand model and stored in the hand database.
  • Standard hand data may include information such as time (ie interception time), coordinate positions or relative positions of key joint points of the left and right hands.
  • the user's hand data may correspond to the standard hand data in the hand database through its time information.
  • FIG. 11 shows a schematic diagram of standard hand data storage in the hand database of an embodiment.
  • the hand database may include a first-level table and a second-level table, wherein the first-level table (as shown in Figure 11(A)) is used to store the basic information of the standard piano performance, including the serial number, name , level, pitch and hand data secondary table serial number and other information; the secondary table (as shown in Figure 11(B)) is used to store the hand data of each track, including the interception time, from the Data such as the relative positions of the 21 joint points of the left and right hands in the hand image captured from the video information of the track.
  • the first-level table as shown in Figure 11(A)
  • the secondary table is used to store the hand data of each track, including the interception time, from the Data such as the relative positions of the 21 joint points of the left and right hands in the hand image captured from the video information of the track.
  • the audio database can be associated with the hand database, that is, the standard audio data with consistent time information and the standard hand data are stored in association with each other in units of tracks to form a comprehensive database.
  • the time interval for extracting the standard audio data from the standard performance is inconsistent with the time interval for extracting the standard hand image from the standard performance, or the time interval for extracting the standard audio data from the standard performance is the time interval when the standard hand image is intercepted from the standard performance
  • the time interval is an integral multiple of the time interval, only the standard hand data that matches the time information of the standard audio data is stored.
  • FIG. 12 shows a schematic diagram of the storage of the integrated database of one embodiment.
  • the comprehensive database may include a first-level table and a second-level table, wherein the first-level table (as shown in Fig. 12(A)) is used to store the basic information of standard piano performance pieces, including serial number, name, level , pitch and comprehensive data secondary table number and other information; the secondary table (as shown in Figure 12(B)) is used to store the audio data and hand data of each track.
  • an error redundancy interval may also be set for the standard hand data in the hand database.
  • the user's hand data falls within the error redundancy interval, it can be considered that the user's hand data is the same as the standard hand data. Basically the same.
  • a hand data matching degree threshold may be set.
  • the hand data matching degree threshold can be set by the user, by default by the system, or intelligently set by the system after counting the performance levels of other piano players on the same piece in the networked state.
  • the matching degree between the user's hand data and the standard hand data is greater than or equal to the hand data matching degree threshold, it indicates that the hand movement played by the user is correct or basically correct; when the extracted user's hand data matches the corresponding standard
  • the matching degree of the hand data is less than the threshold of the matching degree of the hand data, it means that the hand movement of the user is wrong.
  • the user's hand image corresponding to the user's hand data can be automatically saved, which is convenient for the user to view.
  • wrong hand movements can also be displayed to the user, for example, a virtual hand outline is generated through animation rendering, when the fingering is wrong, the wrong finger is displayed to the user; when the hand shape is wrong, the wrong finger is displayed to the user hand area (e.g. palm, fingertips, etc.). Or display hand type error categories and positions, fingering errors and positions, etc. to the user. Since the foregoing embodiments have detailed descriptions for identifying playing errors from 2D images, they will not be repeated here.
  • the degree of matching between the user audio data and the corresponding standard audio data and the degree of matching between the user's hand data and the corresponding standard hand data can comprehensively consider the level of the user's piano playing from two aspects of notes and hand movements.
  • the degree of matching between the user audio data and the corresponding standard audio data and the degree of matching between the user's hand data and the corresponding standard hand data may be set as weights in determining the user's piano performance score.
  • the user's playing habits are personalized to formulate scoring rules. For example, if the notes played by a user are accurate but the hand movements are often wrong, a larger weight can be set for the matching degree between the user's hand data and the corresponding standard hand data, so as to give more feedback on the user's performance in piano playing. of hand movements.
  • standard key touch force data is also stored in the database, and the collected user key touch force data can be compared with the standard key touch force data to obtain the matching degree between the user key touch force and the standard key touch force,
  • the user's piano playing level is comprehensively considered in combination with the matching degree between the user audio data and the corresponding standard audio data, the matching degree between the user's hand data and the corresponding standard hand data, and/or the user's playing error.
  • the results of the user's piano playing may be fed back to the user in a delayed manner.
  • the comprehensive score of the piece can also be displayed; the user's specific audio errors and hand movement errors during playing can also be recorded in detail, and a score report can be formed, so that the user can make targeted Practice or correct mistakes in piano playing; you can also compare the current score or score report with the user's previous performance records or other users' performance records, and comprehensively evaluate the user's current playing level.
  • the user audio data and the user hand data may be compared at the same time, and based on the matching degree between the user audio data and the corresponding standard audio data and the matching degree between the user hand data and the corresponding standard hand data , and/or the user plays incorrectly, and feedback the playing result to the user.
  • the user audio data and the user's hand image can be extracted and analyzed in real time while acquiring the audio information and video information of the user playing the piano.
  • the user may be fed back with the overall performance result of the piece.
  • FIG. 13 shows a smart piano training method according to an embodiment of the present invention. As shown in Figure 13, the method includes the following steps:
  • S710 Acquire audio information and video information of the user playing the piano.
  • Steps S710-S720 are similar to the above-mentioned steps S310-S320, and are not repeated here.
  • step S730 compare the matching degree between the user audio data and the corresponding standard audio data with the specified threshold N1, when the matching degree between the user audio data and the corresponding standard audio data is greater than or equal to the specified threshold N1, execute step S740 ; when When the degree of matching between the user audio data and the corresponding standard audio data is less than the specified threshold N1, step S760 is executed.
  • S740 intercept the user's hand image corresponding to the user audio data from the video information.
  • a 2D image including the complete piano keyboard may also be intercepted from the video information to identify whether there is a playing error.
  • step S760 it is judged whether the user's piano playing has ended, if it has ended, go to step S770; if it has not ended, go to steps S710-S760.
  • the user when the matching degree between the user audio data and the corresponding standard audio data is less than a specified threshold, the user may be prompted for the key information corresponding to the standard audio data, for example, a virtual keyboard is generated through animation rendering and prompts the correct piano keys; and/or when the degree of matching between the user's hand data and the corresponding standard hand data is less than a specified threshold, the user can be prompted for the hand action corresponding to the standard hand data, for example, a virtual hand outline is generated by animation rendering And prompt the correct hand movements.
  • FIG. 14 shows a smart piano training method according to an embodiment of the present invention. As shown in Figure 14, the method includes the following steps:
  • S810 Acquire audio information and video information of the user playing the piano.
  • step S830 compare the matching degree between the user audio data and the corresponding standard audio data with the specified threshold N1, when the matching degree between the user audio data and the corresponding standard audio data is greater than or equal to the specified threshold N1, execute step S840 ; when When the degree of matching between the user audio data and the corresponding standard audio data is less than the specified threshold N1, the user is prompted for the key information corresponding to the standard audio data, and step S870 is executed.
  • S840 intercept the user's hand image corresponding to the user's audio data from the video information.
  • a 2D image including a complete piano keyboard may also be intercepted from the video information to identify whether there is a playing error.
  • step S870 it is judged whether the user's piano playing has ended, if so, go to step S880; if not, go to steps S810-S870.
  • the present invention accurately recognizes the hand data in the user's hand image by using the hand model, and plays wrong, and on the basis of comprehensively considering the audio data and hand data generated by the user when playing the piano
  • the overall judgment of the results of playing enables users to effectively obtain feedback on notes and hand movements during practice without the guidance of professional teachers, which is helpful for users to find and correct mistakes and improve practice efficiency.
  • by prompting the user with correct key information and/or hand movements in real time it can also help the user to obtain correct demonstration and guidance in time, which is helpful for the user to learn to play the piano by himself.
  • the present invention also provides an intelligent piano training system implementing the above method.
  • the system includes: an audio and video acquisition unit for acquiring audio information and video information of the user's piano playing; a data extraction unit for Extracting user audio data from audio information, and intercepting user hand images corresponding to user audio data and/or 2D images containing a complete piano keyboard from video information; data recognition unit for recognizing user hands through hand models user hand data in the 2D image, and/or identify whether there is a playing error from the 2D image through an intelligent recognition system for assisting piano teaching, wherein the hand model uses the hand image as input data, and uses the hand image as input data.
  • the hand data in the external image is the output data, which is obtained by training the neural network; the data matching unit is used to compare the user audio data with the corresponding standard audio data in the audio database, and obtain the user audio data and the corresponding standard audio data.
  • the matching degree of the user's hand data and the corresponding standard hand data in the hand database are compared to obtain the matching degree of the user's hand data and the corresponding standard hand data; the user interaction unit is used for user audio data based on user audio data.
  • the degree of matching with the corresponding standard audio data and the degree of matching between the user's hand data and the corresponding standard hand data, and/or the user's playing error feedback the playing result to the user.
  • the user interaction unit in the smart piano training system is further configured to: prompt the user for the key information corresponding to the corresponding standard audio data, prompt the user for the hand movement corresponding to the corresponding standard hand data, and The user prompts for playing errors and wrong positions.
  • the intelligent piano training system further includes a control unit for controlling the mutual cooperation between the audio and video acquisition unit, the data extraction unit, the data identification unit, the data matching unit, and the user interaction unit, and based on the user
  • the degree of matching between the audio data and the corresponding standard audio data determines whether to activate the data identification unit, and judges based on the degree of matching between the user audio data and the corresponding standard audio data or the degree of matching between the user's hand law and the corresponding standard hand data Whether the user plays the piano piece is over, and it is determined whether the audio and video capture unit or the user interaction unit is activated.
  • FIG. 15 shows a smart piano practice system according to an embodiment of the present invention.
  • the intelligent piano practice system 900 includes an audio and video acquisition unit 901 , a data extraction unit 902 , a data identification unit 903 , a data matching unit 904 and a user interaction unit 905 .
  • the audio and video capture unit 901 including a sound capture device 9011 and a video capture device 9012, is used to obtain audio information and video information generated when the user plays the piano.
  • the sound collection device 9011 may be, for example, one or more microphones installed near the piano.
  • the sound collection device 9011 can be connected to the data extraction unit 902 in a wired or wireless manner, and sends the acquired audio information to the data extraction unit 902 .
  • the video capture device 9012 may be a device with a photography or image capture function, such as a monocular camera, a binocular camera or a depth camera.
  • the video capture device 9012 can be fixed around the piano to capture hand video information at a fixed point.
  • the video capture device 9012 can also be connected to the data extraction unit 902 in a wired or wireless manner, and sends the acquired video information to the data extraction unit 902 .
  • the sound collecting device 9011 and the video collecting device 9012 can be integrated in one device to simultaneously acquire audio information and video information when the user plays the piano.
  • the data extraction unit 902 includes an audio data extraction unit 9021 and an image data extraction unit 9022, wherein the audio data extraction unit 9021 is used to extract user audio data from the audio information and send it to the data matching unit 904; the image data extraction unit 9022 uses The user's hand image corresponding to the user's audio data and/or the 2D image including the complete piano keyboard is intercepted from the video information, and sent to the data identification unit 903 .
  • the data recognition unit 903 which contains the hand model 9031, is connected with the image data interception unit 9022, and is used for identifying the user hand data in the user hand image through the hand model and/or by using the intelligence for assisting piano teaching.
  • the recognition system recognizes playing errors from the 2D images and sends them to the data matching unit 904 .
  • the hand model 9031 takes the hand image as input data and the hand data in the hand image as output data, and is obtained by training a neural network.
  • the data matching unit 904 includes an audio data matching unit 9041 and a hand data matching unit 9042.
  • the audio data matching unit 9041 includes an audio database for comparing the user audio data from the data extraction unit 902 with the corresponding standard audio data in the audio database to obtain the degree of matching between the user audio data and the corresponding standard audio data, And send it to the user interaction unit 905 and the control unit 906 .
  • the hand data matching unit 9042 includes a hand database for comparing the user hand data from the data identification unit 903 with the corresponding standard audio data in the hand database to obtain the user hand data and the corresponding standard hand data. The matching degree of the user's hand data and the corresponding standard hand data and the user's playing error are sent to the user interaction unit 905.
  • the audio database and the hand database can be stored in the audio data matching unit 9041 and the hand data matching unit 9042 as built-in files, and can also be connected to the audio data matching unit 9041 and the hand data matching unit 9042 through the API program interface.
  • the user interaction unit 905 includes a processor 9051 and a display device 9052, wherein the processor 9051 is configured to receive the matching degree between the user audio data from the audio data matching unit 9041 and the corresponding standard audio data, and receive from the hand data matching unit 9042 the degree of matching between the user's hand data and the corresponding standard hand data, and/or the user's playing error, and based on the degree of matching between the user's audio data and the corresponding standard audio data and the user's hand data and the corresponding The matching degree of the standard hand data and/or the user's playing error, determine the score of the user's piano performance.
  • the display device 9052 can be, for example, an electronic device with a display function, such as a smart phone, an IPAD, smart glasses, a liquid crystal display screen, an electronic ink screen, etc., for displaying the scoring result of the processor 9051 .
  • the processor 9051 can determine the correct key information and display it on the display device 9052 based on the matching degree between the user audio data and the corresponding standard audio data, for example, generate a virtual keyboard through animation rendering and prompt the correct key information piano keys.
  • the processor 9051 can also construct a correct hand motion and display it on the display device 9052 based on the degree of matching between the user's hand data and the corresponding standard hand data, for example, can generate a virtual hand contour and Prompt the correct hand movements, and can also establish a specific skeletal system for different users according to the user's hand information, generate the user's personalized virtual hand contour by means of skinning, animation rendering, etc., and control it according to standard hand data.
  • Virtual hand contours suggest correct hand movements. Or display a specific hand image annotated with a specific hand type and location, fingering error and location.
  • the smart piano practice system may further include a sensor, and the sensor may be installed under the keys, and used to collect data on the touch strength of the user when playing the piano.
  • the present invention may be implemented in the form of a computer program.
  • the computer program can be stored in various storage media (eg, hard disk, optical disk, flash memory, etc.), and when the computer program is executed by the processor, can be used to implement the method of the present invention.
  • the present invention may be implemented in the form of an electronic device.
  • the electronic device includes a processor and a memory, and the memory stores a computer program that, when executed by the processor, can be used to implement the method of the present invention.
  • the above embodiments describe the position coordinates in the shape of a rectangle or an approximate rectangle.
  • the vertices of the polygon containing the complete piano keyboard are used. Coordinates to represent the piano keyboard area.
  • the present invention may be a system, method and/or computer program product.
  • the computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present invention.
  • a computer-readable storage medium may be a tangible device that retains and stores instructions for use by the instruction execution device.
  • Computer-readable storage media may include, but are not limited to, electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing, for example.
  • Non-exhaustive list of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • flash memory static random access memory
  • SRAM static random access memory
  • CD-ROM compact disk read only memory
  • DVD digital versatile disk
  • memory sticks floppy disks
  • mechanically coded devices such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.

Abstract

An intelligent identification method for giving assistance with piano teaching, the method comprising: acquiring, from above a piano keyboard, a 2D image, which contains the complete piano keyboard, of playing a piano; performing target detection on the 2D image by means of a piano keyboard detection network to detect a piano keyboard region, which is represented by the relative position coordinates of the 2D image, and obtaining, by means of conversion, piano keyboard position coordinates under the original coordinates of the 2D image; performing, by means of a hand detection network, target detection on the piano keyboard region, which is represented by the piano keyboard position coordinates, so as to detect a hand region, which is represented by the relative position coordinates of the piano keyboard region, and obtaining hand position coordinates under the original coordinate system of the 2D image by means of conversion; and identifying, by means of a hand shape error detection network, whether there is a hand shape error in the hand region, which is represented by the hand position coordinates, outputting a hand shape error type and a hand shape error position, which is represented by the relative position coordinates of the hand region, and obtaining, by means of conversion, hand shape error position coordinates under the original coordinates of the 2D image. In addition, further provided is an intelligent piano training method for identifying a playing error by using the intelligent identification method.

Description

用于辅助钢琴教学的智能识别方法及系统、智能钢琴训练方法及系统Intelligent recognition method and system for assisting piano teaching, intelligent piano training method and system 技术领域technical field
本发明涉及深度学习领域,尤其涉及一种用于辅助钢琴教学的智能识别方法及系统、以及智能钢琴训练方法。The invention relates to the field of deep learning, in particular to an intelligent identification method and system for assisting piano teaching, and an intelligent piano training method.
背景技术Background technique
目前钢琴教学大多数采用的是教师当面指导教学的方式,这种方式受人力、时间、金钱、教师水平等因素限制,大大增加了钢琴学习的难度。随着AI时代的到来,人工智能技术成为了解决钢琴学习问题的一个突破口,诞生了越来越多的智能钢琴教学系统。现有的智能钢琴教学系统存在以下主要缺点:At present, most of the piano teaching adopts the method of face-to-face instruction by teachers. This method is limited by factors such as manpower, time, money, and teachers' level, which greatly increases the difficulty of piano learning. With the advent of the AI era, artificial intelligence technology has become a breakthrough in solving piano learning problems, and more and more intelligent piano teaching systems have been born. The existing intelligent piano teaching system has the following main shortcomings:
1、大部分方案采用的是基于特征比对的方法,首先,通过数学模型,建立一个正确手型的标准数据库;然后,构建预测模型,用于抽取预测图片的特征,将该特征与标准数据库进行比对,从而判断是否为错误的弹奏手型。这种方式的难点在于构建标准数据库是一个复杂且低效的过程,由于人手的大小、关节比例等具有很大的差异,这就导致在使用关节角度或关节长度构建标准手型时,具有较大的主观性,是不太准确的。同时,由于手与相机角度的变化,即使是差异很大的手型,在做比对时,也可能得出一个很高的相似度,导致得出错误结论。因此,特征比对的方法鲁棒性差,主观性高,识别率低。1. Most of the schemes use a method based on feature comparison. First, a mathematical model is used to establish a standard database of correct hand shapes; then, a prediction model is constructed to extract the features of the predicted pictures, and the features are compared with the standard database. Make a comparison to determine whether it is a wrong playing hand. The difficulty of this method is that it is a complex and inefficient process to build a standard database. Due to the large differences in the size and joint ratio of human hands, when using joint angles or joint lengths to construct a standard hand shape, it is more difficult to Subjectivity is large and is not very accurate. At the same time, due to the change of the angle between the hand and the camera, even if the hand shapes are very different, a high similarity may be obtained when comparing, resulting in incorrect conclusions. Therefore, the method of feature comparison has poor robustness, high subjectivity and low recognition rate.
2、在构建预测模型时,很多方法使用双目或者深度相机得到三维数据,从而构建3D模型。相比于2D视觉模型,3D模型计算量大、设计复杂、性差,对硬件要求高,需要大算力的芯片支持,深度相机和大算力芯片都会大大增加成本。2. When building a predictive model, many methods use binocular or depth cameras to obtain 3D data to build a 3D model. Compared with the 2D visual model, the 3D model has a large amount of calculation, complex design, poor performance, high hardware requirements, and requires the support of chips with large computing power. Depth cameras and large computing power chips will greatly increase the cost.
3、现有方法缺少一个系统全面的钢琴手型和指法校正方案。手型错误只能判断对与错,不能判断是哪种错误,也不能指出错误位置在哪里,指导学生矫正错误的能力不足。当前还没有很好的方法对按键和手指进行精确绑定,5根手指与88个键的组合有成百上千种,现有方法无法对指法错误进行精确的识别,没有精确识别弹奏错误的情况下,无法精准的针对性进行弹奏 指导。3. The existing method lacks a systematic and comprehensive piano hand shape and fingering correction scheme. The hand shape error can only judge right and wrong, not what kind of error it is, nor can it point out where the error is, and the ability to guide students to correct the error is insufficient. At present, there is no good method for precise binding of keys and fingers. There are hundreds of combinations of 5 fingers and 88 keys. Existing methods cannot accurately identify fingering errors and do not accurately identify playing errors. Under the circumstance, it is impossible to accurately guide the playing.
众所周知,钢琴弹奏中的音准、节奏、指法、手型等因素十分关键,是初学者必须反复练习的基本功,因此通常需要在专业的钢琴老师的监督、辅导下练习。然而,囿于受专业老师辅导的时间有限,初学者往往独自练习,导致各种错误得不到及时地反馈与纠正,练习效果欠佳。As we all know, factors such as pitch, rhythm, fingering, and hand shape in piano playing are very critical. They are the basic skills that beginners must practice repeatedly. Therefore, they usually need to be practiced under the supervision and guidance of professional piano teachers. However, due to the limited time to be tutored by professional teachers, beginners often practice alone, resulting in the lack of timely feedback and correction of various errors, and the practice effect is not good.
现有技术中已有不少针对初学者进行自我练习时的钢琴训练方法,有的是依据所弹奏乐曲的音频数据来判断练习者弹奏乐曲的准确性,例如,将演奏者的音频信息与大师演奏的正确声音数据相比较,以此来判断音准、节奏、速度、力度,以此评估演奏结果;有的是基于钢琴演奏的视频图像来评估弹奏的准确性,例如,通过从钢琴教学视频中截取指关节和钢琴按键的图像建立标准指法模型图和标准按键顺序模型图,然后将练习视频与标准模型图进行对比分析,来实现自动纠错和智能教学;还有的综合考量演奏的音频数据和视频数据以判断钢琴演奏的正确性,例如,将音频数据及时间信号与标准音符数据相比较得到正确音符数据,并据此调取对应的演奏图像数据并进行视觉识别分析以获得演奏图像中正确手部数据,根据正确音符数据、正确手部数据以及标准音符数据计算钢琴演奏的评分,等等。In the prior art, there are many piano training methods for beginners to perform self-practice. Some are based on the audio data of the played musical piece to judge the accuracy of the musical piece played by the practitioner. For example, the audio information of the player is compared with the master. Comparing with the correct sound data of the performance to judge pitch, rhythm, speed, and strength, to evaluate the performance; some are based on the video image of the piano performance to evaluate the accuracy of the performance, for example, by intercepting from the piano teaching video The images of the knuckles and piano keys establish a standard fingering model diagram and a standard key sequence model diagram, and then compare and analyze the exercise video and the standard model diagram to realize automatic error correction and intelligent teaching; Video data to judge the correctness of the piano performance, for example, compare the audio data and time signal with the standard note data to obtain the correct note data, and then retrieve the corresponding performance image data and perform visual recognition analysis to obtain the correct performance image. Hand data, calculates a score for piano performance based on correct note data, correct hand data, and standard note data, etc.
然而,现有的钢琴训练方法仍存在一些不足。一方面,仅以弹奏乐曲的音频数据作为评定基础的练习(或测评)方法可能会由于环境中杂音的干扰导致判断结果不准确。另外,由于该类方法仅关注弹奏的音符是否准确,因而无法对弹奏者的手部姿势、指法等其他重要方面进行反馈或纠正。另一方面,仅以弹奏的视频(或图像)数据作为判断基础的练习(或测评)方法可能会因其仅针对截取的弹奏者手部及琴键的图像孤立地进行识别,故无法与弹奏的乐曲有机结合。即使弹奏的指法和音符都正确,也会由于忽略了乐曲的节奏、速度等重要因素而影响判断结果的准确性。此外,现有的综合考虑弹奏者的音频和视频(或图像)数据的练习(或测评)方法中,但无法正确判断演奏者的指法,也不能进行及时反馈或纠正,因此无法实现真正专业的指导。However, the existing piano training methods still have some shortcomings. On the one hand, the practice (or evaluation) method based only on the audio data of playing music may lead to inaccurate judgment results due to the interference of noise in the environment. In addition, since such methods only focus on the accuracy of the notes played, they cannot provide feedback or corrections to other important aspects such as the player's hand posture and fingering. On the other hand, the practice (or evaluation) method that only uses the video (or image) data of playing as the basis for judgment may be recognized in isolation because it is only based on the intercepted images of the player's hands and keys, so it cannot be compared with the The music played is organically combined. Even if the fingering and notes are played correctly, the accuracy of the judgment results will be affected by ignoring important factors such as the rhythm and speed of the music. In addition, in the existing practice (or evaluation) method that comprehensively considers the player's audio and video (or image) data, it cannot correctly judge the player's fingering, nor can it provide timely feedback or correction, so it is impossible to achieve a truly professional guidance.
另外,对于手部信息的采集,现有技术中通过使用可穿戴设备(例如,数据手套)、移动追踪技术(例如,微型雷达系统)、或人工提取图像中的手势数据等技术。然而,在钢琴弹奏中,可穿戴设备的使用可能影响手臂 (或手指)的灵活性,移动追踪技术对于探测手指在琴键上的移动或摁压琴键此类细微动作的精确度不高,而人工提取手势数据的工作量较大、专业性较高、泛化能力和鲁棒性都不尽如人意。In addition, for the collection of hand information, in the prior art, technologies such as wearable devices (eg, data gloves), movement tracking technologies (eg, micro radar systems), or manual extraction of gesture data from images are used. However, in piano playing, the use of wearable devices may affect the flexibility of the arms (or fingers), and the motion tracking technology is not very accurate for detecting subtle movements such as finger movement on the keys or pressing the keys, and Manual extraction of gesture data requires a large workload, high professionalism, and unsatisfactory generalization ability and robustness.
因此,亟需一种更加准确、合理的智能钢琴训练方法和系统。Therefore, there is an urgent need for a more accurate and reasonable intelligent piano training method and system.
发明内容SUMMARY OF THE INVENTION
因此,本发明的目的在于克服上述现有技术的缺陷,提供一种能够准确识别弹奏错误的用于辅助钢琴教学的智能识别方法及系统。Therefore, the purpose of the present invention is to overcome the above-mentioned defects of the prior art, and to provide an intelligent identification method and system for assisting piano teaching which can accurately identify playing errors.
根据本发明的第一方面,提供一种用于辅助钢琴教学的智能识别方法,用于从弹奏钢琴的2D图像中识别出手型错误、和/或指法错误,所述方法包括:从钢琴键盘上方获取弹奏钢琴的包含完整钢琴键盘的2D图像;通过钢琴键盘检测网络对所述2D图像进行目标检测以检测出以2D图像的相对位置坐标表示的钢琴键盘区域,并将所述用于表示钢琴键盘区域的2D图像的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的钢琴键盘位置坐标;通过手部检测网络对以钢琴键盘位置坐标表示的钢琴键盘区域进行目标检测以检测出以钢琴键盘区域的相对位置坐标表示的手部区域,并将所述用于表示手部区域的钢琴键盘区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的手部位置坐标;通过手型错误检测网络识别以手部位置坐标表示的手部区域中是否存在手型错误,如果存在手型错误,则输出手型错误类型以及以手部区域的相对位置坐标表示的手型错误位置,并将所述用于表示手型错误位置的手部区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的手型错误位置坐标。在本发明的一些实施例中,本发明方法还包括:从所述以钢琴键盘位置坐标表示的钢琴键盘区域中将每个琴键划分出来得到以钢琴键盘区域的相对位置坐标表示的不同琴键,并将每个用于表示琴键的钢琴键盘区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的琴键坐标;通过指尖特征点检测网络从以手部位置坐标表示的手部区域中检测出以手部区域的相对位置坐标表示的不同手指的指尖特征点,并将每个用于表示不同手指的指尖特征点的手部区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的指尖坐标;基于指尖坐标和琴键坐标进行位置判断,将落在琴键上的指尖与该琴键进行绑定 获得手指按键绑定关系,并将弹奏同一个音符的手指按键绑定关系和曲谱数据库中的标准绑定关系进行对比以检测是否存在指法错误。采用本发明方法,可以直接通过2D图像精确识别出弹奏过程中的具体错误,计算量小。According to a first aspect of the present invention, an intelligent recognition method for assisting piano teaching is provided, for identifying hand shape errors and/or fingering errors from a 2D image of playing the piano, the method comprising: learning from a piano keyboard The 2D image including the complete piano keyboard of playing the piano is obtained above; the target detection is performed on the 2D image through the piano keyboard detection network to detect the piano keyboard area represented by the relative position coordinates of the 2D image, and the said 2D image is used to represent The relative position coordinates of the 2D image of the piano keyboard area are converted into the coordinates in the original coordinate system of the 2D image to obtain the piano keyboard position coordinates in the original coordinates of the 2D image; Object detection to detect the hand area represented by the relative position coordinates of the piano keyboard area, and convert the relative position coordinates of the piano keyboard area used to represent the hand area to the coordinates in the original coordinate system of the 2D image to obtain 2D The hand position coordinates under the original coordinates of the image; the hand shape error detection network is used to identify whether there is a hand shape error in the hand region represented by the hand position coordinates, and if there is a hand shape error, the hand shape error type and the hand shape The hand shape error position represented by the relative position coordinates of the area, and the relative position coordinates of the hand region used to represent the hand shape error position are converted into coordinates in the original coordinate system of the 2D image to obtain the hand under the original coordinates of the 2D image. Type error location coordinates. In some embodiments of the present invention, the method of the present invention further comprises: dividing each piano key from the piano keyboard region represented by the piano keyboard position coordinates to obtain different piano keys represented by the relative position coordinates of the piano keyboard region, and Convert the relative position coordinates of each piano keyboard area used to represent the keys to the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the keys in the original coordinates of the 2D image; The fingertip feature points of different fingers represented by the relative position coordinates of the hand region are detected in the hand region of The coordinates in the original coordinate system of the image are obtained to obtain the fingertip coordinates in the original coordinates of the 2D image; the position judgment is performed based on the coordinates of the fingertip and the coordinates of the key, and the fingertip that falls on the key is bound to the key to obtain the key binding relationship of the finger , and compare the key binding relationship of the fingers playing the same note with the standard binding relationship in the score database to detect whether there is a fingering error. By adopting the method of the present invention, the specific errors in the playing process can be accurately identified directly through the 2D image, and the calculation amount is small.
优选的,上述方法中,在检测出钢琴键盘区域后,将钢琴键盘区域对应的钢琴键盘位置坐标进行第一预设像素的扩展,获得包含完整手部的钢琴键盘有效区域,然后基于钢琴键盘有效区域进行后续处理。优选的,所述第一预设像素为200像素。通过像素扩展,可以有效避免因有些2D图像中因为拍照角度或者手在键盘的位置不同导致的手部不完整而带来的识别不准确的问题,同时,通过直接对钢琴键盘有效区域进行弹奏错误的识别,无需对整个图像进行识别,极大的减少了计算工作量,降低了硬件开销。Preferably, in the above method, after the piano keyboard area is detected, the position coordinates of the piano keyboard corresponding to the piano keyboard area are expanded by the first preset pixel to obtain the effective area of the piano keyboard including the complete hand, and then based on the effective area of the piano keyboard area for subsequent processing. Preferably, the first preset pixel is 200 pixels. Through pixel expansion, the problem of inaccurate recognition caused by incomplete hands in some 2D images due to different camera angles or different positions of the hands on the keyboard can be effectively avoided. At the same time, by playing directly on the effective area of the piano keyboard For wrong identification, it is not necessary to identify the entire image, which greatly reduces the computational workload and hardware overhead.
优选的,在本发明的一些实施例中,在检测出手部区域后,基于钢琴键盘位置坐标和手部位置坐标的对比,过滤掉未落在钢琴键盘上的手的坐标即去掉未落在钢琴键盘上的手的信息,将落在钢琴键盘上的手的坐标边界向四个方向进行第二预设像素的扩展,获得包含落在钢琴键盘上的手的完整手部的手部有效区域以及对应手部位置坐标。优选的,所述第二预设像素为30像素。通过手部位置坐标的过滤,可以有效剔除未落在钢琴键盘上的手的数据无需对其进行弹奏错误识别,仅对落在钢琴键盘上的手进行弹奏错误识别,缩短了识别时间,减少计算工作量。Preferably, in some embodiments of the present invention, after the hand region is detected, based on the comparison between the position coordinates of the piano keyboard and the position coordinates of the hand, the coordinates of the hands that do not fall on the piano keyboard are filtered out, that is, the coordinates of the hands that do not fall on the piano are removed. For the hand information on the keyboard, the coordinate boundary of the hand falling on the piano keyboard is extended by a second preset pixel in four directions, so as to obtain the effective hand area including the complete hand of the hand falling on the piano keyboard and Corresponds to hand position coordinates. Preferably, the second preset pixel is 30 pixels. Through the filtering of the hand position coordinates, the data of the hands that do not fall on the piano keyboard can be effectively eliminated without the need to perform wrong playing identification. Reduce computational effort.
上述方法中,钢琴键盘检测网络、手部检测网络、手型错误检测网络、指尖特征点检测网络均通过神经网络训练得到,可以智能准确地进行目标检测是错误识别。在本发明的一些实施例中,通过如下方式对神经网络进行训练以获得所述钢琴键盘检测网络、手部检测网络、手型错误检测网络、指尖特征点检测网络:In the above method, the piano keyboard detection network, the hand detection network, the hand shape error detection network, and the fingertip feature point detection network are all obtained through neural network training, which can intelligently and accurately perform target detection and error recognition. In some embodiments of the present invention, the neural network is trained in the following manner to obtain the piano keyboard detection network, hand detection network, hand shape error detection network, and fingertip feature point detection network:
S1、采集多个人在多种场景下的弹奏不同类型钢琴的图像,形成原始数据集,使原始数据集中的图像覆盖现有技术下所有钢琴类型对应的场景和全错误类型;S1. Collect images of multiple people playing different types of pianos in various scenarios to form an original data set, so that the images in the original data set cover the scenes and all error types corresponding to all piano types under the prior art;
S2、对原始数据集进行标注,包括标注钢琴键盘位置坐标,标注手部位置坐标,标注手型错误类型和手型错误位置坐标,标注不同手指指尖特征点坐标,所有标注均在同一个二维坐标系中;S2. Label the original data set, including labeling the piano keyboard position coordinates, labeling the hand position coordinates, labeling the hand shape error type and hand shape error position coordinates, labeling the coordinates of the feature points of different fingers and fingertips, all labels are in the same two in the dimensional coordinate system;
S3、根据标注的钢琴键盘位置坐标对原始数据集中的图像进行处理,获得包含以标注的钢琴键盘位置坐标表示的钢琴键盘区域的图像以形成第一 数据集;进一步地,对钢琴键盘区域进行扩充得到的钢琴键盘有效区域,根据标注的键盘位置坐标和手部位置坐标以原始数据集为基础对原图进行裁剪,获得每张原图中的钢琴键盘有效区域形成第二数据集,其中,第二数据集中将在原图中标注的手部位置坐标转换为与钢琴键盘有效区域同一坐标系下的坐标;进一步地,对手部区域进行扩充得到的手部有效区域,根据标注的手部位置坐标和手型错误位置坐标以原始数据集为基础对原图进行裁剪获得每张原图中的手部有效区域形成第三数据集,其中,第三数据集中将在原图中标注的手型错误位置坐标转换为与手部有效区域同一坐标系下的坐标;根据标注的手部位置坐标和不同手指指尖特征点坐标以原始数据集为基础对对原图进行裁剪获得每张原图中的手部有效区域形成第四数据集,其中,第四数据集中将在原图中标注的不同手指指尖特征点坐标转换为与手部有效区域同一坐标系下的坐标;S3, process the images in the original data set according to the marked piano keyboard position coordinates, and obtain an image containing the piano keyboard area represented by the marked piano keyboard position coordinates to form a first data set; further, expand the piano keyboard area For the obtained effective area of the piano keyboard, the original image is cropped on the basis of the original data set according to the marked keyboard position coordinates and hand position coordinates, and the effective area of the piano keyboard in each original image is obtained to form a second data set. In the second dataset, the hand position coordinates marked in the original image are converted into coordinates in the same coordinate system as the effective area of the piano keyboard; further, the effective hand area obtained by expanding the hand area is based on the marked hand position coordinates and The hand shape error position coordinates are based on the original data set. The original image is cropped to obtain the effective area of the hand in each original image to form a third data set. The third data set will include the hand shape error position coordinates marked in the original image. Convert to the coordinates in the same coordinate system as the effective area of the hand; according to the marked hand position coordinates and the coordinates of the feature points of different fingers and fingertips, the original image is cropped based on the original data set to obtain the hand in each original image. The effective area forms a fourth data set, wherein, in the fourth data set, the coordinates of different fingertip feature points marked in the original image are converted into coordinates in the same coordinate system as the effective area of the hand;
S4、用第一数据集将预定神经网络进行训练至收敛获得钢琴键盘检测网络,用第二数据集将预定神经网络进行训练至收敛获得手部检测网络;用第三数据集将预定神经网络进行训练至收敛获得手型错误检测网络;用第四数据集将预定神经网络进行训练至收敛获得指尖特征点检测网络。S4, use the first data set to train the predetermined neural network to converge to obtain the piano keyboard detection network, use the second data set to train the predetermined neural network to converge to obtain the hand detection network; use the third data set to carry out the predetermined neural network Train to convergence to obtain a hand shape error detection network; use the fourth data set to train a predetermined neural network to converge to obtain a fingertip feature point detection network.
在本发明的一些实施例中,分别用第一数据集、第二数据集、第三数据集将yolov4网络训练至收敛以分别获得钢琴键盘检测网络、手部检测网络、手型错误检测网络。用第四数据集将ResNet18和级联金字塔网络组成的网络训练至收敛获得指尖特征点检测网络。In some embodiments of the present invention, the first data set, the second data set, and the third data set are used to train the yolov4 network to convergence to obtain a piano keyboard detection network, a hand detection network, and a hand shape error detection network, respectively. The fourth dataset is used to train the network composed of ResNet18 and cascaded pyramid network to convergence to obtain the fingertip feature point detection network.
通过神经网络训练得到的钢琴键盘检测网络可以智能识别钢琴键盘位置得到钢琴键盘位置坐标;通过神经网络训练得到的手部检测网络可以智能准确地识别以输入钢琴键盘区域的相对位置坐标表示的手部位置,将用于表示手部位置的钢琴键盘区域的相对位置坐标转换为原始图像坐标系下的坐标可直接获得原始图像中的手部位置坐标;通过神经网络训练得到的手型错误检测网络可以智能准确地识别出具体的手型错误类型以及手型错误位置坐标;通过神经网络训练得到的指尖特征点检测网络可以智能准确地识别出不同手指的以输入手部区域的相对位置坐标表示的指尖位置,将用于表示指尖位置的输入手部区域的相对位置坐标转换为原始图像坐标系下的坐标可以直接获得原始图像中的指尖坐标,方便后续指法识别。上述采用标注的数据集对神经网络进行训练获得的检测网络,无需建立标准的手型错误对比数 据库即可准确的识别出弹奏中的手型错误,鲁棒性好,准确度高。The piano keyboard detection network trained by the neural network can intelligently identify the position of the piano keyboard and obtain the position coordinates of the piano keyboard; the hand detection network obtained by the neural network training can intelligently and accurately identify the hand represented by the relative position coordinates of the input piano keyboard area. The hand position coordinates in the original image can be directly obtained by converting the relative position coordinates of the piano keyboard area used to represent the hand position into the coordinates in the original image coordinate system; the hand shape error detection network obtained by the neural network training can be Intelligently and accurately identify the specific hand type error type and hand type error position coordinates; the fingertip feature point detection network obtained by neural network training can intelligently and accurately identify the relative position coordinates of the input hand area for different fingers. Fingertip position, by converting the relative position coordinates of the input hand area used to represent the position of the fingertip to the coordinates in the original image coordinate system, the fingertip coordinates in the original image can be directly obtained, which is convenient for subsequent fingering recognition. The above detection network obtained by training the neural network with the labeled data set can accurately identify the hand shape errors in playing without establishing a standard hand shape error comparison database, with good robustness and high accuracy.
根据本发明的第二方面,提供一种用于实现本发明第一方面所述方法的系统,包括:图像采集模块,用于采集弹奏钢琴的包含完整钢琴键盘的2D图像;钢琴键盘检测模块,用于对所述2D图像进行目标检测以检测出以2D图像的相对位置坐标表示的钢琴键盘区域,并将所述用于表示钢琴键盘区域的2D图像的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的钢琴键盘位置坐标;手部检测模块,用于对所述以钢琴键盘位置坐标表示的钢琴键盘区域进行目标检测以检测出以钢琴键盘区域的相对位置坐标表示的手部区域,并将所述用于表示手部区域的钢琴键盘区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的手部位置坐标;手型错误检测模块,用于从以手部位置坐标表示的手部区域中识别是否存在手型错误,并输出手型错误类型以及以手部区域的相对位置坐标表示的手型错误位置,并将所述用于表示手型错误位置的手部区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的手型错误位置坐标。优选的,所述系统还包括:琴键划分模块,用于从所述以钢琴键盘位置坐标表示的钢琴键盘区域中将每个琴键划分出来得到以钢琴键盘区域的相对位置坐标表示的不同琴键,并将每个用于表示琴键的钢琴键盘区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的琴键坐标;指尖特征点检测网络,用于从以手部位置坐标表示的手部区域中检测出以手部区域的相对位置坐标表示的不同手指的指尖特征点,并将每个用于表示不同手指的指尖特征点的手部区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的指尖坐标;指法错误检测模块,用于基于指尖坐标和琴键坐标进行位置判断,将落在琴键上的指尖与该琴键进行绑定获得手指按键绑定关系,并将弹奏同一个音符的手指按键绑定关系和曲谱数据库中的标准绑定关系进行对比以检测是否存在指法错误。在本发明的一些实施例中,所述系统还包括:用户交互与显示模块,用于将弹奏过程中出现的弹奏错误与弹奏钢琴的图像进行合并并进行显示,以及提供模式选择和功能选择的交互界面。According to a second aspect of the present invention, there is provided a system for implementing the method described in the first aspect of the present invention, comprising: an image acquisition module for acquiring a 2D image of playing the piano including a complete piano keyboard; a piano keyboard detection module , for performing target detection on the 2D image to detect the piano keyboard area represented by the relative position coordinates of the 2D image, and converting the relative position coordinates of the 2D image representing the piano keyboard area into the original coordinates of the 2D image The coordinates under the system to obtain the position coordinates of the piano keyboard under the original coordinates of the 2D image; the hand detection module is used to perform target detection on the piano keyboard area represented by the piano keyboard position coordinates to detect the relative position of the piano keyboard area. The hand area represented by the coordinates, and the relative position coordinates of the piano keyboard area used to represent the hand area are converted into the coordinates under the original coordinate system of the 2D image to obtain the hand position coordinates under the original coordinates of the 2D image; hand shape The error detection module is used to identify whether there is a hand shape error from the hand region represented by the hand position coordinates, and output the hand shape error type and the hand shape error position represented by the relative position coordinates of the hand region. The relative position coordinates of the hand region used to represent the wrong hand position are converted into the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the wrong hand position under the original coordinates of the 2D image. Preferably, the system further includes: a key dividing module, configured to divide each key from the piano keyboard region represented by the piano keyboard position coordinates to obtain different piano keys represented by the relative position coordinates of the piano keyboard region, and The relative position coordinates of each piano keyboard area used to represent the keys are converted to the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the keys in the original coordinates of the 2D image; the fingertip feature point detection network is used to detect the hand position from the hand position. In the hand region represented by the coordinates, the fingertip feature points of different fingers represented by the relative position coordinates of the hand region are detected, and the relative position coordinates of each hand region used to represent the fingertip feature points of different fingers are converted. It is the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the fingertip in the original coordinate of the 2D image; the fingering error detection module is used to judge the position based on the coordinates of the fingertip and the coordinates of the key, and compare the fingertip falling on the key with the key Bind to obtain the finger key binding relationship, and compare the finger key binding relationship of playing the same note with the standard binding relationship in the score database to detect whether there is a fingering error. In some embodiments of the present invention, the system further includes: a user interaction and display module for merging and displaying the playing errors occurring during playing with the image of playing the piano, and providing mode selection and Interactive interface for feature selection.
在本发明的一些实施例中,所述图像采集模块采用任意可拍照电子设备,例如手机、照相机、摄像头等。In some embodiments of the present invention, the image acquisition module adopts any electronic device that can take pictures, such as a mobile phone, a camera, a camera, and the like.
根据本发明的第三方面,为了克服现有技术中钢琴训练的不足,本发明 还提供一种智能钢琴训练方法,所述方法包括:获取用户弹奏钢琴的音频信息和视频信息;从所述音频信息中提取用户音频数据,并与音频数据库中存储的对应的参照音频数据相比较,获得所述用户音频数据与所述对应的参照音频数据的匹配度;从所述视频信息中截取与所述用户音频数据相对应的用户手部图像,通过手部模型识别所述用户手部图像中的用户手部数据,并与手部数据库中存储的对应的正确的手部数据相比较,获得所述用户手部数据与所述对应的参照手部数据的匹配度,其中,所述手部模型以手部图像为输入数据,以所述手部图像中手部数据为输出数据,通过对神经网络进行训练获得;和/或,从所述视频信息中截取与所述用户音频数据相对应的包含完整钢琴键盘的2D图像,并采用本发明第一方面所述方法从该2D图像中识别是否存在弹奏错误;以及基于所述用户音频数据与所述对应的参照音频数据的匹配度以及所述用户手部数据与所述对应的参照手部数据的匹配度、和/或用户弹奏错误,向用户反馈弹奏结果。所述弹奏错误包括手型错误、和/或指法错误。According to the third aspect of the present invention, in order to overcome the deficiencies of piano training in the prior art, the present invention also provides a method for training an intelligent piano, the method comprising: acquiring audio information and video information of a user playing the piano; Extract the user audio data from the audio information, and compare it with the corresponding reference audio data stored in the audio database to obtain the degree of matching between the user audio data and the corresponding reference audio data; The user's hand image corresponding to the user's audio data, the user's hand data in the user's hand image is identified by the hand model, and the corresponding correct hand data stored in the hand database is compared to obtain the obtained data. The degree of matching between the user's hand data and the corresponding reference hand data, wherein the hand model takes the hand image as input data, and takes the hand data in the hand image as output data. The network is trained to obtain; and/or, intercepting a 2D image corresponding to the user audio data containing the complete piano keyboard from the video information, and using the method described in the first aspect of the present invention to identify from the 2D image whether There is a playing error; and based on the degree of matching of the user audio data with the corresponding reference audio data and the degree of matching of the user hand data with the corresponding reference hand data, and/or a user playing error , and feedback the playing result to the user. The playing errors include hand shape errors, and/or fingering errors.
可选的,上述方法还包括:基于所述用户弹奏钢琴中产生的全部所述用户音频数据与所述对应的参照音频数据的匹配度以及所述用户弹奏钢琴中产生的全部所述用户手部数据与所述对应的参照手部数据的匹配度和/或用户的全部弹奏错误,向用户反馈弹奏结果。Optionally, the above method further includes: based on the degree of matching of all the user audio data generated by the user playing the piano with the corresponding reference audio data and all the user audio data generated by the user playing the piano. The matching degree between the hand data and the corresponding reference hand data and/or all the playing errors of the user, the playing result is fed back to the user.
可选的,上述方法还包括:当所述用户音频数据与所述对应的参照音频数据的匹配度小于指定阈值时,向所述用户提示所述对应的参照音频数据所对应的琴键信息。Optionally, the above method further includes: when the degree of matching between the user audio data and the corresponding reference audio data is less than a specified threshold, prompting the user for key information corresponding to the corresponding reference audio data.
可选的,上述方法还包括:当所述用户手部数据与所述对应的参照手部数据的匹配度小于指定阈值时,向所述用户提示所述对应的参照手部数据所对应的手部动作,和/或向所述用户显示弹奏错误的类型和弹奏错误位置。Optionally, the above method further includes: when the degree of matching between the user's hand data and the corresponding reference hand data is less than a specified threshold, prompting the user of the hand corresponding to the corresponding reference hand data. part actions, and/or displaying to the user the type of playing error and the wrong playing position.
可选的,其中,所述用户音频数据包括提取时间、音符、基频和音强。Optionally, the user audio data includes extraction time, musical note, fundamental frequency and sound intensity.
可选的,其中,所述用户手部数据包括截取时间和左右手各21个关键关节点的相对位置。Optionally, the user hand data includes the interception time and the relative positions of 21 key joint points of each of the left and right hands.
可选的,其中,所述从所述音频信息中提取用户音频数据包括:按照第一时间间隔从所述音频信息中提取所述用户音频数据,以及其中,所述用户音频数据根据其提取时间与所述音频数据库中参照音频数据相对应。Optionally, wherein the extracting the user audio data from the audio information includes: extracting the user audio data from the audio information according to a first time interval, and wherein the user audio data is extracted according to its extraction time Corresponding to the reference audio data in the audio database.
可选的,其中,所述从所述视频信息中截取与所述用户音频数据相对应的用户手部图像包括:按照第二时间间隔从所述视频信息中截取所述用户手部图像,以及其中,所述用户手部图像通过其截取时间与所述用户音频数据相对应。Optionally, wherein the intercepting the user's hand image corresponding to the user audio data from the video information includes: intercepting the user's hand image from the video information according to a second time interval, and Wherein, the user's hand image corresponds to the user's audio data through its interception time.
可选的,其中,所述第二时间间隔与所述第一时间间隔相同,或者所述第二时间间隔是所述第一时间间隔的整数倍。Optionally, the second time interval is the same as the first time interval, or the second time interval is an integer multiple of the first time interval.
可选的,其中,用户手部数据的截取时间与所述用户手部图像的图像截取时间相同,所述用户手部数据根据其截取时间信息与所述数据库中参照手部数据相对应。Optionally, the interception time of the user hand data is the same as the interception time of the user hand image, and the user hand data corresponds to the reference hand data in the database according to the interception time information.
可选的,其中,所述手部模型采用循环神经网络或者长短时记忆神经网络训练获得。Optionally, the hand model is obtained by training a recurrent neural network or a long-term memory neural network.
可选的,上述方法还包括:从所述用户手部图像中选取包含钢琴琴键的用户手部图像用于识别所述用户手部数据。Optionally, the above method further includes: selecting a user's hand image including piano keys from the user's hand image to identify the user's hand data.
可选的,上述方法还包括:获取所述用户的触键力度数据;将所述用户的触键力度数据与数据库中对应的参照触键力度数据相比较,获得所述用户触键力度数据与所述对应的参照触键力度数据的匹配度;基于所述用户音频数据与所述对应的参照音频数据的匹配度、所述用户手部数据与所述对应的参照手部数据的匹配度以及所述用户触键力度数据与所述对应的参照触键力度数据的匹配度和/或用户的全部弹奏错误确定所述用户弹奏钢琴的评分。Optionally, the above method further includes: acquiring the user's key-touching force data; The matching degree of the corresponding reference touch key force data; based on the matching degree between the user audio data and the corresponding reference audio data, the matching degree between the user hand data and the corresponding reference hand data, and The degree of matching between the user's key-touching force data and the corresponding reference key-touching force data and/or all playing errors of the user determine the score of the user for playing the piano.
本发明第四方面提供了一种智能钢琴训练系统,包括:音频和视频采集单元,用于获取用户钢琴弹奏的音频信息和视频信息;数据提取单元,用于从所述音频信息中提取用户音频数据,以及从所述视频信息中截取与所述用户音频数据相对应的用户手部图像和/或包含完整钢琴键盘的2D图像;数据识别单元,用于通过手部模型识别所述用户手部图像中的用户手部数据,和/或通过如本发明第二方面所述的用于辅助钢琴教学的智能识别弹奏错误的系统从所述2D图像中识别是否存在弹奏错误,其中,所述手部模型以手部图像为输入数据,以所述手部图像中手部数据为输出数据,通过对神经网络进行训练获得;数据匹配单元,用于将所述用户音频数据与音频数据库中对应的参照音频数据相比较,获得所述用户音频数据与所述对应的参照音频数据的匹配度,以及将所述用户手部数据与手部数据库中 对应的参照手部数据相比较,获得所述用户手部数据与所述对应的参照手部数据的匹配度;以及用户交互单元,用于基于所述用户音频数据与所述对应的参照音频数据的匹配度以及所述用户手部数据与所述对应的参照手部数据的匹配度和/或用户的全部弹奏错误,向用户反馈弹奏结果。所述弹奏错误包括手型错误、和/或指法错误。可选的,其中,所述用户交互单元还用于:向所述用户提示所述对应的参照音频数据所对应的琴键信息;和/或向所述用户提示所述对应的参照手部数据所对应的手部动作。A fourth aspect of the present invention provides an intelligent piano training system, comprising: an audio and video acquisition unit for acquiring audio information and video information of a user's piano playing; a data extraction unit for extracting the user from the audio information Audio data, and intercepting a user's hand image corresponding to the user's audio data and/or a 2D image including a complete piano keyboard from the video information; a data recognition unit for recognizing the user's hand through a hand model user hand data in the 2D image, and/or whether there is a playing error is identified from the 2D image by the system for intelligently recognizing playing errors for assisting piano teaching according to the second aspect of the present invention, wherein, The hand model takes the hand image as input data and the hand data in the hand image as output data, and is obtained by training a neural network; a data matching unit is used to match the user audio data with the audio database. Compare the corresponding reference audio data in the user audio data to obtain the matching degree between the user audio data and the corresponding reference audio data, and compare the user hand data with the corresponding reference hand data in the hand database to obtain a degree of matching between the user's hand data and the corresponding reference hand data; and a user interaction unit configured to match the user's hand data with the corresponding reference audio data and the user's hand data based on the degree of matching The matching degree with the corresponding reference hand data and/or all the playing errors of the user, the playing result is fed back to the user. The playing errors include hand shape errors, and/or fingering errors. Optionally, the user interaction unit is further configured to: prompt the user with the key information corresponding to the corresponding reference audio data; and/or prompt the user where the corresponding reference hand data is located. corresponding hand movements.
可选的,其中,所述视频和音频采集单元包括音频采集装置和视频采集装置,以及其中,所述视频采集装置包括一个或多个单目摄像头、双目摄像头或深度摄像头,所述视频采集装置被固定在钢琴周围定点采集手部视频信息,或者被安装在滑轨上自动追踪采集手部视频信息。Optionally, the video and audio capture unit includes an audio capture device and a video capture device, and wherein the video capture device includes one or more monocular cameras, binocular cameras, or depth cameras, and the video capture device The device is fixed around the piano to collect hand video information at a fixed point, or is installed on the slide rail to automatically track and collect hand video information.
可选的,上述系统还包括:传感器,所述传感器安装于琴键下方,用于采集用户弹奏钢琴时的触键力度数据。Optionally, the above-mentioned system further includes: a sensor, which is installed under the piano keys and is used for collecting key-touching force data when the user plays the piano.
与现有技术相比,本发明的优点在于:本发明通过手部模型精确识别用户手部图像中的手部数据,并在综合考量用户在弹奏钢琴时产生的音频数据和手部数据的基础上,对用户钢琴弹奏的结果作出整体判断,使得用户在缺乏专业老师指导的情况下,也能够迅速获得练习中有关音符及指法方面的有效反馈,有利于用户及时发现并纠正错误,提高练习效率。此外,在本发明的一些实施例中,通过实时向用户提示正确的琴键信息和/或手部动作,能够帮助用户及时获得正确的示范和指导,有助于用户自学弹奏钢琴。在本发明中,采用2D图像对弹奏错误进行识别,计算量小,硬件成本低。因此,2D视觉模型具有不可替代的优势。Compared with the prior art, the advantages of the present invention are: the present invention accurately identifies the hand data in the user's hand image through the hand model, and comprehensively considers the audio data and the hand data generated when the user is playing the piano. On this basis, make an overall judgment on the results of the user's piano playing, so that the user can quickly obtain effective feedback on the notes and fingering in the practice without the guidance of a professional teacher, which is conducive to the user to discover and correct mistakes in time, improve Practice efficiency. In addition, in some embodiments of the present invention, by prompting the user with correct key information and/or hand movements in real time, it can help the user to obtain correct demonstration and guidance in time, and help the user to learn to play the piano by himself. In the present invention, the 2D image is used to identify the playing errors, the calculation amount is small, and the hardware cost is low. Therefore, 2D visual models have irreplaceable advantages.
附图说明Description of drawings
以下参照附图对本发明实施例作进一步说明,其中:The embodiments of the present invention will be further described below with reference to the accompanying drawings, wherein:
图1为根据本发明实施例的一种用于辅助钢琴教学的智能识别方法中主要工作内容示意图;1 is a schematic diagram of main work contents in an intelligent identification method for assisting piano teaching according to an embodiment of the present invention;
图2为根据本发明实施例的一种用于辅助钢琴教学的智能识别系统框架示意图;2 is a schematic diagram of a framework of an intelligent recognition system for assisting piano teaching according to an embodiment of the present invention;
图3为根据本发明实施例的采集到的2D图像示例示意图;3 is a schematic diagram illustrating an example of a 2D image collected according to an embodiment of the present invention;
图4为根据本发明实施例的从图3所示的2D图像中检测到的钢琴键盘 区域示意图;4 is a schematic diagram of a piano keyboard area detected from the 2D image shown in FIG. 3 according to an embodiment of the present invention;
图5为根据本发明实施例的从图4所示的钢琴键盘区域中检测到的手部区域示意图;5 is a schematic diagram of a hand area detected from the piano keyboard area shown in FIG. 4 according to an embodiment of the present invention;
图6为根据本发明实施例的从图5所示的手部区域中检测指尖特征点示意图;6 is a schematic diagram of detecting fingertip feature points from the hand region shown in FIG. 5 according to an embodiment of the present invention;
图7是钢琴练习中常见的6种错误手部动作及对应的正确手部动作示意图;Fig. 7 is a schematic diagram of 6 common wrong hand movements and corresponding correct hand movements in piano practice;
图8是根据本发明一个实施例的单个手掌中的21个关键关节点的示意图;8 is a schematic diagram of 21 key joint points in a single palm according to an embodiment of the present invention;
图9是根据本发明一个实施例的智能钢琴练习方法;Fig. 9 is a smart piano practice method according to an embodiment of the present invention;
图10是根据本发明一个实施例的音频数据库中标准音频数据存储示意图;10 is a schematic diagram of standard audio data storage in an audio database according to an embodiment of the present invention;
图11是根据本发明一个实施例的手部数据库中标准手部数据存储示意图;11 is a schematic diagram of standard hand data storage in a hand database according to an embodiment of the present invention;
图12是根据本发明一个实施例的综合数据库的存储示意图;12 is a schematic diagram of storage of a comprehensive database according to an embodiment of the present invention;
图13是根据本发明一个实施例的智能钢琴训练方法;Fig. 13 is a smart piano training method according to an embodiment of the present invention;
图14是根据本发明一个实施例的智能钢琴训练方法;Fig. 14 is a smart piano training method according to an embodiment of the present invention;
图15是根据本发明一个实施例的智能钢琴训练系统。FIG. 15 is a smart piano training system according to one embodiment of the present invention.
具体实施方式detailed description
为了使本发明的目的,技术方案及优点更加清楚明白,以下通过具体实施例对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below through specific embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
本发明建立在2D图像的基础上,将手型错误看作一个检测任务,将指法错误看成特征点检测任务和图像分割任务,构建2D视觉模型来解决识别弹奏错误的问题,由此不需要进行3D建模,也不需要建立标准数据库进行比对,极大的降低了计算任务量和硬件开销。Based on 2D images, the present invention regards hand shape errors as a detection task, and fingering errors as feature point detection tasks and image segmentation tasks, and constructs a 2D visual model to solve the problem of recognizing playing errors. 3D modeling is required, and there is no need to establish a standard database for comparison, which greatly reduces the amount of computing tasks and hardware overhead.
根据本发明的一个实施例,如图1所示,一种用于辅助钢琴教学的智能识别方法,用于通过对采集的弹奏钢琴的2D图像进行检测以识别是否有弹奏错误,主要包括以下几个部分:首先对采集到的弹奏钢琴的2D图像进行钢琴键盘检测,从中截取钢琴键盘区域;然后针对截取到的钢琴键盘区域分 别进行手部检测和琴键分割,手部检测是从钢琴键盘区域中检测出手部区域,琴键分割是基于钢琴键盘区域将琴键进行分割获得每个琴键;其次,针对手部区域,分别进行手型错误检测和指尖检测,手型错误检测检测手型错误并输出存在的手型错误类型和手型错误位置坐标,指尖检测是对手部区域进行指尖特征点检测以获得不同手指的指尖位置坐标;最后将指尖位置坐标与琴键位置坐标进行位置判断,将落在琴键上的指尖与琴键进行绑定获得手指按键绑定关系,并将手指按键绑定关系与曲谱数据库中弹奏同一个音符的绑定关系进行对比以判断是否有指法错误,并输出存在的指法错误。图2示出了采用本发明方法识别弹奏错误过程中的主要功能模块,其中,通过图像采集模块从钢琴键盘上方采集弹奏钢琴的包含完整键盘的2D图像,通过钢琴键盘检测模块对2D图像进行钢琴键盘检测得到钢琴键盘区域,通过手部检测模块对钢琴键盘区域进行手部检测得到手部区域,通过手型错误检测模块对手部区域进行手型错误检测获得手型错误类型和手型错误位置坐标,通过琴键划分模块基于钢琴键盘区域对琴键进行分割获得每个琴键的坐标,通过指尖特征点检测网络对手部区域进行指尖特征点检测获得不同手指的指尖坐标,通过指法错误检测模块将指尖坐标和琴键坐标进行位置对比判断以将落在琴键上的指尖与该琴键进行绑定获得手指按键绑定关系,如果手指的指尖坐标表示的区域和琴键坐标表示的区域有重叠,那么这个手指落在与其指尖坐标有区域重叠的琴键上,将该手指与琴键进行绑定获得手指按键绑定关系,将弹奏同一个音符的手指按键绑定关系和曲谱数据库中的标准绑定关系进行对比以检测指法错误。According to an embodiment of the present invention, as shown in FIG. 1, an intelligent identification method for assisting piano teaching is used to identify whether there is a playing error by detecting the collected 2D images of playing the piano, which mainly includes: The following parts are as follows: First, the piano keyboard is detected on the collected 2D image of playing the piano, and the piano keyboard area is intercepted from it; then hand detection and key segmentation are performed for the intercepted piano keyboard area. The hand detection is from the piano keyboard. The hand area is detected in the keyboard area, and the key segmentation is to divide the keys based on the piano keyboard area to obtain each key; secondly, for the hand area, hand type error detection and fingertip detection are performed respectively, and hand type error detection detects hand type errors. And output the existing hand type error type and hand type error position coordinates. Fingertip detection is to detect the fingertip feature points of the hand area to obtain the fingertip position coordinates of different fingers; Judging, bind the fingertips on the keys to the keys to obtain the finger key binding relationship, and compare the finger key binding relationship with the binding relationship of playing the same note in the score database to determine whether there is a fingering error , and output the fingering errors that exist. Fig. 2 shows the main functional modules in the process of identifying playing errors using the method of the present invention, wherein the 2D image including the complete keyboard of playing the piano is collected from above the piano keyboard by the image acquisition module, and the 2D image is collected by the piano keyboard detection module. Perform piano keyboard detection to obtain the piano keyboard area, perform hand detection on the piano keyboard area through the hand detection module to obtain the hand area, and use the hand type error detection module to perform hand type error detection on the hand area to obtain the hand type error type and hand type error The position coordinates are obtained by dividing the keys based on the piano keyboard area by the key division module to obtain the coordinates of each key, and the fingertip feature points of the hand area are detected by the fingertip feature point detection network to obtain the fingertip coordinates of different fingers. Fingering error detection The module compares and judges the position of the fingertip coordinates and the key coordinates to bind the fingertip falling on the key with the key to obtain the finger key binding relationship. If the area represented by the fingertip coordinates of the finger and the area represented by the key coordinates have Overlapping, then this finger falls on the key that overlaps with its fingertip coordinates, bind the finger to the key to obtain the finger key binding relationship, and the finger key binding relationship that plays the same note will be the same as the one in the score database. Standard bindings are compared to detect fingering errors.
为了更好的理解本发明,下面结合附图对本发明进行详细说明。For better understanding of the present invention, the present invention will be described in detail below with reference to the accompanying drawings.
一、图像采集1. Image collection
通过图像采集模块从钢琴键盘上方采集弹奏者弹奏钢琴时的包含完整钢琴键盘的2D图像(采集到的2D图像中,人位于图像上方,图像中的钢琴键盘区域在图像上呈矩形或近似矩形),并对采集到的2D图像运用图像处理算法进行处理,得到处理后的2D图像。这里的图像采集模块采用任意可拍照电子设备来采集2D图像,例如手机、照相机、摄像头等,拍摄前或拍摄过程中调整设备的角度,使拍摄到的2D图像包含完整的钢琴键盘以及部分或者全部手部。这里的图像处理算法包括但不限于黑电平补偿、镜头校正、坏像素校正、颜色插值、去噪声、gamma校正、色彩空间转换、白平衡校正、 色彩与对比度增强、格式转换等算法,用于对拍摄到的2D图像进行处理,以使处理后的2D图像适用于后续的操作。举例而言,可以将拍摄到的原始2D图像转换为格式上与钢琴键盘检测兼容的2D图像,如bmp、jpg、png、tif、gif、pcx、tga、exif、fpx、svg、psd、cdr、pcd、dxf、ufo、eps、ai、raw、WMF、webp、avif、apng等格式的图像,具体格式可以根据实际应用需求进行设定。如图3所示的一个示例,是采集到的包括完整的钢琴键盘以及弹奏者弹奏钢琴时的全部手部的2D图像。The 2D image including the complete piano keyboard when the player plays the piano is collected from above the piano keyboard by the image acquisition module (in the collected 2D image, the person is located above the image, and the piano keyboard area in the image is rectangular or approximate on the image. Rectangle), and use the image processing algorithm to process the collected 2D image to obtain the processed 2D image. The image acquisition module here uses any electronic device that can take pictures to collect 2D images, such as mobile phones, cameras, cameras, etc., adjust the angle of the device before or during the shooting, so that the captured 2D image contains the complete piano keyboard and some or all of the hand. The image processing algorithms here include, but are not limited to, black level compensation, lens correction, bad pixel correction, color interpolation, noise removal, gamma correction, color space conversion, white balance correction, color and contrast enhancement, format conversion, etc. The captured 2D image is processed so that the processed 2D image is suitable for subsequent operations. For example, the captured raw 2D image can be converted into a 2D image in a format compatible with piano keyboard detection, such as bmp, jpg, png, tif, gif, pcx, tga, exif, fpx, svg, psd, cdr, Images in formats such as pcd, dxf, ufo, eps, ai, raw, WMF, webp, avif, apng, etc. The specific format can be set according to the actual application requirements. An example shown in FIG. 3 is a collected 2D image including a complete piano keyboard and all hands of the player when playing the piano.
二、钢琴键盘检测Second, the piano keyboard detection
通过钢琴键盘检测模块从2D图像中检测出钢琴键盘位置,从而得到钢琴键盘位置坐标表示的钢琴键盘区域。钢琴键盘检测模块采用钢琴键盘检测网络对2D图像做目标检测,获得钢琴键盘位置坐标,其中,钢琴键盘检测网络输出的是输入2D图像的相对位置坐标,将其转换为2D图像原始坐标系下的坐标以得到2D图像原始坐标系下的钢琴键盘位置坐标,该钢琴键盘位置坐标用于指示2D图像中的钢琴键盘区域。其中,钢琴键盘检测网络是以弹奏钢琴的2D图像为输入,钢琴键盘在2D图像中的相对位置坐标为输出,对神经网络进行训练获得。对2D图像进行钢琴键盘的识别时,将2D图像中的钢琴键盘完全落入钢琴键盘检测网络的矩形目标检测框,使得矩形目标检测框中囊括整个钢琴键盘,并以矩形目标检测框的对角或者中心点二维坐标((x、y)的形式)来表示钢琴键盘在输入2D图像中的相对位置坐标,将钢琴键盘检测网络输出的相对于输入2D图像中的相对位置坐标转换为2D图像原始坐标系下的坐标即可获得钢琴键盘在原始2D图像中的钢琴键盘位置坐标。例如,可以表示为矩形目标检测框左上角坐标(x1、y1)、右下角坐标(x2、y2),左上角坐标(x1、y1)、右下角坐标(x2、y2)所表示的2D图像中的长方形区域即为钢琴键盘区域;或者,钢琴键盘区域也可以表示为矩形目标检测框中心点坐标(x0,y0)和矩形目标检测框的宽度w及高度h,两种方式均可,本发明实施例中采用第一种对角方式{(x1、y1),(x2、y2)}进行描述,且未来方便说明,本发明中所有的坐标均是指已经转换为2D图像原始坐标系下的坐标。The position of the piano keyboard is detected from the 2D image by the piano keyboard detection module, so as to obtain the piano keyboard area represented by the coordinates of the piano keyboard position. The piano keyboard detection module uses the piano keyboard detection network to perform target detection on the 2D image, and obtains the position coordinates of the piano keyboard. The output of the piano keyboard detection network is the relative position coordinates of the input 2D image, which is converted into the original coordinate system of the 2D image. coordinates to obtain the piano keyboard position coordinates in the original coordinate system of the 2D image, and the piano keyboard position coordinates are used to indicate the piano keyboard area in the 2D image. Among them, the piano keyboard detection network takes the 2D image of playing the piano as the input, and the relative position coordinates of the piano keyboard in the 2D image as the output, which is obtained by training the neural network. When recognizing the piano keyboard in the 2D image, the piano keyboard in the 2D image completely falls into the rectangular target detection frame of the piano keyboard detection network, so that the rectangular target detection frame includes the entire piano keyboard, and the opposite corners of the rectangular target detection frame are used. Or the two-dimensional coordinates of the center point (in the form of (x, y)) to represent the relative position coordinates of the piano keyboard in the input 2D image, and convert the relative position coordinates output by the piano keyboard detection network relative to the input 2D image into a 2D image. The coordinates in the original coordinate system can obtain the position coordinates of the piano keyboard in the original 2D image. For example, it can be expressed as the coordinates of the upper left corner (x1, y1), the coordinates of the lower right corner (x2, y2), the coordinates of the upper left corner (x1, y1), and the coordinates of the lower right corner (x2, y2) of the rectangular target detection frame. The rectangular area is the piano keyboard area; alternatively, the piano keyboard area can also be expressed as the coordinates of the center point of the rectangular target detection frame (x0, y0) and the width w and height h of the rectangular target detection frame, both ways are acceptable, the present invention In the embodiment, the first diagonal method {(x1, y1), (x2, y2)} is used for description, and for the convenience of description in the future, all the coordinates in the present invention refer to the coordinates that have been converted into the original coordinate system of the 2D image. coordinate.
钢琴键盘区域的作用有两个,一是作为手部检测模块的输入,检测出手的位置;二是作为琴键划分模块的输入,将每个琴键个体分割出来,得出每个琴键的位置。The piano keyboard area has two functions, one is as the input of the hand detection module to detect the position of the hand; the other is as the input of the key division module, which divides each individual key and obtains the position of each key.
三、手部检测3. Hand detection
通过手部检测模块从钢琴键盘区域中检测出以钢琴键盘区域的相对位置坐标表示的手部区域,并将用于表示手部区域的钢琴键盘区域的相对位置坐标转换为2D图像原始坐标系下的坐标从而得到2D图像原始坐标系下的手部位置坐标,该手部位置坐标用于指示2D图像中的手部区域。其中,所述手部检测网络是以钢琴键盘区域为输入,钢琴键盘区域中手部位置的相对位置坐标为输出,对神经网络进行训练获得。在图像中,通过对比手部位置坐标和钢琴键盘位置坐标来判断二者是否有重叠,如果手部位置坐标表示的矩形区域和钢琴键盘坐标表示的矩形区域有重叠,则手部区域与钢琴键盘区域有重叠,那么就认为该手落在键盘上,对这些手进行接下来的手型错误检测、指法错误检测等;相反,如果没有重叠,那么就认为该手没有落在键盘上,也就无需再进行接下来的手型错误检测、指法错误检测等检测流程。The hand area represented by the relative position coordinates of the piano keyboard area is detected from the piano keyboard area by the hand detection module, and the relative position coordinates of the piano keyboard area used to represent the hand area are converted into the original coordinate system of the 2D image. The coordinates of the hand position in the original coordinate system of the 2D image are obtained, and the hand position coordinates are used to indicate the hand region in the 2D image. Wherein, the hand detection network takes the piano keyboard area as the input, and the relative position coordinates of the hand position in the piano keyboard area as the output, and is obtained by training the neural network. In the image, the hand position coordinates and the piano keyboard position coordinates are compared to determine whether the two overlap. If the rectangular area represented by the hand position coordinates and the rectangular area represented by the piano keyboard coordinates overlap, the hand area and the piano keyboard If the area overlaps, then the hand is considered to be on the keyboard, and the following hand type error detection, fingering error detection, etc. are performed on these hands; on the contrary, if there is no overlap, then the hand is considered not to be on the keyboard. There is no need to perform the following detection processes such as hand type error detection and fingering error detection.
需要考虑的是,钢琴键盘区域中的手有可能是不完整的,如手指放在键盘上,而手掌在键盘外部,为了保证放在键盘上的手的检测完整性,本发明需要将钢琴键盘区域的上边缘向图像上方扩充一定像素(例如,200像素),形成扩充后的钢琴键盘区域,又称钢琴键盘有效区域,这样可以保证在该区域内,所有放在键盘上的手都是完整的,根据本发明的一个实施例,钢琴键盘区域向上扩充200个像素,扩充后的钢琴键盘有效区域坐标表示为{(x1、y1-200),(x2、y2)}。It should be considered that the hand in the piano keyboard area may be incomplete, for example, the fingers are placed on the keyboard, and the palm is outside the keyboard. In order to ensure the detection integrity of the hand placed on the keyboard, the present invention needs to place the piano keyboard The upper edge of the area is expanded by a certain pixel (for example, 200 pixels) above the image to form the expanded piano keyboard area, also known as the effective area of the piano keyboard, so as to ensure that all hands on the keyboard are intact in this area. According to an embodiment of the present invention, the piano keyboard area is expanded upward by 200 pixels, and the coordinates of the expanded piano keyboard effective area are expressed as {(x1, y1-200), (x2, y2)}.
手部检测模块通过手部检测网络对钢琴键盘位置坐标表示的钢琴键盘有效区域进行目标检测得到以钢琴键盘区域的相对位置坐标表示的手部区域,并将用于表示手部区域的钢琴键盘区域的相对位置坐标转换为2D图像原始坐标系下的坐标从而得到2D图像原始坐标系下的手部位置坐标。可以看出,手部位置坐标是通过手部检测网络检测并转换得到的,虽然最简单的方式是将整张图片输入到手部检测网络,但这样会增大计算量,因此,本发明以钢琴键盘有效区域来做手部检测,对钢琴键盘有效区域进行手部的识别时,将{(x1、y1-200),(x2、y2)}所表示的钢琴键盘有效区域中的手部完全落入手部检测网络的矩形目标检测框,并以手部检测网络的矩形目标检测框的左上角和右下角坐标来表示手部区域,手部区域的位置坐标(即手部位置坐标)与钢琴键盘有效区域的位置坐标处于同一坐标系,分别用于指示2D图像中的手部区域和钢琴键盘有效区域,方便进行位置判断。假设手部检测 网络的矩形目标检测框检测到手部时的左上角坐标为(x1‘、y1‘),右下角坐标为(x2‘、y2‘)},那么手部区域的位置坐标可表示为{(x1‘、y1‘),(x2‘、y2‘)}。由于从钢琴键盘有效区域中检测出手部区域时,手部区域有可能是不完整的或者有些手并未落在钢琴键盘上,对比手部位置坐标{(x1‘、y1‘),(x2‘、y2‘)}和钢琴键盘位置坐标{(x1、y1),(x2、y2)},如果手部区域与键盘区域有重叠,那么就认为该手在键盘上,本发明对这些手进行接下来的手型错误检测、指法错误检测等;相反,如果没有重叠,那么就认为该手没有放在键盘上,也就无需再进行接下来的手型错误检测、指法错误检测等检测流程。由于在钢琴键盘有效区域进行手部检测获得的手部区域是手部检测网络的矩形目标检测框所限定的范围,有可能没有包含完整的手部,例如同一只手的部分手指的指尖在钢琴键盘上,另一部分的手指的指尖在钢琴键盘外,检测的时候钢琴键盘外的指尖未被包含在目标检测框中,对此,本发明首先去掉未落在钢琴键盘上的手的坐标,然后对落在钢琴键盘上的手的区域边界(手部区域)向四方进行一定像素的边界扩展以获得包含落在钢琴键盘上的手的完整手部的扩充后的手部区域,又称手部有效区域,根据本发明的一个实施例,手部区域边界扩展30个像素,扩充后的手部有效区域的坐标可表示为{(x1‘-30、y1‘-30),(x2‘+30、y2‘+30)},这样可以保证检测到的落在钢琴键盘上的手是完整的。其中,需要注意的是,在扩充手部区域时,需要对手部区域的四条边界进行越界检查,如果扩充后的某条边界超出原始2D图像边界范围,就需要用2D图像的边界代替手部区域该越界边界即越过边界就不再扩充该条边界。如图5所示的图像为从图4所示的钢琴键盘有效区域中识别出手部有效区域的示意图。The hand detection module performs target detection on the effective area of the piano keyboard represented by the position coordinates of the piano keyboard through the hand detection network to obtain the hand area represented by the relative position coordinates of the piano keyboard area, and will be used to represent the piano keyboard area of the hand area. The relative position coordinates of are converted into the coordinates in the original coordinate system of the 2D image to obtain the position coordinates of the hand in the original coordinate system of the 2D image. It can be seen that the hand position coordinates are detected and converted by the hand detection network. Although the simplest way is to input the entire picture into the hand detection network, this will increase the amount of calculation. Therefore, the present invention uses a piano The effective area of the keyboard is used for hand detection. When the effective area of the piano keyboard is identified, the hand in the effective area of the piano keyboard represented by {(x1, y1-200), (x2, y2)} will be completely touched. Start with the rectangular target detection frame of the hand detection network, and use the upper left corner and lower right corner coordinates of the rectangular target detection frame of the hand detection network to represent the hand area, and the position coordinates of the hand area (ie the hand position coordinates) and the piano keyboard The position coordinates of the effective area are in the same coordinate system, and are respectively used to indicate the hand area and the effective area of the piano keyboard in the 2D image, which is convenient for location judgment. Assuming that when the rectangular target detection frame of the hand detection network detects the hand, the coordinates of the upper left corner are (x1', y1'), and the coordinates of the lower right corner are (x2', y2')}, then the position coordinates of the hand area can be expressed as {(x1', y1'), (x2', y2')}. Since the hand area is detected from the effective area of the piano keyboard, the hand area may be incomplete or some hands do not fall on the piano keyboard, compare the hand position coordinates {(x1', y1'), (x2' , y2')} and the piano keyboard position coordinates {(x1, y1), (x2, y2)}, if the hand area overlaps with the keyboard area, then it is considered that the hand is on the keyboard, and the present invention connects these hands. On the contrary, if there is no overlap, then it is considered that the hand is not placed on the keyboard, and there is no need to perform the following detection processes such as hand type error detection and fingering error detection. Since the hand area obtained by hand detection in the effective area of the piano keyboard is limited by the rectangular target detection frame of the hand detection network, it may not contain the complete hand. For example, the fingertips of some fingers of the same hand are in On the piano keyboard, the fingertips of another part of the fingers are outside the piano keyboard, and the fingertips outside the piano keyboard are not included in the target detection frame during the detection. Coordinates, and then extend the boundary of the hand area (hand area) that falls on the piano keyboard to the four directions by a certain pixel boundary to obtain the expanded hand area that contains the complete hand of the hand that falls on the piano keyboard, and then It is called the effective hand area. According to an embodiment of the present invention, the boundary of the hand area is expanded by 30 pixels, and the coordinates of the expanded effective area of the hand can be expressed as {(x1'-30, y1'-30), (x2 '+30, y2'+30)}, which ensures that the detected hand on the piano keyboard is intact. Among them, it should be noted that when expanding the hand area, it is necessary to perform out-of-bounds checking on the four boundaries of the hand area. If an expanded boundary exceeds the boundary range of the original 2D image, the boundary of the 2D image needs to be used instead of the hand area. The out-of-bounds boundary is crossed and the boundary is no longer expanded. The image shown in FIG. 5 is a schematic diagram of identifying the effective area of the hand from the effective area of the piano keyboard shown in FIG. 4 .
通过截取手部有效区域,去掉不在钢琴键盘上的手,可以减少后续处理时间,提高弹奏错误识别精度。此处的手部有效区域的作用有两个,一是作为手型错误检测模块的输入,检测出手型错误;二是作为指尖特征点检测网络的输入,检测出指尖特征点。By intercepting the effective area of the hand and removing the hands that are not on the piano keyboard, the subsequent processing time can be reduced and the accuracy of playing error recognition can be improved. There are two functions of the hand effective area here, one is as the input of the hand shape error detection module to detect hand shape errors; the other is as the input of the fingertip feature point detection network to detect the fingertip feature points.
四、手型错误检测Fourth, hand type error detection
通过手型错误检测模块从手部位置坐标表示的手部有效区域中检测手型错误获得手型错误类型和手型错误位置坐标,所述手型错误检测模块采用手型错误检测网络对手部有效区域进行检测并获得以输入手部有效区域的相对位置坐标表示的手型错误位置,将用于表示手型错误位置的手部有效区 域的相对位置坐标转换为2D图像原始坐标系下的坐标获得手型错误位置坐标。其中,所述手型错误检测网络是以手部有效区域为输入,手型错误类型和手型错误在手部有效区域中的相对位置坐标为输出,对神经网络进行训练获得。也就是说,手型错误检测模块中采用基于深度学习的方法对输入的手部有效区域进行检测,其输出结果为手型错误的类别和手型错误位置坐标,指导用户矫正错误手型。The hand type error detection module detects the hand type error in the effective hand area represented by the hand position coordinates to obtain the hand type error type and the hand type error position coordinates. The hand type error detection module adopts the hand type error detection network to be effective for the hand. Area is detected and the hand shape error position represented by the relative position coordinates of the input hand effective area is obtained, and the relative position coordinates of the hand effective area used to represent the hand shape error position are converted into the coordinates in the original coordinate system of the 2D image to obtain Incorrect hand position coordinates. Wherein, the hand shape error detection network takes the effective hand area as the input, and the hand shape error type and the relative position coordinates of the hand shape error in the effective hand area as the output, and is obtained by training the neural network. That is to say, in the hand shape error detection module, a method based on deep learning is used to detect the input effective area of the hand, and the output result is the type of hand shape error and the coordinates of the wrong hand shape, so as to guide the user to correct the wrong hand shape.
其中,手型错误分为折指、指尖未立住、指尖朝上、压腕、掌关节塌陷,每种错误就是一个类别,本发明将手型错误看作一个检测任务,手型错误检测网络输出的是2D图像中的手部区域出现错误的手型错误类别和该错误在输入手部有效区域中的相对位置坐标,有多少个错误就输出多少个检测结果。Among them, hand shape errors are divided into folded fingers, fingertips not standing, fingertips pointing upwards, wrist pressing, and palmar joint collapse. Each type of error is a category. The present invention regards hand shape errors as a detection task. The output of the detection network is the wrong hand type error category in the hand area in the 2D image and the relative position coordinates of the error in the input hand effective area, and output as many detection results as there are errors.
五、指法错误检测5. Fingering Error Detection
1、指尖特征点检测1. Fingertip feature point detection
通过指尖特征点检测网络从以手部位置坐标表示的手部区域中检测出以手部区域的相对位置坐标表示的不同手指的指尖特征点,并将每个用于表示不同手指的指尖特征点的手部区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的指尖坐标。其中,指尖特征点检测网络是手部有效区域为输入、不同手指的指尖在手部有效区域中的相对位置坐标为输出,通过对神经网络进行训练获得的网络。指尖特征点检测网络对手部有效区域进行图像分割,将每一个手指的指尖分割,获得每个手指指尖的特征点,例如,如图6所示的图像为从图5所示的手部有效区域中识别出每个手指的指尖特征点的示意图,最终获得每个手指在手部有效区域中的相对位置坐标,例如拇指指尖坐标、食指指尖坐标、中指指尖坐标等,每个手指对应的指尖坐标以指尖对应的矩形检测框对角表示,将每个指尖的相对位置坐标转换为2D图像原始坐标系下的坐标以获得每个指尖在2D图像原始坐标系的指尖坐标。The fingertip feature points of different fingers represented by the relative position coordinates of the hand region are detected from the hand region represented by the hand position coordinates through the fingertip feature point detection network, and each fingertip feature point representing a different finger is used to The relative position coordinates of the hand region of the sharp feature points are converted to coordinates in the original coordinate system of the 2D image to obtain the coordinates of the fingertip in the original coordinates of the 2D image. Among them, the fingertip feature point detection network is a network obtained by training a neural network as the input and the relative position coordinates of the fingertips of different fingers in the effective area of the hand as the output. The fingertip feature point detection network performs image segmentation on the effective area of the hand, and the fingertip of each finger is segmented to obtain the feature points of each fingertip. For example, the image shown in Figure 6 is from the hand shown in Figure 5. A schematic diagram of identifying the fingertip feature points of each finger in the effective area of the hand, and finally obtain the relative position coordinates of each finger in the effective area of the hand, such as the coordinates of the thumb fingertip, the index fingertip coordinate, the middle finger fingertip coordinate, etc., The fingertip coordinates corresponding to each finger are represented by the diagonal corners of the rectangular detection frame corresponding to the fingertip, and the relative position coordinates of each fingertip are converted into coordinates in the original coordinate system of the 2D image to obtain the original coordinates of each fingertip in the 2D image. The fingertip coordinates of the system.
2、琴键划分2. Key division
通过琴键划分模块从所述以钢琴键盘位置坐标表示的钢琴键盘区域中将每个琴键划分出来得到以钢琴键盘区域的相对位置坐标表示的不同琴键,并将每个用于表示琴键的钢琴键盘区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的琴键坐标;获得每个琴键的坐标所限定的范围即为该琴键有效区域,对琴键有效区域进行形态学处理获得 每个琴键的有效边缘。对键盘区域进行划分的目的是为了结合指尖特征点判断指法是否正确。实际划分形式就是从边缘第一个键开始编号,如琴键编号为[K1,K2,K3,...,K88]的集合。每个键在图像中的像素区域有一个多边形的表达,如K1键的区域表达为一系列顶点的集和[(x K10,y K10),...,(x K1n,y K1n)],其中(x K1n,y K1n)表示2D图像坐标系中的一个点,x K1n为该点横坐标,y K1n为该点纵坐标,由这些点构成的多边形包裹的像素点即为K1键有效区域。琴键划分模块以检测到的钢琴键盘区域为输入,将其转为灰度图,然后进行形态学操作,去除光照噪声等的影响,利用边缘检测算法(如sobel算子)提取边缘,最后进行连通域分析,得到键盘边缘。钢琴键盘只有黑键和白键两种,且边界都是规则线段,根据像素统计特性,将边缘与各个键绑定,不同边缘交点即为该键有效区域多边形的顶点,经过琴键划分模块的分割,在2D图像坐标系中建立各个键的数学表达模型。 Different piano keys represented by the relative position coordinates of the piano keyboard region are obtained by dividing each piano key from the piano keyboard region represented by the piano keyboard position coordinates by the piano key dividing module, and each piano keyboard region used to represent the keys is divided into different piano keys. The relative position coordinates of the 2D image are converted into the coordinates under the original coordinate system of the 2D image to obtain the key coordinates under the original coordinates of the 2D image; the range limited by the coordinates of each key is the effective area of the key, and the effective area of the key is morphologically processed. Get the effective edge of each key. The purpose of dividing the keyboard area is to judge whether the fingering is correct or not based on the feature points of the fingertips. The actual division form is to start numbering from the first key of the edge, such as the set of keys numbered [K1, K2, K3, ..., K88]. The pixel area of each key in the image has a polygonal expression, such as the area of the K1 key is expressed as a set of vertices and [(x K10 ,y K10 ),...,(x K1n ,y K1n )], Where (x K1n , y K1n ) represents a point in the 2D image coordinate system, x K1n is the abscissa of the point, y K1n is the ordinate of the point, and the pixels wrapped in the polygon formed by these points are the K1 key effective area . The keyboard division module takes the detected piano keyboard area as input, converts it into a grayscale image, and then performs morphological operations to remove the influence of light noise, etc., and uses edge detection algorithms (such as the sobel operator) to extract edges, and finally connect them. Domain analysis to get keyboard edges. The piano keyboard has only two types of black keys and white keys, and the boundaries are regular line segments. According to the statistical characteristics of pixels, the edges are bound to each key, and the intersection points of different edges are the vertices of the polygon in the effective area of the key, which is divided by the key division module. , the mathematical expression model of each bond is established in the 2D image coordinate system.
3、指法识别3. Fingering recognition
通过指法错误检测模块将指尖坐标和琴键坐标进行位置对比判断以将落在琴键上的指尖与该琴键进行绑定获得手指按键绑定关系,如果手指的指尖坐标表示的区域和琴键坐标表示的区域有重叠,那么这个手指落在与其指尖坐标有区域重叠的琴键上,将该手指与琴键进行绑定获得手指按键绑定关系,将弹奏同一个音符的手指按键绑定关系和曲谱数据库中的标准绑定关系进行对比,如果不一致,则判断为指法错误,提示用户正确的指法以纠正错误。Through the fingering error detection module, the position of the fingertip coordinates and the key coordinates are compared and judged to bind the fingertip falling on the key with the key to obtain the finger key binding relationship. If the indicated area overlaps, then the finger falls on the key whose fingertip coordinates overlap, and the finger is bound to the key to obtain the finger key binding relationship, and the key binding relationship between the fingers playing the same note and The standard binding relationship in the score database is compared, and if they are inconsistent, it is judged as a fingering error, and the user is prompted to correct the fingering.
其中,当某一个手指的指尖坐标表示的区域与多个琴键坐标表示的区域有重叠时,会得到在弹奏某个音符时同一个手指与多个琴键的手指按键绑定关系,这和曲谱数据库中的标准绑定关系肯定不一致,该手指则是明显的指法错误。Among them, when the area represented by the fingertip coordinates of a certain finger overlaps with the areas represented by the coordinates of multiple keys, the key binding relationship between the same finger and multiple keys will be obtained when a certain note is played, which is the same as The standard binding relationship in the score database is definitely inconsistent, and this finger is an obvious fingering error.
指法识别的目的是判断哪根手指按了哪个键,实现手指按键的绑定。它依赖琴键划分模块和指尖特征点检测网络的输出。同时,琴键上的声音传感器能够感知按下琴键所产生的信号,从而确定是哪个琴键按下,以及获得当前弹奏的音符。从琴键划分模块输出的结果中取出该键的有效区域,依次计算检测出的指尖特征点是否落入该键区域,如果落入,则将该手指与该键绑定。从曲谱数据库中,查找该音符处手指与琴键的标准绑定关系,并与指法识别获得的预测的绑定关系做比对,如果不一致,则判断为指法错误,提示 用户正确的指法以纠正错误。The purpose of fingering recognition is to determine which finger has pressed which key, and to realize the binding of finger keys. It relies on the output of the key segmentation module and the fingertip feature point detection network. At the same time, the sound sensor on the key can sense the signal generated by pressing the key, so as to determine which key is pressed, and obtain the currently played note. The valid area of the key is extracted from the result output by the key division module, and it is calculated in turn whether the detected fingertip feature points fall into the key area, and if so, the finger is bound to the key. From the score database, find the standard binding relationship between the finger and the key at the note, and compare it with the predicted binding relationship obtained by fingering recognition. If it is inconsistent, it is judged that the fingering is wrong, and the user is prompted to correct the wrong fingering. .
从上述实施例可知,本发明采用基于深度学习的手段完成检测任务和分割任务,并通过训练神经网络来获得检测网络。本发明提供一套完整的神经网络训练方式来获得钢琴键盘检测网络、手部检测网络、手型错误检测网络、指尖特征点检测网络。本发明分析了每种手型错误和指法错误产生的生物力学原理,归纳各个错误的本质的视觉特征,以这个特征作为神经网络学习和预测的依据。然后将该特征以矩形框(但不限于矩形框)标注出来,得到样本数据集,并将数据集按照一定比例分为训练集、验证集和测试集(例如按照比例7:2:1划分样本数据集)。其中,训练集和验证集用于训练神经网络,测试集用于测试评价最终网络模型的效果。It can be known from the above-mentioned embodiments that the present invention adopts the means based on deep learning to complete the detection task and the segmentation task, and obtains the detection network by training the neural network. The present invention provides a complete set of neural network training methods to obtain a piano keyboard detection network, a hand detection network, a hand shape error detection network, and a fingertip feature point detection network. The invention analyzes the biomechanical principle of each hand type error and fingering error, summarizes the essential visual feature of each error, and uses this feature as the basis for neural network learning and prediction. Then the feature is marked with a rectangular frame (but not limited to a rectangular frame) to obtain a sample data set, and the data set is divided into a training set, a validation set and a test set according to a certain ratio (for example, the samples are divided according to the ratio of 7:2:1) data set). Among them, the training set and validation set are used to train the neural network, and the test set is used to test and evaluate the effect of the final network model.
根据本发明的一个实施例,本发明提供一种对神经网络进行训练以获得所述钢琴键盘检测网络、手部检测网络、手型错误检测网络、指尖特征点检测网络的方法,包括如下部分:According to an embodiment of the present invention, the present invention provides a method for training a neural network to obtain the piano keyboard detection network, hand detection network, hand shape error detection network, and fingertip feature point detection network, including the following parts :
a.数据集采集a. Data set collection
部署图像采集模块,采集不同年龄段、不同性别、不同学龄、不同肤色的多人在多种场景、不同类型和型号的钢琴、不同角度和不同光照下的钢琴弹奏影像,数据集覆盖全场景和全错误类型。Deploy an image acquisition module to collect images of people playing pianos in various scenes, different types and models of pianos, different angles, and different lighting conditions of people of different ages, genders, school ages, and skin colors. The dataset covers the entire scene and full error type.
b.数据集标注b. Dataset annotation
分为键盘标注、手部标注、手型错误标注、指尖特征点标注。其中,在手型错误标注中,归纳总结了各个手型错误的本质特征,将手型错误类别和手型错误位置标注出来。具体来说,包括标注钢琴键盘位置坐标、标注手部位置坐标、标注手型错误类型和手型错误位置坐标、标注不同手指指尖特征点坐标,所有标注均在图像原始坐标系中。It is divided into keyboard annotation, hand annotation, hand type error annotation, and fingertip feature point annotation. Among them, in the hand shape error labeling, the essential characteristics of each hand shape error are summarized, and the hand shape error category and hand shape error position are marked. Specifically, it includes labeling the position coordinates of the piano keyboard, labeling the position coordinates of the hand, labeling the wrong type of hand and the coordinates of the wrong hand shape, labeling the coordinates of the feature points of different fingers and fingertips, all labels are in the original coordinate system of the image.
c.数据集处理c. Dataset processing
根据标注的钢琴键盘位置坐标对原始数据集中的图像进行处理,获得包含以标注的钢琴键盘位置坐标表示的钢琴键盘区域的图像以形成第一数据集;进一步地,对钢琴键盘区域进行扩充得到的钢琴键盘有效区域,根据标注的键盘位置坐标和手部位置坐标以原始数据集为基础对原图进行裁剪,获得每张原图中的钢琴键盘有效区域形成第二数据集,其中,第二数据集中将在原图中标注的手部位置坐标转换为与钢琴键盘有效区域同一坐标系下的坐标;进一步地,对手部区域进行扩充得到的手部有效区域,根据标注的手 部位置坐标和手型错误位置坐标以原始数据集为基础对原图进行裁剪获得每张原图中的手部有效区域形成第三数据集,其中,第三数据集中将在原图中标注的手型错误位置坐标转换为与手部有效区域同一坐标系下的坐标;根据标注的手部位置坐标和不同手指指尖特征点坐标以原始数据集为基础对对原图进行裁剪获得每张原图中的手部有效区域形成第四数据集,其中,第四数据集中将在原图中标注的不同手指指尖特征点坐标转换为与手部有效区域同一坐标系下的坐标。The images in the original data set are processed according to the marked position coordinates of the piano keyboard, and an image containing the piano keyboard area represented by the marked piano keyboard position coordinates is obtained to form a first data set; further, the piano keyboard area is expanded to obtain an image In the effective area of the piano keyboard, the original image is cropped based on the original data set according to the marked keyboard position coordinates and hand position coordinates, and the effective area of the piano keyboard in each original image is obtained to form a second data set, wherein the second data set The hand position coordinates marked in the original image are converted into the coordinates in the same coordinate system as the effective area of the piano keyboard; further, the effective hand area obtained by expanding the hand area is based on the marked hand position coordinates and hand shape. The error position coordinates are based on the original data set, and the original image is cropped to obtain the effective area of the hand in each original image to form a third data set. The third data set converts the hand type error position coordinates marked in the original image into Coordinates in the same coordinate system as the effective area of the hand; according to the marked hand position coordinates and the coordinates of different fingertip feature points, the original image is cropped on the basis of the original data set to obtain the effective area of the hand in each original image A fourth data set is formed, wherein the coordinates of different fingertip feature points marked in the original image are converted into coordinates in the same coordinate system as the effective area of the hand in the fourth data set.
d.模型训练d. Model training
(1)钢琴键盘检测网络、手部检测网络和手型错误检测网络(1) Piano keyboard detection network, hand detection network and hand shape error detection network
本发明将钢琴键盘检测、手部检测和手型错误检测作为多分支检测任务,设计多任务分支网络结构,然后对其进行训练获得检测网络。The invention takes piano keyboard detection, hand detection and hand shape error detection as multi-branch detection tasks, designs multi-task branch network structure, and then trains them to obtain detection network.
其中,钢琴键盘检测网络只有一个任务分支即钢琴键盘检测分支。Among them, the piano keyboard detection network has only one task branch, that is, the piano keyboard detection branch.
对于手部检测网络,只有一个任务分支,即手部检测分支。该网络需要完成三个任务,一是输出手部的坐标位置,二是输出手部的左右属性,三是输出手部的正反属性。本发明将手部分为四个类别,正左手、正右手、反左手、反右手,正左手表示左手手背朝上,反左手表示左手手心朝上,右手类之。For the hand detection network, there is only one task branch, the hand detection branch. The network needs to complete three tasks, one is to output the coordinate position of the hand, the other is to output the left and right attributes of the hand, and the third is to output the positive and negative attributes of the hand. The present invention divides the hand parts into four categories, positive left hand, positive right hand, anti-left hand, anti-right hand, positive left hand means the back of the left hand is up, anti-left hand means the palm of the left hand is up, right hand and so on.
对于手型错误检测网络,有多个任务分支,每一个分支是一种错误类型的检测子网络,子网络的预测类型只有该错误一种类别,即有多少个错误类别就有多少个子分支,所有的错误检测子分支网络共享主干网络。例如,分为折指检测分支、掌关节塌陷检测分支、塌腕检测分支等。For the hand type error detection network, there are multiple task branches, each branch is a detection sub-network of an error type, and the prediction type of the sub-network has only one type of error, that is, there are as many sub-branchs as there are error categories. All error detection sub-branch networks share the backbone network. For example, it is divided into a broken finger detection branch, a palm joint collapse detection branch, a wrist collapse detection branch, and the like.
用第一数据集将yolov4网络进行训练至收敛获得钢琴键盘检测网络,用第二数据集将yolov4网络进行训练至收敛获得手部检测网络;用第三数据集将yolov4网络进行训练至收敛获得手型错误检测网络,其中,检测任务的每个分支都采用相同的损失函数。其中,对于单任务分支网络,该检测任务分支的损失即是整个网络的总损失;对于多任务分支网络,各个检测任务分支损失的加权和,才是整个网络的总损失。对于神经网络的训练以及损失函数的设计是本领域常见手段,此处不再赘述。训练过程中可对图像进行在线数据增强,包括但不限于颜色、对比度、亮度、噪声、平滑模糊、翻转、形变、畸变、随机遮挡和擦除等,提高网络鲁棒性。Use the first data set to train the yolov4 network to convergence to obtain the piano keyboard detection network, use the second data set to train the yolov4 network to convergence to obtain the hand detection network; use the third data set to train the yolov4 network to converge to obtain the hand detection network. type error detection network, where the same loss function is used for each branch of the detection task. Among them, for a single-task branch network, the loss of the detection task branch is the total loss of the entire network; for a multi-task branch network, the weighted sum of the losses of each detection task branch is the total loss of the entire network. The training of the neural network and the design of the loss function are common methods in the field, and will not be repeated here. During the training process, online data enhancement can be performed on images, including but not limited to color, contrast, brightness, noise, smooth blur, flip, deformation, distortion, random occlusion and erasure, etc., to improve network robustness.
需要说明的是,神经网络的选择并不局限于yolov4网络,其他神经网络也可以。It should be noted that the choice of neural network is not limited to the yolov4 network, other neural networks can also be used.
(2)指尖特征点检测网络(2) Fingertip feature point detection network
将指尖特征点检测作为一个图像分割任务,用第四数据集将ResNet18作为主干网、检测头采用级联金字塔网络的神经网络进行训练至收敛获得指尖特征点检测网络。所谓级联金字塔网络,是指以多尺度特征做为输入的两个网络级联起来,第一个网络称为GlobalNet,对指尖特征点做初步检测,采用L2损失函数。GlobalNet产生的特征图再经过卷积层抽取特征后,输入到RefineNet网络,对预测的特征点进行微调,产生更加精确的结果。Taking fingertip feature point detection as an image segmentation task, the fourth dataset uses ResNet18 as the backbone network, and the detection head adopts a neural network with cascaded pyramid networks to train to converge to obtain a fingertip feature point detection network. The so-called cascaded pyramid network refers to the cascade of two networks that take multi-scale features as input. The first network is called GlobalNet, which performs preliminary detection on fingertip feature points and uses the L2 loss function. The feature map generated by GlobalNet is then extracted by the convolutional layer and input to the RefineNet network to fine-tune the predicted feature points to produce more accurate results.
从上述实施例可知,本发明具有以下优点:1、速度快效率高,2D图像计算量小,算法轻量简单,效果好,性高;2、能够精确地输出错误类型、错误位置信息,区分手型错误与指法错误,矫正错误时更具有针对性;3、以数据驱动的方式来训练每个具体的模型,不需要经验性地建立标准比对数据库,鲁棒性高;4、采用自顶向下的方式预测结果,从粗粒度的键盘检测,到细粒度的手型错误检测和指法错误识别,多个网络之间采用级联的方式,上级网络能够为下级网络提供更多的先验知识,同时每个子网络采用多任务分支的方式,取得了更高的性能。It can be seen from the above-mentioned embodiments that the present invention has the following advantages: 1. Fast speed and high efficiency, small calculation amount of 2D images, light and simple algorithm, good effect and high performance; 2. It can accurately output error type and error position information, distinguish Hand shape errors and fingering errors are more targeted when correcting errors; 3. Each specific model is trained in a data-driven way, and there is no need to empirically establish a standard comparison database, which has high robustness; 4. Adopt automatic The top-down method predicts the results, from coarse-grained keyboard detection, to fine-grained hand type error detection and fingering error recognition, and multiple networks are cascaded. At the same time, each sub-network adopts a multi-task branch, which achieves higher performance.
从本发明背景技术可知,识别弹奏错误在钢琴教学和训练中有着重要的作用,能够显著提升钢琴教学质量。通常情况下,对于练习者钢琴弹奏的判断和评价至少包括音符和手部动作两个方面,其中,音符可以包括基音和泛音的频谱、力度、速度、节奏等因素。在时间信息相对应的情况下,可以通过将弹奏时的音频数据信号转换为音频数据,并与标准音频数据相比较,以此判断弹奏的音符是否正确。在本发明中,“标准音频数据”、“标准手部数据”是指用于与用户音频数据、用户手部数据做比较以判断用户钢琴弹奏结果的“参照音频数据”、“参照手部数据”。It can be known from the background technology of the present invention that identifying playing errors plays an important role in piano teaching and training, and can significantly improve the quality of piano teaching. Usually, the judgment and evaluation of a practitioner's piano performance include at least two aspects: notes and hand movements, where notes may include the frequency spectrum, strength, speed, rhythm and other factors of fundamental and overtones. When the time information corresponds, it can be judged whether the played note is correct by converting the audio data signal during playing into audio data and comparing it with the standard audio data. In the present invention, "standard audio data" and "standard hand data" refer to "reference audio data", "reference hand data" used for comparing with user audio data and user hand data to judge the user's piano playing result data".
手部动作包括指法和手型两个方面,其中,指法是用于确定在曲目练习时使用了正确的手指弹奏相应的音符,指法包括单个手指的位置(或位置变化)以及多个手指间的相对位置变化。常见的指法例如可以包括顺指(即一个手指对应一个琴键)、穿指(即一个手指从另外一个或多个手指下面穿过去以弹奏更高音)、跨指(即一个手指从另外一个或多个手指上面跨过去以弹奏更低音)、括指、缩指以及轮指等等。手型是用于确定弹奏任意音符时有无折指、指尖不站立、掌关节塌陷、晃腕、抬指、手指紧张等问题。图7示出了钢琴练习中常见的6种错误手部动作及对应的正确手部动 作,其中图7A示出了折指及对应的正确手部动作,图7B示出了指尖站立时错误及正确的手部动作,图7C示出了掌关节塌陷及对应的正确手部动作,图7D示出了晃腕及对应的正确手部动作,图7E示出了抬指及对应的正确手部动作,图7F示出了手指紧张及对应的正确手部动作。手部动作的变化可以取得不同的发音效果,对音符的连贯性、节奏、速度、音色都有很大影响,是能够弹奏出良好的效果的关键。Hand movements include two aspects: fingering and hand shape. Fingering is used to determine that the correct finger is used to play the corresponding note during repertoire practice. Fingering includes the position (or position change) of a single finger and multiple fingers. relative position changes. Common fingerings can include, for example, straight-finger (that is, one finger corresponds to a key), finger-penetrating (that is, one finger passes under one or more other fingers to play higher notes), cross-fingering (that is, one finger passes from another or Multiple fingers are stepped over to play lower bass), brackets, retractions, ring fingers, and so on. Hand shape is used to determine whether there are any problems such as broken fingers, not standing fingertips, collapse of palm joints, wrist shaking, finger lift, and finger tension when playing any note. Figure 7 shows 6 common wrong hand movements and the corresponding correct hand movements in piano practice, wherein Figure 7A shows folding fingers and the corresponding correct hand movements, Figure 7B shows the wrong hand movements when the fingertips are standing Figure 7C shows the palmar joint collapse and the corresponding correct hand motion, Figure 7D shows the wrist shaking and the corresponding correct hand motion, and Figure 7E shows the finger lift and the corresponding correct hand motion. Figure 7F shows finger tension and the corresponding correct hand movement. Changes in hand movements can achieve different pronunciation effects, which have a great impact on the coherence, rhythm, speed, and timbre of notes, and are the key to good results.
根据本发明的一个实施例,单个手掌至少包括21个关键关节点,可以根据21个关键关节点的坐标位置或者相对位置来表征该手掌的手部数据。得益于深度学习的发展,可以利用训练好的手部模型(即神经网络模型)识别弹奏者双手关键关节点的坐标位置或相对位置,即弹奏者的手部数据,并将该手部数据与标准弹奏的手部数据进行比较,以此判断弹奏的手部动作是否准确。According to an embodiment of the present invention, a single palm includes at least 21 key joint points, and the hand data of the palm can be characterized according to the coordinate positions or relative positions of the 21 key joint points. Thanks to the development of deep learning, the trained hand model (ie, the neural network model) can be used to identify the coordinate position or relative position of the key joint points of the player's hands, that is, the player's hand data, and use the hand model. The part data is compared with the standard playing hand data to judge whether the playing hand movements are accurate.
图8示出了本发明一个实施例的单个手掌中的21个关键关节点的示意图。如图8所示,可以从单个手掌中选出21个关键关节点,分别用序列号0-20表示,其中,[0、1、2、3、4]代表拇指中从手腕到指尖的5个关键关节点;[5、6、7、8]代表食指中从手腕到指尖的4个关键关节点;[9、10、11、12]代表中指中从手腕到指尖的4个关键关节点;[13、14、15、16]代表无名指中从手腕到指尖的4个关键关节点;[17、18、19、20]代表小拇指中从手腕到指尖的4个关键关节点。FIG. 8 shows a schematic diagram of 21 key joint points in a single palm according to an embodiment of the present invention. As shown in Figure 8, 21 key joint points can be selected from a single palm, respectively represented by serial numbers 0-20, where [0, 1, 2, 3, 4] represent the thumb from the wrist to the fingertip. 5 key joint points; [5, 6, 7, 8] represent the 4 key joint points in the index finger from the wrist to the fingertip; [9, 10, 11, 12] represent the 4 key joint points in the middle finger from the wrist to the fingertip Key joint points; [13, 14, 15, 16] represent the 4 key joint points in the ring finger from the wrist to the fingertip; [17, 18, 19, 20] represent the 4 key joints in the little finger from the wrist to the fingertip point.
手部数据可以用每个关键关节点的坐标位置来表示,也可以通过各个关键关节点的相对位置来表示。在一个实施例中,可以选择拇指中的“0”关节点作为中心原点,其他关节点的相对位置可以用该关节点相对于拇指中的“0”关节点的相对坐标位置来表示,其中,每个关键关节点的坐标位置可以用平面坐标(x,y)来表示。在另一个实施例中,也可以用(x,y,v)来表示一个关节点的坐标,其中v表示该关节点是否被遮挡。当v=1时,表示该关节点未被遮挡;当v=0时,表示该关节点被其他部分遮挡。在一个实施例中,还可以标注“left”或者“right”,用以区分该关键关节点是位于左手还是位于右手。手部各个关键关节点的相对位置可以通过其他关节点相对于某一关节点的相对位置来表示。Hand data can be represented by the coordinate position of each key joint point, or by the relative position of each key joint point. In one embodiment, the "0" joint point in the thumb can be selected as the center origin, and the relative positions of other joint points can be represented by the relative coordinate positions of the joint point relative to the "0" joint point in the thumb, wherein, The coordinate position of each key joint point can be represented by plane coordinates (x, y). In another embodiment, (x, y, v) can also be used to represent the coordinates of a joint point, where v represents whether the joint point is occluded. When v=1, it means that the joint point is not blocked; when v=0, it means that the joint point is blocked by other parts. In one embodiment, "left" or "right" may also be marked to distinguish whether the key joint is located in the left hand or the right hand. The relative position of each key joint point of the hand can be represented by the relative position of other joint points relative to a certain joint point.
手部模型是以手部图像为输入数据,以手部图像中手部数据为输出数据,通过对神经网络模型进行训练获得。在一个实施例中,由于钢琴演奏 中的手部图像具有时序性,手部模型中的神经网络可以采用循环神经网络(Recurrent Neural Network,RNN)或者长短时记忆神经网络(Long Short-Term Memory,LSTM)。RNN是在普通多层BP神经网络基础上,增加了隐藏层各单元间的横向联系,并通过一个权重矩阵将上一个时间序列的神经单元的值传递至当前的神经单元,从而使神经网络具备了记忆功能。RNN对于处理有上下文联系的NLP、或者时间序列的机器学习问题,有很好的应用性。然而,RNN虽然具备记忆性,但由于存在梯度爆炸或者梯度消失,不能记忆太前或者太后的内容。因此,根据本发明的一个实施例,在采样间隔较长的情况下,采用LSTM来识别手部图像。LSTM在普通RNN基础上,在隐藏层各神经单元中增加记忆单元,从而使时间序列上的记忆信息可控,每次在隐藏层各单元间传递时通过几个可控门(遗忘门、输入门、候选门、输出门),可以控制之前信息和当前信息的记忆和遗忘程度,从而使RNN网络具备了长期记忆功能。The hand model takes the hand image as the input data and the hand data in the hand image as the output data, which is obtained by training the neural network model. In one embodiment, since the hand images in piano performance are time-series, the neural network in the hand model can use a Recurrent Neural Network (RNN) or a Long Short-Term Memory Neural Network (Long Short-Term Memory, LSTM). RNN is based on the ordinary multi-layer BP neural network, increases the horizontal connection between the units of the hidden layer, and transmits the value of the neural unit of the previous time series to the current neural unit through a weight matrix, so that the neural network has the memory function. RNN has good applicability for dealing with contextual NLP or time series machine learning problems. However, although RNN has memory, it cannot memorize the content that is too early or too late due to gradient explosion or gradient disappearance. Therefore, according to an embodiment of the present invention, when the sampling interval is long, LSTM is used to recognize the hand image. On the basis of ordinary RNN, LSTM adds memory units to each neural unit of the hidden layer, so that the memory information in the time series is controllable, and each time it passes through several controllable gates (forgetting gate, input gate, candidate gate, output gate), which can control the memory and forgetting degree of previous information and current information, so that the RNN network has a long-term memory function.
手部模型的训练集可以包括各种不同样本的手部图片,例如,不同年龄(如老人、成人、孩子)和不同性别(如男性、女性)的不同手部动作的手部图像。手部图像中的手部动作不限于弹钢琴,可以包含各种动作,例如握拳、全掌伸展、推、拉、竖拇指等等。训练集中手部图像的手部数据(例如关键关节点的坐标位置或相对位置等)可以由人工标注,也可以从已有数据库中获得。利用训练好的手部模型(即神经网络模型)可以识别手部图像中弹奏者双手关键关节点的坐标位置或相对位置,即用户手部数据。The training set of the hand model can include hand pictures of various samples, for example, hand images of different hand movements of different ages (eg, the elderly, adults, children) and different genders (eg, male, female). The hand motion in the hand image is not limited to playing the piano, and can include various motions, such as making a fist, extending the whole palm, pushing, pulling, raising the thumb, and so on. The hand data of the hand images in the training set (such as the coordinate positions or relative positions of key joint points, etc.) can be manually annotated or obtained from an existing database. Using the trained hand model (ie, the neural network model), the coordinate position or relative position of the key joint points of the player's hands in the hand image can be identified, that is, the user's hand data.
根据本发明前述实施例的描述,可以通过包含钢琴完整键盘的2D图像识别是否有手型错误、和/或指法错误等弹奏错误。According to the description of the foregoing embodiments of the present invention, whether there are playing errors such as hand shape errors and/or fingering errors can be identified through a 2D image including a complete piano keyboard.
基于上述研究,本发明提供了一种智能钢琴训练方法,该方法从获取到的用户弹奏钢琴的音频信息和视频信息中,按照一定的时间间隔从用户音频信息中提取用户音频数据,与音频数据库中对应的标准音频数据相比较,获得用户音频数据与标准音频数据的匹配度,并且按照一定的时间间隔从用户视频信息中截取与用户音频数据对应的用户手部图像,通过手部模型识别用户手部图像中的用户手部数据,与手部数据库中对应的标准手部数据相比较,获得用户手部数据与标准手部数据的匹配度,和/或,从所述视频信息中截取与所述用户音频数据相对应的包含完整钢琴键盘的2D 图像,并采用前述实施例所述用于辅助钢琴教学的识别方法从该2D图像中识别是否存在弹奏错误;依据该用户的所有用户音频数据与对应的标准音频数据的匹配度以及所有用户手部数据与对应的标准手部数据的匹配度、和/或用户弹奏错误向用户反馈弹奏结果。Based on the above research, the present invention provides an intelligent piano training method. The method extracts user audio data from the user audio information at a certain time interval from the acquired audio information and video information of the user playing the piano, and compares the audio data with the audio information. Compare the corresponding standard audio data in the database to obtain the matching degree between the user audio data and the standard audio data, and intercept the user's hand image corresponding to the user's audio data from the user's video information according to a certain time interval, and identify it through the hand model. The user's hand data in the user's hand image is compared with the corresponding standard hand data in the hand database to obtain the matching degree between the user's hand data and the standard hand data, and/or, intercepted from the video information A 2D image containing a complete piano keyboard corresponding to the user's audio data, and identifying whether there is a playing error from the 2D image using the recognition method for assisting piano teaching described in the previous embodiment; according to all users of the user The matching degree between the audio data and the corresponding standard audio data, the matching degree between all the user's hand data and the corresponding standard hand data, and/or the user's playing error feedback the playing result to the user.
图9示出了本发明一个实施例的智能钢琴训练方法。如图9所示,该方法包括以下步骤:FIG. 9 shows a smart piano training method according to an embodiment of the present invention. As shown in Figure 9, the method includes the following steps:
S310,获取用户弹奏钢琴的音频信息和视频信息。S310: Acquire audio information and video information of the user playing the piano.
如上所述,音符练习和手部动作练习是钢琴练习中的两个主要方面,因此需要同时获取用户弹奏钢琴时的音频信息和视频信息。在一些实施中,可以通过音频和视频采集设备(例如麦克风和摄像头,或者带有麦克风的摄像头)采集用户弹奏钢琴的音频信息和视频信息。这种情况下,可以对采集到的音频信息进行去除静音段或去噪、降噪等预处理,以避免外部干扰,提高评分的准确性。在另一些实施例中,对于在电子钢琴上弹奏的音频信息,可以通过连接电子钢琴上的MIDI接口(Musical Instrument Digital Interface)收集用户弹奏钢琴的MIDI音频数字信号。MIDI音频数字信号是由电子钢琴输出的、代表弹奏的某一音符的、并且可以被计算机识别和处理的二进制数据。对于视频信息,可以通过摄像头或其他具有图像采集功能的设备拍摄用户弹奏钢琴的手部动作。可以由同一摄像头拍摄双手的手部动作,也可以由多个摄像头分别从不同角度分开拍摄左右手的手部动作,在这种情况下,可以对左右手的视频信息进行拼接。As mentioned above, note practice and hand movement practice are two main aspects in piano practice, so it is necessary to obtain both audio information and video information when the user plays the piano. In some implementations, audio and video information of the user playing the piano may be collected through audio and video collection devices (eg, a microphone and a camera, or a camera with a microphone). In this case, the collected audio information can be preprocessed by removing silent segments, denoising, and noise reduction to avoid external interference and improve the accuracy of scoring. In other embodiments, for the audio information played on the electronic piano, the MIDI audio digital signal of the user playing the piano may be collected by connecting to a MIDI interface (Musical Instrument Digital Interface) on the electronic piano. MIDI audio digital signals are binary data output by an electronic piano, representing a certain note played, and that can be recognized and processed by a computer. For video information, the user's hand movement of playing the piano can be captured by a camera or other device with image capture function. The hand movements of both hands can be captured by the same camera, or the hand movements of the left and right hands can be captured separately from different angles by multiple cameras. In this case, the video information of the left and right hands can be spliced.
在一个实施例中,还可以通过安装在琴键下方的压力传感器采集用户弹奏钢琴的触键力度,以与上述音频和视频信息结合,共同确定用户弹奏钢琴的评分。In one embodiment, a pressure sensor installed under the keys may also collect the touch strength of the user playing the piano, so as to combine with the above audio and video information to jointly determine the score of the user playing the piano.
S320,从音频信息中提取用户音频数据,并与音频数据库中对应的标准音频数据相比较,获得用户音频数据与对应的标准音频数据的匹配度。S320, extracting user audio data from the audio information, and comparing it with the corresponding standard audio data in the audio database, to obtain a degree of matching between the user audio data and the corresponding standard audio data.
音频数据库中包含有大量标准的钢琴弹奏曲目(例如由钢琴教师或者专业人员演奏的曲目,或者由人工智能根据乐谱自动生成的曲目)的音频数据。可以按照一定的时间间隔从标准的钢琴演奏曲目的音频信息中提取标准音频数据,并以曲目为单位进行存储,形成音频数据库。音频数据库中的音频数据至少可以包括曲目名称、提取时间、音符、基频以及音强等信息。The audio database contains audio data of a large number of standard piano playing pieces (for example, pieces played by piano teachers or professionals, or pieces automatically generated by artificial intelligence based on musical scores). Standard audio data can be extracted from the audio information of standard piano performance pieces according to a certain time interval, and stored in units of pieces to form an audio database. The audio data in the audio database may at least include information such as track name, extraction time, musical note, fundamental frequency, and sound intensity.
在一个实施例中,可以按照10ms或者更短的时间间隔从标准钢琴演奏曲目的音频信息中提取标准音频数据并存在在音频数据库中。当前世界吉尼斯记录中最快钢琴手记录为1s摁压钢琴琴键14次。以1s摁压钢琴琴键20次为例,摁压一次钢琴琴键的视角为50ms。因此,按10ms的时间间隔从钢琴演奏曲目的音频信息中提取音频数据可以涵盖演奏产生的所有音符。In one embodiment, standard audio data may be extracted from the audio information of a standard piano performance at time intervals of 10 ms or less and stored in an audio database. The current Guinness World Records record for the fastest pianist is 14 times in 1 second. Taking pressing the piano keys 20 times in 1s as an example, the viewing angle of pressing the piano keys once is 50ms. Therefore, extracting audio data from the audio information of a piano performance at intervals of 10ms can cover all the notes produced by the performance.
图10示出了一个实施例的音频数据库中标准音频数据存储示意图。如图10所示,音频数据库中可以包括一级表格和二级表格,其中,一级表格用于存储标准钢琴演奏曲目的基本信息,包括序号、名称、等级、音调和音频数据二级表格序号等信息。二级表格用于存储每个曲目的音频数据,包括提取时间、按固定的时间间隔从该曲目的音频信息中提取的音符及其基频和音强等数据。如图10(A)所示,在一级表格中存储有若干标准钢琴演奏曲目,例如,曲目0001的名称为“春之歌”,一级,A大调,其音频数据存储在二级表格0001中;曲目0036的名称为“回旋曲”,四级,C大调,音频数据存储在二级表格0036中;曲目0180的名称为“卡农”,其他,D大调,音频数据存储在二级表格0180中,等等。如图10(B)所示,在二级表格0001中存储有在0001曲目完整的演奏时间内每间隔10ms所提取的全部音符以及对应的基频和音强等数据,例如,在“0.000”时刻,没有音符,基频为0,音强为0;在“0.010”时刻,音符为G4,基频为391Hz,音强为10dB;在“0.020”时刻,音符仍为G4,基频为391Hz,音强为15dB;在“0.030”时刻,音符仍为G4,基频为391Hz,音强为20dB;…;在“0.250”时刻,音符为D4,基频为293Hz,音强为10;…,等等。FIG. 10 shows a schematic diagram of standard audio data storage in the audio database of one embodiment. As shown in FIG. 10 , the audio database may include a first-level table and a second-level table, wherein the first-level table is used to store the basic information of standard piano performance pieces, including the serial number, name, level, pitch and audio data. The second-level table serial number and other information. The secondary table is used to store the audio data of each track, including the extraction time, the notes extracted from the audio information of the track at fixed time intervals, and their fundamental frequency and sound intensity. As shown in Figure 10(A), several standard piano performance pieces are stored in the first-level table. For example, the name of the piece 0001 is "Song of Spring", the first-level, A major, and its audio data is stored in the second-level table 0001 Medium; the name of track 0036 is "Rondo", in the key of C major, and the audio data is stored in the second-level table 0036; the name of the track 0180 is "Canon", other, in the key of D, the audio data is stored in the second-level table 0180, and so on. As shown in Figure 10(B), the secondary table 0001 stores all the notes extracted at every 10ms interval during the complete performance time of the 0001 track and the corresponding fundamental frequency and sound intensity data, for example, at the time of "0.000" , there is no note, the fundamental frequency is 0, and the sound intensity is 0; at the "0.010" moment, the note is G4, the fundamental frequency is 391Hz, and the sound intensity is 10dB; at the "0.020" moment, the note is still G4, and the fundamental frequency is 391Hz, The sound intensity is 15dB; at the "0.030" moment, the note is still G4, the fundamental frequency is 391Hz, and the sound intensity is 20dB; ...; at the "0.250" moment, the note is D4, the fundamental frequency is 293Hz, and the sound intensity is 10;..., etc.
可以按照固定的间隔从音频信息中提取用户音频数据,将提取到的用户音频数据与音频数据库中对应的标准音频数据相比较,可以获得用户音频数据与对应的标准音频数据的匹配度。User audio data can be extracted from the audio information at regular intervals, and the extracted user audio data can be compared with the corresponding standard audio data in the audio database to obtain the matching degree between the user audio data and the corresponding standard audio data.
在一个实施例中,可以按照第一时间间隔从音频信息中提取用户音频数据。第一时间间隔可以与音频数据库中从标准演奏曲目中提取标准音频数据的时间间隔相同,也可以是上述时间间隔的整数倍。提取到的用户音频数据至少可以包含即提取时间、音符及其基频和音强等信息。用户音频数据可以通过其提取时间信息与音频数据库中标准音频数据相对应。以图 9中音频数据库为例,当用户弹奏“春之歌”曲目时,可以按照30ms的时间间隔从采集到的音频信息中提取用户音频数据,假若在0.030s时提取到的音符为G4,基频为391,音强为15的用户音频数据,则与该用户音频数据相对应的标准音频数据为音频数据库中一级表格中曲目0001以及二级表格0001中在0.030s时的音频数据(包括音符及其基频和音强)。In one embodiment, the user audio data may be extracted from the audio information according to the first time interval. The first time interval may be the same as the time interval for extracting standard audio data from the standard performance repertoire in the audio database, or may be an integer multiple of the above-mentioned time interval. The extracted user audio data may at least include information such as extraction time, notes and their fundamental frequencies and sound intensity. The user audio data can correspond to the standard audio data in the audio database through its extraction time information. Taking the audio database in Figure 9 as an example, when the user plays the song "Song of Spring", the user audio data can be extracted from the collected audio information according to the time interval of 30ms. If the extracted note at 0.030s is G4, The user audio data whose base frequency is 391 and the sound intensity is 15, then the standard audio data corresponding to the user audio data is the audio data at the time of 0.030s in the track 0001 in the first-level table and the second-level table 0001 in the audio database ( including the note and its fundamental frequency and intensity).
在用户弹奏钢琴前,可以由用户自行在数据库中选择其将要弹奏的曲目,也可以在用户开始弹奏钢琴后,智能识别用户所弹奏的曲目并查询该曲目在音频数据库中的标准音频数据,然后将用户音频数据与音频数据库中对应的标准音频数据相比较,以获得用户音频数据与对应的标准音频数据的匹配度。Before the user plays the piano, the user can select the song to be played in the database, or after the user starts to play the piano, the user can intelligently identify the song played by the user and query the standard of the song in the audio database. audio data, and then compare the user audio data with the corresponding standard audio data in the audio database to obtain the matching degree between the user audio data and the corresponding standard audio data.
在一个实施例中,音频数据库中可以存储有同一曲目不同曲风的标准音频数据。在用户开始弹奏钢琴后,智能识别用户所弹奏的曲目及曲风,并查询该曲目及曲风在音频数据库中对应的标准音频数据,然后将用户音频数据与音频数据库中对应的标准音频数据相比较,以获得用户音频数据与对应的标准音频数据的匹配度。In one embodiment, the audio database may store standard audio data of the same track in different genres. After the user starts to play the piano, intelligently identify the song and genre played by the user, and query the standard audio data corresponding to the song and genre in the audio database, and then compare the user audio data with the corresponding standard audio in the audio database. The data are compared to obtain the matching degree between the user audio data and the corresponding standard audio data.
在一个实施例中,可以为音频数据中的不同信息设置不同的权重值,以计算用户音频数据与对应的标准音频数据的匹配度。例如,可以设置音符的基频权重大于其音强权重,使得音符的基频信息在计算匹配度中占比更大。在一个实施例中,还可以为音频数据库中的标准音频数据设置误差冗余区间,例如,为音符的基频信息设置±10Hz的误差冗余区间,当用户音频数据落入该区间内时,可以认为该用户音频数据中音符的基频信息与对应的标准音频数据中音符的基频信息基本一致。In one embodiment, different weight values may be set for different information in the audio data, so as to calculate the degree of matching between the user audio data and the corresponding standard audio data. For example, the fundamental frequency weight of a note can be set to be greater than its pitch intensity weight, so that the fundamental frequency information of the note accounts for a larger proportion in the calculation of the matching degree. In one embodiment, an error redundancy interval can also be set for the standard audio data in the audio database, for example, an error redundancy interval of ±10 Hz is set for the fundamental frequency information of the musical note. When the user audio data falls within this interval, It can be considered that the fundamental frequency information of the notes in the user audio data is basically consistent with the fundamental frequency information of the notes in the corresponding standard audio data.
在一个实施例中,音频数据库还可以进一步细分为单音数据库和曲目数据库,其中,单音数据库中存储有单个音符对应的标准音频数据,曲目数据库中存储有大量标准的钢琴弹奏曲目对应的标准音频数据。由此,当用户在练习钢琴时,既可以判断用户弹奏单个音符时的音频数据,也能够判断用户弹奏某个曲目时的音频数据。In one embodiment, the audio database may be further subdivided into a monophonic database and a repertoire database, wherein the monophonic database stores standard audio data corresponding to a single note, and the repertoire database stores a large number of standard piano playing repertoires corresponding to standard audio data. Thus, when the user is practicing the piano, it is possible to determine not only the audio data when the user plays a single note, but also the audio data when the user plays a certain piece of music.
S330,基于用户音频数据与对应的标准音频数据的匹配度,从视频信息中截取与用户音频数据相对应的用户手部图像。S330 , based on the degree of matching between the user audio data and the corresponding standard audio data, intercept an image of the user's hand corresponding to the user audio data from the video information.
在钢琴弹奏中,只有当弹奏音符正确或基本正确时,手部动作的判断才有意义。因此,根据本发明的一个实施例,基于用户音频数据与对应的 标准音频数据的匹配度,决定是否需要从视频信息中截取与用户音频数据相对应的用户手部图像。In piano playing, the judgment of hand movements is meaningful only when the notes are played correctly or substantially correctly. Therefore, according to an embodiment of the present invention, based on the degree of matching between the user audio data and the corresponding standard audio data, it is determined whether the user hand image corresponding to the user audio data needs to be intercepted from the video information.
在一个实施例,可以设置一个音频匹配度阈值。音频匹配度阈值可以由用户设置、系统默认、或者在联网状态下由系统统计其他钢琴弹奏者对同一曲目的弹奏水平后智能设定。当提取到的用户音频数据与对应的标准音频数据的匹配度大于或等于该音频匹配度阈值时,表示用户弹奏的音符正确或基本正确,进而可以从视频信息中截取与该用户音频数据相对应的用户手部图像以判断用户的手部动作;当提取到的用户音频数据与对应的标准音频数据的匹配度小于该音频匹配度阈值时,表示用户弹奏的音符错误,因而不必再进行手部动作判断。In one embodiment, an audio match threshold may be set. The audio matching degree threshold can be set by the user, by default by the system, or intelligently set by the system after counting the playing levels of other piano players on the same piece in the networked state. When the matching degree between the extracted user audio data and the corresponding standard audio data is greater than or equal to the audio matching degree threshold, it means that the notes played by the user are correct or basically correct, and then the user audio data can be intercepted from the video information. The corresponding user hand image is used to judge the user's hand movement; when the matching degree of the extracted user audio data and the corresponding standard audio data is less than the audio matching degree threshold, it means that the note played by the user is wrong, so it is unnecessary to perform Judgment of hand movements.
视频泛指一系列静态影像以电信号的方式加以捕捉、纪录、处理、储存、传送与重现的各种技术。因此视频实际上是由一系列图像按时序排列构成。当连续的图像变化每秒超过24帧图像以上时,根据视觉暂留原理,人眼无法辨别单幅的静态图像,看上去是平滑连续的视觉效果。因此,可以按照固定的间隔从视频信息中截取用户手部图像。用户手部图像可以通过其截取时间信息与视频信息中的用户音频数据相对应。Video generally refers to various technologies in which a series of static images are captured, recorded, processed, stored, transmitted and reproduced in the form of electrical signals. So a video is actually a series of images arranged in chronological order. When the continuous image changes exceed 24 frames per second, according to the principle of persistence of vision, the human eye cannot distinguish a single static image, and it appears to be a smooth and continuous visual effect. Therefore, the user's hand image can be intercepted from the video information at regular intervals. The user's hand image may correspond to the user's audio data in the video information through its interception time information.
在一个实施例中,可以按照第二时间间隔从视频信息中截取用户手部图像。第二时间间隔可以与从音频信息中提取用户音频数据的时间间隔相同,也可以是上述时间间隔的整数倍。在时间信息一致的情况下,此刻截取的用户手部图像即为产生该用户音频数据时相应的用户手部图像。例如,假若在0.030s时提取到音符为G4,音强为15,音高为1750的用户音频数据,则在0.030s从视频信息中截取到的用户手部图像则为产生该用户音频数据时相应的用户手部图像。在一个实施例中,第二时间间隔不大于30ms。In one embodiment, an image of the user's hand may be captured from the video information at a second time interval. The second time interval may be the same as the time interval for extracting user audio data from the audio information, or may be an integer multiple of the above-mentioned time interval. If the time information is consistent, the user's hand image captured at this moment is the corresponding user's hand image when the user's audio data is generated. For example, if user audio data with a note of G4, a sound intensity of 15, and a pitch of 1750 is extracted at 0.030s, the user's hand image captured from the video information at 0.030s is the time when the user audio data is generated. The corresponding user hand image. In one embodiment, the second time interval is not greater than 30 ms.
在一个实施例中,为减少手部模型的计算量,可以对截取到的用户手部图像进行筛选,仅选出包含有钢琴琴键区域的用户手部图像用于识别用户手部数据。In one embodiment, in order to reduce the calculation amount of the hand model, the captured user hand images may be screened, and only the user hand images including the piano key area are selected for identifying the user hand data.
在一个实施例中,除了基于用户音频数据与对应的标准音频数据的匹配度,还可以由用户自行设置是否截取用户手部图像并识别用户手部数据。In one embodiment, in addition to the degree of matching between the user audio data and the corresponding standard audio data, the user can also set whether to capture the user's hand image and identify the user's hand data.
在本发明的一个实施例中,从视频信息中截取包含钢琴完整键盘的2D图像,用于识别可能存在的弹奏错误。In one embodiment of the present invention, a 2D image containing the complete piano keyboard is captured from the video information for identifying possible playing errors.
S340,通过手部模型识别用户手部图像中的用户手部数据,并与手部 数据库中对应的正确的手部数据相比较,获得用户手部数据与对应的标准手部数据的匹配度,和/或从2D图像中识别是否存在用户弹奏错误。S340, identify the user hand data in the user hand image by the hand model, and compare it with the corresponding correct hand data in the hand database to obtain the matching degree between the user hand data and the corresponding standard hand data, And/or identify from the 2D image whether there is a user playing error.
如上所述,手部模型是以手部图像为输入数据,以该手部图像中手部数据为输出数据,通过对神经网络进行训练获得。通过手部模型可以识别用户手部图像中的用户手部数据,例如手部关节点的坐标位置或相对位置,包括左右手中各21个关键关节点的坐标位置或相对位置,或者多于或少于21个关节点的坐标位置或相对位置。在一个实施例中,用户手部数据还可以包括腕部的坐标位置或相对位置。As mentioned above, the hand model takes the hand image as input data and the hand data in the hand image as output data, and is obtained by training the neural network. The user's hand data in the user's hand image can be identified through the hand model, such as the coordinate positions or relative positions of hand joint points, including the coordinate positions or relative positions of 21 key joint points in the left and right hands, or more or less The coordinate position or relative position of the 21 joint points. In one embodiment, the user hand data may further include the coordinate position or relative position of the wrist.
在一个实施例中,手部模型可以采用训练好的循环神经网络或者长短时记忆神经网络。在一个实施例中,当用户手部图像包含有钢琴琴键区域时,可以首先在视场背景下检测出钢琴琴键区域,并画出琴键候选框,然后通过手部模型在琴键候选框区域内进行手部关键点回归检测,以提取用户手部数据。In one embodiment, the hand model may use a trained recurrent neural network or a long and short-term memory neural network. In one embodiment, when the user's hand image includes a piano key area, the piano key area can be detected first in the background of the field of view, and a key candidate frame is drawn, and then the hand model is used in the key candidate frame area. Hand keypoint regression detection to extract user hand data.
手部数据库中包含有大量标准手部数据。可以按照一定的时间间隔,从钢琴弹奏曲目的标准演奏视频信息中截取标准手部图像,再由手部模型识别标准手部图像中的标准手部数据,并以曲目为单位进行存储,形成手部数据库。在一个实施例中,可以按照与音频数据库中从标准演奏曲目提取标准音频数据相同的时间间隔,或者是上述时间间隔的整数倍,从标准钢琴演奏曲目的视频信息中截取标准手部图像,再由手部模型识别标准手部图像中的标准手部数据,并存在手部数据库中。标准手部数据可以包含有时间(即截取时间)、左右手各自关键关节点的坐标位置或相对位置等信息。用户手部数据可以通过其时间信息与手部数据库中的标准手部数据相对应。The hand database contains a large amount of standard hand data. According to a certain time interval, the standard hand image can be intercepted from the standard performance video information of the piano playing piece, and then the standard hand data in the standard hand image can be identified by the hand model, and stored in units of pieces to form Hand database. In one embodiment, the standard hand image can be intercepted from the video information of the standard piano repertoire according to the same time interval as the standard audio data extracted from the standard repertoire in the audio database, or an integer multiple of the above-mentioned time interval, and then The standard hand data in the standard hand image is recognized by the hand model and stored in the hand database. Standard hand data may include information such as time (ie interception time), coordinate positions or relative positions of key joint points of the left and right hands. The user's hand data may correspond to the standard hand data in the hand database through its time information.
图11示出了一个实施例的手部数据库中标准手部数据存储示意图。如图11所示,手部数据库中可以包括一级表格和二级表格,其中,一级表格(如图11(A)所示)用于存储标准钢琴演奏曲目的基本信息,包括序号、名称、等级、音调和手部数据二级表格序号等信息;二级表格(如图11(B)所示)用于存储每个曲目的手部数据,包括截取时间、按固定的时间间隔从该曲目的视频信息中截取的手部图像中左右手21个关节点的相对位置等数据。FIG. 11 shows a schematic diagram of standard hand data storage in the hand database of an embodiment. As shown in Figure 11, the hand database may include a first-level table and a second-level table, wherein the first-level table (as shown in Figure 11(A)) is used to store the basic information of the standard piano performance, including the serial number, name , level, pitch and hand data secondary table serial number and other information; the secondary table (as shown in Figure 11(B)) is used to store the hand data of each track, including the interception time, from the Data such as the relative positions of the 21 joint points of the left and right hands in the hand image captured from the video information of the track.
在一个实施例中,可以将音频数据库与手部数据库相关联,即以曲目 为单位,将时间信息一致的标准音频数据和标准手部数据关联存储,形成综合数据库。当从标准演奏曲目提取标准音频数据的时间间隔与从标准演奏曲目截取标准手部图像的时间间隔不一致,或者从标准演奏曲目提取标准音频数据的时间间隔是从标准演奏曲目截取标准手部图像的时间间隔的整数倍时,仅存储与标准音频数据的时间信息一致时的标准手部数据。In one embodiment, the audio database can be associated with the hand database, that is, the standard audio data with consistent time information and the standard hand data are stored in association with each other in units of tracks to form a comprehensive database. When the time interval for extracting the standard audio data from the standard performance is inconsistent with the time interval for extracting the standard hand image from the standard performance, or the time interval for extracting the standard audio data from the standard performance is the time interval when the standard hand image is intercepted from the standard performance When the time interval is an integral multiple of the time interval, only the standard hand data that matches the time information of the standard audio data is stored.
图12示出了一个实施例的综合数据库的存储示意图。如图12所示,综合数据库可以包括一级表格和二级表格,其中,一级表格(如图12(A)所示)用于存储标准钢琴演奏曲目的基本信息,包括序号、名称、等级、音调和综合数据二级表格序号等信息;二级表格(如图12(B)所示)用于存储每个曲目的音频数据和手部数据。FIG. 12 shows a schematic diagram of the storage of the integrated database of one embodiment. As shown in Fig. 12, the comprehensive database may include a first-level table and a second-level table, wherein the first-level table (as shown in Fig. 12(A)) is used to store the basic information of standard piano performance pieces, including serial number, name, level , pitch and comprehensive data secondary table number and other information; the secondary table (as shown in Figure 12(B)) is used to store the audio data and hand data of each track.
通过将提取到的用户手部数据与手部数据库中对应的标准手部数据相比较,可以获得用户手部数据与对应的标准手部数据的匹配度。在一个实施例中,还可以为手部数据库中的标准手部数据设置误差冗余区间,当用户手部数据落入误差冗余区间内时,可以认为该用户手部数据与标准手部数据基本一致。By comparing the extracted user hand data with the corresponding standard hand data in the hand database, the degree of matching between the user's hand data and the corresponding standard hand data can be obtained. In one embodiment, an error redundancy interval may also be set for the standard hand data in the hand database. When the user's hand data falls within the error redundancy interval, it can be considered that the user's hand data is the same as the standard hand data. Basically the same.
在一个实施例中,可以设置一个手部数据匹配度阈值。手部数据匹配度阈值可以由用户设置、系统默认、或者在联网状态下由系统统计其他钢琴弹奏者对同一曲目的弹奏水平后智能设定。当用户手部数据与标准手部数据的匹配度大于或等于该手部数据匹配度阈值时,表示用户弹奏的手部动作正确或基本正确;当提取到的用户手部数据与对应的标准手部数据的匹配度小于该手部数据匹配度阈值时,表示用户弹奏的手部动作错误,此时可以自动保存该用户手部数据对应的用户手部图像,便于用户查看。在一个实施例中,还可以向用户显示错误的手部动作,例如,通过动画渲染生成虚拟手部轮廓,当指法错误时,向用户显示错误的手指;当手型错误时,向用户显示错误的手部区域(如手掌、指尖等)。或者向用户显示手型错误类别与位置、指法错误与位置等,由于前述实施例对于从2D图像中识别弹奏错误有详细的描述,此处不再赘述。In one embodiment, a hand data matching degree threshold may be set. The hand data matching degree threshold can be set by the user, by default by the system, or intelligently set by the system after counting the performance levels of other piano players on the same piece in the networked state. When the matching degree between the user's hand data and the standard hand data is greater than or equal to the hand data matching degree threshold, it indicates that the hand movement played by the user is correct or basically correct; when the extracted user's hand data matches the corresponding standard When the matching degree of the hand data is less than the threshold of the matching degree of the hand data, it means that the hand movement of the user is wrong. At this time, the user's hand image corresponding to the user's hand data can be automatically saved, which is convenient for the user to view. In one embodiment, wrong hand movements can also be displayed to the user, for example, a virtual hand outline is generated through animation rendering, when the fingering is wrong, the wrong finger is displayed to the user; when the hand shape is wrong, the wrong finger is displayed to the user hand area (e.g. palm, fingertips, etc.). Or display hand type error categories and positions, fingering errors and positions, etc. to the user. Since the foregoing embodiments have detailed descriptions for identifying playing errors from 2D images, they will not be repeated here.
S350,基于用户音频数据与对应的标准音频数据的匹配度以及用户手部数据与对应的标准手部数据的匹配度,和/或用户弹奏错误,向用户反馈弹奏结果。S350, based on the degree of matching between the user audio data and the corresponding standard audio data, the degree of matching between the user's hand data and the corresponding standard hand data, and/or the user's playing error, feedback the playing result to the user.
用户音频数据与对应的标准音频数据的匹配度越高,表示用户钢琴弹 奏的音符准确度越高;同样地,用户手部数据与对应的标准手部数据的匹配度越高,表示用户钢琴弹奏的手部动作准确度越高。因此,用户音频数据与对应的标准音频数据的匹配度和用户手部数据与对应的标准手部数据的匹配度能够从音符和手部动作两个方面综合考量用户钢琴弹奏的水平。The higher the matching degree between the user audio data and the corresponding standard audio data, the higher the accuracy of the notes played by the user's piano; similarly, the higher the matching degree between the user's hand data and the corresponding standard hand data, the higher the accuracy of the user's piano The higher the accuracy of the hand movements you play. Therefore, the degree of matching between the user audio data and the corresponding standard audio data and the degree of matching between the user's hand data and the corresponding standard hand data can comprehensively consider the level of the user's piano playing from two aspects of notes and hand movements.
在一个实施例中,可以设置用户音频数据与对应的标准音频数据的匹配度以及用户手部数据与对应的标准手部数据的匹配度在确定用户钢琴演奏评分中的权重,由此可以针对不同用户的弹奏习惯个性化制定评分规则。例如,若某用户弹奏的音符较准但手部动作经常出现错误,可以将用户手部数据与对应的标准手部数据的匹配度设置较大权重,以着重反馈该用户在钢琴弹奏中的手部动作情况。In one embodiment, the degree of matching between the user audio data and the corresponding standard audio data and the degree of matching between the user's hand data and the corresponding standard hand data may be set as weights in determining the user's piano performance score. The user's playing habits are personalized to formulate scoring rules. For example, if the notes played by a user are accurate but the hand movements are often wrong, a larger weight can be set for the matching degree between the user's hand data and the corresponding standard hand data, so as to give more feedback on the user's performance in piano playing. of hand movements.
在一个实施例中,数据库中还存储有标准触键力度数据,可以将采集到的用户触键力度数据与标准触键力度数据相比较,获取用户触键力度与标准触键力度的匹配度,并结合用户音频数据与对应的标准音频数据的匹配度以及用户手部数据与对应的标准手部数据的匹配度,和/或用户弹奏错误综合考量用户钢琴弹奏的水平。In one embodiment, standard key touch force data is also stored in the database, and the collected user key touch force data can be compared with the standard key touch force data to obtain the matching degree between the user key touch force and the standard key touch force, The user's piano playing level is comprehensively considered in combination with the matching degree between the user audio data and the corresponding standard audio data, the matching degree between the user's hand data and the corresponding standard hand data, and/or the user's playing error.
通过上述智能钢琴训练方法,能够使用户在缺乏老师指导的情况下,也能够及时、准确地得知在自己练习钢琴时的音符和手部动作情况,有助于用户及时纠错,有效提高练习效率。Through the above-mentioned intelligent piano training method, users can timely and accurately know the notes and hand movements when they practice the piano by themselves in the absence of teacher guidance, which is helpful for users to correct mistakes in time and effectively improve their practice. effectiveness.
在一些实施例中,用户钢琴弹奏的结果可以以延时的方式向用户反馈。例如,也可以在用户钢琴弹奏结束时,显示该曲目的综合评分;也可以详细记录用户在弹奏中具体的音频错误和手部动作错误,并形成评分报告,以使用户能够针对性的练习或改正钢琴弹奏中的错误;还可以将当前的评分或评分报告与该用户以往的弹奏记录或其他用户的弹奏记录做对比,综合评价该用户当前的弹奏水平。In some embodiments, the results of the user's piano playing may be fed back to the user in a delayed manner. For example, at the end of the user's piano playing, the comprehensive score of the piece can also be displayed; the user's specific audio errors and hand movement errors during playing can also be recorded in detail, and a score report can be formed, so that the user can make targeted Practice or correct mistakes in piano playing; you can also compare the current score or score report with the user's previous performance records or other users' performance records, and comprehensively evaluate the user's current playing level.
在其他实施例中,还可以同时比较用户音频数据和用户手部数据,并基于用户音频数据与对应的标准音频数据的匹配度以及用户手部数据与所述对应的标准手部数据的匹配度,和/或用户弹奏错误,向用户反馈弹奏结果。在这种情况下,可以在获取用户弹奏钢琴时的音频信息和视频信息时,实时提取并分析用户音频数据和用户手部图像。In other embodiments, the user audio data and the user hand data may be compared at the same time, and based on the matching degree between the user audio data and the corresponding standard audio data and the matching degree between the user hand data and the corresponding standard hand data , and/or the user plays incorrectly, and feedback the playing result to the user. In this case, the user audio data and the user's hand image can be extracted and analyzed in real time while acquiring the audio information and video information of the user playing the piano.
在一些情况下,可以在用户弹奏完某一曲目后,向用户反馈该曲目的 整体弹奏结果。In some cases, after the user has played a certain piece, the user may be fed back with the overall performance result of the piece.
图13示出了本发明一个实施例的智能钢琴训练方法。如图13所示,该方法包括以下步骤:FIG. 13 shows a smart piano training method according to an embodiment of the present invention. As shown in Figure 13, the method includes the following steps:
S710,获取用户弹奏钢琴的音频信息和视频信息。S710: Acquire audio information and video information of the user playing the piano.
S720,从音频信息中提取用户音频数据,并与音频数据库中对应的标准音频数据相比较,获得用户音频数据与对应的标准音频数据的匹配度。S720, extracting user audio data from the audio information, and comparing it with the corresponding standard audio data in the audio database to obtain a degree of matching between the user audio data and the corresponding standard audio data.
步骤S710-S720与上述步骤S310-S320类似,在此不再赘述。Steps S710-S720 are similar to the above-mentioned steps S310-S320, and are not repeated here.
S730,将用户音频数据与对应的标准音频数据的匹配度与指定阈值N 1相比较,当用户音频数据与对应的标准音频数据的匹配度大于或等于指定阈值N 1时,执行步骤S740;当用户音频数据与对应的标准音频数据的匹配度小于指定阈值N 1时,执行步骤S760。 S730, compare the matching degree between the user audio data and the corresponding standard audio data with the specified threshold N1, when the matching degree between the user audio data and the corresponding standard audio data is greater than or equal to the specified threshold N1, execute step S740 ; when When the degree of matching between the user audio data and the corresponding standard audio data is less than the specified threshold N1, step S760 is executed.
S740,从视频信息中截取与用户音频数据相对应的用户手部图像,在一些实施例中,还可以从视频信息中截取包含钢琴完整键盘的2D图像用于识别是否存在弹奏错误。S740, intercept the user's hand image corresponding to the user audio data from the video information. In some embodiments, a 2D image including the complete piano keyboard may also be intercepted from the video information to identify whether there is a playing error.
S750,通过手部模型识别用户手部图像中的用户手部数据,并与手部数据库中对应的正确的手部数据相比较,获得用户音频数据与对应的标准手部数据的匹配度,和/或通过2D图像识别是否存在弹奏错误。S750, identifying the user's hand data in the user's hand image by the hand model, and comparing it with the corresponding correct hand data in the hand database to obtain the matching degree between the user's audio data and the corresponding standard hand data, and / or identify through 2D images if there is a playing error.
S760,判断用户钢琴弹奏是否结束,若结束,执行步骤S770;若尚未结束,执行步骤S710-S760。S760, it is judged whether the user's piano playing has ended, if it has ended, go to step S770; if it has not ended, go to steps S710-S760.
S770,基于用户弹奏钢琴中产生的全部用户音频数据与对应的标准音频数据的匹配度,以及用户弹奏钢琴中产生的全部用户手部数据与对应的标准手部数据的匹配度,和/或用户弹奏错误,向用户反馈弹奏结果。S770, based on the degree of matching of all user audio data generated by the user playing the piano with the corresponding standard audio data, and the degree of matching between all the user hand data generated by the user playing the piano and the corresponding standard hand data, and/ Or the user plays incorrectly, and feedback the playing result to the user.
上述方法通过在用户钢琴弹奏结束后向用户反馈其钢琴弹奏的综合效果,有利于用户整体把握其弹奏的完整曲目或其中一段旋律。By feeding back the comprehensive effect of the user's piano playing to the user after the user's piano playing, the above method is helpful for the user to grasp the complete piece or one of the melody played by the user as a whole.
在一些实施例中,当用户音频数据与对应的标准音频数据的匹配度小于指定阈值时,可以向用户提示标准音频数据所对应的琴键信息,例如,通过动画渲染生成虚拟键盘并提示出正确的琴键;以及/或者当用户手部数据与对应的标准手部数据的匹配度小于指定阈值时,可以向用户提示标准手部数据所对应的手部动作,例如,通过动画渲染生成虚拟手部轮廓并提示出正确的手部动作。In some embodiments, when the matching degree between the user audio data and the corresponding standard audio data is less than a specified threshold, the user may be prompted for the key information corresponding to the standard audio data, for example, a virtual keyboard is generated through animation rendering and prompts the correct piano keys; and/or when the degree of matching between the user's hand data and the corresponding standard hand data is less than a specified threshold, the user can be prompted for the hand action corresponding to the standard hand data, for example, a virtual hand outline is generated by animation rendering And prompt the correct hand movements.
图14示出了本发明一个实施例的智能钢琴训练方法。如图14所示, 该方法包括以下步骤:FIG. 14 shows a smart piano training method according to an embodiment of the present invention. As shown in Figure 14, the method includes the following steps:
S810,获取用户弹奏钢琴的音频信息和视频信息。S810: Acquire audio information and video information of the user playing the piano.
S820,从音频信息中提取用户音频数据,并与音频数据库中对应的标准音频数据相比较,获得用户音频数据与对应的标准音频数据的匹配度。S820, extracting user audio data from the audio information, and comparing it with the corresponding standard audio data in the audio database to obtain a degree of matching between the user audio data and the corresponding standard audio data.
S830,将用户音频数据与对应的标准音频数据的匹配度与指定阈值N 1相比较,当用户音频数据与对应的标准音频数据的匹配度大于或等于指定阈值N 1时,执行步骤S840;当用户音频数据与对应的标准音频数据的匹配度小于指定阈值N 1时,向用户提示标准音频数据对应的琴键信息,并执行步骤S870。 S830, compare the matching degree between the user audio data and the corresponding standard audio data with the specified threshold N1, when the matching degree between the user audio data and the corresponding standard audio data is greater than or equal to the specified threshold N1, execute step S840 ; when When the degree of matching between the user audio data and the corresponding standard audio data is less than the specified threshold N1, the user is prompted for the key information corresponding to the standard audio data, and step S870 is executed.
S840,从视频信息中截取与用户音频数据相对应的用户手部图像,在一些实施例中,还可以从视频信息中截取包含钢琴完整键盘的2D图像用于识别是否存在弹奏错误。S840, intercept the user's hand image corresponding to the user's audio data from the video information. In some embodiments, a 2D image including a complete piano keyboard may also be intercepted from the video information to identify whether there is a playing error.
S850,通过手部模型识别用户手部图像中的用户手部数据,并与手部数据库中对应的正确的手部数据相比较,获得用户音频数据与对应的标准手部数据的匹配度,和/或通过2D图像识别是否存在弹奏错误。S850, identifying the user's hand data in the user's hand image by the hand model, and comparing it with the corresponding correct hand data in the hand database to obtain the matching degree between the user's audio data and the corresponding standard hand data, and / or identify through 2D images if there is a playing error.
S860,将用户手部数据与对应的标准手部数据的匹配度与指定阈值N 2相比较,当用户手部数据与对应的标准手部数据的匹配度小于指定阈值N 2时,向用户提示标准手部数据对应的手部动作,和/或弹奏错误对应的手型错误类型和错误位置或指法错误和位置。 S860, compare the matching degree between the user's hand data and the corresponding standard hand data with the specified threshold N2 , and when the matching degree between the user's hand data and the corresponding standard hand data is less than the specified threshold N2 , prompt the user to Hand movements corresponding to standard hand data, and/or hand type errors and wrong positions or fingering errors and positions corresponding to playing errors.
S870,判断用户钢琴弹奏是否结束,若结束,执行步骤S880;若尚未结束,执行步骤S810-S870。S870, it is judged whether the user's piano playing has ended, if so, go to step S880; if not, go to steps S810-S870.
S880,基于用户弹奏钢琴中产生的全部用户音频数据与对应的标准音频数据的匹配度,以及用户弹奏钢琴中产生的全部用户手部数据与对应的标准手部数据的匹配度,和/或用户弹奏错误,向用户反馈弹奏结果。S880, based on the matching degree of all user audio data generated in the user's playing the piano with the corresponding standard audio data, and the matching degree of all the user's hand data generated in the user's playing the piano and the corresponding standard hand data, and/ Or the user plays incorrectly, and feedback the playing result to the user.
通过上述方法,能够针对用户钢琴弹奏中产生的在音符和/或手部动作错误的进行实时指导与示范,以使用户能够及时掌握正确的弹奏音符和/或手部动作,提高练习效率。Through the above method, real-time guidance and demonstration can be carried out for the wrong notes and/or hand movements in the user's piano playing, so that the user can grasp the correct playing notes and/or hand movements in time and improve the practice efficiency. .
综上,本发明通过利用手部模型精确识别用户手部图像中的手部数据,弹奏错误,并在综合考量用户在弹奏钢琴时产生的音频数据和手部数据的基础上,对用户弹奏的结果作出整体判断,使得用户在缺乏专业老师指导的情况下,能够有效获得在练习中有关音符及手部动作方面的反馈,有利 于用户发现并纠正错误,提高练习效率。此外,通过实时向用户提示正确的琴键信息和/或手部动作,还可以帮助用户及时获得正确的示范和指导,有助于用户自学弹奏钢琴。To sum up, the present invention accurately recognizes the hand data in the user's hand image by using the hand model, and plays wrong, and on the basis of comprehensively considering the audio data and hand data generated by the user when playing the piano The overall judgment of the results of playing enables users to effectively obtain feedback on notes and hand movements during practice without the guidance of professional teachers, which is helpful for users to find and correct mistakes and improve practice efficiency. In addition, by prompting the user with correct key information and/or hand movements in real time, it can also help the user to obtain correct demonstration and guidance in time, which is helpful for the user to learn to play the piano by himself.
另一方面,本发明还提供了一种实施上述方法的智能钢琴训练系统,该系统包括:音频和视频采集单元,用于获取用户钢琴弹奏的音频信息和视频信息;数据提取单元,用于从音频信息中提取用户音频数据,以及从视频信息中截取与用户音频数据相对应的用户手部图像和/或包含完整钢琴键盘的2D图像;数据识别单元,用于通过手部模型识别用户手部图像中的用户手部数据,和/或通过用于辅助钢琴教学的智能识别系统从所述2D图像中识别是否存在弹奏错误,其中,手部模型以手部图像为输入数据,以手部图像中手部数据为输出数据,通过对神经网络进行训练获得;数据匹配单元,用于将用户音频数据与音频数据库中对应的标准音频数据相比较,获得用户音频数据与对应的标准音频数据的匹配度,以及将用户手部数据与手部数据库中对应的标准手部数据相比较,获得用户手部数据与对应的标准手部数据的匹配度;用户交互单元,用于基于用户音频数据与对应的标准音频数据的匹配度以及用户手部数据与对应的标准手部数据的匹配度,和/或用户弹奏错误,向用户反馈弹奏结果。On the other hand, the present invention also provides an intelligent piano training system implementing the above method. The system includes: an audio and video acquisition unit for acquiring audio information and video information of the user's piano playing; a data extraction unit for Extracting user audio data from audio information, and intercepting user hand images corresponding to user audio data and/or 2D images containing a complete piano keyboard from video information; data recognition unit for recognizing user hands through hand models user hand data in the 2D image, and/or identify whether there is a playing error from the 2D image through an intelligent recognition system for assisting piano teaching, wherein the hand model uses the hand image as input data, and uses the hand image as input data. The hand data in the external image is the output data, which is obtained by training the neural network; the data matching unit is used to compare the user audio data with the corresponding standard audio data in the audio database, and obtain the user audio data and the corresponding standard audio data. The matching degree of the user's hand data and the corresponding standard hand data in the hand database are compared to obtain the matching degree of the user's hand data and the corresponding standard hand data; the user interaction unit is used for user audio data based on user audio data. The degree of matching with the corresponding standard audio data and the degree of matching between the user's hand data and the corresponding standard hand data, and/or the user's playing error, feedback the playing result to the user.
在一个实施例中,智能钢琴训练系统中的用户交互单元还用于:向用户提示对应的标准音频数据所对应的琴键信息,向用户提示对应的标准手部数据所对应的手部动作,向所述用户提示弹奏错误和错误位置。In one embodiment, the user interaction unit in the smart piano training system is further configured to: prompt the user for the key information corresponding to the corresponding standard audio data, prompt the user for the hand movement corresponding to the corresponding standard hand data, and The user prompts for playing errors and wrong positions.
在一个实施例中,智能钢琴训练系统中还包括控制单元,用于控制音频和视频采集单元、数据提取单元、数据识别单元、数据匹配单元、以及用户交互单元之间的相互配合,并基于用户音频数据与对应的标准音频数据的匹配度确定是否激活数据识别单元,以及基于用户音频数据与对应的标准音频数据的匹配度或基于用户手部法据与对应的标准手部数据的匹配度判断用户弹奏钢琴曲目是否结束,并确定是否激活音频和视频采集单元或者用户交互单元。In one embodiment, the intelligent piano training system further includes a control unit for controlling the mutual cooperation between the audio and video acquisition unit, the data extraction unit, the data identification unit, the data matching unit, and the user interaction unit, and based on the user The degree of matching between the audio data and the corresponding standard audio data determines whether to activate the data identification unit, and judges based on the degree of matching between the user audio data and the corresponding standard audio data or the degree of matching between the user's hand law and the corresponding standard hand data Whether the user plays the piano piece is over, and it is determined whether the audio and video capture unit or the user interaction unit is activated.
图15示出了本发明一个实施例的智能钢琴练习系统。如图15所述,该智能钢琴练习系统900包括音频和视频采集单元901、数据提取单元902、数据识别单元903、数据匹配单元904以及用户交互单元905。FIG. 15 shows a smart piano practice system according to an embodiment of the present invention. As shown in FIG. 15 , the intelligent piano practice system 900 includes an audio and video acquisition unit 901 , a data extraction unit 902 , a data identification unit 903 , a data matching unit 904 and a user interaction unit 905 .
音频和视频采集单元901,包括声音采集装置9011和视频采集装置 9012,用于获取用户在弹奏钢琴时产生的音频信息和视频信息。其中,声音采集装置9011例如可以是安装在钢琴附近的一个或多个麦克风。声音采集装置9011可以通过有线或者无线的方式与数据提取单元902相连,将获取的音频信息发送给数据提取单元902。视频采集装置9012可以是具有摄影或图像采集功能的设备,例如可以是单目摄像头、双目摄像头或深度摄像头。视频采集装置9012可以被固定在钢琴周围定点采集手部视频信息,例如,可以仅仅安装在钢琴键盘的前上方,也可以在键盘的上方、前方、左侧和/或右侧分别安装多个具有摄影功能的设备;也可以被安装在滑轨上自动追踪采集手部视频信息,自动调整其拍摄位置和/或角度。类似地,视频采集装置9012也可以通过有线或者无线的方式与数据提取单元902相连,将获取的视频信息发送给数据提取单元902。在一个实施例中,声音采集装置9011和视频采集装置9012可以集成在一个装置中,以同时获取用户弹奏钢琴时的音频信息和视频信息。The audio and video capture unit 901, including a sound capture device 9011 and a video capture device 9012, is used to obtain audio information and video information generated when the user plays the piano. The sound collection device 9011 may be, for example, one or more microphones installed near the piano. The sound collection device 9011 can be connected to the data extraction unit 902 in a wired or wireless manner, and sends the acquired audio information to the data extraction unit 902 . The video capture device 9012 may be a device with a photography or image capture function, such as a monocular camera, a binocular camera or a depth camera. The video capture device 9012 can be fixed around the piano to capture hand video information at a fixed point. For example, it can be installed only above the front of the piano keyboard, or can be installed on the upper, front, left and/or right sides of the keyboard. A device with photography function; it can also be installed on the slide rail to automatically track and collect video information of the hand, and automatically adjust its shooting position and/or angle. Similarly, the video capture device 9012 can also be connected to the data extraction unit 902 in a wired or wireless manner, and sends the acquired video information to the data extraction unit 902 . In one embodiment, the sound collecting device 9011 and the video collecting device 9012 can be integrated in one device to simultaneously acquire audio information and video information when the user plays the piano.
数据提取单元902,包括音频数据提取单元9021和图像数据截取单元9022,其中,音频数据提取单元9021用于从音频信息中提取用户音频数据,并发送给数据匹配单元904;图像数据截取单元9022用于从视频信息中截取与用户音频数据相对应的用户手部图像和/或包含完整钢琴键盘的2D图像,并发送给数据识别单元903。The data extraction unit 902 includes an audio data extraction unit 9021 and an image data extraction unit 9022, wherein the audio data extraction unit 9021 is used to extract user audio data from the audio information and send it to the data matching unit 904; the image data extraction unit 9022 uses The user's hand image corresponding to the user's audio data and/or the 2D image including the complete piano keyboard is intercepted from the video information, and sent to the data identification unit 903 .
数据识别单元903,其中包含手部模型9031,并与图像数据截取单元9022相连接,用于通过手部模型识别用户手部图像中的用户手部数据和/或通过用于辅助钢琴教学的智能识别系统从所述2D图像中识别弹奏错误,并发送给数据匹配单元904。其中,手部模型9031以手部图像为输入数据,以手部图像中手部数据为输出数据,通过对神经网络进行训练获得。The data recognition unit 903, which contains the hand model 9031, is connected with the image data interception unit 9022, and is used for identifying the user hand data in the user hand image through the hand model and/or by using the intelligence for assisting piano teaching. The recognition system recognizes playing errors from the 2D images and sends them to the data matching unit 904 . The hand model 9031 takes the hand image as input data and the hand data in the hand image as output data, and is obtained by training a neural network.
数据匹配单元904,包括音频数据匹配单元9041和手部数据匹配单元9042。其中,音频数据匹配单元9041中包含音频数据库,用于将来自数据提取单元902的用户音频数据与音频数据库中对应的标准音频数据相比较,获得用户音频数据与对应的标准音频数据的匹配度,并发送给用户交互单元905和控制单元906。手部数据匹配单元9042中包含手部数据库,用于将来自数据识别单元903的用户手部数据与手部数据库中对应的标准音频数据相比较,获得用户手部数据与对应的标准手部数据的匹配度,并将用户手部数据与对应的标准手部数据的匹配度以及用户弹奏错误发送给用 户交互单元905。音频数据库和手部数据库可以作为内置文件存储在音频数据匹配单元9041和手部数据匹配单元9042中,也可以通过API程序接口与音频数据匹配单元9041和手部数据匹配单元9042相连接。The data matching unit 904 includes an audio data matching unit 9041 and a hand data matching unit 9042. Wherein, the audio data matching unit 9041 includes an audio database for comparing the user audio data from the data extraction unit 902 with the corresponding standard audio data in the audio database to obtain the degree of matching between the user audio data and the corresponding standard audio data, And send it to the user interaction unit 905 and the control unit 906 . The hand data matching unit 9042 includes a hand database for comparing the user hand data from the data identification unit 903 with the corresponding standard audio data in the hand database to obtain the user hand data and the corresponding standard hand data. The matching degree of the user's hand data and the corresponding standard hand data and the user's playing error are sent to the user interaction unit 905. The audio database and the hand database can be stored in the audio data matching unit 9041 and the hand data matching unit 9042 as built-in files, and can also be connected to the audio data matching unit 9041 and the hand data matching unit 9042 through the API program interface.
用户交互单元905,包括处理器9051和显示设备9052,其中,处理器9051用于接收来自音频数据匹配单元9041的用户音频数据与对应的标准音频数据的匹配度,以及接收来自手部数据匹配单元9042的用户手部数据与对应的标准手部数据的匹配度,和/或用户弹奏错误,并基于用户音频数据与对应的标准音频数据的匹配度以及所述用户手部数据与所述对应的标准手部数据的匹配度和/或用户弹奏错误,确定所述用户钢琴演奏的评分。显示设备9052例如可以是智能手机、IPAD、智能眼镜、液晶显示屏、电子水墨屏等具有显示功能的电子设备,用于显示处理器9051的评分结果。在一个实施例中,处理器9051可以基于用户音频数据与对应的标准音频数据的匹配度,确定正确的琴键信息并显示在显示设备9052上,例如,通过动画渲染生成虚拟键盘并提示出正确的琴键。在一个实施例中,处理器9051还可以基于用户手部数据与对应的标准手部数据的匹配度,构建正确的手部动作并显示在显示设备9052上,例如,可以生成虚拟手部轮廓并提示出正确的手部动作,也可根据用户手部信息建立不同用户的特定的骨骼系统,通过蒙皮、动画渲染等手段生成该用户个性化的虚拟手部轮廓,并根据标准手部数据控制虚拟手部轮廓提示正确的手部动作。或者显示标注有具体手型错误类型和位置、指法错误和位置的具体手部图像。The user interaction unit 905 includes a processor 9051 and a display device 9052, wherein the processor 9051 is configured to receive the matching degree between the user audio data from the audio data matching unit 9041 and the corresponding standard audio data, and receive from the hand data matching unit 9042 the degree of matching between the user's hand data and the corresponding standard hand data, and/or the user's playing error, and based on the degree of matching between the user's audio data and the corresponding standard audio data and the user's hand data and the corresponding The matching degree of the standard hand data and/or the user's playing error, determine the score of the user's piano performance. The display device 9052 can be, for example, an electronic device with a display function, such as a smart phone, an IPAD, smart glasses, a liquid crystal display screen, an electronic ink screen, etc., for displaying the scoring result of the processor 9051 . In one embodiment, the processor 9051 can determine the correct key information and display it on the display device 9052 based on the matching degree between the user audio data and the corresponding standard audio data, for example, generate a virtual keyboard through animation rendering and prompt the correct key information piano keys. In one embodiment, the processor 9051 can also construct a correct hand motion and display it on the display device 9052 based on the degree of matching between the user's hand data and the corresponding standard hand data, for example, can generate a virtual hand contour and Prompt the correct hand movements, and can also establish a specific skeletal system for different users according to the user's hand information, generate the user's personalized virtual hand contour by means of skinning, animation rendering, etc., and control it according to standard hand data. Virtual hand contours suggest correct hand movements. Or display a specific hand image annotated with a specific hand type and location, fingering error and location.
在一个实施例中,智能钢琴练习系统中还可以包括传感器,传感器可以安装于琴键下方,用于采集用户弹奏钢琴时的触键力度数据。In one embodiment, the smart piano practice system may further include a sensor, and the sensor may be installed under the keys, and used to collect data on the touch strength of the user when playing the piano.
在本发明的一个实施例中,可以以计算机程序的形式来实现本发明。计算机程序可以存储于各种存储介质(例如,硬盘、光盘、闪存等)中,当该计算机程序被处理器执行时,能够用于实现本发明的方法。In one embodiment of the present invention, the present invention may be implemented in the form of a computer program. The computer program can be stored in various storage media (eg, hard disk, optical disk, flash memory, etc.), and when the computer program is executed by the processor, can be used to implement the method of the present invention.
在本发明的另一个实施例中,可以以电子设备的形式来实现本发明。该电子设备包括处理器和存储器,在存储器中存储有计算机程序,当该计算机程序被处理器执行时,能够用于实现本发明的方法。In another embodiment of the present invention, the present invention may be implemented in the form of an electronic device. The electronic device includes a processor and a memory, and the memory stores a computer program that, when executed by the processor, can be used to implement the method of the present invention.
需要说明的是,上面的实施例以矩形、近似矩形的形状来描述位置坐标,在拍摄角度不同的情况下,当钢琴键盘在2D图像上不呈现矩形时,以包含完整钢琴键盘的多边形的顶点坐标来表示钢琴键盘区域。It should be noted that the above embodiments describe the position coordinates in the shape of a rectangle or an approximate rectangle. In the case of different shooting angles, when the piano keyboard does not present a rectangle on the 2D image, the vertices of the polygon containing the complete piano keyboard are used. Coordinates to represent the piano keyboard area.
需要说明的是,虽然上文按照特定顺序描述了各个步骤,但是并不意味着必须按照上述特定顺序来执行各个步骤,实际上,这些步骤中的一些可以并发执行,甚至改变顺序,只要能够实现所需要的功能即可。It should be noted that although the steps are described above in a specific order, it does not mean that the steps must be executed in the above-mentioned specific order. In fact, some of these steps can be executed concurrently, or even change the order, as long as it can be achieved The required function can be.
本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present invention.
计算机可读存储介质可以是保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以包括但不限于电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。A computer-readable storage medium may be a tangible device that retains and stores instructions for use by the instruction execution device. Computer-readable storage media may include, but are not limited to, electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing, for example. More specific examples (non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Various embodiments of the present invention have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (35)

  1. 一种用于辅助钢琴教学的智能识别的方法,其特征在于,所述方法包括:A method for intelligent identification for assisting piano teaching, characterized in that the method comprises:
    从钢琴键盘上方获取弹奏钢琴的包含完整钢琴键盘的2D图像;Get a 2D image of the full piano keyboard playing the piano from above the piano keyboard;
    通过钢琴键盘检测网络对所述2D图像进行目标检测以检测出以2D图像的相对位置坐标表示的钢琴键盘区域,并将所述用于表示钢琴键盘区域的2D图像的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的钢琴键盘位置坐标;Perform object detection on the 2D image through the piano keyboard detection network to detect the piano keyboard area represented by the relative position coordinates of the 2D image, and convert the relative position coordinates of the 2D image used to represent the piano keyboard area into a 2D image The coordinates in the original coordinate system to obtain the piano keyboard position coordinates in the original coordinates of the 2D image;
    通过手部检测网络对以钢琴键盘位置坐标表示的钢琴键盘区域进行目标检测以检测出以钢琴键盘区域的相对位置坐标表示的手部区域,并将所述用于表示手部区域的钢琴键盘区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的手部位置坐标;The target detection is performed on the piano keyboard area represented by the piano keyboard position coordinates through the hand detection network to detect the hand area represented by the relative position coordinates of the piano keyboard area, and the piano keyboard area used to represent the hand area is used to detect the hand area. The relative position coordinates of are converted to the coordinates under the original coordinate system of the 2D image to obtain the hand position coordinates under the original coordinates of the 2D image;
    通过手型错误检测网络识别以手部位置坐标表示的手部区域中是否存在手型错误。Whether there is a hand shape error in the hand region represented by the hand position coordinates is identified by the hand shape error detection network.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    获取识别到的手型错误类型以及以手部区域的相对位置坐标表示的手型错误位置,并将所述用于表示手型错误位置的手部区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的手型错误位置坐标。Obtain the identified hand shape error type and the hand shape error position represented by the relative position coordinates of the hand region, and convert the relative position coordinates of the hand region used to represent the hand shape error position to the original coordinate system of the 2D image to obtain the hand shape error position coordinates in the original coordinates of the 2D image.
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    从所述以钢琴键盘位置坐标表示的钢琴键盘区域中将每个琴键划分出来得到以钢琴键盘区域的相对位置坐标表示的不同琴键,并将每个用于表示琴键的钢琴键盘区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的琴键坐标;From the piano keyboard area represented by the piano keyboard position coordinates, each key is divided to obtain different keys represented by the relative position coordinates of the piano keyboard area, and each of the relative position coordinates of the piano keyboard area used to represent the keys Convert to the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the keys in the original coordinate system of the 2D image;
    通过指尖特征点检测网络从以手部位置坐标表示的手部区域中检测出以手部区域的相对位置坐标表示的不同手指的指尖特征点,并将每个用于表示不同手指的指尖特征点的手部区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的指尖坐标;The fingertip feature points of different fingers represented by the relative position coordinates of the hand region are detected from the hand region represented by the hand position coordinates through the fingertip feature point detection network, and each fingertip feature point representing a different finger is used to The relative position coordinates of the hand region of the sharp feature point are converted into the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the fingertip under the original coordinates of the 2D image;
    基于指尖坐标和琴键坐标进行位置判断,将落在琴键上的指尖与该琴键进行绑定获得手指按键绑定关系,并将弹奏同一个音符的手指按键绑定关系 和曲谱数据库中的标准绑定关系进行对比以检测是否存在指法错误。Determine the position based on the coordinates of the fingertip and the key, bind the fingertip on the key to the key to obtain the key binding relationship of the finger, and compare the key binding relationship of the finger playing the same note with the key binding relationship in the score database. Standard bindings are compared to detect fingering errors.
  4. 根据权利要求1所述的方法,其特征在于,还包括:The method of claim 1, further comprising:
    在检测出钢琴键盘区域后,将钢琴键盘区域进行第一预设像素的扩展,获得包含完整手部的钢琴键盘有效区域。After the piano keyboard area is detected, the piano keyboard area is extended by the first preset pixel to obtain an effective area of the piano keyboard including the complete hand.
  5. 根据权利要求4所述的方法,其特征在于,所述第一预设像素为200像素。The method according to claim 4, wherein the first preset pixel is 200 pixels.
  6. 根据权利要求1所述的方法,其特征在于,还包括:The method of claim 1, further comprising:
    在检测出手部区域后,基于钢琴键盘位置坐标和手部位置坐标的对比,过滤掉未落在钢琴键盘上的手,将落在钢琴键盘上的手的坐标边界向四个方向进行第二预设像素的扩展,获得包含落在钢琴键盘上的手的完整手部的手部区域以及对应手部位置坐标。After the hand area is detected, based on the comparison between the position coordinates of the piano keyboard and the position coordinates of the hands, the hands that do not fall on the piano keyboard are filtered out, and the coordinate boundaries of the hands falling on the piano keyboard are subjected to a second prediction in four directions. Assuming the extension of pixels, the hand area including the complete hand of the hand falling on the piano keyboard and the corresponding hand position coordinates are obtained.
  7. 根据权利要求6所述的方法,其特征在于,所述第二预设像素为30像素。The method according to claim 6, wherein the second preset pixel is 30 pixels.
  8. 根据权利要求2所述的方法,其特征在于,通过如下方式对神经网络进行训练以获得所述钢琴键盘检测网络、手部检测网络、手型错误检测网络、指尖特征点检测网络:The method according to claim 2, wherein the neural network is trained in the following manner to obtain the piano keyboard detection network, the hand detection network, the hand shape error detection network, and the fingertip feature point detection network:
    S1、采集多个人在多种场景下的弹奏不同类型钢琴的图像,形成原始数据集,使原始数据集中的图像覆盖现有技术下所有钢琴类型对应的场景和全错误类型;S1. Collect images of multiple people playing different types of pianos in various scenarios to form an original data set, so that the images in the original data set cover the scenes and all error types corresponding to all piano types under the prior art;
    S2、对原始数据集进行标注,包括标注钢琴键盘位置坐标,标注手部位置坐标,标注手型错误类型和手型错误位置坐标,标注不同手指指尖特征点坐标,所有标注均在同一个二维坐标系中;S2. Label the original data set, including labeling the piano keyboard position coordinates, labeling the hand position coordinates, labeling the hand shape error type and hand shape error position coordinates, labeling the coordinates of the feature points of different fingers and fingertips, all labels are in the same two in the dimensional coordinate system;
    S3、根据标注的钢琴键盘位置坐标对原始数据集中的图像进行处理,获得包含以标注的钢琴键盘位置坐标表示的钢琴键盘区域的图像以形成第一数据集;进一步地,对钢琴键盘区域进行扩充得到的钢琴键盘有效区域,根据标注的键盘位置坐标和手部位置坐标以原始数据集为基础对原图进行裁剪,获得每张原图中的钢琴键盘有效区域形成第二数据集,其中,第二数据集中将在原图中标注的手部位置坐标转换为与钢琴键盘有效区域同一坐标系下的坐标;进一步地,对手部区域进行扩充得到的手部有效区域,根据标注的手部位置坐标和手型错误位置坐标以原始数据集为基础对原图进行裁剪获得每张原图中的手部有效区域形成第三数据集,其中,第三数据集中将 在原图中标注的手型错误位置坐标转换为与手部有效区域同一坐标系下的坐标;根据标注的手部位置坐标和不同手指指尖特征点坐标以原始数据集为基础对对原图进行裁剪获得每张原图中的手部有效区域形成第四数据集,其中,第四数据集中将在原图中标注的不同手指指尖特征点坐标转换为与手部有效区域同一坐标系下的坐标;S3, process the images in the original data set according to the marked piano keyboard position coordinates, and obtain an image containing the piano keyboard area represented by the marked piano keyboard position coordinates to form a first data set; further, expand the piano keyboard area For the obtained effective area of the piano keyboard, the original image is cropped on the basis of the original data set according to the marked keyboard position coordinates and hand position coordinates, and the effective area of the piano keyboard in each original image is obtained to form a second data set. In the second dataset, the hand position coordinates marked in the original image are converted into coordinates in the same coordinate system as the effective area of the piano keyboard; further, the effective hand area obtained by expanding the hand area is based on the marked hand position coordinates and The hand shape error position coordinates are based on the original data set. The original image is cropped to obtain the effective area of the hand in each original image to form a third data set. The third data set will include the hand shape error position coordinates marked in the original image. Convert to the coordinates in the same coordinate system as the effective area of the hand; according to the marked hand position coordinates and the coordinates of the feature points of different fingers and fingertips, the original image is cropped based on the original data set to obtain the hand in each original image. The effective area forms a fourth data set, wherein, in the fourth data set, the coordinates of different fingertip feature points marked in the original image are converted into coordinates in the same coordinate system as the effective area of the hand;
    S4、用第一数据集将预定神经网络进行训练至收敛获得钢琴键盘检测网络,用第二数据集将预定神经网络进行训练至收敛获得手部检测网络;用第三数据集将预定神经网络进行训练至收敛获得手型错误检测网络;用第四数据集将预定神经网络进行训练至收敛获得指尖特征点检测网络。S4, use the first data set to train the predetermined neural network to converge to obtain the piano keyboard detection network, use the second data set to train the predetermined neural network to converge to obtain the hand detection network; use the third data set to carry out the predetermined neural network Train to convergence to obtain a hand shape error detection network; use the fourth data set to train a predetermined neural network to converge to obtain a fingertip feature point detection network.
  9. 根据权利要求8所述的方法,其特征在于,分别用第一数据集、第二数据集、第三数据集将yolov4网络训练至收敛以分别获得钢琴键盘检测网络、手部检测网络、手型错误检测网络。The method according to claim 8, wherein the first data set, the second data set, and the third data set are used to train the yolov4 network to convergence to obtain a piano keyboard detection network, a hand detection network, and a hand shape respectively. Error detection network.
  10. 根据权利要求8所述的方法,其特征在于,用第四数据集将ResNet18和级联金字塔网络组成的网络训练至收敛获得指尖特征点检测网络。The method according to claim 8, characterized in that a network composed of ResNet18 and a cascaded pyramid network is trained to converge to obtain a fingertip feature point detection network with the fourth data set.
  11. 一种用于辅助钢琴教学的智能识别系统,其特征在于,所述系统包括:An intelligent identification system for assisting piano teaching, characterized in that the system comprises:
    图像采集模块,用于采集弹奏钢琴的包含完整钢琴键盘的2D图像;An image acquisition module for acquiring a 2D image of a piano playing piano including a complete piano keyboard;
    钢琴键盘检测模块,用于对所述2D图像进行目标检测以检测出以2D图像的相对位置坐标表示的钢琴键盘区域,并将所述用于表示钢琴键盘区域的2D图像的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的钢琴键盘位置坐标;The piano keyboard detection module is used to perform target detection on the 2D image to detect the piano keyboard area represented by the relative position coordinates of the 2D image, and convert the relative position coordinates of the 2D image used to represent the piano keyboard area to The coordinates in the original coordinate system of the 2D image are obtained to obtain the position coordinates of the piano keyboard under the original coordinates of the 2D image;
    手部检测模块,用于对所述以钢琴键盘位置坐标表示的钢琴键盘区域进行目标检测以检测出以钢琴键盘区域的相对位置坐标表示的手部区域,并将所述用于表示手部区域的钢琴键盘区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的手部位置坐标;The hand detection module is used to perform target detection on the piano keyboard area represented by the piano keyboard position coordinates to detect the hand area represented by the relative position coordinates of the piano keyboard area, and use the said used to represent the hand area The relative position coordinates of the piano keyboard area are converted to the coordinates under the original coordinate system of the 2D image to obtain the hand position coordinates under the original coordinates of the 2D image;
    手型错误检测模块,用于从以手部位置坐标表示的手部区域中识别是否存在手型错误,并输出手型错误类型以及以手部区域的相对位置坐标表示的手型错误位置,并将所述用于表示手型错误位置的手部区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的手型错误位置坐标。The hand shape error detection module is used to identify whether there is a hand shape error from the hand region represented by the hand position coordinates, and output the hand shape error type and the hand shape error position represented by the relative position coordinates of the hand region, and The relative position coordinates of the hand region used to represent the wrong hand position are converted into coordinates in the original coordinate system of the 2D image to obtain the wrong hand position coordinates under the original coordinates of the 2D image.
  12. 根据权利要求11所述的系统,其特征在于,所述系统还包括:The system of claim 11, wherein the system further comprises:
    琴键划分模块,用于从所述以钢琴键盘位置坐标表示的钢琴键盘区域中将每个琴键划分出来得到以钢琴键盘区域的相对位置坐标表示的不同琴键,并将每个用于表示琴键的钢琴键盘区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的琴键坐标;The piano key division module is used to divide each piano key from the piano keyboard area represented by the piano keyboard position coordinates to obtain different piano keys represented by the relative position coordinates of the piano keyboard area, and divide each piano used to represent the piano keys. The relative position coordinates of the keyboard area are converted to the coordinates in the original coordinate system of the 2D image to obtain the coordinates of the keys in the original coordinates of the 2D image;
    指尖特征点检测网络,用于从以手部位置坐标表示的手部区域中检测出以手部区域的相对位置坐标表示的不同手指的指尖特征点,并将每个用于表示不同手指的指尖特征点的手部区域的相对位置坐标转换为2D图像原始坐标系下的坐标以获得2D图像原始坐标下的指尖坐标;The fingertip feature point detection network is used to detect the fingertip feature points of different fingers represented by the relative position coordinates of the hand region from the hand region represented by the hand position coordinates, and each is used to represent different fingers. The relative position coordinates of the hand region of the fingertip feature points are converted into the coordinates under the original coordinate system of the 2D image to obtain the fingertip coordinates under the original coordinates of the 2D image;
    指法错误检测模块,用于基于指尖坐标和琴键坐标进行位置判断,将落在琴键上的指尖与该琴键进行绑定获得手指按键绑定关系,并将弹奏同一个音符的手指按键绑定关系和曲谱数据库中的标准绑定关系进行对比以检测是否存在指法错误。The fingering error detection module is used to judge the position based on the coordinates of the fingertips and the coordinates of the keys, bind the fingertips that fall on the keys to the keys to obtain the key binding relationship of the fingers, and bind the keys of the fingers that play the same note. The fixed relationship is compared with the standard binding relationship in the score database to detect whether there is a fingering error.
  13. 根据权利要求11-12任一所述的系统,其特征在于,所述系统还包括:The system according to any one of claims 11-12, wherein the system further comprises:
    用户交互与显示模块,用于将弹奏过程中出现的弹奏错误与弹奏钢琴的图像进行合并并进行显示,以及提供模式选择和功能选择的交互界面。The user interaction and display module is used to combine and display the playing errors and the images of playing the piano, and provide an interactive interface for mode selection and function selection.
  14. 一种计算机可读存储介质,其特征在于,其上包含有计算机程序,所述计算机程序可被处理器执行以实现权利要求1至10任一所述方法的步骤。A computer-readable storage medium, characterized in that a computer program is contained thereon, and the computer program can be executed by a processor to implement the steps of any one of the methods of claims 1 to 10.
  15. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    一个或多个处理器;one or more processors;
    存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述电子设备实现如权利要求1至10中任一项所述方法的步骤。A storage device for storing one or more programs that, when executed by the one or more processors, cause the electronic device to implement the electronic device as claimed in any one of claims 1 to 10 steps of the method.
  16. 一种智能钢琴训练方法,包括:An intelligent piano training method, comprising:
    获取用户弹奏钢琴的音频信息和视频信息;Obtain the audio information and video information of the user playing the piano;
    从所述音频信息中提取用户音频数据,并与音频数据库中存储的对应的参照音频数据相比较,获得所述用户音频数据与所述对应的参照音频数据的匹配度;Extract user audio data from the audio information, and compare it with the corresponding reference audio data stored in the audio database to obtain a degree of matching between the user audio data and the corresponding reference audio data;
    从所述视频信息中截取与所述用户音频数据相对应的用户手部图像,通过手部模型识别所述用户手部图像中的用户手部数据,并与手部数据库 中存储的对应的参照手部数据相比较,获得所述用户手部数据与所述对应的参照手部数据的匹配度,其中,所述手部模型以手部图像为输入数据,以所述手部图像中手部数据为输出数据,通过对神经网络进行训练获得;和/或,从所述视频信息中截取与所述用户音频数据相对应的包含完整钢琴键盘的2D图像,并采用如权利要求3-9任一所述方法从该2D图像中识别是否存在弹奏错误;以及The user's hand image corresponding to the user's audio data is intercepted from the video information, the user's hand data in the user's hand image is identified by the hand model, and the corresponding reference stored in the hand database is used. Comparing the hand data to obtain the degree of matching between the user's hand data and the corresponding reference hand data, wherein the hand model uses the hand image as input data, and uses the hand image in the hand image as the input data. The data is output data, obtained by training a neural network; and/or, intercepting a 2D image corresponding to the user audio data and comprising a complete piano keyboard from the video information, and using any of the methods as claimed in claims 3-9. - the method identifies from the 2D image whether there is a playing error; and
    基于所述用户音频数据与所述对应的参照音频数据的匹配度以及所述用户手部数据与所述对应的参照手部数据的匹配度、和/或用户弹奏错误,向用户反馈弹奏结果。Based on the degree of matching between the user audio data and the corresponding reference audio data, the degree of matching between the user's hand data and the corresponding reference hand data, and/or the user's playing error, feedback the playing to the user result.
  17. 根据权利要求16所述的钢琴训练方法,还包括:The piano training method according to claim 16, further comprising:
    基于所述用户弹奏钢琴的全部所述用户音频数据与所述对应的参照音频数据的匹配度以及所述用户弹奏钢琴的全部所述用户手部数据与所述对应的参照手部数据的匹配度和/或用户的全部弹奏错误,向用户反馈弹奏结果。Based on the matching degree of all the user audio data of the user playing the piano and the corresponding reference audio data and the matching degree of all the user hand data of the user playing the piano and the corresponding reference hand data Matching degree and/or all playing errors of the user, feedback the playing result to the user.
  18. 根据权利要求16所述的钢琴训练方法,其中,还包括:The piano training method according to claim 16, wherein, further comprising:
    当所述用户音频数据与所述对应的参照音频数据的匹配度小于指定阈值时,向所述用户提示所述对应的参照音频数据所对应的琴键信息。When the degree of matching between the user audio data and the corresponding reference audio data is less than a specified threshold, the user is prompted for the key information corresponding to the corresponding reference audio data.
  19. 根据权利要求16所述的钢琴训练方法,其中,还包括:The piano training method according to claim 16, wherein, further comprising:
    当所述用户手部数据与所述对应的参照手部数据的匹配度小于指定阈值时,向所述用户显示错误的手部动作,和/或向所述用户提示所述对应的参照手部数据所对应的手部动作。When the degree of matching between the user's hand data and the corresponding reference hand data is less than a specified threshold, display an erroneous hand motion to the user, and/or prompt the user for the corresponding reference hand The hand motion corresponding to the data.
  20. 根据权利要求16所述的钢琴训练方法,其中,所述用户音频数据包括提取时间、音符、基频和音强。The piano training method according to claim 16, wherein the user audio data includes extraction time, musical note, fundamental frequency and sound intensity.
  21. 根据权利要求16所述的钢琴训练方法,其中,所述用户手部数据包括截取时间和左右手各21个关键关节点的相对位置。The piano training method according to claim 16, wherein the user's hand data includes the interception time and the relative positions of 21 key joint points of each of the left and right hands.
  22. 根据权利要求16所述的钢琴训练方法,其中,所述弹奏错误包括手型错误和/或指法错误。The piano training method according to claim 16, wherein the playing errors include hand shape errors and/or fingering errors.
  23. 根据权利要求16所述的钢琴训练方法,其中,所述从所述音频信息中提取用户音频数据包括:按照第一时间间隔从所述音频信息中提取所述用户音频数据;以及其中,所述用户音频数据根据其提取时间与所述音频数据库中参照音频数据相对应。The piano training method according to claim 16, wherein the extracting user audio data from the audio information comprises: extracting the user audio data from the audio information according to a first time interval; and wherein the The user audio data corresponds to the reference audio data in the audio database according to its extraction time.
  24. 根据权利要求23所述的钢琴训练方法,其中,所述从所述视频信息中截取与所述用户音频数据相对应的用户手部图像包括:按照第二时间间隔从所述视频信息中截取所述用户手部图像;以及其中,所述用户手部图像根据其截取时间与所述用户音频数据相对应。The piano training method according to claim 23, wherein the intercepting the user's hand image corresponding to the user audio data from the video information comprises: intercepting the video information according to a second time interval the user hand image; and wherein the user hand image corresponds to the user audio data according to its interception time.
  25. 根据权利要求24所述的钢琴训练方法,其中,所述第二时间间隔与所述第一时间间隔相同,或者所述第二时间间隔是所述第一时间间隔的整数倍。The piano training method according to claim 24, wherein the second time interval is the same as the first time interval, or the second time interval is an integer multiple of the first time interval.
  26. 根据权利要求24所述的钢琴训练方法,其中,所述用户手部数据的截取时间与所述用户手部图像的图像截取时间相同,所述用户手部数据根据其截取时间信息与所述数据库中参照手部数据相对应。The piano training method according to claim 24, wherein the interception time of the user's hand data is the same as the interception time of the image of the user's hand, and the user's hand data is consistent with the database according to the interception time information thereof. Corresponding to the reference hand data.
  27. 根据权利要求16所述的钢琴训练方法,其中,所述手部模型采用循环神经网络或者长短时记忆神经网络训练获得。The piano training method according to claim 16, wherein the hand model is obtained by training a recurrent neural network or a long-term memory neural network.
  28. 根据权利要求16所述的钢琴训练方法,还包括:The piano training method according to claim 16, further comprising:
    从所述用户手部图像中选取包含钢琴琴键的用户手部图像用于识别所述用户手部数据。A user's hand image including piano keys is selected from the user's hand image for identifying the user's hand data.
  29. 根据权利要求16所述的钢琴训练方法,还包括:The piano training method according to claim 16, further comprising:
    获取所述用户的触键力度数据;obtaining the touch force data of the user;
    将所述用户的触键力度数据与数据库中存储的对应的参照触键力度数据相比较,获得所述用户触键力度数据与所述对应的参照触键力度数据的匹配度;Comparing the user's key-touching force data with the corresponding reference key-touching force data stored in the database to obtain the degree of matching between the user's key-touching force data and the corresponding reference key-touching force data;
    基于所述用户音频数据与所述对应的参照音频数据的匹配度、所述用户手部数据与所述对应的参照手部数据的匹配度以及所述用户触键力度数据与所述对应的参照触键力度数据的匹配度和/或用户的全部弹奏错误,向用户反馈弹奏结果。Based on the matching degree between the user audio data and the corresponding reference audio data, the matching degree between the user hand data and the corresponding reference hand data, and the user touch force data and the corresponding reference The matching degree of the touch velocity data and/or the user's overall playing error, and the playing result is fed back to the user.
  30. 一种智能钢琴训练系统,包括:An intelligent piano training system, comprising:
    音频和视频采集单元,用于获取用户钢琴弹奏的音频信息和视频信息;The audio and video acquisition unit is used to obtain the audio information and video information of the user's piano performance;
    数据提取单元,用于从所述音频信息中提取用户音频数据,以及从所述视频信息中截取与所述用户音频数据相对应的用户手部图像和/或包含完整钢琴键盘的2D图像;A data extraction unit for extracting user audio data from the audio information, and intercepting a user hand image corresponding to the user audio data and/or a 2D image including a complete piano keyboard from the video information;
    数据识别单元,用于通过手部模型识别所述用户手部图像中的用户手部数据,和/或通过如权利要求12所述的用于辅助钢琴教学的智能识别系 统从所述2D图像中识别是否存在弹奏错误,其中,所述手部模型以手部图像为输入数据,以所述手部图像中手部数据为输出数据,通过对神经网络进行训练获得;A data recognition unit for recognizing user hand data in said user hand image through a hand model, and/or from said 2D image through an intelligent recognition system for assisting piano teaching as claimed in claim 12 Identifying whether there is a playing error, wherein the hand model takes the hand image as input data, and takes the hand data in the hand image as output data, and is obtained by training a neural network;
    数据匹配单元,用于将所述用户音频数据与音频数据库中对应的参照音频数据相比较,获得所述用户音频数据与所述对应的参照音频数据的匹配度,以及将所述用户手部数据与手部数据库中对应的参照手部数据相比较,获得所述用户手部数据与所述对应的参照手部数据的匹配度;A data matching unit is configured to compare the user audio data with the corresponding reference audio data in the audio database, obtain the degree of matching between the user audio data and the corresponding reference audio data, and compare the user hand data with the corresponding reference audio data. Comparing with the corresponding reference hand data in the hand database, obtaining the degree of matching between the user's hand data and the corresponding reference hand data;
    用户交互单元,用于基于所述用户音频数据与所述对应的参照音频数据的匹配度以及所述用户手部数据与所述对应的参照手部数据的匹配度和/或用户的全部弹奏错误,向用户反馈弹奏结果。A user interaction unit, configured to match the user audio data with the corresponding reference audio data and the user hand data with the corresponding reference hand data and/or all playing by the user Error, feedback the playing result to the user.
  31. 根据权利要求30所述的钢琴训练系统,其中,所述用户交互单元还用于:The piano training system of claim 30, wherein the user interaction unit is further configured to:
    向所述用户提示所述对应的参照音频数据所对应的琴键信息;和/或prompting the user the key information corresponding to the corresponding reference audio data; and/or
    向所述用户提示所述对应的参照手部数据所对应的手部动作;和/或prompting the user for the hand motion corresponding to the corresponding reference hand data; and/or
    向所述用户提示弹奏错误和错误位置。Playing errors and wrong positions are prompted to the user.
  32. 根据权利要求30所述的钢琴训练系统,其中,所述视频和音频采集单元包括音频采集装置和视频采集装置,以及其中,所述视频采集装置包括一个或多个单目摄像头、双目摄像头或深度摄像头,所述视频采集装置被固定在钢琴周围定点采集手部视频信息,或者被安装在滑轨上自动追踪采集手部视频信息。The piano training system of claim 30, wherein the video and audio capture unit includes an audio capture device and a video capture device, and wherein the video capture device includes one or more monocular cameras, binocular cameras, or Depth camera, the video capture device is fixed around the piano to capture hand video information at a fixed point, or is installed on the slide rail to automatically track and capture hand video information.
  33. 根据权利要求30所述的钢琴训练系统,还包括:传感器,所述传感器安装于琴键下方,用于采集用户弹奏钢琴时的触键力度数据。The piano training system according to claim 30, further comprising: a sensor, which is installed under the piano keys and is used to collect data on the touch force of the keys when the user plays the piano.
  34. 一种用于智能钢琴训练的存储介质,其中存储有计算机程序,在所述计算机程序被处理器执行时,能够用于实现权利要求16-29中任一项所述的方法。A storage medium for smart piano training, wherein a computer program is stored, and when the computer program is executed by a processor, it can be used to implement the method of any one of claims 16-29.
  35. 一种用于智能钢琴训练的电子设备,包括处理器和存储器,所述存储器中存储有计算机程序,在所述计算机程序被处理器执行时,能够用于实现权利要求16-29中任一项所述的方法。An electronic device for intelligent piano training, comprising a processor and a memory, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, it can be used to realize any one of claims 16-29 the method described.
PCT/CN2021/117130 2020-09-09 2021-09-08 Intelligent identification method and system for giving assistance with piano teaching, and intelligent piano training method and system WO2022052941A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202010939320.9A CN114170868A (en) 2020-09-09 2020-09-09 Intelligent piano training method and system
CN202010939320.9 2020-09-09
CN202110982026.0A CN113723264A (en) 2021-08-25 2021-08-25 Method and system for intelligently identifying playing errors for assisting piano teaching
CN202110982026.0 2021-08-25

Publications (1)

Publication Number Publication Date
WO2022052941A1 true WO2022052941A1 (en) 2022-03-17

Family

ID=80632087

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/117130 WO2022052941A1 (en) 2020-09-09 2021-09-08 Intelligent identification method and system for giving assistance with piano teaching, and intelligent piano training method and system

Country Status (1)

Country Link
WO (1) WO2022052941A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862572A (en) * 2022-11-22 2023-03-28 广州珠江艾茉森数码乐器股份有限公司 Intelligent piano system and using method
CN116434149A (en) * 2023-06-14 2023-07-14 四川交通职业技术学院 Sichuan embroidery training monitoring system and method based on image recognition
CN117207204A (en) * 2023-11-09 2023-12-12 之江实验室 Control method and control device of playing robot

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1569185A2 (en) * 2004-02-25 2005-08-31 Yamaha Corporation Fingering guidance apparatus and program
CN107180224A (en) * 2017-04-10 2017-09-19 华南理工大学 Finger motion detection and localization method based on spatio-temporal filtering and joint space Kmeans
CN108052277A (en) * 2017-12-14 2018-05-18 深圳市艾德互联网络有限公司 A kind of AR positioning learning methods and device
CN109243248A (en) * 2018-09-29 2019-01-18 南京华捷艾米软件科技有限公司 A kind of virtual piano and its implementation based on 3D depth camera mould group
CN109448131A (en) * 2018-10-24 2019-03-08 西北工业大学 A kind of virtual piano based on Kinect plays the construction method of system
CN109493683A (en) * 2018-11-15 2019-03-19 深圳市象形字科技股份有限公司 A kind of auxiliary piano practice person's fingering detection method based on computer vision technique

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1569185A2 (en) * 2004-02-25 2005-08-31 Yamaha Corporation Fingering guidance apparatus and program
CN107180224A (en) * 2017-04-10 2017-09-19 华南理工大学 Finger motion detection and localization method based on spatio-temporal filtering and joint space Kmeans
CN108052277A (en) * 2017-12-14 2018-05-18 深圳市艾德互联网络有限公司 A kind of AR positioning learning methods and device
CN109243248A (en) * 2018-09-29 2019-01-18 南京华捷艾米软件科技有限公司 A kind of virtual piano and its implementation based on 3D depth camera mould group
CN109448131A (en) * 2018-10-24 2019-03-08 西北工业大学 A kind of virtual piano based on Kinect plays the construction method of system
CN109493683A (en) * 2018-11-15 2019-03-19 深圳市象形字科技股份有限公司 A kind of auxiliary piano practice person's fingering detection method based on computer vision technique

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862572A (en) * 2022-11-22 2023-03-28 广州珠江艾茉森数码乐器股份有限公司 Intelligent piano system and using method
CN115862572B (en) * 2022-11-22 2023-11-03 广州珠江艾茉森数码乐器股份有限公司 Intelligent piano system and use method
CN116434149A (en) * 2023-06-14 2023-07-14 四川交通职业技术学院 Sichuan embroidery training monitoring system and method based on image recognition
CN116434149B (en) * 2023-06-14 2023-08-15 四川交通职业技术学院 Sichuan embroidery training monitoring system and method based on image recognition
CN117207204A (en) * 2023-11-09 2023-12-12 之江实验室 Control method and control device of playing robot
CN117207204B (en) * 2023-11-09 2024-01-30 之江实验室 Control method and control device of playing robot

Similar Documents

Publication Publication Date Title
WO2022052941A1 (en) Intelligent identification method and system for giving assistance with piano teaching, and intelligent piano training method and system
WO2020082566A1 (en) Physiological sign recognition-based distance learning method, device, apparatus, and storage medium
CN112908355B (en) System and method for quantitatively evaluating teaching skills of teacher and teacher
CN106485984A (en) A kind of intelligent tutoring method and apparatus of piano
CN113723264A (en) Method and system for intelligently identifying playing errors for assisting piano teaching
CN109191939B (en) Three-dimensional projection interaction method based on intelligent equipment and intelligent equipment
CN110796018A (en) Hand motion recognition method based on depth image and color image
Johnson et al. Detecting hand posture in piano playing using depth data
CN111126280A (en) Gesture recognition fusion-based aphasia patient auxiliary rehabilitation training system and method
CN114445853A (en) Visual gesture recognition system recognition method
CN108304806A (en) A kind of gesture identification method integrating feature and convolutional neural networks based on log path
CN114170868A (en) Intelligent piano training method and system
US11580868B2 (en) AR-based supplementary teaching system for guzheng and method thereof
CN116386424A (en) Method, device and computer readable storage medium for music teaching
US20220375362A1 (en) Virtual tutorials for musical instruments with finger tracking in augmented reality
Soroni et al. Hand Gesture Based Virtual Blackboard Using Webcam
CN113158906B (en) Motion capture-based guqin experience learning system and implementation method
CN114428879A (en) Multimode English teaching system based on multi-scene interaction
CN209895305U (en) Gesture interaction system
Kerdvibulvech et al. Guitarist fingertip tracking by integrating a Bayesian classifier into particle filters
CN113674565B (en) Teaching system and method for piano teaching
CN112788390A (en) Control method, device, equipment and storage medium based on human-computer interaction
Liu et al. Multi-modal deep learning-based violin bowing action recognition
Wang et al. Virtual piano system based on monocular camera
WO2019183768A1 (en) Method and device for book reading book on the basis of user instruction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21865994

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21865994

Country of ref document: EP

Kind code of ref document: A1