WO2022202266A1 - 画像処理方法、画像処理システムおよびプログラム - Google Patents

画像処理方法、画像処理システムおよびプログラム Download PDF

Info

Publication number
WO2022202266A1
WO2022202266A1 PCT/JP2022/009830 JP2022009830W WO2022202266A1 WO 2022202266 A1 WO2022202266 A1 WO 2022202266A1 JP 2022009830 W JP2022009830 W JP 2022009830W WO 2022202266 A1 WO2022202266 A1 WO 2022202266A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
performance
finger
data
estimation
Prior art date
Application number
PCT/JP2022/009830
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
陽 前澤
Original Assignee
ヤマハ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ヤマハ株式会社 filed Critical ヤマハ株式会社
Priority to CN202280022994.XA priority Critical patent/CN117043818A/zh
Publication of WO2022202266A1 publication Critical patent/WO2022202266A1/ja

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10GREPRESENTATION OF MUSIC; RECORDING MUSIC IN NOTATION FORM; ACCESSORIES FOR MUSIC OR MUSICAL INSTRUMENTS NOT OTHERWISE PROVIDED FOR, e.g. SUPPORTS
    • G10G3/00Recording music in notation form, e.g. recording the mechanical operation of a musical instrument
    • G10G3/04Recording music in notation form, e.g. recording the mechanical operation of a musical instrument using electrical means

Definitions

  • the present disclosure relates to technology for analyzing performances by users.
  • Patent Literature 1 discloses a technique for detecting an object using a deep neural network.
  • one aspect of the present disclosure aims to improve the convenience of performance images.
  • an image processing method provides a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument. A specific area containing an image is estimated, and the specific area is extracted from the performance image.
  • An image processing system estimates a specific region including an image of a musical instrument in a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument.
  • An estimating unit and an area extracting unit for extracting the specific area from the performance image.
  • a program includes a region estimation unit for estimating a specific region including an image of a musical instrument in a performance image including an image of the musical instrument and images of a plurality of fingers of a user playing the musical instrument. , and an area extracting section for extracting the specific area from the performance image.
  • FIG. 1 is a block diagram illustrating the configuration of a performance analysis system according to a first embodiment
  • FIG. FIG. 4 is a schematic diagram of a performance image
  • 1 is a block diagram illustrating the functional configuration of a performance analysis system
  • FIG. It is a schematic diagram of an analysis screen.
  • 8 is a flowchart of finger position estimation processing
  • 8 is a flowchart of left/right determination processing
  • FIG. 10 is an explanatory diagram of image extraction processing
  • 6 is a flowchart of image extraction processing
  • FIG. 4 is an illustration of machine learning to establish an inference model
  • FIG. 4 is a schematic diagram of a reference image
  • 6 is a flowchart of matrix generation processing
  • 6 is a flowchart of initial setting processing
  • 4 is a schematic diagram of a setting screen
  • FIG. 4 is a flowchart of performance analysis processing
  • FIG. 10 is an explanatory diagram relating to the problem of fingering estimation
  • FIG. 11 is a block diagram illustrating the configuration of a performance analysis system in a second embodiment
  • FIG. 10 is a schematic diagram of control data in the second embodiment
  • 9 is a flow chart of performance analysis processing in the second embodiment
  • 14 is a flow chart of performance analysis processing in the third embodiment.
  • FIG. 11 is a flowchart of initial setting processing in the fourth embodiment
  • FIG. FIG. 12 is a block diagram illustrating the configuration of a performance analysis system in a fifth embodiment
  • FIG. FIG. 11 is a block diagram illustrating the functional configuration of an image processing system according to a sixth embodiment;
  • FIG. 14 is a flowchart of first image processing in the sixth embodiment
  • FIG. FIG. 21 is a block diagram illustrating the functional configuration of an image processing system according to a seventh embodiment
  • FIG. FIG. 14 is a flowchart of second image processing in the seventh embodiment
  • FIG. 1 is a block diagram illustrating the configuration of a performance analysis system 100 according to the first embodiment.
  • a keyboard instrument 200 is connected to the performance analysis system 100 by wire or wirelessly.
  • a user that is, a performer
  • the keyboard instrument 200 supplies performance data P representing a performance by the user to the performance analysis system 100 .
  • the performance data P is time-series data specifying the pitch n of each of a plurality of notes played in sequence by the user.
  • the performance data P is data in a format conforming to the MIDI (Musical Instrument Digital Interface) standard, for example.
  • the performance analysis system 100 is a computer system that analyzes the performance of the keyboard instrument 200 by the user. Specifically, performance analysis system 100 analyzes the user's fingering. Fingering is the manner in which the user uses the fingers of the left and right hands (ie fingering) in playing the keyboard instrument 200 . That is, the information as to which finger the user uses to operate each key 21 of the keyboard instrument 200 is analyzed as the fingering of the user.
  • the performance analysis system 100 includes a control device 11, a storage device 12, an operation device 13, a display device 14, and a photographing device 15.
  • the performance analysis system 100 is realized by, for example, a portable information device such as a smart phone or a tablet terminal, or a portable or stationary information device such as a personal computer.
  • the performance analysis system 100 can be realized as a single device, or as a plurality of devices configured separately from each other. Also, the performance analysis system 100 may be installed in the keyboard instrument 200 .
  • the control device 11 is composed of one or more processors that control each element of the performance analysis system 100 .
  • the control device 11 includes one or more types of CPU (Central Processing Unit), SPU (Sound Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). It consists of a processor.
  • the storage device 12 is a single or multiple memories that store programs executed by the control device 11 and various data used by the control device 11 .
  • the storage device 12 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of types of recording media.
  • a portable recording medium that can be attached to and detached from the performance analysis system 100, or a recording medium (for example, cloud storage) that can be written or read by the control device 11 via a communication network such as the Internet, for example, can be stored. You may utilize as the apparatus 12.
  • the operation device 13 is an input device that receives instructions from the user.
  • the operation device 13 is, for example, an operator operated by a user or a touch panel that detects contact by the user.
  • An operating device 13 (for example, a mouse or a keyboard) separate from the performance analysis system 100 may be connected to the performance analysis system 100 by wire or wirelessly.
  • the display device 14 displays images under the control of the control device 11 .
  • various display panels such as a liquid crystal display panel or an organic EL (Electroluminescence) panel are used as the display device 14 .
  • the display device 14, which is separate from the performance analysis system 100, may be connected to the performance analysis system 100 by wire or wirelessly.
  • the photographing device 15 is an image input device that generates a time series of image data D1 by photographing a subject.
  • the time series of the image data D1 is moving image data representing moving images.
  • the photographing device 15 includes an optical system such as a photographing lens, an imaging device for receiving incident light from the optical system, and a processing circuit for generating image data D1 according to the amount of light received by the imaging device.
  • the photographing device 15, which is separate from the performance analysis system 100 may be connected to the performance analysis system 100 by wire or wirelessly.
  • the user adjusts the position or angle of the imaging device 15 with respect to the keyboard instrument 200 so that the imaging conditions recommended by the provider of the performance analysis system 100 are realized.
  • the photographing device 15 is installed above the keyboard instrument 200 and photographs the keyboard 22 of the keyboard instrument 200 and the user's left and right hands. Therefore, as illustrated in FIG. 2, a performance image including an image g1 of the keyboard 22 of the keyboard instrument 200 (hereinafter referred to as "keyboard image") and an image g2 of the user's left and right hands (hereinafter referred to as "finger images").
  • a time series of image data D 1 representing G 1 is generated by the imaging device 15 . That is, moving image data representing a moving image of the user playing the keyboard instrument 200 is generated in parallel with the performance.
  • the photographing condition by the photographing device 15 is, for example, the photographing range or the photographing direction.
  • the photographing range is the range (angle of view) photographed by the photographing device 15 .
  • the shooting direction is the direction of the shooting device 15 with respect to the keyboard instrument 200 .
  • FIG. 3 is a block diagram illustrating the functional configuration of the performance analysis system 100.
  • the control device 11 functions as a performance analysis section 30 and a display control section 40 by executing programs stored in the storage device 12 .
  • the performance analysis unit 30 analyzes the performance data P and the image data D1 to generate fingering data Q representing the user's fingering.
  • the fingering data Q designates with which of the user's fingers each of the plurality of keys 21 of the keyboard instrument 200 is operated.
  • the fingering data Q consists of a pitch n corresponding to the key 21 operated by the user and the number k of the finger used by the user to operate the key 21 (hereinafter referred to as "finger number").
  • a pitch n is, for example, a note number in the MIDI standard.
  • the finger number k is a number assigned to each finger of the user's left hand and right hand.
  • the display control unit 40 causes the display device 14 to display various images.
  • the display control section 40 causes the display device 14 to display an image (hereinafter referred to as "analysis screen") 61 representing the result of analysis by the performance analysis section 30 .
  • FIG. 4 is a schematic diagram of the analysis screen 61.
  • the analysis screen 61 is an image in which a plurality of note images 611 are arranged on a coordinate plane on which a horizontal time axis and a vertical pitch axis are set.
  • a note image 611 is displayed for each note played by the user.
  • the position of the note image 611 in the direction of the pitch axis is set according to the pitch n of the note represented by the note image 611 .
  • the position and total length of the note image 611 in the direction of the time axis are set according to the sounding period of the note represented by the note image 611 .
  • a code (hereinafter referred to as a "fingering code") 612 corresponding to the finger number k specified for that note by the fingering data Q is arranged.
  • the letter “L” in fingering 612 means left hand, and the letter “R” in fingering 612 means right hand.
  • the numbers of the fingering symbols 612 mean each finger. Specifically, the number “1" of the fingering symbols 612 means the thumb, the number “2” means the index finger, the number “3” means the middle finger, the number “4" means the ring finger, The number "5" means the little finger.
  • fingering 612 "R2" refers to the index finger of the right hand and fingering 612 "L4" refers to the ring finger of the left hand.
  • the musical note image 611 and the fingering symbol 612 are displayed in different modes (for example, hue or gradation) for the right hand and the left hand.
  • the display control unit 40 uses the fingering data Q to display the analysis screen 61 of FIG. 4 on the display device 14 .
  • the notes are displayed in a manner different from the normal note image 611 (for example, a dashed frame line).
  • An image 611 is displayed, and a specific code, such as "?", is displayed to indicate that the estimation result of the finger number k is invalid.
  • the performance analysis unit 30 includes a finger position data generation unit 31 and a fingering data generation unit 32.
  • the finger position data generator 31 generates finger position data F by analyzing the performance image G1.
  • the finger position data F is data representing the position of each finger of the user's left hand and the position of each finger of his right hand.
  • the fingering data generator 32 generates fingering data Q using the performance data P and the finger position data F.
  • FIG. Finger position data F and fingering data Q are generated for each unit period on the time axis. Each unit period is a period (frame) of a predetermined length.
  • the finger position data generation unit 31 includes an image extraction unit 311 , a matrix generation unit 312 , a finger position estimation unit 313 and a projective transformation unit 314 .
  • the finger position estimation unit 313 estimates the positions c[h, f] of the fingers of the user's left hand and right hand by analyzing the performance image G1 represented by the image data D1.
  • the position c[h, f] of each finger is the position of each fingertip in the xy coordinate system set in the performance image G1.
  • the position c[h,f] is a combination (x[h, f], y[h, f]).
  • the positive direction of the x-axis corresponds to the right direction of the keyboard 22 (direction from low tones to high tones), and the negative direction of the x-axis corresponds to the left direction of the keyboard 22 (towards from high tones to low tones).
  • the numerical value "1" of the variable h means the left hand
  • the numerical value "2" of the variable h means the right hand.
  • the number '1' for the variable f means the thumb
  • the number '2' means the index finger
  • the number '3' means the middle finger
  • the number '4' means the ring finger
  • the number '5' means the little finger.
  • FIG. 5 is a flowchart illustrating a specific procedure of the process of estimating the position of each finger of the user by the finger position estimation unit 313 (hereinafter referred to as "finger position estimation process").
  • the finger position estimation processing includes image analysis processing Sa1, left/right determination processing Sa2, and interpolation processing Sa3.
  • the position c[h, f] of each finger in one of the user's left hand and right hand (hereinafter referred to as "first hand") and the other of the user's left hand and right hand (hereinafter referred to as "second hand ) is estimated by analyzing the performance image G1.
  • the finger position estimating unit 313 performs image recognition processing for estimating the skeleton or joints of the user through image analysis to determine the positions c[h,1] to c[h,5 of the fingers of the first hand. ] and the positions c[h,1] to c[h,5] of each finger of the second hand.
  • a known image recognition process such as MediaPipe or OpenPose is used for the image analysis process Sa1. If the fingertip is not detected from the performance image G1, the coordinate x[h,f] of the fingertip on the x-axis is set to an invalid value such as "0".
  • the user's right arm and left arm may cross, so only the coordinates x[h,f] of each position c[h,f] estimated by the image analysis processing Sa1 It is not appropriate to determine the left or right hand from .
  • the user's left hand or right hand can be estimated from the performance image G1 based on the coordinates of the user's shoulders and arms.
  • the processing load of the image analysis processing Sa1 increases.
  • the finger position estimating unit 313 of the first embodiment performs the left/right determination shown in FIG. Processing Sa2 is executed. That is, the finger position estimation unit 313 sets the variable h at the position c[h, f] of the fingers of the first hand and the second hand to the numerical value "1" meaning the left hand and the numerical value "2" meaning the right hand. is determined to be one of
  • the performance image G1 captured by the imaging device 15 is an image of the backs of both the left and right hands of the user.
  • the thumb position c[h,1] is located to the right of the little finger position c[h,5]
  • the thumb position c[h,1] is positioned to the left of the little finger position c[h,5].
  • the finger position estimating unit 313 determines that the thumb position c[h,1] of the first hand and the second hand is the little finger position c[h,5] in the left/right determination process Sa2.
  • the finger position estimator 313 determines whether the position c[h,1] of the thumb is to the left (in the negative direction of the x-axis) of the position c[h,5] of the little finger in the first and second hands.
  • the positioned hand is determined to be the right hand.
  • FIG. 6 is a flowchart illustrating a specific procedure of left/right determination processing Sa2.
  • the finger position estimation unit 313 calculates a determination index ⁇ [h] for each of the first hand and the second hand (Sa21).
  • the determination index ⁇ [h] is calculated, for example, by Equation (1) below.
  • the symbol ⁇ [h] in formula (1) is the average value (for example, simple average) of the coordinates x[h,1] to x[h,5] of the five fingers of each of the first and second hands. be.
  • the finger position estimating unit 313 determines that the hand having a negative determination index ⁇ [h] among the first hand and the second hand is the left hand, and sets the variable h to the numerical value "1" (Sa22).
  • the finger position estimating unit 313 determines that the hand having a positive determination index ⁇ [h] among the first hand and the second hand is the right hand, and sets the variable h to the numerical value "2" (Sa23). According to the left/right determination process Sa2 described above, the position c[h, f] of each finger of the user can be distinguished between the right hand and the left hand by a simple process using the relationship between the position of the thumb and the position of the little finger. can.
  • the position c[h, f] of each finger of the user is estimated for each unit period by the image analysis processing Sa1 and the left/right determination processing Sa2.
  • the position c[h,f] may not be properly estimated due to various circumstances such as noise existing in the performance image G1. Therefore, when the position c[h,f] is missing in a specific unit period (hereinafter referred to as “missing period”), the finger position estimation unit 313 calculates the position c[h,f] in the unit periods before and after the missing period. ], the position c[h,f] in the missing period is calculated.
  • the position c[h,f] in the unit period immediately before the missing period ] and the position c[h,f] in the immediately following unit period is calculated as the position c[h,f] in the missing period.
  • the performance image G1 includes the keyboard image g1 and the finger image g2.
  • the image extraction unit 311 in FIG. 3 extracts a specific area (hereinafter referred to as "specific area") B from the performance image G1, as illustrated in FIG.
  • the specific area B is an area of the performance image G1 that includes the keyboard image g1 and the finger image g2.
  • the finger image g2 corresponds to an image of at least part of the user's body.
  • FIG. 8 is a flow chart illustrating a specific procedure of the process of extracting the specific area B from the performance image G1 by the image extraction unit 311 (hereinafter referred to as "image extraction process").
  • the image extraction processing includes region estimation processing Sb1 and region extraction processing Sb2.
  • the area estimation process Sb1 is a process of estimating a specific area B for the performance image G1 represented by the image data D1.
  • the image extraction unit 311 generates an image processing mask M representing the specific area B from the image data D1 by the area estimation process Sb1.
  • the image processing mask M is a mask having the same size as the performance image G1, and is composed of a plurality of elements corresponding to different pixels of the performance image G1.
  • each element in the area corresponding to the specific area B of the performance image G1 is set to the numerical value "1”
  • each element in the area other than the specific area B is set to the numerical value "0".
  • is a binary mask set to An element (region estimation section) for estimating the specific region B of the performance image G1 is implemented by the control device 11 executing the region estimation processing Sb1.
  • the estimation model 51 is used for generating the image processing mask M by the image extraction unit 311 . That is, the image extraction unit 311 generates the image processing mask M by inputting the image data D1 representing the performance image G1 to the estimation model 51.
  • FIG. The estimation model 51 is a statistical model obtained by learning the relationship between the image data D1 and the image processing mask M through machine learning.
  • the estimation model 51 is composed of, for example, a deep neural network (DNN: Deep Neural Network).
  • DNN Deep Neural Network
  • CNN convolutional neural network
  • RNN recurrent neural network
  • the estimation model 51 may be configured by combining multiple types of deep neural networks. Also, additional elements such as long short-term memory (LSTM) may be installed in the estimation model 51 .
  • LSTM long short-term memory
  • FIG. 9 is an explanatory diagram of machine learning that establishes the estimation model 51.
  • the estimated model 51 is established by machine learning by a machine learning system 900 separate from the performance analysis system 100 , and the estimated model 51 is provided to the performance analysis system 100 .
  • Machine learning system 900 is a server system capable of communicating with performance analysis system 100 via a communication network such as the Internet.
  • the estimation model 51 is transmitted from the machine learning system 900 to the performance analysis system 100 via the communication network.
  • a plurality of learning data T are used for machine learning of the estimation model 51.
  • Each of the plurality of learning data T is composed of a combination of learning image data Dt and learning image processing mask Mt.
  • the image data Dt represents a known image including a keyboard image g1 of the keyboard instrument and an image around the keyboard instrument.
  • the model of the keyboard instrument and shooting conditions (for example, shooting range and shooting direction) differ for each image data Dt. That is, image data Dt is prepared in advance by photographing each of a plurality of types of keyboard instruments under different photographing conditions. Note that the image data Dt may be prepared by a known image synthesizing technique.
  • the image processing mask Mt of each learning data T is a mask representing the specific region B in the known image represented by the image data Dt of the learning data T.
  • the machine learning system 900 generates an image processing mask M output by an initial or provisional model (hereinafter referred to as a "provisional model") 51a when image data Dt of each learning data T is input, and an image of the learning data T An error function representing the error with the processing mask M is computed.
  • Machine learning system 900 then updates multiple variables of interim model 51a such that the error function is reduced.
  • the provisional model 51 a at the time when the above processing is repeated for each of the plurality of learning data T is determined as the estimated model 51 . Therefore, the estimation model 51 can generate a statistically valid image processing mask M for the unknown image data D1 under the latent relationship between the image data Dt and the image processing mask Mt in the plurality of learning data T. to output That is, the estimation model 51 is a learned model that has learned the relationship between the image data Dt and the image processing mask Mt.
  • the image processing mask M representing the specific region B is generated by inputting the image data D1 of the performance image G1 into the machine-learned estimation model 51. Therefore, the specific area B can be specified with high precision for various unknown performance images G1.
  • the area extraction process Sb2 in FIG. 8 is a process for extracting the specific area B from the performance image G1 represented by the image data D1.
  • the region extraction processing Sb2 is image processing for relatively emphasizing the specific region B by selectively removing regions other than the specific region in the performance image G1.
  • the image extraction unit 311 of the first embodiment generates the image data D2 by applying the image processing mask M to the image data D1 (performance image G1). Specifically, the image extraction unit 311 multiplies the pixel value of each pixel in the performance image G1 by the element of the image processing mask M corresponding to the pixel. As illustrated in FIG.
  • the area extracting process Sb2 generates image data D2 representing an image (hereinafter referred to as "performance image G2") obtained by removing areas other than the specific area B from the performance image G1. That is, the performance image G2 represented by the image data D2 is an image obtained by extracting the keyboard image g1 and the finger image g2 from the performance image G1.
  • An element (region extractor) for extracting the specific region B of the performance image G1 is implemented by the control device 11 executing the region extracting process Sb2.
  • the position c[h, f] of each finger estimated by the finger position estimation process is the coordinates in the xy coordinate system set in the performance image G1.
  • the conditions for photographing the keyboard instrument 200 by the photographing device 15 may differ depending on various circumstances such as the usage environment of the keyboard instrument 200 . For example, it is assumed that the imaging range is too wide (or too narrow) compared to the ideal imaging conditions illustrated in FIG. 2, or that the imaging direction is inclined with respect to the vertical direction.
  • the numerical values of the coordinates x[h,f] and coordinates y[h,f] at each position c[h,f] depend on the shooting conditions of the performance image G1 by the shooting device 15. FIG.
  • the projective transformation unit 314 of the first embodiment converts the position c[h,f] of each finger on the performance image G1 to the position C[h,f] in the XY coordinate system that is substantially independent of the imaging conditions of the imaging device 15. h, f] (image registration).
  • the finger position data F generated by the finger position data generation unit 31 is data representing the position C[h,f] after conversion by the projective conversion unit 314 . That is, the finger position data F includes the positions C[1,1] to C[1,5] of the fingers of the user's left hand and the positions C[2,1] to C[ of the fingers of the user's right hand. 2,5].
  • the XY coordinate system is set to a predetermined image (hereinafter referred to as "reference image”) Gref, as illustrated in FIG.
  • the reference image Gref is an image of a keyboard of a standard keyboard instrument (hereinafter referred to as “reference instrument”) captured under standard imaging conditions.
  • the reference image Gref is not limited to an image of an actual keyboard.
  • an image synthesized by a known image synthesis technique may be used as the reference image Gref.
  • Image data Dref representing the reference image Gref (hereinafter referred to as “reference data”) and auxiliary data A relating to the reference image Gref are stored in the storage device 12 .
  • Auxiliary data A is data specifying a combination of an area (hereinafter referred to as a "unit area") Rn in which each key 21 of the reference musical instrument exists in the reference image Gref and the pitch n corresponding to the key 21. That is, the auxiliary data A can also be said to be data defining a unit region Rn corresponding to each pitch n in the reference image Gref.
  • Transformation from the position c[h,f] in the x-y coordinate system to the position C[h,f] in the XY coordinate system uses the transformation matrix W, as expressed by the following formula (2).
  • a projective transformation is used.
  • the symbol X in Equation (2) means the coordinate on the X-axis in the XY coordinate system
  • the symbol Y means the coordinate on the Y-axis.
  • the symbol s is an adjustment value for matching the scale between the xy coordinate system and the XY coordinate system.
  • FIG. 11 is a flowchart illustrating a specific procedure of the process of generating the transformation matrix W by the matrix generator 312 (hereinafter referred to as "matrix generation process").
  • the matrix generation process of the first embodiment is executed with the performance image G2 (image data D2) processed by the image extraction process as the object of processing.
  • the keyboard image g1 is approximated to the reference image Gref with high precision, compared to the configuration in which the matrix generation process is executed for the entire performance image G1 including areas other than the specific area B.
  • a suitable transformation matrix W can be generated.
  • the matrix generation process includes an initialization process Sc1 and a matrix update process Sc2.
  • the initial setting process Sc1 is a process of setting an initial matrix W0, which is an initial value of the transformation matrix W. FIG. The details of the initial setting process Sc1 will be described later.
  • the matrix update process Sc2 is a process of generating a transformation matrix W by iteratively updating the initial matrix W0. That is, the projective transformation unit 314 iteratively updates the initial matrix W0 so that the keyboard image g1 of the performance image G2 approaches the reference image Gref by projective transformation using the transformation matrix W, thereby transforming the transformation matrix W into Generate.
  • the coordinate X/s on the X-axis of a specific point in the reference image Gref approximates or matches the coordinate x on the x-axis of a point corresponding to the point in the keyboard image g1
  • a transformation matrix W is generated so that the coordinate Y/s of a particular point on the Y axis approximates or matches the coordinate y on the y axis of a point corresponding to that point in the keyboard image g1. That is, the coordinates of the key 21 corresponding to a specific pitch in the keyboard image g1 are transformed into the coordinates of the key 21 corresponding to the pitch in the reference image Gref by the projective transformation applying the transformation matrix W. , a transformation matrix W is generated.
  • An element (matrix generation unit 312) for generating the conversion matrix W is implemented by the control device 11 executing the matrix update processing Sc2 illustrated above.
  • the matrix update process Sc2 for example, a process of updating the transformation matrix W so that the image feature amount such as SIFT (Scale-Invariant Feature Transform) becomes closer between the reference image Gref and the keyboard image g1 is assumed.
  • SIFT Scale-Invariant Feature Transform
  • the keyboard image g1 a pattern in which a plurality of keys 21 are arranged in the same manner is repeated, so there is a possibility that the conversion matrix W cannot be properly estimated in the form using the image feature amount.
  • the matrix generator 312 of the first embodiment increases the enhanced correlation coefficient (ECC) between the reference image Gref and the keyboard image g1 ( Iteratively update the initial matrix W0 so as to ideally maximize
  • ECC enhanced correlation coefficient
  • the extended correlation coefficient is suitable for generating the transformation matrix W used for transforming the keyboard image g1.
  • a transformation matrix W may be generated so as to be close to each other.
  • the projective transformation unit 314 in FIG. 3 executes projective transformation processing.
  • the projective transformation process is a projective transformation of the performance image G1 using the transformation matrix W generated by the matrix generation process.
  • the performance image G1 is transformed into an image (hereinafter referred to as "transformed image") shot under the same shooting conditions as the reference image Gref.
  • the area corresponding to the key 21 of the pitch n in the transformed image substantially matches the unit area Rn of the pitch n in the reference image Gref.
  • the x-y coordinate system of the transformed image substantially matches the x-y coordinate system of the reference image Gref.
  • the projective transformation unit 314 converts the position c[h, f] of each finger to the position C[h, f].
  • an element projective transformation unit 3114 that executes the projective transformation of the performance image G1 is realized.
  • the display control unit 40 causes the display device 14 to display the transformed image generated by the projective transformation process.
  • the display control unit 40 causes the display device 14 to display the converted image and the reference image Gref in an overlapping state.
  • the area corresponding to the key 21 of each pitch n in the transformed image and the unit area Rn corresponding to the pitch n in the reference image Gref overlap each other.
  • the transformation matrix W is generated so that the keyboard image g1 of the performance image G1 approaches the reference image Gref, and the projective transformation process using the transformation matrix W is performed on the performance image G1. be done. Therefore, the performance image G1 of the keyboard instrument 200 played by the user can be converted into a converted image corresponding to the photographing conditions of the reference musical instrument in the reference image Gref.
  • FIG. 12 is a flowchart illustrating a specific procedure of the initial setting process Sc1.
  • the projective transformation unit 314 causes the display device 14 to display the setting screen 62 illustrated in FIG. 13 (Sc11).
  • the setting screen 62 includes a performance image G1 photographed by the photographing device 15 and an instruction 622 for the user.
  • the instruction 622 is to select an area (hereinafter referred to as "target area”) 621 corresponding to one or more specific pitches (hereinafter referred to as "target pitch") n in the keyboard image g1 in the performance image G1. is the message.
  • the user selects the target area 621 corresponding to the target pitch n in the performance image G1 by operating the operation device 13 while viewing the setting screen 62 .
  • the projective transformation unit 314 receives selection of the target area 621 by the user (Sc12).
  • the projective transformation unit 314 identifies one or more unit regions Rn designated by the auxiliary data A for the target pitch n in the reference image Gref represented by the reference data Dref (Sc13). Then, the projective transformation unit 314 calculates a matrix for projectively transforming the target region 621 of the performance image G1 into one or more unit regions Rn specified from the reference image Gref as an initial matrix W0 (Sc14).
  • the initial matrix W0 is set so as to approach the unit area Rn corresponding to the target pitch n.
  • the initial matrix W0 is set so that the target area 621 corresponding to the instruction from the user in the performance image G1 approaches the unit area Rn corresponding to the target pitch n in the reference image Gref. be done. Therefore, it is possible to generate an appropriate transformation matrix W that can approximate the keyboard image g1 to the reference image Gref with high accuracy.
  • the area designated by the user by operating the operating device 13 in the performance image G1 is used as the target area 621 for setting the initial matrix W0. Therefore, an appropriate initial matrix W0 can be generated while reducing the processing load, compared with, for example, a form in which the area corresponding to the target pitch n in the performance image G1 is estimated by arithmetic processing.
  • the initial setting process Sc1 is executed for the performance image G1, but the initial setting process Sc1 may be executed for the performance image G2.
  • the fingering data generator 32 in FIG. 3 generates the fingering data Q using the performance data P generated by the keyboard instrument 200 and the finger position data F generated by the finger position data generator 31, as described above. .
  • the fingering data Q is generated every unit period.
  • the fingering data generator 32 of the first embodiment includes a probability calculator 321 and a fingering estimator 322 .
  • the probability calculation unit 321 calculates, for each finger number k, the probability p that the pitch n specified by the performance data P is played by the finger of each finger number k.
  • the probability p is an index (likelihood) of the probability that the finger with the finger number k has operated the key 21 with the pitch n.
  • the probability calculator 321 calculates the probability p according to whether or not the position C[k] of the finger with the finger number k exists within the unit area Rn of the pitch n.
  • the probability p is calculated for each unit period on the time axis. Specifically, when the performance data P designates the pitch n, the probability calculation unit 321 calculates the probability p(C[k]
  • ⁇ k n) by the calculation of Equation (3) exemplified below. .
  • the condition " ⁇ k n" in the probability p(C[k]
  • ⁇ k n) means the probability that the position C[k] is observed for the finger under the condition that the finger with the finger number k is playing the pitch n. do.
  • the symbol I(C[k] ⁇ Rn) in Equation (3) is set to a numerical value “1” when the position C[k] exists within the unit region Rn, and the position C[k] is outside the unit region Rn. is an indicator function that is set to the value '0' if it exists in .
  • means the area of the unit region Rn.
  • the symbol ⁇ (0, ⁇ 2 E) means observation noise, which is represented by a normal distribution with mean 0 and variance ⁇ 2 .
  • Symbol E is a unit matrix of 2 rows and 2 columns.
  • the symbol * means convolution of observation noise ⁇ (0, ⁇ 2 E).
  • the position of the finger is the probability that the finger position data F designates the position C[k] for the finger. Therefore, the probability p(C[k]
  • ⁇ k n) is maximized when the position C[k] of the finger with the finger number k is within the unit area Rn in the playing state, and the position C[k] is the unit It decreases with increasing distance from the region Rn.
  • the probability calculator 321 calculates the probability p(C[k ]
  • ⁇ k 0) is calculated by the following equation (4).
  • in Equation (4) means the total area of the N unit regions R1 to RN in the reference image Gref. As can be seen from the formula (4), when the user does not operate any key 21, the probability p(C[k]
  • ⁇ k 0) is a common numerical value (1 /
  • the fingering estimation unit 322 estimates the user's fingering. Specifically, the fingering estimation unit 322 estimates the finger (finger number k) that performed the pitch n specified by the performance data P from the probability p(C[k]
  • ⁇ k n) of each finger. do. The fingering estimation unit 322 estimates the finger number k (generates the fingering data Q) every time the probability p(C[k]
  • ⁇ k n) of each finger is calculated (that is, every unit period). Specifically, the fingering estimation unit 322 identifies the finger number k corresponding to the maximum value among a plurality of probabilities p(C[k]
  • ⁇ k n) corresponding to different fingers. Then, the fingering estimation unit 322 generates fingering data Q that specifies the pitch n specified by the performance data P and the finger number k specified from the probability p(C[k]
  • ⁇ k n).
  • the fingering estimation unit 322 sets the finger number k to an invalid value meaning invalidity of the estimation result in a unit period in which the maximum value of the plurality of probabilities p(C[k]
  • ⁇ k n) is below the threshold. set.
  • the display control unit 40 displays the musical note image 611 in a manner different from the normal musical note image 611, as illustrated in FIG. display a sign "?"
  • the configuration and operation of the fingering data generator 32 are as described above.
  • FIG. 14 is a flowchart illustrating a specific procedure of processing (hereinafter referred to as "performance analysis processing") executed by the performance analysis section 30. As shown in FIG. For example, the performance analysis process is started when the user gives an instruction to the operation device 13 .
  • the control device 11 executes the image extraction process of FIG. 8 (S11). That is, the control device 11 generates the performance image G2 by extracting the specific region B including the keyboard image g1 and the finger image g2 from the performance image G1.
  • the image extraction process includes the area estimation process Sb1 and the area extraction process Sb2 as described above.
  • the control device 11 After executing the image extraction process, the control device 11 (matrix generation unit 312) executes the matrix generation process of FIG. 11 (S12). That is, the control device 11 generates the transformation matrix W by iteratively updating the initial matrix W0 so as to increase the extended correlation coefficient between the reference image Gref and the keyboard image g1.
  • the matrix generation process includes the initialization process Sc1 and the matrix update process Sc2, as described above.
  • the control device 11 repeats the processing (S13 to S18) illustrated below for each unit period.
  • the control device 11 (finger position estimating section 313) executes the finger position estimating process of FIG. 5 (S13). That is, the control device 11 estimates the positions c[h, f] of the fingers of the user's left hand and right hand by analyzing the performance image G1.
  • the finger position estimation processing includes image analysis processing Sa1, left/right determination processing Sa2, and interpolation processing Sa3.
  • the control device 11 executes projective transformation processing (S14). That is, the control device 11 generates a transformed image by projective transformation of the performance image G1 using the transformation matrix W.
  • FIG. In the projective transformation process, the control device 11 transforms the position c[h,f] of each finger of the user into the position C[h,f] in the XY coordinate system, and the position C[h,f] of each finger of the user. Generate finger position data F representing f].
  • the control device 11 executes probability calculation processing (S15). That is, the control device 11 calculates the probability p(C[k]
  • ⁇ k n) that the pitch n specified by the performance data P is played by each finger with the finger number k. Then, the control device 11 (the fingering estimation unit 322) executes fingering estimation processing (S16). That is, the control device 11 estimates the finger number k of the finger that played the pitch n from the probability p(C[k]
  • ⁇ k n) of each finger, and designates the pitch n and the finger number k. Generate finger data Q.
  • the control device 11 (display control unit 40) updates the analysis screen 61 according to the fingering data Q (S17). Further, the control device 11 determines whether or not a predetermined end condition is satisfied (S18). For example, when the user instructs to end the performance analysis processing by operating the operation device 13, the control device 11 determines that the end condition is met. If the termination condition is not satisfied (S18: NO), the control device 11 repeats the processes after the finger position estimation process (S13 to S18) for the immediately following unit period. On the other hand, if the termination condition is satisfied (S18: YES), the control device 11 terminates the performance analysis process.
  • the finger position data F generated by analyzing the performance image G1 and the performance data P representing the performance by the user are used to generate the fingering data Q. be. Therefore, the fingering can be estimated with high accuracy compared to the configuration in which the fingering is estimated only from the performance data P.
  • the position c[h, f] of each finger estimated by the finger position estimation process is calculated using the transformation matrix W for projective transformation that brings the keyboard image g1 closer to the reference image Gref. converted. That is, the position C[h,f] of each finger is estimated with reference to the reference image Gref. Therefore, the fingering can be estimated with high precision compared to a configuration in which the position c[h, f] of each finger is not converted to a position based on the reference image Gref.
  • a specific area B including the keyboard image g1 is extracted from the performance image G1. Therefore, as described above, it is possible to generate an appropriate transformation matrix W that can approximate the keyboard image g1 to the reference image Gref with high accuracy. Further, extracting the specific region B can improve the usability of the performance image G1. Particularly in the first embodiment, a specific area B including the keyboard image g1 and the finger image g2 is extracted from the performance image G1. Therefore, it is possible to generate a performance image G2 in which the appearance of the keyboard 22 of the keyboard instrument 200 and the appearance of the user's fingers can be efficiently visually recognized.
  • Second Embodiment A second embodiment will be described.
  • elements having the same functions as those of the first embodiment are denoted by the same reference numerals as those used in the description of the first embodiment, and detailed descriptions thereof are appropriately omitted. do.
  • the left hand middle and index fingers overlap each other. That is, the position C[k] of the middle finger of the left hand and the position C[k] of the index finger of the left hand exist within one unit region Rn.
  • a plurality of fingers may overlap each other. As described above, when a plurality of fingers overlap each other within one unit region Rn, the method of the first embodiment may not be able to estimate the fingering with high accuracy. 2nd Embodiment is a form for solving the above subject. Specifically, in the second embodiment, the positional relationship of a plurality of fingers and the temporal variation (dispersion) of the position of each finger are taken into consideration in fingering estimation.
  • FIG. 16 is a block diagram illustrating the functional configuration of the performance analysis system 100 according to the second embodiment.
  • a performance analysis system 100 of the second embodiment has a configuration in which a control data generator 323 is added to the same elements as those of the first embodiment.
  • the control data generator 323 generates N pieces of control data Z[1] to Z[N] corresponding to different pitches n.
  • FIG. 17 is a schematic diagram of control data Z[n] corresponding to an arbitrary pitch n.
  • the control data Z[n] is vector data representing the characteristics of the relative position (hereinafter referred to as "relative position") C'[k] of each finger with respect to the unit area Rn of pitch n.
  • the relative position C'[k] is information obtained by converting the position C[k] represented by the finger position data F into a position relative to the unit region Rn.
  • the control data Z[n] corresponding to one pitch n includes the pitch n, and position average Za[n,k] and position variance Zb[n,k] for each of a plurality of fingers. It contains velocity mean Zc[n,k] and velocity variance Zd[n,k].
  • the average position Za[n,k] is the average of the relative positions C'[k] within a period of a predetermined length including the current unit period (hereinafter referred to as "observation period").
  • the observation period is, for example, a period corresponding to a plurality of unit periods arranged forward on the time axis with the current unit period ending.
  • the position variance Zb[n,k] is the variance of the relative position C'[k] within the observation period.
  • the velocity average Zc[n,k] is the average of the velocities (that is, rate of change) at which the relative position C'[k] changes within the observation period.
  • the velocity variance Zd[n,k] is the variance of the velocity at which the relative position C'[k] changes within the observation period.
  • control data Z[n] are information (Za[n,k], Zb[n,k].Zc[n,k], Zd [n,k]). Therefore, the control data Z[n] is data reflecting the positional relationship of the user's fingers. Also, the control data Z[n] includes information (Zb[n,k], Zd[n,k]) regarding the variation of the relative position C'[k] for each of a plurality of fingers. Therefore, the control data Z[n] is data that reflects temporal variations in the position of each finger.
  • a plurality of estimation models 52[k] (52[1] to 52[10]) prepared in advance for different fingers are used for the probability calculation processing by the probability calculation unit 321 of the second embodiment.
  • the estimation model 52[k] of each finger is a trained model that has learned the relationship between the control data Z[n] and the probability p[k] of the finger.
  • the probability p[k] is an index (probability) of the accuracy of playing the pitch n specified by the performance data P with the finger having the finger number k.
  • the probability calculation unit 321 calculates the probability p[k] by inputting the N pieces of control data Z[1] to Z[N] to the estimation model 52[k] for each of a plurality of fingers. .
  • the estimation model 52[k] corresponding to any one finger number k is a logistic regression model represented by Equation (5) below.
  • variable ⁇ k and variable ⁇ k,n in Equation (5) are set by machine learning by the machine learning system 900. That is, each estimated model 52[k] is established by machine learning by the machine learning system 900, and each estimated model 52[k] is provided to the performance analysis system 100. FIG. For example, the variable ⁇ k and the variable ⁇ k,n of each estimated model 52[k] are sent from the machine learning system 900 to the performance analysis system 100. FIG.
  • the estimation model 52[k] is designed so that the probability p[k] is small for fingers with a high change rate of the relative position C′[k]. Learn the relationship with p[k].
  • the probability calculator 321 calculates a plurality of probabilities p[k] regarding different fingers for each unit period by inputting the control data Z[n] to each of the plurality of estimation models 52[k].
  • the fingering estimation unit 322 estimates the user's fingering through fingering estimation processing that applies a plurality of probabilities p[k]. Specifically, the fingering estimation unit 322 estimates the finger (finger number k) that played the pitch n specified by the performance data P from the probability p[k] of each finger. The fingering estimation unit 322 estimates the finger number k (generates the fingering data Q) every time the probability p[k] of each finger is calculated (that is, every unit period). Specifically, the fingering estimation unit 322 identifies the finger number k corresponding to the maximum value among a plurality of probabilities p[k] corresponding to different fingers. Then, the fingering estimation unit 322 generates fingering data Q that specifies the pitch n specified by the performance data P and the finger number k specified from the probability p[k].
  • FIG. 18 is a flowchart illustrating a specific procedure of performance analysis processing in the second embodiment.
  • generation of control data Z[n] (S19) is added to the same process as in the first embodiment.
  • the control device 11 (control data generator 323) generates different pitches n N pieces of control data Z[1] to Z[N] corresponding to .
  • the control device 11 calculates the probability p[ k] is calculated (S15). Further, the control device 11 (finger estimating unit 322) estimates the user's fingering by a fingering estimating process applying a plurality of probabilities p[k] (S16).
  • the operations of elements other than the fingering data generator 32 (S11-S14, S17-S18) are the same as in the first embodiment.
  • control data Z[k] input to the estimation model 52[k] in the second embodiment are the average Za[n,k] and the variance Zb[n,k] of the relative positions C'[k] of the fingers. ] and the mean Zc[n,k] and variance Zd[n,k] of the rate of change of the relative position C′[k]. Therefore, even if a plurality of fingers overlap each other due to, for example, a finger slipping, the user's fingering can be estimated with high accuracy.
  • the logistic regression model was exemplified as the estimation model 52[k], but the type of estimation model 52[k] is not limited to the above examples.
  • a statistical model such as a multilayer perceptron may be used as the estimation model 52[k].
  • a deep neural network such as a convolutional neural network or a recursive neural network may also be used as the estimation model 52[k].
  • a combination of multiple types of statistical models may be used as the estimation model 52[k].
  • the various estimation models 52[k] exemplified above are comprehensively expressed as learned models that have learned the relationship between the control data Z[n] and the probability p[k].
  • FIG. 19 is a flowchart illustrating a specific procedure of performance analysis processing in the third embodiment.
  • the control device 11 After executing the image extraction process and the matrix generation process, the control device 11 refers to the performance data P to determine whether or not the user is playing the keyboard instrument 200 (S21). Specifically, the control device 11 determines whether or not any of the plurality of keys 21 of the keyboard instrument 200 is being operated.
  • the controller 11 If the keyboard instrument 200 is being played (S21: YES), the controller 11 generates finger position data F (S13-S14) and fingering data Q (S15-S16), as in the first embodiment. ) and update of the analysis screen 61 (S17). On the other hand, if the keyboard instrument 200 is not being played (S21: NO), the control device 11 shifts the process to step S18. That is, generation of finger position data F (S13-14), generation of fingering data Q (S15-S16), and update of analysis screen 61 (S17) are not executed.
  • the same effects as in the first embodiment are also achieved in the third embodiment. Further, in the third embodiment, generation of the finger position data F and the fingering data Q is stopped when the keyboard instrument 200 is not played. Therefore, the processing load necessary for generating the fingering data Q can be reduced compared to the configuration in which the generation of the finger position data F is continued regardless of whether the keyboard instrument 200 is played. In addition, 3rd Embodiment is applied also to 2nd Embodiment.
  • FIG. 20 is a flowchart illustrating a specific procedure of the initial setting process Sc1 executed by the control device 11 (matrix generator 312) of the fourth embodiment.
  • the user selects a key 21 corresponding to a desired pitch (hereinafter referred to as "specific pitch") n among the plurality of keys 21 of the keyboard instrument 200 by a specific finger (hereinafter referred to as “specified finger”). (referred to as a “specific finger”).
  • the specific finger is, for example, the finger (for example, the index finger of the right hand) notified to the user by the display on the display device 14 or the instruction manual of the keyboard instrument 200 or the like.
  • performance data P specifying a specific pitch n is supplied from the keyboard instrument 200 to the performance analysis system 100 .
  • the control device 11 acquires the performance data P from the keyboard instrument 200, thereby recognizing the performance of the specific pitch n by the user (Sc15).
  • the control device 11 specifies a unit area Rn corresponding to a specific pitch n among the N unit areas R1 to RN of the reference image Gref (Sc16).
  • the finger position data generation unit 31 generates finger position data F through finger position estimation processing.
  • the finger position data F includes the position C[h, f] of the specific finger used by the user to play the specific pitch n.
  • the control device 11 acquires the finger position data F to specify the position C[h,f] of the specific finger (Sc17).
  • the control device 11 uses the unit area Rn corresponding to the specific pitch n and the position C[h,f] of the specific finger represented by the finger position data F to set the initial matrix W0 (Sc18). That is, the control device 11 sets the initial matrix W0 so that the position C[h,f] of the specific finger represented by the finger position data F approaches the unit area Rn of the specific pitch n in the reference image Gref. Specifically, a matrix for projectively transforming the position C[h,f] of the specific finger to the center of the unit area Rn is set as the initial matrix W0.
  • the position c[h,f] of the specific finger in the performance image G1 changes to the position c[h,f] of the specific pitch n in the reference image Gref.
  • the initial matrix W0 is set so as to approach the portion (unit region Rn) corresponding to . Since the user only needs to play the desired pitch n, compared to the first embodiment in which the user needs to select the target area 621 by operating the operation device 13, the initial matrix W0 needs only to be set.
  • the control device 11 adjusts the position C[h,f] of the specific finger during performance of the specific pitch n to the unit area Rn of the specific pitch n. , to set the initial matrix W0.
  • FIG. 21 is a block diagram illustrating the functional configuration of a performance analysis system 100 according to a fifth embodiment.
  • a performance analysis system 100 of the fifth embodiment comprises a sound pickup device 16 .
  • the sound collecting device 16 generates the sound signal V by collecting sound reproduced from the keyboard instrument 200 by the user's performance.
  • the acoustic signal V is a time-domain audio signal representing the waveform of the sound reproduced by the keyboard instrument 200 .
  • the sound collecting device 16 which is separate from the performance analysis system 100, may be connected to the performance analysis system 100 by wire or wirelessly. Note that the time series of samples forming the acoustic signal V may be interpreted as "performance data P".
  • the control device 11 of the performance analysis system 100 functions as a performance analysis section 30 by executing programs stored in the storage device 12 .
  • the performance analysis section 30 generates fingering data Q using the sound signal V supplied from the sound pickup device 16 and the image data D1 supplied from the photographing device 15 .
  • the fingering data Q designates the pitch n corresponding to the key 21 operated by the user and the finger number k of the finger used to operate the key 21 by the user.
  • the pitch n is designated by the performance data P in the first embodiment
  • the acoustic signal V in the fifth embodiment is not a signal that directly designates the pitch n. Therefore, the performance analysis section 30 simultaneously estimates the pitch n and the finger number k using the acoustic signal V and the image data D1.
  • a latent variable w t,n,k is prepared for each combination of pitch n and finger number k.
  • the latent variable w t,n,k is a variable for one-hot expression that is set to either of the binary values '0' and '1'.
  • the value "1" of the latent variable w t,n,k means that the pitch n is played by the finger with the finger number k, and the value "0" of the latent variable w t,n,k This means that the fingers of the player are also not used for playing.
  • the posterior probability U t,n is the posterior probability that pitch n is pronounced at time t under the condition that acoustic signal V is observed. Therefore, the probability (1 ⁇ U t,n ) is the probability that the latent variable w t,n,0 is the numerical value “1” under the condition that the acoustic signal V is observed (any pitch n is played). probability of not The posterior probability U t,n is estimated by a known estimation model that has learned the relationship between the acoustic signal V and the posterior probability U t,n . The estimation model is a trained model for automatic transcription.
  • a deep neural network eg a convolutional neural network or a recurrent neural network, is used as an estimation model for estimating the posterior probabilities U t,n .
  • the probability ⁇ t,n,k is the probability that the pitch n is played by the finger with the finger number k when the pitch n is being played.
  • Equation (6) The probability p(w
  • the first term on the right side of Equation (6) means the probability that no pitch n is pronounced, and the second term means the probability that the pitch n is pronounced if the pitch n is pronounced with the finger number k It means the probability that it is played with the fingers of .
  • Equation (7) the probability p(C[k]
  • ⁇ 2 ,Rn) in Equation (7) is the probability expressed by Equation (3) or Equation (4) above.
  • Equation (8) a symmetric Dirichlet distribution (Dir) expressed by Equation (8) below is assumed.
  • ⁇ in Equation (8) is a variable that defines the shape of the symmetric Dirichlet distribution.
  • MAP Maximum A Posteriori
  • Presence or absence of pitch n and finger number k can be estimated at the same time.
  • mean field approximation variational Bayesian estimation
  • Equation (9) the distribution that is most approximate to the probability distribution of the posterior probability p(z
  • V, ⁇ ,C[k]) is identified.
  • the performance analysis unit 30 repeats the calculations of the following formulas (10) and (11).
  • the symbol c in Equation (10) is a coefficient for normalizing the probability distribution ⁇ t,n,k so that the sum of the probability distribution ⁇ t,n, k over a plurality of finger numbers k is "1". Also, the symbol ⁇ > means an expected value.
  • the performance analysis unit 30 repeats the calculations of formulas (10) and (11) for all possible combinations of pitch n and finger number k for one time t on the time axis. .
  • the performance analysis unit 30 converts the computation result of equation (10) at the time when the computation of equation (10) and equation (11) is repeated a predetermined number of times into the probability distribution ⁇ t,n of the latent variable w t,n,k , k .
  • a probability distribution ⁇ t,n,k is calculated for each time t on the time axis.
  • the performance analysis unit 30 of the fifth embodiment utilizes an HMM (Hidden Markov Model) to which the probability distribution ⁇ t,n,k is applied to combine pitch n and finger number k (that is, fingering data Generate a time series of Q).
  • HMM Hidden Markov Model
  • the HMM for fingering estimation consists of a latent state corresponding to each of the pronunciation (key depression) and silence of pitch n, and a plurality of latent states corresponding to different finger numbers k. . Only three types of state transitions are allowed: (1) self-transition, (2) silence ⁇ arbitrary finger number k, and (3) arbitrary finger number k ⁇ silence. is set to '0'. The above conditions are constraints for keeping the finger number k unchanged during the period in which one note is pronounced. Also, the expected value of the probability distribution ⁇ t,n,k calculated by the calculations of Equations (10) and (11) is set as the observation probability for each latent state of the HMM.
  • the performance analysis unit 30 uses the HMM described above to estimate the state series by dynamic programming such as the Viterbi algorithm. The performance analysis unit 30 generates a time series of fingering data Q according to the result of estimating the state series.
  • the fingering data Q is generated using the acoustic signal V and the image data D1. That is, fingering data Q can be generated even in situations where performance data P cannot be obtained.
  • the pitch n and the finger number k are simultaneously estimated using the acoustic signal V and the image data D1, the pitch n and the finger number k are individually estimated. It can estimate the fingering with high accuracy while reducing the processing load compared to .
  • 5th Embodiment is applied also to 4th Embodiment from 2nd Embodiment.
  • the projective transformation unit 314 generates a transformed image from the performance image G1. That is, the projective transformation unit 314 changes the photographing conditions of the performance image G1.
  • the sixth embodiment is an image processing system 700 that uses the above functions of changing the shooting conditions of the performance image G1.
  • the performance analysis system 100 of the first to fifth embodiments can also be expressed as an image processing system 700 when focusing on the processing of the performance image G1 by the projective transformation unit 314.
  • FIG. 22 is a block diagram illustrating the functional configuration of an image processing system 700 according to the sixth embodiment.
  • the image processing system 700 includes a control device 11, a storage device 12, an operation device 13, a display device 14, and a photographing device 15, like the performance analysis system 100 of the first embodiment.
  • the imaging device 15 generates a time series of image data D1 representing the performance image G1 by imaging the keyboard instrument 200 under specific imaging conditions.
  • the storage device 12 stores a plurality of reference data Dref.
  • Each of the plurality of reference data Dref represents a reference image Gref photographing a reference musical instrument, which is a keyboard of a standard keyboard musical instrument.
  • the photographing conditions of the reference instrument differ for each reference image Gref (for each reference data Dref). Specifically, for example, one or more conditions out of the shooting range or shooting direction differ for each reference image Gref.
  • the storage device 12 also stores auxiliary data A for each reference data Dref.
  • the control device 11 implements the matrix generation unit 312, the projective transformation unit 314, and the display control unit 40 by executing the programs stored in the storage device 12.
  • the matrix generator 312 selectively uses one of the plurality of reference data Dref to generate the transformation matrix W.
  • FIG. A projective transformation unit 314 generates image data D3 of a transformed image G3 from image data D1 of a performance image G1 by projective transformation using a transformation matrix W.
  • the display control unit 40 causes the display device 14 to display the converted image G3 represented by the image data D3.
  • FIG. 23 is a flowchart illustrating a specific procedure of processing (hereinafter referred to as "first image processing") executed by the control device 11 of the sixth embodiment.
  • first image processing is started with an instruction from the user to the operation device 13 as a trigger.
  • the control device 11 determines whether or not selection of imaging conditions has been received from the user (S31).
  • the control device 11 selects the imaging conditions selected by the user from among the plurality of reference data Dref stored in the storage device 12.
  • Reference data Dref (hereinafter referred to as "selected reference data Dref") is obtained (S32).
  • the user's selection of imaging conditions corresponds to the operation of selecting one of a plurality of reference images Gref (reference data Dref) corresponding to different imaging conditions.
  • the control device 11 uses the selected reference data Dref to execute the same matrix generation processing as in the first embodiment (S33). Specifically, the control device 11 sets the initial matrix W0 by an initial setting process Sc1 using the selection reference data Dref. Further, the control device 11 generates a transformation matrix W through matrix update processing Sc2 that iteratively updates the initial matrix W0 so that the keyboard image g1 of the performance image G1 approaches the reference image Gref of the selected reference data Dref. On the other hand, if the selection of the imaging condition is not accepted (S31: NO), the selection of the reference data Dref (S32) and the matrix generation process (S33) are not executed.
  • the control device 11 (projective transformation unit 314) generates a transformed image G3 by performing projective transformation processing using the transformation matrix W on the performance image G1 (S34).
  • Projective transformation processing is the same as in the first embodiment.
  • image data D3 representing the transformed image G3 is generated.
  • a converted image G3 corresponding to the same photographing conditions as the reference image Gref of the selected reference data Dref is generated from the performance image G1. That is, the converted image G3 is an image obtained by converting the photographing conditions of the performance image G1 into photographing conditions equivalent to those of the reference image Gref.
  • the converted image G3 corresponding to the shooting conditions selected by the user is generated.
  • the control device 11 causes the display device 14 to display the transformed image G3 generated by the projective transformation process (S35).
  • the control device 11 determines whether or not the termination condition is satisfied (S36). For example, when the user instructs to end the first image processing by operating the operation device 13, the control device 11 determines that the end condition is met. If the termination condition is not satisfied (S36: NO), the control device 11 shifts the process to step S31. That is, the conversion matrix W is generated (S32-S33) and the conversion image G3 is generated and displayed (S34-S35) on the condition that the selection of the photographing conditions is accepted (S31: YES). On the other hand, if the termination condition is satisfied (S36: YES), the control device 11 terminates the first image processing.
  • the transformation matrix W is generated so that the keyboard image g1 in the performance image G1 approaches the reference image Gref. executed. Therefore, the performance image G1 of the keyboard instrument 200 played by the user can be converted into a converted image G3 corresponding to the photographing conditions of the reference musical instrument in the reference image Gref.
  • any one of a plurality of reference data Dref with different imaging conditions is selectively used for matrix generation processing. Therefore, a converted image G3 corresponding to various shooting conditions can be generated from the performance image G1 shot under specific shooting conditions.
  • the reference data Dref corresponding to the imaging conditions selected by the user among the plurality of reference data Dref are used for the matrix generation process, so that the converted image G3 corresponding to the imaging conditions desired by the user is generated. can generate By changing the photographing conditions of the performance image G1 as described above, it is possible to generate a converted image G3 that can be used for various purposes.
  • a plurality of converted images G3 with uniform photographing conditions are generated. It can be generated as teaching material for music lessons.
  • the image extractor 311 extracts the specific region B including the keyboard image g1 and the finger image g2 from the performance image G1.
  • the seventh embodiment is an image processing system 700 that utilizes the above functions of extracting the specific area B of the performance image G1.
  • the performance analysis system 100 of the first to fifth embodiments is also expressed as an image processing system 700 when focusing on the processing of the performance image G1 by the image extracting section 311.
  • FIG. 24 is a block diagram illustrating the functional configuration of an image processing system 700 according to the seventh embodiment.
  • the image processing system 700 includes a control device 11, a storage device 12, an operation device 13, a display device 14, and a photographing device 15, like the performance analysis system 100 of the first embodiment.
  • the imaging device 15 generates a time series of image data D1 representing the performance image G1 by imaging the keyboard instrument 200 under specific imaging conditions.
  • the performance image G1 includes a keyboard image g1 and a finger image g2, as in the above-described forms.
  • the control device 11 functions as an image extractor 311 and a display controller 40 by executing programs stored in the storage device 12 .
  • the image extraction unit 311 generates image data D2 representing a performance image G2 obtained by extracting a partial region from the performance image G1. Specifically, as in the first embodiment, the image extracting unit 311 performs an area estimation process Sb1 for generating an image processing mask M and an area extraction process Sb2 for applying the image processing mask M to the performance image G1. do.
  • the display control unit 40 causes the display device 14 to display the performance image G2 represented by the image data D2.
  • the single estimation model 51 is illustrated in the first embodiment.
  • the estimation model 51 used in the area estimation process Sb1 in the seventh embodiment includes a first model 511 and a second model 512.
  • Each of the first model 511 and the second model 512 is composed of a deep neural network such as a convolutional neural network or a recurrent neural network.
  • the first model 511 is a statistical model for generating the first mask representing the first region of the performance image G1.
  • the first area is an area including the keyboard image g1 in the performance image G1.
  • the finger image g2 is not included in the first area.
  • the first mask is, for example, a binary mask in which each element in the first area is set to the numerical value "1" and each element in the area other than the first area is set to the numerical value "0".
  • the image extraction unit 311 generates the first mask by inputting the image data D1 representing the performance image G1 to the first model 511.
  • FIG. That is, the first model 511 is a trained model that has learned the relationship between the image data D1 and the first mask (first region) by machine learning.
  • the second model 512 is a statistical model for generating a second mask representing the second area of the performance image G1.
  • the second area is an area including the finger image g2 in the performance image G1.
  • the keyboard image g1 is not included in the second area.
  • the second mask is, for example, a binary mask in which each element in the second area is set to the numerical value "1" and each element in the area other than the second area is set to the numerical value "0".
  • the image extraction unit 311 generates a second mask by inputting the image data D1 representing the performance image G1 to the second model 512.
  • FIG. That is, the second model 512 is a trained model that has learned the relationship between the image data D1 and the second mask (second region) by machine learning.
  • FIG. 25 is a flowchart illustrating a specific procedure of processing (hereinafter referred to as "second image processing") executed by the control device 11 of the seventh embodiment.
  • the second image processing is started with an instruction from the user to the operation device 13 as a trigger.
  • the control device 11 executes the region estimation processing Sb1 (S41-S43).
  • the area estimation process Sb1 of the seventh embodiment includes a first estimation process (S41), a second estimation process (S42), and an area combining process (S43).
  • the first estimation process is a process of estimating the first area of the performance image G1. Specifically, the control device 11 inputs the image data D1 representing the performance image G1 to the first model 511 to generate the first mask representing the first region (S41).
  • the second estimation process is a process of estimating the second area of the performance image G2. Specifically, the control device 11 inputs the image data D1 representing the performance image G1 to the second model 512 to generate a second mask representing the second region (S42).
  • the area synthesizing process is a process of generating an image processing mask M representing the specific area B including the first area and the second area.
  • the specific area B represented by the image processing mask M corresponds to the sum of the first area and the second area. That is, the control device 11 generates the image processing mask M by synthesizing the first mask and the second mask (S43).
  • the image processing mask M is a binary mask for extracting the specific region B containing the keyboard image g1 and the finger image g2 from the performance image G1, as in the first embodiment. .
  • the control device 11 uses the image processing mask M generated in the area estimation process Sb1 to execute the area extraction process Sb2 similar to that of the first embodiment (S44). That is, the control device 11 extracts the specific area B from the performance image G1 represented by the image data D1 using the image processing mask M, thereby generating the image data D2 representing the performance image G2.
  • the control device 11 causes the display device 14 to display the performance image G2 generated by the region extraction processing Sb2 (S45).
  • the control device 11 determines whether or not the termination condition is satisfied (S46). For example, when the user instructs to end the second image processing by operating the operation device 13, the control device 11 determines that the end condition is satisfied. If the termination condition is not satisfied (S46: NO), the control device 11 shifts the process to step S41. That is, the area estimation process Sb1 (S41 to S43), the area extraction process Sb2 (S44), and the display of the performance image G2 (S45) are executed. On the other hand, if the termination condition is satisfied (S46: YES), the control device 11 terminates the second image processing.
  • a specific area B including the keyboard image g1 is extracted from the performance image G1. Therefore, it is possible to improve the convenience of the performance image G1.
  • a specific area B including the keyboard image g1 and the finger image g2 is extracted from the performance image G1. Therefore, it is possible to generate a performance image G2 in which the appearance of the keyboard 22 of the keyboard instrument 200 and the appearance of the user's fingers can be efficiently visually recognized.
  • the first model 511 estimates the first region of the performance image G1 including the keyboard image g1
  • the second model 512 estimates the second region of the performance image G1 including the finger image g2. estimated by Therefore, the specific region B including the keyboard image g1 and the finger image g2 can be extracted with high precision compared to the configuration using the single estimation model 51 that collectively extracts both the keyboard image g1 and the finger image g2. can. Also, since each of the first model 511 and the second model 512 is established by individual machine learning, the processing load related to the machine learning of the first model 511 and the second model 512 is reduced.
  • the first mode is an operation mode for extracting both the keyboard image g1 and the finger image g2 from the performance image G1. That is, in the first mode, the image extraction section 311 executes both the first estimation process and the second estimation process. Therefore, an image processing mask M representing the specific region B is generated as in the seventh embodiment. That is, in the first mode, a specific area B including both the keyboard image g1 and the finger image g2 is extracted from the performance image G1.
  • the second mode is an operation mode for extracting the keyboard image g1 from the performance image G1. That is, in the second mode, the image extraction unit 311 executes the first estimation process but does not execute the second estimation process. That is, the first mask generated by the first estimation process is determined as the image processing mask M applied to the area extraction process Sb2. Therefore, in the second mode, the keyboard image g1 is extracted from the performance image G1.
  • the image extraction unit 311 executes the first estimation process in the second mode.
  • a form that does not execute is also assumed.
  • the finger image g2 is extracted from the performance image G1.
  • the second mode is expressed as an operation mode in which one of the first estimation process and the second estimation process is executed.
  • the matrix generation process is executed for the performance image G2 after the image extraction process (FIG. 8).
  • a generation process may be performed. That is, the image extracting process (image extracting section 311) for generating the performance image G2 from the performance image G1 may be omitted.
  • the finger position estimation processing using the performance image G1 has been exemplified in each of the above embodiments, the finger position estimation processing may be executed using the performance image G2 after processing by the image extraction processing. That is, the position C[h,f] of each finger of the user may be estimated by analyzing the performance image G2. Further, in each of the above embodiments, the projective transformation process is performed on the performance image G1, but the projective transformation process may be performed on the performance image G2 after the image extraction process. That is, a transformed image may be generated by projective transformation of the performance image G2.
  • the position c[h,f] of each finger of the user is transformed into the position C[h,f] in the XY coordinate system by projective transformation processing.
  • Finger position data F representing c[h,f] may be generated. That is, the projective transformation process (projective transformation unit 314) for transforming the position c[h,f] into the position C[h,f] may be omitted.
  • the conversion matrix W generated immediately after the start of the performance analysis process is used continuously in subsequent processes.
  • the transformation matrix W may be updated at appropriate points during the execution of .
  • the conversion matrix W may be updated.
  • positional change a change in the position of the photographing device 15
  • Transformation matrix W is updated.
  • the matrix generator 312 generates a transformation matrix ⁇ that represents the positional change (displacement) of the imaging device 15 .
  • represents the positional change (displacement) of the imaging device 15 .
  • the matrix generation unit 312 determines that the coordinate x′/ ⁇ calculated by Equation (12) from the x-coordinate of the specific point after the position change is the x-coordinate of the point corresponding to the point in the performance image G before the position change. and the coordinate y'/ ⁇ calculated by Equation (12) from the y-coordinate of a specific point after the position change is the point corresponding to the point in the performance image G before the position change. Generate a transformation matrix ⁇ to approximate or match the y-coordinate.
  • the matrix generation unit 312 generates the product W ⁇ of the transformation matrix W before the position change and the transformation matrix ⁇ representing the position change as the initial matrix W0, and updates the initial matrix W0 by the matrix update processing Sc2 to convert Generate a matrix W.
  • the transformation matrix W after the position change is generated using the transformation matrix W calculated before the position change and the transformation matrix ⁇ representing the position change. Therefore, it is possible to generate a transformation matrix W that can specify the position C[h, f] of each finger with high accuracy while reducing the load of the matrix generation process.
  • the first to fifth embodiments were assumed. good.
  • the keyboard instrument 200 including the keyboard 22 is illustrated, but the present disclosure can be applied to any type of instrument.
  • any musical instrument that can be manually operated by the user such as a stringed instrument, a wind instrument, or a percussion instrument
  • each of the above aspects is similarly applied.
  • a typical example of a musical instrument is a type of musical instrument played by the user with the fingers of one hand or both hands.
  • the performance analysis system 100 may be realized by a server device that communicates with an information device such as a smartphone or a tablet terminal. For example, performance data P generated by a keyboard instrument 200 connected to an information device and image data D1 generated by a photographing device 15 mounted on or connected to the information device are transmitted from the information device to the performance analysis system 100. be.
  • the performance analysis system 100 generates fingering data Q by executing performance analysis processing on performance data P and image data D1 received from the information device, and transmits the fingering data Q to the information device.
  • the image processing system 700 exemplified in the sixth or seventh embodiment may also be realized by a server device that communicates with the information device.
  • the functions of the performance analysis system 100 according to the first to fifth embodiments or the image processing system 700 according to the sixth to seventh embodiments are the functions of the control device 11, as described above. Alternatively, it is realized by cooperation of a plurality of processors and programs stored in the storage device 12 .
  • a program according to the present disclosure may be provided in a form stored in a computer-readable recording medium and installed in a computer.
  • the recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disc) such as a CD-ROM is a good example.
  • recording media in the form of The non-transitory recording medium includes any recording medium other than transitory (propagating signal), and does not exclude volatile recording media.
  • the storage device 12 that stores the program in the distribution device corresponds to the above-described non-transitory recording medium.
  • An image processing method is a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument, and a specific region including the image of the musical instrument. is estimated, and the specific region is extracted from the performance image.
  • the specific region including the image of the musical instrument is extracted from the performance image including the image of the musical instrument and the images of a plurality of fingers of the user. Therefore, it is possible to improve the convenience of performance images.
  • the specific area is an area including an image of the musical instrument and an image of at least a part of the user's body.
  • the specific region including the image of the musical instrument and the image of the user's body is extracted. Therefore, it is possible to generate an image in which the appearance of the musical instrument and the appearance of the user's body can be efficiently visually recognized.
  • an image processing mask representing the specific region in the estimation of the specific region, is generated by inputting image data representing the performance image into a machine-learned estimation model.
  • the specific region is extracted by applying the image processing mask to the performance image.
  • the image processing mask representing the specific region is generated by inputting the image data of the performance image into the machine-learned estimation model. Therefore, the specific region can be specified with high precision for various unknown performance images.
  • the estimation model includes a first model and a second model, and the estimation of the specific region is performed by inputting image data representing the performance image into the first model.
  • a first estimation process of estimating a first area including the image of the musical instrument in the performance image and inputting image data representing the performance image into the second model.
  • a second estimation process of estimating a second area including the image of the fingers of the and an area synthesizing process of generating the image processing mask representing the specific area including the first area and the second area.
  • the first region of the performance image including the image of the musical instrument is estimated by the first model
  • the second region of the performance image including the image of the user is estimated by the second model.
  • a first mode in which both the first estimation process and the second estimation process are performed, and a second mode in which one of the first estimation process and the second estimation process is performed mode can be switched.
  • the first mode a specific area including the image of the musical instrument and the image of the user is extracted from the performance image.
  • the second mode a specific region including one of the musical instrument and the user's image is extracted from the performance image. As described above, it is possible to easily switch the extraction target from the performance image.
  • An image processing system in a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument, a specific region including the image of the musical instrument and an area extraction unit for extracting the specific area from the performance image.
  • a program according to one aspect (aspect 7) of the present disclosure estimates a specific region including an image of a musical instrument in a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument.
  • the computer system is caused to function as an area estimating section for performing the performance image and an area extracting section for extracting the specific area from the performance image.
  • Performance analysis system 11 For Control device 12
  • Storage device 13 For Operation device 14
  • Display device 15 For Photographing device 200
  • Keyboard instrument 21 For Key 22
  • Keyboard 30 for Performance analysis unit 31
  • Finger position data generation unit 311 For Image Extraction unit 312
  • Matrix generation unit 313 Finger position estimation unit 314 Projective transformation unit 32
  • Fingering data generation unit 321 Probability calculation unit 322 Fingering estimation unit 323
  • Control data generation unit, 40 Display control unit 51... Estimation model 51a... Temporary model 52[k]... Estimation model 700... Image processing system

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Auxiliary Devices For Music (AREA)
PCT/JP2022/009830 2021-03-25 2022-03-07 画像処理方法、画像処理システムおよびプログラム WO2022202266A1 (ja)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202280022994.XA CN117043818A (zh) 2021-03-25 2022-03-07 图像处理方法、图像处理系统及程序

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2021-051181 2021-03-25
JP2021051181A JP2022149159A (ja) 2021-03-25 2021-03-25 画像処理方法、画像処理システムおよびプログラム

Publications (1)

Publication Number Publication Date
WO2022202266A1 true WO2022202266A1 (ja) 2022-09-29

Family

ID=83397016

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/009830 WO2022202266A1 (ja) 2021-03-25 2022-03-07 画像処理方法、画像処理システムおよびプログラム

Country Status (3)

Country Link
JP (1) JP2022149159A (zh)
CN (1) CN117043818A (zh)
WO (1) WO2022202266A1 (zh)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020046500A (ja) * 2018-09-18 2020-03-26 ソニー株式会社 情報処理装置、情報処理方法および情報処理プログラム

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020046500A (ja) * 2018-09-18 2020-03-26 ソニー株式会社 情報処理装置、情報処理方法および情報処理プログラム

Also Published As

Publication number Publication date
CN117043818A (zh) 2023-11-10
JP2022149159A (ja) 2022-10-06

Similar Documents

Publication Publication Date Title
US11557269B2 (en) Information processing method
US11967302B2 (en) Information processing device for musical score data
EP3759707B1 (en) A method and system for musical synthesis using hand-drawn patterns/text on digital and non-digital surfaces
CN113421547B (zh) 一种语音处理方法及相关设备
WO2020059245A1 (ja) 情報処理装置、情報処理方法および情報処理プログラム
JP2021043258A (ja) 制御システム、及び制御方法
US20220414472A1 (en) Computer-Implemented Method, System, and Non-Transitory Computer-Readable Storage Medium for Inferring Audience's Evaluation of Performance Data
JP2022115956A (ja) 情報処理方法、情報処理装置およびプログラム
WO2022202266A1 (ja) 画像処理方法、画像処理システムおよびプログラム
WO2022202265A1 (ja) 画像処理方法、画像処理システムおよびプログラム
WO2022202264A1 (ja) 演奏解析方法、演奏解析システムおよびプログラム
US20230230493A1 (en) Information Processing Method, Information Processing System, and Recording Medium
WO2022202267A1 (ja) 情報処理方法、情報処理システムおよびプログラム
JP7152908B2 (ja) 仕草制御装置及び仕草制御プログラム
CN115437598A (zh) 虚拟乐器的互动处理方法、装置及电子设备
Moryossef et al. At your fingertips: Automatic piano fingering detection
WO2023032422A1 (ja) 処理方法、プログラムおよび処理装置
WO2023243293A1 (ja) 演奏モーション推定方法および演奏モーション推定装置
WO2023181570A1 (ja) 情報処理方法、情報処理システムおよびプログラム
WO2023053632A1 (ja) 情報処理装置、情報処理方法、及びプログラム
CN113657185A (zh) 一种钢琴练习智能辅助方法、装置及介质
CN116343820A (zh) 音频处理方法、装置、设备和存储介质
JP2024030802A (ja) モデル学習装置、モデル学習方法、及びモデル学習プログラム。
CN116324932A (zh) 信息处理方法及信息处理系统
CN115687668A (zh) 音乐文件的生成方法、生成装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22775060

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280022994.X

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22775060

Country of ref document: EP

Kind code of ref document: A1