WO2022202266A1 - Image processing method, image processing system, and program - Google Patents

Image processing method, image processing system, and program Download PDF

Info

Publication number
WO2022202266A1
WO2022202266A1 PCT/JP2022/009830 JP2022009830W WO2022202266A1 WO 2022202266 A1 WO2022202266 A1 WO 2022202266A1 JP 2022009830 W JP2022009830 W JP 2022009830W WO 2022202266 A1 WO2022202266 A1 WO 2022202266A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
performance
finger
data
estimation
Prior art date
Application number
PCT/JP2022/009830
Other languages
French (fr)
Japanese (ja)
Inventor
陽 前澤
Original Assignee
ヤマハ株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ヤマハ株式会社 filed Critical ヤマハ株式会社
Priority to CN202280022994.XA priority Critical patent/CN117043818A/en
Publication of WO2022202266A1 publication Critical patent/WO2022202266A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10GREPRESENTATION OF MUSIC; RECORDING MUSIC IN NOTATION FORM; ACCESSORIES FOR MUSIC OR MUSICAL INSTRUMENTS NOT OTHERWISE PROVIDED FOR, e.g. SUPPORTS
    • G10G3/00Recording music in notation form, e.g. recording the mechanical operation of a musical instrument
    • G10G3/04Recording music in notation form, e.g. recording the mechanical operation of a musical instrument using electrical means

Definitions

  • the present disclosure relates to technology for analyzing performances by users.
  • Patent Literature 1 discloses a technique for detecting an object using a deep neural network.
  • one aspect of the present disclosure aims to improve the convenience of performance images.
  • an image processing method provides a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument. A specific area containing an image is estimated, and the specific area is extracted from the performance image.
  • An image processing system estimates a specific region including an image of a musical instrument in a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument.
  • An estimating unit and an area extracting unit for extracting the specific area from the performance image.
  • a program includes a region estimation unit for estimating a specific region including an image of a musical instrument in a performance image including an image of the musical instrument and images of a plurality of fingers of a user playing the musical instrument. , and an area extracting section for extracting the specific area from the performance image.
  • FIG. 1 is a block diagram illustrating the configuration of a performance analysis system according to a first embodiment
  • FIG. FIG. 4 is a schematic diagram of a performance image
  • 1 is a block diagram illustrating the functional configuration of a performance analysis system
  • FIG. It is a schematic diagram of an analysis screen.
  • 8 is a flowchart of finger position estimation processing
  • 8 is a flowchart of left/right determination processing
  • FIG. 10 is an explanatory diagram of image extraction processing
  • 6 is a flowchart of image extraction processing
  • FIG. 4 is an illustration of machine learning to establish an inference model
  • FIG. 4 is a schematic diagram of a reference image
  • 6 is a flowchart of matrix generation processing
  • 6 is a flowchart of initial setting processing
  • 4 is a schematic diagram of a setting screen
  • FIG. 4 is a flowchart of performance analysis processing
  • FIG. 10 is an explanatory diagram relating to the problem of fingering estimation
  • FIG. 11 is a block diagram illustrating the configuration of a performance analysis system in a second embodiment
  • FIG. 10 is a schematic diagram of control data in the second embodiment
  • 9 is a flow chart of performance analysis processing in the second embodiment
  • 14 is a flow chart of performance analysis processing in the third embodiment.
  • FIG. 11 is a flowchart of initial setting processing in the fourth embodiment
  • FIG. FIG. 12 is a block diagram illustrating the configuration of a performance analysis system in a fifth embodiment
  • FIG. FIG. 11 is a block diagram illustrating the functional configuration of an image processing system according to a sixth embodiment;
  • FIG. 14 is a flowchart of first image processing in the sixth embodiment
  • FIG. FIG. 21 is a block diagram illustrating the functional configuration of an image processing system according to a seventh embodiment
  • FIG. FIG. 14 is a flowchart of second image processing in the seventh embodiment
  • FIG. 1 is a block diagram illustrating the configuration of a performance analysis system 100 according to the first embodiment.
  • a keyboard instrument 200 is connected to the performance analysis system 100 by wire or wirelessly.
  • a user that is, a performer
  • the keyboard instrument 200 supplies performance data P representing a performance by the user to the performance analysis system 100 .
  • the performance data P is time-series data specifying the pitch n of each of a plurality of notes played in sequence by the user.
  • the performance data P is data in a format conforming to the MIDI (Musical Instrument Digital Interface) standard, for example.
  • the performance analysis system 100 is a computer system that analyzes the performance of the keyboard instrument 200 by the user. Specifically, performance analysis system 100 analyzes the user's fingering. Fingering is the manner in which the user uses the fingers of the left and right hands (ie fingering) in playing the keyboard instrument 200 . That is, the information as to which finger the user uses to operate each key 21 of the keyboard instrument 200 is analyzed as the fingering of the user.
  • the performance analysis system 100 includes a control device 11, a storage device 12, an operation device 13, a display device 14, and a photographing device 15.
  • the performance analysis system 100 is realized by, for example, a portable information device such as a smart phone or a tablet terminal, or a portable or stationary information device such as a personal computer.
  • the performance analysis system 100 can be realized as a single device, or as a plurality of devices configured separately from each other. Also, the performance analysis system 100 may be installed in the keyboard instrument 200 .
  • the control device 11 is composed of one or more processors that control each element of the performance analysis system 100 .
  • the control device 11 includes one or more types of CPU (Central Processing Unit), SPU (Sound Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). It consists of a processor.
  • the storage device 12 is a single or multiple memories that store programs executed by the control device 11 and various data used by the control device 11 .
  • the storage device 12 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of types of recording media.
  • a portable recording medium that can be attached to and detached from the performance analysis system 100, or a recording medium (for example, cloud storage) that can be written or read by the control device 11 via a communication network such as the Internet, for example, can be stored. You may utilize as the apparatus 12.
  • the operation device 13 is an input device that receives instructions from the user.
  • the operation device 13 is, for example, an operator operated by a user or a touch panel that detects contact by the user.
  • An operating device 13 (for example, a mouse or a keyboard) separate from the performance analysis system 100 may be connected to the performance analysis system 100 by wire or wirelessly.
  • the display device 14 displays images under the control of the control device 11 .
  • various display panels such as a liquid crystal display panel or an organic EL (Electroluminescence) panel are used as the display device 14 .
  • the display device 14, which is separate from the performance analysis system 100, may be connected to the performance analysis system 100 by wire or wirelessly.
  • the photographing device 15 is an image input device that generates a time series of image data D1 by photographing a subject.
  • the time series of the image data D1 is moving image data representing moving images.
  • the photographing device 15 includes an optical system such as a photographing lens, an imaging device for receiving incident light from the optical system, and a processing circuit for generating image data D1 according to the amount of light received by the imaging device.
  • the photographing device 15, which is separate from the performance analysis system 100 may be connected to the performance analysis system 100 by wire or wirelessly.
  • the user adjusts the position or angle of the imaging device 15 with respect to the keyboard instrument 200 so that the imaging conditions recommended by the provider of the performance analysis system 100 are realized.
  • the photographing device 15 is installed above the keyboard instrument 200 and photographs the keyboard 22 of the keyboard instrument 200 and the user's left and right hands. Therefore, as illustrated in FIG. 2, a performance image including an image g1 of the keyboard 22 of the keyboard instrument 200 (hereinafter referred to as "keyboard image") and an image g2 of the user's left and right hands (hereinafter referred to as "finger images").
  • a time series of image data D 1 representing G 1 is generated by the imaging device 15 . That is, moving image data representing a moving image of the user playing the keyboard instrument 200 is generated in parallel with the performance.
  • the photographing condition by the photographing device 15 is, for example, the photographing range or the photographing direction.
  • the photographing range is the range (angle of view) photographed by the photographing device 15 .
  • the shooting direction is the direction of the shooting device 15 with respect to the keyboard instrument 200 .
  • FIG. 3 is a block diagram illustrating the functional configuration of the performance analysis system 100.
  • the control device 11 functions as a performance analysis section 30 and a display control section 40 by executing programs stored in the storage device 12 .
  • the performance analysis unit 30 analyzes the performance data P and the image data D1 to generate fingering data Q representing the user's fingering.
  • the fingering data Q designates with which of the user's fingers each of the plurality of keys 21 of the keyboard instrument 200 is operated.
  • the fingering data Q consists of a pitch n corresponding to the key 21 operated by the user and the number k of the finger used by the user to operate the key 21 (hereinafter referred to as "finger number").
  • a pitch n is, for example, a note number in the MIDI standard.
  • the finger number k is a number assigned to each finger of the user's left hand and right hand.
  • the display control unit 40 causes the display device 14 to display various images.
  • the display control section 40 causes the display device 14 to display an image (hereinafter referred to as "analysis screen") 61 representing the result of analysis by the performance analysis section 30 .
  • FIG. 4 is a schematic diagram of the analysis screen 61.
  • the analysis screen 61 is an image in which a plurality of note images 611 are arranged on a coordinate plane on which a horizontal time axis and a vertical pitch axis are set.
  • a note image 611 is displayed for each note played by the user.
  • the position of the note image 611 in the direction of the pitch axis is set according to the pitch n of the note represented by the note image 611 .
  • the position and total length of the note image 611 in the direction of the time axis are set according to the sounding period of the note represented by the note image 611 .
  • a code (hereinafter referred to as a "fingering code") 612 corresponding to the finger number k specified for that note by the fingering data Q is arranged.
  • the letter “L” in fingering 612 means left hand, and the letter “R” in fingering 612 means right hand.
  • the numbers of the fingering symbols 612 mean each finger. Specifically, the number “1" of the fingering symbols 612 means the thumb, the number “2” means the index finger, the number “3” means the middle finger, the number “4" means the ring finger, The number "5" means the little finger.
  • fingering 612 "R2" refers to the index finger of the right hand and fingering 612 "L4" refers to the ring finger of the left hand.
  • the musical note image 611 and the fingering symbol 612 are displayed in different modes (for example, hue or gradation) for the right hand and the left hand.
  • the display control unit 40 uses the fingering data Q to display the analysis screen 61 of FIG. 4 on the display device 14 .
  • the notes are displayed in a manner different from the normal note image 611 (for example, a dashed frame line).
  • An image 611 is displayed, and a specific code, such as "?", is displayed to indicate that the estimation result of the finger number k is invalid.
  • the performance analysis unit 30 includes a finger position data generation unit 31 and a fingering data generation unit 32.
  • the finger position data generator 31 generates finger position data F by analyzing the performance image G1.
  • the finger position data F is data representing the position of each finger of the user's left hand and the position of each finger of his right hand.
  • the fingering data generator 32 generates fingering data Q using the performance data P and the finger position data F.
  • FIG. Finger position data F and fingering data Q are generated for each unit period on the time axis. Each unit period is a period (frame) of a predetermined length.
  • the finger position data generation unit 31 includes an image extraction unit 311 , a matrix generation unit 312 , a finger position estimation unit 313 and a projective transformation unit 314 .
  • the finger position estimation unit 313 estimates the positions c[h, f] of the fingers of the user's left hand and right hand by analyzing the performance image G1 represented by the image data D1.
  • the position c[h, f] of each finger is the position of each fingertip in the xy coordinate system set in the performance image G1.
  • the position c[h,f] is a combination (x[h, f], y[h, f]).
  • the positive direction of the x-axis corresponds to the right direction of the keyboard 22 (direction from low tones to high tones), and the negative direction of the x-axis corresponds to the left direction of the keyboard 22 (towards from high tones to low tones).
  • the numerical value "1" of the variable h means the left hand
  • the numerical value "2" of the variable h means the right hand.
  • the number '1' for the variable f means the thumb
  • the number '2' means the index finger
  • the number '3' means the middle finger
  • the number '4' means the ring finger
  • the number '5' means the little finger.
  • FIG. 5 is a flowchart illustrating a specific procedure of the process of estimating the position of each finger of the user by the finger position estimation unit 313 (hereinafter referred to as "finger position estimation process").
  • the finger position estimation processing includes image analysis processing Sa1, left/right determination processing Sa2, and interpolation processing Sa3.
  • the position c[h, f] of each finger in one of the user's left hand and right hand (hereinafter referred to as "first hand") and the other of the user's left hand and right hand (hereinafter referred to as "second hand ) is estimated by analyzing the performance image G1.
  • the finger position estimating unit 313 performs image recognition processing for estimating the skeleton or joints of the user through image analysis to determine the positions c[h,1] to c[h,5 of the fingers of the first hand. ] and the positions c[h,1] to c[h,5] of each finger of the second hand.
  • a known image recognition process such as MediaPipe or OpenPose is used for the image analysis process Sa1. If the fingertip is not detected from the performance image G1, the coordinate x[h,f] of the fingertip on the x-axis is set to an invalid value such as "0".
  • the user's right arm and left arm may cross, so only the coordinates x[h,f] of each position c[h,f] estimated by the image analysis processing Sa1 It is not appropriate to determine the left or right hand from .
  • the user's left hand or right hand can be estimated from the performance image G1 based on the coordinates of the user's shoulders and arms.
  • the processing load of the image analysis processing Sa1 increases.
  • the finger position estimating unit 313 of the first embodiment performs the left/right determination shown in FIG. Processing Sa2 is executed. That is, the finger position estimation unit 313 sets the variable h at the position c[h, f] of the fingers of the first hand and the second hand to the numerical value "1" meaning the left hand and the numerical value "2" meaning the right hand. is determined to be one of
  • the performance image G1 captured by the imaging device 15 is an image of the backs of both the left and right hands of the user.
  • the thumb position c[h,1] is located to the right of the little finger position c[h,5]
  • the thumb position c[h,1] is positioned to the left of the little finger position c[h,5].
  • the finger position estimating unit 313 determines that the thumb position c[h,1] of the first hand and the second hand is the little finger position c[h,5] in the left/right determination process Sa2.
  • the finger position estimator 313 determines whether the position c[h,1] of the thumb is to the left (in the negative direction of the x-axis) of the position c[h,5] of the little finger in the first and second hands.
  • the positioned hand is determined to be the right hand.
  • FIG. 6 is a flowchart illustrating a specific procedure of left/right determination processing Sa2.
  • the finger position estimation unit 313 calculates a determination index ⁇ [h] for each of the first hand and the second hand (Sa21).
  • the determination index ⁇ [h] is calculated, for example, by Equation (1) below.
  • the symbol ⁇ [h] in formula (1) is the average value (for example, simple average) of the coordinates x[h,1] to x[h,5] of the five fingers of each of the first and second hands. be.
  • the finger position estimating unit 313 determines that the hand having a negative determination index ⁇ [h] among the first hand and the second hand is the left hand, and sets the variable h to the numerical value "1" (Sa22).
  • the finger position estimating unit 313 determines that the hand having a positive determination index ⁇ [h] among the first hand and the second hand is the right hand, and sets the variable h to the numerical value "2" (Sa23). According to the left/right determination process Sa2 described above, the position c[h, f] of each finger of the user can be distinguished between the right hand and the left hand by a simple process using the relationship between the position of the thumb and the position of the little finger. can.
  • the position c[h, f] of each finger of the user is estimated for each unit period by the image analysis processing Sa1 and the left/right determination processing Sa2.
  • the position c[h,f] may not be properly estimated due to various circumstances such as noise existing in the performance image G1. Therefore, when the position c[h,f] is missing in a specific unit period (hereinafter referred to as “missing period”), the finger position estimation unit 313 calculates the position c[h,f] in the unit periods before and after the missing period. ], the position c[h,f] in the missing period is calculated.
  • the position c[h,f] in the unit period immediately before the missing period ] and the position c[h,f] in the immediately following unit period is calculated as the position c[h,f] in the missing period.
  • the performance image G1 includes the keyboard image g1 and the finger image g2.
  • the image extraction unit 311 in FIG. 3 extracts a specific area (hereinafter referred to as "specific area") B from the performance image G1, as illustrated in FIG.
  • the specific area B is an area of the performance image G1 that includes the keyboard image g1 and the finger image g2.
  • the finger image g2 corresponds to an image of at least part of the user's body.
  • FIG. 8 is a flow chart illustrating a specific procedure of the process of extracting the specific area B from the performance image G1 by the image extraction unit 311 (hereinafter referred to as "image extraction process").
  • the image extraction processing includes region estimation processing Sb1 and region extraction processing Sb2.
  • the area estimation process Sb1 is a process of estimating a specific area B for the performance image G1 represented by the image data D1.
  • the image extraction unit 311 generates an image processing mask M representing the specific area B from the image data D1 by the area estimation process Sb1.
  • the image processing mask M is a mask having the same size as the performance image G1, and is composed of a plurality of elements corresponding to different pixels of the performance image G1.
  • each element in the area corresponding to the specific area B of the performance image G1 is set to the numerical value "1”
  • each element in the area other than the specific area B is set to the numerical value "0".
  • is a binary mask set to An element (region estimation section) for estimating the specific region B of the performance image G1 is implemented by the control device 11 executing the region estimation processing Sb1.
  • the estimation model 51 is used for generating the image processing mask M by the image extraction unit 311 . That is, the image extraction unit 311 generates the image processing mask M by inputting the image data D1 representing the performance image G1 to the estimation model 51.
  • FIG. The estimation model 51 is a statistical model obtained by learning the relationship between the image data D1 and the image processing mask M through machine learning.
  • the estimation model 51 is composed of, for example, a deep neural network (DNN: Deep Neural Network).
  • DNN Deep Neural Network
  • CNN convolutional neural network
  • RNN recurrent neural network
  • the estimation model 51 may be configured by combining multiple types of deep neural networks. Also, additional elements such as long short-term memory (LSTM) may be installed in the estimation model 51 .
  • LSTM long short-term memory
  • FIG. 9 is an explanatory diagram of machine learning that establishes the estimation model 51.
  • the estimated model 51 is established by machine learning by a machine learning system 900 separate from the performance analysis system 100 , and the estimated model 51 is provided to the performance analysis system 100 .
  • Machine learning system 900 is a server system capable of communicating with performance analysis system 100 via a communication network such as the Internet.
  • the estimation model 51 is transmitted from the machine learning system 900 to the performance analysis system 100 via the communication network.
  • a plurality of learning data T are used for machine learning of the estimation model 51.
  • Each of the plurality of learning data T is composed of a combination of learning image data Dt and learning image processing mask Mt.
  • the image data Dt represents a known image including a keyboard image g1 of the keyboard instrument and an image around the keyboard instrument.
  • the model of the keyboard instrument and shooting conditions (for example, shooting range and shooting direction) differ for each image data Dt. That is, image data Dt is prepared in advance by photographing each of a plurality of types of keyboard instruments under different photographing conditions. Note that the image data Dt may be prepared by a known image synthesizing technique.
  • the image processing mask Mt of each learning data T is a mask representing the specific region B in the known image represented by the image data Dt of the learning data T.
  • the machine learning system 900 generates an image processing mask M output by an initial or provisional model (hereinafter referred to as a "provisional model") 51a when image data Dt of each learning data T is input, and an image of the learning data T An error function representing the error with the processing mask M is computed.
  • Machine learning system 900 then updates multiple variables of interim model 51a such that the error function is reduced.
  • the provisional model 51 a at the time when the above processing is repeated for each of the plurality of learning data T is determined as the estimated model 51 . Therefore, the estimation model 51 can generate a statistically valid image processing mask M for the unknown image data D1 under the latent relationship between the image data Dt and the image processing mask Mt in the plurality of learning data T. to output That is, the estimation model 51 is a learned model that has learned the relationship between the image data Dt and the image processing mask Mt.
  • the image processing mask M representing the specific region B is generated by inputting the image data D1 of the performance image G1 into the machine-learned estimation model 51. Therefore, the specific area B can be specified with high precision for various unknown performance images G1.
  • the area extraction process Sb2 in FIG. 8 is a process for extracting the specific area B from the performance image G1 represented by the image data D1.
  • the region extraction processing Sb2 is image processing for relatively emphasizing the specific region B by selectively removing regions other than the specific region in the performance image G1.
  • the image extraction unit 311 of the first embodiment generates the image data D2 by applying the image processing mask M to the image data D1 (performance image G1). Specifically, the image extraction unit 311 multiplies the pixel value of each pixel in the performance image G1 by the element of the image processing mask M corresponding to the pixel. As illustrated in FIG.
  • the area extracting process Sb2 generates image data D2 representing an image (hereinafter referred to as "performance image G2") obtained by removing areas other than the specific area B from the performance image G1. That is, the performance image G2 represented by the image data D2 is an image obtained by extracting the keyboard image g1 and the finger image g2 from the performance image G1.
  • An element (region extractor) for extracting the specific region B of the performance image G1 is implemented by the control device 11 executing the region extracting process Sb2.
  • the position c[h, f] of each finger estimated by the finger position estimation process is the coordinates in the xy coordinate system set in the performance image G1.
  • the conditions for photographing the keyboard instrument 200 by the photographing device 15 may differ depending on various circumstances such as the usage environment of the keyboard instrument 200 . For example, it is assumed that the imaging range is too wide (or too narrow) compared to the ideal imaging conditions illustrated in FIG. 2, or that the imaging direction is inclined with respect to the vertical direction.
  • the numerical values of the coordinates x[h,f] and coordinates y[h,f] at each position c[h,f] depend on the shooting conditions of the performance image G1 by the shooting device 15. FIG.
  • the projective transformation unit 314 of the first embodiment converts the position c[h,f] of each finger on the performance image G1 to the position C[h,f] in the XY coordinate system that is substantially independent of the imaging conditions of the imaging device 15. h, f] (image registration).
  • the finger position data F generated by the finger position data generation unit 31 is data representing the position C[h,f] after conversion by the projective conversion unit 314 . That is, the finger position data F includes the positions C[1,1] to C[1,5] of the fingers of the user's left hand and the positions C[2,1] to C[ of the fingers of the user's right hand. 2,5].
  • the XY coordinate system is set to a predetermined image (hereinafter referred to as "reference image”) Gref, as illustrated in FIG.
  • the reference image Gref is an image of a keyboard of a standard keyboard instrument (hereinafter referred to as “reference instrument”) captured under standard imaging conditions.
  • the reference image Gref is not limited to an image of an actual keyboard.
  • an image synthesized by a known image synthesis technique may be used as the reference image Gref.
  • Image data Dref representing the reference image Gref (hereinafter referred to as “reference data”) and auxiliary data A relating to the reference image Gref are stored in the storage device 12 .
  • Auxiliary data A is data specifying a combination of an area (hereinafter referred to as a "unit area") Rn in which each key 21 of the reference musical instrument exists in the reference image Gref and the pitch n corresponding to the key 21. That is, the auxiliary data A can also be said to be data defining a unit region Rn corresponding to each pitch n in the reference image Gref.
  • Transformation from the position c[h,f] in the x-y coordinate system to the position C[h,f] in the XY coordinate system uses the transformation matrix W, as expressed by the following formula (2).
  • a projective transformation is used.
  • the symbol X in Equation (2) means the coordinate on the X-axis in the XY coordinate system
  • the symbol Y means the coordinate on the Y-axis.
  • the symbol s is an adjustment value for matching the scale between the xy coordinate system and the XY coordinate system.
  • FIG. 11 is a flowchart illustrating a specific procedure of the process of generating the transformation matrix W by the matrix generator 312 (hereinafter referred to as "matrix generation process").
  • the matrix generation process of the first embodiment is executed with the performance image G2 (image data D2) processed by the image extraction process as the object of processing.
  • the keyboard image g1 is approximated to the reference image Gref with high precision, compared to the configuration in which the matrix generation process is executed for the entire performance image G1 including areas other than the specific area B.
  • a suitable transformation matrix W can be generated.
  • the matrix generation process includes an initialization process Sc1 and a matrix update process Sc2.
  • the initial setting process Sc1 is a process of setting an initial matrix W0, which is an initial value of the transformation matrix W. FIG. The details of the initial setting process Sc1 will be described later.
  • the matrix update process Sc2 is a process of generating a transformation matrix W by iteratively updating the initial matrix W0. That is, the projective transformation unit 314 iteratively updates the initial matrix W0 so that the keyboard image g1 of the performance image G2 approaches the reference image Gref by projective transformation using the transformation matrix W, thereby transforming the transformation matrix W into Generate.
  • the coordinate X/s on the X-axis of a specific point in the reference image Gref approximates or matches the coordinate x on the x-axis of a point corresponding to the point in the keyboard image g1
  • a transformation matrix W is generated so that the coordinate Y/s of a particular point on the Y axis approximates or matches the coordinate y on the y axis of a point corresponding to that point in the keyboard image g1. That is, the coordinates of the key 21 corresponding to a specific pitch in the keyboard image g1 are transformed into the coordinates of the key 21 corresponding to the pitch in the reference image Gref by the projective transformation applying the transformation matrix W. , a transformation matrix W is generated.
  • An element (matrix generation unit 312) for generating the conversion matrix W is implemented by the control device 11 executing the matrix update processing Sc2 illustrated above.
  • the matrix update process Sc2 for example, a process of updating the transformation matrix W so that the image feature amount such as SIFT (Scale-Invariant Feature Transform) becomes closer between the reference image Gref and the keyboard image g1 is assumed.
  • SIFT Scale-Invariant Feature Transform
  • the keyboard image g1 a pattern in which a plurality of keys 21 are arranged in the same manner is repeated, so there is a possibility that the conversion matrix W cannot be properly estimated in the form using the image feature amount.
  • the matrix generator 312 of the first embodiment increases the enhanced correlation coefficient (ECC) between the reference image Gref and the keyboard image g1 ( Iteratively update the initial matrix W0 so as to ideally maximize
  • ECC enhanced correlation coefficient
  • the extended correlation coefficient is suitable for generating the transformation matrix W used for transforming the keyboard image g1.
  • a transformation matrix W may be generated so as to be close to each other.
  • the projective transformation unit 314 in FIG. 3 executes projective transformation processing.
  • the projective transformation process is a projective transformation of the performance image G1 using the transformation matrix W generated by the matrix generation process.
  • the performance image G1 is transformed into an image (hereinafter referred to as "transformed image") shot under the same shooting conditions as the reference image Gref.
  • the area corresponding to the key 21 of the pitch n in the transformed image substantially matches the unit area Rn of the pitch n in the reference image Gref.
  • the x-y coordinate system of the transformed image substantially matches the x-y coordinate system of the reference image Gref.
  • the projective transformation unit 314 converts the position c[h, f] of each finger to the position C[h, f].
  • an element projective transformation unit 3114 that executes the projective transformation of the performance image G1 is realized.
  • the display control unit 40 causes the display device 14 to display the transformed image generated by the projective transformation process.
  • the display control unit 40 causes the display device 14 to display the converted image and the reference image Gref in an overlapping state.
  • the area corresponding to the key 21 of each pitch n in the transformed image and the unit area Rn corresponding to the pitch n in the reference image Gref overlap each other.
  • the transformation matrix W is generated so that the keyboard image g1 of the performance image G1 approaches the reference image Gref, and the projective transformation process using the transformation matrix W is performed on the performance image G1. be done. Therefore, the performance image G1 of the keyboard instrument 200 played by the user can be converted into a converted image corresponding to the photographing conditions of the reference musical instrument in the reference image Gref.
  • FIG. 12 is a flowchart illustrating a specific procedure of the initial setting process Sc1.
  • the projective transformation unit 314 causes the display device 14 to display the setting screen 62 illustrated in FIG. 13 (Sc11).
  • the setting screen 62 includes a performance image G1 photographed by the photographing device 15 and an instruction 622 for the user.
  • the instruction 622 is to select an area (hereinafter referred to as "target area”) 621 corresponding to one or more specific pitches (hereinafter referred to as "target pitch") n in the keyboard image g1 in the performance image G1. is the message.
  • the user selects the target area 621 corresponding to the target pitch n in the performance image G1 by operating the operation device 13 while viewing the setting screen 62 .
  • the projective transformation unit 314 receives selection of the target area 621 by the user (Sc12).
  • the projective transformation unit 314 identifies one or more unit regions Rn designated by the auxiliary data A for the target pitch n in the reference image Gref represented by the reference data Dref (Sc13). Then, the projective transformation unit 314 calculates a matrix for projectively transforming the target region 621 of the performance image G1 into one or more unit regions Rn specified from the reference image Gref as an initial matrix W0 (Sc14).
  • the initial matrix W0 is set so as to approach the unit area Rn corresponding to the target pitch n.
  • the initial matrix W0 is set so that the target area 621 corresponding to the instruction from the user in the performance image G1 approaches the unit area Rn corresponding to the target pitch n in the reference image Gref. be done. Therefore, it is possible to generate an appropriate transformation matrix W that can approximate the keyboard image g1 to the reference image Gref with high accuracy.
  • the area designated by the user by operating the operating device 13 in the performance image G1 is used as the target area 621 for setting the initial matrix W0. Therefore, an appropriate initial matrix W0 can be generated while reducing the processing load, compared with, for example, a form in which the area corresponding to the target pitch n in the performance image G1 is estimated by arithmetic processing.
  • the initial setting process Sc1 is executed for the performance image G1, but the initial setting process Sc1 may be executed for the performance image G2.
  • the fingering data generator 32 in FIG. 3 generates the fingering data Q using the performance data P generated by the keyboard instrument 200 and the finger position data F generated by the finger position data generator 31, as described above. .
  • the fingering data Q is generated every unit period.
  • the fingering data generator 32 of the first embodiment includes a probability calculator 321 and a fingering estimator 322 .
  • the probability calculation unit 321 calculates, for each finger number k, the probability p that the pitch n specified by the performance data P is played by the finger of each finger number k.
  • the probability p is an index (likelihood) of the probability that the finger with the finger number k has operated the key 21 with the pitch n.
  • the probability calculator 321 calculates the probability p according to whether or not the position C[k] of the finger with the finger number k exists within the unit area Rn of the pitch n.
  • the probability p is calculated for each unit period on the time axis. Specifically, when the performance data P designates the pitch n, the probability calculation unit 321 calculates the probability p(C[k]
  • ⁇ k n) by the calculation of Equation (3) exemplified below. .
  • the condition " ⁇ k n" in the probability p(C[k]
  • ⁇ k n) means the probability that the position C[k] is observed for the finger under the condition that the finger with the finger number k is playing the pitch n. do.
  • the symbol I(C[k] ⁇ Rn) in Equation (3) is set to a numerical value “1” when the position C[k] exists within the unit region Rn, and the position C[k] is outside the unit region Rn. is an indicator function that is set to the value '0' if it exists in .
  • means the area of the unit region Rn.
  • the symbol ⁇ (0, ⁇ 2 E) means observation noise, which is represented by a normal distribution with mean 0 and variance ⁇ 2 .
  • Symbol E is a unit matrix of 2 rows and 2 columns.
  • the symbol * means convolution of observation noise ⁇ (0, ⁇ 2 E).
  • the position of the finger is the probability that the finger position data F designates the position C[k] for the finger. Therefore, the probability p(C[k]
  • ⁇ k n) is maximized when the position C[k] of the finger with the finger number k is within the unit area Rn in the playing state, and the position C[k] is the unit It decreases with increasing distance from the region Rn.
  • the probability calculator 321 calculates the probability p(C[k ]
  • ⁇ k 0) is calculated by the following equation (4).
  • in Equation (4) means the total area of the N unit regions R1 to RN in the reference image Gref. As can be seen from the formula (4), when the user does not operate any key 21, the probability p(C[k]
  • ⁇ k 0) is a common numerical value (1 /
  • the fingering estimation unit 322 estimates the user's fingering. Specifically, the fingering estimation unit 322 estimates the finger (finger number k) that performed the pitch n specified by the performance data P from the probability p(C[k]
  • ⁇ k n) of each finger. do. The fingering estimation unit 322 estimates the finger number k (generates the fingering data Q) every time the probability p(C[k]
  • ⁇ k n) of each finger is calculated (that is, every unit period). Specifically, the fingering estimation unit 322 identifies the finger number k corresponding to the maximum value among a plurality of probabilities p(C[k]
  • ⁇ k n) corresponding to different fingers. Then, the fingering estimation unit 322 generates fingering data Q that specifies the pitch n specified by the performance data P and the finger number k specified from the probability p(C[k]
  • ⁇ k n).
  • the fingering estimation unit 322 sets the finger number k to an invalid value meaning invalidity of the estimation result in a unit period in which the maximum value of the plurality of probabilities p(C[k]
  • ⁇ k n) is below the threshold. set.
  • the display control unit 40 displays the musical note image 611 in a manner different from the normal musical note image 611, as illustrated in FIG. display a sign "?"
  • the configuration and operation of the fingering data generator 32 are as described above.
  • FIG. 14 is a flowchart illustrating a specific procedure of processing (hereinafter referred to as "performance analysis processing") executed by the performance analysis section 30. As shown in FIG. For example, the performance analysis process is started when the user gives an instruction to the operation device 13 .
  • the control device 11 executes the image extraction process of FIG. 8 (S11). That is, the control device 11 generates the performance image G2 by extracting the specific region B including the keyboard image g1 and the finger image g2 from the performance image G1.
  • the image extraction process includes the area estimation process Sb1 and the area extraction process Sb2 as described above.
  • the control device 11 After executing the image extraction process, the control device 11 (matrix generation unit 312) executes the matrix generation process of FIG. 11 (S12). That is, the control device 11 generates the transformation matrix W by iteratively updating the initial matrix W0 so as to increase the extended correlation coefficient between the reference image Gref and the keyboard image g1.
  • the matrix generation process includes the initialization process Sc1 and the matrix update process Sc2, as described above.
  • the control device 11 repeats the processing (S13 to S18) illustrated below for each unit period.
  • the control device 11 (finger position estimating section 313) executes the finger position estimating process of FIG. 5 (S13). That is, the control device 11 estimates the positions c[h, f] of the fingers of the user's left hand and right hand by analyzing the performance image G1.
  • the finger position estimation processing includes image analysis processing Sa1, left/right determination processing Sa2, and interpolation processing Sa3.
  • the control device 11 executes projective transformation processing (S14). That is, the control device 11 generates a transformed image by projective transformation of the performance image G1 using the transformation matrix W.
  • FIG. In the projective transformation process, the control device 11 transforms the position c[h,f] of each finger of the user into the position C[h,f] in the XY coordinate system, and the position C[h,f] of each finger of the user. Generate finger position data F representing f].
  • the control device 11 executes probability calculation processing (S15). That is, the control device 11 calculates the probability p(C[k]
  • ⁇ k n) that the pitch n specified by the performance data P is played by each finger with the finger number k. Then, the control device 11 (the fingering estimation unit 322) executes fingering estimation processing (S16). That is, the control device 11 estimates the finger number k of the finger that played the pitch n from the probability p(C[k]
  • ⁇ k n) of each finger, and designates the pitch n and the finger number k. Generate finger data Q.
  • the control device 11 (display control unit 40) updates the analysis screen 61 according to the fingering data Q (S17). Further, the control device 11 determines whether or not a predetermined end condition is satisfied (S18). For example, when the user instructs to end the performance analysis processing by operating the operation device 13, the control device 11 determines that the end condition is met. If the termination condition is not satisfied (S18: NO), the control device 11 repeats the processes after the finger position estimation process (S13 to S18) for the immediately following unit period. On the other hand, if the termination condition is satisfied (S18: YES), the control device 11 terminates the performance analysis process.
  • the finger position data F generated by analyzing the performance image G1 and the performance data P representing the performance by the user are used to generate the fingering data Q. be. Therefore, the fingering can be estimated with high accuracy compared to the configuration in which the fingering is estimated only from the performance data P.
  • the position c[h, f] of each finger estimated by the finger position estimation process is calculated using the transformation matrix W for projective transformation that brings the keyboard image g1 closer to the reference image Gref. converted. That is, the position C[h,f] of each finger is estimated with reference to the reference image Gref. Therefore, the fingering can be estimated with high precision compared to a configuration in which the position c[h, f] of each finger is not converted to a position based on the reference image Gref.
  • a specific area B including the keyboard image g1 is extracted from the performance image G1. Therefore, as described above, it is possible to generate an appropriate transformation matrix W that can approximate the keyboard image g1 to the reference image Gref with high accuracy. Further, extracting the specific region B can improve the usability of the performance image G1. Particularly in the first embodiment, a specific area B including the keyboard image g1 and the finger image g2 is extracted from the performance image G1. Therefore, it is possible to generate a performance image G2 in which the appearance of the keyboard 22 of the keyboard instrument 200 and the appearance of the user's fingers can be efficiently visually recognized.
  • Second Embodiment A second embodiment will be described.
  • elements having the same functions as those of the first embodiment are denoted by the same reference numerals as those used in the description of the first embodiment, and detailed descriptions thereof are appropriately omitted. do.
  • the left hand middle and index fingers overlap each other. That is, the position C[k] of the middle finger of the left hand and the position C[k] of the index finger of the left hand exist within one unit region Rn.
  • a plurality of fingers may overlap each other. As described above, when a plurality of fingers overlap each other within one unit region Rn, the method of the first embodiment may not be able to estimate the fingering with high accuracy. 2nd Embodiment is a form for solving the above subject. Specifically, in the second embodiment, the positional relationship of a plurality of fingers and the temporal variation (dispersion) of the position of each finger are taken into consideration in fingering estimation.
  • FIG. 16 is a block diagram illustrating the functional configuration of the performance analysis system 100 according to the second embodiment.
  • a performance analysis system 100 of the second embodiment has a configuration in which a control data generator 323 is added to the same elements as those of the first embodiment.
  • the control data generator 323 generates N pieces of control data Z[1] to Z[N] corresponding to different pitches n.
  • FIG. 17 is a schematic diagram of control data Z[n] corresponding to an arbitrary pitch n.
  • the control data Z[n] is vector data representing the characteristics of the relative position (hereinafter referred to as "relative position") C'[k] of each finger with respect to the unit area Rn of pitch n.
  • the relative position C'[k] is information obtained by converting the position C[k] represented by the finger position data F into a position relative to the unit region Rn.
  • the control data Z[n] corresponding to one pitch n includes the pitch n, and position average Za[n,k] and position variance Zb[n,k] for each of a plurality of fingers. It contains velocity mean Zc[n,k] and velocity variance Zd[n,k].
  • the average position Za[n,k] is the average of the relative positions C'[k] within a period of a predetermined length including the current unit period (hereinafter referred to as "observation period").
  • the observation period is, for example, a period corresponding to a plurality of unit periods arranged forward on the time axis with the current unit period ending.
  • the position variance Zb[n,k] is the variance of the relative position C'[k] within the observation period.
  • the velocity average Zc[n,k] is the average of the velocities (that is, rate of change) at which the relative position C'[k] changes within the observation period.
  • the velocity variance Zd[n,k] is the variance of the velocity at which the relative position C'[k] changes within the observation period.
  • control data Z[n] are information (Za[n,k], Zb[n,k].Zc[n,k], Zd [n,k]). Therefore, the control data Z[n] is data reflecting the positional relationship of the user's fingers. Also, the control data Z[n] includes information (Zb[n,k], Zd[n,k]) regarding the variation of the relative position C'[k] for each of a plurality of fingers. Therefore, the control data Z[n] is data that reflects temporal variations in the position of each finger.
  • a plurality of estimation models 52[k] (52[1] to 52[10]) prepared in advance for different fingers are used for the probability calculation processing by the probability calculation unit 321 of the second embodiment.
  • the estimation model 52[k] of each finger is a trained model that has learned the relationship between the control data Z[n] and the probability p[k] of the finger.
  • the probability p[k] is an index (probability) of the accuracy of playing the pitch n specified by the performance data P with the finger having the finger number k.
  • the probability calculation unit 321 calculates the probability p[k] by inputting the N pieces of control data Z[1] to Z[N] to the estimation model 52[k] for each of a plurality of fingers. .
  • the estimation model 52[k] corresponding to any one finger number k is a logistic regression model represented by Equation (5) below.
  • variable ⁇ k and variable ⁇ k,n in Equation (5) are set by machine learning by the machine learning system 900. That is, each estimated model 52[k] is established by machine learning by the machine learning system 900, and each estimated model 52[k] is provided to the performance analysis system 100. FIG. For example, the variable ⁇ k and the variable ⁇ k,n of each estimated model 52[k] are sent from the machine learning system 900 to the performance analysis system 100. FIG.
  • the estimation model 52[k] is designed so that the probability p[k] is small for fingers with a high change rate of the relative position C′[k]. Learn the relationship with p[k].
  • the probability calculator 321 calculates a plurality of probabilities p[k] regarding different fingers for each unit period by inputting the control data Z[n] to each of the plurality of estimation models 52[k].
  • the fingering estimation unit 322 estimates the user's fingering through fingering estimation processing that applies a plurality of probabilities p[k]. Specifically, the fingering estimation unit 322 estimates the finger (finger number k) that played the pitch n specified by the performance data P from the probability p[k] of each finger. The fingering estimation unit 322 estimates the finger number k (generates the fingering data Q) every time the probability p[k] of each finger is calculated (that is, every unit period). Specifically, the fingering estimation unit 322 identifies the finger number k corresponding to the maximum value among a plurality of probabilities p[k] corresponding to different fingers. Then, the fingering estimation unit 322 generates fingering data Q that specifies the pitch n specified by the performance data P and the finger number k specified from the probability p[k].
  • FIG. 18 is a flowchart illustrating a specific procedure of performance analysis processing in the second embodiment.
  • generation of control data Z[n] (S19) is added to the same process as in the first embodiment.
  • the control device 11 (control data generator 323) generates different pitches n N pieces of control data Z[1] to Z[N] corresponding to .
  • the control device 11 calculates the probability p[ k] is calculated (S15). Further, the control device 11 (finger estimating unit 322) estimates the user's fingering by a fingering estimating process applying a plurality of probabilities p[k] (S16).
  • the operations of elements other than the fingering data generator 32 (S11-S14, S17-S18) are the same as in the first embodiment.
  • control data Z[k] input to the estimation model 52[k] in the second embodiment are the average Za[n,k] and the variance Zb[n,k] of the relative positions C'[k] of the fingers. ] and the mean Zc[n,k] and variance Zd[n,k] of the rate of change of the relative position C′[k]. Therefore, even if a plurality of fingers overlap each other due to, for example, a finger slipping, the user's fingering can be estimated with high accuracy.
  • the logistic regression model was exemplified as the estimation model 52[k], but the type of estimation model 52[k] is not limited to the above examples.
  • a statistical model such as a multilayer perceptron may be used as the estimation model 52[k].
  • a deep neural network such as a convolutional neural network or a recursive neural network may also be used as the estimation model 52[k].
  • a combination of multiple types of statistical models may be used as the estimation model 52[k].
  • the various estimation models 52[k] exemplified above are comprehensively expressed as learned models that have learned the relationship between the control data Z[n] and the probability p[k].
  • FIG. 19 is a flowchart illustrating a specific procedure of performance analysis processing in the third embodiment.
  • the control device 11 After executing the image extraction process and the matrix generation process, the control device 11 refers to the performance data P to determine whether or not the user is playing the keyboard instrument 200 (S21). Specifically, the control device 11 determines whether or not any of the plurality of keys 21 of the keyboard instrument 200 is being operated.
  • the controller 11 If the keyboard instrument 200 is being played (S21: YES), the controller 11 generates finger position data F (S13-S14) and fingering data Q (S15-S16), as in the first embodiment. ) and update of the analysis screen 61 (S17). On the other hand, if the keyboard instrument 200 is not being played (S21: NO), the control device 11 shifts the process to step S18. That is, generation of finger position data F (S13-14), generation of fingering data Q (S15-S16), and update of analysis screen 61 (S17) are not executed.
  • the same effects as in the first embodiment are also achieved in the third embodiment. Further, in the third embodiment, generation of the finger position data F and the fingering data Q is stopped when the keyboard instrument 200 is not played. Therefore, the processing load necessary for generating the fingering data Q can be reduced compared to the configuration in which the generation of the finger position data F is continued regardless of whether the keyboard instrument 200 is played. In addition, 3rd Embodiment is applied also to 2nd Embodiment.
  • FIG. 20 is a flowchart illustrating a specific procedure of the initial setting process Sc1 executed by the control device 11 (matrix generator 312) of the fourth embodiment.
  • the user selects a key 21 corresponding to a desired pitch (hereinafter referred to as "specific pitch") n among the plurality of keys 21 of the keyboard instrument 200 by a specific finger (hereinafter referred to as “specified finger”). (referred to as a “specific finger”).
  • the specific finger is, for example, the finger (for example, the index finger of the right hand) notified to the user by the display on the display device 14 or the instruction manual of the keyboard instrument 200 or the like.
  • performance data P specifying a specific pitch n is supplied from the keyboard instrument 200 to the performance analysis system 100 .
  • the control device 11 acquires the performance data P from the keyboard instrument 200, thereby recognizing the performance of the specific pitch n by the user (Sc15).
  • the control device 11 specifies a unit area Rn corresponding to a specific pitch n among the N unit areas R1 to RN of the reference image Gref (Sc16).
  • the finger position data generation unit 31 generates finger position data F through finger position estimation processing.
  • the finger position data F includes the position C[h, f] of the specific finger used by the user to play the specific pitch n.
  • the control device 11 acquires the finger position data F to specify the position C[h,f] of the specific finger (Sc17).
  • the control device 11 uses the unit area Rn corresponding to the specific pitch n and the position C[h,f] of the specific finger represented by the finger position data F to set the initial matrix W0 (Sc18). That is, the control device 11 sets the initial matrix W0 so that the position C[h,f] of the specific finger represented by the finger position data F approaches the unit area Rn of the specific pitch n in the reference image Gref. Specifically, a matrix for projectively transforming the position C[h,f] of the specific finger to the center of the unit area Rn is set as the initial matrix W0.
  • the position c[h,f] of the specific finger in the performance image G1 changes to the position c[h,f] of the specific pitch n in the reference image Gref.
  • the initial matrix W0 is set so as to approach the portion (unit region Rn) corresponding to . Since the user only needs to play the desired pitch n, compared to the first embodiment in which the user needs to select the target area 621 by operating the operation device 13, the initial matrix W0 needs only to be set.
  • the control device 11 adjusts the position C[h,f] of the specific finger during performance of the specific pitch n to the unit area Rn of the specific pitch n. , to set the initial matrix W0.
  • FIG. 21 is a block diagram illustrating the functional configuration of a performance analysis system 100 according to a fifth embodiment.
  • a performance analysis system 100 of the fifth embodiment comprises a sound pickup device 16 .
  • the sound collecting device 16 generates the sound signal V by collecting sound reproduced from the keyboard instrument 200 by the user's performance.
  • the acoustic signal V is a time-domain audio signal representing the waveform of the sound reproduced by the keyboard instrument 200 .
  • the sound collecting device 16 which is separate from the performance analysis system 100, may be connected to the performance analysis system 100 by wire or wirelessly. Note that the time series of samples forming the acoustic signal V may be interpreted as "performance data P".
  • the control device 11 of the performance analysis system 100 functions as a performance analysis section 30 by executing programs stored in the storage device 12 .
  • the performance analysis section 30 generates fingering data Q using the sound signal V supplied from the sound pickup device 16 and the image data D1 supplied from the photographing device 15 .
  • the fingering data Q designates the pitch n corresponding to the key 21 operated by the user and the finger number k of the finger used to operate the key 21 by the user.
  • the pitch n is designated by the performance data P in the first embodiment
  • the acoustic signal V in the fifth embodiment is not a signal that directly designates the pitch n. Therefore, the performance analysis section 30 simultaneously estimates the pitch n and the finger number k using the acoustic signal V and the image data D1.
  • a latent variable w t,n,k is prepared for each combination of pitch n and finger number k.
  • the latent variable w t,n,k is a variable for one-hot expression that is set to either of the binary values '0' and '1'.
  • the value "1" of the latent variable w t,n,k means that the pitch n is played by the finger with the finger number k, and the value "0" of the latent variable w t,n,k This means that the fingers of the player are also not used for playing.
  • the posterior probability U t,n is the posterior probability that pitch n is pronounced at time t under the condition that acoustic signal V is observed. Therefore, the probability (1 ⁇ U t,n ) is the probability that the latent variable w t,n,0 is the numerical value “1” under the condition that the acoustic signal V is observed (any pitch n is played). probability of not The posterior probability U t,n is estimated by a known estimation model that has learned the relationship between the acoustic signal V and the posterior probability U t,n . The estimation model is a trained model for automatic transcription.
  • a deep neural network eg a convolutional neural network or a recurrent neural network, is used as an estimation model for estimating the posterior probabilities U t,n .
  • the probability ⁇ t,n,k is the probability that the pitch n is played by the finger with the finger number k when the pitch n is being played.
  • Equation (6) The probability p(w
  • the first term on the right side of Equation (6) means the probability that no pitch n is pronounced, and the second term means the probability that the pitch n is pronounced if the pitch n is pronounced with the finger number k It means the probability that it is played with the fingers of .
  • Equation (7) the probability p(C[k]
  • ⁇ 2 ,Rn) in Equation (7) is the probability expressed by Equation (3) or Equation (4) above.
  • Equation (8) a symmetric Dirichlet distribution (Dir) expressed by Equation (8) below is assumed.
  • ⁇ in Equation (8) is a variable that defines the shape of the symmetric Dirichlet distribution.
  • MAP Maximum A Posteriori
  • Presence or absence of pitch n and finger number k can be estimated at the same time.
  • mean field approximation variational Bayesian estimation
  • Equation (9) the distribution that is most approximate to the probability distribution of the posterior probability p(z
  • V, ⁇ ,C[k]) is identified.
  • the performance analysis unit 30 repeats the calculations of the following formulas (10) and (11).
  • the symbol c in Equation (10) is a coefficient for normalizing the probability distribution ⁇ t,n,k so that the sum of the probability distribution ⁇ t,n, k over a plurality of finger numbers k is "1". Also, the symbol ⁇ > means an expected value.
  • the performance analysis unit 30 repeats the calculations of formulas (10) and (11) for all possible combinations of pitch n and finger number k for one time t on the time axis. .
  • the performance analysis unit 30 converts the computation result of equation (10) at the time when the computation of equation (10) and equation (11) is repeated a predetermined number of times into the probability distribution ⁇ t,n of the latent variable w t,n,k , k .
  • a probability distribution ⁇ t,n,k is calculated for each time t on the time axis.
  • the performance analysis unit 30 of the fifth embodiment utilizes an HMM (Hidden Markov Model) to which the probability distribution ⁇ t,n,k is applied to combine pitch n and finger number k (that is, fingering data Generate a time series of Q).
  • HMM Hidden Markov Model
  • the HMM for fingering estimation consists of a latent state corresponding to each of the pronunciation (key depression) and silence of pitch n, and a plurality of latent states corresponding to different finger numbers k. . Only three types of state transitions are allowed: (1) self-transition, (2) silence ⁇ arbitrary finger number k, and (3) arbitrary finger number k ⁇ silence. is set to '0'. The above conditions are constraints for keeping the finger number k unchanged during the period in which one note is pronounced. Also, the expected value of the probability distribution ⁇ t,n,k calculated by the calculations of Equations (10) and (11) is set as the observation probability for each latent state of the HMM.
  • the performance analysis unit 30 uses the HMM described above to estimate the state series by dynamic programming such as the Viterbi algorithm. The performance analysis unit 30 generates a time series of fingering data Q according to the result of estimating the state series.
  • the fingering data Q is generated using the acoustic signal V and the image data D1. That is, fingering data Q can be generated even in situations where performance data P cannot be obtained.
  • the pitch n and the finger number k are simultaneously estimated using the acoustic signal V and the image data D1, the pitch n and the finger number k are individually estimated. It can estimate the fingering with high accuracy while reducing the processing load compared to .
  • 5th Embodiment is applied also to 4th Embodiment from 2nd Embodiment.
  • the projective transformation unit 314 generates a transformed image from the performance image G1. That is, the projective transformation unit 314 changes the photographing conditions of the performance image G1.
  • the sixth embodiment is an image processing system 700 that uses the above functions of changing the shooting conditions of the performance image G1.
  • the performance analysis system 100 of the first to fifth embodiments can also be expressed as an image processing system 700 when focusing on the processing of the performance image G1 by the projective transformation unit 314.
  • FIG. 22 is a block diagram illustrating the functional configuration of an image processing system 700 according to the sixth embodiment.
  • the image processing system 700 includes a control device 11, a storage device 12, an operation device 13, a display device 14, and a photographing device 15, like the performance analysis system 100 of the first embodiment.
  • the imaging device 15 generates a time series of image data D1 representing the performance image G1 by imaging the keyboard instrument 200 under specific imaging conditions.
  • the storage device 12 stores a plurality of reference data Dref.
  • Each of the plurality of reference data Dref represents a reference image Gref photographing a reference musical instrument, which is a keyboard of a standard keyboard musical instrument.
  • the photographing conditions of the reference instrument differ for each reference image Gref (for each reference data Dref). Specifically, for example, one or more conditions out of the shooting range or shooting direction differ for each reference image Gref.
  • the storage device 12 also stores auxiliary data A for each reference data Dref.
  • the control device 11 implements the matrix generation unit 312, the projective transformation unit 314, and the display control unit 40 by executing the programs stored in the storage device 12.
  • the matrix generator 312 selectively uses one of the plurality of reference data Dref to generate the transformation matrix W.
  • FIG. A projective transformation unit 314 generates image data D3 of a transformed image G3 from image data D1 of a performance image G1 by projective transformation using a transformation matrix W.
  • the display control unit 40 causes the display device 14 to display the converted image G3 represented by the image data D3.
  • FIG. 23 is a flowchart illustrating a specific procedure of processing (hereinafter referred to as "first image processing") executed by the control device 11 of the sixth embodiment.
  • first image processing is started with an instruction from the user to the operation device 13 as a trigger.
  • the control device 11 determines whether or not selection of imaging conditions has been received from the user (S31).
  • the control device 11 selects the imaging conditions selected by the user from among the plurality of reference data Dref stored in the storage device 12.
  • Reference data Dref (hereinafter referred to as "selected reference data Dref") is obtained (S32).
  • the user's selection of imaging conditions corresponds to the operation of selecting one of a plurality of reference images Gref (reference data Dref) corresponding to different imaging conditions.
  • the control device 11 uses the selected reference data Dref to execute the same matrix generation processing as in the first embodiment (S33). Specifically, the control device 11 sets the initial matrix W0 by an initial setting process Sc1 using the selection reference data Dref. Further, the control device 11 generates a transformation matrix W through matrix update processing Sc2 that iteratively updates the initial matrix W0 so that the keyboard image g1 of the performance image G1 approaches the reference image Gref of the selected reference data Dref. On the other hand, if the selection of the imaging condition is not accepted (S31: NO), the selection of the reference data Dref (S32) and the matrix generation process (S33) are not executed.
  • the control device 11 (projective transformation unit 314) generates a transformed image G3 by performing projective transformation processing using the transformation matrix W on the performance image G1 (S34).
  • Projective transformation processing is the same as in the first embodiment.
  • image data D3 representing the transformed image G3 is generated.
  • a converted image G3 corresponding to the same photographing conditions as the reference image Gref of the selected reference data Dref is generated from the performance image G1. That is, the converted image G3 is an image obtained by converting the photographing conditions of the performance image G1 into photographing conditions equivalent to those of the reference image Gref.
  • the converted image G3 corresponding to the shooting conditions selected by the user is generated.
  • the control device 11 causes the display device 14 to display the transformed image G3 generated by the projective transformation process (S35).
  • the control device 11 determines whether or not the termination condition is satisfied (S36). For example, when the user instructs to end the first image processing by operating the operation device 13, the control device 11 determines that the end condition is met. If the termination condition is not satisfied (S36: NO), the control device 11 shifts the process to step S31. That is, the conversion matrix W is generated (S32-S33) and the conversion image G3 is generated and displayed (S34-S35) on the condition that the selection of the photographing conditions is accepted (S31: YES). On the other hand, if the termination condition is satisfied (S36: YES), the control device 11 terminates the first image processing.
  • the transformation matrix W is generated so that the keyboard image g1 in the performance image G1 approaches the reference image Gref. executed. Therefore, the performance image G1 of the keyboard instrument 200 played by the user can be converted into a converted image G3 corresponding to the photographing conditions of the reference musical instrument in the reference image Gref.
  • any one of a plurality of reference data Dref with different imaging conditions is selectively used for matrix generation processing. Therefore, a converted image G3 corresponding to various shooting conditions can be generated from the performance image G1 shot under specific shooting conditions.
  • the reference data Dref corresponding to the imaging conditions selected by the user among the plurality of reference data Dref are used for the matrix generation process, so that the converted image G3 corresponding to the imaging conditions desired by the user is generated. can generate By changing the photographing conditions of the performance image G1 as described above, it is possible to generate a converted image G3 that can be used for various purposes.
  • a plurality of converted images G3 with uniform photographing conditions are generated. It can be generated as teaching material for music lessons.
  • the image extractor 311 extracts the specific region B including the keyboard image g1 and the finger image g2 from the performance image G1.
  • the seventh embodiment is an image processing system 700 that utilizes the above functions of extracting the specific area B of the performance image G1.
  • the performance analysis system 100 of the first to fifth embodiments is also expressed as an image processing system 700 when focusing on the processing of the performance image G1 by the image extracting section 311.
  • FIG. 24 is a block diagram illustrating the functional configuration of an image processing system 700 according to the seventh embodiment.
  • the image processing system 700 includes a control device 11, a storage device 12, an operation device 13, a display device 14, and a photographing device 15, like the performance analysis system 100 of the first embodiment.
  • the imaging device 15 generates a time series of image data D1 representing the performance image G1 by imaging the keyboard instrument 200 under specific imaging conditions.
  • the performance image G1 includes a keyboard image g1 and a finger image g2, as in the above-described forms.
  • the control device 11 functions as an image extractor 311 and a display controller 40 by executing programs stored in the storage device 12 .
  • the image extraction unit 311 generates image data D2 representing a performance image G2 obtained by extracting a partial region from the performance image G1. Specifically, as in the first embodiment, the image extracting unit 311 performs an area estimation process Sb1 for generating an image processing mask M and an area extraction process Sb2 for applying the image processing mask M to the performance image G1. do.
  • the display control unit 40 causes the display device 14 to display the performance image G2 represented by the image data D2.
  • the single estimation model 51 is illustrated in the first embodiment.
  • the estimation model 51 used in the area estimation process Sb1 in the seventh embodiment includes a first model 511 and a second model 512.
  • Each of the first model 511 and the second model 512 is composed of a deep neural network such as a convolutional neural network or a recurrent neural network.
  • the first model 511 is a statistical model for generating the first mask representing the first region of the performance image G1.
  • the first area is an area including the keyboard image g1 in the performance image G1.
  • the finger image g2 is not included in the first area.
  • the first mask is, for example, a binary mask in which each element in the first area is set to the numerical value "1" and each element in the area other than the first area is set to the numerical value "0".
  • the image extraction unit 311 generates the first mask by inputting the image data D1 representing the performance image G1 to the first model 511.
  • FIG. That is, the first model 511 is a trained model that has learned the relationship between the image data D1 and the first mask (first region) by machine learning.
  • the second model 512 is a statistical model for generating a second mask representing the second area of the performance image G1.
  • the second area is an area including the finger image g2 in the performance image G1.
  • the keyboard image g1 is not included in the second area.
  • the second mask is, for example, a binary mask in which each element in the second area is set to the numerical value "1" and each element in the area other than the second area is set to the numerical value "0".
  • the image extraction unit 311 generates a second mask by inputting the image data D1 representing the performance image G1 to the second model 512.
  • FIG. That is, the second model 512 is a trained model that has learned the relationship between the image data D1 and the second mask (second region) by machine learning.
  • FIG. 25 is a flowchart illustrating a specific procedure of processing (hereinafter referred to as "second image processing") executed by the control device 11 of the seventh embodiment.
  • the second image processing is started with an instruction from the user to the operation device 13 as a trigger.
  • the control device 11 executes the region estimation processing Sb1 (S41-S43).
  • the area estimation process Sb1 of the seventh embodiment includes a first estimation process (S41), a second estimation process (S42), and an area combining process (S43).
  • the first estimation process is a process of estimating the first area of the performance image G1. Specifically, the control device 11 inputs the image data D1 representing the performance image G1 to the first model 511 to generate the first mask representing the first region (S41).
  • the second estimation process is a process of estimating the second area of the performance image G2. Specifically, the control device 11 inputs the image data D1 representing the performance image G1 to the second model 512 to generate a second mask representing the second region (S42).
  • the area synthesizing process is a process of generating an image processing mask M representing the specific area B including the first area and the second area.
  • the specific area B represented by the image processing mask M corresponds to the sum of the first area and the second area. That is, the control device 11 generates the image processing mask M by synthesizing the first mask and the second mask (S43).
  • the image processing mask M is a binary mask for extracting the specific region B containing the keyboard image g1 and the finger image g2 from the performance image G1, as in the first embodiment. .
  • the control device 11 uses the image processing mask M generated in the area estimation process Sb1 to execute the area extraction process Sb2 similar to that of the first embodiment (S44). That is, the control device 11 extracts the specific area B from the performance image G1 represented by the image data D1 using the image processing mask M, thereby generating the image data D2 representing the performance image G2.
  • the control device 11 causes the display device 14 to display the performance image G2 generated by the region extraction processing Sb2 (S45).
  • the control device 11 determines whether or not the termination condition is satisfied (S46). For example, when the user instructs to end the second image processing by operating the operation device 13, the control device 11 determines that the end condition is satisfied. If the termination condition is not satisfied (S46: NO), the control device 11 shifts the process to step S41. That is, the area estimation process Sb1 (S41 to S43), the area extraction process Sb2 (S44), and the display of the performance image G2 (S45) are executed. On the other hand, if the termination condition is satisfied (S46: YES), the control device 11 terminates the second image processing.
  • a specific area B including the keyboard image g1 is extracted from the performance image G1. Therefore, it is possible to improve the convenience of the performance image G1.
  • a specific area B including the keyboard image g1 and the finger image g2 is extracted from the performance image G1. Therefore, it is possible to generate a performance image G2 in which the appearance of the keyboard 22 of the keyboard instrument 200 and the appearance of the user's fingers can be efficiently visually recognized.
  • the first model 511 estimates the first region of the performance image G1 including the keyboard image g1
  • the second model 512 estimates the second region of the performance image G1 including the finger image g2. estimated by Therefore, the specific region B including the keyboard image g1 and the finger image g2 can be extracted with high precision compared to the configuration using the single estimation model 51 that collectively extracts both the keyboard image g1 and the finger image g2. can. Also, since each of the first model 511 and the second model 512 is established by individual machine learning, the processing load related to the machine learning of the first model 511 and the second model 512 is reduced.
  • the first mode is an operation mode for extracting both the keyboard image g1 and the finger image g2 from the performance image G1. That is, in the first mode, the image extraction section 311 executes both the first estimation process and the second estimation process. Therefore, an image processing mask M representing the specific region B is generated as in the seventh embodiment. That is, in the first mode, a specific area B including both the keyboard image g1 and the finger image g2 is extracted from the performance image G1.
  • the second mode is an operation mode for extracting the keyboard image g1 from the performance image G1. That is, in the second mode, the image extraction unit 311 executes the first estimation process but does not execute the second estimation process. That is, the first mask generated by the first estimation process is determined as the image processing mask M applied to the area extraction process Sb2. Therefore, in the second mode, the keyboard image g1 is extracted from the performance image G1.
  • the image extraction unit 311 executes the first estimation process in the second mode.
  • a form that does not execute is also assumed.
  • the finger image g2 is extracted from the performance image G1.
  • the second mode is expressed as an operation mode in which one of the first estimation process and the second estimation process is executed.
  • the matrix generation process is executed for the performance image G2 after the image extraction process (FIG. 8).
  • a generation process may be performed. That is, the image extracting process (image extracting section 311) for generating the performance image G2 from the performance image G1 may be omitted.
  • the finger position estimation processing using the performance image G1 has been exemplified in each of the above embodiments, the finger position estimation processing may be executed using the performance image G2 after processing by the image extraction processing. That is, the position C[h,f] of each finger of the user may be estimated by analyzing the performance image G2. Further, in each of the above embodiments, the projective transformation process is performed on the performance image G1, but the projective transformation process may be performed on the performance image G2 after the image extraction process. That is, a transformed image may be generated by projective transformation of the performance image G2.
  • the position c[h,f] of each finger of the user is transformed into the position C[h,f] in the XY coordinate system by projective transformation processing.
  • Finger position data F representing c[h,f] may be generated. That is, the projective transformation process (projective transformation unit 314) for transforming the position c[h,f] into the position C[h,f] may be omitted.
  • the conversion matrix W generated immediately after the start of the performance analysis process is used continuously in subsequent processes.
  • the transformation matrix W may be updated at appropriate points during the execution of .
  • the conversion matrix W may be updated.
  • positional change a change in the position of the photographing device 15
  • Transformation matrix W is updated.
  • the matrix generator 312 generates a transformation matrix ⁇ that represents the positional change (displacement) of the imaging device 15 .
  • represents the positional change (displacement) of the imaging device 15 .
  • the matrix generation unit 312 determines that the coordinate x′/ ⁇ calculated by Equation (12) from the x-coordinate of the specific point after the position change is the x-coordinate of the point corresponding to the point in the performance image G before the position change. and the coordinate y'/ ⁇ calculated by Equation (12) from the y-coordinate of a specific point after the position change is the point corresponding to the point in the performance image G before the position change. Generate a transformation matrix ⁇ to approximate or match the y-coordinate.
  • the matrix generation unit 312 generates the product W ⁇ of the transformation matrix W before the position change and the transformation matrix ⁇ representing the position change as the initial matrix W0, and updates the initial matrix W0 by the matrix update processing Sc2 to convert Generate a matrix W.
  • the transformation matrix W after the position change is generated using the transformation matrix W calculated before the position change and the transformation matrix ⁇ representing the position change. Therefore, it is possible to generate a transformation matrix W that can specify the position C[h, f] of each finger with high accuracy while reducing the load of the matrix generation process.
  • the first to fifth embodiments were assumed. good.
  • the keyboard instrument 200 including the keyboard 22 is illustrated, but the present disclosure can be applied to any type of instrument.
  • any musical instrument that can be manually operated by the user such as a stringed instrument, a wind instrument, or a percussion instrument
  • each of the above aspects is similarly applied.
  • a typical example of a musical instrument is a type of musical instrument played by the user with the fingers of one hand or both hands.
  • the performance analysis system 100 may be realized by a server device that communicates with an information device such as a smartphone or a tablet terminal. For example, performance data P generated by a keyboard instrument 200 connected to an information device and image data D1 generated by a photographing device 15 mounted on or connected to the information device are transmitted from the information device to the performance analysis system 100. be.
  • the performance analysis system 100 generates fingering data Q by executing performance analysis processing on performance data P and image data D1 received from the information device, and transmits the fingering data Q to the information device.
  • the image processing system 700 exemplified in the sixth or seventh embodiment may also be realized by a server device that communicates with the information device.
  • the functions of the performance analysis system 100 according to the first to fifth embodiments or the image processing system 700 according to the sixth to seventh embodiments are the functions of the control device 11, as described above. Alternatively, it is realized by cooperation of a plurality of processors and programs stored in the storage device 12 .
  • a program according to the present disclosure may be provided in a form stored in a computer-readable recording medium and installed in a computer.
  • the recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disc) such as a CD-ROM is a good example.
  • recording media in the form of The non-transitory recording medium includes any recording medium other than transitory (propagating signal), and does not exclude volatile recording media.
  • the storage device 12 that stores the program in the distribution device corresponds to the above-described non-transitory recording medium.
  • An image processing method is a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument, and a specific region including the image of the musical instrument. is estimated, and the specific region is extracted from the performance image.
  • the specific region including the image of the musical instrument is extracted from the performance image including the image of the musical instrument and the images of a plurality of fingers of the user. Therefore, it is possible to improve the convenience of performance images.
  • the specific area is an area including an image of the musical instrument and an image of at least a part of the user's body.
  • the specific region including the image of the musical instrument and the image of the user's body is extracted. Therefore, it is possible to generate an image in which the appearance of the musical instrument and the appearance of the user's body can be efficiently visually recognized.
  • an image processing mask representing the specific region in the estimation of the specific region, is generated by inputting image data representing the performance image into a machine-learned estimation model.
  • the specific region is extracted by applying the image processing mask to the performance image.
  • the image processing mask representing the specific region is generated by inputting the image data of the performance image into the machine-learned estimation model. Therefore, the specific region can be specified with high precision for various unknown performance images.
  • the estimation model includes a first model and a second model, and the estimation of the specific region is performed by inputting image data representing the performance image into the first model.
  • a first estimation process of estimating a first area including the image of the musical instrument in the performance image and inputting image data representing the performance image into the second model.
  • a second estimation process of estimating a second area including the image of the fingers of the and an area synthesizing process of generating the image processing mask representing the specific area including the first area and the second area.
  • the first region of the performance image including the image of the musical instrument is estimated by the first model
  • the second region of the performance image including the image of the user is estimated by the second model.
  • a first mode in which both the first estimation process and the second estimation process are performed, and a second mode in which one of the first estimation process and the second estimation process is performed mode can be switched.
  • the first mode a specific area including the image of the musical instrument and the image of the user is extracted from the performance image.
  • the second mode a specific region including one of the musical instrument and the user's image is extracted from the performance image. As described above, it is possible to easily switch the extraction target from the performance image.
  • An image processing system in a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument, a specific region including the image of the musical instrument and an area extraction unit for extracting the specific area from the performance image.
  • a program according to one aspect (aspect 7) of the present disclosure estimates a specific region including an image of a musical instrument in a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument.
  • the computer system is caused to function as an area estimating section for performing the performance image and an area extracting section for extracting the specific area from the performance image.
  • Performance analysis system 11 For Control device 12
  • Storage device 13 For Operation device 14
  • Display device 15 For Photographing device 200
  • Keyboard instrument 21 For Key 22
  • Keyboard 30 for Performance analysis unit 31
  • Finger position data generation unit 311 For Image Extraction unit 312
  • Matrix generation unit 313 Finger position estimation unit 314 Projective transformation unit 32
  • Fingering data generation unit 321 Probability calculation unit 322 Fingering estimation unit 323
  • Control data generation unit, 40 Display control unit 51... Estimation model 51a... Temporary model 52[k]... Estimation model 700... Image processing system

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

This performance analysis system (100) is equipped with: a region estimation unit for estimating a specific region including an image of an instrument within performance images that include an image of the instrument and an image of a plurality of fingers of a user performing on the instrument; and a region extraction unit that extracts the specific region from within the performance images.

Description

画像処理方法、画像処理システムおよびプログラムImage processing method, image processing system and program
 本開示は、利用者による演奏を解析する技術に関する。 The present disclosure relates to technology for analyzing performances by users.
 例えば撮影装置により撮影された画像のうち特定の物体が存在する領域を推定する技術が従来から提案されている。例えば特許文献1には、深層ニューラルネットワークを利用して物体を検知する技術が開示されている。 For example, there have been conventionally proposed techniques for estimating an area where a specific object exists in an image captured by an imaging device. For example, Patent Literature 1 discloses a technique for detecting an object using a deep neural network.
日本国特表2020-528176号公報Japanese special table 2020-528176 米国特許出願公開第2021/0248788号明細書U.S. Patent Application Publication No. 2021/0248788
 ところで、鍵盤楽器等の楽器の演奏を撮影した演奏画像のうち鍵盤の領域等の特定の領域を抽出できれば、例えば利用者の運指の解析等に利用できて便利である。以上の事情を考慮して、本開示のひとつの態様は、演奏画像の利便性を向上させることを目的とする。 By the way, if it is possible to extract a specific region such as a keyboard region from a musical performance image obtained by photographing a performance of a musical instrument such as a keyboard instrument, it would be convenient, for example, to analyze the user's fingering. Considering the above circumstances, one aspect of the present disclosure aims to improve the convenience of performance images.
 以上の課題を解決するために、本開示のひとつの態様に係る画像処理方法は、楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定し、前記演奏画像のうち前記特定領域を抽出する。 In order to solve the above problems, an image processing method according to one aspect of the present disclosure provides a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument. A specific area containing an image is estimated, and the specific area is extracted from the performance image.
 本開示のひとつの態様に係る画像処理システムは、楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定する領域推定部と、前記演奏画像のうち前記特定領域を抽出する領域抽出部とを具備する。 An image processing system according to one aspect of the present disclosure estimates a specific region including an image of a musical instrument in a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument. An estimating unit and an area extracting unit for extracting the specific area from the performance image.
 本開示のひとつの態様に係るプログラムは、楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定する領域推定部、および、前記演奏画像のうち前記特定領域を抽出する領域抽出部、としてコンピュータシステムを機能させる。 A program according to one aspect of the present disclosure includes a region estimation unit for estimating a specific region including an image of a musical instrument in a performance image including an image of the musical instrument and images of a plurality of fingers of a user playing the musical instrument. , and an area extracting section for extracting the specific area from the performance image.
第1実施形態に係る演奏解析システムの構成を例示するブロック図である。1 is a block diagram illustrating the configuration of a performance analysis system according to a first embodiment; FIG. 演奏画像の模式図である。FIG. 4 is a schematic diagram of a performance image; 演奏解析システムの機能的な構成を例示するブロック図である。1 is a block diagram illustrating the functional configuration of a performance analysis system; FIG. 解析画面の模式図である。It is a schematic diagram of an analysis screen. 指位置推定処理のフローチャートである。8 is a flowchart of finger position estimation processing; 左右判定処理のフローチャートである。8 is a flowchart of left/right determination processing; 画像抽出処理の説明図である。FIG. 10 is an explanatory diagram of image extraction processing; 画像抽出処理のフローチャートである。6 is a flowchart of image extraction processing; 推定モデルを確立する機械学習の説明図である。FIG. 4 is an illustration of machine learning to establish an inference model; 参照画像の模式図である。FIG. 4 is a schematic diagram of a reference image; 行列生成処理のフローチャートである。6 is a flowchart of matrix generation processing; 初期設定処理のフローチャートである。6 is a flowchart of initial setting processing; 設定画面の模式図である。4 is a schematic diagram of a setting screen; FIG. 演奏解析処理のフローチャートである。4 is a flowchart of performance analysis processing; 運指推定の課題に関する説明図である。FIG. 10 is an explanatory diagram relating to the problem of fingering estimation; 第2実施形態における演奏解析システムの構成を例示するブロック図である。FIG. 11 is a block diagram illustrating the configuration of a performance analysis system in a second embodiment; FIG. 第2実施形態における制御データの模式図である。FIG. 10 is a schematic diagram of control data in the second embodiment; 第2実施形態における演奏解析処理のフローチャートである。9 is a flow chart of performance analysis processing in the second embodiment. 第3実施形態における演奏解析処理のフローチャートである。14 is a flow chart of performance analysis processing in the third embodiment. 第4実施形態における初期設定処理のフローチャートである。FIG. 11 is a flowchart of initial setting processing in the fourth embodiment; FIG. 第5実施形態における演奏解析システムの構成を例示するブロック図である。FIG. 12 is a block diagram illustrating the configuration of a performance analysis system in a fifth embodiment; FIG. 第6実施形態における画像処理システムの機能的な構成を例示するブロック図である。FIG. 11 is a block diagram illustrating the functional configuration of an image processing system according to a sixth embodiment; 第6実施形態における第1画像処理のフローチャートである。FIG. 14 is a flowchart of first image processing in the sixth embodiment; FIG. 第7実施形態における画像処理システムの機能的な構成を例示するブロック図である。FIG. 21 is a block diagram illustrating the functional configuration of an image processing system according to a seventh embodiment; FIG. 第7実施形態における第2画像処理のフローチャートである。FIG. 14 is a flowchart of second image processing in the seventh embodiment; FIG.
1:第1実施形態
 図1は、第1実施形態に係る演奏解析システム100の構成を例示するブロック図である。演奏解析システム100には、鍵盤楽器200が有線または無線により接続される。鍵盤楽器200は、複数(N個)の鍵21が配列された鍵盤22を具備する電子楽器である。鍵盤22の複数の鍵21の各々は、相異なる音高n(n=1~N)に対応する。利用者(すなわち演奏者)は、自身の左手および右手により鍵盤楽器200の所望の鍵21を順次に操作する。鍵盤楽器200は、利用者による演奏を表す演奏データPを演奏解析システム100に供給する。演奏データPは、利用者が順次に演奏する複数の音符の各々について当該音符の音高nを指定する時系列データである。例えば、演奏データPは、例えばMIDI(Musical Instrument Digital Interface)規格に準拠した形式のデータである。
1: First Embodiment FIG. 1 is a block diagram illustrating the configuration of a performance analysis system 100 according to the first embodiment. A keyboard instrument 200 is connected to the performance analysis system 100 by wire or wirelessly. The keyboard instrument 200 is an electronic instrument having a keyboard 22 on which a plurality of (N) keys 21 are arranged. Each of the plurality of keys 21 of the keyboard 22 corresponds to a different tone pitch n (n=1 to N). A user (that is, a performer) sequentially operates desired keys 21 of the keyboard instrument 200 with his/her left and right hands. The keyboard instrument 200 supplies performance data P representing a performance by the user to the performance analysis system 100 . The performance data P is time-series data specifying the pitch n of each of a plurality of notes played in sequence by the user. For example, the performance data P is data in a format conforming to the MIDI (Musical Instrument Digital Interface) standard, for example.
 演奏解析システム100は、利用者による鍵盤楽器200の演奏を解析するコンピュータシステムである。具体的には、演奏解析システム100は、利用者の運指を解析する。運指は、鍵盤楽器200の演奏において利用者が左手および右手の各手指を使用する方法(すなわち指使い)である。すなわち、利用者が鍵盤楽器200の各鍵21を何れの手指により操作するかという情報が、利用者の運指として解析される。 The performance analysis system 100 is a computer system that analyzes the performance of the keyboard instrument 200 by the user. Specifically, performance analysis system 100 analyzes the user's fingering. Fingering is the manner in which the user uses the fingers of the left and right hands (ie fingering) in playing the keyboard instrument 200 . That is, the information as to which finger the user uses to operate each key 21 of the keyboard instrument 200 is analyzed as the fingering of the user.
 演奏解析システム100は、制御装置11と記憶装置12と操作装置13と表示装置14と撮影装置15とを具備する。演奏解析システム100は、例えばスマートフォンまたはタブレット端末等の可搬型の情報装置、またはパーソナルコンピュータ等の可搬型または据置型の情報装置により実現される。なお、演奏解析システム100は、単体の装置として実現されるほか、相互に別体で構成された複数の装置でも実現される。また、演奏解析システム100は、鍵盤楽器200に搭載されてもよい。 The performance analysis system 100 includes a control device 11, a storage device 12, an operation device 13, a display device 14, and a photographing device 15. The performance analysis system 100 is realized by, for example, a portable information device such as a smart phone or a tablet terminal, or a portable or stationary information device such as a personal computer. The performance analysis system 100 can be realized as a single device, or as a plurality of devices configured separately from each other. Also, the performance analysis system 100 may be installed in the keyboard instrument 200 .
 制御装置11は、演奏解析システム100の各要素を制御する単数または複数のプロセッサで構成される。例えば、制御装置11は、CPU(Central Processing Unit)、SPU(Sound Processing Unit)、DSP(Digital Signal Processor)、FPGA(Field Programmable Gate Array)、またはASIC(Application Specific Integrated Circuit)等の1種類以上のプロセッサにより構成される。 The control device 11 is composed of one or more processors that control each element of the performance analysis system 100 . For example, the control device 11 includes one or more types of CPU (Central Processing Unit), SPU (Sound Processing Unit), DSP (Digital Signal Processor), FPGA (Field Programmable Gate Array), or ASIC (Application Specific Integrated Circuit). It consists of a processor.
 記憶装置12は、制御装置11が実行するプログラムと、制御装置11が使用する各種のデータとを記憶する単数または複数のメモリである。記憶装置12は、例えば磁気記録媒体もしくは半導体記録媒体等の公知の記録媒体、または、複数種の記録媒体の組合せで構成される。なお、演奏解析システム100に対して着脱される可搬型の記録媒体、または例えばインターネット等の通信網を介して制御装置11が書込または読出を実行可能な記録媒体(例えばクラウドストレージ)を、記憶装置12として利用してもよい。 The storage device 12 is a single or multiple memories that store programs executed by the control device 11 and various data used by the control device 11 . The storage device 12 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of types of recording media. A portable recording medium that can be attached to and detached from the performance analysis system 100, or a recording medium (for example, cloud storage) that can be written or read by the control device 11 via a communication network such as the Internet, for example, can be stored. You may utilize as the apparatus 12. FIG.
 操作装置13は、利用者からの指示を受付ける入力機器である。操作装置13は、例えば、利用者が操作する操作子、または、利用者による接触を検知するタッチパネルである。なお、演奏解析システム100とは別体の操作装置13(例えばマウスまたはキーボード)を、演奏解析システム100に対して有線または無線により接続してもよい。 The operation device 13 is an input device that receives instructions from the user. The operation device 13 is, for example, an operator operated by a user or a touch panel that detects contact by the user. An operating device 13 (for example, a mouse or a keyboard) separate from the performance analysis system 100 may be connected to the performance analysis system 100 by wire or wirelessly.
 表示装置14は、制御装置11による制御のもとで画像を表示する。例えば液晶表示パネルまたは有機EL(Electroluminescence)パネル等の各種の表示パネルが表示装置14として利用される。なお、演奏解析システム100とは別体の表示装置14を、演奏解析システム100に対して有線または無線により接続してもよい。 The display device 14 displays images under the control of the control device 11 . For example, various display panels such as a liquid crystal display panel or an organic EL (Electroluminescence) panel are used as the display device 14 . The display device 14, which is separate from the performance analysis system 100, may be connected to the performance analysis system 100 by wire or wirelessly.
 撮影装置15は、被写体の撮影により画像データD1の時系列を生成する画像入力機器である。画像データD1の時系列は、動画を表す動画データである。例えば、撮影装置15は、撮影レンズ等の光学系と、光学系からの入射光を受光する撮像素子と、撮像素子による受光量に応じた画像データD1を生成する処理回路とを具備する。なお、演奏解析システム100とは別体の撮影装置15を演奏解析システム100に対して有線または無線により接続してもよい。 The photographing device 15 is an image input device that generates a time series of image data D1 by photographing a subject. The time series of the image data D1 is moving image data representing moving images. For example, the photographing device 15 includes an optical system such as a photographing lens, an imaging device for receiving incident light from the optical system, and a processing circuit for generating image data D1 according to the amount of light received by the imaging device. Note that the photographing device 15, which is separate from the performance analysis system 100, may be connected to the performance analysis system 100 by wire or wirelessly.
 利用者は、演奏解析システム100の提供者から推奨された撮影条件が実現されるように、鍵盤楽器200に対する撮影装置15の位置または角度を調整する。具体的には、撮影装置15は、鍵盤楽器200の上方に設置され、鍵盤楽器200の鍵盤22と利用者の左手および右手とを撮影する。したがって、図2に例示される通り、鍵盤楽器200の鍵盤22の画像(以下「鍵盤画像」という)g1と利用者の左手および右手の画像(以下「手指画像」という)g2とを含む演奏画像G1を表す画像データD1の時系列が、撮影装置15により生成される。すなわち、利用者が鍵盤楽器200を演奏する動画を表す動画データが、当該演奏に並行して生成される。なお、撮影装置15による撮影条件は、例えば撮影範囲または撮影方向である。撮影範囲は、撮影装置15が撮影する範囲(画角)である。撮影方向は、鍵盤楽器200に対する撮影装置15の方向である。 The user adjusts the position or angle of the imaging device 15 with respect to the keyboard instrument 200 so that the imaging conditions recommended by the provider of the performance analysis system 100 are realized. Specifically, the photographing device 15 is installed above the keyboard instrument 200 and photographs the keyboard 22 of the keyboard instrument 200 and the user's left and right hands. Therefore, as illustrated in FIG. 2, a performance image including an image g1 of the keyboard 22 of the keyboard instrument 200 (hereinafter referred to as "keyboard image") and an image g2 of the user's left and right hands (hereinafter referred to as "finger images"). A time series of image data D 1 representing G 1 is generated by the imaging device 15 . That is, moving image data representing a moving image of the user playing the keyboard instrument 200 is generated in parallel with the performance. In addition, the photographing condition by the photographing device 15 is, for example, the photographing range or the photographing direction. The photographing range is the range (angle of view) photographed by the photographing device 15 . The shooting direction is the direction of the shooting device 15 with respect to the keyboard instrument 200 .
 図3は、演奏解析システム100の機能的な構成を例示するブロック図である。制御装置11は、記憶装置12に記憶されたプログラムを実行することで、演奏解析部30および表示制御部40として機能する。演奏解析部30は、演奏データPおよび画像データD1の解析により、利用者の運指を表す運指データQを生成する。運指データQは、鍵盤楽器200の複数の鍵21の各々が利用者の複数の手指のうち何れの手指により操作されたかを指定する。具体的には、運指データQは、利用者が操作した鍵21に対応する音高nと、利用者が当該鍵21の操作に使用した手指の番号(以下「指番号」という)kとを指定する。音高nは、例えばMIDI規格におけるノート番号である。指番号kは、利用者の左手および右手の各手指に付与された番号である。 FIG. 3 is a block diagram illustrating the functional configuration of the performance analysis system 100. As shown in FIG. The control device 11 functions as a performance analysis section 30 and a display control section 40 by executing programs stored in the storage device 12 . The performance analysis unit 30 analyzes the performance data P and the image data D1 to generate fingering data Q representing the user's fingering. The fingering data Q designates with which of the user's fingers each of the plurality of keys 21 of the keyboard instrument 200 is operated. Specifically, the fingering data Q consists of a pitch n corresponding to the key 21 operated by the user and the number k of the finger used by the user to operate the key 21 (hereinafter referred to as "finger number"). Specify A pitch n is, for example, a note number in the MIDI standard. The finger number k is a number assigned to each finger of the user's left hand and right hand.
 表示制御部40は、各種の画像を表示装置14に表示させる。例えば、表示制御部40は、演奏解析部30による解析の結果を表す画像(以下「解析画面」という)61を表示装置14に表示させる。図4は、解析画面61の模式図である。解析画面61は、横方向の時間軸と縦方向の音高軸とが設定された座標平面に複数の音符画像611が配置された画像である。音符画像611は利用者が演奏した音符毎に表示される。音高軸の方向における音符画像611の位置は、当該音符画像611が表す音符の音高nに応じて設定される。時間軸の方向における音符画像611の位置および全長は、当該音符画像611が表す音符の発音期間に応じて設定される。 The display control unit 40 causes the display device 14 to display various images. For example, the display control section 40 causes the display device 14 to display an image (hereinafter referred to as "analysis screen") 61 representing the result of analysis by the performance analysis section 30 . FIG. 4 is a schematic diagram of the analysis screen 61. As shown in FIG. The analysis screen 61 is an image in which a plurality of note images 611 are arranged on a coordinate plane on which a horizontal time axis and a vertical pitch axis are set. A note image 611 is displayed for each note played by the user. The position of the note image 611 in the direction of the pitch axis is set according to the pitch n of the note represented by the note image 611 . The position and total length of the note image 611 in the direction of the time axis are set according to the sounding period of the note represented by the note image 611 .
 各音符の音符画像611には、運指データQが当該音符について指定する指番号kに対応する符号(以下「運指符号」という)612が配置される。運指符号612の文字「L」は左手を意味し、運指符号612の文字「R」は右手を意味する。また、運指符号612の数字は各手指を意味する。具体的には、運指符号612の数字「1」は親指を意味し、数字「2」は人差指を意味し、数字「3」は中指を意味し、数字「4」は薬指を意味し、数字「5」は小指を意味する。したがって、例えば運指符号612「R2」は右手の人差指を意味し、運指符号612「L4」は左手の薬指を意味する。音符画像611および運指符号612は、右手と左手とについて相異なる態様(例えば色相または階調)で表示される。表示制御部40は、運指データQを利用して図4の解析画面61を表示装置14に表示させる。 In the note image 611 of each note, a code (hereinafter referred to as a "fingering code") 612 corresponding to the finger number k specified for that note by the fingering data Q is arranged. The letter "L" in fingering 612 means left hand, and the letter "R" in fingering 612 means right hand. Also, the numbers of the fingering symbols 612 mean each finger. Specifically, the number "1" of the fingering symbols 612 means the thumb, the number "2" means the index finger, the number "3" means the middle finger, the number "4" means the ring finger, The number "5" means the little finger. Thus, for example, fingering 612 "R2" refers to the index finger of the right hand and fingering 612 "L4" refers to the ring finger of the left hand. The musical note image 611 and the fingering symbol 612 are displayed in different modes (for example, hue or gradation) for the right hand and the left hand. The display control unit 40 uses the fingering data Q to display the analysis screen 61 of FIG. 4 on the display device 14 .
 なお、解析画面61内の複数の音符画像611のうち、指番号kの推定結果の信頼性が低い音符については、通常の音符画像611とは相違する態様(例えば破線状の枠線)で音符画像611が表示され、かつ、指番号kの推定結果が無効であることを意味する特定の符号、例えば「??」が表示される。 Among the plurality of note images 611 in the analysis screen 61, for notes with low reliability in the estimation result of the finger number k, the notes are displayed in a manner different from the normal note image 611 (for example, a dashed frame line). An image 611 is displayed, and a specific code, such as "?", is displayed to indicate that the estimation result of the finger number k is invalid.
 図3に例示される通り、演奏解析部30は、指位置データ生成部31と運指データ生成部32とを具備する。指位置データ生成部31は、演奏画像G1の解析により指位置データFを生成する。指位置データFは、利用者の左手の各手指の位置と右手の各手指の位置とを表すデータである。以上の通り、第1実施形態においては、利用者の各手指の位置が左手と右手とに区別されるから、利用者の左手と右手とを区別した運指を推定できる。他方、運指データ生成部32は、演奏データPと指位置データFとを利用して運指データQを生成する。指位置データFおよび運指データQは、時間軸上の単位期間毎に生成される。各単位期間は、所定長の期間(フレーム)である。 As illustrated in FIG. 3, the performance analysis unit 30 includes a finger position data generation unit 31 and a fingering data generation unit 32. The finger position data generator 31 generates finger position data F by analyzing the performance image G1. The finger position data F is data representing the position of each finger of the user's left hand and the position of each finger of his right hand. As described above, in the first embodiment, since the positions of the fingers of the user are distinguished between the left hand and the right hand, it is possible to estimate the fingering that distinguishes between the left hand and the right hand of the user. On the other hand, the fingering data generator 32 generates fingering data Q using the performance data P and the finger position data F. FIG. Finger position data F and fingering data Q are generated for each unit period on the time axis. Each unit period is a period (frame) of a predetermined length.
A:指位置データ生成部31
 指位置データ生成部31は、画像抽出部311と行列生成部312と指位置推定部313と射影変換部314とを具備する。
A: Finger position data generator 31
The finger position data generation unit 31 includes an image extraction unit 311 , a matrix generation unit 312 , a finger position estimation unit 313 and a projective transformation unit 314 .
[指位置推定部313]
 指位置推定部313は、画像データD1が表す演奏画像G1の解析により利用者の左手および右手の各手指の位置c[h,f]を推定する。各手指の位置c[h,f]は、演奏画像G1に設定されるx-y座標系における各指先の位置である。位置c[h,f]は、演奏画像G1のx-y座標系におけるx軸上の座標x[h,f]とy軸上の座標y[h,f]との組合せ(x[h,f],y[h,f])で表現される。x軸の正方向は、鍵盤22の右方向(低音から高音に向かう方向)に相当し、x軸の負方向は、鍵盤22の左方向(高音から低音に向かう方向)に相当する。記号hは、左手および右手の何れかを示す変数である(h=1,2)。具体的には、変数hの数値「1」は左手を意味し、変数hの数値「2」は右手を意味する。変数fは、左手および右手の各々における各手指の番号(f=1~5)である。変数fの数値「1」は親指を意味し、数値「2」は人差指を意味し、数値「3」は中指を意味し、数値「4」は薬指を意味し、数値「5」は小指を意味する。したがって、例えば図2に例示された位置c[1,2]は、左手(h=1)の人差指(f=2)の指先の位置であり、位置c[2,4]は、右手(h=2)の薬指(f=4)の指先の位置である。
[Finger position estimation unit 313]
The finger position estimation unit 313 estimates the positions c[h, f] of the fingers of the user's left hand and right hand by analyzing the performance image G1 represented by the image data D1. The position c[h, f] of each finger is the position of each fingertip in the xy coordinate system set in the performance image G1. The position c[h,f] is a combination (x[h, f], y[h, f]). The positive direction of the x-axis corresponds to the right direction of the keyboard 22 (direction from low tones to high tones), and the negative direction of the x-axis corresponds to the left direction of the keyboard 22 (towards from high tones to low tones). The symbol h is a variable indicating either left hand or right hand (h=1, 2). Specifically, the numerical value "1" of the variable h means the left hand, and the numerical value "2" of the variable h means the right hand. The variable f is the number of each finger in each of the left and right hands (f=1-5). The number '1' for the variable f means the thumb, the number '2' means the index finger, the number '3' means the middle finger, the number '4' means the ring finger, and the number '5' means the little finger. means. Thus, for example, the position c[1,2] illustrated in FIG. 2 is the position of the fingertip of the index finger (f=2) of the left hand (h=1), and the position c[2,4] is the position of the right hand (h = 2) is the position of the fingertip of the ring finger (f = 4).
 図5は、指位置推定部313が利用者の各手指の位置を推定する処理(以下「指位置推定処理」という)の具体的な手順を例示するフローチャートである。指位置推定処理は、画像解析処理Sa1と左右判定処理Sa2と補間処理Sa3とを含む。 FIG. 5 is a flowchart illustrating a specific procedure of the process of estimating the position of each finger of the user by the finger position estimation unit 313 (hereinafter referred to as "finger position estimation process"). The finger position estimation processing includes image analysis processing Sa1, left/right determination processing Sa2, and interpolation processing Sa3.
 画像解析処理Sa1は、利用者の左手および右手の一方(以下「第1手」という)における各手指の位置c[h,f]と、利用者の左手および右手の他方(以下「第2手」という)における各手指の位置c[h,f]とを、演奏画像G1の解析により推定する処理である。具体的には、指位置推定部313は、画像の解析により利用者の骨格または関節を推定する画像認識処理により、第1手の各手指の位置c[h,1]~c[h,5]と第2手の各手指の位置c[h,1]~c[h,5]とを推定する。画像解析処理Sa1には、例えばMediaPipeまたはOpenPose等の公知の画像認識処理が利用される。なお、演奏画像G1から指先が検出されない場合、x軸上における当該指先の座標x[h,f]は「0」等の無効値に設定される。 In the image analysis processing Sa1, the position c[h, f] of each finger in one of the user's left hand and right hand (hereinafter referred to as "first hand") and the other of the user's left hand and right hand (hereinafter referred to as "second hand ) is estimated by analyzing the performance image G1. Specifically, the finger position estimating unit 313 performs image recognition processing for estimating the skeleton or joints of the user through image analysis to determine the positions c[h,1] to c[h,5 of the fingers of the first hand. ] and the positions c[h,1] to c[h,5] of each finger of the second hand. A known image recognition process such as MediaPipe or OpenPose is used for the image analysis process Sa1. If the fingertip is not detected from the performance image G1, the coordinate x[h,f] of the fingertip on the x-axis is set to an invalid value such as "0".
 画像解析処理Sa1においては、利用者の第1手の各手指の位置c[h,1]~c[h,5]と第2手の各手指の位置c[h,1]~c[h,5]とは推定されるが、第1手および第2手の各々が利用者の左手および右手の何れに該当するのかまでは特定できない。また、鍵盤楽器200の演奏においては、利用者の右腕と左腕とが交差する場合があるため、画像解析処理Sa1により推定された各位置c[h,f]の座標x[h,f]のみから左手または右手を確定することは適切でない。なお、利用者の両腕および胴体を含む部分を撮影装置15により撮影すれば、利用者の両肩および両腕の座標に基づいて、利用者の左手または右手を演奏画像G1から推定できる。しかし、撮影装置15により広範囲を撮影する必要があるという問題、および、画像解析処理Sa1の処理負荷が増大するという問題がある。 In the image analysis processing Sa1, the positions c[h,1] to c[h,5] of the fingers of the user's first hand and the positions c[h,1] to c[h of the fingers of the second hand , 5], but it cannot be specified whether each of the first and second hands corresponds to the user's left hand or right hand. In addition, when playing the keyboard instrument 200, the user's right arm and left arm may cross, so only the coordinates x[h,f] of each position c[h,f] estimated by the image analysis processing Sa1 It is not appropriate to determine the left or right hand from . If a portion including the user's arms and body is photographed by the photographing device 15, the user's left hand or right hand can be estimated from the performance image G1 based on the coordinates of the user's shoulders and arms. However, there is a problem that it is necessary to photograph a wide range with the photographing device 15, and a problem that the processing load of the image analysis processing Sa1 increases.
 以上の事情を考慮して、第1実施形態の指位置推定部313は、第1手および第2手の各々が利用者の左手および右手の何れに該当するのかを判定する図5の左右判定処理Sa2を実行する。すなわち、指位置推定部313は、第1手および第2手の各々の手指の位置c[h,f]における変数hを、左手を意味する数値「1」および右手を意味する数値「2」の何れかに確定する。 In consideration of the above circumstances, the finger position estimating unit 313 of the first embodiment performs the left/right determination shown in FIG. Processing Sa2 is executed. That is, the finger position estimation unit 313 sets the variable h at the position c[h, f] of the fingers of the first hand and the second hand to the numerical value "1" meaning the left hand and the numerical value "2" meaning the right hand. is determined to be one of
 鍵盤楽器200を演奏する状態では、左手および右手の双方の甲が鉛直方向の上方に位置するから、撮影装置15が撮影する演奏画像G1は、利用者の左手および右手の双方の甲の画像を含む。したがって、演奏画像G1内の左手においては、親指の位置c[h,1]が小指の位置c[h,5]よりも右方に位置し、演奏画像G1内の右手においては、親指の位置c[h,1]が小指の位置c[h,5]よりも左方に位置する。以上の事情を考慮して、指位置推定部313は、左右判定処理Sa2において、第1手および第2手のうち、親指の位置c[h,1]が小指の位置c[h,5]よりも右方(x軸の正方向)に位置する手を左手(h=1)と判定する。他方、指位置推定部313は、第1手および第2手のうち、親指の位置c[h,1]が小指の位置c[h,5]よりも左方(x軸の負方向)に位置する手を右手と判定する。 When the keyboard instrument 200 is played, the backs of both the left and right hands are positioned vertically upward, so the performance image G1 captured by the imaging device 15 is an image of the backs of both the left and right hands of the user. include. Therefore, in the left hand in the performance image G1, the thumb position c[h,1] is located to the right of the little finger position c[h,5], and in the right hand in the performance image G1, the thumb position c[h,1] is positioned to the left of the little finger position c[h,5]. Considering the above circumstances, the finger position estimating unit 313 determines that the thumb position c[h,1] of the first hand and the second hand is the little finger position c[h,5] in the left/right determination process Sa2. The hand positioned to the right (positive direction of the x-axis) of the left hand (h=1) is determined. On the other hand, the finger position estimator 313 determines whether the position c[h,1] of the thumb is to the left (in the negative direction of the x-axis) of the position c[h,5] of the little finger in the first and second hands. The positioned hand is determined to be the right hand.
 図6は、左右判定処理Sa2の具体的な手順を例示するフローチャートである。指位置推定部313は、第1手および第2手の各々について判定指標γ[h]を算定する(Sa21)。判定指標γ[h]は、例えば以下の数式(1)により算定される。
Figure JPOXMLDOC01-appb-M000001
 
 数式(1)の記号μ[h]は、第1手および第2手の各々における5本の手指の座標x[h,1]~x[h,5]の平均値(例えば単純平均)である。数式(1)から理解される通り、親指から小指にかけて座標x[h,f]が減少する場合(左手)には判定指標γ[h]が負数となり、親指から小指にかけて座標x[h,f]が増加する場合(右手)には判定指標γ[h]が正数となる。そこで、指位置推定部313は、第1手および第2手のうち判定指標γ[h]が負数である手を左手と判定し、変数hを数値「1」に設定する(Sa22)。また、指位置推定部313は、第1手および第2手のうち判定指標γ[h]が正数である手を右手と判定し、変数hを数値「2」に設定する(Sa23)。以上に説明した左右判定処理Sa2によれば、親指の位置と小指の位置との関係を利用した簡便な処理により、利用者の各手指の位置c[h,f]を右手と左手とに区別できる。
FIG. 6 is a flowchart illustrating a specific procedure of left/right determination processing Sa2. The finger position estimation unit 313 calculates a determination index γ[h] for each of the first hand and the second hand (Sa21). The determination index γ[h] is calculated, for example, by Equation (1) below.
Figure JPOXMLDOC01-appb-M000001

The symbol μ[h] in formula (1) is the average value (for example, simple average) of the coordinates x[h,1] to x[h,5] of the five fingers of each of the first and second hands. be. As can be understood from formula (1), when the coordinate x[h,f] decreases from the thumb to the little finger (left hand), the judgment index γ[h] becomes a negative number, and the coordinate x[h,f] increases from the thumb to the little finger. ] increases (right hand), the judgment index γ[h] becomes a positive number. Therefore, the finger position estimating unit 313 determines that the hand having a negative determination index γ[h] among the first hand and the second hand is the left hand, and sets the variable h to the numerical value "1" (Sa22). Further, the finger position estimating unit 313 determines that the hand having a positive determination index γ[h] among the first hand and the second hand is the right hand, and sets the variable h to the numerical value "2" (Sa23). According to the left/right determination process Sa2 described above, the position c[h, f] of each finger of the user can be distinguished between the right hand and the left hand by a simple process using the relationship between the position of the thumb and the position of the little finger. can.
 画像解析処理Sa1および左右判定処理Sa2により、利用者の各手指の位置c[h,f]が、単位期間毎に推定される。しかし、演奏画像G1に存在するノイズ等の種々の事情により、位置c[h,f]が適正に推定されない場合がある。そこで、指位置推定部313は、特定の単位期間(以下「欠落期間」という)において位置c[h,f]が欠落した場合に、当該欠落期間の前後の単位期間における位置c[h,f]を利用した補間処理Sa3により、欠落期間における位置c[h,f]を算定する。例えば、時間軸上で連続する3個の単位期間のうち中央の単位期間(欠落期間)において位置c[h,f]が欠落した場合、欠落期間の直前の単位期間における位置c[h,f]と直後の単位期間における位置c[h,f]との平均が、欠落期間における位置c[h,f]として算定される。 The position c[h, f] of each finger of the user is estimated for each unit period by the image analysis processing Sa1 and the left/right determination processing Sa2. However, the position c[h,f] may not be properly estimated due to various circumstances such as noise existing in the performance image G1. Therefore, when the position c[h,f] is missing in a specific unit period (hereinafter referred to as “missing period”), the finger position estimation unit 313 calculates the position c[h,f] in the unit periods before and after the missing period. ], the position c[h,f] in the missing period is calculated. For example, if the position c[h,f] is missing in the central unit period (missing period) of three consecutive unit periods on the time axis, the position c[h,f] in the unit period immediately before the missing period ] and the position c[h,f] in the immediately following unit period is calculated as the position c[h,f] in the missing period.
[画像抽出部311]
 前述の通り、演奏画像G1は、鍵盤画像g1と手指画像g2とを含む。図3の画像抽出部311は、図7に例示される通り、演奏画像G1のうち特定の領域(以下「特定領域」という)Bを抽出する。特定領域Bは、演奏画像G1のうち鍵盤画像g1と手指画像g2とを含む領域である。手指画像g2は、利用者の身体の少なくとも一部の画像に相当する。
[Image extraction unit 311]
As described above, the performance image G1 includes the keyboard image g1 and the finger image g2. The image extraction unit 311 in FIG. 3 extracts a specific area (hereinafter referred to as "specific area") B from the performance image G1, as illustrated in FIG. The specific area B is an area of the performance image G1 that includes the keyboard image g1 and the finger image g2. The finger image g2 corresponds to an image of at least part of the user's body.
 図8は、画像抽出部311が演奏画像G1から特定領域Bを抽出する処理(以下「画像抽出処理」という)の具体的な手順を例示するフローチャートである。画像抽出処理は、領域推定処理Sb1と領域抽出処理Sb2とを含む。 FIG. 8 is a flow chart illustrating a specific procedure of the process of extracting the specific area B from the performance image G1 by the image extraction unit 311 (hereinafter referred to as "image extraction process"). The image extraction processing includes region estimation processing Sb1 and region extraction processing Sb2.
 領域推定処理Sb1は、画像データD1が表す演奏画像G1について特定領域Bを推定する処理である。具体的には、画像抽出部311は、領域推定処理Sb1により、特定領域Bを表す画像処理マスクMを画像データD1から生成する。画像処理マスクMは、図7に例示される通り、演奏画像G1と同等のサイズのマスクであり、演奏画像G1の相異なる画素に対応する複数の要素で構成される。具体的には、画像処理マスクMは、演奏画像G1の特定領域Bに対応する領域内の各要素が数値「1」に設定され、特定領域B以外の領域内の各要素が数値「0」に設定されたバイナリマスクである。制御装置11が領域推定処理Sb1を実行することで、演奏画像G1の特定領域Bを推定する要素(領域推定部)が実現される。 The area estimation process Sb1 is a process of estimating a specific area B for the performance image G1 represented by the image data D1. Specifically, the image extraction unit 311 generates an image processing mask M representing the specific area B from the image data D1 by the area estimation process Sb1. As illustrated in FIG. 7, the image processing mask M is a mask having the same size as the performance image G1, and is composed of a plurality of elements corresponding to different pixels of the performance image G1. Specifically, in the image processing mask M, each element in the area corresponding to the specific area B of the performance image G1 is set to the numerical value "1", and each element in the area other than the specific area B is set to the numerical value "0". is a binary mask set to An element (region estimation section) for estimating the specific region B of the performance image G1 is implemented by the control device 11 executing the region estimation processing Sb1.
 図3に例示される通り、画像抽出部311による画像処理マスクMの生成には推定モデル51が利用される。すなわち、画像抽出部311は、演奏画像G1を表す画像データD1を推定モデル51に入力することで画像処理マスクMを生成する。推定モデル51は、画像データD1と画像処理マスクMとの関係を機械学習により学習した統計モデルである。推定モデル51は、例えば深層ニューラルネットワーク(DNN:Deep Neural Network)で構成される。例えば、畳込ニューラルネットワーク(CNN:Convolutional Neural Network)または再帰型ニューラルネットワーク(RNN:Recurrent Neural Network)等の任意の形式の深層ニューラルネットワークが推定モデル51として利用される。複数種の深層ニューラルネットワークの組合せで推定モデル51が構成されてもよい。また、長短期記憶(LSTM:Long Short-Term Memory)等の付加的な要素が推定モデル51に搭載されてもよい。 As illustrated in FIG. 3, the estimation model 51 is used for generating the image processing mask M by the image extraction unit 311 . That is, the image extraction unit 311 generates the image processing mask M by inputting the image data D1 representing the performance image G1 to the estimation model 51. FIG. The estimation model 51 is a statistical model obtained by learning the relationship between the image data D1 and the image processing mask M through machine learning. The estimation model 51 is composed of, for example, a deep neural network (DNN: Deep Neural Network). For example, any form of deep neural network such as a convolutional neural network (CNN) or a recurrent neural network (RNN) is used as the estimation model 51 . The estimation model 51 may be configured by combining multiple types of deep neural networks. Also, additional elements such as long short-term memory (LSTM) may be installed in the estimation model 51 .
 図9は、推定モデル51を確立する機械学習の説明図である。例えば演奏解析システム100とは別体の機械学習システム900による機械学習で推定モデル51が確立され、当該推定モデル51が演奏解析システム100に提供される。機械学習システム900は、例えばインターネット等の通信網を介して演奏解析システム100と通信可能なサーバシステムである。機械学習システム900から通信網を介して演奏解析システム100に推定モデル51が送信される。 FIG. 9 is an explanatory diagram of machine learning that establishes the estimation model 51. FIG. For example, the estimated model 51 is established by machine learning by a machine learning system 900 separate from the performance analysis system 100 , and the estimated model 51 is provided to the performance analysis system 100 . Machine learning system 900 is a server system capable of communicating with performance analysis system 100 via a communication network such as the Internet. The estimation model 51 is transmitted from the machine learning system 900 to the performance analysis system 100 via the communication network.
 推定モデル51の機械学習には複数の学習データTが利用される。複数の学習データTの各々は、学習用の画像データDtと学習用の画像処理マスクMtとの組合せで構成される。画像データDtは、鍵盤楽器の鍵盤画像g1と当該鍵盤楽器の周囲の画像とを含む既知画像を表す。鍵盤楽器の機種および撮影条件(例えば撮影範囲および撮影方向)は、画像データDt毎に相違する。すなわち、複数種の鍵盤楽器の各々を相異なる撮影条件により撮影することで画像データDtが事前に用意される。なお、公知の画像合成技術により画像データDtが用意されてもよい。各学習データTの画像処理マスクMtは、当該学習データTの画像データDtが表す既知画像のうち特定領域Bを表すマスクである。具体的には、画像処理マスクMtのうち特定領域Bに対応する領域内の要素は数値「1」に設定され、特定領域B以外の領域内の要素は数値「0」に設定される。すなわち、画像処理マスクMtは、画像データDtの入力に対して推定モデル51が出力すべき正解を意味する。 A plurality of learning data T are used for machine learning of the estimation model 51. Each of the plurality of learning data T is composed of a combination of learning image data Dt and learning image processing mask Mt. The image data Dt represents a known image including a keyboard image g1 of the keyboard instrument and an image around the keyboard instrument. The model of the keyboard instrument and shooting conditions (for example, shooting range and shooting direction) differ for each image data Dt. That is, image data Dt is prepared in advance by photographing each of a plurality of types of keyboard instruments under different photographing conditions. Note that the image data Dt may be prepared by a known image synthesizing technique. The image processing mask Mt of each learning data T is a mask representing the specific region B in the known image represented by the image data Dt of the learning data T. FIG. Specifically, the elements in the area corresponding to the specific area B in the image processing mask Mt are set to the numerical value "1", and the elements in the area other than the specific area B are set to the numerical value "0". That is, the image processing mask Mt means the correct answer that the estimation model 51 should output in response to the input of the image data Dt.
 機械学習システム900は、各学習データTの画像データDtを入力したときに初期的または暫定的なモデル(以下「暫定モデル」という)51aが出力する画像処理マスクMと、当該学習データTの画像処理マスクMとの誤差を表す誤差関数を算定する。そして、機械学習システム900は、誤差関数が低減されるように暫定モデル51aの複数の変数を更新する。複数の学習データTの各々について以上の処理が反復された時点の暫定モデル51aが、推定モデル51として確定される。したがって、推定モデル51は、複数の学習データTにおける画像データDtと画像処理マスクMtとの間に潜在する関係のもとで、未知の画像データD1に対して統計的に妥当な画像処理マスクMを出力する。すなわち、推定モデル51は、画像データDtと画像処理マスクMtとの関係を学習した学習済モデルである。 The machine learning system 900 generates an image processing mask M output by an initial or provisional model (hereinafter referred to as a "provisional model") 51a when image data Dt of each learning data T is input, and an image of the learning data T An error function representing the error with the processing mask M is computed. Machine learning system 900 then updates multiple variables of interim model 51a such that the error function is reduced. The provisional model 51 a at the time when the above processing is repeated for each of the plurality of learning data T is determined as the estimated model 51 . Therefore, the estimation model 51 can generate a statistically valid image processing mask M for the unknown image data D1 under the latent relationship between the image data Dt and the image processing mask Mt in the plurality of learning data T. to output That is, the estimation model 51 is a learned model that has learned the relationship between the image data Dt and the image processing mask Mt.
 以上の通り、第1実施形態においては、機械学習済の推定モデル51に演奏画像G1の画像データD1を入力することで、特定領域Bを表す画像処理マスクMが生成される。したがって、未知の多様な演奏画像G1について特定領域Bを高精度に特定できる。 As described above, in the first embodiment, the image processing mask M representing the specific region B is generated by inputting the image data D1 of the performance image G1 into the machine-learned estimation model 51. Therefore, the specific area B can be specified with high precision for various unknown performance images G1.
 図8の領域抽出処理Sb2は、画像データD1が表す演奏画像G1のうち特定領域Bを抽出する処理である。具体的には、領域抽出処理Sb2は、演奏画像G1のうち特定領域以外の領域を選択的に除去することで特定領域Bを相対的に強調する画像処理である。第1実施形態の画像抽出部311は、画像処理マスクMを画像データD1(演奏画像G1)に適用することで画像データD2を生成する。具体的には、画像抽出部311は、演奏画像G1における各画素の画素値に対して、画像処理マスクMのうち当該画素に対応する要素を乗算する。領域抽出処理Sb2により、図7に例示される通り、演奏画像G1のうち特定領域B以外の領域が除去された画像(以下「演奏画像G2」という)を表す画像データD2が生成される。すなわち、画像データD2が表す演奏画像G2は、演奏画像G1のうち鍵盤画像g1と手指画像g2とが抽出された画像である。制御装置11が領域抽出処理Sb2を実行することで、演奏画像G1の特定領域Bを抽出する要素(領域抽出部)が実現される。 The area extraction process Sb2 in FIG. 8 is a process for extracting the specific area B from the performance image G1 represented by the image data D1. Specifically, the region extraction processing Sb2 is image processing for relatively emphasizing the specific region B by selectively removing regions other than the specific region in the performance image G1. The image extraction unit 311 of the first embodiment generates the image data D2 by applying the image processing mask M to the image data D1 (performance image G1). Specifically, the image extraction unit 311 multiplies the pixel value of each pixel in the performance image G1 by the element of the image processing mask M corresponding to the pixel. As illustrated in FIG. 7, the area extracting process Sb2 generates image data D2 representing an image (hereinafter referred to as "performance image G2") obtained by removing areas other than the specific area B from the performance image G1. That is, the performance image G2 represented by the image data D2 is an image obtained by extracting the keyboard image g1 and the finger image g2 from the performance image G1. An element (region extractor) for extracting the specific region B of the performance image G1 is implemented by the control device 11 executing the region extracting process Sb2.
[射影変換部314]
 指位置推定処理により推定された各手指の位置c[h,f]は、演奏画像G1に設定されたx-y座標系における座標である。撮影装置15による鍵盤楽器200の撮影条件は、鍵盤楽器200の使用環境等の各種の事情に応じて相違し得る。例えば、図2に例示した理想的な撮影条件と比較して撮影範囲が広過ぎる(または狭過ぎる)場合または撮影方向が鉛直方向に対して傾斜する場合が想定される。各位置c[h,f]における座標x[h,f]および座標y[h,f]の数値は、撮影装置15による演奏画像G1の撮影条件に依存する。そこで、第1実施形態の射影変換部314は、演奏画像G1に関する各手指の位置c[h,f]を、撮影装置15による撮影条件に実質的に依存しないX-Y座標系における位置C[h,f]に変換(image registration)する。指位置データ生成部31が生成する指位置データFは、射影変換部314による変換後の位置C[h,f]を表すデータである。すなわち、指位置データFは、利用者の左手の各手指の位置C[1,1]~C[1,5]と、利用者の右手の各手指の位置C[2,1]~C[2,5]とを指定する。
[Projective transformation unit 314]
The position c[h, f] of each finger estimated by the finger position estimation process is the coordinates in the xy coordinate system set in the performance image G1. The conditions for photographing the keyboard instrument 200 by the photographing device 15 may differ depending on various circumstances such as the usage environment of the keyboard instrument 200 . For example, it is assumed that the imaging range is too wide (or too narrow) compared to the ideal imaging conditions illustrated in FIG. 2, or that the imaging direction is inclined with respect to the vertical direction. The numerical values of the coordinates x[h,f] and coordinates y[h,f] at each position c[h,f] depend on the shooting conditions of the performance image G1 by the shooting device 15. FIG. Therefore, the projective transformation unit 314 of the first embodiment converts the position c[h,f] of each finger on the performance image G1 to the position C[h,f] in the XY coordinate system that is substantially independent of the imaging conditions of the imaging device 15. h, f] (image registration). The finger position data F generated by the finger position data generation unit 31 is data representing the position C[h,f] after conversion by the projective conversion unit 314 . That is, the finger position data F includes the positions C[1,1] to C[1,5] of the fingers of the user's left hand and the positions C[2,1] to C[ of the fingers of the user's right hand. 2,5].
 X-Y座標系は、図10に例示される通り、所定の画像(以下「参照画像」という)Grefに設定される。参照画像Grefは、標準的な鍵盤楽器の鍵盤(以下「参照楽器」という)を標準的な撮影条件で撮影した画像である。なお、参照画像Grefは、実在の鍵盤を撮影した画像に限定されない。例えば公知の画像合成技術により合成された画像が参照画像Grefとして利用されてもよい。参照画像Grefを表す画像データ(以下「参照データ」という)Drefと、当該参照画像Grefに関する補助データAとが、記憶装置12に記憶される。 The XY coordinate system is set to a predetermined image (hereinafter referred to as "reference image") Gref, as illustrated in FIG. The reference image Gref is an image of a keyboard of a standard keyboard instrument (hereinafter referred to as "reference instrument") captured under standard imaging conditions. Note that the reference image Gref is not limited to an image of an actual keyboard. For example, an image synthesized by a known image synthesis technique may be used as the reference image Gref. Image data Dref representing the reference image Gref (hereinafter referred to as “reference data”) and auxiliary data A relating to the reference image Gref are stored in the storage device 12 .
 補助データAは、参照画像Gref内において参照楽器の各鍵21が存在する領域(以下「単位領域」という)Rnと、当該鍵21に対応する音高nとの組合せを指定するデータである。すなわち、補助データAは、参照画像Grefのうち各音高nに対応する単位領域Rnを定義するデータとも換言される。 Auxiliary data A is data specifying a combination of an area (hereinafter referred to as a "unit area") Rn in which each key 21 of the reference musical instrument exists in the reference image Gref and the pitch n corresponding to the key 21. That is, the auxiliary data A can also be said to be data defining a unit region Rn corresponding to each pitch n in the reference image Gref.
 x-y座標系の位置c[h,f]からX-Y座標系の位置C[h,f]への変換には、以下の数式(2)で表現される通り、変換行列Wを利用した射影変換が利用される。数式(2)の記号Xは、X-Y座標系におけるX軸上の座標を意味し、記号YはY軸上の座標を意味する。また、記号sは、x-y座標系とX-Y座標系との間で縮尺(スケール)を整合させるための調整値である。
Figure JPOXMLDOC01-appb-M000002
  
Transformation from the position c[h,f] in the x-y coordinate system to the position C[h,f] in the XY coordinate system uses the transformation matrix W, as expressed by the following formula (2). A projective transformation is used. The symbol X in Equation (2) means the coordinate on the X-axis in the XY coordinate system, and the symbol Y means the coordinate on the Y-axis. Also, the symbol s is an adjustment value for matching the scale between the xy coordinate system and the XY coordinate system.
Figure JPOXMLDOC01-appb-M000002
[行列生成部312]
 図3の行列生成部312は、射影変換部314が射影変換に適用する数式(2)の変換行列Wを生成する。図11は、行列生成部312が変換行列Wを生成する処理(以下「行列生成処理」という)の具体的な手順を例示するフローチャートである。第1実施形態の行列生成処理は、画像抽出処理による処理後の演奏画像G2(画像データD2)を処理対象として実行される。以上の構成によれば、特定領域B以外の領域も含む演奏画像G1の全体を処理対象として行列生成処理が実行される構成と比較して、鍵盤画像g1を参照画像Grefに高精度に近似させる適切な変換行列Wを生成できる。
[Matrix generator 312]
The matrix generator 312 in FIG. 3 generates the transformation matrix W of Equation (2) that the projective transformation unit 314 applies to the projective transformation. FIG. 11 is a flowchart illustrating a specific procedure of the process of generating the transformation matrix W by the matrix generator 312 (hereinafter referred to as "matrix generation process"). The matrix generation process of the first embodiment is executed with the performance image G2 (image data D2) processed by the image extraction process as the object of processing. According to the above configuration, the keyboard image g1 is approximated to the reference image Gref with high precision, compared to the configuration in which the matrix generation process is executed for the entire performance image G1 including areas other than the specific area B. A suitable transformation matrix W can be generated.
 行列生成処理は、初期設定処理Sc1と行列更新処理Sc2とを含む。初期設定処理Sc1は、変換行列Wの初期値である初期行列W0を設定する処理である。初期設定処理Sc1の詳細については後述する。 The matrix generation process includes an initialization process Sc1 and a matrix update process Sc2. The initial setting process Sc1 is a process of setting an initial matrix W0, which is an initial value of the transformation matrix W. FIG. The details of the initial setting process Sc1 will be described later.
 行列更新処理Sc2は、初期行列W0を反復的に更新することで変換行列Wを生成する処理である。すなわち、射影変換部314は、演奏画像G2の鍵盤画像g1が、変換行列Wを利用した射影変換により参照画像Grefに近付くように、初期行列W0を反復的に更新することで、変換行列Wを生成する。例えば、参照画像Grefにおける特定の地点のX軸上の座標X/sが、鍵盤画像g1のうち当該地点に対応する地点のx軸上の座標xに近似または一致し、かつ、参照画像Grefにおける特定の地点のY軸上の座標Y/sが、鍵盤画像g1のうち当該地点に対応する地点のy軸上の座標yに近似または一致するように、変換行列Wが生成される。すなわち、鍵盤画像g1のうち特定の音高に対応する鍵21の座標が、変換行列Wを適用した射影変換により、参照画像Grefのうち当該音高に対応する鍵21の座標に変換されるように、変換行列Wが生成される。以上に例示した行列更新処理Sc2を制御装置11が実行することで、変換行列Wを生成する要素(行列生成部312)が実現される。 The matrix update process Sc2 is a process of generating a transformation matrix W by iteratively updating the initial matrix W0. That is, the projective transformation unit 314 iteratively updates the initial matrix W0 so that the keyboard image g1 of the performance image G2 approaches the reference image Gref by projective transformation using the transformation matrix W, thereby transforming the transformation matrix W into Generate. For example, the coordinate X/s on the X-axis of a specific point in the reference image Gref approximates or matches the coordinate x on the x-axis of a point corresponding to the point in the keyboard image g1, and A transformation matrix W is generated so that the coordinate Y/s of a particular point on the Y axis approximates or matches the coordinate y on the y axis of a point corresponding to that point in the keyboard image g1. That is, the coordinates of the key 21 corresponding to a specific pitch in the keyboard image g1 are transformed into the coordinates of the key 21 corresponding to the pitch in the reference image Gref by the projective transformation applying the transformation matrix W. , a transformation matrix W is generated. An element (matrix generation unit 312) for generating the conversion matrix W is implemented by the control device 11 executing the matrix update processing Sc2 illustrated above.
 ところで、行列更新処理Sc2としては、例えばSIFT(Scale-Invariant Feature Transform)等の画像特徴量が参照画像Grefと鍵盤画像g1との間で近付くように変換行列Wを更新する処理が想定される。しかし、鍵盤画像g1においては、複数の鍵21が同様に配列されたパターンが反復されるから、画像特徴量を利用した形態では変換行列Wを適切に推定できない可能性がある。 By the way, as the matrix update process Sc2, for example, a process of updating the transformation matrix W so that the image feature amount such as SIFT (Scale-Invariant Feature Transform) becomes closer between the reference image Gref and the keyboard image g1 is assumed. However, in the keyboard image g1, a pattern in which a plurality of keys 21 are arranged in the same manner is repeated, so there is a possibility that the conversion matrix W cannot be properly estimated in the form using the image feature amount.
 以上の事情を考慮して、第1実施形態の行列生成部312は、行列更新処理Sc2において、参照画像Grefと鍵盤画像g1との間の拡張相関係数(ECC:Enhanced Correlation Coefficient)が増加(理想的には最大化)するように初期行列W0を反復的に更新する。以上の形態によれば、画像特徴量を利用した前述の形態と比較して、鍵盤画像g1を参照画像Grefに高精度に近似させ得る適切な変換行列Wを生成できる。拡張相関係数を利用した変換行列Wの生成については、Georgios D. Evangelidis and Emmanouil Z. Psarakis, "Parametric Image Alignment Using Enhanced Correlation Coefficient Maximization", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 10, October 2008、にも開示されている。なお、前述の通り、鍵盤画像g1の変換に利用される変換行列Wの生成には拡張相関係数が好適であるが、前述のSIFT等の画像特徴量が参照画像Grefと鍵盤画像g1との間で近付くように変換行列Wを生成してもよい。 Considering the above circumstances, the matrix generator 312 of the first embodiment increases the enhanced correlation coefficient (ECC) between the reference image Gref and the keyboard image g1 ( Iteratively update the initial matrix W0 so as to ideally maximize According to the above embodiment, it is possible to generate an appropriate transformation matrix W capable of approximating the keyboard image g1 to the reference image Gref with high accuracy, compared with the above-described embodiment using the image feature amount. Georgios D. Evangelidis and Emmanouil Z. Psarakis, "Parametric Image Alignment Using Enhanced Correlation Coefficient Maximization", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO 10, October 2008, also disclosed. As described above, the extended correlation coefficient is suitable for generating the transformation matrix W used for transforming the keyboard image g1. A transformation matrix W may be generated so as to be close to each other.
 図3の射影変換部314は、射影変換処理を実行する。射影変換処理は、行列生成処理により生成された変換行列Wを利用した演奏画像G1の射影変換である。射影変換処理により、演奏画像G1は、参照画像Grefと同等の撮影条件のもとで撮影された画像(以下「変換画像」という)に変換される。例えば、変換画像のうち音高nの鍵21に対応する領域は、参照画像Grefにおける当該音高nの単位領域Rnに実質的に一致する。また、変換画像のx-y座標系は、参照画像GrefのX-Y座標系に実質的に一致する。以上に説明した射影変換処理において、射影変換部314は、前述の数式(2)で表現される通り、各手指の位置c[h,f]を、X-Y座標系の位置C[h,f]に変換する。以上に例示した射影変換処理を制御装置11が実行することで、演奏画像G1の射影変換を実行する要素(射影変換部314)が実現される。 The projective transformation unit 314 in FIG. 3 executes projective transformation processing. The projective transformation process is a projective transformation of the performance image G1 using the transformation matrix W generated by the matrix generation process. By the projective transformation process, the performance image G1 is transformed into an image (hereinafter referred to as "transformed image") shot under the same shooting conditions as the reference image Gref. For example, the area corresponding to the key 21 of the pitch n in the transformed image substantially matches the unit area Rn of the pitch n in the reference image Gref. Also, the x-y coordinate system of the transformed image substantially matches the x-y coordinate system of the reference image Gref. In the projective transformation process described above, the projective transformation unit 314 converts the position c[h, f] of each finger to the position C[h, f]. By executing the projective transformation process illustrated above by the control device 11, an element (projective transformation unit 314) that executes the projective transformation of the performance image G1 is realized.
 表示制御部40は、射影変換処理により生成された変換画像を表示装置14に表示させる。例えば、表示制御部40は、変換画像と参照画像Grefと相互に重複させた状態で表示装置14に表示させる。前述の通り、変換画像のうち各音高nの鍵21に対応する領域と、参照画像Grefのうち当該音高nに対応する単位領域Rnとは、相互に重複する。 The display control unit 40 causes the display device 14 to display the transformed image generated by the projective transformation process. For example, the display control unit 40 causes the display device 14 to display the converted image and the reference image Gref in an overlapping state. As described above, the area corresponding to the key 21 of each pitch n in the transformed image and the unit area Rn corresponding to the pitch n in the reference image Gref overlap each other.
 以上の通り、第1実施形態においては、演奏画像G1の鍵盤画像g1が参照画像Grefに近付くように変換行列Wが生成され、変換行列Wを利用した射影変換処理が演奏画像G1に対して実行される。したがって、利用者が演奏する鍵盤楽器200の演奏画像G1を、参照画像Grefにおける参照楽器の撮影条件に対応する変換画像に変換できる。 As described above, in the first embodiment, the transformation matrix W is generated so that the keyboard image g1 of the performance image G1 approaches the reference image Gref, and the projective transformation process using the transformation matrix W is performed on the performance image G1. be done. Therefore, the performance image G1 of the keyboard instrument 200 played by the user can be converted into a converted image corresponding to the photographing conditions of the reference musical instrument in the reference image Gref.
 図12は、初期設定処理Sc1の具体的な手順を例示するフローチャートである。初期設定処理Sc1が開始されると、射影変換部314は、図13に例示される設定画面62を表示装置14に表示させる(Sc11)。設定画面62は、撮影装置15が撮影する演奏画像G1と、利用者に対する指示622とを含む。指示622は、演奏画像G1内の鍵盤画像g1のうち1個以上の特定の音高(以下「目標音高」という)nに対応する領域(以下「目標領域」という)621を選択する旨のメッセージである。利用者は、設定画面62を視認しながら操作装置13を操作することで、演奏画像G1のうち、目標音高nに対応する目標領域621を選択する。射影変換部314は、利用者による目標領域621の選択を受付ける(Sc12)。 FIG. 12 is a flowchart illustrating a specific procedure of the initial setting process Sc1. When the initial setting process Sc1 is started, the projective transformation unit 314 causes the display device 14 to display the setting screen 62 illustrated in FIG. 13 (Sc11). The setting screen 62 includes a performance image G1 photographed by the photographing device 15 and an instruction 622 for the user. The instruction 622 is to select an area (hereinafter referred to as "target area") 621 corresponding to one or more specific pitches (hereinafter referred to as "target pitch") n in the keyboard image g1 in the performance image G1. is the message. The user selects the target area 621 corresponding to the target pitch n in the performance image G1 by operating the operation device 13 while viewing the setting screen 62 . The projective transformation unit 314 receives selection of the target area 621 by the user (Sc12).
 射影変換部314は、参照データDrefが表す参照画像Grefのうち補助データAが目標音高nについて指定する1個以上の単位領域Rnを特定する(Sc13)。そして、射影変換部314は、演奏画像G1の目標領域621を、参照画像Grefから特定された1個以上の単位領域Rnに射影変換するための行列を、初期行列W0として算定する(Sc14)。以上の説明から理解される通り、第1実施形態の初期設定処理Sc1は、鍵盤画像g1のうち利用者から指示された目標領域621が、初期行列W0を利用した射影変換により、参照画像Grefのうち目標音高nに対応する単位領域Rnに近付くように、初期行列W0を設定する処理である。 The projective transformation unit 314 identifies one or more unit regions Rn designated by the auxiliary data A for the target pitch n in the reference image Gref represented by the reference data Dref (Sc13). Then, the projective transformation unit 314 calculates a matrix for projectively transforming the target region 621 of the performance image G1 into one or more unit regions Rn specified from the reference image Gref as an initial matrix W0 (Sc14). As can be understood from the above description, in the initial setting process Sc1 of the first embodiment, the target area 621 designated by the user in the keyboard image g1 is transformed into the reference image Gref by projective transformation using the initial matrix W0. Among these processes, the initial matrix W0 is set so as to approach the unit area Rn corresponding to the target pitch n.
 行列更新処理Sc2により適切な変換行列Wを生成するには、初期行列W0の設定が重要である。行列更新処理Sc2に拡張相関係数を利用する形態においては特に、初期行列W0の適否が最終的な変換行列Wの適否に影響し易いという傾向がある。第1実施形態においては、演奏画像G1のうち利用者からの指示に応じた目標領域621が、参照画像Grefのうち目標音高nに対応する単位領域Rnに近付くように、初期行列W0が設定される。したがって、鍵盤画像g1を参照画像Grefに高精度に近似させ得る適切な変換行列Wを生成できる。また、第1実施形態においては、演奏画像G1のうち利用者が操作装置13に対する操作で指定した領域が目標領域621として初期行列W0の設定に利用される。したがって、例えば演奏画像G1のうち目標音高nに対応する領域を演算処理により推定する形態と比較して、処理負荷を低減しながら適切な初期行列W0を生成できる。なお、以上の説明においては演奏画像G1を対象として初期設定処理Sc1を実行したが、演奏画像G2について初期設定処理Sc1が実行されてもよい。 Setting the initial matrix W0 is important in order to generate an appropriate transformation matrix W by the matrix update processing Sc2. Especially in the form of using the extended correlation coefficient for the matrix update processing Sc2, there is a tendency that the suitability of the initial matrix W0 tends to affect the suitability of the final transformation matrix W. In the first embodiment, the initial matrix W0 is set so that the target area 621 corresponding to the instruction from the user in the performance image G1 approaches the unit area Rn corresponding to the target pitch n in the reference image Gref. be done. Therefore, it is possible to generate an appropriate transformation matrix W that can approximate the keyboard image g1 to the reference image Gref with high accuracy. Further, in the first embodiment, the area designated by the user by operating the operating device 13 in the performance image G1 is used as the target area 621 for setting the initial matrix W0. Therefore, an appropriate initial matrix W0 can be generated while reducing the processing load, compared with, for example, a form in which the area corresponding to the target pitch n in the performance image G1 is estimated by arithmetic processing. In the above description, the initial setting process Sc1 is executed for the performance image G1, but the initial setting process Sc1 may be executed for the performance image G2.
B:運指データ生成部32
 図3の運指データ生成部32は、前述の通り、鍵盤楽器200が生成する演奏データPと指位置データ生成部31が生成する指位置データFとを利用して運指データQを生成する。運指データQの生成は、単位期間毎に実行される。第1実施形態の運指データ生成部32は、確率算定部321と運指推定部322とを具備する。なお、以上の説明においては、利用者の1個の手指を変数hと変数fとの組合せで表現したが、以下の説明においては、利用者の1個の手指を指番号k(k=1~10)で表現する。したがって、指位置データFが各手指について指定する位置C[h,f]は、以下の説明では位置C[k]と表記される。
B: fingering data generator 32
The fingering data generator 32 in FIG. 3 generates the fingering data Q using the performance data P generated by the keyboard instrument 200 and the finger position data F generated by the finger position data generator 31, as described above. . The fingering data Q is generated every unit period. The fingering data generator 32 of the first embodiment includes a probability calculator 321 and a fingering estimator 322 . In the above explanation, one finger of the user is represented by a combination of variables h and f, but in the following explanation, one finger of the user is represented by finger number k (k=1 10). Therefore, the position C[h,f] specified for each finger by the finger position data F is denoted as position C[k] in the following description.
[確率算定部321]
 確率算定部321は、演奏データPにより指定された音高nが各指番号kの手指により演奏された確率pを、指番号k毎に算定する。確率pは、指番号kの手指が音高nの鍵21を操作した確度の指標(尤度)である。確率算定部321は、指番号kの手指の位置C[k]が音高nの単位領域Rn内に存在するか否かに応じて確率pを算定する。確率pは、時間軸上の単位期間毎に算定される。具体的には、演奏データPが音高nを指定する場合、確率算定部321は、以下に例示する数式(3)の演算により、確率p(C[k]|ηk=n)を算定する。
Figure JPOXMLDOC01-appb-M000003
  
[Probability calculator 321]
The probability calculation unit 321 calculates, for each finger number k, the probability p that the pitch n specified by the performance data P is played by the finger of each finger number k. The probability p is an index (likelihood) of the probability that the finger with the finger number k has operated the key 21 with the pitch n. The probability calculator 321 calculates the probability p according to whether or not the position C[k] of the finger with the finger number k exists within the unit area Rn of the pitch n. The probability p is calculated for each unit period on the time axis. Specifically, when the performance data P designates the pitch n, the probability calculation unit 321 calculates the probability p(C[k]|ηk=n) by the calculation of Equation (3) exemplified below. .
Figure JPOXMLDOC01-appb-M000003
 確率p(C[k]|ηk=n)における条件「ηk=n」は、指番号kの手指が音高nを演奏しているという条件を意味する。すなわち、確率p(C[k]|ηk=n)は、指番号kの手指が音高nを演奏している状況のもとで当該手指について位置C[k]が観測される確率を意味する。 The condition "ηk=n" in the probability p(C[k]|ηk=n) means that the finger with finger number k is playing pitch n. That is, the probability p(C[k]|ηk=n) means the probability that the position C[k] is observed for the finger under the condition that the finger with the finger number k is playing the pitch n. do.
 数式(3)の記号I(C[k]∈Rn)は、位置C[k]が単位領域Rn内に存在する場合に数値「1」に設定され、位置C[k]が単位領域Rn外に存在する場合に数値「0」に設定される指示関数である。記号|Rn|は、単位領域Rnの面積を意味する。また、記号ν(0,σ2E)は、観測ノイズを意味し、平均0および分散σ2の正規分布で表現される。記号Eは2行2列の単位行列である。記号*は観測ノイズν(0,σ2E)の畳込を意味する。 The symbol I(C[k]εRn) in Equation (3) is set to a numerical value “1” when the position C[k] exists within the unit region Rn, and the position C[k] is outside the unit region Rn. is an indicator function that is set to the value '0' if it exists in . The symbol |Rn| means the area of the unit region Rn. Also, the symbol ν(0, σ 2 E) means observation noise, which is represented by a normal distribution with mean 0 and variance σ 2 . Symbol E is a unit matrix of 2 rows and 2 columns. The symbol * means convolution of observation noise ν(0,σ 2 E).
 以上の説明から理解される通り、確率算定部321が算定する確率p(C[k]|ηk=n)は、演奏データPにより指定される音高nが指番号kの手指により演奏されるという条件のもとで、当該手指の位置が、指位置データFが当該手指について指定する位置C[k]である確度である。したがって、確率p(C[k]|ηk=n)は、指番号kの手指の位置C[k]が演奏状態の単位領域Rn内にある場合に極大となり、当該位置C[k]が単位領域Rnから離間するほど減少する。 As can be understood from the above description, the probability p(C[k]|ηk=n) calculated by the probability calculation unit 321 is such that the pitch n specified by the performance data P is played with the fingers of the finger number k. Under this condition, the position of the finger is the probability that the finger position data F designates the position C[k] for the finger. Therefore, the probability p(C[k]|ηk=n) is maximized when the position C[k] of the finger with the finger number k is within the unit area Rn in the playing state, and the position C[k] is the unit It decreases with increasing distance from the region Rn.
 他方、演奏データPが何れの音高nも指定しない場合、すなわち、利用者がN個の鍵21の何れも操作していない場合、確率算定部321は、各手指の確率p(C[k]|ηk=0)を以下の数式(4)により算定する。
Figure JPOXMLDOC01-appb-M000004
  
 数式(4)の記号|R|は、参照画像GrefにおけるN個の単位領域R1~RNの総面積を意味する。数式(4)から理解される通り、利用者が何れの鍵21も操作していない状態では、確率p(C[k]|ηk=0)は、全部の指番号kについて共通の数値(1/|R|)に設定される。
On the other hand, if the performance data P does not specify any pitch n, that is, if the user does not operate any of the N keys 21, the probability calculator 321 calculates the probability p(C[k ]|ηk=0) is calculated by the following equation (4).
Figure JPOXMLDOC01-appb-M000004

The symbol |R| in Equation (4) means the total area of the N unit regions R1 to RN in the reference image Gref. As can be seen from the formula (4), when the user does not operate any key 21, the probability p(C[k]|ηk=0) is a common numerical value (1 /|R|).
 以上の通り、演奏データPが音高nを指定する期間内においては、相異なる手指に対応する複数の確率p(C[k]|ηk=n)が、時間軸上の単位期間毎に算定される。他方、演奏データPが音高nを指定しない期間内の各単位期間においては、相異なる手指に対応する複数の確率p(C[k]|ηk=0)が、充分に小さい固定値(1/|R|)に設定される。 As described above, within the period in which the performance data P designates the pitch n, a plurality of probabilities p(C[k]|ηk=n) corresponding to different fingers are calculated for each unit period on the time axis. be done. On the other hand, in each unit period within the period in which the performance data P does not specify the pitch n, the plurality of probabilities p(C[k]|ηk=0) corresponding to different fingers is a sufficiently small fixed value (1 /|R|).
[運指推定部322]
 運指推定部322は、利用者の運指を推定する。具体的には、運指推定部322は、各手指の確率p(C[k]|ηk=n)から、演奏データPにより指定される音高nを演奏した手指(指番号k)を推定する。運指推定部322による指番号kの推定(運指データQの生成)は、各手指の確率p(C[k]|ηk=n)の算定毎(すなわち単位期間毎)に実行される。具体的には、運指推定部322は、相異なる手指に対応する複数の確率p(C[k]|ηk=n)のうち最大値に対応する指番号kを特定する。そして、運指推定部322は、演奏データPが指定する音高nと、確率p(C[k]|ηk=n)から特定した指番号kとを指定する運指データQを生成する。
[Fingering estimation unit 322]
The fingering estimation unit 322 estimates the user's fingering. Specifically, the fingering estimation unit 322 estimates the finger (finger number k) that performed the pitch n specified by the performance data P from the probability p(C[k]|ηk=n) of each finger. do. The fingering estimation unit 322 estimates the finger number k (generates the fingering data Q) every time the probability p(C[k]|ηk=n) of each finger is calculated (that is, every unit period). Specifically, the fingering estimation unit 322 identifies the finger number k corresponding to the maximum value among a plurality of probabilities p(C[k]|ηk=n) corresponding to different fingers. Then, the fingering estimation unit 322 generates fingering data Q that specifies the pitch n specified by the performance data P and the finger number k specified from the probability p(C[k]|ηk=n).
 なお、演奏データPが音高nを指定する期間内において、複数の確率p(C[k]|ηk=n)のうちの最大値が所定の閾値を下回る場合には、運指を推定した結果の信頼性が低いことを意味する。そこで、運指推定部322は、複数の確率p(C[k]|ηk=n)の最大値が閾値を下回る単位期間においては、指番号kを、推定結果の無効を意味する無効値に設定する。指番号kが無効値に設定された音符について、表示制御部40は、図4の例示の通り、通常の音符画像611とは相違する態様で音符画像611を表示し、指番号kの推定結果が無効であることを意味する符号「??」を表示する。運指データ生成部32の構成および動作は以上の通りである。 Note that if the maximum value among the plurality of probabilities p(C[k]|ηk=n) falls below a predetermined threshold within the period in which the performance data P designates the pitch n, the fingering is estimated. It means that the results are unreliable. Therefore, the fingering estimation unit 322 sets the finger number k to an invalid value meaning invalidity of the estimation result in a unit period in which the maximum value of the plurality of probabilities p(C[k]|ηk=n) is below the threshold. set. For the musical note with the finger number k set to an invalid value, the display control unit 40 displays the musical note image 611 in a manner different from the normal musical note image 611, as illustrated in FIG. display a sign "?" The configuration and operation of the fingering data generator 32 are as described above.
 図14は、演奏解析部30が実行する処理(以下「演奏解析処理」という)の具体的な手順を例示するフローチャートである。例えば操作装置13に対する利用者からの指示を契機として演奏解析処理が開始される。 FIG. 14 is a flowchart illustrating a specific procedure of processing (hereinafter referred to as "performance analysis processing") executed by the performance analysis section 30. As shown in FIG. For example, the performance analysis process is started when the user gives an instruction to the operation device 13 .
 演奏解析処理が開始されると、制御装置11(画像抽出部311)は、図8の画像抽出処理を実行する(S11)。すなわち、制御装置11は、演奏画像G1のうち鍵盤画像g1と手指画像g2とを含む特定領域Bを抽出することで演奏画像G2を生成する。画像抽出処理は、前述の通り、領域推定処理Sb1と領域抽出処理Sb2とを含む。 When the performance analysis process is started, the control device 11 (image extraction unit 311) executes the image extraction process of FIG. 8 (S11). That is, the control device 11 generates the performance image G2 by extracting the specific region B including the keyboard image g1 and the finger image g2 from the performance image G1. The image extraction process includes the area estimation process Sb1 and the area extraction process Sb2 as described above.
 画像抽出処理を実行すると、制御装置11(行列生成部312)は、図11の行列生成処理を実行する(S12)。すなわち、制御装置11は、参照画像Grefと鍵盤画像g1との間の拡張相関係数が増加するように初期行列W0を反復的に更新することで、変換行列Wを生成する。行列生成処理は、前述の通り、初期設定処理Sc1と行列更新処理Sc2とを含む。 After executing the image extraction process, the control device 11 (matrix generation unit 312) executes the matrix generation process of FIG. 11 (S12). That is, the control device 11 generates the transformation matrix W by iteratively updating the initial matrix W0 so as to increase the extended correlation coefficient between the reference image Gref and the keyboard image g1. The matrix generation process includes the initialization process Sc1 and the matrix update process Sc2, as described above.
 変換行列Wが生成されると、制御装置11は、以下に例示する処理(S13~S18)を単位期間毎に反復する。まず、制御装置11(指位置推定部313)は、図5の指位置推定処理を実行する(S13)。すなわち、制御装置11は、演奏画像G1の解析により利用者の左手および右手の各手指の位置c[h,f]を推定する。指位置推定処理は、前述の通り、画像解析処理Sa1と左右判定処理Sa2と補間処理Sa3とを含む。 When the conversion matrix W is generated, the control device 11 repeats the processing (S13 to S18) illustrated below for each unit period. First, the control device 11 (finger position estimating section 313) executes the finger position estimating process of FIG. 5 (S13). That is, the control device 11 estimates the positions c[h, f] of the fingers of the user's left hand and right hand by analyzing the performance image G1. As described above, the finger position estimation processing includes image analysis processing Sa1, left/right determination processing Sa2, and interpolation processing Sa3.
 制御装置11(射影変換部314)は、射影変換処理を実行する(S14)。すなわち、制御装置11は、変換行列Wを利用した演奏画像G1の射影変換により変換画像を生成する。射影変換処理において、制御装置11は、利用者の各手指の位置c[h,f]を、X-Y座標系における位置C[h,f]に変換し、各手指の位置C[h,f]を表す指位置データFを生成する。 The control device 11 (projective transformation unit 314) executes projective transformation processing (S14). That is, the control device 11 generates a transformed image by projective transformation of the performance image G1 using the transformation matrix W. FIG. In the projective transformation process, the control device 11 transforms the position c[h,f] of each finger of the user into the position C[h,f] in the XY coordinate system, and the position C[h,f] of each finger of the user. Generate finger position data F representing f].
 以上の処理により指位置データFを生成すると、制御装置11(確率算定部321)は、確率算定処理を実行する(S15)。すなわち、制御装置11は、演奏データPが指定する音高nが各指番号kの手指により演奏された確率p(C[k]|ηk=n)を算定する。そして、制御装置11(運指推定部322)は、運指推定処理を実行する(S16)。すなわち、制御装置11は、音高nを演奏した手指の指番号kを各手指の確率p(C[k]|ηk=n)から推定し、音高nと指番号kとを指定する運指データQを生成する。 When the finger position data F is generated by the above processing, the control device 11 (probability calculation unit 321) executes probability calculation processing (S15). That is, the control device 11 calculates the probability p(C[k]|ηk=n) that the pitch n specified by the performance data P is played by each finger with the finger number k. Then, the control device 11 (the fingering estimation unit 322) executes fingering estimation processing (S16). That is, the control device 11 estimates the finger number k of the finger that played the pitch n from the probability p(C[k]|ηk=n) of each finger, and designates the pitch n and the finger number k. Generate finger data Q.
 以上の処理により運指データQを生成すると、制御装置11(表示制御部40)は、運指データQに応じて解析画面61を更新する(S17)。また、制御装置11は、所定の終了条件が成立したか否かを判定する(S18)。例えば操作装置13に対する操作で利用者から演奏解析処理の終了が指示された場合に、制御装置11は終了条件が成立したと判定する。終了条件が成立しない場合(S18:NO)、制御装置11は、直後の単位期間について、指位置推定処理以降の処理(S13~S18)を反復する。他方、終了条件が成立した場合(S18:YES)、制御装置11は、演奏解析処理を終了する。 When the fingering data Q is generated by the above process, the control device 11 (display control unit 40) updates the analysis screen 61 according to the fingering data Q (S17). Further, the control device 11 determines whether or not a predetermined end condition is satisfied (S18). For example, when the user instructs to end the performance analysis processing by operating the operation device 13, the control device 11 determines that the end condition is met. If the termination condition is not satisfied (S18: NO), the control device 11 repeats the processes after the finger position estimation process (S13 to S18) for the immediately following unit period. On the other hand, if the termination condition is satisfied (S18: YES), the control device 11 terminates the performance analysis process.
 以上に説明した通り、第1実施形態においては、演奏画像G1の解析により生成される指位置データFと、利用者による演奏を表す演奏データPとを利用して、運指データQが生成される。したがって、演奏データPのみから運指を推定する構成と比較して運指を高精度に推定できる。 As described above, in the first embodiment, the finger position data F generated by analyzing the performance image G1 and the performance data P representing the performance by the user are used to generate the fingering data Q. be. Therefore, the fingering can be estimated with high accuracy compared to the configuration in which the fingering is estimated only from the performance data P.
 また、第1実施形態においては、鍵盤画像g1を参照画像Grefに近付ける射影変換のための変換行列Wを利用して、指位置推定処理により推定された各手指の位置c[h,f]が変換される。すなわち、参照画像Grefを基準とした各手指の位置C[h,f]が推定される。したがって、各手指の位置c[h,f]を、参照画像Grefを基準とした位置に変換しない構成と比較して、運指を高精度に推定できる。 Further, in the first embodiment, the position c[h, f] of each finger estimated by the finger position estimation process is calculated using the transformation matrix W for projective transformation that brings the keyboard image g1 closer to the reference image Gref. converted. That is, the position C[h,f] of each finger is estimated with reference to the reference image Gref. Therefore, the fingering can be estimated with high precision compared to a configuration in which the position c[h, f] of each finger is not converted to a position based on the reference image Gref.
 第1実施形態においては、演奏画像G1のうち鍵盤画像g1を含む特定領域Bが抽出される。したがって、前述の通り、鍵盤画像g1を参照画像Grefに高精度に近似させ得る適切な変換行列Wを生成できる。また、特定領域Bの抽出により、演奏画像G1の利便性を向上させることが可能である。第1実施形態においては特に、演奏画像G1のうち鍵盤画像g1と手指画像g2とを含む特定領域Bが抽出される。したがって、鍵盤楽器200の鍵盤22の様子と利用者の手指の様子とを効率的に視認可能な演奏画像G2を生成できる。 In the first embodiment, a specific area B including the keyboard image g1 is extracted from the performance image G1. Therefore, as described above, it is possible to generate an appropriate transformation matrix W that can approximate the keyboard image g1 to the reference image Gref with high accuracy. Further, extracting the specific region B can improve the usability of the performance image G1. Particularly in the first embodiment, a specific area B including the keyboard image g1 and the finger image g2 is extracted from the performance image G1. Therefore, it is possible to generate a performance image G2 in which the appearance of the keyboard 22 of the keyboard instrument 200 and the appearance of the user's fingers can be efficiently visually recognized.
2:第2実施形態
 第2実施形態を説明する。なお、以下に例示する各形態において機能が第1実施形態と同様である要素については、第1実施形態の説明で使用したのと同様の符号を流用して各々の詳細な説明を適宜に省略する。
2: Second Embodiment A second embodiment will be described. In each embodiment illustrated below, elements having the same functions as those of the first embodiment are denoted by the same reference numerals as those used in the description of the first embodiment, and detailed descriptions thereof are appropriately omitted. do.
 第1実施形態においては、指番号kの手指の位置C[k]が音高nの単位領域Rn内に存在するか否かに応じて確率p(C[k]|ηk=n)が算定される。単位領域Rn内に1本の手指のみが存在することを前提とすれば、第1実施形態においても運指を高精度に推定できる。ただし、鍵盤楽器200の実際の演奏においては、1個の単位領域Rn内に複数の手指の位置C[k]が存在する場合が想定される。 In the first embodiment, the probability p(C[k]|ηk=n) is calculated according to whether or not the position C[k] of the finger with the finger number k exists within the unit region Rn of the pitch n. be done. Assuming that only one finger exists in the unit area Rn, the fingering can be estimated with high accuracy also in the first embodiment. However, in an actual performance of the keyboard instrument 200, it is assumed that a plurality of finger positions C[k] exist within one unit region Rn.
 例えば、図15に例示される通り、利用者が左手の中指で1個の鍵21を操作した状態で、当該左手の人差指を鉛直方向の上方に移動させた場合、演奏画像G1においては、左手の中指と人差指とが相互に重複する。すなわち、左手の中指の位置C[k]と人差指の位置C[k]とが1個の単位領域Rn内に存在する。また、利用者が1本の指で鍵21を操作した状態で当該手指の上方または下方に他指を通過させる演奏方法(指くぐり)においては、複数の手指が相互に重複する場合がある。以上のように複数の手指が1個の単位領域Rn内において相互に重複する場合には、第1実施形態の方法では、運指を高精度に推定できない可能性がある。第2実施形態は、以上の課題を解決するための形態である。具体的には、第2実施形態においては、複数の手指の位置関係と各手指の位置の時間的な変動(ばらつき)とが、運指の推定に加味される。 For example, as exemplified in FIG. 15, when the user operates one key 21 with the middle finger of the left hand and moves the index finger of the left hand upward in the vertical direction, in the performance image G1, the left hand middle and index fingers overlap each other. That is, the position C[k] of the middle finger of the left hand and the position C[k] of the index finger of the left hand exist within one unit region Rn. In addition, in a playing method in which the user operates the keys 21 with one finger and passes the other finger above or below that finger (finger slipping), a plurality of fingers may overlap each other. As described above, when a plurality of fingers overlap each other within one unit region Rn, the method of the first embodiment may not be able to estimate the fingering with high accuracy. 2nd Embodiment is a form for solving the above subject. Specifically, in the second embodiment, the positional relationship of a plurality of fingers and the temporal variation (dispersion) of the position of each finger are taken into consideration in fingering estimation.
 図16は、第2実施形態における演奏解析システム100の機能的な構成を例示するブロック図である。第2実施形態の演奏解析システム100は、第1実施形態と同様の要素に制御データ生成部323を追加した構成である。 FIG. 16 is a block diagram illustrating the functional configuration of the performance analysis system 100 according to the second embodiment. A performance analysis system 100 of the second embodiment has a configuration in which a control data generator 323 is added to the same elements as those of the first embodiment.
 制御データ生成部323は、相異なる音高nに対応するN個の制御データZ[1]~Z[N]を生成する。図17は、任意の1個の音高nに対応する制御データZ[n]の模式図である。制御データZ[n]は、音高nの単位領域Rnに対する各手指の相対的な位置(以下「相対位置」という)C'[k]の特徴を表すベクトルデータである。相対位置C'[k]は、指位置データFが表す位置C[k]を単位領域Rnに対する相対的な位置に変換した情報である。 The control data generator 323 generates N pieces of control data Z[1] to Z[N] corresponding to different pitches n. FIG. 17 is a schematic diagram of control data Z[n] corresponding to an arbitrary pitch n. The control data Z[n] is vector data representing the characteristics of the relative position (hereinafter referred to as "relative position") C'[k] of each finger with respect to the unit area Rn of pitch n. The relative position C'[k] is information obtained by converting the position C[k] represented by the finger position data F into a position relative to the unit region Rn.
 1個の音高nに対応する制御データZ[n]は、当該音高nを含むほか、複数の手指の各々について、位置平均Za[n,k]と位置分散Zb[n,k]と速度平均Zc[n,k]と速度分散Zd[n,k]とを含む。位置平均Za[n,k]は、現在の単位期間を含む所定長の期間(以下「観測期間」という)内における相対位置C'[k]の平均である。観測期間は、例えば、現在の単位期間を末尾として時間軸上で前方に配列する複数の単位期間に相当する期間である。位置分散Zb[n,k]は、観測期間内における相対位置C'[k]の分散である。速度平均Zc[n,k]は、観測期間内において相対位置C'[k]が変化する速度(すなわち変化率)の平均である。速度分散Zd[n,k]は、観測期間内において相対位置C'[k]が変化する速度の分散である。 The control data Z[n] corresponding to one pitch n includes the pitch n, and position average Za[n,k] and position variance Zb[n,k] for each of a plurality of fingers. It contains velocity mean Zc[n,k] and velocity variance Zd[n,k]. The average position Za[n,k] is the average of the relative positions C'[k] within a period of a predetermined length including the current unit period (hereinafter referred to as "observation period"). The observation period is, for example, a period corresponding to a plurality of unit periods arranged forward on the time axis with the current unit period ending. The position variance Zb[n,k] is the variance of the relative position C'[k] within the observation period. The velocity average Zc[n,k] is the average of the velocities (that is, rate of change) at which the relative position C'[k] changes within the observation period. The velocity variance Zd[n,k] is the variance of the velocity at which the relative position C'[k] changes within the observation period.
 以上の通り、制御データZ[n]は、複数の手指の各々について相対位置C'[k]に関する情報(Za[n,k],Zb[n,k].Zc[n,k],Zd[n,k])を含む。したがって、制御データZ[n]は、利用者の複数の手指の位置関係が反映されたデータである。また、制御データZ[n]は、複数の手指の各々について相対位置C'[k]の変動に関する情報(Zb[n,k],Zd[n,k])を含む。したがって、制御データZ[n]は、各手指の位置の時間的な変動が反映されたデータである。 As described above, the control data Z[n] are information (Za[n,k], Zb[n,k].Zc[n,k], Zd [n,k]). Therefore, the control data Z[n] is data reflecting the positional relationship of the user's fingers. Also, the control data Z[n] includes information (Zb[n,k], Zd[n,k]) regarding the variation of the relative position C'[k] for each of a plurality of fingers. Therefore, the control data Z[n] is data that reflects temporal variations in the position of each finger.
 第2実施形態の確率算定部321による確率算定処理には、相異なる手指について事前に用意された複数の推定モデル52[k](52[1]~52[10])が利用される。各手指の推定モデル52[k]は、制御データZ[n]と当該手指に関する確率p[k]との関係を学習した学習済モデルである。確率p[k]は、演奏データPが指定する音高nを指番号kの手指が演奏した確度の指標(確率)である。確率算定部321は、複数の手指の各々について、N個の制御データZ[1]~Z[N]を当該手指の推定モデル52[k]に入力することで確率p[k]を算定する。 A plurality of estimation models 52[k] (52[1] to 52[10]) prepared in advance for different fingers are used for the probability calculation processing by the probability calculation unit 321 of the second embodiment. The estimation model 52[k] of each finger is a trained model that has learned the relationship between the control data Z[n] and the probability p[k] of the finger. The probability p[k] is an index (probability) of the accuracy of playing the pitch n specified by the performance data P with the finger having the finger number k. The probability calculation unit 321 calculates the probability p[k] by inputting the N pieces of control data Z[1] to Z[N] to the estimation model 52[k] for each of a plurality of fingers. .
 任意の1個の指番号kに対応する推定モデル52[k]は、以下の数式(5)で表現されるロジスティック回帰モデルである。
Figure JPOXMLDOC01-appb-M000005
  
The estimation model 52[k] corresponding to any one finger number k is a logistic regression model represented by Equation (5) below.
Figure JPOXMLDOC01-appb-M000005
 数式(5)の変数βkおよび変数ωk,nは、機械学習システム900による機械学習で設定される。すなわち、機械学習システム900による機械学習で各推定モデル52[k]が確立され、各推定モデル52[k]が演奏解析システム100に提供される。例えば、各推定モデル52[k]の変数βkおよび変数ωk,nが、機械学習システム900から演奏解析システム100に送信される。 The variable βk and variable ωk,n in Equation (5) are set by machine learning by the machine learning system 900. That is, each estimated model 52[k] is established by machine learning by the machine learning system 900, and each estimated model 52[k] is provided to the performance analysis system 100. FIG. For example, the variable βk and the variable ωk,n of each estimated model 52[k] are sent from the machine learning system 900 to the performance analysis system 100. FIG.
 押鍵状態にある手指の上方に位置する手指、または、押鍵状態にある手指の上方または下方を移動する手指は、押鍵状態にある手指と比較して移動し易いという傾向がある。以上の傾向を考慮すると、推定モデル52[k]は、相対位置C'[k]の変化率が高い手指について確率p[k]が小さい数値となるように、制御データZ[n]と確率p[k]との関係を学習する。確率算定部321は、複数の推定モデル52[k]の各々に制御データZ[n]を入力することで、相異なる手指に関する複数の確率p[k]を単位期間毎に算定する。 A finger positioned above a key-pressed finger or a finger moving above or below a key-pressed finger tends to move more easily than a key-pressed finger. Considering the above tendency, the estimation model 52[k] is designed so that the probability p[k] is small for fingers with a high change rate of the relative position C′[k]. Learn the relationship with p[k]. The probability calculator 321 calculates a plurality of probabilities p[k] regarding different fingers for each unit period by inputting the control data Z[n] to each of the plurality of estimation models 52[k].
 運指推定部322は、複数の確率p[k]を適用した運指推定処理により、利用者の運指を推定する。具体的には、運指推定部322は、演奏データPが指定する音高nを演奏した手指(指番号k)を、各手指の確率p[k]から推定する。運指推定部322による指番号kの推定(運指データQの生成)は、各手指の確率p[k]の算定毎(すなわち単位期間毎)に実行される。具体的には、運指推定部322は、相異なる手指に対応する複数の確率p[k]のうち最大値に対応する指番号kを特定する。そして、運指推定部322は、演奏データPが指定する音高nと、確率p[k]から特定した指番号kとを指定する運指データQを生成する。 The fingering estimation unit 322 estimates the user's fingering through fingering estimation processing that applies a plurality of probabilities p[k]. Specifically, the fingering estimation unit 322 estimates the finger (finger number k) that played the pitch n specified by the performance data P from the probability p[k] of each finger. The fingering estimation unit 322 estimates the finger number k (generates the fingering data Q) every time the probability p[k] of each finger is calculated (that is, every unit period). Specifically, the fingering estimation unit 322 identifies the finger number k corresponding to the maximum value among a plurality of probabilities p[k] corresponding to different fingers. Then, the fingering estimation unit 322 generates fingering data Q that specifies the pitch n specified by the performance data P and the finger number k specified from the probability p[k].
 図18は、第2実施形態における演奏解析処理の具体的な手順を例示するフローチャートである。第2実施形態の演奏解析処理においては、第1実施形態と同様の処理に制御データZ[n]の生成(S19)が追加される。具体的には、制御装置11(制御データ生成部323)は、指位置データ生成部31が生成する指位置データF(すなわち各手指の位置C[h,f])から、相異なる音高nに対応するN個の制御データZ[1]~Z[N]を生成する。 FIG. 18 is a flowchart illustrating a specific procedure of performance analysis processing in the second embodiment. In the performance analysis process of the second embodiment, generation of control data Z[n] (S19) is added to the same process as in the first embodiment. Specifically, the control device 11 (control data generator 323) generates different pitches n N pieces of control data Z[1] to Z[N] corresponding to .
 制御装置11(確率算定部321)は、各推定モデル52[k]にN個の制御データZ[1]~Z[N]を入力する確率算定処理により、指番号kに対応する確率p[k]を算定する(S15)。また、制御装置11(運指推定部322)は、複数の確率p[k]を適用した運指推定処理により、利用者の運指を推定する(S16)。運指データ生成部32以外の要素の動作(S11~S14,S17~S18)は第1実施形態と同様である。 The control device 11 (probability calculator 321) calculates the probability p[ k] is calculated (S15). Further, the control device 11 (finger estimating unit 322) estimates the user's fingering by a fingering estimating process applying a plurality of probabilities p[k] (S16). The operations of elements other than the fingering data generator 32 (S11-S14, S17-S18) are the same as in the first embodiment.
 第2実施形態においても第1実施形態と同様の効果が実現される。また、第2実施形態において推定モデル52[k]に入力される制御データZ[k]は、各手指の相対位置C'[k]の平均Za[n,k]および分散Zb[n,k]と、相対位置C'[k]の変化率の平均Zc[n,k]および分散Zd[n,k]とを含む。したがって、例えば指くぐり等に起因して複数の手指が相互に重複する状態でも、利用者の運指を高精度に推定できる。 The same effects as in the first embodiment are also achieved in the second embodiment. Further, the control data Z[k] input to the estimation model 52[k] in the second embodiment are the average Za[n,k] and the variance Zb[n,k] of the relative positions C'[k] of the fingers. ] and the mean Zc[n,k] and variance Zd[n,k] of the rate of change of the relative position C′[k]. Therefore, even if a plurality of fingers overlap each other due to, for example, a finger slipping, the user's fingering can be estimated with high accuracy.
 なお、以上の説明においては、推定モデル52[k]としてロジスティック回帰モデルを例示したが、推定モデル52[k]の種類は以上の例示に限定されない。例えば、多層パーセプトロン等の統計モデルを推定モデル52[k]として利用してもよい。また、畳込ニューラルネットワークまたは再帰型ニューラルネットワーク等の深層ニューラルネットワークを推定モデル52[k]として利用してもよい。複数種の統計モデルの組合せを推定モデル52[k]として利用してもよい。以上に例示した各種の推定モデル52[k]は、制御データZ[n]と確率p[k]との関係を学習した学習済モデルとして包括的に表現される。 In the above description, the logistic regression model was exemplified as the estimation model 52[k], but the type of estimation model 52[k] is not limited to the above examples. For example, a statistical model such as a multilayer perceptron may be used as the estimation model 52[k]. A deep neural network such as a convolutional neural network or a recursive neural network may also be used as the estimation model 52[k]. A combination of multiple types of statistical models may be used as the estimation model 52[k]. The various estimation models 52[k] exemplified above are comprehensively expressed as learned models that have learned the relationship between the control data Z[n] and the probability p[k].
3:第3実施形態
 図19は、第3実施形態における演奏解析処理の具体的な手順を例示するフローチャートである。画像抽出処理および行列生成処理を実行すると、制御装置11は、演奏データPを参照することで、利用者による鍵盤楽器200の演奏の有無を判定する(S21)。具体的には、制御装置11は、鍵盤楽器200の複数の鍵21の何れかが操作されているか否かを判定する。
3: Third Embodiment FIG. 19 is a flowchart illustrating a specific procedure of performance analysis processing in the third embodiment. After executing the image extraction process and the matrix generation process, the control device 11 refers to the performance data P to determine whether or not the user is playing the keyboard instrument 200 (S21). Specifically, the control device 11 determines whether or not any of the plurality of keys 21 of the keyboard instrument 200 is being operated.
 鍵盤楽器200が演奏されている場合(S21:YES)、制御装置11は、第1実施形態と同様に、指位置データFの生成(S13~S14)と運指データQの生成(S15~S16)と解析画面61の更新(S17)とを実行する。他方、鍵盤楽器200が演奏されていない場合(S21:NO)、制御装置11は処理をステップS18に移行する。すなわち、指位置データFの生成(S13~14)と運指データQの生成(S15~S16)と解析画面61の更新(S17)とは実行されない。 If the keyboard instrument 200 is being played (S21: YES), the controller 11 generates finger position data F (S13-S14) and fingering data Q (S15-S16), as in the first embodiment. ) and update of the analysis screen 61 (S17). On the other hand, if the keyboard instrument 200 is not being played (S21: NO), the control device 11 shifts the process to step S18. That is, generation of finger position data F (S13-14), generation of fingering data Q (S15-S16), and update of analysis screen 61 (S17) are not executed.
 第3実施形態においても第1実施形態と同様の効果が実現される。また、第3実施形態においては、鍵盤楽器200が演奏されていない場合には、指位置データFおよび運指データQの生成が停止される。したがって、鍵盤楽器200の演奏の有無に関わらず指位置データFの生成が継続される構成と比較して、運指データQの生成に必要な処理負荷を低減できる。なお、第3実施形態は第2実施形態にも適用される。 The same effects as in the first embodiment are also achieved in the third embodiment. Further, in the third embodiment, generation of the finger position data F and the fingering data Q is stopped when the keyboard instrument 200 is not played. Therefore, the processing load necessary for generating the fingering data Q can be reduced compared to the configuration in which the generation of the finger position data F is continued regardless of whether the keyboard instrument 200 is played. In addition, 3rd Embodiment is applied also to 2nd Embodiment.
4:第4実施形態
 第4実施形態は、前述の各形態における初期設定処理Sc1を変更した形態である。図20は、第4実施形態の制御装置11(行列生成部312)が実行する初期設定処理Sc1の具体的な手順を例示するフローチャートである。
4: Fourth Embodiment The fourth embodiment is a form in which the initial setting process Sc1 in each of the above-described forms is modified. FIG. 20 is a flowchart illustrating a specific procedure of the initial setting process Sc1 executed by the control device 11 (matrix generator 312) of the fourth embodiment.
 初期設定処理Sc1が開始されると、利用者は、鍵盤楽器200の複数の鍵21のうち所望の音高(以下「特定音高」という)nに対応する鍵21を、特定の手指(以下「特定手指」という)により操作する。特定手指は、例えば表示装置14による表示または鍵盤楽器200の取扱説明書等により利用者に通知された手指(例えば右手の人差指)である。利用者による演奏の結果、特定音高nを指定する演奏データPが鍵盤楽器200から演奏解析システム100に供給される。制御装置11は、鍵盤楽器200から演奏データPを取得することで利用者による特定音高nの演奏を認識する(Sc15)。制御装置11は、参照画像GrefのN個の単位領域R1~RNのうち特定音高nに対応する単位領域Rnを特定する(Sc16)。 When the initial setting process Sc1 is started, the user selects a key 21 corresponding to a desired pitch (hereinafter referred to as "specific pitch") n among the plurality of keys 21 of the keyboard instrument 200 by a specific finger (hereinafter referred to as "specified finger"). (referred to as a “specific finger”). The specific finger is, for example, the finger (for example, the index finger of the right hand) notified to the user by the display on the display device 14 or the instruction manual of the keyboard instrument 200 or the like. As a result of the performance by the user, performance data P specifying a specific pitch n is supplied from the keyboard instrument 200 to the performance analysis system 100 . The control device 11 acquires the performance data P from the keyboard instrument 200, thereby recognizing the performance of the specific pitch n by the user (Sc15). The control device 11 specifies a unit area Rn corresponding to a specific pitch n among the N unit areas R1 to RN of the reference image Gref (Sc16).
 他方、指位置データ生成部31は、指位置推定処理により指位置データFを生成する。指位置データFは、利用者が特定音高nの演奏に使用した特定手指の位置C[h,f]を含む。制御装置11は、指位置データFを取得することで、特定手指の位置C[h,f]を特定する(Sc17)。 On the other hand, the finger position data generation unit 31 generates finger position data F through finger position estimation processing. The finger position data F includes the position C[h, f] of the specific finger used by the user to play the specific pitch n. The control device 11 acquires the finger position data F to specify the position C[h,f] of the specific finger (Sc17).
 制御装置11は、特定音高nに対応する単位領域Rnと、指位置データFが表す特定手指の位置C[h,f]とを利用して、初期行列W0を設定する(Sc18)。すなわち、制御装置11は、指位置データFが表す特定手指の位置C[h,f]が、参照画像Grefのうち特定音高nの単位領域Rnに近付くように、初期行列W0を設定する。具体的には、特定手指の位置C[h,f]を単位領域Rnの中心に射影変換するための行列が、初期行列W0として設定される。 The control device 11 uses the unit area Rn corresponding to the specific pitch n and the position C[h,f] of the specific finger represented by the finger position data F to set the initial matrix W0 (Sc18). That is, the control device 11 sets the initial matrix W0 so that the position C[h,f] of the specific finger represented by the finger position data F approaches the unit area Rn of the specific pitch n in the reference image Gref. Specifically, a matrix for projectively transforming the position C[h,f] of the specific finger to the center of the unit area Rn is set as the initial matrix W0.
 第4実施形態においても第1実施形態と同様の効果が実現される。また、第4実施形態においては、利用者が所望の特定音高nを特定手指で演奏すると、演奏画像G1における特定手指の位置c[h,f]が、参照画像Grefのうち特定音高nに対応する部分(単位領域Rn)に近付くように、初期行列W0が設定される。利用者は所望の音高nを演奏すればよいから、例えば利用者が操作装置13の操作により目標領域621を選択する必要がある第1実施形態と比較して、初期行列W0の設定に必要な利用者の作業の負荷が軽減される。他方、利用者が目標領域621を指定する第1実施形態によれば、利用者の手指の位置C[h,f]の推定が不要であるから、第2実施形態と比較して、推定誤差の影響を低減しながら適切な初期行列W0を設定できる。なお、第4実施形態は、第2実施形態または第3実施形態にも同様に適用される。 The same effects as in the first embodiment are realized in the fourth embodiment as well. Further, in the fourth embodiment, when the user plays a desired specific pitch n with a specific finger, the position c[h,f] of the specific finger in the performance image G1 changes to the position c[h,f] of the specific pitch n in the reference image Gref. The initial matrix W0 is set so as to approach the portion (unit region Rn) corresponding to . Since the user only needs to play the desired pitch n, compared to the first embodiment in which the user needs to select the target area 621 by operating the operation device 13, the initial matrix W0 needs only to be set. This reduces the workload of users who On the other hand, according to the first embodiment in which the user designates the target area 621, it is unnecessary to estimate the position C[h,f] of the user's fingers. An appropriate initial matrix W0 can be set while reducing the influence of . In addition, 4th Embodiment is similarly applied also to 2nd Embodiment or 3rd Embodiment.
 なお、第4実施形態においては利用者が1個の特定音高nを演奏する場合を想定したが、複数の特定音高nを利用者が特定手指により演奏してもよい。制御装置11は、複数の特定音高nの各々について、当該特定音高nの演奏時における特定手指の位置C[h,f]と、当該特定音高nの単位領域Rnとが近付くように、初期行列W0を設定する。 In the fourth embodiment, it is assumed that the user plays one specific pitch n, but the user may play a plurality of specific pitches n with a specific finger. For each of the plurality of specific pitches n, the control device 11 adjusts the position C[h,f] of the specific finger during performance of the specific pitch n to the unit area Rn of the specific pitch n. , to set the initial matrix W0.
5:第5実施形態
 図21は、第5実施形態における演奏解析システム100の機能的な構成を例示するブロック図である。第5実施形態の演奏解析システム100は、収音装置16を具備する。収音装置16は、利用者による演奏で鍵盤楽器200から再生される音響を収音することで音響信号Vを生成する。音響信号Vは、鍵盤楽器200が再生する音響の波形を表す時間領域のオーディオ信号である。なお、演奏解析システム100とは別体の収音装置16を、演奏解析システム100に対して有線または無線により接続してもよい。なお、音響信号Vを構成するサンプルの時系列を「演奏データP」と解釈してもよい。
5: Fifth Embodiment FIG. 21 is a block diagram illustrating the functional configuration of a performance analysis system 100 according to a fifth embodiment. A performance analysis system 100 of the fifth embodiment comprises a sound pickup device 16 . The sound collecting device 16 generates the sound signal V by collecting sound reproduced from the keyboard instrument 200 by the user's performance. The acoustic signal V is a time-domain audio signal representing the waveform of the sound reproduced by the keyboard instrument 200 . Note that the sound collecting device 16, which is separate from the performance analysis system 100, may be connected to the performance analysis system 100 by wire or wirelessly. Note that the time series of samples forming the acoustic signal V may be interpreted as "performance data P".
 演奏解析システム100の制御装置11は、記憶装置12に記憶されたプログラムを実行することで演奏解析部30として機能する。演奏解析部30は、収音装置16から供給される音響信号Vと撮影装置15から供給される画像データD1とを利用して運指データQを生成する。運指データQは、第1実施形態と同様に、利用者が操作した鍵21に対応する音高nと、利用者が当該鍵21の操作に使用した手指の指番号kとを指定する。第1実施形態においては音高nが演奏データPにより指定されるが、第5実施形態の音響信号Vは音高nを直接的に指定する信号ではない。そこで、演奏解析部30は、音響信号Vおよび画像データD1を利用して音高nと指番号kとを同時に推定する。 The control device 11 of the performance analysis system 100 functions as a performance analysis section 30 by executing programs stored in the storage device 12 . The performance analysis section 30 generates fingering data Q using the sound signal V supplied from the sound pickup device 16 and the image data D1 supplied from the photographing device 15 . As in the first embodiment, the fingering data Q designates the pitch n corresponding to the key 21 operated by the user and the finger number k of the finger used to operate the key 21 by the user. Although the pitch n is designated by the performance data P in the first embodiment, the acoustic signal V in the fifth embodiment is not a signal that directly designates the pitch n. Therefore, the performance analysis section 30 simultaneously estimates the pitch n and the finger number k using the acoustic signal V and the image data D1.
 音高nおよび指番号kの推定のために、潜在変数wt,n,kを想定する。記号tは時刻を示す変数である。時間軸上の1個の単位期間が変数tにより指示されてもよい。また、第5実施形態における指番号kは、相異なる手指に対応する10個の数値(k=1~10)と所定の無効値(k=0)とを含む11通りの数値の何れかに設定される。 For the estimation of pitch n and finger number k, we assume latent variables w t,n,k . Symbol t is a variable indicating time. A single unit period on the time axis may be indicated by the variable t. Further, the finger number k in the fifth embodiment can be one of 11 numbers including 10 numbers (k=1 to 10) corresponding to different fingers and a predetermined invalid value (k=0). set.
 音高nと指番号kとの組合せ毎に潜在変数wt,n,kが用意される。潜在変数wt,n,kは、「0」および「1」の2値の何れかに設定されるone-hot表現のための変数である。潜在変数wt,n,kの数値「1」は、音高nが指番号kの手指により演奏されていることを意味し、潜在変数wt,n,kの数値「0」は、何れの手指も演奏に使用されていないことを意味する。 A latent variable w t,n,k is prepared for each combination of pitch n and finger number k. The latent variable w t,n,k is a variable for one-hot expression that is set to either of the binary values '0' and '1'. The value "1" of the latent variable w t,n,k means that the pitch n is played by the finger with the finger number k, and the value "0" of the latent variable w t,n,k This means that the fingers of the player are also not used for playing.
 また、事後確率Ut,nと確率πt,n,kとを想定する。事後確率Ut,nは、音響信号Vが観測された条件のもとで時刻tにおいて音高nが発音されている事後確率である。したがって、確率(1-Ut,n)は、音響信号Vが観測された条件のもとで潜在変数wt,n,0が数値「1」である確率(何れの音高nも演奏されていない確率)に相当する。事後確率Ut,nは、音響信号Vと事後確率Ut,nとの関係を学習した公知の推定モデルにより推定される。推定モデルは、自動採譜用の学習済モデルである。例えば畳込ニューラルネットワークまたは再帰型ニューラルネットワーク等の深層ニューラルネットワークが、事後確率Ut,nを推定するための推定モデルとして利用される。確率πt,n,kは、音高nが演奏されている状態において当該音高nが指番号kの手指により演奏されている確率である。 We also assume posterior probabilities U t,n and probabilities π t,n,k . The posterior probability U t,n is the posterior probability that pitch n is pronounced at time t under the condition that acoustic signal V is observed. Therefore, the probability (1−U t,n ) is the probability that the latent variable w t,n,0 is the numerical value “1” under the condition that the acoustic signal V is observed (any pitch n is played). probability of not The posterior probability U t,n is estimated by a known estimation model that has learned the relationship between the acoustic signal V and the posterior probability U t,n . The estimation model is a trained model for automatic transcription. A deep neural network, eg a convolutional neural network or a recurrent neural network, is used as an estimation model for estimating the posterior probabilities U t,n . The probability π t,n,k is the probability that the pitch n is played by the finger with the finger number k when the pitch n is being played.
 音響信号Vと確率πt,n,kとが観測されたときの潜在変数wt,n,kの確率p(w|V,π)は、以下の数式(6)で表現される。
Figure JPOXMLDOC01-appb-M000006
  
 数式(6)における右辺の第1項は、何れの音高nも発音されていない確率を意味し、第2項は、音高nが発音されている場合に当該音高nが指番号kの手指により演奏されている確率を意味する。
The probability p(w|V, π) of the latent variable w t,n,k when the acoustic signal V and the probability π t,n,k are observed is expressed by the following equation (6).
Figure JPOXMLDOC01-appb-M000006

The first term on the right side of Equation (6) means the probability that no pitch n is pronounced, and the second term means the probability that the pitch n is pronounced if the pitch n is pronounced with the finger number k It means the probability that it is played with the fingers of .
 また、潜在変数wt,n,kが観測されたときに演奏画像G1から位置C[k]が観測される確率p(C[k]|w)は、以下の数式(7)で表現される。
Figure JPOXMLDOC01-appb-M000007
  
 数式(7)における確率p(C[k]|σ2,Rn)は、前掲の数式(3)または数式(4)で表現される確率である。
Also, the probability p(C[k]|w) of observing the position C[k] from the performance image G1 when the latent variable w t,n,k is observed is expressed by the following equation (7). be.
Figure JPOXMLDOC01-appb-M000007

The probability p(C[k]|σ 2 ,Rn) in Equation (7) is the probability expressed by Equation (3) or Equation (4) above.
 また、確率πt,n,kの事前分布としては、以下の数式(8)で表現される対称ディリクレ分布(Dir)を想定する。
Figure JPOXMLDOC01-appb-M000008
  
 数式(8)の記号αは、対称ディリクレ分布の形状を規定する変数である。
Also, as the prior distribution of the probability π t,n,k , a symmetric Dirichlet distribution (Dir) expressed by Equation (8) below is assumed.
Figure JPOXMLDOC01-appb-M000008

The symbol α in Equation (8) is a variable that defines the shape of the symmetric Dirichlet distribution.
 以上の前提において、潜在変数wt,n,kの事後確率p(z|V,π,C[k])を最大化する最大事後確率推定(MAP:Maximum A Posteriori)を実行することで、音高nの有無と指番号kとを同時に推定できる。しかし、事後確率p(z|V,π,C[k])の確率分布の推定は困難であるため、第5実施形態においては平均場近似(変分ベイズ推定)を検討する。 On the above premise, by executing the maximum posterior probability estimation (MAP: Maximum A Posteriori) that maximizes the posterior probability p(z|V,π,C[k]) of the latent variable w t,n,k , Presence or absence of pitch n and finger number k can be estimated at the same time. However, since it is difficult to estimate the probability distribution of the posterior probability p(z|V,π,C[k]), mean field approximation (variational Bayesian estimation) is considered in the fifth embodiment.
 具体的には、以下の数式(9)のように因子分解される分布のうち事後確率p(z|V,π,C[k])の確率分布に最も近似する分布が特定される。例えば、事後確率p(z|V,π,C[k])とのKL(Kullback-Leibler)距離が最小となる分布が特定される。
Figure JPOXMLDOC01-appb-M000009
  
Specifically, the distribution that is most approximate to the probability distribution of the posterior probability p(z|V, π, C[k]) among the distributions that are factorized as in Equation (9) below is specified. For example, a distribution that minimizes the KL (Kullback-Leibler) distance from the posterior probability p(z|V,π,C[k]) is identified.
Figure JPOXMLDOC01-appb-M000009
 具体的には、演奏解析部30は、以下の数式(10)および数式(11)の演算を反復する。
Figure JPOXMLDOC01-appb-M000010
  
Figure JPOXMLDOC01-appb-M000011
  
 数式(10)の記号cは、複数の指番号kにわたる確率分布ρt,n,kの合計が「1」となるように当該確率分布ρt,n,kを正規化する係数である。また、記号〈 〉は、期待値を意味する。
Specifically, the performance analysis unit 30 repeats the calculations of the following formulas (10) and (11).
Figure JPOXMLDOC01-appb-M000010

Figure JPOXMLDOC01-appb-M000011

The symbol c in Equation (10) is a coefficient for normalizing the probability distribution ρ t,n,k so that the sum of the probability distribution ρ t,n, k over a plurality of finger numbers k is "1". Also, the symbol <> means an expected value.
 具体的には、演奏解析部30は、時間軸上の1個の時刻tについて、音高nと指番号kとの全通りの組合せについて数式(10)および数式(11)の演算を反復する。演奏解析部30は、所定の回数にわたり数式(10)および数式(11)の演算を反復した時点の数式(10)の演算結果を、潜在変数wt,n,kの確率分布ρt,n,kとして確定する。時間軸上の時刻t毎に確率分布ρt,n,kが算定される。 Specifically, the performance analysis unit 30 repeats the calculations of formulas (10) and (11) for all possible combinations of pitch n and finger number k for one time t on the time axis. . The performance analysis unit 30 converts the computation result of equation (10) at the time when the computation of equation (10) and equation (11) is repeated a predetermined number of times into the probability distribution ρ t,n of the latent variable w t,n,k , k . A probability distribution ρ t,n,k is calculated for each time t on the time axis.
 ところで、時間軸上の時刻t毎に個別に算定された確率分布ρt,n,kから、音高nおよび指番号kを時刻t毎に算定する形態では、利用者が1個の音符を演奏する期間内において前後の時刻tで指番号kが変化する場合、または、音高nが継続する期間が過度に短くなる場合がある。そこで、第5実施形態の演奏解析部30は、確率分布ρt,n,kを適用したHMM(Hidden Markov Model)を利用して、音高nと指番号kとの組合せ(すなわち運指データQ)の時系列を生成する。 By the way, in the form in which the pitch n and the finger number k are calculated for each time t from the probability distribution ρ t,n,k individually calculated for each time t on the time axis, the user selects one note. There are cases where the finger number k changes before and after the time t within the playing period, or the period during which the pitch n continues becomes excessively short. Therefore, the performance analysis unit 30 of the fifth embodiment utilizes an HMM (Hidden Markov Model) to which the probability distribution ρ t,n,k is applied to combine pitch n and finger number k (that is, fingering data Generate a time series of Q).
 具体的には、運指推定用のHMMは、音高nの発音(押鍵)および消音の各々に対応する潜在状態と、相異なる指番号kに対応する複数の潜在状態とで構成される。状態遷移としては、(1)自己遷移、(2)無音→任意の指番号k、および(3)任意の指番号k→無音、の3種類のみが許容され、他の状態遷移に係る遷移確率は「0」に設定される。以上の条件は、1個の音符が発音される期間内において指番号kを変化させないための制約条件である。また、数式(10)および数式(11)の演算により算定された確率分布ρt,n,kの期待値が、HMMの各潜在状態に関する観測確率として設定される。演奏解析部30は、以上に説明したHMMを利用し、例えばビタビアルゴリズム等の動的計画法により状態系列を推定する。演奏解析部30は、状態系列を推定した結果に応じて運指データQの時系列を生成する。 Specifically, the HMM for fingering estimation consists of a latent state corresponding to each of the pronunciation (key depression) and silence of pitch n, and a plurality of latent states corresponding to different finger numbers k. . Only three types of state transitions are allowed: (1) self-transition, (2) silence→arbitrary finger number k, and (3) arbitrary finger number k→silence. is set to '0'. The above conditions are constraints for keeping the finger number k unchanged during the period in which one note is pronounced. Also, the expected value of the probability distribution ρ t,n,k calculated by the calculations of Equations (10) and (11) is set as the observation probability for each latent state of the HMM. The performance analysis unit 30 uses the HMM described above to estimate the state series by dynamic programming such as the Viterbi algorithm. The performance analysis unit 30 generates a time series of fingering data Q according to the result of estimating the state series.
 第5実施形態によれば、音響信号Vと画像データD1とを利用して運指データQが生成される。すなわち、演奏データPを取得できない状況でも運指データQを生成できる。また、第5実施形態においては、音響信号Vおよび画像データD1を利用して音高nと指番号kとが同時に推定されるから、音高nおよび指番号kの各々を個別に推定する形態と比較して処理負荷を軽減しながら高精度に運指を推定できる。なお、第5実施形態は第2実施形態から第4実施形態にも適用される。 According to the fifth embodiment, the fingering data Q is generated using the acoustic signal V and the image data D1. That is, fingering data Q can be generated even in situations where performance data P cannot be obtained. In the fifth embodiment, since the pitch n and the finger number k are simultaneously estimated using the acoustic signal V and the image data D1, the pitch n and the finger number k are individually estimated. It can estimate the fingering with high accuracy while reducing the processing load compared to . In addition, 5th Embodiment is applied also to 4th Embodiment from 2nd Embodiment.
6:第6実施形態
 前述の各形態において例示した通り、射影変換部314は、演奏画像G1から変換画像を生成する。すなわち、射影変換部314は、演奏画像G1の撮影条件を変化させる。第6実施形態は、演奏画像G1の撮影条件を変化させる以上の機能を利用した画像処理システム700である。なお、第1実施形態から第5実施形態の演奏解析システム100も、射影変換部314による演奏画像G1の処理に着目すれば、画像処理システム700と表現される。なお、第6実施形態においては、利用者の運指の推定は必須ではない。
6: Sixth Embodiment As illustrated in the above embodiments, the projective transformation unit 314 generates a transformed image from the performance image G1. That is, the projective transformation unit 314 changes the photographing conditions of the performance image G1. The sixth embodiment is an image processing system 700 that uses the above functions of changing the shooting conditions of the performance image G1. Note that the performance analysis system 100 of the first to fifth embodiments can also be expressed as an image processing system 700 when focusing on the processing of the performance image G1 by the projective transformation unit 314. FIG. In the sixth embodiment, it is not essential to estimate the user's fingering.
 図22は、第6実施形態における画像処理システム700の機能的な構成を例示するブロック図である。画像処理システム700は、第1実施形態の演奏解析システム100と同様に、制御装置11と記憶装置12と操作装置13と表示装置14と撮影装置15とを具備する。撮影装置15は、第1実施形態と同様に、特定の撮影条件のもとで鍵盤楽器200を撮影することで、演奏画像G1を表す画像データD1の時系列を生成する。 FIG. 22 is a block diagram illustrating the functional configuration of an image processing system 700 according to the sixth embodiment. The image processing system 700 includes a control device 11, a storage device 12, an operation device 13, a display device 14, and a photographing device 15, like the performance analysis system 100 of the first embodiment. As in the first embodiment, the imaging device 15 generates a time series of image data D1 representing the performance image G1 by imaging the keyboard instrument 200 under specific imaging conditions.
 記憶装置12は、複数の参照データDrefを記憶する。複数の参照データDrefの各々は、標準的な鍵盤楽器の鍵盤である参照楽器を撮影した参照画像Grefを表す。参照楽器の撮影条件は、参照画像Gref毎(参照データDref毎)に相違する。具体的には、例えば撮影範囲または撮影方向のうち1以上の条件が、参照画像Gref毎に相違する。また、記憶装置12は、参照データDref毎に補助データAを記憶する。 The storage device 12 stores a plurality of reference data Dref. Each of the plurality of reference data Dref represents a reference image Gref photographing a reference musical instrument, which is a keyboard of a standard keyboard musical instrument. The photographing conditions of the reference instrument differ for each reference image Gref (for each reference data Dref). Specifically, for example, one or more conditions out of the shooting range or shooting direction differ for each reference image Gref. The storage device 12 also stores auxiliary data A for each reference data Dref.
 制御装置11は、記憶装置12に記憶されたプログラムを実行することで、行列生成部312と射影変換部314と表示制御部40とを実現する。行列生成部312は、複数の参照データDrefの何れかを選択的に利用して変換行列Wを生成する。射影変換部314は、変換行列Wを利用した射影変換により、演奏画像G1の画像データD1から変換画像G3の画像データD3を生成する。表示制御部40は、画像データD3が表す変換画像G3を表示装置14に表示させる。 The control device 11 implements the matrix generation unit 312, the projective transformation unit 314, and the display control unit 40 by executing the programs stored in the storage device 12. The matrix generator 312 selectively uses one of the plurality of reference data Dref to generate the transformation matrix W. FIG. A projective transformation unit 314 generates image data D3 of a transformed image G3 from image data D1 of a performance image G1 by projective transformation using a transformation matrix W. FIG. The display control unit 40 causes the display device 14 to display the converted image G3 represented by the image data D3.
 図23は、第6実施形態の制御装置11が実行する処理(以下「第1画像処理」という)の具体的な手順を例示するフローチャートである。例えば操作装置13に対する利用者からの指示を契機として第1画像処理が開始される。 FIG. 23 is a flowchart illustrating a specific procedure of processing (hereinafter referred to as "first image processing") executed by the control device 11 of the sixth embodiment. For example, the first image processing is started with an instruction from the user to the operation device 13 as a trigger.
 利用者は、操作装置13を操作することで、相異なる参照画像Grefに対応する複数の撮影条件の何れかを選択する。制御装置11(行列生成部312)は、撮影条件の選択を利用者から受付けたか否かを判定する(S31)。撮影条件の選択を受付けた場合(S31:YES)、制御装置11(行列生成部312)は、記憶装置12に記憶された複数の参照データDrefのうち、利用者が選択した撮影条件に対応する参照データDref(以下「選択参照データDref」という)を取得する(S32)。利用者による撮影条件の選択は、相異なる撮影条件に対応する複数の参照画像Gref(参照データDref)の何れかを選択する動作に相当する。 By operating the operation device 13, the user selects one of a plurality of imaging conditions corresponding to different reference images Gref. The control device 11 (matrix generator 312) determines whether or not selection of imaging conditions has been received from the user (S31). When the selection of the imaging conditions is accepted (S31: YES), the control device 11 (matrix generator 312) selects the imaging conditions selected by the user from among the plurality of reference data Dref stored in the storage device 12. Reference data Dref (hereinafter referred to as "selected reference data Dref") is obtained (S32). The user's selection of imaging conditions corresponds to the operation of selecting one of a plurality of reference images Gref (reference data Dref) corresponding to different imaging conditions.
 制御装置11(行列生成部312)は、選択参照データDrefを利用して、第1実施形態と同様の行列生成処理を実行する(S33)。具体的には、制御装置11は、選択参照データDrefを利用した初期設定処理Sc1により初期行列W0を設定する。また、制御装置11は、演奏画像G1の鍵盤画像g1が選択参照データDrefの参照画像Grefに近付くように初期行列W0を反復的に更新する行列更新処理Sc2により、変換行列Wを生成する。他方、撮影条件の選択を受付けない場合(S31:NO)、参照データDrefの選択(S32)および行列生成処理(S33)は実行されない。 The control device 11 (matrix generation unit 312) uses the selected reference data Dref to execute the same matrix generation processing as in the first embodiment (S33). Specifically, the control device 11 sets the initial matrix W0 by an initial setting process Sc1 using the selection reference data Dref. Further, the control device 11 generates a transformation matrix W through matrix update processing Sc2 that iteratively updates the initial matrix W0 so that the keyboard image g1 of the performance image G1 approaches the reference image Gref of the selected reference data Dref. On the other hand, if the selection of the imaging condition is not accepted (S31: NO), the selection of the reference data Dref (S32) and the matrix generation process (S33) are not executed.
 制御装置11(射影変換部314)は、変換行列Wを利用した射影変換処理を演奏画像G1に対して実行することで変換画像G3を生成する(S34)。射影変換処理は、第1実施形態と同様である。射影変換処理の結果、変換画像G3を表す画像データD3が生成される。具体的には、選択参照データDrefの参照画像Grefと同等の撮影条件に対応する変換画像G3が演奏画像G1から生成される。すなわち、変換画像G3は、演奏画像G1の撮影条件を参照画像Grefと同等の撮影条件に変換した画像である。以上の説明から理解される通り、第6実施形態によれば、利用者が選択した撮影条件に対応する変換画像G3が生成される。 The control device 11 (projective transformation unit 314) generates a transformed image G3 by performing projective transformation processing using the transformation matrix W on the performance image G1 (S34). Projective transformation processing is the same as in the first embodiment. As a result of the projective transformation process, image data D3 representing the transformed image G3 is generated. Specifically, a converted image G3 corresponding to the same photographing conditions as the reference image Gref of the selected reference data Dref is generated from the performance image G1. That is, the converted image G3 is an image obtained by converting the photographing conditions of the performance image G1 into photographing conditions equivalent to those of the reference image Gref. As can be understood from the above description, according to the sixth embodiment, the converted image G3 corresponding to the shooting conditions selected by the user is generated.
 制御装置11(表示制御部40)は、射影変換処理により生成された変換画像G3を表示装置14に表示させる(S35)。制御装置11は、終了条件が成立したか否かを判定する(S36)。例えば操作装置13に対する操作で利用者から第1画像処理の終了が指示された場合に、制御装置11は終了条件が成立したと判定する。終了条件が成立しない場合(S36:NO)、制御装置11は、処理をステップS31に移行する。すなわち、撮影条件の選択の受付(S31:YES)を条件とした変換行列Wの生成(S32~S33)と、変換画像G3の生成および表示(S34~S35)とが実行される。他方、終了条件が成立した場合(S36:YES)、制御装置11は、第1画像処理を終了する。 The control device 11 (display control unit 40) causes the display device 14 to display the transformed image G3 generated by the projective transformation process (S35). The control device 11 determines whether or not the termination condition is satisfied (S36). For example, when the user instructs to end the first image processing by operating the operation device 13, the control device 11 determines that the end condition is met. If the termination condition is not satisfied (S36: NO), the control device 11 shifts the process to step S31. That is, the conversion matrix W is generated (S32-S33) and the conversion image G3 is generated and displayed (S34-S35) on the condition that the selection of the photographing conditions is accepted (S31: YES). On the other hand, if the termination condition is satisfied (S36: YES), the control device 11 terminates the first image processing.
 以上の通り、第6実施形態においては、演奏画像G1における鍵盤画像g1が参照画像Grefに近付くように変換行列Wが生成され、当該変換行列Wを利用した射影変換処理が演奏画像G1に対して実行される。したがって、利用者が演奏する鍵盤楽器200の演奏画像G1を、参照画像Grefにおける参照楽器の撮影条件に対応する変換画像G3に変換できる。 As described above, in the sixth embodiment, the transformation matrix W is generated so that the keyboard image g1 in the performance image G1 approaches the reference image Gref. executed. Therefore, the performance image G1 of the keyboard instrument 200 played by the user can be converted into a converted image G3 corresponding to the photographing conditions of the reference musical instrument in the reference image Gref.
 また、第6実施形態においては、撮影条件が相違する複数の参照データDrefの何れかが選択的に行列生成処理に利用される。したがって、特定の撮影条件のもとで撮影された演奏画像G1から、多様な撮影条件に対応する変換画像G3を生成できる。第6実施形態では特に、複数の参照データDrefのうち利用者が選択した撮影条件に対応する参照データDrefが行列生成処理に利用されるから、利用者の所望の撮影条件に対応する変換画像G3を生成できる。以上のように演奏画像G1の撮影条件を変化させることで、多様な用途に利用可能な変換画像G3を生成できる。例えば、音楽教習の指導者が自身の演奏を撮影した複数の演奏画像G1の各々について第6実施形態の第1画像処理を実行することで、撮影条件が統一された複数の変換画像G3を、音楽教習の教材として生成できる。 Also, in the sixth embodiment, any one of a plurality of reference data Dref with different imaging conditions is selectively used for matrix generation processing. Therefore, a converted image G3 corresponding to various shooting conditions can be generated from the performance image G1 shot under specific shooting conditions. Particularly in the sixth embodiment, the reference data Dref corresponding to the imaging conditions selected by the user among the plurality of reference data Dref are used for the matrix generation process, so that the converted image G3 corresponding to the imaging conditions desired by the user is generated. can generate By changing the photographing conditions of the performance image G1 as described above, it is possible to generate a converted image G3 that can be used for various purposes. For example, by executing the first image processing of the sixth embodiment on each of a plurality of performance images G1 photographed by a music teacher of his own performance, a plurality of converted images G3 with uniform photographing conditions are generated. It can be generated as teaching material for music lessons.
7:第7実施形態
 前述の各形態において例示した通り、画像抽出部311は、演奏画像G1のうち鍵盤画像g1と手指画像g2とを含む特定領域Bを抽出する。第7実施形態は、演奏画像G1の特定領域Bを抽出する以上の機能を利用した画像処理システム700である。なお、第1実施形態から第5実施形態の演奏解析システム100も、画像抽出部311による演奏画像G1の処理に着目すれば、画像処理システム700と表現される。なお、第7実施形態においては、利用者の運指の推定は必須ではない。
7: Seventh Embodiment As exemplified in each of the above-described embodiments, the image extractor 311 extracts the specific region B including the keyboard image g1 and the finger image g2 from the performance image G1. The seventh embodiment is an image processing system 700 that utilizes the above functions of extracting the specific area B of the performance image G1. Note that the performance analysis system 100 of the first to fifth embodiments is also expressed as an image processing system 700 when focusing on the processing of the performance image G1 by the image extracting section 311. FIG. Note that in the seventh embodiment, estimation of the user's fingering is not essential.
 図24は、第7実施形態における画像処理システム700の機能的な構成を例示するブロック図である。画像処理システム700は、第1実施形態の演奏解析システム100と同様に、制御装置11と記憶装置12と操作装置13と表示装置14と撮影装置15とを具備する。撮影装置15は、特定の撮影条件のもとで鍵盤楽器200を撮影することで、演奏画像G1を表す画像データD1の時系列を生成する。演奏画像G1は、前述の各形態と同様に、鍵盤画像g1と手指画像g2とを含む。 FIG. 24 is a block diagram illustrating the functional configuration of an image processing system 700 according to the seventh embodiment. The image processing system 700 includes a control device 11, a storage device 12, an operation device 13, a display device 14, and a photographing device 15, like the performance analysis system 100 of the first embodiment. The imaging device 15 generates a time series of image data D1 representing the performance image G1 by imaging the keyboard instrument 200 under specific imaging conditions. The performance image G1 includes a keyboard image g1 and a finger image g2, as in the above-described forms.
 制御装置11は、記憶装置12に記憶されたプログラムを実行することで、画像抽出部311および表示制御部40として機能する。画像抽出部311は、演奏画像G1のうち一部の領域を抽出した演奏画像G2を表す画像データD2を生成する。具体的には、画像抽出部311は、第1実施形態と同様に、画像処理マスクMを生成する領域推定処理Sb1と、画像処理マスクMを演奏画像G1に適用する領域抽出処理Sb2とを実行する。表示制御部40は、画像データD2が表す演奏画像G2を表示装置14に表示させる。 The control device 11 functions as an image extractor 311 and a display controller 40 by executing programs stored in the storage device 12 . The image extraction unit 311 generates image data D2 representing a performance image G2 obtained by extracting a partial region from the performance image G1. Specifically, as in the first embodiment, the image extracting unit 311 performs an area estimation process Sb1 for generating an image processing mask M and an area extraction process Sb2 for applying the image processing mask M to the performance image G1. do. The display control unit 40 causes the display device 14 to display the performance image G2 represented by the image data D2.
 第1実施形態においては単体の推定モデル51を例示した。第7実施形態において領域推定処理Sb1に利用される推定モデル51は、第1モデル511および第2モデル512を含む。第1モデル511および第2モデル512の各々は、畳込ニューラルネットワークまたは再帰型ニューラルネットワーク等の深層ニューラルネットワークで構成される。 The single estimation model 51 is illustrated in the first embodiment. The estimation model 51 used in the area estimation process Sb1 in the seventh embodiment includes a first model 511 and a second model 512. FIG. Each of the first model 511 and the second model 512 is composed of a deep neural network such as a convolutional neural network or a recurrent neural network.
 第1モデル511は、演奏画像G1のうち第1領域を表す第1マスクを生成するための統計モデルである。第1領域は、演奏画像G1のうち鍵盤画像g1を含む領域である。手指画像g2は第1領域に含まれない。第1マスクは、例えば、第1領域内の各要素が数値「1」に設定され、第1領域以外の領域内の各要素が数値「0」に設定されたバイナリマスクである。画像抽出部311は、演奏画像G1を表す画像データD1を第1モデル511に入力することで第1マスクを生成する。すなわち、第1モデル511は、画像データD1と第1マスク(第1領域)との関係を機械学習により学習した学習済モデルである。 The first model 511 is a statistical model for generating the first mask representing the first region of the performance image G1. The first area is an area including the keyboard image g1 in the performance image G1. The finger image g2 is not included in the first area. The first mask is, for example, a binary mask in which each element in the first area is set to the numerical value "1" and each element in the area other than the first area is set to the numerical value "0". The image extraction unit 311 generates the first mask by inputting the image data D1 representing the performance image G1 to the first model 511. FIG. That is, the first model 511 is a trained model that has learned the relationship between the image data D1 and the first mask (first region) by machine learning.
 第2モデル512は、演奏画像G1のうち第2領域を表す第2マスクを生成するための統計モデルである。第2領域は、演奏画像G1のうち手指画像g2を含む領域である。鍵盤画像g1は第2領域に含まれない。第2マスクは、例えば、第2領域内の各要素が数値「1」に設定され、第2領域以外の領域内の各要素が数値「0」に設定されたバイナリマスクである。画像抽出部311は、演奏画像G1を表す画像データD1を第2モデル512に入力することで第2マスクを生成する。すなわち、第2モデル512は、画像データD1と第2マスク(第2領域)との関係を機械学習により学習した学習済モデルである。 The second model 512 is a statistical model for generating a second mask representing the second area of the performance image G1. The second area is an area including the finger image g2 in the performance image G1. The keyboard image g1 is not included in the second area. The second mask is, for example, a binary mask in which each element in the second area is set to the numerical value "1" and each element in the area other than the second area is set to the numerical value "0". The image extraction unit 311 generates a second mask by inputting the image data D1 representing the performance image G1 to the second model 512. FIG. That is, the second model 512 is a trained model that has learned the relationship between the image data D1 and the second mask (second region) by machine learning.
 図25は、第7実施形態の制御装置11が実行する処理(以下「第2画像処理」という)の具体的な手順を例示するフローチャートである。例えば操作装置13に対する利用者からの指示を契機として第2画像処理が開始される。 FIG. 25 is a flowchart illustrating a specific procedure of processing (hereinafter referred to as "second image processing") executed by the control device 11 of the seventh embodiment. For example, the second image processing is started with an instruction from the user to the operation device 13 as a trigger.
 第2画像処理が開始されると、制御装置11(画像抽出部311)は、領域推定処理Sb1を実行する(S41~S43)。第7実施形態の領域推定処理Sb1は、第1推定処理(S41)と第2推定処理(S42)と領域合成処理(S43)とを含む。 When the second image processing is started, the control device 11 (image extraction unit 311) executes the region estimation processing Sb1 (S41-S43). The area estimation process Sb1 of the seventh embodiment includes a first estimation process (S41), a second estimation process (S42), and an area combining process (S43).
 第1推定処理は、演奏画像G1の第1領域を推定する処理である。具体的には、制御装置11は、演奏画像G1を表す画像データD1を第1モデル511に入力することで、第1領域を表す第1マスクを生成する(S41)。第2推定処理は、演奏画像G2の第2領域を推定する処理である。具体的には、制御装置11は、演奏画像G1を表す画像データD1を第2モデル512に入力することで、第2領域を表す第2マスクを生成する(S42)。 The first estimation process is a process of estimating the first area of the performance image G1. Specifically, the control device 11 inputs the image data D1 representing the performance image G1 to the first model 511 to generate the first mask representing the first region (S41). The second estimation process is a process of estimating the second area of the performance image G2. Specifically, the control device 11 inputs the image data D1 representing the performance image G1 to the second model 512 to generate a second mask representing the second region (S42).
 領域合成処理は、第1領域と第2領域とを含む特定領域Bを表す画像処理マスクMを生成する処理である。具体的には、画像処理マスクMが表す特定領域Bは、第1領域と第2領域との和に相当する。すなわち、制御装置11は、第1マスクと第2マスクとを合成することで画像処理マスクMを生成する(S43)。以上の説明から理解される通り、画像処理マスクMは、第1実施形態と同様に、演奏画像G1のうち鍵盤画像g1と手指画像g2とを含む特定領域Bを抽出するためのバイナリマスクである。 The area synthesizing process is a process of generating an image processing mask M representing the specific area B including the first area and the second area. Specifically, the specific area B represented by the image processing mask M corresponds to the sum of the first area and the second area. That is, the control device 11 generates the image processing mask M by synthesizing the first mask and the second mask (S43). As can be understood from the above description, the image processing mask M is a binary mask for extracting the specific region B containing the keyboard image g1 and the finger image g2 from the performance image G1, as in the first embodiment. .
 制御装置11(画像抽出部311)は、領域推定処理Sb1で生成された画像処理マスクMを利用して第1実施形態と同様の領域抽出処理Sb2を実行する(S44)。すなわち、制御装置11は、画像データD1が表す演奏画像G1のうち特定領域Bを画像処理マスクMにより抽出することで、演奏画像G2を表す画像データD2を生成する。 The control device 11 (image extraction unit 311) uses the image processing mask M generated in the area estimation process Sb1 to execute the area extraction process Sb2 similar to that of the first embodiment (S44). That is, the control device 11 extracts the specific area B from the performance image G1 represented by the image data D1 using the image processing mask M, thereby generating the image data D2 representing the performance image G2.
 制御装置11(表示制御部40)は、領域抽出処理Sb2により生成された演奏画像G2を表示装置14に表示させる(S45)。制御装置11は、終了条件が成立したか否かを判定する(S46)。例えば操作装置13に対する操作で利用者から第2画像処理の終了が指示された場合に、制御装置11は終了条件が成立したと判定する。終了条件が成立しない場合(S46:NO)、制御装置11は、処理をステップS41に移行する。すなわち、領域推定処理Sb1(S41~S43)と、領域抽出処理Sb2(S44)と、演奏画像G2の表示(S45)とが実行される。他方、終了条件が成立した場合(S46:YES)、制御装置11は、第2画像処理を終了する。 The control device 11 (display control unit 40) causes the display device 14 to display the performance image G2 generated by the region extraction processing Sb2 (S45). The control device 11 determines whether or not the termination condition is satisfied (S46). For example, when the user instructs to end the second image processing by operating the operation device 13, the control device 11 determines that the end condition is satisfied. If the termination condition is not satisfied (S46: NO), the control device 11 shifts the process to step S41. That is, the area estimation process Sb1 (S41 to S43), the area extraction process Sb2 (S44), and the display of the performance image G2 (S45) are executed. On the other hand, if the termination condition is satisfied (S46: YES), the control device 11 terminates the second image processing.
 第7実施形態においては、第1実施形態と同様に、演奏画像G1のうち鍵盤画像g1を含む特定領域Bが抽出される。したがって、演奏画像G1の利便性を向上させることが可能である。第7実施形態においては特に、演奏画像G1のうち鍵盤画像g1と手指画像g2とを含む特定領域Bが抽出される。したがって、鍵盤楽器200の鍵盤22の様子と利用者の手指の様子とを効率的に視認可能な演奏画像G2を生成できる。 In the seventh embodiment, as in the first embodiment, a specific area B including the keyboard image g1 is extracted from the performance image G1. Therefore, it is possible to improve the convenience of the performance image G1. Especially in the seventh embodiment, a specific area B including the keyboard image g1 and the finger image g2 is extracted from the performance image G1. Therefore, it is possible to generate a performance image G2 in which the appearance of the keyboard 22 of the keyboard instrument 200 and the appearance of the user's fingers can be efficiently visually recognized.
 また、第7実施形態によれば、演奏画像G1のうち鍵盤画像g1を含む第1領域が第1モデル511により推定され、演奏画像G1のうち手指画像g2を含む第2領域が第2モデル512により推定される。したがって、鍵盤画像g1と手指画像g2との双方を一括的に抽出する単体の推定モデル51を利用する構成と比較して、鍵盤画像g1と手指画像g2とを含む特定領域Bを高精度に抽出できる。また、第1モデル511および第2モデル512の各々が個別の機械学習により確立されるから、第1モデル511および第2モデル512の機械学習に関する処理負荷が軽減される。 According to the seventh embodiment, the first model 511 estimates the first region of the performance image G1 including the keyboard image g1, and the second model 512 estimates the second region of the performance image G1 including the finger image g2. estimated by Therefore, the specific region B including the keyboard image g1 and the finger image g2 can be extracted with high precision compared to the configuration using the single estimation model 51 that collectively extracts both the keyboard image g1 and the finger image g2. can. Also, since each of the first model 511 and the second model 512 is established by individual machine learning, the processing load related to the machine learning of the first model 511 and the second model 512 is reduced.
 なお、画像抽出部311が第1モードと第2モードとを切替可能な構成も想定される。第1モードは、演奏画像G1から鍵盤画像g1および手指画像g2の双方を抽出する動作モードである。すなわち、第1モードにおいて、画像抽出部311は、第1推定処理および第2推定処理の双方を実行する。したがって、第7実施形態と同様に、特定領域Bを表す画像処理マスクMが生成される。すなわち、第1モードにおいては、鍵盤画像g1および手指画像g2の双方を含む特定領域Bが演奏画像G1から抽出される。 A configuration in which the image extraction unit 311 can switch between the first mode and the second mode is also assumed. The first mode is an operation mode for extracting both the keyboard image g1 and the finger image g2 from the performance image G1. That is, in the first mode, the image extraction section 311 executes both the first estimation process and the second estimation process. Therefore, an image processing mask M representing the specific region B is generated as in the seventh embodiment. That is, in the first mode, a specific area B including both the keyboard image g1 and the finger image g2 is extracted from the performance image G1.
 第2モードは、演奏画像G1から鍵盤画像g1を抽出する動作モードである。すなわち、第2モードにおいて、画像抽出部311は、第1推定処理を実行する一方で第2推定処理を実行しない。すなわち、第1推定処理により生成される第1マスクが、領域抽出処理Sb2に適用される画像処理マスクMとして確定される。したがって、第2モードにおいては、鍵盤画像g1が演奏画像G1から抽出される。 The second mode is an operation mode for extracting the keyboard image g1 from the performance image G1. That is, in the second mode, the image extraction unit 311 executes the first estimation process but does not execute the second estimation process. That is, the first mask generated by the first estimation process is determined as the image processing mask M applied to the area extraction process Sb2. Therefore, in the second mode, the keyboard image g1 is extracted from the performance image G1.
 以上の通り、第1モードと第2モードとを切替可能な形態によれば、演奏画像G1からの抽出対象を簡便に切替えることが可能である。なお、以上の説明においては、画像抽出部311が第2モードにおいて第1推定処理を実行したが、第2モードにおいて、画像抽出部311が、第2推定処理を実行する一方で第1推定処理を実行しない形態も想定される。以上の形態においては、手指画像g2が演奏画像G1から抽出される。以上の例示から理解される通り、第2モードは、第1推定処理および第2推定処理の一方が実行される動作モードとして表現される。 As described above, according to the mode in which the first mode and the second mode can be switched, it is possible to easily switch the extraction target from the performance image G1. In the above description, the image extraction unit 311 executes the first estimation process in the second mode. A form that does not execute is also assumed. In the above embodiment, the finger image g2 is extracted from the performance image G1. As understood from the above examples, the second mode is expressed as an operation mode in which one of the first estimation process and the second estimation process is executed.
8:変形例
 以上に例示した各態様に付加される具体的な変形の態様を以下に例示する。以下の例示から任意に選択された2以上の態様を、相互に矛盾しない範囲で適宜に併合してもよい。
8: Modification Examples of specific modification added to each of the above-exemplified embodiments are illustrated below. Two or more aspects arbitrarily selected from the following examples may be combined as appropriate within a mutually consistent range.
(1)前述の各形態においては、画像抽出処理(図8)による処理後の演奏画像G2を処理対象として行列生成処理を実行したが、撮影装置15が撮影する演奏画像G1を処理対象として行列生成処理が実行されてもよい。すなわち、演奏画像G1から演奏画像G2を生成する画像抽出処理(画像抽出部311)は省略されてもよい。 (1) In each of the above-described embodiments, the matrix generation process is executed for the performance image G2 after the image extraction process (FIG. 8). A generation process may be performed. That is, the image extracting process (image extracting section 311) for generating the performance image G2 from the performance image G1 may be omitted.
 前述の各形態においては、演奏画像G1を利用した指位置推定処理を例示したが、画像抽出処理による処理後の演奏画像G2を利用して指位置推定処理が実行されてもよい。すなわち、演奏画像G2の解析により利用者の各手指の位置C[h,f]が推定されてもよい。また、前述の各形態においては、演奏画像G1を対象として射影変換処理を実行したが、画像抽出処理による処理後の演奏画像G2を対象として射影変換処理が実行されてもよい。すなわち、演奏画像G2に対する射影変換により変換画像が生成されてもよい。 Although the finger position estimation processing using the performance image G1 has been exemplified in each of the above embodiments, the finger position estimation processing may be executed using the performance image G2 after processing by the image extraction processing. That is, the position C[h,f] of each finger of the user may be estimated by analyzing the performance image G2. Further, in each of the above embodiments, the projective transformation process is performed on the performance image G1, but the projective transformation process may be performed on the performance image G2 after the image extraction process. That is, a transformed image may be generated by projective transformation of the performance image G2.
(2)前述の各形態においては、利用者の各手指の位置c[h,f]を射影変換処理によりX-Y座標系の位置C[h,f]に変換したが、各手指の位置c[h,f]を表す指位置データFが生成されてもよい。すなわち、位置c[h,f]を位置C[h,f]に変換する射影変換処理(射影変換部314)は省略されてもよい。 (2) In each of the above embodiments, the position c[h,f] of each finger of the user is transformed into the position C[h,f] in the XY coordinate system by projective transformation processing. Finger position data F representing c[h,f] may be generated. That is, the projective transformation process (projective transformation unit 314) for transforming the position c[h,f] into the position C[h,f] may be omitted.
(3)第1実施形態から第5実施形態においては、演奏解析処理の開始の直後に生成される変換行列Wが、以降の処理において継続的に利用される形態を例示したが、演奏解析処理の実行中の適切な時点において変換行列Wが更新されてもよい。例えば、鍵盤楽器200に対する撮影装置15の位置が変化した場合に、変換行列Wを更新する形態が想定される。具体的には、演奏画像G1の解析により撮影装置15の位置の変化(以下「位置変化」という)が検出された場合、または、撮影装置15の位置変化が利用者から指示された場合に、変換行列Wが更新される。 (3) In the first to fifth embodiments, the conversion matrix W generated immediately after the start of the performance analysis process is used continuously in subsequent processes. The transformation matrix W may be updated at appropriate points during the execution of . For example, when the position of the photographing device 15 with respect to the keyboard instrument 200 is changed, the conversion matrix W may be updated. Specifically, when a change in the position of the photographing device 15 (hereinafter referred to as "positional change") is detected by analyzing the performance image G1, or when a user instructs a change in the position of the photographing device 15, Transformation matrix W is updated.
 具体的には、行列生成部312は、撮影装置15の位置変化(ズレ)を表す変換行列δを生成する。例えば、位置変化後の演奏画像G(G1,G2)内の座標(x,y)について、以下の数式(12)で表現される関係を想定する。
Figure JPOXMLDOC01-appb-M000012
 
Specifically, the matrix generator 312 generates a transformation matrix δ that represents the positional change (displacement) of the imaging device 15 . For example, assume the relationship expressed by the following formula (12) for the coordinates (x, y) in the performance image G (G1, G2) after the position change.
Figure JPOXMLDOC01-appb-M000012
 行列生成部312は、位置変化後の特定の地点のx座標から数式(12)で算定される座標x'/εが、位置変化前における演奏画像Gのうち当該地点に対応する地点のx座標に近似または一致し、かつ、位置変換後の特定の地点のy座標から数式(12)で算定される座標y'/εが、位置変化前における演奏画像Gのうち当該地点に対応する地点のy座標に近似または一致するように、変換行列δを生成する。そして、行列生成部312は、位置変化前の変換行列Wと位置変化を表す変換行列δとの積Wδを初期行列W0として生成し、当該初期行列W0を行列更新処理Sc2により更新することで変換行列Wを生成する。 The matrix generation unit 312 determines that the coordinate x′/ε calculated by Equation (12) from the x-coordinate of the specific point after the position change is the x-coordinate of the point corresponding to the point in the performance image G before the position change. and the coordinate y'/ε calculated by Equation (12) from the y-coordinate of a specific point after the position change is the point corresponding to the point in the performance image G before the position change. Generate a transformation matrix δ to approximate or match the y-coordinate. Then, the matrix generation unit 312 generates the product Wδ of the transformation matrix W before the position change and the transformation matrix δ representing the position change as the initial matrix W0, and updates the initial matrix W0 by the matrix update processing Sc2 to convert Generate a matrix W.
 以上の構成においては、位置変化前に算定された変換行列Wと位置変化を表す変換行列δとを利用して、位置変化後の変換行列Wが生成される。したがって、行列生成処理の負荷を軽減しながら、各手指の位置C[h,f]を高精度に特定可能な変換行列Wを生成できる。なお、以上の説明においては第1実施形態から第5実施形態を想定したが、第6実施形態においても同様に、第1画像処理の実行中の適切な時点において変換行列Wが更新されてもよい。 In the above configuration, the transformation matrix W after the position change is generated using the transformation matrix W calculated before the position change and the transformation matrix δ representing the position change. Therefore, it is possible to generate a transformation matrix W that can specify the position C[h, f] of each finger with high accuracy while reducing the load of the matrix generation process. In the above description, the first to fifth embodiments were assumed. good.
(4)前述の各形態においては、鍵盤22を具備する鍵盤楽器200を例示したが、本開示が適用される楽器の種類は任意である。例えば、弦楽器,管楽器または打楽器等、利用者が手動で操作可能な任意の楽器について、前述の各形態は同様に適用される。楽器の典型例は、利用者が片手または両手の手指により演奏する種類の楽器である。 (4) In each of the above-described embodiments, the keyboard instrument 200 including the keyboard 22 is illustrated, but the present disclosure can be applied to any type of instrument. For example, for any musical instrument that can be manually operated by the user, such as a stringed instrument, a wind instrument, or a percussion instrument, each of the above aspects is similarly applied. A typical example of a musical instrument is a type of musical instrument played by the user with the fingers of one hand or both hands.
(5)例えばスマートフォンまたはタブレット端末等の情報装置と通信するサーバ装置により演奏解析システム100が実現されてもよい。例えば、情報装置に接続された鍵盤楽器200が生成する演奏データPと、当該情報装置に搭載または接続された撮影装置15が生成する画像データD1とが、情報装置から演奏解析システム100に送信される。演奏解析システム100は、情報装置から受信した演奏データPおよび画像データD1に対して演奏解析処理を実行することで運指データQを生成し、当該運指データQを情報装置に送信する。また、第6実施形態または第7実施形態に例示した画像処理システム700も同様に、情報装置と通信するサーバ装置により実現されてよい。 (5) The performance analysis system 100 may be realized by a server device that communicates with an information device such as a smartphone or a tablet terminal. For example, performance data P generated by a keyboard instrument 200 connected to an information device and image data D1 generated by a photographing device 15 mounted on or connected to the information device are transmitted from the information device to the performance analysis system 100. be. The performance analysis system 100 generates fingering data Q by executing performance analysis processing on performance data P and image data D1 received from the information device, and transmits the fingering data Q to the information device. Similarly, the image processing system 700 exemplified in the sixth or seventh embodiment may also be realized by a server device that communicates with the information device.
(6)第1実施形態から第5実施形態に係る演奏解析システム100、または第6実施形態から第7実施形態に係る画像処理システム700の機能は、前述の通り、制御装置11を構成する単数または複数のプロセッサと、記憶装置12に記憶されたプログラムとの協働により実現される。本開示に係るプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性(non-transitory)の記録媒体であり、CD-ROM等の光学式記録媒体(光ディスク)が好例であるが、半導体記録媒体または磁気記録媒体等の公知の任意の形式の記録媒体も包含される。なお、非一過性の記録媒体とは、一過性の伝搬信号(transitory, propagating signal)を除く任意の記録媒体を含み、揮発性の記録媒体も除外されない。また、配信装置が通信網を介してプログラムを配信する構成では、当該配信装置においてプログラムを記憶する記憶装置12が、前述の非一過性の記録媒体に相当する。 (6) The functions of the performance analysis system 100 according to the first to fifth embodiments or the image processing system 700 according to the sixth to seventh embodiments are the functions of the control device 11, as described above. Alternatively, it is realized by cooperation of a plurality of processors and programs stored in the storage device 12 . A program according to the present disclosure may be provided in a form stored in a computer-readable recording medium and installed in a computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disc) such as a CD-ROM is a good example. Also included are recording media in the form of The non-transitory recording medium includes any recording medium other than transitory (propagating signal), and does not exclude volatile recording media. In addition, in a configuration in which a distribution device distributes a program via a communication network, the storage device 12 that stores the program in the distribution device corresponds to the above-described non-transitory recording medium.
9:付記
 以上に例示した形態から、例えば以下の構成が把握される。
9: Supplementary Note From the above-exemplified forms, for example, the following configuration can be grasped.
 本開示のひとつの態様(態様1)に係る画像処理方法は、楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定し、前記演奏画像のうち前記特定領域を抽出する。以上の態様においては、楽器の画像と利用者の複数の手指の画像とを含む演奏画像のうち楽器の画像を含む特定領域が抽出される。したがって、演奏画像の利便性を向上させることが可能である。 An image processing method according to one aspect (aspect 1) of the present disclosure is a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument, and a specific region including the image of the musical instrument. is estimated, and the specific region is extracted from the performance image. In the above aspect, the specific region including the image of the musical instrument is extracted from the performance image including the image of the musical instrument and the images of a plurality of fingers of the user. Therefore, it is possible to improve the convenience of performance images.
 態様1の具体例(態様2)において、前記特定領域は、前記楽器の画像と前記利用者の身体の少なくとも一部の画像とを含む領域である。以上の態様においては、楽器の画像と利用者の身体の画像とを含む特定領域が抽出される。したがって、楽器の様子と利用者の身体の様子とを効率的に視認できる画像を生成できる。 In the specific example of aspect 1 (aspect 2), the specific area is an area including an image of the musical instrument and an image of at least a part of the user's body. In the above aspect, the specific region including the image of the musical instrument and the image of the user's body is extracted. Therefore, it is possible to generate an image in which the appearance of the musical instrument and the appearance of the user's body can be efficiently visually recognized.
 態様2の具体例(態様3)において、前記特定領域の推定においては、前記演奏画像を表す画像データを、機械学習済の推定モデルに入力することで、前記特定領域を表す画像処理マスクを生成し、前記特定領域の抽出においては、前記画像処理マスクを前記演奏画像に適用することで前記特定領域を抽出する。以上の態様においては、機械学習済の推定モデルに演奏画像の画像データを入力することで、特定領域を表す画像処理マスクが生成される。したがって、未知の多様な演奏画像について特定領域を高精度に特定できる。 In the specific example of aspect 2 (aspect 3), in the estimation of the specific region, an image processing mask representing the specific region is generated by inputting image data representing the performance image into a machine-learned estimation model. In extracting the specific region, the specific region is extracted by applying the image processing mask to the performance image. In the above aspect, the image processing mask representing the specific region is generated by inputting the image data of the performance image into the machine-learned estimation model. Therefore, the specific region can be specified with high precision for various unknown performance images.
 態様3の具体例(態様4)において、前記推定モデルは、第1モデルと第2モデルとを含み、前記特定領域の推定は、前記演奏画像を表す画像データを前記第1モデルに入力することで、当該演奏画像のうち前記楽器の画像を含む第1領域を推定する第1推定処理と、前記演奏画像を表す画像データを前記第2モデルに入力することで、当該演奏画像のうち前記複数の手指の画像を含む第2領域を推定する第2推定処理と、前記第1領域と前記第2領域とを含む前記特定領域を表す前記画像処理マスクを生成する領域合成処理とを含む。以上の態様においては、演奏画像のうち楽器の画像を含む第1領域が第1モデルにより推定され、演奏画像のうち利用者の画像を含む第2領域が第2モデルにより推定される。したがって、楽器の画像と利用者の画像との双方を一括的に抽出する単体のモデルを利用する構成と比較して、楽器の画像と利用者の画像とを含む特定領域を高精度に抽出できる。また、第1モデルおよび第2モデルの各々が個別の機械学習により確立されるから、第1モデルおよび第2モデルの機械学習に関する処理負荷が軽減される。 In a specific example of Aspect 3 (Aspect 4), the estimation model includes a first model and a second model, and the estimation of the specific region is performed by inputting image data representing the performance image into the first model. a first estimation process of estimating a first area including the image of the musical instrument in the performance image; and inputting image data representing the performance image into the second model. a second estimation process of estimating a second area including the image of the fingers of the , and an area synthesizing process of generating the image processing mask representing the specific area including the first area and the second area. In the above aspect, the first region of the performance image including the image of the musical instrument is estimated by the first model, and the second region of the performance image including the image of the user is estimated by the second model. Therefore, compared to the configuration using a single model that collectively extracts both the image of the instrument and the image of the user, it is possible to extract a specific region containing the image of the instrument and the image of the user with high accuracy. . Moreover, since each of the first model and the second model is established by individual machine learning, the processing load related to machine learning of the first model and the second model is reduced.
 態様4の具体例(態様5)において、前記第1推定処理および前記第2推定処理の双方を実行する第1モードと、前記第1推定処理および前記第2推定処理の一方を実行する第2モードとを切替可能である。以上の態様において、第1モードでは、楽器の画像と利用者の画像とを含む特定領域が演奏画像から抽出される。他方、第2モードでは、楽器の楽器と利用者の画像との一方を含む特定領域が演奏画像から抽出される。以上の通り、演奏画像からの抽出対象を簡便に切替えることが可能である。 In the specific example of aspect 4 (aspect 5), a first mode in which both the first estimation process and the second estimation process are performed, and a second mode in which one of the first estimation process and the second estimation process is performed mode can be switched. In the above aspect, in the first mode, a specific area including the image of the musical instrument and the image of the user is extracted from the performance image. On the other hand, in the second mode, a specific region including one of the musical instrument and the user's image is extracted from the performance image. As described above, it is possible to easily switch the extraction target from the performance image.
 本開示のひとつの態様(態様6)に係る画像処理システムは、楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定する領域推定部と、前記演奏画像のうち前記特定領域を抽出する領域抽出部とを具備する。 An image processing system according to one aspect (aspect 6) of the present disclosure, in a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument, a specific region including the image of the musical instrument and an area extraction unit for extracting the specific area from the performance image.
 本開示のひとつの態様(態様7)に係るプログラムは、楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定する領域推定部、および、前記演奏画像のうち前記特定領域を抽出する領域抽出部、としてコンピュータシステムを機能させる。 A program according to one aspect (aspect 7) of the present disclosure estimates a specific region including an image of a musical instrument in a performance image including an image of a musical instrument and images of a plurality of fingers of a user playing the musical instrument. The computer system is caused to function as an area estimating section for performing the performance image and an area extracting section for extracting the specific area from the performance image.
 なお、本出願は、2021年3月25日出願の日本特許出願(特願2021-051181)に基づくものであり、その内容は本出願の中に参照として取り込まれる。 This application is based on a Japanese patent application (Japanese Patent Application No. 2021-051181) filed on March 25, 2021, the content of which is incorporated into this application as a reference.
100…演奏解析システム
11…制御装置
12…記憶装置
13…操作装置
14…表示装置
15…撮影装置
200…鍵盤楽器
21…鍵
22…鍵盤
30…演奏解析部
31…指位置データ生成部
311…画像抽出部
312…行列生成部
313…指位置推定部
314…射影変換部
32…運指データ生成部
321…確率算定部
322…運指推定部
323…制御データ生成部、
40…表示制御部
51…推定モデル
51a…暫定モデル
52[k]…推定モデル
700…画像処理システム 
100... Performance analysis system 11... Control device 12... Storage device 13... Operation device 14... Display device 15... Photographing device 200... Keyboard instrument 21... Key 22... Keyboard 30... Performance analysis unit 31... Finger position data generation unit 311... Image Extraction unit 312 Matrix generation unit 313 Finger position estimation unit 314 Projective transformation unit 32 Fingering data generation unit 321 Probability calculation unit 322 Fingering estimation unit 323 Control data generation unit,
40... Display control unit 51... Estimation model 51a... Temporary model 52[k]... Estimation model 700... Image processing system

Claims (11)

  1.  楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定し、
     前記演奏画像のうち前記特定領域を抽出する
     コンピュータシステムにより実現される画像処理方法。
    estimating a specific region including an image of the musical instrument in a performance image including an image of the musical instrument and images of a plurality of fingers of a user playing the musical instrument;
    An image processing method implemented by a computer system that extracts the specific region from the performance image.
  2.  前記特定領域は、前記楽器の画像と前記利用者の身体の少なくとも一部の画像とを含む領域である
     請求項1の画像処理方法。
    2. The image processing method according to claim 1, wherein said specific area is an area including an image of said musical instrument and an image of at least a part of said user's body.
  3.  前記特定領域の推定においては、前記演奏画像を表す画像データを、機械学習済の推定モデルに入力することで、前記特定領域を表す画像処理マスクを生成し、
     前記特定領域の抽出においては、前記画像処理マスクを前記演奏画像に適用することで前記特定領域を抽出する
     請求項2の画像処理方法。
    In estimating the specific region, generating an image processing mask representing the specific region by inputting image data representing the performance image into a machine-learned estimation model,
    3. The image processing method according to claim 2, wherein, in extracting the specific region, the specific region is extracted by applying the image processing mask to the performance image.
  4.  前記推定モデルは、第1モデルと第2モデルとを含み、
     前記特定領域の推定は、
     前記演奏画像を表す画像データを前記第1モデルに入力することで、当該演奏画像のうち前記楽器の画像を含む第1領域を推定する第1推定処理と、
     前記演奏画像を表す画像データを前記第2モデルに入力することで、当該演奏画像のうち前記複数の手指の画像を含む第2領域を推定する第2推定処理と、
     前記第1領域と前記第2領域とを含む前記特定領域を表す前記画像処理マスクを生成する領域合成処理とを含む
     請求項3の画像処理方法。
    The estimation model includes a first model and a second model,
    The estimation of the specific region is
    a first estimation process of estimating a first region including an image of the musical instrument in the performance image by inputting image data representing the performance image into the first model;
    a second estimation process of estimating a second region including the images of the plurality of fingers in the performance image by inputting image data representing the performance image into the second model;
    4. The image processing method according to claim 3, further comprising a region combining process for generating said image processing mask representing said specific region including said first region and said second region.
  5.  前記第1推定処理および前記第2推定処理の双方を実行する第1モードと、前記第1推定処理および前記第2推定処理の一方を実行する第2モードとを切替可能である
     請求項4の画像処理方法。
    5. The method of claim 4, wherein switching between a first mode in which both the first estimation process and the second estimation process are performed and a second mode in which one of the first estimation process and the second estimation process is performed can be switched. Image processing method.
  6.  楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定する領域推定部と、
     前記演奏画像のうち前記特定領域を抽出する領域抽出部と
     を具備する画像処理システム。
    an area estimation unit that estimates a specific area including an image of the musical instrument in a performance image including an image of the musical instrument and images of a plurality of fingers of a user playing the musical instrument;
    an image processing system comprising: an area extracting unit that extracts the specific area from the performance image.
  7.  前記特定領域は、前記楽器の画像と前記利用者の身体の少なくとも一部の画像とを含む領域である
     請求項6の画像処理システム。
    7. The image processing system according to claim 6, wherein said specific area is an area including an image of said musical instrument and an image of at least a part of said user's body.
  8.  前記領域推定部は、前記演奏画像を表す画像データを、機械学習済の推定モデルに入力することで、前記特定領域を表す画像処理マスクを生成し、
     前記特定領域の抽出においては、前記画像処理マスクを前記演奏画像に適用することで前記特定領域を抽出する
     請求項7の画像処理システム。
    The region estimation unit inputs image data representing the performance image to a machine-learned estimation model to generate an image processing mask representing the specific region,
    8. The image processing system according to claim 7, wherein, in extracting the specific region, the specific region is extracted by applying the image processing mask to the performance image.
  9.  前記推定モデルは、第1モデルと第2モデルとを含み、
     前記領域推定部は、
     前記演奏画像を表す画像データを前記第1モデルに入力することで、当該演奏画像のうち前記楽器の画像を含む第1領域を推定する第1推定処理と、
     前記演奏画像を表す画像データを前記第2モデルに入力することで、当該演奏画像のうち前記複数の手指の画像を含む第2領域を推定する第2推定処理と、
     前記第1領域と前記第2領域とを含む前記特定領域を表す前記画像処理マスクを生成する領域合成処理とを含む
     請求項8の画像処理システム。
    The estimation model includes a first model and a second model,
    The region estimation unit
    a first estimation process of estimating a first region including an image of the musical instrument in the performance image by inputting image data representing the performance image into the first model;
    a second estimation process of estimating a second region including the images of the plurality of fingers in the performance image by inputting image data representing the performance image into the second model;
    9. The image processing system according to claim 8, further comprising a region combining process for generating said image processing mask representing said specific region including said first region and said second region.
  10. 前記第1推定処理および前記第2推定処理の双方を実行する第1モードと、前記第1推定処理および前記第2推定処理の一方を実行する第2モードとを切替可能である
     請求項9の画像処理システム。
    10. The method of claim 9, wherein switching between a first mode in which both the first estimation process and the second estimation process are performed and a second mode in which one of the first estimation process and the second estimation process is performed can be switched. image processing system.
  11.  楽器の画像と当該楽器を演奏する利用者の複数の手指の画像とを含む演奏画像のうち、前記楽器の画像を含む特定領域を推定する領域推定部、および、
     前記演奏画像のうち前記特定領域を抽出する領域抽出部
     としてコンピュータシステムを機能させるプログラム。 
    an area estimating unit for estimating a specific area including an image of the musical instrument in a performance image including an image of the musical instrument and images of a plurality of fingers of a user playing the musical instrument;
    A program that causes a computer system to function as an area extraction unit that extracts the specific area from the performance image.
PCT/JP2022/009830 2021-03-25 2022-03-07 Image processing method, image processing system, and program WO2022202266A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202280022994.XA CN117043818A (en) 2021-03-25 2022-03-07 Image processing method, image processing system, and program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2021051181A JP2022149159A (en) 2021-03-25 2021-03-25 Image processing method, image processing system, and program
JP2021-051181 2021-03-25

Publications (1)

Publication Number Publication Date
WO2022202266A1 true WO2022202266A1 (en) 2022-09-29

Family

ID=83397016

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/009830 WO2022202266A1 (en) 2021-03-25 2022-03-07 Image processing method, image processing system, and program

Country Status (3)

Country Link
JP (1) JP2022149159A (en)
CN (1) CN117043818A (en)
WO (1) WO2022202266A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020046500A (en) * 2018-09-18 2020-03-26 ソニー株式会社 Information processing apparatus, information processing method and information processing program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020046500A (en) * 2018-09-18 2020-03-26 ソニー株式会社 Information processing apparatus, information processing method and information processing program

Also Published As

Publication number Publication date
JP2022149159A (en) 2022-10-06
CN117043818A (en) 2023-11-10

Similar Documents

Publication Publication Date Title
US11557269B2 (en) Information processing method
EP3759707B1 (en) A method and system for musical synthesis using hand-drawn patterns/text on digital and non-digital surfaces
US20210151014A1 (en) Information processing device for musical score data
CN113421547B (en) Voice processing method and related equipment
US20240013754A1 (en) Performance analysis method, performance analysis system and non-transitory computer-readable medium
WO2020059245A1 (en) Information processing device, information processing method and information processing program
JP2021043258A (en) Control system and control method
JP2022115956A (en) Information processing method, information processing device and program
WO2022202266A1 (en) Image processing method, image processing system, and program
WO2022202265A1 (en) Image processing method, image processing system, and program
US20230230493A1 (en) Information Processing Method, Information Processing System, and Recording Medium
WO2022202267A1 (en) Information processing method, information processing system, and program
US20220414472A1 (en) Computer-Implemented Method, System, and Non-Transitory Computer-Readable Storage Medium for Inferring Audience&#39;s Evaluation of Performance Data
JP7152908B2 (en) Gesture control device and gesture control program
CN115437598A (en) Interactive processing method and device of virtual musical instrument and electronic equipment
Moryossef et al. At your fingertips: Automatic piano fingering detection
WO2023032422A1 (en) Processing method, program, and processing device
WO2023243293A1 (en) Performance motion estimation method and performance motion estimation device
WO2023181570A1 (en) Information processing method, information processing system, and program
WO2023053632A1 (en) Information processing device, information processing method, and program
CN113657185A (en) Intelligent auxiliary method, device and medium for piano practice
CN113782059A (en) Musical instrument audio evaluation method and device and non-transient storage medium
CN116343820A (en) Audio processing method, device, equipment and storage medium
CN116324932A (en) Information processing method and information processing system
CN115687668A (en) Music file generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22775060

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280022994.X

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22775060

Country of ref document: EP

Kind code of ref document: A1