CN112466266B

CN112466266B - Control system and control method

Info

Publication number: CN112466266B
Application number: CN202010876140.0A
Authority: CN
Inventors: 前泽阳
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2019-09-06
Filing date: 2020-08-27
Publication date: 2024-05-31
Anticipated expiration: 2040-08-27

Abstract

A control system and control method are provided that are capable of estimating the timing at which an event occurs based on the motion of a face. The device is provided with: an acquisition unit that acquires image information; a determination unit that detects a movement of a face portion and a direction of a line of sight in a captured image shown in the image information based on the image information, and determines whether or not a preliminary operation associated with a darkness operation indicating a timing of causing an event to occur is performed using a result of the detection; an estimating unit configured to estimate a timing at which an event occurs based on the image information when the determining unit determines that the preparatory operation is performed; and an output unit configured to output the estimation result estimated by the estimation unit.

Description

Control system and control method

Technical Field

The present invention relates to a control system and a control method.

Background

Conventionally, a score alignment technique has been proposed in which a current performance position (hereinafter referred to as "performance position") in a musical piece is estimated by analyzing sounds for playing the musical piece (for example, patent document 1).

Prior art literature

Patent literature

Patent document 1: japanese patent application laid-open No. 2015-79183

Disclosure of Invention

Problems to be solved by the invention

In addition, in an ensemble system in which players and automatic musical instruments and the like perform ensemble, for example, the following processes are performed: based on the estimation result of the position on the score of the performance by the player, the timing of the event at which the automatic performance musical instrument makes the next sound is envisioned. However, in reality, when a person and a person are combined, timing may be combined by a combination of a start of a musical composition, a recovery of an extension symbol (fermata), a sound of a final musical composition, and the like.

The present invention has been made in view of such a situation, and an object thereof is to provide a control system and a control method capable of estimating the timing of occurrence of an event based on the motion of a face.

Means for solving the problems

In order to solve the above-described problems, an aspect of the present invention is a control system including: an acquisition unit that acquires image information including a user imaged with time; a determination unit configured to determine whether or not a preparatory operation has been performed based on the movement of the face of the user and the direction of the line of sight detected from the image information; an estimating unit that estimates a timing at which an event occurs when it is determined that the preparatory operation is performed; and an output unit configured to output the estimation result estimated by the estimation unit.

In order to solve the above-described problems, an aspect of the present invention is a control system including: an acquisition unit that acquires image information; a determination unit that detects a movement of a face portion and a direction of a line of sight in a captured image shown in the image information based on the image information, and determines whether or not a preliminary operation associated with a darkness operation indicating a timing of causing an event to occur is performed using a result of the detection; an estimating unit that estimates a timing at which an event occurs based on the image information and the password operation, when the determining unit determines that the preliminary operation is performed; and an output unit configured to output the estimation result estimated by the estimation unit.

In addition, an aspect of the present invention is a control method, wherein an acquisition unit acquires image information; a determination unit that detects a movement of a face portion and a direction of a line of sight in a captured image shown in the image information based on the image information, and determines whether or not a preliminary operation associated with a darkness operation indicating a timing of causing an event to occur is performed using a result of the detection; an estimating unit that estimates a timing at which an event occurs based on the image information and the password operation, when the determining unit determines that the preliminary operation is performed; an output unit outputs the estimation result estimated by the estimation unit.

Effects of the invention

According to the present invention, the timing at which an event occurs can be estimated based on the motion of the face.

Drawings

Fig. 1 is a block diagram of an automatic playing system according to an embodiment of the present invention.

Fig. 2 is an explanatory diagram of the combination operation and the performance position.

Fig. 3 is an explanatory diagram of image composition performed by the image composition unit.

Fig. 4 is an explanatory diagram of a relationship between a performance position of a performance object song and an instruction position of an automatic performance.

Fig. 5 is an explanatory diagram of the relationship between the position of the combination action and the start point of performance of the performance object track.

Fig. 6 is an explanatory diagram of a performance image.

Fig. 7 is an explanatory diagram of a performance image.

Fig. 8 is a flowchart of the operation of the control device.

Fig. 9 is a block diagram of an analysis processing unit in embodiment 2.

Fig. 10 is an explanatory diagram of the operation of the analysis processing unit in embodiment 2.

Fig. 11 is a flowchart showing the operation of the analysis processing unit in embodiment 2.

Fig. 12 is a block diagram of the automatic playing system.

Fig. 13 is a simulation result of the sound emission timing of the player and the sound emission timing of the accompaniment part.

Fig. 14 is an evaluation result of the automatic playing system.

Fig. 15 is a block diagram of the detection processing unit 524 in embodiment 3.

Fig. 16 is a flowchart showing the operation of the detection processing unit 524 in embodiment 3.

Description of the reference numerals

100 … Automatic performance system, 12 … control device, 22 … recording device, 222 … photographing device, 52 … password detection unit, 522 … image synthesis unit, 524 … detection processing unit, 5240 … acquisition unit, 5241 … determination unit, 5242 … estimation unit, 5243 … output unit, 5244 … face part extraction model, 5245 … password action estimation model

Detailed Description

Embodiment 1

Fig. 1 is a block diagram of an automatic playing system 100 according to embodiment 1 of the present application. The automatic playing system 100 is a computer system provided in a space such as a concert hall where a plurality of players P play musical instruments, and performs automatic performance of a piece of music (hereinafter referred to as "piece of music") of a piece of music performed by the plurality of players P in parallel. In addition, the player P is typically a player of a musical instrument, but a singer of a performance object track may also be the player P. That is, the "performance" in the present application includes not only the performance of musical instruments but also singing. In addition, a person who is not actually responsible for the performance of the musical instrument (for example, command at concert or acoustic supervision at recording, etc.) may also be included in the player P.

As illustrated in fig. 1, the automatic playing system 100 of the present embodiment includes a control device 12, a storage device 14, a recording device 22, an automatic playing device 24, and a display device 26. The control device 12 and the storage device 14 are implemented by an information processing device such as a personal computer, for example.

The control device 12 is, for example, a processing circuit such as a CPU (central processing unit (Central Processing Unit)) and controls the elements of the automatic playing system 100 in an integrated manner. The storage device 14 is composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of recording media, and stores a program executed by the control device 12 and various data used by the control device 12. In addition, the storage device 14 (for example, cloud storage) separate from the automatic playing system 100 may be prepared, and the control device 12 may perform writing and reading to the storage device 14 via a communication network such as a mobile communication network or the internet. That is, the storage device 14 may be omitted from the automatic playing system 100.

The storage device 14 of the present embodiment stores musical composition data M. The music data M specifies the performance content of the performance object track of the automatic performance. For example, a file (SMF: STANDARD MIDI FILE, standard MIDI file) in a form conforming to MIDI (musical instrument digital interface (Musical Instrument DIGITAL INTERFACE)) standard is preferable as the musical composition data M. Specifically, the musical composition data M is time series data in which instruction data indicating performance contents and time data indicating the occurrence time of the instruction data are arranged. The instruction data designates a pitch (note number) and a strength (velocity) to instruct various events such as sound production and noise elimination. The time data specifies, for example, an interval (time difference) of the immediately preceding and following instruction data.

The automatic playing device 24 of fig. 1 performs an automatic playing of the performance object song based on the control of the control device 12. Specifically, of the plurality of performance parts constituting the performance object song, performance parts different from those of the plurality of players P (for example, string instruments) are automatically performed by the automatic performance apparatus 24. The automatic playing device 24 of the present embodiment is a keyboard musical instrument (i.e., an automatic player piano) provided with a driving mechanism 242 and a sounding mechanism 244. The sounding mechanism 244 is a string striking mechanism that sounds strings (i.e., sounding bodies) in association with displacement of keys of a keyboard, similarly to a piano of a natural musical instrument. Specifically, the sounding mechanism 244 includes, for each key, an action mechanism composed of a hammer capable of striking a string and a plurality of transmission members (e.g., a linkage (whippen), jack (jack), and vibration lever (repetition lever)) that transmit the displacement of the key to the hammer. The driving mechanism 242 performs an automatic performance of the performance object song by driving the sound emitting mechanism 244. Specifically, the driving mechanism 242 includes a plurality of driving bodies (for example, actuators such as solenoids) for displacing the respective keys, and a driving circuit for driving the driving bodies. The driving mechanism 242 drives the sound emitting mechanism 244 in accordance with an instruction from the control device 12, thereby realizing an automatic performance of the performance object track. The automatic playing device 24 may be equipped with the control device 12 or the storage device 14.

The recording device 22 records the situation where a plurality of players P perform a performance on a performance object track. As illustrated in fig. 1, the recording device 22 of the present embodiment includes a plurality of imaging devices 222 and a plurality of radio devices 224. The photographing device 222 is provided for each player P, and generates an image signal V0 by photographing the player P. The image signal V0 is a signal representing a moving image of the player P. The sound pickup device 224 is provided for each player P, and picks up sounds (for example, musical instrument sounds or singing sounds) generated by the performance of the player P (for example, the performance of musical instruments or singing sounds) to generate an acoustic signal A0. The acoustic signal A0 is a signal representing the waveform of sound. As understood from the above description, a plurality of image signals V0 photographed by different players P and a plurality of acoustic signals A0 received by sounds played by different players P are recorded. In addition, an acoustic signal A0 output from an electronic musical instrument such as an electronic stringed instrument may be used. Thus, the radio receiver 224 may also be omitted.

The control device 12 implements a plurality of functions (a password detection section 52, a performance analysis section 54, a performance control section 56, and a display control section 58) for realizing an automatic performance of a performance object track by executing a program stored in the storage device 14. The function of the control device 12 may be realized by a set (i.e., a system) of a plurality of devices, or a part or all of the functions of the control device 12 may be realized by a dedicated electronic circuit. The server device located at a position apart from the space such as the concert hall where the recording device 22, the automatic playing device 24, and the display device 26 are provided may also realize part or all of the functions of the control device 12.

Each player P performs an action (hereinafter referred to as "act of a combination") that becomes a combination of the performances of the performance object tracks. The darkness movement is a movement (gesture (gesture)) indicating 1 time point on the time axis. For example, the action of the player P lifting his own musical instrument, or the action of the player P moving his own body is a preferable example of the combination action. For example, as illustrated in fig. 2, a specific player P who is dominant in the performance of the performance object song performs a crediting operation at a time point Q earlier by a predetermined period (hereinafter referred to as "preparation period") B than the start point of the performance object song should be started. The preparation period B is, for example, a period of a time length corresponding to 1 beat of the performance object song. Accordingly, the duration of the preparation period B varies according to the performance speed (tempo) of the performance object song. For example, the faster the performance speed, the shorter the preparation period B is. The player P starts the performance of the performance object track at the arrival of the start point by performing a combination action at a point advanced by a preparation period B corresponding to 1 beat from the start point of the performance object track based on the performance speed envisaged for the performance object track. The plague action is used as a trigger for the automatic performance of the automatic performance apparatus 24 in addition to a trigger for the performance of the other players P. The duration of the preparation period B is arbitrary, and may be set to a duration corresponding to a plurality of beats, for example.

The credence detection section 52 of fig. 1 detects credence actions of the player P. Specifically, the password detection unit 52 detects a password operation by analyzing an image captured by each imaging device 222 on the player P. As illustrated in fig. 1, the password detection unit 52 of the present embodiment includes an image synthesis unit 522 and a detection processing unit 524. The image synthesizing unit 522 synthesizes the plurality of image signals V0 generated by the plurality of imaging devices 222 to generate an image signal V. As illustrated in fig. 3, the image signal V is a signal indicating an image in which a plurality of moving images (# 1, #2, #3, … …) indicated by each image signal V0 are arranged. That is, the image signal V representing the moving images of the players P is supplied from the image synthesizing section 522 to the detection processing section 524.

The detection processing unit 524 analyzes the image signal V generated by the image synthesizing unit 522 to detect the darkness operation of one of the players P. For the detection of the darkness act by the detection processing unit 524, a known image analysis technique including an image recognition process of extracting an element (for example, a body or a musical instrument) moved when the player P performs the darkness act from an image and a moving body detection process of detecting the movement of the element may be used. In addition, an identification model such as a neural network or a multi-tree may be used for detecting the darkness operation. For example, the feature amounts extracted from the image signals in which the performance of the plurality of players P is captured are used as given learning data, and machine learning (e.g., deep learning) of the recognition model is performed in advance. The detection processing section 524 detects a darkness act by applying the feature quantity extracted from the image signal V in the scene where the automatic performance is actually performed to the recognition model after the machine learning.

The performance analysis section 54 of fig. 1 sequentially estimates the positions (hereinafter referred to as "performance positions") T at which a plurality of players P are currently playing in the performance object track in parallel with the performance of the players P. Specifically, the performance analysis unit 54 analyzes the sound received by each of the plurality of sound receiving devices 224 to estimate the performance position T. As illustrated in fig. 1, the performance analysis unit 54 of the present embodiment includes an acoustic mixing unit 542 and an analysis processing unit 544. The acoustic mixing unit 542 mixes a plurality of acoustic signals A0 generated by the plurality of acoustic devices 224 to generate an acoustic signal a. That is, the acoustic signal a is a signal representing a mixed sound of a plurality of sounds represented by different acoustic signals A0.

The analysis processing unit 544 analyzes the acoustic signal a generated by the acoustic mixing unit 542 to estimate the performance position T. For example, the analysis processing section 544 determines the performance position T by collating the sound represented by the acoustic signal a with the performance content of the performance object track represented by the music data M. The analysis processing unit 544 according to the present embodiment estimates the performance speed (tempo) R of the performance target song by analyzing the acoustic signal a. For example, the analysis processing section 544 determines the performance velocity R from the time variation of the performance position T (i.e., the variation of the performance position T in the time axis direction). In addition, a known acoustic analysis technique (score alignment) may be arbitrarily used for estimating the performance position T and performance velocity R in the analysis processing unit 544. For example, the analysis technique disclosed in patent document 1 can be used for estimation of the performance position T and performance velocity R. In addition, a recognition model such as a neural network or a multi-tree may be used for estimation of the performance position T and performance speed R. For example, machine learning (e.g., deep learning) for generating an identification model is performed before an automatic performance using feature amounts extracted from acoustic signals a on which performance of a plurality of players P is performed as learning data given thereto. The analysis processing section 544 estimates the performance position T and the performance velocity R by applying the feature amount extracted from the acoustic signal a in the scene where the automatic performance is actually performed to the recognition model generated by the machine learning.

The detection of the plague operation by the plague detection section 52 and the estimation of the performance position T and performance velocity R by the performance analysis section 54 are performed in parallel and in real time with the performance of the performance object tracks of the plurality of players P. For example, the detection of the combination action and the estimation of the performance position T and the performance speed R are repeated in a predetermined cycle. But regardless of the difference between the detected period of the combination action and the estimated period of the performance position T and performance speed R.

The performance control section 56 of fig. 1 causes the automatic performance apparatus 24 to execute an automatic performance of the performance object song so as to synchronize the combination action detected by the combination detection section 52 with the performance position T estimated by the performance analysis section 54. Specifically, the performance control unit 56 instructs the automatic performance device 24 to start the automatic performance in response to detection of the combination operation by the combination detection unit 52, and instructs the automatic performance device 24 to specify the performance content of the musical composition data M at the time point corresponding to the performance position T in the performance target track. That is, the performance control unit 56 is a sequencer (sequencer) for sequentially supplying instruction data included in the musical composition data M of the music piece of the performance object to the automatic performance apparatus 24. The automatic performance apparatus 24 performs automatic performance of the performance object song in accordance with the instruction from the performance control section 56. As the performance of the plurality of players P proceeds, the performance position T moves rearward within the performance object track, so the automatic performance of the performance object track of the automatic performance apparatus 24 proceeds together with the movement of the performance position T. As understood from the above description, the performance control section 56 instructs the automatic performance apparatus 24 to synchronize the tempo of the performance and the timing of each sound with the performance of the plurality of players P in a state where the musical performance such as the intensity of each sound or the phrase (phrase) performance of the subject song is maintained as the content specified by the musical composition data M. Therefore, for example, when music data M representing a performance of a specific player (for example, a past player who has gone out of the world) is used, an atmosphere in which the player and a plurality of players P existing in reality are coordinated to breathe exactly in synchronization with each other can be formed while faithfully reproducing a music performance unique to the player through automatic performance.

Further, from the instruction of the automatic performance by the instruction data output from the performance control section 56 to the automatic performance device 24, it takes about several hundred milliseconds until the automatic performance device 24 actually sounds (for example, the hammer of the sounding mechanism 244 makes a string). That is, the actual sound production of the automatic playing device 24 is inevitably delayed with respect to the instruction from the performance control section 56. Accordingly, as a result, in the configuration in which the performance control section 56 instructs the automatic performance apparatus 24 to perform the performance at the performance position T itself estimated by the performance analysis section 54 in the performance object track, the sounding of the automatic performance apparatus 24 is delayed with respect to the performance of the plurality of players P.

Therefore, as illustrated in fig. 2, the performance control section 56 of the present embodiment instructs the automatic performance apparatus 24 to perform a performance at a time TA that is later (future) than the performance position T estimated by the performance analysis section 54 in the performance target track. That is, the performance control section 56 pre-reads instruction data in the music data M of the performance object song so that the delayed sound production is synchronized with the performance of the plurality of players P (for example, a specific note of the performance object song is played substantially simultaneously between the automatic playing device 24 and each player P).

Fig. 4 is an explanatory diagram of temporal changes in the performance position T. The amount of fluctuation of the performance position T per unit time (gradient of the straight line of fig. 4) corresponds to the performance speed R. In fig. 4, for brevity, a case where the performance speed R is maintained constant is illustrated.

As illustrated in fig. 4, the performance control section 56 instructs the automatic performance apparatus 24 to perform a performance at a timing TA that is a backward adjustment amount α with respect to the performance position T in the performance target track. The adjustment amount α is variably set in accordance with the delay amount D from the instruction of the automatic performance by the performance control section 56 until the automatic performance device 24 actually utters, and the performance speed R estimated by the performance analysis section 54. Specifically, the performance control unit 56 sets the section length of the performance object song in the time of the delay amount D based on the performance speed R as the adjustment amount α. Therefore, the adjustment amount α becomes a larger value as the performance speed R becomes faster (the gradient of the straight line of fig. 4 is steep). In fig. 4, a case is assumed in which the performance speed R is maintained constant throughout the entire section of the performance object song, but in reality, the performance speed R may vary. Therefore, the adjustment amount α can be varied with time in conjunction with the performance speed R.

The delay amount D is set in advance to a predetermined value (for example, from about several tens to several hundreds milliseconds) corresponding to the measurement result of the automatic playing device 24. In addition, in the actual automatic playing device 24, the delay amount D may be different depending on the pitch or intensity of the performance. Therefore, the delay amount D (further, the adjustment amount α depending on the delay amount D) may be set variably according to the pitch or intensity of the note to be automatically played.

The performance control unit 56 instructs the automatic performance apparatus 24 to start the automatic performance of the performance object song with the password action detected by the password detection unit 52 as a trigger. Fig. 5 is an explanatory diagram of the relationship between the combination action and the automatic performance. As illustrated in fig. 5, the performance control section 56 starts an instruction for the automatic performance of the automatic performance apparatus 24 at a time point QA when the time period δ has elapsed from the time point Q when the combination operation is detected. The time period δ is a time period in which the delay amount D of the automatic performance is subtracted from the time period τ corresponding to the preparation period B. The duration τ of the preparation period B varies according to the performance speed R of the performance object song. Specifically, the faster the performance speed R (the steeper gradient of the straight line of fig. 5), the shorter the duration τ of the preparation period B becomes. However, at the timing QA of the surreptitious operation, the performance of the performance object track has not yet started, so the performance speed R is not estimated. Accordingly, the performance control section 56 calculates the duration τ of the preparation period B based on the standard performance tempo (standard beat) R0 envisaged for the performance object track. The performance tempo R0 is specified in the music data M, for example. However, the tempo recognized by the plurality of players P for the performance object track (for example, the tempo assumed at the time of performance exercise) may be set as the performance tempo R0.

As described above, the performance control section 56 starts an instruction to perform an automatic performance at the time QA when the time period δ (δ=τ -D) has elapsed from the time QA when the password is operated. Therefore, sounding of the automatic playing device 24 is started at a time QB (i.e., a time point at which the plurality of players P start playing) at which the preparation period B has elapsed from the time point Q at which the dubbing operation is performed. That is, the automatic performance of the automatic performance apparatus 24 is started substantially simultaneously with the start of the performance object tracks of the plurality of players P. The control of the automatic performance by the performance control section 56 of the present embodiment is as described above by way of example.

The display control section 58 of fig. 1 causes an image (hereinafter referred to as "performance image") G visually representing the progress of the automatic performance apparatus 24 to be displayed on the display apparatus 26. Specifically, the display control unit 58 generates image data representing the performance image G and outputs the generated image data to the display device 26, thereby causing the display device 26 to display the performance image G. The display device 26 displays the performance image G instructed from the display control section 58. Such as a liquid crystal display panel or a projector, is a preferred example of the display device 26. The plurality of players P can visually confirm the performance image G displayed on the display device 26 at any time in parallel with the performance of the performance object track.

The display control unit 58 of the present embodiment displays a moving image that changes dynamically in association with the automatic performance of the automatic performance apparatus 24 as a performance image G on the display apparatus 26. Fig. 6 and 7 are display examples of the performance image G. As illustrated in fig. 6 and 7, the performance image G is a stereoscopic image in which a display body (object) 74 is arranged in a virtual space 70 having a bottom surface 72. As illustrated in fig. 6, the display 74 is a substantially spherical solid that floats in the virtual space 70 and descends at a predetermined speed. A shadow 75 of the display 74 is displayed on the bottom surface 72 of the virtual space 70, and the shadow 75 approaches the display 74 on the bottom surface 72 as the display 74 descends. As illustrated in fig. 7, the display 74 rises to a predetermined height in the virtual space 70 at the time of starting the sound emission of the automatic playing device 24, and the shape of the display 74 is irregularly deformed while the sound emission is continuing. When the sound emission of the automatic performance is stopped (muffled), the irregular deformation of the display 74 is stopped, and the display 74 returns to the initial shape (spherical shape) of fig. 6, and the display 74 is shifted to a state of being lowered at a predetermined speed. The above operations (the ascent and the deformation) of the display 74 are repeated for each sound production of the automatic performance. For example, the display 74 descends before the start of the performance object track, and the direction of movement of the display 74 changes from descending to ascending at the point in time when the notes at the start of the performance object track are sounded by the automatic performance. Accordingly, the player P visually confirming the performance image G displayed on the display device 26 can grasp the timing of sounding of the automatic performance device 24 by the transition from the descending to the ascending of the display 74.

The display control unit 58 of the present embodiment controls the display device 26 so as to display the performance image G exemplified above. Further, the delay from the display control section 58 to the display of the instruction image or the change of the instruction image to the display device 26 until the instruction is reflected on the display image of the display device 26 is sufficiently smaller than the delay amount D of the automatic performance device 24. Accordingly, the display control unit 58 causes the display device 26 to display the performance image G corresponding to the performance content of the performance position T itself estimated by the performance analysis unit 54 in the performance target track. Therefore, as described above, the performance image G dynamically changes in synchronization with the actual sound emission of the automatic playing device 24 (the time point delayed by the delay amount D from the instruction of the performance control section 56). That is, at the point when the automatic playing device 24 actually starts sounding of each note of the performance object song, the movement of the display 74 of the performance image G is changed from descending to ascending. Accordingly, each player P can visually confirm the timing at which each note of the music of the performance object song is sounded by the automatic playing device 24.

Fig. 8 is a flowchart illustrating the operation of the control device 12 of the automatic playing system 100. For example, the process of fig. 8 is started in parallel with the performance of the performance object tracks of the plurality of players P, taking the interrupt signal generated at a predetermined cycle as a trigger. When the process of fig. 8 is started, the control device 12 (the password detection unit 52) analyzes the plurality of image signals V0 supplied from the plurality of imaging devices 222 to determine whether or not a password operation from an arbitrary player P is performed (SA 1). The control device 12 (performance analysis unit 54) analyzes the plurality of acoustic signals A0 supplied from the plurality of sound pickup devices 224 to estimate the performance position T and the performance velocity R (SA 2). In addition, the order of the detection (SA 1) of the combination action and the estimation (SA 2) of the performance position T and the performance speed R may be reversed.

The control device 12 (performance control section 56) instructs the automatic performance device 24 to perform an automatic performance corresponding to the performance position T and performance speed R (SA 3). Specifically, the automatic performance apparatus 24 is caused to perform the automatic performance of the performance object song so as to synchronize the credence action detected by the credence detection section 52 with the progress of the performance position T estimated by the performance analysis section 54. Further, the control device 12 (display control section 58) causes the display device 26 to display a performance image G for representing the progress of the automatic performance (SA 4).

In the above-exemplified embodiment, the automatic performance of the automatic performance apparatus 24 is performed so as to synchronize the progression of the dubbing action of the player P and the performance position T, on the other hand, the performance image G for representing the progression of the automatic performance apparatus 24 is displayed on the display apparatus 26. Accordingly, the player P can visually confirm the progress of the automatic performance apparatus 24 to be reflected on the performance itself. That is, a natural ensemble is achieved in which the performance of the plurality of players P and the automatic performance of the automatic performance apparatus 24 interact. In the present embodiment, in particular, there are the following advantages: since the performance image G, which dynamically changes according to the performance content of the automatic performance, is displayed on the display device 26, the player P can grasp the progress of the automatic performance visually and intuitively.

In the present embodiment, the performance content of the time point TA that is temporally later than the performance position T estimated by the performance analysis unit 54 is instructed to the automatic performance apparatus 24. Therefore, even in the case where the actual sound production of the automatic performance apparatus 24 is delayed with respect to the instruction of the performance control section 56, the performance of the player P and the automatic performance can be synchronized with high accuracy. Further, the performance at the time TA, which is later than the performance position T by the variable adjustment amount α corresponding to the performance velocity R estimated by the performance analysis section 54, is instructed to the automatic performance apparatus 24. Therefore, even in the case where, for example, the performance speed R fluctuates, the performance of the player and the automatic performance can be synchronized with high accuracy.

< Embodiment 2 >

Embodiment 2 of the present invention will be described. In the following exemplary embodiments, elements having the same functions or functions as those of embodiment 1 will be appropriately omitted from detailed description of the elements along with numerals used in the description of embodiment 1.

Fig. 9 is a block diagram illustrating the configuration of the analysis processing unit 544 according to embodiment 2. As illustrated in fig. 9, the analysis processing unit 544 according to embodiment 2 includes a likelihood calculating unit 82 and a position estimating unit 84. Fig. 10 is an explanatory diagram of the operation of the likelihood calculating unit 82.

The likelihood calculating section 82 calculates the observation likelihood L for each of a plurality of points t in the performance object track in parallel with performance of the performance object track by a plurality of players P. That is, a distribution of observation likelihoods L (hereinafter referred to as "observation likelihood distribution") over a plurality of points t within the performance object track is calculated. The observation likelihood distribution is calculated for the acoustic signal a for each unit section (frame) divided on the time axis. The observation likelihood L of any 1 time point t in the observation likelihood distribution of the acoustic signal a calculated for 1 unit section is an index of the accuracy with which the sound represented by the acoustic signal a in the unit section is uttered at that time point t in the performance object track. The observation likelihood L is also referred to as an index of accuracy of performance of each time point t in the performance object track by the plurality of players P. That is, the time t at which the observation likelihood L calculated for any 1 unit section is high is highly likely to coincide with the sound emission position of the sound represented by the acoustic signal a in that unit section. The unit sections immediately before and after each other may overlap each other on the time axis.

As illustrated in fig. 9, the likelihood calculating unit 82 according to embodiment 2 includes a1 st calculating unit 821, a2 nd calculating unit 822, and a 3 rd calculating unit 823. The 1 st arithmetic unit 821 calculates the 1 st likelihood L1 (a), and the 2 nd arithmetic unit 822 calculates the 2 nd likelihood L2 (C). The 3 rd computing unit 823 multiplies the 1 st likelihood L1 (a) calculated by the 1 st computing unit 821 by the 2 nd likelihood L2 (C) calculated by the 2 nd computing unit 822 to calculate the distribution of the observation likelihoods L. That is, the observation likelihood L is expressed by the product of the 1 st likelihood L1 (a) and the 2 nd likelihood L2 (C) (l=l1 (a) L2 (C)).

The 1 st arithmetic unit 821 calculates a 1 st likelihood L1 (a) for each of a plurality of points t in the performance object track by comparing the acoustic signal a of each unit section with the musical composition data M of the performance object track. That is, as illustrated in fig. 10, the distribution of the 1 st likelihood L1 (a) over a plurality of points t within the performance object track is calculated for each unit section. The 1 st likelihood L1 (a) is a likelihood calculated by analysis of the acoustic signal a. The 1 st likelihood L1 (a) calculated for an arbitrary 1 time t by analysis of 1 unit section of the acoustic signal a is an index of accuracy with which the sound represented by the acoustic signal a in the unit section is uttered at the time t in the performance object track. At a time t having a high possibility of matching with the performance position of 1 unit section of the acoustic signal a among the plurality of time t on the time axis, there is a peak of the 1 st likelihood L1 (a). As a method of calculating the 1 st likelihood L1 (a) from the acoustic signal a, for example, a technique of japanese patent application laid-open No. 2014-178395 can be preferably used.

The 2 nd arithmetic unit 822 in fig. 9 calculates a2 nd likelihood L2 (C) corresponding to the detection of the presence or absence of the detection of the combination operation. Specifically, the 2 nd likelihood L2 (C) is calculated from the variable C indicating whether or not the operation is a sign. The variable C is notified from the password detection unit 52 to the likelihood calculation unit 82. When the password detection unit 52 detects a password operation, the variable C is set to 1, and when the password detection unit 52 does not detect a password operation, the variable C is set to 0. The value of the variable C is not limited to 2 values of 0 and 1. For example, the variable C when no sign operation is detected may be set to a predetermined positive number (but a value smaller than the value of the variable C when no sign operation is detected).

As illustrated in fig. 10, a plurality of reference points a are specified on the time axis of the performance object track. The reference point a is, for example, a start point of a musical composition or a point at which performance is restarted after a long rest indicated by an extension symbol or the like. For example, respective timings of a plurality of reference points a within a performance object track are specified by music data M.

As illustrated in fig. 10, the 2 nd likelihood L2 (C) is maintained at 1 in the unit section (c=0) where no darkness operation is detected. On the other hand, in the unit section (c=1) in which the combination operation is detected, the 2 nd likelihood L2 (C) is set to 0 (an example of the 2 nd value) in a period (hereinafter referred to as "reference period") ρ extending over a predetermined length on the front side on the time axis from each reference point a, and is set to 1 (an example of the 1 st value) in a period other than each reference period ρ. The reference period ρ is set to a period of about 1 beat to 2 beat of the performance object song, for example. As described above, the observation likelihood L is calculated by the product of the 1 st likelihood L1 (a) and the 2 nd likelihood L2 (C). Therefore, when a combination operation is detected, the observation likelihood L in the reference period ρ in front of each of the plurality of reference points a specified for the performance object track is reduced to 0. On the other hand, when no combination operation is detected, the 2 nd likelihood L2 (C) is maintained at 1, and therefore the 1 st likelihood L1 (a) is calculated as the observation likelihood L.

The position estimating section 84 of fig. 9 estimates the performance position T based on the observation likelihood L calculated by the likelihood calculating section 82. Specifically, the position estimating unit 84 calculates a posterior distribution of the performance position T based on the observation likelihood L, and estimates the performance position T based on the posterior distribution. The posterior distribution of the performance position T is a probability distribution of posterior probability of the position T in the performance object track at the time of sound production in the unit section under the condition that the acoustic signal a in the unit section is observed. In the calculation of the posterior distribution using the observation likelihood L, known statistical processing such as bayesian estimation using a hidden semi-markov model (HSMM) is used as disclosed in, for example, japanese patent application laid-open No. 2015-79183.

As described above, since the observation likelihood L is set to 0 in the reference period ρ before the reference point a corresponding to the darkness operation, the posterior distribution becomes effective in the section after the reference point a. Therefore, the time point after the reference point a corresponding to the combination action is estimated as the performance position T. Further, the position estimating section 84 determines the performance velocity R from the time variation of the performance position T. The configuration and operation of the analysis processing unit 544 are the same as those of embodiment 1.

Fig. 11 is a flowchart illustrating the content of the process (step SA2 of fig. 8) in which the analysis processing section 544 estimates the performance position T and the performance velocity R. In parallel with the performance of the performance object tracks of the plurality of players P, the process of fig. 11 is performed for each unit section on the time axis.

The 1 st arithmetic unit 821 analyzes the acoustic signal a in the unit section to calculate a1 st likelihood L1 (a) for each of a plurality of points t in the performance object track (SA 21). The 2 nd arithmetic unit 822 calculates a2 nd likelihood L2 (C) corresponding to whether or not the darkness operation is detected (SA 22). The order of calculation (SA 21) of the 1 st likelihood L1 (a) of the 1 st arithmetic unit 821 and the order of calculation (SA 22) of the 2 nd likelihood L2 (C) of the 2 nd arithmetic unit 822 may be reversed. The 3 rd computing unit 823 multiplies the 1 st likelihood L1 (a) calculated by the 1 st computing unit 821 and the 2 nd likelihood L2 (C) calculated by the 2 nd computing unit 822 to calculate the distribution of the observation likelihoods L (SA 23).

The position estimating unit 84 estimates the performance position T based on the observation likelihood distribution calculated by the likelihood calculating unit 82 (SA 24). Further, the position estimating section 84 calculates the performance velocity R from the time variation of the performance position T (SA 25).

As described above, in embodiment 2, since the detection result of the ambiguous operation is also considered in the estimation of the performance position T in addition to the analysis result of the acoustic signal a, the performance position T can be estimated with higher accuracy than in a configuration in which only the analysis result of the acoustic signal a is considered, for example. The performance position T is also estimated with high accuracy, for example, at the start point of a musical piece or at the point of restarting performance after a rest. In embodiment 2, when a combination operation is detected, the observation likelihood L in the reference period ρ corresponding to the reference point a at which the combination operation is detected among the plurality of reference points a specified for the performance object track is lowered. That is, the detection timing of the darkness operation other than the reference period ρ is not reflected in the estimation of the performance timing T. Therefore, there is an advantage that erroneous estimation of the performance time T when the combination operation is erroneously detected can be suppressed.

< Modification >

The above-described embodiments can be variously modified. Specific modifications are exemplified below. The 2 or more modes arbitrarily selected from the following examples can be appropriately combined within a range not contradicting each other.

(1) In the above-described embodiment, the automatic performance of the performance object song is started upon detection of the combination action by the combination detection unit 52, but the combination action may be used for control of the automatic performance at the point in time during the performance object song. For example, when the performance is restarted after the end of the long-time rest in the performance object track, the automatic performance of the performance object track is restarted with the operation of the combination as in the above-described modes. For example, as in the operation described with reference to fig. 5, the specific player P performs the credence operation at the timing Q earlier by the preparation period B than the timing at which the performance is restarted after the rest in the performance target track. Then, at a point in time when the time period δ corresponding to the delay amount D and the performance speed R has elapsed from this point in time Q, the performance control section 56 resumes the instruction for the automatic performance of the automatic performance apparatus 24. Further, since the performance speed R has already been estimated at a point in the middle of the performance object track, the performance speed R estimated by the performance analysis section 54 is applied in setting the time length δ.

In addition, the period during which the combination action is performed in the performance object track can be grasped in advance based on the performance content of the performance object track. Therefore, the password detection unit 52 may monitor whether or not a password operation is performed for a specific period (hereinafter referred to as "monitoring period") in which there is a possibility of performing the password operation in the performance object track. For example, section specification data for specifying a start point and an end point for each of a plurality of monitoring periods envisaged for a performance object track is stored in the storage device 14. The section specification data may be included in the music data M. The credence detection unit 52 performs monitoring of credence operation when the performance position T exists in each monitoring period specified by the section specification data in the performance target track, and stops monitoring of credence operation when the performance position T is outside the monitoring period. According to the above configuration, since the darkness operation is detected by limiting the monitoring period to the performance object track, there is an advantage in that the processing load of the darkness detection unit 52 is reduced as compared with a configuration in which whether or not the darkness operation is monitored over the entire section of the performance object track. Further, it is possible to reduce the possibility of false detection of the act of the combination in a period in which it is actually impossible to perform the act of the combination in the performance object track.

(2) In the above-described embodiment, the sign operation is detected by analyzing the entire image (fig. 3) represented by the image signal V, but the sign detection unit 52 may monitor whether or not the sign operation is present with respect to a specific region (hereinafter referred to as a "monitoring region") in the image represented by the image signal V. For example, the darkness detection unit 52 selects a range of the specific player P for which the darkness operation is predetermined in the image represented by the image signal V as a monitoring area, and detects the darkness operation with the monitoring area as a target. The range outside the monitoring area is excluded from the monitoring object of the password detection unit 52. According to the above configuration, since the operation of the password is detected by being limited to the monitoring area, there is an advantage in that the processing load of the password detection unit 52 is reduced as compared with a configuration in which whether or not the password operation is monitored throughout the entire image represented by the image signal V. Further, it is possible to reduce the possibility that the action of the player P who does not actually perform the combination action is erroneously determined to be the combination action.

Further, as exemplified in the modification (1), if the crediting operation is performed a plurality of times during the performance of the performance object track, the player P who performs the crediting operation may be changed for each crediting operation. For example, the player P1 performs a crediting operation before the start of the performance object song, while the player P2 performs a crediting operation in the middle of the performance object song. Therefore, it is also preferable to change the position (or size) of the monitoring area with time in the image represented by the image signal V. Since the player P who performs the crediting operation is determined before performance, for example, area specification data for specifying the position of the monitor area in time series is stored in the storage 14 in advance. The dark number detection unit 52 monitors the dark number operation for each monitoring area designated by the area designation data in the image represented by the image signal V, and excludes the areas other than the monitoring area from the monitoring object of the dark number operation. According to the above configuration, even when the player P who performs the crediting operation is changed as the musical composition proceeds, the crediting operation can be appropriately detected.

(3) In the above-described embodiment, the plurality of players P are photographed by the plurality of photographing devices 222, but the plurality of players P (for example, the entire stage where the plurality of players P are located) may be photographed by 1 photographing device 222. Similarly, the sound played by the players P may be received by 1 receiving device 224. Further, the dark signal detecting unit 52 may monitor whether or not a dark signal is present for each of the plurality of image signals V0 (therefore, the image synthesizing unit 522 may be omitted).

(4) In the above-described embodiment, the darkness operation is detected by the analysis of the image signal V imaged by the imaging device 222, but the method of detecting the darkness operation by the darkness detection unit 52 is not limited to the above-described example. For example, the plague detection section 52 may detect the plague operation of the player P by analyzing the detection signal of a detector (for example, various sensors such as an acceleration sensor) worn on the body of the player P. However, the configuration of the foregoing embodiment in which the combination operation is detected by analyzing the image captured by the imaging device 222 has an advantage in that the influence on the performance operation of the player P can be reduced and the combination operation can be detected, as compared with the case in which the detector is worn on the body of the player P.

(5) In the above-described embodiment, the performance position T and the performance speed R are estimated by analysis of the acoustic signal a in which a plurality of acoustic signals A0 representing sounds of different musical instruments are mixed, but the performance position T and the performance speed R may be estimated by analysis of each acoustic signal A0. For example, the performance analysis unit 54 estimates the tentative performance position T and performance velocity R for each of the plurality of acoustic signals A0 by the same method as in the foregoing embodiment, and determines the specified performance position T and performance velocity R based on the estimation results for each acoustic signal A0. For example, representative values (for example, average values) of the performance position T and the performance velocity R estimated from the acoustic signals A0 are calculated as the determined performance position T and performance velocity R. As understood from the above description, the acoustic mixing section 542 of the performance analysis section 54 may be omitted.

(6) As exemplified in the foregoing embodiment, the automatic playing system 100 is realized by the cooperation of the control device 12 and the program. A program according to a preferred embodiment of the present invention causes a computer to function as: the automatic performance device includes a combination detection unit 52 for detecting a combination of the action of a player P playing a performance object track, a performance analysis unit 54 for sequentially estimating a performance position T in the performance object track by analyzing an acoustic signal a representing a sound to be played in parallel with the performance, a performance control unit 56 for causing the automatic performance device 24 to execute the automatic performance of the performance object track so as to synchronize the combination of the action detected by the combination detection unit 52 and the performance position T estimated by the performance analysis unit 54, and a display control unit 58 for causing a performance image G representing the progress of the automatic performance to be displayed on the display device 26. That is, the program according to the preferred embodiment of the present invention is a program for causing a computer to execute the music data processing method according to the preferred embodiment of the present invention. The program exemplified above may be provided in a form stored in a computer-readable recording medium and installed on a computer. The recording medium is, for example, a non-transitory (non-transitory) recording medium, and an optical recording medium (optical disc) such as a CD-ROM is a preferable example, but may include a semiconductor recording medium, a magnetic recording medium, or any known recording medium. The program may be distributed to the computer by a distribution system via a communication network.

(7) A preferred embodiment of the present invention may be also determined as the operation method (automatic performance method) of the automatic performance system 100 according to the foregoing embodiment. For example, in the automatic performance method according to the preferred embodiment of the present invention, a computer system (a single computer or a system composed of a plurality of computers) detects a plague motion of a player P playing a performance object track (SA 1), analyzes an acoustic signal a representing a sound to be played in parallel with the performance, sequentially estimates a performance position T in the performance object track (SA 2), causes the automatic performance device 24 to execute the automatic performance of the performance object track so as to synchronize the plague motion with the performance position T (SA 3), and causes the display device 26 to display a performance image G representing the progress of the automatic performance (SA 4).

(8) The following configuration is grasped, for example, according to the above-described exemplary embodiments.

Mode A1

In the performance analysis method according to a preferred embodiment (embodiment A1) of the present invention, a plague motion of a player playing a musical composition is detected, analysis of an acoustic signal representing a sound playing the musical composition is performed, a distribution of observation likelihoods is calculated as an index of accuracy of matching each time point in the musical composition with a performance position, the performance position is estimated from the distribution of observation likelihoods, and in the calculation of the distribution of observation likelihoods, when the plague motion is detected, the observation likelihood is lowered in a period before a reference point designated on a time axis for the musical composition. In the above aspect, since the detection result of the combination operation is considered in the estimation of the performance position in addition to the analysis result of the acoustic signal, the performance position can be estimated with higher accuracy than, for example, a configuration in which only the analysis result of the acoustic signal is considered.

Mode A2

In a preferred embodiment (aspect A2) of aspect A1, in the calculation of the distribution of observation likelihoods, A1 st likelihood, which is an index of accuracy of matching performance positions at each time point in the musical piece, is calculated from the acoustic signal, the 1 st likelihood is set to A1 st value in a state where the darkness operation is not detected, and when the darkness operation is detected, A2 nd likelihood, which is set to A2 nd value smaller than the 1 st value in a period preceding the reference point, is calculated, and the observation likelihoods are calculated by multiplying the 1 st likelihood and the 2 nd likelihood. In the above aspect, there is an advantage that the observation likelihood can be easily calculated by multiplying the 1 st likelihood calculated from the acoustic signal by the 2 nd likelihood corresponding to the detection result of the combination operation.

Mode A3

In a preferred example of the embodiment A2 (embodiment A3), the 1 st value is 1 and the 2 nd value is 0. In this way, the observation likelihood can be clearly distinguished between the case where the combination operation is detected and the case where the combination operation is not detected.

Mode A4

In the automatic playing method according to a preferred embodiment (embodiment A4) of the present invention, a plague motion of a player playing a musical composition is detected, a playing position in the musical composition is estimated by analyzing an acoustic signal representing a sound playing the musical composition, the automatic playing device is caused to perform the automatic playing of the musical composition so as to synchronize with the progress of the playing position, a distribution of observation likelihoods, which is an index of accuracy of matching each time point in the musical composition with the playing position, is calculated by the analysis of the acoustic signal in the estimation of the playing position, the playing position is estimated from the distribution of the observation likelihoods, and in the calculation of the distribution of the observation likelihoods, when the plague motion is detected, the observation likelihoods in a period forward of a reference point designated on a time axis for the musical composition are lowered. In the above aspect, since the detection result of the combination operation is considered in the estimation of the performance position in addition to the analysis result of the acoustic signal, the performance position can be estimated with higher accuracy than, for example, a configuration in which only the analysis result of the acoustic signal is considered.

Mode A5

In a preferred embodiment (aspect A5) of aspect A4, in the calculation of the distribution of observation likelihoods, a1 st likelihood, which is an index of accuracy of matching performance positions at each time point in the musical piece, is calculated from the acoustic signal, the 1 st likelihood is set to a1 st value in a state where the darkness operation is not detected, and when the darkness operation is detected, a2 nd likelihood, which is set to a2 nd value smaller than the 1 st value in a period preceding the reference point, is calculated, and the observation likelihoods are calculated by multiplying the 1 st likelihood and the 2 nd likelihood. In the above aspect, there is an advantage that the observation likelihood can be easily calculated by multiplying the 1 st likelihood calculated from the acoustic signal by the 2 nd likelihood corresponding to the detection result of the combination operation.

Mode A6

In a preferred embodiment (aspect A6) of the aspect A4 or the aspect A5, the automatic performance apparatus is caused to perform the automatic performance in accordance with musical piece data representing performance contents of the musical piece, and the plurality of reference points are specified by the musical piece data. In the above manner, by designating each reference point by the music data instructing the automatic performance to the automatic performance apparatus, there is an advantage in that the structure and the processing are simplified as compared with a structure in which a plurality of reference points are designated in addition to the music data.

Mode A7

In any one of the preferred embodiments (embodiment A7) of the embodiments A4 to A6, an image indicating progress of the automatic performance is displayed on a display device. According to the above manner, the player can visually confirm the progress of the automatic performance apparatus to be reflected on the performance itself. That is, a natural performance in which the performance of the player and the automatic performance of the automatic performance apparatus are interactively coordinated is achieved.

Mode A8

An automatic playing system according to a preferred embodiment (aspect A8) of the present invention includes: a credence detection unit for detecting credence operation of a player playing music; an analysis processing unit that estimates a performance position in the musical composition based on analysis of an acoustic signal representing a sound of the musical composition; and a performance control section for causing an automatic performance apparatus to perform an automatic performance of a musical composition so as to synchronize the combination action detected by the combination detection section with the performance position estimated by the performance analysis section, the analysis processing section including: a likelihood calculation unit that calculates a distribution of observation likelihoods as an index of accuracy of matching each time point in the musical composition with a performance position by analysis of the acoustic signal; and a position estimating unit that estimates the performance position based on the distribution of the observation likelihoods, wherein the likelihood calculating unit decreases the observation likelihoods during a period before a reference point designated on a time axis for the musical piece when the darkness operation is detected. In the above aspect, since the detection result of the combination operation is considered in the estimation of the performance position in addition to the analysis result of the acoustic signal, the performance position can be estimated with higher accuracy than, for example, a configuration in which only the analysis result of the acoustic signal is considered

(9) The following configuration is grasped, for example, in the automatic playing system exemplified in the foregoing embodiment.

Mode B1

An automatic playing system according to a preferred embodiment (aspect B1) of the present invention includes: a credence detection unit for detecting credence operation of a player playing music; a performance analysis unit that sequentially estimates performance positions in a musical composition by analyzing an acoustic signal representing a performed sound in parallel with the performance; and a performance control section for causing the automatic performance apparatus to perform an automatic performance of the musical composition so as to synchronize the combination action detected by the combination detection section with the performance position estimated by the performance analysis section; and a display control unit for causing the display device to display an image indicating progress of the automatic performance. In the above configuration, the automatic performance of the automatic performance apparatus is performed so that the performance position and the puzzles of the player are synchronized, and on the other hand, an image showing the progress of the automatic performance apparatus is displayed on the display apparatus. Therefore, the player can visually confirm the progress of the automatic performance apparatus to be reflected on the performance itself. That is, a natural performance in which the performance of the player and the automatic performance of the automatic performance apparatus are interactively coordinated is achieved.

Mode B2

In a preferred embodiment of the mode B1 (mode B2), the performance control unit instructs the automatic performance apparatus to perform a performance at a time point later than the performance position estimated by the performance analysis unit. In the above aspect, the performance content at a time point later than the performance position estimated by the performance analysis section is instructed to the automatic performance apparatus. Therefore, even when the actual sound production of the automatic performance apparatus is delayed with respect to the instruction of the performance control section, the performance of the player and the automatic performance can be synchronized with high accuracy.

Mode B3

In a preferred embodiment of the mode B2 (mode B3), the performance analysis unit estimates a performance tempo from the analysis of the acoustic signal, and the performance control unit instructs the automatic performance apparatus to perform a musical performance at a time point after the performance position estimated by the performance analysis unit by an adjustment amount corresponding to the performance tempo. In the above aspect, the performance at the time point which is later than the performance position by the variable adjustment amount according to the performance speed estimated by the performance analysis section is instructed to the automatic performance apparatus. Therefore, even in the case of, for example, a variation in performance speed, the performance of the player and the automatic performance can be synchronized with high accuracy.

Mode B4

In any one of the preferred embodiments (embodiment B4) of the embodiments B1 to B3, the password detection unit detects the password operation by analyzing an image of the player photographed by the photographing device. In the above aspect, detecting the combination operation based on the analysis of the image captured by the imaging device has an advantage that the influence on the performance of the player can be reduced and the combination operation can be detected, compared to a case where the combination operation is detected by, for example, wearing a detector on the body of the player.

Mode B5

In any one of the preferred embodiments (embodiment B5) of the embodiments B1 to B4, the display control unit causes the display device to display an image that dynamically changes according to the performance content of the automatic performance. In the above manner, since the image dynamically changing according to the performance content of the automatic performance is displayed on the display device, there is an advantage in that the player can visually and intuitively grasp the progress of the automatic performance.

Mode B6

In the automatic performance method according to a preferred embodiment (embodiment B6) of the present invention, the computer system detects the combination action of a player who plays a musical piece, sequentially estimates the performance position in the musical piece by analyzing an acoustic signal representing the sound being played in parallel with the performance, causes the automatic performance apparatus to perform the automatic performance of the musical piece so as to synchronize the combination action with the performance position, and causes the display apparatus to display an image representing the progress of the automatic performance.

< Detailed description >

The preferred mode of the present invention can be expressed as follows.

1. Precondition of

An automatic playing system refers to a system that mechanically coordinates a performance of a human being to generate accompaniment. Here, the description gives an automatic performance system like classical music in which an automatic performance system and a human should play a score representation, respectively. Such an automatic playing system has a wide variety of applications such as exercise assistance for musical performance or music expansion performance for driving an electronic musical instrument in cooperation with a player. In the following, the part of the ensemble engine performance will be referred to as an "accompaniment part". In order to perform an ensemble that is musically integrated, it is necessary to appropriately control performance timing of an accompaniment part. In the appropriate timing control, there are 4 requirements described below.

Claim 1 in principle, the automatic playing system needs to play where a human player plays. Therefore, the automatic playing system needs to match the position of the played musical composition with the human player. Particularly in classical music, since the suppression of performance tempo (beat) is important in musical performance, it is necessary to follow the beat changes of players. In order to follow the player with higher accuracy, it is preferable to acquire the habit of the player by analyzing the exercise (color bar) of the player.

Claim 2 an automatic playing system generates a musical performance integrated on music. That is, it is necessary to follow the performance of a human being within a range that the musical performance of the accompaniment part is maintained.

Claim 3 the degree to which the accompaniment part fits the player (master-slave relationship) can be changed according to the background (context) of the music piece. In music, people should be matched even if the musical performance is slightly impaired, or the musical performance of the accompaniment part should be maintained even if the following performance is impaired. Accordingly, the balance of "followability" and "musicality" recited in claim 1 and claim 2, respectively, changes depending on the background of the musical composition. For example, a portion with a rhythm (rhythm) that is unclear tends to follow a portion with a clearer rhythm.

Claim 4 the master-slave relationship can be changed immediately according to the instruction of the player. The trade-off (trade off) of the following property and the musicality of the automatic playing system is often adjusted by a dialogue between people in the color bars. In the case of performing such adjustment, the adjustment result is checked by flicking the adjusted portion. Therefore, an automatic playing system capable of setting a follow-up behavior is required in a color row.

In order to satisfy these requirements at the same time, it is necessary to generate accompaniment parts so as not to cause a break in musicality on the basis of following the position where a player plays. To achieve this, the automatic playing system requires three elements: (1) a model for predicting a player's position, (2) a timing generation model for generating an accompaniment part of music, (3) a model for correcting a performance timing based on a master-slave relationship. In addition, these elements need to be able to operate or learn independently. However, it has been difficult to independently process these elements. Therefore, in the following description, (1) a performance timing generation process of a player, (2) a performance timing generation process representing a range in which an automatic performance system can perform musically, and (3) a process for combining performance timings of the automatic performance system and the player for the automatic performance system in cooperation with the player while maintaining a master-slave relationship are considered, and these three elements are independently modeled and integrated. By independent presentation, each element can be independently learned or operated. When the system is used, the range of timing at which the automatic playing system can play is inferred while the timing generation process of the player is inferred, and the accompaniment part is played so that the ensemble and the timing of the player are coordinated. Thus, the automatic playing system can match a human and play an ensemble that is musically free from breakdowns.

2. Correlation technique

In a conventional automatic performance system, performance timing of a player is estimated by using score following. On this basis, in order to coordinate the ensemble engine and the human, two methods can be roughly used. First, it is proposed to obtain an average behavior or a behavior of time variation in a musical composition by regressing the relationship of performance timings with respect to a player and an ensemble engine by a large number of color bars. In such a method, since the result itself of the ensemble is regressed, the musical performance of the accompaniment part and the following performance of the accompaniment part can be obtained simultaneously as a result. On the other hand, it is considered that it is difficult to perform the individual operation following property or musical property in the color bars because it is difficult to represent the timing prediction of the player, the generation process of the ensemble engine, and the degree of matching in a split manner. In addition, in order to obtain the following property of music, it is necessary to analyze the ensemble data between humans separately, and therefore, it is necessary to cost the content preparation. Second, there is a method of setting restrictions on beat tracks by using a dynamic system described by a small number of parameters. In this method, on the basis of advance information such as beat continuity, a player's beat track or the like is learned by color bars. In addition, the accompaniment part may also be able to learn the sounding timing of the accompaniment part. Since they describe beat tracks using few parameters, accompaniment parts or human "habits" can be easily rewritten manually in color bars. However, it is difficult to independently operate the following property, which is indirectly obtained from the deviation of the sounding timing when the player and the ensemble engine are each independently performing. To improve the explosive force in the color row, it is considered effective to alternately perform learning of the automatic playing system and dialogue of the automatic playing system and the player. Therefore, in order to independently operate the followability, a method of adjusting the ensemble playing logic itself is proposed. In the present method, based on such an idea, a mathematical model is considered that enables control of "fitting method", "performance timing of accompaniment part", "performance timing of player" independently and in a conversational manner.

3. Overview of the System

The structure of the automatic playing system is shown in fig. 12. In the method, in order to follow the position of a player, score following is performed based on an acoustic signal and a camera image. In addition, the position of the player is predicted based on the generation process of the position where the player is playing based on statistical information obtained from the posterior distribution followed by the score. In order to determine the pronunciation timing of the accompaniment part, the timing of the accompaniment part is generated by combining a model predicting the timing of the player and a generation process of the timing desirable for the accompaniment part.

4. Music score tracking

In order to estimate the position in the piece of music that the player is currently playing, score following is used. In the score following method of the present system, a discrete state space model that simultaneously expresses the score position and the tempo of performance is considered. The observed sound is modeled as a hidden Markov process (hidden Markov model; HMM) over the state space, and the posterior distribution of the state space is estimated in turn using a delayed-decision forward-backward algorithm. The delayed-decision forward-backward algorithm refers to sequentially executing forward algorithms, and by running backward algorithm, which is regarded as the end of data at the current time, posterior distribution for states several frames before the current time is calculated. At the point in time when the MAP value of the posterior distribution passes the position on the score that is regarded as on (onset), a laplace approximation of the posterior distribution is output.

The construction of the state space is discussed. First, a musical composition is divided into R sections, and each section is set to one state. In the r-th section, the number of frames n required to pass through the section and the current passing frame 0.ltoreq.1 < n for each n are held as state variables. That is, n corresponds to the beat of a certain section, and the combination of r and l corresponds to the position on the score. Such a transition over a state space is represented as a markov process as follows.

[ Number 1]

(1) From (r, n, l) to itself: p

(2) From (r, n, l < n) to (r, n, l+1): 1-p

(3) From (r, n, n-1) to (r+1, n',()):

This model has the advantages of both explicit-duration HMM (explicit period HMM) and left-to-right HMM (left-to-right HMM). That is, the duration in the section can be determined approximately by the selection of n, and a minute beat variation in the section can be absorbed by the self-transition probability p. The length of the section or the self-transition probability is obtained by analyzing the music data. Specifically, comment information such as a beat instruction or an extension symbol is used.

Next, the observation likelihoods of such models are defined. Each state (r, n, l) corresponds to a position s (r, n, l) in a certain musical composition. Further, for an arbitrary position s in a musical composition, in addition to the average value/-c _s ² and/Δ -c _s ² of the observed Constant Q Transform (CQT) and Δcqt, the accuracies κ _s ^(c) and Δc _s ² are assigned, respectively(Symbol/representation vector, symbol-representation upper dash within mathematical expression). Based on this, at time t, when CQT, c _t、ΔCQT、Δc_t are observed, the observation likelihood corresponding to the state (r _t,n_t,l_t) is defined as follows.

[ Number 2]

Here vMF (x|μ, κ) refers to von Mises-Fisher distribution, specifically, normalized to x εS ^D (SD: D-1 dimensional unit sphere) expressed by the following expression.

[ Number 3]

In deciding c or delta-c, a model of piano roll (piano roll) of score expression and a CQT conceived from each sound is used. First, a specific index i is assigned to a pair of a pitch and a musical instrument name existing on a score. In addition, for the i-th tone, an average observation cqtω _if is assigned. At position s on the score, if the intensity of the ith sound is set to h _si, then-c _s,f is given as follows. Delta-c is obtained by differentiating-c _s,f in the s direction and half-wave rectifying.

[ Number 4] the method comprises

When a musical composition starts from a silent state, visual information becomes more important. Therefore, in the present system, as described above, the password operation (cue) detected from the camera placed before the player is used. In the present method, unlike the method of controlling the automatic playing system from top to bottom, by directly reflecting the presence or absence of the password action in the observation likelihood, the acoustic signal and the password action are uniformly processed. Therefore, first, the position of the start point or extension symbol of the music contained in the place { q _i}.^q_i where the combination action is required is extracted from the score information. When a combination action is detected during execution of the score following, the posterior distribution is guided after the position of the signal action by setting the observation likelihood of the state corresponding to the position U [ ≡q _i－Τ,^q_i ] on the score to 0. By the score following, the ensemble engine accepts the distribution of the currently estimated position or tempo approximately as a normal distribution after several frames from the position of sound switching on the score. That is, if the score tracking engine detects a switch of the nth sound (hereinafter referred to as "on (onset) event") existing on the music data, the time stamp (timestamp) t _n of the time when the on event was detected, and the estimated average position μ _n and its variance σ _n ² on the score are notified to the ensemble timing generation section. In addition, since delayed-decision type estimation is performed, the notification itself generates a delay of 100 ms.

5. Playing timing combination model

The ensemble engine calculates a play position of the appropriate ensemble engine based on the information (t _n,μ_n,σ_n ²) notified by the score following. The ensemble engine preferably models, in order to match the player, 3 processes of (1) a generation process of timing of performance of the player, (2) a generation process of timing of performance of the accompaniment part, and (3) a process of listening to the player while the accompaniment part performs, independently. Using such a model, the performance timing of the accompaniment part and the predicted position of the player are considered to be generated, and the timing of the final accompaniment part is generated.

5.1 Performance timing Generation procedure for players

In order to express the performance timing of the player, it is assumed that the player makes a linear motion at a position on the score at a velocity v _n ^(p) between t _n and t _n+1. That is, consider x _n ^(p) as the position of the player on the score played at t _n, and ε _n ^(p) as noise for the tempo or the position on the score, consider the following generation process. Here, Δt _m,n＝t_m－t_n is set.

[ Number 5 ] the method comprises

Noise epsilon _n ^(p) contains a pseudo-or sounding timing error in addition to the beat variations. To represent the former, based on the case where the sounding timing also changes with beat variation, a model is considered that migrates between t _n and t _n-1 with accelerations generated from the normal distribution of the variance ψ ². Then, if the covariance matrix of ε _n ^(p) is set to h= [ ΔT _n,n-1 ²/2,ΔT_n,n-1 ], Σ _n ^(p)＝ψ² h' h is given, and the beat change and the sounding timing change become correlated. In order to represent the latter, σ _n ^(p) is added to Σ _n,0,0 ^(p) in consideration of white noise of standard deviation σ _n ^(p). Therefore, if Σ _n ^(p) is a matrix obtained by adding σ _n ^(p) to Σ _n,0,0 ^(p), ε _n ^(p)～N(0,Σ_n ^(p) is given. N (a, b) means a normal distribution of the mean a and the variance b.

Next, it is considered to combine the history of performance timings of the user/μ _n＝[μ_n,μ_n-1,…,μ_n-In ] and/σ _n ²＝[σ_n,σ_n-1,…,σ_n-In reported by the score tracking system with (3) or (4). Here, I _n is the length of the history considered, and is set to include an event 1 beat earlier than t _n. The generation process of such/mu _n or/sigma _n ² is defined as follows.

[ Number 6]

Here,/W _n is the regression coefficient used to predict observations/. Mu. _n from x _n ^(p) and v _n ^(p). Here,/W _n is defined as follows.

[ Number 7]

By using not the latest μ _n as an observation value but the previous history as well as the previous one, it is considered that even if the score fails to follow a part of the score, the action is less likely to break. In addition, it is considered that the performance method depending on the trend for a long time, such as a pattern that can follow the increase or decrease of the tempo, can be obtained by the color bars. Such a model corresponds to the application of the concept of the trajectory HMM (Trajectory HMM) to a continuous state space in the sense that the correlation of beats and position changes on the score is explicitly recorded.

5.2 Performance timing Generation procedure of accompaniment part

As described above, the internal state [ x _n ^(p),v_n ^(p) ] of the player can be inferred from the history of the score following the reported position by using the timing model of the player. The automatic playing system reconciles such a habit of deducing and accompaniment of how the accompaniment part "wants to play" and deduces the final sound emission timing. Therefore, a process of generating performance timing in the accompaniment part, how the accompaniment part "wants to play" is considered here.

At the performance timing of the accompaniment part, a process of being performed with a beat track within a certain range from the given beat track is considered. The beat trajectory given means performance data considering the use of a performance expression addition system or a human. When the automatic playing system receives the nth on event, the predicted value of which position on the played piece of music x _n ^(a) and its relative velocity v _n ^(a) are expressed as follows.

[ Number 8]

Here, v _n ^(a) means a beat given in advance at the position n on the score reported at time t _n, and substitutes a beat trajectory given in advance. In addition, epsilon ^(a) determines the allowable deviation range for the performance timing generated from the beat track given in advance. From such parameters, the range of a musically natural performance is determined as an accompaniment part. Beta.epsilon.0, 1 means the term indicating how strongly the previously given beat is to be pulled back, having the effect of pulling back the beat track to v _n ^(a). Such a model has a certain effect in audio alignment, and therefore it is advisable to explain the generation process as timing of playing the same musical piece. In addition, since v follows wiener process without such limitation (β=1), beat diverges, resulting in extremely fast or slow performance.

5.3 Performance timing combining procedure of player and accompaniment part

To this end, the sounding timing of the player and the sounding timing of the accompaniment part are independently modeled, respectively. Here, based on their generation processes, a process in which the accompaniment part "fits" while listening to the player is described. Therefore, consider describing the following actions: when the accompaniment part is matched with a person, an error between a predicted value of a position to be played currently by the accompaniment part and a predicted value of a current position of a player is corrected slowly. Hereinafter, such a variable describing the degree of correction error is referred to as a "combining coefficient". The combination coefficient is affected by the master-slave relationship of the accompaniment part and the player. For example, in the case where the player's rhythm is clearer than the accompaniment part, the accompaniment part is mostly closely matched with the player. In addition, in the color bars, when the master-slave relationship is instructed from the player, it is necessary to change the fitting method in accordance with the instruction. That is, the binding coefficient varies according to the background of the musical composition or the dialogue with the player. Thus, the procedure of the accompaniment part in cooperation with the player when the combination coefficient γ _n e [0,1] in the score position at the time of receiving t _n is given is described as follows.

[ Number 9] of the above-mentioned materials

In this model, the degree of follow-up varies according to the magnitude of γ _n. For example, assuming γ _n =0, the accompaniment part is not perfectly matched with the player, and if γ _n =1, the accompaniment part is perfectly matched with the player. In this model, the variance of the performance x _n ^(a) that the accompaniment part will perform and the prediction error in the performance timing x _n ^(p) of the player are also weighted by the combination coefficient. Thus, the variance of x ^(a) or v ^(a) is a result of the performance timing probability process itself of the player being coordinated with the performance timing probability process itself of the accompaniment part. Therefore, it is known that beat tracks "intended to be generated" by both the player and the automatic playing system can be naturally unified.

Simulation of the present model at β=0.9 is shown in fig. 13. It is known that by changing γ in this way, it is possible to supplement between the beat track (sine wave) of the accompaniment part and the beat track (step) function of the player. In addition, it is known that the generated beat track is closer to the beat track of the target set as the accompaniment part than the player's beat track by the influence of β. Namely, it is considered to have the following effects: the player is "pulled" if it is faster than-v ^(a), and "prompted" if it is slower.

5.4 Calculation method of combination coefficient gamma

The degree of synchronization between players represented by the combination coefficient γ _n is set by several factors. First, the master-slave relationship is affected by the background in the musical composition. For example, the guide ensemble is mostly a part with an easy-to-understand rhythm. In addition, the master-slave relationship is sometimes changed by a dialogue. In order to set a master-slave relationship according to the background of a musical composition, the density of sound Φ _n = [ moving average of note density for accompaniment part, moving average of note density for player part ] is calculated from score information. Since a large number of parts of the sound easily determines the beat trajectory, it is considered that the combination coefficient can be approximately extracted by using such a feature quantity. At this time, when the accompaniment part is not performing (Φ _n,0 =0), the position prediction of the ensemble is completely controlled by the player, and in the part where the player is not performing (Φ _n,1 =0), the position prediction of the ensemble is expected to completely ignore the behavior of the player. Therefore, γ _n is determined as follows.

[ Number 10] of the above-mentioned materials

Here, epsilon > 0 is set to a sufficiently small value. As in the case where a completely unilateral master-slave relationship (γ _n =0 or γ _n =1) is difficult to occur in the human-to-human ensemble, the exploratory step (heuristic) as described above does not become a completely unilateral master-slave relationship when both the player and the accompaniment part are playing. The complete unilateral master-slave relationship occurs only when one of the players/ensemble engines is temporarily silent, which is optimal.

In addition, in color bars, etc., a player or operator can rewrite γ _n as necessary. It is considered that the following desirable characteristics are obtained by rewriting human beings to appropriate values in color bars: the domain of gamma _n is finite and the behavior under its boundary conditions is self-explanatory or continuously variable with respect to the variation of gamma _n.

5.5 On-line inference

When the automatic playing system is operated, the posterior distribution of the playing timing model is updated at the timing of receipt (t _n,μ_n,σ_n ²). The proposed method can efficiently estimate by using a Kalman filter. At the time point when (t _n,μ_n,σ_n ²) is notified, predict (prediction) and update (update) steps of the kalman filter are performed to predict the position at which the accompaniment part should perform at time t as follows.

[ Number 11]

Here τ ^(s) refers to the input-output delay in the automatic playing system. In addition, in the present system, the state variable is updated also when the accompaniment part sounds. That is, as described above, in addition to performing predict/update steps according to the score following result, only predict (prediction) steps are performed at the point of accompaniment part sound production, and the obtained prediction value is substituted into the state variable.

6. Evaluation experiment

To evaluate the present system, first, the position estimation accuracy of the player is evaluated. Regarding timing generation of the ensemble, the effectiveness of the term β, which is a term for pulling back the tempo of the ensemble to a predetermined value, or the index γ, which is how much the accompaniment part matches the player, is evaluated by performing listening to the player.

6.1 Evaluation of score following

To evaluate the score following accuracy, the following accuracy of the training curve for Bergmuller (bragg mueller) was evaluated. As the evaluation data, data of 14 pieces (No. 1, no. 4 to No. 10, no. 14, no. 15, no. 19, no. 20, no. 22, no. 23) of the exercise music played Bergmuller by the pianist (op.100) were used to evaluate the spectral face following accuracy. In addition, no camera input was used in this experiment. The Total precision is evaluated on an evaluation scale by referring to MIREX. Total precision refers to precision of the whole corpus (corpus) when the alignment error is converged to a certain threshold value tau as a correct solution.

First, in order to verify the validity regarding delayed-decision type inference, total precision (τ=300 ms) for the amount of delayed frames in the delayed-decision forward backward (forward-backward delayed decision) algorithm is evaluated. The results are shown in FIG. 14. It is known that accuracy is improved by using posterior distribution of results several frames ago. Further, it is found that if the delay amount exceeds 2 frames, the accuracy gradually decreases. In the case of delay amount 2 frames, total precision (total precision) =82% at τ=100 ms, and 64% at τ=50 ms.

6.2 Verification of the Performance timing combination model

Verification of the performance timing combination model is performed by listening to the player. As a feature of the present model, there are a combination engine to pull back β of a predicted tempo and a combination coefficient γ, and verify the validity of both. First, in order to eliminate the influence of the binding coefficient, a system was prepared in which expression (4) was set to v _n ^(p)＝βv_n-1 ^(p)+(1－β)～v_n ^(a) and x _n ^(a)＝x_n ^(p),v_n ^(a)＝v_n ^(p). That is, consider the following ensemble engine: the result of filtering the score following result is directly used for performance timing generation of accompaniment while assuming that the expected value of beat is in v, the variance of which is controlled by β. First, after the automatic playing system in the case where 6 pianists are caused to set to β=0 is used for one day, the sense of use is listened to. The subject music is selected from a wide variety of music such as classical, romance, public, etc. In listening, there is a dominant disadvantage in that if a person wants to match with the ensemble, the accompaniment part is also in time with the person, and the beat is extremely slow or fast. Such a phenomenon occurs when the response of the system is subtly not matched with the player due to τ ^(s) in equation (12) being improperly set. For example, in the case where the response of the system is earlier than assumed, the user accelerates the beat in order to match the system returned earlier. As a result, the system following its beat returns a response earlier, and the beat continues to accelerate.

Next, experiments were performed by other 5 pianists and 1 pianist also participating in the experiment of β=0 using the same track under the condition of β=0.1. Listening was done with the same challenge content when β=0, but without hearing the problem of beat divergence. Furthermore, pianists who assisted the experiment also in the condition of β=0 also have made the opinion that the following property was improved. But hears that the system may produce a delay/acceleration opinion when the player has a great divergence between the beat envisaged by a certain tune and the beat that the system wants to pull back. This tendency occurs when an unknown musical piece is played, i.e., when the player does not know the "common sense" tempo. This means that by the system trying to introduce the effect of a certain beat, the divergence of the beat is prevented from occurring, on the other hand, when the interpretation associated with the accompaniment part and the beat is extremely different, the accompaniment part gives an impression of the flaring. In addition, regarding the following property, it has also been shown that it is preferable to change according to the background of music. The reason is that the degree of the matching method such as "preferably pull back", "desired to be better matched" is substantially the same according to the characteristics of the musical composition.

Finally, in professional string four, a system fixed to γ=0 and a system in which γ is adjusted according to the background of performance are used, and the latter has better opinion, which indicates its effectiveness. However, in this verification, since the subject knows that the latter system is an improved system, it is necessary to perform additional verification appropriately using the AB method or the like. In addition, since there are several situations in which γ is changed according to a dialogue in a color bar, it is shown that it is useful to change the bonding coefficient in a color bar.

7. Advanced learning process

To obtain the "habit" of the player, h _si and ω _if and the beat trajectory are estimated based on the MAP state of time t, s _t, calculated by score following, and its input feature sequence { c _t}^T _t＝1. Here, their estimation methods are briefly described. In the estimation of h _si and ω _if, the posterior distribution is estimated taking into account the Informed NMF (notification NMF) model of the Poisson-Gamma system as shown below.

[ Number 12]

The hyper-parameters appearing here are suitably calculated from the instrument sound database or the roll-up screen of the piano of the score expression. The posterior distribution is approximated using a variational bayesian method. Specifically, the posterior distribution p (h, ω|c) is approximated in the form of q (h) q (w), and the KL distance between the posterior distribution and q (h) q (w) is minimized while the auxiliary variable is introduced. From the posterior distribution thus estimated, MAP estimates of the parameter ω corresponding to the timbre of the instrument sound are stored and used in subsequent system operations. In addition, h corresponding to the strength of the piano roll screen can also be used.

Next, the length (i.e., beat trajectory) of the section on which the player plays each musical piece is estimated. If the beat trajectory is estimated, the beat performance specific to the player can be restored, and therefore, the position prediction of the player is improved. On the other hand, when the number of color bars is small, there is a possibility that the estimation of the beat trajectory is erroneous due to an estimation error or the like, and the accuracy of the position prediction is rather deteriorated. Therefore, when changing the beat track, it is considered that first, the beat track is provided with the advance information related to the beat track, and only the beat of the part of the player whose beat track always deviates from the advance information is changed. First, how far the tempo of the player deviates is calculated. If the number of color bars is small, the estimated value itself of the degree of deviation becomes unstable, so that the distribution itself of the beat tracks of the player also has a prior distribution. Let the average mu _s ^(p) and variance lambda _s ^(p) of beats of the player in the position s in the musical piece follow N(μ_s ^(p)|m₀,b₀λ_s ^(p)-1)Gamma(λ_s ^(p)-1|a₀ ^λ,b₀ ^λ)., then, if the average mu _s ^(R) of beats obtained from K plays is lambda _s ^(R)-1, the posterior distribution of beats is given as follows.

[ Number 13] of the above-mentioned materials

When the posterior distribution obtained in this way is obtained as a distribution generated from the beat distribution N (μ _s ^S,λ_s ^S-1) that can be obtained from the position s in the musical composition, the average value thereof is given as follows.

[ Number 14]

Based on the beat thus calculated, the average value of ε used in expression (3) or expression (4) is updated.

Embodiment 3

Embodiment 3 of the present invention will be described. In the present embodiment, the automatic playing system 100 recognizes the plague action of the player P to perform a performance. Elements having the same functions or functions as those of embodiment 1 are used in the following exemplary embodiments, and detailed descriptions thereof are omitted as appropriate along with numerals used in the description of embodiment 1.

In particular, the password operation in the present embodiment is premised on an operation performed by the movement of the face of the player P. The combination action in the present embodiment represents the timing of occurrence of an event by action. The event here is various behaviors during performance, and is, for example, timing indicating the start and end of a sound production, a cycle of a beat, and the like. The sign operation in the present embodiment is, for example, an operation of directing the line of sight to the direction of the partner delivering the sign, nodding, accompanying sound, raising the head, and the like, and lightly sucking air.

Fig. 15 is a block diagram showing an example of the configuration of detection processing unit 524 according to embodiment 3. The detection processing unit 524 includes, for example, an acquisition unit 5240, a determination unit 5241, an estimation unit 5242, an output unit 5243, a face portion extraction model 5244, and a combination motion estimation model 5245.

The acquisition unit 5240 acquires image information. The image information is information of an image in which performance of the player P is captured, and is, for example, information including the image signal V generated by the image synthesizing unit 522.

In the present embodiment, the image information is information including depth information. The depth information is information indicating a distance from a predetermined position (for example, a photographing position) to the subject for each pixel in the image. At this time, the plurality of photographing devices 222 in the recording device 22 include at least one depth camera. The depth camera is a distance measuring sensor that measures a distance to an object, for example, irradiates light such as infrared rays, and measures the distance to the object based on a time required until reflected light reflected by the object by the irradiated light is received. Or the plurality of photographing devices 222 may include a stereoscopic camera. The stereo camera photographs an object from a plurality of different directions to calculate a depth value (depth information) up to the object.

The acquiring unit 5240 repeatedly acquires image information at predetermined time intervals. The predetermined time interval may be arbitrary, periodic, random, or a mixture thereof. The acquisition unit 5240 outputs the acquired image information to the determination unit 5241.

The determination unit 5241 extracts a face portion (hereinafter, referred to as a face portion) including eyes of a person from an image (hereinafter, referred to as a photographic image) shown in the image information based on the image information acquired from the acquisition unit 5240.

Specifically, the determination unit 5241 first separates the background from the captured image. The determination unit 5241 determines pixels having a distance to the object greater than a predetermined threshold as a background using, for example, depth information of the pixels, and extracts a region having a distance to the object less than the predetermined threshold to separate the background from the captured image. In this case, even in a region where the distance to the subject is smaller than the predetermined threshold, the determination unit 5241 can determine that the region where the area of the region is smaller than the predetermined threshold is also the background.

Next, the determination unit 5241 extracts a face portion using the image from which the background is separated and the face portion extraction model 5244. The face portion extraction model 5244 is a learned model created by causing a learning model to learn teacher data. The learning model is, for example, CNN (convolutional neural network (Convolutional Neural Network)). The teacher data is data (data set) that correlates a learning image in which a face portion including eyes of a person is photographed with a determination result in which the face portion of the person in the learning image is determined. By learning teacher data, the face portion extraction model 5244 becomes a model as follows: a face portion of a person in an input image is estimated from the image, and an estimation result is output. The determination unit 5241 extracts a face portion based on an output obtained by inputting the image information acquired from the acquisition unit 5240 into the face portion extraction model 5244.

Next, the determination unit 5241 detects the movement of the face portion based on an image of the face portion extracted from the photographed image (hereinafter referred to as an extraction image). The determination unit 5241 detects the movement of the face portion by comparing the extracted images in order of time series, for example. The determination unit 5241 extracts, for example, feature points in an extracted image, and detects movement of a face portion based on a temporal change in position coordinates of the extracted feature points. The feature points herein are points representing characteristic parts of the face portion, such as the corners of the eyes, the tips of the eyebrows, and the like. If the extracted image includes a portion other than the eyes, the corner of the mouth or the like may be extracted as a feature point.

The determination unit 5241 detects the direction of the line of sight based on the extracted image. The determination unit 5241 extracts an area of the eye in the extracted image. The method of extracting the region of the eye may be arbitrary, and for example, a learned model similar to the face portion extraction model 5244 may be used, or another image processing method may be used. For example, the determination unit 5241 determines the direction of the line of sight based on the orientation of the face. In general, this is because the player P is considered to look at the partner with the dark number by directing the face to the partner with the dark number. The determination unit 5241 determines the direction of the face in the left-right direction based on depth information of a portion of the face that is symmetric in the left-right direction with respect to the center line in the up-down direction, such as the left-right eyes and the eyebrows. The determination unit 5241 determines that the front face of the face is facing the depth camera and the direction of the line of sight is in the front direction thereof, for example, when the difference between the distances between the left and right eyes is considered to be smaller than a predetermined threshold and the left and right eyes are substantially equidistant from the depth camera. The vertical direction can be determined by the same method.

The determination unit 5241 uses the detection result to determine whether or not a preliminary operation associated with a combination operation indicating the timing of an event has been performed. The preliminary operation is a part of the surreptitious operations or an operation linked to the surreptitious operations, and is a preliminary operation performed before the timing of starting the sound production or the like indicated in the surreptitious operations. For example, in the case of performing a surreptitious operation by nodding, the preliminary operation is an operation (hereinafter, also referred to as a cue-down) performed on the lower face before an operation (hereinafter, also referred to as a cue-up) of raising the face. Or, in the case of performing a surreptitious action by raising the head to gently inhale, the preliminary action is an action of jetting air before raising the face.

The determination unit 5241 determines that the preparatory operation is performed when, for example, the movement of the face portion is along the direction indicating the up and down direction of the nod (an example of the "1 st direction"), and the direction of the line of sight is the direction of the partner carrying the dark number (an example of the "2 nd direction"). The determination unit 5241 outputs the determination result of the preliminary operation to the estimation unit 5242.

The estimating unit 5242 estimates the timing of occurrence of the event from the image indicating the preparatory operation based on the determination result of the determining unit 5241. The estimating unit 5242 estimates the timing of occurrence of an event using, for example, an image group indicating the flow of a series of operations including a preliminary operation and the darkness operation estimation model 5245. The combination motion estimation model 5245 is a learned model created by causing a learning model to learn teacher data. The learning model is, for example, LSTM (Long Short-Term Memory). The teacher data is data (data set) that correlates a learning image including a time sequence of a face portion of a human eye with a determination result of determining a combination operation in the learning image. The implication actions herein may include various actions for determining implication actions, including, for example, an implication action (cue-up), a preliminary action (cue-down), and an action in which the line of sight looks or does not look at a specific direction. By learning teacher data, the combination motion estimation model 5245 becomes the following model: from the inputted time series image group, an action shown in the next image in the series of images is estimated, and an estimation result is outputted. The determination unit 5241 estimates the timing at which the event occurs based on an output obtained by inputting an image group representing the flow of a series of actions including the preparatory actions into the face portion extraction model 5244.

The output unit 5243 outputs information indicating the timing of occurrence of the event estimated by the estimation unit 5242.

The face portion extraction model 5244 is a model for outputting a face portion of a person in an input image by learning, as teacher data, a data set in which a learning image including a face portion of a person's eyes is captured and a determination result of determining a face portion of a person in the learning image are associated.

The combination motion estimation model 5245 is a model that performs learning by using, as teacher data, a data set obtained by associating a learning image in which a face portion including a human eye is imaged with a determination result of determining a combination motion in the learning image, and outputs whether or not the combination motion is performed in an input image.

Fig. 16 is a flowchart showing a flow of processing performed by the detection processing unit 524.

The acquisition unit 5240 acquires image information. The acquiring unit 5240 outputs the acquired image information to the determining unit 5241 (step S10).

The determination unit 5241 extracts an area in which a face portion in an image is photographed based on image information (step S11), and detects the movement of the face portion and the direction of the line of sight based on the extracted image. The determination unit 5241 determines whether or not the movement of the face portion is in a predetermined direction based on the detection result (step S12). Further, the determination unit 5241 determines whether or not the direction of the line of sight is a specific direction (camera direction in fig. 16) (step S13). The determination unit 5241 determines whether or not an image is an image in which a preliminary operation related to a combination operation is performed based on the movement of the face portion and the direction of the line of sight, and outputs the determination result to the estimation unit 5242.

The estimating unit 5242 estimates the timing of occurrence of the event based on the image information of the image determined by the determining unit 5241 as the image subjected to the preparatory operation (step S14). The estimating unit 5242 estimates the timing at which the event occurs by estimating the next operation using, for example, a series of time-series image groups including the preliminary operation and the combination motion estimation model 5245. The estimation unit 5242 outputs the estimation result to the output unit 5243.

The output unit 5243 outputs the estimation result estimated by the estimation unit 5242. The output unit 5243 outputs a performance start signal corresponding to the estimated timing of occurrence of the event, for example (step S15).

As described above, the automatic playing system 100 (control system) of embodiment 3 includes the acquisition unit 5240, the determination unit 5241, the estimation unit 5242, and the output unit 5243. The acquisition unit 5240 acquires image information. The determination unit 5241 detects the movement of a face portion including eyes of a person and the direction of the line of sight of the person when the face portion is imaged in an imaged image shown in the image information based on the image information, and determines whether or not a preliminary operation associated with a darkness operation indicating the timing of occurrence of an event is performed using the detection result. When the determination unit 5241 determines that the preparatory operation is performed, the estimation unit 5242 estimates the timing at which the event occurs based on the image information. The output unit 5243 outputs the estimation result estimated by the estimation unit 5242.

Thus, the automatic playing system 100 of embodiment 3 can estimate the timing at which an event occurs based on the motion of the face. That is, in a situation in which a combination of the start timing of sounding during playing of a musical piece, the recovery timing of extension symbols, the timing of sounding of the last sound of a musical piece, and the stop timing is assumed based on a darkness of eye contact, the player P can control the playing of the automatic playing system 100 based on the motion of the face and the darkness operation expressed in the line of sight direction.

In embodiment 3, an image obtained by photographing a face portion including eyes is used for estimation. Therefore, even when a part of the face of the player P is hidden by the musical instrument or the like (occlusion (occurrence)) in the image of the player P photographed with the wind musical instrument or the like, it is possible to recognize a plague operation using the peripheral part of the eyes where occlusion is difficult to occur during the performance, and to estimate the timing of occurrence of an event. Therefore, even when various operations are performed during performance, estimation can be reliably performed.

In embodiment 3, both the motion of the face portion and the direction of the line of sight are used for estimation. Therefore, since the player P can distinguish between the action of moving the face or body by focusing too much on the performance and the action of the sign, the accuracy of estimation can be improved as compared with the case of estimating by only the movement of the face portion.

In the automatic playing system 100 according to embodiment 3, the estimating unit 5242 estimates the timing at which the event occurs by using the plague motion estimation model 5245. Thus, the estimation can be performed by a simple method of inputting an image into a model without performing complicated image processing. Therefore, reduction of the processing load or shortening of the processing time can be expected as compared with the case of performing complicated image processing. Further, by using teacher data learned by the combination motion estimation model 5245, the timing of various events such as the start of sound production and the cycle of a beat can be estimated, and any event can be handled.

In the automatic playing system 100 according to embodiment 3, the determination unit 5241 determines that the preparatory motion is performed when the motion of the face portion is along the up-down direction (specific 1 st direction) such as the nodding direction and the direction of the line of sight is the direction of the partner (specific 2 nd direction) that is the dim number, based on the image information. This makes it possible to determine the movement in a specific direction and the direction of the line of sight, which are characteristic of the operation in the implication, and to improve the accuracy.

Further, in the automatic playing system 100 of embodiment 3, the determination section 5241 detects the movement of the face portion using the face portion extraction model 5244. This can achieve the same effects as those described above.

In the automatic playing system 100 according to embodiment 3, the image information includes depth information indicating a distance to the subject for each pixel in the image, and the determination unit 5241 extracts a face portion in the image by separating a background in the captured image based on the depth information. The region of the eyes shown in the face is a relatively narrow region, and thus the number of pixels in the region of the eyes extracted from the image is smaller than in other regions. In addition, the shape or color of the eye is more complex than other parts. Therefore, even when the region of the eye can be extracted accurately, noise is liable to be mixed in as compared with other regions. Therefore, even if the orientation of the face is detected by performing image processing on an image of the region from which the eyes are extracted, it is difficult to extract with high accuracy. In contrast, in the present embodiment, depth information is used. The depth information does not change as much as color information or the like, even around the eyes. Therefore, the orientation of the face can be detected with high accuracy based on depth information (depth information) of the periphery of the eyes. The approximate distance from the imaging device 222 to the player P can be grasped in advance. Therefore, if depth information is used, the player P can be extracted easily by separating the background without performing complicated image processing such as contour extraction. By excluding the background pixels from the object to be analyzed, it is possible to achieve not only high processing speed but also reduction of false detection.

In the above description, the case where the direction of the line of sight is detected based on the image information has been described as an example, but the present invention is not limited thereto. For example, eye tracking or the like may be used to detect the direction of the line of sight based on the relative positional relationship between the cornea and the pupil detected by the reflected light of the infrared ray irradiated to the eye.

Further, the automatic playing system 100 of embodiment 3 may be used to react an agent (agent) for ensemble. For example, if the player P looks at the robot equipped with the camera, the robot may be caused to perform an operation of looking at the player P. Further, if the player P performs a credentialing action (for example, cue-up) or a preliminary action (for example, cue-down), the robot also performs the addition in accordance with the action. Thereby, the performance synchronized by the automatic playing system 100 with respect to the player P can be performed.

Some embodiments of the present invention have been described, but these embodiments are presented as examples and are not intended to limit the scope of the invention. These embodiments can be implemented in various other modes, and various omissions, substitutions, and changes can be made without departing from the scope of the invention. These embodiments and modifications are included in the invention described in the claims and their equivalents as well as in the scope and gist of the invention.

Claims

1. A control system is provided with:

an acquisition unit that acquires image information including a user imaged with time;

A determination unit configured to determine whether or not a preliminary operation associated with a combination of a password operation indicating a timing of occurrence of an event is performed, based on a movement of the face of the user and a direction of a line of sight detected from the image information;

An estimating unit that estimates a timing at which an event occurs when it is determined that the preparatory operation is performed; and

And an output unit configured to output the estimation result estimated by the estimation unit.

2. A control system is provided with:

an acquisition unit that acquires image information;

A determination unit that detects a movement of a face portion and a direction of a line of sight in a captured image shown in the image information based on the image information, and determines whether or not a preliminary operation associated with a darkness operation indicating a timing of causing an event to occur is performed using a result of the detection;

An estimating unit that estimates a timing at which an event occurs based on the image information and the password operation, when the determining unit determines that the preliminary operation is performed; and

3. The control system of claim 1 or claim 2,

The estimation unit estimates the timing at which the event occurs using the output result of a darkness movement estimation model, which is a model as follows: a learning image including a face portion of a human eye and a data set obtained by correlating a result of determination of a darkness operation indicating a timing of occurrence of an event in the learning image are used as teacher data, and learning is performed to output whether the darkness operation is performed in an input image.

4. The control system according to claim 1 to 3,

An event represented by a surprise action representing the timing of the occurrence of the event is the start of a sound production,

The estimating unit estimates timing indicating the start of sound production using a darkness movement estimation model indicating a learning result in which a relation between an image and the darkness movement is learned, using a movement of a face portion including eyes of a person indicating the start of sound production as the darkness movement.

5. The control system according to claim 1 to 4,

The event represented by the combination action indicating the timing of causing the event to occur is the period of the beat in the performance,

The estimating unit estimates the timing of the period representing the beat in performance using a combination motion estimation model representing a learning result in which the relationship between the image and the combination motion is learned, using the combination motion of the face portion including the eyes of the person representing the period of the beat in performance as the combination motion.

6. The control system according to claim 1 to 5,

The determination unit determines that the preparatory operation is performed when the movement of the face portion including the eyes of the person is in a specific 1 st direction and the direction of the line of sight is in a specific 2 nd direction based on the image information.

7. The control system according to claim 1 to 6,

The determination section extracts the face portion in the photographic image shown in the image information using an output result of a face portion extraction model, which is a model that: a learning image including a face portion of a person's eyes and a data set obtained by correlating the result of determination of the face portion in the learning image are used as teacher data, and learning is performed to output the face portion of the person in the inputted image.

8. The control system according to claim 1 to 7,

The image information includes depth information indicating a distance to an object for each pixel in an image,

The determination unit separates a background in a photographic image shown in the image information based on the depth information, and extracts a face portion including eyes of a person in the image based on the image from which the background is separated.

9. A method of controlling the operation of a vehicle,

The acquisition unit acquires the image information,

A determination unit that detects, based on the image information, a movement of a face portion and a direction of a line of sight in a captured image shown in the image information, and determines, using the detected result, whether or not a preliminary operation associated with a darkness operation indicating a timing of causing an event to occur is performed;

An output unit outputs the estimation result estimated by the estimation unit.