WO2023191322A1

WO2023191322A1 - Method and apparatus for implementing virtual performance partner

Info

Publication number: WO2023191322A1
Application number: PCT/KR2023/002880
Authority: WO
Inventors: Ziheng Yan; Jiang Yu; Xin Jin; Jie Chen; Weiyang Su; Youxin Chen; Longhai WU
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2022-03-30
Filing date: 2023-03-02
Publication date: 2023-10-05
Also published as: CN114639394A

Abstract

A method and apparatus for implementing a virtual performance partner are provided. The method includes collecting audio frame data performed by a performer; and for each piece of current audio frame data collected, performing: converting the piece of current audio frame data collected into a current digital score, matching the current digital score with a range of digital scores in a repertoire, and determining a matching digital score in the range of digital scores that matches the current digital score; positioning a position of the matching digital score in the repertoire, and determining a start time of playing a cooperation part of music in a next bar of the matching digital score in the repertoire for a performance partner.

Description

METHOD AND APPARATUS FOR IMPLEMENTING VIRTUAL PERFORMANCE PARTNER

The disclosure relates to an audio processing technology, and more particularly, to a method and apparatus for implementing a virtual performance partner.

As quality-oriented education in China has increased, enthusiasm for music has increased, and an important part of music skills are musical instrument performances. In an ordinary musical instrument performance, a performer and a cooperator may cooperatively perform a piece of music. The performer plays a dominant role in the musical cooperation. The cooperator should follow the performance of the performer. For example, in the performance of violin, a performer plays the violin, and a symphony orchestra as a cooperator performs with the violin performer.

In traditional musical instrument performance exercises, the performer usually follows a recording, e.g., a compact disc (CD). However, since the performer has a limited level or the performed music may be a high-difficulty virtuoso composition, the performer is likely unable to keep up with the recording of an artist on the CD, so the experience of the current performance exercise is poor.

Provided are a method and apparatus for implementing a virtual performance partner, which enables music playing to be adapted to the performance progress of performers and improves the performance experience of performers.

In accordance with an aspect of the disclosure, a method for providing a virtual performance partner includes: collecting audio frame data performed by a performer; and for each piece of current audio frame data collected, performing: converting the piece of current audio frame data collected into a current digital score, matching the current digital score with a range of digital scores in a repertoire, and determining a matching digital score in the range of digital scores that matches the current digital score; positioning a position of the matching digital score in the repertoire, and determining a start time of playing a cooperation part of music in a next bar of the matching digital score in the repertoire for a performance partner; and determining a performance error between the performer and the performance partner based on a performance time of the current digital score and a performance time of the matching digital score, and adjusting a playing speed of the performance partner for the cooperation part in the repertoire based on the performance error.

The method may include determining, based on a position of the current digital score, the range within which a next audio frame is matched.

The determining the start time of playing the cooperation part of music in the next bar of the matching digital score in the repertoire for the performance partner may include: determining a performance speed of the performer based on the position of the matching digital score and positions of matching digital scores corresponding to first N pieces of audio frame data in the current audio frame data, and identifying the performance speed as a reference playing speed of the repertoire; and determining a start time of playing a next bar of music of the matching digital score in the repertoire for the performance partner based on the reference playing speed.

The adjusting the playing speed of the performance partner for the repertoire based on the performance error may include: based on the performance error being less than one beat, adjusting the playing speed of the performance partner within a current bar of music based on the performance error based on the reference playing speed, to make the performance partner consistent with the performer in a performance end time of the current bar of music; and based on the performance error being greater than one beat, pausing playing, by the performance partner, at the current bar, and playing a next bar of music based on a playing time of the next bar of music.

The method may include, based on repeated segments being contained in the repertoire, receiving an inputted set performance segment, and identifying the performance segment as the range.

The method may include, based on the repertoire starting from a performance of the performance partner, playing, by the performance partner, a part of the repertoire prior to the performance of the performer based on a set playing speed.

The method may include, based on the repertoire transitioning from a solo part of the performer to a performance part of the performance partner: based on a performance speed of the performer changing, starting to play, by the performance partner, the repertoire based on a performance speed at an end of the solo part; and based on the performance speed of the performer staying constant, starting to play, by the performance partner, the repertoire based on a set playing speed.

The performance partner ends the playing of the repertoire based on the current digital score not being matched successfully within a first set time.

The converting the piece of current audio frame data collected into the current digital score may include: processing the piece of current audio frame data collected using a pre-trained neural network model, and outputting the current digital score corresponding to the piece of current audio frame data collected.

The current digital score may be represented using a binary saliency map, and the pre-trained neural network model may be trained using a binary classification cross entropy loss function.

The matching may be implemented using a neural-network processor.

The method may include outputting a score of the repertoire and the position determined by the positioning.

The method may include determining a current scene based on the position determined by the positioning, and synthesizing, corresponding to the current scene, a virtual performance animation corresponding to the current scene using an avatar pre-selected by the performer.

Based on there being a plurality of performance users, the performer may be a preset performance user among the plurality of performance users; and based on the matching being unsuccessful within a preset time, the performer may be switched to a next preset performance user among the plurality of performance users.

Based on there being a plurality of performance users, an avatar pre-selected by each user may be stored; based on the virtual performance animation being displayed, a virtual performance animation synthesized using an avatar pre-selected by a current performer may be displayed, and based on the performer being switched, the virtual performance animation may be switched to a virtual performance animation synthesized using an avatar pre-selected by a performer switched to; or, avatars pre-selected by all the performance users are displayed simultaneously, and a desired virtual performance animation may be synthesized.

The synthesizing, corresponding to the current scene, the virtual performance animation corresponding to the current scene using an avatar pre-selected by the performer may include: pre-setting an animation switching position in the repertoire, and based on a performance progress of the repertoire by the performance partner reaching the animation switching position, changing the virtual performance animation; and/or, based on the current digital score not being matched successfully and/or the performance error corresponding to the current digital score being greater than a set threshold, changing an avatar preset by the performer into a preset action, and synthesizing the virtual performance animation.

The animation switching position may be set based on an input of a performance user, or the animation switching position may be contained in the repertoire.

The animation switching position may be a position of switching between different musical instruments within the cooperation part in the repertoire, wherein the changing the virtual performance animation may include displaying a virtual performance animation preset corresponding to a performance of a musical instrument switched to corresponding to the switching position between the different musical instruments.

In accordance with an aspect of the disclosure, an apparatus for implementing a virtual performance partner includes: a processor configured to: collect audio frame data performed by a performer; convert, for each piece of current audio frame data collected, the piece of current audio frame data collected into a current digital score, match the current digital score with a range of digital scores in a repertoire, and determine a matching digital score in the range of digital scores that matches the current digital score; position, for each piece of the current audio frame data collected, a position of the matching digital score in the repertoire, and determine a start time of playing a cooperation part of music in a next bar of the matching digital score in the repertoire for a performance partner; and determine, for each piece of the current audio frame data collected, a performance error between the performer and the performance partner based on a performance time of the current digital score and a performance time of the matching digital score, and adjust a playing speed of the performance partner for the cooperation part in the repertoire based on the performance error.

In accordance with an aspect of the disclosure, a method for providing a virtual performance partner includes: receiving a current digital score corresponding to audio frame data; matching the current digital score with a range of digital scores in a repertoire; determining a matching digital score in the range of digital scores based on matching the current digital score; identifying a position of the matching digital score in the repertoire, and identifying a start time of playing a cooperation part of music in a next bar of the matching digital score in the repertoire for a performance partner; determining a performance error between the performer and the performance partner based on a performance time of the current digital score and a performance time of the matching digital score; and adjusting a playing speed of the performance partner for the cooperation part in the repertoire based on the performance error.

The method may include determining, based on a position of the current digital score, the range within which a next audio frame may be matched.

The performance partner may end the playing of the repertoire based on the current digital score not being matched successfully within a first set time.

The above and other aspects, features, and advantages of certain embodiments of the present disclosure will be more apparent from the following description taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a basic flow of a method for implementing a virtual performance partner, according to an embodiment;

FIG. 2 is a schematic diagram of a system architecture, according to an embodiment;

FIG. 3 is a schematic diagram of a flow of a method for implementing a virtual performance partner, according to an embodiment;

FIG. 4 is a schematic diagram of training a neural network model, according to an embodiment; and

FIG. 5 is a schematic diagram of a basic structure of an apparatus for implementing a virtual performance partner, according to an embodiment.

The embodiments described below do not represent all technical aspects of the disclosure. It should be understood that various equivalents or variations that may be substituted for them at the time of the present application belong to the scope of rights of the disclosure.

If a detailed description for the functions or configurations related to the present disclosure may unnecessarily obscure the gist of the present disclosure, the detailed description may be omitted. In addition, the following embodiments may be modified in several different forms, and the scope and spirit of the disclosure are not limited to the following embodiments. Rather, these embodiments are provided to make the present disclosure thorough and complete, and to completely transfer the spirit of the present disclosure to those skilled in the art.

However, it is to be understood that technologies mentioned in the present disclosure are not limited to specific embodiments, and include all modifications, equivalents and/or substitutions according to embodiments of the present disclosure. Throughout the accompanying drawings, similar components are denoted by similar reference numerals.

The expressions "first," "second" and the like, used in the present disclosure may indicate various components regardless of a sequence and/or importance of the components. These expressions are only used in order to distinguish one component from the other components, and do not limit the corresponding components.

In the present disclosure, the expression "A or B," "at least one of A and/or B" or "one or more of A and/or B" or the like, may include all possible combinations of items enumerated together. For example, "A or B," "at least one of A and B," or "at least one of A or B" may indicate all of 1) a case where at least one A is included, 2) a case where at least one B is included, or 3) a case where both of at least one A and at least one B are included.

A term of a singular form may include its plural forms unless the context clearly indicates otherwise. It is to be understood that a term "include" or "formed of" used in the specification specifies the presence of features, numerals, steps, operations, components, parts or combinations thereof, which is mentioned in the specification, and does not preclude the presence or addition of one or more other features, numerals, steps, operations, components, parts or combinations thereof.

In case that any component (for example, a first component) is mentioned to be "(operatively or communicatively) coupled with/to" or "connected to" another component (for example, a second component), it is to be understood that the any component is directly coupled to the another component or may be coupled to the another component through other component (for example, a third component). On the other hand, in case that any component (for example, the first component) is mentioned to be "directly coupled" or "directly connected to" another component (for example, the second component), it is to be understood that the other component (for example, the third component) is not present between any component and another component.

An expression "configured (or set) to" used in the present disclosure may be replaced by an expression "suitable for," "having the capacity to," "designed to," "adapted to," "made to" or "capable of" based on a situation. A term "configured (or set) to" may not necessarily indicate "specifically designed to" in hardware. Instead, an expression "an apparatus configured to" may indicate that the apparatus may "perform~" together with other apparatuses or components. For example, "a processor configured (or set) to perform A, B, and C" may indicate a dedicated processor (for example, an embedded processor) for performing the corresponding operations or a generic-purpose processor (for example, a central processing unit (CPU) or an application processor) which may perform the corresponding operations by executing one or more software programs stored in a memory apparatus.

In the performance of musical instruments, a cooperator should follow the performance of a performer. Since music is sensitive to the tempo, a slight delay may cause the discomfort of hearing. Based on this, the disclosure provides a method and apparatus for implementing a virtual performance partner, to be able to adaptively adjust the content and speed of a repertoire played in a player according to the audio of the performer and especially be able to make adjustment according to the tempo of the performer.

Specifically, a performance partner is a virtual device that controls the player to play a piece of music, and a performer is a user who uses the performance partner to accompany the performance of his own musical instrument. The performer performs an actual performance of certain music A. The player plays a specified part of the corresponding music A, which is usually an accompaniment part for the music A performed by a certain musical instrument. Hereinafter, the entire music performed is referred to as a repertoire. A complete score of the repertoire and a corresponding audio file to be played by the performance partner may be pre-stored, e.g., in a server. The performance partner acquires, according to a request for the music from a performance user, the stored score of the repertoire and the audio file to be played by the performance partner. The complete score includes a score of a part to be performed by the performer and a score corresponding to the audio file to be played by the performance partner (hereinafter referred to as a score of a cooperation part). In addition, in order to provide more options, it is possible to store a plurality of audio files to be played by the performance partner corresponding to one repertoire, e.g., audio files corresponding to different musical instruments. Therefore, the performance user may be provided with versions of different musical instrument accompaniments for a same repertoire.

FIG. 1 is a schematic diagram of a basic flow of a method for implementing a virtual performance partner in the disclosure, according to an embodiment. As shown in FIG. 1, the method includes the following steps (e.g., operations).

In operation 101, audio frame data performed by a performer is collected.

For each piece of current audio frame data collected in operation 101, the following operations 102-104 are executed.

In operation 102, the piece of current audio frame data collected is converted into a current digital score, the current digital score is matched with a range of digital scores in a repertoire, and a matching digital score in the range of digital scores that matches the current digital score is determined.

Operation 102 is used to perform the operations of score recognizing and score matching. The collected audio frame data is first converted into digital scores. A digital score into which the piece of current audio frame data collected is converted is referred to as the current digital score. The audio data of the repertoire is pre-converted into a digital score, and the current digital score converted is matched with the entire repertoire or the range of digital scores in the repertoire. A part of digital score in the repertoire that successfully matches the current digital score is referred to as the matching digital score. It can be seen therefrom that each piece of current audio frame data collected is converted into a current digital score and a corresponding matching digital score is determined. When matching of digital scores is performed, the digital score of the entire repertoire matched with the current digital score should be the score of the part performed by the performer.

In operation 103, a position of the matching digital score is positioned in the repertoire, and a start time of playing a cooperation part of music in a next bar of the matching digital score in the repertoire is determined for a performance partner.

Operation 103 is used to perform score positioning. For the matching digital score determined in operation 102, the position of the matching digital score in the entire repertoire, i.e., the position of a part currently being performed by the performer in the entire music, is determined. It is determined, using the position information, when the performance partner starts to play the music content in the next bar of the positioned position. The method of determining the start time of playing the next bar is described in detail in the following embodiments.

If the performance partner supports displaying the score it is performing, it is also possible to indicate the positioned position in the score it is performing according to the positioned position.

In operation 104, a performance error between the performer and the performance partner is determined based on a performance time of the current digital score and a performance time of the matching digital score, and a playing speed of the performance partner for the cooperation part in the repertoire is adjusted based on the performance error.

Operation 104 is used to process tempo tracking. Specifically, an error between the performance of the performer and the play of the performance partner is determined, so that tempo tracking is performed to adjust the playing speed of the performance partner for the repertoire.

Through the above flow, it is possible to recognize the audio being performed by the performer and position the part being performed in the music, so as to enable the performer to control the playing progress of the player at each bar of the music. Tempo tracking is performed using the performance error between the performer and the performance partner, to further adjust the playing speed of the player.

As shown in the flow of FIG. 1, operations 102-104 should be processed for each audio frame, according to an embodiment. Therefore, the processing should be fast. Hardware such as NPU (Neural-network Processing Unit) (e.g., neural-network processor) may be used as a support to implement the above method. The performance partner analyzes the audio and matches the scores in real time, which falls within a high-energy consumption calculation scene. The issues such as power may be ignored when the operations are deployed on a desktop device. In addition, a sufficient volume may be useful for the performance partner to cooperate with the musical instrument. Therefore, a television may be used as a suitable deployment platform to realize a specific method of the disclosure.

A specific implementation of the method for implementing a virtual performance partner in the disclosure will be described below with an embodiment. In the present embodiment, a television is selected as a deployment platform of the corresponding method. The entire system architecture, as shown in FIG. 2, includes a television, a microphone connected to the television, and a sound box connected to the television. FIG. 3 is a schematic diagram of a specific flow of the method for implementing a virtual performance partner in the present embodiment. The flow is explained by taking the processing of an audio frame as an example. As shown in FIG. 3, the processing of an audio frame includes the following steps, according to an embodiment.

In operation 301, current audio frame data of a performer is acquired.

The processing of the operation 301 may be performed in a variety of existing ways. For example, a microphone may be plugged into a television to collect the audio of the performer.

In operation 302, the current audio frame data is converted into a current digital score using a pre-trained neural network model.

Operation 302 is used for score recognizing, i.e., converting the audio collected by the microphone into a digital score that can be subsequently processed. The digital score may be represented in various existing ways, such as a binary saliency map. Specifically, the pitch of music includes 88 keys from great A2 to small c5 at intervals of semitone. Based on this, a digital score may be represented as a two-dimensional matrix with an X axis representing time coordinates and a Y axis representing pitch coordinates, thereby generating a binary saliency map.

In the processing of operation 302, the current audio frame data is inputted into the trained neural network model, and after it is processed by the model, the current digital score corresponding to the current audio frame data is outputted. Of course, a neural network model for converting the audio data into the digital score should be pre-trained before the entire flow shown in FIG. 3 is started. The training of the neural network model is briefly described below.

The input of the neural network model is the audio frame data collected by the microphone, and the output is the corresponding digital score which is specifically a binary saliency map in the present embodiment. In view of the particularity of a musical instrument performance and tonal differences between different musical instruments, a corresponding neural network model may be trained for each musical instrument.

To train the neural network, training data may be prepared in advance, including a series of audio data (collected from a process of score performance by a corresponding musical instrument) and digital scores corresponding to the audio data. A manner of acquiring the digital scores corresponding to the audio data is: knowing a score corresponding to audio data A, i.e., a score of the performed content, e.g., a staff, and representing the score in the form of a digital score, e.g., representing the corresponding score with a binary saliency map, where the digital score corresponds to audio data A. In the training of the neural network, as shown in FIG. 4, audio data and a digital score corresponding thereto constitute paired training data, the audio data is inputted into the neural network model to obtain an output of the model, the output is compared with the digital score corresponding to the audio data to calculate a loss function, so as to update model parameters accordingly, and then, next audio data for the training is inputted into the model for processing until the model training is completed. In the present embodiment, the digital score is represented using the binary saliency map. Since the binary saliency map is labels of binary classification, the neural network model may be trained, using a binary classification cross entropy loss function.

In operation 303, the current digital score is matched with a specified range of digital scores in a repertoire, and a matching digital score in the specified range of digital scores that matches the current digital score is determined, according to an embodiment.

Operation 303 is used to perform score matching, according to an embodiment. In order to perform the score matching, it may be necessary to acquire the digital score of the repertoire which may be pre-stored in a database. The current digital score is compared with the complete digital score or the specified partial digital score of the repertoire, to perform window search and find the most similar part to the current digital score, and the most similar part is referred to as the matching digital score. As previously described, the digital score of the repertoire for comparison may be a complete digital score or a specified partial digital score. Here, the specified partial digital score may be specified by a user, or the content currently being performed may be targeted within a certain region of the repertoire after a plurality of audio frames have been processed, and the region is taken as the specified range. When digital score matching is performed, the digital score of the part for the performer in the repertoire should be selected to be matched with the current digital score.

Matching may be realized using various existing searching and matching algorithms, or may be realized by pre-training the network model. In addition, in order to accelerate the calculation and share CPU usage, the above matching process may be performed by an NPU.

In operation 304, a position of the matching digital score is positioned in the repertoire, and a start time of playing a cooperation part of music in a next bar of the matching digital score in the repertoire is determined for a performance partner.

As previously described, the implementation of the virtual performance partner is actually controlling the player (i.e., a television in this embodiment) to play the set playing content. The set playing content, e.g., violin accompaniment audio for a certain repertoire, may be preset by the user.

Operation 304 is used to first perform score positioning. The position of the matching digital score matched in the repertoire in operation 303 is determined. The position is a part currently being performed by the performer. A performance speed of the performer, i.e., an average performance speed within N+1 audio frame times, is calculated according to the position of the matching digital score and the positions of matching digital scores corresponding to first N pieces of audio frame data in the current audio frame data. The performance speed is used as a reference playing speed of the player playing the repertoire. Next, a start time of playing a next bar of music of the matching digital score in the repertoire is determined for the player based on the reference playing speed. That is to say, a performance start time of a next bar relative to a bar where the current audio frame is located is calculated based on the above reference playing speed. The player determines the performance start time as a start time of playing the next bar of music, so that the performance of the performer and the playing of the player are synchronized at an initial position of the next bar.

In addition, when performing the score matching, the matching may be performed within a specified range of the music. A specific position of the specified range in the processing of the next audio frame data may be determined according to the positioned position based on the positioned result of the present step.

The above processing for score positioning is the most basic processing manner, which is referred to herein as a random positioning manner, and is used when processing an ordinary audio frame, which may include when a performer enters music for the first time and when the performer starts a performance from any position. On this basis, a music theory positioning manner may be further included, which refers to processing and playing a cooperation part according to information marked in a score. The information marked in the score may be, e.g., music starting from a solo part, playing a segment at a free speed to cause difficulty in tracking, containing repeated segments in a composition, music starting from a band part, music changing from the solo part to the band part, etc.

For the above random positioning manner and music theory positioning manner, corresponding processing may be used to determine a start time of playing a next bar of cooperation repertoire in different situations, and the foregoing determination manner is simply referred to as a random positioning algorithm. The processing allocation of random positioning and music theory positioning may be performed as follows:

When the performer enters music for the first time, when the performer starts a performance from any position, when the music starts from a solo part, and when a segment is performed at a free speed to cause difficulty in tracking, the above random positioning algorithm may be used for processing to determine a reference playing speed and a start time of playing a next bar.

When repeated segments are contained in the repertoire, an inputted set performance segment is received, the performance segment is used as the specified range, and the random positioning algorithm is executed to determine a reference playing speed and a start time of playing a next bar.

When the repertoire starts from the performance of the performance partner, the performance partner plays a part of the repertoire prior to the performance of the performer according to a set playing speed. The set playing speed may be a default playing speed.

When the repertoire transitions from a solo part of the performer to a performance part of the performance partner, the performance partner starts to play the repertoire according to a performance speed at the end of the solo part if a performance speed of the performer changes; or otherwise, the performance partner starts to play the repertoire according to a set playing speed.

In operation 305, a performance error between the performer and the performance partner is determined according to a performance time of the current digital score and a performance time of the matching digital score, and a playing speed of the performance partner for the cooperation part in the repertoire is adjusted according to the performance error.

The present operation 305 is used to perform tempo tracking and adjust an actual playing speed within a bar.

First, it may be necessary to determine the performance error between the performer and the performance partner. As previously described, the digital score includes pitches and durations of respective tones in the score. The durations are referred to as the performance time of the digital score. A difference between the performance time of the current digital score and the performance time of the matching digital score is the performance error.

The manner of adjusting the playing speed according to the performance error may specifically include the following steps. When the performance error is less than one beat, on the basis of the reference playing speed determined in operation 305, the playing speed within the current bar is adjusted according to the performance error, so that the performance partner is consistent with the performer in a performance end time of a current bar of music. In this way, the performance speed can be adjusted within the current bar, and the performance speed of the performer can be caught up within the current bar. The processing cooperates with the processing of operation 304 to ensure that the start times of playing a next bar by the performer and the performance partner are consistent.

When the performance error is greater than one beat, the performance partner pauses playing at the current bar and plays a next bar of music according to a playing time of the next bar of music. Since the non-synchronization in tempo is easily perceived, if the performance error is excessive, the performance of the performance partner is paused at the current bar, i.e., the playing of the player is paused, and the cooperation part of a next bar of score is played starting from the next bar. The processing also cooperates with the processing of operation 304.

In addition, the performance partner ends the playing of the repertoire when the current digital score is not matched successfully within a first set time (e.g., 5 seconds).

If the performer skips a bar and pauses performance for less than the first set time, the performance of the performance partner is paused at a current bar, i.e., the playing of the player is paused, and the cooperation part of a next bar of score is played starting from the next bar. The processing also cooperates with the processing of operation 304.

If the performance is ended or the performance is interrupted (i.e., corresponding audio frame data cannot be collected), the performance partner ends the playing of the repertoire.

Thus, the method flow in the present embodiment is ended, according to an embodiment.

On the basis of the above method, since the television also has a display function, the following processing may also be further included to improve the user experience:

1. Information about a performance score and a current performance position is displayed and/or outputted. Here, the current performance position determined by positioning may be displayed in real time according to the positioned result of operation 304.

2. A user is allowed to select an avatar of the user by setting, and a virtual performance animation synthesized from the avatar is displayed in real time according to a positioned result of the score. The avatar of the user may include a static image and a dynamic image. The static image refers to fixed materials such as a portrait, clothing, decoration, musical instruments, and stage scenery of a virtual character. The dynamic image refers to an animation action synthesized in real time by the television when the user performs, such as a character action and a camera movement. The preset animation content may be displayed according to different scenes determined by positioning the score.

Animation switching positions may be preset in the repertoire. When a performance progress of the repertoire by the performance partner reaches a certain animation switching position, the displaying of a virtual performance animation is changed. The virtual performance animation content switched to may be pre-designed. For example, the animation switching position may be set as: a position of switching between different musical instruments within the cooperation part in the repertoire. Accordingly, when the performance proceeds to the animation switching position (i.e., the position of switching between different musical instruments), a virtual performance animation set in advance corresponding to the performance of the musical instrument switched to is displayed. The animation switching position may be set according to the input of the performance user before the performance starts, or the animation switching position may also be contained in the repertoire and already set when a file is initially established.

When it is detected that the volume of a device where the performance partner is located changes, the action amplitude of each avatar in the virtual animation may change according to the volume change.

When the current digital score is not matched successfully and/or a problem occurs in the tempo of the performer (e.g., a performance error corresponding to the current digital score may be greater than a set threshold), it is also possible to transform the avatar preset by the performer into a preset action and synthesize a corresponding virtual performance animation, so that when the digital score fails to be matched and/or a problem occurs in the tempo, an animation character corresponding to the performer may have a corresponding performance.

An example of corresponding virtual performance animations displayed in different scenes is given below, as shown in Table 1.

Scene	Animation (dynamic image)
1. Wait for performance	A performer is in position and waves
2. Band scene 1	A shot shows all musicians
3. Band scene 2	A shot shows the band from various angles
4. Soloist is ready to enter	The soloist and the band are in eye contact to indicate that they are ready
5. Soloist performance	A shot focuses on the soloist
6. New part enters music	A shot focuses on people for 2-3 seconds
7. Volume change	The action amplitude of performance is adjusted with the volume
8. Tempo change	A shot focuses on a person with the greatest change amplitude for 2-3 seconds
9. End of performance	The band puts down musical instruments in greeting, and then the animation is ended
10. Performance interruption	The band grabs musical instruments and waits, and the animation ends if the waiting time exceeds 5 seconds

In addition, the above entire processing of the performance partner is performed for a specific performer. In fact, both a single-user situation and a multi-user situation may also be set in specific implementation. This is referred to herein as performance tracking, specifically including single-user tracking and multi-user tracking. In a single-user scene, after a user completes basic settings and starts performance, the performance partner always follows the set user. In a multi-user scene, assuming that there are users A, B and C, the three users should perform simultaneously according to the normal musical cooperation. In this case, the performance partner follows the set user A to perform a cooperation part. If in the tempo tracking part, user A fails to be tracked for more than a second set time (e.g., 2 seconds), the tracked object is switched to another user, and so on.If there are a plurality of performance users, each user may pre-select a corresponding avatar. When a virtual performance animation is displayed, only an avatar corresponding to a current performer (i.e., a user being followed by the performance partner) may be displayed, and a corresponding virtual performance animation is synthesized. When the performer is switched, the virtual performance animation is switched to an avatar corresponding to the performer switched to, and a corresponding virtual performance animation is synthesized. When a virtual performance animation is displayed, it is also possible to simultaneously display avatars of all performers and synthesize corresponding virtual performance animations.

The above is the specific implementation of the method for implementing a virtual performance partner in the disclosure. By means of the above processing in the disclosure, the performance partner can perform tracking and playing according to audio performed by the performer, especially tracking in tempo, thereby adapting the performance of the music to the performance progress of the performer and improving the performance experience of the performer.

The disclosure also provides an apparatus for implementing a virtual performance partner. As shown in FIG. 5, the apparatus includes a processor for implementing: a collector 510, a score recognizer and matcher 520, a score positioner 530, and a tempo tracker 540.

The collector is configured to collect audio frame data performed by a performer in audio frames.

The score recognizer and matcher is configured to convert, for each piece of the current audio frame data collected, the piece of current audio frame data collected into a current digital score, match the current digital score with a specified range of digital scores in a repertoire, and determine a matching digital score in the specified range of digital scores that matches the current digital score.

The score positioner is configured to position, for each piece of the current audio frame data collected, a position of the matching digital score in the repertoire, and determine a start time of playing a cooperation part of music in a next bar of the matching digital score in the repertoire for a performance partner.

The tempo tracker is configured to determine, for each piece of the collected current audio frame data collected, a performance error between the performer and the performance partner according to a performance time of the current digital score and a performance time of the matching digital score, and adjust a playing speed of the performance partner for the cooperation part in the repertoire according to the performance error.

With the above method and apparatus of the disclosure, a virtual performance partner may be implemented. A non-limiting example is provided below:

The system settings are shown in Table 2. Under the system settings shown in the example, a user may select a composition to be performed through the settings, and set parts, a musical instrument(s) used by a performer(s), and a virtual animation image(s) of the performer(s). According to the user settings, a television acquires a digital score corresponding to the composition set by the user from a score library using the cloud service, and acquires a neural network model of the musical instrument set by the user from a sound library for performing audio-to-digital score conversion. The television collects audio data generated by the performer performing with the selected musical instrument through a microphone connected to the television. The television performs score recognition, converts the collected audio data into a digital score, positions a position of the current music after matching, and plays a cooperation part of the set music synchronously with the performance of the performer. The television synthesizes a virtual performance animation in real time for output according to the positioned position, and outputs a score and the position positioned in the score.

Module		Description
Cloud service	Score library	Store public or copyrighted digital scores
Cloud service	Sound library	Store audio conversion models of various musical instruments
Setting	Composition	Audio file of a composition included in score library
	Part	Part to be performed by performer
	Musical instrument	Musical instrument used by performer
	Image	Face, dress, etc. of virtual character
Input	Microphone input	Collect performed music using internal/external microphone
Television	Score recognizing	Convert audio input into score information (i.e., digital score)
	Score matching	Window matching of current score segment in global/partial score
	Score positioning	Position current music in score
	Tempo tracking	Infer tempo of performer for synchronous playing
	Animation generating	Synthesize virtual performance animation in real time
Output	Score	Display position of current music in score
	Sound	Synchronously play cooperation part of music
	Animation	Generate, play and store virtual animation in real time

In an embodiment, a method for providing a virtual performance partner may comprise: collecting audio frame data performed by a performer.In an embodiment, the method may further comprise: for each piece of current audio frame data collected, converting the piece of current audio frame data collected into a current digital score.

In an embodiment, the method may further comprise: matching the current digital score with a range of digital scores in a repertoire.

In an embodiment, the method may further comprise: determining a matching digital score in the range of digital scores that matches the current digital score.

In an embodiment, the method may further comprise: positioning a position of the matching digital score in the repertoire.

In an embodiment, the method may further comprise: determining a start time of playing a cooperation part of music in a next bar of the matching digital score in the repertoire for a performance partner.

In an embodiment, the method may further comprise: determining a performance error between the performer and the performance partner based on a performance time of the current digital score and a performance time of the matching digital score.

In an embodiment, the method may further comprise: adjusting a playing speed of the performance partner for the cooperation part in the repertoire based on the performance error.

In an embodiment, the method may further comprise: determining, based on a position of the current digital score, the range within which a next audio frame is matched.

In an embodiment, the determining the start time of playing the cooperation part of music in the next bar of the matching digital score in the repertoire for the performance partner may comprise: determining a performance speed of the performer based on the position of the matching digital score and positions of matching digital scores corresponding to first N pieces of audio frame data in the current audio frame data; identifying the performance speed as a reference playing speed of the repertoire; and determining a start time of playing a next bar of music of the matching digital score in the repertoire for the performance partner based on the reference playing speed.

In an embodiment, the adjusting the playing speed of the performance partner for the repertoire based on the performance error may comprise: based on the performance error being less than one beat, adjusting the playing speed of the performance partner within a current bar of music according to the performance error based on the reference playing speed, to make the performance partner consistent with the performer in a performance end time of the current bar of music; and based on the performance error being greater than one beat, pausing playing, by the performance partner, at the current bar, and playing a next bar of music based on a playing time of the next bar of music.

In an embodiment, the method may further comprise: based on repeated segments being contained in the repertoire, receiving an inputted set performance segment, and identifying the performance segment as the range.

In an embodiment, the method may further comprise: based on the repertoire starting from a performance of the performance partner, playing, by the performance partner, a part of the repertoire prior to the performance of the performer based on a set playing speed.

In an embodiment, the method may further comprise: based on the repertoire transitioning from a solo part of the performer to a performance part of the performance partner; based on a performance speed of the performer changing, starting to play, by the performance partner, the repertoire based on a performance speed at an end of the solo part; and based on the performance speed of the performer staying constant, starting to play, by the performance partner, the repertoire according to a set playing speed.

In an embodiment, the performance partner may end the playing of the repertoire based on the current digital score not being matched successfully within a first set time.

In an embodiment, the converting the piece of current audio frame data collected into the current digital score may comprise: processing the piece of current audio frame data collected using a pre-trained neural network model; and outputting the current digital score corresponding to the piece of current audio frame data collected.

In an embodiment, the current digital score may be represented using a binary saliency map, and the pre-trained neural network model is trained using a binary classification cross entropy loss function.

In an embodiment, the matching is implemented using a neural-network processor.

In an embodiment, the method may further comprise: outputting a score of the repertoire and the position determined by the positioning.

In an embodiment, the method may further compris: determining a current scene based on the position determined by the positioning; and synthesizing, corresponding to the current scene, a virtual performance animation corresponding to the current scene using an avatar pre-selected by the performer.

In an embodiment, based on there being a plurality of performance users, the performer may be a preset performance user among the plurality of performance users; and based on the matching being unsuccessful within a preset time, the performer may be switched to a next preset performance user among the plurality of performance users.

In an embodiment, based on there being a plurality of performance users, an avatar pre-selected by each user may be stored; and based on the virtual performance animation being displayed, a virtual performance animation synthesized using an avatar pre-selected by a current performer may be displayed, and based on the performer being switched, the virtual performance animation may be switched to a virtual performance animation synthesized using an avatar pre-selected by a performer switched to; or, avatars pre-selected by all the performance users may be displayed simultaneously, and a desired virtual performance animation may be synthesized.

In an embodiment, the synthesizing, corresponding to the current scene, the virtual performance animation corresponding to the current scene using the avatar pre-selected by the performer may comprise: pre-setting an animation switching position in the repertoire, and based on a performance progress of the repertoire by the performance partner reaching the animation switching position, changing the virtual performance animation; and/or based on the current digital score not being matched successfully and/or the performance error corresponding to the current digital score being greater than a set threshold, changing an avatar preset by the performer into a preset action, and synthesizing the virtual performance animation.

In an embodiment, the animation switching position may be set based on an input of a performance user, or the animation switching position is contained in the repertoire.

In an embodiment, the animation switching position may be a position of switching between different musical instruments within the cooperation part in the repertoire, and the changing the virtual performance animation may comprise displaying a virtual performance animation preset corresponding to a performance of a musical instrument switched to corresponding to the switching position between the different musical instruments.

In an embodiment, an apparatus for implementing a virtual performance partner, may comprises: one or more processors configured to: collect audio frame data performed by a performer.

In an embodiment, the one or more processors may be further configured to: convert, for each piece of current audio frame data collected, the piece of current audio frame data collected into a current digital score.

In an embodiment, the one or more processors may be further configured to: match the current digital score with a range of digital scores in a repertoire.

In an embodiment, the one or more processors may be further configured to: determine a matching digital score in the range of digital scores that matches the current digital score.

In an embodiment, the one or more processors may be further configured to: position, for each piece of the current audio frame data collected, a position of the matching digital score in the repertoire.

In an embodiment, the one or more processors may be further configured to: determine a start time of playing a cooperation part of music in a next bar of the matching digital score in the repertoire for a performance partner.

In an embodiment, the one or more processors may be further configured to: determine, for each piece of the current audio frame data collected, a performance error between the performer and the performance partner based on a performance time of the current digital score and a performance time of the matching digital score.

In an embodiment, the one or more processors may be further configured to: adjust a playing speed of the performance partner for the cooperation part in the repertoire based on the performance error.

In an embodiment, a method for providing a virtual performance partner, may comprise: receiving a current digital score corresponding to audio frame data.

In an embodiment, the method may further comprise: determining a matching digital score in the range of digital scores based on matching the current digital score.

In an embodiment, the method may further comprise: identifying a position of the matching digital score in the repertoire.

In an embodiment, the method may further comprise: identifying a start time of playing a cooperation part of music in a next bar of the matching digital score in the repertoire for a performance partner.

The above descriptions are merely embodiments, and are not intended to limit the disclosure. Any modification, equivalent substitution or improvement made within the spirit and principles of the disclosure should be included within the protection scope of the disclosure.

Claims

A method for providing a virtual performance partner, the method comprising:

collecting audio frame data performed by a performer; and

for each piece of current audio frame data collected, performing:

converting the piece of current audio frame data collected into a current digital score;

matching the current digital score with a range of digital scores in a repertoire;

determining a matching digital score in the range of digital scores that matches the current digital score;

positioning a position of the matching digital score in the repertoire;

determining a start time of playing a cooperation part of music in a next bar of the matching digital score in the repertoire for a performance partner;

determining a performance error between the performer and the performance partner based on a performance time of the current digital score and a performance time of the matching digital score; and

adjusting a playing speed of the performance partner for the cooperation part in the repertoire based on the performance error.
The method of claim 1, further comprising determining, based on a position of the current digital score, the range within which a next audio frame is matched.
The method of any one of claims 1 to 2, wherein the determining the start time of playing the cooperation part of music in the next bar of the matching digital score in the repertoire for the performance partner comprises:

determining a performance speed of the performer based on the position of the matching digital score and positions of matching digital scores corresponding to first N pieces of audio frame data in the current audio frame data;

identifying the performance speed as a reference playing speed of the repertoire; and

determining a start time of playing a next bar of music of the matching digital score in the repertoire for the performance partner based on the reference playing speed.
The method of any one of claims 1 to 3, wherein the adjusting the playing speed of the performance partner for the repertoire based on the performance error comprises:

based on the performance error being less than one beat, adjusting the playing speed of the performance partner within a current bar of music according to the performance error based on the reference playing speed, to make the performance partner consistent with the performer in a performance end time of the current bar of music; and

based on the performance error being greater than one beat, pausing playing, by the performance partner, at the current bar, and playing a next bar of music based on a playing time of the next bar of music.
The method of any one of claims 1 to 4, further comprising:

based on repeated segments being contained in the repertoire, receiving an inputted set performance segment, and identifying the performance segment as the range.
The method of any one of claims 1 to 5, further comprising, based on the repertoire starting from a performance of the performance partner, playing, by the performance partner, a part of the repertoire prior to the performance of the performer based on a set playing speed.
The method of any one of claims 1 to 6, further comprising:

based on the repertoire transitioning from a solo part of the performer to a performance part of the performance partner:

based on a performance speed of the performer changing, starting to play, by the performance partner, the repertoire based on a performance speed at an end of the solo part; and

based on the performance speed of the performer staying constant, starting to play, by the performance partner, the repertoire according to a set playing speed.
The method of any one of claims 1 to 7, wherein the performance partner ends the playing of the repertoire based on the current digital score not being matched successfully within a first set time.
The method of any one of claims 1 to 8, further comprising:

determining a current scene based on the position determined by the positioning; and

synthesizing, corresponding to the current scene, a virtual performance animation corresponding to the current scene using an avatar pre-selected by the performer.
The method of any one of claims 1 to 9, wherein based on there being a plurality of performance users,

the performer is a preset performance user among the plurality of performance users; and

based on the matching being unsuccessful within a preset time, the performer is switched to a next preset performance user among the plurality of performance users.
The method of any one of claims 1 to 10, wherein based on there being a plurality of performance users, an avatar pre-selected by each user is stored; and

based on the virtual performance animation being displayed, a virtual performance animation synthesized using an avatar pre-selected by a current performer is displayed, and based on the performer being switched, the virtual performance animation is switched to a virtual performance animation synthesized using an avatar pre-selected by a performer switched to; or, avatars pre-selected by all the performance users are displayed simultaneously, and a desired virtual performance animation is synthesized.
The method of any one of claims 1 to 11, wherein the synthesizing, corresponding to the current scene, the virtual performance animation corresponding to the current scene using the avatar pre-selected by the performer comprises:

pre-setting an animation switching position in the repertoire, and based on a performance progress of the repertoire by the performance partner reaching the animation switching position, changing the virtual performance animation; and/or

based on the current digital score not being matched successfully and/or the performance error corresponding to the current digital score being greater than a set threshold, changing an avatar preset by the performer into a preset action, and synthesizing the virtual performance animation.
The method of any one of claims 1 to 12, wherein the animation switching position is a position of switching between different musical instruments within the cooperation part in the repertoire, and

wherein the changing the virtual performance animation comprises displaying a virtual performance animation preset corresponding to a performance of a musical instrument switched to corresponding to the switching position between the different musical instruments.
A computer-readable storage medium storing a computer program that, when executed by one or more processors, causes the one or more processors to perform the method of any one of claims 1 to 13.
An apparatus for implementing a virtual performance partner, the apparatus comprising:

one or more processors configured to:

collect audio frame data performed by a performer;

convert, for each piece of current audio frame data collected, the piece of current audio frame data collected into a current digital score,

match the current digital score with a range of digital scores in a repertoire,

determine a matching digital score in the range of digital scores that matches the current digital score,

position, for each piece of the current audio frame data collected, a position of the matching digital score in the repertoire,

determine a start time of playing a cooperation part of music in a next bar of the matching digital score in the repertoire for a performance partner,

determine, for each piece of the current audio frame data collected, a performance error between the performer and the performance partner based on a performance time of the current digital score and a performance time of the matching digital score, and

adjust a playing speed of the performance partner for the cooperation part in the repertoire based on the performance error.