JP2018155936A

JP2018155936A - Sound data edition method

Info

Publication number: JP2018155936A
Application number: JP2017052947A
Authority: JP
Inventors: 佳孝浦谷; Yoshitaka Uratani; 藤島　琢哉; Takuya Fujishima; 琢哉藤島
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2017-03-17
Filing date: 2017-03-17
Publication date: 2018-10-04

Abstract

PROBLEM TO BE SOLVED: To synchronize two or more sound data throughout an entire period.SOLUTION: A sound data edition method includes a step of synchronizing first sound data based on a first demonstration demonstrated during reproduction of a reference sound and second sound data based on a second demonstration different from the first demonstration demonstrated during reproduction of the reference sound by using the reference sound.SELECTED DRAWING: Figure 7

Description

本発明は、複数の音データを編集する技術に関する。 The present invention relates to a technique for editing a plurality of sound data.

地理的に離れた複数の演奏者が合奏をするためのシステムが知られている。例えば特許文献１には、同期信号を基準として開始された演奏セッションにおいて、各演奏パートの演奏データを対応する端末装置が生成するシステムが記載されている。このシステムにおいて、各演奏データは同期信号に基づいて補正される。 A system for performing a concert by a plurality of performers who are geographically separated is known. For example, Patent Document 1 describes a system in which a corresponding terminal device generates performance data of each performance part in a performance session started with a synchronization signal as a reference. In this system, each performance data is corrected based on the synchronization signal.

特開２０１４−１５３５１５号公報JP 2014-153515 A

特許文献１に記載の技術においては、同期を開始したタイミング以降の全期間に渡って２以上の音データを同期させることができなかった。
これに対し本発明は、全期間に渡って２つ以上の音データを同期させる技術を提供する。 In the technique described in Patent Document 1, two or more sound data cannot be synchronized over the entire period after the timing at which synchronization is started.
In contrast, the present invention provides a technique for synchronizing two or more sound data over the entire period.

本発明は、参照音を再生中に実演された第１の実演に基づく第１音データ及び前記参照音を再生中に実演された前記第１の実演とは異なる第２の実演に基づく第２音データを、前記参照音を用いて同期するステップを有する音データ編集方法を提供する。 The present invention provides a first sound data based on a first demonstration performed during reproduction of a reference sound and a second demonstration based on a second demonstration different from the first demonstration performed during reproduction of the reference sound. There is provided a sound data editing method including a step of synchronizing sound data using the reference sound.

この音データ編集方法は、前記第１音データ及び前記第２音データは、それぞれ、第１チャンネル及び第２チャンネルのデータを含み、前記第１チャンネルのデータは、前記参照音を表し、前記第２チャンネルのデータは、実演の音を表してもよい。 In the sound data editing method, the first sound data and the second sound data include data of a first channel and a second channel, respectively, the data of the first channel represents the reference sound, and the first sound data The 2-channel data may represent the sound of the demonstration.

この音データ編集方法は、入力音を表す入力音信号を取得するステップと、前記入力音信号に基づいて、前記第１音データ及び前記第２音データの少なくとも一方のテンポを調整するステップとを有してもよい。 The sound data editing method includes a step of obtaining an input sound signal representing an input sound, and a step of adjusting at least one tempo of the first sound data and the second sound data based on the input sound signal. You may have.

この音データ編集方法は、前記第１音データのテンポを前記参照音に合わせて調整するステップと、前記第２音データのテンポを前記参照音に合わせて調整するステップとを有し、前記同期するステップにおいて、テンポが調整された前記第１音データ及び前記第２音データが同期されてもよい。 The sound data editing method includes a step of adjusting a tempo of the first sound data according to the reference sound, and a step of adjusting a tempo of the second sound data according to the reference sound. In the step of performing, the first sound data and the second sound data whose tempo is adjusted may be synchronized.

前記第１音データは第１動画データに含まれる音データであり、前記第２音データは前記第１動画データと異なる第２動画データに含まれる音データであり、前記同期するステップにおいて、前記第１動画データ及び前記第２動画データが同期されてもよい。 In the synchronizing step, the first sound data is sound data included in the first moving image data, and the second sound data is sound data included in second moving image data different from the first moving image data. The first moving image data and the second moving image data may be synchronized.

本発明によれば、全期間に渡って２つ以上の音データを同期させることができる。 According to the present invention, two or more sound data can be synchronized over the entire period.

第１実施形態に係る音データ編集システム１の機能構成を例示する図。1 is a diagram illustrating a functional configuration of a sound data editing system 1 according to a first embodiment. ユーザー端末１０のハードウェア構成を例示する図。The figure which illustrates the hardware constitutions of the user terminal. サーバ２０のハードウェア構成を例示する図。The figure which illustrates the hardware constitutions of the server. 音データ編集システム１の動作の概要を例示するシーケンスチャート。6 is a sequence chart illustrating an outline of the operation of the sound data editing system 1. 音データのアップロードに係る動作を例示するシーケンスチャート。The sequence chart which illustrates the operation | movement which concerns on upload of sound data. 参照音の再生指示を入力するためのＵＩ画面を例示する図。The figure which illustrates UI screen for inputting the reproduction | regeneration instruction | indication of a reference sound. 音データの同期に係る動作を例示するシーケンスチャート。The sequence chart which illustrates the operation | movement which concerns on the synchronization of sound data. 同期される演奏動画を選択するためのＵＩ画面を例示する図。The figure which illustrates UI screen for selecting the performance animation synchronized. 合成動画における各演奏動画の配置を例示する図。The figure which illustrates arrangement | positioning of each performance animation in a synthetic animation. 第３実施形態に係る音データ編集システム３の機能構成を例示する図。The figure which illustrates functional composition of sound data editing system 3 concerning a 3rd embodiment. 逆アライメント処理の概念図。The conceptual diagram of a reverse alignment process. 第３実施形態において参照音の再生指示を入力するためのＵＩ画面。The UI screen for inputting the reproduction | regeneration instruction | indication of a reference sound in 3rd Embodiment. 逆アライメント処理を例示するフローチャート。The flowchart which illustrates a reverse alignment process. 第４実施形態に係る音データ編集システム４の機能構成を例示する図。The figure which illustrates functional composition of sound data editing system 4 concerning a 4th embodiment. アライメント部１６の機能構成を例示する図。The figure which illustrates the function structure of the alignment part. 音データのアライメント再生に係る動作を例示するフローチャート。The flowchart which illustrates the operation | movement which concerns on the alignment reproduction | regeneration of sound data.

１．第１実施形態
１−１．構成
図１は、第１実施形態に係る音データ編集システム１の機能構成を例示する図である。音データ編集システム１は、同じ参照音を再生中に実演された複数の実演の音データを同期するサービス（以下「音データ編集サービス」という）を提供する。ここで、実演とは、時間的に変化する音の発生を伴う対象物を、演劇的に演じ、舞い、演奏し、歌い、口演し、朗詠し、又はその他の方法により演ずることをいう。対象物は、思想、感情、又は事実を表現したものであって、例えば、音楽、文芸、芸能、又は学術の分野に属するものをいう。一例として、ここでいう実演は楽器を用いた楽曲の演奏であり、音データは演奏音データである。この例で、音データ編集システム１は、第１音データ及び第２音データを同期する。第１音データは、第１実演に基づく（伴う）音声（第１音声の一例）をデータ化したものである。第２音データは、第２実演に基づく（伴う）音声（第２音声の一例）をデータ化したものである。第１実演及び第２実演は、いずれも、同じ参照音を再生中に行われる実演である。ここで、「同じ参照音を再生中に行われる」とは、各実演が行われる際に再生される参照音が同じものであることをいい、必ずしも第１実演及び第２実演が同時に行われることを意味するものではない。また、「参照音が同じ」とは２つの参照音が完全に同一であることのみを意味するものではなく、例えば、同じ楽曲を同じ奏者が演奏した演奏音を記録したものであるが記録された時期が異なるもの（いわゆるテイク違い）であってもよいし、同じ楽曲を異なる奏者が演奏した演奏音を記録したものであってもよい。第１実演はユーザーＡ（第１ユーザーの一例）により行われ、第２実演はユーザーＢ（第２ユーザーの一例）により行われる。 1. First embodiment 1-1. Configuration FIG. 1 is a diagram illustrating a functional configuration of a sound data editing system 1 according to the first embodiment. The sound data editing system 1 provides a service (hereinafter referred to as “sound data editing service”) that synchronizes sound data of a plurality of demonstrations performed while reproducing the same reference sound. Here, the demonstration means that an object accompanied by the generation of a sound that changes with time is performed in a dramatic manner, dancing, performing, singing, speaking, recitation, or performing by other methods. The target object represents an idea, emotion, or fact, and for example, belongs to the field of music, literary arts, performing arts, or academics. As an example, the demonstration here is a performance of music using a musical instrument, and the sound data is performance sound data. In this example, the sound data editing system 1 synchronizes the first sound data and the second sound data. The first sound data is data obtained by converting (accompanied) sound (an example of the first sound) based on the first performance. The second sound data is data obtained by converting (accompanied) voice (an example of the second voice) based on the second performance. Both the first demonstration and the second demonstration are demonstrations performed while reproducing the same reference sound. Here, “performed during playback of the same reference sound” means that the reference sound reproduced when each performance is performed is the same, and the first performance and the second performance are always performed simultaneously. It doesn't mean that. In addition, “the same reference sound” does not mean that the two reference sounds are completely the same. For example, a performance sound recorded by the same player on the same music is recorded. They may be different in time (so-called take differences), or may be recorded performance sounds of different players performing the same music. The first demonstration is performed by the user A (an example of the first user), and the second demonstration is performed by the user B (an example of the second user).

音データ編集システム１は、ユーザー端末１０及びサーバ２０を有する。ユーザー端末１０は、音データ編集サービスにおけるクライアントであり、ユーザーインターフェースを提供する。ここでは、２台のユーザー端末１０が図示される。ユーザーＡのユーザー端末及びユーザーＢのユーザー端末及びその要素を区別するときは、ユーザー端末１０Ａ及びユーザー端末１０Ｂのように添字を用いる。 The sound data editing system 1 includes a user terminal 10 and a server 20. The user terminal 10 is a client in the sound data editing service and provides a user interface. Here, two user terminals 10 are illustrated. When the user terminal of user A and the user terminal of user B and their elements are distinguished, subscripts are used as in the user terminal 10A and the user terminal 10B.

ユーザー端末１０は、記憶部１１、再生部１２、記録部１３、通信部１４、及びＵＩ部１５を有する。記憶部１１は、各種のデータを記憶する。この例で、記憶部１１は、参照音を再生するためのデータＤｒを記憶する。再生部１２は、各種のデータを再生する。この例で、再生部１２は、データＤｒを再生する。記録部１３は、参照音の再生中に行われる実演に伴う音をデータ化し、音データとして記憶部１１に記憶する。通信部１４は、所定の通信規格に従って他の装置と通信する。この例で、通信部１４は、実演の音データをサーバ２０にアップロードする。ＵＩ部１５は、参照音の再生、録音開始、アップロードなど、ユーザーが各種の指示を入力するためのユーザーインターフェースを提供する。 The user terminal 10 includes a storage unit 11, a playback unit 12, a recording unit 13, a communication unit 14, and a UI unit 15. The storage unit 11 stores various data. In this example, the storage unit 11 stores data Dr for reproducing the reference sound. The reproduction unit 12 reproduces various data. In this example, the reproducing unit 12 reproduces the data Dr. The recording unit 13 converts the sound accompanying the performance performed during the reproduction of the reference sound into data, and stores it as sound data in the storage unit 11. The communication unit 14 communicates with other devices according to a predetermined communication standard. In this example, the communication unit 14 uploads the sound data of the demonstration to the server 20. The UI unit 15 provides a user interface for the user to input various instructions such as reference sound reproduction, recording start, and upload.

サーバ２０は、音データ編集サービスにおけるサーバである。サーバ２０は、インターネット等のネットワーク（図示略）を介してユーザー端末１０と通信する。サーバ２０は、通信部２１、記憶部２３、及び同期部２４を有する。通信部２１は、所定の通信規格に従って他の装置と通信する。この例で、通信部２１は、ユーザー端末１０から送信された音データを受信する。記憶部２３は、通信部２１により受信された音データを記憶する。同期部２４は、２つ以上の音データを同期し、これらを合成する。合成後の音データを「合成音データ」という。なお、ここで音データの「合成」とは、２つ以上の音をミックスする処理（ミキシング）をいう。 The server 20 is a server in the sound data editing service. The server 20 communicates with the user terminal 10 via a network (not shown) such as the Internet. The server 20 includes a communication unit 21, a storage unit 23, and a synchronization unit 24. The communication unit 21 communicates with other devices according to a predetermined communication standard. In this example, the communication unit 21 receives sound data transmitted from the user terminal 10. The storage unit 23 stores sound data received by the communication unit 21. The synchronization unit 24 synchronizes two or more sound data and synthesizes them. The synthesized sound data is referred to as “synthesized sound data”. Here, “synthesis” of sound data refers to a process of mixing two or more sounds (mixing).

サーバ２０の通信部２１は、同期部２４により生成された合成音データを、ユーザー端末１０に配信する。ユーザー端末１０において、再生部１２は、サーバ２０から配信された合成音データを再生する。 The communication unit 21 of the server 20 distributes the synthesized sound data generated by the synchronization unit 24 to the user terminal 10. In the user terminal 10, the playback unit 12 plays back the synthesized sound data distributed from the server 20.

図２は、ユーザー端末１０のハードウェア構成を例示する図である。ユーザー端末１０は、ＣＰＵ（Central Processing Unit）１０１、メモリー１０２、ストレージ１０３、通信ＩＦ１０４、ディスプレイ１０５、入力装置１０６、マイクロフォン１０７、スピーカー１０８、及びカメラ１０９を有するコンピュータ装置、具体的には例えばスマートフォン、タブレット端末、又はパーソナルコンピュータである。ＣＰＵ１０１は、プログラムを実行し、ユーザー端末１０の他のハードウェア要素を制御する制御装置である。メモリー１０２は、ＣＰＵ１０１がプログラムを実行する際のワークエリアとして機能する主記憶装置であり、例えばＲＡＭ（Random Access Memory）を含む。ストレージ１０３は、各種のプログラム及びデータを記憶する不揮発性の補助記憶装置であり、例えばＳＳＤ（Solid State Drive）又はＨＤＤ（Hard Disk Drive）を含む。通信ＩＦ１０４は、他の装置と通信するためのインターフェースであり、例えばＮＩＣ（Network Interface Card）を含む。ディスプレイ１０５は、情報を表示する表示装置であり、例えばＬＣＤ（Liquid Crystal Display）を含む。入力装置１０６は、ユーザーがユーザー端末１０に対して指示又は情報を入力するための装置であり、例えば、タッチセンサー又はキーボードを含む。マイクロフォン１０７は、音を集音し、集音した音を電気信号に変換する装置である。スピーカー１０８は、電気信号に応じて音を出力する装置である。カメラ１０９は、動画を撮影するための撮影装置である。 FIG. 2 is a diagram illustrating a hardware configuration of the user terminal 10. The user terminal 10 includes a CPU (Central Processing Unit) 101, a memory 102, a storage 103, a communication IF 104, a display 105, an input device 106, a microphone 107, a speaker 108, and a camera 109, specifically, for example, a smartphone, A tablet terminal or a personal computer. The CPU 101 is a control device that executes a program and controls other hardware elements of the user terminal 10. The memory 102 is a main storage device that functions as a work area when the CPU 101 executes a program, and includes, for example, a RAM (Random Access Memory). The storage 103 is a nonvolatile auxiliary storage device that stores various programs and data, and includes, for example, an SSD (Solid State Drive) or an HDD (Hard Disk Drive). The communication IF 104 is an interface for communicating with other devices, and includes, for example, a NIC (Network Interface Card). The display 105 is a display device that displays information, and includes, for example, an LCD (Liquid Crystal Display). The input device 106 is a device for the user to input instructions or information to the user terminal 10, and includes, for example, a touch sensor or a keyboard. The microphone 107 is a device that collects sound and converts the collected sound into an electrical signal. The speaker 108 is a device that outputs sound according to an electrical signal. The camera 109 is a photographing device for photographing a moving image.

この例で、ストレージ１０３は、コンピュータ装置を音データ編集システム１におけるクライアントとして機能させるためのプログラム（以下「クライアントプログラム」という）を記憶する。ＣＰＵ１０１がクライアントプログラムを実行している状態において、メモリー１０２及びストレージ１０３の少なくとも一方は、記憶部１１の一例である。スピーカー１０８は、再生部１２の一例である。マイクロフォン１０７及びカメラ１０９は、記録部１３の一例である。ディスプレイ１０５及び入力装置１０６は、ＵＩ部１５の一例である。通信ＩＦ１０４は、通信部１４の一例である。 In this example, the storage 103 stores a program (hereinafter referred to as “client program”) for causing the computer device to function as a client in the sound data editing system 1. In a state where the CPU 101 is executing the client program, at least one of the memory 102 and the storage 103 is an example of the storage unit 11. The speaker 108 is an example of the playback unit 12. The microphone 107 and the camera 109 are examples of the recording unit 13. The display 105 and the input device 106 are examples of the UI unit 15. The communication IF 104 is an example of the communication unit 14.

図３は、サーバ２０のハードウェア構成を例示する図である。サーバ２０は、ＣＰＵ２０１、メモリー２０２、ストレージ２０３、及び通信ＩＦ２０４を有するコンピュータ装置である。ＣＰＵ２０１は、プログラムを実行し、サーバ２０の他のハードウェア要素を制御する制御装置である。メモリー２０２は、ＣＰＵ２０１がプログラムを実行する際のワークエリアとして機能する主記憶装置であり、例えばＲＡＭを含む。ストレージ２０３は、各種のプログラム及びデータを記憶する不揮発性の補助記憶装置であり、例えばＳＳＤ又はＨＤＤを含む。通信ＩＦ２０４は、他の装置と通信するためのインターフェースであり、例えばＮＩＣを含む。 FIG. 3 is a diagram illustrating a hardware configuration of the server 20. The server 20 is a computer device having a CPU 201, a memory 202, a storage 203, and a communication IF 204. The CPU 201 is a control device that executes a program and controls other hardware elements of the server 20. The memory 202 is a main storage device that functions as a work area when the CPU 201 executes a program, and includes, for example, a RAM. The storage 203 is a nonvolatile auxiliary storage device that stores various programs and data, and includes, for example, an SSD or an HDD. The communication IF 204 is an interface for communicating with other devices, and includes, for example, a NIC.

この例で、ストレージ２０３は、コンピュータ装置を音データ編集システム１におけるサーバとして機能させるためのプログラム（以下「サーバプログラム」という）を記憶する。サーバプログラムを実行しているＣＰＵ２０１により制御される通信ＩＦ２０４は、通信部２１の一例である。サーバプログラムを実行しているＣＰＵ２０１は、同期部２４の一例である。サーバプログラムを実行しているＣＰＵ２０１により制御されるメモリー２０２及びストレージ２０３の少なくとも一方は、記憶部２３の一例である。 In this example, the storage 203 stores a program (hereinafter referred to as “server program”) for causing the computer device to function as a server in the sound data editing system 1. A communication IF 204 controlled by the CPU 201 executing the server program is an example of the communication unit 21. The CPU 201 executing the server program is an example of the synchronization unit 24. At least one of the memory 202 and the storage 203 controlled by the CPU 201 executing the server program is an example of the storage unit 23.

１−２．動作
図４は、音データ編集システム１の動作の概要を例示するシーケンスチャートである。以下の例において、音データは、楽器の演奏に係る実演を記録した動画（以下「演奏動画」という）のデータに含まれる音声データ（音声トラック）である。ここでは、動画データであるデータＤ［１］（第１動画データの一例）とデータＤ［２］（第２動画データの一例）とが同期される例を説明する。 1-2. Operation FIG. 4 is a sequence chart illustrating an outline of the operation of the sound data editing system 1. In the following example, the sound data is audio data (audio track) included in data of a moving image (hereinafter referred to as “performance moving image”) that records a performance related to the performance of the instrument. Here, an example will be described in which data D [1] (an example of first moving image data), which is moving image data, and data D [2] (an example of second moving image data) are synchronized.

ステップＳ１において、ユーザー端末１０は、動画データを記録する。ステップＳ２において、ユーザー端末１０は、動画データをサーバ２０にアップロードする。図示はしないがサーバ２０に対しては複数のユーザー端末１０がアクセス可能であり、これら複数のユーザー端末１０から複数の動画データがサーバ２０にアップロードされる。この例において、複数の動画データには、データＤ［１］及びデータＤ［２］が含まれる。サーバ２０は、これら複数の動画データを記憶する。 In step S1, the user terminal 10 records moving image data. In step S 2, the user terminal 10 uploads the moving image data to the server 20. Although not shown, a plurality of user terminals 10 can access the server 20, and a plurality of moving image data is uploaded to the server 20 from the plurality of user terminals 10. In this example, the plurality of moving image data includes data D [1] and data D [2]. The server 20 stores the plurality of moving image data.

ステップＳ３において、ユーザー端末１０は、複数の動画データを同期するための同期要求をサーバ２０に送信する。この同期要求は、同期処理の対象となる動画データを特定する情報を含む。この例において、同期処理の対象となる動画データにはデータＤ［１］及びデータＤ［２］が含まれる。ステップＳ４において、サーバ２０は、同期要求により指定された複数の動画データを同期する。同期処理により生成されるデータを同期データという。ステップＳ５において、サーバ２０は、同期データをユーザー端末１０に送信する。ステップＳ６において、ユーザー端末１０は、同期されたデータを再生する。以下、これらの処理の詳細を説明する。以下においてＵＩ部１５等の機能要素を処理の主体として記載するが、これは、クライアントプログラム等のソフトウェアを実行するＣＰＵ１０１等のハードウェア要素が他のハードウェア要素を用いて処理を実行することを意味する。 In step S 3, the user terminal 10 transmits a synchronization request for synchronizing a plurality of moving image data to the server 20. This synchronization request includes information for specifying moving image data to be subjected to synchronization processing. In this example, moving image data to be subjected to synchronization processing includes data D [1] and data D [2]. In step S4, the server 20 synchronizes a plurality of moving image data designated by the synchronization request. Data generated by the synchronization processing is called synchronization data. In step S 5, the server 20 transmits the synchronization data to the user terminal 10. In step S6, the user terminal 10 reproduces the synchronized data. Details of these processes will be described below. In the following description, functional elements such as the UI unit 15 are described as processing subjects. This means that a hardware element such as the CPU 101 that executes software such as a client program executes processing using other hardware elements. means.

１−２−１．音データのアップロード
図５は、音データのアップロードに係る動作を例示するフローチャートである。図５のフローは、例えば、ユーザー端末１０においてユーザーが演奏の録音（演奏動画の記録）の開始を指示したことを契機として開始される。 1-2-1. Uploading Sound Data FIG. 5 is a flowchart illustrating an operation related to uploading sound data. The flow in FIG. 5 is started when, for example, the user terminal 10 instructs the start of recording of a performance (recording of a performance video).

ステップＳ１１において、ＵＩ部１５は、参照音の選択を受け付ける。参照音は実演（この例では楽器の演奏）の進行を示す音であって、例えば、演奏される楽曲そのものの音（例えばＣＤ等から取り込んだ音）、その楽曲の一部のパート（例えばドラムパート）の演奏音、またその楽曲から一部のパート（例えばボーカルパート）を除いた演奏音である。このように参照音とは、単に周期的に繰り返されるクリック音又はビープ音のような楽曲において意味の無いものではなく、それ自体が楽曲の少なくとも一部を構成する音をいう。参照音のデータは、記憶部１１及び記憶部２３（サーバ２０）の少なくとも一方に、少なくとも１つ以上、記憶されている。ＵＩ部１５は、記憶部１１及び記憶部２３を検索し、参照音の一覧を生成する。ＵＩ部１５は、参照音の一覧を表示し、ユーザーに一の参照音を選択するよう促す。ユーザーは、表示された一覧の中から所望の参照音を選択する。ユーザーによる参照音の選択を受け付けると、ＵＩ部１５は、選択された参照音を再生部１２に通知する。 In step S11, the UI unit 15 accepts selection of a reference sound. The reference sound is a sound indicating the progress of the demonstration (in this example, the performance of the instrument). For example, the sound of the musical piece to be played (for example, a sound taken from a CD or the like) and a part of the musical piece (for example, a drum) (Part) performance sound, or a performance sound obtained by removing a part of the music (for example, vocal part). As described above, the reference sound is not meaningless in the music such as a click sound or a beep sound that is periodically repeated, and refers to a sound that itself constitutes at least a part of the music. At least one reference sound data is stored in at least one of the storage unit 11 and the storage unit 23 (server 20). The UI unit 15 searches the storage unit 11 and the storage unit 23 and generates a list of reference sounds. The UI unit 15 displays a list of reference sounds and prompts the user to select one reference sound. The user selects a desired reference sound from the displayed list. When receiving the selection of the reference sound by the user, the UI unit 15 notifies the reproduction unit 12 of the selected reference sound.

ステップＳ１２において、記録部１３は、演奏動画の記録を開始する。演奏動画の記録は、例えば、ユーザーがＵＩ部１５を介して記録開始の指示を入力したことを契機として開始される。演奏動画は、カメラ１０９により撮影される映像及びマイクロフォン１０７により集音される音声を含む。演奏動画は所定のデータフォーマット（例えば汎用の動画フォーマット）で記録される。 In step S12, the recording unit 13 starts recording a performance moving image. The recording of the performance video is started, for example, when the user inputs a recording start instruction via the UI unit 15. The performance moving image includes video shot by the camera 109 and sound collected by the microphone 107. The performance movie is recorded in a predetermined data format (for example, a general-purpose movie format).

ステップＳ１３において、再生部１２は、指定された参照音の再生を開始する。再生部１２は、指定された参照音のデータを記憶部１１又は記憶部２３から読み出し、読み出したデータをデコードして音信号を生成する。再生部１２は、生成した音信号をスピーカー１０８に出力する。こうして、参照音は順次、スピーカー１０８から出力される。この例においては、スピーカー１０８から参照音が出力されるので、演奏動画の音声トラックにおいては、参照音に演奏音が重ね合わされた状態で音声が記録される（映像ではなく音声が重ね合わされる）。 In step S13, the playback unit 12 starts playback of the designated reference sound. The reproduction unit 12 reads the designated reference sound data from the storage unit 11 or the storage unit 23 and decodes the read data to generate a sound signal. The playback unit 12 outputs the generated sound signal to the speaker 108. Thus, the reference sound is sequentially output from the speaker 108. In this example, since the reference sound is output from the speaker 108, the sound is recorded with the performance sound superimposed on the reference sound in the sound track of the performance moving image (sound is superimposed instead of video). .

図６は、参照音の再生指示を入力するためのＵＩ画面を例示する図である。このＵＩ画面は、ウインドウ５０１、ボタン５０２、及びウインドウ５０５を含む。ウインドウ５０１は、再生される参照音の識別情報（例えば、タイトル）を表示する領域である。ボタン５０２は、参照音の再生又は一時停止を指示するボタンである。再生音が停止した状態でボタン５０２が押されると、参照音の再生が開始される。再生音が再生されている状態でボタン５０２が押されると、参照音の再生が一時停止される。 FIG. 6 is a diagram illustrating a UI screen for inputting a reference sound reproduction instruction. This UI screen includes a window 501, a button 502, and a window 505. A window 501 is an area for displaying identification information (for example, a title) of a reference sound to be reproduced. A button 502 is a button for instructing reproduction or pause of the reference sound. When the button 502 is pressed with the reproduction sound stopped, reproduction of the reference sound is started. When the button 502 is pressed while the reproduction sound is being reproduced, the reproduction of the reference sound is paused.

このＵＩ画面は、さらに、ボタン５０６を含む。ボタン５０６は、演奏動画の記録を停止するためのボタンである。参照音の再生中、ユーザーは、参照音を聴きながら、参照音に合わせて楽器を演奏する。記録部１３は、ユーザーが楽器を演奏する映像及びその演奏音を、参照音と共に記録する。参照音の再生中は図６のＵＩ画面が表示されており、ユーザーは任意のタイミングで演奏動画の記録を停止することができる。 This UI screen further includes a button 506. A button 506 is a button for stopping the recording of the performance video. During playback of the reference sound, the user plays the instrument in accordance with the reference sound while listening to the reference sound. The recording unit 13 records a video of a user playing a musical instrument and a performance sound thereof together with a reference sound. While the reference sound is being reproduced, the UI screen of FIG. 6 is displayed, and the user can stop the recording of the performance video at an arbitrary timing.

再び図５を参照する。ステップＳ１４において、記録部１３は、演奏動画の記録を停止する。演奏動画の記録は、例えば、ユーザーがボタン５０６を押すと停止する。ステップＳ１１〜Ｓ１４の処理は、図４のステップＳ１の処理の詳細である。ステップＳ２において、通信部１４は、記録部１３により生成された演奏動画の動画データをサーバ２０に送信（アップロード）する。なお、参照音のデータが記憶部２３ではなく記憶部１１から取得された場合、通信部１４は、動画データに加えて、参照音のデータをサーバ２０に送信してもよい。サーバ２０の記憶部２３は、参照音のデータを記憶する。 Refer to FIG. 5 again. In step S14, the recording unit 13 stops recording the performance moving image. Recording of the performance video stops when the user presses the button 506, for example. The processing in steps S11 to S14 is details of the processing in step S1 in FIG. In step S 2, the communication unit 14 transmits (uploads) the moving image data of the performance moving image generated by the recording unit 13 to the server 20. When the reference sound data is acquired from the storage unit 11 instead of the storage unit 23, the communication unit 14 may transmit the reference sound data to the server 20 in addition to the moving image data. The storage unit 23 of the server 20 stores reference sound data.

サーバ２０の通信部２１は、ユーザー端末１０から動画データを受信する。記憶部２３は、通信部２１が受信した動画データを記憶する。記憶部２３において、動画データは、動画の属性を示す属性情報と対応付けて記憶される。属性情報は、例えば、参照音の識別情報、演奏音の識別情報、動画作成者の識別情報、及びアップロード日時を含む。なお、動画データに対応する属性情報はこの例に限定されない。例えば、属性情報は、これらすべての情報を含む必要はなく、いずれか１つ又は複数の情報のみを含んでもよい。参照音の識別情報は、例えば楽曲のタイトルを含む。演奏音の識別情報は、例えば演奏音を発した楽器名を含む。これらの情報は、例えば、ＵＩ部１５を介してユーザーにより入力される。 The communication unit 21 of the server 20 receives moving image data from the user terminal 10. The storage unit 23 stores the moving image data received by the communication unit 21. In the storage unit 23, the moving image data is stored in association with attribute information indicating the attribute of the moving image. The attribute information includes, for example, reference sound identification information, performance sound identification information, moving picture creator identification information, and upload date and time. The attribute information corresponding to the moving image data is not limited to this example. For example, the attribute information does not need to include all these pieces of information, and may include only one or a plurality of pieces of information. The reference sound identification information includes, for example, a song title. The performance sound identification information includes, for example, the name of the musical instrument that emitted the performance sound. These pieces of information are input by the user via the UI unit 15, for example.

１−２−２．音データの同期（合成）
図７は、音データの同期に係る動作（音データ編集方法の一例）を例示するフローチャートである。図７のフローは、図４のステップＳ４〜Ｓ５の処理に対応する。ここではまず、ステップＳ３に関し、ユーザー端末１０が、複数の動画データを同期するための同期要求をサーバ２０に送信する処理に関連する事項を説明する。 1-2-2. Sound data synchronization (synthesis)
FIG. 7 is a flowchart illustrating an operation related to the synchronization of sound data (an example of a sound data editing method). The flow in FIG. 7 corresponds to the processing in steps S4 to S5 in FIG. Here, first, regarding step S 3, items related to processing in which the user terminal 10 transmits a synchronization request for synchronizing a plurality of moving image data to the server 20 will be described.

図８は、同期される演奏動画を選択するためのＵＩ画面を例示する図である。このＵＩ画面は、ウインドウ６０１、テキストボックス６０２、ボタン６０３、及びボタン６０４を含む。ウインドウ６０１は、記憶部２３に記憶されている演奏動画の一覧を表示するための領域である。この例では、演奏動画のサムネイル画像及び演奏動画のタイトルが表示される。テキストボックス６０２は、検索キーを入力するための領域であり、ボタン６０３は検索を指示するためのボタンである。検索が実行されると、ウインドウ６０１には、検索結果に含まれる演奏動画の一覧が表示される。ウインドウ６０１においては、ユーザーにより選択された演奏動画が、選択されていない演奏動画と区別して表示される。この例では、「さくらさくらのギター弾いてみた」という演奏動画、及び「さくらさくらのベース弾いてみた」という演奏動画が選択されている。ボタン６０４は、動画の同期を指示するためのボタンである。ユーザーによりボタン６０４が押されると、通信部１４は、同期される動画を特定する情報を含む同期要求をサーバ２０に送信する（ステップＳ３）。 FIG. 8 is a diagram illustrating a UI screen for selecting a performance video to be synchronized. This UI screen includes a window 601, a text box 602, a button 603, and a button 604. The window 601 is an area for displaying a list of performance videos stored in the storage unit 23. In this example, a thumbnail image of the performance video and a title of the performance video are displayed. A text box 602 is an area for inputting a search key, and a button 603 is a button for instructing a search. When the search is executed, a list of performance videos included in the search result is displayed in the window 601. In the window 601, the performance video selected by the user is displayed separately from the performance video not selected. In this example, the performance movie “I tried playing Sakura Sakura's guitar” and the performance movie “I tried playing Sakura Sakura's bass” are selected. A button 604 is a button for instructing the synchronization of the moving image. When the button 604 is pressed by the user, the communication unit 14 transmits a synchronization request including information specifying a moving image to be synchronized to the server 20 (step S3).

再び図７を参照して説明する。ステップＳ４１において、サーバ２０の同期部２４は、同期要求により指定される複数の動画データを同期するためのパラメーターを計算する。このパラメーターの計算は、参照音をキーとして行われる。具体的に、同期部２４は、データＤｒ（参照音）と同期対象のデータＤ［ｉ］との相互相関Ｃｒｉを最大にする時間差τを次式（１）に従って計算する。この例において、時間差τは複数の動画データを同期するためのパラメーターの一例である。

ここで、音信号ｙｒはデータＤｒに含まれる音信号であり、音信号ｙｉはデータＤ［ｉ］に含まれる音データにより示される音信号である。式（１）は、音信号ｙｒ（ｔ）の始点と音信号ｙｉ（ｔ）の始点とを時間領域において一致させてから、音信号ｙｒ（ｔ）に対する音信号ｙｉ（ｔ）の時間差（時間軸上のシフト量）τを変数として両者間の信号波形の相関の程度を示した数値列を示す。 A description will be given with reference to FIG. 7 again. In step S41, the synchronization unit 24 of the server 20 calculates a parameter for synchronizing a plurality of moving image data specified by the synchronization request. This parameter is calculated using the reference sound as a key. Specifically, the synchronization unit 24 calculates a time difference τ that maximizes the cross-correlation Cri between the data Dr (reference sound) and the synchronization target data D [i] according to the following equation (1). In this example, the time difference τ is an example of a parameter for synchronizing a plurality of moving image data.

Here, the sound signal yr is a sound signal included in the data Dr, and the sound signal yi is a sound signal indicated by the sound data included in the data D [i]. Equation (1) is obtained by matching the start point of the sound signal yr (t) and the start point of the sound signal yi (t) in the time domain, and then the time difference (time) of the sound signal yi (t) with respect to the sound signal yr (t). A numerical sequence showing the degree of correlation of signal waveforms between the two using the amount of shift on the axis τ as a variable.

同期部２４は、同期対象となるすべてのデータＤ［ｉ］に対して、式（１）によりデータＤｒとの時間差τを計算する。同期対象となる２個のデータＤの時間領域における位置関係は、データＤｒとの時間差から得られる。例えば、データＤｒとデータＤ［１］との時間差τ［ｒ，１］、及びデータＤｒとデータＤ［２］との時間差τ［ｒ，２］から、データＤ［１］とデータＤ［２］との時間差τ［１，２］は、次式（２）により得られる。

The synchronization unit 24 calculates a time difference τ from the data Dr with respect to all the data D [i] to be synchronized by the equation (1). The positional relationship between the two data D to be synchronized in the time domain is obtained from the time difference from the data Dr. For example, from the time difference τ [r, 1] between the data Dr and data D [1] and the time difference τ [r, 2] between the data Dr and data D [2], the data D [1] and data D [2 ] Is obtained by the following equation (2).

ステップＳ４２において、同期部２４は、複数の演奏動画の映像を同期する際の、画面における映像の位置関係を決定する。この例では、同期後の動画（以下「合成動画」という）の画面はｎ個の領域（ｎは同期される演奏動画の数以上の自然数）に分割され、同期される演奏動画の映像は、それぞれこれら複数の領域のうち１個の領域に表示される。同期部２４は、所定のアルゴリズムに従って画面の分割数ｎを決定し、各領域に演奏動画を割り当てる。 In step S42, the synchronization unit 24 determines the positional relationship of the images on the screen when synchronizing the images of a plurality of performance videos. In this example, the screen of the synchronized video (hereinafter referred to as “synthetic video”) is divided into n areas (n is a natural number equal to or greater than the number of performance videos to be synchronized). Each of the plurality of areas is displayed in one area. The synchronization unit 24 determines the number n of screen divisions according to a predetermined algorithm, and assigns a performance video to each area.

図９は、合成動画における各演奏動画の配置を例示する図である。この例では、演奏動画の画面は縦方向に２分割される。画面向かって左側の領域には「さくらさくらのギター弾いてみた」の映像が、右側の領域には「さくらさくらのベース弾いてみた」の映像が、それぞれ割り当てられる。 FIG. 9 is a diagram illustrating the arrangement of each performance video in the composite video. In this example, the performance video screen is divided into two in the vertical direction. The video “Sakura Sakura I tried playing the guitar” is assigned to the area on the left side of the screen, and the video “Sakura Sakura I played the bass” is assigned to the area on the right.

再び図７を参照する。ステップＳ４３において、同期部２４は、演奏動画の音声を同期する際の、音像の位置関係を決定する。この例において、同期部２４は、音像定位を変更する処理は行わない。 Refer to FIG. 7 again. In step S43, the synchronization unit 24 determines the positional relationship of the sound images when synchronizing the sound of the performance video. In this example, the synchronization unit 24 does not perform the process of changing the sound image localization.

ステップＳ４４において、同期部２４は、同期処理の対象となる動画データを記憶部２３から読み出す（すなわち取得する）。この例において、同期部２４は、ユーザーＡの演奏動画データ及びユーザーＢの演奏動画データを読み出す（すなわちこれらのデータを取得する）。ユーザーＡの演奏動画データの音声トラックは、参照音を再生中に実演された実演の第１音データの一例であり、ユーザーＢの演奏動画データの音声トラックは、参照音を再生中に実演された別の実演の第２音データの一例である。 In step S 44, the synchronization unit 24 reads (that is, acquires) moving image data to be subjected to synchronization processing from the storage unit 23. In this example, the synchronization unit 24 reads the performance video data of the user A and the performance video data of the user B (that is, acquires these data). The audio track of the performance video data of the user A is an example of the first sound data of the demonstration performed during the reproduction of the reference sound, and the audio track of the performance video data of the user B is demonstrated during the reproduction of the reference sound. It is an example of the 2nd sound data of another demonstration.

ステップＳ４５において、同期部２４は、合成動画のデータを生成する。詳細には以下のとおりである。同期部２４は、複数の動画データについて、ステップ４１において計算された時間差をつけて時間領域における位置を決定する。また、同期部２４は、ステップＳ４２において決定された画面配置に従って映像を同期し、さらに、ステップＳ４３において決定された音像位置に従って音声トラックを生成する。ステップＳ４１〜Ｓ４５は、図４のステップＳ４の詳細である。ステップＳ５において、通信部２１は、同期部２４により生成された合成動画のデータを、同期要求の送信元のユーザー端末１０に送信する。 In step S45, the synchronization unit 24 generates composite moving image data. Details are as follows. The synchronization unit 24 determines the position in the time domain for the plurality of moving image data with the time difference calculated in step 41. The synchronization unit 24 synchronizes the video according to the screen layout determined in step S42, and further generates an audio track according to the sound image position determined in step S43. Steps S41 to S45 are details of step S4 in FIG. In step S5, the communication unit 21 transmits the composite moving image data generated by the synchronization unit 24 to the user terminal 10 that is the transmission source of the synchronization request.

本実施形態によれば、ユーザーが指定した参照音を再生しながら行われた実演の音データを、簡単な操作で同期することができる。 According to this embodiment, the sound data of the demonstration performed while reproducing the reference sound designated by the user can be synchronized with a simple operation.

１−２−３．音データの再生
ユーザー端末１０の通信部１４は、合成動画のデータを受信する（すなわちダウンロードする）。再生部１２は、ダウンロードした合成動画を再生する。この処理は図４のステップＳ６の処理に相当する。動画データの再生には周知の技術が用いられる。 1-2-3. Reproduction of sound data The communication unit 14 of the user terminal 10 receives (that is, downloads) synthesized moving image data. The playback unit 12 plays back the downloaded composite video. This process corresponds to the process of step S6 in FIG. A well-known technique is used for reproducing moving image data.

２．第２実施形態
第１実施形態において、同期処理の対象となる音データは、いずれも参照音に演奏音が重ね合わされたものであった。このような音データを同期した場合、参照音を除いて演奏音だけを同期することが難しいという問題があった。第２実施形態はこの問題に対処する。 2. Second Embodiment In the first embodiment, all of the sound data to be subjected to the synchronization process is obtained by superimposing the performance sound on the reference sound. When such sound data is synchronized, there is a problem that it is difficult to synchronize only the performance sound except the reference sound. The second embodiment addresses this problem.

第２実施形態では、同期処理の対象となる音データにおいて参照音と演奏音とは分離される。具体的には、ステレオ２チャンネルの音データにおいて、左チャンネル（第１チャンネルの一例）には参照音の音信号が、右チャンネル（第２チャンネルの一例）には演奏音の音信号が記録される。ユーザー端末１０において記録部１３が参照音と演奏音とを分離して記録する方法の一例は、ユーザー端末１０において参照音を出力する際に、スピーカー１０８を介さずにヘッドホン（図示略）を介して参照音を出力することである。ユーザーは、ヘッドホンで参照音を聴きながら演奏する。このとき、記録部１３は、再生している（出力している）参照音の音信号を左チャンネルに記録し、マイクロフォン１０７を介して入力された演奏音の音信号を右チャンネルに記録する。参照音と演奏音とを分離して記録する別の方法は、演奏音を、マイクロフォン１０７を介さずに例えばいわゆるライン入力を用いて記録することである。この場合、参照音はスピーカー１０８を介して出力されてもよいし、スピーカー１０８を介さずヘッドホンを介して出力されてもよい。 In the second embodiment, the reference sound and the performance sound are separated in the sound data to be synchronized. Specifically, in the stereo 2-channel sound data, the sound signal of the reference sound is recorded in the left channel (an example of the first channel), and the sound signal of the performance sound is recorded in the right channel (an example of the second channel). The An example of a method in which the recording unit 13 separates and records the reference sound and the performance sound in the user terminal 10 is to output the reference sound through the headphones (not shown) without the speaker 108 when the user terminal 10 outputs the reference sound. To output a reference sound. The user performs while listening to the reference sound through the headphones. At this time, the recording unit 13 records the sound signal of the reference sound being reproduced (output) on the left channel, and records the sound signal of the performance sound input via the microphone 107 on the right channel. Another method for recording the reference sound and the performance sound separately is to record the performance sound without using the microphone 107, for example, using so-called line input. In this case, the reference sound may be output via the speaker 108 or may be output via the headphones without passing through the speaker 108.

第２実施形態では、ステップＳ４１において、同期部２４は、同期対象であるデータＤのうち左チャンネルに記録された音信号（参照音の音信号である）が、式（１）における音信号ｙｉとして用いられる。第１実施形態においては、参照音と演奏音とが重ね合わされた音信号とデータＤｒ（参照音）の音信号との相互相関を用いてデータＤとデータＤｒとの時間差τが計算されるところ、データＤにおける演奏音と参照音とのバランス等の事情により、データＤとデータＤｒとの時間差τを正確に計算できない場合があった。しかし、この例においては、データＤにおいて演奏音を含まない参照音の音信号と、データＤｒの音信号との相互相関を用いて時間差τが計算される。したがって、時間差τをより正確に計算できる。すなわち、複数のデータＤをより正確に同期できる。 In the second embodiment, in step S41, the synchronization unit 24 determines that the sound signal recorded in the left channel of the data D to be synchronized (the sound signal of the reference sound) is the sound signal yi in Expression (1). Used as In the first embodiment, the time difference τ between the data D and the data Dr is calculated using the cross-correlation between the sound signal obtained by superimposing the reference sound and the performance sound and the sound signal of the data Dr (reference sound). In some cases, the time difference τ between the data D and the data Dr cannot be calculated accurately due to the balance between the performance sound and the reference sound in the data D. However, in this example, the time difference τ is calculated using the cross-correlation between the sound signal of the reference sound not including the performance sound in the data D and the sound signal of the data Dr. Therefore, the time difference τ can be calculated more accurately. That is, a plurality of data D can be synchronized more accurately.

また、第２実施形態では、ステップＳ４５において、データＤに含まれる音声データのうち、右チャンネルに記録された音信号（演奏音の音信号である）が合成される。左チャンネルの音信号（参照音の音信号である）は合成されない。従来、ある楽曲（参照音）を背後で流しながらその楽曲に合わせて楽器を演奏する演奏動画を記録し、この動画をインターネットの動画投稿サイトに公開することが行われている。さらに、他人がアップロードした演奏動画に自分の演奏動画を合成してあたかも両者が合奏をしているかのような動画を作成し、これを公開することも行われている。しかし、これらの演奏動画においては、参照音と演奏音が混ざった音声しか記録されないため、その後に他人の演奏動画と合成する際に参照音がノイズとなってしまうという問題がある。これに対し本実施形態においては、合成動画の音声トラックには参照音が含まれておらず、演奏音のみが含まれる。したがって、複数の音データを同期する際に、よりノイズの少ない状態で２つの音データを同期及び再生することができる。なお、参照音の音信号を記録する第１チャンネル及び演奏音の音信号を記録する第２チャンネルの具体例は上述の実施形態の例に限定されない。例えば、ステレオ２チャンネルのうち左チャンネルに演奏音の音信号が記録され、右チャンネルに参照音の音信号が記録されてもよい。別の例において、３チャンネル以上の音響システム（例えばサラウンド５．１チャンネル）において１つのチャンネルに参照音の音信号が記録され、他の１つのチャンネルに演奏音の音信号が記録されてもよい。 In the second embodiment, in step S45, the sound signal (the sound signal of the performance sound) recorded in the right channel is synthesized from the sound data included in the data D. The sound signal of the left channel (which is the sound signal of the reference sound) is not synthesized. 2. Description of the Related Art Conventionally, a performance video that plays a musical instrument in accordance with a music while playing a music (reference sound) behind is recorded, and this video is published on a video posting site on the Internet. Furthermore, the performance videos uploaded by other people are combined with their performance videos to create a video as if the two are performing together, and this is also made public. However, in these performance moving images, only a sound in which the reference sound and the performance sound are mixed is recorded, so that there is a problem that the reference sound becomes noise when synthesized with another person's performance moving image. On the other hand, in the present embodiment, the audio track of the synthesized moving image does not include the reference sound but includes only the performance sound. Therefore, when synchronizing a plurality of sound data, the two sound data can be synchronized and reproduced with less noise. Note that specific examples of the first channel for recording the sound signal of the reference sound and the second channel for recording the sound signal of the performance sound are not limited to the above-described embodiments. For example, the sound signal of the performance sound may be recorded on the left channel of the two stereo channels, and the sound signal of the reference sound may be recorded on the right channel. In another example, the sound signal of the reference sound may be recorded in one channel and the sound signal of the performance sound may be recorded in the other channel in an acoustic system having three or more channels (for example, surround 5.1 channel). .

３．第３実施形態
図１０は、第３実施形態に係る音データ編集システム３の機能構成を例示する図である。音データ編集システム３において、第１実施形態に係る音データ編集システム１と共通する事項については説明を省略する。音データ編集システム３において、サーバ２０は、逆アライメント部２２を有する。逆アライメント部２２は、通信部２１により受信された音データに対し逆アライメント処理を行う。詳細は後述するが、逆アライメント処理とは、音データにより示される演奏音のテンポを規格化する処理をいう。記憶部２３は、逆アライメント処理された音データを記憶する。 3. Third Embodiment FIG. 10 is a diagram illustrating a functional configuration of a sound data editing system 3 according to a third embodiment. In the sound data editing system 3, the description of items common to the sound data editing system 1 according to the first embodiment is omitted. In the sound data editing system 3, the server 20 has a reverse alignment unit 22. The reverse alignment unit 22 performs reverse alignment processing on the sound data received by the communication unit 21. Although details will be described later, the reverse alignment process refers to a process for normalizing the tempo of the performance sound indicated by the sound data. The storage unit 23 stores the sound data subjected to the reverse alignment process.

図１１は、逆アライメント処理の概念図である。ここでは、ユーザーＡの演奏動画に含まれる参照音ＲＡ及び演奏音ＳＡ、並びに通常のテンポで再生される参照音Ｒ０を概念的に示す。さらに、参照音のテンポが併せて図示される。この図において、横軸は実時間を示す。参照音として表されるドットは、参照音における楽譜内の単位時間（例えば１２８分音符に相当する時間）を示す。通常のテンポで再生される参照音Ｒ０においては、ドットは等間隔で一様に配置されており、楽譜内の単位時間が均一に進行する。すなわち、参照音のテンポは、楽曲全体を通じてＴ０一定である。一方、ユーザーＡが演奏動画の記録に用いた参照音ＲＡは、楽曲の一部の期間（期間Ｄ１）において、通常よりも遅いテンポＴ１で再生され、その他の期間では通常通りのテンポＴ０で再生されたものである。逆アライメント処理は、演奏動画のテンポを、楽曲全体を通じて、基準となるデータ（この例では通常のテンポで再生される参照音Ｒ０）と同期するように調整する処理をいう。この例では、ユーザーＡの演奏動画は、期間Ｄ１における参照音ＲＡが参照音Ｒ０と同じテンポとなるよう、期間Ｄ１の演奏動画が時間領域において圧縮される。 FIG. 11 is a conceptual diagram of the reverse alignment process. Here, the reference sound RA and the performance sound SA included in the performance video of the user A, and the reference sound R0 reproduced at a normal tempo are conceptually shown. Further, the tempo of the reference sound is also illustrated. In this figure, the horizontal axis indicates real time. The dot represented as the reference sound indicates a unit time (for example, a time corresponding to a 128th note) in the score of the reference sound. In the reference sound R0 reproduced at a normal tempo, the dots are uniformly arranged at equal intervals, and the unit time in the score progresses uniformly. That is, the tempo of the reference sound is constant T0 throughout the music. On the other hand, the reference sound RA used by the user A for recording the performance video is reproduced at a tempo T1 that is slower than normal during a part of the music (period D1), and is reproduced at a normal tempo T0 during other periods. It has been done. The reverse alignment process is a process of adjusting the tempo of the performance moving image so as to be synchronized with the reference data (in this example, the reference sound R0 reproduced at a normal tempo) throughout the music. In this example, the performance video of the user A is compressed in the time domain so that the reference sound RA in the time period D1 has the same tempo as the reference sound R0.

図１２は、第３実施形態において参照音の再生指示を入力するためのＵＩ画面を例示する図である。この例において、ユーザー端末１０の再生部１２は、参照音の再生速度を変更することができる。このＵＩ画面は、ウインドウ５０１、ボタン５０２、ボタン５０３、ボタン５０４、及びウインドウ５０５を含む。ボタン５０３は、再生速度を速くするためのボタンである。ボタン５０４は、再生速度を遅くするためのボタンである。この例において、再生部１２は、参照音の再生速度を変更することができる。再生速度は、例えば、参照音の楽曲全体を通じて全体的に指定される。例えば、再生部１２は、通常の０．８倍の一定速度で参照音を再生する。あるいは、再生部１２は、参照音のうちあらかじめ指定された一部のみ、他の部分と異なる速度で再生する（例えば、第３５小節から第４２小節まで、通常の０．８倍の速度で、その他の部分は通常速度で再生する）。さらに別の例で、再生部１２は、参照音の再生中に入力される指示に応じて動的に再生速度を変更してもよい。例えば、再生部１２は、参照音の再生中にボタン５０３が押されると、その後の再生速度を速くする。 FIG. 12 is a diagram illustrating a UI screen for inputting a reference sound reproduction instruction in the third embodiment. In this example, the playback unit 12 of the user terminal 10 can change the playback speed of the reference sound. This UI screen includes a window 501, a button 502, a button 503, a button 504, and a window 505. The button 503 is a button for increasing the playback speed. A button 504 is a button for slowing the reproduction speed. In this example, the playback unit 12 can change the playback speed of the reference sound. The playback speed is specified as a whole throughout the music of the reference sound, for example. For example, the reproducing unit 12 reproduces the reference sound at a constant speed that is 0.8 times the normal speed. Alternatively, the playback unit 12 plays back only a part of the reference sound specified in advance at a speed different from that of the other parts (for example, from the 35th bar to the 42nd bar at a normal 0.8 times speed, Other parts play at normal speed). In yet another example, the playback unit 12 may dynamically change the playback speed according to an instruction input during playback of the reference sound. For example, when the button 503 is pressed during playback of the reference sound, the playback unit 12 increases the subsequent playback speed.

ユーザー端末１０からアップロードされた音データは、そのまま記憶されるのではなく、逆アライメント部２２により逆アライメント処理されてから記憶部２３に記憶される。 The sound data uploaded from the user terminal 10 is not stored as it is, but is stored in the storage unit 23 after being subjected to the reverse alignment processing by the reverse alignment unit 22.

図１３は、逆アライメント処理を例示するフローチャートである。ステップＳ２２１において、逆アライメント部２２は、演奏動画において、参照音のテンポの時系列を特定する。楽曲が既知であり、その楽曲の楽譜が事前に与えられている条件の下、リアルタイムで演奏される楽曲の演奏音から、いま楽曲のどの部分が演奏されているのか推定する技術が知られている。この技術は、例えば以下の処理を含む。逆アライメント部２２は、まず演奏動画の音声トラックの左チャンネルに記録された参照音の音声波形を複数の期間（フレーム）に分割して定Ｑ変換を施すことにより周波数スペクトログラムを得る。逆アライメント部２２は、この周波数スペクトログラムから、オンセット時刻（発音開始時刻）及び音高を抽出する。逆アライメント部２２は、現在の状態の事後分布をDelayed-decisionで逐次推定し、楽譜上でオンセットとみなされる位置を事後分布のピークが通過した時点で、事後分布のラプラス近似及びいくつか統計量を出力する。具体的には、逆アライメント部２２は、楽曲データ上に存在するｎ番目のイベントを検知すると、そのイベントが検知された時刻Ｔ［ｎ］、事後分布により示される楽譜上の平均位置及び分散を出力する。楽譜上の平均位置が発音位置ｕ［ｎ］の推定値であり、分散が観測ノイズｑ［ｎ］の推定値である。なお、発音位置の推定の詳細は、例えば特開２０１５−７９１８３号公報に例示されている。 FIG. 13 is a flowchart illustrating the reverse alignment process. In step S221, the reverse alignment unit 22 specifies the time series of the tempo of the reference sound in the performance moving image. There is a known technique for estimating which part of a song is currently being played from the sound of the song being played in real time under the condition that the song is known and the score of the song is given in advance. Yes. This technique includes, for example, the following processing. The reverse alignment unit 22 first obtains a frequency spectrogram by dividing the audio waveform of the reference sound recorded in the left channel of the audio track of the performance moving image into a plurality of periods (frames) and performing constant Q conversion. The reverse alignment unit 22 extracts the onset time (sound generation start time) and the pitch from the frequency spectrogram. The inverse alignment unit 22 sequentially estimates the posterior distribution of the current state using Delayed-decision, and when the peak of the posterior distribution passes through the position regarded as the onset on the score, Laplace approximation and some statistics Output quantity. Specifically, when the reverse alignment unit 22 detects the nth event existing on the music data, the time T [n] at which the event is detected, the average position and variance on the score indicated by the posterior distribution are calculated. Output. The average position on the score is the estimated value of the pronunciation position u [n], and the variance is the estimated value of the observation noise q [n]. The details of the estimation of the pronunciation position are exemplified in, for example, Japanese Patent Application Laid-Open No. 2015-79183.

逆アライメント部２２は、参照音に含まれる音符（以下「対象音符」という）につき、その前及び後の少なくとも１つの音符との実時間間隔を、基準となる時間間隔（例えば参照音の楽曲を一定テンポで演奏した場合に、その対象音符と他の音符との実時間間隔）と比較することにより、対象音符の発音時点における相対的なテンポを得る。逆アライメント部２２は、参照音に含まれるすべての音符についてこの処理を行うことにより、テンポの時系列を得る。 The reverse alignment unit 22 determines a real time interval between at least one note before and after a note included in the reference sound (hereinafter referred to as “target note”) as a reference time interval (for example, a music piece of the reference sound). When a performance is performed at a constant tempo, a relative tempo at the time of sounding of the target note is obtained by comparing with a real time interval between the target note and other notes. The reverse alignment unit 22 obtains a tempo time series by performing this process on all the notes included in the reference sound.

ステップＳ２２２において、逆アライメント部２２は、複数の期間の中から、対象となる期間（以下「対象期間」という）を順次、特定する。ステップＳ２２３において、逆アライメント部２２は、所定の基準に従って対象期間のテンポを規格化する。ピッチ（音高）を一定に保つため、テンポの調整は波形を単純に時間軸方向に拡大又は縮小するのではなく、いわゆるタイムストレッチ（タイムエキスパンダー又はタイムコンプレッサーと言われることもある）の技術が用いられる。タイムストレッチの技術としては、例えば、合成音の波形を複数のブロックに分割し、時間領域においてブロックの位置をずらしながら配置することによりテンポを調整する技術が用いられる。ノイズを低減するため、クロスフェードが用いられる。 In step S222, the reverse alignment unit 22 sequentially identifies target periods (hereinafter referred to as “target periods”) from among a plurality of periods. In step S223, the reverse alignment unit 22 normalizes the tempo of the target period according to a predetermined standard. In order to keep the pitch (pitch) constant, the tempo adjustment is not simply expanding or reducing the waveform in the time axis direction, but a so-called time stretch technique (sometimes called a time expander or time compressor) Used. As a technique of time stretching, for example, a technique of adjusting a tempo by dividing a waveform of a synthesized sound into a plurality of blocks and arranging the blocks while shifting the positions of the blocks in the time domain is used. In order to reduce noise, a cross fade is used.

ステップＳ２２４において、逆アライメント部２２は、参照音に含まれるすべての期間について処理が完了したか判断する。処理が完了していない期間があると判断された場合（Ｓ２２４：ＮＯ）、逆アライメント部２２は、処理をステップＳ２２２に移行する。すべての期間について処理が完了したと判断された場合（Ｓ２２４：ＹＥＳ）、逆アライメント部２２は処理を終了する。逆アライメント部２２は、こうしてテンポが調整された演奏音の波形を得る。 In step S224, the reverse alignment unit 22 determines whether the processing has been completed for all periods included in the reference sound. If it is determined that there is a period in which the process is not completed (S224: NO), the reverse alignment unit 22 proceeds to step S222. When it is determined that the process has been completed for all the periods (S224: YES), the reverse alignment unit 22 ends the process. The reverse alignment unit 22 obtains a performance sound waveform with the tempo adjusted in this way.

なおここでは詳細な説明を省略するが、逆アライメント部２２は、演奏動画の映像も、演奏音と同期するようにテンポを調整する。 Although the detailed description is omitted here, the reverse alignment unit 22 adjusts the tempo so that the video of the performance video is also synchronized with the performance sound.

従来、ある楽曲（参照音）を背後で流しながらその楽曲に合わせて楽器を演奏する演奏動画を記録し、この動画をインターネットの動画投稿サイトに公開することが行われている。さらに、他人がアップロードした演奏動画に自分の演奏動画を同期してあたかも両者が合奏をしているかのような動画を作成し、これを公開することも行われている。しかし、これらの演奏動画においては、ユーザーＡが用いる参照音及びユーザーＢが用いる参照音は、同じテンポで再生されたものでなければならなかった。参照音として例えばミュージックシーケンサーにより出力される演奏音を用いる場合、楽曲のテンポはユーザーが任意に設定できる。原曲どおりのテンポだと上手く弾けない演奏者でも、テンポを下げれば弾ける場合がある。しかし、テンポを原曲から変えてしまうと他のユーザーの動画と同期することができなくなってしまう。これに対し本実施形態においては、演奏動画は基準となるテンポに調整された状態で記録される。したがって、演奏動画が記録されたときに再生されていた参照音のテンポによらず、他のユーザーの演奏動画と同期することが可能となる。 2. Description of the Related Art Conventionally, a performance video that plays a musical instrument in accordance with a music while playing a music (reference sound) behind is recorded, and this video is published on a video posting site on the Internet. In addition, it is also possible to create a video as if the two perform a ensemble by synchronizing their performance video with a performance video uploaded by another person and publish it. However, in these performance moving images, the reference sound used by user A and the reference sound used by user B had to be reproduced at the same tempo. For example, when a performance sound output from a music sequencer is used as the reference sound, the tempo of the music can be arbitrarily set by the user. Even performers who do not play well with the original tempo may play if the tempo is lowered. However, if you change the tempo from the original song, you will not be able to synchronize with other users' videos. On the other hand, in the present embodiment, the performance video is recorded in a state adjusted to a reference tempo. Therefore, it is possible to synchronize with the performance video of another user regardless of the tempo of the reference sound reproduced when the performance video is recorded.

４．第４実施形態
図１４は、第４実施形態に係る音データ編集システム４の機能構成を例示する図である。音データ編集システム４において、第１実施形態に係る音データ編集システム１と共通する事項については説明を省略する。音データ編集システム４において、ユーザー端末１０は、アライメント部１６を有する。アライメント部１６は、合成音データを再生する際のテンポを、外部から入力された情報に応じて動的に調整する。 4). Fourth Embodiment FIG. 14 is a diagram illustrating a functional configuration of a sound data editing system 4 according to a fourth embodiment. In the sound data editing system 4, description of matters common to the sound data editing system 1 according to the first embodiment is omitted. In the sound data editing system 4, the user terminal 10 has an alignment unit 16. The alignment unit 16 dynamically adjusts the tempo at the time of reproducing the synthesized sound data according to information input from the outside.

図１５は、アライメント部１６の機能構成を例示する図である。アライメント部１６は、合成音データをアライメント再生するための機能を有する。アライメント再生とは、入力信号（例えば、楽器のリアルタイムの演奏音の音信号）に応じてタイミング又はテンポを動的に調整しながら、合成音データを再生することをいう。 FIG. 15 is a diagram illustrating a functional configuration of the alignment unit 16. The alignment unit 16 has a function for aligning and reproducing the synthesized sound data. Alignment playback refers to playing back synthesized sound data while dynamically adjusting the timing or tempo according to an input signal (for example, a sound signal of a real-time performance sound of a musical instrument).

アライメント部１６は、入力部１６１、推定部１６２、予想部１６３、及び出力部１６４を有する。入力部１６１は、入力信号を受け付ける。推定部１６２は、入力信号を解析し、いま楽譜上のどの位置が演奏されているか推定する。なお、入力信号がどの楽曲を演奏したものであるかは既知とする。この楽曲の楽譜を示すデータは、記憶部１１に記憶されている。予想部１６３は、推定部１６２から供給される推定値を観測値として、合成音データの次の再生タイミングの予想を行う。出力部１６４は、予想部１６３から入力された予想時刻に応じて、次に発音すべき期間の再生命令を再生部１２に出力する。 The alignment unit 16 includes an input unit 161, an estimation unit 162, a prediction unit 163, and an output unit 164. The input unit 161 receives an input signal. The estimation unit 162 analyzes the input signal and estimates which position on the score is being played. Note that it is known which music is played by the input signal. Data indicating the musical score of the music is stored in the storage unit 11. The prediction unit 163 uses the estimated value supplied from the estimation unit 162 as an observation value to predict the next reproduction timing of the synthesized sound data. The output unit 164 outputs, to the playback unit 12, a playback command for a period to be sounded next, according to the predicted time input from the prediction unit 163.

図１６は、同期された音データのアライメント再生に係る動作を例示するフローチャートである。図１６のフローは、例えば、ユーザー端末１０においてユーザーがアライメント再生の開始を指示したことを契機として開始される。ステップＳ６１において、入力部１６１は、入力音の受け付けを開始する。すなわち、入力部１６１は、入力音信号を取得する。 FIG. 16 is a flowchart illustrating an operation related to synchronized reproduction of synchronized sound data. The flow in FIG. 16 is started when, for example, the user instructs the start of alignment reproduction at the user terminal 10. In step S61, the input unit 161 starts accepting an input sound. That is, the input unit 161 acquires an input sound signal.

ステップＳ６２において、推定部１６２は、楽譜位置の推定を行う。ステップＳ６３において、予想部１６３は、演奏のテンポを予想する。演奏のテンポは、実時間に対する楽譜位置の推移から予想される。ステップＳ６４において、予想部１６３は、次の再生タイミングの予想を行う。出力部１６４は、予想部１６３から入力された予想時刻に応じて、次に発音すべき期間の再生命令を再生部１２に出力する（ステップＳ６５）。ステップＳ６２〜Ｓ６４の処理は、周期的に繰り返し実行される。 In step S62, the estimation unit 162 estimates the score position. In step S63, the prediction unit 163 predicts the performance tempo. The performance tempo is predicted from the transition of the musical score position with respect to real time. In step S64, the prediction unit 163 predicts the next reproduction timing. The output unit 164 outputs to the playback unit 12 a playback command for the next period to be sounded according to the predicted time input from the prediction unit 163 (step S65). The processing of steps S62 to S64 is repeatedly executed periodically.

合成音のデータは、複数の期間に区分される。複数の期間への区分は、対応する参照音の楽譜上のイベントに基づいて行われる。具体的には、合成音のデータは、例えば、参照音の小節に対応する位置で区分される。出力部１６４から出力される再生命令は、合成音データのうち再生すべき期間及びテンポを指定する情報を含む。再生部１２は、指定された期間を、指定されたテンポで再生する。テンポの調整には、例えばタイムストレッチ技術が用いられる。 The synthesized sound data is divided into a plurality of periods. The division into a plurality of periods is performed based on the event on the score of the corresponding reference sound. Specifically, the synthesized sound data is divided at positions corresponding to the bars of the reference sound, for example. The reproduction command output from the output unit 164 includes information for designating a period and a tempo to be reproduced in the synthesized sound data. The playback unit 12 plays back a specified period at a specified tempo. For example, a time stretch technique is used to adjust the tempo.

この例によれば、例えば、リアルタイムの楽器演奏に対し、合成音（合成動画）を同期させて再生することができる。例えば、リアルタイムの楽器演奏において音楽表現又は演奏技術に起因するテンポのゆらぎ又はタイミングのずれが発生した場合であっても、合成音はリアルタイムの楽器演奏に追従して再生される。この例によれば、合成音を伴奏として自分の楽器演奏とリアルタイムで擬似的な合奏体験をすることができる。 According to this example, for example, a synthesized sound (synthetic moving image) can be reproduced in synchronization with a real-time musical instrument performance. For example, even when a tempo fluctuation or timing shift caused by music expression or performance technique occurs in a real-time instrument performance, the synthesized sound is reproduced following the real-time instrument performance. According to this example, it is possible to have a simulated ensemble experience in real time with your own musical instrument performance with synthetic sounds as accompaniment.

５．変形例
本発明は上述の第１〜第５実施形態に限定されるものではなく種々の変形実施が可能である。上述の実施形態のうち２つ以上のものが組み合わせて用いられてもよい。また、上述の実施形態のそれぞれ、又は２つ以上の組み合わせに対し、以下の変形例のうち少なくとも１つが適用されてもよい。 5. The present invention is not limited to the first to fifth embodiments described above, and various modifications can be made. Two or more of the above embodiments may be used in combination. In addition, at least one of the following modifications may be applied to each of the above-described embodiments or a combination of two or more.

５−１．変形例１
同期部２４は、ステップＳ４３において、同期処理の対象となるデータＤに含まれる音信号の定位を調整してもよい。同期部２４は、例えば、あらかじめ決められたアルゴリズムにしたがって各演奏音の音像位置を決定する。あるいは、同期部２４は、あらかじめ準備されたテンプレートの中からユーザーにより選択された音像位置の組み合わせ（例えば、ギターは左、キーボードは右、ベース、ドラム、及びボーカルは中央など）に従って各演奏音の音像位置を決定する。さらにあるいは、同期部２４は、ユーザーの指示にしたがって各演奏音の音像位置を決定する。 5-1. Modification 1
In step S43, the synchronization unit 24 may adjust the localization of the sound signal included in the data D to be synchronized. For example, the synchronization unit 24 determines the sound image position of each performance sound according to a predetermined algorithm. Alternatively, the synchronization unit 24 selects each performance sound according to a combination of sound image positions selected by a user from templates prepared in advance (for example, left for guitar, right for keyboard, bass, drum, and vocal for center). Determine the position of the sound image. Further alternatively, the synchronization unit 24 determines the sound image position of each performance sound in accordance with a user instruction.

５−２．変形例２
同期部２４における同期処理は、実施形態で例示したものに限定されない。２個のデータＤを同期するためのパラメーターである時間差τを計算する具体的方法は、式（１）又は（２）を用いるものに限定されない。例えば、ＭＦＣＣ（Mel-Frequency Cepstrum Coefficients、メル周波数ケプストラム係数）又はＰＣＰ（Pitch Class Profile）等の他の特徴量を最大にする時間差が、２個のデータＤを同期するためのパラメーターとして用いられてもよい。別の例において、同期部２４は、各演奏音の相対的な音量を調整してもよい。音量は、例えば、演奏動画の属性に応じて調整される。演奏動画の属性は、例えば、その演奏動画を作成したユーザー（演奏動画における演奏者）の属性である。具体的には、演奏者の習熟度が他のユーザーと比較して低い場合、その演奏動画の演奏音は、他のユーザーの演奏音と比較して低い音量で合成される。別の例で、演奏動画の属性は、演奏される楽器の属性である。具体的には、特定の楽器の演奏音は、他の楽器と比較して低い音量で合成される。 5-2. Modification 2
The synchronization processing in the synchronization unit 24 is not limited to that exemplified in the embodiment. A specific method for calculating the time difference τ, which is a parameter for synchronizing two pieces of data D, is not limited to the one using Expression (1) or (2). For example, a time difference that maximizes another feature such as MFCC (Mel-Frequency Cepstrum Coefficients) or PCP (Pitch Class Profile) is used as a parameter for synchronizing the two data D. Also good. In another example, the synchronization unit 24 may adjust the relative volume of each performance sound. For example, the volume is adjusted according to the attribute of the performance video. The attribute of the performance video is, for example, the attribute of the user who created the performance video (performer in the performance video). Specifically, when the proficiency level of the performer is lower than that of other users, the performance sound of the performance moving image is synthesized at a lower volume than that of other users. In another example, the performance movie attribute is an attribute of the instrument being played. Specifically, the performance sound of a specific musical instrument is synthesized at a lower volume than other musical instruments.

５−３．変形例３
同期部２４は、ユーザーＡの演奏動画をユーザーＢの演奏動画と同期する際に、演奏音を他の音データと差し替えてもよい。差し替えに用いられる音データは、例えば、演奏される原曲の音データから、その演奏動画に含まれるパートを抽出したものである。例えば、ユーザーＡの演奏動画が、ある楽曲のギターのパートを演奏したものであった場合、原曲の音データからギターのパートの音が抽出され、この音が、ユーザーＡの演奏音と差し替えられる。この例によれば、合成動画においていわゆる当て振りをすることができる。 5-3. Modification 3
The synchronization unit 24 may replace the performance sound with other sound data when synchronizing the performance video of the user A with the performance video of the user B. The sound data used for replacement is obtained by, for example, extracting the parts included in the performance video from the sound data of the original music to be played. For example, if user A's performance video is a performance of a guitar part of a song, the sound of the guitar part is extracted from the sound data of the original song, and this sound is replaced with the performance sound of user A. It is done. According to this example, it is possible to make a so-called hit in a synthesized moving image.

５−４．変形例４
同期部２４は、合成動画における音像位置を映像の配置に応じて決定してもよい。例えば、ボーカル、ギター、ベース、及びドラムのパートをそれぞれ演奏した４つの演奏動画を同期する場合において、ボーカルの映像を画面手前に、ギターの映像を画面右手に、ベースの映像を画面左手に、ドラムの映像を画面奥に、それぞれ配置するときは、ボーカルの音像を手前に、ギターの音像を右に、ベースの音像を左に、ドラムの音像を奥に、それぞれ定位させてもよい。この例によれば、映像及び音像の位置関係が整合した合成動画を提供することができる。 5-4. Modification 4
The synchronization unit 24 may determine the sound image position in the synthesized moving image according to the arrangement of the video. For example, if you synchronize four performance videos of vocal, guitar, bass, and drum parts, the vocal image is on the front of the screen, the guitar image is on the right side of the screen, and the bass image is on the left side of the screen. When the drum images are arranged at the back of the screen, the sound image of the vocal may be in front, the sound image of the guitar on the right, the sound image of the bass on the left, and the sound image of the drum on the back. According to this example, it is possible to provide a synthesized moving image in which the positional relationship between video and sound images is matched.

５−５．変形例５
参照音を用いて同期されて合成されるデータは、映像データ及び音データに限定されない。例えば、映像又は音声の合成（ミキシング）に用いられる制御信号、映像切り替えの制御信号、ＣＧにおける画像オブジェクトの生成を制御する制御信号、又は舞台照明の制御信号等が同期されて合成されてもよい。 5-5. Modification 5
Data that is synthesized in synchronism using the reference sound is not limited to video data and sound data. For example, a control signal used for video or audio synthesis (mixing), a video switching control signal, a control signal for controlling generation of an image object in CG, or a stage lighting control signal may be synthesized in synchronism. .

５−６．変形例６
音データ編集システム１は、同期する素材としての演奏動画を提供するためのＳＮＳ（Social Networking Service）を提供してもよい。ＳＮＳにおいては、例えば、演奏の習熟度（レベル）、音楽の嗜好、演奏の傾向、楽器の種類、又は演奏するフレーズによりユーザーがグルーピングされる。 5-6. Modification 6
The sound data editing system 1 may provide an SNS (Social Networking Service) for providing a performance video as a synchronized material. In SNS, users are grouped according to, for example, performance level (level), music preference, performance tendency, instrument type, or phrase to be played.

５−７．変形例７
ユーザー端末１０及びサーバ２０における機能の分担は、図１、図１０、又は図１４において例示したものに限定されない。図１、図１０、又は図１４においてユーザー端末１０に実装された機能の一部をサーバ２０に実装してもよいし、サーバ２０に実装された機能の一部をユーザー端末１０に実装してもよい。一例として、ユーザー端末１０が逆アライメント部２２に相当する機能を有してもよい。この場合、ユーザー端末１０は、演奏動画に対して逆アライメント処理を行い、テンポが規格化された演奏動画のデータをサーバ２０に送信する。別の例で、ユーザー端末１０が同期部２４に相当する機能を有してもよい。この場合、ユーザー端末１０は、同期される複数の演奏動画を取得し、これらを合成する。さらに別の例で、サーバ２０がアライメント部１６に相当する機能を有してもよい。この場合、サーバ２０は、ユーザー端末１０を介して入力音を取得し、入力音に応じて合成音を再生する処理を行う。 5-7. Modification 7
The sharing of functions in the user terminal 10 and the server 20 is not limited to that illustrated in FIG. 1, FIG. 10, or FIG. 1, 10, or 14, some of the functions implemented in the user terminal 10 may be implemented in the server 20, or some of the functions implemented in the server 20 may be implemented in the user terminal 10. Also good. As an example, the user terminal 10 may have a function corresponding to the reverse alignment unit 22. In this case, the user terminal 10 performs reverse alignment processing on the performance moving image, and transmits the performance moving image data whose tempo is standardized to the server 20. In another example, the user terminal 10 may have a function corresponding to the synchronization unit 24. In this case, the user terminal 10 acquires a plurality of performance videos to be synchronized and synthesizes them. In yet another example, the server 20 may have a function corresponding to the alignment unit 16. In this case, the server 20 performs a process of acquiring the input sound via the user terminal 10 and reproducing the synthesized sound according to the input sound.

５−８．変形例８
音データ編集システム１〜４の機能の一部は、省略されてもよい。例えば、音データ編集システム１は、逆アライメント部２２を有さなくてもよい。この場合、ユーザー端末１０において記録された演奏動画は、逆アライメント処理されずそのままサーバ２０に記憶される。逆アライメント処理が行われない場合、参照音の時間領域における一部期間を通常と異なるテンポで再生しながら記録された演奏動画を他の演奏動画と同期させることはできない。 5-8. Modification 8
Some of the functions of the sound data editing systems 1 to 4 may be omitted. For example, the sound data editing system 1 may not have the reverse alignment unit 22. In this case, the performance video recorded in the user terminal 10 is stored in the server 20 as it is without being subjected to the reverse alignment process. If the reverse alignment process is not performed, a performance video recorded while reproducing a part of the reference sound in the time domain at a different tempo cannot be synchronized with another performance video.

逆アライメント部２２が無い場合でも、全体として一定のテンポで参照音を再生しながら記録された演奏動画については、例えばタイムストレッチ技術を用いることにより演奏音のテンポを調整することができる。この場合、参照音のテンポを特定する情報は例えばユーザー端末１０から提供される。参照音は全体として一定のテンポで再生されるものに限るという制限を設ければ、逆アライメント処理が行われなくても他の演奏動画と同期させることができる。 Even if the reverse alignment unit 22 is not provided, the performance sound tempo can be adjusted by using, for example, a time stretch technique for a performance video recorded while reproducing the reference sound at a constant tempo as a whole. In this case, information for specifying the tempo of the reference sound is provided from the user terminal 10, for example. If there is a restriction that the reference sound is limited to one that is reproduced at a constant tempo as a whole, the reference sound can be synchronized with other performance moving images without performing the reverse alignment process.

あるいは、音データ編集システム１は、例えばユーザーＡの演奏音とユーザーＢの演奏音とを同期する際、両者のいずれか一方の音信号を、伸張率又は圧縮率を変えつつ時間軸伸張又は時間軸圧縮し、両者の相互相関係数を順次計算して、相互相関係数が最大となる伸張率又は圧縮率を特定することにより、ユーザーＡとユーザーＢとの相対的なテンポの差を特定してもよい。テンポの差が特定されれば、タイムストレッチ技術を用いて、ユーザーＡの演奏音及びユーザーＢの演奏音のテンポを整合させることができる。 Alternatively, when the sound data editing system 1 synchronizes, for example, the performance sound of the user A and the performance sound of the user B, the sound data editing system 1 uses either the time axis expansion or the time of the sound signal of either of them while changing the expansion ratio or the compression ratio. The relative tempo difference between user A and user B is specified by axial compression and calculating the cross-correlation coefficient between the two and specifying the expansion or compression ratio that maximizes the cross-correlation coefficient. May be. If the tempo difference is specified, the tempo of the performance sound of the user A and the performance sound of the user B can be matched using the time stretch technique.

５−９．他の変形例
実施形態で説明した動画データのフォーマット及びこれに付随するデータはあくまで例示である。例えば、動画データは音データ編集システム１に独自のデータフォーマットを有してもよい。また、動画データは、楽器の識別情報等の属性情報を含んでいなくてもよい。実施形態において、参照音はヘッドホンを介して視聴され、演奏音と参照音とが混じっていない例を説明した。しかし、参照音はスピーカー１０８から出力され、演奏音に参照音が混じっていてもよい。このような場合であっても、演奏動画の音声トラックには、演奏音の録音時に再生されていた参照音のみが記録された音声チャンネルが含まれるので、演奏音と参照音とが混在するチャンネルの信号から、参照音のみが記録されたチャンネルの逆相の信号を減算する等の手法により、参照音を除去することができる。 5-9. Other Modifications The format of moving image data described in the embodiment and data associated therewith are merely examples. For example, the moving image data may have a data format unique to the sound data editing system 1. The moving image data may not include attribute information such as instrument identification information. In the embodiment, an example in which the reference sound is viewed via headphones and the performance sound and the reference sound are not mixed has been described. However, the reference sound may be output from the speaker 108, and the performance sound may be mixed with the reference sound. Even in such a case, the audio track of the performance video contains an audio channel that contains only the reference sound that was played back when the performance sound was recorded. The reference sound can be removed by a method such as subtracting a signal having a phase opposite to that of the channel in which only the reference sound is recorded from the above signal.

同期される音データは、楽器の演奏音に限定されない。実演に伴って発生する音を記録したデータであれば、同期される音データはどのようなものであってもよい。既に説明したように、「実演」には、例えば、演奏、歌唱、朗読、及び口演が含まれる。 The sound data to be synchronized is not limited to the performance sound of the musical instrument. As long as the sound is recorded with the sound generated along with the performance, the sound data to be synchronized may be any data. As already described, “demonstration” includes, for example, performance, singing, reading, and oral performance.

音データ編集システム１の機能を実現するためのハードウェア構成は、図２及び図３で例示したものに限定されない。要求される機能を実現できるものであれば、音データ編集システム１はどのようなハードウェア構成を有していてもよい。また、機能とハードウェアとの対応関係は、実施形態で例示したものに限定されない。例えば、実施形態においてサーバ２０に実装されていた機能を、２台以上の装置に分散して実装してもよい。 The hardware configuration for realizing the functions of the sound data editing system 1 is not limited to those illustrated in FIGS. 2 and 3. The sound data editing system 1 may have any hardware configuration as long as the required function can be realized. In addition, the correspondence relationship between functions and hardware is not limited to that illustrated in the embodiment. For example, the functions implemented in the server 20 in the embodiment may be distributed and implemented in two or more devices.

ユーザー端末１０及びサーバ２０において実行されるプログラムは、光ディスク、磁気ディスク、半導体メモリーなどの記憶媒体により提供されてもよいし、インターネット等の通信回線を介してダウンロードされてもよい。また、このプログラムは、図４、図５、図７、及び図１３のすべてのステップを備える必要はない。これらのステップの一部が省略されてもよい。 The program executed in the user terminal 10 and the server 20 may be provided by a storage medium such as an optical disk, a magnetic disk, or a semiconductor memory, or may be downloaded via a communication line such as the Internet. In addition, this program does not have to include all the steps shown in FIGS. 4, 5, 7, and 13. Some of these steps may be omitted.

１…音データ編集システム、３…音データ編集システム、４…音データ編集システム、１０…ユーザー端末、１１…記憶部、１２…再生部、１３…記録部、１４…通信部、１５…ＵＩ部、１６…アライメント部、２０…サーバ、２１…通信部、２２…逆アライメント部、２３…記憶部、２４…同期部、１０１…ＣＰＵ、１０２…メモリー、１０３…ストレージ、１０４…通信ＩＦ、１０５…ディスプレイ、１０６…入力装置、１０７…マイクロフォン、１０８…スピーカー、１０９…カメラ、２０１…ＣＰＵ、２０２…メモリー、２０３…ストレージ、２０４…通信ＩＦ DESCRIPTION OF SYMBOLS 1 ... Sound data editing system, 3 ... Sound data editing system, 4 ... Sound data editing system, 10 ... User terminal, 11 ... Memory | storage part, 12 ... Reproduction | regeneration part, 13 ... Recording part, 14 ... Communication part, 15 ... UI part , 16 ... alignment unit, 20 ... server, 21 ... communication unit, 22 ... reverse alignment unit, 23 ... storage unit, 24 ... synchronization unit, 101 ... CPU, 102 ... memory, 103 ... storage, 104 ... communication IF, 105 ... Display, 106 ... Input device, 107 ... Microphone, 108 ... Speaker, 109 ... Camera, 201 ... CPU, 202 ... Memory, 203 ... Storage, 204 ... Communication IF

Claims

First sound data based on a first demonstration performed during playback of a reference sound and second sound data based on a second demonstration different from the first demonstration performed during playback of the reference sound, A sound data editing method comprising the step of synchronizing using the reference sound.

The first sound data and the second sound data include data of a first channel and a second channel, respectively.
The data of the first channel represents the reference sound,
The sound data editing method according to claim 1, wherein the second channel data represents a performance sound.

Obtaining an input sound signal representing the input sound;
The sound data editing method according to claim 1, further comprising: adjusting a tempo of at least one of the first sound data and the second sound data based on the input sound signal.

Adjusting the tempo of the first sound data according to the reference sound;
Adjusting the tempo of the second sound data according to the reference sound,
The sound data editing method according to any one of claims 1 to 3, wherein, in the synchronizing step, the first sound data and the second sound data whose tempo has been adjusted are synchronized.

The first sound data is sound data included in the first moving image data,
The second sound data is sound data included in second moving image data different from the first moving image data,
The sound data editing method according to any one of claims 1 to 4, wherein, in the synchronizing step, the first moving image data and the second moving image data are synchronized.