JP4382786B2

JP4382786B2 - Audio mixdown device, audio mixdown program

Info

Publication number: JP4382786B2
Application number: JP2006225702A
Authority: JP
Inventors: 哉内田
Original assignee: 株式会社タイトー
Priority date: 2006-08-22
Filing date: 2006-08-22
Publication date: 2009-12-16
Anticipated expiration: 2026-08-22
Also published as: JP2008051896A

Description

本発明は、カラオケ機器などにおいて入力された音声を合成する音声ミックスダウン装置、及び音声ミックスダウンプログラムに関する。 The present invention relates to an audio mixdown device that synthesizes audio input in a karaoke device or the like, and an audio mixdown program.

従来、カラオケ利用に先立って利用者の歌声を録音しておいてカラオケ演奏時に一人でも合唱を楽しむことができる通信カラオケシステムが考えられている（特許文献１参照）。特許文献１に記載された通信カラオケシステムでは、利用者端末において、カラオケ利用に先立って、希望曲の伴奏音楽を音響出力しながら利用者が歌声を録音する。利用者端末は、録音時のテンポやキーなどの情報を含む歌唱トラック録音ファイルを作成し、利用者ＩＤと楽曲ＩＤとを対応づけてファイル保管サーバに送達して保管する。カラオケ演奏装置は、ファイル保管サーバにアクセスして、利用者ＩＤと楽曲ＩＤとにより特定される歌唱トラック録音ファイルを取り寄せ、そのファイルに録音された同一楽曲の伴奏音楽をカラオケデータベースから取り出して、伴奏音楽とファイルに録音された音声とを音楽的に同期させて音響出力するとともに、マイクロホンに入力される音声を混合して音響出力する。
特開２００４−５３７３６公報 Conventionally, a communication karaoke system has been conceived in which a user's singing voice is recorded prior to karaoke use, and even one person can enjoy chorus during karaoke performance (see Patent Document 1). In the communication karaoke system described in Patent Literature 1, the user records a singing voice while outputting the accompaniment music of the desired song at the user terminal prior to using the karaoke. The user terminal creates a singing track recording file including information such as the tempo and key at the time of recording, and associates the user ID and the music ID with each other and delivers them to the file storage server for storage. The karaoke performance device accesses the file storage server, obtains the song track recording file specified by the user ID and the song ID, takes out the accompaniment music of the same song recorded in the file from the karaoke database, The music and sound recorded in the file are acoustically output in synchronism, and the sound input to the microphone is mixed to output sound.
JP 2004-53736 A

従来の通信カラオケシステムでは、利用者本人が利用者端末において予め歌声を録音して、ファイル保管サーバに保管しておかなければならなかった。また、利用者は、カラオケ演奏装置において合唱をする場合に、自分が予めファイル保管サーバに保管した自分の歌声（歌唱トラック録音ファイル）しか利用することができなかった。すなわち、他の利用者との合唱（デュエットなど含む）をすることができなかった。 In the conventional communication karaoke system, the user himself / herself has to record a singing voice in advance at the user terminal and store it in a file storage server. In addition, the user can only use his / her singing voice (song track recording file) stored in the file storage server in advance when singing in the karaoke performance device. That is, it was not possible to perform chorus (including duets) with other users.

本発明は前述した事情に考慮してなされたもので、その目的は、他の利用者により録音された音声を含む複数の音声を合成して１つの楽曲の音声を生成することが可能な音声ミックスダウン装置を提供することにある。 The present invention has been made in view of the above-described circumstances, and an object of the present invention is to generate a sound of one piece of music by synthesizing a plurality of sounds including sounds recorded by other users. It is to provide a mixdown device.

本発明は、ネットワークを介して接続された複数の機器から送信された、前記機器においてユーザが楽曲に合わせて歌唱した音声が録音された音声ファイルと、前記音声ファイルに付加された楽曲毎の識別データと前記音声ファイルの音声を入力したユーザ個人に関する性別及び年齢を含むユーザデータとを入力する音声ファイル入力手段と、前記音声ファイル入力手段によって入力された複数の音声ファイルを同音声ファイルに付加された前記識別データと前記ユーザデータとを対応付けて記憶する音声ファイル記憶手段と、音声ファイルの合成を要求する前記機器から合成の対象とする楽曲を示す前記識別データと前記ユーザデータとを入力する識別データ入力手段と、前記識別データ入力手段によって入力された前記識別データと前記ユーザデータとに応じて、前記音声ファイル記憶手段に記憶された共通する前記識別データと前記ユーザデータとが付加された複数の音声ファイルを選択する選択手段と、前記選択手段によって選択された前記複数の音声ファイルのファイル名一覧を、音声ファイルの合成を要求する前記機器に提供し、前記ファイル名一覧から合成対象とする前記音声ファイルを前記機器のユーザにより指定させる合成対象ファイル指定手段と、前記合成対象ファイル指定手段により指定させた複数の音声ファイルに対してエフェクト処理を施すと共に合成して合成音声データを生成する合成音声データ生成手段と、前記合成音声データ生成手段によって生成された合成音声データを音声ファイルの合成を要求する前記機器に出力する合成音声データ出力手段とを具備したことを特徴とする。 The present invention relates to an audio file recorded from a plurality of devices connected via a network and recorded by a user singing along with a song in the device, and identification for each song added to the audio file. Voice file input means for inputting data and user data including gender and age related to the individual user who has input the voice of the voice file, and a plurality of voice files input by the voice file input means are added to the voice file. The identification data and the user data are stored in association with each other, and the identification data indicating the music to be synthesized and the user data are input from the device requesting synthesis of the audio file. Identification data input means; the identification data input by the identification data input means; and A plurality of audio files to which the common identification data and the user data stored in the audio file storage unit are added according to user data; and the plurality of audio files selected by the selection unit A synthesis target file designating unit that provides a file name list of the audio file to the device that requests synthesis of the audio file, and that allows the user of the device to designate the audio file to be synthesized from the file name list; Synthetic voice data generating means for performing effect processing on a plurality of audio files specified by the target file specifying means and synthesizing them to generate synthesized voice data; and synthesized voice data generated by the synthesized voice data generating means Synthetic voice data output to be output to the device that requires voice file synthesis Characterized by comprising and.

本発明によれば、識別データ（例えば曲名）が付加された複数の音声ファイルを記憶しておき、この複数の音声ファイルから共通する識別データが付加された複数の音声ファイルを選択して合成するので、他の利用者により録音された音声を含む複数の音声を合成して、例えばあたかも合唱やデュエットをしているような１つの楽曲の音声を生成することができる。 According to the present invention, a plurality of audio files to which identification data (for example, song names) are added are stored, and a plurality of audio files to which common identification data is added are selected and synthesized from the plurality of audio files. Therefore, it is possible to synthesize a plurality of voices including voices recorded by other users, and generate voices of one piece of music as if, for example, chorus or duet is performed.

以下、図面を参照して本発明の実施の形態について説明する。
図１は、本実施形態におけるシステムの構成を示すブロック図である。図１に示すシステムでは、音声ミックスダウンサーバ１０がネットワーク１２を介して、各種の機器、例えばカラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等と接続される。ネットワーク１２は、インターネットや公衆回線網（電話回線網）、ＬＡＮ（Local Area Network）など各種の通信網を含む。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a system in the present embodiment. In the system shown in FIG. 1, the audio mixdown server 10 is connected to various devices such as a karaoke device 14, a mobile phone 16, a personal computer 18, and the like via a network 12. The network 12 includes various communication networks such as the Internet, a public line network (telephone line network), and a LAN (Local Area Network).

音声ミックスダウンサーバ１０は、ネットワーク１２を介して接続された機器から、カラオケ（伴奏音声）に合わせて録音された音声のファイル（以下、録音音声ファイルと称する）を入力して記憶すると共に、機器を通じて受信されたユーザからの指示に応じて選択した複数の録音音声ファイルを合成して、例えば複数の利用者が合唱やデュエットをしているような１つの楽曲の音声を生成するサービスを提供する。 The audio mixdown server 10 inputs and stores an audio file (hereinafter referred to as a recorded audio file) recorded in accordance with karaoke (accompaniment audio) from an apparatus connected via the network 12. A service is provided for synthesizing a plurality of recorded audio files selected in accordance with an instruction from a user received through the terminal, and generating, for example, the sound of one piece of music in which a plurality of users are choiring or duet. .

カラオケ機器１４、携帯電話機１６、及び携帯電話機１６は、ネットワーク１２を介して、音声ミックスダウンサーバ１０と接続するための通信機能を有している。また、カラオケ機器１４、携帯電話機１６、及び携帯電話機１６は、カラオケ音声を出力して、このカラオケ音声に合わせてユーザが歌唱した音声を入力するカラオケ機能、このカラオケ機能によって入力された音声を録音する録音機能を有している。この録音機能によって録音された音声の録音音声ファイルは、通信機能により音声ミックスダウンサーバ１０に送信される。 The karaoke device 14, the mobile phone 16, and the mobile phone 16 have a communication function for connecting to the audio mixdown server 10 via the network 12. Further, the karaoke device 14, the mobile phone 16, and the mobile phone 16 output karaoke voice and record the karaoke function for inputting the voice sung by the user in accordance with the karaoke voice and the voice input by the karaoke function. It has a recording function. The recorded voice file of the voice recorded by this recording function is transmitted to the voice mixdown server 10 by the communication function.

図２は、音声ミックスダウンサーバ１０の機能構成を示すブロック図である。音声ミックスダウンサーバ１０は、コンピュータによって実現されるもので、記憶装置に記憶されたプログラムをプロセッサにより実行することにより各種機能を実現する。音声ミックスダウンサーバ１０に用意されるプログラムには、録音音声ファイルの音声を合成するための機能を実現するための音声ミックスダウンプログラムが含まれる。 FIG. 2 is a block diagram showing a functional configuration of the audio mixdown server 10. The audio mixdown server 10 is realized by a computer, and implements various functions by executing a program stored in a storage device by a processor. The program prepared in the audio mixdown server 10 includes an audio mixdown program for realizing a function for synthesizing the audio of the recorded audio file.

図２に示すように、音声ミックスダウンサーバ１０には、音声ファイル入力部２０、ミックスダウン制御部２１、合成パターンリスト記憶部２２、エフェクト処理部２３、合成音声データ出力部２４、リスト一覧出力部２５、投票入力部２６、集計部２７、及び録音音声ファイルデータベース２８の機能が設けられている。 As shown in FIG. 2, the audio mixdown server 10 includes an audio file input unit 20, a mixdown control unit 21, a synthesis pattern list storage unit 22, an effect processing unit 23, a synthesized audio data output unit 24, and a list list output unit. 25, functions of a voting input unit 26, a totaling unit 27, and a recorded audio file database 28 are provided.

音声ファイル入力部２０は、ネットワーク１２を通じて、カラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等の各機器から送信された録音音声ファイルを入力して録音音声ファイルデータベース２８に記憶する。また、音声ファイル入力部２０は、録音音声ファイルと共に、録音音声ファイルとして記憶された音声に関する各種の情報を含むキャプションデータと音声を録音したユーザ（機器の使用者）に関する情報を含むユーザデータを入力して、録音音声ファイルと対応付けて録音音声ファイルデータベース２８に記憶させることができる。 The audio file input unit 20 inputs a recorded audio file transmitted from each device such as the karaoke device 14, the mobile phone 16, and the personal computer 18 through the network 12, and stores it in the recorded audio file database 28. The audio file input unit 20 also inputs caption data including various types of information related to the audio stored as the recorded audio file and user data including information related to the user (device user) who recorded the audio, along with the recorded audio file. Then, it can be stored in the recorded voice file database 28 in association with the recorded voice file.

ミックスダウン制御部２１は、ネットワーク１２を通じてカラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等からの音声ミックスダウン要求に応じて、録音音声ファイルデータベース２８に記憶された複数の録音音声ファイルを合成（ミックスダウン）して１つの楽曲の音声を生成するためのミックスダウン処理を制御する。また、ミックスダウン制御部２１は、カラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等からの要求に応じて、ミックスダウン処理を実行することにより生成された合成パターンリストをもとにデータベース２８に記憶された録音音声ファイルを合成する合成音声データ出力処理を制御する。また、ミックスダウン制御部２１は、リスト一覧出力部２５によってリスト一覧データを生成させてリスト一覧出力部２５により出力されると共に、このリスト一覧データに該当する楽曲に対してユーザから入力された投票データについての集計部２７による集計結果についての評価を実行する。例えば、音声ミックスダウンサーバ１０により合成された楽曲を聴いたユーザによる評価投票を受け付けて、ネットワーク上で実現される仮想的な合唱コンクールを実施する。 The mixdown control unit 21 synthesizes a plurality of recorded sound files stored in the recorded sound file database 28 in response to a sound mixdown request from the karaoke device 14, the mobile phone 16, the personal computer 18, etc. via the network 12. Down) to control the mixdown process for generating the sound of one piece of music. In addition, the mixdown control unit 21 stores the data in the database 28 based on the composite pattern list generated by executing the mixdown process in response to a request from the karaoke device 14, the mobile phone 16, the personal computer 18, or the like. A synthesized voice data output process for synthesizing the recorded voice file is controlled. In addition, the mixdown control unit 21 generates list list data by the list list output unit 25 and outputs the list list data, and the vote input by the user for the music corresponding to the list list data. Evaluation of the totaling result by the totaling unit 27 for data is executed. For example, an evaluation vote by a user who has listened to a music synthesized by the audio mixdown server 10 is accepted, and a virtual choral contest realized on the network is performed.

合成パターンリスト記憶部２２は、ミックスダウン処理において生成された合成パターンリストを記憶する。合成パターンリストは、録音音声ファイルデータベース２８に記憶された複数の録音音声ファイルを合成して１つの楽曲の音声を生成するために必要なデータを含んでいる（図４参照）。 The composite pattern list storage unit 22 stores the composite pattern list generated in the mixdown process. The synthesis pattern list includes data necessary for synthesizing a plurality of recording sound files stored in the recording sound file database 28 to generate sound of one music piece (see FIG. 4).

エフェクト処理部２３は、ミックスダウン制御部２１の制御のもとで、録音音声ファイルデータベース２８に記憶された複数の録音音声ファイルの音声を合成すると共に、音声に対して各種のエフェクトを施すための処理を実行する。エフェクト処理部２３は、合成の対象とする録音音声ファイルに対して個別にエフェクトを施す個別エフェクトの他、複数の録音音声ファイルの音声を合成した後の音声全体に対してエフェクトを施す全体エフェクトを実行することができる。 The effect processing unit 23 synthesizes sounds of a plurality of recorded sound files stored in the recorded sound file database 28 under the control of the mixdown control unit 21 and applies various effects to the sound. Execute the process. The effect processing unit 23 performs not only individual effects that individually apply effects to a recording audio file to be synthesized, but also overall effects that apply effects to the entire audio after synthesizing the audio of a plurality of recording audio files. Can be executed.

合成音声データ出力部２４は、エフェクト処理部２３によって合成（ミックスダウン）された音声のデータ（以下、合成音声データと称する）を、ネットワーク１２を通じて、カラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等に出力する。 The synthesized voice data output unit 24 synthesizes (mixed down) the voice data (hereinafter referred to as synthesized voice data) by the effect processing unit 23 via the network 12, the karaoke equipment 14, the mobile phone 16, the personal computer 18, and the like. Output to.

リスト一覧出力部２５は、合成パターンリスト記憶部２２に記憶された複数の合成パターンリストをもとに、合成パターンリストを一覧表示するためのリスト一覧データを生成して、カラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等に対して出力する。カラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等では、リスト一覧データをもとにリスト一覧を表示して、ユーザから合成パターンリスト（合成された楽曲）を選択する指示を入力して音声ミックスダウンサーバ１０に送信する。 The list list output unit 25 generates list list data for displaying the combined pattern list based on the plurality of combined pattern lists stored in the combined pattern list storage unit 22, and the karaoke device 14, the mobile phone 16. Output to personal computer 18 or the like. In the karaoke device 14, the mobile phone 16, the personal computer 18, etc., the list list is displayed based on the list list data, and the user inputs an instruction to select the synthesized pattern list (synthesized music) and mixes the voice. Send to server 10.

投票入力部２６は、音声ミックスダウンサーバ１０により合成された楽曲の音声（合成音声データ）に対する評価結果を示す投票データを、カラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等から入力する。 The voting input unit 26 inputs voting data indicating an evaluation result for the voice of the music synthesized by the voice mixdown server 10 (synthesized voice data) from the karaoke device 14, the mobile phone 16, the personal computer 18, and the like.

集計部２７は、投票入力部２６によって入力された複数の機器（カラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等）からの評価データを集計する。例えば、リスト一覧出力部２５によるリスト一覧の対象となった複数の楽曲を対象のそれぞれについて評価データを集計して、例えば仮想的な合唱コンクールの採点結果とする。 The totaling unit 27 totals evaluation data from a plurality of devices (the karaoke device 14, the mobile phone 16, the personal computer 18, etc.) input by the voting input unit 26. For example, the evaluation data is aggregated for each of a plurality of songs that are the targets of the list list by the list list output unit 25, and the score is obtained, for example, as a virtual choral contest.

録音音声ファイルデータベース２８は、カラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等から入力された録音音声ファイル、キャプションデータ、ユーザデータ等を記憶する。 The recorded voice file database 28 stores recorded voice files, caption data, user data, and the like input from the karaoke device 14, the mobile phone 16, the personal computer 18, and the like.

図３は、録音音声ファイルデータベース２８に記憶されるデータのデータ形式の一例を示す図である。
図３に示すように、録音音声ファイルデータベース２８には、音声ファイル入力部２０を通じて入力された録音音声ファイルと、録音音声ファイルと共に入力されたキャプションデータとユーザデータとが対応付けて記憶される。キャプションデータには、例えば録音音声ファイルとして記憶された楽曲を識別する識別データとして、楽曲ＩＤ、タイトル、アーティスト名などのデータが含まれている。この識別データは、例えばカラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等において音声を録音する際に利用されたカラオケ（音声データ）に対して設定されているデータが用いられるものとする。また、キャプションデータには、ユーザが録音した音声についての属性データとして、パート（ソプラノ、アルト、テノール、バス、コーラス、語りなど）や、エフェクト有無を示すデータが含まれる。また、エフェクト有りの場合には、エフェクト種類（リバーブなど）を示すデータが付加されるものとする。なお、属性データは、ユーザが使用する機器において音声を録音する際に、ユーザによる機器の入力装置（ボタン、キーボードなど）に対する操作によって入力されるものとする。 FIG. 3 is a diagram showing an example of the data format of data stored in the recorded audio file database 28.
As shown in FIG. 3, the recorded voice file database 28 stores a recorded voice file input through the voice file input unit 20, caption data input together with the recorded voice file, and user data in association with each other. The caption data includes, for example, data such as a music ID, a title, and an artist name as identification data for identifying a music stored as a recorded audio file. As this identification data, for example, data set for karaoke (voice data) used when voice is recorded in the karaoke device 14, the mobile phone 16, the personal computer 18, or the like is used. The caption data includes part data (soprano, alto, tenor, bass, chorus, narrative, etc.) and data indicating the presence / absence of effects as attribute data about the voice recorded by the user. When there is an effect, data indicating the effect type (reverb etc.) is added. It is assumed that the attribute data is input by the user's operation on an input device (button, keyboard, etc.) of the device when recording a sound in the device used by the user.

ユーザデータは、録音音声ファイルの音声を入力したユーザ個人に関するデータであり、例えば性別や年齢などのデータが含まれる。ユーザデータは、ユーザが使用する機器において音声を録音する際に、キャプションデータの属性データと共に入力されても良いし、音声ミックスダウンサーバ１０に予め登録しておき、この登録されたデータを利用するようにしても良い。また、他のシステムを通じて音声ミックスダウンサーバ１０を利用している場合、例えばＳＮＳ（Social Networking Service）を提供するサーバを通じて音声ミックスダウンサーバ１０にアクセスしている場合には、このＳＮＳを提供するサーバに予め登録されたユーザデータを利用するようにしても良い。 The user data is data relating to the individual user who has input the voice of the recorded voice file, and includes data such as sex and age, for example. The user data may be input together with the caption data attribute data when recording the voice in the device used by the user, or registered in advance in the voice mixdown server 10 and the registered data is used. You may do it. When the audio mixdown server 10 is used through another system, for example, when the audio mixdown server 10 is accessed through a server that provides SNS (Social Networking Service), the server that provides this SNS User data registered in advance may be used.

図４は、合成パターンリスト記憶部２２に記憶される合成パターンリストのデータ形式の一例を示す図である。
合成パターンリストは、ミックスダウン処理によって複数の音声を合成して１つの楽曲の音声を生成する際に生成される。合成音声データ出力処理では、合成パターンリストをもとにして、ミックスダウン処理と同様にして複数の音声を合成して１つの楽曲の音声を生成することができる。 FIG. 4 is a diagram illustrating an example of a data format of the composite pattern list stored in the composite pattern list storage unit 22.
The synthesis pattern list is generated when a plurality of voices are synthesized by a mixdown process to generate a voice of one music piece. In the synthesized voice data output process, a plurality of voices can be synthesized based on the synthesized pattern list in the same manner as the mixdown process to generate the voice of one piece of music.

図４に示す合成パターンリストは、音声ミックスダウンサーバ１０によって合成された１つの楽曲の音声に対するもので、この合成された音声を識別するための楽曲データ、合成の対象となった録音音声ファイルのファイル名、合成時に録音音声ファイルに対して個別に施されたエフェクト種類を示す個別エフェクト、合成された音声全体に対して施されたエフェクト種類を示す全体エフェクトのデータが対応付けて設定されている。楽曲データには、例えば合成された楽曲に対して設定される固有の合成楽曲ＩＤと、合成の対象となった録音音声ファイルに含まれている識別データ（楽曲ＩＤ、タイトル）が含まれているものとする。 The synthesis pattern list shown in FIG. 4 is for the sound of one piece of music synthesized by the audio mixdown server 10, and the song data for identifying the synthesized voice and the recorded voice file to be synthesized are recorded. The file name, the individual effect indicating the effect type individually applied to the recorded audio file at the time of synthesis, and the entire effect data indicating the effect type applied to the entire synthesized audio are set in association with each other. . The song data includes, for example, a unique synthesized song ID set for the synthesized song and identification data (song ID, title) included in the recorded audio file to be synthesized. Shall.

次に、本実施形態における動作について説明する。
図５は、音声ミックスダウンサーバ１０において実行されるミックスダウン処理を概念的に示す図である。音声ミックスダウンサーバ１０には、カラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等から入力された複数の録音音声ファイルが録音音声ファイルデータベース２８において記憶される。録音音声ファイルには、各機器において録音した際に使用されたカラオケ（楽曲）を識別する識別データが付加されている。音声ミックスダウンサーバ１０は、共通する識別データが付加された録音音声ファイル、すなわち同じ楽曲のカラオケに合わせて録音された複数の音声から合成対象とするものをユーザからの指示に応じて選択する。音声ミックスダウンサーバ１０は、選択した複数の録音音声ファイルに対して所定のエフェクト処理を施して合成することで、複数のユーザが音声ミックスダウンサーバ１０に登録した録音音声ファイルにより１つの楽曲の音声を生成する。従って、ユーザは、他のユーザによって録音された録音音声ファイルを任意に選択して自分の好みにあったデュエットされた楽曲の音声を作成したり、実際には実現できないような人数により合唱された楽曲の音声を作成することもできる。音声ミックスダウンサーバ１０は、複数の録音音声ファイルの音声を合成する際にエフェクト処理を施すことができるため、合成の対象とする複数の録音音声ファイルの整合をとり、合成された１つの音声に違和感が生じないようにすることができると共に、ユーザが所望するエフェクトを付加することができる。 Next, the operation in this embodiment will be described.
FIG. 5 is a diagram conceptually showing the mixdown process executed in the audio mixdown server 10. In the audio mixdown server 10, a plurality of recorded audio files input from the karaoke device 14, the mobile phone 16, the personal computer 18, etc. are stored in the recorded audio file database 28. Identification data for identifying the karaoke (musical piece) used when recording in each device is added to the recorded audio file. The voice mixdown server 10 selects a recorded voice file to which common identification data is added, that is, a voice to be synthesized from a plurality of voices recorded in accordance with the karaoke of the same music according to an instruction from the user. The audio mixdown server 10 performs a predetermined effect process on the selected plurality of recorded audio files and synthesizes them, so that the audio of one song is recorded by the recorded audio files registered by the plurality of users in the audio mixdown server 10. Is generated. Therefore, the user can arbitrarily select a recorded sound file recorded by another user to create a sound of a duet music that suits his / her preference, or has been sung by a number of people that cannot be realized in practice. You can also create music audio. Since the audio mixdown server 10 can perform effect processing when synthesizing the audio of a plurality of recorded audio files, the audio mixdown server 10 matches the plurality of recorded audio files to be synthesized, and combines them into one synthesized audio. An uncomfortable feeling can be prevented and an effect desired by the user can be added.

まず、カラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等の各種機器から音声ミックスダウンサーバ１０に録音音声ファイルを記憶させるための処理について説明する。以下、パーソナルコンピュータ１８において実行される場合を例にして説明する（カラオケ機器１４や携帯電話機１６等の他の機器においても同様にして実行されるものとする）。 First, a process for storing a recorded audio file in the audio mixdown server 10 from various devices such as the karaoke device 14, the mobile phone 16, and the personal computer 18 will be described. Hereinafter, a case where the program is executed in the personal computer 18 will be described as an example (assumed to be executed similarly in other devices such as the karaoke device 14 and the mobile phone 16).

例えば、音声ミックスダウンサーバ１０が提供するサービスを利用する場合、すなわち、ある楽曲について他人とデュエットや合唱した音声を作成しようとする場合には、その仲間を集める方法として、ネットワーク１２（インターネット等）を介して提供されているＳＮＳ（Social Networking Service）などが利用される。例えば、ＳＮＳを利用しているユーザ内において、例えばあるアーティストに関するコミュニティをつくり、その中でデュエットや合唱をしたい楽曲やアーティストを提示して参加者を募る。そして、参加希望した参加者に対して該当楽曲について歌唱した音声の録音音声ファイルをそれぞれ音声ミックスダウンサーバ１０へ送信するように要求する。また、特定のコミュニティを作成して参加者を募るのではなく、不特定のユーザに対する電子掲示板において、デュエットや合唱をしたい楽曲やアーティストを提示して、該当楽曲についての録音音声ファイルの提供を要求しても良い。 For example, when using a service provided by the audio mixdown server 10, that is, when trying to create a duet or choral voice with another person for a certain piece of music, the network 12 (Internet or the like) is used as a method of collecting the friends. SNS (Social Networking Service) provided via the Internet is used. For example, in a user who uses SNS, for example, a community related to a certain artist is created, and a song or artist to be duet or choral is presented in the community, and participants are recruited. Then, it requests the participants who wish to participate to transmit the recorded audio files of the voices sung for the corresponding music to the audio mixdown server 10. Also, instead of creating a specific community and recruiting participants, on the electronic bulletin board for unspecified users, present the music or artist that you want to duet or chorus, and request the provision of recorded audio files for the music You may do it.

はじめに、パーソナルコンピュータ１８等の機器において、音声ミックスダウンサーバ１０に送信する録音音声ファイルを作成する録音処理について、図６に示すフローチャートを参照しながら説明する。 First, recording processing for creating a recorded audio file to be transmitted to the audio mixdown server 10 in a device such as the personal computer 18 will be described with reference to the flowchart shown in FIG.

まず、パーソナルコンピュータ１８は、カラオケ機能が起動され、音声の録音に使用するカラオケ（楽曲）をユーザからの指示に応じて選択する（ステップＡ１）。例えば、カラオケが用意された各楽曲には、それぞれ固有の楽曲ＩＤ（複数桁の数字など）が設定されており、この楽曲ＩＤの指定により楽曲が選択される。なお、音声ミックスダウンサーバ１０に送信する録音音声ファイルを作成するために使用されるカラオケ（演奏音）は、例えば特定のメーカーによって作成されたものが使用されるものとする。すなわち、同じ楽曲であってもメーカー毎にカラオケの長さや構成などが異なるため、音声ミックスダウンサーバ１０のサービスを利用するユーザは、予め決められた他のユーザと共通するカラオケを利用するものとする。 First, the personal computer 18 is activated with the karaoke function, and selects a karaoke (musical piece) to be used for recording sound according to an instruction from the user (step A1). For example, each music for which karaoke is prepared has a unique music ID (multi-digit number or the like), and the music is selected by specifying the music ID. It is assumed that karaoke (performance sound) used to create a recorded audio file to be transmitted to the audio mixdown server 10 is, for example, one created by a specific manufacturer. That is, even if it is the same music, since the karaoke length, composition, etc. differ for every maker, the user who uses the service of the audio mixdown server 10 uses karaoke that is shared with other predetermined users. To do.

次に、パーソナルコンピュータ１８は、音声を録音する際に使用されるエフェクト種類をユーザ指定に応じた選択する（ステップＡ２）。エフェクト種類としては、例えばリバーブなどがあり、ユーザが任意に指定することができる。なお、音声ミックスダウンサーバ１０において複数の録音音声ファイルを合成する際に、録音音声ファイル別に個別エフェクトを指定することができるため、ここでは必ずしもエフェクト処理を実行する必要はない。 Next, the personal computer 18 selects the effect type used when recording the sound according to the user designation (step A2). The effect type includes, for example, reverb and can be arbitrarily designated by the user. Note that when synthesizing a plurality of recorded audio files in the audio mixdown server 10, individual effects can be specified for each recorded audio file, and therefore it is not always necessary to execute the effect processing.

パーソナルコンピュータ１８は、カラオケ音声を出力すると共に、マイク（図示せず）により入力される歌唱による音声を録音する（ステップＡ３）。ここでは、カラオケ音声を含めて歌唱された音声をマイクにより入力して録音しても良いし、カラオケ音声をヘッドホンから出力して歌唱者のみが聞けるようにして、歌唱された音声のみをマイクにより入力して録音するようにしても良い。カラオケ音声を含めて録音した場合、複数の録音音声ファイルを合成する際に、カラオケ音声部分について歌唱による音声に影響しないようにするための何らかの処理が必要となるので、好ましくは歌唱音のみが入力されるものとする。以下、歌唱による音声のみにより録音音声ファイルが作成されるものとして説明する。パーソナルコンピュータ１８は、カラオケ音声を出力している間に入力される音声を録音する（ステップＡ３）。ここで、予め録音時に実行するエフェクト処理の種類が指定されている場合には、入力された音声に対して指定されたエフェクト処理を施しながら録音する。 The personal computer 18 outputs karaoke voice and records voice by singing inputted by a microphone (not shown) (step A3). Here, the sung voice including the karaoke voice may be inputted and recorded by the microphone, or the karaoke voice is outputted from the headphones so that only the singer can listen, and only the sung voice is heard by the microphone. You may make it input and record. When recording including karaoke sound, when synthesizing multiple recorded sound files, some processing is required to prevent the karaoke sound part from affecting the sound of singing, so preferably only the singing sound is input. Shall be. In the following description, it is assumed that a recorded voice file is created only by voice from singing. The personal computer 18 records the input voice while outputting the karaoke voice (step A3). Here, when the type of effect processing to be executed at the time of recording is designated in advance, recording is performed while performing the designated effect processing on the input sound.

カラオケ音声が終了すると、パーソナルコンピュータ１８は、録音された音声の録音音声ファイルを生成する（ステップＡ４）。また、この録音音声ファイルの音声に関するキャプションデータとユーザデータとを入力する（ステップＡ５，Ａ６）。キャプションデータは、例えば音声の録音時に用いられたカラオケに対して予め設定された楽曲ＩＤ、タイトル、アーティストを用いる。また、エフェクト有無（エフェクト種類）のデータについては、ステップＡ２におけるユーザによる指定内容を用いるものとする。その他、録音した音声のパートなどのデータについては、ユーザによる入力装置（キーボードなど）に対する操作により入力されるものとする。 When the karaoke voice ends, the personal computer 18 generates a recorded voice file of the recorded voice (step A4). Also, caption data and user data relating to the voice of the recorded voice file are input (steps A5 and A6). The caption data uses, for example, a song ID, title, and artist that are preset for karaoke used at the time of voice recording. In addition, for the data on the presence / absence of effect (effect type), the content specified by the user in step A2 is used. In addition, data such as a recorded voice part is input by an operation on an input device (such as a keyboard) by a user.

パーソナルコンピュータ１８は、ネットワーク１２を通じて、録音音声ファイル、キャプションデータ、及びユーザデータとを音声ミックスダウンサーバ１０に対して送信（アップロード）する（ステップＡ７）。 The personal computer 18 transmits (uploads) the recorded audio file, caption data, and user data to the audio mixdown server 10 via the network 12 (step A7).

音声ミックスダウンサーバ１０の音声ファイル入力部２０は、パーソナルコンピュータ１８から録音音声ファイル、キャプションデータ、及びユーザデータを入力すると、録音音声ファイルデータベース２８に対応付けて記憶させる。 When the audio file input unit 20 of the audio mixdown server 10 receives a recorded audio file, caption data, and user data from the personal computer 18, the audio file input unit 20 stores the audio file in association with the recorded audio file database 28.

次に、音声ミックスダウンサーバ１０におけるミックスダウン処理について、図７に示すフローチャートを参照しながら説明する。
まず、録音音声ファイルデータベース２８に記憶された複数の録音音声ファイルから合成（ミックスダウン）の対象とするファイルの選択が行われる（ステップＢ１）。例えば、音声ミックスダウンサーバ１０のミックスダウン制御部２１は、パーソナルコンピュータ１８から楽曲を指定するためのデータ、例えば楽曲ＩＤ、タイトル、アーティスト名などの識別データを入力する。ミックスダウン制御部２１は、パーソナルコンピュータ１８から入力した識別データをもとに録音音声ファイルデータベース２８のキャプションデータを検索し、該当する録音音声ファイルを抽出する。 Next, the mixdown process in the audio mixdown server 10 will be described with reference to the flowchart shown in FIG.
First, a file to be synthesized (mixed down) is selected from a plurality of recorded sound files stored in the recorded sound file database 28 (step B1). For example, the mixdown control unit 21 of the audio mixdown server 10 inputs data for designating music from the personal computer 18, for example, identification data such as music ID, title, artist name, and the like. The mixdown control unit 21 searches the caption data in the recorded voice file database 28 based on the identification data input from the personal computer 18 and extracts the corresponding recorded voice file.

また、ミックスダウン制御部２１は、識別データだけでなく、録音された音声のパートやエフェクト有無などのデータ、さらには録音をしたユーザの性別や年齢などを指定するデータをパーソナルコンピュータ１８から入力して、録音音声ファイルデータベース２８から検索される録音音声ファイルを絞り込むことができる。 In addition, the mixdown control unit 21 inputs not only the identification data but also data such as the recorded voice part and the presence / absence of effects, and data specifying the gender and age of the recorded user from the personal computer 18. Thus, the recorded sound files searched from the recorded sound file database 28 can be narrowed down.

ミックスダウン制御部２１は、録音音声ファイルデータベース２８から抽出された録音音声ファイルのファイル名の一覧をパーソナルコンピュータ１８に提供し、このファイル名の一覧から合成対象とするものを、パーソナルコンピュータ１８のユーザにより指定させる。ここで、指定された録音音声ファイルについての試聴が要求された場合（ステップＢ２、Ｙｅｓ）、ミックスダウン制御部２１は、該当する録音音声ファイル（一部あるいは全部）を再生して、音声データをネットワーク１２を介してパーソナルコンピュータ１８に送信する（ステップＢ３）。パーソナルコンピュータ１８は、音声データをもとに音声を出力してユーザに確認させる。 The mixdown control unit 21 provides the personal computer 18 with a list of file names of the recorded sound files extracted from the recorded sound file database 28, and the user of the personal computer 18 uses the list of file names to be synthesized. To specify. Here, when the audition for the designated recorded audio file is requested (step B2, Yes), the mixdown control unit 21 reproduces the corresponding recorded audio file (part or all), and converts the audio data into The data is transmitted to the personal computer 18 via the network 12 (step B3). The personal computer 18 outputs a voice based on the voice data to make the user confirm.

この音楽データの送信に対して、ミックスダウン制御部２１は、選択確定の指示がパーソナルコンピュータ１８から入力されると、この録音音声ファイルを合成対象として確定する（ステップＢ４、Ｙｅｓ）。 In response to the transmission of the music data, the mixdown control unit 21 confirms the recorded audio file as a synthesis target when a selection confirmation instruction is input from the personal computer 18 (step B4, Yes).

ここで、録音音声ファイルの選択終了が指示されなければ（ステップＢ５、Ｎｏ）、ミックスダウン制御部２１は、前述と同様にして、ファイル内の一覧から合成対象とするものをユーザによって指定させる。以上の処理を繰り返して実行することで、ユーザは、任意の数の録音音声ファイルを指定することができる。ここでは、ユーザは、音声ミックスダウンサーバ１０の録音音声ファイルデータベース２８に記憶された録音音声ファイルを試聴することができるので、自分の好みにあった音声を合成（ミックスダウン）することができる複数の録音音声ファイルを任意に指定することが可能となる。 Here, if the end of the selection of the recorded audio file is not instructed (No at Step B5), the mixdown control unit 21 causes the user to specify what is to be synthesized from the list in the file in the same manner as described above. By repeatedly executing the above processing, the user can specify an arbitrary number of recorded audio files. Here, since the user can audition the recorded voice file stored in the recorded voice file database 28 of the voice mixdown server 10, a plurality of voices can be synthesized (mixed down) according to his / her preference. It is possible to arbitrarily specify the recorded audio file.

録音音声ファイルの選択終了が指示されると（ステップＢ５、Ｙｅｓ）、ミックスダウン制御部２１は、エフェクト種類選択画面をパーソナルコンピュータ１８において表示するためのデータをパーソナルコンピュータ１８に対して送信する。パーソナルコンピュータ１８は、ミックスダウン制御部２１から受信したデータをもとにエフェクト種類選択画面を表示させる。 When the end of selection of the recorded audio file is instructed (step B5, Yes), the mixdown control unit 21 transmits data for displaying the effect type selection screen on the personal computer 18 to the personal computer 18. The personal computer 18 displays an effect type selection screen based on the data received from the mixdown control unit 21.

図８（ａ）には、エフェクト種類選択画面（メニュー）の一例を示している。図８（ａ）に示す例では、エフェクト種類として、コンプレッサ、リバーブ、コーラス、タイムストレッチ、…など、複数のエフェクト種類が一覧表示されている。また、エフェクト種類選択画面には、録音音声ファイル毎にエフェクトを施す個別エフェクトを実行するか、合成された音声全体にエフェクトを施す全体エフェクトを実行するかを指示することができるボタンが設けられているものとする（図示せず）。 FIG. 8A shows an example of an effect type selection screen (menu). In the example shown in FIG. 8A, a plurality of effect types such as compressor, reverb, chorus, time stretch,... Are listed as effect types. In addition, the effect type selection screen is provided with buttons that can be used to instruct whether to execute individual effects that apply effects to each recorded audio file or to execute overall effects that apply effects to the entire synthesized sound. (Not shown).

ここで、エフェクト種類選択画面を通じて個別エフェクトが選択された場合（ステップＢ７、Ｙｅｓ）、ミックスダウン制御部２１は、先に選択している複数の録音音声ファイルの何れをエフェクト処理の対象とするか、またエフェクト種類を何れにするかをユーザによって指定させる（ステップＢ８，Ｂ９）。エフェクト種類によっては、レベルを別途指定させることもできる。例えば、リバーブの場合には、予め用意された複数段階のレベルを任意に指定可能であるものとする。また、コンプレッサは、音量を調整するもので、録音音声ファイルごとに音量レベルを調整することができる。なお、タイムストレッチは、音声の長さを調整するもので、例えば全体の音声の長さを複数の音声間で同じとなるように調整する。タイムストレッチについては、全体エフェクト用のエフェクトとし、個別エフェクトでは指定できないようにする。 Here, when an individual effect is selected through the effect type selection screen (step B7, Yes), the mixdown control unit 21 determines which of a plurality of previously selected recording audio files is to be subjected to the effect processing. Also, the effect type is designated by the user (steps B8 and B9). Depending on the effect type, the level can be specified separately. For example, in the case of reverb, a plurality of levels prepared in advance can be arbitrarily designated. The compressor adjusts the volume, and can adjust the volume level for each recorded audio file. Note that the time stretch is to adjust the length of the sound, and for example, the entire sound length is adjusted to be the same among a plurality of sounds. Time stretch is an effect for the entire effect, and cannot be specified for individual effects.

ミックスダウン制御部２１は、パーソナルコンピュータ１８から入力された指示に応じて、録音音声ファイルデータベース２８に記憶された録音音声ファイルを選択すると共に、エフェクト処理部２３により実行されるエフェクト処理の種類を選択する。 The mixdown control unit 21 selects a recording sound file stored in the recording sound file database 28 in accordance with an instruction input from the personal computer 18 and also selects a type of effect processing executed by the effect processing unit 23. To do.

ここで、指定された録音音声ファイルについての試聴が要求された場合（ステップＢ１０、Ｙｅｓ）、ミックスダウン制御部２１は、該当する録音音声ファイル（一部または全部）についてエフェクト処理部２３によりエフェクト処理を施した上で再生して、音声データをネットワーク１２を介してパーソナルコンピュータ１８に送信する（ステップＢ１１）。パーソナルコンピュータ１８は、音声データをもとにエフェクトが施された音声を出力してユーザに確認させる。 Here, when the audition for the designated recorded audio file is requested (step B10, Yes), the mixdown control unit 21 performs the effect processing on the corresponding recorded audio file (part or all) by the effect processing unit 23. And the audio data is transmitted to the personal computer 18 via the network 12 (step B11). The personal computer 18 outputs a sound to which an effect has been applied based on the sound data and allows the user to confirm the sound.

この試聴用の音声データ送信に対して、ミックスダウン制御部２１は、選択確定の指示がパーソナルコンピュータ１８から入力されると、この録音音声ファイルに対する個別エフェクトの種類を確定する。一方、ユーザは、エフェクトが施された音声を試聴した結果、納得できない場合には、前述と同様にして個別エフェクトを設定して試聴することができる。ここでは、同じようにして、複数の録音音声ファイルに対して、それぞれ個別エフェクトを設定することができる（ステップＢ７〜Ｂ１１）。 In response to the audio data transmission for trial listening, when the selection confirmation instruction is input from the personal computer 18, the mixdown control unit 21 determines the type of individual effect for the recorded audio file. On the other hand, if the user is not satisfied as a result of trial listening to the sound to which the effect has been applied, the user can set and listen to the individual effect in the same manner as described above. Here, in the same manner, individual effects can be set for a plurality of recorded audio files (steps B7 to B11).

また、エフェクト種類選択画面を通じて全体エフェクトが選択された場合（ステップＢ１２、Ｙｅｓ）、ミックスダウン制御部２１は、先に選択されている複数の録音音声ファイルに対するエフェクト処理の種類を何れにするかをユーザによって指定させる（ステップＢ１３）。 When the entire effect is selected through the effect type selection screen (step B12, Yes), the mixdown control unit 21 determines which type of effect processing is to be performed on a plurality of previously selected recording audio files. Designated by the user (step B13).

なお、全体エフェクトについても、図８（ａ）に示すエフェクト種類選択画面からユーザによって指定させるものとする。なお、エフェクト種類選択画面では、全体エフェクトとして利用可能なエフェクト種類のみが指定できるようにする。また、エフェクト種類選択画面において何れかのエフェクトが指定された場合に、さらにエフェクトの詳細をユーザが容易に指定することができる選択画面（メニュー）を用意するようにしても良い。 Note that the overall effect is also specified by the user from the effect type selection screen shown in FIG. In the effect type selection screen, only effect types that can be used as overall effects can be specified. Further, when any effect is specified on the effect type selection screen, a selection screen (menu) that allows the user to easily specify the details of the effect may be prepared.

図８には、エフェクト種類選択画面においてリバーブが選択された場合の選択画面の表示例を示している。図８（ｂ）に示す選択画面には、リバーブをどのように施すかを選択することができる複数の項目が設けられている。図８（ｂ）に示す例では、「×××ホール」「○○館」などの建築物が項目として用意されている。すなわち、例えば複数の利用者による合唱を、指定した建築物内においてしているような音声を生成するようにエフェクトを指定することができる。これにより、ユーザは、エフェクトに関する詳細な知識を有していなくても、所望したエフェクトが付加された音声を取得することができる。 FIG. 8 shows a display example of the selection screen when reverb is selected on the effect type selection screen. The selection screen shown in FIG. 8B is provided with a plurality of items for selecting how to apply reverb. In the example shown in FIG. 8B, buildings such as “XXX Hall” and “XX Hall” are prepared as items. That is, for example, an effect can be designated so as to generate a sound like a chorus by a plurality of users in a designated building. Thereby, even if the user does not have detailed knowledge about the effect, the user can acquire the sound to which the desired effect is added.

ミックスダウン制御部２１は、パーソナルコンピュータ１８から入力された指示に応じて、録音音声ファイルデータベース２８に記憶された録音音声ファイルを選択すると共に、エフェクト処理部２３に対してエフェクト種類選択画面において指定されたエフェクト処理の種類を設定する。 The mixdown control unit 21 selects a recorded audio file stored in the recorded audio file database 28 in accordance with an instruction input from the personal computer 18 and is designated on the effect type selection screen for the effect processing unit 23. Set the type of effect processing.

ここで、指定された録音音声ファイルについての試聴が要求された場合（ステップＢ１４、Ｙｅｓ）、ミックスダウン制御部２１は、該当する複数の録音音声ファイルについて、それぞれに設定された個別エフェクトを施し、さらに複数の録音音声ファイルの音声を合成（ミックスダウン）した後の音声に対して全体エフェクトを施した上で再生して、音声データをネットワーク１２を介してパーソナルコンピュータ１８に送信する（ステップＢ１５）。パーソナルコンピュータ１８は、音声データをもとにエフェクトが施された音声を出力してユーザに確認させる。 Here, when the audition about the designated recording sound file is requested (step B14, Yes), the mixdown control unit 21 applies the individual effect set to each of the corresponding plurality of recording sound files, Further, the sound after synthesizing (mixing down) the voices of a plurality of recorded voice files is reproduced after being subjected to the overall effect, and the voice data is transmitted to the personal computer 18 via the network 12 (step B15). . The personal computer 18 outputs a sound to which an effect has been applied based on the sound data and allows the user to confirm the sound.

この試聴用の音声データ送信に対して、ミックスダウン制御部２１は、選択確定の指示がパーソナルコンピュータ１８から入力されると、この録音音声ファイルに対する全体エフェクトの種類を確定する。一方、ユーザは、エフェクトが施された音声を試聴した結果、納得できない場合には、前述と同様にして個別エフェクトあるいは全体エフェクトを設定して試聴することができる。 In response to the audio data transmission for trial listening, when the selection confirmation instruction is input from the personal computer 18, the mixdown control unit 21 determines the type of the entire effect for the recorded audio file. On the other hand, if the user is unsatisfied as a result of trial listening to the sound to which the effect has been applied, the user can set the individual effect or the entire effect in the same way as described above to make a trial listening.

ここで、エフェクト種類の選択終了が指示されると（ステップＢ１６、Ｙｅｓ）、ミックスダウン制御部２１は、それまでに選択された内容に応じて図４に示すような合成パターンリストを生成して、合成パターンリスト記憶部２２に記憶させる（ステップＢ１７）。なお、合成パターンリストの楽曲データに含まれる合成楽曲ＩＤは、新規に合成パターンリストを作成する毎に固有のデータが設定されるものとする。 Here, when the selection of the effect type selection is instructed (step B16, Yes), the mixdown control unit 21 generates a synthesis pattern list as shown in FIG. 4 according to the contents selected so far. Then, it is stored in the composite pattern list storage unit 22 (step B17). It is assumed that unique data is set for the composite music ID included in the music data of the composite pattern list every time a new composite pattern list is created.

ミックスダウン制御部２１は、合成パターンリストに設定された内容に従い、エフェクト処理部２３により複数の録音音声ファイルの音声を所定のエフェクトを施して合成させる。すなわち、エフェクト処理部２３は、合成の対象として選択された複数の録音音声ファイルについて、それぞれに設定された個別エフェクトを施し、さらに複数の録音音声ファイルの音声を合成（ミックスダウン）した後の音声に対して全体エフェクトを施して合成音声データを生成する。合成音声データ出力部２４は、エフェクト処理部２３により生成された合成音声データを、ネットワーク１２を通じて、パーソナルコンピュータ１８に送信（ダウンロード）する（ステップＢ１８）。 In accordance with the contents set in the synthesis pattern list, the mixdown control unit 21 causes the effect processing unit 23 to synthesize the sounds of a plurality of recorded audio files with a predetermined effect. That is, the effect processing unit 23 applies the individual effect set to each of the plurality of recorded sound files selected as the synthesis target, and further synthesizes (mixes down) the sound of the plurality of recorded sound files. The synthesized voice data is generated by applying the entire effect to the. The synthesized voice data output unit 24 transmits (downloads) the synthesized voice data generated by the effect processing unit 23 to the personal computer 18 via the network 12 (step B18).

なお、エフェクト処理部２３による複数の録音音声ファイルを対象とした音声の合成では、ユーザによって指定された個別エフェクトと全体エフェクトだけでなく、複数の録音音声ファイルの整合をとるためのエフェクト処理をユーザからの指示がなくても自動的に実行する。例えば、複数の音声を単純に合成することにより、予め決められたダイナミックレンジを外れて音が歪んでしまう可能性がある。このため、エフェクト処理部２３は、合成対象とする録音音声ファイルの数に応じて、各録音音声ファイルの音声の音量を下げて、合成された後の音声がダイナミックレンジ内に収まるようにする。例えば、合成対象とする録音音声ファイルの数が多くなるほど、各録音音声ファイルの音声に対する音量の下げ幅が大きくなる。 Note that in the synthesis of sound for a plurality of recorded sound files by the effect processing unit 23, not only the individual effect specified by the user and the whole effect but also the effect processing for matching the plurality of recorded sound files is performed by the user. Even if there is no instruction from, it will be executed automatically. For example, by simply synthesizing a plurality of sounds, the sound may be distorted outside a predetermined dynamic range. For this reason, the effect processing unit 23 reduces the volume of the sound of each recorded sound file in accordance with the number of recorded sound files to be combined so that the combined sound falls within the dynamic range. For example, the greater the number of recorded audio files to be synthesized, the greater the volume reduction for the audio of each recorded audio file.

また、録音音声ファイルの音声の長さは、同じカラオケ音声を利用して録音されていたとしても、録音を実行した機器の性能や動作環境などの違いなどによってバラツキが発生する可能性がある。このため、全体エフェクトにおいて、ユーザによりタイムストレッチの実行が指定されていなくても、複数の録音音声ファイルの音声の長さのバラツキが予め設定された許容範囲（例えば０．５秒）を超える場合には自動的にタイムストレッチを実行するようにしても良い。なお、音声全体の長さのバラツキが許容範囲内である場合には省略することもできる。また、音声全体を対象とするのではなく、例えば音声最後部の所定時間分など部分的にタイムストレッチの対象として音声の長さを調整することもできる。さらに、音声長さを単純に調整するだけでなく、音声の変化、例えば対応する音声部分の音量がピークとなる時点を一致させるように、各録音音声ファイルの音声の長さを調整するようにしても良い。 Moreover, even if the voice length of the recorded voice file is recorded using the same karaoke voice, there is a possibility that variations may occur due to differences in performance or operating environment of the device that performed the recording. For this reason, in the overall effect, even if the execution of time stretching is not specified by the user, the sound length variation of a plurality of recorded sound files exceeds the preset allowable range (for example, 0.5 seconds). Alternatively, time stretching may be automatically executed. It should be noted that it can be omitted when the length variation of the entire voice is within the allowable range. In addition, the length of the voice can be adjusted as a target of time stretching partially, for example, for a predetermined time at the end of the voice, instead of the entire voice. In addition to simply adjusting the audio length, the audio length of each recorded audio file should be adjusted to match the change in audio, for example, the time when the volume of the corresponding audio part peaks. May be.

また、リバーブ等のエフェクトについても、個別エフェクトと全体エフェクトが設定されていない場合でも、予め設定されたデフォルトの設定内容でのエフェクト処理を自動的に実行するようにしても良い。 For effects such as reverb, even when individual effects and overall effects are not set, effect processing with preset default settings may be automatically executed.

さらに、前述したような、録音音声ファイルの音声の音量や長さだけでなく、録音された音声のピッチなど、複数の録音音声ファイルを合成するうえで整合をとるための処理が実行されるものとする。 In addition to the volume and length of the sound of the recorded sound file, as described above, the processing for matching is performed when combining multiple recorded sound files, such as the pitch of the recorded sound. And

また、前述した説明では、個別エフェクト、全体エフェクトのそれぞれについて試聴を可能としているが、処理負荷が大きくて処理に時間を要する場合には、試聴のための処理を省略するようにしても良い。すなわち、合成の対象として選択された複数の録音音声ファイルに対する個別エフェクト及び全体エフェクトの種類をユーザにより指定させて、この指定に応じたエフェクト処理を施して各録音音声ファイルの音声を合成して合成音声データを生成する。 In the above description, trial listening is possible for each of the individual effects and the entire effect. However, if the processing load is large and processing takes time, the processing for trial listening may be omitted. In other words, the user can specify the types of individual effects and overall effects for a plurality of recorded audio files selected for synthesis, apply the effect processing according to this specification, and synthesize and synthesize the audio of each recorded audio file Generate audio data.

このようにして、パーソナルコンピュータ１８のユーザは、音声ミックスダウンサーバ１０に記憶された他のユーザがアップロードした録音音声ファイルを利用して生成された１つの楽曲についての合成音声データを取得することができる。パーソナルコンピュータ１８では、音声ミックスダウンサーバ１０からダウンロードされた合成音声データをもとにして、他のユーザとデュエットや合唱しているような音声を出力することができる。 In this way, the user of the personal computer 18 can obtain synthesized voice data for one piece of music generated using a recorded voice file uploaded by another user stored in the voice mixdown server 10. it can. The personal computer 18 can output sound such as duet or choral with other users based on the synthesized sound data downloaded from the sound mixdown server 10.

次に、音声ミックスダウンサーバ１０における合成音声データ出力処理について、図９に示すフローチャートを参照しながら説明する。
音声ミックスダウンサーバ１０は、ミックスダウン処理によって複数の録音音声ファイルをもとに合成音声データを生成すると、この合成音声データを生成するために必要なデータが含まれる合成パターンリストを合成パターンリスト記憶部２２に記憶させる。音声ミックスダウンサーバ１０は、この合成パターンリスト記憶部２２に記憶された合成パターンリストをもとにして、ミックスダウン処理において生成された合成音声データと同じデータを任意のタイミングで生成することができる。従って、ユーザは、他のユーザが複数の録音音声ファイルをもとに作成（ミックスダウン）した楽曲を、合成パターンリストを指定することにより得ることができる。また、本実施形態における音声ミックスダウンサーバ１０は、多数のユーザからの要求に応じて合成音声データを生成したとしても、合成音声データそのものを蓄積しないため大量の記憶容量を消費しないで済む。 Next, the synthesized voice data output process in the voice mixdown server 10 will be described with reference to the flowchart shown in FIG.
When the voice mixdown server 10 generates synthesized voice data based on a plurality of recorded voice files by the mixdown process, a synthesized pattern list including data necessary for generating the synthesized voice data is stored in the synthesized pattern list. Store in the unit 22. The audio mixdown server 10 can generate the same data as the synthesized audio data generated in the mixdown process based on the synthesis pattern list stored in the synthesis pattern list storage unit 22 at an arbitrary timing. . Therefore, the user can obtain a song created (mixed down) by another user based on a plurality of recorded audio files by designating the synthesis pattern list. Moreover, even if the audio | voice mixdown server 10 in this embodiment produces | generates synthetic | combination audio | voice data according to the request | requirement from many users, since synthetic | combination audio | voice data itself is not accumulate | stored, it is not necessary to consume a large amount of storage capacity.

本実施形態では、例えば複数のユーザが作成（ミックスダウン）した楽曲を対象とした仮想的な合唱コンクールを実施するものとして説明する。 This embodiment demonstrates as what implements the virtual choral contest for the music which the some user created (mixed down), for example.

音声ミックスダウンサーバ１０には、予め合唱コンクールの対象とする楽曲（合成パターンリスト）が設定されているものとする。例えば、同じ楽曲（タイトル）、同じアーティストなどの条件に該当する楽曲の合成パターンリストを抽出して合唱コンクールの対象として設定しても良いし、あるいはユーザが合唱コンクールへの参加を要求して参加対象とする楽曲の合成パターンリストを設定しても良い。また、その他の設定方法であっても良い。 It is assumed that the audio mixdown server 10 is set in advance with music (synthetic pattern list) that is the subject of the choral competition. For example, a synthetic pattern list of songs that meet conditions such as the same music (title) and the same artist may be extracted and set as the target of the choral contest, or the user may request participation in the choral contest and participate You may set the synthetic pattern list of the target music. Other setting methods may be used.

リスト一覧出力部２５は、ミックスダウン制御部２１の制御のもとで、予め合唱コンクールの対象として設定された楽曲（合成パターンリスト）の一覧用データ（リスト一覧データ）をパーソナルコンピュータ１８に対して出力する（ステップＣ１）。パーソナルコンピュータ１８は、リスト一覧データをもとにして、音声ミックスダウンサーバ１０においてミックスダウンにより生成された楽曲についてのリスト一覧を表示する。 The list list output unit 25 receives, for the personal computer 18, list data (list list data) for music (synthetic pattern list) set in advance as a target of the choral contest under the control of the mixdown control unit 21. Output (step C1). Based on the list list data, the personal computer 18 displays a list list of songs generated by the mixdown in the audio mixdown server 10.

パーソナルコンピュータ１８は、ユーザからの指示によりリスト一覧中から何れかが選択されると、音声ミックスダウンサーバ１０に通知する。ミックスダウン制御部２１は、パーソナルコンピュータ１８からの通知に応じて、合成パターンリスト記憶部２２に記憶された該当する合成パターンリストを選択し（ステップＣ２）、この合成パターンリストに従いエフェクト処理部２３により合成パターンリストを生成させる（ステップＣ３）。エフェクト処理部２３は、前述したミックスダウン処理と同様にして、複数の録音音声ファイルの音声を合成して合成音声データを生成する。 The personal computer 18 notifies the audio mixdown server 10 when any one of the lists is selected according to an instruction from the user. In response to the notification from the personal computer 18, the mixdown control unit 21 selects a corresponding composite pattern list stored in the composite pattern list storage unit 22 (step C2), and the effect processing unit 23 according to this composite pattern list. A composite pattern list is generated (step C3). The effect processing unit 23 generates synthesized voice data by synthesizing voices of a plurality of recorded voice files in the same manner as the above-described mixdown process.

合成音声データ出力部２４は、エフェクト処理部２３によって生成された合成音声データをパーソナルコンピュータ１８に出力する。パーソナルコンピュータ１８は、音声ミックスダウンサーバ１０からの合成音声データを入力して再生する。 The synthesized voice data output unit 24 outputs the synthesized voice data generated by the effect processing unit 23 to the personal computer 18. The personal computer 18 inputs and reproduces the synthesized voice data from the voice mixdown server 10.

パーソナルコンピュータ１８は、前述と同様にして、ユーザによりリスト一覧から任意に合成パターンリストが指定されることで、このリストに対応する合成音声データを入力して、ミックスダウンにより生成された合成音声データを入力して再生することができる。 The personal computer 18 receives the synthesized voice data corresponding to this list by the user specifying the synthesized pattern list from the list list in the same manner as described above, and the synthesized voice data generated by the mixdown. Can be input and played.

こうして、パーソナルコンピュータ１８のユーザは、音声ミックスダウンサーバ１０において複数の録音音声ファイルをもとに生成された楽曲を確認することができる。そして、各楽曲に対する評価を現す投票データを音声ミックスダウンサーバ１０に送信することができる。 In this way, the user of the personal computer 18 can check the music generated on the audio mixdown server 10 based on the plurality of recorded audio files. Then, the voting data representing the evaluation for each song can be transmitted to the audio mixdown server 10.

音声ミックスダウンサーバ１０の投票入力部２６は、パーソナルコンピュータ１８から送信される投票データを入力する（ステップＣ６）。集計部２７は、投票入力部２６によって入力された投票データについて、リスト一覧出力部２５によって送信したリスト一覧に含まれる合成パターンリスト毎に集計する（ステップＣ７）。 The voting input unit 26 of the audio mixdown server 10 inputs the voting data transmitted from the personal computer 18 (step C6). The totaling unit 27 totals the voting data input by the voting input unit 26 for each composite pattern list included in the list list transmitted by the list list output unit 25 (step C7).

集計部２７は、複数のユーザによって送信される投票データを合成パターンリスト毎に集計し、例えば予め決められた期間内の集計結果を判定して、例えば順位を決定する。音声ミックスダウンサーバ１０は、この集計結果（順位）を、投票データを送信したユーザ、あるいは合唱コンクールの対象とする楽曲を生成したユーザ等に対して送信する。 The tabulation unit 27 tabulates the voting data transmitted by a plurality of users for each composite pattern list, determines the tabulation result within a predetermined period, for example, and determines the rank, for example. The audio mixdown server 10 transmits the count result (rank) to the user who transmitted the voting data or the user who generated the music to be the subject of the choral contest.

このようにして、音声ミックスダウンサーバ１０は、合成パターンリストをもとにして合成音声データを生成して、カラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等の機器に送信し、ミックスダウンにより生成された楽曲に対する評価を入力して集計することができる。これにより、音声ミックスダウンサーバ１０が提供するサービスを、より多くのユーザが興味をもって利用することが期待できる。 In this way, the audio mixdown server 10 generates synthesized audio data based on the synthesis pattern list, transmits it to devices such as the karaoke device 14, the mobile phone 16, the personal computer 18, and the like, and generates it by the mixdown. It is possible to input and aggregate evaluations for the selected music pieces. Thereby, it can be expected that more users will use the service provided by the audio mixdown server 10 with interest.

なお、前述した説明では、仮想的な合唱コンクールにおいて投票データを受け付けるものとして説明したが、合成パターンリスト記憶部２２に記憶された合成パターンリストの一覧をネットワーク１２を通じて任意にユーザが閲覧できるようにし、そこで選択された合成パターンリストに応じて生成された合成音声データを出力して、評価結果を受け付けるようにすることも可能である。 In the above description, the voting data is received in the virtual choral contest. However, the list of the composite pattern list stored in the composite pattern list storage unit 22 can be arbitrarily viewed by the user through the network 12. Therefore, it is also possible to output the synthesized voice data generated according to the selected synthesis pattern list and accept the evaluation result.

また、前述した説明では、音声ミックスダウンサーバ１０において、カラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等から録音音声ファイルを入力して合成（ミックスダウン）するものとして説明しているが、カラオケ機器１４、携帯電話機１６、パーソナルコンピュータ１８等の機器において、前述したような録音音声ファイルの合成を実行するようにしても良い。例えば、カラオケ機器１４は、通常の利用においてカラオケに合わせて入力された音声を録音音声ファイルとして録音音声ファイルデータベースに記憶しておく。カラオケ機器１４は、録音音声ファイルデータベースからユーザによって指定された複数の録音音声ファイル（共通する識別データが付された同じ楽曲の音声）を合成して１つの楽曲の音声を生成する。この生成された音声は、カラオケ機器１４から出力しても良いし、音声ファイルとして外部に出力しても良い。また、カラオケ機器１４は、この複数の録音音声ファイルをもとに生成された音声と、対応する楽曲のカラオケとを合成して、カラオケ用の音声として出力するようにしても良い。また、カラオケ機器１４は、直接入力された音声の録音音声ファイルだけでなく、ネットワーク１２を通じて他の機器から入力された録音音声ファイルを対象として音声を合成しても良い。携帯電話機１６やパーソナルコンピュータ１８等の機器においても、前述と同様の処理を実行することができる。 Further, in the above description, the audio mixdown server 10 is described as inputting a recording audio file from the karaoke device 14, the mobile phone 16, the personal computer 18, and the like and synthesizing (mixing down) the karaoke device. 14. In the devices such as the mobile phone 16 and the personal computer 18, the recording voice file may be synthesized as described above. For example, the karaoke device 14 stores a voice input in accordance with karaoke in normal use in a recorded voice file database as a recorded voice file. The karaoke device 14 synthesizes a plurality of recording sound files (sounds of the same music with common identification data) designated by the user from the recording sound file database to generate sound of one music. The generated voice may be output from the karaoke device 14 or may be output to the outside as a voice file. Further, the karaoke device 14 may synthesize the voice generated based on the plurality of recorded voice files and the karaoke of the corresponding music and output the synthesized voice as karaoke voice. Further, the karaoke device 14 may synthesize a sound not only for a directly recorded audio file but also for a recorded audio file input from another device via the network 12. The same processing as described above can also be executed in devices such as the mobile phone 16 and the personal computer 18.

また、前述した説明では、歌唱による音声を合成（ミックスダウン）する場合を例にして説明しているが、例えば楽器音などの他の音声を合成する場合に適用することも可能である。 In the above description, the case of synthesizing (mixing down) a voice by singing has been described as an example. However, the present invention can also be applied to the case of synthesizing other sounds such as instrument sounds.

また、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 Further, the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of components disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

また、実施形態に記載した手法は、コンピュータに実行させることができるプログラム（ソフトウエア手段）として、例えば磁気ディスク（フレキシブルディスク、ハードディスク等）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤ、ＭＯ等）、半導体メモリ（ＲＯＭ、ＲＡＭ、フラッシュメモリ等）等の記憶媒体に格納し、また通信媒体により伝送して頒布することもできる。なお、媒体側に格納されるプログラムには、コンピュータに実行させるソフトウエア手段（実行プログラムのみならずテーブルやデータ構造も含む）をコンピュータ内に構成させる設定プログラムをも含む。本装置を実現するコンピュータは、記憶媒体に記憶されたプログラムを読み込み、また場合により設定プログラムによりソフトウエア手段を構築し、このソフトウエア手段によって動作が制御されることにより上述した処理を実行する。なお、本明細書でいう記憶媒体は、頒布用に限らず、コンピュータ内部あるいはネットワークを介して接続される機器に設けられた磁気ディスクや半導体メモリ等の記憶媒体を含むものである。 In addition, the method described in the embodiment uses, as a program (software means) that can be executed by a computer, for example, a magnetic disk (flexible disk, hard disk, etc.), an optical disk (CD-ROM, DVD, MO, etc.), and a semiconductor memory. It can also be stored in a storage medium (ROM, RAM, flash memory, etc.) or transmitted via a communication medium for distribution. The program stored on the medium side also includes a setting program that configures in the computer software means (including not only the execution program but also a table and data structure) that is executed by the computer. A computer that implements the present apparatus reads the program stored in the storage medium, constructs software means by a setting program in some cases, and executes the processing described above by controlling the operation by the software means. The storage medium referred to in this specification is not limited to distribution, and includes storage media such as a magnetic disk and a semiconductor memory provided in a computer or in a device connected via a network.

本発明の実施形態におけるシステムの構成を示すブロック図。The block diagram which shows the structure of the system in embodiment of this invention. 本実施形態における音声ミックスダウンサーバ１０の機能構成を示すブロック図。The block diagram which shows the function structure of the audio | voice mixdown server 10 in this embodiment. 録音音声ファイルデータベース２８に記憶されるデータのデータ形式の一例を示す図。The figure which shows an example of the data format of the data memorize | stored in the sound recording file database. 合成パターンリスト記憶部２２に記憶される合成パターンリストのデータ形式の一例を示す図。The figure which shows an example of the data format of the synthetic pattern list memorize | stored in the synthetic pattern list memory | storage part 22. FIG. 音声ミックスダウンサーバ１０において実行されるミックスダウン処理を概念的に示す図。The figure which shows notionally the mixdown process performed in the audio | voice mixdown server 10. FIG. 録音音声ファイルを作成する録音処理について説明するためのフローチャート。The flowchart for demonstrating the recording process which produces a recording audio | voice file. ミックスダウン処理について説明するためのフローチャート。The flowchart for demonstrating a mixdown process. エフェクト種類選択画面（メニュー）の一例を示す図。The figure which shows an example of an effect kind selection screen (menu). 合成音声データ出力処理について説明するためのフローチャート。The flowchart for demonstrating a synthetic | combination audio | voice data output process.

Explanation of symbols

１０…音声ミックスダウンサーバ、１２…ネットワーク、１４…カラオケ機器、１６…携帯電話機、１８…パーソナルコンピュータ、２０…音声ファイル入力部、２１…ミックスダウン制御部、２２…合成パターンリスト記憶部、２３…エフェクト処理部、２４…合成音声データ出力部、２５…リスト一覧出力部、２６…投票入力部、２７…集計部、２８…録音音声ファイルデータベース。 DESCRIPTION OF SYMBOLS 10 ... Voice mixdown server, 12 ... Network, 14 ... Karaoke apparatus, 16 ... Mobile phone, 18 ... Personal computer, 20 ... Voice file input part, 21 ... Mixdown control part, 22 ... Synthetic pattern list storage part, 23 ... Effect processing unit 24... Synthesized voice data output unit 25. List list output unit 26. Voting input unit 27 27 totaling unit 28.

Claims

An audio file recorded from a plurality of devices connected via a network and recorded by the user in accordance with the music sung by the user, identification data for each music added to the audio file, and the audio Audio file input means for inputting user data including gender and age related to the individual user who has input the audio of the file;
Voice file storage means for storing a plurality of voice files input by the voice file input means in association with the identification data added to the voice file and the user data;
Identification data input means for inputting the identification data indicating the music to be synthesized and the user data from the device that requests synthesis of the audio file;
In accordance with the identification data and the user data input by the identification data input means, a plurality of audio files to which the common identification data and the user data stored in the audio file storage means are added are selected. Selection means to
A file name list of the plurality of audio files selected by the selection unit is provided to the device requesting synthesis of the audio file, and the audio file to be synthesized is designated by the user of the device from the file name list A synthesis target file specifying means to be
A synthesized voice data generating means for generating a synthesized voice data by performing effect processing and synthesizing a plurality of voice files designated by the synthesis target file designation means;
An audio mixdown apparatus, comprising: synthesized voice data output means for outputting the synthesized voice data generated by the synthesized voice data generating means to the device that requests synthesis of a voice file.

The synthesized voice data generating means includes
From the device that requests the synthesis of the audio file, execution of individual effects for individually performing the effect processing on the plurality of audio files designated as synthesis targets and the overall effect for performing the effect processing on the entire synthesized audio 2. The audio mixdown device according to claim 1, wherein an instruction is input, and the individual effect and the overall effect are selectively executed in accordance with the instruction.

List generation means for generating a synthesis pattern list in which data necessary for generating synthesized voice data by the synthesized voice data generating means is stored;
A synthetic pattern list storage unit that stores the synthetic pattern list generated by the list generation unit;
The synthesized voice data generation unit generates the synthesized voice data by synthesizing a plurality of voice files stored in the voice file storage unit based on a synthesis pattern list stored in the synthesis pattern list storage unit. The audio mixdown device according to claim 1.

Further comprising effect type input means for inputting designation of the type of effect processing executed by the synthesized voice data generating means,
3. The audio mixdown apparatus according to claim 1, wherein the synthesized audio data generation unit executes an effect process of a type corresponding to the designation input by the effect type input unit.

Input means for inputting data indicating evaluation of the synthesized voice data output by the synthesized voice data output means;
4. A tally unit for tallying evaluation results for each of a plurality of synthetic pattern lists stored in the synthetic pattern list storage unit based on data input by the input unit. The audio mixdown device described.

Computer
An audio file recorded from a plurality of devices connected via a network and recorded by the user in accordance with the music sung by the user, identification data for each music added to the audio file, and the audio Audio file input means for inputting user data including gender and age related to the individual user who has input the audio of the file;
Voice file storage means for storing a plurality of voice files input by the voice file input means in association with the identification data added to the voice file and the user data;
Identification data input means for inputting the identification data indicating the music to be synthesized and the user data from the device that requests synthesis of the audio file;
In accordance with the identification data and the user data input by the identification data input means, a plurality of audio files to which the common identification data and the user data stored in the audio file storage means are added are selected. Selection means to
A file name list of the plurality of audio files selected by the selection unit is provided to the device requesting synthesis of the audio file, and the audio file to be synthesized is designated by the user of the device from the file name list A synthesis target file specifying means to be
A synthesized voice data generating means for generating a synthesized voice data by performing effect processing and synthesizing a plurality of voice files designated by the synthesis target file designation means;
A voice mixdown program for causing the synthesized voice data generated by the synthesized voice data generating means to function as a synthesized voice data output means for outputting the synthesized voice data to the device that requests synthesis of a voice file.