JP2009258291A

JP2009258291A - Sound data processing device and program

Info

Publication number: JP2009258291A
Application number: JP2008105904A
Authority: JP
Inventors: Hiroshi Kayama; 啓嘉山; Hayato Oshita; 隼人大下
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2008-04-15
Filing date: 2008-04-15
Publication date: 2009-11-05
Anticipated expiration: 2028-04-15
Also published as: JP5509536B2

Abstract

<P>PROBLEM TO BE SOLVED: To make it easy to check each designated note in a music information image. <P>SOLUTION: A storage device 12 stores a plurality of sound data D used for combining different sounds. An allocation section 34 allocates singular sound data D or the plurality of sound data D to one designated note. A display control section 26 makes a display device 16 display a music information image 60 in which an indicator P whose vertical axis position is selected according to the pitch of the designated note, and whose horizontal axis position is selected according to the sounding point of time of the designated note is arranged for each designated note. The display control section 26 displays the indicator P of the designated note in which the singular sound data D are allocated, and the indicator P of the designated note in which the plurality of sound data D are allocated, in different modes. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声（人間の発声音や楽器の演奏音）の合成に使用される音声データを処理する技術に関する。 The present invention relates to a technique for processing audio data used for synthesizing voice (human voice or musical instrument performance).

音高と発音の時点および時間長とを指定する音楽情報（スコアデータ）に基づいて音声を合成する技術が従来から提案されている。利用者は、音楽情報が可視化された画像（以下「音楽情報画像」という）を表示装置で確認しながら音楽情報の編集や作成を実行する。例えば特許文献１に開示されるように、音楽情報画像は、合成の対象として指定された音声（以下「指定音」という）に対応する図形（以下「指示子」という）を時系列に配列した画像（ピアノロール）である。縦軸の方向における指示子の位置は指定音の音高に応じて選定され、横軸の方向における指示子の位置は指定音の発音の時点に応じて選定される。
特開２００４−２５８５６３号公報 Conventionally, a technique for synthesizing speech based on music information (score data) that specifies a pitch, a time point of pronunciation, and a time length has been proposed. The user edits and creates music information while confirming an image in which music information is visualized (hereinafter referred to as “music information image”) on a display device. For example, as disclosed in Patent Document 1, in a music information image, figures (hereinafter referred to as “indicators”) corresponding to a voice designated as a synthesis target (hereinafter referred to as “designated sound”) are arranged in time series. It is an image (piano roll). The position of the indicator in the direction of the vertical axis is selected according to the pitch of the designated sound, and the position of the indicator in the direction of the horizontal axis is selected according to the time point of the specified sound.
JP 2004-258563 A

音楽情報画像における各指示子は、指定音の合成に使用される歌唱音毎に個別に配置されるから、複数の歌唱者が共通の旋律を歌唱する音声（合唱音）の合成を指示する場合であっても、音楽情報画像における指示子の時系列を歌唱者毎に個別に作成して合成音を指示する必要がある。したがって、音楽情報画像において各指示子を確認（さらには音楽情報の作成や編集）する作業が利用者にとって煩雑であるという問題がある。以上においては複数の歌唱音の混合音を合成する場合を例示したが、複数の楽器による合奏音を合成する場合にも同様の問題が発生する。以上の事情を考慮して、本発明は、音楽情報画像における各指定音の確認を容易化することをひとつの目的とする。 Since each indicator in the music information image is individually arranged for each singing sound used for synthesizing the designated sound, a case where a plurality of singers instruct to synthesize a voice (chorus sound) singing a common melody Even so, it is necessary to create a time series of indicators in the music information image individually for each singer and instruct the synthesized sound. Therefore, there is a problem that the operation of confirming each indicator (and creating or editing music information) in the music information image is complicated for the user. In the above, the case of synthesizing a mixed sound of a plurality of singing sounds has been exemplified, but the same problem occurs when synthesizing ensemble sounds of a plurality of musical instruments. In view of the above circumstances, an object of the present invention is to facilitate confirmation of each designated sound in a music information image.

以上の課題を解決するために、本発明に係る音声データ処理装置は、相異なる音声の合成に使用される複数の音声データを記憶する記憶手段と、複数の音声データのうち２以上の音声データをひとつの指定音に割当てる割当手段と、指定音の音高に応じて第１軸の方向の位置が選定されるとともに当該指定音の発音の時点に応じて第２軸の方向の位置が選定された指示子を指定音毎に配置した音楽情報画像を表示装置に表示させる表示制御手段とを具備する。本発明においては、割当手段が２以上の音声データを割当てたひとつの指定音が音楽情報画像のひとつの指示子で表示される。したがって、共通の指定音に割当てられた複数の音声データの各々について指示子が別個に表示される場合と比較して指定音を利用者が容易に確認できる。 In order to solve the above-described problems, an audio data processing device according to the present invention includes a storage unit that stores a plurality of audio data used for synthesizing different sounds, and two or more audio data among the plurality of audio data. Is assigned to one designated sound, and the position in the direction of the first axis is selected in accordance with the pitch of the designated sound, and the position in the direction of the second axis is selected in accordance with the sounding point of the designated sound. Display control means for displaying on the display device a music information image in which the designated indicator is arranged for each designated sound. In the present invention, one designated sound to which the assigning means assigns two or more audio data is displayed by one indicator of the music information image. Therefore, the user can easily confirm the designated sound as compared with the case where the indicator is displayed separately for each of the plurality of audio data assigned to the common designated sound.

本発明における「音声」とは任意の音響である。例えば人間の発声音（例えば歌唱音）や楽器の演奏音が本発明の「音声」の概念に包含される。また、「指定音」は、合成の対象として指定された音声である。指定音の指定の方法は任意である。例えば、予め用意された音楽情報にて指定音が指定される場合や利用者が任意に指定音を指定する場合がある。また、「指示子の態様」とは、視覚的に認識できる指示子の状態を意味する。例えば、指示子のサイズや表示色（色相，明度，彩度）や形状などが「指示子の態様」の概念に包含される。 The “voice” in the present invention is arbitrary sound. For example, human vocal sounds (for example, singing sounds) and musical instrument performance sounds are included in the concept of “speech” of the present invention. The “designated sound” is a sound designated as a synthesis target. The method of specifying the designated sound is arbitrary. For example, the designated sound may be designated by music information prepared in advance, or the user may arbitrarily designate the designated sound. The “indicator mode” means a state of the indicator that can be visually recognized. For example, the size, display color (hue, brightness, saturation) and shape of the indicator are included in the concept of “indicator mode”.

本発明の好適な態様において、表示制御手段は、割当手段が割当てた音声データの組合せが相違する各指定音の指示子を相異なる態様で表示する。以上の態様においては、指定音に割当てられた音声データの組合せが相違する各指示子が相異なる態様で表示されるから、別個の組合せの音声データから生成される各指定音を利用者が音楽情報画像にて容易に区別できるという利点がある。 In a preferred aspect of the present invention, the display control means displays the indicators of the designated sounds having different combinations of the voice data assigned by the assignment means in different ways. In the above aspect, since the indicators having different combinations of the sound data assigned to the designated sound are displayed in different forms, the user can play the designated sounds generated from the sound data of different combinations. There is an advantage that the information image can be easily distinguished.

本発明の好適な態様において、割当手段は、ひとつの指定音に対してひとつの音声データおよび２以上の音声データを選択的に割当て可能であり、表示制御手段は、割当手段がひとつの音声データを割当てた指定音の指示子と、割当手段が２以上の音声データを割当てた指定音の指示子とを相異なる態様で表示する。以上の態様においては、単独の音声データを割当てた指定音の指示子と２以上の音声データを割当てた指定音の指示子とが相異なる態様で表示されるから、指定音に割当てられた音声データが単数であるか複数であるかを利用者が音楽情報画像にて容易に区別できるという利点がある。例えば、ひとつの音声データが単数の音声（例えばひとりの発声音やひとつの楽器の演奏音）の合成に使用される場合、利用者は、指定音が独唱音または独奏音として合成されるのか合唱音または合奏音として合成されるのかを音楽情報画像の指示子の態様から区別することができる。 In a preferred aspect of the present invention, the assigning means can selectively assign one sound data and two or more sound data to one designated sound, and the display control means has one sound data as the assigning means. The indicator of the designated sound to which is assigned and the indicator of the designated sound to which the assigning means has assigned two or more audio data are displayed in different manners. In the above aspect, the designated sound indicator assigned with the single sound data and the designated sound indicator assigned with two or more pieces of sound data are displayed in different forms. There is an advantage that the user can easily distinguish whether the data is singular or plural by the music information image. For example, when one piece of voice data is used to synthesize a single piece of voice (for example, a single utterance sound or a performance sound of one instrument), the user can sing whether the specified sound is synthesized as a solo sound or a solo sound. Whether it is synthesized as a sound or an ensemble sound can be distinguished from the indicator form of the music information image.

本発明の好適な態様において、表示制御手段は、割当手段が割当てた音声データの組合せが共通する各指定音の指示子を共通の態様で表示する。以上の態様においては、指定音に割当てられた音声データの組合せが共通する各指示子が共通の態様で表示されるから、同種の音声（共通の組合せの音声データから合成される音声）として合成される指定音の時系列を利用者が容易に確認できるという利点がある。 In a preferred aspect of the present invention, the display control means displays in a common manner the indicators of the designated sounds that share the same combination of voice data assigned by the assigning means. In the above aspect, since the indicators having a common combination of sound data assigned to the designated sound are displayed in a common form, they are synthesized as the same kind of sound (sound synthesized from sound data of a common combination). There is an advantage that the user can easily confirm the time series of the designated sound to be played.

本発明の好適な態様において、表示制御手段は、各指示子の態様を利用者からの指示に応じて可変に設定する。以上の態様によれば、各指示子の態様が利用者からの指示に応じて可変に設定されるから、個々の利用者の感性や嗜好に応じて直感的に確認し易い態様で各指示子を表示することが可能である。 In a preferred aspect of the present invention, the display control unit variably sets the aspect of each indicator in accordance with an instruction from the user. According to the above aspect, since the mode of each indicator is variably set according to an instruction from the user, each indicator is configured in an easy-to-check manner according to the sensitivity and preference of each user. Can be displayed.

本発明の好適な態様において、表示制御手段は、割当手段が指定音に割当てた音声データの個数に応じて当該指定音の指示子の態様を変化させる。以上の態様によれば、各指定音に割当てられた音声データの多少を利用者が視覚的に容易に確認できるという利点がある。 In a preferred aspect of the present invention, the display control unit changes the mode of the indicator of the designated sound according to the number of audio data assigned to the designated sound by the assigning unit. According to the above aspect, there exists an advantage that a user can confirm visually some of the audio | voice data allocated to each designated sound easily.

本発明の好適な態様に係る音声データ処理装置は、割当手段がひとつの指定音に割当てた２以上の音声データの各々に対応した音声の音高の分布範囲（音高分布範囲）を可変に設定する第１設定手段（例えば図８の設定部４２）を具備し、表示制御手段は、第１設定手段が指定音に設定した分布範囲の広狭に応じて当該指定音の指示子の態様を変化させる。以上の態様においては、複数の音声データの各々に対応した音声の音高の揺らぎの程度（分布範囲）を利用者が視覚的に容易に確認できる。 In the audio data processing device according to a preferred aspect of the present invention, the pitch distribution range (pitch distribution range) of the voice corresponding to each of the two or more audio data assigned by the assigning unit to one designated sound can be changed. First setting means for setting (for example, setting unit 42 in FIG. 8) is provided, and the display control means changes the indicator mode of the designated sound in accordance with the distribution range set for the designated sound by the first setting means. Change. In the above aspect, the user can easily visually confirm the degree (distribution range) of the pitch of the voice corresponding to each of the plurality of voice data.

本発明の好適な態様に係る音声データ処理装置は、割当手段がひとつの指定音に割当てた２以上の音声データの各々に対応した音声が開始する時点の分布範囲を可変に設定する第２設定手段（例えば図８の設定部４２）を具備し、表示制御手段は、第２設定手段が指定音に設定した分布範囲の広狭に応じて当該指定音の指示子の態様を変化させる。以上の態様においては、複数の音声データの各々に対応した音声が発音する時点の揺らぎの程度（分布範囲）を利用者が視覚的に容易に確認できる。 In the audio data processing device according to a preferred aspect of the present invention, the second setting for variably setting the distribution range at the time when the audio corresponding to each of the two or more audio data assigned by the assigning unit to one designated sound starts. Means (for example, the setting unit 42 in FIG. 8), and the display control means changes the mode of the indicator of the designated sound according to the distribution range set by the second setting means as the designated sound. In the above aspect, the user can easily visually confirm the degree of fluctuation (distribution range) when the sound corresponding to each of the plurality of sound data is generated.

本発明の好適な態様において、記憶手段は、複数の音声データの各々の特徴量を記憶し、複数の音声データのうち利用者からの指示に応じた指示特徴量に類似する特徴量の２以上の音声データを記憶手段から選択する選択手段とを具備し、割当手段は、選択手段が選択した２以上の音声データをひとつの指定音に割当てる。以上の態様においては、指示特徴量に類似する特徴量の２以上の音声データがひとつの指定音に割当てられるから、複数の音声データの各々の音楽的な特徴量を利用者が認識していなくても、利用者の所望の特徴量の音声データの組合せを指定音に割当てることが可能となる。もっとも、指定音に割当てられる音声データの組合せを選択する方法は任意である。例えば、ひとつの指定音に割当てる２以上の音声データの各々を利用者が指定する構成や、記憶装置に格納された複数の音声データからランダムに選択した２以上の音声データを指定音に割当てる構成も採用される。 In a preferred aspect of the present invention, the storage means stores each feature quantity of the plurality of audio data, and two or more feature quantities similar to the instruction feature quantity according to the instruction from the user among the plurality of audio data. Selecting means for selecting the audio data from the storage means, and the assigning means assigns two or more sound data selected by the selecting means to one designated sound. In the above aspect, since two or more pieces of sound data having a feature amount similar to the designated feature amount are assigned to one designated sound, the user does not recognize each musical feature amount of the plurality of sound data. However, it is possible to assign a combination of voice data having a desired feature amount of the user to the designated sound. However, a method for selecting a combination of audio data assigned to the designated sound is arbitrary. For example, a configuration in which a user designates each of two or more audio data to be assigned to one designated sound, or a configuration in which two or more audio data randomly selected from a plurality of audio data stored in a storage device is assigned to a designated sound. Is also adopted.

指示特徴量と特徴量との類否に応じて音声データを選択する態様の具体例において、選択手段は、指示特徴量との類似度が高い順番で、利用者が可変に指示した個数の音声データを記憶手段から選択する。以上の態様においては、指定音に割当てられる音声データの個数が利用者からの指示に応じて可変に設定されるから、利用者の所望の規模（歌唱者や演奏者の総数）の合成音を生成できるという利点がある。 In the specific example of the aspect in which the audio data is selected according to the similarity between the instruction feature quantity and the feature quantity, the selection unit is the number of voices variably instructed by the user in the order of similarity to the instruction feature quantity. Select data from storage means. In the above aspect, since the number of audio data allocated to the designated sound is variably set according to the instruction from the user, the synthesized sound of the user's desired scale (total number of singers and performers) is obtained. There is an advantage that it can be generated.

指示特徴量と特徴量との類否に応じて音声データを選択する態様において、特徴量は、例えば、音楽的な特徴に関する複数の因子について音声データの多変量解析（因子分析）で特定された複数の因子値を含む。以上の構成においては、音声の心理的な印象を特徴づける各因子の因子値が特徴量を構成するから、利用者が希望する印象の合成音を適切に生成することが可能となる。 In the aspect in which voice data is selected according to the similarity between the instruction feature quantity and the feature quantity, the feature quantity is specified by, for example, multivariate analysis (factor analysis) of voice data for a plurality of factors related to musical features. Contains multiple factor values. In the above configuration, since the factor value of each factor that characterizes the psychological impression of speech constitutes the feature amount, it is possible to appropriately generate the synthesized sound of the impression desired by the user.

指示特徴量は、利用者からの指示が反映された特徴量である。指示特徴量の特定の方法は本発明において任意である。例えば、複数の因子の各々について利用者が指示した因子値の集合を指示特徴量として利用する構成においては、指定音の合成に使用されるべき音声データの特徴量を利用者が精緻に指定できるという利点がある。一方、利用者が選択した音声データについて記憶手段に記憶された特徴量を指示特徴量として利用する構成においては、利用者が音声の印象を既に認知している音声データに類似する音声データを選択できるという利点がある。 The instruction feature amount is a feature amount in which an instruction from the user is reflected. In the present invention, a specific method for specifying the instruction feature amount is arbitrary. For example, in a configuration in which a set of factor values designated by the user for each of a plurality of factors is used as an indicated feature amount, the user can specify precisely the feature amount of audio data to be used for the synthesis of the designated sound. There is an advantage. On the other hand, in the configuration in which the feature quantity stored in the storage means is used as the instruction feature quantity for the voice data selected by the user, the voice data similar to the voice data in which the user has already recognized the voice impression is selected. There is an advantage that you can.

指示特徴量と特徴量との類否に応じて音声データを選択する態様の具体例において、記憶手段は、複数の音声データの各々の属性を記憶し、選択手段は、特徴量が指示特徴量に類似し、かつ、利用者が選択した属性に対応する音声データを選択する。以上の態様においては、特徴量に加えて音声データの属性が選択手段による選択の基準として採用されるから、利用者の嗜好や感性にさらに合致した合成音を生成し得る音声データを容易に選択できるという利点がある。なお、音声データの属性としては、例えば、当該音声データが表す発声音の発声者の性別や年齢、あるいは当該音声データが表す演奏音の演奏に使用された楽器の種類や型式が好適である。 In the specific example of the aspect in which the audio data is selected according to the similarity between the instruction feature quantity and the feature quantity, the storage unit stores each attribute of the plurality of audio data, and the selection unit has the feature quantity indicating the instruction feature quantity. And the audio data corresponding to the attribute selected by the user is selected. In the above aspect, since the attribute of the audio data is adopted as the selection criterion by the selection means in addition to the feature amount, it is easy to select the audio data that can generate the synthesized sound that further matches the user's preference and sensitivity There is an advantage that you can. Note that, as the attribute of the voice data, for example, the gender and age of the utterer of the uttered sound represented by the voice data, or the type and model of the instrument used for the performance of the performance sound represented by the voice data are suitable.

以上の各態様に係る音声データ処理装置は、音声データの処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働によっても実現される。本発明に係るプログラムは、相異なる音声の合成に使用される複数の音声データを記憶する記憶手段を具備するコンピュータに、複数の音声データのうち２以上の音声データをひとつの指定音に割当てる割当処理と、指定音の音高に応じて第１軸の方向の位置が選定されるとともに当該指定音の発音の時点に応じて第２軸の方向の位置が選定された指示子を指定音毎に配置した音楽情報画像を表示装置に表示させる表示制御処理とを実行させる。本発明のプログラムによれば、以上の各態様に係る音声データ処理装置と同様の作用および効果が奏される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The audio data processing apparatus according to each aspect described above is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to audio data processing, and a general-purpose such as a CPU (Central Processing Unit). This is also realized by cooperation between the arithmetic processing unit and the program. The program according to the present invention assigns two or more audio data among a plurality of audio data to one designated sound to a computer having storage means for storing a plurality of audio data used for synthesizing different sounds. For each designated sound, an indicator in which the position in the direction of the first axis is selected according to the processing and the pitch of the designated sound, and the position in the direction of the second axis is selected according to the time of sounding of the designated sound. Display control processing for displaying the music information image arranged on the display device on the display device. According to the program of the present invention, operations and effects similar to those of the audio data processing apparatus according to each of the above aspects are exhibited. The program of the present invention is provided to a user in a form stored in a computer-readable recording medium and installed in the computer, or provided from a server device in a form of distribution via a communication network and installed in the computer. Is done.

＜Ａ：第１実施形態＞
図１は、本発明の第１実施形態に係る音声データ処理装置１００Aのブロック図である。図１に示すように、音声データ処理装置１００Aは、制御装置１０と記憶装置１２と入力装置１４と表示装置１６と音出力装置１８とを具備するコンピュータシステムで実現される。 <A: First Embodiment>
FIG. 1 is a block diagram of an audio data processing apparatus 100A according to the first embodiment of the present invention. As shown in FIG. 1, the audio data processing device 100A is realized by a computer system including a control device 10, a storage device 12, an input device 14, a display device 16, and a sound output device 18.

制御装置１０は、プログラムを実行する演算処理装置である。制御装置１０は、複数の要素（情報生成部２２，音声合成部２４，表示制御部２６，選択部３２，割当部３４）として機能することで音声信号ＳOUTを生成および出力する。ただし、制御装置１０の各要素は専用の電子回路（ＤＳＰ）でも実現される。音声信号ＳOUTは、入力装置１４に対する利用者からの操作に応じて合成された音声（以下「合成音」という）の波形を表す信号である。記憶装置１２は、制御装置１０が実行するプログラムや制御装置１０が使用する各種のデータを記憶する。半導体記憶装置や磁気記憶装置などの公知の記録媒体が記憶装置１２として任意に採用される。 The control device 10 is an arithmetic processing device that executes a program. The control device 10 generates and outputs a voice signal SOUT by functioning as a plurality of elements (information generation unit 22, voice synthesis unit 24, display control unit 26, selection unit 32, and allocation unit 34). However, each element of the control device 10 is also realized by a dedicated electronic circuit (DSP). The audio signal SOUT is a signal representing a waveform of a voice (hereinafter referred to as “synthetic sound”) synthesized in response to an operation by the user with respect to the input device 14. The storage device 12 stores a program executed by the control device 10 and various data used by the control device 10. A known recording medium such as a semiconductor storage device or a magnetic storage device is arbitrarily adopted as the storage device 12.

記憶装置１２には、相異なる音声の合成に使用されるｎ個（ｎは２以上の自然数）の音声データＤ（Ｄ1〜Ｄn）が格納される。本形態におけるｎ個の音声データＤの各々は別個の発声者の音声から生成される。ひとつの音声データＤは、音声を時間軸上で区分した複数の音声素片（[a]，[i]，[u]，……）の各々について採取された複数の素片データで構成される。例えば音声素片の波形を表すデータや音声素片の波形の特徴量を表すデータが素片データとして利用される。音声素片は、聴覚上で区別できる最小の単位に音声を区分した音素または複数の音素を連結した音素連鎖である。図１に示すように、各音声データＤには固有の識別子ｄA（ｄA1〜ｄAn）が付与される。 The storage device 12 stores n pieces (n is a natural number of 2 or more) of sound data D (D1 to Dn) used for synthesizing different sounds. Each of the n pieces of sound data D in this embodiment is generated from the sound of a separate speaker. One speech data D is composed of a plurality of segment data collected for each of a plurality of speech segments ([a], [i], [u],...) Obtained by dividing speech on the time axis. The For example, data representing the waveform of a speech unit and data representing the feature amount of the waveform of the speech unit are used as the segment data. The phoneme segment is a phoneme chain in which speech is divided into a minimum unit that can be distinguished by hearing or a phoneme chain in which a plurality of phonemes are connected. As shown in FIG. 1, each audio data D is given a unique identifier dA (dA1 to dAn).

入力装置１４は、音声データ処理装置１００Aに対する指示の入力のために利用者が操作する機器（例えばマウスやキーボード）である。表示装置１６（例えば液晶表示装置）は、制御装置１０による制御のもとに各種の画像を表示する。音出力装置１８は、制御装置１０から供給される音声信号ＳOUTに応じた音波を放射する放音機器（例えばスピーカやヘッドホン）である。 The input device 14 is a device (for example, a mouse or a keyboard) operated by a user for inputting an instruction to the audio data processing device 100A. The display device 16 (for example, a liquid crystal display device) displays various images under the control of the control device 10. The sound output device 18 is a sound emitting device (for example, a speaker or headphones) that emits a sound wave corresponding to the audio signal SOUT supplied from the control device 10.

図１の情報生成部２２は、合成の対象となる複数の音声（指定音）を指定する音楽情報（スコアデータ）ＳDを生成して記憶装置１２に格納する。図２は、音楽情報ＳDの模式図である。音楽情報ＳDは、複数の指定音の各々について、指定音の音高と発音時間と発音記号と識別子ｄB（ｄB1〜ｄBn）とを指定する。発音時間は、指定音の発音の始点および終点の指定を含む。 The information generation unit 22 in FIG. 1 generates music information (score data) SD for designating a plurality of sounds (designated sounds) to be synthesized and stores them in the storage device 12. FIG. 2 is a schematic diagram of the music information SD. The music information SD designates the pitch, pronunciation time, pronunciation symbol, and identifier dB (dB1 to dBn) of the designated sound for each of the plurality of designated sounds. The pronunciation time includes designation of the start point and end point of the specified sound.

音楽情報ＳDにおいてひとつの指定音に対応する識別子ｄB（ｄB1〜ｄBn）は、当該指定音の合成に使用される音声データＤの組合せ（以下「パート」という）を識別するための符号である。指定音には、単独の音声データＤ（以下「単独パート」という）が割当てられる場合と複数の音声データＤの集合（以下「編成パート」という）が割当てられる場合とがある。単独パートが割当てられた指定音には単独の音声データＤの識別子ｄAが音楽情報ＳD内の識別子ｄBとして設定され、編成パートが割当てられた指定音には、複数の音声データＤの組合せに対して固有に付与された識別子ｄBが音楽情報ＳDに設定される。 An identifier dB (dB1 to dBn) corresponding to one designated sound in the music information SD is a code for identifying a combination of audio data D (hereinafter referred to as “part”) used for synthesis of the designated sound. There are cases where a single sound data D (hereinafter referred to as “single part”) is assigned to the designated sound and a set of a plurality of sound data D (hereinafter referred to as “knitting part”) is assigned. The identifier dA of the single audio data D is set as the identifier dB in the music information SD for the designated sound to which the single part is assigned, and for the combination of a plurality of audio data D for the designated sound to which the organization part is assigned. Thus, the uniquely assigned identifier dB is set in the music information SD.

図１の音声合成部２４は、情報生成部２２が生成した音楽情報ＳDを利用して音声信号ＳOUTを合成する。さらに詳述すると、音声合成部２４は、音楽情報ＳD内の発音時間を参照して複数の指定音の各々（以下では特に「対象指定音」という）を時系列に順次に選択する。音楽情報ＳDにて対象指定音に設定された識別子ｄBのパートが単独パートである場合、音声合成部２４は、第１に、識別子ｄBが示す音声データＤのうち音楽情報ＳDにて対象指定音に指定された発音記号に対応する素片データを記憶装置１２から取得する。第２に、音声合成部２４は、記憶装置１２から取得した素片データの音高を、音楽情報ＳDにて対象指定音に設定された音高に調整する。一方、対象指定音に設定された識別子ｄBのパートが編成パートである場合、音声合成部２４は、編成パートを構成する複数の音声データＤの各々から単独パートの場合と同様の方法で取得した素片データを音高の調整後に合成（加算）する。以上の手順で生成された素片データの時系列をＤ/Ａ変換（図示略）することで音声信号ＳOUTが生成される。したがって、単独パートが割当てられた指定音は単独の発声者の音声（独唱音）として合成され、編成パートが割当てられた指定音は複数の発声者の音声の混合音（合唱音）として合成される。 The voice synthesizer 24 in FIG. 1 synthesizes the voice signal SOUT using the music information SD generated by the information generator 22. More specifically, the speech synthesizer 24 sequentially selects each of a plurality of designated sounds (hereinafter, specifically referred to as “target designated sounds”) in time series with reference to the pronunciation time in the music information SD. When the part of the identifier dB set as the target designated sound in the music information SD is a single part, the speech synthesizer 24 firstly selects the target designated sound in the music information SD of the voice data D indicated by the identifier dB. The segment data corresponding to the phonetic symbol specified in (1) is acquired from the storage device 12. Secondly, the speech synthesizer 24 adjusts the pitch of the segment data acquired from the storage device 12 to the pitch set as the target designated sound in the music information SD. On the other hand, when the part of the identifier dB set to the target designated sound is a knitting part, the speech synthesizer 24 acquires from each of a plurality of pieces of sound data D constituting the knitting part in the same manner as in the case of a single part. The segment data is synthesized (added) after adjusting the pitch. The audio signal SOUT is generated by performing D / A conversion (not shown) of the time series of the segment data generated by the above procedure. Therefore, the designated sound to which a single part is assigned is synthesized as a single speaker's voice (singing sound), and the designated sound to which a composition part is assigned is synthesized as a mixed sound (choral sound) of a plurality of voices. The

図１の表示制御部２６は、音楽情報ＳDの生成や編集に関する各種の画像を生成して表示装置１６に表示させる。例えば、表示制御部２６は、利用者が各種の項目を設定するための設定画像（図３，図５，図６）や利用者が音楽情報ＳDを確認または編集（作成）するための音楽情報画像（図４）を表示装置１６に表示させる。 The display control unit 26 in FIG. 1 generates various images related to the generation and editing of the music information SD and displays them on the display device 16. For example, the display control unit 26 sets a setting image (FIGS. 3, 5, and 6) for the user to set various items, and music information for the user to confirm or edit (create) the music information SD. The image (FIG. 4) is displayed on the display device 16.

図１の選択部３２は、入力装置１４に対する利用者からの操作に応じて複数の音声データＤを選択することで編成パートを構築する。編成パートの生成が利用者から指示されると、表示制御部２６は、図３の設定画像５２を表示装置１６に表示させる。設定画像５２には識別領域５２１と選択領域５２２と候補領域５２３とが配置される。識別領域５２１には、設定画像５２にて実際に編成される編成パートの識別子ｄB（図３の例示では“Cho1"という名称）が表示される。なお、識別子ｄB（例えば"Cho1"という名称）は入力装置１４に対する利用者からの操作に応じて可変に設定される。 The selection unit 32 in FIG. 1 constructs a knitting part by selecting a plurality of audio data D according to an operation from the user with respect to the input device 14. When the generation of the knitting part is instructed by the user, the display control unit 26 causes the display device 16 to display the setting image 52 of FIG. In the setting image 52, an identification area 521, a selection area 522, and a candidate area 523 are arranged. In the identification area 521, an identifier dB (named “Cho1” in the example of FIG. 3) of the knitting part actually knitted in the setting image 52 is displayed. Note that the identifier dB (for example, the name “Cho1”) is variably set according to an operation from the user on the input device 14.

選択領域５２２には、記憶装置１２に格納されたｎ個の音声データＤの各々の識別子ｄA（例えば"Taro"や"Jiro"といった名称）が配列される。利用者が選択領域５２２内のひとつの識別子ｄAを選択したうえで操作子（コマンドボタン）５２４（Add）を操作すると、表示制御部２６は当該識別子ｄAを候補領域５２３内に追加する。一方、利用者が候補領域５２３内のひとつの識別子ｄAを選択したうえで操作子５２５（Delete）を操作すると、表示制御部２６は当該識別子ｄAを候補領域５２３から削除する。 In the selection area 522, identifiers dA (for example, names such as “Taro” and “Jiro”) of the n pieces of audio data D stored in the storage device 12 are arranged. When the user selects one identifier dA in the selection area 522 and operates an operator (command button) 524 (Add), the display control unit 26 adds the identifier dA to the candidate area 523. On the other hand, when the user selects one identifier dA in the candidate area 523 and operates the operation element 525 (Delete), the display control unit 26 deletes the identifier dA from the candidate area 523.

操作子５２６は、編成データの候補として利用者が指定したひとつまたは複数の音声データＤ（すなわち識別子ｄAが候補領域５２３内に配置された音声データＤ）の音声を利用者が視聴するための画像である。操作子５２６が操作されると、選択部３２は、候補領域５２３に識別子ｄAが配置されたひとつまたは複数の音声データＤの各々について所定の音声素片の素片データを音声合成部２４に出力する。音声合成部２４は、各素片データから生成した所定の音高の音声を混合することで音声信号ＳOUTを出力する。したがって、利用者が候補として指定したひとつまたは複数の音声データＤに対応した合成音が音出力装置１８から再生される。利用者は、音出力装置１８から再生される合成音を随時に受聴（試聴）しながら選択領域５２２内の識別子ｄAの選択や候補領域５２３内の識別子ｄAの削除を反復することで、所望の編成パートを生成することが可能である。 The operation element 526 is an image for the user to view the sound of one or a plurality of sound data D designated by the user as a candidate for composition data (that is, sound data D in which the identifier dA is arranged in the candidate area 523). It is. When the operator 526 is operated, the selection unit 32 outputs segment data of a predetermined speech unit to the speech synthesizer 24 for each of one or a plurality of speech data D in which the identifier dA is arranged in the candidate area 523. To do. The speech synthesizer 24 outputs a speech signal SOUT by mixing speech of a predetermined pitch generated from each piece data. Therefore, a synthesized sound corresponding to one or a plurality of audio data D designated as a candidate by the user is reproduced from the sound output device 18. The user repeats the selection of the identifier dA in the selection region 522 and the deletion of the identifier dA in the candidate region 523 while listening to the synthesized sound reproduced from the sound output device 18 as needed. It is possible to generate a knitting part.

操作子５２７（OK）が操作されると編成パートの内容が確定する。さらに詳述すると、選択部３２は、操作子５２７の操作時に候補領域５２３内に配列されている複数の識別子ｄAを、識別領域５２１に表示された識別子ｄB（すなわち作成中の編成パートの識別子ｄB）に対応させて記憶装置１２に格納する。すなわち、選択部３２は、利用者が選択した複数の音声データＤを組合せて編成パートを生成する。図３の操作子５２８（Cancel）が操作された場合、設定画像５２における設定の内容は反映されない。 When the operator 527 (OK) is operated, the contents of the composition part are determined. More specifically, the selection unit 32 uses a plurality of identifiers dA arranged in the candidate area 523 when the operation element 527 is operated as the identifier dB displayed in the identification area 521 (that is, the identifier dB of the composition part being created). ) And stored in the storage device 12. That is, the selection unit 32 generates a knitting part by combining a plurality of audio data D selected by the user. When the operator 528 (Cancel) in FIG. 3 is operated, the setting contents in the setting image 52 are not reflected.

次に、図４を参照して、利用者が音楽情報ＳDを確認または編集するための音楽情報画像６０について説明する。音楽情報ＳDの表示が利用者から指示されると、表示制御部２６は、図４の音楽情報画像６０を表示装置１６に表示させる。図４に示すように、音楽情報画像６０は作業領域６２と操作領域６４とに区分される。作業領域６２は、記憶装置１２に格納された音楽情報ＳDが可視的に表示される領域である。さらに詳述すると、作業領域６２には、音高に対応する縦軸（以下「音高軸」という）と時間に対応する横軸（以下「時間軸」という）とが設定されたピアノロール型の画像が表示される。 Next, a music information image 60 for the user to confirm or edit the music information SD will be described with reference to FIG. When the display of the music information SD is instructed by the user, the display control unit 26 displays the music information image 60 of FIG. As shown in FIG. 4, the music information image 60 is divided into a work area 62 and an operation area 64. The work area 62 is an area where the music information SD stored in the storage device 12 is visibly displayed. More specifically, the work area 62 has a piano roll type in which a vertical axis corresponding to pitch (hereinafter referred to as “pitch axis”) and a horizontal axis corresponding to time (hereinafter referred to as “time axis”) are set. Is displayed.

利用者は、音楽情報画像６０を視認しながら入力装置１４を操作することで指定音の音高と発音の始点および終点とを指示する。表示制御部２６は、利用者から指示された指定音に対応した図形（以下「指示子」という）Ｐを作業領域６２内に配置する。音高軸の方向における指示子Ｐの位置は利用者が指定した音高に応じて選定され、時間軸の方向における指示子Ｐの位置は利用者が指定した発音の始点（または終点）に応じて選定される。時間軸の方向における指示子Ｐの寸法は、指定音の発音の始点から終点までの時間長に応じて選定される。以上のように指定音が指示されるたびに、情報生成部２２は、利用者が指示した音高と始点および終点とを音楽情報ＳDにおける当該指定音の音高および発音時間として記憶装置１２に格納する。 The user operates the input device 14 while visually recognizing the music information image 60 to instruct the pitch of the designated sound and the start point and end point of the pronunciation. The display control unit 26 arranges a graphic (hereinafter referred to as “indicator”) P corresponding to the designated sound instructed by the user in the work area 62. The position of the indicator P in the direction of the pitch axis is selected according to the pitch specified by the user, and the position of the indicator P in the direction of the time axis depends on the start point (or end point) of the pronunciation specified by the user. Selected. The size of the indicator P in the direction of the time axis is selected according to the time length from the start point to the end point of the specified sound. Each time the designated sound is instructed as described above, the information generating unit 22 stores the pitch designated by the user, the start point, and the end point in the storage device 12 as the pitch and pronunciation time of the designated sound in the music information SD. Store.

以上の処理が反復されることで、別個の指定音に対応した複数の指示子Ｐが作業領域６２内に配置される。利用者は、入力装置１４を操作することで作業領域６２内のひとつの指示子Ｐ（以下「選択指示子」という）Ｐを選択することが可能である。利用者は、入力装置１４を操作することで選択指示子Ｐについて発音記号（文字）を指示する。情報生成部２２は、利用者が指示した発音記号を選択指示子Ｐに対応する指定音の発音記号として音楽情報ＳDに格納する。 By repeating the above processing, a plurality of indicators P corresponding to separate designated sounds are arranged in the work area 62. The user can select one indicator P (hereinafter referred to as “selection indicator”) P in the work area 62 by operating the input device 14. The user operates the input device 14 to instruct a phonetic symbol (character) for the selection indicator P. The information generation unit 22 stores the phonetic symbol designated by the user in the music information SD as the phonetic symbol of the designated sound corresponding to the selection indicator P.

図１の割当部３４は、利用者からの指示に応じて単独パートおよび編成パートを各指定音（選択指示子Ｐ）に対して選択的に割当てる。指定音に対するパートの割当てには、図４の操作領域６４内の操作子６４１と操作子６４２とが使用される。操作子６４１は単独パートの割当てに使用され、操作子６４２は編成パートの割当てに使用される。利用者がひとつの指示子Ｐ（選択指示子Ｐ）を選択したうえで操作子６４１を操作すると、表示制御部２６は、記憶装置１２に格納されたｎ個の音声データＤの各々の識別子ｄA（ｄA1〜ｄAn）を選択の候補として操作子６４１の近傍に表示させる。割当部３４は、ｎ個の識別子ｄAのうち利用者が選択した識別子ｄAを、選択指示子Ｐに対応した指定音の識別子ｄBとして音楽情報ＳDに格納する。すなわち、割当部３４は、選択指示子Ｐに対応した指定音に単独パート（単独の音声データＤ）を割当てる。 The assigning unit 34 in FIG. 1 selectively assigns a single part and a composition part to each designated sound (selection indicator P) in accordance with an instruction from the user. For the assignment of the part to the designated sound, the operation element 641 and the operation element 642 in the operation area 64 of FIG. 4 are used. The operator 641 is used for assigning a single part, and the operator 642 is used for assigning a knitted part. When the user selects one indicator P (selection indicator P) and operates the operator 641, the display control unit 26 identifies each of the identifiers dA of the n pieces of audio data D stored in the storage device 12. (DA1 to dAn) are displayed in the vicinity of the operation element 641 as selection candidates. The allocation unit 34 stores the identifier dA selected by the user among the n identifiers dA in the music information SD as the identifier dB of the designated sound corresponding to the selection indicator P. That is, the assigning unit 34 assigns a single part (single audio data D) to the designated sound corresponding to the selection indicator P.

利用者が操作子６４２を操作すると、表示制御部２６は、選択部３２が編成した複数の編成パートの各々の識別子ｄBを選択の候補として操作子６４２の近傍に表示させる。割当部３４は、複数の識別子ｄBのうち利用者が選択した識別子ｄBを、選択指示子Ｐに対応した指定音の識別子ｄBとして音楽情報ＳDに格納する。すなわち、割当部３４は、選択指示子Ｐに対応した指定音に編成パート（複数の音声データＤ）を割当てる。 When the user operates the operation element 642, the display control unit 26 displays each identifier dB of the plurality of knitting parts knitted by the selection unit 32 as a selection candidate in the vicinity of the operation element 642. The assigning unit 34 stores the identifier dB selected by the user among the plurality of identifiers dB in the music information SD as the identifier dB of the designated sound corresponding to the selection indicator P. That is, the assigning unit 34 assigns the knitting part (a plurality of audio data D) to the designated sound corresponding to the selection indicator P.

作業領域６２内に配置された直後の指示子Ｐは初期的な態様（総ての指示子Ｐについて共通の態様）で表示される。利用者は、各指示子Ｐの態様（サイズや表示色（色相，明度，彩度）や形状）を入力装置１４に対する操作に応じて可変に設定することが可能である。指示子Ｐの態様の変更が利用者から指示されると、表示制御部２６は、図５の設定画像５４または図６の設定画像５６を表示装置１６に表示させる。 The indicator P immediately after being arranged in the work area 62 is displayed in an initial mode (a mode common to all the indicators P). The user can variably set the mode (size, display color (hue, brightness, saturation) and shape) of each indicator P according to the operation on the input device 14. When the user instructs to change the mode of the indicator P, the display control unit 26 causes the display device 16 to display the setting image 54 in FIG. 5 or the setting image 56 in FIG. 6.

図５の設定画像５４は、選択指示子Ｐの枠線（輪郭線）の態様を利用者が指定するための画像である。識別領域５４０には、割当部３４が選択指示子Ｐの指定音に割当てたパート（単独パートまたは編成パート）の識別子ｄBが表示される。識別領域５４０に表示される識別子ｄBは入力装置１４に対する操作に応じて変更される。 The setting image 54 in FIG. 5 is an image for the user to specify the mode of the frame (outline) of the selection indicator P. In the identification area 540, the identifier dB of the part (single part or composition part) assigned by the assigning unit 34 to the designated sound of the selection indicator P is displayed. The identifier dB displayed in the identification area 540 is changed according to the operation on the input device 14.

利用者は、設定画像５４を視認しながら入力装置１４を適宜に操作することで、選択指示子Ｐの枠線の態様に関する複数の項目（線種，線幅，線色など）の各々について複数の候補の何れかを選択する。例えば、利用者は、設定画像５４の操作子５４１を操作することで表示される複数の候補のなかから選択指示子Ｐの枠線の線種（例えば実線や波線や破線など）を選択する。同様に、選択指示子Ｐの枠線の線幅が操作子５４２に対する操作に応じて指定され、選択指示子Ｐの枠線の線色が操作子５４３に対する操作に応じて指定され、時間軸の方向における選択指示子Ｐの両端部の形状（角状や円弧状）が操作子５４４に対する操作に応じて指定される。また、選択指示子Ｐの枠線の透過度（枠線を透過して背景が視認される度合）が操作子５４５に対する操作に応じて指定される。選択指示子Ｐの枠線の線幅や透過度については利用者が数値を直接に指定することも可能である。利用者が各項目を指定または変更するたびに、表示制御部２６は、当該指定を実際に枠線に反映させた指示子Ｐを領域５４６に表示する。 The user appropriately operates the input device 14 while visually recognizing the setting image 54, so that a plurality of items (line type, line width, line color, etc.) relating to the frame shape of the selection indicator P can be obtained. One of the candidates is selected. For example, the user selects a frame type (for example, a solid line, a wavy line, a broken line, etc.) of the selection indicator P from a plurality of candidates displayed by operating the operation element 541 of the setting image 54. Similarly, the line width of the frame line of the selection indicator P is designated in accordance with the operation on the operation element 542, the line color of the frame line of the selection indicator P is designated in accordance with the operation on the operation element 543, and the time axis The shape (corner shape or arc shape) of both ends of the selection indicator P in the direction is designated according to the operation on the operation element 544. Further, the transparency of the frame of the selection indicator P (the degree to which the background is visually recognized through the frame) is designated according to the operation on the operation element 545. The user can directly specify numerical values for the line width and transparency of the frame of the selection indicator P. Each time the user designates or changes each item, the display control unit 26 displays the indicator P in which the designation is actually reflected in the frame line in the area 546.

選択指示子Ｐの枠線の態様は、利用者が操作子５４７（OK）を操作した段階で設定画像５４に指定されている態様に確定する。すなわち、操作子５４７が操作されると、表示制御部２６は、音楽情報画像６０の作業領域６２に実際に配置された選択指示子Ｐの枠線を設定画像５４で設定された態様に変更する。さらに、表示制御部２６は、作業領域６２に配置された複数の指示子Ｐのうち割当部３４が選択指示子Ｐと共通のパートを割当てた総ての指示子Ｐ（すなわち、識別領域５４０に表示された識別子ｄBのパートが割当てられた指示子Ｐ）の枠線を、設定画像５４にて選択指示子Ｐの枠線に指定された態様に変更する。なお、操作子５４８（Cancel）が操作された場合、設定画像５４における設定の内容は反映されない。 The mode of the frame line of the selection indicator P is fixed to the mode specified in the setting image 54 when the user operates the operation unit 547 (OK). That is, when the operation element 547 is operated, the display control unit 26 changes the frame line of the selection indicator P actually arranged in the work area 62 of the music information image 60 to the mode set in the setting image 54. . Further, the display control unit 26 assigns all the indicators P (ie, the identification region 540 to which the assigning unit 34 assigns a part common to the selection indicator P among the plurality of indicators P arranged in the work area 62). The frame line of the indicator P) to which the displayed part of the identifier dB is assigned is changed to the mode designated as the frame line of the selection indicator P in the setting image 54. Note that when the operation element 548 (Cancel) is operated, the setting contents in the setting image 54 are not reflected.

一方、図６の設定画像５６は、選択指示子Ｐの枠線の内側の領域（以下「内部領域」という）の態様を利用者が指定するための画像である。識別領域５６０には、割当部３４が選択指示子Ｐの指定音に割当てたパート（単独パートまたは編成パート）の識別子ｄBが表示される。識別領域５６０に表示される識別子ｄBは入力装置１４に対する操作に応じて変更される。 On the other hand, the setting image 56 in FIG. 6 is an image for the user to specify the mode of the area inside the frame of the selection indicator P (hereinafter referred to as “inner area”). In the identification area 560, an identifier dB of a part (single part or composition part) assigned by the assigning unit 34 to the designated sound of the selection indicator P is displayed. The identifier dB displayed in the identification area 560 is changed according to the operation on the input device 14.

利用者は、設定画像５６を確認しながら入力装置１４を適宜に操作することで、選択指示子Ｐの内部領域の態様に関する複数の項目（色彩など）の各々について複数の候補の何れかを選択する。さらに詳述すると、内部領域の色彩が操作子５６１に対する操作に応じて指定され、内部領域に表示される網掛の種類（ハッチングのパターン）が操作子５６２に対する操作に応じて指定され、内部領域内の網掛の色彩が操作子５６３に対する操作に応じて指定され、内部領域の透過度が操作子５６４に対する操作に応じて指定される。利用者が各項目を指定または変更するたびに、表示制御部２６は、当該指定を実際に内部領域に反映させた指示子Ｐを領域５６５に表示する。 The user selects one of a plurality of candidates for each of a plurality of items (colors, etc.) relating to the mode of the internal region of the selection indicator P by appropriately operating the input device 14 while confirming the setting image 56. To do. More specifically, the color of the internal area is designated in accordance with an operation on the operation element 561, and the type of hatching (hatching pattern) displayed in the internal area is designated in accordance with the operation on the operation element 562. The shaded color is designated according to the operation on the operation element 563, and the transparency of the internal region is designated according to the operation on the operation element 564. Each time the user designates or changes each item, the display control unit 26 displays an indicator P that reflects the designation in the internal area in the area 565.

設定画像５６の操作子５６６（OK）が操作されると、表示制御部２６は、音楽情報画像６０の作業領域６２に実際に配置された選択指示子Ｐの内部領域を設定画像５６で設定された態様に変更する。さらに、表示制御部２６は、作業領域６２に配置された複数の指示子Ｐのうち割当部３４が選択指示子Ｐと共通のパートを割当てた総ての指示子Ｐ（すなわち、識別領域５６０に表示された識別子ｄBのパートが割当てられた指示子Ｐ）の内部領域を、設定画像５６にて選択指示子Ｐの内部領域に指定された態様に変更する。なお、操作子５６７（Cancel）が操作された場合、設定画像５６における設定の内容は反映されない。 When the operation element 566 (OK) of the setting image 56 is operated, the display control unit 26 sets the internal area of the selection indicator P actually arranged in the work area 62 of the music information image 60 with the setting image 56. Change to the mode. Further, the display control unit 26 assigns all the indicators P (ie, the identification region 560 to which the assigning unit 34 assigns a part common to the selection indicator P among the plurality of indicators P arranged in the work area 62). The internal area of the indicator P) to which the displayed part of the identifier dB is assigned is changed to a mode designated as the internal area of the selection indicator P in the setting image 56. Note that when the operator 567 (Cancel) is operated, the setting contents in the setting image 56 are not reflected.

以上のように、割当部３４の割当てたパートが共通する各指定音の指示子Ｐは共通の態様で表示される。利用者は選択指示子Ｐを任意に指定して態様を変更できるから、表示制御部２６は、別個のパートが割当てられた各指定音の指示子Ｐを相異なる態様で表示することが可能である。例えば、割当部３４が単独パートを割当てた指定音の指示子Ｐと、割当部３４が編成パートを割当てた指定音の指示子Ｐとが相異なる態様で表示される。 As described above, the indicator P for each designated sound that is shared by the parts assigned by the assigning unit 34 is displayed in a common manner. Since the user can arbitrarily specify the selection indicator P and change the mode, the display control unit 26 can display the indicator P of each specified sound to which a separate part is assigned in different modes. is there. For example, the designator P of the designated sound to which the allocating unit 34 has assigned the single part and the designator P of the designated sound to which the allocating unit 34 has assigned the knitting part are displayed in a different manner.

以上の態様においては、複数の音声データＤ（編成パート）が割当てられた指定音が音楽情報画像６０内のひとつの指示子Ｐで表示される。したがって、ひとつの指定音に割当てられた複数の音声データＤの各々について指示子Ｐが別個に表示される場合と比較して音楽情報画像６０が簡素化され、指定音の時系列を利用者が容易に確認（さらには編集）できる。しかも、単独パートの指定音の指示子Ｐと編成パートの指定音の指示子Ｐとを相異なる態様で表示できるから、各指定音が単独パートで再生される（指定音が独唱音として合成される）のか編成パートで再生される（指定音が合唱音として合成される）のかを利用者が直感的に把握できるという利点もある。 In the above aspect, the designated sound to which a plurality of audio data D (organization part) is assigned is displayed by one indicator P in the music information image 60. Therefore, the music information image 60 is simplified compared to the case where the indicator P is displayed separately for each of the plurality of audio data D assigned to one designated sound, and the user can determine the time series of the designated sound. Easy to check (and edit). Moreover, since the indicator P for the designated sound of the single part and the indicator P for the designated sound of the composition part can be displayed in different modes, each designated sound is reproduced as a single part (the designated sound is synthesized as a solo sound). There is also an advantage that the user can intuitively understand whether the designated part is played back (or the designated sound is synthesized as a choral sound).

＜Ｂ：第２実施形態＞
次に、本発明の第２実施形態について説明する。なお、以下の各形態において第１実施形態と共通する要素については以上と同じ符号を付して各々の詳細な説明を適宜に省略する。 <B: Second Embodiment>
Next, a second embodiment of the present invention will be described. In the following embodiments, elements that are the same as those in the first embodiment are denoted by the same reference numerals, and detailed descriptions thereof are omitted as appropriate.

本形態の表示制御部２６は、割当部３４が指定音に割当てた編成パートを構成する音声データＤの個数Ｎ（すなわち編成パートの音声の総数）に応じて当該指定音の指示子Ｐの態様を変化させる。音声データＤの個数Ｎに応じて制御される指示子Ｐの態様は任意であるが、例えば、図７の部分(A)に示すように編成パートの音声データＤの個数Ｎが多いほど指示子Ｐの枠線を太い線幅に設定する構成や、図７の部分(B)に示すように編成パートの音声データＤの個数Ｎが多いほど指示子Ｐの枠線や内部領域を濃い色彩に設定する構成が好適である。また、例えば指示子Ｐの枠線を波線で表示する場合には、図７の部分(C)に示すように、編成パートの音声データＤの個数Ｎが多いほど枠線の振幅を増加させる構成も採用される。 The display control unit 26 according to the present embodiment uses the designated sound indicator P in accordance with the number N of voice data D constituting the knitting part assigned to the designated sound by the assigning unit 34 (that is, the total number of sounds of the knitting part). To change. The mode of the indicator P that is controlled in accordance with the number N of audio data D is arbitrary. The configuration in which the border line of P is set to a thick line width, and as shown in the part (B) of FIG. 7, the border line and the inner area of the indicator P become darker as the number N of the audio data D of the knitting part increases. A configuration to be set is preferable. Further, for example, when the frame line of the indicator P is displayed as a wavy line, as shown in the part (C) of FIG. 7, the configuration in which the amplitude of the frame line is increased as the number N of the audio data D of the knitting part increases. Is also adopted.

以上の形態においては、編成パートを構成する音声データＤの個数Ｎに応じて各指示子Ｐの態様が制御されるから、指定音の合成に使用される音声データＤの個数Ｎ（すなわち合成音における音声の混合数）を利用者が直感的に把握できるという利点がある。なお、以上の形態においては個数Ｎに応じて指示子Ｐの態様を制御したが、例えば、指定音に割当てられた音声データＤの特性に応じて表示制御部２６が当該指定音の指示子Ｐの態様を可変に制御する構成も採用される。例えば、表示制御部２６は、音声データＤが表す音声の音量や音高が高いほど指示子Ｐの枠線や内部領域の色彩の濃度を増加させる。 In the above embodiment, since the mode of each indicator P is controlled according to the number N of the voice data D constituting the composition part, the number N of voice data D used for the synthesis of the designated sound (that is, the synthesized sound) There is an advantage that the user can intuitively grasp the number of voices in the system. In the above embodiment, the mode of the indicator P is controlled according to the number N. For example, the display control unit 26 determines the indicator P of the designated sound according to the characteristics of the audio data D assigned to the designated sound. A configuration for variably controlling the mode is also adopted. For example, the display control unit 26 increases the frame density of the indicator P and the color density of the inner region as the volume and pitch of the voice represented by the voice data D are higher.

＜Ｃ：第３実施形態＞
複数の発声者が同じ旋律を合唱する場合、各発声者の音声の音高や発音の時点には多少のバラツキが発生するのが通常である。したがって、編成パートを構成する複数の音声データＤの各々に対応した音声の音高や発音の時点が完全に合致すると、合成音が聴感上において不自然な印象となる場合がある。そこで、本形態においては、各音声データＤに対応した音声の音高や発音の時点にバラツキ（揺らぎ）を付与する。 <C: Third Embodiment>
When a plurality of speakers sing the same melody, there is usually some variation in the pitch of each speaker and the time of pronunciation. Therefore, if the pitches of the voices corresponding to each of the plurality of voice data D constituting the knitting part and the time points of pronunciation are completely matched, the synthesized sound may give an unnatural impression on hearing. Therefore, in this embodiment, a variation (fluctuation) is given to the pitch of the sound corresponding to each sound data D and the time of pronunciation.

図８は、本発明の第３実施形態に係る音声データ処理装置１００Bのブロック図である。図８に示すように、本形態の音声データ処理装置１００Bは、第１実施形態の音声データ処理装置１００Aに設定部４２を追加した構成である。設定部４２は、音高分布範囲と発音点分布範囲とを可変に設定する。音高分布範囲は、編成パートを構成する各音声データＤに対応した音声の音高が揺動する範囲（音高のバラツキの範囲）である。発音点分布範囲は、編成パートを構成する各音声データＤに対応した音声の発音の時点が揺動する範囲（発音の時点のバラツキの範囲）である。 FIG. 8 is a block diagram of an audio data processing device 100B according to the third embodiment of the present invention. As shown in FIG. 8, the audio data processing device 100B of this embodiment has a configuration in which a setting unit 42 is added to the audio data processing device 100A of the first embodiment. The setting unit 42 variably sets the pitch distribution range and the pronunciation point distribution range. The pitch distribution range is a range (pitch variation range) in which the pitch of the voice corresponding to each voice data D constituting the knitting part fluctuates. The pronunciation point distribution range is a range in which the sound generation time corresponding to each sound data D constituting the knitting part fluctuates (range of the sound generation time variation).

図９は、音高分布範囲および発音点分布範囲を利用者が設定するための設定画像５８の模式図である。利用者が入力装置１４に所定の操作を付与すると、表示制御部２６は設定画像５８を表示装置１６に表示させる。利用者は、設定画像５８を確認しながら入力装置１４を操作することで音高分布範囲および発音点分布範囲を指定する。 FIG. 9 is a schematic diagram of a setting image 58 for the user to set the pitch distribution range and the pronunciation point distribution range. When the user gives a predetermined operation to the input device 14, the display control unit 26 displays the setting image 58 on the display device 16. The user specifies the pitch distribution range and the pronunciation point distribution range by operating the input device 14 while confirming the setting image 58.

図９の識別領域５８１には、選択部３２が生成した複数の編成パートのうち利用者が選択した編成パート（すなわち設定の対象となる編成パート）の識別子ｄBが表示される。利用者は、操作子５８２を操作（左右に移動）することで音高分布範囲の広狭を設定する。発音点分布範囲の広狭も同様に利用者による操作子５８３の操作に応じて設定される。利用者は、音高分布範囲および発音点分布範囲を数値で直接に指定することも可能である。 In the identification area 581 of FIG. 9, the identifier dB of the knitting part selected by the user (that is, the knitting part to be set) among the plurality of knitting parts generated by the selection unit 32 is displayed. The user operates the operation element 582 (moves left and right) to set the pitch distribution range. Similarly, the range of the pronunciation point distribution range is set according to the operation of the operator 583 by the user. The user can also directly specify the pitch distribution range and the pronunciation point distribution range by numerical values.

設定画像５８における設定の内容は操作子５８４（OK）の操作で確定される。すなわち、設定部４２は、操作子５８４の操作時に指定されている音高分布範囲および発音点分布範囲を、作業中の編成パートの識別子ｄB（識別領域５８１に表示された識別子ｄB）に対応付けて記憶装置１２に格納する。なお、操作子５８５(Cancel)が操作された場合には設定画像５８の設定の内容は反映されない。 The content of the setting in the setting image 58 is confirmed by the operation of the operator 584 (OK). That is, the setting unit 42 associates the pitch distribution range and the pronunciation point distribution range specified when operating the operation element 584 with the identifier dB (identifier dB displayed in the identification area 581) of the knitting part being worked on. And stored in the storage device 12. Note that when the operator 585 (Cancel) is operated, the setting contents of the setting image 58 are not reflected.

図８の音声合成部２４は、編成パートが割当てられた指定音の合成に音高分布範囲および発音点分布範囲を使用する。すなわち、音声合成部２４は、編成パートを構成する各音声データＤに対応した音声の音高を音高分布範囲内で相違させるとともに各音声の発音の時点を発音点分布範囲内で相違させる。以上の構成によれば、編成パートの合成音を構成する各音声の音高や発音の時点に揺らぎが付与されるから、現実の合唱音に近い自然な合成音を生成することが可能となる。 The voice synthesizer 24 in FIG. 8 uses the pitch distribution range and the pronunciation point distribution range for synthesizing the designated sound to which the composition part is assigned. That is, the speech synthesizer 24 makes the pitches of the voices corresponding to the respective voice data D constituting the knitting part different within the pitch distribution range, and makes the time points of sound generation different within the pronunciation point distribution range. According to the above configuration, fluctuations are given to the pitches and the time points of pronunciation of the sounds constituting the synthesized sound of the knitting part, so that it is possible to generate a natural synthesized sound that is close to the actual chorus sound. .

一方、表示制御部２６は、編成パートが割当てられた指示子Ｐの態様を、設定部４２が当該編成パートに設定した音高分布範囲および発音点分布範囲の広狭に応じて変化させる。図１０は、本形態における指示子Ｐの態様の変化を説明するための概念図である。図１０に示すように、指示子Ｐのうち時間軸（横軸）の方向における両端の部分ＰEの形状が音高分布範囲および発音点分布範囲の広狭に応じて可変に設定される。例えば、表示制御部２６は、音高分布範囲が広いほど音高軸の方向における端部ＰEの寸法Ｌ1を例えば破線ａのように増加させ、発音点分布範囲が広いほど時間軸の方向における端部ＰEの寸法Ｌ2を例えば破線ｂのように増加させるといった具合である。 On the other hand, the display control unit 26 changes the mode of the indicator P to which the knitting part is assigned in accordance with the pitch distribution range and the pronunciation point distribution range set by the setting unit 42 for the knitting part. FIG. 10 is a conceptual diagram for explaining a change in the mode of the indicator P in the present embodiment. As shown in FIG. 10, the shape of the portion PE at both ends in the direction of the time axis (horizontal axis) of the indicator P is variably set according to the pitch distribution range and the pronunciation point distribution range. For example, the display control unit 26 increases the dimension L1 of the end portion PE in the direction of the pitch axis as the pitch distribution range is wider, for example, as indicated by the broken line a, and the end in the time axis direction is larger as the pronunciation point distribution range is wider. For example, the dimension L2 of the part PE is increased as indicated by a broken line b.

以上の形態においては、音高分布範囲や発音点分布範囲の広狭に応じて指示子Ｐの態様が可変に制御されるから、指示子Ｐに割当てられた編成パートの音高分布範囲や発音点分布範囲を利用者が視覚的に容易に確認できるという利点がある。しかも、指示子Ｐの端部ＰEのうち音高軸の方向の寸法Ｌ1が音高分布範囲の広狭に応じて制御され、時間軸の方向の寸法Ｌ2が発音点分布範囲の広狭に応じて制御される。したがって、例えば端部ＰEの寸法Ｌ1を発音点分布範囲に応じて制御する構成や端部ＰEの寸法Ｌ2を音高分布範囲に応じて制御する構成と比較して、発音点分布範囲や音高分布範囲の広狭を利用者が直感的に把握できるという利点もある。なお、音高分布範囲および発音点分布範囲の一方のみを設定部４２が可変に制御する構成も採用される。 In the above embodiment, since the mode of the indicator P is variably controlled according to the pitch distribution range and the pronunciation point distribution range, the pitch distribution range and the pronunciation point of the knitting part assigned to the indicator P are controlled. There is an advantage that the user can visually confirm the distribution range easily. Moreover, the dimension L1 in the direction of the pitch axis of the end portion PE of the indicator P is controlled according to the width of the pitch distribution range, and the dimension L2 in the direction of the time axis is controlled according to the width of the pronunciation point distribution range. Is done. Therefore, for example, compared to the configuration in which the dimension L1 of the end portion PE is controlled according to the sounding point distribution range and the configuration in which the dimension L2 of the end portion PE is controlled according to the pitch distribution range, There is also an advantage that the user can intuitively grasp the width of the distribution range. A configuration is also employed in which the setting unit 42 variably controls only one of the pitch distribution range and the pronunciation point distribution range.

＜Ｄ：第４実施形態＞
以上の各形態においては編成パートを構成する複数の音声データＤを利用者が選択する。しかし、自身の嗜好ないし感性に合致した音声データＤや合唱音の合成のために音楽的に適切な音声データＤ（例えば音楽的に調和する音声データＤ）を利用者が自分で選択することは煩雑かつ困難である。そこで、本形態においては、利用者が指定した音楽的な印象に合致する音声データＤが自動的に選択されたうえで編成パートとして使用される。 <D: Fourth Embodiment>
In each of the above forms, the user selects a plurality of audio data D constituting the composition part. However, it is not possible for the user to select audio data D that matches his own preference or sensitivity or audio data D that is musically appropriate for synthesizing choral sound (for example, audio data D that is musically harmonized). It is complicated and difficult. Therefore, in this embodiment, the audio data D that matches the musical impression designated by the user is automatically selected and used as the knitting part.

図１１は、本形態に係る音声データ処理装置１００Cのブロック図である。図１１に示すように、音声データ処理装置１００Cは、第１実施形態の音声データ処理装置１００Aに解析部４４を追加した構成である。解析部４４は、記憶装置１２に格納されたｎ個の音声データＤ（Ｄ1〜Ｄn）の各々について音楽的な特徴量Ｆを解析する。記憶装置１２は、各音声データＤについて解析部４４が解析した特徴量Ｆ（Ｆ1〜Ｆn）を当該音声データＤに対応させて記憶する。ただし、音声データＤと特徴量Ｆとを別個の記憶装置に格納した構成も採用される。また、音声データＤと特徴量Ｆとが外部で用意されたうえで記憶装置１２に格納される構成においては解析部４４が省略され得る。 FIG. 11 is a block diagram of an audio data processing device 100C according to this embodiment. As shown in FIG. 11, the sound data processing device 100C has a configuration in which an analysis unit 44 is added to the sound data processing device 100A of the first embodiment. The analysis unit 44 analyzes the musical feature amount F for each of the n pieces of audio data D (D1 to Dn) stored in the storage device 12. The storage device 12 stores the feature amount F (F1 to Fn) analyzed by the analysis unit 44 for each audio data D in association with the audio data D. However, a configuration in which the audio data D and the feature amount F are stored in separate storage devices is also employed. Further, in the configuration in which the audio data D and the feature amount F are prepared outside and stored in the storage device 12, the analysis unit 44 can be omitted.

解析部４４による特徴量Ｆの抽出には例えば多変量解析（因子分析）が利用される。図１２は、記憶装置１２に音声データＤ毎に格納される特徴量Ｆ（Ｆ1〜Ｆn）の概念図である。図１２に示すように、音声データＤi（ｉ＝１〜ｎ）の特徴量Ｆiは、音声の心理的な印象を特徴づける複数種の因子（金属因子，迫力因子，美的因子）の各々について当該音声データＤiの多変量解析で特定された因子値Ｘ（Ｘ[i,I]，Ｘ[i,II]，Ｘ[i,III]）の集合である。なお、本形態においては３種類の因子（[I]〜[III]）を例示するが、音声データＤの多変量解析における因子の種類数（特徴量Ｆに含まれる因子値Ｘの個数）は任意に変更される。 For example, multivariate analysis (factor analysis) is used to extract the feature value F by the analysis unit 44. FIG. 12 is a conceptual diagram of feature amounts F (F1 to Fn) stored for each audio data D in the storage device 12. As shown in FIG. 12, the feature value Fi of the voice data Di (i = 1 to n) is related to each of a plurality of types of factors (metal factor, force factor, and aesthetic factor) that characterize the psychological impression of the voice. This is a set of factor values X (X [i, I], X [i, II], X [i, III]) specified by multivariate analysis of the speech data Di. In this embodiment, three types of factors ([I] to [III]) are exemplified, but the number of types of factors (the number of factor values X included in the feature amount F) in the multivariate analysis of the audio data D is It is changed arbitrarily.

解析部４４は、音声データＤの各素片データが表す音声の物理的な特徴量（例えば音量やピッチや周波数特性など）から、音楽の心理的な印象を表現する複数の形容詞対（例えば「明るい-暗い」「力強い-軽い」）の各々の指標値を評価し、複数の形容詞対の指標値を複数種の因子（金属因子，迫力因子，美的因子）に統計的に集約することで複数の因子値Ｘを特定する。図１２の金属因子[I]の因子値Ｘ（Ｘ[1,I]，Ｘ[2,I]，……，Ｘ[n,I]）は、受聴者が音声を金属的と感受する程度（金属因子）の指標であり、迫力因子[II]の因子値Ｘ（Ｘ[1,II]，Ｘ[2,II]，……，Ｘ[n,II]）は、受聴者が音声に迫力を感受する程度（迫力因子）の指標であり、美的因子[III]の指標値Ｘ（Ｘ[1,III]，Ｘ[2,III]，……，Ｘ[n,III]）は、受聴者が音声を美的と感受する程度（美的因子）の指標である。 The analysis unit 44 uses a plurality of adjective pairs (for example, “a”, etc.) to express a psychological impression of music from the physical feature quantities (for example, volume, pitch, frequency characteristics, etc.) of speech represented by each piece of speech data D. By evaluating each index value of “bright-dark” and “powerful-light”), the index values of multiple adjective pairs are statistically aggregated into multiple types of factors (metal factor, force factor, aesthetic factor). The factor value X is specified. The factor value X (X [1, I], X [2, I],..., X [n, I]) of the metal factor [I] in FIG. 12 is the extent to which the listener perceives the sound as metallic. (Metallic factor) index, force factor [II] factor value X (X [1, II], X [2, II], ..., X [n, II]) It is an index of the degree to which force is perceived (power factor), and the index value X (X [1, III], X [2, III], ..., X [n, III]) of the aesthetic factor [III] is It is an index of the degree (audible factor) that the listener perceives the sound as aesthetic.

図１１の選択部３２は、記憶装置１２に格納されたｎ個の音声データＤのうち利用者からの指示に応じて設定された特徴量（以下では特に「指示特徴量」という）ＦUに類似する特徴量Ｆに対応した複数の音声データＤを記憶装置１２から選択（検索）する。選択部３２の選択した複数の音声データＤで編成パートが構成される。 The selection unit 32 in FIG. 11 is similar to a feature amount (hereinafter, specifically referred to as “instruction feature amount”) FU set according to an instruction from the user among n pieces of audio data D stored in the storage device 12. A plurality of audio data D corresponding to the feature amount F to be selected is selected (searched) from the storage device 12. A plurality of audio data D selected by the selection unit 32 constitutes a knitting part.

利用者が入力装置１４に所定の操作を付与すると、表示制御部２６は、指示特徴量ＦUの指定のための設定画像７２（図１３）を表示装置１６に表示させる。利用者は、設定画像７２を確認しながら入力装置１４を適宜に操作することで指示特徴量ＦUを指定する。設定画像７２の識別領域７２０には、編集の対象となる編成パートの識別子ｄBが表示される。領域７２１には、作業中の編成パートを構成すべき音声データＤの個数（音声の混合数）Ｎが表示される。利用者は、入力装置１４を適宜に操作することで領域７２１内の数値Ｎを適宜に変更すること（例えば領域７２１に対する個数Ｎの直接的な入力や操作子７２２の操作による個数Ｎの増減）が可能である。 When the user gives a predetermined operation to the input device 14, the display control unit 26 causes the display device 16 to display a setting image 72 (FIG. 13) for specifying the instruction feature amount FU. The user designates the instruction feature amount FU by appropriately operating the input device 14 while confirming the setting image 72. In the identification area 720 of the setting image 72, the identifier dB of the organization part to be edited is displayed. In the area 721, the number N of voice data D (the number of mixed voices) N that should constitute the working part being worked is displayed. The user appropriately changes the numerical value N in the area 721 by appropriately operating the input device 14 (for example, direct input of the number N to the area 721 or increase / decrease of the number N by operating the operation element 722). Is possible.

指示特徴量ＦUは、記憶装置１２に格納された特徴量Ｆと同様の３種類の因子（金属因子，迫力因子，美的因子）の各々について因子値Ｕ（Ｕ[I]，Ｕ[II]，Ｕ[III]）を含む。各因子値Ｕは、入力装置１４に対する操作に応じて個別に設定される。すなわち、金属因子[I]の因子値Ｕ[I]は図１３の操作子７２３の操作に応じて設定され、迫力因子[II]の因子値Ｕ[II]は操作子７２４の操作に応じて設定され、美的因子[III]の因子値Ｕ[III]は操作子７２５の操作に応じて設定される。利用者は、例えば、金属的な音声を希望する場合には金属因子[I]の因子値Ｕ[I]を大きい数値に設定し、迫力のある音声を希望する場合には迫力因子[II]の因子値Ｕ[II]を大きい数値に設定する。利用者は、各因子値Ｕを数値で直接に指定することも可能である。 The indicated feature value FU is a factor value U (U [I], U [II], U) for each of the three types of factors (metal factor, force factor, and aesthetic factor) similar to the feature value F stored in the storage device 12. U [III]). Each factor value U is individually set according to an operation on the input device 14. That is, the factor value U [I] of the metal factor [I] is set according to the operation of the operator 723 in FIG. 13, and the factor value U [II] of the force factor [II] is set according to the operation of the operator 724. The factor value U [III] of the aesthetic factor [III] is set according to the operation of the operator 725. For example, when the user desires a metallic voice, the factor value U [I] of the metal factor [I] is set to a large value, and when a powerful voice is desired, the powerful factor [II] The factor value U [II] is set to a large value. The user can also directly specify each factor value U as a numerical value.

操作子７２６（Search）の操作を契機として音声データＤの検索が実行される。さらに詳述すると、選択部３２は、操作子７２６の操作の時点で設定されている指示特徴量ＦUとの類似度が高い順番で上位のＮ個の特徴量Ｆに対応した音声データＤの集合を編成パートの候補として記憶装置１２から検索する。なお、指示特徴量ＦUと特徴量Ｆとの類否の判定については後述する。 The search for the voice data D is executed in response to the operation of the operator 726 (Search). More specifically, the selection unit 32 is a set of audio data D corresponding to the top N feature values F in descending order of similarity to the instruction feature value FU set at the time of operation of the operator 726. Is searched from the storage device 12 as a candidate for the knitting part. The determination of the similarity between the instruction feature quantity FU and the feature quantity F will be described later.

操作子７２７は、操作子７２６の操作で検索されたＮ個の音声データＤに対応した音声（Ｎ種類の音声の混合）を利用者が試聴するための画像である。操作子７２７が操作されると、選択部３２は、直前に検索したＮ個の音声データＤの各々について所定の音声素片に対応する素片データを音声合成部２４に出力する。音声合成部２４は、Ｎ個の素片データから生成した所定の音高の音声を混合することで音声信号ＳOUTを出力する。したがって、利用者が指定した各因子値Ｕから検索されたＮ個の音声データＤに対応した合成音が音出力装置１８から再生される。利用者は、音出力装置１８から再生される合成音を受聴（試聴）しながら因子値Ｕの変更を反復することで、所望の編成パートを生成することが可能である。 The operation element 727 is an image for the user to audition the sound corresponding to the N pieces of sound data D searched by the operation of the operation element 726 (mixture of N kinds of sounds). When the operator 727 is operated, the selection unit 32 outputs, to the speech synthesizer 24, segment data corresponding to a predetermined speech segment for each of the N speech data D searched immediately before. The speech synthesizer 24 outputs a speech signal SOUT by mixing speech of a predetermined pitch generated from N piece data. Therefore, a synthesized sound corresponding to the N pieces of sound data D retrieved from each factor value U designated by the user is reproduced from the sound output device 18. The user can generate a desired knitting part by repeatedly changing the factor value U while listening (trial listening) to the synthesized sound reproduced from the sound output device 18.

利用者が操作子７２８を操作した時点で検索されているＮ個の音声データＤ（すなわち、操作子７２８の直前に選択部３２が検索したＮ個の音声データＤ）の集合が編成パートとして確定する。さらに詳述すると、選択部３２は、操作子７２８の操作時に検索されているＮ個の音声データＤの各々の識別子ｄAを、識別領域７２０に表示された識別子ｄB（すなわち作成中の編成パートの識別子ｄB）に対応させて記憶装置１２に格納する。すなわち、選択部３２は、利用者が指定した指示特徴量ＦUに類似する特徴量ＦのＮ個の音声データＤを組合せて編成パートを構築する。編成パートの利用の方法は第１実施形態と同様である。なお、操作子７２９（Cancel）が操作された場合、設定画像７２における設定の内容は反映されない。 A set of N voice data D searched at the time when the user operates the operator 728 (that is, N voice data D searched by the selection unit 32 immediately before the operator 728) is determined as a composition part. To do. More specifically, the selection unit 32 sets the identifier dA of each of the N pieces of audio data D searched at the time of operating the operator 728 to the identifier dB displayed in the identification area 720 (that is, the composition part being created). It is stored in the storage device 12 in correspondence with the identifier dB). That is, the selection unit 32 combines N pieces of audio data D having a feature amount F similar to the instruction feature amount FU designated by the user to construct a composition part. The method of using the knitting part is the same as in the first embodiment. Note that when the operation element 729 (Cancel) is operated, the setting content in the setting image 72 is not reflected.

次に、指示特徴量ＦUと特徴量Ｆとの類否の判定について説明する。選択部３２は、記憶装置１２に格納されたｎ個の特徴量Ｆ1〜Ｆnの各々について指示特徴量ＦUとの類否の指標となる数値（以下「類否指標値」という）Ｒを算定する。本形態の類否指標値Ｒは、図１４に示すように、３種類の因子の各々に対応する座標軸が設定された空間（以下「因子空間」という）での距離に相当する。すなわち、特徴量Ｆiと指示特徴量ＦUとの類否指標値Ｒiは、指示特徴量ＦUの各因子値Ｕ（Ｕ[I]，Ｕ[II]，Ｕ[III]）を座標値として因子空間に規定される地点と特徴量Ｆiの各因子値Ｘ（Ｘ[i,I]，Ｘ[i,II]，Ｘ[i,III]）を座標値として因子空間に規定される地点との距離である。さらに詳述すると、選択部３２は、以下の数式(1)で表現されるユークリッド距離を類否指標値Ｒiとして算定する。
Ｒi＝√｛（Ｘ[i,I]−Ｕ[I]）²＋（Ｘ[i,II]−Ｕ[II]）²＋（Ｘ[i,III]−Ｕ[III]）²｝ ……(1) Next, determination of similarity between the instruction feature quantity FU and the feature quantity F will be described. The selection unit 32 calculates a numerical value (hereinafter referred to as “similarity index value”) R that is an index of similarity with the instruction feature amount FU for each of the n feature amounts F1 to Fn stored in the storage device 12. . The similarity index value R in this embodiment corresponds to a distance in a space (hereinafter referred to as “factor space”) in which coordinate axes corresponding to each of the three types of factors are set, as shown in FIG. That is, the similarity index value Ri between the feature quantity Fi and the designated feature quantity FU is a factor space with the factor values U (U [I], U [II], U [III]) of the designated feature quantity FU as coordinate values. The distance between the point defined in the above and the point defined in the factor space with each factor value X (X [i, I], X [i, II], X [i, III]) of the feature value Fi as coordinate values It is. More specifically, the selection unit 32 calculates the Euclidean distance expressed by the following formula (1) as the similarity index value Ri.
Ri = √ {(X [i, I] −U [I]) ² + (X [i, II] −U [II]) ² + (X [i, III] −U [III]) ² } … (1)

数式(1)から理解されるように、指示特徴量ＦUと特徴量Ｆとの類似の程度が高いほど類否指標値Ｒiは小さい数値となる。したがって、選択部３２は、指示特徴量ＦUとの類否指標値Ｒが小さい順番で上位のＮ個の特徴量Ｆの音声データＤを編成パートの要素として選択する。解析部４４および選択部３２以外の要素については第１実施形態と同様である。 As understood from Equation (1), the similarity index value Ri becomes smaller as the degree of similarity between the instruction feature quantity FU and the feature quantity F is higher. Therefore, the selection unit 32 selects the speech data D of the top N feature quantities F in the order of the similarity index value R with the instruction feature quantity FU as the element of the organization part. Elements other than the analysis unit 44 and the selection unit 32 are the same as those in the first embodiment.

以上の形態においても第１実施形態と同様の効果が実現される。さらに、本形態においては、利用者が指定した指示特徴量ＦUに類似する特徴量Ｆの複数の音声データＤが編成パートの要素として選択されるから、各音声データＤの各々に対応した音声の特性を利用者が熟知していない場合であっても、利用者の嗜好や感性に合致した複数の音声データＤや音楽的な印象が類似する複数の音声データＤが編成パートとして音声の合成に使用される。したがって、編成パートを編成する利用者の負担を軽減することが可能である。 In the above embodiment, the same effect as that of the first embodiment is realized. Furthermore, in this embodiment, since a plurality of audio data D having a feature amount F similar to the designated feature amount FU designated by the user is selected as an element of the organization part, the audio corresponding to each of the audio data D is selected. Even if the user is not familiar with the characteristics, a plurality of audio data D that matches the user's preference and sensibility and a plurality of audio data D with similar musical impressions are used as a composition part for synthesizing the voice. used. Therefore, it is possible to reduce the burden on the user who organizes the knitting part.

また、編成パートを構成する音声データＤの個数Ｎが入力装置１４に対する操作に応じて可変に設定されるから、編成パートを少人数の合唱音とするか大人数の合唱音とするかを利用者が任意に設定できる。さらに、設定画像７２の操作子７２７を操作することで利用者は実際の合成音を試聴できるから、自分の希望の合成音を生成するための指示特徴量ＦU（因子値Ｕの組合せ）を利用者が容易に探索できるという利点もある。 In addition, since the number N of audio data D constituting the knitting part is variably set according to the operation on the input device 14, whether the knitting part is a small chorus sound or a large chorus sound is used. The user can set it arbitrarily. Further, since the user can audition the actual synthesized sound by operating the operation element 727 of the setting image 72, the instruction feature quantity FU (combination of the factor values U) for generating the desired synthesized sound is used. There is also an advantage that a person can easily search.

なお、編成パートを構成する音声データＤの個数Ｎに応じて指定音の指示子Ｐの態様を変化させる第２実施形態の構成や、編成パートの各音声データＤの音声に音高や発音の揺らぎを付与する第３実施形態の構成は本形態にも同様に適用される。 It should be noted that the configuration of the second embodiment in which the mode of the indicator P of the designated sound is changed according to the number N of the audio data D constituting the knitting part, and the pitch and pronunciation of the sound of each audio data D of the knitting part are changed. The configuration of the third embodiment that imparts fluctuation is similarly applied to this embodiment.

＜Ｅ：第５実施形態＞
第４実施形態においては利用者が因子値Ｕ（Ｕ[I]，Ｕ[II]，Ｕ[III]）を入力装置１４から直接的に指示した。本発明の第５実施形態においては、利用者が選択した音声データＤの特徴量Ｆが指示特徴量ＦUとして利用される。なお、第４実施形態と共通する部分については説明を省略する。 <E: Fifth Embodiment>
In the fourth embodiment, the user directly instructs the factor value U (U [I], U [II], U [III]) from the input device 14. In the fifth embodiment of the present invention, the feature amount F of the audio data D selected by the user is used as the instruction feature amount FU. Note that description of portions common to the fourth embodiment is omitted.

利用者は、入力装置１４を適宜に操作することで、記憶装置１２に格納されたｎ個の音声データＤ（Ｄ1〜Ｄn）のなかからひとつの音声データＤ（以下「選択音声データＤ」という）を選択する。選択部３２は、選択音声データＤの特徴量Ｆを記憶装置１２から取得し、図１５に示すように、当該特徴量Ｆ（図１５では特徴量Ｆ1）を指示特徴量ＦUとして、第４実施形態と同様の手順でｎ個の音声データＤ（選択音声データＤを含む）の各々について類否指標値Ｒ（Ｒ1〜Ｒn）を算定する。そして、選択部３２は、類否指標値Ｒが小さい順番で上位のＮ個の音声データＤ（類否指標値Ｒが最小値（ゼロ）となる選択音声データＤを含む）を編成パートの要素として選択する。 The user appropriately operates the input device 14 to select one audio data D (hereinafter referred to as “selected audio data D”) from the n audio data D (D1 to Dn) stored in the storage device 12. ) Is selected. The selection unit 32 acquires the feature amount F of the selected audio data D from the storage device 12, and, as shown in FIG. 15, the feature amount F (feature amount F1 in FIG. 15) is used as the instruction feature amount FU in the fourth embodiment. Similarity index values R (R1 to Rn) are calculated for each of the n pieces of audio data D (including selected audio data D) in the same procedure as in the embodiment. Then, the selection unit 32 includes the top N pieces of audio data D (including the selected audio data D with the similarity index value R having the minimum value (zero)) in the order of the similarity index value R as the element of the knitting part. Select as.

以上の構成においては、利用者の指定した選択音声データＤに特徴量Ｆが類似するＮ個の音声データＤを選択部３２が自動的に（すなわち利用者による選択を必要とせずに）選択される。したがって、利用者の既知の音声データＤ（選択音声データＤ）に音楽的な印象が類似する音声データＤを編成パートの要素として選択するための利用者の負担が軽減されるという利点がある。なお、以上においては選択音声データＤを編成パートに含める場合を例示したが、選択音声データＤを編成パートの要素から除外してもよい。 In the above configuration, the selection unit 32 automatically selects N pieces of voice data D having a feature amount F similar to the selected voice data D designated by the user (that is, without requiring selection by the user). The Therefore, there is an advantage that the burden on the user for selecting the voice data D whose musical impression is similar to the known voice data D (selected voice data D) of the user as an element of the organization part is reduced. In addition, although the case where the selection audio | voice data D was included in the organization part was illustrated above, you may exclude the selection audio | voice data D from the element of an organization part.

＜Ｆ：第６実施形態＞
第４実施形態では各音声データＤの特徴量Ｆを音声データＤの選択に利用した。本発明の第６実施形態においては、各音声データＤに対応した音声の発声者の属性が特徴量Ｆとともに音声データＤの選択に使用される。なお、第４実施形態と共通する部分については説明を省略する。 <F: Sixth Embodiment>
In the fourth embodiment, the feature amount F of each audio data D is used for selecting the audio data D. In the sixth embodiment of the present invention, the attribute of the voice speaker corresponding to each voice data D is used for selecting the voice data D together with the feature amount F. Note that description of portions common to the fourth embodiment is omitted.

図１６に示すように、本形態に係る音声データ処理装置１００Dの記憶装置１２はｎ個の音声データＤ（Ｄ1〜Ｄn）の各々について特徴量Ｆ（Ｆ1〜Ｆn）と属性Ａ（Ａ1〜Ａn）とを記憶する。属性Ａは、音声データＤの音声の発声者に関連する情報（性質・特徴）である。本形態では発声者の性別を属性Ａとして例示する。 As shown in FIG. 16, the storage device 12 of the audio data processing device 100D according to the present embodiment has a feature amount F (F1 to Fn) and an attribute A (A1 to An) for each of n pieces of audio data D (D1 to Dn). ) Is stored. The attribute A is information (property / feature) related to the voice speaker of the voice data D. In this embodiment, the sex of the speaker is exemplified as attribute A.

利用者は、入力装置１４を適宜に操作することで発声者の属性Ａ（性別）を任意に指定する。選択部３２は、記憶装置１２に格納されたｎ個の音声データＤから、指定特徴量ＦUと特徴量Ｆとの類否指標値Ｒが小さい順番（すなわち指定特徴量ＦUと特徴量Ｆとの類似度が高い順番）で上位に位置し、かつ、属性Ａが利用者からの指定に合致するＮ個の音声データＤを編成パートの要素として選択する。したがって、例えば利用者が属性Ａとして男性を指定した場合、発声者が男性であるＮ個の音声データＤのみが選択されて音声の合成に利用される。 The user arbitrarily designates the speaker's attribute A (gender) by appropriately operating the input device 14. The selection unit 32 starts from the n pieces of audio data D stored in the storage device 12 in the order of the similarity index value R between the specified feature value FU and the feature value F (that is, between the specified feature value FU and the feature value F). N pieces of audio data D which are positioned higher in the order of similarity) and whose attribute A matches the designation from the user are selected as elements of the organization part. Therefore, for example, when the user designates male as the attribute A, only N pieces of voice data D whose voice is male are selected and used for voice synthesis.

以上の構成によれば、特徴量Ｆに加えて属性Ａが音声データＤの選択の基準として使用されるから、利用者の嗜好や感性に合致した音声データＤを第４実施形態と比較して容易かつ確実に選択できるという利点がある。なお、指示特徴量ＦUには、第４実施形態のように利用者が設定画像７２にて指定した数値を利用してもよいし、第５実施形態のように利用者が指定した選択音声データＤの特徴量Ｆを利用してもよい。 According to the above configuration, since the attribute A is used as a reference for selecting the audio data D in addition to the feature amount F, the audio data D that matches the user's preference and sensitivity is compared with the fourth embodiment. There is an advantage that it can be easily and reliably selected. The designated feature value FU may be a numerical value designated by the user in the setting image 72 as in the fourth embodiment, or selected voice data designated by the user as in the fifth embodiment. The feature amount F of D may be used.

＜Ｇ：変形例＞
以上の形態には様々な変形が加えられる。具体的な変形の態様を例示すれば以下の通りである。なお、以下の例示から２以上の態様を任意に選択して組合わせてもよい。 <G: Modification>
Various modifications are added to the above embodiment. An example of a specific modification is as follows. Two or more aspects may be arbitrarily selected from the following examples and combined.

（１）変形例１
以上の各形態においては、単独パートが割当てられた指定音の指示子Ｐと編成パートが割当てられた指定音の指示子Ｐとが利用者からの指示に応じて相異なる態様に変更される場合（すなわち、初期的には各指示子Ｐが共通の態様で表示される場合）を例示したが、単独パートの指示子Ｐと編成パートの指示子Ｐとの表示の態様を表示制御部２６が自動的に（すなわち利用者からの指示に依存せずに）相違させる構成も採用される。例えば、利用者が指示子Ｐに単独パートを割当てると、表示制御部２６は、単独パートについて用意された初期的な態様で当該指示子Ｐを表示し、利用者が指示子Ｐに編成パートを割当てると、表示制御部２６は、単独パートとは別個に編成パートについて用意された初期的な態様で当該指示子Ｐを表示する。各指示子Ｐの態様が利用者からの操作に応じて初期的な態様から変更される点は以上の各形態と同様である。 (1) Modification 1
In each of the above embodiments, the designated sound indicator P to which a single part is assigned and the designated sound indicator P to which a composition part is assigned are changed to different modes according to instructions from the user. (Ie, when each indicator P is initially displayed in a common mode), the display control unit 26 displays the display mode of the indicator P of the single part and the indicator P of the composition part. A configuration in which the difference is made automatically (that is, not depending on an instruction from the user) is also adopted. For example, when the user assigns a single part to the indicator P, the display control unit 26 displays the indicator P in the initial form prepared for the single part, and the user assigns the composition part to the indicator P. When assigned, the display control unit 26 displays the indicator P in an initial form prepared for the composition part separately from the single part. The point that the mode of each indicator P is changed from the initial mode according to the operation from the user is the same as the above-described modes.

（２）変形例２
以上の各形態においては利用者が音楽情報画像６０を確認しながら音楽情報ＳDを作成ないし編集する場合を例示したが、既存の音楽情報ＳDを利用者による確認のために音楽情報画像６０として表示する構成も採用される。既存の音楽情報ＳDは、例えば、可搬型の記録媒体や通信網を介して記憶装置１２に格納される。音楽情報ＳDの作成や編集という処理は本発明において必須ではない。 (2) Modification 2
In each of the above embodiments, the case where the user creates or edits the music information SD while confirming the music information image 60 is exemplified, but the existing music information SD is displayed as the music information image 60 for confirmation by the user. The structure to do is also adopted. The existing music information SD is stored in the storage device 12 via, for example, a portable recording medium or a communication network. The process of creating and editing the music information SD is not essential in the present invention.

（３）変形例３
割当部３４が割当てた音声データＤの組合せに応じて制御される指示子Ｐの態様は以上の例示に限定されない。すなわち、指示子Ｐのサイズや表示色（色相，明度，彩度）や形状など視覚的に知覚できる総ての態様が音声データＤの組合せに応じた制御の対象として採用され得る。また、指示子Ｐの態様を利用者が可変に設定する構成は本発明において必須ではない。例えば、音声データＤの組合せに応じて表示制御部２６が自動的に（すなわち利用者からの指示に依存せずに）選定した態様で指示子Ｐが表示される。また、以上の形態においてはパート（音声データＤの組合せ）が共通する各指定音の指示子Ｐを表示制御部２６が自動的に共通の態様で表示したが、共通のパートが割当てられた複数の指定音について指示子Ｐの態様を利用者が個別に設定する構成も採用される。 (3) Modification 3
The mode of the indicator P controlled according to the combination of the voice data D allocated by the allocation unit 34 is not limited to the above examples. That is, all visually recognizable aspects such as the size, display color (hue, lightness, saturation) and shape of the indicator P can be adopted as control targets according to the combination of the audio data D. In addition, the configuration in which the user variably sets the mode of the indicator P is not essential in the present invention. For example, the indicator P is displayed in a mode automatically selected by the display control unit 26 according to the combination of the audio data D (that is, without depending on the instruction from the user). In the above embodiment, the display control unit 26 automatically displays the indicator P of each designated sound having a common part (combination of audio data D) in a common manner. A configuration is also adopted in which the user individually sets the mode of the indicator P for the designated sound.

（４）変形例４
以上の各形態においては類否指標値Ｒiの算定に数式(1)を利用したが、音声データＤの選択の基準となる類否指標値Ｒiを、指定特徴量ＦUおよび特徴量Ｆの因子値（Ｘ，Ｕ）に対して因子毎に別個の加重値Ｗを付与したうえで算定する構成も好適である。例えば、選択部３２は、金属因子に対する加重値Ｗ[I]と迫力因子に対する加重値Ｗ[II]と美的因子に対する加重値Ｗ[III]とを含む以下の数式(2)を数式(1)の代わりに使用して類否指標値Ｒiを算定する。
Ｒi＝√｛Ｗ[I]・（Ｘ[i,I]−Ｕ[I]）²＋Ｗ[II]・（Ｘ[i,II]−Ｕ[II]）²＋Ｗ[III]・（Ｘ[i,III]−Ｕ[III]）²｝ ……(2) (4) Modification 4
In each of the above embodiments, the mathematical expression (1) is used to calculate the similarity index value Ri. However, the similarity index value Ri, which is a reference for selecting the audio data D, is used as the factor value of the designated feature value FU and the feature value F. A configuration in which calculation is performed after assigning a separate weight value W for each factor to (X, U) is also suitable. For example, the selection unit 32 calculates the following formula (2) including the weight value W [I] for the metal factor, the weight value W [II] for the force factor, and the weight value W [III] for the aesthetic factor. Is used instead of to calculate the similarity index value Ri.
Ri = √ {W [I] · (X [i, I] −U [I]) ² + W [II] · (X [i, II] −U [II]) ² + W [III] · (X [ i, III] −U [III]) ² } (2)

各加重値Ｗ（Ｗ[I]，Ｗ[II]，Ｗ[iii]）は、入力装置１４に対する利用者からの操作に応じて任意に設定される。以上の構成においては、選択部３２による選択に対する各因子の影響が加重値Ｗに応じて可変に制御されるから、編成パートにおける音声データＤの組合せが多様化されるという利点がある。また、因子空間の座標軸において因子値Ｘの単位量に相当する距離が因子毎に相違する場合（すなわち座標軸毎にスケールが相違する場合）には、因子毎に別個に加重値Ｗを設定することで各因子の因子値（座標値）を正規化する構成が好適である。例えば、複数の特徴量Ｆの各々の因子値Ｘが座標軸上に分布する場合における当該分布の分散値の逆数を当該因子の加重値Ｗとして数式(2)を演算すれば、因子毎の座標軸の相違を補償して適切な類否指標値Ｒiが算定される。もっとも、数式(1)や数式(2)は類否指標値Ｒiを算定するための演算式の例示に過ぎず、指示特徴量ＦUと特徴量Ｆとの類否の評価には公知の技術が任意に採用される。 Each weight value W (W [I], W [II], W [iii]) is arbitrarily set according to the operation from the user to the input device 14. In the above configuration, since the influence of each factor on selection by the selection unit 32 is variably controlled according to the weight value W, there is an advantage that the combinations of the audio data D in the organization part are diversified. When the distance corresponding to the unit amount of the factor value X is different for each factor on the coordinate axis of the factor space (that is, when the scale is different for each coordinate axis), the weight value W is set separately for each factor. A configuration in which the factor value (coordinate value) of each factor is normalized is preferable. For example, in the case where each factor value X of a plurality of feature quantities F is distributed on the coordinate axis, if Equation (2) is calculated by using the inverse of the variance value of the distribution as the weight value W of the factor, the coordinate axis for each factor is calculated. An appropriate similarity index value Ri is calculated by compensating for the difference. However, the formulas (1) and (2) are merely examples of arithmetic expressions for calculating the similarity index value Ri, and publicly known techniques are used to evaluate the similarity between the instruction feature quantity FU and the feature quantity F. Adopted arbitrarily.

（５）変形例５
第４実施形態から第６実施形態における特徴量Ｆ（指示特徴量ＦU）は多変量解析の因子値に限定されない。さらに具体的には、音声データＤが表す音声の特性値（例えば周波数特性や音量）を特徴量Ｆとして使用した構成や、特性値を因子毎の因子値Ｘとともに使用した構成が好適である。例えば、因子値Ｘと指示特徴量ＦUの因子値Ｕとの類似度が高い順番で上位に位置し、かつ、特性値が利用者からの指示に応じた条件に合致するＮ個の音声データＤが編成パートの要素として選択される。 (5) Modification 5
The feature amount F (instruction feature amount FU) in the fourth to sixth embodiments is not limited to the factor value of multivariate analysis. More specifically, a configuration using the characteristic value (for example, frequency characteristic or volume) of the voice represented by the voice data D as the feature amount F, or a configuration using the characteristic value together with the factor value X for each factor is preferable. For example, N pieces of audio data D that are positioned higher in the order of similarity between the factor value X and the factor value U of the instruction feature quantity FU, and whose characteristic values meet the conditions according to the instruction from the user Is selected as an element of the organization part.

（６）変形例６
第４実施形態から第６実施形態においては編成パートを構成する音声データＤの個数Ｎを利用者が指定したが、個数Ｎを所定値に固定した構成も採用される。また、第４実施形態から第６実施形態において、選択部３２がひとつの音声データＤを単独パートとして選択する構成も好適である。すなわち、選択部３２は、記憶装置１２に格納されたｎ個の音声データＤのうち指示特徴量ＦUに最も類似する特徴量Ｆに対応するひとつの音声データＤを単独パートとして選択する。 (6) Modification 6
In the fourth to sixth embodiments, the user specifies the number N of audio data D constituting the composition part, but a configuration in which the number N is fixed to a predetermined value is also employed. In the fourth to sixth embodiments, a configuration in which the selection unit 32 selects one piece of audio data D as a single part is also suitable. That is, the selection unit 32 selects, as a single part, one piece of sound data D corresponding to the feature amount F most similar to the designated feature amount FU among the n pieces of sound data D stored in the storage device 12.

（７）変形例７
以上の各形態においては各音声データＤが別個の発声者の音声から生成された場合を便宜的に例示したが、同じ発声者が発声した相異なる音声から複数の音声データＤを生成してもよい。また、以上の各形態においては人間の発声音の合成を便宜的に想定したが、各種の楽器の演奏音を合成する構成（すなわち各音声データＤが楽器の演奏音から生成された構成）も採用される。以上のように本発明における「音声」は、人間による発声音（発話音や歌唱音）および楽器の演奏音の双方を包含する概念である。 (7) Modification 7
In each of the above embodiments, the case where each voice data D is generated from the voice of a separate speaker is illustrated for convenience, but a plurality of voice data D may be generated from different voices uttered by the same speaker. Good. In each of the above embodiments, synthesis of human vocal sounds is assumed for the sake of convenience. However, a configuration for synthesizing performance sounds of various musical instruments (that is, a configuration in which each audio data D is generated from performance sounds of musical instruments) is also possible. Adopted. As described above, the “speech” in the present invention is a concept that includes both human voices (speech sounds and singing sounds) and musical instrument performance sounds.

（８）変形例８
ひとつの音声データＤが表す音声の総数は適宜に変更される。例えば、以上の各形態においてはひとつの音声データＤをひとりの発声者の音声（あるいはひとつの楽器の演奏音）から生成した場合を便宜的に例示したが、並列に発生した複数の音声の混合音（例えば複数の発声者による発声音（合唱音）や複数の楽器による演奏音（合奏音））からひとつの音声データＤを生成した構成も採用される。以上の態様において、複数の音声の混合音に対応するひとつの音声データＤを割当部３４が割当てたひとつの指定音の指示子Ｐを、表示制御部２６が表示装置１６（音楽情報画像６０の作業領域６２）に表示させる。音声データＤが表す音声の混合数（すなわち合唱時の歌唱者の人数や合奏に使用された楽器の総数）Ｍは当該音声データＤの属性Ａとして記憶装置１２に格納される。表示制御部２６は、属性Ａが指定する混合数Ｍに応じて指示子Ｐの態様を可変に制御する。混合数Ｍに応じて指示子Ｐの態様を制御する方法は任意であるが、例えば図７に例示した態様（第２実施形態における音声データＤの個数Ｎを混合数Ｍに置換した構成）が好適である。以上の構成においても、複数の音声の混合音が割当てられた指定音の指示子Ｐを利用者が容易に確認できるという利点がある。 (8) Modification 8
The total number of voices represented by one voice data D is changed as appropriate. For example, in each of the above embodiments, the case where one piece of sound data D is generated from the sound of one speaker (or the performance sound of one instrument) is illustrated for convenience, but a mixture of a plurality of sounds generated in parallel is used. A configuration in which one piece of sound data D is generated from sound (for example, uttered sound (choral sound) by a plurality of speakers or performance sound (ensemble sound) by a plurality of musical instruments) is also employed. In the above aspect, the display control unit 26 displays the indicator P for one designated sound assigned by the assigning unit 34 to one piece of sound data D corresponding to a mixed sound of a plurality of sounds. It is displayed in the work area 62). The mixed number of voices represented by the voice data D (that is, the number of singers at the time of chorus and the total number of instruments used for the ensemble) M is stored in the storage device 12 as the attribute A of the voice data D. The display control unit 26 variably controls the mode of the indicator P in accordance with the mixture number M specified by the attribute A. The method of controlling the mode of the indicator P in accordance with the number of mixtures M is arbitrary, but for example, the mode illustrated in FIG. 7 (configuration in which the number N of audio data D in the second embodiment is replaced with the number of mixtures M) Is preferred. Also in the above configuration, there is an advantage that the user can easily confirm the indicator P of the designated sound to which a mixed sound of a plurality of sounds is assigned.

（９）変形例９
複数の音声データＤが割当てられた指示音をひとつの指示子Ｐで表示する構成（第１実施形態から第３実施形態）と、利用者からの指示に応じた指示特徴量ＦUに特徴量Ｆが類似する音声データＤを合成の対象として選択する構成（第４実施形態から第６実施形態）とは各々が独立して成立し得る。例えば、第４実施形態から第６実施形態においては、複数の音声データＤが割当てられた指示音をひとつの指示子Ｐで表示する構成や指示子Ｐの態様が可変である構成は適宜に省略される。 (9) Modification 9
A configuration (first embodiment to third embodiment) in which an instruction sound to which a plurality of audio data D is assigned is displayed by one indicator P, and an instruction feature amount FU in accordance with an instruction from a user. Can be established independently from the configuration (fourth embodiment to sixth embodiment) in which audio data D with similar values is selected as a synthesis target. For example, in the fourth to sixth embodiments, a configuration in which an instruction sound to which a plurality of audio data D is assigned is displayed with one indicator P and a configuration in which the indicator P is variable are appropriately omitted. Is done.

（１０）変形例１０
音声信号ＳOUTの出力先は音出力装置１８に限定されない。例えば、音声信号ＳOUTを記憶装置１２（または他の記録媒体）に格納する構成や、音声信号ＳOUTを通信網に送信する構成も採用される。 (10) Modification 10
The output destination of the audio signal SOUT is not limited to the sound output device 18. For example, a configuration in which the audio signal SOUT is stored in the storage device 12 (or other recording medium) and a configuration in which the audio signal SOUT is transmitted to a communication network are also employed.

本発明の第１実施形態に係る音声データ処理装置のブロック図である。It is a block diagram of the audio | voice data processing apparatus which concerns on 1st Embodiment of this invention. 音楽情報の模式図である。It is a schematic diagram of music information. 編成パートを編集するための設定画像の模式図である。It is a schematic diagram of the setting image for editing the organization part. 音楽情報画像の模式図である。It is a schematic diagram of a music information image. 指示子の枠線の態様を指定する設定画像の模式図である。It is a schematic diagram of the setting image which designates the frame shape of the indicator. 指示子の内部領域の態様を指定する設定画像の模式図である。It is a schematic diagram of the setting image which designates the aspect of the internal region of the indicator. 本発明の第２実施形態における指示子の態様の制御を説明するための概念図である。It is a conceptual diagram for demonstrating control of the mode of the indicator in 2nd Embodiment of this invention. 本発明の第３実施形態に係る音声データ処理装置のブロック図である。It is a block diagram of the audio | voice data processing apparatus which concerns on 3rd Embodiment of this invention. 音高分布範囲および発音点分布範囲を指定する設定画像の模式図である。It is a schematic diagram of the setting image which designates a pitch distribution range and a pronunciation point distribution range. 音高分布範囲および発音点分布範囲に応じた指示子の態様の変化を説明するための概念図である。It is a conceptual diagram for demonstrating the change of the mode of the indicator according to a pitch distribution range and a pronunciation point distribution range. 本発明の第４実施形態に係る音声データ処理装置のブロック図である。It is a block diagram of the audio | voice data processing apparatus which concerns on 4th Embodiment of this invention. 特徴量の模式図である。It is a schematic diagram of a feature amount. 指示特徴量を指定する設定画像の模式図である。It is a schematic diagram of a setting image for designating an instruction feature amount. 類否指標値の算定を説明するための概念図である。It is a conceptual diagram for demonstrating calculation of the similarity index value. 本発明の第５実施形態における類否指標値の算定を説明するための概念図である。It is a conceptual diagram for demonstrating calculation of the similarity index value in 5th Embodiment of this invention. 本発明の第６実施形態に係る音声データ処理装置のブロック図である。It is a block diagram of the audio | voice data processing apparatus which concerns on 6th Embodiment of this invention.

Explanation of symbols

１００A，１００B，１００C，１００D……音声データ処理装置、１０……制御装置、１２……記憶装置、１４……入力装置、１６……表示装置、１８……音出力装置、２２……情報生成部、２４……音声合成部、２６……表示制御部、３２……選択部、３４……割当部、４２……設定部、４４……解析部、５２，５４，５６，５８，７２……設定画像、６０……音楽情報画像、Ｄ（Ｄ1〜Ｄn）……音声データ、ＳD……音楽情報、ＳOUT……音声信号。 100A, 100B, 100C, 100D …… Audio data processing device, 10 …… Control device, 12 …… Storage device, 14 …… Input device, 16 …… Display device, 18 …… Sound output device, 22 …… Information generation , 24... Speech synthesis unit, 26... Display control unit, 32... Selection unit, 34 .. allocation unit, 42... Setting unit, 44 .. analysis unit, 52, 54, 56, 58, 72. ... setting image, 60 ... music information image, D (D1 to Dn) ... audio data, SD ... music information, SOUT ... audio signal.

Claims

Storage means for storing a plurality of voice data used for synthesizing different voices;
Allocating means for allocating two or more audio data of the plurality of audio data to one designated sound;
An indicator in which the position in the direction of the first axis is selected according to the pitch of the designated sound and the indicator in which the position in the direction of the second axis is selected according to the time of sounding of the designated sound is arranged for each designated sound. A sound data processing apparatus comprising: display control means for displaying the music information image on the display device.

The audio data processing apparatus according to claim 1, wherein the display control means displays the indicators of the designated sounds, each having a different combination of the audio data assigned by the assigning means, in different modes.

The assigning means can selectively assign one sound data and two or more sound data to one designated sound,
The display control means displays the designated sound indicator assigned by the assigning means to one piece of audio data and the designated sound indicator assigned by the assigning means to two or more pieces of sound data in different modes. Item 2. The audio data processing device according to Item 1.

The audio data processing apparatus according to any one of claims 1 to 3, wherein the display control means displays in a common manner the indicators of the designated sounds that share the combination of the audio data assigned by the assigning means.

The audio data processing apparatus according to claim 1, wherein the display control unit variably sets the mode of each indicator according to an instruction from a user.

The voice data processing device according to claim 1, wherein the display control unit changes a mode of an indicator of the designated sound in accordance with the number of voice data assigned to the designated sound by the assigning unit.

Comprising: first setting means for variably setting a distribution range of voice pitches corresponding to each of two or more voice data assigned to one designated sound by the assigning means;
The voice data processing device according to any one of claims 1 to 6, wherein the display control means changes a mode of an indicator of the designated sound in accordance with a distribution range set for the designated sound by the first setting means. .

A second setting means for variably setting a distribution range at a time when a voice corresponding to each of two or more voice data assigned to one designated sound by the assigning means starts;
The voice data processing device according to any one of claims 1 to 7, wherein the display control means changes a mode of an indicator of the designated sound in accordance with a distribution range set by the second setting means as the designated sound. .

The storage means stores a feature amount of each of the plurality of audio data,
Selecting means for selecting, from the storage means, two or more pieces of voice data having a feature quantity similar to an instruction feature quantity according to an instruction from a user among the plurality of voice data;
The audio data processing device according to any one of claims 1 to 8, wherein the allocating unit allocates two or more audio data selected by the selecting unit to one designated sound.

A computer comprising storage means for storing a plurality of voice data used for synthesizing different voices;
An allocation process for allocating two or more audio data of the plurality of audio data to one designated sound;
An indicator in which the position in the direction of the first axis is selected according to the pitch of the designated sound and the indicator in which the position in the direction of the second axis is selected according to the time of sounding of the designated sound is arranged for each designated sound. Display control processing for displaying a music information image on a display device.