JP5082760B2

JP5082760B2 - Sound control apparatus and program

Info

Publication number: JP5082760B2
Application number: JP2007275173A
Authority: JP
Inventors: 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2007-10-23
Filing date: 2007-10-23
Publication date: 2012-11-28
Anticipated expiration: 2027-10-23
Also published as: JP2009103893A

Description

本発明は、音声の入力に応じて音を制御する技術に関する。 The present invention relates to a technique for controlling sound according to voice input.

入力音声の音韻に応じた音を発生する技術が従来から提案されている。例えば特許文献１には、入力音声に対する音声認識で同定された音韻に応じたリズム音を出力する技術が開示されている。すなわち、事前に登録された複数の音声パターンのうち入力音声に相関する音声パターンが音声認識で特定され、当該音声パターンに対応したリズム音が出力される。
特開平９−２８１９６８号公報 Conventionally, a technique for generating a sound corresponding to a phoneme of an input voice has been proposed. For example, Patent Literature 1 discloses a technique for outputting a rhythm sound corresponding to a phoneme identified by speech recognition for an input speech. That is, a voice pattern that correlates with the input voice is specified by voice recognition among a plurality of voice patterns registered in advance, and a rhythm sound corresponding to the voice pattern is output.
Japanese Patent Laid-Open No. 9-281968

しかし、特許文献１の技術においては入力音声に対する音声認識が必須である。したがって、利用者が事前に登録した音声パターンを記憶するために大容量の記憶装置が必要になるとともに、演算処理装置による音声認識の処理の負荷が過大となるといった問題がある。以上の事情に鑑みて、本発明は、音声認識を要することなく入力音声の音韻に応じた音を生成することを目的とする。 However, in the technique of Patent Document 1, speech recognition for input speech is essential. Therefore, there is a problem that a large-capacity storage device is required to store the voice pattern registered in advance by the user, and the processing load of voice recognition by the arithmetic processing device becomes excessive. In view of the above circumstances, an object of the present invention is to generate a sound according to the phoneme of an input voice without requiring voice recognition.

入力音声における各帯域の成分のエネルギの分布（周波数スペクトル）は音韻に応じて相違するという関係を利用して、本発明に係る音制御装置は、入力音声の音韻に応じて変化する音韻指標値を入力音声の特定の帯域の成分の強度に基づいて算定する指標算定手段と、複数の音の何れかを音韻指標値に基づいて選択する音選択手段と、入力音声のピーク値を検出するピーク検出手段と、音韻指標値に応じて閾値を可変に設定する閾値設定手段と、ピーク値が閾値を上回るか否かを判定する発音判定手段と、ピーク値が閾値を上回ると発音判定手段が判定した場合に、音選択手段が選択した音の発生を示す音データを生成するデータ生成手段とを具備する。 Using the relationship that the energy distribution (frequency spectrum) of the components of each band in the input speech differs depending on the phoneme, the sound control device according to the present invention uses the phoneme index value that changes according to the phoneme of the input speech. Is calculated based on the intensity of a component in a specific band of the input voice , sound selection means for selecting one of a plurality of sounds based on the phoneme index value, and a peak for detecting the peak value of the input voice Detecting means; threshold setting means for variably setting a threshold according to phoneme index value; pronunciation determining means for determining whether or not the peak value exceeds the threshold; and pronunciation determination means determining if the peak value exceeds the threshold In this case, there is provided data generation means for generating sound data indicating the generation of the sound selected by the sound selection means.

以上の構成においては、入力音声の音韻の指標となる音韻指標値が入力音声のうち特定の成分の強度に基づいて算定されるから、入力音声の音声認識は原理的に不要である。したがって、音声パターンを記憶する大容量な記憶装置が不要となり、音韻を弁別するための処理の負荷が軽減されるという利点がある。また、音選択手段による選択音の発生の可否の判定のためにピーク値と比較される閾値が可変に設定されるから、入力音声のピーク値に応じた発音の頻度を音韻の種類に拘わらず均一化することが可能である。閾値設定手段は、例えば、ピーク値が低くなり易い音韻ほど閾値が低下するように音韻指標値に応じて閾値を可変に制御する。もっとも、音韻指標値が特定の音韻を示す場合に閾値を低下させれば、当該音韻に対応する音の発生の頻度（可能性）を他の音韻と比較して高める（発音の頻度を音韻の種類に応じて不均一化する）構成も実現される。なお、音データの形式は任意である。例えば、音の指定（ノートナンバ）を含むデータ（ＭＩＤＩデータ）や音の時間波形を示すデータ（波形データ）が音データとして好適である。また、入力音声のうち音韻指標値の算定に使用される特定の帯域は、入力音声の音韻に応じた音韻指標値の相違が顕著となる（すなわち、周波数軸上のエネルギの分布のうち音韻に応じた特徴が顕著に現れる帯域を含む）ように選定される。 In the above configuration, since the phoneme index value that is the index of the phoneme of the input speech is calculated based on the strength of a specific component of the input speech, the speech recognition of the input speech is not necessary in principle. Therefore, there is an advantage that a large-capacity storage device for storing the voice pattern is not necessary, and the processing load for discriminating phonemes is reduced. In addition, since the threshold value to be compared with the peak value is variably set for determining whether or not the selection sound can be generated by the sound selection means, the frequency of pronunciation according to the peak value of the input speech is set regardless of the type of phoneme. It is possible to make it uniform. For example, the threshold value setting means variably controls the threshold value according to the phoneme index value so that the phoneme whose peak value tends to be lower is lowered. However, if the phoneme index value indicates a specific phoneme, if the threshold value is lowered, the frequency (probability) of the sound corresponding to the phoneme is increased compared with other phonemes (the pronunciation frequency is increased). A configuration (which is non-uniform depending on the type) is also realized. The format of the sound data is arbitrary. For example, data including sound designation (note number) (MIDI data) and data indicating a time waveform of sound (waveform data) are suitable as sound data. Further, in a specific band used for calculating the phoneme index value in the input speech, the difference in the phoneme index value according to the phoneme of the input speech becomes remarkable (that is, the phoneme in the energy distribution on the frequency axis). (Including a band in which the corresponding feature appears prominently).

本発明の好適な態様において、指標算定手段は、特定の帯域の成分を入力音声から抽出するフィルタ処理手段と、フィルタ処理手段による処理後の成分の強度を検出する第１強度検出手段（例えば図１の強度検出部１４４）と、入力音声の強度を検出する第２強度検出手段（例えば図１の強度検出部１４６）と、第１強度検出手段が検出した強度と第２強度検出手段が検出した強度との相対比に基づいて音韻指標値を算定する演算手段とを含む。以上の態様によれば、入力音声から選択的に抽出された成分の強度と当該入力音声の強度との相対比に基づいて音韻指標値が算定されるから、入力音声の強度の相違に拘わらず、音韻に応じて適切に変化する音韻指標値を算定することが可能である。強度の相対比に基づく音韻指標値の算定とは、強度の相対比を音韻指標値として算定する処理のほか、強度の相対比を変数として含む関数から音韻指標値を算定する処理を含む。 In a preferred aspect of the present invention, the index calculating means includes a filter processing means for extracting a component in a specific band from the input speech, and a first intensity detecting means for detecting the intensity of the component processed by the filter processing means (for example, FIG. 1 intensity detecting unit 144), second intensity detecting means for detecting the intensity of the input voice (for example, intensity detecting unit 146 in FIG. 1), the intensity detected by the first intensity detecting means and the second intensity detecting means Calculating means for calculating a phonological index value based on a relative ratio to the measured intensity. According to the above aspect, since the phonological index value is calculated based on the relative ratio between the intensity of the component selectively extracted from the input voice and the intensity of the input voice, regardless of the difference in the intensity of the input voice. It is possible to calculate a phoneme index value that changes appropriately according to phonemes. The calculation of the phoneme index value based on the intensity relative ratio includes a process of calculating the phoneme index value from a function including the intensity relative ratio as a variable in addition to the process of calculating the intensity relative ratio as the phoneme index value.

本発明の好適な態様において、指標算定手段は、入力音声の別個の帯域に属する複数の成分の各々について音韻指標値を算定し、音選択手段は、複数の音韻指標値に基づいて音を選択する。以上の態様によれば、入力音声のひとつの帯域に属する成分からひとつの音韻指標値が算定される構成と比較して、音韻指標値に応じた選択の候補（音の種類）を多様化することが可能である。 In a preferred aspect of the present invention, the index calculation means calculates a phoneme index value for each of a plurality of components belonging to separate bands of the input speech, and the sound selection means selects a sound based on the plurality of phoneme index values. To do. According to the above aspect, the selection candidates (sound types) according to the phoneme index values are diversified as compared with the configuration in which one phoneme index value is calculated from the components belonging to one band of the input speech. It is possible.

本発明の好適な態様に係る音制御装置は、音韻指標値と音との関係を可変に設定する対応音設定手段を具備し、音選択手段は、対応音設定手段が設定した関係において、指標算定手段が算定した音韻指標値に対応する音を選択する。以上の態様によれば、音韻指標値と音との関係が可変に設定されるから、例えば利用者の所望の音を音韻に応じて再生することが可能である。 The sound control device according to a preferred aspect of the present invention includes a corresponding sound setting unit that variably sets a relationship between a phoneme index value and a sound, and the sound selection unit is an index according to the relationship set by the corresponding sound setting unit. The sound corresponding to the phoneme index value calculated by the calculation means is selected. According to the above aspect, since the relationship between the phoneme index value and the sound is variably set, for example, a user's desired sound can be reproduced according to the phoneme.

本発明の好適な態様において、閾値設定手段は、発音の判定用の第１閾値と消音の判定用の第２閾値との各々を音韻指標値に応じて可変に設定し、発音判定手段は、ピーク検出手段が検出したピーク値が第１閾値を上回るか否か、および、ピーク検出手段が検出したピーク値が第２閾値を下回るか否かを判定し、データ生成手段は、ピーク検出手段が検出したピーク値が第１閾値を上回ると発音判定手段が判定した場合に、音選択手段が選択した音の発生を示す音データを生成し、ピーク検出手段が検出したピーク値が第２閾値を下回ると発音判定手段が判定した場合に、当該音の消音を示す音データを生成する。 In a preferred aspect of the present invention, the threshold value setting means variably sets each of the first threshold value for determination of pronunciation and the second threshold value for determination of mute according to the phoneme index value, It is determined whether or not the peak value detected by the peak detection means exceeds the first threshold value, and whether or not the peak value detected by the peak detection means is lower than the second threshold value. When the sound generation determination unit determines that the detected peak value exceeds the first threshold value, sound data indicating the generation of the sound selected by the sound selection unit is generated, and the peak value detected by the peak detection unit sets the second threshold value. If the sound generation determining means determines that the sound is below, sound data indicating the mute of the sound is generated .

本発明の好適な態様に係る音制御装置は、入力音声の強度（例えば入力音声の音量やピーク値）と音データが示す音の音量との関係を可変に設定する対応音量設定手段と、対応音量設定手段が設定した関係において、ピーク検出手段が検出したピーク値に対応する音量を決定する音量決定手段とを具備し、データ生成手段は、音量決定手段が設定した音量の音を示す音データを生成する。以上の態様によれば、入力音声の強度と音データが示す音の音量との関係が可変に設定されるから、例えば、入力音声の音量が少ない場合でも再生音の音量を充分に確保する態様や、入力音声の音量が多い場合でも再生音の音量を抑制する態様が適宜に採用される。 The sound control device according to a preferred aspect of the present invention includes a corresponding volume setting unit that variably sets a relationship between the intensity of the input sound (for example, the volume or peak value of the input sound) and the sound volume indicated by the sound data, Volume determination means for determining the volume corresponding to the peak value detected by the peak detection means in the relationship set by the volume setting means, and the data generation means is sound data indicating the sound of the volume set by the volume determination means Is generated. According to the above aspect, since the relationship between the intensity of the input sound and the volume of the sound indicated by the sound data is variably set, for example, the aspect of sufficiently ensuring the volume of the reproduced sound even when the volume of the input sound is low Alternatively, a mode in which the volume of the reproduced sound is suppressed even when the volume of the input sound is high is appropriately employed.

本発明に係る音制御装置は、各処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働によっても実現される。本発明に係るプログラムは、入力音声の音韻に応じて変化する音韻指標値を入力音声の特定の帯域の成分の強度に基づいて算定する指標算定処理（例えば図５のステップＳ3）と、複数の音の何れかを音韻指標値に基づいて選択する音選択処理（例えば図５のステップＳ5）と、入力音声のピーク値を検出するピーク検出処理と、音韻指標値に応じて閾値を可変に設定する閾値設定処理と、ピーク値が閾値を上回るか否かを判定する発音判定処理と、ピーク値が閾値を上回ると発音判定処理で判定した場合に、音選択処理で選択した音の発生を示す音データを生成するデータ生成処理（例えば図５のステップＳ11）とをコンピュータに実行させる。以上のプログラムによっても、本発明に係る音制御装置と同様の作用および効果が奏される。なお、本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態で提供されてコンピュータにインストールされる。 The sound control device according to the present invention is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to each process, and a general-purpose arithmetic processing device such as a CPU (Central Processing Unit) and a program It is also realized through collaboration with. The program according to the present invention includes an index calculation process (for example, step S3 in FIG. 5) for calculating a phoneme index value that changes according to the phoneme of the input speech based on the intensity of a component of a specific band of the input speech, A sound selection process (for example, step S5 in FIG. 5) for selecting one of the sounds based on the phoneme index value, a peak detection process for detecting the peak value of the input voice, and a threshold value variably set according to the phoneme index value The generation of the sound selected in the sound selection process when the sound generation determination process determines that the peak value exceeds the threshold value, and the sound generation determination process determines that the peak value exceeds the threshold value. The computer is caused to execute data generation processing (for example, step S11 in FIG. 5) for generating sound data. Even with the above program, the same operations and effects as the sound control apparatus according to the present invention are exhibited. The program of the present invention is provided to the user in a form stored in a computer-readable recording medium and installed in the computer, or is provided in a form distributed via a communication network and installed in the computer. The

＜Ａ：第１実施形態＞
図１は、本発明の第１実施形態に係る音制御装置の構成を示すブロック図である。音制御装置１００は、利用者が発声した擬声語（例えば打楽器の演奏音を模擬した「ドン」「パン」といった音声）の音韻に応じた打楽器の演奏音を生成する装置である。例えば、利用者が「ドン」という擬声語を発声した場合にはバスドラムの演奏音が再生され、利用者が「パン」という擬声語を発声した場合にはハイハットシンバルの演奏音が再生されるといった具合である。 <A: First Embodiment>
FIG. 1 is a block diagram showing the configuration of the sound control apparatus according to the first embodiment of the present invention. The sound control device 100 is a device that generates a percussion instrument performance sound corresponding to the phoneme of an onomatopoeia uttered by a user (for example, a sound such as “don” or “pan” that simulates a percussion instrument performance sound). For example, if the user utters the onomatopoeia “Don”, the performance sound of the bass drum is played, and if the user utters the onomatopoeia “Pan”, the performance sound of the hi-hat cymbal is played. It is.

図１に示すように、音制御装置１００は、制御装置１０と記憶装置４０とを具備するコンピュータシステムで実現される。制御装置１０は、プログラムの実行によって様々な処理を実行する演算処理装置である。記憶装置４０は、制御装置１０が実行するプログラムや制御装置１０が使用する各種のデータを記憶する。 As shown in FIG. 1, the sound control device 100 is realized by a computer system including a control device 10 and a storage device 40. The control device 10 is an arithmetic processing device that executes various processes by executing a program. The storage device 40 stores a program executed by the control device 10 and various data used by the control device 10.

制御装置１０には入力機器５０とＡ/Ｄ変換器６２と音源回路７２とが接続される。入力機器５０は、利用者が操作する複数の操作子で構成される。利用者は、入力機器５０を適宜に操作することで音制御装置１００に各種の指示を入力する。Ａ/Ｄ変換器６２には収音機器６４が接続される。収音機器６４は、利用者が発声した音声（以下「入力音声」という）Ｖを収音する。Ａ/Ｄ変換器６２は、収音機器６４が収音した入力音声Ｖの時間波形を表すデジタルの音声信号ＳVを生成する。 An input device 50, an A / D converter 62, and a sound source circuit 72 are connected to the control device 10. The input device 50 includes a plurality of operators that are operated by a user. The user inputs various instructions to the sound control device 100 by appropriately operating the input device 50. A sound collection device 64 is connected to the A / D converter 62. The sound collection device 64 collects a voice (hereinafter referred to as “input voice”) V uttered by the user. The A / D converter 62 generates a digital audio signal SV representing a time waveform of the input audio V picked up by the sound pickup device 64.

制御装置１０は、図１に図示された各要素として機能することで、入力音声Ｖ（音声信号ＳV）に応じた打楽器の演奏音を示す音データＤSを生成および出力する。音データＤSは、ＭＩＤＩ（Musical Instrument Digital Interface）規格に準拠した形式のデジタルデータである。音源回路７２（ＭＩＤＩ音源）は、打楽器の演奏音の波形を示すデータ列を音データＤSに基づいて生成する。音源回路７２から出力されたデータ列は、Ｄ/Ａ変換器７４でアナログの音信号に変換される。放音機器７６は、Ｄ/Ａ変換器７４が出力する音信号を増幅するとともに増幅後の音信号に応じた音波を放射する。 The control device 10 functions and functions as each element shown in FIG. 1 to generate and output sound data DS indicating a percussion instrument performance sound corresponding to the input sound V (audio signal SV). The sound data DS is digital data in a format compliant with the MIDI (Musical Instrument Digital Interface) standard. The tone generator 72 (MIDI tone generator) generates a data string indicating the waveform of the percussion instrument performance sound based on the sound data DS. The data string output from the sound source circuit 72 is converted into an analog sound signal by the D / A converter 74. The sound emitting device 76 amplifies the sound signal output from the D / A converter 74 and radiates a sound wave corresponding to the amplified sound signal.

次に、制御装置１０の機能的な構成を説明する。図１の分割部１２は、音声信号ＳV（入力音声Ｖ）を時間軸上で複数のフレーム（例えば１ミリ秒程度の区間）に区分する。各フレームの音声信号ＳVは、指標算定部１４とピーク検出部１６とに供給される。 Next, a functional configuration of the control device 10 will be described. 1 divides the audio signal SV (input audio V) into a plurality of frames (for example, sections of about 1 millisecond) on the time axis. The audio signal SV of each frame is supplied to the index calculation unit 14 and the peak detection unit 16.

指標算定部１４は、各フレームの音声信号ＳVについて音韻指標値Ａを生成する。音韻指標値Ａは、入力音声Ｖの音韻（音素）に応じて変化する数値である。すなわち、音韻指標値Ａが充分に相違する入力音声Ｖは別個の音韻として弁別される。 The index calculation unit 14 generates a phoneme index value A for the audio signal SV of each frame. The phoneme index value A is a numerical value that changes according to the phoneme (phoneme) of the input speech V. That is, the input speech V having a sufficiently different phoneme index value A is discriminated as a separate phoneme.

図２は、発声音の周波数スペクトルＱの概形を音韻の種類毎に示すグラフである。図２の部分(A)は両唇音（/b/，/p/）の周波数スペクトルＱであり、図２の部分(B)は歯茎音（/t/，/d/）の周波数スペクトルＱであり、図２の部分(C)は軟口蓋音（/k/，/g/）の周波数スペクトルＱである。図２の各部分に示すように、発声音の周波数スペクトルＱは、発声の原理や後続の母音との組合せに応じて音韻毎に周波数スペクトルＱが相違する。例えば、両唇音の周波数スペクトルＱ（部分(A)）は高域ほど強度が低下するように分布するのに対し、歯茎音の周波数スペクトルＱ（部分(B)）は低域ほど強度が低下するように分布する。また、軟口蓋音の周波数スペクトルＱ（部分(C)）は中域にて強度が最大になるとともに低域および高域では強度が低下する。 FIG. 2 is a graph showing an outline of the frequency spectrum Q of the uttered sound for each phoneme type. Part (A) of FIG. 2 is the frequency spectrum Q of the bilateral sound (/ b /, / p /), and part (B) of FIG. 2 is the frequency spectrum Q of the gum sound (/ t /, / d /). In FIG. 2, part (C) is the frequency spectrum Q of the soft palate sound (/ k /, / g /). As shown in each part of FIG. 2, the frequency spectrum Q of the uttered sound is different for each phoneme depending on the principle of the utterance and the combination with the subsequent vowel. For example, the frequency spectrum Q (part (A)) of both lip sounds is distributed such that the intensity decreases as the frequency increases, whereas the frequency spectrum Q (part (B)) of the gum sound decreases as the frequency decreases. To be distributed. In addition, the frequency spectrum Q (part (C)) of the soft palate sound has a maximum intensity in the middle range and decreases in the low and high ranges.

以上のように入力音声Ｖの周波数スペクトルＱが音韻に応じて相違するという現象を利用して、図１の指標算定部１４は、入力音声Ｖのうち特定の周波数帯域（以下「弁別帯域」という）の成分の強度に基づいて音韻指標値Ａを算定する。図１に示すように、本形態の指標算定部１４は、フィルタ処理部１４２と強度検出部１４４と強度検出部１４６と演算部１４８とで構成される。 As described above, using the phenomenon that the frequency spectrum Q of the input voice V is different depending on the phoneme, the index calculation unit 14 in FIG. 1 performs a specific frequency band (hereinafter referred to as “discrimination band”) in the input voice V. The phoneme index value A is calculated on the basis of the intensity of the component. As shown in FIG. 1, the index calculation unit 14 of this embodiment includes a filter processing unit 142, an intensity detection unit 144, an intensity detection unit 146, and a calculation unit 148.

フィルタ処理部１４２は、音声信号ＳVのうちの弁別帯域内の成分ＶCを選択的に抽出する。例えば、弁別帯域の上限の周波数を遮断周波数とするローパスフィルタや弁別帯域を通過帯域とするバンドパスフィルタがフィルタ処理部１４２として好適に採用される。弁別帯域は、音韻指標値Ａによる区別の対象となる複数の音韻の間で周波数スペクトルＱの分布の相違が当該帯域内で顕著となるように統計的または実験的に選定される。本形態では、図２の部分(A)の両唇音（例えば「バン」「パン」といった擬声語）と図２の部分(B)の歯茎音（例えば「タン」「ドン」といった擬声語）とを区別する場合を便宜的に想定する。図２の部分(A)および部分(B)に図示した周波数ｆc1を下回る帯域ＢL（低域）において、両唇音と歯茎音との周波数スペクトルＱの相違は顕著となる。したがって、帯域ＢLが弁別帯域としてフィルタ処理部１４２に設定される。 The filter processing unit 142 selectively extracts the component VC in the discrimination band from the audio signal SV. For example, a low-pass filter whose cutoff frequency is the upper limit frequency of the discrimination band and a band-pass filter whose pass band is the discrimination band are suitably employed as the filter processing unit 142. The discrimination band is selected statistically or experimentally so that the difference in the distribution of the frequency spectrum Q is remarkable in the band among the plurality of phonemes to be distinguished by the phoneme index value A. In this embodiment, the bilateral sound of part (A) in FIG. 2 (for example, onomatopoeia such as “bang” and “bread”) is distinguished from the gum sound (for example, onomatopoeia such as “tan” and “don”) in part (B) of FIG. It is assumed for the sake of convenience. In the band BL (low band) lower than the frequency fc1 shown in the part (A) and the part (B) of FIG. 2, the difference in the frequency spectrum Q between the bilateral sound and the gum sound becomes remarkable. Accordingly, the band BL is set in the filter processing unit 142 as a discrimination band.

図１の強度検出部１４４は、フィルタ処理部１４２が抽出した成分ＶCの強度（パワー）ＰCをフレーム毎に検出する。強度ＰCは、例えば、成分ＶCの波形を示すフレーム内の各サンプル（フィルタ処理後の音声信号ＳVの各サンプル）の振幅値の自乗を合計した数値の平方根を当該フレーム内のサンプルの総数で除算した数値である。一方、強度検出部１４６は、フィルタ処理部１４２による処理を経ていない入力音声Ｖの強度（パワー）Ｐ0をフレーム毎に検出する。強度Ｐ0は強度ＰCと同様の方法で算定される。 The intensity detector 144 in FIG. 1 detects the intensity (power) PC of the component VC extracted by the filter processor 142 for each frame. For example, the intensity PC is obtained by dividing the square root of the sum of the squares of the amplitude values of the samples in the frame indicating the waveform of the component VC (each sample of the filtered audio signal SV) by the total number of samples in the frame. It is a numerical value. On the other hand, the intensity detector 146 detects the intensity (power) P0 of the input voice V that has not been processed by the filter processor 142 for each frame. The intensity P0 is calculated by the same method as the intensity PC.

演算部１４８は、強度Ｐ0に対する強度ＰCの相対比を音韻指標値Ａ（Ａ＝ＰC／Ｐ0）として算定する。図２の部分(A)から理解されるように両唇音については弁別帯域（帯域ＢL）内の強度ＰCが高いから、入力音声Ｖの音韻が両唇音である場合には音韻指標値Ａは大きい数値となる。一方、図２の部分(B)のように歯茎音については弁別帯域（帯域ＢL）内の強度ＰCが低いから、入力音声Ｖの音韻が歯茎音である場合には音韻指標値Ａは小さい数値となる。したがって、演算部１４８の算定する音韻指標値Ａの大小に応じて入力音声Ｖの音韻を概略的に弁別することが可能である。 The calculation unit 148 calculates the relative ratio of the intensity PC to the intensity P0 as the phoneme index value A (A = PC / P0). As understood from part (A) of FIG. 2, since the intensity PC in the discrimination band (band BL) is high for the bilateral sound, the phoneme index value A is large when the phoneme of the input voice V is the bilateral sound. It becomes a numerical value. On the other hand, as shown in part (B) of FIG. 2, the intensity PC in the discrimination band (band BL) is low for gum sounds, so that the phoneme index value A is a small numerical value when the phoneme of the input voice V is a gum sound. It becomes. Therefore, it is possible to roughly discriminate the phoneme of the input speech V according to the magnitude of the phoneme index value A calculated by the calculation unit 148.

図１の音選択部２２は、複数種の打楽器の演奏音の何れかを音韻指標値Ａに基づいて選択する。音選択部２２が選択した演奏音を指定する符号（以下「ノートナンバ」という）Ｎnが音選択部２２からデータ生成部３０に出力される。音韻指標値Ａに対応するノートナンバＮnの特定には、記憶装置４０に格納されたテーブル（「音選択テーブル」という）ＴBLが使用される。 The sound selection unit 22 in FIG. 1 selects any one of a plurality of percussion instrument performance sounds based on the phoneme index value A. A code (hereinafter referred to as “note number”) Nn designating the performance sound selected by the sound selector 22 is output from the sound selector 22 to the data generator 30. To specify the note number Nn corresponding to the phoneme index value A, a table (referred to as “sound selection table”) TBL stored in the storage device 40 is used.

図３は、音選択テーブルＴBLの内容を示す模式図である。同図に示すように、音選択テーブルＴBLは、音韻指標値Ａの数値の複数の範囲の各々にノートナンバ（打楽器の種類）Ｎnを対応させたテーブルである。例えば、両唇音に対応する音韻指標値Ａの範囲ａ1にはハイハットシンバルを指定するノートナンバＮn1が対応づけられ、歯茎音に対応する音韻指標値Ａの範囲ａ2にはバスドラムを指定するノートナンバＮn2が対応づけられる。音選択部２２は、演算部１４８の算定した音韻指標値Ａが属する範囲を音選択テーブルＴBLから探索し、当該範囲に対応するノートナンバＮnを記憶装置４０から取得する。 FIG. 3 is a schematic diagram showing the contents of the sound selection table TBL. As shown in the figure, the sound selection table TBL is a table in which a note number (percussion instrument type) Nn is associated with each of a plurality of ranges of numerical values of the phoneme index value A. For example, note number Nn1 for designating hi-hat cymbals is associated with range a1 of phoneme index value A corresponding to both lip sounds, and note number for designating bass drums in range a2 of phoneme index value A corresponding to gum sounds. Nn2 is associated. The sound selection unit 22 searches the sound selection table TBL for the range to which the phoneme index value A calculated by the calculation unit 148 belongs, and acquires the note number Nn corresponding to the range from the storage device 40.

図１の対応音設定部２３は、音選択テーブルＴBLにおける音韻指標値ＡとノートナンバＮnとの関係を可変に制御する。例えば、対応音設定部２３は、音選択テーブルＴBLにおける音韻指標値Ａの各範囲に対して、利用者が入力機器５０の操作で指定した種類の打楽器に対応したノートナンバＮnを対応させて記憶装置４０に格納する。したがって、各音韻の発声時に出力される打楽器の演奏音を利用者は適宜に変更することが可能である。 The corresponding sound setting unit 23 in FIG. 1 variably controls the relationship between the phoneme index value A and the note number Nn in the sound selection table TBL. For example, the corresponding sound setting unit 23 stores each range of the phoneme index value A in the sound selection table TBL in association with the note number Nn corresponding to the type of percussion instrument designated by the user by operating the input device 50. Store in device 40. Therefore, the user can appropriately change the performance sound of the percussion instrument that is output when each phoneme is uttered.

ピーク検出部１６は、入力音声Ｖの時間軸上におけるピークの強度（以下「ピーク値」という）ＰKをフレーム毎に検出する。ピーク値ＰKの検出には公知の技術が任意に採用される。例えば、入力音声Ｖの時間波形の包絡線を特定し、当該包絡線におけるフレーム内のピークの振幅をピーク値ＰKとして検出する構成が好適である。 The peak detector 16 detects the peak intensity (hereinafter referred to as “peak value”) PK of the input voice V on the time axis for each frame. A known technique is arbitrarily adopted for detection of the peak value PK. For example, a configuration in which an envelope of the time waveform of the input voice V is specified and the amplitude of the peak in the frame in the envelope is detected as the peak value PK is suitable.

発音判定部２４は、ピーク検出部１６が検出したピーク値ＰKの大小に応じて発音および消音の時期を決定する。さらに詳述すると、発音判定部２４は、ピーク値ＰKが閾値ＴONを上回ったフレームにてデータ生成部３０に発音を指示するとともに、ピーク値ＰKが閾値ＴOFFを下回ったフレームにてデータ生成部３０に消音を指示する。 The sound generation determination unit 24 determines the time of sound generation and mute according to the magnitude of the peak value PK detected by the peak detection unit 16. More specifically, the sound generation determination unit 24 instructs the data generation unit 30 to generate sound in a frame in which the peak value PK exceeds the threshold value TON, and the data generation unit 30 in a frame in which the peak value PK is lower than the threshold value TOFF. To mute.

ところで、ピーク値ＰKの大小は入力音声Ｖの音韻に依存する傾向がある。すなわち、ピーク値ＰKが増加し易い音韻とピーク値ＰKが増加し難い音韻とがある。したがって、入力音声Ｖの音韻に拘わらず閾値ＴON（閾値ＴOFF）を固定値とした構成では、例えばピーク値ＰKが増加し難い音韻ほど発音判定部２４が発音の時期と判定する可能性は低下するから、演奏音の発音の頻度が音韻に応じて相違するという不整合が発生する。 By the way, the magnitude of the peak value PK tends to depend on the phoneme of the input voice V. That is, there are phonemes in which the peak value PK is likely to increase and phonemes in which the peak value PK is difficult to increase. Therefore, in the configuration in which the threshold value TON (threshold value TOFF) is a fixed value regardless of the phoneme of the input speech V, for example, the probability that the pronunciation determination unit 24 determines the time of pronunciation as the phoneme whose peak value PK is difficult to increase decreases. Therefore, inconsistency that the frequency of pronunciation of the performance sound differs depending on the phoneme occurs.

そこで、図１の閾値設定部２５は、閾値ＴONおよび閾値ＴOFFを入力音声Ｖの音韻に応じて可変に設定する。閾値設定部２５による音韻の認識には、指標算定部１４の算定した音韻指標値Ａが流用される。すなわち、閾値設定部２５は、ピーク値ＰKが増加し難い音韻を音韻指標値Ａが示す場合には、ピーク値ＰKが増加し易い音韻の場合と比較して、閾値ＴONおよび閾値ＴOFFを減少させる。以上の構成によれば、各音韻に対応した演奏音の発音の頻度が複数の音韻について均一化されるという利点がある。 Therefore, the threshold value setting unit 25 in FIG. 1 variably sets the threshold value TON and the threshold value TOFF according to the phoneme of the input voice V. For the recognition of phonemes by the threshold setting unit 25, the phoneme index value A calculated by the index calculation unit 14 is used. That is, the threshold value setting unit 25 decreases the threshold value TON and the threshold value TOFF when the phoneme index value A indicates a phoneme in which the peak value PK is difficult to increase, compared to the case of a phoneme in which the peak value PK is likely to increase. . According to the above configuration, there is an advantage that the frequency of pronunciation of the performance sound corresponding to each phoneme is made uniform for a plurality of phonemes.

図１の音量決定部２６は、ピーク検出部１６が検出したピーク値ＰKに応じて演奏音の音量を決定する。音量決定部２６が決定した音量を指定する数値（以下「ベロシティ」という）ＶELがデータ生成部３０に出力される。対応音量設定部２７は、以下に説明するようにピーク値ＰKとベロシティＶELとの関係を可変に設定する。 The volume determination unit 26 in FIG. 1 determines the volume of the performance sound according to the peak value PK detected by the peak detection unit 16. A numerical value (hereinafter referred to as “velocity”) VEL designating the volume determined by the volume determination unit 26 is output to the data generation unit 30. The corresponding volume setting unit 27 variably sets the relationship between the peak value PK and the velocity VEL as described below.

記憶装置４０には、ピーク値ＰKとベロシティＶELとの関係を定義する複数の関数（以下「音量関数」という）Ｆが記憶される。図４は、各音量関数Ｆ（Ｆ1〜Ｆ3）の内容を示す概念図である。図４に示すように、ピーク値ＰKに対するベロシティＶELの変化の態様は音量関数Ｆ毎に相違する。例えば、音量関数Ｆ1は、ピーク値ＰKが数値ｐ1を上回ると傾きが減少するようにピーク値ＰKとベロシティＶELとの関係を定義するのに対し、音量関数Ｆ2は、ピーク値ＰKが数値ｐ2を上回ると傾きが増加するようにピーク値ＰKとベロシティＶELとの関係を定義する。また、音量関数Ｆ3は、ピーク値ＰKに対して直線的に増加するようにベロシティＶELを定義する。対応音量設定部２７は、利用者が入力機器５０の操作で指定した音量関数Ｆを記憶装置４０から選択する。音量決定部２６は、対応音量設定部２７が選択した音量関数Ｆにピーク値ＰKを代入することでベロシティＶELを算定する。したがって、入力音声Ｖの音量に対するベロシティＶELの変化の態様（音量関数Ｆ）を利用者は適宜に変更することができる。例えば、利用者が図４の音量関数Ｆ1を選択した場合には、発声の音量が小さい場合であっても充分な音量（ベロシティＶEL）の演奏音が生成され、利用者が音量関数Ｆ2を選択した場合には発声の音量が大きい場合であっても演奏音の音量が抑制されるといった具合である。 The storage device 40 stores a plurality of functions F (hereinafter referred to as “volume function”) that define the relationship between the peak value PK and the velocity VEL. FIG. 4 is a conceptual diagram showing the contents of each volume function F (F1 to F3). As shown in FIG. 4, the change in velocity VEL with respect to the peak value PK is different for each volume function F. For example, the volume function F1 defines the relationship between the peak value PK and the velocity VEL so that the slope decreases when the peak value PK exceeds the numerical value p1, whereas the volume function F2 has the peak value PK having the numerical value p2. The relationship between the peak value PK and the velocity VEL is defined so that the slope increases when the value is exceeded. The volume function F3 defines the velocity VEL so as to increase linearly with respect to the peak value PK. The corresponding volume setting unit 27 selects the volume function F designated by the user by operating the input device 50 from the storage device 40. The volume determination unit 26 calculates the velocity VEL by substituting the peak value PK into the volume function F selected by the corresponding volume setting unit 27. Therefore, the user can appropriately change the change mode (volume function F) of the velocity VEL with respect to the volume of the input voice V. For example, when the user selects the volume function F1 in FIG. 4, a performance sound with a sufficient volume (velocity VEL) is generated even when the volume of the utterance is low, and the user selects the volume function F2. In this case, the volume of the performance sound is suppressed even when the volume of the utterance is high.

データ生成部３０は、音選択部２２と発音判定部２４と音量決定部２６とによる動作の結果に応じた音データＤSを生成する。具体的には、発音判定部２４による発音の指示を契機として、データ生成部３０は、発音を指示する音データＤS（ノートオンイベント）を生成して音源回路７２に出力する。発音を指示する音データＤSは、音選択部２２が指定したノートナンバＮnと音量決定部２６が指定したベロシティＶELとを含む。以上の音データＤSが音源回路７２に出力されることで、入力音声Ｖの音韻に対応した種類の打楽器の演奏音が、入力音声Ｖのピーク値ＰKに応じた音量で放音機器７６から出力される。一方、発音判定部２４から消音が指示された場合、データ生成部３０は、ノートナンバＮnに対応した演奏音の消音を指示する音データＤS（ベロシティＶELとしてゼロが指定されたノートオフイベント）を生成して音源回路７２に出力する。 The data generation unit 30 generates sound data DS corresponding to the results of operations performed by the sound selection unit 22, the sound generation determination unit 24, and the volume determination unit 26. Specifically, the data generation unit 30 generates sound data DS (note-on event) for instructing sound generation and outputs the sound data to the sound source circuit 72 in response to an instruction for sound generation by the sound generation determination unit 24. The sound data DS instructing pronunciation includes the note number Nn designated by the sound selection unit 22 and the velocity VEL designated by the volume determination unit 26. By outputting the above sound data DS to the sound source circuit 72, the performance sound of the percussion instrument corresponding to the phoneme of the input voice V is output from the sound emitting device 76 at a volume corresponding to the peak value PK of the input voice V. Is done. On the other hand, when the sound generation determination unit 24 is instructed to mute, the data generation unit 30 outputs sound data DS (note-off event in which zero is specified as the velocity VEL) that instructs the mute of the performance sound corresponding to the note number Nn. It is generated and output to the sound source circuit 72.

次に、図５を参照して、制御装置１０が実行する処理の全体的な流れを説明する。図５の処理は、プログラムの起動を指示する操作を利用者が入力機器５０に付与した場合に開始される。図５の処理を開始すると、分割部１２は、Ａ/Ｄ変換器６２から供給される音声信号ＳVからひとつのフレームを切出す（ステップＳ1）。次いで、フィルタ処理部１４２および強度検出部１４４による強度ＰCの検出と強度検出部１４６による強度Ｐ0の検出とピーク検出部１６によるピーク値ＰKの検出とが順次に実行される（ステップＳ2）。さらに、演算部１４８は、強度Ｐ0と強度ＰCとから音韻指標値Ａを算定する（ステップＳ3）。 Next, the overall flow of processing executed by the control device 10 will be described with reference to FIG. The process of FIG. 5 is started when the user gives the input device 50 an operation for instructing activation of the program. When the processing of FIG. 5 is started, the dividing unit 12 cuts out one frame from the audio signal SV supplied from the A / D converter 62 (step S1). Next, the detection of the intensity PC by the filter processing unit 142 and the intensity detection unit 144, the detection of the intensity P0 by the intensity detection unit 146, and the detection of the peak value PK by the peak detection unit 16 are sequentially performed (step S2). Further, the calculation unit 148 calculates a phoneme index value A from the intensity P0 and the intensity PC (step S3).

次いで、制御装置１０は、入力機器５０に対する操作に応じて各種の変数を更新する（ステップＳ4）。さらに詳述すると、対応音設定部２３は、音選択テーブルＴBLの内容（音韻指標値Ａの各範囲とノートナンバＮnとの対応）を入力機器５０に対する操作に応じて更新し、対応音量設定部２７は、記憶装置４０に格納された複数の音量関数Ｆの何れかを入力機器５０に対する操作に応じて選択する。また、制御装置１０は、閾値ＴONの候補となる数値ＴH1および数値ＴH2と閾値ＴOFFの候補となる数値ＴL1および数値ＴL2とを入力機器５０に対する操作に応じて設定する（ステップＳ4）。 Next, the control device 10 updates various variables according to the operation on the input device 50 (step S4). More specifically, the corresponding sound setting unit 23 updates the contents of the sound selection table TBL (correspondence between each range of the phoneme index value A and the note number Nn) according to an operation on the input device 50, and a corresponding volume setting unit. 27 selects one of a plurality of volume functions F stored in the storage device 40 in accordance with an operation on the input device 50. Further, the control device 10 sets the numerical value TH1 and numerical value TH2 that are candidates for the threshold value TON and the numerical value TL1 and numerical value TL2 that are candidates for the threshold value TOFF in accordance with an operation on the input device 50 (step S4).

次いで、音選択部２２は、ステップＳ3にて算定した音韻指標値Ａに対応するノートナンバＮnを音選択テーブルＴBLから特定する（ステップＳ5）。また、閾値設定部２５は、ステップＳ3で算定した音韻指標値Ａに応じて閾値ＴONおよび閾値ＴOFFを設定する（ステップＳ6）。すなわち、例えば音韻指標値Ａが両唇音に対応する範囲ａ1内にある場合には数値ＴH1を閾値ＴONに設定するとともに数値ＴL1を閾値ＴOFFに設定し、音韻指標値Ａが歯茎音の範囲ａ2内にある場合には数値ＴH2を閾値ＴONに設定するとともに数値ＴL2を閾値ＴOFFに設定するといった具合である。 Next, the sound selection unit 22 specifies the note number Nn corresponding to the phoneme index value A calculated in step S3 from the sound selection table TBL (step S5). The threshold setting unit 25 sets the threshold TON and the threshold TOFF according to the phoneme index value A calculated in step S3 (step S6). That is, for example, when the phonological index value A is in the range a1 corresponding to both lip sounds, the numerical value TH1 is set to the threshold value TON and the numerical value TL1 is set to the threshold value TOFF, and the phonological index value A is in the gum sound range a2. In such a case, the numerical value TH2 is set to the threshold value TON and the numerical value TL2 is set to the threshold value TOFF.

次に、制御装置１０は、状態フラグＳFが消音を示し、かつ、ステップＳ2にて検出したピーク値ＰKがステップＳ6で設定した閾値ＴONを上回るか否かを判定する（ステップＳ7）。状態フラグＳFは、現時点が発音の状態にあるか消音の状態にあるかを識別するための符号である。 Next, the control device 10 determines whether or not the status flag SF indicates mute and the peak value PK detected in step S2 exceeds the threshold value TON set in step S6 (step S7). The status flag SF is a code for identifying whether the current state is a sounding state or a mute state.

ステップＳ7の結果が肯定である場合（すなわち現在のフレームが発音の開始点に該当する場合）、音量決定部２６は、ステップＳ2で検出したピーク値ＰKをステップＳ4にて選択した音量関数Ｆに代入することでベロシティＶELを算定する（ステップＳ8）。一方、ステップＳ7の結果が否定である場合（すなわち過去の発音が継続している場合またはピーク値ＰKが閾値ＴONに到達しない場合）、制御装置１０は、状態フラグＳFが発音を示し、かつ、ステップＳ2で検出したピーク値ＰKがステップＳ6で設定した閾値ＴOFFを下回るか否かを判定する（ステップＳ9）。 When the result of step S7 is affirmative (that is, when the current frame corresponds to the starting point of sound generation), the volume determination unit 26 sets the peak value PK detected at step S2 to the volume function F selected at step S4. By substituting, the velocity VEL is calculated (step S8). On the other hand, when the result of step S7 is negative (that is, when the past pronunciation continues or when the peak value PK does not reach the threshold value TON), the control device 10 indicates that the status flag SF indicates the pronunciation, and It is determined whether or not the peak value PK detected in step S2 is lower than the threshold value TOFF set in step S6 (step S9).

ステップＳ9の結果が肯定である場合（すなわち現在のフレームが発音の終了点に該当する場合）、音量決定部２６はベロシティＶELをゼロに設定する（ステップＳ10）。一方、ステップＳ9の結果が否定である場合（すなわち、現在のフレームでは発音および消音の一方から他方への変化がない場合）、制御装置１０は、処理をステップＳ1に移行して音声信号ＳVの次のフレームについて同様の処理を実行する。 When the result of step S9 is affirmative (that is, when the current frame corresponds to the end point of sound generation), the volume determination unit 26 sets the velocity VEL to zero (step S10). On the other hand, when the result of step S9 is negative (that is, when there is no change from one of the sound generation and mute to the other in the current frame), the control device 10 moves the process to step S1 and outputs the audio signal SV. Similar processing is performed for the next frame.

ステップＳ8またはステップＳ10が完了すると、データ生成部３０は、現在のフレームに関する処理の結果に応じて音データＤSを生成する（ステップＳ11）。すなわち、状態フラグＳFが消音を示す場合（今回のフレームで発音に変化した場合）、データ生成部３０は、ステップＳ5で設定したノートナンバＮnとステップＳ8で設定したベロシティＶELとを含むノートオンイベントを音データＤSとして生成して音源回路７２に出力する。したがって、利用者が発声した音韻に応じた打楽器の演奏音が放音機器７６から出力される。一方、状態フラグＳFが発音を示す場合（今回のフレームで消音に変化した場合）、データ生成部３０は、ステップＳ5のノートナンバＮnとステップＳ10でゼロに設定したベロシティＶELとを含むノートオフイベントを音データＤSとして生成して音源回路７２に出力する。 When step S8 or step S10 is completed, the data generation unit 30 generates sound data DS according to the result of the process relating to the current frame (step S11). That is, when the status flag SF indicates mute (when sounding is changed in the current frame), the data generation unit 30 includes a note-on event including the note number Nn set in step S5 and the velocity VEL set in step S8. Is generated as sound data DS and output to the sound source circuit 72. Therefore, a percussion instrument performance sound corresponding to the phoneme uttered by the user is output from the sound emitting device 76. On the other hand, when the status flag SF indicates sounding (when the state flag SF is changed to mute), the data generating unit 30 includes a note-off event including the note number Nn in step S5 and the velocity VEL set to zero in step S10. Is generated as sound data DS and output to the sound source circuit 72.

次いで、制御装置１０は、状態フラグＳFを発音および消音の一方から他方に反転したうえで（ステップＳ12）、演奏音の再生を終了する時期が到来したか否かを判定する（ステップＳ13）。利用者は、入力機器５０を適宜に操作することで再生の終了を制御装置１０に指示することが可能である。ステップＳ13の結果が否定である場合（例えば再生の終了が未だ指示されていない場合）、制御装置１０は、処理をステップＳ1に移行して音声信号ＳVの次のフレームについて同様の処理を実行する。一方、ステップＳ13の結果が肯定である場合、制御装置１０は図５の処理を終了する。 Next, the control device 10 inverts the state flag SF from one of sound generation and mute to the other (step S12), and determines whether or not it is time to end the reproduction of the performance sound (step S13). The user can instruct the control device 10 to end the reproduction by appropriately operating the input device 50. When the result of step S13 is negative (for example, when the end of reproduction has not been instructed yet), the control device 10 shifts the process to step S1 and executes the same process for the next frame of the audio signal SV. . On the other hand, when the result of step S13 is affirmative, the control device 10 ends the process of FIG.

以上に説明したように、本形態においては、入力音声Ｖの音韻の区別の指標となる音韻指標値Ａが入力音声Ｖのうち弁別帯域（帯域ＢL）の成分ＶCの強度ＰCに基づいて算定されるから、入力音声Ｖの音声認識は原理的に不要である。したがって、記憶装置４０に必要となる容量や制御装置１０による処理の負荷を特許文献１の技術と比較して低減することが可能である。 As described above, in this embodiment, the phoneme index value A, which is an index for distinguishing the phoneme of the input speech V, is calculated based on the intensity PC of the component VC in the discrimination band (band BL) of the input speech V. Therefore, speech recognition of the input speech V is not necessary in principle. Therefore, it is possible to reduce the capacity required for the storage device 40 and the processing load by the control device 10 as compared with the technique of Patent Document 1.

なお、成分ＶCの強度ＰCが音韻に応じて相違するとは言っても、例えば成分ＶCの強度ＰC自体が音韻指標値Ａとして採択される構成においては、入力音声Ｖの音量（強度Ｐ0）に応じて音韻指標値Ａが変化するから、音韻指標値Ａのみからは音韻を適切に区別できない可能性もある。本形態においては成分ＶCの強度ＰCと入力音声Ｖの全体の強度Ｐ0との相対比に基づいて音韻指標値Ａが算定されるから、入力音声Ｖの強度Ｐ0の大小に拘わらず、各音韻を適切に区分し得る音韻指標値Ａが算定されるという利点がある。なお、以上の説明から理解されるように、例えば振幅の最大値が所定値（例えば１）となるように音声信号ＳVをフレーム毎の強度で正規化（標準化）したうえで指標算定部１４に供給する構成においては、強度検出部１４４が検出する強度ＰC自体を音韻指標値Ａとしてもよい。 Even though the intensity PC of the component VC differs depending on the phoneme, for example, in the configuration in which the intensity PC itself of the component VC is adopted as the phoneme index value A, it depends on the volume (intensity P0) of the input voice V. Therefore, there is a possibility that the phoneme cannot be properly distinguished from the phoneme index value A alone. In this embodiment, since the phoneme index value A is calculated based on the relative ratio between the intensity PC of the component VC and the overall intensity P0 of the input voice V, each phoneme is obtained regardless of the magnitude P0 of the input voice V. There is an advantage that a phoneme index value A that can be appropriately classified is calculated. As can be understood from the above description, for example, after the audio signal SV is normalized (standardized) by the intensity for each frame so that the maximum value of the amplitude becomes a predetermined value (for example, 1), the index calculation unit 14 In the supply configuration, the intensity PC itself detected by the intensity detection unit 144 may be used as the phoneme index value A.

なお、演奏音が時間的に継続する打楽器の演奏音（例えばシンバルの演奏音）が再生の対象として想定される場合にはノートオフイベントが必要であるが、時間的に継続しない打楽器の演奏音（すなわち、瞬間的にのみ発生するバスドラムやハイハットシンバルなどの演奏音）のみを再生の対象として想定する場合にはノートオフイベントは不要である。したがって、データ生成部３０がノートオンイベントのみを音データＤSとして生成する構成も採用される。また、ノートオフイベントを利用する構成において、ノートオフイベントのベロシティをゼロ以外の数値に指定してもよい。 Note that when a percussion instrument performance sound (for example, a cymbal performance sound) whose performance sound continues in time is assumed to be played back, a note-off event is necessary, but a percussion instrument performance sound that does not continue in time In the case where only the performance sound (that is, the performance sound such as a bass drum or hi-hat cymbal that occurs only instantaneously) is assumed to be reproduced, the note-off event is unnecessary. Therefore, a configuration in which the data generation unit 30 generates only the note-on event as the sound data DS is also employed. In a configuration using a note-off event, the velocity of the note-off event may be designated as a numerical value other than zero.

＜Ｂ：第２実施形態＞
次に、本発明の第２実施形態について説明する。なお、本形態において作用や機能が第１実施形態と共通する要素については、以上と同じ符号を付して各々の詳細な説明を適宜に省略する。 <B: Second Embodiment>
Next, a second embodiment of the present invention will be described. In addition, about the element which an effect | action and function are common in 1st Embodiment in this form, the same code | symbol as the above is attached | subjected and each detailed description is abbreviate | omitted suitably.

図６は、指標算定部１４の具体的な構成を示すブロック図である。本形態のフィルタ処理部１４２は、周波数帯域が相異なる複数の成分ＶC（ＶC1〜ＶC3）を音声信号ＳVから抽出する。図６に示すように、通過帯域が相違する３個のフィルタ部ＦL（ＦL1〜ＦL3）で構成されるフィルタバンクがフィルタ処理部１４２として好適に採用される。フィルタ部ＦL1は図２における低周波側の帯域ＢL（〜ｆc1）の成分ＶC1を音声信号ＳVから抽出するバンドパスフィルタまたはローパスフィルタであり、フィルタ部ＦL3は高周波側の帯域ＢH（ｆc2〜）の成分ＶC3を音声信号ＳVから抽出するバンドパスフィルタまたはハイパスフィルタであり、フィルタ部ＦL2は中間の帯域ＢM（ｆc1〜ｆc2）の成分ＶC2を音声信号ＳVから抽出するバンドパスフィルタである。 FIG. 6 is a block diagram showing a specific configuration of the index calculation unit 14. The filter processing unit 142 of the present embodiment extracts a plurality of components VC (VC1 to VC3) having different frequency bands from the audio signal SV. As shown in FIG. 6, a filter bank composed of three filter units FL (FL1 to FL3) having different passbands is suitably employed as the filter processing unit 142. The filter unit FL1 is a band-pass filter or low-pass filter that extracts the component VC1 of the low frequency side band BL (˜fc1) in FIG. 2 from the audio signal SV, and the filter unit FL3 is the high frequency side band BH (fc2˜). The band-pass filter or high-pass filter that extracts the component VC3 from the audio signal SV, and the filter unit FL2 is a band-pass filter that extracts the component VC2 of the intermediate band BM (fc1 to fc2) from the audio signal SV.

強度検出部１４４は、３種類の成分ＶC1〜ＶC3の各々について強度ＰC（ＰC1〜ＰC3）を検出する。成分ＶCから強度ＰCを検出する方法は第１実施形態と同様である。一方、強度検出部１４６は、第１実施形態と同様に音声信号ＳVの強度Ｐ0を検出する。 The intensity detection unit 144 detects the intensity PC (PC1 to PC3) for each of the three types of components VC1 to VC3. The method for detecting the intensity PC from the component VC is the same as in the first embodiment. On the other hand, the intensity detector 146 detects the intensity P0 of the audio signal SV as in the first embodiment.

演算部１４８は、強度ＰC1〜ＰC3の各々と強度Ｐ0との相対比を音韻指標値Ａ（Ａ1〜Ａ3）として算定する。音韻指標値Ａ1は帯域ＢLの成分ＶC1の強度ＰC1に応じた数値（Ａ1＝ＰC1／Ｐ0）であり、音韻指標値Ａ2は帯域ＢMの成分ＶC2の強度ＰC2に応じた数値（Ａ2＝ＰC2／Ｐ0）であり、音韻指標値Ａ3は帯域ＢHの成分ＶC3の強度ＰC3に応じた数値（Ａ3＝ＰC3／Ｐ0）である。したがって、音韻指標値Ａ1〜Ａ3の大小に応じて入力音声Ｖの音韻を区別することが可能である。 The calculation unit 148 calculates the relative ratio between each of the intensities PC1 to PC3 and the intensity P0 as the phoneme index value A (A1 to A3). The phoneme index value A1 is a numerical value (A1 = PC1 / P0) corresponding to the intensity PC1 of the component VC1 of the band BL, and the phoneme index value A2 is a numerical value (A2 = PC2 / P0) corresponding to the intensity PC2 of the component VC2 of the band BM. The phoneme index value A3 is a numerical value (A3 = PC3 / P0) corresponding to the intensity PC3 of the component VC3 of the band BH. Therefore, it is possible to distinguish the phoneme of the input voice V according to the magnitude of the phoneme index values A1 to A3.

例えば、図２から理解されるように、音韻指標値Ａ1および音韻指標値Ａ2が所定の閾値を上回るとともに音韻指標値Ａ3が閾値を下回る場合（すなわち周波数スペクトルＱのうち帯域ＢLおよび帯域ＢMの強度が帯域ＢHと比較して高い場合）、入力音声Ｖの音韻は両唇音に弁別される。また、音韻指標値Ａ2および音韻指標値Ａ3が閾値を上回るとともに音韻指標値Ａ1が閾値を下回る場合（すなわち周波数スペクトルＱのうち帯域ＢMおよび帯域ＢHの強度が帯域ＢLと比較して高い場合）、入力音声Ｖの音韻は歯茎音に弁別される。さらに、音韻指標値Ａ2が閾値を上回るとともに音韻指標値Ａ1および音韻指標値Ａ3が閾値を下回る場合、入力音声Ｖの音韻は軟口蓋音に弁別される。 For example, as understood from FIG. 2, when the phonological index value A1 and the phonological index value A2 exceed a predetermined threshold and the phonological index value A3 falls below the threshold (that is, the intensities of the band BL and the band BM in the frequency spectrum Q). Is higher than the band BH), the phoneme of the input voice V is discriminated as a bilateral sound. Also, when the phoneme index value A2 and the phoneme index value A3 exceed the threshold and the phoneme index value A1 falls below the threshold (that is, when the intensity of the band BM and the band BH in the frequency spectrum Q is higher than the band BL), The phoneme of the input voice V is discriminated as a gum sound. Further, when the phoneme index value A2 exceeds the threshold and the phoneme index value A1 and the phoneme index value A3 are below the threshold, the phoneme of the input voice V is discriminated as a soft palate sound.

音選択テーブルＴBLは、別個の音韻に対応する音韻指標値Ａ1〜Ａ3の各範囲とノートナンバＮnとを対応づける。音選択部２２は、指標算定部１４が算定した音韻指標値Ａ1〜Ａ3の範囲に対応するノートナンバＮnを音選択テーブルＴBLから探索してデータ生成部３０に指示する。一方、閾値設定部２５は、音韻指標値Ａ1〜Ａ3から弁別される音韻に応じて閾値ＴONおよび閾値ＴOFFを可変に制御する。 The sound selection table TBL associates each range of phoneme index values A1 to A3 corresponding to individual phonemes with the note number Nn. The sound selection unit 22 searches the sound selection table TBL for the note number Nn corresponding to the range of the phoneme index values A1 to A3 calculated by the index calculation unit 14 and instructs the data generation unit 30 to do so. On the other hand, the threshold value setting unit 25 variably controls the threshold value TON and the threshold value TOFF according to the phoneme discriminated from the phoneme index values A1 to A3.

以上の構成によっても第１実施形態と同様の作用および効果が奏される。また、別個の帯域（ＢL，ＢM，ＢH）に対応する複数の成分ＶC1〜ＶC3の各々について音韻指標値Ａ1〜Ａ3が算定されるから、第１実施形態と比較して多数の音韻を区別することが可能である。したがって、入力音声Ｖの音韻に応じて多様な演奏音を選択的に再生できるという利点がある。なお、以上の形態においては３種類の音韻指標値Ａ1〜Ａ3を算定したが、音韻指標値Ａの個数（入力音声Ｖから抽出される成分ＶCの個数）は任意である。 With the above configuration, the same operations and effects as in the first embodiment are achieved. Also, since the phoneme index values A1 to A3 are calculated for each of the plurality of components VC1 to VC3 corresponding to the separate bands (BL, BM, BH), a large number of phonemes are distinguished from those of the first embodiment. It is possible. Therefore, there is an advantage that various performance sounds can be selectively reproduced according to the phoneme of the input voice V. In the above embodiment, three types of phoneme index values A1 to A3 are calculated, but the number of phoneme index values A (the number of components VC extracted from the input speech V) is arbitrary.

＜Ｃ：変形例＞
以上の各形態には以下に例示するような様々な変形を加えることができる。なお、以下の例示から２以上の態様を任意に選択して組合わせてもよい。 <C: Modification>
Various modifications as exemplified below can be added to the above embodiments. Two or more aspects may be arbitrarily selected from the following examples and combined.

（１）変形例１
音データＤSの形式は以上の例示（ＭＩＤＩ形式）に限定されない。打楽器の演奏音の時間軸上における波形を示すデータ列（サンプル列）を音データＤSとしてデータ生成部３０が生成する構成も好適に採用される。例えば、記憶装置４０は、複数種の打楽器の各々について演奏音の波形を示す波形データを記憶する。発音判定部２４から発音が指示されると、データ生成部３０は、複数の波形データのうち音選択部２２が指定したノートナンバＮnに対応する打楽器の波形データを選択し、当該波形データの音量（振幅値）をベロシティＶELに応じて増減したうえでＤ/Ａ変換器７４に出力する。以上の構成によれば、ＭＩＤＩに準拠した音源回路７２が不要であるという利点がある。 (1) Modification 1
The format of the sound data DS is not limited to the above example (MIDI format). A configuration in which the data generation unit 30 generates a data string (sample string) indicating a waveform on the time axis of a percussion instrument performance sound as sound data DS is also preferably employed. For example, the storage device 40 stores waveform data indicating the waveform of the performance sound for each of a plurality of types of percussion instruments. When sound generation is instructed from the sound generation determination unit 24, the data generation unit 30 selects the percussion instrument waveform data corresponding to the note number Nn specified by the sound selection unit 22 from the plurality of waveform data, and the volume of the waveform data. (Amplitude value) is increased / decreased according to velocity VEL, and then output to D / A converter 74. According to the above configuration, there is an advantage that the tone generator circuit 72 conforming to MIDI is unnecessary.

（２）変形例２
以上の各形態においては打楽器の演奏音を例示したが、再生音は任意に変更される。打楽器以外の楽器を含む複数の楽器の何れかの演奏音を音選択部２２が音韻指標値Ａに応じて選択する構成も好適である。また、再生音は楽器の演奏音に限定されない。例えば、拍手の音声を再生してもよい。 (2) Modification 2
In each of the above embodiments, the performance sound of a percussion instrument is exemplified, but the reproduction sound is arbitrarily changed. A configuration in which the sound selection unit 22 selects a performance sound of any of a plurality of musical instruments including instruments other than percussion instruments in accordance with the phonological index value A is also suitable. Further, the playback sound is not limited to the performance sound of the musical instrument. For example, the sound of applause may be reproduced.

ひとつの楽器が生成する複数の演奏音の何れかを示す音データＤSをデータ生成部３０が生成する構成も好適である。例えば、音選択部２２が生成したノートナンバＮnをひとつの楽器の演奏音の音高として指定する音データＤSがデータ生成部３０から音源回路７２に出力される。また、変形例１のように音データＤSを波形データとする構成においては、特定の楽器の演奏音の波形データのピッチをノートナンバＮnに応じて変換したうえでＤ/Ａ変換器７４に出力する構成が採用される。 A configuration in which the data generation unit 30 generates sound data DS indicating any one of a plurality of performance sounds generated by one musical instrument is also suitable. For example, sound data DS specifying the note number Nn generated by the sound selection unit 22 as the pitch of the performance sound of one musical instrument is output from the data generation unit 30 to the sound source circuit 72. In the configuration in which the sound data DS is waveform data as in the first modification, the pitch of the waveform data of the performance sound of a specific instrument is converted according to the note number Nn and then output to the D / A converter 74. A configuration is adopted.

（３）変形例３
以上の各形態においてはピーク値ＰKに応じてベロシティＶELを設定する構成を例示したが、ピーク値ＰKと音声信号ＳVの強度Ｐ0とは連動する可能性が高いから、強度検出部１４６が検出した強度Ｐ0に基づいて音量決定部２６がベロシティＶELを決定する構成も採用される。 (3) Modification 3
In each of the above embodiments, the configuration in which the velocity VEL is set according to the peak value PK is exemplified. However, since the peak value PK and the intensity P0 of the audio signal SV are highly likely to be linked, the intensity detection unit 146 has detected. A configuration is also employed in which the sound volume determination unit 26 determines the velocity VEL based on the intensity P0.

（４）変形例４
以上の各形態においては入力音声Ｖが時間領域で処理される構成を例示したが、音声信号ＳVを周波数領域に展開した周波数スペクトルに基づいて強度ＰCやピーク値ＰKを特定する構成も採用される。もっとも、以上の各形態のように時間領域で処理する構成によれば、ＦＦＴ（Fast Fourier Transform）処理などの周波数分析が不要であるから、制御装置１０による処理の負荷が軽減されるという利点がある。 (4) Modification 4
In each of the above embodiments, the configuration in which the input voice V is processed in the time domain is exemplified, but a configuration in which the intensity PC and the peak value PK are specified based on the frequency spectrum in which the voice signal SV is expanded in the frequency domain is also employed. . However, according to the configuration in which processing is performed in the time domain as in each of the above embodiments, frequency analysis such as FFT (Fast Fourier Transform) processing is unnecessary, and therefore, there is an advantage that the processing load by the control device 10 is reduced. is there.

（５）変形例５
以上の各形態における制御装置１０の各機能がＤＳＰなどの電子回路によって実現された構成や、制御装置１０の各機能が複数の集積回路で実現される構成も好適である。また、収音機器６４や放音機器７６は音制御装置１００に必須の要件ではない。例えば、記憶装置４０に格納された音声信号ＳVや通信網を介して配信された音声信号ＳVを処理の対象とした構成においては収音機器６４やＡ/Ｄ変換器６２が省略される。また、データ生成部３０の生成した音データＤSが記憶装置４０に格納される構成や音データＤSが通信網を介して他の機器に送信される構成においては放音機器７６やＤ/Ａ変換器７４（さらには音源回路７２）が省略される。 (5) Modification 5
A configuration in which each function of the control device 10 in each of the above embodiments is realized by an electronic circuit such as a DSP, and a configuration in which each function of the control device 10 is realized by a plurality of integrated circuits are also suitable. Further, the sound collection device 64 and the sound emission device 76 are not essential requirements for the sound control device 100. For example, in the configuration in which the audio signal SV stored in the storage device 40 and the audio signal SV distributed via the communication network are processed, the sound collection device 64 and the A / D converter 62 are omitted. In the configuration in which the sound data DS generated by the data generation unit 30 is stored in the storage device 40 or in the configuration in which the sound data DS is transmitted to other devices via the communication network, the sound emitting device 76 or the D / A conversion is performed. The device 74 (and the sound source circuit 72) is omitted.

本発明の第１実施形態に係る音制御装置の構成を示すブロック図である。It is a block diagram which shows the structure of the sound control apparatus which concerns on 1st Embodiment of this invention. 音韻に応じた周波数スペクトルの相違を説明するための概念図である。It is a conceptual diagram for demonstrating the difference in the frequency spectrum according to a phoneme. 音選択テーブルの内容を示す概念図である。It is a conceptual diagram which shows the content of the sound selection table. ピーク値とベロシティとの関係を定義する複数の音量関数を例示する概念図である。It is a conceptual diagram which illustrates the some volume function which defines the relationship between a peak value and velocity. 制御装置による処理のフローチャートである。It is a flowchart of the process by a control apparatus. 本発明の第２実施形態における指標算定部の構成を示すブロック図である。It is a block diagram which shows the structure of the parameter | index calculation part in 2nd Embodiment of this invention.

Explanation of symbols

１００……音制御装置、１０……制御装置、１２……分割部、１４……指標算定部、１４２……フィルタ処理部、１４４……強度検出部、１４６……強度検出部、１４８……演算部、１６……ピーク検出部、２２……音選択部、２３……対応音設定部、２４……発音判定部、２５……閾値設定部、２６……音量決定部、２７……対応音量設定部、３０……データ生成部、４０……記憶装置、５０……入力機器、６２……Ａ/Ｄ変換器、６４……収音機器、７２……音源回路、７４……Ｄ/Ａ変換器、７６……放音機器。 DESCRIPTION OF SYMBOLS 100 ... Sound control apparatus, 10 ... Control apparatus, 12 ... Division | segmentation part, 14 ... Index calculation part, 142 ... Filter processing part, 144 ... Strength detection part, 146 ... Strength detection part, 148 ... Arithmetic unit, 16 …… Peak detection unit, 22 …… Sound selection unit, 23 …… Corresponding sound setting unit, 24 …… Sound generation determination unit, 25 …… Threshold setting unit, 26 …… Volume determination unit, 27 …… Correspondence Volume setting unit, 30 ... Data generation unit, 40 ... Storage device, 50 ... Input device, 62 ... A / D converter, 64 ... Sound collection device, 72 ... Sound source circuit, 74 ... D / A converter, 76 …… Sound emitting device.

Claims

Index calculation means for calculating a phoneme index value that changes according to the phoneme of the input speech based on the intensity of a component of a specific band of the input speech;
Sound selecting means for selecting any one of a plurality of sounds based on the phonological index value;
Peak detecting means for detecting a peak value of the input voice;
Threshold setting means for variably setting a threshold according to the phoneme index value;
Pronunciation determination means for determining whether or not the peak value exceeds the threshold;
A sound control apparatus comprising: data generation means for generating sound data indicating generation of a sound selected by the sound selection means when the sound generation determination means determines that the peak value exceeds the threshold value .

Corresponding volume setting means for variably setting the relationship between the intensity of the input voice and the volume of the sound indicated by the sound data;
A volume determining means for determining a volume corresponding to the peak value detected by the peak detecting means in the relationship set by the corresponding volume setting means;
The data generation unit generates sound data indicating a sound having a volume set by the volume determination unit.
The sound control device according to claim 1 .

The threshold value setting means variably sets each of a first threshold value for determination of pronunciation and a second threshold value for determination of mute according to the phoneme index value,
The pronunciation determination unit determines whether the peak value detected by the peak detection unit exceeds the first threshold value, and whether the peak value detected by the peak detection unit is lower than the second threshold value. ,
The data generation means generates sound data indicating the occurrence of the sound selected by the sound selection means when the sound generation determination means determines that the peak value detected by the peak detection means exceeds the first threshold value. The sound control device according to claim 1 or 2 , wherein when the sound generation determination means determines that the peak value detected by the peak detection means is below the second threshold value, sound data indicating mute of the sound is generated. .

An index calculation process for calculating a phoneme index value that changes according to the phoneme of the input speech based on the intensity of a component of a specific band of the input speech;
A sound selection process for selecting one of a plurality of sounds based on the phonological index value;
A peak detection process for detecting a peak value of the input voice;
A threshold setting process for variably setting a threshold according to the phoneme index value;
Pronunciation determination processing for determining whether or not the peak value exceeds the threshold;
A program for causing a computer to execute a data generation process for generating sound data indicating the generation of a sound selected in the sound selection process when it is determined in the sound generation determination process that the peak value exceeds the threshold value .