JP2017067903A

JP2017067903A - Acoustic analysis device

Info

Publication number: JP2017067903A
Application number: JP2015191028A
Authority: JP
Inventors: 慶太有元; Keita Arimoto
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2015-09-29
Filing date: 2015-09-29
Publication date: 2017-04-06
Anticipated expiration: 2035-09-29
Also published as: JP6565549B2

Abstract

PROBLEM TO BE SOLVED: To specify an utterance section immediately after the utterance of the utterance source from among acoustic signals with high accuracy.SOLUTION: An utterance section detection part 40 is an element for analyzing an utterance start point TS where the utterance of the acoustic sound is started and an utterance end point TE where the utterance of the acoustic sound is finished from among the acoustic signals XA, and includes: a start point analysis part 44 for specifying a maximum point of intensity of the acoustic sound XE as the utterance start point TS; and an end point analysis part 46 for specifying a minimum point where intensity of the acoustic signal XE is turned oppositely to the increase in the course of decreasing the intensity after the lapse of the utterance start point TS with time, as the utterance end point in the case where a fluctuation index according to a difference between the intensity at the maximum point after the increase and the minimum value of the intensity until the maximum point exceeds a final point threshold value.SELECTED DRAWING: Figure 4

Description

本発明は、音響を解析する技術に関する。 The present invention relates to a technique for analyzing sound.

音響信号のうち音響が発音される発音区間を解析するための各種の技術が従来から提案されている。例えば特許文献１には、音響信号のＳＮ（Signal to Noise）比が所定の条件を充足する期間を発音区間として特定する構成が開示されている。 Conventionally, various techniques for analyzing a sound generation section in which sound is generated in an acoustic signal have been proposed. For example, Patent Document 1 discloses a configuration in which a period in which an SN (Signal to Noise) ratio of an acoustic signal satisfies a predetermined condition is specified as a sounding section.

特開２００８−１７０８０６号公報JP 2008-170806 A

ところで、打楽器等の各種の楽器が発音する音響の解析には当該楽器の発音区間の特定が重要である。しかし、例えば打楽器が素早く連打された場合のように発音源が短い間隔で複数回にわたり発音した場合には、最初の発音による音響が充分に減衰する以前に直後の発音が開始する。したがって、特許文献１の技術のもとでは、複数回にわたる発音が１個の発音区間に包含される可能性がある。しかし、実際の演奏音の解析の場面では、発音の開始直後の特性の解析が重要である場合が想定されるから、発音源が短い間隔で複数回にわたり発音した場合でも、最初の発音による音響のみを含む発音区間を高精度に特定することが重要である。以上の事情を考慮して、本発明は、音響信号のうち発音源の発音の直後の発音区間を高精度に特定することを目的とする。 By the way, it is important to specify the sound generation section of the musical instrument for the analysis of sound produced by various musical instruments such as percussion instruments. However, when the sound source is sounded a plurality of times at short intervals, for example, when the percussion instrument is struck quickly, the immediate sounding starts before the sound due to the first sound is sufficiently attenuated. Therefore, under the technique of Patent Document 1, a plurality of pronunciations may be included in one pronunciation period. However, in the actual performance sound analysis scene, it is assumed that the analysis of the characteristics immediately after the start of sounding is important, so even if the sound source is sounded multiple times at short intervals, the sound of the first sounding It is important to specify a pronunciation interval including only high precision. In view of the above circumstances, an object of the present invention is to specify a sound generation section immediately after sound generation of a sound source in an acoustic signal with high accuracy.

以上の課題を解決するために、本発明の好適な態様に係る音響解析装置は、音響信号のうち音響の発音が開始される発音始点と当該音響の発音が終了する発音終点とを解析する音響解析装置であって、音響信号の強度の極大点を発音始点として特定する始点解析部と、発音始点の経過後に音響信号の強度が経時的に減少する過程で強度が増加に反転する極小点を、増加後の極大点での強度と当該極大点までの強度の最小値との差分に応じた変動指標が終点閾値を上回る場合に発音終点として特定する終点解析部とを具備する。以上の態様では、発音始点の経過後に音響信号の強度が経時的に減少する過程で強度が増加に反転する極小点を、変動指標が終点閾値を上回る場合に発音終点として特定する。すなわち、発音源が短い間隔で複数回にわたり発音した場合（最初の発音による音響が充分に減衰する以前に直後の発音が開始する場合）には、発音始点に対応する最初の発音のみを発音区間が包含するように発音終点が特定される。したがって、音響信号のうち発音源の発音の直後の発音区間を高精度に特定することが可能である。 In order to solve the above problems, an acoustic analysis device according to a preferred aspect of the present invention is an acoustic analysis device that analyzes a sound generation start point at which sound generation is started and a sound generation end point at which sound generation ends in the sound signal. An analysis device that includes a start point analysis unit that identifies a maximum point of the intensity of an acoustic signal as a pronunciation start point, and a minimum point at which the intensity reverses to increase in the process of decreasing the intensity of the acoustic signal over time after the start of the sound generation point. And an end point analysis unit that specifies the end point of sound generation when the variation index according to the difference between the intensity at the maximum point after the increase and the minimum value of the intensity up to the maximum point exceeds the end point threshold value. In the above aspect, the minimum point where the intensity reverses to increase in the process in which the intensity of the acoustic signal decreases with time after the sounding start point has elapsed is specified as the sounding end point when the variation index exceeds the end point threshold. In other words, when the sound source is sounded multiple times at short intervals (when the sound immediately after the start sound is sufficiently attenuated), only the first sound corresponding to the sound start point is generated. The pronunciation end point is specified to include. Therefore, it is possible to specify the sound generation section immediately after the sound generation of the sound source in the acoustic signal with high accuracy.

本発明の好適な態様において、終点解析部は、極小点を発音終点として特定する以前に、発音始点での強度に応じた減衰閾値を下回るまで音響信号の強度が当該発音始点から減少した場合に、減衰閾値を強度が下回る時点を発音終点として特定する。以上の態様では、極小点が発音終点として特定される以前に、発音始点での強度に応じた減衰閾値を下回るまで音響信号の強度が発音始点から減少した場合に、強度が減衰閾値を下回る時点が発音終点として特定される。したがって、発音始点の経過後に発音源が発音することなく音響信号が減衰する場合に、発音始点からの減衰の度合に応じた適切な発音終点を設定できるという利点がある。 In a preferred aspect of the present invention, the end point analysis unit, when the minimum point is specified as the pronunciation end point, when the intensity of the acoustic signal decreases from the sound start point until it falls below the attenuation threshold corresponding to the intensity at the sound start point. The point in time when the intensity falls below the attenuation threshold is specified as the pronunciation end point. In the above aspect, when the intensity of the acoustic signal decreases from the sounding start point until the minimum point is identified as the sounding end point and falls below the attenuation threshold corresponding to the intensity at the sounding start point, Is identified as the pronunciation end point. Accordingly, there is an advantage that an appropriate sounding end point can be set according to the degree of attenuation from the sounding start point when the sound signal is attenuated without sounding by the sounding source after the sounding start point has elapsed.

本発明の好適な態様において、始点解析部は、音響信号の強度の極大点を順次に検出する一方、極大点での強度と当該極大点までの強度の最小値との差分に応じた変動指標が、終点閾値よりも大きい始点閾値を上回る場合に、当該極大点を発音始点として特定する。以上の態様では、音響信号の強度の極大点が順次に検出される一方、極大点での強度と当該極大点までの強度の最小値である基準値との差分に応じた変動指標が始点閾値を上回る場合に、当該極大点が発音始点として特定される。したがって、音響信号から検出される複数の極大点のうち発音源の明瞭な発音の開始を発音始点として高精度に特定できるという利点がある。 In a preferred aspect of the present invention, the start point analysis unit sequentially detects the maximum point of the intensity of the acoustic signal, while the variation index according to the difference between the intensity at the maximum point and the minimum value of the intensity up to the maximum point. Is greater than the start point threshold value greater than the end point threshold value, the local maximum point is specified as the pronunciation start point. In the above aspect, the maximum point of the intensity of the acoustic signal is sequentially detected, while the variation index according to the difference between the intensity at the maximum point and the reference value that is the minimum value of the intensity up to the maximum point is the starting point threshold value. When the value exceeds the maximum value, the local maximum point is specified as the pronunciation starting point. Therefore, there is an advantage that it is possible to specify with high accuracy the clear start of the sound source of the sound source among the plurality of maximum points detected from the acoustic signal.

本発明の好適な態様において、変動指標は、極大点での強度と当該極大点までの強度の最小値との差分を当該極大点での強度により除算した数値である。以上の態様では、極大点での強度と基準値との差分を極大点での強度により除算することで変動指標が算定される。すなわち、差分が音響信号の音量の大小に依存しない数値に正規化される。したがって、音響信号の音量に関わらず発音始点および発音終点を適切に特定することが可能である。 In a preferred aspect of the present invention, the variation index is a numerical value obtained by dividing the difference between the intensity at the maximum point and the minimum value of the intensity up to the maximum point by the intensity at the maximum point. In the above aspect, the variation index is calculated by dividing the difference between the intensity at the maximum point and the reference value by the intensity at the maximum point. That is, the difference is normalized to a numerical value that does not depend on the volume of the sound signal. Therefore, it is possible to appropriately specify the sound generation start point and the sound generation end point regardless of the volume of the acoustic signal.

本発明の好適な態様において、始点解析部は、音響信号の強度の第１極大点以降の待機区間内に、第１極大点を上回る強度の第２極大点を検出した場合に、第１極大点を発音始点の候補から除外する。以上の態様では、音響信号の強度の極大点以降の待機区間内に、当該極大点を上回る強度の極大点が検出された場合に、極大点が発音始点の候補から除外される。したがって、発音源による１回の発音の開始から音響信号の強度が増加する過程で複数の極大点が検出される場合でも、当該発音に対応した１個の極大点を含む発音区間を適切に特定することが可能である。 In a preferred aspect of the present invention, when the start point analysis unit detects a second maximum point having an intensity exceeding the first maximum point in a standby section after the first maximum point of the intensity of the acoustic signal, the first maximum is detected. The point is excluded from the pronunciation start point candidates. In the above aspect, when a maximum point with an intensity exceeding the maximum point is detected in the standby section after the maximum point of the intensity of the acoustic signal, the maximum point is excluded from the pronunciation start point candidates. Therefore, even when multiple local maximum points are detected in the process of increasing the intensity of the sound signal from the start of a single sound generation by the sound source, the sound generation section including one local maximum point corresponding to the sound generation is appropriately specified. Is possible.

本発明の第１実施形態に係る音響処理装置の構成図である。1 is a configuration diagram of a sound processing apparatus according to a first embodiment of the present invention. 音響解析部の構成図である。It is a block diagram of an acoustic analysis part. 音響信号の各発音区間の説明図である。It is explanatory drawing of each sound generation area of an acoustic signal. 発音区間検出部の構成図である。It is a block diagram of a pronunciation area detection part. 発音区間検出部の動作の説明図である。It is explanatory drawing of operation | movement of a pronunciation area detection part. 始点解析処理のフローチャートである。It is a flowchart of a starting point analysis process. 終点解析処理のフローチャートである。It is a flowchart of an end point analysis process. 音源識別部の構成図である。It is a block diagram of a sound source identification part. 調波解析処理のフローチャートである。It is a flowchart of a harmonic analysis process. 音源識別処理のフローチャートである。It is a flowchart of a sound source identification process. 第２実施形態における始点解析処理の説明図である。It is explanatory drawing of the starting point analysis process in 2nd Embodiment.

＜第１実施形態＞
図１は、本発明の第１実施形態の音響処理装置１２の構成図である。図１に例示される通り、音響処理装置１２には複数の収音装置１４と放音装置１６とが接続される。複数の収音装置１４の各々は、当該収音装置１４の周囲の音響を表す音響信号ＸAを生成する。音響信号ＸAは、例えば左右２チャネルのステレオ形式の信号である。複数の収音装置１４が生成した複数の音響信号ＸAが音響処理装置１２に並列に供給される。なお、収音装置１４が生成した音響信号ＸAをアナログからデジタルに変換するＡ/Ｄ変換器の図示は便宜的に省略した。 <First Embodiment>
FIG. 1 is a configuration diagram of the sound processing apparatus 12 according to the first embodiment of the present invention. As illustrated in FIG. 1, a plurality of sound collection devices 14 and sound emission devices 16 are connected to the sound processing device 12. Each of the plurality of sound collection devices 14 generates an acoustic signal XA representing the sound around the sound collection device 14. The acoustic signal XA is, for example, a stereo signal with two channels on the left and right. A plurality of acoustic signals XA generated by the plurality of sound collecting devices 14 are supplied to the acoustic processing device 12 in parallel. The A / D converter for converting the acoustic signal XA generated by the sound collection device 14 from analog to digital is not shown for convenience.

各収音装置１４は相異なる発音源の近傍に配置される。発音源は、例えば演奏により楽音を発音する楽器や歌唱音声を発音する歌唱者である。第１実施形態では、収録スタジオ等の音響空間の内部で歌唱者と複数の楽器とにより音楽を演奏する場合を想定する。各収音装置１４が生成する音響信号ＸAには、当該収音装置１４の近傍の発音源から発音された音響が優勢に含有されるが、当該音響と比較して小音量で他の発音源の音響も含有され得る。 Each sound collecting device 14 is arranged in the vicinity of a different sound source. The sound source is, for example, a musical instrument that produces a musical tone or a singer that produces a singing voice. In the first embodiment, it is assumed that music is played by a singer and a plurality of musical instruments inside an acoustic space such as a recording studio. The sound signal XA generated by each sound collection device 14 predominately contains sound produced from a sound source in the vicinity of the sound collection device 14, but other sound sources with a lower volume than the sound. Can also be included.

第１実施形態の各発音源は、調波音または非調波音を発音する。調波音は、基本周波数の基音成分と複数の倍音成分とを周波数軸上に配列した調波構造が明瞭に観測される調波性の音響である。例えば弦楽器または管楽器等の調波楽器の楽音や歌唱音声等の人間の発声音が調波音の典型例である。他方、非調波音は、調波構造が明瞭に観測されない非調波性の音響である。例えばドラムやシンバル等の打楽器の楽音が非調波音の典型例である。 Each sound source in the first embodiment generates a harmonic sound or a non-harmonic sound. The harmonic sound is a harmonic sound in which the harmonic structure in which the fundamental frequency component of the fundamental frequency and a plurality of harmonic components are arranged on the frequency axis is clearly observed. For example, a musical sound of a harmonic instrument such as a stringed instrument or a wind instrument or a human vocal sound such as a singing voice is a typical example of a harmonic sound. On the other hand, non-harmonic sound is non-harmonic sound in which the harmonic structure is not clearly observed. For example, percussion musical sounds such as drums and cymbals are typical examples of non-harmonic sounds.

なお、調波音は、調波性の音響成分を非調波性の音響成分と比較して優勢に含有する音響を意味する。したがって、調波性の音響成分のみで構成される音響のほか、調波性の音響成分と非調波性の音響成分との双方を含有するが全体としては調波性が優勢である音響も、調波音の概念に包含される。同様に、非調波音は、非調波性の音響成分を調波性の音響成分と比較して優勢に含有する音響を意味する。したがって、非調波性の音響成分のみで構成される音響のほか、調波性の音響成分と非調波性の音響成分との双方を含有するが全体としては非調波性が優勢である音響も、非調波音の概念に包含される。以下の説明では、調波音に関連する要素の符号に添字Ｈ（Ｈ：Harmonic）を付加し、非調波音に関連する要素の符号に添字Ｐ（Ｐ：Percussive）を付加する場合がある。 Note that the harmonic sound means sound containing a harmonic acoustic component predominantly compared to a non-harmonic acoustic component. Therefore, in addition to sound composed only of harmonic acoustic components, there is also acoustic that contains both harmonic acoustic components and non-harmonic acoustic components, but the harmonics predominate as a whole. Included in the concept of harmonic sounds. Similarly, non-harmonic sound refers to sound that predominately contains non-harmonic acoustic components compared to harmonic acoustic components. Therefore, it contains both harmonic and non-harmonic acoustic components in addition to the sound composed only of non-harmonic acoustic components, but the non-harmonic property is dominant as a whole. Sound is also included in the concept of non-harmonic sound. In the following description, the subscript H (H: Harmonic) may be added to the code of the element related to the harmonic sound, and the subscript P (P: Percussive) may be added to the code of the element related to the non-harmonic sound.

音響処理装置１２は、複数の音響信号ＸAに対する音響処理で音響信号ＸBを生成する。具体的には、第１実施形態の音響処理装置１２は、複数の音響信号ＸAの混合（ミキシング）により左右２チャネルのステレオ形式の音響信号ＸBを生成する。放音装置１６（例えばスピーカやヘッドホン）は、音響処理装置１２が生成した音響信号ＸBに応じた音響を放音する。なお、音響処理装置１２が生成した音響信号ＸBをデジタルからアナログに変換するＤ/Ａ変換器の図示は便宜的に省略した。また、図１では各収音装置１４と放音装置１６とを音響処理装置１２とは別個の要素として図示したが、複数の収音装置１４と放音装置１６とを音響処理装置１２に搭載することも可能である。 The sound processing device 12 generates the sound signal XB by sound processing on the plurality of sound signals XA. Specifically, the sound processing device 12 according to the first embodiment generates a left and right two-channel stereo sound signal XB by mixing (mixing) a plurality of sound signals XA. The sound emitting device 16 (for example, a speaker or headphones) emits sound according to the acoustic signal XB generated by the sound processing device 12. In addition, illustration of the D / A converter which converts the acoustic signal XB which the acoustic processing apparatus 12 produced | generated from digital to analog was abbreviate | omitted for convenience. In FIG. 1, each sound collecting device 14 and sound emitting device 16 are illustrated as separate elements from the sound processing device 12, but a plurality of sound collecting devices 14 and sound emitting devices 16 are mounted on the sound processing device 12. It is also possible to do.

図１に例示される通り、音響処理装置１２は、制御装置１２２と記憶装置１２４とを具備するコンピュータシステムで実現される。記憶装置１２４は、例えば磁気記録媒体や半導体記録媒体等の公知の記録媒体または複数種の記録媒体の組合せであり、制御装置１２２が実行するプログラムや制御装置１２２が使用する各種のデータを記憶する。制御装置１２２は、記憶装置１２４が記憶するプログラムを実行することで、複数の音響信号ＸAの各々を解析する音響解析部２０と、音響解析部２０による解析結果を利用して複数の音響信号ＸAから音響信号ＸBを生成する音響処理部３０とを実現する。なお、制御装置１２２の機能の一部または全部を専用の電子回路で実現する構成や、制御装置１２２の機能を複数の装置に分散した構成も採用され得る。 As illustrated in FIG. 1, the sound processing device 12 is realized by a computer system including a control device 122 and a storage device 124. The storage device 124 is a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of types of recording media, and stores a program executed by the control device 122 and various data used by the control device 122. . The control device 122 executes a program stored in the storage device 124, thereby analyzing each of the plurality of acoustic signals XA, and using the analysis result by the acoustic analysis unit 20, the plurality of acoustic signals XA. The acoustic processing unit 30 that generates the acoustic signal XB from the above is realized. A configuration in which some or all of the functions of the control device 122 are realized by a dedicated electronic circuit, or a configuration in which the functions of the control device 122 are distributed to a plurality of devices may be employed.

音響解析部２０は、複数の収音装置１４から供給される複数の音響信号ＸAの各々について、当該音響信号ＸAが表す音響の発音源の種類を特定する。具体的には、音響解析部２０は、各音響信号ＸAの発音源の種類を示す情報（以下「音源識別情報」という）Ｄを生成する。音源識別情報Ｄは、例えば発音源の名称（具体的には楽器名や演奏パート名）である。 The acoustic analysis unit 20 specifies, for each of the plurality of acoustic signals XA supplied from the plurality of sound collection devices 14, the type of sound generation source represented by the acoustic signal XA. Specifically, the acoustic analysis unit 20 generates information (hereinafter referred to as “sound source identification information”) D indicating the type of sound source of each acoustic signal XA. The sound source identification information D is, for example, the name of the sound source (specifically, the instrument name or performance part name).

図２は、音響解析部２０の構成図である。図２に例示される通り、第１実施形態の音響解析部２０は、発音区間検出部４０と特徴量抽出部５０と音源識別部６０とを具備する。なお、以下の説明では、任意の１系統の音響信号ＸAに対する処理に便宜的に着目するが、複数の音響信号ＸAの各々について同様の処理が実行される。 FIG. 2 is a configuration diagram of the acoustic analysis unit 20. As illustrated in FIG. 2, the acoustic analysis unit 20 of the first embodiment includes a sounding section detection unit 40, a feature amount extraction unit 50, and a sound source identification unit 60. In the following description, although attention is paid to the processing for an arbitrary one system of acoustic signals XA for convenience, the same processing is executed for each of the plurality of acoustic signals XA.

図２の発音区間検出部４０は、音響信号ＸAについて複数の発音区間Ｐを検出する。図３には、音響信号ＸAの波形と発音区間Ｐとの関係が図示されている。図３から理解される通り、各発音区間Ｐは、音響信号ＸAが表す音響が発音される時間軸上の区間であり、音響の発音が開始する時点（以下「発音始点」という）ＴSから終点（以下「発音終点」という）ＴEまでの区間である。 The sounding section detection unit 40 in FIG. 2 detects a plurality of sounding sections P for the acoustic signal XA. FIG. 3 shows the relationship between the waveform of the acoustic signal XA and the sound generation section P. As can be understood from FIG. 3, each sound generation section P is a time-axis section where the sound represented by the sound signal XA is generated, and from the point in time when sound generation starts (hereinafter referred to as “pronunciation start point”) TS to the end point This is a section up to TE (hereinafter referred to as “pronunciation end point”).

図２の特徴量抽出部５０は、音響信号ＸAの特徴量Ｆを抽出する。第１実施形態の特徴量抽出部５０は、発音区間検出部４０が検出した発音区間Ｐ毎に特徴量Ｆを順次に抽出する。特徴量Ｆは、発音区間Ｐ内の音響信号ＸAの音響的な特徴を表す指標である。第１実施形態の特徴量Ｆは、相異なる複数種の特性値ｆ（ｆ1，ｆ2，……）を包含するベクトルで表現される。具体的には、音響信号ＸAの音色を表すＭＦＣＣ（Mel-frequency cepstral coefficients），発音区間Ｐ内の音響の立上がりの急峻度，基音成分に対する倍音成分の強度比，音響信号ＸAの強度の符号が反転する回数または頻度である零交差数等の複数種の特性値ｆが特徴量Ｆに包含される。 The feature quantity extraction unit 50 in FIG. 2 extracts the feature quantity F of the acoustic signal XA. The feature amount extraction unit 50 according to the first embodiment sequentially extracts the feature amount F for each sound generation section P detected by the sound generation section detection unit 40. The feature amount F is an index representing the acoustic feature of the acoustic signal XA in the sound generation section P. The feature amount F of the first embodiment is expressed as a vector including a plurality of different characteristic values f (f1, f2,...). Specifically, the MFCC (Mel-frequency cepstral coefficients) representing the timbre of the acoustic signal XA, the steepness of the acoustic rise in the sound generation section P, the intensity ratio of the harmonic component to the fundamental component, and the sign of the intensity of the acoustic signal XA A plurality of types of characteristic values f such as the number of inversions or the number of zero crossings, which is the frequency, are included in the feature amount F.

各発音源が発音する音響の特徴は、発音始点ＴSの直後に特に顕著となる。第１実施形態では、音響信号ＸAの発音始点ＴS毎（発音区間Ｐ毎）に音響信号ＸAの特徴量Ｆが抽出されるから、発音の有無や時点とは無関係に音響信号ＸAを区分した区間毎に特徴量Ｆを抽出する構成と比較して、発音源の種類毎に固有の特徴が顕著に反映された特徴量Ｆを抽出できるという利点がある。もっとも、発音源による発音の有無や時点とは無関係に音響信号ＸAを時間軸上で区分した区間毎に特徴量Ｆを抽出する（したがって発音区間検出部４０は省略される）ことも可能である。音源識別部６０は、特徴量抽出部５０が抽出した特徴量Ｆを利用して音響信号ＸAの発音源の種類を識別することで音源識別情報Ｄを生成する。 The characteristics of the sound generated by each sound source are particularly prominent immediately after the sound generation start point TS. In the first embodiment, since the feature amount F of the acoustic signal XA is extracted for each sounding start point TS (for each sounding section P) of the sound signal XA, the section in which the sound signal XA is divided regardless of the presence or time of sounding. Compared with the configuration in which the feature amount F is extracted every time, there is an advantage that the feature amount F in which a unique feature is significantly reflected for each type of sound source can be extracted. However, it is also possible to extract the feature amount F for each section obtained by dividing the acoustic signal XA on the time axis regardless of whether or not the sound source is sounded and the time point (therefore, the sounding section detection unit 40 is omitted). . The sound source identification unit 60 generates the sound source identification information D by identifying the type of the sound source of the acoustic signal XA using the feature amount F extracted by the feature amount extraction unit 50.

図１の音響処理部３０は、音響解析部２０が音響信号ＸA毎に解析した音源識別情報Ｄを参照して複数の音響信号ＸAに音響処理を実行することで音響信号ＸBを生成する。具体的には、音響信号ＸAの音源識別情報Ｄが示す発音源の種類毎に事前に設定された音響処理が当該音響信号ＸAに対して実行される。音響信号ＸAに対する音響処理としては、例えば残響効果や歪効果等の各種の音響効果を付与する効果付与処理（エフェクタ）や、周波数帯域毎の音量を調整する特性調整処理（イコライザ），音像が定位する位置を調整する定位調整処理（パン），音量を調整する音量調整処理が例示される。効果付与処理で音響信号ＸAに付与される音響効果の種類や度合，特性調整処理で音響信号ＸAに付与される周波数特性，定位調整処理で調整される音像の位置，音量調整処理による調整内容（ゲイン）等の各種のパラメータが、音源識別情報Ｄが示す発音源の種類毎に個別に設定される。そして、音響処理部３０は、以上に例示した音響処理後の複数の音響信号ＸAを混合（ミキシング）することで音響信号ＸBを生成する。すなわち、第１実施形態の音響処理部３０は、調波性解析部６２による発音源の識別結果を反映した自動ミキシングを実現する。以下、第１実施形態における発音区間検出部４０および音源識別部６０の各々の具体的な構成を説明する。 The sound processing unit 30 in FIG. 1 generates the sound signal XB by performing sound processing on the plurality of sound signals XA with reference to the sound source identification information D analyzed by the sound analysis unit 20 for each sound signal XA. Specifically, acoustic processing set in advance for each type of sound source indicated by the sound source identification information D of the acoustic signal XA is executed on the acoustic signal XA. As acoustic processing for the acoustic signal XA, for example, effect imparting processing (effector) that imparts various acoustic effects such as reverberation effect and distortion effect, characteristic adjustment processing (equalizer) that adjusts the volume for each frequency band, and sound image localization Examples are a localization adjustment process (pan) for adjusting the position to be performed and a volume adjustment process for adjusting the volume. The type and degree of the sound effect given to the sound signal XA by the effect applying process, the frequency characteristic given to the sound signal XA by the characteristic adjusting process, the position of the sound image adjusted by the localization adjusting process, and the adjustment contents by the volume adjusting process ( Various parameters such as gain) are individually set for each type of sound source indicated by the sound source identification information D. Then, the sound processing unit 30 generates the sound signal XB by mixing (mixing) the plurality of sound signals XA after the sound processing exemplified above. That is, the acoustic processing unit 30 according to the first embodiment realizes automatic mixing that reflects the sound source identification result by the harmonic analysis unit 62. Hereinafter, a specific configuration of each of the sounding section detection unit 40 and the sound source identification unit 60 in the first embodiment will be described.

＜発音区間検出部４０＞
図４は、発音区間検出部４０の構成図である。図４に例示される通り、第１実施形態の発音区間検出部４０Aは、信号処理部４２と始点解析部４４と終点解析部４６とを具備する。なお、以下の説明では、任意の１系統の音響信号ＸAに対する処理に便宜的に着目するが、実際には複数の音響信号ＸAの各々について同様の処理が実行される。 <Sound generation section detector 40>
FIG. 4 is a configuration diagram of the sounding section detection unit 40. As illustrated in FIG. 4, the sounding section detection unit 40A of the first embodiment includes a signal processing unit 42, a start point analysis unit 44, and an end point analysis unit 46. In the following description, attention is paid to the processing for an arbitrary one-system acoustic signal XA for the sake of convenience, but actually the same processing is executed for each of the plurality of acoustic signals XA.

信号処理部４２は、収音装置１４から供給される音響信号ＸAの信号処理で音響信号ＸEを生成する。音響信号ＸEは、音響信号ＸAの時間軸上の包絡線（エンベロープ）に相当する。具体的には、信号処理部４２は、音響信号ＸAの各信号値を絶対値に変換したうえで高周波成分を抑圧（平滑化処理）することで音響信号ＸEを生成する。音響信号ＸEの波形が図５に例示されている。なお、外部装置で生成された音響信号ＸEが音響処理装置１２に供給される構成では、音響処理装置１２から信号処理部４２が省略され得る。 The signal processing unit 42 generates an acoustic signal XE by signal processing of the acoustic signal XA supplied from the sound collection device 14. The acoustic signal XE corresponds to an envelope (envelope) on the time axis of the acoustic signal XA. Specifically, the signal processing unit 42 generates the acoustic signal XE by converting each signal value of the acoustic signal XA into an absolute value and then suppressing (smoothing) the high frequency component. The waveform of the acoustic signal XE is illustrated in FIG. In the configuration in which the acoustic signal XE generated by the external device is supplied to the acoustic processing device 12, the signal processing unit 42 can be omitted from the acoustic processing device 12.

図４の始点解析部４４は、音響信号ＸEのうち音響の発音が開始される発音始点ＴSを特定する。終点解析部４６は、音響信号ＸEのうち音響の発音が終了する発音終点ＴEを特定する。第１実施形態では、始点解析部４４による発音始点ＴSの特定と終点解析部４６による発音終点ＴEの特定とが、音響信号ＸEの生成に並行して実時間的に音響信号ＸEの始点から時間の経過とともに順次に実行される。始点解析部４４および終点解析部４６の各々の動作を以下に説明する。 The start point analysis unit 44 in FIG. 4 specifies a sound generation start point TS at which sound generation is started in the sound signal XE. The end point analysis unit 46 specifies a sound generation end point TE at which sound generation ends in the sound signal XE. In the first embodiment, the specification of the sound generation start point TS by the start point analysis unit 44 and the specification of the sound generation end point TE by the end point analysis unit 46 are timed from the start point of the sound signal XE in real time in parallel with the generation of the sound signal XE. It is executed sequentially with the passage of time. The operations of the start point analysis unit 44 and the end point analysis unit 46 will be described below.

＜始点解析部４４＞
図５に例示される通り、第１実施形態の始点解析部４４は、音響信号ＸEの強度（振幅またはパワー）Ｑが増加から減少に反転する極大点（ピーク）ｘHを発音始点ＴSとして特定する。ただし、第１実施形態の始点解析部４４は、音響信号ＸEから検出される全部の極大点ｘHを発音始点ＴSとするのではなく、音響信号ＸEから検出される複数の極大点ｘHのうち所定の条件を充足する極大点ｘHを選択的に発音始点ＴSとして特定する。 <Start point analysis unit 44>
As illustrated in FIG. 5, the start point analysis unit 44 of the first embodiment specifies the maximum point (peak) xH where the intensity (amplitude or power) Q of the acoustic signal XE reverses from increase to decrease as the sound generation start point TS. . However, the start point analysis unit 44 of the first embodiment does not set all the local maximum points xH detected from the acoustic signal XE as the sound generation start point TS, but is a predetermined one of a plurality of local maximum points xH detected from the acoustic signal XE. The local maximum point xH that satisfies the above condition is selectively specified as the pronunciation start point TS.

具体的には、始点解析部４４は、図５に例示された極大点ｘH1のように、極大点ｘHでの音響信号ＸEの強度ＱHと基準値ＱREFとの差分(ＱH−ＱREF)に応じた変動指標δが所定の閾値（以下「始点閾値」という）ＺSを上回る場合（δ＞ＺS）に当該極大点ｘHを発音始点ＴSとして確定する。他方、図５に例示された極大点ｘH0のように、変動指標δが始点閾値ＺSを下回る極大値ｘHは発音始点ＴSとされない。 Specifically, the start point analysis unit 44 responds to the difference (QH−QREF) between the intensity QH of the acoustic signal XE at the local maximum point xH and the reference value QREF as the local maximum point xH1 illustrated in FIG. When the variation index δ exceeds a predetermined threshold value (hereinafter referred to as “starting point threshold value”) ZS (δ> ZS), the local maximum point xH is determined as the pronunciation starting point TS. On the other hand, like the local maximum point xH0 illustrated in FIG. 5, the local maximum value xH whose fluctuation index δ is lower than the starting point threshold value ZS is not set as the pronunciation starting point TS.

基準値ＱREFは、直前の発音始点ＴS（処理開始の直後は音響信号ＸEの始点）以降における音響信号ＸEの強度Ｑの最小値となるように発音始点ＴSの解析処理の進行とともに随時に更新される。変動指標δは、例えば、極大点ｘHでの強度ＱHと基準値ＱREFとの差分(ＱH−ＱREF)を当該強度ＱHで除算した数値（δ＝(ＱH−ＱREF)／ＱH）である。強度ＱHでの除算により、変動指標δは、音響信号ＸEの全体的な音量の大小に依存しない数値に正規化される。始点閾値ＺSは、事前に選定された所定の正数である。 The reference value QREF is updated as the analysis process of the sound generation start point TS progresses so that it becomes the minimum value of the intensity Q of the sound signal XE after the immediately preceding sound generation start point TS (immediately after the start of processing is the start point of the sound signal XE). The The variation index δ is, for example, a numerical value (δ = (QH−QREF) / QH) obtained by dividing the difference (QH−QREF) between the intensity QH at the local maximum point xH and the reference value QREF by the intensity QH. By dividing by the intensity QH, the variation index δ is normalized to a numerical value that does not depend on the overall volume of the acoustic signal XE. The starting point threshold value ZS is a predetermined positive number selected in advance.

図６は、始点解析部４４が発音始点ＴSを特定する処理（以下「始点解析処理」という）のフローチャートである。始点解析部４４は、音響信号ＸEの始点から順次に極大点ｘHを検出し、極大点ｘHの検出毎に図６の始点解析処理を開始する。 FIG. 6 is a flowchart of processing (hereinafter referred to as “start point analysis processing”) in which the start point analysis unit 44 specifies the pronunciation start point TS. The start point analysis unit 44 sequentially detects the maximum point xH from the start point of the acoustic signal XE, and starts the start point analysis process of FIG. 6 each time the maximum point xH is detected.

音響信号ＸEの極大点ｘHの検出を契機として始点解析処理を開始すると、始点解析部４４は、当該極大点ｘHでの強度ＱHと現時点での基準値ＱREFとの差分(ＱH−ＱREF)に応じた変動指標δが始点閾値ＺSを上回るか否かを判定する（ＳC1）。変動指標δが始点閾値ＺSを下回る場合（ＳC1：NO）、始点解析部４４は、今回の極大点ｘHを発音始点ＴSとして特定することなく始点解析処理を終了する。他方、変動指標δが始点閾値ＺSを上回る場合（ＳC1：YES）、始点解析部４４は、今回の極大点ｘHを発音始点ＴSとして特定する（ＳC2）。そして、始点解析部４４は、基準値ＱREFを今回の極大点ｘHでの強度ＱHに更新する（ＳC3）。発音始点ＴSが経過すると音響信号ＸEは減衰するから、発音始点ＴSの経過後は基準値ＱREFは経時的に減少していく。以上が始点解析処理の好適例である。 When starting point analysis processing is triggered by the detection of the local maximum point xH of the acoustic signal XE, the starting point analysis unit 44 responds to the difference (QH−QREF) between the intensity QH at the local maximum point xH and the current reference value QREF. It is determined whether or not the variation index δ exceeds the starting point threshold value ZS (SC1). When the variation index δ is less than the start point threshold value ZS (SC1: NO), the start point analysis unit 44 ends the start point analysis process without specifying the current local maximum point xH as the pronunciation start point TS. On the other hand, when the variation index δ exceeds the start point threshold value ZS (SC1: YES), the start point analysis unit 44 specifies the current local maximum point xH as the pronunciation start point TS (SC2). Then, the start point analysis unit 44 updates the reference value QREF to the intensity QH at the current local maximum point xH (SC3). Since the sound signal XE attenuates when the sound generation start point TS elapses, the reference value QREF decreases with time after the sound generation start point TS elapses. The above is a preferred example of the starting point analysis process.

＜終点解析部４６＞
図４の終点解析部４６は、前述の通り、音響信号ＸEのうち音響の発音が終了する発音終点ＴEを特定する。図７は、終点解析部４６が発音終点ＴEを特定する処理（以下「終点解析処理」という）のフローチャートである。始点解析部４４による発音始点ＴSの特定（ＳC2）を契機として図７の終点解析処理が開始される。 <End point analysis unit 46>
As described above, the end point analysis unit 46 in FIG. 4 identifies the sound generation end point TE at which sound generation ends in the sound signal XE. FIG. 7 is a flowchart of processing (hereinafter referred to as “end point analysis processing”) in which the end point analysis unit 46 specifies the pronunciation end point TE. The end point analysis process in FIG. 7 is started when the start point analysis unit 44 specifies the pronunciation start point TS (SC2).

発音始点ＴSの特定を契機として終点解析処理を開始すると、終点解析部４６は、当該発音始点ＴSから所定の時間τが経過したか否かを判定する（ＳD1）。発音始点ＴSから所定の時間τが経過していない場合（ＳD1：NO）、終点解析部４６は、現時点の音響信号ＸEの強度Ｑが所定の閾値（以下「減衰閾値」という）Ｚ0を下回るか否かを判定する（ＳD2）。減衰閾値Ｚ0は、直前の発音始点ＴSでの音響信号ＸEの強度ＱHに応じた数値に設定される。具体的には、発音始点ＴSでの強度ＱHに１未満の正数（例えば０.４〜０.６の任意の数値）を乗算した数値が減衰閾値Ｚ0として好適である。強度Ｑが閾値Ｚ0を下回る場合（ＳD2：YES）、終点解析部４６は現時点を発音終点ＴEとして特定する（ＳD3）。すなわち、発音始点ＴSの経過後で音響信号ＸEの強度Ｑが減衰閾値Ｚ0を下回るまで減少した時点が発音終点ＴEとして特定される。 When the end point analysis processing is started in response to the specification of the pronunciation start point TS, the end point analysis unit 46 determines whether or not a predetermined time τ has elapsed from the pronunciation start point TS (SD1). If the predetermined time τ has not elapsed since the sound generation start point TS (SD1: NO), the end point analysis unit 46 determines whether the current intensity Q of the acoustic signal XE falls below a predetermined threshold (hereinafter referred to as “attenuation threshold”) Z0. It is determined whether or not (SD2). The attenuation threshold Z0 is set to a numerical value corresponding to the intensity QH of the acoustic signal XE at the immediately preceding sounding start point TS. Specifically, a numerical value obtained by multiplying the intensity QH at the pronunciation start point TS by a positive number less than 1 (for example, an arbitrary numerical value of 0.4 to 0.6) is suitable as the attenuation threshold Z0. When the intensity Q is lower than the threshold value Z0 (SD2: YES), the end point analysis unit 46 specifies the current time point as the pronunciation end point TE (SD3). In other words, the point in time when the intensity Q of the acoustic signal XE decreases after the sounding start point TS falls below the attenuation threshold Z0 is specified as the sounding end point TE.

ところで、例えば打楽器が素早く連打された場合のように発音源が短い間隔で複数回にわたり発音した場合には、最初の発音による音響が充分に減衰する以前に直後の発音が開始する。したがって、音響信号ＸEの強度Ｑが減衰閾値Ｚ0を下回る時点を発音終点ＴEとして特定するだけでは、発音始点ＴSから発音終点ＴEまでの１個の発音区間Ｐに発音源の複数回にわたる発音が包含される結果となる。しかし、例えば特徴量抽出部５０による特徴量Ｆの抽出や音源識別部６０による発音源の種類の識別等の音響信号ＸAの解析の場面では、発音源の発音の開始直後の特性の解析が重要である。以上の事情を考慮して、第１実施形態の終点解析部４６は、発音源が短い間隔で複数回にわたり発音した場合でも、発音始点ＴSに対応する最初の発音のみを発音区間Ｐが包含するように（すなわち第２回目以降の発音が発音区間Ｐに包含されないように）、発音終点ＴEを特定する。 By the way, when the sound source is sounded a plurality of times at short intervals, for example, when a percussion instrument is quickly repeatedly struck, the immediately following sounding starts before the sound of the first sounding is sufficiently attenuated. Therefore, only by specifying the time point when the intensity Q of the acoustic signal XE falls below the attenuation threshold value Z0 as the sounding end point TE, the sound source of the sound source is included in one sounding period P from the sounding start point TS to the sounding end point TE. Result. However, in the scene of analysis of the acoustic signal XA such as the extraction of the feature amount F by the feature amount extraction unit 50 and the identification of the type of the sound source by the sound source identification unit 60, it is important to analyze the characteristics immediately after the start of the sound source. It is. Considering the above circumstances, the end point analysis unit 46 of the first embodiment includes only the first pronunciation corresponding to the pronunciation start point TS even if the pronunciation source is pronounced multiple times at short intervals. As described above (that is, so that the second and subsequent pronunciations are not included in the pronunciation period P), the pronunciation end point TE is specified.

具体的には、音響信号ＸEの強度Ｑが減衰閾値Ｚ0を上回る場合（ＳD2：NO）、終点解析部４６は、発音始点ＴS以降に音響信号ＸEの強度Ｑが減少から増加に反転する極小点（ディップ）ｘLが検出されたか否かを判定する（ＳD4）。極小点ｘLが検出されない場合（ＳD4：NO）、終点解析部４６は処理をステップＳD1に移行して、発音始点ＴSから時間τが経過するか（ＳD1：YES）、音響信号ＸEの強度Ｑが減衰閾値Ｚ0を下回るまで（ＳD2：YES）、極小点ｘLの発生を監視する。 Specifically, when the intensity Q of the acoustic signal XE exceeds the attenuation threshold Z0 (SD2: NO), the end point analysis unit 46 determines the minimum point at which the intensity Q of the acoustic signal XE reverses from decreasing to increasing after the sounding start point TS. (Dip) It is determined whether xL is detected (SD4). When the minimum point xL is not detected (SD4: NO), the end point analysis unit 46 proceeds to step SD1 to determine whether the time τ has elapsed from the pronunciation start point TS (SD1: YES) or the intensity Q of the acoustic signal XE is Until the attenuation threshold value Z0 falls below (SD2: YES), the occurrence of the minimum point xL is monitored.

他方、音響信号ＸEの強度Ｑが減衰閾値Ｚ0を下回る以前に極小点（以下では特に「対象極小点」という）ｘLが検出されると（ＳD4：YES）、終点解析部４６は、対象極小点ｘLの直後の極大点ｘHが検出されたか否かを判定する（ＳD5）。極大点ｘHが検出されない場合（ＳD5：NO）、終点解析部４６は処理をステップＳD1に移行する。なお、対象極小点ｘLが検出された場合（ＳD4：YES）に、当該対象極小点ｘLでの強度Ｑが現時点の基準値ＱREFを下回るときには（Ｑ＜ＱREF）、基準値ＱREFが当該対象極小点ｘLでの強度Ｑに更新される。すなわち、基準値ＱREFは、前述の通り、発音始点ＴS以降における強度Ｑの最小値（例えば対象極小点ｘLでの強度Ｑ）となるように更新される。 On the other hand, if the local minimum point (hereinafter referred to as “target local minimum point”) xL is detected before the intensity Q of the acoustic signal XE falls below the attenuation threshold value Z0 (SD4: YES), the end point analysis unit 46 It is determined whether or not the maximum point xH immediately after xL has been detected (SD5). When the local maximum point xH is not detected (SD5: NO), the end point analysis unit 46 shifts the processing to step SD1. If the target minimum point xL is detected (SD4: YES) and the intensity Q at the target minimum point xL is lower than the current reference value QREF (Q <QREF), the reference value QREF is the target minimum point. Updated to strength Q at xL. That is, as described above, the reference value QREF is updated so as to be the minimum value of the intensity Q after the sounding start point TS (for example, the intensity Q at the target minimum point xL).

図５には、対象極小点ｘLの直後の極大点ｘH2が例示されている。極大点ｘH2が検出されると（ＳD5：YES）、終点解析部４６は、当該極大点ｘH2での強度ＱHと現時点での基準値ＱREFとの差分(ＱH−ＱREF)に応じた変動指標δが終点閾値ＺEを上回るか否かを判定する（ＳD6）。変動指標δは、前述の通り、強度ＱHと基準値ＱREFとの差分(ＱH−ＱREF)を当該強度ＱHで除算した数値である。なお、現時点の基準値ＱREFは、対象極小点ｘLでの強度Ｑである可能性が高い。また、終点閾値ＺEは、発音始点ＴSの特定に利用される前述の始点閾値ＺSを下回る所定の正数に設定される（ＺE＜ＺS）。 FIG. 5 illustrates a local maximum point xH2 immediately after the target local minimum point xL. When the local maximum point xH2 is detected (SD5: YES), the end point analysis unit 46 calculates the variation index δ according to the difference (QH−QREF) between the intensity QH at the local maximum point xH2 and the current reference value QREF. It is determined whether or not the end point threshold value ZE is exceeded (SD6). As described above, the variation index δ is a numerical value obtained by dividing the difference (QH−QREF) between the strength QH and the reference value QREF by the strength QH. It is highly possible that the current reference value QREF is the intensity Q at the target minimum point xL. The end point threshold value ZE is set to a predetermined positive number lower than the above-described start point threshold value ZS used for specifying the sound generation start point TS (ZE <ZS).

変動指標δが終点閾値ＺEを下回る場合（ＳD6：NO）には、対象極小点ｘLの直後に極大点ｘHが観測されたものの発音始点ＴSの直後の発音源の発音による強度Ｑの増加（第２回目以降の発音）とまでは推定できない。したがって、発音終点ＴEをまだ確定せずに引続き音響信号ＸEの強度Ｑを監視する必要がある。そこで、終点解析部４６は、処理をステップＳD1に移行して、発音始点ＴSから時間τが経過するか（ＳD1：YES）、音響信号ＸEの強度Ｑが減衰閾値Ｚ0を下回るまで（ＳD2：YES）、極小点ｘLの発生を監視する。 When the variation index δ is lower than the end point threshold value ZE (SD6: NO), the maximum point xH is observed immediately after the target minimum point xL, but the intensity Q increases due to the pronunciation of the pronunciation source immediately after the pronunciation start point TS (No. 1). Cannot be estimated until the second and subsequent pronunciations. Therefore, it is necessary to continuously monitor the intensity Q of the acoustic signal XE without determining the pronunciation end point TE yet. Therefore, the end point analysis unit 46 shifts the process to step SD1 until the time τ elapses from the sound generation start point TS (SD1: YES) or until the intensity Q of the acoustic signal XE falls below the attenuation threshold Z0 (SD2: YES). ), Monitoring the occurrence of the minimum point xL.

他方、変動指標δが終点閾値ＺEを上回るほど極大点ｘHの強度Ｑが増加した場合（ＳD6：YES）には、対象極小点ｘLの直後の極大点ｘHは、発音始点ＴSの直後の発音源の発音（すなわち最初の発音の直後の第２回目以降の発音）による強度Ｑの増加と推定される。したがって、発音始点ＴSから対象極小点ｘLまでを発音区間Ｐとして確定し、第２回目以降の発音に対応する直後の極大点ｘHは発音区間Ｐから除外する必要がある。そこで、終点解析部４６は、対象極小点ｘLを発音終点ＴEとして特定する（ＳD7）。すなわち、対象極小点ｘLの直後の極大点ｘHについて変動指標δが終点閾値ＺEを上回る場合に、当該対象極小点ｘLが事後的に発音終点ＴEとして確定される。 On the other hand, when the intensity Q of the maximum point xH increases as the variation index δ exceeds the end point threshold value ZE (SD6: YES), the maximum point xH immediately after the target minimum point xL is the sound source immediately after the sound generation start point TS. It is presumed that the intensity Q is increased due to the pronunciation (ie, the second and subsequent pronunciations immediately after the first pronunciation). Therefore, it is necessary to determine from the sounding start point TS to the target local minimum point xL as the sounding section P, and to exclude the local maximum point xH corresponding to the second and subsequent sounding from the sounding section P. Therefore, the end point analysis unit 46 specifies the target minimum point xL as the pronunciation end point TE (SD7). That is, when the variation index δ exceeds the end point threshold value ZE at the local maximum point xH immediately after the target local minimum point xL, the target local minimum point xL is subsequently determined as the pronunciation end point TE.

以上の説明から理解される通り、第１実施形態の終点解析部４６は、発音始点ＴSの経過後に音響信号ＸEの強度Ｑが経時的に減少する過程で検出される対象極小点ｘLを、変動指標δが終点閾値ＺEを上回る場合（ＳD6：YES）に発音終点ＴEとして特定し（ＳD7）、変動指標δが終点閾値ＺEを下回る場合（ＳD6：NO）には発音終点ＴEとしない。なお、対象極小点ｘLの直後に検出された図５の極大点ｘH2については、図６を参照して説明した通り、変動指標δが始点閾値ＺSを上回ることを条件として発音始点ＴSとして特定される。変動指標δが始点閾値ＺSを上回る場合には終点閾値ＺEも当然に上回るから、当該極大点ｘHの直前の対象極小点ｘLは発音終点ＴEとして確定される。 As understood from the above description, the end point analysis unit 46 of the first embodiment varies the target minimum point xL detected in the process in which the intensity Q of the acoustic signal XE decreases with time after the sound generation start point TS elapses. When the index δ exceeds the end point threshold value ZE (SD6: YES), the sound generation end point TE is specified (SD7), and when the variation index δ is less than the end point threshold value ZE (SD6: NO), the sound generation end point TE is not set. Note that the local maximum point xH2 of FIG. 5 detected immediately after the target local minimum point xL is specified as the pronunciation start point TS on the condition that the variation index δ exceeds the start point threshold value ZS, as described with reference to FIG. The When the variation index δ exceeds the start point threshold value Z S, the end point threshold value ZE naturally also exceeds, so the target minimum point x L immediately before the maximum point x H is determined as the pronunciation end point TE.

他方、音響信号ＸEの強度Ｑが減衰閾値Ｚ0を下回る（ＳD2：YES）ことも、発音始点ＴSの経過後の極小点ｘLが発音終点ＴEとして特定される（ＳD7）こともなく、直前の発音始点ＴSから時間τが経過すると（ＳD1：YES）、終点解析部４６は、発音始点ＴSから時間τが経過した時点を発音終点ＴEとして特定する（ＳD8）。以上の説明から理解される通り、終点解析部４６は、基本的には音響信号ＸEの強度Ｑが減衰閾値Ｚ0を下回る時点を発音終点ＴEとして特定する一方（ＳD3）、発音始点ＴSの直後の発音源の発音が推定される場合（ＳD6：YES）には、発音区間Ｐから当該発音が除外されるように極小点ｘLを発音終点ＴEとして確定し（ＳD7）、何れの条件も成立しない場合には発音始点ＴSから時間τが経過した時点を発音終点ＴEとして特定する（ＳD8）。 On the other hand, the intensity Q of the acoustic signal XE falls below the attenuation threshold Z0 (SD2: YES), or the minimum point xL after the elapse of the sounding start point TS is not specified as the sounding end point TE (SD7). When the time τ elapses from the start point TS (SD1: YES), the end point analysis unit 46 specifies the time point when the time τ elapses from the sound generation start point TS as the sound generation end point TE (SD8). As understood from the above description, the end point analysis unit 46 basically specifies the time point when the intensity Q of the acoustic signal XE falls below the attenuation threshold value Z0 as the sounding end point TE (SD3), but immediately after the sounding start point TS. When the pronunciation of the pronunciation source is estimated (SD6: YES), the minimum point xL is determined as the pronunciation end point TE so that the pronunciation is excluded from the pronunciation period P (SD7), and none of the conditions is satisfied Is specified as a sound generation end point TE (SD8).

以上に説明した通り、第１実施形態では、発音始点ＴSの経過後に音響信号ＸEの強度Ｑが経時的に減少する過程で強度Ｑが増加に反転する極小点ｘLを、変動指標δが終点閾値ＺEを上回る場合に発音終点ＴEとして特定する。すなわち、発音源が短い間隔で複数回にわたり発音した場合（最初の発音による音響が充分に減衰する以前に直後の発音が開始する場合）には、発音始点ＴSに対応する最初の発音のみを発音区間Ｐが包含するように発音終点ＴEが特定される。したがって、音響信号ＸAの解析に重要な発音直後の区間を発音区間Ｐとして高精度に特定することが可能である。音源識別部６０による発音源の識別には、発音源の種類毎の相違が顕著となる発音直後の特性が特に重要である。したがって、発音直後の区間を発音区間Ｐとして高精度に特定できる第１実施形態は格別に好適である。 As described above, in the first embodiment, the minimal point xL where the intensity Q reverses to increase in the process in which the intensity Q of the acoustic signal XE decreases with time after the sound generation start point TS elapses, and the variation index δ is the end point threshold value. If it exceeds ZE, it is specified as the pronunciation end point TE. In other words, when the sound source is sounded multiple times at short intervals (when the sound immediately after the sound of the first sound begins to decay before the sound is sufficiently attenuated), only the first sound corresponding to the sound start point TS is sounded. The pronunciation end point TE is specified so as to be included in the section P. Therefore, it is possible to specify the interval immediately after the sound generation important for the analysis of the acoustic signal XA as the sound generation interval P with high accuracy. For the sound source identification by the sound source identification unit 60, the characteristic immediately after the sound generation in which the difference for each type of sound source becomes significant is particularly important. Therefore, the first embodiment that can specify the section immediately after the sound generation as the sound generation section P with high accuracy is particularly suitable.

また、第１実施形態では、変動指標δが終点閾値ＺEを上回る極小点ｘLの到来前に、発音始点ＴSでの強度ＱHに応じた減衰閾値Ｚ0を下回るまで音響信号ＸEの強度Ｑが発音始点ＴSと比較して減少した場合（ＳD2：YES）に、当該強度Ｑが減衰閾値Ｚ0を下回る時点が発音終点ＴEとして特定される。したがって、発音始点ＴSの経過後に発音源が発音することなく音響信号ＸEが減衰する場合に、発音始点ＴSからの減衰の度合に応じた適切な発音終点ＴEを設定できるという利点がある。 Further, in the first embodiment, before the arrival of the minimum point xL where the variation index δ exceeds the end point threshold value ZE, the intensity Q of the acoustic signal XE is reduced to the sounding start point until it falls below the attenuation threshold value Z0 corresponding to the intensity QH at the sounding start point TS. When it decreases compared to TS (SD2: YES), the time point when the intensity Q falls below the attenuation threshold value Z0 is specified as the sound generation end point TE. Therefore, when the sound signal XE attenuates without sound generation after the sound generation start point TS has elapsed, there is an advantage that an appropriate sound generation end point TE can be set according to the degree of attenuation from the sound generation start point TS.

第１実施形態では、音響信号ＸEの強度Ｑの極大点ｘHを順次に検出する一方、極大点ｘHでの強度ＱHと当該極大点ｘHまでの強度Ｑの最小値である基準値ＱREFとの差分(ＱH−ＱREF)に応じた変動指標δが始点閾値ＺSを上回る場合に、当該極大点ｘHが発音始点ＴSとして特定される。したがって、音響信号ＸEから検出される複数の極大点ｘHのうち発音源の明瞭な発音の開始を発音始点ＴSとして高精度に特定できるという利点がある。 In the first embodiment, the maximum point xH of the intensity Q of the acoustic signal XE is sequentially detected, while the difference between the intensity QH at the maximum point xH and the reference value QREF that is the minimum value of the intensity Q up to the maximum point xH. When the variation index δ according to (QH−QREF) exceeds the start point threshold value ZS, the local maximum point xH is specified as the sound generation start point TS. Therefore, there is an advantage that the start of clear sound generation of the sound source among the plurality of maximum points xH detected from the acoustic signal XE can be specified with high accuracy as the sound start point TS.

また、極大点ｘHでの強度ＱHと当該極大点ｘHまでの強度Ｑの最小値である基準値ＱREFとの差分(ＱH−ＱREF)を極大点ｘHでの強度ＱHにより除算することで変動指標δが算定される。すなわち、差分(ＱH−ＱREF)が音響信号ＸEの音量の大小に依存しない数値に正規化される。したがって、音響信号ＸEの音量に関わらず発音始点ＴSおよび発音終点ＴEを適切に特定することが可能である。 Further, the variation index δ is obtained by dividing the difference (QH−QREF) between the intensity QH at the maximum point xH and the reference value QREF that is the minimum value of the intensity Q up to the maximum point xH by the intensity QH at the maximum point xH. Is calculated. That is, the difference (QH−QREF) is normalized to a numerical value that does not depend on the volume of the acoustic signal XE. Therefore, it is possible to appropriately specify the sound generation start point TS and the sound generation end point TE regardless of the volume of the acoustic signal XE.

＜音源識別部６０＞
図８は、第１実施形態の音源識別部６０の構成図である。図８に例示される通り、第１実施形態の音源識別部６０は、調波性解析部６２と第１解析部６４と第２解析部６６と音源特定部６８とを具備する。 <Sound source identification unit 60>
FIG. 8 is a configuration diagram of the sound source identification unit 60 of the first embodiment. As illustrated in FIG. 8, the sound source identification unit 60 of the first embodiment includes a harmonic analysis unit 62, a first analysis unit 64, a second analysis unit 66, and a sound source identification unit 68.

調波性解析部６２は、音響信号ＸAが表す音響（以下「対象音」という）が調波音および非調波音の何れに該当するかを音響信号ＸAの特徴量Ｆから解析する。第１実施形態の調波性解析部６２は、対象音が調波音に該当する確度ＷH（第１確度）と対象音が非調波音に該当する確度ＷP（第２確度）とを算定する。 The harmonic analysis unit 62 analyzes whether the sound represented by the sound signal XA (hereinafter referred to as “target sound”) corresponds to a harmonic sound or a non-harmonic sound from the feature value F of the sound signal XA. The harmonic analysis unit 62 of the first embodiment calculates the accuracy WH (first accuracy) that the target sound corresponds to the harmonic sound and the accuracy WP (second accuracy) that the target sound corresponds to the non-harmonic sound.

具体的には、特徴量Ｆの解析で調波音と非調波音とを判別する公知のパターン認識器が調波性解析部６２として任意に利用される。第１実施形態では、教師あり学習を利用した統計モデルの代表例であるサポートベクターマシーン（ＳＶＭ：Support Vector Machine）を調波性解析部６２として例示する。すなわち、調波性解析部６２は、調波音と非調波音とを含む多数の音響の学習データを適用した機械学習で事前に決定された超平面を利用して、特徴量Ｆの対象音が調波音および非調波音の何れに該当するかを特徴量Ｆ毎（発音区間Ｐ毎）に順次に判別する。そして、調波性解析部６２は、例えば所定の期間内に対象音が調波音であると判別した回数の比率（調波音と判別した回数／当該期間内の判別の総回数）を調波音の確度ＷHとして算定する一方、対象音が非調波音であると判別した回数の比率を非調波音の確度ＷPとして算定する（ＷH＋ＷP＝１）。以上の説明から理解される通り、音響信号ＸAの対象音が調波音である可能性（尤度）が高いほど確度ＷHは大きい数値となり、対象音が非調波音である可能性が高いほど確度ＷPは大きい数値となる。 Specifically, a known pattern recognizer that discriminates between harmonic and non-harmonic sounds by analyzing the feature value F is arbitrarily used as the harmonic analysis unit 62. In the first embodiment, a support vector machine (SVM) which is a representative example of a statistical model using supervised learning is exemplified as the harmonic analysis unit 62. That is, the harmonic analysis unit 62 uses the hyperplane determined in advance by machine learning to which a large number of acoustic learning data including harmonic and non-harmonic sounds is used to generate the target sound of the feature amount F. It is sequentially determined for each feature amount F (for each sounding section P) whether the harmonic sound or the non-harmonic sound is applicable. Then, the harmonic analysis unit 62 determines, for example, the ratio of the number of times that the target sound is determined to be harmonic sound within a predetermined period (the number of times determined as harmonic sound / the total number of determinations within the period) of the harmonic sound. While calculating as the accuracy WH, the ratio of the number of times that the target sound is determined to be a non-harmonic sound is calculated as the accuracy WP of the non-harmonic sound (WH + WP = 1). As understood from the above description, the higher the possibility (likelihood) that the target sound of the acoustic signal XA is a harmonic sound, the higher the probability WH, and the higher the possibility that the target sound is a non-harmonic sound. WP is a large number.

第１解析部６４は、音響信号ＸAの対象音の発音源が複数種の調波音源の何れに該当するかを音響信号ＸAの特徴量Ｆから解析する。調波音源は、調波音を発音する発音源（例えば調波楽器）を意味する。図８では、ベース（Bass），ギター（Guitar），男性歌唱者（male Vo.），女性歌唱者（female Vo.）の４種類が、対象音の発音源の候補となる調波音源として例示されている。具体的には、第１実施形態の第１解析部６４は、Ｎ種類（Ｎは２以上の自然数）の調波音源の各々について、対象音の発音源が当該調波音源に該当する確度に応じた評価値ＥH(n)（ＥH(1)〜ＥH(N)）を設定する。 The first analysis unit 64 analyzes from the feature quantity F of the acoustic signal XA whether the sound source of the target sound of the acoustic signal XA corresponds to a plurality of types of harmonic sound sources. The harmonic sound source means a sound source (for example, a harmonic instrument) that generates harmonic sounds. In FIG. 8, four types of bass (Bass), guitar (Guitar), male singer (male Vo.), And female singer (female Vo.) Are illustrated as harmonic sound sources that are candidates for the sound source of the target sound. Has been. Specifically, the first analysis unit 64 of the first embodiment sets the accuracy of the sound source of the target sound corresponding to the harmonic sound source for each of N types (N is a natural number of 2 or more) of harmonic sound sources. The corresponding evaluation value EH (n) (EH (1) to EH (N)) is set.

図９は、第１解析部６４が評価値ＥH(1)〜ＥH(N)を設定する処理（以下「調波解析処理」という）のフローチャートである。特徴量抽出部５０による特徴量Ｆの抽出毎（したがって発音区間Ｐ毎）に図９の調波解析処理が実行される。 FIG. 9 is a flowchart of processing (hereinafter referred to as “harmonic analysis processing”) in which the first analysis unit 64 sets the evaluation values EH (1) to EH (N). The harmonic analysis process shown in FIG. 9 is executed every time the feature amount F is extracted by the feature amount extraction unit 50 (therefore, every sounding section P).

調波解析処理を開始すると、第１解析部６４は、事前に選定されたＮ種類の調波音源から任意の２種類の調波音源を選択する全通り（_NＣ₂通り）の組合せの各々について、対象音の発音源が当該組合せの２種類の調波音源の何れに該当するかを、特徴量Ｆを利用して判別する（ＳA1）。以上の判別には、２種類の調波音源を判別候補とするサポートベクターマシーンが好適に利用される。すなわち、調波音源の組合せに相当する_NＣ₂通りのサポートベクターマシーンに特徴量Ｆを適用することで、当該組合せ毎に対象音の発音源が２種類の調波音源から選択される。 When the harmonic analysis process is started, the first analysis unit 64 selects each of the _two combinations ( _N C ₂ types) of selecting any two types of harmonic sound sources from the N types of harmonic sound sources selected in advance. For which the sound source of the target sound corresponds to which of the two types of harmonic sound sources of the combination, using the feature value F (SA1). For the above discrimination, a support vector machine using two types of harmonic sound sources as discrimination candidates is preferably used. That is, by applying the feature value F to _N C ₂ support vector machines corresponding to combinations of harmonic sound sources, the sound source of the target sound is selected from two types of harmonic sound sources for each combination.

第１解析部６４は、Ｎ種類の調波音源の各々について、対象音の発音源が当該調波音源に該当する確度ＣH(n)（ＣH(1)〜ＣH(N)）を算定する（ＳA2）。任意の１個（第ｎ番目）の調波音源の確度ＣH(n)は、例えば、合計_NＣ₂回にわたる判別のうち対象音の発音源が第ｎ番目の調波音源に該当すると判別された回数の比率（調波音源に該当すると判別された回数／_NＣ₂）である。以上の説明から理解される通り、音響信号ＸAの対象音の発音源がＮ種類のうち第ｎ番目の調波音源に該当する可能性（尤度）が高いほど確度ＣH(n)は大きい数値となる。 For each of the N types of harmonic sound sources, the first analysis unit 64 calculates the accuracy CH (n) (CH (1) to CH (N)) that the sound source of the target sound corresponds to the harmonic sound source ( SA2). The accuracy CH (n) of an arbitrary one (nth) harmonic sound source is determined, for example, as the sound source of the target sound corresponds to the nth harmonic sound source out of a total of _N C _two determinations. The ratio of the number of times (number of times determined to correspond to a harmonic sound source / _NC ₂ ). As understood from the above description, the probability CH (n) is larger as the probability (likelihood) that the sound source of the target sound of the acoustic signal XA corresponds to the nth harmonic sound source out of N types is higher. It becomes.

第１解析部６４は、調波音源毎に算定された確度ＣH(n)の順位に対応した数値（得点）を評価値ＥH(n)としてＮ種類の調波音源の各々について設定する（ＳA3）。具体的には、確度ＣH(n)が大きいほど評価値ＥH(n)が大きい数値となるように確度ＣH(n)の順位に応じた数値が各調波音源の評価値ＥH(n)に付与される。例えば、確度ＣH(n)の降順で最上位に位置する調波音源の評価値ＥH(n)は数値ε1（例えばε1＝１００）に設定され、確度ＣH(n)が第２位に位置する調波音源の評価値ＥH(n)は数値ε1を下回る数値ε2（例えばε2＝８０）に設定され、確度ＣH(n)が第３位に位置する調波音源の評価値ＥH(n)は数値ε2を下回る数値ε3（例えばε3＝６０）に設定され、所定の順位を下回る残余の調波音源の評価値ＥH(n)は最小値（例えば０）に設定される、という具合である。以上の説明から理解される通り、音響信号ＸAの対象音の発音源がＮ種類のうち第ｎ番目の調波音源に該当する可能性が高いほど評価値ＥH(n)は大きい数値となる。以上が調波解析処理の好適例である。 The first analysis unit 64 sets a numerical value (score) corresponding to the rank of the accuracy CH (n) calculated for each harmonic sound source as an evaluation value EH (n) for each of the N types of harmonic sound sources (SA3). ). Specifically, the numerical value according to the order of the accuracy CH (n) is set as the evaluation value EH (n) of each harmonic sound source so that the evaluation value EH (n) becomes a larger value as the accuracy CH (n) increases. Is granted. For example, the evaluation value EH (n) of the harmonic sound source located at the top in the descending order of the accuracy CH (n) is set to a numerical value ε1 (for example, ε1 = 100), and the accuracy CH (n) is located in the second place. The harmonic sound source evaluation value EH (n) is set to a numerical value ε2 (for example, ε2 = 80) lower than the numerical value ε1, and the evaluation value EH (n) of the harmonic sound source with the accuracy CH (n) in the third place is The numerical value ε3 (for example, ε3 = 60) lower than the numerical value ε2 is set, and the evaluation value EH (n) of the remaining harmonic sound source lower than the predetermined rank is set to the minimum value (for example, 0). As understood from the above description, the evaluation value EH (n) becomes larger as the sound source of the target sound of the acoustic signal XA is more likely to correspond to the nth harmonic sound source among the N types. The above is a preferred example of harmonic analysis processing.

図８の第２解析部６６は、音響信号ＸAの対象音の発音源が複数種の非調波音源の何れに該当するかを音響信号ＸAの特徴量Ｆから解析する。非調波音源は、非調波音を発音する発音源（例えば打楽器等の非調波楽器）を意味する。図８では、バスドラム（Kick），スネアドラム（Snare），ハイハット（Hi-Hat），フロアタム（F-Tom），シンバル（Cymbal）の５種類が、対象音の発音源の候補となる非調波音源として例示されている。具体的には、第１実施形態の第２解析部６６は、Ｍ種類（Ｍは２以上の自然数）の非調波音源の各々について、対象音の発音源が当該非調波音源に該当する確度に応じた評価値ＥP(m)（ＥP(1)〜ＥP(M)）を設定する。なお、調波音源の種類数Ｎと非調波音源の種類数Ｍとの異同は不問である。 The second analysis unit 66 in FIG. 8 analyzes from the feature quantity F of the acoustic signal XA which of the plural types of non-harmonic sound sources the sound source of the target sound of the acoustic signal XA corresponds to. A non-harmonic sound source means a sound source that produces non-harmonic sounds (for example, a non-harmonic instrument such as a percussion instrument). In FIG. 8, bass drum (Kick), snare drum (Snare), hi-hat (Hi-Hat), floor tom (F-Tom), and cymbal (Cymbal) are non-tones that are candidates for the sound source of the target sound. Illustrated as a wave source. Specifically, in the second analysis unit 66 of the first embodiment, for each of M types (M is a natural number of 2 or more) of non-harmonic sound sources, the sound source of the target sound corresponds to the non-harmonic sound source. An evaluation value EP (m) (EP (1) to EP (M)) corresponding to the accuracy is set. The difference between the number N of harmonic sound sources and the number M of non-harmonic sound sources is not questioned.

第２解析部６６によるＭ個の評価値ＥP(1)〜ＥP(M)の設定（非調波解析処理）は、図９に例示した調波解析処理（第１解析部６４による評価値ＥH(n)の設定）と同様である。具体的には、第２解析部６６は、Ｍ種類の非調波音源から２種類を選択する全通り（_MＣ₂通り）の組合せの各々について、対象音の発音源が当該組合せの２種類の非調波音源の何れに該当するかを判別し、対象音の発音源が第ｍ番目の非調波音源に該当する確度ＣP(m)を非調波音源毎に算定する。非調波音源の判別には、調波解析処理での調波音源の判別と同様にサポートベクターマシーンが好適に利用される。 The setting (non-harmonic analysis process) of the M evaluation values EP (1) to EP (M) by the second analysis unit 66 is the harmonic analysis process (evaluation value EH by the first analysis unit 64) illustrated in FIG. (Setting (n)). Specifically, the second calculating unit 66, for each of all combinations _(M C ₂ combinations) selecting two kinds from M kinds of non-harmonic sound source, two types of sound sources the combination of the target sound The accuracy CP (m) corresponding to the sound source of the target sound corresponding to the mth non-harmonic sound source is calculated for each non-harmonic sound source. For the discrimination of the non-harmonic sound source, a support vector machine is preferably used as in the case of the harmonic sound source discrimination in the harmonic analysis process.

そして、第２解析部６６は、Ｍ種類の非調波音源の各々について、確度ＣP(m)の順位に対応した数値を評価値ＥP(m)として設定する。確度ＣP(m)の任意の順位に位置する非調波音源の評価値ＥP(m)には、確度ＣH(n)の順番で同順位に位置する調波音源の評価値ＥH(n)と同等の数値が付与される。具体的には、確度ＣP(m)の降順で最上位に位置する非調波音源の評価値ＥP(m)は数値ε1に設定され、確度ＣP(m)が第２位に位置する非調波音源の評価値ＥP(m)は数値ε2に設定され、確度ＣP(m)が第３位に位置する非調波音源の評価値ＥP(m)は数値ε3に設定され、所定の順位を下回る残余の調波音源の評価値ＥP(m)は最小値（例えば０）に設定される。したがって、音響信号ＸAの対象音の発音源がＭ種類のうち第ｍ番目の非調波音源に該当する可能性（尤度）が高いほど評価値ＥP(m)は大きい数値となる。 Then, the second analysis unit 66 sets a numerical value corresponding to the rank of the accuracy CP (m) as the evaluation value EP (m) for each of the M types of non-harmonic sound sources. The evaluation value EP (m) of the non-harmonic sound source located at an arbitrary rank of the accuracy CP (m) includes the evaluation value EH (n) of the harmonic sound source located at the same rank in the order of the accuracy CH (n). Equivalent numbers are given. Specifically, the evaluation value EP (m) of the non-harmonic sound source positioned at the top in the descending order of the accuracy CP (m) is set to the numerical value ε1, and the non-harmonic signal having the accuracy CP (m) positioned at the second position. The evaluation value EP (m) of the wave source is set to the numerical value ε2, and the evaluation value EP (m) of the non-harmonic sound source with the accuracy CP (m) located at the third place is set to the numerical value ε3. The evaluation value EP (m) of the remaining harmonic sound source below is set to a minimum value (for example, 0). Therefore, the higher the possibility (likelihood) that the sound source of the target sound of the acoustic signal XA corresponds to the mth non-harmonic sound source among the M types, the larger the evaluation value EP (m) becomes.

特徴量抽出部５０が音響信号ＸAから抽出する任意の１個の特徴量Ｆは、前述の通り、相異なる特性値ｆ1（第１特性値）および特性値ｆ2（第２特性値）を含む複数の特性値ｆで構成される。第１実施形態の第１解析部６４は、特徴量Ｆの特性値ｆ1を利用して、対象音の発音源がＮ種類の調波音源の各々に該当する確度ＣH(n)を解析する。他方、第２解析部６６は、特徴量Ｆの特性値ｆ2を利用して、対象音の発音源がＭ種類の非調波音源の各々に該当する確度ＣP(m)を解析する。すなわち、第１解析部６４が調波音源の確度ＣH(n)の算定に利用する特徴量Ｆ（特性値ｆ1）と第２解析部６６が非調波音源の確度ＣP(m)の算定に適用する特徴量Ｆ（特性値ｆ2）とは相違する。 As described above, any one feature quantity F extracted from the acoustic signal XA by the feature quantity extraction unit 50 includes a plurality of characteristic values f1 (first characteristic values) and characteristic values f2 (second characteristic values). Of characteristic value f. The first analysis unit 64 of the first embodiment uses the characteristic value f1 of the feature amount F to analyze the accuracy CH (n) corresponding to each of the N types of harmonic sound sources as the sound source of the target sound. On the other hand, the second analysis unit 66 uses the characteristic value f2 of the feature amount F to analyze the accuracy CP (m) in which the sound source of the target sound corresponds to each of the M types of non-harmonic sound sources. That is, the first analysis unit 64 calculates the characteristic amount F (characteristic value f1) used for calculating the harmonic source accuracy CH (n), and the second analysis unit 66 calculates the non-harmonic source accuracy CP (m). This is different from the applied feature amount F (characteristic value f2).

具体的には、第１解析部６４による確度ＣH(n)の算定には、調波音源の種類毎に相違が顕著となる特性値ｆ1が利用される。例えば、音色を表すＭＦＣＣや、基音成分に対する倍音成分の強度比等の特性値ｆ1が、調波音の確度ＣH(n)の算定に好適に利用される。他方、第２解析部６６による確度ＣP(m)の算定には、非調波音源の種類毎に相違が顕著となる特性値ｆ2が利用される。例えば、音響の立上がりの急峻度や零交差数等の特性値ｆ2が、非調波音の確度ＣP(m)の算定に好適に利用される。なお、第１解析部６４が利用する特性値ｆ1と第２解析部６６が利用する特性値ｆ2とを部分的に共通させることも可能である。 Specifically, for the calculation of the accuracy CH (n) by the first analysis unit 64, a characteristic value f1 in which the difference is remarkable for each type of harmonic sound source is used. For example, the characteristic value f1 such as the MFCC representing the tone color or the intensity ratio of the harmonic component to the fundamental component is preferably used for calculating the accuracy CH (n) of the harmonic sound. On the other hand, for the calculation of the accuracy CP (m) by the second analysis unit 66, a characteristic value f2 in which the difference is remarkable for each type of non-harmonic sound source is used. For example, the characteristic value f2 such as the steepness of the acoustic rise and the number of zero crossings is preferably used for calculating the accuracy CP (m) of the subharmonic sound. The characteristic value f1 used by the first analysis unit 64 and the characteristic value f2 used by the second analysis unit 66 can be partially shared.

図８の音源特定部６８は、調波性解析部６２と第１解析部６４と第２解析部６６とによる以上の解析の結果に応じて音響信号ＸAの発音源の種類を特定する。発音源の種類の特定は発音区間Ｐ毎に実行される。図８に例示される通り、第１実施形態の音源特定部６８は、乗算部６８２と乗算部６８４と選択処理部６８６とを包含する。 The sound source specifying unit 68 in FIG. 8 specifies the type of sound source of the acoustic signal XA according to the results of the above analysis by the harmonic analysis unit 62, the first analysis unit 64, and the second analysis unit 66. The type of the sound source is specified for each sound generation section P. As illustrated in FIG. 8, the sound source identification unit 68 of the first embodiment includes a multiplication unit 682, a multiplication unit 684, and a selection processing unit 686.

乗算部６８２は、第１解析部６４がＮ種類の調波音源について設定したＮ個の評価値ＥH(1)〜ＥH(N)の各々に、調波性解析部６２が解析した調波音の確度ＷHを乗算することでＮ個の識別指標Ｒ（Ｒ＝ＥH(n)×ＷH）を算定する。他方、乗算部６８４は、第２解析部６６がＭ種類の非調波音源について設定したＭ個の評価値ＥP(1)〜ＥP(M)の各々に、調波性解析部６２が解析した非調波音の確度ＷPを乗算することでＭ個の識別指標Ｒ（Ｒ＝ＥP(m)×ＷP）を算定する。乗算部６８２および乗算部６８４の処理により、Ｎ種類の調波音源とＭ種類の非調波音源とを含むＫ種類（Ｋ＝Ｎ＋Ｍ）の候補音源の各々について識別指標Ｒが算定される。以上の説明から理解される通り、確度ＷHは、調波音の各評価値ＥH(n)に対する加重値に相当し、確度ＷPは、非調波音の各評価値ＥP(m)に対する加重値に相当する。対象音が調波音に該当する確度ＷHが大きいほど調波音源の識別指標Ｒが相対的に優勢となり、対象音が非調波音に該当する確度ＷPが大きいほど非調波音源の識別指標Ｒが相対的に優勢となる。 The multiplying unit 682 adds the harmonic sound analyzed by the harmonicity analyzing unit 62 to each of the N evaluation values EH (1) to EH (N) set by the first analyzing unit 64 for the N types of harmonic sound sources. N identification indices R (R = EH (n) × WH) are calculated by multiplying the accuracy WH. On the other hand, in the multiplication unit 684, the harmonic analysis unit 62 analyzes each of the M evaluation values EP (1) to EP (M) set by the second analysis unit 66 for the M types of non-harmonic sound sources. By multiplying the non-harmonic sound accuracy WP, M identification indices R (R = Ep (m) × WP) are calculated. The identification index R is calculated for each of K types (K = N + M) candidate sound sources including N types of harmonic sound sources and M types of non-harmonic sound sources by the processing of the multiplication unit 682 and the multiplication unit 684. As understood from the above description, the accuracy WH corresponds to a weight value for each evaluation value EH (n) of the harmonic sound, and the accuracy WP corresponds to a weight value for each evaluation value EP (m) of the non-harmonic sound. To do. The higher the accuracy WH that the target sound corresponds to the harmonic sound, the greater the identification index R of the harmonic sound source, and the higher the accuracy WP that the target sound corresponds to the non-harmonic sound, the higher the identification index R of the non-harmonic sound source. Relatively dominant.

選択処理部６８６は、乗算部６８２および乗算部６８４が算定したＫ個の識別指標Ｒに応じて音響信号ＸAの対象音の発音源の種類を特定し、当該発音源の種類を示す音源識別情報Ｄ（例えば楽器名）を生成する。具体的には、選択処理部６８６は、Ｋ種類の候補音源のうち識別指標Ｒが最大となる１種類の候補音源を対象音の発音源として選択し、当該候補音源を指定する音源識別情報Ｄを生成する。すなわち、音響信号ＸAの対象音の発音源の種類が識別される。以上に例示した処理が複数の音響信号ＸAの各々について実行されることで、対象音の発音源の種類を示す音源識別情報Ｄが音響信号ＸA毎に生成される。音響解析部２０の具体例は以上の通りである。 The selection processing unit 686 specifies the type of the sound source of the target sound of the acoustic signal XA according to the K identification indexes R calculated by the multiplication unit 682 and the multiplication unit 684, and the sound source identification information indicating the type of the sound source D (for example, instrument name) is generated. Specifically, the selection processing unit 686 selects one type of candidate sound source having the maximum identification index R among the K types of candidate sound sources as the sound source of the target sound, and the sound source identification information D for designating the candidate sound source. Is generated. That is, the type of sound source of the target sound of the acoustic signal XA is identified. By executing the processing exemplified above for each of the plurality of acoustic signals XA, sound source identification information D indicating the type of sound source of the target sound is generated for each acoustic signal XA. Specific examples of the acoustic analysis unit 20 are as described above.

図１の音響処理部３０は、調波性解析部６２が音響信号ＸA毎に解析した音源識別情報Ｄを参照して複数の音響信号ＸAに音響処理を実行することで音響信号ＸBを生成する。具体的には、音響信号ＸAの音源識別情報Ｄが示す発音源の種類毎に事前に設定された音響処理が当該音響信号ＸAに対して実行される。音響信号ＸAに対する音響処理としては、例えば残響効果や歪効果等の各種の音響効果を付与する効果付与処理（エフェクタ）や、周波数帯域毎の音量を調整する特性調整処理（イコライザ），音像が定位する位置を調整する定位調整処理（パン），音量を調整する音量調整処理が例示される。効果付与処理で音響信号ＸAに付与される音響効果の種類や度合，特性調整処理で音響信号ＸAに付与される周波数特性，定位調整処理で調整される音像の位置，音量調整処理による調整内容（ゲイン）等の各種のパラメータが、音源識別情報Ｄが示す発音源の種類毎に個別に設定される。そして、音響処理部３０は、以上に例示した音響処理後の複数の音響信号ＸAを混合（ミキシング）することで音響信号ＸBを生成する。すなわち、第１実施形態の音響処理部３０は、調波性解析部６２による発音源の識別結果を反映した自動ミキシングを実現する。 The acoustic processing unit 30 in FIG. 1 generates an acoustic signal XB by performing acoustic processing on a plurality of acoustic signals XA with reference to the sound source identification information D analyzed by the harmonic analysis unit 62 for each acoustic signal XA. . Specifically, acoustic processing set in advance for each type of sound source indicated by the sound source identification information D of the acoustic signal XA is executed on the acoustic signal XA. As acoustic processing for the acoustic signal XA, for example, effect imparting processing (effector) that imparts various acoustic effects such as reverberation effect and distortion effect, characteristic adjustment processing (equalizer) that adjusts the volume for each frequency band, and sound image localization Examples are a localization adjustment process (pan) for adjusting the position to be performed and a volume adjustment process for adjusting the volume. The type and degree of the sound effect given to the sound signal XA by the effect applying process, the frequency characteristic given to the sound signal XA by the characteristic adjusting process, the position of the sound image adjusted by the localization adjusting process, and the adjustment contents by the volume adjusting process ( Various parameters such as gain) are individually set for each type of sound source indicated by the sound source identification information D. Then, the sound processing unit 30 generates the sound signal XB by mixing (mixing) the plurality of sound signals XA after the sound processing exemplified above. That is, the acoustic processing unit 30 according to the first embodiment realizes automatic mixing that reflects the sound source identification result by the harmonic analysis unit 62.

図１０は、第１実施形態の音源識別部６０が任意の１系統の音響信号ＸAについて対象音の発音源の種類を特定する処理（以下「音源識別処理」という）のフローチャートである。複数の音響信号ＸAの各々について、特徴量抽出部５０による特徴量Ｆの抽出毎（発音区間Ｐ毎）に図１０の音源識別処理が実行される。 FIG. 10 is a flowchart of a process (hereinafter referred to as “sound source identification process”) in which the sound source identification unit 60 of the first embodiment specifies the type of sound source of the target sound for any one system of acoustic signals XA. For each of the plurality of acoustic signals XA, the sound source identification process of FIG. 10 is executed every time the feature amount F is extracted by the feature amount extraction unit 50 (for each sound generation section P).

音源識別処理を開始すると、調波性解析部６２は、音響信号ＸAが表す対象音が調波音および非調波音の何れに該当するかを音響信号ＸAの特徴量Ｆから解析する（ＳB1）。他方、第１解析部６４は、図９を参照して説明した調波解析処理によりＮ種類の調波音源の各々について評価値ＥH(n)（ＥH(1)〜ＥH(N)）を算定し（ＳB2）、第２解析部６６は、調波解析処理と同様の非調波解析処理によりＭ種類の非調波音源の各々について評価値ＥP(m)（ＥP(1)〜ＥP(M)）を算定する（ＳB3）。そして、音源特定部６８は、調波性解析部６２と第１解析部６４と第２解析部６６とによる以上の解析の結果に応じて音響信号ＸAの発音源の種類を特定する（ＳB4）。なお、調波性解析部６２による調波性の解析と、第１解析部６４による調波解析処理と、第２解析部６６による非調波解析処理との順序は任意である。例えば調波解析処理（ＳB2）および非調波解析処理（ＳB3）の実行後に調波性解析部６２が調波性を解析することも可能である。 When the sound source identification process is started, the harmonic analysis unit 62 analyzes whether the target sound represented by the acoustic signal XA corresponds to the harmonic sound or the non-harmonic sound from the feature value F of the acoustic signal XA (SB1). On the other hand, the first analysis unit 64 calculates the evaluation value EH (n) (EH (1) to EH (N)) for each of the N types of harmonic sound sources by the harmonic analysis process described with reference to FIG. (SB2), the second analysis unit 66 performs evaluation values EP (m) (EP (1) to EP (M) for each of the M types of non-harmonic sound sources by the non-harmonic analysis process similar to the harmonic analysis process. )) Is calculated (SB3). The sound source identification unit 68 identifies the type of sound source of the acoustic signal XA according to the results of the above analysis by the harmonic analysis unit 62, the first analysis unit 64, and the second analysis unit 66 (SB4). . The order of the harmonic analysis by the harmonic analysis unit 62, the harmonic analysis processing by the first analysis unit 64, and the non-harmonic analysis processing by the second analysis unit 66 is arbitrary. For example, after the harmonic analysis process (SB2) and the non-harmonic analysis process (SB3) are performed, the harmonic analysis unit 62 can analyze the harmonics.

以上に説明した通り、第１実施形態では、調波音と非調波音とを相互に区別して対象音の発音源の種類が特定される。具体的には、対象音が調波音および非調波音の各々に該当する確度（ＷH，ＷP）を調波性解析部６２が解析した結果と、対象音の発音源がＮ種類の調波音源の各々に該当する確度ＣH(n)を第１解析部６４が解析した結果と、対象音の発音源がＭ種類の非調波音源の各々に該当する確度ＣP(m)を第２解析部６６が解析した結果とを利用して、対象音の発音源の種類が特定される。したがって、調波音と非調波音とを区別せずに発音源の種類を特定する構成と比較して対象音の発音源の種類を高精度に特定することが可能である。第１解析部６４や第２解析部６６の未学習の発音源についても音響処理部３０による調波音／非調波音の識別は可能であるという利点もある。 As described above, in the first embodiment, the type of sound source of the target sound is specified by distinguishing the harmonic sound and the non-harmonic sound from each other. Specifically, the harmonic analysis unit 62 analyzes the accuracy (WH, WP) that the target sound corresponds to each of the harmonic sound and the non-harmonic sound, and N types of harmonic sound sources as the sound source of the target sound. The analysis results of the accuracy CH (n) corresponding to each of the first and second analysis portions 64 and the accuracy CP (m) corresponding to each of the M types of non-harmonic sound sources as the sound source of the target sound The type of sound source of the target sound is specified using the result analyzed by 66. Therefore, it is possible to specify the sound source type of the target sound with high accuracy as compared with the configuration in which the sound source type is specified without distinguishing between the harmonic sound and the non-harmonic sound. There is also an advantage that harmonic sound / non-harmonic sound can be identified by the sound processing unit 30 for the unlearned sound source of the first analysis unit 64 and the second analysis unit 66.

また、第１実施形態では、対象音が調波音に該当する確度ＷHと各調波音源の評価値ＥH(n)との乗算、および、対象音が非調波音に該当する確度ＷPと各非調波音源の評価値ＥP(m)との乗算により、Ｋ種類の候補楽器（Ｎ種類の調波音源およびＭ種類の非調波音源）の各々について識別指標Ｒが算定され、各識別指標Ｒに応じて対象音の発音源の種類が特定される。すなわち、対象音が調波音に該当する確度ＷHが大きいほど調波音源の識別指標Ｒが相対的に優勢となり、対象音が非調波音に該当する確度ＷPが大きいほど非調波音源の識別指標Ｒが相対的に優勢となる。したがって、Ｋ個の識別指標Ｒの比較により対象音の発音源の種類を簡便かつ高精度に特定できるという利点がある。 In the first embodiment, the accuracy WH corresponding to the target sound corresponding to the harmonic sound and the evaluation value EH (n) of each harmonic sound source are multiplied, and the accuracy WP corresponding to the target sound corresponding to the subharmonic sound and each non-harmonic sound. The identification index R is calculated for each of K types of musical instruments (N types of harmonic sources and M types of non-harmonic sources) by multiplication with the harmonic sound source evaluation value EP (m). The type of sound source of the target sound is specified according to the above. That is, as the accuracy WH corresponding to the target sound corresponds to the harmonic sound, the harmonic sound source identification index R becomes relatively dominant, and as the accuracy WP corresponding to the target sound corresponds to the non-harmonic sound, the identification index of the non-harmonic sound source increases. R becomes relatively dominant. Therefore, there is an advantage that the type of sound source of the target sound can be specified easily and with high accuracy by comparing the K identification indexes R.

ところで、例えば対象音の発音源が調波音源に該当する確度ＣH(n)を評価値ＥH(n)として利用するとともに対象音の発音源が非調波音源に該当する確度ＣP(m)を評価値ＥP(m)として利用する構成（以下「比較例」という）では、評価値ＥH(n)の数値が調波音源の種類数Ｎに依存するとともに評価値ＥP(m)の数値が非調波音源の種類数Ｍに依存する。例えば、調波音源の種類数Ｎが多いほど確度ＣH(n)は小さい数値となる。したがって、調波音源の種類数Ｎと非調波音源の種類数Ｍとが相違する場合には、評価値ＥH(n)と評価値ＥP(m)とを適切に比較できないという問題がある。第１実施形態では、対象音の発音源が調波音源に該当する確度ＣH(n)の順位に応じた数値が評価値ＥH(n)として調波音源毎に設定され、対象音の発音源が非調波音源に該当する確度ＣP(m)の順位に応じた数値が評価値ＥP(m)として非調波音源毎に設定される。すなわち、評価値ＥH(n)は調波音源の種類数Ｎに依存しない数値に設定され、評価値ＥP(m)は非調波音源の種類数Ｍに依存しない数値に設定される。したがって、第１実施形態によれば、例えば調波音源の種類数Ｎと非調波音源の種類数Ｍとが相違する場合でも評価値ＥH(n)と評価値ＥP(m)とを適切に比較できるという利点がある。調波音源の種類数Ｎおよび非調波音源の種類数Ｍの制約が緩和されると換言することも可能である。ただし、前述の比較例も本発明の範囲には包含される。 By the way, for example, the accuracy CH (n) corresponding to the sound source of the target sound corresponding to the harmonic sound source is used as the evaluation value EH (n) and the accuracy CP (m) corresponding to the sound source of the target sound corresponding to the non-harmonic sound source is used. In the configuration used as the evaluation value EP (m) (hereinafter referred to as “comparative example”), the numerical value of the evaluation value EH (n) depends on the number N of types of harmonic sound sources and the numerical value of the evaluation value EP (m) is not Depends on the number M of types of harmonic sound sources. For example, the accuracy CH (n) becomes smaller as the number N of types of harmonic sound sources increases. Therefore, when the number N of harmonic sound sources and the number M of non-harmonic sound sources are different, there is a problem that the evaluation value EH (n) and the evaluation value EP (m) cannot be appropriately compared. In the first embodiment, a numerical value corresponding to the rank of the accuracy CH (n) corresponding to the sound source of the target sound corresponding to the harmonic sound source is set as the evaluation value EH (n) for each harmonic sound source, and the sound source of the target sound A numerical value corresponding to the rank of the accuracy CP (m) corresponding to the subharmonic sound source is set for each subharmonic sound source as the evaluation value EP (m). That is, the evaluation value EH (n) is set to a value that does not depend on the number N of harmonic sound sources, and the evaluation value EP (m) is set to a value that does not depend on the number M of non-harmonic sound sources. Therefore, according to the first embodiment, for example, the evaluation value EH (n) and the evaluation value EP (m) are appropriately set even when the number N of harmonic sound sources is different from the number M of non-harmonic sound sources. There is an advantage that it can be compared. In other words, the restrictions on the number N of harmonic sound sources and the number M of non-harmonic sound sources are relaxed. However, the comparative examples described above are also included in the scope of the present invention.

また、第１実施形態では、第１解析部６４が調波音源の確度ＣH(n)の算定に利用する特徴量Ｆ（特性値ｆ1）と第２解析部６６が非調波音源の確度ＣP(m)の算定に適用する特徴量Ｆ（特性値ｆ2）とが相違する。具体的には、例えば第１解析部６４による確度ＣH(n)の算定には調波音の識別に好適な特性値ｆ1が利用され、第２解析部６６による確度ＣP(m)の算定には非調波音の識別に好適な特性値ｆ2が利用される。したがって、調波音源の確度ＣH(n)の算定と非調波音源の確度ＣP(m)の算定とに同種の特徴量を利用する構成と比較して、対象音の発音源を高精度に特定できるという利点がある。ただし、第１解析部６４と第２解析部６６とが共通の特徴量Ｆを利用することも可能である。 In the first embodiment, the first analysis unit 64 uses the characteristic amount F (characteristic value f1) used for calculating the harmonic source accuracy CH (n) and the second analysis unit 66 uses the non-harmonic source accuracy CP. The feature amount F (characteristic value f2) applied to the calculation of (m) is different. Specifically, for example, the characteristic value f1 suitable for identifying harmonics is used for the calculation of the accuracy CH (n) by the first analysis unit 64, and for the calculation of the accuracy CP (m) by the second analysis unit 66, for example. A characteristic value f2 suitable for identifying non-harmonic sound is used. Therefore, the sound source of the target sound is highly accurate compared to a configuration that uses the same type of feature quantity for the calculation of the harmonic source accuracy CH (n) and the non-harmonic source accuracy CP (m). There is an advantage that it can be identified. However, it is also possible for the first analysis unit 64 and the second analysis unit 66 to use a common feature amount F.

＜第２実施形態＞
本発明の第２実施形態を説明する。なお、以下に例示する各形態において作用や機能が第１実施形態と同様である要素については、第１実施形態の説明で使用した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described. In addition, about the element which an effect | action and function are the same as that of 1st Embodiment in each form illustrated below, the code | symbol used by description of 1st Embodiment is diverted, and each detailed description is abbreviate | omitted suitably.

図１１に例示される通り、発音源による１回の発音（例えば打楽器の１回の打撃による発音）の開始直後に強度Ｑが増加する過程において複数回の極大点ｘH（ｘH1，ｘH2）が観測される場合がある。図１１の極大点ｘH1の変動指標δは始点閾値ＺSを上回るから、変動指標δが始点閾値ＺSを上回る全部の極大点ｘHを発音始点ＴSとして確定する第１実施形態では、極大点ｘH1および極大点ｘH2の双方が発音始点ＴSとして特定される。しかし、音響信号ＸEの強度は、極大点ｘH1の直後の極大点ｘH2まで増加する。すなわち、極大点ｘH1および極大点ｘH2は、実際には発音源の１回の発音に対応すると推定される。したがって、極大点ｘH1を発音始点ＴSとして特定することなく直後の極大点ｘH2のみを発音始点ＴSとして特定し、極大点ｘH1および極大点ｘH2の双方を１個の発音区間Ｐに包含させるべきである。以上の事情を考慮して、第２実施形態では、音響信号ＸEの強度Ｑの１個の極大点ｘH1の直後に、当該極大点ｘH1を上回る強度Ｑの極大点ｘH2を検出した場合に、先行の極大点ｘH1を発音始点ＴSの候補から除外する。 As illustrated in FIG. 11, a plurality of local maximum points xH (xH1, xH2) are observed in the process in which the intensity Q increases immediately after the start of one sound generation by the sound source (for example, sound generation by one percussion instrument percussion). May be. Since the fluctuation index δ of the local maximum point xH1 in FIG. 11 exceeds the starting point threshold value ZS, in the first embodiment in which all the local maximum points xH whose fluctuation index δ exceeds the starting point threshold value ZS are determined as the pronunciation starting point TS, the local maximum point xH1 and the local maximum point Both of the points xH2 are specified as the pronunciation start point TS. However, the intensity of the acoustic signal XE increases to the maximum point xH2 immediately after the maximum point xH1. That is, it is estimated that the local maximum point xH1 and the local maximum point xH2 actually correspond to one sound generation of the sound source. Accordingly, without specifying the local maximum point xH1 as the pronunciation start point TS, only the local maximum point xH2 immediately after is specified as the pronunciation start point TS, and both the local maximum point xH1 and the local maximum point xH2 should be included in one sound generation section P. . Considering the above circumstances, in the second embodiment, when a local maximum point xH2 having an intensity Q exceeding the local maximum point xH1 is detected immediately after one local maximum point xH1 of the intensity Q of the acoustic signal XE, the preceding is performed. The maximum point xH1 is excluded from the pronunciation start point TS candidates.

具体的には、変動指標δが始点閾値ＺSを上回る任意の１個の極大点ｘH1（第１極大点）を第１実施形態と同様の方法で検出すると、始点解析部４４は、図１１に例示される通り、当該極大点ｘH1に対応する時間軸上の位置に待機区間Ｖを設定する。待機区間Ｖは、極大点ｘH1を発音始点ＴSとして確定することを留保する区間であり、極大点ｘH1以降に設定される。第２実施形態の始点解析部４４は、極大点ｘH1を始点とする所定長の待機区間Ｖを設定する。 Specifically, when any one local maximum point xH1 (first local maximum point) in which the variation index δ exceeds the starting point threshold value ZS is detected by the same method as that in the first embodiment, the starting point analysis unit 44 is shown in FIG. As illustrated, the standby section V is set at a position on the time axis corresponding to the local maximum point xH1. The standby section V is a section in which it is reserved that the local maximum point xH1 is determined as the sound generation start point TS, and is set after the local maximum point xH1. The start point analysis unit 44 of the second embodiment sets a standby section V having a predetermined length starting from the maximum point xH1.

待機区間Ｖを設定すると、始点解析部４４は、極大点ｘH1以降の音響信号ＸEについて極大点ｘHの探索を継続する。前述の通り、音響信号ＸEの強度Ｑは、極大点ｘH1以降に増加する可能性がある。極大点ｘH1を上回る強度の極大点ｘH2（第２極大点）を待機区間Ｖ内に検出した場合、始点解析部４４は、先行の極大点ｘH1を発音始点ＴSの候補から除外する。以上の処理を順次に実行し、検出済の極大点ｘHを上回る強度の極大点ｘHを検出することなく待機区間Ｖが経過すると、始点解析部４４は、待機区間Ｖの満了前に最後に検出した極大点ｘHを発音始点ＴSとして確定する。 When the standby section V is set, the start point analysis unit 44 continues searching for the maximum point xH for the acoustic signal XE after the maximum point xH1. As described above, the intensity Q of the acoustic signal XE may increase after the maximum point xH1. When a local maximum point xH2 (second local maximum point) having an intensity exceeding the local maximum point xH1 is detected in the standby section V, the start point analysis unit 44 excludes the preceding local maximum point xH1 from the pronunciation start point TS candidates. When the standby section V passes without detecting the local maximum point xH having an intensity exceeding the detected local maximum point xH, the start point analysis unit 44 finally detects the standby section V before the expiration. The determined local maximum point xH is determined as the pronunciation start point TS.

以上の説明から理解される通り、第２実施形態では、音響信号ＸEの強度Ｑの極大点ｘH1以降の待機区間Ｖ内に、当該極大点ｘH1を上回る強度Ｑの極大点ｘH2が検出された場合に、極大点ｘH1が発音始点ＴSの候補から除外される。したがって、発音源による１回の発音の開始から音響信号ＸEの強度Ｑが増加する過程で複数の極大点ｘHが検出される場合でも、当該発音に対応した１個の極大点ｘHを含む発音区間Ｐを適切に特定することが可能である。 As understood from the above description, in the second embodiment, when a local maximum point xH2 having an intensity Q exceeding the local maximum point xH1 is detected in the standby section V after the local maximum point xH1 of the intensity Q of the acoustic signal XE. In addition, the local maximum point xH1 is excluded from the candidates for the pronunciation start point TS. Therefore, even when a plurality of local maximum points xH are detected in the process of increasing the intensity Q of the acoustic signal XE from the start of one sound generation by the sound source, the sound generation section including one local maximum point xH corresponding to the sound generation It is possible to specify P appropriately.

なお、第２実施形態では、１個の極大点ｘH1を始点とする待機区間Ｖを設定したが、極大点ｘH1を上回る強度Ｑの極大点ｘH2を検出した場合に、当該極大点ｘH2を始点とする待機区間Ｖを新規に設定する（すなわち極大点ｘHの検出毎に待機区間Ｖを更新する）ことも可能である。 In the second embodiment, the standby interval V starting from one local maximum point xH1 is set. However, when a local maximum point xH2 having an intensity Q exceeding the local maximum point xH1 is detected, the local maximum point xH2 is set as the starting point. It is also possible to newly set the waiting section V to be performed (that is, to update the waiting section V every time the maximum point xH is detected).

＜変形例＞
以上に例示した各態様は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２個以上の態様は、相互に矛盾しない範囲で適宜に併合され得る。 <Modification>
Each aspect illustrated above can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined within a range that does not contradict each other.

（１）前述の各形態では、調波性解析部６２がサポートベクターマシンにより調波音と非調波音とを判別したが、調波性解析部６２による調波音／非調波音の判別方法は以上の例示に限定されない。例えば、調波音および非調波音の各々の特徴量Ｆの分布傾向を表現する混合正規分布を利用して対象音を調波音と非調波音とに判別する方法や、K-meansアルゴリズムを利用したクラスタリングで対象音を調波音と非調波音とに判別する方法も採用され得る。第１解析部６４および第２解析部６６の各々が対象音の発音源の種類を推定する方法についても同様に、前述の各形態で例示したサポートベクターマシンには限定されず、公知のパターン認識技術を任意に採用することが可能である。 (1) In each of the above embodiments, the harmonic analysis unit 62 discriminates between harmonic and non-harmonic sounds using the support vector machine, but the harmonic / non-harmonic sound discrimination method by the harmonic analysis unit 62 is as described above. It is not limited to the illustration. For example, a method of discriminating a target sound into a harmonic sound and a non-harmonic sound using a mixed normal distribution that expresses a distribution tendency of the feature amount F of each of the harmonic sound and the non-harmonic sound, or a K-means algorithm is used. A method of discriminating the target sound into a harmonic sound and a non-harmonic sound by clustering may be employed. Similarly, the method in which each of the first analysis unit 64 and the second analysis unit 66 estimates the type of the sound source of the target sound is not limited to the support vector machine exemplified in each of the above-described embodiments, and is well-known pattern recognition. Any technique can be employed.

（２）前述の各形態では、調波性解析部６２が解析した調波音の確度ＷHをＮ個の評価値ＥH(1)〜ＥH(N)に乗算するとともに非調波音の確度ＷPをＭ個の評価値ＥP(1)〜ＥP(M)に乗算したが、調波音の確度ＷHおよび非調波音の確度ＷPを音響信号ＸAの発音源の種類に反映させる方法は以上の例示に限定されない。例えば、音響信号ＸAの対象音が調波音および非調波音の何れに該当するかを確度ＷHおよび確度ＷPに応じて判別し、Ｎ個の評価値ＥH(1)〜ＥH(N)およびＭ個の評価値ＥP(1)〜ＥP(M)の何れかを調波性の判別結果に応じて選択的に利用して、音源特定部６８が発音源の種類を特定することも可能である。 (2) In the above-described embodiments, the harmonic sound accuracy WH analyzed by the harmonic analysis unit 62 is multiplied by N evaluation values EH (1) to EH (N), and the non-harmonic sound accuracy WP is set to M. Although the evaluation values EP (1) to EP (M) are multiplied, the method of reflecting the accuracy WH of the harmonic sound and the accuracy WP of the non-harmonic sound in the type of the sound source of the acoustic signal XA is not limited to the above examples. . For example, it is determined according to the accuracy WH and the accuracy WP whether the target sound of the acoustic signal XA corresponds to the harmonic sound or the non-harmonic sound, and N evaluation values EH (1) to EH (N) and M It is also possible for the sound source specifying unit 68 to specify the type of the sound source by selectively using any one of the evaluation values EP (1) to EP (M) according to the harmonic discrimination result.

具体的には、調波性解析部６２は、確度ＷHが確度ＷPを上回る場合には対象音を調波音と判別し、確度ＷPが確度ＷHを上回る場合には対象音を非調波音と判別する。音源特定部６８は、対象音が調波音であると判別された場合には、第１解析部６４が算定したＮ個の評価値ＥH(1)〜ＥH(N)のなかの最大値に対応する調波音源を発音源の種類として特定する一方、対象音が非調波音であると判別された場合には、第２解析部６６が算定したＭ個の評価値ＥP(1)〜ＥP(M)のなかの最大値に対応する非調波音源を発音源の種類として特定する。以上に例示した構成は、前述の各形態において、確度ＷHおよび確度ＷPの一方を１に設定するとともに他方を０に設定した構成とも換言される。なお、対象音が調波音であると調波性解析部６２が判別した場合に第２解析部６６による非調波解析処理（Ｍ個の評価値ＥP(1)〜ＥP(M)の算定）を省略する構成や、対象音が非調波音であると調波性解析部６２が解析した場合に第１解析部６４による調波解析処理（Ｎ個の評価値ＥH(1)〜ＥH(N)の算定）を省略する構成も採用され得る。 Specifically, the harmonic analysis unit 62 determines the target sound as a harmonic sound when the accuracy WH exceeds the accuracy WP, and determines the target sound as a non-harmonic sound when the accuracy WP exceeds the accuracy WH. To do. When it is determined that the target sound is a harmonic sound, the sound source identification unit 68 corresponds to the maximum value among the N evaluation values EH (1) to EH (N) calculated by the first analysis unit 64. If the target sound is determined to be a non-harmonic sound while the harmonic sound source to be identified is identified as the type of sound source, the M evaluation values EP (1) to EP (2) calculated by the second analysis unit 66 are determined. The non-harmonic sound source corresponding to the maximum value in M) is specified as the type of sound source. The configuration exemplified above is also referred to as a configuration in which one of the accuracy WH and the accuracy WP is set to 1 and the other is set to 0 in each of the above-described embodiments. When the harmonic analysis unit 62 determines that the target sound is a harmonic sound, non-harmonic analysis processing (calculation of M evaluation values EP (1) to EP (M)) by the second analysis unit 66 is performed. When the harmonic analysis unit 62 analyzes that the target sound is a non-harmonic sound, harmonic analysis processing (N evaluation values EH (1) to EH (N A configuration in which the calculation of () is omitted may be employed.

以上の例示から理解される通り、音源特定部６８は、調波性解析部６２と第１解析部６４と第２解析部６６とによる解析結果に応じて対象音の発音源の種類を特定する要素として包括的に表現され、第１解析部６４および第２解析部６６の双方の解析結果を利用するか一方の解析結果のみを利用するかは、本発明において不問である。 As understood from the above examples, the sound source specifying unit 68 specifies the type of sound source of the target sound according to the analysis results by the harmonic analysis unit 62, the first analysis unit 64, and the second analysis unit 66. Whether the analysis results of both the first analysis unit 64 and the second analysis unit 66 are used or only one of the analysis results is used is unquestioned in the present invention.

（３）前述の各形態では始点閾値ＺSを固定値としたが、始点閾値ＺSを可変値とすることも可能である。例えば、極大点ｘHでの音響信号ＸEの強度ＱHに応じた数値（例えば強度ＱHを所定値に乗算した数値）を始点閾値ＺSとして利用し、図６のステップＳC1では、極大点ｘHでの強度ＱHと基準値ＱREFとの差分(ＱH−ＱREF)を変動指標δとして始点閾値ＺSと比較することも可能である。終点閾値ＺEについても同様に可変値とすることが可能である。また、始点閾値ＺSまたは終点閾値ＺEを利用者からの指示に応じて可変に設定することも可能である。 (3) In each of the above-described embodiments, the starting point threshold value ZS is a fixed value, but the starting point threshold value ZS may be a variable value. For example, a numerical value corresponding to the intensity QH of the acoustic signal XE at the local maximum point xH (for example, a numerical value obtained by multiplying the predetermined value by the intensity QH) is used as the start point threshold value ZS, and in step SC1 in FIG. It is also possible to compare the difference (QH−QREF) between QH and the reference value QREF with the starting point threshold value ZS using the variation index δ. Similarly, the end point threshold value ZE can be a variable value. It is also possible to variably set the start point threshold value ZS or the end point threshold value ZE according to an instruction from the user.

（４）移動体通信網やインターネット等の通信網を介して端末装置（例えば携帯電話機やスマートフォン）と通信するサーバ装置で音響処理装置１２を実現することも可能である。具体的には、音響処理装置１２は、端末装置から通信網を介して受信した複数の音響信号ＸAから前述の各形態と同様の処理で音響信号ＸBを生成して端末装置に送信する。なお、端末装置から受信した複数の音響信号ＸAの各々の発音源の種類（音源識別情報Ｄ）を音響解析部２０が識別して端末装置に通知し、端末装置に搭載された音響処理部３０が識別結果に応じて複数の音響信号ＸAから音響信号ＸBを生成することも可能である。すなわち、音響処理部３０は音響処理装置１２から省略され得る。また、音響信号ＸAの発音区間Ｐの発音始点ＴSおよび発音終点ＴEを端末装置に通知する構成（例えば端末装置が特徴量抽出部５０および音源解析部２０を具備する構成）では、音響処理装置１２の音響解析部２０から特徴量抽出部５０と音源識別部６０とが省略される。 (4) It is also possible to realize the acoustic processing device 12 by a server device that communicates with a terminal device (for example, a mobile phone or a smartphone) via a communication network such as a mobile communication network or the Internet. Specifically, the acoustic processing device 12 generates an acoustic signal XB from the plurality of acoustic signals XA received from the terminal device via the communication network, and transmits the acoustic signal XB to the terminal device by the same processing as in the above-described embodiments. Note that the acoustic analysis unit 20 identifies the type of sound source (sound source identification information D) of each of the plurality of acoustic signals XA received from the terminal device, notifies the terminal device, and the acoustic processing unit 30 mounted on the terminal device. However, it is also possible to generate the acoustic signal XB from the plurality of acoustic signals XA according to the identification result. That is, the sound processing unit 30 can be omitted from the sound processing device 12. In the configuration in which the terminal device is notified of the sound generation start point TS and the sound generation end point TE of the sound generation section P of the acoustic signal XA (for example, the terminal device includes the feature amount extraction unit 50 and the sound source analysis unit 20), the sound processing device 12 The feature amount extraction unit 50 and the sound source identification unit 60 are omitted from the acoustic analysis unit 20.

以上の説明から理解される通り、本発明の好適な態様は、音響信号ＸAのうち音響の発音が開始される発音始点ＴSと当該音響の発音が終了する発音終点ＴEとを解析する装置（音響解析装置）として包括的に表現される。音響解析装置における音響処理部３０の有無は不問である。 As will be understood from the above description, the preferred embodiment of the present invention is an apparatus for analyzing the sound generation start point TS at which sound generation is started and the sound generation end point TE at which sound generation ends (acoustic signal XA). Analysis device). The presence or absence of the acoustic processing unit 30 in the acoustic analysis device is not questioned.

（５）前述の各形態で例示した音響処理装置１２は、前述の通り制御装置１２２とプログラムとの協働で実現される。プログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、以上に例示したプログラムは、通信網を介した配信の形態で提供されてコンピュータにインストールされ得る。 (5) The sound processing device 12 exemplified in the above-described embodiments is realized by the cooperation of the control device 122 and the program as described above. The program may be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. Further, the program exemplified above can be provided in the form of distribution via a communication network and installed in a computer.

（６）本発明は、前述の各形態に係る音響処理装置１２の動作方法としても特定される。例えば、発音区間検出部４０が音響信号ＸAのうち音響の発音が開始される発音始点ＴSと当該音響の発音が終了する発音終点ＴEとを解析する方法（音響解析方法）においては、コンピュータ（単体の装置のほか、相互に別体の複数の装置で構成されたコンピュータシステムも含む）が、音響信号ＸAの強度Ｑの極大点ｘHを発音始点ＴSとして特定し（図６の始点解析処理）、発音始点ＴSの経過後に音響信号ＸAの強度Ｑが経時的に減少する過程で強度Ｑが増加に反転する極小点ｘLを、増加後の極大点ｘHでの強度ＱHと当該極大点ｘHまでの強度Ｑの最小値ＱREFとの差分に応じた変動指標δが終点閾値ＺEを上回る場合に発音終点ＴEとして特定する（図７の終点解析処理）。 (6) The present invention is also specified as an operation method of the sound processing device 12 according to each of the above-described embodiments. For example, in a method (acoustic analysis method) in which the sounding section detection unit 40 analyzes a sounding start point TS at which sound generation is started and a sounding end point TE at which sound generation ends in the sound signal XA, a computer (single unit) is used. In addition to the above device, a computer system composed of a plurality of devices separated from each other) specifies the local maximum point xH of the intensity Q of the acoustic signal XA as the pronunciation starting point TS (starting point analysis processing in FIG. 6), The intensity xH at the maximum point xH after the increase and the intensity up to the maximum point xH after the intensity Q of the acoustic signal XA decreases with time after the sounding start point TS has elapsed. When the variation index δ according to the difference between the Q and the minimum value QREF exceeds the end point threshold value ZE, the sound generation end point TE is specified (end point analysis process in FIG. 7).

１２……音響処理装置、１４……収音装置、１６……放音装置、１２２……制御装置、１２４……記憶装置、２０……音響解析部、３０……音響処理部、４０……発音区間検出部、４２……信号処理部、４４……始点解析部、４６……終点解析部、５０……特徴量抽出部、６０……音源識別部、６２……調波性解析部、６４……第１解析部６４、６６……第２解析部、６８……音源特定部、６８２……乗算部、６８４……乗算部、６８６……選択処理部。
DESCRIPTION OF SYMBOLS 12 ... Sound processing device, 14 ... Sound collection device, 16 ... Sound emission device, 122 ... Control device, 124 ... Memory | storage device, 20 ... Sound analysis part, 30 ... Sound processing part, 40 ... Sound generation section detection unit 42... Signal processing unit 44... Start point analysis unit 46... End point analysis unit 50 .. feature amount extraction unit 60 .. sound source identification unit 62. 64... First analysis unit 64, 66... Second analysis unit, 68... Sound source identification unit, 682.

Claims

A sound analysis device that analyzes a sound generation start point at which sound generation starts and a sound end point at which sound generation ends in an acoustic signal,
A starting point analysis unit that identifies the maximum point of the intensity of the acoustic signal as the starting point of sound generation;
The minimum point at which the intensity reverses to increase in the process in which the intensity of the acoustic signal decreases with time after the start of the sounding start point, the intensity at the maximum point after the increase and the minimum value of the intensity up to the maximum point An acoustic analysis device comprising: an end point analysis unit that identifies the sound generation end point when a variation index corresponding to the difference exceeds an end point threshold value.

The end point analysis unit determines the attenuation when the intensity of the acoustic signal has decreased from the sounding start point until the minimum point falls below an attenuation threshold corresponding to the intensity at the sounding start point before specifying the minimum point as the sounding end point. The acoustic analysis device according to claim 1, wherein a point in time when the intensity falls below a threshold is specified as the pronunciation end point.

The starting point analysis unit sequentially detects the maximum point of the intensity of the acoustic signal, while a variation index according to the difference between the intensity at the maximum point and the minimum value of the intensity up to the maximum point is the end point threshold value The acoustic analysis device according to claim 1, wherein the local maximum point is specified as the sound generation start point when a start point threshold value greater than the threshold value is exceeded.

The acoustic analysis according to any one of claims 1 to 3, wherein the variation index is a numerical value obtained by dividing a difference between the intensity at the local maximum point and the minimum value of the intensity up to the local maximum point by the intensity at the local maximum point. apparatus.

When the start point analysis unit detects a second maximum point having an intensity exceeding the first maximum point in a standby section after the first maximum point of the intensity of the acoustic signal, the start point analysis unit generates the first maximum point as the pronunciation The acoustic analysis device according to any one of claims 1 to 4, wherein the acoustic analysis device is excluded from a starting point candidate.