JP5169297B2

JP5169297B2 - Sound processing apparatus and program

Info

Publication number: JP5169297B2
Application number: JP2008041520A
Authority: JP
Inventors: 靖雄吉岡
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2008-02-22
Filing date: 2008-02-22
Publication date: 2013-03-27
Anticipated expiration: 2028-02-22
Also published as: JP2009198892A

Description

本発明は、音響の種類を判別する技術に関する。 The present invention relates to a technique for determining the type of sound.

収音機器による収録音などの音響（以下「入力音」という）を音声の区間と非音声の区間とに区別する技術が従来から提案されている。例えば特許文献１には、入力音のうち所定の周波数帯域に属する成分の強度に基づいて音声を検出する技術が開示されている。
特開２０００−１３２１７７号公報 Conventionally, a technique for distinguishing sound such as recorded sound by a sound collecting device (hereinafter referred to as “input sound”) into a voice section and a non-voice section has been proposed. For example, Patent Document 1 discloses a technique for detecting a sound based on the intensity of a component belonging to a predetermined frequency band in an input sound.
JP 2000-132177 A

しかし、特許文献１の技術においては、音響的な特性が非音声に類似する無声音を高精度に検出することが困難である。したがって、実際には音声（有声音および無声音）が継続している区間内であっても、無声音の区間が非音声と誤判定され、音声と判定される区間が途切れる可能性がある。以上の事情に鑑みて、本発明は、無声音を高精度に判別することをひとつの目的とする。 However, in the technique of Patent Document 1, it is difficult to detect an unvoiced sound having an acoustic characteristic similar to that of non-voice with high accuracy. Therefore, even in a section where voice (voiced sound and unvoiced sound) is actually continuing, the section of unvoiced sound may be erroneously determined as non-speech and the section determined as speech may be interrupted. In view of the above circumstances, an object of the present invention is to discriminate unvoiced sound with high accuracy.

以上の課題を解決するために、本発明の第１の態様に係る音処理装置は、入力音の各単位区間について変調スペクトルを特定する変調スペクトル特定手段と、変調スペクトルのうち変調周波数が所定の範囲に属する成分の強度に応じた第１指標値を算定する第１指標算定手段（例えば図２の指標算定部４２）と、入力音の零交差数に応じた第２指標値を各単位区間について算定する第２指標算定手段（例えば図２の指標算定部４４）と、各単位区間の入力音が音声であるか否かを第１指標値と第１閾値との大小に基づいて判定する音声判定手段と、各単位区間の入力音が無声音であるか否かを、第１閾値とは相違する第２閾値と第１指標値との大小と、第２指標値とに基づいて判定する無声音判定手段とを具備する。 In order to solve the above problems, a sound processing apparatus according to the first aspect of the present invention includes a modulation spectrum specifying unit that specifies a modulation spectrum for each unit section of an input sound, and a modulation frequency of the modulation spectrum is a predetermined frequency. First index calculation means for calculating a first index value corresponding to the intensity of the component belonging to the range (for example, the index calculation unit 42 in FIG. 2) and a second index value corresponding to the number of zero crossings of the input sound for each unit section Based on the magnitude of the first index value and the first threshold value, the second index calculation means (for example, the index calculation unit 44 in FIG. 2) for calculating the input sound and whether or not the input sound of each unit section is speech Based on the second index value , the voice determination means, and whether or not the input sound of each unit section is an unvoiced sound, based on the magnitude of the second threshold value and the first index value different from the first threshold value Unvoiced sound determination means.

以上の構成においては、変調スペクトルのうち変調周波数が所定の範囲に属する成分の強度に基づいて各単位区間の入力音が無声音であるか否かが判定されるから、入力音の周波数スペクトルを利用する技術と比較して高精度に無声音を判別することが可能となる。また、入力音の零交差数に応じた第２指標値が無声音判定手段による無声音の判定に利用されるから、第１指標値のみを利用する構成と比較して高精度に無声音が判別される。例えば、有声音は無声音と比較して零交差数が小さいという傾向があるから、第２指標値を利用することで無声音を有声音と高精度に区別することが可能である。 In the above configuration, since it is determined whether or not the input sound of each unit section is an unvoiced sound based on the intensity of the component whose modulation frequency falls within a predetermined range in the modulation spectrum, the frequency spectrum of the input sound is used. It is possible to discriminate the unvoiced sound with higher accuracy than the technology to do. Further, since the second index value corresponding to the number of zero crossings of the input sound is used for the determination of the unvoiced sound by the unvoiced sound determination means, the unvoiced sound can be determined with higher accuracy than the configuration using only the first index value. . For example, a voiced sound tends to have a smaller number of zero crossings than an unvoiced sound. Therefore, the unvoiced sound can be distinguished from a voiced sound with high accuracy by using the second index value.

本発明の好適な態様に係る音処理装置は、入力音の周波数スペクトルの平坦性に応じた第３指標値を各単位区間について算定する第３指標算定手段（例えば図２の指標算定部４６）を具備し、無声音判定手段は、各単位区間の入力音が無声音であるか否かを第１指標値と第２指標値と第３指標値とに基づいて判定する。以上の態様においては、入力音の周波数スペクトルの平坦性に応じた第３指標値が無声音判定手段による無声音の判定に利用されるから、第１指標値および第２指標値のみを利用する構成と比較して高精度に無声音を判別することが可能である。例えば、有声音や環境音（例えばプッシュトーン）などの音響は無声音と比較して周波数スペクトルの平坦性が低いという傾向があるから、第３指標値を利用することで無声音を有声音や環境音と高精度に区別することが可能である。 The sound processing apparatus according to a preferred aspect of the present invention is a third index calculation means for calculating a third index value corresponding to the flatness of the frequency spectrum of the input sound for each unit section (for example, the index calculation unit 46 in FIG. 2). The unvoiced sound determination means determines whether the input sound of each unit section is an unvoiced sound based on the first index value, the second index value, and the third index value. In the above aspect, since the third index value corresponding to the flatness of the frequency spectrum of the input sound is used for the determination of the unvoiced sound by the unvoiced sound determination means, only the first index value and the second index value are used. It is possible to discriminate unvoiced sound with high accuracy by comparison. For example, since sounds such as voiced sounds and environmental sounds (for example, push tones) tend to have lower frequency spectrum flatness than unvoiced sounds, the third index value is used to convert unvoiced sounds into voiced sounds and environmental sounds. And can be distinguished with high accuracy.

本発明の第２の態様に係る音処理装置は、入力音の各単位区間について変調スペクトルを特定する変調スペクトル特定手段と、変調スペクトルのうち変調周波数が所定の範囲に属する成分の強度に応じた第１指標値を算定する第１指標算定手段と、入力音の周波数スペクトルの平坦性に応じた第３指標値を各単位区間について算定する第３指標算定手段と、各単位区間の入力音が音声であるか否かを第１指標値と第１閾値との大小に基づいて判定する音声判定手段と、各単位区間の入力音が無声音であるか否かを、第１閾値とは相違する第２閾値と第１指標値との大小と、第３指標値とに基づいて判定する無声音判定手段とを具備する。 The sound processing apparatus according to the second aspect of the present invention includes a modulation spectrum specifying unit that specifies a modulation spectrum for each unit section of an input sound, and a modulation frequency corresponding to the intensity of a component that belongs to a predetermined range in the modulation spectrum. First index calculation means for calculating the first index value, third index calculation means for calculating a third index value corresponding to the flatness of the frequency spectrum of the input sound for each unit section, and input sound in each unit section The sound determination means for determining whether or not the sound is based on the magnitude of the first index value and the first threshold, and whether or not the input sound of each unit section is an unvoiced sound are different from the first threshold. An unvoiced sound determining means for determining based on the magnitude of the second threshold value and the first index value and the third index value is provided.

第２の態様においては、変調スペクトルのうち変調周波数が所定の範囲に属する成分の強度に基づいて各単位区間の入力音が無声音であるか否かが判定されるから、入力音の周波数スペクトルを利用する技術と比較して高精度に無声音を判別することが可能となる。また、入力音の周波数スペクトルの平坦性に応じた第３指標値が無声音判定手段による無声音の判定に利用されるから、第１指標値のみを利用する構成と比較して高精度に無声音が判別される。例えば、有声音や環境音（例えばプッシュトーン）などの音響は無声音と比較して周波数スペクトルの平坦性が低いという傾向があるから、第３指標値を利用することで無声音を有声音や環境音と高精度に区別することが可能である。 In the second aspect, since it is determined whether or not the input sound of each unit section is an unvoiced sound based on the intensity of a component whose modulation frequency belongs to a predetermined range in the modulation spectrum, the frequency spectrum of the input sound is It is possible to discriminate unvoiced sound with higher accuracy than the technology used. Further, since the third index value corresponding to the flatness of the frequency spectrum of the input sound is used for the determination of the unvoiced sound by the unvoiced sound determination means, the unvoiced sound is discriminated with higher accuracy than the configuration using only the first index value. Is done. For example, since sounds such as voiced sounds and environmental sounds (for example, push tones) tend to have lower frequency spectrum flatness than unvoiced sounds, the third index value is used to convert unvoiced sounds into voiced sounds and environmental sounds. And can be distinguished with high accuracy.

なお、無声音は、声帯の振動を伴わない音声（有声音以外の音声）である。無声子音や無声化した有声音（母音）が無声音に該当する。無声化とは、本来的には声帯の振動を伴なって発声されるべき有声音が何らかの条件のもとで声帯の振動を伴なわずに発声される現象である。 Note that the unvoiced sound is sound that does not accompany vocal cord vibration (voice other than voiced sound). Unvoiced consonants and unvoiced voiced sounds (vowels) correspond to unvoiced sounds. Devoicing is a phenomenon in which a voiced sound that should be originally uttered with vocal cord vibration is uttered without any vocal cord vibration under some conditions.

また、「入力音が無声音であるか否かを指標値に基づいて判定する」とは、入力音が無声音であるか否かの判定の結果が当該指標値の大小に応じて変化することを意味する。無声音判定手段による判定の具体的な方法は、以下に例示するように各指標値の定義に応じて適宜に選定される。 Further, “determining whether or not the input sound is an unvoiced sound based on the index value” means that the determination result of whether or not the input sound is an unvoiced sound changes depending on the magnitude of the index value. means. The specific method of determination by the unvoiced sound determination means is appropriately selected according to the definition of each index value as exemplified below.

第１の態様および第２の態様に係る音処理装置において、変調スペクトルのうち変調周波数が所定の範囲内にある成分の強度が高いほど第１指標値が減少するように第１指標値が定義される場合、例えば、第１指標値が小さいほど、入力音が無声音と判定される可能性が上昇するように、無声音判定手段による判定の内容が選定される。例えば、第１指標値以外の指標値が入力音を無声音と判定するための条件を充足している場合、無声音判定手段は、第１指標値が所定の閾値（例えば図９の閾値Ｔ1B）を下回る場合に入力音を無声音と判定し、第１指標値が当該閾値を上回る場合に入力音が無声音ではないと判定する。一方、変調スペクトルのうち変調周波数が所定の範囲内にある成分の強度が高いほど第１指標値が増加するように第１指標値が定義される場合、例えば、第１指標値が大きいほど、入力音が無声音と判定される可能性が上昇するように、無声音判定手段による判定の内容が選定される。例えば、第１指標値以外の指標値が入力音を無声音と判定するための条件を充足している場合、無声音判定手段は、第１指標値が所定の閾値を上回る場合に入力音を無声音と判定し、第１指標値が当該閾値を下回る場合に入力音が無声音ではないと判定する。 In the sound processing apparatus according to the first and second aspects, the first index value is defined such that the first index value decreases as the intensity of a component having a modulation frequency within a predetermined range in the modulation spectrum increases. In this case, for example, the content of determination by the unvoiced sound determination unit is selected such that the smaller the first index value, the higher the possibility that the input sound is determined to be unvoiced sound. For example, when an index value other than the first index value satisfies a condition for determining the input sound as an unvoiced sound, the unvoiced sound determination means determines that the first index value has a predetermined threshold value (for example, the threshold value T1B in FIG. 9). The input sound is determined to be an unvoiced sound if the input sound is lower, and the input sound is determined not to be an unvoiced sound if the first index value exceeds the threshold value. On the other hand, when the first index value is defined such that the first index value increases as the intensity of a component having a modulation frequency within a predetermined range in the modulation spectrum increases, for example, the larger the first index value, The content of the determination by the unvoiced sound determination means is selected so that the possibility that the input sound is determined to be an unvoiced sound increases. For example, when an index value other than the first index value satisfies a condition for determining an input sound as an unvoiced sound, the unvoiced sound determination means determines that the input sound is an unvoiced sound when the first index value exceeds a predetermined threshold value. It determines, and when a 1st index value is less than the said threshold value, it determines with an input sound not being an unvoiced sound.

また、第１の態様に係る音処理装置において、入力音の零交差数が多いほど第２指標値が増加するように第２指標値が定義される場合、例えば、第２指標値が大きいほど、入力音が無声音と判定される可能性が上昇するように、無声音判定手段による判定の内容が選定される。例えば、第２指標値以外の指標値が入力音を無声音と判定するための条件を充足している場合、無声音判定手段は、第２指標値が所定の閾値（例えば図９の閾値Ｔ2）を上回る場合に入力音を無声音と判定し、第２指標値が当該閾値を下回る場合に入力音が無声音ではないと判定する。一方、入力音の零交差数が多いほど第２指標値が減少するように第２指標値が定義される場合（例えば零交差数の逆数が第２指標値として算定される場合）、例えば、第２指標値が小さいほど、入力音が無声音と判定される可能性が上昇するように無声音判定手段による判定の内容が選定される。例えば、第２指標値以外の指標値が入力音を無声音と判定するための条件を充足している場合、無声音判定手段は、第２指標値が所定の閾値を下回る場合に入力音を無声音と判定し、第２指標値が当該閾値を上回る場合に入力音が無声音ではないと判定する。 In the sound processing device according to the first aspect, when the second index value is defined such that the second index value increases as the number of zero crossings of the input sound increases, for example, the larger the second index value, The content of the determination by the unvoiced sound determination means is selected so that the possibility that the input sound is determined to be an unvoiced sound increases. For example, when an index value other than the second index value satisfies a condition for determining the input sound as an unvoiced sound, the unvoiced sound determination means determines that the second index value has a predetermined threshold (for example, the threshold T2 in FIG. 9). When it exceeds, the input sound is determined as an unvoiced sound, and when the second index value is below the threshold, it is determined that the input sound is not an unvoiced sound. On the other hand, when the second index value is defined so that the second index value decreases as the number of zero crossings of the input sound increases (for example, when the reciprocal of the number of zero crossings is calculated as the second index value), for example, The content of the determination by the unvoiced sound determination means is selected so that the possibility that the input sound is determined to be unvoiced increases as the second index value decreases. For example, when an index value other than the second index value satisfies a condition for determining an input sound as an unvoiced sound, the unvoiced sound determination means determines that the input sound is an unvoiced sound when the second index value is lower than a predetermined threshold value. It determines, and when a 2nd parameter | index value exceeds the said threshold value, it determines with an input sound not being an unvoiced sound.

第２の態様に係る音処理装置において、入力音の周波数スペクトルの平坦性が高いほど第３指標値が減少するように第３指標値が定義される場合、例えば、第３指標値が小さいほど、入力音が無声音と判定される可能性が上昇するように、無声音判定手段による判定の内容が選定される。例えば、第３指標値以外の指標値が入力音を無声音と判定するための条件を充足している場合、無声音判定手段は、第３指標値が所定の閾値（例えば図９の閾値Ｔ3）を下回る場合に入力音を無声音と判定し、第３指標値が当該閾値を上回る場合に入力音が無声音ではないと判定する。一方、入力音の周波数スペクトルの平坦性が高いほど第３指標値が増加するように第３指標値が定義される場合、例えば、第３指標値が大きいほど、入力音が無声音と判定される可能性が上昇するように、無声音判定手段による判定の内容が選定される。例えば、第３指標値以外の指標値が入力音を無声音と判定するための条件を充足している場合、無声音判定手段は、第３指標値が所定の閾値を上回る場合に入力音を無声音と判定し、第３指標値が当該閾値を下回る場合に入力音が無声音ではないと判定する。 In the sound processing device according to the second aspect, when the third index value is defined such that the third index value decreases as the flatness of the frequency spectrum of the input sound increases, for example, the smaller the third index value, The content of the determination by the unvoiced sound determination means is selected so that the possibility that the input sound is determined to be an unvoiced sound increases. For example, when an index value other than the third index value satisfies a condition for determining an input sound as an unvoiced sound, the unvoiced sound determination means determines that the third index value has a predetermined threshold (for example, a threshold T3 in FIG. 9). The input sound is determined to be an unvoiced sound when lower than the threshold, and the input sound is determined not to be an unvoiced sound when the third index value exceeds the threshold value. On the other hand, when the third index value is defined such that the third index value increases as the flatness of the frequency spectrum of the input sound increases, for example, the input sound is determined to be unvoiced as the third index value increases. The content of determination by the unvoiced sound determination means is selected so that the possibility increases. For example, when an index value other than the third index value satisfies a condition for determining the input sound as an unvoiced sound, the unvoiced sound determination means determines that the input sound is an unvoiced sound when the third index value exceeds a predetermined threshold value. When the third index value is below the threshold, it is determined that the input sound is not an unvoiced sound.

変調スペクトルのうち変調周波数が所定の範囲に属する成分の強度が高いほど第１指標値が減少するように第１指標値が算定される構成において、音声判定手段は、第１指標値が第１閾値（例えば図６の閾値Ｔ1A）を下回る単位区間の入力音を音声と判定し、無声音判定手段は、第１閾値よりも大きい第２閾値を第１指標値が下回る単位区間の入力音を無声音と判定する。また、変調スペクトルのうち変調周波数が所定の範囲に属する成分の強度が高いほど第１指標値が増加するように第１指標値が算定される構成において、音声判定手段は、第１指標値が第１閾値を上回る単位区間の入力音を音声と判定し、無声音判定手段は、第１閾値よりも小さい第２閾値を第１指標値が上回る単位区間の入力音を無声音と判定する。以上の各態様によれば、有声音と無声音とを高精度に区別することが可能である。 In the configuration in which the first index value is calculated so that the first index value decreases as the intensity of the component whose modulation frequency falls within the predetermined range in the modulation spectrum increases, the speech determination means has the first index value as the first index value. An input sound in a unit section that falls below a threshold value (for example, threshold value T1A in FIG. 6) is determined as speech, and the unvoiced sound determination means uses unvoiced sound as an input sound in a unit section in which the first index value falls below a second threshold value that is greater than the first threshold value. Is determined. Further, in the configuration in which the first index value is calculated so that the first index value increases as the intensity of the component whose modulation frequency belongs to the predetermined range in the modulation spectrum is higher, the sound determination means has the first index value The input sound in the unit section exceeding the first threshold is determined as speech, and the unvoiced sound determining means determines the input sound in the unit section in which the first index value exceeds the second threshold smaller than the first threshold as unvoiced sound. According to the above aspects, it is possible to distinguish voiced and unvoiced sounds with high accuracy.

第１の態様および第２の態様に係る音処理装置の具体例において、無声音判定手段が無声音と判定した単位区間の入力音と他の単位区間の入力音とに対して異なる処理を実行する音処理手段が設置される。また、他の具体例においては、音声判定手段が音声と判定した単位区間の入力音と無声音判定手段が無声音と判定した単位区間の入力音とに対して異なる処理を実行する音処理手段が設置される。以上の構成においては、入力音の種類（音声／非音声または有声音／無声音）に応じた適切な処理を実行することで所望の特性の音響を生成することが可能である。 In the specific example of the sound processing device according to the first aspect and the second aspect, the sound that performs different processing on the input sound of the unit section determined by the unvoiced sound determination unit as the unvoiced sound and the input sound of the other unit sections Processing means are installed. In another specific example, sound processing means for performing different processing on the input sound of the unit section determined by the sound determination means as the sound and the input sound of the unit section determined by the unvoiced sound determination means as the unvoiced sound is installed. Is done. In the above configuration, it is possible to generate sound having a desired characteristic by executing appropriate processing according to the type of input sound (voice / non-voice or voiced / unvoiced sound).

以上の総ての態様に係る音処理装置は、入力音の処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働によっても実現される。第１の態様に係るプログラムは、入力音の各単位区間について変調スペクトルを特定する変調スペクトル特定処理と、変調スペクトルのうち変調周波数が所定の範囲に属する成分の強度に応じた第１指標値を算定する第１指標算定処理と、入力音の零交差数に応じた第２指標値を各単位区間について算定する第２指標算定処理と、各単位区間の入力音が音声であるか否かを第１指標値と第１閾値との大小に基づいて判定する音声判定処理と、各単位区間の入力音が無声音であるか否かを、第１閾値とは相違する第２閾値と第１指標値との大小と、第２指標値とに基づいて判定する無声音判定処理とをコンピュータに実行させる。第２の態様に係るプログラムは、入力音の各単位区間について変調スペクトルを特定する変調スペクトル特定処理と、変調スペクトルのうち変調周波数が所定の範囲に属する成分の強度に応じた第１指標値を算定する第１指標算定処理と、入力音の周波数スペクトルの平坦性に応じた第３指標値を各単位区間について算定する第３指標算定処理と、各単位区間の入力音が音声であるか否かを第１指標値と第１閾値との大小に基づいて判定する音声判定処理と、各単位区間の入力音が無声音であるか否かを、第１閾値とは相違する第２閾値と第１指標値との大小と、第３指標値とに基づいて判定する無声音判定処理とをコンピュータに実行させる。本発明のプログラムによれば、以上の各態様に係る音処理装置と同様の作用および効果が奏される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The sound processing apparatus according to all of the above aspects is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to processing of input sound, or a general purpose such as a CPU (Central Processing Unit). This is also realized by cooperation between the arithmetic processing unit and the program. The program according to the first aspect includes a modulation spectrum specifying process for specifying a modulation spectrum for each unit section of the input sound, and a first index value corresponding to the intensity of a component whose modulation frequency belongs to a predetermined range in the modulation spectrum. A first index calculation process for calculating; a second index calculation process for calculating a second index value corresponding to the number of zero crossings of the input sound for each unit section; and whether or not the input sound of each unit section is a voice. The second threshold value and the first index, which are different from the first threshold value, are determined based on the magnitude of the first index value and the first threshold value, and whether or not the input sound of each unit section is an unvoiced sound. The computer performs unvoiced sound determination processing that is determined based on the magnitude of the value and the second index value. The program according to the second aspect includes a modulation spectrum specifying process for specifying a modulation spectrum for each unit section of the input sound, and a first index value corresponding to the intensity of a component whose modulation frequency is within a predetermined range of the modulation spectrum. A first index calculation process for calculating, a third index calculation process for calculating a third index value corresponding to the flatness of the frequency spectrum of the input sound for each unit section, and whether or not the input sound of each unit section is speech The second threshold value and the second threshold value that are different from the first threshold value are voice determination processing that determines whether or not the input sound of each unit section is an unvoiced sound , based on the magnitude of the first index value and the first threshold value. The computer performs unvoiced sound determination processing that is determined based on the magnitude of the 1 index value and the third index value. According to the program of this invention, the effect | action and effect similar to the sound processing apparatus which concern on each above aspect are show | played. The program of the present invention is provided to a user in a form stored in a computer-readable recording medium and installed in the computer, or provided from a server device in a form of distribution via a communication network and installed in the computer. Is done.

図１は、本発明の実施の形態に係る遠隔会議システムのブロック図である。遠隔会議システム１００は、地理的に離間した空間Ｒ1と空間Ｒ2とで複数の利用者Ｕ（会議の参加者）が相互に音声を授受するシステムである。各空間Ｒ（Ｒ1，Ｒ2）には、収音機器１２と音処理装置１４と音処理装置１６と放音機器１８とが設置される。 FIG. 1 is a block diagram of a remote conference system according to an embodiment of the present invention. The remote conference system 100 is a system in which a plurality of users U (conference participants) exchange voices with each other in geographically separated spaces R1 and R2. In each space R (R1, R2), a sound collecting device 12, a sound processing device 14, a sound processing device 16, and a sound emitting device 18 are installed.

収音機器１２は、空間Ｒ内に存在する入力音ＶINの波形を表す音響信号ＳINを生成する装置（マイクロホン）である。空間Ｒ1および空間Ｒ2の各々の音処理装置１４は、音響信号ＳINから出力信号ＳOUTを生成して空間Ｒ1および空間Ｒ2の他方の音処理装置１６に送信する。音処理装置１６は、出力信号ＳOUTを増幅して放音機器１８に出力する。放音機器１８は、音処理装置１６から供給される増幅後の出力信号ＳOUTに応じた音波を放射する装置（スピーカ）である。以上の構成により、空間Ｒ1内の各利用者Ｕの発声音が空間Ｒ2内の放音機器１８から出力され、空間Ｒ2内の各利用者Ｕの発声音が空間Ｒ1内の放音機器１８から出力される。 The sound collection device 12 is a device (microphone) that generates an acoustic signal SIN representing the waveform of the input sound VIN existing in the space R. Each sound processing device 14 in the space R1 and the space R2 generates an output signal SOUT from the acoustic signal SIN and transmits the output signal SOUT to the other sound processing device 16 in the space R1 and the space R2. The sound processing device 16 amplifies the output signal SOUT and outputs it to the sound emitting device 18. The sound emitting device 18 is a device (speaker) that emits sound waves according to the amplified output signal SOUT supplied from the sound processing device 16. With the above configuration, the utterance sound of each user U in the space R1 is output from the sound emitting device 18 in the space R2, and the utterance sound of each user U in the space R2 is output from the sound emitting device 18 in the space R1. Is output.

図２は、空間Ｒ1および空間Ｒ2の各々に設置される音処理装置１４の構成を示すブロック図である。図２に示すように、音処理装置１４は、制御装置２２と記憶装置２４とを具備する。制御装置２２は、プログラムを実行することで図２の各要素として機能する演算処理装置である。なお、図２の各要素はＤＳＰなどの電子回路によっても実現される。記憶装置２４は、制御装置２２が実行するプログラムや制御装置２２が使用する各種のデータを記憶する。半導体記憶装置や磁気記憶装置など公知の記憶媒体が記憶装置２４として任意に利用される。 FIG. 2 is a block diagram showing a configuration of the sound processing device 14 installed in each of the space R1 and the space R2. As shown in FIG. 2, the sound processing device 14 includes a control device 22 and a storage device 24. The control device 22 is an arithmetic processing device that functions as each element in FIG. 2 by executing a program. 2 are also realized by an electronic circuit such as a DSP. The storage device 24 stores a program executed by the control device 22 and various data used by the control device 22. A known storage medium such as a semiconductor storage device or a magnetic storage device is arbitrarily used as the storage device 24.

制御装置２２は、収音機器１２から供給される音響信号ＳIN（入力音ＶIN）を時間軸に沿って区分した複数の区間（以下「単位区間」という）の各々について当該入力音ＶINの種類（有声音／無声音／非音声）を判別する機能と、判別の結果に応じた処理を音響信号ＳINに対して実行することで出力信号ＳOUTを生成する機能とを実現する。有声音は、声帯の振動を伴なう音声（発話音）である。無声音は、声帯の振動を伴なわない音声である。一方、非音声は、音声以外の音響（雑音）である。各種の暗騒音（例えば空調設備の動作音）や各種の環境音（例えば携帯電話機の着信音や扉の開閉音）が非音声に該当する。 The control device 22 determines the type of the input sound VIN for each of a plurality of sections (hereinafter referred to as “unit sections”) obtained by dividing the acoustic signal SIN (input sound VIN) supplied from the sound collection device 12 along the time axis (hereinafter referred to as “unit section”) A function of discriminating voiced / unvoiced sound / non-speech) and a function of generating an output signal SOUT by executing processing according to the discrimination result on the acoustic signal SIN are realized. A voiced sound is a sound (speech sound) accompanied by vibration of a vocal cord. Unvoiced sound is sound that does not involve vocal cord vibration. On the other hand, non-voice is sound (noise) other than voice. Various background noises (for example, operating sounds of air conditioning equipment) and various environmental sounds (for example, ringtones of mobile phones and door opening / closing sounds) correspond to non-speech.

図２の周波数分析部３２は、フーリエ変換（例えばＦＦＴ（Fast Fourier Transform））を含む周波数分析を音響信号ＳINに対して実行することで、音響信号ＳINを時間軸に沿って区分した複数のフレームの各々について周波数スペクトル（対数スペクトル）Ｓ0を算定する。各フレームは単位区間と比較して充分に短い時間長に設定される。 The frequency analysis unit 32 in FIG. 2 performs a frequency analysis including a Fourier transform (for example, FFT (Fast Fourier Transform)) on the acoustic signal SIN, thereby a plurality of frames obtained by dividing the acoustic signal SIN along the time axis. A frequency spectrum (logarithmic spectrum) S0 is calculated for each of. Each frame is set to a sufficiently short time length compared to the unit interval.

変調スペクトル特定部３４は、音響信号ＳIN（入力音ＶIN）の変調スペクトルＭSを特定する。変調スペクトルＭSは、音響信号ＳINの周波数スペクトルＳ0のうち特定の周波数帯域に属する成分の時間的な変動（以下「時間軌跡」という）についてフーリエ変換を実行した結果に相当する。 The modulation spectrum specifying unit 34 specifies the modulation spectrum MS of the acoustic signal SIN (input sound VIN). The modulation spectrum MS corresponds to a result obtained by performing a Fourier transform on a temporal variation (hereinafter referred to as “time locus”) of a component belonging to a specific frequency band in the frequency spectrum S0 of the acoustic signal SIN.

図３は、変調スペクトル特定部３４のブロック図である。図４は、変調スペクトル特定部３４による処理を説明するための概念図である。図４の部分(A)には、周波数分析部３２がフレーム毎に特定した周波数スペクトルＳ0を時系列に配列したスペクトログラムＳPが図示されている。 FIG. 3 is a block diagram of the modulation spectrum specifying unit 34. FIG. 4 is a conceptual diagram for explaining processing by the modulation spectrum specifying unit 34. Part (A) of FIG. 4 shows a spectrogram SP in which the frequency spectrum S0 specified for each frame by the frequency analysis unit 32 is arranged in time series.

図３に示すように、変調スペクトル特定部３４は成分抽出部３４２と周波数分析部３４４とで構成される。成分抽出部３４２は、図４の部分(A)および部分(B)に示すように、スペクトログラムＳPのうち特定の周波数帯域ωに属する成分の強度（エネルギ）の時間軌跡ＳTを抽出する。さらに詳述すると、成分抽出部３４２は、各フレームの周波数スペクトルＳ0のうち周波数帯域ωに属する成分の強度を算定し、周波数スペクトルＳ0の強度を複数のフレームついて時系列に配列することで時間軌跡ＳTを生成する。周波数帯域ωは、入力音ＶINが音声である場合の時間軌跡ＳTの周波数特性（変調スペクトルＭS）と入力音ＶINが非音声である場合の時間軌跡ＳTの周波数特性とが顕著に相違するように実験的または統計的に選定される。例えば、周波数帯域ωは、１０Ｈz（さらに好適には５０Ｈz）から８００Ｈzまでの範囲に選定される。なお、各周波数スペクトルＳ0におけるひとつの周波数の成分の強度の時系列を時間軌跡ＳTとして成分抽出部３４２が抽出する構成も採用される。 As shown in FIG. 3, the modulation spectrum specifying unit 34 includes a component extracting unit 342 and a frequency analyzing unit 344. As shown in part (A) and part (B) of FIG. 4, the component extraction unit 342 extracts the time trajectory ST of the intensity (energy) of the component belonging to the specific frequency band ω from the spectrogram SP. More specifically, the component extraction unit 342 calculates the intensity of the component belonging to the frequency band ω in the frequency spectrum S0 of each frame, and arranges the intensity of the frequency spectrum S0 in time series for a plurality of frames. Generate ST. In the frequency band ω, the frequency characteristic (modulation spectrum MS) of the time trajectory ST when the input sound VIN is speech and the frequency characteristic of the time trajectory ST when the input sound VIN is non-speech are significantly different. Selected experimentally or statistically. For example, the frequency band ω is selected in a range from 10 Hz (more preferably 50 Hz) to 800 Hz. A configuration in which the component extraction unit 342 extracts the time series of the intensity of one frequency component in each frequency spectrum S0 as the time locus ST is also employed.

図３の周波数分析部３４４は、図４の部分(B)および部分(C)に示すように、時間軌跡ＳTに対してフーリエ変換を実行することで、時間軌跡ＳTを時間軸に沿って区分した複数の単位区間ＴUの各々について変調スペクトルＭSを算定する。単位区間ＴUは、複数のフレームで構成される所定の時間長（例えば１秒程度）の期間である。なお、本形態においては各単位区間ＴUが重複しない構成を便宜的に例示するが、相前後する各単位区間ＴUが部分的に重複する構成も採用される。 As shown in part (B) and part (C) of FIG. 4, the frequency analysis unit 344 in FIG. 3 performs a Fourier transform on the time trajectory ST, thereby dividing the time trajectory ST along the time axis. The modulation spectrum MS is calculated for each of the plurality of unit intervals TU. The unit section TU is a period of a predetermined time length (for example, about 1 second) composed of a plurality of frames. In the present embodiment, a configuration in which the unit sections TU do not overlap is illustrated for convenience, but a configuration in which the adjacent unit sections TU partially overlap is also employed.

図５は、複数種の音響（有声音／無声音／非音声）の変調スペクトルＭSを示す。図５の部分(A)は有声音（「あいうえお」と発声した音声）の変調スペクトルＭSである。図５の部分(B)は、無声音が豊富な音響の変調スペクトルＭSである。さらに詳述すると、図５の部分(B)は、特に「さ」および「し」が殆ど無声化されるように「さしすせそ」と発声した場合の音声の変調スペクトルＭSである。また、図５の部分(C)および部分(D)は非音声の変調スペクトルＭSである。さらに詳述すると、図５の部分(C)はホワイトノイズ（暗騒音）の変調スペクトルＭSであり、図５の部分(D)は電話機のプッシュトーン（環境音）の変調スペクトルＭSである。 FIG. 5 shows modulation spectra MS of plural kinds of sounds (voiced / unvoiced / non-voice). Part (A) of FIG. 5 is a modulation spectrum MS of voiced sound (speech uttered as “Aiueo”). Part (B) of FIG. 5 is a modulation spectrum MS of sound rich in unvoiced sound. More specifically, part (B) of FIG. 5 is a modulation spectrum MS of a voice when “sashisoseso” is uttered so that “sa” and “shi” are almost silent. Further, part (C) and part (D) of FIG. 5 are non-voice modulation spectrum MS. More specifically, part (C) of FIG. 5 is a modulation spectrum MS of white noise (background noise), and part (D) of FIG. 5 is a modulation spectrum MS of a push tone (environmental sound) of the telephone.

人間の通常の発話音（すなわち音声）の変調スペクトルＭSにおいては、図４の部分(C)に示すように、発話中に音節が切替わる周波数に相当する４Ｈz程度の変調周波数にて強度が極大となる場合が多い。さらに詳述すると、音声（有声音および無声音）の変調スペクトルＭS（図５の部分(A)および部分(B)）においては、変調周波数が１０Ｈzを下回る範囲内で強度が高いという傾向がある。一方、多くの非音声の変調スペクトルＭS（図５の部分(C)および部分(D)）においては、変調周波数が１０Ｈzを上回る範囲の成分の強度が高いという傾向がある。 In the modulation spectrum MS of a normal human speech sound (that is, speech), as shown in part (C) of FIG. 4, the intensity is maximum at a modulation frequency of about 4 Hz corresponding to the frequency at which the syllable is switched during speech. In many cases. More specifically, in the modulation spectrum MS of voice (voiced sound and unvoiced sound) (part (A) and part (B) in FIG. 5), the intensity tends to be high within a range where the modulation frequency is below 10 Hz. On the other hand, in many non-speech modulation spectra MS (part (C) and part (D) in FIG. 5), the intensity of components in the range where the modulation frequency exceeds 10 Hz tends to be high.

以上の特性の相違を考慮して、本形態においては、変調スペクトル特定部３４が特定した変調スペクトルＭSのうち変調周波数が所定の範囲（以下「判定対象範囲」という）Ａに属する成分の強度を入力音ＶINの種類の判定に利用する。判定対象範囲Ａは、変調スペクトルＭSの強度の相違が音声と非音声とで顕著となる範囲に設定される。例えば１０Ｈz以下の範囲（さらに好適には２Ｈzから８Ｈzの範囲）が判定対象範囲Ａとして適切である。 In consideration of the above difference in characteristics, in the present embodiment, the intensity of a component whose modulation frequency belongs to a predetermined range (hereinafter referred to as “determination target range”) A in the modulation spectrum MS specified by the modulation spectrum specifying unit 34 is calculated. Used to determine the type of input sound VIN. The determination target range A is set to a range in which the difference in intensity of the modulation spectrum MS is significant between voice and non-voice. For example, a range of 10 Hz or less (more preferably, a range of 2 Hz to 8 Hz) is appropriate as the determination target range A.

図２の指標算定部４２は、変調スペクトル特定部３４が各単位区間ＴUについて特定した変調スペクトルＭSについて、判定対象範囲Ａに属する成分の強度（エネルギ）に応じた指標値Ｄ1を算定する。さらに詳述すると、指標算定部４２は、第１に、変調スペクトルＭSのうち変調周波数が判定対象範囲Ａに属する成分の強度（例えば判定対象範囲Ａ内の各変調周波数における強度の加算値や平均値）Ｌ1と、変調周波数の全範囲にわたる変調スペクトルＭSの強度（総ての変調周波数における強度の加算値や平均値）Ｌ2とを算定する。第２に、指標算定部４２は、強度Ｌ1と強度Ｌ2との相対比（Ｌ1／Ｌ2）を含む以下の演算式(A)に基づいて指標値Ｄ1を算定する。
Ｄ1＝１−（Ｌ1／Ｌ2） ……(A)
演算式(A)の内容から理解されるように、変調スペクトルＭSのうち判定対象範囲Ａ内の成分の強度Ｌ1が高いほど（すなわち入力音ＶINが音声である可能性が高いほど）、指標値Ｄ1は小さい数値となる。したがって、指標値Ｄ1は、入力音ＶINが音声および非音声の何れであるかを判断するための指標となる。また、判定対象範囲Ａは発話時に音節が切替わる周波数の成分を豊富に含むから、指標値Ｄ1は、音声に特有なリズム（発話のリズム）が入力音ＶINに含まれるか否かを判断するための指標としても把握される。 2 calculates an index value D1 corresponding to the intensity (energy) of a component belonging to the determination target range A for the modulation spectrum MS specified by the modulation spectrum specifying unit 34 for each unit section TU. More specifically, the index calculation unit 42 firstly calculates the intensity of the component whose modulation frequency belongs to the determination target range A in the modulation spectrum MS (for example, the added value or average of the intensity at each modulation frequency in the determination target range A). Value) L1 and the intensity of the modulation spectrum MS over the entire range of the modulation frequency (addition value or average value of the intensity at all modulation frequencies) L2. Second, the index calculation unit 42 calculates the index value D1 based on the following arithmetic expression (A) including the relative ratio (L1 / L2) between the intensity L1 and the intensity L2.
D1 = 1- (L1 / L2) (A)
As understood from the content of the arithmetic expression (A), the index value increases as the intensity L1 of the component within the determination target range A in the modulation spectrum MS increases (that is, the possibility that the input sound VIN is a voice). D1 is a small numerical value. Therefore, the index value D1 is an index for determining whether the input sound VIN is voice or non-voice. Further, since the determination target range A includes abundant frequency components at which syllables are switched during utterance, the index value D1 determines whether or not the input sound VIN includes a rhythm peculiar to speech (speech rhythm). It is also grasped as an index for

図２の指標算定部４４は、入力音ＶIN（音響信号ＳIN）の零交差数に応じた指標値Ｄ2を単位区間ＴU毎に算定する。零交差数（ゼロクロス数）は、音響信号ＳINの強度の符号が反転する回数（音響信号ＳINの強度がゼロを跨いで変化する回数）である。指標算定部４４は、例えば、単位区間ＴU内の零交差数の合計値や単位区間ＴU内の所定の時間毎の零交差数の平均値を指標値Ｄ2として算定する。したがって、音響信号ＳINの零交差数が多いほど指標値Ｄ2は大きい数値となる。 2 calculates an index value D2 corresponding to the number of zero crossings of the input sound VIN (acoustic signal SIN) for each unit interval TU. The number of zero crossings (the number of zero crossings) is the number of times the sign of the intensity of the acoustic signal SIN is inverted (the number of times the intensity of the acoustic signal SIN changes across zero). The index calculation unit 44 calculates, for example, the total value of the number of zero crossings in the unit section TU or the average value of the number of zero crossings for each predetermined time in the unit section TU as the index value D2. Therefore, the larger the number of zero crossings of the acoustic signal SIN, the larger the index value D2.

図２の指標算定部４６は、周波数分析部３２が特定した周波数スペクトルＳ0の形状の平坦性に応じた指標値Ｄ3を単位区間ＴU毎に算定する。例えば、指標算定部４６は、第１に、単位区間ＴU内の複数のフレームにわたる平均的な周波数スペクトルＳ0を複数の周波数帯域に分割し、各周波数帯域内の成分の強度（エネルギの平均値）を算定する。第２に、指標算定部４６は、複数の周波数帯域の強度のうちの最大値Ｅmaxと最小値Ｅminとの相対比を指標値Ｄ3（Ｄ3＝Ｅmax／Ｅmin）として単位区間ＴU毎に算定する。したがって、周波数スペクトルＳ0の平坦性が高い（すなわち強度の最大値Ｅmaxと最小値Ｅminとの相違が小さい）ほど指標値Ｄ3は小さい数値となる。なお、各フレームの周波数スペクトルＳ0における最大値Ｅmaxと最小値Ｅminとの相対比を単位区間ＴU内の複数のフレームについて平均することで指標値Ｄ3を算定してもよい。 The index calculation unit 46 in FIG. 2 calculates an index value D3 corresponding to the flatness of the shape of the frequency spectrum S0 specified by the frequency analysis unit 32 for each unit interval TU. For example, first, the index calculation unit 46 divides the average frequency spectrum S0 over a plurality of frames in the unit section TU into a plurality of frequency bands, and the intensity (average value of energy) of the components in each frequency band. Is calculated. Secondly, the index calculation unit 46 calculates the relative ratio between the maximum value Emax and the minimum value Emin among the intensities of the plurality of frequency bands as the index value D3 (D3 = Emax / Emin) for each unit section TU. Therefore, the higher the flatness of the frequency spectrum S0 (that is, the smaller the difference between the maximum value Emax and the minimum value Emin), the smaller the index value D3. The index value D3 may be calculated by averaging the relative ratio between the maximum value Emax and the minimum value Emin in the frequency spectrum S0 of each frame for a plurality of frames in the unit interval TU.

図２の判定部５０は、指標値Ｄ1と指標値Ｄ2と指標値Ｄ3とに基づいて入力音ＶINの種類（有声音／無声音／非音声）を判定する。本形態の判定部５０は、音声判定部５２と無声音判定部５４とで構成される。音声判定部５２は、各単位区間ＴUの入力音ＶINが有声（特に有声音）であるか否かを指標値Ｄ1に基づいて判定する。図６は、音声判定部５２の具体的な動作を示すフローチャートである。図６の処理は、ひとつの単位区間ＴUについて指標値Ｄ1が算定されるたびに実行される。 The determination unit 50 in FIG. 2 determines the type (voiced / unvoiced / non-voice) of the input sound VIN based on the index value D1, the index value D2, and the index value D3. The determination unit 50 according to this embodiment includes a voice determination unit 52 and an unvoiced sound determination unit 54. The voice determination unit 52 determines whether or not the input sound VIN of each unit section TU is voiced (particularly voiced) based on the index value D1. FIG. 6 is a flowchart showing a specific operation of the voice determination unit 52. The process of FIG. 6 is executed every time the index value D1 is calculated for one unit section TU.

音声判定部５２は、指標値Ｄ1が閾値Ｔ1Aを下回るか否かを判定する（ステップＳA1）。閾値Ｔ1Aは、有声音の指標値Ｄ1が閾値Ｔ1Aを下回るとともに非音声の指標値Ｄ1が閾値Ｔ1Aを上回るように実験的または統計的に設定される。ステップＳA1の結果が否定である場合、音声判定部５２は、今回の処理の対象である単位区間ＴUの入力音ＶINは有声音でないと判定する（ステップＳA2）。一方、ステップＳA1の結果が肯定である場合、音声判定部５２は、今回の単位区間ＴUの入力音ＶINが有声音であると判定する（ステップＳA3）。ステップＳA3において、音声判定部５２は、有声音を指定する識別データｄを生成して音処理部６０に出力する。 The voice determination unit 52 determines whether or not the index value D1 is below the threshold value T1A (step SA1). The threshold value T1A is set experimentally or statistically so that the voiced sound index value D1 is lower than the threshold value T1A and the non-speech index value D1 is higher than the threshold value T1A. If the result of step SA1 is negative, the voice determination unit 52 determines that the input sound VIN of the unit interval TU that is the target of the current process is not a voiced sound (step SA2). On the other hand, if the result of step SA1 is affirmative, the sound determination unit 52 determines that the input sound VIN of the current unit section TU is a voiced sound (step SA3). In step SA3, the voice determination unit 52 generates identification data d that designates a voiced sound and outputs the identification data d to the sound processing unit 60.

ところで、図５に示すように、有声音（部分(A)）と無声音（部分(B)）とでは変調スペクトルＭSの特性が相違する。さらに詳述すると、無声音の変調スペクトルＭSは、有声音の変調スペクトルＭSと比較して高域側の変調周波数にて強度が極大となる傾向がある。すなわち、無声音については変調スペクトルＭSのうち判定対象範囲Ａ内の成分の強度Ｌ1が有声音と比較して低いから、無声音の変調スペクトルＭSから算定される指標値Ｄ1は、有声音の指標値Ｄ1と比較して大きい数値となる場合が多い。したがって、図６のステップＳA1における指標値Ｄ1と閾値Ｔ1Aとの比較だけでは、無声音が音声に分類されない。そこで、無声音判定部５４は、各単位区間ＴUの入力音ＶINが無声音であるか否かを判定する。 Incidentally, as shown in FIG. 5, the characteristics of the modulation spectrum MS are different between voiced sound (part (A)) and unvoiced sound (part (B)). More specifically, the modulation spectrum MS of the unvoiced sound tends to have a maximum intensity at the modulation frequency on the high frequency side as compared with the modulation spectrum MS of the voiced sound. That is, for the unvoiced sound, since the intensity L1 of the component within the determination target range A of the modulation spectrum MS is lower than that of the voiced sound, the index value D1 calculated from the modulation spectrum MS of the unvoiced sound is the index value D1 of the voiced sound. In many cases, it becomes a large numerical value. Therefore, the unvoiced sound is not classified into speech only by comparing the index value D1 and the threshold value T1A in step SA1 in FIG. Therefore, the unvoiced sound determination unit 54 determines whether or not the input sound VIN of each unit section TU is an unvoiced sound.

図７は、音響信号ＳINの時間波形を示す。図８は、音響信号ＳINの周波数スペクトルＳ0を示す。図７および図８の各々における部分(A)は、図５の部分(A)に変調スペクトルＭSを図示した有声音の特性である。同様に、図７および図８の各々における部分(B)は無声音（図５の部分(B)）の特性である。また、図７および図８の各々における部分(C)は、図５の部分(C)に変調スペクトルＭSを図示したホワイトノイズの特性であり、図７および図８の各々における部分(D)は、図５の部分(D)に変調スペクトルＭSを例示したプッシュトーンの特性である。 FIG. 7 shows a time waveform of the acoustic signal SIN. FIG. 8 shows the frequency spectrum S0 of the acoustic signal SIN. The part (A) in each of FIGS. 7 and 8 is a characteristic of voiced sound in which the modulation spectrum MS is illustrated in the part (A) of FIG. Similarly, part (B) in each of FIGS. 7 and 8 is a characteristic of unvoiced sound (part (B) in FIG. 5). Further, the part (C) in each of FIGS. 7 and 8 is a characteristic of white noise in which the modulation spectrum MS is illustrated in the part (C) of FIG. 5, and the part (D) in each of FIGS. FIG. 5D shows the characteristics of a push tone whose modulation spectrum MS is illustrated in part (D) of FIG.

図７の部分(A)ないし部分(D)の対比から理解されるように、無声音（部分(B)）および非音声（部分(C)および部分(D)）は、有声音（部分(A)）と比較して単位時間内の零交差数が多い。したがって、無声音または非音声（ホワイトノイズやプッシュトーン）の単位区間ＴUについて算定される指標値Ｄ2は、有声音の単位区間ＴUの指標値Ｄ2と比較して大きい数値となる。 As can be understood from the comparison of part (A) to part (D) in FIG. 7, unvoiced sound (part (B)) and non-speech (part (C) and part (D)) Compared with)), the number of zero crossings per unit time is larger. Therefore, the index value D2 calculated for the unit interval TU of unvoiced sound or non-speech (white noise or push tone) is larger than the index value D2 of the unit interval TU of voiced sound.

また、図８の部分(A)ないし部分(D)の対比から理解されるように、部分(B)の無声音および部分(C)のホワイトノイズは、部分(A)の有声音や部分(D)のプッシュトーンと比較すると、周波数スペクトルＳ0の形状が平坦である（すなわち強度の相違が少ない）。したがって、無声音またはホワイトノイズの単位区間ＴUについて算定される指標値Ｄ3は、有声音やプッシュトーンの単位区間ＴUの指標値Ｄ3と比較して小さい数値となる。 Further, as understood from the comparison of the parts (A) to (D) in FIG. 8, the unvoiced sound of the part (B) and the white noise of the part (C) are the voiced sound and the part (D ), The shape of the frequency spectrum S0 is flat (that is, there is little difference in intensity). Therefore, the index value D3 calculated for the unvoiced sound or white noise unit interval TU is smaller than the index value D3 of the voiced sound or push tone unit interval TU.

さらに、図５の部分(A)ないし部分(D)の対比から理解されるように、部分(A)の有声音および部分(B)の無声音（すなわち音声）は、部分(C)のホワイトノイズや部分(D)のプッシュトーン（すなわち非音声）と比較すると、変調スペクトルＭSのうち判定対象範囲Ａ内の強度が高い。したがって、無声音または有声音の単位区間ＴUについて算定される指標値Ｄ1は、非音声の単位区間ＴUの指標値Ｄ1と比較して小さい数値となる。
Further, as can be understood from the comparison of the parts (A) to (D) in FIG. 5, the voiced sound of the part (A) and the unvoiced sound (ie, the voice) of the part (B) are the white noise of the part (C). Compared with the push tone (that is, non-speech) of the portion (D), the intensity within the determination target range A is high in the modulation spectrum MS. Therefore, the index value D1 calculated for the unvoiced or voiced unit interval TU is smaller than the index value D1 of the non-speech unit interval TU.

各指標値Ｄ（Ｄ1〜Ｄ3）と入力音ＶINの種類とが以上の関係にあることを考慮し、無声音判定部５４は、各単位区間ＴUの入力音ＶINが無声音であるか否かを指標値Ｄ1ないし指標値Ｄ3に基づいて判定する。図９は、無声音判定部５４の具体的な動作を示すフローチャートである。図９の処理は、指標値Ｄ1ないし指標値Ｄ3がひとつの単位区間ＴUについて算定されるたびに実行される。 In consideration of the relationship between each index value D (D1 to D3) and the type of input sound VIN, the unvoiced sound determination unit 54 indicates whether or not the input sound VIN of each unit section TU is an unvoiced sound. The determination is made based on the value D1 or the index value D3. FIG. 9 is a flowchart showing a specific operation of the unvoiced sound determination unit 54. The process of FIG. 9 is executed each time the index value D1 to the index value D3 are calculated for one unit section TU.

無声音判定部５４は、指標算定部４４の算定した指標値Ｄ2が閾値Ｔ2を上回るか否かを判定する（ステップＳB1）。閾値Ｔ2は、無声音および非音声の指標値Ｄ2が閾値Ｔ2を上回るとともに有声音の指標値Ｄ2が閾値Ｔ2を下回るように実験的または統計的に選定される。ステップＳB1の結果が否定である場合、無声音判定部５４は、今回の単位区間ＴUの入力音ＶINを無声音でないと判定する（ステップＳB2）。 The unvoiced sound determination unit 54 determines whether or not the index value D2 calculated by the index calculation unit 44 exceeds the threshold value T2 (step SB1). The threshold value T2 is selected experimentally or statistically so that the unvoiced and non-speech index value D2 exceeds the threshold value T2 and the voiced sound index value D2 falls below the threshold value T2. If the result of step SB1 is negative, the unvoiced sound determination unit 54 determines that the input sound VIN of the current unit interval TU is not an unvoiced sound (step SB2).

ステップＳB1の結果が肯定である場合、無声音判定部５４は、指標算定部４６の算定した指標値Ｄ3が閾値Ｔ3を下回るか否かを判定する（ステップＳB3）。閾値Ｔ3は、無声音および暗騒音（ホワイトノイズ）の指標値Ｄ3が閾値Ｔ3を下回るとともに有声音および環境音（プッシュトーン）の指標値Ｄ3が閾値Ｔ3を上回るように実験的または統計的に選定される。ステップＳB3の結果が否定である場合、無声音判定部５４は、今回の単位区間ＴUの入力音ＶINは無声音でないと判定する（ステップＳB2）。 If the result of step SB1 is affirmative, the unvoiced sound determination unit 54 determines whether or not the index value D3 calculated by the index calculation unit 46 is below the threshold T3 (step SB3). The threshold value T3 is experimentally or statistically selected so that the index value D3 of unvoiced sound and background noise (white noise) is lower than the threshold value T3, and the index value D3 of voiced sound and environmental sound (push tone) is higher than the threshold value T3. The If the result of step SB3 is negative, the unvoiced sound determination unit 54 determines that the input sound VIN of the current unit section TU is not an unvoiced sound (step SB2).

ステップＳB3の結果が肯定である場合、無声音判定部５４は、指標算定部４２の算定した指標値Ｄ1が閾値Ｔ1Bを下回るか否かを判定する（ステップＳB4）。閾値Ｔ1Bは、無声音および有声音の指標値Ｄ1が閾値Ｔ1Bを下回るとともに非音声の指標値Ｄ1が閾値Ｔ1Bを上回るように実験的または統計的に選定される。図５を参照して前述したように、無声音の変調スペクトルＭSのうち強度が最大となる変調周波数は有声音の変調スペクトルＭSと比較して高域側にある。したがって、無声音および有声音の双方の指標値Ｄ1を上回るように設定された閾値Ｔ1Bは、音声判定部５２が有声音の判別に使用した閾値Ｔ1Aと比較して大きい数値となる。 If the result of step SB3 is affirmative, the unvoiced sound determination unit 54 determines whether or not the index value D1 calculated by the index calculation unit 42 is below the threshold value T1B (step SB4). The threshold value T1B is selected experimentally or statistically so that the unvoiced and voiced sound index value D1 is lower than the threshold value T1B and the non-voice index value D1 is higher than the threshold value T1B. As described above with reference to FIG. 5, the modulation frequency having the maximum intensity in the modulation spectrum MS of unvoiced sound is on the high frequency side compared to the modulation spectrum MS of voiced sound. Therefore, the threshold value T1B set so as to exceed the index value D1 of both the unvoiced sound and the voiced sound is a larger numerical value than the threshold value T1A used by the voice determination unit 52 to discriminate the voiced sound.

ステップＳB4の結果が肯定である場合、無声音判定部５４は、今回の単位区間ＴUの入力音ＶINを無声音と判別する（ステップＳB5）。ステップＳB5において、無声音判定部５４は、無声音を指定する識別データｄを生成して音処理部６０に出力する。一方、ステップＳB4の結果が否定である場合、無声音判定部５４は、今回の単位区間ＴUの入力音ＶINは無声音でないと判定する（ステップＳB2）。音声判定部５２が有声音でないと判定し（ステップＳA2）、かつ、無声音判定部５４が無声音でないと判定した（ステップＳB2）単位区間ＴUは非音声に分類される。すなわち、非音声を指定する識別データｄが判定部５０から音処理部６０に出力される。 If the result of step SB4 is affirmative, the unvoiced sound determination unit 54 determines that the input sound VIN of the current unit section TU is an unvoiced sound (step SB5). In step SB5, the unvoiced sound determination unit 54 generates identification data d designating the unvoiced sound and outputs the identification data d to the sound processing unit 60. On the other hand, if the result of step SB4 is negative, the unvoiced sound determination unit 54 determines that the input sound VIN of the current unit section TU is not an unvoiced sound (step SB2). The unit section TU is classified as non-speech, which is determined that the voice determination unit 52 is not a voiced sound (step SA2) and the unvoiced sound determination unit 54 is determined not to be a voiceless sound (step SB2). That is, the identification data d designating non-speech is output from the determination unit 50 to the sound processing unit 60.

図２の音処理部６０は、判定部５０（音声判定部５２および無声音判定部５４）が単位区間ＴUについて判定した結果に応じた処理を当該単位区間ＴUの音響信号ＳINに対して実行することで出力信号ＳOUTを生成する。音処理部６０の具体的な処理について以下に詳述する。 The sound processing unit 60 in FIG. 2 performs processing on the acoustic signal SIN in the unit section TU according to the result of the determination section 50 (the sound determination section 52 and the unvoiced sound determination section 54) determining the unit section TU. Generates an output signal SOUT. Specific processing of the sound processing unit 60 will be described in detail below.

第１に、音処理部６０は、音声判定部５２が有声音と判定した単位区間ＴUの音響信号ＳINと無声音判定部５４が無声音と判定した単位区間ＴUの音響信号ＳINとに対して別個の処理を実行する。例えば、音処理部６０は、無声音の単位区間ＴUについては音響信号ＳINの高域の成分をローパスフィルタ処理で抑制する一方、有声音の単位区間ＴUについてはフィルタ処理を実行しない。 First, the sound processing unit 60 separates the acoustic signal SIN of the unit interval TU determined by the voice determination unit 52 as a voiced sound and the acoustic signal SIN of the unit interval TU determined by the unvoiced sound determination unit 54 as an unvoiced sound. Execute the process. For example, the sound processing unit 60 suppresses the high-frequency component of the acoustic signal SIN for the unvoiced sound unit interval TU by the low-pass filter process, but does not perform the filter process for the voiced sound unit interval TU.

第２に、音処理部６０は、音声（有声音または無声音）の単位区間ＴUの音響信号ＳINと非音声の単位区間ＴUの音響信号ＳINとに対して別個の処理を実行する。例えば、音処理部６０は、音声判定部５２が有声音と判定した単位区間ＴUと無声音判定部５４が無声音と判定した単位区間ＴUとについては音響信号ＳINを出力信号ＳOUTとして出力する一方、非音声と判定された単位区間ＴUについては音量をゼロに設定した出力信号ＳOUTを出力する（すなわち音響信号ＳINを出力しない）。したがって、空間Ｒ1および空間Ｒ2の各々においては、他方の空間Ｒ内の入力音ＶINのうちの非音声が除去され、利用者が本来的に受聴する必要のある音声のみが音処理装置１６を介して放音機器１８から放射される。さらに、無声音については高域の成分が抑圧されているから、利用者にとって受聴し易い音声が放音機器１８から放射される。 Secondly, the sound processing unit 60 performs separate processing on the acoustic signal SIN in the unit section TU of voice (voiced sound or unvoiced sound) and the acoustic signal SIN in the non-sound unit section TU. For example, the sound processing unit 60 outputs the acoustic signal SIN as the output signal SOUT for the unit interval TU determined by the voice determination unit 52 as a voiced sound and the unit interval TU determined by the unvoiced sound determination unit 54 as an unvoiced sound. For the unit section TU determined to be speech, the output signal SOUT with the volume set to zero is output (that is, the acoustic signal SIN is not output). Therefore, in each of the space R1 and the space R2, the non-speech of the input sound VIN in the other space R is removed, and only the sound that the user originally needs to listen to passes through the sound processing device 16. And emitted from the sound emitting device 18. Furthermore, since the high frequency component of the unvoiced sound is suppressed, a sound that is easy for the user to listen to is emitted from the sound emitting device 18.

音声判定部５２が有声音と判定した単位区間ＴUのみを出力信号ＳOUTとして出力される構成においては、無声音が非音声として処理されるから、出力信号ＳOUTに応じて放射される音響は無声音の単位区間ＴUで途切れる。本形態においては、音声判定部５２が有声音の単位区間ＴUを判別するだけでなく、無声音判定部５４が指標値Ｄ1ないし指標値Ｄ3に基づいて無声音の単位区間ＴUを判別するから、入力音ＶINの有声音および無声音の双方について出力信号ＳOUTとして放音することが可能である。したがって、出力信号ＳOUTから放射される音響のうち入力音ＶINの無声音に対応する区間の中断が防止されるという利点がある。 In the configuration in which only the unit interval TU determined by the speech determination unit 52 as a voiced sound is output as the output signal SOUT, the unvoiced sound is processed as non-speech, so the sound radiated according to the output signal SOUT is a unit of unvoiced sound. Breaks off in section TU. In this embodiment, not only the voice determination unit 52 determines the unit interval TU of voiced sound, but also the unvoiced sound determination unit 54 determines the unit interval TU of unvoiced sound based on the index value D1 to the index value D3. Both voiced and unvoiced sounds of VIN can be emitted as output signal SOUT. Therefore, there is an advantage that the section corresponding to the unvoiced sound of the input sound VIN among the sounds radiated from the output signal SOUT is prevented.

また、変調スペクトルＭSのうち判定対象範囲Ａ内の成分の強度Ｌ1（発話のリズムの有無）に基づいて音声（有声音および無声音）と非音声とが区別されるから、入力音ＶINの周波数スペクトルを利用する特許文献１の技術と比較して高精度に音声と非音声とを判別することが可能である。なお、非音声の音量が大きい場合、変調スペクトルＭSの強度は変調周波数の全帯域にわたって高い。したがって、変調スペクトルＭSの判定対象範囲Ａ内の強度Ｌ1のみに基づいて音声と非音声とを区別する構成においては、音量の大きい非音声が音声と誤判定される可能性がある。本形態においては、判定対象範囲Ａ内の強度Ｌ1と変調周波数の全範囲にわたる強度Ｌ2との相対比が判定部５０による判定に使用されるから、非音声の音量が大きい場合であっても音声と非音声とを正確に判定できるという利点がある。 Further, since the voice (voiced sound and unvoiced sound) and non-voice are distinguished from each other based on the intensity L1 of the component within the determination target range A of the modulation spectrum MS (whether or not there is an utterance rhythm), the frequency spectrum of the input sound VIN It is possible to discriminate between speech and non-speech with high accuracy compared to the technique of Patent Document 1 using When the volume of non-voice is high, the intensity of the modulation spectrum MS is high over the entire band of the modulation frequency. Therefore, in a configuration in which speech and non-speech are distinguished based only on the intensity L1 within the determination target range A of the modulation spectrum MS, non-speech with a high volume may be erroneously determined as speech. In this embodiment, since the relative ratio between the intensity L1 within the determination target range A and the intensity L2 over the entire range of the modulation frequency is used for the determination by the determination unit 50, even if the volume of non-speech is high There is an advantage that non-voice can be accurately determined.

＜変形例＞
以上の形態には様々な変形が加えられる。具体的な変形の態様を例示すれば以下の通りである。なお、以下の例示から２以上の態様を任意に選択して組合わせてもよい。 <Modification>
Various modifications are added to the above embodiment. An example of a specific modification is as follows. Two or more aspects may be arbitrarily selected from the following examples and combined.

（１）変形例１
以上の形態においては変調スペクトルＭSに応じた指標値Ｄ1と零交差数に応じた指標値Ｄ2と周波数スペクトルＳ0の平坦性に応じた指標値Ｄ3とを入力音ＶINの判別に利用したが、指標値Ｄ2および指標値Ｄ3の一方と指標値Ｄ1とに基づいて入力音ＶINを判別する構成も採用される。例えば、指標値Ｄ1と指標値Ｄ2とを利用する態様においては、図２の指標算定部４６や図９のステップＳB3が省略される。また、指標値Ｄ1と指標値Ｄ3とを利用する態様においては、図２の指標算定部４４や図９のステップＳB1が省略される。ただし、図２の構成のように３種類の指標値（Ｄ1〜Ｄ3）を利用する構成によれば、２種類の指標値Ｄのみを利用する構成と比較して入力音ＶINを高精度に判別できるという利点がある。 (1) Modification 1
In the above embodiment, the index value D1 corresponding to the modulation spectrum MS, the index value D2 corresponding to the number of zero crossings, and the index value D3 corresponding to the flatness of the frequency spectrum S0 are used for discrimination of the input sound VIN. A configuration is also adopted in which the input sound VIN is discriminated based on one of the value D2 and the index value D3 and the index value D1. For example, in an aspect in which the index value D1 and the index value D2 are used, the index calculation unit 46 in FIG. 2 and step SB3 in FIG. 9 are omitted. Further, in an aspect in which the index value D1 and the index value D3 are used, the index calculation unit 44 in FIG. 2 and step SB1 in FIG. 9 are omitted. However, according to the configuration using three types of index values (D1 to D3) as in the configuration of FIG. 2, the input sound VIN is discriminated more accurately than the configuration using only two types of index values D. There is an advantage that you can.

（２）変形例２
以上の形態においては、有声音の指標値Ｄ1を上回るとともに無声音の指標値Ｄ1を下回るように閾値Ｔ1Aが設定されるから、音声判定部５２は、有声音を無声音や非音声から判別する手段として機能する。しかし、有声音および無声音の双方（すなわち音声）の指標値Ｄ1を上回るように閾値Ｔ1Aを設定することで、音声（有声音および無声音）を非音声から判別する手段として音声判定部５２を機能させてもよい。なお、有声音および無声音の双方の指標値Ｄ1を上回るように閾値Ｔ1Aが設定された場合であっても、無声音の単位区間ＴUが非音声と誤判定される可能性はあるから、無声音を音声に分類する（すなわち非音声から除外する）ために無声音判定部５４は好適に利用される。 (2) Modification 2
In the above embodiment, since the threshold value T1A is set so as to exceed the voiced sound index value D1 and lower than the unvoiced sound index value D1, the sound determination unit 52 is a means for determining the voiced sound from unvoiced sound or non-voiced sound. Function. However, by setting the threshold value T1A so as to exceed the index value D1 of both voiced and unvoiced sounds (that is, voice), the voice determination unit 52 is caused to function as means for discriminating voice (voiced and unvoiced sounds) from non-voice. May be. Even if the threshold value T1A is set so as to exceed the index value D1 for both voiced and unvoiced sounds, there is a possibility that the unit section TU of unvoiced sounds may be misjudged as non-speech. The unvoiced sound determination unit 54 is preferably used in order to classify the sound (ie, exclude from non-speech).

（３）変形例３
以上の形態においては、入力音ＶINの総ての単位区間ＴUについて無声音判定部５４が図９の処理（無声音の判別）を実行したが、音声判定部５２が音声（有声音）でないと判定した単位区間ＴUについてのみ図９の処理を実行してもよい。本変形例においては、音声判定部５２が音声と判定した単位区間ＴUについて無声音判定部５４の処理が省略されるから、判定部５０（無声音判定部５４）による処理の負荷が削減されるという利点がある。 (3) Modification 3
In the above embodiment, the unvoiced sound determination unit 54 has performed the process of FIG. 9 (unvoiced sound determination) for all unit intervals TU of the input sound VIN, but the sound determination unit 52 has determined that the sound is not a voice (voiced sound). The process of FIG. 9 may be executed only for the unit section TU. In this modification, since the processing of the unvoiced sound determination unit 54 is omitted for the unit interval TU that the sound determination unit 52 determines to be sound, the processing load by the determination unit 50 (unvoiced sound determination unit 54) is reduced. There is.

（４）変形例４
各指標値Ｄ（Ｄ1，Ｄ2，Ｄ3）の定義は適宜に変更される。したがって、各指標値Ｄ（Ｄ1，Ｄ2，Ｄ3）の大小と入力音ＶINの種類との関係は任意である。例えば、以上の形態においては、変調スペクトルＭSにおける判定対象範囲Ａ内の強度Ｌ1が高いほど指標値Ｄ1が減少するように指標値Ｄ1を定義した構成（すなわち指標値Ｄ1が小さいほど入力音ＶINが音声と判定される可能性が上昇する構成）を例示したが、判定対象範囲Ａ内の強度Ｌ1が高いほど指標値Ｄ1が増加するように指標値Ｄ1を定義した構成（すなわち指標値Ｄ1が大きいほど入力音ＶINが音声と判定される可能性が上昇する構成）も採用される。強度Ｌ1が高いほど指標値Ｄ1が増加する構成において、音声判定部５２は、指標値Ｄ1が閾値Ｔ1Aを上回る単位区間ＴUの入力音ＶINを有声音と判定し（ステップＳA1およびステップＳA2）、無声音判定部５４は、指標値Ｄ1が閾値Ｔ1Bを上回る単位区間ＴUの入力音ＶINを無声音と判定する（ステップＳB4およびステップＳB5）。閾値Ｔ1Aは、閾値Ｔ1Bと比較して小さい数値に設定される。 (4) Modification 4
The definition of each index value D (D1, D2, D3) is changed as appropriate. Therefore, the relationship between the magnitude of each index value D (D1, D2, D3) and the type of the input sound VIN is arbitrary. For example, in the above embodiment, the index value D1 is defined such that the index value D1 decreases as the intensity L1 in the determination target range A in the modulation spectrum MS increases (that is, the input sound VIN decreases as the index value D1 decreases). A configuration in which the possibility of being determined to be speech is exemplified), but a configuration in which the index value D1 is defined such that the index value D1 increases as the intensity L1 in the determination target range A increases (that is, the index value D1 is large). A configuration in which the possibility that the input sound VIN is determined to be a voice is increased. In the configuration in which the index value D1 increases as the intensity L1 increases, the speech determination unit 52 determines that the input sound VIN of the unit section TU in which the index value D1 exceeds the threshold T1A is a voiced sound (steps SA1 and SA2), and the unvoiced sound The determination unit 54 determines that the input sound VIN in the unit section TU in which the index value D1 exceeds the threshold value T1B is an unvoiced sound (steps SB4 and SB5). The threshold value T1A is set to a smaller numerical value than the threshold value T1B.

また、音響信号ＳINの零交差数が多いほど指標値Ｄ2が減少するように指標値Ｄ2を定義した構成（例えば零交差数の逆数を指標値Ｄ2とした構成）や、周波数スペクトルＳ0の平坦性が高いほど指標値Ｄ3が増加するように指標値Ｄ3を定義した構成も好適である。また、周波数スペクトルＳ0の分散を指標値Ｄ3として算定してもよい。すなわち、指標算定部４６は、周波数スペクトルＳ0を区分した各周波数帯域の強度（エネルギ）と全帯域にわたる強度の平均値との差分値の自乗を総ての周波数帯域にわたって平均した数値を指標値Ｄ3として算定する。以上の方法で算定された指標値Ｄ3は、周波数スペクトルＳ0の平坦性が高いほど小さい数値となる。 Further, a configuration in which the index value D2 is defined so that the index value D2 decreases as the number of zero crossings of the acoustic signal SIN increases (for example, a configuration in which the reciprocal of the number of zero crossings is the index value D2), or the flatness of the frequency spectrum S0. A configuration in which the index value D3 is defined so that the index value D3 increases as the value of R is higher is also suitable. Alternatively, the variance of the frequency spectrum S0 may be calculated as the index value D3. That is, the index calculator 46 calculates an index value D3 by averaging the squares of the difference values between the intensities (energy) of each frequency band obtained by dividing the frequency spectrum S0 and the average value of the intensity over the entire band over all frequency bands. Calculated as The index value D3 calculated by the above method becomes smaller as the flatness of the frequency spectrum S0 is higher.

（５）変形例５
以上の形態においては周波数スペクトルＳ0において周波数帯域ωに属する成分の時間軌跡ＳTに対してフーリエ変換を実行することで変調スペクトルＭSを特定したが、音響信号ＳIN（入力音ＶIN）のケプストラムの時間軌跡に対してフーリエ変換を実行することで変調スペクトルＭSを特定する構成も採用される。さらに詳述すると、変調スペクトル特定部３４の成分抽出部３４２は、音響信号ＳINの各フレームのケプストラムのうちケフレンシが特定の範囲内にある成分の時間軌跡ＳTを抽出し、周波数分析部３４４は、ケプストラムの時間軌跡ＳTに対して単位区間ＴU毎にフーリエ変換を実行することで各単位区間ＴUの変調スペクトルＭSを算定する。 (5) Modification 5
In the above embodiment, the modulation spectrum MS is specified by performing Fourier transform on the time trajectory ST of the component belonging to the frequency band ω in the frequency spectrum S0, but the time trajectory of the cepstrum of the acoustic signal SIN (input sound VIN). A configuration is also adopted in which the modulation spectrum MS is specified by performing a Fourier transform on. More specifically, the component extraction unit 342 of the modulation spectrum specifying unit 34 extracts a time trajectory ST of a component whose quefrency is within a specific range from the cepstrum of each frame of the acoustic signal SIN, and the frequency analysis unit 344 The modulation spectrum MS of each unit section TU is calculated by performing Fourier transform for each unit section TU on the time trajectory ST of the cepstrum.

（６）変形例６
以上の形態においては、音声判定部５２による判定に指標値Ｄ1を利用したが、入力音ＶINが音声（有声音）か否かを判定する方法には公知の技術が任意に採用される。例えば、音声判定部５２が音響信号ＳINのピッチ（基本周波数）の検出を実行し、明確なピッチが検出された単位区間ＴUを音声と判定するとともにピッチが検出されない単位区間ＴUを非音声と判定する構成も好適である。もっとも、図２の構成においては、無声音判定部５４で使用される指標値Ｄ1が音声判定部５２でも使用されるから、指標値Ｄ1とは別個の指標値（例えばピッチ）が音声判定部５２による判定に使用される構成と比較して指標値の算定の負荷が軽減されるという利点がある。なお、例えば入力音ＶINから無声音の単位区間ＴUのみを検出する音処理装置１４においては音声判定部５２が省略される。 (6) Modification 6
In the above embodiment, the index value D1 is used for the determination by the sound determination unit 52. However, a known technique is arbitrarily adopted as a method for determining whether or not the input sound VIN is sound (voiced sound). For example, the voice determination unit 52 detects the pitch (fundamental frequency) of the acoustic signal SIN, determines that a unit interval TU in which a clear pitch is detected is a voice, and determines a unit interval TU in which no pitch is detected as a non-voice. Such a configuration is also suitable. However, in the configuration of FIG. 2, since the index value D1 used in the unvoiced sound determination unit 54 is also used in the voice determination unit 52, an index value (for example, pitch) different from the index value D1 is determined by the voice determination unit 52. There is an advantage that the load of calculating the index value is reduced as compared with the configuration used for the determination. For example, in the sound processing device 14 that detects only the unit section TU of unvoiced sound from the input sound VIN, the sound determination unit 52 is omitted.

（７）変形例７
以上の形態においては、入力音ＶINを収音した空間Ｒ内の音処理装置１４にて識別データｄおよび出力信号ＳOUTを生成したが、識別データｄを生成する位置（入力音ＶINを分類する位置）や出力信号ＳOUTを生成する位置は適宜に変更される。例えば、収音機器１２が生成した音響信号ＳINと判定部５０の生成した識別データｄとを音処理装置１４が出力する構成においては、音響信号ＳINと識別データｄとから出力信号ＳOUTを生成する音処理部６０が受信側の音処理装置１６に設置される。また、収音機器１２が生成した音響信号ＳINを音処理装置１４が送信する構成においては、受信側の音処理装置１６に図２と同様の要素が設置される。もっとも、遠隔会議システム１００は本発明の用途の例示に過ぎない。したがって、出力信号ＳOUTや音響信号ＳINの送受信は本発明において必須ではない。 (7) Modification 7
In the above embodiment, the identification data d and the output signal SOUT are generated by the sound processing device 14 in the space R that picks up the input sound VIN, but the position where the identification data d is generated (the position where the input sound VIN is classified). ) And the position for generating the output signal SOUT are appropriately changed. For example, in the configuration in which the sound processing device 14 outputs the acoustic signal SIN generated by the sound collecting device 12 and the identification data d generated by the determination unit 50, the output signal SOUT is generated from the acoustic signal SIN and the identification data d. A sound processing unit 60 is installed in the sound processing device 16 on the receiving side. Further, in the configuration in which the sound processing device 14 transmits the acoustic signal SIN generated by the sound collecting device 12, the same elements as those in FIG. 2 are installed in the sound processing device 16 on the receiving side. However, the remote conference system 100 is merely an example of the application of the present invention. Therefore, transmission / reception of the output signal SOUT and the acoustic signal SIN is not essential in the present invention.

（８）変形例８
以上の形態においては、非音声と判定された単位区間ＴUの音響信号ＳINを音処理部６０が出力しない（出力信号ＳOUTの音量をゼロに設定する）構成を例示したが、音処理部６０による処理の内容は適宜に変更される。例えば、非音声と判定された単位区間ＴUについて音響信号ＳINの音量を低下させた信号を音処理部６０が出力信号ＳOUTとして出力する構成も好適である。また、音声（有声音または無声音）の単位区間ＴUと非音声の単位区間ＴUとについて音響信号ＳINに別個の音響的な効果を付与することで出力信号ＳOUTを生成する構成や、有声音の単位区間ＴUと無声音の単位区間ＴUとについて音響信号ＳINに別個の音響的な効果を付与する構成も採用される。さらに、出力信号ＳOUTの出力先（音処理装置１６）において話者認識（話者識別または話者認証）や音声認識が実行される構成において、音処理部６０は、例えば、有声音または無声音と判定された単位区間ＴUについては、音声認識や話者認識に使用される特徴量を音響信号ＳINから抽出して出力信号ＳOUTとして出力する一方、非音声と判定された単位区間ＴUについては特徴量の抽出を停止する。 (8) Modification 8
In the above embodiment, the configuration in which the sound processing unit 60 does not output the acoustic signal SIN of the unit section TU determined to be non-speech (set the volume of the output signal SOUT to zero) is exemplified. The content of the process is changed as appropriate. For example, a configuration in which the sound processing unit 60 outputs, as the output signal SOUT, a signal obtained by reducing the volume of the acoustic signal SIN for the unit section TU determined as non-speech is also suitable. In addition, a configuration in which the output signal SOUT is generated by giving a separate acoustic effect to the acoustic signal SIN for a unit section TU of voice (voiced sound or unvoiced sound) and a unit section TU of non-voice, or a unit of voiced sound A configuration in which separate acoustic effects are applied to the acoustic signal SIN for the section TU and the unit section TU of unvoiced sound is also employed. Furthermore, in a configuration in which speaker recognition (speaker identification or speaker authentication) or voice recognition is performed at the output destination (sound processing device 16) of the output signal SOUT, the sound processing unit 60 is, for example, voiced or unvoiced sound. For the determined unit section TU, the feature quantity used for speech recognition and speaker recognition is extracted from the acoustic signal SIN and output as the output signal SOUT, while for the unit section TU determined to be non-speech, the feature quantity is used. Stop extracting.

本発明の実施の形態に係る遠隔会議システムのブロック図である。It is a block diagram of the remote conference system which concerns on embodiment of this invention. 図１の音処理装置のブロック図である。It is a block diagram of the sound processing apparatus of FIG. 図２の変調スペクトル特定部のブロック図である。It is a block diagram of the modulation spectrum specific | specification part of FIG. 図２の変調スペクトル特定部による処理の手順を示す概念図である。It is a conceptual diagram which shows the procedure of the process by the modulation spectrum specific | specification part of FIG. 複数種の音響の変調スペクトルである。It is a modulation spectrum of multiple types of sound. 図２の音声判定部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio | voice determination part of FIG. 複数種の音響の時間波形である。It is a time waveform of multiple types of sound. 複数種の音響の周波数スペクトルである。It is a frequency spectrum of multiple types of sound. 図２の無声音判定部の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the unvoiced sound determination part of FIG.

Explanation of symbols

１００……遠隔会議システム、１２……収音機器、１４……音処理装置、１６……音処理装置、１８……放音機器、２２……制御装置、２４……記憶装置、３２……周波数分析部、３４……変調スペクトル特定部、４２……指標算定部、４４……指標算定部、４６……指標算定部、５０……判定部、５２……音声判定部、５４……無声音判定部、６０……音処理部、ＶIN……入力音、ＳIN……音響信号、ＳOUT……出力信号、ｄ……識別データ、ＭS……変調スペクトル、Ｄ1，Ｄ2，Ｄ3……指標値、ＴU……単位区間。 100 …… Remote conference system, 12 …… Sound collecting device, 14 …… Sound processing device, 16 …… Sound processing device, 18 …… Sound emitting device, 22 …… Control device, 24 …… Storage device, 32 …… Frequency analysis unit 34... Modulation spectrum specifying unit 42... Index calculation unit 44 .. index calculation unit 46... Index calculation unit 50 .. determination unit 52. Judgment unit, 60 ... sound processing unit, VIN ... input sound, SIN ... acoustic signal, SOUT ... output signal, d ... identification data, MS ... modulation spectrum, D1, D2, D3 ... index value, TU: Unit section.

Claims

Modulation spectrum specifying means for specifying the modulation spectrum for each unit section of the input sound;
First index calculating means for calculating a first index value according to the intensity of a component whose modulation frequency belongs to a predetermined range in the modulation spectrum;
Second index calculation means for calculating a second index value corresponding to the number of zero crossings of the input sound for each unit section;
Voice determining means for determining whether or not the input sound of each unit section is voice based on the magnitude of the first index value and the first threshold;
An unvoiced sound that determines whether or not the input sound of each unit section is an unvoiced sound based on the magnitude of a second threshold value different from the first threshold value and the first index value and the second index value A sound processing apparatus comprising: a determination unit;

Modulation spectrum specifying means for specifying the modulation spectrum for each unit section of the input sound;
First index calculating means for calculating a first index value according to the intensity of a component whose modulation frequency belongs to a predetermined range in the modulation spectrum;
Third index calculating means for calculating a third index value corresponding to the flatness of the frequency spectrum of the input sound for each unit section;
Voice determining means for determining whether or not the input sound of each unit section is voice based on the magnitude of the first index value and the first threshold;
An unvoiced sound that determines whether or not the input sound of each unit section is an unvoiced sound based on the magnitude of a second threshold value different from the first threshold value and the first index value and the third index value A sound processing apparatus comprising: a determination unit;

The first index calculation means calculates the first index value so that the first index value decreases as the intensity of a component whose modulation frequency belongs to a predetermined range in the modulation spectrum is higher,
The voice determination unit determines that the input sound of the unit section in which the first index value is lower than the first threshold is a voice,
The unvoiced sound determination means determines that an input sound of a unit section in which the first index value falls below a second threshold value that is greater than the first threshold value is an unvoiced sound.
The sound processing apparatus according to claim 1 or 2 .

The first index calculation means calculates the first index value so that the first index value increases as the intensity of a component whose modulation frequency belongs to a predetermined range in the modulation spectrum is higher,
The voice determination unit determines that the input sound of the unit section in which the first index value exceeds the first threshold is a voice,
The unvoiced sound determining means determines an input sound of a unit section in which the first index value exceeds a second threshold value smaller than the first threshold value as an unvoiced sound.
The sound processing apparatus according to claim 1 or 2 .

The unvoiced sound determining means determines whether or not it is an unvoiced sound only for a unit section determined by the sound determining means not to be a sound.
The sound processing apparatus according to any one of claims 1 to 4.

A sound processing unit that performs low-pass filter processing on a unit section determined by the unvoiced sound determination unit as unvoiced sound, and that does not perform the low-pass filter process on a unit section determined by the sound determination unit as speech
The sound processing apparatus according to claim 1, comprising:

A modulation spectrum specifying process for specifying a modulation spectrum for each unit section of the input sound;
A first index calculation process for calculating a first index value according to the intensity of a component whose modulation frequency belongs to a predetermined range in the modulation spectrum;
A second index calculation process for calculating a second index value corresponding to the number of zero crossings of the input sound for each unit section;
A sound determination process for determining whether or not the input sound of each unit section is a sound based on the magnitude of the first index value and the first threshold;
An unvoiced sound that determines whether or not the input sound of each unit section is an unvoiced sound based on the magnitude of a second threshold value different from the first threshold value and the first index value and the second index value A program that causes a computer to execute judgment processing.

A modulation spectrum specifying process for specifying a modulation spectrum for each unit section of the input sound;
A first index calculation process for calculating a first index value according to the intensity of a component whose modulation frequency belongs to a predetermined range in the modulation spectrum;
A third index calculation process for calculating a third index value corresponding to the flatness of the frequency spectrum of the input sound for each unit section;
A sound determination process for determining whether or not the input sound of each unit section is a sound based on the magnitude of the first index value and the first threshold;
An unvoiced sound that determines whether or not the input sound of each unit section is an unvoiced sound based on the magnitude of a second threshold value different from the first threshold value and the first index value and the third index value A program that causes a computer to execute judgment processing.