JP2007010892A

JP2007010892A - Device for determining speech signal

Info

Publication number: JP2007010892A
Application number: JP2005190160A
Authority: JP
Inventors: Izumi Taniguchi; いずみ谷口
Original assignee: Toa Corp
Current assignee: Toa Corp
Priority date: 2005-06-29
Filing date: 2005-06-29
Publication date: 2007-01-18
Anticipated expiration: 2025-06-29
Also published as: JP4493557B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device for determining speech signals capable of unfailingly determining the ending point of talking in a voice signal. <P>SOLUTION: A first signal generating means of the device generates a first signal E which is the envelope signal of the value of the voice signal level. A second signal generating means generates a second signal by smoothing the first signal E on a time axis on condition that the level value of the second signal F does not change temporally when it is above the value obtained by adding a predetermined value to the level value of the voice signal. A determination means determines the point when the state that the level value of the second signal F is above the value determined by adding a predetermined value to the level value of the voice signal has continued for a first predetermined duration as the ending point of talking in the voice signal. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声信号の状態や種類を判断する音声信号判断装置に関する。 The present invention relates to an audio signal determination apparatus that determines the state and type of an audio signal.

ここに言う「音声信号」とは、可聴周波数帯域にその成分を有する信号のことであり、その信号形態は問わない。例えば、アナログ信号であってもよいし、ディジタル信号であってもよい。また、電気信号であってもよいし、光信号であってもよい。例えば、話者の声をマイクロホンで受音したときにマイクロホンから出力される話声信号は音声信号である。また、音楽ＣＤに記録されている音楽信号も音声信号である。 The “audio signal” mentioned here is a signal having the component in the audible frequency band, and the signal form is not limited. For example, it may be an analog signal or a digital signal. Further, it may be an electric signal or an optical signal. For example, a speech signal output from a microphone when a speaker's voice is received by a microphone is an audio signal. A music signal recorded on the music CD is also an audio signal.

音響装置において、入力される音声信号の話し始め時点や話し終わり時点を判断するものがある。例えば、話し始めにおいて録音を開始し、話し終わりにおいて録音を停止させるような、自動録音装置のようなものである。 Some acoustic devices determine the start time and end time of an input audio signal. For example, an automatic recording device that starts recording at the beginning of a conversation and stops recording at the end of a conversation.

話し始め時点や話し終わり時点の検出のために、従来の装置では、所定のゲートレベルを設定しておき、入力される音声信号のレベルとこのゲートレベルとを比較することにより、話し始め時点や話し終わり時点を判断している。 In the conventional apparatus, a predetermined gate level is set for detection of the start time and the end time of the talk, and the level of the input audio signal is compared with this gate level, so that Judging the end of the talk.

しかし予め設定されたゲートレベルに基づく上記方法によれば、判断の誤りが生ずることが多い。 However, according to the above method based on a preset gate level, a judgment error often occurs.

例えば、マイクロホンに入力される話声が小さいために、音声信号レベルの大きさがゲートレベルに到達せず、よって、話者が会話やアナウンスを開始しているにもかかわらず、音響装置がその開始（話し始め）時点を判断できない場合がある。 For example, since the voice input to the microphone is small, the audio signal level does not reach the gate level, and thus the sound device does not respond to the sound device even though the speaker has started a conversation or announcement. The start (speaking start) time may not be determined.

また、話者の周囲騒音が大きいためにマイクロホンからの出力信号がゲートレベルを超えており、よって話者が会話やアナウンスを終了しているにもかかわらず、音響装置がその終了（話し終わり）時点を検出できない場合がある。 In addition, the sound output from the microphone exceeds the gate level due to the loudness of the speaker's ambient noise, and thus the sound device ends (speaking end) even though the speaker ends the conversation or announcement. The time may not be detected.

さらに、入力される音声信号が、会話やアナウンスによる話声信号であるのか、それとも音楽信号であるのかを判断（識別）し、その判断の結果に応じて、拡声装置における異なるゲイン調整をしたいような場合がある。 Furthermore, it is determined (identified) whether the input audio signal is a speech signal by conversation or announcement, or a music signal, and different gain adjustments in the loudspeaker are to be performed according to the determination result. There are cases.

つまり、音声信号が話声信号である場合は、話者がマイクロホンに近づいたり、マイクロホンから遠ざかったりすることによって、音声信号のレベルが変動する。また、複数人によってなされる会話においては、話者によって声の大きさに差があるので、音声信号レベルの変動が生ずる。よって、音声信号が話声信号である場合は、音声信号のレベルの変動に応じて、比較的敏感なゲイン調整を行うことが望ましい。 That is, when the audio signal is a speech signal, the level of the audio signal varies as the speaker approaches the microphone or moves away from the microphone. Further, in a conversation conducted by a plurality of people, there is a difference in the volume of the voice depending on the speaker, so that the voice signal level varies. Therefore, when the audio signal is a speech signal, it is desirable to perform relatively sensitive gain adjustment in accordance with fluctuations in the level of the audio signal.

一方、音声信号が音楽信号である場合にも、音声信号のレベルは変動する。一の音楽の中には比較的レベルの大きな部分と比較的レベルの小さな部分とが混在するからである。しかし、音声信号が音楽信号である場合には、その音楽性を損なわないようにするためにも、レベル変動に応じた敏感なゲイン調整は行わない方がよい。 On the other hand, even when the audio signal is a music signal, the level of the audio signal varies. This is because a portion with a relatively large level and a portion with a relatively small level are mixed in one music. However, when the audio signal is a music signal, it is better not to perform a sensitive gain adjustment according to the level fluctuation in order not to impair the musicality.

このように、音声信号が話声信号であるのか音楽信号であるのかによって、ゲイン調整の方法を変えたい場合があるのである。そのためには、音声信号が話声信号であるのか音楽信号であるのかを判断する方法が必要である。しかし、そのための有効な方法はいまだ提案されていない。 Thus, there are cases where it is desired to change the gain adjustment method depending on whether the audio signal is a speech signal or a music signal. For this purpose, a method for determining whether the audio signal is a speech signal or a music signal is necessary. However, no effective method has been proposed yet.

なお、信号の自己相関に基づいて、話声信号と音楽信号とを識別する技術が特許文献１に開示されている。
特開２００２−３６６１８９号公報（第１頁、要約欄） A technique for discriminating a speech signal from a music signal based on the autocorrelation of the signal is disclosed in Patent Document 1.
JP 2002-366189 A (first page, summary column)

本願に係る発明の目的は、音声信号における話し終わりの時点を確実に判断することができる音声信号判断装置を提供することにある。 An object of the invention according to the present application is to provide an audio signal determination device that can reliably determine the end point of speech in an audio signal.

また、本願に係るもう一つの発明の目的は、音声信号が話声信号であるのか音楽信号であるのかを正確に判断することができる音声信号判断装置を提供することにある。 Another object of the present invention is to provide an audio signal determination device capable of accurately determining whether an audio signal is a speech signal or a music signal.

上記課題を解決するため、本願発明に係る音声信号判断装置は、第１信号を生成する第１信号生成手段と、第２信号を生成する第２信号生成手段と、判断手段とを備え、該第１信号生成手段は、音声信号のレベル値の包絡信号たる該第１信号を生成し、該第２信号生成手段は、該第２信号のレベル値が、該音声信号のレベル値に所定値を加えた値以上であるときには時間的に変化しないことを条件として、該第１信号を時間軸上で平滑化することによって該第２信号を生成し、該判断手段は、該第２信号のレベル値が、該音声信号のレベル値に該所定値を加えた値以上である状態が第１所定期間続いた時点を、該音声信号の話し終わり時点であると判断する。この装置では、第２信号のレベル値が、音声信号のレベル値に所定値を加えた値以上である状態を無音状態と判断できることを利用し、この無音状態が一定時間持続したことをもって、話し終わりであると判断しているのである。ゲートレベルを用いることなく、無音状態の検出およびその観測の結果から話し終わりを判断しているので、周囲騒音に影響されず、的確に話し終わり時点を判断できる。 In order to solve the above-described problem, an audio signal determination device according to the present invention includes first signal generation means for generating a first signal, second signal generation means for generating a second signal, and determination means, The first signal generation means generates the first signal which is an envelope signal of the level value of the audio signal, and the second signal generation means sets the level value of the second signal to a predetermined value to the level value of the audio signal. The second signal is generated by smoothing the first signal on the time axis on the condition that it does not change in time when the value is equal to or greater than the value obtained by adding A point in time when the state where the level value is equal to or greater than the level value of the audio signal plus the predetermined value continues for the first predetermined period is determined as the end point of the audio signal. This device utilizes the fact that a state in which the level value of the second signal is equal to or greater than a value obtained by adding a predetermined value to the level value of the audio signal can be determined as a silent state. Judging that it is the end. Without using the gate level, the end of the talk is determined from the result of the detection of the silent state and the observation thereof, so that it is possible to accurately determine the end of the talk without being influenced by ambient noise.

上記音声信号判断装置において、該判断手段は、該音声信号の話し終わり時点の後、該第２信号のレベル値が、該音声信号のレベル値に該所定値を加えた値よりも小さくなった時点を、該音声信号の話し始め時点であると判断してもよい。話し終わり時点を的確に判断しているので、この装置によって的確に話し始め時点を判断できる。 In the audio signal determination apparatus, the determination means has a level value of the second signal that is smaller than a value obtained by adding the predetermined value to the level value of the audio signal after the end of speaking of the audio signal. The time point may be determined as the time when the voice signal starts to be spoken. Since the end point of the talk is accurately determined, the start point of the talk can be accurately determined by this device.

上記音声信号判断装置において、該音声信号の話し始め時点に続く第２所定期間内においては、該第２信号生成手段が、該第２信号のレベル値を該第１信号のレベル値に一致させるようにしてもよい。 In the audio signal determination device, the second signal generating means matches the level value of the second signal with the level value of the first signal within a second predetermined period following the start time of speaking of the audio signal. You may do it.

また、上記課題を解決するため、本願発明に係るもう一つの音声信号判断装置は、第１信号を生成する第１信号生成手段と、第２信号を生成する第２信号生成手段と、判断手段とを備え、該第１信号生成手段は、音声信号のレベル値の包絡信号たる該第１信号を生成し、該第２信号生成手段は、該第２信号のレベル値が、該音声信号のレベル値に所定値を加えた値以上であるときには時間的に変化しないことを条件として、該第１信号を時間軸上で平滑化することによって該第２信号を生成し、該判断手段は、該第２信号のレベル値が、該音声信号のレベル値に該所定値を加えた値以上である状態を無音状態であると判断し、かつ、該無音状態が発生する頻度に基づいて該音声信号が話声信号であるか音楽信号であるかを判断する。この装置では、話声信号と音楽信号とではそこに表れる無音状態の頻度が異なるということに基づいて、音声信号が話声信号であるか音楽信号であるかを正確に判断することができる。 In order to solve the above-described problem, another audio signal determination device according to the present invention includes a first signal generation unit that generates a first signal, a second signal generation unit that generates a second signal, and a determination unit. The first signal generating means generates the first signal that is an envelope signal of the level value of the audio signal, and the second signal generating means has the level value of the second signal of the audio signal. The second signal is generated by smoothing the first signal on the time axis on the condition that when the level value is equal to or greater than a value obtained by adding a predetermined value to the level value, the determination means includes: A state in which the level value of the second signal is equal to or higher than the level value of the audio signal plus the predetermined value is determined to be a silent state, and the audio is based on the frequency of occurrence of the silent state. It is determined whether the signal is a speech signal or a music signal. In this apparatus, it is possible to accurately determine whether the voice signal is a voice signal or a music signal based on the fact that the frequency of the silent state that appears between the voice signal and the music signal is different.

上記音声信号判断装置において、該判断手段は、第３所定期間内に無音状態が発生する回数を計数し、計数された回数が所定回数以上であるとき該音声信号を話声信号であると判断し、計数された回数が該所定回数未満であるとき該音声信号を音楽信号であると判断してもよい。 In the audio signal determination apparatus, the determination means counts the number of times that a silent state occurs within a third predetermined period, and determines that the audio signal is a speech signal when the counted number is equal to or greater than the predetermined number. The audio signal may be determined to be a music signal when the counted number is less than the predetermined number.

また、上記音声信号判断装置において、該判断手段は、連続する二の無音状態の内の先の無音状態が終了してから後の無音状態が開始するまでの時間に基づいて、該音声信号が話声信号であるか音楽信号であるかを判断するようにしてもよい。 In the audio signal determination device, the determination means may determine whether the audio signal is based on the time from the end of the previous silent state to the start of the subsequent silent state. You may make it judge whether it is a speech signal or a music signal.

また上記音声信号判断装置において、該第１信号生成手段は、第４所定期間毎に、直近の該第４所定期間における音声信号のレベル値の最大値を検出し、検出した最大値を該第４所定期間だけ維持することによって該第１信号を生成するようにしてもよい。 In the audio signal determination device, the first signal generating means detects the maximum value of the level value of the audio signal in the latest fourth predetermined period for each fourth predetermined period, and the detected maximum value is 4 The first signal may be generated by maintaining only for a predetermined period.

本願に係る音声信号判断装置によれば、話者の周囲騒音にあまり影響されることなく確実に話し終わりを検出することができる。 According to the voice signal determination device according to the present application, it is possible to reliably detect the end of the talk without being affected by the ambient noise of the speaker.

また、本願に係るもう一つの音声信号判断装置によれば、音声信号が話声信号であるのか音楽信号であるのかを正確に判断することができる。 Moreover, according to another audio signal determination apparatus according to the present application, it is possible to accurately determine whether the audio signal is a speech signal or a music signal.

以下、本願発明の一実施形態を図面を参照しつつ説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

まず、図１を参照しつつ、本願に係る音声信号判断装置の構成が採用されたゲイン自動調整装置２０を説明する。 First, an automatic gain adjustment device 20 that employs the configuration of an audio signal determination device according to the present application will be described with reference to FIG.

図１は、音響システムＳの概略構成を示すブロック図である。この音響システムＳは、拡声空間に設置されており、該拡声空間にアナウンスや音楽を提供するために使用される。拡声空間の代表的な例としては、空港構内がある。 FIG. 1 is a block diagram showing a schematic configuration of the acoustic system S. As shown in FIG. The sound system S is installed in a loudspeaker space, and is used to provide announcements and music to the loudspeaker space. A typical example of a loudspeaker space is an airport campus.

音響システムＳは、主に、ＣＤ(コンパクトディスク)プレーヤ１１、マイクロホンＭａ,Ｍｂ,Ｍｃ、切り換えスイッチ１５、ゲイン自動調整装置２０、パワーアンプ１７及びスピ−カ１８で構成されている。この音響システムＳでは、切り換えスイッチ１５によって、音源をＣＤプレーヤ１１、マイクロホンＭａ、マイクロホンＭｂ、マイクロホンＭｃのうちのいずれかに切り換えることができる。 The acoustic system S mainly includes a CD (compact disc) player 11, microphones Ma, Mb, Mc, a changeover switch 15, an automatic gain adjusting device 20, a power amplifier 17, and a speaker 18. In the acoustic system S, the sound source can be switched to any one of the CD player 11, the microphone Ma, the microphone Mb, and the microphone Mc by the changeover switch 15.

アナウンサーのアナウンスを拡声空間に提供するときは、切り換えスイッチ１５によって、マイクロホンＭａ,マイクロホンＭｂ又はマイクロホンＭｃが音源として選択される。アナウンサーがこれらマイクロホンＭａ,Ｍｂ,Ｍｃを使用してアナウンスを行うと、マイクロホンＭａ,Ｍｂ,Ｍｃが、ゲイン自動調整装置２０に入力されるべき音声信号としての話声信号を発生する。 When the announcer's announcement is provided to the loudspeaker space, the microphone Ma, the microphone Mb, or the microphone Mc is selected as a sound source by the changeover switch 15. When the announcer makes an announcement using these microphones Ma, Mb, Mc, the microphones Ma, Mb, Mc generate a speech signal as a voice signal to be input to the automatic gain adjustment device 20.

音楽を拡声空間に提供するときは、切り換えスイッチ１５によって、ＣＤプレーヤ１１が音源として選択される。ＣＤプレーヤ１１には音楽ＣＤが装着されているので、ＣＤプレーヤ１１はゲイン自動調整装置２０に入力されるべき音声信号としての音楽信号を発生する。 When the music is provided to the loud space, the changeover switch 15 selects the CD player 11 as a sound source. Since the music CD is loaded in the CD player 11, the CD player 11 generates a music signal as an audio signal to be input to the automatic gain adjustment device 20.

ゲイン自動調整装置２０は音源（ＣＤプレーヤ１１、マイクロホンＭａ,Ｍｂ,Ｍｃ）からの音声信号を切り換えスイッチ１５を介して入力し、その音声信号にゲインを与えてからパワーアンプ１７に送出する。ゲイン自動調整装置２０のゲインは、入力される音声信号に基づいて自動的に調整される。 The automatic gain adjusting device 20 inputs an audio signal from the sound source (CD player 11, microphones Ma, Mb, Mc) via the changeover switch 15, gives a gain to the audio signal, and sends it to the power amplifier 17. The gain of the automatic gain adjusting device 20 is automatically adjusted based on the input audio signal.

パワーアンプ１７は、ゲイン自動調整装置２０からの音声信号を増幅して、スピーカ１８に送出する。そしてスピーカ１８から、アナウンスや音楽が拡声空間に放射される。 The power amplifier 17 amplifies the audio signal from the automatic gain adjustment device 20 and sends it to the speaker 18. Announcements and music are radiated from the speaker 18 to the loudspeaker space.

ゲイン自動調整装置２０は、音声信号入力部２１と、信号増幅部２２と、制御信号生成部２３と、音声信号出力部２４とを有する。音声信号入力部２１は、音源からの音声信号を入力する部分である。音声信号入力部２１に入力された音声信号は、信号増幅部２２と制御信号生成部２３とに送出される。信号増幅部２２は入力した音声信号にゲインを与えてから、その音声信号を音声信号出力部２４に送出する。信号増幅部２２のゲインは、外部からの制御信号によって制御される。制御信号生成部２３は、入力された音声信号に基づいて各種演算・処理を行うことによって制御信号を生成し、その制御信号を信号増幅部２２に送出する。音声信号出力部２４は、信号増幅部２２からの音声信号をパワーアンプ１７に送出する。 The automatic gain adjustment apparatus 20 includes an audio signal input unit 21, a signal amplification unit 22, a control signal generation unit 23, and an audio signal output unit 24. The audio signal input unit 21 is a part that inputs an audio signal from a sound source. The audio signal input to the audio signal input unit 21 is sent to the signal amplification unit 22 and the control signal generation unit 23. The signal amplifying unit 22 gives a gain to the input audio signal, and then sends the audio signal to the audio signal output unit 24. The gain of the signal amplifying unit 22 is controlled by an external control signal. The control signal generation unit 23 generates a control signal by performing various calculations and processes based on the input audio signal, and sends the control signal to the signal amplification unit 22. The audio signal output unit 24 sends the audio signal from the signal amplification unit 22 to the power amplifier 17.

次に、図２を参照しつつ、制御信号生成部２３における各種演算・処理を説明する。 Next, various calculations and processes in the control signal generator 23 will be described with reference to FIG.

制御信号生成部２３は、入力された音声信号から種々の信号（レベル信号、第１信号、第２信号、信号増幅部２２に与えるべき制御信号）を生成する。また制御信号生成部２３には、各種のフラグ（無音検出フラグ、話し終わり検出フラグ、話し始め検出フラグ）が設定されている。さらに制御信号生成部２３は、これら信号やフラグに基づいて各種の判断（話し終わり時点の判断、話し始め時点の判断、音声信号が話声信号であるのか音楽信号であるのかの判断）を行う。 The control signal generation unit 23 generates various signals (level signal, first signal, second signal, control signal to be given to the signal amplification unit 22) from the input audio signal. Various flags (silence detection flag, talk end detection flag, talk start detection flag) are set in the control signal generator 23. Further, the control signal generation unit 23 makes various determinations (determination at the end of speech, determination at the start of speech, determination of whether the audio signal is a speech signal or a music signal) based on these signals and flags. .

図２は、制御信号生成部２３での各種演算・処理によって生成される各種信号や各種フラグを示す図である。図２の横軸は時間を示し、縦軸は各信号のレベルやフラグの状態を示す。 FIG. 2 is a diagram illustrating various signals and various flags generated by various calculations and processes in the control signal generation unit 23. The horizontal axis in FIG. 2 indicates time, and the vertical axis indicates the level of each signal and the state of the flag.

制御信号生成部２３では、入力された音声信号のレベルを示すレベル信号Ｄが生成される。さらに、レベル信号Ｄに基づいて、第１信号Ｅが生成される。さらに、レベル信号Ｄと第１信号Ｅとに基づいて、第２信号Ｆが生成される。さらに、レベル信号Ｄと第２信号Ｆとに基づいて、無音検出フラグＪの状態が定められる。さらに無音検出フラグＪに基づいて、話し終わり検出フラグＫの状態が定められる。さらに無音検出フラグＪと話し終わり検出フラグＫとに基づいて、話し始め検出フラグＬの状態が定められる。なお、話し始め検出フラグＬと第１信号Ｅとに基づいて、第２信号Ｆに対する修正が加えられる。 The control signal generator 23 generates a level signal D indicating the level of the input audio signal. Further, the first signal E is generated based on the level signal D. Further, the second signal F is generated based on the level signal D and the first signal E. Furthermore, the state of the silence detection flag J is determined based on the level signal D and the second signal F. Further, based on the silence detection flag J, the state of the talk end detection flag K is determined. Further, based on the silence detection flag J and the talk end detection flag K, the state of the talk start detection flag L is determined. Note that the second signal F is corrected based on the speech start detection flag L and the first signal E.

さらに第２信号Ｆに基づいて制御信号が生成される。制御信号生成部２３は、この制御信号を信号増幅部２２に送出する。 Further, a control signal is generated based on the second signal F. The control signal generator 23 sends this control signal to the signal amplifier 22.

図２には、レベル信号Ｄが破線で示され、第１信号Ｅが実線で示され、第２信号Ｆが一点鎖線で示されている。 In FIG. 2, the level signal D is indicated by a broken line, the first signal E is indicated by a solid line, and the second signal F is indicated by a one-dot chain line.

レベル信号Ｄについてさらに説明すると、図中のレベル信号Ｄは、アナウンサーが交代しながら、拡声空間にアナウンスを提供したときに生ずる話声信号に基づいて生成されたものである。 The level signal D will be further described. The level signal D in the figure is generated based on the speech signal generated when the announcer provides an announcement to the sound space while changing.

図中の矢印Ｐａで示される期間（以下、「期間Ｐａ」という）は、アナウンサーＡがマイクロホンＭａを使用してアナウンスを行った期間である。図中の矢印Ｐｂで示される期間（以下、「期間Ｐｂ」という）は、アナウンサーＢがマイクロホンＭｂを使用してアナウンスを行った期間である。図中の矢印Ｐｃで示される期間（以下、「期間Ｐｃ」という）は、アナウンサーＣがマイクロホンＭｃを使用してアナウンスを行った期間である。 A period indicated by an arrow Pa in the figure (hereinafter referred to as “period Pa”) is a period during which the announcer A made an announcement using the microphone Ma. A period indicated by an arrow Pb in the drawing (hereinafter referred to as “period Pb”) is a period in which the announcer B made an announcement using the microphone Mb. A period indicated by an arrow Pc in the figure (hereinafter referred to as “period Pc”) is a period in which the announcer C made an announcement using the microphone Mc.

レベル信号Ｄに着目すると、アナウンサーによって音声信号の大きさに差があることがわかる。期間Ｐａにおける音声信号は比較的小さく、そのときのレベル信号Ｄのピーク値は約−３０ｄＢである。期間Ｐｂにおける音声信号は比較的大きく、そのときのレベル信号Ｄのピーク値は約＋５ｄＢである。期間Ｐｃにおける音声信号はさらに大きく、そのときのレベル信号Ｄのピーク値は約＋１０ｄＢである。 Focusing on the level signal D, it can be seen that there is a difference in the magnitude of the audio signal depending on the announcer. The audio signal in the period Pa is relatively small, and the peak value of the level signal D at that time is about −30 dB. The audio signal in the period Pb is relatively large, and the peak value of the level signal D at that time is about +5 dB. The audio signal in the period Pc is larger, and the peak value of the level signal D at that time is about +10 dB.

次に第１信号Ｅについて説明する。 Next, the first signal E will be described.

第１信号Ｅは、レベル信号Ｄの包絡（エンベロープ）線として表れている。つまり第１信号Ｅはレベル信号Ｄの包絡信号である。 The first signal E appears as an envelope of the level signal D. That is, the first signal E is an envelope signal of the level signal D.

第１信号Ｅを生成する方法として、種々の方法を採用することができる。その中の最も単純な方法として、レベル信号Ｄをローパスフィルタに通過させることにより、その包絡信号を得るという方法を採用することもできる。しかし、本実施形態では、次のような方法によって、第１信号Ｅを生成している。 As a method for generating the first signal E, various methods can be employed. As the simplest method among them, a method of obtaining the envelope signal by passing the level signal D through a low-pass filter may be employed. However, in the present embodiment, the first signal E is generated by the following method.

つまり、第４所定期間（Ｐ４秒）毎に、その直近のＰ４秒間におけるレベル信号Ｄの最大値を検出し、検出した該最大値をその後Ｐ４秒間だけホールドすることによって、第１信号Ｅを生成しているのである。第４所定期間である「Ｐ４秒」は、例えば、０．１秒以上５秒以下であってもよい。このようにして第１信号Ｅを生成しているために、図中の第１信号Ｅの波形は、ステップ状になっている。ステップ状の波形ではあるが、第１信号Ｅはレベル信号Ｄを包絡している。 That is, every fourth predetermined period (P4 seconds), the maximum value of the level signal D in the latest P4 seconds is detected, and the detected maximum value is held for P4 seconds thereafter, thereby generating the first signal E. It is doing. The fourth predetermined period “P4 seconds” may be, for example, not less than 0.1 seconds and not more than 5 seconds. Since the first signal E is generated in this way, the waveform of the first signal E in the figure is stepped. Although it is a stepped waveform, the first signal E envelopes the level signal D.

第１信号Ｅの波形には、レベル信号Ｄの波形において見られる急峻なピークやディップは表れていない。 The waveform of the first signal E does not show a steep peak or dip seen in the waveform of the level signal D.

次に第２信号Ｆについて説明する。 Next, the second signal F will be described.

第２信号Ｆは、第１信号Ｅ等に基づいて生成される。信号増幅器２２に与える制御信号は、第２信号Ｆに基づいて生成される。つまり、第２信号Ｆが大きいときには、信号増幅器２２のゲインを小さくし、第２信号Ｆが小さいときには、信号増幅器２２のゲインを大きくするような制御信号を生成するのである。 The second signal F is generated based on the first signal E and the like. A control signal applied to the signal amplifier 22 is generated based on the second signal F. That is, the control signal is generated so that the gain of the signal amplifier 22 is reduced when the second signal F is large, and the gain of the signal amplifier 22 is increased when the second signal F is small.

第２信号Ｆの変化は、後述する話し始め検出フラグＬの状態によって異なる。話し始め検出フラグＬの状態が「１」であるときには、制御信号生成部２３は、第２信号Ｆを第１信号Ｅに一致させる。その理由は後述する。 The change of the second signal F varies depending on the state of a talk start detection flag L described later. When the state of the talk start detection flag L is “1”, the control signal generator 23 matches the second signal F with the first signal E. The reason will be described later.

話し始め検出フラグＬの状態が「０」であるときには、第２信号Ｆは、レベル信号Ｄと第１信号Ｅとに基づいて生成される。より具体的には、第２信号Ｆは次のようにして生成される。 When the state of the talk start detection flag L is “0”, the second signal F is generated based on the level signal D and the first signal E. More specifically, the second signal F is generated as follows.

つまり、制御信号生成部２３は、第２信号Ｆとレベル信号Ｄとを常時観測している。そして、第２信号Ｆがレベル信号Ｄよりも所定値（Ｗ(ｄＢ)）以上大きいときは、第２信号Ｆを時間的に変化させないで、そのときの大きさに維持する。つまり、第２信号Ｆは増大も減少もしない。所定値である「Ｗ(ｄＢ)」は、例えば、１０ｄＢ以上４０ｄＢ以下であってもよい。 That is, the control signal generation unit 23 constantly observes the second signal F and the level signal D. When the second signal F is larger than the level signal D by a predetermined value (W (dB)) or more, the second signal F is maintained at the current level without being temporally changed. That is, the second signal F does not increase or decrease. The predetermined value “W (dB)” may be, for example, 10 dB or more and 40 dB or less.

第２信号Ｆのレベルが、レベル信号Ｄのレベル値にＷ(ｄＢ)を加えた値よりも小さい場合は、第２信号Ｆと第１信号Ｅとの大小関係に応じて、第２信号Ｆの大きさを変化させる。つまり、第２信号Ｆよりも第１信号Ｅの方が大きければ、第２信号Ｆを所定の第１変化速度（Ｒ１(ｄＢ)／秒）で増加させる。ここで、Ｒ１の大きさは、０．１以上３以下であってもよい。第２信号Ｆよりも第１信号Ｅの方が小さければ、第２信号Ｆを（Ｒ１(ｄＢ)／秒）の速度で減少させる。第２信号Ｆと第１信号Ｅとが等しい場合は、第２信号Ｆの大きさを時間的に変化させないようにする。 When the level of the second signal F is smaller than the value obtained by adding W (dB) to the level value of the level signal D, the second signal F depends on the magnitude relationship between the second signal F and the first signal E. Change the size of. That is, if the first signal E is larger than the second signal F, the second signal F is increased at a predetermined first change rate (R1 (dB) / second). Here, the magnitude of R1 may be not less than 0.1 and not more than 3. If the first signal E is smaller than the second signal F, the second signal F is decreased at a rate of (R1 (dB) / second). When the second signal F and the first signal E are equal, the magnitude of the second signal F is not changed with time.

その結果、話し始め検出フラグＬの状態が「０」である期間において、第２信号Ｆがレベル信号ＤよりもＷ(ｄＢ)以上大きい部分では、第２信号Ｆの波形は水平線状になり、それ以外の部分では、第２信号Ｆの波形は第１信号Ｅの波形を時間軸上で平滑化したような波形になる。 As a result, in a period in which the state of the talk start detection flag L is “0”, the waveform of the second signal F becomes a horizontal line when the second signal F is larger than the level signal D by W (dB) or more. In other parts, the waveform of the second signal F is a waveform obtained by smoothing the waveform of the first signal E on the time axis.

第２信号Ｆがレベル信号ＤよりもＷ(ｄＢ)以上大きいということは、アナウンスが無音状態になっていると判断できるのであるが、そのときには第２信号Ｆの波形が水平線状になるので信号増幅器２２のゲインは変化しない。よって、急にアナウンスが再開されたとしても、音声信号に対して適正なゲインを与えることができる。 If the second signal F is larger than the level signal D by W (dB) or more, it can be determined that the announcement is silent, but at that time, the waveform of the second signal F becomes a horizontal line. The gain of the amplifier 22 does not change. Therefore, even if the announcement is suddenly resumed, an appropriate gain can be given to the audio signal.

また、アナウンスが無音状態ではないときには、第２信号Ｆの波形が第１信号Ｅの波形を時間軸上で平滑化したような波形になる。よって、アナウンスによる話声信号のレベル変化に応じて、緩やかに信号増幅器２２のゲインが変化する。その結果、自然で聞きやすいアナウンスを拡声空間に提供することができる。 When the announcement is not silent, the waveform of the second signal F is a waveform obtained by smoothing the waveform of the first signal E on the time axis. Therefore, the gain of the signal amplifier 22 gradually changes in accordance with the level change of the speech signal due to the announcement. As a result, announcements that are natural and easy to hear can be provided to the loudspeaker space.

次に無音検出フラグＪについて説明する。 Next, the silence detection flag J will be described.

無音検出フラグＪの状態は、レベル信号Ｄと第２信号Ｆとに基づいて定められる。制御信号生成部２３は、レベル信号Ｄと第２信号Ｆとを常時観測しており、これら信号のレベルから無音検出フラグＪの状態を設定するのである。 The state of the silence detection flag J is determined based on the level signal D and the second signal F. The control signal generator 23 constantly observes the level signal D and the second signal F, and sets the state of the silence detection flag J based on the level of these signals.

より詳細に説明すると、第２信号Ｆがレベル信号ＤよりもＷ(ｄＢ)以上大きいときは、無音検出フラグＪを「１」の状態に設定し、そうでないときは、無音検出フラグＪを「０」の状態に設定するのである。無音検出フラグＪの状態はこのようにして設定されるので、この無音検出フラグＪの状態が「１」のときは、音声信号が無音状態にあると判断することができる。 More specifically, when the second signal F is greater than the level signal D by W (dB) or more, the silence detection flag J is set to “1”; otherwise, the silence detection flag J is set to “1”. The state is set to “0”. Since the state of the silence detection flag J is set in this way, when the state of the silence detection flag J is “1”, it can be determined that the audio signal is in the silence state.

このような無音状態は、アナウンスや会話のような話声において頻繁に出現する。例えばアナウンスや会話における文節の区切りにおいても出現するし、会話における話者の交代のときにも出現する。よって、無音状態が検出されたことによって、話声信号が話し終わり時点に達したと直ちに判断するのは適切ではない。本実施例では、話声信号が話し終わり時点に達したか否かの判断結果は、次に説明する話し終わり検出フラグＫの状態に表れる。 Such a silent state frequently appears in speech such as announcements and conversations. For example, it appears at the breaks of phrases in announcements and conversations, and also appears at the time of speaker change in conversations. Therefore, it is not appropriate to immediately determine that the speech signal has reached the end of the speech due to the detection of the silent state. In the present embodiment, the determination result as to whether or not the voice signal has reached the end point of the speech appears in the state of the end of speech detection flag K described below.

次に、話し終わり検出フラグＫについて説明する。 Next, the talk end detection flag K will be described.

話し終わり検出フラグＫの初期状態は「０」である。話し終わり検出フラグＫの状態は、無音検出フラグＪに基づいて定められる。制御信号生成部２３は、無音検出フラグＪの状態を常時観測しており、無音検出フラグＪが「１」である状態が第１所定期間（Ｐ１秒）続いた時点で、「０」の状態にある話し終わりフラグＫを「１」の状態に設定する。そして、その後に無音検出フラグＪが「０」の状態になった時点で、話し終わり検出フラグＫを「０」の状態に戻す。ここで、第１所定期間（Ｐ１秒）は、第４所定期間（Ｐ４秒）よりも長くすべきである。第１所定期間（Ｐ１秒）は、０．５秒以上１０秒以下であってもよい。 The initial state of the talk end detection flag K is “0”. The state of the talk end detection flag K is determined based on the silence detection flag J. The control signal generation unit 23 constantly observes the state of the silence detection flag J, and when the state where the silence detection flag J is “1” continues for the first predetermined period (P1 second), the state of “0” Is set to a state of “1”. After that, when the silence detection flag J becomes “0”, the talking end detection flag K is returned to “0”. Here, the first predetermined period (P1 seconds) should be longer than the fourth predetermined period (P4 seconds). The first predetermined period (P1 seconds) may be not less than 0.5 seconds and not more than 10 seconds.

このように、話し終わり検出フラグＫは、無音状態がＰ１秒持続した時点で、「０」から「１」に変化する。よって、この話し終わり検出フラグＫの状態が「０」から「１」に変化した時点が、話声信号における話し終わり時点であると判断することができる。 Thus, the talk end detection flag K changes from “0” to “1” when the silent state continues for P1 seconds. Therefore, it can be determined that the point in time when the state of the talk end detection flag K changes from “0” to “1” is the talk end point in the speech signal.

このように、一定のゲートレベルを設けることなく、無音状態の検出およびその観測の結果から話し終わり時点を判断している。よって、話者の周囲騒音の影響をあまり受けることなく、的確に話し終わり時点を判断することができる。 As described above, the end point of the talk is determined from the result of the detection of the silent state and the observation without providing a certain gate level. Therefore, it is possible to accurately determine the end point of the talk without being affected by the ambient noise of the speaker.

次に、話し始め検出フラグＬについて説明する。 Next, the talk start detection flag L will be described.

話し始め検出フラグＬの初期状態は「０」である。話し始め検出フラグＬの状態は、話し終わり検出フラグＫに基づいて定められる。制御信号生成部２３は、話し終わり検出フラグＫの状態を常時観測しており、話し終わり検出フラグＫが「１」から「０」に変化した時点で、「０」の状態にある話し始め検出フラグＬを「１」の状態に設定する。そして、その後、第２所定期間（Ｐ２秒）経過した時点で、話し始め検出フラグＬを「０」の状態に戻す。ここで、第２所定期間（Ｐ２秒）は、第４所定期間（Ｐ４秒）よりも長くすべきである。第２所定期間（Ｐ２秒）は、０．５秒以上１０秒以下であってもよい。 The initial state of the talk start detection flag L is “0”. The state of the talk start detection flag L is determined based on the talk end detection flag K. The control signal generator 23 constantly observes the state of the talk end detection flag K, and when the talk end detection flag K changes from “1” to “0”, the talk start detection in the “0” state is detected. The flag L is set to “1”. Thereafter, when the second predetermined period (P2 seconds) has elapsed, the talking start detection flag L is returned to the state of “0”. Here, the second predetermined period (P2 seconds) should be longer than the fourth predetermined period (P4 seconds). The second predetermined period (P2 seconds) may be not less than 0.5 seconds and not more than 10 seconds.

このように、話し始め検出フラグＬは、話し終わりが検出された後、音声信号が無音状態ではなくなった時点で、「０」から「１」に変化する。よって、この話し始め検出フラグＬの状態が「０」から「１」に変化した時点が、話し始め時点であると判断することができる。このように、話し終わりを適正に検出することによって、話し始めを適正に検出することができる。 As described above, the speech start detection flag L changes from “0” to “1” when the speech signal is not silenced after the end of the speech is detected. Therefore, it can be determined that the time when the state of the talk start detection flag L changes from “0” to “1” is the talk start time. In this way, by properly detecting the end of the talk, it is possible to properly detect the start of the talk.

次に、話し始め検出フラグＬと第２信号Ｆとの関係について説明する。 Next, the relationship between the talk start detection flag L and the second signal F will be described.

話し始め検出フラグＬが「０」から「１」に変化した時点を、新しいアナウンスが開始された時点であると判断することができる。このときには、アナウンサーが代わっている可能性や、使用されるマイクロホンが代わっている可能性が高い。よって、新たに入力される音声信号のレベルに対応して、素早く信号増幅器２２のゲインを調整しなければならない。そのために、制御信号生成部２３は、話し始め検出フラグＬの状態が「１」に設定されているＰ２秒間だけは、第２信号Ｆを第１信号Ｅに一致させる。これにより、第２信号Ｆを、音声信号のレベルに応じて素早く変化させることができる。その結果、信号増幅器２２のゲインを、新たに開始されるアナウンスの音声信号に応じたレベルに、素早く変化させることができる。 It can be determined that the time point when the talk start detection flag L changes from “0” to “1” is the time point when a new announcement is started. At this time, there is a high possibility that the announcer has been replaced or the microphone to be used has been replaced. Therefore, the gain of the signal amplifier 22 must be quickly adjusted in accordance with the level of the newly input audio signal. For this reason, the control signal generator 23 makes the second signal F coincide with the first signal E only for P2 seconds in which the state of the talk start detection flag L is set to “1”. Thereby, the 2nd signal F can be changed rapidly according to the level of an audio | voice signal. As a result, the gain of the signal amplifier 22 can be quickly changed to a level corresponding to the newly started announcement audio signal.

次に、信号増幅器２２のゲインについて説明する。 Next, the gain of the signal amplifier 22 will be described.

図３は、第２信号Ｆと信号増幅器２２のゲインＧとを示す図である。横軸は時間を示し、縦軸はレベルを示す。図３における第２信号Ｆの波形は、図２における第２信号Ｆの波形と同一である。 FIG. 3 is a diagram illustrating the second signal F and the gain G of the signal amplifier 22. The horizontal axis indicates time, and the vertical axis indicates level. The waveform of the second signal F in FIG. 3 is the same as the waveform of the second signal F in FIG.

信号増幅器２２に与える制御信号は、第２信号Ｆに基づいて生成されるということ、および、第２信号Ｆが大きいときには、信号増幅器２２のゲインを小さくし、第２信号Ｆが小さいときには、信号増幅器２２のゲインを大きくするということを前述した。 The control signal given to the signal amplifier 22 is generated based on the second signal F, and when the second signal F is large, the gain of the signal amplifier 22 is decreased, and when the second signal F is small, the signal is As described above, the gain of the amplifier 22 is increased.

信号増幅器２２のゲインＧ(dB)は、次のようにして定められる。つまり、第２信号ＦのレベルをＦ(dB)とすると、Ｆが０よりも小さいときは、「Ｇ(dB)＝１２(dB)」に設定される。また、Ｆが１２よりも大きいときは、「Ｇ(dB)＝０(dB)」に設定される。Ｆが０以上１２以下であるときは、「Ｇ(dB)＝（１２−Ｆ）(dB)」に設定される。 The gain G (dB) of the signal amplifier 22 is determined as follows. That is, assuming that the level of the second signal F is F (dB), when F is smaller than 0, “G (dB) = 12 (dB)” is set. When F is greater than 12, “G (dB) = 0 (dB)” is set. When F is not less than 0 and not more than 12, “G (dB) = (12−F) (dB)” is set.

信号増幅器２２のゲインが上述のとおりに設定されるように、制御信号生成部２３は、第２信号Ｆに基づいて制御信号（信号増幅器２２に与えられるべき制御信号）を生成する。 Based on the second signal F, the control signal generator 23 generates a control signal (a control signal to be given to the signal amplifier 22) so that the gain of the signal amplifier 22 is set as described above.

図３から、信号増幅器２２のゲインＧが、期間Ｐｃの初期において素早く変化していることが理解される。これは、話し始め検出フラグＬの状態が「１」に設定されているＰ２秒間は、第２信号Ｆを第１信号Ｅに一致させるようにしていることによる。 From FIG. 3, it can be seen that the gain G of the signal amplifier 22 changes quickly in the initial period Pc. This is because the second signal F coincides with the first signal E for P2 seconds in which the state of the talk start detection flag L is set to “1”.

次に、無音検出フラグＪを観測することにより、ゲイン自動調整装置２０に入力される音声信号が話声信号であるのか音楽信号であるのかを識別することができるということを説明する。 Next, it will be described that by observing the silence detection flag J, it is possible to identify whether the audio signal input to the automatic gain adjustment device 20 is a speech signal or a music signal.

本願の発明者は、種々の話声信号から得られる無音検出フラグＪの状態の変化と、種々の音楽信号から得られる無音検出フラグＪの状態の変化とを分析した。そして、話声信号から得られる無音検出フラグＪの状態の変化と、音楽信号から得られる無音検出フラグＪの状態の変化とが異なる特徴を有するということを見出した。 The inventors of the present application analyzed changes in the state of the silence detection flag J obtained from various speech signals and changes in the state of the silence detection flag J obtained from various music signals. The present inventors have found that the change in the state of the silence detection flag J obtained from the speech signal and the change in the state of the silence detection flag J obtained from the music signal have different characteristics.

つまり、話声信号から得られる無音検出フラグＪは、音楽信号から得られる無音検出フラグＪに比べて、「０」から「１」に変化する頻度、つまりフラグが立ち上がる頻度が高いのである。その理由は次のとおりであると推測できる。 That is, the silence detection flag J obtained from the speech signal has a higher frequency of changing from “0” to “1”, that is, the frequency of the flag rising, than the silence detection flag J obtained from the music signal. The reason can be estimated as follows.

図２のレベル信号Ｄから理解されるように、音声信号が話声信号であるときには、レベル信号Ｄにおいて急峻なディップが生ずることが多い。このディップは、アナウンスにおける文節の区切り等によって生ずるものである。つまり、アナウンスにおける文節の区切り等では無音状態となるのである。アナウンスのみならず、複数人の話者が参加する会話においても、このような無音状態が頻繁に発生する。会話においても文節の区切りが存在するし、また、話者が交代するときにも無音状態が発生するからである。 As understood from the level signal D in FIG. 2, when the audio signal is a speech signal, a sharp dip often occurs in the level signal D. This dip is caused by paragraph breaks in the announcement. In other words, there is no sound when a sentence breaks in an announcement. Such silence occurs frequently not only in announcements but also in conversations involving multiple speakers. This is because there are clause breaks in the conversation, and a silent state occurs when the speaker changes.

アナウンスや会話のような話声に比べると、音楽には無音状態が少ない。よって、音声信号が音楽信号であるときには、レベル信号Ｄにおいて急峻なディップが表れる頻度は低い。その結果、無音検出フラグＪが、「０」から「１」に変化する頻度も低い。 Compared to speech such as announcements and conversations, music has less silence. Therefore, when the audio signal is a music signal, the frequency of the steep dip appearing in the level signal D is low. As a result, the frequency of the silence detection flag J changing from “0” to “1” is low.

このことを利用して、音声信号が話声信号であるのか音楽信号であるのかを判断（識別）することができる。つまり、無音検出フラグＪに基づいて、その判断（識別）ができるのである。換言すれば、無音状態の頻度を検出し、それに基づいて、音声信号が話声信号であるか音楽信号であるかを判断（識別）することができる。その具体的な方法にはいくつかある。 Using this fact, it is possible to determine (identify) whether the audio signal is a speech signal or a music signal. That is, the determination (identification) can be made based on the silence detection flag J. In other words, it is possible to detect (identify) whether the sound signal is a speech signal or a music signal based on the frequency of the silent state detected. There are several specific methods.

例えば、第３所定期間（Ｐ３秒）の間に、第１所定回（Ｎ１回）以上、無音検出フラグＪが立ち上がれば（「０」から「１」に変化すれば）、入力された音声信号が話声信号であると判断し、立ち上がり回数が該第１所定回に満たないときは、入力された音声信号が音楽信号であると判断してもよい。換言すれば、レベル信号Ｄと第２信号Ｆとに基づいて一定時間において無音状態に陥る回数を計数し、その計数された回数に基づいて、入力された音声信号が話声信号であるか音楽信号であるかを判断（識別）するのである。ここで、第３所定期間（Ｐ３秒）は、第４所定期間（Ｐ４秒）よりも長くすべきである。第３所定期間（Ｐ３秒）は、２秒以上２０秒以下であってもよい。また、第１所定回（Ｎ１回）は、２回以上にすべきである。第１所定回（Ｎ１回）は、２回以上５回以下であってもよい。 For example, if the silence detection flag J rises (changes from “0” to “1”) for the third predetermined period (P3 seconds) or more for the first predetermined time (N1 times) or more, the input audio signal May be determined to be a speech signal, and when the number of rising times is less than the first predetermined time, it may be determined that the input speech signal is a music signal. In other words, based on the level signal D and the second signal F, the number of times of silence in a certain time is counted, and based on the counted number, whether the input voice signal is a speech signal or music It is judged (identified) whether it is a signal. Here, the third predetermined period (P3 seconds) should be longer than the fourth predetermined period (P4 seconds). The third predetermined period (P3 seconds) may be not less than 2 seconds and not more than 20 seconds. Further, the first predetermined number of times (N1 times) should be two times or more. The first predetermined times (N1 times) may be 2 times or more and 5 times or less.

入力されている音声信号が音楽信号であると判断するためには、第３所定期間（Ｐ３秒）の計時を開始してから少なくともＰ３秒間は無音検出フラグＪの状態を観測する必要がある。Ｐ３秒経過することによって初めて、そのＰ３秒間における無音検出フラグＪの立ち上がり回数がＮ１回未満であることを知ることができるからである。Ｐ３秒間の観測の結果、その間（Ｐ３秒間）における無音検出フラグＪの立ち上がり回数がＮ１回未満であるということがわかれば、その時点で、入力されている音声信号が音楽信号であると判断することができる。 In order to determine that the input audio signal is a music signal, it is necessary to observe the state of the silence detection flag J for at least P3 seconds after the start of the measurement of the third predetermined period (P3 seconds). This is because only after the passage of P3 seconds, it can be known that the number of rises of the silence detection flag J in the P3 seconds is less than N1 times. As a result of observation for P3 seconds, if it is found that the number of rises of the silence detection flag J during that period (P3 seconds) is less than N1, it is determined that the input audio signal is a music signal at that time. be able to.

入力されている音声信号が話声信号であると判断するためには、第３所定期間（Ｐ３秒）の計時を開始してからＰ３秒間、無音検出フラグＪの状態を観測し、そのＰ３秒間における無音検出フラグＪの立ち上がり回数が何回であるのかを検出してから判断することもできる。計数された回数が例えば（Ｎ１＋２）回であれば、音声信号が話声信号であると判断することができる。 In order to determine that the input voice signal is a speech signal, the state of the silence detection flag J is observed for P3 seconds after the start of the measurement of the third predetermined period (P3 seconds), and for P3 seconds. It can also be determined after detecting how many times the silent detection flag J has risen. If the counted number is, for example, (N1 + 2) times, it can be determined that the voice signal is a speech signal.

しかし、入力されている音声信号が話声信号であると判断するためには、第３所定期間（Ｐ３秒）の計時を開始してから必ずしもＰ３秒間、無音検出フラグＪの状態を観測する必要はない。第３所定期間（Ｐ３秒）の計時中に、無音検出フラグＪの立ち上がりがＮ１回検出された時点で、音声信号が話声信号であると判断することができる。そして、その時点から次の第３所定期間（Ｐ３秒）の計時を開始するのである。 However, in order to determine that the input voice signal is a speech signal, it is necessary to observe the state of the silence detection flag J for P3 seconds after the start of the third predetermined period (P3 seconds). There is no. It can be determined that the voice signal is a speech signal when the rise of the silence detection flag J is detected N1 times during the time counting of the third predetermined period (P3 seconds). Then, the timing of the next third predetermined period (P3 seconds) is started from that point.

また、無音状態の頻度を検出して、それに基づいて、音声信号が話声信号であるか音楽信号であるかを判断（識別）するための別の具体的な方法として、次のような方法もある。 Further, as another specific method for detecting the frequency of the silent state and determining (identifying) whether the audio signal is a speech signal or a music signal based on the frequency, the following method is used. There is also.

つまり、無音検出フラグＪが「１」から「０」に変化した後に、それが「１」に戻るまでの時間Ｔを観測し、その時間Ｔが所定時間たるＴ１秒以内である状態が続いている間は、入力された音声信号を話声信号であると判断することができる。その後、その時間ＴがＴ１秒を超えることが第２所定回たるＮ２回連続して観測されたときは、音声信号が話声信号から音楽信号に変化したと判断することができる。さらにその後、その時間ＴがＴ１秒以内である状態が第３所定回たるＮ３回連続して観測されたときには、音声信号が音楽信号から話声信号に変化したと判断することもできる。つまり、無音状態が終了してから次の無音状態が開始されるまでの時間を連続的に観測し、その観測された時間に基づいて、入力された音声信号が話声信号であるか音楽信号であるかを識別するのである。なお、図２において、無音検出フラグＪが「１」から「０」に変化した後に、それが「１」に戻るまでの時間Ｔがｔ秒である場合を例示している。ここで、所定時間（Ｔ１秒）は、２秒以上１０秒以下であってもよい。また、第２所定回（Ｎ２回）は、２回以上１０回以下であってもよい。また、第３所定回（Ｎ３回）は、２回以上１０回以下であってもよい。 That is, after the silent detection flag J changes from “1” to “0”, the time T until the silence detection flag J returns to “1” is observed, and the time T is within T1 seconds which is a predetermined time. As long as the voice signal is input, it can be determined that the input voice signal is a speech signal. After that, when the time T exceeds T1 seconds is continuously observed N2 times, which is the second predetermined time, it can be determined that the voice signal has changed from a voice signal to a music signal. Further, after that, when a state in which the time T is within T1 seconds is continuously observed N3 times that is the third predetermined time, it can be determined that the voice signal has changed from a music signal to a speech signal. That is, the time from the end of the silence state to the start of the next silence state is continuously observed, and based on the observed time, the input speech signal is a speech signal or a music signal. Is identified. FIG. 2 illustrates a case where the time T from when the silence detection flag J changes from “1” to “0” until it returns to “1” is t seconds. Here, the predetermined time (T1 seconds) may be not less than 2 seconds and not more than 10 seconds. The second predetermined time (N2 times) may be 2 times or more and 10 times or less. The third predetermined time (N3 times) may be 2 times or more and 10 times or less.

以上のように、無音状態の頻度を検出し、それに基づいて音声信号が話声信号であるか音楽信号であるかを判断（識別）することができる。そのような判断結果を、信号増幅器２２のゲインの調整に反映させることもできる。例えば、音声信号が話声信号であるか音楽信号であるかによって、第２信号Ｆの変化速度を変えるようにすることもできる。つまり、第２信号Ｆは、所定条件下において、（Ｒ１(ｄＢ)／秒）の速度で増加したり、（Ｒ１(ｄＢ)／秒）の速度で減少したりするということを前述したが、この変化速度を音声信号の種類によって変えるようにするのである。 As described above, it is possible to detect (identify) whether the sound signal is a speech signal or a music signal based on the frequency of the silent state detected. Such a determination result can be reflected in the adjustment of the gain of the signal amplifier 22. For example, the change speed of the second signal F can be changed depending on whether the audio signal is a speech signal or a music signal. That is, as described above, the second signal F increases at a rate of (R1 (dB) / sec) or decreases at a rate of (R1 (dB) / sec) under a predetermined condition. This change speed is changed depending on the type of the audio signal.

例えば、音声信号が話声信号であると判断されたときには、第２信号Ｆの変化速度を、第１変化速度である（Ｒ１(ｄＢ)／秒）とし、音声信号が音楽信号であると判断されたときには、第２信号Ｆの変化速度を、第２変化速度である（Ｒ２(ｄＢ)／秒）としてもよい。Ｒ２の大きさは、Ｒ１よりも小さい。よって、音楽信号の場合には変化速度がより遅くなる。Ｒ２の大きさは、０．０１以上２以下であってもよい。このように、第２信号Ｆの変化速度を音声信号の種類によって変えるようにすると、信号増幅部２２のゲインの変化速度も音声信号の種類によって変わることになる。 For example, when it is determined that the audio signal is a speech signal, the change rate of the second signal F is set to the first change rate (R1 (dB) / sec), and it is determined that the audio signal is a music signal. When this is done, the change rate of the second signal F may be the second change rate (R2 (dB) / sec). The magnitude of R2 is smaller than R1. Therefore, in the case of a music signal, the change speed becomes slower. The magnitude of R2 may be not less than 0.01 and not more than 2. As described above, when the change rate of the second signal F is changed depending on the type of the audio signal, the change rate of the gain of the signal amplifying unit 22 also changes depending on the type of the audio signal.

上述の例のように、音楽信号の場合には変化速度がより遅くなるようにしたのは、音楽性を損なわないようにするためである。 The reason why the change rate is made slower in the case of a music signal as in the above-described example is to prevent the musicality from being impaired.

以上、本願発明の一実施形態を説明した。上記実施形態では、本願に係る音声信号判断装置を、ゲイン自動調整装置に利用する例を示した。しかし、本願に係る音声信号判断装置は、ゲイン自動調整装置のみならず、他にも利用の用途がある。例えば、話し始めと同時に音声信号の録音を開始し、話し終わりと同時にその録音を停止させるような、自動録音装置に利用することもできる。 The embodiment of the present invention has been described above. In the above embodiment, an example in which the audio signal determination device according to the present application is used for an automatic gain adjustment device has been described. However, the audio signal determination device according to the present application has not only an automatic gain adjustment device but also other uses. For example, the present invention can be applied to an automatic recording apparatus that starts recording an audio signal at the beginning of a conversation and stops recording at the end of the conversation.

本願の音声信号判断装置を用いることにより、音声信号における話し終わり時点を判断することができ、また、音声信号の種類を判断することができるので、電気音響の技術分野に利用できる。 By using the audio signal determination device of the present application, it is possible to determine the end point of speech in the audio signal, and it is possible to determine the type of the audio signal, which can be used in the technical field of electroacoustics.

音響システムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of an acoustic system. 制御信号生成部での各種演算・処理によって生成される各種信号や各種フラグを示す図である。It is a figure which shows the various signals and various flags which are produced | generated by the various calculation and processing in a control signal production | generation part. 第２信号と信号増幅器のゲインとを示す図である。It is a figure which shows the 2nd signal and the gain of a signal amplifier.

Explanation of symbols

１１ＣＤプレーヤ
１５切り換えスイッチ
１７パワーアンプ
１８スピーカ
２０ゲイン自動調整装置
２１音声信号入力部
２２信号増幅部
２３制御信号生成部
２４音声信号出力部
Ｍａ,Ｍｂ,Ｍｃマイクロホン
Ｄレベル信号
Ｅ第１信号
Ｆ第２信号
Ｇゲイン
Ｓ音響システム
11 CD player 15 changeover switch 17 power amplifier 18 speaker 20 automatic gain adjustment device 21 audio signal input unit 22 signal amplification unit 23 control signal generation unit 24 audio signal output unit Ma, Mb, Mc microphone D level signal E first signal F first 2 signal G gain S sound system

Claims

A first signal generating means for generating a first signal; a second signal generating means for generating a second signal; and a determining means.
The first signal generating means generates the first signal which is an envelope signal of the level value of the audio signal,
The second signal generating means outputs the first signal as a time on condition that the second signal does not change in time when the level value of the second signal is not less than a value obtained by adding a predetermined value to the level value of the audio signal. Generating the second signal by smoothing on an axis;
The determination means determines a point in time when the state in which the level value of the second signal is equal to or higher than the level value of the audio signal plus the predetermined value continues for a first predetermined period, An audio signal determination device that determines that there is an audio signal.

The determination means determines a point in time when the level value of the second signal becomes smaller than a value obtained by adding the predetermined value to the level value of the voice signal after the voice signal ends. The audio signal determination apparatus according to claim 1, wherein the audio signal determination apparatus determines that it is a point in time when talking is started.

3. The voice according to claim 2, wherein the second signal generating means matches the level value of the second signal with the level value of the first signal within a second predetermined period following the start of speaking of the voice signal. Signal judgment device.

A first signal generating means for generating a first signal; a second signal generating means for generating a second signal; and a determining means.
The first signal generating means generates the first signal which is an envelope signal of the level value of the audio signal,
The second signal generating means outputs the first signal as a time on condition that the second signal does not change in time when the level value of the second signal is not less than a value obtained by adding a predetermined value to the level value of the audio signal. Generating the second signal by smoothing on an axis;
The determination means determines that a state where the level value of the second signal is equal to or higher than a value obtained by adding the predetermined value to the level value of the audio signal is a silent state, and the frequency at which the silent state occurs. An audio signal determination device that determines whether the audio signal is a speech signal or a music signal based on

The determination means counts the number of times that a silent state occurs within the third predetermined period, and determines that the voice signal is a speech signal when the counted number is equal to or greater than the predetermined number. The audio signal determination apparatus according to claim 4, wherein the audio signal is determined to be a music signal when the number is less than the predetermined number of times.

The determination means determines whether the audio signal is a speech signal or a music signal based on the time from the end of the previous silent state to the start of the subsequent silent state. The audio signal determination apparatus according to claim 4, wherein it is determined whether there is any.

The first signal generating means detects the maximum value of the level value of the audio signal in the latest fourth predetermined period every fourth predetermined period, and maintains the detected maximum value only for the fourth predetermined period. The audio signal determination apparatus according to claim 1, wherein the first signal is generated.