JP2021167853A

JP2021167853A - Abnormal sound detection device and program therefor

Info

Publication number: JP2021167853A
Application number: JP2020070171A
Authority: JP
Inventors: 琴子古屋; Kotoko Furuya; 隆弘松田; Takahiro Matsuda
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2021-10-21
Anticipated expiration: 2040-04-09
Also published as: JP7445503B2

Abstract

To provide an automatic sound monitor capable of detecting a variety of abnormal sound included in a sound signal.SOLUTION: An automatic sound monitor 1 comprises: acoustic feature quantity calculation means 10 for calculating an acoustic feature quantity of a sound signal; sound signal level detection means 20 for detecting whether the sound signal level is proper; single-frequency signal detection means 30 for detecting whether the sound signal includes a single-frequency signal; predicted value calculation means 40 for using a learning model to calculate a predicted value of sound distortion; sound distortion detection means 50 for detecting whether the sound signal includes sound distortion based upon the predicted value; abnormal sound detection means 60 for detecting whether the sound signal includes abnormal sound; and switching control means 70 for performing switching control over the sound signal.SELECTED DRAWING: Figure 1

Description

本発明は、音声信号に含まれる異常音を検知する異常音検知装置及びそのプログラムに関する。 The present invention relates to an abnormal sound detecting device for detecting an abnormal sound included in an audio signal and a program thereof.

従来より、ラジオ放送等の音声放送における異常音を検知する音声自動モニタが知られている（例えば、特許文献１）。この従来手法は、入力された２つの音声信号レベルを比較することで、放送装置の故障を検知するものである。そして、従来手法では、２つの音声信号レベルが不一致の場合、警報を出力すると共に、本番系から予備系への切換えを行う。 Conventionally, an automatic audio monitor that detects an abnormal sound in audio broadcasting such as radio broadcasting has been known (for example, Patent Document 1). This conventional method detects a failure of a broadcasting device by comparing two input audio signal levels. Then, in the conventional method, when the two audio signal levels do not match, an alarm is output and the production system is switched to the backup system.

実開平５−４３６３４号公報Jikkenhei No. 5-43334

しかしながら、特許文献１に記載の手法では、音声放送としてふさわしくない単一周波数信号（テスト信号）や信号対雑音比が低い音声信号が混入した場合でも、その音声信号のレベル自体に問題が無いため、異常を検知することができない。 However, in the method described in Patent Document 1, there is no problem in the level of the audio signal itself even when a single frequency signal (test signal) unsuitable for audio broadcasting or an audio signal having a low signal-to-noise ratio is mixed. , Abnormality cannot be detected.

なお、単一周波数信号とは、放送装置を保守点検するためのテスト信号のことであり、例えば、周波数が１ｋＨｚで一定の信号である。この単一周波数信号を放送装置に入力する際、自動で切り替わらないようにする「保守ボタン」を押すことになっている。ここで、「保守ボタン」を押し忘れた場合、音声自動モニタがレベルの高い単一周波数信号を正常な音声信号と誤判定し、この単一周波数信号に切り替えて放送する事故につながってしまう。 The single frequency signal is a test signal for maintenance and inspection of a broadcasting device, and is, for example, a signal having a constant frequency of 1 kHz. When inputting this single frequency signal to the broadcasting device, it is supposed to press the "maintenance button" to prevent automatic switching. Here, if the "maintenance button" is forgotten to be pressed, the automatic audio monitor erroneously determines that the high-level single frequency signal is a normal audio signal, which leads to an accident of switching to this single frequency signal and broadcasting.

そこで、本発明は、音声信号に含まれる様々な異常音を検知できる異常音検知装置及びそのプログラムを提供することを課題とする。 Therefore, an object of the present invention is to provide an abnormal sound detection device capable of detecting various abnormal sounds included in an audio signal and a program thereof.

前記課題を解決するため、本発明に係る異常音検知装置は、音声信号に含まれる異常音を検知する異常音検知装置であって、音響特徴量算出手段と、音声信号レベル検知手段と、単一周波数信号検知手段と、予測値算出手段と、音声歪み検知手段と、異常音検知手段と、を備える構成とした。 In order to solve the above-mentioned problems, the abnormal sound detecting device according to the present invention is an abnormal sound detecting device that detects an abnormal sound included in a voice signal, and is simply an acoustic feature amount calculating means, a voice signal level detecting means, and the like. The configuration includes a one-frequency signal detecting means, a predicted value calculating means, a voice distortion detecting means, and an abnormal sound detecting means.

かかる構成によれば、音響特徴量算出手段は、音声信号から、周波数知覚特性に関する音響特徴量と音階情報に関する音響特徴量とを算出する。
また、音声信号レベル検知手段は、音声信号のレベルが適正であるか否かを検知する。
そして、単一周波数信号検知手段は、音声信号に単一周波数信号が含まれるか否かを検知する。
さらに、予測値算出手段は、周波数知覚特性に関する音響特徴量及び音階情報に関する音響特徴量を予め学習した学習モデルを用いて、音声信号に音声歪みが含まれる確率を示す予測値を算出する。 According to such a configuration, the acoustic feature amount calculating means calculates the acoustic feature amount related to the frequency perception characteristic and the acoustic feature amount related to the scale information from the voice signal.
Further, the audio signal level detecting means detects whether or not the level of the audio signal is appropriate.
Then, the single frequency signal detecting means detects whether or not the audio signal includes the single frequency signal.
Further, the predicted value calculating means calculates a predicted value indicating the probability that the voice signal includes voice distortion by using a learning model in which the acoustic feature amount related to the frequency perception characteristic and the acoustic feature amount related to the scale information are learned in advance.

また、音声歪み検知手段は、予測値算出手段が算出した予測値に基づいて、音声信号に音声歪みが含まれるか否かを検知する。
そして、異常音検知手段は、音声信号レベル検知手段、単一周波数信号検知手段及び音声歪み検知手段の検知結果に基づいて、音声信号に異常音が含まれるか否かを検知する。
このように、異常音検知装置は、音声信号のレベルだけでなく、単一周波数信号や音声歪みに起因する異常音も検知することができる。 Further, the voice distortion detecting means detects whether or not the voice signal includes voice distortion based on the predicted value calculated by the predicted value calculating means.
Then, the abnormal sound detecting means detects whether or not the voice signal contains an abnormal sound based on the detection results of the voice signal level detecting means, the single frequency signal detecting means, and the voice distortion detecting means.
As described above, the abnormal sound detection device can detect not only the level of the audio signal but also the abnormal sound caused by the single frequency signal and the audio distortion.

なお、本発明は、コンピュータを、前記した異常音検知装置として機能させるためのプログラムで実現することもできる。 The present invention can also be realized by a program for causing the computer to function as the above-mentioned abnormal sound detection device.

本発明によれば、音声信号に含まれる様々な異常音を検知することができる。 According to the present invention, various abnormal sounds included in an audio signal can be detected.

実施形態に係る自動音声モニタの構成を示すブロック図である。It is a block diagram which shows the structure of the automatic voice monitor which concerns on embodiment. （ａ）は一般的な音声信号のスペクトル重心を示すグラフであり、（ｂ）は単一周波数信号のスペクトル重心を示すグラフである。(A) is a graph showing the spectral centroid of a general audio signal, and (b) is a graph showing the spectral centroid of a single frequency signal. 実施形態において、学習データを説明する説明図である。It is explanatory drawing explaining the learning data in embodiment. 実施形態における音声歪みの検知を説明する説明図であり、（ａ）は学習モデルに入力する音響特徴量を示し、（ｂ）は学習モデルから取得した予測値を示し、（ｃ）は音声歪みの検知結果を示す。It is explanatory drawing explaining the detection of the voice distortion in an embodiment, (a) shows the acoustic feature amount input to the learning model, (b) shows the predicted value acquired from the learning model, (c) shows the voice distortion. The detection result of is shown. 実施形態に係る自動音声モニタの動作を示すフローチャートである。It is a flowchart which shows the operation of the automatic voice monitor which concerns on embodiment. 実施例において、予測値算出手段及び音声歪み検知手段の評価結果を説明する説明図である。It is explanatory drawing explaining the evaluation result of the predicted value calculation means and the voice distortion detection means in an Example.

以下、本発明の実施形態について図面を参照して説明する。但し、以下に説明する実施形態は、本発明の技術思想を具体化するためのものであって、特定的な記載がない限り、本発明を以下のものに限定しない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. However, the embodiments described below are for embodying the technical idea of the present invention, and the present invention is not limited to the following unless otherwise specified.

図１に示すように、自動音声モニタ（異常音検知装置）１は、ラジオ放送において、本番系及び予備系からなる２系統の音声信号が入力され、入力された各系統の音声信号に含まれる異常音を検知するものである。そして、自動音声モニタ１は、本番系の音声信号に異常音が含まれており、かつ、予備系の音声信号に異常音が含まれない場合、本番系の音声信号から予備系の音声信号への切り替えを制御盤２に指令する。 As shown in FIG. 1, in radio broadcasting, the automatic audio monitor (abnormal sound detection device) 1 is input with two audio signals consisting of a production system and a backup system, and is included in the input audio signals of each system. It detects abnormal sounds. Then, in the automatic audio monitor 1, when the production system audio signal contains an abnormal sound and the backup system audio signal does not include an abnormal sound, the automatic audio monitor 1 changes from the production system audio signal to the backup system audio signal. Command the control panel 2 to switch.

なお、本番系の音声信号とは、実際にラジオ放送されている音声信号のことである。また、予備系の音声信号とは、本番系の音声信号に何らかの異常が発生したときに切り替えて放送するための音声信号のことである。ここで、自動音声モニタ１には、２系統の音声信号が監視用音声信号（２系統分）として入力される。また、制御盤２には、２系統の音声信号が放送用音声信号（２系統分）として入力される。 The production audio signal is an audio signal that is actually broadcast on the radio. Further, the backup system audio signal is an audio signal for switching and broadcasting when some abnormality occurs in the production system audio signal. Here, two systems of audio signals are input to the automatic audio monitor 1 as monitoring audio signals (for two systems). Further, two systems of audio signals are input to the control panel 2 as broadcast audio signals (for two systems).

制御盤２は、自動音声モニタ１からの切替指令に従って、２系統の音声信号を切り替え出力するものである。つまり、制御盤２は、自動音声モニタ１から切り替えが指令された場合、本番系の音声信号を予備系の音声信号に切り替える。 The control panel 2 switches and outputs two audio signals according to a switching command from the automatic audio monitor 1. That is, when the automatic audio monitor 1 commands switching, the control panel 2 switches the production audio signal to the backup audio signal.

［自動音声モニタの構成］
以下、自動音声モニタ１の構成について詳細に説明する。
図１に示すように、自動音声モニタ１は、音響特徴量算出手段１０と、音声信号レベル検知手段２０と、単一周波数信号検知手段３０と、予測値算出手段４０と、音声歪み検知手段５０と、異常音検知手段６０と、切替制御手段７０とを備える。 [Automatic audio monitor configuration]
Hereinafter, the configuration of the automatic voice monitor 1 will be described in detail.
As shown in FIG. 1, the automatic voice monitor 1 includes an acoustic feature amount calculating means 10, a voice signal level detecting means 20, a single frequency signal detecting means 30, a predicted value calculating means 40, and a voice distortion detecting means 50. The abnormal sound detecting means 60 and the switching control means 70 are provided.

音響特徴量算出手段１０は、入力された音声信号から、周波数知覚特性に関する音響特徴量と音階情報に関する音響特徴量とを算出するものである。ここで、音響特徴量算出手段１０は、周波数知覚特性に関する音響特徴量として、音声信号からメル周波数スペクトル（mel spectrogram）、及び、メル周波数ケプストラム係数（mel frequency cepstrum coefficients）を算出する。また、音響特徴量算出手段１０は、音階情報に関する音響特徴量として、音声信号からクロマグラム（chromagram）を算出する。さらに、音響特徴量算出手段１０は、音声信号レベルの二乗平均平方根（root mean square）と、音声信号のスペクトル重心（spectral centroid）とを算出する。このとき、音響特徴量算出手段１０は、所定の設定時間（例えば、データサンプル数５１２以上）における音声信号レベルの二乗平均平方根を算出することとする。 The acoustic feature amount calculating means 10 calculates the acoustic feature amount related to the frequency perception characteristic and the acoustic feature amount related to the scale information from the input audio signal. Here, the acoustic feature amount calculating means 10 calculates the mel frequency spectrum (mel spectrogram) and the mel frequency cepstrum coefficients from the audio signal as the acoustic feature amount related to the frequency perception characteristic. Further, the acoustic feature amount calculating means 10 calculates a chromagram from the audio signal as the acoustic feature amount related to the scale information. Further, the acoustic feature amount calculating means 10 calculates the root mean square of the voice signal level and the spectral centroid of the voice signal. At this time, the acoustic feature amount calculating means 10 calculates the root mean square of the audio signal level at a predetermined set time (for example, the number of data samples is 512 or more).

なお、音響特徴量算出手段１０は、２系統の音声信号のそれぞれから、音響特徴量（メル周波数スペクトル、メル周波数ケプストラム係数、クロマグラム、二乗平均平方根、スペクトル重心）を算出する。そして、音響特徴量算出手段１０は、各系統の音声信号レベルの二乗平均平方根を音声信号レベル検知手段２０に出力する。また、音響特徴量算出手段１０は、各系統の音声信号のスペクトル重心を単一周波数信号検知手段３０に出力する。また、音響特徴量算出手段１０は、各系統の音声信号のメル周波数スペクトル、メル周波数ケプストラム係数及びクロマグラムを予測値算出手段４０に出力する。 The acoustic feature amount calculating means 10 calculates the acoustic feature amount (mel frequency spectrum, mel frequency cepstrum coefficient, chromagram, root mean square, spectral center of gravity) from each of the two audio signals. Then, the acoustic feature amount calculating means 10 outputs the root mean square root of the voice signal level of each system to the voice signal level detecting means 20. Further, the acoustic feature amount calculating means 10 outputs the spectral centroid of the audio signal of each system to the single frequency signal detecting means 30. Further, the acoustic feature amount calculating means 10 outputs the mel frequency spectrum, the mel frequency cepstrum coefficient, and the chromagram of the audio signal of each system to the predicted value calculating means 40.

音声信号レベル検知手段２０は、音声信号レベルが適正であるか否かを検知するものである。具体的には、音声信号レベル検知手段２０は、以下の式（１）に示すように、音響特徴量算出手段１０から入力された音声信号レベルの二乗平均平方根ＬＶ_ＲＭＳが、予め設定された適正レベルの範囲内であるか否かを検知する。この式（１）では、ＬＶ_ＭＩＮが適正レベルの最低値を表し、ＬＶ_ＭＡＸが適正レベルの最大値を表す。この最低値ＬＶ_ＭＩＮ及び最大値ＬＶ_ＭＡＸは、任意の値で予め設定しておく（例えば、最低値ＬＶ_ＭＩＮ＝−５５ｄＢｍ、最大値ＬＶ_ＭＡＸ＝−２４ｄＢｍ）。
ＬＶ_ＭＩＮ≦ＬＶ_ＲＭＳ≦ＬＶ_ＭＡＸ …式（１） The audio signal level detecting means 20 detects whether or not the audio signal level is appropriate. Specifically, in the audio signal level detecting means 20, as shown in the following equation (1), the root mean square LV _RMS of the audio signal level input from the acoustic feature amount calculating means 10 is set to an appropriate value in advance. Detects whether it is within the level range. In this equation (1), LV _MIN represents the minimum value of the _{appropriate level, and LV MAX} represents the maximum value of the appropriate level. The minimum value LV _MIN and the maximum value LV _MAX are set in advance with arbitrary values (for example, the minimum value LV _MIN = −55 dBm, the maximum value LV _MAX = −24 dBm).
LV _MIN ≤ LV _RMS ≤ LV _MAX ... Equation (1)

ここで、音声信号レベル検知手段２０は、式（１）を満たす場合、音声信号レベルが適正であることを示す正常“０”を音声信号レベルの検知結果として異常音検知手段６０に出力する。一方、音声信号レベル検知手段２０は、前記した式（１）を満たさない場合、音声信号レベルが不適正であることを示す異常“１”を音声信号レベルの検知結果として異常音検知手段６０に出力する。 Here, when the equation (1) is satisfied, the audio signal level detecting means 20 outputs a normal “0” indicating that the audio signal level is appropriate to the abnormal sound detecting means 60 as the detection result of the audio signal level. On the other hand, when the audio signal level detecting means 20 does not satisfy the above equation (1), the abnormal sound detecting means 60 receives an abnormality "1" indicating that the audio signal level is inappropriate as a detection result of the audio signal level. Output.

なお、音声信号レベル検知手段２０は、前記した手法を用いて、２系統の音声信号のそれぞれが適正レベルの範囲内であるか否かを検知する。そして、音声信号レベル検知手段２０は、音声信号の系統毎に音声信号レベルの検知結果を異常音検知手段６０に出力する。 The audio signal level detecting means 20 detects whether or not each of the two audio signals is within an appropriate level range by using the method described above. Then, the audio signal level detecting means 20 outputs the detection result of the audio signal level to the abnormal sound detecting means 60 for each system of the audio signal.

単一周波数信号検知手段３０は、音声信号に単一周波数信号が含まれるか否かを検知するものである。図２（ａ）に示すように、通常の音声信号では、そのスペクトル重心が一定とならない。その一方、図２（ｂ）に示すように、単一周波数信号では、その周波数及びレベルが一定のため、スペクトル重心も一定となる。そこで、単一周波数信号検知手段３０は、音響特徴量算出手段１０から入力された音声信号のスペクトル重心に基づいて、単一周波数信号が含まれるか否かを検知することとした。 The single frequency signal detecting means 30 detects whether or not the audio signal includes a single frequency signal. As shown in FIG. 2A, the spectral center of gravity of a normal audio signal is not constant. On the other hand, as shown in FIG. 2B, since the frequency and level of the single frequency signal are constant, the spectral centroid is also constant. Therefore, the single frequency signal detecting means 30 determines whether or not the single frequency signal is included based on the spectral centroid of the audio signal input from the acoustic feature amount calculating means 10.

具体的には、単一周波数信号検知手段３０は、以下の式（２）に示すように、音声信号のスペクトル重心Ｃｅｎｔｒｏｉｄが第１閾値ＴＨ_１を超え、かつ、音声信号のスペクトル重心の分散σ^２が第２閾値未満ＴＨ_２の場合、音声信号に単一周波数信号が含まれると検知する。この第１閾値ＴＨ_１及び第２閾値ＴＨ_２は、任意の値で予め設定しておく（例えば、第１閾値ＴＨ_１＝１、第２閾値ＴＨ_２＝０．０２）。
Ｃｅｎｔｒｏｉｄ＞ＴＨ_１ａｎｄ σ^２＜ＴＨ_２ …式（２） Specifically, in the single frequency signal detecting means 30, as shown in the following equation (2), the spectral centroid Centroid of the _{audio signal exceeds the first threshold value TH 1} , and the dispersion σ of the spectral centroid of the audio signal. ^{When 2} is less than the second threshold value TH ₂ , it is detected that the audio signal includes a single frequency signal. The first threshold value TH ₁ and the second threshold value TH ₂ are preset with arbitrary values (for example, the first threshold value TH ₁ = 1, the second threshold value TH ₂ = 0.02).
Centroid> TH ₁ and σ ² <TH ₂ ... Equation (2)

ここで、単一周波数信号検知手段３０は、式（２）を満たさない場合、音声信号に単一周波数信号が含まれないことを示す正常“０”を単一周波数信号の検知結果として異常音検知手段６０に出力する。一方、単一周波数信号検知手段３０は、式（２）を満たす場合、音声信号に単一周波数信号が含まれることを示す異常“１”を単一周波数信号の検知結果として異常音検知手段６０に出力する。 Here, when the single frequency signal detecting means 30 does not satisfy the equation (2), the normal "0" indicating that the audio signal does not include the single frequency signal is set as the detection result of the single frequency signal, and the abnormal sound is produced. Output to the detection means 60. On the other hand, when the single frequency signal detecting means 30 satisfies the equation (2), the abnormal sound detecting means 60 uses an abnormality "1" indicating that the audio signal includes the single frequency signal as a detection result of the single frequency signal. Output to.

なお、単一周波数信号検知手段３０は、前記した手法を用いて、２系統の音声信号のそれぞれに単一周波数信号が含まれるか否かを検知する。そして、単一周波数信号検知手段３０は、音声信号の系統毎に単一周波数信号の検知結果を異常音検知手段６０に出力する。 The single frequency signal detecting means 30 detects whether or not each of the two audio signals includes a single frequency signal by using the method described above. Then, the single frequency signal detecting means 30 outputs the detection result of the single frequency signal to the abnormal sound detecting means 60 for each system of the audio signal.

予測値算出手段４０は、周波数知覚特性に関する音響特徴量及び音階情報に関する音響特徴量を予め学習した学習モデルを用いて、音声信号に音声歪みが含まれる確率を示す予測値を算出するものである。
音声歪み検知手段５０は、予測値算出手段４０から入力された予測値に基づいて、音声信号に音声歪みが含まれるか否かを検知するものである。 The predicted value calculating means 40 calculates a predicted value indicating the probability that the voice signal includes voice distortion by using a learning model in which the acoustic feature amount related to the frequency perception characteristic and the acoustic feature amount related to the scale information are learned in advance. ..
The voice distortion detecting means 50 detects whether or not the voice signal includes voice distortion based on the predicted value input from the predicted value calculating means 40.

＜学習モデルの生成手法＞
図３を参照し、予測値算出手段４０による学習モデルの生成手法について説明する。
学習モデルは、音響特徴量として、メル周波数スペクトル、メル周波数ケプストラム係数及びクロマグラムを機械学習することで生成したものである。例えば、同一の音源素材から、音声歪みが含まれない正常音声信号と、人為的に発生させた音声歪みが含まれる異常音声信号とを生成する。そして、図３に示すように、正常音声信号及び異常音声信号のそれぞれから、各時刻のメル周波数スペクトル(mel)、メル周波数ケプストラム係数(mfcc)、クロマグラム(chr)を算出し、これら多次元の音響特徴量を学習データとする。 <Learning model generation method>
A learning model generation method by the predicted value calculating means 40 will be described with reference to FIG.
The learning model is generated by machine learning the mel frequency spectrum, the mel frequency cepstrum coefficient, and the chromagram as acoustic features. For example, from the same sound source material, a normal audio signal that does not include audio distortion and an abnormal audio signal that includes artificially generated audio distortion are generated. Then, as shown in FIG. 3, the mel frequency spectrum (mel), the mel frequency cepstrum coefficient (mfcc), and the chromagram (chr) at each time are calculated from each of the normal audio signal and the abnormal audio signal, and these multidimensional The acoustic feature amount is used as training data.

また、図３の学習データには、主観評価実験により求めた設定値が含まれている。この設定値は、人間に正常な音声又は異常な音声として認識されたことを示す。つまり、設定値は、音声歪みが含まれない正常音声信号“０”、音声歪みが含まれる異常音声信号“１”を示す。 Further, the learning data of FIG. 3 includes set values obtained by a subjective evaluation experiment. This set value indicates that the human has recognized it as a normal voice or an abnormal voice. That is, the set value indicates a normal voice signal “0” that does not include voice distortion and an abnormal voice signal “1” that includes voice distortion.

なお、図３では、図面を見やすくするために、音響特徴量を３次元データとして図示したが、実際には、より多次元の音響特徴量であることが多い。例えば、学習データには、１２８次元のメル周波数スペクトル、１２８次元のメル周波数ケプストラム係数、１２次元のクロマグラムからなる２６８次元の音響特徴量が含まれている（不図示）。 In FIG. 3, the acoustic features are shown as three-dimensional data in order to make the drawings easier to see, but in reality, the acoustic features are often more multidimensional. For example, the training data includes a 268-dimensional acoustic feature quantity consisting of a 128-dimensional Mel frequency spectrum, a 128-dimensional Mel frequency cepstrum coefficient, and a 12-dimensional chromagram (not shown).

また、機械学習の手法は任意であり、例えば、ＤａｔａＲｏｂｏｔのような機械学習プラットフォームを利用できる（参考文献１）。このＤａｔａＲｏｂｏｔには、１００種類以上のアルゴリズムがビルトインされており、複数の学習モデルを同時並行で学習可能であり、効率的に最適な学習モデルを生成できる。
参考文献１：ＤａｔａＲｏｂｏｔ、[online］、［令和２年３月２４日検索］、インターネット〈URL：https://www.datarobot.com/jp/platform/〉 In addition, the machine learning method is arbitrary, and for example, a machine learning platform such as DataRobot can be used (Reference 1). More than 100 types of algorithms are built into this DataRobot, and it is possible to learn a plurality of learning models in parallel at the same time, and it is possible to efficiently generate an optimum learning model.
Reference 1: DataRobot, [online], [Search on March 24, 2nd year of Reiwa], Internet <URL: https://www.datarobot.com/jp/platform/>

＜音声歪みの検知手法＞
図４を参照し、予測値算出手段４０及び音声歪み検知手段５０による音声歪みの検知手法について説明する。
図４（ａ）に示すように、予測値算出手段４０は、音声信号のメル周波数スペクトル、メル周波数ケプストラム係数及びクロマグラムからなる多次元の音響特徴量が入力される。すると、予測値算出手段４０は、図４（ｂ）に示すように、各時刻の音響特徴量を学習済みの学習モデルに入力することで、学習モデルから各時刻の予測値を取得する。そして、予測値算出手段４０は、予め設定された時間窓だけずらしながら、各時刻の予測値を平均化する。 <Voice distortion detection method>
With reference to FIG. 4, a method for detecting voice distortion by the predicted value calculating means 40 and the voice distortion detecting means 50 will be described.
As shown in FIG. 4A, the predicted value calculating means 40 inputs a multidimensional acoustic feature amount including a mel frequency spectrum of an audio signal, a mel frequency cepstrum coefficient, and a chromagram. Then, as shown in FIG. 4B, the predicted value calculating means 40 acquires the predicted value at each time from the learning model by inputting the acoustic feature amount at each time into the trained learning model. Then, the predicted value calculating means 40 averages the predicted values at each time while shifting only the preset time window.

続いて、音声歪み検知手段５０は、予め設定した第３閾値（例えば、“０．５”）を基準として、予測値算出手段４０で平均化された予測値の閾値判定を行う。図４（ｃ）に示すように、音声歪み検知手段５０は、平均化された予測値が第３閾値未満の場合、音声信号に音声歪みが含まれないことを示す正常“０”を音声歪みの検知結果として異常音検知手段６０に出力する。一方、音声歪み検知手段５０は、平均化された予測値が第３閾値以上の場合、音声信号に音声歪みが含まれることを示す異常“０”を音声歪みの検知結果として異常音検知手段６０に出力する。 Subsequently, the voice distortion detecting means 50 determines the threshold value of the predicted value averaged by the predicted value calculating means 40 with reference to a preset third threshold value (for example, “0.5”). As shown in FIG. 4C, when the averaged predicted value is less than the third threshold value, the voice distortion detecting means 50 sets a normal “0” indicating that the voice signal does not include voice distortion. Is output to the abnormal sound detecting means 60 as the detection result of. On the other hand, when the averaged predicted value is equal to or higher than the third threshold value, the voice distortion detecting means 50 sets an abnormality “0” indicating that the voice signal includes the voice distortion as a detection result of the voice distortion, and the abnormal sound detecting means 60. Output to.

なお、予測値算出手段４０は、前記した手法を用いて、２系統の音声信号のそれぞれから予測値を算出し、音声信号の系統毎に予測値を音声歪み検知手段５０に出力する。
また、音声歪み検知手段５０は、前記した手法を用いて、２系統の音声信号のそれぞれに音声歪みが含まれるか否かを検知する。そして、音声歪み検知手段５０は、音声信号の系統毎に音声歪みの検知結果を異常音検知手段６０に出力する。 The predicted value calculating means 40 calculates a predicted value from each of the two audio signals using the above-mentioned method, and outputs the predicted value to the audio distortion detecting means 50 for each audio signal system.
Further, the voice distortion detecting means 50 detects whether or not each of the two voice signals includes voice distortion by using the above-mentioned method. Then, the voice distortion detecting means 50 outputs the detection result of the voice distortion to the abnormal sound detecting means 60 for each system of the voice signal.

図１に戻り、自動音声モニタ１の構成について説明を続ける。
異常音検知手段６０は、音声信号レベル検知手段２０、単一周波数信号検知手段３０及び音声歪み検知手段５０から入力された検知結果に基づいて、音声信号に異常音が含まれるか否かを検知するものである。 Returning to FIG. 1, the configuration of the automatic voice monitor 1 will be described.
The abnormal sound detecting means 60 detects whether or not the voice signal contains an abnormal sound based on the detection results input from the voice signal level detecting means 20, the single frequency signal detecting means 30, and the voice distortion detecting means 50. To do.

＜異常音検知手法：第１例＞
以下、異常音検知手段６０による異常音検知手法の第１例について説明する。
具体的には、異常音検知手段６０は、音声信号レベルが不適正な場合、音声信号に単一周波数信号が含まれる場合、又は、音声信号に音声歪みが含まれる場合の何れにおいて、音声信号に異常音が含まれると検知する。つまり、異常音検知手段６０は、音声信号レベル検知手段２０、単一周波数信号検知手段３０及び音声歪み検知手段５０から入力された検知結果の何れか一つでも異常“１”の場合、音声信号に異常音が含まれると検知する。 <Abnormal sound detection method: First example>
Hereinafter, a first example of the abnormal sound detection method by the abnormal sound detecting means 60 will be described.
Specifically, the abnormal sound detecting means 60 provides an audio signal when the audio signal level is inappropriate, when the audio signal contains a single frequency signal, or when the audio signal contains audio distortion. Detects that an abnormal sound is included in. That is, when any one of the detection results input from the voice signal level detecting means 20, the single frequency signal detecting means 30, and the voice distortion detecting means 50 is abnormal "1", the abnormal sound detecting means 60 is a voice signal. Detects that an abnormal sound is included in.

一方、異常音検知手段６０は、音声信号レベルが適正レベルであり、かつ、音声信号に単一周波数信号が含まれず、かつ、音声信号に音声歪みが含まれない場合、音声信号に異常音が含まれないと検知する。つまり、異常音検知手段６０は、音声信号レベル検知手段２０、単一周波数信号検知手段３０、又は、音声歪み検知手段５０から入力された検知結果の全てが正常“０”の場合、音声信号に異常音が含まれないと検知する。 On the other hand, in the abnormal sound detecting means 60, when the audio signal level is an appropriate level, the audio signal does not include a single frequency signal, and the audio signal does not include audio distortion, an abnormal sound is generated in the audio signal. Detects that it is not included. That is, when all of the detection results input from the audio signal level detecting means 20, the single frequency signal detecting means 30, or the audio distortion detecting means 50 are normal "0", the abnormal sound detecting means 60 becomes an audio signal. Detects that no abnormal sound is included.

＜異常音検知手法：第２例＞
また、異常音検知手段６０は、第２例の手法で異常音を検知してもよい。
具体的には、異常音検知手段６０は、音声信号レベル検知手段２０、単一周波数信号検知手段３０及び音声歪み検知手段５０から入力された検知結果の多数決により、音声信号に異常音が含まれるか否かを検知する。つまり、異常音検知手段６０は、正常“０”と異常“１”との検知結果数とを比較し、正常“０”が異常“１”の検知結果数を超える場合、音声信号に異常音が含まれないと検知する。一方、異常音検知手段６０は、異常“１”が正常“０”の検知結果数を超える場合、音声信号に異常音が含まれると検知する。 <Abnormal sound detection method: 2nd example>
Further, the abnormal sound detecting means 60 may detect the abnormal sound by the method of the second example.
Specifically, the abnormal sound detecting means 60 includes an abnormal sound in the voice signal by a majority decision of the detection results input from the voice signal level detecting means 20, the single frequency signal detecting means 30, and the voice distortion detecting means 50. Detects whether or not. That is, the abnormal sound detecting means 60 compares the number of detection results of the normal "0" and the abnormal "1", and when the normal "0" exceeds the number of the detection results of the abnormal "1", the abnormal sound is recorded in the audio signal. Is not included. On the other hand, when the abnormal sound "1" exceeds the number of detection results of the normal "0", the abnormal sound detecting means 60 detects that the audio signal contains the abnormal sound.

なお、異常音検知手段６０は、前記した第１例や第２例の手法を用いて、２系統の音声信号のそれぞれに異常音が含まれるか否かを検知する。そして、異常音検知手段６０は、各系統の音声信号に異常音が含まれるか否かを示す検知結果を切替制御手段７０に出力する。 The abnormal sound detecting means 60 detects whether or not the abnormal sound is included in each of the two audio signals by using the methods of the first example and the second example described above. Then, the abnormal sound detecting means 60 outputs a detection result indicating whether or not the audio signal of each system includes the abnormal sound to the switching control means 70.

切替制御手段７０は、異常音検知手段６０から入力された検知結果に基づいて、本番系及び予備系の２系統の音声信号の切り替え制御を行うものである。例えば、切替制御手段７０は、本番系の音声信号に異常音が含まれおり、かつ、予備系の音声信号に異常音が含まれていない場合、本番系の音声信号から予備系の音声信号への切替指令を制御盤２に出力する。 The switching control means 70 controls the switching of the audio signals of the production system and the backup system based on the detection result input from the abnormal sound detecting means 60. For example, in the switching control means 70, when the production system audio signal contains an abnormal sound and the backup system audio signal does not contain an abnormal sound, the switching control means 70 changes from the production system audio signal to the backup system audio signal. The switching command of is output to the control panel 2.

［自動音声モニタの動作］
図５を参照し、自動音声モニタ１の動作について説明する。なお、図５では、学習モデルが既に生成されていることとする。
図５に示すように、ステップＳ１において、音響特徴量算出手段１０は、音声信号の音響特徴量（メル周波数スペクトル、メル周波数ケプストラム係数、クロマグラム、二乗平均平方根、スペクトル重心）を算出する。 [Operation of automatic voice monitor]
The operation of the automatic voice monitor 1 will be described with reference to FIG. In FIG. 5, it is assumed that the learning model has already been generated.
As shown in FIG. 5, in step S1, the acoustic feature amount calculating means 10 calculates the acoustic feature amount (mel frequency spectrum, mel frequency cepstrum coefficient, chromagram, root mean square, spectral center of gravity) of the audio signal.

ステップＳ２において、音声信号レベル検知手段２０は、ステップＳ１で算出した音声信号レベルの二乗平均平方根に基づいて、音声信号レベルが適正であるか否かを検知する。
ステップＳ３において、単一周波数信号検知手段３０は、ステップＳ１で算出したスペクトル重心に基づいて、音声信号に単一周波数信号が含まれるか否かを検知する。 In step S2, the audio signal level detecting means 20 detects whether or not the audio signal level is appropriate based on the root mean square of the audio signal level calculated in step S1.
In step S3, the single frequency signal detecting means 30 detects whether or not the audio signal includes a single frequency signal based on the spectral centroid calculated in step S1.

ステップＳ４において、予測値算出手段４０は、ステップＳ１で算出したメル周波数スペクトル、メル周波数ケプストラム係数及びクロマグラムを学習モデルに入力することで、予測値を算出する。
ステップＳ５において、音声歪み検知手段５０は、ステップＳ４で算出した予測値に基づいて、音声信号に音声歪みが含まれるか否かを検知する。 In step S4, the predicted value calculating means 40 calculates the predicted value by inputting the mel frequency spectrum, the mel frequency cepstrum coefficient, and the chromagram calculated in step S1 into the learning model.
In step S5, the voice distortion detecting means 50 detects whether or not the voice signal includes voice distortion based on the predicted value calculated in step S4.

ステップＳ６において、異常音検知手段６０は、ステップＳ２、ステップＳ３及びステップＳ５の検知結果に基づいて、音声信号に異常音が含まれるか否かを検知する。
ここで、音声信号に異常音が含まれる場合（ステップＳ６でＹｅｓ）、自動音声モニタ１は、ステップＳ７の処理に進む。
一方、音声信号に異常音が含まれない場合（ステップＳ６でＮｏ）、自動音声モニタ１は、ステップＳ８の処理に進む。 In step S6, the abnormal sound detecting means 60 detects whether or not the audio signal contains an abnormal sound based on the detection results of steps S2, S3, and S5.
Here, when the voice signal contains an abnormal sound (Yes in step S6), the automatic voice monitor 1 proceeds to the process of step S7.
On the other hand, when the voice signal does not include an abnormal sound (No in step S6), the automatic voice monitor 1 proceeds to the process of step S8.

ステップＳ７において、切替制御手段７０は、ステップＳ６の検知結果に基づいて、本番系及び予備系の２系統の音声信号の切り替え制御を行う。 In step S7, the switching control means 70 performs switching control of the audio signals of the production system and the backup system based on the detection result of step S6.

ステップＳ８において、自動音声モニタ１は、処理を終了するか否かを判定する。例えば、音声信号が終了した場合、自動音声モニタ１は、処理を終了すると判定する。
ここで、処理を終了しない場合（ステップＳ８でＮｏ）、自動音声モニタ１は、ステップＳ１の処理に戻る。 In step S8, the automatic voice monitor 1 determines whether or not to end the process. For example, when the audio signal ends, the automatic audio monitor 1 determines that the process ends.
Here, if the process is not completed (No in step S8), the automatic voice monitor 1 returns to the process of step S1.

［作用・効果］
自動音声モニタ１は、ラジオ放送の際、音声信号に含まれる様々な異常音を検知し、異常音が含まれない系統の音声信号に切り替えることができる。すなわち、自動音声モニタ１は、２系統の音声信号のそれぞれに対し、レベルの検知、単一周波数信号の検知、音声歪みの検知を行い、正常な系統の音声信号に切り替えることができる。 [Action / Effect]
The automatic audio monitor 1 can detect various abnormal sounds included in the audio signal during radio broadcasting and switch to an audio signal of a system that does not include the abnormal sound. That is, the automatic audio monitor 1 can detect the level, detect the single frequency signal, and detect the audio distortion for each of the two audio signals, and switch to the audio signal of the normal system.

以上、本発明の実施形態を詳述してきたが、本発明は前記した実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。
前記した実施形態では、音声信号が２系統であることとして説明したが、これに限定されない。例えば、自動音声モニタは、１系統の音声信号に含まれる異常音を検知してもよい。 Although the embodiments of the present invention have been described in detail above, the present invention is not limited to the above-described embodiments, and includes design changes and the like within a range that does not deviate from the gist of the present invention.
In the above-described embodiment, it has been described that the audio signal has two systems, but the present invention is not limited to this. For example, the automatic voice monitor may detect an abnormal sound included in one system of voice signals.

前記した実施形態では、ラジオ放送の音声信号であることとして説明したが、これに限定されない。例えば、自動音声モニタは、テレビ放送やストリーミング配信の音声信号に含まれる異常音も検知できる。 In the above-described embodiment, it has been described as being an audio signal of radio broadcasting, but the present invention is not limited to this. For example, an automatic audio monitor can also detect abnormal sounds included in audio signals of television broadcasting and streaming distribution.

前記した実施形態では、自動音声モニタは、音声信号の切り替え制御を行うこととして説明したがこれに限定されない。例えば、自動音声モニタは、音声信号に異常音が含まれることを検知した場合、任意の手法で警報を出力してもよい。 In the above-described embodiment, the automatic voice monitor has been described as performing switching control of a voice signal, but the present invention is not limited to this. For example, when the automatic voice monitor detects that the voice signal contains an abnormal sound, the automatic voice monitor may output an alarm by any method.

前記した実施形態では、音響特徴量算出手段が、音声信号の音響特徴量として、メル周波数スペクトル、メル周波数ケプストラム係数、クロマグラム、二乗平均平方根及びスペクトル重心を算出することとして説明したが、これに限定されない。 In the above-described embodiment, the acoustic feature calculation means calculates the mel frequency spectrum, the mel frequency cepstrum coefficient, the chromagram, the root mean square, and the spectral centroid as the acoustic features of the audio signal, but the present invention is limited to this. Not done.

前記した実施形態では、予測値算出手段が、機械学習として、ＤａｔａＲｏｂｏｔなどの機械学習プラットフォームを用いることとして説明したが、これに限定されない。 In the above-described embodiment, the predicted value calculation means has been described as using a machine learning platform such as DataRobot as machine learning, but the present invention is not limited thereto.

前記した各実施形態では、自動音声モニタを独立したハードウェアとして説明したが、本発明は、これに限定されない。例えば、本発明は、コンピュータが備えるＣＰＵ、メモリ、ハードディスク等のハードウェア資源を、前記した自動音声モニタとして動作させるプログラムで実現することもできる。これらのプログラムは、通信回線を介して配布してもよく、ＣＤ−ＲＯＭやフラッシュメモリ等の記録媒体に書き込んで配布してもよい。 In each of the above embodiments, the automatic voice monitor has been described as independent hardware, but the present invention is not limited thereto. For example, the present invention can also be realized by a program that operates hardware resources such as a CPU, memory, and hard disk of a computer as the above-mentioned automatic voice monitor. These programs may be distributed via a communication line, or may be written and distributed on a recording medium such as a CD-ROM or a flash memory.

以下、実施例として、図１の予測値算出手段４０及び音声歪み検知手段５０の評価結果について説明する。
予測値算出手段４０の学習モデルに検証データを入力し、その検知結果を評価した。この検証データには、学習に使用していない評価用テストデータを使用した。また、検証データには、約１分４８秒の音声データから抽出した、正常音データ数５４２０個、異常音(歪み音)データ数３７８０個、計９２９０個を用いた。そして、学習モデルから出力される予測値と設定値との比較を行った。 Hereinafter, as an example, the evaluation results of the predicted value calculating means 40 and the voice distortion detecting means 50 of FIG. 1 will be described.
Verification data was input to the learning model of the predicted value calculating means 40, and the detection result was evaluated. For this verification data, evaluation test data that was not used for training was used. Further, as the verification data, 5420 normal sound data and 3780 abnormal sound (distorted sound) data extracted from the audio data of about 1 minute and 48 seconds were used, for a total of 9290. Then, the predicted value output from the learning model and the set value were compared.

図６には評価結果を示した。図６の横軸は、評価用テストデータを主観評価したときの設定値を示す。また、図６の縦軸は、音声歪み検知手段５０が算出した予測値（予測結果）を示す。この設定値に対する予測値をプロットし、それぞれのポイントにおけるデータ密度を算出した。そして、データ密度の高いポイントを濃い色、低いポイントを薄い色で示した。 FIG. 6 shows the evaluation results. The horizontal axis of FIG. 6 shows the set value when the evaluation test data is subjectively evaluated. Further, the vertical axis of FIG. 6 shows a predicted value (prediction result) calculated by the voice distortion detecting means 50. Predicted values for this set value were plotted and the data density at each point was calculated. The points with high data density are shown in dark colors, and the points with low data density are shown in light colors.

図６に示すように、設定値“０”に対して予測値が約０．０ポイント、設定値“１”に対して予測値が約１．０ポイントにデータが集中しており、学習モデルの精度が高いことを確認できた。さらに、音声歪み検知手段５０の誤検知が入力データ９２９０個の中でわずか１個であり、音声歪み検知手段５０の検知精度が高いことも確認できた。 As shown in FIG. 6, the data is concentrated at about 0.0 points for the set value “0” and about 1.0 points for the set value “1”, and the learning model. It was confirmed that the accuracy of was high. Further, it was confirmed that the false detection of the voice distortion detecting means 50 was only one out of the 9290 input data, and the detection accuracy of the voice distortion detecting means 50 was high.

１自動音声モニタ（異常音検知装置）
１０音響特徴量算出手段
２０音声信号レベル検知手段
３０単一周波数信号検知手段
４０予測値算出手段
５０音声歪み検知手段
６０異常音検知手段
７０切替制御手段 1 Automatic voice monitor (abnormal sound detection device)
10 Acoustic feature amount calculation means 20 Voice signal level detection means 30 Single frequency signal detection means 40 Predicted value calculation means 50 Voice distortion detection means 60 Abnormal sound detection means 70 Switching control means

Claims

An abnormal sound detection device that detects abnormal sounds contained in audio signals.
From the audio signal, an acoustic feature amount calculating means for calculating an acoustic feature amount related to frequency perception characteristics and an acoustic feature amount related to scale information, and an acoustic feature amount calculating means.
An audio signal level detecting means for detecting whether or not the audio signal level is appropriate, and an audio signal level detecting means.
A single frequency signal detecting means for detecting whether or not the audio signal includes a single frequency signal, and
A predictive value calculating means for calculating a predicted value indicating the probability that the voice signal includes voice distortion by using a learning model in which the acoustic feature amount related to the frequency perception characteristic and the acoustic feature amount related to the scale information are learned in advance.
Based on the predicted value calculated by the predicted value calculating means, the voice distortion detecting means for detecting whether or not the voice signal includes the voice distortion, and the voice distortion detecting means.
An abnormal sound detecting means for detecting whether or not the voice signal contains the abnormal sound based on the detection results of the voice signal level detecting means, the single frequency signal detecting means, and the voice distortion detecting means.
An abnormal sound detection device characterized by being equipped with.

The acoustic feature amount calculation means is
Mel frequency spectrum and mel frequency cepstrum coefficient are calculated from the voice signal as acoustic features related to the frequency perception characteristic.
A chromagram is calculated from the audio signal as an acoustic feature amount related to the scale information.
The root mean square of the level of the audio signal and the spectral centroid of the audio signal are further calculated.
The audio signal level detecting means is
Detecting whether or not the root mean square of the level of the audio signal is within the preset appropriate level range,
The single frequency signal detecting means
When the spectral centroid of the audio signal exceeds a preset first threshold value and the dispersion of the spectral centroid of the audio signal is less than the preset second threshold value, the audio signal includes the single frequency signal. Is detected and
The predicted value calculation means is
The abnormal sound detection device according to claim 1, wherein the learning model in which the mel frequency spectrum, the mel frequency cepstrum coefficient, and the chromagram are learned in advance is used.

The abnormal sound detecting means either when the level of the audio signal is inappropriate, when the audio signal includes the single frequency signal, or when the audio signal includes the audio distortion. The abnormal sound detecting device according to claim 1 or 2, wherein the audio signal is detected to include the abnormal sound.

The acoustic feature amount calculating means receives two systems of the audio signals, and calculates the acoustic feature amount related to the frequency perception characteristic and the acoustic feature amount related to the scale information from the input audio signals of each system.
The audio signal level detecting means detects whether or not the level of the audio signal of each system is appropriate, and determines whether or not the level is appropriate.
The single frequency signal detecting means detects whether or not the single frequency signal is included in the audio signal of each system, and determines whether or not the single frequency signal is included.
The predicted value calculating means calculates the predicted value from the audio signal of each system by using the learning model.
The voice distortion detecting means detects whether or not the voice signal of each system includes the voice distortion, and determines whether or not the voice distortion is included.
The abnormal sound detecting means detects whether or not the abnormal sound is included in the audio signal of each system, and determines whether or not the abnormal sound is included.
The method according to any one of claims 1 to 3, further comprising a switching control means for performing switching control of two systems of the audio signals based on the detection result of the abnormal sound detecting means. Abnormal sound detection device.

A program for causing a computer to function as the abnormal sound detection device according to any one of claims 1 to 4.