JP4787979B2

JP4787979B2 - Noise detection apparatus and noise detection method

Info

Publication number: JP4787979B2
Application number: JP2006336336A
Authority: JP
Inventors: 哲也滝口; 康雄有木; 信之三宅; 健太郎古賀
Original assignee: Denso Ten Ltd; Kobe University NUC
Current assignee: Denso Ten Ltd; Kobe University NUC
Priority date: 2006-12-13
Filing date: 2006-12-13
Publication date: 2011-10-05
Anticipated expiration: 2026-12-13
Also published as: JP2008145988A

Abstract

<P>PROBLEM TO BE SOLVED: To discriminate the kind (sound source) of a noise. <P>SOLUTION: A final discriminator for discriminating binary values indicating whether data of a noise-superposed speech having a noise superposed in a speech section is a noise generated by a predetermined sound source is held for each predetermined sound source, the input data of the noise-superposed speech is discriminated by using held final discriminators by predetermined sound sources, and a final discriminator having the highest score of one of the binary values is decided according to results of the discrimination to detect the sound source of the noise in the data being the predetermined sound source that the decided final discriminator indicates. Further, a plurality of data including the data of the noise-superposed speech are held as data for learning, and boosting is used to derive final discriminators by the predetermined sound sources using the held data for learning. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、雑音検出装置および雑音検出方法に関する。 The present invention relates to a noise detection device and a noise detection method.

従来より、音声認識技術が使用されるにあたっては、発話に雑音が重畳することに起因して、誤認識が引き起こされることが少なくない。このことに着目し、スペクトラルサブトラクションを始めとした雑音を除去する研究が、数多くなされている。ここで、雑音の除去について具体的に説明すると、雑音の除去は、まず、雑音を推定し、次に、推定された雑音を雑音重畳音声（雑音が重畳された音声）から減算するという手順で行われる。この雑音の推定には、発話直前の雑音のみの区間（非音声区間）から雑音を推定する手法や、雑音のみの区間から得られた情報を確率的に追跡して雑音を推定する手法などが用いられることが多い。例えば、最小統計量に基づく雑音推定法などが用いられる（V.Stahl, A.Fischer, and R.Bippus, “Quantile based noise estimation for spectral subtruction and Wiener filtering”, Proc.ICASSP 2000, pp.1875-1878, May 2000）。 Conventionally, when speech recognition technology is used, misrecognition is often caused by noise superimposed on speech. Focusing on this fact, many studies have been made to remove noise including spectral subtraction. Here, the noise removal will be described in detail. Noise removal is performed by first estimating the noise and then subtracting the estimated noise from the noise-superimposed speech (sound on which the noise is superimposed). Done. There are two methods for estimating noise, such as estimating noise from a noise-only section (non-speech section) immediately before utterance, and estimating noise by stochastically tracking information obtained from a noise-only section. Often used. For example, noise estimation based on minimum statistics is used (V. Stahl, A. Fischer, and R. Bippus, “Quantile based noise estimation for spectral subtruction and Wiener filtering”, Proc.ICASSP 2000, pp.1875- 1878, May 2000).

ところで、雑音の除去の前段階としての雑音の推定は、定常的な雑音や、時間的に緩やかに変化する雑音に対しては、非常に有効な手法であると考えられ、雑音除去（抑圧）に高い効果が得られることが期待できる。しかしながら、例えば、家の中のような実環境で音声認識技術が使用されることを考えると、雑音の中には、電話のコール音など、発話中に突如発生するもの（突発性雑音）も少なくない。例えば、図１８は、音声中に電話のコール音が重畳している波形を示すものである。このように、発話中に雑音が突如発生した時は、たとえ雑音が短時間しか継続しない場合であっても、音声認識率は低下する。 By the way, noise estimation as a pre-stage of noise removal is considered to be a very effective technique for stationary noise and noise that changes slowly in time, and noise removal (suppression). It is expected that a high effect can be obtained. However, considering that voice recognition technology is used in a real environment such as in a house, some noises that occur suddenly during speech, such as phone call sounds (sudden noise), are also included. Not a few. For example, FIG. 18 shows a waveform in which a telephone call sound is superimposed in the voice. Thus, when noise suddenly occurs during speech, the speech recognition rate decreases even if the noise lasts only for a short time.

このため、かかる突発性雑音に対しても有効な手法を検討しなければならないが、上記したような手法を用いて突発性雑音を推定することは、通常困難である。また、ＨＭＭ合成の手法（三木一浩，西浦敬信，中村哲，鹿野清宏，“ＨＭＭを用いた環境音識別の検討”，電子情報通信学会音声研究会，SP99-106，pp.79-84(1999-12)、伊田政樹，中村哲，“雑音ＤＢを用いたモデル適応化ＨＭＭのＳＮ比別マルチパスモデルによる雑音下音声認識”，電子情報通信学会技術報告，Vol.101，No.522，pp.51-56，2001-12）を用いることも考えられるが、ＨＭＭ合成の手法を用いるには、あらかじめどのような雑音が音声に重畳されるかを特定しておかない限り、組み合わせの数が増えてしまい、結果として、音声認識に時間がかかることから、適切な手法であるとはいえない。 For this reason, an effective method for such a sudden noise must be studied, but it is usually difficult to estimate the sudden noise using the method as described above. Also, HMM synthesis methods (Kazuhiro Miki, Takanobu Nishiura, Satoshi Nakamura, Kiyohiro Shikano, “Examination of environmental sound discrimination using HMM”, IEICE Speech Society, SP99-106, pp. 79-84 (1999) -12), Masaki Ida, Satoshi Nakamura, “Model-adapted HMM using noise DB for speech recognition under noise by multipath model with SN ratio”, IEICE Technical Report, Vol.101, No.522, pp .51-56,2001-12) it is considered to use, the use of the technique of HMM composition, unless Failure to identify whether advance what noise is superimposed on the voice, the number of combinations As a result, it takes time for speech recognition, so it cannot be said to be an appropriate method.

このようなことから、突発性雑音に対処することを目的とした場合には、雑音の除去の前段階として雑音を推定するのではなく、雑音を検出する手順によることが望ましいと考えられる。また、この雑音の検出には、音声のパワーを調べることで雑音を検出する手法や、ＡｄａＢｏｏｓｔ（アダブースト）によって雑音を検出する手法などを用いることが考えられる。もっとも、音声のパワーを調べることで雑音を検出する手法は、図１８の波形のように、極端にＳＮＲ（Signal vs. Noise Ratio）が悪い場合であれば、ある程度の検出をすることができるが、図１９の波形のように、ＳＮＲが５ｄＢの３種類の雑音（「スプレー音」、「紙を破る音」、「電話のコール音」）が音声区間に重畳して存在している場合には（「スプレー音」、「電話のコール音」は完全に音声区間に重畳して存在している）、これらを検出することは不可能に近い。 For this reason, when it is intended to deal with sudden noise, it is considered that it is desirable not to estimate noise as a pre-stage of noise removal but to use a procedure for detecting noise. In order to detect this noise, it is conceivable to use a technique for detecting noise by examining the power of speech, a technique for detecting noise by AdaBoost (Adaboost), or the like. However, the method of detecting noise by examining the power of speech can detect to some extent if the SNR (Signal vs. Noise Ratio) is extremely bad as shown in the waveform of FIG. 19, when three types of noise (“spray sound”, “sound that breaks paper”, “phone call sound”) with an SNR of 5 dB are superimposed on the voice section. (“Spray sound” and “phone call sound” are completely superimposed on the voice section), and it is almost impossible to detect them.

一方、ＡｄａＢｏｏｓｔによって雑音を検出する手法について説明すると、ＡｄａＢｏｏｓｔとは、二値判別問題に対して強力な手法であり、Ｂｏｏｓｔｉｎｇ（ブースティング）と呼ばれる手法の一つである。ここで、Ｂｏｏｓｔｉｎｇとは、判別性能の低い複数の弱識別器の重み付き多数決によって最終的な識別器を生成し、最終的な識別器による識別の結果を出力する手法である。ＡｄａＢｏｏｓｔは、高精度かつ高速であることから、画像情報から顔などのオブジェクトを検出する手法としてよく用いられている（Paul Viola and Michael Jones：“Rapid Object Detection using a Boosted Cascadeof Simple Features”．IEEECVPR，vol.1，pp.511-518，2001.）。また、非特許文献１および非特許文献２では、ＡｄａＢｏｏｓｔを用いて雑音を含まない音声区間を検出する手法が開示されている。 On the other hand, a technique for detecting noise by AdaBoost will be described. AdaBoost is a powerful technique for the binary discrimination problem, and is one of techniques called Boosting. Here, Boosting is a method of generating a final discriminator by weighted majority of a plurality of weak discriminators having low discrimination performance and outputting a discrimination result by the final discriminator. AdaBoost is often used as a method for detecting an object such as a face from image information because of its high accuracy and high speed (Paul Viola and Michael Jones: “Rapid Object Detection using a Boosted Cascade of Simple Features”. IEEECVPR, vol.1, pp.511-518, 2001.). Further, Non-Patent Document 1 and Non-Patent Document 2 disclose a technique for detecting a speech section that does not include noise using AdaBoost.

Kwon，O.,Lee,T.：“Optimizing speech／non-speech classifier design using adaboost”Proc.IEEE ICASSP 2003, pp I-436-I-439.pp.Apr.2003Kwon, O., Lee, T .: “Optimizing speech / non-speech classifier design using adaboost” Proc. IEEE ICASSP 2003, pp I-436-I-439.pp. Apr. 2003 松田博義，滝口哲也，有木康雄：“Real Adaboostによる音声区間検出”，日本音響学会2006年秋季研究発表会，2-P-12，PP.117-118，2006-09.Hiroyoshi Matsuda, Tetsuya Takiguchi, Yasuo Ariki: “Speech detection by Real Adaboost”, Acoustical Society of Japan 2006 Autumn Meeting, 2-P-12, PP.117-118, 2006-09.

ところで、上記した従来の技術では、以下に説明するように、雑音の種類（音源）を識別することができないという課題があった。すなわち、ＡｄａＢｏｏｓｔによって雑音を検出する手法では、識別器は、例えば、「雑音」か「雑音ではない」かの二値を識別するものであることから、雑音の種類（音源）を識別することができない。 By the way, the above-described conventional technique has a problem that the type of noise (sound source) cannot be identified as described below. That is, in the method of detecting noise by AdaBoost, for example, the discriminator discriminates the binary value of “noise” or “not noise”, so that the type of noise (sound source) can be identified. Can not.

そこで、この発明は、上記した従来技術の課題を解決するためになされたものであり、雑音の種類（音源）を識別することが可能な雑音検出装置および雑音検出方法を提供することを目的とする。 Accordingly, the present invention has been made to solve the above-described problems of the prior art, and an object thereof is to provide a noise detection device and a noise detection method capable of identifying the type of noise (sound source). To do.

上述した課題を解決し、目的を達成するため、請求項１に係る発明は、雑音が音声区間に重畳して存在する雑音重畳音声のデータが所定の音源による雑音であるか否かの二値を識別する最終識別器を所定の音源ごとに保持する最終識別器保持手段と、入力された前記雑音重畳音声のデータを前記最終識別器保持手段によって保持された前記所定の音源ごとの最終識別器各々を用いて識別し、当該識別の結果前記二値のうちいずれか一つの値に識別されたスコアが最も高い最終識別器を判定することで、当該データに存在する雑音の音源が当該判定された最終識別器が示す所定の音源であることを検出する検出手段と、を備えたことを特徴とする。 In order to solve the above-described problems and achieve the object, the invention according to claim 1 is a binary signal indicating whether or not noise superimposed speech data in which noise is superimposed on a speech section is noise from a predetermined sound source. A final discriminator holding unit for holding a final discriminator for each predetermined sound source, and a final discriminator for each predetermined sound source in which the data of the input noise superimposed speech is held by the final discriminator holding unit The sound source of the noise present in the data is determined by determining the final classifier having the highest score identified as one of the binary values as a result of the identification. And detecting means for detecting that the sound source is a predetermined sound source indicated by the final discriminator.

また、請求項２に係る発明は、上記の発明において、雑音重畳音声のデータを含む複数のデータを学習用データとして保持する学習用データ保持手段と、データが所定の音源による雑音であるか否かの二値を識別する識別器を学習用データから学習させることで学習が終了した最終識別器を導出するブースティングを用いて、前記学習用データ保持手段によって保持された前記学習用データから所定の音源ごとの最終識別器を導出する最終識別器導出手段と、をさらに備えたことを特徴とする。 According to a second aspect of the present invention, in the above invention, learning data holding means for holding a plurality of data including noise superimposed voice data as learning data, and whether the data is noise caused by a predetermined sound source. The learning data held by the learning data holding means is predetermined by using boosting for deriving a final discriminator that has finished learning by learning a discriminator for discriminating the binary from the learning data. And a final classifier deriving unit for deriving a final classifier for each sound source.

また、請求項３に係る発明は、上記の発明において、前記最終識別器導出手段は、前記ブースティングとして、アダブーストを用いて前記最終識別器を導出することを特徴とする。 The invention according to claim 3 is characterized in that, in the above invention, the final discriminator deriving means derives the final discriminator using Adaboost as the boosting.

また、請求項４に係る発明は、上記の発明において、前記検出手段は、前記雑音重畳音声のデータをフレーム単位で識別し、当該データの雑音の区間が当該フレームで区切られた区間であることをさらに検出することを特徴とする。 According to a fourth aspect of the present invention, in the above invention, the detection means identifies the noise-superimposed speech data in units of frames, and the noise section of the data is a section divided by the frame. Is further detected.

また、請求項５に係る発明は、上記の発明において、入力された前記データの連続するフレームの中に、前記検出手段によって判定された前記最終識別器で識別された識別の結果が他のフレームとは異なる結果のフレームが含まれる場合には、当該異なる結果のフレームに対して平滑化を行う平滑化手段をさらに備えたことを特徴とする。 According to a fifth aspect of the present invention, in the above invention, the identification result identified by the final discriminator determined by the detection means is another frame in the continuous frames of the input data. When a frame with a result different from that is included, smoothing means for smoothing the frame with a different result is further provided.

また、請求項６に係る発明は、雑音が音声区間に重畳して存在する雑音重畳音声のデータが所定の音源による雑音であるか否かの二値を識別する最終識別器を所定の音源ごとに保持する最終識別器保持工程と、入力された前記雑音重畳音声のデータを前記最終識別器保持工程によって保持された前記所定の音源ごとの最終識別器各々を用いて識別し、当該識別の結果前記二値のうちいずれか一つの値に識別されたスコアが最も高い最終識別器を判定することで、当該データに存在する雑音の音源が当該判定された最終識別器が示す所定の音源であることを検出する検出工程と、を含んだことを特徴とする。 The invention according to claim 6 provides a final discriminator for each predetermined sound source for identifying a binary value indicating whether or not noise superimposed speech data in which noise is superimposed on a speech section is noise caused by a predetermined sound source. A final discriminator holding step that is held in the discriminator, and the input noise superimposed speech data is discriminated using each final discriminator for each predetermined sound source held in the final discriminator holding step, and the result of the discrimination By determining the final discriminator having the highest score identified by any one of the two values, the noise source present in the data is a predetermined sound source indicated by the determined final discriminator. And a detecting step for detecting this.

請求項１または６の発明によれば、雑音が音声区間に重畳して存在する雑音重畳音声のデータが所定の音源による雑音であるか否かの二値を識別する最終識別器を所定の音源ごとに保持し、入力された雑音重畳音声のデータを保持された所定の音源ごとの最終識別器各々を用いて識別し、識別の結果、二値のうちいずれか一つの値に識別されたスコアが最も高い最終識別器を判定することで、データに存在する雑音の音源が判定された最終識別器が示す所定の音源であることを検出するので、雑音の種類（音源）を識別することが可能になる。 According to the first or sixth aspect of the present invention, the final discriminator for identifying the binary value of whether or not the noise-superimposed speech data in which noise is superimposed on the speech section is noise due to the predetermined sound source is provided as the predetermined sound source. Each of the stored noise superposed speech data is identified using each final discriminator for each predetermined sound source, and as a result of identification, the score identified as one of the two values By determining the final discriminator having the highest noise level, it is detected that the noise source existing in the data is the predetermined sound source indicated by the determined final discriminator. It becomes possible.

また、請求項２の発明によれば、雑音重畳音声のデータを含む複数のデータを学習用データとして保持し、データが所定の音源による雑音であるか否かの二値を識別する識別器を学習用データから学習させることで学習が終了した最終識別器を導出するブースティングを用いて、保持された学習用データから所定の音源ごとの最終識別器を導出するので、雑音の種類（音源）を適切に識別することが可能になる。 According to the invention of claim 2, the discriminator that holds a plurality of data including noise superimposed voice data as learning data and discriminates whether or not the data is noise caused by a predetermined sound source. The type of noise (sound source) because the final classifier for each given sound source is derived from the stored learning data using boosting that derives the final classifier that has been learned by learning from the training data Can be properly identified.

また、請求項３の発明によれば、雑音検出装置は、ブースティングとして、アダブーストを用いて最終識別器を導出するので、雑音の種類（音源）を適切に識別することが可能になる。 According to the invention of claim 3, since the noise detection device derives the final discriminator using Adaboost as boosting, it becomes possible to appropriately identify the type of noise (sound source).

また、請求項４の発明によれば、雑音検出装置は、雑音重畳音声のデータをフレーム単位で識別し、データの雑音の区間がフレームで区切られた区間であることをさらに検出するので、上記の効果に加え、雑音の区間を検出することも可能になる。 According to the invention of claim 4, the noise detection device further identifies that the noise-superimposed speech data is identified in units of frames and further detects that the noise section of the data is a section divided by frames. In addition to the above effect, it is also possible to detect a noise interval.

また、請求項５の発明によれば、雑音検出装置は、入力されたデータの連続するフレームの中に、検出手段によって判定された最終識別器で識別された識別の結果が他のフレームとは異なる結果のフレームが含まれる場合には、異なる結果のフレームに対して平滑化を行うので、雑音の種類（音源）を正確に識別することが可能になる。 According to the invention of claim 5, the noise detection apparatus is configured such that the identification result identified by the final classifier determined by the detection means is different from the other frames in the continuous frames of the input data. When frames with different results are included, smoothing is performed on the frames with different results, so that the noise type (sound source) can be accurately identified.

以下に添付図面を参照して、この発明に係る雑音検出装置および雑音検出方法の実施例を詳細に説明する。なお、以下では、実施例で用いる主要な用語、実施例１に係る雑音検出装置の概要および特徴、実施例１に係る雑音検出装置の構成および処理の手順、実施例１の効果を順に説明し、次に、他の実施例について説明する。 Exemplary embodiments of a noise detection device and a noise detection method according to the present invention will be described below in detail with reference to the accompanying drawings. In the following, the main terms used in the embodiment, the outline and features of the noise detection device according to the first embodiment, the configuration and processing procedure of the noise detection device according to the first embodiment, and the effects of the first embodiment will be described in order. Next, another embodiment will be described.

［用語の説明］
まず最初に、以下の実施例で用いる主要な用語を説明する。以下の実施例で用いる「雑音」とは、音声認識技術を使用するにあたり、認識すべき「音声」とは異なる「音」のことであり、認識すべき「音声」の認識において、通常妨げになると考えられる「音」のことである。以下では、認識対象の「音声」を、認識すべき「音声」が存在する区間である「音声区間」と、認識すべき「音声」が存在しない「非音声区間」との２つに大きく分類し、かかる「音声区間」に「雑音」が重畳して存在することと（認識すべき「音声」と「雑音」とが重畳する「音声」）、「雑音」のみが「非音声区間」に存在することとを、「雑音重畳音声」と定義する。 [Explanation of terms]
First, main terms used in the following examples will be described. “Noise” used in the following embodiments refers to “sound” that is different from “speech” to be recognized when using the speech recognition technology, and usually impedes the recognition of “speech” to be recognized. It is the “sound” that is considered to be. In the following, the “speech” to be recognized is broadly classified into two types, “speech segment” in which “speech” to be recognized exists and “non-speech segment” in which “speech” to be recognized does not exist. In addition, “noise” is superimposed on the “speech segment” (“speech” in which “speech” and “noise” to be recognized are superimposed), and only “noise” is included in the “non-speech segment”. Existence is defined as “noise superimposed speech”.

ところで、認識対象の「音声」に「雑音」が含まれると音声認識率は低下することから、「雑音」を除去（抑圧）した上で音声認識を行うべきであるが、この「雑音」の除去（抑圧）の前段階としては、「雑音」を検出することが必要になる。しかも、「雑音」の検出は、「雑音」の種類（音源）を識別した上で検出されることが望ましい。 By the way, if “noise” is included in the “speech” to be recognized, the speech recognition rate decreases. Therefore, speech recognition should be performed after removing (suppressing) “noise”. As a pre-stage of removal (suppression), it is necessary to detect “noise”. Moreover, it is desirable that the “noise” is detected after identifying the type (sound source) of “noise”.

ここで、「音源」について具体的に説明すると、例えば、「雑音」には、「スプレー音」（例えば、『シューッ』という音など）、「紙を破る音」（例えば、『ビリビリビリ』という音など）、「電話のコール音」（例えば、『プルルルル』という音など）など、様々な種類の「音源」があると考えられる。これらの「音源」の違いは、図１９に示すような波形の違いとなって現れることから、「雑音」を検出する際に「音源」を識別して検出することは、「雑音」を除去（抑圧）する際にも役立つことになる。言い換えると、「雑音」の「音源」を識別することで（どのような「雑音」が「音声」に混入したのかまでを知ることで）、雑音除去（抑圧）時には、あらかじめ「音源」ごとに保存された「雑音」のデータを用いて「雑音」を除去（抑圧）することができる。このようなことから、本発明に係る「雑音検出装置」が、いかなる方法によって「雑音」の「音源」を識別するかが、重要な点になる。 Here, the “sound source” will be specifically described. For example, “noise” includes “spray sound” (for example, “swoosh”), “sound to break paper” (for example, “buzzy” sound) Etc.) and “phone call sound” (for example, “Purururu” sound, etc.) are considered to be various kinds of “sound sources”. Since the difference between these “sound sources” appears as a waveform difference as shown in FIG. 19, identifying and detecting “sound source” when detecting “noise” eliminates “noise”. (Repression) will also help. In other words, by identifying the “sound source” of “noise” (by knowing what “noise” is mixed in “speech”), each noise source is pre-determined for noise reduction (suppression). The “noise” can be removed (suppressed) using the stored “noise” data. For this reason, it is important how the “noise detection apparatus” according to the present invention identifies the “sound source” of “noise” by any method.

［実施例１に係る雑音検出装置の概要および特徴］
続いて、図１を用いて、実施例１に係る雑音検出装置の概要および特徴を説明する。図１は、実施例１に係る雑音検出装置の概要および特徴を説明するための図である。 [Outline and Features of Noise Detection Device According to Embodiment 1]
Next, the outline and features of the noise detection apparatus according to the first embodiment will be described with reference to FIG. FIG. 1 is a diagram for explaining the outline and features of the noise detection apparatus according to the first embodiment.

実施例１に係る雑音検出装置は、上記したように、認識対象の「音声」から「雑音」を検出することを概要とし、雑音の種類（音源）を識別することを主たる特徴とする。 As described above, the noise detection apparatus according to the first embodiment is mainly characterized by detecting “noise” from “speech” to be recognized, and mainly characterized by identifying the type of noise (sound source).

この主たる特徴について簡単に説明すると、実施例１に係る雑音検出装置は、学習用データ保持部に、雑音重畳音声を含むデータを、学習用データとして保持している。 This main feature will be briefly described. The noise detection apparatus according to the first embodiment holds data including noise superimposed speech as learning data in a learning data holding unit.

例えば、実施例１に係る雑音検出装置は、図１の（１）に示すように、学習用データ保持部に、学習用データとして、複数のデータで構成された一つのグループを保持し、この一つのグループに対して、「音声のみ」、「スプレー音」、「紙を破る音」、「電話のコール音」などの音源ごとに、ラベルを付与して保持している。 For example, as shown in FIG. 1 (1), the noise detection apparatus according to the first embodiment holds one group composed of a plurality of data as learning data in the learning data holding unit. For each group, a label is assigned to each sound source such as “voice only”, “spray sound”, “paper breaking sound”, and “phone call sound”.

これらの学習用データは、実施例１に係る雑音検出装置が最終識別器（雑音の検出に利用する識別器）を導出するためのデータであるので、雑音検出装置の利用者によって予め入力されたりすることで、雑音検出装置が予め保持しているものである。 These learning data are data for the noise detection apparatus according to the first embodiment to derive a final discriminator (a discriminator used for noise detection), and thus may be input in advance by a user of the noise detection apparatus. By doing so, the noise detection apparatus holds in advance.

このような構成のもと、実施例１に係る雑音検出装置は、ブースティングを用いて音源ごとの最終識別器を導出する。ここで、Ｂｏｏｓｔｉｎｇ（ブースティング）とは、データが所定の音源であるか否かの二値を識別する識別器を学習用データから学習させることで、学習が終了した最終識別器を導出するアルゴリズムのことである。実施例１においては、Ｂｏｏｓｔｉｎｇとして、ＡｄａＢｏｏｓｔ（アダブースト）を用いて音源ごとの最終識別器を導出する。なお、ＡｄａＢｏｏｓｔのアルゴリズムについては、雑音検出装置の構成を説明する際に、詳述する。 With such a configuration, the noise detection apparatus according to the first embodiment derives a final discriminator for each sound source using boosting. Here, boosting is an algorithm for deriving a final discriminator that has completed learning by causing a discriminator for discriminating binary values of whether or not the data is a predetermined sound source to be learned from learning data. That is. In the first embodiment, the final discriminator for each sound source is derived using AdaBoost (Adaboost) as boosting. The AdaBoost algorithm will be described in detail when the configuration of the noise detection apparatus is described.

例えば、雑音検出装置は、図１の（２）に示すように、ＡｄａＢｏｏｓｔを用いて、『音声のみの識別器』（「音声のみ」であるか否かを識別する識別器）を学習データから学習させることで、『音声のみの最終識別器』を導出して保持する。同様に、雑音検出装置は、図１の（２）に示すように、ＡｄａＢｏｏｓｔを用いて、『スプレー音の識別器』（「スプレー音」であるか否かを識別する識別器）を学習データから学習させることで、『スプレー音の最終識別器』を導出して保持し、『紙を破る音の識別器』（「紙を破る音」であるか否かを識別する識別器）を学習データから学習させることで、『紙を破る音の最終識別器』を導出して保持し、『電話のコール音の識別器』（「電話のコール音」であるか否かを識別する識別器）を学習データから学習させることで、『電話のコール音の最終識別器』を導出して保持する。 For example, as shown in (2) of FIG. 1, the noise detection device uses “AdaBoost” to obtain a “voice-only discriminator” (a discriminator that discriminates whether or not it is “voice only”) from learning data. By learning, the “speech-only final discriminator” is derived and held. Similarly, as shown in (2) of FIG. 1, the noise detection device uses “AdaBoost” to obtain “spray sound discriminator” (discriminator for discriminating whether it is “spray sound”) as learning data. By learning from, the "spray sound final discriminator" is derived and retained, and the "paper breaking sound discriminator" (discriminator discriminating whether or not it is a "paper breaking sound") is learned. By learning from the data, the “final discriminator of the sound that breaks the paper” is derived and retained, and the discriminator that discriminates whether or not it is a “telephone call tone discriminator” (“phone call tone”) ) Is learned from the learning data, so that the “final classifier of the telephone call tone” is derived and held.

次に、実施例１に係る雑音検出装置は、入力された雑音重畳音声のデータを、保持された音源ごとの最終識別器によってデータのフレーム単位で識別する。ここで、データのフレームとは、データをある時間で区切った固まりのことであり（例えば、『２０ｍｓ』など）、データの区間を示すものである。 Next, the noise detection apparatus according to the first embodiment identifies input noise-superimposed speech data in units of data frames by a final discriminator for each stored sound source. Here, the data frame is a group of data divided by a certain time (for example, “20 ms”) and indicates a data section.

例えば、雑音検出装置は、図１の（３）に示すように、入力されたデータを、『音声のみの最終識別器』、『スプレー音の最終識別器』、『紙を破る音の最終識別器』、『電話のコール音の最終識別器』などの最終識別器を用いて、データの「フレームＮｏ．１００」の区間について識別する。すると、例えば、図１の（３）に示すように、データの「フレームＮｏ．１００」は、『音声のみの最終識別器』を用いた識別の結果、二値のうちいずれか一つの値に『−０．３』というスコアで識別される。同様に、『スプレー音の最終識別器』を用いた識別の結果、二値のうちいずれか一つの値に『−０．１』というスコアで識別され、『紙を破る音の最終識別器』を用いた識別の結果、二値のうちいずれか一つの値に『−０．２』というスコアで識別され、『電話のコール音の最終識別器』を用いた識別の結果、二値のうちいずれか一つの値（ここでは、『電話のコール音』）に『０．５』というスコアで識別される。 For example, as shown in (3) of FIG. 1, the noise detection device converts the input data into “final discriminator of only sound”, “final discriminator of spray sound”, and “final discrimination of sound that breaks paper”. Using a final discriminator such as “final discriminator” or “final discriminator of telephone call tone”, the section of “frame No. 100” of the data is identified. Then, for example, as shown in (3) of FIG. 1, the “frame No. 100” of the data is changed to any one of the binary values as a result of identification using the “sound-only final discriminator”. It is identified with a score of “−0.3”. Similarly, as a result of the identification using the “spray sound final discriminator”, one of the two values is identified with a score of “−0.1”, and “the final discriminator of the sound that breaks the paper” As a result of identification using, one of the two values is identified with a score of “−0.2”, and as a result of identification using the “final discriminator of phone call tone” Any one of the values (here, “phone call tone”) is identified with a score of “0.5”.

続いて、実施例１に係る雑音検出装置は、二値のうちいずれか一つの値に識別されたスコアが最も高い最終識別器を判定する。例えば、雑音検出装置は、図１の（４）に示すように、二値のうちいずれか一つの値に『０．５』という最も高いスコアで識別された最終識別器が『電話のコール音の最終識別器』であることを判定する。 Subsequently, the noise detection apparatus according to the first embodiment determines the final discriminator having the highest score identified by any one of the two values. For example, as shown in (4) of FIG. 1, the noise detection device is configured such that the final discriminator identified with the highest score of “0.5” as one of the two values is “phone call sound. It is determined that it is a final discriminator.

そして、雑音検出装置は、データの雑音区間がこのフレームで区切られた区間であること、並びに、データに存在する雑音の音源が判定された最終識別器が示す音源であることを検出する。例えば、雑音検出装置は、図１の（５）に示すように、データの雑音区間が「フレームＮｏ．１００」で区切られた区間であること、並びに、データに存在する雑音の音源が『電話のコール音』であることを検出する。 Then, the noise detection device detects that the noise section of the data is a section divided by this frame, and that the noise source included in the data is the sound source indicated by the determined final discriminator. For example, as shown in (5) of FIG. 1, the noise detection apparatus is a section in which the noise section of the data is divided by “frame No. 100” and the noise source in the data is “telephone”. Is detected.

このようなことから、実施例１に係る雑音検出装置によれば、雑音の種類（音源）を識別することが可能になる。 For this reason, according to the noise detection apparatus according to the first embodiment, it is possible to identify the type of noise (sound source).

なお、実施例１においては、雑音検出装置が、学習用データを予め保持し、学習用データから最終識別器を導出して保持した後に、保持した最終識別器を用いて入力された雑音重畳音声のデータの検出処理を行う事例について説明したが、本発明はこれに限られるものではない。例えば、学習用データの保持や、学習用データから最終識別器を導出する処理については、雑音検出装置とは異なる別の装置が行い、本発明に係る雑音検出装置は、別の装置で導出された最終識別器を保持し、保持した最終識別器を用いて入力された雑音重畳音声のデータの検出処理を行う事例についても、本発明を同様に適用することができる。 In the first embodiment, the noise detection apparatus holds the learning data in advance, derives and holds the final discriminator from the learning data, and then inputs the noise superimposed speech input using the held final discriminator. Although the example of performing the data detection process has been described, the present invention is not limited to this. For example, the processing for holding the learning data and deriving the final discriminator from the learning data is performed by another device different from the noise detection device, and the noise detection device according to the present invention is derived by another device. The present invention can be similarly applied to a case where the final discriminator is held and the detection processing of the noise superimposed speech data input using the held final discriminator is performed.

また、実施例１においては、雑音検出装置が、入力された雑音重畳音声のデータをフレーム単位で識別し、データの雑音の区間がフレームで区切られた区間であることをも検出する手法について説明したが、本発明はこれに限られるものではなく、雑音検出装置が、入力された雑音重畳音声のデータをフレーム単位で識別する手法によらない場合にも、本発明を同様に適用することができる。 Also, in the first embodiment, a method is described in which the noise detection device identifies input noise-superimposed speech data in units of frames and detects that the data noise section is a section divided by frames. However, the present invention is not limited to this, and the present invention can be applied in the same manner even when the noise detection device does not rely on a method for identifying input noise superimposed speech data in units of frames. it can.

また、実施例１においては、雑音検出装置が一つのグループを保持し、この一つのグループに対して、「音声のみ」、「スプレー音」、「紙を破る音」、「電話のコール音」などの音源ごとにラベルを付与して保持している事例について説明したが、本発明はこれに限られるものではない。例えば、音源としてその他の音源を選択する場合や、一つのグループに対して音源ごとのラベルを付与するのではなく、重複して複数のグループを保持する場合にも（ラベルは音源ごとに付与される）、本発明を同様に適用することができる。すなわち、音源は、本発明に係る雑音検出装置が利用される環境等に合わせて適宜選択されればよく、また、保持の形態はいずれでもよい。 In the first embodiment, the noise detection device holds one group, and for this one group, “voice only”, “spray sound”, “paper breaking sound”, “phone call sound” However, the present invention is not limited to this example. For example, when another sound source is selected as the sound source, or when a plurality of groups are held in duplicate instead of giving a label for each sound source to one group (a label is assigned to each sound source). The present invention can be similarly applied. That is, the sound source may be appropriately selected according to the environment where the noise detection apparatus according to the present invention is used, and any form of holding may be used.

［実施例１に係る雑音検出装置の構成］
次に、図２〜図１０を用いて、実施例１に係る雑音検出装置を説明する。図２は、実施例１に係る雑音検出装置の構成を示すブロック図であり、図３は、出力部を説明するための図であり、図４は、学習用データ保持部を説明するための図であり、図５は、最終識別器保持部を説明するための図であり、図６は、検出結果記憶部を説明するための図であり、図７および図８は、最終識別器導出処理（ＡｄａＢｏｏｓｔ）を説明するための図であり、図９および図１０は、区間音源検出処理（Ｍｕｌｔｉ−ｃｌａｓｓＡｄａＢｏｏｓｔ）を説明するための図である。 [Configuration of Noise Detection Apparatus According to First Embodiment]
Next, the noise detection apparatus according to the first embodiment will be described with reference to FIGS. FIG. 2 is a block diagram illustrating the configuration of the noise detection apparatus according to the first embodiment, FIG. 3 is a diagram for explaining the output unit, and FIG. 4 is a diagram for explaining the learning data holding unit. FIG. 5 is a diagram for explaining the final discriminator holding unit, FIG. 6 is a diagram for explaining the detection result storage unit, and FIG. 7 and FIG. FIG. 9 and FIG. 10 are diagrams for explaining section sound source detection processing (Multi-class AdaBoost).

実施例１に係る雑音検出装置１０は、図２に示すように、入力部１１と、出力部１２と、入出力制御Ｉ／Ｆ部１３と、記憶部２０と、制御部３０とを備える。 As illustrated in FIG. 2, the noise detection apparatus 10 according to the first embodiment includes an input unit 11, an output unit 12, an input / output control I / F unit 13, a storage unit 20, and a control unit 30.

入力部１１は、制御部３０による各種処理に用いるデータや、各種処理をするための操作指示などを、マイク、キーボード、またはマウスなどによって入力する。具体的には、入力部１１が、学習用データ（雑音重畳音声のデータを含む複数のデータ）をマイクで入力すると、入力されたデータは、後述する学習用データ保持部２１によって保持される。また、入力部１１が、後述する区間音源検出部３２による検出対象となるデータをマイクで入力すると、入力されたデータは、後述する入力データ一時記憶部２３によって保持される。 The input unit 11 inputs data used for various types of processing by the control unit 30 and operation instructions for performing various types of processing using a microphone, a keyboard, a mouse, or the like. Specifically, when the input unit 11 inputs learning data (a plurality of data including noise-superimposed speech data) with a microphone, the input data is held by a learning data holding unit 21 described later. Further, when the input unit 11 inputs data to be detected by the section sound source detection unit 32 described later with a microphone, the input data is held in the input data temporary storage unit 23 described later.

なお、実施例１においては、学習用データや検出対象となるデータを、マイクとしての入力部１１が雑音検出装置１０に入力する手法について説明したが、本発明はこれに限られるものではなく、学習用データを電子データ化した音声ファイルや、検出対象となるデータを電子データ化した音声ファイルなどを、外部記憶装置や通信部としての入力部１１が雑音検出装置１０に入力する手法などにも、本発明を同様に適用することができる。 In the first embodiment, the method of inputting the learning data and the detection target data to the noise detection device 10 by the input unit 11 as a microphone has been described. However, the present invention is not limited to this. For example, an audio file obtained by converting learning data into electronic data or an audio file obtained by converting data to be detected into electronic data to the input unit 11 as an external storage device or communication unit is input to the noise detection device 10. The present invention can be similarly applied.

出力部１２は、制御部３０による各種処理の結果や、各種処理をするための操作指示などを、ディスプレイまたはプリンタなどに出力する。具体的には、出力部１２は、後述する検出結果記憶部２４に記憶された検出結果を、ディスプレイまたはプリンタなどに出力する。 The output unit 12 outputs the results of various processes by the control unit 30 and operation instructions for performing various processes to a display or a printer. Specifically, the output unit 12 outputs the detection result stored in the detection result storage unit 24 described later to a display or a printer.

例えば、出力部１２は、図３に示すような検出結果をディスプレイに出力する。図３について具体的に説明すると、図３の上半分は、検出対象となるデータであって、音声区間に雑音が重畳して存在する雑音重畳音声の波形を示す図である。また、図３の下半分は、上半分に示した雑音重畳音声から、本発明に係る雑音検出装置１０が、雑音の区間を検出し、かつ、雑音の種類（音源）を識別した結果を示したものである。すなわち、図３に例示されている雑音重畳音声の波形においては、「スプレー音」（ｓｐｒａｙ）を音源とする雑音と、「紙を破る音」（ｐａｐｅｒ）を音源とする雑音と、「電話のコール音」（ｐｈｏｎｅ）を音源とする雑音とが、音声に重畳して存在していたことがわかる。 For example, the output unit 12 outputs a detection result as shown in FIG. 3 to the display. 3 will be described in detail. The upper half of FIG. 3 is data to be detected, and is a diagram illustrating a waveform of noise superimposed speech that is present with noise superimposed on a speech section. The lower half of FIG. 3 shows the result of the noise detection apparatus 10 according to the present invention detecting the noise section and identifying the noise type (sound source) from the noise superimposed speech shown in the upper half. It is a thing. That is, in the waveform of the noise-superimposed speech exemplified in FIG. 3, noise having “spray sound” as a sound source, noise having “paper breaking sound” as a sound source, and “phone sound” It can be seen that noise having a “call sound” as a sound source was superimposed on the voice.

なお、図３の例では、出力部１２が、波形等で検出結果を出力する手法について説明したが、本発明はこれに限られるものではなく、検出結果をテキスト（例えば、「フレームＮｏ．１００」で区切られた区間のデータに存在する雑音の音源は『電話のコール音』である、など）で出力したり、ディスプレイやプリンタに出力するのではなく、雑音を除去（抑圧）する他の装置に出力するなど、いずれでもよい。 In the example of FIG. 3, the method in which the output unit 12 outputs the detection result using a waveform or the like has been described. However, the present invention is not limited to this, and the detection result may be text (for example, “frame No. 100”). The sound source of the noise that exists in the section separated by "" is a phone call tone ", etc.) or other output that eliminates (suppresses) noise instead of outputting it to a display or printer Any of them may be output to the apparatus.

入出力制御Ｉ／Ｆ部１３は、入力部１１と、出力部１２と、記憶部２０と、制御部３０との間におけるデータ転送を制御する。 The input / output control I / F unit 13 controls data transfer among the input unit 11, the output unit 12, the storage unit 20, and the control unit 30.

記憶部２０は、制御部３０における各種制御に用いられるデータを記憶し、特に本発明に密接に関連するものとしては、図２に示すように、学習用データ保持部２１と、最終識別器保持部２２と、入力データ一時記憶部２３と、検出結果記憶部２４とを備える。なお、学習用データ保持部２１は、特許請求の範囲に記載の「学習用データ保持手段」に対応し、最終識別器保持部２２は、特許請求の範囲に記載の「最終識別器保持手段」に対応する。 The storage unit 20 stores data used for various controls in the control unit 30, and particularly as closely related to the present invention, as shown in FIG. 2, a learning data holding unit 21 and a final discriminator holding unit Unit 22, input data temporary storage unit 23, and detection result storage unit 24. The learning data holding unit 21 corresponds to the “learning data holding unit” described in the claims, and the final discriminator holding unit 22 is the “final discriminator holding unit” described in the claims. Corresponding to

学習用データ保持部２１は、雑音重畳音声のデータを含む複数のデータを、学習用データとして保持する。具体的には、学習用データ保持部２１は、入力部１１によって入力された学習用データを保持し、保持した学習用データは、後述する最終識別器導出部３１による処理に利用される。 The learning data holding unit 21 holds a plurality of data including noise superimposed voice data as learning data. Specifically, the learning data holding unit 21 holds the learning data input by the input unit 11, and the held learning data is used for processing by the final discriminator derivation unit 31 described later.

ここで、学習用データ保持部２１が保持する学習用データは、最終識別器導出部３１が最終識別器（雑音の検出に利用する識別器）を導出するためのデータであるので、雑音検出装置１０の利用者によって予め入力されたりすることで、学習用データ保持部２１が予め保持しているものである。なお、実施例１においては、雑音検出装置１０が学習用データ保持部２１に学習用データを予め保持する事例について説明したが、本発明はこれに限られるものではなく、学習用データの保持や、学習用データから最終識別器を導出する処理については、雑音検出装置１０とは異なる別の装置が行う事例にも、本発明を同様に適用することができる。この場合には、雑音検出装置１０は、学習用データ保持部２１を備えなくてもよい。 Here, the learning data held by the learning data holding unit 21 is data for the final discriminator deriving unit 31 to derive the final discriminator (the discriminator used for noise detection). The learning data holding unit 21 holds in advance by being input in advance by ten users. In the first embodiment, the case where the noise detection apparatus 10 holds the learning data in the learning data holding unit 21 in advance has been described. However, the present invention is not limited to this, and the learning data may be held. As for the process of deriving the final discriminator from the learning data, the present invention can be similarly applied to a case where another device different from the noise detection device 10 performs. In this case, the noise detection apparatus 10 may not include the learning data holding unit 21.

学習用データについて例を挙げて説明すると、例えば、学習用データ保持部２１は、図４に示すような学習用データを保持する。すなわち、学習用データ保持部２１は、「音声のみ」の学習用データのグループ、「スプレー音」の学習用データのグループ、「紙を破る音」の学習用データのグループ、および、「電話のコール音」の学習用データのグループとして、複数の音声データ（図４においては波形で図示する）で構成された共通の一つのグループを保持する。この学習用データのグループのことを、ＡｄａＢｏｏｓｔの理論においては、「特徴ベクトル」と表現したりする。 For example, the learning data holding unit 21 holds learning data as shown in FIG. 4. That is, the learning data holding unit 21 includes a “voice only” learning data group, a “spray sound” learning data group, a “paper breaking sound” learning data group, As a group of “call tone” learning data, a common group composed of a plurality of voice data (illustrated by waveforms in FIG. 4) is held. This group of learning data is expressed as a “feature vector” in the AdaBoost theory.

また、学習用データ保持部２１は、音源ごとに、「音声のみであるのか、それ以外であるのか」、「スプレー音であるのか、それ以外であるのか」、「紙を破る音であるのか、それ以外であるのか」、「電話のコール音であるのか、それ以外であるのか」などの情報を、音声データ各々について対応づけて保持している。これらの情報のことを、ＡｄａＢｏｏｓｔの理論においては、「ラベル」と表現したりする。「特徴ベクトル」や「ラベル」については、最終識別器導出部３１を説明する際に、詳述する。 In addition, the learning data holding unit 21 determines whether the sound is “only sound or other”, “whether it is a spray sound or other than that”, “whether it is a sound that breaks paper”, for each sound source. , “Is it other than that”, “whether it is a phone call sound, or other than that”, and the like are stored in association with each voice data. Such information is expressed as “label” in the AdaBoost theory. The “feature vector” and “label” will be described in detail when the final discriminator deriving unit 31 is described.

なお、実施例１においては、学習用データ保持部２１が、共通の一つのグループを保持し、この一つのグループに対して、「音声のみ」、「スプレー音」、「紙を破る音」、「電話のコール音」などの音源ごとにラベルを付与して保持している事例について説明したが、本発明はこれに限られるものではない。例えば、音源としてその他の音源を選択する場合や、一つのグループに対して音源ごとのラベルを付与するのではなく、重複して複数のグループを保持する場合にも（ラベルは音源ごとに付与される）、本発明を同様に適用することができる。すなわち、音源は、本発明に係る雑音検出装置１０が利用される環境等に合わせて適宜選択されればよく、また、保持の形態はいずれでもよい。 In the first embodiment, the learning data holding unit 21 holds one common group, and for this one group, “voice only”, “spray sound”, “paper breaking sound”, Although the case where a label is assigned and held for each sound source such as “phone call sound” has been described, the present invention is not limited to this. For example, when another sound source is selected as the sound source, or when a plurality of groups are held in duplicate instead of giving a label for each sound source to one group (a label is assigned to each sound source). The present invention can be similarly applied. That is, the sound source may be appropriately selected according to the environment in which the noise detection apparatus 10 according to the present invention is used, and any form of holding may be used.

最終識別器保持部２２は、所定の音源による雑音重畳音声のデータを識別する最終識別器（所定の音源による雑音であるか否かの二値を識別する最終識別器）を、所定の音源ごとに保持する。具体的には、最終識別器保持部２２は、後述する最終識別器導出部３１によって導出された最終識別器を所定の音源ごとに保持し、保持した最終識別器は、後述する区間音源検出部３２による処理に利用される。 The final discriminator holding unit 22 assigns a final discriminator (final discriminator for discriminating a binary value of whether or not the noise is generated by a predetermined sound source) for each predetermined sound source. Hold on. Specifically, the final discriminator holding unit 22 holds the final discriminator derived by the final discriminator deriving unit 31 described later for each predetermined sound source, and the held final discriminator is a section sound source detection unit described later. 32 is used for the process.

ここで、最終識別器保持部２２が保持する最終識別器は、区間音源検出部３２が雑音を検出するための識別器であるので、区間音源検出部３２による検出処理の前に、最終識別器導出部３１によって予め導出され、最終識別器保持部２２が予め保持しているものである。なお、実施例１においては、雑音検出装置１０が最終識別器導出部３１を備え、最終識別器導出部３１によって導出された最終識別器を最終識別器保持部２２が保持する事例について説明したが、本発明はこれに限られるものではなく、学習用データの保持や、学習用データから最終識別器を導出する処理については、雑音検出装置１０とは異なる別の装置が行う事例にも、本発明を同様に適用することができる。この場合には、雑音検出装置１０は、雑音検出装置１０の利用者によって予め入力されたりすることで、最終識別器を保持するなどする。 Here, since the final discriminator held by the final discriminator holding unit 22 is a discriminator for the section sound source detection unit 32 to detect noise, the final discriminator before the detection processing by the section sound source detection unit 32 is performed. Derived in advance by the deriving unit 31 and held in advance by the final discriminator holding unit 22. In the first embodiment, the noise detection apparatus 10 includes the final classifier deriving unit 31 and the final classifier derived by the final classifier deriving unit 31 is held by the final classifier holding unit 22. However, the present invention is not limited to this, and the processing for holding the learning data and deriving the final discriminator from the learning data is also performed in a case where another device different from the noise detection device 10 performs. The invention can be applied as well. In this case, the noise detection apparatus 10 holds the final discriminator by being input in advance by the user of the noise detection apparatus 10.

最終識別器について例を挙げて説明すると、例えば、最終識別器保持部２２は、図５に示すような最終識別器を保持する。すなわち、最終識別器保持部２２は、識別内容として「音声のみであるか否かを識別」する『「音声のみ」の最終識別器』、識別内容として「スプレー音であるか否かを識別」する『「スプレー音」の最終識別器』、識別内容として「紙を破る音であるか否かを識別」する『「紙を破る音」の最終識別器』、識別内容として「電話のコール音であるか否かを識別」する『「電話のコール音」の最終識別器』などを保持する。なお、実施例１において、最終識別器はＡｄａＢｏｏｓｔの理論を用いて導出されたものであるが、ＡｄａＢｏｏｓｔの理論については、最終識別器導出部３１を説明する際に、詳述する。また、図５に示す「識別内容」の項目などは、説明の便宜上付与したものであって、最終識別器保持部２２が必ず保持しなければならない項目ではない。 The final discriminator will be described with an example. For example, the final discriminator holding unit 22 holds the final discriminator as shown in FIG. That is, the final discriminator holding unit 22 “identifies whether or not there is only sound” as the identification content “final discriminator of“ sound only ””, and “identifies whether or not it is a spray sound” as the identification content "Final sound detector for spray sound", "Identify whether the sound is a paper breaker" as the identification content, "Final discriminator for the paper break sound", and "Phone call sound" as the identification content “Final identifier of“ phone call tone ”” etc. are stored. In the first embodiment, the final classifier is derived using the AdaBoost theory. The AdaBoost theory will be described in detail when the final classifier deriving unit 31 is described. Further, the item “identification content” shown in FIG. 5 is given for convenience of explanation, and is not necessarily an item that the final discriminator holding unit 22 should hold.

入力データ一時記憶部２３は、雑音検出装置１０の検出対象となるデータを一時的に記憶する。具体的には、入力データ一時記憶部２３は、入力部１１によって入力された検出対象となるデータを一時的に記憶し、一時的に記憶した検出対象のデータは、後述する区間音源検出部３２による処理に利用される。 The input data temporary storage unit 23 temporarily stores data to be detected by the noise detection device 10. Specifically, the input data temporary storage unit 23 temporarily stores the data to be detected input by the input unit 11, and the temporarily stored data to be detected is the section sound source detection unit 32 described later. It is used for processing by.

ここで、入力データ一時記憶部２３が保持するデータは、区間音源検出部３２が雑音を検出するためのデータであるので、雑音検出装置１０の利用者によって予め入力されたり、雑音検出にあたりその都度入力されたりすることで、入力データ一時記憶部２３が一時的に記憶するものである。また、入力データ一時記憶部２３が一時的に記憶した入力データは、区間音源検出部３２による処理が終了した直後に削除されてもよく、あるいは、必要に応じて所定の期間記憶し続けていてもよく、入力データ一時記憶部２３が入力データを記憶する期間は、運用に応じて適宜変更することができる。 Here, since the data held in the input data temporary storage unit 23 is data for the section sound source detection unit 32 to detect noise, it is input in advance by the user of the noise detection device 10 or each time noise detection is performed. The input data temporary storage unit 23 temporarily stores the input data. Further, the input data temporarily stored in the input data temporary storage unit 23 may be deleted immediately after the processing by the section sound source detection unit 32 is completed, or may be stored for a predetermined period as necessary. In addition, the period during which the input data temporary storage unit 23 stores the input data can be appropriately changed according to the operation.

検出結果記憶部２４は、検出対象となるデータの検出結果を記憶する。具体的には、検出結果記憶部２４は、後述する区間音源検出部３２や平滑化部３３によって検出（もしくは検出後平滑化）された検出結果を記憶し、記憶した検出結果は、出力部１２によって出力されるなどする。 The detection result storage unit 24 stores a detection result of data to be detected. Specifically, the detection result storage unit 24 stores detection results detected (or smoothed after detection) by the section sound source detection unit 32 and the smoothing unit 33 described later, and the stored detection results are stored in the output unit 12. Is output by.

ここで、検出結果記憶部２４が記憶した検出結果は、出力部１２による出力処理が終了した直後に削除されてもよく、あるいは、必要に応じて所定の期間記憶し続けていてもよく、検出結果記憶部２４が検出結果を記憶する期間は、運用に応じて適宜変更することができる。 Here, the detection result stored in the detection result storage unit 24 may be deleted immediately after the output processing by the output unit 12 is completed, or may be stored for a predetermined period as necessary. The period during which the result storage unit 24 stores the detection result can be changed as appropriate according to the operation.

検出結果について例を挙げて説明すると、例えば、検出結果記憶部２４は、図６に示すような検出結果を記憶する。すなわち、検出結果記憶部２４は、データのフレームで区切られた区間ごとに、データに存在する雑音の音源に関する情報を対応づけて保持する。ここで、図６の上半分は、区間音源検出部３２によって検出処理を行ったデータの平滑化前の検出結果であり、図６の下半分は、区間音源検出部３２によって検出処理を行ったデータに対して、平滑化部３３によって平滑化処理を行った後の検出結果を示すものである。両者の違いについては、平滑化部３３を説明する際に、詳述する。 For example, the detection result storage unit 24 stores the detection result as illustrated in FIG. 6. That is, the detection result storage unit 24 holds information related to the noise source of noise present in the data for each section divided by the data frame. Here, the upper half of FIG. 6 is a detection result before smoothing of the data subjected to the detection process by the section sound source detection unit 32, and the lower half of FIG. The detection result after performing the smoothing process with respect to data by the smoothing part 33 is shown. The difference between the two will be described in detail when the smoothing unit 33 is described.

なお、実施例１においては、検出結果記憶部２４が図６に示すような検出結果を記憶する手法について説明したが、本発明はこれに限られるものではなく、その他の形態で検出結果を記憶する手法や、平滑化前の検出結果を記憶しない手法など、検出結果記憶部２４が記憶する検出結果については、運用に応じて適宜変更することができる。 In the first embodiment, the method in which the detection result storage unit 24 stores the detection results as shown in FIG. 6 has been described. However, the present invention is not limited to this, and the detection results are stored in other forms. The detection result stored in the detection result storage unit 24, such as a method for performing the detection and a method for not storing the detection result before smoothing, can be appropriately changed according to the operation.

制御部３０は、雑音検出装置１０における各種制御を行い、特に本発明に密接に関連するものとしては、図２に示すように、最終識別器導出部３１と、区間音源検出部３２と、平滑化部３３とを備える。なお、最終識別器導出部３１は、特許請求の範囲に記載の「最終識別器導出手段」に対応し、区間音源検出部３２は、特許請求の範囲に記載の「検出手段」に対応し、平滑化部３３は、特許請求の範囲に記載の「平滑化手段」に対応する。 The control unit 30 performs various controls in the noise detection apparatus 10, and particularly those closely related to the present invention include a final discriminator derivation unit 31, a section sound source detection unit 32, and a smoothing unit as shown in FIG. And a conversion unit 33. The final discriminator deriving unit 31 corresponds to the “final discriminator deriving unit” described in the claims, and the section sound source detection unit 32 corresponds to the “detecting unit” described in the claims. The smoothing unit 33 corresponds to “smoothing means” described in the claims.

最終識別器導出部３１は、識別器（データが所定の音源による雑音であるか否かの二値を識別する識別器、以下、弱識別器）を学習用データから学習させることで学習が終了した最終識別器を導出するブースティングを用いて、所定の音源ごとの最終識別器を導出する。具体的には、最終識別器導出部３１は、学習用データ保持部２１によって保持された学習用データから、ＡｄａＢｏｏｓｔの理論を用いて最終識別器を導出し、導出した最終識別器を、最終識別器保持部２２に保持させる。 The final discriminator deriving unit 31 finishes learning by learning a discriminator (a discriminator for discriminating binary whether data is noise caused by a predetermined sound source, hereinafter, a weak discriminator) from learning data. The final classifier for each predetermined sound source is derived using boosting for deriving the final classifier. Specifically, the final discriminator deriving unit 31 derives the final discriminator from the learning data held by the learning data holding unit 21 using the AdaBoost theory, and the derived final discriminator is finally discriminated. It is held in the vessel holder 22.

なお、実施例１においては、雑音検出装置１０が学習用データ保持部２１に学習用データを予め保持し、最終識別器導出部３１が予め保持された学習用データから最終識別器を導出する事例について説明したが、本発明はこれに限られるものではなく、学習用データの保持や、学習用データから最終識別器を導出する処理については、雑音検出装置１０とは異なる別の装置が行う事例にも、本発明を同様に適用することができる。この場合には、雑音検出装置１０は、最終識別器導出部３１を備えなくてもよい。 In the first embodiment, the noise detection apparatus 10 holds learning data in the learning data holding unit 21 in advance, and the final discriminator derivation unit 31 derives the final discriminator from the learning data held in advance. However, the present invention is not limited to this, and a case where a process different from the noise detection apparatus 10 performs the process of holding the learning data and deriving the final discriminator from the learning data is performed. In addition, the present invention can be similarly applied. In this case, the noise detection apparatus 10 may not include the final discriminator deriving unit 31.

以下、最終識別器導出部３１による最終識別器導出処理について、詳述する。最終識別器導出処理は、ブースティング（Ｂｏｏｓｔｉｎｇ）を用いて行われるが、実施例１においては、ブースティングの一つであるアダブースト（ＡｄａＢｏｏｓｔ）を用いて行われる例について説明する。 Hereinafter, the final classifier deriving process by the final classifier deriving unit 31 will be described in detail. The final discriminator derivation process is performed using boosting. In the first embodiment, an example will be described in which the booster is one of boosting processes (AdaBoost).

ＡｄａＢｏｏｓｔの手順について、図７を用いて概要を説明すると、ＡｄａＢｏｏｓｔでは、まず、初期の重みで重み付けされた学習用データ（ステップＳ７０１〜ステップＳ７０２）から弱識別器を学習した後（ステップＳ７０３）、その弱識別器で誤識別を起こした学習用データの重みが大きくなるように、学習用データの重みを更新する（ステップＳ７０５）。次に、更新された新しい重みで重み付けされた学習用データから新しい弱識別器を学習し（ステップＳ７０３）、再び、その新しい弱識別器で誤識別を起こした学習用データの重みが大きくなるように、学習用データの重みを更新する（ステップＳ７０５）。こうして、弱識別器を自動で生成（学習）していき、最後に、複数生成された弱識別器の重み付き多数決で（ステップＳ７０４）、最終的な識別器（最終識別器）を生成する（ステップＳ７０６）。 The outline of the AdaBoost procedure will be described with reference to FIG. 7. In AdaBoost, first, a weak classifier is learned from learning data weighted with initial weights (steps S701 to S702) (step S703). The weight of the learning data is updated so that the weight of the learning data that has been erroneously identified by the weak classifier is increased (step S705). Next, a new weak classifier is learned from the learning data weighted with the updated new weight (step S703), and again the weight of the learning data that has caused erroneous identification with the new weak classifier is increased. Then, the weight of the learning data is updated (step S705). In this way, weak classifiers are automatically generated (learned), and finally, a final classifier (final classifier) is generated by weighted majority of a plurality of weak classifiers generated (step S704). Step S706).

上記したＡｄａＢｏｏｓｔの手順について、図８を用いてより詳細に説明すると、ＡｄａＢｏｏｓｔでは、まず、図８の（Ａ）に示すように、「特徴ベクトル」と「ラベル」とが対応づけられた学習用データを与える（ステップＳ７０１に相当）。ここで、「特徴ベクトル」および「ラベル」は、実施例１においては、例えば、学習用データ保持部２１によって保持される学習用データのグループ、および、「スプレー音であるのか、それ以外であるのか」の情報のことである。 The AdaBoost procedure described above will be described in more detail with reference to FIG. 8. In AdaBoost, first, as shown in FIG. 8A, a learning feature in which “feature vectors” and “labels” are associated with each other. Data is provided (corresponding to step S701). Here, the “feature vector” and “label” are, for example, a group of learning data held by the learning data holding unit 21 and “spray sound or otherwise” in the first embodiment. Information.

次に、ＡｄａＢｏｏｓｔでは、図８の（Ｂ）に示すように、学習用データの重みを初期化する（ステップＳ７０２に相当）。すなわち、例えば、ある学習用データの「ラベル」が「スプレー音」であれば、学習用データ全体の中で「ラベル」が「スプレー音」となるデータの数の２倍の値で割ったものを、この学習用データの初期の重みとする。同様に、例えば、ある学習用データの「ラベル」が「それ以外（スプレー音以外）」であれば、学習用データ全体の中で「ラベル」が「それ以外（スプレー音以外）」となるデータの数の２倍の値で割ったものを、この学習用データの初期の重みとする。なお、２倍の値で割るのは、正規化するためである。 Next, in AdaBoost, as shown in FIG. 8B, the weight of learning data is initialized (corresponding to step S702). That is, for example, if the “label” of some learning data is “spray sound”, it is divided by twice the number of data in which “label” becomes “spray sound” in the entire learning data Is the initial weight of the learning data. Similarly, for example, if the “label” of a certain learning data is “other (other than spray sound)”, the data in which “label” is “other (other than spray sound)” in the entire learning data. The initial weight of the learning data is obtained by dividing the value by twice the number of. The reason for dividing by 2 times is for normalization.

続いて、ＡｄａＢｏｏｓｔでは、図８の（Ｃ）に示すように、初期の重みで重み付けされた学習用データから弱識別器で学習した後、弱識別器自体の重みを決定するとともに、その弱識別器で誤識別を起こした学習用データの重みが大きくなるように、学習用データの重みを更新する（ステップＳ７０３〜７０５に相当）。すなわち、例えば、図８の（Ｂ）で初期の重みで重み付けされた「スプレー音」の学習用データから、（２．１）で示すように、誤識別が最小となるように弱識別器を学習した後（弱識別器１とする）、（２．２）式および（２．３）式で示すように、弱識別器１自体の重みを決定するとともに、（２．４）式で示すように、弱識別器１で誤識別を起こした学習用データの重みが大きくなるように、学習用データの重みを更新する。 Subsequently, in AdaBoost, as shown in FIG. 8C, after learning with the weak classifier from the learning data weighted with the initial weight, the weight of the weak classifier itself is determined and its weak classification is performed. The weight of the learning data is updated so that the weight of the learning data that has been erroneously identified by the device increases (corresponding to steps S703 to 705). That is, for example, from the learning data of “spray sound” weighted with the initial weight in FIG. 8B, as shown in (2.1), the weak classifier is set so as to minimize the misclassification. After learning (referred to as weak classifier 1), the weight of the weak classifier 1 itself is determined and expressed by formula (2.4) as shown in formulas (2.2) and (2.3). As described above, the weight of the learning data is updated so that the weight of the learning data that has been erroneously identified by the weak classifier 1 is increased.

こうして、ＡｄａＢｏｏｓｔでは、弱識別器１から弱識別器Ｔまで、複数の弱識別器を自動で生成（学習）していき、最後に、図８の（Ｄ）に示すように、複数生成された弱識別器の重み付き多数決で、最終的な識別器（最終識別器）を生成する（ステップＳ７０６に相当）。例えば、「スプレー音」の最終識別器を生成する。言い換えると、弱識別器は、各次元（１〜Ｔ）において、重み付きエラーが最小になるように閾値を設定し、その中でさらに重み付きエラーが最小となる次元を選択したことになる。 In this way, AdaBoost automatically generates (learns) a plurality of weak classifiers from weak classifier 1 to weak classifier T, and finally, a plurality of weak classifiers are generated as shown in FIG. 8D. A final discriminator (final discriminator) is generated by the weighted majority of the weak discriminators (corresponding to step S706). For example, a final discriminator of “spray sound” is generated. In other words, the weak discriminator sets a threshold value so that the weighted error is minimized in each dimension (1 to T), and selects a dimension that further minimizes the weighted error.

このようにして、実施例１における最終識別器導出部３１は、上記してきたようなＡｄａＢｏｏｓｔの理論を用いて、学習用データから所定の音源ごとの最終識別器を導出する。具体的には、最終識別器導出部３１は、図９の（１．１）式に示すように、学習用データ保持部２１によって保持された学習用データ（共通の一つのグループ）の「ラベル」を所定の音源ごとに付け替え、（１．２）式に示すように、所定の音源ごとの最終識別器を導出する。 In this way, the final discriminator deriving unit 31 according to the first embodiment derives a final discriminator for each predetermined sound source from the learning data by using the AdaBoost theory as described above. Specifically, the final discriminator deriving unit 31, as shown in the equation (1.1) in FIG. 9, “label” of the learning data (a common group) held by the learning data holding unit 21. ”Is replaced for each predetermined sound source, and a final discriminator for each predetermined sound source is derived as shown in Equation (1.2).

ここで、上記してきたように、ＡｄａＢｏｏｓｔは、基本的に二クラス判別であるが、雑音の除去（抑圧）を考えるのであれば、雑音の音源を識別し（どのような雑音が音声に混入したのかまでを知り）、雑音除去（抑圧）時には、あらかじめ音源ごとに保存された雑音のデータを用いて雑音を除去（抑圧）することが望ましい。このようなことから、最終識別器導出部３１は、ＡｄａＢｏｏｓｔを多クラス問題に適応できるように拡張することで（１クラス対その他のクラスの二値判別器を複数作成）、Ｍｕｌｔｉ―ｃｌａｓｓを実現し、区間音源検出部３２において雑音の種類（音源）の識別まで行えるようにしているのである。 Here, as described above, AdaBoost is basically two-class discrimination, but if noise removal (suppression) is considered, the noise source is identified (what noise is mixed in the voice) It is desirable to remove (suppress) noise using noise data stored in advance for each sound source when removing (suppressing) noise. For this reason, the final discriminator deriving unit 31 realizes multi-class by extending AdaBoost to be adaptable to multi-class problems (creating a plurality of binary discriminators of one class versus other classes). In addition, the section sound source detection unit 32 can perform identification of noise types (sound sources).

区間音源検出部３２は、入力された雑音重畳音声のデータの雑音の区間と雑音の音源とを検出する。具体的には、区間音源検出部３２は、入力データ一時記憶部２３によって記憶された雑音重畳音声データを、最終識別器保持部２２によって保持された所定の音源ごとの最終識別器各々を用いてフレーム単位で識別し、識別の結果、二値のうちいずれか一つの値に識別されたスコアが最も高い最終識別器を判定することで、データの雑音の区間がフレームで区切られた区間であること、並びに、データに存在する雑音の音源が判定された最終識別器が示す所定の音源であることを検出し、検出した結果を、検出結果記憶部２４に記憶させたり、平滑化部３３による処理に利用させたりする。 The section sound source detection unit 32 detects the noise section and the noise source of the input noise superimposed speech data. Specifically, the section sound source detection unit 32 uses the final discriminator for each predetermined sound source stored in the final discriminator holding unit 22 by using the noise superimposed voice data stored in the input data temporary storage unit 23. By identifying each frame and determining the final classifier with the highest score identified as one of the binary values as a result of identification, the data noise section is a section divided by frames. In addition, it is detected that the noise source included in the data is the predetermined sound source indicated by the determined final discriminator, and the detection result is stored in the detection result storage unit 24, or by the smoothing unit 33. It is used for processing.

以下、区間音源検出部３２の検出処理について、図１０を用いて説明すると、区間音源検出部３２は、まず、入力データ一時記憶部２３によって記憶された雑音重畳音声データ（ステップＳ１００１）を、最終識別器保持部２２によって保持された所定の音源ごとの最終識別器各々を用いてフレーム単位で識別する（ステップＳ１００２）。言い換えると、区間音源検出部３２は、Ｍｕｌｔｉ−ｃｌａｓｓＡｄａＢｏｏｓｔに、フレームごとの特徴量である対数メルフィルタバンクを入力し、「音声のみであるか否か」、「スプレー音であるか否か」、「紙を破る音であるか否か」、「電話のコール音であるか否か」、というように識別していく。そして、識別の結果、二値のうちいずれか一つの値に識別されたスコアが最も高い最終識別器を判定することで（ステップＳ１００３）、データの雑音の区間がフレームで区切られた区間であること、並びに、データに存在する雑音の音源が判定された最終識別器が示す所定の音源であることを検出する。 Hereinafter, the detection process of the section sound source detection unit 32 will be described with reference to FIG. 10. The section sound source detection unit 32 first converts the noise superimposed speech data (step S1001) stored in the input data temporary storage unit 23 to the final. Each final discriminator for each predetermined sound source held by the discriminator holding unit 22 is used to identify each frame (step S1002). In other words, the section sound source detection unit 32 inputs a logarithmic mel filter bank, which is a feature amount for each frame, to the Multi-class AdaBoost, and “whether it is only sound” or “whether it is a spray sound”. , “Whether it is a sound that breaks paper”, “Whether it is a phone call sound”, and so on. Then, as a result of the identification, by determining the final discriminator having the highest score identified as one of the two values (step S1003), the data noise section is a section divided by frames. In addition, it is detected that the noise source included in the data is a predetermined sound source indicated by the determined final discriminator.

区間音源検出部３２の検出処理について具体的に例を挙げて説明すると、区間音源検出部３２は、入力された雑音重畳音声のデータを、保持された音源ごとの最終識別器によってデータのフレーム単位で識別する。ここで、データのフレームとは、データをある時間で区切った固まりのことであり（例えば、『２０ｍｓ』など）、データの区間を示すものである。 The detection processing of the section sound source detection unit 32 will be described with a specific example. The section sound source detection unit 32 uses the final discriminator for each sound source to store the input noise superimposed speech data in units of data frames. Identify with Here, the data frame is a group of data divided by a certain time (for example, “20 ms”) and indicates a data section.

例えば、区間音源検出部３２は、入力されたデータを、『音声のみの最終識別器』、『スプレー音の最終識別器』、『紙を破る音の最終識別器』、『電話のコール音の最終識別器』などの最終識別器を用いて、データの「フレームＮｏ．１００」の区間について識別する。すると、例えば、データの「フレームＮｏ．１００」は、『音声のみの最終識別器』を用いた識別の結果、二値のうちいずれか一つの値に『−０．３』というスコアで識別される。同様に、データの「フレームＮｏ．１００」は、『スプレー音の最終識別器』を用いた識別の結果、二値のうちいずれか一つの値に『−０．１』というスコアで識別され、『紙を破る音の最終識別器』を用いた識別の結果、二値のうちいずれか一つの値に『−０．２』というスコアで識別され、『電話のコール音の最終識別器』を用いた識別の結果、二値のうちいずれか一つの値（ここでは、『電話のコール音』）に『０．５』というスコアで識別される。 For example, the section sound source detection unit 32 converts the input data into “sound-only final identifier”, “spray sound final identifier”, “paper-breaking sound final identifier”, “phone call sound” Using a final discriminator such as “final discriminator”, the section of “frame No. 100” of the data is identified. Then, for example, “frame No. 100” of the data is identified with a score of “−0.3” as one of the two values as a result of identification using the “final classifier only for voice”. The Similarly, “frame No. 100” of the data is identified with a score of “−0.1” in any one of the binary values as a result of identification using the “final identifier of spray sound”. As a result of identification using “the final discriminator of the sound that breaks the paper”, one of the two values is identified with a score of “−0.2”, and the “final discriminator of the phone call tone” As a result of the identification used, any one of the two values (here, “phone call tone”) is identified with a score of “0.5”.

続いて、区間音源検出部３２は、二値のうちいずれか一つの値に識別されたスコアが最も高い最終識別器を判定する。例えば、区間音源検出部３２は、二値のうちいずれか一つの値に『０．５』という最も高いスコアで識別された最終識別器が『電話のコール音の最終識別器』であることを判定する。 Subsequently, the section sound source detection unit 32 determines the final discriminator having the highest score identified by any one of the binary values. For example, the section sound source detection unit 32 determines that the final discriminator identified with the highest score of “0.5” as one of the two values is the “final discriminator of the telephone call sound”. judge.

そして、区間音源検出部３２は、データの雑音区間がこのフレームで区切られた区間であること、並びに、データに存在する雑音の音源が判定された最終識別器が示す音源であることを検出する。例えば、区間音源検出部３２は、データの雑音区間が「フレームＮｏ．１００」で区切られた区間であること、並びに、データに存在する雑音の音源が『電話のコール音』であることを検出する。 Then, the section sound source detection unit 32 detects that the noise section of the data is a section divided by this frame, and that the noise source existing in the data is the sound source indicated by the determined final discriminator. . For example, the section sound source detection unit 32 detects that the noise section of the data is a section divided by “Frame No. 100” and that the sound source of the noise present in the data is “phone call sound”. To do.

このように、区間音源検出部３２は、ＡｄａＢｏｏｓｔを多クラス問題に適応できるように拡張することで（複数の二値判別器のうち、最も結果の高かったものを識別結果とすることで）、Ｍｕｌｔｉ―ｃｌａｓｓを実現し、雑音の種類（音源）の識別まで行えるようにしているのである。 Thus, the section sound source detection unit 32 extends AdaBoost so that it can be applied to the multi-class problem (by making the highest result among the plurality of binary discriminators the identification result). Multi-class is realized, and the type of noise (sound source) can be identified.

なお、実施例１においては、区間音源検出部３２による検出の後、平滑化部３３による平滑化処理（ステップＳ１００４）や、出力部１２による出力処理（ステップＳ１００５）などが行われる。また、実施例１においては、区間音源検出部３２が、入力された雑音重畳音声のデータをフレーム単位で識別し、データの雑音の区間がフレームで区切られた区間であることをも検出する手法について説明したが、本発明はこれに限られるものではなく、区間音源検出部３２が、入力された雑音重畳音声のデータをフレーム単位で識別する手法によらない場合にも、本発明を同様に適用することができる。 In the first embodiment, after detection by the section sound source detection unit 32, smoothing processing by the smoothing unit 33 (step S1004), output processing by the output unit 12 (step S1005), and the like are performed. In the first embodiment, the section sound source detection unit 32 identifies the input noise-superimposed speech data in units of frames, and also detects that the data noise section is a section divided by frames. However, the present invention is not limited to this, and the present invention is similarly applied to the case where the section sound source detection unit 32 does not rely on a method of identifying input noise superimposed speech data in units of frames. Can be applied.

平滑化部３３は、入力されたデータの連続するフレームの中に、識別の結果が他のフレームとは異なる結果のフレームが含まれる場合には、異なる結果のフレームに対して平滑化を行う。具体的には、平滑化部３３は、区間音源検出部３２によって検出された検出結果で、検出結果記憶部２４に記憶された検出結果に対して、識別の結果が他のフレームとは異なる結果のフレームが含まれる場合には、異なる結果のフレームに対して平滑化を行い、平滑化を行った結果を、検出結果記憶部２４に記憶する。 The smoothing unit 33 performs smoothing on frames with different results when consecutive frames of input data include frames with different results of identification from other frames. Specifically, the smoothing unit 33 is a detection result detected by the section sound source detection unit 32, and a result of identification different from that of other frames with respect to the detection result stored in the detection result storage unit 24. If the frame is included, smoothing is performed on frames with different results, and the result of smoothing is stored in the detection result storage unit 24.

例えば、平滑化部３３は、「雑音」として検出されるフレームが連続している中で、ごくわずかに「音声」として検出されるフレームが存在した場合、このフレームは「音声」ではなく、誤識別を起こした「雑音」だと考える。具体的に例を挙げて説明すると、平滑化部３３は、例えば、前後３フレームおよびそのフレームの計７フレームの中で、最も多い検出結果をそのフレームの検出結果とし、変更がなくなるまで繰り返す。平滑化部３３は、連続して「雑音」と判定されている区間は、フレームごとに検出結果が異なっても、ひとつの「雑音」とみなす。 For example, when the frames detected as “noise” are continuous and there is a frame that is detected as “speech” very slightly, the smoothing unit 33 is not “speech”, and the frame is erroneous. Think of it as the “noise” that caused the identification. Specifically, for example, the smoothing unit 33 sets, for example, the largest number of detection results among the three frames before and after and a total of seven frames, and repeats until there is no change. The smoothing unit 33 regards a section that is continuously determined as “noise” as one “noise” even if the detection results differ from frame to frame.

図６に示すように、平滑化部３３は、例えば、「フレームＮｏ．１０１」に着目すると、「フレームＮｏ．１０１」の前後３フレームおよびそのフレームの計７フレームの中で、最も多い検出結果は「電話のコール音」であるので、「フレームＮｏ．１０１」の検出結果も「電話のコール音」であるとする。 As illustrated in FIG. 6, for example, when focusing on “frame No. 101”, the smoothing unit 33 has the largest number of detection results among the three frames before and after “frame No. 101” and the total of seven frames. Is a “phone call tone”, and therefore the detection result of “frame No. 101” is also a “phone call tone”.

なお、実施例１における平滑化部３３は、例えば、継続時間が２００ｍｓ以上の雑音を対象としているため、それより明らかに小さい継続時間で雑音と判定されるものは、「湧き出し」と考える。例えば、継続時間が２００ｍｓの半分である１００ｍｓ以下の雑音については、切り取りを行う。 In addition, since the smoothing unit 33 according to the first embodiment targets, for example, noise with a duration of 200 ms or longer, what is determined as noise with a duration shorter than that is considered to be “welling up”. For example, noise is cut out for noise of 100 ms or less, which is half of the duration of 200 ms.

また、実施例１においては、平滑化部３３が異なる結果のフレームに対して平滑化を行い、平滑化された結果を検出結果として出力する手法について説明したが、本発明はこれに限られるものではなく、平滑化部３３によって平滑化処理が行われず、平滑化されていない結果を検出結果として出力する手法にも、本発明を同様に適用することができる。 In the first embodiment, the technique has been described in which the smoothing unit 33 performs smoothing on different result frames and outputs the smoothed results as detection results. However, the present invention is not limited to this. Instead, the present invention can be similarly applied to a method in which a smoothing process is not performed by the smoothing unit 33 and an unsmoothed result is output as a detection result.

［実施例１に係る雑音検出装置による処理の手順］
次に、図１１を用いて、実施例１に係る雑音検出装置による処理の手順（一例）を説明する。図１１は、実施例１に係る雑音検出装置による処理の手順を示すフローチャートである。 [Procedure of processing by the noise detection apparatus according to the first embodiment]
Next, a procedure (example) of processing performed by the noise detection apparatus according to the first embodiment will be described with reference to FIG. FIG. 11 is a flowchart of a process procedure performed by the noise detection apparatus according to the first embodiment.

まず、雑音検出装置１０は、区間音源検出部３２において、検出対象のデータの入力を受け付けたか否かを判定する（ステップＳ１１０１）。データの入力を受け付けていない場合には（ステップＳ１１０１否定）、雑音検出装置１０は、区間音源検出部３２において、検出対象のデータの入力を受け付けたか否かを判定する処理に戻る。 First, the noise detection apparatus 10 determines whether or not the section sound source detection unit 32 has received input of data to be detected (step S1101). When the input of data is not received (No at Step S1101), the noise detection apparatus 10 returns to the process of determining whether or not the input of the data to be detected is received in the section sound source detection unit 32.

一方、データの入力を受け付けた場合には（ステップＳ１１０１肯定）、雑音検出装置１０は、区間音源検出部３２において、所定の音源の最終識別器で、入力されたデータの１フレームを識別する（ステップＳ１１０２）。 On the other hand, when the input of data is accepted (Yes in step S1101), the noise detection apparatus 10 identifies one frame of the input data with the final discriminator of a predetermined sound source in the section sound source detection unit 32 ( Step S1102).

続いて、雑音検出装置１０は、区間音源検出部３２において、全ての音源の最終識別器で識別したか否かを判定する（ステップＳ１１０３）。全ての音源の最終識別器で識別していない場合には（ステップＳ１１０３否定）、雑音検出装置１０は、区間音源検出部３２において、最終識別器を変更し（ステップＳ１１０４）、変更後の所定の音源の最終識別器で、入力されたデータの１フレームを識別する処理に戻る。 Subsequently, the noise detection apparatus 10 determines whether or not the section sound source detection unit 32 has identified all sound sources with the final discriminators (step S1103). When not identifying with all the final discriminators of the sound source (No at Step S1103), the noise detection apparatus 10 changes the final discriminator in the section sound source detection unit 32 (Step S1104), and changes the predetermined discriminator after the change. The process returns to the process of identifying one frame of the input data by the final discriminator of the sound source.

一方、全ての音源の最終識別器で識別した場合には（ステップＳ１１０３肯定）、雑音検出装置１０は、区間音源検出部３２において、二値のうちいずれか一つの値に識別されたスコアが最も高い最終識別器を判定する（ステップＳ１１０５）。 On the other hand, when the final sound discriminators of all sound sources have been identified (Yes at step S1103), the noise detection device 10 has the highest score identified by any one of the two values in the section sound source detection unit 32. A high final discriminator is determined (step S1105).

そして、雑音検出装置１０は、区間音源検出部３２において、データの雑音の区間がフレームで区切られた区間であること、並びに、データに存在する雑音の音源が、判定された最終識別器が示す所定の音源であることを検出する（ステップＳ１１０６）。 In the noise detection device 10, the section sound source detection unit 32 indicates that the data noise section is a section divided by frames, and the determined noise discriminator indicates the noise sound source present in the data. It is detected that the sound source is a predetermined sound source (step S1106).

次に、雑音検出装置１０は、区間音源検出部３２において、全てのフレームについて検出したか否かを判定する（ステップＳ１１０７）。全てのフレームについて検出していない場合には（ステップＳ１１０７否定）、雑音検出装置１０は、区間音源検出部３２において、フレームを変更し（ステップＳ１１０８）、所定の音源の最終識別器で、入力されたデータの１フレームを識別する処理に戻る。 Next, the noise detection apparatus 10 determines whether or not all the frames are detected in the section sound source detection unit 32 (step S1107). If all the frames have not been detected (No at step S1107), the noise detection apparatus 10 changes the frame at the section sound source detection unit 32 (step S1108) and is input by the final discriminator of the predetermined sound source. Return to the process of identifying one frame of the data.

一方、全てのフレームについて検出した場合には（ステップＳ１１０７肯定）、雑音検出装置１０は、平滑化部３３において、平滑化処理を行い（ステップＳ１１０９）、検出結果を出力部１２に出力するなどして（ステップＳ１１１０）、処理を終了する。 On the other hand, when all the frames have been detected (Yes at Step S1107), the noise detection apparatus 10 performs a smoothing process at the smoothing unit 33 (Step S1109), and outputs the detection result to the output unit 12. (Step S1110), the process is terminated.

このようにして、実施例１に係る雑音検出装置１０は、雑音の種類（音源）を識別することが可能になる。 In this way, the noise detection apparatus 10 according to the first embodiment can identify the type of noise (sound source).

［実施例１の効果］
上記してきたように、実施例１によれば、雑音が音声区間に重畳して存在する雑音重畳音声のデータが所定の音源による雑音であるか否かの二値を識別する最終識別器を所定の音源ごとに保持し、入力された雑音重畳音声のデータを保持された所定の音源ごとの最終識別器各々を用いて識別し、識別の結果、二値のうちいずれか一つの値に識別されたスコアが最も高い最終識別器を判定することで、データに存在する雑音の音源が判定された最終識別器が示す所定の音源であることを検出するので、雑音の種類（音源）を識別することが可能になる。 [Effect of Example 1]
As described above, according to the first embodiment, the final discriminator that identifies the binary value of whether or not the noise-superimposed speech data that is present with the noise superimposed on the speech section is noise from a predetermined sound source is predetermined. Is stored for each sound source, and the input noise superimposed speech data is identified using each final discriminator for each predetermined sound source, and as a result of the identification, one of the binary values is identified. By determining the final discriminator having the highest score, it is detected that the noise source existing in the data is a predetermined sound source indicated by the determined final discriminator, so that the noise type (sound source) is identified. It becomes possible.

また、実施例１によれば、雑音重畳音声のデータを含む複数のデータを学習用データとして保持し、データが所定の音源による雑音であるか否かの二値を識別する識別器を学習用データから学習させることで学習が終了した最終識別器を導出するブースティングを用いて、保持された学習用データから所定の音源ごとの最終識別器を導出するので、雑音の種類（音源）を適切に識別することが可能になる。 In addition, according to the first embodiment, a discriminator that holds a plurality of data including noise superimposed voice data as learning data and discriminates whether or not the data is noise caused by a predetermined sound source is used for learning. The final classifier for each given sound source is derived from the stored learning data using boosting that derives the final classifier that has been learned by learning from the data, so the type of noise (sound source) is appropriate. Can be identified.

また、実施例１によれば、雑音検出装置は、ブースティングとして、アダブーストを用いて最終識別器を導出するので、雑音の種類（音源）を適切に識別することが可能になる。 Also, according to the first embodiment, the noise detection device derives the final discriminator using Adaboost as boosting, so that it is possible to appropriately identify the type of noise (sound source).

また、実施例１によれば、雑音検出装置は、雑音重畳音声のデータをフレーム単位で識別し、データの雑音の区間がフレームで区切られた区間であることをさらに検出するので、上記の効果に加え、雑音の区間を検出することも可能になる。 In addition, according to the first embodiment, the noise detection apparatus further identifies that the noise-superimposed speech data is identified in units of frames and further detects that the data noise section is a section divided by frames. In addition, it is possible to detect a noise interval.

また、実施例１によれば、雑音検出装置は、入力されたデータの連続するフレームの中に、検出手段によって判定された最終識別器で識別された識別の結果が他のフレームとは異なる結果のフレームが含まれる場合には、異なる結果のフレームに対して平滑化を行うので、雑音の種類（音源）を正確に識別することが可能になる。 In addition, according to the first embodiment, the noise detection apparatus has a result in which the identification result identified by the final classifier determined by the detection unit is different from the other frames in consecutive frames of the input data. When the frame is included, smoothing is performed on frames with different results, so that the noise type (sound source) can be accurately identified.

さて、これまで、実施例１として、雑音検出装置１０の概要および特徴、構成、処理の手順などについて説明してきたが、次に、実施例２として、本発明に係る雑音検出装置１０による評価実験について説明する。なお、実施例２における評価実験は、本発明に係る雑音検出装置１０の再現率、適合率、および、雑音の識別率を評価することを主たる目的としている。 So far, the outline, features, configuration, processing procedure, and the like of the noise detection apparatus 10 have been described as the first embodiment. Next, as the second embodiment, an evaluation experiment by the noise detection apparatus 10 according to the present invention is described. Will be described. In addition, the evaluation experiment in Example 2 is mainly intended to evaluate the recall rate, the matching rate, and the noise identification rate of the noise detection apparatus 10 according to the present invention.

［実験条件］
まず、実施例２における評価実験の実験条件について説明する。「学習用データ」には、ＡＳＪから提供されている研究用連続音声データベースから、男性話者２１人×１０発話の発話データを用い、検証対象である「評価データ」には、同じくＡＳＪから提供されている研究用連続音声データベースから、男性話者５人×平均２４０発話の発話データを用いた。また、「雑音」には、ＲＷＣＰの提供する非音声ドライソースの中から、「スプレー音」、「紙を破る音」、「電話のコール音」の三種類のデータを用いた。 [Experimental conditions]
First, the experimental conditions of the evaluation experiment in Example 2 will be described. For “learning data”, utterance data of 21 male speakers × 10 utterances from the continuous speech database for research provided by ASJ is used. From the research continuous speech database, the speech data of 5 male speakers × average 240 utterances was used. For the “noise”, three types of data “spray sound”, “sound to break paper”, and “phone call sound” were used from non-voice dry sources provided by RWCP.

「学習用データ」には、発話データと、さらに、その発話データにＳＮＲを調整した各「雑音」を重畳させたものとを用いた。また、「学習用データ」のＳＮＲは、『−５ｄＢ』から『５ｄＢ』の間でランダムに変化させた。一方、「評価データ」には、１発話に２００ｍｓ以上の継続時間のＳＮＲを調整した「雑音」を１〜３つ重畳させたものを用いた。ただし、雑音が重畳した区間に、さらに別の雑音が重畳するようなデータは存在しない。また、「評価データ」のＳＮＲは、『−５ｄＢ』、『０ｄＢ』、『５ｄＢ』の３つである。 As the “learning data”, utterance data and data obtained by superimposing each “noise” adjusted for SNR on the utterance data were used. Further, the SNR of the “learning data” was randomly changed between “−5 dB” and “5 dB”. On the other hand, the “evaluation data” used was obtained by superimposing 1 to 3 “noises” with the SNR adjusted for a duration of 200 ms or more on one utterance. However, there is no data in which another noise is superimposed in the section where the noise is superimposed. Further, the SNR of the “evaluation data” is “−5 dB”, “0 dB”, and “5 dB”.

なお、実施例２における評価実験において、ＳＮＲは、図１２の（Ａ）〜（Ｃ）に示す式で求めた。また、特徴量には、「対数メルフィルタバンク」を使用した。「学習用データ」、「評価データ」ともに、フレーム幅は『２０ｍｓ』、フレームシフト『１０ｍｓ』であり、「１−（０．９７ｚの（−１）乗）」のプリエンファシス、ハミング窓を用いている。 In the evaluation experiment in Example 2, the SNR was obtained by the formulas shown in (A) to (C) of FIG. In addition, “logarithmic mel filter bank” was used as the feature amount. For both “learning data” and “evaluation data”, the frame width is “20 ms”, the frame shift is “10 ms”, and the pre-emphasis of “1- (0.97z to the (−1) power)” and the Hamming window are used. ing.

［雑音検出］
実施例２における評価実験における判定について説明すると、まず、検出という観点のみから、区間が正しく検出できているものは、雑音の種類（音源）が異なっていたとしても、「正解」と判定することとする。また、誤差のマージンを決めておき、正解データとの誤差がそのマージン以内であるものも、「正解」と判定する。なお、マージンは、実施例２における評価実験では、『３０ｍｓ』とした。また、検出区間が大きすぎるものは、「誤検出」、検出区間が小さすぎるものは、「未検出」とした。 [Noise detection]
The determination in the evaluation experiment in Example 2 will be described. First, from the viewpoint of detection only, a section that is correctly detected is determined as “correct” even if the noise type (sound source) is different. And In addition, an error margin is determined, and an error that is within the margin of the correct answer data is also determined as “correct answer”. The margin was set to “30 ms” in the evaluation experiment in Example 2. In addition, the detection interval is too large, “false detection”, and the detection interval is too small, “not detected”.

評価には、図１３の（Ａ）に示す検出率（Ｄｅｔｅｃｔｉｏｎｒａｔｅ）、図１３の（Ｂ）に示す再現率（Ｒｅｃａｌｌｒａｔｅ）、および、図１３の（Ｃ）に示す適合率（Ｐｒｅｃｉｓｉｏｎｒａｔｅ）の３つを用いる。ここで、検出率、再現率、および適合率は、検出した区間の中で正解した数「Ｔｐ」、誤検出数「Ｆｐ」、未検出数「Ｔｎ」、雑音の総数「Ｔａ」を用いて、図１３の（Ａ）〜（Ｃ）式で計算される。ここで、本来、検出率と再現率とは等しいものだが、実施例２における評価実験では、区間を大きく取りすぎた雑音を「誤検出」として評価していることから、検出率と再現率とで異なる値が出る場合があるので、その両方を示すものである。 For the evaluation, the detection rate (Detection rate) shown in FIG. 13A, the recall rate shown in FIG. 13B (Recall rate), and the precision rate (Precision rate) shown in FIG. 13C. The following three are used. Here, the detection rate, the recall rate, and the matching rate are calculated using the number of correct answers “Tp”, the number of false detections “Fp”, the number of undetections “Tn”, and the total number of noises “Ta” in the detected section. , (A) to (C) in FIG. Here, although the detection rate and the recall rate are essentially the same, in the evaluation experiment in Example 2, since the noise having a large section was evaluated as “false detection”, the detection rate and the recall rate are Since different values may come out, both are shown.

評価実験の結果は、図１４に示す通りとなった。全てのＳＮＲに対して、検出率、再現率、および適合率が『９５％以上』と良好な結果がでており、『５ｄＢ』以上の強さの「雑音」であれば、検出できることが確認された。 The result of the evaluation experiment is as shown in FIG. For all SNRs, the detection rate, reproducibility, and matching rate are “95% or higher”, and good results are obtained. It is confirmed that “noise” with a strength of “5 dB” or higher can be detected. It was done.

［雑音識別］
ところで、上記では、区間さえ正しければ雑音の種類（音源）が異なっていたとしても「正解」と判定したが、次に、区間が正しく判定された雑音の中での雑音の識別率を評価し、さらに、検出率と併せて、区間が正しく、かつ、雑音の識別結果も正しいものを、雑音の正解率として求めた。その結果が、図１５である。 [Noise discrimination]
By the way, in the above, even if the section is correct, even if the noise type (sound source) is different, it is determined as “correct”. Next, the noise discrimination rate among the noises in which the section is correctly determined is evaluated. Further, in addition to the detection rate, the correct answer rate of noise was determined for the correct section and the correct noise identification result. The result is shown in FIG.

図１５から、すべてのＳＮＲにおいて、『９９．５％』を超える高い雑音識別率（ノイズ識別率）を得ることができていることがわかる。すなわち、検出できたもののほとんどについて、雑音の種類（音源）を正しく識別できていることになる。なお、正しく検出や識別ができている評価実験の出力例は、図３に示したものである。図３に示すように、波形のみでは「電話のコール音」を見分けることはできないが、本発明に係る雑音検出装置１０によって、全て、正しく検出できており、雑音の種類（音源）を正しく識別できている。 FIG. 15 shows that a high noise discrimination rate (noise discrimination rate) exceeding “99.5%” can be obtained in all SNRs. In other words, the noise type (sound source) can be correctly identified for most of the detected ones. An example of the output of the evaluation experiment that has been correctly detected and identified is shown in FIG. As shown in FIG. 3, the “phone call sound” cannot be discriminated only by the waveform, but all can be correctly detected by the noise detection device 10 according to the present invention, and the noise type (sound source) is correctly identified. is made of.

［ミスマッチモデルによる検出精度の変化］
ところで、上記では、「評価データ」の「雑音」と「学習用データ」の「雑音」とのＳＮＲが等しかったが、次に、「評価データ」と「学習用データ」とのＳＮＲを変化させ、検出精度がどの程度変化するかを調べることとする。上記と同様に『ＳＮＲ−５ｄＢ〜５ｄＢ』の「学習用データ」で学習したモデル、『ＳＮＲ−５ｄＢのみ』の「学習用データ」で学習したモデル、『ＳＮＲ０ｄＢのみ』の「学習用データ」で学習したモデル、『ＳＮＲ５ｄＢのみ』の「学習用データ」を用いて学習したモデルのそれぞれについて、『ＳＮＲ−１０ｄＢ〜１０ｄＢ』の「評価データ」に対する検出率、再現率、および適合率を算出し、違いを比較する。なお、ＡｄａＢｏｏｓｔの学習回数は、同様に、１，０００回である。 [Change in detection accuracy due to mismatch model]
By the way, in the above, the SNRs of “noise” of “evaluation data” and “noise” of “learning data” were equal. Next, the SNRs of “evaluation data” and “learning data” were changed. Let us examine how much the detection accuracy changes. In the same manner as described above, a model learned with “Study data” of “SNR-5 dB to 5 dB”, a model learned with “Study data” of “SNR-5 dB only”, and “Learning data” of “SNR 0 dB only” For each of the learned model and the model learned using the “learning data” of “SNR 5 dB only”, the detection rate, the recall rate, and the adaptation rate for the “evaluation data” of “SNR-10 dB to 10 dB” are calculated, Compare the differences. Note that the number of learnings of AdaBoost is similarly 1,000 times.

結果は、図１６に示すとおりである。適合率は、「評価データ」の「雑音」のＳＮＲを変化させても、誤検出数には影響しなかったことから、学習時に用いる「学習用データ」のＳＮＲによって、ほぼ決定する。「学習用データ」のＳＮＲが低くなるほど、適合率は高くなる結果となった。 The results are as shown in FIG. The relevance rate is almost determined by the SNR of the “learning data” used during learning because the number of false detections was not affected even if the “noise” SNR of the “evaluation data” was changed. The lower the SNR of the “learning data”, the higher the matching rate.

また、「評価データ」のＳＮＲが高くなるほど、未検出数が増える傾向がある。図１６より、「学習用データ」『ＳＮＲ−５ｄＢ』の識別器を用いたとき、「評価データ」『ＳＮＲ５ｄＢ』において、検出率は『７６．７％』、「学習用データ」『ＳＮＲ０ｄＢ』の識別器では、評価データ『ＳＮＲ１０ｄＢ』において、検出率『６９．９％』まで下がる。全てについて学習した識別器では、『−５ｄＢ』、『０ｄＢ』と比べ適合率が下がるが、検出率、再現率の減少量は少なかった。 In addition, as the SNR of “evaluation data” increases, the number of undetected items tends to increase. As shown in FIG. 16, when the “learning data” “SNR-5 dB” discriminator is used, the detection rate of “evaluation data” “SNR5 dB” is “76.7%”, “learning data” “SNR0 dB”. In the discriminator, the evaluation data “SNR10 dB” is reduced to the detection rate “69.9%”. In the discriminator learned for all, the matching rate decreased compared to “−5 dB” and “0 dB”, but the decrease in the detection rate and the recall rate was small.

［ミスマッチモデルによる雑音識別精度の変化］
上記と同様の条件で、「雑音」の識別率および正解率を評価する。結果は、図１７に示す。図１７より、「学習用データ」の「雑音」のＳＮＲと、「評価データ」のＳＮＲとの差が大きいほど、識別率は低下するという結果になった。『ＳＮＲ−５ｄＢ』で学習した識別器を用いたときの「評価データ」『ＳＮＲ１０ｄＢ』の識別率は、『９４．４％』と減少はするものの、高い値を示しており、『ＳＮＲ５ｄＢ』で学習した識別器を用いたとき、「評価データ」『ＳＮＲ−１０ｄＢ』の識別率は、『８０．１％』と比較的低い値となった。また、『ＳＮＲ０ｄＢ』で学習したデータを見ると、「評価データ」『ＳＮＲ１０ｄＢ』では、『９５．９％』、「評価データ」『ＳＮＲ−１０ｄＢ』では、『９３．７％』となった。また、「雑音」の正解率は、モデルマッチの高いものが、高い値を示すが、平均的にみると、『ＳＮＲ−５ｄＢ〜５ｄＢ』で学習したものが、一番高い値を示した。 [Change in noise discrimination accuracy by mismatch model]
Under the same conditions as above, the “noise” discrimination rate and the correct answer rate are evaluated. The results are shown in FIG. FIG. 17 shows that the discrimination rate decreases as the difference between the “noise” SNR of “learning data” and the SNR of “evaluation data” increases. When using the discriminator learned with “SNR-5 dB”, the discrimination rate of “evaluation data” and “SNR10 dB” decreases to “94.4%”, but shows a high value, with “SNR5 dB” When the learned classifier was used, the identification rate of “evaluation data” “SNR-10 dB” was a relatively low value of “80.1%”. The data learned with “SNR 0 dB” is “95.9%” for “evaluation data” “SNR 10 dB” and “93.7%” for “evaluation data” “SNR-10 dB”. Further, the correct answer rate of “noise” shows a high value when the model match is high, but on average, what was learned with “SNR−5 dB to 5 dB” showed the highest value.

さて、これまで本発明の実施例について説明したが、本発明は上記した実施例以外にも、種々の異なる形態にて実施されてよいものである。 Although the embodiments of the present invention have been described so far, the present invention may be implemented in various different forms other than the above-described embodiments.

［システム構成等］
本実施例において説明した各処理のうち、自動的におこなわれるものとして説明した処理（例えば、保持する学習用データから最終識別器を導出する処理など）の全部または一部を手動的におこなうこともでき（例えば、必要に応じてコマンドを入力することで、保持する学習用データから最終識別器を導出する処理など）、あるいは、手動的におこなわれるものとして説明した処理（例えば、学習用データの入力など）の全部または一部を公知の方法で自動的におこなうこともできる（例えば、ネットワークを介して自動的にダウンロードなど）。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 [System configuration, etc.]
Of all the processes described in the present embodiment, all or a part of the processes described as being automatically performed (for example, the process of deriving the final discriminator from the stored learning data) is manually performed. (For example, processing for deriving a final discriminator from learning data to be held by inputting a command as necessary), or processing described as being performed manually (for example, learning data) All or a part of the data may be automatically performed by a known method (for example, automatically downloaded via a network). In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示（例えば、図２など）の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる（例えば、検出結果記憶部を、平滑化前と平滑化後とで分散して構成するなど）。さらに、各装置にて行なわれる各処理機能は、その全部または任意の一部が、ＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 Each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated (for example, FIG. 2). In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. The detection result storage unit can be configured to be integrated before and after smoothing, for example. Further, all or any part of each processing function performed in each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.

なお、本実施例で説明した雑音検出方法は、あらかじめ用意されたプログラムをパーソナルコンピュータやワークステーションなどのコンピュータで実行することによって実現することができる。このプログラムは、インターネットなどのネットワークを介して配布することができる。また、このプログラムは、ハードディスク、フレキシブルディスク（ＦＤ）、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤなどのコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行することもできる。 The noise detection method described in the present embodiment can be realized by executing a program prepared in advance on a computer such as a personal computer or a workstation. This program can be distributed via a network such as the Internet. The program can also be executed by being recorded on a computer-readable recording medium such as a hard disk, a flexible disk (FD), a CD-ROM, an MO, and a DVD and being read from the recording medium by the computer.

以上のように、本発明に係る雑音検出装置および雑音検出方法は、「雑音」を検出することに有用であり、特に、雑音の種類（音源）を識別することに適する。 As described above, the noise detection device and the noise detection method according to the present invention are useful for detecting “noise”, and are particularly suitable for identifying the type of noise (sound source).

実施例１に係る雑音検出装置の概要および特徴を説明するための図である。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram for explaining an overview and characteristics of a noise detection device according to a first embodiment. 実施例１に係る雑音検出装置の構成を示すブロック図である。1 is a block diagram illustrating a configuration of a noise detection apparatus according to a first embodiment. 出力部を説明するための図である。It is a figure for demonstrating an output part. 学習用データ保持部を説明するための図である。It is a figure for demonstrating the data holding part for learning. 最終識別器保持部を説明するための図である。It is a figure for demonstrating the last discriminator holding | maintenance part. 検出結果記憶部を説明するための図である。It is a figure for demonstrating a detection result memory | storage part. 最終識別器導出処理（ＡｄａＢｏｏｓｔ）を説明するための図である。It is a figure for demonstrating the last classifier derivation | leading-out process (AdaBoost). 最終識別器導出処理（ＡｄａＢｏｏｓｔ）を説明するための図である。It is a figure for demonstrating the last classifier derivation | leading-out process (AdaBoost). 区間音源検出処理（Ｍｕｌｔｉ−ｃｌａｓｓＡｄａＢｏｏｓｔ）を説明するための図である。It is a figure for demonstrating a section sound source detection process (Multi-class AdaBoost). 区間音源検出処理（Ｍｕｌｔｉ−ｃｌａｓｓＡｄａＢｏｏｓｔ）を説明するための図である。It is a figure for demonstrating a section sound source detection process (Multi-class AdaBoost). 実施例１に係る雑音検出装置による処理の手順を示すフローチャートである。3 is a flowchart illustrating a processing procedure performed by the noise detection apparatus according to the first embodiment. ＳＮＲを求める計算式を説明するための図である。It is a figure for demonstrating the calculation formula which calculates | requires SNR. 検出率、再現率、および適合率を説明するための図である。It is a figure for demonstrating a detection rate, a reproduction rate, and a precision. 実施例２に係る雑音検出装置の評価結果を示す図である。It is a figure which shows the evaluation result of the noise detection apparatus which concerns on Example 2. FIG. 実施例２に係る雑音検出装置の評価結果を示す図である。It is a figure which shows the evaluation result of the noise detection apparatus which concerns on Example 2. FIG. 実施例２に係る雑音検出装置の評価結果を示す図である。It is a figure which shows the evaluation result of the noise detection apparatus which concerns on Example 2. FIG. 実施例２に係る雑音検出装置の評価結果を示す図である。It is a figure which shows the evaluation result of the noise detection apparatus which concerns on Example 2. FIG. 音声に電話音が重畳した波形を示す図である。It is a figure which shows the waveform which the telephone sound superimposed on the audio | voice. 音声に各雑音が重畳した波形を示す図である。It is a figure which shows the waveform which each noise superimposed on the audio | voice.

Explanation of symbols

１０雑音検出装置
１１入力部
１２出力部
１３入出力制御Ｉ／Ｆ部
２０記憶部
２１学習用データ保持部
２２最終識別器保持部
２３入力データ一時記憶部
２４検出結果記憶部
３０制御部
３１最終識別器導出部
３２区間音源検出部
３３平滑化部 DESCRIPTION OF SYMBOLS 10 Noise detection apparatus 11 Input part 12 Output part 13 Input / output control I / F part 20 Storage part 21 Learning data holding part 22 Final discriminator holding part 23 Input data temporary storage part 24 Detection result storage part 30 Control part 31 Final identification Deriving unit 32 Section sound source detection unit 33 Smoothing unit

Claims

Final discriminator holding means for holding, for each predetermined sound source, a final discriminator for discriminating whether or not the data of the noise-superimposed speech existing by superimposing noise on the speech section is noise from a predetermined sound source;
The inputted noise superimposed speech data is identified using each final discriminator for each of the predetermined sound sources held by the final discriminator holding means, and as a result of the discrimination, any one of the binary values is identified. Detecting means for detecting that the sound source of the noise present in the data is a predetermined sound source indicated by the determined final identifier,
A noise detection apparatus comprising:

Learning data holding means for holding a plurality of data including noise superimposed speech data as learning data;
The learning data holding is performed by using boosting for deriving a final discriminator whose learning is completed by learning a discriminator for discriminating whether or not the data is noise caused by a predetermined sound source from the learning data. A final discriminator deriving unit for deriving a final discriminator for each predetermined sound source from the learning data held by the unit;
The noise detection apparatus according to claim 1, further comprising:

The noise detector according to claim 2, wherein the final discriminator deriving unit derives the final discriminator using Adaboost as the boosting.

The said detection means identifies the data of the said noise superimposed audio | voice per frame, and further detects that the area of the noise of the said data is the area divided | segmented by the said frame. The noise detection device according to any one of the above.

If the frame of the inputted data includes a frame whose result of identification identified by the final discriminator determined by the detection means is different from that of other frames, the frame is different. 5. The noise detection apparatus according to claim 4, further comprising smoothing means for performing smoothing on the resulting frame.

A final discriminator holding step for holding, for each predetermined sound source, a final discriminator for discriminating whether the data of the noise-superimposed speech existing by superimposing noise on the speech section is noise from a predetermined sound source;
The input noise-superimposed speech data is identified using each final discriminator for each of the predetermined sound sources held in the final discriminator holding step, and as a result of the discrimination, one of the two values is identified. Detecting the final discriminator having the highest score identified in step (b) to detect that the noise source existing in the data is a predetermined sound source indicated by the determined final discriminator;
The noise detection method characterized by including.