JP2011248296A

JP2011248296A - Sound signal section extracting device and sound signal section extracting method

Info

Publication number: JP2011248296A
Application number: JP2010124299A
Authority: JP
Inventors: Seiki Sugi; 聖基杉; Yoichi Matsui; 洋一松井; Ikuko Ueno; 育子上野; Kenzo Ito; 憲三伊藤
Original assignee: Kanto Jidosha Kogyo KK; Kanto Auto Works Ltd; Iwate Prefectural University
Current assignee: Kanto Jidosha Kogyo KK; Toyota Motor East Japan Inc; Iwate Prefectural University
Priority date: 2010-05-31
Filing date: 2010-05-31
Publication date: 2011-12-08
Anticipated expiration: 2030-05-31
Also published as: JP5351835B2

Abstract

PROBLEM TO BE SOLVED: To make possible precise extraction of a sound signal section, even if the input sound signal contains unsteady noise, without having to set a threshold for the power of input signals or process complex arithmetic operations.SOLUTION: Analog sound signals of oscillation generated by any sound, when input, are converted into digital sound signals by sampling and quantizing (a step 101); a two-dimensional picture is prepared by graphic conversion of amplitude values in a time series on the basis of the digital sound signals acquired at the step 101 to prepare binary picture data (a step 102); the binary picture data prepared at the step 102 are subjected to contraction processing to convert the value of a noted pixel from 1 to 0 if a predetermined number of pixels around which have one or more of 0 value, to extract a sound signal section that constitutes an agglomerated region cleared of any noise section present in the binary picture data (a step 103); and the start point and the end point of the agglomerated region in the binary picture data are detected in a time series to identify the sound signal section extracted at the step 103 (a step 106).

Description

本発明は、入力される音信号に含まれる雑音成分を除去して所望の音信号の信号区間を抽出する音信号区間抽出装置及び音信号区間抽出方法に関する。 The present invention relates to a sound signal section extraction device and a sound signal section extraction method for extracting a signal section of a desired sound signal by removing a noise component contained in an input sound signal.

従来から、入力される音信号に含まれる雑音成分を除去して所望の音信号の信号区間を抽出すために、種々の音声区間検出手段が知られている。例えば、人間の発話音声や、機械等から発生する動作音等の正常音の区間を示す信号区間と、正常音以外の雑音の区間を示す雑音区間とを分離するために、入力信号のパワーに対して閾値を設定し、その閾値と入力信号のパワーとを比較して、信号区間と雑音区間とを区別する音声認識装置や音声区間検出装置が知られている（例えば、特許文献１、特許文献２参照。）。 Conventionally, various voice section detecting means are known for extracting a signal section of a desired sound signal by removing a noise component contained in an input sound signal. For example, in order to separate a signal section indicating a normal sound section such as a human speech voice or an operation sound generated from a machine from a noise section indicating a noise section other than the normal sound, the power of the input signal is separated. For example, a speech recognition device and a speech interval detection device that set a threshold value and compare the threshold value with the power of an input signal to distinguish a signal interval from a noise interval are known (for example, Patent Document 1, Patent). Reference 2).

また、信号区間と雑音区間とを分離するために、一定時間範囲内で零レベルを交差する回数を基にして入力信号のパワーと比較するための閾値を設定する零交差法（例えば、特許文献３参照。）や、入力音声信号を正規化することによって音声認識率を向上させるための音声スペクトル概形（例えば、特許文献４参照。）、又は信号区間を検出するために音声を捉えるための特徴量（入力信号の周期性情報）として利用する自己相関（例えば、特許文献５参照。）が知られている。 Also, in order to separate the signal section and the noise section, a zero-crossing method that sets a threshold value for comparison with the power of the input signal based on the number of times that the zero level is crossed within a certain time range (for example, patent document) 3), or an outline of a speech spectrum for improving the speech recognition rate by normalizing the input speech signal (see, for example, Patent Document 4), or capturing speech to detect a signal section. An autocorrelation (see, for example, Patent Document 5) used as a feature amount (periodic information of an input signal) is known.

特公昭６３−２９７５４号公報Japanese Patent Publication No. 63-29754 特開昭５８−１３０３９５号公報JP 58-130395 A 特開平５−１６５４９６号公報Japanese Patent Laid-Open No. 5-16596 特公平１−３６９５９号公報Japanese Patent Publication No. 1-36959 特開２００７−３２８２２８号公報JP 2007-328228 A

しかしながら、特許文献１に記載された音声認識装置や特許文献２に記載された音声区間検出装置では、雑音の大きさなどの物理的特徴が時間と共に大きく変動しない、所謂、定常雑音の場合や、信号区間に比べてパワーが比較的小さい雑音の場合に有効であるが、雑音の大きさが時系列で不規則に変動する非定常雑音では閾値を超える場合があるので、雑音区間を信号区間と誤認識する虞があった。 However, in the speech recognition apparatus described in Patent Document 1 and the speech section detection apparatus described in Patent Document 2, physical characteristics such as the magnitude of noise do not vary greatly with time, so-called stationary noise, This is effective when the noise is relatively small compared to the signal interval, but the threshold may be exceeded for non-stationary noise where the noise level fluctuates irregularly in time series. There was a risk of misrecognition.

これに対して、特許文献３に記載された音声検出装置、特許文献４に記載された音声認識装置、及び特許文献５に記載された信号処理装置は、他の物理的指標を用いることで特許文献１に記載された音声認識装置や特許文献２に記載された音声区間検出装置より信号区間の検出精度を向上させることが可能になるが、複雑な演算処理を行なわなければならないという難点があった。この場合、その演算処理に見合った演算処理装置が必要になる。 In contrast, the speech detection device described in Patent Literature 3, the speech recognition device described in Patent Literature 4, and the signal processing device described in Patent Literature 5 are patented by using other physical indices. Although it is possible to improve the detection accuracy of the signal section from the speech recognition apparatus described in Document 1 and the speech section detection apparatus described in Patent Document 2, there is a problem that complicated arithmetic processing must be performed. It was. In this case, an arithmetic processing device suitable for the arithmetic processing is required.

本発明は、このような従来の難点を解消するためになされたもので、非定常雑音を含む入力音信号でも、入力信号のパワーに対する閾値を設定したり、複雑な演算処理を行ったりしなくても、正確に音信号区間を抽出することができる音信号区間抽出装置及び音信号区間抽出方法を提供することを目的とする。 The present invention has been made to solve such a conventional problem, and does not set a threshold for the power of the input signal or perform complicated arithmetic processing even for an input sound signal including non-stationary noise. However, an object of the present invention is to provide a sound signal section extraction device and a sound signal section extraction method that can accurately extract a sound signal section.

上述の目的を達成する本発明の第１の態様である音信号区間抽出装置は、音によって発生する振動を検出してアナログ音信号に変換する音情報検出部と、音情報検出部で検出した音のアナログ音信号が入力すると、サンプリングして量子化することでデジタル音信号に変換する信号入力部と、信号入力部で取得したデジタル音信号に基づき振幅値を時系列でグラフ化した２次元画像を作成して２値画像データとする画像作成部と、画像作成部で作成した２値画像データに、予め定められた数の周辺画素の値に０が１つ以上あると注目画素の値を１から０に変換する収縮処理を行って、当該２値画像データに存在している雑音区間を削除して塊化された領域となる音信号区間を抽出する収縮処理機能を有する画像処理部と、画像処理部で抽出した音信号区間を特定するために、２値画像データにおける塊化された領域の起点と終点とを時系列で検出する音信号区間判定部とから構成されているものである。 The sound signal section extraction device according to the first aspect of the present invention that achieves the above-mentioned object is detected by a sound information detection unit that detects vibration generated by sound and converts it into an analog sound signal, and a sound information detection unit When a sound analog sound signal is input, it is sampled and quantized and converted into a digital sound signal, and a two-dimensional graph of amplitude values in time series based on the digital sound signal acquired by the signal input unit. An image creation unit that creates an image to generate binary image data, and the binary image data created by the image creation unit has a value of the target pixel if the value of a predetermined number of peripheral pixels is one or more. An image processing unit having a contraction processing function for performing a contraction process for converting 1 to 0 and deleting a noise section existing in the binary image data to extract a sound signal section that is an agglomerated area And extracted by the image processor To identify a sound signal section, in which is composed of a sound signal section determining unit for detecting in time series the start and end points of agglomerated areas in the binary image data.

本発明の第２の態様は第１の態様である音信号区間抽出装置において、画像処理部は、２値画像データにおける塊化された領域に、予め定められた数の周辺画素の値に１が１つ以上あると注目画素の値を０から１に変換する膨張処理を行って、収縮処理機能で行った収縮処理によって一部削除された領域を復元する膨張処理機能を有するものである。 According to a second aspect of the present invention, in the sound signal section extraction device according to the first aspect, the image processing unit sets the value of a predetermined number of peripheral pixels to 1 in a lump area in the binary image data. If there is one or more, the expansion processing function is performed to perform the expansion processing for converting the value of the target pixel from 0 to 1, and to restore a region partially deleted by the contraction processing performed by the contraction processing function.

本発明の第３の態様は第２の態様である音信号区間抽出装置において、画像処理部は、モルフォロジー演算処理で２値画像データにおける音信号区間の塊化された領域を抽出するものである。 According to a third aspect of the present invention, in the sound signal section extraction device according to the second aspect, the image processing unit extracts the agglomerated region of the sound signal section in the binary image data by morphological operation processing. .

また、本発明の第４の態様である音信号区間抽出方法は、音によって発生する振動のアナログ音信号が入力すると、サンプリングして量子化することでデジタル音信号に変換する第１のステップと、第１のステップで取得したデジタル音信号に基づき振幅値を時系列でグラフ化した２次元画像を作成して２値画像データとする第２のステップと、第２のステップで作成した２値画像データに、予め定められた数の周辺画素の値に０が１つ以上あると注目画素の値を１から０に変換する収縮処理を行って、当該２値画像データに存在している雑音区間を削除して塊化された領域となる音信号区間を抽出する第３のステップと、第３のステップで抽出した音信号区間を特定するために、２値画像データにおける塊化された領域の起点と終点とを時系列で検出する第４のステップとを有するものでる。 The sound signal section extraction method according to the fourth aspect of the present invention includes a first step of converting a digital sound signal by sampling and quantizing when an analog sound signal of vibration generated by sound is input. , A second step of creating a two-dimensional image in which amplitude values are graphed in time series based on the digital sound signal acquired in the first step to obtain binary image data, and a binary generated in the second step If the image data has a predetermined number of peripheral pixel values of 0 or more, a contraction process for converting the value of the target pixel from 1 to 0 is performed, and noise existing in the binary image data A third step of extracting a sound signal section that becomes a clustered area by deleting the section, and a clustered area in the binary image data in order to specify the sound signal section extracted in the third step The start and end points of Leaving ones and a fourth step of detecting a column.

本発明の第５の態様は第４の態様である音信号区間抽出方法において、第３のステップは、２値画像データにおける塊化された領域に、予め定められた数の周辺画素の値に１が１つ以上あると注目画素の値を０から１に変換する膨張処理を行って、収縮処理によって一部削除された領域を復元することを含むものである。 According to a fifth aspect of the present invention, in the sound signal section extraction method according to the fourth aspect, the third step is to set a predetermined number of peripheral pixel values in the agglomerated region in the binary image data. When one or more 1s are included, expansion processing for converting the value of the target pixel from 0 to 1 is performed, and a region partially deleted by the contraction processing is restored.

このような第１の態様である音信号区間抽出装置及び第４の態様である音信号区間抽出方法は、従来の音声区間検出手段のような信号のパワーやスペクトルの情報などの物理的特徴に基づいて入力される音信号に含まれる雑音成分を除去するのではなく、人が音信号の信号波形から音信号区間の存在を視覚的に捉えることができる視覚情報に基づき、音信号区間を抽出するという従来にはなかった音信号区間抽出手段を提供するものである。具体的には、音信号の信号波形を、人が視覚的に捉えることができる視覚情報にするために、横軸（Ｘ軸）が時間、縦軸（Ｙ軸）が軸中央をゼロとして上下方向に正負の振幅値となるグラフで表現する。音信号の信号波形を、このようにグラフ化することで、人は、縦軸の振幅値が上下に大きく、横軸に連続して出現する塊化された領域が音信号区間であると、視覚で判別することになる。 The sound signal section extraction device according to the first aspect and the sound signal section extraction method according to the fourth aspect have physical characteristics such as signal power and spectrum information as in the conventional sound section detection means. Rather than removing the noise component contained in the input sound signal, the sound signal section is extracted based on visual information that allows a person to visually grasp the presence of the sound signal section from the signal waveform of the sound signal. The present invention provides a means for extracting a sound signal section that has not existed before. Specifically, the horizontal axis (X axis) is time and the vertical axis (Y axis) is zero at the center in order to make the signal waveform of the sound signal visual information that humans can visually grasp. Expressed in a graph with positive and negative amplitude values in the direction. By graphing the signal waveform of the sound signal in this way, a person has an amplitude value on the vertical axis that is large up and down, and a clustered area that appears continuously on the horizontal axis is a sound signal section. It will be determined visually.

第１の態様である音信号区間抽出装置及び第４の態様である音信号区間抽出方法は、この人の視覚による画像認識を応用するもので、音情報検出部で検出した音のアナログ音信号を信号入力部でデジタル音信号に変換して画像作成部に送出し、画像作成部でデジタル音信号に基づき２値画像データを作成する。この２値画像データは、雑音成分を含むデジタル音信号が画像化されているので、この２値画像データを画像処理部の収縮処理機能で収縮処理を行うと、２値画像データに存在している雑音区間を削除して塊化された領域となる音信号区間を抽出することができる。なお、音信号の雑音成分は非定常雑音でも、振幅値が低くなったり、振幅値が高くても時間軸上の幅が狭くなる細長い領域となったりすることから、雑音成分を含む音信号の２値画像データに、予め定められた数の周辺画素の値に０が１つ以上あると注目画素の値を１から０に変換する収縮処理を行うことで、雑音成分を削除して音信号区間を塊化することができる。そして、音信号区間判定部で２値画像データにおける塊化された領域の起点と終点とを時系列で検出することで、非定常雑音を含む入力音信号でも、入力信号のパワーに対する閾値を設定したり、複雑な演算処理を行ったりしなくても、音信号区間を抽出することができる。 The sound signal segment extraction device according to the first aspect and the sound signal segment extraction method according to the fourth aspect apply image recognition by human vision, and an analog sound signal of the sound detected by the sound information detection unit Is converted into a digital sound signal by the signal input unit and sent to the image creating unit, and the image creating unit creates binary image data based on the digital sound signal. Since this binary image data is an image of a digital sound signal including a noise component, if this binary image data is subjected to contraction processing by the contraction processing function of the image processing unit, it exists in the binary image data. It is possible to extract a sound signal section that is an agglomerated area by deleting existing noise sections. Note that even if the noise component of the sound signal is non-stationary noise, the amplitude value is low, or even if the amplitude value is high, it becomes a long and narrow area with a narrow width on the time axis. When binary image data has a predetermined number of peripheral pixel values of 0 or more, a contraction process is performed to convert the value of the target pixel from 1 to 0, thereby removing the noise component and generating a sound signal. Sections can be agglomerated. Then, the threshold value for the power of the input signal is set even for the input sound signal including non-stationary noise by detecting the start point and the end point of the clustered region in the binary image data in time series by the sound signal section determination unit. Therefore, it is possible to extract the sound signal section without performing complicated calculation processing.

また、第２の態様である音信号区間抽出装置及び第５の態様である音信号区間抽出方法は、画像処理部の膨張処理機能で、収縮処理機能による収縮処理で２値画像データにおける塊化された領域に、予め定められた数の周辺画素の値に１が１つ以上あると注目画素の値を０から１に変換する膨張処理を行うことにより、一部削除された領域を復元することができるので、音信号区間の抽出精度が向上する。 The sound signal section extraction device according to the second aspect and the sound signal section extraction method according to the fifth aspect are the expansion processing function of the image processing unit, and agglomeration in the binary image data by the contraction processing by the contraction processing function. If a predetermined number of peripheral pixel values is 1 or more in the predetermined area, a partially deleted area is restored by performing dilation processing for converting the value of the target pixel from 0 to 1 Therefore, the extraction accuracy of the sound signal section is improved.

また、第３の態様である音信号区間抽出装置は、画像処理部が、モルフォロジー演算処理で２値画像データにおける音信号区間の塊化された領域を抽出することで、収縮処理機能による収縮処理と膨張処理機能による膨張処理とを組み合わせて用いることができる。 Further, in the sound signal section extraction device according to the third aspect, the image processing unit extracts the agglomerated region of the sound signal section in the binary image data by morphological operation processing, so that the contraction processing by the contraction processing function And expansion processing by the expansion processing function can be used in combination.

本発明の音信号区間抽出装置及び音信号区間抽出方法によれば、非定常雑音を含む入力音信号でも、入力信号のパワーに対する閾値を設定したり、複雑な演算処理を行ったりしなくても、正確に音信号区間を抽出することができるようになる。 According to the sound signal section extraction device and the sound signal section extraction method of the present invention, even for an input sound signal including non-stationary noise, a threshold value for the power of the input signal is not set or complicated arithmetic processing is not performed. The sound signal section can be accurately extracted.

本発明の音信号区間抽出装置における好ましい実施の形態例を示すシステム構成のブロック図である。It is a block diagram of a system configuration showing a preferred embodiment in the sound signal section extraction device of the present invention. 本発明の音信号区間抽出方法における好ましい実施の形態例を示すフローチャートである。It is a flowchart which shows the preferable example of embodiment in the sound signal area extraction method of this invention. 本発明の音信号区間抽出方法を説明するための図で、（Ａ）は入力音信号の振幅パターン画像、（Ｂ）は（Ａ）の振幅パターン画像に対して収縮処理を行った画像、（Ｃ）は（Ｂ）の画像に対して膨張処理を行った画像である。It is a figure for demonstrating the sound signal area extraction method of this invention, (A) is the amplitude pattern image of an input sound signal, (B) is the image which performed the contraction process with respect to the amplitude pattern image of (A), ( C) is an image obtained by performing an expansion process on the image of (B). 画像の収縮処理及び膨張処理における注目画素と周辺画素との関係を示す説明図である。It is explanatory drawing which shows the relationship between the attention pixel and peripheral pixel in the shrinkage | contraction process and expansion process of an image. 図３の振幅パターン画像と膨張処理を行った画像との関係を示す説明図である。It is explanatory drawing which shows the relationship between the amplitude pattern image of FIG. 3, and the image which performed the expansion process.

以下、本発明の音信号区間抽出装置及び音信号区間抽出方法を実施するための最良の形態例について、図面を参照して説明する。 The best mode for carrying out the sound signal section extraction device and the sound signal section extraction method of the present invention will be described below with reference to the drawings.

本発明の音信号区間抽出装置は図１に示すように、音によって発生する振動を検出してアナログ音信号に変換する音情報検出部１１と、図示しないＡ／Ｄ変換部を含み、音情報検出部１１で検出した音のアナログ音信号が入力すると、サンプリングして量子化することでデジタル音信号に変換する信号入力部１２と、信号入力部１２で取得したデジタル音信号に基づき振幅値を時系列でグラフ化した２次元画像を作成して２値画像データとする画像作成部１３と、画像作成部１３で作成した２値画像データに収縮処理を行う収縮処理機能を有する画像処理部１４と、画像処理部１４で抽出した音信号区間を特定するために、２値画像データにおける塊化された領域の起点と終点とを時系列で検出する音信号区間判定部１５とから構成されている。 As shown in FIG. 1, the sound signal section extraction device of the present invention includes a sound information detection unit 11 that detects vibration generated by sound and converts it into an analog sound signal, and an A / D conversion unit (not shown). When an analog sound signal of the sound detected by the detection unit 11 is input, a signal input unit 12 that converts the digital sound signal by sampling and quantization, and an amplitude value based on the digital sound signal acquired by the signal input unit 12 An image creation unit 13 that creates a time-series graphed two-dimensional image to obtain binary image data, and an image processing unit 14 that has a shrinkage processing function for performing shrinkage processing on the binary image data created by the image creation unit 13. And a sound signal section determination section 15 for detecting the start and end points of the clustered area in the binary image data in time series in order to specify the sound signal section extracted by the image processing section 14. .

音情報検出部１１は、音によって発生する振動を検出してアナログ音信号に変換することができるマイクロホンや振動センサが該当する。マイクロホンは音が空気の振動によって発生するので、その空気の振動を電気信号に変換することでその空気の振動に応じたアナログの波形を信号入力部１２で取得することができる。また、振動センサは、振動を変位、速度、あるいは加速度で定量的に捕らえるもので、測定した物理量を電気信号に変換することでその振動に応じたアナログの波形を信号入力部１２で取得することができる。 The sound information detection unit 11 corresponds to a microphone or a vibration sensor that can detect vibration generated by sound and convert it into an analog sound signal. Since sound is generated by vibration of air in the microphone, an analog waveform corresponding to the vibration of the air can be acquired by the signal input unit 12 by converting the vibration of the air into an electric signal. The vibration sensor captures vibration quantitatively by displacement, speed, or acceleration, and obtains an analog waveform corresponding to the vibration by the signal input unit 12 by converting the measured physical quantity into an electric signal. Can do.

信号入力部１２は、デジタル音信号の信号情報を保存するために、例えば、１列目をサンプル点番号、２列目を１列目に対応する振幅値、行を全サンプル個数のＮ個（Ｎ：０，１，２，…，Ｎ−１）とするＮ行２列の行列とする。また、サンプリング周波数Ｆの値も保存することで、任意サンプル点ｎ（０≦ｎ≦Ｎ−１）の時間を、ｎ×（１／Ｆ）により算出することが可能となる。したがって、信号情報の行列における１列目は、時間情報を保存していることになる。 In order to store the signal information of the digital sound signal, the signal input unit 12 stores, for example, the first column with the sample point number, the second column with the amplitude value corresponding to the first column, and N rows of the total number of samples ( N: 0, 1, 2,..., N−1). Further, by storing the value of the sampling frequency F, the time of an arbitrary sample point n (0 ≦ n ≦ N−1) can be calculated by n × (1 / F). Therefore, the first column in the signal information matrix stores time information.

画像作成部１３は、信号情報の行列における１列目のサンプル点を時間軸となるＸ軸に、２列目の振幅値を信号のパワーとなるＹ軸に表現した画像作成を行うものである。例えば、任意のサンプル点間ｎ_ａ〜ｎ_ｂ（ａ，ｂ：０≦ａ＜ｂ，ａ＜ｂ≦Ｎ−１）をＸ軸方向の１画素分とし、そのｎ_ａ〜ｎ_ｂの間にある振幅値の平均値や中央値、最大値に相当する値分の画素数をＹ軸に表現する。このＹ軸には、Ｙ軸中央より上に正の振幅値を、Ｙ軸中央より下に負の振幅値が表現されることになる。 The image creating unit 13 creates an image in which the sample points in the first column in the signal information matrix are expressed on the X axis as a time axis, and the amplitude values in the second column are expressed on the Y axis as a signal power. . For example, n _{a to} n _b (a, b: 0 ≦ a <b, a <b ≦ N−1) between arbitrary sample points is set to one pixel in the X-axis direction, and between n _{a to} n _b The number of pixels corresponding to the average value, median value, and maximum value of a certain amplitude value is expressed on the Y axis. The Y axis represents a positive amplitude value above the center of the Y axis and a negative amplitude value below the center of the Y axis.

この画像は、振幅値を表現している黒色部分が１、それ以外の背景となる白色部分が０となるような２値画像とする。この場合のＸ軸方向の画像サイズＳ_Ｘは、Ｎ／（ｎ_ｂ−ｎ_ａ）となる。Ｙ軸方向の画像サイズＳｙは、元の振幅値の値をそのまま用いた場合は振幅値の最大値Ａｐと最小値Ａｍを加算した値となるが、任意のサイズにすることも可能である。この画像サイズを任意のサイズにするには、画像の拡大・縮小処理に相当する方程式、Ｓｙ＝（Ａｐ＋Ａｍ）×Ｃが好ましい。この場合、任意のサンプル点間ｎ_ａ〜ｎ_ｂにある振幅値の平均値や中央値、最大値に相当する値分の画素数もＣ倍にしてＹ軸に表現する。ここで、Ｃは、Ｃ＞１なら拡大処理、０＜Ｃ＜１ならば縮小処理となる。
なお、ここでは振幅値を線形で表現していたが、これに限らず、非線形で表現してもよい。 This image is a binary image in which the black portion expressing the amplitude value is 1 and the white portion other than that is 0. In this case, the image size S _{X in} the X-axis direction is N / (n _b −n _a ). The image size Sy in the Y-axis direction is a value obtained by adding the maximum value Ap and the minimum value Am of the amplitude value when the original value of the amplitude value is used as it is, but may be an arbitrary size. In order to set the image size to an arbitrary size, an equation corresponding to image enlargement / reduction processing, Sy = (Ap + Am) × C, is preferable. In this case, the average or median amplitude value in between any sample point n _a ~n _b, the number of pixels of value component that corresponds to the maximum value even in the C doubled to represent the Y-axis. Here, C is an enlargement process if C> 1, and a reduction process if 0 <C <1.
Although the amplitude value is expressed linearly here, it is not limited to this and may be expressed nonlinearly.

画像処理部１４の収縮処理機能は、予め定められた数の周辺画素の値に０が１つ以上あると注目画素の値を１から０に変換する収縮処理を行うものである。この画像の収縮処理は、２値画像における対象物体の輪郭から外側に伸びた画素幅が狭い凸形状の平滑化や、対象物体の背景にある孤立的な点、所謂ごま塩雑音の除去に有効な処理で、注目画素の周辺画素に０が１つ以上あれば、その注目画素を０にする。なお、画素ｘ，ｙは、収縮処理によって音信号区間の領域が大きく削除されないようにするために、０≦ｘ＜Ｓｘ−１，０≦ｙ＜Ｓｙ−１を満足する値とする。周辺画素の範囲は任意であるが、範囲が大き過ぎると信号区間の存在する領域が削除される可能性があることから、最小構成要素となる注目画素に隣接する８画素を用いるのが最良である。つまり、３×３画素の中央を注目画素とし、注目画素が１のとき、その周辺の８画素のうち１つでも０があれば、注目画素を０とする。この収縮処理を、画像作成部１３で作成した２値画像データ全体に対して繰り返し実行した場合、値１の領域を徐々に縮めていくことになり、最終的には、面積が小さい領域は消滅し、元々面積が大きい領域だけが画像上に残ることになる。 The contraction processing function of the image processing unit 14 performs contraction processing for converting the value of the target pixel from 1 to 0 when the value of the predetermined number of peripheral pixels is one or more. This image contraction processing is effective for smoothing a convex shape with a narrow pixel width extending outward from the contour of the target object in the binary image, and for removing so-called sesame salt noise in the background of the target object. If one or more 0 is present in the peripheral pixels of the target pixel in the process, the target pixel is set to zero. Note that the pixels x and y have values satisfying 0 ≦ x <Sx−1 and 0 ≦ y <Sy−1 so that the region of the sound signal section is not largely deleted by the contraction process. The range of surrounding pixels is arbitrary, but if the range is too large, there is a possibility that the region where the signal section exists is deleted. Therefore, it is best to use 8 pixels adjacent to the target pixel as the minimum component. is there. That is, the center of 3 × 3 pixels is set as the target pixel, and when the target pixel is 1, if any one of the eight surrounding pixels is 0, the target pixel is set to 0. When this contraction processing is repeatedly performed on the entire binary image data created by the image creation unit 13, the value 1 region is gradually reduced, and eventually the region with a small area disappears. However, only the area with the originally large area remains on the image.

ここで、雑音区間は、振幅値が低いため、画像上の値が１になる部分が小領域となることから、収縮処理により結果的に削除されることになる。また、雑音区間には、振幅値が短い時間長で大きくなる場合も考えられるが、画像上の値が１になる部分は、Ｘ軸方向の幅が狭い細長い領域となるため、この領域も結果として収縮処理により削除される。一方、信号区間では、画像上の値１となる部分の面積が大きいため、収縮処理を行っても削除されることはない。この収縮処理は、目的とする音信号区間の存在する領域が抽出できるまで、または、雑音区間の存在する領域が削除されるまで、少なくともｋ回（ｋ≧１）繰り返して実行される。 Here, since the amplitude value of the noise section is low, the portion where the value on the image is 1 is a small area, and is therefore deleted as a result of the contraction process. In the noise section, the amplitude value may increase with a short time length. However, since the portion where the value on the image is 1 is a narrow and narrow area in the X-axis direction, this area is also the result. Are deleted by the contraction process. On the other hand, in the signal section, since the area of the value 1 on the image is large, it is not deleted even if the contraction process is performed. This contraction process is repeated at least k times (k ≧ 1) until a region where a target sound signal section exists can be extracted or a region where a noise section exists is deleted.

なお、この画像の収縮処理では画像の形状によっては音信号区間の存在する領域の一部が削除される場合があるので、画像処理部１４に画像に膨張処理を行う膨張処理機能をもたせるとよい。この画像処理部１４の膨張処理機能は、予め定められた数の周辺画素の値に１が１つ以上あると注目画素の値を０から１に変換する膨張処理を行うものである。この画像の膨張処理は、２値画像における対象物体の内部に伸びた画素幅が狭い凹形状の平滑化や、対象物体の内部にある所謂ごま塩雑音の除去に有効な処理で、収縮処理によって一部削除された領域が出現されるまで、画像作成部１３で作成した２値画像データ全体に対して少なくともｋ回（ｋ≧１）繰り返して実行される。なお、音情報の存在しない振幅パターン領域は収縮処理によって削除されるため、膨張処理を行っても元の領域が出現することはない。即ち、膨張処理後に画像に残った値１の領域が、音信号区間の存在する領域となる。このように、画像処理部１４の膨張処理機能で、収縮処理機能による収縮処理で２値画像データにおける塊化された領域に、予め定められた数の周辺画素の値に１が１つ以上あると注目画素の値を０から１に変換する膨張処理を行うことにより、一部削除された領域を復元することができるので、音信号区間の抽出精度が向上する。
このような画像処理部１４の収縮処理及び膨張処理は、モルフォロジー演算処理が好ましい。モルフォロジー演算処理は、収縮処理及び膨張処理を組み合わせて用いることができるからである。 In this image shrinking process, depending on the shape of the image, a part of the area where the sound signal section exists may be deleted. Therefore, the image processing unit 14 may be provided with an expansion processing function for performing the expansion process on the image. . The expansion processing function of the image processing unit 14 performs expansion processing for converting the value of the pixel of interest from 0 to 1 when one or more of the predetermined number of peripheral pixel values is one. This image expansion process is an effective process for smoothing a concave shape with a narrow pixel width extending inside a target object in a binary image and removing so-called sesame salt noise inside the target object. This process is repeated at least k times (k ≧ 1) for the entire binary image data created by the image creation unit 13 until a partially deleted region appears. Since the amplitude pattern area where no sound information exists is deleted by the contraction process, the original area does not appear even if the expansion process is performed. That is, the region of value 1 remaining in the image after the expansion processing is a region where a sound signal section exists. As described above, in the dilation processing function of the image processing unit 14, there is one or more 1 in the values of the predetermined number of peripheral pixels in the agglomerated region in the binary image data by the constriction processing by the contraction processing function Since the partially deleted region can be restored by performing expansion processing for converting the value of the target pixel from 0 to 1, the extraction accuracy of the sound signal section is improved.
Such shrinkage processing and expansion processing of the image processing unit 14 are preferably morphological calculation processing. This is because the morphological calculation process can be used in combination with the shrinking process and the expanding process.

音信号区間判定部１５は、信号入力部１２、画像作成部１３及び画像処理部１４で音信号区間となる２値画像データにおける塊化された領域が抽出されるので、この塊化された領域の起点と終点とを時系列で検出するだけで、音信号区間を効率的且つ正確に特定することができる。 The sound signal section determination unit 15 extracts the agglomerated area in the binary image data serving as the sound signal section by the signal input unit 12, the image creation unit 13, and the image processing unit 14. The sound signal section can be identified efficiently and accurately only by detecting the starting point and the ending point of the sound in time series.

このように構成された音信号区間抽出装置による音信号区間抽出方法について、以下、図１、図２、図３、図４、図５を参照しながら説明する。なお、画像処理部１４は収縮処理機能及び膨張処理機能を有しているものとする。 Hereinafter, the sound signal section extraction method by the sound signal section extraction apparatus configured as described above will be described with reference to FIGS. 1, 2, 3, 4, and 5. Note that the image processing unit 14 has a contraction processing function and an expansion processing function.

音情報検出部１１で検出した音のアナログ音信号を信号入力部１２でデジタル音信号に変換して画像作成部１０２に送出し（ステップ１０１）、画像作成部１０２でデジタル音信号に基づき２値画像データを作成する（ステップ１０２）。この２値画像データは図３（Ａ）に示すように、振幅値を表現している黒色部分が１、それ以外の背景となる白色部分が０となるような２値画像となっている。この２値画像データは、雑音成分を含むデジタル音信号が画像化されているので、この２値画像データを画像処理部１４の収縮処理機能で収縮処理を行うと（ステップ１０３）、２値画像データに存在している雑音区間を削除して塊化された領域となる音信号区間を抽出することができる（ステップ１０４）。この収縮処理は図４に示すように、３×３画素の中央の画素ｅを注目画素とし、注目画素ｅが１のとき、その周辺の８画素ａ、ｂ、ｃ、ｄ、ｆ、ｇ、ｈ、ｉのうち１つでも０があれば、注目画素を０とする。なお、ステップ１０３及びステップ１０４は雑音区間の存在する領域が削除されるまで、繰り返して実行される。この雑音区間を削除して塊化された領域となる音信号区間は、図３（Ｂ）に示すような画像になる。 An analog sound signal of the sound detected by the sound information detection unit 11 is converted into a digital sound signal by the signal input unit 12 and sent to the image creation unit 102 (step 101), and the image creation unit 102 performs binary processing based on the digital sound signal. Image data is created (step 102). As shown in FIG. 3A, the binary image data is a binary image in which the black portion expressing the amplitude value is 1 and the white portion that is the other background is 0. Since this binary image data is an image of a digital sound signal including a noise component, if this binary image data is subjected to a contraction process by the contraction processing function of the image processing unit 14 (step 103), the binary image It is possible to extract a sound signal section that is an agglomerated area by deleting a noise section existing in the data (step 104). As shown in FIG. 4, the contraction process uses a central pixel e of 3 × 3 pixels as a target pixel, and when the target pixel e is 1, the surrounding eight pixels a, b, c, d, f, g, If at least one of h and i is 0, the target pixel is set to 0. Steps 103 and 104 are repeatedly executed until the area where the noise section exists is deleted. The sound signal section that is an agglomerated area by deleting the noise section is an image as shown in FIG.

また、画像処理部１４は膨張処理機能で、収縮処理機能による収縮処理で２値画像データにおける塊化された領域に膨張処理を行うことにより、一部削除された領域を復元することができる（ステップ１０５）。この膨張処理は図４に示すように、注目画素ｅが０のとき、その周辺の８画素ａ、ｂ、ｃ、ｄ、ｆ、ｇ、ｈ、ｉのうち１つでも１があれば、注目画素を１とする。なお、ステップ１０５は収縮処理によって一部削除された領域が出現されるまで、繰り返して実行される。この一部削除された領域が出現した塊化された領域となる音信号区間は、図３（Ｃ）に示すような画像となる。 Further, the image processing unit 14 is an expansion processing function, and by performing expansion processing on the agglomerated region in the binary image data by contraction processing by the contraction processing function, it is possible to restore a partially deleted region ( Step 105). As shown in FIG. 4, when the target pixel e is 0, this expansion processing is performed if any one of the surrounding eight pixels a, b, c, d, f, g, h, i is 1 Let the pixel be 1. Step 105 is repeatedly executed until an area partially deleted by the shrinking process appears. The sound signal section that is an agglomerated region where the partially deleted region appears is an image as shown in FIG.

このように画像に対して、ｎ回収縮処理を行った後、ｎ回膨張処理を行うオープニング処理を行うのは、入力される音信号の信号波形を２値化した画像では、所謂ごま塩雑音は生じないが、振幅値の大小による凹凸が生じるからである。振幅値がパルス的に大きい箇所であるインパルス雑音が、X軸方向の画素幅が狭くなった凸形状となり、また、塊化された領域となる音信号区間以外の箇所が、Ｙ軸方向の画素幅が狭い領域となる。収縮処理は、これらの雑音を除去することができ、結果的に塊化された領域となる音信号区間のみを画像上に残すことが可能になる。但し、収縮処理によって抽出された塊化された領域となる音信号区間は、収縮した回数分だけ領域のX軸方向の幅が小さくなっているので、その領域の本来のX軸方向の幅に戻すために、収縮処理した回数分だけ膨張処理を行う必要がある。 In this way, after performing the n-time contraction process on the image, the opening process for performing the n-time expansion process is performed because an image obtained by binarizing the signal waveform of the input sound signal has so-called sesame salt noise. Although it does not occur, unevenness due to the magnitude of the amplitude value occurs. Impulse noise, which is a place where the amplitude value is large in a pulse shape, has a convex shape with a narrow pixel width in the X-axis direction. The area becomes narrow. The contraction process can remove these noises, and as a result, it is possible to leave only the sound signal section that becomes a clustered area on the image. However, since the width of the sound signal section, which is an agglomerated area extracted by the contraction process, decreases in the X-axis direction by the number of contractions, the original X-axis width of the area is reduced. In order to return, it is necessary to perform expansion processing for the number of times of contraction processing.

このようにして得られた音信号区間の領域の画像は図５に示すように、信号区間判定部１５で２値画像データにおける塊化された領域の起点Ａと終点Ｂとを時系列で検出する（ステップ１０６）。このように画像処理することで、非定常雑音を含む入力音信号でも、入力信号のパワーに対する閾値を設定したり、複雑な演算処理を行ったりしなくても、塊化された領域となる音信号区間を効率的且つ正確に抽出することができる。 As shown in FIG. 5, the image of the region of the sound signal section obtained in this way detects the start point A and the end point B of the agglomerated area in the binary image data in a time series as shown in FIG. (Step 106). By performing image processing in this way, even an input sound signal including non-stationary noise can be a sound that becomes an agglomerated region without setting a threshold value for the power of the input signal or performing complicated arithmetic processing. The signal interval can be extracted efficiently and accurately.

このような本発明の音信号区間抽出装置及び音信号区間抽出方法は、従来の音声区間検出手段と併用することで、音信号区間の抽出精度の向上を図ることが可能になる。
また、本発明の音信号区間抽出装置及び音信号区間抽出方法の機能を実現するソフトウエアのプログラムコードを記録した記録媒体を、コンピュータで読み出し実行することでも、本発明の目的を達成することができる。 Such a sound signal section extraction device and sound signal section extraction method of the present invention can improve the extraction accuracy of a sound signal section by using it together with a conventional speech section detection means.
Further, the object of the present invention can also be achieved by reading and executing by a computer a recording medium that records software program codes for realizing the functions of the sound signal section extracting device and the sound signal section extracting method of the present invention. it can.

これまで本発明について図面に示した特定の実施の形態をもって説明してきたが、本発明は図面に示した実施の形態に限定されるものではなく、本発明の効果を奏する限り、これまで知られたいかなる構成であっても採用することができることはいうまでもないことである。 Although the present invention has been described with the specific embodiments shown in the drawings, the present invention is not limited to the embodiments shown in the drawings, and is known so far as long as the effects of the present invention are achieved. It goes without saying that any configuration can be adopted.

このような本発明の音信号区間抽出装置及び音信号区間抽出方法は、例えば、機械の異常予兆を早期に発見するために、機械の動作音を分析する際、雑音によって誤認識してしまうことを防ぐことができる。 Such a sound signal section extraction apparatus and sound signal section extraction method of the present invention may be misrecognized by noise when analyzing the operation sound of a machine, for example, in order to detect an abnormal sign of the machine at an early stage. Can be prevented.

１……音信号区間抽出装置
１１……音情報検出部
１２……信号入力部
１３……画像作成部
１４……画像処理部
１５……音信号区間判定部 DESCRIPTION OF SYMBOLS 1 ... Sound signal area extraction device 11 ... Sound information detection part 12 ... Signal input part 13 ... Image preparation part 14 ... Image processing part 15 ... Sound signal area determination part

Claims

A sound information detector that detects vibration generated by sound and converts it into an analog sound signal;
When the analog sound signal of the sound detected by the sound information detection unit is input, a signal input unit that converts to a digital sound signal by sampling and quantizing,
An image creation unit that creates a binary image data by creating a two-dimensional image in which amplitude values are graphed in time series based on the digital sound signal acquired by the signal input unit;
The binary image data created by the image creation unit is subjected to a contraction process for converting the value of the target pixel from 1 to 0 when the predetermined number of peripheral pixel values is one or more, and An image processing unit having a contraction processing function for extracting a sound signal section that is an agglomerated area by deleting a noise section existing in binary image data;
In order to specify the sound signal section extracted by the image processing unit, the sound signal section determination unit configured to detect the start point and the end point of the clustered area in the binary image data in time series. A sound signal section extraction device characterized by comprising:

The image processing unit converts the value of the pixel of interest from 0 to 1 when the lumped area in the binary image data has one or more of the predetermined number of neighboring pixel values. The sound signal section extraction device according to claim 1, further comprising an expansion processing function for performing expansion processing and restoring a region partially deleted by the contraction processing performed by the contraction processing function.

The sound signal section extraction device according to claim 2, wherein the image processing section extracts the agglomerated region of the sound signal section in the binary image data by morphological operation processing.

When an analog sound signal of vibration generated by sound is input, a first step of converting to a digital sound signal by sampling and quantizing;
A second step of generating a two-dimensional image in which amplitude values are graphed in time series based on the digital sound signal acquired in the first step, and obtaining binary image data;
The binary image data created in the second step is subjected to a contraction process for converting the value of the target pixel from 1 to 0 when the predetermined number of peripheral pixel values is one or more, A third step of extracting a sound signal section that is a clustered area by deleting a noise section existing in the binary image data;
A fourth step of detecting, in time series, a start point and an end point of the agglomerated region in the binary image data in order to specify the sound signal section extracted in the third step. A characteristic sound signal segment extraction method.

The third step converts the value of the target pixel from 0 to 1 when the lumped area in the binary image data has one or more of the predetermined number of neighboring pixel values. The sound signal section extraction method according to claim 4, further comprising: performing an expansion process to restore a region partially deleted by the contraction process.