JPS63289599A

JPS63289599A - Segmentation apparatus

Info

Publication number: JPS63289599A
Application number: JP62125439A
Authority: JP
Inventors: 浩明服部
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1987-05-21
Filing date: 1987-05-21
Publication date: 1988-11-28

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は不特定話者の音声認識におけるセグメンテーシ
ョン装置の改良に関する。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to an improvement of a segmentation device for speech recognition of unspecified speakers.

（従来の技術とその問題点）音節または音素などの単語よりも小さなセグメントを用
いて音声認識を行う方法においては入力音声をセグメン
トへ分割する精度、即ちセグメンテーションの精度が認
識率へ与える影響が非常に大きい、認識対象が特定話者
の場合には各話者に依存した特徴量を用いることが出来
るが、対象が不特定話者の場合には各話者毎に音響的特
徴が異なるから、入力音声の音響的特徴だけから当該入
力音声をセグメントに分割する従来のセグメンテーショ
ン装置では、極端なセグメンテーション誤りを生じてし
まうことがあった。従来装置にはこのような問題点があ
った。(Prior art and its problems) In methods of speech recognition using segments smaller than words, such as syllables or phonemes, the accuracy of dividing input speech into segments, that is, the accuracy of segmentation, has a significant effect on the recognition rate. If the recognition target is a specific speaker, it is possible to use features that depend on each speaker, but if the target is an unspecified speaker, the acoustic features differ for each speaker. Conventional segmentation devices that divide input speech into segments based only on the acoustic characteristics of the input speech sometimes result in extreme segmentation errors. Conventional devices have had such problems.

そこで、本発明は、あらかじめセグメントの境界が存在
する範囲を推定することにより極端なセグメンテーショ
ン誤りを避け、セグメンテーションを精度良く行おうと
するものである。Therefore, the present invention attempts to avoid extreme segmentation errors and perform segmentation with high accuracy by estimating in advance the range in which segment boundaries exist.

（問題点を解決するための手段）前述の問題点を解決するために本発明が提供する手段は
、音節または音素などの単語よりも小さなセグメントを
用いて音声認識を行う装置であって、入力音声から音響
的特徴パラメータを抽出する手段と、単語標準パターン
として特徴パラメータとセグメント境界と該セグメント
境界の特徴を記憶する手段と、前記音響的特徴パラメー
タを用いて前記入力音声と前記単語標準パターンを時間
的に対応付けて前記単語標準パターンの前記セグメント
境界に対応する前記入力音声の時点を求める手段と、前
記セグメント境界の特徴を用いて前記時点からある範囲
内で前記入力音声のセグメント境界を探す手段とを含ん
でなることを特徴とする。(Means for Solving the Problems) Means provided by the present invention to solve the above-mentioned problems is an apparatus that performs speech recognition using segments smaller than words, such as syllables or phonemes, means for extracting acoustic feature parameters from speech; means for storing feature parameters, segment boundaries, and features of the segment boundaries as word standard patterns; means for determining a point in time of the input speech that corresponds to the segment boundary of the standard word pattern in a temporal manner; and searching for a segment boundary of the input speech within a certain range from the point in time using characteristics of the segment boundary. It is characterized by comprising means.

（作用）入力音声は音響的特徴パラメータ抽出手段により特徴パ
ラメータの時系列へ変換される。ここで特徴パラメータ
とは音響的な特徴を反映したパラメータである０日本音
響学会誌Ｖｏ　１２７、Ｎｏ１、ｐ４８３〜４９０　（
１９７１−０９）　、迫江、千葉“動的計画法を利用し
た音声の時間正規化に基づく連続単語認識”に述べられ
たバタンマツチング法（以下、ＤＰマツチングと呼ぶ）
を用いれば、入力音声の特徴パラメータ時系列とあらか
じめ登録された単語標準パターンとを時間的に対応付け
ることが出来る。そこで、あらかじめ標準パターンをセ
グメントに分割した結果を記憶しておけばＷＡ準パター
ンのセグメント境界に対応する入力音声の時点（以下、
セグメント境界候補と呼ぶ）を求めることが出来る。実
際のセグメント境界はこのようにして求めたセグメント
境界候補の近傍に存在すると考えられるので、セグメン
ト境界候補を含む別に定められる区間においてセグメン
ト境界を探すことにより効率よくセグメンテーションが
行われる。また、セグメント境界候補近傍にてセグメン
ト境界を探す際には、既に探すべきセグメントが分かっ
ているので適切な特徴パラメータを選択してセグメント
境界が求められる。特定話者の場合にはＤＰマツチング
の結果得られる単語標準パターンとの距離を用いて単語
認識を行うことも勿論可能であるが、不特定話者の場合
には十分な認識精度が得られないためここでは入力音声
と標準パターンとの時間的対応付けのみを行う。従って
、単語認識を行う場合に比べて少ない数の特徴パラメー
タを記憶しておけば良く、記憶量、計算量を減らすこと
が出来る。(Operation) Input speech is converted into a time series of feature parameters by the acoustic feature parameter extraction means. Here, the feature parameters are parameters that reflect acoustic characteristics.0 Journal of the Acoustical Society of Japan Vo 127, No. 1, p.
1971-09), Sakoe, Chiba, "Continuous word recognition based on temporal normalization of speech using dynamic programming", the slam matching method (hereinafter referred to as DP matching)
By using , it is possible to temporally associate the feature parameter time series of input speech with pre-registered word standard patterns. Therefore, if you store the results of dividing the standard pattern into segments in advance, it is possible to store the results of dividing the standard pattern into segments.
(referred to as segment boundary candidates) can be found. Since the actual segment boundary is considered to exist in the vicinity of the segment boundary candidate obtained in this way, segmentation can be efficiently performed by searching for the segment boundary in a separately determined section that includes the segment boundary candidate. Furthermore, when searching for a segment boundary near a segment boundary candidate, since the segment to be searched for is already known, the segment boundary can be found by selecting appropriate feature parameters. In the case of specific speakers, it is of course possible to perform word recognition using the distance from the word standard pattern obtained as a result of DP matching, but in the case of unspecified speakers, sufficient recognition accuracy cannot be obtained. Therefore, only the temporal correspondence between the input voice and the standard pattern is performed here. Therefore, it is only necessary to store a smaller number of feature parameters than when performing word recognition, and the amount of storage and calculation can be reduced.

（実施例）第１図は本発明の一実施例を示すブロック図、第２図は
入力音声と標準パターンの時間的対応付は方法を例示す
る図、第３図は本発明によるセグメント境界の決定方法
の一例を説明するための図である。以下に動作を簡単に
説明する。(Example) Fig. 1 is a block diagram showing an embodiment of the present invention, Fig. 2 is a diagram illustrating a method for temporally associating input speech and standard patterns, and Fig. 3 is a diagram illustrating a method for temporally associating input speech and standard patterns. FIG. 3 is a diagram for explaining an example of a determination method. The operation will be briefly explained below.

いま、入力音声“３″を３つのセグメント／ｓ／、／ａ
／、／Ｎ／にセグメンテーションするものとする。入力
音声は音響分析部１において特徴パラメータの時系列へ
変換される。ここでいう特徴パラメータとは音響的な特
徴を反映したパラメータであればよく、フィルタバンク
の出力、メルケプストラム係数、零交差数、フォルマン
ト周波数等が上げられる。記憶部２にはあらかじめ単語
標準パターンとして各単語の特徴パラメータとセグメン
ト境界およびセグメント境界の特徴を登録しておく、照
合部３は記憶部２から単語標準パターンを読み出し、分
析部１の出力とマツチングを行う。Now, the input voice “3” is divided into three segments /s/, /a
Segmentation is performed into /, /N/. The input speech is converted into a time series of feature parameters in the acoustic analysis section 1. The feature parameters here may be any parameters that reflect acoustic features, such as the output of a filter bank, mel cepstral coefficients, number of zero crossings, formant frequency, etc. The feature parameters of each word, segment boundaries, and features of the segment boundaries are registered in advance in the storage unit 2 as a word standard pattern.The matching unit 3 reads the word standard pattern from the storage unit 2 and matches it with the output of the analysis unit 1. I do.

ＤＰマツチングにより最適化が行われた後、最適値を与
えるバスをたどることにより、入力音声と標準パターン
を対応付けることができ、セグメント境界候補を求める
ことが出来る。第２図に時間的対応付けの様子を示す、
第２図において２１は単語標準パターンのエネルギー包
絡を、２２は入力音声のエネルギー包絡を表し、２３は
マツチング平面、２４は最適値を与えるパスを示す０図
において破線は対応付けられたセグメント境界を表す。After optimization is performed by DP matching, by tracing the bus that gives the optimal value, it is possible to associate the input voice with the standard pattern, and to find segment boundary candidates. Figure 2 shows the temporal correspondence.
In Figure 2, 21 represents the energy envelope of the word standard pattern, 22 represents the energy envelope of the input speech, 23 represents the matching plane, and 24 represents the path that gives the optimal value. represent.

セグメント境界候補を探す区間を決定するには様々な方
法がある。いまセグメント境界候補が時点ｉであるとす
ると、１）　固定値とする方法。There are various methods for determining the interval to search for segment boundary candidates. Assuming that the segment boundary candidate is at time i, 1) A method of setting it to a fixed value.

ある値ｊにたいして、区間を［ｉ−ｊ、ｉ＋ｊ］とする
。For a certain value j, let the interval be [i-j, i+j].

２）　１）に標準パターンとの時間長の違いを反映させ
た方法。2) A method that reflects the difference in time length from the standard pattern in 1).

入力音声が標準パターンのα倍の継続長を持つ場合、区
間を［１−ｊＸα、ｉ＋ｊＸα］とする。When the input voice has a duration α times that of the standard pattern, the interval is set to [1−jXα, i+jXα].

等が考えられる。etc. are possible.

この様にして定められた区間［ｉｌ、ｉ２］においてセ
グメント境界候補を探す際には探すべきセグメント境界
候補の音響的特徴を考慮することが出来る。即ち、先の
例において、セグメント／　ｓ　／と／ａ／の境界を探
す場合にはセグメント／　Ｓ　／においては３　ｋＨ２
以上の周波数へのエネルギーの集中がみられると言う知
見に基づき、全周波数領域におけるエネルギーに占める
３　ｋＨ７以上のエネルギーの割合の変化量を求め、そ
れが前記区間［１２、ｉ２］において最小値を取る時点
をセグメント境界として求めればよい、記憶部２にはこ
の様なセグメント境界の特徴を記憶しておく。When searching for a segment boundary candidate in the interval [il, i2] defined in this way, it is possible to take into consideration the acoustic characteristics of the segment boundary candidate to be searched for. That is, in the previous example, when searching for the boundary between segments /s/ and /a/, 3 kH2 for segment /S/
Based on the knowledge that there is a concentration of energy in the above frequency range, we calculated the amount of change in the ratio of energy above 3 kHz to the energy in the entire frequency range, and found that it was the minimum value in the interval [12, i2]. It is only necessary to find the time point taken as a segment boundary, and the storage unit 2 stores the characteristics of such a segment boundary.

境界決定部４は照合部３の結果から探索区間を決定し、
記憶部２からセグメント境界の特徴を読みだし、分析部
１からの分析結果からセグメント境界を決定する。第３
図において３１は入力音声のエネルギー包絡を示し、３
２は全周波数領域におけるエネルギーに占める３　ｋＨ
ｚ以上のエネルギーの割合の変化量を示す、第３図にお
いて、ｉはセグメント境界候補、ｉ′は区間［ｉｌ、ｉ
２］において３２の全周波数領域におけるエネルギーに
占める３　ｋＨ２以上のエネルギーの割合の変化量が最
小値を取る時点として求められたセグメント境界である
。The boundary determining unit 4 determines a search interval from the result of the matching unit 3,
The characteristics of the segment boundaries are read from the storage unit 2, and the segment boundaries are determined from the analysis results from the analysis unit 1. Third
In the figure, 31 indicates the energy envelope of the input voice, and 3
2 is 3 kHz of energy in the entire frequency range
In FIG. 3, which shows the amount of change in the proportion of energy equal to or higher than z, i is a segment boundary candidate, and i' is an interval [il, i
2], the segment boundary is determined as the point in time when the amount of change in the ratio of energy of 3 kHz or more to the energy in all 32 frequency regions takes a minimum value.

（発明の効果）以上のように本発明による装置によれば、不特定話者の
音声を高い精度でセグメンテーションすることができ、
高精度の音声認識装置の実現が可能となる。(Effects of the Invention) As described above, according to the device according to the present invention, speech of unspecified speakers can be segmented with high accuracy.
It becomes possible to realize a highly accurate speech recognition device.

[Brief explanation of the drawing]

第１図は本発明の一実施例を示すブロック図、第２図は
入力音声と標準パターンの時間的対応材は方法を例示す
る図、第３図は本発明によるセグメント境界の決定方法
の一例を説明するための図である。図において、１は音響分析部、２は単語標準パターン記
憶部、３は照合部、４は境界決定部、２１は入力音声の
エネルギー包絡、２２は単語標準パターンのエネルギー
包絡、２３はマツチング平面、２４はマツチングパス、
３１は入力音声のエネルギー包絡、３２は全周波数領域
におけるエネルギーに占める３ｋｌｌｚ以上のエネルギ
ーの割合の変化量である。FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 2 is a diagram illustrating a method for temporally corresponding input speech and standard patterns, and FIG. 3 is an example of a method for determining segment boundaries according to the present invention. FIG. In the figure, 1 is an acoustic analysis unit, 2 is a word standard pattern storage unit, 3 is a matching unit, 4 is a boundary determination unit, 21 is an energy envelope of input speech, 22 is an energy envelope of a word standard pattern, 23 is a matching plane, 24 is matching pass,
31 is the energy envelope of the input voice, and 32 is the amount of change in the ratio of energy of 3 kllz or more to the energy in the entire frequency range.

Claims

[Claims]

In a device that performs speech recognition using segments smaller than words such as syllables or phonemes, a means for extracting acoustic feature parameters from input speech, and storing feature parameters, segment boundaries, and features of the segment boundaries as word standard patterns. means for temporally associating the input speech and the word standard pattern using the acoustic feature parameter to determine a time point in the input speech corresponding to the segment boundary of the word standard pattern; and means for searching for a segment boundary of the input speech within a certain range from the point in time using boundary characteristics.