JPH0466040B2

JPH0466040B2 -

Info

Publication number: JPH0466040B2
Application number: JP17893583A
Authority: JP
Inventors: Masahide Yoneyama; Junichiro Fujimoto
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1983-09-26
Filing date: 1983-09-26
Publication date: 1992-10-21
Also published as: JPS6069699A

Description

【発明の詳細な説明】技術分野本発明は、音声認識装置における音声パターン
作成に関する。DETAILED DESCRIPTION OF THE INVENTION Technical Field The present invention relates to speech pattern creation in a speech recognition device.

従来技術単語の音声認識を行う場合、現在、DP法（動
的計画法）を用いたマツチング法が一般化してい
るが、この方法は計算量が多いという欠点を持つ
ている。そこで、音声を時間、周波数パターン
（TSP）として表わしてマツチングする方法が提
案されている。Prior Art When performing speech recognition of words, a matching method using the DP method (dynamic programming) is currently common, but this method has the drawback of requiring a large amount of calculation. Therefore, a method has been proposed in which audio is expressed as a time frequency pattern (TSP) and matched.

第１図は、上記TSPの一例を示すブロツク線
図で、図中、１はマイク、２はスイツチ、３ａ，
３ｂはアンプ、４ａ，４ｂはフイルター群、５
ａ，５ｂは２値化回路、６は辞書部、７は類似度
照合部、８は比較部、９は結果出力部である。こ
の方法は、単語TSPをある閾値で２値化したパ
ターンを標準パターンとして登録しておき、未知
入力パターンを閾値を変えて２値変パターンにし
て登録されたパターンに重ね合せる操作をしてそ
の重なり具合から類似度を求めるもので、こうす
ることで低い閾で２値化した幅の広いパターンと
高い閾で２値化した幅の狭いパターンの重ね合せ
をみるため周波数方向のゆらぎを吸収したパター
ンマツチングが可能で、これは不特定話者認識向
きのマツチング手段となる。 FIG. 1 is a block diagram showing an example of the above TSP, in which 1 is a microphone, 2 is a switch, 3a,
3b is an amplifier, 4a and 4b are filter groups, 5
a, 5b are binarization circuits, 6 is a dictionary section, 7 is a similarity comparison section, 8 is a comparison section, and 9 is a result output section. In this method, a pattern obtained by binarizing the word TSP with a certain threshold is registered as a standard pattern, and an unknown input pattern is changed to a binary variable pattern by changing the threshold and superimposed on the registered pattern. The degree of similarity is calculated from the degree of overlap, and by doing this, fluctuations in the frequency direction are absorbed to see the superposition of a wide pattern binarized with a low threshold and a narrow pattern binarized with a high threshold. Pattern matching is possible, and this is a matching method suitable for speaker-independent recognition.

第２図は、ある単語のTSPの時間の１サンプ
ルを示す図で、横軸に周波数、縦軸にレベルを示
す。この場合、声帯、音源特性が周波数の上昇と
共に低下するため、発声された単語の周波数パタ
ーン上でも高周波数域でのレベルが下る。これを
L₁，L₂の閾値が「１」、「０」に２値化すると、
第２図ｂ，ｃに示すごとく、同じパターンでも
「１」となる部分が違つてくる。そこで、閾値を
声帯特性に近い傾斜を持たせる方法が考えられる
が、この方法によると、第２図に示したような２
値化パターンの違いは少なくなるが、第３図に示
すようにF₁に比べてF₂，F₃のピークが小さい場
合に、やはりパターンに違いが出るという欠点が
あつた。 FIG. 2 is a diagram showing one sample of the TSP time of a certain word, with the horizontal axis showing frequency and the vertical axis showing level. In this case, since the vocal cords and sound source characteristics decrease as the frequency increases, the level in the high frequency range of the frequency pattern of the uttered word also decreases. this
When the threshold values of L ₁ and L ₂ are binarized to "1" and "0",
As shown in FIG. 2b and c, even if the pattern is the same, the parts that become "1" are different. Therefore, a method can be considered in which the threshold value has a slope close to the vocal fold characteristics, but according to this method, the 2
Although the difference in the value pattern is reduced, there is still a drawback that a difference appears in the pattern when the peaks of F ₂ and F ₃ are smaller than that of F ₁ as shown in FIG. 3.

目的本発明は上述のごとき実情に鑑みてなされたも
ので、特に、２値化に際してTSP上のピークが
欠落しないようにすることを目的としてなされた
ものである。Purpose The present invention has been made in view of the above-mentioned circumstances, and in particular, has been made for the purpose of preventing peaks on TSP from being omitted during binarization.

構成本発明の構成について、以下、実施例に基づい
て説明する。Configuration The configuration of the present invention will be described below based on Examples.

本発明は、パターン上の各凸部又は凹部を検出
し、その凸部又は凹部のピーク又はデイツプの一
定値倍するか一定値を加減することによつて２値
化閾値を決めるものである。 The present invention detects each convex portion or concave portion on a pattern, and determines a binarization threshold by multiplying the peak or dip of the convex portion or concave portion by a constant value or by adding or subtracting a constant value.

第４図は、本発明の一実施例を説明するための
構成図で、図中、１０はマイク、１１はフイルタ
ー群、１２は音声区間切出し部、１３は凸部検出
部（極大点検出部）、１４はレジスター、１５は
２値化部、１６は閾値決定部、１７は辞書部、１
８は類似度照合部、１９は結果表示部で、まず、
マイク１０で入力された単語音声をバンドパスフ
イルター群１１によつて周波数の特徴パシメータ
に変換し、その中から音声に関する部分だけをと
り出す。この信号は凸部検出部１３にて時間の１
サンプルずつ周波数方向に凸部を検出する。凸部
の検出は例えば隣接するフイルター出力の差を見
ながらの符号が反転する部分として検出すること
ができる。検出された凸部のピーク値は閾値決定
部１６へ送られ、ここで隣接するピークを結ぶ直
線（第５図の破線より一定の値だけ低い値（第５
図実線）として閾値が決定される。なお、辞書作
成時と認識時によつて２値化の閾値を変化させる
場合は、それぞれによつて上記一定値を変化させ
れば良い。このようにして辞書登録された後、未
知入力は２値化されて照合部では辞書登録された
各単語と類似度の算出が行われ、最大類似度を有
する結果を認識結果として出力する。類似度の算
出は未知の音声の時間周波数パターンの２値化さ
れたものを辞書の同様のパターン上へ重ね、重な
り具合によつて求める。重ねるパターンの時間長
が異なる場合は例えば線形伸縮などによつて長さ
を揃えても良いし、辞書パターンが同じ単語を何
回も発声した２値パターンの加算からなる場合は
長さを揃えなくとも良いこともある。また、第４
図の凸部検出を凹部検出におきかえ、一定の値だ
け高い値を閾値とすることも可能で、このように
しても同様の効果を得ることができる。 FIG. 4 is a configuration diagram for explaining one embodiment of the present invention. In the figure, 10 is a microphone, 11 is a filter group, 12 is a voice section extraction section, and 13 is a convex part detection part (maximum point detection part). ), 14 is a register, 15 is a binarization section, 16 is a threshold value determination section, 17 is a dictionary section, 1
8 is a similarity comparison section, 19 is a result display section, and first,
A word voice inputted through a microphone 10 is converted into a frequency characteristic pathometer by a group of bandpass filters 11, and only the voice-related portion is extracted from the voice. This signal is detected by the protrusion detection section 13 at 1 of the time.
Convex portions are detected sample by sample in the frequency direction. A convex portion can be detected, for example, by looking at the difference between adjacent filter outputs and detecting a portion where the sign is reversed. The peak value of the detected convex portion is sent to the threshold value determination unit 16, where it is determined by a straight line connecting adjacent peaks (a value lower by a certain value than the broken line in FIG. 5).
The threshold value is determined as (solid line in the figure). In addition, when changing the threshold value of binarization depending on the time of dictionary creation and the time of recognition, the above-mentioned constant value may be changed depending on each. After being registered in the dictionary in this manner, the unknown input is binarized, and the matching unit calculates the degree of similarity with each word registered in the dictionary, and outputs the result having the maximum degree of similarity as a recognition result. The degree of similarity is calculated by superimposing a binarized time-frequency pattern of an unknown voice on a similar pattern in a dictionary and determining the degree of overlap. If the overlapping patterns have different time lengths, the lengths may be made equal by linear expansion or contraction, or if the dictionary pattern consists of the addition of binary patterns in which the same word is uttered many times, the lengths may not be made equal. There are also good things. Also, the fourth
It is also possible to replace the detection of convex portions in the figure with detection of concave portions and set a value higher than a certain value as the threshold value, and the same effect can also be obtained in this way.

第６図は、本発明の他の実施例を説明するため
の構成図である。この実施例は、基本的には、第
４図に示した実施例と同様であるが、図示のよう
に、音声区間切り出し後に凸部検出部１３と凹部
検出部（極小点検出部）２０を設けて、特徴パタ
ーンの凸部と凹部を求め、凸部のピークと凹部の
ピーク値、デイツプ値の差から２値化の閾値を決
定するものである。 FIG. 6 is a configuration diagram for explaining another embodiment of the present invention. This embodiment is basically the same as the embodiment shown in FIG. 4, but as shown in the figure, a convex portion detection section 13 and a concave portion detection section (minimum point detection section) 20 are installed after cutting out the voice section. This method determines the convex portions and concave portions of the characteristic pattern, and determines the binarization threshold from the difference between the peak value of the convex portion, the peak value of the concave portion, and the dip value.

第７図は、本発明の更に他の実施例を説明する
ための構成図で、図中のＡ部は第４図又は第６図
のＡ部に相当している。この実施例は、図示のよ
うに、音声区間切り出し部の次に、或いは、凸
部、凹部検出部の前に傾斜補正部２１を設け、第
３図に示した如き周波数方向への傾斜を持つ値を
引き、ピークのレベルをほぼ一定にしてから上記
の如き操作を行なうようにしたものである。この
ようにしてピークレベルの補正を行つた後に、第
４図、第６図のＡ部に相当する部分へ信号伝達す
れば良い。なお、傾斜補正は例えば最小自乗直線
（三輪、城戸：音声研究会資料S79−24）などに
よつて達成できる。 FIG. 7 is a configuration diagram for explaining still another embodiment of the present invention, and section A in the figure corresponds to section A in FIG. 4 or FIG. 6. In this embodiment, as shown in the figure, a slope correction section 21 is provided next to the voice section extraction section or before the protrusion/concavity detection section, so as to have an inclination in the frequency direction as shown in FIG. The above operation is performed after subtracting the value and making the peak level almost constant. After correcting the peak level in this manner, the signal may be transmitted to the portion corresponding to portion A in FIGS. 4 and 6. Incidentally, the tilt correction can be achieved by, for example, a least square straight line (Miwa, Kido: Speech Study Group Material S79-24).

効果以上の説明から明らかなように、本発明による
と、音声の時間−周波数パターン上の特徴的なピ
ークを欠くことのない正確な音声パターンの作成
が可能となる。Effects As is clear from the above description, according to the present invention, it is possible to create an accurate voice pattern without missing characteristic peaks in the time-frequency pattern of voice.

[Brief explanation of the drawing]

第１図は、TSPマツチング法の一例を示す構
成図、第２図及び第３図は、それぞれTSPの時
間の１サンプルを示す図、第４図は、本発明の一
実施例を説明するための構成図、第５図は、動作
説明をするための波形図、第６図及び第７図は、
それぞれ本発明の他の実施例を説明するための構
成図である。１０……マイク、１１……フイルター群、１２
……音声区間切り出し部、１３……凸部検出部
（極大点検出部）、１４……レジスター、１５……
２値化部、１６……閾値決定部、１７……辞書
部、１８……類似度照合部、１９……結果表示
部、２０……凹部検出部（極小点検出部）、２１
……傾斜補正部。 FIG. 1 is a block diagram showing an example of the TSP matching method, FIGS. 2 and 3 are diagrams each showing one sample of TSP time, and FIG. 4 is a diagram for explaining one embodiment of the present invention. 5 is a waveform diagram for explaining the operation, and FIGS. 6 and 7 are
FIG. 7 is a configuration diagram for explaining other embodiments of the present invention. 10...Microphone, 11...Filter group, 12
...Voice section cutting section, 13...Protrusion detection section (maximum point detection section), 14...Register, 15...
Binarization unit, 16...Threshold value determination unit, 17...Dictionary unit, 18...Similarity matching unit, 19...Result display unit, 20...Concavity detection unit (minimum point detection unit), 21
...Inclination correction section.

Claims

[Scope of Claims] 1. A sound pattern is created by having a band division means for dividing an audio signal into a plurality of frequency bands, and a binarization means for binarizing each band signal divided by the band division means. In the speech pattern creation device to be created, a detection means for detecting the maximum point of the band signal with respect to frequency and a straight line determined by two adjacent points among the maximum points detected by the detection means, and a line on the straight line are determined. 1. A voice pattern creation device, comprising a threshold setting means for setting a threshold value to a value lower than a predetermined value, and binarizing each of the band signals using the threshold set by the threshold setting means. 2. An audio pattern creation device that creates an audio pattern by having a band dividing means that divides an audio signal into a plurality of frequency bands, and a binarizing means that binarizes each band signal divided by the band dividing means. In this step, a straight line is determined by a detection means for detecting the minimum point of the band signal with respect to frequency and two adjacent points among the minimum points detected by the detection means, and a predetermined value is determined from the value on the straight line. 1. A voice pattern creation device, comprising a threshold setting means for setting a high value as a threshold, and binarizing each of the band signals using the threshold set by the threshold setting means.