JP2601448B2

JP2601448B2 - Voice recognition method and apparatus

Info

Publication number: JP2601448B2
Application number: JP60207131A
Authority: JP
Inventors: 潤一郎藤本; 哲也室井
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1985-09-19
Filing date: 1985-09-19
Publication date: 1997-04-16
Anticipated expiration: 2012-04-16
Also published as: JPS6266300A

Description

DETAILED DESCRIPTION OF THE INVENTION

技術分野本発明は、音声認識方法及びその装置に関する。従来技術音声を２値化処理して特徴パターンを求め、この２値
化処理して求めた入力パターンと辞書パターンを線形マ
ッチングして認識する所謂BTSP（Binary Time−Spectru
m Pattern）方式による音声認識は本出願人において既
に種々提案されている。しかし、このBTSP方式では音声を２値処理するために
音声の大きさを表わすエネルギーやパワーの情報が失わ
れ誤認識となることがある。例えば、子音部が急速に立
ち上る破裂音である/P/と比較的緩かに立ち上る/K/の差
が区別しにくい。そこで音声のパワー情報を通常の方法
で２値化して２値のTSP（BTSP）パターンと共に保持し
ておくことが考えられるが、この場合、BTSPとパワー部
の演算方式が異なるため、パワー部の類似性を求める専
用の演算部が必要となり、装置が複雑化し折角のBTSP方
式の高速演算性が失われてしまうことになる。目的本発明は、上述のごとき実情に鑑みてなされたもの
で、特に、「１」，「０」の２値で表わされた２値化パ
ターンによる認識システムに音声のエネルギー又はパワ
ー情報を加えながらも高精度，高速認識の可能な認識方
法及びその装置を提供することを目的としてなされたも
のである。構成本発明は、上記目的を達成するために、（１）音声の
特徴量を抽出して、標準パターンとして保持しておき、
未知入力音声の音声パターンと照合することによって認
識結果を決定する音声認識方法において、音声のエネル
ギーまたはパワー形状を時間とパワーの大きさとを直交
する２つの軸とする２次元パターンとして表現し、各時
間における音声のエネルギーまたはパワーに対応する２
次元パターン上の位置またはその近傍と、それ以外の部
分とを「１」と「０」によって表して２次元パターン化
された音声パターンを作り、同様の手順によって作った
１つ以上のパターンを重ね合わせて加算したものを標準
パターンとし、未知入力音声も同様に２次元パターン化
し、標準パターンと重ね合わせて類似性を判別し、最大
類似の標準パターンを認識結果とすること、或いは、
（２）音声の特徴量を抽出して、標準パターンとして保
持しておき、未知入力音声の音声パターンと照合するこ
とによって認識結果を決定する音声認識装置において、
音声のエネルギーまたはパワーを検出するパワー検出部
と、音声区間を求め、該音声区間に関する部分だけのパ
ワー形状を時間とパワーの大きさとを直交する２つの軸
とする２次元パターンとして表現し、各時間における音
声のエネルギーまたはパワーに対応する２次元パターン
上の位置またはその近傍と、その以外の部分とを「１」
と「０」によって表わされる２次元パターンを作る２値
化部と、標準パターン作成時と認識時とを切り換える切
換部と、標準パターン作成時に、複数回の発声パターン
を前記２値化部により作成した２次元パターンに重ね合
わせて加算した標準パターンを作成する標準パターン作
成部と、未知の音声が入力される認識時に、未知の入力
音声を前記２値化部により作成した２次元パターンと前
記標準パターン作成部により作成された標準パターンと
を照合する重ね合わせ部と、該重ね合わせ部により重ね
合わされたパターンの類似性を判別する類似度判定部
と、該類似度判定部により判定された最大類似の標準パ
ターンを認識結果とする認識結果出力部とを具備するこ
と、更には、（３）前記（２）において、前記音声区間
が切り出された後に、パワー検出部と特徴量変換部とが
並設され、前記２次元パターンを他の特徴量により作成
した２次元パターンと合わせて用い、各々の種類のパタ
ーン間で類似度を求め、一方の類似度と他方の類似度を
作用させて最終的な類似性を判別して認識結果を決定す
ること、更には、（４）前記（２）において、前記音声
区間が切り出された後に、パワー検出部と特徴量変換部
とが並設され、前記２次元標準パターンと未知入力２次
元パワーパターンとを照合して類似性を判別するパワー
パターン照合部と、該パワーパターン照合部の結果に基
づいて類似度を計算するかどうかを判断する類似度判断
部とを有し、前記２次元パターンを他の特徴量により作
成した２次元パターンと合わせて用い、各々の種類のパ
ターン間で類似度を求める際、一方の類似度が特定の条
件を満たす時、又は満たさない時のみ他方の類似度も計
算して認識結果を決定することを特徴としたものであ
る。以下、本発明の実施例に基づいて説明する。第１図は、本発明の一実施例を説明するための電気的
ブロック線図で、図中、１はマイク、２はパワー検出
部、３は音声区間検出部、４は２値化部、５はレジス
タ、６は加算部、７は標準パターン、８は重ね合わせ
部、９は類似度判定部、10は認識結果出力部で、まず、
マイク１から入つた音声のパワー検出し、音声区間を求
め、音声区間に関する部分だけパワーを２値化して
「１」，「０」の２値のパターンで表わす。パワー検出
は例えば音声波形の振幅の包絡を検波するなどして求め
れば良いし、音声区間の検出はパワー検出部２で求めた
パワーが一定値以上である部分として求められる。又、
２値化部は第２図（Ａ）に示すパワー信号から（Ｂ）に
示す２値化情報の如く、音声パワーの形状が示される部
分を「１」、他を「０」で表わす。この例ではパワーの
大きさを５段階に量子化しており（Ｂ）から（Ａ）の波
形を連想することは容易である。標準パターン作成時に
はスイッチＳをTECHNICAL FIELD The present invention relates to a speech recognition method and an apparatus thereof. 2. Description of the Related Art A so-called BTSP (Binary Time-Spectru) is used in which a speech is binarized to obtain a characteristic pattern, and the input pattern and the dictionary pattern obtained by the binarization are linearly matched and recognized.
The present applicant has already proposed various types of speech recognition using the “m Pattern” method. However, in the BTSP method, since the voice is subjected to the binary processing, information of energy and power representing the volume of the voice may be lost, resulting in erroneous recognition. For example, it is difficult to distinguish the difference between / P /, which is a plosive sound in which a consonant part rises rapidly, and / K /, which rises relatively slowly. Therefore, it is conceivable that the power information of the voice is binarized by a normal method and held together with a binary TSP (BTSP) pattern. In this case, however, since the calculation method of the BTSP and the power unit is different, the power unit A dedicated calculation unit for obtaining similarity is required, which complicates the apparatus and loses the high-speed calculation performance of the BTSP method. SUMMARY OF THE INVENTION The present invention has been made in view of the above-described circumstances, and in particular, it has been proposed to add speech energy or power information to a recognition system using a binary pattern represented by binary values of "1" and "0". However, it is an object of the present invention to provide a recognition method and a device capable of high-accuracy and high-speed recognition. Configuration According to the present invention, in order to achieve the above object, (1) a feature amount of a voice is extracted and stored as a standard pattern;
In a speech recognition method in which a recognition result is determined by collating with a speech pattern of an unknown input speech, the energy or power shape of speech is expressed as a two-dimensional pattern having two axes orthogonal to time and power magnitude. 2 corresponding to the energy or power of the sound at time
A position on the dimensional pattern or its vicinity and other parts are represented by "1" and "0" to create a two-dimensional patterned voice pattern, and one or more patterns created by the same procedure are overlapped. The sum of the standard patterns is used as a standard pattern, the unknown input voice is also converted into a two-dimensional pattern in the same manner, and the similarity is determined by superimposing the standard pattern on the unknown pattern.
(2) In a voice recognition device that extracts a feature amount of a voice and holds it as a standard pattern, and determines a recognition result by comparing the voice pattern with a voice pattern of an unknown input voice.
A power detector for detecting the energy or power of the voice, and a voice section, and expressing the power shape of only the portion related to the voice section as a two-dimensional pattern having two axes orthogonal to time and power magnitude; "1" indicates a position on or near the two-dimensional pattern corresponding to the energy or power of the sound at the time, and other parts.
And a switching unit for switching between the time of standard pattern creation and the time of recognition, and a plurality of utterance patterns created by the binarizing unit when the standard pattern is created. A standard pattern creating unit for creating a standard pattern that is superimposed on and added to the two-dimensional pattern, and a two-dimensional pattern created by the binarizing unit for an unknown input speech when the unknown speech is recognized. A superimposition unit for collating the standard pattern created by the pattern creation unit, a similarity determination unit for determining the similarity of the pattern superimposed by the superimposition unit, and a maximum similarity determined by the similarity determination unit And a recognition result output unit that uses the standard pattern of (1) as a recognition result. (3) In (2), after the voice section is cut out A power detection unit and a feature amount conversion unit are provided side by side, and the two-dimensional pattern is used together with a two-dimensional pattern created by another feature amount to determine a similarity between patterns of each type. And (4) determining a recognition result by determining the final similarity by applying the other similarity to the power detection unit after the voice section is cut out in (2). A power pattern matching unit for judging similarity by comparing the two-dimensional standard pattern with the unknown input two-dimensional power pattern; and a similarity measure based on the result of the power pattern matching unit. And a similarity determining unit that determines whether to calculate the similarity. When the two-dimensional pattern is used in combination with a two-dimensional pattern created by another feature amount, when similarity is determined between patterns of each type, One Similarity score when a particular condition is satisfied, or is obtained by said determining a recognition result by also calculated seen other similarity when not satisfied. Hereinafter, a description will be given based on examples of the present invention. FIG. 1 is an electric block diagram for explaining an embodiment of the present invention, in which 1 is a microphone, 2 is a power detector, 3 is a voice section detector, 4 is a binarizer, 5 is a register, 6 is an addition unit, 7 is a standard pattern, 8 is a superimposition unit, 9 is a similarity determination unit, and 10 is a recognition result output unit.
The power of the voice input from the microphone 1 is detected, the voice section is obtained, and only the portion related to the voice section is binarized and represented by a binary pattern of "1" and "0". The power detection may be obtained by, for example, detecting the envelope of the amplitude of the audio waveform, and the detection of the audio section is obtained as a portion where the power obtained by the power detection unit 2 is equal to or more than a certain value. or,
In the binarization unit, the portion indicating the shape of the audio power is represented by "1" and the others are represented by "0" as in the binarization information shown in FIG. 2B from the power signal shown in FIG. 2A. In this example, the magnitude of the power is quantized in five stages, and it is easy to associate the waveforms from (B) to (A). When creating a standard pattern, switch S

【標】側に倒し、一つの音声例えば/Pa/
を３回発声する。まず、１回目発声パターンをレジスタ
５に入れ、２回目のパターンと重ね合わせ加算し、再び
レジスタ５に入れる。次いで、３回目のパターンとジレ
スタ５の内容が加算されて標準パターンとして登録され
る。つまり第２図の（Ｂ），（Ｃ），（Ｄ）のパターン
の加算により標準パターン（Ｅ）が作成されることにな
り、登録すべき各音声についてこれをくり返した後認識
に入る。認識の場合、未知の音声がマイク１から入力さ
れ、標準パターン作成時と同じ過程を経て「１」、
「０」に２値化されたパターンが先に作られたいくつか
の標準パターンと照合される。照合は「１」，「０」に
２値化されたパターンと標準パターンの一つが重ね合わ
されて類似度を計算することになる。この時の２値化さ
れた未知音声の「１」，「０」パターンは第２図（Ｂ）
と同じ形をしており、両者が類似の波形なら標準パター
ン（Ｅ）と重ね合わせることによつて（Ｂ）のパターン
の「１」のエレメントは（Ｅ）の大きな値を示すエレメ
ントに重なることになる。そこで類似度として両者の重
ね合わせによつて対応づいたエレメント同士の積をとり
それらの和として定義しても良い。こうして登録されて
いる全ての標準パターンと未知の「１」，「０」パター
ンの類似度を求め、最大の類似度を得たものを認識結果
として出力する。これによつて２値化処理（1,0処理）
した中にパワー情報を加えて類似度の計算ができるよう
になつた。しかし、パワーの情報だけによつて音声を認
識することは難しい。第３図は、上記欠点を解消した他の実施例を示す電気
的ブロック線図で、この実施例は、前記実施例で作成し
たパターンと他の特徴量により作成したパターンを合わ
せて用い、両方のパターン間で類似度を求め、一方の類
似度を他方の類似度に作用させて最終的な類似度を求め
るようにしたもので、ここでは、併用する他の方法とし
て従来技術として説明した２値のTSPを用いる方法を選
んだ。これは２値のTSPパワーパターンと共に２値化処
理されたもので、同じ演算が可能であるからであるが勿
論これ以外の方式と併用しても差し支えない。第３図に
おいては、音声区間検出部３で音声区間が切り出された
後、パワー検出部２でパワー検出がなされ、一方では同
じ信号を特徴量変換部11にて特徴量変換を行なう。特徴
量は、この実施例では、スペクトルが適している。パワ
ーとスペクトルの形状を２値化部４で「１」，「０」に
２値化する。この「１」，「０」の２値化パターンでは
スペクトルパターンとパワーパターンを結合して一つの
パターンとする方が後の演算が容易である。２値化部４
でのパターンの例は第４図の如くなり、通常のBTSPが
Ｆ、第２図（Ｂ）のパターンに相当するのがＧである。
これを第１図の例と同じ手順で類似度計算して結果を引
き出せば良い。この場合、類似度判定部９ではパターン
の大きさが大きくなつたと考えれば手順は何ら変る部分
がなく、両者のパターンの和の類似度により結果を求め
ることになる。これにより、第１図の例に比べ精度は飛
躍的に向上する。この場合、パワーかスペクトルのどち
らかのパターンにウエイトを置いて他を補助的手段とし
て用いることができる。第５図は、上述のごとき観点に立つてなされた実施例
を説明するための電気的ブロック線図で、この実施例に
よると類似度を求める際、一方の類似度が特定の条件を
満たす或いは満たさない時のみ、他方の類似度も計算し
て認識結果を決定することができる。この実施例は、第
３図に示した実施例と同様にスペクトルとパワーを結合
した第４図の如き「１」，「０」パターンを作り、これ
を何回か重ねて登録しておく。認識時には２値化部４で
できた未知入力パワーのパワー部とパワーパターン照合
部12で照合して類似性をみる。この類似性が大きく違つ
ているものはスペクトル部の類似度を計算しないと判断
部13で判断し、次の標準パターンとの照合に移る。もし
判断部13でスペクトルパターンの類似度計算をすると判
断されたものは第３図と同様にパターン間の類似度を求
めることになる。この場合の類似度はパワー部を含めて
計算しても含めずに計算しても良い。ここでの例はパタ
ーン全体のパワーの比較になつているが、これは一つの
音声パターン全体でなくパターン中のフレーム毎に行な
つても良いことは勿論である。効果以上の説明から明らかなように、本発明によると
「１」，「０」に２値化されたスペクトルパターンにも
パワー情報が添加され音声認識の精度を向上させること
ができる。[Picture] Flip to the side, and one voice, for example / Pa /
Three times. First, the first utterance pattern is put into the register 5, superimposed and added with the second pattern, and put into the register 5 again. Next, the third pattern and the contents of the giresta 5 are added and registered as a standard pattern. That is, the standard pattern (E) is created by adding the patterns (B), (C), and (D) in FIG. 2, and the speech is registered after repeating this for each voice to be registered. In the case of recognition, an unknown voice is input from the microphone 1 and passes through the same process as when the standard pattern is created, and is "1",
The pattern binarized to “0” is checked against some standard patterns created earlier. In the collation, the pattern binarized into “1” and “0” and one of the standard patterns are superimposed to calculate the similarity. The binary "1" and "0" patterns of the unknown voice at this time are shown in FIG. 2 (B).
If both have similar waveforms, the element of "1" in the pattern of (B) overlaps with the element showing a large value of (E) by overlapping with the standard pattern (E). become. Therefore, as the similarity, the product of the elements corresponding to each other by superposition of the two may be taken and defined as the sum thereof. The similarity between all the registered standard patterns and the unknown “1” and “0” patterns is obtained, and the one with the highest similarity is output as a recognition result. By this, binarization processing (1,0 processing)
Then, the power information is added and the similarity can be calculated. However, it is difficult to recognize speech only based on power information. FIG. 3 is an electric block diagram showing another embodiment in which the above-mentioned disadvantages are solved. In this embodiment, the pattern created in the above embodiment and the pattern created by other feature amounts are used together. The similarity between the patterns is calculated, and one similarity is made to act on the other similarity to obtain the final similarity. In this case, another method used in combination with the conventional technique is described as 2 The method using the TSP of values was chosen. This is a result of the binarization processing performed together with the binary TSP power pattern, and the same operation is possible. However, it is needless to say that other methods may be used in combination. In FIG. 3, after a voice section is cut out by the voice section detection unit 3, power detection is performed by the power detection unit 2, and the same signal is subjected to feature value conversion by the feature value conversion unit 11. In this embodiment, a spectrum is suitable for the feature amount. The power and the shape of the spectrum are binarized by the binarization unit 4 into “1” and “0”. In the binarized pattern of “1” and “0”, the subsequent calculation is easier if the spectral pattern and the power pattern are combined into one pattern. Binarization unit 4
The example of the pattern in FIG. 4 is as shown in FIG. 4, where F is a normal BTSP and G is a pattern corresponding to the pattern in FIG. 2 (B).
The similarity may be calculated in the same procedure as in the example of FIG. 1 to derive the result. In this case, if the similarity determination unit 9 considers that the size of the pattern has increased, there is no part that changes the procedure, and the result is obtained based on the similarity of the sum of the two patterns. Thereby, the accuracy is dramatically improved as compared with the example of FIG. In this case, weights can be placed on either the power or the spectrum pattern and the others can be used as auxiliary means. FIG. 5 is an electric block diagram for explaining an embodiment based on the above viewpoint. According to this embodiment, when similarity is obtained, one of the similarities satisfies a specific condition or Only when the condition is not satisfied, the other similarity can be calculated to determine the recognition result. In this embodiment, similar to the embodiment shown in FIG. 3, "1" and "0" patterns as shown in FIG. 4 in which the spectrum and the power are combined are created and registered several times. At the time of recognition, the power part of the unknown input power generated by the binarizing unit 4 is compared with the power pattern matching unit 12 to see the similarity. If the similarity is significantly different, the determination unit 13 determines that the similarity of the spectrum part is not calculated, and the process proceeds to the collation with the next standard pattern. If the determination unit 13 determines to calculate the similarity between the spectral patterns, the similarity between the patterns is calculated as in FIG. In this case, the similarity may be calculated with or without the power part. In this example, the power of the entire pattern is compared, but it goes without saying that this may be performed for each frame in the pattern instead of for one entire audio pattern. Effects As is clear from the above description, according to the present invention, power information is added to the spectral pattern binarized to “1” and “0”, and the accuracy of speech recognition can be improved.

[Brief description of the drawings]

第１図は、本発明の一実施例を説明するための電気的ブ
ロック線図、第２図は、本発明の動作説明をするための
２値化パターンを示す図、第３図は、本発明の他の実施
例を説明するための電気的ブロック線図、第４図は、２
値化パターンの例を示す図、第５図は、本発明の他の実
施例を示す電気的ブロック線図である。１…マイク、２…パワー検出部、３…音声区間検出部、
４…２値化部、５…レジスタ、６…加算部、７…標準パ
ターン、８…重ね合わせ部、９…類似度判定部、10…認
識結果出力部、11…特徴量変換部、12…パワーパターン
照合部、13…判断部。FIG. 1 is an electric block diagram for explaining an embodiment of the present invention, FIG. 2 is a diagram showing a binarization pattern for explaining the operation of the present invention, and FIG. FIG. 4 is an electric block diagram for explaining another embodiment of the present invention.
FIG. 5 is a diagram showing an example of a binarization pattern, and FIG. 5 is an electrical block diagram showing another embodiment of the present invention. 1 ... microphone, 2 ... power detector, 3 ... voice section detector,
4 binarization unit, 5 register, 6 addition unit, 7 standard pattern, 8 superposition unit, 9 similarity determination unit, 10 recognition result output unit, 11 feature amount conversion unit, 12 ... Power pattern collating unit, 13 ... judgment unit.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭60−19199（ＪＰ，Ａ) 特開昭59−222900（ＪＰ，Ａ) 特開昭59−205680（ＪＰ，Ａ) 特開昭59−186073（ＪＰ，Ａ) 日本音響学会講演論文集（昭和58年10 月）３−１−８，Ｐ．195〜196 ──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-60-19199 (JP, A) JP-A-59-222900 (JP, A) JP-A-59-205680 (JP, A) JP-A-59-205680 186073 (JP, A) Proceedings of the Acoustical Society of Japan (October 1983) 3-1-8, p. 195-196

Claims

(57) [Claims]

1. A speech recognition method for extracting a feature amount of a speech, storing the feature amount as a standard pattern, and determining a recognition result by comparing the feature with a speech pattern of an unknown input speech. The time and the magnitude of the power are expressed as a two-dimensional pattern having two axes orthogonal to each other, and the position on or near the two-dimensional pattern corresponding to the energy or power of the sound at each time and the other parts are represented by “ A two-dimensional pattern-formed voice pattern is represented by “1” and “0”, and one or more patterns created by the same procedure are superimposed and added as a standard pattern. It is characterized in that it is converted into a dimensional pattern, the similarity is determined by superimposing it on a standard pattern, and the maximum similar standard pattern is used as the recognition result. Speech recognition method.

2. A speech recognition apparatus for extracting a feature amount of a speech, storing the extracted feature amount as a standard pattern, and determining a recognition result by collating with a speech pattern of an unknown input speech. And a power detector that calculates a voice section and calculates the power shape of only the portion related to the voice section by orthogonalizing the time and the magnitude of the power.
The two or more axes are expressed as two-dimensional patterns, and the position on or near the two-dimensional pattern corresponding to the energy or power of the sound at each time, and other parts are expressed by “1” and “0”. A binarizing unit for creating a dimensional pattern, a switching unit for switching between a standard pattern creation time and a recognition time, and a plurality of utterance patterns superimposed on the two-dimensional pattern created by the binarizing unit when the standard pattern is created. A standard pattern creating unit for creating the added standard pattern, and a two-dimensional pattern created by the binarizing unit for the unknown input speech at the time of recognition when unknown speech is input, and a standard created by the standard pattern creating unit. A superimposition unit for collating the pattern, a similarity determination unit for determining the similarity of the pattern superimposed by the superimposition unit, and the similarity Speech recognition apparatus characterized by comprising a recognition result output unit for a recognition result a maximum similarity of the reference pattern determined by the tough.

3. A power detecting section and a feature quantity converting section are arranged side by side after the speech section is cut out, and the two-dimensional pattern is used together with a two-dimensional pattern created by another feature quantity. A similarity between patterns of a kind is obtained, and a similarity is determined by applying one of the similarities to the other to determine a final similarity, thereby determining a recognition result. A speech recognition device according to the item.

4. After the speech section is cut out, a power detection section and a feature quantity conversion section are provided side by side, and the two-dimensional standard pattern is compared with an unknown input two-dimensional power pattern to determine similarity. A two-dimensional pattern having a power pattern matching unit and a similarity determining unit for determining whether to calculate similarity based on the result of the power pattern matching unit, When determining the similarity between patterns of each type, when one similarity satisfies a specific condition or only when it does not, the other similarity is calculated and the recognition result is determined. The speech recognition device according to claim 2, wherein the speech recognition device is characterized in that: