JPS59121398A

JPS59121398A - Voice recognition system

Info

Publication number: JPS59121398A
Application number: JP57228623A
Authority: JP
Inventors: 浜田　隆史; 荻田　隆彦; 上柳　裕
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1982-12-28
Filing date: 1982-12-28
Publication date: 1984-07-13

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】ａ　発明の技術分野本発明は、音声認識装置に係や、特に゛複数話者の中の
各話者を識別することができる音声認識方式に関するも
のである。DETAILED DESCRIPTION OF THE INVENTION (a) Technical Field of the Invention The present invention relates to a speech recognition device, and particularly to a speech recognition method capable of identifying each speaker among a plurality of speakers.

ｂ　従来技術と問題点入力された音声を、認識装置が予め用意惚れた音声パタ
ーン辞書と照らし合わせながら、音声認識を行う方式と
しては、不特定話者方式と、特定話者方式がある。b. Prior Art and Problems There are two types of speech recognition methods: a speaker-independent method and a specific speaker method, in which the recognition device performs speech recognition by comparing the input speech with a speech pattern dictionary prepared in advance.

不特定話者方式とは同一の音声パターン辞書を用いて、
不特定な話者の音声を認識する方式である。この方式で
は、不特定多数の話者の音声を認識できるが、音声パタ
ーン辞書が共通であるために高い認識率を得ることがで
きない。Using the same speech pattern dictionary as the speaker-independent method,
This method recognizes the voices of unspecified speakers. Although this method can recognize the voices of an unspecified number of speakers, it is not possible to obtain a high recognition rate because the speech pattern dictionary is common.

特定話者方式とけ、話者ごとに音声バクーン辞書を登録
し、話者に対応する音声パターン辞書を用いて音声を認
識する方式である。この方式では、話者ごとに音声パタ
ーン辞書を登録する作業を必要とするが、高い認識率を
得ることができる。The specific speaker method is a method in which a speech Bakun dictionary is registered for each speaker, and speech is recognized using a speech pattern dictionary corresponding to the speaker. Although this method requires the work of registering a speech pattern dictionary for each speaker, it is possible to obtain a high recognition rate.

ところで、音声認識装置の応用分野として、会議等の議
事録を作成する作業がある。この様に、複数の話者が同
席し、話をする場合には、音声認識装置に入力されてい
る音声が、どの話者の音声であるかを識別した上で、音
声認識を行う必要がある。そこで、この様な場合には次
のような音声認識の方式が考えられる。By the way, one application field of speech recognition devices is the task of creating minutes of meetings and the like. In this way, when multiple speakers are present and talking, it is necessary to identify which speaker's voice is being input into the speech recognition device before performing speech recognition. be. Therefore, in such a case, the following voice recognition method can be considered.

（１）不特定話者方式を用いて、各人の音声を認識する
。(1) Recognize each person's voice using a speaker-independent method.

（２）各話者毎にその話者に対応する音声パターン辞書
と認識装置とを用意して、特定話者方式で各人の音声を
認識する。(2) For each speaker, prepare a speech pattern dictionary and a recognition device corresponding to that speaker, and recognize each person's speech using a speaker-specific method.

（３）各話者毎に夕その話者に対応する音声パターンと
、一台の認識装置とを用意し、各人が音声を入力するに
先だって、どの音声バクーン辞書を用いて音声認識を行
うかを示す個人識別情報を入力しながら、音声パターン
辞書を指定し、その指定された音声パターン辞書を用い
て各人の音声を開織する。(3) For each speaker, prepare a speech pattern corresponding to that speaker and one recognition device, and before each person inputs speech, use which speech Bakun dictionary to perform speech recognition. While inputting personal identification information indicating the identity of the user, a voice pattern dictionary is specified, and each person's voice is recorded using the specified voice pattern dictionary.

ところが、（１）、　（２）、　（３）にはそれぞれ次
の様な欠点がある。（１）では、認識可能な語い数が少
なく、認識率も低い。（２）では、音声パターン辞書、
認識装置が話者の数だけ必要となり、コスト的に問題が
ある。（３）では、個人識別情報の入力作業が煩しく、
又、音声を入力する前に、個人識別情報を入力し忘れる
こともある。However, (1), (2), and (3) each have the following drawbacks. In (1), the number of words that can be recognized is small and the recognition rate is low. In (2), the speech pattern dictionary,
This requires as many recognition devices as there are speakers, which poses a cost problem. In (3), inputting personal identification information is cumbersome;
Additionally, users may forget to input personal identification information before inputting voice.

Ｃ発明の目的そこで、本発明においては、以上に述べたような欠点を
解消しうる。複数話者の中の各話者を識別するととがで
きるｌ音声認識方式を提案するものである。C.Object of the Invention Accordingly, in the present invention, the above-mentioned drawbacks can be overcome. This paper proposes a speech recognition method that can identify each speaker among multiple speakers.

ｄ　発明の構成そのため本考案では入力手段よ多入力された音声を、複
数用意された音声パターン辞書の中から一つの音声パタ
ーン辞書を選択し、その音声パターン辞書と、前記音声
とを認識手段が比較しなから音声認識を行なう音声認識
装置に於いて、複数の音声入力手段を設け、少なくとも
その二つの音声入力手段よ多入力された同一音声の位相
差、又は、入力レベルの比によって音声パターン辞書の
選択を行うことを特徴とする音声認識方式を提案する。d.Structure of the Invention Therefore, in the present invention, the input means selects one speech pattern dictionary from a plurality of prepared speech pattern dictionaries, and the recognition means selects one speech pattern dictionary from among a plurality of speech pattern dictionaries, and uses the speech pattern dictionary and the speech inputted by the input means. In a speech recognition device that performs speech recognition without comparison, a plurality of speech input means are provided, and a speech pattern is determined based on the phase difference or the ratio of input levels of the same speech input multiple times from at least two of the speech input means. We propose a speech recognition method that is characterized by dictionary selection.

ｅ　発明の実施例第１区は、本発明の一実施例である音声認識方弐眞使用
する音声認識装置を示し、１，２はマイ　　　　　　　
□り、３Ｉ／′ｉ話者判定部、４は特徴抽出部、５は認
諏部、６は登録部、７は制御部、８はキーボードディス
プレイ、９は辞書部、ＩＯは切替部、１１〜１３は音声
パターン辞書、１４は音声認識装置をそれぞれ示す。e Embodiment 1 of the invention Section 1 shows a speech recognition device using a speech recognition method which is an embodiment of the invention, and 1 and 2 are my own.
□, 3I/'i speaker determination section, 4 is a feature extraction section, 5 is an authentication section, 6 is a registration section, 7 is a control section, 8 is a keyboard display, 9 is a dictionary section, IO is a switching section, 11 13 indicates a speech pattern dictionary, and 14 indicates a speech recognition device.

第２図は、本実施例を実施する際の話者、マイクの位置
を示し、Ａ、　　Ｂ、　　Ｃは話者、ｌ、２はマイク、
１４け音声認識装置をそれぞれ示す。Figure 2 shows the positions of speakers and microphones when implementing this embodiment, where A, B, and C are speakers, l and 2 are microphones,
14 voice recognition devices are shown respectively.

第３図は、話者Ａ、　　Ｂ、　　Ｃがマイク１．２よ多
入力する音声の波形を示し、第４図は、話者Ａがマイク
１，２より入力する音声の波形を示し、八−１は９話者
Ａがマイク１よ多入力する音声の波形、Ａ−２は２話者
Ａ必エマイク２５人力する音声の波形、Ｂ１１ｄ、話者
Ｂがマイク１より入力する音声の波形、Ｂ−２は、話者
Ｂがマイク２よ多入力する音声の波形、Ｃ−１ｕ、話者
Ｃがマイクｌよ多入力する音声の波形、Ｃ−２は、話者
Ｃがマイク２よ多入力する音声の波千をそれぞれ示す。FIG. 3 shows the waveforms of voices input by speakers A, B, and C from microphones 1 and 2, and FIG. 4 shows the waveforms of voices input by speaker A from microphones 1 and 2. -1 is the waveform of the voice input by 9 speakers A with more than microphone 1, A-2 is the waveform of the voice input by 2 speakers A with 25 microphones, B11d is the waveform of the voice input by speaker B from microphone 1, B-2 is the waveform of the voice that speaker B inputs more than microphone 2, C-1u, the waveform of voice that speaker C inputs more than microphone 1, and C-2 is the waveform of voice that speaker C inputs more than microphone 2. Each wave of input audio is shown.

又、第４図は、第３図Ａ−１．Ａ−２の拡大図をそれぞ
れ示す。In addition, FIG. 4 is similar to FIG. 3 A-1. An enlarged view of A-2 is shown.

本発明においては、音声入力手段として２本のマイクを
用いる。そして複数の話者（３人）と２本のマイクの距
離４が異なるように、話者と一２本のマイクを配置する
。すると２本のマイクよシそれぞれ入力される各話者の
同一音声ｄ１その位相、或いは入力レベルに差が生じる
。その位相、或いは、入力レベルの大小関係を個人識別
情報と対応させる。In the present invention, two microphones are used as voice input means. Then, the speakers and twelve microphones are arranged so that the distances 4 between the plurality of speakers (three people) and the two microphones are different. Then, a difference occurs in the phase or input level of the same voice d1 of each speaker inputted by the two microphones. The phase or magnitude relationship of the input level is made to correspond to personal identification information.

第２図の話者Ａ、　Ｂ、　Ｃマイク１，２の位置を見る
と分かるように、話者Ａとマイク１，２との間の距離を
比較すれば、：マイク１の方が話者Ａに近い。そこで、
第３図Ａ−１．Ａ−２に示すように、音声の波形がある
一定１ノベル（周囲の雑音レベルを考慮して決定された
）を超えた時点をその音声の始まりの時点（始点）とす
ると、マイク１より入力された音声の波形の始点■は、
マイク２より入力された音声の波形の始点■′よシも時
間的に早い。ここで、鰐４図を見ても分かるよ５１Ｃ，
Ａ−１、Ａ−２には入力レベルに差がある（Ａ−２の入
力レベルの方が小さい）が始点のレベルが同じであれば
、時間の差が強調されるので、入力レベルに応じて始点
のレベルを変化きせる必要はない。As can be seen from the positions of speaker A, B, and C microphones 1 and 2 in Figure 2, if we compare the distances between speaker A and microphones 1 and 2, we find that: Close to A. Therefore,
Figure 3 A-1. As shown in A-2, if the point in time when the voice waveform exceeds a certain level (determined by considering the surrounding noise level) is the point in time when the voice starts (starting point), the input from microphone 1 The starting point of the audio waveform is
The starting point of the waveform of the voice input from the microphone 2 is also earlier in time. Here, you can see 51C by looking at the crocodile diagram 4.
There is a difference in the input level between A-1 and A-2 (the input level of A-2 is smaller), but if the starting point level is the same, the time difference will be emphasized, so the input level will be adjusted according to the input level. There is no need to change the level of the starting point.

同様にＢ−１，Ｂ−２にπすように、話者Ｂの音声につ
いて、始点■と■′はほぼ同時であシ、Ｃ−１゜Ｃ’−
２に示すように、話者Ｃ・の音声について、始点■と■
′は■が■′よシも遅い。Similarly, as shown in B-1 and B-2, for speaker B's voice, the starting points ■ and ■' are almost simultaneous, and C-1°C'-
As shown in 2, for speaker C's voice, the starting points ■ and ■
′ is slower than ■ than ■′.

そこで、上に述べたようなマイク１．　２よ少入力され
る音声の位相差を個人識別情報に対応させて利用する。Therefore, the microphone 1 as mentioned above. The phase difference of the input voice of less than 2 is used in correspondence with personal identification information.

話者Ａ、　　Ｂ、　　Ｃについて、音声認識を開始する
前に、辞書１１〜１３に音声の登録を行う。登録に当っ
ては、キーボド８より切替部１０に、辞書１１〜１３の
一つを選択するよう、に指示する。各話者はマイク１．
又はマイク２よシ音声（５０音や数字等）を入力する。For speakers A, B, and C, before starting speech recognition, the voices are registered in dictionaries 11 to 13. For registration, the keyboard 8 instructs the switching unit 10 to select one of the dictionaries 11 to 13. Each speaker has one microphone.
Or input voice (50 sounds, numbers, etc.) using microphone 2.

入力された音声は、抽出部４により、その周波数の特徴
が抽出される。抽出された周波数の特性は登録部６によ
シ、選択された辞書に登録される。辞書１１〜１３は、
Ａには辞書１１．Ｂには辞書」２．Ｃにけ辞書１３が対
応するとする。話者Ａ、　　Ｂ、、　、Ｃが第２図のよ
うな位置につくと、会話を始める前に、何らかの音燕例
えば名前を発声する）を、Ａ、Ｂ、Ｃがそれぞれマイク
１，２よ少入力する。Ａ、　Ｂ、　Ｃの音声についてそ
れぞれ判定部３はマイク１，２より入力された音声の位
相差を検出する。と同時にキーボード８よシ個人識別情
報を入力し、判定部３に個人識別情報（カナ、数字等）
と位相差を対応させて記憶する。又、この個人識別情報
は、辞書１１．１２゜１３にそれぞれ対応する。Ａ、　
Ｂ、　Ｃが会話を始めるとマイク１．２よ少入力される
音声により、判定部３は、その位相差により話者を判定
する。判定部３は、辞書部９に個人識別情報を送る。辞
噛部９内では、切替部１０が、個人識別情報に基づき１
１〜１３の辞書のうちから一つの辞書を選択する。認識
部５では、抽出部４で抽出された音声の周波数特徴が、
選択された辞書と比較されながら音声契識される。ここ
で、抽出部４の入力となる音声は、２本のマイクの音声
入力のうち位相の早い音声とし、位相差が無い場合は、
予め決められたマイクよ少入力される音声とする。又、
抽出部４が、２本のマイクの音声入力のどちらを特徴抽
出するかけ、判定部３が指示する。The extraction unit 4 extracts the frequency characteristics of the input voice. The extracted frequency characteristics are registered in the selected dictionary by the registration unit 6. Dictionaries 11-13 are
A has a dictionary 11. B has a dictionary”2. It is assumed that the C language dictionary 13 corresponds to this. When speakers A, B, , , and C take positions as shown in Figure 2, before starting a conversation, A, B, and C make some sounds (for example, pronouncing their names) through microphones 1 and 2, respectively. Enter a small amount. The determining unit 3 detects the phase difference between the voices input from the microphones 1 and 2 for the voices A, B, and C, respectively. At the same time, enter the personal identification information on the keyboard 8, and enter the personal identification information (kana, numbers, etc.) in the judgment section 3.
and the phase difference are stored in correspondence. Further, this personal identification information corresponds to dictionaries 11, 12, and 13, respectively. A,
When B and C start a conversation, the determining unit 3 determines the speaker based on the phase difference between the voices input from the microphones 1 and 2. The determination unit 3 sends personal identification information to the dictionary unit 9. In the switching unit 9, a switching unit 10 selects one based on the personal identification information.
Select one dictionary from dictionaries 1 to 13. In the recognition unit 5, the frequency characteristics of the voice extracted by the extraction unit 4 are
The audio is recognized while being compared with the selected dictionary. Here, the audio input to the extraction unit 4 is the audio with a faster phase among the audio inputs from the two microphones, and if there is no phase difference,
A small amount of audio is input from a predetermined microphone. or,
The determining unit 3 instructs the extracting unit 4 to extract features from the audio input from the two microphones.

本発明の他の実施例として、２本のマイクより入力され
た音声の入力レベルを比較する方法力Ｘある。すなわち
、話者とマイクとの距離が大きくなれば、入力レベルは
低くなシ、距離が小さくなれば、入力レベルは大きくな
る。この関係を利用して、マイク１、マイク２に入力さ
れる同一音声の入力レベルを比較し、その大小関係によ
って話者を識別する。瀉３図を見ると、話者Ａでは、マ
イク１の人力レベル〉マイク２の入力レベル、話者Ｂで
は、マイク１の入力レベル中マイク２の入力レベル、話
者Ｃでは、マイク１の入力レベル〈マイク２の入力レベ
ル、という関係が成シ立つ。As another embodiment of the present invention, there is a method for comparing the input levels of voices input from two microphones. That is, as the distance between the speaker and the microphone increases, the input level decreases, and as the distance decreases, the input level increases. Using this relationship, the input levels of the same voice input to microphone 1 and microphone 2 are compared, and the speaker is identified based on the magnitude relationship. Looking at Figure 3, for speaker A, the human power level of microphone 1 > the input level of microphone 2, for speaker B, the input level of microphone 2 among the input levels of microphone 1, and for speaker C, the input level of microphone 1 The following relationship holds true: level <input level of microphone 2.

そこで、これらの入力レベルの大小関係を、先に述べ′
ｆｃ実施例の位相差の関係と同様に、個人識別情報と対
応させる。Therefore, the magnitude relationship of these input levels is described first.
Similar to the phase difference relationship in the fc embodiment, it is made to correspond to personal identification information.

入力レベルの比を測定する方法を第４図によυ述べる。The method of measuring the ratio of input levels is described in FIG.

まず音声の波形がある一定レベル（周囲の雑音レベルを
考慮した）を越えた時点をその音声の始マシの時点（始
点）とする。そしてマイク１．２より入力きれた音声に
ついて求められた始点■、■′よシ一定時間内住）で、
両入力のピーク値（Ｌｌ、　Ｌｔ　）を求める。そして
ＬＨ、Ｌ２のレベル比を求め、その比を個人識別情報と
対応させる。First, the point in time when the waveform of a voice exceeds a certain level (taking into account the level of surrounding noise) is defined as the point in time at which the voice begins (starting point). Then, at the starting point ■, ■′ (within a certain period of time) determined for the audio input from microphone 1.2,
Find the peak values (Ll, Lt) of both inputs. Then, the level ratio of LH and L2 is determined, and the ratio is associated with personal identification information.

又、この様に、入力レベルの差を用いて話者の識別を行
う場合は、抽出部４−への音声入力は、入力レベルの高
い方の音声入力とし、入力レベル力：同じ場合には、予
め決められたマイクの音声入力とする。In addition, when identifying speakers using the difference in input levels in this way, the voice input to the extraction unit 4- is the voice input with a higher input level, and if the input level power is the same, , the audio input is from a predetermined microphone.

本実施例では、話者が３人の場合を示したが、３人を越
える場合も、位相差、入力レベル比をより細分化すれば
、同様に実施可能である。In this embodiment, a case where there are three speakers is shown, but the same implementation is possible even when there are more than three speakers by dividing the phase difference and the input level ratio more finely.

本実施例では、マイクを２本としたが、３本以上のマイ
クを使用すれば、話者の識別により多くの情報を使用す
ることができる。In this embodiment, two microphones are used, but if three or more microphones are used, more information can be used to identify the speaker.

ｅ　発明の効果本発明によれば、大きなコストがかかることもe Effect of invention According to the present invention, large costs may be required.

[Brief explanation of the drawing]

第１図は本発明の一実施例である音声認識方式に使用す
る音声認識装置を示し、１，２はマイク、３は話者判定
部、４は特徴抽出部、５は認識部、６は登録部、７は制
御部、８はキーボードディスプレイ１．９は辞書部、１
０は切替部、１１〜１３は音声パターン辞書、１４は音
声認識装置をそれぞれ示す。第２図は、本実施例を実施する際の話者、マイクの位置
を示し１．Ａ　、　Ｂ　、　Ｃは話者、１，２はマイク
、１４は音声認識装置をそれぞれ示す。FIG. 1 shows a speech recognition device used in a speech recognition system that is an embodiment of the present invention, 1 and 2 are microphones, 3 is a speaker determination section, 4 is a feature extraction section, 5 is a recognition section, and 6 is a Registration section, 7 is control section, 8 is keyboard display 1.9 is dictionary section, 1
0 represents a switching unit, 11 to 13 represent voice pattern dictionaries, and 14 represents a voice recognition device. FIG. 2 shows the positions of the speaker and the microphone when implementing this embodiment.1. A, B, and C are speakers, 1 and 2 are microphones, and 14 is a speech recognition device, respectively.

Claims

[Claims]

The input means selects one voice butter dictionary from a plurality of prepared voice pattern dictionaries, and performs voice recognition while the recognition means compares the voice pattern dictionary with the voice. phase difference of the same voice, or
A speech recognition method characterized by selecting a speech pattern dictionary based on the ratio of input levels.