JP2753255B2

JP2753255B2 - Voice-based interactive information retrieval device

Info

Publication number: JP2753255B2
Application number: JP63082928A
Authority: JP
Inventors: 信夫畑岡; 明雄天野; 熹市川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1988-04-06
Filing date: 1988-04-06
Publication date: 1998-05-18
Anticipated expiration: 2013-05-18
Also published as: JPH01255925A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は音声認識を利用した情報検索システムに係
り、特にユーザがシステムと自由に対話しながらあいま
いな記憶から必要な情報を捜すのに好適でかつ音声認識
の性能を向上させるのに好適な情報検索装置に関する。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information retrieval system using speech recognition, and is particularly suitable for a user to search for necessary information from ambiguous memory while freely interacting with the system. The present invention relates to an information retrieval apparatus suitable for improving the performance of speech recognition.

[Conventional technology]

音声を情報検索システム等の入出力インターフエース
として利用する試みは数多くある。しかし、主に音声認
識部の性能不足のため、いまだに実用のレベルには達し
ていないのが実状である。この様な状況の中で、電子情
報通信学会論文誌D,Vol.J70-D,No.11,p.p.2121-2127（1
987年10月）に記載の技術のように音声対話型のサービ
スにおける円滑なマン・マシンインターフエースの実現
を狙いとして、認識系に起因する入力のあいまいさを入
力情報間の関連性を利用してより確実なものとするも
の、また電子情報通信学会技術研究報告Vol.87,No.209,
NL87-11（1987年）に記載の技術のように音声理解シス
テムにおいて、入力候補全体から入力されている内容の
主題を抽出し、そこから連想される語彙を効率的に絞り
込んで認識性能の向上をはかるものなどがある。There have been many attempts to use speech as an input / output interface for information retrieval systems and the like. However, the reality is that it has not yet reached a practical level, mainly due to the lack of performance of the speech recognition unit. Under these circumstances, IEICE Transactions D, Vol.J70-D, No.11, pp2121-2127 (1
In order to realize a smooth man-machine interface in voice-based interactive services as in the technology described in October 987), the ambiguity of input caused by the recognition system is used by using the relevance between input information. And IEICE Technical Report Vol.87, No.209,
As in the technology described in NL87-11 (1987), a speech comprehension system extracts the subjects of the input content from all input candidates and efficiently narrows the vocabulary associated therewith to improve the recognition performance. There is something to measure.

[Problems to be solved by the invention]

上記従来技術はいずれも入力情報（例えば単語音声）
間の関連性を利用して、認識候補を絞り込む技術として
は優れているが、前者は入力の属性（例えば場所，日
時，氏名など）が既知であることが必要となり、結果と
してシステム主導の入力インターフエースとなつてしま
うこと、また各属性の入力において不確実な内容は入力
できないという問題、後者の従来技術は文章理解等の言
語処理技術となつており、情報検索システムには直接利
用できないという問題があつた。Any of the above prior arts requires input information (for example, word voice)
Although it is excellent as a technique for narrowing down recognition candidates by using the relationship between them, the former requires that input attributes (eg, location, date and time, name, etc.) be known, and as a result, system-driven input The problem is that the interface becomes an interface, uncertain contents cannot be entered in the input of each attribute, and the latter conventional technology is a language processing technology such as sentence comprehension and cannot be directly used for information retrieval systems. There was a problem.

本発明の目的は上記従来技術の問題を解決して、ユー
ザがシステムと自由に対話しながら、あいまいな記憶か
らでも情報検索が可能となり、かつ認識性能向上をはか
つた音声利用による情報検索装置を提供することにあ
る。SUMMARY OF THE INVENTION An object of the present invention is to solve the above-mentioned problems of the prior art, and enable a user to search for information from an ambiguous memory while freely interacting with the system, and to improve the recognition performance, and to improve the recognition performance. Is to provide.

[Means for solving the problem]

上記目的は、入力の属性を限定せずに、自由に入力さ
れた音声（例えば複数個の単語音声）を認識して得られ
る認識候補（各入力単語に対して複数個）を、既に単語
間の関係を記述してある情報を用いて、関係付けられる
単語候補の組み合せのみを取り出すことで高精度に認識
候補を絞り込み、最終的に得られる認識候補の単語系列
を用いて情報検索することにより達成される。上記処理
において複数個得られた認識候補を高精度で絞り込むこ
とができるので、結果として認識性能の向上がはかれる
ことになる。The above-described object is to recognize recognition candidates (a plurality of words for each input word) obtained by recognizing freely input voices (for example, a plurality of word voices) without limiting the attributes of the input. By using only the information that describes the relationship between the words, only the combinations of the word candidates to be related are extracted to narrow down the recognition candidates with high accuracy, and the information is searched using the word sequence of the finally obtained recognition candidates. Achieved. Since a plurality of recognition candidates obtained in the above process can be narrowed down with high accuracy, the recognition performance is improved as a result.

尚、検索する内容に対応した単語間の関係を記述した
情報は、例えば日立評論vo1.69,No.3（1987-3）に詳細
に記載されている概念ネツトワークの技術を使うことで
容易に達成される。Information describing the relationship between words corresponding to the content to be searched can be easily obtained using the concept network technology described in detail in, for example, Hitachi Review, vo1.69, No. 3 (1987-3). Is achieved.

[Action]

本発明の単語間の関係を記述した概念ネットワークは
入力単語の認識候補を絞り込む判断材料として使われ
る。例えば３個の単語「音声」，「認識」，「技術」を
入力した時、各入力に対して認識結果の第1,第２候補が
各々（「温泉」，「音声」），（「認識」，「民
宿」），（「事実」，「技術」）のようになつたとした
場合、概念ネツトワーク上には「音声」−「認識」−
「技術」の関係が記述されている可能性が高く、その他
の単語間の関係は記述されている可能性は低いので、最
終的には認識結果として「音声」−「認識」−「技術」
の系列が選ばれることになる。上記概念ネツトワークは
検索される情報（例えば新聞記事等）の中に含まれる重
要単語（キーワード、例えば発表機関，内容を表わす単
語群）間の関係を記述したものとなつている。The concept network describing the relationship between words according to the present invention is used as a judgment material for narrowing down recognition candidates of an input word. For example, when three words “voice”, “recognition”, and “technology” are input, the first and second candidates of the recognition result for each input are (“hot spring”, “voice”), (“recognition”, respectively). , "Minshuku"), ("fact", "technology"), the concept network has "speech"-"recognition"-
Since there is a high possibility that the relationship of "technology" is described and the relationship of other words is unlikely to be described, finally, as a recognition result, "speech"-"recognition"-"technology"
Will be selected. The concept network described above describes relationships between important words (keywords, for example, presentation organizations, words representing contents) included in information to be searched (for example, newspaper articles).

〔Example〕

以下、本発明の一実施例を第１図により説明する。第
１図は本発明の音声による対話型情報検索装置の一実施
例を示すブロツク図である。対話型で入力された音声１
は、音声認識部２で音声の特徴パタンが求められた後、
標準音声パタンメモリ６から読み込まれた標準音声との
比較，照合により認識される。認識結果は入力された音
声ごとに、最もらしい順に複数の候補が出力される。単
語系列選択部３では、検索される情報に関する重要単語
（キーワード、例えば新聞記事の場合は発表機関，日
時，内容を表わす単語群など）間の関係を記述した概念
ネツトワークテーブル８から読み込まれた情報を使つ
て、該音声認識部２で音声が入力されるごとに得られた
複数の認識候補の中から、お互いに関係づけられた単語
内容の系列のみ選択される。即ち、概念ネツトワークを
使つて、認識単語候補の絞り込みが実行される。次に検
索処理部４では、選択された単語系列（検索語の系列）
から検索実行に必要な検索条件文が生成され、概念ネツ
トワークテーブル８から読み込まれた検索される情報に
関するキーワード間の関係を記述した知識を用いた検定
（概念マツチング）が実行され、最終的に検索された結
果の情報（記事など）が選択される。上記検索処理４の
具体的な実施例は日立評論vol.69,No.3（1987年３月）
に詳細に記載されている。表示部５では検索にて選ばれ
た情報が情報格納部９から取り込まれて表示される。Hereinafter, an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing an embodiment of a voice interactive information retrieval apparatus according to the present invention. Voice 1 input interactively
Is obtained after the voice feature pattern is obtained by the voice recognition unit 2.
It is recognized by comparison and collation with the standard voice read from the standard voice pattern memory 6. As for the recognition result, a plurality of candidates are output in the most likely order for each input voice. The word series selection unit 3 reads from the conceptual network table 8 which describes the relationship between important words (keywords, for example, a publication organization, date and time, a word group representing the contents, etc. in the case of a newspaper article) relating to the information to be searched. Using the information, only a series of word contents related to each other is selected from a plurality of recognition candidates obtained each time a voice is input by the voice recognition unit 2. That is, narrowing down of recognition word candidates is performed using the concept network. Next, in the search processing unit 4, the selected word series (search word series)
, A search condition sentence necessary for execution of the search is generated, and a test (concept matching) using knowledge describing a relationship between keywords regarding information to be retrieved read from the concept network table 8 is executed. Information (such as an article) of the search result is selected. A specific example of the above search processing 4 is Hitachi Review, vol. 69, No. 3 (March 1987).
In more detail. On the display unit 5, information selected by the search is fetched from the information storage unit 9 and displayed.

入力制御部10は概念ネツトワークテーブル８から読み
込まれた検索に必要な知識の階層構造をもとに、適切な
入力シーケンスが生成され、それに基づいて入力要求11
が行なわれる。また、概念ネツトワークテーブル８に記
述されている概念（単語）に対応する音声パタンが標準
音声単語群書生成部７にて生成され、標準音声パタンメ
モリ６に格納される機能を有する。この結果、検索され
る情報が変更された時、その検索に必要な標準音声が任
意に生成される。The input control unit 10 generates an appropriate input sequence based on the hierarchical structure of the knowledge necessary for the search read from the concept network table 8 and based on the generated input sequence,
Is performed. Further, it has a function of generating a voice pattern corresponding to a concept (word) described in the concept network table 8 by the standard voice word group generation unit 7 and storing the voice pattern in the standard voice pattern memory 6. As a result, when the information to be searched is changed, a standard voice required for the search is arbitrarily generated.

以上の処理はすべて制御部12にて制御されながら実行
される。All of the above processing is executed while being controlled by the control unit 12.

第２図は音声認識部２の一実施例を詳細に示したもの
である。入力音声は音声入力部21を経て、音声分析部22
にて特徴パタンが求められ、単語認識部23にて認識処理
が実行され、単語候補列が求められる。また、入力音声
が対話的な文章あるいは単語列の場合はキーワード個所
探索部24にて入力音声の韻律的な情報を用いてキーワー
ド部分（主の単語部分）が探索され、単語認識部にてそ
の単語内容が認識されることになる。FIG. 2 shows an embodiment of the speech recognition section 2 in detail. The input voice passes through a voice input unit 21 and a voice analysis unit 22
, A recognition pattern is executed by the word recognition unit 23, and a word candidate string is obtained. When the input speech is an interactive sentence or word string, a keyword portion (main word portion) is searched for in the keyword location search unit 24 using the prosodic information of the input speech, and the keyword is searched for in the word recognition unit. The word content will be recognized.

第３図は音声入力部21と音声分析部22の一実施例を詳
細に示したものである。アナログ入力音声x_nはLPF（低
域通過フイルタ）211,ADC（アナログ−デイジタル変換
器）212で、サンプリングでの折り返し雑音を除去され
ながらデイジタル値へ変換される。次に、特徴パタン抽
出部221で音声の特徴パラメータが所定時間（フレー
ム）ごとに計算され、入力音声の特徴パタンが抽出され
る。音声の特徴パラメータとしては、BPF（帯域通過フ
イルタ）出力値やLPC分析（線形予測分析）結果の各種
パラメータなどが用いられる。韻律情報抽出部222で
は、音声の強勢又は抑揚などの韻律的情報を表わす特徴
パラメータ（例えば、パワーやピツチ周期）が抽出され
る。FIG. 3 shows an embodiment of the voice input unit 21 and the voice analysis unit 22 in detail. The analog input voice _xn is converted to a digital value by an LPF (low-pass filter) 211 and an ADC (analog-to-digital converter) 212 while removing aliasing noise in sampling. Next, the feature pattern extraction unit 221 calculates feature parameters of the speech at predetermined time intervals (frames), and extracts feature patterns of the input speech. As the voice characteristic parameters, various parameters of a BPF (band-pass filter) output value, an LPC analysis (linear prediction analysis) result, and the like are used. The prosody information extraction unit 222 extracts feature parameters (for example, power and pitch cycle) representing prosody information such as stress or intonation of the voice.

第４図は特徴パタン抽出部221の一実施例を詳細に示
したものである。実施例ではBPF分析をあげている。デ
イジタル値に変換された登録音声x_nは中心周波数と帯域
値の違うＫ個のBPF群2211に入力される。BPFは２次のパ
ターワース型フイルタとなつており、加算器２個，乗算
器４個，遅延器２個から構成されている。BPF出力の波
形は絶対値（ABS）2212にて整流され、LPF2213にて高域
周波数成分をカツトされながら登録音声パタンX_i（i;フ
レーム）が求められる。LPFはBPF同様の処理規模のパタ
ーワース型となつている。FIG. 4 shows an embodiment of the characteristic pattern extracting unit 221 in detail. In the examples, BPF analysis is described. The registered voice _xn converted to a digital value is input to K BPF groups 2211 having different center frequencies and band values. The BPF is a second-order Putterworth-type filter, and includes two adders, four multipliers, and two delay units. The waveform of the BPF output is rectified by the absolute value (ABS) 2212, and the registered voice pattern X _i (i; frame) is obtained while the high frequency component is cut by the LPF 2213. LPF is a Putterworth type with the same processing scale as BPF.

本発明では特徴パタン抽出部221の構成をBPF分析とし
たが、LPC分析とすることも可能である。この場合の詳
細な実施例は文献「音声波形の線形予測分析による音声
分析と合成（Speech Analysis and Synthesis by Linea
r Prediction of the Speech Wave）」by B.S.Atal et
al,Journal of Acoustic Society of America,Vol.50,
p,p,637〜655（1971）に詳細に説明されている。In the present invention, the configuration of the feature pattern extraction unit 221 is a BPF analysis, but may be an LPC analysis. A detailed example in this case is described in the document "Speech Analysis and Synthesis by Linea
r Prediction of the Speech Wave) "by BSAtal et
al, Journal of Acoustic Society of America, Vol.50,
p, p, 637-655 (1971).

第５図は韻律情報抽出部222の一実施例を詳細に示し
たものである。韻律情報を表わすパラメータとしてはパ
ワーとピツチ周期をあげている。FIG. 5 shows an embodiment of the prosody information extracting unit 222 in detail. Power and pitch cycle are given as parameters representing prosody information.

パワー算出部2221では入力音声x_nのＭ時点からのパワ
ーＰ（短時間エネルギー）が次式に基づいて算出され
る。The power calculator 2221 calculates the power P (short-time energy) of the input voice _xn from the time point M based on the following equation.

ここで、Ｎは１フレームのサンプル点数を示す。第５
図の実施例では、入力音声x_nを入力として乗算器22211
にてx_n ²＝x_n×x_nが求められ、加算器22212にて、ｎ−１
時点までのパワーの中間値（但し、式（１）のＭ＝０と簡略比）とx_n ²が加算さ
れ、新らたにｎ時点でのパワー中間値P_nが求められる。
以下、遅延バツフア22213を経由して、同様の処理が繰
返され、最終パワーＰが求められる（ｎ＝Ｎに対応）。 Here, N indicates the number of sample points in one frame. Fifth
In the illustrated embodiment, a multiplier 22211 receives an input speech _xn as an input.
_Xn ² = _xn × _xn is calculated by the adder 22212.
Intermediate value of power up to the point (Where M = 0 and the simplified ratio in equation (1)) and x _n ² are added to obtain a new power intermediate value P _n at the time point _n .
Thereafter, the same processing is repeated via the delay buffer 22213 to obtain the final power P (corresponding to n = N).

ピツチ周期算出部2222ではセンタークリツピングされ
た波形の自己相関関数から入力音声波形のピツチ周期が
求められる。ピツチ周期（逆数をピツチ周波数、基本周
波数という）は音の高低を司る重要なパラメータであ
り、発声者の口の形状（専門的には声道の長さ）という
物理的な特性から基本的には決定され、強調あるいは抑
揚によつて多少の値の変動が引き起こされる。ピツチ周
期の算出方法は数多くあるが、未だ完全な手法は見い出
されていない。本発明での実施例として、波形の自動相
関関数から導出する方法を使つている。この手法は文献
「音声信号のデイジタル処理（Digital Processing of
Speech Signals）」by L.R.Rabiner et al,PRENTICE-HA
LL,P150-157に詳細に説明されている。以下、手法に関
して簡単に説明する。センタークリツピングされた波形
y_nは、次式から求める。The pitch cycle calculation unit 2222 calculates the pitch cycle of the input speech waveform from the autocorrelation function of the center-clipped waveform. The pitch cycle (the reciprocal is called pitch frequency or fundamental frequency) is an important parameter that governs the pitch of sound, and is basically based on the physical characteristics of the mouth shape of the speaker (technically, the length of the vocal tract). Is determined, and some value fluctuation is caused by emphasis or intonation. There are many methods for calculating the pitch period, but no complete method has been found yet. As an embodiment of the present invention, a method of deriving from a waveform auto-correlation function is used. This technique is described in the document “Digital Processing of Audio Signals”.
Speech Signals) "by LRRabiner et al, PRENTICE-HA
LL, pp. 150-157. Hereinafter, the method will be briefly described. Center-clipped waveform
y _n is obtained from the following equation.

y_n＝Ｃ〔x_n〕 …（２）ここでＣ〔Ｘ〕はセンタークリツピング関数である。
ピツチ周期はセンタークリツピングされた波形のｉ次の
自己相関関数Ｒ（ｉ）の値の大きさの比較から求められる。つまり、ピツチ
周期をt_pとすれば、Ｒ（ｉ）Ｒ（Ｏ）ｉ＝t_p,2t_p,3t_p …（４）Ｒ（ｉ）Ｏｉは上記以外の関係があり、ピツチ周期t_pがRiの値が大小から求め
られる。第５図の実施例では、クリツピング関数メモリ
22222から読み込まれたクリツピング関数Ｃ〔ｘ〕と入
力音声波形x_nとの乗算が乗算器22221にて実行され、セ
ンタークリツピングされた波形y_nが求められる。次に、
ｉ次の遅延バツフア22223を使つて、y_nとy_{n_1}との積が
乗算器22224にて求められる。加算器22225では、ｎ−１
時点までのｉ次の自己相関関数中間値とy_ny_{n_1}とが加算され、新らたにｎ時点での中間値R_n
（ｉ）が求められる。以下、遅延バツフア22226を経由
して、同様の処理が繰返された最終値Ｒ（ｉ）が求めら
れることになる。次に、Ｒ（ｉ）の値を入力して、比較
器22227にて値の大小が比較され、式（４）の関係から
ピツチ周期t_pが求められる。y _n = C [x _n ] (2) where C [X] is a center clipping function.
The pitch period is the i-th autocorrelation function R (i) of the center-clipped waveform. Is obtained by comparing the magnitudes of That is, if the pitch period and _{t p, R (i) R} (O) i = t p, 2t p, 3t p ... (4) R (i) O i is related other than the above, the pitch period t _p is obtained from the value of Ri. In the embodiment of FIG. 5, the clipping function memory
Multiplication of the input speech waveform x _n Kuritsupingu functions and C [x] read from 22222 is executed at the multiplier 22221, a waveform y _n which is the center chestnut tree ping sought. next,
Using the i-th order delay buffer 22223 connexion, product of y _n and y _{n_1} is obtained at the multiplier 22224. In the adder 22225, n-1
Intermediate value of i-th autocorrelation function up to time And y _n y _{n_1} are added, and a new intermediate value R _n at time n is added.
(I) is required. Hereinafter, the final value R (i) obtained by repeating the same processing is obtained via the delay buffer 22226. Then enter the values of R (i), the magnitude of the value in the comparator 22227 are compared, pitch period t _p is determined from the relationship of formula (4).

第６図は単語認識部23の一実施例を詳細を示したもの
である。入力音声パタンX_iを入力として、音素認識部23
1にて音素標準パタンメモリ61から読み込まれた標準音
素の特徴パタンを使つて、単語を構成している音素の認
識が行なわれる。単語照合部232では単語辞書メモリ62
から読み込まれた語彙の音素記号列と音素認識部231で
得られた単語の音素系列との比較，照合が記号系列上に
て実行される。判定部233では記号照合結果を用いて、
単語の内容の認識結果が出力される。FIG. 6 shows an embodiment of the word recognition section 23 in detail. As input an input speech pattern X _i, the phoneme recognition section 23
In step 1, the phoneme constituting the word is recognized using the feature pattern of the standard phoneme read from the phoneme standard pattern memory 61. In the word matching unit 232, the word dictionary memory 62
The comparison and collation between the phoneme symbol string of the vocabulary read from the word and the phoneme sequence of the word obtained by the phoneme recognition unit 231 are executed on the symbol sequence. The judgment unit 233 uses the result of symbol matching to
The recognition result of the word content is output.

第７図は音素認識部231の一実施例を詳細に示したも
のである。距離計算部2311では入力音声パタンX_i音素標
準パタンY_jとのフレーム間距離d_ijが算出され、照合部2
312にて入力音声と音素標準との照合が行なわれる。照
合部は一般にDP（Dynamic Programming）マツチング処
理が実行される。次に候補判定部2313では、標準パタン
ｍに対する照合値▲Ｄ^m _i _Jm▼（ここでJ_mは標準パタンｍ
のフレーム長）から、例えば最小値を与える標準パタン
ｍが求められ、音素記号列IPHCD（ｋ）（ここでｋは音
素列番号）が出力される。FIG. 7 shows an embodiment of the phoneme recognition unit 231 in detail. The distance-frame distance d _ij between calculator in 2311 the input speech pattern X _i phoneme standard pattern Y _j is calculated, the matching section 2
At 312, the input voice is compared with the phoneme standard. The collating unit generally performs a DP (Dynamic Programming) matching process. Then the candidate determination unit 2313, the matching value for the reference pattern ^m ▲ D m _i _Jm ▼ (where J _m is the standard pattern m
, A standard pattern m that gives the minimum value is obtained, and a phoneme symbol string IPHCD (k) (where k is a phoneme string number) is output.

第８図は距離計算部2311の一実施例を詳細に示すもの
である。本実施例では絶対値距離を用いた場合を示す。
２つの音声の特徴はパタンX_iとY_jとの絶対距離d_ijはとして求まる。ここでi,jはフレーム、ＫはBPFのチヤ
ネル数である。従つて、実施例では、２つの特徴パタン
X_i,Y_jとが各々フレームパタンレジスタ23111,23112を介
しながら入力され、減算器23113でx_ki−y_kjの計算、絶
対値変換器23114で｜x_ki−y_kj｜の計算がされ、加算器2
3115でｋ＝１からＫまでの累積が計算されることにな
る。結果d_ijは距離レジスタ23116に格納される。本発明
の実施例では絶対値距離としたが、LPC分析で得られる
特徴パタンの相関尺度なども考えられる。この場合の具
体的実施例は文献「音声認識に適用した最小予測誤差原
理（Minimum Prediction Residual Principle Applied
to Speech Recognition）」by F.Itakura et al.IEEE T
rans on Acoustics,Speech and Signal Processing,vo
l.ASSP-23,p,p,57〜72（Feb.1975）に詳細に説明されて
いる。FIG. 8 shows an embodiment of the distance calculator 2311 in detail. In this embodiment, a case where an absolute value distance is used is shown.
The feature of the two voices is that the absolute distance d _ij between the patterns X _i and Y _j is Is obtained as Here, i and j are frames, and K is the number of BPF channels. Therefore, in the embodiment, two feature patterns are used.
X _i and Y _j are input through the frame pattern registers 23111 and 23112, respectively. The subtractor 23113 calculates x _ki −y _kj , and the absolute value converter 23114 calculates | x _ki −y _kj | Adder 2
At 3115, the accumulation from k = 1 to K is calculated. The result _dij is stored in the distance register 23116. Although the absolute value distance is used in the embodiment of the present invention, a correlation scale of a feature pattern obtained by the LPC analysis may be used. A specific embodiment in this case is described in the document "Minimum Prediction Residual Principle Applied
to Speech Recognition) "by F. Itakura et al. IEEE T
rans on Acoustics, Speech and Signal Processing, vo
l. ASSP-23, p, p, 57-72 (Feb. 1975).

第９図は照合部2312の一実施例を詳細に示したもので
ある。原理は特開昭55-2205号「連続DP法」を改良した
ものである。入力音声のｉフレームと標準パタンのｊフ
レームとのフレーム間距離d_ijをもとに、累積距離D_ijが
次の漸化式を使つて算出される。FIG. 9 shows an embodiment of the collation unit 2312 in detail. The principle is an improvement of JP-A-55-2205 "continuous DP method". Based on the inter-frame distance d _ij between the i-frame of the input voice and the j-frame of the standard pattern, the cumulative distance D _ij is calculated using the following recurrence formula.

以上の漸化式から、入力音声の各フレームｉごとに、
標準パタンｍに対する最適照合値▲Ｄ^m _i _Jm▼が求められ
る（J_mは標準パタンｍのフレーム長）。 From the above recurrence formula, for each frame i of the input voice,
Optimal matching value for the reference pattern ^m ▲ D m _i _Jm ▼ is determined (J _m is the frame length of the reference pattern m).

照合部2312の具体的な実施例は、入力音声と標準パタ
ンとのフレーム間距離d_ijがフレーム距離レジスタ23121
を介して入力され、遅延メモリ23122と中間累積距離格
納メモリ23127を用いて、（６）式のd_i-1,j-1やD
_i-1,j-2,D_i-1,j-1,D_i-2,j-1が記憶される。各々の距離
値をもとに加算器23123ではパスのD_i-1,j-2i＋d
_i-1,j-1、加算器23124ではパスのD_i-2,j-1＋d_i-1,j-1
が算出され、パスのD_i-1,j-1とともに比較器23125で
最小値が探索される。さらに加算器23126で2d_ijが最小
値に加算され、中間累積距離D_ijが新たに求められる。
この結果は中間累積距離格納メモリ23127に格納され、D
_i+1,j+1の算出の情報となる。照合部では入力音声のｉ
フレームごとに標準パタンｍとの最適照合値▲Ｄ_m _i _Jm▼
を出力し（ｉフレームは母音区間情報i_sk〜i_ekの範囲
内）、候補判定部2313の入力となる。判定部では照合値
▲Ｄ^m _i _Jm▼の大小関係から、入力音声がどの標準音声に
最も似ているかの判定がなされる。判定部は単純な大小
比較器で構成される。A specific example of the matching unit 2312 is that the inter-frame distance _dij between the input voice and the standard pattern is set in the frame distance register 23121.
And using the delay memory 23122 and the intermediate cumulative distance storage memory 23127, the di _{-1, j-1} and D
_{i-1, j-2} , _{Di-1, j-1} , _{Di-2, j-1} are stored. Based on each distance value, the adder 23123 passes the path Di _{-1, j-2i} + d
_{i-1, j-1} and the adder 23124 passes the path D _{i-2, j-1} + d _{i-1, j-1}
Is calculated, and the minimum value is searched for by the comparator 23125 together with Di _{-1, j-1 of} the path. Further, 2d _ij is added to the minimum value by the adder 23126, and the intermediate cumulative distance D _ij is newly obtained.
This result is stored in the intermediate cumulative distance storage memory 23127, and D
_This is information for calculating _{i + 1 and j + 1} . In the matching unit, the i
Optimal matching value between the standard pattern m for each frame ▲ D _m _i _Jm ▼
Outputs (i frames within the vowel interval information i _sk ~i _ek), as an input candidate determination unit 2313. From the verification value ▲ D ^m _i _Jm ▼ magnitude relation of the determination unit, a determination whether the input speech is most similar to any standard speech made. The judgment unit is composed of a simple magnitude comparator.

第10図は単語照合部232の一実施例を詳細に示したも
のである。単語の音素認識結果の音素記号列IPHCD
（ｋ）とが単語辞書から読み込まれた標準単語の音素記
号列と各々音素記号列レジスタ2321,2322を介しながら
入力される。次に比較器232では各々の音素記号列間の
比較がひとつのコード毎に行なわれ、加算器2324にて全
系列での差（総距離）が求められる。全標準単語での総
距離の大小比較が距離レジスタ2325を介して、比較器23
26にて実行され、認識結果（最小総距離となる標準単
語）が出力される。FIG. 10 shows an embodiment of the word matching unit 232 in detail. Phoneme symbol string IPHCD of phoneme recognition result of word
(K) and the phoneme symbol strings of the standard words read from the word dictionary and input via the phoneme symbol string registers 2321 and 2322, respectively. Next, in the comparator 232, the comparison between the respective phoneme symbol strings is performed for each code, and the difference (total distance) in all the series is obtained in the adder 2324. The comparison of the total distance of all standard words is performed via the distance register 2325 by the comparator 23.
Executed in 26, the recognition result (the standard word having the minimum total distance) is output.

判定部233は単純な大小比較器で構成される。 The judgment unit 233 is composed of a simple magnitude comparator.

第11図はキーワード個所探索部24の一実施例を詳細に
示したものである。実施例を説明する前に、キーワード
探索の原理を説明する。発声者が伝えようと意図した情
報は一般にゆつくり発声するか、その内容（主に単語）
を強調して発声することが知にれている。このように、
韻律情報は発声内容に対して合理的・自然的な情報であ
り、話し言葉を意味的なまとまりに分割するための重要
な情報となつている。以上の詳細な説明は特願昭61-755
28号「音声会話文構造推定方式」に示されている。本発
明でのキーワード探索部24は、上記特許の一実施例を使
つたものとなつている。具体的には、入力音声分割部24
1にて、音声の強調または抑揚などの韻律情報（ピツチ
周期やパワー）の特徴に基づいて入力音声を意味的なま
とまりに分割し、言いまわし推定部242にて上記韻律情
報の特徴から発音の言いまわしを推定した後、文構造推
定部243にて、文構造辞書244から読み込まれた情報を使
つて入力音声の文構造が推定される。次に、文構造の情
報を使つて、意味的に重要な語（キーワード）がキーワ
ード抽出部245にて求められ、キーワード探索が実行さ
れることになる。FIG. 11 shows an embodiment of the keyword location searching section 24 in detail. Before describing the embodiment, the principle of keyword search will be described. The information that the speaker intends to convey is generally loosely uttered or its content (mainly words)
It is known to utter emphasis. in this way,
The prosody information is rational and natural information for the utterance content, and is important information for dividing a spoken word into a meaningful unit. The above detailed description is in Japanese Patent Application No. 61-755
No. 28, "Speech Conversation Sentence Structure Estimation Method". The keyword search unit 24 according to the present invention uses an embodiment of the above patent. Specifically, the input voice dividing unit 24
In step 1, the input voice is divided into semantic units based on the characteristics of prosodic information (pitch cycle and power) such as emphasis or inflection of the voice. After estimating the wording, the sentence structure estimating unit 243 estimates the sentence structure of the input speech using the information read from the sentence structure dictionary 244. Next, using the information on the sentence structure, a keyword (keyword) that is semantically significant is obtained by the keyword extraction unit 245, and a keyword search is executed.

本発明の主点である単語系列選択部３の実施例を示す
前に、本発明の原理を詳細に示す。第12図は単語系列選
択部３での一動作例を示すものである。例えば３個の単
語「音声」，「認識」，「技術」が入力された時、音声
認識部から出力された単語候補列T_ij（ｉは入力音声の
入力順を示す番号、ｊは各入力に対して出力される複数
候補の順位）が図のように入力「音声」に対して第1,第
２の候補が各々「温泉」，「音声」，入力「認識」に対
しては「認識」，「民宿」、入力「技術」に対しては
「事実」，「……」，「技術」となつた場合を考える。
この時、本発明の主点である単語系列選択部では概念ネ
ツトワーク上に記述されている単語間の意味的関係に関
する情報を使つて、候補単語の絞り込みを行なう。図の
例では、概念ネツトワーク上には「音声」−「認識」−
「技術」の関係が記述されているので、最終系に可能性
のある単語系列としては「音声」−「認識」−「技術」
が得られることになる。ここで、「温泉」−「民宿」の
関係はもしかしたら概念ネツトワーク上に記述されてい
ることもあり得るが、３番目の入力結果得られた認識候
補「事実」「……」「技術」を関係付けられないので最
終単語系列からは除かれることになる。Before showing an embodiment of the word sequence selection unit 3 which is the main point of the present invention, the principle of the present invention will be described in detail. FIG. 12 shows an operation example of the word sequence selection unit 3. For example, when three words “speech”, “recognition”, and “technique” are input, a word candidate string T _ij (i is a number indicating the input order of the input voice, j is each input) As shown in the figure, the first and second candidates are "hot spring", "voice", and "recognition" for input "recognition". ",""Minshuku", and "input""technology" are "fact", "...", "technology".
At this time, the word sequence selection unit, which is the main point of the present invention, narrows down candidate words using information on the semantic relationship between words described on the conceptual network. In the example shown in the figure, "speech"-"recognition"-
Since the relation of "technology" is described, the word series that may be the final system is "voice"-"recognition"-"technology"
Is obtained. Here, the relationship between "hot spring" and "minshuku" may possibly be described on the concept network, but the recognition candidates obtained from the third input result "fact""...""technology" Cannot be related to the final word sequence.

第13図は単語系列選択部で使われる概念ネツトワーク
の構造は簡単に示したものである。概念ネツトワークは
概念（単語）間の意味的な連がりを階層的な包摂関係
（例えば、「会社」−「日立」−（「日立中研」）と検
索される記事などの情報を具体的に表現した具体関係
（例えば、「日立中研」−（開発した）−「S820」）と
の２つの関係のネツトワークで表現した知識表現方式で
ある（詳細は例えば、「日立評論vol.69,No.3（1987-
3）」を参照）。第12図の例題での概念ネツトワークも
あわせて示す。FIG. 13 simply shows the structure of the concept network used in the word sequence selection unit. The concept network expresses semantic connections between concepts (words) in a hierarchical subsumption relationship (eg, "company"-"Hitachi"-("Hitachi Chuken") This is a knowledge expression system expressed by a network of two relationships with the expressed concrete relationship (for example, "Hitachi Chuken"-(developed)-"S820") (For details, see, for example, "Hitachi Review Vol. 69, No. .3 (1987-
3) "). The conceptual network in the example of FIG. 12 is also shown.

第14図は本発明の主点である単語系列選択部３の一実
施例を詳細に示したものである。複数個入力された単語
音声の各々に対して複数個の認識候補を持つた候補単語
列を入力として、各々の単語候補に単語番号が付けら
れ、単語番号バツフア31を経由して比較部32にて、認識
可能語彙（概念）の階層的な包摂関係との概念間の関係
に関する知識を表わした概念ネツトワーク78を使つて、
関係付けられる認識候補だけが選択される。この時の関
係付けの比較は１つの入力音声に対して得られる複数の
認識候補間ではなく、複数個入力された音声の各々に対
して得られる認識候補間の関係である。従つて、比較部
32での処理はすでに入力された音声に対して選択された
結果を単語系列中間メモリ33から入力して、処理がすべ
ての入力音声が終了するまで繰り返えされることにな
る。最終系に得られた複数の単語系列は単語系列バツフ
ア34を経由して出力される。該出力された単語系列が検
索用単語系列として使われる。FIG. 14 shows in detail one embodiment of the word sequence selection unit 3 which is the main point of the present invention. For each of the plurality of input word sounds, a candidate word string having a plurality of recognition candidates is input, and each word candidate is assigned a word number, and sent to a comparison unit 32 via a word number buffer 31. Using a concept network 78 that represents knowledge about the relationship between concepts and the hierarchical subsumption of recognizable vocabulary (concepts),
Only relevant recognition candidates are selected. The comparison of the association at this time is not a plurality of recognition candidates obtained for one input voice, but a relationship between recognition candidates obtained for each of a plurality of input voices. Therefore, the comparison section
The process at 32 is performed by inputting the result selected for the already input speech from the word series intermediate memory 33, and the process is repeated until all the input speech is completed. The plurality of word sequences obtained in the final system are output via the word sequence buffer 34. The output word sequence is used as a search word sequence.

以上のように本発明は主点は概念ネツトワークで表現
された概念間の関係に関する知識を利用して、入力され
た単語音声ごとに複数個得られる認識候補からもつとも
らしい結果を選び出すことにある。As described above, the main point of the present invention is to select a likely result from a plurality of recognition candidates obtained for each input word voice by utilizing the knowledge about the relationship between the concepts expressed in the concept network. .

第15図は検索処理部４の一実施例を詳細に示すもので
ある。検索条件文生成部41では検索用単語系列を入力と
して、概念ネツトワークで表現された各概念間の関係を
使つて情報検索に必要な検索条件文を生成する。ここで
生成される検索条件文はあいまいな入力から作成された
抽象的概念である（例えば、入力が「記事」，「社
会」，「計算機」，「開発」とした時、検索条件文は
「ある会社で開発された計算機に関する記事」とな
る）。次に概念マツチング部42では、概念ネツトワーク
で表現された知識を参照しながら、検索条件文からより
具体的な概念（包摂関係の下位の概念）による関係付け
の検定を行なう。この結果、検索条件文を満足する具体
的概念で表現された記事などの情報を探しあてることが
できる。以上の処理を概念マツチングと呼ぶ。検索され
た結果は検索番号バツフア43を経由して出力される。FIG. 15 shows an embodiment of the search processing section 4 in detail. The search condition sentence generation unit 41 receives a word sequence for search and generates a search condition sentence required for information search by using a relationship between the concepts expressed in the concept network. The search condition sentence generated here is an abstract concept created from ambiguous input (for example, when the input is “article”, “society”, “computer”, “development”, the search condition sentence is “ Article about a computer developed by a company "). Next, the concept matching unit 42 tests the association with a more specific concept (lower concept of the inclusive relation) from the search condition sentence while referring to the knowledge expressed in the concept network. As a result, it is possible to search for information such as articles expressed by a specific concept that satisfies the search condition sentence. The above processing is called concept matching. The search result is output via the search number buffer 43.

表示部５では検索番号に基づいて、検索された情報の
中味を情報格納部９から読み込んで、デイスプレイ装置
などに内容を表示する機能を持つ。上記検索条件文生成
の詳細は「日立評論vol.69,No.3（1987-3）」に記載さ
れている。The display unit 5 has a function of reading the contents of the searched information from the information storage unit 9 based on the search number and displaying the contents on a display device or the like. Details of the generation of the search condition sentence are described in “Hitachi Review, vol. 69, No. 3 (1987-3)”.

第16図は標準音声単語辞書生成部７の一実施例を詳細
に示したものである。検索される情報に対応して得られ
た概念ネツトワークテーブルに記述されている概念（単
語）から入力音声の認識に使う標準音声パタン（具体的
には音筋コード等で表現された単語辞書）を生成する。
概念ネツトワークの概念は単語音節コードで表現されて
いる。（例えば「日立」はHi ta chi／）。変形比較部7
1では音声の調音結合等により音声がどう変動，変形す
るかを記述した変形規則73に基づいて、入力された単語
音節コードの比較が実行され、単語辞書生成部72で単語
辞書が生成される。例えば「日立」の場合は、/chi/の
ｉが無声化することがよくあり、結果として（ここでは/i/が無声化することもあるという表示）を生成す
る。結果は標準音声パタン６の単語辞書に格納される。FIG. 16 shows an embodiment of the standard speech word dictionary generator 7 in detail. A standard speech pattern (specifically, a word dictionary represented by tone code etc.) used for recognition of input speech from concepts (words) described in a concept network table obtained corresponding to information to be searched. Generate
The concept of the concept network is represented by word syllable codes. (For example, "Hitachi" is Hi ta chi /). Deformation comparison part 7
In 1, the input word syllable codes are compared based on a deformation rule 73 that describes how the voice fluctuates and deforms due to articulation of the voice, and the word dictionary generation unit 72 generates a word dictionary. . For example, in the case of "Hitachi", the i of / chi / is often silenced, resulting in (here Produces an indication that / i / may be silenced). The result is stored in the word dictionary of the standard voice pattern 6.

第17図は本発明のひとつである音声による対話型情報
検索装置の別の一実施例を示すブロツク図である。この
実施例は、入力されて認識された単語内容と概念ネツト
ワークで表現された単語間の関係に関する知識を使つ
て、次に入力される音声を認識する際に使う標準音声の
カテゴリーを絞り込み、制限するものである。具体的に
は、語彙選択部13では音声認識部２で認識された単語列
を入力として、その時点までに入力された単語の組合せ
から次に入力される可能性のある単語内容を概念ネツト
ワークテーブル８から読み込まれた単語間の関係に関す
る知識を使つて絞り込む。絞り込んだ結果は標準音声パ
タン入力制御部14に入力され、次の入力音声を認識する
際に使う標準音声のカテゴリーを制限する。音声認識部
2,検索処理部4,表示部5,標準音声パタンメモリ6,概念ネ
ツトワークテーブル8,情報格納部9,制御部12は第１図で
示した音声による対話型情報検索装置でのものと同じ構
成で実現される。FIG. 17 is a block diagram showing another embodiment of the voice interactive information retrieval apparatus according to the present invention. This embodiment uses knowledge about the relationship between the input and recognized word content and the words expressed in the concept network to narrow down the category of standard speech used to recognize the next input speech, Restrict. More specifically, the vocabulary selection unit 13 receives the word string recognized by the voice recognition unit 2 as an input, and converts the word content that may be input next from the combination of words input up to that point into the concept network. It narrows down using the knowledge about the relationship between words read from the table 8. The narrowed-down result is input to the standard voice pattern input control unit 14, and limits the category of the standard voice used when recognizing the next input voice. Voice recognition unit
2, a search processing unit 4, a display unit 5, a standard voice pattern memory 6, a conceptual network table 8, an information storage unit 9, and a control unit 12 are the same as those in the voice interactive information search device shown in FIG. It is realized by the configuration.

第18図は語彙選択部13の一実施例を詳細に示したもの
である。音声認識部２で得られた単語認識結果は単語番
号バツフア131を経由して、その時点まで認識されてい
る単語系列に追加された形で単語系列中間メモリ132に
格納される。次に、語彙絞り込み部133では単語系列中
間メモリ132から読み込まれた単語内容の組合せと概念
ネツトワークテーブル８から読み込まれた単語間の関係
に関する知識とを使つて次に入力される単語内容語彙の
絞り込みを行なう。結果は語彙カテゴリーバツフア134
を経由して標準音声パタン入力制御部14へ入力される。FIG. 18 shows an embodiment of the vocabulary selection unit 13 in detail. The word recognition result obtained by the voice recognition unit 2 is stored in the word sequence intermediate memory 132 via the word number buffer 131 in a form added to the word sequence recognized up to that point. Next, the vocabulary narrowing unit 133 uses the combination of the word contents read from the word sequence intermediate memory 132 and the knowledge about the relationship between the words read from the concept network table 8 to determine the next word content vocabulary to be input. Perform a refinement. Result is vocabulary category buffer 134
Is input to the standard voice pattern input control unit 14 via the.

標準音声パタン入力制御部14は絞り込まれた語彙のカ
テゴリーをもとにして標準音声パタンメモリ６からその
カテゴリーだけに属する標準音声を取り込む処理を行
う。The standard voice pattern input control unit 14 performs a process of taking in standard voices belonging only to the category from the standard voice pattern memory 6 based on the narrowed vocabulary category.

〔The invention's effect〕

本発明によれば、複数個出力された認識候補から最も
らしい候補を選択することができるので、現状の音声認
識性能でも十分に使え、かつあいまいな記憶からでも音
声を利用した対話型による情報検索が可能となる効果が
ある。According to the present invention, the most probable candidate can be selected from a plurality of output recognition candidates, so that the current speech recognition performance can be used sufficiently, and interactive information retrieval using speech even from ambiguous storage. There is an effect that becomes possible.

[Brief description of the drawings]

第１図は本発明の一実施例を示す音声による対話型情報
検索装置のブロツク図、第２図から第16図は第１図の各
処理部の一実施例と動作原理を詳細に示すブロツク図、
第17図は本発明のひとつである音声による対話型情報検
索装置の一実施例を示すブロツク図、第18図は第17図の
一部を詳細に示すブロツク図である。３……単語系列選択部、４……検索処理部、７……標準
音声単語辞書生成部、８……概念関係テーブルメモリ、
10……入力制御部、13……語彙選択部。FIG. 1 is a block diagram of a voice interactive information retrieval apparatus showing one embodiment of the present invention, and FIGS. 2 to 16 are block diagrams showing in detail one embodiment of each processing unit and the operation principle of FIG. Figure,
FIG. 17 is a block diagram showing an embodiment of a voice interactive information retrieval apparatus according to one embodiment of the present invention, and FIG. 18 is a block diagram showing a part of FIG. 17 in detail. 3... Word series selection unit, 4... Search processing unit, 7... Standard speech word dictionary generation unit, 8... Concept relation table memory,
10 ... Input control unit, 13 ... Vocabulary selection unit.

フロントページの続き (56)参考文献特開昭62−226374（ＪＰ，Ａ) 特開昭59−53985（ＪＰ，Ａ) 藤澤、畠山、藤縄「概念ネットワークを用いた知的ファイリングシステム」. 日立評論、Ｖｏｌ．69，Ｎｏ３，1987, ｐ231−238Continuation of the front page (56) References JP-A-62-226374 (JP, A) JP-A-59-53985 (JP, A) Fujisawa, Hatakeyama, Fujina "Intelligent filing system using concept network". Hitachi Review Vol. 69, No3, 1987, p231-238

Claims

(57) [Claims]

1. A voice input means, a voice analysis means for extracting characteristics of an input voice, and characteristics of the input voice obtained from the voice analysis means are preliminarily analyzed and stored as a standard voice pattern. Using a recognition network that compares and collates with the features of the standard speech to mainly perform word recognition, and a concept network that is expressed in advance by a relationship between keywords related to information to be searched, the recognition processing unit obtains a plurality of words. A word sequence selection means for selecting one or a plurality of word sequences to be used for information retrieval from the word candidate sequence of the input speech to be obtained, and the information retrieval word sequence obtained by the word sequence selection device and the concept network. Means for generating a search condition sentence necessary for a search process using a search condition, and a concept matcher for selecting a content to be searched by comparing the search condition sentence with the concept relation network Means and, interactive information retrieval apparatus by voice, characterized in that the search results obtained in 該概 precaution matching means and a means for displaying on a display.

2. The interactive information retrieval apparatus according to claim 1, wherein said input voice is a voice naturally uttered interactively and naturally, and said input voice obtained from said voice analysis means. A voice-based interactive information retrieval apparatus characterized in that the input speech is divided into units of words or phrases using the prosodic information of the above, and keyword location searching means for searching for an important word part is provided.

3. A voice interactive information retrieval apparatus according to claim 1, further comprising input control means for controlling the content of voice to be input. Interactive information retrieval device.

4. A speech interactive information retrieval apparatus according to claim 1, further comprising means for outputting said retrieval result by speech. Interactive information retrieval device.

5. A speech-based interactive information retrieval apparatus according to claim 1, wherein an acoustic feature pattern of the word is obtained from a word name representing the concept of the concept network. Characterized in that a means for arbitrarily generating the information as the standard voice pattern is provided.

6. A means for inputting voice, a voice analyzing means for extracting characteristics of the input voice, and a standard which is preliminarily analyzed with the characteristics of the input voice obtained from the voice analyzing means and stored as a standard voice pattern. Vocabulary selection that narrows down the category of the standard speech by using a recognition processing unit that compares and collates with the features of speech to mainly perform word recognition, and a concept network that is expressed in advance by a relationship between keywords related to information to be searched. Means for generating a search condition sentence necessary for search processing using the word sequence of the word recognition result of the input speech obtained in the recognition processing means and the concept network, and the search condition sentence and the concept network And a concept matching means for selecting a content to be searched by collating with the information.