JP2010009446A

JP2010009446A - System, method and program for retrieving voice file

Info

Publication number: JP2010009446A
Application number: JP2008170021A
Authority: JP
Inventors: Nobuyasu Ito; 伸泰伊東; Takehito Kurata; 岳人倉田
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2008-06-30
Filing date: 2008-06-30
Publication date: 2010-01-14
Anticipated expiration: 2028-06-30
Also published as: JP5068225B2

Abstract

<P>PROBLEM TO BE SOLVED: To facilitate registration of a new word and input of a keyword without being conscious of the contents of a voice recognition dictionary as far as possible in order to connect voice recognition to succeeding language processing or retrieval. <P>SOLUTION: In the registration of a new word or the input of a keyword, "reading" is input by a user at first. The reading is converted from pronunciation into notation by the same language model as a language model for voice recognition, and thereby a Kana-Kanji notation is obtained. Then, the obtained Kana-Kanji notation is properly compared with a corrected word and an original character string to identify the unknown word of the voice recognition dictionary. A converted keyword can be used for retrieving retrieval data formed by voice recognition of a voice file. The unknown word portion can be properly registered in the voice recognition dictionary. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

この発明は、音声認識によって作成されたテキストの処理に関し、より詳しくは、音声ファイルから、音声認識によって作成されたテキストの検索に関するものである。 The present invention relates to processing of text created by voice recognition, and more particularly to retrieval of text created by voice recognition from a voice file.

近年、ビジネス・インテグリティなどの目的で、コールセンターや営業店での会話を録音し、その録音した音声に対して、音声認識を実行してテキスト化し、その後、検索やテキスト・マイニングの処理を行う、という試みが行われている。 In recent years, we have recorded conversations at call centers and sales offices for the purpose of business integrity, etc., and performed voice recognition on the recorded voices, and then processed search and text mining. An attempt has been made.

テキスト・マイニングやキーワード検索においては、辞書を用いた言語処理が行われることが一般的である。その際、未知の単語や新しい単語に遭遇した場合は、辞書に新たにその単語が登録されるため、辞書が更新されることになる。 In text mining and keyword search, language processing using a dictionary is generally performed. At that time, when an unknown word or a new word is encountered, the word is newly registered in the dictionary, and the dictionary is updated.

特に日本語においては、仮名、漢字、アルファベット表記のどれを用いるかなど、表記の揺れが大きいので、言語処理用辞書では、さまざまな表記の同意語を登録することが一般的である。しかし、音声認識においては、同一の発音をもつ同義語が登録されていると、そのいずれを出力するかを区別する手段がなく、語彙増大による認識精度低下を招く。そこで、できる限り単一の表記のみ登録し、言語モデルのエントロピーを下げるように留意することが求められる。 In particular, in Japanese, there are large variations in notation such as whether to use kana, kanji, or alphabet, so it is common to register synonyms of various notations in the language processing dictionary. However, in the speech recognition, if synonyms having the same pronunciation are registered, there is no means for distinguishing which one is output, which causes a reduction in recognition accuracy due to an increase in vocabulary. Therefore, it is required to register only a single notation as much as possible and to take care to reduce the entropy of the language model.

例えば、「しくみさい」という音声認識の結果、「仕組み債」と「仕組債」の２とおりの表記がありえるが、「仕組み債」に統一する、というような具合である。 For example, as a result of the speech recognition of “Kikumisai”, there can be two types of “structured bonds” and “structured bonds”, but they are unified into “structured bonds”.

また、音声認識と後段の処理では、それぞれの都合で、異なる単位とする必要があることも多いが、その場合問題はさらに複雑になる。つまり、前段の音声認識でどのような単位・表記が用いられているかを常に意識して、言語処理用の辞書登録やキーワード入力を行わなくてはならない。 Also, in speech recognition and subsequent processing, it is often necessary to use different units for each convenience, but in this case, the problem is further complicated. In other words, it is necessary to perform dictionary registration and keyword input for language processing, always conscious of what units and notations are used in the speech recognition in the previous stage.

しかし、音声認識の結果に対する言語モデルの詳細は分からないため、ユーザは、前段の音声認識に適合するような言語処理用の辞書登録やキーワード入力を行うことは、困難であった。 However, since the details of the language model with respect to the result of speech recognition are unknown, it is difficult for the user to perform dictionary registration and keyword input for language processing suitable for speech recognition in the previous stage.

特開平７−１５２７５６号公報は、かな漢字変換、音声合成、音声認識といった辞書を用いる処理において、目的別の辞書を一つにまとめ、保守の容易化、容量の低減を図ることを開示する。 Japanese Patent Application Laid-Open No. 7-152756 discloses that in a process using a dictionary such as kana-kanji conversion, speech synthesis, and speech recognition, the dictionary for each purpose is put together to facilitate maintenance and reduce capacity.

特開平８−３１４９１５号公報は、複数の変換機能に対応する辞書手段を構成し、当該辞書手段に対して複数のインデックスを用いて検索を行う情報の表現態様変換装置を開示する。 Japanese Patent Laid-Open No. 8-314915 discloses an information expression mode conversion device that constitutes dictionary means corresponding to a plurality of conversion functions and performs a search using a plurality of indexes for the dictionary means.

特開２０００−３３９３０５号公報は、キーボードによる入力と音声による入力の２つの入力方法を使って、より入力精度と入力操作性を向上させて文書作成を行う技法を開示する。 Japanese Patent Application Laid-Open No. 2000-339305 discloses a technique for creating a document by improving input accuracy and input operability by using two input methods of keyboard input and voice input.

特開２００４−１４５０１４号公報は、文法と辞書の管理を容易に行なうことができ、且つ、入力された音声に忠実に応答を行ない得る自動音声応答装置及び自動音声応答方法を開示する。 Japanese Patent Application Laid-Open No. 2004-145014 discloses an automatic voice response device and an automatic voice response method that can easily manage grammar and dictionary and can respond faithfully to input voice.

[NAGATA 1999]（詳細は下記）は、未知語検索に関する技術を開示する。 [NAGATA 1999] (details below) discloses techniques related to unknown word search.

[MORI 1999] （詳細は下記）は、確率的モデルによる仮名漢字変換技術を開示する。 [MORI 1999] (details below) discloses kana-kanji conversion technology using a probabilistic model.

特開平７−１５２７５６号公報JP 7-152756 A 特開平８−３１４９１５号公報JP-A-8-314915 特開２０００−３３９３０５号公報JP 2000-339305 A 特開２００４−１４５０１４号公報JP 2004-145014 A [NAGATA 1999] Nagata, M. : A part of speech estimation method for Japanese unknown words using a statistical model of morphology and context, Proc. of the 37th ACL, pp277-284, 1999[NAGATA 1999] Nagata, M.: A part of speech estimation method for Japanese unknown words using a statistical model of morphology and context, Proc. Of the 37th ACL, pp277-284, 1999 [MORI 1999] 森信介,土屋雅稔,山地治,長尾真. 確率的モデルによる仮名漢字変換. 情報処理学会論文誌. Vol.40, No.7, pp.2946-2953. 1999.[MORI 1999] Mori Shinsuke, Tsuchiya Masami, Yamaji Osamu, Nagao Makoto. Kana-Kanji Conversion by Stochastic Model. IPSJ Transactions. Vol.40, No.7, pp.2946-2953. 1999.

上記の従来技術を組みあわせても、依然として、音声ファイルを適切に検索するための、音声認識の言語モデルを意識したキーワード入力の必要性の問題は解消されない。 Even if the above-described conventional techniques are combined, the problem of the necessity of keyword input in consideration of the language model of speech recognition in order to appropriately search the speech file is still not solved.

従って、この発明の目的は、音声認識と後段の言語処理や検索をつなげるため、音声認識用辞書の内容をできるだけ意識せず、容易に新語登録やキーワード入力を可能にすることにある。 Accordingly, an object of the present invention is to connect speech recognition to subsequent language processing and search, and to easily register new words and input keywords without being conscious of the contents of the speech recognition dictionary as much as possible.

通常新語登録やキーワード入力においてはコンピュータの仮名漢字変換機能を用いて行われる。つまりその段階では「読み」を入力するわけであるが、その読みは表記確定後捨てられる。一方音声認識用の辞書つまり言語モデルでは必ず読みが必要であり、表記・読み、単語の共起確率を保持している。 Normally, new word registration and keyword input are performed using the computer's kana-kanji conversion function. In other words, “reading” is input at that stage, but the reading is discarded after the notation is confirmed. On the other hand, a dictionary for speech recognition, that is, a language model, always requires reading, and maintains notation / reading and word co-occurrence probabilities.

本発明は、その点に着目してなされたものであり、音声ファイルにテキストのキーワードをつけるために使用される音声認識用の辞書つまり言語モデルと同一の言語モデルを、後段の新語登録及び検索時の「読み」の変換に使用するようにしたものである。 The present invention has been made paying attention to this point, and a new word registration and search in the subsequent stage is performed on the same language model as a speech recognition dictionary, that is, a language model, used for attaching a text keyword to an audio file. It is used for conversion of “reading” of time.

音声ファイルをテキストに変換するためにはまず、アナログ音声信号が、ディジタル信号に変換され、そこから、所定の時間窓での離散フーリエ変換に周波数領域の信号が生成され、そこから対数スペクトル生成され、さらに離散コサイン変換により、ケプストラムが生成される。ケプストラムからはさらに、周知の技術により、波形の振幅、基本周波数、パワースペクトル包絡などが抽出され、これらが、音響特徴量となる。音響モデルは、この音響特徴量を元に、発話の各部分がどの単語の可能性があるかを判定するために使用される。この際、音響モデルは、確率モデルを用いて、可能性のある単語を確率的に求める。 In order to convert an audio file into text, an analog audio signal is first converted into a digital signal, from which a frequency domain signal is generated for discrete Fourier transform in a predetermined time window, from which a logarithmic spectrum is generated. Further, a cepstrum is generated by discrete cosine transform. Further, the amplitude, fundamental frequency, power spectrum envelope, and the like of the waveform are extracted from the cepstrum by a known technique, and these become acoustic feature quantities. The acoustic model is used to determine which word each part of the utterance may be based on this acoustic feature amount. At this time, the acoustic model uses a probabilistic model to obtain probable words stochastically.

こうして、可能性のある単語の列が決められると、これに言語モデルが適用される。言語モデルは、文脈（近接する単語）から、どのような単語列が一番尤もらしいかを、確率モデルを用いて、予測・判定する。 Thus, once a possible word sequence is determined, the language model is applied to it. The language model uses a probability model to predict and determine what word string is most likely from the context (adjacent words).

本発明によれば、新語登録やキーワード入力においてはまず「読み」がユーザによって入力される。この読みが、音声認識用の言語モデルと同一の言語モデルにより発音・表記変換され、以って仮名漢字表記が得られる。次に、得られた仮名漢字表記を、適宜修正語、及び元の文字列と比較することによって、音声認識辞書の未知語が同定される。変換結果のキーワードは、音声ファイルの音声認識によって形成した索引データを検索するために使用することができる。未知語の部分は、適宜音声認識辞書に登録することができる。 According to the present invention, in the new word registration and keyword input, first, “reading” is input by the user. This reading is converted into pronunciation and notation by the same language model as the language model for speech recognition, thereby obtaining kana-kanji notation. Next, an unknown word in the speech recognition dictionary is identified by comparing the obtained kana / kanji notation with the corrected word and the original character string as appropriate. The conversion result keyword can be used to search index data formed by voice recognition of a voice file. The unknown word part can be registered in the speech recognition dictionary as appropriate.

この発明によれば、音声認識と後段の言語処理や検索において、音声認識用辞書の内容を意識せず、容易に新語登録やキーワード入力を可能にすることにが可能ならしめられる。 According to the present invention, it is possible to easily register new words and input keywords without being aware of the contents of the speech recognition dictionary in speech recognition and subsequent language processing and search.

以下、図面を参照して、本発明の一実施例の構成及び処理を説明する。以下の記述では、特に断わらない限り、図面に亘って、同一の要素は同一の符号で参照されるものとする。なお、ここで説明する構成と処理は、一実施例として説明するものであり、本発明の技術的範囲をこの実施例に限定して解釈する意図はないことを理解されたい。 The configuration and processing of an embodiment of the present invention will be described below with reference to the drawings. In the following description, the same elements are referred to by the same reference numerals throughout the drawings unless otherwise specified. It should be understood that the configuration and processing described here are described as an example, and the technical scope of the present invention is not intended to be limited to this example.

図１は、本発明を実施するためのハードウェア構成の一実施例を示す概要ブロック図である。図１の構成は、個別のユーザが検索を行うための、好適には複数のクライアント・システム１１０と、音声データ・ファイルを検索可能に蓄積する音声蓄積サーバ１２０と、音声データから、音声認識を行って、音響モデルと言語モデルに従い、索引データを作成するための音声認識サーバ１３０と、クライアント・システム１１０、音声蓄積サーバ１２０及び音声認識サーバ１３０を接続するためのネットワーク１４０からなる。 FIG. 1 is a schematic block diagram showing an embodiment of a hardware configuration for carrying out the present invention. The configuration shown in FIG. 1 is preferably a plurality of client systems 110 for an individual user to perform a search, a voice storage server 120 for storing voice data files so as to be searchable, and voice recognition from voice data. Then, according to the acoustic model and the language model, the speech recognition server 130 for creating index data, and the client system 110, the speech storage server 120, and the network 140 for connecting the speech recognition server 130 are included.

ネットワーク１４０は、ＬＡＮ、ＷＡＮ、インターネット、イントラネットなど任意の接続形態を利用することができる。 The network 140 can use any connection form such as LAN, WAN, the Internet, and an intranet.

また、このような、クライアント・システム１１０、音声蓄積サーバ１２０及び音声認識サーバ１３０が別個に離隔してネットワーク１４０で接続された構成は必須ではなく、音声データ、及び索引をローカル・システムにコピーすることによって、スタンドアロンで本発明のシステムを構成することもできる。 In addition, such a configuration in which the client system 110, the voice storage server 120, and the voice recognition server 130 are separately separated and connected by the network 140 is not essential, and the voice data and the index are copied to the local system. Accordingly, the system of the present invention can be configured stand-alone.

クライアント・システム１１０には、ウェブ・ブラウザ１１２が導入されている。ウェブ・ブラウザ１１２は、ＪａｖａＳｃｒｉｐｔ（商標）などの、コンテンツ内スクリプトを解釈する機能をもち、ユーザからの入力を受付け、クライアント・システム１１０側の通信インターフェース１１４及び、音声蓄積サーバ１２０側の通信インターフェース１２２を介して、音声蓄積サーバ１２０側のＰｅｒｌ、ＰＨＰなどのプログラムと連携して、検索動作を行う。なお、ＪａｖａＳｃｒｉｐｔ（商標）を使用することなく、一般的なＨＴＭＬの組み込みフォームと、ＣＧＩの組み合わせを用いることもできる。 A web browser 112 is installed in the client system 110. The web browser 112 has a function of interpreting an in-content script, such as JavaScript (trademark), accepts an input from the user, and communicates on the client system 110 side and the voice storage server 120 side communication interface 122. The search operation is performed in cooperation with a program such as Perl or PHP on the voice storage server 120 side. A combination of a general HTML embedded form and CGI can also be used without using JavaScript (trademark).

あるいは、音声蓄積サーバ１２０側で、サーブレットまたはＪＳＰのようなサーバ・サイドＪａｖａ（商標）の仕組みで、検索機能を構築してもよい。 Alternatively, on the voice storage server 120 side, a search function may be constructed using a server-side Java (trademark) mechanism such as a servlet or JSP.

音声蓄積サーバ１２０は、音声データ１２４、及び、音声データ１２４に蓄積されている個々の音声ファイルを検索するための検索データ１２６をもつ。ここで、音声データ１２４は、例えば、コールセンターの会話、放送番組、ポッドキャストデータなどである。音声データのままの形式では、検索することが困難なため、音声データは逐次、通信インターフェース１２２、ネットワーク１４０及び通信インターフェース１３６を介して、音声認識サーバ１３０に送られる。 The voice storage server 120 has voice data 124 and search data 126 for searching for individual voice files stored in the voice data 124. Here, the audio data 124 is, for example, a call center conversation, a broadcast program, podcast data, or the like. Since it is difficult to search in the form of the voice data as it is, the voice data is sequentially sent to the voice recognition server 130 via the communication interface 122, the network 140, and the communication interface 136.

音声認識サーバ１３０は、音声ファイルを音声認識して索引テキストを生成するための、音響モデル１３２と言語モデル１３４をもつ。生成された索引テキストは、通信インターフェース１３６及び通信インターフェース１２２を介して、音声蓄積サーバ１２０に、索引データ１２６として提供される。 The speech recognition server 130 has an acoustic model 132 and a language model 134 for recognizing speech files and generating index text. The generated index text is provided as index data 126 to the voice storage server 120 via the communication interface 136 and the communication interface 122.

音声蓄積サーバ１２０側での検索機能は、ＰｏｓｔｇｒｅＳＱＬ、ＭｙＳＱＬなどのデータベース検索システムにより実現することができる。 The search function on the voice storage server 120 side can be realized by a database search system such as PostgreSQL or MySQL.

音声認識サーバ１３０の言語モデル１３４は、ネットワーク１４０及び各々の通信インターフェースを介して、クライアント・システム１１０及び音声蓄積サーバ１２０からも、アクセス可能となされている。 The language model 134 of the voice recognition server 130 can also be accessed from the client system 110 and the voice storage server 120 via the network 140 and each communication interface.

言語モデル１３４はまた、認証された特殊なユーザのクライアント・システム１１０からの操作によって、単語の登録、編集、削除などの操作を受け付けるようにしてもよい。 The language model 134 may also accept operations such as word registration, editing, and deletion by operations from the client system 110 of an authenticated special user.

図２は、クライアント・システム１１０、音声蓄積サーバ１２０及び音声認識サーバ１３０のハードウェア構成のより詳細なブロック図を、総称的に示す。 FIG. 2 generically shows a more detailed block diagram of the hardware configuration of the client system 110, the voice storage server 120, and the voice recognition server 130.

図２の構成は、メインメモリ２０４と、ＣＰＵ２０６とをもち、これらは、バス２０２に接続されている。ＣＰＵは、好適には、３２ビットまたは６４ビットのアキーテクチャに基づくものであり、例えば、インテル社のＰｅｎｔｉｕｍ（Ｒ）４、Ｘｅｏｎ（Ｒ）、Ｃｏｒｅ２ＤＵＯ、ＡＭＤ社のＡｔｈｌｏｎ（Ｒ）などを使用することができる。バス２０２には、ディスプレイ・コントローラ２０８を介して、ＬＣＤモニタなどのディスプレイ２１０が接続される。ディスプレイ２１０は、クライアント・システム１１０においては、ユーザがウェブブラウザ１１２を眺めつつ、検索を行うために使用される。音声蓄積サーバ１２０においては、システム管理者が、ＪａｖａＳｃｒｉｐｔ（商標）やＰＨＰなどのプログラムを書いて、クライアント・システム１１０から呼び出し可能に登録したり、クライアント・プログラム１１０を介してアクセスするユーザーのユーザーＩＤとパスワードを登録したりするために使用される。 The configuration in FIG. 2 includes a main memory 204 and a CPU 206, which are connected to the bus 202. The CPU is preferably based on 32-bit or 64-bit key architecture, such as Intel Pentium (R) 4, Xeon (R), Core 2 DUO, AMD Athlon (R), etc. can do. A display 210 such as an LCD monitor is connected to the bus 202 via a display controller 208. In the client system 110, the display 210 is used for a user to perform a search while looking at the web browser 112. In the voice storage server 120, a system administrator writes a program such as JavaScript (trademark) or PHP and registers it so as to be callable from the client system 110 or a user ID of a user who accesses through the client program 110. And used to register passwords.

バス２０２にはまた、ＩＤＥコントローラ２１２を介して、ハードディスク２１４と、ＤＶＤドライブ２１６が接続される。 A hard disk 214 and a DVD drive 216 are also connected to the bus 202 via an IDE controller 212.

クライアント・システム１１０の場合、ハードディスク２１４には、オペレーティング・システム、ウェブ・ブラウザ１１２その他のプログラムが、メインメモリ２０４にロード可能に記憶されている。好適なオペレーティング・システムとして、これには限定されないが、Ｗｉｎｄｏｗｓ（Ｒ）ＸＰ、Ｗｉｎｄｏｗｓ（Ｒ）Ｖｉｓｔａ、Ｌｉｎｕｘ（Ｒ）、ＭａｃＯＳなど、ＴＣＰ／ＩＰネットワーキング機能をサポートしている任意のオペレーティング・システムを使用することができる。 In the case of the client system 110, the hard disk 214 stores an operating system, a web browser 112, and other programs that can be loaded into the main memory 204. Suitable operating systems include, but are not limited to, any operating system that supports TCP / IP networking functions, such as, but not limited to, Windows (R) XP, Windows (R) Vista, Linux (R), Mac OS, etc. Can be used.

音声蓄積サーバ１２０の場合、ハードディスク２１４には、オペレーティング・システム、音声データのファイル１２４、及び索引データのファイル１２６が格納されている。好適なオペレーティング・システムとして、これには限定されないが、Ｗｉｎｄｏｗｓ（Ｒ）２００３Ｓｅｒｖｅｒ、Ｌｉｎｕｘ（Ｒ）、ＭａｃＯＳなど、ＴＣＰ／ＩＰネットワーキング機能をサポートしている任意のオペレーティング・システムを使用することができる。ハードディスク２１４にはさらに、音声認識サーバ１２０をデータベース・サーバとして働かせるための、ＡｐａｃｈｅやＴｏｍｃａｔなどのプログラムも導入されている。 In the case of the voice storage server 120, the hard disk 214 stores an operating system, a voice data file 124, and an index data file 126. Suitable operating systems include, but are not limited to, any operating system that supports TCP / IP networking functions, such as Windows® 2003 Server, Linux®, Mac OS, etc. it can. The hard disk 214 further includes programs such as Apache and Tomcat for causing the voice recognition server 120 to function as a database server.

音声認識サーバ１３０の場合、ハードディスク２１４には、オペレーティング・システム、音響モデルのファイル１３２、及び言語モデルのファイル１３４が格納されている。好適なオペレーティング・システムとして、これには限定されないが、Ｗｉｎｄｏｗｓ（Ｒ）２００３Ｓｅｒｖｅｒ、Ｌｉｎｕｘ（Ｒ）、ＭａｃＯＳなど、ＴＣＰ／ＩＰネットワーキング機能をサポートしている任意のオペレーティング・システムを使用することができる。ハードディスク２１４にはさらに、音声認識サーバ１２０をアプリケーション・サーバとして働かせるための、ＡｐａｃｈｅやＴｏｍｃａｔなどのプログラムも導入されている。 In the case of the speech recognition server 130, the hard disk 214 stores an operating system, an acoustic model file 132, and a language model file 134. Suitable operating systems include, but are not limited to, any operating system that supports TCP / IP networking functions, such as Windows® 2003 Server, Linux®, Mac OS, etc. it can. The hard disk 214 further includes programs such as Apache and Tomcat for causing the voice recognition server 120 to function as an application server.

ＤＶＤドライブ２１６は、必要に応じて、ＣＤ−ＲＯＭまたはＤＶＤディスクからプログラムをハードディスク２１４に追加導入するために使用される。バス２０２には更に、キーボード・マウスコントローラ２２０を介して、キーボード２２０と、マウス２２２が接続されている。 The DVD drive 216 is used to additionally introduce a program from the CD-ROM or DVD disk to the hard disk 214 as necessary. Further, a keyboard 220 and a mouse 222 are connected to the bus 202 via a keyboard / mouse controller 220.

通信インターフェース２２４は、好適にはイーサネット・プロトコルに従うものであり、コンピュータ本体と、ネットワーク１４０とを、物理的に接続する役割を担い、コンピュータのオペレーティング・システムの通信機能のＴＣＰ／ＩＰ通信プロトコルに対して、ネットワークインターフェース層を提供する。ここで、図示されている構成は、有線接続構成であるが、例えば、ＩＥＥＥ８０２１１ａ／ｂ／ｇなどの無線ＬＡＮ接続規格に基づき、無線ＬＡＮ接続するものであってもよい。 The communication interface 224 preferably conforms to the Ethernet protocol, and plays a role of physically connecting the computer main body and the network 140 to the TCP / IP communication protocol of the communication function of the computer operating system. Providing a network interface layer. Here, the illustrated configuration is a wired connection configuration, but may be a wireless LAN connection based on a wireless LAN connection standard such as IEEE802.11a / b / g.

また、通信インターフェース２２４は、イーサネットプロトコルに限定されるものではなく、例えば、トークンリングなどの任意のプロトコルに従うものでよく、特定の物理的通信プロトコルに限定されない。 Further, the communication interface 224 is not limited to the Ethernet protocol, and may conform to an arbitrary protocol such as a token ring, for example, and is not limited to a specific physical communication protocol.

図３は、音声認識サーバ１３０で実行される、音声認識処理の機能を説明するための機能ブロック図である。この処理プログラムは、音声認識サーバ１３０のハードディスク・ドライブに格納されて、必要に応じてメインメモリに呼び出される。図３で、入力信号３０２は、好適には、音声蓄積サーバ１２２の音声データのファイル１２４から、個別の音声ファイルとして、ネットワーク１４０を介して提供される。音声認識サーバ１３０では、音声データのファイルを一旦ハードディスクにセーブして、音声ファイル・プレーヤのプログラムにかけることで再生し、その再生アナログ信号を、入力信号３０２としてもよい。 FIG. 3 is a functional block diagram for explaining the function of the voice recognition process executed by the voice recognition server 130. This processing program is stored in the hard disk drive of the speech recognition server 130 and is called up to the main memory as necessary. In FIG. 3, the input signal 302 is preferably provided via the network 140 as a separate audio file from the audio data file 124 of the audio storage server 122. In the voice recognition server 130, a voice data file may be temporarily saved in a hard disk and played by applying to a voice file player program, and the playback analog signal may be used as the input signal 302.

音響処理ブロック３０４では、アナログ入力信号３０２が、一旦Ａ／Ｄ変換により、ディジタル信号に変換される。もし入力信号３０２が予めディジタル信号であるなら、Ａ／Ｄ変換は不要である。 In the acoustic processing block 304, the analog input signal 302 is once converted into a digital signal by A / D conversion. If the input signal 302 is a digital signal in advance, A / D conversion is not necessary.

音響処理ブロック３０４ではさらに、所定の時間窓での離散フーリエ変換に周波数領域の信号が生成され、そこから対数スペクトル生成され、さらに離散コサイン変換により、ケプストラムが生成される。ケプストラムからはさらに、周知の技術により、波形の振幅、基本周波数、パワースペクトル包絡などが抽出され、これらが、音響特徴量となる。 The acoustic processing block 304 further generates a frequency domain signal for discrete Fourier transform in a predetermined time window, generates a logarithmic spectrum therefrom, and further generates a cepstrum by discrete cosine transform. Further, the amplitude, fundamental frequency, power spectrum envelope, and the like of the waveform are extracted from the cepstrum by a known technique, and these become acoustic feature quantities.

復号化ブロック３０６では、音響処理ブロック３０４から入力された音響特徴量に対して、音響モデル３０８と、言語モデル３１０を適用することによって、入力信号３０２を音声認識した結果のテキストが得られる。 In the decoding block 306, the acoustic model 308 and the language model 310 are applied to the acoustic feature amount input from the acoustic processing block 304, thereby obtaining a text resulting from speech recognition of the input signal 302.

より詳しく述べると、音響モデル３０８は、ＨＭＭなどの確率モデルを用いて、尤度の高い音素の並びを得るために使用される。 More specifically, the acoustic model 308 is used to obtain a sequence of phonemes with high likelihood using a probabilistic model such as an HMM.

一方、言語モデル３１０は、音響モデル３０８の適用によって得られた音素の並びから、語彙辞書３１２を用いて、どのような単語列が一番尤もらしいか、を判定するために使用される。例えば、「ぽすとはあかい」のように、ほぼ聞こえる単語の並びとして、「ポストは赤い」、「コストわ高い」、「ホスト輪仲居」、・・・などがあり得るが、このうち、言語モデル３１０は、「ポストは赤い」が最尤と判定することになる。 On the other hand, the language model 310 is used to determine what word string is most likely using the vocabulary dictionary 312 from the phoneme sequence obtained by applying the acoustic model 308. For example, “post is red”, and the most audible word sequence can be “post is red”, “cost is high”, “host circle Nakai”, and so on. The language model 310 determines that “post is red” is the maximum likelihood.

このようにして得られた音声認識した結果のテキスト（単語列）は、任意の他のアプリケーション・プログラム３１４で使用することができる。本実施例では、音声認識された結果のテキスト・データは、音声認識サーバ１３０から、音声蓄積サーバ１２０に送られて、音声ファイルに関連付けて、索引データのファイル１２６に格納される。なお、音声ファイルを音声認識した結果のテキストを、索引として音声ファイルに関連付ける技術は、これらには限定されないが、本出願人に係る、特開２０００−３４８０６４及び特開２００６−１７８０８７などに記述されている。 The speech recognition result text (word string) obtained in this way can be used in any other application program 314. In this embodiment, the text data resulting from the speech recognition is sent from the speech recognition server 130 to the speech storage server 120, and is stored in the index data file 126 in association with the speech file. The technique for associating the text obtained as a result of voice recognition of the voice file with the voice file as an index is not limited thereto, but is described in JP 2000-348064 A and JP 2006-178087 A related to the present applicant. ing.

次に、図４以下のフローチャートを参照して、本発明の一実施例の処理について説明する。図４のステップ４０２では、所定のグラフィック・ユーザ・インターフェース（ＧＵＩ）を用いて、ユーザが、単語・フレーズの「読み」を入力する。なお、このＧＵＩの例は、図６に示す。また、このＧＵＩは、図１のクライアント・システム１１０上で実行されることに留意されたい。 Next, processing according to an embodiment of the present invention will be described with reference to the flowchart in FIG. In step 402 of FIG. 4, the user inputs “reading” of the word / phrase using a predetermined graphic user interface (GUI). An example of this GUI is shown in FIG. It should also be noted that this GUI is executed on the client system 110 of FIG.

ここで例えば、ユーザは、「株券貸借取引」という単語（複合語）を入力すると仮定すると、図６の、読みフィールドとして示されている、テキスト・フィールド６０２に、「かぶけんたいしゃくとりひき」と入力する。 Here, for example, if it is assumed that the user inputs the word “compound lending transaction” (compound word), the text field 602 shown as a reading field in FIG. input.

ステップ４０４では、言語モデルを使った発音・表記変換が実行される。すなわち、音声認識用辞書と、その言語モデルをもちいて、発音から表記への変換が行われる。このための処理プログラムは、好適には、音声認識サーバ１３０にあり、クライアント・システム１１０は、ＣＧＩ、ＪＳＰなどの仕組みで、単に音声認識サーバ１３０上のプログラムを呼び出す。また、音声認識用辞書と、その言語モデルは、音声認識サーバ１３０上にあるものが、ネットワーク１４０を介してアクセスされて使用される。音声認識用辞書と、その言語モデルは、図１では、言語モデル１３４として総称的に示されている。この場合、必要に応じて、音声蓄積サーバ１２０で、音声認識サーバ１３０上の音声認識用辞書と、その言語モデルのレプリカを作成し、そちらの方を理由するようにしてもよい。 In step 404, pronunciation / notation conversion using a language model is executed. That is, the phonetic recognition dictionary and its language model are used to convert pronunciation to notation. The processing program for this purpose is preferably in the voice recognition server 130, and the client system 110 simply calls the program on the voice recognition server 130 with a mechanism such as CGI or JSP. A speech recognition dictionary and its language model on the speech recognition server 130 are accessed and used via the network 140. The dictionary for speech recognition and its language model are generically shown as language model 134 in FIG. In this case, if necessary, the voice storage server 120 may create a voice recognition dictionary on the voice recognition server 130 and a replica of the language model, and the reason may be used.

さて、ある入力記号列から言語モデルを元にした最適出力列を得る手法は、例えば、上述の[MORI 1999]に書かれている手法を使用することができる。そこでは、入力かな列Ｙを条件とする仮名漢字交じりの単語列の条件付確率を最大にするような単語列Ｗが選択される。数式であわらすと下記のとおりである。

As a method for obtaining an optimum output sequence based on a language model from a certain input symbol sequence, for example, the method described in [MORI 1999] described above can be used. There, the word string W is selected so as to maximize the conditional probability of the word string mixed with kana and kanji, which uses the input kana string Y as a condition. The formula is as follows.

図４のステップ４０４の詳細なステップのフローチャートを、図５に示す。ステップ５０２では、入力されたかな列が、音声認識用辞書に対応付けるために、処理プログラムによって、音素記号列に変換される。ここで、音素記号とは音声を構成する音の種類を分類し、そのそれぞれに対応した記号であり、日本語の場合母音、子音合わせて５０程度の種類がある。読み(Sounds-like)から発音記号列への変換は必ずしも１対１ではないため、その場合複数の記号列が出力される。しかしその数は高々数個にとどまり、同じ処理をそれぞれについて行い、最後にそのそれぞれの確率を比較、最大のものを選択することで得られるので、以降では一意に発音記号列が決まった場合について記述する。 A detailed flowchart of step 404 in FIG. 4 is shown in FIG. In step 502, the input kana string is converted into a phoneme symbol string by the processing program so as to be associated with the speech recognition dictionary. Here, the phoneme symbol is a symbol corresponding to each of the types of sounds constituting the speech. In Japanese, there are about 50 types including vowels and consonants. Since conversion from readings (Sounds-like) to phonetic symbol strings is not necessarily one-to-one, a plurality of symbol strings are output in that case. However, the number is limited to a few at most, and the same processing is performed for each, finally the probability of each is compared, and it is obtained by selecting the maximum one. Describe.

例えば、「かぶけんたいしゃくとりひき」からは以下のような発音記号列(Ｈ)が得られる。
Ｈ = /k/a/b/u/k/e/_n/t/a/i/sy/a/k/u/t/o/r/i/h/i/k/i/ For example, the following phonetic symbol string (H) can be obtained from “Kabuken Taisakutoriki”.
H = / k / a / b / u / k / e / _n / t / a / i / sy / a / k / u / t / o / r / i / h / i / k / i /

その後、上記数式(1)で、Ｙを発音記号列Ｈで置き換えた式により、最適な単語列Ｗが選択される訳であるが、そのため数式(1)を下記のように変形する。

After that, the optimum word string W is selected by the above formula (1) with Y replaced by the phonetic symbol string H. For this reason, the formula (1) is modified as follows.

ここでP(H|W)は各単語がどのように読まれるか、を示す確率であり、P(W)は単語列の出現確率であり、例えばN-gramモデルにより計算される。この２つの値は、音声認識エンジンが一般的に辞書・言語モデルとして保持している情報から計算することが可能である。言い換えれば上記式の右辺に基づいて確率値最大のWを選択することであり、この処理は音にあいまい性がなく音響モデルが理想的であった場合の音声認識を行った結果（単語列）と解釈することができ、それが以下のステップ５０４、５０６及び５０８である。その実施においては、図３に示すように音声認識エンジンの内、音響処理された結果音素列が一意に決定されたとして復号化ブロック３０６に入力される。 Here, P (H | W) is a probability indicating how each word is read, and P (W) is an appearance probability of the word string, and is calculated by, for example, an N-gram model. These two values can be calculated from information that the speech recognition engine generally holds as a dictionary / language model. In other words, the maximum probability value W is selected based on the right side of the above expression, and this processing is the result of speech recognition when the sound model is ideal and the sound model is ideal (word string). Which are the following steps 504, 506 and 508. In the implementation, as shown in FIG. 3, the phoneme string obtained as a result of the acoustic processing in the speech recognition engine is input to the decoding block 306 as being uniquely determined.

ステップ５０４では、着目する記号へのインデックスとなるポインタが、発音記号列の最左（この例では、/k/）に設定される。 In step 504, the pointer serving as an index to the symbol of interest is set to the leftmost (/ k / in this example) of the phonetic symbol string.

ステップ５０６では、上記ポインタから始まる右部分列について辞書引きが実施され、当該発音に合致する単語が、候補として得られる。ここでは「課/ka」「株/kabu」「株券/kabuke_n」などが候補となる。以上は未知語が存在しないと仮定した場合であるが、どのような辞書においても未知語は存在するのが普通である。したがってこのステップにおいて、辞書に存在しなかった部分音素列、たとえば「kabuke」についても、それが未知語であったと仮定し、表記不明の単語W_unk=<kabuke>として候補に追加してもよい。その場合表記が不明である単語に対してP(H|W)をどう計算するのかが問題となる。いわゆる形態素解析等では数多くの未知語モデルが提案されている(例えば、前述の[NAGATA 1999])が、発音・表記変換では（その部分が未知語であると指摘され変換されないことに）メリットがないためほとんど議論されていない。そこで、ここでは、未知語W_unk部分の「音素列」をh(=h₁h₂,…,h_N)、各音素(h_i)の出現確率をP(h_i)として、次のようなモデルを考える。

この式で、右辺の第１項であるP(N|W_unk)は当該単語がN個の音素からなる読みをもつ確率であり、第２項は入力の部分列を構成する当該音素列が出現する確率を音素の1-gramにより近似している。なお、音素列は記号列の１つと考えられるが、記号列の出現確率を効率的に近似する手法は他にもさまざま存在する。例えば、第１項をポアソン分布、第２項をより高次のN-gramとするなどの手法を適用することができる。 In step 506, dictionary lookup is performed on the right subsequence starting from the pointer, and a word that matches the pronunciation is obtained as a candidate. “Section / ka”, “Stock / kabu”, “Stock certificate / kabuke_n”, etc. are candidates here. The above is a case where it is assumed that there is no unknown word, but it is normal that an unknown word exists in any dictionary. Therefore, in this step, a partial phoneme sequence that did not exist in the dictionary, for example “kabuke”, may be added to the candidate as an unknown word W _unk = <kabuke>, assuming that it was an unknown word. . In that case, the problem is how to calculate P (H | W) for a word whose notation is unknown. Many unknown word models have been proposed in so-called morphological analysis (eg, [NAGATA 1999] described above), but there is a merit in pronunciation / notation conversion (that part is pointed out as an unknown word and not converted). There is almost no discussion because there is no. Therefore, here, the “phoneme sequence” of the unknown word W _unk part is h (= h ₁ h ₂ ,..., H _N ), and the appearance probability of each phoneme (h _i ) is P (h _i ) as follows: A simple model.

In this equation, P (N | W _unk ), the first term on the right side, is the probability that the word has a reading consisting of N phonemes, and the second term is the phoneme sequence constituting the input subsequence. The probability of appearing is approximated by a 1-gram phoneme. The phoneme string is considered as one of the symbol strings, but there are various other methods for efficiently approximating the appearance probability of the symbol string. For example, a technique such as a Poisson distribution for the first term and a higher order N-gram for the second term can be applied.

ステップ５０８では、言語モデルが参照され、ステップ５０６で得られた単語（列）候補について生起確率が計算される。例えば、N-gramによるならば言語モデルを参照し、
P(「課」) = P(開始記号→「課」) = 0.0001
P(「株」) = P(開始記号→「株」) = 0.0005
P(「株券」) = P(開始記号→「株券」) = 0.0025
といった計算が行われる。 In step 508, the language model is referred to, and the occurrence probability is calculated for the word (sequence) candidate obtained in step 506. For example, if it is based on N-gram, refer to the language model,
P (`` Section '') = P (Start symbol → `` Section '') = 0.0001
P (`` share '') = P (start symbol → `` share '') = 0.0005
P (`` Stock certificate '') = P (Start symbol → `` Stock certificate '') = 0.0025
Such a calculation is performed.

ステップ５１０では、確率の絶対値または他の候補と比較した相対値が十分小さいと判断された単語（列）は除外し、以降の繰り返し計算を行わない。そうでない場合は各候補のそれぞれについて、ステップ５１２でポインタが更新され、ステップ５０６から処理が、繰り返される。上記の例では、たとえば確率の高い上位２個である「株」と「株券」が残され、「課」は棄却される。そして「株」を選択したとすると、ポインタは「/k/a/b/u/」の直後である「k」(左から5音素目)）に置かれ、その位置からステップ５０６の処理が繰り返され、５音素以降にマッチする候補単語、たとえば「倦怠/ ke_ntai」が候補単語となる。このような繰り返しによりさまざまな候補単語列が得られるが、その多くは確率が十分低いため、この過程において棄却されることになる。「株券」の場合もまったく同様に、ポインターを「/k/a/b/u/k/e/_n/」の次音素である「t」に進め、ステップ５０６以降が繰り返される。 In step 510, words (sequences) determined to have sufficiently small absolute values of probabilities or relative values compared to other candidates are excluded, and subsequent iterations are not performed. Otherwise, for each candidate, the pointer is updated at step 512 and the process from step 506 is repeated. In the above example, for example, “stock” and “stock certificate” which are the top two with the highest probability are left, and “section” is rejected. If “stock” is selected, the pointer is placed at “k” (5th phoneme from the left) immediately after “/ k / a / b / u /”, and the processing of step 506 is performed from that position. A candidate word that is repeated after 5 phonemes, for example, “fatigue / ke_ntai” becomes a candidate word. Various repetitions of candidate word strings are obtained by such repetition, but most of them are rejected in this process because their probabilities are sufficiently low. In the case of “stock certificate”, the pointer is advanced to “t” which is the next phoneme of “/ k / a / b / u / k / e / _n /” in the same manner, and step 506 and subsequent steps are repeated.

結果的に、図５の処理の結果、音声認識辞書と言語モデルを用いた変換結果が１つまたは複数リストされるので、ユーザは、そのうちの１つをマウス操作により、選択することになる。この結果、図６に示すように、「かぶけんたいしゃくとりひき」が表示されているテキスト・フィールド６０２の下のテキスト・フィールド６０４に、選択された結果である、「株券体癪取引」が表示される。 As a result, one or a plurality of conversion results using the speech recognition dictionary and the language model are listed as a result of the processing of FIG. 5, and the user selects one of them by operating the mouse. As a result, as shown in FIG. 6, “stock certificate body transaction”, which is the selected result, is displayed in the text field 604 below the text field 602 in which “Kabuken Taikakutoriki” is displayed. Is done.

また実際の実装においてはViterbi、Dymanic Programmingに基づいた上記を高速化するためさまざまな工夫が行われるが、すでによく知られた手法でありここでは詳述しない。 In actual implementation, various ideas are made to speed up the above based on Viterbi and Dynamic Programming, but this is a well-known technique and will not be described in detail here.

図４に戻って、誤っている箇所があった場合、または未知語であると判断された場合は、ステップ４０６で、ユーザが、当該箇所にカーソルを合わせ、好適には通常の仮名漢字変換機能を使って、修正する。ここでの変換は、図６の「変換」ボタン６０６をクリックによって、行われるが、キーボード上の変換キーを叩いてもよい。ここでの仮名漢字変換機能は、クライアント・コンピュータ１１０に備わっているものでよい。例えば、図６では、「たいしゃく」の部分が「体癪」となっているので、「貸借」と修正することになる。 Returning to FIG. 4, if there is an erroneous part or it is determined that the word is an unknown word, in step 406, the user places the cursor on the part and preferably the normal kana-kanji conversion function. Use to correct. The conversion here is performed by clicking the “Conversion” button 606 in FIG. 6, but a conversion key on the keyboard may be hit. The kana-kanji conversion function here may be provided in the client computer 110. For example, in FIG. 6, since “Tasakaku” is “physical”, it is corrected to “loan”.

ステップ４０８では、最初に変換された「株券体癪取引」と、修正後の「株券貸借取引」がシステムによって比較され、これによって、「株券/かぶげん」と、「取引/とりひき」の部分は正しく、「貸借/たいしゃく」に相当する箇所が音声認識辞書にとって未知語であることが検出される。 In step 408, the first converted "stock certificate transaction" and the revised "stock certificate lending transaction" are compared by the system, so that "stock certificate / kabugen" and "transaction / toriki" parts Is correctly detected that the part corresponding to “borrowing / squeezing” is an unknown word for the speech recognition dictionary.

次のステップ４１０に行って、ユーザが「確定」ボタン６０８をクリックすると、音声認識辞書に対して未知語である、「貸借/たいしゃく」が、既知語である「株券」または「取引」を伴って、「株券→貸借」または、「貸借→取引」というコンテキストで、音声認識辞書に登録される。一旦音声認識辞書に登録されると、次回の音声ファイルの音声認識処理に際して、この音声認識辞書が、言語モデル３１０によって使用される。 When the user clicks the “confirm” button 608 in the next step 410, the unknown word “borrowing / squeezing” for the speech recognition dictionary is changed to “stock certificate” or “transaction”, which is a known word. Along with this, it is registered in the speech recognition dictionary in the context of “stock certificate → lending” or “lending → transaction”. Once registered in the speech recognition dictionary, the speech model is used by the language model 310 in the next speech recognition processing of the speech file.

一方、ステップ４１０で、ユーザが「検索」ボタン６１０をクリックすると、現段階では、音声認識辞書に対して「貸借」が未知語であり、よって正しく検索できない可能性があるので、システムは、メッセージ・ウインドウを生成するなどして、警告する。いずれにしても、ユーザが「検索」ボタン６１０をクリックすることによって、ステップ６０２の読みに対して、音声認識辞書に基づく変換結果からユーザが選んだキーワードが、クライアント・コンピュータ１１０が音声蓄積サーバ１２０に送られ、音声蓄積サーバ１２０は、送られたキーワードに基づき索引データ１２６を検索する。 On the other hand, when the user clicks the “search” button 610 in step 410, “borrowing” is an unknown word in the speech recognition dictionary at this stage, and therefore there is a possibility that the search cannot be performed correctly. -Warn by creating a window. In any case, when the user clicks the “search” button 610, the keyword selected by the user from the conversion result based on the speech recognition dictionary is read by the client computer 110 from the conversion result based on the speech recognition dictionary. The voice storage server 120 searches the index data 126 based on the sent keyword.

そして、音声蓄積サーバ１２０は、索引データ１２６において、そのキーワードにヒットするものがみつかると、ヒットした索引データに関連付けられている音声ファイルのリストを、クライアント・コンピュータ１１０に返す。 When the audio storage server 120 finds a hit in the keyword in the index data 126, it returns a list of audio files associated with the hit index data to the client computer 110.

クライアント・コンピュータ１１０は、受け取った音声ファイルのリストを、別のウインドウに表示し、そこから適宜、クライアント・コンピュータ１１０のユーザがリスト中の音声ファイルのリンク（図示しない）をクリックすることにより、当該音声ファイルの内容を聴くことができる。 The client computer 110 displays the received audio file list in a separate window, and when the user of the client computer 110 clicks the link (not shown) of the audio file in the list as appropriate, Listen to the contents of audio files.

なお参考までに、下記は、発音・表記変換のサンプルである。
下記の例で、左側の数字は、は当該結果の確率をPとした場合の-kΣlogP (但し、kは整数化するための係数で、ここでは256とし、対数の底は10を用いている) を示し、<..>uは、未知語を示す。
＜例１＞
入力>>かぶけんたいしゃくとりひき<<
4565 < 株券たい癪取引 >
4640 < 株券退社く取引 >
4673 < 株券退社九取引 >
4732 < 株券体癪取引 >
4867 < 株券対癪取引 >
4937 < 株券タイ癪取引 >
＜例２＞
入力>>とりぷるえーのめいがら<<
2247 < トリプルＡの銘柄 >
3239 < トリプルええの銘柄 >
3514 < トリプルＡ野銘柄 >
3792 < トリプルＡの銘柄 >
3921 < トリプルエーの銘柄 >
3942 < トリプル D_エーの銘柄 >
4188 < トリプルええ野銘柄 >
＜例３＞
入力>>かぶけんたいしゃくとりひき<<
4165 < 株券 <たいしゃく>u 取引 >
4565 < 株券たい癪取引 >
4640 < 株 <けんたいしゃく>u 取引 >
4673 < 株券退社九取引 >
4732 < 株券体癪取引 >
4758 < <かぶけんたいしゃく>u 取引 > For reference, the following is a sample of pronunciation / notation conversion.
In the example below, the number on the left is -kΣlogP where P is the probability of the result (where k is a coefficient to make an integer, here it is 256, and the base of the logarithm is 10) ), And <..> u indicates an unknown word.
<Example 1>
Input >>
4565 <Stock Certificate Taiho Trading>
4640 <Stock certificate retirement transaction>
4673 <Retired stock certificates Nine transactions>
4732 <Stock Certificate 癪 Transaction>
4867 <Trading against stock certificates>
4937 <Thai Stock Trading>
<Example 2>
Input >> Tripple no Meigara <<
2247 <Triple A Brand>
3239 <Triple yeah brand>
3514 <Triple A No Brand>
3792 <Brand of Triple A>
3921 <Triple A Brand>
3942 <Triple D_A Brand>
4188 <triple yeah stocks>
<Example 3>
Input >>
4165 <Stock Certificate <Taishaku> u Transaction>
4565 <Stock Certificate Taiho Trading>
4640 <Shares <Transaction> u Trading>
4673 <Retired stock certificates Nine transactions>
4732 <Stock Certificate 癪 Transaction>
4758 <Transaction> u Transaction>

以上のように、特定の実施例により、本発明の技法を説明してきたが、本発明の技術的範囲は、この特定の実施例に限定されず、さまざまな変形例が可能である。例えば、図１のようなネットワークで接続して構成ではなく、スタンドアロンの構成でよく、また、音声蓄積サーバは、音声認識サーバは、同一のサーバによって構成してもよい。 As described above, the technique of the present invention has been described using a specific embodiment. However, the technical scope of the present invention is not limited to this specific embodiment, and various modifications are possible. For example, it may be a stand-alone configuration rather than a configuration connected by a network as shown in FIG. 1, and the voice storage server may be configured by the same server as the voice recognition server.

ハードウェア構成の全体の概要図である。1 is an overall schematic diagram of a hardware configuration. 図１の構成で使用されるコンピュータのより詳細な構成を示す図である。It is a figure which shows the more detailed structure of the computer used by the structure of FIG. 音声認識システムの機能を示すブロック図である。It is a block diagram which shows the function of a speech recognition system. 音声認識用言語モデルを使用して読みを変換する処理のフローチャートの図である。It is a figure of the flowchart of the process which converts a reading using the language model for speech recognition. 音声認識用言語モデルを使用して読みを変換する処理のフローチャートの図である。It is a figure of the flowchart of the process which converts a reading using the language model for speech recognition. 読みと変換結果を入力・表示するためのウインドウを示す図である。It is a figure which shows the window for inputting and displaying a reading and a conversion result.

Claims

A system for searching for data stored by associating an audio file with an index text obtained as a result of voice recognition of the audio file by computer processing,
Means for accepting reading input from a user by processing of the computer;
Means for converting the reading into a phonetic symbol string by the processing of the computer;
Means for converting the phonetic symbol string into a keyword using a language model substantially the same as the language model used for speech recognition of the speech file by the processing of the computer;
Means for searching the index text using the keyword by the processing of the computer;
Audio file search system.

Means for correcting the keyword in accordance with a user operation by the processing of the computer;
Means for identifying an unknown word based on the corrected part of the keyword by the processing of the computer;
Means for notifying the user of the presence of the unknown word;
The audio file search system according to claim 1.

A system for updating, by computer processing, a speech recognition dictionary for recognizing speech files and creating index texts for retrieving speech files,
Means for accepting reading input from a user by processing of the computer;
Means for converting the reading into a phonetic symbol string by the processing of the computer;
Means for converting the phonetic symbol string into a keyword using a language model substantially the same as the language model used for speech recognition of the speech file by the processing of the computer;
Means for correcting the keyword in accordance with a user operation by the processing of the computer;
Means for identifying an unknown word based on the corrected part of the keyword by the processing of the computer;
Means for registering a user correction word corresponding to the unknown word in the speech recognition dictionary by the processing of the computer;
A speech recognition dictionary update system.

A method for searching for data stored by associating an audio file with an index text obtained as a result of voice recognition of the audio file by computer processing,
Accepting reading input from a user by processing of the computer;
Converting the reading into a phonetic symbol string by processing of the computer;
Converting the phonetic symbol string into a keyword using a language model substantially the same as the language model used for speech recognition of the speech file by the processing of the computer;
A step of searching the index text using the keyword by the processing of the computer;
How to search for audio files.

Correcting the keyword in accordance with a user operation by a process of the computer;
Identifying an unknown word based on the corrected portion of the keyword by processing of the computer;
Further notifying the user of the presence of the unknown word,
The method for searching for an audio file according to claim 4.

A method for updating, by computer processing, a speech recognition dictionary for recognizing speech files and creating index texts for searching speech files,
Accepting reading input from a user by processing of the computer;
Converting the reading into a phonetic symbol string by processing of the computer;
Converting the phonetic symbol string into a keyword using a language model substantially the same as the language model used for speech recognition of the speech file by the processing of the computer;
Correcting the keyword in accordance with a user operation by a process of the computer;
Identifying an unknown word based on the corrected portion of the keyword by processing of the computer;
Registering a user correction word corresponding to the unknown word in the speech recognition dictionary by the processing of the computer;
How to update the speech recognition dictionary.

A program for searching for data stored by associating an audio file with an index text obtained as a result of voice recognition of the audio file by computer processing,
The computer,
Accepting reading input from the user;
Converting the reading into a phonetic symbol string;
Converting the phonetic symbol string into a keyword using a language model substantially the same as the language model used for speech recognition of the speech file;
Performing a step of searching the index text using the keyword;
An audio file search program.

The computer,
Correcting the keyword in accordance with a user operation by a process of the computer;
Identifying an unknown word based on the corrected portion of the keyword by processing of the computer;
Further informing the user of the presence of the unknown word,
The audio file search program according to claim 7.

A program for updating a speech recognition dictionary for creating an index text by recognizing the speech file in order to search for the speech file by computer processing,
The computer,
Accepting reading input from a user by processing of the computer;
Converting the reading into a phonetic symbol string by processing of the computer;
Converting the phonetic symbol string into a keyword using a language model substantially the same as the language model used for speech recognition of the speech file by the processing of the computer;
Correcting the keyword in accordance with a user operation by a process of the computer;
Identifying an unknown word based on the corrected portion of the keyword by processing of the computer;
Registering a user correction word corresponding to the unknown word in the speech recognition dictionary by the processing of the computer;
Voice recognition dictionary update program.