JP5558284B2

JP5558284B2 - Speech recognition system, speech recognition method, and speech recognition program

Info

Publication number: JP5558284B2
Application number: JP2010207048A
Authority: JP
Inventors: 孝輔辻野; 真也飯塚
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2010-09-15
Filing date: 2010-09-15
Publication date: 2014-07-23
Anticipated expiration: 2030-09-15
Also published as: JP2012063537A

Description

本発明は、通信端末、音声認識方法、および音声認識プログラムに関するものである。 The present invention relates to a communication terminal, a voice recognition method, and a voice recognition program.

音声認識処理を行うに当たっては、クライアント端末内で行う場合と、サーバ側で行う場合がある。クライアント端末内で行う場合は、当該端末がユーザ専用のものである場合が多いことから、端末内のユーザ固有の情報として、ユーザ辞書、ユーザの過去の入力音声や通話音声、音響トレーニングの実績などを利用して、言語モデルや音響モデルの個人カスタマイズが容易である。したがって、ユーザ適応の音声認識が可能となる反面、メモリ量や演算リソースが限られ、語彙数や仮説探索の範囲が制約されるといったデメリットがある。 The voice recognition process may be performed in the client terminal or on the server side. When performed in a client terminal, the terminal is often dedicated to the user. Therefore, as user-specific information in the terminal, user dictionary, user input voice and call voice, acoustic training results, etc. It is easy to personally customize language models and acoustic models. Accordingly, user-adaptive speech recognition is possible, but there are disadvantages such as a limited amount of memory and computing resources, and a restriction on the number of vocabularies and the range of hypothesis search.

一方で、サーバ側で行う場合は、端末内音声認識と比較してメモリ量や演算リソースを豊富に利用でき、大語彙且つ高精度の音声認識が可能といったメリットがある。しかし、サーバは多数のユーザに共有されるものであるため、ユーザ各々の言語モデルまたは音響モデルを学習したり、アクセスに応じて即時且つ高速にユーザ各々にカスタマイズされた言語モデルまたは音響モデルを読み込むにはコストが高くかかる。したがって、サーバ側の音声認識では言語モデルまたは音響モデルのユーザごとのカスタマイズが困難というデメリットがある。 On the other hand, when it is performed on the server side, there are merits that a large amount of memory and computing resources can be used compared to in-terminal speech recognition, and speech recognition with high vocabulary and high accuracy is possible. However, since the server is shared by many users, it learns each user's language model or acoustic model, and loads the language model or acoustic model customized for each user immediately and quickly according to access. Is expensive. Therefore, there is a demerit that it is difficult to customize the language model or the acoustic model for each user in the speech recognition on the server side.

そこで、両者の長所を兼ね備え、大語彙且つ高精度の音声認識処理と、言語モデルまたは音響モデルのユーザごとのカスタマイズを両立する音声認識処理を実現することが課題とされており、特許文献１は当該課題を解決するための一つの試みを示している。特許文献１では、サーバは認識結果における単語境界の時間情報を端末に返し、端末では当該時間情報を参照し且つ自らが有する辞書を用いて再認識を行う。特に、未知語や固有名詞と判定された語のみを再認識の対象とすることにより、認識精度の向上を目指している。 Therefore, there is a problem of realizing a speech recognition process that combines the advantages of both, a large vocabulary and highly accurate speech recognition process, and customization of each language model or acoustic model for each user. One attempt to solve the problem is shown. In Patent Document 1, the server returns the time information of the word boundary in the recognition result to the terminal, and the terminal refers to the time information and performs re-recognition using a dictionary owned by itself. In particular, we aim to improve recognition accuracy by re-recognizing only words that are determined as unknown words or proper nouns.

特開２０１０−８５５３６号公報JP 2010-85536 A

しかし、このような特許文献１の技術では、サーバ側の認識で単語境界が正しく認識されていないと、端末側は正しくない時間情報を参照することとなり、端末での再認識後も正しい認識結果が得られないおそれがある。また、サーバ側で認識対象の語が語彙外の語である場合は、未知語や固有名詞の判定を正しくできない場合があり、端末で再認識を行った後も正しい認識結果が得られない場合がある。 However, in such a technique of Patent Document 1, if the word boundary is not correctly recognized by the server side recognition, the terminal side will refer to the incorrect time information, and the correct recognition result even after re-recognition at the terminal May not be obtained. Also, if the word to be recognized on the server side is a word outside the vocabulary, it may not be possible to correctly determine unknown words or proper nouns, and correct recognition results may not be obtained even after re-recognition at the terminal There is.

そこで、本発明は上記に鑑みてなされたもので、大語彙且つ高精度の音声認識処理と、言語モデルまたは音響モデルのユーザごとのカスタマイズを両立する音声認識処理を実現することが可能な通信端末、音声認識方法、および音声認識プログラムを提供することを目的とする。 Therefore, the present invention has been made in view of the above, and is a communication terminal capable of realizing speech recognition processing that achieves both large vocabulary and high-accuracy speech recognition processing and customization of each language model or acoustic model for each user. An object is to provide a speech recognition method and a speech recognition program.

上記課題を解決するために、本発明の通信端末は、音声信号を入力する音声入力手段と、音声認識処理を行うための言語モデルまたは音響モデルであってユーザに適応されたものを格納する格納手段と、前記言語モデルまたは前記音響モデルを用いて前記音声信号に対して第１の音声認識処理を行う音声認識処理手段と、前記音声認識処理手段の認識処理結果を構成する語彙を抽出する語彙抽出手段と、前記語彙抽出手段が抽出した当該語彙を認識辞書として用いて前記音声信号に対して第２の音声認識処理を行うサーバに、前記音声信号とともに前記語彙を表す情報を送信する送信手段と、を備える。 In order to solve the above problems, a communication terminal according to the present invention stores a voice input means for inputting a voice signal, and a language model or an acoustic model for performing voice recognition processing that is adapted to the user. A speech recognition processing means for performing a first speech recognition process on the speech signal using the language model or the acoustic model, and a vocabulary for extracting a vocabulary constituting a recognition processing result of the speech recognition processing means Extraction means and transmission means for transmitting information representing the vocabulary together with the voice signal to a server that performs second voice recognition processing on the voice signal using the vocabulary extracted by the vocabulary extraction means as a recognition dictionary And comprising.

また、本発明の音声認識方法は、格納手段に、音声認識処理を行うための言語モデルまたは音響モデルであってユーザに適応されたものが格納されており、音声入力手段が、音声信号を入力する音声入力ステップと、音声認識処理手段が、前記言語モデルまたは前記音響モデルを用いて前記音声信号に対して第１の音声認識処理を行う音声認識処理ステップと、語彙抽出手段が、前記音声認識処理手段の認識処理結果を構成する語彙を抽出する語彙抽出ステップと、送信手段が、前記語彙抽出手段が抽出した当該語彙を認識辞書として用いて前記音声信号に対して第２の音声認識処理を行うサーバに、前記音声信号とともに前記語彙を表す情報を送信する送信ステップと、を備える。 In the speech recognition method of the present invention, a language model or an acoustic model for performing speech recognition processing that is adapted to the user is stored in the storage means, and the speech input means inputs a speech signal. A voice input step, a voice recognition processing means for performing a first voice recognition process on the voice signal using the language model or the acoustic model, and a vocabulary extraction means for the voice recognition. A vocabulary extraction step for extracting a vocabulary constituting the recognition processing result of the processing means; and a transmission means for performing a second speech recognition process on the speech signal using the vocabulary extracted by the vocabulary extraction means as a recognition dictionary. A transmission step of transmitting information representing the vocabulary together with the audio signal to a server to perform.

また、本発明の音声認識プログラムは、格納手段に、音声認識処理を行うための言語モデルまたは音響モデルであってユーザに適応されたものが格納されており、音声信号を入力する音声入力モジュールと、前記言語モデルまたは前記音響モデルを用いて前記音声信号に対して第１の音声認識処理を行う音声認識処理モジュールと、前記音声認識処理モジュールの認識処理結果を構成する語彙を抽出する語彙抽出モジュールと、前記語彙抽出モジュールが抽出した当該語彙を認識辞書として用いて前記音声信号に対して第２の音声認識処理を行うサーバに、前記音声信号とともに前記語彙を表す情報を送信する送信モジュールと、を備える。 In the speech recognition program of the present invention, a language model or acoustic model for performing speech recognition processing and adapted to the user is stored in the storage means, and a speech input module for inputting a speech signal; A speech recognition processing module that performs a first speech recognition process on the speech signal using the language model or the acoustic model, and a vocabulary extraction module that extracts a vocabulary constituting a recognition processing result of the speech recognition processing module A transmission module that transmits information representing the vocabulary together with the audio signal to a server that performs second audio recognition processing on the audio signal using the vocabulary extracted by the vocabulary extraction module as a recognition dictionary; Is provided.

このような本発明の通信端末、音声認識方法、および音声認識プログラムによれば、通信端末の音声認識処理手段による第１の音声認識処理の結果を構成する語彙を、第２の音声認識処理を行うサーバに送信する。通信端末の音声認識処理手段は、ユーザに適応された言語モデルまたは音響モデルを用いるため、ユーザにカスタマイズされた音声認識処理が可能である。この音声認識処理の結果を構成する語彙がサーバに送信され認識辞書として用いられるため、サーバでは、認識辞書を拡張した上で、大語彙且つ高精度の音声認識処理を行うことが可能となる。したがって、第２の音声認識処理における未知語を減少させながらも、大語彙且つ高精度の音声認識処理と、言語モデルまたは音響モデルのユーザごとのカスタマイズを両立する音声認識処理を実現することが可能となる。 According to the communication terminal, the speech recognition method, and the speech recognition program of the present invention, the vocabulary constituting the result of the first speech recognition processing by the speech recognition processing means of the communication terminal is converted into the second speech recognition process. Send to server to do. Since the speech recognition processing means of the communication terminal uses a language model or an acoustic model adapted to the user, speech recognition processing customized for the user is possible. Since the vocabulary constituting the result of the speech recognition process is transmitted to the server and used as a recognition dictionary, the server can perform a large vocabulary and highly accurate speech recognition process after expanding the recognition dictionary. Therefore, it is possible to realize a speech recognition process that achieves both large vocabulary and high-accuracy speech recognition processing and customization of each language model or acoustic model for each user while reducing unknown words in the second speech recognition processing. It becomes.

また、本発明において、前記言語モデルは、当該通信端末内に存在するユーザデータ、または前記ユーザの利用履歴から得られ前記ユーザに依存する言語データに基づくユーザ辞書を含んでいても良い。 In the present invention, the language model may include a user dictionary based on user data existing in the communication terminal or language data obtained from the user's usage history and depending on the user.

この発明によれば、通信端末の言語モデルをユーザにカスタマイズされた言語モデルとすることができる。ユーザ辞書には、例えばユーザの知り合いの人名等が含まれることができる。 According to this invention, the language model of the communication terminal can be a language model customized by the user. The user dictionary can include, for example, the name of a user acquaintance.

また、本発明において、前記音響モデルは、前記ユーザの過去の入力音声もしくは通話音声、または音響トレーニングの実績を利用して、前記ユーザに適応されても良い。 In the present invention, the acoustic model may be applied to the user by using the user's past input voice or call voice, or the results of acoustic training.

この発明によれば、通信端末の音響モデルをユーザにあわせてカスタマイズするための具体的な方法が提供される。 According to the present invention, a specific method for customizing an acoustic model of a communication terminal according to a user is provided.

また、本発明の通信端末において、前記サーバとは、ネットワーク上に接続されていても良い。 In the communication terminal of the present invention, the server may be connected to a network.

この発明によれば、通信端末側は、ユーザ辞書全体ではなく、第１の音声認識処理の結果を構成する語彙のみをサーバに送信するため、通信端末とサーバがネットワーク上に接続されている場合に、情報伝送のコストが少なくて済むというメリットがある。また、送信すべきデータの量が少ないことから、全体の処理時間が短くなり、音声認識処理が終わるまでの遅延時間が短くて済むというメリットがある。 According to this invention, since the communication terminal side transmits only the vocabulary constituting the result of the first speech recognition process to the server, not the entire user dictionary, the communication terminal and the server are connected on the network. In addition, there is an advantage that the cost of information transmission can be reduced. Further, since the amount of data to be transmitted is small, there is an advantage that the entire processing time is shortened and the delay time until the voice recognition processing is completed can be shortened.

また、本発明において、前記語彙抽出手段は、前記語彙のうち、前記ユーザデータまたは前記ユーザ辞書に存在するもののみを抽出しても良い。 In the present invention, the vocabulary extracting means may extract only the vocabulary existing in the user data or the user dictionary.

この発明によれば、通信端末のユーザデータまたはユーザ辞書に存在する語彙をサーバの認識辞書にて確実に拡張させることができる。また、語彙抽出手段が抽出すべき語彙を減らすことができ、通信端末からサーバへ送信すべきデータ量を更に少なくすることができる。 According to this invention, the vocabulary existing in the user data or user dictionary of the communication terminal can be reliably expanded in the recognition dictionary of the server. Further, the vocabulary to be extracted by the vocabulary extraction means can be reduced, and the amount of data to be transmitted from the communication terminal to the server can be further reduced.

また、本発明において、前記送信手段は、前記語彙が前記ユーザ辞書に存在するものか否かを示す情報を前記サーバに更に送信しても良い。 In the present invention, the transmitting means may further transmit information indicating whether or not the vocabulary exists in the user dictionary to the server.

この発明によれば、当該情報を参照することにより、通信端末のユーザデータまたはユーザ辞書に存在する語彙を確実に特定でき、当該語彙をサーバの認識辞書にて確実に拡張させることができる。 According to this invention, by referring to the information, the vocabulary existing in the user data or the user dictionary of the communication terminal can be reliably specified, and the vocabulary can be reliably expanded in the recognition dictionary of the server.

本発明によれば、第２の音声認識処理における未知語を減少させながらも、大語彙且つ高精度の音声認識処理と、言語モデルまたは音響モデルのユーザごとのカスタマイズを両立する音声認識処理を実現することが可能な通信端末、音声認識方法、および音声認識プログラムを提供することができる。 According to the present invention, while reducing unknown words in the second speech recognition process, a speech recognition process that achieves both large vocabulary and high-accuracy speech recognition processing and customization of a language model or acoustic model for each user is realized. It is possible to provide a communication terminal, a speech recognition method, and a speech recognition program that can be used.

音声認識システム１の構成概要図である。1 is a schematic configuration diagram of a voice recognition system 1. FIG. クライアント端末１００およびサーバ２００のハードウェア構成図である。2 is a hardware configuration diagram of a client terminal 100 and a server 200. FIG. 音声認識システム１で行われる動作を示すシーケンス図である。It is a sequence diagram which shows the operation | movement performed with the speech recognition system. クライアント端末１００による第１の音声認識処理の結果の一例を示す図である。It is a figure which shows an example of the result of the 1st audio | voice recognition process by the client terminal.

以下、添付図面を参照して本発明にかかる通信端末、音声認識方法、および音声認識プログラムの好適な実施形態を詳細に説明する。なお、図面の説明において同一の要素には同一の符号を付し、重複する説明を省略する。以下の説明における「音声認識処理」とは、話者の話す音声言語を解析し、話している内容を文字データとして取り出す処理のことである。 DESCRIPTION OF EMBODIMENTS Hereinafter, preferred embodiments of a communication terminal, a speech recognition method, and a speech recognition program according to the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same elements are denoted by the same reference numerals, and redundant description is omitted. The “speech recognition process” in the following description is a process of analyzing a spoken language spoken by a speaker and extracting the content being spoken as character data.

（音声認識システム１の全体構成）
まず、本発明の実施形態に係る音声認識システム１の構成について、図１を参照しながら説明する。図１は、音声認識システム１の構成概要図である。図１に示すように、音声認識システム１は、クライアント端末１００（特許請求の範囲における「通信端末」に相当）およびサーバ２００から構成され、クライアント端末１００とサーバ２００とはネットワーク３００上に接続されている。図１ではクライアント端末１００を代表して１台のみを示しているが、サーバ２００に複数のクライアント端末１００が通信可能である。クライアント端末１００は第１の音声認識処理を行い、サーバ２００は第２の音声認識処理を行う。第２の音声認識処理の結果が音声認識処理の最終的な結果となる。 (Overall configuration of the speech recognition system 1)
First, the configuration of the speech recognition system 1 according to the embodiment of the present invention will be described with reference to FIG. FIG. 1 is a schematic configuration diagram of the speech recognition system 1. As shown in FIG. 1, the speech recognition system 1 includes a client terminal 100 (corresponding to a “communication terminal” in the claims) and a server 200, and the client terminal 100 and the server 200 are connected on a network 300. ing. In FIG. 1, only one client terminal 100 is shown as a representative, but a plurality of client terminals 100 can communicate with the server 200. The client terminal 100 performs a first voice recognition process, and the server 200 performs a second voice recognition process. The result of the second voice recognition process is the final result of the voice recognition process.

（クライアント端末１００の構成）
クライアント端末１００について詳細に説明する。クライアント端末１００は例えば携帯電話機やスマートフォンであって、図２はクライアント端末１００のハードウェア構成図である。図２に示すように、クライアント端末１００は、物理的には、ＣＰＵ１１、主記憶装置であるＲＯＭ１２及びＲＡＭ１３、操作ボタンやマイクなどの入力デバイス１４、ＬＣＤや有機ＥＬディスプレイなどの出力デバイス１５、サーバ２００との間でデータの送受信を行う通信モジュール１６、メモリディバイス等の補助記憶装置１７を備えて構成される。後述するクライアント端末１００の各機能は、ＣＰＵ１１、ＲＯＭ１２、ＲＡＭ１３等のハードウェア上に所定のソフトウェアを読み込ませることにより、ＣＰＵ１１の制御の元で入力デバイス１４、出力デバイス１５、通信モジュール１６を動作させると共に、主記憶装置１２，１３や補助記憶装置１７におけるデータの読み出し及び書き込みを行うことで実現される。 (Configuration of client terminal 100)
The client terminal 100 will be described in detail. The client terminal 100 is a mobile phone or a smartphone, for example, and FIG. 2 is a hardware configuration diagram of the client terminal 100. As shown in FIG. 2, the client terminal 100 physically includes a CPU 11, a ROM 12 and a RAM 13 which are main storage devices, an input device 14 such as operation buttons and a microphone, an output device 15 such as an LCD or an organic EL display, and a server. A communication module 16 that transmits and receives data to and from the memory 200 and an auxiliary storage device 17 such as a memory device are provided. Each function of the client terminal 100 to be described later operates the input device 14, the output device 15, and the communication module 16 under the control of the CPU 11 by reading predetermined software on hardware such as the CPU 11, the ROM 12, and the RAM 13. At the same time, it is realized by reading and writing data in the main storage devices 12 and 13 and the auxiliary storage device 17.

図１に戻り、クライアント端末１００は、機能的には、音声入力部１１０（特許請求の範囲における「音声入力手段」に相当）、端末側格納部１２０（特許請求の範囲における「格納手段」に相当）、端末側音声認識部１３０（特許請求の範囲における「音声認識処理手段」に相当）、語彙抽出部１４０（特許請求の範囲における「語彙抽出手段」に相当）、および送信部１５０（特許請求の範囲における「送信手段」に相当）を備えて構成される。 Returning to FIG. 1, the client terminal 100 functionally includes a voice input unit 110 (corresponding to “voice input unit” in the claims) and a terminal-side storage unit 120 (“storage unit” in the claims). Equivalent), terminal-side voice recognition unit 130 (corresponding to “voice recognition processing means” in the claims), vocabulary extraction unit 140 (corresponding to “vocabulary extraction means” in the claims), and transmission unit 150 (patent) Equivalent to “transmission means” in the claims).

音声入力部１１０は、ユーザからの音声信号を入力するものであって、例えば図２に示した入力デバイス１４により構成されることができる。音声入力部１１０は、マイクから入力した音声信号をＡ−Ｄ変換し、音声データを生成する。音声入力部１１０は、さらに、生成した音声データを圧縮符号に変換して符号化データを生成しても良く、音声データから特徴量データを抽出しても良い。以下では、符号化データおよび特徴量データを含めて音声データという。音声入力部１１０は、生成した音声データを端末側音声認識部１３０および送信部１５０に出力する。 The voice input unit 110 inputs a voice signal from a user, and can be configured by the input device 14 shown in FIG. 2, for example. The voice input unit 110 performs A / D conversion on a voice signal input from a microphone to generate voice data. The voice input unit 110 may further generate encoded data by converting the generated voice data into a compression code, or may extract feature data from the voice data. Hereinafter, the encoded data and the feature data are referred to as audio data. The voice input unit 110 outputs the generated voice data to the terminal-side voice recognition unit 130 and the transmission unit 150.

端末側格納部１２０は、端末側音声認識部１３０が音声認識処理を行うための言語モデルまたは音響モデルであって、ユーザに適応されたものを格納するものである。本実施形態におけるクライアント端末１００はユーザ専用のものであるか、または限定されたユーザのみが使用可能な通信端末である場合が多い。したがって、この場合には言語モデルや音響モデルの個人カスタマイズが容易である。端末側格納部１２０に格納された端末側言語モデルは、クライアント端末１００内に存在するユーザデータ、またはユーザの利用履歴から得られ且つユーザに依存する言語データに基づくユーザ辞書を含むことにより、クライアント端末１００のユーザにカスタマイズされることができる。また、端末側格納部１２０に格納された端末側音響モデルは、ユーザの過去の入力音声もしくは通話音声、または音響トレーニングの実績を利用して、ユーザに適応されることにより、カスタマイズされることができる。このような端末側格納部１２０は、例えば図２に示した補助記憶装置１７により構成されることができる。 The terminal-side storage unit 120 stores a language model or an acoustic model adapted for the user, which is a language model or an acoustic model for the terminal-side speech recognition unit 130 to perform speech recognition processing. In many cases, the client terminal 100 according to the present embodiment is dedicated to a user or a communication terminal that can be used only by a limited user. Therefore, in this case, personal customization of the language model and acoustic model is easy. The terminal-side language model stored in the terminal-side storage unit 120 includes a user dictionary based on user data existing in the client terminal 100 or language data obtained from the user's usage history and depending on the user. It can be customized by the user of the terminal 100. In addition, the terminal-side acoustic model stored in the terminal-side storage unit 120 may be customized by being adapted to the user using the user's past input voice or call voice, or the results of acoustic training. it can. Such a terminal-side storage unit 120 can be configured by the auxiliary storage device 17 shown in FIG. 2, for example.

端末側音声認識部１３０は、例えば図２に示したＣＰＵ１１等により構成され、端末側格納部１２０に格納された言語モデルまたは音響モデルを用いて、音声入力部１１０より入力した音声データに対して第１の音声認識処理を行うものである。第１の音声認識処理はクライアント端末１００側での音声認識処理であることから、メモリ量や演算リソースを豊富に利用した大語彙且つ高精度の音声認識処理ではないけれども、ユーザにカスタマイズされた言語モデルまたは音響モデルを用いたユーザに適応された音声認識処理である。なお、端末側音声認識部１３０による音声認識処理そのものは、例えば言語モデルとしてユニグラム（ｕｎｉｇｒａｍ）またはバイグラム（ｂｉｇｒａｍ）を用い、音響モデルとして隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）を用いるなど、周知の技術であるため、ここでは詳細な説明を省略する。端末側音声認識部１３０における音声認識は、入力音声全体に対する連続音声認識または、孤立単語認識であってもよく、また音声の一部に対する音声認識やワードスポッティングであってもよい。端末側音声認識部１３０は、第１の音声認識処理を行った結果を語彙抽出部１４０に出力する。 The terminal-side speech recognition unit 130 is configured by, for example, the CPU 11 shown in FIG. 2 and the like, and uses the language model or the acoustic model stored in the terminal-side storage unit 120 for voice data input from the speech input unit 110. The first voice recognition process is performed. Since the first speech recognition process is a speech recognition process on the client terminal 100 side, it is not a large vocabulary and high-accuracy speech recognition process that uses abundant amounts of memory and computing resources. This is a speech recognition process adapted to a user using a model or an acoustic model. Note that the speech recognition process itself performed by the terminal-side speech recognition unit 130 is a well-known technique, for example, using a unigram or bigram as a language model and a hidden Markov model as an acoustic model. Therefore, detailed description is omitted here. The voice recognition in the terminal side voice recognition unit 130 may be continuous voice recognition or isolated word recognition for the entire input voice, or may be voice recognition or word spotting for a part of the voice. The terminal-side voice recognition unit 130 outputs the result of the first voice recognition process to the vocabulary extraction unit 140.

語彙抽出部１４０は、端末側音声認識部１３０より第１の音声認識処理の結果を入力され、当該結果を構成する語彙（表記と読みの対）を抽出するものである。語彙抽出部１４０は、例えば図２に示したＣＰＵ１１等により構成されることができる。語彙抽出部１４０は、第１の音声認識処理の結果を構成する語彙全てを抽出しても良く、当該語彙のうち、ユーザデータまたはユーザ辞書に存在するもののみを抽出するようにしても良い。または、語彙抽出部１４０は、抽出した語彙がユーザデータまたはユーザ辞書に存在するものである場合には、その旨を示す情報（以下、「指示信号」という）を生成する処理とともに、語彙抽出処理を行っても良い。語彙抽出部１４０は、指示信号があれば当該指示信号とともに、抽出した語彙を表す情報（以下、「語彙情報」という）を送信部１５０に出力する。 The vocabulary extraction unit 140 receives the result of the first speech recognition processing from the terminal side speech recognition unit 130 and extracts the vocabulary (notation and reading pair) constituting the result. The vocabulary extraction unit 140 can be configured by, for example, the CPU 11 shown in FIG. The vocabulary extraction unit 140 may extract all of the vocabulary constituting the result of the first speech recognition process, or may extract only the vocabulary existing in the user data or the user dictionary. Alternatively, when the extracted vocabulary is present in the user data or the user dictionary, the vocabulary extraction unit 140 generates information indicating the fact (hereinafter referred to as “instruction signal”) and the vocabulary extraction processing. May be performed. If there is an instruction signal, vocabulary extraction section 140 outputs information indicating the extracted vocabulary (hereinafter referred to as “vocabulary information”) to transmission section 150 together with the instruction signal.

送信部１５０は、音声入力部１１０より入力した音声データとともに、語彙抽出部１４０より入力した語彙情報、および指示信号があれば当該指示信号をサーバ２００に送信するものである。送信部１５０は、例えば図２に示した通信モジュール１６により構成されることができる。 The transmission unit 150 transmits the instruction signal to the server 200 if there is vocabulary information input from the vocabulary extraction unit 140 and an instruction signal together with the voice data input from the voice input unit 110. The transmission unit 150 can be configured by, for example, the communication module 16 illustrated in FIG.

（サーバ２００の構成）
続いて、サーバ２００について説明する。図２はサーバ２００のハードウェア構成図である。図２に示すように、サーバ２００は、物理的には、ＣＰＵ２１、ＲＯＭ２２及びＲＡＭ２３等の主記憶装置、キーボード及びマウス等の入力デバイス２４、ディスプレイ等の出力デバイス２５、クライアント端末１００との間でデータの送受信を行うためのネットワークカード等の通信モジュール２６、ハードディスク等の補助記憶装置２７などを含む通常のコンピュータシステムとして構成される。後述するサーバ２００の各機能は、ＣＰＵ２１、ＲＯＭ２２、ＲＡＭ２３等のハードウェア上に所定のコンピュータソフトウェアを読み込ませることにより、ＣＰＵ２１の制御の元で入力デバイス２４、出力デバイス２５、通信モジュール２６を動作させると共に、主記憶装置２２，２３や補助記憶装置２７におけるデータの読み出し及び書き込みを行うことで実現される。 (Configuration of server 200)
Next, the server 200 will be described. FIG. 2 is a hardware configuration diagram of the server 200. As shown in FIG. 2, the server 200 is physically connected between a CPU 21, a main storage device such as a ROM 22 and a RAM 23, an input device 24 such as a keyboard and a mouse, an output device 25 such as a display, and a client terminal 100. The computer system includes a communication module 26 such as a network card for transmitting and receiving data, an auxiliary storage device 27 such as a hard disk, and the like. Each function of the server 200 described later causes the input device 24, the output device 25, and the communication module 26 to operate under the control of the CPU 21 by reading predetermined computer software on the hardware such as the CPU 21, the ROM 22, and the RAM 23. At the same time, it is realized by reading and writing data in the main storage devices 22 and 23 and the auxiliary storage device 27.

図１に戻り、サーバ２００は、機能的には、受信部２１０、サーバ側格納部２２０、認識辞書拡張部２３０、サーバ側音声認識部２４０、および認識結果送信部２５０を備えて構成される。 Returning to FIG. 1, the server 200 is functionally configured to include a receiving unit 210, a server-side storage unit 220, a recognition dictionary expansion unit 230, a server-side speech recognition unit 240, and a recognition result transmission unit 250.

受信部２１０は、クライアント端末１００の送信部１５０より、音声データ、語彙情報、および指示信号があれば当該指示信号を受信するものである。受信部２１０は、受信した音声データをサーバ側音声認識部２４０に出力し、受信した語彙情報および指示信号を認識辞書拡張部２３０に出力する。 The receiving unit 210 receives the instruction signal from the transmission unit 150 of the client terminal 100 if there is voice data, vocabulary information, and an instruction signal. The receiving unit 210 outputs the received voice data to the server-side voice recognition unit 240, and outputs the received vocabulary information and instruction signal to the recognition dictionary expansion unit 230.

サーバ側格納部２２０は、サーバ側音声認識部２４０が音声認識処理を行うための言語モデルまたは音響モデルを格納するものである。本実施形態におけるサーバ２００は多数のユーザに共有されるものである場合が多いので、サーバ側で言語モデルや音響モデルの個人カスタマイズを行うことは容易ではないが、サーバ側格納部２２０は大語彙且つ高精度の音声認識処理に適合した言語モデルまたは音響モデルを格納することができる。 The server-side storage unit 220 stores a language model or an acoustic model for the server-side speech recognition unit 240 to perform speech recognition processing. Since the server 200 in this embodiment is often shared by many users, it is not easy to personally customize a language model or an acoustic model on the server side, but the server side storage unit 220 has a large vocabulary. It is also possible to store a language model or acoustic model suitable for high-accuracy speech recognition processing.

認識辞書拡張部２３０は、受信部２１０より、語彙情報、および指示信号があれば当該指示信号を入力し、当該入力した諸情報に基づき、サーバ側格納部２２０に格納されたサーバ側言語モデルの認識辞書を拡張するものである。「サーバ側言語モデルの認識辞書を拡張する」とは、受信部２１０より入力した語彙情報で表される語彙がサーバ側言語モデルにおいて未知語である場合に、当該未知語をサーバ側言語モデルの認識辞書に既知語として登録することにより、当該語彙をそれ以上未知語でないようにすることをいう。このような認識辞書の拡張により、サーバ側音声認識部２４０がサーバ側言語モデルを用いて音声認識処理を行う際の未知語を減少させることができる。 The recognition dictionary expansion unit 230 receives the vocabulary information and the instruction signal from the reception unit 210 if there is an instruction signal. Based on the input information, the recognition dictionary expansion unit 230 receives the server-side language model stored in the server-side storage unit 220. It extends the recognition dictionary. “Expanding the server-side language model recognition dictionary” means that when the vocabulary represented by the vocabulary information input from the receiving unit 210 is an unknown word in the server-side language model, the unknown word is converted to the server-side language model. By registering as a known word in the recognition dictionary, this means that the vocabulary is no longer unknown. Such an expansion of the recognition dictionary can reduce unknown words when the server side speech recognition unit 240 performs speech recognition processing using the server side language model.

サーバ側音声認識部２４０は、サーバ側格納部２２０に格納された言語モデルまたは音響モデルを用いて、受信部２１０より入力した音声データに対して第２の音声認識処理を行うものである。第２の音声認識処理はサーバ２００側での音声認識処理であることから、メモリ量や演算リソースを豊富に利用した大語彙且つ高精度の音声認識処理が可能である。更に、第２の音声認識処理は、認識辞書拡張部２３０により認識辞書が拡張された後の言語モデルを用いて行われる。このため、未知語が減少された上での音声認識処理が可能となり、認識結果の精度が向上する。なお、サーバ側音声認識部２４０による音声認識処理そのものは、例えば言語モデルとしてトライグラム（ｔｒｉｇｒａｍ）を用い、音響モデルとして隠れマルコフモデルを用いるなど、周知の技術であるため、ここでは詳細な説明を省略する。サーバ側音声認識部２４０は、第２の音声認識処理を行った結果を認識結果送信部２５０に出力する。 The server-side voice recognition unit 240 performs a second voice recognition process on the voice data input from the reception unit 210 using the language model or the acoustic model stored in the server-side storage unit 220. Since the second speech recognition process is a speech recognition process on the server 200 side, a large vocabulary and high-accuracy speech recognition process using a large amount of memory and computing resources is possible. Further, the second speech recognition process is performed using the language model after the recognition dictionary is expanded by the recognition dictionary expansion unit 230. For this reason, it is possible to perform speech recognition processing after reducing unknown words, and the accuracy of the recognition result is improved. Note that the speech recognition processing itself by the server-side speech recognition unit 240 is a well-known technique, for example, using a trigram as a language model and a hidden Markov model as an acoustic model. Omitted. The server side voice recognition unit 240 outputs the result of the second voice recognition processing to the recognition result transmission unit 250.

認識結果送信部２５０は、サーバ側音声認識部２４０より第２の音声認識処理を行った結果を入力され、当該結果をクライアント端末１００に送信するものである。なお、クライアント端末１００は第２の音声認識処理の結果を受信する手段（例えば図２の通信モジュール１６で構成可能）、および当該結果をユーザに表示する手段（例えば図２の出力デバイス１５で構成可能）を備えても良い。 The recognition result transmission unit 250 receives the result of the second speech recognition processing from the server side speech recognition unit 240 and transmits the result to the client terminal 100. The client terminal 100 includes means for receiving the result of the second speech recognition process (for example, the communication module 16 shown in FIG. 2) and means for displaying the result to the user (for example, the output device 15 shown in FIG. 2). Possible).

（音声認識システム１の動作）
続いて、音声認識システム１により行われる動作について、図３を参照しながら説明する。図３は、音声認識システム１で行われる動作を示すシーケンス図である。 (Operation of the speech recognition system 1)
Next, operations performed by the voice recognition system 1 will be described with reference to FIG. FIG. 3 is a sequence diagram showing operations performed in the speech recognition system 1.

（ステップＳ１、特許請求の範囲における「音声入力ステップ」に相当）
最初に、クライアント端末１００の音声入力部１１０がユーザからの音声信号を入力し、Ａ−Ｄ変換等を行った後に、音声データを端末側音声認識部１３０および送信部１５０に出力する。 (Step S1, equivalent to “voice input step” in the claims)
First, after the voice input unit 110 of the client terminal 100 receives a voice signal from the user and performs A / D conversion or the like, the voice data is output to the terminal-side voice recognition unit 130 and the transmission unit 150.

（ステップＳ２、特許請求の範囲における「音声認識処理ステップ」に相当）
次に、端末側音声認識部１３０が、端末側格納部１２０に格納された言語モデルまたは音響モデルを用いて、ステップＳ１にて音声入力部１１０より入力した音声データに対して第１の音声認識処理を行う。図４は、第１の音声認識処理の結果の一例を示す。図４では認識結果をＮ−ｂｅｓｔで示しているが、これに限らず、ｃｏｎｆｕｓｉｏｎｎｅｔｗｏｒｋであっても良く、単語ｌａｔｔｉｃｅであっても良い。端末側音声認識部１３０は、図４に示したような第１の音声認識処理の結果を語彙抽出部１４０に出力する。 (Step S2, corresponding to “voice recognition processing step” in the claims)
Next, the terminal side voice recognition unit 130 uses the language model or the acoustic model stored in the terminal side storage unit 120 to perform first voice recognition on the voice data input from the voice input unit 110 in step S1. Process. FIG. 4 shows an example of the result of the first speech recognition process. Although the recognition result is indicated by N-best in FIG. 4, the recognition result is not limited to this, and may be a confusion network or a word lattice. The terminal side speech recognition unit 130 outputs the result of the first speech recognition process as shown in FIG. 4 to the vocabulary extraction unit 140.

（ステップＳ３、特許請求の範囲における「語彙抽出ステップ」に相当）
次に、語彙抽出部１４０が、ステップＳ２の第１の音声認識処理の結果を端末側音声認識部１３０より入力され、当該結果を構成する語彙を抽出する。図４の例のような認識結果が入力された場合に、語彙抽出部１４０は下記に示す語彙を抽出し、当該抽出した語彙を表す語彙情報を送信部１５０に出力する（語彙抽出部１４０の語彙抽出パターン１）。
「今日／キョウ」、「は／ワ」、「横浜／ヨコハマ」、「高浜／タカハマ」、「横島／ヨコシマ」、「へ／エ」、「に／ニ」、「行った／イッタ」、「会った／アッタ」 (Step S3, equivalent to “vocabulary extraction step” in the claims)
Next, the vocabulary extraction unit 140 receives the result of the first speech recognition process in step S2 from the terminal-side speech recognition unit 130, and extracts the vocabulary constituting the result. When the recognition result as in the example of FIG. 4 is input, the vocabulary extraction unit 140 extracts the following vocabulary and outputs vocabulary information representing the extracted vocabulary to the transmission unit 150 (the vocabulary extraction unit 140 Vocabulary extraction pattern 1).
“Today / Kyo”, “Ha / Wa”, “Yokohama / Yokohama”, “Takahama / Takahama”, “Yokoshima / Yokoshima”, “He / E”, “Ni / Ni”, “Done / Itta”, “ Meet / Atta ”

ここで、語彙抽出部１４０が、図４に示す認識結果における語彙のうち、ユーザデータまたはユーザ辞書に存在するもののみを抽出するようにしても良い。例えば、図４の例において、「横島／ヨコシマ」のみがクライアント端末１００のユーザデータまたはユーザ辞書に存在する語彙であって、その他の「今日／キョウ」、「は／ワ」等はユーザデータまたはユーザ辞書に存在しない語彙であるとする。この場合に、語彙抽出部１４０は「横島／ヨコシマ」のみを抽出し、当該抽出した語彙「横島／ヨコシマ」を表す語彙情報を送信部１５０に出力する（語彙抽出部１４０の語彙抽出パターン２）。 Here, the vocabulary extraction unit 140 may extract only the vocabulary in the recognition result shown in FIG. 4 that exists in the user data or the user dictionary. For example, in the example of FIG. 4, only “Yokoshima / Yokoshima” is a vocabulary existing in the user data or user dictionary of the client terminal 100, and other “today / kyo”, “ha / wa”, etc. are user data or It is assumed that the vocabulary does not exist in the user dictionary. In this case, the vocabulary extraction unit 140 extracts only “Yokoshima / Yokoshima” and outputs lexical information representing the extracted vocabulary “Yokoshima / Yokoshima” to the transmission unit 150 (vocabulary extraction pattern 2 of the vocabulary extraction unit 140). .

更に、語彙抽出部１４０が、抽出した語彙がユーザデータまたはユーザ辞書に存在するものである場合には、その旨を示す情報（指示信号）を生成する処理とともに、上記語彙抽出パターン１のような語彙抽出処理を行っても良い。例えば、図４の例において、語彙抽出部１４０が「今日／キョウ」、「は／ワ」等の認識結果にある全ての語彙を抽出するとともに、語彙「横島／ヨコシマ」はユーザデータまたはユーザ辞書に存在する語彙である旨を示す指示信号を生成する。そして、語彙抽出部１４０は、当該指示信号とともに、抽出した語彙を表す語彙情報を送信部１５０に出力する（語彙抽出部１４０の語彙抽出パターン３）。 Further, when the extracted vocabulary exists in the user data or the user dictionary, the vocabulary extraction unit 140 generates information (instruction signal) to that effect, as well as the vocabulary extraction pattern 1 described above. Vocabulary extraction processing may be performed. For example, in the example of FIG. 4, the vocabulary extraction unit 140 extracts all vocabulary in recognition results such as “today / kyo”, “ha / wa”, and the vocabulary “Yokoshima / Yokoshima” is user data or a user dictionary. An instruction signal indicating that it is a vocabulary existing in is generated. Then, the vocabulary extraction unit 140 outputs vocabulary information representing the extracted vocabulary together with the instruction signal to the transmission unit 150 (vocabulary extraction pattern 3 of the vocabulary extraction unit 140).

（ステップＳ４、特許請求の範囲における「送信ステップ」に相当）
次に、送信部１５０が、ステップＳ１にて音声入力部１１０より入力した音声データとともに、ステップＳ３にて語彙抽出部１４０より入力した語彙情報、および指示信号があれば当該指示信号をサーバ２００に送信する。 (Step S4, corresponding to “transmission step” in claims)
Next, if there is vocabulary information input from the vocabulary extraction unit 140 in step S3 and an instruction signal together with the audio data input from the audio input unit 110 in step S1, the transmission unit 150 sends the instruction signal to the server 200. Send.

（ステップＳ５）
次に、サーバ２００の受信部２１０が、クライアント端末１００の送信部１５０より、音声データ、語彙情報、および指示信号があれば当該指示信号を受信する。受信部２１０は、受信した音声データをサーバ側音声認識部２４０に出力し、受信した語彙情報および指示信号を認識辞書拡張部２３０に出力する。 (Step S5)
Next, the receiving unit 210 of the server 200 receives the instruction signal from the transmitting unit 150 of the client terminal 100 if there is voice data, vocabulary information, and an instruction signal. The receiving unit 210 outputs the received voice data to the server-side voice recognition unit 240, and outputs the received vocabulary information and instruction signal to the recognition dictionary expansion unit 230.

（ステップＳ６）
次に、認識辞書拡張部２３０が、受信部２１０より、語彙情報、および指示信号があれば当該指示信号を入力し、当該入力した諸情報に基づき、サーバ側格納部２２０に格納されたサーバ側言語モデルの認識辞書を拡張する。 (Step S6)
Next, if there is vocabulary information and an instruction signal from the receiving unit 210, the recognition dictionary expansion unit 230 inputs the instruction signal, and the server side stored in the server side storage unit 220 based on the input information. Extend the language model recognition dictionary.

上記の語彙抽出パターン１のように、語彙抽出部１４０が抽出した語彙を全て表す語彙情報を入力した場合に、認識辞書拡張部２３０は、当該入力された語彙情報で表される語彙と、自サーバが保有している認識辞書に登録されている語彙とを比較し、語彙情報で表される語彙のうち自サーバの認識辞書に登録されていないもののみを既知語として新たに登録するようにしても良い。図４の例において、比較の結果、例えば「横浜／ヨコハマ」および「高浜／タカハマ」の二つの語彙がサーバの認識辞書に登録されていないことが判明された場合に、認識辞書拡張部２３０は当該語彙「横浜／ヨコハマ」および「高浜／タカハマ」を自サーバの認識辞書に既知語として新たに登録する（認識辞書拡張部２３０の辞書拡張パターン１）。 When the vocabulary information that represents all the vocabulary extracted by the vocabulary extraction unit 140 is input as in the vocabulary extraction pattern 1 described above, the recognition dictionary expansion unit 230, the vocabulary represented by the input vocabulary information, Compare the vocabulary registered in the recognition dictionary held by the server and register only the vocabulary represented by the vocabulary information that is not registered in the server's recognition dictionary as known words. May be. In the example of FIG. 4, when it is determined that, for example, two vocabularies “Yokohama / Yokohama” and “Takahama / Takahama” are not registered in the server's recognition dictionary, the recognition dictionary expansion unit 230 The vocabulary “Yokohama / Yokohama” and “Takahama / Takahama” are newly registered as known words in the recognition dictionary of the own server (dictionary expansion pattern 1 of the recognition dictionary expansion unit 230).

または、上記の語彙抽出パターン２のように、入力された語彙情報で表される語彙が全てクライアント端末１００のユーザデータまたはユーザ辞書に存在するものであることが保障される場合には、認識辞書拡張部２３０は入力された語彙情報で表される語彙全てを既知語として認識辞書に登録するようにしても良い。このような場合は、例えば図４の例において、クライアント端末１００の語彙抽出部１４０がユーザデータまたはユーザ辞書に存在する語彙のみを抽出することを、例えばクライアント端末１００とサーバ２００間で所定の情報を予め交換しておくことにより、サーバ２００が事前に知っている場合である。このような場合に、例えば、「横島／ヨコシマ」のみが語彙抽出部１４０により抽出され、「横島／ヨコシマ」のみを表す語彙情報が認識辞書拡張部２３０に入力され、認識辞書拡張部２３０は語彙「横島／ヨコシマ」を既知語として自サーバの認識辞書に登録する。なお、仮に語彙「横島／ヨコシマ」がサーバ側言語モデルの認識辞書に既に登録されている既知語である場合には、認識辞書拡張部２３０は語彙「横島／ヨコシマ」を登録しなくても良い（認識辞書拡張部２３０の辞書拡張パターン２）。 Alternatively, when it is ensured that the vocabulary represented by the input vocabulary information exists in the user data or user dictionary of the client terminal 100 as in the vocabulary extraction pattern 2 described above, the recognition dictionary The extension unit 230 may register all the vocabulary represented by the input vocabulary information in the recognition dictionary as known words. In such a case, for example, in the example of FIG. 4, the fact that the vocabulary extraction unit 140 of the client terminal 100 extracts only the vocabulary existing in the user data or the user dictionary is, for example, predetermined information between the client terminal 100 and the server 200. This is a case where the server 200 knows in advance by exchanging. In such a case, for example, only “Yokoshima / Yokoshima” is extracted by the vocabulary extraction unit 140, lexical information representing only “Yokoshima / Yokoshima” is input to the recognition dictionary expansion unit 230, and the recognition dictionary expansion unit 230 Register “Yokoshima / Yokoshima” as a known word in the recognition dictionary of the local server. Note that if the vocabulary “Yokoshima / Yokoshima” is a known word already registered in the recognition dictionary of the server language model, the recognition dictionary expansion unit 230 may not register the vocabulary “Yokoshima / Yokoshima”. (Dictionary expansion pattern 2 of the recognition dictionary expansion unit 230).

更に、上記の語彙抽出パターン３のように指示信号がある場合には、認識辞書拡張部２３０は、当該指示信号を参照することにより、当該語彙がクライアント端末１００のユーザデータまたはユーザ辞書に存在するものであることが判明された場合のみに、当該語彙を既知語として認識辞書に登録するようにしても良い。例えば、上記図４の例において、「横島／ヨコシマ」がユーザデータまたはユーザ辞書に存在する語彙である旨を示す指示信号が入力された場合に、認識辞書拡張部２３０は語彙「横島／ヨコシマ」を既知語として自サーバの認識辞書に登録する。なお、仮に語彙「横島／ヨコシマ」がサーバ側言語モデルの認識辞書に既に登録されている既知語である場合には、認識辞書拡張部２３０は語彙「横島／ヨコシマ」を登録しなくても良い（認識辞書拡張部２３０の辞書拡張パターン３）。 Further, when there is an instruction signal as in the vocabulary extraction pattern 3 described above, the recognition dictionary expansion unit 230 refers to the instruction signal so that the vocabulary exists in the user data or the user dictionary of the client terminal 100. The vocabulary may be registered in the recognition dictionary as a known word only when it is found to be a thing. For example, in the example of FIG. 4 described above, when an instruction signal indicating that “Yokoshima / Yokoshima” is a vocabulary existing in the user data or the user dictionary is input, the recognition dictionary expansion unit 230 uses the vocabulary “Yokoshima / Yokoshima”. Is registered in the recognition dictionary of its own server as a known word. Note that if the vocabulary “Yokoshima / Yokoshima” is a known word already registered in the recognition dictionary of the server language model, the recognition dictionary expansion unit 230 may not register the vocabulary “Yokoshima / Yokoshima”. (Dictionary expansion pattern 3 of the recognition dictionary expansion unit 230).

（ステップＳ７）
次に、サーバ側音声認識部２４０が、サーバ側格納部２２０に格納された言語モデルまたは音響モデルを用いて、ステップＳ５にて受信部２１０より入力した音声データに対して第２の音声認識処理を行う。第２の音声認識処理は、認識辞書拡張部２３０により認識辞書が拡張された後の言語モデルを用いて行われる。サーバ側音声認識部２４０は、第２の音声認識処理を行った結果を認識結果送信部２５０に出力する。 (Step S7)
Next, the server side voice recognition unit 240 uses the language model or the acoustic model stored in the server side storage unit 220 to perform the second voice recognition process on the voice data input from the reception unit 210 in step S5. I do. The second speech recognition process is performed using the language model after the recognition dictionary is expanded by the recognition dictionary expansion unit 230. The server side voice recognition unit 240 outputs the result of the second voice recognition processing to the recognition result transmission unit 250.

（ステップＳ８）
次に、認識結果送信部２５０が、サーバ側音声認識部２４０より第２の音声認識処理を行った結果を入力され、当該結果をクライアント端末１００に送信する。 (Step S8)
Next, the recognition result transmission unit 250 receives the result of the second speech recognition processing from the server side speech recognition unit 240 and transmits the result to the client terminal 100.

（ステップＳ９）
次に、クライアント端末１００側で第２の音声認識処理の結果を受信し、当該結果をユーザに表示する。 (Step S9)
Next, the client terminal 100 side receives the result of the second voice recognition process and displays the result to the user.

以上の説明においては、本発明の実施態様として、クライアント端末１００およびサーバ２００を備える音声認識システム１を例示したが、これに限られるものではなく、クライアント端末１００およびサーバ２００における各機能を実行するためのモジュールを備えたプログラムとして構成してもよい。すなわち、クライアント端末１００に相当するものとして、格納手段に、音声認識処理を行うための言語モデルまたは音響モデルであってユーザに適応されたものが格納されており、音声入力部１１０に相当する音声入力モジュール、端末側音声認識部１３０に相当する音声認識処理モジュール、語彙抽出部１４０に相当する語彙抽出モジュール、および送信部１５０に相当する送信モジュールを備えたプログラムを構成する。また、同様に、サーバ２００の各構成要素に相当する各モジュールを備えたプログラムを構成する。そして、携帯端末、スマートフォン、サーバ等に当該プログラムを読み込ませることにより、上述のクライアント端末１００およびサーバ２００を備える音声認識システム１と同等の機能を実現することができる。このようなプログラムは記録媒体に記録されることができる。記録媒体とは、コンピュータのハードウェア資源に備えられている読み取り装置に対して、プログラムの記述内容に応じて、磁気、光、電気等のエネルギーの変化状態を引き起こして、それに対応する信号の形式で、読み取り装置にプログラムの記述内容を伝達できるものである。かかる記録媒体としては、例えば、磁気ディスク、光ディスク、ＣＤ−ＲＯＭ、コンピュータに内蔵されるメモリなどが該当する。 In the above description, the speech recognition system 1 including the client terminal 100 and the server 200 is illustrated as an embodiment of the present invention. However, the present invention is not limited to this, and each function in the client terminal 100 and the server 200 is executed. You may comprise as a program provided with the module for. That is, as the equivalent to the client terminal 100, a language model or an acoustic model for performing speech recognition processing that is adapted to the user is stored in the storage means, and the speech corresponding to the speech input unit 110 is stored. A program including an input module, a speech recognition processing module corresponding to the terminal-side speech recognition unit 130, a vocabulary extraction module corresponding to the vocabulary extraction unit 140, and a transmission module corresponding to the transmission unit 150 is configured. Similarly, a program including each module corresponding to each component of the server 200 is configured. And a function equivalent to the voice recognition system 1 provided with the above-mentioned client terminal 100 and the server 200 is realizable by making a portable terminal, a smart phone, a server, etc. read the said program. Such a program can be recorded on a recording medium. A recording medium is a signal format that causes a state of change in energy such as magnetism, light, electricity, etc. according to the description of a program to a reading device provided in a computer hardware resource. Thus, the description content of the program can be transmitted to the reading device. Examples of such a recording medium include a magnetic disk, an optical disk, a CD-ROM, and a memory built in a computer.

（本実施形態の作用及び効果）
続いて、本実施形態にかかる音声認識システム１の作用及び効果について説明する。本実施形態の音声認識システム１によれば、クライアント端末１００の端末側音声認識部１３０による第１の音声認識処理の結果を構成する語彙を、第２の音声認識処理を行うサーバ２００に送信する。クライアント端末１００の端末側音声認識部１３０は、ユーザに適応された言語モデルまたは音響モデルを用いるため、ユーザにカスタマイズされた音声認識処理が可能である。この音声認識処理の結果を構成する語彙がサーバ２００に送信され認識辞書として用いられるため、サーバ２００では、認識辞書を拡張した上で、大語彙且つ高精度の音声認識処理を行うことが可能となる。したがって、第２の音声認識処理における未知語を減少させながらも、大語彙且つ高精度の音声認識処理と、言語モデルまたは音響モデルのユーザごとのカスタマイズを両立する音声認識処理を実現することが可能となる。 (Operation and effect of this embodiment)
Then, the effect | action and effect of the speech recognition system 1 concerning this embodiment are demonstrated. According to the speech recognition system 1 of the present embodiment, the vocabulary constituting the result of the first speech recognition process by the terminal-side speech recognition unit 130 of the client terminal 100 is transmitted to the server 200 that performs the second speech recognition process. . Since the terminal-side voice recognition unit 130 of the client terminal 100 uses a language model or an acoustic model adapted to the user, voice recognition processing customized for the user is possible. Since the vocabulary constituting the result of the speech recognition process is transmitted to the server 200 and used as a recognition dictionary, the server 200 can perform a large vocabulary and highly accurate speech recognition process after expanding the recognition dictionary. Become. Therefore, it is possible to realize a speech recognition process that achieves both large vocabulary and high-accuracy speech recognition processing and customization of each language model or acoustic model for each user while reducing unknown words in the second speech recognition processing. It becomes.

また、本実施形態によれば、クライアント端末１００の言語モデルに、当該クライアント端末１００内に存在するユーザデータ、またはユーザの利用履歴から得られユーザに依存する言語データに基づくユーザ辞書を含ませることにより、クライアント端末１００の言語モデルをユーザにカスタマイズされた言語モデルとすることができる。ユーザ辞書には、例えばユーザの知り合いの人名等が含まれることができる。 According to the present embodiment, the language model of the client terminal 100 includes a user dictionary based on user data existing in the client terminal 100 or language data obtained from the user usage history and depending on the user. Thus, the language model of the client terminal 100 can be a language model customized by the user. The user dictionary can include, for example, the name of a user acquaintance.

また、本実施形態によれば、ユーザの過去の入力音声もしくは通話音声、または音響トレーニングの実績を利用するといった、クライアント端末１００の音響モデルをユーザにあわせてカスタマイズするための具体的な方法が提供される。 In addition, according to the present embodiment, a specific method for customizing the acoustic model of the client terminal 100 according to the user, such as using the user's past input voice or call voice, or the results of acoustic training, is provided. Is done.

また、本実施形態によれば、クライアント端末１００側は、ユーザ辞書全体ではなく、第１の音声認識処理の結果を構成する語彙のみをサーバ２００に送信するため、クライアント端末１００とサーバ２００がネットワーク上に接続されている本実施形態のような場合に、情報伝送のコストが少なくて済むというメリットがある。また、送信すべきデータの量が少ないことから、全体の処理時間が短くなり、音声認識処理が終わるまでの遅延時間が短くて済むというメリットがある。 Further, according to the present embodiment, the client terminal 100 transmits only the vocabulary constituting the result of the first speech recognition process to the server 200 instead of the entire user dictionary, so that the client terminal 100 and the server 200 are connected to the network. In the case of this embodiment connected above, there is an advantage that the cost of information transmission can be reduced. Further, since the amount of data to be transmitted is small, there is an advantage that the entire processing time is shortened and the delay time until the voice recognition processing is completed can be shortened.

また、本実施形態によれば、特に語彙抽出パターン２および辞書拡張パターン２においては、クライアント端末１００のユーザデータまたはユーザ辞書に存在する語彙をサーバ２００の認識辞書にて確実に拡張させることができる。また、語彙抽出部１４０が抽出すべき語彙を減らすことができ、クライアント端末１００からサーバ２００へ送信すべきデータ量を更に少なくすることができる。 Further, according to the present embodiment, particularly in the vocabulary extraction pattern 2 and the dictionary expansion pattern 2, the vocabulary existing in the user data of the client terminal 100 or the user dictionary can be reliably expanded in the recognition dictionary of the server 200. . Further, the vocabulary to be extracted by the vocabulary extraction unit 140 can be reduced, and the amount of data to be transmitted from the client terminal 100 to the server 200 can be further reduced.

また、本実施形態によれば、特に語彙抽出パターン３および辞書拡張パターン３においては、指示信号を参照することにより、クライアント端末１００のユーザデータまたはユーザ辞書に存在する語彙を確実に特定でき、当該語彙をサーバ２００の認識辞書にて確実に拡張させることができる。 Further, according to the present embodiment, particularly in the vocabulary extraction pattern 3 and the dictionary expansion pattern 3, by referring to the instruction signal, the vocabulary existing in the user data or the user dictionary of the client terminal 100 can be reliably identified, The vocabulary can be reliably expanded in the recognition dictionary of the server 200.

１…音声認識システム、１００…クライアント端末、１１０…音声入力部、１２０…端末側格納部、１３０…端末側音声認識部、１４０…語彙抽出部、１５０…送信部、２００…サーバ、２１０…受信部、２２０…サーバ側格納部、２３０…認識辞書拡張部、２４０…サーバ側音声認識部、２５０…認識結果送信部、３００…ネットワーク。

DESCRIPTION OF SYMBOLS 1 ... Voice recognition system, 100 ... Client terminal, 110 ... Voice input part, 120 ... Terminal side storage part, 130 ... Terminal side voice recognition part, 140 ... Vocabulary extraction part, 150 ... Transmission part, 200 ... Server, 210 ... Reception , 220 ... server-side storage unit, 230 ... recognition dictionary expansion unit, 240 ... server-side voice recognition unit, 250 ... recognition result transmission unit, 300 ... network.

Claims

A speech recognition system that performs speech recognition processing by communication between a communication terminal and a server,
The communication terminal is
A voice input means for inputting a voice signal;
Storage means for storing a language model or an acoustic model adapted to the user for performing speech recognition processing;
Speech recognition processing means for performing a first speech recognition process on the speech signal using the language model or the acoustic model;
Vocabulary extracting means for extracting a vocabulary constituting a recognition processing result of the speech recognition processing means and generating an instruction signal which is information indicating whether or not the extracted vocabulary is present in the language model ;
Before SL lexical information is information representing the vocabulary along with audio signals, and said indication signal and a transmitting means for transmitting to said server,
The server
A server-side storage for storing a server-side language model, which is a language model for performing speech recognition processing, which is a language model adapted to speech recognition processing with a large vocabulary and higher accuracy than the language model stored in the storage means Means,
Recognition dictionary expansion means for registering an unknown word in the server-side language model as a known word based on the vocabulary information and the instruction signal;
Server-side speech recognition processing means for performing a second speech recognition process on the speech signal using the server-side language model;
A speech recognition system comprising:

The language model stored by the storage unit includes user data existing in the communication terminal or a user dictionary based on language data obtained from the user's usage history and depending on the user.
The speech recognition system according to claim 1.

The acoustic model stored by the storage unit is adapted to the user by using the past input voice or call voice of the user, or the results of acoustic training,
The voice recognition system according to claim 1, wherein

The server is connected to a network,
The speech recognition system according to any one of claims 1 to 3, wherein

The vocabulary extracting means extracts only the vocabulary existing in the user data or the user dictionary;
Speech recognition system according to claim 2, characterized in that.

A speech recognition method for performing speech recognition processing by communication between a communication terminal and a server,
In the storage means of the communication terminal, a language model or an acoustic model for performing speech recognition processing that is adapted to the user is stored,
A voice input step in which the voice input means of the communication terminal inputs a voice signal;
A voice recognition processing step in which voice recognition processing means of the communication terminal performs a first voice recognition process on the voice signal using the language model or the acoustic model;
The vocabulary extracting means of the communication terminal extracts the vocabulary constituting the recognition processing result of the speech recognition processing means, and an instruction that is information indicating whether or not the extracted vocabulary exists in the language model A vocabulary extraction step for generating a signal ;
Transmitting means of said communication terminal, before Symbol vocabulary information which is information representing the vocabulary along with audio signals, and and a transmission step of transmitting the instruction signal to the server,
The server side which is a language model for performing speech recognition processing in the server storage means of the server, and which is a language model suitable for speech recognition processing having a larger vocabulary and higher accuracy than the language model stored in the storage means Language model is stored,
A recognition dictionary expansion means for registering an unknown word in the server side language model as a known word based on the vocabulary information and the instruction signal;
A server side speech recognition processing step in which the server side speech recognition processing means of the server performs a second speech recognition process on the speech signal using the server side language model;
A speech recognition method comprising:

A speech recognition program for speech recognition processing performed by communication between a communication terminal and a server,
An audio input module for inputting audio signals;
A speech recognition processing module that performs a first speech recognition process on the speech signal using a language model or an acoustic model that is stored in the communication terminal and that is adapted to the user. When,
A vocabulary extraction module that extracts a vocabulary constituting a recognition processing result of the speech recognition processing module and generates an instruction signal that is information indicating whether the extracted vocabulary is present in the language model ;
A transmitting module for sending pre SL lexical information is information representing the vocabulary along with audio signals, and said indication signal to said server,
A language model for performing speech recognition processing based on the vocabulary information and the instruction signal, and a language model adapted to speech recognition processing having a larger vocabulary and higher accuracy than a language model stored in the communication terminal. A recognition dictionary extension module for registering unknown words in the server-side language model as known words;
A server-side speech recognition processing module that performs a second speech recognition process on the speech signal using the server-side language model;
A speech recognition program comprising: