JP2010032865A

JP2010032865A - Speech recognizer, speech recognition system, and program

Info

Publication number: JP2010032865A
Application number: JP2008196078A
Authority: JP
Inventors: Toshiki Endo; 俊樹遠藤; Koji Sase; 孝司佐瀬
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2008-07-30
Filing date: 2008-07-30
Publication date: 2010-02-12
Anticipated expiration: 2028-07-30
Also published as: JP5112978B2

Abstract

<P>PROBLEM TO BE SOLVED: To recognize a speech, using either of a speaker adapted acoustic model or a versatile acoustic model, in response to a speech recognition request executed frequency, when adapting the acoustic model to a speaker in each group. <P>SOLUTION: A speech recognizer includes: a usage frequency extraction means 60 for extracting a user ID of a user whose speech recognition request executed frequency is a predetermined threshold value or more, from an access information database 32 for storing access information including a speech recognition request and the user ID; a high-frequency user grouping means 50 for grouping the users, based on the extracted user IDs; a speaker adaptation processing means 70 for performing speaker adaptation processing of the acoustic model for each group; and a speech recognition processing means 20 for recognizing the speech of the user corresponding to the user ID, using the acoustic model to which the speaker adaptation processing is performed for every group. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、不特定多数のユーザが利用する音声認識システムに適用される技術に関する。 The present invention relates to a technique applied to a speech recognition system used by an unspecified number of users.

従来から、音声認識システムにおいて、認識精度を高めるために、認識結果の文字列と入力された音声データとを用いて、音響モデルを自動的に適応する技術が知られている（特許文献１）。また、ユーザのプロファイル情報に基づいて、ユーザを複数のグループに分けて、グループ別に音響モデルを話者適応する技術も知られている（特許文献２、特許文献３）。特許文献２では、多数の不特定話者から得られた音声特徴データに基づいて、複数の話者クラスを設定し、各話者クラスに属する話者の音声特徴データをもとに不特定話者コードブックを作成する。そして、各話者クラスに属する話者の音声特徴ベクトル列から話者適応用単語ごとの重心ベクトル列を算出して記憶する。入力話者の音声特徴ベクトル列と、重心ベクトル列とを対応付けて、入力話者の音声がどの話者クラスに属するかを判定し、対応する不特定話者コードブックを選択する。音声認識時には、入力話者の音声を選択されたコードブックによりコード化し、対応する音声モデルを用いて音声認識を行なう。 2. Description of the Related Art Conventionally, in a speech recognition system, a technique for automatically adapting an acoustic model using a character string as a recognition result and input speech data is known in order to improve recognition accuracy (Patent Document 1). . In addition, based on user profile information, a technique is also known in which users are divided into a plurality of groups and the acoustic model is speaker-adapted for each group (Patent Documents 2 and 3). In Patent Document 2, a plurality of speaker classes are set based on speech feature data obtained from a large number of unspecified speakers, and unspecified stories are based on the speech feature data of speakers belonging to each speaker class. An author codebook Then, a centroid vector sequence for each speaker adaptation word is calculated from the speech feature vector sequence of the speakers belonging to each speaker class and stored. The speech feature vector sequence of the input speaker and the barycentric vector sequence are associated with each other to determine to which speaker class the speech of the input speaker belongs, and the corresponding unspecified speaker codebook is selected. At the time of voice recognition, the voice of the input speaker is coded by the selected code book, and voice recognition is performed using the corresponding voice model.

特許文献３では話者の特質毎に分類した音声サンプルを用いて、カテゴリ毎に最適な出力関数を決定し、その出力関数を用いて全体の出力関数を決定し、各分類の隠れマルコフモデルの出力確率と、話者の属性確率とから話者の属性に合わせた音声認識を行なう。また、この話者分類による認識を推定スコアの算出に用いる。さらに、この出力確率の計算および推定スコアの算出を、曖昧推論を用い、かつ、領域区分法・変数分離法によって高速化する。 In Patent Document 3, an optimal output function is determined for each category using speech samples classified for each speaker characteristic, the overall output function is determined using the output function, and the hidden Markov model of each classification is determined. Speech recognition that matches the attributes of the speaker is performed from the output probability and the attribute probability of the speaker. In addition, the recognition based on the speaker classification is used to calculate the estimated score. Further, the calculation of the output probability and the calculation of the estimated score are speeded up using the fuzzy reasoning and the region segmentation method / variable separation method.

また、特許文献４では、複数の音響モデルのうち、グループに対応する音響モデルを用いて認識処理を行なう。
特開平０９−０３４４８５号公報特開平０９−２５８７６９号公報特開平１０−２８２９８６号公報特開２００２−１８２６８２号公報 Moreover, in patent document 4, a recognition process is performed using the acoustic model corresponding to a group among several acoustic models.
JP 09-034485 A JP 09-258769 A JP-A-10-282986 JP 2002-182682 A

しかしながら、上記の従来技術のように、グループ別に音響モデルの話者適応処理を行なう場合、ユーザ数が数千にも及ぶと、グループ化する処理負担が非常に大きくなってしまう。このようにユーザ規模が大きい場合、グループ数を多くすると、話者適応の処理回数が多くなってしまい、逆にグループ数を少なくすると、結果的にグループに属するユーザ数が多くなり、話者適応を行なった音響モデルを用いても認識精度の向上を期待するどころか、特定の話者に対して認識精度が下がることも懸念される。 However, when the speaker adaptation processing of the acoustic model is performed for each group as in the above-described prior art, if the number of users reaches several thousand, the processing load for grouping becomes very large. In this way, when the user scale is large, increasing the number of groups increases the number of speaker adaptation processes. Conversely, decreasing the number of groups results in an increase in the number of users belonging to the group, and speaker adaptation. Rather than expecting an improvement in recognition accuracy even if the acoustic model is used, there is a concern that the recognition accuracy may be lowered for a specific speaker.

本発明は、このような事情に鑑みてなされたものであり、グループ別に音響モデルを話者適応する場合に、音声認識要求を行なった頻度に応じて、話者適応された音響モデルまたは汎用の音響モデルのいずれか一方を用いて音声認識を行なうことができる音声認識装置、音声認識システムおよびプログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and when adapting an acoustic model for each group to a speaker, the acoustic model adapted for the speaker or a general-purpose one is used depending on the frequency of the voice recognition request. An object of the present invention is to provide a speech recognition apparatus, a speech recognition system, and a program that can perform speech recognition using any one of acoustic models.

（１）上記の目的を達成するために、本発明は、以下のような手段を講じた。すなわち、本発明の音声認識装置は、音声認識要求を行なった頻度に応じて、話者適応された音響モデルまたは汎用の音響モデルのいずれか一方を用いて音声認識を行なう音声認識装置であって、音声認識要求およびユーザＩＤを含むアクセス情報を格納するアクセス情報データベースから、音声認識要求を行なった頻度が、予め定められた閾値以上であるユーザのユーザＩＤを抽出する利用頻度抽出手段と、前記抽出されたユーザＩＤに基づいて、ユーザのグループ化を行なう高頻度ユーザグループ化手段と、前記グループ毎に音響モデルの話者適応処理を行なう話者適応処理手段と、前記グループ毎に話者適応処理が行なわれた音響モデルを用いて前記ユーザＩＤに対応するユーザの音声認識を行なう音声認識処理手段と、を備えることを特徴としている。 (1) In order to achieve the above object, the present invention takes the following measures. That is, the speech recognition device of the present invention is a speech recognition device that performs speech recognition using either a speaker-adapted acoustic model or a general-purpose acoustic model according to the frequency of requesting speech recognition. A use frequency extracting means for extracting a user ID of a user whose frequency of making a voice recognition request is equal to or higher than a predetermined threshold from an access information database storing access information including a voice recognition request and a user ID; High-frequency user grouping means for grouping users based on the extracted user ID, speaker adaptation processing means for performing speaker adaptation processing for acoustic models for each group, and speaker adaptation for each group Voice recognition processing means for performing voice recognition of the user corresponding to the user ID using the processed acoustic model. It is characterized.

このように、音声認識要求を行なった頻度が、予め定められた閾値以上であるユーザのユーザＩＤを抽出し、抽出されたユーザＩＤに基づいて、ユーザのグループ化を行ない、グループ毎に音響モデルの話者適応処理を行なうため、ユーザ数が数千に及ぶ場合であっても、話者適応処理を行なう回数を抑えつつ、高い音声認識精度を実現するための音響モデルの話者適応を行なうことが可能となる。これにより、ユーザにとっては、音声認識の要求を行なっていくうちに精度の高い音声認識サービスを利用することが可能となる。また、音声認識サービスの提供者にとっては、ユーザを囲い込んで常連化し、利用促進および常連者の再利用を図ることが可能となる。 As described above, the user IDs of the users whose frequency of the voice recognition request is equal to or higher than a predetermined threshold are extracted, and the users are grouped based on the extracted user IDs, and the acoustic model for each group. Because the speaker adaptation processing is performed, even if the number of users reaches several thousand, the speaker adaptation of the acoustic model for realizing high speech recognition accuracy is performed while suppressing the number of times the speaker adaptation processing is performed. It becomes possible. As a result, the user can use a highly accurate speech recognition service while making a request for speech recognition. In addition, it becomes possible for the provider of the voice recognition service to enclose the user and make it regular so as to promote use and reuse the regular.

（２）また、本発明の音声認識装置において、前記高頻度ユーザグループ化手段は、ユーザのプロファイル情報に基づいてグループ化を行なうことを特徴としている。 (2) In the speech recognition apparatus of the present invention, the high-frequency user grouping means performs grouping based on user profile information.

このように、ユーザのプロファイル情報に基づいてグループ化を行なうため、効果的かつ効率的なグループ化を行なうことが可能となる。 In this way, since grouping is performed based on user profile information, effective and efficient grouping can be performed.

（３）また、本発明の音声認識装置において、前記高頻度グループ化手段は、グループ別音響モデルから求めた音声認識スコアが最大となるグループに前記抽出されたユーザＩＤをマッピングすることによりグループ化を行なうことを特徴としている。 (3) Moreover, in the speech recognition apparatus of the present invention, the high-frequency grouping unit performs grouping by mapping the extracted user IDs to a group having the maximum speech recognition score obtained from the group-specific acoustic model. It is characterized by performing.

このように、グループ別音響モデルから求めた音声認識スコアが最大となるグループに前記抽出されたユーザＩＤをマッピングすることによりグループ化を行なうため、効果的かつ効率的なグループ化を行なうことが可能となる。 As described above, grouping is performed by mapping the extracted user ID to the group having the maximum speech recognition score obtained from the acoustic model for each group, so that effective and efficient grouping can be performed. It becomes.

（４）また、本発明の音声認識装置において、前記高頻度グループ化手段は、グループ別音響モデルから求めた音声認識スコアの分布傾向に基づいてグループ化を行なうことを特徴としている。 (4) In the speech recognition apparatus of the present invention, the high-frequency grouping means performs grouping based on a distribution tendency of speech recognition scores obtained from a group-specific acoustic model.

このように、グループ別音響モデルから求めた音声認識スコアの分布傾向に基づいてグループ化を行なうため、効果的かつ効率的なグループ化を行なうことが可能となる。 In this way, since grouping is performed based on the distribution tendency of the speech recognition scores obtained from the acoustic model for each group, effective and efficient grouping can be performed.

（５）また、本発明の音声認識装置において、前記音声認識処理手段は、前記音声認識要求を行なった頻度が、予め定められた閾値以下であるユーザから音声認識要求があった場合は、汎用の音響モデルを用いて音声認識を行なうことを特徴としている。 (5) Further, in the voice recognition device of the present invention, the voice recognition processing means may perform general-purpose when there is a voice recognition request from a user whose frequency of making the voice recognition request is equal to or less than a predetermined threshold. It is characterized by performing speech recognition using an acoustic model.

このように、音声認識要求を行なった頻度が、予め定められた閾値以下であるユーザから音声認識要求があった場合は、汎用の音響モデルを用いて音声認識を行なうため、広い範囲のユーザに対応することが可能となる。 As described above, when a voice recognition request is made by a user whose frequency of making a voice recognition request is equal to or less than a predetermined threshold value, voice recognition is performed using a general-purpose acoustic model. It becomes possible to respond.

（６）また、本発明のサーバ装置は、請求項１から請求項５のいずれかに記載の音声認識装置を有するサーバ装置と、少なくとも一つの携帯端末装置と、から構成され、前記携帯端末装置から前記サーバ装置に対して音声認識要求を行ない、前記サーバ装置が音声認識を行なって前記携帯端末装置に対して音声認識結果を送信することを特徴としている。 (6) Moreover, the server apparatus of this invention is comprised from the server apparatus which has the speech recognition apparatus in any one of Claims 1-5, and at least 1 portable terminal device, The said portable terminal device The server apparatus makes a voice recognition request to the server apparatus, and the server apparatus performs voice recognition and transmits a voice recognition result to the portable terminal apparatus.

この構成により、音声認識要求を行なった頻度が、予め定められた閾値以上であるユーザのユーザＩＤを抽出し、抽出されたユーザＩＤに基づいて、ユーザのグループ化を行ない、グループ毎に音響モデルの話者適応処理を行なうため、ユーザ数が数千に及ぶ場合であっても、話者適応処理を行なう回数を抑えつつ、高い音声認識精度を実現するための音響モデルの話者適応を行なうことが可能となる。これにより、ユーザにとっては、音声認識の要求を行なっていくうちに精度の高い音声認識サービスを利用することが可能となる。また、音声認識サービスの提供者にとっては、ユーザを囲い込んで常連化し、利用促進および常連者の再利用を図ることが可能となる。 With this configuration, a user ID of a user whose frequency of making a voice recognition request is equal to or greater than a predetermined threshold is extracted, and the users are grouped based on the extracted user ID, and an acoustic model for each group. Because the speaker adaptation processing is performed, even if the number of users reaches several thousand, the speaker adaptation of the acoustic model for realizing high speech recognition accuracy is performed while suppressing the number of times the speaker adaptation processing is performed. It becomes possible. As a result, the user can use a highly accurate speech recognition service while making a request for speech recognition. In addition, it becomes possible for the provider of the voice recognition service to enclose the user and make it regular so as to promote use and reuse the regular.

（７）また、本発明のプログラムは、音声認識要求を行なった頻度に応じて、話者適応された音響モデルまたは汎用の音響モデルのいずれか一方を用いて音声認識を行なうプログラムであって、音声認識要求およびユーザＩＤを含むアクセス情報を格納するアクセス情報データベースから、音声認識要求を行なった頻度が、予め定められた閾値以上であるユーザのユーザＩＤを抽出する処理と、前記抽出されたユーザＩＤに基づいて、ユーザのグループ化を行なう処理と、前記グループ毎に音響モデルの話者適応処理を行なう処理と、前記グループ毎に話者適応処理が行なわれた音響モデルを用いて前記ユーザＩＤに対応するユーザの音声認識を行なう処理と、を含む一連の処理を、コンピュータで読み取りおよび実行可能にコマンド化したことを特徴としている。 (7) Further, the program of the present invention is a program for performing speech recognition using either a speaker-adapted acoustic model or a general-purpose acoustic model according to the frequency at which a speech recognition request is made, A process of extracting a user ID of a user whose frequency of making a voice recognition request is equal to or higher than a predetermined threshold from an access information database storing access information including a voice recognition request and a user ID; and the extracted user Based on the ID, a process for grouping users, a process for performing speaker adaptation processing for the acoustic model for each group, and a user ID using the acoustic model for which speaker adaptation processing has been performed for each group A series of processes including voice recognition processing for users corresponding to the computer, which can be read and executed by a computer It is characterized by a door.

本発明によれば、音声認識要求を行なった頻度が、予め定められた閾値以上であるユーザのユーザＩＤを抽出し、抽出されたユーザＩＤに基づいて、ユーザのグループ化を行ない、グループ毎に音響モデルの話者適応処理を行なうため、ユーザ数が数千に及ぶ場合であっても、話者適応処理を行なう回数を抑えつつ、高い音声認識精度を実現するための音響モデルの話者適応を行なうことが可能となる。これにより、ユーザにとっては、音声認識の要求を行なっていくうちに精度の高い音声認識サービスを利用することが可能となる。また、音声認識サービスの提供者にとっては、ユーザを囲い込んで常連化し、利用促進および常連者の再利用を図ることが可能となる。 According to the present invention, user IDs of users whose voice recognition request frequency is equal to or higher than a predetermined threshold are extracted, and the users are grouped based on the extracted user IDs. Speaker adaptation of the acoustic model to achieve high speech recognition accuracy while reducing the number of speaker adaptation processes, even when the number of users reaches several thousand because the speaker adaptation processing of the acoustic model is performed Can be performed. As a result, the user can use a highly accurate speech recognition service while making a request for speech recognition. In addition, it becomes possible for the provider of the voice recognition service to enclose the user and make it regular so as to promote use and reuse the regular.

次に、本発明に係る実施形態について、図面を参照しながら説明する。図１は、本実施形態に係る音声認識装置の概念を示す図である。本実施形態では、ユーザからの音声認識要求（以下、「利用頻度」と呼称する。）が、Ｎ回（Ｎは自然数）以上であるユーザを、利用頻度と認識スコアに基づいて任意のグループに分け、グループ別に話者適応したグループ別音響モデルを用いて音声認識を行なう。これにより、精度の高い音声認識サービスを提供する。図１に示すように、利用頻度の高いグループと低いグループとに分ける。利用頻度が高いグループに対しては、話者適応した音響モデルを用いて音声認識処理を行なう。これにより、いわゆる常連者に対して精度の高い音声認識サービスを提供することが可能となる。一方、利用頻度が低いグループに対しては、汎用の不特定話者用の音響モデルを用いて音声認識処理を行なう。これにより、広い範囲のユーザを対象とすることが可能となる。 Next, embodiments according to the present invention will be described with reference to the drawings. FIG. 1 is a diagram illustrating the concept of the speech recognition apparatus according to the present embodiment. In this embodiment, a user whose voice recognition request (hereinafter referred to as “usage frequency”) is N times (N is a natural number) or more is assigned to an arbitrary group based on the usage frequency and the recognition score. Speech recognition is performed by using a group-specific acoustic model that is adapted for each group. This provides a highly accurate voice recognition service. As shown in FIG. 1, the group is divided into a frequently used group and a less frequently used group. For a group that is frequently used, speech recognition processing is performed using a speaker-adapted acoustic model. This makes it possible to provide a highly accurate voice recognition service for so-called regulars. On the other hand, for a group with low usage frequency, speech recognition processing is performed using a general-purpose acoustic model for unspecified speakers. This makes it possible to target a wide range of users.

図２は、本実施形態に係る音声認識システムの概略構成を示すブロック図である。認識処理要求受信手段１０は、ユーザ端末からの音声認識要求と音声データを受信し、音声認識処理手段２０に対して、認識処理を指示する。また、認識結果をユーザ端末に返信するほか、入力音声や認識結果、アクセス情報を各々のＤＢに蓄積する。グループＩＤ検索手段１１は、認識要求と共に受信したユーザＩＤが属するグループＩＤを、ユーザＩＤ・グループＩＤテーブルから取得する。 FIG. 2 is a block diagram showing a schematic configuration of the voice recognition system according to the present embodiment. The recognition process request receiving means 10 receives a voice recognition request and voice data from the user terminal, and instructs the voice recognition processing means 20 to perform a recognition process. In addition to returning the recognition result to the user terminal, the input voice, the recognition result, and the access information are stored in each DB. The group ID search means 11 acquires the group ID to which the user ID received together with the recognition request belongs from the user ID / group ID table.

音声認識処理手段２０は、認識処理要求受信手段１０から受信した認識要求に従って、汎用音響モデル２１、グループ別音響モデル２２、および言語モデル２３を用いて認識処理を行なう。そして、認識結果を認識処理要求受信手段１０に返信する。認識要求では、グループＩＤも含まれ、該当するグループの音響モデルを用いて認識処理する。蓄積音声ＤＢ３３は、入力された音声を蓄積する。音声データはＰＣＭ形式などの音声データのほか、スペクトル領域のデータ、ケプストラム領域のデータ、ＶＱデータであってもよい。 The speech recognition processing means 20 performs recognition processing using the general acoustic model 21, the group-specific acoustic model 22, and the language model 23 in accordance with the recognition request received from the recognition processing request receiving means 10. Then, the recognition result is returned to the recognition process request receiving means 10. In the recognition request, the group ID is also included, and recognition processing is performed using the acoustic model of the corresponding group. The stored voice DB 33 stores the input voice. The audio data may be spectrum data, cepstrum data, or VQ data in addition to audio data in the PCM format or the like.

認識結果ＤＢ３１は、認識結果の文字列、および認識スコアを保持する。認識スコアとしては、さらに音響尤度と言語確率に分けて保持しても良い。アクセス情報プロファイル情報ＤＢ３２は、認識要求のあった時刻情報、ユーザＩＤおよび、対応する蓄積音声ＤＢ３３に格納された音声データ名、認識結果ＤＢ３１に格納された認識結果ファイル名を保持する。ユーザＩＤ・グループＩＤテーブル４０は、ユーザＩＤと割当てられたグループＩＤのマッピングテーブルである。高頻度ユーザグループ化手段５０で、グルーピングし、話者適応モデルを更新後に、本テーブルが更新される。 The recognition result DB 31 holds a recognition result character string and a recognition score. The recognition score may be further divided into acoustic likelihood and language probability. The access information profile information DB 32 holds the time information when the recognition is requested, the user ID, the voice data name stored in the corresponding accumulated voice DB 33, and the recognition result file name stored in the recognition result DB 31. The user ID / group ID table 40 is a mapping table of user IDs and assigned group IDs. After the grouping is performed by the high-frequency user grouping means 50 and the speaker adaptation model is updated, this table is updated.

利用頻度抽出手段６０は、アクセス情報プロファイル情報ＤＢ３２から、各ユーザＩＤの利用頻度を抽出し、Ｎ回以上利用したユーザＩＤの情報を高頻度ユーザグループ化手段５０に送信する。高頻度ユーザグループ化手段５０は、Ｎ回以上利用したユーザをグループ化する。グループ化の手法については、プロファイル情報に基づく方法と、話者適応モデルのスコアの最大値、またはスコアの分布に基づく方法がある。このグループ化の手法については、後述する。 The usage frequency extraction unit 60 extracts the usage frequency of each user ID from the access information profile information DB 32 and transmits information on the user IDs used N times or more to the high frequency user grouping unit 50. The high-frequency user grouping means 50 groups users that have been used N times or more. As a grouping method, there are a method based on profile information and a method based on the maximum score of the speaker adaptation model or the distribution of scores. This grouping method will be described later.

話者適応処理手段７０は、汎用言語モデルと、蓄積された音声データと認識結果を用いて、グループごとに音響モデルの話者適応処理を行なう。データ取得手段７１は、話者適応に用いる音声データと認識結果を、蓄積音声ＤＢ３３と認識結果ＤＢ３１から取得し、音響モデル加工手段７２に送信する。音響モデル加工手段７２は、音響モデルの話者適応処理を行ない、出来上がった音響モデルを更新する。 The speaker adaptation processing means 70 performs speaker adaptation processing of the acoustic model for each group using the general language model, the accumulated speech data, and the recognition result. The data acquisition means 71 acquires voice data and recognition results used for speaker adaptation from the accumulated voice DB 33 and the recognition result DB 31 and transmits them to the acoustic model processing means 72. The acoustic model processing means 72 performs speaker adaptation processing for the acoustic model and updates the completed acoustic model.

以上が本実施形態に係る音声認識システムの概略構成であるが、認識処理要求受信手段１０、音声認識処理手段２０、汎用音響モデル２１および言語モデル２３を携帯端末装置に実装し、それ以外の構成要素をサーバ装置に実装することが可能である。 The above is the schematic configuration of the voice recognition system according to the present embodiment, but the recognition processing request receiving means 10, the voice recognition processing means 20, the general-purpose acoustic model 21 and the language model 23 are mounted on the mobile terminal device, and the other configurations. Elements can be implemented in server devices.

図３は、話者適応の手順を示すフローチャートである。まず、利用頻度抽出手段６０が、音声認識要求を行なったユーザの利用頻度を計算する（ステップＳ１）。次に、高頻度ユーザグループ化手段５０が、ユーザのグルーピングを行なう（ステップＳ２）。ここでは、例えば、利用頻度がＮ回以上であるユーザを、Ｍ個のグループに分けるものとする。Ｍは、１以上の任意の自然数であり、その最大値は、利用頻度がＮ以上のユーザ数である。次に、話者適応処理手段７０が、音響モデルの話者適応処理を行なう（ステップＳ３）。ここでは、データ取得手段７１が、各グループに割り当てられたユーザＩＤの音声データを蓄積音声ＤＢ３３から取得し、また、認識結果ＤＢ３１から認識結果を取得する。そして、音響モデル加工手段７２が、汎用の不特定話者用の音響モデルに対して適応処理を行なう。なお、この適応処理は、従来から知られているＭＬＬＲなどの手法を用いることが可能である。次に、話者適応処理手段７０は、音響モデル加工手段７２を介して、ユーザＩＤとグループＩＤのマッピングテーブルの更新を行なう（ステップＳ４）。 FIG. 3 is a flowchart showing a procedure for speaker adaptation. First, the usage frequency extraction means 60 calculates the usage frequency of the user who made the voice recognition request (step S1). Next, the high frequency user grouping means 50 groups users (step S2). Here, for example, users whose usage frequency is N times or more are divided into M groups. M is an arbitrary natural number of 1 or more, and the maximum value is the number of users whose usage frequency is N or more. Next, the speaker adaptation processing means 70 performs speaker adaptation processing for the acoustic model (step S3). Here, the data acquisition means 71 acquires the voice data of the user ID assigned to each group from the accumulated voice DB 33 and acquires the recognition result from the recognition result DB 31. Then, the acoustic model processing means 72 performs an adaptive process on the general-purpose speaker model for unspecified speakers. Note that this adaptive processing can use a conventionally known technique such as MLLR. Next, the speaker adaptation processing unit 70 updates the user ID / group ID mapping table via the acoustic model processing unit 72 (step S4).

図４は、音声認識処理の手順を示すフローチャートである。認識処理要求受信手段１０が、ユーザから認識要求を受信すると（ステップＳ１１）、グループＩＤ検索手段１１が、グループＩＤの検索を行なう（ステップＳ１２）。この検索では、ユーザＩＤがキーとして用いられる。次に、音声認識処理手段２０が、グループＩＤに該当する音響モデルを用いて、音声認識処理を行なう（ステップＳ１３）。ここで、例えば、不特定話者に対しては、グループＩＤとして、０を割り当てる。 FIG. 4 is a flowchart showing the procedure of the voice recognition process. When the recognition process request receiving means 10 receives a recognition request from the user (step S11), the group ID search means 11 searches for a group ID (step S12). In this search, the user ID is used as a key. Next, the speech recognition processing means 20 performs speech recognition processing using the acoustic model corresponding to the group ID (step S13). Here, for example, 0 is assigned to the unspecified speaker as the group ID.

図５は、高頻度ユーザグループ化手段５０が行なう音響モデルを用いたグループ化の手法を示す図である。図５に示すように、高頻度ユーザグループ化手段５０は、特定のユーザＩＤの音声すべてを取得し、これらをグループ別音響モデルにマッチングさせる。例えば、グループＩＤが「１」、「２」、・・・「Ｍ」であるグループ別音響モデルに対し、認識スコアの平均値を算出する。ここでは、例えば、グループＩＤが「１」では、正規化スコアが６４．１であり、グループＩＤが「２」では、正規化スコアが３９．４であり、グループＩＤが「Ｍ」では、正規化スコアが７２．９であったとする。この場合、最大のスコアのグループを選択する。または、スコアの分布に基づいて、グループ分けを行なっても良い。なお、高頻度ユーザグループ化手段５０は、ユーザの年齢、性別、出身などのプロファイル情報に基づいて、利用頻度が高いユーザをグループ化することも可能である。 FIG. 5 is a diagram showing a grouping method using an acoustic model performed by the high-frequency user grouping means 50. As shown in FIG. 5, the high-frequency user grouping unit 50 acquires all voices of a specific user ID and matches them with a group-specific acoustic model. For example, the average value of the recognition scores is calculated for the group-specific acoustic models whose group IDs are “1”, “2”,. Here, for example, when the group ID is “1”, the normalized score is 64.1, when the group ID is “2”, the normalized score is 39.4, and when the group ID is “M”, the normalized score is 64.1. Assume that the conversion score is 72.9. In this case, the group with the highest score is selected. Alternatively, grouping may be performed based on the score distribution. Note that the high-frequency user grouping means 50 can also group users who are frequently used based on profile information such as the user's age, sex, and origin.

以上説明したように、本実施形態に係る音声認識システムによれば、音声認識要求を行なった頻度が、予め定められた値（Ｎ）以上であるユーザのユーザＩＤを抽出し、抽出されたユーザＩＤに基づいて、ユーザのグループ化を行ない、グループ毎に音響モデルの話者適応処理を行なうため、ユーザ数が数千に及ぶ場合であっても、話者適応処理を行なう回数を抑えつつ、高い音声認識精度を実現するための音響モデルの話者適応を行なうことが可能となる。これにより、ユーザにとっては、音声認識の要求を行なっていくうちに精度の高い音声認識サービスを利用することが可能となる。また、音声認識サービスの提供者にとっては、ユーザを囲い込んで常連化し、利用促進および常連者の再利用を図ることが可能となる。 As described above, according to the speech recognition system according to the present embodiment, the user ID of a user whose frequency of requesting speech recognition is equal to or greater than a predetermined value (N) is extracted, and the extracted user Based on the ID, the users are grouped, and the speaker adaptation processing of the acoustic model is performed for each group. Therefore, even when the number of users reaches several thousand, the number of times of speaker adaptation processing is suppressed, It is possible to perform speaker adaptation of an acoustic model for realizing high speech recognition accuracy. As a result, the user can use a highly accurate speech recognition service while making a request for speech recognition. In addition, it becomes possible for the provider of the voice recognition service to enclose the user and make it regular so as to promote use and reuse the regular.

なお、本実施形態の特徴的な動作は、コンピュータでプログラムを実行させることにより行なうことができる。すなわち、本実施形態に係るプログラムは、音声認識要求を行なった頻度に応じて、話者適応された音響モデルまたは汎用の音響モデルのいずれか一方を用いて音声認識を行なうプログラムであって、音声認識要求およびユーザＩＤを含むアクセス情報を格納するアクセス情報データベースから、音声認識要求を行なった頻度が、予め定められた閾値以上であるユーザのユーザＩＤを抽出する処理と、前記抽出されたユーザＩＤに基づいて、ユーザのグループ化を行なう処理と、前記グループ毎に音響モデルの話者適応処理を行なう処理と、前記グループ毎に話者適応処理が行なわれた音響モデルを用いて前記ユーザＩＤに対応するユーザの音声認識を行なう処理と、を含む一連の処理を、コンピュータで読み取りおよび実行可能にコマンド化したことを特徴としている。 The characteristic operation of this embodiment can be performed by causing a computer to execute a program. That is, the program according to the present embodiment is a program for performing speech recognition using either a speaker-adapted acoustic model or a general-purpose acoustic model in accordance with the frequency at which a speech recognition request is made. A process of extracting a user ID of a user whose frequency of making a voice recognition request is equal to or higher than a predetermined threshold from an access information database storing access information including a recognition request and a user ID, and the extracted user ID Based on the above, a process for grouping users, a process for performing speaker adaptation processing for acoustic groups for each group, and an acoustic model for which speaker adaptation processing has been performed for each group are used for the user ID. A command that allows a computer to read and execute a series of processes including voice recognition of a corresponding user It is characterized in that the.

これにより、ユーザ数が数千に及ぶ場合であっても、話者適応処理を行なう回数を抑えつつ、高い音声認識精度を実現するための音響モデルの話者適応を行なうことが可能となる。これにより、ユーザにとっては、音声認識の要求を行なっていくうちに精度の高い音声認識サービスを利用することが可能となる。また、音声認識サービスの提供者にとっては、ユーザを囲い込んで常連化し、利用促進および常連者の再利用を図ることが可能となる。 As a result, even when the number of users reaches several thousand, it is possible to perform speaker adaptation of an acoustic model for realizing high speech recognition accuracy while suppressing the number of times of speaker adaptation processing. As a result, the user can use a highly accurate speech recognition service while making a request for speech recognition. In addition, it becomes possible for the provider of the voice recognition service to enclose the user and make it regular so as to promote use and reuse the regular.

本実施形態に係る音声認識装置の概念を示す図である。It is a figure which shows the concept of the speech recognition apparatus which concerns on this embodiment. 本実施形態に係る音声認識システムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the speech recognition system which concerns on this embodiment. 話者適応の手順を示すフローチャートである。It is a flowchart which shows the procedure of speaker adaptation. 音声認識処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of a speech recognition process. 高頻度ユーザグループ化手段５０が行なう音響モデルを用いたグループ化の手法を示す図である。It is a figure which shows the method of grouping using the acoustic model which the high frequency user grouping means 50 performs.

Explanation of symbols

１０認識処理要求受信手段
１１グループＩＤ検索手段
２０音声認識処理手段
２１汎用音響モデル
２２グループ別音響モデル
２３言語モデル
４０ユーザＩＤ・グループＩＤテーブル
５０高頻度ユーザグループ化手段
６０利用頻度抽出手段
７０話者適応処理手段
７１データ取得手段
７２音響モデル加工手段 DESCRIPTION OF SYMBOLS 10 Recognition process request | requirement reception means 11 Group ID search means 20 Speech recognition processing means 21 General purpose acoustic model 22 Acoustic model according to group 23 Language model 40 User ID / group ID table 50 High frequency user grouping means 60 Usage frequency extraction means 70 Speaker Adaptive processing means 71 Data acquisition means 72 Acoustic model processing means

Claims

A speech recognition device that performs speech recognition using either a speaker-adapted acoustic model or a general-purpose acoustic model according to the frequency at which a speech recognition request is made,
Usage frequency extraction means for extracting a user ID of a user whose frequency of making a voice recognition request is equal to or higher than a predetermined threshold from an access information database storing access information including a voice recognition request and a user ID;
A high-frequency user grouping means for grouping users based on the extracted user ID;
Speaker adaptation processing means for performing speaker adaptation processing of the acoustic model for each group;
A speech recognition apparatus comprising speech recognition processing means for performing speech recognition of a user corresponding to the user ID using an acoustic model on which speaker adaptation processing has been performed for each group.

The speech recognition apparatus according to claim 1, wherein the high-frequency user grouping means performs grouping based on user profile information.

The voice according to claim 1, wherein the high-frequency grouping means performs grouping by mapping the extracted user ID to a group having a maximum voice recognition score obtained from a group-specific acoustic model. Recognition device.

The speech recognition apparatus according to claim 1, wherein the high-frequency grouping unit performs grouping based on a distribution tendency of speech recognition scores obtained from a group-specific acoustic model.

The speech recognition processing means performs speech recognition using a general-purpose acoustic model when a speech recognition request is made by a user whose frequency of making the speech recognition request is equal to or less than a predetermined threshold. The speech recognition apparatus according to claim 1.

A server device having the voice recognition device according to any one of claims 1 to 5;
At least one portable terminal device, and
A voice recognition system, wherein a request for voice recognition is made from the portable terminal device to the server device, the server device performs voice recognition and transmits a voice recognition result to the portable terminal device.

A program for performing speech recognition using either a speaker-adapted acoustic model or a general-purpose acoustic model according to the frequency of the speech recognition request,
A process of extracting a user ID of a user whose frequency of making a voice recognition request is equal to or higher than a predetermined threshold from an access information database storing access information including a voice recognition request and a user ID;
A process of grouping users based on the extracted user ID;
Processing to perform speaker adaptation processing of the acoustic model for each group;
A series of processes including a process of performing speech recognition of a user corresponding to the user ID using an acoustic model subjected to speaker adaptation processing for each group, and converted into a command that can be read and executed by a computer A program characterized by