JP6363478B2

JP6363478B2 - Speech recognition apparatus, speech recognition method, and speech recognition program

Info

Publication number: JP6363478B2
Application number: JP2014236529A
Authority: JP
Inventors: 麻衣子井元; 丈二中山; 山田　智広; 智広山田; 滋藤村; えりか足利
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-11-21
Filing date: 2014-11-21
Publication date: 2018-07-25
Anticipated expiration: 2034-11-21
Also published as: JP2016099501A

Description

本発明は、音声認識技術に関する。 The present invention relates to speech recognition technology.

近年、スマートフォンやタブレットのようなモバイル端末向けの音声認識サービスが普及しつつあり、今後、音声認識機能を提供するサービスが拡大することが予測される。 In recent years, voice recognition services for mobile terminals such as smartphones and tablets are becoming widespread, and it is expected that services that provide voice recognition functions will expand in the future.

従来の音声認識技術は、入力された音声を４つのステップでテキスト化する（非特許文献１参照）。第１のステップでは、入力された音声を分析し、音声信号から雑音を除去し、音声認識の手掛かりとなる音響特徴を抽出する。第２のステップでは、各音素の特徴を蓄積した音響モデルを用いて、入力された音声を言葉の最小単位である音素を表す記号に変換する。第３のステップでは、音素列と単語の対応関係を蓄積した認識辞書を用いて音素列を単語に変換する。第４のステップでは、言葉遣いや言い回しを蓄積した言語モデルを用いて、各変換候補に対して妥当性の指標となるスコアを算出する。言語モデルは、単語のつながりのルールを統計値として保持している。音声認識結果として出力される変換候補は、最も妥当な変換候補のみを出力することもあれば、Ｎ−Ｂｅｓｔ解といわれる妥当性の高い順のＮ個の変換候補を出力することもあり、変換候補の出力数は音声認識サービスに依存する。 Conventional speech recognition technology converts input speech into text in four steps (see Non-Patent Document 1). In the first step, input speech is analyzed, noise is removed from the speech signal, and acoustic features that are clues for speech recognition are extracted. In the second step, the input speech is converted into a symbol representing a phoneme, which is the minimum unit of words, using an acoustic model in which the features of each phoneme are accumulated. In the third step, the phoneme string is converted into a word using a recognition dictionary in which the correspondence between the phoneme string and the word is accumulated. In the fourth step, a score serving as a validity index is calculated for each conversion candidate using a language model in which wording and phrases are accumulated. The language model holds a word connection rule as a statistical value. The conversion candidates output as the speech recognition result may output only the most appropriate conversion candidates, or may output N conversion candidates in order of high validity called N-Best solutions. The number of candidate outputs depends on the speech recognition service.

“音声認識ソリューション認識精度の向上で実用段階に入った音声認識技術”、［online］、2012年10月30日、株式会社インプレス、［2014年9月26日検索］、インターネット〈 URL：http://it.impressbm.co.jp/articles/-/10240/〉"Voice recognition solution: Voice recognition technology that has entered the practical stage with improved recognition accuracy", [online], October 30, 2012, Impress Inc., [searched September 26, 2014], Internet <URL: http: //it.impressbm.co.jp/articles /-/ 10240 />

変換候補の妥当性は、発話時の発話者の置かれた状況や背景（コンテキスト）に依存するが、発話者のコンテキストを考慮しない場合は、適切でない音声認識結果を提示するという問題があった。例えば、「おいしいかきをたべたい」という音声が入力されたときに、「かき」という言葉に対して果物の「柿」と貝類の「牡蠣」のどちらが適切であるかの判別が難しい。 The validity of the conversion candidate depends on the situation and background (context) of the speaker at the time of utterance, but there is a problem that an inappropriate speech recognition result is presented if the context of the speaker is not considered. . For example, when a voice saying “I want to eat delicious oysters” is input, it is difficult to determine which of the words “oysters” is appropriate for the fruit “柿” or the shellfish “oysters”.

本発明は、上記に鑑みてなされたものであり、より適切な音声認識結果を提示することを目的とする。 The present invention has been made in view of the above, and an object thereof is to present a more appropriate speech recognition result.

第１の本発明に係る音声認識装置は、日付または日時と当該日付または当該日時におけるユーザの状況を示す情報とを関連付けたユーザ情報を格納したユーザ情報蓄積手段と、複数の単語と当該単語間の共起頻度を含む共起頻度情報を格納した共起頻度蓄積手段と、前記ユーザの音声を入力して音声認識を実行し、変換候補を得るとともに、前記ユーザが入力した音声入力内容の示す時制を判定する音声認識手段と、前記ユーザ情報蓄積手段から前記音声入力内容の示す時制に対応する前記ユーザ情報を取得して当該ユーザ情報に含まれる単語を抽出するとともに、前記変換候補に含まれる単語を抽出し、それぞれから抽出した単語を含む前記共起頻度情報の共起頻度に基づいて前記変換候補を並べ替える変換候補整列手段と、を有することを特徴とする。 The speech recognition apparatus according to the first aspect of the present invention includes a user information storage unit that stores user information in which a date or date and date and information indicating a user's situation on the date or date and time are associated , a plurality of words, and between the words Co-occurrence frequency storage means storing co-occurrence frequency information including the co-occurrence frequency of the user, voice of the user is input to perform speech recognition, conversion candidates are obtained , and voice input content input by the user is indicated. Voice recognition means for determining tense, and the user information corresponding to the tense indicated by the voice input content is acquired from the user information storage means, and a word included in the user information is extracted, and is included in the conversion candidate Conversion candidate alignment means for extracting words and rearranging the conversion candidates based on the co-occurrence frequency of the co-occurrence frequency information including the word extracted from each And features.

上記音声認識装置において、前記ユーザ情報は、音声認識結果を除くものであることを特徴とする。 In the above speech recognition apparatus, the user information is one that excludes a speech recognition result .

上記音声認識装置において、前記ユーザ情報は、前記ユーザがサービスに登録または更新した情報であることを特徴とする。 In the voice recognition apparatus, the user information is information registered or updated by the user in a service .

第２の本発明に係る音声認識方法は、コンピュータにより実行される音声認識方法であって、ユーザの音声を入力して音声認識を実行し、変換候補を得るとともに、前記ユーザが入力した音声入力内容の示す時制を判定するステップと、日付または日時と当該日付または当該日時におけるユーザの状況を示す情報とを関連付けたユーザ情報を格納したユーザ情報蓄積手段から、前記音声入力内容の示す時制に対応する前記ユーザ情報を取得し、当該ユーザ情報に含まれる単語を抽出するステップと、前記変換候補に含まれる単語を抽出するステップと、複数の単語と当該単語間の共起頻度を含む共起頻度情報を格納した共起頻度蓄積手段から、前記ユーザ情報と前記変換候補のそれぞれから抽出した単語が含まれる前記共起頻度情報を取得し、当該共起頻度情報の共起頻度に基づいて前記変換候補を並べ替えるステップと、を有することを特徴とする。 A speech recognition method according to a second aspect of the present invention is a speech recognition method executed by a computer, which performs speech recognition by inputting a user's speech, obtains conversion candidates, and inputs the speech input by the user. Corresponding to the tense indicated by the voice input content from the step of determining the tense indicated by the content and the user information storage means storing the user information associating the date or date and the information indicating the user status on the date or the date and time Acquiring the user information, extracting a word included in the user information, extracting a word included in the conversion candidate, and a co-occurrence frequency including a plurality of words and a co-occurrence frequency between the words The co-occurrence frequency information including words extracted from each of the user information and the conversion candidates is acquired from the co-occurrence frequency storage unit storing the information. And having the steps of: rearranging the conversion candidates based on the occurrence frequency of the co-occurrence frequency information.

第３の本発明に係る音声認識プログラムは、上記音声認識装置の各手段としてコンピュータを動作させることを特徴とする。 A speech recognition program according to a third aspect of the present invention is characterized by operating a computer as each means of the speech recognition apparatus.

本発明によれば、より適切な音声認識結果を提示することができる。 According to the present invention, a more appropriate speech recognition result can be presented.

本実施の形態における音声認識システムの構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the speech recognition system in this Embodiment. ユーザ情報データベースが保持するデータの例を示す図である。It is a figure which shows the example of the data which a user information database hold | maintains. コンテキスト情報データベースが保持するデータの例を示す図である。It is a figure which shows the example of the data which a context information database hold | maintains. 本実施の形態における音声認識システムの処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech recognition system in this Embodiment. 本実施の形態におけるリランキング実行部の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the re-ranking execution part in this Embodiment. リランキング実行部の処理を具体的に説明する図である。It is a figure explaining the process of a reranking execution part concretely.

以下、本発明の実施の形態について図面を用いて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本実施の形態における音声認識システムの構成を示す機能ブロック図である。同図に示す音声認識システムは、クライアント端末１とサーバ３を備える。本音声認識システムは、ユーザが利用するサービスからユーザのコンテキストを推定するためのユーザに関するユーザ情報を取得して格納しておき、ユーザが音声を入力したときに、その音声を認識して変換候補を得た後、得られた変換候補を、ユーザ情報に基づいてユーザのコンテキストにあった適切な順序に並べ替えて提示する音声認識システムである。以下、クライアント端末１とサーバ３について説明する。 FIG. 1 is a functional block diagram showing the configuration of the speech recognition system in the present embodiment. The voice recognition system shown in the figure includes a client terminal 1 and a server 3. This speech recognition system acquires and stores user information related to a user for estimating the user's context from the service used by the user, and recognizes the speech when the user inputs speech, and converts it into a conversion candidate. After obtaining, the obtained conversion candidates are rearranged in an appropriate order suitable for the user's context based on the user information and presented. Hereinafter, the client terminal 1 and the server 3 will be described.

クライアント端末１は、ユーザ情報格納部１１、ユーザ情報データベース（ＤＢ）１２、音声入力部１３、情報送信部１４、認識結果受信部１５、および表示部１６を備える。 The client terminal 1 includes a user information storage unit 11, a user information database (DB) 12, a voice input unit 13, an information transmission unit 14, a recognition result reception unit 15, and a display unit 16.

ユーザ情報格納部１１は、ユーザが利用するサービスからユーザ情報を取得し、ユーザ情報ＤＢ１２に格納する。ユーザ情報の例としては、例えば、スケジュール管理サービスから取得できるユーザのスケジュールに関する情報、コメント投稿サービスから取得できるユーザの投稿したコメントに関する情報がある。ユーザ情報を取得する対象のサービスは予め登録して本音声認識システムに連携させておく。ユーザ情報格納部１１は、連携させたサービスでユーザ情報が更新されたタイミングで処理を実行し、ユーザ情報ＤＢ１２に格納されたユーザ情報を随時更新する。例えば、スケジュール管理サービスで新たなスケジュールが追加されたときはレコードを追加してユーザ情報を新規登録し、スケジュールが更新されたときはユーザ情報ＤＢ１２に格納した情報を書き換える。 The user information storage unit 11 acquires user information from the service used by the user and stores it in the user information DB 12. Examples of user information include, for example, information related to a user's schedule that can be acquired from the schedule management service and information related to a user's posted comment that can be acquired from the comment posting service. The service for which user information is acquired is registered in advance and linked to the voice recognition system. The user information storage unit 11 executes processing at the timing when the user information is updated by the linked service, and updates the user information stored in the user information DB 12 as needed. For example, when a new schedule is added by the schedule management service, a record is added to newly register user information, and when the schedule is updated, information stored in the user information DB 12 is rewritten.

ユーザ情報ＤＢ１２は、各サービスから取得したユーザ情報を保持する。図２に、ユーザ情報ＤＢ１２が保持するデータの例を示す。図２の例では、ユーザ情報ＤＢ１２は、項目カラム、日にちカラム、時間カラムで構成されたレコードを保持している。項目カラムには、ユーザの状況を示す情報が格納される。例えば、連携させるサービスとして、スケジュール管理サービスが設定されているときは、ユーザ情報格納部１１は、スケジュール管理サービスに登録されている予定の項目と日時を取得して、ユーザ情報ＤＢ１２の項目カラム、日にちカラム、時間カラムに格納する。また、連携させるサービスとして、コメント投稿サービスが設定されているときは、ユーザ情報格納部１１は、ユーザによってコメント投稿サービスに投稿された投稿内容と日時を取得して、ユーザ情報ＤＢ１２の項目カラム、日にちカラム、時間カラムに格納する。 The user information DB 12 holds user information acquired from each service. FIG. 2 shows an example of data held in the user information DB 12. In the example of FIG. 2, the user information DB 12 holds records composed of item columns, date columns, and time columns. The item column stores information indicating the user status. For example, when the schedule management service is set as the service to be linked, the user information storage unit 11 acquires the scheduled item and date and time registered in the schedule management service, and the item column of the user information DB 12 Store in date column and time column. When the comment posting service is set as a service to be linked, the user information storage unit 11 acquires the posting content and date / time posted by the user to the comment posting service, and the item column of the user information DB 12 Store in date column and time column.

連携しているサービスからユーザ情報を取得する方法として、例えばＯＡｕｔｈを用いることができる（参考ＵＲＬｈｔｔｐ：／／ｏａｕｔｈ．ｎｅｔ／）。ＯＡｕｔｈとは、あるウェブサービスＡにおいてユーザが持つリソースとユーザがアクセス権限を持つ各種機能に対し、ユーザの許可を受けた他のウェブサービスＢがアクセスするための仕組みである。ユーザがウェブサービスＢにウェブサービスＡへのアクセスの許可を与えておくことで、ウェブサービスＢは許可を与えられた範囲で、ウェブサービスＡの提供するＡＰＩ（ＡｐｐｌｉｃａｔｉｏｎＰｒｏｇｒａｍｍｉｎｇＩｎｔｅｒｆａｃｅ）にアクセスできる。 For example, OAuth can be used as a method for acquiring user information from a linked service (reference URL http://oauth.net/). OAuth is a mechanism for another web service B that has received permission from the user to access various resources that the user has in the web service A and various functions to which the user has access authority. When the user grants the web service B permission to access the web service A, the web service B can access an API (Application Programming Interface) provided by the web service A within a given range.

音声入力部１３は、音声認識すべきユーザの音声を入力し、入力した音声情報を情報送信部１４に送信する。 The voice input unit 13 inputs a user's voice to be voice-recognized and transmits the input voice information to the information transmission unit 14.

情報送信部１４は、音声入力部１３から音声情報を受信するとともに、音声入力部１３に音声が入力された時刻である音声入力時刻に対応するユーザ情報をユーザ情報ＤＢ１２から取得し、取得したユーザ情報と音声情報をサーバ３に送信する。ユーザ情報を取得するときは、ユーザ情報ＤＢ１２の日にちカラム、時間カラムが音声入力時刻を含むレコードの項目カラムをユーザ情報として取得する。例えば、図２に示すデータがユーザ情報ＤＢ１２に格納されており、音声入力時刻が「２０１４／９／２７１８：４０」であるとき、情報送信部１４は、音声入力時刻を含むレコードの項目カラムに格納されている情報（図２では「家族と広島旅行」）をユーザ情報Ｉ_ｕ＝｛家族と広島旅行｝とする。このとき、日にちカラムが音声入力時刻の日にちと同じレコードの項目カラムに格納されている情報（図２では「厳島神社すごい。」）を取得し、ユーザ情報Ｉ_ｕ＝｛家族と広島旅行，厳島神社すごい。｝としてもよい。音声入力時刻に対してユーザ情報ＤＢ１２から取得するレコードのルールは予め設定しておく。 The information transmission unit 14 receives the voice information from the voice input unit 13, acquires user information corresponding to the voice input time that is the time when the voice is input to the voice input unit 13 from the user information DB 12, and acquires the acquired user Information and voice information are transmitted to the server 3. When acquiring the user information, the date column and the time column of the user information DB 12 acquire the item column of the record including the voice input time as the user information. For example, when the data shown in FIG. 2 is stored in the user information DB 12 and the voice input time is “2014/9/27 18:40”, the information transmission unit 14 includes the item column of the record including the voice input time. The user information I _u = {Family and Hiroshima trip} in FIG. At this time, the information stored in the item column of the record whose date column is the same as the date of the voice input time (“Itsukushima Shrine is amazing” in FIG. 2) is obtained, and the user information I _u = {Family and Hiroshima trip, Itsukushima The shrine is amazing. }. The rule of the record acquired from the user information DB 12 with respect to the voice input time is set in advance.

認識結果受信部１５は、サーバ３から音声情報の音声認識結果である変換候補を受信して表示部１６に表示させる。 The recognition result receiving unit 15 receives a conversion candidate that is a voice recognition result of the voice information from the server 3 and causes the display unit 16 to display the conversion candidate.

表示部１６は、入力した音声の変換候補を所定の位置に表示する。 The display unit 16 displays input speech conversion candidates at predetermined positions.

サーバ３は、情報受信部３１、音声認識部３２、リランキング実行部３３、およびコンテキスト情報ＤＢ３４を備える。 The server 3 includes an information receiving unit 31, a voice recognition unit 32, a reranking execution unit 33, and a context information DB 34.

情報受信部３１は、クライアント端末１から音声情報とユーザ情報を受信し、音声認識部３２に送信する。 The information receiving unit 31 receives voice information and user information from the client terminal 1 and transmits them to the voice recognition unit 32.

音声認識部３２は、受信した音声情報に対して音声認識を実行して変換候補を得て、得られた変換候補とユーザ情報をリランキング実行部３３に送信する。各変換候補には、候補の妥当性を示すスコアが付与される。音声認識の実行には、周知の音声認識技術を用いる。 The voice recognition unit 32 performs voice recognition on the received voice information to obtain conversion candidates, and transmits the obtained conversion candidates and user information to the reranking execution unit 33. Each conversion candidate is given a score indicating the validity of the candidate. A well-known speech recognition technique is used to execute speech recognition.

また、音声認識部３２は、音声情報の時制を判定し、判定した時制が「未来」の特定の時間もしくは「過去」の特定の時間である場合は、クライアント端末１のユーザ情報ＤＢ１２からその時間に対応するユーザ情報を取得してリランキング実行部３３に送信する。例えば、音声情報に「明日」や「昨日」などの具体的な未来や過去を示す表現が含まれているときは、その表現が示す日時に対応するユーザ情報をユーザ情報ＤＢ１２から取得し、取得したユーザ情報をリランキング実行部３３に送信する。 In addition, the voice recognition unit 32 determines the tense of the voice information. If the determined tense is a specific time of “future” or a specific time of “past”, the time is read from the user information DB 12 of the client terminal 1. Is acquired and transmitted to the reranking execution unit 33. For example, when the voice information includes an expression indicating a specific future or the past such as “Tomorrow” or “Yesterday”, the user information corresponding to the date and time indicated by the expression is acquired from the user information DB 12 and acquired. The user information is transmitted to the reranking execution unit 33.

リランキング実行部３３は、コンテキスト情報ＤＢ３４に格納された単語の組合せの共起関係に関する情報を参照し、変換候補に含まれる単語とユーザ情報に含まれる単語の共起関係に基づいて変換候補のスコアを再計算する。リランキング実行部３３の具体的な処理については後述する。 The reranking execution unit 33 refers to the information on the co-occurrence relationship of the word combinations stored in the context information DB 34, and determines the conversion candidate based on the co-occurrence relationship between the word included in the conversion candidate and the word included in the user information. Recalculate the score. Specific processing of the reranking execution unit 33 will be described later.

コンテキスト情報ＤＢ３４は、２つの単語の組合せの共起関係に関する情報を保持する。図３に、コンテキスト情報ＤＢ３４が保持するデータの例を示す。同図の例では、単語１カラム、単語２カラム、共起頻度カラムで構成されたレコードを保持している。例えば、既知の共起頻度計算プログラムＮ−ｇｒａｍ（参考ＵＲＬｈｔｔｐ：／／ｏｓｃａｒ．ｇｓｉｄ．ｎａｇｏｙａ−ｕ．ａｃ．ｊｐ／ｐｒｏｊｅｃｔ／ｅｌｃ／ｇｅｎｋｏｕ／ｎｇｒａｍｐａｐｅｒ２／ｎｏｄｅ８．ｈｔｍｌ）、単語共起頻度データベース（参考ＵＲＬｈｔｔｐｓ：／／ａｌａｇｉｎｒｃ．ｎｉｃｔ．ｇｏ．ｊｐ）を用いて２つの単語同士の共起頻度を算出し、その値を共起頻度カラムに格納する。 The context information DB 34 holds information regarding the co-occurrence relationship between two word combinations. FIG. 3 shows an example of data held in the context information DB 34. In the example of the figure, a record composed of a word 1 column, a word 2 column, and a co-occurrence frequency column is held. For example, the known co-occurrence frequency calculation program N-gram (reference URL http://oscar.gsid.nagoya-u.ac.jp/project/elc/genkou/ngrmapper2/node8.html), word co-occurrence frequency database ( The co-occurrence frequency between two words is calculated using the reference URL https://alaginrc.nict.go.jp), and the value is stored in the co-occurrence frequency column.

クライアント端末１、サーバ３が備える各部は、演算処理装置、記憶装置等を備えたコンピュータにより構成して、各部の処理がプログラムによって実行されるものとしてもよい。このプログラムはクライアント端末１、サーバ３が備える記憶装置に記憶されており、磁気ディスク、光ディスク、半導体メモリ等の記録媒体に記録することも、ネットワークを通して提供することも可能である。ここでは、各機能をクライアント端末１とサーバ３に分けたが、本システムを一つの装置で実現してもよい。 Each unit included in the client terminal 1 and the server 3 may be configured by a computer including an arithmetic processing device, a storage device, and the like, and the processing of each unit may be executed by a program. This program is stored in a storage device included in the client terminal 1 and the server 3, and can be recorded on a recording medium such as a magnetic disk, an optical disk, or a semiconductor memory, or provided through a network. Here, although each function is divided into the client terminal 1 and the server 3, this system may be implement | achieved by one apparatus.

次に、本実施の形態における音声認識システムの動作について説明する。 Next, the operation of the speech recognition system in this embodiment will be described.

図４は、本実施の形態における音声認識システムの処理の流れを示すフローチャートである。なお、ユーザ情報格納部１１は、連携されたサービスからユーザ情報を随時収集してユーザ情報ＤＢ１２に格納しているとする。 FIG. 4 is a flowchart showing the flow of processing of the speech recognition system in the present embodiment. It is assumed that the user information storage unit 11 collects user information from linked services as needed and stores it in the user information DB 12.

音声入力部１３が音声を入力すると、情報送信部１４は、音声入力時刻に対応するユーザ情報をユーザ情報ＤＢ１２から取得し、音声情報とユーザ情報をサーバ３に送信する（ステップＳ１１）。 When the voice input unit 13 inputs a voice, the information transmission unit 14 acquires user information corresponding to the voice input time from the user information DB 12, and transmits the voice information and the user information to the server 3 (step S11).

情報受信部３１は、クライアント端末１から受信した音声情報とユーザ情報を音声認識部３２に送信し、音声認識部３２は、音声情報に対して音声認識を実行して変換候補を得る（ステップＳ１２）。 The information receiving unit 31 transmits the voice information and user information received from the client terminal 1 to the voice recognition unit 32, and the voice recognition unit 32 performs voice recognition on the voice information to obtain conversion candidates (step S12). ).

音声認識部３２は、音声認識を実行した結果から時制を判定し（ステップＳ１３）、判定した時制が「未来」又は「過去」の特定の時間である場合は（ステップＳ１４のＹＥＳ）、その時間に対応するユーザ情報をユーザ情報ＤＢ１２から取得する（ステップＳ１５）。 The voice recognition unit 32 determines the tense from the result of executing the voice recognition (step S13). If the determined tense is a specific time of “future” or “past” (YES in step S14), the time Is acquired from the user information DB 12 (step S15).

リランキング実行部３３は、ユーザ情報に含まれる単語と変換候補に含まれる単語との共起関係をコンテキスト情報ＤＢ３４から検索し、その共起関係に基づいて変換候補のスコアを再計算し、再計算したスコアに基づいて変換候補をリランキングする（ステップＳ１６）。クライアント端末１は、スコアの高い順にリランキングした変換候補をユーザに提示する。 The reranking execution unit 33 searches the context information DB 34 for a co-occurrence relationship between the word included in the user information and the word included in the conversion candidate, recalculates the conversion candidate score based on the co-occurrence relationship, The conversion candidates are reranked based on the calculated score (step S16). The client terminal 1 presents the conversion candidates reranked in descending order of score to the user.

次に、変換候補のリランキングの処理の流れについて説明する。 Next, the flow of the conversion candidate reranking process will be described.

図５は、本実施の形態におけるリランキング実行部３３の処理の流れを示すフローチャートである。図６は、図５の処理を具体的に説明するための図である。リランキング実行部３３は、音声認識部３２から変換候補Ｒ_１とユーザ情報Ｉ_ｕを受信すると以下の処理を実行する。図６に示すように、音声認識部３２が音声認識を実行した結果である変換候補Ｒ_１にはスコアが付与されている。 FIG. 5 is a flowchart showing the flow of processing of the reranking execution unit 33 in the present embodiment. FIG. 6 is a diagram for specifically explaining the processing of FIG. 5. Reranking execution unit 33 executes the following process upon reception of conversion candidates R ₁ and the user information I _u from the voice recognition unit 32. As shown in FIG. 6, a score is assigned to the conversion candidate R ₁ that is the result of the voice recognition unit 32 executing voice recognition.

リランキング実行部３３は、ユーザ情報Ｉ_ｕを形態素解析して名詞Ｎを抽出する（ステップＳ２１）。例えば、ユーザ情報Ｉ_ｕ＝｛家族と広島旅行，厳島神社すごい｝のときは、Ｎ＝｛家族，広島，旅行，厳島，神社｝が抽出される。形態素解析には、例えば、既知の形態素解析エンジンであるＭｅＣａｂを用いることができる（参考ＵＲＬｈｔｔｐｓ：／／ｃｏｄｅ．ｇｏｏｇｌｅ．ｃｏｍ／ｐ／ｍｅｃａｂ／）。 Reranking execution unit 33 extracts a noun N user information _{I u} to morphological analysis (step S21). For example, when user information I _u = {family and Hiroshima trip, Itsukushima Shrine is amazing}, N = {family, Hiroshima, trip, Itsukushima, shrine} is extracted. For example, MeCab, which is a known morphological analysis engine, can be used for the morphological analysis (reference URL https://code.google.com/p/mecab/).

リランキング実行部３３は、ステップＳ２１で抽出した名詞Ｎそれぞれに対して、その名詞がコンテキスト情報ＤＢ３４の単語１カラムもしくは単語２カラムに格納されているレコードを検索し、単語の共起頻度の組Ｓを取得する（ステップＳ２２）。例えば、Ｎ＝｛家族，広島，旅行，厳島，神社｝でコンテキスト情報ＤＢ３４に図３に示すデータが格納されているときは、図６に示すように、リランキング実行部３３は単語の共起頻度の組Ｓ＝｛［広島，柿，２］，［広島，牡蠣，５］｝を取得する。 For each of the nouns N extracted in step S21, the reranking execution unit 33 searches for records in which the nouns are stored in the word 1 column or the word 2 column of the context information DB 34, and sets the word co-occurrence frequencies. S is acquired (step S22). For example, when N = {family, Hiroshima, travel, Itsukushima, shrine} and the data shown in FIG. 3 is stored in the context information DB 34, as shown in FIG. A set of frequencies S = {[Hiroshima, 柿, 2], [Hiroshima, oyster, 5]} is acquired.

リランキング実行部３３は、変換候補Ｒ_１それぞれのスコアを再計算する（ステップＳ２３）。具体的には、例えば、ステップＳ２２で取得した単語の共起頻度の組Ｓの中から、変換候補Ｒ_１のｎ番目の変換候補Ｒ_１（ｎ）に含まれる名詞を含む単語の共起頻度の組Ｓ’を抽出する。そして、スコアｒｅｓｃｏｒｅ（Ｒ_１（ｎ））を次式（１）で算出する。 Reranking execution unit 33 recalculates the conversion candidate _{R 1} each score (step S23). Specifically, for example, the co-occurrence frequency of words including nouns included in the n-th conversion candidate R ₁ (n) of the conversion candidate R ₁ from the set S of word co-occurrence frequencies acquired in step S22. A set S ′ is extracted. Then, the score rescore (R ₁ (n)) is calculated by the following equation (1).

ここで、Ｓ’（ｉ）は、ｉ番目の単語の共起頻度の組Ｓ’の共起頻度カラムに格納されている数値を表し、ｍは単語の共起頻度の組Ｓ’の要素の数を表す。また、α＞０である。

Here, S ′ (i) represents a numerical value stored in the co-occurrence frequency column of the co-occurrence frequency set S ′ of the i-th word, and m is an element of the co-occurrence frequency set S ′ of the word Represents a number. Also, α> 0.

リランキング実行部３３は、ステップ２３で算出したスコアｒｅｓｃｏｒｅ（Ｒ_１（ｎ））の降順に変換候補Ｒ_１を並び替えて変換候補Ｒ_２を生成する（ステップＳ２４）。図６に示す例では、変換候補Ｒ_１が変換候補Ｒ_２のように並べ替えられてクライアント端末１に送信される。 The reranking execution unit 33 rearranges the conversion candidates R _{1 in} descending order of the score rescore (R ₁ (n)) calculated in step 23 to generate a conversion candidate R ₂ (step S 24). In the example illustrated in FIG. 6, the conversion candidate R ₁ is rearranged like the conversion candidate R ₂ and transmitted to the client terminal 1.

以上説明したように、本実施の形態によれば、ユーザ情報格納部１１が、ユーザの利用するサービスからユーザの状況を示すユーザ情報を取得してユーザ情報ＤＢ１２に格納しておき、音声入力部１３がユーザの音声を入力したときに情報送信部１４が音声入力時刻に対応するユーザ情報を取得し、リランキング実行部３３がユーザ情報に含まれる単語を抽出するとともに、２つの単語の組合せの共起関係に関する情報を格納したコンテキスト情報ＤＢ３４からリランキング実行部３３が抽出した単語を含む単語の共起頻度の組Ｓを検索し、単語の共起頻度の組Ｓの中から音声認識部３２の音声認識結果である変換候補それぞれに含まれる単語を含む単語の共起頻度の組Ｓ’を抽出し、その共起頻度に基づいて変換候補を並べ替えることにより、ユーザのコンテキストを考慮した、より適切な音声認識結果を提示することが可能となる。 As described above, according to the present embodiment, the user information storage unit 11 acquires user information indicating the user status from the service used by the user, stores the user information in the user information DB 12, and stores the user information in the voice input unit. When 13 inputs a user's voice, the information transmission unit 14 acquires user information corresponding to the voice input time, the reranking execution unit 33 extracts a word included in the user information, and the combination of two words A set S of word co-occurrence frequencies including the word extracted by the reranking execution unit 33 is retrieved from the context information DB 34 storing information related to the co-occurrence relationship, and the speech recognition unit 32 is searched from the set S of word co-occurrence frequencies. By extracting a set S ′ of co-occurrence frequencies of words including words included in each of the conversion candidates that are the speech recognition results, and rearranging the conversion candidates based on the co-occurrence frequencies , Considering the context of the user, it is possible to present more appropriate speech recognition result.

本実施の形態によれば、音声認識部３２がユーザの音声が示す特定の時間を判定し、リランキング実行部３３が特定の時間に対応するユーザ情報を取得することにより、ユーザの音声の示す時間に対応するコンテキストに基づいて、より適切な音声認識結果を提示することが可能となる。 According to the present embodiment, the voice recognition unit 32 determines a specific time indicated by the user's voice, and the reranking execution unit 33 acquires user information corresponding to the specific time, thereby indicating the user's voice. Based on the context corresponding to time, it is possible to present a more appropriate speech recognition result.

１…クライアント端末
１１…ユーザ情報格納部
１２…ユーザ情報ＤＢ
１３…音声入力部
１４…情報送信部
１５…認識結果受信部
１６…表示部
３…サーバ
３１…情報受信部
３２…音声認識部
３３…リランキング実行部
３４…コンテキスト情報ＤＢ DESCRIPTION OF SYMBOLS 1 ... Client terminal 11 ... User information storage part 12 ... User information DB
DESCRIPTION OF SYMBOLS 13 ... Voice input part 14 ... Information transmission part 15 ... Recognition result receiving part 16 ... Display part 3 ... Server 31 ... Information receiving part 32 ... Voice recognition part 33 ... Reranking execution part 34 ... Context information DB

Claims

User information accumulating means for storing user information in which date or date and date and information indicating the status of the user at the date or date and time are associated ;
Co-occurrence frequency accumulating means storing co-occurrence frequency information including a plurality of words and the co-occurrence frequency between the words;
Voice recognition means for performing voice recognition by inputting the user's voice, obtaining conversion candidates, and determining the tense indicated by the voice input content input by the user ;
The user information corresponding to the tense indicated by the voice input content is acquired from the user information storage means, and the words included in the user information are extracted, and the words included in the conversion candidates are extracted and extracted from each. Conversion candidate alignment means for rearranging the conversion candidates based on the co-occurrence frequency of the co-occurrence frequency information including a word;
A speech recognition apparatus comprising:

The voice recognition apparatus according to claim 1, wherein the user information excludes a voice recognition result.

The voice recognition apparatus according to claim 1, wherein the user information is information registered or updated by the user in a service.

A speech recognition method executed by a computer,
Inputting a user's voice to perform voice recognition, obtaining conversion candidates, and determining a tense indicated by the voice input content input by the user ;
The user information corresponding to the tense indicated by the voice input content is acquired from user information storage means storing user information in which date or date and information indicating the user status on the date or date are stored, and the user Extracting words contained in the information;
Extracting words included in the conversion candidates;
The co-occurrence frequency information including words extracted from each of the user information and the conversion candidate is acquired from a co-occurrence frequency storage unit storing co-occurrence frequency information including a plurality of words and co-occurrence frequencies between the words. Rearranging the conversion candidates based on the co-occurrence frequency of the co-occurrence frequency information;
A speech recognition method comprising:

The voice recognition method according to claim 4, wherein the user information excludes a voice recognition result.

6. The voice recognition method according to claim 4, wherein the user information is information registered or updated by the user in a service.

A speech recognition program for operating a computer as each means of the speech recognition apparatus according to claim 1.