JP7058574B2

JP7058574B2 - Information processing equipment, information processing methods, and programs

Info

Publication number: JP7058574B2
Application number: JP2018168724A
Authority: JP
Inventors: 賢昭佐藤; 純平三宅
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2018-09-10
Filing date: 2018-09-10
Publication date: 2022-04-22
Anticipated expiration: 2038-09-10
Also published as: JP2020042131A

Description

本発明は、情報処理装置、情報処理方法、およびプログラムに関する。 The present invention relates to an information processing apparatus, an information processing method, and a program.

潜在語言語モデルを用いて音声認識を行う技術が知られている（特許文献１参照）。潜在語言語モデルとは、学習テキスト（コーパス）中の各単語に対する潜在語を考慮したモデルである。 A technique for performing speech recognition using a latent language model is known (see Patent Document 1). The latent language model is a model that considers the latent words for each word in the learning text (corpus).

特許第５９７５９３８号公報Japanese Patent No. 5975938

しかしながら、従来の技術では、コーパスの各語に対して数万個ある潜在語候補の確率をそれぞれ求める必要があり、語彙が多い場合等では、処理負荷が高くなり、音声認識結果の出力に時間を要する可能性がある。 However, in the conventional technique, it is necessary to obtain the probabilities of tens of thousands of latent word candidates for each word in the corpus, and when the vocabulary is large, the processing load becomes high and it takes time to output the speech recognition result. May be required.

本発明は、このような事情を考慮してなされたものであり、より効率的に音声認識処理をすることができる情報処理装置、情報処理方法、およびプログラムを提供することを目的の一つとする。 The present invention has been made in consideration of such circumstances, and one of the objects of the present invention is to provide an information processing device, an information processing method, and a program capable of performing voice recognition processing more efficiently. ..

本発明の一態様は、音声データを取得する取得部と、前記音声データを解析してテキストに変換した、１以上の解析結果を出力する解析部と、前記解析結果に係る前記入力テキストに含まれる複数の単語のそれぞれを示す分散表現によるベクトル値に変換するベクトル変換部と、前記ベクトル変換部により変換されたベクトル値と、前記入力意図が既知の入力テキストに対応し、予め求められている前記既知の入力テキストのベクトル値とに基づいて、前記１以上の解析結果から前記音声データに係る音声を発した利用者の入力テキストの入力意図が反映された可能性の高い前記解析結果を選択する選択部と、を備える情報処理装置である。 One aspect of the present invention includes an acquisition unit that acquires voice data, an analysis unit that analyzes the voice data and converts it into text, and outputs one or more analysis results, and the input text related to the analysis result. A vector conversion unit that converts each of a plurality of words into a vector value by a distributed expression, a vector value converted by the vector conversion unit, and an input text whose input intention is known are obtained in advance. Based on the vector value of the known input text, the analysis result having a high possibility of reflecting the input intention of the input text of the user who emitted the voice related to the voice data is selected from the analysis results of 1 or more. It is an information processing apparatus including a selection unit.

本発明の一態様によれば、より効率的に音声認識処理をすることができる。 According to one aspect of the present invention, the voice recognition process can be performed more efficiently.

実施形態に係る情報処理装置１００の使用環境の一例を示す図である。It is a figure which shows an example of the use environment of the information processing apparatus 100 which concerns on embodiment. 情報処理装置１００の処理を模式的に示す図である。It is a figure which shows typically the process of the information processing apparatus 100. 実施形態に係る情報処理装置１００の構成図である。It is a block diagram of the information processing apparatus 100 which concerns on embodiment. Ｗ２Ｖ実行部１０６によるベクトル変換処理を説明するための図である。It is a figure for demonstrating the vector conversion process by the W2V execution unit 106. 文ベクトルを説明するための図である。It is a figure for demonstrating a sentence vector. 選別部１１０による好適候補選別を模式的に示す図である。It is a figure which shows typically the suitable candidate selection by a sorting unit 110. タスクテキストを説明するための図である。It is a figure for explaining a task text. 信頼度導出部１１０ａによる信頼度導出処理を説明するための図である。It is a figure for demonstrating the reliability derivation process by a reliability derivation unit 110a. タスクテキストベクトルリスト１２０ｇを模式的に示す図である。It is a figure which shows typically the task text vector list 120g. 代表ベクトルを説明するための図である。It is a figure for demonstrating a representative vector. 類似評価方法について説明するための図である。It is a figure for demonstrating the similarity evaluation method. 言語モデル演算部１１２による、クラスタ選択を模式的に示す図である。It is a figure which shows schematically the cluster selection by a language model arithmetic unit 112. 情報処理装置１００による言語モデル生成処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the language model generation processing by an information processing apparatus 100. 情報処理装置１００による音声認識処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the voice recognition processing by an information processing apparatus 100.

以下、図面を参照し、本発明の情報処理装置、情報処理方法、およびプログラムの実施形態について説明する。 Hereinafter, embodiments of the information processing apparatus, information processing method, and program of the present invention will be described with reference to the drawings.

［概要］
情報処理装置は、一以上のプロセッサにより実現される。情報処理装置は、利用者の発した音声を収録した音声データを受信し、受信した入力データの音声認識処理を行い、認識の結果に基づいて種々の処理を行う装置である。種々の処理としては、音声を発した利用者の意図に沿ったＩｏＴ（Internet of Things）機器の制御を行うこと、利用者の質問に対して応答することなどがある。以下、利用者の意図する情報処理装置の動作をタスクと称する場合がある。なお音声データは、圧縮や暗号化などの処理が施されたものであってもよい。 [overview]
The information processing device is realized by one or more processors. The information processing device is a device that receives voice data in which voices emitted by a user are recorded, performs voice recognition processing of the received input data, and performs various processing based on the recognition result. Various processes include controlling the IoT (Internet of Things) device according to the intention of the user who emitted the voice, and responding to the user's question. Hereinafter, the operation of the information processing device intended by the user may be referred to as a task. The voice data may be compressed or encrypted.

図１は、実施形態に係る情報処理装置１００の使用環境の一例を示す図である。図示する環境では、端末装置２０、制御対象デバイス３０、およびサービスサーバ４０は、ネットワークＮＷを介して互いに通信する。ネットワークＮＷは、例えば、ＷＡＮ（Wide Area Network）、ＬＡＮ（Local Area Network）、インターネット、プロバイダ装置、無線基地局、専用回線などのうちの一部または全部を含む。図１に示す例では、制御対象デバイス３０の数は、Ｎ（Ｎは、１以上の整数）個である。なお、本明細書では、制御対象デバイス３０－１～３０－Ｎにおいて、共通の事項を説明する場合など、個々の制御対象デバイス３０－１～３０－Ｎを区別しない場合には、単に制御対象デバイス３０と呼ぶ。 FIG. 1 is a diagram showing an example of a usage environment of the information processing apparatus 100 according to the embodiment. In the illustrated environment, the terminal device 20, the controlled device 30, and the service server 40 communicate with each other via the network NW. The network NW includes, for example, a part or all of a WAN (Wide Area Network), a LAN (Local Area Network), the Internet, a provider device, a wireless base station, a dedicated line, and the like. In the example shown in FIG. 1, the number of controlled devices 30 is N (N is an integer of 1 or more). In this specification, when the individual control target devices 30-1 to 30-N are not distinguished, for example, when common matters are explained in the control target devices 30-1 to 30-N, the control target is simply controlled. Called device 30.

端末装置２０は、利用者の音声入力を受け付ける装置である。端末装置２０は、スマートフォンなどの携帯電話、タブレット端末、パーソナルコンピュータ、スマートスピーカ（ＡＩスピーカ）等である。 The terminal device 20 is a device that accepts a user's voice input. The terminal device 20 is a mobile phone such as a smartphone, a tablet terminal, a personal computer, a smart speaker (AI speaker), or the like.

制御対象デバイス３０は、通信機能と、外部からの制御を受け付けるインターフェースとを備え、利用者により操作される端末装置２０からのリクエストに応じて制御可能なＩｏＴ機器である。制御対象デバイス３０は、例えば、テレビやラジオ、照明器具、冷蔵庫、電子レンジ、洗濯機、炊飯器、自走式掃除機、空調機器などである。 The control target device 30 is an IoT device having a communication function and an interface for receiving control from the outside, and can be controlled in response to a request from the terminal device 20 operated by the user. The controlled device 30 is, for example, a television, a radio, a lighting fixture, a refrigerator, a microwave oven, a washing machine, a rice cooker, a self-propelled vacuum cleaner, an air conditioner, or the like.

サービスサーバ４０は、利用者により操作される端末装置２０からのリクエストに対応するウェブページを提供するウェブサーバ装置、アプリケーションが起動された端末装置２０と通信を行って各種情報の受け渡しを行ってコンテンツを提供するアプリケーションサーバ装置等である。 The service server 40 communicates with a web server device that provides a web page corresponding to a request from a terminal device 20 operated by a user, and a terminal device 20 in which an application is started, and exchanges various information to provide contents. An application server device or the like that provides.

図２は、情報処理装置１００の処理を模式的に示す図である。情報処理装置１００は、利用者が端末装置２０を介して入力された音声データを音響モデルに適用することで音素に変換し、音素に基づいて１以上の候補テキスト（音声データに含まれる音をテキスト化したもの）を生成し、さらに生成した候補テキストのうち既知のタスク特徴量との比較に基づいて選択した候補テキストを言語モデルに適用することで、好適候補を選択する。好適候補とは、候補テキストの中で利用者の意図が反映された可能性が高い好適な候補であると判定されたものである。 FIG. 2 is a diagram schematically showing the processing of the information processing apparatus 100. The information processing device 100 converts voice data input by the user via the terminal device 20 into a phoneme by applying it to an acoustic model, and uses one or more candidate texts (sounds included in the voice data) based on the phonemes. A suitable candidate is selected by generating a text version) and applying the candidate text selected based on the comparison with the known task features among the generated candidate texts to the language model. A suitable candidate is a candidate that is determined to be a suitable candidate that is likely to reflect the user's intention in the candidate text.

音響モデルとは、周波数成分や時間変化を統計的に分析し、入力された音声データがどのような音素で構成されるか（何と言っているか）を判別するためのモデルである。音素とは、アルファベットや仮名などの言語の最小単位を特定するためのラベルであり、例えば、母音や子音等を含む。情報処理装置１００は、音素を言語ルールに従って適宜、結合することで候補テキストを得る。 The acoustic model is a model for statistically analyzing frequency components and time changes to determine what kind of phonemes the input voice data is composed of (what is said). A phoneme is a label for specifying the smallest unit of a language such as an alphabet or a kana, and includes, for example, a vowel or a consonant. The information processing apparatus 100 obtains candidate texts by appropriately combining phonemes according to language rules.

図２に示すように、音素変換の結果、生成した候補テキストが“kyonotenki”である場合、例えば、”k”や”t”は生成した候補テキストに含まれる音素を示すものである。音声認識処理が日本語を前提として行われる場合、候補テキストは、アルファベット表記で表されてもよいし、ひらがな表記またはカタカナ表記で表されてもよい。図２に示す例において、情報処理装置１００は、受け付けた音声データに基づいて、“kyonotenki”、“kyonotenkii”、“kyonodenki”を含む候補テキストを生成する。 As shown in FIG. 2, when the candidate text generated as a result of phoneme conversion is “kyonotenki”, for example, “k” and “t” indicate phonemes included in the generated candidate text. When the voice recognition process is performed on the premise of Japanese, the candidate text may be expressed in alphabetical notation, hiragana notation, or katakana notation. In the example shown in FIG. 2, the information processing apparatus 100 generates a candidate text including "kyonotenki", "kyonotenkii", and "kyonodenki" based on the received voice data.

情報処理装置１００は、図２に示す例において、“kyonotenki”、“kyonotenkii”、“kyonodenki”を含む変換候補のそれぞれに対して形態素解析を行う。形態素解析とは、候補テキストを構成する単語の区切りを決定し、区切られたそれぞれの単語の例えば品詞を導出する処理である。形態素解析は、例えば、ＭｅＣＡＢなどの形態素解析エンジンを利用して行われる。情報処理装置１００は、例えば、候補テキスト“kyonotenki”を解析した結果、「今日（kyo）」、「の(no)」、「天気(tenki)」の３つの単語を導出する。同様に、候補テキスト“kyonotenkii”を解析した結果、「今日（kyo）」、「の(no) 」、「テンキー(tenkii)」を、候補テキスト“kyonodenki”を解析した結果、「京（kyo）」、「の(no) 」、「電気(denki)」を生成する。 In the example shown in FIG. 2, the information processing apparatus 100 performs morphological analysis on each of the conversion candidates including "kyonotenki", "kyonotenkii", and "kyonodenki". Morphological analysis is a process of determining a delimiter of words constituting a candidate text and deriving, for example, a part of speech of each delimited word. The morphological analysis is performed using, for example, a morphological analysis engine such as MeCAB. As a result of analyzing the candidate text "kyonotenki", for example, the information processing apparatus 100 derives three words "today (kyo)", "no (no)", and "weather (tenki)". Similarly, as a result of analyzing the candidate text "kyonotenkii", "today (kyo)", "(no)", "tenkii", and as a result of analyzing the candidate text "kyonodenki", "kyo (kyo)" , "(No)", "Electricity (denki)" are generated.

情報処理装置１００は、１以上の候補テキストのそれぞれから生成した解析結果を評価する。そして、評価値に基づいて１つの候補テキストを選択し、より具体的に、情報処理装置１００は、候補テキストの解析結果の、既知のタスク音声から得られた特徴量との適合率を評価し、利用者の意図に沿ったものと推定される好適候補を選択する。そして、情報処理装置１００は、意図に対応する出力情報を生成するタスクに関する命令を出力する。適合率については後述する。 The information processing apparatus 100 evaluates the analysis result generated from each of the one or more candidate texts. Then, one candidate text is selected based on the evaluation value, and more specifically, the information processing apparatus 100 evaluates the matching rate of the analysis result of the candidate text with the feature amount obtained from the known task voice. , Select suitable candidates that are presumed to be in line with the user's intention. Then, the information processing apparatus 100 outputs an instruction regarding a task for generating output information corresponding to the intention. The precision rate will be described later.

［構成］
図３は、情報処理装置１００の構成図である。情報処理装置１００は、例えば、取得部１０２と、解析部１０４と、Ｗ２Ｖ（Word2Vec）実行部１０６と、テキストベクトル生成部１０８と、選別部１１０と、言語モデル演算部１１２と、選択部１１４と、出力情報生成部１１６と、出力部１１８と、記憶部１２０とを備える。これらの構成要素は、例えば、ＣＰＵ（Central Processing Unit）などのハードウェアプロセッサがプログラム（ソフトウェア）を実行することにより実現される。また、これらの構成要素のうち一部または全部は、ＬＳＩ（Large Scale Integration）やＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）、ＧＰＵ（Graphics Processing Unit）などのハードウェア（回路部；circuitryを含む）によって実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。プログラムは、予め情報処理装置１００のＨＤＤやフラッシュメモリなどの記憶装置（非一過性の記憶媒体を備える記憶装置）に格納されていてもよいし、ＤＶＤやＣＤ－ＲＯＭなどの着脱可能な記憶媒体（非一過性の記憶媒体）に格納されており、記憶媒体がドライブ装置に装着されることで情報処理装置１００のＨＤＤやフラッシュメモリにインストールされてもよい。 [Constitution]
FIG. 3 is a configuration diagram of the information processing apparatus 100. The information processing apparatus 100 includes, for example, an acquisition unit 102, an analysis unit 104, a W2V (Word2Vec) execution unit 106, a text vector generation unit 108, a selection unit 110, a language model calculation unit 112, and a selection unit 114. , An output information generation unit 116, an output unit 118, and a storage unit 120. These components are realized by, for example, a hardware processor such as a CPU (Central Processing Unit) executing a program (software). In addition, some or all of these components are hardware (circuits) such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), and GPU (Graphics Processing Unit). It may be realized by the part; including circuitry), or it may be realized by the cooperation of software and hardware. The program may be stored in advance in a storage device (a storage device including a non-transient storage medium) such as an HDD or a flash memory of the information processing device 100, or a detachable storage such as a DVD or a CD-ROM. It is stored in a medium (non-transient storage medium), and may be installed in the HDD or flash memory of the information processing device 100 by mounting the storage medium in the drive device.

記憶部１２０は、例えば、ＲＡＭ（Random Access Memory）、レジスタ、フラッシュメモリ、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read Only Memory）などにより実現される。記憶部１２０は、例えば、音響モデル１２０ａ、言語モデル１２０ｂ、コーパスの解析結果１２０ｃ、タスクテキストの解析結果１２０ｄ、候補テキストの解析結果１２０ｅ、単語ベクトルリスト１２０ｆ、タスクテキストベクトルリスト１２０ｇ、言語モデル演算用テキスト１２０ｈなどの情報を記憶する。 The storage unit 120 is realized by, for example, a RAM (Random Access Memory), a register, a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), or the like. The storage unit 120 is, for example, an acoustic model 120a, a language model 120b, a corpus analysis result 120c, a task text analysis result 120d, a candidate text analysis result 120e, a word vector list 120f, a task text vector list 120g, and a language model calculation. Information such as text 120h is stored.

取得部１０２は、情報処理装置１００が音声認識処理を行う上でコーパスとして利用する文字情報（例えば、ニュース等の記事データや、ＳＮＳ（Social Networking Service）の投稿データ）を取得し、解析部１０４に出力する。なお、コーパスとして利用する文字情報は、口語形式のテキスト（例えば、ＳＮＳにおける投稿履歴や、自動応答装置における利用者と装置の会話履歴、現実の会話をテキストに直したもの、自装置の音声処理履歴など）が含まれることが望ましい。また、取得部１０２は、情報処理装置１００の管理者が設定した定型タスクを示す文字情報のデータセット（以下、タスクテキスト）を取得し、解析部１０４に出力する。 The acquisition unit 102 acquires character information (for example, article data such as news and posted data of SNS (Social Networking Service)) used as a corpus in the information processing device 100 to perform voice recognition processing, and the analysis unit 104. Output to. The character information used as a corpus is colloquial text (for example, posting history in SNS, conversation history between the user and the device in the automatic response device, actual conversation converted into text, voice processing of the own device). History etc.) should be included. Further, the acquisition unit 102 acquires a data set of character information (hereinafter, task text) indicating a routine task set by the administrator of the information processing apparatus 100, and outputs the data set to the analysis unit 104.

また、取得部１０２は、端末装置２０の利用者により入力された音声データを取得し、解析部１０４に出力する。取得部１０２が取得した音声データに利用者の位置情報が付与される場合、その位置情報は、候補テキストと併せて管理されるものとする。 Further, the acquisition unit 102 acquires the voice data input by the user of the terminal device 20 and outputs it to the analysis unit 104. When the user's position information is added to the voice data acquired by the acquisition unit 102, the position information shall be managed together with the candidate text.

解析部１０４は、取得部１０２により出力されたコーパスとして利用する文字情報を所定の解析方法で解析する。所定の解析方法とは、例えば、形態素解析である。形態素解析とは、文字情報を名詞、動詞、助詞等の品詞の単位で分解することである。解析部１０４は、解析結果をコーパスの解析結果１２０ｃとして記憶部１２０に記憶させる。また、解析部１０４は、取得部１０２により出力されたタスクテキストを解析し、解析結果をタスクテキストの解析結果１２０ｄとして記憶部１２０に記憶させる。 The analysis unit 104 analyzes the character information used as the corpus output by the acquisition unit 102 by a predetermined analysis method. The predetermined analysis method is, for example, morphological analysis. Morphological analysis is the decomposition of character information into parts of speech such as nouns, verbs, and particles. The analysis unit 104 stores the analysis result in the storage unit 120 as the analysis result 120c of the corpus. Further, the analysis unit 104 analyzes the task text output by the acquisition unit 102, and stores the analysis result in the storage unit 120 as the analysis result 120d of the task text.

また、解析部１０４は、取得部１０２により出力された音声データを音響モデル１２０ａに適用して１以上の候補テキストを生成した後に、それぞれの候補テキストに対して形態素解析等の解析処理を行う。また、解析部１０４は、解析結果を候補テキストの解析結果１２０ｅとして記憶部１２０に記憶させる。 Further, the analysis unit 104 applies the voice data output by the acquisition unit 102 to the acoustic model 120a to generate one or more candidate texts, and then performs analysis processing such as morphological analysis on each candidate text. Further, the analysis unit 104 stores the analysis result in the storage unit 120 as the analysis result 120e of the candidate text.

図４は、Ｗ２Ｖ実行部１０６によるベクトル変換処理を説明するための図である。Ｗ２Ｖ実行部１０６は、例えば、コーパスの解析結果１２０ｃに含まれる各単語の意味をベクトル表現化（分散表現化）して単語ベクトルを生成する。図４の例では、Ｗ２Ｖ実行部１０６は、「ボリューム」の単語ベクトルを生成している。Ｗ２Ｖ実行部１０６は、「音」と「ボリューム」、「ミュージック」と「音楽」のように意味の近い単語同士で単語ベクトル間の距離（コサイン類似度）が近くなるように、単語ベクトルを生成する。Ｗ２Ｖ実行部１０６は、生成したベクトル値を記憶部に単語ベクトルリスト１２０ｆとして記憶部１２０に記憶させる。Ｗ２Ｖ実行部１０６は、「ベクトル変換部」の一例である。 FIG. 4 is a diagram for explaining the vector conversion process by the W2V execution unit 106. The W2V execution unit 106 generates a word vector by vector-expressing (distributed representation) the meaning of each word included in the analysis result 120c of the corpus, for example. In the example of FIG. 4, the W2V execution unit 106 generates the word vector of “volume”. The W2V execution unit 106 generates a word vector so that the distance (cosine similarity) between word vectors is close between words having similar meanings such as "sound" and "volume", and "music" and "music". do. The W2V execution unit 106 stores the generated vector value in the storage unit 120 as a word vector list 120f. The W2V execution unit 106 is an example of a “vector conversion unit”.

また、Ｗ２Ｖ実行部１０６は、単語ベクトルリスト１２０ｆに記憶されていない単語がタスクテキストまたは候補テキストに含まれる場合、タスクテキストの解析結果１２０ｄ、または候補テキストの解析結果１２０ｅを、例えばコーパスに追加することで同様に解析し、それらのベクトル値を生成してもよい。このベクトル値は、単語ベクトルリスト１２０ｆに記憶されてもよいし、記憶されなくてもよい。 Further, when the task text or the candidate text contains a word that is not stored in the word vector list 120f, the W2V execution unit 106 adds the analysis result 120d of the task text or the analysis result 120e of the candidate text to, for example, the corpus. It may be analyzed in the same manner to generate those vector values. This vector value may or may not be stored in the word vector list 120f.

図３に戻り、テキストベクトル生成部１０８は、タスクテキストの解析結果１２０ｄ、候補テキストの解析結果１２０ｅ、および単語ベクトルリスト１２０ｆのベクトル値を用いて、候補テキストの文単位のベクトル値（以下、文ベクトル）を生成する。テキストベクトル生成部１０８は、生成した文ベクトルを選別部１１０に出力する。 Returning to FIG. 3, the text vector generation unit 108 uses the vector values of the task text analysis result 120d, the candidate text analysis result 120e, and the word vector list 120f, and the vector value of the candidate text in sentence units (hereinafter, sentence). Vector) is generated. The text vector generation unit 108 outputs the generated sentence vector to the selection unit 110.

図５は、文ベクトルについて説明するための図である。テキストベクトル生成部１０８は、例えば、「ボリュームを下げて」の文ベクトルを生成する場合、「ボリューム」、「を」、および「下げて」の単語ベクトルに所定の演算を行うことで（例えば、それぞれの単語ベクトルを加算することで）、文ベクトルを生成する。この結果、文を構成する単語の単語ベクトルを合計した文ベクトルについても同様に、「音楽の音を小さくして」と「ボリュームを下げて」のように意味が近い文の文ベクトル同士の距離は近くなる。 FIG. 5 is a diagram for explaining a sentence vector. For example, when the text vector generation unit 108 generates a sentence vector of "lower volume", the text vector generation unit 108 performs a predetermined operation on the word vectors of "volume", "o", and "lower" (for example,). (By adding each word vector), a sentence vector is generated. As a result, for the sentence vector that is the sum of the word vectors of the words that make up the sentence, the distance between the sentence vectors that have similar meanings such as "make the music sound quieter" and "turn down the volume". Will be closer.

また、テキストベクトル生成部１０８は、タスクテキストの解析結果１２０ｄおよびＷ２Ｖ実行部１０６により出力された単語ベクトルを用いて、タスクテキストの文ベクトルを生成し、タスクテキストベクトルリスト１２０ｇとして記憶部１２０に記憶させる。タスクテキストは、利用者の意図を含んでいることが既知のテキストであり、例えば、情報処理装置１００の管理者によってあらかじめ設定される。 Further, the text vector generation unit 108 generates a sentence vector of the task text using the analysis result 120d of the task text and the word vector output by the W2V execution unit 106, and stores it in the storage unit 120 as the task text vector list 120g. Let me. The task text is a text known to include the intention of the user, and is set in advance by, for example, the administrator of the information processing apparatus 100.

図３に戻り、選別部１１０は、候補テキストの文ベクトル、タスクテキストの文ベクトル、および言語モデル演算用テキスト１２０ｈの文ベクトルに基づいて、言語モデル１２０ｂの元となる文ベクトルを選別する。選別部１１０は、選別結果を言語モデル演算部１１２に出力する。言語モデル演算用テキスト１２０ｈとは、例えば、情報処理装置１００の管理者が想定するタスクテキストの文ベクトルや、過去の情報処理装置１００の音声認識処理履歴として保持する文ベクトルである。 Returning to FIG. 3, the selection unit 110 selects the sentence vector that is the source of the language model 120b based on the sentence vector of the candidate text, the sentence vector of the task text, and the sentence vector of the language model calculation text 120h. The sorting unit 110 outputs the sorting result to the language model calculation unit 112. The language model calculation text 120h is, for example, a sentence vector of a task text assumed by the administrator of the information processing device 100 or a sentence vector held as a voice recognition processing history of the past information processing device 100.

選別部１１０は、例えば、信頼度導出部１１０ａを備える。信頼度導出部１１０ａによる優先度導出処理については後述する。 The sorting unit 110 includes, for example, a reliability deriving unit 110a. The priority derivation process by the reliability derivation unit 110a will be described later.

言語モデル演算部１１２は、例えば、言語モデル生成部１１２ａを備える。言語モデル生成部１１２ａは、選別部１１０により出力された選別結果を適用した言語モデルを生成し、コーパス毎の言語モデル１２０ｂとして記憶部１２０に記憶させる。言語モデル生成部１１２ａは、例えば、情報処理装置１００の管理者があらかじめ設定した言語モデル演算用テキスト１２０ｈ、および選別部１１０により選択された変換候補に基づいて言語モデル１２０ｂを生成する。 The language model calculation unit 112 includes, for example, a language model generation unit 112a. The language model generation unit 112a generates a language model to which the selection result output by the selection unit 110 is applied, and stores it in the storage unit 120 as a language model 120b for each corpus. The language model generation unit 112a generates, for example, a language model 120h based on a language model calculation text 120h preset by the administrator of the information processing apparatus 100 and conversion candidates selected by the selection unit 110.

また、言語モデル演算部１１２は、選別部１１０により出力された候補テキストを言語モデル１２０ｂに適用し、適用結果を選択部１１４に出力する。 Further, the language model calculation unit 112 applies the candidate text output by the selection unit 110 to the language model 120b, and outputs the application result to the selection unit 114.

選択部１１４は、言語モデル演算部１１２により出力された候補テキストを評価値に基づいて評価することで、利用者の入力意図が反映された可能性の高い好適候補を選択する。選択部１１４は、選択結果である好適候補を出力情報生成部１１６に出力する。 The selection unit 114 evaluates the candidate text output by the language model calculation unit 112 based on the evaluation value, and selects a suitable candidate having a high possibility of reflecting the input intention of the user. The selection unit 114 outputs a suitable candidate as a selection result to the output information generation unit 116.

なお、選択部１１４は、候補テキストに位置情報が付与される場合、その位置情報から利用者の入力環境を推定し、候補テキスト利用者のタスクの実行意図を含むものであるか否かを評価し、評価結果に基づいて候補テキストを選択してもよい。例えば、選択部１１４は、候補テキストの位置情報から利用者が自宅にいることが推定される場合には、自宅で利用する制御対象デバイス３０に関するタスクの適合率を高く設定し、同時に職場で利用する制御対象デバイス３０に関するタスクの適合率を低く設定することで対応するタスクが選択される確度を変更してよい。 When the position information is given to the candidate text, the selection unit 114 estimates the input environment of the user from the position information, evaluates whether or not the candidate text user includes the task execution intention, and evaluates. Candidate text may be selected based on the evaluation result. For example, when the user is estimated to be at home from the position information of the candidate text, the selection unit 114 sets a high accuracy rate of the task related to the controlled device 30 to be used at home, and at the same time, uses it at work. The accuracy with which the corresponding task is selected may be changed by setting the matching rate of the task with respect to the controlled device 30 to be low.

図６は、選択部１１４による好適候補選別を模式的に示す図である。言語モデルとは、候補テキストから、好適候補を生成するためのモデルである。選別部１１０は、例えば、候補ベクトルの文ベクトルとタスクテキストの文ベクトルの類似度から、タスクテキストに近いものほど高い評価値を与え、更に、言語モデルを用いて、単語の並びに関するスコアが高いものほど高い評価値を与える、これらの評価値を総合評価することで、好適候補を選択する。なお、言語モデルは、利用者の周辺環境を加味して評価を行うものでもよい。 FIG. 6 is a diagram schematically showing suitable candidate selection by the selection unit 114. The language model is a model for generating suitable candidates from candidate texts. For example, from the similarity between the sentence vector of the candidate vector and the sentence vector of the task text, the selection unit 110 gives a higher evaluation value as the one is closer to the task text, and further, the score regarding the sequence of words is higher by using the language model. Suitable candidates are selected by comprehensively evaluating these evaluation values, which give higher evaluation values. The language model may be evaluated in consideration of the surrounding environment of the user.

図３に戻り、出力情報生成部１１６は、選択部１１４により出力された好適候補に基づいて、利用者の意図する出力情報を生成し、出力部１１８に出力する。出力情報には、出力先の装置を特定する情報、出力先の装置に対する処理リクエストなどが含まれる。 Returning to FIG. 3, the output information generation unit 116 generates output information intended by the user based on the suitable candidates output by the selection unit 114, and outputs the output information to the output unit 118. The output information includes information for specifying the output destination device, a processing request for the output destination device, and the like.

出力情報生成部１１６は、例えば、好適候補が「今日の天気を教えて」である場合、サービスサーバ４０の提供する天気予報のウェブサイトに対してリクエストを送信し、端末装置２０に送信するためのリクエストの応答の一部または全部を含む情報を出力情報とする。また、出力情報生成部１１６は、例えば、好適候補が「音楽の音量を下げて」である場合、音楽再生中の制御対象デバイス３０を特定し、音量を下げる命令を出力する。なお、出力情報生成部１１６は、出力先が制御対象デバイス３０の出力情報を生成する場合、端末装置２０に制御対象デバイス３０に対して出力情報を出力したことを通知する出力情報を併せて生成してもよい。 For example, when the suitable candidate is "tell me the weather today", the output information generation unit 116 transmits a request to the weather forecast website provided by the service server 40, and transmits the request to the terminal device 20. The output information is the information including a part or all of the response of the request of. Further, for example, when the suitable candidate is "lower the volume of music", the output information generation unit 116 identifies the controlled target device 30 during music reproduction and outputs an instruction to lower the volume. When the output destination generates the output information of the control target device 30, the output information generation unit 116 also generates the output information notifying the terminal device 20 that the output information has been output to the control target device 30. You may.

出力部１１８は、出力情報生成部１１６により出力された出力情報を、端末装置２０または制御対象デバイス３０に出力する。 The output unit 118 outputs the output information output by the output information generation unit 116 to the terminal device 20 or the controlled device 30.

［タスクテキスト］
以下、タスクテキストについて説明する。情報処理装置１００の管理者は、例えば、端末装置２０の過去の音声入力履歴や、情報処理装置１００の処理履歴に基づいて、選択部１１４が評価基準とするタスクテキストを抽出する。 [Task text]
The task text will be described below. The administrator of the information processing apparatus 100 extracts, for example, a task text as an evaluation standard by the selection unit 114 based on the past voice input history of the terminal apparatus 20 and the processing history of the information processing apparatus 100.

図７は、タスクテキストを説明するための図である。図７の左図は、端末装置２０の過去の音声入力履歴の音声認識結果Ｒ１～Ｒ７を示す。音声認識結果には、端末装置２０の利用者の入力意図が反映されたものと、利用者には入力意図はないが音声認識されたものとが含まれる。情報処理装置１００の管理者は、例えば、音声認識結果のＲ４をタスクに近いテキストであると判別した場合、図７の右上図に示すように優先度を高く設定する。タスクに近いとは、利用者の入力意図が反映された可能性が高いテキストが含まれることである。また、情報処理装置１００の管理者は、音声認識結果のＲ６をタスクから遠いテキストであると判別した場合、図７の右下図に示すように優先度を低く設定する。 FIG. 7 is a diagram for explaining the task text. The left figure of FIG. 7 shows the voice recognition results R1 to R7 of the past voice input history of the terminal device 20. The voice recognition result includes a result reflecting the input intention of the user of the terminal device 20 and a voice recognition result having no input intention by the user. For example, when the administrator of the information processing apparatus 100 determines that the voice recognition result R4 is a text close to the task, the administrator sets a high priority as shown in the upper right figure of FIG. Close to a task means that it contains text that is likely to reflect the user's input intent. Further, when the administrator of the information processing apparatus 100 determines that the voice recognition result R6 is a text far from the task, the administrator sets the priority low as shown in the lower right figure of FIG.

また、情報処理装置１００の管理者は、Ｒ１、Ｒ２、Ｒ３、Ｒ５、およびＲ７についてもタスクから遠いテキストであると判別し、優先度を低く設定する。タスクテキストの優先度は、例えば、タスクテキストの文ベクトル値とともに、タスクテキストベクトルリスト１２０ｇに登録される。 Further, the administrator of the information processing apparatus 100 determines that the texts R1, R2, R3, R5, and R7 are far from the task, and sets the priority low. The priority of the task text is registered in the task text vector list 120g together with the sentence vector value of the task text, for example.

［言語モデル生成処理フロー］
以下、情報処理装置１００による言語モデル１２０ｂの生成処理について説明する。情報処理装置１００は、例えば、コーパスの種別毎に言語モデル１２０ｂを生成する。また、情報処理装置１００の管理者により、定期的に言語モデル演算用テキスト１２０ｈの変更・更新が行われてもよく、例えば、そのタイミングで言語モデルの再生成が行われる。 [Language model generation process flow]
Hereinafter, the process of generating the language model 120b by the information processing apparatus 100 will be described. The information processing apparatus 100 generates, for example, a language model 120b for each type of corpus. Further, the administrator of the information processing apparatus 100 may periodically change / update the language model calculation text 120h, and for example, the language model is regenerated at that timing.

図８は、情報処理装置１００による言語モデル１２０ｂの生成処理の流れの一例を示すフローチャートである。 FIG. 8 is a flowchart showing an example of the flow of the generation processing of the language model 120b by the information processing apparatus 100.

まず、取得部１０２は、コーパスとして利用する文字情報を取得する（Ｓ１００）。次に、解析部１０４は、コーパスとして利用する文字情報を解析し、解析結果をコーパスの解析結果１２０ｃとして記憶部１２０に記憶させる（Ｓ１０２）。次に、Ｗ２Ｖ実行部１０６は、コーパスの解析結果１２０ｃに含まれる単語のベクトル値を生成し、単語ベクトルリスト１２０ｆとして記憶部１２０に記憶させる（Ｓ１０４）。 First, the acquisition unit 102 acquires character information to be used as a corpus (S100). Next, the analysis unit 104 analyzes the character information used as the corpus, and stores the analysis result in the storage unit 120 as the analysis result 120c of the corpus (S102). Next, the W2V execution unit 106 generates a vector value of the word included in the analysis result 120c of the corpus and stores it in the storage unit 120 as the word vector list 120f (S104).

次に、取得部１０２は、タスクテキストを取得する（Ｓ１０６）。次に、解析部１０４は、タスクテキストを解析し、解析結果をタスクテキストの解析結果１２０ｄとして記憶部１２０に記憶させる（Ｓ１０８）。 Next, the acquisition unit 102 acquires the task text (S106). Next, the analysis unit 104 analyzes the task text and stores the analysis result in the storage unit 120 as the analysis result 120d of the task text (S108).

次に、取得部１０２は、候補テキストを取得する（Ｓ１１０）。次に、解析部１０４は、候補テキストを解析し、解析結果を候補テキストの解析結果１２０ｅとして記憶部１２０に記憶させる（Ｓ１１２）。 Next, the acquisition unit 102 acquires the candidate text (S110). Next, the analysis unit 104 analyzes the candidate text and stores the analysis result in the storage unit 120 as the analysis result 120e of the candidate text (S112).

次に、テキストベクトル生成部１０８は、タスクテキストの解析結果１２０ｄと単語ベクトルリスト１２０ｆを参照して、タスクテキストの文ベクトルを生成し、タスクテキストベクトルリスト１２０ｇとして記憶部１２０に記憶させる（Ｓ１１４）。 Next, the text vector generation unit 108 generates a sentence vector of the task text by referring to the analysis result 120d of the task text and the word vector list 120f, and stores it in the storage unit 120 as the task text vector list 120g (S114). ..

次に、選別部１１０は、候補テキストを選別し、言語モデル生成部１１２ａに出力する（Ｓ１１６）。 Next, the selection unit 110 selects the candidate text and outputs it to the language model generation unit 112a (S116).

次に、言語モデル生成部１１２ａは、選別部１１０により出力された候補テキストと、言語モデル演算用テキスト１２０ｈとに基づいて、言語モデル１２０ｂを生成する（Ｓ１２０）。以上、本フローチャートの処理の説明を終了する。 Next, the language model generation unit 112a generates a language model 120b based on the candidate text output by the selection unit 110 and the language model calculation text 120h (S120). This is the end of the description of the processing of this flowchart.

［信頼度］
以下、信頼度導出部１１０ａの信頼度導出処理についてより具体的に説明する。信頼度とは、音声認識結果の信頼性を評価する度合を０から１．０の間の数値で示すものである。信頼度導出部１１０ａは、例えば、テキストの信頼性が高い場合、すなわち、他の競合候補となるテキストが存在しない場合に信頼度を１．０に設定する。信頼度は、例えば、大語彙連続音声認識エンジンの検索結果として得られる単語の事後確率を用いて導出される。 [Degree of reliability]
Hereinafter, the reliability derivation process of the reliability derivation unit 110a will be described more specifically. The reliability is a numerical value between 0 and 1.0 indicating the degree of evaluation of the reliability of the speech recognition result. The reliability derivation unit 110a sets the reliability to 1.0, for example, when the reliability of the text is high, that is, when there is no other text that can be a competitor. The reliability is derived, for example, using the posterior probabilities of words obtained as search results of a large vocabulary continuous speech recognition engine.

図９は、信頼度導出部１１０ａによる信頼度導出処理を説明するための図である。信頼度導出部１１０ａは、例えば、候補テキストＥ１～Ｅ４のそれぞれの信頼度を導出する。選別部１１０は、例えば、信頼度導出部１１０ａが導出した信頼度が閾値（例えば、０．８程度）以上である候補テキストＥ１およびＥ４をタスクテキストとして選択する。なお、選別部１１０は、複数のタスクテキストが選択可能である場合、信頼度の高いタスクテキストを優先的に選択してもよい。 FIG. 9 is a diagram for explaining the reliability derivation process by the reliability derivation unit 110a. The reliability derivation unit 110a derives, for example, the reliability of each of the candidate texts E1 to E4. The sorting unit 110 selects, for example, the candidate texts E1 and E4 whose reliability derived by the reliability deriving unit 110a is equal to or higher than a threshold value (for example, about 0.8) as task texts. When a plurality of task texts can be selected, the sorting unit 110 may preferentially select a highly reliable task text.

［ベクトルリストのクラスタリング］
図１０は、タスクテキストベクトルリスト１２０ｇを模式的に示す図である。タスクテキストベクトルリスト１２０ｇは、例えば、１０個程度のクラスタ構造をとる。類似するタスクテキストをクラスタとして取りまとめる。クラスタは、例えば、ｋ平均法（k-means clustering）等により構成される。 [Vector list clustering]
FIG. 10 is a diagram schematically showing a task text vector list 120 g. The task text vector list 120g has, for example, a cluster structure of about 10. Organize similar task texts into a cluster. The cluster is configured by, for example, the k-means clustering method or the like.

また、タスクテキストベクトルリスト１２０ｇは、クラスタ毎に代表ベクトルを導出しておくことで、被検索効率を高めることができる。代表ベクトルとは、例えば、クラスタを構成するタスクテキストの文ベクトルの平均でもよいし、タスクテキストの優先度と文ベクトルによる加重平均であってもよい。 Further, in the task text vector list 120g, the search efficiency can be improved by deriving the representative vector for each cluster. The representative vector may be, for example, the average of the sentence vectors of the task texts constituting the cluster, or may be the weighted average of the priority of the task text and the sentence vector.

図１１は、代表ベクトルを説明するための図である。選別部１１０は、タスクテキストを選択する際に、まず代表ベクトルと、候補テキストの文ベクトルとを比較してクラスタを選択し、次に選択したクラスタの中から、好適なタスクテキストを選択する。 FIG. 11 is a diagram for explaining a representative vector. When selecting a task text, the sorting unit 110 first compares the representative vector with the sentence vector of the candidate text to select a cluster, and then selects a suitable task text from the selected clusters.

［テキストの類似評価］
以下、テキストの類似評価方法について説明する。図１２は、類似評価方法について説明するための図である。 [Similarity evaluation of text]
Hereinafter, a method for evaluating the similarity of texts will be described. FIG. 12 is a diagram for explaining a similarity evaluation method.

言語モデル演算部１１２は、例えば、「ボリュームを下げて」の文ベクトルｖ１、および「音楽の音を小さくして」の文ベクトルｖ２を、式（１）に示すコサイン類似度を求める数式に適用することで、テキストの類似度を評価する。 The language model calculation unit 112 applies, for example, the sentence vector v1 of "lowering the volume" and the sentence vector v2 of "lowering the sound of music" to the mathematical formula for obtaining the cosine similarity shown in the equation (1). By doing so, the similarity of the text is evaluated.

式（１）は、文ベクトルｖ１と文ベクトルｖ２の積を、文ベクトルｖ１の絶対値と文ベクトルｖ２の絶対値の積で除算することを表す式であり、演算結果が１に近ければ文ベクトルｖ１と文ベクトルｖ２が類似していることを示す。 Equation (1) is an equation representing dividing the product of the sentence vector v1 and the sentence vector v2 by the product of the absolute value of the sentence vector v1 and the absolute value of the sentence vector v2. It shows that the vector v1 and the sentence vector v2 are similar.

言語モデル演算部１１２は、コサイン類似度が閾値以上であれば、文ベクトルｖ１と文ベクトルｖ２とが類似である、すなわち、元のテキストが同一または類似の入力意図を示すと判別する。 If the cosine similarity is equal to or higher than the threshold value, the language model calculation unit 112 determines that the sentence vector v1 and the sentence vector v2 are similar, that is, the original texts indicate the same or similar input intentions.

言語モデル演算部１１２は、例えば、クラスタの代表ベクトルと候補テキストの文ベクトルとの類似評価を行う。図１３は、言語モデル演算部１１２による、クラスタ選択を模式的に示す図である。 The language model calculation unit 112, for example, evaluates the similarity between the representative vector of the cluster and the sentence vector of the candidate text. FIG. 13 is a diagram schematically showing cluster selection by the language model calculation unit 112.

言語モデル演算部１１２は、図１３に示すように、例えば、候補テキスト「ボリュームを下げてほしいなあ」の文ベクトルと、クラスタＣ１およびＣ２の代表ベクトルとの類似度をそれぞれ導出し、類似度が高いクラスタＣ２を第１段階の選択対象として選択する。さらに、言語モデル演算部１１２は、選択したクラスタＣ２の中から第２段階の選択対象として１以上の好適なタスクテキストを選択する。 As shown in FIG. 13, the language model calculation unit 112 derives, for example, the similarity between the sentence vector of the candidate text “I want you to lower the volume” and the representative vectors of the clusters C1 and C2, respectively, and the similarity is determined. The high cluster C2 is selected as the selection target for the first stage. Further, the language model calculation unit 112 selects one or more suitable task texts as the selection target of the second stage from the selected cluster C2.

［音声認識処理］
図１４は、情報処理装置１００による音声認識処理の流れの一例を示すフローチャートである。 [Voice recognition processing]
FIG. 14 is a flowchart showing an example of the flow of voice recognition processing by the information processing apparatus 100.

まず、取得部１０２は、音声データを取得する（Ｓ２００）。次に、解析部１０４は、取得部１０２により出力された音声データを音響モデル１２０ａに適用し、候補テキストを生成する（Ｓ２０２）。次に、言語モデル演算部１１２は、解析部１０４により出力された候補テキストを言語モデル１２０ｂに適用する（Ｓ２０４）。次に、選択部１１４は、言語モデル演算部１１２により出力された適用結果から、好適候補を選択する（Ｓ２０６）。次に、出力情報生成部１１６は、好適候補に基づいて出力情報を生成する（Ｓ２０８）。次に、出力部１１８は、出力情報を端末装置２０等に出力する（Ｓ２１０）。以上、本フローチャートの処理の説明を終了する。 First, the acquisition unit 102 acquires voice data (S200). Next, the analysis unit 104 applies the voice data output by the acquisition unit 102 to the acoustic model 120a to generate a candidate text (S202). Next, the language model calculation unit 112 applies the candidate text output by the analysis unit 104 to the language model 120b (S204). Next, the selection unit 114 selects a suitable candidate from the application result output by the language model calculation unit 112 (S206). Next, the output information generation unit 116 generates output information based on suitable candidates (S208). Next, the output unit 118 outputs the output information to the terminal device 20 or the like (S210). This is the end of the description of the processing of this flowchart.

以上、説明した実施形態の情報処理装置１００によれば、音声データを取得する取得部１０２と、音声データを解析して候補テキストに変換した、１以上の解析結果を出力する解析部１０４と、解析結果に係る候補テキストに含まれる複数の単語のそれぞれを示す分散表現によるベクトル値に変換するＷ２Ｖ実行部１０６と、Ｗ２Ｖ実行部１０６により変換されたベクトル値と、音声データに係る音声を発した利用者の入力テキストの入力意図が既知の入力テキストに対応し、予め求められている単語ベクトルリスト１２０ｆとに基づいて、１以上の解析結果から入力意図が反映された可能性の高い解析結果を選択する選択部１１４と、を備えることにより、より効率的に音声認識処理を行うことができる。 According to the information processing apparatus 100 of the embodiment described above, the acquisition unit 102 that acquires voice data, the analysis unit 104 that analyzes voice data and converts it into candidate text, and outputs one or more analysis results. The W2V execution unit 106 that converts each of the plurality of words included in the candidate text related to the analysis result into a vector value by distributed expression, the vector value converted by the W2V execution unit 106, and the voice related to the voice data are emitted. An analysis result with a high possibility that the input intention is reflected from one or more analysis results based on the word vector list 120f obtained in advance corresponding to the input text whose input intention of the user's input text is known. By providing the selection unit 114 to be selected, the voice recognition process can be performed more efficiently.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 Although the embodiments for carrying out the present invention have been described above using the embodiments, the present invention is not limited to these embodiments, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

１００…情報処理装置、２０…端末装置、３０…制御対象デバイス、４０…サービスサーバ、１００…情報処理装置、１０２…取得部、１０４…解析部、１０６…Ｗ２Ｖ実行部、１０８…テキストベクトル生成部、１１０…選別部、１１０ａ…信頼度導出部、１１２…言語モデル演算部、１１４…選択部、１１６…出力情報生成部、１１８…出力部 100 ... Information processing device, 20 ... Terminal device, 30 ... Control target device, 40 ... Service server, 100 ... Information processing device, 102 ... Acquisition unit, 104 ... Analysis unit, 106 ... W2V execution unit, 108 ... Text vector generation unit , 110 ... Sorting unit, 110a ... Reliability derivation unit, 112 ... Language model calculation unit, 114 ... Selection unit, 116 ... Output information generation unit, 118 ... Output unit

Claims

The acquisition unit that acquires audio data,
An analysis unit that analyzes the voice data and converts it into text, and outputs one or more analysis results.
A vector conversion unit that converts each of a plurality of words included in the text related to the analysis result into a vector value by a distributed expression, and a vector conversion unit.
The vector value converted by the vector conversion unit and the input text of the user who has emitted the voice related to the voice data correspond to the input text whose input intention is known, and the vector of the known input text obtained in advance. A selection unit that selects the analysis result that is likely to reflect the input intention from the analysis results of 1 or more based on the value.
A reliability derivation unit for deriving the reliability of the analysis result using the posterior probability of the word obtained as the search result of the speech recognition engine is provided.
The selection unit changes the analysis result to be selected based on the reliability.
Information processing equipment.

The selection unit preferentially selects the analysis result whose reliability is equal to or higher than the threshold value.
The information processing apparatus according to claim 1 .

The acquisition unit that acquires audio data,
An analysis unit that analyzes the voice data and converts it into text, and outputs one or more analysis results.
A vector conversion unit that converts each of a plurality of words included in the text related to the analysis result into a vector value by a distributed expression, and a vector conversion unit.
The vector value converted by the vector conversion unit and the input text of the user who has emitted the voice related to the voice data correspond to the input text whose input intention is known, and the vector of the known input text obtained in advance. A selection unit that selects the analysis result that is likely to reflect the input intention from the analysis results of 1 or more based on the value.
Equipped with
The vector conversion unit derives a representative vector of a cluster which is a group of the known input texts having a semantic similarity of a predetermined degree or more.
The selection unit selects the first stage of the analysis result using the representative vector, and then reflects the input intention of the user's input text from the cluster selected by the selection of the first stage. Select the analysis result that is likely to be
Information processing equipment.

The acquisition unit that acquires audio data,
An analysis unit that analyzes the voice data and converts it into text, and outputs one or more analysis results.
A vector conversion unit that converts each of a plurality of words included in the text related to the analysis result into a vector value by a distributed expression, and a vector conversion unit.
The vector value converted by the vector conversion unit and the input text of the user who has emitted the voice related to the voice data correspond to the input text whose input intention is known, and the vector of the known input text obtained in advance. A selection unit that selects the analysis result that is likely to reflect the input intention from the analysis results of 1 or more based on the value.
Equipped with
The selection unit determines whether or not the voice data includes the user's task execution intention based on the position information given to the voice data.
Information processing equipment.

The selection unit changes the accuracy of selection of the corresponding task according to the input environment of the voice data estimated based on the position information.
The information processing apparatus according to claim 4 .

The acquisition unit that acquires audio data,
An analysis unit that analyzes the voice data and converts it into text, and outputs one or more analysis results.
A vector conversion unit that converts each of a plurality of words included in the text related to the analysis result into a vector value by a distributed expression, and a vector conversion unit.
The vector value converted by the vector conversion unit and the input text of the user who has emitted the voice related to the voice data correspond to the input text whose input intention is known, and the vector of the known input text obtained in advance. A selection unit that selects the analysis result that is likely to reflect the input intention from the analysis results of 1 or more based on the value.
Equipped with
Further, an output information generation unit for outputting an instruction related to a task for generating output information corresponding to the input intention based on the selection result by the selection unit is further provided.
Information processing equipment.

The computer
Get voice data,
One or more analysis results obtained by analyzing the voice data and converting it into text are output.
It is converted into a vector value by a distributed representation indicating each of a plurality of words included in the text related to the analysis result.
Based on the converted vector value and the vector value of the known input text obtained in advance, the input intention of the input text of the user who emitted the voice related to the voice data corresponds to the known input text. , Select the analysis result having a high possibility of reflecting the input intention from the analysis results of 1 or more.
The reliability of the analysis result is derived by using the posterior probability of the word obtained as the search result of the speech recognition engine.
At the time of the selection, the analysis result selected based on the reliability is changed.
Information processing method.

The computer
Get voice data,
One or more analysis results obtained by analyzing the voice data and converting it into text are output.
It is converted into a vector value by a distributed representation indicating each of a plurality of words included in the text related to the analysis result.
Based on the converted vector value and the vector value of the known input text obtained in advance corresponding to the input text whose input intention of the input text of the user who emitted the voice related to the voice data is known. , Select the analysis result having a high possibility of reflecting the input intention from the analysis results of 1 or more.
When converting to the vector value, a representative vector of a cluster, which is a group of the known input texts having a semantic similarity of a predetermined degree or more, is derived.
At the time of the selection, the first stage selection of the analysis result is performed using the representative vector, and then the input intention of the input text of the user is reflected from the cluster selected by the selection of the first stage. Select the analysis result that is likely to have been done,
Information processing method.

The computer
Get voice data,
One or more analysis results obtained by analyzing the voice data and converting it into text are output.
It is converted into a vector value by a distributed representation indicating each of a plurality of words included in the text related to the analysis result.
Based on the converted vector value and the vector value of the known input text obtained in advance corresponding to the input text whose input intention of the input text of the user who emitted the voice related to the voice data is known. , Select the analysis result having a high possibility of reflecting the input intention from the analysis results of 1 or more.
At the time of the selection, it is determined whether or not the voice data includes the execution intention of the user's task based on the position information given to the voice data.
Information processing method.

The computer
Get voice data,
One or more analysis results obtained by analyzing the voice data and converting it into text are output.
It is converted into a vector value by a distributed representation indicating each of a plurality of words included in the text related to the analysis result.
Based on the converted vector value and the vector value of the known input text obtained in advance corresponding to the input text whose input intention of the input text of the user who emitted the voice related to the voice data is known. , Select the analysis result having a high possibility of reflecting the input intention from the analysis results of 1 or more.
Based on the result of the selection, an instruction relating to a task for generating output information corresponding to the input intention is output.
Information processing method.

On the computer
Get voice data,
One or more analysis results obtained by analyzing the voice data and converting it into text are output.
It is converted into a vector value by a distributed expression indicating each of a plurality of words included in the text related to the analysis result.
Based on the converted vector value and the vector value of the known input text obtained in advance, the input intention of the input text of the user who emitted the voice related to the voice data corresponds to the known input text. , The analysis result having a high possibility of reflecting the input intention is selected from the analysis results of one or more .
The reliability of the analysis result is derived by using the posterior probability of the word obtained as the search result of the speech recognition engine.
At the time of the selection, the analysis result selected based on the reliability is changed.
program.

On the computer
Get voice data,
One or more analysis results obtained by analyzing the voice data and converting it into text are output.
It is converted into a vector value by a distributed expression indicating each of a plurality of words included in the text related to the analysis result.
Based on the converted vector value and the vector value of the known input text obtained in advance, the input intention of the input text of the user who emitted the voice related to the voice data corresponds to the known input text. , The analysis result having a high possibility of reflecting the input intention is selected from the analysis results of one or more.
When converting to the vector value, a representative vector of a cluster, which is a group of the known input texts having a semantic similarity of a predetermined degree or more, is derived.
At the time of the selection, the representative vector is used to select the first stage of the analysis result, and then the input intention of the user's input text from the cluster selected by the selection of the first stage is Select the analysis result that is likely to be reflected,
program.

On the computer
Get voice data,
One or more analysis results obtained by analyzing the voice data and converting it into text are output.
It is converted into a vector value by a distributed expression indicating each of a plurality of words included in the text related to the analysis result.
Based on the converted vector value and the vector value of the known input text obtained in advance corresponding to the input text whose input intention of the input text of the user who emitted the voice related to the voice data is known. , The analysis result having a high possibility of reflecting the input intention is selected from the analysis results of one or more.
At the time of the selection, it is determined whether or not the voice data includes the execution intention of the user's task based on the position information given to the voice data.
program.

On the computer
Get voice data,
One or more analysis results obtained by analyzing the voice data and converting it into text are output.
It is converted into a vector value by a distributed expression indicating each of a plurality of words included in the text related to the analysis result.
Based on the converted vector value and the vector value of the known input text obtained in advance corresponding to the input text whose input intention of the input text of the user who emitted the voice related to the voice data is known. , The analysis result having a high possibility of reflecting the input intention is selected from the analysis results of one or more.
Based on the result of the selection, an instruction relating to a task for generating output information corresponding to the input intention is output.
program.