JP7096199B2

JP7096199B2 - Information processing equipment, information processing methods, and programs

Info

Publication number: JP7096199B2
Application number: JP2019092572A
Authority: JP
Inventors: 賢昭佐藤; 純平三宅
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2019-05-16
Filing date: 2019-05-16
Publication date: 2022-07-05
Anticipated expiration: 2039-05-16
Also published as: JP2020187282A

Description

本発明は、情報処理装置、情報処理方法、およびプログラムに関する。 The present invention relates to an information processing apparatus, an information processing method, and a program.

音声認識結果を含む確からしさに基づいて音声認識を行う技術が知られている（特許文献１参照）。確からしさは、例えば、コーパスとの単純な比較結果や、音声認識結果とコーパスとの類似度を評価する結果に基づいて設定される。 A technique for performing speech recognition based on the certainty including the speech recognition result is known (see Patent Document 1). The certainty is set based on, for example, a simple comparison result with the corpus or a result of evaluating the similarity between the speech recognition result and the corpus.

特開２０１６－２０６４８７号公報Japanese Unexamined Patent Publication No. 2016-206487

しかしながら、従来の技術では、コーパスの各語に対して数万個ある潜在語候補から好適な候補を抽出するための処理に時間を要し、効率的な音声認識処理が実現されない可能性があった。また、コーパスの各語から好適な候補を抽出することの精度向上が十分検討されていない可能性があった。 However, with the conventional technique, it takes time to extract a suitable candidate from tens of thousands of latent word candidates for each word of the corpus, and there is a possibility that efficient speech recognition processing cannot be realized. rice field. In addition, there is a possibility that improvement in the accuracy of extracting suitable candidates from each word of the corpus has not been sufficiently examined.

本発明は、このような事情を考慮してなされたものであり、より効率的、且つ高精度に音声認識処理をすることができる情報処理装置、情報処理方法、およびプログラムを提供することを目的の一つとする。 The present invention has been made in consideration of such circumstances, and an object of the present invention is to provide an information processing device, an information processing method, and a program capable of performing voice recognition processing more efficiently and with high accuracy. It is one of.

本発明の一態様は、音声データを取得する取得部と、前記音声データを解析してテキストに変換する解析部と、前記解析部による解析結果のテキストに含まれる複数の第１ワードのそれぞれについて、前記テキストに含まれ且つ前記第１ワードが含まれる被解析文の中における前記第１ワードの頻出性と、ライブラリ情報に含まれる文に対する前記第１ワードの希少性とを評価した第１指標値を導出して前記被解析文に対応付けることを行う指標値導出部と、前記解析部により解析された文を分散表現によるベクトル値に変換するベクトル変換部と、前記指標値導出部により導出された前記第１指標値と、前記ベクトル変換部による変換結果とに基づいて、前記被解析文または前記着目文から一部の文を選択する選択部と、意味合いが既知であり且つ前記ベクトル値が求められている教師文のうち、前記選択部により選択された選択文とベクトル値が近い教師文の意味合いを、前記選択文の意味合いとして対応付けたデータを生成する生成部と、を備える、情報処理装置である。 One aspect of the present invention is for each of an acquisition unit for acquiring voice data, an analysis unit for analyzing the voice data and converting it into a text, and a plurality of first words included in the text of the analysis result by the analysis unit. , The first index which evaluated the frequency of the first word in the analyzed sentence included in the text and including the first word, and the rarity of the first word with respect to the sentence included in the library information. An index value derivation unit that derives a value and associates it with the analyzed sentence, a vector conversion unit that converts a sentence analyzed by the analysis unit into a vector value by distributed representation, and an index value derivation unit that is derived. The meaning is known and the vector value is the selection unit that selects a part of the sentence to be analyzed or the sentence of interest based on the first index value and the conversion result by the vector conversion unit. Information including a generation unit that generates data in which the meaning of a teacher sentence whose vector value is close to that of the selection sentence selected by the selection unit is associated with the meaning of the selection sentence among the required teacher sentences. It is a processing device.

本発明の一態様によれば、より効率的、且つ高精度に音声認識処理をすることができる。 According to one aspect of the present invention, voice recognition processing can be performed more efficiently and with high accuracy.

実施形態に係る情報処理装置１００の使用環境の一例を示す図である。It is a figure which shows an example of the use environment of the information processing apparatus 100 which concerns on embodiment. 情報処理装置１００の処理を模式的に示す図である。It is a figure which shows typically the process of the information processing apparatus 100. ＷＦＳＴについて説明するための図である。It is a figure for demonstrating WFST. ＷＦＳＴについて説明するための図である。It is a figure for demonstrating WFST. ＷＦＳＴについて説明するための図である。It is a figure for demonstrating WFST. 実施形態に係る情報処理装置１００の構成図である。It is a block diagram of the information processing apparatus 100 which concerns on embodiment. Ｗ２Ｖ実行部１１０によるベクトル変換処理を説明するための図である。It is a figure for demonstrating the vector conversion process by the W2V execution unit 110. 文ベクトルを説明するための図である。It is a figure for demonstrating a sentence vector. 選択部１１４による好適候補選択を模式的に示す図である。It is a figure which shows typically the suitable candidate selection by a selection part 114. タスクテキストを説明するための図である。It is a figure for demonstrating a task text. 代表ベクトルを説明するための図である。It is a figure for demonstrating a representative vector. 抽出対象テキストの指標値を説明するための図である。It is a figure for demonstrating the index value of the text to be extracted. ベクトル変換部１１２により導出されたｔｆ－ｉｄｆ値の一例を示す図である。It is a figure which shows an example of the tf-idf value derived by the vector conversion unit 112. 文ベクトルのｔｆ－ｉｄｆベクトルを説明するための図である。It is a figure for demonstrating the tf-idf vector of a sentence vector. 信頼度導出部１１４ａによる信頼度導出処理を説明するための図である。It is a figure for demonstrating the reliability derivation process by a reliability derivation unit 114a. 類似評価方法について説明するための図である。It is a figure for demonstrating the similarity evaluation method. 情報処理装置１００による言語モデル生成処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the language model generation processing by an information processing apparatus 100. 情報処理装置１００による音声認識処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the voice recognition processing by an information processing apparatus 100.

以下、図面を参照し、本発明の情報処理装置、情報処理方法、およびプログラムの実施形態について説明する。 Hereinafter, embodiments of the information processing apparatus, information processing method, and program of the present invention will be described with reference to the drawings.

［概要］
情報処理装置は、一以上のプロセッサにより実現される。情報処理装置は、利用者の発した音声を収録した音声データを受信し、受信した入力データの音声認識処理を行い、認識の結果に基づいて種々の処理を行う装置（以下、「端末装置」と称する）に対して、言語モデルを提供するための装置である。種々の処理としては、音声を発した利用者の意図に沿ったＩｏＴ（Internet of Things）機器の制御を行うこと、利用者の質問に対して応答することなどがある。 [Overview]
The information processing device is realized by one or more processors. The information processing device receives voice data recording the voice emitted by the user, performs voice recognition processing of the received input data, and performs various processing based on the recognition result (hereinafter, "terminal device"). It is a device for providing a language model for (referred to as). Various processes include controlling the IoT (Internet of Things) device according to the intention of the user who emitted the voice, and responding to the user's question.

言語モデルとは、音声認識処理において、入力データをテキスト変換する自然言語処理モデルであり、入力結果をテキストに変換した結果が正解である可能性の高い変換結果についての確率を内包するものである。以下、利用者の意図する端末装置の動作を「タスク」と称する場合がある。なお音声データは、圧縮や暗号化などの処理が施されたものであってもよい。 The language model is a natural language processing model that converts input data into text in speech recognition processing, and includes the probability of the conversion result that the result of converting the input result into text is likely to be the correct answer. .. Hereinafter, the operation of the terminal device intended by the user may be referred to as a "task". The voice data may be compressed or encrypted.

図１は、実施形態に係る情報処理装置１００の使用環境の一例を示す図である。 FIG. 1 is a diagram showing an example of a usage environment of the information processing apparatus 100 according to the embodiment.

図示する環境では、端末装置２０、制御対象デバイス３０、およびサービスサーバ４０は、ネットワークＮＷを介して互いに通信する。ネットワークＮＷは、例えば、ＷＡＮ（Wide Area Network）、ＬＡＮ（Local Area Network）、インターネット、プロバイダ装置、無線基地局、専用回線などのうちの一部または全部を含む。図１に示す例では、制御対象デバイス３０の数は、Ｎ（Ｎは、１以上の整数）個である。なお、本明細書では、制御対象デバイス３０－１～３０－Ｎにおいて、共通の事項を説明する場合など、個々の制御対象デバイス３０－１～３０－Ｎを区別しない場合には、単に制御対象デバイス３０と呼ぶ。 In the illustrated environment, the terminal device 20, the controlled device 30, and the service server 40 communicate with each other via the network NW. The network NW includes, for example, a part or all of a WAN (Wide Area Network), a LAN (Local Area Network), the Internet, a provider device, a wireless base station, a dedicated line, and the like. In the example shown in FIG. 1, the number of controlled devices 30 is N (N is an integer of 1 or more). In this specification, when the individual controlled object devices 30-1 to 30-N are not distinguished, such as when common matters are explained in the controlled object devices 30-1 to 30-N, the controlled object is simply controlled. Called device 30.

端末装置２０は、利用者の音声入力を受け付ける装置である。端末装置２０は、スマートフォンなどの携帯電話、タブレット端末、パーソナルコンピュータ、スマートスピーカ（ＡＩスピーカ）等である。 The terminal device 20 is a device that accepts a user's voice input. The terminal device 20 is a mobile phone such as a smartphone, a tablet terminal, a personal computer, a smart speaker (AI speaker), or the like.

制御対象デバイス３０は、通信機能と、外部からの制御を受け付けるインターフェースとを備え、利用者により操作される端末装置２０からの指令に応じて制御可能なＩｏＴ機器である。制御対象デバイス３０は、例えば、テレビやラジオ、照明器具、冷蔵庫、電子レンジ、洗濯機、炊飯器、自走式掃除機、空調機器、車両などである。 The control target device 30 is an IoT device having a communication function and an interface for receiving control from the outside, and can be controlled in response to a command from the terminal device 20 operated by the user. The controlled device 30 is, for example, a television, a radio, a lighting fixture, a refrigerator, a microwave oven, a washing machine, a rice cooker, a self-propelled vacuum cleaner, an air conditioner, a vehicle, or the like.

なお、制御対象デバイス３０は、端末装置２０自身である可能性がある。すなわち、端末装置２０は、情報処理装置１００による処理結果に応じて何らかの検索処理を行ったり、電話をかけたり、メッセージを送信したりすることがある。 The controlled device 30 may be the terminal device 20 itself. That is, the terminal device 20 may perform some kind of search processing, make a phone call, or send a message according to the processing result of the information processing device 100.

サービスサーバ４０は、利用者により操作される端末装置２０からの指令に対応するウェブページを提供するウェブサーバ装置、アプリケーションが起動された端末装置２０と通信を行って各種情報の受け渡しを行ってコンテンツを提供するアプリケーションサーバ装置等である。 The service server 40 communicates with a web server device that provides a web page corresponding to a command from the terminal device 20 operated by the user, and the terminal device 20 in which the application is started, and exchanges various information to provide contents. Is an application server device or the like that provides.

図２は、情報処理装置１００の処理を模式的に示す図である。 FIG. 2 is a diagram schematically showing the processing of the information processing apparatus 100.

情報処理装置１００は、利用者が端末装置２０を介して入力された音声データを音響モデルに適用することで音素に変換し、音素に基づいて１以上の抽出対象テキスト（音声データに含まれる音をテキスト化したもの）を生成し、さらに生成した抽出対象テキストのうち既知のタスク特徴量との比較に基づいて選択した抽出対象テキストを言語モデルに適用することで、好適候補を選択する。好適候補とは、抽出対象テキストの中で利用者の意図が反映された可能性が高い好適なテキストであると判定されたものであって、端末装置２０または制御対象デバイスの操作を示唆するテキストである。 The information processing device 100 converts voice data input by the user via the terminal device 20 into phonemes by applying it to an acoustic model, and one or more texts to be extracted (sounds included in the voice data) based on the phonemes. Is generated, and the extracted text selected based on the comparison with the known task feature amount among the generated texts to be extracted is applied to the language model to select suitable candidates. The suitable candidate is a text that is determined to be a suitable text that is likely to reflect the intention of the user in the text to be extracted, and is a text suggesting the operation of the terminal device 20 or the controlled device. Is.

音響モデルとは、周波数成分や時間変化を統計的に分析し、入力された音声データがどのような音素で構成されるか（何と言っているか）を判別するためのモデルである。音素とは、アルファベットや仮名などの言語の最小単位を特定するためのラベルであり、例えば、母音や子音等を含む。情報処理装置１００は、音素を言語ルールに従って適宜、結合することで抽出対象テキストを得る。 The acoustic model is a model for statistically analyzing frequency components and time changes to determine what kind of phonemes the input voice data is composed of (what is said). A phoneme is a label for specifying the smallest unit of a language such as an alphabet or a kana, and includes, for example, a vowel or a consonant. The information processing apparatus 100 obtains the text to be extracted by appropriately combining phonemes according to a language rule.

図２に示すように、音素変換の結果、生成した抽出対象テキストが“kyonotenki”である場合、例えば、”k”や”t”は生成した抽出対象テキストに含まれる音素を示すものである。音声認識処理が日本語を前提として行われる場合、抽出対象テキストは、アルファベット表記で表されてもよいし、ひらがな表記またはカタカナ表記で表されてもよい。図２に示す例において、情報処理装置１００は、受け付けた音声データに基づいて、“kyonotenki”、“kyonotenkii”、“kyonodenki”を含む抽出対象テキストを生成する。 As shown in FIG. 2, when the extraction target text generated as a result of phoneme conversion is “kyonotenki”, for example, “k” and “t” indicate phonemes included in the generated extraction target text. When the voice recognition process is performed on the premise of Japanese, the text to be extracted may be expressed in alphabetical notation, hiragana notation, or katakana notation. In the example shown in FIG. 2, the information processing apparatus 100 generates an extraction target text including "kyonotenki", "kyonotenkii", and "kyonodenki" based on the received voice data.

情報処理装置１００の生成する言語モデルは、図２に示す例において、“kyonotenki”、“kyonotenkii”、“kyonodenki”を含む変換候補のそれぞれに対して形態素解析を行う。形態素解析とは、抽出対象テキストを構成する単語の区切りを決定し、区切られたそれぞれの単語の例えば品詞を導出する処理である。形態素解析は、例えば、ＭｅＣＡＢなどの形態素解析エンジンを利用して行われる。 The language model generated by the information processing apparatus 100 performs morphological analysis on each of the conversion candidates including "kyonotenki", "kyonotenkii", and "kyonodenki" in the example shown in FIG. The morphological analysis is a process of determining a delimiter of words constituting the text to be extracted and deriving, for example, a part of speech of each delimited word. The morphological analysis is performed using, for example, a morphological analysis engine such as MeCAB.

言語モデルは、例えば、抽出対象テキスト“kyonotenki”を解析した結果、「今日（kyo）」、「の(no)」、「天気(tenki)」の３つの単語を導出する。同様に、抽出対象テキスト“kyonotenkii”を解析した結果、「今日（kyo）」、「の(no)」、「テンキー(tenkii)」を、抽出対象テキスト“kyonodenki”を解析した結果、「京（kyo）」、「の(no)」、「電気(denki)」を生成する。このように、音声入力をひらがなから漢字変換する場合に、複数パターンの変換候補が生成される可能性がある。 For example, the language model derives three words "today (kyo)", "no (no)", and "weather (tenki)" as a result of analyzing the extraction target text "kyonotenki". Similarly, as a result of analyzing the extraction target text "kyonotenkii", "today (kyo)", "(no)", "tenkii", and as a result of analyzing the extraction target text "kyonodenki", "Kyo (kyo)" kyo) ”,“ no (no) ”,“ electricity (denki) ”are generated. In this way, when converting voice input from Hiragana to Kanji, there is a possibility that conversion candidates for a plurality of patterns will be generated.

言語モデルは、１以上の抽出対象テキストのそれぞれから生成した解析結果を評価する評価値を生成し、その評価値に基づいて複数パターンの変換候補の中から１つの抽出対象テキストを選択する。より具体的に、情報処理装置１００は、抽出対象テキストの解析結果の、既知のタスク音声から得られた特徴量との適合率を評価し、利用者の意図に沿ったものと推定される好適候補を選択する。そして、情報処理装置１００は、意図に対応する出力情報を生成するタスクに関する命令を出力する。 The language model generates an evaluation value for evaluating the analysis result generated from each of one or more extraction target texts, and selects one extraction target text from a plurality of patterns of conversion candidates based on the evaluation value. More specifically, the information processing apparatus 100 evaluates the matching rate of the analysis result of the text to be extracted with the feature amount obtained from the known task voice, and is presumed to be in line with the user's intention. Select a candidate. Then, the information processing apparatus 100 outputs an instruction regarding a task for generating output information corresponding to the intention.

［ＷＦＳＴ］
図３～図５は、音響モデルおよび言語モデルにより実現される、ＷＦＳＴ（Weighted Finite-state Transducer；重みつき有限状態トランスデューサ）について説明するための図である。ＷＦＳＴとは、入力データを「変換候補」と「その変換候補の確からしさの推定値」に変換する機構の一例である。 [WFST]
3 to 5 are diagrams for explaining a WFST (Weighted Finite-state Transducer) realized by an acoustic model and a language model. WFST is an example of a mechanism for converting input data into a "conversion candidate" and an "estimated value of the certainty of the conversion candidate".

ＷＦＳＴを用いた音声認識が行われる場合、端末装置２０が受け付けた音声入力は、音響モデルによりトリフォン（Triphon）などの文脈依存の音素に変換される（図３）。次に、音響モデル（または言語モデル）は、音素から単語に変換する（図４）。次に、言語モデルは、単語から音声入力の変換結果となるテキストを生成する（図５）。言語モデルは、例えば、N-gram言語モデルである。例えば、言語モデルとして3-gramが採用される場合、３単語ごとに区切り、３単語ごとに意味合いが成立するか否かに基づいてテキスト全体の構成を決定する。 When voice recognition using WFST is performed, the voice input received by the terminal device 20 is converted into a context-sensitive phoneme such as Triphon by the acoustic model (FIG. 3). Next, the acoustic model (or language model) converts phonemes into words (Fig. 4). Next, the language model generates text that is the result of conversion of speech input from words (Fig. 5). The language model is, for example, an N-gram language model. For example, when 3-gram is adopted as a language model, the composition of the entire text is determined based on whether or not the meaning is established for each of the three words.

情報処理装置１００は、上述のような形態素解析や、ＷＦＳＴを用いた音声認識がより高速に、かつより高い処理精度で行われるように、好適な言語モデルを生成する。 The information processing apparatus 100 generates a suitable language model so that the above-mentioned morphological analysis and speech recognition using WFST can be performed at higher speed and with higher processing accuracy.

［全体構成］
図６は、情報処理装置１００の構成図である。情報処理装置１００は、例えば、取得部１０２と、解析部１０４と、頻出性計算部１０６と、希少性計算部１０８と、Ｗ２Ｖ（Word2Vec）実行部１１０と、ベクトル変換部１１２と、選択部１１４と、言語モデル演算部１１６と、指令出力部１１８と、記憶部１２０とを備える。これらの構成要素（記憶部１２０を除く）は、例えば、ＣＰＵ（Central Processing Unit）などのハードウェアプロセッサがプログラム（ソフトウェア）を実行することにより実現される。 [overall structure]
FIG. 6 is a block diagram of the information processing apparatus 100. The information processing apparatus 100 includes, for example, an acquisition unit 102, an analysis unit 104, a frequency calculation unit 106, a rarity calculation unit 108, a W2V (Word2Vec) execution unit 110, a vector conversion unit 112, and a selection unit 114. A language model calculation unit 116, a command output unit 118, and a storage unit 120 are provided. These components (excluding the storage unit 120) are realized by, for example, a hardware processor such as a CPU (Central Processing Unit) executing a program (software).

また、これらの構成要素のうち一部または全部は、ＬＳＩ（Large Scale Integration）やＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）、ＧＰＵ（Graphics Processing Unit）などのハードウェア（回路部；circuitryを含む）によって実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。プログラムは、予め情報処理装置１００のＨＤＤやフラッシュメモリなどの記憶装置（非一過性の記憶媒体を備える記憶装置）に格納されていてもよいし、ＤＶＤやＣＤ－ＲＯＭなどの着脱可能な記憶媒体（非一過性の記憶媒体）に格納されており、記憶媒体がドライブ装置に装着されることで情報処理装置１００のＨＤＤやフラッシュメモリにインストールされてもよい。 In addition, some or all of these components are hardware (circuits) such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), and GPU (Graphics Processing Unit). It may be realized by the part; including circuitry), or it may be realized by the cooperation of software and hardware. The program may be stored in advance in a storage device (a storage device including a non-transient storage medium) such as an HDD or a flash memory of the information processing device 100, or a detachable storage such as a DVD or a CD-ROM. It is stored in a medium (non-transient storage medium), and may be installed in the HDD or flash memory of the information processing apparatus 100 by mounting the storage medium in the drive device.

記憶部１２０は、例えば、ＲＡＭ（Random Access Memory）、レジスタ、フラッシュメモリ、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read Only Memory）などにより実現される。記憶部１２０は、例えば、音響モデル１２０ａ、言語モデル１２０ｂ、コーパスの解析結果１２０ｃ、タスクテキストの解析結果１２０ｄ、抽出対象テキストの解析結果１２０ｅ、単語ベクトルリスト１２０ｆ、ベクトルリスト１２０ｇ、言語モデル演算用テキスト１２０ｈなどの情報を記憶する。ベクトルリスト１２０ｇには、例えば、タスクテキストベクトルリスト１２０ｉと、抽出対象テキストベクトルリスト１２０ｊとが含まれる。 The storage unit 120 is realized by, for example, a RAM (Random Access Memory), a register, a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), or the like. The storage unit 120 may include, for example, an acoustic model 120a, a language model 120b, a corpus analysis result 120c, a task text analysis result 120d, an extraction target text analysis result 120e, a word vector list 120f, a vector list 120g, and a language model calculation text. Information such as 120h is stored. The vector list 120g includes, for example, a task text vector list 120i and an extraction target text vector list 120j.

取得部１０２は、情報処理装置１００が音声認識処理を行う上でコーパスとして利用する文字情報（以下、「コーパスＩ１」と称する）を取得し、解析部１０４に出力する。コーパスＩ１には、例えば、ニュース等の記事データや、ＳＮＳ（Social Networking Service）の投稿データが含まれる。コーパスＩ１は、「ライブラリ情報」の一例である。 The acquisition unit 102 acquires character information (hereinafter referred to as “corpus I1”) used as a corpus when the information processing apparatus 100 performs voice recognition processing, and outputs the information to the analysis unit 104. The corpus I1 includes, for example, article data such as news and posted data of SNS (Social Networking Service). Corpus I1 is an example of "library information".

なお、コーパスＩ１は、口語形式のテキスト（例えば、ＳＮＳにおける投稿履歴や、自動応答装置における利用者と装置の会話履歴の書き下し文、現実の会話をテキストに直したもの、端末装置２０から取得した音声入力に対する自装置の処理履歴の書き下し文など）であることが望ましい。 The corpus I1 is a colloquial text (for example, a post history in the SNS, a written sentence of the conversation history between the user and the device in the automatic response device, a text converted from the actual conversation, and a voice acquired from the terminal device 20. It is desirable that it is a written sentence of the processing history of the own device for the input).

また、取得部１０２は、情報処理装置１００の管理者が設定した定型タスクを示す文字情報のデータセット（以下、「タスクテキストＩ２」と称する）を取得し、解析部１０４に出力する。タスクテキストＩ２は「教師文」の一例である。 Further, the acquisition unit 102 acquires a data set of character information (hereinafter referred to as “task text I2”) indicating a routine task set by the administrator of the information processing apparatus 100, and outputs the data set to the analysis unit 104. The task text I2 is an example of a "teacher sentence".

また、取得部１０２は、端末装置２０の利用者により入力された音声データ（以下、「音声データＩ３」と称する）を取得し、解析部１０４に出力する。取得部１０２が取得した音声データＩ３に利用者の位置情報を含む場合がある。位置情報とは、例えば、端末装置２０に含まれるＧＮＳＳ（Global Navigation Satellite System）受信装置による処理結果であってもよい。また、端末装置２０が主として特定の場所（例えば、利用者のリビング、利用者のオフィスなど）で利用される装置である場合には、その特定の場所に関する情報が位置情報に相当する。 Further, the acquisition unit 102 acquires voice data (hereinafter referred to as “voice data I3”) input by the user of the terminal device 20 and outputs the voice data to the analysis unit 104. The voice data I3 acquired by the acquisition unit 102 may include the user's position information. The position information may be, for example, a processing result by a GNSS (Global Navigation Satellite System) receiving device included in the terminal device 20. Further, when the terminal device 20 is a device mainly used in a specific place (for example, a user's living room, a user's office, etc.), the information about the specific place corresponds to the location information.

解析部１０４は、取得部１０２により取得された情報に対して、解析を行い、テキスト（文字データ）に変換する。解析部１０４による解析とは、例えば、形態素解析である。 The analysis unit 104 analyzes the information acquired by the acquisition unit 102 and converts it into text (character data). The analysis by the analysis unit 104 is, for example, a morphological analysis.

解析部１０４は、例えば、取得部１０２により出力されたコーパスＩ１に対して解析を行う。解析部１０４は、例えば、取得部１０２により出力されたコーパスＩ１を名詞、動詞、助詞等の品詞の単位で分解する。解析部１０４は、解析結果をコーパスの解析結果１２０ｃとして記憶部１２０に格納する。 The analysis unit 104 analyzes, for example, the corpus I1 output by the acquisition unit 102. For example, the analysis unit 104 decomposes the corpus I1 output by the acquisition unit 102 into units of part of speech such as nouns, verbs, and particles. The analysis unit 104 stores the analysis result as the analysis result 120c of the corpus in the storage unit 120.

また、解析部１０４は、取得部１０２より出力されたタスクテキストＩ２を解析し、解析結果をタスクテキストの解析結果１２０ｄとして記憶部１２０に格納する。 Further, the analysis unit 104 analyzes the task text I2 output from the acquisition unit 102, and stores the analysis result as the analysis result 120d of the task text in the storage unit 120.

また、解析部１０４は、取得部１０２により出力された音声データＩ３を音響モデル１２０ａに適用して１以上の抽出対象テキストを生成した後に、それぞれの抽出対象テキストに対して形態素解析等の解析処理を行う。また、解析部１０４は、解析結果を抽出対象テキストの解析結果１２０ｅとして記憶部１２０に格納する。 Further, the analysis unit 104 applies the voice data I3 output by the acquisition unit 102 to the acoustic model 120a to generate one or more extraction target texts, and then analyzes each extraction target text by morphological analysis or the like. I do. Further, the analysis unit 104 stores the analysis result in the storage unit 120 as the analysis result 120e of the text to be extracted.

頻出性計算部１０６は、抽出対象テキストの解析結果１２０ｅから、抽出対象テキストに含まれる一文（以下、「被解析文」と称する）に含まれる複数の単語（以下、「第１ワード」と称する）のそれぞれについて頻出性を示す指標値を計算して、被解析文に対応付ける。頻出性計算部１０６は、例えば、抽出対象テキストの解析結果１２０ｅから、抽出対象テキストに含まれる一文に含まれる複数の単語のそれぞれについて、ｔｆ値（Term Frequency Value;頻出性を示す指標値）を計算して被解析文に対応付ける。 The frequency calculation unit 106 refers to a plurality of words (hereinafter referred to as "first word") included in one sentence included in the extracted text (hereinafter referred to as "analyzed sentence") from the analysis result 120e of the extracted text. ) Is calculated and associated with the analyzed sentence. The frequency calculation unit 106, for example, obtains a tf value (Term Frequency Value; an index value indicating frequency) for each of a plurality of words included in one sentence included in the extraction target text from the analysis result 120e of the extraction target text. Calculate and associate with the analyzed sentence.

なお、頻出性計算部１０６は、コーパスの解析結果１２０ｃに含まれる一文に含まれる単語（以下、「第２ワード」と称する）のそれぞれに対するｔｆ値をあらかじめ計算しておく。頻出性計算部１０６は、コーパスの解析結果１２０ｃに含まれる複数の第２ワードのそれぞれについて、コーパスの解析結果１２０ｃに含まれ且つ第２ワードが含まれるコーパスの一文（以下、「着目文」と称する）の中における第２ワードのｔｆ値をあらかじめ計算しておき、着目文に対応付けておく。 The frequency calculation unit 106 calculates in advance the tf value for each word (hereinafter referred to as “second word”) included in one sentence included in the analysis result 120c of the corpus. The frequency calculation unit 106 refers to each of the plurality of second words included in the corpus analysis result 120c as a sentence of the corpus included in the corpus analysis result 120c and including the second word (hereinafter referred to as "attention sentence"). The tf value of the second word in (referred to as) is calculated in advance and associated with the sentence of interest.

希少性計算部１０８は、抽出対象テキストの解析結果１２０ｅから、抽出対象テキストに含まれる被解析文に含まれる第１ワードのそれぞれに対する希少性を示す指標値を計算して、被解析文に対応付ける。希少性計算部１０８は、例えば、抽出対象テキストの解析結果１２０ｅから、抽出対象テキストに含まれる被解析文に含まれる第１ワードのそれぞれに対するｉｄｆ値（Inversed Document Frequency Value;希少性を示す指標値）を計算して、被解析文に対応付ける。 The shortage calculation unit 108 calculates an index value indicating the rarity of each of the first words included in the analyzed sentence included in the extracted text from the analysis result 120e of the extracted text, and associates it with the analyzed sentence. .. The rarity calculation unit 108, for example, from the analysis result 120e of the text to be extracted, an idf value (Inversed Document Frequency Value; index value indicating rarity) for each of the first words included in the analyzed sentence included in the text to be extracted. ) Is calculated and associated with the analyzed sentence.

なお、希少性計算部１０８は、コーパスの解析結果１２０ｃに含まれる一文に含まれる第２ワードのそれぞれに対するｉｄｆ値をあらかじめ計算しておき、着目文に対応付けておく。 The shortage calculation unit 108 calculates in advance the idf value for each of the second words included in one sentence included in the analysis result 120c of the corpus, and associates it with the sentence of interest.

頻出性計算部１０６および希少性計算部１０８は、第１ワードに対する指標値の設定と、第２ワードに対する指標値の設定のうち、少なくとも一方を行う。頻出性計算部１０６および希少性計算部１０８を併せ持つものは、「指標値導出部」の一例である。抽出対象テキストの解析結果１２０ｅに関する頻出性計算部１０６および希少性計算部１０８による計算結果は「第１指標値」の一例であり、コーパスの解析結果１２０ｃに関する頻出性計算部１０６および希少性計算部１０８による計算結果は「第２指標値」の一例である。 The frequency calculation unit 106 and the shortage calculation unit 108 perform at least one of the setting of the index value for the first word and the setting of the index value for the second word. The one having both the frequency calculation unit 106 and the shortage calculation unit 108 is an example of the “index value derivation unit”. The calculation result by the frequency calculation unit 106 and the rarity calculation unit 108 regarding the analysis result 120e of the extraction target text is an example of the “first index value”, and the frequency calculation unit 106 and the rarity calculation unit regarding the analysis result 120c of the corpus. The calculation result by 108 is an example of "second index value".

Ｗ２Ｖ実行部１１０は、解析部１０４により解析された文に含まれる単語のそれぞれを分散表現によるベクトル値に変換する。Ｗ２Ｖ実行部１１０は、例えば、コーパスの解析結果１２０ｃをベクトル値に変換し、変換結果を単語ベクトルリスト１２０ｆに格納する。 The W2V execution unit 110 converts each of the words included in the sentence analyzed by the analysis unit 104 into a vector value by distributed representation. The W2V execution unit 110 converts, for example, the analysis result 120c of the corpus into a vector value, and stores the conversion result in the word vector list 120f.

ベクトル変換部１１２は、解析部１０４により解析された文を分散表現によるベクトル値に変換する。ベクトル変換部１１２により生成されるベクトル値は、Ｗ２Ｖ実行部１１０により変換されたベクトル値と、頻出性計算部１０６および希少性計算部１０８による計算結果のうち、第１ワードに対する指標値または第２ワードに対する指標値のうち少なくとも一方に基づくものである。 The vector conversion unit 112 converts the sentence analyzed by the analysis unit 104 into a vector value by distributed representation. The vector value generated by the vector conversion unit 112 is the index value for the first word or the second of the vector value converted by the W2V execution unit 110 and the calculation result by the frequency calculation unit 106 and the rarity calculation unit 108. It is based on at least one of the index values for the word.

ベクトル変換部１１２は、抽出対象テキストの解析結果１２０ｅおよび単語ベクトルリスト１２０ｆのベクトル値を用いて、抽出対象テキストの文単位のベクトル値（以下、抽出対象テキストの文ベクトル、または単に「文ベクトル」と称する）を生成する。 The vector conversion unit 112 uses the analysis result 120e of the text to be extracted and the vector value of the word vector list 120f to use the vector value of the sentence unit of the text to be extracted (hereinafter, the sentence vector of the text to be extracted, or simply "sentence vector". ) Is generated.

文ベクトルは、例えば、抽出対象テキストの解析結果１２０ｅがＷ２Ｖ実行部１１０により変換されたベクトル値と、頻出性計算部１０６および希少性計算部１０８による計算結果（以下、「ｔｆ－ｉｄｆ値」と称する）とを含むものである。ベクトル変換部１１２は、抽出対象テキストの文ベクトルを選択部１１４に出力する。 The sentence vector is, for example, a vector value obtained by converting the analysis result 120e of the text to be extracted by the W2V execution unit 110, and a calculation result by the frequency calculation unit 106 and the rarity calculation unit 108 (hereinafter, “tf-idf value”). ) And. The vector conversion unit 112 outputs the sentence vector of the text to be extracted to the selection unit 114.

また、ベクトル変換部１１２は、タスクテキストの解析結果１２０ｄおよび単語ベクトルリスト１２０ｆのベクトル値を用いて、タスクテキストの文単位のベクトル値（以下、「タスクテキストの文ベクトル」と称する）を生成する。ベクトル変換部１１２は、タスクテキストの文ベクトルを選択部１１４に出力する。 Further, the vector conversion unit 112 generates a vector value for each sentence of the task text (hereinafter referred to as “task text sentence vector”) by using the analysis result 120d of the task text and the vector value of the word vector list 120f. .. The vector conversion unit 112 outputs the sentence vector of the task text to the selection unit 114.

選択部１１４は、言語モデル１２０ｂの生成過程において、抽出対象テキストの文ベクトルおよびタスクテキストの文ベクトルに基づいて、言語モデル１２０ｂの元となる（言語モデル１２０ｂに反映させる）文ベクトルを選択する。言語モデル１２０ｂの元となる文ベクトルの導出元であるテキストは、「選択文」の一例である。選択部１１４は、選択結果を言語モデル演算部１１６に出力する。 In the process of generating the language model 120b, the selection unit 114 selects a sentence vector that is the source of the language model 120b (reflected in the language model 120b) based on the sentence vector of the text to be extracted and the sentence vector of the task text. The text from which the sentence vector that is the source of the language model 120b is derived is an example of the "selective sentence". The selection unit 114 outputs the selection result to the language model calculation unit 116.

また、選択部１１４は、言語モデル１２０ｂの使用過程（情報処理装置１００による音声認識処理過程）において、ベクトル変換部１１２による変換結果の一部または全部を言語モデル演算部１１６に出力する。 Further, in the process of using the language model 120b (speech recognition processing process by the information processing apparatus 100), the selection unit 114 outputs a part or all of the conversion result by the vector conversion unit 112 to the language model calculation unit 116.

選択部１１４は、例えば、信頼度導出部１１４ａを備える。信頼度導出部１１４ａによる優先度導出処理については後述する。 The selection unit 114 includes, for example, a reliability derivation unit 114a. The priority derivation process by the reliability derivation unit 114a will be described later.

言語モデル演算部１１６は、言語モデル１２０ｂに関連する処理を行う。 The language model calculation unit 116 performs processing related to the language model 120b.

言語モデル演算部１１６は、例えば、言語モデル生成部１１６ａを備える。言語モデル生成部１１６ａは、言語モデル１２０ｂの生成過程において、選択部１１４により出力された選択結果を適用した言語モデルを生成し、言語モデル１２０ｂとして記憶部１２０に格納する。言語モデル生成部１１６ａは、例えば、情報処理装置１００の管理者があらかじめ設定した言語モデル演算用テキスト１２０ｈ、および選択部１１４により選択された変換候補に基づいて言語モデル１２０ｂを生成する。 The language model calculation unit 116 includes, for example, a language model generation unit 116a. The language model generation unit 116a generates a language model to which the selection result output by the selection unit 114 is applied in the generation process of the language model 120b, and stores it in the storage unit 120 as the language model 120b. The language model generation unit 116a generates a language model 120b based on, for example, a language model calculation text 120h preset by the administrator of the information processing apparatus 100 and conversion candidates selected by the selection unit 114.

言語モデル演算用テキスト１２０ｈとは、例えば、情報処理装置１００の管理者が想定するタスクテキストの文ベクトルや、過去の情報処理装置１００の音声認識処理履歴として保持する文ベクトルである。言語モデル演算用テキスト１２０ｈには、コーパスＩ１やタスクテキストＩ２、音声データＩ３などと同一または類似の文から生成された文ベクトルが含まれてもよい。選択部１１４は、頻出性計算部１０６および希少性計算部１０８による第１ワードのｔｆ－ｉｄｆ値または第２ワードのｔｆ－ｉｄｆ値のうち少なくとも一方と、ベクトル変換部１１２による変換結果とに基づいて、被解析文または着目文から一部の文を選択する。 The language model calculation text 120h is, for example, a sentence vector of a task text assumed by the administrator of the information processing device 100 or a sentence vector held as a voice recognition processing history of the past information processing device 100. The language model calculation text 120h may include a sentence vector generated from a sentence that is the same as or similar to the corpus I1, the task text I2, the voice data I3, and the like. The selection unit 114 is based on at least one of the tf-idf value of the first word or the tf-idf value of the second word by the frequency calculation unit 106 and the shortage calculation unit 108, and the conversion result by the vector conversion unit 112. Then, select a part of the sentence to be analyzed or the sentence of interest.

また、言語モデル演算部１１６は、言語モデル１２０ｂの使用過程（情報処理装置１００による音声認識処理過程）において選択部１１４により出力された選択結果を言語モデル１２０ｂに適用し、適用した結果を指令出力部１１８に出力する。 Further, the language model calculation unit 116 applies the selection result output by the selection unit 114 in the process of using the language model 120b (speech recognition processing process by the information processing apparatus 100) to the language model 120b, and outputs the applied result as a command. Output to unit 118.

指令出力部１１８は、言語モデル１２０ｂの使用過程（情報処理装置１００による音声認識処理過程）において、ベクトル変換部１１２により変換されたベクトル値に基づいて、被認識文（選択された被解析文、または着目文）の意味合いを推定し、推定結果に基づく指令に関する情報（または指令そのもの）を出力する。指令出力部１１８により出力される指令には、端末装置２０に行わせたい処理の指示、出力先の制御対象デバイス３０を特定する情報、出力先の制御対象デバイス３０に対する処理リクエストなどが含まれる。 The command output unit 118 is a recognized sentence (selected sentence to be analyzed, based on the vector value converted by the vector conversion unit 112 in the process of using the language model 120b (speech recognition processing process by the information processing apparatus 100). Or the meaning of the sentence of interest) is estimated, and the information about the command based on the estimation result (or the command itself) is output. The command output by the command output unit 118 includes an instruction of processing to be performed by the terminal device 20, information for specifying the control target device 30 of the output destination, a processing request to the control target device 30 of the output destination, and the like.

指令出力部１１８は、例えば、言語モデル演算部１１６により出力された、言語モデル１２０ｂへの適用結果である好適候補が「今日の天気を教えて」である場合、サービスサーバ４０の提供する天気予報のウェブサイトに対してリクエストを送信し、端末装置２０に送信するための指令の応答の一部または全部を含む情報を出力情報とする。 The command output unit 118 is, for example, the weather forecast provided by the service server 40 when the suitable candidate output by the language model calculation unit 116, which is the result of application to the language model 120b, is "tell me the weather today". The output information is information including a part or all of the response of the command for transmitting the request to the website of the above and transmitting to the terminal device 20.

また、指令出力部１１８は、例えば、好適候補が「音楽の音量を下げて」である場合、音楽再生中の制御対象デバイス３０を特定し、音量を下げる命令を出力する。なお、指令出力部１１８は、出力先が制御対象デバイス３０の出力情報を生成する場合、端末装置２０に制御対象デバイス３０に対して出力情報を出力したことを通知する出力情報を併せて生成してもよい。 Further, for example, when the suitable candidate is "lower the volume of music", the command output unit 118 identifies the controlled target device 30 during music reproduction and outputs a command to lower the volume. When the output destination generates the output information of the control target device 30, the command output unit 118 also generates the output information notifying the terminal device 20 that the output information has been output to the control target device 30. You may.

〔Ｗ２Ｖベクトル変換〕
図７は、Ｗ２Ｖ実行部１１０によるベクトル変換処理を説明するための図である。 [W2V vector conversion]
FIG. 7 is a diagram for explaining the vector conversion process by the W2V execution unit 110.

Ｗ２Ｖ実行部１１０は、例えば、コーパスの解析結果１２０ｃに含まれる各単語の意味をベクトル表現化（分散表現化）して単語ベクトルを生成する。図７の例では、Ｗ２Ｖ実行部１１０は、「ボリューム」の単語ベクトルを生成している。 The W2V execution unit 110 generates a word vector by vector-expressing (distributed representation) the meaning of each word included in the analysis result 120c of the corpus, for example. In the example of FIG. 7, the W2V execution unit 110 generates a word vector of “volume”.

Ｗ２Ｖ実行部１１０は、「音」と「ボリューム」、「ミュージック」と「音楽」のように意味の近い単語同士で単語ベクトル間の距離（コサイン類似度）が近くなるように、単語ベクトルを生成する。Ｗ２Ｖ実行部１１０は、生成したベクトル値を記憶部に単語ベクトルリスト１２０ｆとして記憶部１２０に格納する。 The W2V execution unit 110 generates a word vector so that the distance (cosine similarity) between word vectors is close between words having similar meanings such as "sound" and "volume", and "music" and "music". do. The W2V execution unit 110 stores the generated vector value in the storage unit 120 as a word vector list 120f in the storage unit.

また、Ｗ２Ｖ実行部１１０は、単語ベクトルリスト１２０ｆに記憶されていない単語がタスクテキストまたは抽出対象テキストに含まれる場合、タスクテキストの解析結果１２０ｄ、または抽出対象テキストの解析結果１２０ｅを、例えばコーパスに追加することで同様に解析し、それらのベクトル値を生成してもよい。このベクトル値は、Ｗ２Ｖ実行部１１０による処理の都度、単語ベクトルリスト１２０ｆに反映されてもよいし、反映されなくてもよい。 Further, when the task text or the extraction target text contains a word that is not stored in the word vector list 120f, the W2V execution unit 110 sets the analysis result 120d of the task text or the analysis result 120e of the extraction target text into, for example, a corpus. By adding them, they may be analyzed in the same manner and their vector values may be generated. This vector value may or may not be reflected in the word vector list 120f each time the processing is performed by the W2V execution unit 110.

［文ベクトル］
図８は、文ベクトルについて説明するための図である。 [Sentence vector]
FIG. 8 is a diagram for explaining a sentence vector.

ベクトル変換部１１２は、例えば、「ボリュームを下げて」の文ベクトルを生成する場合、「ボリューム」、「を」、および「下げて」の単語ベクトルに所定の演算を行うことで（例えば、それぞれの単語ベクトルを加算することで）、文ベクトルを生成する。 For example, when the vector conversion unit 112 generates a sentence vector of "lowering the volume", the vector conversion unit 112 performs predetermined operations on the word vectors of "volume", "o", and "lowering" (for example, respectively). (By adding the word vectors of) to generate a sentence vector.

この結果、文を構成する単語の単語ベクトルを合計した文ベクトルについても同様に、「音楽の音を小さくして」と「ボリュームを下げて」のように意味が近い文の文ベクトル同士の距離は近くなる。 As a result, for the sentence vector that is the sum of the word vectors of the words that make up the sentence, the distance between the sentence vectors of sentences that have similar meanings such as "make the sound of music quieter" and "lower the volume". Will be closer.

また、ベクトル変換部１１２は、タスクテキストの解析結果１２０ｄおよびＷ２Ｖ実行部１１０により出力された単語ベクトルを用いて、タスクテキストの文ベクトルを生成し、タスクテキストベクトルリスト１２０ｉとして記憶部１２０に格納する。タスクテキストは、利用者の意図を含んでいることが既知のテキストであり、例えば、情報処理装置１００の管理者によってあらかじめ設定される。 Further, the vector conversion unit 112 generates a sentence vector of the task text using the analysis result 120d of the task text and the word vector output by the W2V execution unit 110, and stores it in the storage unit 120 as the task text vector list 120i. .. The task text is a text known to include the intention of the user, and is set in advance by, for example, the administrator of the information processing apparatus 100.

［候補選択］
選択部１１４は、言語モデル演算部１１６により出力された抽出対象テキストを評価値に基づいて評価することで、利用者の入力意図が反映された可能性の高い好適候補を選択する。選択部１１４は、選択結果である好適候補を言語モデル演算部１１６に出力する。 [Candidate selection]
The selection unit 114 evaluates the extraction target text output by the language model calculation unit 116 based on the evaluation value, and selects a suitable candidate having a high possibility of reflecting the input intention of the user. The selection unit 114 outputs a suitable candidate as a selection result to the language model calculation unit 116.

図９は、選択部１１４による好適候補選択を模式的に示す図である。 FIG. 9 is a diagram schematically showing suitable candidate selection by the selection unit 114.

言語モデルとは、抽出対象テキストから、好適候補を生成するためのモデルである。選択部１１４は、例えば、候補ベクトルの文ベクトルとタスクテキストの文ベクトルの類似度から、タスクテキストに近いものほど高い評価値を与え、更に、言語モデルを用いて、単語の並びに関するスコアが高いものほど高い評価値を与える、これらの評価値を総合評価することで、好適候補を選択する。なお、言語モデルは、利用者の周辺環境を加味して評価を行うものでもよい。 The language model is a model for generating suitable candidates from the text to be extracted. For example, from the similarity between the sentence vector of the candidate vector and the sentence vector of the task text, the selection unit 114 gives a higher evaluation value as it is closer to the task text, and further, the score regarding the sequence of words is higher by using the language model. Suitable candidates are selected by comprehensively evaluating these evaluation values, which give higher evaluation values. The language model may be evaluated in consideration of the surrounding environment of the user.

［タスクテキスト］
以下、タスクテキストについて説明する。情報処理装置１００の管理者は、例えば、端末装置２０の過去の音声入力履歴や、情報処理装置１００の処理履歴に基づいて、言語モデル１２０ｂが生成される過程において選択部１１４が評価基準として参照するタスクテキストＩ２を抽出する。 [Task text]
The task text will be described below. The administrator of the information processing apparatus 100 refers to the selection unit 114 as an evaluation criterion in the process of generating the language model 120b based on, for example, the past voice input history of the terminal apparatus 20 and the processing history of the information processing apparatus 100. The task text I2 to be processed is extracted.

図１０は、タスクテキストを説明するための図である。 FIG. 10 is a diagram for explaining the task text.

図１０の左図は、端末装置２０の過去の音声入力履歴の音声認識結果Ｒ１～Ｒ７を示す。音声認識結果には、端末装置２０の利用者の入力意図が反映されたものと、利用者には入力意図はないが音声認識されたものとが含まれる。 The left figure of FIG. 10 shows the voice recognition results R1 to R7 of the past voice input history of the terminal device 20. The voice recognition result includes a result reflecting the input intention of the user of the terminal device 20 and a voice recognition result having no input intention by the user.

情報処理装置１００の管理者は、例えば、音声認識結果Ｒ４をタスクに近いテキストであると判別した場合、図１０の右上図に示すように優先度を高く設定する。「タスクに近い」とは、利用者の入力意図が反映された可能性が高いテキストが含まれることであり、端末装置２０または制御対象デバイス３０に対する操作の意味合いが高いテキストが含まれることである。 For example, when the administrator of the information processing apparatus 100 determines that the voice recognition result R4 is a text close to the task, the administrator sets a high priority as shown in the upper right figure of FIG. "Close to a task" means that text that is likely to reflect the input intention of the user is included, and that text that has a high meaning of operation for the terminal device 20 or the controlled device 30 is included. ..

また、情報処理装置１００の管理者は、音声認識結果のＲ６をタスクから遠いテキストであると判別した場合、図１０の右下図に示すように優先度を低く設定する。 Further, when the administrator of the information processing apparatus 100 determines that the voice recognition result R6 is a text far from the task, the administrator sets the priority low as shown in the lower right figure of FIG.

また、情報処理装置１００の管理者は、音声認識結果Ｒ１、Ｒ２、Ｒ３、Ｒ５、およびＲ７についてもタスクから遠いテキストであると判別し、優先度を低く設定する。タスクテキストの優先度は、例えば、タスクテキストの文ベクトル値とともに、タスクテキストベクトルリスト１２０ｉに登録される。 Further, the administrator of the information processing apparatus 100 determines that the voice recognition results R1, R2, R3, R5, and R7 are texts far from the task, and sets the priority low. The priority of the task text is registered in the task text vector list 120i together with the sentence vector value of the task text, for example.

タスクテキストベクトルリスト１２０ｉは、１０個程度のクラスタ構造をとってもよく、その場合タスクの意味内容が類似するタスクテキストをクラスタとして取りまとめる。クラスタは、例えば、ｋ平均法（k-means clustering）等により構成される。意味内容の類似評価については後述する。 The task text vector list 120i may have a cluster structure of about 10, and in that case, the task texts having similar meanings and contents of the tasks are collected as a cluster. The cluster is configured by, for example, the k-means clustering method or the like. Similar evaluation of meaning and content will be described later.

また、タスクテキストベクトルリスト１２０ｉには、被検索効率を高めることを目的としてクラスタ毎に代表ベクトルが設定され、その代表ベクトルが格納されてもよい。代表ベクトルとは、例えば、クラスタを構成するタスクテキストの文ベクトルの平均でもよいし、タスクテキストの優先度と文ベクトルによる加重平均であってもよい。 Further, in the task text vector list 120i, a representative vector may be set for each cluster for the purpose of improving the search efficiency, and the representative vector may be stored. The representative vector may be, for example, the average of the sentence vectors of the task texts constituting the cluster, or may be the weighted average of the priority of the task texts and the sentence vectors.

なお、選択部１１４は、抽出対象テキストに位置情報が付与される場合、その位置情報から利用者の入力環境を推定し、抽出対象テキスト利用者のタスクの実行意図を含むものであるか否かを判別し、判別結果に基づいて後続の処理を行ってもよい。 When position information is given to the text to be extracted, the selection unit 114 estimates the input environment of the user from the position information and determines whether or not the text includes the execution intention of the user of the text to be extracted. Then, subsequent processing may be performed based on the determination result.

例えば、選択部１１４は、抽出対象テキストの位置情報から利用者が自宅リビングにいることが推定される場合には、リビングで利用する制御対象デバイス３０に関するタスクの適合率を高く設定し、同時にオフィスで利用する制御対象デバイス３０に関するタスクの適合率を低く設定することで対応するタスクが選択される確度（適合率の高さ）を変更してよい。 For example, when the user is estimated to be in the living room at home from the position information of the text to be extracted, the selection unit 114 sets a high matching rate of the task related to the controlled device 30 used in the living room, and at the same time, the office. By setting the matching rate of the task related to the controlled device 30 used in the above to a low value, the accuracy (high matching rate) of selecting the corresponding task may be changed.

例えば、図１０の例においては、音声データＩ３が利用者の自宅リビングに対応付いた位置情報を持つ場合に、「年休がほしい」よりも「電球がほしい」というタスクの実行意図を含むテキストが認識される可能性が高いため、「電球がほしい」の適合率を高く設定している。一方、音声データＩ３が利用者のオフィスに対応付いた位置情報を持つ場合に、「電球がほしい」よりも「年休がほしい」というタスクの実行意図を含むテキストが認識される可能性が高い場合（「電球が欲しい」という音声データＩ３を受け付ける可能性が低い場合）には、図示の例とは異なる適合率（例えば、「電球がほしい」と「年休がほしい」の適合率を逆にするなど）が設定されてもよい。 For example, in the example of FIG. 10, when the voice data I3 has the position information corresponding to the user's home living room, the text including the execution intention of the task "I want a light bulb" rather than "I want an annual holiday". Is likely to be recognized, so the precision rate of "I want a light bulb" is set high. On the other hand, when the voice data I3 has the location information corresponding to the user's office, there is a high possibility that the text including the execution intention of the task "I want an annual holiday" rather than "I want a light bulb" is recognized. In the case (when it is unlikely to accept the voice data I3 saying "I want a light bulb"), the matching rate different from the example shown in the figure (for example, the matching rate of "I want a light bulb" and "I want an annual holiday" is reversed. Etc.) may be set.

図１１は、代表ベクトルを説明するための図である。 FIG. 11 is a diagram for explaining a representative vector.

選択部１１４は、例えば、タスクテキストを選択する際に、まず代表ベクトルと、抽出対象テキストの文ベクトルとを比較してクラスタを選択し、次に選択したクラスタの中から、好適なタスクテキストを選択する。 For example, when selecting a task text, the selection unit 114 first compares the representative vector with the sentence vector of the text to be extracted to select a cluster, and then selects a suitable task text from the selected clusters. select.

［抽出対象テキストの指標値］
選択部１１４は、上述のような「タスクに近い」テキストであるか否かの判定要素として、ｔｆ－ｉｄｆ値を用いる。 [Index value of text to be extracted]
The selection unit 114 uses the tf-idf value as a determination factor as to whether or not the text is "close to the task" as described above.

図１２は、抽出対象テキストの指標値を説明するための図である。 FIG. 12 is a diagram for explaining an index value of the text to be extracted.

抽出対象テキストに含まれる一文Ｓ１（以下、「抽出対象テキストＳ１」と称する）が「来週／の／土曜／温泉／に／行きたい／ん／だけど／いい／温泉／は／ある／の」（／：単語の区切り位置）という１４単語である場合、ベクトル変換部１１２は、頻出性計算部１０６および希少性計算部１０８による計算結果に基づいて、単語ごとのテキスト内での「重要度」の判定元情報となる文ベクトルを生成する。 One sentence S1 (hereinafter referred to as "extraction target text S1") included in the extraction target text is "next week / no / Saturday / hot spring / ni / want to go / n / but / good / hot spring / ha / aru / no" ( In the case of 14 words (/: word delimiter position), the vector conversion unit 112 determines the "importance" in the text for each word based on the calculation results by the frequency calculation unit 106 and the rarity calculation unit 108. Generate a sentence vector that is the judgment source information.

以下の説明において、コーパスＩ１に２００，０００文が含まれており、コーパスＩ１に単語「温泉」という単語を含む文が１５０文含まれ、コーパスＩ１に単語「の」を含む文が３０,０００文含まれるものとして説明する。 In the following description, corpus I1 contains 200,000 sentences, corpus I1 contains 150 sentences containing the word "hot spring", and corpus I1 contains 30,000 sentences containing the word "no". Explain as if the sentence is included.

なお、図１２の例において、抽出対象テキストＳ１は「被解析文」の一例である。また、抽出対象テキストＳ１に含まれる二重下線を引いた単語「温泉」は「第１ワード」の一例である。また、抽出対象テキストＳ１に含まれる下線を引いた単語「の」や、抽出対象テキストＳ２に含まれる下線を引いた単語「の」は、それぞれ以下の説明において着目する「第１ワード」の一例である。 In the example of FIG. 12, the extraction target text S1 is an example of the “analyzed sentence”. Further, the double underlined word "hot spring" included in the extraction target text S1 is an example of the "first word". Further, the underlined word "no" included in the extraction target text S1 and the underlined word "no" included in the extraction target text S2 are examples of the "first word" to be focused on in the following description. Is.

また、タスクテキストＳ３およびタスクテキストＳ４は、図６のタスクテキストＩ２に含まれるタスクテキストの一例である。タスクテキストＳ３は、「着目文」の一例であり、タスクテキストＳ４は「着目文以外の文」の一例である。タスクテキストＳ３に含まれる二重下線を引いた単語「温泉」は「第２ワード」の一例である。また、タスクテキストＳ３およびタスクテキストＳ４に含まれる下線を引いた単語「の」は、「第２ワード」の一例である。 Further, the task text S3 and the task text S4 are examples of the task text included in the task text I2 of FIG. The task text S3 is an example of a "sentence of interest", and the task text S4 is an example of a "sentence other than the sentence of interest". The double underlined word "hot spring" included in the task text S3 is an example of the "second word". Further, the underlined word "no" included in the task text S3 and the task text S4 is an example of the "second word".

図１２の例において、頻出性計算部１０６は、抽出対象テキストＳ１に含まれる単語「温泉」のｔｆ値を、２／１４（抽出対象テキストＳ１を構成する１４単語のうち２単語を占める）であると計算する。同様に、頻出性計算部１０６は、抽出対象テキストＳ１に含まれる単語「の」のｔｆ値を、２／１４であると計算する。 In the example of FIG. 12, the frequency calculation unit 106 sets the tf value of the word “hot spring” included in the extraction target text S1 to 2/14 (occupies 2 words out of the 14 words constituting the extraction target text S1). Calculate that there is. Similarly, the frequency calculation unit 106 calculates that the tf value of the word "no" included in the extraction target text S1 is 2/14.

希少性計算部１０８は、抽出対象テキストＳ１に含まれる単語「温泉」のｉｄｆ値を、log(200000／150)と計算する。同様に、希少性計算部１０８は、抽出対象テキストＳ１に含まれる単語「の」のｉｄｆ値を、log(200000／30000)であると計算する。 The shortage calculation unit 108 calculates the idf value of the word "hot spring" included in the extraction target text S1 as log (200000/150). Similarly, the shortage calculation unit 108 calculates the idf value of the word "no" included in the extraction target text S1 to be log (200000/30000).

次に、ベクトル変換部１１２は、抽出対象テキストＳ１に含まれる単語のそれぞれの頻出性計算部１０６および希少性計算部１０８による計算結果を乗算して、抽出対象テキストＳ１に含まれる単語のそれぞれのｔｆ－ｉｄｆ値を導出する。 Next, the vector conversion unit 112 multiplies the calculation results by the frequency calculation unit 106 and the rarity calculation unit 108 of the words included in the extraction target text S1, and each of the words included in the extraction target text S1. The tf-idf value is derived.

例えば、ベクトル変換部１１２は、抽出対象テキストＳ１に含まれる単語「温泉」のｔｆ－ｉｄｆ値を、２／１４×ｌｏｇ（２０００００／１５０）≒０．４４６であると導出する。同様に、ベクトル変換部１１２は、抽出対象テキストＳ１に含まれる単語「の」のｔｆ－ｉｄｆ値を、２／１４×ｌｏｇ（２０００００／３００００）≒０．１１８であると導出する。 For example, the vector conversion unit 112 derives that the tf-idf value of the word “hot spring” included in the extraction target text S1 is 2/14 × log (200000/150) ≈0.446. Similarly, the vector conversion unit 112 derives the tf-idf value of the word “no” included in the extraction target text S1 as 2/14 × log (200,000 / 30,000) ≈0.118.

ベクトル変換部１１２により導出されたｔｆ－ｉｄｆ値がより大きい値となる単語は、抽出対象テキストＳ１においてより「重要度」の高い単語である。すなわち、図１２の抽出対象テキストＳ１において、ベクトル変換部１１２により導出されたｔｆ－ｉｄｆ値に基づいて評価すると、単語「温泉」がより重要度の高い単語である。 A word having a larger tf-idf value derived by the vector conversion unit 112 is a word having a higher "importance" in the extraction target text S1. That is, in the extraction target text S1 of FIG. 12, the word "hot spring" is a more important word when evaluated based on the tf-idf value derived by the vector conversion unit 112.

ベクトル変換部１１２は、抽出対象テキストに含まれる一文Ｓ２「来週／の／天気／の／情報」に対して抽出対象テキストＳ１と同様にｔｆ－ｉｄｆ値を導出する。 The vector conversion unit 112 derives a tf-idf value for the sentence S2 “next week / of / weather / of / information” included in the extraction target text, as in the extraction target text S1.

また、ベクトル変換部１１２は、タスクテキストＩ２に含まれる一文Ｓ３「近く／の／温泉／を／調べて／ほしい」およびタスクテキストＩ２に含まれる一文Ｓ４「明日／の／東京／の／天気」のそれぞれに対して、タスクテキストに含まれる第２ワードのｔｆ値およびｉｄｆ値を導出して、ｔｆ－ｉｄｆ値を導出する。 Further, the vector conversion unit 112 includes a sentence S3 "near / no / hot spring / check / want" included in the task text I2 and a sentence S4 "tomorrow / no / Tokyo / no / weather" included in the task text I2. The tf value and the idf value of the second word included in the task text are derived for each of the above, and the tf-idf value is derived.

［文ベクトル（ｔｆ－ｉｄｆベクトル）］
図１３は、ベクトル変換部１１２により導出されたｔｆ－ｉｄｆ値の一例を示す図である。 [Sentence vector (tf-idf vector)]
FIG. 13 is a diagram showing an example of the tf-idf value derived by the vector conversion unit 112.

ベクトル変換部１１２は、抽出対象テキストが「今日／の／天気／を／教えて」である場合、抽出対象テキストに含まれる単語のそれぞれのｔｆ－ｉｄｆ値を導出する。ベクトル変換部１１２は、例えば、単語「今日」のｔｆ－ｉｄｆ値は０．５であり、単語「の」のｔｆ－ｉｄｆ値は０．０２であると導出したとする。 When the extraction target text is "today / no / weather / tell / tell", the vector conversion unit 112 derives the tf-idf value of each word included in the extraction target text. It is assumed that the vector conversion unit 112 derives, for example, that the tf-idf value of the word "today" is 0.5 and the tf-idf value of the word "no" is 0.02.

図１４は、文ベクトルのｔｆ－ｉｄｆベクトルを説明するための図である。 FIG. 14 is a diagram for explaining the tf-idf vector of the sentence vector.

ベクトル変換部１１２は、図１２に示したように抽出対象テキストに含まれる単語のそれぞれのｔｆ－ｉｄｆ値の導出結果を用いて、ｔｆ－ｉｄｆベクトルを生成する。例えば、ベクトル変換部１１２がテキスト「今日／の／天気／を／教えて／が」からｔｆ－ｉｄｆベクトルを生成する場合、図１４に示すような分散表現によるベクトルで表現することができる。なお、テキストに含まれる単語「が」は、抽出対象テキストに含まれない単語の一例である。抽出対象テキストに含まれない単語のｔｆ－ｉｄｆベクトル値は０である。 As shown in FIG. 12, the vector conversion unit 112 generates a tf-idf vector using the derivation result of each tf-idf value of the word included in the extraction target text. For example, when the vector conversion unit 112 generates a tf-idf vector from the text “today / no / weather / tell / ga”, it can be represented by a vector represented by a distributed expression as shown in FIG. The word "ga" included in the text is an example of a word not included in the text to be extracted. The tf-idf vector value of the word not included in the extracted text is 0.

同様に、ベクトル変換部１１２は、コーパスの解析結果１２０ｃに対してもｔｆ－ｉｄｆベクトル値を導出する処理を行っておく。そのようにすることによって、選択部１１４による選択処理においてｔｆ－ｉｄｆベクトル値を参照することが可能になるため、言語モデル１２０ｂの生成のために好適な文ベクトルを選択することができ、高精度の言語モデル１２０ｂの生成が言語モデル生成部１１６ａにより実現される。 Similarly, the vector conversion unit 112 also performs a process of deriving the tf-idf vector value for the analysis result 120c of the corpus. By doing so, it becomes possible to refer to the tf-idf vector value in the selection process by the selection unit 114, so that a sentence vector suitable for generating the language model 120b can be selected, and the sentence vector can be selected with high accuracy. The generation of the language model 120b is realized by the language model generation unit 116a.

［信頼度］
以下、信頼度導出部１１４ａの信頼度導出処理についてより具体的に説明する。信頼度とは、音声認識結果の信頼性を評価する度合を０から１．０の間の数値で示すものであって、認識結果をどれだけ信頼してよいかを表す尺度である。 [Degree of reliability]
Hereinafter, the reliability derivation process of the reliability derivation unit 114a will be described more specifically. The reliability is a numerical value between 0 and 1.0 indicating the degree of evaluation of the reliability of the speech recognition result, and is a measure of how reliable the recognition result can be.

信頼度導出部１１４ａは、例えば、テキストの信頼性が高い場合、すなわち、他の競合候補となるテキストが存在しない場合に信頼度を１．０に設定する。信頼度は、例えば、大語彙連続音声認識エンジンの検索結果として得られる単語の事後確率を用いて導出される。なお、信頼度の導出には、ｐ＊(ｔｆ－ｉｄｆベクトル値の類似度)が用いられてもよい。 The reliability derivation unit 114a sets the reliability to 1.0, for example, when the reliability of the text is high, that is, when there is no other text that is a candidate for competition. The reliability is derived, for example, using the posterior probabilities of words obtained as search results of a large vocabulary continuous speech recognition engine. In addition, p * (similarity of tf-idf vector value) may be used for deriving the reliability.

図１５は、信頼度導出部１１４ａによる信頼度導出処理を説明するための図である。 FIG. 15 is a diagram for explaining the reliability derivation process by the reliability derivation unit 114a.

信頼度導出部１１４ａは、例えば、抽出対象テキストＥ１～Ｅ４のそれぞれの信頼度を導出する。選択部１１４は、例えば、信頼度導出部１１４ａが導出した信頼度が閾値（例えば、０．８程度）以上である抽出対象テキストＥ１およびＥ４を優先的にタスクテキストとして選択する。なお、選択部１１４は、複数のタスクテキストが選択可能である場合、信頼度の高いタスクテキストを優先的に選択してもよい。 The reliability derivation unit 114a derives, for example, the reliability of each of the extraction target texts E1 to E4. For example, the selection unit 114 preferentially selects the extraction target texts E1 and E4 whose reliability derived by the reliability derivation unit 114a is a threshold value (for example, about 0.8) or more as task text. When a plurality of task texts can be selected, the selection unit 114 may preferentially select a highly reliable task text.

また、信頼度導出部１１４ａは、信頼度を所定の周期で再設定してもよい。その場合、信頼度導出部１１４ａは、抽出対象テキストＥ１～Ｅ４のうち、誤り（誤変換が含まれたり、タスクテキストに適合するものがなかったりするなどのこと）である可能性の高い抽出対象テキストＥ２およびＥ３に対して、より低い信頼度を設定することで、選択部１１４による処理精度を高めてもよい。 Further, the reliability derivation unit 114a may reset the reliability at a predetermined cycle. In that case, the reliability derivation unit 114a has a high possibility that the extraction target texts E1 to E4 are erroneous (such as erroneous conversion being included or none matching the task text). By setting a lower reliability for the texts E2 and E3, the processing accuracy by the selection unit 114 may be improved.

選択部１１４は、信頼度導出部１１４ａにより導出された信頼度に基づいて、被解析文に対応する文ベクトルを選択する。選択部１１４は、例えば、信頼度導出部１１４ａにより導出された信頼度が閾値以上である解析結果から得られた被解析文を優先的に選択する。信頼度導出部１１４ａにより信頼度が設定されることによって、誤った被認識文が言語モデル１２０ｂに反映されることを避けることができる。 The selection unit 114 selects the sentence vector corresponding to the sentence to be analyzed based on the reliability derived by the reliability derivation unit 114a. The selection unit 114 preferentially selects, for example, the analyzed sentence obtained from the analysis result whose reliability derived by the reliability derivation unit 114a is equal to or higher than the threshold value. By setting the reliability by the reliability deriving unit 114a, it is possible to prevent the erroneous recognized sentence from being reflected in the language model 120b.

また、選択部１１４は、信頼度導出部１１４ａにより導出された信頼度が閾値以上である文ベクトルが見つかった場合、選択処理が途中であったとしても、その選択処理を中断することによって、言語モデル１２０ｂの生成処理に要する処理時間を短縮してもよい。 Further, when the selection unit 114 finds a sentence vector whose reliability is equal to or higher than the threshold value derived by the reliability derivation unit 114a, the selection unit 114 interrupts the selection process even if the selection process is in progress, thereby causing the language. The processing time required for the generation processing of the model 120b may be shortened.

［テキストの意味内容の類似評価］
以下、テキストの意味内容の類似評価方法について説明する。 [Similar evaluation of the meaning and content of text]
Hereinafter, a method for evaluating the similarity of the meaning and content of the text will be described.

言語モデル演算部１１６は、例えば、抽出対象テキストの文ベクトル（以下、「ベクトルｖｉ」と称する）と、各クラスタの代表ベクトルＶとに対してコサイン類似度を求める数式に適用することで、テキストの意味内容の類似評価を行う。コサイン類似度を求める数式は、例えば、任意の文ベクトルｖ１と任意の文ベクトルｖ２の積を、文ベクトルｖ１の絶対値と文ベクトルｖ２の絶対値の積で除算する式であり、演算結果が１に近ければ文ベクトルｖ１と文ベクトルｖ２が類似していることを示す式である。 The language model calculation unit 116 applies, for example, to a mathematical formula for obtaining the cosine similarity between the sentence vector of the text to be extracted (hereinafter referred to as “vector vi”) and the representative vector V of each cluster, thereby text. Evaluate the similarity of the meaning and content of. The formula for obtaining the cosine similarity is, for example, a formula for dividing the product of an arbitrary sentence vector v1 and an arbitrary sentence vector v2 by the product of the absolute value of the sentence vector v1 and the absolute value of the sentence vector v2. If it is close to 1, it is an expression indicating that the sentence vector v1 and the sentence vector v2 are similar.

言語モデル演算部１１６は、導出したコサイン類似度が閾値以上であれば、文ベクトルｖ１と文ベクトルｖ２とが類似である、すなわち、文ベクトルｖ１の導出元のテキストと文ベクトルｖ２の導出元のテキストが同一または類似の意味内容であると判定する。 If the derived cosine similarity is equal to or higher than the threshold value, the language model calculation unit 116 is similar to the sentence vector v1 and the sentence vector v2, that is, the text of the derivation source of the sentence vector v1 and the derivation source of the sentence vector v2. It is determined that the texts have the same or similar meanings.

図１６は、類似評価方法について説明するための図である。 FIG. 16 is a diagram for explaining a similarity evaluation method.

言語モデル演算部１１６は、例えば、抽出対象テキスト「今日の天気はどうかな」のベクトルｖｉを導出する。言語モデル演算部１１６は、「今日の天気を教えて」、「明日の天気を教えて」、「天気は晴れか教えて」などの文ベクトルを含むクラスタＣ１の代表ベクトル（以下、「クラスタ代表ベクトルＣＶ１」と称する）や、「音楽の音を小さくして」などの文ベクトルを含むクラスタＣ２の代表ベクトル（以下、「クラスタ代表ベクトルＣＶ２」と称する）と、ベクトルｖｉとをコサイン類似度を求める数式に適用してテキストの意味内容の類似度を評価する。 The language model calculation unit 116 derives, for example, a vector vi of the text to be extracted "How is the weather today?". The language model calculation unit 116 is a representative vector of cluster C1 including sentence vectors such as "tell me the weather today", "tell me the weather tomorrow", and "tell me if the weather is sunny" (hereinafter, "cluster representative"). The cosine similarity between the representative vector of cluster C2 (hereinafter referred to as "cluster representative vector CV2") including sentence vectors such as "vector CV1" and "make the sound of music quieter" and the vector vi. Evaluate the similarity of the meaning and content of the text by applying it to the desired formula.

なお、クラスタＣ１に含まれるタスクテキストのそれぞれは、「教師文」の一例である。 Each of the task texts included in the cluster C1 is an example of a "teacher sentence".

例えば、図示のように、ベクトルｖｉとクラスタ代表ベクトルＣＶ１の類似度が０．７５であり、ベクトルｖｉとクラスタ代表ベクトルＣＶ２の類似度が０．１である場合、言語モデル演算部１１６は、より類似度の高いクラスタ代表ベクトルＣＶ１の導出元であるクラスタＣ１が、抽出対象テキストのベクトルｖｉとの同一または類似の意味内容であると判定する。 For example, as shown in the figure, when the similarity between the vector vi and the cluster representative vector CV1 is 0.75 and the similarity between the vector vi and the cluster representative vector CV2 is 0.1, the language model calculation unit 116 is twisted. It is determined that the cluster C1 which is the derivation source of the cluster representative vector CV1 having a high degree of similarity has the same or similar meaning as the vector vi of the extraction target text.

言語モデル演算部１１６は、さらに、クラスタＣ１に含まれるタスクテキストの中から、抽出対象テキストのベクトルｖｉと同一または類似の意味内容であるタスクテキストを選択する。 Further, the language model calculation unit 116 selects a task text having the same or similar meaning as the vector vi of the extraction target text from the task texts included in the cluster C1.

言語モデル生成部１１６ａは、言語モデル生成部１１６ａにより選択されたタスク文の意味合いを、抽出対象テキストＳ１の意味合いとして対応付けたデータを生成するような言語モデル１２０ｂを生成する。 The language model generation unit 116a generates a language model 120b that generates data in which the meaning of the task sentence selected by the language model generation unit 116a is associated with the meaning of the text S1 to be extracted.

図示の例においては、例えば、抽出対象テキスト「今日の天気はどうかな」と、クラスタＣ１の中でタスクテキスト「今日の天気を教えて」がのベクトル値の類似性が高い（最も意味合いが近い）と判定されたとする。その場合、言語モデル１２０ｂは抽出対象テキスト「今日の天気はどうかな」が入力されると、上述のようなベクトルの類似性の評価の結果が推定に反映されて、抽出対象テキストがタスクテキスト「今日の天気を教えて」と同一または類似の意味合いであると推定する。 In the illustrated example, for example, the vector values of the extraction target text "How is the weather today" and the task text "Tell me the weather today" in the cluster C1 have high similarity (the closest meaning). ) Is determined. In that case, when the extraction target text "How is the weather today?" Is input in the language model 120b, the result of the vector similarity evaluation as described above is reflected in the estimation, and the extraction target text is the task text ". I presume that it has the same or similar meaning as "Tell me about the weather today."

指令出力部１１８は、推定結果であるタスクテキスト「今日の天気を教えて」に基づく指令を端末装置２０に出力する。これにより、端末装置２０は、情報処理装置１００の処理結果に基づいて、タスクテキスト「今日の天気を教えて」に基づく指令（例えば、ネットワークＮＷを介して今日の天気に関する情報を取得することなど）を実行する。 The command output unit 118 outputs a command based on the task text "Tell me the weather today", which is the estimation result, to the terminal device 20. As a result, the terminal device 20 obtains information about today's weather via the network NW, or the like, based on the processing result of the information processing device 100, based on the task text "Tell me the weather today". ) Is executed.

なお、テキストの意味内容の類似評価は、コサイン類似度以外の方法で評価されてもよく、レーベンシュタイン距離によるテキスト比較評価や、ジャロ・ウィンクラー距離によるテキスト比較評価などの評価が行われてもよい。 The similarity evaluation of the meaning and content of the text may be evaluated by a method other than the cosine similarity, and even if the text comparison evaluation based on the Levenshtein distance or the text comparison evaluation based on the Jaro-Winkler distance is performed. good.

［言語モデル生成処理フロー］
以下、情報処理装置１００による言語モデル１２０ｂの生成処理について説明する。情報処理装置１００は、例えば、コーパスＩ１の種別毎に言語モデル１２０ｂを生成する。また、情報処理装置１００の管理者により、定期的に言語モデル演算用テキスト１２０ｈの変更・更新が行われてもよく、例えば、言語モデル演算用テキスト１２０ｈの変更・更新のタイミングで言語モデル１２０ｂの再生成が行われる。 [Language model generation process flow]
Hereinafter, the process of generating the language model 120b by the information processing apparatus 100 will be described. The information processing apparatus 100 generates, for example, a language model 120b for each type of corpus I1. Further, the administrator of the information processing apparatus 100 may periodically change / update the language model calculation text 120h. For example, the language model 120b may be changed / updated at the timing of the change / update of the language model calculation text 120h. Regeneration is done.

図１７は、情報処理装置１００による言語モデル１２０ｂの生成処理の流れの一例を示すフローチャートである。 FIG. 17 is a flowchart showing an example of the flow of the generation processing of the language model 120b by the information processing apparatus 100.

まず、取得部１０２は、コーパスとして利用する文字情報（コーパスＩ１）を取得する（Ｓ１００）。次に、解析部１０４は、コーパスＩ１を音響モデル１２０ａに適用するなどにより実現される形態素解析等の解析方法により解析し、解析結果をコーパスの解析結果１２０ｃとして記憶部１２０に格納する（Ｓ１０２）。次に、Ｗ２Ｖ実行部１１０は、コーパスの解析結果１２０ｃに含まれる文字情報を構成する単語のそれぞれのベクトル値（単語ベクトル）を生成し（Ｓ１０４）、単語ベクトルリスト１２０ｆとして記憶部１２０に格納する（Ｓ１０６）。 First, the acquisition unit 102 acquires character information (corpus I1) to be used as a corpus (S100). Next, the analysis unit 104 analyzes the corpus I1 by an analysis method such as morphological analysis realized by applying the corpus I1 to the acoustic model 120a, and stores the analysis result as the corpus analysis result 120c in the storage unit 120 (S102). .. Next, the W2V execution unit 110 generates a vector value (word vector) of each of the words constituting the character information included in the analysis result 120c of the corpus (S104), and stores the word vector list 120f in the storage unit 120. (S106).

次に、取得部１０２は、タスクテキストＩ２を取得する（Ｓ１０６）。次に、解析部１０４は、タスクテキストＩ２をコーパスＩ１と同様に解析し（Ｓ１０８）、解析結果をタスクテキストの解析結果１２０ｄとして記憶部１２０に格納する（Ｓ１１０）。 Next, the acquisition unit 102 acquires the task text I2 (S106). Next, the analysis unit 104 analyzes the task text I2 in the same manner as the corpus I1 (S108), and stores the analysis result as the analysis result 120d of the task text in the storage unit 120 (S110).

次に、取得部１０２は、抽出対象テキストの元情報である音声データＩ３を取得する（Ｓ１１２）。次に、解析部１０４は、音声データＩ３をコーパスＩ１およびタスクテキストＩ２と同様に解析し、解析結果を抽出対象テキストの解析結果１２０ｅとして記憶部１２０に格納する（Ｓ１１４）。 Next, the acquisition unit 102 acquires the voice data I3, which is the original information of the text to be extracted (S112). Next, the analysis unit 104 analyzes the voice data I3 in the same manner as the corpus I1 and the task text I2, and stores the analysis result in the storage unit 120 as the analysis result 120e of the text to be extracted (S114).

次に、ベクトル変換部１１２は、タスクテキストの解析結果１２０ｄと単語ベクトルリスト１２０ｆを参照して、タスクテキストの文ベクトルを生成し、タスクテキストベクトルリスト１２０ｉとして記憶部１２０に格納する（Ｓ１１４）。次に、ベクトル変換部１１２は、抽出対象テキストの文ベクトルを生成する（Ｓ１１６）。 Next, the vector conversion unit 112 refers to the analysis result 120d of the task text and the word vector list 120f, generates a sentence vector of the task text, and stores it in the storage unit 120 as the task text vector list 120i (S114). Next, the vector conversion unit 112 generates a sentence vector of the text to be extracted (S116).

次に、選択部１１４は、抽出対象テキストの文ベクトルおよびタスクテキストの文ベクトルに基づいて、言語モデル１２０ｂの元となる（言語モデル１２０ｂに反映させる）文ベクトルを選択する（Ｓ１１８）。次に、言語モデル生成部１１６ａは、選択部１１４による選択結果に基づいて、言語モデル１２０ｂを生成する（Ｓ１２０）。以上、本フローチャートの処理の説明を終了する。 Next, the selection unit 114 selects a sentence vector (reflected in the language model 120b) that is the source of the language model 120b based on the sentence vector of the text to be extracted and the sentence vector of the task text (S118). Next, the language model generation unit 116a generates the language model 120b based on the selection result by the selection unit 114 (S120). This is the end of the description of the processing of this flowchart.

［音声認識処理］
図１８は、情報処理装置１００による音声認識処理の流れの一例を示すフローチャートである。 [Voice recognition processing]
FIG. 18 is a flowchart showing an example of the flow of voice recognition processing by the information processing apparatus 100.

まず、取得部１０２は、端末装置２０から音声データＩ２を取得する（Ｓ２００）。次に、解析部１０４は、取得部１０２により出力された音声データＩ２を音響モデル１２０ａに適用し、抽出対象テキストを生成する（Ｓ２０２）。 First, the acquisition unit 102 acquires the voice data I2 from the terminal device 20 (S200). Next, the analysis unit 104 applies the voice data I2 output by the acquisition unit 102 to the acoustic model 120a to generate the text to be extracted (S202).

次に、言語モデル演算部１１６は、解析部１０４により出力された抽出対象テキストを言語モデル１２０ｂに適用する（Ｓ２０４）。次に、選択部１１４は、言語モデル演算部１１６により出力された適用結果から、好適候補を選択する（Ｓ２０６）。 Next, the language model calculation unit 116 applies the extraction target text output by the analysis unit 104 to the language model 120b (S204). Next, the selection unit 114 selects a suitable candidate from the application result output by the language model calculation unit 116 (S206).

次に、言語モデル生成部１１６ａは、好適候補に基づいて出力情報を生成する（Ｓ２０８）。次に、指令出力部１１８は、出力情報を端末装置２０等に出力する（Ｓ２１０）。以上、本フローチャートの処理の説明を終了する。 Next, the language model generation unit 116a generates output information based on suitable candidates (S208). Next, the command output unit 118 outputs the output information to the terminal device 20 or the like (S210). This is the end of the description of the processing of this flowchart.

以上、説明した実施形態の情報処理装置１００によれば、音声データを取得する取得部１０２と、取得部１０２により取得された音声データを解析してテキストに変換する解析部１０４と、解析部１０４による解析結果のテキストに含まれる複数の第１ワードのそれぞれについて、テキストに含まれ且つ第１ワードが含まれる被解析文（音声データＩ３）の中における第１ワードの頻出性と、ライブラリ情報に含まれる文に対する第１ワードの希少性とを評価した第１指標値である、ｔｆ値およびｉｄｆ値（またはｔｆ―ｉｄｆベクトル）を導出して被解析文に対応付けることと、コーパスＩ１、タスクテキストＩ２および言語モデル演算用テキスト１２０ｈなどのライブラリ情報に含まれる複数の第２ワードのそれぞれについて、ライブラリ情報に含まれ且つ第２ワードが含まれる着目文の中における第２ワードの頻出性と、ライブラリ情報に含まれる着目文以外の文に対する第２ワードの希少性とを評価した第２指標値であるｔｆ値およびｉｄｆ値（またはｔｆ―ｉｄｆベクトル）を導出して、着目文に対応付けることとのうち少なくとも一方を行う頻出性計算部１０６および希少性計算部１０８と、解析部１０４により解析された文を分散表現によるベクトル値に変換するベクトル変換部１１２と、頻出性計算部１０６および希少性計算部１０８により導出された第１指標値または第２指標値のうち少なくとも一方と、ベクトル変換部１１２によるベクトル変換結果とに基づいて、被解析文または着目文から一部の文を選択する選択部１１４と、意味合いが既知であり且つ文ベクトルが求められている教師文のうち、選択部１１４により選択された選択文と文ベクトルが近い教師文の意味合いを、選択文の意味合いとして対応付けたデータを生成する言語モデル生成部１１６ａと、を備えることにより、より効率的且つ高精度に音声認識処理を行うことができる。 According to the information processing apparatus 100 of the embodiment described above, the acquisition unit 102 that acquires voice data, the analysis unit 104 that analyzes the voice data acquired by the acquisition unit 102 and converts it into text, and the analysis unit 104. For each of the plurality of first words included in the text of the analysis result by, the frequency of the first word in the analyzed sentence (voice data I3) included in the text and including the first word, and the library information. Derivation of tf value and idf value (or tf-idf vector), which are the first index values that evaluate the rarity of the first word for the contained sentence, and associating them with the analyzed sentence, corpus I1, task text For each of the plurality of second words included in the library information such as I2 and the text 120h for language model calculation, the frequency of the second word in the note of interest included in the library information and including the second word, and the library. The tf value and the idf value (or tf-idf vector), which are the second index values for evaluating the rarity of the second word for sentences other than the sentence of interest included in the information, are derived and associated with the sentence of interest. Of these, the frequency calculation unit 106 and the rarity calculation unit 108 that perform at least one of them, the vector conversion unit 112 that converts the sentence analyzed by the analysis unit 104 into the vector value by the distributed representation, the frequency calculation unit 106 and the rarity calculation. A selection unit that selects a part of the sentence to be analyzed or the sentence of interest based on at least one of the first index value or the second index value derived by unit 108 and the vector conversion result by the vector conversion unit 112. Data in which 114 and a teacher sentence whose meaning is known and whose sentence vector is required are associated with the meaning of a teacher sentence whose sentence vector is close to that of the selected sentence selected by the selection unit 114 as the meaning of the selected sentence. By providing a language model generation unit 116a for generating data, it is possible to perform voice recognition processing more efficiently and with high accuracy.

〔変形例〕
言語モデル生成部１１６ａの生成する言語モデル１２０ｂは、固定の単語に特化した言語モデルであってもよい。「固定の単語に特化」とは、例えば、入力される言語に必ず固定の単語（上述の例における「天気」や「温泉」、「野球」など）、または固定の単語と同一または類似の単語が含まれ、固定の単語に関する処理のみを想定することである。 [Modification example]
The language model 120b generated by the language model generation unit 116a may be a language model specialized for a fixed word. "Specialized in fixed words" means, for example, words that are always fixed to the language entered (such as "weather", "hot springs", "baseball" in the above example), or the same as or similar to fixed words. Words are included, and only processing related to fixed words is assumed.

その場合、頻出性計算部１０６および希少性計算部１０８は、抽出対象テキストに基づいて言語モデルを生成する場合、第１ワードを固定して処理を行い、コーパスに基づいて言語モデルを生成する場合、第２ワードを固定して処理を行う。また、頻出性計算部１０６および希少性計算部１０８は、コーパスと抽出対象テキストの両方に基づいて言語モデルを生成する場合、第１ワードおよび第２ワードを同じワードに固定して処理を行う。これにより、例えば、単語「温泉」に特化した言語モデル１２０ｂや、単語「天気」に特化した言語モデル１２０ｂを生成することができる。 In that case, when the frequency calculation unit 106 and the shortage calculation unit 108 generate a language model based on the text to be extracted, the first word is fixed and processed, and the language model is generated based on the corpus. , The second word is fixed and processing is performed. Further, when the frequency calculation unit 106 and the shortage calculation unit 108 generate a language model based on both the corpus and the text to be extracted, the first word and the second word are fixed to the same word for processing. Thereby, for example, a language model 120b specialized for the word "hot spring" and a language model 120b specialized for the word "weather" can be generated.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 Although the embodiments for carrying out the present invention have been described above using the embodiments, the present invention is not limited to these embodiments, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

２０…端末装置、３０…制御対象デバイス、４０…サービスサーバ、１００…情報処理装置、１０２…取得部、１０４…解析部、１０６…頻出性計算部、１０８…希少性計算部、１１０…Ｗ２Ｖ実行部、１１２…ベクトル変換部、１１４…選択部、１１４ａ…信頼度導出部、１１６…言語モデル演算部、１１６ａ…言語モデル生成部、１１８…指令出力部、１２０ｂ…言語モデル 20 ... Terminal device, 30 ... Control target device, 40 ... Service server, 100 ... Information processing device, 102 ... Acquisition unit, 104 ... Analysis unit, 106 ... Frequent calculation unit, 108 ... Rarity calculation unit, 110 ... W2V execution Unit, 112 ... Vector conversion unit, 114 ... Selection unit, 114a ... Reliability derivation unit, 116 ... Language model calculation unit, 116a ... Language model generation unit, 118 ... Command output unit, 120b ... Language model

Claims

The acquisition unit that acquires audio data,
An analysis unit that analyzes the voice data and converts it into text,
For each of the plurality of first words included in the text of the analysis result by the analysis unit, the frequency of the first word in the analyzed sentence included in the text and including the first word, and library information. An index value deriving unit that derives a first index value that evaluates the rarity of the first word with respect to the sentence included in the sentence and associates it with the analyzed sentence.
A vector conversion unit that converts the sentence analyzed by the analysis unit into a vector value by distributed representation, and
A selection unit that selects a part of the sentence to be analyzed based on the first index value derived by the index value derivation unit and the conversion result by the vector conversion unit.
Among the teacher sentences whose meanings are known and whose vector value is required, the data in which the meanings of the teacher sentences whose vector values are close to those of the selection sentences selected by the selection unit are associated with the meanings of the selection sentences. The generator to generate and
Equipped with
For each of the plurality of second words included in the library information, the index value deriving unit determines the frequency of the second word in the sentence of interest included in the library information and including the second word. A second index value that evaluates the rarity of the second word for a sentence other than the note of interest included in the library information is derived and associated with the sentence of interest.
The selection unit is the analyzed sentence or the attention sentence based on at least one of the first index value or the second index value derived by the index value derivation unit and the conversion result by the vector conversion unit. Select some sentences from
The index value derivation unit is
When deriving only the first index value, the first word is fixed and processed.
When deriving only the second index value, the second word is fixed and processed.
When deriving the first index value and the second index value, the first word and the second word are fixed to the same word for processing.
The generation unit generates the associated data for each fixed word.
Information processing equipment.

The acquisition unit that acquires audio data,
An analysis unit that analyzes the voice data and converts it into text,
For each of the plurality of first words included in the text of the analysis result by the analysis unit, the frequency of the first word in the analyzed sentence included in the text and including the first word, and library information. An index value deriving unit that derives a first index value that evaluates the rarity of the first word with respect to the sentence included in the sentence and associates it with the analyzed sentence.
A vector conversion unit that converts the sentence analyzed by the analysis unit into a vector value by distributed representation, and
A selection unit that selects a part of the sentence to be analyzed based on the first index value derived by the index value derivation unit and the conversion result by the vector conversion unit.
Among the teacher sentences whose meanings are known and whose vector value is required, the data in which the meanings of the teacher sentences whose vector values are close to those of the selection sentences selected by the selection unit are associated with the meanings of the selection sentences. The generator to generate and
Equipped with
For each of the plurality of second words included in the library information, the index value deriving unit determines the frequency of the second word in the sentence of interest included in the library information and including the second word. A second index value that evaluates the rarity of the second word for a sentence other than the note of interest included in the library information is derived and associated with the sentence of interest.
The selection unit is the analyzed sentence or the attention sentence based on at least one of the first index value or the second index value derived by the index value derivation unit and the conversion result by the vector conversion unit. Select some sentences from
A command output unit that estimates the meaning of the recognized sentence based on the vector value converted by the vector conversion unit and outputs a command based on the estimation result is further provided.
The vector conversion unit converts the recognized sentence included in the text of the analysis result by the analysis unit into a vector value by distributed representation.
The command output unit estimates the meaning of the recognized sentence based on the similarity of the vector value with the sentence included in the associated data, and outputs a command based on the estimation result.
The selection unit determines whether or not the voice data includes the user's task execution intention based on the position information given to the voice data.
Information processing equipment.

At least one of the first index value and the second index value is a tf-idf value.
The information processing apparatus according to claim 1 or 2 .

Further equipped with a reliability derivation unit for deriving the reliability of the analysis result,
The selection unit selects the sentence to be analyzed based on the reliability.
The information processing apparatus according to any one of claims 1 to 3 .

The selection unit preferentially selects the analysis target sentence obtained from the analysis result whose reliability is equal to or higher than the threshold value.
The information processing apparatus according to claim 4 .

When the selected sentence selected from the analysis result whose reliability is equal to or higher than the threshold value, the selection unit ends the selection process.
The information processing apparatus according to claim 5 .

The selection unit changes the accuracy with which the corresponding task is selected according to the input environment of the voice data estimated based on the position information.
The information processing apparatus according to claim 2 .

The computer
Get voice data,
The voice data is analyzed and converted into text, and then
For each of the plurality of first words included in the analysis result text, the frequency of the first word in the analyzed sentence included in the text and including the first word, and the sentence included in the library information. A first index value that evaluates the rarity of the first word is derived and associated with the analyzed sentence.
Convert the parsed sentence into a vector value by distributed representation,
Based on the first index value and the vector conversion result, a part of the sentences to be analyzed is selected.
Among the teacher sentences whose meanings are known and whose vector value is required, data is generated in which the meanings of the selected selection sentences and the teacher sentences whose vector values are close to each other are associated with the meanings of the selection sentences.
For each of the plurality of second words included in the library information, the frequency of the second word in the sentence of interest included in the library information and including the second word, and the frequency of the second word included in the library information. A second index value that evaluates the rarity of the second word for a sentence other than the sentence of interest is derived and associated with the sentence of interest.
In the selection process, a part of the sentence to be analyzed or the sentence of interest is selected based on at least one of the first index value or the second index value and the vector conversion result.
When deriving the first index value or the second index value,
When deriving only the first index value, the first word is fixed and processed.
When deriving only the second index value, the second word is fixed and processed.
When deriving the first index value and the second index value, the first word and the second word are fixed to the same word for processing.
Generate the associated data for each fixed word.
Information processing method.

The computer
Get voice data,
The voice data is analyzed and converted into text, and then
For each of the plurality of first words included in the analysis result text, the frequency of the first word in the analyzed sentence included in the text and including the first word, and the sentence included in the library information. A first index value that evaluates the rarity of the first word is derived and associated with the analyzed sentence.
Convert the parsed sentence into a vector value by distributed representation,
Based on the first index value and the vector conversion result, a part of the sentences to be analyzed is selected.
Among the teacher sentences whose meanings are known and whose vector value is required, data is generated in which the meanings of the selected selection sentences and the teacher sentences whose vector values are close to each other are associated with the meanings of the selection sentences.
For each of the plurality of second words included in the library information, the frequency of the second word in the sentence of interest included in the library information and including the second word, and the frequency of the second word included in the library information. A second index value that evaluates the rarity of the second word for a sentence other than the sentence of interest is derived and associated with the sentence of interest.
In the selection process, a part of the sentence to be analyzed or the sentence of interest is selected based on at least one of the first index value or the second index value and the vector conversion result.
Based on the converted vector value, the meaning of the recognized sentence is estimated, and the command based on the estimation result is output.
The recognized sentence included in the text of the analysis result is converted into a vector value by distributed representation, and is converted into a vector value.
Based on the similarity of the vector value with the sentence included in the associated data, the meaning of the recognized sentence is estimated, and the command based on the estimation result is output.
Based on the position information given to the voice data, it is determined whether or not the voice data includes the execution intention of the user's task.
Information processing method.

On the computer
Get voice data,
The voice data is analyzed and converted into text, and then
For each of the plurality of first words included in the analysis result text, the frequency of the first word in the analyzed sentence included in the text and including the first word, and the sentence included in the library information. A first index value that evaluates the rarity of the first word is derived and associated with the analyzed sentence.
Convert the parsed sentence into a vector value by distributed representation,
Based on the first index value and the vector conversion result, a part of the sentences to be analyzed is selected.
Among the teacher sentences whose meanings are known and whose vector value is required, data is generated in which the meanings of the selected selection sentences and the teacher sentences whose vector values are close to each other are associated with the meanings of the selection sentences.
For each of the plurality of second words included in the library information, the frequency of the second word in the sentence of interest included in the library information and including the second word, and the frequency of the second word included in the library information. A second index value that evaluates the rarity of the second word for a sentence other than the sentence of interest is derived and associated with the sentence of interest.
In the selection process, a part of the sentence to be analyzed or the sentence of interest is selected based on at least one of the first index value or the second index value and the vector conversion result.
When deriving the first index value or the second index value,
When deriving only the first index value, the first word is fixed and processed.
When deriving only the second index value, the second word is fixed and processed.
When deriving the first index value and the second index value, the first word and the second word are fixed to the same word for processing.
Generate the associated data for each fixed word.
A program that lets you do things .

On the computer
Get voice data,
The voice data is analyzed and converted into text, and then
For each of the plurality of first words included in the analysis result text, the frequency of the first word in the analyzed sentence included in the text and including the first word, and the sentence included in the library information. A first index value that evaluates the rarity of the first word is derived and associated with the analyzed sentence.
Convert the parsed sentence into a vector value by distributed representation,
Based on the first index value and the vector conversion result, a part of the sentences to be analyzed is selected.
Among the teacher sentences whose meanings are known and whose vector value is required, data is generated in which the meanings of the selected selection sentences and the teacher sentences whose vector values are close to each other are associated with the meanings of the selection sentences.
For each of the plurality of second words included in the library information, the frequency of the second word in the sentence of interest included in the library information and including the second word, and the frequency of the second word included in the library information. A second index value that evaluates the rarity of the second word for a sentence other than the sentence of interest is derived and associated with the sentence of interest.
In the selection process, a part of the sentence to be analyzed or the sentence of interest is selected based on at least one of the first index value or the second index value and the vector conversion result.
Based on the converted vector value, the meaning of the recognized sentence is estimated, and the command based on the estimation result is output.
The recognized sentence included in the text of the analysis result is converted into a vector value by distributed representation, and is converted into a vector value.
Based on the similarity of the vector value with the sentence included in the associated data, the meaning of the recognized sentence is estimated, and the command based on the estimation result is output.
Based on the position information given to the voice data, it is determined whether or not the voice data includes the execution intention of the user's task.
A program that lets you do things .