JP2022111564A

JP2022111564A - Information processing device, program, and information processing method

Info

Publication number: JP2022111564A
Application number: JP2021007078A
Authority: JP
Inventors: 尚史福江; Naofumi Fukue
Original assignee: TIS Inc
Current assignee: TIS Inc
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2022-08-01

Abstract

To provide an information processing device, a program, and an information processing method capable of evaluating voice recognition accuracy based on a plurality of states of voice output.SOLUTION: An information processing device includes: a voice output control unit that controls a voice output from a voice output device, based on voice output conditions; an acquisition unit that acquires recognition result text information indicating the recognition result of the voice, from a voice recognition device that recognizes the voice output from the voice output device; a calculation unit for calculating recognition accuracy of the recognition result based on test text information and the recognition result text information, referring to a text storage unit that stores the test text information indicating a content of test voice included in the voice; and an output unit for outputting the calculated recognition accuracy and the conditions in association with each other. The voice output control unit changes the conditions based on the recognition accuracy calculated by the calculation unit, and controls an output of the voice from the voice output device based on the changed conditions.SELECTED DRAWING: Figure 3

Description

本発明は、情報処理装置、プログラム、および情報処理方法に関する。 The present invention relates to an information processing device, a program, and an information processing method.

従来、音声認識精度を評価するための技術が存在する。例えば、下記特許文献１に開示されている音声認識精度推定装置では、音声を認識した結果に対する評価として、入力された音声を音声認識し、正解である確率などを示す単語アライメントネットワークに基づいて認識された単語ごとの音声認識精度を推定する。 Conventionally, there are techniques for evaluating speech recognition accuracy. For example, in the speech recognition accuracy estimation device disclosed in Patent Document 1 below, as an evaluation of the result of speech recognition, input speech is recognized and recognized based on a word alignment network that indicates the probability of being correct. Estimate the speech recognition accuracy for each word.

特開２０１８－２５７１７号公報JP 2018-25717 A

ところで、音声認識装置が利用される場面（以下、「利用場面」ともいう）は、会議での議事録作成や自宅でのデバイス制御など様々な利用場面が考えられる。そして、利用場面ごとに、発声者がどの位置にいてどのように発声するかなど、認識対象の音声の出力の状態が異なってくることが考えられる。しかしながら、上記特許文献では、この音声出力の複数の状態をふまえて音声認識精度を評価させることができないという問題がある。 By the way, there are various situations in which the speech recognition device is used (hereinafter also referred to as "usage situation"), such as creating minutes in a meeting and device control at home. In addition, it is conceivable that the state of output of speech to be recognized differs depending on the usage scene, such as where the speaker is and how he or she speaks. However, in the above-mentioned patent document, there is a problem that speech recognition accuracy cannot be evaluated based on the plurality of states of speech output.

そこで、本発明は、上記課題に鑑みて、音声出力の複数の状態をふまえて音声認識精度を評価させることができる情報処理装置、プログラム、および情報処理方法を提供することを目的とする。 SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide an information processing apparatus, a program, and an information processing method capable of evaluating speech recognition accuracy based on a plurality of states of speech output.

本発明の一態様に係る情報処理装置は、音声出力の条件に基づいて、音声出力装置からの音声の出力を制御する音声出力制御部と、音声出力装置から出力された音声を認識した音声認識装置から、音声の認識結果を示す認識結果テキスト情報を取得する取得部と、音声に含まれる試験音声の内容を示す試験用テキスト情報を記憶するテキスト記憶部を参照して、試験用テキスト情報と認識結果テキスト情報とに基づいて、認識結果の認識精度を算出する算出部と、算出された認識精度と条件とを対応付けて出力する出力部と、を備え、音声出力制御部は、算出部により算出された認識精度に基づいて条件を変更して、変更された条件に基づいて音声出力装置からの音声の出力を制御する。 An information processing apparatus according to an aspect of the present invention includes a voice output control unit that controls voice output from a voice output device based on a voice output condition, and a voice recognition unit that recognizes voice output from the voice output device. With reference to an acquisition unit for acquiring recognition result text information indicating the speech recognition result from the device and a text storage unit for storing test text information indicating the content of the test speech included in the speech, the test text information and a calculation unit for calculating the recognition accuracy of the recognition result based on the recognition result text information; The condition is changed based on the recognition accuracy calculated by , and the voice output from the voice output device is controlled based on the changed condition.

本発明の一態様に係るプログラムは、コンピュータに、音声出力の条件に基づいて、音声出力装置からの音声の出力を制御する音声出力制御機能と、音声出力装置から出力された音声を認識した音声認識装置から、音声の認識結果を示す認識結果テキスト情報を取得する取得機能と、音声に含まれる試験音声の内容を示す試験用テキスト情報を記憶するテキスト記憶部を参照して、試験用テキスト情報と認識結果テキスト情報とに基づいて、認識結果の認識精度を算出する算出機能と、算出された認識精度と条件とを対応付けて出力する出力機能と、を実現させ、音声出力制御機能は、算出機能により算出された認識精度に基づいて条件を変更して、変更された条件に基づいて音声出力装置からの音声の出力を制御する。 A program according to an aspect of the present invention provides a computer with an audio output control function for controlling output of audio from an audio output device based on audio output conditions, and audio obtained by recognizing the audio output from the audio output device. With reference to an acquisition function for acquiring recognition result text information indicating the speech recognition result from the recognition device and a text storage unit for storing test text information indicating the content of the test speech included in the speech, the test text information is obtained. and recognition result text information, and an output function for outputting the calculated recognition accuracy and condition in association with each other. The condition is changed based on the recognition accuracy calculated by the calculation function, and the output of the sound from the sound output device is controlled based on the changed condition.

本発明の一態様に係る情報処理方法は、コンピュータが、音声出力の条件に基づいて、音声出力装置からの音声の出力を制御し、音声出力装置から出力された音声を認識した音声認識装置から、音声の認識結果を示す認識結果テキスト情報を取得し、音声に含まれる試験音声の内容を示す試験用テキスト情報を記憶するテキスト記憶部を参照して、試験用テキスト情報と認識結果テキスト情報とに基づいて、認識結果の認識精度を算出し、算出された認識精度と条件とを対応付けて出力し、算出された認識精度に基づいて条件を変更し、変更された条件に基づいて音声出力装置からの音声の出力を制御する。 In an information processing method according to an aspect of the present invention, a computer controls speech output from a speech output device based on speech output conditions, and a speech recognition device recognizes speech output from the speech output device. acquires recognition result text information indicating the speech recognition result, refers to the text storage unit that stores test text information indicating the content of the test speech included in the speech, and stores the test text information and the recognition result text information. Based on , calculate the recognition accuracy of the recognition result, output the calculated recognition accuracy and condition in correspondence, change the condition based on the calculated recognition accuracy, and output voice based on the changed condition Controls audio output from the device.

本発明によれば、音声出力の複数の状態をふまえて音声認識精度を評価させることができる情報処理装置、プログラム、および情報処理方法を提供することができる。 According to the present invention, it is possible to provide an information processing apparatus, a program, and an information processing method capable of evaluating speech recognition accuracy based on a plurality of states of speech output.

本実施形態に係る評価システムのシステム構成例を説明するための図である。It is a figure for demonstrating the system configuration example of the evaluation system which concerns on this embodiment. 本実施形態に係る評価システムの概要を説明するための図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a figure for demonstrating the outline|summary of the evaluation system which concerns on this embodiment. 本実施形態に係る制御装置の機能構成の一例を示す図である。It is a figure showing an example of functional composition of a control device concerning this embodiment. 本実施形態に係る評価システムの複数のパラメータを含む出力条件の一例を示す表である。It is a table showing an example of output conditions including a plurality of parameters of the evaluation system according to the present embodiment. 本実施形態に係る評価システムの画面例を示す図である。It is a figure which shows the example of a screen of the evaluation system which concerns on this embodiment. 本実施形態に係る評価システムの画面例を示す図である。It is a figure which shows the example of a screen of the evaluation system which concerns on this embodiment. 本実施形態に係る評価システムの認識精度と音声出力装置の音量との関係の一例を示す図である。It is a figure which shows an example of the relationship between the recognition accuracy of the evaluation system which concerns on this embodiment, and the volume of a sound output device. 本実施形態に係る評価システムの認識精度と音声出力装置との距離との関係の一例を示す図である。It is a figure which shows an example of the relationship of the recognition accuracy of the evaluation system which concerns on this embodiment, and the distance with a voice output device. 本実施形態に係る制御装置の動作例を示す図である。It is a figure which shows the operation example of the control apparatus which concerns on this embodiment. 本実施形態に係る制御装置のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the control apparatus which concerns on this embodiment.

添付図面を参照して、本発明の好適な実施形態（以下、「本実施形態」という）について説明する。なお、各図において、同一の符号を付したものは、同一または同様の構成を有する。 A preferred embodiment of the present invention (hereinafter referred to as "this embodiment") will be described with reference to the accompanying drawings. It should be noted that, in each figure, the same reference numerals have the same or similar configurations.

本実施形態において、「部」や「手段」、「装置」、「システム」とは、単に物理的手段を意味するものではなく、その「部」や「手段」、「装置」、「システム」が有する機能をソフトウェアによって実現する場合も含む。また、１つの「部」や「手段」、「装置」、「システム」が有する機能が２つ以上の物理的手段や装置により実現されてもよい。また、２つ以上の「部」や「手段」、「装置」、「システム」の機能が１つの物理的手段や装置により実現されてもよい。 In the present embodiment, "part", "means", "apparatus", and "system" do not simply mean physical means; It also includes the case where the functions possessed by are realized by software. Also, the function of one "unit", "means", "apparatus" or "system" may be realized by two or more physical means or devices. Also, the functions of two or more "parts", "means", "apparatus", and "system" may be realized by one physical means or apparatus.

＜１．システム構成＞
図１を参照して、本実施形態に係わる評価システム１のシステム構成例を説明する。評価システム１は、音声認識装置の音声認識精度を開発者や顧客などのユーザが評価するためのシステムである。評価システム１は、例えば、音声認識装置の開発において、同一モデルの現行製品と次期製品の精度をそれぞれ評価して認識精度が改善されているかを確認するために用いることができる。また、評価システム１は、例えば、自社の音声認識装置の認識精度と他社の音声認識装置の認識精度とをそれぞれ評価して比較するために用いることもできる。 <1. System configuration>
A system configuration example of an evaluation system 1 according to the present embodiment will be described with reference to FIG. The evaluation system 1 is a system for users such as developers and customers to evaluate the speech recognition accuracy of speech recognition devices. The evaluation system 1 can be used, for example, in the development of a speech recognition device, to evaluate the accuracy of the current product and the next product of the same model to check whether the recognition accuracy has been improved. The evaluation system 1 can also be used, for example, to evaluate and compare the recognition accuracy of the company's speech recognition device and the recognition accuracy of another company's speech recognition device.

図１に示すように、評価システム１は、制御装置１００と、評価対象の音声認識装置２００と、音声を出力する音声出力装置３００ａ～３００ｃとを含む。また評価システム１は、例えば、ネットワークＮを介して音声認識システム５００と接続されていてもよい。なお、音声出力装置３００ａ～３００ｃは、特に区別の必要がない場合、総称して「音声出力装置３００」ともいう。 As shown in FIG. 1, the evaluation system 1 includes a control device 100, a speech recognition device 200 to be evaluated, and speech output devices 300a to 300c for outputting speech. The evaluation system 1 may also be connected to the speech recognition system 500 via the network N, for example. Note that the audio output devices 300a to 300c are also collectively referred to as the "audio output device 300" when there is no particular need to distinguish between them.

ネットワークＮは、無線ネットワークや有線ネットワークにより構成される。ネットワークの一例としては、携帯電話網や、ＰＨＳ（ＰｅｒｓｏｎａｌＨａｎｄｙ－ｐｈｏｎｅＳｙｓｔｅｍ）網、無線ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、３Ｇ（３ｒｄＧｅｎｅｒａｔｉｏｎ）、ＬＴＥ（ＬｏｎｇＴｅｒｍＥｖｏｌｕｔｉｏｎ）、４Ｇ（４ｔｈＧｅｎｅｒａｔｉｏｎ）、５Ｇ（５ｔｈＧｅｎｅｒａｔｉｏｎ）、ＷｉＭａｘ（登録商標）、赤外線通信、Ｂｌｕｅｔｏｏｔｈ（登録商標）、有線ＬＡＮ、電話線、電灯線ネットワーク、ＩＥＥＥ１３９４等に準拠したネットワークがある。 The network N is composed of a wireless network and a wired network. Examples of networks include mobile phone networks, PHS (Personal Handy-phone System) networks, wireless LAN (Local Area Network), 3G (3rd Generation), LTE (Long Term Evolution), 4G (4th Generation), 5G ( 5th Generation), WiMax (registered trademark), infrared communication, Bluetooth (registered trademark), wired LAN, telephone line, power line network, and IEEE1394-compliant network.

制御装置１００は、音声出力装置３００からの音声の出力を制御する情報処理装置である。また、制御装置１００は、例えば、有線（例えば、ＵＳＢやＨＤＭＩ（登録商標）、プラグ・ジャックなど）や無線（例えば、ＷｉｆｉやＢｌｕｅｔｏｏｔｈなど）により音声出力装置３００と接続されていてもよい。また、制御装置１００は、音声認識装置２００から、認識結果テキスト情報を取得する。なお、説明を簡単にするために、制御装置１００を１台の端末装置とする例を説明するが、制御装置１００をこれに限る趣旨ではない。制御装置１００は、例えば、搭載する機能を複数台の端末装置に分散させて構成されていてもよい。 The control device 100 is an information processing device that controls audio output from the audio output device 300 . Also, the control device 100 may be connected to the audio output device 300 by wire (eg, USB, HDMI (registered trademark), plug-and-jack, etc.) or wirelessly (eg, Wifi, Bluetooth, etc.). Also, the control device 100 acquires recognition result text information from the speech recognition device 200 . In order to simplify the description, an example in which the control device 100 is a single terminal device will be described, but the control device 100 is not intended to be limited to this. The control device 100 may be configured, for example, by distributing the installed functions to a plurality of terminal devices.

「認識結果テキスト情報」とは、音声出力装置３００から出力された音声の認識結果をテキスト（文字列）により示す情報である。 The “recognition result text information” is information indicating the recognition result of the voice output from the voice output device 300 in text (character string).

音声認識装置２００は、評価システム１による評価対象の装置である。音声認識装置２００は、例えば、制御装置１００や音声認識システム５００との通信が可能な情報処理装置である。音声認識装置２００は、発声者の音声を取得して、取得した音声を音声認識によりテキストに変換して記録する。 The speech recognition device 200 is a device to be evaluated by the evaluation system 1 . The speech recognition device 200 is, for example, an information processing device capable of communicating with the control device 100 and the speech recognition system 500 . The speech recognition device 200 acquires the voice of a speaker, converts the acquired voice into text by voice recognition, and records the text.

音声認識装置２００は、例えば、取得した音声に対話などで応答する、いわゆるスマートスピーカーであってもよい。音声認識装置２００は、他の例として、汎用のタブレット端末やスマートフォンなどであってもよい。音声認識装置２００は、例えば、汎用のタブレット端末に専用のプログラムをインストールし、このプログラムを実行させることにより、この汎用のタブレット端末を音声認識装置２００として使用してもよい。 The speech recognition device 200 may be, for example, a so-called smart speaker that responds to acquired speech through dialogue. As another example, the speech recognition device 200 may be a general-purpose tablet terminal, smartphone, or the like. The speech recognition device 200 may be used as the speech recognition device 200 by, for example, installing a dedicated program in a general-purpose tablet terminal and executing this program.

音声出力装置３００は、アンプ回路およびスピーカを含む装置であり、音声を出力する装置である。音声出力装置３００は、制御装置１００から制御されて音声を出力する。音声出力装置３００は、例えば、スピーカ装置であってもよいし、制御装置１００との通信が可能なスマートフォンやラップトップ端末などの情報処理装置であってもよい。音声出力装置３００は、１つでもよいし、複数存在してもよい。制御装置１００が音声出力装置３００に出力させる音声の音声データ（以下、単に「音声データ」ともいう）は、制御の際に制御装置１００から連携させてもよいし、音声出力装置３００の記憶部（不図示）や外部の記憶装置が記憶していてもよい。 The audio output device 300 is a device that includes an amplifier circuit and a speaker, and outputs audio. The audio output device 300 outputs audio under the control of the control device 100 . The audio output device 300 may be, for example, a speaker device, or an information processing device such as a smart phone or a laptop terminal capable of communicating with the control device 100 . One or a plurality of audio output devices 300 may be provided. Audio data of audio output by the control device 100 to the audio output device 300 (hereinafter also simply referred to as “audio data”) may be linked from the control device 100 at the time of control, or may be stored in the storage unit of the audio output device 300. (not shown) or an external storage device.

制御装置１００が音声出力装置３００に出力させる音声は、認識精度を評価するための試験音声を含む。言い換えると、評価システム１は、試験音声の内容を音声認識装置２００がどの程度認識できたかを評価する。また、この出力させる音声は、例えば、試験音声の他に、雑音を含んでもよい。また、試験音声は、第１音声と、第２音声とを含んでもよい。第２音声とは、その長さや音量、周波数などが第１音声と異なる音声である。 The sound that the control device 100 causes the sound output device 300 to output includes a test sound for evaluating recognition accuracy. In other words, the evaluation system 1 evaluates how well the speech recognition apparatus 200 has recognized the content of the test speech. Also, the output voice may include noise in addition to the test voice, for example. Also, the test voice may include a first voice and a second voice. The second sound is a sound that differs from the first sound in length, volume, frequency, and the like.

音声認識システム５００は、音声認識装置２００と通信の通信が可能なシステムである。音声認識システム５００は、音声認識装置２００から取得した音声のデータを受信し、受信したデータに基づいて音声を認識する。 The speech recognition system 500 is a system capable of communicating with the speech recognition device 200 . The speech recognition system 500 receives speech data acquired from the speech recognition device 200 and recognizes the speech based on the received data.

＜２．システム概要＞
図２を参照して、評価システム１の概要を説明する。本例では、評価対象の音声認識装置２００に対して、その周囲に複数の音声出力装置３００を設置し、複数の音声出力装置３００から出力される音声の認識精度を評価する試験を行う例を説明する。また、評価システム１での試験の実施において、外部からの音声が音声認識装置２００に極力入らないように、評価のための部屋（本例では、計測室とする）に音声認識装置２００や音声出力装置３００を設置してもよい。 <2. System Overview>
An overview of the evaluation system 1 will be described with reference to FIG. In this example, a plurality of speech output devices 300 are installed around a speech recognition device 200 to be evaluated, and a test is performed to evaluate the recognition accuracy of speech output from the plurality of speech output devices 300. explain. In addition, when conducting a test with the evaluation system 1, the speech recognition device 200 and the speech recognition device 200 should be placed in a room for evaluation (in this example, a measurement room) so that speech from the outside does not enter the speech recognition device 200 as much as possible. An output device 300 may be installed.

（１）図２に示すように、まず、制御装置１００が音声出力装置３００に対して、音声出力の条件（以下、「出力条件」ともいう）に基づいて、音声出力を指示する。 (1) As shown in FIG. 2, first, the control device 100 instructs the audio output device 300 to output audio based on audio output conditions (hereinafter also referred to as "output conditions").

（２）音声出力装置３００は、この指示に基づいて、音声を出力する。 (2) The audio output device 300 outputs audio based on this instruction.

「音声出力の条件（出力条件）」とは、例えば、音声出力装置３００から音声をどのように出力させるかという条件である。音声出力の条件は、例えば、音声出力に関する複数のパラメータ（以下、単に「複数のパラメータ」または「パラメータ」ともいう）を含んでもよい。パラメータの詳細については、後述する。 The “sound output condition (output condition)” is, for example, a condition as to how the sound is to be output from the sound output device 300 . The audio output condition may include, for example, a plurality of parameters relating to audio output (hereinafter also simply referred to as "plurality of parameters" or "parameters"). Details of the parameters will be described later.

（３）音声認識装置２００は、音声出力装置３００から出力された音声を取得し、取得した音声を認識する。この音声認識の処理は、音声認識装置２００自身の機能が実行してもよいし、音声認識システム５００に一部または全部を委譲してもよい。音声認識装置２００は、音声認識の結果を、認識結果テキスト情報として記録する。 (3) The speech recognition device 200 acquires the speech output from the speech output device 300 and recognizes the acquired speech. This speech recognition processing may be executed by the function of the speech recognition device 200 itself, or may be partially or wholly delegated to the speech recognition system 500 . The speech recognition device 200 records the result of speech recognition as recognition result text information.

（４）制御装置１００は、音声認識装置２００のＡＰＩを介して、記録された認識結果テキスト情報を取得する。 (4) The control device 100 acquires the recorded recognition result text information via the API of the speech recognition device 200 .

（５）制御装置１００は、試験用テキスト情報と認識結果テキスト情報とに基づいて、認識結果の認識精度を算出する。ここで「試験用テキスト情報」とは、音声出力装置３００に出力させる音声に含まれる試験音声の内容をテキストにより示す情報である。 (5) The control device 100 calculates the recognition accuracy of the recognition result based on the test text information and the recognition result text information. Here, the “test text information” is information indicating the content of the test voice included in the voice to be output by the voice output device 300 by text.

（６）制御装置１００は、上記（５）で算出された認識精度と上記（１）で適用された出力条件とを対応付けて画面やファイルなどに出力する。 (6) The control device 100 associates the recognition accuracy calculated in (5) above with the output conditions applied in (1) above, and outputs them to a screen, a file, or the like.

（７）制御装置１００は、上記（５）で算出された認識精度をフィードバックして、音声出力装置３００に対して再度音声出力を指示する。具体的には、制御装置１００は、この認識精度に基づいて出力条件を変更し、変更した出力条件に基づいて音声出力装置３００に対して音声出力を指示する。 (7) The control device 100 feeds back the recognition accuracy calculated in (5) above, and instructs the voice output device 300 to output voice again. Specifically, the control device 100 changes the output condition based on this recognition accuracy, and instructs the voice output device 300 to output voice based on the changed output condition.

評価システム１では、上記（１）～（７）を所定回数繰り返し実施してもよい。具体的には、音声出力装置３００による音声出力と、この音声出力に対する音声認識装置２００による音声認識と、は複数回実行されてもよい。そして、制御装置１００は、複数回実行される際に、その都度出力条件を変更しながら音声認識装置２００の認識精度を評価する。 In the evaluation system 1, the above (1) to (7) may be repeated a predetermined number of times. Specifically, the speech output by the speech output device 300 and the speech recognition by the speech recognition device 200 for this speech output may be performed multiple times. Then, the control device 100 evaluates the recognition accuracy of the speech recognition device 200 while changing the output condition each time when it is executed a plurality of times.

上記構成によれば、制御装置１００は、音声認識精度の評価にあたって、出力条件を変更することで複数の音声出力の状態をつくることができる。このため、上記構成によれば、制御装置１００は、音声出力の複数の状態をふまえて音声認識精度をユーザに評価させることができる。したがって、制御装置１００は、様々な音声出力の状態をつくることで音声認識装置２００の様々な利用場面を想定した環境で音声認識精度の評価をユーザにさせることができる。 According to the above configuration, the control device 100 can create a plurality of voice output states by changing the output conditions when evaluating the voice recognition accuracy. Therefore, according to the above configuration, the control device 100 can allow the user to evaluate the speech recognition accuracy based on a plurality of states of speech output. Therefore, the control device 100 can make the user evaluate the speech recognition accuracy in an environment assuming various usage situations of the speech recognition device 200 by creating various speech output states.

＜３．機能構成＞
図３を参照して、本実施形態に係る制御装置１００の機能構成を説明する。図３に示すように、制御装置１００は、制御部１１０と、記憶部１３０と、通信部１４０と、を備える。 <3. Functional configuration>
A functional configuration of the control device 100 according to the present embodiment will be described with reference to FIG. As shown in FIG. 3 , the control device 100 includes a control section 110 , a storage section 130 and a communication section 140 .

制御部１１０は、音声出力制御部１１１と、取得部１１２と、算出部１１３と、出力部１１４と、生成部１１５と、を備える。 The control unit 110 includes an audio output control unit 111 , an acquisition unit 112 , a calculation unit 113 , an output unit 114 and a generation unit 115 .

＜音声出力制御部＞
音声出力制御部１１１は、出力条件に基づいて、音声出力装置３００からの音声の出力を制御する。音声出力制御部１１１は、例えば、出力条件が「音声データＡを出力させる」である場合、音声データＡを記憶する記憶部１３０を参照して、通信部１４０を介して音声出力装置３００にこの音声データＡを送信してもよい。音声出力制御部１１１は、例えば、出力条件が「音量〇〇［ｄｂ］で音声の長さが〇〇［ｓ］の音声データを、再生速度は１倍速で、ｙｙ／ｍｍ／ｄｄｈ１ｈ１：ｍ１ｍ１とｈ１ｈ１：ｍ２ｍ２の２回出力させる」である場合、出力条件に合致する音声データを記憶部１３０から取得し、音声出力装置３００と通信可能に接続して、取得した音声データと音声を出力させる指示（出力条件に含まれるパラメータ（本例の場合、出力のタイミング、出力回数、および再生速度）を含む）とを音声出力装置３００に送信してもよい。 <Audio output controller>
The audio output control unit 111 controls the output of audio from the audio output device 300 based on the output conditions. For example, when the output condition is "to output audio data A", the audio output control unit 111 refers to the storage unit 130 that stores the audio data A, and sends the audio data A to the audio output device 300 via the communication unit 140. Audio data A may be transmitted. For example, the audio output control unit 111 outputs audio data whose output condition is “volume 〇〇 [db], audio length 〇〇 [s], playback speed is 1×, yy/mm/dd h1h1:m1m1 and h1h1: m2m2 twice”, acquires audio data that meets the output condition from the storage unit 130, connects to the audio output device 300 so as to be communicable, and outputs the acquired audio data and audio. Instructions (including parameters included in the output conditions (in this example, timing of output, number of times of output, and playback speed)) may be transmitted to the audio output device 300 .

出力条件は、例えば、条件情報として記憶部１３０に記憶されていてもよい。また、出力条件は、試験シナリオとして複数の出力条件を組みあわせて条件情報として記憶されていてもよい。 The output condition may be stored in the storage unit 130 as condition information, for example. Moreover, the output conditions may be stored as condition information by combining a plurality of output conditions as a test scenario.

音声データは、例えば、試験音声の試験音声データと、雑音を発する雑音データと、を含んでもよい。試験音声データと雑音データとは、例えば、それぞれ異なる個別のデータであってもよいし、混合（ミキシング）させてもよい。また、試験音声データは、第１試験音声の第１試験音声データと、第１試験音声と長さが異なる第２試験音声の第２試験音声データとを含んでもよい。 The audio data may include, for example, test audio data of test audio and noise data that emits noise. The test speech data and the noise data may be, for example, different individual data, or may be mixed. Also, the test audio data may include first test audio data of the first test audio and second test audio data of the second test audio different in length from the first test audio.

音声出力制御部１１１は、算出部１１３により算出された認識精度に基づいて出力条件を変更する。そして、音声出力制御部１１１は、変更された出力条件に基づいて、音声出力装置３００からの音声の出力を制御する。 The voice output control section 111 changes the output condition based on the recognition accuracy calculated by the calculation section 113 . Then, the audio output control unit 111 controls the output of audio from the audio output device 300 based on the changed output condition.

上記構成によれば、音声出力制御部１１１は、音声認識精度の評価にあたって、出力条件を変更することで複数の音声出力の状態をつくることができる。このため、上記構成によれば、音声出力制御部１１１は、音声出力の複数の状態をふまえて音声認識精度をユーザに評価させることができる。したがって、音声出力制御部１１１は、様々な音声出力の状態をつくることで音声認識装置２００の様々な利用場面を想定した環境で音声認識精度をユーザに評価させることができる。 According to the above configuration, the voice output control unit 111 can create a plurality of voice output states by changing the output conditions when evaluating the voice recognition accuracy. Therefore, according to the above configuration, the voice output control unit 111 can allow the user to evaluate the voice recognition accuracy based on a plurality of states of voice output. Therefore, the voice output control unit 111 can make the user evaluate the voice recognition accuracy in environments assuming various usage scenes of the voice recognition apparatus 200 by creating various voice output states.

音声出力制御部１１１は、例えば、音声出力装置３００による音声出力と、この音声出力に対する音声認識装置２００による音声認識と、が複数回実行される際に、この複数回のそれぞれの回で、複数のパラメータの中で使用する一以上のパラメータを変更して音声出力を制御してもよい。 For example, when the voice output by the voice output device 300 and the voice recognition by the voice recognition device 200 for this voice output are executed a plurality of times, the voice output control unit 111 performs a plurality of One or more of the parameters used may be changed to control audio output.

上記構成によれば、音声出力制御部１１１は、音声出力に関する複数のパラメータの中で使用するパラメータを変更することで、各出力条件に対してどこで差異をつけるかバリエーションをもたせて制御することができる。このため、上記構成によれば、音声出力制御部１１１は、バリエーションをもたせた環境で音声認識精度の評価をユーザにさせることができる。 According to the above configuration, the audio output control unit 111 can change the parameters to be used among the plurality of parameters related to audio output, thereby giving variation to each output condition. can. Therefore, according to the above configuration, the voice output control unit 111 can allow the user to evaluate the voice recognition accuracy in environments with variations.

複数のパラメータは、例えば、音声出力装置３００に出力させる音声の音声データを特定するための情報、音声の音源定位位置（音声認識装置２００を基準として音声出力装置３００それぞれからの距離および方向を含む）、音声の発話速度（試験音声の内容を発声者が読んで発話した際の速度の他に音声の再生速度も含む）、音声の音量、音声の長さ、音声の発声者に関する情報（例えば、性別や年代、人声か合成音声かを特定するための情報など）、出力のタイミング、または出力の回数）などの少なくともいずれか二つを含んでもよい。この「音声データを特定するための情報」とは、例えば、音声データを記憶するファイルのファイル名やこのファイルにアクセスするための情報（ＵＲＬなど）などである。複数のパラメータは、例えば、図４に示すように、出力条件ごとに組み合わされていてもよい。 The plurality of parameters include, for example, information for specifying audio data of audio to be output to the audio output device 300, sound source localization position (distance and direction from each audio output device 300 with respect to the audio recognition device 200) ), speech speed (including the speed at which the speaker reads and speaks the content of the test speech, as well as the playback speed of the speech), the volume of the speech, the length of the speech, information on the speaker of the speech (for example, , gender, age, information for specifying whether it is a human voice or a synthesized voice, etc.), timing of output, or number of times of output). The "information for specifying the audio data" is, for example, the file name of the file storing the audio data and information (such as URL) for accessing this file. A plurality of parameters may be combined for each output condition, as shown in FIG. 4, for example.

パラメータは、例えば、装置選択情報を含んでもよい。ここで「装置選択情報」とは、複数回それぞれの回で、複数の音声出力装置３００の中で出力をオンする装置を選択する情報である。 The parameters may include device selection information, for example. Here, the "device selection information" is information for selecting a device to turn on the output from among the plurality of audio output devices 300 for each of a plurality of times.

パラメータは、例えば、装置指定情報を含んでもよい。ここで「装置指定情報」とは、複数回それぞれの回で、音声出力装置３００ごとに試験音声と雑音のいずれを出力させるかを指定する情報である。 The parameters may include, for example, device specific information. Here, the "device designation information" is information that designates which of the test sound and noise is to be output for each sound output device 300 in each of a plurality of times.

パラメータは、例えば、音声出力装置３００に出力させる音声の長さを含んでもよい。 The parameters may include, for example, the length of the audio output by the audio output device 300 .

例えば、音声データが試験音声データと雑音データとを混合させたデータの場合、パラメータは、この混合に関するパラメータ（以下、「混合パラメータ」ともいう）であってもよい。 For example, if the speech data is data obtained by mixing test speech data and noise data, the parameters may be parameters relating to this mixing (hereinafter also referred to as "mixing parameters").

パラメータは、例えば、試験音声の音量と雑音の音量との差であってもよい。また、パラメータは、例えば、第１試験音声の音量と第２試験音声の音量との差であってもよい。 The parameter may be, for example, the difference between the loudness of the test speech and the loudness of the noise. The parameter may also be, for example, the difference between the volume of the first test sound and the volume of the second test sound.

音声出力制御部１１１は、例えば、各パラメータの値を変更させて音声出力を制御してもよい。音声出力制御部１１１は、例えば、予め設定された値ごとに各パラメータの値を変更させて（例えば、音量であれば５［ｄｂ］ずつ順次あげていくなど）音声出力を制御してもよい。音声出力制御部１１１は、例えば、閾値情報の閾値や基準値情報の基準値ごとに各パラメータの値を変更させていってもよい。 The audio output control unit 111 may control the audio output by changing the value of each parameter, for example. The audio output control unit 111 may control the audio output by, for example, changing the value of each parameter for each preset value (for example, increasing the volume by 5 [db] sequentially). . The audio output control unit 111 may change the value of each parameter for each threshold of the threshold information or the reference value of the reference value information, for example.

音声出力制御部１１１は、例えば、装置選択情報にさらに基づいて、複数の音声出力装置３００それぞれの音声出力のオン／オフをそれぞれの回で切り替えて音声出力を制御してもよい。 The audio output control unit 111 may control the audio output by switching ON/OFF of the audio output of each of the plurality of audio output devices 300 each time, for example, further based on the device selection information.

上記構成によれば、音声出力制御部１１１は、複数の音声出力装置３００それぞれの音声出力のオン／オフを切り替えることで音源定位にバリエーションをもたせて認識精度の評価をさせることができる。 According to the above configuration, the voice output control unit 111 can vary the sound source localization by switching on/off the voice output of each of the plurality of voice output devices 300 to evaluate the recognition accuracy.

音声出力制御部１１１は、例えば、装置指定情報にさらに基づいて、複数の音声出力装置３００それぞれに出力させる音声を、試験音声と雑音とでそれぞれの回で切り替えて音声出力を制御してもよい。 For example, the audio output control unit 111 may control the audio output by switching the audio to be output to each of the plurality of audio output devices 300 between the test audio and the noise each time, further based on the device designation information. .

例えば、屋外や商業施設など利用場面によっては周囲の騒音などの雑音が音声認識装置２００に認識精度に影響をあたえることも考えられる。上記構成によれば、音声出力制御部１１１は、このような雑音の影響をふまえたうえで認識精度を評価させることができる。このため、上記構成によれば、音声出力制御部１１１は、雑音が多い場面を含む様々な利用場面を想定した評価をさせることができる。 For example, it is conceivable that noise such as ambient noise affects the recognition accuracy of the speech recognition apparatus 200 depending on the usage scene, such as outdoors or in commercial facilities. According to the above configuration, the speech output control section 111 can evaluate the recognition accuracy in consideration of the influence of such noise. Therefore, according to the above configuration, the voice output control unit 111 can perform evaluations assuming various usage scenes including scenes with a lot of noise.

音声出力制御部１１１は、例えば、音声出力装置３００それぞれに出力させる音声を、第１試験音声と第２試験音声とで複数回のそれぞれの回で切り替えて音声出力を制御してもよい。 For example, the audio output control unit 111 may control the audio output by switching the audio output from each audio output device 300 between the first test audio and the second test audio multiple times.

例えば、会議室での会話では複数の発声者がそれぞれの席から発言するなど複数の音声がそれぞれ異なる音源から発声する場面で音声認識装置２００が利用されることが考えられる。上記構成によれば、音声出力制御部１１１は、このような利用場面にそくした出力条件で出力させるよう制御することができる。 For example, in a conversation in a conference room, the speech recognition apparatus 200 may be used in situations where a plurality of voices are uttered from different sound sources, such as when a plurality of speakers speak from their respective seats. According to the above configuration, the voice output control unit 111 can perform control so that voice is output under an output condition suitable for such a usage scene.

＜取得部＞
取得部１１２は、音声出力装置３００から出力された音声を認識した音声認識装置２００から、この音声の認識結果を示す認識結果テキスト情報を取得する。取得部１１２の取得の態様は、どのような態様でもよく、例えば、音声出力装置３００から送信された認識結果テキスト情報のテキストファイルを受信してもよい。また、取得部１１２が認識結果テキスト情報を取得する態様は、他の例として、音声認識装置２００にリモートアクセスして認識結果テキスト情報のテキストファイルを取得してもよいし、音声認識装置２００が実装するＡＰＩに認識結果テキスト情報の参照を指示してその結果として認識結果テキスト情報を取得してもよい。 <Acquisition part>
The acquisition unit 112 acquires recognition result text information indicating the recognition result of the speech output from the speech output device 300 from the speech recognition device 200 that has recognized the speech. The acquisition mode of the acquisition unit 112 may be any mode. For example, a text file of the recognition result text information transmitted from the voice output device 300 may be received. As another example, the acquiring unit 112 acquires the recognition result text information by remotely accessing the speech recognition device 200 to acquire a text file of the recognition result text information. The implemented API may be instructed to refer to the recognition result text information, and the recognition result text information may be acquired as a result.

＜算出部＞
算出部１１３は、試験用テキスト情報を記憶するテキスト記憶部１３１を参照して、この試験用テキスト情報と取得部１１２により取得された認識結果テキスト情報とに基づいて、認識結果の認識精度を算出する。算出部１１３は、算出した認識精度を精度情報として記憶部１３０に記憶する。算出部１１３は、例えば、算出した認識精度と適用した出力条件とを対応付けて精度情報として記憶部１３０に記憶してもよい。算出部１１３は、例えば、試験用テキスト情報のテキストと認識結果テキスト情報のテキストとが一致している度合いを認識精度として算出してもよい。例えば、試験用テキスト情報のテキストと認識結果テキスト情報のテキストとが８割一致した場合には、算出部１１３は、認識精度を「０．８」または「８０％」と算出してもよい。 <Calculation unit>
The calculation unit 113 refers to the text storage unit 131 that stores the test text information, and calculates the recognition accuracy of the recognition result based on this test text information and the recognition result text information acquired by the acquisition unit 112. do. The calculation unit 113 stores the calculated recognition accuracy in the storage unit 130 as accuracy information. For example, the calculation unit 113 may associate the calculated recognition accuracy with the applied output condition and store them as accuracy information in the storage unit 130 . For example, the calculation unit 113 may calculate the degree of matching between the text of the test text information and the text of the recognition result text information as the recognition accuracy. For example, when the text of the test text information and the text of the recognition result text information match 80%, the calculation unit 113 may calculate the recognition accuracy as "0.8" or "80%."

算出部１１３は、例えば、所定の学習期間において適用した出力条件と当該出力条件下で算出された認識精度との組み合わせに基づいて、複数のパラメータそれぞれの閾値または基準値を特定してもよい。算出部１１３は、特定した閾値を閾値情報として閾値記憶部１３２に記憶する。また、算出部１１３は、例えば、特定した基準値を基準値情報として記憶部１３０に記憶してもよい。 For example, the calculation unit 113 may specify the threshold values or reference values for each of the plurality of parameters based on the combination of the output condition applied in a predetermined learning period and the recognition accuracy calculated under the output condition. The calculation unit 113 stores the identified threshold in the threshold storage unit 132 as threshold information. Further, the calculation unit 113 may store the specified reference value in the storage unit 130 as reference value information, for example.

＜音量の閾値または基準値＞
算出部１１３は、例えば、音量の閾値を特定するにあたって、所定の学習期間における音声の音量とそれに対応する音声の認識精度を学習データとして入力することにより図７に示すような音量と認識精度の第１パターンモデルを構築してもよい。算出部１１３は、例えば、音量を説明変数（特徴量）とし認識精度を目的変数（特徴量）として、回帰分析による統計処理を用いて第１パターンモデルを構築してもよい。算出部１１３は、構築した第１パターンモデルに音声の音量を入力して、認識精度を算出してもよい。算出部１１３は、例えば、利用場面ごとの音量の取りうる範囲を、複数の認識精度の値（本例では、十分な認識精度（０．９）と許容できる認識精度（０．７）とする）により複数の段階（「高：十分な認識精度が得られる音量」「中：許容できる認識精度が得られる音量」）に区分けする。算出部１１３は、例えば、区分けした２つの範囲（Ｒ１またはＲ２）のうちいずれかの範囲の上限値および／または下限値を、音量の閾値として特定してもよい。また、算出部１１３は、例えば、第１パターンモデルにおける認識精度が最大となる音量の値（ｄ１）を音量の基準値として特定してもよい。 <Volume threshold or reference value>
For example, when specifying the volume threshold, the calculation unit 113 inputs the volume of the voice in a predetermined learning period and the recognition accuracy of the corresponding voice as learning data, thereby obtaining the volume and the recognition accuracy as shown in FIG. A first pattern model may be constructed. For example, the calculation unit 113 may construct the first pattern model using statistical processing based on regression analysis, with volume as an explanatory variable (feature amount) and recognition accuracy as an objective variable (feature amount). The calculation unit 113 may calculate the recognition accuracy by inputting the sound volume to the constructed first pattern model. For example, the calculation unit 113 sets the possible range of sound volume for each usage scene to a plurality of recognition accuracy values (in this example, sufficient recognition accuracy (0.9) and acceptable recognition accuracy (0.7)). ) to classify into a plurality of levels (“high: volume at which sufficient recognition accuracy can be obtained” and “middle: volume at which acceptable recognition accuracy can be obtained”). The calculation unit 113 may specify, for example, the upper limit value and/or the lower limit value of one of the two divided ranges (R1 or R2) as the volume threshold. Further, the calculation unit 113 may specify, for example, the volume value (d1) that maximizes the recognition accuracy in the first pattern model as the volume reference value.

＜距離の閾値または基準値＞
算出部１１３は、例えば、所定の学習期間における音声認識装置２００と音声出力装置３００との距離とそれに対応する認識精度を学習データとして入力することにより図８に示すような音声出力装置３００との距離と認識精度の第２パターンモデルを構築してもよい。算出部１１３は、例えば、音声出力装置３００との距離を説明変数（特徴量）とし認識精度を目的変数（特徴量）として、回帰分析による統計処理を用いて第２パターンモデルを構築してもよい。算出部１１３は、構築した第２パターンモデルに音声出力装置３００との距離を入力して、認識精度を算出してもよい。算出部１１３は、例えば、発声者との距離の取りうる範囲を、上記音量の例と同様に、複数の認識精度の値により複数の段階に区分けする。算出部１１３は、例えば、区分けした２つの範囲（Ｒ３またはＲ４）のうちいずれかの範囲の上限値および／または下限値を、音声出力装置３００との距離の閾値として特定してもよい。また、算出部１１３は、例えば、第２パターンモデルにおける認識精度が最大となる音量の値（ｄ２）を距離の基準値として特定してもよい。 <Distance threshold or reference value>
For example, the calculation unit 113 inputs the distance between the speech recognition device 200 and the speech output device 300 in a predetermined learning period and the corresponding recognition accuracy as learning data, thereby calculating the distance between the speech recognition device 200 and the speech output device 300 as shown in FIG. A second pattern model of distance and recognition accuracy may be constructed. For example, the calculation unit 113 may construct the second pattern model using statistical processing based on regression analysis using the distance from the audio output device 300 as an explanatory variable (feature amount) and the recognition accuracy as an objective variable (feature amount). good. The calculation unit 113 may input the distance from the voice output device 300 to the constructed second pattern model to calculate the recognition accuracy. The calculation unit 113, for example, divides the possible range of distance from the speaker into a plurality of stages based on a plurality of recognition accuracy values, as in the case of the volume. The calculation unit 113 may specify, for example, the upper limit value and/or the lower limit value of one of the two divided ranges (R3 or R4) as the threshold value of the distance from the audio output device 300 . Further, the calculation unit 113 may specify, for example, the volume value (d2) that maximizes the recognition accuracy in the second pattern model as the distance reference value.

＜周波数の閾値または基準値＞
算出部１１３は、例えば、所定の学習期間における音声出力装置３００に出力させた音声の周波数に基づいて、認識精度を算出してもよい。算出部１１３は、例えば、周波数の統計値（平均値や中央値）を算出し、算出した統計値を周波数の基準値として特定してもよい。また、算出部１１３は、例えば、周波数帯域を算出し、算出した周波数帯域の上限値もしくは下限値を周波数の閾値として特定してもよい。 <Frequency threshold or reference value>
The calculation unit 113 may calculate the recognition accuracy, for example, based on the frequency of the sound output by the sound output device 300 during a predetermined learning period. The calculation unit 113 may, for example, calculate a statistical value (average value or median value) of the frequency and specify the calculated statistical value as the reference value of the frequency. Further, the calculation unit 113 may, for example, calculate the frequency band and specify the upper limit value or the lower limit value of the calculated frequency band as the frequency threshold.

算出部１１３は、例えば、所定の学習期間における音声データの音声に含まれる、子音または所定の閾値以上の高周波数域の少なくともいずれかのパワー（または音圧レベル）を特徴量として抽出してもよい。ここでいう「パワー」とは、いわゆる音響パワーであり、音の周波数分析において、周波数ごとの重み（パワー）を示し、人の聴覚が感じる音の大きさや強さ（音量）とは相違する。パワーは、子音または所定の閾値以上の高周波数域の音声の強さとする。算出部１１３は、抽出した特徴量に基づいて、認識精度を算出してもよい。算出部１１３は、例えば、子音のパワーにより上記で算出した認識精度に重み付けを行い、重み付けを行った複数の認識精度により上記の音量の例と同様に２つの段階（「高」「中」）に区分けしてもよい。 For example, the calculation unit 113 may extract, as a feature quantity, the power (or sound pressure level) of at least one of a consonant or a high frequency range equal to or higher than a predetermined threshold, which is included in speech of speech data in a predetermined learning period. good. The "power" here is the so-called acoustic power, which indicates the weight (power) of each frequency in the frequency analysis of sound, and is different from the loudness and strength (volume) of sound sensed by human hearing. The power is the strength of a consonant or a high-frequency voice above a predetermined threshold. The calculation unit 113 may calculate the recognition accuracy based on the extracted feature amount. For example, the calculation unit 113 weights the recognition accuracy calculated above according to the power of the consonant, and uses the plurality of weighted recognition accuracies to determine two levels (“high” and “medium”) in the same manner as the above volume example. can be divided into

算出部１１３は、例えば、音声の音圧レベルと周波数とについて、縦軸を音圧レベルとし横軸を周波数とするグラフにプロットしてもよい。算出部１１３は、上記音量の例と同様に、プロットしたデータが取りうる範囲を、２つの認識精度のエリア（「高」「中」）に区分けする。算出部１１３は、例えば、音声の音圧レベルと周波数とが区分けした二つのエリアのいずれに属するかによって、認識精度を算出してもよい。 For example, the calculation unit 113 may plot the sound pressure level and frequency of the sound on a graph in which the vertical axis is the sound pressure level and the horizontal axis is the frequency. The calculation unit 113 divides the possible range of the plotted data into two recognition accuracy areas (“high” and “medium”) in the same manner as the volume example. The calculation unit 113 may calculate the recognition accuracy depending on, for example, to which of two areas the sound pressure level and frequency of the voice belong.

算出部１１３は、例えば、上記のように（ア）音量、（イ）発声者との距離、（ウ）周波数、（エ）子音または所定の閾値以上の高周波数域のパワー、の少なくともいずれかにより算出した認識精度の組み合わせに基づいて、複合的な認識精度（以下、「複合認識精度」ともいう）を算出してもよい。算出部１１３は、複合的な認識精度に基づいて、各パラメータの閾値または基準値を特定してもよい。例えば、算出部１１３は、複合認識精度が最大となるときの各パラメータの値を閾値または基準値として特定してもよい。 For example, the calculation unit 113 calculates at least one of (a) the volume, (b) the distance from the speaker, (c) the frequency, and (d) the power of a consonant or a high frequency range equal to or higher than a predetermined threshold, as described above. Composite recognition accuracy (hereinafter, also referred to as “composite recognition accuracy”) may be calculated based on the combination of recognition accuracies calculated by . The calculation unit 113 may specify the threshold value or reference value of each parameter based on the composite recognition accuracy. For example, the calculation unit 113 may specify the value of each parameter when the composite recognition accuracy is maximized as the threshold value or the reference value.

算出部１１３は、例えば、上記（ア）～（オ）それぞれの認識精度の加重平均を算出して、算出した加重平均を複合認識精度として算出してもよい。算出部１１３は、例えば、この加重平均にあたって、上記の（ア）と（イ）の重要度を他の（ウ）～（オ）より高く設定してもよい。算出部１１３は、例えば、この重要度に比例した係数をそれぞれの認識精度にかけて重み付けをしてもよい。算出部１１３は、具体的には、以下の式によって複合認識精度を算出してもよい。 The calculation unit 113 may, for example, calculate a weighted average of the recognition accuracies of (a) to (e), and calculate the calculated weighted average as the composite recognition accuracy. For example, the calculation unit 113 may set the importance of the above (a) and (b) higher than the other (c) to (e) in this weighted average. For example, the calculation unit 113 may weight each recognition accuracy by applying a coefficient proportional to the degree of importance. Specifically, the calculation unit 113 may calculate the composite recognition accuracy using the following formula.

複合認識精度＝（α×上記（オ）の認識精度＋β×上記（ア）の認識精度＋θ×上記（イ）の認識精度＋δ・上記（ウ）の認識精度）／（α＋β＋θ＋δ） Combined recognition accuracy = (α x recognition accuracy of (E) above + β x recognition accuracy of (A) above + θ x recognition accuracy of (B) above + δ・recognition accuracy of (C) above) / (α + β + θ + δ)

「α」は、上記（オ）の重み係数であり、「β」は、上記（ア）、すなわち音量の重み係数であり、「θ」は上記（イ）、すなわち距離の重み係数であり、「δ」は、上記（ウ）、すなわち周波数の重み係数である。βとθは、設定した重要度に応じて、αおよびδより大きい値としてもよい。 "α" is the weighting factor for (E) above, "β" is the weighting factor for (A) above, that is, the volume, and "θ" is the weighting factor for (B) above, that is, the distance, “δ” is the above (c), that is, the frequency weighting factor. β and θ may be larger than α and δ depending on the set importance.

算出部１１３は、例えば、複数のパラメータの中で少なくとも二以上のパラメータの間の相関関係の度合いを示す相関度（相関係数）を算出してもよい。算出部１１３は、例えば、音量と認識精度との相関度として、音量と認識精度との共分散を音量の標準偏差と認識精度の標準偏差との積で割って、音量と認識精度の相関係数を算出してもよい。算出部１１３は、他の例として、多変量解析の技術（重回帰分析やロジスティック回帰など）を用いて、二以上のパラメータの間の相関度として、二以上のパラメータの間の相関係数を算出してもよい。 The calculation unit 113 may calculate, for example, a degree of correlation (correlation coefficient) indicating the degree of correlation between at least two or more parameters among a plurality of parameters. For example, the calculation unit 113 divides the covariance of the volume and the recognition accuracy by the product of the standard deviation of the volume and the standard deviation of the recognition accuracy as the degree of correlation between the volume and the recognition accuracy, thereby obtaining the correlation between the volume and the recognition accuracy. You can calculate the number. As another example, the calculation unit 113 uses a multivariate analysis technique (multiple regression analysis, logistic regression, etc.) to calculate the correlation coefficient between the two or more parameters as the degree of correlation between the two or more parameters. can be calculated.

算出部１１３は、例えば、認識結果の信頼度を算出してもよい。算出部１１３は、例えば、認識結果に含まれる単語ごとの信頼度を算出し、算出した単語ごとの信頼度を集計して認識結果の認識精度を算出してもよい。 The calculation unit 113 may calculate, for example, the reliability of the recognition result. For example, the calculation unit 113 may calculate the reliability of each word included in the recognition result, and aggregate the calculated reliability of each word to calculate the recognition accuracy of the recognition result.

単語ごとの信頼度は、例えば、所定の範囲の値（例えば、０．０～１．０の範囲）を有してもよい。この所定の範囲の値の中で数値が１．０、すなわち上限に近いほど、単語ごとの信頼度は、その単語に似たスコアをもつ他の競合候補が相対的に少ないことを示す。他方、この所定の範囲の値の中で数値が０．０、すなわち下限に近いほど、単語ごとの信頼度は、その単語に似たスコアをもつ他の競合候補が相対的に多いことを示す。すなわち、所定の範囲の中で数値が上限に近ければ近いほど、単語ごとの信頼度は、認識結果の一位候補の単語に近い他の候補がなく、信頼（確信）をもってその認識結果を出力したということがいえる。 The word-by-word confidence may, for example, have a predetermined range of values (eg, a range of 0.0 to 1.0). Within this predetermined range of values, the closer the number is to 1.0, ie, the upper limit, the more confidence per word indicates that there are relatively few other competing candidates with scores similar to that word. On the other hand, within this predetermined range of values, the closer the numerical value is to 0.0, i.e. the lower bound, the more confidence per word indicates that there are relatively more other competing candidates with scores similar to that word. . That is, the closer the numerical value is to the upper limit within a predetermined range, the more reliable the recognition result is output because there are no other candidates close to the top candidate in the recognition result. It can be said that

単語の信頼度の算出方法は、いくつかの方法が考えられるが、例えば、駒谷、河原著「音声認識結果の信頼度を用いた効率的な確認・誘導を行う対話処理」（情報処理学会論文誌、Ｖｏｌ．４３、Ｎｏ．１０、ｐｐ３０７８－３０８６）が知られている。 There are several methods for calculating the degree of reliability of a word. Journal, Vol.43, No.10, pp3078-3086).

＜出力部＞
出力部１１４は、精度情報を記憶する記憶部１３０を参照して、算出部１１３により算出された認識精度と出力条件とを対応付けて出力する。出力部１１４がこの対応付けを出力する態様はどのような態様であってもよく、例えば、画面に出力してもよいし、ｃｓｖファイルや表形式のファイルに出力してもよい。 <Output part>
The output unit 114 refers to the storage unit 130 that stores accuracy information, and outputs the recognition accuracy calculated by the calculation unit 113 and the output condition in association with each other. The output unit 114 may output the correspondence in any manner. For example, the correspondence may be output to a screen, a csv file, or a tabular file.

ここで図５および図６を参照して、出力部１１４の出力の一例を説明する。 Here, an example of the output of the output unit 114 will be described with reference to FIGS. 5 and 6. FIG.

図５に示すように、出力部１１４は、認識精度評価画面Ａ１を、評価対象の音声認識装置２００ごとに、ユーザ端末（不図示）に出力（表示）させてもよい。認識精度評価画面Ａ１は、評価対象の音声認識装置２００に適用する出力条件ごとに、出力条件を識別するための情報（本例ではＮｏ）と、出力条件の内容と、出力条件に対応する認識精度と、出力条件による音声出力させるための実行ボタンと、が含まれている。この実行ボタンがユーザにより押下されると、制御装置１００から音声出力装置３００に対して音声出力が指示される。 As shown in FIG. 5, the output unit 114 may output (display) a recognition accuracy evaluation screen A1 to a user terminal (not shown) for each speech recognition device 200 to be evaluated. The recognition accuracy evaluation screen A1 includes, for each output condition applied to the speech recognition apparatus 200 to be evaluated, information for identifying the output condition (No in this example), details of the output condition, and recognition corresponding to the output condition. Accuracy and an execution button for outputting audio according to output conditions are included. When the execution button is pressed by the user, the control device 100 instructs the audio output device 300 to output audio.

図６に示すように、出力部１１４は、認識精度評価画面Ａ２を、評価対象の音声認識装置ごと、かつ利用場面（本例では、「会議室（小規模）での議事録作成」とする）ごとに、ユーザ端末に出力させてもよい。認識精度評価画面Ａ２では、評価対象の音声認識装置２００および利用場面に適用する出力条件として、この利用場面のパラメータの閾値を超えないものを抽出して出力される。認識精度評価画面Ａ２は、抽出された出力条件ごとに、出力条件を識別するための情報（本例ではＮｏ）と、出力条件の内容と、出力条件に対応する認識精度と、出力条件による音声出力させるための実行ボタンと、が含まれている。この実行ボタンがユーザにより押下されると、制御装置１００から音声出力装置３００に対して音声出力が指示される。 As shown in FIG. 6, the output unit 114 sets the recognition accuracy evaluation screen A2 for each speech recognition device to be evaluated and for the usage scene (in this example, “creation of minutes in a conference room (small scale)”). ) may be output to the user terminal. On the recognition accuracy evaluation screen A2, as the output conditions applied to the speech recognition apparatus 200 to be evaluated and the usage scene, those that do not exceed the threshold of the parameters of the usage scene are extracted and output. The recognition accuracy evaluation screen A2 includes, for each extracted output condition, information for identifying the output condition (No in this example), details of the output condition, recognition accuracy corresponding to the output condition, and speech based on the output condition. and an execution button for outputting. When the execution button is pressed by the user, the control device 100 instructs the audio output device 300 to output audio.

出力部１１４は、例えば、閾値情報を記憶する閾値記憶部１３２を参照して、閾値情報に基づいて、利用画面ごとに、認識精度が、所定の精度以上のもの、かつ対応付けられたパラメータが閾値を超えないものを抽出して出力してもよい。また、出力部１１４は、例えば、閾値情報に基づいて、利用画面ごとに、認識精度に対応付けられたパラメータが閾値を超えないものを抽出して出力してもよい。「利用場面」とは、音声認識装置２００を利用する場面である。 For example, the output unit 114 refers to the threshold storage unit 132 that stores the threshold information, and based on the threshold information, for each usage screen, the recognition accuracy is equal to or higher than a predetermined accuracy, and the associated parameter is Those that do not exceed the threshold may be extracted and output. In addition, the output unit 114 may extract and output, for example, the parameters associated with the recognition accuracy that do not exceed the threshold for each usage screen based on the threshold information. A “use scene” is a scene in which the speech recognition device 200 is used.

「閾値情報」とは、利用場面ごとの複数のパラメータそれぞれの閾値を示す情報である。閾値情報に示される閾値は、例えば、予め設定された値（固定値）であってもよいし、算出部１１３により特定された値であってもよい。 The “threshold information” is information indicating thresholds for each of a plurality of parameters for each usage scene. The threshold indicated by the threshold information may be, for example, a preset value (fixed value) or a value specified by the calculator 113 .

一般的に音量をあげれば認識精度もよくなる傾向にあるが、利用場面によっては、例えば会議など大きい音量が相応しくないまたは許容できない場面もある。上記構成によれば、出力部１１４は、このように利用場面に応じて、各パラメータを適切なまたは許容できる範囲におさまるもののみ抽出してユーザに対して出力させることができる。このため、上記構成によれば、ユーザは、評価したい利用場面ごとに効率的に評価することができる。 In general, increasing the volume tends to improve the recognition accuracy, but depending on the usage scene, there are situations where a high volume is not appropriate or acceptable, such as in a meeting. According to the above configuration, the output unit 114 can extract only parameters within an appropriate or permissible range and output them to the user according to the situation of use. Therefore, according to the above configuration, the user can efficiently evaluate each use scene that the user wants to evaluate.

出力部１１４は、例えば、基準値情報を記憶する基準値記憶部（不図示）を参照して、基準値情報に基づいて、利用画面ごとに、認識精度が、所定の精度以上のもの、かつ対応付けられたパラメータが基準値から所定範囲内のものを抽出して出力してもよい。また、出力部１１４は、例えば、基準値情報に基づいて、利用画面ごとに、認識精度に対応付けられたパラメータが基準値から所定範囲内のものを抽出して出力してもよい。 The output unit 114 refers to, for example, a reference value storage unit (not shown) that stores reference value information, and based on the reference value information, the recognition accuracy is equal to or higher than a predetermined accuracy, and Those parameters whose associated parameters are within a predetermined range from the reference value may be extracted and output. Further, the output unit 114 may extract and output, for example, the parameters associated with the recognition accuracy within a predetermined range from the reference value for each usage screen based on the reference value information.

「基準値情報」とは、利用場面ごとの複数のパラメータそれぞれの基準値を示す情報である。基準値情報に示される閾値は、例えば、予め設定された値（固定値）であってもよいし、算出部１１３により特定された値であってもよい。 “Reference value information” is information indicating reference values for each of a plurality of parameters for each usage scene. The threshold indicated in the reference value information may be, for example, a preset value (fixed value) or a value specified by the calculator 113 .

一般的に音量をあげれば認識精度もよくなる傾向にあるが、利用場面によっては、例えば会議など大きい音量が相応しくないまたは許容できない場面もある。上記構成によれば、出力部１１４は、このように利用場面に応じて、各パラメータを基準となる値に基づいて抽出してユーザに対して出力させることができる。このため、上記構成によれば、ユーザは、評価したい利用場面ごとに効率的に評価することができる。 In general, increasing the volume tends to improve the recognition accuracy, but depending on the usage scene, there are situations where a high volume is not appropriate or acceptable, such as in a meeting. According to the above configuration, the output unit 114 can thus extract each parameter based on the reference value and output it to the user according to the situation of use. Therefore, according to the above configuration, the user can efficiently evaluate each use scene that the user wants to evaluate.

出力部１１４は、例えば、出力条件を出力するにあたって、出力条件に含まれる複数のパラメータを認識精度との相関度の大きい順に並べ替えて（ソートして）出力してもよい。また、出力部１１４は、例えば、出力条件を出力するにあたって、出力条件に含まれる複数のパラメータの中で認識精度と相関度が最大のパラメータを識別可能に（例えば、協調表示など）出力してもよい。このような構成によれば、出力部１１４は、どのパラメータが認識精度により影響をあたえたかをユーザに把握させつつ、評価させることができる。 For example, when outputting the output condition, the output unit 114 may rearrange (sort) the plurality of parameters included in the output condition in descending order of correlation with the recognition accuracy and output them. In addition, for example, when outputting the output condition, the output unit 114 outputs the parameter with the highest recognition accuracy and correlation among the plurality of parameters included in the output condition in a identifiable manner (for example, coordinated display). good too. According to such a configuration, the output unit 114 can allow the user to grasp and evaluate which parameter affected the recognition accuracy.

＜生成部＞
生成部１１５は、音声合成処理を用いて、試験用テキスト情報に基づいて、試験音声を含む音声データを生成してもよい。 <Generation unit>
The generation unit 115 may generate speech data including the test speech based on the test text information using speech synthesis processing.

生成部１１５は、例えば、第１試験音声データおよび第２試験音声データを、ステレオ音声データに加工してもよい。ここで「ステレオ音声データ」とは、第１試験音声と、第２試験音声とをステレオフォニック再生するための音声データである。生成部１１５は、例えば、ステレオ音声データの加工の前処理として、第１試験音声データまたは第２試験音声データの音声の音像を定位させてもよい。加工部１２２は、例えば、第１試験音声データについて、発声者（チャンネル）ごとに仮想音源の位置に第１試験音声の音像を定位させてもよい。この仮想音源の位置は、例えば、出力条件の音源定位位置にあわせて設定してもよい。 The generator 115 may, for example, process the first test audio data and the second test audio data into stereo audio data. Here, "stereo audio data" is audio data for stereophonically reproducing the first test audio and the second test audio. For example, the generation unit 115 may localize the sound image of the first test sound data or the second test sound data as preprocessing for stereo sound data processing. For example, the processing unit 122 may localize the sound image of the first test sound at the position of the virtual sound source for each speaker (channel) with respect to the first test sound data. The position of the virtual sound source may be set according to the sound source localization position of the output condition, for example.

＜記憶部＞
記憶部１３０は、音声データや基準値情報、精度情報を記憶する。記憶部１３０は、データベースマネジメントシステム（ＤＢＭＳ）を利用して上記の各種情報・データを記憶してもよいし、ファイルシステムを利用して上記の情報を記憶してもよい。ＤＢＭＳを利用する場合は、上記の情報ごとにテーブルを設けて、テーブル間を関連付けてこれらの情報を管理してもよい。また記憶部１３０は、テキスト記憶部１３１と、閾値記憶部１３２と、を備える。 <Memory section>
The storage unit 130 stores audio data, reference value information, and accuracy information. The storage unit 130 may store the above various information and data using a database management system (DBMS), or may store the above information using a file system. When using a DBMS, a table may be provided for each of the above information, and the information may be managed by associating the tables. The storage unit 130 also includes a text storage unit 131 and a threshold storage unit 132 .

テキスト記憶部１３１は、試験用テキスト情報や認識結果テキスト情報を記憶する。閾値記憶部１３２は、閾値情報を記憶する。 The text storage unit 131 stores test text information and recognition result text information. The threshold storage unit 132 stores threshold information.

＜通信部＞
通信部１４０は、有線ネットワークや無線ネットワークを介して、音声認識装置２００や音声出力装置３００などとの間で音声データやテキスト情報などの各種情報・データを送受信する。 <Communication part>
The communication unit 140 transmits and receives various information and data such as voice data and text information to and from the voice recognition device 200 and the voice output device 300 via a wired network or a wireless network.

＜４．動作例＞
図８を参照して、制御装置１００の動作例を説明する。なお、以下に示す図８の動作例の処理の順番は一例であって、適宜、変更されてもよい。本例では、試験シナリオとして１～Ｎ（Ｎは自然数）番目の出力条件に順次基づいて、複数回音声出力と音声認識とを実行し、音声認識装置２００の認識精度をそれぞれの回で算出していく例を説明する。 <4. Operation example>
An operation example of the control device 100 will be described with reference to FIG. Note that the order of processing in the operation example of FIG. 8 shown below is an example, and may be changed as appropriate. In this example, as a test scenario, speech output and speech recognition are executed a plurality of times based on the 1st to Nth (N is a natural number) output conditions, and the recognition accuracy of the speech recognition device 200 is calculated each time. An example will be explained.

図８に示すように、制御装置１００の音声出力制御部１１１は、Ｎ番目の出力条件に基づいて、音声出力装置３００からの音声の出力を制御する（Ｓ１０）。次に、取得部１１２は、音声出力装置３００から出力された音声を認識した音声認識装置２００から、この音声の認識結果を示す認識結果テキスト情報を取得する（Ｓ１１）。 As shown in FIG. 8, the audio output control unit 111 of the control device 100 controls audio output from the audio output device 300 based on the Nth output condition (S10). Next, the acquisition unit 112 acquires recognition result text information indicating the recognition result of the speech from the speech recognition device 200 that has recognized the speech output from the speech output device 300 (S11).

次に、算出部１１３は、試験用テキスト情報を記憶するテキスト記憶部１３１を参照して、試験用テキスト情報と認識結果テキスト情報とに基づいて、認識結果の認識精度を算出する（Ｓ１２）。次に、出力部１１４は、算出された認識精度とＮ番目の出力条件とを対応付けて出力する（Ｓ１３）。 Next, the calculation unit 113 refers to the text storage unit 131 that stores the test text information, and calculates the recognition accuracy of the recognition result based on the test text information and the recognition result text information (S12). Next, the output unit 114 outputs the calculated recognition accuracy and the N-th output condition in association with each other (S13).

次に、試験シナリオにＮ＋１番目の出力条件が存在する場合（Ｓ１４のＹｅｓ）、音声出力制御部１１１はＮをインクリメントする（Ｓ１５）。フローはステップＳ１０の前に戻る。 Next, if the test scenario has the (N+1)th output condition (Yes in S14), the audio output control unit 111 increments N (S15). The flow returns to before step S10.

次に、試験シナリオにＮ＋１番目の出力条件が存在しない場合（Ｓ１４のＮｏ）、フローは終了する。 Next, if there is no N+1th output condition in the test scenario (No in S14), the flow ends.

＜５．ハードウェア構成＞
図９を参照して、上述してきた制御装置１００をコンピュータ８００により実現する場合のハードウェア構成の一例を説明する。なお、それぞれの装置の機能は、複数台の装置に分けて実現することもできる。 <5. Hardware configuration>
An example of a hardware configuration when the control device 100 described above is implemented by a computer 800 will be described with reference to FIG. The function of each device can also be implemented by being divided into a plurality of devices.

図９に示すように、コンピュータ８００は、プロセッサ８０１と、メモリ８０３と、記憶装置８０５と、入力Ｉ／Ｆ部８０７と、データＩ／Ｆ部８０９と、通信Ｉ／Ｆ部８１１、および表示装置８１３を含む。 As shown in FIG. 9, a computer 800 includes a processor 801, a memory 803, a storage device 805, an input I/F section 807, a data I/F section 809, a communication I/F section 811, and a display device. 813 included.

プロセッサ８０１は、メモリ８０３に記憶されているプログラムを実行することによりコンピュータ８００における様々な処理を制御する。例えば、制御装置１００の制御部１１０が備える各機能部などは、メモリ８０３に一時記憶されたプログラムをプロセッサ８０１が実行することにより実現可能である。 Processor 801 controls various processes in computer 800 by executing programs stored in memory 803 . For example, each functional unit included in the control unit 110 of the control device 100 can be realized by the processor 801 executing a program temporarily stored in the memory 803 .

メモリ８０３は、例えばＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の記憶媒体である。メモリ８０３は、プロセッサ８０１によって実行されるプログラムのプログラムコードや、プログラムの実行時に必要となるデータを一時的に記憶する。 The memory 803 is a storage medium such as a RAM (Random Access Memory). The memory 803 temporarily stores program codes of programs executed by the processor 801 and data necessary for executing the programs.

記憶装置８０５は、例えばハードディスクドライブ（ＨＤＤ）やフラッシュメモリ等の不揮発性の記憶媒体である。記憶装置８０５は、オペレーティングシステムや、上記各構成を実現するための各種プログラムを記憶する。この他、記憶装置８０５は、音声データ、テキスト情報（試験用テキスト情報や認識結果テキスト情報）、閾値情報などを登録するテーブルと、このテーブルを管理するＤＢを記憶することも可能である。このようなプログラムやデータは、必要に応じてメモリ８０３にロードされることにより、プロセッサ８０１から参照される。 The storage device 805 is a non-volatile storage medium such as a hard disk drive (HDD) or flash memory. A storage device 805 stores an operating system and various programs for realizing the above configurations. In addition, the storage device 805 can also store a table for registering voice data, text information (test text information and recognition result text information), threshold information, etc., and a DB for managing this table. Such programs and data are referred to by the processor 801 by being loaded into the memory 803 as necessary.

入力Ｉ／Ｆ部８０７は、ユーザからの入力を受け付けるためのデバイスである。入力Ｉ／Ｆ部８０７の具体例としては、キーボードやマウス、タッチパネル、各種センサ、ウェアラブル・デバイス等が挙げられる。入力Ｉ／Ｆ部８０７は、例えばＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）等のインタフェースを介してコンピュータ８００に接続されても良い。 An input I/F unit 807 is a device for receiving input from the user. Specific examples of the input I/F unit 807 include a keyboard, mouse, touch panel, various sensors, wearable devices, and the like. The input I/F unit 807 may be connected to the computer 800 via an interface such as USB (Universal Serial Bus).

データＩ／Ｆ部８０９は、コンピュータ８００の外部からデータを入力するためのデバイスである。データＩ／Ｆ部８０９の具体例としては、各種記憶媒体に記憶されているデータを読み取るためのドライブ装置等がある。データＩ／Ｆ部８０９は、コンピュータ８００の外部に設けられることも考えられる。その場合、データＩ／Ｆ部８０９は、例えばＵＳＢ等のインタフェースを介してコンピュータ８００へと接続される。 A data I/F unit 809 is a device for inputting data from outside the computer 800 . A specific example of the data I/F unit 809 is a drive device for reading data stored in various storage media. Data I/F unit 809 may be provided outside computer 800 . In that case, the data I/F unit 809 is connected to the computer 800 via an interface such as USB.

通信Ｉ／Ｆ部８１１は、コンピュータ８００の外部の装置と有線または無線により、インターネットＮを介したデータ通信を行うためのデバイスである。通信Ｉ／Ｆ部８１１は、コンピュータ８００の外部に設けられることも考えられる。その場合、通信Ｉ／Ｆ部８１１は、例えばＵＳＢ等のインタフェースを介してコンピュータ８００に接続される。 The communication I/F unit 811 is a device for performing data communication with a device external to the computer 800 via the Internet N by wire or wirelessly. Communication I/F unit 811 may be provided outside computer 800 . In that case, the communication I/F unit 811 is connected to the computer 800 via an interface such as USB.

表示装置８１３は、各種情報を表示するためのデバイスである。表示装置８１３の具体例としては、例えば液晶ディスプレイや有機ＥＬ（Ｅｌｅｃｔｒｏ－Ｌｕｍｉｎｅｓｃｅｎｃｅ）ディスプレイ、ウェアラブル・デバイスのディスプレイ等が挙げられる。表示装置８１３は、コンピュータ８００の外部に設けられても良い。その場合、表示装置８１３は、例えばディスプレイケーブル等を介してコンピュータ８００に接続される。また、入力Ｉ／Ｆ部８０７としてタッチパネルが採用される場合には、表示装置８１３は、入力Ｉ／Ｆ部８０７と一体化して構成することが可能である。 The display device 813 is a device for displaying various information. Specific examples of the display device 813 include a liquid crystal display, an organic EL (Electro-Luminescence) display, and a wearable device display. The display device 813 may be provided outside the computer 800 . In that case, the display device 813 is connected to the computer 800 via, for example, a display cable. Further, when a touch panel is adopted as the input I/F section 807 , the display device 813 can be integrated with the input I/F section 807 .

なお、本実施形態は、本発明を説明するための例示であり、本発明をその実施の形態のみに限定する趣旨ではない。また、本発明は、その要旨を逸脱しない限り、さまざまな変形が可能である。さらに、当業者であれば、上記に述べる各要素を均等なものに置換した実施の形態を採用することが可能であり、かかる実施の形態も本発明の範囲に含まれる。 In addition, this embodiment is an illustration for demonstrating this invention, and is not the meaning which limits this invention only to the embodiment. Also, the present invention can be modified in various ways without departing from the gist thereof. Furthermore, those skilled in the art can adopt embodiments in which each element described above is replaced with equivalents, and such embodiments are also included in the scope of the present invention.

［変形例］
なお、本発明を上記実施形態に基づいて説明してきたが、以下のような場合も本発明に含まれる。 [Modification]
Although the present invention has been described based on the above embodiments, the following cases are also included in the present invention.

［変形例１］
上記実施形態に係る制御装置１００おける各構成の少なくとも一部は、音声認識装置２００に搭載させる評価システム１専用のプログラムが備えてもよい。例えば、このプログラムに制御装置１００の制御部１１０の各機能部を備えさせて、評価システム１を、制御装置１００を個別に設けずに音声認識装置２００と音声出力装置３００で構成してもよい。 [Modification 1]
At least a part of each configuration in the control device 100 according to the above-described embodiment may be provided in a program dedicated to the evaluation system 1 installed in the speech recognition device 200 . For example, this program may be provided with each functional unit of the control unit 110 of the control device 100, and the evaluation system 1 may be configured with the speech recognition device 200 and the speech output device 300 without separately providing the control device 100. .

［変形例２］
上記実施形態に係るテキスト記憶部と閾値記憶部とについて、制御装置１００の記憶部１３０が備える例を説明したが、テキスト記憶部と閾値記憶部をこれに限る趣旨ではない。上記実施形態に係るテキスト記憶部と閾値記憶部とは、例えば、音声認識装置２００が備えてもよいし、外部システムの装置が備えてもよい。 [Modification 2]
Regarding the text storage unit and the threshold storage unit according to the above-described embodiment, the example provided in the storage unit 130 of the control device 100 has been described, but the text storage unit and the threshold storage unit are not limited to this. For example, the text storage unit and the threshold storage unit according to the above embodiments may be provided in the speech recognition device 200 or may be provided in an external system device.

［変形例３］
上記実施形態では示していないが、音声出力装置３００の配置において、人の手によって配置してもよいし、制御装置１００と音声出力装置３００とをアームなどで連携させて配置してもよい。評価システム１では、適用する出力条件に基づいて音声出力制御部１１１がこのアームを制御（例えば、回転または伸縮）することによって、音声出力装置３００を配置してもよい。 [Modification 3]
Although not shown in the above embodiment, the audio output device 300 may be arranged manually, or the control device 100 and the audio output device 300 may be arranged by linking with an arm or the like. In the evaluation system 1, the audio output device 300 may be arranged by the audio output control unit 111 controlling (for example, rotating or extending/contracting) this arm based on the applied output conditions.

１…評価システム、１００…制御装置、１１０…制御部、１１１…音声出力制御部、１１２…取得部、１１３…算出部、１１４…出力部、１１５…生成部、１３０…記憶部、１４０…通信部、２００…音声認識装置、３００…音声出力装置、５００…音声認識システム、８００…コンピュータ、８０１…プロセッサ、８０３…メモリ、８０５…記憶装置、８０７…入力Ｉ／Ｆ部、８０９…データＩ／Ｆ部、８１１…通信Ｉ／Ｆ部、８１３…表示装置、８１７…音声入力装置、８１９…音声出力装置。 DESCRIPTION OF SYMBOLS 1... Evaluation system 100... Control apparatus 110... Control part 111... Voice output control part 112... Acquisition part 113... Calculation part 114... Output part 115... Generation part 130... Storage part 140... Communication Unit 200 Speech recognition device 300 Speech output device 500 Speech recognition system 800 Computer 801 Processor 803 Memory 805 Storage device 807 Input I/F unit 809 Data I/ F unit, 811... Communication I/F unit, 813... Display device, 817... Voice input device, 819... Voice output device.

Claims

an audio output control unit that controls audio output from an audio output device based on audio output conditions;
an acquisition unit that acquires recognition result text information indicating a recognition result of the speech from the speech recognition device that recognized the speech output from the speech output device;
By referring to a text storage unit storing test text information indicating the content of the test speech included in the speech, the recognition accuracy of the recognition result is calculated based on the test text information and the recognition result text information. a calculation unit for
an output unit that associates and outputs the calculated recognition accuracy and the condition,
The voice output control unit changes the condition based on the recognition accuracy calculated by the calculation unit, and controls the output of voice from the voice output device based on the changed condition.
Information processing equipment.

The speech output by the speech output device and the speech recognition by the speech recognition device for the speech output are performed multiple times,
the condition includes a plurality of parameters related to the audio output;
The audio output control unit controls the audio output by changing one or more parameters used among the plurality of parameters at each of the plurality of times.
The information processing device according to claim 1 .

The output unit outputs the recognition accuracy of each of the speech recognitions performed a plurality of times in association with one or more parameters used to control the speech output in each of the times.
The information processing apparatus according to claim 2 .

The output unit refers to a threshold storage unit that stores threshold information indicating thresholds for each of the plurality of parameters for each usage scene in which the speech recognition device is used, and further based on the threshold information, for each usage screen. and extracting and outputting those whose recognition accuracy is equal to or higher than a predetermined accuracy and whose associated parameters do not exceed the threshold;
The information processing apparatus according to claim 3.

The plurality of parameters include at least any two of information related to the sound source localization position of the sound to be output from the sound output device, the speech rate of the sound, the volume of the sound, or the speaker of the sound,
The audio output control unit controls the audio output by changing the value of the parameter.
The information processing apparatus according to any one of claims 2 to 4.

A plurality of said audio output devices exist,
The parameter includes device selection information for selecting a device whose output is to be turned on among the plurality of audio output devices at each of the times, and the audio output control unit further controls the plurality of audio output devices based on the device selection information. switching on/off the audio output of each audio output device at each time to control the audio output;
The information processing apparatus according to any one of claims 2 to 5.

The audio data of the audio to be output by the audio output device includes test audio data of the test audio and noise data that emits noise,
The parameter includes device designation information that designates which of the test sound and the noise is to be output from each of the sound output devices at each of the times. and controlling the sound output by switching the sound to be output to each of the plurality of sound output devices between the test sound and the noise at each time,
The information processing device according to claim 6 .

The parameter includes the length of the audio to be output to the audio output device,
The test audio data includes first test audio data of a first test audio and second test audio data of a second test audio different in length from the first test audio,
The audio output control unit controls the audio output by switching the audio to be output to each of the audio output devices between the first test audio and the second test audio each time.
The information processing apparatus according to claim 7.

to the computer,
an audio output control function for controlling audio output from an audio output device based on audio output conditions;
an acquisition function for acquiring recognition result text information indicating a recognition result of the speech from the speech recognition device that recognized the speech output from the speech output device;
By referring to a text storage unit storing test text information indicating the content of the test speech included in the speech, the recognition accuracy of the recognition result is calculated based on the test text information and the recognition result text information. a calculation function to
realizing an output function that associates and outputs the calculated recognition accuracy and the condition,
The voice output control function changes the condition based on the recognition accuracy calculated by the calculation function, and controls the output of voice from the voice output device based on the changed condition.
program.

the computer
controlling audio output from an audio output device based on audio output conditions,
Acquiring recognition result text information indicating a recognition result of the speech from the speech recognition device that recognized the speech output from the speech output device;
By referring to a text storage unit storing test text information indicating the content of the test speech included in the speech, the recognition accuracy of the recognition result is calculated based on the test text information and the recognition result text information. death,
outputting the calculated recognition accuracy and the condition in association with each other;
changing the condition based on the calculated recognition accuracy;
controlling audio output from the audio output device based on the changed condition;
Information processing methods.