JP2022181759A

JP2022181759A - Voice quality evaluation device, voice quality evaluation method, and voice quality evaluation program

Info

Publication number: JP2022181759A
Application number: JP2021088897A
Authority: JP
Inventors: 尚大辻; Hisashi Otsuji
Original assignee: GREEN KK
Current assignee: GREEN KK
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2022-12-08

Abstract

To provide a voice quality evaluation device, a voice quality evaluation method, and a voice quality evaluation program, which simplify evaluation of transmission quality in voice data transmitted via a network.SOLUTION: A voice quality evaluation device 1 includes: a voice generation unit 12 that converts reference text into reference voice data; a reference voice transmission unit 13 that transmits the reference voice data generated by the voice generation unit to a playback device 50 via a network NW; an evaluation target voice acquisition unit 14 that acquires evaluation target voice data that is played back from the playback device; a voice recognition unit 15 that generates evaluation target text by voice-recognizing words included in the evaluation target voice data; and an evaluation unit 16 that evaluates quality of the voice received via the network based on the evaluation target text.SELECTED DRAWING: Figure 1

Description

本発明は、音声品質を評価する技術に関する。 The present invention relates to technology for evaluating voice quality.

近年、通信ネットワークを介して、ビデオ会議や遠隔授業等を行う機会が増加している。そこで、この音声の品質を簡便に評価できる技術が必要とされている。 2. Description of the Related Art In recent years, opportunities to conduct video conferences, remote classes, etc. via communication networks have increased. Therefore, there is a need for a technology that can easily evaluate the quality of this voice.

例えば、特許文献１では、入力された音声の誤り訂正及びフィルタ処理後のサンプルの初期音声認識結果とフィルタ処理後の原文を比較し、音声評価点数を算出する音声評価方法等が提案されている。
また、特許文献２では、入力された音声認識データを記憶部に記録させ、検索部により音声認識データと辞書データをマッチングして、音声認識結果を作成する技術が提案されている。 For example, Patent Document 1 proposes a speech evaluation method for calculating a speech evaluation score by comparing an initial speech recognition result of a sample after error correction and filtering of input speech and an original text after filtering. .
Further, Japanese Patent Application Laid-Open No. 2002-200002 proposes a technique of recording input speech recognition data in a storage unit, matching the speech recognition data with dictionary data by a search unit, and creating a speech recognition result.

特開２０１６－５１１７９号公報JP 2016-51179 A 特開２０１８－５４７１７号公報JP 2018-54717 A

電話システムの音声品質評価は、古くからMOS（Mean Opinion Score）と呼ばれる主観品質評価方法が用いられてきた。また、コンピュータを用いてMOSによる主観評価の結果を推測する客観的品質手法POLQA（Perceptual Objective Listening Quality Assessment：知覚客観受話品質評価）が知られている。POLQAは、リファレンス音声と呼ばれる原音声と、それを受話側で録音した音声とを比較し、MOS値を算出する手法である。 A subjective quality evaluation method called MOS (Mean Opinion Score) has long been used to evaluate the voice quality of telephone systems. Also known is an objective quality method POLQA (Perceptual Objective Listening Quality Assessment) that estimates the result of subjective evaluation by MOS using a computer. POLQA is a method of comparing an original voice, called a reference voice, with a voice recorded on the receiver side, and calculating the MOS value.

ここで、ビデオ会議や遠隔授業等における音声品質評価においては、数十分から数時間等に渡って、話者の発話内容を受話側で正確に聞き取れるか否かが肝要である。しかしながら、POLQAは、長時間に渡って伝送される音声の品質評価には適していない。POLQAにはリファレンス音声、リファレンス映像に対する基準が定められており、基準を満たさない通常の会話、音声に対する評価は対象にされていない。また、POLQAは音声データ同士を比較する手法であるので、長い音声データを評価しようとする場合にはデータ量が大きくなり、処理コストが膨大である。そのため、例えば「パケット化された音声や映像がネットワーク状態の変化によりどのような影響受けるか」といった時間的変化の評価は困難である。 Here, in voice quality evaluation in video conferences, remote classes, etc., it is important whether the receiving side can accurately hear the contents of the speaker's utterance over several tens of minutes to several hours. However, POLQA is not suitable for evaluating the quality of long-term transmitted speech. POLQA defines standards for reference audio and reference video, and does not cover evaluation of normal conversations and audio that do not meet the standards. In addition, since POLQA is a method of comparing voice data, the amount of data becomes large when trying to evaluate long voice data, and the processing cost is enormous. For this reason, it is difficult to evaluate temporal changes such as, for example, how packetized audio and video are affected by changes in network conditions.

また、POLQAは、原音声と受話側音声との差異を評価する手法であるため、受話側で聞き取れる発話内容の正確性を適切に評価しているとはいえない。例えば、POLQAは、イコライジング処理等の音声品質を向上させる処理を行った場合であっても、音声に差異があると判断された結果、低い評価がなされる。同様に、ビデオ会議や遠隔授業等、収音データに空間雑音や音の反響が含まれるような状況において、雑音を除去する処理を行って再生した場合であっても、音声に差異があると判断され、評価が低くなってしまう。 In addition, since POLQA is a method of evaluating the difference between the original speech and the speech on the receiving side, it cannot be said to properly evaluate the accuracy of the utterance content that can be heard by the receiving side. For example, in POLQA, even if processing for improving voice quality such as equalizing processing is performed, a low evaluation is given as a result of determining that there is a difference in voice. Similarly, in situations where spatial noise and sound echoes are included in the collected sound data, such as video conferences and distance learning, even if the noise is removed and played back, there may be differences in the sound. You will be judged and undervalued.

そこで、本発明は、ネットワークを介して伝送される音声データにおいて、伝送の品質評価を簡便に行うことを目的とする。 SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to easily evaluate the quality of transmission of audio data transmitted over a network.

上記目的を達成するため、本発明の一の観点に係る音声品質評価装置は、参照テキストを参照音声データに変換する音声生成部と、前記音声生成部により生成される前記参照音声データを、ネットワークを介して再生装置に送信する参照音声送信部と、前記再生装置から再生される評価対象音声データを取得する評価対象音声取得部と、前記評価対象音声データに含まれる言葉を音声認識して評価対象テキストを生成する音声認識部と、前記評価対象テキストに基づいて、前記ネットワークを介して受信される音声の品質評価を行う評価部と、を備える。 In order to achieve the above object, a speech quality evaluation apparatus according to one aspect of the present invention provides a speech generation unit that converts a reference text into reference speech data; an evaluation target voice acquisition unit that acquires evaluation target voice data played back from the playback device; an evaluation target voice acquisition unit that acquires evaluation target voice data reproduced from the playback device; A speech recognition unit that generates target text, and an evaluation unit that performs quality evaluation of speech received via the network based on the evaluation target text.

前記評価部は、前記評価対象テキストと前記参照テキストを比較して、前記品質評価を行うものとしてもよい。 The evaluation unit may perform the quality evaluation by comparing the evaluation target text and the reference text.

前記音声認識部は、前記参照音声データに含まれる言葉を音声認識して第２参照テキストを生成し、前記評価部は、前記評価対象テキストと前記第２参照テキストを比較して、前記品質評価を行うものとしてもよい。 The speech recognition unit speech-recognizes words included in the reference speech data to generate a second reference text, and the evaluation unit compares the evaluation target text and the second reference text to perform the quality evaluation. may be performed.

前記音声認識部は、前記参照音声データに含まれる言葉を音声認識して第２参照テキストを生成し、前記評価部は、前記参照テキストと前記第２参照テキストを比較して第１評価を行い、前記参照テキストと前記評価対象テキストを比較して第２評価を行い、前記第１評価および前記第２評価の結果に基づいて、前記品質評価を行うものとしてもよい。 The speech recognition unit speech-recognizes words included in the reference speech data to generate a second reference text, and the evaluation unit compares the reference text and the second reference text to perform a first evaluation. A second evaluation may be performed by comparing the reference text and the text to be evaluated, and the quality evaluation may be performed based on the results of the first evaluation and the second evaluation.

上記目的を達成するため、本発明の別の観点に係る音声品質評価方法は、参照テキストを参照音声データに変換する音声生成処理と、前記音声生成処理により生成される前記参照音声データを、ネットワークを介して再生装置に送信する参照音声送信処理と、前記再生装置から再生される評価対象音声データを取得する評価対象音声取得処理と、前記評価対象音声データに含まれる言葉を音声認識して評価対象テキストを生成する音声認識処理と、前記評価対象テキストに基づいて、前記ネットワークを介して受信される音声の品質評価を行う評価処理と、を含む。 In order to achieve the above object, a speech quality evaluation method according to another aspect of the present invention provides speech generation processing for converting a reference text into reference speech data; an evaluation target voice acquisition process for acquiring evaluation target voice data played back from the playback device; and an evaluation target voice data by recognizing words included in the evaluation target voice data. A speech recognition process for generating target text, and an evaluation process for evaluating the quality of speech received via the network based on the evaluation target text.

上記目的を達成するため、本発明のさらに別の観点に係る音声品質評価プログラムは、参照テキストを参照音声データに変換する音声生成命令と、前記音声生成命令により生成される前記参照音声データを、ネットワークを介して再生装置に送信する参照音声送信命令と、前記再生装置から再生される評価対象音声データを取得する評価対象音声取得命令と、前記評価対象音声データに含まれる言葉を音声認識して評価対象テキストを生成する音声認識命令と、前記評価対象テキストに基づいて、前記ネットワークを介して受信される音声の品質評価を行う評価命令と、をコンピュータに実行させる。
なお、コンピュータプログラムは、各種のデータ読取可能な記録媒体に格納して提供したり、インターネット等のネットワークを介してダウンロード可能に提供したりすることができる。 To achieve the above object, a speech quality evaluation program according to still another aspect of the present invention provides a speech generation instruction for converting a reference text into reference speech data, and the reference speech data generated by the speech generation instruction, a reference voice transmission command to be transmitted to a playback device via a network; an evaluation target voice acquisition command to acquire evaluation target voice data played back from the playback device; A computer is caused to execute speech recognition instructions for generating text to be evaluated and evaluation instructions for evaluating the quality of speech received over the network based on the text to be evaluated.
The computer program can be stored in various data-readable recording media and provided, or can be provided in a downloadable manner via a network such as the Internet.

本発明に係る音声品質評価装置によれば、ネットワークを介して伝送される音声データにおいて、伝送の品質評価を簡便に行うことができる。 According to the speech quality evaluation apparatus of the present invention, it is possible to easily evaluate the quality of transmission of speech data transmitted over a network.

本発明の第１の実施形態に係る音声品質評価装置の機能ブロック図である。1 is a functional block diagram of a speech quality evaluation device according to a first embodiment of the present invention; FIG. 上記音声品質評価装置において処理されるデータの１例であって、（ａ）参照テキストの例、（ｂ）参照テキストを読み上げた参照音声データの例、（ｃ）ネットワークを介して受信した評価対象音声データの例、（ｄ）上記ネットワークを介して受信した音声データを音声認識して得られる評価対象テキストの例、を示す概念図である。An example of data processed in the speech quality evaluation device, including (a) an example of a reference text, (b) an example of reference speech data obtained by reading the reference text, and (c) an evaluation target received via a network. FIG. 4 is a conceptual diagram showing an example of speech data, and (d) an example of text to be evaluated obtained by speech recognition of the speech data received via the network; 上記音声品質評価装置によって実行される一連の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a series of processes performed by the said audio|voice quality evaluation apparatus. 本発明の第２の実施形態に係る音声品質評価装置によって実行される、一連の処理の流れを示すフローチャートである。9 is a flow chart showing the flow of a series of processes executed by the speech quality evaluation device according to the second embodiment of the present invention; 本発明の第３の実施形態に係る音声品質評価装置によって実行される、一連の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a series of processes performed by the speech quality evaluation apparatus which concerns on the 3rd Embodiment of this invention.

以下、本発明の実施形態に係る音声品質評価装置、音声品質評価方法、および音声品質評価プログラムについて、図を参照して説明する。 A speech quality evaluation device, a speech quality evaluation method, and a speech quality evaluation program according to embodiments of the present invention will be described below with reference to the drawings.

＜第１実施形態＞
●音声品質評価装置の構成
図１に示すように、音声品質評価装置１は、ネットワークＮＷを介して送受信される音声の品質を評価する装置である。ネットワークＮＷは、例えばインターネットの他、有線又は無線で接続される適宜の通信回線であってよく、形式は任意である。 <First Embodiment>
●Configuration of Voice Quality Evaluation Device As shown in FIG. 1, the voice quality evaluation device 1 is a device that evaluates the quality of voice transmitted and received via the network NW. The network NW may be, for example, the Internet, or an appropriate communication line connected by wire or wirelessly, and may be of any format.

例えば、音声品質評価装置１は、ネットワークＮＷを介して再生装置５０に接続されている。再生装置５０は、例えばパーソナルコンピュータ、スマートホン又はタブレット等の端末であり、ビデオ通話の視聴者が視聴している端末である。 For example, the voice quality evaluation device 1 is connected to the playback device 50 via the network NW. The playback device 50 is, for example, a terminal such as a personal computer, a smart phone, or a tablet, and is a terminal viewed by a viewer of the video call.

音声品質評価装置１は、ネットワークＮＷを介して参照音声の音声データ（以下、「参照音声データ」ともいう。）を再生装置５０に送信する。再生装置５０は、この参照音声データを音声に変換し、再生する。なお、参照音声データは、ネットワークＮＷ上の適宜の装置、例えばサーバを経由して再生装置５０に受信されてもよい。音声品質評価装置１は、ネットワークＮＷを介して伝送され、再生装置５０で再生される音声データを取得し、この音声の品質を評価する。 The speech quality evaluation device 1 transmits speech data of reference speech (hereinafter also referred to as “reference speech data”) to the reproduction device 50 via the network NW. The reproducing device 50 converts this reference audio data into audio and reproduces it. Note that the reference audio data may be received by the playback device 50 via an appropriate device on the network NW, such as a server. The voice quality evaluation device 1 acquires voice data transmitted via the network NW and reproduced by the reproduction device 50, and evaluates the quality of this voice.

音声品質評価装置１は、メモリなどの記憶媒体、プロセッサ、通信モジュール、及び入力／出力インターフェース等で構成され、プロセッサが記憶媒体に記録されたコンピュータプログラムを実行することで、図１に示した機能ブロックを実現するようになっている。記憶媒体は、コンピュータ読み取り可能記録媒体であって、ＲＡＭ（random access memory）、ＲＯＭ（read only memory）、ディスクドライブ、ＳＳＤ（solid state drive）、フラッシュメモリ（flash memory）のような記憶装置等を含んでよい。ここで、ＲＯＭやディスクドライブ、ＳＳＤ、フラッシュメモリのような非一時的な記憶装置は、メモリとは区分される別の格納装置として音声品質評価装置１に含まれてもよい。 The speech quality evaluation apparatus 1 is configured by a storage medium such as a memory, a processor, a communication module, an input/output interface, etc., and the processor executes a computer program recorded in the storage medium to perform the functions shown in FIG. Blocks are realized. The storage medium is a computer-readable recording medium, and includes storage devices such as RAM (random access memory), ROM (read only memory), disk drives, SSD (solid state drive), and flash memory. may contain. Here, a non-temporary storage device such as a ROM, disk drive, SSD, or flash memory may be included in the voice quality evaluation device 1 as a separate storage device separated from the memory.

音声品質評価装置１は、上記したハードウェア構成により、例えば、主として、参照テキスト取得部１１、音声生成部１２、参照音声送信部１３、評価対象音声取得部１４、音声認識部１５、評価部１６を具備する。なお、音声品質評価装置１の構成の一部又は全部が、別のハードウェア構成により実現されていてもよいし、一部又は全部がクラウドコンピュータにより実現されていてもよい。また、音声品質評価装置１の機能の一部が再生装置５０の内部に構成されていてもよい。この場合、例えば評価対象音声取得部１４、音声認識部１５、評価部１６が再生装置５０に構成されていてもよい。 The speech quality evaluation apparatus 1 has, for example, mainly a reference text acquisition unit 11, a speech generation unit 12, a reference speech transmission unit 13, an evaluation target speech acquisition unit 14, a speech recognition unit 15, an evaluation unit 16, and the like. Equipped with Part or all of the configuration of the speech quality evaluation apparatus 1 may be realized by another hardware configuration, or part or all of it may be realized by a cloud computer. Also, part of the functions of the voice quality evaluation device 1 may be configured inside the playback device 50 . In this case, for example, the evaluation target speech acquisition unit 14, the speech recognition unit 15, and the evaluation unit 16 may be configured in the playback device 50. FIG.

参照テキスト取得部１１は、音声の品質評価に用いる参照テキストを取得する機能部である。参照テキストは、例えば図２（ａ）に示すようなテキストデータであり、日本語に限らず適宜の言語であってよい。参照テキスト取得部１１は、参照テキストを適宜のネットワークを介して取得してもよいし、音声品質評価装置１が有する適宜の入力手段を介して入力を受け付けてもよい。 The reference text acquisition unit 11 is a functional unit that acquires a reference text used for speech quality evaluation. The reference text is, for example, text data as shown in FIG. 2(a), and may be in any language other than Japanese. The reference text acquisition unit 11 may acquire the reference text via an appropriate network, or may receive an input via an appropriate input means of the speech quality evaluation apparatus 1 .

図２（ｂ）の概念図に示すように、音声生成部１２は、参照テキスト取得部１１により取得した参照テキストを参照音声データに変換する機能部である。音声生成部１２は、人工音声により参照テキストを読み上げて参照音声データに変換してもよい。また、音声生成部１２は、参照テキストをディスプレイ等の表示部に表示させ、アナウンサー等の正確な発話をする話者に読み上げを促す構成であってもよい。この場合、音声生成部１２は、話者により読み上げられた音声を収音する構成を有する。 As shown in the conceptual diagram of FIG. 2B, the speech generation unit 12 is a functional unit that converts the reference text acquired by the reference text acquisition unit 11 into reference speech data. The speech generator 12 may read out the reference text with artificial speech and convert it into reference speech data. Further, the speech generation unit 12 may be configured to display the reference text on a display unit such as a display and prompt a speaker, such as an announcer, who speaks accurately, to read the reference text. In this case, the speech generator 12 has a configuration for picking up the speech read out by the speaker.

参照音声送信部１３は、音声生成部１２により生成される参照音声データを、ネットワークＮＷを介して再生装置５０に送信する機能部である。再生装置５０は、受信した参照音声データを音声として再生する。このとき、再生装置５０は、音声データに適宜の信号処理を施してから再生してもよい。この信号処理は、例えばイコライジング処理、ノイズキャンセリング処理、周波数フィルタ処理又は増幅処理等、音声に含まれる言葉をより明瞭に聞き取れるようにするための音響信号処理であってもよいし、ネットワークＮＷによる伝送に起因して欠損した情報を補完する処理であってもよい。 The reference audio transmission unit 13 is a functional unit that transmits the reference audio data generated by the audio generation unit 12 to the playback device 50 via the network NW. The reproduction device 50 reproduces the received reference audio data as audio. At this time, the playback device 50 may perform appropriate signal processing on the audio data before playback. This signal processing may be, for example, equalizing processing, noise canceling processing, frequency filtering processing, amplification processing, or other acoustic signal processing for making the words contained in the voice more clearly audible. It may be a process of compensating for missing information due to transmission.

なお、参照音声データは、ネットワークＮＷ上の適宜の装置、例えばサーバを経由して再生装置５０に受信されてもよい。また、上述の信号処理は、当該装置により参照音声データに施されてもよい。 Note that the reference audio data may be received by the playback device 50 via an appropriate device on the network NW, such as a server. The signal processing described above may also be applied to the reference audio data by the device.

評価対象音声取得部１４は、再生装置５０から再生される音声データ（以下、「評価対象音声データ」ともいう。）を取得する機能部である。評価対象音声取得部１４は、再生装置５０に接続され、音声データを取得する。
図２（ｃ）の概念図に示すように、評価対象音声データは、参照音声データと一部が異なっている。同図の例では、領域Ｌに示される一部のデータの振幅が小さくなっている様子を示している。評価対象音声データは、伝送の過程で参照音声データよりも劣化したデータの他、再生装置５０又はネットワークＮＷ上の装置において行われる上記した適宜の信号処理により、発話内容が聞き取りやすく加工されたデータであってもよい。 The evaluation target audio acquisition unit 14 is a functional unit that acquires audio data reproduced from the playback device 50 (hereinafter also referred to as “evaluation target audio data”). The evaluation target speech acquisition unit 14 is connected to the playback device 50 and acquires speech data.
As shown in the conceptual diagram of FIG. 2(c), the evaluation target speech data is partially different from the reference speech data. In the example of the figure, the amplitude of some data shown in area L is reduced. The speech data to be evaluated includes data that has deteriorated more than the reference speech data in the process of transmission, and data that has been processed so that the utterance content can be easily heard by the above-described appropriate signal processing performed in the playback device 50 or devices on the network NW. may be

音声認識部１５は、評価対象音声データに含まれる言葉を音声認識して評価対象テキストを生成する機能部である。図２（ｄ）の例では、領域Ｗに示される一部のテキストが、参照テキストとは異なっている様子を示している。 The speech recognition unit 15 is a functional unit that performs speech recognition of words included in the evaluation target speech data to generate an evaluation target text. In the example of FIG. 2(d), some text shown in area W is different from the reference text.

評価部１６は、評価対象テキストに基づいて、ネットワークＮＷを介して受信される音声の品質評価を行う機能部である。本実施形態においては、評価部１６は、評価対象テキストと参照テキストを比較して、音声の品質評価を行う。
音声の品質評価スコアは、例えば以下の式（１）により計算される。

品質評価スコア＝評価対象テキストで正確に認識されている音数／参照テキストの音数×１００
・・・（１）
The evaluation unit 16 is a functional unit that evaluates the quality of speech received via the network NW based on the text to be evaluated. In this embodiment, the evaluation unit 16 compares the evaluation target text and the reference text to evaluate the quality of speech.
A speech quality evaluation score is calculated, for example, by the following equation (1).

Quality evaluation score = Number of sounds correctly recognized in text to be evaluated/Number of sounds in reference text x 100
... (1)

品質評価スコアは、値が大きいほど音声品質が良いことを示し、評価対象テキストにおいてすべての音が正確に認識されている場合には、品質評価スコアは１００となる。図２の例では、参照テキストの音が９３字であり、評価対象テキストはこのうち８９字を正確に認識できていることから、品質評価スコアは９６である。品質評価スコアは、音声品質評価装置１が有する、又は接続されている適宜の表示部に表示される。 A larger value of the quality evaluation score indicates better speech quality, and the quality evaluation score is 100 when all the sounds in the text to be evaluated are correctly recognized. In the example of FIG. 2, the reference text has 93 sounds, and the text to be evaluated can accurately recognize 89 of them, so the quality evaluation score is 96. The quality evaluation score is displayed on an appropriate display unit that the speech quality evaluation device 1 has or is connected to.

評価部１６は、評価対象テキストの全文に対して１個の品質評価スコアを算出してもよいし、消化対象テキストを複数に分割してそれぞれ品質評価スコアを算出し、１個の参照テキストに対して時間軸に沿った複数の品質評価スコアが算出されるようになっていてもよい。１個の参照テキストに対し複数の品質評価スコアを算出する構成によれば、ネットワークの伝送状態の時間変化を評価することができる。この場合に、評価部１６は、評価対象テキストを互いに重複がないように分割してもよいし、時間軸上で一部を重複させながら分割してもよい。なお、音声認識部１５が音声データを分割した上でそれぞれテキストデータに変換してもよい。この場合には、評価部１６は各テキストデータに対し評価を行う。 The evaluation unit 16 may calculate one quality evaluation score for the entire evaluation target text, or divide the digestion target text into a plurality of parts, calculate the quality evaluation scores for each of them, and divide them into one reference text. On the other hand, a plurality of quality evaluation scores along the time axis may be calculated. According to the configuration in which a plurality of quality evaluation scores are calculated for one reference text, it is possible to evaluate temporal changes in the transmission state of the network. In this case, the evaluation unit 16 may divide the text to be evaluated so as not to overlap each other, or may divide the text while partially overlapping on the time axis. Note that the speech recognition unit 15 may divide the speech data and then convert each of them into text data. In this case, the evaluation unit 16 evaluates each text data.

●処理フロー
図３を用いて、音声品質評価装置１が伝送される音声を評価する処理フローについて説明する。
まず、参照テキスト取得部１１により、参照テキストを取得する（ステップＳ１１）。ついで、音声生成部１２により、参照テキストに基づいて参照音声データを取得する（ステップＳ１２）。ついで、参照音声送信部１３により、参照音声を再生装置５０に送信する（ステップＳ１３）。 ●Processing Flow A processing flow for evaluating transmitted voice by the voice quality evaluation apparatus 1 will be described with reference to FIG.
First, the reference text is obtained by the reference text obtaining unit 11 (step S11). Next, the speech generator 12 acquires reference speech data based on the reference text (step S12). Next, the reference voice is transmitted to the playback device 50 by the reference voice transmission unit 13 (step S13).

ついで、評価対象音声取得部１４により、再生装置５０において再生される対象音声データを取得する（ステップＳ１４）。ついで、音声認識部１５により、対象音声データを音声認識し、テキストデータを生成する（ステップＳ１５）。評価部１６により、参照テキストと評価対象テキストとの一致率を評価し、品質評価スコアを算出する（ステップＳ１６）。ついで、品質評価スコアを適宜の表示部に表示する（ステップＳ１７）。 Next, the target voice data to be played back by the playback device 50 is acquired by the evaluation target voice acquisition unit 14 (step S14). Next, the speech recognition unit 15 performs speech recognition on the target speech data to generate text data (step S15). The evaluation unit 16 evaluates the rate of matching between the reference text and the text to be evaluated, and calculates a quality evaluation score (step S16). Next, the quality evaluation score is displayed on an appropriate display unit (step S17).

このような本発明に係る音声品質評価装置によれば、話者の発話内容を受話側で正確に聞き取れるかという観点に着目し、ネットワークによる伝送の品質評価を行うことができる。また、本発明に係る音声品質評価装置によれば、テキストデータ同士を比較評価するため、音声データ同士を比較する構成に比べて解析するデータ量を圧縮することができる。したがって、長時間の発話における伝送の品質評価が可能である。また、評価対象音声をテキストデータにして評価を行う構成によれば、音質を向上させる処理を行った場合にも、発話内容が正確に伝送されているかを適切に評価することができる。 According to such a speech quality evaluation apparatus according to the present invention, it is possible to evaluate the quality of transmission over a network, focusing on whether the receiving side can accurately hear what the speaker is saying. Also, according to the speech quality evaluation apparatus of the present invention, since the text data are compared and evaluated, the amount of data to be analyzed can be compressed as compared with the structure that compares the speech data. Therefore, it is possible to evaluate the quality of transmission in long speech. In addition, according to the configuration in which the speech to be evaluated is evaluated as text data, it is possible to appropriately evaluate whether the utterance content is accurately transmitted even when the processing for improving the sound quality is performed.

＜第２実施形態＞
本発明の第２実施形態に係る音声品質評価装置について、第１実施形態と異なる部分を中心に説明する。この実施形態において、音声認識部１５は、参照音声データに含まれる言葉を音声認識して第２参照テキストを生成し、評価部１６は、評価対象テキストと第２参照テキストを比較して、音声の品質評価を行う。なお、第１実施形態と同様の構成については適宜説明を省略し、同じ符号を付した。 <Second embodiment>
A voice quality evaluation apparatus according to the second embodiment of the present invention will be described, focusing on the parts different from the first embodiment. In this embodiment, the speech recognition unit 15 speech-recognizes words included in the reference speech data to generate a second reference text, and the evaluation unit 16 compares the evaluation target text with the second reference text to obtain a speech quality evaluation. In addition, description is omitted suitably about the structure similar to 1st Embodiment, and the same code|symbol was attached|subjected.

図４に示すように、第２実施形態に係る音声品質評価装置においては、参照テキストを取得し（ステップＳ１１）、参照音声データを生成した後（ステップＳ１２）、音声認識部１５により参照音声データの音声認識を行い、第２参照テキストを生成する（ステップＳ２１）。また、参照音声データを送信して（ステップＳ１３）、再生装置５０を介して評価対象音声データを取得し（ステップＳ１４）、音声認識を行う（ステップＳ１５）。ステップＳ２１と、ステップＳ１３乃至Ｓ１５との順番は任意であり、同時に行われてもよい。 As shown in FIG. 4, in the speech quality evaluation apparatus according to the second embodiment, after obtaining the reference text (step S11) and generating the reference speech data (step S12), the speech recognition unit 15 generates the reference speech data. is performed to generate a second reference text (step S21). Also, the reference voice data is transmitted (step S13), the evaluation target voice data is acquired via the playback device 50 (step S14), and voice recognition is performed (step S15). The order of step S21 and steps S13 to S15 is arbitrary and may be performed simultaneously.

ついで、評価部１６により、第２参照テキストと評価対象テキストの一致率を評価し、品質評価スコアを算出し（ステップＳ２２）、この品質評価スコアを表示部に表示する（ステップＳ２３）。
この場合の品質評価スコアは、例えば以下の式（２）で表される。

品質評価スコア＝評価対象テキストで正確に認識されている音数／第２参照テキストの音数×１００
・・・（２）
Next, the evaluation unit 16 evaluates the matching rate between the second reference text and the evaluation target text, calculates the quality evaluation score (step S22), and displays the quality evaluation score on the display unit (step S23).
The quality evaluation score in this case is represented, for example, by Equation (2) below.

Quality evaluation score = Number of sounds correctly recognized in text to be evaluated/Number of sounds in second reference text x 100
... (2)

この構成によれば、音声認識部１５による誤認識がある場合には、第２参照テキストと評価対象テキストの双方に同様の誤認識が現れるので、品質評価スコアにおける誤認識の影響を除去できる。すなわち、本実施形態においては、音声認識部１５による影響を除いて音声品質を評価できる。 According to this configuration, when there is an erroneous recognition by the speech recognition unit 15, the same erroneous recognition appears in both the second reference text and the evaluation target text, so the influence of the erroneous recognition on the quality evaluation score can be eliminated. That is, in this embodiment, the voice quality can be evaluated by removing the influence of the voice recognition unit 15 .

＜第３実施形態＞
本発明の第３実施形態に係る音声品質評価装置について、第２実施形態と異なる部分を中心に説明する。この実施形態において、音声認識部１５は、参照音声データに含まれる言葉を音声認識して第２参照テキストを生成し、評価部１６は、参照テキストと第２参照テキストを比較して第１評価を行うとともに、参照テキストと評価対象テキストを比較して第２評価を行った上で、第１評価および前記第２評価の結果に基づいて、音声の品質評価を行う。
なお、第１実施形態又は第２実施形態と同様の構成については適宜説明を省略し、同じ符号を付した。 <Third Embodiment>
A voice quality evaluation apparatus according to the third embodiment of the present invention will be described, focusing on the parts different from the second embodiment. In this embodiment, the speech recognition unit 15 speech-recognizes words included in the reference speech data to generate a second reference text, and the evaluation unit 16 compares the reference text and the second reference text to obtain a first evaluation. and perform a second evaluation by comparing the reference text and the text to be evaluated, and then evaluate the quality of speech based on the results of the first evaluation and the second evaluation.
It should be noted that the same configurations as those of the first embodiment or the second embodiment will be appropriately omitted from description and given the same reference numerals.

図５に示すように、第３実施形態に係る音声品質評価装置においては、参照テキストを取得し（ステップＳ１１）、参照音声データを生成した後（ステップＳ１２）、音声認識部１５により参照音声データの音声認識を行い、第２参照テキストを生成する（ステップＳ２１）。次いで、参照テキストと、第２参照テキストとの一致率（以下、「第１一致率」ともいう。）を算出する第１評価を行う（ステップＳ３１）。 As shown in FIG. 5, in the speech quality evaluation apparatus according to the third embodiment, after obtaining the reference text (step S11) and generating the reference speech data (step S12), the speech recognition unit 15 generates the reference speech data. is performed to generate a second reference text (step S21). Next, a first evaluation is performed to calculate a match rate between the reference text and the second reference text (hereinafter also referred to as "first match rate") (step S31).

また、参照音声データを送信して（ステップＳ１３）、再生装置５０を介して評価対象音声データを取得し（ステップＳ１４）、音声認識を行う（ステップＳ１５）ついで、参照テキストと評価対象テキストとの一致率（以下、「第２一致率」ともいう。）を算出する第２評価を行う（ステップＳ３２）。ステップＳ２１およびステップＳ３１と、ステップＳ１３乃至Ｓ１５およびステップＳ３２との順番は任意であり、同時に行われてもよい。 Further, the reference speech data is transmitted (step S13), the evaluation target speech data is obtained via the playback device 50 (step S14), and speech recognition is performed (step S15). A second evaluation for calculating a match rate (hereinafter also referred to as a "second match rate") is performed (step S32). The order of steps S21 and S31 and steps S13 to S15 and step S32 is arbitrary and may be performed simultaneously.

ついで、評価部１６により、第１一致率および第２一致率を比較して品質評価スコアを算出し（ステップＳ３３）、この品質評価スコアを表示部に表示する（ステップＳ３４）。例えば、品質評価スコアは、以下の式（３）により表される。

品質評価スコア＝第２一致率／第１一致率×１００・・・（３）

また、品質評価スコアに加えて、第１一致率および第２一致率をそれぞれ表示部に表示してよい。 Next, the evaluation unit 16 compares the first matching rate and the second matching rate to calculate a quality evaluation score (step S33), and displays this quality evaluation score on the display unit (step S34). For example, the quality evaluation score is represented by Equation (3) below.

Quality evaluation score=second match rate/first match rate×100 (3)

Also, in addition to the quality evaluation score, the first match rate and the second match rate may be displayed on the display unit.

この構成によれば、音声認識部１５による音声認識の正確性を第１一致率および第２一致率で確認できるとともに、品質評価スコアにより伝送前後の音声品質の評価を確認できる。 According to this configuration, the accuracy of speech recognition by the speech recognition unit 15 can be confirmed by the first match rate and the second match rate, and the evaluation of voice quality before and after transmission can be checked by the quality evaluation score.

このように、本発明に係る音声品質評価装置によれば、ネットワークを介して伝送される音声データにおいて、伝送の品質評価を簡便に行うことができる。 Thus, according to the speech quality evaluation apparatus of the present invention, it is possible to easily evaluate the transmission quality of speech data transmitted via a network.

なお、本説明においてはネットワークによる伝送前後の音声品質の評価を例に説明したが、本発明に係る音声品質評価装置はネットワークに限らず、発話内容を伝達する機構全般の評価に用いることができる。また、音声品質評価装置は、ネットワークの評価に用いるのみならず、ビデオ会議又は遠隔授業を行っている話者に略リアルタイム又は事後的に視認させ、正確に伝送されなかった部分を確認させたり、当該部分を再度話すよう促すといったシステムを構築することで、ビデオ会議等における正確な情報共有の一助となるようにしてもよい。 In this explanation, evaluation of speech quality before and after transmission by a network is described as an example, but the speech quality evaluation device according to the present invention is not limited to networks, and can be used for evaluation of mechanisms in general for transmitting utterance contents. . In addition, the voice quality evaluation device is not only used for network evaluation, but also allows the speaker who is conducting a video conference or a remote class to visually check the part that was not transmitted correctly in real time or after the fact. By constructing a system that prompts the person to repeat the relevant part, it may be possible to help accurate information sharing in a video conference or the like.

１音声品質評価装置
１１参照テキスト取得部
１２音声生成部
１３参照音声送信部
１４評価対象音声取得部
１５音声認識部
１６評価部
ＮＷネットワーク 1 Speech Quality Evaluation Device 11 Reference Text Acquisition Unit 12 Speech Generation Unit 13 Reference Speech Transmission Unit 14 Evaluation Target Speech Acquisition Unit 15 Speech Recognition Unit 16 Evaluation Unit NW Network

Claims

a speech generator that converts the reference text into reference speech data;
a reference audio transmission unit that transmits the reference audio data generated by the audio generation unit to a playback device via a network;
an evaluation target voice acquisition unit that acquires evaluation target voice data reproduced from the playback device;
a speech recognition unit that performs speech recognition of words included in the evaluation target speech data to generate an evaluation target text;
an evaluation unit that evaluates the quality of speech received via the network based on the text to be evaluated;
comprising
Voice quality evaluation device.

The evaluation unit performs the quality evaluation by comparing the evaluation target text and the reference text.
2. The speech quality evaluation device according to claim 1.

The speech recognition unit speech-recognizes words included in the reference speech data to generate a second reference text,
The evaluation unit performs the quality evaluation by comparing the evaluation target text and the second reference text.
3. The speech quality evaluation device according to claim 1 or 2.

The speech recognition unit speech-recognizes words included in the reference speech data to generate a second reference text,
The evaluation unit compares the reference text with the second reference text to perform a first evaluation, compares the reference text with the evaluation target text to perform a second evaluation, and performs the first evaluation and the second evaluation. Performing the quality evaluation based on the results of the evaluation;
4. The speech quality evaluation device according to any one of claims 1 to 3.

speech generation processing for converting the reference text into reference speech data;
a reference audio transmission process for transmitting the reference audio data generated by the audio generation process to a playback device via a network;
an evaluation target audio acquisition process for acquiring evaluation target audio data reproduced from the playback device;
speech recognition processing for generating an evaluation target text by recognizing words included in the evaluation target speech data;
an evaluation process for evaluating the quality of speech received via the network based on the text to be evaluated;
including,
Speech quality evaluation method.

speech generation instructions for converting the reference text into reference speech data;
a reference audio transmission instruction for transmitting the reference audio data generated by the audio generation instruction to a playback device via a network;
an evaluation target audio acquisition command for acquiring evaluation target audio data reproduced from the playback device;
a speech recognition instruction for generating an evaluation target text by speech recognition of words included in the evaluation target speech data;
an evaluation instruction for evaluating the quality of speech received via the network based on the text to be evaluated;
cause the computer to run
Voice quality evaluation program.