JP2009157050A

JP2009157050A - Uttering verification device and uttering verification method

Info

Publication number: JP2009157050A
Application number: JP2007334330A
Authority: JP
Inventors: Takashi Sumiyoshi; 貴志住吉
Original assignee: Hitachi Omron Terminal Solutions Corp
Current assignee: Hitachi Omron Terminal Solutions Corp
Priority date: 2007-12-26
Filing date: 2007-12-26
Publication date: 2009-07-16

Abstract

<P>PROBLEM TO BE SOLVED: To reduce a verification cost of an uttering content by using an automatic speech detection technology. <P>SOLUTION: An uttering verification system includes a user interface capable of separately performing: word detection result verification operations S11 to S14 of a first stage, in which only whether a result detected by the automatic speech detection technology from a speech waveform to be detected is correct, is verified by a person; and content verification operations S15 to S18 of a second stage in which a detection audible range is determined based on the verification result, and which verifies whether a content is correct from the detection audible range. Based on the past verification result, the number of detection words to be verified is reduced. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声波形に対し、定められた内容が発話されているかを検証する発話検証システムに関する。 The present invention relates to an utterance verification system that verifies whether a predetermined content is uttered with respect to a speech waveform.

近年、企業におけるコールセンター業務の監視など、従業員と顧客との会話を録音し、会話の妥当性を検証することでクレーム対応や企業コンプライアンスに活用するという音声情報の利用方法が考案されている。 In recent years, a method of using voice information has been devised, such as monitoring call center operations in a company, recording a conversation between an employee and a customer, and verifying the validity of the conversation for use in complaint handling and corporate compliance.

録音した発話を人間が聞いて検証する場合、録音した長さに応じた膨大な時間を要する。話速変換等の従来技術を適用して検証時間の短縮を図ることが考えられるが、人間の聴取能力には限界があり、やはり多大な時間がかかってしまう。コンピュータ処理による自動化も考えられているが、音声の多様性（話者性、環境等の影響）のため難しく、さまざまな発話がなされる実用環境下で有効な技術はいまだ確立されていない。したがって、現在の技術レベルでは、コンピュータによる自動化と人間による検証を組み合わせてシステム化し、最大限の効率化を図るべきである。 When a recorded utterance is heard and verified by a human, it takes an enormous amount of time according to the recorded length. Although it is conceivable to shorten the verification time by applying a conventional technique such as speech speed conversion, human listening ability is limited, and it still takes a lot of time. Although automation by computer processing is also considered, it is difficult due to the diversity of speech (the influence of speaker nature, environment, etc.), and a technology effective in a practical environment where various utterances are made has not yet been established. Therefore, at the current technical level, computerization and human verification should be combined into a system for maximum efficiency.

そのようなシステムとして、たとえば特許第３２８５１４５号は、録音音声データベースを検証する方法について述べている。この方法は、大量の発話データベースにつけられたラベルが正しいかどうかを、音声認識を用いて自動検索し、一致しなかったものを人間に検聴させるというものである。これにより、人間が検証する時間を大幅に削減できるとしている。
特許第３２８５１４５号 As such a system, for example, Japanese Patent No. 3285145 describes a method for verifying a recorded voice database. In this method, whether or not labels attached to a large amount of utterance databases are correct is automatically searched using speech recognition, and a person who does not match is audited. As a result, human verification time can be greatly reduced.
Japanese Patent No. 3285145

本発明が解決すべき課題は、コールセンター等で発話された音声の内容が、あらかじめ定められた規則に適合しているかを検証するとき、蓄積された発話データをほぼそのまま検証しなければならず、多くの時間がかかるというものである。 The problem to be solved by the present invention is that when verifying whether the content of speech uttered at a call center conforms to a predetermined rule, the accumulated utterance data must be verified almost as it is, It takes a lot of time.

従来技術は、単語単位での発話検証をコンピュータにより補助するものであった。しかし、発話の内容まで検証するためには、単語レベルでの発話検証では不十分である。なぜなら、発話の内容は最低でも文、すなわち複数の単語から構成されるのが普通であり、同じ内容を発話する場合でも、発話される単語の種類や順序が異なる。したがって、単語単位での発話検証ができても内容まで検証できたとはいえず、本発明の課題を解決できているとはいえない。 In the prior art, utterance verification in units of words is assisted by a computer. However, utterance verification at the word level is insufficient to verify the content of the utterance. This is because the content of an utterance is usually composed of at least a sentence, that is, a plurality of words, and even when the same content is uttered, the types and order of words to be uttered are different. Therefore, even if the utterance verification can be performed in units of words, it cannot be said that the content has been verified and the problem of the present invention cannot be solved.

さらに、単語を検証するタスクと内容を検証するタスクでは、検証作業者に要求される知識レベルが異なる。音声を聞いて内容が正しいかを検証する場合、検証する人間は該当する業務内容に精通し、判断基準を熟知しておかなければならない。このように専門的な知識を持つ人材が必要な場合、人件費としてのコストは高くなる。これも解決すべき課題である。 Furthermore, the knowledge level required for the verification operator differs between the task of verifying the word and the task of verifying the content. When verifying whether or not the contents are correct by listening to the voice, the person to be verified must be familiar with the corresponding work contents and be familiar with the judgment criteria. When human resources with specialized knowledge are required in this way, the cost as labor costs increases. This is also a problem to be solved.

本発明は、検証対象の音声波形から自動音声検出技術により検出した結果が正しいか否かだけを検証する１段階目の「単語検出結果検証」作業と、その検証結果を元に検聴範囲を決定し、検聴範囲から内容が正しいかを検証する２段階目の「内容検証」作業を個別に行えるユーザインタフェースを備えた発話検証システムを提供する。また、過去の検証結果に基づいて検証する検出単語数を削減する。 In the present invention, a first-step “word detection result verification” operation for verifying only whether or not a result detected by an automatic speech detection technology from a speech waveform to be verified is correct, and a listening range is determined based on the verification result. Provided is an utterance verification system including a user interface that can individually determine and verify whether the content is correct from the audition range, and perform a second-stage “content verification” operation. In addition, the number of detected words to be verified based on past verification results is reduced.

本発明の発話検証装置は、入力された音声波形から指定された単語の発話が含まれる位置を単語検出結果として求める単語検出部と、音声波形から単語検出結果として求められた位置を含む波形を再生し、入力された単語検出の肯定・否定の判定を検出単語検証結果として保存する検出単語検証処理部と、単語に基づく検聴範囲決定ルールを格納したファイルと、検出単語検証結果と検聴範囲決定ルールに基づき音声波形の検聴範囲を決定する検聴範囲決定部とを備える。また、入力された音声波形から検聴範囲を含む波形を再生し、入力された再生内容に対する肯定・否定の判定を内容検証結果として保存する内容検証部を有する。 An utterance verification device according to the present invention includes a word detection unit that obtains a position including an utterance of a specified word from an input speech waveform as a word detection result, and a waveform that includes a position obtained as a word detection result from the speech waveform. Detected word verification processing unit that reproduces and stores input word detection affirmative / negative determination as a detected word verification result, a file that stores a determination range determination rule based on words, a detected word verification result, and an audition A listening range determination unit that determines the listening range of the audio waveform based on the range determination rule. In addition, a content verification unit is provided that reproduces a waveform including a listening range from the input audio waveform, and stores affirmation / negative determination as to the content verification result.

単語検出部は入力された音声波形から指定された単語の発話が含まれる位置とともにその検出の確からしさを表すスコアを単語検出結果として求め、検出単語検証処理部は、過去に前記検出単語検証処理部により得られた検出単語検証結果と当該検出単語のスコアからなるデータベースよりスコア事前分布情報を求め、単語検出結果内のスコアとスコア事前分布情報に基づき単語検出の肯定・否定の判定を行うようにしてもよい。このとき、検出単語検証処理部は、スコア事前分布情報のうち肯定と判断されたもののスコアから肯定スコア分布を推定し、肯定スコア分布とあらかじめ指定された許容誤棄却率に基づき許容誤棄却スコア閾値を計算し、許容誤棄却スコア閾値と単語検出結果のスコアを比較して単語検出の肯定・否定を判定することができる。また、検出単語検証処理部は、スコア事前分布情報のうち否定と判断されたもののスコアから否定スコア分布を推定し、否定スコア分布とあらかじめ指定された許容誤受理率に基づき許容誤受理スコア閾値を計算し、許容誤受理スコア閾値と単語検出結果のスコアを比較して単語検出の肯定・否定を判定することができる。 The word detection unit obtains, as a word detection result, a score indicating the probability of the detection together with the position where the utterance of the designated word is included from the input speech waveform, and the detected word verification processing unit has previously detected the detected word verification processing The score prior distribution information is obtained from a database consisting of the detected word verification result obtained by the section and the score of the detected word, and whether the word detection is positive or negative is determined based on the score in the word detection result and the score prior distribution information. It may be. At this time, the detected word verification processing unit estimates the positive score distribution from the score of the score prior distribution information determined to be positive, and based on the positive score distribution and the predetermined allowable error rejection rate, the allowable error rejection score threshold value , And an acceptable false rejection score threshold is compared with the score of the word detection result to determine whether the word detection is positive or negative. In addition, the detected word verification processing unit estimates a negative score distribution from the score of the score prior distribution information that has been determined to be negative, and sets an allowable misacceptance score threshold based on the negative score distribution and an allowable error acceptance rate specified in advance. It is possible to determine whether the word detection is affirmative or negative by calculating and comparing the allowable false acceptance score threshold with the score of the word detection result.

また、本発明による発話検証方法は、単語検出部と、検出単語検証処理部と、検聴範囲決定ルールを格納したファイルと、検聴範囲決定部とを有する発話検証装置を用いて、入力された音声波形の検聴範囲を決定する方法であり、単語検出部により、入力された音声波形から指定された単語の発話が含まれる位置を単語検出結果として求める工程、検出単語検証処理部により、音声波形から単語検出結果として求められた位置を含む波形を再生し、入力された単語検出の肯定・否定の判定結果を検出単語検証結果として保存する工程、検聴範囲決定部により、検出単語検証結果と検聴範囲決定ルールに基づき音声波形の検聴範囲を決定する工程を有する。 Further, the speech verification method according to the present invention is input using an utterance verification device having a word detection unit, a detected word verification processing unit, a file storing a listening range determination rule, and a listening range determination unit. The method of determining the audition range of the voice waveform, the step of obtaining a position including the utterance of the word specified from the input voice waveform as a word detection result by the word detection unit, the detected word verification processing unit, Playing back a waveform including the position obtained as the word detection result from the speech waveform and storing the input word detection positive / negative determination result as the detection word verification result, the detection word verification by the listening range determination unit A step of determining a listening range of the audio waveform based on the result and the listening range determination rule;

１段階目の単語検出結果検証作業は単純な反復作業であり、専門知識が不要で高速に作業が行え、その結果に基づいて限定された部分のみを内容検証すればよくなるため、全体を聞くコストを低減しつつ高精度なチェックが可能となる。さらに、単語検出結果検証作業を過去の検証結果に基づき短縮することにより、１段階目の単語検出結果検証作業自体も短くし、作業時間をおおよそ半分以下にできる。 The first-stage word detection result verification work is a simple repetitive work that does not require specialized knowledge and can be performed at high speed, and only a limited part needs to be verified based on the result, so the cost of listening to the whole This makes it possible to check with high accuracy while reducing the above. Furthermore, by shortening the word detection result verification work based on the past verification results, the first-stage word detection result verification work itself can be shortened, and the work time can be reduced to approximately half or less.

以下、図面を参照して本発明の実施の形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

最初に、本発明の第一の実施形態を説明する。本発明の発話検証システムは、図１に示す発話検証装置により構成される。発話検証装置１００は、演算装置２００、記憶装置３００、出力装置４００、入力装置５００から構成される。記憶装置３００はプログラムとデータを保存しており、プログラムとして単語検出プログラム３１１、検聴範囲決定プログラム３１４、検出単語検証ＵＩ（User Interface）処理プログラム３１３、内容検証ＵＩ処理プログラム３１５を保持する。またデータとして、音声波形３６０、単語検出結果３７０、検出単語検証結果３８０、内容検証結果３９０を保持する。更に、単語辞書３２０、音韻モデル３３０、検聴範囲決定ルール３４０を保持する。なお本実施例は、一般的なコンピュータの動作である、記憶装置からプログラムを読み込み演算装置がそれに従い動作するという仕組みを想定したが、本発明はそのような構成に依存せず、記述されたプログラムに従い動作するあらゆるコンピュータに対して適用できる。 First, a first embodiment of the present invention will be described. The utterance verification system of the present invention includes the utterance verification apparatus shown in FIG. The utterance verification device 100 includes an arithmetic device 200, a storage device 300, an output device 400, and an input device 500. The storage device 300 stores programs and data, and holds a word detection program 311, a listening range determination program 314, a detected word verification UI (User Interface) processing program 313, and a content verification UI processing program 315 as programs. Also, as data, a speech waveform 360, a word detection result 370, a detected word verification result 380, and a content verification result 390 are held. Furthermore, a word dictionary 320, a phoneme model 330, and a listening range determination rule 340 are held. Although the present embodiment assumes a general computer operation in which a program is read from a storage device and an arithmetic device operates according to the mechanism, the present invention is described without depending on such a configuration. Applicable to any computer that operates according to the program.

次に、発話検証装置１００の動作を説明する。まず図２に示すように、単語検出プログラム３１１が音声波形３６０から、単語辞書３２０に含まれる単語に相当すると判断できる箇所を、音韻モデル３３０を用いて探し出し、その結果を単語検出結果３７０として記録する。以下では、音声波形３６０は音声波形３６１、３６２の２つがあるものとして説明する。音声波形は、例えばリニアＰＣＭ（Pulse Code Modulation）形式、ＭＰ３（MPEG 1 Layer-3）形式のようなフォーマットで保存された、人間の音声、音楽、物音、雑音等を含むデータである。 Next, the operation of the utterance verification device 100 will be described. First, as shown in FIG. 2, the word detection program 311 uses the phonetic model 330 to search the speech waveform 360 for a portion that can be determined to correspond to a word included in the word dictionary 320, and records the result as a word detection result 370. To do. In the following description, the audio waveform 360 is described as having two audio waveforms 361 and 362. The voice waveform is data including human voice, music, sound, noise, and the like stored in a format such as a linear PCM (Pulse Code Modulation) format and an MP3 (MPEG 1 Layer-3) format.

単語辞書３２０は、検出する単語が記述されたデータベースで、各レコードは単語ＩＤ、発音列と表記情報から構成される。単語ＩＤはそのレコードのユニークな識別子、発音列はその単語の発音方法を発音記号（音韻、あるいは当該言語の音素など）列、表記情報はその単語の文字表記を示す。なお単語ＩＤを別途用意せずに発音列あるいは表記情報により代えてもよく、表記情報を省略してもよく、発音列を省略し発音列は表記情報から当該言語が定める変換規則により直接求めてもよい。以下では、単語辞書３２０は図示したように単語ＩＤが「ａ」と「ｂ」の２レコードが登録されているものとして説明する。音韻モデル３３０は、単語辞書の発音列を記述する発音記号に対応する、音声波形の特徴を表すデータベースである。例えばＨＭＭ（Hidden Markov Model）などの形式を用いる。 The word dictionary 320 is a database in which words to be detected are described, and each record includes a word ID, a pronunciation string, and notation information. A word ID indicates a unique identifier of the record, a pronunciation string indicates a pronunciation method of the word, a pronunciation symbol (phoneme or phoneme of the language), and notation information indicates a character notation of the word. Note that the word ID may be replaced by a pronunciation string or notation information, or the notation information may be omitted. The pronunciation string may be omitted, and the pronunciation string may be directly obtained from the notation information according to a conversion rule defined by the language. Also good. In the following description, it is assumed that the word dictionary 320 is registered with two records having the word IDs “a” and “b” as illustrated. The phoneme model 330 is a database that represents the characteristics of a speech waveform corresponding to phonetic symbols that describe a phonetic string in a word dictionary. For example, a format such as HMM (Hidden Markov Model) is used.

単語検出プログラム３１１は、一般にワードスポッティングと呼ばれる既知の技術を用い、音声波形３６０から、単語辞書３２０に含まれる単語に相当すると判断できる箇所を、音韻モデル３３０を用いて探し出す。具体的な方法は当該技術分野の文献に詳しいので、ここでは割愛する。ここでは、図示したように検出区間５０１〜５０９が検出されたとする。 The word detection program 311 uses a known technique generally called word spotting, and uses the phoneme model 330 to search for a part that can be determined to correspond to a word included in the word dictionary 320 from the speech waveform 360. The specific method is well known in the literature in the technical field, and is omitted here. Here, it is assumed that the detection sections 501 to 509 are detected as illustrated.

単語検出プログラム３１１による単語検出結果は単語検出結果３７０に格納される。単語検出結果３７０はデータベースであり、各レコードは検出ＩＤ、波形ＩＤ、単語ＩＤ、位置から構成される。検出ＩＤは単語検出結果３７０のレコードのユニークな識別子、波形ＩＤは音声波形を特定する識別子、単語ＩＤは単語辞書３２０に記録された単語を特定する識別子、位置は単語が検出された区間（開始時刻と終了時刻）を表す。ここでは、検出区間５０１〜５０９に対応して検出ＩＤ５０１〜５０９を付与し、図示した内容が得られたものとする。例えば、検出ＩＤ「５０１」のレコードには、波形ＩＤ「３６１」の位置５．０２〜５．８９に、単語ＩＤ「ａ」に対応する単語が検出されたことが記録されている。 The word detection result by the word detection program 311 is stored in the word detection result 370. The word detection result 370 is a database, and each record includes a detection ID, a waveform ID, a word ID, and a position. The detection ID is a unique identifier of the record of the word detection result 370, the waveform ID is an identifier that identifies a speech waveform, the word ID is an identifier that identifies a word recorded in the word dictionary 320, and the position is a section where a word is detected (start Time and end time). Here, it is assumed that the detection IDs 501 to 509 are assigned to the detection sections 501 to 509 and the illustrated contents are obtained. For example, the record of the detection ID “501” records that the word corresponding to the word ID “a” is detected at the positions 5.02 to 5.89 of the waveform ID “361”.

次に、検出単語検証ＵＩ処理プログラム３１３は、単語検出結果に基づき、図３に示すような検出単語検証ＵＩを作成し、単語検出結果判別作業者に提示する。検出単語検証ＵＩは、単語検出結果情報提示部６１０、音声波形再生指示部６２０、肯定判別部６３０、否定判別部６４０から構成される。 Next, the detected word verification UI processing program 313 creates a detected word verification UI as shown in FIG. 3 based on the word detection result and presents it to the word detection result determination operator. The detected word verification UI includes a word detection result information presentation unit 610, a speech waveform reproduction instruction unit 620, an affirmative determination unit 630, and a negative determination unit 640.

単語検出結果情報提示部６１０は、単語検出結果に関連する情報、例えば現在対象となっている単語検出結果レコードの単語ＩＤに対応する発音列あるいは表記情報や、作業の残り工程を示す情報などが含まれる。音声波形再生指示部６２０は、これが指定されると現在対象となっている単語検出結果レコードの波形ＩＤ、位置により指定される音声波形を出力手段により単語検出結果判別作業者に提示する。なお提示する音声波形は、波形ＩＤと位置により指定される音声波形のみであっても、位置の前後数秒を含む区間であっても、波形ＩＤが指定する音声波形全体であってもよい。 The word detection result information presentation unit 610 includes information related to the word detection result, for example, a pronunciation string or notation information corresponding to the word ID of the current word detection result record, information indicating the remaining process of work, and the like. included. When this is designated, the speech waveform reproduction instructing unit 620 presents the speech waveform designated by the waveform ID and position of the current word detection result record to the word detection result determination operator by the output means. Note that the presented speech waveform may be only the speech waveform specified by the waveform ID and the position, or a section including several seconds before and after the position, or the entire speech waveform specified by the waveform ID.

肯定判別部６３０及び否定判別部６４０は、これが指定されると現在対象となっている単語検出結果レコードについて、それぞれ肯定、否定の判断を下したことを検出単語検証結果３８０に格納する。検出単語検証結果３８０は、図４に示すように、検出ＩＤと検証結果から構成される。 When this is specified, the affirmative determination unit 630 and the negative determination unit 640 store in the detected word verification result 380 that positive and negative determinations have been made for the currently detected word detection result record, respectively. The detected word verification result 380 is composed of a detection ID and a verification result, as shown in FIG.

検出単語検証ＵＩ処理プログラム３１３は、以上の処理を単語検出結果３７０のすべてのレコードに対して繰り返し行い、検出単語検証結果３８０を作成する。ここでは、図４に示したような検証結果を、単語検出結果判別作業者が指示したものとする。なお提示するレコードの順序は任意であるが、同じ単語ＩＤのレコードをまとめて提示すれば、単語検出結果判別作業者は同一の単語を連続して判別できるため、作業効率の増大が期待できる。 The detected word verification UI processing program 313 repeats the above processing for all records of the word detection result 370 to create a detected word verification result 380. Here, it is assumed that the verification result as shown in FIG. 4 is instructed by the word detection result determination operator. The order of the records to be presented is arbitrary, but if records with the same word ID are presented together, the word detection result discriminating operator can continuously discriminate the same word, so that an increase in work efficiency can be expected.

次に、検聴範囲決定プログラム３１４は、図５に示すように、単語検出結果３７０、検出単語検証結果３８０と検聴範囲決定ルール３４０に基づいて検聴範囲を決定し、検聴範囲３８５に格納する。検聴範囲決定ルール３４０は、内容ＩＤと条件から構成される。内容ＩＤはそのルールが定める内容を指定する識別子、条件は、検出単語検証結果がどのような状況であったときにそのルールに適合しているかを示す条件である。ここでは、内容ＩＤ「９０１」として、「単語ＩＤ＝ａと単語ＩＤ＝ｂの検出単語検証結果が１０秒以内の位置関係で存在」という条件が指定されているものとする。 Next, the listening range determination program 314 determines the listening range based on the word detection result 370, the detected word verification result 380, and the listening range determination rule 340, as shown in FIG. Store. The inspection range determination rule 340 includes a content ID and a condition. The content ID is an identifier for designating the content defined by the rule, and the condition is a condition indicating the situation in which the detected word verification result is suitable for the rule. Here, it is assumed that the condition “detected word verification result of word ID = a and word ID = b exists in a positional relationship within 10 seconds” is specified as the content ID “901”.

一般的には、条件判断を行う際、検出単語検証結果３８０により否定と判断された結果は条件判断から除外する。また条件判断を行うのは、同じ波形ＩＤ内で検出されたエントリ同士である。 In general, when performing the condition determination, the result determined to be negative by the detected word verification result 380 is excluded from the condition determination. Further, it is the entries detected within the same waveform ID that make the condition determination.

検聴範囲は、内容ＩＤ、波形ＩＤ、区間により構成される。内容ＩＤは検聴範囲決定ルール内の内容ＩＤと同じ意味で、波形ＩＤは音声波形を指定する識別子、区間は検聴すべき範囲、例えば検聴範囲決定ルール内で条件にヒットしたときに使われた検出エントリの区間をすべて含む最小の連続区間、を表す。図２に示した単語検出結果、図４に示した検出単語検証結果、図５に示した検聴範囲決定ルールからは、図５に示すように検聴範囲として波形ＩＤ「３６２」の区間「１１．４２〜１４．３１」が得られる。ここで、図４に示した検出単語検証結果３８０において検出５０２の検証結果が否定であるため、検聴範囲決定プログラムはルールに一致しないと判断し、後段の内容判別処理から除外する。このような場合が、本発明において検出単語判別の結果により内容判別の作業コストを削減できる典型的な例である。 The audition range includes a content ID, a waveform ID, and a section. The content ID has the same meaning as the content ID in the audition range determination rule. The waveform ID is an identifier for designating a speech waveform. The section is used when a condition is hit in a range to be audited, for example, in the audition range determination rule. Represents the minimum continuous section including all the detected entry sections. From the word detection result shown in FIG. 2, the detected word verification result shown in FIG. 4, and the audition range determination rule shown in FIG. 5, as shown in FIG. 11.42-14.31 "are obtained. Here, since the verification result of detection 502 is negative in the detection word verification result 380 shown in FIG. 4, the audition range determination program determines that it does not match the rule, and excludes it from the content determination processing in the subsequent stage. Such a case is a typical example in which the work cost for content determination can be reduced by the result of detection word determination in the present invention.

上記検聴範囲決定ルールとしては、以下のものが考えられる。
（１）単語検出結果内のエントリのうち、その検出ＩＤが検出単語検証結果により肯定されたものについて、ある単語ＩＤ群がある時間以内にすべて存在する場合、その単語ＩＤ群の範囲を検聴範囲とする。
（２）単語検出結果内のエントリのうち、その検出ＩＤが検出単語検証結果により肯定されたものについて、ある単語ＩＤ群がある時間以内に定められた順序ですべて存在する場合、その単語ＩＤ群の範囲を検聴範囲とする。
（３）上記の検聴範囲を、音声の区切りまで拡大する。音声の区切りとは、音声のパワーや音声らしさに基づく指標により、音声がないと判断された区間の任意の位置である。これにより、後の内容判別作業が行いやすくなる。 The following listening range determination rules can be considered.
(1) Among the entries in the word detection result, when the detection ID is affirmed by the detection word verification result, if all of the word ID group exist within a certain time, the range of the word ID group is audited. Range.
(2) Among the entries in the word detection result, when the detection ID is affirmed by the detection word verification result, if all of the word ID groups exist in a predetermined order within a certain time, the word ID group Is set as the audition range.
(3) The above-described listening range is expanded to a voice break. The voice segmentation is an arbitrary position in a section where it is determined that there is no voice based on an index based on voice power or voice quality. This facilitates later content determination work.

次に、内容検証ＵＩ処理プログラム３１５は、検聴範囲３８５に基づき、図６に示すような内容検証ＵＩを作成し、内容検証作業者に提示する。内容検証ＵＩは、検聴範囲情報提示部７１０、音声波形再生指示部７２０、肯定判別部７３０、否定判別部７４０から構成される。 Next, the content verification UI processing program 315 creates a content verification UI as shown in FIG. 6 based on the listening range 385 and presents it to the content verification operator. The content verification UI includes an audition range information presentation unit 710, a speech waveform reproduction instruction unit 720, an affirmative determination unit 730, and a negative determination unit 740.

検聴範囲情報提示部７１０は、検聴範囲に関連する情報、例えば現在対象となっている検聴範囲の検証すべき内容を説明する情報や、作業の残り工程を示す情報などが含まれる。声波形再生指示部７２０は、これが指定されると現在対象となっている検聴範囲レコードの波形ＩＤと区間により指定される音声波形を出力手段により内容検証作業者に提示する。なお提示する音声波形は、区間により指定される音声波形のみであっても、区間の前後数秒を含む区間であっても、波形ＩＤが指定する音声波形全体であってもよい。 The audition range information presentation unit 710 includes information related to the audition range, for example, information explaining the contents to be verified of the target audition range, information indicating the remaining steps of the work, and the like. When this is designated, the voice waveform reproduction instructing unit 720 presents the voice waveform designated by the waveform ID and the section of the current audition range record to be presented to the content verification operator by the output means. The presented speech waveform may be only the speech waveform specified by the section, the section including several seconds before and after the section, or the entire speech waveform specified by the waveform ID.

肯定判別部７３０及び否定判別部７４０は、これが指定されると現在対象となっている検聴範囲レコードについて、それぞれ肯定、否定の判断を下したことを内容検証結果３９０に格納する。内容検証結果３９０は、図７に示すように、内容ＩＤと検証結果から構成される。 When this is specified, the affirmative determination unit 730 and the negative determination unit 740 store in the content verification result 390 that affirmative and negative determinations have been made for the audition range record that is the current target. As shown in FIG. 7, the content verification result 390 includes a content ID and a verification result.

内容検証ＵＩ処理プログラム３１５は、以上の処理を検聴範囲３８５のすべてのレコードに対して繰り返し行い、内容検証結果３９０を作成する。ここでは、図９に示したような検証結果を、内容検証作業者が指示したものとする。なお提示するレコードの順序は任意であるが、同じ内容ＩＤのレコードをまとめて提示すれば、内容検証作業者は同一の内容に関して連続して判別できるため、作業効率の増大が期待できる。 The content verification UI processing program 315 repeats the above processing for all the records in the listening range 385 and creates a content verification result 390. Here, it is assumed that the verification result as shown in FIG. 9 is instructed by the content verification operator. The order of the records to be presented is arbitrary. However, if records with the same content ID are presented together, the content verification operator can continuously determine the same content, so that an increase in work efficiency can be expected.

以上の発話検証装置の動作により、音声波形から内容検証結果が得られ、本発明の目的が達成される。 By the operation of the utterance verification device described above, the result of content verification is obtained from the speech waveform, and the object of the present invention is achieved.

なお本実施形態において、発話検証装置は１つの装置であることを仮定したが、２つ以上の装置から構成されていてもよい。例えば、単語検出プログラムと検出単語検証ＵＩ処理プログラムの機能を第一の発話検証装置、検聴範囲決定プログラムと内容検証ＵＩ処理プログラムの機能を第二の発話検証装置とし、必要なデータをネットワーク等を通して共有することで発話検証システムを実現することが考えられる。 In the present embodiment, it is assumed that the utterance verification device is one device, but it may be composed of two or more devices. For example, the function of the word detection program and the detected word verification UI processing program is the first utterance verification device, the function of the audition range determination program and the content verification UI processing program is the second utterance verification device, and necessary data is a network, etc. It is conceivable to realize an utterance verification system by sharing through the network.

また本実施形態において、検出単語判別ＵＩと内容判別ＵＩはＧＵＩ（Graphical User Interface）を仮定したが、同等の機能を持つＣＵＩ（Character User Interface）やＶＵＩ（Voice User Interface）などで代替してもよい。ＶＵＩの実現例としては、判別する単語や判別する内容の説明を音声で流したあと、ユーザが音声入力あるいは手元のボタンにより肯定、は否定、聞き直しを指示することができるものが考えられる。 In this embodiment, the detected word discrimination UI and the content discrimination UI are assumed to be a GUI (Graphical User Interface), but may be replaced with a CUI (Character User Interface) or a VUI (Voice User Interface) having equivalent functions. Good. As an implementation example of the VUI, it is conceivable that the user can instruct affirmation, denial, and re-listening by voice input or a button at hand after explaining the word to be discriminated and the contents to be discriminated by voice.

図１２は、本発明の第一の実施形態による処理の流れを示すフローチャートである。最初にステップ１１において、入力された音声波形に対して単語検出プログラム３１１によって単語検出を行う。検出結果は、検出された単語の単語ＩＤと音声波形内での位置情報とともに単語検出結果として記録される。次に、検出単語検証ＵＩ処理プログラム３１３によって、全ての単語検出結果Ｘに対してステップ１２からステップ１３の処理を行う。ステップ１２では、Ｘに対する検出単語検証ＵＩを表示し、入力された音声波形の検出された単語の部分を再生し、入力装置５００から単語検出に対する判定が入力されるのを待つ。ステップ１３では、入力された判定結果を検出単語検証結果に格納する。ステップ１４では単語検出結果Ｘが残っているか判定し、残っていればステップ１２に戻って処理を反復し、残っていなければステップ１５に進む。 FIG. 12 is a flowchart showing the flow of processing according to the first embodiment of the present invention. First, in step 11, word detection is performed by the word detection program 311 on the input speech waveform. The detection result is recorded as a word detection result together with the word ID of the detected word and positional information in the speech waveform. Next, the processing from step 12 to step 13 is performed on all word detection results X by the detected word verification UI processing program 313. In step 12, the detected word verification UI for X is displayed, the detected word portion of the input speech waveform is reproduced, and the input device 500 waits for a determination regarding word detection to be input. In step 13, the input determination result is stored in the detected word verification result. In step 14, it is determined whether the word detection result X remains. If it remains, the process returns to step 12 to repeat the process, and if not, the process proceeds to step 15.

ステップ１５では、検聴範囲決定プログラム３１４によって、検出単語検出結果と検聴範囲決定ルールに基づいて、検聴範囲を決定する。次に、ステップ１６に進み、内容検証ＵＩプログラム３１５により、検聴範囲Ｙに対する内容検証ＵＩを表示し、音声波形のうち検聴範囲Ｙで指定された波形を再生し、再生内容に対して入力装置５００から肯定あるいは否定の判定が入力されるのを待つ。ステップ１７では、入力された判定結果を内容検証結果として保存する。ステップ１８では、検証範囲Ｙが残っているか判定し、残っていればステップ１６に戻って処理を反復し、残っていなければ処理を終了する。 In step 15, the audition range is determined by the audition range determination program 314 based on the detected word detection result and the audition range determination rule. Next, proceeding to step 16, the content verification UI program 315 displays the content verification UI for the listening range Y, reproduces the waveform specified by the listening range Y among the audio waveforms, and inputs the playback content. It waits for a positive or negative determination to be input from the device 500. In step 17, the input determination result is stored as a content verification result. In step 18, it is determined whether the verification range Y remains. If it remains, the process returns to step 16 to repeat the process, and if not, the process ends.

次に。本発明の第二の実施形態を説明する。第二の実施形態は、第一の実施形態において、単語検出プログラムが出力する各検出単語のスコアを利用し、そのスコアに基づき検出単語検証のコストを削減するものである。本実施形態の発話検証装置１０１を図８に示す。図１で示した発話検証装置１００との違いは、記憶装置内３００に閾値判別データベース３５０を含み、さらに単語検出結果の代わりにスコア付き単語検出結果３７１を含むことである。さらに、単語検出プログラム３１１と検出単語検証ＵＩ処理プログラム３１３の動作が以下のように異なる。 next. A second embodiment of the present invention will be described. The second embodiment uses the score of each detected word output from the word detection program in the first embodiment, and reduces the cost of the detected word verification based on the score. An utterance verification apparatus 101 of this embodiment is shown in FIG. The difference from the utterance verification device 100 shown in FIG. 1 is that the storage device 300 includes a threshold discrimination database 350 and further includes a scored word detection result 371 instead of the word detection result. Further, the operations of the word detection program 311 and the detected word verification UI processing program 313 are different as follows.

単語検出プログラム３１１の動作は、第一の実施形態（図２）で説明した内容に準じるが、単語検出結果の代わりに、図９に示すスコア付き単語検出結果３７１を出力する点が異なる。スコア付き単語検出結果には、各レコードが示す検出単語のスコア（検出結果がどの程度確からしいかを示す）の情報が追加される。スコアの計算は、一般的には音響モデルとの音響尤度差などに基づく手法が知られている。具体的な方法は当該技術分野の文献に詳しいので、ここでは割愛する。 The operation of the word detection program 311 is in accordance with the contents described in the first embodiment (FIG. 2), except that a word detection result with score 371 shown in FIG. 9 is output instead of the word detection result. Information of the score of the detected word indicated by each record (indicating how likely the detection result is) is added to the scored word detection result. For calculating the score, a method based on an acoustic likelihood difference from an acoustic model is generally known. The specific method is well known in the literature in the technical field, and is omitted here.

検出単語検証ＵＩ処理プログラム３１３の動作は、第一の実施形態で説明した内容に準じるが、対象とする検出結果を、スコア付き単語検出結果のスコア情報と閾値判別データベースに基づき選ぶ。 The operation of the detected word verification UI processing program 313 conforms to the content described in the first embodiment, but selects a target detection result based on the score information of the scored word detection result and the threshold discrimination database.

まず閾値判別データベースの詳細を図１０に示す。閾値判別データベースの各レコードは単語ＩＤ、スコア、検証結果から構成される。閾値判別データベースには、本システムにより過去判別された検出単語の判別結果が格納されている。 First, details of the threshold discrimination database are shown in FIG. Each record in the threshold discrimination database is composed of a word ID, a score, and a verification result. The threshold discrimination database stores the discrimination results of the detected words discriminated in the past by this system.

まず検出単語検証ＵＩ処理プログラム３１３は、閾値判別データベース３５０から、ある単語ＩＤの肯定スコア分布と否定スコア分布を求める。肯定スコア分布と否定スコア分布は正規分布であると仮定すると、標本平均と標本分散が得られる。次に、あらかじめ定められた許容誤受理率Ａ、許容誤棄却率Ｂに基づき、許容誤受理スコア閾値αと許容誤棄却スコア閾値βを求める。これは、肯定スコア分布と否定スコア分布の確率分布関数をそれぞれｆ₁(Ｘ)，ｆ₂(Ｘ)とおくと、ｆ₂(Ｘ＝α)＝１−Ａ、ｆ₁(Ｘ＝β)＝Ｂを満たすα、βである。図１１にこれらを図示する。 First, the detected word verification UI processing program 313 obtains a positive score distribution and a negative score distribution of a certain word ID from the threshold discrimination database 350. Assuming that the positive score distribution and the negative score distribution are normal distributions, the sample mean and sample variance are obtained. Next, based on the predetermined allowable error acceptance rate A and the allowable error rejection rate B, the allowable error acceptance score threshold value α and the allowable error rejection score threshold value β are obtained. If the probability distribution functions of the positive score distribution and the negative score distribution are set as f ₁ (X) and f ₂ (X), respectively, f ₂ (X = α) = 1−A, f ₁ (X = β) = Α and β satisfying B. These are illustrated in FIG.

さらにこの結果に基づき、スコア付き単語検出結果の当該単語ＩＤの各レコードを、スコアＳの値により以下のように処理する。
（１）Ｓ＞αの場合、当該レコードの検証結果を受理する。
（２）Ｓ＜βの場合、当該レコードの検証結果を棄却する。
（３）それ以外の場合、当該レコードを対象とした単語検出結果検証ＵＩを作成し、単語検出結果検証作業者に提示する。肯定判別部あるいは否定判別部が指定された場合、その結果を検出単語検証結果３８０に格納すると同時に、閾値判別データベース３５０にも格納する。 Further, based on this result, each record of the word ID of the scored word detection result is processed as follows according to the value of the score S.
(1) When S> α, the verification result of the record is accepted.
(2) If S <β, reject the verification result of the record.
(3) In other cases, a word detection result verification UI for the record is created and presented to the word detection result verification operator. When an affirmative discrimination unit or a negative discrimination unit is designated, the result is stored in the detected word verification result 380 and simultaneously stored in the threshold discrimination database 350.

なお、上記のスコアの判別方法は一例であり、他にも以下のような方式が考えられる。
１．閾値判別データベースから許容誤受理スコア閾値αと許容誤棄却スコア閾値βを求めるのを検出ＩＤごとに行わず、一定間隔、あるいは任意のタイミングで行う。
２．誤棄却が許容できない場合、βを用いずαのみを用いる。これは許容誤棄却率Ｂ＝０、許容誤棄却スコア閾値β＝−∞と置くのと等価である。これは誤棄却が許されないタスクに有効である。 Note that the above-described score determination method is merely an example, and other methods such as the following are conceivable.
1. The permissible false acceptance score threshold value α and the permissible false rejection score threshold value β are obtained from the threshold discrimination database for each detection ID, but at regular intervals or at arbitrary timing.
2. If false rejection is not acceptable, only α is used instead of β. This is equivalent to setting the allowable error rejection rate B = 0 and the allowable error rejection score threshold β = −∞. This is useful for tasks that cannot be rejected.

一方、誤受理が許容できないタスクであっても、単語判別時に誤受理されても後段の内容判別時に棄却されるため、内容判別のコスト増加につながるのみで、最終的な誤受理にはつながりにくい。それでも、単語判別時に誤受理が許容できない場合は、同様にαを用いずβのみを用いる。これは許容誤受理率Ａ＝１、許容誤受理スコア閾値α＝∞と置くのと等価である。
３．話者や環境など、スコア分布が同一と仮定できる集合の識別子をスコア付き単語検出結果と閾値判別データベースに別途付与し、同一の識別子を持つもののみを用いて許容誤受理スコア閾値αと許容誤棄却スコア閾値βを求める。話者や環境によりスコア分布が異なることが分かっている場合、この手法は効果的である。 On the other hand, even if a task is not allowed to be accepted incorrectly, it will be rejected at the subsequent content determination even if it is erroneously accepted at the time of word determination. . Still, if erroneous acceptance is not allowed at the time of word determination, similarly, only β is used instead of α. This is equivalent to setting the allowable error acceptance rate A = 1 and the allowable error acceptance score threshold α = ∞.
3. A set of identifiers that can be assumed to have the same score distribution, such as speakers and environments, is separately assigned to the scored word detection result and the threshold discrimination database, and only those having the same identifier are used to allow the acceptable false acceptance score threshold α A rejection score threshold β is obtained. This method is effective when it is known that the score distribution varies depending on the speaker and the environment.

以上の処理により、上記（１）（２）のように自動的に検証結果が確率理論的に決定されるレコードが存在するため、一定の検出精度を保ったうえで単語検出結果判別作業を短縮できる。 As a result of the above processing, there is a record in which the verification result is automatically determined theoretically as in (1) and (2) above, so that the word detection result determination work is shortened while maintaining a certain detection accuracy. it can.

また、第一の実施形態における検聴範囲決定ルールに、以下のようにスコアを用いてもよい。
（１）単語検出結果内のエントリのうち、その検出ＩＤが検出単語検証結果により肯定されたものについて、ある単語ＩＤ群がある時間以内にすべて存在する場合、かつそれらのエントリのスコアの合計値あるいは平均値が定められた値を超える場合、その単語ＩＤ群の範囲を検聴範囲とする。
（２）単語検出結果内のエントリのうち、その検出ＩＤが検出単語検証結果により肯定されたものについて、ある単語ＩＤ群がある時間以内に定められた順序ですべて存在する場合、かつそれらのエントリのスコアの合計値あるいは平均値が定められた値を超える場合、その単語ＩＤ群の範囲を検聴範囲とする。 Moreover, you may use a score as follows for the audition range determination rule in 1st embodiment.
(1) Of the entries in the word detection result, when the detection ID is affirmed by the detection word verification result, if there is a certain word ID group within a certain time, and the total score of those entries Alternatively, when the average value exceeds a predetermined value, the range of the word ID group is set as the audition range.
(2) Of the entries in the word detection result, when the detection ID is affirmed by the detection word verification result, if there is a certain word ID group in an order determined within a certain time, and those entries When the total value or average value of the scores exceeds a predetermined value, the range of the word ID group is set as the audition range.

図１３は、本発明の第二の実施形態による処理の流れを示すフローチャートである。最初にステップ２１において、入力された音声波形に対して単語検出プログラム３１１によって単語検出を行う。検出結果は、検出された単語の単語ＩＤと音声波形内での位置情報とスコアともに単語検出結果として記録される。次に、検出単語検証ＵＩ処理プログラム３１３によって、全ての単語検出結果Ｘに対してステップ２２からステップ２５の処理を行う。ステップ２２では、Ｘはスコア閾値よる判別が可能かを判定し、可能であればステップ２３に進んで、自動判定結果を検出単語検証結果として格納する。スコア閾値よる判別が不可能な場合にはステップ２４に進み、Ｘに対する検出単語検証ＵＩを表示し、入力された音声波形の検出された単語の部分を再生し、入力装置５００から単語検出に対する判定が入力されるのを待つ。ステップ２５では、入力された判定結果を検出単語検証結果に格納する。次に、ステップ２６に進み、スコア閾値を更新する。スコア閾値の更新は、これまでの検出単語検証結果から、ある単語ＩＤの肯定スコア分布と否定スコア分布を求め、あらかじめ定められた許容誤受理率Ａ、許容誤棄却率Ｂに基づき、許容誤受理スコア閾値αと許容誤棄却スコア閾値βを求めることで行われる。ステップ２７では単語検出結果Ｘが残っているか判定し、残っていればステップ２２に戻って処理を反復し、残っていなければステップ２８に進む。 FIG. 13 is a flowchart showing the flow of processing according to the second embodiment of the present invention. First, in step 21, word detection is performed by the word detection program 311 on the input speech waveform. The detection result is recorded as the word detection result together with the word ID of the detected word, the position information in the speech waveform, and the score. Next, the processing from step 22 to step 25 is performed on all word detection results X by the detected word verification UI processing program 313. In step 22, X determines whether discrimination based on the score threshold is possible. If possible, the process proceeds to step 23, and the automatic determination result is stored as a detected word verification result. If discrimination based on the score threshold is impossible, the process proceeds to step 24, where the detected word verification UI for X is displayed, the detected word portion of the input speech waveform is reproduced, and the determination for word detection from the input device 500 is performed. Wait for input. In step 25, the input determination result is stored in the detected word verification result. Next, it progresses to step 26 and updates a score threshold value. The score threshold is updated by obtaining an affirmative score distribution and a negative score distribution of a certain word ID based on the detection word verification results so far, and accepting false acceptance based on a predetermined acceptable false acceptance rate A and acceptable false rejection rate B. This is performed by obtaining the score threshold α and the allowable error rejection score threshold β. In step 27, it is determined whether the word detection result X remains. If it remains, the process returns to step 22 to repeat the process, and if not, the process proceeds to step 28.

ステップ２８では、検聴範囲決定プログラム３１４によって、検出単語検出結果と検聴範囲決定ルールに基づいて、検聴範囲を決定する。次に、ステップ２９に進み、内容検証ＵＩプログラム３１５により、検聴範囲Ｙに対する内容検証ＵＩを表示し、音声波形のうち検聴範囲Ｙで指定された波形を再生し、再生内容に対して入力装置５００から肯定あるいは否定の判定が入力されるのを待つ。ステップ３０では、入力された判定結果を内容検証結果として保存する。ステップ３１では、検証範囲Ｙが残っているか判定し、残っていればステップ２９に戻って処理を反復し、残っていなければ処理を終了する。 In step 28, the audition range is determined by the audition range determination program 314 based on the detected word detection result and the audition range determination rule. Next, the process proceeds to step 29, where the content verification UI program 315 displays the content verification UI for the listening range Y, reproduces the waveform specified by the listening range Y among the audio waveforms, and inputs the playback content. It waits for a positive or negative determination to be input from the device 500. In step 30, the input determination result is stored as a content verification result. In step 31, it is determined whether the verification range Y remains. If it remains, the process returns to step 29 to repeat the process, and if not, the process ends.

本発明による発話検証装置の構成例を示す図。The figure which shows the structural example of the speech verification apparatus by this invention. 音声波形、単語検出結果の詳細と、単語検出プログラムの動作を説明する図。The figure explaining the detail of an audio | voice waveform, a word detection result, and operation | movement of a word detection program. 検出単語検証ＵＩを示す図。The figure which shows detected word verification UI. 検出単語検証結果の説明図。Explanatory drawing of a detection word verification result. 検聴範囲の詳細と、検聴範囲決定プログラムの動作を説明する図。The figure explaining the detail of the audition range and operation | movement of the audition range determination program. 内容検証ＵＩを示す図。The figure which shows content verification UI. 内容検証結果の説明図。Explanatory drawing of a content verification result. 本発明による発話検証装置の構成例を示す図。The figure which shows the structural example of the speech verification apparatus by this invention. スコア付き単語検出結果の説明図。Explanatory drawing of the word detection result with a score. 閾値判別データベースの説明図。Explanatory drawing of a threshold discrimination database. 肯定スコア分布と否定スコア分布の説明図。Explanatory drawing of positive score distribution and negative score distribution. 処理の流れを示すフローチャート。The flowchart which shows the flow of a process. 処理の流れを示すフローチャート。The flowchart which shows the flow of a process.

Explanation of symbols

１００，１０１発話検証装置
１００演算装置
３００記憶装置
４００出力装置
５００入力装置
５０１〜５０９単語検出プログラムにより検出された単語の区間 100, 101 Utterance verification device 100 Arithmetic device 300 Storage device 400 Output device 500 Input devices 501 to 509 Word segments detected by the word detection program

Claims

A word detection unit that obtains a position where an utterance of a specified word is included as a word detection result from the input speech waveform;
A detection word verification processing unit that reproduces a waveform including the position obtained as the word detection result from the speech waveform, and stores the input word detection positive / negative determination as a detection word verification result;
A file that stores rules for determining the scope of listening based on words;
A listening range determination unit that determines a listening range of the speech waveform based on the detection word verification result and the listening range determination rule;
An utterance verification device comprising:

The utterance verification device according to claim 1, further comprising: a content verification unit that reproduces a waveform including the listening range from the speech waveform, and stores affirmation / negative determination as to a content verification result with respect to the input reproduction content. A featured speech verification device.

The utterance verification apparatus according to claim 2, further comprising a storage unit that holds the speech waveform, the word detection result, the detected word verification result, the listening range, and the content verification result.

The utterance verification device according to claim 1,
The word detection unit obtains, as a word detection result, a score indicating the probability of the detection together with a position including the utterance of the specified word from the input speech waveform,
The detected word verification processing unit obtains score prior distribution information from a database including the detected word verification result obtained in the past by the detected word verification processing unit and the score of the detected word, and a score in the word detection result An utterance verification device that performs affirmative / negative determination of the word detection based on the score prior distribution information.

In the utterance verification device according to claim 4,
The detected word verification processing unit estimates an affirmative score distribution from the score of the score prior distribution information determined to be affirmative, and an allowable error rejection score threshold based on the positive score distribution and an allowable error rejection rate specified in advance Utterance verification device, wherein the threshold of the allowable error rejection score is compared with the score of the word detection result to determine whether the word detection is affirmative or negative.

In the utterance verification device according to claim 4,
The detected word verification processing unit estimates a negative score distribution from a score of the score prior distribution information that is determined to be negative, and an allowable misacceptance score threshold based on the negative score distribution and a pre-specified allowable misacceptance rate Utterance verification apparatus, wherein the acceptance error acceptance score threshold value is compared with a word detection result score to determine whether the word detection is affirmative or negative.

Using an utterance verification device having a word detection unit, a detected word verification processing unit, a file storing the audition range determination rule, and an audition range determination unit, the audition range of the input speech waveform is determined. A method,
A step of obtaining, as a word detection result, a position including the utterance of the designated word from the input speech waveform by the word detection unit;
Reproducing a waveform including a position obtained as the word detection result from the speech waveform by the detected word verification processing unit, and storing an input word detection positive / negative determination result as a detection word verification result;
A step of determining a listening range of the speech waveform based on the detection word verification result and the listening range determination rule by the listening range determination unit;
A speech verification method characterized by comprising:

8. The utterance verification method according to claim 7, further comprising a step of reproducing a waveform including the audition range from the speech waveform and storing an affirmative / negative determination result for the input reproduction content as a content verification result. Utterance verification method.

The utterance verification method according to claim 7, wherein a score representing the probability of detection is also obtained as the word detection result, and in the step of obtaining the detection word verification result, a score in advance is obtained from the detection word verification result including a score obtained in the past. An utterance verification method, wherein distribution information is obtained, and whether the word detection is positive or negative is determined based on a score in the word detection result and the score prior distribution information.

10. The utterance verification method according to claim 9, wherein an affirmative score distribution is estimated from a score of the score prior distribution information that is determined to be affirmative, and an allowable error rejection based on the affirmative score distribution and a predetermined allowable error rejection rate. An utterance verification method comprising: calculating a score threshold value, and comparing the allowable false rejection score threshold value with a score of a word detection result to determine whether the word detection is affirmative or negative.

10. The utterance verification method according to claim 9, wherein a negative score distribution is estimated from a score of the score prior distribution information that is determined to be negative, and an allowable misacceptance is based on the negative score distribution and a predetermined allowable error acceptance rate. An utterance verification method, comprising: calculating a score threshold value and comparing the allowable false acceptance score threshold value with a score of a word detection result to determine whether the word detection is positive or negative.