JP2023146216A

JP2023146216A - Conversion-into-text support device and conversion-into-text support method

Info

Publication number: JP2023146216A
Application number: JP2022053293A
Authority: JP
Inventors: 義明山添; Yoshiaki Yamazoe; 稜松本; Ryo Matsumoto; 洋輔谷澤; Yosuke Yazawa; 雪城高橋; Yukishiro Takahashi
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2023-10-12

Abstract

To enable keyword matching of a call content to be carried out with high accuracy regardless of characteristics of speech-to-text conversion concerning the call content.SOLUTION: A conversion-into-text support device 100 includes: a storage device 101 that holds master data 126 that defines information on correct phonemes of each vocabulary; and an arithmetic device 104 that applies call recording data to an acoustic model 110 to extract phonemes, calculates a coincidence rate between correct phonemes of a vocabulary assumed to appear concerning the call recording data and the phonemes obtained by the extraction in the vocabulary for which the phonemes have been defined by the master data 126, and as a result of this calculation, specifies a vocabulary in which the phonemes indicate a predetermined coincidence rate as a result of keyword matching.SELECTED DRAWING: Figure 2

Description

本発明は、テキスト化支援装置及びテキスト化支援方法に関するものである。 The present invention relates to a text conversion support device and a text conversion support method.

営業員やコールセンタ等における通話内容が、コンプライアンス等の観点に照らして適切か確認するニーズが存在する。また近年では、そうした通話内容の録音データを聞き直して確認するといった旧来手法ではなく、当該音声データのテキスト化を行った上で確認対象とする手法も提案されている。
そうしたテキスト化に関連する従来技術としては、商談や営業活動の際の顧客への説明内容等のデータに基づいて、「禁止表現」の有無、および「必要事項」が含まれているか否かのいずれについてもチェック対象とするコンプライアンスチェックシステムおよびコンプライアンスチェックプログラム（特許文献１参照）などが提案されている。 There is a need to confirm whether the content of calls made by salespeople, call centers, etc. is appropriate from a compliance perspective. In addition, in recent years, instead of the traditional method of re-listening to and checking the recorded data of such a call, a method has been proposed in which the voice data is converted into text and then checked.
Conventional technology related to such text conversion is based on data such as the content of explanations to customers during business negotiations and sales activities, and determines whether "prohibited expressions" are included and whether "required matters" are included. Compliance check systems and compliance check programs (see Patent Document 1), etc., have been proposed to check both.

この技術は、業担当者が顧客に対して行った各発話についてコンプライアンスを遵守しているかをチェックするコンプライアンスチェックシステムであって、前記営業担当者の前記各発話の内容を音声認識技術によりテキスト化したテキストデータに対して、形態素解析を含む自然言語解析処理を行って解析済テキストデータとして出力するテキスト解析部と、前記各発話に係る前記解析済テキストデータ内の各発話について、所定の基準に従って連続する１つ以上の発話からなるブロックにまとめ、前記各ブロックにおいて、顧客に対して説明するべき必要事項として予め定義された第１のテキストデータの内容が説明されているか否かを判定する判定部と、前記各発話に係る前記解析済テキストデータについて、顧客に対して述べてはいけない禁止表現の内容として予め定義された第２のテキストデータにマッチするものがある場合に、対象の前記発話において対象の前記禁止表現が述べられたものと判定するキーワードマッチング部と、前記営業担当者が前記顧客に対して行った前記各発話のデータに前記営業担当者および／または前記顧客を特定する管理情報と関連付けて記録するデータ記録部と、を有し、前記テキスト解析部は、前記営業担当者が前記顧客に対して行った前記各発話のデータに、前記管理情報に基づいて抽出される前記営業担当者が前記顧客に対して行った過去の発話についても含め、前記判定部は、前記ブロックにおいて、前記第１のテキストデータの内容が説明されていると判定した場合に、前記ブロックに対して前記必要事項のカテゴリを付与して記録するとともに、前記必要事項のそれぞれについて、予め設定した所定の評価基準に基づいて、説明された度合を判定するシステムである。 This technology is a compliance check system that checks whether compliance is observed with respect to each utterance made by a sales representative to a customer, and the content of each utterance made by the sales representative is converted into text using voice recognition technology. a text analysis unit that performs natural language analysis processing including morphological analysis on the analyzed text data and outputs it as analyzed text data; Judgment for grouping into blocks consisting of one or more consecutive utterances and determining whether or not the contents of first text data predefined as necessary matters to be explained to the customer are explained in each block. and the analyzed text data related to each of the utterances, if there is one that matches second text data that is predefined as content of prohibited expressions that must not be said to the customer, the target utterance is a keyword matching unit that determines that the target prohibited expression is said in the above, and management that identifies the salesperson and/or the customer in data of each of the utterances made by the salesperson to the customer. a data recording unit that records the information in association with the information; When the determination unit determines that the content of the first text data is explained in the block, including past utterances made by the salesperson to the customer, the determination unit In this system, the necessary matters are assigned categories and recorded, and the degree to which each of the necessary matters has been explained is determined based on predetermined evaluation criteria set in advance.

特開２０１８－１２０６４０号公報Japanese Patent Application Publication No. 2018-120640

上述のようなテキスト化については、深層学習技術等の進展によって精度向上が図られてきおり、その利活用が進んでいる。例えば、金融分野における通話録音データの利活用の一例として、ＮＧワードの発言有無、正しい顧客名、商品名の発音有無をチェックするといったものがある。
当該チェックに際しては、通話録音データをテキスト化したものに対して、キーワードマッチングを行うケースが多い。ところが、録音状況や発話者の癖などの要因により、テキスト化の精度が低くなりやすい通話（誤検知が多い通話）の存在も判明しており、こうした通話に関して、精度良くキーワードマッチングを行うことは困難であった。 With regard to the above-mentioned text conversion, advances in deep learning technology and the like have led to improvements in accuracy, and its utilization is progressing. For example, one example of the use of call recording data in the financial field is to check whether NG words are said or not, and whether correct customer names and product names are pronounced.
When performing this check, keyword matching is often performed on the text of call recording data. However, it has been found that there are calls for which the accuracy of text conversion is likely to be low (calls with many false positives) due to factors such as recording conditions and the habits of the speaker, and it is difficult to perform accurate keyword matching for these calls. It was difficult.

つまり、音声テキスト化の精度が低くなりがちな通話に関してキーワードマッチングを行うとしても、その精度は期待出来ず、結局のところチェック漏れが発生してしまう要因となっている。 In other words, even if keyword matching is performed for phone calls, where the accuracy of voice-to-text conversion tends to be low, the accuracy cannot be expected, and this is a factor that ends up being overlooked.

そこで本発明の目的は、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングを精度良好に実施可能とする技術を提供することにある。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a technique that enables keyword matching of the contents of a call with high accuracy, regardless of the characteristics of the speech-to-text conversion of the contents of the call.

上記課題を解決する本発明のテキスト化支援装置は、会話の場面ないし対象ごとに出現が想定される各語彙の正しい音素の情報を規定したマスタデータを保持する記憶装置と、所定装置から得た通話録音データを音響モデルに適用して音素を抽出する処理と、前記マスタデータで音素が規定された語彙のうち、前記通話録音データの会話の場面ないし対象に関して出現が想定されている語彙の前記正しい音素と、前記抽出した音素との一致率を算定する処理と、前記算定の結果、音素同士が所定の一致率を示す前記語彙をキーワードマッチング結果として特定する処理を実行する演算装置と、を含むことを特徴とする。
また、本発明のテキスト化支援方法は、情報処理装置が、会話の場面ないし対象ごとに出現が想定される各語彙の正しい音素の情報を規定したマスタデータを記憶装置にて保持し、所定装置から得た通話録音データを音響モデルに適用して音素を抽出する処理と、前記マスタデータで音素が規定された語彙のうち、前記通話録音データの会話の場面ないし対象に関して出現が想定されている語彙の前記正しい音素と、前記抽出した音素との一致率を算定する処理と、前記算定の結果、音素同士が所定の一致率を示す前記語彙をキーワードマッチング結果として特定する処理と、を実行することを特徴とする。 The text conversion support device of the present invention that solves the above problems includes a storage device that holds master data specifying correct phoneme information for each vocabulary that is expected to appear in each conversation scene or subject, and A process of applying phone call recording data to an acoustic model to extract phonemes, and a process of extracting phonemes from among the vocabulary whose phonemes are defined in the master data, which are expected to appear in the conversation scene or subject of the call recording data. a calculation device that executes a process of calculating a match rate between a correct phoneme and the extracted phoneme, and a process of specifying the vocabulary whose phonemes show a predetermined match rate as a keyword matching result as a result of the calculation; It is characterized by containing.
Further, in the text conversion support method of the present invention, the information processing device stores master data in a storage device that specifies correct phoneme information for each vocabulary that is expected to appear in each conversation scene or subject, and A process of extracting phonemes by applying the phone call recording data obtained from the phone call recording data to an acoustic model, and a process of extracting phonemes by applying the phoneme recording data obtained from the phone call recording data to an acoustic model, and a process in which phonemes are expected to appear in the conversation scene or subject of the phone call recording data from among the vocabulary for which phonemes are defined in the master data. A process of calculating a match rate between the correct phoneme of the vocabulary and the extracted phoneme, and a process of identifying the vocabulary whose phonemes show a predetermined match rate as a keyword matching result as a result of the calculation. It is characterized by

本発明によれば、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングを精度良好に実施可能となる。 According to the present invention, regardless of the characteristics of voice-to-text conversion regarding the contents of a call, keyword matching of the contents of the call can be performed with good accuracy.

本実施形態のテキスト化支援装置を含むネットワーク構成図である。FIG. 1 is a network configuration diagram including a text conversion support device according to the present embodiment. 本実施形態におけるテキスト化支援装置のハードウェア構成例を示す図である。1 is a diagram illustrating an example of the hardware configuration of a text conversion support device in this embodiment. FIG. 本実施形態におけるオペレータ端末のハードウェア構成例を示す図である。It is a diagram showing an example of the hardware configuration of an operator terminal in this embodiment. 本実施形態におけるコールセンタシステムのハードウェア構成例を示す図である。It is a diagram showing an example of the hardware configuration of a call center system in this embodiment. 本実施形態における管理者端末のハードウェア構成例を示す図である。It is a diagram showing an example of the hardware configuration of an administrator terminal in this embodiment. 本実施形態の通話録音ＤＢの構成例を示す図である。It is a figure showing an example of composition of call recording DB of this embodiment. 本実施形態の音素マスタテーブルの構成例を示す図である。It is a figure showing an example of composition of a phoneme master table of this embodiment. 本実施形態の発話類似度テーブルの構成例を示す図である。It is a figure showing the example of composition of the utterance similarity table of this embodiment. 本実施形態におけるテキスト化支援方法のフロー例１を示す図である。It is a figure showing example 1 of a flow of a textization support method in this embodiment. 本実施形態におけるテキスト化支援方法のフロー例２を示す図である。It is a figure which shows the flow example 2 of the text conversion support method in this embodiment. 本実施形態におけるテキスト化支援方法のフロー例３を示す図である。It is a figure which shows the flow example 3 of the text conversion support method in this embodiment.

＜ネットワーク構成＞
以下に本発明の実施形態について図面を用いて詳細に説明する。図１は、本実施形態のテキスト化支援装置１００を含むネットワーク構成図である。図１に示すテキスト化支援装置１００は、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングを精度良好に実施可能とするコンピュータである。 <Network configuration>
Embodiments of the present invention will be described in detail below with reference to the drawings. FIG. 1 is a network configuration diagram including a text conversion support apparatus 100 of this embodiment. The text conversion support device 100 shown in FIG. 1 is a computer that is capable of performing keyword matching of the contents of a call with high accuracy, regardless of the characteristics of converting the contents of the call into voice to text.

本実施形態のテキスト化支援装置１００は、図１で示すように、インターネットや組織内のセキュアな回線などの適宜なネットワーク１を介して、オペレータ端末２００、コールセンタシステム３００、及び管理者端末４００と、必要に応じて通信可能に接続されている。よって、これらを総称してテキスト化システム１０としてもよい。 As shown in FIG. 1, the text conversion support device 100 of this embodiment communicates with an operator terminal 200, a call center system 300, and an administrator terminal 400 via an appropriate network 1 such as the Internet or a secure line within an organization. , and are communicably connected as necessary. Therefore, these may be collectively referred to as the text conversion system 10.

本実施形態のテキスト化支援装置１００は、例えば、コールセンタでのオペレータと顧客との会話内容がコンプライアンスや顧客対応の観点で適切であったか、会話中でのＮＧワードの出現や、或いは必須ワードの不出現といった事象についてキーワードマッチングで特定する支援装置と言える。 The text conversion support device 100 of this embodiment can, for example, check whether the content of the conversation between the operator and the customer at the call center was appropriate from the viewpoint of compliance and customer service, the appearance of NG words during the conversation, or the absence of essential words. It can be said that it is a support device that identifies phenomena such as appearance by keyword matching.

勿論、コールセンタ業務におけるオペレータと顧客との会話に関してキーワードマッチングを行う状況のみを本発明の適用対象とするのみならず、音声データ中に必要な／禁忌のキーワードの出現状況を検証する機会が存在する業務等であれば、いずれについても適用可能である。 Of course, the present invention is not only applicable to situations in which keyword matching is performed regarding conversations between operators and customers in call center operations, but also there is an opportunity to verify the appearance of necessary/contraindicated keywords in voice data. It can be applied to any business, etc.

一方、オペレータ端末２００は、種々の商品やサービスに関する顧客からの問合せへの対応業務、或いは見込み客等に対する電話営業を行う担当者が使用する端末である。具体的には、ＰＣと一体となった電話端末、スマートフォン、タブレット端末、パーソナルコンピュータなどを想定できる。こうしたオペレータ端末２００での担当者と顧客との間の会話が録音され、通話録音データとして管理、活用されることとなる。 On the other hand, the operator terminal 200 is a terminal used by a person in charge of responding to inquiries from customers regarding various products and services, or conducting telephone sales to potential customers. Specifically, a telephone terminal integrated with a PC, a smartphone, a tablet terminal, a personal computer, etc. can be assumed. The conversation between the person in charge and the customer at the operator terminal 200 is recorded, and is managed and utilized as recorded call data.

また、コールセンタシステム３００は、上述のオペレータ端末２００と顧客の電話機との間で発着信の管理や、上述のオペレータ端末２００での会話内容である通話録音データを管理するシステムとなる。よって、コールセンタシステム３００は、通話録音データを記憶装置にて保持・管理し、テキスト化支援装置１００に適宜配信する。 Further, the call center system 300 is a system that manages incoming and outgoing calls between the above-described operator terminal 200 and customer telephones, and manages call recording data that is the content of conversations at the above-described operator terminal 200. Therefore, call center system 300 retains and manages call recording data in a storage device, and distributes it to text conversion support device 100 as appropriate.

また、管理者端末４００は、上述のコールセンタの管理者が操作する端末である。この管理者端末４００は、当該コールセンタでの業務終了時など適宜なタイミングで、一日など所定期間分の通話録音データに関して、上述のコンプライアンス等の所定観点でのチェックを行うべくキーワードマッチング処理の指示を、テキスト化支援装置１００に行い、その処理結果を取得する端末となる。
＜ハードウェア構成＞
また、本実施形態のテキスト化支援装置１００のハードウェア構成は、図２に以下の如くとなる。 Further, the administrator terminal 400 is a terminal operated by the administrator of the above-mentioned call center. This administrator terminal 400 instructs keyword matching processing to check call recording data for a predetermined period of time, such as one day, from a predetermined viewpoint such as the above-mentioned compliance at an appropriate timing such as at the end of work at the call center. This is a terminal that performs the following on the text conversion support device 100 and obtains the processing results.
<Hardware configuration>
Further, the hardware configuration of the text conversion support apparatus 100 of this embodiment is shown in FIG. 2 as follows.

すなわちテキスト化支援装置１００は、記憶装置１０１、メモリ１０３、演算装置１０４、および通信装置１０５、を備える。 That is, the text conversion support device 100 includes a storage device 101, a memory 103, an arithmetic device 104, and a communication device 105.

このうち記憶装置１０１は、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）やハードディスクドライブなど適宜な不揮発性記憶素子で構成される。 Among these, the storage device 101 is configured with an appropriate nonvolatile storage element such as an SSD (Solid State Drive) or a hard disk drive.

また、メモリ１０３は、ＲＡＭなど揮発性記憶素子で構成される。 Furthermore, the memory 103 is composed of a volatile storage element such as a RAM.

また、演算装置１０４は、記憶装置１０１に保持されるプログラム１０２をメモリ１０３に読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制御処理を行なうＣＰＵである。 Further, the arithmetic device 104 is a CPU that reads the program 102 held in the storage device 101 into the memory 103 and executes it, performs overall control of the device itself, and performs various judgments, calculations, and control processing.

また、通信装置１０５は、ネットワーク１と接続して、少なくともコールセンタシステム３００との通信処理を担うネットワークインターフェイスカード等を想定する。 Further, the communication device 105 is assumed to be a network interface card or the like that connects to the network 1 and handles communication processing with at least the call center system 300.

なお、テキスト化支援装置１００がスタンドアロンマシンである場合、ユーザからのキー入力や音声入力を受け付ける入力装置、処理データの表示を行うディスプレイ等の出力装置、を更に備えるとすれば好適である。 Note that when the text conversion support device 100 is a stand-alone machine, it is preferable to further include an input device that accepts key inputs and voice inputs from the user, and an output device such as a display that displays processed data.

また、記憶装置１０１内には、本実施形態のテキスト化支援装置として必要な機能を実装する為のプログラム１０２に加えて、通話録音ＤＢ１２５、音素マスタテーブル１２６、及び発話類似度テーブル１２７６が少なくとも記憶されている。ただし、これらデータベース等についての詳細は後述する。 In addition to the program 102 for implementing the functions necessary for the text conversion support device of this embodiment, the storage device 101 stores at least a call recording DB 125, a phoneme master table 126, and an utterance similarity table 1276. has been done. However, details regarding these databases and the like will be described later.

また、プログラム１０２は、音響モデル１１０、及び言語モデル１１１を備えるものとする。音響モデル１１０は、オペレータと顧客との間の会話に関する通話録音データから当該通話の音声を構成する音素を抽出する機能である。 It is also assumed that the program 102 includes an acoustic model 110 and a language model 111. The acoustic model 110 has a function of extracting phonemes constituting the voice of a conversation from call recording data regarding a conversation between an operator and a customer.

そのため、テキスト化支援装置１００は、通話録音データが示す音声の特徴量（周波数や音の強弱）を分析し、取扱いしやすいデータとして変換する音響分析を事前に実行し、この音響分析結果が示す特徴量を音響モデル１１０に与えることになる。 Therefore, the text conversion support device 100 analyzes the voice features (frequency and sound strength) shown by the call recording data, performs acoustic analysis in advance to convert it into data that is easy to handle, and the acoustic analysis results indicate The feature amounts will be given to the acoustic model 110.

音響モデル１１０は、適宜な深層学習などにより、上述の特徴量と音素との対応関係を規定したモデルであって、上述の音声の特徴量を与えることで、音波の最小単位である音素を抽出する。 The acoustic model 110 is a model that defines the correspondence between the above-mentioned features and phonemes through appropriate deep learning, and extracts phonemes, which are the smallest units of sound waves, by giving the above-mentioned voice features. do.

なお、音素とは、音声を発したときに観測できる音波の最小構成要素である。日本語における音素は、母音（アイウエオ）、擬音（ン）、子音（２３種類）の計３種類から成り立っている。例えば、「田中さん」の場合は、「t-a-n-a-k-a-s-a-n」が音素となる。 Note that a phoneme is the smallest component of a sound wave that can be observed when a voice is uttered. Phonemes in Japanese are made up of three types: vowels (aiueo), onomatopoeias (n), and consonants (23 types). For example, in the case of "Tanaka-san," the phoneme is "t-a-n-a-k-a-s-a-n."

本実施形態のテキスト化支援装置１００は、音響モデル１１０により得た音素に基づいて、キーワードマッチングを行うこととなる。上述の場合、音素「t-a-n-a-k-a-s-a-n」
を、「田中さん」という日本語の語彙として特定する処理が該当する。より具体的には、各音素がどの単語に該当するか、音素マスタテーブル１２６を適宜利用しつつ、本発明のテキスト化支援方法を適用することで、音素を語彙に置換していく。 The text conversion support device 100 of this embodiment performs keyword matching based on the phonemes obtained by the acoustic model 110. In the above case, the phoneme "tanakasan"
This corresponds to the process of specifying ``Tanaka-san'' as a Japanese vocabulary word. More specifically, phonemes are replaced with vocabulary by applying the text conversion support method of the present invention while appropriately using the phoneme master table 126 to determine which word each phoneme corresponds to.

一方、言語モデル１１１は、キーワードマッチングで得た語彙の群れを適宜に文章化する処理を担うものとなる。例えば、「田中さん」、「信州では」、「雪が」、「積もりましたよ」、といった語彙の群れを、語彙の群れと正しい（或いは高頻度で出現する）一文との関係についての統計データ等に基づいて、可能性の高い組み合わせ例として意味ある文章を構成する。 On the other hand, the language model 111 is responsible for appropriately converting a group of vocabulary obtained through keyword matching into sentences. For example, for a cluster of vocabulary such as "Mr. Tanaka," "in Shinshu," "snow," and "It piled up," statistical data about the relationship between the cluster of vocabulary and a sentence that is correct (or appears frequently), etc. Construct meaningful sentences based on the most likely combinations.

また、本実施形態のオペレータ端末２００のハードウェア構成は、図３に以下の如くとなる。 Further, the hardware configuration of the operator terminal 200 of this embodiment is shown in FIG. 3 as follows.

すなわちオペレータ端末２００は、記憶装置２０１、メモリ２０３、演算装置２０４、入力装置２０５、出力装置２０６、および通信装置２０７、を備える。 That is, the operator terminal 200 includes a storage device 201, a memory 203, a calculation device 204, an input device 205, an output device 206, and a communication device 207.

このうち記憶装置２０１は、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）やハードディスクドライブなど適宜な不揮発性記憶素子で構成される。 Among these, the storage device 201 is configured with an appropriate nonvolatile storage element such as an SSD (Solid State Drive) or a hard disk drive.

また、メモリ２０３は、ＲＡＭなど揮発性記憶素子で構成される。 Furthermore, the memory 203 is composed of a volatile storage element such as a RAM.

また、演算装置２０４は、記憶装置２０１に保持されるプログラム２０２をメモリ２０３に読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制
御処理を行なうＣＰＵである。 Further, the arithmetic device 204 is a CPU that reads out and executes the program 202 held in the storage device 201 into the memory 203, performs overall control of the device itself, and performs various judgments, calculations, and control processing.

また、入力装置２０５は、ユーザたるオペレータからのキー入力や音声入力を受け付けるキーボードやマウスといった装置で構成される。 Further, the input device 205 is configured of a device such as a keyboard and a mouse that accept key input and voice input from an operator who is a user.

また、出力装置２０６は、演算装置２０４での処理結果の表示を行うディスプレイやスピーカー等の装置で構成される。 Further, the output device 206 is configured with a device such as a display or a speaker that displays the processing results of the arithmetic device 204.

また、通信装置２０７は、ネットワーク１と接続して、コールセンタシステム３００や管理者端末４００（あるいはテキスト化支援装置１００）との通信処理を担うネットワークインターフェイスカード等を想定する。 Furthermore, the communication device 207 is assumed to be a network interface card or the like that connects to the network 1 and handles communication processing with the call center system 300 and the administrator terminal 400 (or the text conversion support device 100).

また、本実施形態のコールセンタシステム３００のハードウェア構成は、図４に以下の如くとなる。 Further, the hardware configuration of the call center system 300 of this embodiment is shown in FIG. 4 as follows.

すなわちコールセンタシステム３００は、記憶装置３０１、メモリ３０３、演算装置３０４、および通信装置３０５、を備える。 That is, the call center system 300 includes a storage device 301, a memory 303, a calculation device 304, and a communication device 305.

このうち記憶装置３０１は、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）やハードディスクドライブなど適宜な不揮発性記憶素子で構成される。 Among these, the storage device 301 is composed of an appropriate nonvolatile storage element such as an SSD (Solid State Drive) or a hard disk drive.

また、メモリ３０３は、ＲＡＭなど揮発性記憶素子で構成される。 Furthermore, the memory 303 is composed of a volatile storage element such as a RAM.

また、演算装置３０４は、記憶装置３０１に保持されるプログラム３０２をメモリ３０３に読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制御処理を行なうＣＰＵである。 Further, the arithmetic device 304 is a CPU that reads a program 302 held in the storage device 301 to the memory 303 and executes it, performs overall control of the device itself, and performs various judgments, calculations, and control processing.

また、通信装置３０５は、ネットワーク１と接続して、少なくともテキスト化支援装置１００や、オペレータ端末２００との通信処理を担うネットワークインターフェイスカード等を想定する。 Furthermore, the communication device 305 is assumed to be a network interface card or the like that connects to the network 1 and handles communication processing with at least the text conversion support device 100 and the operator terminal 200.

なお、コールセンタシステム３００がスタンドアロンマシンである場合、ユーザからのキー入力や音声入力を受け付ける入力装置、処理データの表示を行うディスプレイ等の出力装置、を更に備えるとすれば好適である。 Note that when the call center system 300 is a stand-alone machine, it is preferable to further include an input device that accepts key input and voice input from the user, and an output device such as a display that displays processed data.

また、記憶装置３０１内には、本実施形態のコールセンタシステム３００として必要な機能を実装する為のプログラム３０２に加えて、通話録音データ３２５が少なくとも記憶されている。この通話録音データ３２５は、テキスト化支援装置１００における通話録音ＤＢ１２５のレコードとなるデータである。 Furthermore, in the storage device 301, in addition to a program 302 for implementing functions necessary for the call center system 300 of this embodiment, at least call recording data 325 is stored. This call recording data 325 is data that becomes a record of the call recording DB 125 in the text conversion support device 100.

また、本実施形態の管理者端末４００のハードウェア構成は、図５に以下の如くとなる。 Further, the hardware configuration of the administrator terminal 400 of this embodiment is shown in FIG. 5 as follows.

すなわち管理者端末４００は、記憶装置４０１、メモリ４０３、演算装置４０４、入力装置４０５、出力装置４０６、および通信装置４０７、を備える。 That is, the administrator terminal 400 includes a storage device 401, a memory 403, a calculation device 404, an input device 405, an output device 406, and a communication device 407.

このうち記憶装置４０１は、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）やハードディスクドライブなど適宜な不揮発性記憶素子で構成される。 Among these, the storage device 401 is configured with an appropriate nonvolatile storage element such as an SSD (Solid State Drive) or a hard disk drive.

また、メモリ４０３は、ＲＡＭなど揮発性記憶素子で構成される。 Further, the memory 403 is composed of a volatile storage element such as a RAM.

また、演算装置４０４は、記憶装置４０１に保持されるプログラム４０２をメモリ４０３に読み出すなどして実行し装置自体の統括制御を行なうとともに各種判定、演算及び制御処理を行なうＣＰＵである。 Further, the arithmetic device 404 is a CPU that reads out and executes the program 402 held in the storage device 401 into the memory 403, performs overall control of the device itself, and performs various judgments, calculations, and control processing.

また、入力装置４０５は、ユーザたるオペレータからのキー入力や音声入力を受け付けるキーボードやマウスといった装置で構成される。 Further, the input device 405 includes a device such as a keyboard and a mouse that accept key input and voice input from an operator who is a user.

また、出力装置４０６は、演算装置４０４での処理結果の表示を行うディスプレイやスピーカー等の装置で構成される。 Further, the output device 406 is configured with a device such as a display or a speaker that displays the processing results of the arithmetic device 404.

また、通信装置４０７は、ネットワーク１と接続して、テキスト化支援装置１００やコールセンタシステム３００との通信処理を担うネットワークインターフェイスカード等を想定する。
＜データ構造例＞
続いて、本実施形態のテキスト化支援装置１００が用いる各種情報について説明する。図６に、本実施形態における通話録音ＤＢ１２５の一例を示す。本実施形態の通話録音ＤＢ１２５は、例えば、コールセンタシステム３００から（またはオペレータ端末２００から）取得した、オペレータと顧客との間の通話録音データを格納したデータベースである。 Furthermore, the communication device 407 is assumed to be a network interface card or the like that connects to the network 1 and handles communication processing with the text conversion support device 100 and the call center system 300.
<Data structure example>
Next, various types of information used by the text conversion support device 100 of this embodiment will be explained. FIG. 6 shows an example of the call recording DB 125 in this embodiment. The call recording DB 125 of this embodiment is, for example, a database that stores call recording data between operators and customers acquired from the call center system 300 (or from the operator terminal 200).

この通話録音ＤＢ１２５は、例えば、通話日時及び通話対象の顧客を示す顧客ＩＤをキーに、当該顧客の氏名、当該顧客から指定された商品・サービス名、対応オペレータのＩＤ、録音データファイル、といったデータを紐付けレコードの集合体となっている。 This call recording DB 125 stores, for example, data such as the name of the customer, the product/service name specified by the customer, the ID of the corresponding operator, and the recorded data file, using the date and time of the call and the customer ID indicating the customer to be called as keys. It is a collection of linked records.

また図７に、本実施形態における音素マスタテーブル１２６の構成例を示す。本実施形態の音素マスタテーブル１２６は、語彙ごとの正しい音素を規定したテーブルである。 Further, FIG. 7 shows an example of the configuration of the phoneme master table 126 in this embodiment. The phoneme master table 126 of this embodiment is a table that defines correct phonemes for each vocabulary.

この音素マスタテーブル１２６は、例えば、会話の場面や対象をキーとして、それら場面や対象に関する会話中に出現が想定される語彙の正しい音素の情報を規定した構成となっている。 This phoneme master table 126 has a configuration in which, for example, conversation scenes and objects are used as keys, and information on correct phonemes of vocabulary that is expected to appear during conversations regarding these scenes and objects is defined.

また図８に、本実施形態における発話類似度テーブル１２７の構成例を示す。本実施形態の発話類似度テーブル１２７は、日本語の母音を発話した場合の各間における類似度を規定したテーブルである。 Further, FIG. 8 shows a configuration example of the utterance similarity table 127 in this embodiment. The utterance similarity table 127 of this embodiment is a table that defines the similarity between utterances of Japanese vowels.

この発話類似度テーブル１２７は、縦横に母音を列挙し、母音それぞれの間での類似度を、最大値１（完全一致）から最小値０（類似度ゼロ）までの間の非連続な数値で規定したマトリクスを構成している。
＜フロー例１＞
以下、本実施形態におけるテキスト化支援方法の実際手順について図に基づき説明する。以下で説明するテキスト化支援方法に対応する各種動作は、テキスト化支援装置１００がメモリ等に読み出して実行するプログラムによって実現される。そして、このプログラムは、以下に説明される各種の動作を行うためのコードから構成されている。 This utterance similarity table 127 lists vowels vertically and horizontally, and the similarity between each vowel is expressed as a discontinuous numerical value between a maximum value of 1 (perfect match) and a minimum value of 0 (zero similarity). It constitutes a specified matrix.
<Flow example 1>
Hereinafter, the actual procedure of the text conversion support method in this embodiment will be explained based on the drawings. Various operations corresponding to the text conversion support method described below are realized by a program that the text conversion support apparatus 100 reads into a memory or the like and executes. This program is composed of codes for performing various operations described below.

図９は、本実施形態におけるテキスト化支援方法のフロー例１を示す図である。この場合、テキスト化支援装置１００は、例えば、コールセンタシステム３００（ないしオペレータ端末２００）から、通話録音データ３２５を取得し、これを通話録音ＤＢ１２５に格納する（ｓ１）。 FIG. 9 is a diagram showing a flow example 1 of the text conversion support method in this embodiment. In this case, the text conversion support device 100 obtains call recording data 325 from, for example, the call center system 300 (or operator terminal 200), and stores it in the call recording DB 125 (s1).

また、テキスト化支援装置１００は、予め定めたタイミングの到来を検知して、または管理者端末４００からの指示を受けて、通話録音ＤＢ１２５で保持する通話録音データのうち、例えば、所定期間に関するものを抽出し、これを音響モデル１１０に適用することで、音素を抽出する（ｓ２）。 In addition, the text conversion support device 100 detects the arrival of a predetermined timing or receives an instruction from the administrator terminal 400, and converts, for example, data related to a predetermined period out of the call recording data held in the call recording DB 125. is extracted and applied to the acoustic model 110 to extract phonemes (s2).

例えば、コールセンタのオペレータが「佐伯」という顧客に対して、定型の挨拶の後、「佐伯さん」という発話を行っていた通話録音データに関して、「Ｔ－Ａ－Ｉ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列を抽出したとする。ここでは顧客氏名を処理対象としたが、これは一例であって、例えば、金融商品名を処理対象とするとしても好適である。 For example, regarding call recording data in which a call center operator uttered ``Mr. Saeki'' to a customer named ``Saeki'' after a standard greeting, a call center operator said ``T-A-I-K-I-S-A''. -N'' is extracted. Here, the customer name is the processing target, but this is just one example, and it is also suitable to use, for example, the financial product name as the processing target.

続いて、テキスト化支援装置１００は、上述の通話録音データに紐付く顧客ＩＤから、当該通話対象の顧客が「佐伯」さんであることを特定し、この「佐伯さん」をキーワードマッチング対象の語彙として、その音素を音素マスタテーブル１２６から抽出する（ｓ３）。この場合、「Ｓ－Ａ－Ｅ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列が、音素マスタテーブル１２６における顧客ＩＤ「Ｃ０１８１２２：佐伯＊＊＊」のレコードから抽出される。 Next, the text conversion support device 100 identifies that the customer to whom the call is directed is "Mr. Saeki" from the customer ID linked to the above-mentioned call recording data, and uses this "Mr. Saeki" in the vocabulary for keyword matching. , the phoneme is extracted from the phoneme master table 126 (s3). In this case, the phoneme array “S-A-E-K-I-S-AN” is extracted from the record of customer ID “C018122: Saeki***” in the phoneme master table 126.

続いて、テキスト化支援装置１００は、ｓ２、ｓ３でそれぞれ得た音素配列を比較し、その一致率を算定する（ｓ４）。上述の場合、「Ｔ－Ａ－Ｉ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列と、「Ｓ－Ａ－Ｅ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列を照合すると、全８音素のうち、６つの音素が一致しており、６／８＝０．７５が一致率となる。 Next, the text conversion support device 100 compares the phoneme sequences obtained in s2 and s3, and calculates the matching rate (s4). In the above case, when comparing the phoneme sequence "TA-I-K-I-S-AN" with the phoneme sequence "S-A-E-K-I-S-AN", Of the total 8 phonemes, 6 phonemes match, and the match rate is 6/8=0.75.

もし、従来どおり、通話録音データから得た「Ｔ－Ａ－Ｉ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列を言語モデル１１１に適用し、「大輝さん」というテキストを得て、これと、音素マスタテーブル１２６で規定の語彙「佐伯さん」というテキストと照合した場合、その一致率は、全４文字のうち２文字の一致で、一致率を２／４＝０．５と算定することになる。キーワードマッチングの合否基準が、例えば一致率０．６であると、オペレータとしては確かに「佐伯さん」と顧客名を発話しているにも関わらず、言語モデル１１１での変換精度の影響によって、これらはマッチングしないと判定されることになってしまう。 If we applied the phoneme sequence "TA-I-K-I-S-AN" obtained from the call recording data to the language model 111 as usual, and obtained the text "Daiki-san", then this When this is compared with the text "Saeki-san" in the specified vocabulary in the phoneme master table 126, the match rate is calculated as 2/4 = 0.5, with 2 out of 4 characters matching. It turns out. If the pass/fail criterion for keyword matching is, for example, a match rate of 0.6, even though the operator has indeed uttered the customer's name as "Mr. Saeki," due to the influence of the conversion accuracy in the language model 111, These will be judged as not matching.

一方、本発明のテキスト化支援装置１００によれば、こうした言語モデル１１１での変換精度の問題をクリアし、音素配列間の一致率に基づくキーワードマッチングを行うことが可能であり、従来よりも精度良好なキーワードマッチングが可能となっている。
＜フロー例２＞
図１０は、本実施形態におけるテキスト化支援方法のフロー例２を示す図である。ここでは、上述のフロー例１における効果をさらに高めるべく、母音の観点を加えて音素配列の一致度を算定する手法について説明する。なお、本フローにおいては、上述のフロー例１におけるｓ１、ｓ２までは同様であるため、それ以降の処理として説明を行うものとする。 On the other hand, according to the text conversion support device 100 of the present invention, it is possible to solve the problem of conversion accuracy in the language model 111 and perform keyword matching based on the matching rate between phoneme sequences, which is more accurate than before. Good keyword matching is possible.
<Flow example 2>
FIG. 10 is a diagram showing a flow example 2 of the text conversion support method in this embodiment. Here, in order to further enhance the effect of the above-described flow example 1, a method will be described in which the degree of coincidence of phoneme sequences is calculated by adding the viewpoint of vowels. Note that in this flow, steps s1 and s2 are the same as in the flow example 1 described above, so the subsequent processing will be described.

テキスト化支援装置１００は、上述のフロー例１のように抽出した音素配列から母音（ａ、ｉ、ｕ、ｅ、ｏ）だけを抽出する（ｓ１０）。上述の例の場合、「Ａ、Ｉ、Ｉ、Ａ」という母音配列を抽出することになる。 The text conversion support device 100 extracts only vowels (a, i, u, e, o) from the phoneme array extracted as in the flow example 1 described above (s10). In the above example, the vowel array "A, I, I, A" will be extracted.

また、テキスト化支援装置１００は、上述の通話対象の顧客「佐伯」さんに関する、音素および母音の抽出をｓ３、ｓ１０と同様に実行する（ｓ１１）。この場合、「Ｓ－Ａ－Ｅ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列から、母音配列「Ａ、Ｅ、Ｉ、Ａ」を抽出することになる。 Further, the text conversion support device 100 extracts phonemes and vowels regarding the above-mentioned customer "Saeki" to be called, in the same manner as in s3 and s10 (s11). In this case, the vowel array "A, E, I, A" is extracted from the phoneme array "SAE-K-I-S-AN".

続いて、テキスト化支援装置１００は、ｓ１０、ｓ１１でそれぞれ得た母音配列におけ
る母音を、配列先頭から順に発話類似度テーブル１２７に照合し、母音配列間で対応する位置同士の母音の類似度を特定する（ｓ１２）。 Next, the text conversion support device 100 compares the vowels in the vowel arrays obtained in s10 and s11 with the utterance similarity table 127 in order from the beginning of the array, and calculates the similarity of vowels between corresponding positions in the vowel arrays. Specify (s12).

例えば、母音「Ａ」と母音「Ａ」は、発話類似度テーブル１２７によれば類似度「１」、母音「Ａ」と母音「Ｉ」は、発話類似度テーブル１２７によれば類似度「０」、母音「Ａ」と母音「Ｕ」は、発話類似度テーブル１２７によれば類似度「０」、母音「Ａ」と母音「Ｅ」は、発話類似度テーブル１２７によれば類似度「０．５」、母音「Ａ」と母音「Ｏ」は、発話類似度テーブル１２７によれば類似度「０．５」、などと特定する。 For example, the vowel "A" and the vowel "A" have a similarity of "1" according to the utterance similarity table 127, and the vowel "A" and the vowel "I" have a similarity of "0" according to the utterance similarity table 127. ", the vowel "A" and the vowel "U" have a similarity of "0" according to the utterance similarity table 127, and the vowel "A" and the vowel "E" have a similarity of "0" according to the utterance similarity table 127. According to the utterance similarity table 127, the vowel "A" and the vowel "O" have a similarity of "0.5".

その結果、上述の例であれば、「Ａ、Ｉ、Ｉ、Ａ」と「Ａ、Ｅ、Ｉ、Ａ」を照合し、「Ａ」と「Ａ」で類似度「１」、「Ｉ」と「Ｅ」で類似度「０．５」、「Ｉ」と「Ｉ」で類似度「１」、「Ａ」と「Ａ」で類似度「１」、となる。 As a result, in the above example, "A, I, I, A" and "A, E, I, A" are compared, and the similarity is "1" and "I" for "A" and "A". and "E" have a similarity of "0.5," "I" and "I" have a similarity of "1," and "A" and "A" have a similarity of "1."

そこでテキスト化支援装置１００は、ｓ１２で得た母音ごとの類似度に基づき、上述の音素配列における母音類似度を、（１＋０．５＋１＋１）／４＝０．８７５と算定する（ｓ１３）。 Therefore, the text conversion support device 100 calculates the vowel similarity in the above-mentioned phoneme array as (1+0.5+1+1)/4=0.875 based on the similarity for each vowel obtained in s12 (s13).

また、テキスト化支援装置１００は、ｓ２、ｓ３で得ている音素配列に基づき、子音についても一致率を算定する（ｓ１４）。上述の例の場合、「Ｔ－Ａ－Ｉ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列における子音「Ｔ、Ｋ、Ｓ、Ｎ」と、「Ｓ－Ａ－Ｅ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」という音素配列における子音「Ｓ、Ｋ、Ｓ、Ｎ」を照合すると、全４音素のうち、３つの音素が一致しており、３／４＝０．７５が一致率となる。 The text conversion support device 100 also calculates the matching rate for consonants based on the phoneme arrays obtained in s2 and s3 (s14). In the above example, the consonants "T, K, S, N" in the phoneme sequence "TA-I-K-I-S-A-N" and "S-A-E-K-I- When comparing the consonants "S, K, S, N" in the phoneme array "S-AN", three phonemes out of the total four phonemes match, and the match rate is 3/4 = 0.75. Become.

続いて、テキスト化支援装置１００は、ｓ１３で得た母音類似度に重み付けをした上で、子音の一致率と加重平均を行って、音素配列間の一致率を算定する（ｓ１５）。 Subsequently, the text conversion support device 100 weights the vowel similarity obtained in s13, performs a weighted average with the consonant matching rate, and calculates the matching rate between phoneme arrays (s15).

例えば、上述の重み付けを「２」、すなわち子音の一致率より２倍の重みをつけて加重平均を行うとすれば、（子音一致率０．７５＋母音類似度０．８７５×重み２）／３＝０．８３、と一致率を算定できる。
＜フロー例３＞
図１１は、本実施形態におけるテキスト化支援方法のフロー例３を示す図である。ここでは、上述のフロー例１、２における効果をさらに高めるべく、脱字や衍字への対処という観点を加えて音素配列の一致度を算定する手法について説明する。なお、本フローにおいては、上述のフロー例１におけるｓ１、ｓ２、フロー例におけるｓ１０、ｓ１１までは同様であるため、それ以降の処理として説明を行うものとする。 For example, if the above-mentioned weighting is set to "2", that is, twice the weight of the consonant matching rate, and weighted averaging is performed, (consonant matching rate 0.75 + vowel similarity 0.875 x weight 2)/3 The matching rate can be calculated as =0.83.
<Flow example 3>
FIG. 11 is a diagram showing a flow example 3 of the text conversion support method in this embodiment. Here, in order to further enhance the effects of the above-described flow examples 1 and 2, a method of calculating the degree of coincidence of phoneme sequences will be described with an added perspective of dealing with omissions and misspellings. Note that in this flow, steps s1 and s2 in the above-described flow example 1 and up to s10 and s11 in the flow example are the same, so the subsequent processing will be described.

テキスト化支援装置１００は、上述のように抽出した、通話録音データにおける音素配列中の母音配列、及び、音素マスタテーブル１２６の対応レコードから読み出した音素配列中の母音配列のそれぞれに関して、当該母音配列において連続する２つの母音の組みにおける類似度を発話類似度テーブル１２７に基づき特定する（ｓ２０）。 The text conversion support device 100 converts the vowel array in the phoneme array in the phoneme array extracted as described above and the vowel array in the phoneme array read from the corresponding record of the phoneme master table 126 into the vowel array. The degree of similarity between a set of two consecutive vowels is specified based on the utterance similarity table 127 (s20).

例えば、通話録音データから得た音素配列「Ｏ－Ｈ－Ａ－Ｙ－Ｏ－Ｕ－Ｇ－Ｏ－Ｚ－Ａ－Ｉ－Ｍ－Ａ－Ｓ－Ｕ－Ｓ－Ａ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」中の母音配列「Ｏ、Ａ、Ｏ、Ｕ、Ｏ、Ａ、Ｉ、Ａ、Ｕ、Ａ、Ｉ、Ａ」では、先頭から２つずつ母音を選択し、組み（１）「Ｏ、Ａ」、組み（２）「Ｏ、Ｕ」、組み（３）「Ｏ、Ａ」、組み（４）「Ｉ、Ａ」、組み（５）「Ｕ、Ａ」、組み（６）「Ｉ、Ａ」といった計６つの組みを形成した場合、発話類似度テーブル１２７に基づき、組み（１）は類似度「０．５」、組み（２）は類似度「０．５」、組み（３）は類似度「０．５」、組み（４）は類似度「０」、組み（５）は類似度「０」、組み（６）は類似度「０」と特定できる。 For example, the phoneme sequence “O-H-A-Y-O-U-G-O-Z-A-I-M-A-S-U-S-A-K-I-S” obtained from call recording data In the vowel arrangement "O, A, O, U, O, A, I, A, U, A, I, A" in "-AN", select two vowels from the beginning and set them to group (1). "O, A", set (2) "O, U", set (3) "O, A", set (4) "I, A", set (5) "U, A", set (6) When a total of six sets such as "I, A" are formed, based on the utterance similarity table 127, set (1) has a similarity of "0.5", set (2) has a similarity of "0.5", and set (2) has a similarity of "0.5". It can be specified that (3) has a similarity of "0.5", set (4) has a similarity of "0", set (5) has a similarity of "0", and set (6) has a similarity of "0".

続いて、テキスト化支援装置１００は、ｓ２０で特定した各組みの類似度が例えば０．５といった基準以上の組みについては予め定めた１つの規定母音（例：Ａ、Ｉ、Ｕ）に畳み込み、類似度が基準を下回る組みについては先頭の母音を採用して、後尾の母音を音素配列中において隣接する母音と次なる組みを形成する処理を実行して、音節配列を生成する（ｓ２１）。 Next, the text conversion support device 100 convolves the combinations identified in s20 whose similarity exceeds a standard, such as 0.5, into one predetermined vowel (e.g. A, I, U), For pairs whose degree of similarity is lower than the standard, the first vowel is adopted, and the process of forming the next pair with adjacent vowels in the phoneme array is performed for the trailing vowel to generate a syllable array (s21).

上述の例の場合、組み（１）は母音「Ａ」に集約（すなわち畳み込み。以下同様）、組み（２）は母音「Ｕ」に集約、組み（３）は母音「Ａ」に集約、組み（４）は先頭の母音「Ｉ」を採用し、後尾の母音「Ａ」を当初の組み（５）の先頭の母音「Ｕ」と組み合わせた新たな組み（５）’を形成し、これ以降の母音の配列についても組みを再構成し、上述の類似度に基づく集約を実行する。 In the above example, set (1) is aggregated to the vowel "A" (that is, convolution; the same applies hereafter), set (2) is aggregated to the vowel "U", set (3) is aggregated to the vowel "A", and the set is aggregated to the vowel "A". (4) adopts the first vowel "I" and combines the last vowel "A" with the first vowel "U" of the original set (5) to form a new set (5)'; The sets are also reconstructed for the vowel arrangement, and the aggregation based on the above-mentioned similarity is performed.

その結果、各組みの集約を経て残った音節配列は、「Ａ、Ｕ、Ａ、Ｉ、Ａ、Ｕ、Ａ、Ｉ、Ａ」となる。 As a result, the syllable arrangement remaining after aggregating each set is "A, U, A, I, A, U, A, I, A."

テキスト化支援装置１００は、こうした音節配列の生成を、音素マスタテーブル１２６で対応するレコードの音素配列「Ｓ－Ａ－Ｓ－Ａ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」における母音配列「Ａ、Ａ、Ｉ、Ａ」に関しても同様に実行し、「Ａ、Ｉ、Ａ」を得ることになる。 The text conversion support device 100 generates such a syllable arrangement such as the vowel arrangement "A" in the phoneme arrangement "S-A-S-A-K-I-S-AN" of the corresponding record in the phoneme master table 126. The same process is performed for "A, I, A" to obtain "A, I, A".

次に、テキスト化支援装置１００は、ｓ２１において、通話録音データ由来の音節配列中で、音素マスタテーブル１２６由来で生成した音節配列と一致する箇所について、音素マスタテーブル１２６由来の音節配列と母音数を比較し、当該母音数が等しい場合（ｓ２２：同数）、上述の箇所と音素マスタテーブル１２６由来の音節配列とで、対応する母音配列における母音の一致率を発話類似度テーブル１２７に基づき算定する（ｓ２３）。 Next, in s21, the text conversion support device 100 selects the syllable arrangement derived from the phoneme master table 126 and the number of vowels for the portion that matches the syllable arrangement generated from the phoneme master table 126 in the syllable arrangement derived from the recorded call data. If the number of vowels is the same (s22: same number), the match rate of vowels in the corresponding vowel arrangement is calculated between the above-mentioned part and the syllable arrangement derived from the phoneme master table 126 based on the utterance similarity table 127. (s23).

例えば、通話録音データの音素配列「Ｏ－Ｈ－Ａ－Ｙ－Ｏ－Ｕ－Ｇ－Ｏ－Ｚ－Ａ－Ｉ－Ｍ－Ａ－Ｓ－Ｕ－Ｓ－Ａ－Ｋ－Ｉ－Ｓ－Ａ－Ｎ」中の母音配列「Ｏ、Ａ、Ｏ、Ｕ、Ｏ、Ａ、Ｉ、Ａ、Ｕ、Ａ、Ｉ、Ａ」のうち、その音素配列が音素マスタテーブル１２６由来の音節配列「Ａ、Ｉ、Ａ」（これは母音配列「Ａ、Ａ、Ｉ、Ａ」に基づく）と一致するのは、「Ｏ、Ａ、Ｉ、Ａ」の箇所である。 For example, the phoneme arrangement of recorded phone calls “O-H-A-Y-O-U-G-O-Z-A-I-M-A-S-U-S-A-K-I-S-A -N” in the vowel array “O, A, O, U, O, A, I, A, U, A, I, A”, the phoneme array is the syllable array “A, "I, A" (which is based on the vowel arrangement "A, A, I, A") matches "O, A, I, A".

よってテキスト化支援装置１００は、通話録音データ由来の母音配列中「Ｏ、Ａ、Ｉ、Ａ」と、音素マスタテーブル１２６由来の母音配列「Ａ、Ａ、Ｉ、Ａ」との間について、各母音の間の類似度を発話類似度テーブル１２７に基づいて特定し、例えば、（０．５＋１＋１＋１）／４＝０．８７５、などと算定する。 Therefore, the text conversion support device 100 converts each of the vowel sequences "O, A, I, A" derived from the call recording data and the vowel array "A, A, I, A" derived from the phoneme master table 126. The degree of similarity between vowels is specified based on the utterance similarity table 127, and calculated as (0.5+1+1+1)/4=0.875, for example.

一方、上述のｓ２２での母音数の比較の結果、前記通話録音データ由来の母音数よりもマスタテーブル１２６由来の母音数が多い場合（ｓ２２：多）、テキスト化支援装置１００は、脱字が行っていると推定し、マスタテーブル１２６由来の音節配列が正とし、通話録音データ由来の音節配列において母音が欠けている部分について、当該マスタテーブル１２６由来の対応音素で補って補正し（ｓ２４）、この補正が行われた母音配列とマスタテーブル１２６由来の母音配列との間で母音の一致率を発話類似度テーブル１２７に基づき算定する（ｓ２５）。 On the other hand, as a result of the comparison of the number of vowels in s22 described above, if the number of vowels derived from the master table 126 is greater than the number of vowels derived from the call recording data (s22: many), the text conversion support device 100 determines that an omission has occurred. It is estimated that the syllable arrangement derived from the master table 126 is correct, and the portion where a vowel is missing in the syllable arrangement derived from the call recording data is corrected by supplementing it with the corresponding phoneme derived from the master table 126 (s24); The concordance rate of vowels between the corrected vowel array and the vowel array derived from the master table 126 is calculated based on the utterance similarity table 127 (s25).

例えば、通話録音データの音素配列「Ｓ－Ａ－Ｋ－Ｉ－Ｓ－Ａ」中の母音配列「Ａ、Ｉ、Ａ」は、その音素配列が音素マスタテーブル１２６由来の音節配列「Ａ、Ｉ、Ａ」（これは母音配列「Ａ、Ａ、Ｉ、Ａ」に基づく）と一致する。ただし、対応する母音配列中の母音数は、マスタテーブル１２６由来の母音配列の方が１つ多い。 For example, the vowel array "A, I, A" in the phoneme array "S-A-K-I-S-A" of the telephone recording data is the syllable array "A, I, A" derived from the phoneme master table 126. , A" (which is based on the vowel sequence "A, A, I, A"). However, the number of vowels in the corresponding vowel array is one more in the vowel array derived from the master table 126.

そこで、テキスト化支援装置１００は、通話録音データ由来の母音配列「Ａ、Ｉ、Ａ」
のうち、上述のマスタテーブル１２６由来の母音配列「Ａ、Ａ、Ｉ、Ａ」と比べて不足している、すなわち欠けているものが先頭から２番目「Ａ」である。よって、テキスト化支援装置１００は、通話録音データ由来の母音配列「Ａ、Ｉ、Ａ」のうち、先頭「Ａ」と２番目の「Ｉ」の間に、「Ａ」を補って補正する。 Therefore, the text conversion support device 100 uses the vowel array "A, I, A" derived from the recorded call data.
Among them, "A" is the second vowel from the beginning that is missing compared to the vowel array "A, A, I, A" derived from the master table 126 described above. Therefore, the text conversion support device 100 corrects the vowel array "A, I, A" derived from the recorded call data by adding "A" between the first "A" and the second "I".

また、テキスト化支援装置１００は、上述の補正を行った母音配列と、マスタテーブル１２６由来の母音配列の間の類似度を、発話類似度テーブル１２７に基づいて（１＋１＋１＋１）／４＝１、などと算定することになる。 Furthermore, the text conversion support device 100 calculates the degree of similarity between the vowel arrangement subjected to the above correction and the vowel arrangement derived from the master table 126, such as (1+1+1+1)/4=1, based on the utterance similarity table 127. It will be calculated as follows.

他方、上述のｓ２２での母音数の比較の結果、前記通話録音データ由来の母音数よりもマスタテーブル１２６由来の母音数が少ない場合（ｓ２２：少）、テキスト化支援装置１００は、衍字が行っていると推定し、マスタテーブル１２６由来の音節配列が正とし、通話録音データ由来の音節配列において母音が過剰となっている部分について削除して補正し（ｓ２６）、この補正が行われた母音配列とマスタテーブル１２６由来の母音配列との間で母音の一致率を発話類似度テーブル１２７に基づき算定する（ｓ２７）。 On the other hand, as a result of the comparison of the number of vowels in s22 described above, if the number of vowels derived from the master table 126 is smaller than the number of vowels derived from the call recording data (s22: small), the text conversion support device 100 determines that the number of vowels is It is assumed that the syllable arrangement derived from the master table 126 is correct, and the part where there are excessive vowels in the syllable arrangement derived from the call recording data is corrected by deleting it (s26), and the vowel after this correction is corrected. The vowel matching rate between the array and the vowel array derived from the master table 126 is calculated based on the utterance similarity table 127 (s27).

例えば、通話録音データの音素配列「Ａ－Ｋ－Ａ－Ｓ－Ａ－Ｋ－Ｉ－Ｓ－Ａ」中の母音配列「Ａ、Ａ、Ａ、Ｉ、Ａ」は、その音素配列が音素マスタテーブル１２６由来の音節配列「Ａ、Ｉ、Ａ」（これは母音配列「Ａ、Ａ、Ｉ、Ａ」に基づく）と一致する。ただし、対応する母音配列中の母音数は、マスタテーブル１２６由来の母音配列の方が１つ少ない。 For example, the vowel array "A, A, A, I, A" in the phoneme array "A-K-A-S-A-K-I-S-A" in the phone call recording data is the phoneme master. It matches the syllable arrangement "A, I, A" from table 126 (which is based on the vowel arrangement "A, A, I, A"). However, the number of vowels in the corresponding vowel array is one less in the vowel array derived from the master table 126.

そこで、テキスト化支援装置１００は、通話録音データ由来の母音配列「Ａ、Ａ、Ａ、Ｉ、Ａ」のうち、上述のマスタテーブル１２６由来の母音配列「Ａ、Ａ、Ｉ、Ａ」と比べて過剰となっているものが先頭の「Ａ」である。よって、テキスト化支援装置１００は、通話録音データ由来の母音配列「Ａ、Ａ、Ａ、Ｉ、Ａ」のうち、先頭「Ａ」を削除して補正する。 Therefore, the text conversion support device 100 compares the vowel array "A, A, A, I, A" derived from the recorded call data with the vowel array "A, A, I, A" derived from the master table 126 described above. The one that is excessive is the first "A". Therefore, the text conversion support device 100 corrects the vowel array "A, A, A, I, A" derived from the call recording data by deleting the leading "A".

なお、既にフロー例２で説明しているため、こうした母音配列の類似度にあわせて、子音配列の一致度も考慮して一致率を算定する概念についての説明は省略する。 Note that, since it has already been explained in flow example 2, a description of the concept of calculating the matching rate by taking into consideration the matching degree of the consonant arrangement as well as the similarity of the vowel arrangement will be omitted.

以上、本発明を実施するための最良の形態などについて具体的に説明したが、本発明はこれに限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能である。 Although the best mode for carrying out the present invention has been specifically described above, the present invention is not limited thereto and can be modified in various ways without departing from the gist thereof.

こうした本実施形態によれば、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングを精度良好に実施可能となる。 According to this embodiment, regardless of the characteristics of voice-to-text conversion regarding the call content, keyword matching of the call content can be performed with good accuracy.

本明細書の記載により、少なくとも次のことが明らかにされる。すなわち、本実施形態のテキスト化支援装置において、前記記憶装置は、母音間の発話類似度を規定した情報をさらに保持し、前記演算装置は、前記一致率の算定に際し、前記正しい音素及び前記抽出した音素のそれぞれに含まれる母音間の一致率を、前記発話類似度の情報に基づいて算定する処理と、前記正しい音素及び前記抽出した音素のそれぞれに含まれる子音間の一致率を算定する処理と、前記母音間の一致率を前記子音間の一致率よりも優位に重み付けて、前記母音間及び前記子音間の各一致率に基づき前記音素同士の一致率を算定するものである、としてもよい。 The description of this specification clarifies at least the following. That is, in the text conversion support device of the present embodiment, the storage device further stores information specifying the utterance similarity between vowels, and the calculation device stores the correct phoneme and the extracted phoneme when calculating the matching rate. A process of calculating a match rate between vowels included in each of the correct phonemes based on the utterance similarity information, and a process of calculating a match rate between consonants included in each of the correct phoneme and the extracted phoneme. The matching rate between the vowels is weighted more favorably than the matching rate between the consonants, and the matching rate between the phonemes is calculated based on each matching rate between the vowels and between the consonants. good.

これによれば、上述の音素同士のマッチングに際して、マッチング対象の要素として（種類が少なく区別がしやすい、すなわち誤検知しにくい特性のある）母音を優先することとなり、一致率の精度を良好なものとしやすくなる。ひいては、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングをより精度良好に実施可能となる。 According to this, when matching the phonemes mentioned above, priority is given to vowels (which have a small number of types and are easy to distinguish, i.e., have characteristics that make it difficult to falsely detect) as elements to be matched, and this increases the accuracy of the matching rate. It becomes easier to understand. As a result, regardless of the characteristics of voice-to-text conversion regarding the content of the call, keyword matching of the content of the call can be performed with higher accuracy.

また、本実施形態のテキスト化支援装置において、前記演算装置は、前記一致率の算定に際し、前記抽出した音素及び前記正しい音素のそれぞれに関して、当該音素に含まれる母音の配列において、連続する２つの母音の組みにおける類似度を前記発話類似度で特定し、前記類似度が基準以上の組みについては予め定めた１つの規定母音に畳み込み、前記類似度が基準を下回る組みについては先頭の母音を採用して、後尾の母音を前記配列において隣接する母音と次なる組みを形成する処理を実行して、音節配列を生成し、前記抽出した音素及び前記正しい音素のそれぞれに関して生成した、前記音節配列の間で母音数を比較し、当該母音数が等しい場合、当該音節配列の元となった、前記抽出した音素及び前記正しい音素のそれぞれの前記母音の配列の間で母音の一致率を前記発話類似度に基づき算定するものである、としてもよい。 Furthermore, in the text conversion support device of the present embodiment, when calculating the matching rate, the arithmetic device may calculate two consecutive vowels in the arrangement of vowels included in the extracted phoneme and the correct phoneme, respectively, for each of the extracted phoneme and the correct phoneme. The degree of similarity in a set of vowels is specified by the utterance similarity, and for a set in which the degree of similarity is above a standard, it is convolved into one predetermined vowel, and for a set in which the degree of similarity is below the standard, the first vowel is adopted. Then, a syllable array is generated by forming the next set of the trailing vowel with an adjacent vowel in the array, and the syllable array generated for each of the extracted phoneme and the correct phoneme is If the number of vowels is equal, the vowel matching rate between the vowel arrays of the extracted phoneme and the correct phoneme, which are the source of the syllable array, is calculated as the utterance similarity. It may be calculated based on degree.

これによれば、日本語では母音類似度が高い母音が連続する場合、二文字を１音節として発音するケースや、一文字しか発音しないケース、或いは一文字目を発音しないケース、同じ文字を不必要に重ねて発音するケースといった、脱字や衍字などの現象が生じ易いといった問題にも適切に対処することが可能となり、ひいては、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングをより精度良好に実施可能となる。 According to this, in Japanese, when there are consecutive vowels with high vowel similarity, there are cases in which two letters are pronounced as one syllable, cases in which only one letter is pronounced, cases in which the first letter is not pronounced, and cases in which the same letter is unnecessarily pronounced. It is now possible to appropriately deal with problems such as cases of overlapping pronunciations, where phenomena such as omissions and spellings are likely to occur, and furthermore, regardless of the characteristics of the voice-to-text conversion of the content of the call, keyword matching of the content of the call can be performed. It becomes possible to perform the process with better accuracy.

また、本実施形態のテキスト化支援装置において、前記演算装置は、前記音節配列の間で母音数を比較し、前記抽出した音素よりも前記正しい音素に対応する音節配列での母音数が多い場合、前記正しい音素の音節配列が正とした場合、前記抽出した音素の音節配列において母音が欠けている部分について、当該正しい音素の対応音素で補って補正し、前記補正が行われた前記抽出した音素及び前記正しい音素のそれぞれの前記母音の配列の間で母音の一致率を前記発話類似度に基づき算定するものである、としてもよい。 Further, in the text conversion support device of the present embodiment, the calculation device compares the number of vowels between the syllable arrays, and if the number of vowels in the syllable array corresponding to the correct phoneme is greater than the extracted phoneme, the calculation device compares the number of vowels between the syllable arrays. , when the syllable arrangement of the correct phoneme is correct, the portion where a vowel is missing in the syllable arrangement of the extracted phoneme is corrected by supplementing it with the corresponding phoneme of the correct phoneme, and the corrected syllable arrangement of the extracted phoneme is corrected. The vowel matching rate between the vowel arrangement of each of the phoneme and the correct phoneme may be calculated based on the utterance similarity.

これによれば、上述の脱字の事象に対して適切に対処し、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングをより精度良好に実施可能となる。 According to this, it is possible to appropriately deal with the above-mentioned omitted characters, and to perform keyword matching for the content of the call with better accuracy, regardless of the characteristics of the speech-to-text conversion of the content of the call.

また、本実施形態のテキスト化支援装置において、前記演算装置は、前記音節配列の間で母音数を比較し、前記抽出した音素よりも前記正しい音素に対応する音節配列での母音数が少ない場合、前記正しい音素の音節配列が正とした場合、前記抽出した音素の音節配列において母音が余剰となっている部分を削除して補正し、前記補正が行われた前記抽出した音素及び前記正しい音素のそれぞれの前記母音の配列の間で母音の一致率を前記発話類似度に基づき算定するものである、としてもよい。 Further, in the text conversion support device of the present embodiment, the calculation device compares the number of vowels between the syllable arrays, and if the number of vowels in the syllable array corresponding to the correct phoneme is smaller than the extracted phoneme, the calculation device compares the number of vowels between the syllable arrays. , if the syllable arrangement of the correct phoneme is correct, correct the syllable arrangement of the extracted phoneme by deleting the portion where the vowel is redundant, and correct the extracted phoneme and the correct phoneme after the correction. The vowel matching rate between each of the vowel arrangements may be calculated based on the utterance similarity.

これによれば、上述の衍字の事象に対して適切に対処し、通話内容に関する音声テキスト化の特性に関わらず、当該通話内容のキーワードマッチングをより精度良好に実施可能となる。 According to this, it is possible to appropriately deal with the above-mentioned spelling event, and to perform keyword matching for the contents of the call with better accuracy, regardless of the characteristics of the speech-to-text conversion of the contents of the call.

また、本実施形態のテキスト化支援方法において、前記情報処理装置が、前記記憶装置において、母音間の発話類似度を規定した情報をさらに保持し、前記一致率の算定に際し、前記正しい音素及び前記抽出した音素のそれぞれに含まれる母音間の一致率を、前記発話類似度の情報に基づいて算定する処理と、前記正しい音素及び前記抽出した音素のそれ
ぞれに含まれる子音間の一致率を算定する処理と、前記母音間の一致率を前記子音間の一致率よりも優位に重み付けて、前記母音間及び前記子音間の各一致率に基づき前記音素同士の一致率を算定する、としてもよい。 Further, in the text conversion support method of the present embodiment, the information processing device further stores information specifying utterance similarity between vowels in the storage device, and when calculating the matching rate, the information processing device further stores information specifying the utterance similarity between vowels, A process of calculating a match rate between vowels included in each of the extracted phonemes based on the utterance similarity information, and a match rate between consonants included in each of the correct phoneme and the extracted phoneme. In the process, the match rate between the vowels may be weighted more favorably than the match rate between the consonants, and the match rate between the phonemes may be calculated based on each match rate between the vowels and between the consonants.

また、本実施形態のテキスト化支援方法において、前記情報処理装置が、前記一致率の算定に際し、前記抽出した音素及び前記正しい音素のそれぞれに関して、当該音素に含まれる母音の配列において、連続する２つの母音の組みにおける類似度を前記発話類似度で特定し、前記類似度が基準以上の組みについては予め定めた１つの規定母音に畳み込み、前記類似度が基準を下回る組みについては先頭の母音を採用して、後尾の母音を前記配列において隣接する母音と次なる組みを形成する処理を実行して、音節配列を生成し、前記抽出した音素及び前記正しい音素のそれぞれに関して生成した、前記音節配列の間で母音数を比較し、当該母音数が等しい場合、当該音節配列の元となった、前記抽出した音素及び前記正しい音素のそれぞれの前記母音の配列の間で母音の一致率を前記発話類似度に基づき算定する、としてもよい。 Further, in the text conversion support method of the present embodiment, when calculating the matching rate, the information processing device may, for each of the extracted phoneme and the correct phoneme, select two consecutive vowels included in the phoneme. The degree of similarity between two pairs of vowels is specified by the utterance similarity, and the pairs whose similarity is above the standard are convolved into one predetermined vowel, and the pairs whose similarity is below the standard are convoluted with the first vowel. the syllable array generated for each of the extracted phoneme and the correct phoneme; If the numbers of vowels are equal, the vowel match rate between the vowel arrays of each of the extracted phonemes and the correct phonemes, which are the source of the syllable array, is calculated from the utterance. It may be calculated based on the degree of similarity.

また、本実施形態のテキスト化支援方法において、前記情報処理装置が、前記音節配列の間で母音数を比較し、前記抽出した音素よりも前記正しい音素に対応する音節配列での母音数が多い場合、前記正しい音素の音節配列が正とした場合、前記抽出した音素の音節配列において母音が欠けている部分について、当該正しい音素の対応音素で補って補正し、前記補正が行われた前記抽出した音素及び前記正しい音素のそれぞれの前記母音の配列の間で母音の一致率を前記発話類似度に基づき算定する、としてもよい。 Further, in the text conversion support method of the present embodiment, the information processing device compares the number of vowels between the syllable arrays, and the number of vowels in the syllable array corresponding to the correct phoneme is greater than the extracted phoneme. In this case, if the syllable arrangement of the correct phoneme is correct, the portion where a vowel is missing in the syllable arrangement of the extracted phoneme is corrected by supplementing it with a corresponding phoneme of the correct phoneme, and the extraction is performed after the correction. The vowel matching rate may be calculated between the vowel arrangement of each of the correct phoneme and the correct phoneme based on the utterance similarity.

また、本実施形態のテキスト化支援方法において、前記情報処理装置が、前記音節配列の間で母音数を比較し、前記抽出した音素よりも前記正しい音素に対応する音節配列での母音数が少ない場合、前記正しい音素の音節配列が正とした場合、前記抽出した音素の音節配列において母音が余剰となっている部分を削除して補正し、前記補正が行われた前記抽出した音素及び前記正しい音素のそれぞれの前記母音の配列の間で母音の一致率を前記発話類似度に基づき算定する、としてもよい。 Further, in the text conversion support method of the present embodiment, the information processing device compares the number of vowels between the syllable arrays, and the number of vowels in the syllable array corresponding to the correct phoneme is smaller than the extracted phoneme. In this case, if the syllable arrangement of the correct phoneme is correct, the syllable arrangement of the extracted phoneme is corrected by deleting the part where the vowel is redundant, and the extracted phoneme with the correction and the correct syllable arrangement are corrected. A vowel matching rate may be calculated between the vowel arrays of each phoneme based on the utterance similarity.

１ネットワーク
１００テキスト化支援装置
１０１記憶装置
１０２プログラム
１０３メモリ
１０４演算装置
１０５通信装置
１１０音響モデル
１１１言語モデル
１２５通話録音ＤＢ
１２６音素マスタテーブル
１２７発話類似度テーブル
２００オペレータ端末
３００コールセンタシステム
４００管理者端末 1 Network 100 Text conversion support device 101 Storage device 102 Program 103 Memory 104 Arithmetic device 105 Communication device 110 Acoustic model 111 Language model 125 Call recording DB
126 Phoneme master table 127 Utterance similarity table 200 Operator terminal 300 Call center system 400 Administrator terminal

Claims

a storage device that holds master data that defines correct phoneme information for each vocabulary that is expected to appear in each conversation situation or subject;
A process of applying phone call recording data obtained from a predetermined device to an acoustic model to extract phonemes, and a process of extracting phonemes from the vocabulary in which phonemes are defined in the master data, which are expected to occur in the conversation scene or subject of the phone call recording data. A process of calculating a match rate between the correct phoneme of the vocabulary and the extracted phoneme, and a process of identifying the vocabulary whose phonemes show a predetermined match rate as a keyword matching result as a result of the calculation. a computing device;
A text conversion support device including

The storage device is
Further retains information specifying the utterance similarity between vowels,
The arithmetic device is
When calculating the matching rate, a process of calculating a matching rate between vowels included in each of the correct phoneme and the extracted phoneme based on the utterance similarity information, and a process of calculating the matching rate between the vowels included in each of the correct phoneme and the extracted phoneme, and The process of calculating the matching rate between the consonants included in each, and weighting the matching rate between the vowels more favorably than the matching rate between the consonants, and calculating the phoneme based on the matching rate between the vowels and between the consonants. It calculates the concordance rate between
The text conversion support device according to claim 1, characterized in that:

The arithmetic device is
When calculating the above match rate,
For each of the extracted phoneme and the correct phoneme, in the arrangement of vowels included in the phoneme, the degree of similarity in a set of two consecutive vowels is specified by the utterance similarity, and for the set for which the degree of similarity is equal to or higher than a reference value. is convolved into one predetermined vowel, and for pairs whose similarity is below the standard, the first vowel is adopted, and the process of forming the next pair with the vowel adjacent in the array is performed with the last vowel. to generate a syllable array,
The number of vowels is compared between the syllable arrays generated for each of the extracted phoneme and the correct phoneme, and if the numbers of vowels are equal, the extracted phoneme and the correct phoneme are the basis of the syllable array. A vowel matching rate is calculated between each of the vowel arrangements based on the utterance similarity,
3. The text conversion support device according to claim 2.

The arithmetic device is
Comparing the number of vowels between the syllable arrays, if the number of vowels in the syllable array corresponding to the correct phoneme is greater than the extracted phoneme, and if the syllable array of the correct phoneme is correct, then the extracted phoneme The part where a vowel is missing in the syllable arrangement is compensated by the corresponding phoneme of the correct phoneme,
A vowel matching rate is calculated between the vowel arrangement of each of the corrected phoneme and the correct phoneme based on the utterance similarity.
4. The text conversion support device according to claim 3.

The arithmetic device is
Comparing the number of vowels between the syllable arrays, if the number of vowels in the syllable array corresponding to the correct phoneme is smaller than the extracted phoneme, and if the syllable array of the correct phoneme is correct, then the extracted phoneme Correct by deleting the part where vowels are redundant in the syllable arrangement of
A vowel matching rate is calculated between the vowel arrangement of each of the corrected phoneme and the correct phoneme based on the utterance similarity.
4. The text conversion support device according to claim 3.

The information processing device
Master data that defines the correct phoneme information for each vocabulary that is expected to appear in each conversation situation or subject is stored in a storage device,
A process of applying phone call recording data obtained from a predetermined device to an acoustic model to extract phonemes, and a process of extracting phonemes from the vocabulary in which phonemes are defined in the master data, which are expected to occur in the conversation scene or subject of the phone call recording data. a process of calculating a match rate between the correct phoneme of the vocabulary and the extracted phoneme; a process of identifying the vocabulary whose phonemes show a predetermined match rate as a keyword matching result as a result of the calculation;
A text conversion support method that executes.

The information processing device
The storage device further retains information defining utterance similarity between vowels,
When calculating the matching rate, a process of calculating a matching rate between vowels included in each of the correct phoneme and the extracted phoneme based on the utterance similarity information, and a process of calculating the matching rate between the vowels included in each of the correct phoneme and the extracted phoneme, and The process of calculating the matching rate between the consonants included in each, and weighting the matching rate between the vowels more favorably than the matching rate between the consonants, and calculating the phoneme based on the matching rate between the vowels and between the consonants. Calculate the match rate between
7. The text conversion support method according to claim 6.

The information processing device
When calculating the above match rate,
For each of the extracted phoneme and the correct phoneme, in the arrangement of vowels included in the phoneme, the degree of similarity in a set of two consecutive vowels is specified by the utterance similarity, and for the set for which the degree of similarity is equal to or higher than a reference value. is convolved into one predetermined vowel, and for pairs whose similarity is below the standard, the first vowel is adopted, and the process of forming the next pair with the vowel adjacent in the array is performed with the last vowel. to generate a syllable array,
The number of vowels is compared between the syllable arrays generated for each of the extracted phoneme and the correct phoneme, and if the numbers of vowels are equal, the extracted phoneme and the correct phoneme are the basis of the syllable array. calculating a vowel matching rate between each of the vowel arrays based on the utterance similarity;
8. The text conversion support method according to claim 7.

The information processing device
Comparing the number of vowels between the syllable arrays, if the number of vowels in the syllable array corresponding to the correct phoneme is greater than the extracted phoneme, and if the syllable array of the correct phoneme is correct, then the extracted phoneme The part where a vowel is missing in the syllable arrangement is compensated by the corresponding phoneme of the correct phoneme,
calculating a vowel matching rate between the vowel arrays of the corrected extracted phoneme and the correct phoneme based on the utterance similarity;
9. The text conversion support method according to claim 8.

The information processing device
Comparing the number of vowels between the syllable arrays, if the number of vowels in the syllable array corresponding to the correct phoneme is smaller than the extracted phoneme, and if the syllable array of the correct phoneme is correct, then the extracted phoneme Correct by deleting the part where vowels are redundant in the syllable arrangement of
calculating a vowel matching rate between the vowel arrays of the corrected extracted phoneme and the correct phoneme based on the utterance similarity;
9. The text conversion support method according to claim 8.