JP5685702B2

JP5685702B2 - Speech recognition result management apparatus and speech recognition result display method

Info

Publication number: JP5685702B2
Application number: JP2009257349A
Authority: JP
Inventors: 章悟安藤; 政悟新井; 泰之高木
Original assignee: Advanced Media Inc
Current assignee: Advanced Media Inc
Priority date: 2009-11-10
Filing date: 2009-11-10
Publication date: 2015-03-18
Anticipated expiration: 2029-11-10
Also published as: JP2011102862A

Description

本発明は、会話の音声データに対する音声認識処理の結果を管理する音声認識結果管理装置と、その音声認識処理の結果を表示する音声認識結果表示方法とに関する。 The present invention relates to a speech recognition result management apparatus that manages a result of speech recognition processing on speech data of a conversation, and a speech recognition result display method that displays the result of the speech recognition processing.

従来、コールセンターやコンタクトセンター等において、顧客とオペレータとの通話音声を録音することが広く行われている。通話音声を再生可能とすることにより、通話内容を、オペレータ自身や、スーパーバイザおよびセンター管理者等（以下「管理者」という）が後から確認し、サービスの向上を図ることができる。 Conventionally, in a call center, a contact center, or the like, recording of a voice call between a customer and an operator is widely performed. By making the call voice reproducible, the operator itself, the supervisor, the center manager, etc. (hereinafter referred to as “manager”) can later confirm the contents of the call to improve the service.

ところが、録音した通話音声のみでは、オペレータや管理者は、必要とする通話音声を探し出し難く、簡単に聞き返しを行うことができない。そこで、例えば非特許文献１に記載されているように、音声に対して音声認識処理を行い、音声認識結果である文字列と、音声認識結果が得られた時間範囲である発話区間を伴った音声波形とを、話者毎に区別して発話タイミングの順序で表示する装置が提案されている。この装置は、右向きの時間軸に沿って発話区間と音声波形とを併せて表示し、文字列を音声波形表示の下に発話区間毎に表示する。オペレータの発話か顧客の発話かは文字列の左側に話者情報として表示して区別する。このような装置を用いることにより、オペレータや管理者が、必要とする通話音声を音声波形や発話区間、音声認識結果の文字列の目視や全文検索等により簡単に捜し出して分析することが可能となり、サービスの向上を図ることができる。 However, it is difficult for an operator or an administrator to find a necessary call voice by using only the recorded call voice, and it is not possible to easily listen back. Therefore, for example, as described in Non-Patent Document 1, speech recognition processing is performed on speech, and a character string that is a speech recognition result and a speech section that is a time range in which the speech recognition result is obtained are included. There has been proposed an apparatus that displays a speech waveform in a sequence of utterance timings while being distinguished for each speaker. This apparatus displays a speech segment and a speech waveform along a right-facing time axis, and displays a character string for each speech segment under the speech waveform display. The operator's utterance or the customer's utterance is distinguished by displaying it as speaker information on the left side of the character string. By using such a device, it becomes possible for an operator or administrator to easily find and analyze the required call voice by visually observing the voice waveform, speech segment, character string of the voice recognition result, full text search, etc. , Service can be improved.

中村雅巳、「議会議事録作成支援システム」、自動認識、日本工業出版、２００４年１０月、第１７巻、１２号、ｐ.３８−４３Masaaki Nakamura, “Meeting Support System for Minutes of Meetings”, Automatic Recognition, Nippon Kogyo Publishing, October 2004, Vol. 17, No. 12, p. 38-43

しかしながら、非特許文献１記載の装置は、オペレータおよび顧客の各発話区間の文字列を、発話開始時刻でソートされた順番に混在して表示し、更に、各文字列を発話区間と離れて表示する。したがって、非特許文献１記載の装置では、管理者は文字列と発話区間との対応を付け難く、視線移動も必要となり、後から通話内容を確認するのに不便であるという問題がある。 However, the device described in Non-Patent Document 1 displays the character strings of each utterance section of the operator and the customer in a mixed order in the utterance start time, and further displays each character string apart from the utterance section. To do. Therefore, in the apparatus described in Non-Patent Document 1, it is difficult for the administrator to associate the character string with the utterance section, and it is necessary to move the line of sight, which is inconvenient for confirming the content of the call later.

そこで、この問題を解決するために、各発話区間の波形表示の上下に文字列を配置することが考えられる。ところが、この場合、画面に一時に表示される範囲の時間幅（以下「表示時間幅」という）を広くしようとすると、一時に表示すべき文字列が多くなり、画面の横方向に文字列が収まらなくなるという問題が発生する。 In order to solve this problem, it is conceivable to arrange character strings above and below the waveform display of each utterance section. However, in this case, if you try to increase the time width of the range that is displayed on the screen at one time (hereinafter referred to as “display time width”), the number of character strings that should be displayed at one time increases, and the character strings are The problem of not being able to fit occurs.

本発明の目的は、通話内容の確認を簡単に行うことができる音声認識結果管理装置および音声認識結果表示方法を提供することである。 An object of the present invention is to provide a voice recognition result management apparatus and a voice recognition result display method capable of easily confirming the contents of a call.

本発明の一態様に係る音声認識結果管理装置は、会話の音声データに対する音声認識処理の結果を管理する音声認識結果管理装置であって、前記音声認識処理の結果である文字列と前記音声認識結果が得られた時間範囲である発話区間とを、話者毎に区別しかつ発話タイミングの順序で表示する会話表示画面を生成する画面生成部と、前記画面生成部が生成した前記会話表示画面を出力する画面出力部とを有し、前記会話表示画面は、前記発話区間と前記文字列とを、前記話者毎に対応付けるとともに前記発話区間の表示の時間軸方向と前記文字列の配列方向とを一致させずに表示する。 A speech recognition result management apparatus according to an aspect of the present invention is a speech recognition result management apparatus that manages a result of speech recognition processing on speech data of a conversation, and a character string that is a result of the speech recognition processing and the speech recognition A screen generation unit that generates a conversation display screen that distinguishes each utterance section that is a time range in which results are obtained for each speaker and displays them in the order of utterance timings, and the conversation display screen generated by the screen generation unit The conversation display screen associates the utterance section and the character string with each speaker, and displays the utterance section in the time axis direction and the character string arrangement direction. Display without matching.

本発明の一態様に係る音声認識結果表示方法は、会話の音声データに対する音声認識処理の結果を表示する音声認識結果表示方法であって、前記音声認識処理の結果である文字列と前記音声認識結果が得られた時間範囲である発話区間とを、話者毎に区別しかつ発話タイミングの順序で表示する会話表示画面を生成する工程と、生成した前記会話表示画面を出力する工程とを有し、前記会話表示画面は、前記発話区間と前記文字列とを、前記話者毎に対応付けるとともに前記発話区間の表示の時間軸方向と前記文字列の配列方向とを一致させずに表示する。 A speech recognition result display method according to an aspect of the present invention is a speech recognition result display method for displaying a result of speech recognition processing on speech data of a conversation, the character string being the result of the speech recognition processing and the speech recognition A step of generating a conversation display screen for distinguishing each utterance section, which is a time range in which results are obtained, from each speaker and displaying them in the order of utterance timing; and a step of outputting the generated conversation display screen. The conversation display screen displays the utterance interval and the character string for each speaker, and displays the utterance interval display without matching the time axis direction of the utterance interval and the arrangement direction of the character strings.

本発明によれば、発話区間と文字列とをその方向を一致させずに表示するので、一致させる場合に比べて、文字列を画面に納め易くすることができ、通話内容の確認を簡単に行うことを可能にする。 According to the present invention, since the utterance section and the character string are displayed without matching their directions, the character string can be easily put on the screen as compared with the case where they are matched, and the contents of the call can be easily confirmed. Make it possible to do.

本発明の実施の形態に係る音声認識結果管理装置を含むコールセンターシステムの構成の一例を示すシステム構成図The system block diagram which shows an example of a structure of the call center system containing the speech recognition result management apparatus which concerns on embodiment of this invention 本実施の形態に係る音声認識結果管理装置を含む通話録音・管理装置の構成を示すブロック図The block diagram which shows the structure of the telephone call recording and management apparatus containing the speech recognition result management apparatus which concerns on this Embodiment 本実施の形態における会話表示画面の構成の一例を示す図The figure which shows an example of a structure of the conversation display screen in this Embodiment 本実施の形態の変形例１における会話表示画面の構成を部分的に示す図The figure which shows partially the structure of the conversation display screen in the modification 1 of this Embodiment. 本実施の形態の変形例２における会話表示画面の構成を示す図The figure which shows the structure of the conversation display screen in the modification 2 of this Embodiment. 本実施の形態の変形例３における会話表示画面の構成を示す図The figure which shows the structure of the conversation display screen in the modification 3 of this Embodiment. 本実施の形態の変形例４における会話表示画面の構成を部分的に示す図The figure which shows partially the structure of the conversation display screen in the modification 4 of this Embodiment. 本実施の形態の変形例５における会話表示画面の構成を部分的に示す図The figure which shows partially the structure of the conversation display screen in the modification 5 of this Embodiment. 本実施の形態の変形例６における会話表示画面の構成を示す図The figure which shows the structure of the conversation display screen in the modification 6 of this Embodiment.

以下、本発明の一実施の形態について、図面を参照して詳細に説明する。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.

図１は、本発明の実施の形態に係る音声認識結果管理装置を含むコールセンターシステムの構成の一例を示すシステム構成図である。 FIG. 1 is a system configuration diagram showing an example of a configuration of a call center system including a speech recognition result management apparatus according to an embodiment of the present invention.

図１において、コールセンターシステム１００は、顧客端末２００−１〜２００−Ｍと、オペレータ端末３００−１〜３００−Ｎと、本実施の形態に係る音声認識結果管理装置を含む通話録音・管理装置４００と、音声認識サーバ５００とを有する。 In FIG. 1, a call center system 100 includes a call recording / management apparatus 400 including customer terminals 200-1 to 200-M, operator terminals 300-1 to 300-N, and a speech recognition result management apparatus according to the present embodiment. And a voice recognition server 500.

顧客端末２００−１〜２００−Ｍとオペレータ端末３００−１〜３００−Ｎとは、通信網７００を介して、通話可能に接続されている。通話録音・管理装置４００は、通信網７００とオペレータ端末３００−１〜３００−Ｎとの間に配置されている。通話録音・管理装置４００と音声認識サーバ５００とは、例えばＬＡＮ（local area network）を介して通信可能に接続されている。 Customer terminals 200-1 to 200-M and operator terminals 300-1 to 300-N are connected to each other via a communication network 700 so as to be able to talk. The call recording / management apparatus 400 is disposed between the communication network 700 and the operator terminals 300-1 to 300-N. The call recording / management apparatus 400 and the voice recognition server 500 are connected to be communicable via, for example, a local area network (LAN).

顧客端末２００−１〜２００−Ｍは、通話機能を有する顧客側の装置であり、例えば、固定電話機、携帯電話機、ＩＰ電話機能を有するパーソナルコンピュータである。 The customer terminals 200-1 to 200-M are customer-side devices having a call function, such as a fixed telephone, a mobile phone, and a personal computer having an IP telephone function.

オペレータ端末３００−１〜３００−Ｎは、通話機能および情報処理機能を有するコールセンター側の装置であり、例えば、ＩＰ電話機能を有するパーソナルコンピュータと通話用のヘッドセットとのユニットである。 The operator terminals 300-1 to 300-N are devices on the call center side having a call function and an information processing function. For example, the operator terminals 300-1 to 300-N are units of a personal computer having an IP phone function and a call headset.

通話録音・管理装置４００は、顧客端末２００とオペレータ端末３００との間の通話音声を録音し、録音音声の音声データを、呼の識別情報（以下「呼情報」という）、話者の識別情報（以下「話者情報」という）、および各時点の時刻を示す情報（以下「時刻情報」という）を付加した状態で管理する。 The call recording / management apparatus 400 records a call voice between the customer terminal 200 and the operator terminal 300, uses the voice data of the recorded voice as call identification information (hereinafter referred to as "call information"), and speaker identification information. (Hereinafter referred to as “speaker information”) and information indicating the time at each time point (hereinafter referred to as “time information”) are added.

また、通話録音・管理装置４００は、後述の音声認識サーバ５００を用いて、通話音声の音声データに対する音声認識の結果を話者毎に区別して取得し、取得した音声認識結果のデータに基づいて、会話表示画面を生成して表示する。会話表示画面は、音声波形および音声認識結果を、話者毎に区別して、発話タイミングの順序で表示する画面である。また、通話録音・管理装置４００は、会話表示画面に対する操作を、例えばコールセンターの管理者から受け付ける。会話表示画面については後述する。 Further, the call recording / management apparatus 400 uses a voice recognition server 500 to be described later to separately obtain the result of voice recognition for the voice data of the call voice for each speaker, and based on the obtained voice recognition result data. Generate and display a conversation display screen. The conversation display screen is a screen that displays the speech waveform and the speech recognition result for each speaker in order of speech timing. Further, the call recording / management apparatus 400 receives an operation on the conversation display screen from, for example, a call center administrator. The conversation display screen will be described later.

音声認識サーバ５００は、音声認識処理を行う装置であり、例えば、ＬＡＮ機能を有するパーソナルコンピュータである。 The voice recognition server 500 is a device that performs voice recognition processing, and is, for example, a personal computer having a LAN function.

通信網７００は、インターネット、公衆電話回線網等である。 The communication network 700 is the Internet, a public telephone line network, or the like.

このような構成を有するコールセンターシステム１００は、顧客端末２００−１〜２００−Ｍとオペレータ端末３００−１〜３００−Ｎとの間の通話内容のみならず、顧客やオペレータの語気の状態（以下適宜「通話状況」という）を示す会話表示画面を表示することができる。したがって、管理者は、各オペレータの顧客との通話状況を、容易に確認することができる。 The call center system 100 having such a configuration includes not only the contents of calls between the customer terminals 200-1 to 200-M and the operator terminals 300-1 to 300-N, but also the state of vocabulary of customers and operators (hereinafter referred to as appropriate). It is possible to display a conversation display screen indicating “call status”. Therefore, the administrator can easily confirm the call status of each operator with the customer.

次に、本発明に係る通話録音・管理装置４００の構成について説明する。 Next, the configuration of the call recording / management apparatus 400 according to the present invention will be described.

図２は、通話録音・管理装置４００の構成を示すブロック図である。 FIG. 2 is a block diagram showing the configuration of the call recording / management apparatus 400.

図２に示すように、通話録音・管理装置４００は、音声データ入力部４１０、特徴抽出部４２０、画面生成部４３０、画面出力部４４０、操作受付部４５０、音声出力部４６０、およびファイル出力部４７０を有する。 As shown in FIG. 2, the call recording / management apparatus 400 includes a voice data input unit 410, a feature extraction unit 420, a screen generation unit 430, a screen output unit 440, an operation reception unit 450, a voice output unit 460, and a file output unit. 470.

なお、本実施の形態では、音声認識処理のうち、音声データから特徴量を抽出する処理は通話録音・管理装置４００において行われ、抽出された特徴量に基づいて音声認識を行う処理は音声認識サーバ５００において行われる。 In the present embodiment, in the voice recognition process, a process for extracting feature amounts from voice data is performed in the call recording / management apparatus 400, and a process for performing voice recognition based on the extracted feature amounts is a voice recognition process. This is performed in the server 500.

音声データ入力部４１０は、通信網７００を介して顧客端末２００から音声データを入力して記憶し、記憶する音声データを、特徴抽出部４２０および画面生成部４３０へ出力する。音声データは、少なくとも音圧の変化を示す音声波形データを含む。音声データ入力部４１０は、各音声データに、呼情報および話者情報を付加する。 The voice data input unit 410 inputs and stores voice data from the customer terminal 200 via the communication network 700, and outputs the stored voice data to the feature extraction unit 420 and the screen generation unit 430. The audio data includes at least audio waveform data indicating a change in sound pressure. The voice data input unit 410 adds call information and speaker information to each voice data.

特徴抽出部４２０は、音声データ入力部４１０から入力された音声データから発話区間を検出し、発話区間の音声データから特徴量を抽出して、特徴量の時系列データに呼情報、話者情報、および発話区間の時刻情報を付加した特徴量データを生成する。ここで、発話区間とは、発話音声が所定時間以上の間隔を空けずに連続して検出される区間である。そして、特徴抽出部４２０は、生成した特徴量データを音声認識サーバ５００へ出力し、音声認識サーバ５００に対して音声認識処理の実行を指示する。 The feature extraction unit 420 detects an utterance interval from the audio data input from the audio data input unit 410, extracts a feature amount from the audio data of the utterance interval, and includes call information and speaker information as time-series data of the feature amount. And feature amount data to which time information of the utterance section is added. Here, the utterance section is a section in which the uttered voice is continuously detected without leaving an interval of a predetermined time or more. Then, the feature extraction unit 420 outputs the generated feature amount data to the speech recognition server 500 and instructs the speech recognition server 500 to execute speech recognition processing.

この結果、音声認識サーバ５００では、発話区間毎の特徴量データから、予め用意された音響モデルデータ、辞書データ、言語モデルデータを用いて、発話音声を示すテキストデータが生成される。そして、音声認識サーバ５００からは、音声認識結果である各テキストデータが、呼情報、話者情報、および発話区間の時刻情報が付加された状態で、通話録音・管理装置４００へ出力される。 As a result, in the speech recognition server 500, text data indicating the uttered speech is generated from the feature amount data for each utterance section, using acoustic model data, dictionary data, and language model data prepared in advance. Then, the voice recognition server 500 outputs each text data as a voice recognition result to the call recording / management apparatus 400 with call information, speaker information, and time information of the utterance section added.

通話録音・管理装置４００の画面生成部４３０は、音声データ入力部４１０から入力された音声データと、音声認識サーバ５００から入力されたテキストデータとに基づいて、通話毎に、会話表示画面を構成するための画像データを生成する。そして、画面生成部４３０は、画像データと音声データとを時間軸で統合した画面データを、データ記憶部４３１に記憶する。会話表示画面は、上述の通り、音声波形と音声認識結果とを、話者毎に区別して、発話タイミングの順序で表示する画面である。より具体的には、会話表示画面は、発話区間と文字列とを、話者毎に対応付けるとともに発話区間の表示の時間軸方向と文字列の配列方向とを直交させて表示する画面である。画面生成部４３０は、画面データに基づき、操作受付部４５０を介して行われる操作に応じて、会話表示画面を生成し、画面出力部４４０を介して表示する。また、画面生成部４３０は、画面データに基づき、操作受付部４５０を介して行われる操作に応じて、音声出力部４６０を介して音声を出力する。会話表示画面の詳細については後述する。 The screen generation unit 430 of the call recording / management apparatus 400 configures a conversation display screen for each call based on the voice data input from the voice data input unit 410 and the text data input from the voice recognition server 500. To generate image data. Then, the screen generation unit 430 stores, in the data storage unit 431, screen data obtained by integrating image data and audio data on the time axis. As described above, the conversation display screen is a screen that displays the speech waveform and the speech recognition result for each speaker in the order of speech timing. More specifically, the conversation display screen is a screen that associates an utterance section and a character string for each speaker and displays the utterance section display with a time axis direction orthogonal to a character string arrangement direction. The screen generation unit 430 generates a conversation display screen based on the screen data in accordance with an operation performed via the operation reception unit 450 and displays the conversation display screen via the screen output unit 440. Further, the screen generation unit 430 outputs a sound via the sound output unit 460 in accordance with an operation performed via the operation reception unit 450 based on the screen data. Details of the conversation display screen will be described later.

画面出力部４４０は、例えば液晶ディスプレイであり、画面生成部４３０の制御を受けて、会話表示画面を表示する。 The screen output unit 440 is, for example, a liquid crystal display, and displays a conversation display screen under the control of the screen generation unit 430.

操作受付部４５０は、例えば、キーボードおよび縦方向のホイールを備えたマウスであり、画面出力部４４０が表示している会話表示画面に対する操作を受け付ける。この操作の内容については後述する。 The operation accepting unit 450 is, for example, a mouse having a keyboard and a vertical wheel, and accepts an operation on the conversation display screen displayed by the screen output unit 440. The contents of this operation will be described later.

音声出力部４６０は、画面生成部４３０の制御を受けて、音声を出力する。 The audio output unit 460 outputs audio under the control of the screen generation unit 430.

ファイル出力部４７０は、画面生成部４３０の制御を受けて、データのファイル出力を行う。 The file output unit 470 performs data file output under the control of the screen generation unit 430.

また、通話録音・管理装置４００は、図示しないが、ＣＰＵ（central processing unit）、制御プログラムを格納したＲＯＭ（read only memory）などの記憶媒体、ＲＡＭ（random access memory）などの作業用メモリ、および通信回路等によって実現することができる。この場合、上記した各部の機能は、ＣＰＵが制御プログラムを実行することにより実現される。 Although not shown, the call recording / management apparatus 400 includes a central processing unit (CPU), a storage medium such as a ROM (read only memory) that stores a control program, a working memory such as a random access memory (RAM), and It can be realized by a communication circuit or the like. In this case, the function of each unit described above is realized by the CPU executing the control program.

このような構成を有する通話録音・管理装置４００は、発話区間と文字列とを直交させて表示するので、表示時間幅を広くしても文字列を画面に納めることができ、通話内容の確認を簡単に行うことを可能にする。また、通話録音・管理装置４００は、音声波形をも話者毎に区別して表示するので、管理者が、会話の状況と話者の感情の変化との関係を詳細に解析することを可能にする。 Since the call recording / management apparatus 400 having such a configuration displays the utterance section and the character string orthogonally, the character string can be displayed on the screen even if the display time width is widened, and the content of the call can be confirmed. Makes it easy to do. In addition, the call recording / management apparatus 400 also displays the voice waveform separately for each speaker, so that the administrator can analyze in detail the relationship between the conversation situation and the emotional change of the speaker. To do.

次に、会話表示画面の構成について説明する。 Next, the configuration of the conversation display screen will be described.

図３は、会話表示画面の構成の一例を示す図である。 FIG. 3 is a diagram illustrating an example of the configuration of the conversation display screen.

図３に示すように、会話表示画面８００は、通話情報表示部８１０、全体像表示部８２０、通話状況表示部８３０、表示状態変更部８４０、ツールバー表示部８５０、および検索バー８６０を有する。 As shown in FIG. 3, the conversation display screen 800 includes a call information display unit 810, an overall image display unit 820, a call status display unit 830, a display state change unit 840, a toolbar display unit 850, and a search bar 860.

通話情報表示部８１０は、元の音声データに付加されていた、呼情報、話者情報、時刻情報の内容を含む、通話情報を表示する。通話情報は、例えば、話者、通話開始日時、通話時間、通話方向、通話属性、編集者、採集編集日時、版数、コメント、評価、音声認識処理開始日時、音声認識処理終了日時、音声認識処理時間、音声認識信頼度、発言数、音声認識パラメータ、公開情報、およびアクセス履歴を含む。 The call information display unit 810 displays call information including the contents of call information, speaker information, and time information added to the original voice data. Call information includes, for example, speaker, call start date / time, call time, call direction, call attribute, editor, collection edit date / time, version number, comment, evaluation, voice recognition process start date / time, voice recognition process end date / time, voice recognition It includes processing time, voice recognition reliability, number of utterances, voice recognition parameters, public information, and access history.

このような通話情報表示部８１０によれば、管理者は、どのオペレータのいつの通話に関する通話状況が会話表示画面８００表示されているかを知ることができる。 According to such a call information display unit 810, the administrator can know which call status of which operator's call is displayed on the conversation display screen 800.

全体像表示部８２０は、音声波形の全体図８２１と、表示範囲マーカ８２２とを表示する。音声波形の全体図８２１は、通話全体の音声波形を、縦方向の時間軸に沿って、話者毎に区別して表示する。また、管理者は、顧客またはオペレータの語気が荒くなっている箇所や無音区間等の重要部分を容易に見付けることができる。表示範囲マーカ８２２は、通話状況表示部８３０で現在表示されている範囲（以下「表示範囲」という）を示し、操作受付部４５０を介して、その位置を縦方向の任意の位置（通話中の任意の位置）に移動させることができるようになっている。 The overall image display unit 820 displays an overall view 821 of the audio waveform and a display range marker 822. The entire speech waveform FIG. 821 displays the speech waveform of the entire call separately for each speaker along the vertical time axis. In addition, the manager can easily find an important part such as a part where the customer or operator's vocabulary is rough or a silent section. The display range marker 822 indicates the range currently displayed on the call status display unit 830 (hereinafter referred to as “display range”), and the position is set to an arbitrary vertical position (during a call) via the operation reception unit 450. It can be moved to any position).

通話状況表示部８３０は、全体像表示部８２０の表示範囲マーカ８２２の位置に対応する時刻周辺の音声波形８３１を、縦方向に伸びて表示される時間軸８３２に沿って、話者毎に区別して表示する。すなわち、音声波形８３１の時刻は縦方向の位置で示され、音声波形８３１の音圧は横方向の位置で示される。ここでは、通話状況表示部８３０は、顧客の音声波形８３１−１を左側に、オペレータの音声波形８３１−２を右側に表示する。 The call status display unit 830 defines the voice waveform 831 around the time corresponding to the position of the display range marker 822 of the overall image display unit 820 for each speaker along the time axis 832 displayed in the vertical direction. Separately displayed. That is, the time of the audio waveform 831 is indicated by the position in the vertical direction, and the sound pressure of the audio waveform 831 is indicated by the position in the horizontal direction. Here, the call status display unit 830 displays the customer's voice waveform 831-1 on the left side and the operator's voice waveform 831-2 on the right side.

また、通話状況表示部８３０は、音声認識結果が得られた発話区間を示す発話区間バー８３３を、時間軸８３２に沿って、話者毎に区別して表示する。ここでは、通話状況表示部８３０は、顧客の発話区間バー８３３−１を中央の左側に、オペレータの発話区間バー８３３−２を中央の右側に表示する。これにより、管理者は、各情報の話者毎の区別を容易に行うことができる。 In addition, the call status display unit 830 displays the utterance section bar 833 indicating the utterance section from which the voice recognition result is obtained, separately for each speaker along the time axis 832. Here, the call status display unit 830 displays the customer's speech section bar 833-1 on the left side of the center and the operator's speech section bar 833-3 on the right side of the center. Thus, the administrator can easily distinguish each piece of information for each speaker.

また、通話状況表示部８３０は、音声認識結果を横書き文字で示すテキストボックス８３４を、対応する発話区間バー８３３に結び付けて、音声波形８３１に重ねて表示する。すなわち、通話状況表示部８３０は、時間軸８３２の方向と直交する配列方向で、音声認識結果の文字列を表示する。これにより、管理者は、音声波形と文字列とを同時に閲覧することができる。また、表示時間幅が広い状態（縮尺が小さい、または縮尺分母が大きい状態）においても画面の横幅を一定にすることができ、文字列を一画面に表示させ易くなり、画面の利用効率を向上させることができる。個々のテキストボックス８３４は、テキスト表示欄と、再生、コピー、編集前への復帰、初期状態への復帰、および削除の各操作のための制御用アイコンとを有する。管理者は、マウスをテキストボックス８３４の上に持ってくるだけで、テキストボックス８３４をアクティブ状態とし（操作対象とし）、文字列の指定や制御用アイコンの操作を即座に行うことができる。すなわち、特定の文字列やこれに対応する音声データに対する操作を、１クリックで行うことが可能となる。 In addition, the call status display unit 830 displays a text box 834 indicating the voice recognition result in horizontal writing characters in association with the corresponding utterance section bar 833 and superimposed on the voice waveform 831. That is, the call status display unit 830 displays the character string of the voice recognition result in the arrangement direction orthogonal to the direction of the time axis 832. Thereby, the administrator can browse the speech waveform and the character string at the same time. In addition, even when the display time width is wide (the scale is small or the scale denominator is large), the width of the screen can be kept constant, making it easier to display the character string on one screen and improving the screen usage efficiency. Can be made. Each text box 834 has a text display column and control icons for operations such as reproduction, copying, restoration before editing, restoration to the initial state, and deletion. The administrator can bring the text box 834 into an active state (to be operated) simply by bringing the mouse over the text box 834, and can immediately specify a character string or operate a control icon. In other words, it is possible to perform an operation on a specific character string or sound data corresponding thereto with one click.

なお、各テキストボックス８３４は、そのテキストボックス８３４上にポインタが位置するときのみ、制御用アイコンを表示するようにしても良い。また、各テキストボックス８３４は、通常は小さいフォントサイズで文字列を表示し、そのテキストボックス８３４上にポインタが位置するときのみ、フォントサイズを大きくしたり、色を変更して、文字列を表示するようにしても良い。これにより、画面の簡素化を図ることができると共に、管理者はどのテキストボックス８３４上にポインタが位置しているかを認識し易くなる。 Each text box 834 may display a control icon only when the pointer is positioned on the text box 834. Each text box 834 normally displays a character string with a small font size, and displays the character string by increasing the font size or changing the color only when the pointer is positioned on the text box 834. You may make it do. As a result, the screen can be simplified and the administrator can easily recognize on which text box 834 the pointer is located.

また、各テキストボックス８３４は、通話状況表示部８３０の表示時間幅に応じて、文字列のフォントサイズを変更したり、必要に応じて他のテキストボックス８３４と重なって表示されるようにしても良い。これにより、表示時間幅が後述のズームバー８４１によって大きく設定され、表示対象となる文字列が多くなっても、各文字列の省略をできるだけ防ぐことができる。 In addition, each text box 834 may be displayed by changing the font size of the character string according to the display time width of the call status display unit 830 or overlapping with other text boxes 834 as necessary. good. Thereby, even if the display time width is set to be large by a zoom bar 841, which will be described later, and the number of character strings to be displayed increases, each character string can be prevented from being omitted as much as possible.

また、通話状況表示部８３０は、音声再生が行われているとき、再生位置を示す再生位置マーカ８３５を、音声波形８３１に重ねて表示する。これにより、管理者は、再生されている音声に対応する音声波形８３１や文字列を容易に確認することができる。 In addition, the call status display unit 830 displays a playback position marker 835 indicating a playback position so as to overlap the voice waveform 831 when voice playback is performed. Thereby, the administrator can easily confirm the voice waveform 831 and the character string corresponding to the reproduced voice.

なお、通話状況表示部８３０に表示されている画像は、縦方向のドラッグ操作が可能となっている。 Note that the image displayed on the call status display unit 830 can be dragged in the vertical direction.

表示状態変更部８４０は、ズームバー８４１および要約バー８４２を表示する。ズームバー８４１は、通話状況表示部８３０の表示時間幅を変更するための、つまり時間軸８３２の縮尺を変更（ズームイン、ズームアウト）するための、スライドバーである。要約バー８４２は、通話状況表示部８３０のテキストボックス８３４に表示される文字列を要約された文字列（以下「要約文字列」という）に置き換える旨およびその要約の程度を指定するためのスライドバーである。ズームバー８４１および要約バー８４２は、いずれも、縦方向にバーがスライドするようになっている。これにより、管理者は、全体像表示部８２０および通話状況表示部８３０と統一感のある操作感覚で、表示状態変更部８４０を操作することができる。 The display state changing unit 840 displays a zoom bar 841 and a summary bar 842. The zoom bar 841 is a slide bar for changing the display time width of the call status display unit 830, that is, for changing the scale of the time axis 832 (zooming in and zooming out). The summary bar 842 is a slide bar for designating that the character string displayed in the text box 834 of the call status display unit 830 is replaced with a summarized character string (hereinafter referred to as “summary character string”) and the degree of the summary. It is. The zoom bar 841 and the summary bar 842 both slide in the vertical direction. As a result, the administrator can operate the display state changing unit 840 with an operation feeling that is unified with the overall image display unit 820 and the call status display unit 830.

ツールバー表示部８５０は、音声データの全体を連続再生するための複数の制御用アイコンを表示する。より具体的には、ツールバー表示部８５０は、自動スクロール停止、再生開始、再生停止、再生範囲の選択、コピー、メモ、保存、編集前への復帰、および初期状態への復帰の各操作のための制御用アイコンを表示する。 The toolbar display unit 850 displays a plurality of control icons for continuously reproducing the entire audio data. More specifically, the toolbar display unit 850 is used for operations such as automatic scroll stop, playback start, playback stop, playback range selection, copy, memo, save, return to pre-edit, and return to the initial state. Displays the control icon for.

会話表示画面８００は、再生開始の制御用アイコンがクリックされると、音声データの再生を開始し、再生箇所が通話状況表示部８３０の最下端に到達する毎に、通話状況表示部８３０の表示画像を１ページ分上へスクロールさせる。なお、会話表示画面８００は、再生箇所が通話状況表示部８３０の中央に固定されるように、通話状況表示部８３０の表示画像を連続的に上へスクロールさせても良い。 When the playback start control icon is clicked on the conversation display screen 800, the playback of the audio data is started, and each time the playback location reaches the lowermost end of the call status display unit 830, the display of the call status display unit 830 is displayed. Scroll image up one page. In the conversation display screen 800, the display image of the call status display unit 830 may be continuously scrolled upward so that the playback position is fixed at the center of the call status display unit 830.

また、会話表示画面８００は、コピーの制御用アイコンがクリックされると、選択中の文字列をコピー対象として記憶する。 In addition, when the copy control icon is clicked, the conversation display screen 800 stores the selected character string as a copy target.

また、会話表示画面８００は、メモの制御用アイコンがクリックされると、補足説明や注意点等のメモの入力を受け付け、入力内容を記憶する。 When the memo control icon is clicked, the conversation display screen 800 accepts input of memos such as supplementary explanations and caution points, and stores the input contents.

また、会話表示画面８００は、保存の制御用アイコンがクリックされると、文字列に対する編集結果を保存する。 Also, the conversation display screen 800 saves the edit result for the character string when the save control icon is clicked.

また、会話表示画面８００は、編集前への復帰の制御用アイコンがクリックされると、編集された文字列を、最後に行われた編集の前の文字列へ復帰させる。 In addition, when the control icon for returning to the previous editing state is clicked, the conversation display screen 800 returns the edited character string to the character string before the last editing.

また、会話表示画面８００は、初期状態への復帰の制御用アイコンがクリックされると、編集された文字列を、初期状態（つまり音声認識結果）へ復帰させる。 In addition, when the control icon for returning to the initial state is clicked, the conversation display screen 800 returns the edited character string to the initial state (that is, the voice recognition result).

検索バー８６０は、入力された文字列を、音声認識結果から検索するためのテキストボックスである。 The search bar 860 is a text box for searching the input character string from the speech recognition result.

このような会話表示画面８００が表示されることにより、管理者は、音声波形８３１の特徴の変化から、顧客またはオペレータが、どのタイミングで保留したか、およびどのタイミングで黙ったか、を知ることができる。また、管理者は、音声波形８３１の特徴から、声の大小や録音音声に異常（音割れや雑音混入）の有無を一目で確認することができる。また、管理者は、広い表示時間幅で多くの文字列を一目で確認することができる。したがって、管理者は、通話状況の確認を簡単に行うことができる。 By displaying such a conversation display screen 800, the administrator can know at what timing the customer or operator has put on hold and at what timing from the change in the characteristics of the voice waveform 831. it can. Further, the administrator can confirm at a glance whether there is any abnormality (sound cracking or noise mixing) in the voice level or the recorded voice from the characteristics of the voice waveform 831. In addition, the administrator can check many character strings at a glance with a wide display time width. Therefore, the administrator can easily check the call status.

画面生成部４３０は、このような会話表示画面８００を構成するためのデータを、予め保持している。また、画面生成部４３０は、テキストボックス８３４に表示される各文字列に対して、複数の要約レベルごとに予め定められた要約ルールに従って要約文字列を生成し、生成した要約文字列を記憶する。要約文字列は、元の文字列よりも文字数が削減された文字列であり、要約バー８４２において要約された文字列の表示が指示されたときに用いられるものである。 The screen generation unit 430 holds data for configuring such a conversation display screen 800 in advance. The screen generation unit 430 generates a summary character string for each character string displayed in the text box 834 according to a summary rule predetermined for each of a plurality of summary levels, and stores the generated summary character string. . The summary character string is a character string in which the number of characters is reduced from that of the original character string, and is used when display of the summarized character string is instructed in the summary bar 842.

以下、会話表示画面８００に関する通話録音・管理装置４００の動作について説明する。 Hereinafter, the operation of the call recording / management apparatus 400 related to the conversation display screen 800 will be described.

まず、画面出力部４４０は、例えば、音声データ入力部４１０に記憶されている音声データの通話をリスト化した通話リストを表示して通話の選択を受け付ける。そして、画面出力部４４０は、いずれかの通話が選択されると、音声データ入力部４１０に対し、その通話の音声データを特徴抽出部４２０へ出力することを指示する。この結果、音声データ入力部４１０は、選択された音声データを、特徴抽出部４２０へ出力する。 First, the screen output unit 440 receives a call selection by displaying a call list that lists voice data calls stored in the voice data input unit 410, for example. Then, when any call is selected, screen output unit 440 instructs voice data input unit 410 to output the voice data of the call to feature extraction unit 420. As a result, the voice data input unit 410 outputs the selected voice data to the feature extraction unit 420.

通話リストは、例えば、「自分の通話」、「自分のグループの通話」、「お気に入りの通話」、「管理者からコメントされた通話」等、適宜グループ化されて表示されても良い。また、画面出力部４４０は、通話リストによってではなく、条件検索（日時、オペレータ名、電話番号、各種分類、認識結果または編集結果の文字列全文検索等）によって、通話の選択を受け付けても良い。 The call list may be displayed by appropriately grouping, for example, “my call”, “my group call”, “favorite call”, “call commented by the administrator”, and the like. Further, the screen output unit 440 may accept the selection of a call not by a call list but by a condition search (date and time, operator name, telephone number, various classifications, a full text search of a recognition result or an edit result, etc.). .

特徴抽出部４２０は、入力された音声データに対して発話区間の検出および発話区間の特徴量抽出を行い、特徴量データを音声認識サーバ５００へ出力して、その音声認識結果を返送させる。なお、特徴抽出部４２０および音声認識サーバ５００は、音声認識に関する処理を、通話の録音が行われるタイミング等に、事前に行っておいても良い。 The feature extraction unit 420 detects the speech section and extracts the feature amount of the speech section from the input speech data, outputs the feature data to the speech recognition server 500, and returns the speech recognition result. Note that the feature extraction unit 420 and the speech recognition server 500 may perform processing related to speech recognition in advance at the timing of recording a call.

画面生成部４３０は、音声認識結果が入力されると、音声データ入力部４１０から対応する音声データを取得する。そして、画面生成部４３０は、音声認識結果および音声データから、上述の画面データを生成し、画面出力部４４０を介して会話表示画面８００を表示する。画面生成部４３０は、初期状態では、全体像表示部８２０の表示範囲マーカ８２２を、通話の開始位置の音声波形８３１が表示される位置に配置する。また、画面生成部４３０は、初期状態では、通話状況表示部８３０の表示時間幅を、比較的小さい幅とする。 When the voice recognition result is input, the screen generation unit 430 acquires corresponding voice data from the voice data input unit 410. Then, the screen generation unit 430 generates the above-described screen data from the voice recognition result and the voice data, and displays the conversation display screen 800 via the screen output unit 440. In the initial state, screen generation unit 430 places display range marker 822 of overall image display unit 820 at a position where voice waveform 831 at the start position of the call is displayed. Further, in the initial state, screen generation unit 430 sets the display time width of call status display unit 830 to a relatively small width.

画面生成部４３０は、操作受付部４５０からの操作に応じて、会話表示画面８００の表示内容を変更する。具体的には、以下の通りである。 The screen generation unit 430 changes the display content of the conversation display screen 800 according to the operation from the operation reception unit 450. Specifically, it is as follows.

画面生成部４３０は、表示範囲マーカ８２２の位置が変更されたとき、その変更に対応して、通話状況表示部８３０の表示範囲を変更する。これにより、管理者は、確認したい位置を、容易に任意の位置に変更することができる。また、音声波形の全体図８２１が表示されているので、管理者は、顧客またはオペレータの語気が荒くなっている箇所を容易に見付けることができ、ピンポイントに重要部分の頭出しを行うことができる。 When the position of the display range marker 822 is changed, the screen generation unit 430 changes the display range of the call status display unit 830 in response to the change. Thereby, the administrator can easily change the position to be confirmed to an arbitrary position. In addition, since the entire diagram 821 of the voice waveform is displayed, the administrator can easily find a place where the customer or the operator's vocabulary is rough and can cue the important part at the pinpoint. it can.

画面生成部４３０は、通話状況表示部８３０のテキストボックス８３４以外の部分を選択して縦方向のドラッグ操作が行われたとき、その操作方向および操作量に応じて、表示範囲を縦方向にスクロールさせる。また、このとき、画面生成部４３０は、表示範囲のスクロールに対応して、表示範囲マーカ８２２の位置を変更する。これにより、管理者は、通話状況の変化を、自分のペースで、連続的に確認することができる。また、スクロール方向が縦方向であるため、通常のマウスのホイールによるスクロール操作がし易く、管理者の操作負担を軽減することができる。この点、時間軸を横方向に配置した従来技術では、このような効果を得ることはできない。 When a portion other than the text box 834 in the call status display unit 830 is selected and a vertical drag operation is performed, the screen generation unit 430 scrolls the display range in the vertical direction according to the operation direction and operation amount. Let At this time, the screen generation unit 430 changes the position of the display range marker 822 in response to the scrolling of the display range. As a result, the administrator can continuously check changes in the call status at his / her own pace. Further, since the scroll direction is the vertical direction, it is easy to perform a scroll operation with a normal mouse wheel, and the operation burden on the administrator can be reduced. In this respect, the conventional technique in which the time axis is arranged in the horizontal direction cannot obtain such an effect.

画面生成部４３０は、ズームバー８４１のスライダの位置が変更されたとき、その変更に対応して、通話状況表示部８３０の表示時間幅を変更する。これにより、管理者は、全体を俯瞰したり詳細を見たりといった、目的に応じた閲覧が容易となる。また、画面生成部４３０は、表示時間幅の変更に対応して、テキストボックス８３４に表示される文字列を、対応する要約レベルの要約文字列に変更する。これにより、音声認識結果の文字列を、会話の内容が分かる状態で短くすることができ、各テキストボックス８３４の文字列および音声波形８３１を見易く表示することができる。 When the position of the slider of the zoom bar 841 is changed, the screen generation unit 430 changes the display time width of the call status display unit 830 in response to the change. This makes it easy for the administrator to browse according to the purpose, such as viewing the whole or viewing details. Further, the screen generation unit 430 changes the character string displayed in the text box 834 to the corresponding summary character string at the summary level in response to the change in the display time width. Thereby, the character string of the voice recognition result can be shortened in a state where the contents of the conversation can be understood, and the character string and the voice waveform 831 in each text box 834 can be displayed in an easy-to-see manner.

画面生成部４３０は、要約バー８４２のスライダの位置が変更されたとき、その変更に対応して、テキストボックス８３４に表示される文字列を、対応する要約レベルの要約文字列に変更する。 When the position of the slider of summary bar 842 is changed, screen generation unit 430 changes the character string displayed in text box 834 to the summary character string of the corresponding summary level in response to the change.

画面生成部４３０は、テキストボックス８３４の制御用アイコンがクリックされたとき、対応する処理を行う。例えば、画面生成部４３０は、テキストボックス８３４の再生アイコンがクリックされたときには、表示中の文字列に対応する音声データを、音声出力部４６０を用いて再生する。また、画面生成部４３０は、テキストボックス８３４の文字列部分がクリックされたとき、文字列に対する編集を受け付ける。 When the control icon in the text box 834 is clicked, the screen generation unit 430 performs a corresponding process. For example, when the reproduction icon in the text box 834 is clicked, the screen generation unit 430 reproduces audio data corresponding to the displayed character string using the audio output unit 460. In addition, when the character string portion of text box 834 is clicked, screen generation unit 430 accepts editing for the character string.

画面生成部４３０は、音声再生が行われているとき、再生位置マーカ８３５を音声波形８３１に重ねて表示するとともに、再生位置に対応して表示位置を移動させる。 When the audio reproduction is performed, the screen generation unit 430 displays the reproduction position marker 835 superimposed on the audio waveform 831 and moves the display position corresponding to the reproduction position.

画面生成部４３０は、ツールバー表示部８５０の制御用アイコンがクリックされたとき、対応する処理を行う。例えば、画面生成部４３０は、ツールバー表示部８５０の再生アイコンがクリックされたとき、表示中の音声波形８３１に対応する音声データを、音声出力部４６０を用いて再生する。また、画面生成部４３０は、ツールバー表示部８５０の保存アイコンがクリックされたときには、会話表示画面８００の表示内容の一部または全てのデータを、ファイル出力部４７０を用いてファイル出力する。 The screen generation unit 430 performs a corresponding process when the control icon on the toolbar display unit 850 is clicked. For example, when the playback icon of the toolbar display unit 850 is clicked, the screen generation unit 430 uses the audio output unit 460 to reproduce audio data corresponding to the audio waveform 831 being displayed. In addition, when the save icon of the toolbar display unit 850 is clicked, the screen generation unit 430 outputs a part or all of the display contents of the conversation display screen 800 to a file using the file output unit 470.

画面生成部４３０は、検索バー８６０に文字列が入力されると、音声認識結果から入力された文字列を検索し、テキストボックス８３４に表示されている文字列のうち、該当する部分を、色を反転させたりハイライトを掛けたりする等して強調表示させる。これにより、管理者は、特定の文字列の出現箇所や出現頻度を容易に確認することができ、更に特定の文字列を含む箇所をピンポイントに再生することが容易となる。 When a character string is input to the search bar 860, the screen generation unit 430 searches for the input character string from the speech recognition result, and selects a corresponding portion of the character string displayed in the text box 834 as a color. Highlight by highlighting or highlighting. As a result, the administrator can easily confirm the appearance location and frequency of appearance of the specific character string, and further easily reproduce the location including the specific character string as a pinpoint.

音声出力部４６０は、上述の通り、画面生成部４３０の制御を受けて音声データを再生するが、このとき、ステレオ音声で、音声データを再生する。より具体的には、音声出力部４６０は、顧客の音声データを左側音声で再生し、オペレータの音声データを右側音声で再生する。すなわち、通話録音・管理装置４００は、話者毎に、通話状況の表示の左右方向と、音声出力の左右方向とを対応させている。これにより、管理者は、音声を話者毎に区別して把握することが容易となり、更に、会話表示画面８００の表示内容と対応付けることが容易となる。この点、話者毎の情報を上下に配置して表示する従来技術では、通常のステレオ音声によって上下方向を区別して再生することはできないため、再生の方向と表示の方向とを感覚的に対応付けることは困難である。 As described above, the audio output unit 460 reproduces the audio data under the control of the screen generation unit 430. At this time, the audio output unit 460 reproduces the audio data with stereo audio. More specifically, the audio output unit 460 reproduces the customer's audio data with the left audio and the operator's audio data with the right audio. That is, the call recording / management apparatus 400 associates the right and left direction of the call status display with the left and right direction of the voice output for each speaker. This makes it easy for the administrator to distinguish and grasp the voice for each speaker, and to associate it with the display content of the conversation display screen 800. In this regard, in the conventional technology in which information for each speaker is arranged vertically and displayed, normal stereo sound cannot be reproduced by distinguishing the vertical direction, so the reproduction direction and the display direction are associated sensuously. It is difficult.

以上説明したように、本実施の形態の通話録音・管理装置４００によれば、発話区間と文字列とを直交させて表示するので、表示時間幅を広くしても文字列を画面に納めることができ、通話内容の確認を簡単に行うことを可能にする。 As described above, according to the call recording / management apparatus 400 of the present embodiment, since the speech section and the character string are displayed orthogonally, the character string can be stored on the screen even if the display time width is widened. This makes it possible to easily check the content of a call.

なお、本実施の形態では、会話表示画面が、顧客の音声データに関する情報を左側、オペレータの音声データに関する情報を右側にそれぞれ表示する場合について説明したが、これに限定されない。例えば、会話表示画面は、顧客の情報とオペレータの情報とを、左右逆に配置しても良い。 In the present embodiment, the case has been described where the conversation display screen displays information related to customer voice data on the left side and information related to operator voice data on the right side. However, the present invention is not limited to this. For example, on the conversation display screen, customer information and operator information may be arranged in the left-right direction.

また、本実施の形態では、各テキストボックスの位置の基準を発話区間バーが表示されている中央側としたが、これに限定されない。例えば、会話表示画面は、話者毎に、テキストボックスの左端部を縦方向に揃えても良い。この場合、管理者は、各文字列の発話タイミングの時間間隔を、テキストボックスの左端部の位置関係から、より正確に捉えることが可能となる。 Moreover, in this Embodiment, although the reference | standard of the position of each text box was made into the center side where the speech area bar is displayed, it is not limited to this. For example, the conversation display screen may align the left end of the text box in the vertical direction for each speaker. In this case, the administrator can more accurately grasp the time interval of the utterance timing of each character string from the positional relationship of the left end portion of the text box.

また、会話表示画面は、顧客の情報とオペレータの情報とを、左右に分けずに配置しても良い。この場合には、例えば、会話表示画面は、図３と同様に顧客とオペレータとで縦方向２列に分けて配置した発話区間バーを画面の端に表示し、テキストボックスについては特に分けずに縦方向１列に配置し、顧客の文字列とオペレータの文字列とをより近接した位置で表示する。この場合、管理者は、両者の間の会話の流れをより短時間で確認することができる。但し、この場合には、会話表示画面は、話者が区別されるよう、テキストボックスを対応する音声バーに結び付けて表示するだけでなく、テキストボックスの色や配置等の表示形態を、顧客とオペレータとの間で異ならせることが望ましい。 The conversation display screen may be arranged without dividing customer information and operator information into left and right. In this case, for example, on the conversation display screen, as in FIG. 3, the customer and the operator display the utterance section bars arranged in two columns in the vertical direction at the end of the screen, and the text box is not particularly divided. Arranged in one vertical column, the customer's character string and the operator's character string are displayed at closer positions. In this case, the administrator can check the flow of conversation between the two in a shorter time. However, in this case, the conversation display screen not only displays the text box connected to the corresponding voice bar so that the speaker is distinguished, but also displays the display form such as the color and arrangement of the text box with the customer. It is desirable to vary between operators.

また、会話表示画面は、顧客の情報のみまたはオペレータの情報のみを表示してもよい。更に、会話表示画面は、顧客の情報とオペレータの情報とを、切り換えて表示しても良い。 The conversation display screen may display only customer information or operator information. Furthermore, the conversation display screen may switch between customer information and operator information.

また、会話表示画面における時間軸の方向と文字列の配列方向とは、必ずしも直交していなくても良い。時間軸の方向と文字列の配列方向とが一致していなければ、一致している従来技術に比べて、文字列を一画面に表示させ易くなり、画面の利用効率を向上させることができる。また、時間軸は、必ずしも縦方向でなくても良い。 Further, the time axis direction and the character string arrangement direction on the conversation display screen do not necessarily have to be orthogonal to each other. If the direction of the time axis and the arrangement direction of the character strings do not match, it becomes easier to display the character strings on one screen than the matching conventional technology, and the use efficiency of the screen can be improved. Further, the time axis does not necessarily have to be the vertical direction.

また、会話表示画面は、音声認識結果のスコア（信頼度）に応じて、文字列の表示状態（例えば色や大きさ）を異ならせても良い。これにより、管理者は、音声認識結果の信頼度を考慮して通話解析を行うことができる。この場合、通話録音・管理装置４００は、例えば、音声認識結果の各所のスコアを、音声認識サーバ５００から取得する必要がある。 The conversation display screen may change the display state (for example, color and size) of the character string according to the score (reliability) of the voice recognition result. Thereby, the administrator can perform a call analysis in consideration of the reliability of the voice recognition result. In this case, the call recording / management apparatus 400 needs to obtain, for example, the scores of the various parts of the voice recognition result from the voice recognition server 500.

また、会話表示画面は、苦情の部分の文字列、テキストボックス、および発話区間バーの少なくとも１つを強調表示しても良い。これにより、管理者は、苦情の発生箇所を素早く見付けることができる。この場合、通話録音・管理装置４００は、例えば、苦情に特有の語句を検索する等して苦情箇所の抽出を行う必要がある。 The conversation display screen may highlight at least one of the character string of the complaint part, the text box, and the utterance section bar. As a result, the administrator can quickly find the location of the complaint. In this case, for example, the call recording / management apparatus 400 needs to extract a complaint part by searching for a phrase unique to the complaint.

また、会話表示画面は、顧客の話を遮ってオペレータが話し始める、いわゆる話かぶりが発生した箇所の、文字列、テキストボックス、および発話区間バーの少なくとも１つを強調表示しても良い。この場合、通話録音・管理装置４００は、例えば、顧客の発話区間の途中でオペレータの発話区間が開始する箇所を抽出する必要がある。 Further, the conversation display screen may highlight at least one of a character string, a text box, and an utterance section bar where a so-called talk fogging occurs, where the operator starts talking while blocking the customer's story. In this case, for example, the call recording / management apparatus 400 needs to extract a location where the operator's speech section starts in the middle of the customer's speech section.

また、通話録音・管理装置４００は、音声認識結果のテキスト情報を分析して、発話内容を、発話内容の種別に時系列で分類しても良い。例えば、通話録音・管理装置４００は、各発話区間を、「あいさつ」、「質問」、「回答」、「苦情」、「依頼」等の種別に分類し、種別を示すタグをその発話区間に付加し、会話表示画面に分かり易く表示する。これにより、管理者は、各発話の種別を素早く判断することができる。この場合、通話録音・管理装置４００は、例えば、発話内容の種別毎に作成された語句のリストで、各発話区間に出現する単語を検索し、ヒット数が最も多いリストの種別を取得する等の統計処理を行って、各発話区間の種別を判断する必要がある。 In addition, the call recording / management apparatus 400 may analyze the text information of the speech recognition result and classify the utterance contents into utterance contents types in time series. For example, the call recording / management apparatus 400 classifies each utterance section into a classification such as “greeting”, “question”, “answer”, “complaint”, “request”, and the like, and a tag indicating the classification is set as the utterance section. It is added and displayed on the conversation display screen in an easy-to-understand manner. Thereby, the administrator can quickly determine the type of each utterance. In this case, for example, the call recording / management apparatus 400 searches a word that appears in each utterance section from a list of phrases created for each type of utterance content, and acquires the type of list with the largest number of hits, etc. It is necessary to determine the type of each utterance section.

また、通話録音・管理装置４００は、会話表示画面の表示内容や画面イメージを、コピーして別ファイルとして保存したり、印刷しても良い。これにより、音声認識結果の活用の幅が広がると共に、管理者は、通話の解析を更に深めることができる。この場合、コピーや印刷の対象を、文字列のみ、所定の時間範囲のみ、特定の話者のみというように、任意に指定できることが望ましい。 Further, the call recording / management apparatus 400 may copy and save the display contents and screen image of the conversation display screen as separate files or print them. As a result, the range of utilization of the speech recognition result is widened, and the administrator can further deepen the analysis of the call. In this case, it is desirable that an object to be copied or printed can be arbitrarily designated such as only a character string, only a predetermined time range, and only a specific speaker.

すなわち、会話表示画面の態様は、図３に示す態様に限定されるものではない。以下、本実施の形態の変形例として、会話表示画面の他の態様の例について説明する。 That is, the mode of the conversation display screen is not limited to the mode shown in FIG. Hereinafter, as a modification of the present embodiment, an example of another aspect of the conversation display screen will be described.

（変形例１）
図４は、本実施の形態の変形例１における会話表示画面の構成を部分的に示す図である。 (Modification 1)
FIG. 4 is a diagram partially showing the configuration of the conversation display screen in Modification 1 of the present embodiment.

図４に示すように、変形例１における会話表示画面８００ａは、通話状況表示部８３０ａに、音声波形８３１の音圧０を基準として、縦方向に伸びた複数の目盛り線８３６ａを横方向に並べて表示する。 As shown in FIG. 4, the conversation display screen 800a according to the first modification example has a plurality of scale lines 836a extending in the vertical direction arranged in the horizontal direction on the call status display unit 830a with reference to the sound pressure 0 of the voice waveform 831. indicate.

これにより、管理者は、音声の音量の変化を把握し易くなる。また、管理者は、音量を定量的に把握し易くなるので、「声が大きすぎる部分」や「声が小さすぎる部分」の定量的な抽出を、容易に行うことができる。例えば、管理者は、音圧０の位置から両側２本目の目盛り線８３６ａを音声波形８３１がはみ出している部分を探すことにより、音量が所定の値を超えた部分を容易にピックアップすることができる。 This makes it easier for the administrator to grasp the change in the sound volume. In addition, since it becomes easy for the administrator to quantitatively grasp the sound volume, it is possible to easily perform quantitative extraction of “parts where the voice is too loud” and “parts where the voice is too loud”. For example, the administrator can easily pick up a portion where the volume exceeds a predetermined value by searching for a portion where the sound waveform 831 protrudes from the second scale line 836a on both sides from the position of the sound pressure 0. .

（変形例２）
図５は、本実施の形態の変形例２における会話表示画面の構成を示す図である。 (Modification 2)
FIG. 5 is a diagram showing a configuration of a conversation display screen in the second modification of the present embodiment.

図５に示すように、変形例２における会話表示画面８００ｂは、通話状況表示部８３０ｂにおける顧客の情報とオペレータの情報とを、異なる形態で表示する。具体的には、会話表示画面８００ｂは、顧客の音声波形８３１ｂ−１とオペレータの音声波形８３１ｂ−２とを異なる色で表示し、顧客のテキストボックス８３４ｂ−１とオペレータのテキストボックス８３４ｂ−２とを異なる色で表示する。また、これにより、管理者は、情報を話者毎に区別して把握することがより一層容易となる。 As shown in FIG. 5, the conversation display screen 800b in the second modification displays customer information and operator information in the call status display unit 830b in different forms. Specifically, the conversation display screen 800b displays the customer's voice waveform 831b-1 and the operator's voice waveform 831b-2 in different colors, and the customer's text box 834b-1 and the operator's text box 834b-2. Are displayed in different colors. This also makes it easier for the administrator to distinguish and grasp the information for each speaker.

また、会話表示画面８００ｂは、ツールバー表示部８５０に、先頭の発話区間への移動、前の発話区間への移動、次の発話区間への移動、末尾の発話区間への移動、および繰り返し再生の各操作のための制御用アイコンを更に表示する。会話表示画面８００は、これらの制御用アイコンがクリックされると、該当する発話区間の開始位置へ再生位置を移動させるとともに、該当する発話区間のテキストボックス８３４ｂをアクティブにする。 In addition, the conversation display screen 800b displays on the toolbar display unit 850 the movement to the first utterance section, the movement to the previous utterance section, the movement to the next utterance section, the movement to the last utterance section, and repeated playback. A control icon for each operation is further displayed. When these control icons are clicked, the conversation display screen 800 moves the reproduction position to the start position of the corresponding utterance section and activates the text box 834b of the corresponding utterance section.

なお、会話表示画面８００ｂは、顧客の情報とオペレータの情報との間で、線種や字体を異ならせても良い。また、会話表示画面８００ｂは、音量の大きさや、表示される文字列に特定の語句が含まれるか否か等に応じて、通話状況表示部８３０ｂにおける表示形態を変化させてもよい。例えば、会話表示画面８００ｂは、音量が一定レベルを超えている箇所の、テキストボックス８３４ｂや文字列を強調表示する。これにより、管理者は、顧客またはオペレータの語気が荒くなっている箇所等を容易に見付けることができる。 The conversation display screen 800b may have different line types and fonts between the customer information and the operator information. Further, the conversation display screen 800b may change the display form in the call status display unit 830b depending on the volume level, whether or not the displayed character string includes a specific word or phrase, and the like. For example, the conversation display screen 800b highlights a text box 834b or a character string where the volume exceeds a certain level. As a result, the administrator can easily find a place where the customer or the operator's language is rough.

（変形例３）
図６は、本実施の形態の変形例３における会話表示画面の構成を示す図である。 (Modification 3)
FIG. 6 is a diagram showing a configuration of a conversation display screen in the third modification of the present embodiment.

図６に示すように、変形例３における会話表示画面８００ｃは、通話状況表示部８３０ｃにおいて、音声データの音声波形ではなく、音声スペクトログラム８３７ｃ−１、８３７ｃ−２を表示する。 As shown in FIG. 6, the conversation display screen 800c according to the third modification displays voice spectrograms 837c-1 and 837c-2 instead of the voice waveform of voice data in the call status display unit 830c.

音声スペクトログラムは、各周波数領域のパワーを濃淡表示させて視覚的に判りやすく表示したグラフであり、声紋とも呼ばれる。音声スペクトログラムからは、音声波形には現れ難い声の特徴を読み取ることが可能である。管理者は、訓練次第で、音声スペクトログラムから、声の高さ、発声者の性別、子音や母音等の音素記号、話者の感情等を読み取り、音声波形のみでは困難な種類の通話解析を行うことが可能となる。なお、会話表示画面８００ｃは、音声波形８３１と音声スペクトログラム８３７ｃとを切り換えて表示しても良い。この場合には、より多面的に通話解析を行うことができる。 The audio spectrogram is a graph in which the power of each frequency region is displayed in a shaded manner so that it can be easily understood visually, and is also called a voiceprint. From the voice spectrogram, it is possible to read voice characteristics that are difficult to appear in the voice waveform. Depending on the training, the administrator reads the voice pitch, the gender of the speaker, phoneme symbols such as consonants and vowels, and the emotions of the speaker from the speech spectrogram, and performs a kind of call analysis that is difficult only with the speech waveform. It becomes possible. Note that the conversation display screen 800c may be switched between the speech waveform 831 and the speech spectrogram 837c. In this case, it is possible to perform call analysis in a multifaceted manner.

（変形例４）
図７は、本実施の形態の変形例４における会話表示画面の構成を部分的に示す図である。 (Modification 4)
FIG. 7 is a diagram partially showing a configuration of a conversation display screen in Modification 4 of the present embodiment.

図７に示すように、変形例４における会話表示画面８００ｄは、通話状況表示部８３０ｄにおいて、発話区間を示す情報として、発話区間バーを用いず、テキストボックス８３４ｄを用いる。具体的には、会話表示画面８００ｄは、音声波形８３１上の発話区間に対応する位置に、テキストボックス８３４ｄを表示する。これにより、画面の簡素化を図ることができる。また、管理者は、音声波形８３１と文字列との対応関係をより一層容易に把握することができる。 As illustrated in FIG. 7, the conversation display screen 800d according to the fourth modification uses a text box 834d as the information indicating the utterance section in the call status display unit 830d without using the utterance section bar. Specifically, the conversation display screen 800d displays a text box 834d at a position corresponding to the utterance section on the speech waveform 831. Thereby, simplification of a screen can be achieved. In addition, the administrator can more easily grasp the correspondence between the speech waveform 831 and the character string.

（変形例５）
図８は、本実施の形態の変形例５における会話表示画面の構成を部分的に示す図である。 (Modification 5)
FIG. 8 is a diagram partially showing a configuration of a conversation display screen in Modification 5 of the present embodiment.

図８に示すように、変形例５における会話表示画面８００ｅは、通話状況表示部８３０ｅにおいて、発話区間バー８３３ｅを、通話状況表示部８３０の中央ではなく、音声波形８３１上に配置する。これにより、発話区間バー８３３ｅの表示に用いていた領域を他の情報表示に割り当てることができ、画面を有効活用することができる。また、管理者は、発話区間と音声波形８３１との対応関係をより一層容易に把握することができる。 As shown in FIG. 8, in the conversation display screen 800 e in the fifth modification, in the call status display unit 830 e, the utterance section bar 833 e is arranged not on the center of the call status display unit 830 but on the voice waveform 831. Thereby, the area used for displaying the utterance section bar 833e can be allocated to other information display, and the screen can be used effectively. In addition, the administrator can more easily grasp the correspondence between the utterance section and the speech waveform 831.

更に、図８に示すように、会話表示画面８００ｅは、変形例４と同様に音声波形８３１上の対応位置に、テキストボックス８３４ｄを表示しても良い。管理者は、音声波形８３１と発話区間と文字列との対応関係をより一層容易に把握することができる。 Furthermore, as shown in FIG. 8, the conversation display screen 800e may display a text box 834d at a corresponding position on the speech waveform 831 as in the fourth modification. The administrator can more easily grasp the correspondence relationship between the speech waveform 831, the utterance section, and the character string.

（変形例６）
図９は、本実施の形態の変形例６における会話表示画面の構成を示す図である。 (Modification 6)
FIG. 9 is a diagram showing a configuration of a conversation display screen in the sixth modification of the present embodiment.

図９に示すように、変形例６における会話表示画面８００ｆは、全体像表示部８２０ｆを、通話状況表示部８３０ｆの中央に表示する。そして、会話表示画面８００ｆは、音声波形の全体図８２１を上下方向に引き伸ばして表示し、全体像表示部８２０ｆを全体像表示部８２０ｆの表示範囲マーカ８２２ｆに結び付けて表示する。 As shown in FIG. 9, the conversation display screen 800f according to the modified example 6 displays the overall image display unit 820f at the center of the call status display unit 830f. Then, the conversation display screen 800f displays the entire waveform waveform 821 in the up-down direction and displays the entire image display unit 820f in association with the display range marker 822f of the entire image display unit 820f.

これにより、情報と操作対象を画面の中央にまとまって配置することができ、画面の視認性および操作性を向上させることができる。また、管理者は、全体像表示部８２０ｆの表示範囲の通話全体における位置を直感的に把握し易くなる。 Thereby, information and an operation target can be arranged collectively in the center of the screen, and the visibility and operability of the screen can be improved. In addition, the administrator can easily intuitively grasp the position of the display range of the overall image display unit 820f in the entire call.

本発明に係る音声認識結果管理装置および音声認識結果表示方法は、通話内容の確認を簡単に行うことができる音声認識結果管理装置および音声認識結果表示方法として有用である。 The voice recognition result management apparatus and the voice recognition result display method according to the present invention are useful as a voice recognition result management apparatus and a voice recognition result display method that can easily check the content of a call.

１００コールセンターシステム
２００顧客端末
３００オペレータ端末
４００通話録音・管理装置
４１０音声データ入力部
４２０特徴抽出部
４３０画面生成部
４３１データ記憶部
４４０画面出力部
４５０操作受付部
４６０音声出力部
４７０ファイル出力部
５００音声認識サーバ
７００通信網
８００、８００ａ〜８００ｆ会話表示画面
８１０通話情報表示部
８２０、８２０ｆ全体像表示部
８２１、８２１ｆ音声波形の全体図
８２２、８２２ｆ表示範囲マーカ
８３０、８３０ａ、８３０ｂ、８３０ｃ、８３０ｄ、８３０ｅ、８３０ｆ通話状況表示部
８３１、８３１ｂ音声波形
８３２時間軸
８３３、８３３ｅ発話区間バー
８３４、８３４ｂ、８３４ｄテキストボックス
８３５再生位置マーカ
８３６ａ目盛り線
８３７ｃ音声スペクトログラム
８４０表示状態変更部
８４１ズームバー
８４２要約バー
８５０ツールバー表示部
８６０検索バー DESCRIPTION OF SYMBOLS 100 Call center system 200 Customer terminal 300 Operator terminal 400 Call recording and management apparatus 410 Voice data input part 420 Feature extraction part 430 Screen generation part 431 Data storage part 440 Screen output part 450 Operation reception part 460 Voice output part 470 File output part 500 Voice Recognition server 700 Communication network 800, 800a to 800f Conversation display screen 810 Call information display unit 820, 820f Overall image display unit 821, 821f Overall view of voice waveform 822, 822f Display range marker 830, 830a, 830b, 830c, 830d, 830e , 830f Call status display part 831, 831b Voice waveform 832 Time axis 833, 833e Speaking section bar 834, 834b, 834d Text box 835 Playback position marker 836a Scale line 83 7c Voice spectrogram 840 Display state change part 841 Zoom bar 842 Summary bar 850 Toolbar display part 860 Search bar

Claims

A speech recognition result management apparatus for managing a result of speech recognition processing for speech data of conversation,
A screen generation unit that generates a conversation display screen that distinguishes each character string that is a result of the voice recognition processing for each speaker and displays them in the order of speech timing;
A screen output unit for outputting the conversation display screen generated by the screen generation unit,
The conversation display screen is
For each speaker, information indicating an utterance interval and information indicating a silent interval other than the utterance interval are displayed along the same time axis, and the character string is information indicating the utterance interval. And display without matching the direction of the time axis and the direction of arrangement of the character string,
Speech recognition result management device.

The conversation display screen is
Information indicating the utterance interval and information indicating the silent interval and the character string are displayed with the direction of the time axis and the arrangement direction of the character string orthogonal to each other.
The speech recognition result management apparatus according to claim 1.

The direction of the time axis is a vertical direction.
The speech recognition result management apparatus according to claim 1.

The conversation display screen is
For each speaker, information indicating the vocabulary state of the speaker is displayed in association with information indicating the utterance interval.
The speech recognition result management apparatus according to claim 1.

The information indicating the state of vocabulary includes a voice waveform of the voice data arranged along the time axis.
The speech recognition result management apparatus according to claim 4.

The information indicating morale includes information that emphasizes the character string,
The speech recognition result management apparatus according to claim 4.

The screen output unit displays the conversation display screen,
An operation receiving unit including a control icon for receiving an editing operation on the character string corresponding to each of the character strings displayed on the conversation display screen;
The speech recognition result management apparatus according to claim 1.

The screen output unit displays the conversation display screen,
An operation accepting unit that accepts a vertical scroll operation on the conversation display screen;
The speech recognition result management apparatus according to claim 3.

The screen output unit displays the conversation display screen,
An operation accepting unit that accepts a change operation for a time width of a range displayed at a time on the conversation display screen;
The screen generator is
Generating the character string summarized at a summary level corresponding to the time width and displaying the character string on the conversation display screen;
The speech recognition result management apparatus according to claim 1.

The screen output unit displays the conversation display screen,
An audio reproduction unit for reproducing the audio data;
An operation accepting unit that accepts designation of a reproduction location of the audio data and start and stop of reproduction;
The conversation display screen shows a playback location during playback of the audio data.
The speech recognition result management apparatus according to claim 1.

The screen output unit displays the conversation display screen,
The conversation is a conversation between two speakers,
The conversation display screen displays the voice waveforms of the two speakers divided into left and right,
The audio data of the two speakers in the direction corresponding to the lateral direction of the display of the audio waveform, further comprising sound reproduction unit for reproducing divided into right and left, and
The speech recognition result management apparatus according to claim 3.

The conversation display screen is
For each utterance section, information indicating the type of utterance content of the utterance section is displayed in association with the utterance section.
The speech recognition result management apparatus according to claim 1.

A speech recognition result display method for displaying a result of speech recognition processing for speech data of a conversation,
Generating a conversation display screen that distinguishes the character string that is the result of the voice recognition processing for each speaker and displays the sequence in the order of utterance timing;
Outputting the generated conversation display screen,
The conversation display screen is
For each speaker, information indicating an utterance interval and information indicating a silent interval other than the utterance interval are displayed along the same time axis, and the character string is information indicating the utterance interval. And display without matching the direction of the time axis and the direction of arrangement of the character string,
Speech recognition result display method.