JP2006154568A

JP2006154568A - Logging system having voice recognition function, terminal device in same system and program

Info

Publication number: JP2006154568A
Application number: JP2004347770A
Authority: JP
Inventors: Matsuaki Terada; 松昭寺田; Kota Oshima; 浩太大島; Masatoshi Oka; 正俊岡; Hiroki Ono; 博樹大野
Original assignee: Toppan Forms Co Ltd; Tokyo University of Agriculture and Technology NUC; Tokyo University of Agriculture
Current assignee: Tokyo University of Agriculture and Technology NUC; Tokyo University of Agriculture; Toppan Edge Inc
Priority date: 2004-11-30
Filing date: 2004-11-30
Publication date: 2006-06-15

Abstract

<P>PROBLEM TO BE SOLVED: To improve recognition rate and to improve convenience while conducting word retrieval without being adversely affected by the presence or the absence of ciphering. <P>SOLUTION: In the logging system having a voice recognition function, recognition texts, which are respectively generated by using specific speaker voice recognition engines (specific speaker voice recognition sections 13 and 14) and respectively held in terminal devices (IP telephones 11 and 12), or learning data of the specific speaker voice recognition engines are respectively transmitted to the respectively calling terminal devices with a prescribed timing, voice recognition of the voice data being transmitted and received at the terminal devices is conducted and the results are stored (information storing regions 15 and 16). <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、ＩＰ（Internet Protocol）接続環境を用いて交換される音声情報を保存し、活用する用途に用いて好適な、音声認識機能付きロギングシステムおよび同システムにおける端末装置ならびにプログラムに関する The present invention relates to a logging system with a voice recognition function suitable for use in storing and utilizing voice information exchanged using an IP (Internet Protocol) connection environment, and a terminal device and a program in the system.

通信コスト低減を目的にＶｏＩＰ（Voice over Internet Protocol）電話が普及し、コンピュータと電話を融合するＣＴＩ（Computer Telephony Information）システムへの応用が注目されるようになった。 VoIP (Voice over Internet Protocol) telephones have become widespread for the purpose of reducing communication costs, and their application to CTI (Computer Telephony Information) systems that fuse computers and telephones has attracted attention.

例えば、コールセンターにおいて、電話機のログを取得するのに、複数の電話機の音声をロギングサーバでまとめてログを取得する音声ロギングシステムがある。これら音声ロギングシステムには、電話機の音声を透過的に保存するものと、ロギングサーバで対象の電話機に音声を中継し、その中継の過程で保存するものがある。また、通話録音装置で録音された音声を認識し利用するシステムとして、認識された音声を機器の操作のために用いるボイスコマンド等がある（例えば、非特許文献１、２参照）。
http://advanced-media.co.jp/prooducts/1502.html＜インターネット＞２００４年１１月１１日閲覧、「ＡｍｉＶｏｉｃｅｓｅｒｉｅｓ製品情報」 http://www.logit.co.jp/products/nicelog/voip.html＜インターネット＞２００４年１１月１１日閲覧、ログイット株式会社Ｐｒｏｄｕｃｔｓ［製品紹介］「ＶｏＩＰ録音を可能にした最新のＩＰレコーディング」 For example, in a call center, there is a voice logging system in which logs of a plurality of telephones are collected by a logging server to obtain logs of the telephones. Among these voice logging systems, there are a system that transparently stores the voice of a telephone and a voice server that relays the voice to a target telephone by a logging server and stores the voice in the relay process. In addition, as a system for recognizing and using a voice recorded by a call recording device, there is a voice command or the like that uses the recognized voice for device operation (for example, see Non-Patent Documents 1 and 2).
http://advanced-media.co.jp/prooducts/1502.html <Internet> Viewed November 11, 2004, “AmiVoice series product information” http://www.logit.co.jp/products/nicelog/voip.html <Internet> Viewed November 11, 2004, Logit Corporation Products [Product Introduction] “Latest IP Recording that Enables VoIP Recording”

ところで、上記したＩＰ電話でやり取りされている音声データに、暗号化処理等で第三者による盗聴を防ぐ仕組みは無い。一方でプライバシー保護の観点から暗号化による仕組みが組み込まれる可能性は高い。また、米国では、訴訟などの証拠保全対策としてＩＰ電話音声の保存を法的に義務付ける動きがある。
しかしながら、暗号化を考慮する際、音声を単一サーバで透過的に保存する場合にはサーバで保存されている音声は暗号化済みのデータであるため、利用が容易ではない。また、音声を中継することによるロギングシステムでは、暗号化データを復号化して保存できる可能性は有しているが、プライバシー保護を考慮する場合、復号化したデータを再度暗号化して中継する必要があり、中継処理に要する処理によって円滑なコミュニケーションを阻害する恐れがある。また、サーバに負荷が集中することから、ロギングミス等を引き起こす危険がある。 By the way, there is no mechanism for preventing eavesdropping by a third party by encrypting the voice data exchanged by the IP phone. On the other hand, there is a high possibility that an encryption mechanism will be incorporated from the viewpoint of privacy protection. In the United States, there is a movement to legally obligate preservation of IP phone voice as a measure for preserving evidence such as litigation.
However, when encryption is considered, when audio is stored transparently on a single server, the audio stored on the server is already encrypted data, so that it is not easy to use. In addition, in the logging system by relaying voice, there is a possibility that the encrypted data can be decrypted and stored. However, when considering privacy protection, it is necessary to encrypt and relay the decrypted data again. There is a risk that smooth communication may be hindered by processing required for relay processing. In addition, since the load is concentrated on the server, there is a risk of causing a logging error or the like.

一方、音声認識について、電話によるコミュニケーションでは対象が一意に決まらないため、不特定多数を対象とする音声認識エンジンを必要とする。不特定話者音声認識エンジンは、発話の個人差を吸収可能な特定話者音声認識エンジンに比べて認識率が低い。
また、電話は、機器の状態、発話者の状態によりやりとりされる音声の品質は変動する。感度が悪いマイクや、マイクと発声器官の距離および発声量などにより、認識結果に悪影響を与える。更に、音声と認識テキストを参照する場合、音声と認識テキストの相関がないため、単語検索を行った場合に単語の発生個所から音声を途中再生することが難しいといった不都合を有していた。 On the other hand, for speech recognition, since the target is not uniquely determined by telephone communication, a speech recognition engine for unspecified majority is required. The unspecified speaker speech recognition engine has a lower recognition rate than the specific speaker speech recognition engine that can absorb individual differences in speech.
In addition, the quality of voice exchanged by telephone varies depending on the state of the device and the state of the speaker. The recognition result is adversely affected by the microphone with low sensitivity, the distance between the microphone and the vocal organ, and the amount of speech. Further, when referring to the voice and the recognized text, there is no correlation between the voice and the recognized text, so that it is difficult to reproduce the voice halfway from the location where the word occurs when a word search is performed.

本発明は上記事情に基づいてなされたものであり、暗号化の有無、音声入力および認識環境等の条件に影響されることなく、単語検索を行う際の利便性の向上をはかった、音声認識機能付きロギングシステムおよび同システムにおける端末装置ならびにプログラムを提供することを目的とする。 The present invention has been made based on the above circumstances, and is a speech recognition system that improves convenience when performing word search without being affected by conditions such as the presence or absence of encryption, speech input, and recognition environment. It is an object of the present invention to provide a logging system with a function and a terminal device and a program in the system.

上記した課題を解決するために本発明の音声認識機能付きロギングシステムは、ネットワークに接続された端末装置間の通話音声をロギングする音声認識機能付きロギングシステムであって、前記端末装置のそれぞれが持つ、特定話者を対象に音声認識を行う特定話者音声認識エンジンを用いて作成された認識テキスト、もしくは前記特定話者音声認識エンジンの学習データを所定のタイミングでそれぞれ相手先端末装置へ送信し、当該相手先端末装置において送受信される音声データの音声認識を行い、その結果を保存する手段、を具備することを特徴とする。 In order to solve the above problems, a logging system with a voice recognition function according to the present invention is a logging system with a voice recognition function that logs call voice between terminal devices connected to a network, and each of the terminal devices has The recognition text created by using a specific speaker voice recognition engine that performs speech recognition for a specific speaker or the learning data of the specific speaker voice recognition engine is transmitted to each other terminal device at a predetermined timing. And means for performing voice recognition of voice data transmitted and received in the counterpart terminal apparatus and storing the result.

また、本発明は、ネットワークに接続された端末装置間の通話音声をロギングする音声認識機能付きロギングシステムにおける前記端末装置であって、相手先端末装置から送信される音声データを受信して保存する手段と、相手先端末装置へ送信する音声データを保存し、当該音声データから自身を対象に音声認識を行う音声認識エンジンを用いて認識テキストを生成し、前記音声データと関連付けて保存する手段と、前記認識テキストを任意のタイミングで通話相手先へ送信する手段と、を具備することを特徴とする。 Further, the present invention is the terminal device in the logging system with a voice recognition function for logging call voice between terminal devices connected to a network, and receives and stores voice data transmitted from a partner terminal device And means for storing voice data to be transmitted to the destination terminal device, generating a recognition text from the voice data using a voice recognition engine that performs voice recognition on itself, and storing the voice data in association with the voice data; And means for transmitting the recognized text to a call partner at an arbitrary timing.

また、本発明において、相手先端末装置との通話が終了したことを検知したときに前記認識テキストを送信することを特徴とする。 Further, the present invention is characterized in that the recognition text is transmitted when it is detected that the call with the counterpart terminal device has ended.

また、本発明において、通話セッションとは別に、前記認識テキストを送信するセッションを確立し、当該セッション毎、前記保存した認識テキストを送信することを特徴とする。 In the present invention, a session for transmitting the recognized text is established separately from the call session, and the stored recognized text is transmitted for each session.

また、本発明は、ネットワークに接続された端末装置間の通話音声をロギングする音声認識機能付きロギングシステムにおける前記端末装置であって、相手先端末装置へ送信する音声データを保存し、当該音声データから、当該音声データに対して音声認識を行う音声認識エンジンを用いて認識テキストを生成し、前記音声データと関連付けて保存する手段と、相手先端末装置から送信される音声データを受信して保存する手段と、相手先端末装置から任意のタイミングで送信される前記相手先端末装置の音声認識エンジンの学習データを受信し、当該学習データを用いて前記保存した受信音声データから認識テキストを生成する手段と、を具備することを特徴とする。 Further, the present invention provides the terminal device in the logging system with a voice recognition function for logging call voice between the terminal devices connected to the network, storing voice data to be transmitted to the counterpart terminal device, and storing the voice data Generating a recognized text using a speech recognition engine that performs speech recognition on the speech data, storing the correlated text in association with the speech data, and receiving and storing the speech data transmitted from the counterpart terminal device And learning data of the speech recognition engine of the counterpart terminal device transmitted at an arbitrary timing from the counterpart terminal device, and generating a recognition text from the stored received voice data using the learning data And means.

また、本発明は、ネットワークに接続された端末装置間の通話音声をロギングする音声認識機能付きロギングシステムにおいて前記端末装置に用いられるプログラムであって、相手先から送信される音声データを受信して保存する処理と、相手先端末装置へ送信する音声データを保存し、当該音声データから、当該音声データに対して音声認識を行う音声認識エンジンを用いて認識テキストを生成し、前記音声データと関連付けて保存する処理と、前記認識テキストを任意のタイミングで通話相手先へ送信する処理と、をコンピュータに実行させることを特徴とする。 Further, the present invention is a program used for the terminal device in a logging system with a voice recognition function for logging a call voice between terminal devices connected to a network, which receives voice data transmitted from a destination. Save processing, save voice data to be transmitted to the destination terminal device, generate recognition text from the voice data using a voice recognition engine that performs voice recognition on the voice data, and associate with the voice data And storing the recognized text and transmitting the recognized text to the other party at an arbitrary timing.

また、本発明は、ネットワークに接続された端末装置間の通話音声をロギングする音声認識機能付きロギングシステムにおいて前記端末装置に用いられるプログラムであって、相手先端末装置へ送信する音声データを保存し、当該音声データから、当該音声データに対して音声認識を行う音声認識エンジンを用いて認識テキストを生成し、前記音声データと関連付けて保存する処理と、相手先端末装置から送信される音声データを受信して保存する処理と、相手先端末装置から任意のタイミングで送信される前記相手先端末装置の音声認識エンジンの学習データを受信し、当該学習データを用いて前記保存した受信音声データから認識テキストを生成する処理と、をコンピュータに実行させることを特徴とする。 Further, the present invention is a program used for the terminal device in a logging system with a voice recognition function for logging call voice between terminal devices connected to a network, and stores voice data to be transmitted to a destination terminal device. From the voice data, a process of generating a recognition text using a voice recognition engine that performs voice recognition on the voice data, storing the text in association with the voice data, and voice data transmitted from the counterpart terminal device Receive and store processing, and receive learning data of the speech recognition engine of the counterpart terminal device transmitted at an arbitrary timing from the counterpart terminal device, and recognize from the stored received voice data using the learning data A process for generating a text is executed by a computer.

本発明によれば、それぞれの端末装置において特定話者を対象に音声認識を行う特定話者音声認識エンジンを用いた音声認識テキストの利用が可能になるため、認識率の向上、および処理速度の向上がはかれる。また、端末装置において送受信データおよび作成された認識テキストの保存が行なわれるため、やりとりされる音声データの暗号化に影響されず、特定の機器に負荷が集中することにより発生するロギングミスがなくなる。
また、本発明によれば、通話終了を検知して、あるいは通話セッションとは別に設定されるセッション毎に認識テキストを送信し、相手先へ自身による特定話者認識結果を送信することで、リアルタイム性は損なわれるものの認識率の高い結果を得ることができる。さらに、本発明によれば、音声と認識テキストの相関を保存することで、検索の際、単語検索によりヒットした部分を、ヒットした単語のある文節から聴くことのできる部分再生を実現することができる。このことにより検索の際の利便性を提供することができる。 According to the present invention, since it is possible to use speech recognition text using a specific speaker speech recognition engine that performs speech recognition for a specific speaker in each terminal device, the recognition rate is improved and the processing speed is increased. Improvement is planned. In addition, since transmission / reception data and created recognition text are stored in the terminal device, the logging error that occurs when the load is concentrated on a specific device is eliminated without being affected by encryption of voice data exchanged.
Further, according to the present invention, the end of the call is detected or the recognition text is transmitted for each session set separately from the call session, and the specific speaker recognition result is transmitted to the other party in real time. Although the performance is impaired, a result with a high recognition rate can be obtained. Furthermore, according to the present invention, by storing the correlation between the speech and the recognized text, it is possible to realize partial reproduction in which a portion hit by the word search can be heard from a phrase having the hit word in the search. it can. This can provide convenience in searching.

以下、本発明実施形態につき、図１〜図７を参照しながら説明する。図１は、本発明実施形態に係わる音声認識機能付きロギングシステムのシステム構成図である。ここでは、ＩＰ電話機１１、１２を用いてユーザ同士で通話する例が示されている。
本発明の端末装置として機能する、ＩＰ電話機１１（１２）は、それぞれ特定話者音声認識エンジンを搭載した特定話者音声認識部１３（１４）を備えており、ここで音声認識された結果は、送信データと関連付けて情報保存領域１５、１６に保存される。なお、いずれも受信音声データについては一時的に保存するのみであり、追って他方のＩＰ電話機１２（１１）から送信される認識テキストと関連付けて保存される。 Hereinafter, embodiments of the present invention will be described with reference to FIGS. FIG. 1 is a system configuration diagram of a logging system with a voice recognition function according to an embodiment of the present invention. Here, an example is shown in which IP telephones 11 and 12 are used to talk between users.
The IP telephone 11 (12), which functions as the terminal device of the present invention, includes a specific speaker voice recognition unit 13 (14) equipped with a specific speaker voice recognition engine. The information is stored in the information storage areas 15 and 16 in association with the transmission data. In either case, the received voice data is only temporarily stored, and is stored in association with the recognized text transmitted from the other IP telephone 12 (11).

図２は、図１に示す音声認識機能付きロギングシステムの内部構成を機能展開して示したブロック図である。
図２中、図１に示すブロックと同一番号が付されたブロックは、図１に示すそれと同じ名称、機能を持つこととする。ここでは、ＩＰ電話機１１（１２）は、音声送信部１１１（１２２）と、音声受信部１１２（１２１）のみを持つ構成とする。
音声送信部１１１（１２２）により送信される音声データは、情報保存領域１５の送信音声保存部１５１（１６１）へ一時保存される他に、特定話者音声認識部１３（１４）へ供給され、ここで、特定話者認識エンジンによる自身の発話による音声認識が行われる。ここでは学習データが豊富なため、音声認識率が高く、早く結果が得られる。特定話者音声認識部１３（１４）は、認識に必要な量が貯まるまでデジタル音声データを保持し、認識処理が可能になった時点で認識処理を行う。認識処理後のデジタル音声データおよび認識テキストは、通話全体のデジタル音声データの認識処理が終了後も継続して情報保存領域１５の送信音声認識テキスト保存部１５２（１６２）へ保存される。 FIG. 2 is a block diagram showing a functional development of the internal configuration of the logging system with a voice recognition function shown in FIG.
In FIG. 2, the blocks with the same numbers as the blocks shown in FIG. 1 have the same names and functions as those shown in FIG. Here, the IP telephone 11 (12) is configured to have only the voice transmission unit 111 (122) and the voice reception unit 112 (121).
The voice data transmitted by the voice transmission unit 111 (122) is temporarily stored in the transmission voice storage unit 151 (161) of the information storage area 15, and is also supplied to the specific speaker voice recognition unit 13 (14). Here, speech recognition based on the user's own speech is performed by the specific speaker recognition engine. Here, since the learning data is abundant, the speech recognition rate is high and results can be obtained quickly. The specific speaker voice recognition unit 13 (14) holds the digital voice data until an amount necessary for recognition is accumulated, and performs the recognition process when the recognition process becomes possible. The digital voice data and the recognition text after the recognition processing are continuously stored in the transmission voice recognition text storage unit 152 (162) in the information storage area 15 even after the digital voice data recognition processing for the entire call is completed.

一方、パケット化されたＩＰ電話機１１（１２）からのデジタル音声データはＩＰ電話機１２（１１）の音声受信部１２１（１１２）で受信され、受信音声保存部１６３（１５３）に一時保存される。このとき暗号化されていた場合は復号化処理が行なわれる。また、ＩＰ電話機１２（１１）では上記同様、自身で発話したデジタル音声データの音声認識（特定話者音声認識部１４（１３））の実行を同時に行い、結果を送信音声認識テキスト保存部１６２（１５２）へ保存している。
認識テキスト同期部１８（１７）では、送信音声認識テキスト保存部１６２（１５２）に保存された認識テキストを、先に音声データを送信したＩＰ電話機１１（１２）が持つ情報保存領域１５の受信音声認識テキスト保存部１５４へ送信するタイミングを監視している。ここでは、通話終了の検出を待ち、通話終了後に音声認識処置が完了した時点で送信するものとする。認識テキスト同期部１８（１７）は、ＶｏＩＰ（Voice over IP）通信の場合、あらかじめホスト名がわかっているため、上記した通話終了のタイミングを検知したことを契機に、ＩＰ電話機１１（１２）同士でやりとりされた通話内容に関し容易に同期を取ることが可能である。 On the other hand, the packetized digital voice data from the IP telephone set 11 (12) is received by the voice receiving section 121 (112) of the IP telephone set 12 (11) and temporarily stored in the received voice storage section 163 (153). If it has been encrypted at this time, a decryption process is performed. In the same manner as described above, the IP telephone 12 (11) simultaneously performs voice recognition (specific speaker voice recognition unit 14 (13)) of the digital voice data uttered by itself, and sends the result to the transmission voice recognition text storage unit 162 ( 152).
In the recognized text synchronization unit 18 (17), the received text stored in the information storage area 15 of the IP telephone 11 (12) having previously transmitted the voice data is used as the recognized text stored in the transmission voice recognition text storage unit 162 (152). The transmission timing to the recognized text storage unit 154 is monitored. Here, it is assumed that transmission is waited for when the end of the call is detected and voice recognition processing is completed after the end of the call. In the case of VoIP (Voice over IP) communication, the recognized text synchronizer 18 (17) knows the host name in advance, so that the IP telephones 11 (12) communicate with each other when the call termination timing is detected. It is possible to easily synchronize the contents of calls exchanged in

なお、ここでは、認識テキストを相手先のＩＰ電話機１１（１２）に送信するタイミングを、通話終了を検知した場合としたが、ＶｏＩＰの通話セッションとは別に、認識テキストを送信する、例えば、ＦＴＰ（File Transfer Protocol）コネクションを確立し、当該セッション毎に送信しても良い。その他、無音区間を検出して認識テキストを送信する等、種々変更が考えられる。
また、ここでは特定話者音声認識部１３（１４）に搭載される特定話者認識エンジンは1個のみとしたが、これを複数備え、話者がスイッチにより切替え使用する工夫も考えられる。ここで、「特定話者」とは、所定期間学習の過程を経て個人の特徴が記憶された話者のことをいう。 Here, the timing for transmitting the recognized text to the destination IP telephone 11 (12) is the case where the end of the call is detected, but the recognized text is transmitted separately from the VoIP call session, for example, FTP A (File Transfer Protocol) connection may be established and transmitted for each session. In addition, various changes such as detecting a silent section and transmitting a recognized text can be considered.
Here, only one specific speaker recognition engine is installed in the specific speaker voice recognition unit 13 (14). However, it is conceivable that a plurality of such specific speaker recognition engines are provided and used by the speaker using a switch. Here, the “specific speaker” refers to a speaker in which individual characteristics are stored through a learning process for a predetermined period.

図３は、本発明実施形態において用いられる音声認識処理の概念図である。ここでは、データ一時保存領域５１（図２における情報保存領域１５、１６における送信音声保存部１５１、１６１、受信音声保存部１５３、１６３に相当）に保存されているデジタル音声データから、文節区切り処理部５２によって文節区切り済み音声データ５３を生成する。
次に、文節区切り済み音声データ５３を音声認識部５４（図２における特定話者音声認識部１３、１４に相当）で音声認識処理を行い、認識テキスト５５を生成する。そして、認識テキスト５５および文節区切り済み音声データ５３は、データ/認識テキスト保存領域５６（図２における送信音声認識テキスト保存部１５２、１６２、受信音声認識テキスト保存部１５４、１６４に相当）に対にして保存する。この操作は、データ一時保存領域５１のデジタル音声データがなくなるまで行う。 FIG. 3 is a conceptual diagram of speech recognition processing used in the embodiment of the present invention. Here, the phrase separation process is performed from the digital audio data stored in the data temporary storage area 51 (corresponding to the transmission audio storage units 151 and 161 and the reception audio storage units 153 and 163 in the information storage areas 15 and 16 in FIG. 2). The section 52 generates phrase-delimited audio data 53.
Next, speech recognition processing is performed on the phrase-separated speech data 53 by the speech recognition unit 54 (corresponding to the specific speaker speech recognition units 13 and 14 in FIG. 2), and the recognition text 55 is generated. The recognized text 55 and the phrase-delimited speech data 53 are paired with the data / recognized text storage area 56 (corresponding to the transmission speech recognition text storage units 152 and 162 and the reception speech recognition text storage units 154 and 164 in FIG. 2). And save. This operation is performed until the digital audio data in the data temporary storage area 51 runs out.

図４、図５は、図３に示す文節区切り処理の処理手順を示すフローチャートである。
図４では、文節区切り処理に、ＩＰ電話における標準的なメディアストリーム伝送プロトコルであるＲＴＰ(Real-time Transport Protocol)の無音制御を利用する。ＲＴＰは、耐障害性に富み、リアルタイムにデータの到着を考慮していないＩＰ網において、ストリーム伝送の制御を行うプロトコルである。具体的には、音声パケットを送出するたびに＋１されるシーケンス番号で、伝送路上でのパケットロスを検知し、送出したデータの累計で表現されるタイムスタンプで再生タイミングを制御する。
無音制御とは、マイクなどの音声入力デバイスから一定期間音声入力が無い場合、帯域を節約するためにデータを送出しない制御である。無音制御が生じた場合、シーケンス番号は無音制御発生前のパケット＋１になり、タイムスタンプは、無音区間にもデータを送出した場合と同じだけ加算される。従って、パケットロスが生じていない状態でタイムスタンプが大きく増加している場合になる。 4 and 5 are flowcharts showing the processing procedure of the phrase segmentation process shown in FIG.
In FIG. 4, silence control of RTP (Real-time Transport Protocol), which is a standard media stream transmission protocol in IP telephones, is used for the phrase delimiting process. RTP is a protocol that controls stream transmission in an IP network that is rich in fault tolerance and does not consider data arrival in real time. Specifically, packet loss on the transmission path is detected with a sequence number incremented by one each time a voice packet is sent, and the playback timing is controlled by a time stamp expressed by the total of sent data.
Silence control is control that does not transmit data in order to save bandwidth when there is no voice input from a voice input device such as a microphone for a certain period of time. When silence control occurs, the sequence number becomes the packet +1 before the silence control occurs, and the time stamp is added in the same manner as when data is sent in the silence period. Therefore, the time stamp is greatly increased with no packet loss.

図４に示すフローチャートを参照しながら説明する。まず、ステップＳ６１で、データ一時保存領域５１に認識処理に充分なデジタル音声データが保存されているか判別する。ここで、充分な量が保存されていない場合、ステップＳ６２において保存されているデータが通話終了に達しているか否かを判別する。通話終了に達していればステップＳ６６において認識処理を行い、達していない場合はステップＳ６３である一定時間待機後、ステップＳ６１に制御を移す。
ステップＳ６１において、保存されているデータが認識に充分な量が保存されていると判別された場合、ステップＳ６４において、ＲＴＰタイムスタンプがある閾値以上は離れている部分、つまり、無音制御が生じた部分が存在するか否かを判別する。無音制御が生じていない場合、ステップＳ６２において保存されているデータが通話終了に達しているものであるか否かを判別する。ここで、通話終了に達している場合はステップＳ６６の認識処理に制御を移し、達していない場合は保存されているデータには続きがある可能性が高いため、ステップＳ６３で一定時間待機した後ステップＳ６１に制御を移す。 This will be described with reference to the flowchart shown in FIG. First, in step S61, it is determined whether sufficient digital audio data for recognition processing is stored in the temporary data storage area 51. If a sufficient amount is not stored, it is determined in step S62 whether the stored data has reached the end of the call. If the end of the call has been reached, a recognition process is performed in step S66. If not, the process proceeds to step S61 after waiting for a predetermined time in step S63.
If it is determined in step S61 that the stored data is stored in a sufficient amount for recognition, in step S64, a portion where the RTP timestamp is separated by a certain threshold or more, that is, silence control has occurred. Determine whether the part exists. If no silence control has occurred, it is determined whether or not the data stored in step S62 has reached the end of the call. Here, if the end of the call has been reached, control is transferred to the recognition process in step S66. If not, the stored data is likely to continue, so after waiting for a certain time in step S63. Control is transferred to step S61.

ステップＳ６４で無音制御を検知した場合、ステップＳ６５において、保存データの最初から無音区間までのデータを1つの文節に区切られた音声データとみなし取得する。ここで取得したデータは保存している部分から取り除く。文節に区切られた音声データは、ステップＳ６６において認識エンジン（音声認識処理部５４）により認識処理され、認識結果を保存する領域に保存する。
認識処理後は、ステップＳ６７において保存されているデータが残っているか否かを判別し、残っている場合はステップＳ６３で一定時間待機後ステップＳ６１に制御を移し、残っていない場合は全ての通話を認識処理したものとみなし終了する。 When silence control is detected in step S64, in step S65, the data from the beginning of the stored data to the silence section is regarded as audio data divided into one phrase and acquired. The acquired data is removed from the saved portion. The speech data divided into phrases is recognized by the recognition engine (speech recognition processing unit 54) in step S66, and stored in an area for storing the recognition result.
After the recognition process, it is determined whether or not the data stored in step S67 remains. If it remains, control is transferred to step S61 after waiting for a predetermined time in step S63. Is regarded as having been recognized and finished.

次に、図５に示すフローチャートを参照しながら説明する。図５は、デジタル音声データの音声レベルを調べ、ある一定時間低い音声レベルの連続する部分を文節とみなす手法である。
まず、ステップＳ７１で、データ一時保存領域５１に認識処理に充分なデジタル音声データが保存されているか否かを判別する。ここで充分な量が保存されていない場合、ステップＳ７２において保存されているデータが通話終了に達しているものであるか否かを判別する。通話終了に達していればステップＳ７７において認識処理を行い、達していない場合はステップＳ７３である一定時間待機後、ステップＳ７１に制御を移す。
ステップＳ７１において、保存されているデータが認識に充分な量が保存されていると判別された場合、ステップＳ７４でノイズ除去フィルタによりノイズを除去し、ホワイトノイズを平滑化する。 Next, a description will be given with reference to the flowchart shown in FIG. FIG. 5 shows a technique in which the audio level of digital audio data is checked and a continuous portion having a low audio level for a certain period of time is regarded as a phrase.
First, in step S71, it is determined whether or not sufficient digital audio data for recognition processing is stored in the temporary data storage area 51. If a sufficient amount is not stored, it is determined in step S72 whether the stored data has reached the end of the call. If the end of the call has been reached, a recognition process is performed in step S77. If not, the process proceeds to step S71 after waiting for a predetermined time in step S73.
If it is determined in step S71 that the stored data is stored in a sufficient amount for recognition, noise is removed by a noise removal filter and white noise is smoothed in step S74.

次に、ステップＳ７５で一定時間音声レベルが低い区間が存在するか否かを判別する。音声レベルが低い区間が存在しない場合は、ステップＳ７２でデータが通話終了に達しているか否かを判別し、達している場合はステップＳ７７に制御を移し、達していない場合はステップＳ７３において一定時間待機後ステップＳ７１に制御を移す。
ここで、音声レベルが低い区間が存在する場合は、ステップＳ７６において、保存データの最初から低い音声レベル区間までのデータを1つの文節に区切られた音声データとみなし取得する。ここで取得したデータは保存している部分から取り除く。文節に区切られた音声データは、ステップＳ７７において認識エンジン（特定話者音声認識部１３、１４）により認識処理され、認識結果を保存する領域に保存する。認識処理後は、ステップＳ７８において保存されているデータが残っているか否かを判別し、残っている場合はステップＳ７３で一定時間待機後ステップＳ７１に制御を移し、残っていない場合は全ての通話を認識処理したものとみなし終了する。 Next, in step S75, it is determined whether or not there is a section having a low audio level for a certain period of time. If there is no section with a low voice level, it is determined in step S72 whether or not the data has reached the end of the call. If so, control is passed to step S77. After waiting, control is passed to step S71.
Here, if there is a section with a low audio level, in step S76, the data from the beginning of the stored data to the low audio level section is regarded as audio data divided into one clause and acquired. The acquired data is removed from the saved portion. The speech data divided into phrases is subjected to recognition processing by the recognition engine (specific speaker speech recognition units 13 and 14) in step S77, and stored in an area for storing the recognition result. After the recognition process, it is determined whether or not the data stored in step S78 remains. If it remains, control is transferred to step S71 after waiting for a predetermined time in step S73. Is regarded as having been recognized and finished.

図６は、検索処理を概念的に示す動作概念図である。ここでは、あいまい検索辞書を用いて音声認識の誤認識結果として考えられるものを検索語として用いる手法について説明する。
まず、ブラウザソフトウェアで作成された検索ＧＵＩ(グラフィカルユーザインタフェース)８１の単語入力欄に単語を入力し、検索ボタンを押下する。このことにより、入力された単語は検索システム８２に渡される（Ｓ８１）。
検索システム８２は、あいまい検索辞書８３から、入力単語の誤認識結果として考えられる、誤認識可能性単語リストを検索システム８２に渡す（Ｓ８２）。続いて検索システム８２はリスト内の単語全てについて、データ/認識テキスト保存領域８４内に保存されている音声認識テキストにマッチングするものがあるか否かを検索する（Ｓ８３）。検索結果は、結果表示インタフェース８５に、予め誤認識可能性単語リストに付けられたランクに従ってランク分けを行い表示する（Ｓ８４）。 FIG. 6 is an operation conceptual diagram conceptually showing the search processing. Here, a method of using what is considered as a misrecognition result of speech recognition as a search word using an ambiguous search dictionary will be described.
First, a word is input into a word input field of a search GUI (graphical user interface) 81 created by browser software, and a search button is pressed. As a result, the input word is passed to the search system 82 (S81).
The search system 82 passes from the fuzzy search dictionary 83 a misrecognizable word list considered as a result of erroneous recognition of the input word to the search system 82 (S82). Subsequently, the search system 82 searches for all words in the list whether there is a match with the speech recognition text stored in the data / recognition text storage area 84 (S83). The search results are displayed on the result display interface 85 according to the rank previously assigned to the misrecognizable word list (S84).

図７は、あいまい辞書の作成処理を概念的に示した動作概念図である。ここでは、音声認識が入力音声波形の特徴からパターンマッチングを行っていることに着目し、誤認識の結果はある程度の数に絞られることを利用している。
まず、音声・単語入力部９１において、単語入力欄に単語を、マイクなどの音声入力デバイスで単語に対応した音声を入力する。次に、入力された単語は、あいまい辞書作成部９２で保持される。次に、音声は、音声認識処理部９３において実際に音声認識処理を行い、認識結果をあいまい辞書作成処理部９２に渡し、最初に入力された単語と関連付ける。その際、入力欄により入力された単語と音声認識の結果テキストが等しい場合、結果テキストは破棄する。また、検索結果テキストが既に入力単語と関連付けられている場合は、結果テキストの出現数を増加させ、図６における検索処理の結果表示時のランク分けに利用する。 FIG. 7 is an operation conceptual diagram conceptually showing the fuzzy dictionary creation processing. Here, focusing on the fact that speech recognition performs pattern matching based on the features of the input speech waveform, the fact that the result of erroneous recognition is limited to a certain number is used.
First, in the voice / word input unit 91, a word is input to the word input field, and a voice corresponding to the word is input by a voice input device such as a microphone. Next, the input word is held in the ambiguous dictionary creation unit 92. Next, the speech is actually subjected to speech recognition processing in the speech recognition processing unit 93, the recognition result is passed to the ambiguous dictionary creation processing unit 92, and is associated with the first input word. At this time, if the word input through the input field is the same as the speech recognition result text, the result text is discarded. If the search result text is already associated with the input word, the number of appearances of the result text is increased and used for ranking when displaying the result of the search process in FIG.

次に、音声認識処理を終えた音声は、音声調整処理部９４において、音声レベルの調整、ノイズ付与、テンポ調整等のエフェクト処理を実行し、再度音声認識処理部９３において音声認識処理を行う。以上の操作を行うことであいまい検索辞書の語認識結果リストを増加させる。 Next, the speech that has undergone the speech recognition processing is subjected to effect processing such as speech level adjustment, noise addition, and tempo adjustment in the speech adjustment processing unit 94, and speech recognition processing is performed again in the speech recognition processing unit 93. By performing the above operation, the word recognition result list of the search dictionary is increased.

以上説明のように本発明によれば、それぞれの端末装置において特定話者を対象に音声認識を行う特定話者音声認識エンジンを用いた音声認識が可能になるため、認識率の向上、および処理速度の向上がはかれる。また、端末装置において送受信データおよび作成された認識テキストの保存が行なわれるため、やりとりされる音声データの暗号化に影響されず、特定の機器に負荷が集中することにより発生するロギングミスがなくなる。
また、本発明によれば、通話終了を検知して、あるいは通話セッションとは別に設定されるセッション毎に認識テキストを送信することで、相手先へ自身による特定話者認識結果を送信することで、リアルタイム性は損なわれるものの認識率の高い結果を得ることができる。 As described above, according to the present invention, it is possible to perform speech recognition using a specific speaker speech recognition engine that performs speech recognition for a specific speaker in each terminal device. Speed can be improved. In addition, since transmission / reception data and created recognition text are stored in the terminal device, the logging error that occurs when the load is concentrated on a specific device is eliminated without being affected by encryption of voice data exchanged.
In addition, according to the present invention, by detecting the end of a call or transmitting a recognition text for each session set separately from a call session, a specific speaker recognition result by itself can be transmitted to the other party. Although the real-time property is impaired, a result with a high recognition rate can be obtained.

また、本発明は、音声を文節に区切ってから音声認識処理を行うものであり、このことにより認識率の向上がはかれる。さらに、音声と認識テキストの相関を保存することで、検索の際、単語検索によりヒットした部分を、ヒットした単語のある文節から聴くことのできる部分再生を実現することができる。また、あいまい検索辞書を用いた検索を行うことで、音声認識結果が正確なものでなくとも検索処理における検索漏れの発生率を低下させる効果も得られる。更に、あいまい検索辞書作成時に実際の音声認識エンジンを用いて作成したものを利用することで、実際の誤認識パターンに即したものを作成し、用いることでより検索漏れの少ない検索処理を実現することができる。 In the present invention, the speech recognition process is performed after the speech is divided into phrases, and this improves the recognition rate. Furthermore, by storing the correlation between the speech and the recognized text, it is possible to realize partial reproduction in which a portion hit by the word search can be heard from a phrase having the hit word at the time of the search. Further, by performing a search using an ambiguous search dictionary, an effect of reducing the occurrence rate of search omission in the search process can be obtained even if the voice recognition result is not accurate. In addition, by using what was created using an actual speech recognition engine when creating an ambiguous search dictionary, it is possible to create a search that conforms to the actual misrecognition pattern and use it to realize search processing with fewer search omissions. be able to.

本発明実施形態に係る音声認識機能付きロギングシステムのシステム構成を示す図である。It is a figure which shows the system configuration | structure of the logging system with a speech recognition function which concerns on this invention embodiment. 本発明実施形態に係る音声認識機能付きロギングシステムの内部構成を機能展開して示したブロック図である。It is the block diagram which expanded and showed the internal structure of the logging system with a speech recognition function which concerns on this invention embodiment. 本発明実施形態に係る音声の文節区切り処理、音声認識処理の動作概念図である。It is an operation | movement conceptual diagram of the audio | voice phrase division | segmentation process which concerns on this invention embodiment, and a speech recognition process. 図３に示す文節区切り処理の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of the phrase division | segmentation process shown in FIG. 図３に示す文節区切り処理の処理手順の他の例を示すフローチャートである。It is a flowchart which shows the other example of the process sequence of the phrase division | segmentation process shown in FIG. あいまい検索を用いた検索システムの動作を概念的に示す動作概念図である。It is an operation | movement conceptual diagram which shows notionally the operation | movement of the search system using a fuzzy search. あいまい検索に用いる辞書を作成する際の動作を概念的に示す動作概念図である。It is an operation | movement conceptual diagram which shows notionally the operation | movement at the time of creating the dictionary used for a fuzzy search.

Explanation of symbols

１１、１２…ＩＰ電話機、１３、１４…特定話者音声認識部、１５、１６…情報保存領域、１７、１８…認識テキスト同期部、

DESCRIPTION OF SYMBOLS 11, 12 ... IP telephone, 13, 14 ... Specific speaker voice recognition part, 15, 16 ... Information storage area, 17, 18 ... Recognition text synchronization part,

Claims

A logging system with a voice recognition function for logging call voice between terminal devices connected to a network,
Each of the terminal devices has a recognition text created using a specific speaker voice recognition engine that performs voice recognition for a specific speaker, or learning data of the specific speaker voice recognition engine at a predetermined timing, respectively. Means for performing voice recognition of voice data transmitted to the partner terminal device and transmitted / received in the partner terminal device, and storing the result;
A logging system with a voice recognition function.

The terminal device in a logging system with a voice recognition function for logging call voice between terminal devices connected to a network,
Means for receiving and storing audio data transmitted from the counterpart terminal device;
Means for storing voice data to be transmitted to a destination terminal device, generating a recognition text from the voice data using a voice recognition engine that performs voice recognition on the subject, and storing the text in association with the voice data;
Means for transmitting the recognized text to a call destination terminal device at an arbitrary timing;
A terminal device comprising:

The terminal device according to claim 2, wherein the recognition text is transmitted when it is detected that the call with the counterpart terminal device has ended.

The terminal device according to claim 2, wherein a session for transmitting the recognized text is established separately from a call session, and the stored recognized text is transmitted for each session.

The terminal device in a logging system with a voice recognition function for logging call voice between terminal devices connected to a network,
Means for storing voice data to be transmitted to a destination terminal device, generating a recognition text from the voice data using a voice recognition engine that performs voice recognition on the voice data, and storing the text in association with the voice data; ,
Means for receiving and storing audio data transmitted from the counterpart terminal device;
Means for receiving learning data of the speech recognition engine of the counterpart terminal device transmitted at an arbitrary timing from the counterpart terminal device, and generating recognition text from the stored received voice data using the learning data;
A terminal device comprising:

A program used for the terminal device in a logging system with a voice recognition function for logging call voice between terminal devices connected to a network,
A process of receiving and storing audio data transmitted from the counterpart terminal device;
Processing for storing voice data to be transmitted to a partner terminal device, generating a recognition text from the voice data using a voice recognition engine that performs voice recognition on the voice data, and storing the text in association with the voice data; ,
A process of transmitting the recognized text to a call destination terminal device at an arbitrary timing;
A program that causes a computer to execute.

A program used for the terminal device in a logging system with a voice recognition function for logging call voice between terminal devices connected to a network,
Processing for storing voice data to be transmitted to a partner terminal device, generating a recognition text from the voice data using a voice recognition engine that performs voice recognition on the voice data, and storing the text in association with the voice data; ,
A process of receiving and storing audio data transmitted from the counterpart terminal device;
Processing for receiving learning data of the speech recognition engine of the counterpart terminal device transmitted at an arbitrary timing from the counterpart terminal device, and generating recognition text from the stored received voice data using the learning data;
A program that causes a computer to execute.