JP2007228114A

JP2007228114A - Telephone system

Info

Publication number: JP2007228114A
Application number: JP2006044730A
Authority: JP
Inventors: Shigeru Murata; 滋村田; Hideo Azuma; 英男東
Original assignee: Fujitsu FSAS Inc
Current assignee: Fujitsu FSAS Inc
Priority date: 2006-02-22
Filing date: 2006-02-22
Publication date: 2007-09-06
Anticipated expiration: 2026-02-22
Also published as: JP4607028B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a telephone system wherein the voice is transmitted from the sending side and received on the receiving side, which displays the voice text on the receiving side in sync with the voice even if the voice is interrupted on the receiving side as well as highlighting the interrupted portion of the voice, and allows the receiving side to inform the sending side of the interruption of the voice and display that information on the receiving side during the talk between the sending side and the receiving side over the telephone system. <P>SOLUTION: The telephone system includes a means which turns the voice into a packet and turns the text information made by voice recognition of the voice into a packet on the sending side, adds identification information indicating synchronization between the voice and the text to these packets, and then sends these packets; and a means which receives the packets on the receiving side, and outputs the voice and the text information in sync based on the identification information when outputting the voice and text information. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、送信側で音声を送信し、受信側で受信する電話システムに関するものである。 The present invention relates to a telephone system that transmits voice on the transmission side and receives it on the reception side.

従来、ＩＰ技術を用いた電話システムは、音声をなるべく途切れさせないためにネットワーク内のパケット優先付け技術で音声（パケット）を優先的に処理する工夫がなされている。 Conventionally, a telephone system using IP technology has been devised so that voice (packet) is preferentially processed by a packet prioritization technique in a network so as not to interrupt the voice as much as possible.

また、ネットワーク上に通信サーバを配して、このサーバ内で電話システムからの音声（パケット）を音声認識してテキスト化し、指定された宛先にメールで配信する技術がある（特許文献１）。 In addition, there is a technique in which a communication server is arranged on a network, voice (packets) from a telephone system is recognized in the server, converted into text, and delivered to a designated destination by mail (Patent Document 1).

また、着信に応答できない場合、発信者からの音声メッセージを蓄積し音声認識してテキストメッセージに変換し、発信者からのメッセージ報知タイミングあるいはユーザ指定があった場合に、変換したテキストメッセージを送信して表示させる技術がある（特許文献２）。
特開２００５−１２３９１号公報特開２００４−３２８５２５号公報 Also, if the incoming call cannot be answered, the voice message from the caller is accumulated, recognized and converted into a text message, and the converted text message is sent when there is a message notification timing or user designation from the caller. There is a technique for displaying the image (Patent Document 2).
JP 2005-12391 A JP 2004-328525 A

しかし、従来のパケット優先付け技術で音声（パケット）を優先的に処理したとしても、電話システムが全面的にＩＰ化されると優先制御を完全に動作させることが困難となり、音声が途切れる事態が発生するおそれがあるという問題がある。 However, even if the voice (packet) is preferentially processed by the conventional packet prioritization technology, it is difficult to operate the priority control completely when the telephone system is fully IP, and the voice is interrupted. There is a problem that it may occur.

このため、例えば送信側では受信側で音声が正常に再生されない事態が発生したことが判らないため、そのまま会話を継続し後から話しの内容についての補足や途中から話しのやり直しを行う必要性が生じてしまうという問題があった。 For this reason, for example, on the transmitting side, it is not known that a situation has occurred in which the audio is not normally reproduced on the receiving side, so there is a need to continue the conversation as it is and to supplement the contents of the conversation later or to repeat the conversation from the middle. There was a problem that it would occur.

また、従来の前者の特許文献の技術では、ネットワーク上にサーバを設けて当該サーバで音声を認識してメールで配信するものであり、音声に同期してその音声認識したテキストデータを表示できないと共に、送信側で受信側で音声が正常に再生されない事態が発生したことを認識できないという問題があった。 Further, in the conventional technique of the former patent document, a server is provided on the network, the voice is recognized by the server and distributed by mail, and the text data recognized by the voice cannot be displayed in synchronization with the voice. There is a problem in that it is impossible to recognize that a situation has occurred in which the sound is not normally reproduced on the transmission side on the transmission side.

また、従来の後者の特許文献の技術では、着信に応答できない場合に、発信者からの音声メッセージをテキスデータにし、表示するものであり、音声に同期してその音声認識したテキストデータを表示できないと共に、送信側で受信側で音声が正常に再生されない事態が発生したことを認識できないという問題があった。 Further, in the latter technique of the latter patent document, when the incoming call cannot be answered, the voice message from the caller is displayed as text data and cannot be displayed in synchronization with the voice. At the same time, there has been a problem that it is impossible to recognize that a situation has occurred in which the audio is not normally reproduced on the transmission side on the transmission side.

本発明は、これらの問題を解決するため、電話システムで送信側と受信側とで通話する際に、受信側で音声として途中で途切れが発生しても音声テキストを同期して表示すると共に途切れた部分を強調表示したり、受信側から送信側に途切れた旨を送信して表示するようにしている。 In order to solve these problems, the present invention displays and synchronizes voice text even when a break occurs in the middle as speech on the receiving side when a call is made between the sending side and the receiving side in the telephone system. The highlighted part is highlighted, or the fact that there is a break from the reception side to the transmission side is transmitted and displayed.

本発明は、電話システムで送信側と受信側とで通話する際に、受信側で音声として途中で途切れが発生しても音声テキストを同期して表示すると共に途切れた部分を強調表示したり、更に、受信側から送信側に途切れた旨を送信して表示することにより、ＩＰ電話における音声の途切れが発生しても同期して確実に音声テキストを表示、欠落部分を強調表示すると共に、送信側に音声の途切れた旨を送信して表示して知らせることが可能となる。 In the present invention, when a call is made between the transmission side and the reception side in the telephone system, even if a break occurs in the middle as a voice on the reception side, the voice text is synchronously displayed and the broken portion is highlighted, In addition, by sending and displaying the interruption from the receiving side to the transmitting side, even if there is a voice interruption in the IP phone, the voice text is displayed reliably and the missing part is highlighted and transmitted. It is possible to transmit and display the fact that the sound is interrupted to the side.

本発明は、電話システムで送信側と受信側とで通話する際に、受信側で音声として途中で途切れが発生しても音声テキストを同期して表示すると共に途切れた部分を強調表示したり、受信側から送信側に途切れた旨を送信して表示したりすることを実現した。 In the present invention, when a call is made between the transmission side and the reception side in the telephone system, even if a break occurs in the middle as a voice on the reception side, the voice text is synchronously displayed and the broken portion is highlighted, It has been possible to send and display the fact that there is a break from the receiving side to the sending side.

図１は、本発明のシステム構成図を示す。
図１の（ａ）は、全体システム構成図を示す。 FIG. 1 shows a system configuration diagram of the present invention.
FIG. 1A shows an overall system configuration diagram.

図１の（ａ）において、端末１は、固定加入電話／携帯電話などの通話する端末であって、ここでは、送話側の端末を端末Ａ，受話側の端末を端末Ｂとしたものであり、相互に通話および同期したテキスト情報を表示するためのものである。 In FIG. 1A, a terminal 1 is a terminal that makes a call such as a fixed-line telephone / mobile phone, and here, the terminal on the transmission side is the terminal A and the terminal on the reception side is the terminal B. Yes, for displaying text information that has been called and synchronized with each other.

収容基地局２は、ネットワーク４に接続され、端末１と加入者回線（無線、有線）を介して相互にパケットで通話するものであって、ここでは、送話側の端末Ａとの間に通話するものを収容基地局Ａ，受話側の端末Ｂと通話するものを収容基地局Ｂとしたものである。端末１と収容基地局２との間には、通話時に、音声情報および当該音声情報を文字認識したテキスト情報と識別情報が相互に送受信される（図２から図１１参照）。 The accommodated base station 2 is connected to the network 4 and communicates with the terminal 1 via the subscriber line (wireless or wired) in a packet, and here, between the terminal A on the transmission side, The accommodating base station A is the one that makes a call, and the accommodating base station B is the one that makes a call with the terminal B on the receiving side. Between the terminal 1 and the accommodating base station 2, voice information and text information obtained by character recognition of the voice information and identification information are mutually transmitted and received during a call (see FIGS. 2 to 11).

サーバ３は、ネットワーク４に接続され、送話側の端末Ａと、受話側の端末Ｂとの間の呼管理（呼の接続、課金管理など）を行うものである。 The server 3 is connected to the network 4 and performs call management (call connection, billing management, etc.) between the terminal A on the transmission side and the terminal B on the reception side.

ネットワーク４は、パケットを相互に通信する通信路であって、ここでは、収容基地局Ａ，収容基地局Ｂ、サーバなどを接続し、端末Ａと端末Ｂとが相互に通話（通話およびテキスト情報を表示）するものである。 The network 4 is a communication path for communicating packets with each other. Here, the accommodation base station A, the accommodation base station B, a server, and the like are connected, and the terminals A and B communicate with each other (call and text information). Display).

図１の（ｂ）は、パケット例を示す。これは、端末Ａからネットワーク４を介して端末Ｂに向けて送信するパケットの例を示す。 FIG. 1B shows an example packet. This shows an example of a packet transmitted from the terminal A to the terminal B via the network 4.

図１の（ｂ）において、送信ブロック１００は、図１の（ａ）の送話側の端末Ａが、ユーザから発声されたアナログの音声信号をサンプリングしてデジタルの音声信号に変換し、これをパケットに設定した音声情報パケット３００および当該音声信号を文字認識してテキスト情報にし、これをパケットに設定したテキスト情報パケット４００の両者を含むものであって、送信側の端末Ａからネットワーク４に向けて送信するものである。 In FIG. 1B, the transmission block 100 is configured such that the terminal A on the transmission side in FIG. 1A samples an analog voice signal uttered by the user and converts it into a digital voice signal. Including the voice information packet 300 set in the packet and the text signal by recognizing the voice signal as text information, and including the text information packet 400 set in the packet. To send to.

受信ブロック２００は、ネットワーク４から、受話側の端末Ｂが受信する受信ブロックを示す。当該受信ブロック２００から音声情報およびテキスト情報をそれぞれ取り出すためのものである。 The reception block 200 indicates a reception block received by the terminal B on the receiving side from the network 4. The voice information and the text information are respectively extracted from the reception block 200.

図１の（ｃ）は、音声照合情報例を示す。これは、図１の（ｂ）のテキスト情報について、データ部に、テキスト情報本体４０１に加えて、音声照合情報４０２が設定されているのでこれを取り出し、後述するように、音声の発声に併せて当該テキスト情報本体４０１を同期して表示したり、音声パケットが欠落したときにテキスト情報中の欠落した部分を強調表示したりなどするためのものである（図２から図１１参照）。 FIG. 1C shows an example of voice collation information. This is because, with respect to the text information of FIG. 1B, since voice collation information 402 is set in the data portion in addition to the text information main body 401, it is taken out and combined with voice utterance as described later. The text information body 401 is displayed synchronously, or the missing portion in the text information is highlighted when a voice packet is missing (see FIGS. 2 to 11).

図２は、本発明の端末（送話側）の例を示す。
図２において、呼設定・管理手段１１は、ネットワーク４との間（更に、受話側の端末Ｂ）に呼を設定して管理するものであって、ここでは、呼を設定して通話可能となったときに、スタート情報を入力サンプリング手段１２に送出し、通話の開始情報を音声情報中に埋め込むものである（図５の（１）Ａ参照）。 FIG. 2 shows an example of the terminal (sending side) of the present invention.
In FIG. 2, the call setting / management means 11 sets and manages a call with the network 4 (further, the terminal B on the receiving side). When this happens, the start information is sent to the input sampling means 12, and the call start information is embedded in the voice information (see (1) A in FIG. 5).

入力サンプリング手段１２は、ユーザＡからの音声信号の入力サンプリング時間を記録したり、呼設定・管理手段１１からの通話の開始情報を当該音声信号中に埋め込んだりなどするものである（図５の（１）Ａ参照）。 The input sampling means 12 records the input sampling time of the voice signal from the user A, and embeds the call start information from the call setting / management means 11 in the voice signal (FIG. 5). (1) See A).

符号化処理手段１３は、音声信号（含む、通話の開始情報）をサンプリング時間単位でサンプリングしてデジタルの音声信号を生成するものである（図５の（２）Ｂ参照）。 The encoding processing means 13 generates a digital audio signal by sampling an audio signal (including call start information) in sampling time units (see (2) B in FIG. 5).

ＲＴＰ情報生成手段１４は、符号化処理手段１３で生成されたデジタルの複数の符号化された音声情報を束ねてパケット用データ（ＲＴＰ情報）を生成するものである（図６の（５）Ｅ参照）。 The RTP information generating unit 14 generates packet data (RTP information) by bundling a plurality of digitally encoded audio information generated by the encoding processing unit 13 ((5) E in FIG. 6). reference).

音声情報パケット化手段１５は、パケット用データ（ＲＴＰ情報）をパケットにするものである。そして、生成したパケットをネットワークに送出する。 The voice information packetizing means 15 converts packet data (RTP information) into a packet. Then, the generated packet is sent to the network.

テキスト化処理手段１６は、音声を文字認識してテキスト情報とし、これに音声照合情報を付加するものであって、音声読込手段１６１、テキスト情報化手段１６２などから構成されるものである（図５の（４）Ｄ参照）。 The text processing means 16 recognizes speech as text information and adds voice collation information to the text information. The text processing means 16 includes a voice reading means 161, a text information converting means 162, and the like (see FIG. 5 (4) D).

音声読込手段１６１は、音声を読み込むものである。
テキスト情報化手段１６２は、どの入力サンプリング時間の音声信号をテキスト化したかの情報を記録するものである（図５の（４）Ｄ参照）。 The voice reading means 161 reads voice.
The text information converting means 162 records information indicating which input sampling time the audio signal is converted into text (see (4) D in FIG. 5).

時間情報照合手段１７は、スタート情報の含まれる入力サンプリング時間とＲＴＰタイムスタンプ位置を合わせ、時間対応情報を、テキスト情報パケット化手段１８に送信するものである（図５の（４）Ｄ参照）。 The time information collating unit 17 matches the input sampling time including the start information with the RTP timestamp position, and transmits time correspondence information to the text information packetizing unit 18 (see (4) D in FIG. 5). .

テキスト情報パケット化手段１８は、ネットワークの互換性、相互接続性を維持するために例えばＩＥＴＦ標準のパケットを生成するものである（図６の（６）Ｆ参照）。 The text information packetizing means 18 generates, for example, an IETF standard packet in order to maintain network compatibility and interconnection (see (6) F in FIG. 6).

次に、図３および図４のフローチャートの順番に従い、図１、図２の構成の動作を詳細に説明する。 Next, the operation of the configuration of FIGS. 1 and 2 will be described in detail according to the order of the flowcharts of FIGS. 3 and 4.

図３は、本発明の動作説明フローチャート（呼設定から通話開始まで）を示す。ここで、端末Ａは図１、図２の発話側の端末Ａ、サーバ３は図１、図２のサーバ３、端末Ｂは図１、図２の受話側の端末Ｂである。ユーザＡ（送話）は送話側のユーザＡの番号操作、発話などを表す。端末Ａの音声情報、テキスト情報は、当該端末Ａの側の音声情報、テキスト情報の符号化などの処理を表す。端末Ｂの音声情報、テキスト情報は、当該端末Ｂの側の音声情報、テキスト情報の表示などの処理を表す。 FIG. 3 is a flowchart for explaining the operation of the present invention (from call setting to call start). Here, the terminal A is the utterance side terminal A in FIGS. 1 and 2, the server 3 is the server 3 in FIGS. 1 and 2, and the terminal B is the reception side terminal B in FIGS. User A (transmission) represents number operation, utterance, and the like of user A on the transmission side. The audio information and text information of the terminal A represent processing such as encoding of the audio information and text information on the terminal A side. The voice information and text information of the terminal B represent processing such as display of voice information and text information on the terminal B side.

図３において、Ｓ１は、ユーザＡ（発話）が番号投入する。これは、発話側のユーザＡが受話側の端末Ｂの電話番号をダイヤル入力する。 In FIG. 3, user A (utterance) inputs a number in S1. In this case, the user A on the speaking side dials the telephone number of the terminal B on the receiving side.

Ｓ２は、発呼処理を行う。これは、Ｓ１でユーザＡがダイヤルしたことに対応して、当該番号の発呼を行う。 In S2, call processing is performed. In response to the dialing by user A in S1, this number is called.

Ｓ３は、サーバ３が受話側端末を認識する。これは、サーバ３がＳ２で発呼された受話側の端末Ｂの電話番号を認識する。 In S3, the server 3 recognizes the receiving terminal. This is because the server 3 recognizes the telephone number of the terminal B on the receiving side that is called in S2.

Ｓ４は、発呼処理を行う。これは、Ｓ３でサーバ３が認識した受話側の端末Ｂの電話番号に発呼する。 In S4, call processing is performed. This calls to the telephone number of the terminal B on the receiving side recognized by the server 3 in S3.

Ｓ５は、端末Ｂで着信処理を行う。
Ｓ６は、呼び出しする。これらＳ５、Ｓ６は、Ｓ４でサーバ３から発呼された受話側の端末Ｂが着信処理を行うと共にユーザＢを呼び出すためのベルを鳴らす。 In S5, the terminal B performs incoming processing.
S6 calls. In S5 and S6, the terminal B on the receiving side called from the server 3 in S4 performs an incoming call process and rings a bell for calling the user B.

Ｓ７は、着信通知する。端末ＢがＳ５で着信したことを認識したので、当該着信した返答をサーバ３に返す。 In S7, an incoming call is notified. Since the terminal B recognizes that it has received an incoming call in S5, it returns the incoming response to the server 3.

Ｓ８は、呼開設監視を開始すると共に、呼び出し中を送話側に返答する。
Ｓ９は、送話側の端末Ａに呼び出し中表示（あるいは呼び出し音声を発声）する。 In step S8, call establishment monitoring is started, and the calling side is returned to the transmitting side.
In S9, a call-in-progress display (or a calling voice is made) is displayed on the terminal A on the transmission side.

Ｓ１１は、オフフックする。これは、Ｓ６の呼び出しに対応して、ユーザＢが受話側の端末Ｂの受話器を取り上げる。 S11 goes off-hook. In response to the call of S6, the user B picks up the receiver of the terminal B on the receiver side.

Ｓ１２は、応答通知する。
Ｓ１３は、Ｓ１２の応答通知に対応して、サーバ３が課金開始する。 In S12, a response is notified.
In S13, the server 3 starts charging in response to the response notification in S12.

Ｓ１４は、端末Ｂが通話セッションを開始する。
Ｓ１５は、サーバ３が呼開設通知を端末Ａに送信する。 In S14, the terminal B starts a call session.
In S15, the server 3 transmits a call establishment notification to the terminal A.

Ｓ１６は、端末Ａが通話セッションを開始する。
Ｓ１７は、通話開始する。これにより、発話側の端末ＡのユーザＡと、受話側の端末ＢのユーザＢとが相互に通話および当該通話に同期してテキスト情報を表示することが相互に可能となる。 In S16, the terminal A starts a call session.
In S17, a call is started. As a result, the user A of the terminal A on the speaking side and the user B of the terminal B on the receiving side can mutually communicate and display text information in synchronization with the call.

次に、通話とテキスト情報の同期表示について詳細に説明する。
図３において、Ｓ２１は、ユーザＡが通話する。 Next, the synchronous display of a call and text information will be described in detail.
In FIG. 3, user A makes a call in S21.

Ｓ２２は、音声符号化する。
Ｓ２３は、音声パケット化する。これらＳ２２、Ｓ２３により、ユーザＡが送話側の端末Ａの受話器に送話すると、サンプリングして符号化し、更に、パケットにしてネットワークを介して受話側の端末Ｂに向けて順次送信する。 In S22, speech encoding is performed.
In S23, voice packetization is performed. When the user A transmits to the handset of the terminal A on the transmitting side by S22 and S23, it is sampled and encoded, and further transmitted as a packet toward the terminal B on the receiving side via the network.

Ｓ２４は、送話音声を読み込む。
Ｓ２５は、音声テキスト化する。 In S24, the transmitted voice is read.
In S25, it is converted into a voice text.

Ｓ２６は、テキストパケット化する。これらＳ２４からＳ２６により、ユーザＡが送話側の端末Ａの受話器に送話すると、当該送話された音声を読み込んで音声認識してテキスト化および音声照合情報を付加し、更に、これらをパケットにしてネットワークを介して受話側の端末Ｂに向けて順次送信する。 In S26, a text packet is formed. When the user A transmits to the receiver of the terminal A on the transmitting side through S24 to S26, the transmitted voice is read and recognized, added to text and voice verification information, and further, these are packetized. Then, the data are sequentially transmitted to the terminal B on the receiving side via the network.

以上のＳ２１からＳ２６によって、ユーザＡが送話側の端末Ａの受話器に送話すると、音声符号化してパケット化、および音声を文字認識してテキスト化および音声照合情報を付加してパケット化し、受話側の端末Ｂに向けて当該パケットを送信することが可能となる。この際、テキスト情報のパケットは、音声が所定閾値よりも小さいときあるいは音声が無いときに送出するようにしている。 When the user A transmits to the receiver of the terminal A on the transmitting side by the above S21 to S26, the speech encoding is performed to packetize the speech, and the speech is recognized and converted into text and speech verification information is added to be packetized. The packet can be transmitted toward the terminal B on the receiving side. At this time, the text information packet is transmitted when the voice is smaller than a predetermined threshold or when there is no voice.

Ｓ２７は、初期化する。
Ｓ２８は、バッファ処理を行う。 S27 is initialized.
In S28, buffer processing is performed.

Ｓ２９は、音声復号化する。
Ｓ３０は、受話する。これらＳ２８、Ｓ２９、Ｓ３０は、Ｓ２３で送話側のユーザＡの送話をパケット化したパケットを、端末Ｂが受信したときに、当該受信したパケットをバッファに一旦格納した後、所定時間遅延して時間順に当該バッファから該当パケットを取り出し、復号化して元の音声信号にし、受話器で元の音声に戻して出力し、ユーザＢに聞かせる。 In S29, voice decoding is performed.
S30 receives a call. These S28, S29, and S30 are delayed for a predetermined time after the terminal B receives the packet obtained by packetizing the transmission of the user A on the transmitting side in S23 and temporarily stores the received packet in the buffer. Then, the corresponding packet is taken out from the buffer in order of time, decoded to be the original voice signal, returned to the original voice by the receiver, and output to the user B.

Ｓ３１は、同期処理を行う。
Ｓ３２は、表示処理を行う。 In S31, a synchronization process is performed.
In S32, display processing is performed.

Ｓ３３は、テキスト表示する。これらＳ３１、Ｓ３２、Ｓ３３は、Ｓ２６で送話側のユーザＡの送話を文字認識したテキスト情報をパケット化したパケットを、端末Ｂが受信したときに、当該パケット中の音声照合情報をもとに同期処理を行い、端末Ｂの表示画面上にテキスト情報を音声と同期して表示する（図７から図１１参照）。 S33 displays text. These S31, S32, and S33 are based on the voice collation information in the packet when the terminal B receives a packet obtained by packetizing the text information obtained by character recognition of the transmission of the user A on the transmission side in S26. And the text information is displayed on the display screen of the terminal B in synchronization with the voice (see FIGS. 7 to 11).

同様に繰り返しと記載したように、Ｓ２１からＳ３３を繰り返すことにより、送話側の端末Ａから受話側の端末Ｂに、音声および当該音声を認識したテキスト情報とその音声照合情報とをパケットにして送信して受信し、発声およびテキスト情報を同期化して表示することが可能となる（図７から図１１参照）。 Similarly, as described as “repeat”, by repeating S21 to S33, the voice and the text information recognizing the voice and the voice collation information are packetized from the transmitting terminal A to the receiving terminal B. It is possible to transmit and receive and display the utterance and text information in synchronization (see FIGS. 7 to 11).

図４は、本発明の動作説明フローチャート（通話から呼開放まで）を示す。
図４において、Ｓ４１は、通話終了する。 FIG. 4 is a flowchart for explaining the operation of the present invention (from a call to a call release).
In FIG. 4, S41 ends the call.

Ｓ４２は、ユーザＢ（受話）がオフフックする。これは、ユーザＢが端末Ｂの受話器を置き、通話を終了する。 In S42, user B (received call) goes off-hook. In this case, the user B puts the handset of the terminal B and ends the call.

Ｓ４３は、端末Ｂが通話セッション終了通知をサーバ３に行う。
Ｓ４４は、サーバ３が課金終了する。 In S43, the terminal B sends a call session end notification to the server 3.
In S44, the server 3 finishes charging.

Ｓ４５は、サーバ３が課金通知を送話側の端末Ａに通知する。
Ｓ４６は、端末Ａが通話セッションを終了する。 In S45, the server 3 notifies the terminal A on the transmission side of the charging notification.
In S46, the terminal A ends the call session.

Ｓ４７は、端末Ａが音声の符号化を終了する。
Ｓ４８は、端末Ａがテキスト化を終了する。 In S47, the terminal A ends the encoding of the voice.
In S48, the terminal A ends text conversion.

Ｓ４９は、サーバ３が呼開放する。
Ｓ５０は、端末Ｂがテキスト情報の表示処理を終了する。 In S49, the server 3 releases the call.
In S50, the terminal B ends the text information display process.

Ｓ５１は、端末Ｂが音声パケットの復号化を終了する。
以上によって、通話終了処理を行うことが可能となる。 In S51, terminal B finishes decoding the voice packet.
As described above, the call termination process can be performed.

図５および図６は、本発明の説明図を示す。ここで、図２のＡ〜Ｆの各ポイントでの情報形態を以下に示す。音声は１２５μｓ単位にサンプリングし、ＲＴＰ間隔を２０ｍｓとする。 5 and 6 are explanatory diagrams of the present invention. Here, the information forms at the points A to F in FIG. 2 are shown below. The audio is sampled in units of 125 μs, and the RTP interval is set to 20 ms.

図５の（１）Ａは、図２の入力サンプリング手段１２から符号化処理手段１３に入力する信号の例を示す。ここでは、呼設定・管理手段１１からのスタート情報を、ユーザＡが送話した音声信号中の図示の位置に挿入している。サンプリング時間は１２５μｓである。これにより、スタート情報の位置を基準に、音声（１２５μｓ間隔でサンプリングしたデジタルの音声）と、音声を文字認識したテキスト情報との同期化を行うことが可能となる。 (1) A in FIG. 5 shows an example of a signal input from the input sampling means 12 in FIG. 2 to the encoding processing means 13. Here, the start information from the call setting / management means 11 is inserted into the illustrated position in the voice signal transmitted by the user A. The sampling time is 125 μs. This makes it possible to synchronize the voice (digital voice sampled at intervals of 125 μs) with the text information obtained by character recognition of the voice with reference to the position of the start information.

図５の（２）Ｂは、図２の符号化処理手段１３からＲＴＰ情報生成手段１４に入力される信号の例を示す。ここでは、図示のように、サンプリング時間（１２５μｓ間隔）毎に、サンプリング時間と、サンプリングされた音声信号（デジタル値）とを組した情報を生成する。 (2) B in FIG. 5 shows an example of a signal input to the RTP information generating unit 14 from the encoding processing unit 13 in FIG. Here, as shown in the figure, for each sampling time (125 μs interval), information is generated by combining the sampling time and the sampled audio signal (digital value).

図５の（３）Ｃは、図２の時間情報照合手段１７がテキスト情報パケット化手段１８に出力する信号の例を示す。ここでは、入力サンプリング時間と、ＲＴＰタイムスタンプ（ｍｓ）（スタート情報を含む）とを対応づけた情報である。尚、スタート情報の位置を確認することで、入力サンプリング時間とＲＴＰタイムスタンプとの対応ができる。また、ＲＴＰタイムスタンプはパケットの生成間隔ごとの値をとる（図の例では２０ｍｓ）。 (3) C in FIG. 5 shows an example of a signal output from the time information matching unit 17 in FIG. 2 to the text information packetizing unit 18. Here, the input sampling time is associated with the RTP time stamp (ms) (including start information). By checking the position of the start information, the input sampling time and the RTP time stamp can be associated. The RTP time stamp takes a value at every packet generation interval (20 ms in the example in the figure).

図５の（４）Ｄは、図２のテキスト化処理手段１６からテキスト化情報パケット化手段１８に入力される信号の例を示す。ここでは、入力サンプリング時間、テキスト情報シーケンス番号、テキスト情報内容を対応づけたものである。尚、テキスト情報については、スタート情報の位置と、テキスト化した先頭の入力サンプリング時間を記録することで、テキスト情報がどのＲＴＰに対応しているか判明する。 (4) D in FIG. 5 shows an example of a signal input from the text processing unit 16 in FIG. 2 to the text information packetizing unit 18. Here, the input sampling time, the text information sequence number, and the text information contents are associated with each other. As for text information, the position of the start information and the input sampling time at the beginning of the text are recorded, and it is determined which RTP the text information corresponds to.

図６の（５）Ｅは、図２のＲＴＰ情報生成手段１４から音声情報パケット化手段１５に入力する信号の例を示す。ここでは、ＲＴＰ情報は、標準的な仕様で生成する（テキスト化情報との照合情報はここには記述されない）。これはテキスト表示できない端末との通信互換性を保つためである。 (5) E in FIG. 6 shows an example of a signal input to the voice information packetizing means 15 from the RTP information generating means 14 in FIG. Here, the RTP information is generated with a standard specification (the collation information with the text information is not described here). This is to maintain communication compatibility with terminals that cannot display text.

図６の（６）Ｆは、図２のテキスト情報パケット化手段７から出力される信号の例を示す。ここでは、テキスト情報ヘッダ（ヘッダ部）には、テキスト化した順序を示すシーケンス番号と、対応するＲＴＰのタイムスタンプ情報とが記述されている。データ部には図１の（ｃ）で既述したように、音声照合情報（ここでは、シーケンス番号、ＲＴＰタイムスタンプ）と、テキスト情報本体とが記述されている。 (6) F in FIG. 6 shows an example of a signal output from the text information packetizing means 7 in FIG. Here, in the text information header (header portion), a sequence number indicating the order of text conversion and corresponding RTP time stamp information are described. As described above with reference to FIG. 1C, voice collation information (here, sequence number, RTP time stamp) and text information main body are described in the data portion.

次に、図７から図１１を参照して受話側の端末Ｂの構成および動作を詳細に説明する。
図７は、本発明の端末（受話側）の例を示す。 Next, the configuration and operation of the terminal B on the receiving side will be described in detail with reference to FIGS.
FIG. 7 shows an example of the terminal (receiving side) of the present invention.

図７において、呼設定・管理手段２１は、呼を設定して管理するものであって、ここでは、受信開始（通話セッション開始時）を指示するものである。 In FIG. 7, the call setting / management means 21 sets and manages a call, and here, instructs to start reception (at the start of a call session).

音声情報受信処理手段２２は、音声情報パケットを受信するものであって、ここでは、音声情報パケットの音声情報中からタイムスタンプ情報を読み取り、同期化処理手段２６に渡すなどするものである。 The voice information reception processing unit 22 receives a voice information packet. Here, the voice information reception processing unit 22 reads time stamp information from the voice information of the voice information packet and passes it to the synchronization processing unit 26.

バッファ２３は、受信した音声パケットを一時的に格納し、同期化処理手段２６からの調整時間の通知をもとに、当該所定時間調整して同期化したパケットを取り出すためのものである。 The buffer 23 is for temporarily storing the received voice packet and taking out a packet that has been adjusted and synchronized for a predetermined time based on the notification of the adjustment time from the synchronization processing means 26.

復号化処理手段２４は、音声パケットを復号し、デジタルの音声の戻すものである。
テキスト情報受信処理手段２５は、テキスト情報パケットを受信するものであって、ここでは、テキスト情報中の音声照合情報として、テキスト化した音声情報パケットのタイムスタンプの範囲を取り出し、同期化処理手段２６に渡すなどするものである。 The decoding processing means 24 decodes the voice packet and returns digital voice.
The text information reception processing unit 25 receives a text information packet. Here, the text information reception processing unit 25 extracts the time stamp range of the text information voice information packet as the voice collation information in the text information, and synchronizes the processing unit 26. Etc.

同期化処理手段２６は、音声情報受信処理手段２２から渡された音声パケットのタイムスタンプ情報と、テキスト情報受信処理手段２５から渡された当該音声パケットの音声をパケット化したときのタイムスタンプの範囲の情報とをもとに、テキスト情報に同期化するように、バッファ２３の深さを調整して音声パケットの同期化を行うものである（図８から図１１参照）。 The synchronization processing means 26 is a time stamp range when the time stamp information of the voice packet passed from the voice information reception processing means 22 and the voice of the voice packet passed from the text information reception processing means 25 are packetized. On the basis of this information, the depth of the buffer 23 is adjusted so as to synchronize with the text information, and the voice packet is synchronized (see FIGS. 8 to 11).

表示処理手段２７は、テキスト情報を表示するものである（図８から図１１参照）。
図８は、本発明の受信ブロック２００の構成例を示す。受信ブロック２００は、図１の（ｂ），（ｃ）で既述したように、音声情報パケットと、テキスト情報パケットとが混在したものであって、ここでは、音声情報パケットに対して、テキスト情報パケットが図示の到着時間差ｔ１を持つものである。 The display processing means 27 displays text information (see FIGS. 8 to 11).
FIG. 8 shows a configuration example of the reception block 200 of the present invention. As described above with reference to FIGS. 1B and 1C, the reception block 200 is a mixture of a voice information packet and a text information packet. The information packet has the arrival time difference t1 shown in the figure.

図９は、本発明の説明図（到着時間差等）を示す。
図９において、バッファ２３のバッファの深さをｔＢ（時間）とし、規定音声情報間隔をｔｃ（時間）とし、音声パケットの規定音声情報間隔からのずれをΔｔ（時間）とすると、音声情報パケットがバッファ２３に入力されてから出力されるまでの時間ｔ２は、
ｔ２＝ｔＢ−Δｔ
となる。そして、この不ぞろいなｔ２の時間を持つ音声パケットをバッファ２３に一時的に格納し、出力からはｔｃが一定の音声パケットを、図７の復号処理手段２４に出力して音声に復号化することが可能となる。また、図８で既述した到着時間差ｔ１が丁度ｔＢにほぼ等しくなるように当該バッファの深さを調整することにより同期化を行い、音声情報パケットを復号化した音声と、テキスト情報パケットを復号化したテキスト情報との同期化を行うことが可能となる。 FIG. 9 is an explanatory diagram of the present invention (arrival time difference, etc.).
In FIG. 9, when the buffer depth of the buffer 23 is tB (time), the specified voice information interval is tc (time), and the deviation of the voice packet from the specified voice information interval is Δt (time), the voice information packet T2 from when the signal is input to the buffer 23 until it is output,
t2 = tB−Δt
It becomes. Then, the voice packet having the irregular t2 time is temporarily stored in the buffer 23, and the voice packet having a constant tc is output from the output to the decoding processing means 24 of FIG. Is possible. Also, synchronization is performed by adjusting the depth of the buffer so that the arrival time difference t1 already described in FIG. 8 is almost equal to tB, and the voice information packet and the text information packet are decoded. It is possible to synchronize with the converted text information.

図１０は、本発明の表示例（音声情報欠落時のテキスト表示例）を示す。
図１０の（ａ）は、原音の例を示す。ここでは、「ひゃくまんえん」（百万円）の例を示す。 FIG. 10 shows a display example of the present invention (text display example when voice information is missing).
FIG. 10A shows an example of the original sound. Here, an example of “Hyakumanen” (million yen) is shown.

図１０の（ｂ）は、送話側出力音声パケットの例を示す。ここでは、音声パケットは、欠落することなく出力されている。 FIG. 10B shows an example of the transmission side output voice packet. Here, the voice packet is output without being lost.

図１０の（ｃ）は、受話側に到着した音声情報パケットの例を示す。ここでは、黒印のものが欠落したとする。欠落したパケットを図示のように、黒印で表示してユーザＢに知らせる。 FIG. 10C shows an example of a voice information packet that has arrived at the receiving side. Here, it is assumed that a black mark is missing. As shown in the figure, the missing packet is displayed with a black mark to inform the user B.

図１０の（ｄ）は、再生音の例を示す。再生音は、図１０の（ｃ）の音声情報パケットのうち、「ま」、「ん」の音声グループ内のパケットが欠落したので、当該「ま」、「ん」の音を再生不可であるので、「ひゃく−−えん」と再生され、「百円」と誤認識する事態が発生する恐れがある。この際、音声情報パケットが欠落した場合、送話側に欠落した音声部分（シーケンス番号、タイムスタンプ）を通知し、送話側の画面に同様に、図１０の（ｃ）のように欠落したパケット部分を強調表示するようにしてもよい。 FIG. 10 (d) shows an example of the reproduced sound. As for the reproduced sound, since the packets in the voice group “MA” and “N” are missing from the voice information packet in FIG. 10C, the sounds “MA” and “N” cannot be played back. Therefore, there is a possibility that the situation will be reproduced as “Hyaku--en” and mistakenly recognized as “100 yen”. At this time, if a voice information packet is missing, the missing voice part (sequence number, time stamp) is notified to the transmitting side, and the missing information is also displayed on the transmitting side as shown in FIG. The packet portion may be highlighted.

図１０の（ｅ）は、テキスト表示の例を示す。これは、図１０の（ｂ）の送話側出力音声情報パケットの元の音声情報を文字認識したテキスト情報「ひゃくまんえん」をパケットにし、受話側で受信して表示した例を示す。 FIG. 10E shows an example of text display. This shows an example in which text information “Hyakumanen” obtained by character recognition of the original voice information of the transmission side output voice information packet of FIG. 10B is received and displayed on the receiving side.

図１１は、本発明の表示例を示す。
図１１の（ａ）は、パケットの例を示す。これは、受話側で受信されたパケットの例を示す。欠落したパケットを強調表示する。 FIG. 11 shows a display example of the present invention.
FIG. 11A shows an example of a packet. This shows an example of a packet received at the receiving side. Highlight missing packets.

図１１の（ｂ）は、音声情報パケットの例を示す。これは、図１１の（ａ）の受信パケットから音声情報パケットのみを抽出し、欠落したパケットを強調表示した例を示す。 FIG. 11B shows an example of a voice information packet. This shows an example in which only the voice information packet is extracted from the received packet of FIG. 11A and the missing packet is highlighted.

図１１の（ｃ）は、テキスト情報パケットの例を示す。これは、図１１の（ａ）の受信パケットからテキスト情報パケットのみを抽出し、欠落したパケットを強調表示した例を示す。 FIG. 11C shows an example of a text information packet. This shows an example in which only the text information packet is extracted from the received packet in FIG. 11A and the missing packet is highlighted.

本発明は、電話システムで送信側と受信側とで通話する際に、受信側で音声として途中で途切れが発生しても音声テキストを同期して表示すると共に途切れた部分を強調表示したり、更に、受信側から送信側に途切れた旨を送信して表示し、ＩＰ電話における音声の途切れが発生しても同期して確実に音声テキストを表示、欠落部分を強調表示すると共に、送信側に音声の途切れた旨を送信して表示して知らせる電話システムに関するものである。 In the present invention, when a call is made between the transmission side and the reception side in the telephone system, even if a break occurs in the middle as a voice on the reception side, the voice text is synchronously displayed and the broken portion is highlighted, Furthermore, a message indicating that the interruption has occurred is transmitted and displayed from the receiving side to the transmitting side, and even if there is a voice interruption in the IP phone, the voice text is displayed in a synchronized manner, the missing part is highlighted, and the transmitting side is highlighted. The present invention relates to a telephone system that transmits, displays, and notifies that a voice is interrupted.

本発明のシステム構成図である。It is a system configuration diagram of the present invention. 本発明の端末（送話側）である。It is a terminal (sending side) of the present invention. 本発明の動作説明フローチャート（呼設定から通話開始まで）である。It is an operation | movement explanatory flowchart (from call setting to call start) of this invention. 本発明の動作説明フローチャート（通話から呼開放まで）である。It is an operation | movement explanatory flowchart (from a telephone call to call release) of this invention. 本発明の説明図（その１）である。It is explanatory drawing (the 1) of this invention. 本発明の説明図（その２）である。It is explanatory drawing (the 2) of this invention. 本発明の端末（受話側）である。It is a terminal (receiving side) of the present invention. 本発明の受信ブロック２００の構成例である。It is a structural example of the receiving block 200 of this invention. 本発明の説明図（到着時間差等）である。It is explanatory drawing (arrival time difference etc.) of this invention. 本発明の表示例（音声情報欠落時のテキスト表示例）である。It is a display example (text display example at the time of audio | voice information missing) of this invention. 本発明の表示例である。It is a display example of the present invention.

Explanation of symbols

１：端末
２：収容基地局
３：サーバ
４：ネットワーク
１１、２１：呼設定・管理手段
１２：入力サンプリング手段
１３：符号化処理手段
１４：ＲＴＰ情報生成手段
１５：音声情報パケット化手段
１６：テキスト化処理手段
１６１：音声読込手段
１６２：テキスト情報化手段
１７：時間情報照合手段
１８：テキスト情報パケット化手段
２２：音声情報受信処理手段
２３：バッファ
２４：復号処理手段
２５：テキスト情報受信処理手段
２６：同期化処理手段
２７：表示処理手段 1: Terminal 2: Accommodating base station 3: Server 4: Network 11, 21: Call setting / management means 12: Input sampling means 13: Encoding processing means 14: RTP information generating means 15: Voice information packetizing means 16: Text Processing means 161: speech reading means 162: text information converting means 17: time information collating means 18: text information packetizing means 22: speech information receiving processing means 23: buffer 24: decoding processing means 25: text information receiving processing means 26 : Synchronization processing means 27: Display processing means

Claims

In a telephone system that sends audio on the sending side and receives it on the receiving side,
A means for transmitting voice by packetizing text information obtained by voice recognition of the voice on the transmission side, and adding identification information indicating synchronization between the two,
A telephone system comprising: means for synchronously outputting the voice and the text information based on the identification information when the receiving side receives the packet and outputs the voice and text information.

2. The telephone system according to claim 1, wherein when the voice level is lower than a predetermined threshold or when there is no voice on the transmission side, the identification information is added to the packet and transmitted.

A buffer that, as the identification information, adds a sequence number of a voice packet to be synchronized to the text information on the transmission side, and temporarily stores a voice packet based on the sequence number in the text information received on the reception side 3. The telephone system according to claim 1 or 2, wherein the voice is output in synchronization with the depth of the sound being adjusted.

As the identification information, a buffer that adds a time stamp of a voice packet to be synchronized to the text information on the transmission side and temporarily stores the voice packet based on the time stamp in the text information received on the reception side 3. The telephone system according to claim 1 or 2, wherein the voice is output in synchronization with the depth of the sound being adjusted.

5. The voice data received on the receiving side is highlighted when a missing or unacceptable packet is received within a predetermined time. Phone system as described in.

6. The text information packet received on the receiving side is highlighted when the packet is missing or cannot be received within a predetermined time. The phone system described in Crab.

When a voice packet received on the receiving side is lost or cannot be received within a predetermined time, the fact that the voice packet was not received can be output to the sending side by sending a message to that effect. The telephone system according to any one of claims 1 to 6, wherein the telephone system is characterized.