JP2014176033A

JP2014176033A - Communication system, communication method and program

Info

Publication number: JP2014176033A
Application number: JP2013049679A
Authority: JP
Inventors: Yohei Tsuzuki; 洋平都筑
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2013-03-12
Filing date: 2013-03-12
Publication date: 2014-09-22

Abstract

PROBLEM TO BE SOLVED: To automatically generate minutes with accuracy.SOLUTION: A communication system including a plurality of information processing devices connected through a network includes: first voice conversion means for converting voice information acquired by one of the plurality of information processing devices into character data; in another information processing device among the plurality of information processing devices, second voice conversion means for converting into character data the voice acquired according to the voice information acquisition timing by the one information processing device; and determination means for comparing two character data converted by the first and second voice conversion means to determine whether or not the voice information acquired by the one information processing device and the other information processing device is the same speech.

Description

本発明は、通信システム、通信方法およびプログラムに関する。 The present invention relates to a communication system, a communication method, and a program.

ネットワークに接続された複数拠点間の情報処理装置を用いた遠隔会議が知られている。このような通信会議システムにおいては、自動音声認識により会議の発話内容を自動でテキスト化し記録する発明が提案されている。また、特許文献１には、自動で生成されたテキスト情報を用いて簡単に議事録を作成する発明が提案されている。すなわち、特許文献1には、会議の議事録を簡易に作成する目的で、重要発言など議事録に掲載される発言をインデックス情報として検出し、それを会議進行と同時に議事録ファイルに挿入することにより簡易な議事録を自動的に作成できるテレビ会議システムの構成が開示されている。 A remote conference using an information processing apparatus between a plurality of bases connected to a network is known. In such a communication conference system, an invention has been proposed in which the speech content of a conference is automatically converted into text and recorded by automatic speech recognition. Patent Document 1 proposes an invention in which minutes are easily created using automatically generated text information. In other words, in Patent Document 1, for the purpose of easily creating the minutes of a meeting, the comments posted in the minutes, such as important statements, are detected as index information and inserted into the minutes file simultaneously with the progress of the meeting. Discloses a configuration of a video conference system that can automatically create a simple minutes.

自動音声認識は、技術的に１００％正確に音声認識することは困難であり、特別なノイズのない状況で一般的に、テキスト化した情報全体の６０〜９０％程度を正しく認識できる。そのためテキスト化した情報には多くの誤りが含まれる。 In automatic speech recognition, it is technically difficult to recognize speech with 100% accuracy, and generally 60 to 90% of the entire text information can be correctly recognized in a situation without special noise. Therefore, the text information contains many errors.

そこで、正確な議事録を作成するためにはこのテキストを修正する必要があるが、もともとの誤り率が高いため、テキスト情報を修正する過程で修正ミスが発生し、実際の内容と異なる情報が議事録に記載されてしまう可能性がある。 Therefore, it is necessary to correct this text in order to create an accurate minutes, but since the original error rate is high, a correction error occurs in the process of correcting the text information, and there is information that differs from the actual content. There is a possibility that it will be listed in the minutes.

そこで、特許文献２には、音声認識の確実性を上げるためにユーザ辞書を登録し、よく使う言葉は検出しやすくする発明なども提案されている。 Therefore, Patent Document 2 proposes an invention that registers a user dictionary in order to increase the certainty of voice recognition and makes it easy to detect frequently used words.

しかしながら、特許文献１、２では、自動音声認識の精度向上、議事録作成の効率化がなされているものの、未だ誤り率が高く、正確に議事録を自動生成するためには更に技術を向上させる必要があった。 However, in Patent Documents 1 and 2, although the accuracy of automatic speech recognition is improved and the creation of minutes is improved, the error rate is still high, and the technology is further improved in order to automatically generate the minutes correctly. There was a need.

上記課題に鑑み、本発明の目的とするところは、議事録をより正確に自動生成することが可能な通信システム、通信方法およびプログラムを提供することにある。 In view of the above problems, an object of the present invention is to provide a communication system, a communication method, and a program capable of automatically generating the minutes more accurately and automatically.

上記課題を解決するために、本発明のある観点によれば、
ネットワークを介して接続された複数の情報処理装置を有する通信システムであって、
前記複数の情報処理装置のうちの一の情報処理装置にて取得した音声情報を文字データに変換する第１の音声変換手段と、
前記複数の情報処理装置のうちの他の情報処理装置にて、前記一の情報処理装置にて音声情報を取得したタイミングに応じて取得した音声情報を文字データに変換する第２の音声変換手段と、
前記第１及び第２の音声変換手段により変換された２つの文字データを比較し、前記一の情報処理装置及び前記他の情報処理装置にて取得された音声情報が同一発言か否かを判定する判定手段と、
を有することを特徴とする通信システムが提供される。 In order to solve the above problems, according to one aspect of the present invention,
A communication system having a plurality of information processing devices connected via a network,
First voice conversion means for converting voice information acquired by one information processing apparatus of the plurality of information processing apparatuses into character data;
Second voice conversion means for converting voice information acquired according to the timing at which voice information was acquired by the one information processing apparatus into character data in another information processing apparatus among the plurality of information processing apparatuses. When,
The two character data converted by the first and second voice conversion means are compared, and it is determined whether or not the voice information acquired by the one information processing apparatus and the other information processing apparatus is the same utterance. Determination means to perform,
A communication system is provided.

本発明によれば、議事録をより正確に自動生成することができる。 According to the present invention, the minutes can be automatically generated more accurately.

一実施形態に係る通信会議システムの全体構成図。1 is an overall configuration diagram of a communication conference system according to an embodiment. FIG. 一実施形態に係る情報処理装置の機能構成図。The functional block diagram of the information processing apparatus which concerns on one Embodiment. 一実施形態に係る音声情報及びテキストデータ例。The audio | voice information and text data example which concern on one Embodiment. 一実施形態に係る発言記録生成処理を示したフローチャート。The flowchart which showed the statement record production | generation process which concerns on one Embodiment. 一実施形態に係る比較対象に選定された各拠点の発言内容例。The example of the content of the statement of each base chosen as the comparison object concerning one embodiment. 一実施形態に係る出力処理例。An output processing example according to an embodiment. 一実施形態に係る出力処理例。An output processing example according to an embodiment. 一実施形態の変形例１に係る発言記録生成処理を示したフローチャート。The flowchart which showed the statement recording production | generation process which concerns on the modification 1 of one Embodiment. 一実施形態の変形例１に係る出力処理例。The output process example which concerns on the modification 1 of one Embodiment. 一実施形態の変形例２に係る発言記録生成処理を示したフローチャート。The flowchart which showed the utterance record production | generation process which concerns on the modification 2 of one Embodiment. 一実施形態の変形例２に係る出力処理例。The output process example which concerns on the modification 2 of one Embodiment. 一実施形態の変形例３に係る発言記録生成処理を示したフローチャート。The flowchart which showed the utterance record production | generation process which concerns on the modification 3 of one Embodiment. 一実施形態の変形例４に係る発言記録生成処理を示したフローチャート。The flowchart which showed the statement recording production | generation process which concerns on the modification 4 of one Embodiment. 一実施形態に係る通信会議サーバの機能構成図。The function block diagram of the communication conference server which concerns on one Embodiment. 一実施形態に係る情報処理装置のハードウェア構成図。The hardware block diagram of the information processing apparatus which concerns on one Embodiment.

以下、本発明の好適な実施形態について添付の図面を参照しながら説明する。なお、本明細書及び図面において、実質的に同一の構成については、同一の符号を付することにより重複した説明を省く。 DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments of the invention will be described with reference to the accompanying drawings. In addition, in this specification and drawing, about the substantially same structure, the duplicate description is abbreviate | omitted by attaching | subjecting the same code | symbol.

＜はじめに＞
ネットワークに接続された複数拠点間の情報処理装置を用いた遠隔会議が知られている。このような通信会議システムでは、以前より、自動音声認識により会議の発話内容を自動でテキスト化し議事録として記録することが行われている。しかし、自動音声認識は、技術的に１００％正確に音声認識することは困難であり、テキスト化した情報に多くの誤りが含まれる。 <Introduction>
A remote conference using an information processing apparatus between a plurality of bases connected to a network is known. In such a communication conference system, it has been practiced to automatically convert the content of a conference utterance into text by automatic speech recognition and record it as a minutes. However, in automatic speech recognition, it is technically difficult to recognize speech with 100% accuracy, and the text information contains many errors.

特に、情報処理装置が配置された拠点によって次のように通信環境が異なる。例えば、
（１）マイクなどの品種・性能・配置などにより音質が変わる。
（２）自拠点の音声は直接マイクから入力されるが、他拠点の音声はサーバ経由でデジタル化されたデータで入力される。そして多くの場合、そのデータは符号化されている。 In particular, the communication environment varies depending on the location where the information processing apparatus is located as follows. For example,
(1) The sound quality varies depending on the type, performance, and arrangement of microphones.
(2) While the voice of the local site is directly input from the microphone, the voice of the other base is input as digitized data via the server. In many cases, the data is encoded.

このような環境の相違により、各拠点の情報処理装置によって音声情報をテキスト化した結果が異なる場合がある。 Due to such a difference in environment, the result of converting voice information into text may differ depending on the information processing apparatus at each site.

そこで、以下に説明する本実施形態の通信システムでは、２つ以上の拠点の情報処理装置で音声情報を音声認識によってテキスト情報に変換し、各拠点の情報処理装置で作ったテキストデータ（以下、文字データともいう。）を、発言毎に対応させて、同一発言かどうかを識別する技術を備える。これにより、音声情報をテキストデータに変換する際の変換ミスの箇所を減らすことができる。この結果、精度の高い議事録を作成することができる。 Therefore, in the communication system of the present embodiment described below, speech information is converted into text information by speech recognition in information processing apparatuses at two or more bases, and text data (hereinafter, (Also referred to as character data) for each utterance, and a technique for identifying whether or not the same utterance is identified. Thereby, the location of the conversion mistake at the time of converting audio | voice information into text data can be reduced. As a result, it is possible to create a highly accurate minutes.

また、議事録を作成する際に、各拠点の情報処理装置で作成した複数のテキストデータを人がマージするのは非常に労力がかかる。そこで、本実施形態のシステムでは、複数のテキストデータを比較して同一発言かどうかを識別し、識別結果に基づき自動で複数のテキストデータを適切にマージできる。このようにして同一発言かどうかを自動で判断することで、より少ない労力で議事録を作成することができる。 In addition, when creating minutes, it is very laborious for a person to merge a plurality of text data created by the information processing apparatus at each site. Therefore, in the system of the present embodiment, a plurality of text data can be compared to identify whether they are the same utterance, and the plurality of text data can be automatically merged appropriately based on the identification result. In this way, it is possible to create minutes with less effort by automatically determining whether or not the same statement is made.

さらに、２つ以上の拠点の情報処理装置で音声認識によって音声情報がテキスト情報に変換される。このため、ネットワークの状況によっては、ある拠点間の通信性能が悪化し音声が途切れた場合でも、２つ以上の拠点のいずれかの拠点の情報処理装置で音声認識された音声情報から確実にテキストデータを得ることができる。 Furthermore, voice information is converted into text information by voice recognition in information processing apparatuses at two or more bases. For this reason, depending on the network conditions, even if the communication performance between certain bases deteriorates and the voice is interrupted, it is possible to reliably make text from the voice information recognized by the information processing device at one of the two or more bases. Data can be obtained.

以上の機能及び効果を奏する本実施形態の通信システムについて以下に説明する。なお、本実施形態では、通信会議システムを例に挙げて説明するが、本実施形態に係る通信システムは、通信会議システムに限られない。例えば、本実施形態に係る通信システムは、対話型の情報提供システムや対話型の窓口システム等において、２以上の情報処理装置を用いて送信又は受信された音声情報のやり取りをテキストデータとして記録する通信システムとして用いることができる。 The communication system of this embodiment that exhibits the above functions and effects will be described below. In the present embodiment, a communication conference system will be described as an example, but the communication system according to the present embodiment is not limited to the communication conference system. For example, the communication system according to the present embodiment records the exchange of voice information transmitted or received using two or more information processing devices as text data in an interactive information providing system, an interactive window system, or the like. It can be used as a communication system.

［システムの全体構成］
まず、本発明の一実施形態に係る通信会議システムについて、図１を参照しながら説明する。図１は、一実施形態に係る通信会議システムの全体構成図である。本実施形態に係る通信会議システム１は、ＩＰネットワーク網１１０を介して複数の情報処理装置１０ａ、１０ｂ、１０ｃ（以下、総称して情報処理装置１０とも称呼する。）と通信会議サーバ５０とが接続され、音声情報等を送信又は受信するようになっている。本実施形態に係る通信会議システム１は、音声情報のみならず、画像・映像情報にも対応でき、テレビ会議システムとして機能し得る。 [System overall configuration]
First, a communication conference system according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is an overall configuration diagram of a communication conference system according to an embodiment. The communication conference system 1 according to the present embodiment includes a plurality of information processing apparatuses 10a, 10b, and 10c (hereinafter collectively referred to as information processing apparatus 10) and a communication conference server 50 via an IP network 110. Connected to transmit or receive audio information or the like. The communication conference system 1 according to the present embodiment can support not only audio information but also image / video information, and can function as a video conference system.

情報処理装置１０ａ、１０ｂ、１０ｃは、通信会議用の端末の一例であり、ＰＣ端末、タブレット型端末であってもよい。各拠点に置かれ、ＩＰネットワーク網１１０に接続する機能を備える。図１では、情報処理装置１０ａは拠点Ａに置かれ、情報処理装置１０ｂは拠点Ｂに置かれ、情報処理装置１０ｃは拠点Ｃに置かれている。情報処理装置１０ａ、１０ｂ、１０ｃは、ＩＰネットワーク網１１０を経由して音声のやり取りを行い、遠隔地点同士の通信による会議を成立させる。 The information processing apparatuses 10a, 10b, and 10c are examples of terminals for communication conferences, and may be PC terminals or tablet terminals. A function of connecting to the IP network 110 is provided at each site. In FIG. 1, the information processing apparatus 10 a is placed at the base A, the information processing apparatus 10 b is placed at the base B, and the information processing apparatus 10 c is placed at the base C. The information processing apparatuses 10a, 10b, and 10c exchange voices via the IP network 110 to establish a conference by communication between remote points.

情報処理装置１０は、音声変換手段を備えている。通信会議システム１のどこか1箇所にのみ音声変換手段１９があるのではなく、通信会議システム１中に複数の音声変換手段１９がある。つまり、各情報処理装置１０が必ず音声変換手段１９を有している必要はない。通信会議システム１中に複数の音声変換手段１９があればよい。例えば、図１では、情報処理装置１０ａ、１０ｂ、１０ｃがそれぞれ音声変換手段１９を有しているが、これに限らず、例えば、情報処理装置１０ａ、１０ｂがそれぞれ音声変換手段１９を有し、情報処理装置１０ｃは音声変換手段１９を有しない場合も有り得る。ただし、通信会議システム１中に複数の音声変換手段１９が必要なため、例えば、情報処理装置１０ａが音声変換手段１９を有し、情報処理装置１０ｂ、１０ｃは音声変換手段１９を有しない場合は有り得ない。 The information processing apparatus 10 includes voice conversion means. The voice conversion means 19 is not only at one place in the communication conference system 1, but there are a plurality of voice conversion means 19 in the communication conference system 1. In other words, each information processing apparatus 10 does not necessarily have the voice conversion means 19. There may be a plurality of voice conversion means 19 in the communication conference system 1. For example, in FIG. 1, the information processing apparatuses 10 a, 10 b, and 10 c each have the voice conversion means 19, but the present invention is not limited thereto, and for example, the information processing apparatuses 10 a and 10 b each have the voice conversion means 19, There may be a case where the information processing apparatus 10 c does not include the voice conversion means 19. However, since a plurality of voice conversion means 19 are required in the communication conference system 1, for example, the information processing apparatus 10a has the voice conversion means 19 and the information processing apparatuses 10b and 10c do not have the voice conversion means 19. No way.

通信会議サーバ５０は、各拠点に配置された情報処理装置１０ａ、１０ｂ、１０ｃの音声情報を中継する装置である。ビデオ会議システムにおいては、一般に「多地点接続装置」（Multipoint Control Unit、MCU）と呼称される。通信会議サーバ５０は、ソフトウェアで構成されてもハードウェアで構成されてもよい。 The communication conference server 50 is a device that relays the audio information of the information processing devices 10a, 10b, and 10c arranged at each base. In a video conference system, it is generally called a “multipoint control unit” (MCU). The communication conference server 50 may be configured by software or hardware.

なお、本実施形態に係る通信会議システム１では、通信会議サーバ５０を介して各拠点に配置された情報処理装置１０間の通信が行われる。しかし、本実施形態に係る通信会議システム１は、これに限らず、通信会議サーバ５０を介さずに、各拠点の情報処理装置１０同士が直接ＩＰネットワーク網１１０を介して通信してもよい。また、本実施形態では、各種装置はＩＰネットワーク網１１０により接続されているが、その他の手段で接続されてもよい。 Note that in the communication conference system 1 according to the present embodiment, communication is performed between the information processing apparatuses 10 arranged at each site via the communication conference server 50. However, the communication conference system 1 according to the present embodiment is not limited to this, and the information processing apparatuses 10 at each site may communicate directly with each other via the IP network 110 without using the communication conference server 50. In the present embodiment, various devices are connected by the IP network 110, but may be connected by other means.

［情報処理装置の機能構成］
次に、本実施形態に係る情報処理装置の機能構成について、図２を参照しながら説明する。図２は、本実施形態に係る情報処理装置の機能構成を示した図である。情報処理装置１０は、通信手段１１、データ処理手段１２、音声入力手段１３、入力音声処理手段１４、音声記憶手段１５、出力音声処理手段１６、計時手段１７、音声出力手段１８、音声変換手段１９、音声認識結果記憶手段２０、発言記録生成手段２１、発言記録出力手段２２、判定手段２３及び話者特定手段２４を有する。 [Functional configuration of information processing device]
Next, the functional configuration of the information processing apparatus according to the present embodiment will be described with reference to FIG. FIG. 2 is a diagram illustrating a functional configuration of the information processing apparatus according to the present embodiment. The information processing apparatus 10 includes a communication unit 11, a data processing unit 12, a voice input unit 13, an input voice processing unit 14, a voice storage unit 15, an output voice processing unit 16, a time measuring unit 17, a voice output unit 18, and a voice conversion unit 19. Voice recognition result storage means 20, speech record generation means 21, speech record output means 22, determination means 23, and speaker identification means 24.

通信手段１１は、ＩＰネットワーク網１１０に接続され、他の情報処理装置１０や通信会議サーバ５０と通信する手段である。通信手段１１は、音声情報を含む各種デジタルデータの送受信を行う。具体的には、送受信する情報には、ＩＰネットワーク網１１０に接続するために必要な情報のほかに、以下の情報が送受信される。
・音声情報
・発言記録情報
・時間情報
・装置の設定情報
基本的に自装置から送信される情報は、自拠点の情報処理装置１０で生成された情報であり、送信される情報は、他拠点の情報処理装置１０で生成された情報である場合が多い。また、本実施形態に係る通信会議システム１において、音声情報に対応したテキストデータは会議終了後に１箇所に集める必要があるが、その場合にもこの通信手段１１を使って最終テキストデータが送受信される。 The communication unit 11 is a unit that is connected to the IP network 110 and communicates with the other information processing apparatus 10 and the communication conference server 50. The communication unit 11 transmits and receives various digital data including audio information. Specifically, as information to be transmitted / received, the following information is transmitted / received in addition to information necessary for connecting to the IP network 110.
・ Voice information ・ Speech record information ・ Time information ・ Device setting information Basically, the information transmitted from the own device is information generated by the information processing device 10 at the own site, and the transmitted information is transmitted to other sites. In many cases, the information is generated by the information processing apparatus 10. In the communication conference system 1 according to the present embodiment, text data corresponding to voice information needs to be collected in one place after the conference ends. In this case, the final text data is transmitted and received using the communication means 11. The

データ処理手段１２は、受け取った情報を処理する手段、または送信する情報を処理する手段である。受信した情報及び送信する情報は取得した情報としてデータ処理手段１２に送信される。一般に音声情報は符号化して送受信することが考えられる。その場合、このデータ処理手段１２で送信情報の符号化（Encode)、受信情報の復号化（Decode)を行う。 The data processing means 12 is a means for processing received information or a means for processing information to be transmitted. The received information and the information to be transmitted are transmitted to the data processing means 12 as acquired information. It is generally considered that audio information is encoded and transmitted / received. In this case, the data processing means 12 encodes transmission information (Encode) and decodes reception information (Decode).

音声入力手段１３は、音声を入力する手段である。音声入力手段１３の一例としては、通信会議システム１ではマイクが一般的である。また、音声入力手段１３は、レコーダー等のデータを入力できる端子を備えていてもよい。 The voice input means 13 is a means for inputting voice. As an example of the voice input unit 13, a microphone is generally used in the communication conference system 1. The voice input means 13 may include a terminal for inputting data such as a recorder.

入力音声処理手段１４は、音声入力手段１３より入力された音声を処理する手段である。入力音声処理手段１４は、マイクから入力された音声のアナログデータをデジタルデータに変換するＡ／Ｄ（アナログ／デジタル）変換手段、音声の周波数特性を変更・調整するイコライザ、音声の入力データのノイズを除去するノイズ除去手段などから構成される。 The input voice processing means 14 is a means for processing the voice input from the voice input means 13. The input sound processing means 14 is an A / D (analog / digital) conversion means for converting analog data of sound input from the microphone into digital data, an equalizer for changing / adjusting the frequency characteristics of the sound, and noise of the input data of the sound It is comprised from the noise removal means etc. which remove | eliminate.

音声記憶手段１５は、音声情報を記憶する手段である。ＲＡＭのような揮発性メモリや、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、各種ＦＬＡＳＨメモリ等の不揮発性メモリなどが考えられる。これらは１つでも複数備えてもよい。音声記憶手段１５に記憶される音声情報としては、以下のデータが挙げられる。
・入力音声処理手段１４により処理された入力音声デジタルデータおよびそれを符号化したデータ。
・受信した音声デジタルデータおよびそれを復号化したデータ。 The voice storage means 15 is a means for storing voice information. A volatile memory such as a RAM, a non-volatile memory such as an HDD (Hard Disk Drive), and various FLASH memories can be considered. One or more of these may be provided. Examples of the voice information stored in the voice storage unit 15 include the following data.
Input voice digital data processed by the input voice processing means 14 and data obtained by encoding it.
Received audio digital data and decoded data.

出力音声処理手段１６は、音声のデジタルデータをアナログデータに変換するＤ／Ａ（デジタル／アナログ）変換手段、音声の周波数特性を変更・調整するイコライザ、音声を増幅するアンプなどから構成される。 The output sound processing means 16 includes a D / A (digital / analog) conversion means for converting sound digital data into analog data, an equalizer for changing / adjusting the frequency characteristics of sound, an amplifier for amplifying sound, and the like.

計時手段１７は、時間を測定する手段であり、例えば、時計、タイマから構成されてもよい。計時手段１７は、他拠点の情報処理装置１０に内蔵された時計と同期させる機能があると望ましい。計時手段１７は、音声情報の送信時刻、音声情報の受信時刻、音声情報の生成時刻を音声情報の発言時刻とみなして計時する。 The time measuring means 17 is a means for measuring time, and may be composed of, for example, a clock or a timer. It is desirable that the time measuring means 17 has a function of synchronizing with a clock built in the information processing apparatus 10 at another base. The clocking means 17 counts the transmission time of the voice information, the reception time of the voice information, and the generation time of the voice information as the speech time of the voice information.

音声出力手段１８は、音声を出力する手段であり、例えば、各種スピーカやイヤホンから構成されてもよい。 The sound output means 18 is a means for outputting sound, and may be constituted by various speakers and earphones, for example.

音声変換手段１９は、音声認識技術（人が話す音声をコンピュータで解析しテキストデータ（文字データ）に変換する技術）により、発言内容をテキストデータに変換する手段である。音声変換手段１９により用いられる音声認識技術としては各種方法が知られている。自動音声認識は、技術的に１００％正確に音声認識することは困難であり、特別なノイズのない状況で一般的に、テキスト化した情報全体の６０〜９０％程度を正しく認識できる。そのためテキスト化した情報に多くの誤りが含まれる。そのため文字データ化した情報には多くの誤りが含まれる。ちなみに、同音異義語の多い日本語は音声認識成功率が低いことが知られている。 The voice conversion means 19 is means for converting the content of a statement into text data by a voice recognition technique (a technique in which a voice spoken by a person is analyzed by a computer and converted into text data (character data)). Various methods are known as a speech recognition technique used by the speech conversion means 19. In automatic speech recognition, it is technically difficult to recognize speech with 100% accuracy, and generally 60 to 90% of the entire text information can be correctly recognized in a situation without special noise. Therefore, many errors are included in the textual information. For this reason, the information converted into character data includes many errors. By the way, Japanese with many homonyms is known to have a low speech recognition success rate.

本実施形態に係る通信会議システム１には、２以上の音声変換手段１９が必要である。例えば、複数の情報処理装置１０のうち一の情報処理装置にて取得した音声情報を文字データに変換する音声変換手段１９を第１の音声変換手段とする。また、複数の情報処理装置１０のうち他の情報処理装置にて取得した音声情報を文字データに変換する音声変換手段１９を第２の音声変換手段とする。その場合、本実施形態に係る通信会議システム１には、少なくとも第１及び第２の音声変換手段の２以上の音声変換手段１９が必要となる。例えば、図１の拠点Ａの情報処理装置１０ａが有する音声変換手段を第１の音声変換手段とし、図１の拠点Ｂの情報処理装置１０ｂが有する音声変換手段を第２の音声変換手段としてもよい。拠点Ｂの情報処理装置１０ｂが有する音声変換手段を第１の音声変換手段とし、拠点Ｃの情報処理装置１０ｃが有する音声変換手段を第２の音声変換手段としてもよい。ただし、本実施形態のシステム構成は一例であり、２以上の音声変換手段を有していればこれに限られないことは言うまでもない。 The communication conference system 1 according to the present embodiment requires two or more voice conversion means 19. For example, the voice conversion means 19 that converts voice information acquired by one information processing apparatus among a plurality of information processing apparatuses 10 into character data is defined as the first voice conversion means. The voice conversion means 19 that converts voice information acquired by another information processing apparatus among the plurality of information processing apparatuses 10 into character data is referred to as a second voice conversion means. In that case, the communication conference system 1 according to the present embodiment requires at least two or more voice conversion means 19 of the first and second voice conversion means. For example, the voice conversion unit included in the information processing apparatus 10a at the site A in FIG. 1 may be used as the first voice conversion unit, and the voice conversion unit included in the information processing apparatus 10b at the site B in FIG. 1 may be used as the second voice conversion unit. Good. The voice conversion means included in the information processing apparatus 10b at the site B may be used as the first voice conversion means, and the voice conversion means included in the information processing apparatus 10c at the site C may be used as the second voice conversion means. However, it goes without saying that the system configuration of the present embodiment is an example, and the system configuration is not limited to this as long as it has two or more voice conversion means.

音声認識結果記憶手段２０は、音声変換手段１９にて音声認識技術を用いて音声情報を文字データ化した情報を記憶する手段である。音声認識結果記憶手段２０は、他拠点の情報処理装置において音声情報を文字データ化した情報も受信して記憶する。なお、音声認識結果記憶手段２０は、音声記憶手段１５と共通でもよい。 The voice recognition result storage means 20 is means for storing information obtained by converting voice information into character data using the voice recognition technique in the voice conversion means 19. The voice recognition result storage means 20 also receives and stores information obtained by converting voice information into character data in the information processing apparatus at another base. The voice recognition result storage unit 20 may be shared with the voice storage unit 15.

発言記録生成手段２１は、例えば次に挙げる情報を利用して発言記録を作成する手段である。
・音声認識結果記憶手段２０に記憶された自拠点の情報処理装置１０のテキストデータ。
・音声認識結果記憶手段２０に記憶された他拠点の情報処理装置１０のテキストデータ。
・計時手段１７により計時された時間情報。
・話者特定手段２４により特定された話者（発言者）情報、発言と発言者を関連付けた情報。 The utterance record generating means 21 is a means for creating a utterance record using, for example, the following information.
Text data of the information processing apparatus 10 at the local site stored in the voice recognition result storage unit 20
Text data of the information processing apparatus 10 at another base stored in the voice recognition result storage unit 20
Time information timed by the time measuring means 17
-Speaker (speaker) information specified by the speaker specifying means 24, and information that associates a speaker with a speaker.

発言記録生成手段２１は、後述する判定手段２３による判定に基づき、情報処理装置１０ａ、１０ｂ、１０ｃのうちの一の情報処理装置及び他の情報処理装置間の通信で行われた発言記録を生成する。 The utterance record generating unit 21 generates a utterance record made by communication between one information processing apparatus among the information processing apparatuses 10a, 10b, and 10c and another information processing apparatus based on determination by the determination unit 23 described later. To do.

音声認識結果記憶手段２０は、音声データと時間を結びつける手段を備え、自拠点と他拠点の情報処理装置の音声情報を文字化したデータを、時間情報を用いて簡易な発言記録を作成する。音声認識結果記憶手段２０は、自拠点の情報処理装置の音声情報を文字化したデータのみを使用して発言記録を作成してもよい。ここで作成した発言記録データは、発言記録出力手段２２に記録されるか、または通信手段１１より外部装置へ伝送または出力される。 The voice recognition result storage means 20 includes means for associating voice data with time, and creates a simple statement record by using the time information of data obtained by characterizing voice information of the information processing apparatuses at the local site and the other sites. The voice recognition result storage unit 20 may create a statement record using only data obtained by converting the voice information of the information processing apparatus at its own site into text. The utterance record data created here is recorded in the utterance record output means 22 or transmitted or output from the communication means 11 to an external device.

発言記録出力手段２２は、発言記録生成手段２１により作成した発言記録データを出力する手段である。発言記録出力手段２２による出力方法としては様々な方法が考えられる。発言記録出力手段２２は、下記に示す出力方法を少なくとも１つ備えている。
・画像データとして出力（アナログＲＧＢコンポーネント信号等）する。 The utterance record output means 22 is means for outputting the utterance record data created by the utterance record generation means 21. Various methods can be considered as the output method by the utterance record output means 22. The statement record output means 22 includes at least one output method described below.
-Output as image data (analog RGB component signal, etc.).

この場合、発言記録出力手段２２は、プロジェクター、モニタ等に情報を出力する。リアルタイムで発言記録を更新し表示するようにすると、発言記録を確認しながら会議を進行することができる。この場合、発言記録出力手段２２には、発言記録データを画像データに変換する手段が必要である。
・発言記録データをデジタルデータとして出力する。 In this case, the utterance record output means 22 outputs information to a projector, a monitor or the like. If the message record is updated and displayed in real time, the conference can proceed while confirming the message record. In this case, the utterance record output means 22 requires means for converting the utterance record data into image data.
・ Output the recorded speech data as digital data.

この場合、具体的には、発言記録出力手段２２は、上記機能を備えた、ＵＳＢ−ＨｏｓｔＩＦ、ＳＤＣａｒｄＩＦ等の各種メモリＩＦ、ＲＳ−２３２ＣＩＦなどの各種ＩＦを備える。 In this case, specifically, the utterance record output means 22 includes various memories IF such as USB-Host IF and SD Card IF, and various IFs such as RS-232C IF, which have the above functions.

判定手段２３は、２以上の音声変換手段１９（第１及び第２の音声変換手段）により変換された２以上の文字データを比較し、音声情報が同一発言か否かを判定する。 The determination unit 23 compares two or more character data converted by the two or more speech conversion units 19 (first and second speech conversion units), and determines whether the speech information is the same speech.

話者特定手段２４は、声紋認証により各音声情報の話者を特定する。 The speaker specifying means 24 specifies the speaker of each voice information by voiceprint authentication.

［発言記録生成処理］
次に、本実施形態に係る発言記録生成処理について、図４を参照しながら説明する。図４は、本実施形態に係る発言記録生成処理を示したフローチャートである。図４のフローチャートの説明に、２拠点の情報処理装置間での通信会議で、各拠点の情報処理装置でそれぞれ音声情報をテキストデータに変換した場合の実際のテキストデータの処理例（図３）を用いる。 [Speech record generation processing]
Next, the statement record generation process according to the present embodiment will be described with reference to FIG. FIG. 4 is a flowchart showing a statement record generation process according to the present embodiment. In the description of the flowchart of FIG. 4, in a communication conference between information processing apparatuses at two bases, an example of processing actual text data when voice information is converted into text data by the information processing apparatuses at each base (FIG. 3). Is used.

図３（ａ）は、通信会議で実際に行われた音声情報のやり取りを示している。拠点Ａに佐藤さん、拠点Ｂに鈴木さんと田中さんがいて、計３名で２拠点の情報処理装置１０ａ、１０ｂを用いて通信会議を行った場合を想定している。ここに示したような会話がされた場合について以下で考える。図３（ａ）は、音声が発せられ順に、拠点３０、発言者３２、発言内容３４の情報が示されている。 FIG. 3A shows the exchange of voice information actually performed in the communication conference. It is assumed that Mr. Sato is at the site A, Mr. Suzuki and Mr. Tanaka are at the site B, and a communication conference is performed by using a total of three information processing apparatuses 10a and 10b. The case where the conversation as shown here is made will be considered below. FIG. 3A shows information about the base 30, the speaker 32, and the content of the speech 34 in the order in which the sound is produced.

図３（ｂ）は拠点Ａの情報処理装置１０ａにてテキスト化したデータの一例を示している。図３（ｂ）の拠点Ａのテキストデータには、発言時刻３６、拠点３０、発言者３２、発言内容３４が含まれている。図３（ａ）のＮｏ．２の田中さんの発言「はい」が拠点Ａの情報処理装置１０ａにおいてテキストデータ化できず、記録されていない。なお、発言時刻は各拠点の情報処理装置１０にて認識した時刻である。 FIG. 3B shows an example of data converted into text by the information processing apparatus 10a at the site A. The text data of the base A in FIG. 3B includes a speech time 36, a base 30, a speaker 32, and a speech content 34. No. of FIG. Tanaka-san's remark “Yes 2” cannot be converted into text data in the information processing apparatus 10a at the base A and is not recorded. Note that the speech time is the time recognized by the information processing apparatus 10 at each site.

図３（ｃ）は拠点Ｂの情報処理装置１０ｂにてテキスト化したデータの一例を示している。図３（ｃ）の拠点Ａのテキストデータには、発言時刻３６、拠点３０、発言者３２、発言内容３４が含まれている。図３（ａ）のＮｏ．５の佐藤さん発言の「仕様」という言葉を「使用」という言葉に誤認識しているものとする。 FIG. 3C shows an example of data converted into text by the information processing apparatus 10b at the site B. The text data of the base A in FIG. 3C includes a speech time 36, a base 30, a speaker 32, and a speech content 34. No. of FIG. It is assumed that the word “specification” by Mr. Sato in 5 is misrecognized as the word “use”.

以上の前提において、図４のフローチャートと、図５及び図６のテキストデータの状態とを参照しながら本実施形態の発言記録生成処理を説明する。 Based on the above assumptions, the statement record generation processing of this embodiment will be described with reference to the flowchart of FIG. 4 and the state of the text data of FIGS.

Ｓ１０１：議事録作成命令が情報処理装置１０に通知される。会議終了時などが適切と考えられるが、タイミングは任意である。また命令の通知方法もどのような方法でもよい。例えば、情報処理装置１０に議事録作成ボタンを備え、そのボタンを押すことで議事録作成命令を情報処理装置１０に通知するような方法が考えられる。 S101: A minutes creation command is notified to the information processing apparatus 10. It is considered appropriate at the end of the meeting, but the timing is arbitrary. The instruction notification method may be any method. For example, a method of providing a minutes creation button in the information processing apparatus 10 and notifying the information processing apparatus 10 of a minutes creation command by pressing the button can be considered.

Ｓ１０２：議事録を作成する情報処理装置１０に作成したテキストデータを送信する。議事録作成は、各情報処理装置１０がそれぞれ行ってもよいし、１つまたは複数のあらかじめ定めた情報処理装置１０のみで実施してもよい。議事録作成命令を受けると、情報処理装置１０は、音声情報から生成したテキストデータを議事録作成を行う情報処理装置１０に送信する。議事録作成を行う情報処理装置１０は、各装置から送信されるテキストデータを受信する。なお、議事録作成を行う装置を、情報処理装置１０に替えて通信会議サーバとしてもよい。 S102: The created text data is transmitted to the information processing apparatus 10 that creates the minutes. The minutes may be created by each information processing apparatus 10 or by only one or a plurality of predetermined information processing apparatuses 10. When receiving the minutes creation command, the information processing apparatus 10 transmits text data generated from the voice information to the information processing apparatus 10 that creates the minutes. The information processing apparatus 10 that creates the minutes receives text data transmitted from each apparatus. Note that the apparatus for creating the minutes may be a communication conference server instead of the information processing apparatus 10.

Ｓ１０３：各拠点のテキストデータを時系列に並べる（図５参照）。ここでは拠点Ａ及び拠点Ｂ間の通信に４秒かかるものと仮定して考えている。そのためα１１の発言を拠点Ｂで４秒後に認識するため、α１１とβ１２の発言時刻３６が４秒ずれている。つまり、一の情報処理装置１０がα１１の発言を取得するタイミングと、他の情報処理装置１０がβ１２の発言を取得するタイミングとは、同時又は数秒程度離れた近時の時間内であり、２つの発言の取得タイミングは類似（対応）する。この図５のように発言時刻３６の順に各拠点のテキストデータを並べるものとする。 S103: The text data of each base is arranged in time series (see FIG. 5). Here, it is assumed that the communication between the base A and the base B takes 4 seconds. Therefore, in order to recognize the message of α11 at the base B after 4 seconds, the message times 36 of α11 and β12 are shifted by 4 seconds. In other words, the timing at which one information processing apparatus 10 acquires an α11 message and the timing at which another information processing apparatus 10 acquires a β12 message are within the recent time that is at the same time or about several seconds apart. The acquisition timings of two statements are similar (corresponding). Assume that the text data of each base is arranged in the order of the speech time 36 as shown in FIG.

Ｓ１０４：判定手段２３は、比較処理を行う各拠点の発言を選定する。ここではまず図５のα１１とβ１２を比べるものとする。 S104: The determination means 23 selects the remarks of each base that performs the comparison process. Here, α11 and β12 in FIG. 5 are first compared.

Ｓ１０５：判定手段２３は、各発言を単語レベルに分解する処理を行う。一例として、α１１とβ１２は次のように分解される。「こちら／拠点／Ａの／佐藤／です。／聞こえますか？」
Ｓ１０６：判定手段２３は、各発言の単語を比較し、一致する単語とその出現順序を記録する。α１１とβ１２の場合はまったく同一となる。 S105: The determination means 23 performs a process of decomposing each utterance into word levels. As an example, α11 and β12 are decomposed as follows. "This is the base / A's / Sato /. Can you hear me?"
S106: The determination means 23 compares the words of each utterance and records the matching words and their appearance order. The cases of α11 and β12 are exactly the same.

Ｓ１０７：判定手段２３は、一致する単語数が指定した割合以上か判定する。一致する単語数が指定した割合以上の場合には、Ｓ１０８へ進む。一致する単語数が指定した割合より少ない場合には、Ｓ１１１へ進む。なお、ここでいう「指定した割合」は任意に決めてよい。パラメータとして設定できるようにすると、一致と判断するレベルを調整することができる。例えば、全単語数の何割以上が一致、などと設定できる。 S107: The determination unit 23 determines whether the number of matching words is equal to or greater than a specified ratio. If the number of matching words is greater than the specified ratio, the process proceeds to S108. If the number of matching words is less than the specified ratio, the process proceeds to S111. The “specified ratio” here may be arbitrarily determined. If it can be set as a parameter, it is possible to adjust the level at which it is determined that there is a match. For example, it can be set such that more than 30% of all words match.

Ｓ１０８：判定手段２３は、一致した単語の出現順序が一致するかを判定する。一致する場合には、Ｓ１０９へ進む。一致しない場合には、Ｓ１１１へ進む。 S108: The determination unit 23 determines whether the appearance order of the matched words matches. If they match, the process proceeds to S109. If not, the process proceeds to S111.

同じ発言であれば出現順序は一致するはずである。逆に出現順序が一致しなければ、一致する単語数が多いとしても同一発言ではないはずである。よって、
Ｓ１０９：一致した単語の出現順序が一致する場合、判定手段２３は、比較対象は「同一発言である」と判定する。 If they say the same, the order of appearance should match. Conversely, if the appearance order does not match, even if there are many matching words, they should not be the same statement. Therefore,
S109: When the appearance order of the matched words matches, the determination unit 23 determines that the comparison target is “same statement”.

Ｓ１１０：この場合、発言記録生成手段２１は、同一発言として処理を行う。具体的には、図５の例では、α１１とβ１２は同一の発言と判断し、発言記録生成手段２１は、図６のα２１とβ２１のように、拠点Ａのテキストデータ及び拠点Ｂのテキストデータの同一行に記載する処理を行う。 S110: In this case, the utterance record generating means 21 performs processing as the same utterance. Specifically, in the example of FIG. 5, α11 and β12 are determined to be the same statement, and the statement record generation unit 21 performs text data of the base A and text data of the base B as α21 and β21 of FIG. The process described in the same line is performed.

Ｓ１１１：一致した単語の出現順序が一致しない場合、判定手段２３は、比較対象は「同一発言ではない」と判定する。 S111: When the appearance order of the matched words does not match, the determination unit 23 determines that the comparison target is “not the same statement”.

Ｓ１１２：この場合、同一発言ではないので、発言記録生成手段２１は、図６の例では異なる行に記載する処理を行う。よって、拠点Ｂのβ２２のテキストデータに対応する拠点Ａのテキストデータは存在しない。 S112: In this case, since the utterances are not the same, the utterance record generating means 21 performs processing described in different lines in the example of FIG. Therefore, there is no text data of base A corresponding to the text data of β22 of base B.

Ｓ１１３：判定手段２３は、全ての発言の比較が完了したかを判定する。完了したと判定した場合、Ｓ１１４に進む。完了していないと判定した場合、Ｓ１０４の処理へ戻る。 S113: The determination means 23 determines whether comparison of all the statements has been completed. If it is determined that the process has been completed, the process proceeds to S114. If it is determined that the process has not been completed, the process returns to S104.

Ｓ１１４：発言記録出力手段２２は、出力処理を実施し、本処理を終了する。不要な場合は何も実施せず、終了しても構わない。 S114: The statement record output means 22 performs an output process and ends this process. If it is unnecessary, nothing may be done and the process may be terminated.

発言記録出力手段２２による出力処理の例としては、拠点情報が重要な場合には、図６のように各拠点のテキストデータを並べて出力することが挙げられる。または、拠点情報が不要な場合には、図７のように同一発言のテキストデータを重複せずに一つ出力してもよい。 As an example of output processing by the utterance record output means 22, when base information is important, it is possible to output text data of each base side by side as shown in FIG. Alternatively, when the base information is unnecessary, one piece of text data of the same message may be output without duplication as shown in FIG.

以上のフローにより、テキストデータを正確にかつわかりやすく記録することができる。１００％完全な発言記録の議事録が保証されるわけではないが、生成された発言記録を参照して人為的に議事録を作成する場合に非常に分かりやすくなり、短期間で簡単に正確な議事録を作成することができる。 With the above flow, text data can be recorded accurately and easily. The minutes of 100% complete statement records are not guaranteed, but it is very easy to understand when creating the minutes by referring to the generated record records, and it is easy and accurate in a short period of time. Minutes can be created.

以上に説明したように、本実施形態に係る通信会議システム１によれば、少なくとも２つ以上の拠点の情報処理装置１０にて音声情報を音声認識によってテキスト情報に変換する。よって、各情報処理装置１０において互いの変換ミスの箇所を補うことができ、より修正ミスが減り、効率よく発言記録を議事録にして作成できる。さらに、同じ発言かどうかを識別する技術を備えることで複数のテキストデータを適切にマージできる。これにより、議事録作成時間を削減することができる。 As described above, according to the communication conference system 1 according to the present embodiment, voice information is converted into text information by voice recognition in the information processing apparatuses 10 at at least two bases. Therefore, each information processing apparatus 10 can compensate for the location of each other's conversion error, and correction errors can be further reduced, and a statement record can be efficiently created as a minutes. Furthermore, a plurality of text data can be appropriately merged by providing a technique for discriminating whether or not the speech is the same. As a result, the minutes creation time can be reduced.

以下、本実施形態に係る発言記録生成処理の変形例１〜変形例４について、図８〜図１３を参照しながら説明する。
（変形例１）
図８は、本実施形態の変形例１に係る発言記録生成処理を示したフローチャートであり、図９は、本実施形態の変形例１に係る出力処理例である。 Hereinafter, Modifications 1 to 4 of the statement record generation process according to the present embodiment will be described with reference to FIGS. 8 to 13.
(Modification 1)
FIG. 8 is a flowchart showing a utterance record generation process according to the first modification of the present embodiment, and FIG. 9 is an output process example according to the first modification of the present embodiment.

変形例１では、上記実施形態の図４のＳ１１０を図８のＳ１１５に変更している点のみ異なる。具体的には、上記実施形態では、判定手段２３により同一発言と判定された場合、一の情報処理装置１０ａ及び他の情報処理装置１０ｂの各拠点Ａ、Ｂの発言記録に同一発言を含ませる（図６）。これに対して、変形例１では、判定手段２３により同一発言と判定された場合、一の情報処理装置１０ａ及び他の情報処理装置１０ｂのいずれか一方の拠点側の発言記録に同一発言を含ませ、いずれか他方の拠点側の発言記録には同一発言を含ませない（図９）。 Modification 1 is different only in that S110 of FIG. 4 in the above embodiment is changed to S115 of FIG. Specifically, in the above embodiment, if the determination unit 23 determines that the same statement is made, the same statement is included in the statement records of the bases A and B of the one information processing apparatus 10a and the other information processing apparatus 10b. (FIG. 6). On the other hand, in the first modification, when the determination unit 23 determines that the same statement is made, the same statement is included in the statement record on the base side of one of the information processing device 10a and the other information processing device 10b. However, the same utterance is not included in the utterance record of the other base (FIG. 9).

以上のように、変形例１では、同一と判断された発言があり、拠点間の情報処理装置１０にて識別されたテキストデータの差異がない場合、一つの拠点側のテキストデータを残し、他の拠点側のテキストデータを削除する処理が実行される。これによれば、発言が記録された議事録中の重複テキストデータが削除されるので、発言内容が見やすくなるという効果がある。なお、図９では、拠点Ａ側のテキストデータを残し、拠点Ｂ側のテキストデータを削除したが、これに限らず、拠点Ｂ側のテキストデータを残し、拠点Ａ側のテキストデータを削除してもよい。
（変形例２）
図１０は、本実施形態の変形例２に係る発言記録生成処理を示したフローチャートであり、図１１は、本実施形態の変形例２に係る出力処理例である。 As described above, in Modification 1, when there are remarks determined to be the same and there is no difference in the text data identified by the information processing apparatus 10 between the bases, the text data on one base side is left, and the other The process of deleting the text data on the base side is executed. According to this, since duplicate text data in the minutes in which the utterance is recorded is deleted, there is an effect that the content of the utterance becomes easy to see. In FIG. 9, the text data on the site A side is left and the text data on the site B side is deleted. However, the present invention is not limited to this, the text data on the site B side is left and the text data on the site A side is deleted. Also good.
(Modification 2)
FIG. 10 is a flowchart showing a utterance record generation process according to the second modification of the present embodiment, and FIG. 11 is an output process example according to the second modification of the present embodiment.

変形例２では、上記変形例１の図８のＳ１１５を図１０のＳ１１６に変更している点のみ異なる。具体的には、変形例１では、判定手段２３により同一発言と判定された場合、一の情報処理装置１０ａ及び他の情報処理装置１０ｂのいずれか一方の拠点の発言記録に同一発言を含ませ、いずれか他方の拠点の発言記録には同一発言を含ませない（図９）。変形例２では、これに加えて、判定手段２３により同一発言と判定された場合であって同一発言中に差異がある場合、同一発言中の差異部分が認識可能なように発言記録を生成する。例えば、図１１では、同一と判断された発言であってテキストデータに一部差異がある場合、その差異部分をマーキングして示している。このようにして、変形例２によれば、拠点間の同一と判断された発言に含まれる差異を見やすくする効果がある。 The modification 2 is different only in that S115 of FIG. 8 of the modification 1 is changed to S116 of FIG. Specifically, in the first modification, when the determination unit 23 determines that the same statement is made, the same statement is included in the statement record of one of the information processing device 10a and the other information processing device 10b. The same remark is not included in the remark record of the other base (FIG. 9). In the modified example 2, in addition to this, when it is determined that the same utterance is determined by the determination unit 23 and there is a difference in the same utterance, the utterance record is generated so that the difference portion in the same utterance can be recognized. . For example, in FIG. 11, when the utterances are determined to be the same and there is a partial difference in the text data, the difference is marked. Thus, according to the second modification, there is an effect of making it easy to see the difference included in the remarks determined to be the same between the bases.

なお、同一発言中の差異部分を認識可能に表示するためには、差異部分をマーキングする他、差異部分を太字にする、差異部分の色を変える、差異部分をカッコで括る、差異部分を下線で示す等様々な方法を用いることができる。
（変形例３）
図１２は、本実施形態の変形例３に係る発言記録生成処理を示したフローチャートである。変形例３では、変形例２の図１０の全ての処理を含み、更に図１２のＳ１１７のステップが加えられている。具体的には、変形例３では、Ｓ１０４の後ステップのＳ１１７にて、判定手段２３は、選定された各拠点の発言時刻の差分が予め定められた所定時間以上であるかを判定する。各拠点の発言時刻の差分が予め定められた所定時間未満であると判定された場合には、Ｓ１０５以降の処理を実行する。一方、選定された各拠点の発言時刻の差分が予め定められた所定時間以上であると判定された場合には、選定された発言の比較処理を行わずに、Ｓ１０４に戻り、次に比較処理を行う発言を選定する。 In order to display the different parts in the same statement in a recognizable manner, in addition to marking the different parts, make the different parts bold, change the color of the different parts, bracket the different parts, and underline the different parts. Various methods such as those shown in FIG.
(Modification 3)
FIG. 12 is a flowchart showing a statement record generation process according to the third modification of the present embodiment. Modification 3 includes all the processes of FIG. 10 of Modification 2 and further includes step S117 of FIG. Specifically, in the third modification, in S117, which is a step after S104, the determination unit 23 determines whether or not the difference between the utterance times of the selected bases is equal to or greater than a predetermined time. When it is determined that the difference between the speech times of the respective bases is less than a predetermined time, the processes after S105 are executed. On the other hand, if it is determined that the difference between the speech times of the selected bases is equal to or greater than a predetermined time, the process returns to S104 without performing the comparison processing of the selected speech, and then the comparison processing Select remarks to be made.

以上のように、変形例３では、判定手段２３は、比較対象である各拠点の２以上の文字データの変換前の音声情報の発言時刻の差分が予め定められた閾値（所定時間）を上回る場合、前記２以上の文字データについて音声情報が同一発言か否かの判定を止める。このように比較対象が同じであっても、「発言時刻が大きく異なる場合は比較対象としない」という制御を行う。これにより、全体の処理を減らし、処理時間の短縮を図ることができる。
（変形例４）
図１３は、本実施形態の変形例４に係る発言記録生成処理を示したフローチャートである。変形例４では、変形例２の図１０の全ての処理を含み、更に図１３のＳ１１８のステップが加えられている。具体的には、変形例４では、Ｓ１０４の後ステップのＳ１１８にて、判定手段２３は、選定された各発言の発言者が異なるかを判定する。各拠点の発言者が同一人であると判定された場合には、Ｓ１０５以降の処理を実行する。一方、各拠点の発言者が異なると判定された場合には、選定された発言の比較処理を行わずに、Ｓ１０４に戻り、次に比較処理を行う発言を選定する。 As described above, in the third modification, the determination unit 23 determines that the difference in the speech time of the speech information before conversion of two or more character data at each base to be compared exceeds a predetermined threshold (predetermined time). In this case, the determination as to whether or not the voice information is the same speech for the two or more character data is stopped. In this way, even when the comparison target is the same, control is performed such that “when the speech time is significantly different, the comparison target is not included”. Thereby, the whole process can be reduced and the processing time can be shortened.
(Modification 4)
FIG. 13 is a flowchart showing a statement record generation process according to Modification 4 of the present embodiment. The modification 4 includes all the processes of FIG. 10 of the modification 2, and further includes the step of S118 of FIG. Specifically, in the modified example 4, in S118, which is a step after S104, the determination unit 23 determines whether or not the selected speakers are different. When it is determined that the speaker at each base is the same person, the processing after S105 is executed. On the other hand, when it is determined that the speaker at each base is different, the process returns to S104 without performing the comparison process of the selected message, and the message for the next comparison process is selected.

声紋認証の機能を有する情報処理装置１０においては、発言と発言者を関連付けた情報を取得することができる。その場合、判定手段２３は、Ｓ１１８に示したように、まず発言者３２を比較してもよい。例えば、図５のα１１とβ１２はともに発言者がＡ１なので、比較対象とするが、もしこれらの発言者が異なった場合は比較対象とはしない。このように比較対象が同じであっても、比較対象である発言内容の話者が異なる場合、音声情報が同一発言か否かの判定を止める、という制御を行う。これにより、全体の処理を減らし、処理時間の短縮を図ることができる。 In the information processing apparatus 10 having a voiceprint authentication function, it is possible to acquire information in which a speech is associated with a speaker. In that case, the determination means 23 may compare the speaker 32 first, as shown to S118. For example, both α11 and β12 in FIG. 5 are comparison targets because the speaker is A1, but if these speakers are different, they are not compared. In this way, even if the comparison target is the same, control is performed to stop the determination as to whether or not the voice information is the same speech when the speakers of the speech content to be compared are different. Thereby, the whole process can be reduced and the processing time can be shortened.

以上、上記実施形態及び変形例１〜変形例４によれば、音声認識によって作成したテキストデータに基づきに発言記録（議事録）を作成する際、２以上の情報処理装置にてそれぞれ取得した音声情報から変換された２以上の文字データを比較する。これにより、音声情報の発言を正確に判定することができる。この結果、議事録の自動作成の精度を高め、修正ミスを減らし、効率よく発言の議事録を作成することができ、議事録作成の時間の削減することができる。 As mentioned above, according to the said embodiment and the modification 1-the modification 4, when producing an utterance record (minutes) based on the text data produced by voice recognition, the audio | voice acquired by two or more information processing apparatuses, respectively. Compare two or more character data converted from information. Thereby, the speech of audio | voice information can be determined correctly. As a result, the accuracy of automatic creation of minutes can be increased, correction errors can be reduced, the minutes of speech can be efficiently created, and the time for creating minutes can be reduced.

なお、上記形態は本発明の範囲を限定するものではなく、通信会議サーバが情報処理装置の判定機能、発言記録生成機能の一部又は全部を備えても良い。また、システムを構成する通信会議サーバや情報処理装置は複数台でも良く、通信会議サーバや情報処理装置のいずれに上記機能を備えさせても良い。なお、この実施形態で説明する情報処理装置と通信会議サーバとが接続されたシステム構成は一例であり、用途や目的に応じて様々なシステム構成例があることは言うまでもない。 In addition, the said form does not limit the scope of the present invention, and a communication conference server may be provided with a part or all of the determination function of the information processing apparatus and the statement record generation function. In addition, a plurality of communication conference servers and information processing apparatuses that constitute the system may be provided, and either the communication conference server or the information processing apparatus may be provided with the above function. It should be noted that the system configuration in which the information processing apparatus and the communication conference server described in this embodiment are connected is an example, and it goes without saying that there are various system configuration examples depending on applications and purposes.

システム構成の他の例としては、図２に示した情報処理装置１０の機能構成のうち、音声変換手段１９、音声認識結果記憶手段２０、発言記録生成手段２１、発言記録出力手段２２、判定手段２３及び話者特定手段２４の機能を、情報処理装置１０の替わりにサーバ５０が有するシステム構成でもよい。その場合、図１４に示したように、通信会議サーバ５０は、通信手段５７、データ処理手段５８、計時手段５９の他、音声変換手段５１、音声認識結果記憶手段５２、発言記録生成手段５３、発言記録出力手段５４、判定手段５５及び話者特定手段５６の機能を有する。 As another example of the system configuration, in the functional configuration of the information processing apparatus 10 illustrated in FIG. 2, the voice conversion unit 19, the speech recognition result storage unit 20, the speech record generation unit 21, the speech record output unit 22, and the determination unit 23 and a system configuration that the server 50 has the functions of the speaker specifying means 24 instead of the information processing apparatus 10 may be used. In this case, as shown in FIG. 14, the communication conference server 50 includes the communication means 57, the data processing means 58, the time measuring means 59, the voice conversion means 51, the voice recognition result storage means 52, the statement record generation means 53, It has functions of a utterance record output means 54, a determination means 55, and a speaker identification means 56.

音声変換手段５１、音声認識結果記憶手段５２、発言記録生成手段５３、発言記録出力手段５４、判定手段５５及び話者特定手段５６の各機能は、情報処理装置１０の各機能と同様である。例えば、音声変換手段５１は、複数の情報処理装置１０のうち一の情報処理装置にて取得した音声情報を一の文字データに変換する。また、音声変換手段５１は、他の情報処理装置にて取得した音声情報であって、前記一の情報処理装置にて音声情報を取得したタイミングに応じて取得した音声情報を他の文字データに変換する。 The functions of the voice conversion unit 51, the voice recognition result storage unit 52, the utterance record generation unit 53, the utterance record output unit 54, the determination unit 55, and the speaker identification unit 56 are the same as those of the information processing apparatus 10. For example, the voice conversion unit 51 converts voice information acquired by one information processing apparatus among the plurality of information processing apparatuses 10 into one character data. The voice conversion means 51 is voice information acquired by another information processing apparatus, and the voice information acquired according to the timing when the voice information is acquired by the one information processing apparatus is converted into other character data. Convert.

また、判定手段２３は、音声変換手段１９により変換された前記一の文字データと他の文字データとを比較し、前記一の情報処理装置及び前記他の情報処理装置にて取得された音声情報が同一発言か否かを判定する。 The determination unit 23 compares the one character data converted by the voice conversion unit 19 with the other character data, and the voice information acquired by the one information processing apparatus and the other information processing apparatus. Is the same statement.

本システム構成例では、通信会議サーバ５０が、上記実施形態及び各変形例の発言記録生成処理を実行する。この場合、図４、図８、図１０、図１２、図１３に示した発言記録生成処理は、通信会議サーバ５０側で実行される。その際、通信会議サーバ５０の音声変換手段５１は、少なくとも２つの拠点の情報処理装置１０にて取得した音声情報を音声認識技術を用いてテキスト情報に変換する。よって、本システム構成例においても、各情報処理装置１０にて取得した音声情報の変換ミスの箇所を補うことができ、これにより、より正確な議事録を作成することができる。なお、通信会議サーバ５０は、複数の情報処理装置１０とネットワークを介して接続されたサーバ機器に相当する。 In this system configuration example, the communication conference server 50 executes the statement record generation processing of the above embodiment and each modification. In this case, the statement record generation process shown in FIGS. 4, 8, 10, 12, and 13 is executed on the communication conference server 50 side. At that time, the voice conversion means 51 of the communication conference server 50 converts the voice information acquired by the information processing apparatuses 10 in at least two bases into text information using a voice recognition technique. Therefore, also in the present system configuration example, it is possible to compensate for a conversion error portion of the voice information acquired by each information processing apparatus 10, and thereby it is possible to create a more accurate minutes. The communication conference server 50 corresponds to a server device connected to a plurality of information processing apparatuses 10 via a network.

以上、添付図面を参照しながら本発明の通信システム及び通信方法の好適な実施形態について詳細に説明したが、本発明の通信システム及び通信方法の技術的範囲はかかる例に限定されない。本発明の技術分野における通常の知識を有する者であれば、特許請求の範囲に記載された技術的思想の範疇において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本発明の通信システム及び通信方法の技術的範囲に属する。また、上記実施形態及び変形例が複数存在する場合、矛盾しない範囲で組み合わせることができる。 The preferred embodiments of the communication system and communication method of the present invention have been described in detail above with reference to the accompanying drawings, but the technical scope of the communication system and communication method of the present invention is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field of the present invention can come up with various changes or modifications within the scope of the technical idea described in the claims. Of course, it belongs to the technical scope of the communication system and communication method of the present invention. In addition, when there are a plurality of the above-described embodiments and modifications, they can be combined within a consistent range.

なお、本発明に係る情報処理装置及び通信会議サーバのハードウェア構成例を、図１５を参照しながら簡単に説明する。発明に係る情報処理装置及び通信会議サーバには、ＣＰＵ１０６が内蔵されている。ＣＰＵ１０６により実行される各機能を実現するためのプログラムは、ＲＯＭ１０４、ＲＡＭ１０５、あるいはＨＤＤ１０８等の記憶手段に予め格納されてもよい。前記プログラムは、記録媒体であるＣＤ−ＲＯＭあるいはフレキシブルディスク，ＳＲＡＭ，ＥＥＰＲＯＭ，メモリカード等の不揮発性記録媒体（メモリ）に記録されてもよい。本発明に係る情報処理装置及び通信会議サーバの機能は、これらのメモリに記録されたプログラムをＣＰＵ１０６に実行させることにより実現され得る。さらに、前記プログラムは、通信回路１０３の機能を用いてＩＰネットワーク網１１０に接続され、プログラムを記録した記録媒体を備える外部機器あるいはプログラムを記憶手段に記憶した外部機器からダウンロードすることもできる。キーボード１０１は、入力装置の一例であり、各装置に各操作信号を入力するのに用いられる。キーボード１０１の替わりにマウスやタッチパネルを用いることもできる。ディスプレイ１０２は、表示装置の一例であり、各装置による処理結果を表示する。 A hardware configuration example of the information processing apparatus and the communication conference server according to the present invention will be briefly described with reference to FIG. The information processing apparatus and communication conference server according to the invention have a CPU 106 built therein. A program for realizing each function executed by the CPU 106 may be stored in advance in a storage unit such as the ROM 104, the RAM 105, or the HDD 108. The program may be recorded on a recording medium such as a CD-ROM or a non-volatile recording medium (memory) such as a flexible disk, SRAM, EEPROM, or memory card. The functions of the information processing apparatus and the communication conference server according to the present invention can be realized by causing the CPU 106 to execute the programs recorded in these memories. Furthermore, the program can be downloaded from an external device connected to the IP network 110 using the function of the communication circuit 103 and having a recording medium recording the program or an external device storing the program in a storage means. The keyboard 101 is an example of an input device, and is used to input each operation signal to each device. A mouse or touch panel can be used instead of the keyboard 101. The display 102 is an example of a display device, and displays a processing result by each device.

以上のように、本実施形態に係る情報処理装置及び通信会議サーバは、上記ハードウェア構成により、上述した各種機能を実現することができる。 As described above, the information processing apparatus and the communication conference server according to the present embodiment can realize the various functions described above by the hardware configuration.

１：通信会議システム、１０ａ，１０ｂ，１０ｃ、１０：情報処理装置、１１：通信手段、１２：データ処理手段、１３：音声入力手段、１４：入力音声処理手段、１５：音声記憶手段、１６：出力音声処理手段、１７：計時手段、１８：音声出力手段、１９：音声変換手段、２０：音声認識結果記憶手段、２１：発言記録生成手段、２２：発言記録出力手段、２３：判定手段、２４：話者特定手段、５０：通信会議サーバ、１１０：ＩＰネットワーク網 1: communication conference system, 10a, 10b, 10c, 10: information processing apparatus, 11: communication means, 12: data processing means, 13: voice input means, 14: input voice processing means, 15: voice storage means, 16: Output voice processing means, 17: timing means, 18: voice output means, 19: voice conversion means, 20: voice recognition result storage means, 21: utterance record generation means, 22: utterance record output means, 23: determination means, 24 : Speaker identification means, 50: Teleconference server, 110: IP network

特開２００５−３４１０１５号公報JP 2005-341015 A 特開２０１１−０６５３２２号公報JP 2011-066532 A

Claims

A communication system having a plurality of information processing devices connected via a network,
First voice conversion means for converting voice information acquired by one information processing apparatus of the plurality of information processing apparatuses into character data;
Second voice conversion means for converting voice information acquired according to the timing at which voice information was acquired by the one information processing apparatus into character data in another information processing apparatus among the plurality of information processing apparatuses. When,
The two character data converted by the first and second voice conversion means are compared, and it is determined whether or not the voice information acquired by the one information processing apparatus and the other information processing apparatus is the same utterance. Determination means to perform,
A communication system comprising:

2. The statement record generation means for generating a statement record generated by communication between the one information processing apparatus and the other information processing apparatus based on the determination by the determination means. Communication system.

The statement record generation means includes the same statement in the statement records at the respective bases of the one information processing apparatus and the other information processing apparatus when the determination means determines that the same statement is the same. The communication system according to claim 2.

The statement record generation means includes the same statement in the statement record at one of the bases of the one information processing apparatus and the other information processing apparatus when the determination means determines that the same comment is present, The communication system according to claim 2, wherein the same utterance is not included in the utterance record of any other base.

The said utterance record generation means generates the said utterance record so that a difference portion in the same utterance can be recognized when there is a difference in the same utterance when the determination means determines that the same utterance is the same. The communication system according to claim 3 or 4, characterized by the above.

Further comprising a time measuring means for measuring the speech time of the voice information;
When the difference between the speech times of the speech information before the conversion of the two or more character data to be compared exceeds a predetermined threshold, the determination means has the same speech information for the two or more character data. The communication system according to any one of claims 1 to 5, wherein the determination as to whether or not is stopped.

A speaker specifying means for specifying a speaker of the voice information;
The determination means stops determining whether the voice information is the same speech for the two or more character data when speakers of the voice information before conversion of the two or more character data to be compared are different. The communication system according to any one of claims 1 to 6.

A communication system having a plurality of information processing devices and server devices connected via a network,
The voice information acquired by one information processing device of the plurality of information processing devices is converted into one character data, and the one information is converted by the other information processing device of the plurality of information processing devices. Voice conversion means for converting the voice information acquired in accordance with the timing at which the voice information is acquired by the processing device into other character data;
Comparing one character data converted by the voice converting means with another character data, it is determined whether the voice information acquired by the one information processing device and the other information processing device is the same utterance. Determination means to perform,
A communication system comprising:

The determination means includes
Each of the voice information acquired by the one information processing apparatus and the other information processing apparatus is decomposed into one or a plurality of words and compared for each word, and the one information processing apparatus and the other information processing apparatus The communication system according to claim 1, wherein it is determined whether or not the voice information acquired in step 1 is the same statement.

A communication method performed by a plurality of information processing apparatuses connected via a network,
A first voice conversion step of converting voice information acquired by one information processing apparatus of the plurality of information processing apparatuses into character data;
A second voice conversion step of converting voice information acquired according to the timing at which voice information is acquired by the one information processing apparatus into character data in another information processing apparatus among the plurality of information processing apparatuses. When,
The two character data converted in the first and second voice conversion steps are compared, and whether or not the voice information acquired by the one information processing apparatus and the other information processing apparatus is the same utterance. A determination step for determining;
A communication method characterized by comprising:

A program for executing a function of a communication system having a plurality of information processing devices connected via a network,
A first voice conversion process for converting voice information acquired by one information processing apparatus of the plurality of information processing apparatuses into character data;
Second voice conversion processing for converting voice information acquired according to the timing at which voice information is acquired by the one information processing apparatus into character data in another information processing apparatus among the plurality of information processing apparatuses. When,
The two character data converted by the first and second voice conversion processes are compared, and whether or not the voice information acquired by the one information processing apparatus and the other information processing apparatus is the same utterance. A determination process for determining;
A program for running a computer.