JP2020060735A

JP2020060735A - Voice recognition system

Info

Publication number: JP2020060735A
Application number: JP2018193388A
Authority: JP
Inventors: 三浦　浩之; Hiroyuki Miura; 浩之三浦
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-10-12
Filing date: 2018-10-12
Publication date: 2020-04-16
Anticipated expiration: 2038-10-12
Also published as: JP7110057B2

Abstract

To convert sound files into texts using a learning type server via the internet and identify speakers.SOLUTION: A voice recognition system 1 includes: a sound collecting unit 2 into which sounds are input; a processing unit 3 that generates sound files F from the sounds, transmits the sound files F to a text conversion server and a speaker identification server, and receives text files W and identification results for speakers H; and a monitor unit 4 that displays processing results from the processing unit 3. Sound data V that is separated by silent states each between speeches by the speakers H is transmitted as the sound files F to a text conversion server 5 and a speaker identification server 6 that are cloud services performing data analysis through self-learning based on the data that is collected from many users via the Internet. Then, the received text files W and identification results for the speakers H are displayed in the monitor unit 4 in chronological order, in association with the sound files F.SELECTED DRAWING: Figure 1

Description

本発明は、例えば複数の話者に対しても、認識精度の高い音声認識システムに関するものである。 The present invention relates to a voice recognition system having high recognition accuracy even for a plurality of speakers, for example.

マイクロホンから音声入力された音声データをテキスト化する音声認識装置は、広く普及している。そして、特許文献１には話者ごとに発声特徴を学習させて、音声認識の精度を高める音声認識装置が開示されている。 A voice recognition device for converting voice data input from a microphone into text has been widely used. Then, Patent Document 1 discloses a voice recognition device that increases the accuracy of voice recognition by learning the utterance feature for each speaker.

また、様々なクラウドサービスにおいて、ディープラーニング（深層学習）を利用した学習システムが構築されている。これらの学習システムは、インターネットを介して多数のユーザから収集されるデータを基に、ニューラルネットワークベースの処理装置により、自己学習によりデータ分析、解析を行う。 In addition, learning systems using deep learning have been constructed in various cloud services. These learning systems perform data analysis and analysis by self-learning by a neural network-based processing device based on data collected from many users via the Internet.

人間からの指示を待たずに自己学習してゆくことで、効率的に処理装置の出力精度を高めることが可能であり、ディープラーニングを活用したクラウドサービスによってデータ分析された分析結果をユーザは利用している。 By performing self-learning without waiting for human instruction, it is possible to efficiently improve the output accuracy of the processing device, and the user can use the analysis results of data analysis performed by the cloud service that utilizes deep learning. are doing.

特開２００２−２１５１８４８号公報JP, 2002-2151848, A

しかし、特許文献１の音声認識装置は、マイクロホンから収集される音声のみから学習しているため、収集できる音声データに限界がある。また、上述のクラウドサービスでは演算処理部を並列に多数配置するような大規模のシステムが構築されているのに対して、特許文献１の音声認識装置はシステム規模が小さいものとなってしまう。従って、特許文献１の音声認識装置は学習精度の向上が遅く、テキスト化、話者特定の精度がなかなか向上しないという問題がある。 However, since the voice recognition device of Patent Document 1 learns from only the voice collected from the microphone, there is a limit to the voice data that can be collected. Further, in the cloud service described above, a large-scale system in which a large number of arithmetic processing units are arranged in parallel is constructed, whereas the speech recognition device of Patent Document 1 has a small system scale. Therefore, the speech recognition apparatus of Patent Document 1 has a problem that the learning accuracy is slow to be improved, and the accuracy of text conversion and speaker identification is not easily improved.

本発明の目的は、上述の課題を解決し、インターネットを介したクラウドサービスである学習型サーバを利用することで、収集した音声を精度良くテキスト化すると共に、精度良く話者の特定を行う音声認識システムを提供することにある。 An object of the present invention is to solve the above-mentioned problems and use a learning server that is a cloud service via the Internet to accurately convert collected voices into texts and to accurately identify a speaker. To provide a recognition system.

上記目的を達成するための本発明に係る音声認識システムは、周囲の音を入力する集音部と、該集音部から入力した音データのデータ加工を行うことにより音声ファイルを生成する処理部と、該処理部の処理結果を表示するモニタ部とから構成される音声認識システムであって、前記処理部はインターネットを介して自己学習機能を備えた文字変換サーバ及び話者特定サーバと接続しており、前記音声ファイルを前記文字変換サーバに送信して、前記文字変換サーバから前記音声ファイルをテキスト化した文章ファイルを受信し、前記音声ファイル及び話者のユーザＩＤ情報を前記話者特定サーバに送信して、前記音声ファイルに対する前記話者の特定結果を受信し、前記音声ファイルに対応する前記文章ファイル及び前記話者の特定結果を前記モニタ部に表示することを特徴とする。 A voice recognition system according to the present invention for achieving the above object includes a sound collecting unit for inputting ambient sound, and a processing unit for generating a sound file by processing data of sound data input from the sound collecting unit. And a monitor unit that displays a processing result of the processing unit, wherein the processing unit is connected to a character conversion server having a self-learning function and a speaker identification server via the Internet. The voice file is transmitted to the character conversion server, a text file obtained by converting the voice file into a text is received from the character conversion server, and the voice file and the user ID information of the speaker are received by the speaker identification server. And the speaker identification result for the audio file is received, and the sentence file and the speaker identification result corresponding to the audio file are transmitted. And displaying the serial monitor.

本発明に係る音声認識システムによれば、インターネットを介して多数のユーザから収集されるデータを基に、自己学習によりデータ分析、解析を行うクラウドサービスである文字変換サーバ及び話者特定サーバを利用することで、文字変換機能及び話者特定機能を設けることなく、音声ファイルに対して精度よく文字変換及び話者特定を行うことができる。 According to the voice recognition system of the present invention, a character conversion server and a speaker identification server, which are cloud services that perform data analysis and analysis by self-learning, based on data collected from many users via the Internet are used. By doing so, it is possible to perform character conversion and speaker identification with high accuracy for a voice file without providing a character conversion function and a speaker identification function.

また、音声ファイルに文章ファイルと特定結果の話者を対応付けて、モニタ部に時系列順にほぼリアルタイムで表示させることができる。話者と発言内容とを文字で確認することができ、画面のスクロールにより過去の発言も容易に確認することが可能である。 Further, the voice file can be associated with the text file and the speaker of the specific result, and can be displayed on the monitor unit in chronological order in substantially real time. The speaker and the content of the utterance can be confirmed by characters, and past utterances can be easily confirmed by scrolling the screen.

音声認識システムのシステム構成図である。It is a system configuration diagram of a voice recognition system. 音声データから音声ファイルを生成する場合のフローチャート図である。It is a flowchart figure in the case of generating a voice file from voice data. 話者の音声データを波形で表した説明図である。It is explanatory drawing which represented the voice data of a speaker by the waveform. 話者ごとの音声データの一覧図である。It is a list figure of the voice data for every speaker. 別の話者の音声データを波形で表した説明図である。It is explanatory drawing which represented the voice data of another speaker by the waveform. 話者ごとの音声データを判別する説明図である。It is explanatory drawing which determines the voice data for every speaker. モニタ部に表示されるテキスト文の説明図である。It is explanatory drawing of the text sentence displayed on a monitor part.

本発明を図示の実施例に基づいて詳細に説明する。
音声認識システム１は、周囲の音を入力する集音部２と、この集音部２から入力した音データのデータ加工を行うことで音声ファイルＦを生成し、この音声ファイルＦを文字変換サーバ及び話者特定サーバに送信し、文章ファイルＷと話者Ｈの特定結果を受信する処理部３と、処理部３の処理結果を表示するモニタ部４から構成される。 The present invention will be described in detail with reference to the illustrated embodiments.
The voice recognition system 1 generates a voice file F by processing the sound collecting unit 2 that inputs a surrounding sound and the sound data input from the sound collecting unit 2, and generates the voice file F. And a processing unit 3 for transmitting the text file W and the identification result of the speaker H to the speaker identification server, and a monitor unit 4 for displaying the processing result of the processing unit 3.

音声認識システム１には、市販のノートパソコンやデスクトップパソコンを用いてもよく、集音部２として例えば外付けのマイクロホン等を使用する。集音部２は左右の二重で録音されるステレオタイプではなく、モノラルタイプを使用し、高品質のものが好ましい。このモノラルタイプの集音部２をテーブル等の話者間の中央に設置することになる。 A commercially available notebook computer or desktop computer may be used as the voice recognition system 1, and an external microphone or the like is used as the sound collection unit 2. The sound collecting unit 2 uses a monaural type instead of a stereo type in which left and right are double-recorded, and is preferably of high quality. This monaural type sound collecting unit 2 is installed in the center between speakers such as a table.

処理部３は演算部３ａ、メモリ部３ｂ及び記憶部３ｃから成り、記憶部３ｃに記憶したソフトウェアを起動することで、各種のデータ処理を行う。この処理部３と集音部２とは有線又は無線で接続されている。 The processing unit 3 includes a calculation unit 3a, a memory unit 3b, and a storage unit 3c, and executes various data processing by activating software stored in the storage unit 3c. The processing unit 3 and the sound collection unit 2 are connected by wire or wirelessly.

モニタ部４は処理部３と接続されており、例えば液晶ディスプレイからなり、モニタ部４に処理部３における各種処理結果等を表示されることができる。なお、モニタ部４はネットワークを介して接続した別のＰＣや携帯端末のモニタ等であってもよい。 The monitor unit 4 is connected to the processing unit 3 and is composed of, for example, a liquid crystal display, and various processing results and the like in the processing unit 3 can be displayed on the monitor unit 4. The monitor unit 4 may be a monitor of another PC or mobile terminal connected via a network.

文字変換サーバ５は、インターネットＩＮ上に存在するニューラルネットワークベースのＡＰＩ（Application Programming Interface）であり、音声認識システム１とインターネットＩＮを介して接続されている。 The character conversion server 5 is a neural network-based API (Application Programming Interface) existing on the Internet IN, and is connected to the voice recognition system 1 via the Internet IN.

文字変換サーバ５は音声認識システム１外であって、音声認識システム１の処理部３から音声ファイルＦがアップロードされると、文章ファイルＷに変換するテキスト化処理を行い、処理部３は生成した文章ファイルＷをダウンロードすることができる。数分間に渡る長い音声ファイルＦをアップロードすると、テキスト化処理に時間を要するため、音声データＶを数１０秒以下に区切り、文字変換サーバ５にアップロードすることが好ましい。 The character conversion server 5 is outside the voice recognition system 1, and when the voice file F is uploaded from the processing unit 3 of the voice recognition system 1, a text conversion process for converting into a text file W is performed, and the processing unit 3 generates it. The text file W can be downloaded. When a long voice file F for several minutes is uploaded, it takes time for the text conversion process. Therefore, it is preferable to divide the voice data V into tens of seconds or less and upload it to the character conversion server 5.

また、文字変換サーバ５は大量のユーザからアップロードされる音声ファイルを基にディープラーニングを行い、テキスト化処理の自己修正している。従って、時間が経過するにつれて、テキスト化処理の変換精度が向上することになる。 In addition, the character conversion server 5 performs deep learning based on voice files uploaded by a large number of users, and self-corrects the text conversion processing. Therefore, as time passes, the conversion accuracy of the text conversion process improves.

同時に、文字変換サーバ５と別体である話者特定サーバ６は、インターネットＩＮ上に存在するニューラルネットワークベースのＡＰＩであり、音声認識システム１とインターネットＩＮを介して接続されている。 At the same time, the speaker identification server 6, which is separate from the character conversion server 5, is a neural network-based API existing on the Internet IN, and is connected to the voice recognition system 1 via the Internet IN.

この話者特定サーバ６は、予め話者ごとに音声サンプルを登録しておき、音声認識システム１から音声データＶを話者特定サーバ６にアップロードすると、登録している話者データに基づいて、音声データＶの話者を特定することが可能である。例えば、話者Ｈａの音声データをアップロードすると、話者Ｈａが既に話者特定サーバ６に登録されていれば、音声データの声主は、話者Ｈａであると特定されることになる。また、話者特定サーバ６には多数の話者が登録されているため、登録しているユーザＩＤのグループを音声データＶと共にアップロードすることで、効率的にユーザＩＤから認識することが可能である。 This speaker identification server 6 registers a voice sample for each speaker in advance, and uploads the voice data V from the voice recognition system 1 to the speaker identification server 6, and based on the registered speaker data, It is possible to specify the speaker of the voice data V. For example, when the voice data of the speaker Ha is uploaded, if the speaker Ha is already registered in the speaker identifying server 6, the voice data voice owner is identified as the speaker Ha. Further, since many speakers are registered in the speaker identification server 6, it is possible to efficiently recognize the user IDs by uploading the group of registered user IDs together with the voice data V. is there.

この話者特定サーバ６も大量のユーザからアップロードされる音声ファイルを基に、ディープランニングを利用して自己分析を行いながら、話者特定を行うため、時間が経過するにつれて話者特定のための精度が向上する。 The speaker identification server 6 also identifies the speaker while performing self-analysis using deep running based on the audio files uploaded by a large number of users. Accuracy is improved.

例えば、複数の話者である話者Ｈａ、Ｈｂ、ＨｃのそれぞれのユーザＩＤを、話者Ｈａ、Ｈｂ、Ｈｃの会話を録音した音声データＶと共に話者特定サーバ６にアップロードすることにより、各話者Ｈａ、Ｈｂ、Ｈｃの特定は３つのユーザＩＤから選出されることになる。従って、話者特定の処理速度が速くなると共に、話者特定の精度が向上することになる。 For example, by uploading the user IDs of the speakers Ha, Hb, and Hc, which are a plurality of speakers, to the speaker identification server 6 together with the voice data V in which the conversations of the speakers Ha, Hb, and Hc are recorded, The speakers Ha, Hb, and Hc are specified by being selected from three user IDs. Therefore, the speaker-specific processing speed is increased, and the speaker-specific accuracy is improved.

図２は集音部２を介して処理部３に入力した音データに対する音声ファイル生成のフローチャート図である。音データは集音部２に入力されたデータであり、録音した音データを加工したものが音声データＶとなる。 FIG. 2 is a flow chart of sound file generation for sound data input to the processing unit 3 via the sound collection unit 2. The sound data is the data input to the sound collection unit 2, and the processed sound data is processed as the sound data V.

図１に示すように、例えば話者Ｈａ、Ｈｂ、Ｈｃの中心に１個の集音部２を配置し、会議を開始した場合の処理部３の処理について説明する。会議が開始されると、図３に示すように話者Ｈａ、Ｈｂ、Ｈｃは時系列にそれぞれ音声を発して、これらが合成された１つの音データが得られる。 As shown in FIG. 1, for example, a process of the processing unit 3 when one sound collecting unit 2 is arranged in the center of the speakers Ha, Hb, and Hc and a conference is started will be described. When the conference is started, the speakers Ha, Hb, and Hc make respective sounds in time series as shown in FIG. 3, and one sound data in which these are synthesized is obtained.

ステップＳ１において、記憶部３ｃに記憶した音データに対して、人間の発声周波数のみを抽出して音声データＶとして記憶する。この抽出処理は例えば、椅子を動かした音や、救急車のサイレン音等が音データに混入されると、それらの音域をノイズとしてカットしたものを音声データＶとして記憶することになる。 In step S1, only the human utterance frequency is extracted from the sound data stored in the storage unit 3c and stored as the sound data V. In this extraction processing, for example, when a sound of moving a chair, a siren sound of an ambulance, or the like is mixed in the sound data, the sound range of which is cut as noise is stored as the sound data V.

図３は時間ｔ１１から録音を開始した音声データＶを簡略した波形で表した説明図である。例えば、最初に話者Ｈａが「これから会議を始めます。」と発言し、次に話者Ｈｂが「了解です。」、更に話者Ｈｃが「分かりました。」と続き、その後に話者Ｈａが「それでは議題に移ります。」と発言したときの音声の波形である。 FIG. 3 is an explanatory diagram showing the audio data V, which is recorded at time t11, in a simplified waveform. For example, the speaker Ha first says, "I'm going to start the meeting.", The speaker Hb says "OK," and the speaker Hc says "I understand." This is the waveform of the voice when Ha says, "I will move to the agenda."

続いて、図２のステップＳ２に移行し、音声データＶの発言と発言の間の無音状態である無音時間ｍを計測する。例えば、無音時間ｍの閾値を１秒と設定し、１秒以上の無音時間ｍ１が発生すると、ステップＳ３に移行する。ステップＳ２で１秒以下の無音時間ｍ０があると、ステップＳ２の処理を繰り返して行う。 Then, the process proceeds to step S2 of FIG. 2, and the silent time m, which is the silent state between the remarks of the voice data V, is measured. For example, when the threshold value of the silent time m is set to 1 second and the silent time m1 of 1 second or more occurs, the process proceeds to step S3. If there is a silent time m0 of 1 second or less in step S2, the process of step S2 is repeated.

ステップＳ３では、無音時間ｍ１により区切れた直前の音声データＶに対して、話者Ｈが複数人存在するか否かの判定を行う。この話者Ｈを識別する処理は、所定間隔でサンプリングした音声データＶに対して、話者Ｈごとに中心周波数が異なることを利用する。中心周波数の変位から、無音時間ｍ１により区切れた直前の音声データの話者Ｈの人数を判別することが可能である。 In step S3, it is determined whether or not there are a plurality of speakers H for the voice data V immediately before being separated by the silent time m1. The process of identifying the speaker H utilizes the fact that the speaker H has different center frequencies with respect to the voice data V sampled at predetermined intervals. From the displacement of the center frequency, it is possible to determine the number of speakers H of the voice data immediately before being separated by the silent time m1.

話者Ｈの人数を判別した後にステップＳ４に移行し、話者Ｈが複数である場合はステップＳ５に移行し、話者Ｈが単数の場合は、ステップＳ４を省略してステップＳ６に移行する。 After determining the number of speakers H, the process proceeds to step S4. When the number of the speakers H is plural, the process proceeds to step S5. When the number of the speakers H is one, step S4 is omitted and the process proceeds to step S6. .

図３に示す音声データＶでは、時間ｔ１２、ｔ１３、ｔ１４、ｔ１５で区切られた直前の音声データＶは、何れも１人ずつの周波数特性しかないので、ステップＳ４では、ステップＳ５を省略してステップＳ６に移行する。 In the voice data V shown in FIG. 3, since the voice data V immediately before divided by the times t12, t13, t14, and t15 have only one frequency characteristic for each person, the step S5 is omitted in the step S4. Control goes to step S6.

ステップＳ５の処理は後述し、先にステップＳ６における処理を説明すると、区切られた音声データＶは、図４に示すように話者Ｈａが最初に発言した「これから会議を始めます。」の音声ファイルＦ１：ｔ１１、話者Ｈｂが発言した「了解です。」の音声ファイルＦ２：ｔ１２、話者Ｈｃが発言した「分かりました。」の音声ファイルＦ３：ｔ１３、話者Ｈａが発言した「それでは議題に移ります。」の音声ファイルＦ４：ｔ１４として保存される。なお、これらの音声ファイルＦに対して、処理部３は誰の発言であるかを特定することはできない。 The process of step S5 will be described later, and the process of step S6 will be described first. As shown in FIG. 4, the separated voice data V is the voice of "The conference will be started." File F1: t11, voice file F2: t12 of speaker Hb saying "OK.", Voice file F3: t13 of speaker Hc saying "I understand.", Speaker Ha saying "OK. It will be saved as a voice file F4: t14 of "Go to agenda." It should be noted that the processing unit 3 cannot specify who the voice file F is.

そして、生成された各音声ファイルＦを、文字変換サーバ５及び話者特定サーバ６に送信する。送信後にステップＳ２に戻り、ステップＳ２〜ステップＳ６の処理を繰り返す。 Then, each generated voice file F is transmitted to the character conversion server 5 and the speaker identification server 6. After the transmission, the process returns to step S2, and the processes of steps S2 to S6 are repeated.

図３は前述のように話者Ｈａ、Ｈｂ、Ｈｃが会話をする際に、最初の話者Ｈａの会話が終わった後に、無音時間ｍ１が発生した後に、次の話者Ｈｂの音声が開始する音声データを示しているが、図５は話者Ｈａ、Ｈｂ、Ｈｃの会話の間に無音時間が閾値以下の無音時間ｍ０であった場合における音声データＶを簡略した波形で表した説明図である。 As shown in FIG. 3, when the speakers Ha, Hb, and Hc have a conversation as described above, after the conversation of the first speaker Ha ends, the silence of the first speaker Ha occurs, and then the voice of the next speaker Hb starts. FIG. 5 is an explanatory diagram showing a simplified waveform of the audio data V in the case where the silent time is less than the threshold m0 during the conversation between the speakers Ha, Hb, and Hc. Is.

図５に示す話者Ｈａの「それでは議題に移ります。」との発言後に、最初の無音時間ｍ１が発生した場合では、ステップＳ３において、時間ｔ２１で区切られた直前の音声データＶ０に対して、所定時間でサンプリングして中心周波数を測定する。そして、ステップＳ４において中心周波数が複数の場合に、つまり話者Ｈが複数の場合にはステップＳ５に移行する。 In the case where the first silent time m1 occurs after the speaker Ha saying “Then, I will move to the agenda.” Shown in FIG. 5, with respect to the immediately preceding voice data V0 separated by the time t21 in step S3. , The center frequency is measured by sampling at a predetermined time. Then, when there are a plurality of center frequencies in step S4, that is, when there are a plurality of speakers H, the process proceeds to step S5.

ステップＳ５では、判別した話者Ｈごとの音声ファイルＦを生成する。図６は時間ｔ２５で無音時間ｍ１が発生することで区切られた直前の音声データＶ０に対して、中心周波数の変位から話者Hを判別する場合の説明図である。この音声データＶ０の中心周波数を判別することで、時間ｔ２１から開始する音声データＶＨ１、時間ｔ２２から開始する音声データＶＨ２、時間ｔ２３から開始する音声データＶＨ３に区分することができる。 In step S5, the voice file F for each of the determined speakers H is generated. FIG. 6 is an explanatory diagram of a case where the speaker H is discriminated from the displacement of the center frequency with respect to the immediately preceding voice data V0 divided by the occurrence of the silent time m1 at time t25. By determining the center frequency of the voice data V0, it is possible to classify the voice data VH1 starting at time t21, the voice data VH2 starting at time t22, and the voice data VH3 starting at time t23.

なお、音声データＶ０の一部に２人の話者Ｈが重複して発声して録音されている場合であっても、サンプリング時間を短くする、例えば１０ｍｓｅｃとすることで、各サンプリング時間を占有する話者Ｈを特定することができ、重複して発声している音声データＶ０から個々の音声データＶへ区分けすることが可能である。 Even when two speakers H are uttered and recorded in a part of the voice data V0, the sampling time is shortened, for example, 10 msec to occupy each sampling time. It is possible to specify the speaker H to be used, and it is possible to classify the voice data V0 which is uttered redundantly into individual voice data V.

更に、２つの発言から成る音声データＶＨ１は、発言間に無音時間ｍ１が存在することから、２つの音声ファイルＦの音声ファイルＦ１：ｔ２１と音声ファイルＦ４：ｔ２４を生成することができる。 Further, the voice data VH1 composed of two utterances can generate the voice files F1: t21 and F4: t24 of the two voice files F because the silent time m1 exists between the utterances.

以上の判別処理を行うことで、図４に示す音声ファイルＦ１：ｔ１１〜Ｆ４：ｔ１４と同様な音声ファイルＦ１：ｔ２１〜Ｆ４：ｔ２４を生成することができる。なお、処理部３ではこれらの音声ファイルＦ１：ｔ２１〜Ｆ４：ｔ２４について話者が異なることは判別できても、誰の発言であるのかを特定することはできない。 By performing the above determination process, the audio files F1: t21 to F4: t24 similar to the audio files F1: t11 to F4: t14 shown in FIG. 4 can be generated. Although the processing unit 3 can determine that the speakers of the audio files F1: t21 to F4: t24 are different, it cannot specify who is speaking.

また、音声データＶＨ１から２つの音声ファイルＦを生成する処理を行わず、１つの音声ファイルＦ１：ｔ２１のみを生成するようにしてもよい。この場合は、音声ファイルＦ１：ｔ２１の後半の発言と、音声ファイルＦ２：ｔ２２、Ｆ３：ｔ２３の発言との時系列を明確にするため、時間ｔ２１〜ｔ２４の情報を各音声ファイルに記憶する必要がある。つまり、音声ファイルＦ１：ｔ２１に時間ｔ２１、ｔ２４を記憶することで、後述するモニタ部４に各発言を時間ｔ２１〜ｔ２４の時系列で表示することができる。 Alternatively, the process of generating the two audio files F from the audio data VH1 may not be performed, and only one audio file F1: t21 may be generated. In this case, in order to clarify the time series of the statements in the latter half of the audio file F1: t21 and the statements in the audio files F2: t22 and F3: t23, it is necessary to store the information of times t21 to t24 in each audio file. There is. That is, by storing the times t21 and t24 in the audio file F1: t21, it is possible to display each utterance on the monitor unit 4 described later in a time series of times t21 to t24.

図３に示す音声データＶと図５に示す音声データＶの処理部３における処理の差は、図３の音声データＶにおいては無音時間ｍ１が発生する度に、音声ファイルＦａが生成され、図４の音声ファイルＦ１：ｔ１１〜Ｆ４：ｔ１４は上から順に生成されて、生成される都度ステップＳ６に移行することになる。これに対して、図５の音声データＶにおいては音声ファイルＦ１：ｔ２１〜Ｆ４：ｔ２４がほぼ同時に生成され、ステップＳ６に移行することになる。 The difference between the processing of the audio data V shown in FIG. 3 and the processing of the audio data V shown in FIG. 5 is that the audio file Fa is generated every time the silent time m1 occurs in the audio data V of FIG. The four audio files F1: t11 to F4: t14 are sequentially generated from the top, and each time they are generated, the process proceeds to step S6. On the other hand, in the audio data V of FIG. 5, the audio files F1: t21 to F4: t24 are generated almost at the same time, and the process proceeds to step S6.

ステップＳ６において、生成された音声ファイルＦを文字変換サーバ５に送信すると、音声ファイルＦ１：ｔ１１〜Ｆ４：ｔ１４及び音声ファイルＦ１：ｔ２１〜Ｆ４：ｔ２４は、それぞれテキスト化された文章ファイルＷ１：ｔ１１〜Ｗ４：ｔ１４及び文章ファイルＷ１：ｔ２１〜Ｗ４：ｔ２４に変換され、音声認識システム１はこれらのファイルを受信することになる。 When the generated voice file F is transmitted to the character conversion server 5 in step S6, the voice files F1: t11 to F4: t14 and the voice files F1: t21 to F4: t24 are converted into text sentence files W1: t11, respectively. ~ W4: t14 and the text files W1: t21 to W4: t24 are converted, and the voice recognition system 1 receives these files.

また、生成された音声ファイルＦを話者特定サーバ６に送信する際には、音声ファイルＦ１：ｔ１１〜Ｆ４：ｔ１４及び音声ファイルＦ１：ｔ２１〜Ｆ４：ｔ２４に加えて、会話を構成する話者Ｈａ〜ＨｃのユーザＩＤを併せて送信する。話者特定サーバ６は、処理部３から送信された音声ファイルＦ１：ｔ１１〜Ｆ４：ｔ１４及び音声ファイルＦ１：ｔ２１〜Ｆ４：ｔ２４に対して、併せて送られてきたユーザＩＤの中からそれぞれの話者Ｈを特定し、処理部３は特定結果の話者Ｈａ〜Ｈｃを音声ファイルＦに対応して受信する。 In addition, when transmitting the generated voice file F to the speaker identification server 6, in addition to the voice files F1: t11 to F4: t14 and the voice files F1: t21 to F4: t24, the speakers configuring the conversation The user IDs Ha to Hc are also transmitted. The speaker identification server 6 selects the user IDs of the voice files F1: t11 to F4: t14 and the voice files F1: t21 to F4: t24 transmitted from the processing unit 3 from the user IDs transmitted together. The speaker H is specified, and the processing unit 3 receives the specified speakers Ha to Hc corresponding to the audio file F.

そして処理部３では、音声ファイルＦに文章ファイルＷと特定結果の話者Ｈを対応付けて、モニタ部４に時系列順に表示する。つまり、話者Ｈが特定できなかった「これから会議を始めます。」の音声ファイルＦ１：ｔ１１は、「これから会議を始めます。」の文章ファイルＷ１：ｔ１１と、話者Ｈａが特定されて、図７に示すように表示される。 Then, the processing unit 3 associates the voice file F with the text file W and the speaker H of the specific result, and displays them on the monitor unit 4 in chronological order. In other words, the voice file F1: t11 of "I will start the meeting." That the speaker H could not specify is identified with the sentence file W1: t11 of "I will start the meeting." And the speaker Ha, It is displayed as shown in FIG.

音声ファイルＦは、ファイル名末尾が時間ｔに対する通し番号として保存され、図７に示すように文章ファイルＷ及び話者Ｈは時間ｔの時系列順に表示される。なお、図７では話者Ｈを識別し易くするために、話者Ｈａを左側に表示し、話者Ｈｂ、Ｈｃを右側に表示している。 The voice file F is stored with the end of the file name as a serial number for the time t, and the text file W and the speaker H are displayed in chronological order of the time t as shown in FIG. In FIG. 7, the speaker Ha is displayed on the left side and the speakers Hb and Hc are displayed on the right side in order to easily identify the speaker H.

このように各ファイルの生成、クラウドサービスへの送受信に多少のタイムラグが発生するものの、ほぼリアルタイムで最新の音声ファイルＦに対する発言日時、文章ファイルＷ及び話者Ｈがモニタ部４の画面下部から順に表示されることになる。 Although a slight time lag occurs in the generation of each file and the transmission / reception to / from the cloud service as described above, the speech date and time for the latest audio file F, the text file W and the speaker H are sequentially displayed from the bottom of the screen of the monitor unit 4 in real time. Will be displayed.

なお、音声ファイルＦのファイル名を基に表示する順を決定しているが、ファイル名以外にも時間ｔ１１〜ｔ１４をファイルのヘッダ等に発言日時として記憶することで、それらの情報を基に時系列で表示することができる。 Note that the display order is determined based on the file name of the audio file F. However, in addition to the file name, the times t11 to t14 are stored as the speech date and time in the file header or the like, and based on these informations. It can be displayed in chronological order.

また、表示される発言日時に代えて、画面に表示処理した処理日時を表示するようにしてもよい。この場合は、上述の発言日時を記憶せずに、処理部３から音声ファイルＦを生成した順でクラウドサービスに送信し、受信することを条件として次の音声ファイルＦをクラウドサービスに送るようにしてもよい。 Further, the processing date and time of the display processing may be displayed on the screen instead of the displayed speech date and time. In this case, the voice files F are transmitted to the cloud service in the order in which they were generated from the processing unit 3 without storing the above-mentioned statement date and time, and the next voice file F is transmitted to the cloud service on condition that they are received. May be.

図７に示すように会話形式で表示されることで後日に、誰がどのような発言をしたのかを容易に確認することが可能である。また、図７に示す画面を他の端末装置を接続したＰＣや携帯端末で閲覧可能とすることで、ほぼリアルタイムで他の場所から会議の内容を目視で確認することができる。 By being displayed in a conversational format as shown in FIG. 7, it is possible to easily confirm who made what kind of speech at a later date. Further, by making the screen shown in FIG. 7 viewable on a PC or a mobile terminal to which another terminal device is connected, the contents of the conference can be visually confirmed from another place in almost real time.

特に、別の場所で会議を音で聞いている場合には、話者Ｈを特定できずに、全体の内容を把握し難いのに対して、音声認識システム１では話者と発言内容とを文字で確認できるので会議内容を把握し易い。 In particular, when the meeting is heard by sound at another place, the speaker H cannot be specified and it is difficult to grasp the entire content. It is easy to understand the content of the meeting because it can be confirmed by letters.

更には、音声を出力することが困難な場所での会議内容の確認や、聴覚障害者による会議内容を確認する際に、容易に会議の内容を把握することができる。画面のスクロールにより過去の発言を簡単に確認することもできる。 Furthermore, the content of the conference can be easily grasped when confirming the content of the conference in a place where it is difficult to output a voice or when confirming the content of the conference by a hearing impaired person. You can easily check past comments by scrolling the screen.

また、音声認識システム１のテーブル等の話者Ｈａ〜Ｈｃの中央に設置した集音部２により、集音した音データを用いて説明したが、別の場所等で録音した音声データを含む音データのファイルをネットワークや記憶媒体等を経由して記憶部３ｃに記憶させて、又は直接読み込ませて演算部３ａにより前述のフローチャートの処理を行うようにしてもよい。 In addition, although the sound data collected by the sound collecting unit 2 installed in the center of the speakers Ha to Hc such as the table of the voice recognition system 1 has been described, the sound including the sound data recorded in another place is included. A data file may be stored in the storage unit 3c via a network, a storage medium, or the like, or may be read directly and the processing of the above-described flowchart may be performed by the calculation unit 3a.

このように、音声認識システム１はインターネットＩＮを介して多数のユーザから収集されるデータを基に、自己学習によりデータ分析、解析を行うクラウドサービスである文字変換サーバ５及び話者特定サーバ６を利用することで、文字変換機能及び話者特定機能を設けることなく、精度のよい文字変換及び話者特定を行うことができる。 As described above, the voice recognition system 1 includes the character conversion server 5 and the speaker identification server 6 that are cloud services that perform data analysis and analysis by self-learning based on data collected from many users via the Internet IN. By using it, it is possible to perform accurate character conversion and speaker identification without providing a character conversion function and a speaker identification function.

また、会議内容をほぼリアルタイムで文章化することができ、また録音した音声ファイルに対しても事後的に文章化することができるので、迅速な会議内容の把握に役立てることが可能である。 In addition, since the conference contents can be written in almost real time, and the recorded voice file can be written in a posterior manner, it can be useful for grasping the conference contents promptly.

１音声認識システム
２集音部
３処理部
４モニタ部
５文字変換サーバ
６話者特定サーバ
ＩＮインターネット 1 voice recognition system 2 sound collection unit 3 processing unit 4 monitor unit 5 character conversion server 6 speaker identification server IN Internet

Claims

It is composed of a processing unit that processes a sound data including sound data input from a sound collection unit that inputs a surrounding sound to generate a sound file, and a monitor unit that displays a processing result of the processing unit. A voice recognition system,
The processing unit is connected to a character conversion server having a self-learning function and a speaker identification server via the Internet,
The voice file is transmitted to the character conversion server, and a text file obtained by converting the voice file into a text is received from the character conversion server,
Transmitting the voice file and the speaker user ID information to the speaker identification server, and receiving the speaker identification result for the voice file,
A voice recognition system, wherein the text file corresponding to the voice file and the speaker identification result are displayed on the monitor.

Generates voice data by extracting only the frequency of human utterance from the sound data, divides the voice data when the silent state between the utterance of the speaker is more than a predetermined time, and immediately before the break. The voice recognition system according to claim 1, wherein the voice file is generated based on voice data.

The voice recognition system according to claim 2, wherein the voice data is sampled at a predetermined interval, and the voice file for each speaker is generated by determining a characteristic of a center frequency.

The voice according to any one of claims 1 to 3, wherein the monitor unit displays the sentence file corresponding to the voice file and the speaker of the specific result in association with each other in chronological order. Recognition system.