JP2004309965A

JP2004309965A - Conference recording/dictation system

Info

Publication number: JP2004309965A
Application number: JP2003106567A
Authority: JP
Inventors: Masami Nakamura; 雅巳中村; Jitsuichi Yamada; 実一山田; Masao Shinkai; 正男新開; Hidehito Aoki; 秀仁青木
Original assignee: Advanced Media Inc
Current assignee: Advanced Media Inc
Priority date: 2003-04-10
Filing date: 2003-04-10
Publication date: 2004-11-04
Anticipated expiration: 2023-04-10
Also published as: JP3859612B2

Abstract

<P>PROBLEM TO BE SOLVED: To dictate speeches of persons who attend a meeting as the minutes of the meeting in time series. <P>SOLUTION: Microphones as many as persons who attend the meeting are available and allocated to the persons present, one to one. An input selection part selects a predetermined number of microphones which output speeches. Speech recognition processing parts are as many as the microphones which can be selected by the input selection part and a speech recognized by each speech recognition processing part is saved in a speech storage part as it is; and recognized data are converted into a character string, which is saved in a character string storage part. The character string saved in the character string storage part is displayed by a display means, and an editor can write the minutes of the meeting with the recognition result displayed by the display means and speeches. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、会議等に出席した者の発言を議事録という形で時系列的に書き起こすことができる会議録音・書き起こしシステムに関するものである。本明細書においていう「会議」は、広い意味のものであり、通常の会議以外に、話し合い、集団面談、団体交渉等多数の発言者がいることを前提としている。
【０００２】
【従来の技術】
図３は従来例の会議における議事録を時系列で書き起こすための概念図である。図３において、マイクロホンａ３１１、マイクロホンｂ３１２、・・・、マイクロホンｎ３１３でひろい上げた音声は、たとえば、テープレコーダのような音声記憶手段ａ３１４、音声記憶手段ｂ３１５、・・・、音声記憶手段ｎ３１６によって一時記憶される。
【０００３】
前記各音声記憶手段３１４ないし３１６の内容は、たとえば、速記者によるワードプロセッサ、あるいは音声認識手段（図示されていない）等の文字列編集手段ａ３１７、・・・、ｎ３１９によって文字列とした後、時系列編集手段３２０によって、文字列を時系列的に並べることができる。
【０００４】
さらに、一般的に、複数の話者が一つのマイクロホンを通じて音声記憶手段に記憶され、前記音声記憶手段に記憶されている話者の声を聞きながら、たとえば、速記者がワードプロセッサ等を用い、議事録として書き起こしていた。
【０００５】
また、特開平５−１２２４６号公報に記載されている音声文書作成装置は、複数の入力を選択して一つとし、無意味語を削除して漢字変換を容易にするとともに、口語調文章を所定の書式の文語調文章に作成する会議議事録編集の機能を備えている。
【０００６】
【特許文献１】
特開平５−１２２４６号公報
【０００７】
【発明が解決しようとする課題】
一般に、会議議事録は、テープレコーダによって録音された録音テープを聞くことにより、ワードプロセッサやパーソナルコンピュータのキーボード入力を行って、文字列として、書き起こされていた。また、前記人手による書き起こしの代わりに、前記録音された録音テープの音声は、音声認識により、文字列に変換することが行われていた。
【０００８】
しかし、前記録音テープに録音されている会議中の音声は、一つのマイクロホンで複数人の発言を同時に記録しているため、人手または音声認識技術によっても、多くの問題を有していた。特に、発言者とマイクロホンとの距離が離れている場合、録音テープに録音されている発言者の録音内容は、周囲のノイズとともに、他の発言者の内容が重なり、内容を明瞭に文字列化することが困難であった。
【０００９】
図３に示すように、発言者一人一人にマイクロホンを用意して、議事録を録音テープにとることが考えられた。
【００１０】
しかし、複数のマイクロホンを用いて録音された録音テープは、発言内容を書き下ろす場合、発言のない無音区間を含む複数の音声を全て聞き取る必要があるため、議事録の作成が困難であるだけでなく、効率が非常に悪くなるという問題があった。また、異なるマイクロホンに入力された発言は、発言順に時系列を揃えて書き下ろす必要があり、作業を煩雑にするという問題があった。
【００１１】
前記問題は、たとえば、複数のマイクロホンからの音声をミキシングして、一つの録音テープに録音することで解消されるが、複数の発言者の発言が重なる場合、書き起こしが困難になる。
【００１２】
また、録音テープに録音された音声の聞き取りや書き起こしは、書き起こしたい発言者の頭出しや、前記発言者の不明瞭な部分を繰り返して聞く等の操作が必要になる。そして、前記操作は、再生、早送り、巻き戻し等、頻繁に行う必要があり、効率を悪くするという問題があった。
【００１３】
前記問題を解決する方法としては、多数のマイクロホンに対して、それぞれ音声認識処理を行うことである。しかし、一台のコンピュータで、発言者の数と同じ数の音声認識処理手段を備えておくことは、２人ないし３人の会議を除いて、前記コンピュータにとって重い負荷となるため、現状では無理である。
【００１４】
また、音声認識処理は、普通、音量が最大のものを一つ選択して入力しているので、複数の発言がある場合、声の小さい発言者が無視される恐れがある。さらに、会議の議事録は、無音時間が問題になる場合もあるが、従来の書き起こし方法を使用すると、会議の時系列が正確に表現できない。
【００１５】
以上のような課題を解決するために、本発明は、複数の発言者から同時に発言があっても、重複録音がなく別々に、正確に録音され、かつ、一台のパーソナルコンピュータで音声認識処理ができるとともに、書き起こしおよび編集が容易である会議録音・書き起こしシステムを提供することを目的とする。
【００１６】
本出願人は、発言者が同時に発言する場合、多くとも２人ないし３人程度であり、これ以上多くの発言者が発言を行った場合、会議が紛糾して、議事録をとる必要がないということに着目した。すなわち、本発明は、多数の発言者にそれぞれマイクロホンが備えられていても、マイクロホンの数だけ音声認識処理手段を設ける必要がなく、多くとも２ないし３程度にした点にある。
【００１７】
【課題を解決するための手段】
（第１発明）
第１発明の会議録音・書き起こしシステムは、複数の話者の発言を文字列として編集することができるものであり、複数のマイクロホンと、前記マイクロホンの内、予め決められた数以下のマイクロホンの音声出力を選択する入力選択部と、前記予め決められた数の音声認識処理部と、前記それぞれの音声認識処理部によって認識された音声およびその文字列を保存する音声保存部および文字列保存部と、前記音声保存部の音声と、文字列保存部の文字列に基づいて時系列的な文字列に編集する文字列編集部と、を少なくとも備えたことを特徴とする。
【００１８】
（第２発明）
第２発明の会議録音・書き起こしシステムにおいて、前記入力選択部は、前記マイクロホンから入力した零個の信号、一個の信号、複数個の信号を選択し、前記選択された信号を異なる音声認識処理部にそれぞれ出力することを特徴とする。
【００１９】
（第３発明）
第３発明の会議録音・書き起こしシステムにおいて、前記入力選択部は、前記マイクロホンから入力した信号をデジタルまたはアナログ信号として異なる音声認識処理部にそれぞれ出力することを特徴とする。
【００２０】
（第４発明）
第４発明の会議録音・書き起こしシステムにおいて、前記入力選択部は、前記信号の強さが予め決められたレベルを一定時間連続して超えているものを先着順に選択することを特徴とする。
【００２１】
（第５発明）
第５発明の会議録音・書き起こしシステムにおいて、前記入力選択部は、コンピュータソフトウエアであることを特徴とする。
【００２２】
（第６発明）
第６発明の会議録音・書き起こしシステムにおいて、前記各音声認識処理部における音声認識結果は、時系列に並べて表示されると同時に、音声認識結果に対する音声を再生する機能を有することを特徴とする。
【００２３】
【発明の実施の形態】
（第１発明）
第１発明は、複数の話者の発言が同じ時間に重複していても、発言内容が正確な文字列として、時系列通りに編集することができる会議録音・書き起こしシステムである。マイクロホンは、当該会議の出席者の数だけあり、前記出席者のそれぞれに一つずつが割り当てられている。入力選択部は、前記マイクロホンの内、音声出力のある予め決められた数以下のマイクロホンの音声出力を選択するようになっている。また、前記マイクロホンの数は、必ずしも、出席者と同じにする必要がない。その際に、出席者は、名前を言ってから発言するとか、あるいは発言者の声紋によって識別される。
【００２４】
通常の会議において、たとえば、３名以上の者が同時に発言することは少ないと仮定している。２名の発言者は、ある発言の終わりと始まり、合いの手等において重複発言が発生するので、書き起こす必要がある。この場合、会議の参加者が多くても、音声認識処理部の数は、２個で済むことになる。
【００２５】
前記入力選択部におけるマイクロホンの選択は、多数出席している会議であっても、同時に発言を行うのは２人ないし３人程度であるということを前提にしている。もし、これ以上多くの者が同時に発言した場合、会議は、紛糾状態であり、それらの発言を書き起こしても意味がない。
【００２６】
音声認識処理部の数は、前記入力選択部において選択できるマイクロホンの数と同じである。前記それぞれの音声認識処理部によって認識された音声は、そのまま音声保存部に保存される。また、前記それぞれの音声認識処理部によって認識されたデータは、文字列になり、文字列保存部に保存される。
【００２７】
前記音声保存部に保存された音声、および文字列保存部に保存された文字列は、コンピュータの表示手段に音声の波形と認識結果としての文字列が表示される。音声は、必要に応じて、コンピュータのスピーカーから録音された通りに出力される。会議の議事録を書き起こす者は、コンピュータの表示手段に表示されている音声の認識結果が正しいと判断した場合、そのまま採用し、前記音声の認識結果が間違いであると判断した場合、実際の音声を聞きながら修正した文字列とする。
【００２８】
たとえば、前記会議の議事録を書き起こす者は、前記音声の認識結果が不明であると判断した場合、表示手段に表示されている音声の波形部分またはその近傍のボタンをクリックすることにより、会議の発言内容をスピーカーによって再現する。前記会議の議事録を書き起こす者は、前記音声の認識結果と実際の録音内容とに基づいて、正しい文字列を時系列的に作成する。
【００２９】
会議の議事録を書き起こす者は、音声認識処理部における認識結果が文字列保存部に保存されており、前記認識結果を表示手段により表示させると同時に、音声保存部に保存された音声を聞きながら、文字列編集部によって、正確な文字列にするだけでなく、時系列的にも正しく編集することができる。
【００３０】
（第２発明）
第２発明は、会議中に誰も発言しない場合であっても、無音の入力があるとみなしている。すなわち、入力選択部は、前記マイクロホンから入力した零個の信号（無音）、一個の信号、複数個の信号を選択し、前記選択された信号を異なる音声認識処理部にそれぞれ出力する。１ないし２個の重複した発言がある場合、異なる音声認識処理部は、それぞれを認識結果として、異なる発言者毎の文字列とする。
【００３１】
（第３発明）
第３発明は、マイクロホンから入力した信号をデジタル信号またはアナログ信号とすることで、入力選択部にソフトウエアを使用したり、あるいは、オーディオ機器に使用するハードウエアを使用することができる。前記入力選択部によって選択されたそれぞれの信号は、異なる音声認識処理部によって音声認識される。前記マイクロホンから入力されたアナログ信号は、音声認識処理部の前でＡ／Ｄ変換されるようにすることもできる。
【００３２】
（第４発明）
第４発明は、入力選択部における選択順位が付けられている。すなわち、前記選択順位は、入力された信号の強さが予め決められたレベルを一定時間以上連続して超えているものを先着順としている。第４発明は、複数の発言者が重複して発言していても、ある一定レベルの強さと連続時間で優先順位をつけているため、隣り合った席の者が小さな声で内輪話をしても、選択されないようにすることができる。
【００３３】
（第５発明）
第５発明は、前記入力選択部にコンピュータソフトウエアを使用することで、複数の発言者が同時であった場合、それぞれのマイクロホンで録音した内容を別々に録音した後、音声認識を正確にすることができる。
【００３４】
（第６発明）
第６発明は、各音声認識処理部における音声認識結果を時系列に並べて表示すると同時に、音声認識結果に対する音声を再生する機能を有する。会議の議事録を書き起こす者は、必要に応じて、再生された音声を聞きながら、前記音声認識結果を時系列的に修正して、議事録を容易に作成することができる。
【００３５】
【実施例】
図１は本発明の一実施例で、会議室における会議の議事録を書き起こすシステムを説明するための概略ブロック構成図である。図１において、会議室１１には、発言者（会議参加者）の数だけマイクロホン１、マイクロホン２、・・・、マイクロホンｍが設置されている。そして、前記各マイクロホンには、コンピュータ１２の音声入力端子に接続できるコンピュータ入力端子１１４が接続されている。
【００３６】
コンピュータ１２には、前記コンピュータ入力端子１１４に接続されている入力選択部１２１と、前記入力選択部１２１によって選択された音声を認識できる音声認識処理部（１）１２２、音声認識処理部（２）１２３、・・・、音声認識処理部（ｎ）１２４と、音声をそのまま保存する音声保存部１２５と、前記各音声認識処理部（１）ないし（ｎ）で認識した文字列を保存する文字列保存部１２６と、前記音声保存部１２５によって保存された音声を出力するスピーカー出力部１２７と、前記文字列保存部１２６によって保存された文字列を表示する表示部１２８と、スピーカー出力部１２７の出力、および／または表示部１２８に表示された文字列を編集する文字列編集部１２９と、編集された文字列１３０とから構成される。
【００３７】
会議の出席者数＝マイクロホン数ｍ＞＞音声認識処理部ｎである。そして、前記音声認識処理部（１）ないし（ｎ）は、発言者の音声信号が一定レベルを超えたものを採用し、一定レベル以下のものを無視する。また、前記音声認識処理部（１）ないし（ｎ）は、前記優先順位にしたがい、発言者の１ないしｍの中のｎ個を同時に選択することができる。
【００３８】
図２は本発明の一実施例におけるコンピュータの表示部を使用した編集の例を説明するための図である。図２において、表示部１２８（図１参照）は、編集領域２１と、音声パターン表示部２２と、会議情報表示部２３とから構成されている。
【００３９】
編集領域２１には、会議情報表示部２３の一部を選択することによって、音声認識処理部によって認識された認識結果２３３が表示される。会議情報表示部２３は、発言者の氏名が表示される発言者表示部２３１と、発言時間が表示される時間表示部２３２と、発言が音声認識処理部１ないしｎによって認識された結果を表示する認識結果表示部２３３と、前記編集領域２１で編集された結果を表示する編集結果表示部２３４とから構成されている。
【００４０】
図１および図２を参照して会議における発言者の音声認識に基づく文字列の編集について説明する。会議は、ｍ人が参加しており、それぞれにマイクロホン１、２、・・・ｍが備えられている。前記各マイクロホンには、コンピュータ入力端子１１４が設けられており、コンピュータ１２の端子（図示されていない）に接続されている。
【００４１】
今、二人の発言者は、同時に、マイクロホン１とマイクロホン２を通して発言したとする。入力選択部１２１は、マイクロホン１およびマイクロホン２の出力を同時に選択して、マイクロホン１の出力を音声認識処理部（１）１２２に、マイクロホン２の出力を音声認識処理部（２）１２３に割り当てる。
【００４２】
前記音声認識処理部（１）１２２および音声認識処理部（２）１２３によって認識された音声は、音声保存部１２５にそれぞれ保存される。前記音声認識処理部（１）１２２および音声認識処理部（２）１２３によって認識された文字列は、文字列保存部１２６にそれぞれ保存される。
【００４３】
前記音声保存部１２５および文字列保存部１２６に保存された音声および文字列は、編集する際にコンピュータ１２の表示部１２８における編集領域２１に表示される（図２）。たとえば、編集者は、１４時５４分２４秒の発言を選択（２３４１）すると、発言者ヤスモトマイが、「大議論をもの中で忘れててました。」という認識結果が編集結果表示部２３４に表示される。また、同じ時間に、アベユウコは、「はは」と笑ったと認識されている。
【００４４】
前記文字列保存部１２６に保存されていた前記ヤスモトマイの発言の認識結果は、内容が判らないため、編集者は、音声パターン表示部２２またはその近傍のボタンをクリックすることにより、音声保存部１２５に保存されていた音声出力がコンピュータ１２のスピーカー出力部１２７から音声信号が出力される。また、発言者表示部２３１、時間表示部２３２、認識結果表示部２３３、および編集結果表示部２３４のいづれかを選択した際に、自動的に音声出力することもできる。
【００４５】
前記編集者は、前記音声信号を聞きながら、前記「大議論をもの中で忘れててました。」が「大事なものを書くのを忘れてました。」であると判断して、編集領域２１において、正しく編集する。前記発言と同時に、全く異なる「はは」というアベユウコの発言は、別のマイクロホン２を介して音声認識処理部（２）１２３で処理されたため、正確に認識されている。
【００４６】
図２に示す画面には、１４時５４分４３秒に、ヤスモトマイが「こうしようとおもうしを大島まで全部作ってくださ」という認識結果が選択され、編集領域２１において、「仕様書まで全部作ってください。発想は、・・・」と編集した状態が示されている。
【００４７】
発言者とマイクロホンとの対応関係は、最初に自己紹介として氏名をいうことにより、次回から、声紋により識別することができるため、発言毎に氏名をいう必要がない。あるいは、マイクロホン１は○○さん、マイクロホン２は□□さんと予め対応させておくこともできる。
【００４８】
以上、本実施例を詳述したが、本発明は、前記本実施例に限定されるものではない。そして、本発明は、特許請求の範囲に記載された本発明を逸脱することがなければ、種々の設計変更を行うことが可能である。本発明の音声認識処理部等ブロック構成図の具体的技術は、周知または公知の技術を使用することができるため、詳細が省略されている。
【００４９】
【発明の効果】
本発明によれば、会議において、同時に発言する者の数は、２人ないし３人程度であることに着目し、出席者の数に対応するマイクロホンからの出力を選択して、２ないし３個の音声認識処理部を有するだけで、重複した発言を明瞭に書き起こすことが容易にできる。
【００５０】
本発明によれば、音声保存部と文字列保存部とをそれぞれ設けることにより、音声を聞きながら文字列を時系列的に書き起こすことができる。
【００５１】
本発明によれば、発言者の氏名は、対応しているマイクロホンの番号または声紋によって識別ができるだけでなく、発言時間とともに、編集結果が書き起こされる。前記発言時間は、会場内の無音時間が判り、時系列としても正確な編集結果が書き起こされる。
【００５２】
本発明によれば、発言者の数とマイクロホンの数が一致しているため、重複発言があっても、正確かつ容易に編集結果が書き起こされる。本発明によれば、前記マイクロホンの数が多くても、入力選択部によって、発言のあったマイクロホンの出力のみを選択できるため、音声認識処理部の数をパーソナルコンピュータ一台で処理できる程度にすることができる。
【００５３】
本発明によれば、多数の発言者の発言が時系列的に編集されているため、特定の発言者の発言内容に着目して編集することができる。また、本発明によれば、テーマに対する賛成派、反対派の意見を別々に対応して編集する等、いろいろな目的に適った編集が可能である。
【図面の簡単な説明】
【図１】本発明の一実施例で、会議室における会議の議事録を書き起こすシステムを説明するための概略ブロック構成図である。
【図２】本発明の一実施例におけるコンピュータの表示部を使用した編集の例を説明するための図である。
【図３】従来例の会議における議事録を時系列で書き起こすための概念図である。
【符号の説明】
１１・・・会議室
１１１・・・マイクロホン１
１１２・・・マイクロホン２
１１３・・・マイクロホンｍ
１１４・・・コンピュータ入力端子
１２・・・コンピュータ
１２１・・・入力選択部
１２２・・・音声認識処理部１
１２３・・・音声認識処理部２
１２４・・・音声認識処理部ｎ
１２５・・・音声保存部
１２６・・・文字列保存部
１２７・・・スピーカー出力部
１２８・・・表示部
１２９・・・文字列編集部
１３０・・・文字列
２１・・・編集領域
２１１・・・編集中の文字列
２２・・・音声パターン表示部
２３・・・会議情報表示部
２３１・・・発言者表示部
２３２・・・時間表示部
２３３・・・認識結果表示部
２３４・・・編集結果表示部
２３４１・・・認識結果選択部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a conference recording / transcription system that can transcribe utterances of persons attending a conference and the like in the form of minutes. The term “meeting” as used in this specification has a broad meaning, and it is assumed that there are a large number of speakers such as discussions, collective interviews, and collective bargaining in addition to ordinary meetings.
[0002]
[Prior art]
FIG. 3 is a conceptual diagram for writing up minutes in a conventional conference in time series. In FIG. 3, the voice picked up by the microphone a311, microphone b312,..., Microphone n313 is temporarily stored by the voice storage means a314, the voice storage means b315,. Remembered.
[0003]
The contents of each of the voice storage means 314 to 316 are, for example, converted into character strings by a word processor by a stenographer or character string editing means a317,..., N319 such as voice recognition means (not shown), The series editing unit 320 can arrange the character strings in time series.
[0004]
Further, in general, a plurality of speakers are stored in the voice storage means through one microphone, and while listening to the voices of the speakers stored in the voice storage means, for example, a stenographer uses a word processor or the like to proceed with the proceedings. I transcribed it as a record.
[0005]
In addition, the voice document creation apparatus described in Japanese Patent Laid-Open No. 5-12246 selects a plurality of inputs as one, deletes meaningless words to facilitate kanji conversion, It has a function to edit meeting minutes, which is created in a sentence in a predetermined format.
[0006]
[Patent Document 1]
Japanese Patent Laid-Open No. 5-12246
[Problems to be solved by the invention]
In general, meeting minutes are transcribed as character strings by listening to a recording tape recorded by a tape recorder and performing keyboard input on a word processor or personal computer. Further, instead of the human transcription, the recorded voice of the recording tape is converted into a character string by voice recognition.
[0008]
However, since the voice during a conference recorded on the recording tape records a plurality of people's utterances simultaneously with one microphone, there are many problems even with manual or voice recognition technology. Especially when the distance between the speaker and the microphone is large, the recorded content of the speaker recorded on the recording tape overlaps with the content of other speakers along with the surrounding noise, making the content clear and clear. It was difficult to do.
[0009]
As shown in FIG. 3, it was considered to prepare a microphone for each speaker and record the minutes on a recording tape.
[0010]
However, recording tapes recorded using multiple microphones not only make it difficult to create minutes, because it is necessary to listen to all of the multiple voices including silent sections when speech content is written down. There was a problem that the efficiency became very bad. In addition, it is necessary to write down the utterances input to different microphones in time series in the order of the utterances, which causes a problem that the work is complicated.
[0011]
The problem can be solved by, for example, mixing sounds from a plurality of microphones and recording them on a single recording tape. However, when a plurality of speakers speak, the writing becomes difficult.
[0012]
Also, in order to hear and transcribe the voice recorded on the recording tape, it is necessary to perform operations such as cueing a speaker who wants to transcribe or repeatedly listening to an unclear part of the speaker. The operation needs to be performed frequently such as reproduction, fast-forwarding, rewinding, etc., and there is a problem that efficiency is lowered.
[0013]
As a method of solving the above problem, voice recognition processing is performed for each of a large number of microphones. However, it is impossible to provide a single computer with the same number of speech recognition processing means as the number of speakers because it is a heavy load on the computer except for a meeting of two or three people. It is.
[0014]
Further, since the voice recognition processing is usually performed by selecting and inputting one with the maximum volume, a speaker with a low voice may be ignored when there are a plurality of utterances. Furthermore, the minutes of the meeting may be a problem of silent time, but if the conventional transcription method is used, the time series of the meeting cannot be expressed accurately.
[0015]
In order to solve the above-described problems, the present invention provides a voice recognition process using a single personal computer, even when a plurality of speakers speak at the same time, are recorded separately without duplicate recording. It is an object of the present invention to provide a conference recording / transcription system that can easily transcribe and edit.
[0016]
Applicants are not more than 2 to 3 speakers at the same time, and if more speakers speak, there is no need to take minutes because the conference is in conflict. We focused on that. That is, according to the present invention, it is not necessary to provide as many speech recognition processing means as the number of microphones, even if a large number of speakers are provided with microphones.
[0017]
[Means for Solving the Problems]
(First invention)
The conference recording / transcription system according to the first aspect of the invention is capable of editing a plurality of speaker's utterances as character strings, and includes a plurality of microphones and a predetermined number or less of the microphones among the microphones. An input selection unit for selecting voice output, the predetermined number of voice recognition processing units, a voice storage unit and a character string storage unit for storing voices recognized by the respective voice recognition processing units and their character strings And a character string editing unit that edits a time-series character string based on the voice of the voice storage unit and the character string of the character string storage unit.
[0018]
(Second invention)
In the conference recording / transcription system according to the second aspect of the present invention, the input selection unit selects zero signals, one signal, or a plurality of signals input from the microphone, and the selected signals are subjected to different voice recognition processing. It outputs to each part, It is characterized by the above-mentioned.
[0019]
(Third invention)
In the conference recording / transcription system according to a third aspect of the invention, the input selection unit outputs a signal input from the microphone to a different voice recognition processing unit as a digital or analog signal.
[0020]
(Fourth invention)
In the conference recording / transcription system according to a fourth aspect of the present invention, the input selection unit selects, in a first-come-first-served basis, a signal whose signal strength continuously exceeds a predetermined level for a certain period of time.
[0021]
(Fifth invention)
In the conference recording / transcription system according to the fifth aspect of the present invention, the input selection unit is computer software.
[0022]
(Sixth invention)
In the conference recording / transcription system according to the sixth aspect of the invention, the voice recognition results in the respective voice recognition processing units are displayed side by side in time series, and at the same time, have a function of reproducing the voice corresponding to the voice recognition results. .
[0023]
DETAILED DESCRIPTION OF THE INVENTION
(First invention)
The first invention is a conference recording / transcription system in which the content of a speech can be edited in time series as an accurate character string even if the speech of a plurality of speakers overlaps at the same time. There are as many microphones as there are attendees in the conference, and one microphone is assigned to each attendee. The input selection unit is configured to select a predetermined number or less of the microphones having a sound output among the microphones. Also, the number of microphones is not necessarily the same as the number of attendees. At that time, the attendee speaks after saying his name, or is identified by the voiceprint of the speaker.
[0024]
In a normal meeting, for example, it is assumed that three or more people rarely speak at the same time. The two speakers need to write down because they start and end at the end of a statement and duplicate statements occur in the hands of others. In this case, even if there are many participants in the conference, only two speech recognition processing units are required.
[0025]
The selection of microphones in the input selection unit is based on the premise that only two or three people speak at the same time, even in a conference in which many attend. If more people speak at the same time, the meeting is in a state of conflict and it makes no sense to transcribe those statements.
[0026]
The number of speech recognition processing units is the same as the number of microphones that can be selected by the input selection unit. The voices recognized by the respective voice recognition processing units are stored in the voice storage unit as they are. The data recognized by each of the voice recognition processing units becomes a character string and is stored in the character string storage unit.
[0027]
As for the voice stored in the voice storage unit and the character string stored in the character string storage unit, a voice waveform and a character string as a recognition result are displayed on the display means of the computer. Audio is output as recorded from the computer speakers as needed. The person who transcribes the meeting minutes adopts the speech recognition result displayed on the computer display means if it is correct, and adopts it as it is, and if the speech recognition result is determined to be incorrect, The character string is corrected while listening to the voice.
[0028]
For example, if the person who transcribes the minutes of the meeting determines that the recognition result of the sound is unknown, the person who clicks the button of the waveform portion of the sound displayed on the display means or the vicinity thereof clicks on the meeting. The content of the utterance is reproduced by the speaker. The person who transcribes the minutes of the meeting creates a correct character string in time series based on the recognition result of the voice and the actual recording content.
[0029]
The person who transcribes the minutes of the meeting has the recognition result stored in the voice recognition processing unit stored in the character string storage unit, and the recognition result is displayed on the display means, and at the same time the voice stored in the voice storage unit is heard. However, the character string editing unit can not only make an accurate character string but also edit it correctly in time series.
[0030]
(Second invention)
The second invention considers that there is silent input even when no one speaks during the meeting. That is, the input selection unit selects zero signals (silence), one signal, and a plurality of signals input from the microphone, and outputs the selected signals to different speech recognition processing units. When there are one or two overlapping utterances, different speech recognition processing units use the recognition results as character strings for different speakers.
[0031]
(Third invention)
In the third aspect of the present invention, the signal input from the microphone is converted into a digital signal or an analog signal, so that software can be used for the input selection unit or hardware used for the audio device can be used. Each signal selected by the input selection unit is recognized by a different speech recognition processing unit. The analog signal input from the microphone may be A / D converted before the voice recognition processing unit.
[0032]
(Fourth invention)
In the fourth invention, the selection order in the input selection unit is given. That is, the selection order is a first-come-first-served basis in which the strength of the input signal continuously exceeds a predetermined level for a predetermined time or more. In the fourth aspect of the invention, even when multiple speakers are speaking in duplicate, priority is given with a certain level of strength and continuous time. However, it can be prevented from being selected.
[0033]
(Fifth invention)
According to a fifth aspect of the present invention, when a plurality of speakers are simultaneously recorded by using computer software for the input selection unit, the content recorded by each microphone is recorded separately, and then voice recognition is made accurate. be able to.
[0034]
(Sixth invention)
The sixth aspect of the invention has a function of displaying the voice recognition results in the respective voice recognition processing units in time series and reproducing the voice corresponding to the voice recognition results. The person who writes the minutes of the meeting can easily create the minutes by modifying the voice recognition result in time series while listening to the reproduced voice as necessary.
[0035]
【Example】
FIG. 1 is a schematic block diagram for explaining a system for writing up minutes of a meeting in a meeting room according to an embodiment of the present invention. 1, microphones 1, microphones 2,..., Microphones m are installed in the conference room 11 by the number of speakers (conference participants). Each microphone is connected to a computer input terminal 114 that can be connected to an audio input terminal of the computer 12.
[0036]
The computer 12 includes an input selection unit 121 connected to the computer input terminal 114, a voice recognition processing unit (1) 122 that can recognize the voice selected by the input selection unit 121, and a voice recognition processing unit (2). ,..., A speech recognition processing unit (n) 124, a speech storage unit 125 that stores speech as it is, and a character string that stores the character strings recognized by the speech recognition processing units (1) to (n). The storage unit 126, the speaker output unit 127 that outputs the audio stored by the audio storage unit 125, the display unit 128 that displays the character string stored by the character string storage unit 126, and the output of the speaker output unit 127 And / or a character string editing unit 129 for editing a character string displayed on the display unit 128 and an edited character string 130.
[0037]
Number of attendees of meeting = number of microphones m >> voice recognition processing unit n. Then, the speech recognition processing units (1) to (n) adopt a speaker whose speech signal exceeds a certain level, and ignore those below a certain level. The voice recognition processing units (1) to (n) can simultaneously select n of the speakers 1 to m according to the priority order.
[0038]
FIG. 2 is a diagram for explaining an example of editing using a display unit of a computer according to an embodiment of the present invention. In FIG. 2, the display unit 128 (see FIG. 1) includes an editing area 21, an audio pattern display unit 22, and a conference information display unit 23.
[0039]
In the editing area 21, the recognition result 233 recognized by the voice recognition processing unit by selecting a part of the conference information display unit 23 is displayed. The conference information display unit 23 displays a speaker display unit 231 in which the name of the speaker is displayed, a time display unit 232 in which the speech time is displayed, and a result in which the speech is recognized by the voice recognition processing units 1 to n. A recognition result display unit 233 for displaying the result of editing in the editing area 21.
[0040]
With reference to FIG. 1 and FIG. 2, the editing of the character string based on the speech recognition of the speaker in the meeting will be described. The conference is attended by m people, each equipped with microphones 1, 2,. Each microphone is provided with a computer input terminal 114 and is connected to a terminal (not shown) of the computer 12.
[0041]
Now, it is assumed that two speakers speak simultaneously through the microphone 1 and the microphone 2. The input selection unit 121 selects the outputs of the microphone 1 and the microphone 2 simultaneously, and assigns the output of the microphone 1 to the speech recognition processing unit (1) 122 and the output of the microphone 2 to the speech recognition processing unit (2) 123.
[0042]
The voices recognized by the voice recognition processing unit (1) 122 and the voice recognition processing unit (2) 123 are stored in the voice storage unit 125, respectively. The character strings recognized by the voice recognition processing unit (1) 122 and the voice recognition processing unit (2) 123 are stored in the character string storage unit 126, respectively.
[0043]
The voices and character strings stored in the voice storage unit 125 and the character string storage unit 126 are displayed in the editing area 21 in the display unit 128 of the computer 12 when editing (FIG. 2). For example, when the editor selects a message at 14:54:24 (2341), the speaker Yasumai recognizes that the recognition result “I forgot the big discussion” is displayed in the editing result display unit 234. Is displayed. At the same time, Abe Yuko is recognized as having laughed "haha".
[0044]
Since the content of the recognition result of Yasumoto's utterance stored in the character string storage unit 126 is unknown, the editor clicks on the voice pattern display unit 22 or a button in the vicinity thereof, thereby the voice storage unit 125. The audio output stored in is output as an audio signal from the speaker output unit 127 of the computer 12. In addition, when any one of the speaker display unit 231, the time display unit 232, the recognition result display unit 233, and the editing result display unit 234 is selected, a voice can be automatically output.
[0045]
While listening to the audio signal, the editor determines that “I forgot the big discussion” was “I forgot to write something important.” Edit correctly in area 21. At the same time as the above-mentioned remarks, Abeyuko's remarks of completely different “haha” are processed by the voice recognition processing unit (2) 123 via another microphone 2, and thus are accurately recognized.
[0046]
In the screen shown in FIG. 2, at 14:54:43, Yasmotomai selected the recognition result “Please make all of these attempts to Oshima”. In the editing area 21, “Create all the specifications”. The idea is ... "and the edited state is shown.
[0047]
The correspondence between the speaker and the microphone can be identified by voiceprint from the next time by referring to the name as a self-introduction first, so there is no need to say the name for each statement. Alternatively, the microphone 1 can be associated with Mr. XX in advance, and the microphone 2 can be associated with Mr. □□ in advance.
[0048]
Although the present embodiment has been described in detail above, the present invention is not limited to the present embodiment. The present invention can be modified in various ways without departing from the scope of the present invention described in the claims. The specific technology of the block configuration diagram of the speech recognition processing unit and the like of the present invention is not described in detail because a known or publicly known technology can be used.
[0049]
【The invention's effect】
According to the present invention, paying attention to the fact that the number of people who speak at the same time is about 2 to 3 in the conference, the output from the microphone corresponding to the number of attendees is selected and 2 to 3 people are selected. By simply including the voice recognition processing section, it is possible to easily write up the duplicated speech clearly.
[0050]
According to the present invention, by providing the voice storage unit and the character string storage unit, the character string can be written in time series while listening to the voice.
[0051]
According to the present invention, the name of the speaker can be identified not only by the number or voiceprint of the corresponding microphone, but also the editing result is transcribed along with the speaking time. As for the speech time, the silent time in the venue is known, and an accurate editing result is transcribed as a time series.
[0052]
According to the present invention, since the number of speakers and the number of microphones are the same, even if there are duplicate utterances, the edited result is transcribed accurately and easily. According to the present invention, even if the number of microphones is large, the input selection unit can select only the output of the microphone that has made a statement, so that the number of voice recognition processing units can be processed by a single personal computer. be able to.
[0053]
According to the present invention, since the utterances of a large number of speakers are edited in time series, editing can be performed while paying attention to the utterance contents of a specific speaker. In addition, according to the present invention, editing suitable for various purposes is possible, such as editing the opinions of the pros and cons for the theme separately.
[Brief description of the drawings]
FIG. 1 is a schematic block diagram for explaining a system for writing up minutes of a meeting in a meeting room according to an embodiment of the present invention.
FIG. 2 is a diagram for explaining an example of editing using a display unit of a computer according to an embodiment of the present invention.
FIG. 3 is a conceptual diagram for writing up minutes in a conventional conference.
[Explanation of symbols]
11 ... Conference room 111 ... Microphone 1
112 ... Microphone 2
113 ... Microphone m
114: Computer input terminal 12: Computer 121 ... Input selection unit 122 ... Speech recognition processing unit 1
123... Voice recognition processing unit 2
124... Voice recognition processing unit n
125 ... Voice storage unit 126 ... Character string storage unit 127 ... Speaker output unit 128 ... Display unit 129 ... Character string editing unit 130 ... Character string 21 ... Editing area 211 ..Editing character string 22 ... voice pattern display unit 23 ... conference information display unit 231 ... speaker display unit 232 ... time display unit 233 ... recognition result display unit 234 ... Edit result display portion 2341... Recognition result selection portion

Claims

In a conference recording and transcription system that can edit the speech of multiple speakers as a character string,
Multiple microphones,
Among the microphones, an input selection unit that selects the audio output of a predetermined number of microphones or less;
A predetermined number of speech recognition processing units;
A voice storage unit and a character string storage unit for storing voices recognized by the respective voice recognition processing units and their character strings;
A character string editing unit that edits the time-series character string based on the voice of the voice storage unit and the character string of the character string storage unit;
Conference recording and transcription system characterized by having at least.

The input selection unit selects zero signals, one signal, or a plurality of signals input from the microphone, and outputs the selected signals to different speech recognition processing units, respectively. The meeting recording and transcription system described in 1.

The conference recording / transcription system according to claim 1 or 2, wherein the input selection unit outputs a signal input from the microphone to a different voice recognition processing unit as a digital or analog signal.

4. The input selection unit according to any one of claims 1 to 3, wherein the input selection unit selects a signal whose signal strength exceeds a predetermined level continuously for a predetermined time in order of arrival. The conference recording and transcription system described.

The conference recording / transcription system according to any one of claims 1 to 4, wherein the input selection unit is computer software.

6. The method according to claim 1, further comprising: a function of displaying the voice recognition results in each of the voice recognition processing units in time series, and a function of playing back the voice corresponding to the voice recognition results. Conference recording and transcription system as described in section.