JP2009301125A

JP2009301125A - Conference voice recording system

Info

Publication number: JP2009301125A
Application number: JP2008152030A
Authority: JP
Inventors: Naoyuki Kanda; 直之神田; Takashi Sumiyoshi; 貴志住吉; Yasunari Obuchi; 康成大淵
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2008-06-10
Filing date: 2008-06-10
Publication date: 2009-12-24
Anticipated expiration: 2028-06-10
Also published as: JP5030868B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a framework that allows participants in a meeting to easily search and correct their own statements using an appropriate authority, while minimizing the operations of the participants. <P>SOLUTION: The speaker of each statement is identified based on the speaker's acoustic feature quantity in speech voice at a meeting, direction information, and the like. A correction authority setting unit 107 provides each statement with an appropriate speech correction authority according to reliability at that time. Also, a voice data search unit 111 and a voice correction unit 112 for easily searching and correcting voice after recording are provided so that a conference participant can correct conference voice after the conference with ease and with an appropriate authority. This makes it possible for the participants to have a free discussion while recording and/or sharing appropriate conference voice with minimum operation. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、打ち合わせやブレインストーミングなどの会議の場において、その音声を録音・検索・共有する会議音声録音システムに関するものである。 The present invention relates to a conference audio recording system for recording, searching, and sharing the audio in a meeting such as a meeting or brainstorming.

これまでに会議における音声を録音し、その音声から議事録を作成する技術が示されている。特許文献１では、会議での音声を音声認識装置によって文字化し、自動的に議事録を作成する方法が記載されている。非特許文献１には、会議中の発言を書記が専用ツールで全て書き起こし、それを議事録として保存する技術が記載されている。 So far, there has been shown a technique for recording a voice in a conference and creating a minutes from the voice. Japanese Patent Application Laid-Open No. 2004-133867 describes a method of automatically creating a minutes by converting a voice at a meeting into a character by a voice recognition device. Non-Patent Document 1 describes a technique in which a clerk transcribes all statements during a meeting with a dedicated tool and saves it as a minutes.

また特許文献２では、複数の参加者がドキュメントを編集・共有するようなシステムにおいて、個々の発言に対して参加者がドキュメント編集の管理権を持つことで、利害関係の複雑な議題においてもドキュメントを適切に管理しつつ、自由な意見発信を可能としている。この技術では、参加者の発言を特定するために、個々人を識別するための何らかの端末を前提としている。 Further, in Patent Document 2, in a system in which a plurality of participants edit and share a document, the participant has the right to edit the document for each utterance. It is possible to disseminate free opinions while appropriately managing In this technique, in order to identify the participant's remarks, some terminal for identifying each individual is assumed.

特許文献３には、マイクロホンを用いて会議を録音し、話者識別を用いて各発話に承認権を付与する技術が記載されている。話者識別ができなかった場合には、全員の承諾のもと議長が書き起しを行う。特許文献４には、公開したくない発言を行う場合に、特定のボタンを押すことにより録音を中断することが記載されている。 Patent Document 3 describes a technique for recording a conference using a microphone and granting an approval right to each utterance using speaker identification. If the speaker cannot be identified, the chairman will transcribe it with the consent of everyone. Patent Document 4 describes that recording is interrupted by pressing a specific button when making a statement that is not desired to be disclosed.

特開２０００−１１２９３１号公報JP 2000-112931 A 特開２００７−３２８４７１号公報JP 2007-328471 A 特開２０００−３５２９９５号公報JP 2000-352995 A 特開２００５−０７２７６８号公報Japanese Patent Laying-Open No. 2005-072768 ディスカッションマイニング：議事録からの知識、発見，情報処理学会第６７回全国大会，２００５．Discussion Mining: Knowledge and Discovery from Minutes, IPSJ 67th National Convention, 2005.

特許文献１や非特許文献１では、参加者の発言は公式な発言として扱われ、参加者が自身の発言を修正する権限の管理や、容易に自身の発言を修正する枠組みが提供されていない。実際の会議の場面では、全ての発言が公式なものとして扱われることはむしろ稀であり、このような前提は参加者の自由な発言を阻害する恐れがある。特に、打ち合わせやブレインストーミングなどの会議で、広く意見を収集して知識の醸成を図るという目的であった場合には、その本来の目的が十分達成されない恐れがある。 In Patent Document 1 and Non-Patent Document 1, participant's remarks are treated as official remarks, and there is no framework for managing the authority of participants to correct their remarks or easily correcting their remarks. . In actual meeting situations, it is rather rare that all remarks are treated as official, and such assumptions can hinder participants' free remarks. In particular, in meetings such as meetings and brainstorming, if the purpose is to collect opinions widely and foster knowledge, the original purpose may not be sufficiently achieved.

また、参加者が面と向かって打ち合わせをするような会議（ＴＶ会議を含む）において、各参加者がそれぞれ専用の入力端末を保持するという状況は、以下の点において不便である。まず第１に、専用の入力端末の数以上の参加者は会議に参加できない。第２に、専用の入力端末ごしにしか発言できない状況は、参加者に過度の心的ストレスを感じさせる。第３に、専用の入力端末ごしに発言するという状況は従来の打ち合わせの方法と大きく異なり、参加者がシステムに慣れるまでに相当の時間を要する。第４に、このような専用システムを設置するのは非常にコストがかかる。上記観点から鑑みて、各人が専用の入力端末を保持するような会議の場というのは限られた環境でのみ有効なものと考えられる。 Further, in a conference (including a TV conference) in which participants make a meeting with each other, the situation in which each participant holds a dedicated input terminal is inconvenient in the following points. First of all, more than the number of dedicated input terminals cannot participate in the conference. Secondly, a situation where the user can speak only through a dedicated input terminal causes the participants to feel excessive mental stress. Third, the situation of speaking through a dedicated input terminal is very different from the conventional meeting method, and it takes a considerable amount of time for participants to get used to the system. Fourth, it is very expensive to install such a dedicated system. In view of the above, it is considered that a meeting place where each person holds a dedicated input terminal is effective only in a limited environment.

特許文献３の方法は、参加者が専用の入力端末を保持する必要がなく、また発言の管理も行われているものの、その目的はあくまで正確な議事録を作成することにあり、参加者が自由な発言を行える環境を提供するという点は考慮されていない。録音しても参加者が自由に発言を行えるようにするためには、各参加者が自身の発言を容易に検索・編集できる機能を備えることが必要である。また、この方法では、話者識別に失敗した場合に一律で全員の承諾を必要としており、議事録作成までの全員の作業量が多いという問題もある。打ち合わせの録音システムを日常的に利用することを考えると、打ち合わせ後の作業は最小限であることが望ましく、この点でも改善が必要である。 The method of Patent Document 3 does not require the participant to hold a dedicated input terminal, and the speech is managed, but its purpose is to create an accurate minutes, The point of providing an environment where people can speak freely is not considered. In order to allow participants to speak freely even after recording, it is necessary for each participant to have a function for easily searching and editing their own statements. In addition, this method has a problem in that when the speaker identification fails, the consent of all the members is required uniformly, and the amount of work for all the members until the minutes are prepared is large. Considering daily use of the recording system for meetings, it is desirable that the work after the meeting is minimal, and improvement is also necessary in this respect.

ボタンを押して録音を中断する方法の場合、参加者は自身の発言が不適切だったと後から気付いた場合に対処できない。結果として、参加者の自由な発言が阻害されるという問題が生じる。 In the case of the method of interrupting recording by pressing a button, the participant cannot cope with later notice that his / her speech was inappropriate. As a result, there arises a problem that a participant's free speech is hindered.

上記のように、従来技術は打ち合わせでの音声を録音・共有するシステムを提供しているものの、参加者が自由な発言を行うための枠組みを十分に提供していなかった。本発明は、打ち合わせ参加者の作業を最小にしつつ、必要であれば参加者が容易に自身の発言を検索・修正できる枠組みを提供する。また、参加者の利害関係が複雑な会議においても、発言の修正を適切な権限で行えるようにするための枠組みの提供も行う。 As described above, although the conventional technology provides a system for recording and sharing the audio at the meeting, it does not provide a sufficient framework for participants to speak freely. The present invention provides a framework that allows participants to easily search and modify their own statements if necessary, while minimizing the work of meeting participants. In addition, we will provide a framework for making it possible to correct statements with appropriate authority even in meetings where the interests of participants are complex.

本発明では、打ち合わせの発話音声の話者音響特徴量や方向情報などからそれぞれの発話の話者を識別し、その際の信頼度に応じた適切な音声修正権を発話ごとに付与する。また録音後に、音声を容易に検索・修正できる音声検索部、音声修正部を備えることにより、会議参加者が会議後に容易かつ適切な権限でもって会議音声を修正することができるようにする。このことにより、適切な会議音声の記録・共有を最低限の作業で行いつつ、参加者が自由に議論を行うことを可能とする。 In the present invention, the speaker of each utterance is identified from the speaker acoustic feature amount and direction information of the uttered speech, and an appropriate voice correction right according to the reliability at that time is given to each utterance. Further, by providing a voice search unit and a voice correction unit that can easily search and correct the voice after recording, the conference participants can easily correct the conference voice with an appropriate authority after the conference. This allows participants to discuss freely while recording and sharing appropriate conference audio with minimal work.

本発明によると、会議録を録音・共有するシステムにおいて、参加者が自由な議論を行うことが可能となる。 According to the present invention, participants can freely discuss in a system for recording and sharing conference minutes.

以下、図面を参照して本発明の実施の形態を説明する。
図１は、本発明による会議音声記録・共有システムの構成例を示す機能ブロック図である。本システムは、会議の参加者を予め登録しておくためのユーザ管理部００１、会議音声を録音する際に動作する音声記録部００２、及び会議終了後に会議の録音内容を修正する会議録音修正・認証部００３を有する。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a functional block diagram showing a configuration example of a conference audio recording / sharing system according to the present invention. The system includes a user management unit 001 for registering conference participants in advance, an audio recording unit 002 that operates when recording conference audio, and conference recording correction / modification that corrects the recording content of the conference after the conference ends. An authentication unit 003 is included.

以下、図１に示した会議音声記録・共有システムにおける処理を、順を追って説明する。 Hereinafter, processing in the conference audio recording / sharing system shown in FIG. 1 will be described in order.

まず、ユーザ管理部００１における処理を図２のフローチャートに示した。ユーザが初めて本システムを利用するときには、ユーザ登録部１１６で当該ユーザの情報を登録する。この際にはユーザ名などを登録する。また、当該ユーザの声も併せて録音し、ユーザ情報保持部１１８へ保存する。また必要であればパスワードの登録か、ユーザ固有のＩＤカードの登録・発行を行う。さらに、この際にユーザの顔写真などを保存しておくこともできる。 First, the process in the user management unit 001 is shown in the flowchart of FIG. When a user uses this system for the first time, the user registration unit 116 registers information about the user. At this time, a user name and the like are registered. The user's voice is also recorded and stored in the user information holding unit 118. If necessary, register a password or register / issue a user-specific ID card. In addition, the user's face photo and the like can be saved at this time.

次に、音声記録部００２における処理について説明する。音声記録部００２は、会議の参加者を同定する参加者同定部１１０、音声入力部１０１、入力された音声を蓄積する音声録音部１０２、録音された多チャンネル音声から話者の方向を特定する話者方向検出部１０３、入力された音声を話者ごとに分離する音源分離部１０４、分離された音声それぞれから話者性を現す特徴量を抽出する話者音響特徴抽出部１０５、話者方向情報と話者音響特徴及び参加者同定部で同定した参加者情報から当該音声の話者を判定する話者識別部１０６、識別された話者の信頼度に基づき当該音声に修正権や修正権譲渡証を設定する修正権設定部１０７、上記の修正権情報を蓄積するアクセス権情報登録部１０８、音声やアクセス権限を保存する音声データベース１０９、及び音声インデキシング部１１９を有する。 Next, processing in the audio recording unit 002 will be described. A voice recording unit 002 identifies a participant from a participant identification unit 110 that identifies a participant in a conference, a voice input unit 101, a voice recording unit 102 that accumulates input voice, and recorded multi-channel voice. Speaker direction detection unit 103, sound source separation unit 104 that separates input speech for each speaker, speaker acoustic feature extraction unit 105 that extracts a feature amount that expresses speaker characteristics from each separated speech, speaker direction Speaker identification unit 106 that determines the speaker of the speech from the information, speaker acoustic features, and participant information identified by the participant identification unit, and the right to modify or modify the speech based on the reliability of the identified speaker There is a correction right setting unit 107 for setting a transfer certificate, an access right information registration unit 108 for storing the above correction right information, a voice database 109 for storing voice and access right, and a voice indexing unit 119. .

本実施例では音声入力部１０１において、複数のマイクロホンからの同期入力を受け付けるものとする。また、音声記録部００２は、その他に画像入力部と画像蓄積部を備えていてもよい。このような応用は、特にＴＶ会議システムにおいて利用可能と考えられる。 In this embodiment, the voice input unit 101 accepts synchronous inputs from a plurality of microphones. In addition, the audio recording unit 002 may include an image input unit and an image storage unit. Such an application can be used particularly in a TV conference system.

音声記録部００２における処理のフローチャートを図３に示した。
会議が始まる前に、まず参加者同定部１１０が会議の参加者を同定する。このために、例えば冒頭で参加者名を各参加者、もしくは議長が発話し、それを音声認識することによって参加者を同定する。この際に、ユーザ登録部１１６においてユーザ情報保持部１１８に登録されたユーザ名から、音声認識辞書を作成することも可能である。なお、音声認識の手法自体は本技術分野において周知であるため説明は省略する。 A flowchart of processing in the voice recording unit 002 is shown in FIG.
Before the conference starts, the participant identification unit 110 first identifies participants of the conference. For this purpose, for example, each participant or the chairman speaks the participant name at the beginning, and the participant is identified by voice recognition. At this time, it is also possible to create a speech recognition dictionary from the user name registered in the user information holding unit 118 in the user registration unit 116. Note that the speech recognition method itself is well known in this technical field, so that the description thereof is omitted.

そのほかに、話者ごとにＩＤカードを発行しておき、参加時にカードリーダに読み取らせる方法や、キーボードから会議参加者を入力する方法、参加者の候補を表示デバイス上に示し選択させる方法などが考えられる。また会議中の発話の音響的な話者性から参加名を特定する方法や、顔画像から参加者名を特定する方法もある。 In addition, there are a method of issuing an ID card for each speaker and reading the card reader at the time of participation, a method of inputting a conference participant from the keyboard, a method of showing and selecting a participant candidate on a display device, etc. Conceivable. There are also a method for identifying the participant name from the acoustic speaker characteristics of the utterance during the conference and a method for identifying the participant name from the face image.

さらに、ユーザ管理部でシステムに登録していないユーザが会議に加わった場合には、ゲストアカウントで会議に参加する。また、この時に「ゲストです」といった発話を行ってもらい、当該音声から話者の声質を学習して、後段の話者識別部１０６で利用することも考えられる。 Furthermore, when a user who is not registered in the system by the user management unit joins the conference, the user joins the conference with the guest account. It is also conceivable that at this time, the user speaks “I am a guest”, learns the voice quality of the speaker from the voice, and uses it in the speaker identification unit 106 in the subsequent stage.

上記で同定された会議の参加者は、当該会議で録音される全ての音声の視聴権を与えられる。 Participants in the conference identified above are given the right to view all audio recorded in the conference.

実際に会議が始まると、システムは複数マイクロホンを持つ音声入力部１０１から逐次会議の音声を取り込み、音声データベース１０９へ保存する。それぞれのマイクロホンは直線や円周など既定の配置で設置されており、各マイクロホンからの入力は専用のＡ／Ｄボードを通して同期して話者方向検出部１０３へと渡される。話者方向検出部１０３では、上記の多チャンネルの音声から音源の方向を検出する。この場合、複数の話者が同時に発話する可能性もあり、そのような場合でも全ての音源方向を正確に検出できることが望ましい。 When the conference actually starts, the system sequentially captures the audio of the conference from the audio input unit 101 having a plurality of microphones and stores it in the audio database 109. Each microphone is installed in a predetermined arrangement such as a straight line or a circumference, and the input from each microphone is synchronously passed to the speaker direction detecting unit 103 through a dedicated A / D board. The speaker direction detection unit 103 detects the direction of the sound source from the multi-channel sound. In this case, a plurality of speakers may speak at the same time, and it is desirable that all sound source directions can be accurately detected even in such a case.

ここでマイクロホンの素子数をＭ＝２とし、それぞれのマイクロホンから得られる信号をｘ_i(τ)（ｉ＝１，２）と表す。まずそれぞれのｘ_i(τ)に対して短時間フーリエ変換を行い、この結果をＸ_i(ｆ，τ)とする。ここでｆは周波数、τは短時間フーリエ変換のフレームインデクスを表す。 Here, the number of microphone elements is M = 2, and a signal obtained from each microphone is represented as x _i (τ) (i = 1, 2). First, short-time Fourier transform is performed on each x _i (τ), and the result is defined as X _i (f, τ). Here, f represents a frequency, and τ represents a frame index of a short-time Fourier transform.

得られた時間周波数ごとに、０番目のマイクロホン入力を基準とした位相差θを推定する。 For each time frequency obtained, the phase difference θ with respect to the 0th microphone input is estimated.

音源方向γは以下の式によって導かれる。

The sound source direction γ is derived by the following equation.

ここでｒはマイクロホン１とマイクロホン２の距離、ｖは音速を表す。また音源方向はマイクロホン１と２の２等分直線方向を０(degree)とした時の角度で示されている。 Here, r represents the distance between the microphone 1 and the microphone 2, and v represents the speed of sound. The direction of the sound source is indicated by an angle when the bisected linear direction of the microphones 1 and 2 is 0 (degree).

上記を時間周波数帯ごとに求めたのち、音源方向を横軸にとったヒストグラムを作成してピークサーチを行うことにより、複数音源の定位を行うことができる。 After obtaining the above for each time frequency band, a plurality of sound sources can be localized by creating a histogram with the sound source direction on the horizontal axis and performing a peak search.

上記にはマイクロホン素子数Ｍ＝２の時の例を示したが、マイクロホンの数が２より多い場合でも、上記アルゴリズムの拡張によって対処できる。また、上記の定位精度を向上した「戸上真人他：逐次的な位相差補正処理に基づく音源定位方式ＳＰＩＲＥの定位性能評価，２００７年春季音響学会，２００７」などの方法も利用できる。これらの詳細は、当業者であれば周知であるため、ここでは記述しない。 The example in the case where the number of microphone elements M = 2 is shown above, but even when the number of microphones is larger than 2, it can be dealt with by extending the algorithm. Further, methods such as “Makoto Togami et al .: Localization performance evaluation of sound source localization method SPIRE based on sequential phase difference correction processing, 2007 Spring Acoustic Society, 2007” with improved localization accuracy can be used. These details are well known to those skilled in the art and will not be described here.

さらに音源分離部１０４では、上記で求めた方向情報を元に音声を音源ごとに分離する。これは、例えば最小分散ビームフォーマを用いることで実現することができる。なお、もちろんこの代わりに独立成分分析などその他の音源分離手法を用いることも可能である。 Further, the sound source separation unit 104 separates the sound for each sound source based on the direction information obtained above. This can be realized, for example, by using a minimum dispersion beamformer. Of course, other sound source separation methods such as independent component analysis can be used instead.

最小分散ビームフォーマでは、Ｘ_i(ｆ，τ)に対し、以下の式で求めた線形フィルタｗ(ｆ)を掛け合わせることにより、目的方向の音を強調し、それ以外の音を抑圧する。 In the minimum dispersion beamformer, X _i (f, τ) is multiplied by a linear filter w (f) obtained by the following expression to emphasize the sound in the target direction and suppress other sounds.

ここで、ａ(ｆ)は目的音方向の空間伝達特性、Ｒ(ｆ)は空間相関行列を表す。
本処理のこれ以上の詳細は当業者であれば周知であるため、ここでは記述しない。
上記で示した音源分離処理により、複数人が同時に発話したような状況においても、発話ごとに分離された音声が得られる。以下の処理は、分離されたそれぞれの音声を音声セグメントとし、個々の音声セグメントに対して別個に行う。 Here, a (f) represents a spatial transfer characteristic in the target sound direction, and R (f) represents a spatial correlation matrix.
Further details of this process are well known to those skilled in the art and will not be described here.
By the sound source separation process described above, it is possible to obtain a sound separated for each utterance even in a situation where a plurality of people speak at the same time. The following processing is performed separately for each voice segment, with each separated voice as a voice segment.

まず話者音響特徴抽出部１０５が、分離されたそれぞれの音声セグメントＸに対して、話者性を現す話者音響特徴量を抽出する。ここで話者音響特徴量としては、ＭＦＣＣ（Mel Frequency Cestrum Coefficient）などが利用できる。この特徴量の詳細は、この分野の当業者には周知であるため説明を省略する。 First, the speaker acoustic feature extraction unit 105 extracts a speaker acoustic feature amount exhibiting speaker characteristics for each separated speech segment X. Here, MFCC (Mel Frequency Cestrum Coefficient) or the like can be used as the speaker acoustic feature amount. The details of this feature amount are well known to those skilled in the art, and thus the description thereof is omitted.

次に、上記話者音響特徴量と、話者方向検出部で求めた話者方向及び参加者同定部から得られる参加者情報を元に話者識別部１０６が話者の判定を行う。この実施例ではＧＭＭ（Gaussian Mixture Model）を用いた話者識別を用いる。話者音響特徴量の列Ｘ＝｛Ｘ₁，…，Ｘ_n｝が与えられたとき、それが話者Ａである尤度は以下で表される。 Next, the speaker identification unit 106 determines the speaker based on the speaker acoustic feature amount, the speaker direction obtained by the speaker direction detection unit, and the participant information obtained from the participant identification unit. In this embodiment, speaker identification using GMM (Gaussian Mixture Model) is used. When a sequence of speaker acoustic features X = {X ₁ ,..., X _n } is given, the likelihood that it is speaker A is expressed as follows.

ここでｍ_j，ｖ_j，λ_jはそれぞれ番目の正規分布の平均、分散と分布の混合重みであり、あらかじめユーザ情報保持部において保持されている話者Ａの音声から値を学習しておいたものである。 Here, m _j , v _j , and λ _j are the average of the first normal distribution, the mixture weight of the variance and the distribution, respectively, and the values are learned from the voice of the speaker A held in the user information holding unit in advance. It was.

当該音声が話者Ａである音響的信頼度ＣＭ_AC(Ａ|Ｘ)を求めるためには、さらにBack-ground Modelと呼ばれる、一般的な音響情報を表現するＧＭＭの尤度Ｐ(Ｘ|ＧＭＭ_bg)を求め、話者尤度との比を計算する。 In order to obtain the acoustic reliability CM _AC (A | X) that the voice is the speaker A, the likelihood P (X | GMM) of the GMM that expresses general acoustic information, called a back-ground model. _bg ) and calculate the ratio to the speaker likelihood.

また上記に加えて、音声方向検出部で推定した話者方向に基づき、当該音声が話者Ａの発話である信頼度ＣＭ_DOA(Ａ|Ｘ)を算出することもできる。このためには、例えば話者Ａが特定の席に着席していることが多いという情報を確率Ｐ(Ｄ|Ａ)として表しておき、下記のように求めることが考えられる。 In addition to the above, the reliability CM _DOA (A | X) that the speech is the speech of the speaker A can be calculated based on the speaker direction estimated by the speech direction detection unit. For this purpose, for example, information that the speaker A often sits in a specific seat is expressed as a probability P (D | A), and it can be obtained as follows.

ここでＤは音声の到来方向を表す。
Ｐ(Ａ|Ｄ)の求め方としては、これまでの会議もしくは現在録音中の会議で得られた音声発話集合を元に下記で求めることなどが考えられる。 Here, D represents the voice arrival direction.
As a method of obtaining P (A | D), it is conceivable to obtain the following based on the speech utterance set obtained in the previous conference or the conference currently being recorded.

上記のＸの和は、これまでの会議もしくは現在録音中の会議で得られた音声セグメントの集合に関しての和とする。またａはＡの要素である個々のユーザとする。
上記で得られたＣＭ_AC(Ａ|Ｘ)とＣＭ_DOA(Ａ|Ｘ)から当該音声が話者Ａの発話である信頼度ＣＭ(Ａ|Ｘ)を求める。例えば、上記の線形和で The above sum of X is the sum for the set of audio segments obtained in the previous conference or the conference currently being recorded. Further, a is an individual user who is an element of A.
From the CM _AC (A | X) and CM _DOA (A | X) obtained above, the reliability CM (A | X) that the voice is the utterance of the speaker A is obtained. For example, with the linear sum above

と表現することができる。この信頼度の値が大きいほど、当該音声セグメントが当該話者の発言である確率が高いと判断できる。

It can be expressed as It can be determined that the greater the reliability value, the higher the probability that the speech segment is the speaker's speech.

なお、上記では例として話者音響情報から得られる信頼度ＣＭ_AC(Ａ|Ｘ)と話者方向情報から得られる信頼度ＣＭ_DOA(Ａ|Ｘ)の２つのみを用いているが、そのほかにシステムが会議場や会議参加者を撮像するカメラなどの撮像手段を備えており、当該撮像手段から得られた画像を元に、話者の顔画像の特徴情報から当該発話の話者ごとの信頼度を求めて組み合わせることも可能である。同様に話者の顔画像と方向情報から、当該発話の話者ごとの信頼度を求めることも可能である。 In the above, only two of the reliability CM _AC (A | X) obtained from the speaker acoustic information and the reliability CM _DOA (A | X) obtained from the speaker direction information are used as an example. The system includes an imaging unit such as a camera for imaging a conference hall or a conference participant, and based on the image obtained from the imaging unit, the feature information of the speaker's face image is used for each speaker of the utterance. It is also possible to obtain a combination of reliability. Similarly, the reliability for each speaker of the utterance can be obtained from the face image of the speaker and the direction information.

修正権設定部１０７では、上記で得られた話者信頼度に基づき、各音声に話者ごとの修正権を付与していく。ここで、下記（１）〜（３）で異なる修正権付与を行う。
（１）話者信頼度が予め定めた閾値θ₁以上の話者が１名だけ存在する場合
当該話者に対する修正権を付与する
（２）話者信頼度が閾値θ₁以上の話者が複数存在する場合
当該話者全てに対する修正権譲渡証を付与する。
（３）話者信頼度が閾値θ₁以上の話者が存在しない場合
全ての参加者に対する修正権譲渡証を付与する。 The correction right setting unit 107 assigns a correction right for each speaker to each voice based on the speaker reliability obtained as described above. Here, different correction rights are assigned in the following (1) to (3).
(1) the speaker reliability predetermined threshold theta ₁ or more speakers to impart modifications right to the speaker if there is only one person (2) speakers reliability threshold theta ₁ or more speakers If there is more than one, give a certificate of assignment of correction rights to all the speakers.
(3) When there is no speaker whose speaker reliability is greater than or equal to the threshold value θ _{1 A} certificate of assignment of the right to revision is assigned to all participants.

ここで修正権とは、当該音声セグメントＸの内容を修正もしくは消去する権限である。また修正権譲渡証とは、修正権譲渡証を持つ全てのユーザの修正権譲渡証を受け取った時点で修正権を得ることができるものである。 Here, the correction right is the right to correct or delete the contents of the audio segment X. The correction right assignment certificate is a right to obtain a correction right when the correction right assignment certificate of all users having the correction right assignment certificate is received.

なお、（１）と判定された話者がゲストアカウントだった場合には、当該会議の参加ユーザ全てに対する修正権譲渡証を付与する。もしくは、予め議長を定めておき、そのユーザに対する修正権を付与するように定めることもできる。 If the speaker determined as (1) is a guest account, a certificate of assignment of correction rights is given to all users participating in the conference. Alternatively, it is possible to predetermine a chairperson and give a correction right to the user.

上記の処理の流れを図４に示した。音声セグメントが入力されると、当該音声がどの会議参加者の発言であるかを推定し、その信頼度を計算する。その後、上記（１）〜（３）のルールに従って、修正権もしくは修正権譲渡証を発行する。 The flow of the above processing is shown in FIG. When an audio segment is input, it is estimated which conference participant the speech is, and the reliability is calculated. Thereafter, according to the rules (1) to (3), a correction right or a correction right assignment certificate is issued.

最後にアクセス権情報登録部１０８が、上記の修正権・修正権譲渡証及び、会議参加者に与えられている視聴権を音声データベース１０９の中で保存する。またここでは、話者方向検出部１０３で求めた方向情報や、話者識別部１０６で求めた話者情報も併せて保存しておくこともできる。 Finally, the access right information registration unit 108 stores the above-mentioned correction right / correction right assignment certificate and the viewing right given to the conference participant in the audio database 109. Here, the direction information obtained by the speaker direction detection unit 103 and the speaker information obtained by the speaker identification unit 106 can also be stored together.

これらの情報を音声データベース１０９に格納した例を図５に示す。ここで音声ファイルＩＤとは会議の録音ごとに固有に割り振られる識別子であり、音声セグメントＩＤとは音声セグメントごとに固有に割り振られる識別子である。また図５においては、「話者」という列において話者識別部１０６で得られた話者とその信頼度を保存している。また「方向」という列において、話者方向検出部１０３から得られた話者方向を保存している。これらの情報は後述の音声データ検索部１１１において利用することができる。 An example in which these pieces of information are stored in the voice database 109 is shown in FIG. Here, the audio file ID is an identifier uniquely assigned for each recording of the conference, and the audio segment ID is an identifier uniquely assigned for each audio segment. In FIG. 5, the speaker obtained by the speaker identification unit 106 and its reliability are stored in the column “speaker”. In the column “direction”, the speaker direction obtained from the speaker direction detection unit 103 is stored. Such information can be used in the audio data search unit 111 described later.

また上記に加えて、音声ファイルＩＤと会議名、参加者、録音日時、ファイルのストレージデバイス上での保管場所を示した図６のようなデータも併せて保存しておく。 In addition to the above, the audio file ID and conference name, participants, recording date and time, and data as shown in FIG. 6 showing the storage location of the file on the storage device are also stored.

次に、音声インデキシング部１１９では、音声セグメントＸそれぞれについて、音声データ検索部１１１が音声データをキーワードによって検索するためのデータベースを作成する。音声データベースの検索方法は既に様々な公知技術が存在するが、ここでは大語彙連続音声認識を用いた方法について説明する。 Next, in the audio indexing unit 119, for each audio segment X, the audio data search unit 111 creates a database for searching the audio data by keywords. There are various known techniques for searching a speech database. Here, a method using large vocabulary continuous speech recognition will be described.

まず音声インデキシング部１１９では、話者識別された音声セグメントＸを大語彙連続音声認識器を用いて単語列へと変換する。単語列中の各単語には大語彙連続音声認識器から出力される信頼度が付与されている。なお、大語彙連続音声認識の技術については、この分野の当業者には公知であるため説明を省略する。 First, the speech indexing unit 119 converts the speaker-identified speech segment X into a word string using a large vocabulary continuous speech recognizer. Each word in the word string is given the reliability output from the large vocabulary continuous speech recognizer. In addition, since the technique of large vocabulary continuous speech recognition is well-known to those skilled in this field, description is abbreviate | omitted.

次に、得られた単語列から、ある単語がどの音声ファイルＩＤ／音声セグメントＩＤに出現するかを表現した索引データを作成する。この例を図７に示した。ここでは、ある単語に対応する｛音声ファイルＩＤ／音声セグメントＩＤ／音声認識から出力される信頼度｝の３つ組みを索引として作成する。例えば「製品」という単語は音声ファイルＩＤ００１２、音声セグメントＩＤ０００３において信頼度０．８で発話されており、また音声ファイルＩＤ００１０、音声セグメントＩＤ０００１において信頼度０．５で発話されているということが分かる。 Next, index data expressing in which audio file ID / audio segment ID a certain word appears is created from the obtained word string. An example of this is shown in FIG. Here, a triplet of {voice file ID / voice segment ID / reliability output from voice recognition} corresponding to a certain word is created as an index. For example, it can be seen that the word “product” is uttered with a reliability of 0.8 in the audio file ID 0012 and the audio segment ID 0003, and is uttered with a reliability of 0.5 in the audio file ID 0010 and the audio segment ID 0001.

これによりユーザは後述の音声データ検索部１１１においてキーワードを用いて、当該キーワードが発話された音声ファイルとそのセグメント位置を求めることが可能となる。 As a result, the user can use the keyword in the voice data search unit 111 (to be described later) to obtain the voice file in which the keyword is uttered and its segment position.

音声インデキシング部では、上記で作成した索引データを音声データベースに保存する。 The voice indexing unit stores the index data created above in a voice database.

なお、音声データベースをキーワードを用いて検索する方法として、上記のほかに大語彙連続音声認識器を用いて単語ラティスを作成する方法や、単語よりも細かいサブワードを単位としたサブワード音声認識器を用いた検索方法などが知られており、これらを代わりに利用することも可能である。またキーワードを検索する際に、複数のキーワードが入力された場合や、複合語が入力された場合の処理に関して対処することも可能である。当該技術については、この分野の当業者に公知であるため、説明を省略する。以上が、音声記録部００２における処理である。 In addition to the above, as a method of searching a speech database using keywords, a method of creating a word lattice using a large vocabulary continuous speech recognizer, or a subword speech recognizer in units of subwords smaller than words is used. There are known search methods, etc., and these can be used instead. Further, when searching for a keyword, it is possible to deal with processing when a plurality of keywords are input or a compound word is input. Since this technique is known to those skilled in the art, a description thereof will be omitted. The above is the processing in the audio recording unit 002.

次に、会議の参加者が録音した音声を検索・編集し、公開する録音修正・認証部００３における処理について述べる。 Next, processing in the recording correction / authentication unit 003 that searches, edits, and publishes the voice recorded by the conference participants will be described.

録音修正・認証部００３は、ユーザの認証を行うユーザ認証部１１７と、キーワードや話者などから音声データを検索することができる音声データ検索部１１１と、ユーザが修正権を持つ音声データのみ修正・削除できる音声修正部１１２と、ユーザが公開権を持つ音声のみ公開の認証を行うことができる音声公開認証部１１３を持つ。さらに修正権譲渡証を持つユーザが修正権譲渡の依頼を行う修正権譲渡依頼部１１４と、修正権譲渡依頼の承認を行う修正権譲渡承認部１１５を持つ。 The recording correction / authentication unit 003 includes a user authentication unit 117 that performs user authentication, a voice data search unit 111 that can search voice data from a keyword, a speaker, and the like, and corrects only voice data for which the user has a correction right. A voice correction unit 112 that can be deleted, and a voice public authentication unit 113 that can perform public authentication only for voices for which the user has public rights. Furthermore, a user with a correction right assignment certificate has a correction right assignment request unit 114 for requesting assignment of a correction right, and a correction right assignment approval unit 115 for approving the correction right assignment request.

まず、ユーザは会議録音の視聴・修正・公開承認を行うために、ユーザ認証部１１７においてユーザ固有の情報を入力する。ユーザ認証部１１７では上記入力された情報とユーザ情報保持部１１８に保存された情報から、システムを操作しているユーザを特定する。ここではユーザにパスワードを入力させることも可能であるし、指静脈認証などのより高度な認証技術を利用することもできる。またユーザ登録部１１６においてユーザごとにＩＤカードを発行しておき、それを認証に利用することも可能である。 First, the user inputs user-specific information in the user authentication unit 117 in order to view, modify, and approve the conference recording. The user authentication unit 117 identifies the user who is operating the system from the input information and the information stored in the user information holding unit 118. Here, it is possible to allow the user to input a password, and it is also possible to use a more advanced authentication technique such as finger vein authentication. It is also possible to issue an ID card for each user in the user registration unit 116 and use it for authentication.

ユーザがユーザ認証部を通してシステムにアクセスすると、図８のようなインタフェースのもと、音声の視聴、検索や、自身が参加した会議の公開承認・修正・修正権譲渡依頼・修正権譲渡承認を選択することができる。 When a user accesses the system through the user authentication unit, the user selects viewing approval / modification / revision right transfer request / revision right transfer approval for the conference in which he / she participated, based on the interface shown in FIG. can do.

ある会議において参加者同定部１１０で同定されたユーザには、当該会議中の音声全てに視聴権が付与されているため、その音声を視聴・検索することができる。この場合、図８の「会議を一覧から視聴」２０２をクリックすると、図９のように自身が視聴できる会議の一覧が表示され、内容を聞くことができる。このとき当該会議中の音声の視聴権が付与されていない会議は表示されず、視聴することができない。また会議に参加していたユーザは当該会議音声に自由に名前をつけることができる。この名前はユーザごとに個別に設定できるようにすることもできるし、会議参加者間で自動的に共有するようにすることもできる。 A user identified by the participant identification unit 110 in a certain meeting is granted viewing rights to all the voices in the meeting, and thus can view and search the voices. In this case, when “view conference from list” 202 in FIG. 8 is clicked, a list of conferences that can be viewed by itself is displayed as shown in FIG. 9, and the content can be heard. At this time, the conference to which the audio viewing right during the conference is not granted is not displayed and cannot be viewed. A user who has participated in the conference can freely name the conference audio. This name can be set individually for each user, or can be automatically shared among conference participants.

また図８の会議音声検索２０３では、会議名やキーワードを用いて会議の内容を検索することができる。図８のテキストボックス２０８へ検索したいキーワードを入力し、検索ボタン２０９を押下する。会議名もしくはキーワード検索の一方だけを利用したい場合には、利用したくない項目のチェックボックス２１０を解除すればよい。検索ボタンが押下されると音声データ検索部１１１が動作し、該当するファイルの一覧をユーザへ表示する。 In the conference voice search 203 of FIG. 8, the conference contents can be searched using the conference name and keywords. A keyword to be searched is input to the text box 208 in FIG. 8 and the search button 209 is pressed. If only one of the meeting name or keyword search is desired, the check box 210 of the item that is not desired to be used may be cleared. When the search button is pressed, the voice data search unit 111 operates to display a list of corresponding files to the user.

ここでは会議名とキーワードによる検索を示したが、そのほかに話者による音声の検索や話者の方向に基づく検索などを行うことも可能である。さらに会議中の画像を蓄積していた場合には、ユーザ情報保持部１１８に保存されている顔画像に基づく検索なども可能である。 Here, the search by the conference name and the keyword is shown, but it is also possible to perform a search by the speaker or a search based on the direction of the speaker. Further, when images during the meeting are accumulated, a search based on the face image stored in the user information holding unit 118 is also possible.

会議に参加していたユーザで、当該ユーザへ修正権及び修正譲渡権が付与された発言に関して、公開してもよいと判断した場合は、音声公開承認部１１３において当該ユーザの発話の公開承認を行う。この際に、例えば図１０のようなインタフェースを備えておき、個々の発話に対して公開承認を行うこともできるし、全ての発話を一括して公開承認できるとよい。個々の音声の公開承認を行いたい場合は、公開したい音声セグメントのみにチェックをしてから、「チェックした音声を公開承認」ボタンを押下する。全ての発話を一括して公開承認したい場合には、まず「全ての音声をチェック」にチェックすると、全ての音声セグメントがチェックされる。 If a user who has participated in the conference determines that the remark that the correction right and the right to transfer the correction have been given to the user may be disclosed, the voice public approval unit 113 approves the public release of the user's utterance. Do. At this time, for example, an interface as shown in FIG. 10 may be provided, and public approval may be performed for individual utterances, or all utterances may be publicly approved collectively. If you want to approve the release of individual audio, check only the audio segment you want to make public, and then press the “Approve approved audio for release” button. If you want to publicly approve all utterances, check “Check all voices” first, and all voice segments will be checked.

あるユーザに修正権が付与されている発話は、当該ユーザが公開承認をした時点で、会議に参加していないユーザから視聴・検索できるようになる。複数のユーザに修正権譲渡証が付与されている発話は、修正権譲渡証を保持するユーザ全てが公開承認をした時点で、会議に参加していないユーザから視聴・検索できるようになる。 An utterance in which a modification right is given to a certain user can be viewed and searched by a user who has not participated in the conference when the user approves the disclosure. An utterance in which a correction right assignment certificate is given to a plurality of users can be viewed and searched by users who have not participated in the conference when all the users holding the correction right assignment certificate have made public approval.

会議に参加していたユーザで、公開する際に編集が必要な発言があると判断したユーザは、まず図８の会議音声修正２０５をクリックし、音声修正部１１２を起動する。音声修正部１１２は、図８と同様の検索インタフェースを備えており、キーワードによる音声の検索や、会議名による検索を行うこともできる。 A user who has participated in the conference and has determined that there is an utterance that needs to be edited when publishing, clicks on conference audio correction 205 in FIG. 8 to activate the audio correction unit 112. The voice correction unit 112 includes a search interface similar to that shown in FIG. 8, and can also search for voice by keyword or search by conference name.

編集したい音声セグメントを発見した後のユーザの行動は、当該ユーザが当該音声セグメントのどのアクセス権を保持しているかによって変化する。当該音声セグメントの修正権をユーザが保持していた場合には、その音声の削除や不要部分にマスキングをほどこすなどの操作を行う。 The behavior of the user after finding the voice segment to be edited varies depending on which access right of the voice segment the user holds. If the user holds the right to correct the voice segment, an operation such as deleting the voice or masking unnecessary parts is performed.

ここで音声修正部１１２は、例えば図１１のようなインタフェースを備えており、マウスのドラッグによって修正したい区間の開始点と終了点を指定する。また、キーワードを入力することによって当該音声中のキーワード部分のみを切り出すこともできる。テキストボックス３０１にキーワードを入力すると、当該キーワード区間の開始点と終了点が設定される。当該技術はワードスポッティングと呼ばれ、この分野の当業者には周知の技術であるため、ここでは詳細は述べない。 Here, the voice correction unit 112 has an interface as shown in FIG. 11, for example, and designates the start point and end point of the section to be corrected by dragging the mouse. Further, it is possible to cut out only the keyword part in the voice by inputting the keyword. When a keyword is entered in the text box 301, the start point and end point of the keyword section are set. Since this technique is called word spotting and is well known to those skilled in the art, details are not described here.

上記で指定した区間に対して、「指定区間をマスキング」をクリックすると、指定した区間がホワイトノイズやビープ音と差し替えられる。また指定区間を消去すると、指定した区間が消去される。 Clicking “Masking specified section” for the section specified above will replace the specified section with white noise or beep sound. If the specified section is deleted, the specified section is deleted.

なお、ここで修正した結果はユーザが会議音声を視聴するときに反映されが、実際の音声データベース自体は修正されないようにすることができる。この場合、システムの管理者権限によって、音声波形をもとに戻すなどの操作を行うことも可能である。 The correction result is reflected when the user views the conference voice, but the actual voice database itself can be prevented from being corrected. In this case, it is also possible to perform an operation such as restoring the voice waveform by the system administrator authority.

以上が、当該音声セグメントの修正権をユーザが保持していた場合の処理である。ユーザが編集したい音声の修正権を持っていない場合には、そのままでは当該音声を修正することができない。この場合、ユーザは図８の修正権譲渡依頼２０６をクリックし、修正権譲渡依頼部１１４を起動する。 The above is the processing when the user holds the right to correct the audio segment. If the user does not have the right to edit the voice that he / she wants to edit, the voice cannot be corrected as it is. In this case, the user clicks the correction right transfer request 206 in FIG. 8 to activate the correction right transfer request unit 114.

修正権譲渡依頼部１１４は図１２のようなインタフェースを持ち、当該音声セグメントの修正権譲渡依頼を、当該音声の修正権譲渡証を保持する全てのユーザへ通知する。この際に例えばメールシステムと本システムが連携し、修正権譲渡依頼の通知が出された参加者にはメールで通知されるなどしてもよい。また修正権譲渡依頼に、ユーザＡのメッセージを付与しておいてもよい。 The correction right transfer request unit 114 has an interface as shown in FIG. 12, and notifies all the users holding the sound correction right transfer certificate of the audio segment. At this time, for example, the mail system and this system may cooperate, and a participant who has been notified of the request to transfer the correction right may be notified by email. In addition, the message of user A may be added to the request for assignment of correction rights.

修正権譲渡依頼の通知を受け取ったユーザは、ユーザ承認部１１７からシステムへアクセスした後、図８の修正権譲渡承認２０７をクリックし、修正権譲渡承認部１１５を起動する。修正権譲渡承認部１１５は図１３のようなインタフェースを持ち、依頼ユーザ名と依頼された当該音声の聴取と依頼者からのメッセージを確認できる。また必要であれば、当該音声前後の文脈を確認できるように、指定した区間の視聴ができる図１３のようなインタフェースを備えているとよい。 The user who has received the correction right assignment request notification accesses the system from the user approval unit 117, and then clicks the correction right assignment approval 207 in FIG. 8 to activate the correction right assignment approval unit 115. The modification right transfer approval unit 115 has an interface as shown in FIG. 13, and can confirm the request user name, the requested voice and the message from the requester. If necessary, it is preferable to provide an interface as shown in FIG. 13 for viewing a specified section so that the context before and after the voice can be confirmed.

当該音声を聴取し、当該音声の修正権を依頼ユーザに与えてもよいと判断したら、「音声の修正権譲渡承認」ボタンをクリックすることにより、当該音声の修正権譲渡証を依頼ユーザへ発行する。 If you listen to the audio and decide that you may give the requesting user the right to correct the audio, click the “Approve Approval for Audio Correcting Rights” button to issue the audio transfer right certificate to the requesting user. To do.

修正権譲渡証を持つ全てのユーザがユーザＡへ修正権譲渡証を発行した時点で、修正権譲渡承認部１１５がユーザＡに当該音声の修正権を付与する。これによりユーザＡは当該音声を修正・削除などすることができる。 When all the users having the correction right assignment certificate have issued the correction right assignment certificate to the user A, the correction right assignment approval unit 115 gives the user A the right to correct the sound. Thereby, the user A can correct / delete the sound.

以上が、会議音声を視聴・検索・修正・公開する枠組みである。本枠組みでは、当該音声を発話したかどうかが定かでない数名の会議参加者のみがユーザＡからの修正権譲渡証発行依頼に対応すればよく、その他の大多数のユーザは、当該処理に関与しなくてすむために、全体としてユーザの手間を大幅に削減できる。また発話ごとに発言の修正権やアクセス権が管理されているため、仮に利害関係の異なるユーザどうしが話しあった後にお互いの音声を不適切に修正する心配を避けられ、より自由な論議を行うことが可能となる。 The above is a framework for viewing, searching, correcting, and publishing conference audio. In this framework, only a few conference participants who are not sure whether or not they uttered the voice need to respond to the request for issuance of the right to transfer correction from User A, and the majority of other users are involved in the process. This eliminates the need for the user and greatly reduces the user's labor as a whole. In addition, since the right to modify and access rights are managed for each utterance, it is possible to avoid concerns about improperly modifying each other's voice after users with different interests speak, and to discuss more freely. It becomes possible.

なお、上述の例では、修正権を持つユーザが当該発話の公開を承認した時点で、他のユーザが当該音声を聴取できるようになるが、これとは異なり、当該会議に参加していた全てのユーザが全ての音声の公開承認した時点で、当該音声を公開することも可能である。 In the above example, when the user with the right to modify approves the release of the utterance, other users can listen to the sound. It is also possible to publish the voice when the user has approved the release of all the voices.

上記のシステムのハードウェア構成について図１４に示す。システムは、ＣＰＵとメモリからなる計算機を備え、計算機には音声入力装置、データ蓄積装置、キーボード、表示装置を備えている。図１に示した機能部１０１〜１１９は、全て計算機のメモリの中に格納されている。また、画像入力も受け付ける場合には、画像入力装置も計算機に接続する。 FIG. 14 shows the hardware configuration of the above system. The system includes a computer including a CPU and a memory, and the computer includes a voice input device, a data storage device, a keyboard, and a display device. All of the functional units 101 to 119 shown in FIG. 1 are stored in the memory of the computer. When accepting image input, the image input device is also connected to the computer.

また本システムをＴＶ会議システムと組み合わせたときのハードウェア構成を図１５に示した。ここでは音声入力装置と画像入力装置が複数の拠点に分散しており、ネットワークを介して計算機に接続されている点が図１４と大きく異なる。 Further, FIG. 15 shows a hardware configuration when this system is combined with a TV conference system. Here, the voice input device and the image input device are dispersed in a plurality of bases, and are greatly different from FIG. 14 in that they are connected to a computer via a network.

本発明によるシステムの構成例を示す機能ブロック図。The functional block diagram which shows the structural example of the system by this invention. ユーザ管理部における処理を示すフローチャート。The flowchart which shows the process in a user management part. 音声記録部における処理を示すフローチャート。The flowchart which shows the process in an audio | voice recording part. アクセス権設定の処理手順を示すフローチャート。The flowchart which shows the process sequence of access right setting. 音声ファイル及び音声セグメント情報の格納例を示す図。The figure which shows the example of a storage of an audio | voice file and audio | voice segment information. 音声ファイル情報の格納例を示す図。The figure which shows the example of storage of audio | voice file information. 音声インデキシング部によって作成される索引データの例を示す図。The figure which shows the example of the index data produced by the audio | voice indexing part. ユーザ認証後のユーザ画面例を示す図。The figure which shows the example of a user screen after user authentication. 会議音声の一覧表示例を示す図。The figure which shows the list display example of a meeting audio | voice. 公開承認のインタフェースを示す図。The figure which shows the interface of public approval. 音声修正部のインタフェースを示す図。The figure which shows the interface of a voice correction part. 修正権譲渡依頼部のインタフェースを示す図。The figure which shows the interface of the correction right transfer request part. 修正権譲渡承認部のインタフェースを示す図。The figure which shows the interface of the correction transfer assignment part. システムのハードウェア構成例を示す図。The figure which shows the hardware structural example of a system. ＴＶ会議システムと組み合わせた場合のハードウェア構成例を示す図。The figure which shows the hardware structural example at the time of combining with a video conference system.

Explanation of symbols

００１：ユーザ管理部
００２：音声記録部
００３：録音修正・認証部 001: User management unit 002: Voice recording unit 003: Recording correction / authentication unit

Claims

An audio recording unit for recording audio;
A speaker identification unit for identifying a speaker from the input voice;
A correction right setting unit that gives a participant a different type of correction right for each utterance of the input voice according to the reliability of speaker identification by the speaker identification unit;
A user authentication unit for authenticating a user;
A conference voice recording system, comprising: a voice correction unit that allows a user who has been given a correction right by the correction right setting unit to correct an utterance given the correction right.

2. The conference voice recording system according to claim 1, wherein the correction right setting unit determines that the reliability of the result of speaker identification as a speaker is higher than a predetermined threshold value when there is only one speaker. A conference characterized by granting the right to correct the utterance and issuing a certificate of transfer of the right to correct the utterance to the plurality of speakers when there are a plurality of speakers whose speaker identification reliability is higher than the threshold value Voice recording system.

In the conference voice recording system according to claim 1, the correction right setting unit, when there is no speaker whose reliability as a result of speaker identification of a certain utterance is higher than a predetermined threshold, A conference audio recording system characterized by issuing a certificate of assignment of correction rights to a speaker.

4. The conference voice recording system according to claim 2 or 3, wherein the right of correcting the utterance is given to a user who has been issued a transfer right transfer certificate from all users having the transfer right transfer certificate. Recording system.

The conference voice recording system according to claim 1, further comprising a voice search unit that allows a participant to search for a voice by a keyword or a speaker name.

2. The conference voice recording system according to claim 1, further comprising a participant identifying unit for identifying a conference participant.

The conference voice recording system according to claim 6, wherein the participant identification unit identifies a participant based on a result of recognizing a voice spoken during the conference.

The conference audio recording system according to claim 1, comprising either or both of an imaging unit that images a conference hall and a speaker direction detection unit, wherein the speaker identification unit is output from the speaker direction detection unit. Speaker identification is performed from either or a combination of an acoustic feature amount representing a speaker direction detection result and speaker characteristics and an image feature amount representing speaker characteristics obtained from an image captured by the imaging unit. Meeting voice recording system.

9. The conference voice recording system according to claim 8, wherein a plurality of microphones are used as voice input units.