JP2007298876A

JP2007298876A - Voice data recording and reproducing apparatus

Info

Publication number: JP2007298876A
Application number: JP2006128514A
Authority: JP
Inventors: 拓弥 ▲高▼橋; Takuya Takahashi
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-05-02
Filing date: 2006-05-02
Publication date: 2007-11-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide an apparatus by which specified speaker's remark is efficiently heard, and utterance contents of the others to reach the remark are understood, from voice data in which conversation of a plurality of speakers is recorded. <P>SOLUTION: A network interface (I/F) 4 of a recording server 101 acquires the voice data of multi-point voice conference communicated in a network 100, and a recording section 3 records the voice data as minutes. The recording section 3 records a phoneme feature amount of each conference attendant. A feature amount data extracting section 2 extracts the phoneme feature amount and compares it with a phoneme feature amount of each conference attendant, when receiving a reproduction request from a personal computer 102. As a result, a recorded voice data of the specified speaker is extracted, and utterance speed conversion (high speed reproduction) of data other than the voice data of the speaker, is performed, and a digest data is generated and a streaming data is distributed. The personal computer 102 receives the streaming data and reproduces it as the digest data. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、会議音声等の音声を記録して利用する装置に関するものである。 The present invention relates to an apparatus for recording and using audio such as conference audio.

従来から会議において、各話者の発言を録音した議事録を作成することが一般的である。このような議事録を後から確認するとき、全ての発言を聴き返すことは時間がかかるため、重要な部分だけ（所謂ダイジェスト）を聴きたい場合が多い。 Conventionally, in a conference, it is common to create a minutes that records the speech of each speaker. When confirming such minutes, it takes time to listen back to all the statements, so it is often desirable to listen to only the important part (so-called digest).

そこで、録音済の音声データから各発話者を識別し、音声認識処理でテキストデータ化する装置（例えば特許文献１参照）、このテキストから要約を作成する装置（例えば、特許文献２、特許文献３参照）が提案されている。 Therefore, a device that identifies each speaker from recorded voice data and converts it into text data by voice recognition processing (see, for example, Patent Document 1), and a device that creates a summary from this text (for example, Patent Document 2, Patent Document 3). Have been proposed).

また、話者、話題毎に音声データを分割し、これらの話者、話題をキー検索として検索処理を行う記録情報処理装置が提案されている（例えば特許文献４参照）。この記録情報処理装置によれば、特定の話者の発言だけを聴くことができる。 In addition, a recording information processing apparatus that divides voice data for each speaker and topic and performs a search process using these speakers and topics as a key search has been proposed (for example, see Patent Document 4). According to this recorded information processing apparatus, only a specific speaker's speech can be heard.

また、発言内容を聴きとりできなかった場合に、指定時間だけ巻き戻して、先ほどよりも遅い速度で再生する情報再生装置（例えば特許文献５参照）が提案されている。
特開２００５−４３６２８号公報特開２００４−２０７３９号公報特開２００２−９９５３０号公報特開２００４−２３６６１号公報特開平８−１４７７９４号公報 In addition, an information reproduction apparatus (see, for example, Patent Document 5) that rewinds for a specified time and reproduces at a slower speed than before when the content of a statement cannot be heard has been proposed.
Japanese Patent Laying-Open No. 2005-43628 JP 2004-20739 A JP 2002-99530 A Japanese Patent Laid-Open No. 2004-23661 JP-A-8-147794

特許文献１〜３に記載の装置は、音声認識処理を行い、テキストを抽出するが、このような処理は、高度の音声認識技術が要求されていた。
特許文献４に記載の装置は、特定の話者の発言のみ聴く構成であるため、前後の他者の発言を聴くことができず、会議全体の概要を把握しにくかった。
特許文献５に記載の装置は、低速再生したい場合は、ユーザが手動で指定しなければならなかった。 The devices described in Patent Documents 1 to 3 perform speech recognition processing and extract text, but such processing requires high-level speech recognition technology.
Since the apparatus described in Patent Document 4 is configured to listen only to the speech of a specific speaker, it is difficult to hear the speech of others before and after, and it is difficult to grasp the outline of the entire conference.
The device described in Patent Document 5 has to be manually designated by the user when low speed reproduction is desired.

本発明は、上記問題点に鑑みて、複数の話者の会話が録音されている音声データから、特定の話者の発言を効率良く聴くことができ、その発言に至るまでの他者の発言内容も把握することができる装置を提供することを目的とする。 In view of the above problems, the present invention can efficiently listen to a specific speaker's remarks from voice data in which conversations of a plurality of speakers are recorded, and the remarks of others up to that remark. An object of the present invention is to provide an apparatus capable of grasping the contents.

この発明の音声データ記録再生装置は、複数の話者の発音を録音した音声データ、および特定の話者の音声特徴量を記録する記録手段と、音声特徴量を抽出する音声特徴量抽出手段と、前記特定の話者の音声特徴量と、前記音声特徴量抽出手段が抽出した音声特徴量と、を比較し、前記音声データのうち、特定の話者の音声データ記録区間を抽出する話者抽出手段と、前記特定の話者の音声データ記録区間以外の区間の音声データを、時間軸に圧縮する処理を行う話速変換手段と、圧縮済みの区間を含んだ音声データを外部に出力する出力手段と、を備えたことを特徴とする。 An audio data recording / reproducing apparatus according to the present invention includes audio recording data recording pronunciations of a plurality of speakers and audio features of a specific speaker, audio feature extraction means for extracting audio features, The speaker that compares the voice feature amount of the specific speaker with the voice feature amount extracted by the voice feature amount extraction unit, and extracts the voice data recording section of the specific speaker from the voice data Extraction means, speech speed conversion means for performing processing for compressing voice data in a section other than the voice data recording section of the specific speaker on the time axis, and outputting voice data including the compressed section to the outside And an output means.

この発明では、特定の話者の音声特徴量（フォルマント等）を記録手段に記録しておく。音声特徴量抽出手段は、記録されている音声データから音声特徴量を抽出する。話者抽出手段は、予め記録されている特定の話者の音声特徴量と、抽出した音声特徴量を比較し、特定の話者の発話区間を抽出する。この抽出した特定の話者の発話区間以外の区間について、音声データを時間軸に圧縮して話速変換（高速化）を行う。 In the present invention, the voice feature (formant, etc.) of a specific speaker is recorded in the recording means. The voice feature amount extraction unit extracts a voice feature amount from the recorded voice data. The speaker extraction unit compares the voice feature amount of a specific speaker recorded in advance with the extracted voice feature amount, and extracts a speech section of the specific speaker. For the sections other than the extracted speech section of the specific speaker, the speech data is compressed on the time axis to perform speech speed conversion (speeding up).

また、この発明は、前記話速変換手段は、前記特定の話者の音声データ記録区間以外の区間の音声データを、前記特定の話者の音声データ記録区間に近い区間ほど低い圧縮率で圧縮することを特徴とする。 Further, according to the present invention, the speech speed converting means compresses voice data in a section other than the voice data recording section of the specific speaker at a lower compression ratio in a section closer to the voice data recording section of the specific speaker. It is characterized by doing.

この発明では、話速変換手段は、特定の話者の発話区間に近いほど圧縮率を低くして話速変換する。これにより、特定の話者がその発言に至るまでの、他者の発言内容をより正確に把握することができる。 In the present invention, the speech speed conversion means converts the speech speed by lowering the compression rate as it is closer to the utterance section of the specific speaker. Thereby, it is possible to grasp the content of another person's utterance more accurately until the specific speaker reaches the utterance.

また、この発明は、前記話速変換手段は、前記特定の話者の音声データ記録区間を、時間軸に伸長する処理を行うとを特徴とする。 In addition, the present invention is characterized in that the speech speed conversion means performs a process of extending the voice data recording section of the specific speaker along the time axis.

この発明では、抽出した特定話者の発話区間について、音声データを時間軸に伸長して話速変換（低速化）を行う。特定話者の音声がゆっくりと再生されるため、より特定話者の発言を理解し易くなる。 In the present invention, speech speed conversion (slowering) is performed by expanding the voice data on the time axis for the extracted speech section of the specific speaker. Since the voice of the specific speaker is played back slowly, it becomes easier to understand the speech of the specific speaker.

また、この発明は、音声データと当該音声データの話者を識別する話者識別データとを経時的に取得する前記データ取得手段を備え、前記記録手段は、前記音声データ、前記話者識別データ、および前記特定の話者の音声特徴量を記録し、前記話者抽出手段は、前記話者識別データ、または音声特徴量の比較結果、の少なくともいずれか一方に基づいて前記特定の話者の音声データ記録区間を抽出することを特徴とする。 The present invention further comprises the data acquisition means for acquiring voice data and speaker identification data for identifying a speaker of the voice data over time, and the recording means includes the voice data and the speaker identification data. And the voice feature amount of the specific speaker is recorded, and the speaker extracting means is configured to record the voice of the specific speaker based on at least one of the speaker identification data and the voice feature amount comparison result. A voice data recording section is extracted.

この発明では、データ取得手段で音声データと話者識別データとを経時的に取得し、記録手段に記録する。話者抽出手段は、話者識別データから、特定の話者の発話区間を抽出する。これにより、特定の話者の発話区間をさらに精度良く抽出することができる
また、この発明は、マイクアレイを備えた放収音装置に接続される請求項４に記載の音声データ記録再生装置であって、前記放収音装置は、前記マイクアレイの各マイクの収音音声信号に基づいて、それぞれに異なる方位に強い指向性を有する複数の収音ビーム信号を形成し、該複数の収音ビーム信号を比較して、最も信号強度の強い収音ビーム信号を選択するとともに、選択された収音ビーム信号に対応する方位を検出して、前記選択された収音ビーム信号を音声データとし、前記検出した方位を話者識別データとして出力することを特徴とする。 In the present invention, voice data and speaker identification data are acquired over time by the data acquisition means and recorded in the recording means. The speaker extracting means extracts the utterance section of a specific speaker from the speaker identification data. This makes it possible to more accurately extract the utterance section of a specific speaker. The present invention is an audio data recording / reproducing apparatus according to claim 4, which is connected to a sound emitting and collecting apparatus including a microphone array. The sound emission and collection device forms a plurality of sound collection beam signals having strong directivities in different directions based on the sound collection sound signals of the microphones of the microphone array, and the plurality of sound collection devices. Compare the beam signals, select the sound collecting beam signal with the strongest signal intensity, detect the direction corresponding to the selected sound collecting beam signal, the selected sound collecting beam signal as audio data, The detected direction is output as speaker identification data.

この発明では、音声データ記録再生装置は、放収音装置に接続されている。この放収音装置は、マイクアレイの各マイクの収音音声信号から複数の収音ビーム信号を形成して、最も信号強度の高い収音ビーム信号を選択するとともに、当該収音ビーム信号に対応する方位を検出する。そして、放収音装置は、選択した収音ビーム信号を音声データとし、検出方位を話者識別データとして出力する。 In the present invention, the audio data recording / reproducing apparatus is connected to the sound emitting and collecting apparatus. This sound emission and collection device forms multiple sound collection beam signals from the sound collection sound signals of each microphone in the microphone array, selects the sound collection beam signal with the highest signal intensity, and supports the sound collection beam signals Detecting the direction to perform. The sound emitting and collecting apparatus outputs the selected sound collecting beam signal as voice data and outputs the detected direction as speaker identification data.

この発明によれば、予め記録しておいた特定の話者の音声特徴量と、音声データから抽出した音声特徴量を比較して、特定の話者の発話区間を抽出し、この発話区間以外の区間を話速変換（高速化）したことで、音声認識（テキスト抽出）等の高度な処理は行う必要なく、特定の話者の発言を聴きながら、その発言に至るまでの他者の発言内容も把握することができる。また、特定の話者の発話区間をユーザが手動で指定する必要もなくなる。 According to the present invention, the speech feature amount of a specific speaker recorded in advance and the speech feature amount extracted from the speech data are compared to extract the speech segment of the specific speaker. The speech rate conversion (speeding up) of this section makes it unnecessary to perform advanced processing such as speech recognition (text extraction), while listening to the speech of a specific speaker, The contents can also be grasped. Further, it is not necessary for the user to manually specify the utterance section of a specific speaker.

図面を参照して、本発明の実施形態である議事録記録、再生システムについて説明する。
図１は、本実施形態の議事録記録、再生システムの構成を示す図である。この議事録記録、再生システムは、ネットワーク１００に接続された音声会議装置１１１、音声会議装置１１２、録音サーバ１０１、およびパソコン１０２を備えている。 With reference to the drawings, a minutes recording and reproducing system according to an embodiment of the present invention will be described.
FIG. 1 is a diagram showing a configuration of a minutes recording / reproducing system according to the present embodiment. The minutes recording / reproducing system includes an audio conference device 111, an audio conference device 112, a recording server 101, and a personal computer 102 connected to the network 100.

音声会議装置１１１、音声会議装置１１２は、それぞれ離れた地点ａ、地点ｂにそれぞれ配置されている。地点ａには、音声会議装置１１１が配置されており、該音声会議装置１１１を囲むように話者Ａ〜Ｇの７人が音声会議装置１１１に対してそれぞれ方位Ｄｉｒ１１〜Ｄｉｒ１６，Ｄｉｒ１８で在席している。地点ｂには、音声会議装置１１２が配置されており、該音声会議装置１１２を囲むように会議者Ｈ〜Ｌの５人が音声会議装置１１２に対してそれぞれ方位Ｄｉｒ２１，Ｄｉｒ２２，Ｄｉｒ２４，Ｄｉｒ２６，Ｄｉｒ２８で在席している。 The audio conference apparatus 111 and the audio conference apparatus 112 are respectively arranged at a point a and a point b that are separated from each other. A voice conference device 111 is arranged at the point a, and seven speakers A to G are seated in directions Dir11 to Dir16 and Dir18 with respect to the voice conference device 111 so as to surround the voice conference device 111, respectively. is doing. At the point b, the audio conference device 112 is arranged, and the five participants H to L surround the audio conference device 112 with respect to the audio conference device 112 in directions Dir21, Dir22, Dir24, Dir26, I am present at Dir28.

図２は本実施形態の音声会議装置１１１の主要構成を示すブロック図である。なお、音声会議装置１１２は、音声会議装置１１１と同一構成であり、その説明を省略する。
音声会議装置１１１は、制御部１１、入出力Ｉ／Ｆ１２、Ｄ／Ａコンバータ１４、放音アンプ１５、スピーカＳＰ１、マイクＭＩＣ１０１、収音アンプ１６、Ａ／Ｄコンバータ１７、エコーキャンセル回路２０、操作部３１、表示部３２を備えている。 FIG. 2 is a block diagram showing the main configuration of the audio conference apparatus 111 of this embodiment. Note that the audio conference apparatus 112 has the same configuration as the audio conference apparatus 111, and a description thereof will be omitted.
The audio conference apparatus 111 includes a control unit 11, an input / output I / F 12, a D / A converter 14, a sound emission amplifier 15, a speaker SP1, a microphone MIC 101, a sound collection amplifier 16, an A / D converter 17, an echo cancellation circuit 20, and an operation. A unit 31 and a display unit 32 are provided.

制御部１１は、音声会議装置１１１を統括的に制御する。入出力Ｉ／Ｆ１２は、ネットワーク１００に接続され、ネットワーク１００を介して入力された相手装置からの音声データを、ネットワーク形式のデータから一般的な音声信号に変換して、エコーキャンセル回路２０を介してＤ／Ａコンバータ１４に出力する。 The control unit 11 controls the audio conference apparatus 111 in an integrated manner. The input / output I / F 12 is connected to the network 100, converts voice data from the partner apparatus input via the network 100 from network format data to a general voice signal, and passes through the echo cancellation circuit 20. Output to the D / A converter 14.

Ｄ／Ａコンバータ１４はディジタル形式の放音音声信号をアナログ形式に変換し、放音アンプ１５は放音音声信号を増幅してスピーカＳＰ１に与え、スピーカＳＰ１は、放音音声信号を音声変換して放音する。これにより、自装置の会議者に、ネットワークで接続された相手先装置の会議者の音声を放音する。 The D / A converter 14 converts the digital sound output sound signal into an analog form, the sound output amplifier 15 amplifies the sound output sound signal and applies it to the speaker SP1, and the speaker SP1 converts the sound output sound signal into sound. And release it. Thereby, the voice of the conference person of the other party apparatus connected with the network is emitted to the conference person of the own apparatus.

マイクＭＩＣ１０１は、自装置の会議者の発話音を含む周囲（方位Ｄｉｒ１１〜方位Ｄｉｒ１８）の音を収音して電気信号に変換し、収音音声信号を生成する。収音アンプ１６は収音音声信号を増幅し、Ａ／Ｄコンバータ１７はアナログ形式の収音音声信号をディジタル形式に変換する。 The microphone MIC101 picks up surrounding sound (direction Dir11 to direction Dir18) including the utterance sound of the conference person of its own device, converts it into an electric signal, and generates a collected sound signal. The sound collecting amplifier 16 amplifies the collected sound signal, and the A / D converter 17 converts the collected sound signal in an analog format into a digital format.

エコーキャンセル回路２０は、適応型フィルタ２１で入力音声信号に基づいて擬似回帰音信号を生成して、ポストプロセッサ２２で収音音声信号から擬似回帰音信号を減算する。これにより、スピーカＳＰからマイクＭＩＣへの回り込み音を抑圧する。入出力Ｉ／Ｆ１２は、エコーキャンセル回路２０からの収音音声信号をネットワーク形式で所定データ長からなる音声データに変換し、制御部１１から得られる収音時間データを添付して、ネットワーク１００に出力する。 The echo cancellation circuit 20 generates a pseudo regression signal based on the input voice signal by the adaptive filter 21 and subtracts the pseudo regression signal from the collected voice signal by the post processor 22. Thereby, the wraparound sound from the speaker SP to the microphone MIC is suppressed. The input / output I / F 12 converts the collected sound signal from the echo cancellation circuit 20 into sound data having a predetermined data length in a network format, and attaches sound collection time data obtained from the control unit 11 to the network 100. Output.

このような構成により、ネットワーク１００に接続された音声会議装置１１１、１１２で多地点会議を行うことができる。 With such a configuration, the multipoint conference can be performed by the audio conference apparatuses 111 and 112 connected to the network 100.

図３は、録音サーバ１０１の構成を示すブロック図である。
録音サーバ１０１は、制御部１、特徴データ抽出部２、記録部３、ネットワークＩ／Ｆ４を備えている。録音サーバ１０１は、音声会議装置１１１、１１２のいずれかと同じ場所に配置しても、これらとは全く異なる場所に配置してもよい。 FIG. 3 is a block diagram showing the configuration of the recording server 101.
The recording server 101 includes a control unit 1, a feature data extraction unit 2, a recording unit 3, and a network I / F 4. The recording server 101 may be arranged at the same place as one of the audio conference apparatuses 111 and 112 or at a place completely different from these.

制御部１は、ネットワークＩ／Ｆ４に対するネットワーク通信制御や、記録部３に対する記録制御等の録音サーバ１０１全体制御を行う。
特徴データ抽出部２は、音声データから各会議参加者の音声特徴量を抽出する。音声特徴量は、典型的には各話者のフォルマント、ピッチ等を表し、音声データをフーリエ変換した周波数スペクトル（パワースペクトル）、およびこのパワースペクトルを対数変換後に逆フーリエ変換したケプストラムから抽出する。特徴データ抽出部２は、会議に先立ち、各話者の音声特徴量を抽出し、各話者の音声特徴量（特徴データ）として、記録部３に記録しておく。各話者の識別情報（すなわち各特徴データがどの話者のものであるか）は、会議参加者（議長）が予め登録する。例えば、話者Ａの音声特徴量を記録部３に登録するとき、議長は、話者Ａに発言してもらい、音声会議装置１１１の操作部３１を用いて話者Ａの情報（個人名等）を記録部３に記録する。なお、本実施形態の議事録記録再生システムを社内で用いる場合、会議参加者が変化しない場合、等であれば、各社員の音声特徴量を、予め記録部３に記録しておくようにしてもよい。 The control unit 1 performs overall control of the recording server 101 such as network communication control for the network I / F 4 and recording control for the recording unit 3.
The feature data extraction unit 2 extracts the voice feature amount of each conference participant from the voice data. The voice feature amount typically represents each speaker's formant, pitch, and the like, and is extracted from a frequency spectrum (power spectrum) obtained by Fourier transforming the voice data, and a cepstrum obtained by logarithmically transforming the power spectrum and then performing inverse Fourier transform. Prior to the conference, the feature data extraction unit 2 extracts the voice feature amount of each speaker and records it in the recording unit 3 as the voice feature amount (feature data) of each speaker. The conference participant (chairperson) registers in advance the identification information of each speaker (that is, which speaker each characteristic data belongs to). For example, when registering the voice feature amount of the speaker A in the recording unit 3, the chairperson asks the speaker A to speak and uses the operation unit 31 of the audio conference apparatus 111 to provide information on the speaker A (personal name, etc.). ) Is recorded in the recording unit 3. If the minutes recording / playback system of this embodiment is used in-house, if the meeting participants do not change, etc., the voice feature amount of each employee is recorded in the recording unit 3 in advance. Also good.

また、特徴データ抽出部２は、会議中において、入力される音声データの各部分の属性を識別する音声状況データを生成する。ここで、属性には、音声データの送信元装置、該装置での収音時刻、等が含まれている。 Further, the feature data extraction unit 2 generates voice status data for identifying attributes of each part of the input voice data during the meeting. Here, the attribute includes a transmission source device of the audio data, a sound collection time at the device, and the like.

記録部３は、大容量の磁気ディスク等からなり、機能的に音声データ記録部３０１、音声状況データ記録部３０２、および特徴データ記録部３０３を備える。音声データ記録部３０１は、ネットワークＩ／Ｆ４を介して入力される音声データを順次記録する。この際、音声データ記録部３０１には、音声会議装置１１１用の記録領域と、音声会議装置１１２用の記録領域とが用意されており、それぞれ対応する領域に音声データが記録される。音声状況データ記録部３０２は、特徴データ抽出部２から入力される音声状況データ、すなわち音声データの送信元装置、収音時刻等の情報を記録する。特徴データ記録部３０３は、特徴データ抽出部２が会議に先立って抽出した会議参加者の音声特徴量を記録する。 The recording unit 3 is composed of a large-capacity magnetic disk or the like, and functionally includes an audio data recording unit 301, an audio status data recording unit 302, and a feature data recording unit 303. The audio data recording unit 301 sequentially records audio data input via the network I / F 4. At this time, a recording area for the audio conference apparatus 111 and a recording area for the audio conference apparatus 112 are prepared in the audio data recording unit 301, and audio data is recorded in the corresponding areas. The voice situation data recording unit 302 records voice situation data input from the feature data extraction unit 2, that is, information such as a voice data transmission source device and a sound collection time. The feature data recording unit 303 records the audio feature amount of the conference participant extracted by the feature data extraction unit 2 prior to the conference.

また、特徴データ抽出部２は、パソコン１０２から記録されている音声データの再生を指示されたとき、音声データ記録部３０１に記録されている音声データから音声特徴量を抽出して、特徴データ記録部３０３に記録しておいた音声特徴量と比較する。その結果、特定の話者（例えば議長）の発言による音声データの記録区間を抽出することができる。録音サーバ１０１は、抽出した記録区間はそのままの音声で、この区間以外の音声を話速変換（高速再生）し、会議のダイジェストデータとしてパソコン１０２にストリーミング配信する。これにより、パソコン１０２においては、議事録のダイジェスト再生を行うことができる。 Further, when the feature data extraction unit 2 is instructed to reproduce the voice data recorded from the personal computer 102, the feature data extraction unit 2 extracts a voice feature amount from the voice data recorded in the voice data recording unit 301, and records the feature data. The voice feature quantity recorded in the unit 303 is compared. As a result, it is possible to extract a recording section of voice data based on a speech of a specific speaker (for example, a chairperson). The recording server 101 uses the voice of the extracted recording section as it is, converts the voice other than this section to the speech speed (high speed playback), and distributes the stream as the conference digest data to the personal computer 102. Thereby, the personal computer 102 can perform digest reproduction of the minutes.

図４は、パソコン１０２の構成を示すブロック図である。
パソコン１０２は、ＣＰＵ１２１、ハードディスク等の記憶部１２２、表示部１２３、操作入力部１２４、ネットワークＩ／Ｆ１２５、スピーカ１２６を備えている。 FIG. 4 is a block diagram showing the configuration of the personal computer 102.
The personal computer 102 includes a CPU 121, a storage unit 122 such as a hard disk, a display unit 123, an operation input unit 124, a network I / F 125, and a speaker 126.

ＣＰＵ１２１は、通常のパソコンの処理制御を行うとともに、記憶部１２２に記憶されている再生アプリケーションプログラム（以下、再生アプリと言う）を読み出して実行することで、音声データの再生手段として機能する。再生アプリは、ユーザの議事録再生要求に応じ、録音サーバ１０１に、記録音声データの再生リクエストを送信する。また、再生アプリは、ユーザの話者指定を受付け、録音サーバ１０１に記録されている音声データから、特定の話者を指定し、上記ダイジェストデータを受信して再生する。これにより、指定された話者の発言を効率良く聴くことができるダイジェスト再生を実現する。なお、ユーザは、複数の話者を指定することもでき、この場合、指定された複数の話者の音声区間はそのままの音声で、他の区間が高速再生される。 The CPU 121 controls the processing of a normal personal computer and functions as a voice data playback unit by reading and executing a playback application program (hereinafter referred to as a playback application) stored in the storage unit 122. The playback application transmits a playback request for the recorded audio data to the recording server 101 in response to the user's minutes playback request. Also, the playback application receives the user's speaker designation, designates a specific speaker from the voice data recorded in the recording server 101, and receives and plays the digest data. As a result, digest reproduction that can efficiently listen to the speech of the designated speaker is realized. Note that the user can also specify a plurality of speakers. In this case, the voice sections of the plurality of designated speakers are kept as they are, and the other sections are reproduced at high speed.

記憶部１２２は、磁気ディスク、半導体メモリ等からなり、再生アプリを記憶するとともに、ＣＰＵ１２１が各機能を実行する際の作業部として利用される。 The storage unit 122 includes a magnetic disk, a semiconductor memory, and the like, and stores a playback application and is used as a working unit when the CPU 121 executes each function.

表示部１２３は、液晶ディスプレイ等により構成され、ＣＰＵ１２１で再生アプリが実行されると、再生アプリが起動してＣＰＵ１２１から表示画面情報が与えられ種々の画面を表示する。 The display unit 123 is configured by a liquid crystal display or the like, and when a playback application is executed by the CPU 121, the playback application is activated and display screen information is given from the CPU 121 to display various screens.

操作入力部１２４は、キーボードやマウスからなり、ユーザの操作入力を受け付けてＣＰＵ１２１に与える。例えば、マウスで表示画面上のカーソルを移動させ、該当位置でマウスをクリックすることにより、クリック情報がＣＰＵ１２１に与えられ、ＣＰＵ１２１はクリック位置とクリック状況から操作入力内容を判断して所定の再生処理を行う。 The operation input unit 124 includes a keyboard and a mouse, receives a user operation input, and gives it to the CPU 121. For example, by moving the cursor on the display screen with the mouse and clicking the mouse at the corresponding position, the click information is given to the CPU 121, and the CPU 121 determines the operation input content from the click position and the click situation and performs a predetermined reproduction process. I do.

ネットワークＩ／Ｆ１２５は、パソコン１０２をネットワーク１００に接続させ、ＣＰＵ１２１からの通信制御に応じて、ＣＰＵ１２１からの制御信号や、録音サーバ１０１から音声データ（ストリーミングデータ）を受信する。 The network I / F 125 connects the personal computer 102 to the network 100 and receives a control signal from the CPU 121 and audio data (streaming data) from the recording server 101 in accordance with communication control from the CPU 121.

スピーカ１２６は、ＣＰＵ１２１の制御に従い音声データを放音する。 The speaker 126 emits sound data according to the control of the CPU 121.

次に、録音サーバ１０１の録音フローについて図５を参照して説明する。
図５は、録音サーバ１０１の録音処理フローを示すフローチャートである。なお、この録音処理フローが行われる前に、各会議参加者の音声特徴量は、記録部３に登録しておくものとする。
録音サーバ１０１は、ネットワーク１００での音声データ通信を監視している。録音サーバ１０１は、会議開始トリガを検出すると録音を開始する（Ｓ１→Ｓ２）。この際、会議開始トリガとしては、ネットワーク１００に音声データが通信されたことを検知することで得られたり、各音声会議装置１１１、１１２が会議開始スイッチの押下により会議開始パルスを発してこれを検知することにより得ることができる。また、録音サーバ１０１に録音開始スイッチが備えられていれば、この録音開始スイッチが押下されることにより検知することもできる。 Next, the recording flow of the recording server 101 will be described with reference to FIG.
FIG. 5 is a flowchart showing a recording process flow of the recording server 101. Note that the audio feature amount of each conference participant is registered in the recording unit 3 before the recording process flow is performed.
The recording server 101 monitors voice data communication over the network 100. The recording server 101 starts recording upon detecting the conference start trigger (S1 → S2). At this time, the conference start trigger can be obtained by detecting that audio data is communicated to the network 100, or each audio conference device 111, 112 can generate a conference start pulse by pressing the conference start switch. It can be obtained by detection. Further, if the recording server 101 is provided with a recording start switch, it can be detected by pressing the recording start switch.

録音が開始されると、録音サーバ１０１（制御部１）は内蔵タイマ等により録音開始時間を取得し、特徴データ抽出部２に与える。特徴データ抽出部２は、この録音開始時刻を１つの音声データファイルのタイトルとして保存する（Ｓ３）。 When recording is started, the recording server 101 (control unit 1) acquires the recording start time by a built-in timer or the like, and gives it to the feature data extraction unit 2. The feature data extraction unit 2 stores this recording start time as a title of one audio data file (S3).

ネットワークＩ／Ｆ４は、ネットワーク１００で通信される音声データを取得し、制御部１、特徴データ抽出部２、および記録部３に与え、記録部３は順次音声データを記憶する（Ｓ４）。 The network I / F 4 acquires audio data communicated through the network 100 and provides it to the control unit 1, the feature data extraction unit 2, and the recording unit 3, and the recording unit 3 sequentially stores the audio data (S4).

この際、制御部１は、ネットワークＩ／Ｆ４が取得した音声データに付加された情報から装置データ、時間データを取得して（Ｓ５）、装置データを記録部３に与える。記録部３は、制御部１から取得した装置データに従い、音声データを装置別に順次音声データ記録部３０１に記録する。 At this time, the control unit 1 acquires device data and time data from the information added to the audio data acquired by the network I / F 4 (S5), and gives the device data to the recording unit 3. The recording unit 3 sequentially records audio data in the audio data recording unit 301 for each device in accordance with the device data acquired from the control unit 1.

また、制御部１は、音声データから装置データ、時間データを取得し、特徴データ抽出部２に与える（Ｓ５）。特徴データ抽出部２は、装置データ、および時間データから音声状況データを生成し、記録部３に与える。記録部３は、特徴データ抽出部２からの音声状況データを音声状況データ記録部３０２に記録する（Ｓ６）。 In addition, the control unit 1 acquires device data and time data from the audio data, and gives them to the feature data extraction unit 2 (S5). The feature data extraction unit 2 generates voice status data from the device data and the time data, and gives the voice status data to the recording unit 3. The recording unit 3 records the voice situation data from the feature data extraction unit 2 in the voice situation data recording unit 302 (S6).

このような音声状況データの生成、記録処理と音声データの記録処理とは、録音終了トリガが検出されるまで繰り返し行われる。そして、録音終了トリガ、または録音の一時停止のトリガが検出されれば（Ｓ７）、制御部１は、特徴データ抽出部２に録音終了制御指示を与える。なお、録音終了トリガは、ネットワーク１００に接続された音声会議装置１１１、１１２の会議終了スイッチ押下や電源オフ等を検出することにより得られる。特徴データ抽出部２は、最終の音声状況データを生成、記録するとともに、音声状況データ記録部３０２に予め記録された各音声状況データを録音開始時に取得したタイトルでグループ化するグループ化指示データを生成して音声状況データ記録部３０２に記録する（Ｓ８）。 Such generation / recording processing of voice status data and recording processing of voice data are repeated until a recording end trigger is detected. When a recording end trigger or a recording pause trigger is detected (S7), the control unit 1 gives a recording end control instruction to the feature data extraction unit 2. The recording end trigger is obtained by detecting pressing of the conference end switch or powering off of the audio conference apparatuses 111 and 112 connected to the network 100. The feature data extraction unit 2 generates and records final voice situation data, and also groups grouping instruction data for grouping each voice situation data recorded in advance in the voice situation data recording unit 302 with a title acquired at the start of recording. Generated and recorded in the voice status data recording unit 302 (S8).

このような構成および処理を行うことで、音声データ記録部３０１には、経時的に連続する音声データが装置毎に記録され、議事録として記録される。
次に、録音サーバ１０１、パソコン１０２の再生フローについて図６、および図７のフローチャートを用いて説明する。
図６は、パソコン１０２の再生処理フローを示すフローチャートである。
まず、ＣＰＵ１２１は、ユーザが操作入力部１２４を用いて議事録検索キーワードを入力したか否かを判断する（Ｓ２１）。議事録検索キーワードは、例えば会議日時、会議名、装置名、等である。ＣＰＵ１２１は、この入力されたキーワードを録音サーバ１０１に送信する（Ｓ２２）。録音サーバでは、キーワードから議事録が検索され、該当する結果がパソコン１０２に受信される（Ｓ２３）。ＣＰＵ１２１は、受信した結果を表示部１２３に表示する（Ｓ２４）。これにより、ユーザは、記録されている議事録の一覧を確認することができ、再生したい議事録を指定することができる。 By performing such a configuration and processing, the audio data recording unit 301 records audio data that is continuous over time for each device, and records the minutes.
Next, the playback flow of the recording server 101 and the personal computer 102 will be described with reference to the flowcharts of FIGS.
FIG. 6 is a flowchart showing a playback processing flow of the personal computer 102.
First, the CPU 121 determines whether or not the user has input a minutes search keyword using the operation input unit 124 (S21). The minutes search keyword is, for example, a meeting date and time, a meeting name, a device name, or the like. The CPU 121 transmits the input keyword to the recording server 101 (S22). In the recording server, the minutes are searched from the keyword, and the corresponding result is received by the personal computer 102 (S23). The CPU 121 displays the received result on the display unit 123 (S24). Thereby, the user can check the list of recorded minutes and can designate the minutes to be reproduced.

その後、ユーザが再生リクエストを入力したか否かを判断する（Ｓ２５）。再生リクエストが入力されなければＳ２１から処理を繰り返す。再生リクエストは、表示部１２３に表示された検索結果から、ユーザがマウス等で再生する議事録を指定することによって入力される。再生リクエストには、通常再生（話速変換無し）、ダイジェスト再生の指定情報が含まれている。また、ダイジェスト再生の場合、どの話者の発話を優先的に聴くかを示す話者指定情報（話者指定が複数の場合は、複数話者の優先順位情報も含む）が入力される。 Thereafter, it is determined whether or not the user has input a reproduction request (S25). If no reproduction request is input, the process is repeated from S21. The reproduction request is input when the user designates the minutes to be reproduced with a mouse or the like from the search result displayed on the display unit 123. The playback request includes designation information for normal playback (no speech speed conversion) and digest playback. In the case of digest reproduction, speaker designation information indicating which speaker's utterance is preferentially listened (when there are a plurality of speaker designations, priority order information of a plurality of speakers is also included) is input.

再生リクエストが入力された場合、ＣＰＵ１２１は、再生リクエスト、および話者指定情報を録音サーバ１０１に送信する（Ｓ２６）。録音サーバ１０１では、指定された議事録を読み出し、話者指定情報に基づいて解析処理（後述する）がされ、ストリーミングデータが生成される。このストリーミングデータがパソコン１０２に受信される（Ｓ２７）。ＣＰＵ１２１は、受信したストリーミングを再生し、スピーカ１２６から音声を放音する（Ｓ２８）。 When the reproduction request is input, the CPU 121 transmits the reproduction request and the speaker designation information to the recording server 101 (S26). The recording server 101 reads the designated minutes, performs analysis processing (described later) based on the speaker designation information, and generates streaming data. This streaming data is received by the personal computer 102 (S27). The CPU 121 reproduces the received streaming and emits sound from the speaker 126 (S28).

その後、ＣＰＵ１２１は、ユーザが再生変更指示を入力したか否かを判断する（Ｓ２９）。再生変更指示は、例えば一時停止指示、早送り指示、等からなる。ＣＰＵ１２１は、入力された再生変更指示に応じて、再生変更処理を行う（Ｓ３０）。例えば、一時停止指示が入力されていればストリーミングデータの再生を一時停止し、音声の放音を停止する。 Thereafter, the CPU 121 determines whether or not the user has input a reproduction change instruction (S29). The reproduction change instruction includes, for example, a pause instruction, a fast-forward instruction, and the like. The CPU 121 performs a reproduction change process in response to the input reproduction change instruction (S30). For example, if a pause instruction is input, playback of streaming data is paused and sound emission is stopped.

その後、ＣＰＵ１２１は、ユーザが終了指示を入力したか否かを判断する（Ｓ３１）。終了指示が有れば終了指示情報を録音サーバ１０１に送信する（Ｓ３２）。終了指示が無ければストリーミングデータの受信から処理を繰り返す。Ｓ３２の終了指示は、上記の一時停止指示とは異なり、ストリーミングデータの受信を停止し、この再生処理フローを終える指示である。 Thereafter, the CPU 121 determines whether or not the user has input an end instruction (S31). If there is an end instruction, the end instruction information is transmitted to the recording server 101 (S32). If there is no end instruction, the process is repeated from the reception of the streaming data. Unlike the pause instruction described above, the end instruction in S32 is an instruction to stop receiving streaming data and end this reproduction processing flow.

図７は、録音サーバ１０１の再生処理フローを示すフローチャートである。
同図（Ａ）のフローは、パソコン１０２から議事録検索キーワードが送信されたことがトリガとなる。制御部１は、パソコン１０２から議事録検索キーワードを受信すると、このキーワードに該当する議事録を記録部３の音声状況データ記録部３０２から検索する（Ｓ５１）。議事録検索キーワードは、上述したように、会議日時、装置名、等である。制御部１は、キーワードに該当する議事録の会議日時、会議名、装置名、等を検索結果として返信する（Ｓ５２）。この結果、パソコン１０２の表示部１２３に検索結果が表示される。 FIG. 7 is a flowchart showing the playback processing flow of the recording server 101.
The flow in FIG. 6A is triggered by the transmission of the minutes search keyword from the personal computer 102. When receiving the minutes search keyword from the personal computer 102, the control unit 1 searches the voice status data recording unit 302 of the recording unit 3 for the minutes corresponding to the keyword (S51). As described above, the minutes search keyword is a meeting date and time, a device name, and the like. The control unit 1 returns the meeting date and time, meeting name, device name, etc. of the minutes corresponding to the keyword as a search result (S52). As a result, the search result is displayed on the display unit 123 of the personal computer 102.

同図（Ｂ）のフローは、パソコン１０２から再生リクエスト、および話者指定情報が送信されたことがトリガとなる。なお、ダイジェスト再生でない（通常再生）場合は、記録部３から議事録を読み出して配信（ストリーミング配信）するのみであるため、その説明は省略する。制御部１は、パソコン１０２から再生リクエスト、および話者指定情報を受信すると、特徴データ抽出部２にこれらを与える。特徴データ抽出部２は、話者指定情報から、記録部３の特徴データ記録部３０３に記録されている会議参加者の音声特徴量のうち、指定された話者の特徴データを読み出す（Ｓ７１）。また、特徴データ抽出部２は、再生リクエストに指定されている議事録を記録部３の音声データ記録部３０１から読み出す（Ｓ７２）。読み出した音声データから音声特徴量を抽出し、Ｓ７１で読み出した特定の話者の特徴データと比較する解析処理を行う（Ｓ７３）。なお、議事録は、所定の時間長（例えば２〜３秒）だけ読み出して、まずこの数秒分だけの解析処理を行う。 The flow in FIG. 5B is triggered by the transmission request and speaker designation information transmitted from the personal computer 102. In the case of not digest playback (normal playback), the minutes are simply read out from the recording unit 3 and distributed (streaming distribution), and the description thereof is omitted. When the control unit 1 receives the reproduction request and the speaker designation information from the personal computer 102, the control unit 1 gives them to the feature data extraction unit 2. The feature data extracting unit 2 reads out the feature data of the designated speaker from the speech feature amounts of the conference participants recorded in the feature data recording unit 303 of the recording unit 3 from the speaker designation information (S71). . The feature data extraction unit 2 reads the minutes specified in the reproduction request from the audio data recording unit 301 of the recording unit 3 (S72). A voice feature is extracted from the read voice data, and analysis processing is performed for comparison with the feature data of the specific speaker read in S71 (S73). Note that the minutes are read for a predetermined time length (for example, 2 to 3 seconds), and analysis processing for only a few seconds is first performed.

特徴データ抽出部２は、パターンマッチング等の手法により、読み出した音声データから特定の話者の特徴データと合致する音声特徴量を有する区間を抽出する。この区間を特定の話者の発話区間とする。特徴データ抽出部２は、特定の話者の発話区間以外の区間を話速変換し（Ｓ７４）、ストリーミングデータを生成する（Ｓ７５）。変換速度は、例えば２倍速とする。また、変換速度を可変にしてもよい。例えば、特定の話者の発話区間に近い部分は低倍速（１．５倍等）、または通常速度とし、特定の話者の発話区間から遠ざかるほど高倍速に設定する。 The feature data extraction unit 2 extracts a section having a speech feature amount matching the feature data of a specific speaker from the read speech data by a method such as pattern matching. This section is set as an utterance section of a specific speaker. The feature data extraction unit 2 converts the speech speed of sections other than the speech section of the specific speaker (S74), and generates streaming data (S75). The conversion speed is, for example, double speed. Further, the conversion speed may be variable. For example, a portion close to the utterance section of a specific speaker is set to a low speed (such as 1.5 times) or a normal speed, and is set to a high speed as the distance from the utterance section of the specific speaker increases.

また、話速変換処理は、単に音声データを倍速で読み出して出力するだけではなく、以下のようにして行う。すなわち、話速変換処理は、音声データ（音声信号）を１周期の波形に切りわけ、各周期波形の前後１区間を合成した新たな周期波形を生成することで音声信号の周期波形数を減らして、音程を保ちつつ信号を圧縮する処理である。 In addition, the speech speed conversion process is performed not only by reading and outputting the voice data at double speed but also as follows. That is, in the speech speed conversion process, the voice data (voice signal) is cut into one-cycle waveform, and the number of periodic waveforms of the voice signal is reduced by generating a new periodic waveform by synthesizing one section before and after each periodic waveform. Thus, the signal is compressed while maintaining the pitch.

図８（Ａ）は、話速変換処理の手順を示すフローチャートである。また、同図（Ｂ）は圧縮方法を説明する図である。同図（Ａ）において、まず音声信号の先頭部分の１周期のサンプル数（例えばサンプリング周波数×１／信号周波数）を検出する（Ｓ９１）。この１周期分のサンプルデータである周期波形を２つ取り出して、同図（Ｂ）に示すように、１つめの周期波形Ａに対して減衰利得係数を乗算することによって減衰波を作成し、２つめの周期波形Ｂに対して増加利得係数を乗算することによって増加波を作成する（Ｓ９２）。そして、これらを加算合成することによってＡとＢの中間の形状の周期波形を合成する（Ｓ９３）。この合成波形を周期波形Ａ、Ｂに代えて出力する（Ｓ９４）。周期波形Ａと周期波形Ｂに代えて、合成波形を出力することによって音響的に自然な時間軸圧縮を行う。 FIG. 8A is a flowchart showing the procedure of speech speed conversion processing. FIG. 2B is a diagram for explaining a compression method. In FIG. 9A, first, the number of samples in one cycle of the head portion of the audio signal (for example, sampling frequency × 1 / signal frequency) is detected (S91). Two periodic waveforms, which are sample data for one period, are taken out and, as shown in FIG. 5B, an attenuation wave is created by multiplying the first periodic waveform A by an attenuation gain coefficient, An increasing wave is created by multiplying the second periodic waveform B by an increasing gain coefficient (S92). Then, by adding and synthesizing these, a periodic waveform having an intermediate shape between A and B is synthesized (S93). This synthesized waveform is output in place of the periodic waveforms A and B (S94). Instead of the periodic waveform A and the periodic waveform B, an acoustically natural time axis compression is performed by outputting a composite waveform.

また、この話速変換処理を行う周期を規定することで、変換速度を可変とすることができる。例えば、図９（Ａ）に示すように、２周期毎に周期波形を２つ合成することで、２倍速に変換することができ、同図（Ｂ）に示すように、３周期毎に周期波形を２つ合成することで、１．５倍速に変換することができる。なお、図７のＳ７３の処理において、特定の話者の発話区間について話速変換してもよい。この場合、その話者の発言内容の理解を助けるため、音声を伸長する処理をすればよい。伸長処理の場合は、図８（Ｂ）の周期波形Ａと周期波形Ｂとの間に、上述の合成波形を挿入し、音声信号の周期波形数を増やせばよい。なお、この場合、区間の先頭部分（例えば数百ｍｓｅｃ）のみを伸長して、それ以後を通常速度で出力するようにし、必要以上に伸長しないようにしてもよい。また、先頭部分を伸長し、それ以後を圧縮するようにしてもよい。 Also, the conversion speed can be made variable by defining the cycle for performing the speech speed conversion processing. For example, as shown in FIG. 9A, it can be converted to double speed by synthesizing two periodic waveforms every two cycles, and every three cycles as shown in FIG. 9B. By combining two waveforms, it can be converted to 1.5 times speed. Note that in the process of S73 of FIG. 7, the speech speed may be converted for the utterance section of a specific speaker. In this case, in order to help the speaker understand the content of the utterance, a process of expanding the voice may be performed. In the case of decompression processing, the above-described synthesized waveform may be inserted between the periodic waveform A and the periodic waveform B in FIG. 8B to increase the number of periodic waveforms of the audio signal. In this case, only the head portion (for example, several hundred msec) of the section may be expanded and the subsequent portion may be output at the normal speed, and may not be expanded more than necessary. Further, the head portion may be expanded and the subsequent portion may be compressed.

なお、特定の話者の発話区間以外の区間をスキップ（倍率無限大に圧縮）するようにしてもよい。また、特徴データとして、無意味な発言（例えば「え〜、あ〜」の様な発言）の音声特徴量を記録部３に記録しておき、Ｓ７２の処理において、この発言区間を音声認識により抽出し、無意味な発言のみスキップするようにしてもよい。 Note that sections other than a specific speaker's speech section may be skipped (compressed to infinite magnification). Further, as feature data, a voice feature amount of a meaningless utterance (for example, a utterance such as “e ~ a ~ a”) is recorded in the recording unit 3, and in the process of S72, this utterance section is recognized by voice recognition. It is also possible to extract only meaningless utterances.

以上のような処理により、話速変換後のストリーミングデータを生成する。図７（Ｂ）において、特徴データ抽出部２は、この話速変換後のストリーミングデータを、パソコン１０２に送信する（Ｓ７６）。終了指示情報をパソコン１０２から受信するまでＳ７２〜Ｓ７６の処理を繰り返す（Ｓ７７→Ｓ７２）。パソコン１０２から終了指示情報を受信していれば、動作を終える（Ｓ７７→ＥＮＤ）。Ｓ７２〜Ｓ７６の処理は、所定時間長の音声データ（例えば２〜３秒）に対してそれぞれ行われるため、最初の数秒分のストリーミングデータが配信されるまでは、パソコン１０２は受信待ちとなり、以後はパソコン１０２の再生処理とは別タスクでストリーミングデータが生成される。従って、最初の配信が始まると、以後は待ち時間無く議事録ダイジェストを聴くことができる。 Through the above processing, the streaming data after the speech speed conversion is generated. In FIG. 7B, the feature data extraction unit 2 transmits the streaming data after the speech speed conversion to the personal computer 102 (S76). The processes of S72 to S76 are repeated until the end instruction information is received from the personal computer 102 (S77 → S72). If the end instruction information is received from the personal computer 102, the operation is finished (S77 → END). Since the processes of S72 to S76 are respectively performed for audio data of a predetermined time length (for example, 2 to 3 seconds), the personal computer 102 waits for reception until the first several seconds of streaming data is distributed. The streaming data is generated by a task different from the reproduction processing of the personal computer 102. Therefore, when the first distribution starts, the minutes digest can be listened to without waiting time thereafter.

なお、本発明の議事録記録、再生システムは、以下の様な応用例が可能である。図１０は、応用例に係る音声会議装置の構成を示すブロック図である。この音声会議装置は、図１の音声会議装置１１１、１１２のそれぞれに代えて使用される。なお、図２に示した音声会議装置１１１と共通する構成部については同一の符号を付し、その説明を省略する。 The minutes recording / reproducing system of the present invention can be applied as follows. FIG. 10 is a block diagram illustrating a configuration of an audio conference apparatus according to an application example. This audio conference apparatus is used in place of each of the audio conference apparatuses 111 and 112 in FIG. In addition, the same code | symbol is attached | subjected about the component which is common in the audio conference apparatus 111 shown in FIG. 2, and the description is abbreviate | omitted.

応用例に係る音声会議装置は、制御部１１、入出力Ｉ／Ｆ１２、放音指向性制御部１３、Ｄ／Ａコンバータ１４、放音アンプ１５、スピーカＳＰ１〜ＳＰ１６、マイクＭＩＣ１０１〜１１６、２０１〜２１６、収音アンプ１６、Ａ／Ｄコンバータ１７、収音ビーム生成部１８、収音ビーム選択部１９、エコーキャンセル回路２０、操作部３１、表示部３２を備えている。 The audio conference apparatus according to the application example includes a control unit 11, an input / output I / F 12, a sound output directivity control unit 13, a D / A converter 14, a sound output amplifier 15, speakers SP1 to SP16, microphones MICs 101 to 116, 201 to 216, a sound collection amplifier 16, an A / D converter 17, a sound collection beam generation unit 18, a sound collection beam selection unit 19, an echo cancellation circuit 20, an operation unit 31, and a display unit 32.

この音声会議装置の制御部１１は、入出力Ｉ／Ｆ１２から入力される相手装置からの音声データを、ネットワーク形式のデータから一般的な音声信号に変換してエコーキャンセル回路２０を介して放音指向性制御部１３に出力するとともに、入力音声信号に添付された方位データを取得して、放音指向性制御部１３に対して放音制御を行う。 The control unit 11 of the voice conference apparatus converts voice data from the partner apparatus input from the input / output I / F 12 from network format data to a general voice signal and emits the sound through the echo cancel circuit 20. While outputting to the directivity control part 13, the azimuth | direction data attached to the input audio | voice signal is acquired, and sound emission control with respect to the sound emission directivity control part 13 is performed.

放音指向性制御部１３は、放音制御内容に応じてスピーカＳＰ１〜ＳＰ１６に対する放音音声信号を生成する。スピーカＳＰ１〜ＳＰ１６に対する放音音声信号は、入力音声データを遅延制御や振幅制御等の信号制御処理を行うことにより形成される。Ｄ／Ａコンバータ１４はディジタル形式の放音音声信号をアナログ形式に変換し、放音アンプ１５は放音音声信号を増幅してスピーカＳＰ１〜ＳＰ１６に与え、スピーカＳＰ１〜ＳＰ１６は、放音音声信号を音声変換して放音する。これにより、自装置の会議者に、ネットワークで接続された相手先装置の会議者の音声を放音する。 The sound emission directivity control unit 13 generates sound emission sound signals for the speakers SP1 to SP16 according to the sound emission control contents. The sound output sound signal to the speakers SP1 to SP16 is formed by performing signal control processing such as delay control and amplitude control on the input sound data. The D / A converter 14 converts the sound output sound signal in the digital format into an analog format, the sound output amplifier 15 amplifies the sound output sound signal and applies it to the speakers SP1 to SP16, and the speakers SP1 to SP16 output the sound output sound signal. The sound is converted to sound. Thereby, the voice of the conference person of the other party apparatus connected with the network is emitted to the conference person of the own apparatus.

マイクＭＩＣ１０１〜１１６、２０１〜２１６は自装置の会議者の発声音を含む周囲の音を収音して電気信号変換し、収音音声信号を生成する。 The microphones MIC 101 to 116 and 201 to 216 collect ambient sounds including the utterance sound of the conference person of the own device, convert them into electric signals, and generate collected sound signals.

収音ビーム生成部１８は、マイクＭＩＣ１０１〜１１６、２０１〜２１６の収音信号に対して遅延処理等を行い、所定方位に強い指向性を有する収音ビーム音声信号ＭＢ１〜ＭＢ８を生成する。収音ビーム音声信号ＭＢ１〜ＭＢ８はそれぞれ異なる方位に強い指向性を有するように設定されている。図１０の音声会議装置を図１の音声会議装置１１１に置き換えた場合であれば、ＭＢ１を方位Ｄｉｒ１１に、ＭＢ２を方位Ｄｉｒ１２に、ＭＢ３を方位Ｄｉｒ１３に、ＭＢ４を方位Ｄｉｒ１４に、ＭＢ５を方位Ｄｉｒ１５に、ＭＢ６を方位Ｄｉｒ１６に、ＭＢ７を方位Ｄｉｒ１７に、ＭＢ８を方位Ｄｉｒ１８に設定される。一方、図１０の音声会議装置を図１の音声会議装置１１２に置き換えた場合であれば、ＭＢ１を方位Ｄｉｒ２１に、ＭＢ２を方位Ｄｉｒ２２に、ＭＢ３を方位Ｄｉｒ２３に、ＭＢ４を方位Ｄｉｒ２４に、ＭＢ５を方位Ｄｉｒ２５に、ＭＢ６を方位Ｄｉｒ２６に、ＭＢ７を方位Ｄｉｒ２７に、ＭＢ８を方位Ｄｉｒ２８に設定される。 The collected sound beam generator 18 performs delay processing on the collected signals of the microphones MICs 101 to 116 and 201 to 216, and generates collected sound beam sound signals MB1 to MB8 having strong directivity in a predetermined direction. The collected sound beam audio signals MB1 to MB8 are set so as to have strong directivity in different directions. If the voice conference apparatus of FIG. 10 is replaced with the voice conference apparatus 111 of FIG. 1, MB1 is set to the direction Dir11, MB2 is set to the direction Dir12, MB3 is set to the direction Dir13, MB4 is set to the direction Dir14, and MB5 is set to the direction Dir15. In addition, MB6 is set to the direction Dir16, MB7 is set to the direction Dir17, and MB8 is set to the direction Dir18. On the other hand, if the voice conference apparatus of FIG. 10 is replaced with the voice conference apparatus 112 of FIG. 1, MB1 is set to the direction Dir21, MB2 is set to the direction Dir22, MB3 is set to the direction Dir23, MB4 is set to the direction Dir24, and MB5 is set. In the direction Dir25, MB6 is set in the direction Dir26, MB7 is set in the direction Dir27, and MB8 is set in the direction Dir28.

収音ビーム選択部１９は、収音ビーム音声信号ＭＢ１〜ＭＢ８の信号強度を比較して、最も強度の高い収音ビーム音声信号を選択し、収音ビーム音声信号ＭＢとしてエコーキャンセル回路２０に出力する。収音ビーム選択部１９は、選択した収音ビーム音声信号ＭＢに対応する方位Ｄｉｒを検出して制御部１１に与える。入出力Ｉ／Ｆ１２は、エコーキャンセル回路２０からの収音ビーム音声信号ＭＢをネットワーク形式で所定データ長からなる音声データに変換し、制御部１１から得られる方位データと収音時間データとを添付して、ネットワーク１００に出力する。 The collected sound beam selection unit 19 compares the signal strengths of the collected sound beam sound signals MB1 to MB8, selects the collected sound beam sound signal having the highest intensity, and outputs it to the echo cancel circuit 20 as the collected sound beam sound signal MB. To do. The collected sound beam selection unit 19 detects the direction Dir corresponding to the selected collected sound beam sound signal MB and supplies the detected direction to the control unit 11. The input / output I / F 12 converts the collected sound beam sound signal MB from the echo cancel circuit 20 into sound data having a predetermined data length in a network format, and attaches the azimuth data and sound collection time data obtained from the control unit 11. And output to the network 100.

次に、応用例に係る録音サーバ１０１の録音フローについて図１１を参照して説明する。
図１１は録音サーバ１０１の録音処理フローを示すフローチャートである。
録音サーバ１０１は、ネットワーク１００での音声データ通信を監視している。録音サーバ１０１は、会議開始トリガを検出すると録音を開始する（Ｓ１０１→Ｓ１０２）。 Next, a recording flow of the recording server 101 according to the application example will be described with reference to FIG.
FIG. 11 is a flowchart showing a recording process flow of the recording server 101.
The recording server 101 monitors voice data communication over the network 100. When the recording server 101 detects the conference start trigger, the recording server 101 starts recording (S101 → S102).

録音が開始されると、録音サーバ１０１（制御部１）は録音開始時間を取得し、特徴データ抽出部２に与える。特徴データ抽出部２は、この録音開始時刻を１つの音声状況データのタイトルとして保存する（Ｓ１０３）。 When recording is started, the recording server 101 (control unit 1) acquires the recording start time and gives it to the feature data extraction unit 2. The feature data extraction unit 2 stores the recording start time as one audio status data title (S103).

ネットワークＩ／Ｆ４は、ネットワーク１００で通信される音声データを取得し、記録部３に与え、記録部３は順次音声データを記憶する（Ｓ１０４）。 The network I / F 4 acquires the audio data communicated through the network 100 and gives it to the recording unit 3, which sequentially stores the audio data (S104).

この際、制御部１は、ネットワークＩ／Ｆ４が取得した音声データから装置データ、方位データ、時間データを取得して、装置データを記録部３に与える。記録部３は、制御部１から取得した装置データに従い、音声データを装置別に順次音声データ記録部３０１に記録する。 At this time, the control unit 1 acquires device data, azimuth data, and time data from the audio data acquired by the network I / F 4 and provides the device data to the recording unit 3. The recording unit 3 sequentially records audio data in the audio data recording unit 301 for each device in accordance with the device data acquired from the control unit 1.

また、制御部１は、音声データから装置データ、方位データ、時間データを取得し、特徴データ抽出部２に与える（Ｓ１０５）。特徴データ抽出部２は、これら装置データ、方位データおよび時間データを一時記憶する。 In addition, the control unit 1 acquires device data, direction data, and time data from the voice data, and gives them to the feature data extraction unit 2 (S105). The feature data extraction unit 2 temporarily stores the device data, the orientation data, and the time data.

この処理は、装置毎に方位データの変化を制御部１が検出するまで繰り返し行われ、方位データの変化を制御部１が検出すると（Ｓ１０６）、制御部１はセッション終了処理制御を特徴データ抽出部２に与える（Ｓ１０７）。特徴データ抽出部２は、同じ方位データからなる音声データ群を関連付けするため、該当する音声データ群の装置データと方位データと開始時間データとを備える音声状況データを生成して記録部３に与える。記録部３は、特徴データ抽出部２からの音声状況データを音声状況データ記録部３０２に記録する（Ｓ１０８）。このような音声状況データの生成、記録処理と音声データの記録処理とは、録音終了トリガが検出されるまで繰り返し行われ、方位データが変化する毎に音声状況データの生成、記録が行われる。 This process is repeated until the control unit 1 detects a change in azimuth data for each device. When the control unit 1 detects a change in azimuth data (S106), the control unit 1 performs session end process control to extract feature data. It gives to part 2 (S107). The feature data extraction unit 2 generates audio status data including device data, direction data, and start time data of the corresponding audio data group, and gives the audio data group to the recording unit 3 in order to associate the audio data group including the same direction data. . The recording unit 3 records the voice situation data from the feature data extraction unit 2 in the voice situation data recording unit 302 (S108). Such generation and recording processing of voice status data and recording processing of voice data are repeated until a recording end trigger is detected, and voice status data is generated and recorded every time the azimuth data changes.

そして、録音終了トリガが検出されれば（Ｓ１１０）、制御部１は、特徴データ抽出部２に録音終了制御指示を与える。特徴データ抽出部２は、最終の音声状況データを生成、記録するとともに、音声状況データ記録部３０２に予め記録された各音声状況データを録音開始時に取得したタイトルでグループ化するグループ化指示データを生成して音声状況データ記録部３０２に記録する（Ｓ１１１）。 When the recording end trigger is detected (S110), the control unit 1 gives a recording end control instruction to the feature data extraction unit 2. The feature data extraction unit 2 generates and records final voice situation data, and also groups grouping instruction data for grouping each voice situation data recorded in advance in the voice situation data recording unit 302 with a title acquired at the start of recording. It is generated and recorded in the voice situation data recording unit 302 (S111).

音声データ記録部３０１には、経時的に連続する音声データが装置毎に記録される。この際、音声データは、音声状況データ記録部３０２に記録された音声特徴データにより、方位データ別に区分されている。なお、方位データと各話者の関係（すなわち各方位にどの話者が存在するか）は、会議参加者（議長）が予め登録する。これにより、音声データは、話者毎に区分されることとなる。 The sound data recording unit 301 records sound data that is continuous over time for each apparatus. At this time, the voice data is classified according to the direction data by the voice feature data recorded in the voice situation data recording unit 302. The conference participant (chairperson) registers in advance the relationship between the orientation data and each speaker (that is, which speaker exists in each orientation). As a result, the voice data is divided for each speaker.

地点ａの音声データであれば、話者Ａの音声データ、話者Ｂの音声データ、話者Ｃの音声データ、話者Ｄの音声データ、話者Ｅの音声データ、話者Ｆの音声データ、話者Ｇの音声データ、および話者の指定されていない無音（雑音）の音声データとで区分化される。そして、各区分化音声データには区分の開始時間データが関連付けされる。なお、発言がなければ、音声特徴データには記録されない。 If it is voice data of point a, voice data of speaker A, voice data of speaker B, voice data of speaker C, voice data of speaker D, voice data of speaker E, voice data of speaker F , The voice data of the speaker G and the voice data of silence (noise) not designated by the speaker. Each segmented audio data is associated with segment start time data. If there is no speech, it is not recorded in the voice feature data.

同様に、地点ｂの音声データであれば、話者Ｈの音声データ、話者Ｉの音声データ、話者Ｊの音声データ、話者Ｋの音声データ、話者Ｌの音声データ、および話者の指定されていない無音（雑音）の音声データとで区分化され、区分毎の開始時間データが関連付けされる。この場合も、発言がなければ、音声特徴データには記録されない。 Similarly, if the voice data is at point b, the voice data of speaker H, the voice data of speaker I, the voice data of speaker J, the voice data of speaker K, the voice data of speaker L, and the speaker And voice data of silence (noise) that is not designated, and start time data for each section is associated. Also in this case, if there is no speech, it is not recorded in the voice feature data.

このように、応用例の構成及び処理を用いることで、それぞれの会議参加者に対応する音声特徴データを備えた状態で議事録を記録することができる。そして、時間データも関連付けされていることで、各会議者の発言状況をも含んで議事録を記録することができる。これにより、ダイジェスト再生処理を行う場合に、指定した話者の発言を精度良く抽出することができる。 In this way, by using the configuration and processing of the application example, it is possible to record the minutes with the audio feature data corresponding to each conference participant. Since the time data is also associated, the minutes can be recorded including the speech status of each conference participant. Thereby, when performing digest reproduction | regeneration processing, the speech of the designated speaker can be extracted accurately.

応用例において、録音サーバ１０１は、図７（Ｂ）の解析処理時（Ｓ７３）に、読み出した音声データから音声特徴量を抽出するとともに、音声データに記録されている話者区分に従って、特定の話者の発話区間を決定する。音声特徴量から抽出した特定の話者の発話区間について、音声データに記録されている話者区間が特定の話者の区間と一致するかをさらに判断して、両者が一致すれば、特定の話者の発話区間として決定する。一致しなければ他の話者の発話区間とする。これにより、特定の話者の発話区間を高精度に抽出する。 In the application example, the recording server 101 extracts a voice feature amount from the read voice data at the time of the analysis process (S73) in FIG. 7B, and performs a specific process according to the speaker classification recorded in the voice data. Determine the speaker's utterance interval. For a specific speaker's utterance section extracted from the speech feature, it is further determined whether the speaker section recorded in the speech data matches the specific speaker's section. Determined as the speaker's utterance interval. If they do not match, the speech section of another speaker is set. Thereby, the utterance section of a specific speaker is extracted with high accuracy.

なお、本実施形態では、ネットワークに接続する複数の音声会議装置で多地点会議を行う場合を示したが、単一の音声会議装置のみを使う場合であっても、同様の作用・効果を得ることができる。 In this embodiment, a case where a multipoint conference is performed with a plurality of audio conference apparatuses connected to the network is shown, but the same operation and effect can be obtained even when only a single audio conference apparatus is used. be able to.

なお、本実施形態では、音声データを記録する例について説明したが、音声会議装置にカメラ等を設置することで、さらに画像（静止画、動画）データを記録することも可能である。画像データは、音声データと同期して記録し、再生時には、音声データと同時に画像データも同期して表示すればよい。 In the present embodiment, an example of recording audio data has been described. However, it is possible to further record image (still image, moving image) data by installing a camera or the like in the audio conference apparatus. The image data may be recorded in synchronization with the audio data, and at the time of reproduction, the image data may be displayed in synchronization with the audio data.

本発明の実施形態の議事録記録、再生システムの構成図Configuration diagram of a minutes recording and playback system according to an embodiment of the present invention 本実施形態の音声会議装置１１１の主要構成を示すブロック図The block diagram which shows the main structures of the audio conference apparatus 111 of this embodiment. 本実施形態の録音サーバ１０１の主要構成を示すブロック図The block diagram which shows the main structures of the recording server 101 of this embodiment パソコン１０２の主要構成を示すブロック図Block diagram showing the main configuration of the personal computer 102 録音サーバ１０１の録音処理フローを示すフローチャートThe flowchart which shows the recording processing flow of the recording server 101 パソコン１０２の再生処理フローを示すフローチャートThe flowchart which shows the reproduction | regeneration processing flow of the personal computer 102 録音サーバ１０１の再生処理フローを示すフローチャートA flowchart showing a playback processing flow of the recording server 101 話速変換処理を示す図Diagram showing speech speed conversion processing 可変速度の場合の話速変換処理を示す図Diagram showing speech speed conversion process for variable speed 応用例に係る音声会議装置の構成を示すブロック図The block diagram which shows the structure of the audio conference apparatus which concerns on an application example 応用例に係る録音サーバ１０１の録音処理フローを示すフローチャートThe flowchart which shows the recording processing flow of the recording server 101 which concerns on an application example.

Explanation of symbols

１００−ネットワーク
１０１−録音サーバ
１−制御部
２−特徴データ生成部
３−記録部
４−ネットワークＩ／Ｆ
１１１，１１２−音声会議装置 100-network 101-recording server 1-control unit 2-characteristic data generation unit 3-recording unit 4-network I / F
111, 112-voice conference equipment

Claims

Recording means for recording sound data of pronunciations of a plurality of speakers, and sound features of a specific speaker;
Voice feature quantity extraction means for extracting voice feature quantities;
Speaker extraction for comparing the voice feature quantity of the specific speaker with the voice feature quantity extracted by the voice feature quantity extraction unit and extracting the voice data recording section of the specific speaker from the voice data Means,
Speech speed conversion means for performing processing for compressing voice data of a section other than the voice data recording section of the specific speaker on a time axis;
Output means for outputting audio data including a compressed section to the outside;
An audio data recording / reproducing apparatus comprising:

The speech speed converting means compresses voice data in a section other than the voice data recording section of the specific speaker at a lower compression ratio in a section closer to the voice data recording section of the specific speaker. The audio data recording / reproducing apparatus according to claim 1.

The voice data recording / reproducing apparatus according to claim 1, wherein the speech speed conversion unit performs a process of extending a voice data recording section of the specific speaker along a time axis.

Including the data acquisition means for acquiring voice data and speaker identification data for identifying a speaker of the voice data over time;
The recording means records the voice data, the speaker identification data, and the voice feature amount of the specific speaker,
The said speaker extraction means extracts the audio | voice data recording area of the said specific speaker based on at least any one of the said speaker identification data or the comparison result of an audio | voice feature-value. Or the audio | voice data recording / reproducing apparatus of Claim 3.

The audio data recording / reproducing device according to claim 4, wherein the audio data recording / reproducing device is connected to a sound emission and collection device including a microphone array.
The sound emission and collection device forms a plurality of sound collection beam signals having strong directivities in different directions based on the sound collection sound signals of the microphones of the microphone array, and the plurality of sound collection beam signals In comparison, the sound collecting beam signal having the strongest signal intensity is selected, the direction corresponding to the selected sound collecting beam signal is detected, the selected sound collecting beam signal is set as audio data, and the detected signal is detected. An audio data recording / reproducing apparatus, characterized in that the direction is output as speaker identification data.