JP2009020460A

JP2009020460A - Voice processing device and program

Info

Publication number: JP2009020460A
Application number: JP2007184874A
Authority: JP
Inventors: Yasuo Yoshioka; 靖雄吉岡
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2007-07-13
Filing date: 2007-07-13
Publication date: 2009-01-29
Anticipated expiration: 2027-07-13
Also published as: JP5050698B2

Abstract

<P>PROBLEM TO BE SOLVED: To select a section in which a target sound exists and another section in which voice except for the target sound exists out of voice signals. <P>SOLUTION: A voice division part 12 divides the voice signals S into a pronouncing section PA and a non-pronouncing section PB on a time axis. A storing device 20 stores a general acoustic model of the target sound. A selection processing part 13 judges the existence of correlation of the acoustic model and the feature amount of the voice signal S in each pronouncing section PA, and selects the pronouncing section PA having correlation with the acoustic model into an effective section PA1 and selects the pronouncing section PA having no correlation with the acoustic model into a rejection section PA2. A voice classifying part 14 classifies the pronouncing section PA selected into the effective section PA1 out of a plurality of the pronouncing sections PA demarcated by the voice division part 12 in every speaker based on the characteristic amount of the voice signal S in the pronouncing section PA. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声信号を時間軸上で複数の区間に区分する技術に関する。 The present invention relates to a technique for dividing an audio signal into a plurality of sections on a time axis.

音声信号を時間軸に沿って複数の区間に区分する各種の技術が従来から提案されている。例えば特許文献１や特許文献２には、音声信号のＳＮ比と所定の閾値との比較の結果に応じて音声信号を発音区間と非発音区間とに区分する技術が開示されている。
特開昭５９−９９４９７号公報国際公開第２００７／０１７９９３号パンフレット Various techniques for dividing an audio signal into a plurality of sections along the time axis have been proposed. For example, Patent Literature 1 and Patent Literature 2 disclose a technique for classifying an audio signal into a sounding interval and a non-sounding interval according to a result of comparison between the SN ratio of the audio signal and a predetermined threshold value.
JP 59-99497 A International Publication No. 2007/017993 Pamphlet

しかし、特許文献１や特許文献２のようにＳＮ比に応じて発音区間と非発音区間とに選別する技術においては、音声信号の収録時における雑音（空調設備の作動音やドアの開閉音）の存在する区間が発音区間に選別される場合がある。そして、例えば人間による発声音など本来の目的となる音声以外の音声が発音区間に混在すると、発音区間を対象とした音声信号の処理（例えば各区間の分類）の精度が低下するという問題がある。以上の事情を背景として、本発明は、音声信号のうち目的音が存在する区間と目的音が存在しない区間とを区別するという課題の解決を目的としている。 However, in the technique of selecting the sounding section and the non-sounding section according to the S / N ratio as in Patent Document 1 and Patent Document 2, noise at the time of recording a sound signal (operation sound of an air conditioner or door opening / closing sound) In some cases, a section in which sound is present is selected as a pronunciation section. Then, for example, when voices other than the original target voice such as human uttered sounds are mixed in the sound generation section, there is a problem that the accuracy of the processing of the sound signal for the sound generation section (for example, classification of each section) is lowered. . In view of the above circumstances, an object of the present invention is to solve the problem of distinguishing between a section where a target sound exists and a section where a target sound does not exist in an audio signal.

前述の課題を解決するために、本発明に係る音声処理装置は、目的音の音響モデルを記憶する記憶手段と、音声信号を時間軸上で複数の区間に区分する音声区分手段と、音響モデルと各区間内の音声信号の特徴量との相関の有無を判定する相関判定手段と、複数の区間のうち音響モデルに相関すると判定された区間を有効区間に選定し、音響モデルに相関しないと判定された区間を棄却区間に選定する区間選別手段とを具備する。以上の構成によれば、目的音の音響モデルと各区間内の音声信号の特徴量との相関の有無に応じて各区間を有効区間と棄却区間とに選別することが可能である。したがって、例えば有効区間のみを選択的に利用することで、各区間の音声信号に対する音声処理（例えば発声者ごとの分類や音声認識）の精度を高めることができる。 In order to solve the above-described problems, a speech processing apparatus according to the present invention includes a storage unit that stores an acoustic model of a target sound, a speech classification unit that segments a speech signal into a plurality of sections on a time axis, and an acoustic model. And a correlation determination means for determining whether or not there is a correlation between the feature quantity of the audio signal in each section, and a section determined to correlate with the acoustic model among a plurality of sections is selected as an effective section and is not correlated with the acoustic model Section selection means for selecting the determined section as a rejection section. According to the above configuration, each section can be classified into an effective section and a rejection section depending on whether or not there is a correlation between the acoustic model of the target sound and the feature amount of the audio signal in each section. Therefore, for example, by selectively using only the effective section, it is possible to improve the accuracy of voice processing (for example, classification for each speaker or voice recognition) for the voice signal in each section.

本発明の好適な態様において、音声区分手段は、音声信号を発音区間と非発音区間とに区分し、相関判定手段は、音響モデルと各区間内の音声信号の特徴量との相関の指標値を第１閾値と比較することで相関の有無を判定し、非発音区間内の音声信号の特性に応じて第１閾値を設定する閾値設定手段を具備する。以上の構成によれば、非発音区間内の音声信号の特性に応じて第１閾値が可変に設定されるから、第１閾値が固定された構成と比較して、相関判定手段による判定の正確性を高めることができる。 In a preferred aspect of the present invention, the voice classifying unit classifies the voice signal into a sounding period and a non-sounding period, and the correlation determining unit is an index value of the correlation between the acoustic model and the feature value of the sound signal in each period. Is compared with a first threshold value to determine whether or not there is a correlation, and threshold value setting means is provided for setting the first threshold value according to the characteristics of the audio signal in the non-sounding interval. According to the above configuration, the first threshold value is variably set according to the characteristics of the audio signal in the non-sounding section. Can increase the sex.

本発明の好適な態様において、複数の区間の各々におけるフレームの総数に対する当該区間内の有声音のフレームの個数の割合が第２閾値を上回るか否かを区間ごとに判定する有声判定手段を具備し、区間選別手段は、音響モデルに相関すると相関判定手段が判定し、かつ、有声音のフレームの個数の割合が第２閾値を上回ると有声判定手段が判定した区間を、有効区間に選定する。以上の構成によれば、有声音のフレームの個数の割合が第２閾値を上回る区間が有効区間に選定されるから、目的音に類似する雑音の区間を棄却区間に選別することが可能である。 In a preferred aspect of the present invention, there is provided voiced determination means for determining for each section whether or not the ratio of the number of frames of voiced sound in the section to the total number of frames in each of the plurality of sections exceeds a second threshold value. The section selection means selects the section determined by the correlation determination means to correlate with the acoustic model and determined by the voiced determination means that the ratio of the number of frames of voiced sound exceeds the second threshold value as an effective section. . According to the above configuration, since a section in which the ratio of the number of frames of voiced sound exceeds the second threshold is selected as an effective section, it is possible to select a section of noise similar to the target sound as a rejection section. .

さらに好適な態様において、相関判定手段は、複数の区間の各々における有声音のフレームの特徴量のみを音響モデルと対比する。例えば人間による発声音などの目的音と雑音との相違は有声音の特性に関して特に顕著となるから、有声音のフレームの特徴量のみが音響モデルと対比される本態様によれば、相関判定手段による判定の正確性が向上するという利点がある。 In a further preferred aspect, the correlation determining means compares only the feature amount of the frame of the voiced sound in each of the plurality of sections with the acoustic model. For example, since the difference between the target sound such as a voice produced by a human and noise is particularly significant with respect to the characteristics of the voiced sound, only the feature amount of the frame of the voiced sound is compared with the acoustic model. There is an advantage that the accuracy of the determination by is improved.

本発明の具体的な態様に係る音声処理装置は、複数の区間のうち区間選別手段が有効区間に選定した複数の区間を、当該区間内の音声信号の特徴量に基づいて発声者ごとに分類する音声分類手段を具備する。本態様によれば、有効区間は目的音を含む可能性が高いから、有効区間の音声信号からは目的音の特性を忠実に反映した特徴量が抽出される。したがって、有効区間のみを分類の対象とする本態様によれば、各区間を音声信号の特性に応じて高い精度で分類できる。 The speech processing apparatus according to a specific aspect of the present invention classifies, for each speaker, a plurality of sections selected by the section selection unit as effective sections among the plurality of sections based on the feature amount of the audio signal in the section. Voice classification means is provided. According to this aspect, since there is a high possibility that the effective section includes the target sound, a feature value that accurately reflects the characteristic of the target sound is extracted from the audio signal in the effective section. Therefore, according to this aspect in which only effective sections are targeted for classification, each section can be classified with high accuracy according to the characteristics of the audio signal.

本発明に係る音声処理装置は、音声の処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働によっても実現される。本発明に係るプログラムは、目的音の音響モデルを記憶する記憶手段を具備するコンピュータに、音声信号を時間軸上で複数の区間に区分する音声区分処理と、音響モデルと各区間内の音声信号の特徴量との相関の有無を判定する相関判定処理（例えば図３のステップＳ5）と、複数の区間のうち音響モデルに相関すると判定された区間を処理の対象となる有効区間に選定し、音響モデルに相関しないと判定された区間を処理の対象外の棄却区間に選定する区間選別処理（例えば図３のステップＳ6やステップＳ10）とをコンピュータに実行させる。以上のプログラムによっても、本発明に係る音声処理装置と同様の作用および効果が奏される。なお、本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The audio processing apparatus according to the present invention is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to audio processing, and a general-purpose arithmetic processing apparatus such as a CPU (Central Processing Unit). It is also realized through collaboration with the program. The program according to the present invention includes a computer having storage means for storing an acoustic model of a target sound, voice classification processing for dividing a voice signal into a plurality of sections on a time axis, an acoustic model, and a voice signal in each section. A correlation determination process (for example, step S5 in FIG. 3) for determining whether or not there is a correlation with the feature amount, and a section determined to be correlated with the acoustic model among a plurality of sections is selected as an effective section to be processed, The computer is caused to perform section selection processing (for example, step S6 and step S10 in FIG. 3) for selecting a section determined to be uncorrelated with the acoustic model as a rejection section that is not subject to processing. Even with the above program, the same operations and effects as those of the speech processing apparatus according to the present invention are exhibited. The program of the present invention is provided to the user in a form stored in a computer-readable recording medium and installed in the computer, or is provided from the server device in the form of distribution via a communication network. Installed on.

本発明は、音声を処理する方法としても特定される。本発明のひとつの態様に係る音声処理方法は、音声信号を時間軸上で複数の区間に区分する音声区分手順と、記憶装置に格納された音響モデルと各区間内の音声信号の特徴量との相関の有無を判定する相関判定手順と、複数の区間のうち音響モデルに相関すると判定された区間を処理の対象となる有効区間に選定し、音響モデルに相関しないと判定された区間を処理の対象外の棄却区間に選定する区間選別手順とを含む。以上の方法によれば、本発明に係る音声処理装置と同様の作用および効果が奏される。 The present invention is also specified as a method of processing speech. An audio processing method according to one aspect of the present invention includes an audio classification procedure for dividing an audio signal into a plurality of sections on a time axis, an acoustic model stored in a storage device, and a feature amount of an audio signal in each section A correlation determination procedure for determining the presence or absence of correlation, and selecting a section determined to correlate with the acoustic model among a plurality of sections as an effective section to be processed, and processing a section determined not to be correlated with the acoustic model Section selection procedure for selecting a rejection section that is not subject to According to the above method, the same operation and effect as the sound processing apparatus according to the present invention are exhibited.

また、本発明の別の態様に係る音声処理装置は、音声信号を時間軸上で複数の区間に区分する音声区分手段と、前記複数の区間の各々におけるフレームの総数に対する当該区間内の有声音のフレームの個数の割合が閾値を上回るか否かを前記区間ごとに判定する有声判定手段と、前記有声音のフレームの個数の割合が閾値を上回る区間を有効区間に選定し、前記有声音のフレームの個数の割合が閾値を下回る区間を棄却区間に選定する区間選別手段とを具備する。以上の態様によれば、各区間内の有声音のフレームの割合に応じて各区間が有効区間と棄却区間とに選別される。したがって、例えば有効区間のみを選択的に利用することで、各区間の音声信号に対する音声処理（例えば発声者ごとの分類や音声認識）の精度を高めることができる。 In addition, the speech processing apparatus according to another aspect of the present invention includes speech classification means for segmenting an audio signal into a plurality of sections on a time axis, and voiced sound in the section with respect to the total number of frames in each of the plurality of sections. Voiced determination means for determining for each section whether or not the ratio of the number of frames exceeds a threshold; selecting a section where the ratio of the number of frames of the voiced sound exceeds a threshold as an effective section; Section selection means for selecting a section in which the ratio of the number of frames is below a threshold as a reject section. According to the above aspect, each section is sorted into an effective section and a rejection section according to the ratio of voiced sound frames in each section. Therefore, for example, by selectively using only the effective section, it is possible to improve the accuracy of voice processing (for example, classification for each speaker or voice recognition) for the voice signal in each section.

＜Ａ：第１実施形態＞
図１は、本発明の第１実施形態に係る音声処理装置の構成を示すブロック図である。同図に示すように、音声処理装置１００は、制御装置１０と記憶装置２０とを具備するコンピュータシステムである。制御装置１０は、プログラムを実行する演算処理装置である。記憶装置２０は、制御装置１０が実行するプログラムや制御装置１０が使用する各種のデータを記憶する。半導体記憶装置や磁気記憶装置など公知の記憶媒体が記憶装置２０として任意に採用される。制御装置１０には出力装置３０が接続される。本形態の出力装置３０は、制御装置１０による制御のもとに各種の画像を表示する表示機器である。 <A: First Embodiment>
FIG. 1 is a block diagram showing the configuration of the speech processing apparatus according to the first embodiment of the present invention. As shown in FIG. 1, the voice processing device 100 is a computer system that includes a control device 10 and a storage device 20. The control device 10 is an arithmetic processing device that executes a program. The storage device 20 stores a program executed by the control device 10 and various data used by the control device 10. A known storage medium such as a semiconductor storage device or a magnetic storage device is arbitrarily adopted as the storage device 20. An output device 30 is connected to the control device 10. The output device 30 according to the present embodiment is a display device that displays various images under the control of the control device 10.

記憶装置２０は、音声の時間軸上における波形を表す音声信号Ｓを記憶する。本形態の音声信号Ｓが表す音声は、複数の参加者が随時に発言する会議において収音機器を利用して収録された音声である。図２の部分(A)には、音声信号Ｓの時間軸上の波形が例示されている。制御装置１０は、記憶装置２０に格納されたプログラムを実行することで音声信号Ｓから会議の議事録を生成する。議事録は、複数の参加者の各々の発言の内容（文字）を時系列に配列した会議の記録である。 The storage device 20 stores an audio signal S representing a waveform on the time axis of audio. The audio represented by the audio signal S of this embodiment is audio recorded using a sound collection device in a conference where a plurality of participants speak at any time. A waveform on the time axis of the audio signal S is illustrated in part (A) of FIG. The control device 10 generates a meeting minutes from the audio signal S by executing a program stored in the storage device 20. The minutes are records of a meeting in which the contents (characters) of each of a plurality of participants are arranged in time series.

さらに、記憶装置２０は、音声処理装置１００による処理の目的となる音声（以下「目的音」という）の音響的な特性を表す音響モデルを記憶する。本形態では人間による発声音が目的音とされる。すなわち、発声音から音響的な特徴量（例えばＭＦＣＣ（Mel Frequency Cepstral Coefficient））を抽出する処理を多数の発声者と多様な内容の発話とについて実行し、抽出された多数の特徴量を統計的に処理することで、人間による発声音の概括的（平均的）な特性を示す音響モデルが生成される。制御装置１０が音響モデルを生成してもよいし外部装置の生成した音響モデルを記憶装置２０に格納してもよい。 Furthermore, the storage device 20 stores an acoustic model that represents the acoustic characteristics of the speech that is the object of processing by the speech processing device 100 (hereinafter referred to as “target sound”). In this embodiment, a voice uttered by a human is the target sound. That is, a process of extracting acoustic feature quantities (for example, MFCC (Mel Frequency Cepstral Coefficient)) from the uttered sound is executed for many speakers and utterances of various contents, and the extracted many feature quantities are statistically processed. As a result of the processing, an acoustic model is generated that shows the general (average) characteristics of the uttered sound by humans. The control device 10 may generate an acoustic model, or an acoustic model generated by an external device may be stored in the storage device 20.

本形態の音響モデルは、例えば、試料となる多数かつ多様な発声音から抽出された特徴量（ＭＦＣＣのベクトル）の分布をＭ個の確率分布の加重和としてモデル化する混合モデルλである（Ｍは２以上の自然数）。混合モデルλの生成には、ＥＭ（Expectation-Maximization）アルゴリズムなど公知の技術が任意に採用される。本形態の混合モデルλは、Ｍ個の正規分布の加重和として以下の式(1)で表現されるガウス混合モデルである。
λ＝｛ｐi，μi，Σi｝（ｉ＝１〜Ｍ） ……(1)
式(1)のｐiは、第ｉ番目の正規分布の加重値（重み値）である。加重値ｐ1〜ｐMの総和は１である。式(1)のμiは第ｉ番目の正規分布の平均ベクトルであり、Σiは第ｉ番目の正規分布の共分散行列である。なお、式(1)のμiのように、実際にはベクトルを意味する記号であっても、当該記号がベクトルを意味することを例えば「平均ベクトル」という表現で明示したうえで、本明細書ではベクトルの記号（文字上の右向き矢印）を省略する。 The acoustic model of the present embodiment is, for example, a mixed model λ that models the distribution of feature values (MFCC vectors) extracted from a large number of various utterances as samples as a weighted sum of M probability distributions ( M is a natural number of 2 or more). For the generation of the mixed model λ, a known technique such as an EM (Expectation-Maximization) algorithm is arbitrarily employed. The mixed model λ of this embodiment is a Gaussian mixed model expressed by the following equation (1) as a weighted sum of M normal distributions.
λ = {pi, μi, Σi} (i = 1 to M) (1)
In the equation (1), pi is a weight value (weight value) of the i-th normal distribution. The sum of the weights p1 to pM is 1. In Expression (1), μi is an average vector of the i-th normal distribution, and Σi is a covariance matrix of the i-th normal distribution. It should be noted that even if a symbol actually means a vector, such as μi in equation (1), this specification means that the symbol means a vector, for example, by clearly expressing it as an `` average vector ''. The vector symbol (the arrow pointing right on the character) is omitted.

制御装置１０は、図１に示すように、音声区分部１２と選別処理部１３と音声分類部１４と音声認識部１６として機能する。なお、図１の制御装置１０の各機能は、音声の処理に専用されるＤＳＰなどの電子回路によっても実現される。また、制御装置１０は、複数の集積回路として実装されてもよい。 As illustrated in FIG. 1, the control device 10 functions as a voice classification unit 12, a selection processing unit 13, a voice classification unit 14, and a voice recognition unit 16. Each function of the control device 10 of FIG. 1 is also realized by an electronic circuit such as a DSP dedicated to voice processing. In addition, the control device 10 may be implemented as a plurality of integrated circuits.

音声区分部１２は、図２の部分(D)に示すように、記憶装置２０に記憶された音声信号Ｓを、時間軸に沿って複数の発音区間ＰAと複数の非発音区間ＰBとに区分する。発音区間ＰAは、音声（目的音や雑音）が存在する区間であり、非発音区間ＰBは、音声が存在しない区間または音量が充分に小さい区間である。 As shown in part (D) of FIG. 2, the voice classification unit 12 classifies the voice signal S stored in the storage device 20 into a plurality of sound generation sections PA and a plurality of non-sound generation sections PB along the time axis. To do. The sound generation section PA is a section where speech (target sound or noise) exists, and the non-sound generation section PB is a section where there is no sound or a volume is sufficiently small.

音声区分部１２は、第１処理と第２処理とを実行する。第１処理は、図２の部分(B)に示すように、音声信号ＳのうちＳＮ比や音量（振幅）が閾値を上回る区間を発音区間ＰAとして検出する処理である。発音区間ＰA以外の区間は非発音区間ＰBとなる。 The voice classification unit 12 executes a first process and a second process. As shown in part (B) of FIG. 2, the first process is a process of detecting a section in which the S / N ratio and the volume (amplitude) of the audio signal S exceed the threshold as the sound generation section PA. Sections other than the sound generation section PA are non-sound generation sections PB.

複数の発声者による発声が間隔をあけずに連続する場合や部分的に重複する場合、第１処理だけでは音声信号Ｓを発声者ごとに区分することが困難である。そこで、音声区分部１２は、図２の部分(C)および部分(D)に示すように、音声信号Ｓの波形の包絡線（エンベロープ）Ｅに現れる複数の谷部Ｄの各々を境界として発音区間ＰAを区分する第２処理を実行する。人間による一連の発話には、一般的に、発話の開始点から音量が徐々に増加するとともに中途の時点から発話の終了点にかけて音量が徐々に減少するという傾向がある。したがって、谷部Ｄを境界として発音区間ＰAを区分する構成によれば、複数の発話が連続または重複する場合であっても、各発声者による発声は別個の発音区間ＰAに区分される。音声区分部１２による区分後の発音区間ＰAの総数を以下ではＪ個（Ｊは２以上の整数）とする。なお、発音区間ＰAや非発音区間ＰBの検出には、以上の例示の他にも公知の技術が任意に採用される。 When utterances by a plurality of speakers are continuous without being spaced apart or partially overlapped, it is difficult to classify the audio signal S for each speaker by only the first process. Therefore, the sound classification unit 12 generates a sound with each of a plurality of valleys D appearing in the envelope (envelope) E of the waveform of the sound signal S as a boundary, as shown in part (C) and part (D) of FIG. A second process for dividing the section PA is executed. In a series of utterances by humans, generally, there is a tendency that the volume gradually increases from the start point of the utterance and gradually decreases from an intermediate point to the end point of the utterance. Therefore, according to the configuration in which the sound generation section PA is divided with the valley portion D as a boundary, even if a plurality of utterances are continuous or overlapped, the utterances by each speaker are divided into separate sound generation sections PA. In the following, the total number of sounding sections PA after the classification by the voice classifying unit 12 is J (J is an integer of 2 or more). In addition to the above examples, a known technique is arbitrarily adopted for detection of the sounding section PA and the non-sounding section PB.

ところで、以上のようにＳＮ比や音量が閾値を上回る区間を発音区間ＰAとして検出した場合、音声信号Ｓのうち目的音以外の音声（例えば電話機の呼出音）が存在する区間が発音区間ＰAとして検出される可能性はある。そこで、選別処理部１３は、図２の部分(E)に示すように、音声区分部１２が画定した複数の発音区間ＰAを、目的音の存在する可能性が高い区間（以下「有効区間」という）ＰA1と目的音の存在する可能性が低い区間（以下「棄却区間」という）ＰA2とに選別する。すなわち、複数の発音区間ＰAのうち目的音の存在しない区間は棄却区間ＰA2として除去される。なお、選別処理部１３の具体的な動作は後述する。 By the way, when the section where the SN ratio and the sound volume exceed the threshold is detected as the sound generation section PA as described above, the section where the voice other than the target sound (for example, the ringing tone of the telephone) exists in the voice signal S is the sound generation section PA. There is a possibility of being detected. Therefore, as shown in part (E) of FIG. 2, the selection processing unit 13 selects a plurality of sound generation sections PA defined by the speech classification unit 12 as sections (hereinafter referred to as “effective sections”) where the target sound is highly likely to exist. And PA1 and a section (hereinafter referred to as “rejection section”) PA2 where there is a low possibility that the target sound exists. That is, the section where the target sound does not exist among the plurality of sound generation sections PA is removed as the reject section PA2. The specific operation of the sorting processing unit 13 will be described later.

図１の音声分類部１４は、複数の発音区間ＰAのうち選別処理部１３が選別した各有効区間ＰA1の音声信号Ｓを発声者ごとに分類する。音声区分部１２が画定した非発音区間ＰBや選別処理部１３が選別した棄却区間ＰA2は分類の対象から除外される。各有効区間ＰA1の分類には公知のクラスタリング技術が任意に採用される。 The voice classification unit 14 in FIG. 1 classifies the voice signal S of each effective section PA1 selected by the selection processing unit 13 among the plurality of sound generation sections PA for each speaker. The non-sound generation section PB defined by the voice classification section 12 and the rejection section PA2 selected by the selection processing section 13 are excluded from the classification targets. A known clustering technique is arbitrarily employed for classification of each effective section PA1.

例えば、音声分類部１４は、各有効区間ＰA1内の音声信号Ｓに対してＦＦＴ（Fast Fourier Transform）処理を含む周波数分析を実行することで当該有効区間ＰA1内の音声信号Ｓの音響的な特徴量（例えばＭＦＣＣ）を抽出し、特徴量の類似する各有効区間ＰA1が共通のクラスタに所属するように複数の有効区間ＰA1を各クラスタに分類する。したがって、音声信号Ｓのうち同じ発声者が発声した可能性が高い有効区間ＰA1は共通のクラスタに分類される。そして、音声分類部１４は、複数の発声者の各々の識別符号と、当該発声者のクラスタに分類された各有効区間ＰA1の始点および終点の時刻と、当該各有効区間ＰA1内の音声信号Ｓとを対応させて記憶装置２０に格納する。なお、会議の参加者の人数を利用者が既知数として指定する構成においては、複数の有効区間ＰA1を、当該人数に相当する個数のクラスタに分類する構成が好適に採用される。 For example, the audio classification unit 14 performs frequency analysis including FFT (Fast Fourier Transform) processing on the audio signal S in each effective interval PA1, thereby acoustic characteristics of the audio signal S in the effective interval PA1. A quantity (for example, MFCC) is extracted, and a plurality of effective sections PA1 are classified into clusters so that each effective section PA1 having a similar feature amount belongs to a common cluster. Therefore, the effective section PA1 that is highly likely to be uttered by the same speaker in the audio signal S is classified into a common cluster. Then, the voice classification unit 14 identifies each of the plurality of speaker's identification codes, the start and end times of each effective section PA1 classified into the speaker's cluster, and the sound signal S in each effective section PA1. Are stored in the storage device 20. In the configuration in which the user designates the number of participants in the conference as a known number, a configuration in which a plurality of effective sections PA1 are classified into a number of clusters corresponding to the number of people is preferably employed.

音声認識部１６は、各クラスタに分類された各有効区間ＰA1の音声信号Ｓから発声者ごとの発言の内容を文字として特定する。各有効区間ＰA1の音声信号Ｓから文字を認識する処理には公知の音声認識技術が任意に採用される。例えば、音声認識部１６は、第１に、ひとつのクラスタに分類された各有効区間ＰA1の音声信号Ｓの音響的な特徴量に応じて初期的な音響モデルを更新（話者適応）することで、当該クラスタに対応した発声者の音声の特徴を固有に反映した音響モデルを生成し、第２に、話者適応後の音響モデルとクラスタ内の各有効区間ＰA1の音声信号Ｓから抽出された特徴量とを対比することで発言の文字を識別する。 The voice recognition unit 16 specifies the content of the utterance for each speaker as a character from the voice signal S of each effective section PA1 classified into each cluster. A known speech recognition technique is arbitrarily employed for the process of recognizing characters from the speech signal S in each effective section PA1. For example, the speech recognition unit 16 firstly updates (speaker adaptation) the initial acoustic model according to the acoustic feature amount of the speech signal S of each effective section PA1 classified into one cluster. Then, an acoustic model that inherently reflects the voice characteristics of the speaker corresponding to the cluster is generated, and secondly, it is extracted from the acoustic model after speaker adaptation and the voice signal S of each effective section PA1 in the cluster. The character of the utterance is identified by comparing with the feature amount.

制御装置１０は、音声認識部１６による処理の結果を出力装置３０に出力する。出力装置３０は、発言の時刻と、発声者の識別符号（例えば発声者の氏名）と、当該発声の内容について音声認識部１６が特定した文字とを時系列に配列した議事録の画像を表示する。 The control device 10 outputs the processing result by the voice recognition unit 16 to the output device 30. The output device 30 displays an image of the minutes in which the time of the utterance, the identification code of the utterer (for example, the name of the utterer), and the characters identified by the voice recognition unit 16 regarding the content of the utterance are arranged in time series. To do.

次に、図３を参照して、選別処理部１３による処理の具体例を説明する。図３の処理は、音声区分部１２による処理が完了することを契機として開始される。同図に示すように、選別処理部１３は、Ｊ個の発音区間ＰAの各々について音声信号Ｓの特徴量を抽出する（ステップＳ1）。さらに詳述すると、選別処理部１３は、各発音区間ＰAを区分した複数のフレームＦ（図２の部分(A)参照）の各々について周波数分析を実行することで、当該発音区間ＰA内の各フレームＦにおけるＭＦＣＣのベクトル（以下「特徴ベクトル」という）の時系列を特徴量として抽出する。もっとも、ステップＳ1にて抽出される特徴量はＭＦＣＣに限定されない。 Next, a specific example of processing by the sorting processing unit 13 will be described with reference to FIG. The process in FIG. 3 is started when the process by the voice classifying unit 12 is completed. As shown in the figure, the selection processing unit 13 extracts the feature amount of the audio signal S for each of the J sound generation sections PA (step S1). More specifically, the selection processing unit 13 performs frequency analysis for each of a plurality of frames F (see part (A) in FIG. 2) that divide each sounding section PA, so that each sounding section PA in each sounding section PA is analyzed. A time series of MFCC vectors (hereinafter referred to as “feature vectors”) in the frame F is extracted as a feature amount. But the feature-value extracted in step S1 is not limited to MFCC.

次いで、選別処理部１３は、Ｊ個の発音区間ＰAのなかから未選択で最先（最も古い）の発音区間ＰAを選択する（ステップＳ2）。次いで、選別処理部１３は、ステップＳ2にて選択した発音区間ＰA（以下では特に「選択区間ＰA_S」という）の開始前にある直近の非発音区間ＰBの音声信号Ｓに応じて閾値ＴＨ1を設定する（ステップＳ3）。非発音区間ＰBは基本的に雑音（環境音）のみが存在する区間であるから、ステップＳ3の処理は、音声信号Ｓの収録時の雑音に応じて閾値ＴＨ1を設定する処理に相当する。具体的には、選別処理部１３は、選択区間ＰA_Sの直前にある非発音区間ＰB内の音声信号Ｓの平均的な強度（以下「雑音レベル」という）を算定し、雑音レベルが高いほど閾値ＴＨ1が小さくなるように閾値ＴＨ1を可変に制御する。 Next, the selection processing unit 13 selects the unselected and earliest (oldest) sounding section PA from among the J sounding sections PA (step S2). Next, the selection processing unit 13 sets a threshold value TH1 according to the sound signal S of the latest non-sounding section PB before the start of the sounding section PA selected in step S2 (hereinafter, particularly referred to as “selected section PA_S”). (Step S3). Since the non-sound generation section PB is basically a section in which only noise (environmental sound) exists, the process of step S3 corresponds to the process of setting the threshold value TH1 according to the noise when the audio signal S is recorded. Specifically, the selection processing unit 13 calculates the average intensity (hereinafter referred to as “noise level”) of the audio signal S in the non-sound generation section PB immediately before the selection section PA_S, and the threshold increases as the noise level increases. The threshold value TH1 is variably controlled so that TH1 becomes small.

次に、選別処理部１３は、記憶装置２０に記憶された混合モデルλと選択区間ＰA_S内の音声信号Ｓの特徴ベクトルｘとの相関の程度を示す相関指標値を算定する（ステップＳ4）。さらに詳述すると、選別処理部１３は、混合モデルλから選択区間ＰA_Sの各特徴ベクトルｘが出現する確率（尤度）を選択区間ＰA_S内の総ての特徴ベクトルｘについて平均化した数値（以下「平均尤度」という）Ｌを相関指標値として算定する。 Next, the selection processing unit 13 calculates a correlation index value indicating the degree of correlation between the mixed model λ stored in the storage device 20 and the feature vector x of the speech signal S in the selected section PA_S (step S4). More specifically, the selection processing unit 13 is a numerical value obtained by averaging the probability (likelihood) that each feature vector x in the selected section PA_S appears from the mixed model λ with respect to all the feature vectors x in the selected section PA_S (hereinafter referred to as the following). L (referred to as “average likelihood”) is calculated as a correlation index value.

ひとつの特徴ベクトルｘをＤ次元のベクトルとすると、混合モデルλから特徴ベクトルｘが出現する尤度ｐ（ｘ｜λ）は以下の式(2)で算定される。

When one feature vector x is a D-dimensional vector, the likelihood p (x | λ) at which the feature vector x appears from the mixed model λ is calculated by the following equation (2).

選別処理部１３は、ステップＳ4において、選択区間ＰA_S内のＫ個の特徴ベクトルｘ（ｘ1〜ｘK）を式(3)に代入することで平均尤度Ｌを算定する。式(3)から理解されるように、音響モデルが表す音声の特徴と選択区間ＰA_S内の音声信号Ｓの特徴とが類似するほど平均尤度Ｌは大きくなる。

In step S4, the selection processing unit 13 calculates the average likelihood L by substituting the K feature vectors x (x1 to xK) in the selected section PA_S into the equation (3). As understood from the equation (3), the average likelihood L increases as the sound feature represented by the acoustic model is similar to the feature of the sound signal S in the selected section PA_S.

次いで、選別処理部１３は、選択区間ＰA_Sの平均尤度Ｌが閾値ＴＨ1を下回るか否かを判定する（ステップＳ5）。混合モデルλには人間の多様な発声音が包括的に反映されるから、平均尤度Ｌが閾値ＴＨ1を下回る選択区間ＰA_Sの音声は人間の発声音である可能性が低い。そこで、選別処理部１３は、ステップＳ5の結果が肯定である場合（Ｌ＜ＴＨ1）、現段階における選択区間ＰA_Sを棄却区間ＰA2に選別する（ステップＳ6）。以上のように、ステップＳ5の処理は、選択区間ＰA_S内の音声が人間の発声音である可能性があるか否かを音声信号Ｓと混合モデルλとの相関の有無に応じて判定する処理である。 Next, the selection processing unit 13 determines whether or not the average likelihood L of the selected section PA_S is lower than the threshold value TH1 (step S5). Since the mixed model λ comprehensively reflects various human utterances, it is unlikely that the speech in the selected section PA_S whose average likelihood L is lower than the threshold TH1 is a human utterance. Therefore, when the result of step S5 is affirmative (L <TH1), the sorting processor 13 sorts the selected section PA_S at the current stage into the reject section PA2 (step S6). As described above, the process of step S5 is a process of determining whether or not there is a possibility that the sound in the selected section PA_S is a human uttered sound according to the presence or absence of the correlation between the sound signal S and the mixed model λ. It is.

ステップＳ1にて抽出される特徴ベクトルｘは音声信号Ｓ内の雑音の影響を受けるから、閾値ＴＨ1が固定値であるとすれば、音声信号Ｓの雑音レベルが高いほど、実際には目的音を含む発音区間ＰAであるにも拘わらずステップＳ5の結果が肯定となる可能性は高まる。本形態においては音声信号Ｓの雑音レベルが高いほど閾値ＴＨ1が小さい数値に設定される（すなわちステップＳ5の結果が肯定となる割合が低くなる）から、目的音を含む選択区間ＰA_Sが棄却区間ＰA2と誤判定される可能性を低減できる。 Since the feature vector x extracted in step S1 is affected by noise in the audio signal S, if the threshold value TH1 is a fixed value, the higher the noise level of the audio signal S, the more actual the target sound. There is a high possibility that the result of step S5 will be affirmative in spite of the sound generation period PA that includes it. In this embodiment, the threshold TH1 is set to a smaller numerical value as the noise level of the audio signal S is higher (that is, the rate at which the result of step S5 becomes affirmative is lower), so that the selection section PA_S including the target sound is the rejection section PA2. It is possible to reduce the possibility of erroneous determination.

ところで、実際には目的音が選択区間ＰA_Sに含まれない場合であっても、人間の発声音に類似する雑音が選択区間ＰA_Sに含まれる場合には、ステップＳ5の結果は否定となる（すなわち棄却区間ＰA2とは判定されない）。そこで、ステップＳ5の結果が否定である場合、選別処理部１３は、混合モデルλを使用しない方法で選択区間ＰA_Sを有効区間ＰA1または棄却区間ＰA2に選別する（ステップＳ7からステップＳ9）。 By the way, even if the target sound is not actually included in the selection section PA_S, if the selection section PA_S includes noise similar to a human voice, the result of step S5 is negative (ie, It is not determined to be the rejection section PA2.) Therefore, if the result of step S5 is negative, the sorting processor 13 sorts the selected section PA_S into the valid section PA1 or the reject section PA2 by a method that does not use the mixed model λ (step S7 to step S9).

人間が自然に発声した場合（例えば意図的に無声音のみを継続的に発声しない限り）、発声が継続する区間のうち所定の割合を上回る時間長にわたって有声音が存在するという傾向がある。そこで、本形態においては、選択区間ＰA_Sのうち有声音の区間の割合に応じて選択区間ＰA_Sを有効区間ＰA1（有声音が豊富な発音区間ＰA）および棄却区間ＰA2（無声音が豊富な発音区間ＰA）とに選別する。 When a human utters naturally (for example, unless only intentionally utters unvoiced sound continuously), there is a tendency that voiced sound exists for a length of time exceeding a predetermined ratio in a section where utterance continues. Therefore, in the present embodiment, the selection section PA_S is selected as the effective section PA1 (sounding section PA rich in voiced sound) and the rejection section PA2 (sounding section PA rich in unvoiced sound) according to the proportion of the voiced sound section in the selection section PA_S. ) And sorting.

ステップＳ7において、選別処理部１３は、選択区間ＰA_S内の複数のフレームＦの各々について、音声信号Ｓの示す音声が有声音であるか無声音であるかを判定する。有声/無声の判断には公知の技術が任意に採用される。例えば、選別処理部１３は、音声信号Ｓの周期性の指標となる自己相関関数の最大値（以下「自己相関値」という）値を各フレームＦについて算定し、自己相関値が所定値を上回るフレームＦ（すなわち音声信号Ｓの周期性が高いフレームＦ）を有声音と判定するとともに自己相関値が所定値を下回るフレームＦを無声音と判定する。また、音声信号Ｓから明確なピッチ（基本周波数）が検出されるフレームＦのみを有声音と判定する構成も好適に採用される。 In step S7, the selection processing unit 13 determines, for each of the plurality of frames F in the selected section PA_S, whether the voice indicated by the voice signal S is a voiced sound or an unvoiced sound. A known technique is arbitrarily employed for voiced / unvoiced determination. For example, the selection processing unit 13 calculates the maximum value (hereinafter referred to as “autocorrelation value”) of the autocorrelation function that is an index of the periodicity of the audio signal S for each frame F, and the autocorrelation value exceeds a predetermined value. The frame F (that is, the frame F having a high periodicity of the audio signal S) is determined as a voiced sound, and the frame F having an autocorrelation value lower than a predetermined value is determined as an unvoiced sound. A configuration in which only a frame F in which a clear pitch (basic frequency) is detected from the audio signal S is determined as a voiced sound is also preferably employed.

次いで、選別処理部１３は、選択区間ＰA_S内のフレームＦの総数のうちステップＳ7にて有声音と判定されたフレームＦの個数の割合Ｒを算定し（ステップＳ8）、割合Ｒが所定の閾値ＴＨ2を上回るか否かを判定する（ステップＳ9）。ステップＳ9の判定が否定である場合（すなわち選択区間ＰA_Sにおいて無声音のフレームＦの割合が高い場合）、選別処理部１３は、現段階における選択区間ＰA_Sを棄却区間ＰA2に選別する（ステップＳ6）。一方、ステップＳ9の判定が肯定である場合（すなわち選択区間ＰA_Sにおいて有声音のフレームＦの割合が高い場合）、選別処理部１３は、選択区間ＰA_Sを有効区間ＰA1に選別する（ステップＳ10）。 Next, the selection processing unit 13 calculates a ratio R of the number of frames F determined to be voiced in step S7 out of the total number of frames F in the selected section PA_S (step S8), and the ratio R is a predetermined threshold value. It is determined whether or not TH2 is exceeded (step S9). When the determination in step S9 is negative (that is, when the ratio of the unvoiced sound frame F is high in the selection section PA_S), the selection processing unit 13 selects the selection section PA_S in the current stage as the rejection section PA2 (step S6). On the other hand, when the determination in step S9 is affirmative (that is, when the ratio of the frame F of voiced sound is high in the selection section PA_S), the selection processing unit 13 selects the selection section PA_S into the effective section PA1 (step S10).

ステップＳ6またはステップＳ10を実行すると、選別処理部１３は、音声信号Ｓの総ての発音区間ＰAを選別したか否かを判定する（ステップＳ11）。ステップＳ11の結果が否定である場合、選別処理部１３は、現段階の選択区間ＰA_Sの直後の発音区間ＰAをステップＳ2にて新たな選択区間ＰA_Sとして選択したうえでステップＳ3以後の処理を実行する。総ての発音区間ＰAの選別が完了すると（ステップＳ11：YES）、選別処理部１３は図３の処理を終了する。 When step S6 or step S10 is executed, the selection processing unit 13 determines whether or not all sound generation sections PA of the audio signal S have been selected (step S11). If the result of step S11 is negative, the selection processing unit 13 selects the sound generation section PA immediately after the current selection section PA_S as the new selection section PA_S in step S2, and then executes the processes after step S3. To do. When the selection of all the sound generation sections PA is completed (step S11: YES), the selection processing unit 13 ends the process of FIG.

以上に説明したように、本形態においては、目的音の有無に応じて発音区間ＰAが有効区間ＰA1および棄却区間ＰA2に区別されるから、目的音を含まない棄却区間ＰA2を音声分類部１４や音声認識部１６による処理の対象から除外することで雑音の影響を有効に低減することができる。例えば、雑音の影響を低減することで各発声者の発声音の特性を忠実に反映した特徴量の抽出が可能となるから、音声分類部１４による各発音区間ＰA（有効区間ＰA1）の分類や音声認識部１６による話者適応および音声認識など特徴量を利用した音声処理の精度が高められる。したがって、音声信号Ｓから正確な議事録を作成することができる。 As described above, in this embodiment, the sound generation section PA is distinguished into the effective section PA1 and the rejection section PA2 according to the presence or absence of the target sound, so that the rejection section PA2 that does not include the target sound is designated as the speech classification unit 14 or By excluding it from the target of processing by the speech recognition unit 16, it is possible to effectively reduce the influence of noise. For example, by reducing the influence of noise, it is possible to extract a feature value that faithfully reflects the characteristics of the uttered sound of each speaker, so that the classification of each sounding section PA (effective section PA1) by the speech classification unit 14 The accuracy of speech processing using feature quantities such as speaker adaptation and speech recognition by the speech recognition unit 16 is improved. Therefore, an accurate minutes can be created from the audio signal S.

ところで、以上の形態においては、音声信号Ｓが人間の発声音の音響モデル（混合モデルλ）に相関するか否かに応じて発音区間ＰAを有効区間ＰA1および棄却区間ＰA2に選別する構成を例示した。これに対し、例えば、音声信号Ｓの収録時に発生し得る雑音の音響モデル（以下「雑音モデル」という）を使用する構成（以下「対比例」という）も想定される。対比例においては、音声信号Ｓと雑音モデルとの相関が高い場合に発音区間ＰAが棄却区間ＰA2に選別され、音声信号Ｓと雑音モデルとの相関が低い場合に発音区間ＰAが有効区間ＰA1に選別される。 By the way, in the above form, the structure which selects sounding area PA into effective area PA1 and rejection area PA2 according to whether the audio | voice signal S correlates with the acoustic model (mixed model (lambda)) of a human vocal sound is illustrated. did. On the other hand, for example, a configuration (hereinafter referred to as “proportional”) using an acoustic model (hereinafter referred to as “noise model”) of noise that may occur during recording of the audio signal S is also assumed. In contrast, when the correlation between the speech signal S and the noise model is high, the sounding section PA is selected as the rejection section PA2, and when the correlation between the speech signal S and the noise model is low, the sounding section PA becomes the effective section PA1. Selected.

しかし、雑音の特性は人間の発声音の特性と比較して極めて多様である。したがって、特定の雑音を想定して雑音モデルを作成したとしても、雑音モデルが網羅し切れない雑音が音声信号Ｓに含まれる可能性は高い。すなわち、対比例の構成においては、目的音を含まない発音区間ＰAを充分に除去できないという問題がある。これに対して本形態においては、人間の発声音の音響モデルが使用されるから、音声信号Ｓに多様な雑音が含まれる場合であっても、目的音を含まない発音区間ＰAを有効に除去できるという利点がある。 However, the characteristics of noise are extremely diverse compared to the characteristics of human vocal sounds. Therefore, even if a noise model is created assuming specific noise, there is a high possibility that noise that cannot be covered by the noise model is included in the speech signal S. That is, in the comparative configuration, there is a problem that the sound generation section PA that does not include the target sound cannot be sufficiently removed. On the other hand, in this embodiment, since an acoustic model of a human uttered sound is used, even if the speech signal S includes various noises, the pronunciation period PA that does not include the target sound is effectively removed. There is an advantage that you can.

＜Ｂ：第２実施形態＞
次に、本発明の第２実施形態について説明する。本形態においては、第１実施形態における平均尤度Ｌに代えて、ＶＱ（Vector Quantization）歪を音響モデルと音声信号Ｓとの相関指標値として採用する。なお、以下の各形態において機能や作用が第１実施形態と同等である要素については、以上と同じ符号を付して各々の詳細な説明を適宜に省略する。 <B: Second Embodiment>
Next, a second embodiment of the present invention will be described. In this embodiment, VQ (Vector Quantization) distortion is adopted as a correlation index value between the acoustic model and the audio signal S instead of the average likelihood L in the first embodiment. In addition, about the element in which a function and an effect | action are equivalent to 1st Embodiment in each following form, the same code | symbol as the above is attached | subjected and each detailed description is abbreviate | omitted suitably.

記憶装置２０に事前に格納される音響モデルは、試料となる多数かつ多様な発声音から抽出された多数の特徴量（ＭＦＣＣ）のベクトルから生成されたコードブック（符号帳）Ｃ^Aである。コードブックの生成には、k-means法やＬＢＧアルゴリズムなど公知の技術が任意に採用される。 Acoustic model stored in advance in the storage device 20, a number and variety of a number of feature amounts extracted from the utterance (MFCC) codebook generated from the vector of the sample (codebook) is C ^A. A known technique such as a k-means method or an LBG algorithm is arbitrarily employed for generating the code book.

図３のステップＳ4において、選別処理部１３は、記憶装置２０に格納されたコードブックＣ^Aと、選択区間ＰA_Sの音声信号ＳからステップＳ1で抽出した複数の特徴ベクトルｘ（例えばＭＦＣＣ）とに基づいてＶＱ歪Ｄを算定する。ＶＱ歪Ｄは、例えば以下の式(4)で算定される。

In step S4 of FIG. 3, distinguishing processing unit 13, the codebook C ^A stored in the storage device 20, a plurality of feature vector x is extracted in step S1 from the sound signal S of the selection section PA_S (eg MFCC) Based on this, the VQ distortion D is calculated. The VQ distortion D is calculated by, for example, the following formula (4).

式(4)の|Ｃ^A|は、コードブックＣ^Aのサイズであり、Ｃ^A(i)は、コードブックＣ^Aにおける第ｉ番目のコードベクトル（セントロイドベクトル）である。また、ｘjは、選択区間ＰA_Sから抽出されたｎ_B個（選択区間ＰA_S内のフレームＦの個数）の特徴ベクトルｘ1〜ｘn_Bのなかの第ｊ番目（ｊ＝１〜ｎ_B）を意味する。ｄ（X,Y）は、ベクトルＸとベクトルＹとのユークリッド距離である。すなわち、ＶＱ歪Ｄは、音響モデルであるコードブックＣ^A内の|Ｃ^A|個のセントロイドベクトルと選択区間ＰA_Sの特徴ベクトルｘとの最小値（min）をｎ_B個の特徴ベクトルｘ1〜ｘn_Bにわたって平均化した数値である。 In formula (4), | C ^A | is the size of the code book C ^A , and C ^A (i) is the i-th code vector (centroid vector) in the code book C ^A. Further, xj denotes the j-th among the feature vectors x1 to xn _B of n _B-number that is extracted from the selected interval PA_S (number of frames F in a selected interval PA_S) (j = 1~n _B) . d (X, Y) is the Euclidean distance between the vector X and the vector Y. That is, the VQ distortion D is the minimum value (min) of | C ^A | centroid vectors in the codebook C ^A that is an acoustic model and the feature vector x of the selected section PA_S, and n _B feature vectors x 1 to x. is a value obtained by averaging over xn _B.

以上の説明から理解されるように、選択区間ＰA_S内の音声が人間の発声音に類似するほどＶＱ歪Ｄは小さくなる。したがって、図３のステップＳ4において、選別処理部１３は、選択区間ＰA_Sの直前の非発音区間ＰBにおける雑音レベルが高いほど閾値ＴＨ1が大きくなるように閾値ＴＨ1を可変に制御する。また、図３のステップＳ5において、選別処理部１３は、ＶＱ歪Ｄが閾値ＴＨ1を上回るか否かを判定し、閾値ＴＨ1を上回る場合には選択区間ＰA_Sを棄却区間ＰA2に選別する一方（ステップＳ5：YES）、閾値ＴＨ1を下回る場合にはステップＳ7に処理を移行する（ステップＳ5：NO）。他の動作は第１実施形態と同様である。本形態においても第１実施形態と同様の効果が奏される。 As understood from the above description, the VQ distortion D becomes smaller as the voice in the selected section PA_S resembles a human voice. Therefore, in step S4 in FIG. 3, the selection processing unit 13 variably controls the threshold value TH1 so that the threshold value TH1 increases as the noise level in the non-sound generation period PB immediately before the selection period PA_S increases. In step S5 in FIG. 3, the selection processing unit 13 determines whether or not the VQ distortion D exceeds the threshold value TH1, and if it exceeds the threshold value TH1, the selection section PA_S is selected as the rejection section PA2 (step) S5: YES), if below the threshold TH1, the process proceeds to step S7 (step S5: NO). Other operations are the same as those in the first embodiment. In this embodiment, the same effect as that of the first embodiment is obtained.

＜Ｃ：変形例＞
以上の各形態には様々な変形を加えることができる。具体的な変形の態様を例示すれば以下の通りである。なお、以下の例示から２以上の態様を任意に選択して組合わせてもよい。 <C: Modification>
Various modifications can be made to each of the above embodiments. An example of a specific modification is as follows. Two or more aspects may be arbitrarily selected from the following examples and combined.

（１）変形例１
以上の各形態においては、発音区間ＰAの各フレームＦから抽出された特徴量（特徴ベクトルｘ）を有声音および無声音の何れであるかに拘わらず音響モデルと対比したが、発音区間ＰAのうち有声音のフレームＦから抽出された特徴量のみを音響モデルと対比する構成も採用される。記憶装置２０に格納される音響モデルは、試料となる音声から無声音の区間や無音の区間を除外した有声音の区間内の特徴量に基づいて生成される。選別処理部１３は、選択区間ＰA_S内の複数のフレームＦのうち有声音のフレームＦから抽出された特徴量のみを利用して図３のステップＳ4にて平均尤度Ｌ（第２実施形態ではＶＱ歪Ｄ）を算定し、音響モデルと選択区間ＰA_S内の音声信号Ｓとの相関の有無をステップＳ5にて判定する。雑音と目的音とは有声音の特性に関して特に相違が顕著であるから、以上の変形例のように発音区間Ｐのうち有声音のフレームＦのみを音響モデルとの対比に使用する構成によれば、ステップＳ5における判定の正確性を高めることができる。 (1) Modification 1
In each of the above embodiments, the feature amount (feature vector x) extracted from each frame F of the sound production section PA is compared with the acoustic model regardless of whether it is voiced sound or unvoiced sound. A configuration in which only the feature amount extracted from the frame F of the voiced sound is compared with the acoustic model is also employed. The acoustic model stored in the storage device 20 is generated based on the feature amount in the voiced sound section excluding the unvoiced sound section and the silent section from the sample voice. The selection processing unit 13 uses only the feature amount extracted from the frame F of the voiced sound among the plurality of frames F in the selected section PA_S, and uses the average likelihood L (in the second embodiment, in the second embodiment). VQ distortion D) is calculated, and whether or not there is a correlation between the acoustic model and the audio signal S in the selected section PA_S is determined in step S5. Since the difference between the noise and the target sound is particularly remarkable with respect to the characteristics of the voiced sound, according to the configuration in which only the frame F of the voiced sound in the sound generation section P is used for comparison with the acoustic model as in the above modification. The accuracy of the determination in step S5 can be improved.

（２）変形例２
以上の形態においては、選択区間ＰA_Sの直前の非発音区間ＰBにおける雑音レベルに基づいて閾値ＴＨ1を設定したが（ステップＳ3）、閾値ＴＨ1の設定の基準は適宜に変更される。例えば、音声信号Ｓのうち先頭の非発音区間ＰBにおける雑音レベルに基づいて閾値ＴＨ1を設定し、各発音区間ＰAを選別するステップＳ5にて当該閾値ＴＨ1を共通に適用する構成も採用される。もっとも、選択区間ＰA_Sの直前にある非発音区間ＰBの雑音レベルを当該選択区間ＰA_Sの選別に適用する第１実施形態の構成によれば、音声信号Ｓの途中の時点で雑音レベルが変化した場合であっても、変化後の雑音レベルに応じて閾値ＴＨ1が更新されるから、ステップＳ5における選別の精度が低下する可能性は低減される。 (2) Modification 2
In the above embodiment, the threshold value TH1 is set based on the noise level in the non-sound generation section PB immediately before the selected section PA_S (step S3), but the criterion for setting the threshold value TH1 is appropriately changed. For example, a configuration in which the threshold value TH1 is set based on the noise level in the leading non-sounding section PB of the audio signal S and the threshold value TH1 is commonly applied in step S5 for selecting each sounding section PA is also adopted. However, according to the configuration of the first embodiment in which the noise level of the non-sound generation section PB immediately before the selection section PA_S is applied to the selection section PA_S, the noise level changes at a point in the middle of the audio signal S. Even so, since the threshold value TH1 is updated according to the noise level after the change, the possibility that the accuracy of selection in step S5 is reduced is reduced.

また、以上の各形態においてはステップＳ9における閾値ＴＨ2を固定値としたが、閾値ＴＨ1と同様の方法（第１実施形態や本変形例に例示した方法）で閾値ＴＨ2を可変に制御する構成も採用される。音声信号Ｓの雑音レベルが高いほど、ステップＳ8で算定される割合Ｒの誤差は増大するから、閾値ＴＨ2を固定値とした形態においては、目的音を含む選択区間ＰA_Sが棄却区間ＰA2と誤判定される可能性が高まる。そこで、選別処理部１３は、選択区間ＰA_Sの直前の非発音区間ＰB（または音声信号Ｓの最初の非発音区間ＰB）における雑音レベルが高いほど閾値ＴＨ2が小さくなるように閾値ＴＨ2を設定する。以上の構成によれば、目的音を含む選択区間ＰA_Sが棄却区間ＰA2と誤判定される可能性を低減できる。 In each of the above embodiments, the threshold value TH2 in step S9 is a fixed value. However, there is a configuration in which the threshold value TH2 is variably controlled by a method similar to the threshold value TH1 (the method exemplified in the first embodiment and this modification). Adopted. As the noise level of the audio signal S is higher, the error of the ratio R calculated in step S8 increases. Therefore, in the embodiment in which the threshold value TH2 is a fixed value, the selection section PA_S including the target sound is erroneously determined as the rejection section PA2. The possibility of being increased. Therefore, the selection processing unit 13 sets the threshold value TH2 so that the threshold value TH2 becomes smaller as the noise level becomes higher in the non-sound generation period PB (or the first non-sounding period PB of the audio signal S) immediately before the selection period PA_S. According to the above configuration, it is possible to reduce the possibility that the selection section PA_S including the target sound is erroneously determined as the rejection section PA2.

（３）変形例３
以上の各形態においてはひとつの音響モデルを利用したが、複数の音響モデルを選択的に利用して発音区間ＰAを有効区間ＰA1と棄却区間ＰA2とに選別してもよい。例えば、平均ピッチが相違する複数種の音声から生成された複数の音響モデルを事前に作成して記憶装置２０に格納する。図３のステップＳ4において、選別処理部１３は、選択区間ＰA_S内の音声信号Ｓのピッチ（平均ピッチ）を検出し、複数の音響モデルのうち当該ピッチに対応した音響モデルを使用して平均尤度Ｌ（第２実施形態ではＶＱ歪Ｄ）を算定する。以上の構成によれば、男性の発声音と女性の発声音とが混在する場合のように音声信号Ｓが多様なピッチの音声を含む場合であっても、発音区間ＰAを正確に有効区間ＰA1と棄却区間ＰA2とに選別することが可能である。 (3) Modification 3
In each of the above embodiments, one acoustic model is used. However, a plurality of acoustic models may be selectively used to sort the sound generation section PA into the effective section PA1 and the rejection section PA2. For example, a plurality of acoustic models generated from a plurality of types of sounds having different average pitches are created in advance and stored in the storage device 20. In step S4 of FIG. 3, the selection processing unit 13 detects the pitch (average pitch) of the audio signal S in the selected section PA_S, and uses the acoustic model corresponding to the pitch among the plurality of acoustic models to calculate the average likelihood. Degree L (VQ distortion D in the second embodiment) is calculated. According to the above configuration, even when the voice signal S includes voices of various pitches, such as when male voices and female voices are mixed, the pronunciation period PA is accurately set as the valid period PA1. And the rejection section PA2.

（４）変形例４
音声区分部１２が音声信号Ｓを区分する方法は以上の例示に限定されない。例えば、音声信号ＳのＳＮ比や音量に応じて音声信号Ｓを発音区間ＰAおよび非発音区間ＰBに区分する第１処理と、包絡線Ｅの谷部Ｄを境界として音声信号Ｓを区分する第２処理との一方のみを実行してもよい。また、音声信号Ｓの特性とは無関係に設定された固定または可変の時間長の区間ごとに音声信号Ｓを区分する構成も採用される。すなわち、発音区間ＰAと非発音区間ＰBとの区分は本発明の形態において必須ではない。 (4) Modification 4
The method by which the audio classification unit 12 classifies the audio signal S is not limited to the above examples. For example, the first process of dividing the audio signal S into the sounding section PA and the non-sounding section PB according to the S / N ratio and volume of the sound signal S, and the first process of dividing the sound signal S with the valley portion D of the envelope E as a boundary. Only one of the two processes may be executed. In addition, a configuration in which the audio signal S is divided into sections of a fixed or variable time length set regardless of the characteristics of the audio signal S is also adopted. In other words, the division between the sounding section PA and the non-sounding section PB is not essential in the embodiment of the present invention.

（５）変形例５
以上の各形態においては、音響モデルに対する相関指標値（平均尤度ＬやＶＱ歪Ｄ）を利用したステップＳ5の判定と、有声音のフレームＦの割合Ｒを利用したステップＳ9の判定とを実行した。しかし、ステップＳ5の判定の結果のみに基づいて各発音区間ＰAを有効区間ＰA1と棄却区間ＰA2とに選別する構成（すなわち図３のステップＳ7からステップＳ9を省略した構成）も採用される。また、ステップＳ9の判定のみに基づいて各発音区間ＰAを有効区間ＰA1と棄却区間ＰA2とに選別する構成（すなわち図３のステップＳ3からステップＳ5を省略した構成）も採用される。 (5) Modification 5
In each of the above embodiments, the determination in step S5 using the correlation index value (average likelihood L and VQ distortion D) for the acoustic model and the determination in step S9 using the ratio R of the voiced frame F are executed. did. However, a configuration in which each sounding section PA is selected as an effective section PA1 and a rejection section PA2 based on only the determination result in step S5 (that is, a configuration in which steps S7 to S9 in FIG. 3 are omitted) is also employed. A configuration is also adopted in which each sounding segment PA is sorted into an effective segment PA1 and a rejection segment PA2 based only on the determination in step S9 (that is, a configuration in which steps S3 to S5 in FIG. 3 are omitted).

（６）変形例６
音声処理装置１００が作成した議事録を印刷する印刷装置を出力装置３０として採用してもよい。もっとも、音声処理装置１００による処理の結果が議事録（文字）の形式で出力される必要はなく、例えば音声分類部１４による分類の結果を出力することも可能である。例えば、音声分類部１４が分類した複数の有効区間ＰA1のうち利用者が指定した時刻を含む有効区間ＰA1内の音声信号Ｓを放音装置（例えばスピーカ）から音波として出力する構成によれば、利用者が各発声者の発言を選択的に聴取して適宜に確認しながら会議の議事録を作成するといった作業を有効に支援することが可能である。また、選別処理部１３が発音区間ＰAを有効区間ＰA1と棄却区間ＰA2とに選別した結果を音声処理装置１００から外部装置に出力する構成も採用される。外部装置においては、音声処理装置１００からの出力に対して図１の音声分類部１４と同様の処理や他の適切な処理が実行される。例えば、複数の発音区間ＰAのうち選別処理部１３が選別した有効区間ＰA1のみを選択的に外部装置に出力し、各有効区間ＰA1を対象として所定の処理（発声者ごとの分類や音声認識）が外部装置にて実行される。以上のように、音声分類部１４や音声認識部１６は音声処理装置１００にとって必須の要素ではない。 (6) Modification 6
A printing device that prints the minutes created by the voice processing device 100 may be adopted as the output device 30. However, it is not necessary to output the result of processing by the speech processing apparatus 100 in the form of minutes (characters), and for example, the result of classification by the speech classification unit 14 can be output. For example, according to the configuration in which the sound signal S in the effective section PA1 including the time specified by the user among the plurality of effective sections PA1 classified by the voice classification unit 14 is output as sound waves from a sound emitting device (for example, a speaker). It is possible to effectively support the user's work of creating the minutes of the meeting while selectively listening to the statements of each speaker and confirming them appropriately. A configuration is also adopted in which the selection processing unit 13 outputs the result of selecting the sound generation section PA into the effective section PA1 and the rejection section PA2 from the speech processing apparatus 100 to an external device. In the external device, the same processing as the speech classification unit 14 in FIG. 1 or other appropriate processing is executed on the output from the speech processing device 100. For example, only the effective section PA1 selected by the selection processing unit 13 among the plurality of sound generation sections PA is selectively output to an external device, and predetermined processing (classification or speech recognition for each speaker) is performed for each effective section PA1. Is executed by an external device. As described above, the speech classification unit 14 and the speech recognition unit 16 are not essential elements for the speech processing apparatus 100.

（７）変形例７
以上の各形態においては記憶装置２０に予め記憶された音声信号Ｓを処理の対象としたが、収音装置（マイクロホン）から供給される音声信号Ｓや通信網を経由して順次に供給される音声信号Ｓを対象として実時間的に処理を実行してもよい。また、音声信号Ｓが表す音響の種類は本発明において任意である。例えば、特定の楽器の演奏音を目的音とする音響モデルが記憶装置２０に格納された構成によれば、当該楽器の演奏会にて収録された音声区間Ｓから目的音以外の音声（例えば拍手音の区間）の区間を棄却区間ＰA2として除外することが可能である。 (7) Modification 7
In each of the above embodiments, the audio signal S stored in advance in the storage device 20 is the target of processing, but is sequentially supplied via the audio signal S supplied from the sound collection device (microphone) and the communication network. The processing may be executed in real time for the audio signal S. In addition, the type of sound represented by the audio signal S is arbitrary in the present invention. For example, according to the configuration in which an acoustic model having a performance sound of a specific instrument as a target sound is stored in the storage device 20, a sound other than the target sound (for example, applause) from the sound section S recorded at the performance of the musical instrument. It is possible to exclude the section of sound) as the rejection section PA2.

本発明の第１実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus which concerns on 1st Embodiment of this invention. 音声処理装置の動作を説明するための概念図である。It is a conceptual diagram for demonstrating operation | movement of a speech processing unit. 選別処理部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a selection process part.

Explanation of symbols

１００……音声処理装置、１０……制御装置、１２……音声区分部、１３……選別処理部、１４……音声分類部、１６……音声認識部、２０……記憶装置、３０……出力装置、ＰA……発音区間、ＰB……非発音区間、Ｓ……音声信号。 DESCRIPTION OF SYMBOLS 100 ... Voice processing apparatus, 10 ... Control apparatus, 12 ... Voice classification part, 13 ... Sorting process part, 14 ... Voice classification part, 16 ... Voice recognition part, 20 ... Memory | storage device, 30 ... Output device, PA ... sounding section, PB ... non-sounding section, S ... speech signal.

Claims

Storage means for storing an acoustic model of the target sound;
Audio classification means for dividing the audio signal into a plurality of sections on the time axis;
Correlation determining means for determining whether or not there is a correlation between the acoustic model and a feature amount of the audio signal in each section;
A speech processing apparatus comprising: a section selecting unit that selects a section determined to correlate with the acoustic model among the plurality of sections as an effective section, and selects a section determined not to be correlated with the acoustic model as a rejection section. .

The voice classification means classifies the voice signal into a sounding section and a non-sounding section,
The correlation determination means determines the presence or absence of correlation by comparing the index value of the correlation between the acoustic model and the feature value of the audio signal in each section with a first threshold value,
The speech processing apparatus according to claim 1, further comprising a threshold setting unit configured to set the first threshold according to characteristics of the speech signal in the non-sounding section.

Voiced determination means for determining for each section whether or not the ratio of the number of frames of voiced sound in the section to the total number of frames in each of the plurality of sections exceeds a second threshold;
The section selecting means determines the section determined by the correlation determining section to correlate with the acoustic model, and the voiced determining section determines that the ratio of the number of voiced frames exceeds a second threshold as an effective section. The voice processing device according to claim 1 or 2.

The speech processing apparatus according to claim 1, wherein the correlation determination unit compares only a feature amount of a voiced sound frame in each of the plurality of sections with the acoustic model.

In a computer having storage means for storing an acoustic model of the target sound,
Audio classification processing for dividing the audio signal into a plurality of sections on the time axis;
A correlation determination process for determining whether or not there is a correlation between the acoustic model and a feature amount of the audio signal in each section;
Of the plurality of sections, a section determined to correlate with the acoustic model is selected as an effective section to be processed, and a section determined not to be correlated with the acoustic model is selected as a rejection section that is not subject to the processing. A program that executes the section selection process.