JP2009020460A - Voice processing device and program - Google Patents

Voice processing device and program Download PDF

Info

Publication number
JP2009020460A
JP2009020460A JP2007184874A JP2007184874A JP2009020460A JP 2009020460 A JP2009020460 A JP 2009020460A JP 2007184874 A JP2007184874 A JP 2007184874A JP 2007184874 A JP2007184874 A JP 2007184874A JP 2009020460 A JP2009020460 A JP 2009020460A
Authority
JP
Japan
Prior art keywords
section
acoustic model
sound
voice
correlation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2007184874A
Other languages
Japanese (ja)
Other versions
JP5050698B2 (en
Inventor
Yasuo Yoshioka
靖雄 吉岡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yamaha Corp
Original Assignee
Yamaha Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yamaha Corp filed Critical Yamaha Corp
Priority to JP2007184874A priority Critical patent/JP5050698B2/en
Publication of JP2009020460A publication Critical patent/JP2009020460A/en
Application granted granted Critical
Publication of JP5050698B2 publication Critical patent/JP5050698B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

<P>PROBLEM TO BE SOLVED: To select a section in which a target sound exists and another section in which voice except for the target sound exists out of voice signals. <P>SOLUTION: A voice division part 12 divides the voice signals S into a pronouncing section PA and a non-pronouncing section PB on a time axis. A storing device 20 stores a general acoustic model of the target sound. A selection processing part 13 judges the existence of correlation of the acoustic model and the feature amount of the voice signal S in each pronouncing section PA, and selects the pronouncing section PA having correlation with the acoustic model into an effective section PA1 and selects the pronouncing section PA having no correlation with the acoustic model into a rejection section PA2. A voice classifying part 14 classifies the pronouncing section PA selected into the effective section PA1 out of a plurality of the pronouncing sections PA demarcated by the voice division part 12 in every speaker based on the characteristic amount of the voice signal S in the pronouncing section PA. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、音声信号を時間軸上で複数の区間に区分する技術に関する。   The present invention relates to a technique for dividing an audio signal into a plurality of sections on a time axis.

音声信号を時間軸に沿って複数の区間に区分する各種の技術が従来から提案されている。例えば特許文献1や特許文献2には、音声信号のSN比と所定の閾値との比較の結果に応じて音声信号を発音区間と非発音区間とに区分する技術が開示されている。
特開昭59−99497号公報 国際公開第2007/017993号パンフレット
Various techniques for dividing an audio signal into a plurality of sections along the time axis have been proposed. For example, Patent Literature 1 and Patent Literature 2 disclose a technique for classifying an audio signal into a sounding interval and a non-sounding interval according to a result of comparison between the SN ratio of the audio signal and a predetermined threshold value.
JP 59-99497 A International Publication No. 2007/017993 Pamphlet

しかし、特許文献1や特許文献2のようにSN比に応じて発音区間と非発音区間とに選別する技術においては、音声信号の収録時における雑音(空調設備の作動音やドアの開閉音)の存在する区間が発音区間に選別される場合がある。そして、例えば人間による発声音など本来の目的となる音声以外の音声が発音区間に混在すると、発音区間を対象とした音声信号の処理(例えば各区間の分類)の精度が低下するという問題がある。以上の事情を背景として、本発明は、音声信号のうち目的音が存在する区間と目的音が存在しない区間とを区別するという課題の解決を目的としている。   However, in the technique of selecting the sounding section and the non-sounding section according to the S / N ratio as in Patent Document 1 and Patent Document 2, noise at the time of recording a sound signal (operation sound of an air conditioner or door opening / closing sound) In some cases, a section in which sound is present is selected as a pronunciation section. Then, for example, when voices other than the original target voice such as human uttered sounds are mixed in the sound generation section, there is a problem that the accuracy of the processing of the sound signal for the sound generation section (for example, classification of each section) is lowered. . In view of the above circumstances, an object of the present invention is to solve the problem of distinguishing between a section where a target sound exists and a section where a target sound does not exist in an audio signal.

前述の課題を解決するために、本発明に係る音声処理装置は、目的音の音響モデルを記憶する記憶手段と、音声信号を時間軸上で複数の区間に区分する音声区分手段と、音響モデルと各区間内の音声信号の特徴量との相関の有無を判定する相関判定手段と、複数の区間のうち音響モデルに相関すると判定された区間を有効区間に選定し、音響モデルに相関しないと判定された区間を棄却区間に選定する区間選別手段とを具備する。以上の構成によれば、目的音の音響モデルと各区間内の音声信号の特徴量との相関の有無に応じて各区間を有効区間と棄却区間とに選別することが可能である。したがって、例えば有効区間のみを選択的に利用することで、各区間の音声信号に対する音声処理(例えば発声者ごとの分類や音声認識)の精度を高めることができる。   In order to solve the above-described problems, a speech processing apparatus according to the present invention includes a storage unit that stores an acoustic model of a target sound, a speech classification unit that segments a speech signal into a plurality of sections on a time axis, and an acoustic model. And a correlation determination means for determining whether or not there is a correlation between the feature quantity of the audio signal in each section, and a section determined to correlate with the acoustic model among a plurality of sections is selected as an effective section and is not correlated with the acoustic model Section selection means for selecting the determined section as a rejection section. According to the above configuration, each section can be classified into an effective section and a rejection section depending on whether or not there is a correlation between the acoustic model of the target sound and the feature amount of the audio signal in each section. Therefore, for example, by selectively using only the effective section, it is possible to improve the accuracy of voice processing (for example, classification for each speaker or voice recognition) for the voice signal in each section.

本発明の好適な態様において、音声区分手段は、音声信号を発音区間と非発音区間とに区分し、相関判定手段は、音響モデルと各区間内の音声信号の特徴量との相関の指標値を第1閾値と比較することで相関の有無を判定し、非発音区間内の音声信号の特性に応じて第1閾値を設定する閾値設定手段を具備する。以上の構成によれば、非発音区間内の音声信号の特性に応じて第1閾値が可変に設定されるから、第1閾値が固定された構成と比較して、相関判定手段による判定の正確性を高めることができる。   In a preferred aspect of the present invention, the voice classifying unit classifies the voice signal into a sounding period and a non-sounding period, and the correlation determining unit is an index value of the correlation between the acoustic model and the feature value of the sound signal in each period. Is compared with a first threshold value to determine whether or not there is a correlation, and threshold value setting means is provided for setting the first threshold value according to the characteristics of the audio signal in the non-sounding interval. According to the above configuration, the first threshold value is variably set according to the characteristics of the audio signal in the non-sounding section. Can increase the sex.

本発明の好適な態様において、複数の区間の各々におけるフレームの総数に対する当該区間内の有声音のフレームの個数の割合が第2閾値を上回るか否かを区間ごとに判定する有声判定手段を具備し、区間選別手段は、音響モデルに相関すると相関判定手段が判定し、かつ、有声音のフレームの個数の割合が第2閾値を上回ると有声判定手段が判定した区間を、有効区間に選定する。以上の構成によれば、有声音のフレームの個数の割合が第2閾値を上回る区間が有効区間に選定されるから、目的音に類似する雑音の区間を棄却区間に選別することが可能である。   In a preferred aspect of the present invention, there is provided voiced determination means for determining for each section whether or not the ratio of the number of frames of voiced sound in the section to the total number of frames in each of the plurality of sections exceeds a second threshold value. The section selection means selects the section determined by the correlation determination means to correlate with the acoustic model and determined by the voiced determination means that the ratio of the number of frames of voiced sound exceeds the second threshold value as an effective section. . According to the above configuration, since a section in which the ratio of the number of frames of voiced sound exceeds the second threshold is selected as an effective section, it is possible to select a section of noise similar to the target sound as a rejection section. .

さらに好適な態様において、相関判定手段は、複数の区間の各々における有声音のフレームの特徴量のみを音響モデルと対比する。例えば人間による発声音などの目的音と雑音との相違は有声音の特性に関して特に顕著となるから、有声音のフレームの特徴量のみが音響モデルと対比される本態様によれば、相関判定手段による判定の正確性が向上するという利点がある。   In a further preferred aspect, the correlation determining means compares only the feature amount of the frame of the voiced sound in each of the plurality of sections with the acoustic model. For example, since the difference between the target sound such as a voice produced by a human and noise is particularly significant with respect to the characteristics of the voiced sound, only the feature amount of the frame of the voiced sound is compared with the acoustic model. There is an advantage that the accuracy of the determination by is improved.

本発明の具体的な態様に係る音声処理装置は、複数の区間のうち区間選別手段が有効区間に選定した複数の区間を、当該区間内の音声信号の特徴量に基づいて発声者ごとに分類する音声分類手段を具備する。本態様によれば、有効区間は目的音を含む可能性が高いから、有効区間の音声信号からは目的音の特性を忠実に反映した特徴量が抽出される。したがって、有効区間のみを分類の対象とする本態様によれば、各区間を音声信号の特性に応じて高い精度で分類できる。   The speech processing apparatus according to a specific aspect of the present invention classifies, for each speaker, a plurality of sections selected by the section selection unit as effective sections among the plurality of sections based on the feature amount of the audio signal in the section. Voice classification means is provided. According to this aspect, since there is a high possibility that the effective section includes the target sound, a feature value that accurately reflects the characteristic of the target sound is extracted from the audio signal in the effective section. Therefore, according to this aspect in which only effective sections are targeted for classification, each section can be classified with high accuracy according to the characteristics of the audio signal.

本発明に係る音声処理装置は、音声の処理に専用されるDSP(Digital Signal Processor)などのハードウェア(電子回路)によって実現されるほか、CPU(Central Processing Unit)などの汎用の演算処理装置とプログラムとの協働によっても実現される。本発明に係るプログラムは、目的音の音響モデルを記憶する記憶手段を具備するコンピュータに、音声信号を時間軸上で複数の区間に区分する音声区分処理と、音響モデルと各区間内の音声信号の特徴量との相関の有無を判定する相関判定処理(例えば図3のステップS5)と、複数の区間のうち音響モデルに相関すると判定された区間を処理の対象となる有効区間に選定し、音響モデルに相関しないと判定された区間を処理の対象外の棄却区間に選定する区間選別処理(例えば図3のステップS6やステップS10)とをコンピュータに実行させる。以上のプログラムによっても、本発明に係る音声処理装置と同様の作用および効果が奏される。なお、本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。   The audio processing apparatus according to the present invention is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to audio processing, and a general-purpose arithmetic processing apparatus such as a CPU (Central Processing Unit). It is also realized through collaboration with the program. The program according to the present invention includes a computer having storage means for storing an acoustic model of a target sound, voice classification processing for dividing a voice signal into a plurality of sections on a time axis, an acoustic model, and a voice signal in each section. A correlation determination process (for example, step S5 in FIG. 3) for determining whether or not there is a correlation with the feature amount, and a section determined to be correlated with the acoustic model among a plurality of sections is selected as an effective section to be processed, The computer is caused to perform section selection processing (for example, step S6 and step S10 in FIG. 3) for selecting a section determined to be uncorrelated with the acoustic model as a rejection section that is not subject to processing. Even with the above program, the same operations and effects as those of the speech processing apparatus according to the present invention are exhibited. The program of the present invention is provided to the user in a form stored in a computer-readable recording medium and installed in the computer, or is provided from the server device in the form of distribution via a communication network. Installed on.

本発明は、音声を処理する方法としても特定される。本発明のひとつの態様に係る音声処理方法は、音声信号を時間軸上で複数の区間に区分する音声区分手順と、記憶装置に格納された音響モデルと各区間内の音声信号の特徴量との相関の有無を判定する相関判定手順と、複数の区間のうち音響モデルに相関すると判定された区間を処理の対象となる有効区間に選定し、音響モデルに相関しないと判定された区間を処理の対象外の棄却区間に選定する区間選別手順とを含む。以上の方法によれば、本発明に係る音声処理装置と同様の作用および効果が奏される。   The present invention is also specified as a method of processing speech. An audio processing method according to one aspect of the present invention includes an audio classification procedure for dividing an audio signal into a plurality of sections on a time axis, an acoustic model stored in a storage device, and a feature amount of an audio signal in each section A correlation determination procedure for determining the presence or absence of correlation, and selecting a section determined to correlate with the acoustic model among a plurality of sections as an effective section to be processed, and processing a section determined not to be correlated with the acoustic model Section selection procedure for selecting a rejection section that is not subject to According to the above method, the same operation and effect as the sound processing apparatus according to the present invention are exhibited.

また、本発明の別の態様に係る音声処理装置は、音声信号を時間軸上で複数の区間に区分する音声区分手段と、前記複数の区間の各々におけるフレームの総数に対する当該区間内の有声音のフレームの個数の割合が閾値を上回るか否かを前記区間ごとに判定する有声判定手段と、前記有声音のフレームの個数の割合が閾値を上回る区間を有効区間に選定し、前記有声音のフレームの個数の割合が閾値を下回る区間を棄却区間に選定する区間選別手段とを具備する。以上の態様によれば、各区間内の有声音のフレームの割合に応じて各区間が有効区間と棄却区間とに選別される。したがって、例えば有効区間のみを選択的に利用することで、各区間の音声信号に対する音声処理(例えば発声者ごとの分類や音声認識)の精度を高めることができる。   In addition, the speech processing apparatus according to another aspect of the present invention includes speech classification means for segmenting an audio signal into a plurality of sections on a time axis, and voiced sound in the section with respect to the total number of frames in each of the plurality of sections. Voiced determination means for determining for each section whether or not the ratio of the number of frames exceeds a threshold; selecting a section where the ratio of the number of frames of the voiced sound exceeds a threshold as an effective section; Section selection means for selecting a section in which the ratio of the number of frames is below a threshold as a reject section. According to the above aspect, each section is sorted into an effective section and a rejection section according to the ratio of voiced sound frames in each section. Therefore, for example, by selectively using only the effective section, it is possible to improve the accuracy of voice processing (for example, classification for each speaker or voice recognition) for the voice signal in each section.

<A:第1実施形態>
図1は、本発明の第1実施形態に係る音声処理装置の構成を示すブロック図である。同図に示すように、音声処理装置100は、制御装置10と記憶装置20とを具備するコンピュータシステムである。制御装置10は、プログラムを実行する演算処理装置である。記憶装置20は、制御装置10が実行するプログラムや制御装置10が使用する各種のデータを記憶する。半導体記憶装置や磁気記憶装置など公知の記憶媒体が記憶装置20として任意に採用される。制御装置10には出力装置30が接続される。本形態の出力装置30は、制御装置10による制御のもとに各種の画像を表示する表示機器である。
<A: First Embodiment>
FIG. 1 is a block diagram showing the configuration of the speech processing apparatus according to the first embodiment of the present invention. As shown in FIG. 1, the voice processing device 100 is a computer system that includes a control device 10 and a storage device 20. The control device 10 is an arithmetic processing device that executes a program. The storage device 20 stores a program executed by the control device 10 and various data used by the control device 10. A known storage medium such as a semiconductor storage device or a magnetic storage device is arbitrarily adopted as the storage device 20. An output device 30 is connected to the control device 10. The output device 30 according to the present embodiment is a display device that displays various images under the control of the control device 10.

記憶装置20は、音声の時間軸上における波形を表す音声信号Sを記憶する。本形態の音声信号Sが表す音声は、複数の参加者が随時に発言する会議において収音機器を利用して収録された音声である。図2の部分(A)には、音声信号Sの時間軸上の波形が例示されている。制御装置10は、記憶装置20に格納されたプログラムを実行することで音声信号Sから会議の議事録を生成する。議事録は、複数の参加者の各々の発言の内容(文字)を時系列に配列した会議の記録である。   The storage device 20 stores an audio signal S representing a waveform on the time axis of audio. The audio represented by the audio signal S of this embodiment is audio recorded using a sound collection device in a conference where a plurality of participants speak at any time. A waveform on the time axis of the audio signal S is illustrated in part (A) of FIG. The control device 10 generates a meeting minutes from the audio signal S by executing a program stored in the storage device 20. The minutes are records of a meeting in which the contents (characters) of each of a plurality of participants are arranged in time series.

さらに、記憶装置20は、音声処理装置100による処理の目的となる音声(以下「目的音」という)の音響的な特性を表す音響モデルを記憶する。本形態では人間による発声音が目的音とされる。すなわち、発声音から音響的な特徴量(例えばMFCC(Mel Frequency Cepstral Coefficient))を抽出する処理を多数の発声者と多様な内容の発話とについて実行し、抽出された多数の特徴量を統計的に処理することで、人間による発声音の概括的(平均的)な特性を示す音響モデルが生成される。制御装置10が音響モデルを生成してもよいし外部装置の生成した音響モデルを記憶装置20に格納してもよい。   Furthermore, the storage device 20 stores an acoustic model that represents the acoustic characteristics of the speech that is the object of processing by the speech processing device 100 (hereinafter referred to as “target sound”). In this embodiment, a voice uttered by a human is the target sound. That is, a process of extracting acoustic feature quantities (for example, MFCC (Mel Frequency Cepstral Coefficient)) from the uttered sound is executed for many speakers and utterances of various contents, and the extracted many feature quantities are statistically processed. As a result of the processing, an acoustic model is generated that shows the general (average) characteristics of the uttered sound by humans. The control device 10 may generate an acoustic model, or an acoustic model generated by an external device may be stored in the storage device 20.

本形態の音響モデルは、例えば、試料となる多数かつ多様な発声音から抽出された特徴量(MFCCのベクトル)の分布をM個の確率分布の加重和としてモデル化する混合モデルλである(Mは2以上の自然数)。混合モデルλの生成には、EM(Expectation-Maximization)アルゴリズムなど公知の技術が任意に採用される。本形態の混合モデルλは、M個の正規分布の加重和として以下の式(1)で表現されるガウス混合モデルである。
λ={pi,μi,Σi} (i=1〜M) ……(1)
式(1)のpiは、第i番目の正規分布の加重値(重み値)である。加重値p1〜pMの総和は1である。式(1)のμiは第i番目の正規分布の平均ベクトルであり、Σiは第i番目の正規分布の共分散行列である。なお、式(1)のμiのように、実際にはベクトルを意味する記号であっても、当該記号がベクトルを意味することを例えば「平均ベクトル」という表現で明示したうえで、本明細書ではベクトルの記号(文字上の右向き矢印)を省略する。
The acoustic model of the present embodiment is, for example, a mixed model λ that models the distribution of feature values (MFCC vectors) extracted from a large number of various utterances as samples as a weighted sum of M probability distributions ( M is a natural number of 2 or more). For the generation of the mixed model λ, a known technique such as an EM (Expectation-Maximization) algorithm is arbitrarily employed. The mixed model λ of this embodiment is a Gaussian mixed model expressed by the following equation (1) as a weighted sum of M normal distributions.
λ = {pi, μi, Σi} (i = 1 to M) (1)
In the equation (1), pi is a weight value (weight value) of the i-th normal distribution. The sum of the weights p1 to pM is 1. In Expression (1), μi is an average vector of the i-th normal distribution, and Σi is a covariance matrix of the i-th normal distribution. It should be noted that even if a symbol actually means a vector, such as μi in equation (1), this specification means that the symbol means a vector, for example, by clearly expressing it as an `` average vector ''. The vector symbol (the arrow pointing right on the character) is omitted.

制御装置10は、図1に示すように、音声区分部12と選別処理部13と音声分類部14と音声認識部16として機能する。なお、図1の制御装置10の各機能は、音声の処理に専用されるDSPなどの電子回路によっても実現される。また、制御装置10は、複数の集積回路として実装されてもよい。   As illustrated in FIG. 1, the control device 10 functions as a voice classification unit 12, a selection processing unit 13, a voice classification unit 14, and a voice recognition unit 16. Each function of the control device 10 of FIG. 1 is also realized by an electronic circuit such as a DSP dedicated to voice processing. In addition, the control device 10 may be implemented as a plurality of integrated circuits.

音声区分部12は、図2の部分(D)に示すように、記憶装置20に記憶された音声信号Sを、時間軸に沿って複数の発音区間PAと複数の非発音区間PBとに区分する。発音区間PAは、音声(目的音や雑音)が存在する区間であり、非発音区間PBは、音声が存在しない区間または音量が充分に小さい区間である。   As shown in part (D) of FIG. 2, the voice classification unit 12 classifies the voice signal S stored in the storage device 20 into a plurality of sound generation sections PA and a plurality of non-sound generation sections PB along the time axis. To do. The sound generation section PA is a section where speech (target sound or noise) exists, and the non-sound generation section PB is a section where there is no sound or a volume is sufficiently small.

音声区分部12は、第1処理と第2処理とを実行する。第1処理は、図2の部分(B)に示すように、音声信号SのうちSN比や音量(振幅)が閾値を上回る区間を発音区間PAとして検出する処理である。発音区間PA以外の区間は非発音区間PBとなる。   The voice classification unit 12 executes a first process and a second process. As shown in part (B) of FIG. 2, the first process is a process of detecting a section in which the S / N ratio and the volume (amplitude) of the audio signal S exceed the threshold as the sound generation section PA. Sections other than the sound generation section PA are non-sound generation sections PB.

複数の発声者による発声が間隔をあけずに連続する場合や部分的に重複する場合、第1処理だけでは音声信号Sを発声者ごとに区分することが困難である。そこで、音声区分部12は、図2の部分(C)および部分(D)に示すように、音声信号Sの波形の包絡線(エンベロープ)Eに現れる複数の谷部Dの各々を境界として発音区間PAを区分する第2処理を実行する。人間による一連の発話には、一般的に、発話の開始点から音量が徐々に増加するとともに中途の時点から発話の終了点にかけて音量が徐々に減少するという傾向がある。したがって、谷部Dを境界として発音区間PAを区分する構成によれば、複数の発話が連続または重複する場合であっても、各発声者による発声は別個の発音区間PAに区分される。音声区分部12による区分後の発音区間PAの総数を以下ではJ個(Jは2以上の整数)とする。なお、発音区間PAや非発音区間PBの検出には、以上の例示の他にも公知の技術が任意に採用される。   When utterances by a plurality of speakers are continuous without being spaced apart or partially overlapped, it is difficult to classify the audio signal S for each speaker by only the first process. Therefore, the sound classification unit 12 generates a sound with each of a plurality of valleys D appearing in the envelope (envelope) E of the waveform of the sound signal S as a boundary, as shown in part (C) and part (D) of FIG. A second process for dividing the section PA is executed. In a series of utterances by humans, generally, there is a tendency that the volume gradually increases from the start point of the utterance and gradually decreases from an intermediate point to the end point of the utterance. Therefore, according to the configuration in which the sound generation section PA is divided with the valley portion D as a boundary, even if a plurality of utterances are continuous or overlapped, the utterances by each speaker are divided into separate sound generation sections PA. In the following, the total number of sounding sections PA after the classification by the voice classifying unit 12 is J (J is an integer of 2 or more). In addition to the above examples, a known technique is arbitrarily adopted for detection of the sounding section PA and the non-sounding section PB.

ところで、以上のようにSN比や音量が閾値を上回る区間を発音区間PAとして検出した場合、音声信号Sのうち目的音以外の音声(例えば電話機の呼出音)が存在する区間が発音区間PAとして検出される可能性はある。そこで、選別処理部13は、図2の部分(E)に示すように、音声区分部12が画定した複数の発音区間PAを、目的音の存在する可能性が高い区間(以下「有効区間」という)PA1と目的音の存在する可能性が低い区間(以下「棄却区間」という)PA2とに選別する。すなわち、複数の発音区間PAのうち目的音の存在しない区間は棄却区間PA2として除去される。なお、選別処理部13の具体的な動作は後述する。   By the way, when the section where the SN ratio and the sound volume exceed the threshold is detected as the sound generation section PA as described above, the section where the voice other than the target sound (for example, the ringing tone of the telephone) exists in the voice signal S is the sound generation section PA. There is a possibility of being detected. Therefore, as shown in part (E) of FIG. 2, the selection processing unit 13 selects a plurality of sound generation sections PA defined by the speech classification unit 12 as sections (hereinafter referred to as “effective sections”) where the target sound is highly likely to exist. And PA1 and a section (hereinafter referred to as “rejection section”) PA2 where there is a low possibility that the target sound exists. That is, the section where the target sound does not exist among the plurality of sound generation sections PA is removed as the reject section PA2. The specific operation of the sorting processing unit 13 will be described later.

図1の音声分類部14は、複数の発音区間PAのうち選別処理部13が選別した各有効区間PA1の音声信号Sを発声者ごとに分類する。音声区分部12が画定した非発音区間PBや選別処理部13が選別した棄却区間PA2は分類の対象から除外される。各有効区間PA1の分類には公知のクラスタリング技術が任意に採用される。   The voice classification unit 14 in FIG. 1 classifies the voice signal S of each effective section PA1 selected by the selection processing unit 13 among the plurality of sound generation sections PA for each speaker. The non-sound generation section PB defined by the voice classification section 12 and the rejection section PA2 selected by the selection processing section 13 are excluded from the classification targets. A known clustering technique is arbitrarily employed for classification of each effective section PA1.

例えば、音声分類部14は、各有効区間PA1内の音声信号Sに対してFFT(Fast Fourier Transform)処理を含む周波数分析を実行することで当該有効区間PA1内の音声信号Sの音響的な特徴量(例えばMFCC)を抽出し、特徴量の類似する各有効区間PA1が共通のクラスタに所属するように複数の有効区間PA1を各クラスタに分類する。したがって、音声信号Sのうち同じ発声者が発声した可能性が高い有効区間PA1は共通のクラスタに分類される。そして、音声分類部14は、複数の発声者の各々の識別符号と、当該発声者のクラスタに分類された各有効区間PA1の始点および終点の時刻と、当該各有効区間PA1内の音声信号Sとを対応させて記憶装置20に格納する。なお、会議の参加者の人数を利用者が既知数として指定する構成においては、複数の有効区間PA1を、当該人数に相当する個数のクラスタに分類する構成が好適に採用される。   For example, the audio classification unit 14 performs frequency analysis including FFT (Fast Fourier Transform) processing on the audio signal S in each effective interval PA1, thereby acoustic characteristics of the audio signal S in the effective interval PA1. A quantity (for example, MFCC) is extracted, and a plurality of effective sections PA1 are classified into clusters so that each effective section PA1 having a similar feature amount belongs to a common cluster. Therefore, the effective section PA1 that is highly likely to be uttered by the same speaker in the audio signal S is classified into a common cluster. Then, the voice classification unit 14 identifies each of the plurality of speaker's identification codes, the start and end times of each effective section PA1 classified into the speaker's cluster, and the sound signal S in each effective section PA1. Are stored in the storage device 20. In the configuration in which the user designates the number of participants in the conference as a known number, a configuration in which a plurality of effective sections PA1 are classified into a number of clusters corresponding to the number of people is preferably employed.

音声認識部16は、各クラスタに分類された各有効区間PA1の音声信号Sから発声者ごとの発言の内容を文字として特定する。各有効区間PA1の音声信号Sから文字を認識する処理には公知の音声認識技術が任意に採用される。例えば、音声認識部16は、第1に、ひとつのクラスタに分類された各有効区間PA1の音声信号Sの音響的な特徴量に応じて初期的な音響モデルを更新(話者適応)することで、当該クラスタに対応した発声者の音声の特徴を固有に反映した音響モデルを生成し、第2に、話者適応後の音響モデルとクラスタ内の各有効区間PA1の音声信号Sから抽出された特徴量とを対比することで発言の文字を識別する。   The voice recognition unit 16 specifies the content of the utterance for each speaker as a character from the voice signal S of each effective section PA1 classified into each cluster. A known speech recognition technique is arbitrarily employed for the process of recognizing characters from the speech signal S in each effective section PA1. For example, the speech recognition unit 16 firstly updates (speaker adaptation) the initial acoustic model according to the acoustic feature amount of the speech signal S of each effective section PA1 classified into one cluster. Then, an acoustic model that inherently reflects the voice characteristics of the speaker corresponding to the cluster is generated, and secondly, it is extracted from the acoustic model after speaker adaptation and the voice signal S of each effective section PA1 in the cluster. The character of the utterance is identified by comparing with the feature amount.

制御装置10は、音声認識部16による処理の結果を出力装置30に出力する。出力装置30は、発言の時刻と、発声者の識別符号(例えば発声者の氏名)と、当該発声の内容について音声認識部16が特定した文字とを時系列に配列した議事録の画像を表示する。   The control device 10 outputs the processing result by the voice recognition unit 16 to the output device 30. The output device 30 displays an image of the minutes in which the time of the utterance, the identification code of the utterer (for example, the name of the utterer), and the characters identified by the voice recognition unit 16 regarding the content of the utterance are arranged in time series. To do.

次に、図3を参照して、選別処理部13による処理の具体例を説明する。図3の処理は、音声区分部12による処理が完了することを契機として開始される。同図に示すように、選別処理部13は、J個の発音区間PAの各々について音声信号Sの特徴量を抽出する(ステップS1)。さらに詳述すると、選別処理部13は、各発音区間PAを区分した複数のフレームF(図2の部分(A)参照)の各々について周波数分析を実行することで、当該発音区間PA内の各フレームFにおけるMFCCのベクトル(以下「特徴ベクトル」という)の時系列を特徴量として抽出する。もっとも、ステップS1にて抽出される特徴量はMFCCに限定されない。   Next, a specific example of processing by the sorting processing unit 13 will be described with reference to FIG. The process in FIG. 3 is started when the process by the voice classifying unit 12 is completed. As shown in the figure, the selection processing unit 13 extracts the feature amount of the audio signal S for each of the J sound generation sections PA (step S1). More specifically, the selection processing unit 13 performs frequency analysis for each of a plurality of frames F (see part (A) in FIG. 2) that divide each sounding section PA, so that each sounding section PA in each sounding section PA is analyzed. A time series of MFCC vectors (hereinafter referred to as “feature vectors”) in the frame F is extracted as a feature amount. But the feature-value extracted in step S1 is not limited to MFCC.

次いで、選別処理部13は、J個の発音区間PAのなかから未選択で最先(最も古い)の発音区間PAを選択する(ステップS2)。次いで、選別処理部13は、ステップS2にて選択した発音区間PA(以下では特に「選択区間PA_S」という)の開始前にある直近の非発音区間PBの音声信号Sに応じて閾値TH1を設定する(ステップS3)。非発音区間PBは基本的に雑音(環境音)のみが存在する区間であるから、ステップS3の処理は、音声信号Sの収録時の雑音に応じて閾値TH1を設定する処理に相当する。具体的には、選別処理部13は、選択区間PA_Sの直前にある非発音区間PB内の音声信号Sの平均的な強度(以下「雑音レベル」という)を算定し、雑音レベルが高いほど閾値TH1が小さくなるように閾値TH1を可変に制御する。   Next, the selection processing unit 13 selects the unselected and earliest (oldest) sounding section PA from among the J sounding sections PA (step S2). Next, the selection processing unit 13 sets a threshold value TH1 according to the sound signal S of the latest non-sounding section PB before the start of the sounding section PA selected in step S2 (hereinafter, particularly referred to as “selected section PA_S”). (Step S3). Since the non-sound generation section PB is basically a section in which only noise (environmental sound) exists, the process of step S3 corresponds to the process of setting the threshold value TH1 according to the noise when the audio signal S is recorded. Specifically, the selection processing unit 13 calculates the average intensity (hereinafter referred to as “noise level”) of the audio signal S in the non-sound generation section PB immediately before the selection section PA_S, and the threshold increases as the noise level increases. The threshold value TH1 is variably controlled so that TH1 becomes small.

次に、選別処理部13は、記憶装置20に記憶された混合モデルλと選択区間PA_S内の音声信号Sの特徴ベクトルxとの相関の程度を示す相関指標値を算定する(ステップS4)。さらに詳述すると、選別処理部13は、混合モデルλから選択区間PA_Sの各特徴ベクトルxが出現する確率(尤度)を選択区間PA_S内の総ての特徴ベクトルxについて平均化した数値(以下「平均尤度」という)Lを相関指標値として算定する。   Next, the selection processing unit 13 calculates a correlation index value indicating the degree of correlation between the mixed model λ stored in the storage device 20 and the feature vector x of the speech signal S in the selected section PA_S (step S4). More specifically, the selection processing unit 13 is a numerical value obtained by averaging the probability (likelihood) that each feature vector x in the selected section PA_S appears from the mixed model λ with respect to all the feature vectors x in the selected section PA_S (hereinafter referred to as the following). L (referred to as “average likelihood”) is calculated as a correlation index value.

ひとつの特徴ベクトルxをD次元のベクトルとすると、混合モデルλから特徴ベクトルxが出現する尤度p(x|λ)は以下の式(2)で算定される。

Figure 2009020460
When one feature vector x is a D-dimensional vector, the likelihood p (x | λ) at which the feature vector x appears from the mixed model λ is calculated by the following equation (2).
Figure 2009020460

選別処理部13は、ステップS4において、選択区間PA_S内のK個の特徴ベクトルx(x1〜xK)を式(3)に代入することで平均尤度Lを算定する。式(3)から理解されるように、音響モデルが表す音声の特徴と選択区間PA_S内の音声信号Sの特徴とが類似するほど平均尤度Lは大きくなる。

Figure 2009020460
In step S4, the selection processing unit 13 calculates the average likelihood L by substituting the K feature vectors x (x1 to xK) in the selected section PA_S into the equation (3). As understood from the equation (3), the average likelihood L increases as the sound feature represented by the acoustic model is similar to the feature of the sound signal S in the selected section PA_S.
Figure 2009020460

次いで、選別処理部13は、選択区間PA_Sの平均尤度Lが閾値TH1を下回るか否かを判定する(ステップS5)。混合モデルλには人間の多様な発声音が包括的に反映されるから、平均尤度Lが閾値TH1を下回る選択区間PA_Sの音声は人間の発声音である可能性が低い。そこで、選別処理部13は、ステップS5の結果が肯定である場合(L<TH1)、現段階における選択区間PA_Sを棄却区間PA2に選別する(ステップS6)。以上のように、ステップS5の処理は、選択区間PA_S内の音声が人間の発声音である可能性があるか否かを音声信号Sと混合モデルλとの相関の有無に応じて判定する処理である。   Next, the selection processing unit 13 determines whether or not the average likelihood L of the selected section PA_S is lower than the threshold value TH1 (step S5). Since the mixed model λ comprehensively reflects various human utterances, it is unlikely that the speech in the selected section PA_S whose average likelihood L is lower than the threshold TH1 is a human utterance. Therefore, when the result of step S5 is affirmative (L <TH1), the sorting processor 13 sorts the selected section PA_S at the current stage into the reject section PA2 (step S6). As described above, the process of step S5 is a process of determining whether or not there is a possibility that the sound in the selected section PA_S is a human uttered sound according to the presence or absence of the correlation between the sound signal S and the mixed model λ. It is.

ステップS1にて抽出される特徴ベクトルxは音声信号S内の雑音の影響を受けるから、閾値TH1が固定値であるとすれば、音声信号Sの雑音レベルが高いほど、実際には目的音を含む発音区間PAであるにも拘わらずステップS5の結果が肯定となる可能性は高まる。本形態においては音声信号Sの雑音レベルが高いほど閾値TH1が小さい数値に設定される(すなわちステップS5の結果が肯定となる割合が低くなる)から、目的音を含む選択区間PA_Sが棄却区間PA2と誤判定される可能性を低減できる。   Since the feature vector x extracted in step S1 is affected by noise in the audio signal S, if the threshold value TH1 is a fixed value, the higher the noise level of the audio signal S, the more actual the target sound. There is a high possibility that the result of step S5 will be affirmative in spite of the sound generation period PA that includes it. In this embodiment, the threshold TH1 is set to a smaller numerical value as the noise level of the audio signal S is higher (that is, the rate at which the result of step S5 becomes affirmative is lower), so that the selection section PA_S including the target sound is the rejection section PA2. It is possible to reduce the possibility of erroneous determination.

ところで、実際には目的音が選択区間PA_Sに含まれない場合であっても、人間の発声音に類似する雑音が選択区間PA_Sに含まれる場合には、ステップS5の結果は否定となる(すなわち棄却区間PA2とは判定されない)。そこで、ステップS5の結果が否定である場合、選別処理部13は、混合モデルλを使用しない方法で選択区間PA_Sを有効区間PA1または棄却区間PA2に選別する(ステップS7からステップS9)。   By the way, even if the target sound is not actually included in the selection section PA_S, if the selection section PA_S includes noise similar to a human voice, the result of step S5 is negative (ie, It is not determined to be the rejection section PA2.) Therefore, if the result of step S5 is negative, the sorting processor 13 sorts the selected section PA_S into the valid section PA1 or the reject section PA2 by a method that does not use the mixed model λ (step S7 to step S9).

人間が自然に発声した場合(例えば意図的に無声音のみを継続的に発声しない限り)、発声が継続する区間のうち所定の割合を上回る時間長にわたって有声音が存在するという傾向がある。そこで、本形態においては、選択区間PA_Sのうち有声音の区間の割合に応じて選択区間PA_Sを有効区間PA1(有声音が豊富な発音区間PA)および棄却区間PA2(無声音が豊富な発音区間PA)とに選別する。   When a human utters naturally (for example, unless only intentionally utters unvoiced sound continuously), there is a tendency that voiced sound exists for a length of time exceeding a predetermined ratio in a section where utterance continues. Therefore, in the present embodiment, the selection section PA_S is selected as the effective section PA1 (sounding section PA rich in voiced sound) and the rejection section PA2 (sounding section PA rich in unvoiced sound) according to the proportion of the voiced sound section in the selection section PA_S. ) And sorting.

ステップS7において、選別処理部13は、選択区間PA_S内の複数のフレームFの各々について、音声信号Sの示す音声が有声音であるか無声音であるかを判定する。有声/無声の判断には公知の技術が任意に採用される。例えば、選別処理部13は、音声信号Sの周期性の指標となる自己相関関数の最大値(以下「自己相関値」という)値を各フレームFについて算定し、自己相関値が所定値を上回るフレームF(すなわち音声信号Sの周期性が高いフレームF)を有声音と判定するとともに自己相関値が所定値を下回るフレームFを無声音と判定する。また、音声信号Sから明確なピッチ(基本周波数)が検出されるフレームFのみを有声音と判定する構成も好適に採用される。   In step S7, the selection processing unit 13 determines, for each of the plurality of frames F in the selected section PA_S, whether the voice indicated by the voice signal S is a voiced sound or an unvoiced sound. A known technique is arbitrarily employed for voiced / unvoiced determination. For example, the selection processing unit 13 calculates the maximum value (hereinafter referred to as “autocorrelation value”) of the autocorrelation function that is an index of the periodicity of the audio signal S for each frame F, and the autocorrelation value exceeds a predetermined value. The frame F (that is, the frame F having a high periodicity of the audio signal S) is determined as a voiced sound, and the frame F having an autocorrelation value lower than a predetermined value is determined as an unvoiced sound. A configuration in which only a frame F in which a clear pitch (basic frequency) is detected from the audio signal S is determined as a voiced sound is also preferably employed.

次いで、選別処理部13は、選択区間PA_S内のフレームFの総数のうちステップS7にて有声音と判定されたフレームFの個数の割合Rを算定し(ステップS8)、割合Rが所定の閾値TH2を上回るか否かを判定する(ステップS9)。ステップS9の判定が否定である場合(すなわち選択区間PA_Sにおいて無声音のフレームFの割合が高い場合)、選別処理部13は、現段階における選択区間PA_Sを棄却区間PA2に選別する(ステップS6)。一方、ステップS9の判定が肯定である場合(すなわち選択区間PA_Sにおいて有声音のフレームFの割合が高い場合)、選別処理部13は、選択区間PA_Sを有効区間PA1に選別する(ステップS10)。   Next, the selection processing unit 13 calculates a ratio R of the number of frames F determined to be voiced in step S7 out of the total number of frames F in the selected section PA_S (step S8), and the ratio R is a predetermined threshold value. It is determined whether or not TH2 is exceeded (step S9). When the determination in step S9 is negative (that is, when the ratio of the unvoiced sound frame F is high in the selection section PA_S), the selection processing unit 13 selects the selection section PA_S in the current stage as the rejection section PA2 (step S6). On the other hand, when the determination in step S9 is affirmative (that is, when the ratio of the frame F of voiced sound is high in the selection section PA_S), the selection processing unit 13 selects the selection section PA_S into the effective section PA1 (step S10).

ステップS6またはステップS10を実行すると、選別処理部13は、音声信号Sの総ての発音区間PAを選別したか否かを判定する(ステップS11)。ステップS11の結果が否定である場合、選別処理部13は、現段階の選択区間PA_Sの直後の発音区間PAをステップS2にて新たな選択区間PA_Sとして選択したうえでステップS3以後の処理を実行する。総ての発音区間PAの選別が完了すると(ステップS11:YES)、選別処理部13は図3の処理を終了する。   When step S6 or step S10 is executed, the selection processing unit 13 determines whether or not all sound generation sections PA of the audio signal S have been selected (step S11). If the result of step S11 is negative, the selection processing unit 13 selects the sound generation section PA immediately after the current selection section PA_S as the new selection section PA_S in step S2, and then executes the processes after step S3. To do. When the selection of all the sound generation sections PA is completed (step S11: YES), the selection processing unit 13 ends the process of FIG.

以上に説明したように、本形態においては、目的音の有無に応じて発音区間PAが有効区間PA1および棄却区間PA2に区別されるから、目的音を含まない棄却区間PA2を音声分類部14や音声認識部16による処理の対象から除外することで雑音の影響を有効に低減することができる。例えば、雑音の影響を低減することで各発声者の発声音の特性を忠実に反映した特徴量の抽出が可能となるから、音声分類部14による各発音区間PA(有効区間PA1)の分類や音声認識部16による話者適応および音声認識など特徴量を利用した音声処理の精度が高められる。したがって、音声信号Sから正確な議事録を作成することができる。   As described above, in this embodiment, the sound generation section PA is distinguished into the effective section PA1 and the rejection section PA2 according to the presence or absence of the target sound, so that the rejection section PA2 that does not include the target sound is designated as the speech classification unit 14 or By excluding it from the target of processing by the speech recognition unit 16, it is possible to effectively reduce the influence of noise. For example, by reducing the influence of noise, it is possible to extract a feature value that faithfully reflects the characteristics of the uttered sound of each speaker, so that the classification of each sounding section PA (effective section PA1) by the speech classification unit 14 The accuracy of speech processing using feature quantities such as speaker adaptation and speech recognition by the speech recognition unit 16 is improved. Therefore, an accurate minutes can be created from the audio signal S.

ところで、以上の形態においては、音声信号Sが人間の発声音の音響モデル(混合モデルλ)に相関するか否かに応じて発音区間PAを有効区間PA1および棄却区間PA2に選別する構成を例示した。これに対し、例えば、音声信号Sの収録時に発生し得る雑音の音響モデル(以下「雑音モデル」という)を使用する構成(以下「対比例」という)も想定される。対比例においては、音声信号Sと雑音モデルとの相関が高い場合に発音区間PAが棄却区間PA2に選別され、音声信号Sと雑音モデルとの相関が低い場合に発音区間PAが有効区間PA1に選別される。   By the way, in the above form, the structure which selects sounding area PA into effective area PA1 and rejection area PA2 according to whether the audio | voice signal S correlates with the acoustic model (mixed model (lambda)) of a human vocal sound is illustrated. did. On the other hand, for example, a configuration (hereinafter referred to as “proportional”) using an acoustic model (hereinafter referred to as “noise model”) of noise that may occur during recording of the audio signal S is also assumed. In contrast, when the correlation between the speech signal S and the noise model is high, the sounding section PA is selected as the rejection section PA2, and when the correlation between the speech signal S and the noise model is low, the sounding section PA becomes the effective section PA1. Selected.

しかし、雑音の特性は人間の発声音の特性と比較して極めて多様である。したがって、特定の雑音を想定して雑音モデルを作成したとしても、雑音モデルが網羅し切れない雑音が音声信号Sに含まれる可能性は高い。すなわち、対比例の構成においては、目的音を含まない発音区間PAを充分に除去できないという問題がある。これに対して本形態においては、人間の発声音の音響モデルが使用されるから、音声信号Sに多様な雑音が含まれる場合であっても、目的音を含まない発音区間PAを有効に除去できるという利点がある。   However, the characteristics of noise are extremely diverse compared to the characteristics of human vocal sounds. Therefore, even if a noise model is created assuming specific noise, there is a high possibility that noise that cannot be covered by the noise model is included in the speech signal S. That is, in the comparative configuration, there is a problem that the sound generation section PA that does not include the target sound cannot be sufficiently removed. On the other hand, in this embodiment, since an acoustic model of a human uttered sound is used, even if the speech signal S includes various noises, the pronunciation period PA that does not include the target sound is effectively removed. There is an advantage that you can.

<B:第2実施形態>
次に、本発明の第2実施形態について説明する。本形態においては、第1実施形態における平均尤度Lに代えて、VQ(Vector Quantization)歪を音響モデルと音声信号Sとの相関指標値として採用する。なお、以下の各形態において機能や作用が第1実施形態と同等である要素については、以上と同じ符号を付して各々の詳細な説明を適宜に省略する。
<B: Second Embodiment>
Next, a second embodiment of the present invention will be described. In this embodiment, VQ (Vector Quantization) distortion is adopted as a correlation index value between the acoustic model and the audio signal S instead of the average likelihood L in the first embodiment. In addition, about the element in which a function and an effect | action are equivalent to 1st Embodiment in each following form, the same code | symbol as the above is attached | subjected and each detailed description is abbreviate | omitted suitably.

記憶装置20に事前に格納される音響モデルは、試料となる多数かつ多様な発声音から抽出された多数の特徴量(MFCC)のベクトルから生成されたコードブック(符号帳)CAである。コードブックの生成には、k-means法やLBGアルゴリズムなど公知の技術が任意に採用される。 Acoustic model stored in advance in the storage device 20, a number and variety of a number of feature amounts extracted from the utterance (MFCC) codebook generated from the vector of the sample (codebook) is C A. A known technique such as a k-means method or an LBG algorithm is arbitrarily employed for generating the code book.

図3のステップS4において、選別処理部13は、記憶装置20に格納されたコードブックCAと、選択区間PA_Sの音声信号SからステップS1で抽出した複数の特徴ベクトルx(例えばMFCC)とに基づいてVQ歪Dを算定する。VQ歪Dは、例えば以下の式(4)で算定される。

Figure 2009020460
In step S4 of FIG. 3, distinguishing processing unit 13, the codebook C A stored in the storage device 20, a plurality of feature vector x is extracted in step S1 from the sound signal S of the selection section PA_S (eg MFCC) Based on this, the VQ distortion D is calculated. The VQ distortion D is calculated by, for example, the following formula (4).
Figure 2009020460

式(4)の|CA|は、コードブックCAのサイズであり、CA(i)は、コードブックCAにおける第i番目のコードベクトル(セントロイドベクトル)である。また、xjは、選択区間PA_Sから抽出されたnB個(選択区間PA_S内のフレームFの個数)の特徴ベクトルx1〜xnBのなかの第j番目(j=1〜nB)を意味する。d(X,Y)は、ベクトルXとベクトルYとのユークリッド距離である。すなわち、VQ歪Dは、音響モデルであるコードブックCA内の|CA|個のセントロイドベクトルと選択区間PA_Sの特徴ベクトルxとの最小値(min)をnB個の特徴ベクトルx1〜xnBにわたって平均化した数値である。 In formula (4), | C A | is the size of the code book C A , and C A (i) is the i-th code vector (centroid vector) in the code book C A. Further, xj denotes the j-th among the feature vectors x1 to xn B of n B-number that is extracted from the selected interval PA_S (number of frames F in a selected interval PA_S) (j = 1~n B) . d (X, Y) is the Euclidean distance between the vector X and the vector Y. That is, the VQ distortion D is the minimum value (min) of | C A | centroid vectors in the codebook C A that is an acoustic model and the feature vector x of the selected section PA_S, and n B feature vectors x 1 to x. is a value obtained by averaging over xn B.

以上の説明から理解されるように、選択区間PA_S内の音声が人間の発声音に類似するほどVQ歪Dは小さくなる。したがって、図3のステップS4において、選別処理部13は、選択区間PA_Sの直前の非発音区間PBにおける雑音レベルが高いほど閾値TH1が大きくなるように閾値TH1を可変に制御する。また、図3のステップS5において、選別処理部13は、VQ歪Dが閾値TH1を上回るか否かを判定し、閾値TH1を上回る場合には選択区間PA_Sを棄却区間PA2に選別する一方(ステップS5:YES)、閾値TH1を下回る場合にはステップS7に処理を移行する(ステップS5:NO)。他の動作は第1実施形態と同様である。本形態においても第1実施形態と同様の効果が奏される。   As understood from the above description, the VQ distortion D becomes smaller as the voice in the selected section PA_S resembles a human voice. Therefore, in step S4 in FIG. 3, the selection processing unit 13 variably controls the threshold value TH1 so that the threshold value TH1 increases as the noise level in the non-sound generation period PB immediately before the selection period PA_S increases. In step S5 in FIG. 3, the selection processing unit 13 determines whether or not the VQ distortion D exceeds the threshold value TH1, and if it exceeds the threshold value TH1, the selection section PA_S is selected as the rejection section PA2 (step) S5: YES), if below the threshold TH1, the process proceeds to step S7 (step S5: NO). Other operations are the same as those in the first embodiment. In this embodiment, the same effect as that of the first embodiment is obtained.

<C:変形例>
以上の各形態には様々な変形を加えることができる。具体的な変形の態様を例示すれば以下の通りである。なお、以下の例示から2以上の態様を任意に選択して組合わせてもよい。
<C: Modification>
Various modifications can be made to each of the above embodiments. An example of a specific modification is as follows. Two or more aspects may be arbitrarily selected from the following examples and combined.

(1)変形例1
以上の各形態においては、発音区間PAの各フレームFから抽出された特徴量(特徴ベクトルx)を有声音および無声音の何れであるかに拘わらず音響モデルと対比したが、発音区間PAのうち有声音のフレームFから抽出された特徴量のみを音響モデルと対比する構成も採用される。記憶装置20に格納される音響モデルは、試料となる音声から無声音の区間や無音の区間を除外した有声音の区間内の特徴量に基づいて生成される。選別処理部13は、選択区間PA_S内の複数のフレームFのうち有声音のフレームFから抽出された特徴量のみを利用して図3のステップS4にて平均尤度L(第2実施形態ではVQ歪D)を算定し、音響モデルと選択区間PA_S内の音声信号Sとの相関の有無をステップS5にて判定する。雑音と目的音とは有声音の特性に関して特に相違が顕著であるから、以上の変形例のように発音区間Pのうち有声音のフレームFのみを音響モデルとの対比に使用する構成によれば、ステップS5における判定の正確性を高めることができる。
(1) Modification 1
In each of the above embodiments, the feature amount (feature vector x) extracted from each frame F of the sound production section PA is compared with the acoustic model regardless of whether it is voiced sound or unvoiced sound. A configuration in which only the feature amount extracted from the frame F of the voiced sound is compared with the acoustic model is also employed. The acoustic model stored in the storage device 20 is generated based on the feature amount in the voiced sound section excluding the unvoiced sound section and the silent section from the sample voice. The selection processing unit 13 uses only the feature amount extracted from the frame F of the voiced sound among the plurality of frames F in the selected section PA_S, and uses the average likelihood L (in the second embodiment, in the second embodiment). VQ distortion D) is calculated, and whether or not there is a correlation between the acoustic model and the audio signal S in the selected section PA_S is determined in step S5. Since the difference between the noise and the target sound is particularly remarkable with respect to the characteristics of the voiced sound, according to the configuration in which only the frame F of the voiced sound in the sound generation section P is used for comparison with the acoustic model as in the above modification. The accuracy of the determination in step S5 can be improved.

(2)変形例2
以上の形態においては、選択区間PA_Sの直前の非発音区間PBにおける雑音レベルに基づいて閾値TH1を設定したが(ステップS3)、閾値TH1の設定の基準は適宜に変更される。例えば、音声信号Sのうち先頭の非発音区間PBにおける雑音レベルに基づいて閾値TH1を設定し、各発音区間PAを選別するステップS5にて当該閾値TH1を共通に適用する構成も採用される。もっとも、選択区間PA_Sの直前にある非発音区間PBの雑音レベルを当該選択区間PA_Sの選別に適用する第1実施形態の構成によれば、音声信号Sの途中の時点で雑音レベルが変化した場合であっても、変化後の雑音レベルに応じて閾値TH1が更新されるから、ステップS5における選別の精度が低下する可能性は低減される。
(2) Modification 2
In the above embodiment, the threshold value TH1 is set based on the noise level in the non-sound generation section PB immediately before the selected section PA_S (step S3), but the criterion for setting the threshold value TH1 is appropriately changed. For example, a configuration in which the threshold value TH1 is set based on the noise level in the leading non-sounding section PB of the audio signal S and the threshold value TH1 is commonly applied in step S5 for selecting each sounding section PA is also adopted. However, according to the configuration of the first embodiment in which the noise level of the non-sound generation section PB immediately before the selection section PA_S is applied to the selection section PA_S, the noise level changes at a point in the middle of the audio signal S. Even so, since the threshold value TH1 is updated according to the noise level after the change, the possibility that the accuracy of selection in step S5 is reduced is reduced.

また、以上の各形態においてはステップS9における閾値TH2を固定値としたが、閾値TH1と同様の方法(第1実施形態や本変形例に例示した方法)で閾値TH2を可変に制御する構成も採用される。音声信号Sの雑音レベルが高いほど、ステップS8で算定される割合Rの誤差は増大するから、閾値TH2を固定値とした形態においては、目的音を含む選択区間PA_Sが棄却区間PA2と誤判定される可能性が高まる。そこで、選別処理部13は、選択区間PA_Sの直前の非発音区間PB(または音声信号Sの最初の非発音区間PB)における雑音レベルが高いほど閾値TH2が小さくなるように閾値TH2を設定する。以上の構成によれば、目的音を含む選択区間PA_Sが棄却区間PA2と誤判定される可能性を低減できる。   In each of the above embodiments, the threshold value TH2 in step S9 is a fixed value. However, there is a configuration in which the threshold value TH2 is variably controlled by a method similar to the threshold value TH1 (the method exemplified in the first embodiment and this modification). Adopted. As the noise level of the audio signal S is higher, the error of the ratio R calculated in step S8 increases. Therefore, in the embodiment in which the threshold value TH2 is a fixed value, the selection section PA_S including the target sound is erroneously determined as the rejection section PA2. The possibility of being increased. Therefore, the selection processing unit 13 sets the threshold value TH2 so that the threshold value TH2 becomes smaller as the noise level becomes higher in the non-sound generation period PB (or the first non-sounding period PB of the audio signal S) immediately before the selection period PA_S. According to the above configuration, it is possible to reduce the possibility that the selection section PA_S including the target sound is erroneously determined as the rejection section PA2.

(3)変形例3
以上の各形態においてはひとつの音響モデルを利用したが、複数の音響モデルを選択的に利用して発音区間PAを有効区間PA1と棄却区間PA2とに選別してもよい。例えば、平均ピッチが相違する複数種の音声から生成された複数の音響モデルを事前に作成して記憶装置20に格納する。図3のステップS4において、選別処理部13は、選択区間PA_S内の音声信号Sのピッチ(平均ピッチ)を検出し、複数の音響モデルのうち当該ピッチに対応した音響モデルを使用して平均尤度L(第2実施形態ではVQ歪D)を算定する。以上の構成によれば、男性の発声音と女性の発声音とが混在する場合のように音声信号Sが多様なピッチの音声を含む場合であっても、発音区間PAを正確に有効区間PA1と棄却区間PA2とに選別することが可能である。
(3) Modification 3
In each of the above embodiments, one acoustic model is used. However, a plurality of acoustic models may be selectively used to sort the sound generation section PA into the effective section PA1 and the rejection section PA2. For example, a plurality of acoustic models generated from a plurality of types of sounds having different average pitches are created in advance and stored in the storage device 20. In step S4 of FIG. 3, the selection processing unit 13 detects the pitch (average pitch) of the audio signal S in the selected section PA_S, and uses the acoustic model corresponding to the pitch among the plurality of acoustic models to calculate the average likelihood. Degree L (VQ distortion D in the second embodiment) is calculated. According to the above configuration, even when the voice signal S includes voices of various pitches, such as when male voices and female voices are mixed, the pronunciation period PA is accurately set as the valid period PA1. And the rejection section PA2.

(4)変形例4
音声区分部12が音声信号Sを区分する方法は以上の例示に限定されない。例えば、音声信号SのSN比や音量に応じて音声信号Sを発音区間PAおよび非発音区間PBに区分する第1処理と、包絡線Eの谷部Dを境界として音声信号Sを区分する第2処理との一方のみを実行してもよい。また、音声信号Sの特性とは無関係に設定された固定または可変の時間長の区間ごとに音声信号Sを区分する構成も採用される。すなわち、発音区間PAと非発音区間PBとの区分は本発明の形態において必須ではない。
(4) Modification 4
The method by which the audio classification unit 12 classifies the audio signal S is not limited to the above examples. For example, the first process of dividing the audio signal S into the sounding section PA and the non-sounding section PB according to the S / N ratio and volume of the sound signal S, and the first process of dividing the sound signal S with the valley portion D of the envelope E as a boundary. Only one of the two processes may be executed. In addition, a configuration in which the audio signal S is divided into sections of a fixed or variable time length set regardless of the characteristics of the audio signal S is also adopted. In other words, the division between the sounding section PA and the non-sounding section PB is not essential in the embodiment of the present invention.

(5)変形例5
以上の各形態においては、音響モデルに対する相関指標値(平均尤度LやVQ歪D)を利用したステップS5の判定と、有声音のフレームFの割合Rを利用したステップS9の判定とを実行した。しかし、ステップS5の判定の結果のみに基づいて各発音区間PAを有効区間PA1と棄却区間PA2とに選別する構成(すなわち図3のステップS7からステップS9を省略した構成)も採用される。また、ステップS9の判定のみに基づいて各発音区間PAを有効区間PA1と棄却区間PA2とに選別する構成(すなわち図3のステップS3からステップS5を省略した構成)も採用される。
(5) Modification 5
In each of the above embodiments, the determination in step S5 using the correlation index value (average likelihood L and VQ distortion D) for the acoustic model and the determination in step S9 using the ratio R of the voiced frame F are executed. did. However, a configuration in which each sounding section PA is selected as an effective section PA1 and a rejection section PA2 based on only the determination result in step S5 (that is, a configuration in which steps S7 to S9 in FIG. 3 are omitted) is also employed. A configuration is also adopted in which each sounding segment PA is sorted into an effective segment PA1 and a rejection segment PA2 based only on the determination in step S9 (that is, a configuration in which steps S3 to S5 in FIG. 3 are omitted).

(6)変形例6
音声処理装置100が作成した議事録を印刷する印刷装置を出力装置30として採用してもよい。もっとも、音声処理装置100による処理の結果が議事録(文字)の形式で出力される必要はなく、例えば音声分類部14による分類の結果を出力することも可能である。例えば、音声分類部14が分類した複数の有効区間PA1のうち利用者が指定した時刻を含む有効区間PA1内の音声信号Sを放音装置(例えばスピーカ)から音波として出力する構成によれば、利用者が各発声者の発言を選択的に聴取して適宜に確認しながら会議の議事録を作成するといった作業を有効に支援することが可能である。また、選別処理部13が発音区間PAを有効区間PA1と棄却区間PA2とに選別した結果を音声処理装置100から外部装置に出力する構成も採用される。外部装置においては、音声処理装置100からの出力に対して図1の音声分類部14と同様の処理や他の適切な処理が実行される。例えば、複数の発音区間PAのうち選別処理部13が選別した有効区間PA1のみを選択的に外部装置に出力し、各有効区間PA1を対象として所定の処理(発声者ごとの分類や音声認識)が外部装置にて実行される。以上のように、音声分類部14や音声認識部16は音声処理装置100にとって必須の要素ではない。
(6) Modification 6
A printing device that prints the minutes created by the voice processing device 100 may be adopted as the output device 30. However, it is not necessary to output the result of processing by the speech processing apparatus 100 in the form of minutes (characters), and for example, the result of classification by the speech classification unit 14 can be output. For example, according to the configuration in which the sound signal S in the effective section PA1 including the time specified by the user among the plurality of effective sections PA1 classified by the voice classification unit 14 is output as sound waves from a sound emitting device (for example, a speaker). It is possible to effectively support the user's work of creating the minutes of the meeting while selectively listening to the statements of each speaker and confirming them appropriately. A configuration is also adopted in which the selection processing unit 13 outputs the result of selecting the sound generation section PA into the effective section PA1 and the rejection section PA2 from the speech processing apparatus 100 to an external device. In the external device, the same processing as the speech classification unit 14 in FIG. 1 or other appropriate processing is executed on the output from the speech processing device 100. For example, only the effective section PA1 selected by the selection processing unit 13 among the plurality of sound generation sections PA is selectively output to an external device, and predetermined processing (classification or speech recognition for each speaker) is performed for each effective section PA1. Is executed by an external device. As described above, the speech classification unit 14 and the speech recognition unit 16 are not essential elements for the speech processing apparatus 100.

(7)変形例7
以上の各形態においては記憶装置20に予め記憶された音声信号Sを処理の対象としたが、収音装置(マイクロホン)から供給される音声信号Sや通信網を経由して順次に供給される音声信号Sを対象として実時間的に処理を実行してもよい。また、音声信号Sが表す音響の種類は本発明において任意である。例えば、特定の楽器の演奏音を目的音とする音響モデルが記憶装置20に格納された構成によれば、当該楽器の演奏会にて収録された音声区間Sから目的音以外の音声(例えば拍手音の区間)の区間を棄却区間PA2として除外することが可能である。
(7) Modification 7
In each of the above embodiments, the audio signal S stored in advance in the storage device 20 is the target of processing, but is sequentially supplied via the audio signal S supplied from the sound collection device (microphone) and the communication network. The processing may be executed in real time for the audio signal S. In addition, the type of sound represented by the audio signal S is arbitrary in the present invention. For example, according to the configuration in which an acoustic model having a performance sound of a specific instrument as a target sound is stored in the storage device 20, a sound other than the target sound (for example, applause) from the sound section S recorded at the performance of the musical instrument. It is possible to exclude the section of sound) as the rejection section PA2.

本発明の第1実施形態に係る音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus which concerns on 1st Embodiment of this invention. 音声処理装置の動作を説明するための概念図である。It is a conceptual diagram for demonstrating operation | movement of a speech processing unit. 選別処理部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a selection process part.

符号の説明Explanation of symbols

100……音声処理装置、10……制御装置、12……音声区分部、13……選別処理部、14……音声分類部、16……音声認識部、20……記憶装置、30……出力装置、PA……発音区間、PB……非発音区間、S……音声信号。 DESCRIPTION OF SYMBOLS 100 ... Voice processing apparatus, 10 ... Control apparatus, 12 ... Voice classification part, 13 ... Sorting process part, 14 ... Voice classification part, 16 ... Voice recognition part, 20 ... Memory | storage device, 30 ... Output device, PA ... sounding section, PB ... non-sounding section, S ... speech signal.

Claims (5)

目的音の音響モデルを記憶する記憶手段と、
音声信号を時間軸上で複数の区間に区分する音声区分手段と、
前記音響モデルと前記各区間内の音声信号の特徴量との相関の有無を判定する相関判定手段と、
前記複数の区間のうち前記音響モデルに相関すると判定された区間を有効区間に選定し、前記音響モデルに相関しないと判定された区間を棄却区間に選定する区間選別手段と
を具備する音声処理装置。
Storage means for storing an acoustic model of the target sound;
Audio classification means for dividing the audio signal into a plurality of sections on the time axis;
Correlation determining means for determining whether or not there is a correlation between the acoustic model and a feature amount of the audio signal in each section;
A speech processing apparatus comprising: a section selecting unit that selects a section determined to correlate with the acoustic model among the plurality of sections as an effective section, and selects a section determined not to be correlated with the acoustic model as a rejection section. .
前記音声区分手段は、前記音声信号を発音区間と非発音区間とに区分し、
前記相関判定手段は、前記音響モデルと前記各区間内の音声信号の特徴量との相関の指標値を第1閾値と比較することで相関の有無を判定し、
前記非発音区間内の音声信号の特性に応じて前記第1閾値を設定する閾値設定手段を具備する
請求項1の音声処理装置。
The voice classification means classifies the voice signal into a sounding section and a non-sounding section,
The correlation determination means determines the presence or absence of correlation by comparing the index value of the correlation between the acoustic model and the feature value of the audio signal in each section with a first threshold value,
The speech processing apparatus according to claim 1, further comprising a threshold setting unit configured to set the first threshold according to characteristics of the speech signal in the non-sounding section.
前記複数の区間の各々におけるフレームの総数に対する当該区間内の有声音のフレームの個数の割合が第2閾値を上回るか否かを前記区間ごとに判定する有声判定手段を具備し、
前記区間選別手段は、前記音響モデルに相関すると前記相関判定手段が判定し、かつ、有声音のフレームの個数の割合が第2閾値を上回ると前記有声判定手段が判定した区間を、有効区間に選定する
請求項1または請求項2の音声処理装置。
Voiced determination means for determining for each section whether or not the ratio of the number of frames of voiced sound in the section to the total number of frames in each of the plurality of sections exceeds a second threshold;
The section selecting means determines the section determined by the correlation determining section to correlate with the acoustic model, and the voiced determining section determines that the ratio of the number of voiced frames exceeds a second threshold as an effective section. The voice processing device according to claim 1 or 2.
前記相関判定手段は、前記複数の区間の各々における有声音のフレームの特徴量のみを前記音響モデルと対比する
請求項1から請求項3の何れかの音声処理装置。
The speech processing apparatus according to claim 1, wherein the correlation determination unit compares only a feature amount of a voiced sound frame in each of the plurality of sections with the acoustic model.
目的音の音響モデルを記憶する記憶手段を具備するコンピュータに、
音声信号を時間軸上で複数の区間に区分する音声区分処理と、
前記音響モデルと前記各区間内の音声信号の特徴量との相関の有無を判定する相関判定処理と、
前記複数の区間のうち前記音響モデルに相関すると判定された区間を処理の対象となる有効区間に選定し、前記音響モデルに相関しないと判定された区間を前記処理の対象外の棄却区間に選定する区間選別処理と
を実行させるプログラム。
In a computer having storage means for storing an acoustic model of the target sound,
Audio classification processing for dividing the audio signal into a plurality of sections on the time axis;
A correlation determination process for determining whether or not there is a correlation between the acoustic model and a feature amount of the audio signal in each section;
Of the plurality of sections, a section determined to correlate with the acoustic model is selected as an effective section to be processed, and a section determined not to be correlated with the acoustic model is selected as a rejection section that is not subject to the processing. A program that executes the section selection process.
JP2007184874A 2007-07-13 2007-07-13 Voice processing apparatus and program Expired - Fee Related JP5050698B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2007184874A JP5050698B2 (en) 2007-07-13 2007-07-13 Voice processing apparatus and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2007184874A JP5050698B2 (en) 2007-07-13 2007-07-13 Voice processing apparatus and program

Publications (2)

Publication Number Publication Date
JP2009020460A true JP2009020460A (en) 2009-01-29
JP5050698B2 JP5050698B2 (en) 2012-10-17

Family

ID=40360112

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2007184874A Expired - Fee Related JP5050698B2 (en) 2007-07-13 2007-07-13 Voice processing apparatus and program

Country Status (1)

Country Link
JP (1) JP5050698B2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009053430A (en) * 2007-08-27 2009-03-12 Yamaha Corp Speech processing device and program
WO2010092914A1 (en) * 2009-02-13 2010-08-19 日本電気株式会社 Method for processing multichannel acoustic signal, system thereof, and program
JP2012048119A (en) * 2010-08-30 2012-03-08 Nippon Telegr & Teleph Corp <Ntt> Voice interval detecting method, speech recognition method, voice interval detector, speech recognition device, and program and storage method therefor
US9117456B2 (en) 2010-11-25 2015-08-25 Fujitsu Limited Noise suppression apparatus, method, and a storage medium storing a noise suppression program
JP2018200617A (en) * 2017-05-29 2018-12-20 京セラドキュメントソリューションズ株式会社 Information processing system
JP2021021749A (en) * 2019-07-24 2021-02-18 富士通株式会社 Detection program, detection method, and detection device
CN114242116A (en) * 2022-01-05 2022-03-25 成都锦江电子系统工程有限公司 Comprehensive judgment method for voice and non-voice of voice
JPWO2022168251A1 (en) * 2021-02-05 2022-08-11

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61138299A (en) * 1984-12-10 1986-06-25 日本電信電話株式会社 Voice section detection system
JPS6456499A (en) * 1987-08-27 1989-03-03 Matsushita Electric Ind Co Ltd Voice recognition
JPH04369695A (en) * 1991-06-19 1992-12-22 Matsushita Electric Ind Co Ltd Voice decision device
JPH06110488A (en) * 1992-09-30 1994-04-22 Matsushita Electric Ind Co Ltd Method and device for speech detection
JPH08305388A (en) * 1995-04-28 1996-11-22 Matsushita Electric Ind Co Ltd Voice range detection device
JP2002023800A (en) * 1998-08-21 2002-01-25 Matsushita Electric Ind Co Ltd Multi-mode sound encoder and decoder
JP2005195955A (en) * 2004-01-08 2005-07-21 Toshiba Corp Device and method for noise suppression
JP2006133284A (en) * 2004-11-02 2006-05-25 Kddi Corp Voice information extracting device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61138299A (en) * 1984-12-10 1986-06-25 日本電信電話株式会社 Voice section detection system
JPS6456499A (en) * 1987-08-27 1989-03-03 Matsushita Electric Ind Co Ltd Voice recognition
JPH04369695A (en) * 1991-06-19 1992-12-22 Matsushita Electric Ind Co Ltd Voice decision device
JPH06110488A (en) * 1992-09-30 1994-04-22 Matsushita Electric Ind Co Ltd Method and device for speech detection
JPH08305388A (en) * 1995-04-28 1996-11-22 Matsushita Electric Ind Co Ltd Voice range detection device
JP2002023800A (en) * 1998-08-21 2002-01-25 Matsushita Electric Ind Co Ltd Multi-mode sound encoder and decoder
JP2005195955A (en) * 2004-01-08 2005-07-21 Toshiba Corp Device and method for noise suppression
JP2006133284A (en) * 2004-11-02 2006-05-25 Kddi Corp Voice information extracting device

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009053430A (en) * 2007-08-27 2009-03-12 Yamaha Corp Speech processing device and program
WO2010092914A1 (en) * 2009-02-13 2010-08-19 日本電気株式会社 Method for processing multichannel acoustic signal, system thereof, and program
JP5605574B2 (en) * 2009-02-13 2014-10-15 日本電気株式会社 Multi-channel acoustic signal processing method, system and program thereof
US9009035B2 (en) 2009-02-13 2015-04-14 Nec Corporation Method for processing multichannel acoustic signal, system therefor, and program
JP2012048119A (en) * 2010-08-30 2012-03-08 Nippon Telegr & Teleph Corp <Ntt> Voice interval detecting method, speech recognition method, voice interval detector, speech recognition device, and program and storage method therefor
US9117456B2 (en) 2010-11-25 2015-08-25 Fujitsu Limited Noise suppression apparatus, method, and a storage medium storing a noise suppression program
JP2018200617A (en) * 2017-05-29 2018-12-20 京セラドキュメントソリューションズ株式会社 Information processing system
JP2021021749A (en) * 2019-07-24 2021-02-18 富士通株式会社 Detection program, detection method, and detection device
JP7331523B2 (en) 2019-07-24 2023-08-23 富士通株式会社 Detection program, detection method, detection device
JPWO2022168251A1 (en) * 2021-02-05 2022-08-11
JP7333878B2 (en) 2021-02-05 2023-08-25 三菱電機株式会社 SIGNAL PROCESSING DEVICE, SIGNAL PROCESSING METHOD, AND SIGNAL PROCESSING PROGRAM
CN114242116A (en) * 2022-01-05 2022-03-25 成都锦江电子系统工程有限公司 Comprehensive judgment method for voice and non-voice of voice

Also Published As

Publication number Publication date
JP5050698B2 (en) 2012-10-17

Similar Documents

Publication Publication Date Title
CN108305615B (en) Object identification method and device, storage medium and terminal thereof
EP1210711B1 (en) Sound source classification
JP5050698B2 (en) Voice processing apparatus and program
EP0625774B1 (en) A method and an apparatus for speech detection
JP4568371B2 (en) Computerized method and computer program for distinguishing between at least two event classes
Kos et al. Acoustic classification and segmentation using modified spectral roll-off and variance-based features
JP7342915B2 (en) Audio processing device, audio processing method, and program
US20050192795A1 (en) Identification of the presence of speech in digital audio data
EP2083417B1 (en) Sound processing device and program
Archana et al. Gender identification and performance analysis of speech signals
JP5647455B2 (en) Apparatus, method, and program for detecting inspiratory sound contained in voice
JP4973352B2 (en) Voice processing apparatus and program
WO2018163279A1 (en) Voice processing device, voice processing method and voice processing program
JP5083951B2 (en) Voice processing apparatus and program
Grewal et al. Isolated word recognition system for English language
CN114303186A (en) System and method for adapting human speaker embedding in speech synthesis
JP4877114B2 (en) Voice processing apparatus and program
JP5109050B2 (en) Voice processing apparatus and program
JPH06110488A (en) Method and device for speech detection
JP2006154212A (en) Speech evaluation method and evaluation device
JP7159655B2 (en) Emotion estimation system and program
Gelling Bird song recognition using gmms and hmms
Zeng et al. Adaptive context recognition based on audio signal
JPWO2020049687A1 (en) Speech processing equipment, audio processing methods, and programs
US20240079027A1 (en) Synthetic voice detection method based on biological sound, recording medium and apparatus for performing the same

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20100520

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20110822

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20110830

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20111027

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20111122

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20120117

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20120626

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20120709

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20150803

Year of fee payment: 3

LAPS Cancellation because of no payment of annual fees