JP2005173007A

JP2005173007A - Voice analysis processing, voice processor using same, and medium

Info

Publication number: JP2005173007A
Application number: JP2003410337A
Authority: JP
Inventors: Hirotaka Shiiyama; 弘隆椎山
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2003-12-09
Filing date: 2003-12-09
Publication date: 2005-06-30

Abstract

<P>PROBLEM TO BE SOLVED: To temporarily regard not only speaker discrimination-difficult voice data as unknown speaker voice data but also to reduce the unknown speaker voice data as much as possible for the speaker discrimination-difficult speaker discrimination-impossible voice data. <P>SOLUTION: Determination of a speaking section lacking an amount of information is reserved, an analysis is completed once, and speaker discrimination precision is enhanced by reprocessing after speaker characteristic information is filled up. Further, characteristics of detected speakers are compared with each other, errors mistaking one speaker as a plurality of speakers are reduced. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声解析処理およびそれを用いた音声処理装置に関するものである。 The present invention relates to a voice analysis process and a voice processing apparatus using the same.

従来、特開平８−２８６６９３号では、音声を解析しこの結果を用い音声データを話者別に振り分けることにより所望する話者の音声データの検索・再生を行う発明を開示している。また、特開２０００−１４８１８９では、音声解析を行う際、実用上どうしても発声する話者弁別不可能な音声データに対しては、当該音声データの話者が特定できない状態を記録する発明を開示している。 Conventionally, Japanese Patent Application Laid-Open No. 8-286663 discloses an invention for searching and reproducing desired voice data of a speaker by analyzing the voice and using the result to sort the voice data by speaker. Japanese Patent Laid-Open No. 2000-148189 discloses an invention for recording a state in which a speaker of voice data cannot be identified for voice data that cannot be distinguished by a speaker in practice when performing voice analysis. ing.

この様に、音声データを解析し話者弁別を実際に行う場合、特に該音声データが短い等の情報量の不足で話者弁別が困難な話者弁別不可能な音声データに対する考慮が不十分であった。本発明では、この様な話者弁別が困難な話者弁別不可能な音声データに対し、話者弁別が困難な音声データを一時的に話者不明の音声データとして扱っておくだけでなく、できるだけ話者不明の音声データを減少する事を課題とする。 As described above, when speech data is analyzed and speaker discrimination is actually performed, in particular, consideration is not given to speech data that cannot be speaker-discriminated, which is difficult to perform speaker discrimination due to a shortage of information such as short speech data. Met. In the present invention, not only the voice data that cannot be distinguished by such speaker, but the voice data that is difficult to perform speaker discrimination is temporarily treated as voice data with unknown speaker. The problem is to reduce the voice data of unknown speakers as much as possible.

また、更に、弁別がうまく行かず、一人の話者を複数に誤弁別するミスが発声する問題もあるが、このミスを極力減らす事も重要な課題である。 In addition, there is a problem that discrimination is not performed well and a mistake that discriminates one speaker into a plurality of voices is uttered. It is also an important issue to reduce this mistake as much as possible.

上記課題を解決する為に、本発明の音声解析処理では、登場する発声者の弁別を行う際に、人の発声を検出する度に発声の個人特徴を抽出しその情報量が予め定められた条件を満たさない場合には発声者不明の発声区間として単独の発声者の扱いで解析結果情報へ記憶する。この発声者不明の発声区間に対しては、解析対象の音声信号を最後まで処理したのちに、再度或いは再帰的に処理を行う。 In order to solve the above-described problems, in the voice analysis processing of the present invention, when distinguishing a speaker who appears, a personal feature of the utterance is extracted each time a person's utterance is detected, and the amount of information is determined in advance. When the condition is not satisfied, the analysis result information is stored as an utterance section with unknown utterance by treating the utterer alone. For the utterance section where the speaker is unknown, the analysis target speech signal is processed to the end, and then the processing is performed again or recursively.

もし、個人特徴を抽出しその情報量が予め定められた条件を満たす場合、新規発声者が無ければ例外処理として一人目の発声者としてとしてその発声の個人特徴とその発声区間を合わせて解析結果情報に記憶するが、もし、既に検出した発声者が存在する場合にはこの発声区間の発声の個人特徴を既に検出した人物の発声の個人特徴と比較を行う事により既に検出した人物とは別の人物であると推定した場合には新規の人物としてその発声の個人特徴とその発声区間を合わせて解析結果情報に記憶する。 If an individual feature is extracted and the amount of information satisfies a pre-determined condition, if there is no new speaker, the result of the analysis is that the individual feature of the utterance and the utterance interval are combined as the first speaker as an exception process. It is stored in the information, but if there is a speaker who has already been detected, the individual characteristics of the utterance in this utterance section are compared with the individual characteristics of the utterance of the person who has already been detected. When it is estimated that the person is a person, the individual characteristics of the utterance and the utterance section are combined and stored in the analysis result information as a new person.

他方、この発声区間の発声の個人特徴が既に検出した人物と同一人物であると推定した場合には解析結果情報に発声区間を追記する。 On the other hand, when it is estimated that the personal characteristics of the utterance in the utterance section are the same as the already detected person, the utterance section is added to the analysis result information.

上記で述べた発声の個人特徴とは少なくとも発声区間中に存在する全母音に対する音響的特長情報の対を含む。というのも、発声の個人特徴は母音に現れる事が良く知られているからであり、比較処理は既に検出した全人物の母音の夫々に対し検出した発声区間中に存在する母音に対して音響的特長情報の対どうしの距離演算を行いその値を定められた閾値と比較する事により検出した発声区間が既に検出されたどの人物のものであるか或いは新規の人物であるかを判断可能である。 The individual feature of the utterance described above includes at least a pair of acoustic feature information for all vowels existing in the utterance section. This is because it is well known that individual features of utterances appear in vowels, and comparison processing is performed for vowels existing in the detected vowel interval for each vowel of all persons already detected. It is possible to determine whether the detected utterance section belongs to a new person or the detected utterance section by calculating the distance between the pair of characteristic features and comparing the value with a predetermined threshold. is there.

更に、個人特徴としての母音に対する音響的特長情報の対とは、具体的には、発声区間中に存在する全母音に対するフォルマント情報或いはケプストラム情報の対、或いは音声ピッチと発声区間中に存在する全母音に対するフォルマント情報の対、更に或いは音声ピッチと発声区間中に存在する全母音に対するフォルマント情報或いはケプストラム情報の対で表現する事が可能である。 Furthermore, a pair of acoustic feature information with respect to a vowel as a personal feature is specifically a pair of formant information or cepstrum information with respect to all vowels existing in the utterance section, or all pairs existing in the voice pitch and utterance section. It can be expressed by a pair of formant information for vowels, or a pair of formant information or cepstrum information for all vowels existing in the voice pitch and the utterance section.

というのも、母音のフォルマント呼ばれる声道の共鳴周波数が存在することが知られており個人認証にも用いられている。そこで、母音を認識し各母音に関してその安定点におけるフォルマント情報を求めこれを個人弁別に役立つからである。 This is because there is a resonance frequency of the vocal tract called vowel formant, which is also used for personal authentication. This is because vowels are recognized, formant information at each stable point is obtained for each vowel, and this is useful for individual discrimination.

また、ケプストラムに関してもフォルマントと同様、声道の発声の共鳴周波数を反映した解析結果が得られ、個人弁別に役立つ。 As for cepstrum, as with formants, analysis results reflecting the resonance frequency of vocal tract vocalizations are obtained, which is useful for individual discrimination.

安定点とは母音判定を行いその中央近傍で且つエネルギーが最大となる近傍である。 The stable point is the vicinity where the vowel is determined and the energy is maximum near the center.

これらの音声ピッチやフォルマントやケプストラムは公知の音声認識処理に用いられているものである。 These voice pitch, formant, and cepstrum are used in known voice recognition processing.

比較処理は、公知のフォルマント距離やケプストラム距離を求め、これを予め定めた閾値と比較する事により行う。 The comparison process is performed by obtaining a known formant distance and cepstrum distance and comparing them with a predetermined threshold value.

また、音声ピッチは、男性と女性・子供などのような音の高低を反映し、同じ声道の共鳴周波数でも声帯振動が異なる人の弁別に役立つので、フォルマント距離やケプストラム距離とピッチ周波数の差の絶対値との重み付け和による距離を求め、これを予め定めた閾値と比較する事に行っても良い。 The voice pitch reflects the pitch of sounds like men, women, children, etc., and is useful for discriminating people with different vocal cord vibrations even at the same vocal tract resonance frequency, so the difference between formant distance and cepstrum distance and pitch frequency. It is also possible to obtain a distance based on a weighted sum with the absolute value of and to compare this with a predetermined threshold value.

ところで、発声者不明の発声区間として単独の発声者の扱いを行う際の個人特徴の情報量の予め定められた条件とは、検出した発声区間中に存在する母音の種類がある閾値以下、或いは最低限必要な母音種類の組み合わせに不足である事とする。 By the way, the predetermined condition of the information amount of the personal feature when handling a single speaker as an utterance section with unknown utterance is a threshold value below a certain vowel type present in the detected utterance section, or It is assumed that the combination of the minimum necessary vowel types is insufficient.

例えば、発声区間が短くそれに含まれる母音の数が少ない、或いは存在する母音の性質が似ていて比較に有効な母音の組み合わせにならない場合には、有意なフォルマント距離やケプストラム距離が計算不能である。 For example, a significant formant distance or cepstrum distance cannot be calculated if the utterance interval is short and the number of vowels included in the vowel is small, or the vowels that exist are similar in nature and cannot be used for comparison. .

そこで、その発声区間に含まれる母音の種類、或いは母音種類の組み合わせにより有意なフォルマント距離やケプストラム距離が計算可能な条件をクリア出来ないものは発声者不明の発声区間として発声者決定処理を先延ばしする。
先延ばしする理由は処理が進むにつれ、検出した発声者の個人特徴が充実するからである。 Therefore, if the vowel type included in the utterance section or the combination of vowel types cannot satisfy the conditions for calculating a significant formant distance or cepstrum distance, the utterance determination process is postponed as an utterance section with an unknown speaker. To do.
The reason for postponing is that the personal characteristics of the detected speaker are enhanced as the process proceeds.

それは、検出した発声区間が既に検出した人物であると判断した場合で、個人特徴において欠落している母音が存在する場合に検出した発声区間にその母音が存在する場合にはその母音の音響的特長情報の対を個人特徴へ記憶する事により実現する。 When it is determined that the detected utterance section is a person who has already been detected, and there is a vowel missing in the personal feature, if the vowel exists in the detected vowel section, the acoustic of the vowel This is realized by memorizing pairs of feature information in individual features.

そして、全音声の解析処理を行った後、更に、既に検出した人物の個人特徴と発声者不明の発声区間としての単独の発声者との個人特徴を比較する事により発声者不明の発声区間として単独の発声者を既に検出した人物へ再マッピングする処理を行うが、これは単独或いは再帰的に行っても良い。 And, after performing the analysis process of all speech, by further comparing the individual characteristics of the already detected person and the individual characteristics of the single speaker as the utterance section of the unknown speaker, as the utterance section of the unknown speaker A process of remapping a single speaker to a person who has already been detected is performed, but this may be performed alone or recursively.

また、一人の発声者が複数の発声者にミス弁別される問題に関しては、検出した人物どうしの個人特徴の比較を行い、その距離が有る閾値以下の場合に一人の発声者として解析情報を統合する。具体的には、一人の発声者の発声区間情報へ他の発声者と成っていた発声区間情報をマージする。 In addition, regarding the problem that one speaker is misidentified by multiple speakers, the individual characteristics of detected persons are compared, and analysis information is integrated as one speaker if the distance is below a certain threshold. To do. Specifically, the utterance section information that has been composed with another utterer is merged with the utterance section information of one speaker.

上記発声者不明の発声区間の再処理と、上記一人の発声者の発声区間情報へ他の発声者と成っていた発声区間情報をマージ処理の順番に関してはどちらを先にする場合も考えられる。 The re-processing of the utterance section with unknown utterer and the utterance section information of the utterance section of the one utterer may be preceded with respect to the order of the merging process.

本発明によれば、話者弁別が困難な話者弁別不可能な音声データに対し、話者弁別が困難な音声データを一時的に話者不明の音声データとして扱い、話者弁別が可能な音声データの解析結果を用いて話者弁別のための特徴情報を学習してゆき、ひとまとまりの音声信号の解析終了後に、この学習により充実した特徴情報を用いて、一時的に話者不明の音声データとされていた音声データを再処理する事により話者不明の音声データを減少する事が可能となった。また、更に、話者弁別後に再度話者特徴を比較する事により、一人の話者を複数に誤弁別するミスを減らす事が可能となった。 According to the present invention, voice data that is difficult to perform speaker discrimination is treated as voice data with unknown speaker temporarily for voice data that is difficult to perform speaker discrimination, and speaker discrimination is possible. The feature information for speaker discrimination is learned using the analysis result of speech data, and after the analysis of a group of speech signals, the feature information enriched by this learning is used to temporarily identify the speaker unknown. It becomes possible to reduce the voice data of unknown speaker by re-processing the voice data that was supposed to be voice data. Furthermore, by comparing speaker characteristics again after speaker discrimination, it is possible to reduce mistakes that misidentify one speaker into multiple.

これらの処理結果を用いる事で、指定話者だけの発声区間を聞くなどの有効な機能を実現する為の貴重な情報を得ることが可能となった。 By using these processing results, it has become possible to obtain valuable information for realizing effective functions such as listening to the utterance section of only the designated speaker.

以下本発明の実施例について、詳細に説明する。 Examples of the present invention will be described in detail below.

実施例では、音声解析処理では事前の個人識別情報を登録行わず、登場する発声者の弁別を行う事により発声者別の発声区間情報生成を行う実施について説明を行う。 In the embodiment, a description will be given of performing voice segment information generation for each speaker by discriminating the voices that appear without registering prior personal identification information in the voice analysis processing.

本発明の実施例の概念であるが、図１の概念図に示す様に、発声区間検出処理と、発声特徴抽出処理と、発声特徴比較処理と、その結果に基づき発声者を決定しに発声者情報ＤＢ及び発声者別の発声区間情報ＤＢへ追記を行う処理からなる。 The concept of the embodiment of the present invention, as shown in the conceptual diagram of FIG. 1, utterance is performed by determining the speaker based on the utterance section detection process, the utterance feature extraction process, the utterance feature comparison process, and the result. It consists of a process of adding information to the speaker information DB and the utterance section information DB for each speaker.

又、図２に本実施例における発声区間検出処理から発声者別の発声区間情報を生成し属性情報及び解析結果情報記憶手段へ記憶するまでのフローを示し、これに従って説明を行う。 FIG. 2 shows a flow from the utterance section detection processing in this embodiment to generation of utterance section information for each utterer and storing it in the attribute information and analysis result information storage means.

まず、S201において、音声信号を読み込む。 First, in S201, an audio signal is read.

図２のフローでは音声信号を一度にメモリへ読み込んでいるが、当然、必要に応じて逐次メモリバッファ読み込む実施例も容易に考えられる。 In the flow of FIG. 2, the audio signal is read into the memory at one time, but naturally, an embodiment in which the memory buffer is sequentially read as necessary can be easily considered.

S202の発声区間検出処理に関しては公知の技術を用いる。そして、音声信号は複数の発声区間に分割される。 A known technique is used for the utterance section detection processing in S202. The audio signal is divided into a plurality of utterance sections.

そしてS203において、音声区間先頭を検出し音声区間終点を検出する毎に、S204の発声特徴抽出処理において発声区間に相当する音声信号から発声の個人弁別を行う為に有用な特徴を抽出する。 In S203, every time a speech segment head is detected and a speech segment end point is detected, a feature useful for individual discrimination of speech is extracted from the speech signal corresponding to the speech segment in the speech feature extraction processing of S204.

発声に関しては、母音のフォルマント呼ばれる声道の共鳴周波数が存在することが知られており個人認証にも用いられている。そこで、母音を認識し各母音に関してその安定点におけるフォルマント情報を求めこれを個人弁別のために記憶する。 Regarding vocalization, it is known that there is a resonance frequency of the vocal tract called formant of vowels, and it is also used for personal authentication. Therefore, vowels are recognized, formant information at the stable point is obtained for each vowel, and this is stored for individual discrimination.

フォルマントの推定に関しては、全極型の音声生成モデルを前提とした線形予測分析（ＬＰＣ）を行い、声道の伝達係数の極を求め、これにより共振周波数を求める等の公知の技術を用いると良い。 For formant estimation, a linear prediction analysis (LPC) based on an all-pole type speech generation model is performed, and the poles of the vocal tract transfer coefficient are obtained, thereby using a known technique such as obtaining the resonance frequency. good.

即ち、発声特徴の一例として図５に示すデータスキーマを例に挙げる。 That is, the data schema shown in FIG. 5 is taken as an example of the utterance feature.

発声区間とその発声区間中に存在する「あいうえお」の５母音の夫々に対してフォルマント情報の対を記憶し、更に音声ピッチ周波数を記憶する。もちろんその発声中に５母音が揃わない場合には存在する母音のみ記述を行う。 A pair of formant information is stored for each of the utterance section and the five vowels of “Aiueo” existing in the utterance section, and further, the voice pitch frequency is stored. Of course, if five vowels are not prepared during the utterance, only the existing vowels are described.

図５における発声区間フィールドは、図６の様に開始位置と終了位置の対からなり、位置の表現方法としては時間表現が可能である。 The utterance section field in FIG. 5 includes a pair of a start position and an end position as shown in FIG. 6, and time expression is possible as a position expression method.

また、フォルマント情報の対に関しては、一般に第３次フォルマントまでを用いると母音の性質を表しきれるので、図７に示す様に第一次、第二次および第三次フォルマント周波数の３つの周波数からなるフィールドを図５の各母音に対して記憶する。 In addition, regarding the formant information pairs, generally, up to the third formant can be used to express the nature of the vowel. Therefore, as shown in FIG. 7, from the three frequencies of the first, second and third formant frequencies. Is stored for each vowel in FIG.

尚、解析する音声信号としては母音の安定点としては平滑化したエネルギーのピークに近く且つピッチ周波数の安定した区間を用いると良い。 As a voice signal to be analyzed, a stable section of the pitch frequency and a stable pitch frequency may be used as a stable point of the vowel.

但し、発声区間中に全ての種類の母音が存在するとは限らず、母音の種類が少ない場合や性質が似ていて比較に有効な母音の組み合わせにならない場合には正確な比較処理が行えない。 However, not all types of vowels are present in the utterance section, and accurate comparison processing cannot be performed when there are few types of vowels or when the combination of vowels that are similar in nature and effective for comparison cannot be obtained.

そこで、S205において、検出した発声区間中に存在する母音の種類がある閾値未満或いは最低限必要な母音種類の組み合わせに不足と判断した場合には、S206において発声者不明の発声区間として単独の発声者の扱いで解析結果情報へ記憶する。 Therefore, in S205, if it is determined that the type of vowel existing in the detected utterance section is less than a certain threshold value or insufficient for the combination of the minimum required vowel types, a single utterance as an utterance section in which the speaker is unknown is determined in S206. Is stored in the analysis result information.

S205において検出した発声区間中に存在する母音の種類がある閾値以上或いは最低限必要な母音種類の組み合わせを満足したと判断した場合には、S207において上記発声特徴抽出を用いてこれを既に検出した人物の図８に示す発声の個人特徴、具体的にはフォルマントの対情報同士の距離の総和を比較する事により既に検出した人物とは別の人物であるかを判断する。 If it is determined that the vowel type existing in the utterance section detected in S205 satisfies a certain threshold or a combination of the minimum required vowel types, it has already been detected using the utterance feature extraction in S207. It is determined whether the person is different from the already detected person by comparing the personal characteristics of the person's utterance shown in FIG. 8, specifically, the sum of the distances between the formant pair information.

S207の発声特徴比較処理に関しては、既に検出した全人物の母音の夫々に対し、検出した発声区間中に存在する母音のみに対してフォルマント情報の対どうしの距離演算を行いそれらの最小値を求め、S208においてこの最小値が定められた閾値以下であるならばその最小距離に該当する人物の発声であると判断する。もし、この最小値が閾値より大きい場合には新規の人物であると判断する。 Regarding the utterance feature comparison processing of S207, the distance between pairs of formant information is calculated for only the vowels existing in the detected utterance section for each of the vowels of all the persons already detected, and the minimum value thereof is obtained. In S208, if this minimum value is equal to or less than a predetermined threshold, it is determined that the voice is of a person corresponding to the minimum distance. If this minimum value is greater than the threshold, it is determined that the person is a new person.

そして、S208において既に検出した人物と同一人物であると推定した場合にはS209においてその人物の発声者ＩＤに対応する図９の発声区間情報レコードに発声区間を追記記憶する。 If it is estimated that the same person as the person already detected in S208, the utterance section is additionally recorded in the utterance section information record of FIG. 9 corresponding to the person's speaker ID in S209.

また、既に検出した人物の個人特徴においてフォルマント情報が欠落している母音が存在する場合で、新たに検出した発声区間に該欠落した母音が含まれている場合にはその母音のフォルマント情報を個人特徴へ記憶する事により、個人特徴を充実させる。 In addition, when there is a vowel whose formant information is missing in the personal characteristics of a person that has already been detected, and the missing vowel is included in a newly detected utterance section, the formant information of that vowel Enrich personal characteristics by memorizing the characteristics.

他方、S208において既に検出した人物では無いと推定した場合には、S211において、新規の人物として新規の発声者ＩＤを発行して図６に示す発声者情報ＤＢに新たなレコードを生成しその発声の個人特徴を記憶し更にその発声者ＩＤで図９の発声区間情報ＤＢに新たなレコードを生成してその発声区間を記憶する。 On the other hand, if it is estimated that the person is not already detected in S208, a new speaker ID is issued as a new person in S211 and a new record is generated in the speaker information DB shown in FIG. Are recorded, and a new record is generated in the utterance section information DB of FIG. 9 with the utterer ID, and the utterance section is stored.

もちろん、初期時には人物の検出は行われていないので、母音の種類がある閾値以上或いは最低限必要な母音種類の組み合わせを満たす場合には単に新規の人物としてその発声の個人特徴とその発声区間を合わせて解析結果情報に記憶する。 Of course, since the person is not detected at the initial stage, if the vowel type satisfies a certain threshold or a combination of the minimum required vowel types, the individual characteristics of the utterance and the utterance interval are simply set as a new person. Together, it is stored in the analysis result information.

ところで、図８と図９の様に、発声の個人特徴情報と発声区間情報を分けたが、当然、これらを１つのレコードにマージして一つの発声者ＩＤに対するレコードとする方法も存在する。 By the way, as shown in FIG. 8 and FIG. 9, the individual characteristic information of the utterance and the utterance section information are separated, but there is naturally a method of merging these into one record to make a record for one speaker ID.

この様にして全音声信号の解析処理を行った後、S212において再度、発声者不明の発声区間として単独の発声者の扱いとなった発声区間に対して、既に検出した人物の個人特徴と比較を行う事により既に検出した人物へ再マッピングする処理を行う。 After performing the analysis processing of all speech signals in this way, in S212, again compared with the individual feature of the person already detected for the utterance section treated as a single utterer as the utterance section with unknown speaker The process of remapping to the already detected person is performed.

その理由は、最初にその発声区間が既に検出したどの発声者のものかを比較処理した時点よりも個人特徴が充実し、より正確な判定が行える状況になっているからである。 The reason is that personal characteristics are enriched and more accurate determination can be made than when the first utterance of the utterance section that has already been detected is compared.

この処理を単独、或いは再帰的に行う事により、発声者不明の発声区間として単独の発声者扱いの音声区間が減少する。 By performing this process alone or recursively, the number of voice segments treated as a single speaker is reduced as the voice segment whose voice is unknown.

このS212の処理フローの詳細を、図３を用いて説明する。 Details of the processing flow of S212 will be described with reference to FIG.

全音声信号の解析処理を行った後、S301で未処理の発声者不明の発声区間があるかどうかを判断し、無ければ終了、有ればS302で図２のS206で一時的に記憶したその発声区間の発声特徴を読み込み、S303にて全ての既検出発声者の発声特徴と比較する。 After analyzing all speech signals, it is determined in S301 whether or not there is an unprocessed utterance unknown speech section. If not, the process ends. If there is, the process temporarily stores it in S302 in FIG. The utterance features of the utterance section are read and compared with the utterance features of all detected utterers in S303.

S304にてS303の比較処理で得た距離の最小値とそれに該当する既検出発声者を決定し、その距離が予め定められた閾値と比較を行い、距離が閾値以下の場合にＹ判定を行い、S305にて既検出発声者の発声区間情報へ今回の発声者不明の発声区間を追記し、S306において既検出発声者の発声特徴で不足している情報を補い、再びS301で未処理の発声者不明の発声区間があるかどうかを判断し、未処理の発声者不明の発声区間が無くなるまで処理を繰り返す。他方、S304で距離が閾値より大きい場合にＮ判定を行い、S307にて発声者不明の発声区間として一時記憶したままとしておく。 In S304, the minimum value of the distance obtained in the comparison process of S303 and the detected speaker corresponding to the minimum value are determined, the distance is compared with a predetermined threshold value, and if the distance is equal to or less than the threshold value, a Y determination is performed. In S305, the utterance section of the detected speaker is added to the utterance section information of the detected speaker, and the missing information is added to the utterance feature of the detected speaker in S306, and the unprocessed utterance is again processed in S301. It is determined whether or not there is an unspoken utterance section, and the process is repeated until there are no unprocessed utterance sections with unknown speaker. On the other hand, if the distance is larger than the threshold value in S304, an N determination is made, and in S307, it is temporarily stored as an utterance section whose speaker is unknown.

もちろん、図３の処理を再帰的に行えば、S306のおかげで前回の処理で発声者不明の発声区間が既検出発声者の発声と判断される場合があり、新たに既検出発声者の発声が出なくなるまで再帰処理を行なう方法も当然考えられる。 Of course, if the process of FIG. 3 is performed recursively, the utterance section whose utterer is unknown may be determined to be the utterance of the already detected speaker in the previous process because of S306, and the newly detected utterer's utterance Naturally, a method of performing recursive processing until no longer appears is also conceivable.

更に、S212の処理後、図２のS213において、最後に検出した人物間の個人特徴どうしの比較を行い、１人の人物を複数の人物に謝って分けたミスを無くす事も可能である。 Further, after the processing of S212, it is possible to compare the personal characteristics between the last detected persons in S213 of FIG. 2 and eliminate the mistake of dividing one person by apologizing to a plurality of persons.

このS213の処理フローの詳細を、図４を用いて説明する。 Details of the processing flow of S213 will be described with reference to FIG.

S401にて既検出発声者の発声特徴を発声者情報ＤＢから読み出し、これら全ての発声特徴を相互比較する。 In S401, the utterance features of the detected utterers are read from the utterer information DB, and all these utterance features are compared with each other.

S403において、S402で得た相互比較距離のうち最小のものを探し、該当する２人の既検出発声者を求める。 In S403, the minimum distance among the mutual comparison distances obtained in S402 is searched for and two corresponding detected speakers are obtained.

そして、S404において相互比較距離が予め定められた閾値以下の場合には、S405においてこの２人の既検出発声者の情報をマージして１人の既検出発声者とする。 If the mutual comparison distance is equal to or smaller than a predetermined threshold value in S404, the information of the two detected speakers is merged in S405 to be one detected speaker.

具体的には、片方の既検出発声者の発声者情報ＤＢレコードの発声特徴へ他方の発声特徴をマージし、更に、片方の既検出発声者の発声区間情報者ＤＢレコードの発声区間情報へ他方の発声区間情報をマージしたのち、他方の発声者情報ＤＢレコードおよび発声区間情報者ＤＢレコードを削除する。 Specifically, the other utterance feature is merged with the utterance feature of the utterer information DB record of one of the detected utterers, and the utterance interval information of the utterance interval information person DB record of one of the detected utterers is Are merged, and then the other speaker information DB record and the speech section information person DB record are deleted.

こうして既検出発声者の構成および発声特徴が更新された後に、再度上記処理を繰り返し、S404で最小の相互比較距離が予め定められた閾値を越えるまで繰り返す。 After the configuration and utterance characteristics of the detected speaker are updated in this manner, the above process is repeated again until the minimum mutual comparison distance exceeds a predetermined threshold value in S404.

これにより、１人の人物を複数の人物に謝って分けたミスを訂正する。 This corrects a mistake that a person is apologized to a plurality of persons.

さてところで、S207の発声特徴比較処理の例として、フォルマント情報の対どうしおよびピッチ情報を用いた距離演算例を説明する。 Now, as an example of the utterance feature comparison process of S207, an example of distance calculation using pairs of formant information and pitch information will be described.

尚、フォルマント情報およびピッチ周波数を解析する対象となる母音としては平滑化したエネルギーのピークに近く且つピッチ周波数の安定した区間を用いる。 Incidentally, as a vowel to be analyzed for formant information and pitch frequency, a section close to a smoothed energy peak and having a stable pitch frequency is used.

一般に第３次フォルマントまでを用いると母音の性質を表しきれると言われているので、本実施例では第３次フォルマントまでを考慮した評価関数を導入する。 In general, it is said that the use of up to the third formant can represent the characteristics of the vowels. Therefore, in this embodiment, an evaluation function considering the third formant is introduced.

着目する発声区間の各母音の第一次、第二次および第三次フォルマントと、図６に示す発声者情報ＤＢに格納された既検出発声者の各母音の第一次、第二次および第三次フォルマント情報を読み出し、これらの特徴比較を行う事により、どの発声者の発声か或いは新規或いは不明であるかを判断する。 The primary, secondary, and tertiary formants of each vowel in the utterance section of interest, and the primary, secondary, and secondary vowels of the detected speaker that are stored in the speaker information DB shown in FIG. By reading the third formant information and comparing these features, it is determined which speaker is uttered or whether it is new or unknown.

この評価関数によれば、発声個人特徴が類似すればする程零に近い値となり完全に一致する場合には零となる。 According to this evaluation function, the closer the utterance individual characteristics are, the closer the value is to zero, and zero when they completely match.

但し、Vow[i][j]は検出した発声区間における母音iのｊ次フォルマント周波数の値でありfは発声区間におけるピッチ周波数であり、
fref Vowref[i][j]は既に検出された人物の母音iのｊ次フォルマント周波数の値でありfrefはその人物のピッチ周波数である。

Where Vow [i] [j] is the value of the j-th formant frequency of the vowel i in the detected utterance interval, and f is the pitch frequency in the utterance interval,
fref Vowref [i] [j] is the value of the j-th formant frequency of the vowel i of the person already detected, and fref is the pitch frequency of the person.

Wi:各母音の個人特徴表現度合に応じた重み、個人特徴を反映する程大きな値を取る。 Wi: A weight corresponding to the degree of personal feature expression of each vowel, and a value that is large enough to reflect the personal feature.

Wf0:ピッチ周波数に対する重み
i:0-4 母音の区別を表す添字
j:0-2 ｊ次フォルマントを表す添字
ＷTotal:フォルマント情報が共通に存在する母音の重みWiの総和
また、|Vow[i][j]−Vowref[i][j]|の計算は母音iのフォルマント情報がVowとVowrefの共通に存在する場合のみ計算を行い、このフォルマント情報が共通に存在する母音の重みWiの総和であるＷTotalでフォルマント比較の項を正規化する。 Wf0: Weight for pitch frequency
i: 0-4 Subscript to indicate vowel distinction
j: 0-2 Subscript representing j-th formant WTotal: Sum of vowel weights Wi in which formant information exists in common Also, | Vow [i] [j] −Vowref [i] [j] | Is calculated only when Vow and Vowref exist in common, and the formant comparison term is normalized with WTotal, which is the sum of the weights Wi of vowels with this formant information in common.

これまで検出した全ての人物の発声個人特徴と新たに検出した発生区間の発声個人特徴を上記評価関数（1.1）を用いて評価値群を求めその最小値Eminを求め発声候補者を決定する。更に、Eminと閾値ε0と比較し、Emin＜ε0を満たす場合に発声候補者を発声人物と判断し、解析結果情報に発声区間を追記する。 Using the evaluation function (1.1), an evaluation value group is obtained for the utterance personal characteristics of all the persons detected so far and the utterance personal characteristics of the newly detected generation section, and the minimum value Emin is obtained to determine the utterance candidate. Further, Emin is compared with the threshold value ε0, and when Emin <ε0 is satisfied, the utterance candidate is determined as the utterance person, and the utterance section is added to the analysis result information.

また、決定された発声候補者の発声個人特徴に不足している母音があり、今回の発声でその不足している母音が含まれている場合には該母音の情報を発声個人特徴に追加記憶する事により発声個人特徴を充実させる。 In addition, if there is a vowel that is missing in the utterance personal feature of the determined utterance candidate and the vowel that is missing is included in the current utterance, information on the vowel is additionally stored in the utterance personal feature. To enhance the individual characteristics of the utterance.

他方、Emin＜ε0を満たさない場合には新規の発声人物と判断し、解析結果情報に新規の発声人物情報を生成し、その発声個人特徴と発声区間を記憶する。 On the other hand, if Emin <ε0 is not satisfied, it is determined that the person is a new utterance person, new utterance person information is generated in the analysis result information, and the utterance personal feature and the utterance section are stored.

上記処理を人の発声を検出する度に行う事により、全ての登場人物単位で発声区間を参照可能であり且つ、その発声の個人特徴を解析結果情報へ記憶することが可能となる。 By performing the above processing every time a person's utterance is detected, it is possible to refer to the utterance section in units of all the characters and to store the individual characteristics of the utterance in the analysis result information.

（他の実施例）
本実施例においては、日本語を例にあげて説明を行ったが、外国語においては夫々の言語の母音、或いは母音が多い場合には人の弁別を行うのに有用な母音のみを選択して採用する事により本発明による処理を行う事も当然考えられる。 (Other examples)
In this embodiment, the explanation has been given by taking Japanese as an example. However, in a foreign language, if there are many vowels of each language, or if there are many vowels, only vowels useful for human discrimination are selected. Of course, it is conceivable to perform the processing according to the present invention.

また、発声区間情報ＤＢに発声区間の特徴を記憶する実施例を述べたが、もちろん発声特徴の比較で発声者が判別出来た時点でこの情報を破棄してＤＢに格納しない方法もある。 Moreover, although the embodiment which memorize | stores the characteristic of an utterance area in utterance area information DB was described, of course, there is also a method of discarding this information and not storing it in the DB when the utterer can be identified by comparing the utterance characteristics.

但し、発声者不明の発声区間に関しては一時的に発声区間情報ＤＢのスキーマに類する形で一時記憶する必要がある。 However, it is necessary to temporarily store an utterance section whose speaker is unknown, temporarily in a form similar to the schema of the utterance section information DB.

また、本実施例ではフォルマント情報の対どうしおよびピッチ情報を用いた距離演算例を挙げたが、当然、ＤＢスキーマも含め、フォルマント情報を全てケプストラム情報に置き換えた実施例も当然可能である。 Further, in this embodiment, an example of distance calculation using pairs of formant information and pitch information is given, but naturally, an embodiment in which all formant information including DB schema is replaced with cepstrum information is also possible.

更に、発声者不明の発声区間として単独の発声者の扱いとなった発声区間に対して既に検出した人物の個人特徴と比較を行う事により既に検出した人物へ再マッピングする処理を行った後、検出した人物間の個人特徴どうしの比較を行い１人の人物を複数の人物に謝って分けたミスを無くす処理を行う例を説明したが、この順番が逆である例も考えられる。 Furthermore, after performing the process of remapping to the already detected person by comparing with the personal characteristics of the already detected person for the utterance section that has been treated as a single speaker as the utterance section of the unknown speaker, Although an example has been described in which individual characteristics between detected persons are compared and a process of eliminating a mistake in which one person is apologized to a plurality of persons is eliminated, an example in which this order is reversed is also conceivable.

更に、検出した人物間の個人特徴どうしの比較を行い１人の人物を複数の人物に謝って分けたミスを無くす処理を行った後、発声者不明の発声区間として単独の発声者の扱いとなった発声区間に対して既に検出した人物の個人特徴と比較を行う事により既に検出した人物へ再マッピングする処理を行った後、再度、検出した人物間の個人特徴どうしの比較を行い１人の人物を複数の人物に謝って分けたミスを無くす処理を行う様な、相互に行う例も当然考える。 Further, after comparing the detected individual characteristics between the persons and eliminating the mistake of dividing one person apologizing to a plurality of persons, it is treated as a single speaker as an utterance section with unknown speaker. After performing the process of remapping to the already detected person by comparing with the personal feature of the already detected person for the utterance section, the individual feature between the detected persons is again compared with each other Naturally, an example of mutual processing, such as a process of eliminating mistakes by apologizing a person to a plurality of persons, is considered.

本発明の音声解析処理の概念図を表す図である。It is a figure showing the conceptual diagram of the audio | voice analysis process of this invention. 本発明における発声検出処理処理フローを表す図である。It is a figure showing the speech detection processing processing flow in this invention. 本発明における発声者不明の発声区間の再処理フローを表す図である。It is a figure showing the re-processing flow of the utterance area in which a speaker is unknown in this invention. 本発明における一人の話者を複数に誤弁別するミスを訂正する処理フローを表す図である。It is a figure showing the processing flow which corrects the mistake which misidentifies one speaker in this invention in multiple. 本発明における発声区間情報スキーマの例を表す図である。It is a figure showing the example of the utterance area information schema in this invention. 本発明における発声区間フィールドのスキーマの例を表す図である。It is a figure showing the example of the schema of the utterance area field in this invention. 本発明における各母音のフォルマントフィールドのスキーマ例を表す図である。It is a figure showing the example schema of the formant field of each vowel in this invention. 本発明における発声者情報ＤＢのスキーマ一例を表す図である。It is a figure showing an example of the schema of speaker information DB in this invention. 本発明における発声者別の発声区間情報ＤＢのスキーマ一例を表す図である。It is a figure showing an example of the schema of utterance section information DB classified by speaker in this invention.

Claims

A means for detecting a person's utterance and its interval;
Means for extracting personal features of utterances;
In the process of storing the utterance interval information for each individual,
Each time a person's utterance is detected, the personal features of the utterance are extracted, and if the amount of information does not satisfy a predetermined condition, it is stored in the analysis result information as a single utterer as an utterance section with an unknown speaker If a personal feature is extracted and the amount of information satisfies a predetermined condition, it is estimated that it is different from the already detected person by comparing it with the personal feature of the voice of the person who has already detected it. If this is the case, the individual characteristics of the utterance and the utterance interval are stored in the analysis result information as a new person,
Alternatively, when it is estimated that the person is the same as the already detected person, analysis result information that can be referred to in units of all characters is stored by adding the utterance section to the analysis result information. .

In Claim 1, the personal feature of utterance includes at least a pair of acoustic feature information for all vowels existing in the utterance interval, and the comparison processing is in the utterance interval detected for each of the already detected vowels of all persons. Calculating the distance between pairs of acoustic feature information for the vowels present in the vowel and comparing the value with a predetermined threshold, the detected utterance interval belongs to a person who has already been detected or a new A voice processing apparatus characterized by determining whether a person is a person.

In claim 2, the pair of acoustic feature information for vowels as individual features is a pair of formant information or cepstrum information for all vowels existing in the utterance interval, or for all vowels existing in the voice pitch and utterance interval. A speech processing apparatus characterized in that it is a pair of formant information, or a pair of formant information or cepstrum information for all vowels existing in a speech pitch and a speech section.

In Claim 1, the predetermined condition of the amount of information of personal characteristics when handling a single speaker as an utterance segment of unknown utterance is a threshold value that includes the type of vowels present in the detected utterance segment A speech processing apparatus characterized by being deficient in a combination of the following or the minimum necessary vowel types.

The utterance detected when the detected utterance section is determined to be a person who has already been detected and there is a missing vowel in the personal feature according to claim 1, claim 2, claim 3, and claim 4 A speech processing apparatus characterized by enriching personal features by storing a pair of acoustic feature information of the vowels in a personal feature when the vowel exists in a section.

In claim 1, claim 2, claim 3, claim 4, claim 5 and claim 7, after analyzing all voices, the personal characteristics of the person already detected and the utterance of unknown speaker A voice characterized by performing remapping processing alone or recursively to a person who has already detected a single speaker as an utterance section with an unknown speaker by comparing individual characteristics with a single speaker as a section Processing equipment.

In claim 1, claim 2, claim 3, claim 4, claim 5, and claim 6, after analyzing all voices, the individual characteristics of the detected persons are compared and An audio processing apparatus that performs a process of correcting an error in which a person is divided into a plurality of persons.

A means for detecting a person's utterance and its interval;
Means for extracting personal features of utterances;
In the process of storing the utterance interval information for each individual,
Each time a person's utterance is detected, the personal features of the utterance are extracted, and if the amount of information does not satisfy a predetermined condition, it is stored in the analysis result information as a single utterer as an utterance section with an unknown speaker If a personal feature is extracted and the amount of information satisfies a predetermined condition, it is estimated that the person is different from the already detected person by comparing it with the personal feature of the utterance of the person who has already detected it. If this is the case, the individual characteristics of the utterance and the utterance interval are stored in the analysis result information as a new person,
Alternatively, when it is estimated that the person is the same as the already detected person, the analysis result information that can be referred to in units of all characters is stored by adding the utterance section to the analysis result information. Processing method.

9. The personal feature of utterance includes at least a pair of acoustic feature information with respect to all vowels existing in the utterance section, and the comparison processing includes in the detected utterance section for each of the vowels of all persons already detected. Calculating the distance between pairs of acoustic feature information for the vowels present in the vowel and comparing the value with a predetermined threshold, the detected utterance interval belongs to a person who has already been detected or a new An audio signal analysis processing method characterized by determining whether a person is a person.

In claim 9, the pair of acoustic feature information for vowels as individual features is a pair of formant information or cepstrum information for all vowels existing in the utterance interval, or for all vowels existing in the voice pitch and utterance interval. A speech signal analysis processing method characterized by being a pair of formant information and / or a pair of formant information or cepstrum information for all vowels existing in a speech pitch and a speech section.

9. The predetermined condition for the amount of information of personal characteristics when handling a single speaker as an utterance segment with unknown utterance is a threshold value that includes the type of vowels present in the detected utterance segment A speech signal analysis processing method characterized by being insufficient for a combination of the following or minimum required vowel types.

The utterance detected when the detected utterance section is determined to be a person who has already been detected and there is a missing vowel in the personal feature according to claim 8, claim 9, claim 10, and claim 11 An audio signal analysis processing method characterized in that, when a vowel exists in a section, the personal feature is enhanced by storing a pair of acoustic feature information of the vowel as a personal feature.

In claim 8, claim 9, claim 10, claim 11, and claim 14, after analyzing all voices, the personal characteristics of the person already detected and the utterance of unknown speaker A voice characterized by performing remapping processing alone or recursively to a person who has already detected a single speaker as an utterance section with an unknown speaker by comparing individual characteristics with a single speaker as a section Signal analysis processing method.

In claim 8, claim 9, claim 10, claim 11, claim 12 and claim 13, a comparison is made between the individual characteristics of the detected persons, and an error is determined by dividing one person into a plurality of persons. An audio signal analysis processing method characterized by performing correction processing.

8. A computer program for instructing an operation of a computer as the voice processing apparatus according to claim 1.

15. A computer program for instructing an operation capable of realizing the audio signal analysis processing method according to claim 8 by a computer.

8. A computer-readable storage medium storing program code for operating a computer as the voice processing apparatus according to claim 1.

15. A computer-readable storage medium storing a program code capable of implementing the audio signal analysis processing method according to claim 8 by a computer.