JPH1055195A

JPH1055195A - Speaker characteristic discrimination method by voice recognition

Info

Publication number: JPH1055195A
Application number: JP21083396A
Authority: JP
Inventors: Kazuyoshi Okura; 計美大倉; Shoji Takeda; 昭二武田
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1996-08-09
Filing date: 1996-08-09
Publication date: 1998-02-24

Abstract

PROBLEM TO BE SOLVED: To provide a speaker characteristic discrimination method by voice recognition capable of discriminating a characteristic of a speaker based on a voice vocalized by an unspecific speaker. SOLUTION: The voices of many speakers with different speaker's characteristics are analyzed respectively, and only an analytic parameter containing many discriminative information of a phoneme among the analytic parameters incorporated in respective analytic results is adopted, and by statistically processing all analytic results, a first voice recognition model is formed, and respective analytic results are segmented using the first voice recognition model. Then, answering relations between the analytic parameters of respective frames of respective analytic results and the model parameter of the first voice recognition model are obtained respectively, and a second voice recognition model at every speaker's characteristic is formed based on the answering relations between the analytic parameters of respective frames of respective analytic results and the model parameter of the first voice recognition model, and the parameters expressing the characteristics of the speaker among the analytic parameters of the answered frames.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する分野】この発明は、音声認識によって話
者の性別、年齢等の話者の特徴を判別する音声認識によ
る話者特徴判別方法に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a speaker characteristic discriminating method by speech recognition for discriminating speaker characteristics such as sex and age of the speaker by speech recognition.

【０００２】[0002]

【従来の技術】不特定の話者が発声した音声を認識する
方法は、既に開発されている。また、不特定の話者が発
声した音声に基づいて、話者の性別、年齢またはその両
方を判定する方法の開発も行なわれつつある。2. Description of the Related Art Methods for recognizing voices uttered by unspecified speakers have already been developed. In addition, a method of determining the sex and / or age of a speaker based on a voice uttered by an unspecified speaker is also being developed.

【０００３】[0003]

【発明が解決しようとする課題】この発明は、不特定の
話者が発声した音声に基づいて、話者の性別、年齢等の
話者の特徴を判別することができる音声認識による話者
特徴判別方法を提供することを目的とする。SUMMARY OF THE INVENTION According to the present invention, a speaker characteristic based on speech recognition which can determine the speaker characteristics such as the sex and age of the speaker based on the voice uttered by an unspecified speaker. An object of the present invention is to provide a determination method.

【０００４】[0004]

【課題を解決するための手段】この発明による第１の音
声認識による話者特徴判別方法は、学習処理と認識処理
とからなり、学習処理は、話者の特徴が異なる多数の話
者の音声をそれぞれ分析するステップ、各分析結果に含
まれている分析パラメータのうち音素の識別情報を多く
含む分析パラメータのみを採用して、全ての分析結果を
統計的に処理することにより、第１の音声認識モデルを
作成するステップ、各分析結果を上記第１の音声認識モ
デルを用いてセグメンテーションし、各分析結果の各フ
レームの分析パラメータと、上記第１の音声認識モデル
のモデルパラメータとの対応関係をそれぞれ求めるステ
ップ、ならびに各分析結果の各フレームの分析パラメー
タと上記第１の音声認識モデルのモデルパラメータとの
対応関係、および対応付けられたフレームの分析パラメ
ータのうち話者の特徴を表現しているパラメータに基づ
いて、話者の特徴ごとの第２の音声認識モデルを作成す
るステップを備え、認識処理は、認識対象音声データを
分析するステップ、分析結果を上記第１の音声認識モデ
ルを用いてセグメンテーションし、分析結果の各フレー
ムの分析パラメータと、上記第１の音声認識モデルのモ
デルパラメータとの対応関係をそれぞれ求めるステッ
プ、ならびに分析結果の各フレームの分析パラメータと
上記第１の音声認識モデルのモデルパラメータとの対応
関係と、話者の特徴ごとの上記第２の音声認識モデルと
に基づいて、上記認識対象音声データに対する分析結果
が、話者の特徴ごとの上記第２の音声認識モデルのう
ち、いずれのモデルに最も適合しているかを判定し、最
も適合しているモデルに対応する話者の特徴を、上記認
識対象音声データに対する話者の特徴とするステップを
備えていることを特徴とする。According to a first aspect of the present invention, a speaker characteristic discriminating method based on speech recognition includes a learning process and a recognition process. The learning process includes a plurality of speaker voices having different speaker characteristics. Analyzing each of the analysis results, and employing only the analysis parameters containing a large amount of phoneme identification information among the analysis parameters included in each analysis result, and statistically processing all the analysis results to obtain the first voice. Creating a recognition model, segmenting each analysis result using the first speech recognition model, and determining a correspondence between an analysis parameter of each frame of each analysis result and a model parameter of the first speech recognition model. A step of obtaining each, a correspondence relationship between an analysis parameter of each frame of each analysis result and a model parameter of the first speech recognition model, and Generating a second speech recognition model for each speaker characteristic based on the analysis parameters of the assigned frame that represent the characteristics of the speaker; Analyzing the data, segmenting the analysis result using the first speech recognition model, and determining a correspondence between an analysis parameter of each frame of the analysis result and a model parameter of the first speech recognition model, respectively. Based on the correspondence between the analysis parameters of each frame of the analysis result and the model parameters of the first speech recognition model, and the second speech recognition model for each speaker characteristic. Of the second speech recognition model for each feature of the speaker best fits any of the models. Was determined, the speaker characteristics corresponding to the model that best fit, characterized in that it comprises the step, wherein the speaker for the recognition target voice data.

【０００５】この発明による第２の音声認識による話者
特徴判別方法は、学習処理と認識処理とからなり、学習
処理は、話者の特徴が異なる多数の話者の音声をそれぞ
れＦＦＴケプストラム分析するステップ、各分析結果に
含まれている分析パラメータのうち音素の識別情報を多
く含む低次元の分析パラメータのみを採用して、全ての
分析結果を統計的に処理することにより、第１の音素Ｈ
ＭＭセットを作成するステップ、各分析結果を上記第１
の音素ＨＭＭセットを用いてセグメンテーションし、各
分析結果の各フレームの分析パラメータと、上記第１の
音素ＨＭＭセット内の音素ＨＭＭとの対応関係をそれぞ
れ求めるステップ、ならびに各分析結果の各フレームの
分析パラメータと上記第１の音素ＨＭＭセット内の音素
ＨＭＭとの対応関係、および対応付けられたフレームの
分析パラメータのうち話者の特徴を表現している高次元
のパラメータに基づいて、話者の特徴ごとの第２の音素
ＨＭＭセットを作成するステップを備え、認識処理は、
認識対象音声データを分析するステップ、分析結果を上
記第１の音素ＨＭＭセットを用いてセグメンテーション
し、分析結果の各フレームの分析パラメータと上記第１
の音素ＨＭＭセット内の音素ＨＭＭとの対応関係をそれ
ぞれ求めるステップ、ならびに分析結果の各フレームの
分析パラメータと上記第１の音素ＨＭＭセット内の音素
ＨＭＭとの対応関係と、話者の特徴ごとの上記第２の音
素ＨＭＭセットとに基づいて、上記認識対象音声データ
に対する分析結果が、話者の特徴ごとの上記第２の音素
ＨＭＭセットのうち、いずれのモデルに最も適合してい
るかを判定し、最も適合しているモデルに対応する話者
の特徴を、上記認識対象音声データに対する話者の特徴
とするステップを備えていることを特徴とする。[0005] A second speaker feature discrimination method based on speech recognition according to the present invention includes a learning process and a recognition process. In the learning process, speeches of a large number of speakers having different speaker characteristics are respectively subjected to FFT cepstrum analysis. Step: The first phoneme H is obtained by statistically processing all the analysis results by adopting only low-dimensional analysis parameters including a large amount of phoneme identification information among the analysis parameters included in each analysis result.
The step of creating an MM set;
Segmentation using the phoneme HMM set of the above, obtaining the correspondence between the analysis parameter of each frame of each analysis result and the phoneme HMM in the first phoneme HMM set, and analyzing each frame of each analysis result. Based on the correspondence between the parameters and the phoneme HMMs in the first phoneme HMM set, and the analysis parameters of the associated frame, based on the high-dimensional parameters expressing the features of the speaker, the characteristics of the speaker Creating a second phoneme HMM set for each
Analyzing the speech data to be recognized, segmenting the analysis result using the first phoneme HMM set, analyzing the analysis parameters of each frame of the analysis result and the first
Determining the correspondence between the analysis parameters of each frame of the analysis result and the phonemes HMM in the first phoneme HMM set, and determining the correspondence between the analysis parameters of each frame of the analysis result and the phoneme HMM in the first phoneme HMM set. Based on the second phoneme HMM set, a determination is made as to which model of the second phoneme HMM set for each speaker characteristic the analysis result for the recognition target speech data best matches. And setting the feature of the speaker corresponding to the most suitable model as the feature of the speaker with respect to the recognition target speech data.

【０００６】上記話者の特徴とは、年齢、性別、身長、
体重等をいい話者を特定可能な特徴からなる事項をい
う。[0006] The characteristics of the speaker include age, gender, height,
It refers to items that have characteristics that can identify a speaker with good weight.

【０００７】[0007]

【発明の実施の形態】以下、この発明を、話者の性別お
よび年齢判定方法に適用した場合の実施の形態について
説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment in which the present invention is applied to a method for determining the sex and age of a speaker will be described below.

【０００８】〔１〕話者の性別および年齢判定方法の概
要説明[1] Outline of method for determining gender and age of speaker

【０００９】話者の性別および年齢判定方法は、性別お
よび年齢判定を行なう前に予め行なわれる学習処理と、
学習処理結果に基づいて話者の性別および年齢判定を行
なう認識処理とからなる。[0009] The gender and age determination method of the speaker includes a learning process performed in advance before performing the gender and age determination;
A recognition process for determining the sex and age of the speaker based on the result of the learning process.

【００１０】〔１−１〕学習処理[1-1] Learning process

【００１１】学習処理は、次のようにして行なわれる。The learning process is performed as follows.

【００１２】（１）性別および年齢が異なる多数の話者
の音声を分析する。(1) Analyze the voices of many speakers with different genders and ages.

【００１３】（２）各分析結果に含まれている分析パラ
メータのうち音素の識別情報を多く含む分析パラメータ
のみを採用して、全ての分析結果を統計的に処理するこ
とにより、第１の音声認識モデルを作成する。(2) The first speech is obtained by statistically processing all the analysis results by using only the analysis parameters containing a large amount of phoneme identification information among the analysis parameters included in each analysis result. Create a recognition model.

【００１４】（３）各分析結果を上記第１の音声認識モ
デルを用いてセグメンテーションし、各分析結果の各フ
レームの分析パラメータと、上記音声認識モデルのモデ
ルパラメータとの対応関係をそれぞれ求める。(3) Each analysis result is segmented using the first speech recognition model, and the correspondence between the analysis parameter of each frame of each analysis result and the model parameter of the speech recognition model is obtained.

【００１５】（４）各分析結果の各フレームの分析パラ
メータと上記第１の音声認識モデルのモデルパラメータ
との対応関係、および対応付けられたフレームの分析パ
ラメータのうち性別および年齢を表現しているパラメー
タに基づいて、性別および年齢ごとの第２の音声認識モ
デルを作成する。(4) The correspondence between the analysis parameters of each frame of each analysis result and the model parameters of the first speech recognition model, and the sex and age of the analysis parameters of the associated frames are expressed. A second speech recognition model for each gender and age is created based on the parameters.

【００１６】〔１−２〕認識処理[1-2] Recognition processing

【００１７】認識処理は、次のようにして行なわれる。The recognition process is performed as follows.

【００１８】（１）認識対象音声データを分析する。(1) The speech data to be recognized is analyzed.

【００１９】（２）分析結果を上記第１の音声認識モデ
ルを用いてセグメンテーションし、分析結果の各フレー
ムの分析パラメータと、上記第１の音声認識モデルのモ
デルパラメータとの対応関係をそれぞれ求める。(2) The analysis result is segmented using the first speech recognition model, and the correspondence between the analysis parameter of each frame of the analysis result and the model parameter of the first speech recognition model is obtained.

【００２０】（３）分析結果の各フレームの分析パラメ
ータと上記第１の音声認識モデルのモデルパラメータと
の対応関係と、性別および年齢ごとの上記第２の音声認
識モデルとに基づいて、上記認識対象音声データに対す
る分析結果が、性別および年齢ごとの上記第２の音声認
識モデルのうち、いずれのモデルに最も適合しているか
を判定し、最も適合しているモデルに対応する性別およ
び年齢を、上記認識対象音声データに対する性別および
年齢とする。(3) The recognition based on the correspondence between the analysis parameters of each frame of the analysis result and the model parameters of the first speech recognition model and the second speech recognition model for each gender and age. The analysis result for the target speech data determines which of the second speech recognition models for each gender and age best fits the model, and determines the gender and age corresponding to the best fitting model, The gender and age for the above-mentioned recognition target voice data are set.

【００２１】〔２〕ＨＭＭを用いた話者の性別および年
齢判定方法の具体的な説明[2] Specific description of a method for determining the sex and age of a speaker using HMM

【００２２】以下、ＨＭＭ（Hidden Markov Model)を用
いた話者の性別および年齢判定方法について具体的に説
明する。Hereinafter, a method of determining the sex and age of a speaker using an HMM (Hidden Markov Model) will be specifically described.

【００２３】ＨＭＭとは、音声の統計的特徴をガウス分
布等の分布によって近似的に表現する確率モデルをい
う。ＨＭＭでは、フレームとモデルとの対応関係は、フ
レームと各状態との対応関係となる。また、各状態内に
おいては、性別および年齢の特徴は、基底分布で表現さ
れる。さらに、サブワードモデルである場合には、サブ
ワードモデルの連結により、文章等を認識し、認識した
モデル系列に従って、セグメンテーションを行なう。The HMM is a stochastic model that approximates the statistical characteristics of speech by a distribution such as a Gaussian distribution. In the HMM, the correspondence between the frame and the model is the correspondence between the frame and each state. In each state, gender and age characteristics are represented by a basal distribution. Further, in the case of a subword model, a sentence or the like is recognized by connecting the subword models, and segmentation is performed according to the recognized model sequence.

【００２４】以下においては、対角共分散行列の混合ガ
ウス分布型ＨＭＭを用いた場合について説明する。ま
た、音素ＨＭＭの構造は、left-to-right 型４状態３ル
ープ構造であるとする。In the following, a case will be described in which a Gaussian mixture HMM with a diagonal covariance matrix is used. It is also assumed that the structure of the phoneme HMM is a left-to-right type four-state three-loop structure.

【００２５】〔２−１〕学習処理学習処理は、次のようにして行なわれる。[2-1] Learning Process The learning process is performed as follows.

【００２６】（１）性別および年齢が異なる多数の話者
の音声をＦＦＴケプストラム分析する。ここでは、１２
８次元のパラメータが得られるものとする。(1) FFT cepstrum analysis of voices of many speakers having different genders and ages. Here, 12
It is assumed that eight-dimensional parameters are obtained.

【００２７】（２）各分析結果に含まれている分析パラ
メータのうち音素の識別情報を多く含む低次元、たとえ
ば１〜１６次元の分析パラメータのみを採用し、周知の
フォワードバックワードアルゴリズム等の学習則によっ
て、第１の音素ＨＭＭのセット（第１の音声認識モデ
ル）を作成する。(2) Of the analysis parameters included in each analysis result, only low-dimensional, for example, 1 to 16-dimensional analysis parameters containing a large amount of phoneme identification information are employed, and learning of a well-known forward backward algorithm or the like is performed. According to the rule, a first phoneme HMM set (first speech recognition model) is created.

【００２８】作成された不特定話者音素ＨＭＭのセット
をＡ＝｛λ₁，…，λ_i…，λ_I｝と定義する。ここ
で、ｉは音素ＨＭＭの番号を示している。また、Ｉは音
素ＨＭＭ数を示している。The set of unspecified speaker phoneme HMMs is defined as A = {λ ₁ ,..., Λ _i , λ _I }. Here, i indicates a phoneme HMM number. I indicates the number of phoneme HMMs.

【００２９】また、ｉ番目の音素ＨＭＭ｛λ_i｝は、 λ_i＝｛ｗ_i,s,m，ａ_{i,s,q ,}μ_{i,s,m ,}σ² _i,s,m｝で表される。The i-th phoneme HMM {λ _i } is λ _i = {wi _{, s, m} , _{ai, s, q,} μi _{, s, m,} σ ² _{i, s, m} }. expressed.

【００３０】ここで、ｗ_i,s,m，μ_i,s,mおよびσ²
_i,s,mは、ｉ番目の音素ＨＭＭにおける、第ｓ状態のｍ
番目のガウス分布の分岐確率、平均ベクトルおよび分散
値のベクトルをそれぞれ表している。また、ａ
_i,s,qは、ｉ番目の音素ＨＭＭの第ｓ状態から第ｑ状態
への遷移確率を表している。Here, w _{i, s, m} , μ _{i, s, m} and σ ²
_{i, s, m} is _m in the s-th state in the i-th phoneme HMM
A branch probability, a mean vector, and a variance vector of the Gaussian distribution are respectively shown. Also, a
_{i, s, q} represents the transition probability of the i-th phoneme HMM from the s-th state to the q-th state.

【００３１】（３）次に、性別および年齢別の各分析結
果（学習用音声資料Ｏ）ごとに、各分析結果を第１の音
素ＨＭＭのセットを用いてセグメンテーションし、各分
析結果の各フレームの分析パラメータと、上記第１の音
素ＨＭＭセット内の音素ＨＭＭとの対応関係をそれぞれ
求める。以下、この処理について説明する。(3) Next, for each analysis result (speech data O for learning) by gender and age, each analysis result is segmented using a first phoneme HMM set, and each frame of each analysis result is segmented. And the corresponding relationship between the analysis parameter and the phoneme HMM in the first phoneme HMM set. Hereinafter, this processing will be described.

【００３２】学習用音声資料Ｏを、次の数式１に示すよ
うに定義する。The learning audio material O is defined as shown in the following Expression 1.

【００３３】[0033]

【数１】 (Equation 1)

【００３４】Ｏは、学習用単語を表している。ｏ_tは、
フレーム番号ｔにおける特徴ベクトルを表している。Ｔ
は、Ｏのフレーム数を表している。O represents a learning word. o _t
This represents the feature vector at the frame number t. T
Represents the number of O frames.

【００３５】学習用単語Ｏに対して、ビタビアルゴリズ
ムを用いてセグメンテーション（以下、ビダビセグメン
テーションという）を行ない、学習用単語Ｏに対応する
単語ＨＭＭを上記第１の音素ＨＭＭセット内の音素ＨＭ
Ｍを連結することによって生成する。そして、得られた
単語ＨＭＭ内の音素ＨＭＭと、学習用単語Ｏの各フレー
ムとの対応関係Θ_cを求める。対応関係Θ_cを求める際
の学習用単語Ｏの特徴ベクトルとしては、音素の識別情
報を多く含む低次元の特徴ベクトルが用いられる。The learning word O is subjected to segmentation (hereinafter referred to as Viterbi segmentation) using the Viterbi algorithm, and the word HMM corresponding to the learning word O is converted to the phoneme HM in the first phoneme HMM set.
Generated by concatenating M. Then, a correspondence Θ _c between the phoneme HMM in the obtained word HMM and each frame of the learning word O is obtained. The feature vector of the learning word O for obtaining the correspondence theta _c, feature vectors of low dimensional rich phoneme identification information is used.

【００３６】対応関係Θ_cは、次の数式２に示すように
定義される。The correspondence Θ _c is defined as shown in the following Expression 2.

【００３７】[0037]

【数２】 (Equation 2)

【００３８】数式２において、ψ_c,tおよびθ_c,tは、
フレーム番号ｔにおける音素ＨＭＭの番号および音素Ｈ
ＭＭの状態の番号をそれぞれ示している。このような処
理は、性別および年齢別に予め分類されている学習用単
語Ｏのそれぞれに対して行なわれる。In Equation 2, ψ _{c, t} and θ _{c, t} are
Phoneme HMM number and phoneme H at frame number t
The numbers of the states of the MM are shown. Such processing is performed on each of the learning words O that are classified in advance by gender and age.

【００３９】なお、学習用単語Ｏに対応する音素ＨＭＭ
系列をξ_cとすると、学習用単語Ｏと音素ＨＭＭ系列ξ
_cとの間の尤度Ｐ（Ｏ，Θ_c｜ξ_c）は数式３、４、５
に示すように定義される。The phoneme HMM corresponding to the learning word O
If the sequence is ξ _c , the learning word O and the phoneme HMM sequence ＭＭ
likelihood _{P (O, Θ c | ξ} c) between the _c The formula 3,4,5
Is defined as shown below.

【００４０】[0040]

【数３】 (Equation 3)

【００４１】[0041]

【数４】 (Equation 4)

【００４２】[0042]

【数５】 (Equation 5)

【００４３】数式４において、Ｍは、各状態における混
合分布数である。In Equation 4, M is the number of mixture distributions in each state.

【００４４】数式５において、Ｄ^lはｏ_tの低次の次数
である。[0044] In Equation 5, D ^l is a low following the order of the o _t.

【００４５】数式５において、ｏ_t,d、μ_i,s,m,dおよ
びσ² _i,s,m,dは、ｏ_t、μ_i,s, _mおよびσ² _i,s,m
の第ｄ要素の値をそれぞれ示している。In equation (5), o _{t, d} , μ _{i, s, m, d} and ^{_{σ 2 i, s, m,}} d _{_{is, o t, μ i, s}} , m and σ ² _{i, s, m}
Respectively indicate the value of the d-th element.

【００４６】また、ｏ_tの低次の次数とは、０≦ｄ≦Ｄ
^lであり、高次の次数とはＤ^l＋１≦ｄ≦Ｄの範囲を表
す。The low-order degree of o _t is 0 ≦ d ≦ D
^l , and the higher order represents a range of D ^l + 1 ≦ d ≦ D.

【００４７】（４）次に、各性別および年齢ごとの第２
の音素ＨＭＭセット（第２の音声認識モデル）を求め
る。つまり、性別および年齢ごとに求められた対応関係
Θ_cごとに、以下のような処理を行なう。(4) Next, the second for each gender and age
(A second speech recognition model) is obtained. That is, for each correspondence relationship theta _c obtained for each gender and age, performs the following process.

【００４８】上記ビダビセグメンテーションにより、θ
_c,t＝ｉかつψ_c,t＝ｓと対応づけられたフレーム番号
ｔの組をΨ_c,i,sと定義する。同じΨ_c,i,sに含まれる
複数のフレームｏ_tから最尤推定によって確率密度関数
を求める。ここでは、混合分布として、確率密度関数を
求めてもよい。この様にして、各性別および年齢ごとの
第２の音素ＨＭＭのセットを求める。By the above Vidavi segmentation, θ
A set of frame numbers t associated with _{c, t} = i and ψc _{, t} = _s is defined as Ψc _{, i, s} . Determining a probability density function by maximum likelihood estimation from the same Ψ _{c, i,} a plurality of frames included in _s o _t. Here, a probability density function may be obtained as a mixture distribution. In this way, a second phoneme HMM set for each gender and age is determined.

【００４９】尚、各状態毎に学習するのではなく、音素
毎のセグメンテーション結果を用いて、斯かるセグメン
テーション区間に対応付けられた分析データに対して、
周知のフォワードバックワードアルゴリズムにより、第
２の音素ＨＭＭセットを求めてもよい。It is to be noted that, instead of learning for each state, the analysis data associated with the segmentation section is obtained by using the segmentation result for each phoneme.
The second phoneme HMM set may be determined by a well-known forward backward algorithm.

【００５０】各性別および年齢ごとの第２の音素ＨＭＭ
のセット（Λ^r）および各音素ＨＭＭ（λ_i ^r）は、そ
れぞれ次の数式６および数式７に示すように定義され
る。Second phoneme HMM for each gender and age
(Λ ^r ) and each phoneme HMM (λ _i ^r ) are defined as shown in the following Expressions 6 and 7, respectively.

【００５１】[0051]

【数６】 (Equation 6)

【００５２】[0052]

【数７】 (Equation 7)

【００５３】〔２−２〕認識処理[2-2] Recognition processing

【００５４】認識処理は、次のようにして行なわれる。The recognition process is performed as follows.

【００５５】（１）認識対象音声をＦＦＴケプストラム
分析する。そして、上記数式１と同様に定義されるよう
な認識対象単語Ｏを生成する。(1) FFT cepstrum analysis is performed on the speech to be recognized. Then, a recognition target word O as defined in the same manner as Expression 1 is generated.

【００５６】（２）認識対象単語Ｏに対して、認識対象
単語Ｏに対応する単語ＨＭＭを第１の音素ＨＭＭセット
（第１の音声認識モデル）内の音素ＨＭＭを連結するこ
とによって生成する。尚、発声内容が未知の場合、単語
ＨＭＭは音素ＨＭＭを用いて認識対象単語Ｏを認識した
結果得られた音素列に対応したものを用いてもよい。そ
して、得られた単語ＨＭＭ内の音素ＨＭＭと学習用単語
Ｏとの対応関係Θ_cをビタビセグメンテーションにより
求める。対応関係Θ_cを求める際の認識対象単語Ｏの特
徴ベクトルとしては、音素の識別情報を多く含む低次元
の特徴ベクトルが用いられる。(2) For the recognition target word O, a word HMM corresponding to the recognition target word O is generated by connecting the phoneme HMMs in the first phoneme HMM set (first speech recognition model). If the utterance content is unknown, the word HMM may correspond to a phoneme string obtained as a result of recognizing the recognition target word O using the phoneme HMM. Then, the correspondence Θ _c between the phoneme HMM in the obtained word HMM and the learning word O is obtained by Viterbi segmentation. The feature vector to be recognized word O for obtaining the correspondence theta _c, feature vectors of low dimensional rich phoneme identification information is used.

【００５７】（３）得られた対応関係Θ_cの音素ＨＭＭ
の状態系列にしたがって、各性別および年齢別の第２の
音素ＨＭＭのセット（Λ^r）に関する尤度Ｐ（Ｏ，Θ_c
｜ξ_c）^rを、数式８、９および１０に基づいて求め
る。そして、尤度が最大となる第２の音素ＨＭＭのセッ
トに対応する性別および年齢を認識結果とする。(3) Phoneme HMM of the obtained correspondence Θ _c
, The likelihood P (O, Θ _c ) for each gender and age-specific second phoneme HMM set (の^r )
| Ξ _c ) ^r is determined based on equations 8, 9 and 10. Then, the gender and the age corresponding to the second phoneme HMM set having the maximum likelihood are set as the recognition results.

【００５８】[0058]

【数８】 (Equation 8)

【００５９】[0059]

【数９】 (Equation 9)

【００６０】[0060]

【数１０】 (Equation 10)

【００６１】数式１０から分かるように、各性別および
年齢別の第２の音素ＨＭＭのセット（Λ^r）に関する尤
度Ｐ（Ｏ，Θ_c｜ξ_c）^rの計算を行なう際には、性別
および年齢を表現している高次元のパラメータが用いら
れている。[0061] As can be seen from Equation 10, the sex and age of the second set of phoneme HMM (lambda ^r) regarding the likelihood _{P (O, Θ c | ξ} c) in the calculation of ^r is gender And a high-dimensional parameter expressing age.

【００６２】上記実施の形態によれば、不特定の話者が
発声した音声に基づいて、話者の性別、年齢またはその
両方を判定することができるようになる。また、第１の
音素ＨＭＭのセットの作成および対応関係Θ_cの演算
は、各分析結果に含まれている分析パラメータのうち音
素の識別情報を多く含む低次元、たとえば１〜１６次元
の分析パラメータのみに基づいて行なわれているので、
演算速度の向上化が図れる。According to the above embodiment, the sex and / or age of the speaker can be determined based on the voice uttered by the unspecified speaker. The creation of the first set of phoneme HMMs and the calculation of the correspondence Θ _c are performed in a low-dimensional analysis parameter containing a large amount of phoneme identification information, for example, 1 to 16-dimensional analysis parameters among the analysis parameters included in each analysis result Only on the basis of
The calculation speed can be improved.

【００６３】上記実施の形態では、不特定の話者が発声
した音声に基づいて、話者の性別および年齢を同時に判
定する場合について説明したが、話者の性別のみを判定
する場合、話者の年齢のみを判定する場合にも、この発
明を適用できることはいうまでもない。In the above embodiment, the case where the gender and age of the speaker are simultaneously determined based on the voice uttered by the unspecified speaker has been described. However, when only the gender of the speaker is determined, It is needless to say that the present invention can also be applied to the case where only the age is determined.

【００６４】また、この発明は、話者の性別、年齢以外
の和種の特徴、たとえば、話者の身長、体重等を判別す
る場合にも適用することができる。The present invention can also be applied to the case of discriminating the characteristics of the Japanese species other than the sex and age of the speaker, for example, the height and weight of the speaker.

【００６５】[0065]

【発明の効果】この発明によれば、不特定の話者が発声
した音声に基づいて、話者の性別、年齢等の話者の特徴
を判別することができるようになる。According to the present invention, it is possible to determine the speaker characteristics such as the sex and age of the speaker based on the voice uttered by the unspecified speaker.

Claims

[Claims]

1. A learning process comprising: a learning process and a recognition process. The learning process includes the steps of analyzing voices of a large number of speakers having different characteristics of the speakers, and a phoneme among analysis parameters included in each analysis result. Creating a first speech recognition model by statistically processing all analysis results by using only analysis parameters containing a large amount of identification information, and analyzing each analysis result by using the first speech recognition model. Calculating the correspondence between the analysis parameter of each frame of each analysis result and the model parameter of the first speech recognition model, and analyzing the analysis parameter of each frame of each analysis result and the first speech. It represents the correspondence between the recognition model and the model parameters, and the speaker characteristics among the analysis parameters of the associated frame. Generating a second speech recognition model for each speaker feature based on the parameters; the recognition processing includes: analyzing the speech data to be recognized; and analyzing the analysis result using the first speech recognition model. Performing a segmentation to determine a corresponding relationship between an analysis parameter of each frame of the analysis result and a model parameter of the first speech recognition model, and an analysis parameter of each frame of the analysis result and the first speech recognition model; Based on the correspondence relationship with the model parameters and the second speech recognition model for each speaker feature, the analysis result for the recognition target speech data is obtained based on the second speech recognition model for each speaker feature. The best fit model is determined, and the characteristics of the speaker corresponding to the best fit model are recognized as described above. A speaker characteristic discriminating method based on voice recognition, comprising a step of characterizing a speaker with respect to target voice data.

2. The learning process includes a learning process and a recognition process.
Cepstrum analysis step, by adopting only low-dimensional analysis parameters including a large amount of phoneme identification information among the analysis parameters included in each analysis result, and statistically processing all analysis results,
Creating a first phoneme HMM set; segmenting each analysis result using said first phoneme HMM set; analyzing parameters of each frame of each analysis result and phoneme HMMs in said first phoneme HMM set;
And the corresponding relationship between the analysis parameter of each frame of each analysis result and the phoneme HMM in the first phoneme HMM set, and the speaker's analysis parameter among the analysis parameters of the associated frame. Generating a second phoneme HMM set for each speaker feature based on the high-dimensional parameters representing the features; and recognizing the speech data to be recognized. Segmenting using the first phoneme HMM set to determine the correspondence between the analysis parameter of each frame of the analysis result and the phoneme HMM in the first phoneme HMM set, respectively, and the analysis parameter of each frame of the analysis result And the correspondence between the phoneme HMMs in the first phoneme HMM set and the phoneme HMM for each speaker feature. Of the second phoneme HMM set for each speaker feature is determined based on the phoneme HMM set of A speaker characteristic discrimination method by voice recognition, comprising the step of setting the characteristics of the speaker corresponding to the model being performed to the characteristics of the speaker with respect to the recognition target voice data.

3. The speech recognition according to claim 1, wherein the feature of the speaker is one arbitrarily selected from age, gender, height, and weight, or any combination thereof. Speaker characteristic discrimination method by