JP2015172736A

JP2015172736A - Voice analysis device

Info

Publication number: JP2015172736A
Application number: JP2015025055A
Authority: JP
Inventors: 信▲敏▼ 上杉; Nobutoshi Uesugi; 松本　秀一; Shuichi Matsumoto; 秀一松本
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2014-02-19
Filing date: 2015-02-12
Publication date: 2015-10-01
Anticipated expiration: 2035-02-12
Also published as: JP6350325B2; WO2015125893A1

Abstract

PROBLEM TO BE SOLVED: To estimate the property of an object person for analysis easily with the use of a singing voice.SOLUTION: The voice analysis device includes: an indicator calculation unit 32 which, for each of mutually different K properties, uses a recognition model M[k] (k=1-k) indicating a trend of a singing voice corresponding to the property to calculate an estimation indicator E[k] which is an indicator of likelihood of coincidence of a singing voice V of an object person for analysis with the singing voice of the property; and an estimation processing unit 34 for estimating a property of the object person for analysis according to the estimation indicator E[k] calculated by the indicator calculation unit 32 for each property.

Description

本発明は、音声を解析する技術に関する。 The present invention relates to a technique for analyzing speech.

特許文献１には、商品に関する消費者の動向（消費動向）を分析する技術が開示されている。消費動向の分析には複数の消費者の歌唱履歴情報が利用される。特許文献１の技術によれば、購入履歴情報が充分に存在しない場合でも消費動向を高精度に分析できるという利点がある。 Patent Document 1 discloses a technique for analyzing consumer trends (consumption trends) regarding products. Sing history information of a plurality of consumers is used for analysis of consumption trends. According to the technique of Patent Document 1, there is an advantage that a consumption trend can be analyzed with high accuracy even when there is not enough purchase history information.

特開２０１１−２２１５９５号公報JP 2011-221595 A

特許文献１の技術では、多数の消費者による消費動向の全体的な傾向を分析することは可能であるが、各個人の活動の傾向等の性質を推定することはできない。以上の事情を考慮して、本発明は、歌唱音声を利用して解析対象者の性質を簡便に推定することを目的とする。 With the technology of Patent Document 1, it is possible to analyze the overall trend of consumption trends by a large number of consumers, but it is not possible to estimate the nature of each individual's activity trend. In view of the above circumstances, an object of the present invention is to simply estimate the characteristics of a person to be analyzed using a singing voice.

以上の課題を解決するために、本発明の音声解析装置は、相異なる複数の属性の各々について、当該属性に対応する歌唱音声の傾向を表す認識モデルを利用して、解析対象者の歌唱音声が当該属性の音声に該当する確度の指標である評価指標を算定する指標算定部と、指標算定部が属性毎に算定した評価指標に応じて解析対象者の性質を推定する推定処理部とを具備する。以上の構成では、複数の属性の各々の音声に該当する確度の指標である評価指標が属性毎の認識モデルにより算定されるから、各評価指標に応じて解析対象の性質を簡便に推定することが可能である。 In order to solve the above problems, the speech analysis apparatus of the present invention uses a recognition model that represents the tendency of the singing speech corresponding to the attribute for each of a plurality of different attributes to analyze the singing speech of the person to be analyzed. An index calculation unit that calculates an evaluation index that is an accuracy index corresponding to the voice of the attribute, and an estimation processing unit that estimates the characteristics of the analysis target person according to the evaluation index calculated for each attribute by the index calculation unit It has. In the above configuration, since the evaluation index, which is an index of accuracy corresponding to each voice of multiple attributes, is calculated by the recognition model for each attribute, the property of the analysis target can be easily estimated according to each evaluation index. Is possible.

本発明の好適な態様において、推定処理部は、複数の属性のうち指標算定部が算定した評価指標に応じて選択した２個以上の属性の組合せに応じて解析対象者の性質を推定する。例えば、複数の属性のうち評価指標の数値順で選択した２個以上の属性の組合せに応じて解析対象者の性質を推定する構成や、複数の属性のうち評価指標が所定値を上回る２個以上の属性の組合せに応じて解析対象者の性質を推定する構成が好適である。以上の態様では、各評価指標に応じて選択した２個以上の属性の組合せに応じて解析対象者の性質が推定されるから、例えば１個の属性のみを評価指標に応じて選択する構成と比較して解析対象者の性質の推定結果が多様化されるという利点がある。 In a preferred aspect of the present invention, the estimation processing unit estimates the property of the analysis target person according to a combination of two or more attributes selected according to the evaluation index calculated by the index calculation unit among a plurality of attributes. For example, a configuration for estimating the characteristics of the person to be analyzed according to a combination of two or more attributes selected in numerical order of evaluation indices from among a plurality of attributes, or two of the plurality of attributes whose evaluation indices exceed a predetermined value A configuration that estimates the characteristics of the person to be analyzed according to the combination of the above attributes is preferable. In the above aspect, since the property of the person to be analyzed is estimated according to the combination of two or more attributes selected according to each evaluation index, for example, only one attribute is selected according to the evaluation index; In comparison, there is an advantage that the estimation results of the properties of the analysis subject are diversified.

本発明の好適な態様において、推定処理部は、相異なる推定結果に対応する複数の加重値系列の各々について、当該加重値系列に含まれる複数の加重値を適用した各評価指標の加重和を算定し、複数の加重値系列のうち当該加重和に応じて選択した加重値系列に対応する推定結果を特定する。以上の態様では、各加重値系列の加重値を適用した各評価指標の加重値に応じて推定結果が特定されるから、推定結果に対する各評価指標の軽重を各加重値に応じて調整できるという利点がある。 In a preferred aspect of the present invention, the estimation processing unit calculates, for each of a plurality of weight value series corresponding to different estimation results, a weighted sum of each evaluation index to which the plurality of weight values included in the weight value series are applied. An estimation result corresponding to a weight value series selected according to the weighted sum among a plurality of weight value series is determined. In the above aspect, since the estimation result is specified according to the weight value of each evaluation index to which the weight value of each weight value series is applied, the weight of each evaluation index with respect to the estimation result can be adjusted according to each weight value. There are advantages.

本発明の好適な態様において、複数の属性の各々の認識モデルは、複数の参照音声を属性毎に分類した複数の集合のうち当該属性に対応した集合の各参照音声を利用した機械学習で生成される。以上の態様では、属性毎に分類された各参照音声を利用した機械学習で認識モデルが生成される。したがって、音声の特徴と発声者の性質との関係の現実的な傾向を反映した高精度な推定が可能であるという利点がある。具体的には、複数の参照音声は、参照音声が発音された時間帯、参照音声が発音された場所、および、参照音声の発声者の性格の少なくともひとつに応じて複数の集合に分類される。また、参照音声の発声者による商品の購入履歴に応じて各参照音声を複数の集合に分類することも可能である。 In a preferred aspect of the present invention, each recognition model of a plurality of attributes is generated by machine learning using each reference voice of a set corresponding to the attribute among a plurality of sets obtained by classifying a plurality of reference voices for each attribute. Is done. In the above aspect, the recognition model is generated by machine learning using each reference speech classified for each attribute. Therefore, there is an advantage that high-accuracy estimation that reflects a realistic tendency of the relationship between the characteristics of the speech and the nature of the speaker is possible. Specifically, the plurality of reference sounds are classified into a plurality of sets according to at least one of a time zone in which the reference sound is generated, a place where the reference sound is generated, and a character of the speaker of the reference sound. . It is also possible to classify each reference sound into a plurality of sets according to the purchase history of the product by the speaker of the reference sound.

本発明の好適な態様に係る音声解析装置は、推定処理部による推定結果に応じた評価方法で解析対象者の歌唱の巧拙を評価する歌唱評価部を具備する。以上の態様では、推定処理部による推定結果に応じた評価方法で解析対象者の歌唱の巧拙が評価されるから、解析対象者の性質に応じた適切な歌唱評価が実現される。なお、「推定結果に応じた評価方法」とは、評価処理の内容のほか、評価結果に寄与する配点（重み），評価基準や評価項目，評価処理に適用される変数など、歌唱評価に関する１種類以上の事項が推定結果に応じて変化することを意味する。 The speech analysis apparatus according to a preferred aspect of the present invention includes a singing evaluation unit that evaluates the skill of the person to be analyzed by an evaluation method according to the estimation result by the estimation processing unit. In the above aspect, since the skill of singing of an analysis object person is evaluated with the evaluation method according to the estimation result by an estimation process part, appropriate singing evaluation according to the property of an analysis object person is implement | achieved. The “evaluation method according to the estimation result” is the one related to the singing evaluation including the contents of the evaluation process, the points (weights) contributing to the evaluation result, the evaluation criteria and the evaluation items, and the variables applied to the evaluation process. It means that more than one kind of matter changes according to the estimation result.

本発明の第１実施形態に係る音声解析装置の構成図である。1 is a configuration diagram of a speech analysis apparatus according to a first embodiment of the present invention. 学習データの説明図である。It is explanatory drawing of learning data. 学習処理の説明図である。It is explanatory drawing of a learning process. 学習処理のフローチャートである。It is a flowchart of a learning process. 音声解析部の構成図である。It is a block diagram of an audio | voice analysis part. 推定処理のフローチャートである。It is a flowchart of an estimation process. 参照データの説明図である。It is explanatory drawing of reference data. 第２実施形態における推定処理のフローチャートである。It is a flowchart of the estimation process in 2nd Embodiment. 第４実施形態に係る音声解析システムの構成図である。It is a block diagram of the audio | voice analysis system which concerns on 4th Embodiment. 第５実施形態に係る音声解析装置の構成図である。It is a block diagram of the audio | voice analysis apparatus which concerns on 5th Embodiment. 第５実施形態における歌唱評価結果の表示例である。It is an example of a display of the song evaluation result in 5th Embodiment. 第５実施形態における歌唱評価結果の表示例である。It is an example of a display of the song evaluation result in 5th Embodiment.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音声解析装置１００の構成図である。音声解析装置１００は、任意の利用者（以下「解析対象者」という）が発音した歌唱音声を解析することで解析対象者の性質（性格，性向，行動パターン等）を推定する情報処理装置であり、演算処理装置１２と記憶装置１４と収音装置１６と表示装置１８とを具備するコンピュータシステムで実現される。 <First Embodiment>
FIG. 1 is a configuration diagram of a speech analysis apparatus 100 according to the first embodiment of the present invention. The speech analysis apparatus 100 is an information processing apparatus that estimates the characteristics (personality, propensity, behavior pattern, etc.) of an analysis target person by analyzing a singing voice that is pronounced by an arbitrary user (hereinafter referred to as “analysis target person”). And a computer system that includes an arithmetic processing unit 12, a storage device 14, a sound collection device 16, and a display device 18.

収音装置１６は、周囲の音響を収音する装置（マイクロホン）である。第１実施形態の収音装置１６は、解析対象者が楽曲を歌唱した歌唱音声Ｖを収音する。表示装置１８（例えば液晶表示パネル）は、演算処理装置１２から指示された画像を表示する。例えば解析対象者の性質の解析結果が表示装置１８に表示される。なお、解析結果を放音装置（スピーカやイヤホン）から音声で出力することも可能である。 The sound collection device 16 is a device (microphone) that collects ambient sounds. The sound collection device 16 of the first embodiment collects the singing voice V in which the analysis target person sang the music. The display device 18 (for example, a liquid crystal display panel) displays an image instructed from the arithmetic processing device 12. For example, the analysis result of the property of the person to be analyzed is displayed on the display device 18. The analysis result can also be output as sound from a sound emitting device (speaker or earphone).

演算処理装置１２は、記憶装置１４に記憶されたプログラムを実行することで音声解析装置１００の各要素を統括的に制御する。具体的には、演算処理装置１２は、図１に例示される通り、解析対象者の性質の解析に利用される認識モデルを生成する学習処理部２２と、学習処理部２２が生成した認識モデルを利用した歌唱音声Ｖの解析で解析対象者の性質を推定する音声解析部２４とを実現する。なお、演算処理装置１２の各機能を複数の装置に分散した構成や、演算処理装置１２の機能の一部を専用の電子回路が実現する構成も採用され得る。 The arithmetic processing unit 12 performs overall control of each element of the speech analysis device 100 by executing a program stored in the storage device 14. Specifically, as illustrated in FIG. 1, the arithmetic processing device 12 includes a learning processing unit 22 that generates a recognition model used for analyzing the characteristics of the analysis target person, and a recognition model generated by the learning processing unit 22. And the voice analysis unit 24 that estimates the characteristics of the person to be analyzed by analyzing the singing voice V using the voice. A configuration in which each function of the arithmetic processing device 12 is distributed to a plurality of devices, or a configuration in which a dedicated electronic circuit realizes a part of the function of the arithmetic processing device 12 may be employed.

記憶装置１４は、演算処理装置１２が実行するプログラムや演算処理装置１２が使用する各種のデータを記憶する。半導体記録媒体および磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１４として任意に採用される。第１実施形態の記憶装置１４は、学習データ群ＧLを記憶する。学習データ群ＧLは、事前に収集された多数の学習データＤLの集合（ビッグデータ）である。学習処理部２２は、記憶装置１４に記憶された学習データ群ＧLを利用した機械学習で認識モデルを生成する。 The storage device 14 stores a program executed by the arithmetic processing device 12 and various data used by the arithmetic processing device 12. A known recording medium such as a semiconductor recording medium and a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily employed as the storage device 14. The storage device 14 of the first embodiment stores a learning data group GL. The learning data group GL is a set (big data) of a large number of learning data DL collected in advance. The learning processing unit 22 generates a recognition model by machine learning using the learning data group GL stored in the storage device 14.

図２に例示される通り、学習データ群ＧLの任意の１個の学習データＤLは、認識モデルの機械学習に利用される歌唱音声（以下「参照音声」という）に関するサンプルデータであり、音声データＤAと特徴データＤBと関連データＤCとを含んで構成される。音声データＤAは、参照音声の時間波形を表す音声ファイルである。他方、特徴データＤBおよび関連データＤCは、音声データＤAが表す参照音声の分類に利用される。 As illustrated in FIG. 2, any one learning data DL in the learning data group GL is sample data relating to a singing voice (hereinafter referred to as “reference voice”) used for machine learning of a recognition model, and voice data. It includes DA, feature data DB, and related data DC. The audio data DA is an audio file representing the time waveform of the reference audio. On the other hand, the feature data DB and the related data DC are used for classification of the reference voice represented by the voice data DA.

特徴データＤBは、参照音声の特徴量（特に歌唱に特有の特徴量）を表すデータである。例えば参照音声の音高や音量の推移等の基本的な特徴量のほか、参照音声の声質を特徴づける特徴量（例えばＭＦＣＣ，シンギングフォルマント）や参照音声の評価結果（得点）等の複数種の特徴量が特徴データＤBに包含される。 The feature data DB is data representing the feature amount of the reference voice (particularly a feature amount specific to singing). For example, in addition to basic features such as the pitch and volume of the reference speech, multiple types such as features that characterize the voice quality of the reference speech (for example, MFCC, singing formants) and reference speech evaluation results (scores) The feature amount is included in the feature data DB.

関連データＤCは、参照音声の歌唱者または歌唱に関連する情報である。図２に例示される通り、第１実施形態の関連データＤCは、個人情報ＤC1と歌唱情報ＤC2と傾向情報ＤC3とを包含する。個人情報ＤC1は、参照音声の歌唱者個人に関する情報（例えば年齢，性別，住所，職業等）である。歌唱情報ＤC2は、参照音声の歌唱に関連する情報（歌唱環境，履歴，傾向）である。例えば、歌唱時刻，歌唱場所の種別（例えば高級店／カラオケボックス等の区別），歌唱頻度，歌唱回数（特定のグループ内での歌唱回数），歌唱人数（来店人数），得点（例えば特定のグループ内での得点の順位）等の情報が歌唱情報ＤC2に包含され得る。 The related data DC is information related to the singer or singing of the reference voice. As illustrated in FIG. 2, the related data DC of the first embodiment includes personal information DC1, song information DC2, and trend information DC3. The personal information DC1 is information (for example, age, sex, address, occupation, etc.) related to the individual singer of the reference voice. The singing information DC2 is information (singing environment, history, tendency) related to the singing of the reference voice. For example, singing time, type of singing place (for example, distinction between high-class stores / karaoke boxes, etc.), singing frequency, number of singing (number of singing in a specific group), number of singing (number of customers in store), score (for example, specific group) The score information DC2 can be included in the singing information DC2.

傾向情報ＤC3は、参照音声の歌唱者の性格や嗜好等に関する情報である。例えば参照音声の歌唱者の性格や歌唱時の感情等の情報が歌唱者に対するアンケートの結果から抽出されて傾向情報ＤC3に包含される。また、例えばブログやＳＮＳ（Social Networking Service）の利用状況（例えば投稿内容やプロフィールの内容）から推定される歌唱者の嗜好や行動パターンに関する情報も傾向情報ＤC3に包含され得る。 The trend information DC3 is information relating to the personality and preferences of the reference voice singer. For example, information such as the personality of the reference voice singer and emotions at the time of singing is extracted from the result of the questionnaire for the singer and included in the trend information DC3. Further, for example, information related to the singer's preference and behavior pattern estimated from the use status (for example, the contents of a post or profile) of a blog or SNS (Social Networking Service) may be included in the trend information DC3.

各学習データＤLの具体例は以上の通りである。多数の歌唱者について学習データＤLを収集することで学習データ群ＧLが構成される。各学習データＤLは、例えばカラオケ店の利用時に歌唱者から採取される。例えば音声データＤAと特徴データＤBとは歌唱者の歌唱音声から生成され、関連データＤCの個人情報ＤC1と歌唱情報ＤC2とは、カラオケ店の会員登録時に発行されるサービスカードの会員情報から抽出され得る。なお、参照音声の属性（例えば関連データＤC）を取得する方法は以上の例示に限定されない。例えば、歌唱者のカラオケ用のサービスカードの個人情報と各種の店舗で登録された利用者毎の購入履歴とを照合して当該歌唱者の購入履歴等の行動パターンを傾向情報ＤC3として抽出する構成や、カラオケ店のサービスカードの個人情報とブログやＳＮＳ等のプロフィールに登録された個人情報とを照合して歌唱者の嗜好や行動パターンを傾向情報ＤC3として抽出する構成も採用され得る。 Specific examples of each learning data DL are as described above. A learning data group GL is configured by collecting learning data DL for a large number of singers. Each learning data DL is collected from a singer, for example, when using a karaoke shop. For example, the voice data DA and the characteristic data DB are generated from the singing voice of the singer, and the personal information DC1 and the singing information DC2 of the related data DC are extracted from the membership information of the service card issued at the time of membership registration of the karaoke shop. obtain. Note that the method for acquiring the attribute of the reference voice (for example, related data DC) is not limited to the above example. For example, the personal information of a singer's karaoke service card and the purchase history of each user registered at various stores are collated to extract behavior patterns such as the purchase history of the singer as trend information DC3 Alternatively, a configuration in which personal information of a karaoke shop service card and personal information registered in a profile such as a blog or SNS are collated to extract the singer's preference and behavior pattern as the trend information DC3 may be employed.

＜学習処理部２２＞
第１実施形態の学習処理部２２は、図３に例示される通り、記憶装置１４に記憶された学習データ群ＧLを利用した機械学習で複数（Ｋ個）の認識モデルＭ[1]〜Ｍ[K]を生成する。任意の１個の認識モデルＭ[k]（ｋ＝１〜Ｋ）は、各学習データＤLの特徴データＤBおよび関連データＤCを利用して複数の参照音声（音声データＤA）を属性毎に分類したＫ個の集合Ｃ[1]〜Ｃ[K]のうち第ｋ番目の集合Ｃ[k]に属する複数の参照音声を利用した機械学習で生成される。したがって、第１実施形態の認識モデルＭ[k]は、１個の集合Ｃ[k]に分類された複数の参照音声の特徴量の傾向を表す統計モデルである。例えばＧＭＭ（Gaussian Mixture Model）等の混合分布を利用した統計モデルやＨＭＭ（Hidden Markov Model）等の確率モデルが認識モデルＭ[k]として好適に利用される。 <Learning processing unit 22>
As illustrated in FIG. 3, the learning processing unit 22 of the first embodiment uses a plurality (K) of recognition models M [1] to M [M] by machine learning using the learning data group GL stored in the storage device 14. Generate [K]. One arbitrary recognition model M [k] (k = 1 to K) classifies a plurality of reference sounds (speech data DA) for each attribute by using the feature data DB and the related data DC of each learning data DL. Generated by machine learning using a plurality of reference speech belonging to the k-th set C [k] among the K sets C [1] to C [K]. Therefore, the recognition model M [k] of the first embodiment is a statistical model that represents the tendency of feature quantities of a plurality of reference speech classified into one set C [k]. For example, a statistical model using a mixture distribution such as GMM (Gaussian Mixture Model) or a probability model such as HMM (Hidden Markov Model) is preferably used as the recognition model M [k].

図４は、学習処理部２２が学習データ群ＧLからＫ個の認識モデルＭ[1]〜Ｍ[K]を生成する学習処理のフローチャートである。図４の学習処理を開始すると、学習処理部２２は、図３に例示される通り、各学習データＤLの複数の音声データＤA（参照音声）をＫ個の集合Ｃ[1]〜Ｃ[K]に分類する（ＳA1）。すなわち、複数の参照音声がＫ種類の属性に分類される。なお、１個の参照音声が複数の集合Ｃ[k]に属する可能性もある。各参照音声の分類には、以下に例示される通り、各学習データＤLの特徴データＤBや関連データＤCが利用される。 FIG. 4 is a flowchart of the learning process in which the learning processing unit 22 generates K recognition models M [1] to M [K] from the learning data group GL. When the learning process of FIG. 4 is started, the learning processing unit 22 converts a plurality of speech data DA (reference speech) of each learning data DL into K sets C [1] to C [K] as illustrated in FIG. ] (SA1). That is, a plurality of reference voices are classified into K types of attributes. Note that one reference voice may belong to a plurality of sets C [k]. For the classification of each reference voice, as illustrated below, feature data DB and related data DC of each learning data DL are used.

各学習データＤLの特徴データＤBは、概略的には、参照音声の音響的な特徴に着目した分類に利用される。例えば、特徴データＤBで表現される音高が楽曲の所期の音符列に正確に追従する参照音声や、音符内での音量または声質の変動が少ない単調な参照音声、音量が小さく声質の張りが少ない参照音声、しゃくりやビブラート等の歌唱技法が少ない参照音声は、「真面目」の属性の集合Ｃ[k]に分類される。 The feature data DB of each learning data DL is generally used for classification focusing on the acoustic features of the reference speech. For example, a reference voice in which the pitch expressed by the feature data DB accurately follows the expected note string of the music, a monotonous reference voice with little fluctuation in the volume or voice quality within the note, a low-volume voice quality A reference voice with a small number of voices and a reference voice with a small number of singing techniques such as shaku and vibrato are classified into a set C [k] of “serious” attributes.

関連データＤCは、概略的には、参照音声の歌唱者の嗜好や性向（行動パターン）に着目した分類に利用される。例えば、関連データＤCで指定される歌唱時刻が深夜の時間帯に属する参照音声は「夜行性」の属性の集合Ｃ[k]に分類され、関連データＤCで指定される歌唱店舗の種別が高級店である参照音声は「富裕層」の属性の集合Ｃ[k]に分類される。また、関連データＤCで指定される歌唱者の性格が「粘着質」である参照音声や、関連データＤCで指定される歌唱回数が多い参照音声は、「粘着質」の属性の集合Ｃ[k]に分類される。なお、各参照音声の分類（クラスタリング）には公知のデータ解析技術が任意に採用される。例えば、特開２００５−２２２１３８号公報に開示されたｋ-ｍｅａｎｓ法によるクラスタリングを各参照音声の分類に利用することが可能である。 The related data DC is generally used for classification focusing on the preference and propensity (behavior pattern) of the singer of the reference voice. For example, reference sounds belonging to a time zone in which the singing time specified by the related data DC belongs to midnight are classified into a set C [k] of “nocturnal” attributes, and the type of the singing store specified by the related data DC is high-class. The reference speech that is a store is classified into a set C [k] of attributes of “rich people”. In addition, a reference voice in which the character of the singer designated by the related data DC is “adhesive” or a reference voice having a high number of singing times specified by the related data DC is a set C [k ]are categorized. In addition, a well-known data analysis technique is arbitrarily employ | adopted for the classification | category (clustering) of each reference audio | voice. For example, clustering by the k-means method disclosed in Japanese Patent Laid-Open No. 2005-222138 can be used for classification of each reference speech.

複数の参照音声を分類すると、学習処理部２２は、Ｋ個の集合Ｃ[1]〜Ｃ[K]から１個の集合Ｃ[k]を選択する（ＳA2）。そして、学習処理部２２は、図３に例示される通り、集合Ｃ[k]に属する複数の音声データＤAを利用した機械学習で認識モデルＭ[k]を生成する（ＳA3）。具体的には、学習処理部２２は、１個の集合Ｃ[k]に属する複数の参照音声の統計的な傾向が表現されるように認識モデルＭ[k]を生成する。認識モデルＭ[k]の機械学習には、例えば決定木学習等の公知の機械学習技術が任意に採用される。例えば、Ｃ４.５（J.Rose Quinlan，"C4.5 Programs for machihne learning," Morgan Kaufmann Publishers, 1993）を利用した決定木学習が好適である。 When the plurality of reference voices are classified, the learning processing unit 22 selects one set C [k] from the K sets C [1] to C [K] (SA2). Then, as illustrated in FIG. 3, the learning processing unit 22 generates a recognition model M [k] by machine learning using a plurality of audio data DA belonging to the set C [k] (SA3). Specifically, the learning processing unit 22 generates a recognition model M [k] so that a statistical tendency of a plurality of reference sounds belonging to one set C [k] is expressed. For machine learning of the recognition model M [k], for example, a known machine learning technique such as decision tree learning is arbitrarily employed. For example, decision tree learning using C4.5 (J. Rose Quinlan, “C4.5 Programs for machihne learning,” Morgan Kaufmann Publishers, 1993) is suitable.

第１実施形態の認識モデルＭ[k]は、認識対象の歌唱音声Ｖについて評価指標（尤度）Ｅ[k]を算定するための統計モデルである。認識モデルＭ[k]で算定される評価指標Ｅ[k]は、集合Ｃ[k]に属する各参照音声の統計的な傾向に歌唱音声Ｖが該当する確度（歌唱音声Ｖが集合Ｃ[k]に分類される尤度）の指標である。すなわち、歌唱音声Ｖの特徴量が、認識モデルＭ[k]で表現される参照音声の特徴量の傾向に合致する度合が高いほど評価指標Ｅ[k]は大きい数値に設定される。学習処理部２２が生成した認識モデルＭ[k]は記憶装置１４に格納される。具体的には、認識モデルＭ[k]を規定する変数が記憶装置１４に格納される。 The recognition model M [k] of the first embodiment is a statistical model for calculating an evaluation index (likelihood) E [k] for the singing voice V to be recognized. The evaluation index E [k] calculated by the recognition model M [k] has an accuracy that the singing voice V corresponds to the statistical tendency of each reference voice belonging to the set C [k] (the singing voice V is set to the set C [k]. Is an index of likelihood). That is, the evaluation index E [k] is set to a larger numerical value as the degree to which the feature amount of the singing voice V matches the tendency of the feature amount of the reference speech expressed by the recognition model M [k]. The recognition model M [k] generated by the learning processing unit 22 is stored in the storage device 14. Specifically, a variable that defines the recognition model M [k] is stored in the storage device 14.

学習処理部２２は、図３に例示される通り、認識モデルＭ[k]に属性情報Ａ[k]を付加する（ＳA4）。属性情報Ａ[k]は、集合Ｃ[k]に対応する属性を表現する情報（ラベル）である。例えば、歌唱時刻が深夜の時間帯に属する参照音声の集合Ｃ[k]の認識モデルＭ[k]には「夜遊び」の属性を指定する属性情報Ａ[k]が付加され、歌唱店舗の種別が高級店である参照音声の集合Ｃ[k]の認識モデルＭ[k]には「富裕層」の属性を指定する属性情報Ａ[k]が付加される。 As illustrated in FIG. 3, the learning processing unit 22 adds the attribute information A [k] to the recognition model M [k] (SA4). The attribute information A [k] is information (label) expressing the attribute corresponding to the set C [k]. For example, the attribute information A [k] for designating the attribute of “night play” is added to the recognition model M [k] of the set C [k] of the reference voice belonging to the midnight time zone, and the type of the singing store Attribute information A [k] designating the attribute of the “rich people” is added to the recognition model M [k] of the reference speech set C [k] where is a high-end store.

学習処理部２２は、Ｋ個の集合Ｃ[1]〜Ｃ[K]について認識モデルＭ[k]の生成が完了したか否かを判定する（ＳA5）。判定結果が否定である場合（ＳA5：NO）、学習処理部２２は、認識モデルＭ[k]の未生成の集合Ｃ[k]を新規に選択したうえで（ＳA2）、認識モデルＭ[k]の生成（ＳA3）と属性情報Ａ[k]の付加（ＳA4）とを実行する。Ｋ個の認識モデルＭ[1]〜Ｍ[K]の生成が完了した場合（ＳA5：YES）、学習処理部２２は図４の学習処理を終了する。 The learning processing unit 22 determines whether or not the generation of the recognition model M [k] has been completed for the K sets C [1] to C [K] (SA5). When the determination result is negative (SA5: NO), the learning processing unit 22 newly selects an ungenerated set C [k] of the recognition model M [k] (SA2), and then recognizes the recognition model M [k. ] Generation (SA3) and addition of attribute information A [k] (SA4). When the generation of the K recognition models M [1] to M [K] is completed (SA5: YES), the learning processing unit 22 ends the learning process of FIG.

＜音声解析部２４＞
図１の音声解析部２４は、収音装置１６が収音した歌唱音声Ｖの解析で解析対象者の性質を推定する。図５は、音声解析部２４の具体的な構成図である。図５に例示される通り、第１実施形態の音声解析部２４は、指標算定部３２と推定処理部３４とを含んで構成される。 <Speech analysis unit 24>
The voice analysis unit 24 in FIG. 1 estimates the characteristics of the person to be analyzed by analyzing the singing voice V collected by the sound collection device 16. FIG. 5 is a specific configuration diagram of the voice analysis unit 24. As illustrated in FIG. 5, the speech analysis unit 24 according to the first embodiment includes an index calculation unit 32 and an estimation processing unit 34.

指標算定部３２は、学習処理部２２が生成したＫ個の認識モデルＭ[1]〜Ｍ[K]の各々に解析対象者の歌唱音声Ｖを適用することで、相異なる属性（認識モデルＭ[k]）に対応するＫ個の評価指標Ｅ[1]〜Ｅ[K]を算定する。推定処理部３４は、指標算定部３２が属性毎に算定した評価指標Ｅ[k]に応じて解析対象者の性質を推定する。第１実施形態の推定処理部３４は、解析対象者の性質の推定結果を示す解析情報Ｑを生成する。 The index calculation unit 32 applies the singing voice V of the person to be analyzed to each of the K recognition models M [1] to M [K] generated by the learning processing unit 22, thereby different attributes (recognition model M K evaluation indexes E [1] to E [K] corresponding to [k]) are calculated. The estimation processing unit 34 estimates the property of the analysis target person according to the evaluation index E [k] calculated for each attribute by the index calculation unit 32. The estimation processing unit 34 of the first embodiment generates analysis information Q indicating the estimation result of the property of the analysis target person.

図６は、音声解析部２４が解析対象者の歌唱音声Ｖから解析情報Ｑを生成する推定処理のフローチャートである。例えば歌唱開始前に解析対象者から付与される指示を契機として図６の推定処理が開始される。 FIG. 6 is a flowchart of an estimation process in which the voice analysis unit 24 generates the analysis information Q from the singing voice V of the person to be analyzed. For example, the estimation process of FIG. 6 is started with an instruction given from the analysis subject before the start of singing.

推定処理を開始すると、指標算定部３２は、記憶装置１４に記憶されたＫ個の認識モデルＭ[1]〜Ｍ[K]から１個の認識モデルＭ[k]を選択する（ＳB1）。そして、指標算定部３２は、収音装置１６から供給される歌唱音声Ｖを認識モデルＭ[k]に適用することで評価指標Ｅ[k]を算定する（ＳB2）。具体的には、指標算定部３２は、歌唱音声Ｖから音高や音量や声質等の特徴量を抽出し、認識モデルＭ[k]で表現される集合Ｃ[k]にて当該特徴量が観測される尤度（確度）を評価指標Ｅ[k]として算定する。例えば、指標算定部３２は、歌唱音声Ｖから抽出される特徴量の確率密度分布（ＧＭＭ等の混合分布モデル）を生成し、確率密度分布と認識モデルＭ[k]との間の尤度（確度）を評価指標Ｅ[k]として算定する。各特徴量の確率密度分布は、例えば声質については１３次元のＭＦＣＣから生成され、音高や音量については単位時間（例えば１秒）内の軌跡の近似曲線から生成される。 When the estimation process is started, the index calculation unit 32 selects one recognition model M [k] from the K recognition models M [1] to M [K] stored in the storage device 14 (SB1). Then, the index calculation unit 32 calculates the evaluation index E [k] by applying the singing voice V supplied from the sound collection device 16 to the recognition model M [k] (SB2). Specifically, the index calculation unit 32 extracts feature quantities such as pitch, volume, and voice quality from the singing voice V, and the feature quantities are represented in the set C [k] represented by the recognition model M [k]. The observed likelihood (accuracy) is calculated as an evaluation index E [k]. For example, the index calculation unit 32 generates a probability density distribution (mixed distribution model such as GMM) of the feature amount extracted from the singing voice V, and the likelihood between the probability density distribution and the recognition model M [k] ( Accuracy) is calculated as an evaluation index E [k]. The probability density distribution of each feature amount is generated from, for example, 13-dimensional MFCC for voice quality, and is generated from an approximate curve of a trajectory within a unit time (for example, 1 second) for pitch and volume.

指標算定部３２は、Ｋ個の認識モデルＭ[1]〜Ｍ[K]の各々について評価指標Ｅ[k]（Ｅ[1]〜Ｅ[K]）の算定が完了したか否かを判定する（ＳB3）。判定結果が否定である場合（ＳB3：NO）、指標算定部３２は、評価指標Ｅ[k]の未算定の認識モデルＭ[k]を新規に選択し（ＳB1）、当該認識モデルＭ[k]を利用して評価指標Ｅ[k]を算定する（ＳB2）。他方、相異なる認識モデルＭ[k]に対応するＫ個の評価指標Ｅ[1]〜Ｅ[K]の算定が完了した場合（ＳB3：YES）、推定処理部３４による解析情報Ｑの生成が実行される（ＳB4，ＳB5）。 The index calculation unit 32 determines whether or not the calculation of the evaluation index E [k] (E [1] to E [K]) is completed for each of the K recognition models M [1] to M [K]. (SB3). When the determination result is negative (SB3: NO), the index calculation unit 32 newly selects an uncalculated recognition model M [k] for the evaluation index E [k] (SB1), and the recognition model M [k ] Is used to calculate the evaluation index E [k] (SB2). On the other hand, when calculation of K evaluation indexes E [1] to E [K] corresponding to different recognition models M [k] is completed (SB3: YES), generation of analysis information Q by the estimation processing unit 34 is performed. It is executed (SB4, SB5).

具体的には、推定処理部３４は、各認識モデルＭ[k]に対応するＫ個の属性のうち評価指標Ｅ[1]〜Ｅ[K]に応じた２個以上の属性（属性情報Ａ[k]）を選択する（ＳB4）。例えば推定処理部３４は、Ｋ個の認識モデルＭ[1]〜Ｍ[K]のうち評価指標Ｅ[k]の数値順（降順）で上位に位置する所定個の認識モデルＭ[k]の各々に付加された各属性情報Ａ[k]を選択する。以上の説明から理解される通り、解析対象者の性質に適合する尤度が高い属性が選択される。例えば、音高が楽曲の所期の音符列に正確に追従する歌唱音声Ｖや、音符内での音量または声質の変動が少ない単調な歌唱音声Ｖ、音量が小さく声質の張りが少ない歌唱音声Ｖ、しゃくりやビブラート等の歌唱技法が少ない歌唱音声Ｖについては、「真面目」を指定する属性情報Ａ[k]が選択される。なお、評価指標Ｅ[k]に応じて歌唱音声Ｖの属性（属性情報Ａ[k]）を選択する方法は以上の例示に限定されない。例えば、Ｋ個の属性のうち評価指標Ｅ[k]が所定の閾値（固定値または可変値）を上回る属性を選択する構成や、評価指標Ｅ[k]が最大となる１個の属性を選択する構成も採用され得る。 Specifically, the estimation processing unit 34 includes two or more attributes (attribute information A) corresponding to the evaluation indexes E [1] to E [K] among the K attributes corresponding to each recognition model M [k]. [k]) is selected (SB4). For example, the estimation processing unit 34 selects a predetermined number of recognition models M [k] positioned higher in the numerical order (descending order) of the evaluation index E [k] among the K recognition models M [1] to M [K]. Each attribute information A [k] added to each is selected. As understood from the above description, an attribute having a high likelihood of conforming to the characteristics of the analysis target person is selected. For example, a singing voice V in which the pitch accurately follows the intended note sequence of the music, a monotonous singing voice V with less fluctuation in volume or voice quality within the note, and a singing voice V with a small volume and less voice quality. For the singing voice V with few singing techniques such as shakukuri and vibrato, attribute information A [k] designating “serious” is selected. In addition, the method of selecting the attribute (attribute information A [k]) of the singing voice V according to the evaluation index E [k] is not limited to the above examples. For example, a configuration in which the evaluation index E [k] exceeds the predetermined threshold (fixed value or variable value) among the K attributes, or one attribute that maximizes the evaluation index E [k] is selected. The structure to do can also be employ | adopted.

推定処理部３４は、評価指標Ｅ[k]を参照して選択した２個以上の属性（属性情報Ａ[k]）の組合せに応じて解析情報Ｑを生成する（ＳB5）。具体的には、属性情報Ａ[k]で指定される属性の文字列を含む文章が解析情報Ｑとして生成される。例えば「夜遊び」を指定する属性情報Ａ[k1]と「粘着質」を指定する属性情報Ａ[k2]とが選択された場合には、「あなたの性格は、夜遊び好きな粘着質ですね」といった文章の解析情報Ｑが生成され、「真面目」を指定する属性情報Ａ[k3]と「富裕層」を指定する属性情報Ａ[k4]とが選択された場合には、「あなたは真面目な富裕層ですね」といった文章の解析情報Ｑが生成される。推定処理部３４は、以上の手順で生成した解析情報Ｑを表示装置１８に表示させる（ＳB6）。 The estimation processing unit 34 generates analysis information Q according to a combination of two or more attributes (attribute information A [k]) selected with reference to the evaluation index E [k] (SB5). Specifically, a sentence including the character string of the attribute specified by the attribute information A [k] is generated as the analysis information Q. For example, if attribute information A [k1] that specifies “night play” and attribute information A [k2] that specifies “sticky” are selected, “Your personality is sticky that you like to play at night.” Is generated, and attribute information A [k3] specifying “serious” and attribute information A [k4] specifying “rich” are selected, “you are serious The analysis information Q of the sentence “It is a wealthy class” is generated. The estimation processing unit 34 causes the display device 18 to display the analysis information Q generated by the above procedure (SB6).

以上に説明した通り、第１実施形態では、Ｋ個の属性の各々の音声に該当する確度の指標である評価指標Ｅ[k]（Ｅ[1]〜Ｅ[K]）が各属性の認識モデルＭ[k]により算定されるから、各評価指標Ｅ[k]に応じて解析対象者の性質（性格，性向，行動パターン）を簡便に推定することが可能である。第１実施形態では特に、Ｋ個の属性のうち各評価指標Ｅ[k]に応じて選択した２個以上の属性の組合せに応じて解析対象者の性質が推定されるから、例えばＫ個のうち１個の属性のみを評価指標Ｅ[k]に応じて選択する構成と比較して解析対象者の性質を多様に推定できるという利点がある。また、第１実施形態では、参照音声の歌唱時刻や歌唱場所（歌唱店舗の種別）や歌唱者の性格に応じて複数の参照音声がＫ個の集合Ｃ[1]〜Ｃ[K]に分類されるから、解析対象者の性質を多様な観点から推定できるという利点もある。 As described above, in the first embodiment, the evaluation index E [k] (E [1] to E [K]), which is an index of accuracy corresponding to the speech of each of the K attributes, is used to recognize each attribute. Since it is calculated by the model M [k], it is possible to easily estimate the properties (personality, propensity, behavior pattern) of the person to be analyzed according to each evaluation index E [k]. In the first embodiment, in particular, since the property of the person to be analyzed is estimated according to a combination of two or more attributes selected according to each evaluation index E [k] among the K attributes, for example, K attributes Compared with a configuration in which only one attribute is selected according to the evaluation index E [k], there is an advantage that various properties of the analysis target person can be estimated. In the first embodiment, a plurality of reference voices are classified into K sets C [1] to C [K] according to the singing time of the reference voice, the singing place (type of singing store), and the character of the singer. Therefore, there is an advantage that the property of the analysis target person can be estimated from various viewpoints.

また、第１実施形態では、事前に収録された複数の参照音声のうち集合Ｃ[k]に分類された各参照音声（音声データＤA）を利用した機械学習で認識モデルＭ[k]が生成されるから、例えば実際の参照音声を利用せずに各認識モデルＭ[k]を人為的に選定した構成と比較すると、音声の特徴と歌唱者の性質との関係の現実的な傾向を反映した高精度な推定が可能であるという利点もある。 In the first embodiment, a recognition model M [k] is generated by machine learning using each reference voice (voice data DA) classified into the set C [k] among a plurality of reference voices recorded in advance. Therefore, for example, when comparing each recognition model M [k] with an artificially selected configuration without using actual reference speech, it reflects a realistic trend of the relationship between the characteristics of speech and the nature of the singer There is also an advantage that high-precision estimation is possible.

＜第２実施形態＞
本発明の第２実施形態を以下に説明する。なお、以下に例示する各形態において作用や機能が第１実施形態と同様である要素については、第１実施形態の説明で参照した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described below. In addition, about the element which an effect | action and function are the same as that of 1st Embodiment in each form illustrated below, the reference | standard referred by description of 1st Embodiment is diverted, and each detailed description is abbreviate | omitted suitably.

第２実施形態の音声解析部２４は、相異なる歌唱時刻の時間帯に対応するＫ個の集合Ｃ[1]〜Ｃ[K]の各々について学習処理で生成されたＫ個の認識モデルＭ[1]〜Ｍ[K]を利用して解析対象者の性質を推定する。すなわち、認識モデルＭ[k]は、Ｋ個のうち第ｋ番目の時間帯に発音された参照音声の統計的な傾向を表現する。 The speech analysis unit 24 according to the second embodiment uses the K recognition models M [generated by the learning process for each of the K sets C [1] to C [K] corresponding to different singing time zones. 1] to M [K] are used to estimate the properties of the analysis target person. That is, the recognition model M [k] expresses a statistical tendency of the reference speech sounded in the k-th time zone out of K pieces.

具体的には、音声解析部２４の指標算定部３２は、第１実施形態と同様に、Ｋ個の認識モデルＭ[k]の各々に解析対象者の歌唱音声Ｖを適用することでＫ個の評価指標Ｅ[1]〜Ｅ[K]を算定する。評価指標Ｅ[k]は、第ｋ番目の時間帯に発音された参照音声の統計的な傾向に歌唱音声Ｖが合致する度合に相当する。推定処理部３４は、指標算定部３２が算定したＫ個の評価指標Ｅ[1]〜Ｅ[K]に応じて解析情報Ｑを生成する。第２実施形態では、推定処理部３４による解析情報Ｑの生成（解析対象者の性格の推定）に図７の参照データＤRが利用される。 Specifically, the index calculation unit 32 of the speech analysis unit 24 applies K singing voices V of the analysis target person to each of the K recognition models M [k] as in the first embodiment. Evaluation indices E [1] to E [K] are calculated. The evaluation index E [k] corresponds to the degree to which the singing voice V matches the statistical tendency of the reference voice sounded in the k-th time zone. The estimation processing unit 34 generates analysis information Q according to the K evaluation indexes E [1] to E [K] calculated by the index calculation unit 32. In the second embodiment, the reference data DR shown in FIG. 7 is used to generate the analysis information Q (estimation of the character of the person to be analyzed) by the estimation processing unit 34.

図７に例示される通り、参照データＤRは、相異なる複数（Ｒ個）の解析情報Ｑ[1]〜Ｑ[R]について加重値系列Ｗ[1]〜Ｗ[R]を登録したデータテーブルである。任意の１個の加重値系列Ｗ[r]（ｒ＝１〜Ｒ）は、相異なる評価指標Ｅ[k]（歌唱時刻の時間帯）に対応するＫ個の加重値ｗ[1]〜ｗ[K]（図７の例示ではＫ＝１０）の系列である。なお、各加重値ｗ[k]を２値（０／１）に設定した場合を図７では便宜的に例示したが、各加重値ｗ[k]の具体的な数値は任意である。例えば加重値ｗ[k]は整数に限定されず、各加重値ｗ[k]の異同は不問である。 As illustrated in FIG. 7, the reference data DR is a data table in which weight value series W [1] to W [R] are registered for a plurality of (R) pieces of analysis information Q [1] to Q [R]. It is. Any one weight value series W [r] (r = 1 to R) is composed of K weight values w [1] to w [1] to w corresponding to different evaluation indices E [k] (time zone of singing time). [K] (K = 10 in the illustration of FIG. 7). In addition, although the case where each weight value w [k] was set to binary (0/1) was illustrated for convenience in FIG. 7, the specific numerical value of each weight value w [k] is arbitrary. For example, the weight value w [k] is not limited to an integer, and the difference between the weight values w [k] is not questioned.

各解析情報Ｑ[r]は、第１実施形態と同様に、解析対象者の性格の推定結果の文章（コメント）を表現する。特定の時間帯に歌唱する傾向がある歌唱者の性格を表現する解析情報Ｑ[r]の加重値系列Ｗ[r]では、当該時間帯に対応する１個以上の加重値ｗ[k]が他の時間帯と比較して大きい数値に設定される。例えば、図７に例示される通り、「夜行性」という解析情報Ｑ[1]の加重値系列Ｗ[1]では、夜間の時間帯（２０時以降）に対応する加重値ｗ[7]〜ｗ[10]が１に設定され、昼間の時間帯に対応する残余の加重値ｗ[k]は０に設定される。以上に例示した参照データＤRが事前に用意されて記憶装置１４に格納される。 Each analysis information Q [r] expresses a sentence (comment) of the estimation result of the personality of the person to be analyzed, as in the first embodiment. In the weight value series W [r] of the analysis information Q [r] expressing the character of a singer who tends to sing in a specific time zone, one or more weight values w [k] corresponding to the time zone are obtained. It is set to a large value compared to other time zones. For example, as illustrated in FIG. 7, in the weight value series W [1] of the analysis information Q [1] “nocturnal”, the weight value w [7] to the night time zone (after 20:00) w [10] is set to 1, and the remaining weight w [k] corresponding to the daytime time zone is set to 0. The reference data DR exemplified above is prepared in advance and stored in the storage device 14.

図８は、第２実施形態における推定処理のフローチャートである。第１実施形態（図６）の推定処理部３４の動作の一部（ＳB4−ＳB6）が、第２実施形態では、参照データＤRを利用して解析情報Ｑ[r]を選択および表示する処理（ＳC4−ＳC6）に置換される。 FIG. 8 is a flowchart of the estimation process in the second embodiment. A part of the operation (SB4-SB6) of the estimation processing unit 34 of the first embodiment (FIG. 6) selects and displays the analysis information Q [r] using the reference data DR in the second embodiment. Replaced by (SC4-SC6).

解析対象者の歌唱音声Ｖを各認識モデルＭ[k]に適用することで指標算定部３２がＫ個の評価指標Ｅ[1]〜Ｅ[K]を算定すると（ＳB3：YES）、推定処理部３４は、Ｒ個の加重値系列Ｗ[1]〜Ｗ[R]の各々（Ｒ個の解析情報Ｑ[1]〜Ｑ[R]の各々）について評価指標Ｘ[r]を算定する（ＳC4）。各評価指標Ｘ[r]は、加重値系列Ｗ[r]の各加重値ｗ[k]を適用した各評価指標Ｅ[k]の加重和（Ｘ[r]＝ｗ[1]Ｅ[1]＋ｗ[2]Ｅ[2]＋……＋ｗ[K]Ｅ[K]）である。したがって、加重値ｗ[k]が大きい時間帯で発音された参照音声に解析対象者の歌唱音声Ｖが近似するほど評価指標Ｅ[k]は大きい数値となる。以上の説明から理解される通り、評価指標Ｅ[k]は単一の集合Ｃ[k]に属する参照音声の統計的な傾向に歌唱音声Ｖが該当する確度に相当し、評価指標Ｘ[r]は、相異なる複数の集合Ｃ[k]の参照音声の総合的な傾向に歌唱音声Ｖが該当する確度に相当する。 When the index calculation unit 32 calculates K evaluation indexes E [1] to E [K] by applying the singing voice V of the analysis target person to each recognition model M [k] (SB3: YES), the estimation process The unit 34 calculates an evaluation index X [r] for each of the R weight value series W [1] to W [R] (each of the R pieces of analysis information Q [1] to Q [R]) ( SC4). Each evaluation index X [r] is a weighted sum (X [r] = w [1] E [1] of each evaluation index E [k] to which each weight value w [k] of the weight value series W [r] is applied. ] + W [2] E [2] + ...... + w [K] E [K]). Therefore, the evaluation index E [k] becomes a larger numerical value as the analysis target person's singing voice V is approximated to the reference voice sounded in the time zone where the weighted value w [k] is large. As understood from the above description, the evaluation index E [k] corresponds to the probability that the singing voice V corresponds to the statistical tendency of the reference voice belonging to the single set C [k], and the evaluation index X [r ] Corresponds to the probability that the singing voice V corresponds to the overall tendency of the reference voices of a plurality of different sets C [k].

推定処理部３４は、参照データＤRのＲ個の解析情報Ｑ[1]〜Ｑ[R]のうち評価指標Ｘ[r]に応じた１個の解析情報Ｑ[r]を選択する（ＳC5）。具体的には、Ｒ個の解析情報Ｑ[1]〜Ｑ[R]のうち評価指標Ｘ[r]が最大値となる加重値系列Ｗ[r]に対応した１個の解析情報Ｑ[r]が選択される。推定処理部３４は、参照データＤRから選択した解析情報Ｑ[r]を記憶装置１４から取得して表示装置１８に表示させる（ＳC6）。 The estimation processing unit 34 selects one piece of analysis information Q [r] corresponding to the evaluation index X [r] from among the R pieces of analysis information Q [1] to Q [R] of the reference data DR (SC5). . Specifically, one piece of analysis information Q [r] corresponding to the weight sequence W [r] in which the evaluation index X [r] has the maximum value among the R pieces of analysis information Q [1] to Q [R]. Is selected. The estimation processing unit 34 acquires the analysis information Q [r] selected from the reference data DR from the storage device 14 and displays it on the display device 18 (SC6).

第２実施形態においても第１実施形態と同様の効果が実現される。また、第２実施形態では、相異なる時間帯に対応するＫ個の加重値ｗ[1]〜ｗ[K]を適用した評価指標Ｅ[k]の加重和として評価指標Ｘ[r]が算定されるから、時間帯毎の音声の統計的な傾向の観点から解析対象者の性質（特に行動パターン）を推定することが可能である。 In the second embodiment, the same effect as in the first embodiment is realized. In the second embodiment, the evaluation index X [r] is calculated as a weighted sum of the evaluation indices E [k] using K weighted values w [1] to w [K] corresponding to different time zones. Therefore, it is possible to estimate the property (particularly the behavior pattern) of the person to be analyzed from the viewpoint of the statistical tendency of the voice for each time zone.

＜第３実施形態＞
第３実施形態における各学習データＤLの関連データＤCは、参照音声の歌唱者が特定の商品を購入した履歴（以下「購入履歴」という）を包含する。購入履歴は、例えば商品の購入者が店舗に提示するサービスカード（例えばポイントカード）から抽出される。学習処理部２２による各参照音声の分類（ＳA1）では、関連データＤCが指定する購入履歴が参照される。したがって、特定商品を購入する傾向がある購入者が発音した参照音声という属性の認識モデルＭ[k]（当該商品の購入者の参照音声の統計的な傾向を表現する統計モデル）が生成される。 <Third Embodiment>
The related data DC of each learning data DL in the third embodiment includes a history (hereinafter referred to as “purchase history”) in which a singer of the reference voice has purchased a specific product. The purchase history is extracted from, for example, a service card (for example, a point card) presented to the store by the purchaser of the product. In the classification (SA1) of each reference voice by the learning processing unit 22, the purchase history specified by the related data DC is referred to. Therefore, a recognition model M [k] (a statistical model expressing a statistical tendency of the reference voice of the purchaser of the product) is generated, which is an attribute of the reference speech pronounced by the purchaser who tends to purchase the specific product. .

音声解析部２４（指標算定部３２，推定処理部３４）の動作は第１実施形態と同様である。特定商品を購入する傾向が推測される解析対象者については、当該傾向の集合Ｃ[k]に対応した認識モデルＭ[k]の評価指標Ｅ[k]が大きい数値となる可能性が高い。したがって、第３実施形態によれば、評価指標Ｅ[k]を参照することで、特定商品を購入する傾向（消費動向）を含む解析対象者の性質を推定できるという利点がある。また、第３実施形態で生成される解析情報Ｑを、各解析対象者に対する特定商品の効果的な提案や、商店や企業による市場調査（マーケティング）に利用することも可能である。具体的には、推定処理部３４は、Ｋ個の評価指標Ｅ[1]〜Ｅ[K]に応じて生成した解析情報Ｑ（解析対象者の性格の推定結果）をマーケティング用（宣伝広告用）のデータベースに登録する。以上の構成では、マーケティング用のデータベースを利用して効率的な宣伝および広告を実現することが可能である。例えば特定の物品（例えばハンバーガーや栄養ドリンク，酒類，飴等の飲食物）を嗜好する傾向が解析情報Ｑから推定される解析対象者には、当該物品を宣伝する画像を楽曲間の時間帯に表示装置１８に表示させる。マーケティング用のデータベースは、例えば音声解析装置１００と通信可能なサーバ装置に構築される。ただし、音声解析装置１００の記憶装置１４にデータベースを構築することも可能である。 The operations of the voice analysis unit 24 (the index calculation unit 32 and the estimation processing unit 34) are the same as those in the first embodiment. For an analysis target person who is estimated to purchase a specific product, the evaluation index E [k] of the recognition model M [k] corresponding to the trend set C [k] is likely to be a large numerical value. Therefore, according to the third embodiment, by referring to the evaluation index E [k], there is an advantage that it is possible to estimate the characteristics of the analysis target person including a tendency to purchase a specific product (consumption trend). In addition, the analysis information Q generated in the third embodiment can be used for effective proposal of specific products for each analysis target person or for market research (marketing) by a store or company. Specifically, the estimation processing unit 34 uses the analysis information Q (estimated result of the person to be analyzed) generated according to the K evaluation indexes E [1] to E [K] for marketing (for advertising purposes). ) In the database. With the above configuration, it is possible to realize efficient advertisement and advertisement using a marketing database. For example, for an analysis subject whose tendency to like a specific item (for example, a hamburger, a drink, an alcoholic beverage, or a food item such as a bowl) is estimated from the analysis information Q, an image that advertises the item is displayed in the time zone between songs. It is displayed on the display device 18. The database for marketing is constructed in a server device that can communicate with the voice analysis device 100, for example. However, it is also possible to construct a database in the storage device 14 of the speech analysis apparatus 100.

＜第４実施形態＞
図９は、第４実施形態における音声解析システム２００の構成図である。図９に例示される通り、第４実施形態の音声解析システム２００は、管理装置５２と複数の音声解析装置５４とを具備する。複数の音声解析装置５４の各々は、例えば通信カラオケ装置等の通信端末で実現され、インターネット等の通信網５８を介して管理装置５２と通信する。 <Fourth embodiment>
FIG. 9 is a configuration diagram of the speech analysis system 200 in the fourth embodiment. As illustrated in FIG. 9, the speech analysis system 200 according to the fourth embodiment includes a management device 52 and a plurality of speech analysis devices 54. Each of the plurality of voice analysis devices 54 is realized by a communication terminal such as a communication karaoke device, and communicates with the management device 52 via a communication network 58 such as the Internet.

管理装置５２は、前述の各形態と同様の学習処理部２２を具備する配信サーバ装置（典型的にはウェブサーバ）である。管理装置５２の学習処理部２２は、前述の各形態と同様の学習処理（図４）でＫ個の認識モデルＭ[1]〜Ｍ[K]を生成したうえで各音声解析装置５４に配信する。管理装置５２から配信された認識モデルＭ[k]は、通信網５８を介して各音声解析装置５４に送信および保持される。 The management device 52 is a distribution server device (typically a web server) that includes the learning processing unit 22 similar to the above-described embodiments. The learning processing unit 22 of the management device 52 generates K recognition models M [1] to M [K] by learning processing (FIG. 4) similar to the above-described embodiments, and then distributes the recognition models to each speech analysis device 54. To do. The recognition model M [k] distributed from the management device 52 is transmitted and held to each voice analysis device 54 via the communication network 58.

各音声解析装置５４は、前述の各形態と同様の音声解析部２４を具備する。各音声解析装置５４の音声解析部２４は、管理装置５２から配信されたＫ個の認識モデルＭ[1]〜Ｍ[K]を利用して、前述の各形態と同様に解析対象者の歌唱音声Ｖを解析することで、解析対象者の性質を推定および表示する。 Each speech analysis device 54 includes the speech analysis unit 24 similar to the above-described embodiments. The voice analysis unit 24 of each voice analysis device 54 uses the K recognition models M [1] to M [K] distributed from the management device 52 to sing the analysis target person in the same manner as in the above-described embodiments. By analyzing the voice V, the property of the person to be analyzed is estimated and displayed.

第４実施形態においても前述の各形態と同様の効果が実現される。また、第４実施形態では、各音声解析装置５４とは別体の管理装置５２にて生成された各認識モデルＭ[k]が複数の音声解析装置５４の各々に配信されるから、各音声解析装置５４が学習処理部２２を具備する必要がない。したがって、音声解析装置５４の構成や処理が簡素化されるという利点がある。 In the fourth embodiment, the same effects as those of the above-described embodiments are realized. In the fourth embodiment, since each recognition model M [k] generated by the management device 52 separate from each speech analysis device 54 is distributed to each of the plurality of speech analysis devices 54, each speech It is not necessary for the analysis device 54 to include the learning processing unit 22. Therefore, there is an advantage that the configuration and processing of the voice analysis device 54 are simplified.

＜第５実施形態＞
図１０は、第５実施形態における音声解析装置１００の構成図である。図１０に例示される通り、第５実施形態の音声解析装置１００の演算処理装置１２は、記憶装置１４に記憶されたプログラムを実行することで、第１実施形態と同様の要素（学習処理部２２，音声解析部２４）に加えて歌唱評価部２６として機能する。 <Fifth Embodiment>
FIG. 10 is a configuration diagram of the speech analysis apparatus 100 according to the fifth embodiment. As illustrated in FIG. 10, the arithmetic processing device 12 of the speech analysis device 100 according to the fifth embodiment executes the program stored in the storage device 14, thereby providing the same elements (learning processing unit) as those in the first embodiment. 22, in addition to the voice analysis unit 24), it functions as a singing evaluation unit 26.

歌唱評価部２６は、収音装置１６が収音した歌唱音声Ｖの解析で解析対象者の歌唱の巧拙を評価する。具体的には、第５実施形態の歌唱評価部２６は、音声解析部２４が解析対象者の性質を推定した結果（すなわち解析情報Ｑ）に応じた評価方法により歌唱音声Ｖを解析することで歌唱の巧拙を評価し、評価結果を示す指標値（以下「歌唱評価値」という）Ｓを算定する。歌唱評価部２６による歌唱評価の処理自体には公知の技術が任意に採用され得る。なお、以下の説明では、解析対象者の歌唱が上手であるほど歌唱評価値Ｓが大きい数値に設定される場合を想定する。歌唱評価部２６が生成した歌唱評価値Ｓに応じた画像が表示装置１８に表示される。例えば歌唱評価値Ｓとその数値に応じた評価コメントとが表示装置１８に表示される。解析情報Ｑに応じた歌唱音声Ｖの評価方法の具体例を以下に列挙する。 The singing evaluation unit 26 evaluates the skill of singing by the person to be analyzed by analyzing the singing voice V collected by the sound collecting device 16. Specifically, the singing evaluation unit 26 of the fifth embodiment analyzes the singing voice V by an evaluation method according to the result of the voice analysis unit 24 estimating the property of the person to be analyzed (that is, the analysis information Q). The skill of singing is evaluated, and an index value (hereinafter referred to as “singing evaluation value”) S indicating the evaluation result is calculated. A known technique can be arbitrarily employed for the singing evaluation process itself by the singing evaluation unit 26. In the following description, it is assumed that the singing evaluation value S is set to a larger value as the analysis target person sings better. An image corresponding to the song evaluation value S generated by the song evaluation unit 26 is displayed on the display device 18. For example, the singing evaluation value S and an evaluation comment corresponding to the numerical value are displayed on the display device 18. Specific examples of the evaluation method of the singing voice V according to the analysis information Q are listed below.

［評価例１］
Ｋ個の認識モデルＭ[1]〜Ｍ[K]は、「男性」の属性の認識モデルＭと「女性」の属性の認識モデルＭとを包含する。歌唱評価部２６は、解析情報Ｑが「男性」の属性を示す場合には男性用の歌唱評価処理で歌唱音声Ｖを評価する一方、解析情報Ｑが「女性」の属性を示す場合には女性用の歌唱評価処理で歌唱音声Ｖを評価する。 [Evaluation Example 1]
The K recognition models M [1] to M [K] include a recognition model M having a “male” attribute and a recognition model M having a “female” attribute. The singing evaluation unit 26 evaluates the singing voice V in the singing evaluation process for men when the analysis information Q indicates the attribute of “male”, while the woman evaluates when the analysis information Q indicates the attribute of “female”. The singing voice V is evaluated by the singing evaluation process.

例えば、歌唱評価処理では、音高に関する所定の範囲（以下「基準声域」という）のうち高域側の音高を解析対象者が正確に（すなわち楽曲の各音符の音高に合致または近似するように）歌唱するほど歌唱評価値Ｓが大きい数値となるように歌唱音声Ｖが解析される。男性用の歌唱評価処理に適用される基準声域は、女性用の歌唱評価処理に適用される基準声域と比較して低域側に位置する。したがって、解析情報Ｑが「男性」の属性を示すときに歌唱音声Ｖの音高が高い場合には、解析情報Ｑが「女性」の属性を示す場合と比較して歌唱評価値Ｓが大きい数値となる。 For example, in the singing evaluation process, the person to be analyzed accurately matches (or approximates) the pitch of each musical note of a musical piece in a predetermined range related to the pitch (hereinafter referred to as “reference vocal range”). The singing voice V is analyzed so that the singing evaluation value S becomes larger as the singing is performed. The reference vocal range applied to the singing evaluation process for men is located on the lower side compared to the reference vocal range applied to the singing evaluation process for women. Therefore, when the pitch of the singing voice V is high when the analysis information Q indicates the attribute “male”, the numerical value of the singing evaluation value S is larger than when the analysis information Q indicates the attribute “female”. It becomes.

［評価例２］
評価例１と同様に、Ｋ個の認識モデルＭ[1]〜Ｍ[K]は、「男性」の属性の認識モデルＭと「女性」の属性の認識モデルＭとを包含する。歌唱評価部２６は、解析情報Ｑが「男性」の属性を示す場合と「女性」の属性を示す場合とで声質の評価項目を相違させる。例えば、解析情報Ｑが「男性」の属性を示す場合、歌唱評価部２６は、歌唱音声Ｖにおいて男性的な声質の度合（例えば非調波成分が調波成分に対して優勢なハスキーな声質の度合）が高いほど歌唱評価値Ｓが大きい数値となるように歌唱音声Ｖを評価する。他方、解析結果が「女性」の属性を示す場合、歌唱評価部２６は、歌唱音声Ｖにおいて女性的な声質の度合（例えば調波成分が非調波成分に対して優勢な明瞭な声質の度合）が高いほど歌唱評価値Ｓが大きい数値となるように歌唱音声Ｖを評価する。 [Evaluation Example 2]
Similarly to the first evaluation example, the K recognition models M [1] to M [K] include a recognition model M having a “male” attribute and a recognition model M having a “female” attribute. The singing evaluation unit 26 makes the evaluation items of the voice quality different when the analysis information Q indicates the attribute of “male” and when the analysis information Q indicates the attribute of “female”. For example, when the analysis information Q indicates an attribute of “male”, the singing evaluation unit 26 determines the degree of masculine voice quality in the singing voice V (for example, a husky voice quality in which the non-harmonic component is dominant over the harmonic component). The singing voice V is evaluated so that the singing evaluation value S increases as the degree) increases. On the other hand, when the analysis result indicates the attribute of “female”, the singing evaluation unit 26 determines the degree of feminine voice quality in the singing voice V (for example, the degree of clear voice quality in which the harmonic component is dominant over the non-harmonic component). ), The singing voice V is evaluated so that the singing evaluation value S becomes a larger numerical value.

また、歌唱評価部２６が歌唱音声Ｖの声質（男声／女声）を解析する構成も想定される。解析情報Ｑが「男性」の属性を示すときには、歌唱音声Ｖが男声と解析された場合（すなわち歌唱音声Ｖから推定される解析対象者の性別と認識モデルＭを利用して推定される解析対象者の性別とが合致する場合）に歌唱評価値Ｓが大きい数値となり、解析情報Ｑが「女性」の属性を示すときには、歌唱音声Ｖが女声と解析された場合に歌唱評価値Ｓが大きい数値となるように、歌唱評価部２６が歌唱音声Ｖを評価する。他方、歌唱音声Ｖから推定される解析対象者の性別と認識モデルＭを利用して推定される解析対象者の性別とが相違する場合には歌唱評価値Ｓが小さい数値となるように歌唱音声Ｖが評価される。 A configuration in which the singing evaluation unit 26 analyzes the voice quality (male voice / female voice) of the singing voice V is also assumed. When the analysis information Q indicates the attribute of “male”, when the singing voice V is analyzed as a male voice (that is, the analysis target estimated using the gender of the analysis target person estimated from the singing voice V and the recognition model M) Singing evaluation value S is large in the case where the person's gender matches), and when the analysis information Q indicates the attribute of “female”, the singing evaluation value S is large when the singing voice V is analyzed as a female voice. The singing evaluation unit 26 evaluates the singing voice V so that On the other hand, when the analysis subject's gender estimated from the singing speech V and the analysis subject's gender estimated using the recognition model M are different, the singing speech is such that the singing evaluation value S is a small value. V is evaluated.

［評価例３］
Ｋ個の認識モデルＭ[1]〜Ｍ[K]は、「大人」の属性の認識モデルＭと「子供」の属性の認識モデルＭとを包含する。歌唱評価部２６は、解析情報Ｑが「子供」の属性を示す場合に、解析情報Ｑが「大人」の属性を示す場合と比較して歌唱評価値Ｓが大きい数値になり易いように歌唱音声Ｖを評価する。具体的には、解析情報Ｑが「子供」の属性を示す場合には、解析情報Ｑが「大人」の属性を示す場合と比較して評価基準を低下させる。 [Evaluation Example 3]
The K recognition models M [1] to M [K] include an “adult” attribute recognition model M and a “children” attribute recognition model M. When the analysis information Q indicates the attribute “child”, the singing evaluation unit 26 sings the singing voice so that the singing evaluation value S is likely to be a larger value than when the analysis information Q indicates the attribute “adult”. Evaluate V. Specifically, when the analysis information Q indicates the attribute “child”, the evaluation criterion is lowered as compared with the case where the analysis information Q indicates the attribute “adult”.

［評価例４］
Ｋ個の認識モデルＭ[1]〜Ｍ[K]は、年齢（「２０代」「５０代」等）に関する属性の複数の認識モデルＭを包含する。歌唱評価部２６は、歌唱音声Ｖの声質（ハスキー度）を解析し、解析情報Ｑが「２０代」の属性を示す場合に、「５０代」の属性を示す場合と比較して、歌唱音声Ｖの声質がハスキーである（具体的には非調波成分が調波成分に対して優勢である）ほど歌唱評価値Ｓが大きい数値となるように歌唱音声Ｖを評価する。 [Evaluation Example 4]
The K recognition models M [1] to M [K] include a plurality of recognition models M having attributes related to age (such as “20s” and “50s”). The singing evaluation unit 26 analyzes the voice quality (husky degree) of the singing voice V, and when the analysis information Q indicates the attribute of “20s”, the singing voice is compared with the case of indicating the attribute of “50s”. The singing voice V is evaluated so that the singing evaluation value S becomes larger as the voice quality of V is husky (specifically, the non-harmonic component is dominant over the harmonic component).

［評価例５］
Ｋ個の認識モデルＭ[1]〜Ｍ[K]は、「ポップス」の属性の認識モデルＭと「演歌」の属性の認識モデルＭとを包含する。「ポップス」の属性の認識モデルＭは、「ポップス」の楽曲に好適な音声の音声データＤAから生成され、「演歌」の属性の認識モデルＭは、「演歌」の楽曲に好適な音声の音声データＤAから生成される。歌唱評価部２６は、歌唱音声Ｖにおける小節（こぶし）やしゃくりやタメ等の歌唱技法の頻度を評価する。「ポップス」と比較して「演歌」では小節（こぶし）やしゃくり等の歌唱技法が多用される傾向がある。以上の傾向を考慮して、歌唱評価部２６は、解析情報Ｑが「演歌」の属性を示す場合に、解析情報Ｑが「ポップス」の属性を示す場合と比較して、歌唱評価値Ｓに対する評価技法の頻度の比重を高く設定する。したがって、解析情報Ｑが「演歌」の属性を示す場合には、「ポップス」の属性を示す場合と比較して、歌唱技法が多用されるほど歌唱評価値Ｓが大きい数値に設定される。なお、各種の歌唱技法の検出には公知の技術が任意に採用され得るが、例えば、特開２００８−２６８３７０号公報（小節），特開２００４−１０２１４６号公報（ビブラート），特開２０１２−００８５９６号公報（ロングトーン），特開２００７−３３４３６４号公報（ビブラート，抑揚，性質，タイミング，しゃくり）に開示された技術が好適に採用され得る。各歌唱技法の意義についても以上の各公報に開示されている。 [Evaluation Example 5]
The K recognition models M [1] to M [K] include a recognition model M having a “pops” attribute and a recognition model M having an “enka” attribute. The “pops” attribute recognition model M is generated from speech audio data DA suitable for “pops” music, and the “enka” attribute recognition model M is speech sound suitable for “enka” music. Generated from data DA. The singing evaluation unit 26 evaluates the frequency of singing techniques such as bars, shackles, and stuff in the singing voice V. Compared to “pops”, “enka” tends to use singing techniques such as bars and shackles. In consideration of the above tendency, the singing evaluation unit 26, when the analysis information Q indicates the “enka” attribute, compared to the case where the analysis information Q indicates the “pop” attribute, Set the specific gravity of the evaluation technique high. Therefore, when the analysis information Q indicates the “enka” attribute, the singing evaluation value S is set to a larger value as the singing technique is used more frequently than when the “pops” attribute is indicated. For detection of various singing techniques, a known technique can be arbitrarily adopted. No. 2007 (long tone) and Japanese Patent Application Laid-Open No. 2007-334364 (vibrato, inflection, properties, timing, and shackle) can be suitably employed. The significance of each singing technique is also disclosed in the above publications.

［評価例６］
Ｋ個の認識モデルＭ[1]〜Ｍ[K]は、「情熱的」の属性の認識モデルＭと「冷静」の属性の認識モデルＭとを包含する。歌唱評価部２６は、解析情報Ｑが「情熱的」の属性を示す場合には、ノリ（例えば楽曲の拍点に同期した音量の振幅）や明瞭度（例えば非調波成分に対する調波成分の優勢度）が高いほど歌唱評価値Ｓが大きい数値となり、解析情報Ｑが「冷静」の属性を示す場合には、歌唱技法の頻度が高く声質が暗い（低音域と比較して高音域で非調波成分が調波成分に対して優勢となる声質）ほど歌唱評価値Ｓが大きい数値となるように歌唱音声Ｖを評価する。 [Evaluation Example 6]
The K recognition models M [1] to M [K] include a recognition model M having a “passionate” attribute and a recognition model M having a “cool” attribute. When the analysis information Q indicates an attribute of “passionate”, the singing evaluation unit 26 determines the resonance (for example, the amplitude of the volume synchronized with the beat point of the music) and the clarity (for example, the harmonic component relative to the inharmonic component). The higher the dominance level, the higher the singing evaluation value S, and when the analysis information Q indicates the attribute of “cool”, the frequency of the singing technique is high and the voice quality is dark (in the high range compared to the low range) The singing voice V is evaluated so that the singing evaluation value S becomes a larger value as the harmonic component becomes superior to the harmonic component).

以上に例示した第５実施形態でも第１実施形態と同様の効果が実現される。また、第５実施形態では、音声解析部２４（推定処理部３４）による解析結果に応じた評価方法で解析対象者の歌唱の巧拙が評価されるから、解析対象者の性質に応じた適切な歌唱評価を実現することが可能である。なお、第２実施形態から第４実施形態の構成は第５実施形態にも適用され得る。例えば第４実施形態で例示した通り、第５実施形態の音声解析装置１００から学習処理部２２は省略され得る。 The fifth embodiment exemplified above also achieves the same effect as that of the first embodiment. Further, in the fifth embodiment, the skill of singing by the analysis target person is evaluated by an evaluation method according to the analysis result by the voice analysis unit 24 (estimation processing unit 34). Singing evaluation can be realized. The configurations of the second to fourth embodiments can be applied to the fifth embodiment. For example, as illustrated in the fourth embodiment, the learning processing unit 22 may be omitted from the speech analysis device 100 of the fifth embodiment.

なお、以上の例示では、音声解析部２４による推定結果に応じた評価方法で解析対象者の歌唱の巧拙を評価したが、歌唱評価値Ｓに応じて歌唱評価部２６が解析対象者に提示する評価コメントを解析情報Ｑに応じて変化させる構成も採用され得る。例えば、前述の評価例１のようにＫ個の認識モデルＭ[1]〜Ｍ[K]が「男性」の属性の認識モデルＭと「女性」の属性の認識モデルＭとを包含する場合を想定する。解析情報Ｑが「男性」の属性を示す場合には、図１１に例示される通り、男性の歌唱の評価に好適な内容の評価コメントが歌唱評価値Ｓとともに表示装置１８に表示され、解析情報Ｑが「女性」の属性を示す場合には、図１２に例示される通り、女性の歌唱の評価に好適な内容の評価コメントが歌唱評価値Ｓとともに表示装置１８に表示される。すなわち、歌唱評価値Ｓが同じでも解析情報Ｑの属性が相違する場合には評価コメントの内容が相違する。 In the above example, the skill of singing of the analysis target person is evaluated by the evaluation method according to the estimation result by the voice analysis unit 24, but the singing evaluation part 26 presents to the analysis target person according to the singing evaluation value S. A configuration in which the evaluation comment is changed according to the analysis information Q can also be adopted. For example, a case where K recognition models M [1] to M [K] include a recognition model M having a “male” attribute and a recognition model M having a “female” attribute, as in Evaluation Example 1 described above. Suppose. When the analysis information Q indicates the attribute of “male”, as illustrated in FIG. 11, an evaluation comment having contents suitable for the evaluation of the male song is displayed on the display device 18 together with the song evaluation value S, and the analysis information When Q indicates the attribute of “female”, as illustrated in FIG. 12, an evaluation comment having a content suitable for evaluating a female song is displayed on the display device 18 together with the song evaluation value S. That is, even if the singing evaluation value S is the same, if the attributes of the analysis information Q are different, the contents of the evaluation comments are different.

また、解析情報Ｑ以外の情報を歌唱評価に適用することも可能である。例えば、解析対象者の過去の歌唱履歴（過去に歌唱した楽曲の傾向）や解析対象者の年齢等の属性を歌唱評価に反映させる構成が採用され得る。楽曲を時間軸上で区分した複数の区間の各々について歌唱評価部２６が解析対象者の歌唱を評価することも可能である。歌唱評価の単位となる区間は、例えば楽曲のうち所定個の小節または音符に相当する区間である。すなわち、歌唱評価部２６による評価結果が、楽曲の歌唱の進行に並行して区間毎（例えば所定個の小節毎または所定個の音符毎）に随時に更新される。 It is also possible to apply information other than the analysis information Q to singing evaluation. For example, the structure which reflects attributes, such as an analysis object person's past song history (the tendency of the music sung in the past) and an analysis object person's age, in song evaluation may be employ | adopted. It is also possible for the song evaluation unit 26 to evaluate the song of the person to be analyzed for each of a plurality of sections obtained by dividing the music on the time axis. The section which is a unit for singing evaluation is a section corresponding to a predetermined number of bars or musical notes, for example. That is, the evaluation result by the singing evaluation unit 26 is updated at any time (for example, every predetermined number of bars or every predetermined number of notes) in parallel with the progress of the singing of the music.

＜変形例＞
以上の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２個以上の態様は適宜に併合され得る。 <Modification>
Each of the above forms can be variously modified. Specific modifications are exemplified below. Two or more aspects arbitrarily selected from the following examples may be appropriately combined.

（１）各学習データＤLの関連データＤCの内容は前述の例示に限定されない。例えば、参照音声の歌唱者が複数人のグループ内で最初に歌唱する傾向があるか否かを例えば歌唱情報ＤC2に包含させれば、当該傾向が指定された参照音声を「積極的」（あるいは目立ちたがり）という属性の集合Ｃ[k]に分類することが可能である。 (1) The content of the related data DC of each learning data DL is not limited to the above example. For example, if the singing information DC2 includes whether or not the singer of the reference voice has a tendency to sing first in a group of a plurality of people, for example, the reference voice in which the tendency is designated is “active” (or It is possible to classify it into a set C [k] of attributes called “conspicuous”.

（２）前述の各形態では、学習処理で事前に生成された各認識モデルＭ[k]を推定処理に利用する場合を例示したが、解析情報Ｑを確認した利用者（解析対象者）からの指示に応じて各認識モデルＭ[k]を事後的に修正することも可能である。 (2) In each of the above-described forms, the case where each recognition model M [k] generated in advance by the learning process is used for the estimation process is exemplified. However, from the user (analysis target person) who confirmed the analysis information Q It is also possible to correct each recognition model M [k] afterwards according to the instruction.

（３）第２実施形態では、相異なる時間帯に対応する複数の加重値ｗ[1]〜ｗ[K]（すなわち時系列）を適用した評価指標Ｅ[1]〜Ｅ[K]の加重和を評価指標Ｘ[r]として算定したが、各加重値ｗ[k]に対応する属性は時間帯に限定されない。例えば、相異なる属性の集合Ｃ[k]に対応する複数（Ｋ個）の加重値ｗ[1]〜ｗ[K]を適用した評価指標Ｅ[1]〜Ｅ[K]の加重和で評価指標Ｘ[r]を算定することも可能である。 (3) In the second embodiment, the weights of the evaluation indexes E [1] to E [K] to which a plurality of weight values w [1] to w [K] (that is, time series) corresponding to different time zones are applied. Although the sum is calculated as the evaluation index X [r], the attribute corresponding to each weight value w [k] is not limited to the time zone. For example, evaluation is performed using a weighted sum of evaluation indexes E [1] to E [K] to which a plurality (K) of weight values w [1] to w [K] corresponding to different attribute sets C [k] are applied. It is also possible to calculate the index X [r].

また、第２実施形態では、評価指標Ｘ[r]が最大値となる加重値系列Ｗ[r]に対応した解析情報Ｑ[r]を選択したが、指標算定部３２が算定したＫ個の評価指標Ｅ[1]〜Ｅ[K]の分布の特徴から解析情報Ｑを特定することも可能である。例えば、Ｋ個の評価指標Ｅ[1]〜Ｅ[K]の典型的な分布（例えば平均値や分散値等の統計量）をＲ個の解析情報Ｑ[1]〜Ｑ[R]の各々について事前に用意し、Ｒ個の解析情報Ｑ[1]〜Ｑ[R]のうちＫ個の評価指標Ｅ[1]〜Ｅ[K]の分布に最も近い分布（例えばピークの数値が近似する分布や分散値が近似する分布）に対応する解析情報Ｑ[r]を推定処理部３４が選択することも可能である。 In the second embodiment, the analysis information Q [r] corresponding to the weight sequence W [r] having the maximum evaluation index X [r] is selected, but the K pieces of information calculated by the index calculation unit 32 are selected. It is also possible to specify the analysis information Q from the distribution characteristics of the evaluation indexes E [1] to E [K]. For example, a typical distribution of K evaluation indexes E [1] to E [K] (for example, a statistic such as an average value or a variance value) is represented by each of R pieces of analysis information Q [1] to Q [R]. Of the R pieces of analysis information Q [1] to Q [R], the distribution closest to the distribution of the K evaluation indices E [1] to E [K] (for example, the peak numerical value approximates). It is also possible for the estimation processing unit 34 to select analysis information Q [r] corresponding to a distribution or a distribution whose distribution value approximates.

（４）通信カラオケ装置等の通信端末と通信するサーバ装置により音声解析装置１００を実現することも可能である。例えば、音声解析装置１００は、通信端末から通信網を介して受信した歌唱音声ＶをＫ個の認識モデルＭ[1]〜Ｍ[K]の各々に適用し（推定処理）、解析対象者の性質の解析結果を示す解析情報Ｑを通信端末に送信する。 (4) The voice analysis device 100 can be realized by a server device that communicates with a communication terminal such as a communication karaoke device. For example, the speech analysis apparatus 100 applies the singing speech V received from the communication terminal via the communication network to each of the K recognition models M [1] to M [K] (estimation process), and analyzes the analysis target person. Analysis information Q indicating the analysis result of the property is transmitted to the communication terminal.

（５）音声合成技術で合成された合成音声を歌唱音声Ｖとして解析対象とすることも可能である。すなわち、解析対象者は、現実の歌唱者のほか、音声合成で想定される仮想的な歌唱者も包含する。 (5) The synthesized voice synthesized by the voice synthesis technique can be analyzed as the singing voice V. That is, the analysis target person includes a virtual singer assumed in speech synthesis in addition to a real singer.

（６）前述の各形態に係る音声解析装置は、解析対象者の性質の推定に専用されるＤＳＰ（Digital Signal Processor）等のハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）等の汎用の演算処理装置とプログラムとの協働によっても実現される。本発明の好適な態様に係るプログラムは、相異なる複数の属性の各々について、当該属性に対応する歌唱音声の傾向を表す認識モデルを利用して、解析対象者の歌唱音声が当該属性の音声に該当する確度の指標である評価指標を算定する指標算定部、および、指標算定部が複数の属性の各々について算定した評価指標に応じて解析対象者の性質を推定する推定処理部としてコンピュータを機能させる。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、例えば、本発明のプログラムは、通信網を介した配信の形態で提供されてコンピュータにインストールされ得る。 (6) The speech analysis apparatus according to each of the above-described embodiments is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to estimation of the characteristics of a person to be analyzed, and a CPU (Central Processing Unit). ) Etc., and can also be realized by the cooperation of a program. The program which concerns on the suitable aspect of this invention uses the recognition model showing the tendency of the singing voice corresponding to the said attribute about each of a plurality of different attributes, and the analysis subject's singing voice becomes the voice of the said attribute. The computer functions as an index calculation unit that calculates an evaluation index that is an index of the corresponding accuracy, and an estimation processing unit that estimates the characteristics of the person being analyzed according to the evaluation index calculated for each of multiple attributes by the index calculation unit Let The program of the present invention can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. For example, the program of the present invention can be provided in the form of distribution via a communication network and installed in a computer.

また、以上に説明した各態様に係る音声解析装置の動作方法（音声解析方法）としても本発明は特定される。本発明の好適な態様に係る音声解析方法は、相異なる複数の属性の各々について、当該属性に対応する歌唱音声の傾向を表す認識モデルを利用して、解析対象者の歌唱音声が当該属性の音声に該当する確度の指標である評価指標を算定する指標算定過程と、指標算定過程で複数の属性の各々について算定した評価指標に応じて解析対象者の性質を推定する推定処理過程とを包含する。 The present invention is also specified as an operation method (voice analysis method) of the voice analysis device according to each aspect described above. The speech analysis method according to a preferred aspect of the present invention uses, for each of a plurality of different attributes, a recognition model representing a tendency of singing speech corresponding to the attribute, and the singing speech of the analysis target person is that of the attribute. Includes an index calculation process that calculates an evaluation index, which is an index of accuracy corresponding to speech, and an estimation process that estimates the characteristics of the person being analyzed according to the evaluation index calculated for each of multiple attributes in the index calculation process To do.

１００，５４……音声解析装置、１２……演算処理装置、１４……記憶装置、１６……収音装置、１８……表示装置、２２……学習処理部、２４……音声解析部、２６……歌唱評価部、３２……指標算定部、３４……推定処理部、５２……管理装置。
100, 54... Voice analysis device, 12... Arithmetic processing device, 14... Storage device, 16... Sound collection device, 18. …… Singing evaluation unit, 32 …… Index calculation unit, 34 …… Estimation processing unit, 52 …… Management device.

Claims

For each of a plurality of different attributes, a recognition model representing the tendency of the singing voice corresponding to the attribute is used to calculate an evaluation index that is an index of the probability that the singing voice of the analysis target person corresponds to the voice of the attribute. An index calculation unit to
A speech analysis apparatus comprising: an estimation processing unit that estimates a property of the person to be analyzed according to the evaluation index calculated by the index calculation unit for each of the plurality of attributes.

The estimation processing unit estimates the property of the analysis target person according to a combination of two or more attributes selected according to the evaluation index calculated by the index calculation unit among the plurality of attributes. Voice analysis device.

The estimation processing unit calculates, for each of a plurality of weight value series corresponding to different estimation results, calculates a weighted sum of the evaluation indexes to which a plurality of weight values included in the weight value series are applied, and The speech analysis apparatus according to claim 1, wherein an estimation result corresponding to a weight value series selected according to the weighted sum among weight value series is specified.

The recognition model of each of the plurality of attributes is generated by machine learning using each reference voice of a set corresponding to the attribute among a plurality of sets obtained by classifying a plurality of reference voices for each attribute. Item 4. The voice analysis device according to any one of items 3 to 4.

The plurality of reference voices are classified into the plurality of sets according to at least one of a time zone in which the reference voice is pronounced, a place where the reference voice is pronounced, and a character of a speaker of the reference voice. Voice analysis device.

The speech analysis apparatus according to any one of claims 1 to 5, wherein the estimation processing unit registers a result of estimating the property of the analysis target person in a marketing database.

The speech analysis apparatus according to claim 1, further comprising: a song evaluation unit that evaluates the skill of the person to be analyzed by an evaluation method according to an estimation result by the estimation processing unit.

The voice according to any one of claims 1 to 5, further comprising: a singing evaluation unit that presents an evaluation comment according to a result of evaluating the skill of singing by the person to be analyzed and an estimation result by the estimation processing unit. Analysis device.