JP4612329B2

JP4612329B2 - Information processing apparatus and program

Info

Publication number: JP4612329B2
Application number: JP2004133082A
Authority: JP
Inventors: 裕子石若; 覚安居; 崇正佐藤; 侑昇嘉数
Original assignee: 株式会社テクノフェイス
Priority date: 2004-04-28
Filing date: 2004-04-28
Publication date: 2011-01-12
Anticipated expiration: 2024-04-28
Also published as: JP2005316077A

Abstract

<P>PROBLEM TO BE SOLVED: To solve the failure that a conventional type automatic performance device, etc., do not assume that users perform practice of voice imitation in an information processor, etc, with which users can perform practice of voice imitation, etc. <P>SOLUTION: This information processor is equipped with; a voice data storage part 101 in which voice data is stored; a voice acquiring part 102 for acquiring voices; a first featured amount extracting part 103 for extracting prescribed featured amounts of voices acquired by the voice acquiring part; a second featured amount extracting part 104 for extracting prescribed featured amounts from the voice data stored in the voice data storage part; a comparison part 105 for comparing the featured amounts extracted by the first featured amount extracting part with the featured amounts extracted by the second featured amount extracting part; and an output part 106 for outputting results compared by the comparison part. By such constitution, users can perform practice of voice imitation easily. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、声まね等の練習を行える情報処理装置等に関するものである。 The present invention relates to an information processing apparatus that can practice voice imitation and the like.

人が発声したときに聞こえる自分の声は、空気中を伝わって聞こえる自分の声(気導フィードバック)と頭蓋骨を伝わって聞こえる自分の声(骨導フィードバック)の合成された音声である。他人に聞こえる声は、空気中を伝わる音のみであるため、自分自身が聞く声を、他人に聞かせることは不可能である。マイクを通じて聞こえる自分の声や録音した声が、いつもと違う声で奇妙に感じてしまうのは、このためである。カラオケや宴会等で、声まねをした時に、自分自身では非常に似ていると思っていたのに、マイクを通じた声が、自分で思っているほど似てなくて、聴衆の受けが悪い場合もある。
かかる背景から、ひとりよがりのものまねに気づいてしまった人、一発芸を身に付けたいと思っている人、宴会で一発芸を強要されて困っている人、声まねが上達したい人を支援する技術が必要である。 One's own voice that can be heard when a person speaks is a synthesized voice of one's voice (air conduction feedback) heard through the air and one's voice (bone conduction feedback) heard through the skull. Since the voices heard by others are only sounds that travel in the air, it is impossible to let others hear the voices they hear. This is why my voice or recorded voice heard through the microphone feels strange and strange. When simulating a voice at karaoke or a banquet, you thought that you were very similar to yourself, but the voice through the microphone is not as similar as you think, and the audience is bad There is also.
From this background, those who have noticed imitation of one person, those who want to learn one shot, those who are in trouble by being forced to perform one at a banquet, those who want to improve their voice imitation The technology to support is necessary.

上記課題を解決するための技術に関連する技術として、以下の自動演奏装置がある。本自動演奏装置は、予め記憶された楽音データに基づき楽音を発生して自動演奏を行なう自動演奏装置であって、音声を入力して歌唱信号に変換する変換手段と、楽音が発生されている間に所定間隔でトリガ信号を発生する信号発生手段と、該信号発生手段によりトリガ信号が発生された回数を計数する第１の計数手段と、前記信号発生手段によりトリガ信号が発生された際、前記変換手段により変換された歌唱信号の有無を計数する第２の計数手段と、前記第１の計数手段で計数された数に対する前記第２の計数手段で計数された数の比率に応じて評価結果を算出する算出手段と、該算出手段で算出された評価結果を報知する報知手段とにより構成される自動演奏装置がある（特許文献１参照）。本自動演奏装置は、伴奏に応じて歌唱した音声を評価できるようにした知育等に好適な歌唱力評価機能付自動演奏装置に関し、伴奏音に乗せて楽曲を歌唱することのできるカラオケ機能及びその歌唱力評価機能を有し、歌唱力が評価されているという実感が得られる幼児の知育に好適な歌唱力評価機能付自動演奏装置を提供することを目的としている。 As a technique related to the technique for solving the above problems, there are the following automatic performance apparatuses. The automatic performance apparatus is an automatic performance apparatus that generates a musical sound based on previously stored musical sound data and performs an automatic performance. The automatic performance apparatus generates a musical sound by converting means for inputting voice and converting it into a singing signal. A signal generating means for generating a trigger signal at a predetermined interval, a first counting means for counting the number of times the trigger signal is generated by the signal generating means, and when the trigger signal is generated by the signal generating means, Evaluation is performed according to a ratio of the number counted by the second counting means to the number counted by the first counting means, and second counting means for counting the presence or absence of the singing signal converted by the converting means. There is an automatic performance device that includes a calculation unit that calculates a result and a notification unit that notifies an evaluation result calculated by the calculation unit (see Patent Document 1). This automatic performance device relates to an automatic performance device with a singing ability evaluation function suitable for intellectual education and the like that can evaluate a voice sung according to an accompaniment, and a karaoke function capable of singing a song on an accompaniment sound and its It aims at providing the automatic performance apparatus with a singing power evaluation function suitable for the intellectual education of the infant which has a singing power evaluation function and the actual feeling that singing power is evaluated is obtained.

また、関連する技術を導入した音楽ソフトウェア商品として、プリマヴィスタ（登録商標）がある（非特許文献１参照）。本ソフトウェアは、「ピッチグラフ」、「音とりモード」、「視唱トレーニング」、「ハモリ測定」の4つの機能を備えた合唱練習用ソフトである。「ピッチグラフ」の機能は、PCのマイクに向かって歌うと音程の変化をグラフで表示する機能である。本機能により、正確な音程を練習できる。「音とりモード」の機能は、合唱のパートを練習するための機能で、他のパートや自分のパートの音を聴きながら歌うと、楽譜に音の高低が表示される。「視唱トレーニング」の機能は、音階や音程の課題を楽譜として表示し、これを歌うことにより譜読みと音程を練習できる機能である。「ハモリ測定」の機能は、ハモリの練習機能で、PCからの音にハーモニーを付けたり、２人でハモると、和音の音程を表示する。
特開平５−１１６８７（第1頁、第１図等）株式会社河合楽器製作所ホームページ、インターネット<ＵＲＬ：http://www.kawai.co.jp/cmusic/products/primavista.htm> Moreover, there is Primavista (registered trademark) as a music software product in which related technology is introduced (see Non-Patent Document 1). This software is a choral practice software with four functions: “pitch graph”, “sound taking mode”, “speech training”, and “humor measurement”. The function of the “pitch graph” is a function that displays a change in pitch in a graph when singing into a PC microphone. This function allows you to practice accurate pitches. The “sound taking mode” function is a function for practicing the choral part. When you sing while listening to the sound of other parts or your own part, the pitch of the sound is displayed on the score. The function of “speech training” is a function that displays musical notes and musical pitch problems as a musical score, and can practice notation and musical pitch by singing the musical score. The function of “Horimori measurement” is a practice function of the hamori, and when the harmony is applied to the sound from the PC or when two people harmonize, the pitch of the chord is displayed.
JP-A-5-11687 (first page, FIG. 1 etc.) Kawai Gakki Mfg. Co., Ltd. website, Internet <URL: http://www.kawai.co.jp/cmusic/products/primavista.htm>

しかしながら、上述した従来技術は、声まねの練習を行うことを想定していない。つまり、従来技術において、何かに似せようとして発声された音声を聞いた人が、音声データの特徴量のうちのどの特徴量の類似度が高い場合に、似ていると感じるかが考慮されていない。
したがって、従来技術において、歌唱力の評価や、音程を狂わないように歌う練習は可能であるが、人の感覚に合致して、発声した音声が対象となる音声に似ているかどうかを判断することは困難であった。
また、従来技術によれば、格納している音声の一部分のみを似るように芸を磨くなどの練習をすることができなかった。かかる芸は一発芸と言われている芸である。
また、従来技術において、人が感じる指標に近い指標で、似ている度合いをリアルタイムに表示できなかったので、例えば、歌のものまねを行っている場合に、途中で軌道修正をすることができなかった。 However, the above-described conventional technology does not assume practice of voice imitation. In other words, in the prior art, it is considered whether a person who has heard the voice uttered to resemble something feels similar when the similarity of the feature quantity of the voice data is high. Not.
Therefore, in the prior art, it is possible to evaluate the singing ability and practice singing so that the pitch does not go wrong, but it is determined whether the voice uttered is similar to the target voice in accordance with the human sense. It was difficult.
In addition, according to the prior art, it has not been possible to practice such as performing arts so that only a part of the stored voice is similar. This is a trick that is said to be a one-off performance.
In addition, in the prior art, it is not possible to display the degree of similarity in real time with an index that is close to the index that a person feels. For example, when imitating a song, the trajectory cannot be corrected midway. It was.

さらに、従来技術において、例えば、正しい歌の音声データを強制的に音痴な音声データに変更して、その音痴な音声データに対する類似度を出力できなかったので、強制的に音痴に歌を歌う練習ができなかった。なお、強制的に音痴に歌を歌うことができれば、十分、宴会芸として役に立つ。 Furthermore, in the prior art, for example, the voice data of the correct song was forcibly changed to sound data and the similarity to the sound data could not be output, so practice to sing a song forcibly I could not. In addition, if you can sing a song forcibly, it will be useful as a banquet art.

本第一の発明の情報処理装置は、音声を取得する音声取得部と、前記音声取得部が取得した音声の所定の特徴量を抽出する第一特徴量抽出部と、前記第一特徴量抽出部が抽出した特徴量と、比較対照の音声データの特徴量を比較する比較部と、前記比較部が比較した結果を出力する出力部を具備する情報処理装置である。なお、前記所定の特徴量は、音声データから抽出されるビブラートに関する情報、音の入り方に関する情報、音程の変化に関する情報のうち１以上の情報を有することが好適である。
かかる構成により、声まねの練習が容易にできる。また、人が似ていると感じるような声まねの能力を手にいれることができる。 The information processing apparatus according to the first aspect of the present invention includes a sound acquisition unit that acquires sound, a first feature amount extraction unit that extracts a predetermined feature amount of the sound acquired by the sound acquisition unit, and the first feature amount extraction An information processing apparatus comprising: a comparison unit that compares a feature amount extracted by the unit with a feature amount of comparison audio data; and an output unit that outputs a result of comparison by the comparison unit. Note that the predetermined feature amount preferably includes one or more information among information on vibrato extracted from audio data, information on how to enter sound, and information on change in pitch.
With this configuration, voice imitation can be practiced easily. In addition, the ability to imitate voices that people feel similar to can be obtained.

また、第二の発明の情報処理装置は、第一の発明の情報処理装置に対して、前記音声データは所定の部分ごとに区切ることが可能であり、前記比較部は、前記部分ごとに、前記第一特徴量抽出部が抽出した特徴量と、比較対照の音声データの特徴量を比較し、前記出力部は、前記比較部が出力した部分ごとの比較結果を出力する情報処理装置である。
かかる構成により、一部分の声まねの練習が容易である。 Further, the information processing apparatus of the second invention is capable of dividing the audio data into predetermined parts with respect to the information processing apparatus of the first invention, and the comparing unit is provided for each of the parts, The feature amount extracted by the first feature amount extraction unit is compared with the feature amount of comparison audio data, and the output unit is an information processing apparatus that outputs a comparison result for each portion output by the comparison unit. .
With this configuration, it is easy to practice part of voice imitation.

また、第三の発明の情報処理装置は、第二の発明の情報処理装置に対して、前記部分を指示する入力を受け付ける入力受付部と、前記入力受付部が前記入力を受け付けた場合に、指示された部分に対応する音声データの一部分を読み出し、音声出力する音声出力部をさらに具備する情報処理装置である。
かかる構成により、一部分の声まねの練習が容易である。 An information processing apparatus according to a third aspect of the present invention is an information receiving apparatus according to the second aspect of the present invention, in which an input receiving unit that receives an input for instructing the part and an input receiving unit that receives the input. The information processing apparatus further includes an audio output unit that reads out a part of audio data corresponding to the instructed part and outputs the audio.
With this configuration, it is easy to practice part of voice imitation.

また、第四の発明の情報処理装置は、第二の発明の情報処理装置に対して、前記部分を指示する入力を受け付ける入力受付部と、前記入力受付部が前記部分を指示する入力を受け付けた場合に、前記音声取得部が音声を取得し、前記第一特徴量抽出部は、前記音声取得部が取得した音声の所定の特徴量を抽出し、前記比較部は、前記第一特徴量抽出部が抽出した特徴量と、比較対照の音声データの特徴量を比較し、前記出力部は、前記比較部が比較した結果を出力する情報処理装置である。
かかる構成により、一部分の声まねの練習が容易である。
また、第五の発明の情報処理装置は、上記の情報処理装置に対して、前記出力部は、前記比較部が比較した結果を視覚的に出力する情報処理装置である。
かかる構成により、声まね指数が一目瞭然であり、ユーザにとって声まねの練習がしやすい。 An information processing apparatus according to a fourth aspect of the present invention is an information receiving apparatus according to the second aspect of the present invention, wherein an input receiving unit that receives an input that indicates the part and an input that the input receiving unit indicates the part are received. The voice acquisition unit acquires the voice, the first feature amount extraction unit extracts a predetermined feature amount of the voice acquired by the voice acquisition unit, and the comparison unit includes the first feature amount. The feature amount extracted by the extraction unit is compared with the feature amount of the comparison audio data, and the output unit is an information processing apparatus that outputs a result of comparison by the comparison unit.
With this configuration, it is easy to practice part of voice imitation.
Moreover, the information processing apparatus of 5th invention is an information processing apparatus with which the said output part outputs visually the result which the said comparison part compared with said information processing apparatus.
With this configuration, the voice imitation index is obvious and it is easy for the user to practice voice imitation.

さらに、第六の発明の情報処理装置は、上記の情報処理装置に対して、前記音声データ格納部に格納されている音声データを変化させる度合いを示す情報である音ズレ情報の入力を受け付ける音ズレ情報入力受付部と、前記音ズレ情報に基づいて前記音声データを変更する音声データ変更部をさらに具備する情報処理装置である。
かかる構成により、例えば、正しい歌の音声データを強制的に音痴な音声データに変更して、強制的に音痴に歌を歌うことを練習することができる。
なお、上記の情報処理装置は、ソフトウェアで実現しても良い。 Furthermore, an information processing apparatus according to a sixth aspect of the present invention is a sound that receives an input of sound shift information that is information indicating a degree of change of audio data stored in the audio data storage unit with respect to the information processing apparatus. The information processing apparatus further includes a deviation information input receiving unit and a voice data changing unit that changes the voice data based on the sound gap information.
With such a configuration, for example, it can be practiced to forcibly change the voice data of a correct song to voice data that is forcibly and forcibly sing a song.
Note that the above information processing apparatus may be realized by software.

本発明によれば、声まね等の練習が行える。 According to the present invention, voice imitation etc. can be practiced.

以下、情報処理装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。
（実施の形態１） Hereinafter, embodiments of an information processing apparatus and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.
(Embodiment 1)

図１は、本実施の形態における情報処理装置のブロック図である。本情報処理装置は、音声データ格納部１０１、音声取得部１０２、第一特徴量抽出部１０３、第二特徴量抽出部１０４、比較部１０５、出力部１０６、音ズレ情報入力受付部１０７、音声データ変更部１０８、入力受付部１０９を具備する。
第一特徴量抽出部１０３は、第一ビブラート情報取得手段１０３１、第一入情報取得手段１０３２、第一音程変化情報取得手段１０３３を具備する。
第二特徴量抽出部１０４は、第二ビブラート情報取得手段１０４１、第二入情報取得手段１０４２、第二音程変化情報取得手段１０４３を具備する。 FIG. 1 is a block diagram of an information processing apparatus according to this embodiment. The information processing apparatus includes an audio data storage unit 101, an audio acquisition unit 102, a first feature amount extraction unit 103, a second feature amount extraction unit 104, a comparison unit 105, an output unit 106, a sound shift information input reception unit 107, an audio A data changing unit 108 and an input receiving unit 109 are provided.
The first feature quantity extraction unit 103 includes first vibrato information acquisition means 1031, first incoming information acquisition means 1032, and first pitch change information acquisition means 1033.
The second feature amount extraction unit 104 includes second vibrato information acquisition means 1041, second incoming information acquisition means 1042, and second pitch change information acquisition means 1043.

音声データ格納部１０１は、真似る対象の音声データ（以下、適宜「教師データ」という）を格納している。音声データは、例えば、ＭＩＤＩ形式の楽音データや、ＷＡＶ形式の音データ等である。ただし、音声データの形式は問わない。また、音声データは、歌手の歌声のデータや、動物の鳴き声や、機械音や、英語や韓国語の単語、文章を読んだ際の音声データ等である。音声データ格納部１０１は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。
音声取得部１０２は、人が発生する音声を取得し、音声データに変換する。音声取得部１０２は、例えば、マイクおよび当該マイクが集音した音声を音声データに変換するソフトウェアから実現され得る。 The audio data storage unit 101 stores audio data to be imitated (hereinafter referred to as “teacher data” as appropriate). The audio data is, for example, MIDI format musical sound data, WAV format sound data, or the like. However, the format of the audio data is not limited. The voice data is singer's voice data, animal calls, machine sounds, voice data when reading English or Korean words and sentences, and the like. The audio data storage unit 101 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.
The voice acquisition unit 102 acquires voice generated by a person and converts it into voice data. The sound acquisition unit 102 can be realized by, for example, a microphone and software that converts sound collected by the microphone into sound data.

第一特徴量抽出部１０３は、音声取得部１０２が取得した音声の所定の特徴量を抽出する。所定の特徴量とは、類似度が高ければ似ていると人が感じる、１以上の特徴量である。所定の特徴量とは、例えば、音声データから抽出されるビブラートに関する情報、音の入り方に関する情報、音程の変化に関する情報のうち１以上の情報を有する。第一特徴量抽出部１０３は、通常、ＭＰＵやメモリ等から実現され得る。第一特徴量抽出部１０３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The first feature quantity extraction unit 103 extracts a predetermined feature quantity of the voice acquired by the voice acquisition unit 102. The predetermined feature amount is one or more feature amounts that a person feels to be similar if the degree of similarity is high. The predetermined feature amount includes, for example, one or more information among information on vibrato extracted from audio data, information on how to enter sound, and information on change in pitch. The first feature amount extraction unit 103 can usually be realized by an MPU, a memory, or the like. The processing procedure of the first feature quantity extraction unit 103 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

第二特徴量抽出部１０４は、音声データ格納部１０１に格納されている音声データから所定の特徴量を抽出する。第二特徴量抽出部１０４が抽出する特徴量は、第一特徴量抽出部１０３が抽出する特徴量と同種の特徴量である。第二特徴量抽出部１０４は、通常、ＭＰＵやメモリ等から実現され得る。第二特徴量抽出部１０４の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The second feature amount extraction unit 104 extracts a predetermined feature amount from the audio data stored in the audio data storage unit 101. The feature quantity extracted by the second feature quantity extraction unit 104 is the same type of feature quantity as the feature quantity extracted by the first feature quantity extraction unit 103. The second feature amount extraction unit 104 can be usually realized by an MPU, a memory, or the like. The processing procedure of the second feature quantity extraction unit 104 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

比較部１０５は、第一特徴量抽出部１０３が抽出した特徴量と、第二特徴量抽出部１０４抽出した特徴量を比較し、比較結果を出力部１０６に渡す。比較部１０５は、２以上の特徴量を比較する場合、特徴量ごとに比較する。そして、かかる場合、比較結果は、特徴量ごとに出力しても良いし、２以上の比較結果に基づいて一の結果を生成し、出力しても良い。比較結果は、声まねの全体の声まねの度合いを示す声まね指数でも良いし、部分（例えば、一小節）ごとの比較結果でも良い。比較部１０５は、通常、ＭＰＵやメモリ等から実現され得る。比較部１０５の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The comparison unit 105 compares the feature amount extracted by the first feature amount extraction unit 103 with the feature amount extracted by the second feature amount extraction unit 104 and passes the comparison result to the output unit 106. When comparing two or more feature amounts, the comparison unit 105 compares them for each feature amount. In such a case, the comparison result may be output for each feature amount, or one result may be generated and output based on two or more comparison results. The comparison result may be a voice imitation index indicating the degree of voice imitation of the entire voice imitation, or a comparison result for each part (for example, one measure). The comparison unit 105 can usually be realized by an MPU, a memory, or the like. The processing procedure of the comparison unit 105 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力部１０６は、比較部１０５が比較した結果を出力する。出力部１０６は、比較部１０５が比較した結果を視覚的に、できれば画像（イメージ）により出力することが好適である。ユーザが比較結果を理解しやすいためである。また、出力部１０６は、比較部１０５が比較した結果をリアルタイムに出力することが好適である。一定以上の時間、比較を続ける場合、ユーザは似るように矯正しやすいからである。さらに、出力部１０６は、目または／および鼻または／および口の画像を有する顔画像を変化させ、比較結果が良好になるような態様で表示することが、さらに好適である。ユーザが発声する音声は、顔の形を変える（主として口の形を変える）ことにより変化する。似るように発声するための顔を表示してやると、ユーザにとって似せようとしている対象の音声に似るように、極めて矯正しやすくなる。出力とは、通常、ディスプレイへの表示を言うが、プリンタへの印字、外部の装置への送信等を含む概念である。出力部１０６は、ディスプレイ等の出力デバイスを含むと考えても含まないと考えても良い。出力部１０６は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 106 outputs the result compared by the comparison unit 105. It is preferable that the output unit 106 outputs the result of comparison by the comparison unit 105 visually, preferably as an image. This is because the user can easily understand the comparison result. Further, it is preferable that the output unit 106 outputs the result of comparison performed by the comparison unit 105 in real time. This is because if the comparison is continued for a certain time or longer, the user can easily correct the image in a similar manner. Furthermore, it is more preferable that the output unit 106 changes the face image having the image of the eyes or / and the nose or / and the mouth, and displays it in a manner that the comparison result is good. The voice uttered by the user changes by changing the shape of the face (mainly changing the shape of the mouth). When a face for uttering in a similar manner is displayed, it becomes very easy to correct for the user to resemble the sound of the object to be recreated. Output usually refers to display on a display, but is a concept including printing on a printer, transmission to an external device, and the like. The output unit 106 may be considered as including or not including an output device such as a display. The output unit 106 can be realized by driver software of an output device or driver software of an output device and an output device.

音ズレ情報入力受付部１０７は、音声データ格納部１０１に格納されている音声データを変化させる度合いを示す情報である音ズレ情報の入力を受け付ける。入力手段は、テンキーやキーボードやマウスやメニュー画面によるもの等、何でも良い。音ズレ情報入力受付部１０７は、テンキーやキーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The sound misalignment information input accepting unit 107 accepts input of sound misalignment information, which is information indicating the degree to which the sound data stored in the sound data storage unit 101 is changed. The input means may be anything such as a numeric keypad, keyboard, mouse or menu screen. The sound shift information input accepting unit 107 can be realized by a device driver of an input means such as a numeric keypad or a keyboard, control software for a menu screen, or the like.

音声データ変更部１０８は、音ズレ情報入力受付部１０７で受け付けた音ズレ情報に基づいて、音声データ格納部１０１に格納されている音声データを自動的に変更する。音声データの変更アルゴリズムは問わない。音ズレ情報が割合の情報である場合に、音声データ変更部１０８は、例えば、音ズレ情報の割合の音情報を、ランダムな値分だけ変化させる。ランダムな値は、例えば、乱数により取得する。音声データ変更部１０８は、通常、ＭＰＵやメモリ等から実現され得る。音声データ変更部１０８の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The sound data changing unit 108 automatically changes the sound data stored in the sound data storage unit 101 based on the sound shift information received by the sound shift information input receiving unit 107. The algorithm for changing the voice data is not limited. When the sound shift information is ratio information, the audio data changing unit 108 changes, for example, the sound information of the ratio of sound shift information by a random value. The random value is obtained by, for example, a random number. The audio data changing unit 108 can be usually realized by an MPU, a memory, or the like. The processing procedure of the audio data changing unit 108 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

入力受付部１０９は、処理の開始指示、または終了指示を受け付ける。かかる開始指示の受け付けにより、音声取得部１０２が音声を取得することを開始し、格納している音声データと取得した音声データの比較が開始される。また、終了指示の受け付けにより、情報処理装置の処理を終了する。指示の入力手段は、テンキーやキーボードやマウスやメニュー画面によるもの等、何でも良い。入力受付部１０９は、テンキーやキーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。
第一ビブラート情報取得手段１０３１、および第二ビブラート情報取得手段１０４１は、音声データからビブラートに関する情報であるビブラート情報を取得する。ビブラート情報の具体例は、後述する。
第一入情報取得手段１０３２、および第二入情報取得手段１０４２は、音声データから音の入り方に関する情報である入情報を取得する。入情報の具体例は、後述する。
第一音程変化情報取得手段１０３３、および第二音程変化情報取得手段１０４３は、音声データから音程の変化に関する情報である音程変化情報を取得する。音程変化情報の具体例は、後述する。 The input receiving unit 109 receives a process start instruction or an end instruction. When the start instruction is received, the sound acquisition unit 102 starts acquiring sound, and comparison between the stored sound data and the acquired sound data is started. Further, the processing of the information processing apparatus is ended by accepting the end instruction. The instruction input means may be anything such as a numeric keypad, a keyboard, a mouse, or a menu screen. The input receiving unit 109 can be realized by a device driver for input means such as a numeric keypad and a keyboard, control software for a menu screen, and the like.
The first vibrato information acquisition unit 1031 and the second vibrato information acquisition unit 1041 acquire vibrato information that is information on vibrato from the audio data. A specific example of vibrato information will be described later.
The first incoming information acquisition unit 1032 and the second incoming information acquisition unit 1042 acquire incoming information that is information relating to the way the sound enters from the voice data. A specific example of the input information will be described later.
The first pitch change information acquisition unit 1033 and the second pitch change information acquisition unit 1043 acquire pitch change information that is information related to pitch changes from the audio data. A specific example of the pitch change information will be described later.

上記各手段は、通常、ＭＰＵやメモリ等から実現され得る。上記各手段の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。
以下、本情報処理装置が、ユーザが発生した音声が、真似の対象である音声データとどれぐらいに類似するかを判断し、判断結果を出力する動作について図２のフローチャートを用いて説明する。
（ステップＳ２０１）入力受付部１０９は、開始指示を受け付けたか否かを判断する。開始指示を受け付ければステップＳ２０２に行き、受け付けなければステップＳ２０１に戻る。
（ステップＳ２０２）音声取得部１０２は、人が発生する音声を取得し、音声データに変換する。変換した音声データは、バッファに追記する。変換した音声データは、例えば、後述する波形データである。 Each of the above means can usually be realized by an MPU, a memory, or the like. The processing procedure of each means is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).
Hereinafter, an operation in which the information processing apparatus determines how much the voice generated by the user is similar to the voice data to be imitated and outputs the judgment result will be described with reference to the flowchart of FIG.
(Step S201) The input receiving unit 109 determines whether a start instruction has been received. If a start instruction is accepted, the process goes to step S202, and if not accepted, the process returns to step S201.
(Step S202) The voice acquisition unit 102 acquires voice generated by a person and converts it into voice data. The converted audio data is added to the buffer. The converted audio data is, for example, waveform data described later.

（ステップＳ２０３）第一特徴量抽出部１０３は、音声データの比較を行う区切りであるか否かを判断する。区切りであればステップＳ２０４に行き、区切りでなければステップＳ２０２に戻る。区切りであるか否かは、例えば、所定の時間が経過したか否かで判断する。なお、所定の時間は、後述するように０．０３秒ぐらいが好適である。また、第一特徴量抽出部１０３は、音声取得部１０２が取得した音声データが、一定時間以上の無音声である場合に区切りであると判断しても良い。 (Step S203) The first feature quantity extraction unit 103 determines whether or not it is a segment for comparing audio data. If it is a delimiter, the process goes to step S204. If not, the process returns to step S202. Whether or not it is a delimiter is determined, for example, based on whether or not a predetermined time has elapsed. The predetermined time is preferably about 0.03 seconds as will be described later. In addition, the first feature amount extraction unit 103 may determine that the audio data acquired by the audio acquisition unit 102 is a delimiter when there is no audio for a predetermined time or longer.

（ステップＳ２０４）第一ビブラート情報取得手段１０３１は、バッファに格納されている音声データから、第一のビブラート情報を取得する。音声データからビブラート情報を取得するアルゴリズムの例は後述する。
（ステップＳ２０５）第一入情報取得手段１０３２は、バッファに格納されている音声データから、第一の入情報を取得する。音声データから入情報を取得するアルゴリズムの例は後述する。
（ステップＳ２０６）第一音程変化情報取得手段１０３３は、バッファに格納されている音声データから、第一の音程変化情報を取得する。音声データから音程変化情報を取得するアルゴリズムの例は後述する。 (Step S204) The first vibrato information acquisition unit 1031 acquires first vibrato information from the audio data stored in the buffer. An example of an algorithm for acquiring vibrato information from audio data will be described later.
(Step S205) The first incoming information acquisition unit 1032 acquires first incoming information from the audio data stored in the buffer. An example of an algorithm for acquiring input information from audio data will be described later.
(Step S206) The first pitch change information acquisition unit 1033 acquires first pitch change information from the audio data stored in the buffer. An example of an algorithm for acquiring pitch change information from audio data will be described later.

（ステップＳ２０７）第二ビブラート情報取得手段１０４１は、音声データ格納部１０１の音声データ中の、対応する音声データから、第二のビブラート情報を取得する。音声データからビブラート情報を取得するアルゴリズムの例は後述する。なお、対応する音声データとは、音声取得部１０２が取得した音声データに対応する音声データである。
（ステップＳ２０８）第二入情報取得手段１０４２は、音声データ格納部１０１の対応する音声データから、第二の入情報を取得する。音声データから入情報を取得するアルゴリズムの例は後述する。
（ステップＳ２０９）第二音程変化情報取得手段１０４３は、バッファに格納されている音声データから、第二の音程変化情報を取得する。音声データから音程変化情報を取得するアルゴリズムの例は後述する。
（ステップＳ２１０）比較部１０５は、第一のビブラート情報と第二のビブラート情報を比較し、比較結果を出力する。
（ステップＳ２１１）比較部１０５は、第一の入情報と第二の入情報を比較し、比較結果を出力する。
（ステップＳ２１２）比較部１０５は、第一の音程変化情報と第二の音程変化情報を比較し、比較結果を出力する。
（ステップＳ２１３）比較部１０５は、ステップＳ２１０からステップＳ２１２で出力した比較結果をパラメータとして得点を算出する。ここで算出された得点は、部分的な声まね指数である。 (Step S207) The second vibrato information acquisition unit 1041 acquires the second vibrato information from the corresponding audio data in the audio data of the audio data storage unit 101. An example of an algorithm for acquiring vibrato information from audio data will be described later. Note that the corresponding audio data is audio data corresponding to the audio data acquired by the audio acquisition unit 102.
(Step S208) The second incoming information acquisition unit 1042 acquires the second incoming information from the corresponding voice data in the voice data storage unit 101. An example of an algorithm for acquiring input information from audio data will be described later.
(Step S209) The second pitch change information acquisition unit 1043 acquires second pitch change information from the audio data stored in the buffer. An example of an algorithm for acquiring pitch change information from audio data will be described later.
(Step S210) The comparison unit 105 compares the first vibrato information and the second vibrato information, and outputs a comparison result.
(Step S211) The comparison unit 105 compares the first input information and the second input information, and outputs a comparison result.
(Step S212) The comparison unit 105 compares the first pitch change information and the second pitch change information, and outputs a comparison result.
(Step S213) The comparison unit 105 calculates a score using the comparison result output in steps S210 to S212 as a parameter. The score calculated here is a partial voice imitation index.

（ステップＳ２１４）出力部１０６は、ステップＳ２１０からステップＳ２１２で出力した比較結果に基づいて、出力する画像を構成する。「画像を構成する」処理は、格納されている画像データを読み出す処理でも良い。
（ステップＳ２１５）出力部１０６は、ステップＳ２１４で構成した画像を出力する。 (Step S214) The output unit 106 configures an image to be output based on the comparison results output in steps S210 to S212. The process of “composing an image” may be a process of reading stored image data.
(Step S215) The output unit 106 outputs the image configured in step S214.

（ステップＳ２１６）終了か否かを判断する。終了か否かの判断は、入力受付部１０９は終了指示を受け付けたか否か、または音声データの比較処理が終了したか否かを判断することにより行われる。終了であればステップＳ２１７に行き、終了でなければステップＳ２０２に戻る。なお、ステップＳ２０２に戻る前に、音声データ格納部１０１の音声データを読み出すポインタをずらす。つまり、本ポインタは、次の比較すべき音声データの先頭のアドレスに移動されている。 (Step S216) It is determined whether or not the process is finished. The determination of whether or not to end is performed by determining whether or not the input receiving unit 109 has received an end instruction or whether or not the audio data comparison process has ended. If completed, go to step S217, otherwise return to step S202. In addition, before returning to step S202, the pointer which reads the audio | voice data of the audio | voice data storage part 101 is shifted. That is, the pointer is moved to the head address of the next audio data to be compared.

（ステップＳ２１７）比較部１０５は、ステップＳ２１３で算出した１以上の得点から、総合得点を算出する。総合得点とは、声まね指数である。総合得点の算出は、ステップＳ２１３で算出した１以上の得点の合計でも良いし、平均でも良いし、合計した数値を１００点満点になるように補正しても良い。
（ステップＳ２１８）出力部１０６は、ステップＳ２１７で算出した声まね指数を出力する。 (Step S217) The comparison unit 105 calculates an overall score from the one or more scores calculated in Step S213. The overall score is a voice imitation index. The calculation of the total score may be the sum of one or more scores calculated in step S213, may be an average, or may be corrected so that the total value becomes 100 points.
(Step S218) The output unit 106 outputs the voice mimic index calculated in step S217.

次に、情報処理装置が、音ズレ情報の入力を受け付けた場合の処理について説明する。音ズレ情報入力受付部１０７は、音ズレ情報の入力を受け付けた場合に、当該音ズレ情報に基づいて、音声データ変更部１０８は、音声データ格納部１０１に格納されている音声データを自動的に変更する。 Next, processing when the information processing apparatus receives input of sound shift information will be described. When the sound shift information input receiving unit 107 receives the input of the sound shift information, the sound data changing unit 108 automatically converts the sound data stored in the sound data storage unit 101 based on the sound shift information. Change to

次に、本情報処理装置の開発の準備段階で行った実験について説明する。本情報処理装置の開発にあたって、声まねの特徴量を抽出することが重要である。そこで、人間が声まねを似ていると感じる基準を決定する必要がある。同じ声まねを聴いても、似ているという人と似ていないという人がいる。これは、声まねを似ていると判断する基準には個人差があるためである。しかし、ある特徴的な音に対しては、共通の評価基準が存在するのではないかと考えられる。そこで、声まねの特徴量を抽出するための、声まね評価実験を行った。 Next, an experiment performed at the preparation stage of development of the information processing apparatus will be described. In developing this information processing apparatus, it is important to extract feature quantities of voice imitation. Therefore, it is necessary to determine a standard for humans to feel similar to voice imitation. There are people who hear the same voice, but who do not look like it. This is because there are individual differences in the criteria for determining that the voice imitation is similar. However, there may be a common evaluation standard for certain characteristic sounds. Therefore, a voice imitation evaluation experiment was performed to extract the features of voice imitation.

本評価実験は、評価者のほとんどが似ていると判断するような声まねは存在するか否か、教師データを声まねとして聞かせたときに、その声まねが教師データと同じ音であると判断できるか、またその評価はどのくらいの点数になるのか、を調査することを目的としている。なお、教師データとは、声まねの対象となるデータであり、本情報処理装置の構成における音声データ格納部１０１の音声データである。 In this evaluation experiment, when the teacher imitates the teacher data as to whether there is a voice imitation that most evaluators judge to be similar, the voice imitation is the same sound as the teacher data. The purpose is to investigate whether it can be judged and how much the evaluation will be. Teacher data is data to be imitated and is voice data in the voice data storage unit 101 in the configuration of the information processing apparatus.

実験内容は、はじめに３秒未満の短い教師データを一度だけ聴いてもらう。その後５秒間隔で素人の声まねを１０人分聴いてもらい、０点から１００点までの評点で、個人の直感をもとにブランクの５秒の間に採点してもらった。評価者には、実験の真の目的をつげずに、機械学習の教師データに使用するためのデータ収集を目的としていると伝えた。５種類の音データに対して、２３名の人に評価実験を行ってもらった。評価結果を図３に示す。 As for the content of the experiment, first, short teacher data less than 3 seconds is listened only once. After that, they listened to 10 amateur voice imitations at 5 second intervals, and scored between 5 and 100 seconds with a score from 0 to 100 based on personal intuition. The evaluators were told that they were aiming to collect data for use in machine learning teacher data, without linking the true purpose of the experiment. For 5 types of sound data, 23 people conducted evaluation experiments. The evaluation results are shown in FIG.

図３の表において、全体平均は、全評価者の平均点数である。声まね最高は、最高得点を取った人の全採点者の平均点、最低は、最低点を取った人の全採点者の平均点、ＴＯＰ獲得率は、最高点をつけた人数の割合、教師データ認識率は、教師データを声まねリストに混ぜて聞かせたときの認識率と平均点である。教師データに最高点をつけた場合を、"認識"としている。教師データを混ぜなかったデータについては、"−"で示している。採点者は全員日本人で、声まねを行った人はマレーシア人、スイスジャーマン、イタリア人、フランス人、日本人とし、各音リストで、順番も人物も変えている。図３の表において、教師データの１番目は、恐竜おもちゃの電子音である、２番目は、本物のヤギ(めぇぇぇ)の鳴声である。３番目は、「お主も悪よの」という日本語の短文である。４番目は、スイスジャーマンで食器棚を意味する言葉である。５番目は長いフランス語である。 In the table of FIG. 3, the overall average is the average score of all evaluators. The voice imitation maximum is the average score of all graders of the highest score, the lowest is the average score of all graders of the lowest score, the TOP acquisition rate is the percentage of people who scored the highest, The teacher data recognition rate is a recognition rate and an average score when teacher data is mixed with a voice imitation list. The case where the highest score is assigned to the teacher data is regarded as “recognition”. Data that is not mixed with teacher data is indicated by “−”. The graders are all Japanese, and the people who imitate the voice are Malaysian, Swiss German, Italian, French and Japanese. In the table of FIG. 3, the first teacher data is an electronic sound of a dinosaur toy, and the second is a real goat cry. The third is a short Japanese sentence that says, "Oh my lord is bad." The fourth is a Swiss German word for cupboard. The fifth is long French.

１番目の恐竜おもちゃの電子音に対して、声まねを行った人は、マレーシア人男性１人、スイスジャーマン男性２人、フランス人男性１人、日本人男性２人、イタリア人女性１人、日本人女性３人の合計１０名である。この声まねデータの中に教師データは含まれていない。評価実験を行った結果、ある日本人女性にのみ高い評点があつまった。この評価の高かった女性のデータ、評価の低かったデータおよび教師データの波形データと時間−周波数解析結果を図４に示す。図４の上は波形である。図４の中は周波数スペクトルである。図４の下は時間―周波数解析（ｘ軸：時間、ｙ軸：周波数）である。色が白い箇所は、振幅が大きい箇所である。 The person who imitated the electronic sound of the first dinosaur toy was one Malaysian man, two Swiss German men, one French man, two Japanese men, one Italian woman, There are a total of 10 Japanese women. The voice imitation data does not include teacher data. As a result of the evaluation experiment, only a Japanese woman got a high score. FIG. 4 shows the waveform data and time-frequency analysis results of the highly evaluated female data, the poorly evaluated data, and the teacher data. The top of FIG. 4 is a waveform. The frequency spectrum is shown in FIG. The bottom of FIG. 4 is a time-frequency analysis (x axis: time, y axis: frequency). A white portion is a portion having a large amplitude.

図４より、似ていると判断された声まねは、教師データと周波数スペクトルが似ている、音の立ち上がりが似ている、波形が似ている、という３点が解析結果からわかる。しかし、時間に関しては、約２秒のデータに対して、１秒ほど長くなっている。このことから、教師の１番目の恐竜おもちゃの電子音に関しては、テンポの正確さは、似ているか否かを判断する場合の特徴量に入っていないと思われる。 From the analysis results, it can be seen from the analysis results that the voice imitation determined to be similar is similar to the teacher data and the frequency spectrum, the sound rise is similar, and the waveform is similar. However, the time is about 1 second longer than about 2 seconds of data. From this, regarding the electronic sound of the teacher's first dinosaur toy, it seems that the accuracy of the tempo is not included in the feature amount when judging whether or not they are similar.

採点者が教師データの特徴量を認識可能な教師データ２のＧｏａｔの場合、教師データの認識率は８２．６％と高かった。しかし、教師データの特徴量の認識が困難な教師データ４および５のスイスジャーマンとフランス語３９．１％と３０．４％と、ともに低い認識率であった。また、教師データ５のフランス語に関しては、声まねリストの中に、全く同じ声まねを混ぜておいたが、同じ評点をつけた採点者は、４人にとどまった。このことから、特徴量が捕らえやすい音については、採点が可能だが、特徴量が捉えきれないものついては、評価ができないということがわかった。聴いたことのない外国語に関しては、３秒のデータですら特徴を捉えることが難しいことから、例えば、音痴な人が音程を捉えることができないのは、音階の特徴量を捉えられないためではないかと推測される。音楽に関しても、長い節を一度に音程を捉えることは、音楽未経験者および音痴な人には非常に困難である。このため、音痴矯正教示データは、小節毎、あるいはメロディーごとに区切って、短い時間間隔で教示する方が効果的であると考えられる。 In the case of Goat of the teacher data 2 where the grader can recognize the feature amount of the teacher data, the recognition rate of the teacher data was as high as 82.6%. However, the Swiss data of teacher data 4 and 5 for which it was difficult to recognize the feature amount of the teacher data and French 39.1% and 30.4% were both low recognition rates. As for French in Teacher Data 5, the same voice imitation was mixed in the voice imitation list, but only four graders gave the same score. From this, it was found that although it is possible to score a sound whose feature value is easy to capture, it cannot be evaluated for a sound whose feature value cannot be captured. For foreign languages that have not been heard, it is difficult to capture the characteristics even with 3 seconds of data. For example, the reason why a sane person cannot capture the pitch is because the feature of the scale cannot be captured. I guess it is not. As for music, it is very difficult for people who are inexperienced in music and those who are not familiar with it to capture the pitch of a long passage at once. For this reason, it is considered that it is more effective to teach phonic correction teaching data at short time intervals by dividing each measure or melody.

万人が似ていると感じる声まねは、各自が記憶している特徴量と現在聞こえている音声との比較となるため、大げさな方が似ていると感じる。評価実験より、特徴量を捉えきれない音については、ほんの数秒のデータですら、人間は正確に音声そのものを記憶できないことがわかった。このことは、聴いたことのない外国語の声まね度を判定するときには、判断できない、あるいは、全部同じに聞こえる、教師データを教師データとして識別できないといった結果からも推測される。これらのことから、教師データとそっくりに声まねができたとしても、他人からは似ていると判定してもらえない可能性がある。そこで、教示する際に使用すべきデータは、教師データの特徴量をより顕著にした加工データを使用した方が効果的であると思われる。 The voice imitation that everyone feels similar is a comparison of the feature value that each person remembers with the voice that is currently being heard, so the oversized person feels similar. From the evaluation experiment, it was found that humans cannot accurately memorize the sound itself, even if only a few seconds of data can not be captured. This is also inferred from the result that it cannot be judged when judging the voice imitation of a foreign language that has never been heard, or that all sounds the same, or teacher data cannot be identified as teacher data. For these reasons, even if you can imitate the voice just like the teacher data, there is a possibility that other people will not judge you to be similar. Therefore, it seems that it is more effective to use the processed data in which the feature amount of the teacher data is more remarkable as the data to be used for teaching.

図３、図４における実験によれば、教師データよりも高く評価されたものがある声まねデータおよび全体的に評価が高かった声まねデータをもとに、特徴量の検討を行った。具体的には、声まねに必要な特徴量の比較検討を行った。音の特徴量として、音の高さ（ピッチ）、大きさ（ラウドネス）、および音色を決定づける要素である倍音成分、立ち上がり時間、立ち上がり特性、ビブラート、振幅変調、ピッチの揺れ等を抽出し、どのような要素が声まねに影響するのか検討し、実際に使用する特徴量を決定した。その結果、似ていると判断されるための音声データの特徴量は、主として、ビブラート、音の入り方、音程の相対的な変化量という３つの特徴量であると判断された。人は、２つの音声データを比較して、かかる特徴量が主として似ている場合に、２つの音声データが似ている、と判断する。また、時間（テンポ）のずれは、評価に影響を与えない。 According to the experiments in FIG. 3 and FIG. 4, the feature amount was examined based on voice mimic data that was evaluated higher than the teacher data and voice mimic data that was generally highly evaluated. Specifically, we compared the features required for voice mimicry. As sound feature quantities, we extract the harmonic components, rise time, rise characteristics, vibrato, amplitude modulation, pitch fluctuation, etc. We examined whether such factors affect voice mimicry, and decided the actual features to be used. As a result, the feature values of the audio data to be judged to be similar were determined to be mainly three feature values of vibrato, sound input, and relative change in pitch. A person compares two pieces of voice data and determines that the two pieces of voice data are similar when such feature amounts are mainly similar. In addition, the time (tempo) shift does not affect the evaluation.

また、ビブラートの特徴を得るために、０．０３秒以上の時間分解能が必要である。０．０３秒以上の時間分解能があれば、音の入りや音程の相対的な変化量の特徴を得ることが可能である。よって、音の分類に必要な時間分解能を、ここでは０．０３秒とする。ただし、音の分類に必要な時間分解能は、略０．０３秒ぐらいであれば良い。特徴量には、各時間における振幅の変化量から得たビブラートの有無、強弱、出だしのクレッシェンドの有無、アタック（音の出だし）の強弱、音量の時間差分などがある。
以下、本実施の形態における情報処理装置の具体的な動作について説明する。まず、音声データ格納部１０１には、声まねの対象である教師データの生波形データが格納されている。 Moreover, in order to obtain the characteristics of vibrato, a time resolution of 0.03 seconds or more is required. If there is a time resolution of 0.03 seconds or more, it is possible to obtain characteristics of the onset of sound and the relative change in pitch. Therefore, the time resolution required for sound classification is 0.03 seconds here. However, the time resolution required for sound classification may be about 0.03 seconds. The feature amount includes the presence / absence of vibrato obtained from the amplitude change amount at each time, strength, presence / absence of crescendo, strength of attack (sound start), time difference in volume, and the like.
Hereinafter, a specific operation of the information processing apparatus according to the present embodiment will be described. First, in the audio data storage unit 101, raw waveform data of teacher data that is a target of voice imitation is stored.

まず、第二特徴量抽出部１０４は、音声データ格納部１０１に格納されている教師データの生波形データから、第二のビブラート情報、第二の入情報、第二の音程変化情報を得る。具体的には以下のような処理を行う。 First, the second feature quantity extraction unit 104 obtains second vibrato information, second input information, and second pitch change information from the raw waveform data of the teacher data stored in the voice data storage unit 101. Specifically, the following processing is performed.

つまり、情報処理装置の第二特徴量抽出部１０４は、音声データ格納部１０１の生波形データを読み出す。この生波形データは、図５（ａ）に示すようなデータである。そして、第二特徴量抽出部１０４は、読み出した生波形データを整流する。次に、第二特徴量抽出部１０４は、整流した波形を、０．０３秒間隔で加算平均する。次に、第二特徴量抽出部１０４は、音の立上がりおよび立下りから、実際の発声部分を抜き出す（図５（ｂ）参照）。次に、第二特徴量抽出部１０４は、０．０３秒間隔で短時間フーリエ変換（ＳｈｏｒｔＴｉｍｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ：ＳＴＦＴ）による時間周波数解析を行う。そして、第二特徴量抽出部１０４は、図５（ｃ）のテンプレートを得る。 That is, the second feature amount extraction unit 104 of the information processing apparatus reads the raw waveform data in the audio data storage unit 101. This raw waveform data is data as shown in FIG. Then, the second feature amount extraction unit 104 rectifies the read raw waveform data. Next, the second feature amount extraction unit 104 averages the rectified waveforms at intervals of 0.03 seconds. Next, the second feature amount extraction unit 104 extracts an actual utterance portion from the rise and fall of the sound (see FIG. 5B). Next, the second feature amount extraction unit 104 performs time-frequency analysis by short-time Fourier transform (STFT) at intervals of 0.03 seconds. And the 2nd feature-value extraction part 104 obtains the template of FIG.5 (c).

次に、ユーザは、声まねの開始指示を入力する、とする。そして、情報処理装置は、開始指示の入力を受け付け、音声取得部１０２は、ユーザが発生する音声データを取得する。取得した音声データは、図５（ａ）に示すような生波形データである。 Next, it is assumed that the user inputs a voice imitation start instruction. The information processing apparatus receives an input of a start instruction, and the voice acquisition unit 102 acquires voice data generated by the user. The acquired voice data is raw waveform data as shown in FIG.

次に、情報処理装置の第一特徴量抽出部１０３は、上述のような第二特徴量抽出部１０４と同様の処理によりテンプレート（図５（ｃ）のようなデータ）を得る。なお、第一特徴量抽出部１０３が図５（ｃ）のようなテンプレートを得る場合に、全体の長さを教師データ（音声データ格納部１０１のデータ）のテンプレートに合わせる（この処理を「Ｎｏｒｍａｌｉｚｅ」という）。例えば、教師データの長さが１ｓで、声まねが０．８ｓしかない場合は１ｓまで引き伸ばして、逆に１．２ｓくらいの長い場合は、１ｓに縮める。その状態で、上記のテンプレート（声まねテンプレート−図５（ｃ））を作る。かかるテンプレートは、教師データのテンプレートと同様の長さである。
以上の処理により、教師データのテンプレート（テンプレート２という）と、ユーザ入力音声から得たテンプレート（テンプレート１という）が得られた。 Next, the first feature quantity extraction unit 103 of the information processing apparatus obtains a template (data as shown in FIG. 5C) by the same process as the second feature quantity extraction unit 104 as described above. When the first feature quantity extraction unit 103 obtains a template as shown in FIG. 5C, the entire length is matched with the template of teacher data (data in the voice data storage unit 101) (this process is referred to as “Normalize”). "). For example, if the length of the teacher data is 1 s and the voice imitation is only 0.8 s, the length is increased to 1 s, and conversely if it is as long as 1.2 s, the length is reduced to 1 s. In this state, the above template (voice imitation template—FIG. 5C) is created. Such a template has the same length as the teacher data template.
Through the above processing, a teacher data template (referred to as template 2) and a template obtained from user input speech (referred to as template 1) were obtained.

そして、第一ビブラート情報取得手段１０３１、第一入情報取得手段１０３２、および第一音程変化情報取得手段１０３３は、ユーザ入力音声から得たテンプレート１から、それぞれ第一のビブラート情報、第一の入情報、第一の音程変化情報を得る。 Then, the first vibrato information acquisition unit 1031, the first input information acquisition unit 1032 and the first pitch change information acquisition unit 1033 respectively receive the first vibrato information and the first input information from the template 1 obtained from the user input voice. Information, first pitch change information.

具体的には、第一ビブラート情報取得手段１０３１は、テンプレート１の一番振幅の強い周波数（Ｆｍａｘ１）を、単位時間ごと（単位時間は、図５（ｃ）の１ブロックで、０．０３ｓ以内である）に算出し、第一のビブラート情報を得る。第一のビブラート情報は、周波数（Ｆｍａｘ１）の数字列である。また、第二ビブラート情報取得手段１０４１は、テンプレート２の一番振幅の強い周波数（Ｆｍａｘ２）を、単位時間ごとに算出し、第二のビブラート情報を得る。第二のビブラート情報も、周波数（Ｆｍａｘ２）の数字列である。 Specifically, the first vibrato information acquisition unit 1031 sets the frequency (Fmax1) with the strongest amplitude of the template 1 for each unit time (the unit time is within one block of FIG. 5C and within 0.03 s). To obtain first vibrato information. The first vibrato information is a numeric string of frequencies (Fmax1). The second vibrato information acquisition unit 1041 calculates the frequency (Fmax2) with the strongest amplitude of the template 2 for each unit time, and obtains the second vibrato information. The second vibrato information is also a numeric string of frequencies (Fmax2).

次に、第一入情報取得手段１０３２は、テンプレート１の最初の所定の数（たとえば、５）のブロックの各周波数における振幅を取得する。また、第二入情報取得手段１０４２は、テンプレート２の最初の所定の数（たとえば、５）のブロックの各周波数における振幅を取得する。 Next, the first input information acquisition unit 1032 acquires the amplitude at each frequency of the first predetermined number (for example, 5) blocks of the template 1. Further, the second input information acquisition unit 1042 acquires the amplitude at each frequency of the first predetermined number (for example, 5) blocks of the template 2.

次に、第一音程変化情報取得手段１０３３は、テンプレート１の単位時間ごとの一番強い振幅の周波数を取得する。つまり、第一の音程変化情報は、振幅の周波数列である。第二音程変化情報取得手段１０４３は、テンプレート２の単位時間ごとの一番強い振幅の周波数を取得する。つまり、第二の音程変化情報も、振幅の周波数列である。
そして、比較部１０５は、上記のテンプレート１、テンプレート２のビブラート情報、入情報、音程変化情報を比較する。 Next, the first pitch change information acquisition unit 1033 acquires the frequency of the strongest amplitude for each unit time of the template 1. That is, the first pitch change information is an amplitude frequency sequence. The second pitch change information acquisition unit 1043 acquires the frequency of the strongest amplitude for each unit time of the template 2. That is, the second pitch change information is also a frequency train of amplitude.
The comparison unit 105 compares the vibrato information, the incoming information, and the pitch change information of the template 1 and the template 2 described above.

まず、比較部１０５は、第一のビブラート情報と第二のビブラート情報を比較して、教師データとユーザが入力した音声のビブラートの類似度を算出する。なお、ビブラートの類似度は、上述したビブラートに関する情報の一例である。具体的には、比較部１０５は、２つのテンプレートのデータの位置の差と個数の差をパラメータとしてビブラートの類似度を算出する。また、比較部１０５は、人工ニューラルネットワーク（ＡＮＮ）による機械学習により類似度を算出しても良い。つまり、比較部１０５は、アンケートデータをもとに、決定する。例えば、Ａさんの声まね点数平均が６０点、Ｂさんの声まね点数平均７０点等といった教師データからＡＮＮを学習して荷重を決定し、その後、未知の声まねデータＣさんの点数を出す。 First, the comparison unit 105 compares the first vibrato information and the second vibrato information, and calculates the similarity between the teacher data and the voice vibrato input by the user. The vibrato similarity is an example of information on the vibrato described above. Specifically, the comparison unit 105 calculates the vibrato similarity using the difference in position and the number of data of two templates as parameters. The comparison unit 105 may calculate the similarity by machine learning using an artificial neural network (ANN). That is, the comparison unit 105 determines based on the questionnaire data. For example, the load is determined by learning ANN from teacher data such as A's voice imitation score average of 60, B's voice imitation score average of 70, etc., and then the unknown voice imitation data C's score is obtained. .

以下、ＡＮＮによる機械学習について説明する。ＡＮＮの入力を特徴量（テンプレートから得た情報）、出力を点数とする。Ａさんの声まねの特徴量を入力し、Ａさんの平均点が６０点だとすると、ＡＮＮの出力が６０点となるまで、ＡＮＮの荷重を学習する。このような学習に使用するデータをパターン信号という。パターン信号がひとつでは、学習に偏りがでてしまうため、ＢさんやＣさんについても、同様の学習を同じニューラルネットワークで行い、さらに荷重の学習を行う（たとえば５パターン）。こうして、学習済みのＡＮＮをあらかじめシステム側で用意しておき、使用者の声まねの点数をＡＮＮによって出力する。つまり、ＡＮＮは、例えば、何人かの平均した審査員に相当する。 Hereinafter, machine learning by ANN will be described. An input of ANN is a feature amount (information obtained from a template), and an output is a score. If A's voice imitation feature is input and A's average score is 60 points, the ANN load is learned until the ANN output reaches 60 points. Data used for such learning is called a pattern signal. Since there is a bias in learning with a single pattern signal, the same learning is performed on Mr. B and Mr. C using the same neural network, and further learning of the load is performed (for example, five patterns). Thus, the learned ANN is prepared in advance on the system side, and the score of the user's voice imitation is output by the ANN. In other words, the ANN corresponds to, for example, several average judges.

また、比較部１０５は、第一の入情報と第二の入情報を比較して、教師データとユーザが入力した音声の、音の入り方の類似度を算出する。なお、音の入り方の類似度は、上述した音の入り方に関する情報の一例である。具体的には、テンプレート１の最初の５つのブロックの各周波数における振幅が第一の入情報であるとする。また、テンプレート２の最初の５つのブロックの各周波数における振幅が第二の入情報であるとする。そして、比較部１０５は、第一の入情報の各要素と第二の入情報の各要素の差の合計の逆数に、所定の整数を掛けた数が２つの入情報の類似度であるとして、音の入り方に関する類似度を算出する。 Further, the comparison unit 105 compares the first input information with the second input information, and calculates the similarity of the sound input between the teacher data and the voice input by the user. In addition, the similarity of the way of entering a sound is an example of information related to the way of entering a sound. Specifically, it is assumed that the amplitude at each frequency of the first five blocks of the template 1 is the first input information. Further, it is assumed that the amplitude at each frequency of the first five blocks of the template 2 is the second input information. Then, the comparison unit 105 assumes that the number obtained by multiplying the reciprocal of the difference between each element of the first input information and each element of the second input information by a predetermined integer is the similarity between the two pieces of input information. The degree of similarity regarding how to enter the sound is calculated.

さらに、比較部１０５は、第一の音程変化情報と第二の音程変化情報を比較して、教師データとユーザが入力した音声の、全体的な類似傾向である全体的な類似度を算出する。この類似度は、音程の変化に関する情報の一例である。具体的には、比較部１０５は、第一の音程変化情報である振幅の周波数列と、第二の音程変化情報である振幅の周波数列の差の合計の逆数に、所定の整数を掛けた数を音程の変化に関する類似度として、算出する。
なお、上記の一番強い振幅の周波数は、例えば、各周波数の加算平均により算出される。 Further, the comparison unit 105 compares the first pitch change information and the second pitch change information, and calculates an overall similarity that is an overall similarity tendency between the teacher data and the voice input by the user. . This similarity is an example of information related to a change in pitch. Specifically, the comparison unit 105 multiplies a reciprocal of the sum of the difference between the frequency sequence of the amplitude that is the first pitch change information and the frequency sequence of the amplitude that is the second pitch change information by a predetermined integer. The number is calculated as the degree of similarity related to the change in pitch.
Note that the frequency having the strongest amplitude is calculated by, for example, an average of the frequencies.

さらに、比較部１０５は上述したビブラートの類似度、音の入り方の類似度および音程の変化に関する類似度に基づいて、声まね指数を算出する。具体的には、例えば、比較部１０５は、３つの類似度の合計を声まね指数として算出する。また、比較部１０５は、３つの類似度の平均値を声まね指数として算出しても良い。 Further, the comparison unit 105 calculates a voice imitation index based on the above-described vibrato similarity, sound entry similarity, and similarity regarding pitch change. Specifically, for example, the comparison unit 105 calculates the total of three similarities as a voice imitation index. The comparison unit 105 may calculate an average value of the three similarities as a voice imitation index.

次に、出力部１０６は、例えば、図６に示す出力顔画像判断表と、図７に示す１以上の出力顔画像を保持している。出力顔画像判断表は、「ＩＤ」「条件」「画像ＩＤ」を有するレコードを１以上保持している。「ＩＤ」は、レコードを識別する情報であり、表管理上の要請のために存在する。「条件」は、特徴量をパラメータとして、出力する画像を決定するための条件である。「条件」の属性値が、比較部１０５が算出した結果に合致すれば、「画像ＩＤ」の画像が出力される。「画像ＩＤ」は、画像を識別する識別子ある。図７の出力顔画像は、ここでは４つの画像である。４つの画像中の「ＩＤ＝１」の画像は、音の入りをソフトにすることを教示する場合に表示される画像である。「ＩＤ＝２」の画像は、音の入りをハードにすることを教示する場合に表示される画像である。「ＩＤ＝３」の画像は、ビブラートを弱くすることを教示する場合に表示される画像である。「ＩＤ＝４」の画像は、ビブラートを強くすることを教示する場合に表示される画像である。 Next, the output unit 106 holds, for example, an output face image determination table shown in FIG. 6 and one or more output face images shown in FIG. The output face image determination table holds one or more records having “ID”, “condition”, and “image ID”. “ID” is information for identifying a record and exists for a request in table management. “Condition” is a condition for determining an image to be output using a feature amount as a parameter. If the attribute value of “condition” matches the result calculated by the comparison unit 105, the image of “image ID” is output. “Image ID” is an identifier for identifying an image. The output face images in FIG. 7 are four images here. The image of “ID = 1” in the four images is an image that is displayed when teaching the softening of sound. The image of “ID = 2” is an image that is displayed when teaching to make the sound hard. The image with “ID = 3” is an image displayed when teaching to weaken the vibrato. The image of “ID = 4” is an image displayed when teaching to increase vibrato.

出力部１０６は、出力顔画像判断表の条件に照らして、比較部１０５が比較した結果により、画像を選択し、表示する。なお、比較部１０５が比較した結果が「「第一のビブラート情報」−「第二のビブラート情報」＝１２」であれば、出力顔画像判断表の「ＩＤ＝３」のレコードの条件に合致し、「画像ＩＤ＝３」の画像を選択し、表示する。かかる画像の選択および表示は、ユーザが音声を入力している間、情報処理装置は、リアルタイムに連続して行う。 The output unit 106 selects and displays an image according to the result of comparison by the comparison unit 105 in light of the conditions of the output face image determination table. If the comparison result by the comparison unit 105 is ““ first vibrato information ”−“ second vibrato information ”= 12”, it matches the condition of the record of “ID = 3” in the output face image determination table. Then, the image of “Image ID = 3” is selected and displayed. Such information selection and display are continuously performed in real time by the information processing apparatus while the user inputs voice.

また、出力部１０６は、上記で算出した声まね指数を図８に示すような態様で出力する。図８の画面において、ユーザが「ＭｉｄｉＯｐｅｎ」ボタン１をクリックすると、情報処理装置は、ＭＩＤＩ再生用データを読み込む。ユーザが「ＰＬＡＹ」ボタン２を押すと、情報処理装置は、スペクトル表示画面（図８の中央の大きな黒い四角）３に網掛けの四角で、同心円のスペクトル表示画面（図８の右側の丸い黒い画面）に黒丸で、正しい音程が教示される。ユーザが歌うと各表示画面には、同様に第一の所定の色（例えば、オレンジ色）で表示される。歌った音程が正しい場合は、第二の所定の色（例えば、黄色く）教示色が変わる。音程がずれている場合は、顔画像（図８の右下の顔４）が、"もっと高く"、"もっと低く"と教示する顔画像を表示する（図１２参照）。音程があっているときは、図８に示すように無表情な顔画像を表示する。以上の表示により、ユーザは、リアルタイムに軌道修正しながら、声まねの練習ができる。
次に、強制的に音痴に歌を歌う宴会芸を身に付けるための、本情報処理装置の動作について説明する。 Further, the output unit 106 outputs the voice imitation index calculated above in a manner as shown in FIG. When the user clicks the “Midi Open” button 1 on the screen of FIG. 8, the information processing apparatus reads the data for MIDI reproduction. When the user presses the “PLAY” button 2, the information processing apparatus displays a spectral display screen (large black square in the center of FIG. 8) 3 with a shaded square and a concentric spectrum display screen (round black on the right side of FIG. 8). The correct pitch is taught with a black circle on the screen. When the user sings, each display screen is similarly displayed in a first predetermined color (for example, orange). If the sung pitch is correct, the second predetermined color (for example, yellow) changes the teaching color. When the pitch is shifted, the face image (lower right face 4 in FIG. 8) teaches that the face image teaches “higher” and “lower” (see FIG. 12). When the pitch is correct, an expressionless face image is displayed as shown in FIG. With the above display, the user can practice voice imitation while correcting the trajectory in real time.
Next, the operation of the information processing apparatus for acquiring banquet art for singing a song forcibly will be described.

今、情報処理装置の音声データ格納部１０１に、歌手が歌った歌の音声データが格納されている、とする。かかる状況で、ユーザは、音声データを変化させる度合いを示す情報である音ズレ情報を入力する。音ズレ情報は、ここでは、音ズレの発生頻度を示す情報である音ズレ情報と、音ズレの幅（大きさ）を示す音ズレレベルを有する。ユーザは、音ズレ情報を「５０％」、音ズレレベルを「７」に設定する（図８左側参照）と、音ズレ情報入力受付部１０７は、かかる音ズレ情報を受け付ける。次に、音声データ変更部１０８は、音ズレ情報を「５０％」、音ズレレベルを「７」に基づいて、音声データ格納部１０１に格納されている音声データを自動的に変更する。つまり、音声データ変更部１０８は、音声データ格納部１０１の音声データ中の５０％のデータを、最大「７」音程を上げる、または下げるように音声データを変更する。 Now, it is assumed that voice data of a song sung by a singer is stored in the voice data storage unit 101 of the information processing apparatus. In such a situation, the user inputs sound shift information that is information indicating the degree to which the sound data is changed. Here, the sound misalignment information includes sound misalignment information which is information indicating the frequency of occurrence of sound misalignment and a sound misalignment level indicating the width (size) of the sound misalignment. When the user sets the sound shift information to “50%” and the sound shift level to “7” (see the left side of FIG. 8), the sound shift information input receiving unit 107 receives the sound shift information. Next, the audio data changing unit 108 automatically changes the audio data stored in the audio data storage unit 101 based on the sound deviation information “50%” and the sound deviation level “7”. That is, the audio data changing unit 108 changes the audio data so that 50% of the audio data in the audio data storage unit 101 is increased or decreased by a maximum of “7” intervals.

具体的には、例えば、図９（ａ）に示すように、元の教師データが１２音階であるとする。そして、全データのうち、５０％のデータが音ズレする、という条件のもと、音声データ変更部１０８は、図９（ｂ）に示すように音ズレするデータを決定する。音声データ変更部１０８が音ズレするデータを決定するアルゴリズムは問わない。音声データ変更部１０８は、ひとつ飛ばしで音ズレするデータを決定しても良いし、乱数を発生させて、発生させた乱数により音ズレするデータを決定しても良い。なお、図９（ｂ）において音ズレするデータは、下線部のデータである。次に、音声データ変更部１０８は、最大で元の音±７ずれるように音ズレの度合いを決定する。音声データ変更部１０８が音ズレの度合いを決定するアルゴリズムも問わない。音声データ変更部１０８は、例えば、乱数を発生させて、発生させた乱数を１４で割った余りにより「−７」から「＋７」までの数字を割り当てる。そして、音声データ変更部１０８は、図９（ｄ）に示すような変更済みの教師データを得る。そして、ユーザは、かかる変更済みの教師データ（元の美しい歌を音痴にした歌のデータ）に基づいて声まねの練習を行うことにより、強制的に音痴に歌を歌う宴会芸を身に付けることができる。声まねにおける情報処理装置の動作は、上述した通りである。
以上、本実施の形態によれば、声まねの練習が容易にできる。また、人が似ていると感じるような声まねの能力を手にいれることができる。
また、本実施の形態によれば、人が感じる指標に近い指標で、似ている度合いをリアルタイムに表示でき、例えば、歌のものまねを行っている場合に、途中で軌道修正をすることができる。 Specifically, for example, as shown in FIG. 9A, it is assumed that the original teacher data is 12 scales. Then, under the condition that 50% of all data has a sound shift, the sound data changing unit 108 determines data to be shifted as shown in FIG. 9B. There is no limitation on the algorithm used by the voice data changing unit 108 to determine the data to be shifted. The audio data changing unit 108 may determine data to be shifted by one skip, or may generate random numbers and determine data to be shifted by the generated random numbers. In FIG. 9B, the data that is shifted in sound is underlined data. Next, the audio data changing unit 108 determines the degree of sound deviation so that the original sound is shifted by ± 7 at the maximum. There is no limitation on the algorithm by which the audio data changing unit 108 determines the degree of sound deviation. For example, the audio data changing unit 108 generates a random number and assigns numbers from “−7” to “+7” by the remainder obtained by dividing the generated random number by 14. Then, the voice data changing unit 108 obtains changed teacher data as shown in FIG. Then, the user learns banquet performance to sing a song compulsorily by practicing voice imitation based on such changed teacher data (data of a song that is a sound of the original beautiful song) be able to. The operation of the information processing apparatus in voice imitation is as described above.
As described above, according to the present embodiment, voice imitation can be practiced easily. In addition, the ability to imitate voices that people feel similar to can be obtained.
Further, according to the present embodiment, the degree of similarity can be displayed in real time with an index that is close to the index that a person feels, and for example, when a song is imitated, the trajectory can be corrected in the middle. .

さらに、本実施の形態によれば、例えば、正しい歌の音声データを強制的に音痴な音声データに変更して、その音痴な音声データに対する類似度を出力でき、強制的に音痴に歌を歌うことができる宴会芸を身に付けることができる。 Furthermore, according to the present embodiment, for example, correct voice data of a song can be forcibly changed to sound data, and the similarity to the sound data can be output, and the song can be forcibly sung. You can learn banquet arts.

なお、本実施の形態において、声まねとは、歌まねや動物の鳴き声のまねや、機械音のまねや、語学の発音のまねなども含む。つまり、本実施の形態における教師データは、歌手の歌声データや、動物の鳴き声データや、機械音のデータや、語学の発音データなどである。かかることは他の実施の形態においても同様である。 In this embodiment, the voice imitation includes singing imitation, animal noise imitation, mechanical sound imitation, language pronunciation imitation, and the like. That is, the teacher data in the present embodiment is singer singing voice data, animal cry data, machine sound data, language pronunciation data, and the like. The same applies to other embodiments.

また、本実施の形態によれば、声まねの判断のための特徴量は、ビブラート情報、音の入り方に関する情報である入情報、および音程の変化に関する情報である音程変化情報が有効であったが、声まね指数の算出のために他の特徴量を用いても良い。かかることも他の実施の形態においても同様である。 Further, according to the present embodiment, the feature quantity for determining voice imitation is effective as vibrato information, incoming information that is information on how to enter sound, and pitch change information that is information related to pitch change. However, other feature amounts may be used for calculating the voice imitation index. This also applies to other embodiments.

また、本実施の形態によれば、教師データのテンプレートを得る動作をユーザからの音声を取得し、２つのテンプレートを比較しながら、リアルタイムに行った。しかし、教師データのテンプレートを得る動作は、ユーザからの音声の取得の前に、前もって行っていても良い。教師データのテンプレートを得る動作を予め他の装置で行って、情報処理装置は他の装置が行って抽出した教師データの特徴量を保持していても良い。かかる場合、情報処理装置は、音声を取得する音声取得部と、前記音声取得部が取得した音声の所定の特徴量を抽出する第一特徴量抽出部と、所定の特徴量を抽出する第二特徴量抽出部と、前記第一特徴量抽出部が抽出した特徴量と、比較対照の音声データの特徴量を比較する比較部と、前記比較部が比較した結果を出力する出力部を具備する装置である。 Further, according to the present embodiment, an operation for obtaining a template of teacher data is performed in real time while obtaining voice from the user and comparing the two templates. However, the operation of obtaining the teacher data template may be performed in advance before obtaining the voice from the user. An operation for obtaining a template of teacher data may be performed in advance by another device, and the information processing device may hold the feature amount of the teacher data extracted by the other device. In this case, the information processing apparatus includes a sound acquisition unit that acquires sound, a first feature amount extraction unit that extracts a predetermined feature amount of the sound acquired by the sound acquisition unit, and a second feature that extracts a predetermined feature amount. A feature amount extraction unit; a comparison unit that compares the feature amount extracted by the first feature amount extraction unit; and a feature amount of audio data for comparison; and an output unit that outputs a result of the comparison by the comparison unit. Device.

また、本実施の形態によれば、情報処理装置は、スタンドアロンで動作したが、サーバ・クライアントシステムにおいて動作しても良い。かかることも他の実施の形態においても同様である。なお、かかる場合の情報処理システムは、図１０に示すシステム構成となる。つまり、情報処理システムは、クライアント装置９１とサーバ装置９２を有する。クライアント装置９１は、音声取得部１０２、出力部１０６、音ズレ情報入力受付部１０７、入力受付部１０９、第一送受信部９１０１を具備する。サーバ装置９２は、第二送受信部９２０１、音声データ格納部１０１、第一特徴量抽出部１０３、第二特徴量抽出部１０４、比較部１０５、音声データ変更部１０８を具備する。クライアント装置９１の第一送受信部９１０１は、ユーザの発生した音声データをサーバ装置９２に送信する。サーバ装置９２の第二送受信部９２０１は、音声データを受信する。比較部１０５は、当該受信した音声データと格納している音声データとの１以上の特徴量を比較する。第二送受信部９２０１は、当該比較結果をクライアント装置９１に送信する。次に、クライアント装置９１の第一送受信部９１０１は、比較結果を受信し、出力部１０６は出力する。つまり、上述した情報処理装置の処理を、クライアント装置９１とサーバ装置９２で分散して処理する態様である。 Further, according to the present embodiment, the information processing apparatus operates stand-alone , but may operate in a server / client system. This also applies to other embodiments. In this case, the information processing system has the system configuration shown in FIG. That is, the information processing system includes a client device 91 and a server device 92. The client device 91 includes a voice acquisition unit 102, an output unit 106, a sound shift information input reception unit 107, an input reception unit 109, and a first transmission / reception unit 9101. The server device 92 includes a second transmission / reception unit 9201, an audio data storage unit 101, a first feature quantity extraction unit 103, a second feature quantity extraction unit 104, a comparison unit 105, and an audio data change unit 108. The first transmission / reception unit 9101 of the client device 91 transmits the voice data generated by the user to the server device 92. The second transmission / reception unit 9201 of the server device 92 receives the audio data. The comparison unit 105 compares one or more feature amounts between the received audio data and the stored audio data. The second transmission / reception unit 9201 transmits the comparison result to the client device 91. Next, the first transmission / reception unit 9101 of the client device 91 receives the comparison result, and the output unit 106 outputs it. That is, this is a mode in which the processing of the information processing device described above is distributed and processed by the client device 91 and the server device 92.

また、本実施の形態によれば、情報処理装置が声まね指数を算出している際に、音声データ格納部１０１に格納されている音声データを音声出力しなかったが、音声出力しても良い。音声データを音声出力することは、ユーザの声まねを助け、好適である場合が多い。 Further, according to the present embodiment, when the information processing apparatus calculates the voice imitation index, the voice data stored in the voice data storage unit 101 is not output as a voice. good. It is often preferable to output the voice data as voice, which helps imitate the user's voice.

また、本実施の形態における具体例によれば、出力部１０６は、目または／および鼻または／および口の画像を有する顔画像を変化させ、声まねの結果が良好になるような態様で顔画像を表示したが、顔画像以外の画像を表示することにより声まねの比較結果を表示しても良い。つまり、ユーザの発声した音声と比較対象となる音（音声データ格納部１０１の音声データ）の特徴量の差異（差分データ）を視覚化できれば良い。差分データの表示のために、図７に示すような"顔"ではなく、図１１（ａ）に示す"棘"、図１１（ｂ）に示す"コーン"、図１１（ｃ）に示す"ボール"などでも良い。"棘"は３つの球体から棘が１２本ずつ延びてくるオブジェクトで、それぞれの棘の長さで差分データの大きさを表現している。"コーン"は円形に回転する１２本の円錐があり、それぞれの長さで差分データの大きさを表現している。"ボール"は外周を左回りに回転する８つの球体と、内周を右回りに回転する４つの球体がそれぞれ、基底の軌道から逸れた距離と球体の色の変化で差分データの大きさを表現している。なお、図７に示す"顔"は、目、鼻、口で表現された顔の各部の大きさと位置が規定の大きさ、場所との違いで差分データの大きさを表現している。また、顔については差分データが一定の値を超えると表情が大きく変化するようなバリエーションが存在しても良い。たとえば、音程が教師データと比較して非常に低い場合は、図１２（ａ）のような"顔"を表示し、「音をもっと高くする」ことを直感的に教示したり、音程が教師データと比較して非常に高い場合は、図１２（ｂ）のような"顔"を表示し、「音をもっと低くする」ことを直感的に教示したりしても良い。特徴量の差異を、直感的なわかりやすさとリアルタイムな入力に対応して表示するため、声まねが上達するために好適である。 Further, according to the specific example in the present embodiment, the output unit 106 changes the face image having the image of the eyes or / and the nose or / and the mouth so that the result of voice mimicry is improved. Although the image is displayed, the comparison result of voice imitation may be displayed by displaying an image other than the face image. That is, it is only necessary to visualize the difference (difference data) in the feature amount between the voice uttered by the user and the sound to be compared (voice data in the voice data storage unit 101). For the display of the difference data, instead of the “face” as shown in FIG. 7, “thorn” shown in FIG. 11A, “cone” shown in FIG. 11B, “shown in FIG. 11C” "Ball" etc. may be used. A “thorn” is an object in which twelve thorns extend from three spheres, and the length of each thorn represents the size of the difference data. “Cone” has twelve cones rotating in a circle, and the length of each cone represents the size of the difference data. “Ball” has 8 spheres that rotate counterclockwise on the outer periphery and 4 spheres that rotate clockwise on the inner periphery, respectively. expressing. Note that the “face” shown in FIG. 7 expresses the size of the difference data by the difference between the size and position of each part of the face expressed by eyes, nose, and mouth with a prescribed size and location. Further, there may be variations in which the facial expression changes greatly when the difference data exceeds a certain value for the face. For example, when the pitch is very low compared to the teacher data, a “face” as shown in FIG. 12A is displayed, and the “pitch is made higher” is intuitively taught, or the pitch is the teacher. If it is very high compared to the data, a “face” as shown in FIG. 12B may be displayed, and “to make the sound lower” may be intuitively taught. Differences in feature quantities are displayed in correspondence with intuitive intelligibility and real-time input, which is suitable for improving voice imitation.

さらに、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、音声を取得する音声取得ステップと、前記音声取得ステップで取得した音声の所定の特徴量を抽出する第一特徴量抽出ステップと、前記第一特徴量抽出部が抽出した特徴量と、比較対照の音声データの特徴量を比較する比較ステップと、前記比較ステップで比較した結果を出力する出力ステップを実行させるためのプログラムである。 Furthermore, the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded and distributed on a recording medium such as a CD-ROM. This also applies to other embodiments in this specification. Note that the software that implements the information processing apparatus according to the present embodiment is the following program. That is, the program includes: a sound acquisition step for acquiring sound; a first feature amount extraction step for extracting a predetermined feature amount of the sound acquired in the sound acquisition step; and the first feature amount extraction unit. This is a program for executing a comparison step for comparing the extracted feature amount with the feature amount of the audio data for comparison and an output step for outputting the result of comparison in the comparison step.

また、本プログラムは、コンピュータに、音声を取得する音声取得ステップと、前記音声取得ステップで取得した音声の所定の特徴量を抽出する第一特徴量抽出ステップと、格納されている音声データから所定の特徴量を抽出する第二特徴量抽出ステップと、前記第一特徴量抽出ステップで抽出した特徴量と、前記第二特徴量抽出ステップで抽出した特徴量を比較する比較ステップと、前記比較ステップで比較した結果を出力する出力ステップを実行させるためのプログラムである。
（実施の形態２）
本実施の形態において、格納している音声データに対して声まねの練習を行える情報処理装置であり、かつ、音声データの各部分の声まね指数が表示され、一部分に対する声まねができる情報処理装置である。 In addition, the program stores, in a computer, a sound acquisition step for acquiring sound, a first feature amount extraction step for extracting a predetermined feature amount of the sound acquired in the sound acquisition step, and a predetermined amount from stored sound data. A second feature amount extraction step for extracting the feature amount, a comparison step for comparing the feature amount extracted in the first feature amount extraction step, the feature amount extracted in the second feature amount extraction step, and the comparison step This is a program for executing an output step for outputting the result of comparison in (1).
(Embodiment 2)
In the present embodiment, an information processing apparatus that can practice voice imitation on stored voice data, and that can display voice imitation index of each part of the voice data, and can perform voice imitation on a part Device.

図１３は、本実施の形態における情報処理装置のブロック図である。本情報処理装置は、音声データ格納部１０１、音声取得部１０２、第一特徴量抽出部１０３、第二特徴量抽出部１０４、比較部１１０５、出力部１１０６、音ズレ情報入力受付部１０７、音声データ変更部１０８、入力受付部１１０９、音声出力部１１１０を具備する。 FIG. 13 is a block diagram of the information processing apparatus according to this embodiment. The information processing apparatus includes an audio data storage unit 101, an audio acquisition unit 102, a first feature quantity extraction unit 103, a second feature quantity extraction unit 104, a comparison unit 1105, an output unit 1106, a sound shift information input reception unit 107, an audio A data changing unit 108, an input receiving unit 1109, and an audio output unit 1110 are provided.

比較部１１０５は、音声データの部分ごとに、第一特徴量抽出部１０３が抽出した特徴量と、第二特徴量抽出部１０４が抽出した特徴量を比較する。音声データは、例えば、歌手が歌った歌のデータである。音声データの部分は、例えば、所定の出力時間の経過により、切り出される。なお、所定の時間は、上述したように０．０３秒ぐらいが好適である。また、音声データの区切りは、一定時間以上の無音声である場合に区切りであると判断されても良い。なお、比較部１１０５は、第一特徴量抽出部１０３が抽出した２以上の特徴量と、第二特徴量抽出部１０４が抽出した２以上特徴量を、特徴量ごとに比較しても良い。比較部１１０５は、通常、ＭＰＵやメモリ等から実現され得る。比較部１１０５の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The comparison unit 1105 compares the feature amount extracted by the first feature amount extraction unit 103 with the feature amount extracted by the second feature amount extraction unit 104 for each part of the audio data. The voice data is, for example, data of a song sung by a singer. The audio data portion is cut out, for example, when a predetermined output time elapses. The predetermined time is preferably about 0.03 seconds as described above. In addition, the audio data may be determined to be a delimiter when there is no sound for a certain time or longer. Note that the comparison unit 1105 may compare the two or more feature amounts extracted by the first feature amount extraction unit 103 with the two or more feature amounts extracted by the second feature amount extraction unit 104 for each feature amount. The comparison unit 1105 can be usually realized by an MPU, a memory, or the like. The processing procedure of the comparison unit 1105 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力部１１０６は、比較部１１０５が出力した部分ごとの比較結果を出力する。比較結果は、点数により示されても良いし、画像により示されても良い。出力とは、通常、ディスプレイへの表示を言うが、プリンタへの印字、外部の装置への送信等を含む概念である。出力部１１０６は、ディスプレイ等の出力デバイスを含むと考えても含まないと考えても良い。出力部１１０６は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 1106 outputs a comparison result for each part output by the comparison unit 1105. The comparison result may be indicated by a score or an image. Output usually refers to display on a display, but is a concept including printing on a printer, transmission to an external device, and the like. The output unit 1106 may or may not include an output device such as a display. The output unit 1106 can be realized by output device driver software, or output device driver software and an output device.

入力受付部１１０９は、処理の開始指示、終了指示、または部分を指示する入力を受け付ける。「部分を指示する入力」とは、例えば、出力部１１０６が出力した部分ごとの比較結果に対する指示入力であり、部分の指示入力である。部分とは、音声データの一部分である。指示の入力手段は、テンキーやキーボードやマウスやメニュー画面によるもの等、何でも良い。入力受付部１１０９は、テンキーやキーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The input receiving unit 1109 receives an input for instructing a process start instruction, an end instruction, or a part. The “input for instructing a part” is, for example, an instruction input for the comparison result for each part output by the output unit 1106, and is an instruction input for the part. A portion is a portion of audio data. The instruction input means may be anything such as a numeric keypad, a keyboard, a mouse, or a menu screen. The input receiving unit 1109 can be realized by a device driver for input means such as a numeric keypad or a keyboard, control software for a menu screen, and the like.

音声出力部１１１０は、指示された部分に対応する音声データの一部分を読み出し、音声出力する。なお、かかる音声データは、通常、音声データ格納部１０１の音声データである。ただし、かかる音声データは、ユーザが先に入力し、音声取得部１０２が取得した音声データでも良い。 The audio output unit 1110 reads out a part of audio data corresponding to the instructed part and outputs the audio. Such audio data is usually audio data in the audio data storage unit 101. However, the voice data may be voice data that is input by the user first and acquired by the voice acquisition unit 102.

なお、入力受付部１１０９が部分を指示する入力を受け付けた場合に、音声取得部１０２が音声を取得し、第一特徴量抽出部１０３は、音声取得部１０２が取得した音声の所定の特徴量を抽出し、第二特徴量抽出部１０４は、一部分の音声データから所定の特徴量を抽出し、比較部１１０５は、第一特徴量抽出部１０３が抽出した１以上の特徴量と、第二特徴量抽出部１０４が抽出した１以上の特徴量を比較し、出力部１１０６は、比較部１１０５が比較した結果を出力する。
以下、本情報処理装置が各部分の声まね指数を出力する動作について図１４のフローチャートを用いて説明する。図１４のフローチャートにおいて、図２のフローチャートと同様の処理に関しては、説明を省略する。 When the input receiving unit 1109 receives an input indicating a part, the voice acquisition unit 102 acquires the voice, and the first feature amount extraction unit 103 is a predetermined feature amount of the voice acquired by the voice acquisition unit 102. The second feature amount extraction unit 104 extracts a predetermined feature amount from a part of the audio data, and the comparison unit 1105 extracts one or more feature amounts extracted by the first feature amount extraction unit 103 and the second feature amount. One or more feature amounts extracted by the feature amount extraction unit 104 are compared, and an output unit 1106 outputs a result of comparison by the comparison unit 1105.
Hereinafter, the operation of the information processing apparatus for outputting the voice imitation index of each part will be described with reference to the flowchart of FIG. In the flowchart of FIG. 14, the description of the same processing as that of the flowchart of FIG. 2 is omitted.

（ステップＳ１４０１）出力部１０６は、ステップＳ２１３で算出した得点を出力する。この得点は、部分ごとの比較結果である。部分ごとの比較結果の表示態様や表示タイミングは問わない。部分ごとの得点の表示態様は、上述した顔画像によるものでも良いし、部分ごとの声まねの得点を数値で表示しても良い。
次に、本情報処理装置が、部分的な声まねの練習に利用される場合の動作について図１５のフローチャートを用いて説明する。 (Step S1401) The output unit 106 outputs the score calculated in step S213. This score is a comparison result for each part. The display mode and display timing of the comparison result for each part are not limited. The display mode of the score for each part may be based on the above-described face image, or the score of the voice imitation for each part may be displayed as a numerical value.
Next, the operation when the information processing apparatus is used for practicing partial voice imitation will be described with reference to the flowchart of FIG.

（ステップＳ１５０１）入力受付部１１０９は、部分を指示する入力を受け付けたか否かを判断する。部分を指示する入力を受け付ければステップＳ１５０２に行き、部分を指示する入力を受け付けなければステップＳ１５０１に戻る。
（ステップＳ１５０２）音声出力部１１１０は、ステップＳ１５０１で受け付けた入力が示す部分に対応する音声データを音声データ格納部１０１から読み出す。
（ステップＳ１５０３）音声出力部１１１０は、ステップＳ１５０２で読み出した音声データを出力する。 (Step S1501) The input receiving unit 1109 determines whether or not an input indicating a part has been received. If an input indicating a part is accepted, the process proceeds to step S1502, and if an input indicating a part is not accepted, the process returns to step S1501.
(Step S1502) The audio output unit 1110 reads audio data corresponding to the portion indicated by the input received in step S1501 from the audio data storage unit 101.
(Step S1503) The audio output unit 1110 outputs the audio data read in step S1502.

（ステップＳ１５０４）音声取得部１０２は、ユーザが発生する音声を取得する。なお、ステップＳ１５０３の音声データ出力と、ステップＳ１５０４の音声の取得は、並行して実行されることが好適である。なお、本ループが２回以上繰り返される場合は、取得した音声は追記される。
（ステップＳ１５０５）ステップＳ１５０１で受け付けた入力が示す部分のすべての出力が終了したか否かを判断する。終了していればステップＳ１５０６に行き、終了していなければステップＳ１５０２に戻る。 (Step S1504) The voice acquisition unit 102 acquires voice generated by the user. Note that the audio data output in step S1503 and the audio acquisition in step S1504 are preferably executed in parallel. In addition, when this loop is repeated twice or more, the acquired sound is appended.
(Step S1505) It is determined whether or not all the outputs of the portion indicated by the input received in step S1501 have been completed. If completed, go to step S1506, and if not completed, return to step S1502.

（ステップＳ１５０６）第一特徴量抽出部１０３は、ステップＳ１５０４で取得された音声から第一の特徴量を抽出する。第一の特徴量は、例えば、実施の形態１で説明したビブラート情報、入情報、音程変化情報であるが、他の特徴量でも良い。 (Step S1506) The first feature quantity extraction unit 103 extracts the first feature quantity from the voice acquired in step S1504. The first feature amount is, for example, the vibrato information, the incoming information, and the pitch change information described in the first embodiment, but may be another feature amount.

（ステップＳ１５０７）第二特徴量抽出部１０４は、ステップＳ１５０２で読み出した音声データから第二の特徴量を抽出する。第二の特徴量は、例えば、実施の形態１で説明したビブラート情報、入情報、音程変化情報であるが、他の特徴量でも良い。
（ステップＳ１５０８）比較部１１０５は、ステップＳ１５０６で取得した第一の特徴量と、ステップＳ１５０７で取得した第二の特徴量を比較する。
（ステップＳ１５０９）出力部１１０６は、ステップＳ１５０８における比較結果を出力する。処理を終了する。
以上の処理により、ユーザは、例えば、音声データ格納部１０１に格納されている歌の音データの真似を、一部のフレーズ（例えば、一小節）について練習できる。 (Step S1507) The second feature quantity extraction unit 104 extracts a second feature quantity from the audio data read in step S1502. The second feature amount is, for example, the vibrato information, the input information, and the pitch change information described in the first embodiment, but may be another feature amount.
(Step S1508) The comparison unit 1105 compares the first feature value acquired in Step S1506 with the second feature value acquired in Step S1507.
(Step S1509) The output unit 1106 outputs the comparison result in step S1508. The process ends.
Through the above processing, the user can practice, for example, imitation of the sound data of a song stored in the audio data storage unit 101 for some phrases (for example, one measure).

以下、本実施の形態における情報処理装置の具体的な動作について説明する。まず、音声データ格納部１０１には、声まねの対象である教師データの生波形データが格納されている。教師データは、ここでは、歌の音声データである。 Hereinafter, a specific operation of the information processing apparatus according to the present embodiment will be described. First, in the audio data storage unit 101, raw waveform data of teacher data that is a target of voice imitation is stored. Here, the teacher data is voice data of a song.

そして、ユーザは、歌まねの開始指示を入力する。次に、情報処理装置は、開始指示の入力を受け付け、音声取得部１０２は、ユーザが発生する音声データを取得する。取得した音声データは、図５（ａ）に示すような生波形データである。 Then, the user inputs an instruction to start singing. Next, the information processing apparatus receives an input of a start instruction, and the voice acquisition unit 102 acquires voice data generated by the user. The acquired voice data is raw waveform data as shown in FIG.

そして、情報処理装置の第一特徴量抽出部１０３は、音声取得部１０２が取得した音声に対して、実施の形態１において説明した処理と同様の処理を行う。そして、第一特徴量抽出部１０３は、第一のビブラート情報、第一の入情報、第一の音程変化情報を得る。
次に、第二特徴量抽出部１０４は、音声データ格納部１０１に格納されている教師データの生波形データから、第二のビブラート情報、第二の入情報、第二の音程変化情報を得る。 Then, the first feature amount extraction unit 103 of the information processing apparatus performs the same process as the process described in Embodiment 1 on the voice acquired by the voice acquisition unit 102. And the 1st feature-value extraction part 103 acquires 1st vibrato information, 1st incoming information, and 1st pitch change information.
Next, the second feature amount extraction unit 104 obtains second vibrato information, second input information, and second pitch change information from the raw waveform data of the teacher data stored in the voice data storage unit 101. .

そして、比較部１０５は、第一のビブラート情報と第二のビブラート情報を比較して、教師データとユーザが入力した音声のビブラートの類似度を算出する。また、比較部１０５は、第一の入情報と第二の入情報を比較して、教師データとユーザが入力した音声の、音の入り方の類似度を算出する。さらに、比較部１０５は、第一の音程変化情報と第二の音程変化情報を比較して、教師データとユーザが入力した音声の、全体的な類似傾向である全体的な類似度を算出する。さらに、比較部１０５は上述したビブラートの類似度、音の入り方の類似度および全体的な類似度に基づいて、声まね指数を算出する。声まね指数は、１００点満点の点数である。そして、出力部１０６は、声まね指数を出力する。かかる処理は、実施の形態１で説明した処理と同様であるので、詳細な説明は省略する。 Then, the comparison unit 105 compares the first vibrato information and the second vibrato information, and calculates the similarity between the teacher data and the voice vibrato input by the user. Further, the comparison unit 105 compares the first input information with the second input information, and calculates the similarity of the sound input between the teacher data and the voice input by the user. Further, the comparison unit 105 compares the first pitch change information and the second pitch change information, and calculates an overall similarity that is an overall similarity tendency between the teacher data and the voice input by the user. . Further, the comparison unit 105 calculates a voice imitation index based on the above-described vibrato similarity, sound entry similarity, and overall similarity. The voice imitation index is a score of 100 points. Then, the output unit 106 outputs a voice imitation index. Since this process is the same as the process described in the first embodiment, detailed description thereof is omitted.

かかる処理を、一小節ごとに繰り返す。その結果、図１６に示す。図１６は、ユーザが歌を歌い進める間、リアルタイムに一小節ずつ、歌まねの度合いである声まね指数が出力されている。また、出力部１０６は、所定の点数より低い小節を、網掛けで示している。
次に、ユーザは、図１６の表示に対して、網掛けの点数が付いている「ＰｈｒａｓｅＮｏ．」を指示する、とする。この指示が、上述した「部分を指示する入力」である。 Such a process is repeated for each measure. The result is shown in FIG. In FIG. 16, while the user sings a song, a voice imitation index, which is the degree of imitation of the song, is output one measure at a time in real time. Further, the output unit 106 indicates the bars lower than the predetermined score by shading.
Next, it is assumed that the user designates “PhraseNo.” With shaded points on the display of FIG. This instruction is the above-described “input for instructing a part”.

次に、入力受付部１１０９は、かかる部分を指示する入力を受け付ける。そして、音声出力部１１１０は、受け付けた入力が示す部分「ＰｈｒａｓｅＮｏ．＝２」に対応する音声データを音声データ格納部１０１から読み出す。次に、音声出力部１１１０は、「ＰｈｒａｓｅＮｏ．＝２」の音声データを出力する。そして、その間、「ＰｈｒａｓｅＮｏ．＝２」に対応する小節の歌まねの練習をするために、ユーザは発声する。その間、音声取得部１０２は、ユーザが発生する音声を取得する。
次に、第一特徴量抽出部１０３はユーザが発声し、取得した音声から第一の特徴量を抽出する。その特徴量は、ビブラート情報、入情報、音程変化情報である。 Next, the input receiving unit 1109 receives an input for instructing the portion. Then, the audio output unit 1110 reads out the audio data corresponding to the portion “PhraseNo. = 2” indicated by the received input from the audio data storage unit 101. Next, the audio output unit 1110 outputs audio data of “PhraseNo. = 2”. In the meantime, the user speaks in order to practice singing a measure corresponding to “PhraseNo. = 2”. Meanwhile, the voice acquisition unit 102 acquires voice generated by the user.
Next, the first feature amount extraction unit 103 utters the user and extracts the first feature amount from the acquired voice. The feature amounts are vibrato information, incoming information, and pitch change information.

次に、第二特徴量抽出部１０４は、読み出した音声データから第二の特徴量を抽出する。第二の特徴量も、ビブラート情報、入情報、音程変化情報である。次に、比較部１１０５は、第一の特徴量と第二の特徴量を比較する。そして、出力部１１０６は、その比較結果を出力する（図１７参照）。
以上、本実施の形態によれば、声まねの練習が容易にできる。特に、本実施の形態によれば、一部分の声まねの練習が容易である。それにより、人が似ていると感じるような声まねの能力を手にいれることができる。 Next, the second feature amount extraction unit 104 extracts a second feature amount from the read audio data. The second feature amount is also vibrato information, incoming information, and pitch change information. Next, the comparison unit 1105 compares the first feature value and the second feature value. Then, the output unit 1106 outputs the comparison result (see FIG. 17).
As described above, according to the present embodiment, voice imitation can be practiced easily. In particular, according to the present embodiment, it is easy to practice a part of voice imitation. As a result, the ability to imitate voices that people feel similar to can be obtained.

なお、本実施の形態において、実施の形態１におけるように顔画像を表示しなかったが、声まねを行っている間、実施の形態１と同様に、顔画像やその他の画像を表示することにより、ユーザに声まね指数をわかりやすく提示しても良い。 In the present embodiment, the face image is not displayed as in the first embodiment, but the face image and other images are displayed in the same manner as in the first embodiment while the voice is imitated. Thus, the imitation index may be presented to the user in an easy-to-understand manner.

また、本実施の形態における具体例によれば、例えば、正しい歌の音声データを強制的に音痴な音声データに変更して、強制的に音痴に歌を歌うことを練習する機能について述べなかったが、実施の形態１で述べた機能と同様に、かかる機能があっても良い。かかる機能は、音ズレ情報入力受付部１０７、音声データ変更部１０８により可能である。 Further, according to the specific example in the present embodiment, for example, the function of forcibly changing the voice data of the correct song to the voice data forcibly and forcibly singing the song forcibly was not described. However, similar to the function described in Embodiment Mode 1, such a function may be provided. Such a function can be performed by the sound shift information input receiving unit 107 and the audio data changing unit 108.

さらに、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、音声を取得する音声取得ステップと、前記音声取得ステップで取得した音声の所定の特徴量を抽出する第一特徴量抽出ステップと、前記第一特徴量抽出部が抽出した特徴量と、比較対照の音声データの特徴量を、音声データの部分ごとに比較する比較ステップと、前記比較ステップで比較した部分ごとの比較結果を出力する出力ステップを実行させるためのプログラムである。 Furthermore, the software that implements the information processing apparatus according to the present embodiment is the following program. That is, the program includes: a sound acquisition step for acquiring sound; a first feature amount extraction step for extracting a predetermined feature amount of the sound acquired in the sound acquisition step; and the first feature amount extraction unit. A program for executing a comparison step for comparing the extracted feature amount and the feature amount of the comparison audio data for each portion of the audio data, and an output step for outputting a comparison result for each portion compared in the comparison step It is.

なお、上記プログラムにおいて、音声取得ステップなどでは、ハードウェアによって行われる処理、例えば、音声取得ステップにおけるスピーカーなどで行われる処理（ハードウェアでしか行われない処理）は含まれない。
また、上記のプログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 In the above program, the sound acquisition step or the like does not include processing performed by hardware, for example, processing performed by a speaker or the like in the sound acquisition step (processing performed only by hardware).
Moreover, the computer which performs said program may be single, and plural may be sufficient as it. That is, centralized processing may be performed, or distributed processing may be performed.

また、上記各実施の形態において、以下のようなアルゴリズムで、声まね指数を算出しても良い。つまり、情報処理装置の比較部がビブラートの類似度、音の入り方の類似度および音程の変化に関する類似度に基づいて、声まね指数を算出する場合のアルゴリズムの詳細を以下に説明する。 In each of the above embodiments, the voice imitation index may be calculated by the following algorithm. That is, the details of the algorithm when the comparison unit of the information processing apparatus calculates the voice imitation index based on the similarity of vibrato, the similarity of how to enter sound, and the similarity related to the change in pitch will be described below.

まず、情報処理装置の第二特徴量抽出部は、以下の前処理を行う。今、教師データ（ａ）が音声データ格納部に格納されている、とする。つまり、（ａ）は生波形である（図１８参照）。そして、まず、第二特徴量抽出部は、ある値（ここでは閾値０．０５）以下のものをゼロとし、ノイズの削減し、図１８（ｂ）を得る。次に、第二特徴量抽出部は、ノイズを減らした波形を整流し、図１８（ｃ）を得る。次に、第二特徴量抽出部は、Ｗｉｎｄｏｗ幅で加算平均をとる。ただし、ビブラートが取れるように、０．０３秒以下の長さにする。その結果、第二特徴量抽出部は、図１８（ｄ）のデータを得る。そして、第二特徴量抽出部は、テンプレートを作るために、音のない部分をカットする。そして、第二特徴量抽出部は、途中で途切れた場合も、後ろの部分はカットし、図１８（ｅ）を得る。以上により、第二特徴量抽出部は、音の出だしｔ１とおわりｔ２を抽出する。 First, the second feature amount extraction unit of the information processing apparatus performs the following preprocessing. Now, it is assumed that the teacher data (a) is stored in the voice data storage unit. That is, (a) is a raw waveform (see FIG. 18). First, the second feature quantity extraction unit sets a value equal to or smaller than a certain value (here, a threshold value 0.05) to zero, reduces noise, and obtains FIG. 18B. Next, the second feature amount extraction unit rectifies the waveform with reduced noise to obtain FIG. Next, the second feature amount extraction unit takes an addition average with a window width. However, the length should be 0.03 seconds or less so that vibrato can be taken. As a result, the second feature quantity extraction unit obtains the data of FIG. And the 2nd feature-value extraction part cuts a part without a sound, in order to make a template. Then, even if the second feature quantity extraction unit is interrupted in the middle, the rear part is cut to obtain FIG. As described above, the second feature amount extraction unit extracts the sound start t1 and the end t2.

次に、第二特徴量抽出部は、図１８の（ｅ）のデータから抽出したｔ１、ｔ２の範囲で、Ｗｉｎｄｏｗ幅０．０１でＳＴＦＴ（ＳｈｏｒｔＴｉｍｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）し、図１９（ｆ）を得る。次に、第二特徴量抽出部は、（ｆ）のＳＴＦＴ結果より、各時間における最大値を持つ周波数のみ抜き出し、図１９（ｆ）を得る。さらに、第二特徴量抽出部は、（ｇ）より、最大値を一番多く持つ周波数を求め、その周波数の上下１オクターブ内でのみ、１０成分大きいものから順に抜き出し、図１９（ｈ）を得る。
次に、第一特徴量抽出部は、上述した第二特徴量抽出部のアルゴリズムと同様のアルゴリズムで、声まねデータの最大値を持つ周波数から上下１オクターブ内の１０成分を抜き出し、図１９（ｉ）を得る。 Next, the second feature quantity extraction unit performs STFT (Short Time Fourier Transform) with a window width of 0.01 within the range of t1 and t2 extracted from the data of FIG. 18E, and FIG. obtain. Next, the second feature quantity extraction unit extracts only the frequency having the maximum value at each time from the STFT result of (f), and obtains FIG. 19 (f). Further, the second feature amount extraction unit obtains the frequency having the largest maximum value from (g), extracts 10 components in descending order only within one octave above and below that frequency, and FIG. obtain.
Next, the first feature quantity extraction unit extracts 10 components within one upper and lower octave from the frequency having the maximum value of the voice mimic data by an algorithm similar to the algorithm of the second feature quantity extraction unit described above, and FIG. i) is obtained.

次に、比較部は、以下のように２つのテンプレート（図１９（ｈ）、図１９（ｉ））を比較する。まず、比較部は、音の入り方の類似度について比較する。つまり、比較部は、図１９（ｈ）において最初の１０ブロック分のみのデータを比較する（０．１秒分）。図１９（ｈ）のＴ＿Ｂで囲まれた四角い部分がここに相当する。図１９（ｉ）についても同様の時間Ｔ＿Ｂを抜き出し、時間ごとの差分をとる。本データの場合、全体の誤差平均値「ｄｉｆｆ＝０．０１９７」となった、とする。比較部が音の入り方が似ていると判断する場合は、「Ｔｈｒｅｓｈｏｌｄ１（−ｘ）<ｄｉｆｆ<Ｔｈｒｅｓｈｏｌｄ２（＋ｘ）」であり、比較部が音の入りが弱いと判断する場合は、「Ｔｈｒｅｓｈｏｌｄ１>ｄｉｆｆ」であり、比較部が音の入りが強いと判断する場合は、「Ｔｈｒｅｓｈｏｌｄ２<ｄｉｆｆ」である、とする。上記の例にあげたデータでは、非常によく似ていると判断される。なお、音の入り方に関する情報を取得するのは、第一入情報取得手段および第二入情報取得手段である。 Next, the comparison unit compares the two templates (FIG. 19 (h) and FIG. 19 (i)) as follows. First, the comparison unit compares the degree of similarity of how the sound enters. That is, the comparison unit compares data for only the first 10 blocks in FIG. 19H (for 0.1 seconds). A square portion surrounded by T_B in FIG. 19H corresponds to this. The same time T_B is extracted from FIG. 19 (i), and the difference for each time is taken. In the case of this data, it is assumed that the total error average value “diff = 0.0197” is obtained. When the comparison unit determines that the sound is similar, “Threshold 1 (−x) <diff <Threshold 2 (+ x)”, and when the comparison unit determines that the sound is weak, “Threshold 1 > diff ”, and when the comparison unit determines that the sound is strong, it is assumed that“ Threshold2 <diff ”. The data given in the above example is judged to be very similar. In addition, it is the 1st input information acquisition means and the 2nd input information acquisition means that acquire the information regarding how to enter the sound.

次に、比較部は、第一のビブラート情報と第二のビブラート情報の類似度について比較する。つまり、比較部は、図１９（ｈ）のテンプレートより、各周波数における時間軸にそって、存在するかどうかのチェックを行う（図１９（ｈ）の矢印）。比較部は、ある周波数（一番強い周波数を中心に前後数ブロック分）のｏｎ−ｏｆｆの繰り返し時間幅をチェックする。比較部は、Ｏｎ−ｏｆｆの繰り返しがない場合、「ビブラートなし」と判断する。そして、比較部は、Ｏｎ−ｏｆｆが繰り返す場合、「ビブラートあり」と判断する。そして、比較部は、「ビブラートあり」の場合に、ビブラートΔｔを求める（（ｈ）Δｔ）。
次に、比較部は、上述と同様に、声まねデータに基づいて、ビブラートΔｔ'を求める。

Next, the comparison unit compares the similarity between the first vibrato information and the second vibrato information. That is, the comparison unit checks whether or not it exists along the time axis at each frequency from the template in FIG. 19H (arrow in FIG. 19H). The comparison unit checks the on-off repetition time width of a certain frequency (for several blocks around the strongest frequency). When there is no on-off repetition, the comparison unit determines that “no vibrato” . Then, when the on-off repeats, the comparison unit determines that “vibrato exists”. Then, in the case of “with vibrato”, the comparison unit obtains vibrato Δt ((h) Δt).
Next, the comparison unit obtains vibrato Δt ′ based on the voice mimic data as described above.

そして、比較部は、「Δｔ'>>Δｔ」の場合、もっとビブラートを細かくすべきと判断し、出力部は、もっとビブラートを細かとの指示を出力する。また、比較部は、「Δｔ'<<Δｔ」あるいは存在しないとき、もっとビブラートを大きくすべきと判断し、出力部は、もっとビブラートを大きくするとの指示を出力する。さらに、比較部は、「Δｔ'＝＝Δｔ」の場合、ビブラート情報に関して類似していると判断し、出力部は、何も出力しない、または良好であることを出力する。
なお、出力時に、音の入り方の類似度、ビブラートの類似度を指標化し、および重み付けし、一の数値を出力しても良い。かかる一の数値が、例えば、声まね指数である。 Then, in the case of “Δt ′ >> Δt”, the comparison unit determines that the vibrato should be made finer, and the output unit outputs an instruction to make the vibrato finer. The comparison unit determines that the vibrato should be increased when “Δt ′ << Δt” or does not exist, and the output unit outputs an instruction to increase the vibrato. Further, when “Δt ′ == Δt”, the comparison unit determines that the vibrato information is similar, and the output unit outputs nothing or is good.
At the time of output, the similarity of sound input and the similarity of vibrato may be indexed and weighted to output one numerical value. Such one numerical value is, for example, a voice imitation index.

また、声まね指数は、以下のように算出しても良い。つまり、比較部は、図１８（ｈ）と図１８（ｉ）から、差分テンプレートを求める（図２１（ｌ））。この差分テンプレートより、各時間における平均差分データをＡＮＮの入力とする。そして、アンケート結果より得た、成績のいいもの、普通のもの、悪いもの３パターンをＡＮＮの学習用の教示データとする。ＡＮＮの入力データは、例えば、７２個である。出力は、アンケート結果より得た平均点を１００点で割ってＮｏｒｍａｌｉｚｅしたものとする。ＡＮＮの学習は誤差逆伝播法とする。この学習済みのＡＮＮに今得た平均差分データ（ｍ）を入力として、入れるとＡＮＮが声まね指数を出力する。図２２は、ＡＮＮのモデル図である。この場合、８０点であった。
なお、第一のビブラート情報と第二のビブラート情報の類似度の判定において、ＳＴＦＴを二度かけたものについても行っても良い。 Further, the voice imitation index may be calculated as follows. That is, the comparison unit obtains a difference template from FIG. 18 (h) and FIG. 18 (i) (FIG. 21 (l)). From this difference template, the average difference data at each time is used as the input of the ANN. Then, three patterns with good results, normal ones and bad ones obtained from the questionnaire results are used as teaching data for ANN learning. There are 72 ANN input data, for example. The output shall be normalized by dividing the average score obtained from the questionnaire result by 100 points. The ANN learning is an error back propagation method. When the average difference data (m) obtained at this time is input to this learned ANN, the ANN outputs an imitation index. FIG. 22 is a model diagram of the ANN. In this case, it was 80 points.
It should be noted that the determination of the similarity between the first vibrato information and the second vibrato information may also be performed for STFT twice.

上記で説明したアルゴリズムは、ビブラートの類似度、音の入り方の類似度および音程の変化に関する類似度のうちの２つの類似度に基づいて声まね指数を算出するアルゴリズムであったが、単に声まね指数を算出するアルゴリズムの一例であり、他のアルゴリズムでも良いことは言うまでもない。 The algorithm described above is an algorithm for calculating a voice imitation index based on two similarities among the similarity of vibrato, the similarity of how to enter a sound, and the similarity related to a change in pitch. It is an example of an algorithm for calculating a mimic index, and it goes without saying that other algorithms may be used.

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。
本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.
The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる情報処理装置は、声まね等の練習ができるという効果を有し、例えば、カラオケ装置に搭載するもの等として有用である。 As described above, the information processing apparatus according to the present invention has an effect of being able to practice voice imitation, and is useful as, for example, a device installed in a karaoke apparatus.

実施の形態１における情報処理装置のブロック図Block diagram of information processing apparatus according to Embodiment 1 同情報処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the information processing apparatus 同声まねの評価結果の表を示す図The figure which shows the table of the evaluation result of the same voice imitation 同声まねの評価結果を示す図The figure which shows the evaluation result of the same voice imitation 同声まね指数を算出するためのアルゴリズムを説明する図Illustration explaining the algorithm for calculating the imitation index 同出力顔画像判断表を示す図The figure which shows the output face image judgment table 同出力顔画像を示す図Figure showing the same output face image 同声まね指数などの出力画面例を示す図The figure which shows the example of output screens such as the same voice imitation index 同音声データの変更を説明する図The figure explaining the change of the audio data 同情報処理システムのシステム構成のブロック図Block diagram of the system configuration of the information processing system 同声まね指数を示す画像を示す図Figure showing an image showing the imitation index 同出力顔画像を示す図Figure showing the same output face image 実施の形態２における情報処理装置のブロック図であるFIG. 10 is a block diagram of an information processing device in a second embodiment 同情報処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the information processing apparatus 同情報処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the information processing apparatus 同声まね指数などの出力画面例を示す図The figure which shows the example of output screens such as the same voice imitation index 同比較結果の表示画面例を示す図The figure which shows the display screen example of the same comparison result 同データ変換の具体例を説明する図The figure explaining the specific example of the data conversion 同データ変換の具体例を説明するA specific example of the data conversion will be described. 同データ変換の具体例を説明するA specific example of the data conversion will be described. 同データ変換の具体例を説明するA specific example of the data conversion will be described. 同ＡＮＮを説明するモデル図Model diagram explaining the ANN

Explanation of symbols

９１クライアント装置
９２サーバ装置
１０１音声データ格納部
１０２音声取得部
１０３第一特徴量抽出部
１０４第二特徴量抽出部
１０５、１１０５比較部
１０６、１１０６出力部
１０７音ズレ情報入力受付部
１０８音声データ変更部
１０９、１１０９入力受付部
１０３１第一ビブラート情報取得手段
１０３２第一入情報取得手段
１０３３第一音程変化情報取得手段
１０４１第二ビブラート情報取得手段
１０４２第二入情報取得手段
１０４３第二音程変化情報取得手段
１１１０音声出力部
９１０１第一送受信部
９２０１第二送受信部

91 Client device 92 Server device 101 Audio data storage unit 102 Audio acquisition unit 103 First feature value extraction unit 104 Second feature value extraction unit 105, 1105 Comparison unit 106, 1106 Output unit 107 Sound shift information input reception unit 108 Audio data change Unit 109, 1109 input reception unit 1031 first vibrato information acquisition unit 1032 first input information acquisition unit 1033 first pitch change information acquisition unit 1041 second vibrato information acquisition unit 1042 second input information acquisition unit 1043 second pitch change information acquisition Means 1110 Audio output unit 9101 First transmission / reception unit 9201 Second transmission / reception unit

Claims

An information processing device that performs voice imitation evaluation,
An audio acquisition unit for acquiring audio;
A first feature quantity extraction unit for extracting a predetermined feature quantity of the voice acquired by the voice acquisition unit;
A feature quantity the first feature extraction unit has extracted, a comparison unit for comparing the feature amount of the audio data to be compared,
An information processing apparatus including an output unit that outputs a result of comparison by the comparison unit;
The feature quantity that the comparison unit compares has information on how to enter the sound ,
The first feature amount extraction unit includes:
From the sound acquired by the sound acquisition unit, comprising first input information acquisition means for acquiring first input information that is amplitude at each frequency of the first predetermined number of blocks,
The comparison unit includes:
The second input information that is the amplitude at each frequency of the first predetermined number of blocks of the audio data to be compared is compared with the first input information, and the similarity of the sound input is obtained, An information processing apparatus that obtains the result of the comparison using the similarity of how sound enters.

The feature amount compared by the comparison unit further includes information regarding a change in pitch,
The first feature amount extraction unit includes:
From the voice acquired by the voice acquisition unit further comprises first pitch change information acquisition means for acquiring first pitch change information that is a set of frequencies having the strongest amplitude per unit time,
The comparison unit includes:
Compare the second pitch change information and the first pitch change information, which is a set of frequencies of the strongest amplitude for each unit time of the audio data to be compared, and obtain the similarity regarding the pitch change, The information processing apparatus according to claim 1, wherein the comparison result is acquired using a similarity degree related to the change in pitch.

The feature amount compared by the comparison unit further includes information on vibrato,
The first feature amount extraction unit includes:
From the voice acquired by the voice acquisition unit, further comprising a first vibrato information acquisition means for acquiring first vibrato information that is a set of frequencies having the strongest amplitude per unit time,
The comparison unit includes:
It is a set of frequencies acquired from the audio data to be compared, and the second vibrato information, which is a set of frequencies having the strongest amplitude per unit time, is compared with the first vibrato information, and similarities related to vibrato The information processing apparatus according to claim 1, wherein the information processing apparatus acquires a degree and obtains the result of the comparison using a similarity degree related to the vibrato.

An information processing device that performs voice imitation evaluation,
An audio data storage unit storing audio data;
An audio acquisition unit for acquiring audio;
A first feature quantity extraction unit for extracting a predetermined feature quantity of the voice acquired by the voice acquisition unit;
A second feature quantity extraction unit for extracting a predetermined feature quantity from the voice data stored in the voice data storage unit;
A comparison unit that compares the feature amount extracted by the first feature amount extraction unit with the feature amount extracted by the second feature amount extraction unit;
An information processing apparatus including an output unit that outputs a result of comparison by the comparison unit;
Feature quantity the comparison unit compares the possess information regarding how to enter sound,
The first feature amount extraction unit includes:
From the sound acquired by the sound acquisition unit, comprising first input information acquisition means for acquiring first input information that is amplitude at each frequency of the first predetermined number of blocks,
The second feature amount extraction unit includes:
From the audio data stored in the audio data storage unit, comprising second input information acquisition means for acquiring second input information that is the amplitude at each frequency of the first predetermined number of blocks,
The comparison unit includes:
An information processing apparatus that compares the second incoming information with the first incoming information, obtains the similarity of how to enter the sound, and obtains the result of the comparison using the similarity of the entered sound .

The feature amount compared by the comparison unit further includes information regarding a change in pitch,
The first feature amount extraction unit includes:
From the voice acquired by the voice acquisition unit further comprises first pitch change information acquisition means for acquiring first pitch change information that is a set of frequencies having the strongest amplitude per unit time,
The second feature amount extraction unit includes:
Further comprising second pitch change information acquisition means for acquiring second pitch change information that is a set of frequencies having the strongest amplitude per unit time from the voice data stored in the voice data storage unit;
The comparison unit includes:
The second pitch change information and the first pitch change information are compared, a similarity related to a pitch change is acquired, and the comparison result is also acquired using the similarity related to the pitch change. 4. The information processing apparatus according to 4.

The feature amount compared by the comparison unit further includes information on vibrato,
The first feature amount extraction unit includes:
From the voice acquired by the voice acquisition unit, the frequency with the strongest amplitude is calculated per unit time, further comprising first vibrato information acquisition means for acquiring first vibrato information that is a set of the frequencies,
The second feature amount extraction unit includes:
A second vibrato information acquisition means for calculating a frequency with the strongest amplitude per unit time from the audio data stored in the audio data storage unit and acquiring second vibrato information that is a set of the frequencies; Equipped,
The comparison unit includes:
The said 2nd vibrato information and said 1st vibrato information are compared, the similarity regarding vibrato is acquired, and the said comparison result is acquired also using the similarity regarding the said vibrato. Information processing device.

The information processing apparatus according to claim 1, wherein the feature quantity compared by the comparison unit does not include information on a tempo.

The unit time is
The information processing apparatus according to claim 1, wherein the information processing apparatus is approximately 0.03 seconds.

A program for assessing voice imitation,
On the computer,
An audio acquisition step for acquiring audio;
A first feature amount extraction step for extracting a predetermined feature amount of the voice acquired in the voice acquisition step;
A feature amount extracted by the first feature extraction step, a comparison step of comparing the feature amount of the audio data to be compared,
A program for executing an output step of outputting a result of comparison in the comparison step;
The feature amount to be compared in the comparison step has information on how to enter the sound,
The first feature amount extraction step includes:
From the sound acquired in the sound acquisition step, comprising a first input information acquisition step of acquiring first input information that is the amplitude at each frequency of the first predetermined number of blocks,
The comparison step includes
The second input information that is the amplitude at each frequency of the first predetermined number of blocks of the audio data to be compared is compared with the first input information, and the similarity of the sound input is obtained, A program that obtains the result of the comparison using the similarity of sound entry.

The feature amount to be compared in the comparison step further includes information on a change in pitch,
The first feature amount extraction step includes:
From the voice acquired in the voice acquisition step, further comprising a first pitch change information acquisition step of acquiring first pitch change information that is a set of frequencies having the strongest amplitude per unit time,
The comparison step includes
Compare the second pitch change information and the first pitch change information, which is a set of frequencies of the strongest amplitude for each unit time of the audio data to be compared, and obtain the similarity regarding the pitch change, The program according to claim 9, wherein the comparison result is acquired using a similarity degree related to the change in the pitch.

The feature amount to be compared in the comparison step further includes information on vibrato,
The first feature amount extraction step includes:
A first vibrato information acquisition step of acquiring first vibrato information that is a set of frequencies having the strongest amplitude per unit time from the voice acquired in the voice acquisition step;
The comparison step includes
It is a set of frequencies acquired from the audio data to be compared, and the second vibrato information, which is a set of frequencies having the strongest amplitude per unit time, is compared with the first vibrato information, and similarities related to vibrato The program according to claim 9 or 10, wherein a degree is obtained, and the result of the comparison is obtained using the degree of similarity related to the vibrato.

12. The program according to claim 9, wherein the feature amount compared in the comparison step does not include information on tempo.

The unit time is
The program according to any one of claims 9 to 12, which is approximately 0.03 seconds.