JP5957798B2

JP5957798B2 - Back voice detection device and singing evaluation device

Info

Publication number: JP5957798B2
Application number: JP2011058386A
Authority: JP
Inventors: 隆一成山; 神谷　伸悟; 伸悟神谷
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2011-03-16
Filing date: 2011-03-16
Publication date: 2016-07-27
Anticipated expiration: 2031-03-16
Also published as: JP2012194389A

Description

本発明は、歌唱者が歌唱した音声から裏声を検出する技術に関する。 The present invention relates to a technique for detecting a back voice from voice sung by a singer.

カラオケ装置においては、歌唱者による歌唱の巧拙を採点する機能を備えるものがある。この採点にあたって、カラオケ装置は、歌唱者による歌唱の音声を録音し、録音した音声を分析して特徴を検出することで、この分析結果と評価の基準とを照らし合わせて点数を算出する。 Some karaoke apparatuses have a function of scoring the skill of singing by a singer. In this scoring, the karaoke device records the voice of the singer's singing, analyzes the recorded voice and detects the characteristics, and calculates the score by comparing the analysis result with the evaluation standard.

ところで、歌唱時には裏声と呼ばれるものが使われることがある。裏声とは、喚声点を越えて、いわゆる裏返った状態の声のことである。楽曲によっては、歌唱の際に裏声を駆使するものもあるから、裏声による歌唱の巧拙を採点に反映させることができれば、より採点の精度を向上させることが可能となる。例えば特許文献１には、歌唱者の歌唱におけるファルセット（裏声）を検出する方法が開示されている。この特許文献１に記載の技術では、裏声に含まれる高調波成分の割合が地声に比べて極端に小さくなることに注目し、歌唱者による歌唱の音声におけるスペクトル特性が、予め決められた状態へと急激に遷移した際に、歌唱の音声を裏声として認識している。 By the way, when singing, a so-called back voice may be used. A reverse voice is a voice in a state of being turned over beyond the calling point. Since some songs make full use of the back voice when singing, if the skill of singing with the back voice can be reflected in the scoring, the scoring accuracy can be further improved. For example, Patent Document 1 discloses a method of detecting a falset in a singer's song. In the technique described in Patent Document 1, it is noted that the ratio of the harmonic component contained in the back voice is extremely small compared to the local voice, and the spectral characteristics in the singing voice by the singer are in a predetermined state. The voice of singing is recognized as a back voice when making a sudden transition.

特開２００７−３１０２０４号公報JP 2007-310204 A

ただし、地声と裏声とを比較したときに、それぞれに高調波成分が含まれる割合がどのような関係になるかは歌唱者によって様々である。すなわち、歌唱者の裏声とは、その歌唱者の地声との関係で決まるものであり、全ての歌唱者に対して固定的に決められるものではない。特許文献１に記載の技術では、高調波成分の割合が小さい状態となるものを予め決めておき、これをどの歌唱者に対しても裏声認識の共通の基準として用いているから、歌唱者によっては、裏声で歌唱したとしてもそのスペクトル特性が予め定められた状態に該当せず、裏声が検出されない可能性がある。このように、実際には歌唱者が裏声で歌唱を行ったのに裏声として検出されないときがあると、カラオケ装置において裏声による歌唱の巧拙を上手く採点できず、その採点の結果は利用者にとっては不満感が大きなものとなることがある。 However, when the local voice and the back voice are compared, the relationship between the proportions of the harmonic components in each of them varies depending on the singer. In other words, the back voice of a singer is determined by the relationship with the singing voice of the singer, and is not fixed for all singers. In the technique described in Patent Document 1, since a ratio in which the ratio of the harmonic component is small is determined in advance, and this is used as a common reference for back voice recognition to any singer, Even if singing in a back voice, the spectrum characteristic does not correspond to a predetermined state, and the back voice may not be detected. In this way, if the singer actually sings with a back voice but sometimes it is not detected as a back voice, the karaoke device cannot score the skill of the back singing well, and the result of the scoring is Dissatisfaction can be significant.

本発明は上述の背景に鑑みてなされたものであり、歌唱者による歌唱の音声から裏声を検出する際に検出漏れを少なくすることを目的とする。 This invention is made | formed in view of the above-mentioned background, and it aims at reducing a detection omission when detecting a back voice from the audio | voice of the song by a singer.

上述の課題を解決するため、本発明は、歌唱者が歌唱したときの音声を表す音声データを取得する音声データ取得手段と、基音の周波数に対する倍音の周波数の比率を倍音比率とし、前記音声データ取得手段によって取得された音声データに基づき、当該音声データが表す音声における前記倍音比率及び当該音声の音高を算出する算出手段と、倍音比率を表す第１軸と音高を表す第２軸とで構成される座標系において、前記算出手段が算出した各々の倍音比率及び音高に対応する座標に、当該倍音比率及び当該音高の組をそれぞれ割り当てる割り当て手段と、前記割り当て手段により複数の前記組が割り当てられた領域の中で、相対的に倍音比率が低く音高が高い一部の領域に割り当てられた前記組に対応する部分の歌唱の音声を表す音声データを、裏声を表す音声データとして検出する裏声検出手段とを備え、前記裏声検出手段は、前記座標系において移動させられるフィルタであって、プラスの重み付けを持つ領域であるプラス領域とマイナスの重み付けを持つ領域であるマイナス領域とを有するフィルタと、前記座標系において予め定められた第１の基準値よりも倍音比率が低く且つ予め定められた第２の基準値よりも音高が高い範囲内で前記フィルタが移動させられるたびに、当該フィルタの前記マイナス領域に含まれる前記組の数にマイナスの重み付けを行って得たマイナスの算出値と、前記プラス領域に含まれる前記組の数にプラスの重み付けを行って得たプラスの算出値とを加算する加算手段とを備え、前記加算手段の加算結果に基づき、裏声を表す音声データを検出することを特徴とする裏声検出装置を提供する。
また、本発明は、歌唱者が歌唱したときの音声を表す音声データを取得する音声データ取得手段と、基音の周波数に対する倍音の周波数の比率を倍音比率とし、前記音声データ取得手段によって取得された音声データに基づき、当該音声データが表す音声における前記倍音比率及び当該音声の音高を算出する算出手段と、倍音比率を表す第１軸と音高を表す第２軸とで構成される座標系において、前記算出手段が算出した各々の倍音比率及び音高に対応する座標に、当該倍音比率及び当該音高の組をそれぞれ割り当てる割り当て手段と、前記割り当て手段により複数の前記組が割り当てられた領域の中で、相対的に倍音比率が低く音高が高い一部の領域に割り当てられた前記組に対応する部分の歌唱の音声を表す音声データを、裏声を表す音声データとして検出する裏声検出手段とを備え、前記裏声検出手段は、前記第１軸におけるそれぞれの倍音比率ごとの前記組に含まれる音高の分布において、予め決められた基準値音高において極大点が現れているときの倍音比率の範囲から、前記組に対応する音声データを、裏声を表す音声データとして検出することを特徴とする裏声検出装置を提供する。 In order to solve the above-described problem, the present invention provides a voice data acquisition unit that acquires voice data representing a voice when a singer sings, and a ratio of a harmonic frequency to a fundamental frequency as a harmonic ratio, and the voice data Based on the voice data acquired by the acquisition means, a calculation means for calculating the harmonic ratio and the pitch of the voice represented by the voice data, a first axis representing the harmonic ratio, and a second axis representing the pitch; In the coordinate system constituted by: an assigning means for assigning each of the harmonic ratio and the pitch set to coordinates corresponding to each harmonic ratio and pitch calculated by the calculating means; and a plurality of the assigning means by the assigning means. The voice representing the singing voice of the portion corresponding to the group assigned to a part of the region assigned with the pair and having a relatively low harmonic ratio and a high pitch. A voice detection means for detecting voice data as voice data representing a voice, and the voice detection means is a filter that is moved in the coordinate system and has a positive area and a negative area that have a positive weight. A range having a negative area, which is a weighted area, and a range in which the harmonic ratio is lower than the first reference value determined in advance in the coordinate system and higher than the second reference value determined in advance. Each time the filter is moved within, a negative calculated value obtained by negatively weighting the number of sets included in the minus region of the filter and the number of sets included in the plus region. And adding means for adding a plus calculated value obtained by carrying out plus weighting. Based on the addition result of the adding means, voice data representing a back voice is detected. Provided is a back-sound detection device characterized in that
In the present invention, the voice data acquisition means for acquiring voice data representing the voice when the singer sang, and the ratio of the harmonic frequency to the fundamental frequency is a harmonic ratio, and the voice data acquisition means acquires the voice data. A coordinate system composed of calculation means for calculating the harmonic ratio and the pitch of the voice represented by the voice data based on the voice data, and a first axis representing the harmonic ratio and a second axis representing the pitch. In the above, the assigning means for assigning the combination of the harmonic ratio and the pitch to the coordinates corresponding to the harmonic ratio and the pitch calculated by the calculating means, respectively, and a region in which the plurality of sets are assigned by the assigning means Sound data representing a voice of a part of a song corresponding to the set assigned to a part of a region having a relatively low overtone ratio and a high pitch. And a falsetto detection means for detecting a data, the falsetto detection means, in the set to the pitch of the distribution contained in each harmonic ratio in the first axis, the maximum point in a predetermined reference neon high A voice detection device is provided that detects voice data corresponding to the set as voice data representing a voice from a range of overtone ratios when the voice appears.

また、本発明は、上記裏声検出装置と、歌唱対象となる楽曲を構成する各構成音を表す参照音声データであって、当該各構成音のうち裏声で発音する構成音に裏声フラグが付されている参照音声データを取得する参照音声データ取得手段と、前記裏声検出装置の音声データ取得手段によって取得される音声データが表す各音声と、前記参照音声データ取得手段によって取得される参照音声データが表す各構成音とをそれぞれ対応させ、対応するものどうしを比較した結果に応じて、前記歌唱者による歌唱を評価する評価手段であって、前記裏声フラグが付されている参照音声データに前記裏声検出装置の裏声検出手段によって検出された裏声を表す音声データが対応している場合には、前記裏声を表す音声データが対応していない場合に比べて高い評価を行う評価手段とを備えることを特徴とする歌唱評価装置としても提供し得る。 Further, the present invention provides reference voice data representing the constituent voices constituting the musical composition to be sung and the above-described voice detection device, and a constituent voice that is pronounced in the constituent voices of the constituent sounds is provided with a back voice flag. The reference voice data acquisition means for acquiring the reference voice data, each voice represented by the voice data acquired by the voice data acquisition means of the back voice detection device, and the reference voice data acquired by the reference voice data acquisition means Each of the constituent sounds to be represented, and evaluation means for evaluating the singing by the singer according to the result of comparing the corresponding sounds, wherein the back voice is added to the reference voice data to which the back voice flag is attached. When the voice data representing the back voice detected by the back voice detection means of the detection device is compatible, the voice data representing the back voice is not compatible. May also provide a singing evaluation device, characterized in that it comprises an evaluation means for performing a high evaluation.

本発明によれば、歌唱者による歌唱の音声から裏声を検出する際に検出漏れを少なくすることが可能となる。 ADVANTAGE OF THE INVENTION According to this invention, it becomes possible to reduce a detection omission when detecting a back voice from the audio | voice of the song by a singer.

カラオケ装置のハードウェア構成を表すブロック図Block diagram showing hardware configuration of karaoke equipment 裏声検出処理が行われる際の処理フロー図Process flow diagram when back face detection processing is performed 倍音比率の計算式を説明するための図Diagram for explaining formula for calculating harmonic ratio 地声と裏声に基づく音声情報データの分布表を表す図The figure which shows the distribution table of the voice information data based on the local voice and the back voice 図４における音声分布表に適用するフィルタを表した図The figure showing the filter applied to the voice distribution table in FIG. 地声と裏声に基づく音声情報データの分布表にフィルタを適用した図Figure with a filter applied to the distribution table of voice information data based on the local voice and back voice 検出した裏声とガイドメロディとの対応関係を表した図Diagram showing correspondence between detected back voice and guide melody 判定結果を表す図A diagram showing the judgment results 制御部の機能的構成を表すブロック図Block diagram showing the functional configuration of the control unit 変形例２に係る地声と裏声に基づく音声情報データの分布表を表す図The figure showing the distribution table of the voice information data based on the local voice and the back voice concerning the modification 2 変形例２に係る地声と裏声に基づく音声情報データの分布表を表す図The figure showing the distribution table of the voice information data based on the local voice and the back voice concerning the modification 2 変形例３に係る音声分布表に適用するフィルタを表した図The figure showing the filter applied to the audio | voice distribution table which concerns on the modification 3. 変形例３に係る音声分布表に適用するフィルタを表した図The figure showing the filter applied to the audio | voice distribution table which concerns on the modification 3. 変形例４に係る裏声検出処理が行われる際の処理フロー図Process flow diagram when back face detection processing according to modification 4 is performed 変形例４に係る分布個数取得基準線を表す図The figure showing the distribution number acquisition reference line which concerns on the modification 4. 変形例４に係る音声情報データの分布を表す図The figure showing distribution of voice information data concerning modification 4 変形例４に係る音声情報データの分布を表す図The figure showing distribution of voice information data concerning modification 4 変形例４に係る音声情報データの分布を表す図The figure showing distribution of voice information data concerning modification 4 変形例４に係る音声情報データの分布を表す図The figure showing distribution of voice information data concerning modification 4 変形例４に係る音声情報データの分布を表す図The figure showing distribution of voice information data concerning modification 4 変形例４に係る裏声の領域を決定する処理を説明する図The figure explaining the process which determines the area | region of the back voice which concerns on the modification 4.

以下、本発明の一実施形態について説明する。
＜実施形態＞
＜構成＞
図１は、カラオケ装置１００のハードウェア構成を表したブロック図である。
カラオケ装置１００は、ユーザの歌唱に対して採点を行うものであり、特にユーザの歌唱において裏声で歌われた箇所を検出してそれを採点対象に含めて採点を行う。このカラオケ装置１００においては、採点の方式に減点方式を採用している。ここで減点方式とは、あるカラオケ楽曲についてユーザが歌唱を開始した時点では満点から始まり（１００点満点であれば１００点）、ユーザによる歌唱が評価基準を満たさないときに減点が行われる、という方式である。図１に示すように、カラオケ装置１００は、制御部１０、記憶部２０、操作部３０、表示部４０、通信制御部５０、音声処理部６０、マイクロホン６１、及びスピーカ６２を有し、これら各部がバス７０を介して接続されている。制御部１０は、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）、及びＲＯＭ（Read Only Memory）等を有している。制御部１０において、ＣＰＵが、ＲＯＭや記憶部２０に記憶されているコンピュータプログラムを読み出しＲＡＭにロードして実行することにより、カラオケ装置１００の各部を制御する。制御部１０は、時間を計測する計時機能を備えている。 Hereinafter, an embodiment of the present invention will be described.
<Embodiment>
<Configuration>
FIG. 1 is a block diagram illustrating a hardware configuration of the karaoke apparatus 100.
The karaoke apparatus 100 performs scoring on the user's singing. In particular, the karaoke device 100 detects a part sung in the user's singing voice and includes it as a scoring target for scoring. In this karaoke apparatus 100, a deduction method is adopted as a scoring method. Here, the deduction method starts from a full score when a user starts singing a certain karaoke piece (100 points if the score is 100), and a deduction is performed when the singing by the user does not satisfy the evaluation standard. It is a method. As shown in FIG. 1, the karaoke apparatus 100 includes a control unit 10, a storage unit 20, an operation unit 30, a display unit 40, a communication control unit 50, an audio processing unit 60, a microphone 61, and a speaker 62. Are connected via a bus 70. The control unit 10 includes a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and the like. In the control unit 10, the CPU controls each unit of the karaoke apparatus 100 by reading out a computer program stored in the ROM or the storage unit 20, loading it into the RAM, and executing it. The control unit 10 has a time measuring function for measuring time.

操作部３０は、各種の操作子を備え、ユーザによる操作内容を表す操作信号を制御部１０に出力する。表示部４０は、例えば液晶パネルを備え、制御部１０による制御の下、各カラオケ楽曲に応じた歌詞テロップや背景映像等を表示する。通信制御部５０は、ネットワーク（非図示）を介した、カラオケ装置１００と図示しないサーバ装置との間のデータ通信を制御する。 The operation unit 30 includes various operators and outputs an operation signal representing the content of an operation performed by the user to the control unit 10. The display unit 40 includes, for example, a liquid crystal panel, and displays lyrics telops, background images, and the like corresponding to each karaoke piece under the control of the control unit 10. The communication control unit 50 controls data communication between the karaoke apparatus 100 and a server apparatus (not shown) via a network (not shown).

マイクロホン６１は、収音した音声を表すアナログの音声信号を音声処理部６０に出力する。音声処理部６０は、Ａ／Ｄ（Analog / Digital）コンバータを有し、マイクロホン６１が出力したアナログの音声信号をデジタルの音声データに変換して制御部１０に出力すると、制御部１０は、これを取得する。このように、制御部１０はユーザ（歌唱者）が歌唱したときの音声を表す音声データを取得する音声取得手段として機能する。また、音声処理部６０は、Ｄ／Ａ（Digital / Analog）コンバータを有し、制御部１０から受け取ったデジタルの音声データをアナログの音声信号に変換してスピーカ６２に出力する。スピーカ６２は、音声処理部６０から受け取ったアナログの音声信号に基づく音を放音する。記憶部２０は、各種のデータを記憶するための記憶手段であり、例えばＨＤＤや不揮発性メモリである。記憶部２０は、伴奏データ記憶領域２１、映像データ記憶領域２２、ＧＭ（Guide Melody）データ記憶領域２３、及びユーザ歌唱音声データ記憶領域２４といった複数の記憶領域を備えている。 The microphone 61 outputs an analog audio signal representing the collected audio to the audio processing unit 60. The audio processing unit 60 includes an A / D (Analog / Digital) converter. When the analog audio signal output from the microphone 61 is converted into digital audio data and output to the control unit 10, the control unit 10 To get. In this way, the control unit 10 functions as a voice acquisition unit that acquires voice data representing a voice when a user (singer) sings. The audio processing unit 60 includes a D / A (Digital / Analog) converter, converts digital audio data received from the control unit 10 into an analog audio signal, and outputs the analog audio signal to the speaker 62. The speaker 62 emits a sound based on the analog audio signal received from the audio processing unit 60. The storage unit 20 is a storage unit for storing various data, and is, for example, an HDD or a nonvolatile memory. The storage unit 20 includes a plurality of storage areas such as an accompaniment data storage area 21, a video data storage area 22, a GM (Guide Melody) data storage area 23, and a user singing voice data storage area 24.

伴奏データ記憶領域２１には、各楽曲における伴奏の音声を表す伴奏データに関する情報が記憶されている。伴奏データ記憶領域２１には、楽曲を一意に識別するための番号である「曲番号」、各楽曲の名称を表す「曲名」、各楽曲の歌い手の名称を表す「歌手名」、各楽曲の伴奏データそのものであるデータファイルの格納場所である「ファイル格納場所」といった複数の項目からなる伴奏データレコードが複数記憶されている。この伴奏データのデータファイルは、例えば、ＭＩＤＩ（Musical Instrument Digital Interface）形式のファイルである。 The accompaniment data storage area 21 stores information related to accompaniment data representing accompaniment sound in each music piece. In the accompaniment data storage area 21, a “song number” that is a number for uniquely identifying a song, a “song name” that represents the name of each song, a “singer name” that represents the name of each song singer, A plurality of accompaniment data records including a plurality of items such as “file storage location” that is a storage location of a data file that is accompaniment data itself are stored. The accompaniment data file is, for example, a MIDI (Musical Instrument Digital Interface) format file.

映像データ記憶領域２２には、上述の曲番号、各楽曲の歌詞を示す歌詞データ及び歌詞の背景に表示される背景映像を表す背景映像データが対応付けられて記憶されている。歌詞データによって示される歌詞は、カラオケ歌唱の際に、楽曲の進行に伴って歌詞テロップとして表示部４０に表示される。また、背景映像データによって表される背景映像は、カラオケ歌唱の際に楽曲の進行に伴って歌詞テロップの背景として表示部４０に表示される。ＧＭデータ記憶領域２３には、上述の曲番号及び楽曲のボーカルパートのメロディを示すデータ、すなわち、歌唱すべき構成音（ノート）の内容を指定するデータであるガイドメロディデータ（以下、ＧＭデータという）が対応付けられて記憶されている。ＧＭデータは、制御部１０がユーザによる歌唱の巧拙を評価する際に比較の基準となるものである。ＧＭデータには、各ノートについて、裏声で歌われるべきか否かを示す裏声フラグが付されている。例えば、「ＡＡＡ」という楽曲において、歌手本人が、或るノートを裏声で歌っている場合、ＧＭデータにおいて、このノートについては裏声フラグが「ＯＮ」の状態で付されている。一方、歌手本人が地声で歌っているノートについては、ＧＭデータにおいて、このノートについては裏声フラグが「ＯＦＦ」の状態で付されている。ＧＭデータは、例えば、ＭＩＤＩ形式により記述されている。ここで地声とは、普段平素で話すときの声のことである。地声で歌われた歌唱の音声には、倍音が豊富に含まれている（つまり、周波数において高調波成分が多い）。一方、裏声とは、地声から裏返った（喚声点を越えた）声のことである。裏声で歌われた歌唱の音声は、地声と比較して高調波成分が少ない一方、ピッチ（音高）が高い。 The video data storage area 22 stores the above-mentioned song number, lyrics data indicating the lyrics of each song, and background video data representing the background video displayed on the background of the lyrics in association with each other. The lyrics shown by the lyrics data are displayed on the display unit 40 as lyrics telop as the music progresses during karaoke singing. In addition, the background video represented by the background video data is displayed on the display unit 40 as the background of the lyrics telop as the music progresses during karaoke singing. In the GM data storage area 23, the above-described song number and data indicating the melody of the vocal part of the song, that is, guide melody data (hereinafter referred to as GM data) which is data specifying the content of the constituent sound (note) to be sung. ) Are stored in association with each other. The GM data serves as a reference for comparison when the control unit 10 evaluates the skill of singing by the user. The GM data has a back flag indicating whether or not each note should be sung back. For example, when the singer himself sings a certain note in the song “AAA”, the note flag is attached to the note in the state of “ON” in the GM data. On the other hand, the note sung by the singer himself is marked with the back-sound flag “OFF” for this note in the GM data. The GM data is described in the MIDI format, for example. Here, the local voice is the voice when speaking normally. The voice of a song sung with a local voice is rich in overtones (that is, there are many harmonic components in frequency). On the other hand, a back voice is a voice that has turned over from the local voice (beyond the screaming point). The voice of the song sung in the back voice has less harmonic components than the ground voice, but has a high pitch (pitch).

ユーザ歌唱音声データ記憶領域２４には、カラオケの対象となった各楽曲について、その伴奏データが再生されている期間中マイクロホン６１によって収音されたユーザの歌唱音声が音声処理部６０でデジタルデータに変換されることで生成された音声データが記憶される。この音声データをユーザ歌唱音声データという。このユーザ歌唱音声データは、例えば、ＷＡＶＥ（RIFF waveform Audio Format）形式のデータファイルとして記憶される。各楽曲についてのユーザ歌唱音声データは、制御部１０によって、その楽曲のＧＭデータに対応付けられる。 In the user singing voice data storage area 24, the user's singing voice collected by the microphone 61 during the period in which the accompaniment data is being reproduced for each piece of karaoke music is converted into digital data by the voice processing unit 60. Audio data generated by the conversion is stored. This voice data is called user singing voice data. This user singing voice data is stored as a data file in a WAVE (RIFF waveform Audio Format) format, for example. The user singing voice data for each song is associated with the GM data of the song by the control unit 10.

＜動作＞
次に、図２〜図６を用いて、制御部１０による裏声の検出方法について説明を行う。
図２は、裏声検出処理が行われる際の処理フロー図である。操作部３０を介してユーザにより楽曲が予約されると（ステップＳ１００；Ｙｅｓ）、制御部１０は、記憶部２０から予約された楽曲の検索を行う（ステップＳ１０２）。具体的にはステップＳ１０２において、制御部１０は、伴奏データ記憶領域２１、映像データ記憶領域２２、及びＧＭデータ記憶領域２３の各々から、選択された楽曲の曲番号をキーにして、その楽曲に関するデータを検索し、検索結果のデータをＲＡＭに読み込む。ステップＳ１０２の次に、制御部１０は、ＲＡＭに記憶された伴奏データ、映像データ、及びＧＭデータに基づいて、楽曲の再生を行う（ステップＳ１０４）。具体的にはステップＳ１０４において、制御部１０は、伴奏データ及びＧＭデータに基づく音声をスピーカ６２から放音させるとともに、映像データに基づく映像を表示部４０に表示させる。そして制御部１０は、この楽曲の再生期間中に、マイク６１によって収音されたユーザの歌唱音声が音声処理部６０によってデジタルのデータに変換されたものであるユーザ歌唱音声データを、ユーザ歌唱音声データ記憶領域２４に記憶させる（ステップＳ１０６）。 <Operation>
Next, a method for detecting a back voice by the control unit 10 will be described with reference to FIGS.
FIG. 2 is a process flow diagram when the back voice detection process is performed. When the music is reserved by the user via the operation unit 30 (step S100; Yes), the control unit 10 searches for the reserved music from the storage unit 20 (step S102). Specifically, in step S102, the control unit 10 uses the song number of the selected song from each of the accompaniment data storage area 21, the video data storage area 22, and the GM data storage area 23 as a key, and relates to the song. Data is searched, and search result data is read into the RAM. After step S102, the control unit 10 reproduces the music based on the accompaniment data, video data, and GM data stored in the RAM (step S104). Specifically, in step S104, the control unit 10 causes the speaker 62 to emit sound based on the accompaniment data and the GM data, and causes the display unit 40 to display a video based on the video data. Then, the control unit 10 converts the user singing voice data obtained by converting the user singing voice collected by the microphone 61 into digital data by the voice processing unit 60 during the reproduction period of the music. The data is stored in the data storage area 24 (step S106).

次に制御部１０は、ユーザ歌唱音声データから、倍音成分の比率を表す倍音比率なるものを算出する（ステップＳ１０８）。まず、基本的な語句について説明を行ってから、倍音比率の説明に入る。「基本周波数」とは、或るノートの音声に基づく信号が正弦波の合成で表されたときの、最も低い周波数成分の周波数を意味する。また、このノートの音高とされる成分を「基音」という。「倍音」とは、基音の周波数に対して、２以上の整数倍の周波数を持つ音の成分のことである。以上のことから、縦軸は周波数成分が持つパワーを表し、横軸は周波数を表す２軸の座標系を考えたときに、以下のように倍音比率を捉えることができる。「基本周波数のパワー」を、「基本周波数のピークを中心とし、幅がピークの開始からピークの終了までの周波数の幅に相当する、基本周波数の周波数成分が持つパワーの面積」とし、「倍音の周波数のパワー」を、「基本周波数の２〜ｎ倍の周波数のピークを中心とし、幅がピークの開始からピークの終了までの周波数の幅に相当する、倍音の周波数の周波数成分が持つパワーの面積の合計」と定義すると、倍音比率は、「倍音の周波数のパワー／基本周波数のパワー」で表される。別の表現をすれば、倍音比率は、基音の周波数に対する倍音の周波数の比率ということもできる。 Next, the control part 10 calculates what becomes a harmonic ratio which represents the ratio of a harmonic component from user song voice data (step S108). First, basic words are explained, and then the harmonic ratio is explained. The “basic frequency” means the frequency of the lowest frequency component when a signal based on the sound of a certain note is expressed by synthesis of a sine wave. In addition, the component that is the pitch of this note is called “fundamental tone”. The “overtone” is a sound component having a frequency that is an integer multiple of 2 or more with respect to the frequency of the fundamental tone. From the above, when the vertical axis represents the power of the frequency component and the horizontal axis represents a biaxial coordinate system representing the frequency, the harmonic ratio can be captured as follows. “Power of fundamental frequency” is defined as “the area of the power of the frequency component of the fundamental frequency corresponding to the width of the frequency from the start of the peak to the end of the peak centered on the peak of the fundamental frequency”. Is the power of the frequency component of the overtone frequency whose center is the peak of the frequency 2 to n times the fundamental frequency and whose width corresponds to the frequency width from the start of the peak to the end of the peak. If defined as “total area of”, the harmonic ratio is represented by “power of frequency of harmonics / power of fundamental frequency”. In other words, the overtone ratio can be said to be the ratio of the overtone frequency to the fundamental frequency.

制御部１０が、裏声を検出するにあたって倍音比率を算出するのは、以下のような理由による。上述したように、裏声は、地声と比較して高調波成分が少ない一方、ピッチ（音高）は高い。従って、縦軸を倍音比率とし、横軸を音高とした２軸の座標系を考えたときに、裏声は、倍音比率が低く音高が高い領域により多く含まれると考えられる。このような理由から、制御部１０は、倍音比率を算出し、裏声の検出に用いているわけである。 The reason why the control unit 10 calculates the overtone ratio when detecting the back voice is as follows. As described above, the back voice has fewer harmonic components than the ground voice, but has a high pitch (pitch). Therefore, when considering a two-axis coordinate system in which the vertical axis is the harmonic ratio and the horizontal axis is the pitch, it is considered that the back voice is included more in the region where the harmonic ratio is low and the pitch is high. For this reason, the control unit 10 calculates the overtone ratio and uses it for detection of the back voice.

ステップＳ１０８において制御部１０は、ユーザ歌唱音声データにおける時間の経過に応じて、例えば１００ｍｓｅｃ（ミリ秒）といった予め定められたサンプリング周期でユーザ歌唱音声データから倍音比率を算出する。また、制御部１０は、内部に備えた計時機能により、倍音比率を上記サンプリング周期で算出した時点の、カラオケ楽曲の開始時からの経過時間を取得する。制御部１０は、この経過時間と、算出した倍音比率及びこの倍音比率を算出した時点のユーザ歌唱音声データの音高とを対応づけた組み合わせである音声情報データをＲＡＭに記憶させる（ステップＳ１１０）。ここで、音声情報データは、１つの経過時間と１つの倍音比率と１つの音高とからなる１組を１単位とする。倍音比率の算出はサンプリング周期単位で行われるから、結局、制御部１０のＲＡＭには、楽曲の再生期間中に含まれる全サンプリング数と同じ数の、複数の音声情報データが記憶されることになる。以降において、音声情報データの作成元である音声を発した（つまり歌唱を行った）ユーザを、音声情報データの持ち主とよぶ。また、音声を表すユーザ歌唱音声データから上記のような手順で生成した音声情報データを、音声に基づく音声情報データという。制御部１０は、上記サンプリング周期で上記処理を繰り返すことにより、楽曲再生の全期間のユーザ歌唱音声データについて、倍音比率を算出するとともに、音声情報データをＲＡＭに記憶させている。 In step S <b> 108, the control unit 10 calculates a harmonic overtone ratio from the user singing voice data at a predetermined sampling cycle such as 100 msec (milliseconds) in accordance with the passage of time in the user singing voice data. Moreover, the control part 10 acquires the elapsed time from the time of the start of karaoke music at the time of calculating a harmonic overtone ratio with the said sampling period by the time measuring function with which it was equipped. The control unit 10 stores, in the RAM, voice information data that is a combination of the elapsed time, the calculated overtone ratio, and the pitch of the user singing voice data at the time of calculating the overtone ratio (step S110). . Here, in the audio information data, one set including one elapsed time, one harmonic ratio, and one pitch is defined as one unit. Since the calculation of the harmonic overtone ratio is performed in units of sampling cycles, a plurality of pieces of audio information data having the same number as the total number of samplings included in the music reproduction period are eventually stored in the RAM of the control unit 10. Become. In the following, a user who utters (that is, sings) the voice that is the creator of the voice information data is referred to as the owner of the voice information data. Moreover, the audio | voice information data produced | generated in the above procedures from the user song audio | voice data showing an audio | voice are called audio | voice information data based on an audio | voice. The control unit 10 repeats the above process at the sampling period, thereby calculating the overtone ratio for the user singing voice data for the entire period of music reproduction and storing the voice information data in the RAM.

図３は、倍音比率の計算式を説明するための図である。図３において、縦軸は周波数成分が持つパワーを表し、図３中で下から上に進むほどパワーが高くなることを表している。また、横軸は周波数を表し、図５中で左から右に進むほど周波数が高くなることを表している。領域Ａ１は、基本周波数のピークを中心とした、基本周波数のピークを中心とし、幅がピークの開始からピークの終了までの周波数の幅に相当する、基本周波数の周波数成分が持つパワーの面積、すなわち上述した基本周波数のパワーを表す。また、領域Ａ２及び領域Ａ３は、基本周波数の２〜ｎ倍の周波数のピークを中心とし、幅がピークの開始からピークの終了までの周波数の幅に相当する、倍音の周波数の周波数成分が持つパワーの面積の合計、すなわち上述した倍音の周波数のパワーを表す。従って、上述したとおり、倍音比率は、「倍音の周波数のパワー／基本周波数のパワー」で表されるため、倍音比率の計算式は（ａ）のようなものとなる。
（ａ）倍音比率＝（Ａ２＋Ａ３＋・・・＋Ａｎ）／Ａ１ FIG. 3 is a diagram for explaining a formula for calculating the overtone ratio. In FIG. 3, the vertical axis represents the power of the frequency component, and the power increases as it progresses from bottom to top in FIG. 3. The horizontal axis represents the frequency, and the frequency increases as it proceeds from left to right in FIG. Area A1 is centered on the peak of the fundamental frequency, centered on the peak of the fundamental frequency, and the width of the power of the frequency component of the fundamental frequency corresponding to the width of the frequency from the start of the peak to the end of the peak, That is, it represents the power of the fundamental frequency described above. Further, the region A2 and the region A3 have frequency components of overtone frequencies whose center is a frequency peak of 2 to n times the fundamental frequency and whose width corresponds to the frequency width from the start of the peak to the end of the peak. This represents the total power area, that is, the power of the above harmonic frequency. Therefore, as described above, since the harmonic ratio is expressed by “power of harmonic frequency / power of fundamental frequency”, the calculation formula of harmonic ratio is as shown in (a).
(A) Overtone ratio = (A2 + A3 +... + An) / A1

ステップＳ１１０の次に、制御部１０は、ＲＡＭに記憶されている、上述した複数の音声情報データを、音声分布表に割り当てる（ステップＳ１１２）。ここで音声分布表とは、上述した、縦軸を倍音比率とし、横軸を音高とした２軸の座標系による表のことであり、ＲＡＭに記憶されている。図４は、地声と裏声の音声に基づく音声情報データの音声分布表の一例を表す図である。図４において、縦軸は図４中で下から上に進むほど倍音比率が高くなることを表している。また、横軸は図４中で左から右に進むほど音高が高くなることを表している。 Following step S110, the control unit 10 assigns the plurality of audio information data stored in the RAM to the audio distribution table (step S112). Here, the voice distribution table is a table based on a two-axis coordinate system in which the vertical axis represents the harmonic ratio and the horizontal axis represents the pitch, and is stored in the RAM. FIG. 4 is a diagram illustrating an example of a voice distribution table of voice information data based on the voices of the local voice and the back voice. In FIG. 4, the vertical axis represents that the harmonic overtone ratio increases as it progresses from bottom to top in FIG. In addition, the horizontal axis represents that the pitch increases as it proceeds from left to right in FIG.

地声領域ａは、歌唱者の地声による歌唱の音声に基づく音声情報データが音声分布表に割り当てられたときの領域の一例である。つまり、音声分布表において、地声に相当する音声情報データに含まれる倍音比率及び音高に対応する箇所に点がプロットされたとき、その点の集合はこの地声領域ａに収まることになる。裏声領域ｂは、歌唱者の裏声による歌唱の音声に基づく音声情報データが音声分布表に割り当てられたときの領域の一例である。つまり、音声分布表において、裏声に相当する各音声情報データに含まれる倍音比率及び音高に対応する箇所に点がプロットされたとき、その点の集合はこの裏声領域ｂに収まることになる。地声領域ａと裏声領域ｂとを比較すると、地声領域ａは、音高については比較的低い方から中程度までの高さに分布しており、倍音比率については低い方から高い方まで満遍なく分布している一方、裏声領域ｂは、音高については中程度から比較的高い高さに分布しており、倍音比率については低い方から中程度まで分布している。音声分布表がこのようなものになっているのは、上述したように裏声は地声と比較して高調波成分が少なく音高が高い一方、地声は裏声と比較して音高が低く、また周波数に関しては裏声のような偏りがないからである。 The local voice area a is an example of an area when the voice information data based on the voice of the singing by the vocalist of the singer is assigned to the voice distribution table. That is, when a point is plotted at a location corresponding to the harmonic ratio and pitch included in the speech information data corresponding to the local voice in the voice distribution table, the set of the points falls within this local voice region a. . The back voice area b is an example of an area when the voice information data based on the voice of the singing voice of the singer is assigned to the voice distribution table. In other words, when points are plotted at locations corresponding to harmonic ratios and pitches included in each piece of audio information data corresponding to the back-sound in the sound distribution table, the set of points falls within this back-sound region b. Comparing the local voice area a and the reverse voice area b, the local voice area a is distributed in a relatively low to medium level with respect to the pitch, and a harmonic ratio is from a low level to a high level. On the other hand, the back voice region b is distributed from a medium to a relatively high pitch with respect to the pitch, and from a low to a medium overtone ratio. The voice distribution table looks like this because, as mentioned above, the back voice has less harmonic components and a higher pitch than the ground voice, while the local voice has a lower pitch than the back voice. This is because there is no bias like a back-sound regarding the frequency.

ステップＳ１１２において音声分布表に音声情報データが割り当てられると、図４に表されるように、地声領域ａと裏声領域ｂの２つの領域に音声情報データが分布する。しかし、この状態では、どの音声情報データが裏声に基づくものであるかを正確には検出できないため、制御部１０は、精度を高く音声情報データを検出できるように、以下のような処理を行う。ステップＳ１１２の次に制御部１０は、音声分布表に割り当てられた音声情報データ群に対してフィルタを用いてフィルタリングを施し、算出値なるものを算出していく（ステップＳ１１４）。 When the voice information data is assigned to the voice distribution table in step S112, the voice information data is distributed in two areas, the local voice area a and the back voice area b, as shown in FIG. However, in this state, since it is not possible to accurately detect which voice information data is based on the back voice, the control unit 10 performs the following process so that the voice information data can be detected with high accuracy. . After step S112, the control unit 10 performs filtering using a filter on the audio information data group assigned to the audio distribution table, and calculates a calculated value (step S114).

図５は、図４における音声分布表に適用するフィルタＦを表した図であり、縦軸及び横軸の意味は図４と同じである。図５に表されるように、フィルタＦは矩形で構成された４つの領域の組み合わせからなる。フィルタＦにおいて、左上、右上、及び左下の領域Ｆａ、Ｆｂ、Ｆｃには、マイナスの重み付けが割り当てられ（マイナス領域という）、右下の領域Ｆｄには、プラスの重み付けが割り当てられている（プラス領域という）。図５に表されるように、マイナス領域は、プラス領域よりも倍音比率の高い側又は音高の低い側にある。制御部１０は、各領域に含まれる音声情報データの個数に、各領域に割り当てられた重み付けを乗算し、乗算した結果を合計した値を算出値とする。上述したように、裏声は、地声と比較して高調波成分が少ない一方、音高は高いから、音声分布表において右下の位置に音声情報データが分布している可能性が高い。そこで、制御部１０が、この領域に存在する音声情報データの個数にプラスの重み付けを乗算すれば、その裏声に対して高い算出値を得ることが可能となる。これが、フィルタＦにおいて、右下の領域Ｆｄのみがプラスの重み付けが為されている理由である。 FIG. 5 is a diagram showing the filter F applied to the sound distribution table in FIG. 4, and the meanings of the vertical axis and the horizontal axis are the same as those in FIG. 4. As shown in FIG. 5, the filter F is composed of a combination of four regions formed by rectangles. In the filter F, negative weights are assigned to the upper left, upper right, and lower left areas Fa, Fb, and Fc (referred to as negative areas), and a positive weight is assigned to the lower right area Fd (plus Area). As shown in FIG. 5, the minus area is on the higher harmonic ratio side or the lower pitch side than the plus area. The control unit 10 multiplies the number of pieces of audio information data included in each area by the weight assigned to each area, and uses the sum of the multiplication results as a calculated value. As described above, the back voice has fewer harmonic components than the ground voice, but the pitch is high, so that there is a high possibility that the voice information data is distributed at the lower right position in the voice distribution table. Therefore, if the control unit 10 multiplies the number of audio information data existing in this region by a positive weight, it is possible to obtain a high calculated value for the back voice. This is the reason why only the lower right region Fd is positively weighted in the filter F.

図６は、地声と裏声に基づく音声情報データの分布表にフィルタＦを適用した図である。図６において縦軸及び横軸の意味は図４と同じである。図４〜図６の内容を踏まえて、ステップＳ１１４における算出値の算出例を説明すると、以下のとおりである。制御部１０は、フィルタＦのマイナス領域に含まれる組の数にマイナスの重み付けを行って得たマイナスの算出値と、プラス領域に含まれる組の数にプラスの重み付けを行って得たプラスの算出値とを加算して、合計の算出値を算出する。例えば、フィルタＦの左上の領域Ｆａに「−２」の重み付けが割り当てられ、右上及び左下の領域Ｆｂ、Ｆｃに「−１」の重み付けが割り当てられるとともに、右下の領域Ｆｄに「＋３」の重み付けが割り当てられていたとする。ここで、左上の領域には「２個」、右上の領域には「２個」、左下の領域には「４個」、そして右下の領域には「１０個」の音声情報データが、各々含まれていたとする。このとき、制御部１０は、以下の式（ｂ）によって算出値を算出する。
（ｂ）（−２×２）＋（−１×２）＋（−１×４）＋（３×１０）＝２０ FIG. 6 is a diagram in which the filter F is applied to the distribution table of the voice information data based on the local voice and the back voice. In FIG. 6, the meanings of the vertical axis and the horizontal axis are the same as those in FIG. Based on the contents of FIGS. 4 to 6, the calculation example of the calculated value in step S114 will be described as follows. The control unit 10 calculates a negative value obtained by negatively weighting the number of pairs included in the negative region of the filter F, and a positive value obtained by performing positive weighting on the number of pairs included in the positive region. The calculated value is added to calculate the total calculated value. For example, “−2” is assigned to the upper left area Fa of the filter F, “−1” is assigned to the upper right and lower left areas Fb and Fc, and “+3” is assigned to the lower right area Fd. Assume that weights have been assigned. Here, “2” in the upper left area, “2” in the upper right area, “4” in the lower left area, and “10” in the lower right area, Suppose each was included. At this time, the control unit 10 calculates a calculated value by the following equation (b).
(B) (−2 × 2) + (− 1 × 2) + (− 1 × 4) + (3 × 10) = 20

図２に戻り、制御部１０は、算出した算出値が予め定められた閾値を超えない場合（ステップＳ１１６；Ｎｏ）、フィルタＦを移動させる（ステップＳ１１８）。そして制御部１０は、算出値の算出（ステップＳ１１４）とフィルタＦの移動（ステップＳ１１８）を、ステップＳ１１６でＹｅｓとなるまで繰り返す。ステップＳ１１６における上記閾値は、カラオケ装置１００の設計時において、不特定多数のユーザによる歌唱の音声から作成した複数の音声情報データを音声分布表に割り当てた結果から、より多くのユーザにとって裏声が検出可能となるように実験的に求めればよい。 Returning to FIG. 2, when the calculated value does not exceed the predetermined threshold (step S116; No), the control unit 10 moves the filter F (step S118). Then, the control unit 10 repeats the calculation of the calculated value (step S114) and the movement of the filter F (step S118) until Yes is obtained in step S116. The above threshold value in step S116 is a result of assigning a plurality of voice information data created from the voice of singing by an unspecified number of users to the voice distribution table at the time of designing the karaoke apparatus 100. It may be obtained experimentally so that it is possible.

図６に戻り、制御部１０がフィルタＦを移動させる際の説明を行う。破線で表された矩形は、フィルタＦの移動可能な範囲ＭＲを表す。制御部１０は、ステップＳ１１８においてフィルタＦを移動させるにあたり、音高が低い領域と倍音比率が高い領域は、移動範囲ＭＲの対象外とする。具体的には、制御部１０は、全ての音声情報データを音声分布表に割り当てたときに、音声情報データに含まれる倍音比率の低い順に音声情報データの個数をカウントする。そして、このカウント値が音声情報データの総数に対して予め定められた割合に達したときに、制御部１０は、このときの音声情報データに含まれる倍音比率を移動範囲ＭＲにおける倍音比率方向への移動の上限（第１の基準値）とする。移動範囲ＭＲにおける音高方向への移動の下限は、音声情報データに含まれる倍音比率のうち、最も低い倍音比率となる。一方、移動範囲ＭＲにおける音高方向へのフィルタＦの移動の上限及び下限は、制御部１０によって以下のように決定される。つまり、制御部１０は、全ての音声情報データを音声分布表に割り当てたときに、音声情報データに含まれる音高の高い順に音声情報データの個数をカウントする。そして、このカウント値が音声情報データの総数に対して予め定められた割合に達したときに、制御部１０は、このときの音声情報データに含まれる音高を、移動範囲ＭＲにおける音高方向へのフィルタＦの移動の下限（第２の基準値）とする。移動範囲ＭＲにおける音高方向へのフィルタＦの移動の上限は、音声情報データに含まれる音高のうち、最も高い音高となる。つまり、制御部１０は、音声分布表において予め定められた第１の基準値よりも倍音比率が低く且つ予め定められた第２の基準値よりも音高が高い範囲内でフィルタを移動させる、ということもできる。 Returning to FIG. 6, description will be given when the control unit 10 moves the filter F. A rectangle represented by a broken line represents a movable range MR of the filter F. When the control unit 10 moves the filter F in step S118, the region where the pitch is low and the region where the harmonic ratio is high are excluded from the movement range MR. Specifically, when all the sound information data is assigned to the sound distribution table, the control unit 10 counts the number of sound information data in ascending order of the overtone ratio included in the sound information data. When the count value reaches a predetermined ratio with respect to the total number of audio information data, the control unit 10 sets the harmonic ratio included in the audio information data at this time in the harmonic ratio direction in the movement range MR. The upper limit of movement (first reference value). The lower limit of the movement in the pitch direction in the movement range MR is the lowest harmonic ratio among the harmonic ratios included in the audio information data. On the other hand, the upper limit and the lower limit of the movement of the filter F in the pitch direction in the movement range MR are determined by the control unit 10 as follows. That is, when all the sound information data is assigned to the sound distribution table, the control unit 10 counts the number of sound information data in descending order of the pitch included in the sound information data. When the count value reaches a predetermined ratio with respect to the total number of audio information data, the control unit 10 determines the pitch included in the audio information data at this time as the pitch direction in the movement range MR. The lower limit of the movement of the filter F to the second (second reference value). The upper limit of the movement of the filter F in the pitch direction in the movement range MR is the highest pitch among the pitches included in the voice information data. That is, the control unit 10 moves the filter within a range in which the harmonic overtone ratio is lower than the predetermined first reference value and the pitch is higher than the predetermined second reference value in the audio distribution table. It can also be said.

移動範囲ＭＲをこのようにしている理由は、上述したように、音声分布表において音高が低い領域及び倍音比率が高い領域には、裏声に基づく音声情報データが割り当てられる可能性が小さいからである。なお、上記所定の割合は、カラオケ装置１００の設計時において、不特定多数のユーザによる歌唱の音声から作成した複数の音声情報データを音声分布表に割り当てた結果から、地声と裏声とを区別したうえで、裏声を評価するのに適していると考えられるものを実験的に求めればよく、制御部のＲＯＭに記憶されている。 The reason why the moving range MR is set in this manner is that, as described above, there is a low possibility that voice information data based on the back voice is assigned to a low pitch area and a high harmonic ratio area in the voice distribution table. is there. In addition, the said predetermined ratio distinguishes a local voice and a back voice from the result of having allocated the several audio | voice information data created from the audio | voice of the singing by an unspecified many users to an audio | voice distribution table at the time of the design of the karaoke apparatus 100. In addition, what is considered to be suitable for evaluating the voice can be obtained experimentally and stored in the ROM of the control unit.

ここでステップＳ１１８におけるフィルタＦの移動は、次のようにすればよい。音声分布表において、フィルタＦの左上隅が倍音比率における或る値の高さに位置するときに、制御部１０が、横軸の正方向に予め定められた幅（例えば５０セント分）だけフィルタＦを移動させる毎に算出値を算出する。そして、移動範囲ＭＲの右端にフィルタＦの右端が接触したら、制御部１０は、フィルタＦの左端を移動範囲ＭＲの左端に位置させるとともに、倍音比率のマイナス方向に予め定められた幅（例えばパワーの一単位）分だけ移動させる。制御部１０は、フィルタＦについて、このような軌跡を描く移動を、算出値が閾値を超えるまで繰り返させる。 Here, the movement of the filter F in step S118 may be performed as follows. In the voice distribution table, when the upper left corner of the filter F is positioned at a certain value height in the harmonic ratio, the control unit 10 filters the predetermined width (for example, 50 cents) in the positive direction of the horizontal axis. Every time F is moved, a calculated value is calculated. When the right end of the filter F comes into contact with the right end of the moving range MR, the control unit 10 positions the left end of the filter F at the left end of the moving range MR and has a predetermined width in the negative direction of the overtone ratio (for example, power Move by one unit). The control unit 10 causes the filter F to repeat the movement of drawing such a trajectory until the calculated value exceeds the threshold value.

制御部１０は、算出した算出値が予め定められた閾値を超えると（ステップＳ１１６；Ｙｅｓ）、フィルタＦの位置（検出位置という）を、算出値が閾値を超えた時点の位置で特定する（ステップＳ１２０）。フィルタＦを用いたフィルタリング処理で算出された算出値が予め定められた閾値を超えるということは、フィルタＦにおける領域の中でも唯一プラスの重み付けを持つ領域Ｆｄの中に、算出値が予め定められた閾値を超えるだけの、充分な数の音声情報データが存在する状態である、ということである。また、上述したように、裏声は、地声と比較して高調波成分が少ない一方、音高は高いから、音声分布表において右下の位置に音声情報データが分布している可能性が高い。従ってステップＳ１２０の次に制御部１０は、フィルタＦにおける、プラスの重み付けがなされた領域（すなわち右下の領域Ｆｄ）に含まれる音声情報データを、裏声に基づく音声情報データであると検出する（ステップＳ１２２）。換言すれば、制御部１０は、複数の倍音比率と音高との組である音声情報データが割り当てられた領域の中で、相対的に倍音比率が低く音高が高い一部の領域に割り当てられた音声情報データに対応するユーザ歌唱音声データを、裏声を表すユーザ歌唱音声データとして検出する。つまり、制御部１０は、このようにして検出したユーザ歌唱音声データは、それが例えば「かすれ声」や「ささやき声」であっても、全て裏声として検出する。換言すれば、本発明において「裏声」という用語の意味には、上記のようにして検出された音声が全て含まれる。 When the calculated value exceeds the predetermined threshold value (step S116; Yes), the control unit 10 specifies the position of the filter F (referred to as a detection position) at the position when the calculated value exceeds the threshold value ( Step S120). The fact that the calculated value calculated by the filtering process using the filter F exceeds a predetermined threshold means that the calculated value is predetermined in the region Fd having only positive weight among the regions in the filter F. This means that there is a sufficient number of audio information data that exceeds the threshold. Further, as described above, the back voice has less harmonic components than the ground voice, but the pitch is high, so there is a high possibility that the voice information data is distributed at the lower right position in the voice distribution table. . Therefore, after step S120, the control unit 10 detects that the audio information data included in the positively weighted area (that is, the lower right area Fd) in the filter F is audio information data based on the reverse voice ( Step S122). In other words, the control unit 10 assigns to a part of the area where the sound information data that is a set of a plurality of harmonic ratios and pitches is allocated, and which has a relatively low harmonic ratio and a high pitch. The user singing voice data corresponding to the received voice information data is detected as user singing voice data representing a back voice. That is, the control unit 10 detects all the user singing voice data detected in this way as a back voice even if it is, for example, a “whispering voice” or a “whispering voice”. In other words, in the present invention, the meaning of the term “back voice” includes all the sounds detected as described above.

ステップＳ１２２の次に、制御部１０は、裏声に基づく音声情報データとＧＭデータとの対応付けを行う（ステップＳ１２４）。具体的な対応付けの方法は、以下のようになる。上述したように音声情報データには、倍音比率と、上記倍音比率を算出した時点のユーザ歌唱音声データの音高と、上記倍音比率を予め定められたサンプリング周期で算出した時点の、カラオケ楽曲の開始時点からの経過時間とが対応付けて記憶されている。これを利用して制御部１０は、裏声に基づくものとして検出した音声情報データから上記経過時間を取得し、取得した経過時間に相当するタイミングのＧＭデータと対応付けを行う。 After step S122, the control unit 10 associates the voice information data based on the back voice with the GM data (step S124). A specific association method is as follows. As described above, in the voice information data, the overtone ratio, the pitch of the user singing voice data at the time of calculating the overtone ratio, and the karaoke piece of music at the time of calculating the overtone ratio at a predetermined sampling cycle are included. The elapsed time from the start time is stored in association with each other. Using this, the control unit 10 acquires the elapsed time from the voice information data detected as being based on the back voice, and associates it with GM data at a timing corresponding to the acquired elapsed time.

図７は、検出した裏声とガイドメロディとの対応関係を模式的に表した図である。図７において、横軸は時間を表し、図７中で左から右に進むほど時間が経過することを表している。また、縦軸は音高を表し、図７中で下から上に進むほど音高が高くなることを表している。縦軸の１つの目盛りは２００セント（全音）の音高を意味している。つまり、例えば図５において、音高「Ａ４」に対応する目盛りに対して１目盛り分だけ上方に位置する目盛りは、「Ｂ４」の音高を表している。また、音高「Ａ４」に対応する目盛りに対して１目盛り分だけ下方に位置する目盛りは、「Ｇ３」の音高を表している。 FIG. 7 is a diagram schematically showing the correspondence between the detected back voice and the guide melody. In FIG. 7, the horizontal axis represents time, and the time elapses from left to right in FIG. 7. The vertical axis represents the pitch, and the pitch increases as it progresses from bottom to top in FIG. One scale on the vertical axis means a pitch of 200 cents (all sounds). That is, for example, in FIG. 5, the scale located one scale above the scale corresponding to the pitch “A4” represents the pitch of “B4”. Further, a scale located one scale below the scale corresponding to the pitch “A4” represents the pitch of “G3”.

また、図７において領域ＧＭ１〜ＧＭ３及びＧＭ５〜ＧＭ７は、ＧＭデータに基づく音高を持つガイドメロディを表している。例えば、図７に示される期間においては、Ａ４の音高の音がＴ１の期間だけ続いた後に、Ｄ４の音高の音がＴ２の期間だけ続き、さらにその後Ｇ４の音高の音がＴ３の期間だけ続くと、Ｔ４の期間だけ無音の状態が続くといった具合である。なお、図７において格子状の模様で表されるガイドメロディＧＭ１，ＧＭ２及びＧＭ４は、地声で歌う設定（すなわち裏声フラグが「ＯＦＦ」）とされており、斜めの縞模様で表されるガイドメロディＧＭ３，ＧＭ６及びＧＭ７は、裏声で歌う設定（すなわち裏声フラグが「ＯＮ」）とされているもとのする。また、実線３００は、前述したユーザ歌唱音声データによって表される、ユーザによる歌唱時の音声の音高を表している。これを、以下、ユーザ歌唱音声曲線３００という。 In FIG. 7, areas GM1 to GM3 and GM5 to GM7 represent guide melodies having pitches based on GM data. For example, in the period shown in FIG. 7, the A4 pitch sound lasts for the period T1, the D4 pitch sound lasts for the period T2, and the G4 pitch sound continues for the T3 period. If it lasts only for a period, the silent state lasts for the period of T4. In FIG. 7, the guide melodies GM1, GM2, and GM4 represented by a lattice pattern are set to sing with the local voice (that is, the back voice flag is “OFF”), and the guide melody is represented by an oblique stripe pattern. It is assumed that the melody GM3, GM6, and GM7 are set to sing in reverse voice (that is, the reverse voice flag is “ON”). Moreover, the continuous line 300 represents the pitch of the audio | voice at the time of the singing by the user represented by the user song voice data mentioned above. This is hereinafter referred to as a user singing voice curve 300.

図２に戻り、ステップＳ１２４の次に、制御部１０は、裏声フラグが「ＯＮ」になっているＧＭデータにおいて、ユーザが裏声で歌ったかどうかを判別する（ステップＳ１２６）。例えば図７において、下方に音声分布表の抜粋が表示されており、裏声として検出された領域Ｆｄにおいて、各音声情報データに対応する点が黒丸で表されている。ここで、或る音声情報データｇに含まれる経過時間に相当するタイミングのガイドメロディは、ガイドメロディＧＭ３であるとする。また、音声情報データｈに含まれる経過時間に相当するタイミングのガイドメロディは、ガイドメロディＧＭ７であるとする。このような音声情報データとガイドメロディの対応付けは前述したようにステップＳ１２４でなされている。 Returning to FIG. 2, after step S124, the control unit 10 determines whether or not the user sang in the GM data in which the back voice flag is “ON” (step S126). For example, in FIG. 7, an excerpt of the voice distribution table is displayed below, and in the region Fd detected as a back voice, points corresponding to each voice information data are represented by black circles. Here, it is assumed that the guide melody at the timing corresponding to the elapsed time included in certain audio information data g is the guide melody GM3. Further, it is assumed that the guide melody at the timing corresponding to the elapsed time included in the audio information data h is the guide melody GM7. As described above, the voice information data and the guide melody are associated with each other in step S124.

ガイドメロディＧＭ３は裏声フラグが「ＯＮ」になっているため、制御部１０は、ユーザが、裏声で歌うべきタイミングにおいて、裏声で歌唱したと判別する。一方、裏声フラグが「ＯＮ」であるガイドメロディＧＭ６については、裏声に基づく音声情報データが対応付けられていないから、制御部１０は、ガイドメロディＧＭ６のタイミングにおいては、ユーザは裏声で歌うべきタイミングにおいて、裏声で歌唱しなかったと判別する。制御部１０は、上記判別の結果を採点に用いることが可能である。例えばカラオケ装置１００は減点方式を採用しているため、上記の場合では、ガイドメロディＧＭ６のタイミングで制御部１０は、減点を行う。つまり、ユーザが、裏声で歌うべきタイミングにおいて裏声で歌唱した場合、制御部１０は、結果として、裏声で歌うべきタイミングにおいて裏声で歌唱しなかった場合と比較して高い評価を行うともいえる。制御部１０が用いる、減点に際しての評価結果の算出方法は、ユーザ歌唱音声を解析する手法としてＦＦＴ（Fast Fourier Transform）などを用いた周波数分析、音量分析などの公知の様々な手法を用いることで、予め定められた評価項目について評価結果を算出してもよいし、単純に、裏声フラグが「ＯＮ」のタイミングにおいて裏声で歌唱されなければ、予め定めたポイントだけ減点を行うようにしてもよい。 Since the back melody flag of the guide melody GM3 is “ON”, the control unit 10 determines that the user has sung in the back voice at the timing when the user should sing in the back voice. On the other hand, since the voice information data based on the back voice is not associated with the guide melody GM6 in which the back voice flag is “ON”, the control unit 10 determines the timing at which the user should sing in the back voice at the timing of the guide melody GM6. , It is determined that it was not sung with a back voice. The control unit 10 can use the determination result for scoring. For example, since the karaoke apparatus 100 employs a deduction method, in the above case, the control unit 10 deducts at the timing of the guide melody GM6. In other words, when the user sings with the back voice at the time when the user should sing with the back voice, the control unit 10 can be said to perform higher evaluation as compared with the case where the user does not sing with the back voice at the time when the user should sing with the back voice. The calculation method of the evaluation result at the time of deduction used by the control unit 10 uses various known techniques such as frequency analysis and volume analysis using FFT (Fast Fourier Transform) as a technique for analyzing the user singing voice. , The evaluation result may be calculated for a predetermined evaluation item, or simply, if the back voice flag is not sung in the back voice at the timing of “ON”, a deduction may be made by a predetermined point. .

ステップＳ１２６の次に、制御部１０は、判定結果を表示部４０に表示する（ステップＳ１２８）。図８は、判定結果を表す図である。図７の例で説明したように、裏声フラグが「ＯＮ」であるＴ３及びＴ７の期間においては、ユーザによる歌唱が裏声であったため、「○」の印が表示されている。一方、裏声フラグが「ＯＮ」であるＴ６の期間においては、ユーザによる歌唱が地声であったため、「△」の印が表示されている。制御部１０は、ユーザによる歌唱が終了した後に、採点結果と共に、このような裏声についての判定結果を表示部４０に表示する。 Following step S126, the control unit 10 displays the determination result on the display unit 40 (step S128). FIG. 8 is a diagram illustrating the determination result. As described in the example of FIG. 7, during the period of T3 and T7 in which the back voice flag is “ON”, since the singing by the user was a back voice, a mark “◯” is displayed. On the other hand, during the period T6 in which the back voice flag is “ON”, since the singing by the user was a natural voice, a mark “Δ” is displayed. The control part 10 displays the determination result about such a back voice on the display part 40 with a scoring result after the user's song is complete | finished.

図９は、制御部１０の機能的構成を表すブロック図である。図９に表されるように、制御部１０は、音声データ取得手段１１、算出手段１２、割り当て手段１３、及び裏声検出手段１４として機能する。また、裏声検出手段１４は、フィルタ１４１、フィルタ移動手段１４２及び加算手段１４３を備えている。算出手段１２は、ユーザ歌唱音声データが表す音声における倍音比率及び当該音声の音高を、歌唱における時間の経過に応じて周期的に算出する。割り当て手段１３は、倍音比率を表す第１軸と音高を表す第２軸とで構成される座標系において、算出手段１２が算出した各々の倍音比率及び音高に対応する座標に、当該倍音比率及び当該音高の組をそれぞれ割り当てる。裏声検出手段１４は、割り当て手段１３により複数の上記組が割り当てられた領域の中で、相対的に倍音比率が低く音高が高い一部の領域に割り当てられた上記組に対応する音声データを、裏声を表す音声データとして検出する。フィルタ１４１は、上記座標系に適用されるフィルタであって、プラスの重み付けを持つ領域であるプラス領域と、座標系においてプラス領域よりも倍音比率の高い側又はプラス領域よりも音高が低い側にあり、マイナスの重み付けを持つ領域であるマイナス領域とを有する。フィルタ移動手段１４２は、上記座標系において予め定められた第１の基準値よりも倍音比率が低く且つ予め定められた第２の基準値よりも音高が高い範囲内でフィルタを移動させる。加算手段１４３は、フィルタ移動手段によってフィルタが移動させられるたびに、フィルタのマイナス領域に含まれる上記組の数にマイナスの重み付けを行って得たマイナスの算出値と、プラス領域に含まれる上記組の数にプラスの重み付けを行って得たプラスの算出値とを加算する。 FIG. 9 is a block diagram illustrating a functional configuration of the control unit 10. As illustrated in FIG. 9, the control unit 10 functions as a voice data acquisition unit 11, a calculation unit 12, an allocation unit 13, and a back voice detection unit 14. The back voice detection unit 14 includes a filter 141, a filter moving unit 142, and an adding unit 143. The calculating means 12 periodically calculates the overtone ratio in the voice represented by the user singing voice data and the pitch of the voice according to the passage of time in the singing. In the coordinate system composed of the first axis representing the harmonic ratio and the second axis representing the pitch, the assigning means 13 assigns the harmonics to the coordinates corresponding to each harmonic ratio and pitch calculated by the calculating means 12. Assign a set of ratios and pitches. The back voice detection unit 14 selects audio data corresponding to the group assigned to a part of the region where the plurality of sets are assigned by the assigning unit 13 and has a relatively low harmonic ratio and high pitch. , Detected as voice data representing a back voice. The filter 141 is a filter applied to the coordinate system, and includes a plus region that is a region having a positive weight, and a side having a higher harmonic ratio than the plus region or a side having a lower pitch than the plus region in the coordinate system. And a negative area that is a negative weighting area. The filter moving means 142 moves the filter within a range in which the harmonic overtone ratio is lower than the predetermined first reference value and the pitch is higher than the predetermined second reference value in the coordinate system. Each time the filter is moved by the filter moving means, the adding means 143 calculates a negative calculated value obtained by negatively weighting the number of sets included in the minus area of the filter and the set included in the plus area. And a positive calculated value obtained by applying a positive weight to the number of.

このように、本発明によれば、歌唱者による歌唱の音声から裏声を検出する際に検出漏れを少なくすることが可能となる。また、本発明によれば、裏声を検出するために、予め用意したデータ群と比較するような方法を用いていないため、上記データ群を予め用意する必要がなく、ユーザによる歌唱の音声から作成された音声情報データのうち、裏声に基づいて作成された音声情報データが上記データ群に該当せずに、結果として検出すべき裏声を検出できないという不具合を奏することがない。さらに、本実施形態においては、上記２軸で構成される座標系に、各々についてプラス或いはマイナスの重み付けがなされた複数の領域で構成されるフィルタを用いてフィルタリング処理を行っている。このように、フィルタが、マイナスの重み付けを持つ領域とプラスの重み付けを持つ領域と領域から構成されるため、音声分布表において地声に基づく音声情報データはマイナスの算出値を得やすく、裏声に基づく音声情報データはプラスの算出値を得やすい。これにより、地声に基づく音声情報データと裏声に基づく音声情報データとが分離されやすく、裏声に基づく音声情報データが、誤って地声に基づく音声情報データとして認識されることによって検出から漏れることを少なくすることが可能となる。 Thus, according to the present invention, it is possible to reduce detection omissions when detecting a back voice from the voice of a song performed by a singer. In addition, according to the present invention, since a method of comparing with a data group prepared in advance is not used to detect a reverse voice, it is not necessary to prepare the data group in advance, and it is created from a voice of a song sung by a user. Among the audio information data that has been generated, the audio information data created based on the back voice does not correspond to the data group, and as a result, the back voice to be detected cannot be detected. Further, in the present embodiment, the filtering process is performed using a filter composed of a plurality of regions each of which is given a positive or negative weight in the coordinate system composed of the two axes. Thus, since the filter is composed of a region having a negative weight and a region having a positive weight and a region, the voice information data based on the local voice in the voice distribution table can easily obtain a negative calculated value. The voice information data based on it is easy to obtain a positive calculated value. As a result, the voice information data based on the local voice is easily separated from the voice information data based on the back voice, and the voice information data based on the back voice leaks from detection by being erroneously recognized as the voice information data based on the local voice. Can be reduced.

＜変形例＞
以上の実施形態は次のように変形可能である。尚、以下の変形例は適宜組み合わせて実施しても良い。 <Modification>
The above embodiment can be modified as follows. In addition, you may implement the following modifications suitably combining.

＜変形例１＞
フィルタＦにおける各領域の矩形の幅についての設定は、実施形態において説明した内容で固定されるものではなく、制御部１０が設定を補正可能としてもよい。この補正の方法は、以下のようにすればよい。例えば、カラオケ装置１００において、操作部３０を介したカラオケ楽曲の予約とともに、どのユーザが歌唱するのかを入力可能とする。そして、例としてユーザ「Ａさん」の歌唱に基づくユーザ歌唱音声データから、制御部１０は音声情報データを作成する。ユーザ「Ａさん」がカラオケ歌唱の歌い手として入力された楽曲が再生される都度、制御部１０は音声情報データを作成する。 <Modification 1>
The setting for the rectangular width of each region in the filter F is not fixed as described in the embodiment, and the control unit 10 may be able to correct the setting. The correction method may be as follows. For example, in the karaoke apparatus 100, it is possible to input which user sings along with the reservation of karaoke music via the operation unit 30. And as an example, the control part 10 produces audio | voice information data from the user song audio | voice data based on a user's "Mr. A" song. The control unit 10 creates audio information data each time a song input by the user “Mr. A” as a singer of karaoke singing is played.

このようにして制御部１０により、ユーザ「Ａさん」を音声情報データの持ち主とする音声情報データが複数回作成されていくと、「Ａさん」の音声情報データは、より正確なものに近づいていく。つまり、ユーザ「Ａさん」が、裏声で歌う箇所が少ない楽曲を１曲だけ歌唱した場合と比較して、裏声で歌う箇所がある程度存在する楽曲や、全編に渡って裏声で歌唱する（いわゆるファルセット）楽曲等に際して作成された音声情報データが蓄積されていくことで、音声情報データの個数が多くなる。一般的にデータの個数が多くなるほど、統計の結果は正確なものに近づいていくから、このようにして蓄積された音声情データは、よりユーザ「Ａさん」の声の特徴を表すものに近づいていくといえる。 In this way, when the voice information data having the user “Mr. A” as the owner of the voice information data is created a plurality of times by the control unit 10, the voice information data of “Mr. A” approaches a more accurate one. To go. In other words, compared to the case where the user “Mr. A” sings only one piece of music that sings less in the back voice, the user “Mr. A” sings in a part of the back voice, or sings in the whole voice (so-called false set). ) Accumulation of voice information data created for music or the like increases the number of voice information data. In general, as the number of data increases, the statistical result approaches an accurate one, and thus the voice information data accumulated in this manner is closer to the one representing the voice characteristics of the user “Mr. A”. It can be said that

上述のようにして作成された音声情報データに基づいて、制御部１０は、フィルタＦにおける各領域における矩形の幅の設定を補正する。例えばフィルタＦにおける各領域の矩形の幅については、制御部１０が、図５中の実線で表される線分と線分との幅を、縮めたり拡げたりする補正を行う。この結果、図５において、４つの領域を構成する矩形の各々の大きさが変化することとなる。ここで制御部１０が補正する線分と線分との幅の補正は、上記蓄積された音声情報データに基づくものであって、より多くの個数の裏声に基づく音声情データが、プラスの重み付けが割り当てられた領域に含まれるようにすればよい。 Based on the audio information data created as described above, the control unit 10 corrects the setting of the rectangular width in each region in the filter F. For example, for the rectangular width of each region in the filter F, the control unit 10 performs correction to reduce or widen the width of the line segment represented by the solid line in FIG. As a result, in FIG. 5, the size of each of the rectangles forming the four regions changes. Here, the correction of the width between the line segments to be corrected by the control unit 10 is based on the accumulated audio information data, and the audio information data based on a larger number of back voices is positively weighted. May be included in the allocated area.

なお、制御部１０は、上述した設定の補正を、ユーザによるカラオケ歌唱の終了まで行い続ける。つまり、ユーザによるカラオケ歌唱が長時間に渡り、より多くの音声情報データが蓄積される都度、上述の設定は、よりユーザの声の特徴にあわせて精度の高いものとなっていく。このようにすれば、制御部１０は、よりユーザの声の特徴にあわせて精度の高い裏声検出が可能になるとともに、ユーザにとっても、自身の声の特徴にあわせた採点結果を知ることが可能になる。 In addition, the control part 10 continues performing the correction | amendment of the setting mentioned above until completion | finish of the karaoke song by a user. That is, each time a user sings a karaoke song for a long time and more voice information data is accumulated, the above setting becomes more accurate in accordance with the characteristics of the user's voice. In this way, the control unit 10 can detect the back voice with higher accuracy in accordance with the characteristics of the user's voice and can also know the scoring result according to the characteristics of the user's voice. become.

＜変形例２＞
制御部１０は、裏声を検出した結果を、ユーザによる歌唱の採点に用いることに限らず、次のようにしてもよい。例えば、図４においては、地声領域ａと裏声領域ｂが、音高方向において余り離れていない。これは、図４における音声情報データの持ち主は、地声と裏声との間に、声の出ない高さの音が少ないことを意味する。図１０及び図１１は、変形例２に係る地声と裏声に基づく音声情報データの分布表を表す図である。図１０では、図４と比較して明らかなように、地声領域ａ２と裏声領域ｂ２との間には、音高方向において一定の距離がある。これは、図１０における音声情報データの持ち主の歌唱においては、地声と裏声との間に声の出ない高さの音が一定量存在することを意味する。つまり、音高方向において、地声領域ａ２と裏声領域ｂ２との間の距離が短い程、その歌唱を行ったユーザは、音の高さについて、より広く且つ途切れが少ない音域で歌唱が可能であるといえる。 <Modification 2>
The control unit 10 is not limited to using the result of detecting the back voice for scoring the song by the user, but may be as follows. For example, in FIG. 4, the local voice area a and the back voice area b are not so far apart in the pitch direction. This means that the owner of the voice information data in FIG. 4 has few sounds with no voice between the local voice and the back voice. 10 and 11 are diagrams illustrating a distribution table of audio information data based on the local voice and the back voice according to the second modification. In FIG. 10, as apparent from the comparison with FIG. 4, there is a certain distance in the pitch direction between the local voice area a2 and the back voice area b2. This means that in the singing of the owner of the audio information data in FIG. 10, there is a certain amount of sound that does not produce a voice between the local voice and the back voice. That is, the shorter the distance between the local voice area a2 and the back voice area b2 in the pitch direction, the more the user who performed the singing can sing in a wider and less interrupted pitch range. It can be said that there is.

一方、図１１では、図４と比較して明らかなように、地声領域ａ３と裏声領域ｂ３とにおいて、音高方向において重なる領域が存在する。これは、図１１における音声情報データの持ち主の歌唱においては、ユーザによって地声と裏声とを相互に使い分け可能な高さの音が一定量存在することを意味する。つまり、音高方向において、地声領域ａ３と裏声領域ｂ３との間の距離が負の値となる（地声領域ａ３と裏声領域ｂ３とが音高方向において重なる）程、その歌唱を行ったユーザは、地声と裏声を相互に使い分ける技量が高いといえる。 On the other hand, in FIG. 11, as apparent from the comparison with FIG. 4, there is a region overlapping in the pitch direction in the local voice region a <b> 3 and the back voice region b <b> 3. This means that in the singing of the owner of the audio information data in FIG. 11, there is a certain amount of sound with a height that allows the user to separately use the local voice and the back voice. That is, in the pitch direction, the singing was performed as the distance between the local voice area a3 and the back voice area b3 becomes a negative value (the local voice area a3 and the back voice area b3 overlap in the pitch direction). It can be said that the user has a high skill to properly use the local voice and the back voice.

上述の考え方に基づいて、制御部１０は、ステップＳ１２２において裏声を検出すると、地声と裏声とを使い分ける技量を判定し、この判定結果を表示部４０に表示してもよい。判定の方法としては、以下のようなものがある。制御部１０は、音声分布表において、地声領域ａにおける最も左端に位置する音声情報データの音高を、地声で歌唱可能な最も低い音高と認定する。また、制御部１０は、音声分布表において、地声領域ａにおける最も右端に位置する音声情報データの音高を、地声で歌唱可能な最も高い音高と認定する。また、制御部１０は、音声分布表において、裏声であると検出した領域（フィルタＦの検出位置におけるプラスの重み付けが割り当てられた領域）における最も左端に位置する音声情報データの音高を、裏声で歌唱可能な最も低い音高と認定する。また、制御部１０は、音声分布表において、裏声であると検出した領域における最も右端に位置する音声情報データの音高を、裏声で歌唱可能な最も高い音高と認定する。さらに制御部１０は、上記認定の結果が「地声で歌唱可能な最も高い音高＞裏声で歌唱可能な最も低い音高」である場合、この音声情報データの持ち主は、地声と裏声を使い分ける地声と裏声とを相互に使い分け可能な高さの音域で歌唱可能である、と判定する。 Based on the above-described concept, when detecting the back voice in step S122, the control unit 10 may determine a skill to properly use the local voice and the back voice and display the determination result on the display unit 40. As a determination method, there are the following methods. In the voice distribution table, the control unit 10 recognizes the pitch of the voice information data located at the leftmost end in the local voice region a as the lowest pitch that can be sung with the local voice. Moreover, the control part 10 recognizes the pitch of the audio | voice information data located in the most right end in the geophone area | region a as the highest pitch which can be sung by a geophone in the audio | voice distribution table. In addition, the control unit 10 determines the pitch of the voice information data located at the leftmost position in the area (the area assigned with the positive weight at the detection position of the filter F) detected as the reverse voice in the voice distribution table. It is recognized as the lowest pitch that can be sung. Moreover, the control part 10 recognizes the pitch of the audio | voice information data located in the right end in the area | region detected as a back voice in the voice distribution table | surface as the highest pitch which can be sung by a back voice. Further, when the result of the authorization is “the highest pitch that can be sung with the local voice> the lowest pitch that can be sung with the back voice”, the owner of the audio information data It is determined that the singing can be performed in a range of a height that can be used separately between the local voice and the back voice.

制御部１０は、判定結果を表示部４０に表示させる。なお、表示の方法については、例えば制御部１０が、「あなたはＣ３〜Ｅ４まで地声が、Ｇ４〜Ａ５まで裏声が出ます。」や「あなたはＣ３〜Ａ４まで地声が、Ｇ４〜Ｃ５まで裏声が出ます。あなたはＧ４〜Ａ４までの音域について、地声と裏声を使い分けることが出来ます。」などといったメッセージを、表示部４０に表示させるようにしてもよい。あるいは制御部１０が、ユーザが歌唱可能な音域を、地声と裏声とでそれぞれ視認により区別可能な画像として表示部４０に表示させるようにしてもよい。このようにすれば、ユーザは、自らの歌唱について、地声と裏声のそれぞれで歌唱可能な音域や、地声と裏声とを相互に使い分け可能な音域を認識することが可能となる。これにより、複数のユーザがカラオケを行うときに、採点の結果として各々の歌唱音域を知ることが出来るという楽しみ方が可能となる。また、例えば歌唱の練習を一人で行いたいユーザは、都度の歌唱について、地声と裏声との歌唱可能な音域の違い及び使い分け可能な音域を判定結果から知ることができるので、判定結果を参考にして歌唱の練習を行うことも可能となる。 The control unit 10 causes the display unit 40 to display the determination result. As for the display method, for example, the control unit 10 may say, “You hear a cry from C3 to E4, and a cry from G4 to A5” or “You hear a cry from C3 to A4, G4 to C5. You can make the display unit 40 display a message such as “you can use the local voice and the back voice for the G4 to A4 range”. Or you may make it the control part 10 display on the display part 40 as an image which can distinguish the sound range which a user can sing by visual recognition with a local voice and a back voice, respectively. If it does in this way, it will become possible for a user to recognize the sound range which can be sung by each of a local voice and a back voice about a user's own singing, and the sound range which can use a ground voice and a back voice mutually. Thereby, when several users perform karaoke, the way of enjoying that each singing sound range can be known as a result of scoring becomes possible. In addition, for example, a user who wants to practice singing alone can know from the determination results the difference in the singing range between the real voice and the back voice and the range that can be used properly for each singing. It is also possible to practice singing.

＜変形例３＞
実施形態において、フィルタＦを、矩形で構成された４つの領域の組み合わせとしたが、フィルタＦの形状はこれに限ったものではない。図１２及び図１３は、変形例３に係る音声分布表に適用するフィルタを表した図である。図１２の例におけるフィルタＦ２では、音声分布表において、プラスの重み付けが割り当てられた領域が五角形の形状となっており、図６と比較すると、音高が低く倍音比率が高い領域が欠落している。また、プラスの重み付けが割り当てられた領域を覆うように、その外側にマイナスの重み付けが割り当てられた領域が設けられている。図１２において、マイナスの重み付けが割り当てられた領域は、３つに分割され、分割された各々の領域に重み付けが割り当てられることを表している。マイナスの重み付けが割り当てられた領域は分割されることなく、マイナスの重み付けが割り当てられた領域全体に一つの重み付けが割り当てられるようにしてもよい。フィルタＦ２の移動については、実施形態と同様の方法を用いればよい。この場合、フィルタＦ２の左上角に形成された斜めの線分の中点を、実施形態におけるフィルタＦの左上隅に置き換えればよい。 <Modification 3>
In the embodiment, the filter F is a combination of four regions configured by rectangles, but the shape of the filter F is not limited to this. 12 and 13 are diagrams showing filters applied to the voice distribution table according to the third modification. In the filter F2 in the example of FIG. 12, in the voice distribution table, a region assigned with a positive weight has a pentagonal shape, and a region with a low pitch and a high harmonic ratio is missing as compared with FIG. Yes. Further, an area to which minus weight is assigned is provided outside the area to which the plus weight is assigned. In FIG. 12, the area to which negative weighting is assigned is divided into three, indicating that weighting is assigned to each of the divided areas. A region assigned with a negative weight may not be divided, and one weight may be assigned to the entire region assigned with a negative weight. For the movement of the filter F2, the same method as in the embodiment may be used. In this case, the midpoint of the diagonal line segment formed at the upper left corner of the filter F2 may be replaced with the upper left corner of the filter F in the embodiment.

図１３の例におけるフィルタＦ３では、音声分布表において、プラスの重み付けが割り当てられた領域は、右下、つまり音高が最も高く倍音比率が最も小さい頂点を直角とした三角形の形状となっている。また、プラスの重み付けが割り当てられた領域を覆うように、その外側にマイナスの重み付けが割り当てられた領域が設けられている。この場合、制御部１０は、フィルタＦ３を構成する三角形全体の重心を求めると、これをＲＡＭに記憶させる。あとは、この重心を、実施形態におけるフィルタＦの左上隅に置き換えてフィルタＦ３の移動を行えばよい。 In the filter F3 in the example of FIG. 13, the region assigned with positive weighting in the speech distribution table has a triangular shape with a right angle at the top right, that is, the vertex with the highest pitch and the lowest harmonic ratio. . Further, an area to which minus weight is assigned is provided outside the area to which the plus weight is assigned. In this case, if the control part 10 calculates | requires the gravity center of the whole triangle which comprises the filter F3, it will memorize | store this in RAM. After that, the center of gravity is replaced with the upper left corner of the filter F in the embodiment, and the filter F3 may be moved.

変形例３に係るフィルタＦ２，Ｆ３においても、プラスの重み付けを持つ領域とマイナスの重み付けを持つ領域とが存在する。要するに、２軸の座標系で表される音声分布表において、裏声に基づく音声情報データが割り当てられやすい範囲についてプラスの重み付けを持つ領域と、このプラスの重み付けを持つ領域を覆うようにして設けられ、マイナスの重み付けを持つ領域との２種類の領域でフィルタが構成されていればよい。このときマイナスの重み付けを持つ領域は、裏声に基づく音声情報データが分布する可能性が小さい範囲となるように配置されればよい。このように変形例３に係るフィルタを用いても、実施形態と同様の効果を奏することができる。 Also in the filters F2 and F3 according to the modified example 3, there are regions having a positive weight and regions having a negative weight. In short, in the voice distribution table represented by the two-axis coordinate system, it is provided so as to cover a region having a positive weight and a region having a positive weight with respect to a range in which voice information data based on the back voice is easily assigned. It is sufficient that the filter is constituted by two types of areas, that is, an area having a negative weight. At this time, the area having a negative weight may be arranged so that the possibility that the voice information data based on the back voice is distributed is small. As described above, even when the filter according to the third modification is used, the same effect as that of the embodiment can be obtained.

＜変形例４＞
音声情報データ群から裏声を検出する方法は、実施形態のようにフィルタを用いる方法に限らない。図１４は、変形例４に係る裏声検出処理が行われる際の処理フロー図である。図１４において、ステップＳ１１２までは、制御部１０が行う処理は図２のものと同様である。ステップＳ１１２の次に、制御部１０は、音声分布表の音高方向において分布した音声情報データの個数（分布個数という）を取得するための基準の線である、分布個数取得基準線を倍音比率方向に移動させる（ステップＳ１１４ｂ）。そして制御部１０は、音高方向における音声情報データの個数の分布を表す音声情報データ個数分布線を生成する（ステップＳ１１６ｂ）。 <Modification 4>
The method of detecting a back voice from the audio information data group is not limited to the method using a filter as in the embodiment. FIG. 14 is a process flow diagram when the reverse voice detection process according to the fourth modification is performed. In FIG. 14, until step S112, the processing performed by the control unit 10 is the same as that in FIG. After step S112, the control unit 10 uses a distribution number acquisition reference line, which is a reference line for acquiring the number of audio information data distributed in the pitch direction of the audio distribution table (referred to as the distribution number), as a harmonic ratio. The direction is moved (step S114b). And the control part 10 produces | generates the audio | voice information data number distribution line showing the distribution of the number of audio | voice information data in a pitch direction (step S116b).

図１５は、変形例４に係る分布個数取得基準線を表す図である。図１５において、縦軸及び横軸は図４と同じである。破線で表された分布個数取得基準線Ｌ１〜Ｌ５は、分布個数取得基準線の一例である。また、図１６ａ〜図１６ｅは、変形例４において音声情報データの分布を表す図である。図１６ａ〜図１６ｅにおいて、縦軸は音声情報データの個数を表し、下から上に進むほど個数が多くなることを表している。また、横軸は音高を表し、図１６ａ〜図１６ｅ中で左から右に進むほど音高が高くなることを表している。図１６ａ〜図１６ｅにおける音声情報データ個数分布線Ｍ１〜Ｍ５は、図１５における分布個数取得基準線Ｌ１〜Ｌ５と各々対応している。すなわち、図１５における分布個数取得基準線Ｌ１に基づいて、音高方向における音声情報データの個数を表したものが、音声情報データ個数分布線Ｍ１である。 FIG. 15 is a diagram illustrating a distribution number acquisition reference line according to the fourth modification. 15, the vertical axis and the horizontal axis are the same as those in FIG. Distribution number acquisition reference lines L1 to L5 represented by broken lines are examples of the distribution number acquisition reference line. 16a to 16e are diagrams showing the distribution of the audio information data in the fourth modification. In FIG. 16A to FIG. 16E, the vertical axis represents the number of audio information data, and indicates that the number increases as it progresses from bottom to top. Further, the horizontal axis represents the pitch, and the pitch increases as it progresses from left to right in FIGS. 16a to 16e. The voice information data number distribution lines M1 to M5 in FIGS. 16a to 16e correspond to the distribution number acquisition reference lines L1 to L5 in FIG. 15, respectively. That is, the voice information data number distribution line M1 represents the number of voice information data in the pitch direction based on the distribution number acquisition reference line L1 in FIG.

図１４のステップＳ１１４ｂにおいて制御部１０は、分布個数取得基準線を倍音比率の低い方から高い方へ向けて予め定められた幅だけ（例えばパワーの一単位）移動させる。そしてステップＳ１１６ｂにおいて制御部１０は、都度の倍音比率の値に位置する分布個数取得基準線に応じて、分布個数、すなわち音声分布表の音高方向において分布した音声情報データの個数を取得すると、取得した音声情報データの個数に基づいて音声情報データ個数分布線を生成する。このように制御部１０は、音声分布表の倍音比率方向におけるそれぞれの倍音比率毎に、音声情報データに含まれる音高の分布を特定する分布特定手段として機能する。図１６ａに表されるように、分布個数取得基準線Ｌ１に応じた音声情報データ個数分布線Ｍ１は、１つのピークを持っている。このピークは図１５に表されるように、裏声に基づく音声情報データの個数を表したものである。また、音声情報データ個数分布線Ｍ２〜音声情報データ個数分布線Ｍ４は、２つのピークを持っている。これらのピークは図１５に表されるように、地声に基づく音声情報データの個数と裏声に基づく音声情報データの個数とを表したものである。さらに、分布個数取得基準線Ｌ５に応じた音声情報データ個数分布線Ｍ５は、１つのピークを持っている。このピークは図１５に表されるように、地声に基づく音声情報データの個数を表したものである。 In step S114b of FIG. 14, the control unit 10 moves the distribution number acquisition reference line by a predetermined width (for example, one unit of power) from the lower harmonic ratio to the higher harmonic ratio. In step S116b, the control unit 10 acquires the distribution number, that is, the number of audio information data distributed in the pitch direction of the audio distribution table, according to the distribution number acquisition reference line located at the value of the overtone ratio. A voice information data number distribution line is generated based on the acquired number of voice information data. In this way, the control unit 10 functions as a distribution specifying unit that specifies the distribution of pitches included in the audio information data for each harmonic ratio in the harmonic ratio direction of the audio distribution table. As shown in FIG. 16A, the audio information data number distribution line M1 corresponding to the distribution number acquisition reference line L1 has one peak. As shown in FIG. 15, this peak represents the number of audio information data based on the back voice. The voice information data number distribution line M2 to the voice information data number distribution line M4 has two peaks. As shown in FIG. 15, these peaks represent the number of voice information data based on the local voice and the number of voice information data based on the back voice. Furthermore, the audio information data number distribution line M5 corresponding to the distribution number acquisition reference line L5 has one peak. As shown in FIG. 15, this peak represents the number of voice information data based on the local voice.

図１５及び図１６ａ〜図１６ｅからは、音声情報データ個数分布線におけるピークが、１つ（図１６ａにおける音高方向の右側）から２つになり、再び１つ（図１６ｅにおける音高方向の左側）になった時点で、制御部１０が、分布個数取得基準線に応じて、裏声に基づく音声情報データを取得しなくなったことを表すことが読み取れる。図１４のステップＳ１１６ｂの次に制御部１０は、音声情報データ個数分布線におけるピークが２つから１つになったかを判定する（ステップＳ１１８ｂ）。制御部１０は、ステップＳ１１８ｂでＮｏと判定する間、ステップＳ１１４ｂ及びステップＳ１１６ｂの処理を繰り返す。制御部１０は、音声情報データ個数分布線におけるピークが２つから１つになったと判定すると（ステップＳ１１８ｂ；Ｙｅｓ）、裏声に基づく音声情報データが存在する領域を決定する（ステップＳ１２０ｂ）。 From FIG. 15 and FIGS. 16a to 16e, the number of peaks in the number distribution line of the audio information data is changed from one (on the right side of the pitch direction in FIG. 16a) to two and again (one in the pitch direction in FIG. It can be read that the control unit 10 indicates that the voice information data based on the back voice is not acquired in accordance with the distribution number acquisition reference line. After step S116b in FIG. 14, the control unit 10 determines whether the number of peaks in the audio information data number distribution line has changed from two to one (step S118b). The control unit 10 repeats the processes of step S114b and step S116b while determining No in step S118b. When the control unit 10 determines that the number of peaks in the voice information data number distribution line has changed from two to one (step S118b; Yes), the control unit 10 determines a region where voice information data based on the back voice exists (step S120b).

ステップＳ１２０ｂにおける上記決定の方法は、例えば以下のようなものである。まず、ステップＳ１１６ｂで制御部１０は、音声情報データ個数分布線の音高方向において、予め定められた基準よりも高い位置にピークが１つ初めて現れた際に、この音声情報データ個数分布線に応じた分布個数取得基準線の、倍音比率方向における値（第１の値とする）を、ＲＡＭに記憶させる。ここで、上記予め定められた基準は、カラオケ装置１００の設計時において、不特定多数のユーザによる歌唱の音声から作成した複数の音声情報データを音声分布表に割り当てた結果から、より多くのユーザにとって裏声が検出可能となるように実験的に求めればよい。そしてステップＳ１１８ｂにおいて、制御部１０は、直前の音声情報データ個数分布線に応じた分布個数取得基準線の倍音比率方向における値（第２の値とする）を、ＲＡＭに記憶させる。 The determination method in step S120b is, for example, as follows. First, in step S116b, the control unit 10 displays the first voice information data number distribution line when the first peak appears at a position higher than a predetermined reference in the pitch direction of the voice information data number distribution line. The corresponding distribution number acquisition reference line in the overtone ratio direction (first value) is stored in the RAM. Here, the predetermined standard is that, when the karaoke apparatus 100 is designed, more users are obtained from the result of assigning a plurality of audio information data created from singing voices by an unspecified number of users to the audio distribution table. Therefore, it may be obtained experimentally so that the back voice can be detected. In step S118b, the control unit 10 causes the RAM to store the value (referred to as the second value) in the overtone ratio direction of the distribution number acquisition reference line corresponding to the immediately preceding audio information data number distribution line.

制御部１０は、このようにして求めた第１の値、第２の値及び上記音高方向における予め定められた基準とに基づいて、裏声に基づく音声情報データが存在する領域を決定する（ステップＳ１２０ｂ）。図１７は、変形例４に係る、裏声の領域を決定する処理を説明する図である。図１７において、縦軸は倍音比率を表し、図１７中で下から上に進むほど倍音比率が高くなることを表している。また、横軸は音高を表し、図１７中で左から右に進むほど音高が高くなることを表している。破線で表された分布個数取得基準線Ｌ１０１は、上記第１の値を倍音比率として持つ分布個数取得基準線であり、分布個数取得基準線Ｌ１０２は、上記第２の値を倍音比率として持つ分布個数取得基準線である。実線Ｓは、上記音高方向における予め定められた基準である（以降において音高基準Ｓという）。地声領域ａ４は、歌唱者の地声による歌唱の音声に基づく音声情報データが音声分布表に割り当てられたときの領域の一例である。裏声領域ｂ４は、歌唱者の裏声による歌唱の音声に基づく音声情報データが音声分布表に割り当てられたときの領域の一例である。 Based on the first value, the second value, and the predetermined reference in the pitch direction, the control unit 10 determines a region where the voice information data based on the back voice exists (see FIG. Step S120b). FIG. 17 is a diagram for explaining processing for determining a back voice area according to the fourth modification. In FIG. 17, the vertical axis represents the harmonic ratio, and indicates that the harmonic ratio increases as it progresses from bottom to top in FIG. 17. Also, the horizontal axis represents the pitch, and the pitch increases as it proceeds from left to right in FIG. A distribution number acquisition reference line L101 represented by a broken line is a distribution number acquisition reference line having the first value as a harmonic ratio, and a distribution number acquisition reference line L102 is a distribution having the second value as a harmonic ratio. It is a number acquisition reference line. A solid line S is a predetermined reference in the pitch direction (hereinafter referred to as a pitch reference S). The local voice area a4 is an example of an area when the voice information data based on the voice of the singing by the vocalist of the singer is assigned to the voice distribution table. The back voice area b4 is an example of an area when the voice information data based on the voice of the singing voice of the singer is assigned to the voice distribution table.

図１７に表されるように、分布個数取得基準線Ｌ１０１、分布個数取得基準線Ｌ１０２及び音高基準Ｓに囲まれた領域に、裏声領域ｂ４が存在している。図１４のステップＳ１２０ｂにおいて制御部１０は、この分布個数取得基準線Ｌ１０１、分布個数取得基準線Ｌ１０２及び音高基準Ｓに囲まれた領域を、裏声に基づく音声情報データが存在する領域である、と決定する。そして制御部１０は、ステップＳ１２０ｂで決定した領域に含まれる音声情報データを、裏声に基づく音声情報データであると検出する（ステップＳ１２２ｂ）。換言すれば制御部１０は、倍音比率毎に特定された音高の分布に基づき、予め決められた基準値（音高基準Ｓ）以上の音高において分布の極大点が現れているときの倍音比率の範囲から、音声情報データに対応する音声データを、裏声を表す音声データとして検出する。以降の処理は、実施形態と同様である。このように、変形例４に係る方法でも、実施形態と同様の効果を奏することが可能である。また、変形例４では、音声分布表において倍音比率方向の値を変化させながら取得した、音高方向に分布した音声情報データの個数のピーク位置及び上記ピークの個数に基づいて、裏声を検出している。このように変形例４では、音声情報データの個数の分布におけるピークに基づく値で挟まれた、多くの音声情報データが含まれる領域を、裏声を検出する領域として決定しているため、裏声に基づく音声情報データを検出から漏らすことを少なくすることが可能となる。 As shown in FIG. 17, a shout region b4 exists in a region surrounded by the distribution number acquisition reference line L101, the distribution number acquisition reference line L102, and the pitch reference S. In step S120b of FIG. 14, the control unit 10 is an area surrounded by the distribution number acquisition reference line L101, the distribution number acquisition reference line L102, and the pitch reference S. And decide. And the control part 10 detects that the audio | voice information data contained in the area | region determined by step S120b is audio | voice information data based on a back voice (step S122b). In other words, the control unit 10 is based on the pitch distribution specified for each harmonic ratio, and the harmonic overtone when the maximum point of the distribution appears at a pitch higher than a predetermined reference value (pitch reference S). From the ratio range, the audio data corresponding to the audio information data is detected as audio data representing a reverse voice. The subsequent processing is the same as in the embodiment. As described above, the method according to the modification 4 can also achieve the same effect as that of the embodiment. Further, in the fourth modification, a back voice is detected based on the peak position of the number of pieces of voice information data distributed in the pitch direction and the number of the peaks obtained while changing the value in the harmonic ratio direction in the voice distribution table. ing. As described above, in the fourth modification, an area including a large amount of audio information data sandwiched between values based on the peaks in the distribution of the number of audio information data is determined as an area for detecting a reverse voice. It is possible to reduce the leakage of the voice information data based on the detection.

＜変形例５＞
本発明は、歌唱評価装置以外にも、これらを実現するための方法や、コンピュータに音声評価機能を実現させるためのプログラムとしても把握される。かかるプログラムは、これを記憶させた光ディスク等の記録媒体の形態で提供されたり、インターネット等を介して、コンピュータにダウンロードさせ、これをインストールして利用させるなどの形態でも提供されたりする。 <Modification 5>
In addition to the singing evaluation apparatus, the present invention can be understood as a method for realizing these and a program for causing a computer to realize a voice evaluation function. Such a program may be provided in the form of a recording medium such as an optical disk storing the program, or may be provided in the form of being downloaded to a computer via the Internet or the like and installed and used.

１０…制御部、２０…記憶部、２１…伴奏データ記憶領域、２２…映像データ記憶領域、２３…ＧＭデータ記憶領域、２４…ユーザ歌唱音声データ記憶領域、３０…操作部、４０…表示部、５０…通信制御部、６０…音声処理部、６１…マイクロホン、６２…スピーカ、７０…バス、１００…カラオケ装置、３００…ユーザ歌唱音声曲線、ａ〜ａ４…地声領域、ｂ〜ｂ４…裏声領域、ｇ，ｈ…音声情報データ、Ａ１…基本周波数のパワー、Ａ２，Ａ３…倍音の周波数のパワー、Ｆ〜Ｆ３…フィルタ、Ｆａ〜Ｆｄ…領域、ＧＭ１〜ＧＭ３，ＧＭ５〜ＧＭ７…ガイドメロディ、Ｌ１〜Ｌ５，Ｌ１０１，Ｌ１０２…分布個数取得基準線、Ｍ１〜Ｍ５…音声情報データ個数分布線、ＭＲ…移動範囲、Ｓ…音高基準 DESCRIPTION OF SYMBOLS 10 ... Control part, 20 ... Storage part, 21 ... Accompaniment data storage area, 22 ... Image | video data storage area, 23 ... GM data storage area, 24 ... User song audio | voice data storage area, 30 ... Operation part, 40 ... Display part, DESCRIPTION OF SYMBOLS 50 ... Communication control part, 60 ... Audio | voice processing part, 61 ... Microphone, 62 ... Speaker, 70 ... Bus | bath, 100 ... Karaoke apparatus, 300 ... User singing voice curve, a-a4 ... Geophonic region, b-b4 ... Back-sound region , G, h ... voice information data, A1, fundamental frequency power, A2, A3, harmonic frequency power, F-F3, filter, Fa-Fd, region, GM1-GM3, GM5-GM7, guide melody, L1 ~ L5, L101, L102 ... distribution number acquisition reference line, M1 to M5 ... voice information data number distribution line, MR ... movement range, S ... pitch reference

Claims

Voice data acquisition means for acquiring voice data representing voice when the singer sings;
Calculating means for calculating the harmonic ratio and the pitch of the voice in the voice represented by the voice data, based on the voice data acquired by the voice data acquiring means, with the ratio of the harmonic frequency to the fundamental frequency as a harmonic ratio; ,
In the coordinate system composed of the first axis representing the harmonic ratio and the second axis representing the pitch, the harmonic ratio and the pitch are set to the coordinates corresponding to each harmonic ratio and pitch calculated by the calculation means. Assigning means for assigning each set of
Audio data representing the voice of the singing of the portion corresponding to the group assigned to a part of the region where the harmonic ratio is relatively low and the pitch is high among the regions to which the plurality of sets are assigned by the assigning means And voice detection means for detecting voice data representing the voice,
The back voice detection means includes
A filter that is moved in the coordinate system, the filter having a positive area that is a positive weight area and a negative area that is a negative weight area;
Each time the filter is moved within a range in which the harmonics ratio is lower than the predetermined first reference value and the pitch is higher than the predetermined second reference value in the coordinate system, A negative calculated value obtained by negatively weighting the number of sets included in the negative region, and a positive calculated value obtained by performing positive weighting on the number of sets included in the positive region. Adding means for adding,
Based on the addition result of the said addition means, the audio | voice data showing a back voice are detected. The back voice detection apparatus characterized by the above-mentioned.

Voice data acquisition means for acquiring voice data representing voice when the singer sings;
Calculating means for calculating the harmonic ratio and the pitch of the voice in the voice represented by the voice data, based on the voice data acquired by the voice data acquiring means, with the ratio of the harmonic frequency to the fundamental frequency as a harmonic ratio; ,
In the coordinate system composed of the first axis representing the harmonic ratio and the second axis representing the pitch, the harmonic ratio and the pitch are set to the coordinates corresponding to each harmonic ratio and pitch calculated by the calculation means. Assigning means for assigning each set of
Audio data representing the voice of the singing of the portion corresponding to the group assigned to a part of the region where the harmonic ratio is relatively low and the pitch is high among the regions to which the plurality of sets are assigned by the assigning means And voice detection means for detecting voice data representing the voice,
The back voice detection means includes
In the distribution of pitches included in the set for each overtone ratio on the first axis, corresponding to the set from the range of the overtone ratio when a maximum point appears at a predetermined reference value pitch. A back voice detection device for detecting voice data as voice data representing a back voice.

The back voice detection device according to claim 1 or 2,
Reference audio data acquisition that is reference audio data that represents each component sound that constitutes a song to be sung, and in which the component sound that is pronounced in a back tone is attached to the component sound that is a back voice flag. Means,
The voices represented by the voice data acquired by the voice data acquisition unit of the back voice detection device correspond to the constituent sounds represented by the reference voice data acquired by the reference voice data acquisition unit, respectively. The voice representing the back voice detected by the back voice detection means of the back voice detection device in the evaluation voice means for evaluating the singing by the singer according to the result of the comparison, wherein the back voice flag is attached to the reference voice data. A singing evaluation apparatus comprising: an evaluation unit that performs a higher evaluation when the data corresponds to the voice data representing the reverse voice than when the data does not correspond.