WO2007088820A1 - Karaoke machine and sound processing method - Google Patents

Karaoke machine and sound processing method Download PDF

Info

Publication number
WO2007088820A1
WO2007088820A1 PCT/JP2007/051413 JP2007051413W WO2007088820A1 WO 2007088820 A1 WO2007088820 A1 WO 2007088820A1 JP 2007051413 W JP2007051413 W JP 2007051413W WO 2007088820 A1 WO2007088820 A1 WO 2007088820A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
singing
lyrics
model
section
Prior art date
Application number
PCT/JP2007/051413
Other languages
French (fr)
Japanese (ja)
Inventor
Akane Noguchi
Original Assignee
Yamaha Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yamaha Corporation filed Critical Yamaha Corporation
Publication of WO2007088820A1 publication Critical patent/WO2007088820A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/361Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
    • G10H1/363Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems using optical disks, e.g. CD, CD-ROM, to store accompaniment information in digital form
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/361Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
    • G10H1/366Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems with means for modifying or correcting the external signal, e.g. pitch correction, reverberation, changing a singer's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/046Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for differentiation between music and non-music signals, based on the identification of musical parameters, e.g. based on tempo detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/066Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for pitch analysis as part of wider processing for musical purposes, e.g. transcription, musical performance evaluation; Pitch recognition, e.g. in polyphonic sounds; Estimation or use of missing fundamental
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/091Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for performance evaluation, i.e. judging, grading or scoring the musical qualities or faithfulness of a performance, e.g. with respect to pitch, tempo or other timings of a reference performance
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information

Definitions

  • the present invention relates to a technique for scoring a singer's singing ability.
  • Some karaoke devices that perform automatically based on music data analyze a singer's voice input to a microphone and score the singer's singing ability.
  • the karaoke apparatus disclosed in Patent Document 1 recognizes a singer's voice text input to a microphone and evaluates how much it matches the lyric text of the music. According to this karaoke apparatus, it is possible to evaluate whether or not the singer remembers the lyrics correctly.
  • Patent Document 1 JP-A-10-91172
  • the present invention has been made under the background described above, and its purpose is to complicate the system.
  • the goal is to enable singers to evaluate the ability to remember the lyrics correctly.
  • the present invention inputs a storage means unit storing example voice data representing a model voice when a song is sung according to lyrics, and a singer's singing voice.
  • a voice input unit and a model voice represented by the model voice data are divided into a plurality of model voice sections, and each of the divided sample voice sections in the singing voice input to the voice input means unit A singing voice section corresponding to the singing voice section corresponding to the singing voice section of the singing voice section,
  • a karaoke apparatus having an evaluation unit that performs evaluation by comparing the display unit and a display unit that displays an evaluation result of the evaluation unit.
  • the evaluation unit matches the singing voice of the singing voice section specified by the specifying unit with the model voice of the example voice section corresponding to the singing voice section. You can ask for a degree, and evaluate it based on the degree of agreement!
  • the storage unit stores lyrics data representing the lyrics of the music, and when the matching degree obtained by the evaluation unit is less than a predetermined value, the sample voice section in which the matching degree is less than the predetermined value
  • the evaluation unit may obtain a degree of coincidence between the formant frequency of the singing voice and the formant frequency of the model voice.
  • the evaluation unit obtains a first spectrum envelope from the speech waveform of the singing speech section specified by the specifying unit, and determines from the speech waveform of the example speech between the sample speech sections corresponding to the singing speech section.
  • the second spectrum envelope may be obtained, the formant frequency of the singing voice may be extracted from the first spectrum envelope, and the formant frequency of the sample voice may be extracted from the second spectrum envelope.
  • the evaluation unit includes a voice waveform of the singing voice section of the singing voice section specified by the specifying section and a voice waveform of the model voice of the model voice section corresponding to the singing voice section. Try to find the degree of agreement and evaluate it based on the degree of agreement.
  • the present invention divides the sample voice when the song is sung according to the lyrics represented by the sample voice data into a plurality of sample voice sections, and the singing voice of the singer input to the voice input unit is included.
  • the comparison process obtains a degree of coincidence between the singing voice of the singing voice section specified by the specifying unit and the model voice of the model voice section corresponding to the singing voice section, and the evaluation process Depending on the degree of agreement you find, you may want to make an evaluation.
  • the speech processing method may further include, when the matching degree obtained by the evaluation process is less than a predetermined value, the lyrics of the music corresponding to the voice of the sample voice section in which the matching degree is less than the predetermined value. May be identified from the lyrics represented by the lyrics data, and the display processing may display the identified lyrics on the display unit.
  • the degree of coincidence between the formant frequency of the singing voice and the formant frequency of the model voice may be obtained.
  • the evaluation process obtains a first spectrum envelope from the voice waveform of the singing voice section of the specified singing voice section, and secondly calculates from the voice waveform of the model voice of the model voice section corresponding to the singing voice section.
  • the spectrum envelope may be obtained, the formant frequency of the singing voice is extracted from the first spectrum envelope, and the formant frequency of the model voice may be extracted from the second spectrum envelope.
  • the evaluation process obtains the degree of coincidence between the voice waveform of the singing voice in the specified singing voice section and the voice waveform of the model voice in the sample voice section corresponding to the singing voice section. You can make an evaluation based on the degree of agreement!
  • FIG. 1 is an external view of a karaoke apparatus according to an embodiment of the present invention.
  • FIG. 2 is a block diagram showing a hardware configuration of the karaoke apparatus according to the embodiment of the present invention.
  • FIG. 3 is a diagram illustrating a format of music data in the embodiment of the present invention.
  • FIG. 4 is a diagram illustrating an example of a lyrics table format.
  • FIG. 5 is a diagram illustrating a waveform of a sample voice and a waveform of a singing voice.
  • FIG. 6 This is a diagram when the waveform of the model voice and the waveform of the singing voice are divided into multiple frames.
  • FIG. 1 is an external view of a karaoke apparatus according to an embodiment of the present invention. As shown in the figure, the karaoke device 1 is connected with a monitor 2, a speaker 3L, a speaker 3R, and a microphone 4. Karaoke device 1 is remotely operated by an infrared signal transmitted from remote control device 5.
  • FIG. 2 is a block diagram showing a hardware configuration of the karaoke apparatus 1.
  • a CPU (Central Processing Unit) 102 uses a RAM (Random Access Memory) 104 as a work area and executes various programs stored in a ROM (Read Only Memory) 103 to control each part of the karaoke device 1. To do.
  • the RAM 104 has a music storage area for temporarily storing music data.
  • the storage unit 105 includes a hard disk device, and stores various data such as music data described later and digital data of singing voice input from the microphone 4.
  • the communication unit 108 receives music data from a host computer (not shown), which is a music data distribution source, via a communication network (not shown) such as the Internet, and receives the received music data in the CPU 102. Transfer to the storage unit 105 under control.
  • the music data may be stored in the storage unit 105 in advance.
  • the karaoke device 1 is provided with a reading device that reads various recording media such as CD-ROM and DVD, and the music data recorded on the various recording media is read by the reading device and transferred to the storage unit 105 for storage. It may be.
  • the music data in the present embodiment includes the header, the musical sound data that is WAVE data representing the contents of the karaoke performance sound, and the voice of the model when the lyrics of the music are correctly sung. It has a WAVE format sample voice data representing the waveform and a lyrics table storing lyrics data representing the lyrics of the music.
  • FIG. 4 is a diagram illustrating a format of the lyrics table.
  • the lyrics table there is a correspondence between the lyrics data representing the lyrics of the music to be played and the time interval data indicating the time interval in which the lyrics represented by the lyrics data should be pronounced when the tone is output according to the tone data.
  • the lyric data on the first line represents the lyrics “Kamereon force”, and the time interval data “01: 00—01: 02” associated with this lyric data.
  • the lyrics “Kamereon power ⁇ ” is pronounced between 1 minute and 2 seconds after the beginning of the music performance.
  • the lyric data on the second line represents the lyric “Ichikitaichi”, and the time interval data “01: 03—01: 06” associated with this lyric data is the voice of the model. This shows that the lyrics “Ichikitaichi” is pronounced between the time when 1 minute and 3 seconds have passed since the music started playing and the time when 1 minute and 6 seconds passed.
  • the microphone 4 converts the singing voice of the singer that is input into a voice signal and outputs the voice signal.
  • the audio signal from which the microphone 4 is also output is input to an audio processing DSP (Digital Signal Processor) 111 and an amplifier 112.
  • the voice processing DSP 111 performs AZD conversion on the input voice signal and generates singing voice data representing the singing voice.
  • This singing voice data is stored in the storage unit 105, and compared with the model voice data and used for scoring the singer's singing ability.
  • Input unit 106 detects a signal generated by an input operation to operation panel provided on karaoke apparatus 1 or remote control apparatus 5 and outputs the detection result to CPU 102.
  • the display control unit 107 displays the video and the score result of the singer's singing ability on the motor 2 under the control of the CPU 102.
  • the tone generator 109 generates a tone signal corresponding to the supplied tone data, and outputs the generated tone signal to the effect DSP 110 as a karaoke performance sound.
  • the effect DSP 110 gives an effect such as reverberation echo to the musical sound signal generated by the sound source device 109.
  • the effected tone signal is DZA converted by the effect DSP 110 and output to the amplifier 112.
  • the amplifier 112 synthesizes and amplifies the musical tone signal output from the effect DSP 110 and the audio signal output from the microphone 4 and outputs it to the speakers 3L and 3R. As a result, the melody of the music and the voice of the singer are output from the speakers 3L and 3R.
  • the song data of the designated song is transferred from the storage unit 105 to the song storage area of the RAM 104 by the CPU 102.
  • CPU102 stores this music Karaoke accompaniment processing is executed by sequentially reading various data included in the music data stored in the area.
  • CPU 102 reads out the musical sound data included in the music data, and outputs the read musical sound data to tone generator 109.
  • the tone generator 109 generates a tone signal of a predetermined tone color based on the supplied music data, and outputs the generated tone signal to the effect DSP 110.
  • an effect such as reverberation echo is given to the musical sound signal output from the sound source device 109.
  • the musical sound signal to which the effect is applied is DZ A converted by the effect DSP 110 and output to the amplifier 112.
  • the amplifier 112 amplifies the musical sound signal output from the effect DSP 110 and outputs it to the speakers 3L and 3R. As a result, the melody of the music is output from the speakers 3L and 3R.
  • the CPU 102 starts counting the elapsed time after the music output is started.
  • the singer's voice is input to the microphone 4 and an audio signal is output from the microphone 4.
  • the voice processing DSP 111 performs AZD conversion on the voice signal output from the microphone 4 to generate singing voice data representing the singing voice.
  • This singing voice data is stored in the storage unit 105.
  • the CPU 102 continues counting elapsed time, and searches the lyrics table for a time interval including the counted time as the start time of the time interval (a time interval in which the counted time is included). Then, the retrieved time interval and the lyrics data stored in association with the retrieved time interval are read out. For example, if the counted elapsed time is 01:00:00, the time section “01: 01: 00: 02” and the lyrics data “Kamereon force” on the first line are read out in the lyrics table shown in FIG.
  • the CPU 102 When the CPU 102 reads the time interval, the CPU 102 compares the audio input to the microphone 4 in this time interval with the model audio in this time interval, and determines whether or not the singer sang the lyrics correctly. Judging. Specifically, the CPU 102 prays the voice represented by the model voice data and, as shown in FIG. 5, reads the time interval (01: 00 on the time axis of the voice waveform represented by the model voice data. — Extract voice waveform A between 01: 02). In addition, the CPU 102 analyzes the stored singing voice data, and as shown in FIG. On the time axis shown, extract the speech waveform B between the read time intervals.
  • the extracted speech waveform A is divided into a plurality of frames by being divided at a predetermined time interval (for example, 10 ms) as shown in FIG. 6 (a).
  • the extracted speech waveform B is divided into a plurality of frames by being divided at a predetermined time interval (for example, 10 ms) as shown in FIG. 6 (b).
  • the CPU 102 associates the speech waveform of each frame of the model speech with the speech waveform of each frame of the singing speech using a DP (Dynamic Programming) matching method.
  • a DP Dynamic Programming
  • the frame A1 and the frame B1 are associated with each other.
  • the voice waveform of frame A2 of the sample voice and the voice waveform of frame B2 force frame B3 of the singing voice correspond to each other! /, Then frame A2 and frame B2 to frame B3 are associated. Is done.
  • the CPU 102 compares the characteristics of the speech waveform between corresponding frames. Specifically, the CPU 102 Fourier transforms the speech waveform for each speech waveform of each frame of the model speech. The CPU 102 obtains the logarithm of the amplitude spectrum obtained by the Fourier transform, and inversely transforms it to generate a spectrum envelope for each frame. The CPU 102 extracts the first formant frequency fl 1, the second formant frequency fl 2, and the third formant frequency fl 3 from the obtained spectral envelope.
  • the CPU 102 performs Fourier transform on the voice waveform for each voice waveform of the singer's voice frame associated with each frame of the model voice. Then, the CPU 102 obtains the logarithm of the amplitude spectrum obtained by the Fourier transform, and inversely transforms it to generate a spectrum envelope for each frame. Then, the CPU 102 extracts the obtained spectrum envelope force frequency f21 of the first formant, frequency f22 of the second formant, and frequency 23 of the third formant.
  • the CPU 102 generates a spectrum envelope of the frame A1 of the sample voice, and extracts the spectrum envelope forces fl1 to f13 of the first to third formants. Then, the CPU 102 generates a spectral envelope of the speech waveform of the frame B1 associated with the frame A1, and extracts the first to third formant formant frequencies f21 to f23 from the spectrum envelope. Further, the CPU 102 generates a spectrum envelope of the frame A2 of the sample voice, and the spectrum envelope force extracts the formant frequencies fl 1 to f 13 of the first to third formants. Then, the CPU 102 generates a spectrum envelope of the speech waveform from the frame B2 to the frame B3 associated with the frame A2, and extracts the formant frequencies f21 to f23 of the first to third formants from the spectrum envelope. .
  • the CPU 102 extracts the formant frequencies fl 1 to fl 3 extracted from each frame of the sample voice and the frame force of the singer's voice associated with each frame of the sample voice f21 to Compare with f23. Then, the CPU 102 compares the difference between the formant frequency f11 and the formant frequency f21, the difference between the formant frequency f12 and the formant frequency f22, and the formant frequency f13 and the formant frequency f23. If the difference is greater than or equal to a predetermined value, mismatch information D indicating that the formant frequencies do not match is added to the frame of the model voice.
  • the CPU 102 matches the speech between the corresponding frames.
  • the discrepancy information D is not added to the frame A1.
  • the formant frequencies fl 1 to fl3 of the frame A2 and the formant frequencies f21 to f23 of the speech waveform from the frames B2 to B3 If the difference is equal to or greater than the predetermined value, mismatch information D indicating that the formant frequencies are mismatched is added to frame A2.
  • the CPU 102 determines a match Z mismatch between the formant frequency of the sound waveform of each frame of the singer and the formant frequency of the sound waveform of each frame of the sample sound for the sound waveform of each frame of the sample sound, Count the number N of frames with mismatch information D added. Next, the CPU 102 compares the total number M of the divided sample voice data frames with the value of the number N. If the value of the number N is more than half of the total number M of frames, the read lyrics data is The lyrics to represent! It is determined that the lyrics of the singer's pronunciation are different from the lyrics of the model voice.
  • the lyrics expressed by the read lyrics data and the lyrics Judge that the lyrics are the same. For example, if the number N of discrepancy information is less than half of the total number of frames M for the voice “Kamere-onga” represented by the model voice data,
  • the CPU 102 determines that the lyrics pronounced by the singer are the same as the lyrics of the model voice.
  • the lyrics expressed by the read lyrics data are different from the lyrics of the singer's pronunciation and the lyrics of the model voice. Determining power Total number of frames Number of M Number of N
  • the ratio of N is more than a predetermined ratio other than 50%
  • the lyrics expressed by the lyrics data read out are the lyrics of the singer and the lyrics of the model voice. May be determined to be different.
  • CPU 102 continues to count the elapsed time in parallel with the comparison between the model voice and the singing voice.
  • the counted elapsed time becomes 01:03
  • the time zone in the second row of the lyrics table shown in FIG. Read “01: 03—01: 06” and lyric data “Ichikita-ichi”.
  • the singing voice data is stored in the storage unit 105.
  • Singing voice data representing the voice “come” is generated and gc fe to the storage unit 105.
  • the CPU 102 divides the waveform of the sound input to the microphone 4 in this time interval and the waveform of the model sound in this time interval into a plurality of frames. Then, the speech waveform of each frame of the model speech is associated with the speech waveform of each frame of the singing speech, and the formant frequencies of the speech waveforms are compared between the associated frames. Then, the CPU 102 judges whether the voice waveform of each frame of the model voice matches the formant frequency of the singer's voice waveform, and adds the mismatch information D, and then the divided sample voice data. The total number M of frames is compared with the value of the number N of frames with discrepancy information added to determine whether or not the singer has sung the lyrics correctly.
  • the CPU 102 repeats the reading of the lyrics data and the model voice data and the determination of the correctness of the lyrics sung by the singer as the music is played back. Then, when all performance event data is read, the karaoke accompaniment process is terminated.
  • the present embodiment it is possible to determine whether or not the singer has sung according to the lyrics without performing voice recognition using a dictionary.
  • voice data that is sung correctly according to the lyrics
  • lyrics in various languages without complicating music it is possible to evaluate the ability of the singer to learn the lyrics correctly!
  • the pitch of the voice represented by the singing voice data is corrected so that the pitch of the voice waveform represented by the singing voice data becomes the pitch of the voice waveform represented by the model voice data.
  • the model voice data detects the pitch variation of the voice waveform represented by the model voice data and makes it a model voice (singing a sound that is lower than the set pitch first and then approaching the original pitch) If there is a law or Determine the degree of coincidence between the pitch fluctuation of the voice waveform represented and the pitch fluctuation of the voice waveform represented by the singing voice data, and try to determine whether or not the singer is singing correctly. You can do it.
  • the voice waveform represented by the sample voice data and the voice waveform represented by the singing voice data are divided into a plurality of frequency bands by a plurality of bandpass filters, and the voice is represented for each frequency band.
  • the correctness of the lyrics may be determined by determining the degree of coincidence of the feature quantities.
  • the model voice data representing the model voice waveform is stored, and the formant frequency is analyzed by analyzing the voice waveform represented by the model voice data. Is stored in advance in the storage unit 105, and the stored formant frequency is compared with the formant frequency of each frame of the singer's voice waveform to determine the degree of coincidence. You may decide to judge.
  • the correctness of the lyrics sung by the singer may be determined after the singer has finished singing the music.
  • a message or image that informs the user that the lyrics are incorrect is not displayed. You may make it display on.

Abstract

It can be evaluated that the singer correctly memorizes the lyrics of a song without making the system complex. A CPU (102) divides the sound waveform of a model sound represented by model sound data into frames and divides the sound waveform of a sung sound represented by sung sound data stored in a storage section (105) into frames. Next, the CPU (102) associates the sound waveform of each frame of the model sound with the sound waveform of each frame of the sung sound, compares the formant frequencies of the sound waveforms of the corresponding frames, and judges whether or not the model sound agrees with the sung sound. If not, the CPU (102) displays on a monitor (2) the lyrics corresponding to the sound represented by the model sound data.

Description

カラオケ装置及び音声処理方法  Karaoke apparatus and voice processing method
技術分野  Technical field
[0001] 本発明は、歌唱者の歌唱力を採点する技術に関する。  [0001] The present invention relates to a technique for scoring a singer's singing ability.
背景技術  Background art
[0002] 楽曲データに基づいて自動演奏を行うカラオケ装置の中には、マイクに入力された 歌唱者の音声を解析し、歌唱者の歌唱力を採点するものがある。例えば、特許文献 1に開示されたカラオケ装置は、マイクに入力された歌唱者の音声の文言を認識し、 楽曲の歌詞の文言とどの程度一致しているかを評価する。このカラオケ装置によれば 、歌唱者が歌詞を正しく覚えているか否かを評価することができる。  [0002] Some karaoke devices that perform automatically based on music data analyze a singer's voice input to a microphone and score the singer's singing ability. For example, the karaoke apparatus disclosed in Patent Document 1 recognizes a singer's voice text input to a microphone and evaluates how much it matches the lyric text of the music. According to this karaoke apparatus, it is possible to evaluate whether or not the singer remembers the lyrics correctly.
特許文献 1 :特開平 10— 91172号公報  Patent Document 1: JP-A-10-91172
発明の開示  Disclosure of the invention
発明が解決しょうとする課題  Problems to be solved by the invention
[0003] ところで、特許文献 1に開示されて!ヽるカラオケ装置のように音声の文言を認識する ためには、音声認識を行う必要がある。音声認識を行う場合、入力された音声を分析 し、音声の音響特徴を抽出する。そして、辞書に記憶されている言葉の中から、言葉 の音響特徴が入力音声の音響特徴に最も近い言葉を探して音声認識結果として出 力する。ここで、言葉を正しく認識するには、辞書に記憶されている言葉が重要となり 、正確に言葉を認識するには多くの言葉を辞書に記憶させておく必要がある。しかし ながら、多くの言葉を辞書に記憶させると、多くの言葉の中から最も近い言葉を探し 出すのに時間が力かることとなり、直ぐに評価結果を示すことができなくなる。また、力 ラオケで歌われる楽曲は、日本語だけでなく外国語の楽曲も多数ある。多数の言語 について音声認識を行う場合には、言語毎に辞書を用意する必要があり、新たな言 語の楽曲をカラオケ装置に追加する場合には、辞書も新たに用意しなければならず 、システムが複雑ィ匕して簡単に楽曲を追加するのが難しくなるという問題が発生する [0003] By the way, in order to recognize the wording of speech as disclosed in Patent Document 1, it is necessary to perform speech recognition. When performing speech recognition, the input speech is analyzed and the acoustic features of the speech are extracted. From the words stored in the dictionary, the words whose acoustic features are closest to the acoustic features of the input speech are searched for and output as speech recognition results. Here, in order to correctly recognize words, words stored in the dictionary are important, and in order to correctly recognize words, it is necessary to store many words in the dictionary. However, if many words are stored in the dictionary, it will take time to find the closest word among many words, and it will not be possible to show the evaluation results immediately. In addition, there are many songs in foreign languages as well as Japanese. When performing speech recognition for a large number of languages, it is necessary to prepare a dictionary for each language. When adding a new language song to a karaoke device, a dictionary must also be prepared. There is a problem that the system becomes complicated and it becomes difficult to add music easily
[0004] 本発明は、上述した背景の下になされたものであり、その目的は、システムを複雑 ィ匕させることなぐ歌唱者が歌詞を正しく覚えている力否かを評価できるようにすること にある。 [0004] The present invention has been made under the background described above, and its purpose is to complicate the system. The goal is to enable singers to evaluate the ability to remember the lyrics correctly.
課題を解決するための手段  Means for solving the problem
[0005] 上述した課題を解決するために本発明は、楽曲を歌詞通りに歌唱したときの手本 音声を表す手本音声データを記憶した記憶手段部と、歌唱者の歌唱音声が入力さ れる音声入力手段部と、前記手本音声データが表す手本音声を複数の手本音声区 間に分割し、前記音声入力手段部に入力された歌唱音声において、前記分割され た各手本音声区間に対応する歌唱音声区間を特定する特定手段部と、前記特定手 段部で特定された歌唱音声区間の歌唱音声と、該歌唱音声区間の歌唱音声に対応 する手本音声区間の手本音声とを比較して評価を行う評価手段部と、前記評価手段 部の評価結果を表示する表示手段部とを有するカラオケ装置を提供する。  [0005] In order to solve the above-described problems, the present invention inputs a storage means unit storing example voice data representing a model voice when a song is sung according to lyrics, and a singer's singing voice. A voice input unit and a model voice represented by the model voice data are divided into a plurality of model voice sections, and each of the divided sample voice sections in the singing voice input to the voice input means unit A singing voice section corresponding to the singing voice section corresponding to the singing voice section of the singing voice section, There is provided a karaoke apparatus having an evaluation unit that performs evaluation by comparing the display unit and a display unit that displays an evaluation result of the evaluation unit.
[0006] この態様にお 、ては、前記評価部は、前記特定部で特定された歌唱音声区間の 歌唱音声と、該歌唱音声区間に対応する前記手本音声区間の手本音声との一致度 を求め、求めた一致度に基づ!/、て評価を行うようにしてもよ!、。 [0006] In this aspect, the evaluation unit matches the singing voice of the singing voice section specified by the specifying unit with the model voice of the example voice section corresponding to the singing voice section. You can ask for a degree, and evaluate it based on the degree of agreement!
また、前記記憶部は、前記楽曲の歌詞を表す歌詞データを記憶し、前記評価部が 求めた前記一致度が所定値未満である場合、前記一致度が所定値未満となった手 本音声区間の音声に対応した歌詞を前記記憶部に記憶された歌詞データが表す歌 詞の中から特定する歌詞特定部を有し、前記表示部は、前記歌詞特定部で特定さ れた歌詞を表示するようにしてもょ ヽ。  In addition, the storage unit stores lyrics data representing the lyrics of the music, and when the matching degree obtained by the evaluation unit is less than a predetermined value, the sample voice section in which the matching degree is less than the predetermined value A lyrics specifying unit for specifying the lyrics corresponding to the voice from the lyrics represented by the lyrics data stored in the storage unit, and the display unit displays the lyrics specified by the lyrics specifying unit Even so, よ う.
また、前記評価部は、前記歌唱音声のフォルマント周波数と前記手本音声のフオル マント周波数との一致度を求めるようにしてもよ 、。  Further, the evaluation unit may obtain a degree of coincidence between the formant frequency of the singing voice and the formant frequency of the model voice.
また、前記評価部は、前記特定部で特定された歌唱音声区間の歌唱音声の音声 波形から第 1スペクトル包絡を求め、前記歌唱音声区間に対応する前記手本音声区 間の手本音声の音声波形から第 2スペクトル包絡を求め、前記第 1スペクトル包絡か ら前記歌唱音声のフォルマント周波数を抽出し、前記第 2スペクトル包絡から前記手 本音声のフォルマント周波数を抽出するようにしてもよ 、。  In addition, the evaluation unit obtains a first spectrum envelope from the speech waveform of the singing speech section specified by the specifying unit, and determines from the speech waveform of the example speech between the sample speech sections corresponding to the singing speech section. The second spectrum envelope may be obtained, the formant frequency of the singing voice may be extracted from the first spectrum envelope, and the formant frequency of the sample voice may be extracted from the second spectrum envelope.
また、前記評価部は、前記特定部で特定された歌唱音声区間の歌唱音声の音声 波形と、該歌唱音声区間に対応する前記手本音声区間の手本音声の音声波形との 一致度を求め、求めた一致度に基づ 、て評価を行うようにしてもょ 、。 Further, the evaluation unit includes a voice waveform of the singing voice section of the singing voice section specified by the specifying section and a voice waveform of the model voice of the model voice section corresponding to the singing voice section. Try to find the degree of agreement and evaluate it based on the degree of agreement.
また、本発明は、手本音声データが表す、楽曲を歌詞通りに歌唱したときの手本音 声を複数の手本音声区間に分割し、音声入力部に入力された歌唱者の歌唱音声に おいて、前記分割された各手本音声区間に対応する歌唱音声区間を特定し、前記 特定された歌唱音声区間の歌唱音声と、該歌唱音声区間に対応する手本音声区間 の手本音声とを比較し、前記比較処理の結果に基づいて評価を行い、前記評価の 結果を表示する音声処理方法を提供する。  In addition, the present invention divides the sample voice when the song is sung according to the lyrics represented by the sample voice data into a plurality of sample voice sections, and the singing voice of the singer input to the voice input unit is included. A singing voice section corresponding to each divided sample voice section, and a singing voice of the specified singing voice section and a sample voice of the sample voice section corresponding to the singing voice section. There is provided a voice processing method for comparing and evaluating based on the result of the comparison process and displaying the result of the evaluation.
また、前記比較処理は、前記特定部で特定された歌唱音声区間の歌唱音声と、該 歌唱音声区間に対応する前記手本音声区間の手本音声との一致度を求め、前記評 価処理は、求めた一致度に基づ 、て評価を行うようにしてもょ 、。  The comparison process obtains a degree of coincidence between the singing voice of the singing voice section specified by the specifying unit and the model voice of the model voice section corresponding to the singing voice section, and the evaluation process Depending on the degree of agreement you find, you may want to make an evaluation.
また、前記音声処理方法は、さらに前記評価処理によって求められた前記一致度 が所定値未満である場合、前記一致度が所定値未満となった手本音声区間の音声 に対応した前記楽曲の歌詞を、歌詞データが表す歌詞の中から特定し、前記表示処 理は、前記特定された歌詞を表示部に表示するようにしてもょ ヽ。  In addition, the speech processing method may further include, when the matching degree obtained by the evaluation process is less than a predetermined value, the lyrics of the music corresponding to the voice of the sample voice section in which the matching degree is less than the predetermined value. May be identified from the lyrics represented by the lyrics data, and the display processing may display the identified lyrics on the display unit.
また、前記評価処理は、前記歌唱音声のフォルマント周波数と前記手本音声のフォ ルマント周波数との一致度を求めるようにしてもよ 、。  In the evaluation process, the degree of coincidence between the formant frequency of the singing voice and the formant frequency of the model voice may be obtained.
また、前記評価処理は、前記特定された歌唱音声区間の歌唱音声の音声波形から 第 1スペクトル包絡を求め、前記歌唱音声区間に対応する前記手本音声区間の手本 音声の音声波形から第 2スペクトル包絡を求め、前記第 1スペクトル包絡から前記歌 唱音声のフォルマント周波数を抽出し、前記第 2スペクトル包絡から前記手本音声の フォルマント周波数を抽出するようにしてもょ 、。  Further, the evaluation process obtains a first spectrum envelope from the voice waveform of the singing voice section of the specified singing voice section, and secondly calculates from the voice waveform of the model voice of the model voice section corresponding to the singing voice section. The spectrum envelope may be obtained, the formant frequency of the singing voice is extracted from the first spectrum envelope, and the formant frequency of the model voice may be extracted from the second spectrum envelope.
また、前記評価処理は、前記特定された歌唱音声区間の歌唱音声の音声波形と、 該歌唱音声区間に対応する前記手本音声区間の手本音声の音声波形との一致度 を求め、求めた一致度に基づ!/、て評価を行うようにしてもよ!、。  In addition, the evaluation process obtains the degree of coincidence between the voice waveform of the singing voice in the specified singing voice section and the voice waveform of the model voice in the sample voice section corresponding to the singing voice section. You can make an evaluation based on the degree of agreement!
発明の効果 The invention's effect
本発明によれば、システムを複雑ィ匕させることなぐ歌唱者が歌詞を正しく覚えてい る力否かを評価することができる。  According to the present invention, it is possible to evaluate whether or not a singer who does not complicate the system can correctly remember the lyrics.
図面の簡単な説明 [0008] [図 1]本発明の実施形態に係るカラオケ装置の外観図である。 Brief Description of Drawings FIG. 1 is an external view of a karaoke apparatus according to an embodiment of the present invention.
[図 2]本発明の実施形態に係るカラオケ装置のハードウェア構成を示したブロック図 である。  FIG. 2 is a block diagram showing a hardware configuration of the karaoke apparatus according to the embodiment of the present invention.
[図 3]本発明の実施形態における楽曲データのフォーマットを例示した図である。  FIG. 3 is a diagram illustrating a format of music data in the embodiment of the present invention.
[図 4]歌詞テーブルのフォーマットを例示した図である。  FIG. 4 is a diagram illustrating an example of a lyrics table format.
[図 5]手本音声の波形と歌唱音声の波形とを例示した図である。  FIG. 5 is a diagram illustrating a waveform of a sample voice and a waveform of a singing voice.
[図 6]手本音声の波形と歌唱音声の波形とを複数のフレームに分割した時の図であ る。  [Fig. 6] This is a diagram when the waveform of the model voice and the waveform of the singing voice are divided into multiple frames.
符号の説明  Explanation of symbols
[0009] 1 カラオケ装置 [0009] 1 Karaoke device
2 モニタ  2 Monitor
3L, 3R スピーカ  3L, 3R speaker
4 マイク  4 Microphone
5 リモコン装置  5 Remote control device
101 パス  101 passes
102 CPU  102 CPU
103 ROM  103 ROM
104 RAM  104 RAM
105 記憶部  105 Memory
106 入力部  106 Input section
107 表示制御部  107 Display controller
108 通信部  108 Communications Department
109 音源装置  109 Sound generator
110 効果用 DSP  110 DSP for effects
111 音声処理用 DSP  111 DSP for voice processing
112 アンプ  112 amplifier
発明を実施するための最良の形態  BEST MODE FOR CARRYING OUT THE INVENTION
[0010] [実施形態の構成] 図 1は本発明の実施形態に係わるカラオケ装置の外観を示した図である。同図に 示したように、カラオケ装置 1にはモニタ 2、スピーカ 3L、スピーカ 3R、そしてマイク 4 が接続されている。カラオケ装置 1は、リモコン装置 5から送信される赤外線信号によ り遠隔操作される。 [0010] [Configuration of the embodiment] FIG. 1 is an external view of a karaoke apparatus according to an embodiment of the present invention. As shown in the figure, the karaoke device 1 is connected with a monitor 2, a speaker 3L, a speaker 3R, and a microphone 4. Karaoke device 1 is remotely operated by an infrared signal transmitted from remote control device 5.
[0011] 図 2は、カラオケ装置 1のハードウェア構成を示したブロック図である。バス 101に接 続されている各部は、このバス 101を介して各部間で通信を行う。 CPU (Central Pro cessing Unit) 102は、 RAM (Random Access Memory) 104をワークエリアとして利用 し、 ROM (Read Only Memory) 103に格納されている各種プログラムを実行すること でカラオケ装置 1の各部を制御する。また、 RAM104には楽曲データを一時記憶す る楽曲記憶領域が確保される。記憶部 105はハードディスク装置を具備しており、後 述する楽曲データやマイク 4より入力された歌唱音声のデジタルデータ等の各種デ ータを記憶する。  FIG. 2 is a block diagram showing a hardware configuration of the karaoke apparatus 1. Each unit connected to the bus 101 communicates with each other via the bus 101. A CPU (Central Processing Unit) 102 uses a RAM (Random Access Memory) 104 as a work area and executes various programs stored in a ROM (Read Only Memory) 103 to control each part of the karaoke device 1. To do. The RAM 104 has a music storage area for temporarily storing music data. The storage unit 105 includes a hard disk device, and stores various data such as music data described later and digital data of singing voice input from the microphone 4.
[0012] 通信部 108は、楽曲データの配信元であるホストコンピュータ(図示略)から、例え ばインターネットなどの通信ネットワーク(図示略)を介して楽曲データを受信し、受信 した楽曲データを CPU102の制御のもと記憶部 105へと転送する。なお、本実施形 態においては、楽曲データは予め記憶部 105に記憶されていてもよい。また、 CD— ROMや DVD等の各種記録媒体を読み取る読み取り装置をカラオケ装置 1に設け、 各種記録媒体に記録された楽曲データを、この読み取り装置により読み取って記憶 部 105に転送して記憶させるようにしてもよい。  The communication unit 108 receives music data from a host computer (not shown), which is a music data distribution source, via a communication network (not shown) such as the Internet, and receives the received music data in the CPU 102. Transfer to the storage unit 105 under control. In the present embodiment, the music data may be stored in the storage unit 105 in advance. The karaoke device 1 is provided with a reading device that reads various recording media such as CD-ROM and DVD, and the music data recorded on the various recording media is read by the reading device and transferred to the storage unit 105 for storage. It may be.
ここで、本実施形態において用いられる楽曲データの構造について説明する。本 実施形態における楽曲データは、図 3に示すように、ヘッダ、カラオケ演奏音の内容 を表す WAVE形式のデータである楽音データ、楽曲の歌詞を間違えずに正しく歌つ たときのお手本の音声の波形を表す WAVE形式の手本音声データ、および楽曲の 歌詞を表す歌詞データを格納した歌詞テーブルとを有している。  Here, the structure of music data used in the present embodiment will be described. As shown in FIG. 3, the music data in the present embodiment includes the header, the musical sound data that is WAVE data representing the contents of the karaoke performance sound, and the voice of the model when the lyrics of the music are correctly sung. It has a WAVE format sample voice data representing the waveform and a lyrics table storing lyrics data representing the lyrics of the music.
[0013] 図 4は、歌詞テーブルのフォーマットを例示した図である。歌詞テーブルにおいて は、演奏される楽曲の歌詞を表す歌詞データと、楽音データに従って楽音が出力さ れたときに、この歌詞データが表す歌詞を発音すべき時間区間を示す時間区間デー タとが対応付けて格納される。 例えば、図 4に示した歌詞テーブルにおいて、 1行目の歌詞データは「かめれおん 力 という歌詞を表しており、この歌詞データに対応付けられている時間区間データ「 01 : 00— 01 : 02」は、お手本の音声において、楽曲の演奏が開始されて 1分経過し た時点から 1分 2秒経過した時点までの間に、この歌詞「かめれおん力^が発音される ことを示している。また、 2行目の歌詞データは「やってきた一」という歌詞を表してお り、この歌詞データに対応付けられている時間区間データ「01 : 03— 01 : 06」は、お 手本の音声において、楽曲の演奏が開始されて 1分 3秒経過した時点から 1分 6秒経 過した時点までの間に、この歌詞「やってきた一」が発音されることを示している。 FIG. 4 is a diagram illustrating a format of the lyrics table. In the lyrics table, there is a correspondence between the lyrics data representing the lyrics of the music to be played and the time interval data indicating the time interval in which the lyrics represented by the lyrics data should be pronounced when the tone is output according to the tone data. Stored with. For example, in the lyrics table shown in FIG. 4, the lyric data on the first line represents the lyrics “Kamereon force”, and the time interval data “01: 00—01: 02” associated with this lyric data. Indicates that in the example voice, the lyrics “Kamereon power ^” is pronounced between 1 minute and 2 seconds after the beginning of the music performance. The lyric data on the second line represents the lyric “Ichikitaichi”, and the time interval data “01: 03—01: 06” associated with this lyric data is the voice of the model. This shows that the lyrics “Ichikitaichi” is pronounced between the time when 1 minute and 3 seconds have passed since the music started playing and the time when 1 minute and 6 seconds passed.
[0014] マイク 4は、入力される歌唱者の歌唱音声を音声信号に変換して出力する。マイク 4 力も出力された音声信号は、音声処理用 DSP (Digital Signal Processor) 111とアン プ 112とに入力される。音声処理用 DSP111は、入力される音声信号を AZD変換 し、歌唱音声を表す歌唱音声データを生成する。この歌唱音声データは、記憶部 10 5に記憶され、手本音声データと比較されて歌唱者の歌唱力の採点に用いられる。  [0014] The microphone 4 converts the singing voice of the singer that is input into a voice signal and outputs the voice signal. The audio signal from which the microphone 4 is also output is input to an audio processing DSP (Digital Signal Processor) 111 and an amplifier 112. The voice processing DSP 111 performs AZD conversion on the input voice signal and generates singing voice data representing the singing voice. This singing voice data is stored in the storage unit 105, and compared with the model voice data and used for scoring the singer's singing ability.
[0015] 入力部 106は、カラオケ装置 1に設けられた操作パネルまたはリモコン装置 5への 入力操作により生成される信号を検出し、この検出結果を CPU102へ出力する。表 示制御部 107は、 CPU102の制御のもと映像や歌唱者の歌唱力の採点結果をモ- タ 2に表示する。  Input unit 106 detects a signal generated by an input operation to operation panel provided on karaoke apparatus 1 or remote control apparatus 5 and outputs the detection result to CPU 102. The display control unit 107 displays the video and the score result of the singer's singing ability on the motor 2 under the control of the CPU 102.
[0016] 音源装置 109は供給される楽音データに対応する楽音信号を生成し、生成した楽 音信号をカラオケ演奏音として効果用 DSP110へ出力する。効果用 DSP110は、音 源装置 109で生成された楽音信号に対してリバーブゃエコー等の効果を付与する。 効果を付与された楽音信号は、効果用 DSP 110によって DZA変換されてアンプ 11 2へ出力される。アンプ 112は、効果用 DSP110から出力された楽音信号と、マイク 4 力 出力された音声信号とを合成 '増幅し、スピーカ 3L、 3Rへ出力する。これにより、 楽曲のメロディと歌唱者の音声とがスピーカ 3L、 3Rから出力される。  The tone generator 109 generates a tone signal corresponding to the supplied tone data, and outputs the generated tone signal to the effect DSP 110 as a karaoke performance sound. The effect DSP 110 gives an effect such as reverberation echo to the musical sound signal generated by the sound source device 109. The effected tone signal is DZA converted by the effect DSP 110 and output to the amplifier 112. The amplifier 112 synthesizes and amplifies the musical tone signal output from the effect DSP 110 and the audio signal output from the microphone 4 and outputs it to the speakers 3L and 3R. As a result, the melody of the music and the voice of the singer are output from the speakers 3L and 3R.
[0017] [実施形態の動作]  [0017] [Operation of the embodiment]
次に本実施形態の動作について説明する。まず、利用者がリモコン装置 5を操作し て楽曲を指定する操作を行うと、指定された楽曲の楽曲データが CPU102により記 憶部 105から RAM104の楽曲記憶領域へ転送される。 CPU102は、この楽曲記憶 領域に格納された楽曲データに含まれている各種データを順次読み出すことにより、 カラオケ伴奏処理を実行する。 Next, the operation of this embodiment will be described. First, when the user operates the remote controller 5 to designate a song, the song data of the designated song is transferred from the storage unit 105 to the song storage area of the RAM 104 by the CPU 102. CPU102 stores this music Karaoke accompaniment processing is executed by sequentially reading various data included in the music data stored in the area.
[0018] 具体的には、 CPU102は、楽曲データに含まれている楽音データを読み出し、読 み出した楽音データを音源装置 109へ出力する。音源装置 109は、供給される楽曲 データに基づいて所定の音色の楽音信号を生成し、生成した楽音信号を効果用 DS P110へ出力する。効果用 DSP110においては、音源装置 109から出力された楽音 信号に対してリバーブゃエコー等の効果が付与される。効果を付与された楽音信号 は、効果用 DSP110によって DZ A変換されてアンプ 112へ出力される。アンプ 112 は、効果用 DSP110から出力された楽音信号を増幅してスピーカ 3L、 3Rへ出力す る。これにより、楽曲のメロディがスピーカ 3L、 3Rから出力される。また、 CPU102は 、楽曲データを音源装置 109へ供給して楽音の出力が開始されると、楽曲の出力が 開始されてカゝら経過した経過時間のカウントを開始する。  Specifically, CPU 102 reads out the musical sound data included in the music data, and outputs the read musical sound data to tone generator 109. The tone generator 109 generates a tone signal of a predetermined tone color based on the supplied music data, and outputs the generated tone signal to the effect DSP 110. In the effect DSP 110, an effect such as reverberation echo is given to the musical sound signal output from the sound source device 109. The musical sound signal to which the effect is applied is DZ A converted by the effect DSP 110 and output to the amplifier 112. The amplifier 112 amplifies the musical sound signal output from the effect DSP 110 and outputs it to the speakers 3L and 3R. As a result, the melody of the music is output from the speakers 3L and 3R. Further, when the CPU 102 supplies the music data to the tone generator 109 and the output of the musical sound is started, the CPU 102 starts counting the elapsed time after the music output is started.
[0019] 一方、楽曲の再生に応じて、歌唱者が歌唱すると、歌唱者の音声がマイク 4に入力 され、マイク 4から音声信号が出力される。音声処理用 DSP111は、マイク 4から出力 された音声信号を AZD変換し、歌唱音声を表す歌唱音声データを生成する。この 歌唱音声データは、記憶部 105に記憶される。  On the other hand, when the singer sings according to the reproduction of the music, the singer's voice is input to the microphone 4 and an audio signal is output from the microphone 4. The voice processing DSP 111 performs AZD conversion on the voice signal output from the microphone 4 to generate singing voice data representing the singing voice. This singing voice data is stored in the storage unit 105.
[0020] CPU102は、経過時間のカウントを続け、カウントした時間を時間区間の開始時間 として含む時間区間 (カウントした時間がその区間内に含まれる時間区間)を、歌詞 テーブルにおいて検索する。そして、検索した時間区間と、検索した時間区間に対 応付けて格納されている歌詞データを読み出す。例えば、カウントされた経過時間が 01 : 00である場合、図 4にした歌詞テーブルにおいては、 1行目の時間区間「01 : 00 01: 02」と歌詞データ「かめれおん力 が読み出される。  [0020] The CPU 102 continues counting elapsed time, and searches the lyrics table for a time interval including the counted time as the start time of the time interval (a time interval in which the counted time is included). Then, the retrieved time interval and the lyrics data stored in association with the retrieved time interval are read out. For example, if the counted elapsed time is 01:00:00, the time section “01: 01: 00: 02” and the lyrics data “Kamereon force” on the first line are read out in the lyrics table shown in FIG.
[0021] CPU102は、時間区間を読み出すと、この時間区間においてマイク 4に入力された 音声と、この時間区間におけるお手本の音声とを比較し、歌唱者が歌詞を正しく歌つ たカゝ否かを判断する。具体的には、 CPU102は、手本音声データが表す音声を解 祈し、図 5に示したように、手本音声データが表す音声波形の時間軸において、読み 出した時間区間(01 : 00— 01 : 02)の間にある音声波形 Aを抽出する。また、 CPU1 02は、記憶された歌唱音声データを解析し、図 5に示したように、歌唱音声データが 表す時間軸において、読み出した時間区間の間にある音声波形 Bを抽出する。そし て、抽出した音声波形 Aを、図 6 (a)に示したように所定の時間間隔 (例えば、 10ms) で区切って複数のフレームに分割する。また、抽出した音声波形 Bを、図 6 (b)に示し たように所定の時間間隔 (例えば、 10ms)で区切って複数のフレームに分割する。 [0021] When the CPU 102 reads the time interval, the CPU 102 compares the audio input to the microphone 4 in this time interval with the model audio in this time interval, and determines whether or not the singer sang the lyrics correctly. Judging. Specifically, the CPU 102 prays the voice represented by the model voice data and, as shown in FIG. 5, reads the time interval (01: 00 on the time axis of the voice waveform represented by the model voice data. — Extract voice waveform A between 01: 02). In addition, the CPU 102 analyzes the stored singing voice data, and as shown in FIG. On the time axis shown, extract the speech waveform B between the read time intervals. Then, the extracted speech waveform A is divided into a plurality of frames by being divided at a predetermined time interval (for example, 10 ms) as shown in FIG. 6 (a). Further, the extracted speech waveform B is divided into a plurality of frames by being divided at a predetermined time interval (for example, 10 ms) as shown in FIG. 6 (b).
[0022] 次に CPU102は、手本音声の各フレームの音声波形と、歌唱音声の各フレームの 音声波形との対応付けを DP (Dynamic Programming)マッチング法を用いて行う。例 えば、図 6に例示した波形において、手本音声のフレーム A1の音声波形と、歌唱音 声のフレーム B1の音声波形とが対応している場合、フレーム A1とフレーム B1とが対 応付けされる。また、手本音声のフレーム A2の音声波形と、歌唱音声のフレーム B2 力 フレーム B3までの音声波形とが対応して!/、る場合、フレーム A2とフレーム B2か らフレーム B3までとが対応付けされる。  Next, the CPU 102 associates the speech waveform of each frame of the model speech with the speech waveform of each frame of the singing speech using a DP (Dynamic Programming) matching method. For example, in the waveform illustrated in Fig. 6, if the speech waveform of the sample voice frame A1 and the voice waveform of the singing voice frame B1 correspond to each other, the frame A1 and the frame B1 are associated with each other. The Also, if the voice waveform of frame A2 of the sample voice and the voice waveform of frame B2 force frame B3 of the singing voice correspond to each other! /, Then frame A2 and frame B2 to frame B3 are associated. Is done.
[0023] 次に CPU102は、対応するフレーム間で音声波形の特徴を比較する。具体的には 、 CPU102は、手本音声の各フレームの音声波形毎に音声波形をフーリエ変換する 。そして CPU102は、フーリエ変換により得られた振幅スペクトルの対数を求め、それ をフーリエ逆変換してフレームごとのスペクトル包絡を生成する。そして CPU102は、 得られたスペクトル包絡から第 1フォルマントの周波数 fl lおよび第 2フォルマントの 周波数 fl2、第 3フォルマントの周波数 fl3を抽出する。  Next, the CPU 102 compares the characteristics of the speech waveform between corresponding frames. Specifically, the CPU 102 Fourier transforms the speech waveform for each speech waveform of each frame of the model speech. The CPU 102 obtains the logarithm of the amplitude spectrum obtained by the Fourier transform, and inversely transforms it to generate a spectrum envelope for each frame. The CPU 102 extracts the first formant frequency fl 1, the second formant frequency fl 2, and the third formant frequency fl 3 from the obtained spectral envelope.
また、 CPU102は、手本音声の各フレームに対応付けされた歌唱者の音声のフレ ームの音声波形毎に、音声波形をフーリエ変換する。そして CPU102は、フーリエ変 換により得られた振幅スペクトルの対数を求め、それをフーリエ逆変換してフレームご とのスペクトル包絡を生成する。そして CPU102は、得られたスペクトル包絡力 第 1 フォルマントの周波数 f21および第 2フォルマントの周波数 f22、第 3フォルマントの周 波数 23を抽出する。  Further, the CPU 102 performs Fourier transform on the voice waveform for each voice waveform of the singer's voice frame associated with each frame of the model voice. Then, the CPU 102 obtains the logarithm of the amplitude spectrum obtained by the Fourier transform, and inversely transforms it to generate a spectrum envelope for each frame. Then, the CPU 102 extracts the obtained spectrum envelope force frequency f21 of the first formant, frequency f22 of the second formant, and frequency 23 of the third formant.
[0024] 例えば、 CPU102は、手本音声のフレーム A1のスペクトル包絡を生成し、このスぺ タトル包絡力 第 1〜第 3フォルマントのフオルマント周波数 fl l〜f 13を抽出する。そ して、 CPU102は、フレーム A1に対応付けされているフレーム B1の音声波形のスぺ タトル包絡を生成し、このスペクトル包絡から第 1〜第 3フォルマントのフオルマント周 波数 f21〜f23を抽出する。 また、 CPU102は、手本音声のフレーム A2のスペクトル包絡を生成し、このスぺク トル包絡力も第 1〜第 3フォルマントのフォルマント周波数 fl l〜f 13を抽出する。そし て、 CPU102は、フレーム A2に対応付けされているフレーム B2からフレーム B3まで の音声波形のスペクトル包絡を生成し、このスペクトル包絡から第 1〜第 3フォルマン トのフォルマント周波数 f21〜f23を抽出する。 [0024] For example, the CPU 102 generates a spectrum envelope of the frame A1 of the sample voice, and extracts the spectrum envelope forces fl1 to f13 of the first to third formants. Then, the CPU 102 generates a spectral envelope of the speech waveform of the frame B1 associated with the frame A1, and extracts the first to third formant formant frequencies f21 to f23 from the spectrum envelope. Further, the CPU 102 generates a spectrum envelope of the frame A2 of the sample voice, and the spectrum envelope force extracts the formant frequencies fl 1 to f 13 of the first to third formants. Then, the CPU 102 generates a spectrum envelope of the speech waveform from the frame B2 to the frame B3 associated with the frame A2, and extracts the formant frequencies f21 to f23 of the first to third formants from the spectrum envelope. .
[0025] 次に CPU102は、手本音声の各フレームから抽出したフォルマント周波数 fl l〜fl 3と、手本音声の各フレームに対応付けされた歌唱者の音声のフレーム力 抽出した フォルマント周波数 f21〜f23とを比較する。そして、 CPU102は、対応する音声波 形同士でフォルマント周波数 f 11とフォルマント周波数 f21との差、フォルマント周波 数 f 12とフォルマント周波数 f 22との差、及びフォルマント周波数 f 13とフォルマント周 波数 f23との差が、所定の値以上である場合には、フォルマント周波数が不一致であ つたことを示す不一致情報 Dを手本音声のフレームに付加する。 [0025] Next, the CPU 102 extracts the formant frequencies fl 1 to fl 3 extracted from each frame of the sample voice and the frame force of the singer's voice associated with each frame of the sample voice f21 to Compare with f23. Then, the CPU 102 compares the difference between the formant frequency f11 and the formant frequency f21, the difference between the formant frequency f12 and the formant frequency f22, and the formant frequency f13 and the formant frequency f23. If the difference is greater than or equal to a predetermined value, mismatch information D indicating that the formant frequencies do not match is added to the frame of the model voice.
例えば、 CPU102は、フレーム A1の音声波形のフォルマント周波数 fl l〜f 13と、 フレーム B1の音声波形のフォルマント周波数とがー致して 、る場合、対応するフレ ーム同士で音声が一致して 、ると判断し、不一致情報 Dをフレーム A1に付加しな ヽ 一方、フレーム A2のフォルマント周波数 fl l〜fl3と、フレーム B2からフレーム B3 までの音声波形のフォルマント周波数 f21〜f23とで、各周波数の差が所定値以上 である場合には、フォルマント周波数が不一致であったことを示す不一致情報 Dをフ レーム A2に付加する。  For example, if the formant frequency fl l to f13 of the speech waveform of frame A1 matches the formant frequency of the speech waveform of frame B1, the CPU 102 matches the speech between the corresponding frames. The discrepancy information D is not added to the frame A1.On the other hand, the formant frequencies fl 1 to fl3 of the frame A2 and the formant frequencies f21 to f23 of the speech waveform from the frames B2 to B3 If the difference is equal to or greater than the predetermined value, mismatch information D indicating that the formant frequencies are mismatched is added to frame A2.
[0026] CPU102は、手本音声の各フレームの音声波形について、歌唱者の各フレームの 音声波形のフォルマント周波数と手本音声の各フレームの音声波形のフォルマント 周波数との一致 Z不一致を判断すると、不一致情報 Dが付加されたフレームの数 N をカウントする。次に CPU102は、分割した手本音声データのフレームの総数 Mと、 数 Nの値とを比較し、数 Nの値がフレーム総数 Mの半分以上である場合には、読み 出した歌詞データが表す歌詞につ!ヽて、歌唱者の発音した歌詞と手本音声の歌詞と が異なると判断する。一方、数 Nの値がフレーム総数 Mの半分未満である場合には、 読み出した歌詞データが表す歌詞について、歌唱者の発音した歌詞と手本音声の 歌詞とが同じであると判断する。例えば、手本音声データが表す「かめれおんが」とい う音声について、不一致情報の数 Nがフレーム総数 Mの半分未満である場合には、[0026] When the CPU 102 determines a match Z mismatch between the formant frequency of the sound waveform of each frame of the singer and the formant frequency of the sound waveform of each frame of the sample sound for the sound waveform of each frame of the sample sound, Count the number N of frames with mismatch information D added. Next, the CPU 102 compares the total number M of the divided sample voice data frames with the value of the number N. If the value of the number N is more than half of the total number M of frames, the read lyrics data is The lyrics to represent! It is determined that the lyrics of the singer's pronunciation are different from the lyrics of the model voice. On the other hand, if the value of the number N is less than half of the total number M of frames, the lyrics expressed by the read lyrics data and the lyrics Judge that the lyrics are the same. For example, if the number N of discrepancy information is less than half of the total number of frames M for the voice “Kamere-onga” represented by the model voice data,
CPU102は、歌唱者の発音した歌詞と、手本音声の歌詞とが同じであると判断する。 なお、本実施形態においては、数 Nの値がフレーム総数 Mの半分以上である場合 には、読み出した歌詞データが表す歌詞について、歌唱者の発音した歌詞と手本音 声の歌詞とが異なると判断している力 フレーム総数 Mに対する数 Nの割合が 5割以 外の所定の割合以上である場合に読み出した歌詞データが表す歌詞について、歌 唱者の発音した歌詞と手本音声の歌詞とが異なると判断するようにしてもよい。 The CPU 102 determines that the lyrics pronounced by the singer are the same as the lyrics of the model voice. In the present embodiment, when the value of the number N is more than half of the total number M of frames, the lyrics expressed by the read lyrics data are different from the lyrics of the singer's pronunciation and the lyrics of the model voice. Determining power Total number of frames Number of M Number of N When the ratio of N is more than a predetermined ratio other than 50%, the lyrics expressed by the lyrics data read out are the lyrics of the singer and the lyrics of the model voice. May be determined to be different.
[0027] CPU102は、手本音声と歌唱音声との比較に並行して経過時間のカウントを続け、 カウントした経過時間が 01 : 03になると、図 4にした歌詞テーブルの 2行目の時間区 間「01 : 03— 01 : 06」と歌詞データ「やってきた一」を読み出す。また、楽曲の再生に 従って歌唱者がこの読み出した時間区間において歌唱を行うと、歌唱音声データが 記憶部 105に記憶される。ここで、例えば、歌唱者が歌詞を間違え、読み出された歌 詞データ 2が表す歌詞「やってきた」とは異なる「 、つてくる」 t 、う歌詞で歌唱者が歌 唱を行うと、「いってくる」という音声を表す歌唱音声データが生成されて記憶部 105 に gc feされる。  [0027] CPU 102 continues to count the elapsed time in parallel with the comparison between the model voice and the singing voice. When the counted elapsed time becomes 01:03, the time zone in the second row of the lyrics table shown in FIG. Read “01: 03—01: 06” and lyric data “Ichikita-ichi”. Further, when the singer sings in the read time section according to the reproduction of the music, the singing voice data is stored in the storage unit 105. Here, for example, if the singer sings with the wrong lyrics, and the singing is different from the lyric that the read singing data 2 shows, the song sings with the lyric. Singing voice data representing the voice “come” is generated and gc fe to the storage unit 105.
[0028] 次に CPU102は、この時間区間においてマイク 4に入力された音声の波形と、この 時間区間におけるお手本の音声の波形とを複数のフレームに分割する。そして、手 本音声の各フレームの音声波形と、歌唱音声の各フレームの音声波形との対応付け を行 、、対応付けられたフレーム間で音声波形のフォルマント周波数の比較を行う。 そして、 CPU102は、手本音声の各フレームの音声波形について、歌唱者の音声波 形のフォルマント周波数との一致 Z不一致を判断し、不一致情報 Dを付加した後、分 割した手本音声データのフレーム総数 Mと、不一致情報が付加されたフレームの数 Nの値とを比較し、歌唱者が歌詞を正しく歌った力否かを判断する。  Next, the CPU 102 divides the waveform of the sound input to the microphone 4 in this time interval and the waveform of the model sound in this time interval into a plurality of frames. Then, the speech waveform of each frame of the model speech is associated with the speech waveform of each frame of the singing speech, and the formant frequencies of the speech waveforms are compared between the associated frames. Then, the CPU 102 judges whether the voice waveform of each frame of the model voice matches the formant frequency of the singer's voice waveform, and adds the mismatch information D, and then the divided sample voice data. The total number M of frames is compared with the value of the number N of frames with discrepancy information added to determine whether or not the singer has sung the lyrics correctly.
[0029] ここで、歌唱者は「やってきた」という歌詞に対し、「いってくる」と異なる歌詞で歌唱 したため、手本音声の音声波形のフォルマント周波数と、歌唱者の音声波形のフォ ルマント周波数とを比較すると、フォルマント周波数が一致せず、不一致情報の数 N がフレーム総数 M以上となる。 CPU102は、数 Nの値がフレーム総数 Mの半分以上 である場合には、読み出した歌詞データが表す歌詞について、歌唱者の発音した歌 詞と手本音声の歌詞とが異なると判断し、読み出した歌詞データが表す歌詞「やって きた」を、表示制御部 107を制御してモニタ 2に表示させ、歌詞を間違った旨を報知 する。 [0029] Here, because the singer sang with the lyrics "I came" with a different lyric from "I come", the formant frequency of the voice waveform of the model voice and the formant frequency of the singer's voice waveform And formant frequencies do not match, and the number N of mismatch information is equal to or greater than the total number M of frames. CPU102 has a value of number N that is more than half of the total number of frames M If it is, the lyrics expressed by the read lyrics data are determined to be different from the lyrics of the singer's pronunciation and the lyrics of the model voice, and the lyrics “Kitakita” expressed by the read lyrics data are displayed. The control unit 107 is controlled and displayed on the monitor 2 to notify that the lyrics are wrong.
[0030] 以下、 CPU102は楽曲の再生に伴って、上述したように、歌詞データおよび手本 音声データの読み出し、歌唱者が歌唱した歌詞の正誤の判断を繰り返す。そして、 全ての演奏イベントデータを読み出すとカラオケ伴奏処理を終了する。  Hereinafter, as described above, the CPU 102 repeats the reading of the lyrics data and the model voice data and the determination of the correctness of the lyrics sung by the singer as the music is played back. Then, when all performance event data is read, the karaoke accompaniment process is terminated.
[0031] 以上説明したように、本実施形態によれば、辞書を用いた音声認識を行わなくても 、歌唱者が歌詞通りに歌唱したカゝ否かを判断することができる。また、本実施形態で は、歌詞どおりに正しく歌唱した音声のデータがあれば、歌詞通りに正しく歌唱した か否力評価することができるので、辞書を用いて言語認識を行う態様のようにシステ ムを複雑化させることなぐ様々な言語の歌詞について、歌唱者が歌詞を正しく覚え て!、る力否かを評価することができる。  As described above, according to the present embodiment, it is possible to determine whether or not the singer has sung according to the lyrics without performing voice recognition using a dictionary. In addition, in this embodiment, if there is voice data that is sung correctly according to the lyrics, it is possible to evaluate whether or not the song is correctly sung according to the lyrics, so that the system recognizes the language using a dictionary. For lyrics in various languages without complicating music, it is possible to evaluate the ability of the singer to learn the lyrics correctly!
[0032] [変形例]  [0032] [Modification]
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限 定されることなぐ例えば、上述の実施形態を以下のように変形して本発明を実施し てもよい。  The embodiment of the present invention has been described above, but the present invention is not limited to the above-described embodiment. For example, the above-described embodiment may be modified as follows to implement the present invention.
[0033] 上述した実施形態においては、歌唱音声データが表す音声波形のピッチが手本音 声データが表す音声波形のピッチとなるように、歌唱音声データが表す音声のピッチ をネ ΐ正するようにしてもよ 、。  In the embodiment described above, the pitch of the voice represented by the singing voice data is corrected so that the pitch of the voice waveform represented by the singing voice data becomes the pitch of the voice waveform represented by the model voice data. Anyway.
[0034] また、上述した実施形態においては、手本音声データが表す音声波形のピッチの 周期的な変動を検出して手本となる音声にビブラートがかかっている力否かを判断し 、ビブラートがかかっていると判断した場合、手本音声データが表す音声波形のピッ チ変動と歌唱音声データが表す音声波形のピッチ変動との一致度を判断し、歌唱者 が正しくビブラートをかけて歌唱して 、る力否かを判断するようにしてもよ!、。  [0034] In the above-described embodiment, periodic fluctuations in the pitch of the voice waveform represented by the model voice data are detected to determine whether or not the voice serving as the model is vibrato, and vibrato If it is determined that the pitch fluctuation of the voice waveform represented by the model voice data matches the pitch fluctuation of the voice waveform represented by the singing voice data, the singer sings with vibrato correctly. You can decide whether you have the power!
また、手本音声データが表す音声波形のピッチ変動を検出して手本となる音声にし やくり(設定された音程よりも低い音をまず発声し、そこから本来の音程に近づけてゆ く歌唱法)がある力否かを判断し、しゃくりがあると判断した場合、手本音声データが 表す音声波形のピッチ変動と歌唱音声データが表す音声波形のピッチ変動との一 致度を判断し、歌唱者が正しくしやくりを行って歌唱しているカゝ否かを判断するよう〖こ してちよい。 In addition, it detects the pitch variation of the voice waveform represented by the model voice data and makes it a model voice (singing a sound that is lower than the set pitch first and then approaching the original pitch) If there is a law or Determine the degree of coincidence between the pitch fluctuation of the voice waveform represented and the pitch fluctuation of the voice waveform represented by the singing voice data, and try to determine whether or not the singer is singing correctly. You can do it.
[0035] また、上述した実施形態においては、複数のバンドパスフィルタによって、手本音声 データが表す音声波形と歌唱音声データが表す音声波形とを複数の周波数帯域に 分割し、周波数帯域毎に音声の特徴量の一致度を判断して歌詞の正否を判断する ようにしてもよい。  [0035] In the above-described embodiment, the voice waveform represented by the sample voice data and the voice waveform represented by the singing voice data are divided into a plurality of frequency bands by a plurality of bandpass filters, and the voice is represented for each frequency band. The correctness of the lyrics may be determined by determining the degree of coincidence of the feature quantities.
[0036] また、上述した実施形態においては、お手本の音声波形を表す手本音声データを 記憶し、この手本音声データが表す音声波形を解析してフォルマント周波数の解析 を行っている力 音声波形を複数のフレームに分割したときのフレーム毎のフオルマ ント周波数を予め記憶部 105に記憶し、この記憶したフォルマント周波数と、歌唱者 の音声波形の各フレームのフォルマント周波数とを比較して一致度を判断するように してちよい。  In the above-described embodiment, the model voice data representing the model voice waveform is stored, and the formant frequency is analyzed by analyzing the voice waveform represented by the model voice data. Is stored in advance in the storage unit 105, and the stored formant frequency is compared with the formant frequency of each frame of the singer's voice waveform to determine the degree of coincidence. You may decide to judge.
[0037] 上述した実施形態においては、歌唱者が楽曲を歌い終えた後に歌唱者が歌唱した 歌詞の正誤の判断を行うようにしてもよい。また、上述した実施形態においては、歌 唱者の発音した歌詞と手本音声の歌詞とが異なると判断した場合、歌詞を表示する のではなぐ歌詞を間違った旨を知らせるメッセージや画像をモニタ 2に表示するよう にしてもよい。  [0037] In the embodiment described above, the correctness of the lyrics sung by the singer may be determined after the singer has finished singing the music. In the above-described embodiment, when it is determined that the lyrics of the singer's pronunciation are different from the lyrics of the model voice, a message or image that informs the user that the lyrics are incorrect is not displayed. You may make it display on.
[0038] 本発明は、 2006年 1月 31日出願の日本特許出願 (特願 2006— 022648)に基づく ものであり、その内容はここに参照として取り込まれる。  [0038] The present invention is based on a Japanese patent application filed on January 31, 2006 (Japanese Patent Application No. 2006-0222648), the contents of which are incorporated herein by reference.

Claims

請求の範囲 The scope of the claims
[1] カラオケ装置は、  [1] Karaoke equipment
楽曲を歌詞通りに歌唱したときの手本音声を表す手本音声データを記憶した記憶 部と、  A storage unit storing sample voice data representing a model voice when a song is sung according to lyrics;
歌唱者の歌唱音声が入力される音声入力部と、  A voice input unit for inputting the singing voice of the singer,
前記手本音声データが表す手本音声を複数の手本音声区間に分割し、前記音声 入力部に入力された歌唱音声において、前記分割された各手本音声区間に対応す る歌唱音声区間を特定する特定部と、  The model voice represented by the model voice data is divided into a plurality of model voice sections, and in the singing voice inputted to the voice input unit, the singing voice sections corresponding to the divided sample voice sections are divided. A specific part to identify;
前記特定部で特定された歌唱音声区間の歌唱音声と、該歌唱音声区間に対応す る手本音声区間の手本音声とを比較して評価を行う評価部と、  An evaluation unit for performing an evaluation by comparing the singing voice of the singing voice section specified by the specifying unit with the model voice of the model voice section corresponding to the singing voice section;
前記評価部の評価結果を表示する表示部と、  A display unit for displaying the evaluation result of the evaluation unit;
を備える。  Is provided.
[2] 請求項 1に記載のカラオケ装置であって、  [2] The karaoke apparatus according to claim 1,
前記評価部は、前記特定部で特定された歌唱音声区間の歌唱音声と、該歌唱音 声区間に対応する前記手本音声区間の手本音声との一致度を求め、求めた一致度 に基づいて評価を行う。  The evaluation unit obtains a degree of coincidence between the singing voice of the singing voice section specified by the specifying part and the example voice of the example voice section corresponding to the singing voice section, and based on the obtained degree of coincidence. To evaluate.
[3] 請求項 2に記載のカラオケ装置であって、 [3] The karaoke apparatus according to claim 2,
前記記憶部は、前記楽曲の歌詞を表す歌詞データを記憶し、  The storage unit stores lyrics data representing the lyrics of the music,
前記評価部が求めた前記一致度が所定値未満である場合、前記一致度が所定値 未満となった手本音声区間の音声に対応した歌詞を前記記憶部に記憶された歌詞 データが表す歌詞の中から特定する歌詞特定部を有し、  When the degree of coincidence obtained by the evaluation unit is less than a predetermined value, the lyrics that the lyrics data stored in the storage unit represent the lyrics corresponding to the voice of the sample voice section in which the degree of coincidence is less than the predetermined value There is a lyrics specific part to specify from
前記表示部は、前記歌詞特定部で特定された歌詞を表示する。  The display unit displays the lyrics specified by the lyrics specifying unit.
[4] 請求項 2に記載のカラオケ装置であって、 [4] The karaoke apparatus according to claim 2,
前記評価部は、前記歌唱音声のフォルマント周波数と前記手本音声のフォルマント 周波数との一致度を求める。  The evaluation unit obtains a degree of coincidence between the formant frequency of the singing voice and the formant frequency of the model voice.
[5] 請求項 4に記載のカラオケ装置であって、 [5] The karaoke apparatus according to claim 4,
前記評価部は、前記特定部で特定された歌唱音声区間の歌唱音声の音声波形か ら第 1スペクトル包絡を求め、前記歌唱音声区間に対応する前記手本音声区間の手 本音声の音声波形力 第 2スペクトル包絡を求め、 The evaluation unit obtains a first spectrum envelope from the speech waveform of the singing voice section specified by the specifying unit, and the hand of the sample voice section corresponding to the singing voice section is obtained. Obtain the speech waveform power of this speech, the second spectral envelope,
前記評価部は、前記第 1スペクトル包絡から前記歌唱音声のフォルマント周波数を 抽出し、前記第 2スペクトル包絡から前記手本音声のフォルマント周波数を抽出する  The evaluation unit extracts a formant frequency of the singing voice from the first spectrum envelope, and extracts a formant frequency of the sample voice from the second spectrum envelope.
[6] 請求項 1に記載のカラオケ装置であって、 [6] The karaoke apparatus according to claim 1,
前記評価部は、前記特定部で特定された歌唱音声区間の歌唱音声の音声波形と 、該歌唱音声区間に対応する前記手本音声区間の手本音声の音声波形との一致度 を求め、求めた一致度に基づいて評価を行う。  The evaluation unit obtains a degree of coincidence between the voice waveform of the singing voice section of the singing voice section specified by the specifying section and the voice waveform of the model voice of the model voice section corresponding to the singing voice section. Evaluation is performed based on the degree of coincidence.
[7] 音声処理方法は、 [7] The voice processing method is
手本音声データが表す、楽曲を歌詞通りに歌唱したときの手本音声を複数の手本 音声区間に分割し、  The sample voice when the song is sung according to the lyrics expressed by the sample voice data is divided into multiple sample voice sections,
音声入力部に入力された歌唱者の歌唱音声において、前記分割された各手本音 声区間に対応する歌唱音声区間を特定し、  In the singing voice of the singer input to the voice input unit, the singing voice section corresponding to each divided example voice section is specified,
前記特定された歌唱音声区間の歌唱音声と、該歌唱音声区間に対応する手本音 声区間の手本音声とを比較し、  Comparing the singing voice of the specified singing voice section with the model voice of the model voice section corresponding to the singing voice section;
前記比較処理の結果に基づ 、て評価を行 、、  Based on the result of the comparison process, an evaluation is performed,
前記評価の結果を表示する。  The result of the evaluation is displayed.
[8] 前記比較処理は、前記特定部で特定された歌唱音声区間の歌唱音声と、該歌唱 音声区間に対応する前記手本音声区間の手本音声との一致度を求め、 [8] In the comparison process, the degree of coincidence between the singing voice of the singing voice section specified by the specifying unit and the example voice of the example voice section corresponding to the singing voice section is obtained.
前記評価処理は、求めた一致度に基づ!、て評価を行う請求項 7に記載の音声処理 方法。  8. The voice processing method according to claim 7, wherein the evaluation process is performed based on the obtained degree of coincidence.
[9] 請求項 8に記載の音声処理方法は、さらに  [9] The voice processing method according to claim 8,
前記評価処理によって求められた前記一致度が所定値未満である場合、前記一 致度が所定値未満となった手本音声区間の音声に対応した前記楽曲の歌詞を、歌 詞データが表す歌詞の中から特定し、  When the degree of coincidence obtained by the evaluation process is less than a predetermined value, the lyrics of the song corresponding to the voice of the sample voice section in which the degree of coincidence is less than the predetermined value are represented by the song data. Identified from
前記表示処理は、前記特定された歌詞を表示部に表示する。  The display process displays the specified lyrics on a display unit.
[10] 請求項 8に記載の音声処理方法であって、 [10] The speech processing method according to claim 8,
前記評価処理は、前記歌唱音声のフォルマント周波数と前記手本音声のフオルマ ント周波数との一致度を求める。 The evaluation process includes a formant frequency of the singing voice and a format of the model voice. The degree of coincidence with the current frequency is obtained.
[11] 請求項 10に記載の音声処理方法であって、  [11] The speech processing method according to claim 10,
前記評価処理は、  The evaluation process includes
前記特定された歌唱音声区間の歌唱音声の音声波形力 第 1スペクトル包絡を 求め、前記歌唱音声区間に対応する前記手本音声区間の手本音声の音声波形か ら第 2スペクトル包絡を求め、  Obtaining the first waveform envelope of the voice waveform power of the singing voice section of the specified singing voice section, obtaining the second spectrum envelope from the voice waveform of the model voice of the model voice section corresponding to the singing voice section,
前記第 1スペクトル包絡から前記歌唱音声のフォルマント周波数を抽出し、前記 第 2スペクトル包絡力も前記手本音声のフォルマント周波数を抽出する。  The formant frequency of the singing voice is extracted from the first spectrum envelope, and the formant frequency of the model voice is also extracted from the second spectrum envelope force.
[12] 請求項 7に記載の音声処理方法であって、 [12] The speech processing method according to claim 7,
前記評価処理は、前記特定された歌唱音声区間の歌唱音声の音声波形と、該歌 唱音声区間に対応する前記手本音声区間の手本音声の音声波形との一致度を求 め、求めた一致度に基づいて評価を行う。  The evaluation processing is performed by obtaining a degree of coincidence between the voice waveform of the singing voice in the specified singing voice section and the voice waveform of the model voice in the model voice section corresponding to the singing voice section. Evaluation is based on the degree of agreement.
PCT/JP2007/051413 2006-01-31 2007-01-29 Karaoke machine and sound processing method WO2007088820A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2006-022648 2006-01-31
JP2006022648A JP4862413B2 (en) 2006-01-31 2006-01-31 Karaoke equipment

Publications (1)

Publication Number Publication Date
WO2007088820A1 true WO2007088820A1 (en) 2007-08-09

Family

ID=38327393

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2007/051413 WO2007088820A1 (en) 2006-01-31 2007-01-29 Karaoke machine and sound processing method

Country Status (2)

Country Link
JP (1) JP4862413B2 (en)
WO (1) WO2007088820A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6217304B2 (en) * 2013-10-17 2017-10-25 ヤマハ株式会社 Singing evaluation device and program
JP6586514B2 (en) * 2015-05-25 2019-10-02 ▲広▼州酷狗▲計▼算机科技有限公司 Audio processing method, apparatus and terminal
CN104978961B (en) * 2015-05-25 2019-10-15 广州酷狗计算机科技有限公司 A kind of audio-frequency processing method, device and terminal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1195760A (en) * 1997-09-16 1999-04-09 Ricoh Co Ltd Musical tone reproducing device
JP2001117568A (en) * 1999-10-21 2001-04-27 Yamaha Corp Singing evaluation device and karaoke device
JP2006227587A (en) * 2005-01-20 2006-08-31 Advanced Telecommunication Research Institute International Pronunciation evaluating device and program

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS60262187A (en) * 1984-06-08 1985-12-25 松下電器産業株式会社 Scoring apparatus
JP3754741B2 (en) * 1996-03-07 2006-03-15 株式会社エクシング Karaoke equipment
JP3673405B2 (en) * 1998-07-08 2005-07-20 株式会社リコー Performance song playback device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1195760A (en) * 1997-09-16 1999-04-09 Ricoh Co Ltd Musical tone reproducing device
JP2001117568A (en) * 1999-10-21 2001-04-27 Yamaha Corp Singing evaluation device and karaoke device
JP2006227587A (en) * 2005-01-20 2006-08-31 Advanced Telecommunication Research Institute International Pronunciation evaluating device and program

Also Published As

Publication number Publication date
JP2007206183A (en) 2007-08-16
JP4862413B2 (en) 2012-01-25

Similar Documents

Publication Publication Date Title
Yamada et al. A rhythm practice support system with annotation-free real-time onset detection
US6856923B2 (en) Method for analyzing music using sounds instruments
Durrieu et al. A musically motivated mid-level representation for pitch estimation and musical audio source separation
Saitou et al. Speech-to-singing synthesis: Converting speaking voices to singing voices by controlling acoustic features unique to singing voices
US5889224A (en) Karaoke scoring apparatus analyzing singing voice relative to melody data
US6182044B1 (en) System and methods for analyzing and critiquing a vocal performance
Cano et al. Voice Morphing System for Impersonating in Karaoke Applications.
KR20090041392A (en) Song practice support device
JP2008026622A (en) Evaluation apparatus
TWI742486B (en) Singing assisting system, singing assisting method, and non-transitory computer-readable medium comprising instructions for executing the same
JP4205824B2 (en) Singing evaluation device and karaoke device
JP5598516B2 (en) Voice synthesis system for karaoke and parameter extraction device
Lerch Software-based extraction of objective parameters from music performances
JP4862413B2 (en) Karaoke equipment
JP2007233077A (en) Evaluation device, control method, and program
US20150112687A1 (en) Method for rerecording audio materials and device for implementation thereof
JP3362491B2 (en) Voice utterance device
Ikemiya et al. Transcribing vocal expression from polyphonic music
JP2008040260A (en) Musical piece practice assisting device, dynamic time warping module, and program
JP2002041068A (en) Singing rating method in karaoke equipment
JPH11249674A (en) Singing marking system for karaoke device
JP4048249B2 (en) Karaoke equipment
Sharma et al. Singing characterization using temporal and spectral features in indian musical notes
JP5092311B2 (en) Voice evaluation device
JP6365483B2 (en) Karaoke device, karaoke system, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07707644

Country of ref document: EP

Kind code of ref document: A1