WO2007088820A1

WO2007088820A1 - Karaoke machine and sound processing method

Info

Publication number: WO2007088820A1
Application number: PCT/JP2007/051413
Authority: WO
Inventors: Akane Noguchi
Original assignee: Yamaha Corporation
Priority date: 2006-01-31
Filing date: 2007-01-29
Publication date: 2007-08-09
Also published as: JP2007206183A; JP4862413B2

Abstract

It can be evaluated that the singer correctly memorizes the lyrics of a song without making the system complex. A CPU (102) divides the sound waveform of a model sound represented by model sound data into frames and divides the sound waveform of a sung sound represented by sung sound data stored in a storage section (105) into frames. Next, the CPU (102) associates the sound waveform of each frame of the model sound with the sound waveform of each frame of the sung sound, compares the formant frequencies of the sound waveforms of the corresponding frames, and judges whether or not the model sound agrees with the sung sound. If not, the CPU (102) displays on a monitor (2) the lyrics corresponding to the sound represented by the model sound data.

Description

Karaoke apparatus and voice processing method

Technical field

[0001] The present invention relates to a technique for scoring a singer's singing ability.

Background art

[0002] Some karaoke devices that perform automatically based on music data analyze a singer's voice input to a microphone and score the singer's singing ability. For example, the karaoke apparatus disclosed in Patent Document 1 recognizes a singer's voice text input to a microphone and evaluates how much it matches the lyric text of the music. According to this karaoke apparatus, it is possible to evaluate whether or not the singer remembers the lyrics correctly.

Patent Document 1: JP-A-10-91172

Disclosure of the invention

Problems to be solved by the invention

[0003] By the way, in order to recognize the wording of speech as disclosed in Patent Document 1, it is necessary to perform speech recognition. When performing speech recognition, the input speech is analyzed and the acoustic features of the speech are extracted. From the words stored in the dictionary, the words whose acoustic features are closest to the acoustic features of the input speech are searched for and output as speech recognition results. Here, in order to correctly recognize words, words stored in the dictionary are important, and in order to correctly recognize words, it is necessary to store many words in the dictionary. However, if many words are stored in the dictionary, it will take time to find the closest word among many words, and it will not be possible to show the evaluation results immediately. In addition, there are many songs in foreign languages as well as Japanese. When performing speech recognition for a large number of languages, it is necessary to prepare a dictionary for each language. When adding a new language song to a karaoke device, a dictionary must also be prepared. There is a problem that the system becomes complicated and it becomes difficult to add music easily

[0004] The present invention has been made under the background described above, and its purpose is to complicate the system. The goal is to enable singers to evaluate the ability to remember the lyrics correctly.

Means for solving the problem

[0005] In order to solve the above-described problems, the present invention inputs a storage means unit storing example voice data representing a model voice when a song is sung according to lyrics, and a singer's singing voice. A voice input unit and a model voice represented by the model voice data are divided into a plurality of model voice sections, and each of the divided sample voice sections in the singing voice input to the voice input means unit A singing voice section corresponding to the singing voice section corresponding to the singing voice section of the singing voice section, There is provided a karaoke apparatus having an evaluation unit that performs evaluation by comparing the display unit and a display unit that displays an evaluation result of the evaluation unit.

[0006] In this aspect, the evaluation unit matches the singing voice of the singing voice section specified by the specifying unit with the model voice of the example voice section corresponding to the singing voice section. You can ask for a degree, and evaluate it based on the degree of agreement!

In addition, the storage unit stores lyrics data representing the lyrics of the music, and when the matching degree obtained by the evaluation unit is less than a predetermined value, the sample voice section in which the matching degree is less than the predetermined value A lyrics specifying unit for specifying the lyrics corresponding to the voice from the lyrics represented by the lyrics data stored in the storage unit, and the display unit displays the lyrics specified by the lyrics specifying unit Even so, よう.

Further, the evaluation unit may obtain a degree of coincidence between the formant frequency of the singing voice and the formant frequency of the model voice.

In addition, the evaluation unit obtains a first spectrum envelope from the speech waveform of the singing speech section specified by the specifying unit, and determines from the speech waveform of the example speech between the sample speech sections corresponding to the singing speech section. The second spectrum envelope may be obtained, the formant frequency of the singing voice may be extracted from the first spectrum envelope, and the formant frequency of the sample voice may be extracted from the second spectrum envelope.

Further, the evaluation unit includes a voice waveform of the singing voice section of the singing voice section specified by the specifying section and a voice waveform of the model voice of the model voice section corresponding to the singing voice section. Try to find the degree of agreement and evaluate it based on the degree of agreement.

In addition, the present invention divides the sample voice when the song is sung according to the lyrics represented by the sample voice data into a plurality of sample voice sections, and the singing voice of the singer input to the voice input unit is included. A singing voice section corresponding to each divided sample voice section, and a singing voice of the specified singing voice section and a sample voice of the sample voice section corresponding to the singing voice section. There is provided a voice processing method for comparing and evaluating based on the result of the comparison process and displaying the result of the evaluation.

The comparison process obtains a degree of coincidence between the singing voice of the singing voice section specified by the specifying unit and the model voice of the model voice section corresponding to the singing voice section, and the evaluation process Depending on the degree of agreement you find, you may want to make an evaluation.

In addition, the speech processing method may further include, when the matching degree obtained by the evaluation process is less than a predetermined value, the lyrics of the music corresponding to the voice of the sample voice section in which the matching degree is less than the predetermined value. May be identified from the lyrics represented by the lyrics data, and the display processing may display the identified lyrics on the display unit.

In the evaluation process, the degree of coincidence between the formant frequency of the singing voice and the formant frequency of the model voice may be obtained.

Further, the evaluation process obtains a first spectrum envelope from the voice waveform of the singing voice section of the specified singing voice section, and secondly calculates from the voice waveform of the model voice of the model voice section corresponding to the singing voice section. The spectrum envelope may be obtained, the formant frequency of the singing voice is extracted from the first spectrum envelope, and the formant frequency of the model voice may be extracted from the second spectrum envelope.

In addition, the evaluation process obtains the degree of coincidence between the voice waveform of the singing voice in the specified singing voice section and the voice waveform of the model voice in the sample voice section corresponding to the singing voice section. You can make an evaluation based on the degree of agreement!

The invention's effect

According to the present invention, it is possible to evaluate whether or not a singer who does not complicate the system can correctly remember the lyrics.

Brief Description of Drawings FIG. 1 is an external view of a karaoke apparatus according to an embodiment of the present invention.

FIG. 2 is a block diagram showing a hardware configuration of the karaoke apparatus according to the embodiment of the present invention.

FIG. 3 is a diagram illustrating a format of music data in the embodiment of the present invention.

FIG. 4 is a diagram illustrating an example of a lyrics table format.

FIG. 5 is a diagram illustrating a waveform of a sample voice and a waveform of a singing voice.

[Fig. 6] This is a diagram when the waveform of the model voice and the waveform of the singing voice are divided into multiple frames.

Explanation of symbols

[0009] 1 Karaoke device

2 Monitor

3L, 3R speaker

4 Microphone

5 Remote control device

101 passes

102 CPU

103 ROM

104 RAM

105 Memory

106 Input section

107 Display controller

108 Communications Department

109 Sound generator

110 DSP for effects

111 DSP for voice processing

112 amplifier

BEST MODE FOR CARRYING OUT THE INVENTION

[0010] [Configuration of the embodiment] FIG. 1 is an external view of a karaoke apparatus according to an embodiment of the present invention. As shown in the figure, the karaoke device 1 is connected with a monitor 2, a speaker 3L, a speaker 3R, and a microphone 4. Karaoke device 1 is remotely operated by an infrared signal transmitted from remote control device 5.

FIG. 2 is a block diagram showing a hardware configuration of the karaoke apparatus 1. Each unit connected to the bus 101 communicates with each other via the bus 101. A CPU (Central Processing Unit) 102 uses a RAM (Random Access Memory) 104 as a work area and executes various programs stored in a ROM (Read Only Memory) 103 to control each part of the karaoke device 1. To do. The RAM 104 has a music storage area for temporarily storing music data. The storage unit 105 includes a hard disk device, and stores various data such as music data described later and digital data of singing voice input from the microphone 4.

The communication unit 108 receives music data from a host computer (not shown), which is a music data distribution source, via a communication network (not shown) such as the Internet, and receives the received music data in the CPU 102. Transfer to the storage unit 105 under control. In the present embodiment, the music data may be stored in the storage unit 105 in advance. The karaoke device 1 is provided with a reading device that reads various recording media such as CD-ROM and DVD, and the music data recorded on the various recording media is read by the reading device and transferred to the storage unit 105 for storage. It may be.

Here, the structure of music data used in the present embodiment will be described. As shown in FIG. 3, the music data in the present embodiment includes the header, the musical sound data that is WAVE data representing the contents of the karaoke performance sound, and the voice of the model when the lyrics of the music are correctly sung. It has a WAVE format sample voice data representing the waveform and a lyrics table storing lyrics data representing the lyrics of the music.

FIG. 4 is a diagram illustrating a format of the lyrics table. In the lyrics table, there is a correspondence between the lyrics data representing the lyrics of the music to be played and the time interval data indicating the time interval in which the lyrics represented by the lyrics data should be pronounced when the tone is output according to the tone data. Stored with. For example, in the lyrics table shown in FIG. 4, the lyric data on the first line represents the lyrics “Kamereon force”, and the time interval data “01: 00—01: 02” associated with this lyric data. Indicates that in the example voice, the lyrics “Kamereon power ^” is pronounced between 1 minute and 2 seconds after the beginning of the music performance. The lyric data on the second line represents the lyric “Ichikitaichi”, and the time interval data “01: 03—01: 06” associated with this lyric data is the voice of the model. This shows that the lyrics “Ichikitaichi” is pronounced between the time when 1 minute and 3 seconds have passed since the music started playing and the time when 1 minute and 6 seconds passed.

[0014] The microphone 4 converts the singing voice of the singer that is input into a voice signal and outputs the voice signal. The audio signal from which the microphone 4 is also output is input to an audio processing DSP (Digital Signal Processor) 111 and an amplifier 112. The voice processing DSP 111 performs AZD conversion on the input voice signal and generates singing voice data representing the singing voice. This singing voice data is stored in the storage unit 105, and compared with the model voice data and used for scoring the singer's singing ability.

Input unit 106 detects a signal generated by an input operation to operation panel provided on karaoke apparatus 1 or remote control apparatus 5 and outputs the detection result to CPU 102. The display control unit 107 displays the video and the score result of the singer's singing ability on the motor 2 under the control of the CPU 102.

The tone generator 109 generates a tone signal corresponding to the supplied tone data, and outputs the generated tone signal to the effect DSP 110 as a karaoke performance sound. The effect DSP 110 gives an effect such as reverberation echo to the musical sound signal generated by the sound source device 109. The effected tone signal is DZA converted by the effect DSP 110 and output to the amplifier 112. The amplifier 112 synthesizes and amplifies the musical tone signal output from the effect DSP 110 and the audio signal output from the microphone 4 and outputs it to the speakers 3L and 3R. As a result, the melody of the music and the voice of the singer are output from the speakers 3L and 3R.

[0017] [Operation of the embodiment]

Next, the operation of this embodiment will be described. First, when the user operates the remote controller 5 to designate a song, the song data of the designated song is transferred from the storage unit 105 to the song storage area of the RAM 104 by the CPU 102. CPU102 stores this music Karaoke accompaniment processing is executed by sequentially reading various data included in the music data stored in the area.

Specifically, CPU 102 reads out the musical sound data included in the music data, and outputs the read musical sound data to tone generator 109. The tone generator 109 generates a tone signal of a predetermined tone color based on the supplied music data, and outputs the generated tone signal to the effect DSP 110. In the effect DSP 110, an effect such as reverberation echo is given to the musical sound signal output from the sound source device 109. The musical sound signal to which the effect is applied is DZ A converted by the effect DSP 110 and output to the amplifier 112. The amplifier 112 amplifies the musical sound signal output from the effect DSP 110 and outputs it to the speakers 3L and 3R. As a result, the melody of the music is output from the speakers 3L and 3R. Further, when the CPU 102 supplies the music data to the tone generator 109 and the output of the musical sound is started, the CPU 102 starts counting the elapsed time after the music output is started.

On the other hand, when the singer sings according to the reproduction of the music, the singer's voice is input to the microphone 4 and an audio signal is output from the microphone 4. The voice processing DSP 111 performs AZD conversion on the voice signal output from the microphone 4 to generate singing voice data representing the singing voice. This singing voice data is stored in the storage unit 105.

[0020] The CPU 102 continues counting elapsed time, and searches the lyrics table for a time interval including the counted time as the start time of the time interval (a time interval in which the counted time is included). Then, the retrieved time interval and the lyrics data stored in association with the retrieved time interval are read out. For example, if the counted elapsed time is 01:00:00, the time section “01: 01: 00: 02” and the lyrics data “Kamereon force” on the first line are read out in the lyrics table shown in FIG.

[0021] When the CPU 102 reads the time interval, the CPU 102 compares the audio input to the microphone 4 in this time interval with the model audio in this time interval, and determines whether or not the singer sang the lyrics correctly. Judging. Specifically, the CPU 102 prays the voice represented by the model voice data and, as shown in FIG. 5, reads the time interval (01: 00 on the time axis of the voice waveform represented by the model voice data. — Extract voice waveform A between 01: 02). In addition, the CPU 102 analyzes the stored singing voice data, and as shown in FIG. On the time axis shown, extract the speech waveform B between the read time intervals. Then, the extracted speech waveform A is divided into a plurality of frames by being divided at a predetermined time interval (for example, 10 ms) as shown in FIG. 6 (a). Further, the extracted speech waveform B is divided into a plurality of frames by being divided at a predetermined time interval (for example, 10 ms) as shown in FIG. 6 (b).

Next, the CPU 102 associates the speech waveform of each frame of the model speech with the speech waveform of each frame of the singing speech using a DP (Dynamic Programming) matching method. For example, in the waveform illustrated in Fig. 6, if the speech waveform of the sample voice frame A1 and the voice waveform of the singing voice frame B1 correspond to each other, the frame A1 and the frame B1 are associated with each other. The Also, if the voice waveform of frame A2 of the sample voice and the voice waveform of frame B2 force frame B3 of the singing voice correspond to each other! /, Then frame A2 and frame B2 to frame B3 are associated. Is done.

Next, the CPU 102 compares the characteristics of the speech waveform between corresponding frames. Specifically, the CPU 102 Fourier transforms the speech waveform for each speech waveform of each frame of the model speech. The CPU 102 obtains the logarithm of the amplitude spectrum obtained by the Fourier transform, and inversely transforms it to generate a spectrum envelope for each frame. The CPU 102 extracts the first formant frequency fl 1, the second formant frequency fl 2, and the third formant frequency fl 3 from the obtained spectral envelope.

Further, the CPU 102 performs Fourier transform on the voice waveform for each voice waveform of the singer's voice frame associated with each frame of the model voice. Then, the CPU 102 obtains the logarithm of the amplitude spectrum obtained by the Fourier transform, and inversely transforms it to generate a spectrum envelope for each frame. Then, the CPU 102 extracts the obtained spectrum envelope force frequency f21 of the first formant, frequency f22 of the second formant, and frequency 23 of the third formant.

[0024] For example, the CPU 102 generates a spectrum envelope of the frame A1 of the sample voice, and extracts the spectrum envelope forces fl1 to f13 of the first to third formants. Then, the CPU 102 generates a spectral envelope of the speech waveform of the frame B1 associated with the frame A1, and extracts the first to third formant formant frequencies f21 to f23 from the spectrum envelope. Further, the CPU 102 generates a spectrum envelope of the frame A2 of the sample voice, and the spectrum envelope force extracts the formant frequencies fl 1 to f 13 of the first to third formants. Then, the CPU 102 generates a spectrum envelope of the speech waveform from the frame B2 to the frame B3 associated with the frame A2, and extracts the formant frequencies f21 to f23 of the first to third formants from the spectrum envelope. .

[0025] Next, the CPU 102 extracts the formant frequencies fl 1 to fl 3 extracted from each frame of the sample voice and the frame force of the singer's voice associated with each frame of the sample voice f21 to Compare with f23. Then, the CPU 102 compares the difference between the formant frequency f11 and the formant frequency f21, the difference between the formant frequency f12 and the formant frequency f22, and the formant frequency f13 and the formant frequency f23. If the difference is greater than or equal to a predetermined value, mismatch information D indicating that the formant frequencies do not match is added to the frame of the model voice.

For example, if the formant frequency fl l to f13 of the speech waveform of frame A1 matches the formant frequency of the speech waveform of frame B1, the CPU 102 matches the speech between the corresponding frames. The discrepancy information D is not added to the frame A1.On the other hand, the formant frequencies fl 1 to fl3 of the frame A2 and the formant frequencies f21 to f23 of the speech waveform from the frames B2 to B3 If the difference is equal to or greater than the predetermined value, mismatch information D indicating that the formant frequencies are mismatched is added to frame A2.

[0026] When the CPU 102 determines a match Z mismatch between the formant frequency of the sound waveform of each frame of the singer and the formant frequency of the sound waveform of each frame of the sample sound for the sound waveform of each frame of the sample sound, Count the number N of frames with mismatch information D added. Next, the CPU 102 compares the total number M of the divided sample voice data frames with the value of the number N. If the value of the number N is more than half of the total number M of frames, the read lyrics data is The lyrics to represent! It is determined that the lyrics of the singer's pronunciation are different from the lyrics of the model voice. On the other hand, if the value of the number N is less than half of the total number M of frames, the lyrics expressed by the read lyrics data and the lyrics Judge that the lyrics are the same. For example, if the number N of discrepancy information is less than half of the total number of frames M for the voice “Kamere-onga” represented by the model voice data,

The CPU 102 determines that the lyrics pronounced by the singer are the same as the lyrics of the model voice. In the present embodiment, when the value of the number N is more than half of the total number M of frames, the lyrics expressed by the read lyrics data are different from the lyrics of the singer's pronunciation and the lyrics of the model voice. Determining power Total number of frames Number of M Number of N When the ratio of N is more than a predetermined ratio other than 50%, the lyrics expressed by the lyrics data read out are the lyrics of the singer and the lyrics of the model voice. May be determined to be different.

[0027] CPU 102 continues to count the elapsed time in parallel with the comparison between the model voice and the singing voice. When the counted elapsed time becomes 01:03, the time zone in the second row of the lyrics table shown in FIG. Read “01: 03—01: 06” and lyric data “Ichikita-ichi”. Further, when the singer sings in the read time section according to the reproduction of the music, the singing voice data is stored in the storage unit 105. Here, for example, if the singer sings with the wrong lyrics, and the singing is different from the lyric that the read singing data 2 shows, the song sings with the lyric. Singing voice data representing the voice “come” is generated and gc fe to the storage unit 105.

Next, the CPU 102 divides the waveform of the sound input to the microphone 4 in this time interval and the waveform of the model sound in this time interval into a plurality of frames. Then, the speech waveform of each frame of the model speech is associated with the speech waveform of each frame of the singing speech, and the formant frequencies of the speech waveforms are compared between the associated frames. Then, the CPU 102 judges whether the voice waveform of each frame of the model voice matches the formant frequency of the singer's voice waveform, and adds the mismatch information D, and then the divided sample voice data. The total number M of frames is compared with the value of the number N of frames with discrepancy information added to determine whether or not the singer has sung the lyrics correctly.

[0029] Here, because the singer sang with the lyrics "I came" with a different lyric from "I come", the formant frequency of the voice waveform of the model voice and the formant frequency of the singer's voice waveform And formant frequencies do not match, and the number N of mismatch information is equal to or greater than the total number M of frames. CPU102 has a value of number N that is more than half of the total number of frames M If it is, the lyrics expressed by the read lyrics data are determined to be different from the lyrics of the singer's pronunciation and the lyrics of the model voice, and the lyrics “Kitakita” expressed by the read lyrics data are displayed. The control unit 107 is controlled and displayed on the monitor 2 to notify that the lyrics are wrong.

Hereinafter, as described above, the CPU 102 repeats the reading of the lyrics data and the model voice data and the determination of the correctness of the lyrics sung by the singer as the music is played back. Then, when all performance event data is read, the karaoke accompaniment process is terminated.

As described above, according to the present embodiment, it is possible to determine whether or not the singer has sung according to the lyrics without performing voice recognition using a dictionary. In addition, in this embodiment, if there is voice data that is sung correctly according to the lyrics, it is possible to evaluate whether or not the song is correctly sung according to the lyrics, so that the system recognizes the language using a dictionary. For lyrics in various languages without complicating music, it is possible to evaluate the ability of the singer to learn the lyrics correctly!

[0032] [Modification]

The embodiment of the present invention has been described above, but the present invention is not limited to the above-described embodiment. For example, the above-described embodiment may be modified as follows to implement the present invention.

In the embodiment described above, the pitch of the voice represented by the singing voice data is corrected so that the pitch of the voice waveform represented by the singing voice data becomes the pitch of the voice waveform represented by the model voice data. Anyway.

[0034] In the above-described embodiment, periodic fluctuations in the pitch of the voice waveform represented by the model voice data are detected to determine whether or not the voice serving as the model is vibrato, and vibrato If it is determined that the pitch fluctuation of the voice waveform represented by the model voice data matches the pitch fluctuation of the voice waveform represented by the singing voice data, the singer sings with vibrato correctly. You can decide whether you have the power!

In addition, it detects the pitch variation of the voice waveform represented by the model voice data and makes it a model voice (singing a sound that is lower than the set pitch first and then approaching the original pitch) If there is a law or Determine the degree of coincidence between the pitch fluctuation of the voice waveform represented and the pitch fluctuation of the voice waveform represented by the singing voice data, and try to determine whether or not the singer is singing correctly. You can do it.

[0035] In the above-described embodiment, the voice waveform represented by the sample voice data and the voice waveform represented by the singing voice data are divided into a plurality of frequency bands by a plurality of bandpass filters, and the voice is represented for each frequency band. The correctness of the lyrics may be determined by determining the degree of coincidence of the feature quantities.

In the above-described embodiment, the model voice data representing the model voice waveform is stored, and the formant frequency is analyzed by analyzing the voice waveform represented by the model voice data. Is stored in advance in the storage unit 105, and the stored formant frequency is compared with the formant frequency of each frame of the singer's voice waveform to determine the degree of coincidence. You may decide to judge.

[0037] In the embodiment described above, the correctness of the lyrics sung by the singer may be determined after the singer has finished singing the music. In the above-described embodiment, when it is determined that the lyrics of the singer's pronunciation are different from the lyrics of the model voice, a message or image that informs the user that the lyrics are incorrect is not displayed. You may make it display on.

[0038] The present invention is based on a Japanese patent application filed on January 31, 2006 (Japanese Patent Application No. 2006-0222648), the contents of which are incorporated herein by reference.

Claims

The scope of the claims

[1] Karaoke equipment

A storage unit storing sample voice data representing a model voice when a song is sung according to lyrics;

A voice input unit for inputting the singing voice of the singer,

The model voice represented by the model voice data is divided into a plurality of model voice sections, and in the singing voice inputted to the voice input unit, the singing voice sections corresponding to the divided sample voice sections are divided. A specific part to identify;

An evaluation unit for performing an evaluation by comparing the singing voice of the singing voice section specified by the specifying unit with the model voice of the model voice section corresponding to the singing voice section;

A display unit for displaying the evaluation result of the evaluation unit;

Is provided.

[2] The karaoke apparatus according to claim 1,

The evaluation unit obtains a degree of coincidence between the singing voice of the singing voice section specified by the specifying part and the example voice of the example voice section corresponding to the singing voice section, and based on the obtained degree of coincidence. To evaluate.

[3] The karaoke apparatus according to claim 2,

The storage unit stores lyrics data representing the lyrics of the music,

When the degree of coincidence obtained by the evaluation unit is less than a predetermined value, the lyrics that the lyrics data stored in the storage unit represent the lyrics corresponding to the voice of the sample voice section in which the degree of coincidence is less than the predetermined value There is a lyrics specific part to specify from

The display unit displays the lyrics specified by the lyrics specifying unit.

[4] The karaoke apparatus according to claim 2,

The evaluation unit obtains a degree of coincidence between the formant frequency of the singing voice and the formant frequency of the model voice.

[5] The karaoke apparatus according to claim 4,

The evaluation unit obtains a first spectrum envelope from the speech waveform of the singing voice section specified by the specifying unit, and the hand of the sample voice section corresponding to the singing voice section is obtained. Obtain the speech waveform power of this speech, the second spectral envelope,

The evaluation unit extracts a formant frequency of the singing voice from the first spectrum envelope, and extracts a formant frequency of the sample voice from the second spectrum envelope.

[6] The karaoke apparatus according to claim 1,

The evaluation unit obtains a degree of coincidence between the voice waveform of the singing voice section of the singing voice section specified by the specifying section and the voice waveform of the model voice of the model voice section corresponding to the singing voice section. Evaluation is performed based on the degree of coincidence.

[7] The voice processing method is

The sample voice when the song is sung according to the lyrics expressed by the sample voice data is divided into multiple sample voice sections,

In the singing voice of the singer input to the voice input unit, the singing voice section corresponding to each divided example voice section is specified,

Comparing the singing voice of the specified singing voice section with the model voice of the model voice section corresponding to the singing voice section;

Based on the result of the comparison process, an evaluation is performed,

The result of the evaluation is displayed.

[8] In the comparison process, the degree of coincidence between the singing voice of the singing voice section specified by the specifying unit and the example voice of the example voice section corresponding to the singing voice section is obtained.

8. The voice processing method according to claim 7, wherein the evaluation process is performed based on the obtained degree of coincidence.

[9] The voice processing method according to claim 8,

When the degree of coincidence obtained by the evaluation process is less than a predetermined value, the lyrics of the song corresponding to the voice of the sample voice section in which the degree of coincidence is less than the predetermined value are represented by the song data. Identified from

The display process displays the specified lyrics on a display unit.

[10] The speech processing method according to claim 8,

The evaluation process includes a formant frequency of the singing voice and a format of the model voice. The degree of coincidence with the current frequency is obtained.

[11] The speech processing method according to claim 10,

The evaluation process includes

Obtaining the first waveform envelope of the voice waveform power of the singing voice section of the specified singing voice section, obtaining the second spectrum envelope from the voice waveform of the model voice of the model voice section corresponding to the singing voice section,

The formant frequency of the singing voice is extracted from the first spectrum envelope, and the formant frequency of the model voice is also extracted from the second spectrum envelope force.

[12] The speech processing method according to claim 7,

The evaluation processing is performed by obtaining a degree of coincidence between the voice waveform of the singing voice in the specified singing voice section and the voice waveform of the model voice in the model voice section corresponding to the singing voice section. Evaluation is based on the degree of agreement.