JP4678672B2 - Pronunciation learning device and pronunciation learning program - Google Patents

Pronunciation learning device and pronunciation learning program Download PDF

Info

Publication number
JP4678672B2
JP4678672B2 JP2005110310A JP2005110310A JP4678672B2 JP 4678672 B2 JP4678672 B2 JP 4678672B2 JP 2005110310 A JP2005110310 A JP 2005110310A JP 2005110310 A JP2005110310 A JP 2005110310A JP 4678672 B2 JP4678672 B2 JP 4678672B2
Authority
JP
Japan
Prior art keywords
voice
segmentation
segmentation pattern
sound
timing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
JP2005110310A
Other languages
Japanese (ja)
Other versions
JP2006251744A (en
Inventor
誠 後藤
Original Assignee
誠 後藤
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 誠 後藤 filed Critical 誠 後藤
Priority to JP2005110310A priority Critical patent/JP4678672B2/en
Publication of JP2006251744A publication Critical patent/JP2006251744A/en
Application granted granted Critical
Publication of JP4678672B2 publication Critical patent/JP4678672B2/en
Application status is Active legal-status Critical
Anticipated expiration legal-status Critical

Links

Images

Abstract

<P>PROBLEM TO BE SOLVED: To provide a pronunciation learning system with which pronunciation of a foreign language is improved by making a learner perceive the same number of single sounds as a native speaker does. <P>SOLUTION: In order for the learner to perceive changing timing of a single sound interval, a timing stimulation indicating means 012 for indicating timing stimulation 013 which is generated by a timing stimulation generating means 011 and a voice indicating means 007 for indicating a voice data 005 are synchronized, from a segmentation pattern 009 which divides the voice data 005 into a plurality of single sound intervals on a time axis. <P>COPYRIGHT: (C)2006,JPO&amp;NCIPI

Description

  The present invention relates to a pronunciation learning device and a pronunciation learning program for the purpose of learning a foreign language. It also relates to a method for producing pronunciation learning materials for the purpose of learning foreign languages.

Conventionally, pronunciation learning of foreign languages has been performed by repeatedly listening and imitating voices that serve as examples recorded on cassette tapes and CDs. However, this method takes a lot of time for learning and does not necessarily lead to complete acquisition. Therefore, in recent years, an interactive pronunciation learning device using computer technology has been proposed. For example, by extracting features from the speech uttered by the learner, displaying it on the screen as a spectrum diagram or formant diagram and feeding it back, the learner can proceed with learning while judging his / her pronunciation defect A learning device has been put into practical use. However, even if such an apparatus is used, it is difficult to say that the difficulty of learning pronunciation of a foreign language is drastically solved.
JP 2001-265 211 A

  One of the reasons why Japanese-speaking students are not good at pronunciation of foreign languages is that when they listen to the same voice, the native speaker of the target language that is the subject of pronunciation learning and the learner The phenomenon that the perceived number of single notes is different. For example, a Japanese learner perceives an English native speaker as two singles “la” as a single single “la”. In the present invention, a sound such as a phoneme in English or a mora in Japanese that a person whose native language speaks feels psychologically when speaking or hearing is called a single sound. In conventional pronunciation learning, this “single-tone discrepancy phenomenon” has been regarded as inevitable.

  The problem to be solved by the present invention is to provide a pronunciation learning device that drastically improves pronunciation by allowing a learner to perceive the same number of single notes as a native speaker of a target language. In the following, English is used as an example of the target language, but the same effect can be obtained when another language such as German or French is used as the target language.

  If the “single-tone discrepancy phenomenon” during listening is considered as an engineering model, “the segmentation process performed as a pattern recognition pre-processing in the human brain process during listening will It can be said that this is a phenomenon that varies depending on the mother tongue. FIG. 1 is an explanatory diagram schematically showing the difference in the manner of brain segmentation processing. When a continuous speech waveform 001 is input to the brain in this way, when perceived in the Japanese-type segmentation format, it is cut out as a whole as in the cut-out section 002, whereas the English-type segmentation is used. When perceived in a style, it is cut out as two separate fragments, such as a cutout section (consonant part) 003 and a cutout section (vowel part) 004. Since these cut out fragments serve as inputs for higher-order recognition functions, it is considered that the number of single notes perceived differs. If learning allows the learner to perceive in the same segmentation manner as the native speaker, the difficulty of pronunciation due to this “single number discrepancy” should be solved.

  The first solving means of the present invention aims at making the learner aware of in what segmentation style the voice being presented should be perceived during auditory learning. FIG. 2 is a basic configuration diagram of the pronunciation learning device according to claim 1. The pronunciation learning device includes a voice data storage unit 006 that stores voice data 005, a segmentation pattern storage unit 010 that stores a segmentation pattern 009 that divides the voice data into a plurality of sections on a time axis, and the segmentation pattern 009. A timing stimulus generating unit 011 for generating a timing stimulus for causing the learner to perceive the switching timing of the section, a voice presenting unit 007 for presenting the voice data 005, and a presentation of the voice 008 by the voice presenting unit. It comprises timing stimulus presenting means 012 for presenting the timing stimulus 013.

  In the present invention, a pattern that divides audio data into a plurality of sections on the time axis is called a segmentation pattern, and is handled as information that can be associated with the audio data. This is expressed by the time corresponding to the section break. In the present invention, the timing stimulus is a stimulus for allowing the learner to perceive the switching timing of the section, and a sense other than the olfactory / olfactory sense that cannot be used because the time resolution is too low, that is, visual / auditory / tactile sense. Given as a sensory stimulus through any of the above.

  Preferably, the segmentation pattern includes one or more single sound sections and one or more background sections. The speech in a single tone section is passed as a recognition target to the higher-order recognition unit in the brain and becomes conscious as a single tone. On the other hand, the voice in the background section is cut off as the background sound and is not conscious as the language sound. By classifying the sections in this way, a time zone that should be perceived as a language sound and a time zone that should not be perceived as a language sound can be explicitly presented to the learner. Time zones that should not be perceived include “crossover sounds” and “beginning noise”.

  In order to designate a section corresponding to a transitional sound, a background section is provided as a gap between the end time of the preceding single sound and the start time of the subsequent single sound. In the present invention, a background section that becomes a gap is called a gap section. Crossover is the concept of phonetics and a kind of articulatory connection. This is a transitional sound that occurs during the time period between two phonemes. Usually, it is not conscious that there is a sound in the time zone of the transitional sound. Therefore, the learner should unconsciously confine the sounds in the transition time zone in the target language segmentation style. However, since it sounds unnatural when you hear a sound that is artificially cut off only the transitional sound part, the sound of this section is the sound of the background section. The possibility of secondary support is undeniable.

  In order to designate a section corresponding to the beginning noise, a background section that is not silent is provided before the start time of the first note. The prefix noise is a term defined in the present invention, and is a phenomenon encountered in the following scenes, for example. When an English native speaker explains the difference between “r” and “l” sounds, only the consonants may be stretched and pronounced. However, the Japanese should think of it as a linguistic sound from the beginning of the transitional sound, as long as “r” and “l” are sustained And try to pay attention. In this way, in the Japanese segmentation style, when a consonant continues unnaturally at the beginning of a word, it is processed as noise and is not recognized as a language sound. Such a section is called head noise in the present invention, and can be designated as a background section.

  Preferably, the second segmentation pattern storage unit is different from the segmentation pattern storage unit, and the timing stimulus is stored in the segmentation pattern section switching timing and the second segmentation pattern storage unit. It is possible to make the learner perceive the switching timing of the second segmentation pattern section at the same time. With this configuration, the learner can learn while comparing the segmentation style of the native language with the segmentation style of the target language.

  Still preferably, the timing stimulus comprises a single phonetic symbol. With this configuration, it is possible to learn while conscious of not only the start time and end time of a single sound but also what the single sound is. In the present invention, a single phonetic symbol is not limited to notation by an international phonetic symbol (IPA symbol), but represents all symbols that can distinguish a single tone. As symbols in the visual stimulus, normal alphabet display or katakana display may be used. For example, katakana notation corresponding to English singles one-on-one may be used as English single phonetic symbols. As a symbol in the auditory stimulation, a single sound that is sounded in isolation may be presented as a phonetic symbol. As a tactile symbol, a braille symbol may be presented as a phonetic symbol.

  The second solving means of the present invention is to suppress perception of a single tone string based on the segmentation style of the native language by presenting the learner with voice variations that can occur only in the target language during auditory learning. And FIG. 3 shows a basic configuration diagram of the pronunciation learning device according to claim 5. This pronunciation learning device includes voice data acquisition means 016 for acquiring voice data 005, and voice presentation means 007 for presenting the voice data 005 acquired by the voice data acquisition means. This is voice data constituting a voice variation 014 composed of a plurality of voice data 015 having different language-specific variables.

  In the present invention, sound perceived as natural sound is defined as “natural sound”, and the sound data is referred to as natural sound data. This indicates voice data obtained by instructing a native speaker of a target language to speak naturally, or voice data synthesized with standard parameters. On the other hand, a sound with artificial adjustment is defined as “adjusted sound”, and the sound data is referred to as adjusted sound data. This can be done by instructing the native speaker of the target language to utter intentionally differently from natural speech, speech obtained by artificially converting natural speech data by filtering, or non-standard Represents speech obtained as a result of speech synthesis using parameters.

  In the present invention, the target language-specific variable is a change of the variable that changes the perceived string in the native language segmentation style without changing the perceived string in the target language segmentation style. It is a variable. FIG. 4 is an explanatory diagram of the sound adjustment by the target language specific variable. The learner's native language and the target language are considered to have different perceived ranges as the same single tone string along the axis of the target language specific variable 017. That is, in the segmentation style of the native language, only a narrow range 018 on this axis is perceived as a single phone string, whereas in the target language segmentation mode, as a single phone string in a wide range 019 on this axis. Be perceived. At this time, if the natural speech 020 in FIG. 4 is listened to, it is perceived in the segmentation style of the native language that has become familiar, and it is difficult to perceive it in the segmentation style of the target language, no matter how hard the learner makes an effort. However, if you listen to the adjustment voice 021 in the figure, you will not be dragged into the native language segmentation style.

  Therefore, by first letting you hear the adjusted voice, we will try to perceive it in the segmentation style of the target language. If you listen to natural speech immediately after that, you will be able to perceive in the target language segmentation style, so you can perceive it in the target language segmentation style instead of your native language segmentation style. It becomes possible. Furthermore, it becomes more effective by presenting the adjusted voice and the natural voice alternately.

  The target language specific variable specifically refers to the time length of the background interval between single notes, the degree of artificial division between single notes, the duration of single notes, or the acoustic parameters of single notes in the target language. Of these four types of variables, vectors having some variables as components are also target language specific variables. Note that the target language specific variable may be expressed in absolute terms with physical quantities having units such as seconds or Hz, or the variable distribution range is divided into several levels, and relative numbers are assigned to the numbers of the levels to which they belong. Sometimes expressed.

  The first option for the target language specific variable is the time length of the background interval between single notes in the target language. If the utterance instruction is given to the native speaker of the target language so that the time length of the background section becomes sufficiently large, each single sound will be pronounced in isolation. As an example, let us consider a case where “la” uttered at a natural speed as natural speech is generated by isolating “l” and “a” as adjusted speech. In the English-type segmentation format, these sounds are perceived as a single sound string “l + a” when viewed as a digital symbol string. On the other hand, in the Japanese-type segmentation style, natural speech is perceived as a single “la” single tone, and adjusted speech is perceived as a single “l + a” single tone. In this way, the phoneme sequence when perceived in the segmentation style of the target language does not change between the natural speech and the adjusted speech, and the phoneme sequence when perceived in the segmentation style of the learner's native language changes. Therefore, the “time length of the background section between single notes” can be selected as a target language specific variable.

  The second option for the target language specific variable is the degree of artificial division between the phonemes of the target language. In the present invention, the degree of artificial division means that the amplitude gain is reduced locally only in the speech waveform of the transitional part by signal processing such as filtering, and the front and back single sounds are artificially separated and perceived. The degree of decrease in amplitude gain when directed. In the English segmentation style, migrating sounds are not conscious, so there is a slight sense of incongruity, but the perceived monotone string does not change before and after filtering. On the other hand, in the Japanese-style segmentation style, the transitional sound part plays an important role as a single tone, so the consonant sound cannot be heard well.

  A third option for the target language specific variable is the duration of a single note of the target language. In the English-type segmentation style, the length of consonants classified as continuous sounds such as friction sounds and stream sounds can be changed freely. On the other hand, in the Japanese-style segmentation style, it is not possible to pronounce with different consonant lengths. In general, in a certain segmentation style, if you can speak with variations, you will be familiar with the variations as a listener, so it will be easily recognized as the same class when you perceive. On the other hand, if you are unable to speak with variations, the listener is not used to listening to the variations and cannot recognize them as the same class.

  For example, if you have a native English speaker prolong the duration of “r” in “ra”, it will change from “La” to “Ura” when perceived in Japanese segmentation style. . However, this “U” is a sound that sounds like noise rather than “U” in Japanese, but here, it is also considered that the occurrence of this noise-like sound is a change in a single string. On the other hand, when perceiving in an English-type segmentation style, the duration of a single note is simply extended, and the obtained single tone string does not change.

A fourth option for the target language specific variable is a monophonic acoustic parameter of the target language. The acoustic parameter represents the pitch of the sound (pitch) or the volume of the sound. In the English-style segmentation style, even if “l” and “a” are intentionally uttered at different pitches and sizes, the perceived single-tone strings are not changed because they are originally different single-tones. When perceiving in a segmentation style, the transitional part changes that cannot occur in Japanese, making it difficult to hear.
The fifth option of the target language specific variable is a combination of the first option to the fourth option. In this case, a vector having the combined variable as a component can be considered as a target language specific variable.

  Preferably, the voice variation includes a plurality of voice data having different target language specific variables from the natural voice data. With this configuration, the target language specific variable can be adjusted step by step. FIG. 5 is an explanatory diagram showing stepwise audio adjustment. It is assumed that the voice can change continuously along the target language specific variable 017. It is assumed that there are a plurality of adjustment sounds that are not natural sounds on this axis. Listening to the natural voice 020 immediately after listening to the first adjusted voice 022 that is very different from the natural voice has the risk of pulling the segmentation style back to the native language style. Therefore, after listening to the first adjusted voice 022, the second adjusted voice 023 that is not much different from that is heard and practiced to perceive in the segmentation style of the target language. When sufficient practice is completed, the learning is now advanced to the third adjusted sound 024 that is a little more natural. In this way, the difficulty of perceiving in the target language segmentation style can be overcome in stages.

  The voice data acquisition means acquires voice data constituting a voice variation by any of the following three methods. FIG. 6 shows a configuration diagram of the pronunciation learning device of claim 5 including a configuration as a first option of the voice data acquisition means. Natural voice data storage means 026 for recording natural voice data 025; segmentation pattern storage means 010 for storing segmentation pattern 009 for dividing the natural voice data into a plurality of sections; and the voice data acquisition means 027, The speech data 005 is obtained by performing filtering processing on the natural speech data using the segmentation pattern 009 and the target language specific variable 017 stored in the segmentation pattern storage unit 010 as parameters.

The second option of the voice data acquisition means is characterized in that the voice data is acquired by reading the voice data from the computer-readable medium according to claim 7 in which the voice variation is stored.
A third option of the voice variation acquisition means is characterized in that voice data is acquired by performing voice synthesis using a target language specific variable as a parameter.

  Further, the voice data acquisition means determines the voice data to be acquired by any one of the following three methods. The first option for data determination is to acquire voices in order from the voice variations as described above. That is, in FIG. 4, the presentation order is determined such as the adjusted sound 021 to the natural sound 020. In FIG. 5, the first adjusted sound 022 is presented, and then the second adjusted sound 023 is advanced to present the sound according to a predetermined order such as the third adjusted sound 024 and the natural sound 020. . The second option for data determination is to pick up speech in random order from among the voice variations. For example, in FIG. 5, the respective adjusted sounds are presented in a random order. After learning has progressed to some extent by the method of presenting in order, additional learning by the method of presenting at random will result in auditory learning in a state where it is unknown to the learner which adjustment speech will be presented. It can be strengthened.

  FIG. 7 shows a block diagram of a pronunciation learning device that determines speech data by the third data determination option. This pronunciation learning device has input means 028, and the voice data acquisition means acquires voice data constituting the voice variation in accordance with an input obtained from the input means. With this configuration, voice data can be determined based on an input from a learner.

  The third solving means in the present invention is to determine what kind of segmentation pattern is perceived when the speech uttered by the learner is perceived in the segmentation style of the target language at the time of vocal learning. The purpose is to provide feedback. FIG. 8 shows a basic configuration diagram of the pronunciation learning device according to claim 9. This pronunciation learning device includes a speech input unit 029 for inputting speech, a segmentation unit 030 for recognizing a segmentation pattern 009 from speech data 005 input by the speech input unit, and a segmentation pattern for presenting the characteristics of the segmentation pattern 009 It has the characteristic presentation means 031, It is characterized by the above-mentioned.

  In the present invention, the feature of the segmentation pattern is a value expressed as a function having the segmentation pattern as an input. For example, there are the length of a single section, the length of a background section between single notes, the segmentation pattern itself, an evaluation value indicating the appropriateness, and the like.

  The pronunciation learning device according to claim 9 is preferably a voice data storage unit that stores voice data, and a segmentation pattern storage unit that stores a segmentation pattern for dividing the voice data into a plurality of sections on a time axis. The segmentation means performs DP matching with input voice data using the voice data as a matching pattern. With this configuration, it is possible to perform segmentation for a plurality of segmentation modes by switching collation patterns.

Using the pronunciation learning device of the present invention, the same sound is generated between the learner's native language and the target language, such as “La” and “l + a”, or “vowel + tu” and “vowel + t + s”. The effect of pronunciation learning can be improved by eliminating the phenomenon in which the number of single notes differs even when listening.
By using the pronunciation learning device according to the first aspect, it is possible to make the learner aware of what segmentation style the sound being presented should be perceived at the time of auditory learning.
Learning using the pronunciation learning device of claim 1 by using a computer-readable medium storing data produced by the pronunciation learning teaching material manufacturing method according to claim 4, and reproducing it with a general-purpose media player The same learning effect can be obtained.

When the pronunciation learning device according to claim 5 is used, at the time of auditory learning, a speech variation that can occur only in the target language is presented to the learner, thereby suppressing perception of a single string based on the segmentation style of the native language. be able to.
When the pronunciation learning device according to claim 9 is used, when the speech uttered by the learner is perceived in the segmentation mode of the target language at the time of vocal learning, the learner understands how the speech is perceived. Can provide feedback.

The embodiment of the present invention includes the following configurations.
(Structure 1) The pronunciation learning apparatus according to claim 1, wherein the plurality of sections include one or more single-tone sections and one or more background sections.
(Configuration 2) The second segmentation pattern storage unit is different from the segmentation pattern storage unit, and the timing stimulus is a timing at which the segmentation pattern section is switched and a second segmentation pattern storage unit stores the second segmentation pattern storage unit. The pronunciation learning device according to claim 1, wherein the learner can simultaneously perceive the timing of switching between the segments of the two segmentation patterns.
(Structure 3) The pronunciation learning device according to claim 1, wherein the timing stimulus is composed of a single phonetic symbol.

(Structure 4) The pronunciation learning device according to claim 5, wherein the voice variation includes a plurality of voice data having different target language specific variables from natural voice.
(Configuration 5) Natural voice data storage means for recording natural voice data; and segmentation pattern storage means for storing a segmentation pattern for dividing the natural voice data into a plurality of sections, wherein the voice data acquisition means is the segmentation 5. The pronunciation learning device according to claim 5, wherein the speech data is acquired by performing filtering processing on the natural speech data using a pattern and a target language specific variable as parameters.

(Structure 6) The plurality of sections include one or more single-tone sections and one or more background sections, the target language specific variable is an artificial division degree, and the filtering processing is a speech waveform corresponding to the background section. The pronunciation learning device according to the fifth aspect, wherein the amplitude of is reduced according to the artificial division rate.
(Structure 7) The pronunciation learning device according to claim 5 or 4, wherein the sound data acquisition means acquires sound data by reading the sound data from the medium of claim 7.
(Structure 8) The pronunciation learning device according to claim 5 or 4, wherein the voice data acquisition means acquires the voice data by performing voice synthesis using a target language specific variable as a parameter.

(Structure 9) The pronunciation learning device according to Claim 5, or Structure 4 to Structure 8, wherein the sound data acquisition means sequentially acquires sound data constituting the sound variation.
(Structure 10) The pronunciation learning device according to Claim 5, or Structure 4 to Structure 8, wherein the sound presenting means obtains sound data constituting the sound variation in any order.
(Configuration 11) Pronunciation learning according to configurations 4 to 8, further comprising input means, wherein the voice data acquisition means acquires voice data constituting the voice variation in accordance with an input obtained from the input means. apparatus.

(Configuration 12) The input means is voice input means, and has segmentation means for recognizing a segmentation pattern from voice data input by the voice input means, and the voice data acquisition means is configured according to the characteristics of the segmentation pattern. The pronunciation learning device according to the eleventh aspect, wherein voice data constituting the voice variation is acquired.
(Configuration 13) The pronunciation learning according to configurations 11 to 12, wherein the voice data acquisition means acquires the voice data constituting the voice variation according to the target language specific variable of the voice data presented immediately before. apparatus.
(Configuration 14) The computer-readable medium according to claim 7, wherein a target language specific variable is stored in association with each of the audio data constituting the audio variation.

(Structure 15) The pronunciation learning device according to claim 9, wherein the characteristic is a length of a single tone section.
(Structure 16) The pronunciation learning device according to claim 9, wherein the feature is a length of a background section between single notes.
(Structure 17) The pronunciation learning device according to claim 9, wherein the feature is an evaluation value representing an appropriateness of the segmentation pattern.
(Structure 18) The pronunciation learning device according to claim 9, wherein the feature is the segmentation pattern itself.
(Arrangement 19) Voice data storage means for storing voice data; and segmentation pattern storage means for storing a segmentation pattern for dividing the voice data into a plurality of sections on a time axis. The segmentation means includes the voice data The pronunciation learning device according to claim 9, wherein DP matching is performed with input voice data using the data as a collation pattern.

  The example of the pronunciation learning apparatus of Claim 1 is shown. The voice data storage means 006 stores the voice produced by the target language native speaker converted into voice data through the microphone. The segmentation pattern storage unit 010 stores a segmentation pattern 009 manually input by an operator in advance while viewing a screen on which the corresponding audio data 005 is visualized. The audio data and the segmentation pattern are recorded on a computer-readable medium according to claim 3 and loaded into the audio data storage means and the segmentation pattern storage means when necessary.

  An example of a GUI (graphical interface) screen when the operator manually inputs a segmentation pattern is shown in FIG. FIG. 9A shows a screen before input. On the screen, the audio data is visualized as a spectrum diagram with the vertical axis representing the frequency axis and the horizontal axis representing the time axis. In the figure, only the first formant to the third formant are schematically drawn, but it is desirable to visualize all the spectrum information as a grayscale image. As for the visualization method, as long as it is a diagram including a time axis such as a spectrum diagram and a speech waveform diagram, the time can be manually input regardless of which diagram is visualized. Here, it is assumed that the voice corresponding to the single tone string “la” is presented. The operator designates the start time and end time of each single-tone section in the figure using an input device such as a mouse while viewing this figure. In this case, a position corresponding to three times T0, T1, and T2 is designated. Then, as shown in FIG. 9B, it is displayed as a vertical dotted line to express that three times T0, T1, and T2 have been input. As a result, a segmentation pattern indicating division into two single sound sections, that is, a set of times (T0, T1, T2) can be manually input.

  FIG. 10 shows an example of the timing stimulus generated by the timing stimulus generator 011. In this example, the learner is presented as a visual stimulus on the computer screen. The timing stimulus generation unit reads the segmentation pattern input in FIG. 9, interprets that two single sounds are included, and generates an image sequence including still images of images A102 and B103 sandwiched between two blank images 101. Generate. A set of time information for switching the image sequence and the image is delivered to the timing stimulus presenting unit 012. In the example of FIG. 10, a circle whose color is changed for each single tone is drawn. As described above, the visual stimulus used as the timing stimulus 013 is preferably configured with a simple figure or the like so that the switching timing can be easily understood even when the screen is switched at high speed.

  FIG. 11 shows a flowchart when the timing stimulus presenting means presents the timing stimulus shown in FIG. In order to synchronize with the presentation of the voice data, the timing stimulus presentation unit 012 resets the timer T to T = 0 when the presentation of the voice data is started 104 and presents a blank image 105. Then, T is counted up with time, 106 is compared with T and T0, image A is presented 107 when T ≧ T0, 108 is compared with T and T1, and image is displayed when T ≧ T1. B is presented 109, T and T2 are compared to 110, and when T ≧ T2, a blank screen is presented 105. When the learner listens to the voice, the learner can also perceive the timing stimulus at the same time, thereby perceiving the switching timing of the single sound section.

  Another example of the pronunciation learning device according to claim 1 is shown. The basic embodiment is the same as that of the first embodiment, except that the segmentation pattern includes a gap section between single notes (configuration 1), and the aspect of the timing stimulus is different accordingly. FIG. 12 shows a schematic diagram of a spectrum diagram of a voice “la” and an example of a segmentation pattern set including a gap section. Here, the start time and end time of each single sound are set as segmentation patterns. Therefore, in this case, the segmentation pattern is ((T0, T3), (T4, T2)).

  FIG. 13 shows an example of a timing stimulus generated from this segmentation pattern and presented to the timing stimulus presenting means. In this case, it is only necessary to alternately repeat the blank image 101 and the image A102, and there is no need to change the stimulus for each single sound. This is because the timing of the start time and end time of each single sound can be clearly perceived by sandwiching a blank image corresponding to the background section in the segmentation pattern.

  Another example of the pronunciation learning device according to claim 1 is shown. Although the basic embodiment is the same as that of the second embodiment, an English pattern and a Japanese pattern are associated with the same voice as a segmentation pattern (Configuration 2), and accordingly The difference is in the manner of timing stimulation. FIG. 14 shows a schematic diagram of the spectrum diagram of the voice “la”, and an English pattern diagram 14 (a) and a Japanese pattern diagram 14 (b) set separately thereto. In this case, the English pattern is ((T0, T3), (T4, T2)), and the Japanese pattern is ((T5, T2)).

  FIG. 15 shows an example of the timing stimulus obtained from these two segmentation patterns. Here, in order to make it easier for the learner to understand the correspondence between the speech and the segmentation pattern, the spectrum diagram of the speech data is displayed simultaneously when the learner performs pronunciation learning. The white vertical bar 301 is moved along the time axis in synchronization with the voice presentation. At this time, the region 302 below the vertical bar is changed to red in the English single tone section, that is, in the time of T0 ≦ T <T3 and T4 ≦ T <T2. On the other hand, the region 303 above the vertical bar is changed to red in a Japanese single tone section, that is, in T5 ≦ T <T2. In this way, the timing stimulus presenting means presents a timing stimulus that simultaneously senses the switching timing of the two segmentation pattern sections, so that the learner can contrast the segmentation pattern formats of the native language and the target language. You can learn pronunciation.

  Another example of the pronunciation learning device according to claim 1 is shown. The basic embodiment is the same as that of the second embodiment, except that a phonetic symbol corresponding to a single tone of the target language is presented as a timing stimulus (Configuration 3). First, when the operator manually inputs the segmentation pattern, at the same time, what each single sound is is input. When presenting the timing stimulus, the phonetic symbol image corresponding to each single sound is presented. An example of presentation is shown in FIG. Here, alphabets are used as phonetic symbols. Instead of the image A102 shown in FIG. 13, an image 401 displaying “L” and an image 402 displaying “A” are presented.

Thus, the timing stimulus presenting means presents a single phonetic symbol so that the learner can understand which single note is presented.
In the first to fourth embodiments, if the screen transition of the presented timing stimulus is reconstructed as moving image data and stored in a computer-readable medium as multimedia data together with audio data, the pronunciation of claim 4 This is an example of a learning material manufacturing method.

  An example of the pronunciation learning teaching material manufacturing method according to claim 4 is shown (Configuration 3). As the segmentation pattern, the same pattern as that illustrated in FIG. The timing stimulus generation step reads this segmentation pattern and generates an auditory stimulus as a timing stimulus. More specifically, monaural sound data is synthesized by using each single sound generated by an isolated native speaker as a material and arranging it in synchronization with the start time of the single sound section. In the data storage step, the stereo audio data is recorded on the recording medium so that the monaural audio data thus created is used for the left ear of the stereo audio data and the original continuous audio is used for the right ear of the stereo audio data. Record as. The learner reproduces this sound data with a normal player through stereo headphones, so that a timing stimulus consisting of a voice that serves as a model from the right ear and a single phonetic symbol using an auditory stimulus from the left ear. hear.

An example of the pronunciation learning device according to claim 5 will be described (Configuration 7, Configuration 9). Use the duration of a single note as the target language specific variable. As natural speech, “continuous consonants + vowels” naturally pronounced by native speakers are used. As the adjusted sound, a native speaker intentionally pronounces the consonant part for a long time is used. Note that native speakers should be asked to pronounce as much as possible without changing the utterance conditions other than the duration of the consonant.
The recording medium according to claim 7, wherein the voice data recorded and collected in this manner is regarded as a voice variation in which natural voice data and adjusted voice data are set, and a voice variation number is given, and the recording medium according to the format shown in FIG. Keep a record. The contents of the voice variation are different from each other in a single string to be learned, for example, the voice variation 1 is “1a”, the voice variation 2 is “li”, and the voice variation 2 is “lu”.

The control procedure of the pronunciation learning device during learning is as follows. First, the first voice variation is targeted. The voice data acquisition unit 016 searches and reads the natural voice data 1 from the recording medium using the voice variation number as a key. The voice presentation unit 018 presents the read voice data to the learner. After a few seconds, the voice data acquisition means reads the adjusted voice data 1 and the voice presentation means presents the read voice data to the learner.
Natural speech and adjusted speech may be presented only once, but it is more effective to continue presenting natural speech and adjusted speech alternately. When the learner thinks that the first voice variation has been sufficiently learned, the “next button” on the GUI is pressed to proceed to the learning of the second voice variation.

An example of the pronunciation learning device according to claim 5 will be described (Configuration 4, Configuration 5, Configuration 6, Configuration 11, Configuration 13). Artificial division is used as the target language specific variable. As natural speech, “consonant + vowel” naturally sounded by a native speaker is used. It is assumed that a segmentation pattern is manually input in advance to the natural voice data by an operator.
The adjusted sound data is obtained by filtering natural sound data with a weighting function derived from the segmentation pattern. FIG. 18 is a diagram for explaining a method of generating adjusted speech while changing the artificial division degree (v) by filtering.

FIG. 18A is a natural speech waveform. Since the segmentation pattern is manually input in advance, it is known to which part of the speech waveform the gap section 701 corresponds.
FIG. 18B shows a weighting function with v = 100%. The weighting function at v = 100% is a function in which the inside of the gap section is almost 0 and the outside thereof is about 1. However, if a step function is used, high-frequency noise is generated in the adjusted speech, and thus Gaussian. It is desirable to use a smoothly changing function such as a function. The adjusted speech with v = 100% is obtained by multiplying the natural speech waveform of FIG. 18 (a) and the weighting function of FIG. 18 (d) by values at the same time (t).

FIG. 18C is a weighting function at v = 50%, and FIG. 18D is a weighting function at v = 25%. In general, when the weighting function at v = 100% is W (t), the weighting function at v = V% can be obtained as {1- (1-W (t)) × V / 100}. By multiplying the weighting function thus obtained and the natural speech waveform of FIG. 18A, adjusted speech data corresponding to an arbitrary degree of artificial division can be obtained.
As described above, the speech data can be acquired by inputting the segmentation pattern and the target language specific variable as parameters and performing the filtering process on the natural speech data (Configuration 5).

  FIG. 19 shows a flowchart specifically illustrating a processing procedure for adjusting the artificial division degree, which is a target language specific variable, using the input obtained from the input means (Configuration 11 and Configuration 13). Note that buttons A and B are provided as input devices. First, as an initial setting, the artificial separation (v) is set to 100% 702. Next, 703 presents the adjustment sound of the set separation degree. Next, the input is checked 704. If there is no input, it continues to wait for the input, but when it is detected that the button A is pressed, v is decreased by 5% and an adjustment voice based on the new value of v is presented 705. On the other hand, when it is detected that the button B is pressed, v is increased by 5%, and a new adjusted voice based on v is presented 706. Repeat this many times. Since v cannot go out of the range from 0% to 100%, when v takes a boundary value, the value is maintained as it is. In addition, the adjusted sound with v = 0% is the same sound as the natural sound because the weighting function is a constant 1.

  When the above processing contents are viewed from the learner side, the operation of this apparatus is performed as follows. First, since an adjusted voice with an artificial separation degree of 100% is presented as an initial setting, clear sounds, consonants, and vowels are heard. Thereafter, when the consonant and the vowel are heard separately, pressing the button A causes the artificial separation degree to be reduced by 5%, and the adjustment voice that is more combined and easy to hear is presented. The learner must always try to separate and listen to the consonant and vowel, but the consonant and vowel are combined in spite of the effort, and it is heard as a single sound like a Japanese single sound. There is also. In that case, by pressing the button B, the artificial separation degree is increased by 5%, and the adjustment sound that is more separated and easy to hear is presented. If this process is continued, it will antagonize while moving up and down around a certain constant with artificial separation.

If the above learning is continued every day, according to the proficiency level, the antagonistic artificial separation gradually shifts to the smaller side, and even when the adjusted voice of 0% artificial separation, that is, natural speech is heard, the consonant And vowels are separated and perceived. Then learning is complete.
The seventh embodiment is an example of the configuration 7 by adding a function of storing the data obtained by the voice data obtaining unit in a computer-readable medium.

  The pronunciation learning material according to claim 8, wherein all the voice variations obtained by using the filter described in the seventh embodiment are collected in a single voice file in ascending or random order and stored in a computer-readable medium. This is an example of a manufacturing method. By reproducing this audio file using a standard media player, the same effect as that obtained when practicing with the learning device of Configuration 9 or Configuration 10 can be obtained.

  An example of the pronunciation learning device according to claim 5 will be described (Configuration 9 and Configuration 10). As the target language specific variable, the time length of the background interval between single notes is used. In the English-type segmentation style, the length of the transitional part between consonants and vowels can be changed and pronounced. On the other hand, in the Japanese-type segmentation style, consonants and vowels are integrated to form one single sound, so that it is not possible to pronounce by changing the length of the transitional part between consonants and vowels.

  As natural speech, “consonant + vowel” naturally sounded by a native speaker is used. As the adjusted sound, a native speaker intentionally pronounced with various gap lengths between single notes is used. However, the length of the gap section of the adjusted speech is all longer than the length of the natural speech gap section. The recorded voice data is structured with the format shown in Fig. 20 after adding the voice variation number to the natural voice and the adjusted voice, and adding the length of each gap section to the adjusted voice. And recorded on the recording medium according to claim 7 (Configuration 14). Note that the length of the gap section can be obtained by manually inputting a segmentation pattern for each audio data.

  The pronunciation learning procedure is as follows. In one configuration, audio data is presented in order (configuration 9). First, the voice data acquisition means reads all the adjusted voice data (the adjusted voice data 11, the adjusted voice data 12,...) Included in the first voice variation voice, and sorts them in descending order based on the length of the gap section. . Then, voice data is presented in order from the longer gap section. When all the presentation is completed, the voice data acquisition means reads the natural voice data 1 from the medium and presents it by the voice presentation means. Thereby, it is presented in order from the long gap section to the short gap section.

Further, as another configuration, it is possible to present them in random order (Configuration 10). At this time, a random number is generated within the range that the gap section can take, and the adjusted speech data having the gap section length closest to the random value is extracted. Then, the adjusted sound presenting means presents the adjusted sound data obtained several seconds after the natural sound is presented. Further, it is assumed that the adjusted speech data having a random gap section length is continuously presented until the learner presses the “next button” on the GUI and proceeds to the next set of pronunciation learning. In this way, listening learning while randomly changing the target language specific variable is useful for improving listening ability from the viewpoint of dealing with variations that may occur during conversation.
In Example 8, since the segmentation pattern is manually input, it is more effective when used in combination with the configuration of the pronunciation learning device of claim 1.

  An example of the pronunciation learning device according to claim 5 will be described (Configuration 8 and Configuration 11). As the target language specific variable, “pitch of voiced consonant” is used as an example of a single sound parameter. In the English segmentation style, although the sound is somewhat unnatural, the pitch of the voiced consonant can be changed independently from the pitch of the vowel. On the other hand, in the Japanese segmentation style, the pitch of voiced consonants cannot be changed independently from the pitch of vowels.

  As an input means, a slider bar on the GUI and a voice presentation button are used. When the learner presses the voice presentation button, if the slider bar is at the center position, the voice synthesis is performed at the same frequency as the fundamental frequency of the vowel and the voiced consonant is presented as a natural voice. On the other hand, when the learner presses the voice presentation button, if the slider bar deviates from the center position, the voiced consonant fundamental frequency is changed from the vowel fundamental frequency according to the slider bar position. Combined and presented as adjusted speech. Thereby, the learner can quickly grasp the learning level of the learner by freely adjusting the pitch of the voiced consonant by his / her slider bar input.

  The example of the pronunciation learning apparatus of Claim 5 is shown (Configuration 4, Configuration 5, Configuration 6, Configuration 11, Configuration 13). The configuration other than the target language specific variable is the same as that of the pronunciation learning device shown in the seventh embodiment. As the target language specific variable, a two-dimensional vector having the length of the background interval between single notes as a first component and the artificial division degree as a second component is used.

  In Example 7, the implementation was performed so as to take 21 levels of an artificial division degree of 0%, 5%,..., 100%. However, the 21 levels are divided into 3 groups of 7 levels. In order to obtain a speech waveform that is the basis of one filtering (the weighting filter according to the above-mentioned artificial division), as a pre-processing, a second filtering for changing only the length of the background section between single notes is performed. Do. The specific processing of the second filtering is to apply only the standard speech rate conversion processing to extend only the length of the section while maintaining the pitch, only the speech waveform inside the gap section. . For example, assuming that the length of the gap section of the natural voice data is 100 milliseconds, two intermediate adjusted voice data are generated which are 150 milliseconds and 200 milliseconds.

For each of the above groups, for the first group (0%, 5%,..., 30%), the natural speech itself having a gap section length of 100 milliseconds is converted into the second group (35%, 40%,. 65%) is an intermediate adjustment voice with a gap section length of 150 milliseconds, and the third group (70%, 75%,..., 100%) is an intermediate adjustment voice with a gap section length of 200 milliseconds. Are respectively subjected to the first filtering in the same manner as in the seventh embodiment as an original waveform. (100 milliseconds, 0%), (100 milliseconds, 5%), ..., (150 milliseconds, 35%), (150 milliseconds, 40%), ..., (200 milliseconds, 70% ), (200 milliseconds, 75%),..., (200 milliseconds, 100%), 21 point sequences in the two-dimensional vector space can be obtained. By using these 21 point sequences as target language specific variables of the pronunciation learning device of the seventh embodiment, a pronunciation learning device that is more effective than the seventh embodiment can be provided.
If the local speech speed conversion process shown in the tenth embodiment is used, the adjusted speech can be acquired by filtering even when the single tone duration is used as the target language specific parameter.

An example of the pronunciation learning device according to claim 9 will be described. The voice input means includes a microphone that converts voice into an electrical voice signal, and converts the electrical signal output therefrom into voice data as digital data.
The segmentation means uses a DP (Dynamic Programming) matching technique that is used as a standard when normalizing the time axis in the field of speech recognition. That is, the voice data input by the voice input means is used as an input pattern, the model voice is used as a verification pattern, and the time axis is non-linearly expanded and contracted optimally. Since this sample voice can be manually input in advance with a segmentation pattern, it is possible to respond to the input voice by checking which time of the voice data is input after collation. You can know where the time of the segmentation pattern to be attached is.

  The segmentation pattern feature presenting means is means for feeding back to the learner an element constituting the recognized segmentation pattern or an evaluation value (configuration 17) representing the appropriateness thereof. The element constituting the segmentation pattern is the length of a single sound section, the length of a background section between single sounds, or the segmentation pattern itself (Configuration 15, Configuration 16, Configuration 18). For example, if the duration of a single tone is fed back, it can be fed back to the learner as a character display “the duration of the consonant of the voice you uttered is how many milliseconds”. The learner can train to be able to arbitrarily control the duration of a single sound and the duration of a transitional sound by practicing utterance while watching this display. Being able to control these values arbitrarily means that you have learned the target language segmentation style.

  Further, a function of switching the collation pattern may be added (Configuration 19). For example, it is an opportunity for the learner to proceed with learning while comparing what segmentation pattern is used when capturing the pronunciation as English pronunciation and when capturing the speech as Japanese pronunciation. This is because the segmentation means switches the collation pattern to voice data uttered by a Japanese speaker when performing DP matching, and sets the segmentation pattern of the input voice corresponding to the Japanese segmentation pattern manually input to this voice data. Can be implemented by recognizing.

  Further, the segmentation pattern itself may be fed back (Configuration 18). That is, all the times constituting the segmentation pattern may be fed back without being omitted. Although feedback may be performed in character display, it is more effective to perform some kind of visualization. For example, feedback is given using the same type of stimulus as the timing stimulus described in the first embodiment. In this case, before the learner speaks to the microphone, if the pronunciation learning device according to claim 1 is used to present the example voice and timing stimulus and perform the repeat after-my pronunciation practice, The segmentation pattern of the model voice can be compared with the segmentation pattern of the own voice, which is more effective.

  An example of claim 9 is shown (Configuration 17). The learner keeps practicing pronunciation many times toward the microphone, trying to pronounce with various gap lengths. When sampling of audio data of a certain number or more is finished, the segmentation pattern of each audio data is recognized by the segmentation means, and the variance value of the time length of the gap section is calculated. If this variance value is less than the threshold value, it cannot be determined that the sound can be generated while adding variations to the length of the gap section, so a Boolean value of “false” is used as the evaluation value, for example, a buzzer is sounded. Feedback. On the other hand, if the variance value is equal to or greater than the threshold value, it is determined that the sound can be generated while adding variations to the length of the gap section, and a Boolean value of “true” is generated, for example, a sound different from the buzzer. To give feedback.

On the contrary, when this device is used when an English native speaker learns pronunciation of Japanese, the true / false of the feedback evaluation value may be reversed.
Alternatively, the variance value itself may be displayed as characters on the screen and fed back as another evaluation value.

Examples of pronunciation learning devices configured by combining claim 5 and claim 9 are shown (Configuration 11, Configuration 12, Configuration 13). This is a so-called repeat after-my learning method in which vocal learning and auditory learning are performed simultaneously.
The configuration of the pronunciation learning device as viewed as the pronunciation learning device of claim 9 is as follows. If the duration of the consonant continuous sound, which is a continuous sound, becomes shorter than the threshold, a buzzer is sounded to give feedback to the learner. In the Japanese-type segmentation style, the duration of the continuous sound cannot be increased, so that it can be confirmed whether or not it is based on the English-type segmentation style.

  On the other hand, when the pronunciation learning device is viewed as the pronunciation learning device of claim 5, the basic configuration is exactly the same as the pronunciation learning device shown in the seventh embodiment, and the difference is that the state of pressing the button A or button B is changed. In addition, it is determined based on whether the duration time of the continuous sound is greater than or less than a threshold value.

  When the above processing contents are viewed from the learner side, the operation of this apparatus is performed as follows. First, since an adjusted voice with an artificial separation degree of 100% is presented as an initial setting, clear sounds, consonants, and vowels are heard. Therefore, if you try to keep the duration of the continuous consonant long, and imitate the sound, you can keep it long because it sounds in English segmentation style, resulting in a buzzer sounding. Absent. This becomes an alternative to pressing the button A of the seventh embodiment, and presents an adjusted voice that is more combined and easy to hear, with a 5% reduction in artificial separation. The learner must always try to separate the consonant from the vowel and listen to it, but the consonant and the vowel are combined in spite of the effort, and it is heard as one single sound like a Japanese single sound. Sometimes. In that case, since the duration of the continuous consonant cannot be kept long, instead of pressing the button B of Example 7, the buzzer sounds and the artificial separation degree is increased by 5%. Present adjustment sound that is easy to hear.

  If this process is continued, it will antagonize while moving up and down around a certain constant with artificial separation. By using this pronunciation learning device, it is not necessary for the learner to bother to press the button, so that it can be used more easily than the device of the seventh embodiment.

  The example of the method of producing the medium of Claim 3 is shown. In the method of manually inputting the segmentation pattern described in the eighth embodiment, manual input takes time, and the necessary target language-specific parameters are necessary when the native speaker is provided with a model voice. It is unclear whether or not a wide range of data has been collected evenly, and we have to provide extra data. Thus, an authoring tool that efficiently collects data using the segmentation means that is a component of the pronunciation learning device shown in the eleventh embodiment will be exemplified.

  First, a segmentation pattern is input by manual input when natural speech data is provided to a native speaker. After that, by performing DP matching using this natural voice data as collation data, when voice variations are collected intentionally by generating long gap intervals, segmentation is performed simultaneously with data collection, and a segmentation pattern is obtained. It becomes. Thereby, the labor of the operator's manual input can be greatly reduced, and the labor of the voice provider can be saved by determining when the sufficient voice variation collection is completed.

  In addition, when the audio provider provides adjusted audio, each time a word is spoken, a histogram about the target language specific variable is presented on the screen and feedback is given, so that the native speaker itself can choose which data. It is more effective because it is possible to grasp whether or not is inputted and intentionally try to input data around that.

Explanatory diagram schematically showing the difference in brain segmentation processing style Basic configuration diagram of the first solution Basic configuration diagram of the second solution Explanation of audio adjustment by target language specific variable Illustration of step-by-step audio adjustment by target language specific variable Configuration diagram of the second solving means when acquiring voice data by filtering processing Configuration diagram of the second solution when using input means Basic configuration diagram of the third solution Diagram explaining manual input of segmentation pattern Diagram illustrating timing stimulus Explanatory drawing of processing of timing stimulus presenting means Illustration of segmentation pattern with gap section Explanatory drawing of timing stimulus generated from segmentation pattern with gap interval Explanatory drawing of a configuration that associates multiple segmentation patterns with the same audio data Illustration of timing stimuli generated from multiple segmentation patterns Illustration of timing stimulus consisting of single phonetic symbols The figure which illustrates the data format stored in the recording medium of Claim 7 Explanatory diagram of how to generate adjusted audio The figure which shows the processing procedure which adjusts the artificial division degree The figure which illustrates the data format stored in the recording medium in Example 9

Explanation of symbols

001 Speech waveform 002 Cutout section when perceived in Japanese segmentation style 003 Cutout section when perceived in English type segmentation style (consonant part)
004 Extraction section (vowel part) when perceived in English type segmentation style
005 Voice data 006 Voice data storage means 007 Voice presentation device 008 Voice 009 Segmentation pattern 010 Segmentation pattern storage means 011 Timing stimulus generation means 012 Timing stimulus presentation means 013 Timing stimulus 014 Voice variation 015 Multiple pieces of voice data having different target language specific parameters 016 Speech data acquisition means 017 Target language specific variable 018 Range perceived as the same phone string in the learner's native language 019 Range perceived as the same phone sequence in the target language 020 Natural speech 021 Adjusted speech 022 First adjusted speech 023 First Second adjusted voice 024 Third adjusted voice 025 Natural voice data 026 Natural voice data storage means 027 Voice data acquisition means (filtering)
028 Input means 029 Audio input means 030 Segmentation means 031 Segmentation pattern feature presentation means 101 Blank image 102 Image A
103 Image B
104 Timer reset 105 Blank image presentation 106 T and T0 comparison 107 Image A presentation 108 T and T1 comparison 109 Image B presentation 110 T and T2 comparison 301 Vertical bar moved to the right in synchronization with audio presentation 302 A region that turns red when entering a single tone segment in a segmentation pattern for English 303 A region that turns red when entering a single tone segment in a segmentation pattern for Japanese 401 An image presented during the duration of a single note “l” 402 A single tone 701 Gap interval 702 Artificial separation (v) is set to 100% 703 Adjusted voice is presented (standard voice is presented when v = 0%)
704 Input check 705 Reduce v by 5% (if v = 0%, leave it as is)
706 Increase v by 5% (if v = 100%, leave it as is)

Claims (2)

  1. Voice data storage means for storing voice data;
    Segmentation pattern storage means for storing a segmentation pattern for dividing the speech data into a plurality of sections on a time axis based on a segmentation mode of a first natural language;
    Second segmentation pattern storage means for storing a second segmentation pattern for dividing the speech data into a plurality of sections on the time axis based on a segmentation style of a second natural language different from the first natural language. When,
    The timing of switching the segmentation pattern section stored in the segmentation pattern storage means and the timing of switching the section of the second segmentation pattern stored in the second segmentation pattern storage means are compared with the learner. Timing stimulus generating means for generating a timing stimulus to be performed;
    Voice presentation means for presenting the voice data;
    Timing stimulus presentation means for presenting the timing stimulus in synchronization with voice presentation by the voice presentation means;
    A pronunciation learning device characterized by comprising:
  2. And audio data storage means for storing the voice data,
    And segmentation pattern storage means for storing a segmentation pattern is divided into a plurality of sections of said audio data on the time axis based on the segmentation mode of the first natural language,
    Second segmentation pattern storage means for storing a second segmentation pattern for dividing the speech data into a plurality of sections on the time axis based on a segmentation style of a second natural language different from the first natural language. And
    A computer having
    The timing of switching the segmentation pattern section stored in the segmentation pattern storage means and the timing of switching the section of the second segmentation pattern stored in the second segmentation pattern storage means are compared with the learner. A timing stimulus generating means for generating a timing stimulus,
    Voice presentation means for presenting the voice data; and
    Timing stimulus presentation means for presenting the timing stimulus in synchronization with the voice presentation by the voice presentation means;
    Program to function as.
JP2005110310A 2005-03-09 2005-03-09 Pronunciation learning device and pronunciation learning program Active JP4678672B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2005110310A JP4678672B2 (en) 2005-03-09 2005-03-09 Pronunciation learning device and pronunciation learning program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2005110310A JP4678672B2 (en) 2005-03-09 2005-03-09 Pronunciation learning device and pronunciation learning program

Publications (2)

Publication Number Publication Date
JP2006251744A JP2006251744A (en) 2006-09-21
JP4678672B2 true JP4678672B2 (en) 2011-04-27

Family

ID=37092261

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2005110310A Active JP4678672B2 (en) 2005-03-09 2005-03-09 Pronunciation learning device and pronunciation learning program

Country Status (1)

Country Link
JP (1) JP4678672B2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5605005B2 (en) * 2010-06-16 2014-10-15 住友電気工業株式会社 Silicon carbide semiconductor device manufacturing method and silicon carbide semiconductor device manufacturing apparatus

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10187028A (en) * 1996-12-20 1998-07-14 Matsushita Electric Ind Co Ltd Vocalization training device
JP2003162291A (en) * 2001-11-22 2003-06-06 Ricoh Co Ltd Language learning device
JP2004334164A (en) * 2002-10-24 2004-11-25 Toshimasa Ishihara System for learning pronunciation and identification of english phonemes "l" and "r"
JP2004347786A (en) * 2003-05-21 2004-12-09 Casio Comput Co Ltd Speech display output controller, image display controller, and speech display output control processing program, image display control processing program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10187028A (en) * 1996-12-20 1998-07-14 Matsushita Electric Ind Co Ltd Vocalization training device
JP2003162291A (en) * 2001-11-22 2003-06-06 Ricoh Co Ltd Language learning device
JP2004334164A (en) * 2002-10-24 2004-11-25 Toshimasa Ishihara System for learning pronunciation and identification of english phonemes "l" and "r"
JP2004347786A (en) * 2003-05-21 2004-12-09 Casio Comput Co Ltd Speech display output controller, image display controller, and speech display output control processing program, image display control processing program

Also Published As

Publication number Publication date
JP2006251744A (en) 2006-09-21

Similar Documents

Publication Publication Date Title
Van Bezooijen Characteristics and recognizability of vocal expressions of emotion
Bundgaard-Nielsen et al. Vocabulary size matters: The assimilation of second-language Australian English vowels to first-language Japanese vowel categories
Ives et al. Discrimination of speaker size from syllable phrases
Denes Effect of duration on the perception of voicing
US7286749B2 (en) Moving image playback apparatus, moving image playback method, and computer program thereof with determining of first voice period which represents a human utterance period and second voice period other than the first voice period
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
JP3984207B2 (en) Speech recognition evaluation apparatus, speech recognition evaluation method, and speech recognition evaluation program
McDermott The cocktail party problem
US8140326B2 (en) Systems and methods for reducing speech intelligibility while preserving environmental sounds
Liss et al. Syllabic strength and lexical boundary decisions in the perception of hypokinetic dysarthric speech
US8170878B2 (en) Method and apparatus for automatically converting voice
US6358054B1 (en) Method and apparatus for teaching prosodic features of speech
EP1028410B1 (en) Speech recognition enrolment system
Strange et al. Acoustic and perceptual similarity of North German and American English vowels
Kooijman et al. Electrophysiological evidence for prelinguistic infants' word recognition in continuous speech
Strand et al. Gradient and Visual Speaker Normalization in the Perception of Fricatives.
US6290504B1 (en) Method and apparatus for reporting progress of a subject using audio/visual adaptive training stimulii
Kuhl et al. Infant vocalizations in response to speech: Vocal imitation and developmental change
JP4363590B2 (en) Speech synthesis
Sadakata et al. Enhanced perception of various linguistic features by musicians: a cross-linguistic study
Haggard Encoding and the REA for speech signals
Jovičić Formant feature differences between whispered and voiced sustained vowels
Leinonen et al. Expression of emotional–motivational connotations with a one-word utterance
JP4355772B2 (en) Force conversion device, speech conversion device, speech synthesis device, speech conversion method, speech synthesis method, and program
Murray et al. Applying an analysis of acted vocal emotions to improve the simulation of synthetic speech

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20080218

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20100630

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20100727

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20100913

A02 Decision of refusal

Free format text: JAPANESE INTERMEDIATE CODE: A02

Effective date: 20101022

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20101101

A911 Transfer of reconsideration by examiner before appeal (zenchi)

Free format text: JAPANESE INTERMEDIATE CODE: A911

Effective date: 20101116

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20101217

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20101217

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20110126

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20110126

R150 Certificate of patent or registration of utility model

Ref document number: 4678672

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20140210

Year of fee payment: 3

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250