WO2013172179A1 - Voice-information presentation device and voice-information presentation method - Google Patents

Voice-information presentation device and voice-information presentation method Download PDF

Info

Publication number
WO2013172179A1
WO2013172179A1 PCT/JP2013/062326 JP2013062326W WO2013172179A1 WO 2013172179 A1 WO2013172179 A1 WO 2013172179A1 JP 2013062326 W JP2013062326 W JP 2013062326W WO 2013172179 A1 WO2013172179 A1 WO 2013172179A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
converted
information presentation
generation unit
speech
Prior art date
Application number
PCT/JP2013/062326
Other languages
French (fr)
Japanese (ja)
Inventor
充伸 神沼
健太 南
早苗 平井
Original Assignee
日産自動車株式会社
学校法人同志社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日産自動車株式会社, 学校法人同志社 filed Critical 日産自動車株式会社
Publication of WO2013172179A1 publication Critical patent/WO2013172179A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants

Definitions

  • the present invention relates to a voice information presentation device and a voice information presentation method for presenting voice information that can be understood without meaning too much of the driver's attention even when mounted on a vehicle.
  • Patent Document 1 is disclosed as an example of such a voice guidance system, and this system discloses that voice guidance is performed in accordance with a user's past information provision history and preferences.
  • the present invention has been proposed in view of the above-described circumstances, and provides an audio information presentation device and an audio information presentation method that can present audio information that does not attract too much attention of the driver. With the goal.
  • the voice information presentation device generates a reference voice that represents the language information to be presented as a voice, and converts the generated reference voice to generate a converted voice having a lower clarity than the reference voice. Then, the generated converted voice is output.
  • FIG. 1 is a block diagram showing the configuration of the audio information presentation apparatus according to the first embodiment to which the present invention is applied.
  • FIG. 2 is a flowchart showing a processing procedure of voice information presentation processing by the voice information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 3 is a flowchart showing a processing procedure of pitch processing by the audio information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 4 is a diagram for explaining pitch frequency conversion by the audio information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 5 is a diagram for explaining pitch frequency conversion by the audio information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 1 is a block diagram showing the configuration of the audio information presentation apparatus according to the first embodiment to which the present invention is applied.
  • FIG. 2 is a flowchart showing a processing procedure of voice information presentation processing by the voice information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 3 is a
  • FIG. 6 is a diagram for explaining pitch frequency conversion by the audio information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 7 is a diagram for explaining pitch processing by the audio information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 8 is a diagram for explaining the result of pitch processing by the audio information presentation apparatus according to the first embodiment to which the present invention is applied.
  • FIG. 9 is a diagram for explaining the result of pitch processing by the audio information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 10 is a diagram for explaining the envelope processing by the audio information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 11 is a diagram for explaining the result of the envelope processing by the audio information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 12 is a diagram for explaining amplitude processing by the audio information presentation apparatus according to the first embodiment to which the present invention is applied.
  • FIG. 13 is a diagram for explaining the result of amplitude processing by the audio information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 14 is a diagram for explaining the amplitude processing by the audio information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 15 is a diagram for explaining the result of amplitude processing by the audio information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 16 is a diagram for explaining audio information presentation processing by the audio information presentation apparatus according to the first embodiment to which the present invention is applied.
  • FIG. 17 is a diagram for explaining voice information presentation processing by the voice information presentation apparatus according to the first embodiment to which the present invention is applied.
  • FIG. 18 is a diagram for explaining audio information presentation processing by the audio information presentation apparatus according to the first embodiment to which the present invention is applied.
  • FIG. 19 is a diagram for explaining the result of the voice information presentation processing by the voice information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 20 is a diagram for explaining the effect of the voice information presentation processing by the voice information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 21 is a diagram for explaining the effect of the voice information presentation processing by the voice information presentation device according to the first embodiment to which the present invention is applied.
  • FIG. 22 is a diagram for explaining audio information presentation processing by the audio information presentation apparatus according to the second embodiment to which the present invention is applied.
  • FIG. 1 is a block diagram showing the configuration of the audio information presentation apparatus according to this embodiment.
  • the audio information presentation device 1 includes a reference audio generation unit 2 that generates a reference audio that expresses language information to be presented as audio, and a reference audio by converting the reference audio.
  • the conversion voice generation unit 3 that generates the converted voice with lower clarity than the voice
  • the voice output unit 4 that outputs the reference voice or the converted voice, and the information and the reference voice necessary for performing the voice presentation processing are stored.
  • a storage unit 5 a storage unit 5.
  • the voice information presentation device 1 is mounted on a vehicle, for example, applied to a navigation device or the like, and converts voice guidance provided at the time of route guidance into converted voice and outputs it. .
  • the converted voice can convey the language information that the original voice guidance intends to present, so the driver can easily understand the meaning of the language information.
  • the audio information presentation device 1 executes a specific program using a general-purpose electronic circuit including a microcomputer, a microprocessor, and a CPU, and thereby serves as a reference audio generation unit 2, a converted audio generation unit 3, and an audio output unit 4. Operate. It can be realized as hardware including a dedicated electronic circuit.
  • the reference voice generation unit 2 generates, for example, voice guidance provided at the time of route guidance of the navigation device as reference voice.
  • the reference voice may be generated by being synthesized as necessary, or may be generated by being stored in advance in the storage unit 5 and acquired.
  • the converted voice generation unit 3 generates a converted voice having a lower clarity than the reference voice by executing a conversion process on the reference voice.
  • the intelligibility is one of the scales indicating the quality of an audio signal such as a telephone, and various indexes have been proposed for its evaluation method, for example, the intelligibility index AI and the audio transmission index STI.
  • the conversion processing for reducing the intelligibility by the converted speech generation unit 3 includes pitch processing, envelope processing, and amplitude processing.
  • the pitch process is a process of converting a pitch frequency, which is a frequency related to the vocal cord vibration of the reference voice, to a specific frequency specified in advance or changing the pitch frequency based on a specific function.
  • the envelope process is a process for alleviating a sharp change in amplitude in the spectrum envelope of the reference speech.
  • the amplitude process is a process for modulating the amplitude in the time waveform of the reference sound.
  • the audio output unit 4 includes a DA converter 21 that converts reference audio or converted audio into an analog signal, an amplifier 22 that amplifies the audio converted into the analog signal, and a speaker 23 that outputs the amplified audio.
  • the reference voice or converted voice is output to the outside.
  • the storage unit 5 stores reference voice corresponding to preset voice guidance and the like, and also stores information necessary for performing voice information presentation processing. Also, the converted voice may be stored in advance.
  • the reference sound generation unit 2 generates a reference sound.
  • the reference voice is voice information similar to normal voice guidance output from the navigation device, and expresses language information to be presented as voice.
  • the reference sound generation unit 2 may generate the reference sound by acquiring the reference sound from the storage unit 5, or may generate the reference sound by synthesizing it.
  • step S102 primary design processing is performed.
  • the primary design process is a process that lowers the intelligibility of speech using various sound effectors according to the designer's sense and subjectivity.
  • the converted voice generation unit 3 performs pitch processing on the generated reference voice to convert the pitch frequency in order to reduce the clarity.
  • the specific procedure of the pitch processing can be realized by three processes of separation of pitch and envelope, pitch conversion, and recombination of the converted pitch and envelope.
  • the processing can be performed by using general pitch correction software such as “Auto-Tune”.
  • the converted pitch frequency 33 can be obtained by setting the target 31 of the frequency to be converted and converting the original pitch frequency 32 indicated by “+” to the target 31.
  • a plurality of target frequencies may be set, and the first target 34 and the second target 35 may be set as shown in FIG.
  • the converted pitch frequency 37 is obtained by converting the two targets 34 and 35 to the target having a shorter distance from the original pitch frequency 36.
  • the target may be a specific function such as the sine function 38 shown in FIG. 6, and the pitch frequency may be changed.
  • the pitch processing is performed on the reference voice 45 shown in FIG. 8 to convert the pitch frequency to one frequency
  • the converted voice 44 is output.
  • the pitch frequency is converted to C4 (about 262 Hz), and it can be seen that the converted voice 44 is contracted in the time direction as the frequency becomes higher than the reference voice 45.
  • FIGS. examples of processing using pitch adjustment, spectral gate, equalizer, compressor, etc. are shown in FIGS.
  • the check items in steps S106 to S109 described later may not be satisfied. Therefore, it is necessary to confirm whether or not the converted speech shown in the waveforms of FIGS. 16 and 18 satisfies the check after performing the correction processing in steps S104 and S105 described later.
  • the amplitude may be converted by superimposing the waveform shown in FIG. 14B on the reference voice as shown in FIG. By this convolution operation, a waveform as shown in FIG. 15 is converted.
  • the clarity can be reduced by performing the amplitude processing.
  • the language information intended to be presented with the reference voice is expressed in the same manner even after the amplitude processing, so that the meaning can be easily understood.
  • step S103 it is determined in step S103 whether or not the voice converted by the primary design process has a lower clarity than the reference voice. If the intelligibility has not decreased, the process returns to step S102 to perform the primary design process again. If the intelligibility has decreased, the process proceeds to step S104, and the troublesomeness is reduced in steps S104 and 105. To perform the process. In addition, the intelligibility can be reduced by reducing the annoyance of the voice.
  • step S104 the converted speech generation unit 3 performs an envelope process in order to reduce annoyance.
  • the annoyance is removed from the voice by suppressing the narrow-band peak 51 shown in FIG.
  • a low-pass filter as shown in FIG. 10B may be applied to the spectrum envelope 52 in FIG.
  • the narrow-band peak 51 is suppressed and converted into a gentle spectral envelope 52 as shown in FIG.
  • the narrow band peak is suppressed by performing the envelope processing in this way, it is possible to remove the annoyance from the voice, thereby preventing the driver from getting too much attention.
  • the language information intended to be presented with the reference voice is expressed in the same way even if the envelope processing is performed, the meaning can be easily understood.
  • step S105 the converted speech generation unit 3 performs amplitude processing to remove the annoyance.
  • the amplitude is converted by performing processing for correcting amplitude distortion and amplitude modulation. For example, as shown in FIG. 12, when a reference wave includes a rectangular wave portion as an amplitude distortion, a low-pass filter is applied to convert it into a smooth waveform as shown in FIG. This processing is performed not only on a rectangular wave but also on a non-continuous waveform such as a triangular wave.
  • steep rise and fall may be alleviated.
  • the rising edge shown in A1 is steep and is subject to correction, but the rising edge of A2 and the falling edge of A3 are gentle, so it is not subject to correction.
  • the rise and fall are less than 0.01 seconds, the correction is made.
  • the rise time of the converted voice is relaxed as shown in FIG. 9, and the rise is converted to be 0.01 seconds or longer.
  • steps S104 and S105 it is possible to generate a sound in which the annoyance is removed from the converted sound.
  • FIG. 2 shows the case where both of the processes of steps S104 and 105 are performed, only one of the processes of steps S104 and 105 may be performed. Nevertheless, it is possible to remove the annoyance from the converted speech sufficiently. Then, when the converted voice is generated, the converted voice generation unit 3 checks whether or not troublesomeness has been removed by executing the following processing.
  • step S106 the converted speech generation unit 3 determines whether or not a narrow band peak exists. For example, it is determined whether or not there is a narrow band peak of about 100 to 300 Hz in the high frequency region in the frequency spectrum shown in FIG. 16 which is the converted speech previously created, and a narrow band peak as shown in FIG. If it exists, the converted voice becomes annoying voice. In order to eliminate this, the process returns to step S104 to perform the process for removing the troublesomeness. If it does not exist, the process proceeds to step S107. .
  • step S107 the converted voice generation unit 3 determines whether or not there is an energy peak in the mid-high frequency band above 0.5 to 0.8 kHz, and the peak exists in the mid-high frequency band or higher. If so, the process returns to step S104 to perform the process for removing the troublesomeness. If no peak exists in the high frequency region, the process proceeds to step S108. For example, as shown in FIG. 17C, when there is an energy peak in a region of 6 kHz or higher, the converted sound becomes annoying sound, so the energy is reduced using a low-pass filter. Therefore, the process returns to step S104.
  • step S108 the converted voice generation unit 3 determines whether or not there are steep rising and falling edges in the time waveform. For example, in the time waveform, it is determined whether or not there is a steep rise or fall that is less than 0.01 seconds as shown in A1 of FIG. If there is a steep rise or fall, the converted voice becomes annoying voice, and in order to eliminate it, it is necessary to return to step S104 and perform processing for removing the troublesomeness again. If there is no steep rise or fall, the process proceeds to step S109.
  • step S109 the converted speech generation unit 3 determines whether or not nonlinear distortion exists in the time waveform. For example, in the time waveform, when nonlinear distortion exists in the converted sound as shown in FIG. 18 which is a part of the time waveform of the converted sound previously generated, the process returns to step S104 to remove the troublesomeness. If it does not exist, the process proceeds to step S110. If non-linear distortion exists, the converted voice becomes annoying voice, and it is necessary to perform processing for removing the troublesomeness again to eliminate it.
  • FIG. 19A is a time waveform of a reference sound before conversion
  • FIG. 19B is a frequency spectrum of the reference sound
  • FIG. 19C is a time waveform of converted sound after conversion
  • FIG. 19D is a conversion. It is the frequency spectrum of speech.
  • the time waveform of the converted voice is converted into a waveform that reduces a steep change and does not attract the driver's attention.
  • FIG.19 (d) it turns out that the energy in a high frequency area
  • the voice output unit 4 next outputs the converted voice in step S110, and the voice information presentation processing by the voice information presentation device 1 according to the present embodiment ends.
  • the rate that it took more than 1 second from the lighting of the LED to the response is 16.42% for the notification sound, which is a lower value than the converted voice and the voice guidance exceeding 20%. It can be seen that the driver's attention has not been drawn.
  • the converted voice and the voice guidance are compared, the converted voice is 23.18%, whereas the voice guidance is 25.00%, and the converted voice is lower. That is, the converted voice is provided with symbolic sound functionality such as a notification sound, and is less likely to attract the driver's attention than the voice guidance. Therefore, the probability that the reaction takes 1 second or more is reduced. In other words, the voice converted by this method is difficult to draw attention.
  • the converted voice according to the present invention is a voice that draws less attention than normal voice guidance.
  • the result of the evaluation experiment 2 for the voice information presentation device 1 according to the present embodiment will be described with reference to FIG.
  • the meanings of the normal voice guidance, the converted voice according to the present invention, and the sine sound as the notification sound are explained in advance to the subject.
  • “right caution” and “left caution” are set for the voice guidance
  • the converted voice is a voice guidance with reduced clarity
  • the notification sound is five discrete melody sounds consisting of three to the right, three The continuously changing sound centered at is left.
  • the reaction time was measured when the test subject was driven by the driving simulator and voice guidance, converted voice, and notification sound were presented, and the results are shown in FIG.
  • the average reaction time is the earliest voice guidance, and the reaction is performed in 1.22 seconds after the presentation of the stimulus is started.
  • the response time of the converted voice is about 1.38 seconds, and the reaction is performed with a delay of 0.16 seconds although not as much as the voice guidance.
  • the response time of the notification sound is 1.81 seconds, which causes a delay of 0.59 seconds compared to the voice guidance and 0.43 seconds compared to the converted voice.
  • the converted voice is a symbolic sound, it shows a reaction time comparable to that of voice guidance, suggesting that it is advantageous in terms of information understanding.
  • the converted voice according to the present invention can convey the meaning of the voice information to the same extent as the voice guidance.
  • the reference voice that expresses the language information to be presented as the voice is generated, and the conversion is lower in clarity than the reference voice. Since the voice is generated and output, even if it is provided as the voice guidance of the vehicle, the driver's attention is not drawn too much and the information to be transmitted can be easily understood.
  • the clarity of the voice is reduced by converting the frequency related to the vocal cord vibration of the reference voice into a specific frequency specified in advance. It is possible to generate converted speech without any.
  • the intelligibility is lowered by changing the frequency related to the vocal cord vibration of the reference audio based on a specific function. Information that is to be transmitted can be easily understood while the driver's attention is not drawn too much.
  • the voice information presentation device 1 since the annoyance is removed from the voice by suppressing a sharp change in the amplitude in the spectrum envelope of the reference voice, the driver's attention is not excessively attracted. Converted speech can be generated.
  • the clarity is lowered by modulating the amplitude in the time waveform of the reference speech, so that the amplitude distortion can be eliminated, thereby ensuring the clarity. Can be reduced.
  • the converted audio generation processing by the converted audio generation unit 3 is different from that of the first embodiment.
  • the converted voice generation unit 3 of the present embodiment generates a signal obtained by moving the reference voice in the time direction, and adds this signal to the reference voice to reduce the clarity. Yes.
  • a signal 72 is generated by moving the reference sound in the time direction with respect to the reference sound 71. Then, the converted sound is generated by adding these signals. Thereby, an echo effect is applied to the generated converted speech, and the intelligibility can be reduced.
  • FIG. 22 shows a case in which the reference sound is delayed in the time direction, a signal advanced in the time direction may be generated and added to the reference sound.
  • a signal whose energy is smaller than that of the reference sound or a signal that is slightly distorted than the reference sound may be generated and added to the reference sound.
  • the clarity is lowered by adding a signal obtained by moving the reference audio in the time direction to the reference audio.
  • the clarity can be surely lowered.
  • the reference voice that expresses the language information to be presented as the voice is generated, and the converted voice is lower in clarity than the reference voice. Is generated and output. As a result, even if it is provided as a voice guidance for the vehicle, the driver's attention is not drawn too much and information to be transmitted can be easily understood. Therefore, the audio information presentation device and the audio information presentation method according to one aspect of the present invention can be used industrially.

Abstract

This voice-information presentation device (1) is equipped with: a reference-voice generation unit (2) for generating a reference voice expressing the linguistic information to be presented as a voice; a converted-voice generation unit (3) for generating a converted voice by converting the reference voice and decreasing the clarity thereof in comparison to the reference voice; and a voice output unit (4) for outputting the converted voice.

Description

音声情報提示装置及び音声情報提示方法Audio information presentation apparatus and audio information presentation method
 本発明は、車両に搭載しても運転者の注意を引き過ぎることがなく、意味を理解することも可能な音声情報を提示する音声情報提示装置及び音声情報提示方法に関する。 The present invention relates to a voice information presentation device and a voice information presentation method for presenting voice information that can be understood without meaning too much of the driver's attention even when mounted on a vehicle.
 従来から車両にはカーナビゲーションシステムが搭載され、経路を案内する場合等には音声ガイダンスが提供されていた。このような音声ガイダンスシステムの一例として特許文献1が開示されており、このシステムでは利用者の過去の情報提供履歴や嗜好に合わせて音声ガイダンスを行うことが開示されている。 Conventionally, vehicles have been equipped with a car navigation system, and voice guidance has been provided when guiding routes. Patent Document 1 is disclosed as an example of such a voice guidance system, and this system discloses that voice guidance is performed in accordance with a user's past information provision history and preferences.
特開2008-40373号公報JP 2008-40373 A
 上述したように運転中に安全に情報を提示するための手段として音声ガイダンスを用いることは有効であるが、運転中に音声ガイダンスを提供すると、運転者の注意を引き過ぎる虞がある。 よ う As described above, it is effective to use voice guidance as a means for safely presenting information during driving. However, providing voice guidance during driving may cause excessive attention of the driver.
 そこで、本発明は、上述した実情に鑑みて提案されたものであり、運転者の注意を引き過ぎることがない音声情報を提示することのできる音声情報提示装置及び音声情報提示方法を提供することを目的とする。 Therefore, the present invention has been proposed in view of the above-described circumstances, and provides an audio information presentation device and an audio information presentation method that can present audio information that does not attract too much attention of the driver. With the goal.
 本発明に係る音声情報提示装置は、提示しようとする言語情報を音声として表した基準音声を生成し、この生成した基準音声を変換して基準音声よりも明瞭度を低下させた変換音声を生成し、この生成した変換音声を出力する。 The voice information presentation device according to the present invention generates a reference voice that represents the language information to be presented as a voice, and converts the generated reference voice to generate a converted voice having a lower clarity than the reference voice. Then, the generated converted voice is output.
図1は、本発明を適用した第1実施形態に係る音声情報提示装置の構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of the audio information presentation apparatus according to the first embodiment to which the present invention is applied. 図2は、本発明を適用した第1実施形態に係る音声情報提示装置による音声情報提示処理の処理手順を示すフローチャートである。FIG. 2 is a flowchart showing a processing procedure of voice information presentation processing by the voice information presentation device according to the first embodiment to which the present invention is applied. 図3は、本発明を適用した第1実施形態に係る音声情報提示装置によるピッチ処理の処理手順を示すフローチャートである。FIG. 3 is a flowchart showing a processing procedure of pitch processing by the audio information presentation device according to the first embodiment to which the present invention is applied. 図4は、本発明を適用した第1実施形態に係る音声情報提示装置によるピッチ周波数の変換を説明するための図である。FIG. 4 is a diagram for explaining pitch frequency conversion by the audio information presentation device according to the first embodiment to which the present invention is applied. 図5は、本発明を適用した第1実施形態に係る音声情報提示装置によるピッチ周波数の変換を説明するための図である。FIG. 5 is a diagram for explaining pitch frequency conversion by the audio information presentation device according to the first embodiment to which the present invention is applied. 図6は、本発明を適用した第1実施形態に係る音声情報提示装置によるピッチ周波数の変換を説明するための図である。FIG. 6 is a diagram for explaining pitch frequency conversion by the audio information presentation device according to the first embodiment to which the present invention is applied. 図7は、本発明を適用した第1実施形態に係る音声情報提示装置によるピッチ処理を説明するための図である。FIG. 7 is a diagram for explaining pitch processing by the audio information presentation device according to the first embodiment to which the present invention is applied. 図8は、本発明を適用した第1実施形態に係る音声情報提示装置によるピッチ処理の結果を説明するための図である。FIG. 8 is a diagram for explaining the result of pitch processing by the audio information presentation apparatus according to the first embodiment to which the present invention is applied. 図9は、本発明を適用した第1実施形態に係る音声情報提示装置によるピッチ処理の結果を説明するための図である。FIG. 9 is a diagram for explaining the result of pitch processing by the audio information presentation device according to the first embodiment to which the present invention is applied. 図10は、本発明を適用した第1実施形態に係る音声情報提示装置による包絡処理を説明するための図である。FIG. 10 is a diagram for explaining the envelope processing by the audio information presentation device according to the first embodiment to which the present invention is applied. 図11は、本発明を適用した第1実施形態に係る音声情報提示装置による包絡処理の結果を説明するための図である。FIG. 11 is a diagram for explaining the result of the envelope processing by the audio information presentation device according to the first embodiment to which the present invention is applied. 図12は、本発明を適用した第1実施形態に係る音声情報提示装置による振幅処理を説明するための図である。FIG. 12 is a diagram for explaining amplitude processing by the audio information presentation apparatus according to the first embodiment to which the present invention is applied. 図13は、本発明を適用した第1実施形態に係る音声情報提示装置による振幅処理の結果を説明するための図である。FIG. 13 is a diagram for explaining the result of amplitude processing by the audio information presentation device according to the first embodiment to which the present invention is applied. 図14は、本発明を適用した第1実施形態に係る音声情報提示装置による振幅処理を説明するための図である。FIG. 14 is a diagram for explaining the amplitude processing by the audio information presentation device according to the first embodiment to which the present invention is applied. 図15は、本発明を適用した第1実施形態に係る音声情報提示装置による振幅処理の結果を説明するための図である。FIG. 15 is a diagram for explaining the result of amplitude processing by the audio information presentation device according to the first embodiment to which the present invention is applied. 図16は、本発明を適用した第1実施形態に係る音声情報提示装置による音声情報提示処理を説明するための図である。FIG. 16 is a diagram for explaining audio information presentation processing by the audio information presentation apparatus according to the first embodiment to which the present invention is applied. 図17は、本発明を適用した第1実施形態に係る音声情報提示装置による音声情報提示処理を説明するための図である。FIG. 17 is a diagram for explaining voice information presentation processing by the voice information presentation apparatus according to the first embodiment to which the present invention is applied. 図18は、本発明を適用した第1実施形態に係る音声情報提示装置による音声情報提示処理を説明するための図である。FIG. 18 is a diagram for explaining audio information presentation processing by the audio information presentation apparatus according to the first embodiment to which the present invention is applied. 図19は、本発明を適用した第1実施形態に係る音声情報提示装置による音声情報提示処理の結果を説明するための図である。FIG. 19 is a diagram for explaining the result of the voice information presentation processing by the voice information presentation device according to the first embodiment to which the present invention is applied. 図20は、本発明を適用した第1実施形態に係る音声情報提示装置による音声情報提示処理の効果を説明するための図である。FIG. 20 is a diagram for explaining the effect of the voice information presentation processing by the voice information presentation device according to the first embodiment to which the present invention is applied. 図21は、本発明を適用した第1実施形態に係る音声情報提示装置による音声情報提示処理の効果を説明するための図である。FIG. 21 is a diagram for explaining the effect of the voice information presentation processing by the voice information presentation device according to the first embodiment to which the present invention is applied. 図22は、本発明を適用した第2実施形態に係る音声情報提示装置による音声情報提示処理を説明するための図である。FIG. 22 is a diagram for explaining audio information presentation processing by the audio information presentation apparatus according to the second embodiment to which the present invention is applied.
 以下、本発明を適用した第1及び第2実施形態について図面を参照して説明する。 Hereinafter, first and second embodiments to which the present invention is applied will be described with reference to the drawings.
[第1実施形態]
 [音声情報提示装置の構成]
 図1は本実施形態に係る音声情報提示装置の構成を示すブロック図である。図1に示すように、本実施形態に係る音声情報提示装置1は、提示しようとする言語情報を音声として表した基準音声を生成する基準音声生成部2と、基準音声を変換して基準音声よりも明瞭度を低下させた変換音声を生成する変換音声生成部3と、基準音声または変換音声を出力する音声出力部4と、音声提示処理を行うために必要となる情報や基準音声を記憶する記憶部5とを備えている。
[First Embodiment]
[Configuration of voice information presentation device]
FIG. 1 is a block diagram showing the configuration of the audio information presentation apparatus according to this embodiment. As shown in FIG. 1, the audio information presentation device 1 according to the present embodiment includes a reference audio generation unit 2 that generates a reference audio that expresses language information to be presented as audio, and a reference audio by converting the reference audio. The conversion voice generation unit 3 that generates the converted voice with lower clarity than the voice, the voice output unit 4 that outputs the reference voice or the converted voice, and the information and the reference voice necessary for performing the voice presentation processing are stored. And a storage unit 5.
 ここで、本実施形態に係る音声情報提示装置1は、例えば車両に搭載されてナビゲーション装置等に適用され、経路案内時に提供される音声案内を変換音声に変換して出力するためのものである。このとき変換音声は明瞭度が低下されているので、元の音声案内と比較して運転者の注意を引き過ぎることはない。一方で、単なる報知音と比較すると、変換音声は元の音声案内が提示しようとしている言語情報を伝達できるので、運転者は言語情報の意味を容易に理解することができる。また、音声情報提示装置1は、マイクロコンピュータ、マイクロプロセッサ、CPUを含む汎用の電子回路を用いて特定のプログラムを実行することにより基準音声生成部2、変換音声生成部3、音声出力部4として動作する。尚、専用の電子回路からなるハードウェアとしても実現可能である。 Here, the voice information presentation device 1 according to the present embodiment is mounted on a vehicle, for example, applied to a navigation device or the like, and converts voice guidance provided at the time of route guidance into converted voice and outputs it. . At this time, since the intelligibility of the converted voice is lowered, the driver's attention is not drawn too much compared with the original voice guidance. On the other hand, compared with a mere notification sound, the converted voice can convey the language information that the original voice guidance intends to present, so the driver can easily understand the meaning of the language information. The audio information presentation device 1 executes a specific program using a general-purpose electronic circuit including a microcomputer, a microprocessor, and a CPU, and thereby serves as a reference audio generation unit 2, a converted audio generation unit 3, and an audio output unit 4. Operate. It can be realized as hardware including a dedicated electronic circuit.
 基準音声生成部2は、例えばナビゲーション装置の経路案内時に提供される音声案内を基準音声として生成するものである。基準音声は、必要に応じて合成することによって生成してもよいし、予め記憶部5に記憶させておいて取得することで生成してもよい。 The reference voice generation unit 2 generates, for example, voice guidance provided at the time of route guidance of the navigation device as reference voice. The reference voice may be generated by being synthesized as necessary, or may be generated by being stored in advance in the storage unit 5 and acquired.
 変換音声生成部3は、基準音声に対して変換処理を実行することによって基準音声よりも明瞭度を低下させた変換音声を生成している。ここで、明瞭度とは、電話等の音声信号における品質を示す尺度の1つであり、その評価方法は様々な指標が提案されており、例えば明瞭度指数AIや音声伝達指標STI等がある。変換音声生成部3による明瞭度を低下させるための変換処理としては、ピッチ処理と包絡処理と振幅処理がある。ピッチ処理は、基準音声の声帯振動に関する周波数であるピッチ周波数を予め指定した特定の周波数に変換したり、特定の関数に基づいて変動させたりする処理である。包絡処理は基準音声のスペクトル包絡における振幅の急峻な変化を緩和するための処理である。振幅処理は基準音声の時間波形において振幅を変調するための処理である。 The converted voice generation unit 3 generates a converted voice having a lower clarity than the reference voice by executing a conversion process on the reference voice. Here, the intelligibility is one of the scales indicating the quality of an audio signal such as a telephone, and various indexes have been proposed for its evaluation method, for example, the intelligibility index AI and the audio transmission index STI. . The conversion processing for reducing the intelligibility by the converted speech generation unit 3 includes pitch processing, envelope processing, and amplitude processing. The pitch process is a process of converting a pitch frequency, which is a frequency related to the vocal cord vibration of the reference voice, to a specific frequency specified in advance or changing the pitch frequency based on a specific function. The envelope process is a process for alleviating a sharp change in amplitude in the spectrum envelope of the reference speech. The amplitude process is a process for modulating the amplitude in the time waveform of the reference sound.
 音声出力部4は、基準音声または変換音声をアナログ信号に変換するDA変換器21と、アナログ信号に変換された音声を増幅する増幅器22と、増幅された音声を出力するスピーカー23とから構成され、基準音声または変換音声を外部に出力している。 The audio output unit 4 includes a DA converter 21 that converts reference audio or converted audio into an analog signal, an amplifier 22 that amplifies the audio converted into the analog signal, and a speaker 23 that outputs the amplified audio. The reference voice or converted voice is output to the outside.
 記憶部5は、予め設定された音声案内等に対応した基準音声を記憶しており、その他に音声情報提示処理を行うために必要となる情報を記憶している。また、変換音声についても予め記憶しておいてもよい。 The storage unit 5 stores reference voice corresponding to preset voice guidance and the like, and also stores information necessary for performing voice information presentation processing. Also, the converted voice may be stored in advance.
 [音声情報提示処理の手順]
 次に、本実施形態に係る音声情報提示装置1による音声情報提示処理の手順を図2のフローチャートを参照して説明する。
[Procedure for voice information presentation processing]
Next, the procedure of voice information presentation processing by the voice information presentation apparatus 1 according to the present embodiment will be described with reference to the flowchart of FIG.
 図2に示すように、まずステップS101において、基準音声生成部2が基準音声を生成する。この基準音声は、ナビゲーション装置から出力される通常の音声案内等と同様の音声情報であり、提示しようとする言語情報を音声として表したものである。基準音声生成部2は基準音声を記憶部5から取得してくることによって生成してもよいし、合成することによって生成してもよい。 As shown in FIG. 2, first, in step S101, the reference sound generation unit 2 generates a reference sound. The reference voice is voice information similar to normal voice guidance output from the navigation device, and expresses language information to be presented as voice. The reference sound generation unit 2 may generate the reference sound by acquiring the reference sound from the storage unit 5, or may generate the reference sound by synthesizing it.
 ステップS102では、一次デザイン処理を行う。一次デザイン処理とはデザイナーのセンスや主観で様々なサウンドエフェクターを用いて音声の明瞭度を下げるような処理である。 In step S102, primary design processing is performed. The primary design process is a process that lowers the intelligibility of speech using various sound effectors according to the designer's sense and subjectivity.
 例えば、生成された基準音声に対して変換音声生成部3が、明瞭度を低下させるためにピッチ処理を実行してピッチ周波数を変換する。ピッチ処理の具体的な手順としては、図3に示すようにピッチと包絡の分離と、ピッチ変換と、変換されたピッチと包絡の再合成という3つの処理で実現することができ、このような処理は「Auto-Tune」のような一般的な音程補正ソフトウェアを用いることによって行うことができる。 For example, the converted voice generation unit 3 performs pitch processing on the generated reference voice to convert the pitch frequency in order to reduce the clarity. As shown in FIG. 3, the specific procedure of the pitch processing can be realized by three processes of separation of pitch and envelope, pitch conversion, and recombination of the converted pitch and envelope. The processing can be performed by using general pitch correction software such as “Auto-Tune”.
 このピッチ処理では、図4に示すようにピッチ周波数を予め指定した1つの周波数に変換する処理を行う。図4では、変換する周波数のターゲット31を設定し、+印で示した元のピッチ周波数32をターゲット31へ変換していくことによって変換後のピッチ周波数33を得ることができる。また、ターゲットとなる周波数を複数設定してもよく、図5に示すように第1ターゲット34と第2ターゲット35を設定してもよい。この場合には、2つのターゲット34、35のうち元のピッチ周波数36からの距離が近いほうのターゲットへ変換して変換後のピッチ周波数37を得ることになる。さらに、ターゲットを図6に示すsin関数38のような特定の関数とし、ピッチ周波数を変動させるように変換してもよい。 In this pitch process, as shown in FIG. 4, a process of converting the pitch frequency into one frequency designated in advance is performed. In FIG. 4, the converted pitch frequency 33 can be obtained by setting the target 31 of the frequency to be converted and converting the original pitch frequency 32 indicated by “+” to the target 31. A plurality of target frequencies may be set, and the first target 34 and the second target 35 may be set as shown in FIG. In this case, the converted pitch frequency 37 is obtained by converting the two targets 34 and 35 to the target having a shorter distance from the original pitch frequency 36. Further, the target may be a specific function such as the sine function 38 shown in FIG. 6, and the pitch frequency may be changed.
 ここで、図8を参照してピッチ処理の結果を説明すると、図8に示す基準音声45にピッチ処理を実行してピッチ周波数を1つの周波数に変換すると、変換音声44が出力される。図8ではピッチ周波数をC4(約262Hz)に合わせるように変換したものであり、変換音声44は基準音声45に対して周波数が高くなることにより時間方向に縮んでいることが分かる。 Here, the result of the pitch processing will be described with reference to FIG. 8. When the pitch processing is performed on the reference voice 45 shown in FIG. 8 to convert the pitch frequency to one frequency, the converted voice 44 is output. In FIG. 8, the pitch frequency is converted to C4 (about 262 Hz), and it can be seen that the converted voice 44 is contracted in the time direction as the frequency becomes higher than the reference voice 45.
 このようにピッチ周波数を変換することにより、明瞭度を低下させることができる。 In this way, the clarity can be reduced by converting the pitch frequency.
 また、ピッチ調整、スペクトラルゲート、イコライザー、コンプレッサーなどを用いて処理した例を図16及び図18に示す。この場合、後に示すステップS106~S109のチェック項目を満たさない可能性がある。よって図16及び図18の波形に示す変換音声は、後述するステップS104、S105の補正処理を行った後、チェックを満たしているか否かを確認する必要がある。 Also, examples of processing using pitch adjustment, spectral gate, equalizer, compressor, etc. are shown in FIGS. In this case, the check items in steps S106 to S109 described later may not be satisfied. Therefore, it is necessary to confirm whether or not the converted speech shown in the waveforms of FIGS. 16 and 18 satisfies the check after performing the correction processing in steps S104 and S105 described later.
 また、図14(a)に示すような基準音声に対して、図14(b)に示す波形を畳み込み演算で重畳することによって振幅を変換させてもよい。この畳み込み演算によって図15に示すような波形に変換される。 Further, the amplitude may be converted by superimposing the waveform shown in FIG. 14B on the reference voice as shown in FIG. By this convolution operation, a waveform as shown in FIG. 15 is converted.
 このように振幅処理を行うことにより、明瞭度を低下させることができる。また、振幅処理を行っても基準音声で提示しようとしていた言語情報は同様に表現されるので、その意味は容易に理解可能である。 In this way, the clarity can be reduced by performing the amplitude processing. In addition, the language information intended to be presented with the reference voice is expressed in the same manner even after the amplitude processing, so that the meaning can be easily understood.
 こうしてステップS102において一次デザイン処理が終了すると、ステップS103では、一次デザイン処理によって変換された音声が基準音声と比較して明瞭度が低下しているか否かを判定する。ここで、明瞭度が低下していない場合にはステップS102に戻って再び一次デザイン処理を行い、明瞭度が低下している場合にはステップS104へ移行し、ステップS104、105において煩わしさを低下させるための処理を行う。尚、音声の煩わしさを低下させることにより、明瞭度を低下させることができる。 Thus, when the primary design process is completed in step S102, it is determined in step S103 whether or not the voice converted by the primary design process has a lower clarity than the reference voice. If the intelligibility has not decreased, the process returns to step S102 to perform the primary design process again. If the intelligibility has decreased, the process proceeds to step S104, and the troublesomeness is reduced in steps S104 and 105. To perform the process. In addition, the intelligibility can be reduced by reducing the annoyance of the voice.
 次に、ステップS104において、変換音声生成部3が煩わしさを低下させるために包絡処理を行う。この包絡処理では、図10(a)に示す狭帯域ピーク51を抑制してスペクトル包絡52をなだらかにすることによって音声から煩わしさを取り除く。具体的な処理としては図10(b)に示すような低域通過フィルタを図10(a)のスペクトル包絡52に適用すればよい。その結果、図11に示すように狭帯域ピーク51は抑制されて、なだらかなスペクトル包絡52に変換されている。 Next, in step S104, the converted speech generation unit 3 performs an envelope process in order to reduce annoyance. In this envelope processing, the annoyance is removed from the voice by suppressing the narrow-band peak 51 shown in FIG. As a specific process, a low-pass filter as shown in FIG. 10B may be applied to the spectrum envelope 52 in FIG. As a result, the narrow-band peak 51 is suppressed and converted into a gentle spectral envelope 52 as shown in FIG.
 このように包絡処理を行って狭帯域ピークを抑制すると、音声から煩わしさを取り除くことができ、これによって運転者の注意を引き過ぎることがなくなる。また、包絡処理を行っても基準音声で提示しようとしていた言語情報は同様に表現されるので、その意味は容易に理解可能である。 If the narrow band peak is suppressed by performing the envelope processing in this way, it is possible to remove the annoyance from the voice, thereby preventing the driver from getting too much attention. In addition, since the language information intended to be presented with the reference voice is expressed in the same way even if the envelope processing is performed, the meaning can be easily understood.
 次に、ステップS105では、変換音声生成部3が煩わしさを取り除くために振幅処理を行う。この振幅処理では、振幅歪を補正する処理や振幅変調を行うことによって振幅を変換している。例えば、図12に示すように基準音声に振幅歪として矩形波の部分が存在するような場合には低域通過フィルタを適用することによって、図13に示すような滑らかな波形に変換する。この処理は矩形波だけではなく、三角波などの連続していない波形に対して行われる。 Next, in step S105, the converted speech generation unit 3 performs amplitude processing to remove the annoyance. In this amplitude processing, the amplitude is converted by performing processing for correcting amplitude distortion and amplitude modulation. For example, as shown in FIG. 12, when a reference wave includes a rectangular wave portion as an amplitude distortion, a low-pass filter is applied to convert it into a smooth waveform as shown in FIG. This processing is performed not only on a rectangular wave but also on a non-continuous waveform such as a triangular wave.
 また、音声から煩わしさを取り除くために、急峻な立ち上がり、立下りを緩和してもよい。例えば、図7に示すような時間波形では、A1に示す立ち上がりは急峻なので修正の対象となるが、A2の立ち上がりとA3の立下りはなだらかなので修正の対象とはならない。具体的には、例えば立ち上がり、立下りが0.01秒未満の場合には修正の対象とする。この結果、図7のA1では図9に示すように変換音声の立ち上がりの時間を緩和し、立ち上がりが0.01秒以上となるように変換されている。 Also, in order to remove the annoyance from the voice, steep rise and fall may be alleviated. For example, in the time waveform as shown in FIG. 7, the rising edge shown in A1 is steep and is subject to correction, but the rising edge of A2 and the falling edge of A3 are gentle, so it is not subject to correction. Specifically, for example, when the rise and fall are less than 0.01 seconds, the correction is made. As a result, in A1 in FIG. 7, the rise time of the converted voice is relaxed as shown in FIG. 9, and the rise is converted to be 0.01 seconds or longer.
 また、音声の立ち上がり、立下りを緩和しても基準音声で提示しようとしていた言語情報は同様に表現されるので、その意味は容易に理解可能である。 In addition, even if the rising and falling edges of the voice are eased, the language information intended to be presented in the reference voice is expressed in the same way, so the meaning can be easily understood.
 こうしてステップS104、105の処理を行うことによって変換音声から煩わしさを取り除いた音声を生成することができる。ただし、図2に示すフローチャートではステップS104、105の処理を両方行う場合について示しているが、ステップS104、105の処理のうちのいずれか1つの処理だけを行うようにしてもよい。それでも十分に変換音声から煩わしさを取り除くことは可能である。そして、変換音声が生成されると、変換音声生成部3は以下の処理を実行することによって煩わしさが取り除かれているか否かをチェックする。 Thus, by performing the processing of steps S104 and S105, it is possible to generate a sound in which the annoyance is removed from the converted sound. However, although the flowchart shown in FIG. 2 shows the case where both of the processes of steps S104 and 105 are performed, only one of the processes of steps S104 and 105 may be performed. Nevertheless, it is possible to remove the annoyance from the converted speech sufficiently. Then, when the converted voice is generated, the converted voice generation unit 3 checks whether or not troublesomeness has been removed by executing the following processing.
 まず、ステップS106において、変換音声生成部3は狭帯域ピークが存在するか否かを判定する。例えば、先に作成した変換音声である図16に示す周波数スペクトルにおいて高周波数領域に100~300Hz程度の狭帯域ピークがあるか否かを判定し、図16のBに示すような狭帯域ピークが存在する場合には、変換音声が耳障りな音声になってしまうので、それを解消するためにステップS104へ戻って煩わしさを取り除くための処理を再び行い、存在しない場合にはステップS107へ移行する。 First, in step S106, the converted speech generation unit 3 determines whether or not a narrow band peak exists. For example, it is determined whether or not there is a narrow band peak of about 100 to 300 Hz in the high frequency region in the frequency spectrum shown in FIG. 16 which is the converted speech previously created, and a narrow band peak as shown in FIG. If it exists, the converted voice becomes annoying voice. In order to eliminate this, the process returns to step S104 to perform the process for removing the troublesomeness. If it does not exist, the process proceeds to step S107. .
 次に、ステップS107では、変換音声生成部3が0.5~0.8kHz以上の中高域以上の帯域にエネルギーのピークが存在するか否かを判定し、中高域以上の帯域にピークが存在する場合にはステップS104へ戻って煩わしさを取り除くための処理を再び行い、高周波数領域にピークが存在しない場合にはステップS108へ移行する。例えば、図17のCに示すように6kHz以上の領域にエネルギーのピークが存在している場合には、変換音声が耳障りな音声になってしまうので、低域通過フィルタを用いてエネルギーを低下させるためにステップS104へ戻ることになる。 Next, in step S107, the converted voice generation unit 3 determines whether or not there is an energy peak in the mid-high frequency band above 0.5 to 0.8 kHz, and the peak exists in the mid-high frequency band or higher. If so, the process returns to step S104 to perform the process for removing the troublesomeness. If no peak exists in the high frequency region, the process proceeds to step S108. For example, as shown in FIG. 17C, when there is an energy peak in a region of 6 kHz or higher, the converted sound becomes annoying sound, so the energy is reduced using a low-pass filter. Therefore, the process returns to step S104.
 次に、ステップS108では、変換音声生成部3が時間波形において急峻な立ち上がり、立下りが存在するか否かを判定する。例えば、時間波形において、図7のA1に示すような0.01秒未満となるような急峻な立ち上がりや立下りが存在するか否かを判定する。急峻な立ち上がりや立下りが存在すると変換音声は耳障りな音声になってしまうので、それを解消するためにステップS104へ戻って再び煩わしさを取り除くための処理を行う必要がある。急峻な立ち上がりや立下りが存在しない場合にはステップS109へ移行する。 Next, in step S108, the converted voice generation unit 3 determines whether or not there are steep rising and falling edges in the time waveform. For example, in the time waveform, it is determined whether or not there is a steep rise or fall that is less than 0.01 seconds as shown in A1 of FIG. If there is a steep rise or fall, the converted voice becomes annoying voice, and in order to eliminate it, it is necessary to return to step S104 and perform processing for removing the troublesomeness again. If there is no steep rise or fall, the process proceeds to step S109.
 次に、ステップS109では、変換音声生成部3は時間波形において非線形歪が存在するか否かを判定する。例えば、時間波形において、先に作成した変換音声の時間波形の一部である図18に示すように変換音声に非線形歪が存在する場合にはステップS104へ戻って煩わしさを取り除くための処理を再び行い、存在しない場合にはステップS110へ移行する。非線形歪が存在すると変換音声は耳障りな音声になってしまうので、それを解消するために再び煩わしさを取り除くための処理を行う必要がある。 Next, in step S109, the converted speech generation unit 3 determines whether or not nonlinear distortion exists in the time waveform. For example, in the time waveform, when nonlinear distortion exists in the converted sound as shown in FIG. 18 which is a part of the time waveform of the converted sound previously generated, the process returns to step S104 to remove the troublesomeness. If it does not exist, the process proceeds to step S110. If non-linear distortion exists, the converted voice becomes annoying voice, and it is necessary to perform processing for removing the troublesomeness again to eliminate it.
 こうして煩わしさが取り除かれているか否かのチェックを通過した変換音声について図19を参照して説明する。図19(a)は変換する前の基準音声の時間波形、図19(b)は基準音声の周波数スペクトル、図19(c)は変換後の変換音声の時間波形、図19(d)は変換音声の周波数スペクトルである。 The converted speech that has passed the check whether or not the annoyance has been removed will be described with reference to FIG. 19A is a time waveform of a reference sound before conversion, FIG. 19B is a frequency spectrum of the reference sound, FIG. 19C is a time waveform of converted sound after conversion, and FIG. 19D is a conversion. It is the frequency spectrum of speech.
 図19(c)に示すように変換音声の時間波形では急峻な変化が緩和されて運転者の注意を引かないような波形に変換されている。また、図19(d)に示すように変換音声の周波数スペクトルでは高周波数領域におけるエネルギーが抑制されていることが分かる。したがって、全体としては耳障りな音が低減されていて運転者の注意を引き過ぎない明瞭度の低い音声に変換されている。 As shown in FIG. 19 (c), the time waveform of the converted voice is converted into a waveform that reduces a steep change and does not attract the driver's attention. Moreover, as shown in FIG.19 (d), it turns out that the energy in a high frequency area | region is suppressed in the frequency spectrum of the conversion sound. Therefore, as a whole, the harsh sound is reduced, and the sound is converted into a low-intelligibility voice that does not attract the driver's attention.
 このような変換音声が生成されると、次にステップS110において音声出力部4が変換音声を出力して、本実施形態に係る音声情報提示装置1による音声情報提示処理は終了する。 When such converted voice is generated, the voice output unit 4 next outputs the converted voice in step S110, and the voice information presentation processing by the voice information presentation device 1 according to the present embodiment ends.
 [第1実施形態の効果]
 ここで、図20を参照して本実施形態に係る音声情報提示装置1に対する評価実験1の結果を説明する。評価実験1では、被験者に仮想的な運転を模擬した運転シミュレーターで運転させ、前方視野内の両端に配置されたLEDのいずれかが点灯すると、点灯した方向にあるステアリング上のスイッチを押すように求めている。そして、この実験中に通常の音声ガイダンスと本発明による変換音声と報知音であるサイン音とを妨害音として提示し、LEDが点灯してから1秒以上反応が遅れた被験者の確率を算出し、その結果を図20に示した。
[Effect of the first embodiment]
Here, with reference to FIG. 20, the result of the evaluation experiment 1 with respect to the audio | voice information presentation apparatus 1 which concerns on this embodiment is demonstrated. In the evaluation experiment 1, when the subject is driven by a driving simulator simulating virtual driving and any one of the LEDs arranged at both ends in the front view is lit, the switch on the steering in the lit direction is pushed. Looking for. During this experiment, the normal voice guidance, the converted voice according to the present invention, and the sine sound as the notification sound are presented as interference sounds, and the probability of the subject whose reaction is delayed for more than 1 second after the LED is lit is calculated. The results are shown in FIG.
 図20によれば、LEDが点灯してから反応までに1秒以上かかった率は、報知音で16.42%であり、20%を超えている変換音声と音声ガイダンスと比較して低い値となっており、運転者の注意を引いていないことが分かる。 According to FIG. 20, the rate that it took more than 1 second from the lighting of the LED to the response is 16.42% for the notification sound, which is a lower value than the converted voice and the voice guidance exceeding 20%. It can be seen that the driver's attention has not been drawn.
 一方、変換音声と音声ガイダンスとを比較すると、変換音声が23.18%であるのに対して音声ガイダンスは25.00%となっており、変換音声のほうが低くなっている。すなわち、変換音声のほうが報知音のような記号音的な機能性が付与され、音声ガイダンスよりも運転者の注意を引くことが少ないので、反応に1秒以上かかる確率が減少している。つまり、今回の手法で変換した音声は注意を引きにくい。 On the other hand, when the converted voice and the voice guidance are compared, the converted voice is 23.18%, whereas the voice guidance is 25.00%, and the converted voice is lower. That is, the converted voice is provided with symbolic sound functionality such as a notification sound, and is less likely to attract the driver's attention than the voice guidance. Therefore, the probability that the reaction takes 1 second or more is reduced. In other words, the voice converted by this method is difficult to draw attention.
 したがって、本発明による変換音声は、通常の音声ガイダンスよりも注意を引きすぎない音声であることがこの評価実験1によって分かる。 Therefore, it can be seen from this evaluation experiment 1 that the converted voice according to the present invention is a voice that draws less attention than normal voice guidance.
 次に、図21を参照して本実施形態に係る音声情報提示装置1に対する評価実験2の結果を説明する。評価実験2では、通常の音声ガイダンスと本発明による変換音声と報知音であるサイン音について、予め被験者に意味を説明しておく。例えば、音声ガイダンスは「右注意」と「左注意」を設定し、変換音声は音声ガイダンスの明瞭度を低下させたもの、報知音は3音から成る5つの離散的メロディ音を右、3音を中心とする連続変化音を左とする。そして、被験者に運転シミュレーターで運転させ、音声ガイダンスと変換音声と報知音とを提示した場合の反応時間をそれぞれ測定し、その結果を図21に示した。 Next, the result of the evaluation experiment 2 for the voice information presentation device 1 according to the present embodiment will be described with reference to FIG. In the evaluation experiment 2, the meanings of the normal voice guidance, the converted voice according to the present invention, and the sine sound as the notification sound are explained in advance to the subject. For example, “right caution” and “left caution” are set for the voice guidance, the converted voice is a voice guidance with reduced clarity, and the notification sound is five discrete melody sounds consisting of three to the right, three The continuously changing sound centered at is left. Then, the reaction time was measured when the test subject was driven by the driving simulator and voice guidance, converted voice, and notification sound were presented, and the results are shown in FIG.
 図21に示すように、平均反応時間は音声ガイダンスが最も早く、刺激の呈示が開始されてから1.22秒で反応していることが分かる。また、変換音声の反応時間はおよそ1.38秒であり、音声ガイダンスほどではないものの0.16秒の遅れで反応していることが分かる。一方、報知音の反応時間は1.81秒となっており、音声ガイダンスと比較して0.59秒、変換音声と比較して0.43秒の遅れを生じている。この結果は、報知音が音声ガイダンスや変換音声と比較して呈示された情報の意味を理解するために多くのリソースを使っていることを示唆しており、情報理解の観点から不利な伝達系であることが分かる。 As shown in FIG. 21, it can be seen that the average reaction time is the earliest voice guidance, and the reaction is performed in 1.22 seconds after the presentation of the stimulus is started. In addition, it can be seen that the response time of the converted voice is about 1.38 seconds, and the reaction is performed with a delay of 0.16 seconds although not as much as the voice guidance. On the other hand, the response time of the notification sound is 1.81 seconds, which causes a delay of 0.59 seconds compared to the voice guidance and 0.43 seconds compared to the converted voice. This result suggests that the notification sound uses a lot of resources to understand the meaning of the presented information compared to the voice guidance and converted voice, which is a disadvantageous transmission system from the viewpoint of information understanding. It turns out that it is.
 一方、変換音声は記号的な音であるにもかかわらず、音声ガイダンスに匹敵する反応時間を示していることから情報理解の観点において有利であることが示唆されている。 On the other hand, although the converted voice is a symbolic sound, it shows a reaction time comparable to that of voice guidance, suggesting that it is advantageous in terms of information understanding.
 したがって、本発明による変換音声は、音声ガイダンスと同じ程度に音声情報の意味を伝達できることがこの評価実験2によって分かる。 Therefore, it can be seen from this evaluation experiment 2 that the converted voice according to the present invention can convey the meaning of the voice information to the same extent as the voice guidance.
 上述したことを考慮すると、本実施形態に係る音声情報提示装置1によれば、提示しようとする言語情報を音声として表した基準音声を生成し、この基準音声よりも明瞭度を低下させた変換音声を生成して出力するので、車両の音声ガイダンスとして提供しても運転者の注意を引き過ぎることがなくなるとともに伝達しようとする情報を容易に理解することができる。 In consideration of the above, according to the audio information presentation device 1 according to the present embodiment, the reference voice that expresses the language information to be presented as the voice is generated, and the conversion is lower in clarity than the reference voice. Since the voice is generated and output, even if it is provided as the voice guidance of the vehicle, the driver's attention is not drawn too much and the information to be transmitted can be easily understood.
 また、本実施形態に係る音声情報提示装置1によれば、基準音声の声帯振動に関する周波数を予め指定した特定の周波数に変換することによって明瞭度を低下させるので、運転者の注意を引き過ぎることのない変換音声を生成することができる。 Further, according to the audio information presentation device 1 according to the present embodiment, the clarity of the voice is reduced by converting the frequency related to the vocal cord vibration of the reference voice into a specific frequency specified in advance. It is possible to generate converted speech without any.
 さらに、本実施形態に係る音声情報提示装置1によれば、基準音声の声帯振動に関する周波数を特定の関数に基づいて変動させることによって明瞭度を低下させるので、車両の音声ガイダンスとして提供しても運転者の注意を引き過ぎることがなくなるとともに伝達しようとする情報を容易に理解することができる。 Furthermore, according to the audio information presentation device 1 according to the present embodiment, the intelligibility is lowered by changing the frequency related to the vocal cord vibration of the reference audio based on a specific function. Information that is to be transmitted can be easily understood while the driver's attention is not drawn too much.
 また、本実施形態に係る音声情報提示装置1によれば、基準音声のスペクトル包絡における振幅の急峻な変化を抑制することによって音声から煩わしさを取り除くので、運転者の注意を引き過ぎることのない変換音声を生成することができる。 Further, according to the voice information presentation device 1 according to the present embodiment, since the annoyance is removed from the voice by suppressing a sharp change in the amplitude in the spectrum envelope of the reference voice, the driver's attention is not excessively attracted. Converted speech can be generated.
 さらに、本実施形態に係る音声情報提示装置1によれば、基準音声の時間波形において振幅を変調することによって明瞭度を低下させるので、振幅歪をなくすことができ、これによって明瞭度を確実に低下させることができる。 Furthermore, according to the speech information presentation device 1 according to the present embodiment, the clarity is lowered by modulating the amplitude in the time waveform of the reference speech, so that the amplitude distortion can be eliminated, thereby ensuring the clarity. Can be reduced.
[第2実施形態]
 次に、本発明の第2実施形態に係る音声情報提示装置について説明する。ただし、本実施形態に係る音声情報提示装置の構成は、第1実施形態と同一なので詳細な説明は省略する。
[Second Embodiment]
Next, an audio information presentation apparatus according to the second embodiment of the present invention will be described. However, since the configuration of the audio information presentation device according to the present embodiment is the same as that of the first embodiment, detailed description thereof is omitted.
 本実施形態に係る音声情報提示装置では、変換音声生成部3による変換音声の生成処理が第1実施形態と相違している。本実施形態の変換音声生成部3は、基準音声が生成されると、この基準音声を時間方向に移動させた信号を生成し、この信号を基準音声に加算することによって明瞭度を低下させている。 In the audio information presentation device according to the present embodiment, the converted audio generation processing by the converted audio generation unit 3 is different from that of the first embodiment. When the reference voice is generated, the converted voice generation unit 3 of the present embodiment generates a signal obtained by moving the reference voice in the time direction, and adds this signal to the reference voice to reduce the clarity. Yes.
 例えば、図22に示すように、基準音声71に対して、基準音声を時間方向に移動させた信号72を生成する。そして、これらの信号を加算することによって変換音声を生成する。これにより、生成された変換音声にはエコーの効果がかかり、明瞭度を低下させることができる。尚、図22では基準音声を時間方向に遅らせる場合について示しているが、時間方向に進んだ信号を生成して基準音声に加算してもよい。 For example, as shown in FIG. 22, a signal 72 is generated by moving the reference sound in the time direction with respect to the reference sound 71. Then, the converted sound is generated by adding these signals. Thereby, an echo effect is applied to the generated converted speech, and the intelligibility can be reduced. Although FIG. 22 shows a case in which the reference sound is delayed in the time direction, a signal advanced in the time direction may be generated and added to the reference sound.
 また、エネルギーを基準音声よりも小さくした信号や基準音声よりも若干歪ませた信号を生成して、基準音声に加算するようにしてもよい。 Also, a signal whose energy is smaller than that of the reference sound or a signal that is slightly distorted than the reference sound may be generated and added to the reference sound.
 [第2実施形態の効果]
 以上詳細に説明したように、本実施形態に係る音声情報提示装置によれば、基準音声を時間方向に移動させた信号を基準音声に加算することによって明瞭度を低下させるので、変換音声にエコーの効果を与えることができ、これによって明瞭度を確実に低下させることができる。
[Effects of Second Embodiment]
As described above in detail, according to the audio information presentation device according to the present embodiment, the clarity is lowered by adding a signal obtained by moving the reference audio in the time direction to the reference audio. Thus, the clarity can be surely lowered.
 なお、上述の実施形態は本発明の一例である。このため、本発明は、上述の実施形態に限定されることはなく、この実施形態以外の形態であっても、本発明に係る技術的思想を逸脱しない範囲であれば、設計などに応じて種々の変更が可能であることは勿論である。 The above-described embodiment is an example of the present invention. For this reason, the present invention is not limited to the above-described embodiment, and even if it is a form other than this embodiment, as long as it does not depart from the technical idea of the present invention, it depends on the design and the like. Of course, various modifications are possible.
 本出願は、2012年5月18日に出願された日本国特許願第2012-114280号に基づく優先権を主張しており、この出願の内容が参照により本発明の明細書に組み込まれる。 This application claims priority based on Japanese Patent Application No. 2012-114280 filed on May 18, 2012, the contents of which are incorporated by reference into the specification of the present invention.
 本発明の一態様に係る音声情報提示装置及び音声情報提示方法によれば、提示しようとする言語情報を音声として表した基準音声を生成し、この基準音声よりも明瞭度を低下させた変換音声を生成して出力する。これにより、車両の音声ガイダンスとして提供しても運転者の注意を引き過ぎることがなくなるとともに伝達しようとする情報を容易に理解することができる。したがって、本発明の一態様に係る音声情報提示装置及び音声情報提示方法は、産業上利用可能である。 According to the audio information presentation apparatus and the audio information presentation method according to one aspect of the present invention, the reference voice that expresses the language information to be presented as the voice is generated, and the converted voice is lower in clarity than the reference voice. Is generated and output. As a result, even if it is provided as a voice guidance for the vehicle, the driver's attention is not drawn too much and information to be transmitted can be easily understood. Therefore, the audio information presentation device and the audio information presentation method according to one aspect of the present invention can be used industrially.
 1 音声情報提示装置
 2 基準音声生成部
 3 変換音声生成部
 4 音声出力部
 5 記憶部
DESCRIPTION OF SYMBOLS 1 Audio | voice information presentation apparatus 2 Reference | standard audio | voice production | generation part 3 Conversion audio | voice production | generation part 4 Audio | voice output part 5 Memory | storage part

Claims (8)

  1.  提示しようとする言語情報を音声として表した基準音声を生成する基準音声生成部と、
     前記基準音声を変換して前記基準音声よりも明瞭度を低下させた変換音声を生成する変換音声生成部と、
     前記変換音声を出力する音声出力部と
    を備えたことを特徴とする音声情報提示装置。
    A reference speech generation unit that generates a reference speech that represents the language information to be presented as speech;
    A converted voice generation unit that converts the reference voice to generate a converted voice having a lower clarity than the reference voice;
    An audio information presentation apparatus comprising: an audio output unit that outputs the converted audio.
  2.  前記変換音声生成部は、前記基準音声の声帯振動に関する周波数を予め指定した特定の周波数に変換することによって明瞭度を低下させることを特徴とする請求項1に記載の音声情報提示装置。 The voice information presentation apparatus according to claim 1, wherein the converted voice generation unit reduces the intelligibility by converting a frequency related to vocal cord vibration of the reference voice to a specific frequency specified in advance.
  3.  前記変換音声生成部は、前記基準音声の声帯振動に関する周波数を特定の関数に基づいて変動させることによって明瞭度を低下させることを特徴とする請求項1に記載の音声情報提示装置。 The voice information presentation device according to claim 1, wherein the converted voice generation unit reduces the intelligibility by changing a frequency related to vocal cord vibration of the reference voice based on a specific function.
  4.  前記変換音声生成部は、前記基準音声のスペクトル包絡における振幅の急峻な変化を抑制することによって明瞭度を低下させることを特徴とする請求項1~3のいずれか1項に記載の音声情報提示装置。 The speech information presentation according to any one of claims 1 to 3, wherein the converted speech generation unit reduces the intelligibility by suppressing a sharp change in amplitude in a spectrum envelope of the reference speech. apparatus.
  5.  前記変換音声生成部は、前記基準音声の時間波形において振幅を変調することによって明瞭度を低下させることを特徴とする請求項1~4のいずれか1項に記載の音声情報提示装置。 The speech information presentation device according to any one of claims 1 to 4, wherein the converted speech generation unit reduces intelligibility by modulating an amplitude in a time waveform of the reference speech.
  6.  前記変換音声生成部は、前記基準音声を時間方向に移動させた信号を前記基準音声に加算することによって明瞭度を低下させることを特徴とする請求項1~5のいずれか1項に記載の音声情報提示装置。 6. The conversion voice generation unit according to claim 1, wherein the conversion voice generation unit reduces the intelligibility by adding a signal obtained by moving the reference voice in a time direction to the reference voice. Audio information presentation device.
  7.  音声情報提示装置を用いた音声情報提示方法であって、
     提示しようとする言語情報を音声として表した基準音声を生成し、
     前記基準音声を変換して前記基準音声よりも明瞭度を低下させた変換音声を生成し、
     前記変換音声を出力する
    ことを特徴とする音声情報提示方法。
    A voice information presentation method using a voice information presentation device,
    Generate a reference voice that expresses the language information to be presented as a voice,
    Converting the reference voice to generate a converted voice having a lower clarity than the reference voice;
    A voice information presentation method characterized by outputting the converted voice.
  8.  提示しようとする言語情報を音声として表した基準音声を生成する基準音声生成手段と、
     前記基準音声を変換して前記基準音声よりも明瞭度を低下させた変換音声を生成する変換音声生成手段と、
     前記変換音声を出力する音声出力手段と
    を備えたことを特徴とする音声情報提示装置。
    A reference voice generation means for generating a reference voice representing the language information to be presented as a voice;
    A converted voice generating means for converting the reference voice to generate a converted voice having a lower clarity than the reference voice;
    A voice information presentation apparatus comprising voice output means for outputting the converted voice.
PCT/JP2013/062326 2012-05-18 2013-04-26 Voice-information presentation device and voice-information presentation method WO2013172179A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2012114280 2012-05-18
JP2012-114280 2012-05-18

Publications (1)

Publication Number Publication Date
WO2013172179A1 true WO2013172179A1 (en) 2013-11-21

Family

ID=49583588

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2013/062326 WO2013172179A1 (en) 2012-05-18 2013-04-26 Voice-information presentation device and voice-information presentation method

Country Status (1)

Country Link
WO (1) WO2013172179A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002116045A (en) * 2000-10-11 2002-04-19 Clarion Co Ltd Sound volume controller
WO2006008871A1 (en) * 2004-07-21 2006-01-26 Matsushita Electric Industrial Co., Ltd. Speech synthesizer

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002116045A (en) * 2000-10-11 2002-04-19 Clarion Co Ltd Sound volume controller
WO2006008871A1 (en) * 2004-07-21 2006-01-26 Matsushita Electric Industrial Co., Ltd. Speech synthesizer

Similar Documents

Publication Publication Date Title
US9580010B2 (en) Vehicle approach notification apparatus
US20190389376A1 (en) Apparatus for providing environmental noise compensation for a synthesized vehicle sound
US20130038435A1 (en) Vehicle running warning device
EP1865494B1 (en) Engine sound processing device
JP4173891B2 (en) Sound effect generator for moving objects
CN103253185B (en) Vehicular active sound effect generating apparatus
US9073477B2 (en) Vehicle approach notification device
EP3757986B1 (en) Adaptive noise masking method and system
WO2014112110A1 (en) Speech synthesizer, electronic watermark information detection device, speech synthesis method, electronic watermark information detection method, speech synthesis program, and electronic watermark information detection program
US9050925B2 (en) Vehicle having an electric drive
JP2016134662A (en) Alarm apparatus
JP4983694B2 (en) Audio playback device
JP5454432B2 (en) Vehicle approach notification device
WO2013172179A1 (en) Voice-information presentation device and voice-information presentation method
JP7454119B2 (en) Vehicle sound generator
JP5985306B2 (en) Noise reduction apparatus and noise reduction method
JP5704022B2 (en) Vehicle approach notification device
JP4888163B2 (en) Karaoke equipment
JP5533795B2 (en) Vehicle approach notification device
JP2007256838A (en) Effective sound generation apparatus for vehicle
JP5699920B2 (en) Vehicle acoustic device
KR20210046124A (en) Indoor sound control method and system of vehicle
WO2013172150A1 (en) Voice-information presentation device and voice-information presentation method
JP2008227681A (en) Acoustic characteristic correction system
JP2010156725A (en) Noise canceling method and circuit

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13791579

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13791579

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP