WO2013172179A1

WO2013172179A1 - Voice-information presentation device and voice-information presentation method

Info

Publication number: WO2013172179A1
Application number: PCT/JP2013/062326
Authority: WO
Inventors: 充伸神沼; 健太南; 早苗平井
Original assignee: 日産自動車株式会社; 学校法人同志社
Priority date: 2012-05-18
Filing date: 2013-04-26
Publication date: 2013-11-21

Abstract

This voice-information presentation device (1) is equipped with: a reference-voice generation unit (2) for generating a reference voice expressing the linguistic information to be presented as a voice; a converted-voice generation unit (3) for generating a converted voice by converting the reference voice and decreasing the clarity thereof in comparison to the reference voice; and a voice output unit (4) for outputting the converted voice.

Description

Audio information presentation apparatus and audio information presentation method

The present invention relates to a voice information presentation device and a voice information presentation method for presenting voice information that can be understood without meaning too much of the driver's attention even when mounted on a vehicle.

Conventionally, vehicles have been equipped with a car navigation system, and voice guidance has been provided when guiding routes. Patent Document 1 is disclosed as an example of such a voice guidance system, and this system discloses that voice guidance is performed in accordance with a user's past information provision history and preferences.

JP 2008-40373 A

よう As described above, it is effective to use voice guidance as a means for safely presenting information during driving. However, providing voice guidance during driving may cause excessive attention of the driver.

Therefore, the present invention has been proposed in view of the above-described circumstances, and provides an audio information presentation device and an audio information presentation method that can present audio information that does not attract too much attention of the driver. With the goal.

The voice information presentation device according to the present invention generates a reference voice that represents the language information to be presented as a voice, and converts the generated reference voice to generate a converted voice having a lower clarity than the reference voice. Then, the generated converted voice is output.

FIG. 1 is a block diagram showing the configuration of the audio information presentation apparatus according to the first embodiment to which the present invention is applied. FIG. 2 is a flowchart showing a processing procedure of voice information presentation processing by the voice information presentation device according to the first embodiment to which the present invention is applied. FIG. 3 is a flowchart showing a processing procedure of pitch processing by the audio information presentation device according to the first embodiment to which the present invention is applied. FIG. 4 is a diagram for explaining pitch frequency conversion by the audio information presentation device according to the first embodiment to which the present invention is applied. FIG. 5 is a diagram for explaining pitch frequency conversion by the audio information presentation device according to the first embodiment to which the present invention is applied. FIG. 6 is a diagram for explaining pitch frequency conversion by the audio information presentation device according to the first embodiment to which the present invention is applied. FIG. 7 is a diagram for explaining pitch processing by the audio information presentation device according to the first embodiment to which the present invention is applied. FIG. 8 is a diagram for explaining the result of pitch processing by the audio information presentation apparatus according to the first embodiment to which the present invention is applied. FIG. 9 is a diagram for explaining the result of pitch processing by the audio information presentation device according to the first embodiment to which the present invention is applied. FIG. 10 is a diagram for explaining the envelope processing by the audio information presentation device according to the first embodiment to which the present invention is applied. FIG. 11 is a diagram for explaining the result of the envelope processing by the audio information presentation device according to the first embodiment to which the present invention is applied. FIG. 12 is a diagram for explaining amplitude processing by the audio information presentation apparatus according to the first embodiment to which the present invention is applied. FIG. 13 is a diagram for explaining the result of amplitude processing by the audio information presentation device according to the first embodiment to which the present invention is applied. FIG. 14 is a diagram for explaining the amplitude processing by the audio information presentation device according to the first embodiment to which the present invention is applied. FIG. 15 is a diagram for explaining the result of amplitude processing by the audio information presentation device according to the first embodiment to which the present invention is applied. FIG. 16 is a diagram for explaining audio information presentation processing by the audio information presentation apparatus according to the first embodiment to which the present invention is applied. FIG. 17 is a diagram for explaining voice information presentation processing by the voice information presentation apparatus according to the first embodiment to which the present invention is applied. FIG. 18 is a diagram for explaining audio information presentation processing by the audio information presentation apparatus according to the first embodiment to which the present invention is applied. FIG. 19 is a diagram for explaining the result of the voice information presentation processing by the voice information presentation device according to the first embodiment to which the present invention is applied. FIG. 20 is a diagram for explaining the effect of the voice information presentation processing by the voice information presentation device according to the first embodiment to which the present invention is applied. FIG. 21 is a diagram for explaining the effect of the voice information presentation processing by the voice information presentation device according to the first embodiment to which the present invention is applied. FIG. 22 is a diagram for explaining audio information presentation processing by the audio information presentation apparatus according to the second embodiment to which the present invention is applied.

Hereinafter, first and second embodiments to which the present invention is applied will be described with reference to the drawings.

[First Embodiment]
[Configuration of voice information presentation device]
FIG. 1 is a block diagram showing the configuration of the audio information presentation apparatus according to this embodiment. As shown in FIG. 1, the audio information presentation device 1 according to the present embodiment includes a reference audio generation unit 2 that generates a reference audio that expresses language information to be presented as audio, and a reference audio by converting the reference audio. The conversion voice generation unit 3 that generates the converted voice with lower clarity than the voice, the voice output unit 4 that outputs the reference voice or the converted voice, and the information and the reference voice necessary for performing the voice presentation processing are stored. And a storage unit 5.

Here, the voice information presentation device 1 according to the present embodiment is mounted on a vehicle, for example, applied to a navigation device or the like, and converts voice guidance provided at the time of route guidance into converted voice and outputs it. . At this time, since the intelligibility of the converted voice is lowered, the driver's attention is not drawn too much compared with the original voice guidance. On the other hand, compared with a mere notification sound, the converted voice can convey the language information that the original voice guidance intends to present, so the driver can easily understand the meaning of the language information. The audio information presentation device 1 executes a specific program using a general-purpose electronic circuit including a microcomputer, a microprocessor, and a CPU, and thereby serves as a reference audio generation unit 2, a converted audio generation unit 3, and an audio output unit 4. Operate. It can be realized as hardware including a dedicated electronic circuit.

The reference voice generation unit 2 generates, for example, voice guidance provided at the time of route guidance of the navigation device as reference voice. The reference voice may be generated by being synthesized as necessary, or may be generated by being stored in advance in the storage unit 5 and acquired.

The converted voice generation unit 3 generates a converted voice having a lower clarity than the reference voice by executing a conversion process on the reference voice. Here, the intelligibility is one of the scales indicating the quality of an audio signal such as a telephone, and various indexes have been proposed for its evaluation method, for example, the intelligibility index AI and the audio transmission index STI. . The conversion processing for reducing the intelligibility by the converted speech generation unit 3 includes pitch processing, envelope processing, and amplitude processing. The pitch process is a process of converting a pitch frequency, which is a frequency related to the vocal cord vibration of the reference voice, to a specific frequency specified in advance or changing the pitch frequency based on a specific function. The envelope process is a process for alleviating a sharp change in amplitude in the spectrum envelope of the reference speech. The amplitude process is a process for modulating the amplitude in the time waveform of the reference sound.

The audio output unit 4 includes a DA converter 21 that converts reference audio or converted audio into an analog signal, an amplifier 22 that amplifies the audio converted into the analog signal, and a speaker 23 that outputs the amplified audio. The reference voice or converted voice is output to the outside.

The storage unit 5 stores reference voice corresponding to preset voice guidance and the like, and also stores information necessary for performing voice information presentation processing. Also, the converted voice may be stored in advance.

[Procedure for voice information presentation processing]
Next, the procedure of voice information presentation processing by the voice information presentation apparatus 1 according to the present embodiment will be described with reference to the flowchart of FIG.

As shown in FIG. 2, first, in step S101, the reference sound generation unit 2 generates a reference sound. The reference voice is voice information similar to normal voice guidance output from the navigation device, and expresses language information to be presented as voice. The reference sound generation unit 2 may generate the reference sound by acquiring the reference sound from the storage unit 5, or may generate the reference sound by synthesizing it.

In step S102, primary design processing is performed. The primary design process is a process that lowers the intelligibility of speech using various sound effectors according to the designer's sense and subjectivity.

For example, the converted voice generation unit 3 performs pitch processing on the generated reference voice to convert the pitch frequency in order to reduce the clarity. As shown in FIG. 3, the specific procedure of the pitch processing can be realized by three processes of separation of pitch and envelope, pitch conversion, and recombination of the converted pitch and envelope. The processing can be performed by using general pitch correction software such as “Auto-Tune”.

In this pitch process, as shown in FIG. 4, a process of converting the pitch frequency into one frequency designated in advance is performed. In FIG. 4, the converted pitch frequency 33 can be obtained by setting the target 31 of the frequency to be converted and converting the original pitch frequency 32 indicated by “+” to the target 31. A plurality of target frequencies may be set, and the first target 34 and the second target 35 may be set as shown in FIG. In this case, the converted pitch frequency 37 is obtained by converting the two

targets

34 and 35 to the target having a shorter distance from the original pitch frequency 36. Further, the target may be a specific function such as the sine function 38 shown in FIG. 6, and the pitch frequency may be changed.

Here, the result of the pitch processing will be described with reference to FIG. 8. When the pitch processing is performed on the reference voice 45 shown in FIG. 8 to convert the pitch frequency to one frequency, the converted voice 44 is output. In FIG. 8, the pitch frequency is converted to C4 (about 262 Hz), and it can be seen that the converted voice 44 is contracted in the time direction as the frequency becomes higher than the reference voice 45.

In this way, the clarity can be reduced by converting the pitch frequency.

Also, examples of processing using pitch adjustment, spectral gate, equalizer, compressor, etc. are shown in FIGS. In this case, the check items in steps S106 to S109 described later may not be satisfied. Therefore, it is necessary to confirm whether or not the converted speech shown in the waveforms of FIGS. 16 and 18 satisfies the check after performing the correction processing in steps S104 and S105 described later.

Further, the amplitude may be converted by superimposing the waveform shown in FIG. 14B on the reference voice as shown in FIG. By this convolution operation, a waveform as shown in FIG. 15 is converted.

In this way, the clarity can be reduced by performing the amplitude processing. In addition, the language information intended to be presented with the reference voice is expressed in the same manner even after the amplitude processing, so that the meaning can be easily understood.

Thus, when the primary design process is completed in step S102, it is determined in step S103 whether or not the voice converted by the primary design process has a lower clarity than the reference voice. If the intelligibility has not decreased, the process returns to step S102 to perform the primary design process again. If the intelligibility has decreased, the process proceeds to step S104, and the troublesomeness is reduced in steps S104 and 105. To perform the process. In addition, the intelligibility can be reduced by reducing the annoyance of the voice.

Next, in step S104, the converted speech generation unit 3 performs an envelope process in order to reduce annoyance. In this envelope processing, the annoyance is removed from the voice by suppressing the narrow-band peak 51 shown in FIG. As a specific process, a low-pass filter as shown in FIG. 10B may be applied to the spectrum envelope 52 in FIG. As a result, the narrow-band peak 51 is suppressed and converted into a gentle spectral envelope 52 as shown in FIG.

If the narrow band peak is suppressed by performing the envelope processing in this way, it is possible to remove the annoyance from the voice, thereby preventing the driver from getting too much attention. In addition, since the language information intended to be presented with the reference voice is expressed in the same way even if the envelope processing is performed, the meaning can be easily understood.

Next, in step S105, the converted speech generation unit 3 performs amplitude processing to remove the annoyance. In this amplitude processing, the amplitude is converted by performing processing for correcting amplitude distortion and amplitude modulation. For example, as shown in FIG. 12, when a reference wave includes a rectangular wave portion as an amplitude distortion, a low-pass filter is applied to convert it into a smooth waveform as shown in FIG. This processing is performed not only on a rectangular wave but also on a non-continuous waveform such as a triangular wave.

Also, in order to remove the annoyance from the voice, steep rise and fall may be alleviated. For example, in the time waveform as shown in FIG. 7, the rising edge shown in A1 is steep and is subject to correction, but the rising edge of A2 and the falling edge of A3 are gentle, so it is not subject to correction. Specifically, for example, when the rise and fall are less than 0.01 seconds, the correction is made. As a result, in A1 in FIG. 7, the rise time of the converted voice is relaxed as shown in FIG. 9, and the rise is converted to be 0.01 seconds or longer.

In addition, even if the rising and falling edges of the voice are eased, the language information intended to be presented in the reference voice is expressed in the same way, so the meaning can be easily understood.

Thus, by performing the processing of steps S104 and S105, it is possible to generate a sound in which the annoyance is removed from the converted sound. However, although the flowchart shown in FIG. 2 shows the case where both of the processes of steps S104 and 105 are performed, only one of the processes of steps S104 and 105 may be performed. Nevertheless, it is possible to remove the annoyance from the converted speech sufficiently. Then, when the converted voice is generated, the converted voice generation unit 3 checks whether or not troublesomeness has been removed by executing the following processing.

First, in step S106, the converted speech generation unit 3 determines whether or not a narrow band peak exists. For example, it is determined whether or not there is a narrow band peak of about 100 to 300 Hz in the high frequency region in the frequency spectrum shown in FIG. 16 which is the converted speech previously created, and a narrow band peak as shown in FIG. If it exists, the converted voice becomes annoying voice. In order to eliminate this, the process returns to step S104 to perform the process for removing the troublesomeness. If it does not exist, the process proceeds to step S107. .

Next, in step S107, the converted voice generation unit 3 determines whether or not there is an energy peak in the mid-high frequency band above 0.5 to 0.8 kHz, and the peak exists in the mid-high frequency band or higher. If so, the process returns to step S104 to perform the process for removing the troublesomeness. If no peak exists in the high frequency region, the process proceeds to step S108. For example, as shown in FIG. 17C, when there is an energy peak in a region of 6 kHz or higher, the converted sound becomes annoying sound, so the energy is reduced using a low-pass filter. Therefore, the process returns to step S104.

Next, in step S108, the converted voice generation unit 3 determines whether or not there are steep rising and falling edges in the time waveform. For example, in the time waveform, it is determined whether or not there is a steep rise or fall that is less than 0.01 seconds as shown in A1 of FIG. If there is a steep rise or fall, the converted voice becomes annoying voice, and in order to eliminate it, it is necessary to return to step S104 and perform processing for removing the troublesomeness again. If there is no steep rise or fall, the process proceeds to step S109.

Next, in step S109, the converted speech generation unit 3 determines whether or not nonlinear distortion exists in the time waveform. For example, in the time waveform, when nonlinear distortion exists in the converted sound as shown in FIG. 18 which is a part of the time waveform of the converted sound previously generated, the process returns to step S104 to remove the troublesomeness. If it does not exist, the process proceeds to step S110. If non-linear distortion exists, the converted voice becomes annoying voice, and it is necessary to perform processing for removing the troublesomeness again to eliminate it.

The converted speech that has passed the check whether or not the annoyance has been removed will be described with reference to FIG. 19A is a time waveform of a reference sound before conversion, FIG. 19B is a frequency spectrum of the reference sound, FIG. 19C is a time waveform of converted sound after conversion, and FIG. 19D is a conversion. It is the frequency spectrum of speech.

As shown in FIG. 19 (c), the time waveform of the converted voice is converted into a waveform that reduces a steep change and does not attract the driver's attention. Moreover, as shown in FIG.19 (d), it turns out that the energy in a high frequency area | region is suppressed in the frequency spectrum of the conversion sound. Therefore, as a whole, the harsh sound is reduced, and the sound is converted into a low-intelligibility voice that does not attract the driver's attention.

When such converted voice is generated, the voice output unit 4 next outputs the converted voice in step S110, and the voice information presentation processing by the voice information presentation device 1 according to the present embodiment ends.

[Effect of the first embodiment]
Here, with reference to FIG. 20, the result of the evaluation experiment 1 with respect to the audio | voice information presentation apparatus 1 which concerns on this embodiment is demonstrated. In the evaluation experiment 1, when the subject is driven by a driving simulator simulating virtual driving and any one of the LEDs arranged at both ends in the front view is lit, the switch on the steering in the lit direction is pushed. Looking for. During this experiment, the normal voice guidance, the converted voice according to the present invention, and the sine sound as the notification sound are presented as interference sounds, and the probability of the subject whose reaction is delayed for more than 1 second after the LED is lit is calculated. The results are shown in FIG.

According to FIG. 20, the rate that it took more than 1 second from the lighting of the LED to the response is 16.42% for the notification sound, which is a lower value than the converted voice and the voice guidance exceeding 20%. It can be seen that the driver's attention has not been drawn.

On the other hand, when the converted voice and the voice guidance are compared, the converted voice is 23.18%, whereas the voice guidance is 25.00%, and the converted voice is lower. That is, the converted voice is provided with symbolic sound functionality such as a notification sound, and is less likely to attract the driver's attention than the voice guidance. Therefore, the probability that the reaction takes 1 second or more is reduced. In other words, the voice converted by this method is difficult to draw attention.

Therefore, it can be seen from this evaluation experiment 1 that the converted voice according to the present invention is a voice that draws less attention than normal voice guidance.

Next, the result of the evaluation experiment 2 for the voice information presentation device 1 according to the present embodiment will be described with reference to FIG. In the evaluation experiment 2, the meanings of the normal voice guidance, the converted voice according to the present invention, and the sine sound as the notification sound are explained in advance to the subject. For example, “right caution” and “left caution” are set for the voice guidance, the converted voice is a voice guidance with reduced clarity, and the notification sound is five discrete melody sounds consisting of three to the right, three The continuously changing sound centered at is left. Then, the reaction time was measured when the test subject was driven by the driving simulator and voice guidance, converted voice, and notification sound were presented, and the results are shown in FIG.

As shown in FIG. 21, it can be seen that the average reaction time is the earliest voice guidance, and the reaction is performed in 1.22 seconds after the presentation of the stimulus is started. In addition, it can be seen that the response time of the converted voice is about 1.38 seconds, and the reaction is performed with a delay of 0.16 seconds although not as much as the voice guidance. On the other hand, the response time of the notification sound is 1.81 seconds, which causes a delay of 0.59 seconds compared to the voice guidance and 0.43 seconds compared to the converted voice. This result suggests that the notification sound uses a lot of resources to understand the meaning of the presented information compared to the voice guidance and converted voice, which is a disadvantageous transmission system from the viewpoint of information understanding. It turns out that it is.

On the other hand, although the converted voice is a symbolic sound, it shows a reaction time comparable to that of voice guidance, suggesting that it is advantageous in terms of information understanding.

Therefore, it can be seen from this evaluation experiment 2 that the converted voice according to the present invention can convey the meaning of the voice information to the same extent as the voice guidance.

In consideration of the above, according to the audio information presentation device 1 according to the present embodiment, the reference voice that expresses the language information to be presented as the voice is generated, and the conversion is lower in clarity than the reference voice. Since the voice is generated and output, even if it is provided as the voice guidance of the vehicle, the driver's attention is not drawn too much and the information to be transmitted can be easily understood.

Further, according to the audio information presentation device 1 according to the present embodiment, the clarity of the voice is reduced by converting the frequency related to the vocal cord vibration of the reference voice into a specific frequency specified in advance. It is possible to generate converted speech without any.

Furthermore, according to the audio information presentation device 1 according to the present embodiment, the intelligibility is lowered by changing the frequency related to the vocal cord vibration of the reference audio based on a specific function. Information that is to be transmitted can be easily understood while the driver's attention is not drawn too much.

Further, according to the voice information presentation device 1 according to the present embodiment, since the annoyance is removed from the voice by suppressing a sharp change in the amplitude in the spectrum envelope of the reference voice, the driver's attention is not excessively attracted. Converted speech can be generated.

Furthermore, according to the speech information presentation device 1 according to the present embodiment, the clarity is lowered by modulating the amplitude in the time waveform of the reference speech, so that the amplitude distortion can be eliminated, thereby ensuring the clarity. Can be reduced.

[Second Embodiment]
Next, an audio information presentation apparatus according to the second embodiment of the present invention will be described. However, since the configuration of the audio information presentation device according to the present embodiment is the same as that of the first embodiment, detailed description thereof is omitted.

In the audio information presentation device according to the present embodiment, the converted audio generation processing by the converted audio generation unit 3 is different from that of the first embodiment. When the reference voice is generated, the converted voice generation unit 3 of the present embodiment generates a signal obtained by moving the reference voice in the time direction, and adds this signal to the reference voice to reduce the clarity. Yes.

For example, as shown in FIG. 22, a signal 72 is generated by moving the reference sound in the time direction with respect to the reference sound 71. Then, the converted sound is generated by adding these signals. Thereby, an echo effect is applied to the generated converted speech, and the intelligibility can be reduced. Although FIG. 22 shows a case in which the reference sound is delayed in the time direction, a signal advanced in the time direction may be generated and added to the reference sound.

Also, a signal whose energy is smaller than that of the reference sound or a signal that is slightly distorted than the reference sound may be generated and added to the reference sound.

[Effects of Second Embodiment]
As described above in detail, according to the audio information presentation device according to the present embodiment, the clarity is lowered by adding a signal obtained by moving the reference audio in the time direction to the reference audio. Thus, the clarity can be surely lowered.

The above-described embodiment is an example of the present invention. For this reason, the present invention is not limited to the above-described embodiment, and even if it is a form other than this embodiment, as long as it does not depart from the technical idea of the present invention, it depends on the design and the like. Of course, various modifications are possible.

This application claims priority based on Japanese Patent Application No. 2012-114280 filed on May 18, 2012, the contents of which are incorporated by reference into the specification of the present invention.

According to the audio information presentation apparatus and the audio information presentation method according to one aspect of the present invention, the reference voice that expresses the language information to be presented as the voice is generated, and the converted voice is lower in clarity than the reference voice. Is generated and output. As a result, even if it is provided as a voice guidance for the vehicle, the driver's attention is not drawn too much and information to be transmitted can be easily understood. Therefore, the audio information presentation device and the audio information presentation method according to one aspect of the present invention can be used industrially.

Claims

A reference speech generation unit that generates a reference speech that represents the language information to be presented as speech;
A converted voice generation unit that converts the reference voice to generate a converted voice having a lower clarity than the reference voice;
An audio information presentation apparatus comprising: an audio output unit that outputs the converted audio.
The voice information presentation apparatus according to claim 1, wherein the converted voice generation unit reduces the intelligibility by converting a frequency related to vocal cord vibration of the reference voice to a specific frequency specified in advance.
The voice information presentation device according to claim 1, wherein the converted voice generation unit reduces the intelligibility by changing a frequency related to vocal cord vibration of the reference voice based on a specific function.
The speech information presentation according to any one of claims 1 to 3, wherein the converted speech generation unit reduces the intelligibility by suppressing a sharp change in amplitude in a spectrum envelope of the reference speech. apparatus.
The speech information presentation device according to any one of claims 1 to 4, wherein the converted speech generation unit reduces intelligibility by modulating an amplitude in a time waveform of the reference speech.
6. The conversion voice generation unit according to claim 1, wherein the conversion voice generation unit reduces the intelligibility by adding a signal obtained by moving the reference voice in a time direction to the reference voice. Audio information presentation device.
A voice information presentation method using a voice information presentation device,
Generate a reference voice that expresses the language information to be presented as a voice,
Converting the reference voice to generate a converted voice having a lower clarity than the reference voice;
A voice information presentation method characterized by outputting the converted voice.
A reference voice generation means for generating a reference voice representing the language information to be presented as a voice;
A converted voice generating means for converting the reference voice to generate a converted voice having a lower clarity than the reference voice;
A voice information presentation apparatus comprising voice output means for outputting the converted voice.