WO2023073949A1

WO2023073949A1 - Voice output device, server device, voice output method, control method, program, and storage medium

Info

Publication number: WO2023073949A1
Application number: PCT/JP2021/040103
Authority: WO
Inventors: 匡弘岩田
Original assignee: パイオニア株式会社
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2023-05-04
Also published as: JPWO2023073949A1

Abstract

A voice output device comprising a character string generation unit, a communication unit, and a control unit. The character string generation unit generates a character string for performing voice guidance. When the character string is a first character string including a proper noun, the communication unit transmits the first character string to an external device and receives first voice data corresponding to the first character string from the external device. When the character string is the first character string, the control unit performs control for causing a voice corresponding to the first voice data to be output. Meanwhile, when the character string is a second character string that does not include a proper noun, the control unit performs control for causing a voice corresponding to second voice data to be output, the second voice data being stored in a storage unit as voice data corresponding to the second character string.

Description

Audio output device, server device, audio output method, control method, program and storage medium

The present invention relates to technology that can be used in voice guidance that accompanies communication.

For example, a configuration disclosed in Patent Literature 1 is conventionally known as a configuration for guiding a route to a destination of a vehicle by voice.

Specifically, Patent Document 1 discloses an in-vehicle device that is mounted in a vehicle and has a voice guidance function, a server device that can communicate with the in-vehicle device via a communication network, A voice guidance system is disclosed.

JP 2012-173702 A

Here, in voice guidance accompanied by communication, as disclosed in Patent Document 1, the amount of communication between the in-vehicle device and the server device increases according to the frequency of speech from the in-vehicle device. There is a problem.

However, Patent Document 1 does not particularly disclose the above-mentioned problems. Therefore, according to the configuration disclosed in Patent Document 1, there still exists a problem corresponding to the above-described problem.

The present invention has been made to solve the above problems, and the main object of the present invention is to provide a voice output device capable of suppressing an increase in the amount of communication according to the frequency of utterances in voice guidance involving communication. purpose.

The claimed invention is a voice output device, comprising: a character string generation unit for generating a character string for performing voice guidance; a communication unit that transmits the first character string to an external device and receives first voice data corresponding to the first character string from the external device; and the character string is the first character string. when the character string is a second character string that does not contain a proper noun while performing control for outputting a sound corresponding to the first sound data, the second character string a control unit that performs control for outputting a sound corresponding to the second sound data stored in the storage unit as the corresponding sound data.

Further, the invention described in the claims is a server device, comprising a receiving unit for receiving a character string generated for performing voice guidance from an external device, and a first character string in which the character string includes a proper noun , the character string is a second character string that does not contain a proper noun while performing control for transmitting first voice data corresponding to the first character string to the external device. a control unit that controls the external device to output a voice corresponding to the second voice data stored in the external device as the voice data corresponding to the second character string in the case of .

Further, the claimed invention is a voice output method, wherein a character string for performing voice guidance is generated, and when the character string is a first character string including a proper noun, the first to an external device, receive first voice data corresponding to the first character string from the external device,
When the character string is the first character string, control is performed to output a voice corresponding to the first voice data, while the second character string does not include a proper noun. , control is performed to output the voice corresponding to the second voice data stored in the storage unit as the voice data corresponding to the second character string.

Further, the claimed invention is a control method in which a character string generated for performing voice guidance is received from an external device, and the character string is a first character string including a proper noun. and performing control for transmitting first voice data corresponding to the first character string to the external device, while the character string is a second character string that does not include a proper noun, Control is performed to cause the external device to output a voice corresponding to second voice data stored in the external device as voice data corresponding to the second character string.

Further, the claimed invention is a program executed by a voice output device having a computer, comprising: a character string generation unit for generating a character string for performing voice guidance; a communication unit configured to transmit the first character string to an external device and receive first voice data corresponding to the first character string from the external device when the character string is 1; When the string is the first character string, control is performed to output a voice corresponding to the first audio data, while the character string is a second character string that does not contain a proper noun. In this case, the computer functions as a control unit that performs control for outputting the sound corresponding to the second sound data stored in the storage unit as the sound data corresponding to the second character string.

Further, the claimed invention is a program executed by a server device having a computer, comprising: a receiving unit for receiving a character string generated for performing voice guidance from an external device; If the first character string includes a proper noun, control is performed to transmit first audio data corresponding to the first character string to the external device, while the character string includes a proper noun. for causing the external device to output voice corresponding to second voice data stored in the external device as voice data corresponding to the second character string when the second character string does not include The computer is made to function as a control unit that performs control.

1 is a diagram showing a configuration example of an audio output system according to an embodiment; FIG. 1 is a block diagram showing a schematic configuration of an audio output device; FIG. The figure which shows an example of schematic structure of a server apparatus. 4 is a flowchart for explaining processing performed in the audio output system according to the first embodiment; 9 is a flowchart for explaining processing performed in the audio output system according to the second embodiment;

In one preferred embodiment of the present invention, the voice output device includes a character string generator for generating a character string for providing voice guidance, and a communication unit for transmitting said first character string to an external device and receiving first voice data corresponding to said first character string from said external device; and said character string being said first character string. In some cases, when the character string is a second character string that does not contain a proper noun while performing control for outputting a voice corresponding to the first voice data, the second character string and a control unit that performs control for outputting a sound corresponding to the second sound data stored in the storage unit as the sound data corresponding to the second sound data.

The above voice output device has a character string generation unit, a communication unit, and a control unit. The character string generator generates a character string for voice guidance. The communication unit, when the character string is a first character string including a proper noun, transmits the first character string to an external device, and transmits first voice data corresponding to the first character string. Receive from the external device. The control unit performs control for outputting a voice corresponding to the first voice data when the character string is the first character string, and performs control for outputting a voice corresponding to the first voice data when the character string is the first voice data. 2 character string, control is performed to output the sound corresponding to the second sound data stored in the storage unit as the sound data corresponding to the second character string. This makes it possible to suppress an increase in the amount of communication according to the frequency of speech in voice guidance involving communication.

In one aspect of the above voice output device, when the character string is the first character string, the control unit stores the first voice data received from the external device as cache data in the storage unit. Store.

In one aspect of the above voice output device, when the cache data corresponding to the first character string is stored in the storage unit, the control unit does not communicate with the external device, Control is performed to output sound corresponding to the cache data.

In one aspect of the above voice output device, the character string is a script including at least one sentence.

In another embodiment of the present invention, a server device includes a receiving unit that receives a character string generated for performing voice guidance from an external device; and performing control for transmitting first voice data corresponding to the first character string to the external device, while the character string is a second character string that does not include a proper noun, and a control unit that controls the external device to output a voice corresponding to the second voice data stored in the external device as the voice data corresponding to the second character string. This makes it possible to suppress an increase in the amount of communication according to the frequency of speech in voice guidance involving communication.

In one aspect of the above server device, the control unit stores the first voice data as cache data in the storage unit when the character string is the first character string.

In one aspect of the above-described server device, the control unit causes the cache data to be transmitted to the external device when the cache data corresponding to the first character string is stored in the storage unit. control.

In one aspect of the above server device, the character string is a script containing at least one sentence.

In still another embodiment of the present invention, the voice output method generates a character string for performing voice guidance, and when the character string is a first character string including a proper noun, the first character transmitting a string to an external device; receiving first audio data corresponding to the first character string from the external device; and, if the character string is the first character string, the first speech While performing control for outputting a voice corresponding to data, when the character string is a second character string that does not contain a proper noun, storing it as voice data corresponding to the second character string in the storage unit Control is performed to output the sound corresponding to the stored second sound data. This makes it possible to suppress an increase in the amount of communication according to the frequency of speech in voice guidance involving communication.

In yet another embodiment of the present invention, the control method receives from an external device a character string generated for providing voice guidance, and if the character string is a first character string including a proper noun, While performing control for transmitting first voice data corresponding to the first character string to the external device, when the character string is a second character string that does not include a proper noun, Control is performed to cause the external device to output the voice corresponding to the second voice data stored in the external device as the voice data corresponding to the second character string. This makes it possible to suppress an increase in the amount of communication according to the frequency of speech in voice guidance involving communication.

In still another embodiment of the present invention, a program executed by a voice output device provided with a computer includes a character string generator for generating a character string for providing voice guidance, a first character string including a proper noun, a communication unit that transmits the first character string to an external device and receives first voice data corresponding to the first character string from the external device when the character string is a character string, and When the character string is the first character string, while performing control for outputting a sound corresponding to the first audio data, when the character string is a second character string that does not include a proper noun , the computer functions as a control unit that performs control for outputting the voice corresponding to the second voice data stored in the storage unit as the voice data corresponding to the second character string. By executing this program on a computer, the above audio output device can be realized. This program can be stored in a storage medium and used.

In still another embodiment of the present invention, a program executed by a server device having a computer includes a receiving unit that receives a character string generated for providing voice guidance from an external device, and When the first character string contains a proper noun, the character string does not include a proper noun control for causing the external device to output a voice corresponding to the second voice data stored in the external device as the voice data corresponding to the second character string when the character string is a second character string; The computer is made to function as a control unit for performing. By executing this program on a computer, the above server device can be realized. This program can be stored in a storage medium and used.

Preferred embodiments of the present invention will be described below with reference to the drawings.

<First embodiment>
First, the first embodiment will be explained.

[System configuration]
(overall structure)
FIG. 1 is a diagram illustrating a configuration example of an audio output system according to an embodiment. A voice output system 1 according to this embodiment includes a voice output device 100 and a server device 200 . The audio output device 100 is mounted on the vehicle Ve. The server device 200 communicates with a plurality of audio output devices 100 mounted on a plurality of vehicles Ve.

The voice output device 100 basically performs route search processing, route guidance processing, etc. for the user who is a passenger of the vehicle Ve. For example, when a destination or the like is input by the user, the voice output device 100 transmits an upload signal S1 including position information of the vehicle Ve and information on the designated destination to the server device 200 . Server device 200 calculates the route to the destination by referring to the map data, and transmits control signal S2 indicating the route to the destination to audio output device 100 . The voice output device 100 provides route guidance to the user by voice output based on the received control signal S2.

In addition, the voice output device 100 provides various types of information to the user through interaction with the user. For example, when a user makes an information request, the audio output device 100 supplies the server device 200 with an upload signal S1 including information indicating the content or type of the information request and information about the running state of the vehicle Ve. The server device 200 acquires and generates information requested by the user, and transmits it to the audio output device 100 as a control signal S2. The audio output device 100 provides the received information to the user by audio output.

(Voice output device)
The voice output device 100 moves together with the vehicle Ve and performs route guidance mainly by voice so that the vehicle Ve travels along the guidance route. It should be noted that "route guidance based mainly on voice" refers to route guidance in which the user can grasp information necessary for driving the vehicle Ve along the guidance route at least from only voice, and the voice output device 100 indicates the current position. It does not exclude the auxiliary display of a surrounding map or the like. In this embodiment, the voice output device 100 outputs at least various information related to driving, such as points on the route that require guidance (also referred to as “guidance points”), by voice. Here, the guidance point corresponds to, for example, an intersection at which the vehicle Ve turns right or left, or other passing points important for the vehicle Ve to travel along the guidance route. The voice output device 100 provides voice guidance regarding guidance points such as, for example, the distance from the vehicle Ve to the next guidance point and the traveling direction at the guidance point. Hereinafter, the voice regarding the guidance for the guidance route is also referred to as "route voice guidance".

The audio output device 100 is installed, for example, on the upper part of the windshield of the vehicle Ve or on the dashboard. Note that the audio output device 100 may be incorporated in the vehicle Ve.

FIG. 2 is a block diagram showing a schematic configuration of the audio output device 100. As shown in FIG. The audio output device 100 mainly includes a communication unit 111, a storage unit 112, an input unit 113, a control unit 114, a sensor group 115, a display unit 116, a microphone 117, a speaker 118, and an exterior camera 119. and an in-vehicle camera 120 . Each element in the audio output device 100 is interconnected via a bus line 110 .

The communication unit 111 performs data communication with the server device 200 under the control of the control unit 114 . The communication unit 111 may receive, for example, map data for updating a map DB (DataBase) 4 to be described later from the server device 200 .

The storage unit 112 is composed of various memories such as RAM (Random Access Memory), ROM (Read Only Memory), and non-volatile memory (including hard disk drive, flash memory, etc.). The storage unit 112 stores a program for the audio output device 100 to execute predetermined processing. The above programs may include an application program for providing route guidance by voice, an application program for playing back music, an application program for outputting content other than music (such as television), and the like. Storage unit 112 is also used as a working memory for control unit 114 . Note that the program executed by the audio output device 100 may be stored in a storage medium other than the storage unit 112 .

The storage unit 112 also stores a map database (hereinafter, the database is referred to as "DB") 4. Various data required for route guidance are recorded in the map DB 4 . The map DB 4 stores, for example, road data representing a road network by a combination of nodes and links, and facility data indicating facilities that are candidates for destinations, stop-off points, or landmarks. The map DB 4 may be updated based on the map information received by the communication section 111 from the map management server under the control of the control section 114 .

In addition, the storage unit 112 stores voice data corresponding to a pre-generated script (character string) that does not contain proper nouns and contains at least one sentence. Specifically, the general-purpose voice data DB 112a of the storage unit 112 stores, for example, "Go along the road" and "Soon turn left at the traffic light. Please stay in the leftmost lane." Voice data corresponding to the script is stored in advance. In addition, voice data corresponding to a script (character string) generated to include a proper noun and at least one sentence can be stored in the voice cache data DB 112b of the storage unit 112 as cache data. Proper nouns include, for example, place names, interchange names, intersection names, road names, and landmark names.

The input unit 113 is a button, touch panel, remote controller, etc. for user operation. The display unit 116 is a display or the like that displays based on the control of the control unit 114 . The microphone 117 collects sounds inside the vehicle Ve, particularly the driver's utterances. A speaker 118 outputs audio for route guidance to the driver or the like.

The sensor group 115 includes an external sensor 121 and an internal sensor 122 . The external sensor 121 is, for example, one or more sensors for recognizing the surrounding environment of the vehicle Ve, such as a lidar, radar, ultrasonic sensor, infrared sensor, and sonar. The internal sensor 122 is a sensor that performs positioning of the vehicle Ve, and is, for example, a GNSS (Global Navigation Satellite System) receiver, a gyro sensor, an IMU (Inertial Measurement Unit), a vehicle speed sensor, or a combination thereof. It should be noted that the sensor group 115 may have a sensor that allows the control unit 114 to directly or indirectly derive the position of the vehicle Ve from the output of the sensor group 115 (that is, by performing estimation processing).

The vehicle exterior camera 119 is a camera that captures the exterior of the vehicle Ve. The exterior camera 119 may be only a front camera that captures the front of the vehicle, or may include a rear camera that captures the rear of the vehicle in addition to the front camera. good too. On the other hand, the in-vehicle camera 120 is a camera for photographing the interior of the vehicle Ve, and is provided at a position capable of photographing at least the vicinity of the driver's seat.

The control unit 114 includes a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), etc., and controls the audio output device 100 as a whole. For example, the control unit 114 estimates the position (including the traveling direction) of the vehicle Ve based on the outputs of one or more sensors in the sensor group 115 . Further, when a destination is specified by the input unit 113 or the microphone 117, the control unit 114 generates route information indicating a guidance route to the destination, Based on the positional information and the map DB 4, route guidance is provided. In this case, the control unit 114 causes the speaker 118 to output route voice guidance. Further, the control unit 114 controls the display unit 116 to display information about the music being played, video content, a map of the vicinity of the current position, or the like.

It should be noted that the processing executed by the control unit 114 is not limited to being implemented by program-based software, and may be implemented by any combination of hardware, firmware, and software. Also, the processing executed by the control unit 114 may be implemented using a user-programmable integrated circuit such as an FPGA (field-programmable gate array) or a microcomputer. In this case, this integrated circuit may be used to implement the program executed by the control unit 114 in this embodiment. Thus, the control unit 114 may be realized by hardware other than the processor.

The configuration of the audio output device 100 shown in FIG. 2 is an example, and various changes may be made to the configuration shown in FIG. For example, instead of storing the map DB 4 in the storage unit 112 , the control unit 114 may receive information necessary for route guidance from the server device 200 via the communication unit 111 . In another example, instead of including the speaker 118, the audio output device 100 is electrically connected to an audio output unit configured separately from the audio output device 100, or by a known communication means, so as to output the audio. Audio may be output from the output unit. In this case, the audio output unit may be a speaker provided in the vehicle Ve. In still another example, the audio output device 100 does not have to include the display section 116 . In this case, the audio output device 100 does not need to perform display-related control at all. may be executed. Similarly, instead of including the sensor group 115, the audio output device 100 may acquire information output by sensors installed in the vehicle Ve based on a communication protocol such as CAN (Controller Area Network) from the vehicle Ve. .

(Server device)
The server device 200 generates route information indicating a guidance route that the vehicle Ve should travel based on the upload signal S1 including the destination and the like received from the voice output device 100 . The server device 200 then generates a control signal S2 relating to information output in response to the user's information request based on the user's information request indicated by the upload signal S1 transmitted by the audio output device 100 and the running state of the vehicle Ve. The server device 200 then transmits the generated control signal S<b>2 to the audio output device 100 .

FIG. 3 is a diagram showing an example of a schematic configuration of the server device 200. As shown in FIG. The server device 200 mainly has a communication section 211 , a storage section 212 and a control section 214 . Each element in the server device 200 is interconnected via a bus line 210 .

The communication unit 211 performs data communication with an external device such as the audio output device 100 under the control of the control unit 214 . The storage unit 212 is composed of various types of memory such as RAM, ROM, nonvolatile memory (including hard disk drive, flash memory, etc.). Storage unit 212 stores a program for server device 200 to execute a predetermined process. Moreover, the memory|storage part 212 contains map DB4. The storage unit 212 is also provided with a voice cache data DB 212b capable of storing, as cache data, voice data corresponding to a script (character string) generated to include a proper noun and at least one sentence. It is

The control unit 214 includes a CPU, GPU, etc., and controls the server device 200 as a whole. Further, the control unit 214 operates together with the audio output device 100 by executing a program stored in the storage unit 212, and executes route guidance processing, information provision processing, and the like for the user. For example, based on the upload signal S1 received from the audio output device 100 via the communication unit 211, the control unit 214 generates route information indicating a guidance route or a control signal S2 relating to information output in response to a user's information request. Then, the control unit 214 transmits the generated control signal S2 to the audio output device 100 through the communication unit 211 .

[Processing flow]
Next, processing performed in the audio output system 1 will be described. FIG. 4 is a flowchart for explaining processing performed in the audio output system according to the first embodiment.

First, the control unit 114 of the voice output device 100 acquires driving situation information including information indicating the driving situation of the vehicle Ve, for example, at any timing during route guidance (step S11).

The driving situation information includes, for example, the direction of the vehicle Ve, the speed of the vehicle Ve, traffic information around the position of the vehicle Ve (including speed regulation and traffic congestion information, etc.), and voice information such as the current time. At least one piece of information that can be acquired based on the function of each unit of the output device 100 may be included. Further, the driving situation information may include any one of the voice obtained by the microphone 117, the image captured by the exterior camera 119, and the image captured by the interior camera 120. The driving status information may also include information received from the server device 200 through the communication unit 111 .

Next, the control unit 114 generates a script SC1 for providing voice guidance to passengers of the vehicle Ve based on the driving situation information acquired in step S11 (step S12).

When script SC1 that does not contain proper nouns is generated in step S12, control unit 114 acquires voice data SD1 corresponding to script SC1 from storage unit 112 (step S20), and Control is performed to output the sound from the speaker 118 (step S21).

On the other hand, when the script SC2 including proper nouns is generated in step S12, the control unit 114 confirms whether or not the voice data corresponding to the script SC2 is stored in the storage unit 112 as cache data (step S13).

When the control unit 114 detects that the voice data SD2 corresponding to the script SC2 is stored in the storage unit 112 as the cache data CD2, the control unit 114 obtains the cache data CD2 from the storage unit 112 (step S20). , control is performed to output sound corresponding to the cache data CD2 from the speaker 118 (step S21). That is, if the cache data CD2 corresponding to the character string containing the proper noun generated in step S12 is stored in the storage unit 112, the control unit 114 does not communicate with the server device 200, and Control for outputting audio corresponding to CD2 is performed.

Further, when the control unit 114 detects that the voice data SD2 corresponding to the script SC2 is not stored in the storage unit 112 as cache data, the control unit 114 causes the communication unit 111 to transmit the script SC2 to the server device 200. control. According to such control, the communication unit 111 transmits the script SC2 to the server device 200 (step S14).

The communication unit 211 of the server device 200 receives the script SC2 transmitted from the audio output device 100 (step S15).

The control unit 214 of the server device 200 performs processing for generating the audio data SD2 corresponding to the script SC2 received by the communication unit 211 (step S16), and then outputs the audio data SD2 from the communication unit 211 to the audio output device. 100 is controlled. According to such control, the communication unit 211 transmits the audio data SD2 to the audio output device 100 (step S17).

Note that, according to this embodiment, the processing related to the generation of the audio data SD2 corresponding to the script SC2 is not limited to being performed in the control unit 214. For example, a speech synthesis function such as TTS (Text To Speech), etc. may be performed on an external server with In such a case, for example, processing for requesting the external server to generate voice data SD2 corresponding to the script SC2 and processing for acquiring the voice data SD2 from the external server are performed in step S16. It should be done in

The communication unit 111 receives the audio data SD2 transmitted from the server device 200 (step S18).

The control unit 114 stores the audio data SD2 received by the communication unit 111 as the cache data CD2 in the storage unit 112 (step S19), and then controls the speaker 118 to output audio corresponding to the audio data SD2. (Step S21).

According to this embodiment, the control unit 114 has a function as a character string generation unit.

As described above, the voice output device 100 according to the present embodiment can be used when voice data corresponding to a script (character string) that does not contain proper nouns is stored in the storage unit 112, and when voice data corresponding to scripts (character strings) that do not contain proper nouns When cache data corresponding to the script (character string) is stored in the storage unit 112 , voice guidance can be provided without communicating with the server device 200 . Therefore, according to the present embodiment, it is possible to suppress an increase in the amount of communication according to the frequency of speech in voice guidance involving communication. In addition, according to the present embodiment, since the voice corresponding to the script is output, it is easier to hear than the case of outputting the voice generated by inserting different proper nouns in the sentence depending on the situation. can be improved.

[Modification]
Modifications suitable for the above embodiment will be described below.

(Modification 1)
The control unit 114 may set the upper limit value CMV of the amount of cache data that can be stored in the storage unit 112 . Further, when the amount of cache data in the storage unit 112 exceeds the upper limit value CMV due to the storage of new cache data, the control unit 114 deletes from the storage unit 112 the cache data that has been voice-output most recently. can be In addition, the control unit 114 may associate the number of times of use with the cache data of the scripts including proper nouns, so that scripts with a large number of times of use are not deleted from the storage unit 112 .

(Modification 2)
For example, when the storage capacity of the storage unit 112 is limited, the voice output device 100 (control unit 114) selects a script with medium to high frequency of voice output among many scripts that do not contain proper nouns. While the corresponding voice data is acquired from the storage unit 112 , the voice data corresponding to the script whose voice output frequency is low may be acquired (received) from the server device 200 .

(Modification 3)
For example, for scripts that do not contain proper nouns, such as "To the destination x kilometers, the required time is y minutes." may be stored in the storage unit 112 .

<Second embodiment>
Next, a second embodiment will be described. In addition, in the present embodiment, the description of the portions to which the same configuration as in the first embodiment can be applied will be omitted as appropriate, and the description will focus on the portions different from the first embodiment. . Specifically, in the present embodiment, while having the same system configuration as in the first embodiment, processing is performed according to a processing flow different from that in the first embodiment. Therefore, the processing flow according to the present embodiment will be mainly described below.

[Processing flow]
FIG. 5 is a flow chart for explaining the processing performed in the audio output system according to the second embodiment.

First, the control unit 114 acquires driving situation information including information indicating the driving situation of the vehicle Ve, for example, at any timing during route guidance (step S31).

Next, the control unit 114 generates a script for providing voice guidance to the passengers of the vehicle Ve based on the driving situation information acquired in step S31 (step S32), and transmits the generated script to the communication unit 111. to the server device 200. According to such control, the communication unit 111 transmits the script generated in step S32 to the server device 200 (step S33).

The communication unit 211 receives the script transmitted from the audio output device 100 (step S34).

When the control unit 214 receives the script SC3 that does not include a proper noun in step S34, the control unit 214 transmits the control signal CS for outputting the voice corresponding to the script SC3 from the communication unit 211 to the voice output device 100. Perform control for transmission. According to such control, the communication unit 211 transmits the control signal CS to the audio output device 100 (step S39). That is, when the character string received by communication unit 211 is a character string that does not contain a proper noun, control unit 214 stores the character string in audio output device 100 (storage unit 112) as audio data corresponding to the character string. It controls the audio output device 100 to output the audio corresponding to the audio data stored therein.

In step S34, when the script SC4 including a proper noun is received, the control unit 214 confirms whether or not the voice data corresponding to the script SC4 is stored in the storage unit 212 as cache data. (Step S35).

When the control unit 214 detects that the voice data SD4 corresponding to the script SC4 is stored in the storage unit 212 as the cache data CD4, the control unit 214 acquires the cache data CD4 from the storage unit 212 (step S38). , the control for transmitting the cache data CD4 from the communication unit 211 to the audio output device 100 is performed. According to such control, the communication unit 211 transmits the cache data CD4 to the audio output device 100 (step S39).

When the control unit 214 detects that the audio data SD4 corresponding to the script SC4 is not stored in the storage unit 212 as cache data, it performs processing for generating the audio data SD4 (step S36). After that, the audio data SD4 is stored in the storage unit 212 as cache data CD4 (step S37). Then, the control unit 214 performs control for transmitting the audio data SD4 from the communication unit 211 to the audio output device 100. FIG. According to such control, the communication unit 211 transmits the audio data SD4 to the audio output device 100 (step S39).

The communication unit 111 receives the audio data SD4, the cache data CD4, or the control signal CS transmitted from the server device 200 (step S40).

When the communication unit 111 receives the audio data SD4, the control unit 114 performs control to output the audio corresponding to the audio data SD4 from the speaker 118 (step S41).

Further, when the communication unit 111 receives the cache data CD4, the control unit 114 performs control for outputting the sound corresponding to the cache data CD4 from the speaker 118 (step S41).

On the other hand, when the communication unit 111 receives the control signal CS, the control unit 114 acquires the voice data SD3 corresponding to the script SC3 generated in step S32 from the storage unit 112, and then responds to the voice data SD3. Control is performed to output sound from the speaker 118 (step S41).

According to this embodiment, the communication unit 211 has a function as a receiving unit.

As described above, when the server device 200 according to the present embodiment receives a script (character string) that does not include a proper noun as a script (character string) for voice guidance from the voice output device 100, the voice data is transmitted, a control signal for outputting a sound corresponding to the sound data stored in the storage unit 112 is transmitted. Therefore, according to the present embodiment, it is possible to suppress an increase in the amount of communication according to the frequency of speech in voice guidance involving communication. Moreover, according to the present embodiment, by using cache data, it is possible to reduce the server load related to voice generation.

In addition, in each of the above-described embodiments, the program can be stored using various types of non-transitory computer readable media and supplied to a control unit or the like that is a computer. Non-transitory computer readable media include various types of tangible storage media. Examples of non-transitory computer-readable media include magnetic storage media (e.g., floppy disks, magnetic tapes, hard disk drives), magneto-optical storage media (e.g., magneto-optical discs), CD-ROMs (Read Only Memory), CD-Rs, CD-R/W, semiconductor memory (eg mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (Random Access Memory)).

Although the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention. That is, the present invention naturally includes various variations and modifications that a person skilled in the art can make according to the entire disclosure including the scope of claims and technical ideas. In addition, the disclosures of the cited patent documents and the like are incorporated herein by reference.

100 audio output device 200

server device

111, 211

communication unit

112, 212 storage unit 113

input unit

114, 214 control unit 115 sensor group 116 display unit 117 microphone 118 speaker 119 exterior camera 120 interior camera

Claims

a character string generation unit that generates a character string for performing voice guidance;
when the character string is a first character string including a proper noun, the first character string is transmitted to an external device, and first voice data corresponding to the first character string is transmitted from the external device a receiving communication unit;
When the character string is the first character string, control is performed to output a voice corresponding to the first voice data, while the second character string does not include a proper noun. a control unit for controlling to output a sound corresponding to the second sound data stored in the storage unit as the sound data corresponding to the second character string when
an audio output device having
2. The audio output according to claim 1, wherein the control unit stores the first audio data received from the external device as cache data in the storage unit when the character string is the first character string. Device.
The control unit outputs a voice corresponding to the cache data without communicating with the external device when the cache data corresponding to the first character string is stored in the storage unit. 3. The audio output device according to claim 2, wherein the control for
The voice output device according to any one of claims 1 to 3, wherein the character string is a script containing at least one sentence.
a receiving unit that receives a character string generated for performing voice guidance from an external device;
controlling transmission of first voice data corresponding to the first character string to the external device when the character string is a first character string including a proper noun; is a second character string that does not contain a proper noun, the external device reproduces a speech corresponding to the second speech data stored in the external device as speech data corresponding to the second character string A control unit that performs control for outputting,
A server device having
The server device according to claim 5, wherein the control unit stores the first voice data as cache data in the storage unit when the character string is the first character string.
7. The control unit according to claim 6, wherein when the cache data corresponding to the first character string is stored in the storage unit, the control unit performs control to transmit the cache data to the external device. Server device.
The server device according to any one of claims 5 to 7, wherein the character string is a script containing at least one sentence.
Generate a character string for voice guidance,
when the character string is a first character string including a proper noun, the first character string is transmitted to an external device, and first voice data corresponding to the first character string is transmitted from the external device receive and
When the character string is the first character string, control is performed to output a voice corresponding to the first voice data, while the second character string does not include a proper noun. , a voice output method for performing control for outputting voice corresponding to second voice data stored in a storage unit as voice data corresponding to the second character string.
receiving a character string generated for performing voice guidance from an external device;
controlling transmission of first voice data corresponding to the first character string to the external device when the character string is a first character string including a proper noun; is a second character string that does not contain a proper noun, the external device reproduces a speech corresponding to the second speech data stored in the external device as speech data corresponding to the second character string Control method for controlling output.
A program executed by an audio output device comprising a computer,
a character string generation unit that generates a character string for performing voice guidance;
when the character string is a first character string including a proper noun, the first character string is transmitted to an external device, and first voice data corresponding to the first character string is transmitted from the external device a receiving communication unit, and
When the character string is the first character string, control is performed to output a voice corresponding to the first voice data, while the second character string does not include a proper noun. A program for causing the computer to function as a control unit that performs control for outputting the sound corresponding to the second sound data stored in the storage unit as the sound data corresponding to the second character string when .
A program executed by a server output device comprising a computer, comprising:
a receiving unit that receives a character string generated for voice guidance from an external device;
controlling transmission of first voice data corresponding to the first character string to the external device when the character string is a first character string including a proper noun; is a second character string that does not contain a proper noun, the external device reproduces a speech corresponding to the second speech data stored in the external device as speech data corresponding to the second character string A program that causes the computer to function as a control unit that controls output.
A storage medium storing the program according to claim 11 or 12.