WO2017135214A1

WO2017135214A1 - Speech translation system, speech translation method, and speech translation program

Info

Publication number: WO2017135214A1
Application number: PCT/JP2017/003300
Authority: WO
Inventors: 知高大越
Original assignee: 株式会社リクルートライフスタイル
Priority date: 2016-02-01
Filing date: 2017-01-31
Publication date: 2017-08-10
Also published as: JP2017138650A; JP6449181B2

Abstract

With the present invention, it is possible to reduce the burden on a user and improve usefulness, and also to prevent the incidence of errors and achieve smooth communication. A speech translation system is provided with an information terminal for inputting the speech of a user, a server device for translating the content of the speech inputted to the information terminal, and an interpreter terminal for performing a telephone conversation process with the information terminal, wherein: the server device is provided with a speech recognition unit for recognizing the content of the speech inputted to the information terminal, and a translation unit for translating the content recognized by the speech recognition unit into content in a different language; and the information terminal is provided with a speech output unit for outputting in speech the content translated by the translation unit of the server device, a first display process control unit for controlling a process for selectively displaying a first image in addition to text, and a telephone conversation process control unit for transmitting to the interpreter terminal a telephone conversation process start request to start the telephone conversation process when the first image is selected.

Description

Speech translation system, speech translation method, and speech translation program

Cross-reference of related applications

This application is based on Japanese Application No. 2016-017071 filed on Feb. 1, 2016, the contents of which are incorporated herein by reference.

The present invention relates to a speech translation system, a speech translation method, and a speech translation program.

In order to enable conversation between people who cannot understand each other's language, for example, conversation between a store clerk (sales clerk at a restaurant, etc.) and a customer (tourist from abroad, etc.) A speech translation technique has been proposed in which the text content is machine-translated into the language of the other party and displayed on the screen, or the text content is played back using speech synthesis technology (for example, a patent). Reference 1). In addition, a speech translation application that operates on an information terminal such as a smartphone that embodies such speech translation technology has been put into practical use (see, for example, Non-Patent Document 1). On the other hand, an interpreting system that enables a telephone call between a plurality of users is known (for example, see Patent Document 2).

JP-A-9-34895 JP 2010-21692 A

In the above-described conventional speech translation apparatus, when a store clerk asks about the contents of a customer's order or explains a cooking material in a restaurant, machine translation by a translation engine is executed. Therefore, when the content of the input voice is not a basic sentence pattern of the language or when the order of spoken words is different, there is a tendency that mistranslation is likely to occur. If the accuracy of the machine translation is poor and communication between the two is not smooth, for example, the store clerk will call the interpreter from the speech translation device carried by the store clerk and have the interpreter translate it. Thus, communication between the two can be performed smoothly.

However, identification information for identifying an interpreter (interpreter terminal used by an interpreter), for example, a telephone number, when calling an interpreter while performing speech translation processing in a conventional speech translation apparatus Must be searched from the telephone directory, communication history, etc. stored in the speech translation apparatus. Then, after specifying the telephone number, it is necessary to further perform a call operation, which may increase the burden on the user (user, speaker) and decrease convenience.

Accordingly, some aspects of the present invention have been made in view of such circumstances, and can reduce the burden on the user and improve convenience, prevent occurrence of mistranslation, and facilitate smooth communication. An object is to provide a speech translation system, a speech translation method, and a speech translation program that can be realized.

In order to solve the above problems, a speech translation system according to an aspect of the present invention is provided between an information terminal that inputs a user's speech, a server device that translates the content of speech input to the information terminal, and the information terminal. A speech translation system comprising: an interpreter terminal that performs the telephone call processing, wherein the server device differs in the content recognized by the speech recognition unit and the content recognized by the speech recognition unit. A translation unit that translates the content into a language, and the information terminal controls a speech output unit that outputs the content translated by the translation unit of the server device by voice, and a process of displaying the text of the translated content Call processing control for controlling call processing between an interpreter terminal and a first display processing control unit that controls processing for selectively displaying a first image in addition to text The first part When the image is selected, it comprises a call processing control unit for transmitting a call processing start request to initiate a call processing interpreter terminal, a speech translation system.

In the speech translation system, the server device further includes a score calculation unit that calculates a score related to translation accuracy, and the first display processing control unit performs a process of displaying the first image when the score is equal to or less than a predetermined threshold. You may control.

In the speech translation system, the server device further includes a storage unit that associates the translated content associated with the input speech content for each user and stores it as a translation history, and the interpreter terminal stores the translation history. You may further provide the 2nd display process control part which controls the process linked and displayed for every user.

In the speech translation system, the first display processing control unit controls processing for further displaying two or more second images respectively indicating two or more languages, and the call processing control unit selects one image of the second images. When the first image is selected after the selection is made, call processing with the interpreter terminal associated with the interpreter who can use the language indicated by one of the selected second images is performed. You may control.

In order to solve the above-described problem, a speech translation method according to an aspect of the present invention is the content of a user's speech, the content translated into content in a different language is output in speech, and the translated content Controlling the process of displaying text, controlling the process of selectively displaying the first image in addition to the text, and controlling the call process between the interpreter terminals, Transmitting a call process start request for starting the call process to the interpreter terminal when the first image is selected.

In order to solve the above problems, a speech translation program according to an aspect of the present invention provides a computer, a speech output unit that outputs the content of a user's speech and the content translated into content of a different language, A first display processing control unit for controlling processing for displaying translated text, a first display processing control unit for controlling processing for selectively displaying a first image in addition to text, and an interpreter A call processing control unit for controlling a call process with a terminal, wherein when the first image is selected, a call process control unit for transmitting a call process start request for starting the call process to the interpreter terminal; And make it work.

In this disclosure, “part”, “apparatus”, and “system” do not simply mean physical means, but the functions of the “part”, “apparatus”, and “system” are realized by software. This includes cases where Further, even if the functions of one “part”, “apparatus”, and “system” are realized by two or more physical means and apparatuses, two or more “parts”, “apparatus”, “system” The function may be realized by one physical means or apparatus.

According to the present disclosure, the burden on the user can be reduced and the convenience can be improved, the occurrence of mistranslation can be prevented, and smooth communication can be realized.

1 is a system block diagram schematically illustrating a preferred embodiment of a network configuration according to a speech translation system according to the present disclosure. FIG. It is a system block diagram showing roughly an example of composition of a user apparatus (information terminal) in a speech translation system by this indication. It is a functional block diagram showing roughly an example of functional composition of a user apparatus (information terminal) in a speech translation system by this indication. It is a system block diagram showing roughly an example of composition of a server apparatus in a speech translation system by this indication. It is a functional block diagram which shows roughly an example of a function structure of the server apparatus in the speech translation system by this indication. It is a system block diagram showing roughly an example of composition of an operator terminal (interpreter apparatus) in a speech translation system by this indication. It is a functional block diagram showing roughly an example of functional composition of an operator terminal in a speech translation system by this indication. It is a flowchart which shows an example of the process flow (part) in the speech translation system by this indication. (A) thru | or (C) are top views which show an example of the transition of the display screen in the information terminal by this indication. (A) thru | or (C) are top views which show an example of the transition of the display screen in the information terminal by this indication. (A) thru | or (D) are top views which show an example of the transition of the display screen in the information terminal by this indication. It is a figure which shows an example of the display screen in the interpreter terminal by this indication. It is a flowchart which shows another example of the process flow (part) in the speech translation system by this indication.

Hereinafter, embodiments of the present invention will be described in detail. The following embodiments are examples for explaining the present invention, and are not intended to limit the present invention only to the embodiments. The present invention can be variously modified without departing from the gist thereof. Furthermore, those skilled in the art can employ embodiments in which the elements described below are replaced with equivalent ones, and such embodiments are also included in the scope of the present invention. Furthermore, positional relationships such as up, down, left, and right shown as needed are based on the display shown unless otherwise specified. Furthermore, various dimensional ratios in the drawings are not limited to the illustrated ratios.

(System configuration)
FIG. 1 is a system block diagram schematically illustrating a preferred embodiment of a network configuration according to a speech translation system according to the present disclosure. In this example, the speech translation system 100 exemplarily includes an information terminal 10 for inputting a user's voice, which is used by the user (speaker or other speaker), and an electronic device connected to the information terminal 10 via the network N. The server device 20 that translates the content of the voice input to the information terminal 10 and the information terminal 10 and the server device 20 that are electronically connected via the network N to the operator terminal 30 (interpreter terminal). And an operator terminal 30 (interpreter terminal) that performs a call process with the information terminal 10 used by the interpreter.

FIG. 2 is a system block diagram schematically illustrating an example of the configuration of the user device (information terminal) in the speech translation system according to the present disclosure. As shown in FIG. 2, the information terminal 10 illustratively includes a processor 11, a storage resource 12, a voice input / output device 13 (for example, a microphone and a speaker that are separate or integrated), and communication. An interface 14, an input device 15, a display device 16, and a camera 17 are provided. In addition, the information terminal 10 operates by installed speech translation application software (at least a part of a speech translation program according to an embodiment of the present disclosure), so that a part of the speech translation system according to the embodiment of the present disclosure or It functions as a whole. The information terminal 10 here is a portable tablet terminal device including a mobile phone represented by a smartphone having a communication function with the network N, for example.

The processor 11 includes an arithmetic logic unit and various registers (program counter, data register, instruction register, general-purpose register, etc.). Further, the processor 11 interprets and executes speech translation application software, which is the program P10 stored in the storage resource 12, and performs various processes. The speech translation application software as the program P10 can be distributed from the server device 20 through the network N, for example, and may be installed and updated manually or automatically.

The network N includes, for example, a wired network (a short-range communication network (LAN), a wide-area communication network (WAN), a value-added communication network (VAN), etc.) and a wireless network (mobile communication network, satellite communication network, Bluetooth ( Bluetooth (registered trademark), WiFi (Wireless Fidelity), HSDPA (High Speed Downlink Packet Access, etc.).

The storage resource 12 is a logical device provided by a storage area of a physical device (for example, a computer-readable storage medium such as a semiconductor memory), and an operating system program, a driver program, various information, etc. used for processing of the information terminal 10 Is stored. Examples of the driver program include an input / output device driver program for controlling the audio input / output device 13, an input device driver program for controlling the input device 15, an output device driver program for controlling the display device 16, and the like. Can be mentioned. Furthermore, the voice input / output device 13 is, for example, a general microphone and a sound player capable of reproducing sound data.

The communication interface 14 provides a connection interface with the server device 20 and the operator terminal 30, for example, and is configured by a wireless communication interface and / or a wired communication interface. The input device 15 provides an interface for accepting an input operation by a tap operation such as an icon, a button, or a virtual keyboard displayed on the display device 16, and is externally attached to the information terminal 10 in addition to the touch panel. Various input devices can be exemplified.

The display device 16 provides various information as an image display interface to the user and the other party of conversation as necessary. Examples thereof include an organic EL display, a liquid crystal display, a CRT display, and preferably various methods. Including those using touch panels. The camera 17 is for capturing still images and moving images of various subjects.

FIG. 3 is a functional block diagram schematically illustrating an example of a functional configuration of a user device (information terminal) in the speech translation system according to the present disclosure. As shown in FIG. 3, the information terminal 10 functionally includes a voice input / output unit 101, a transmission / reception unit 103, an input operation reception unit 105, a display unit 107, an information processing unit 109, and a storage unit 117. . In addition, the information processing unit 109 functionally includes a score comparison unit 111, a first display processing control unit 113, a call processing control unit 115, and an operator terminal specifying unit 116.

The voice input / output unit 101 inputs a user's voice, for example. Moreover, the voice input / output unit 101 outputs, for example, the contents translated by the server device 20 shown in FIG. Here, the voice input / output device 13 illustrated in FIG. 2 functions as the voice input / output unit 101.

The transmission / reception unit 103 transmits / receives various information to / from the server device 20 and the operator terminal 30 shown in FIG. For example, the transmission / reception unit 103 transmits the content of the input voice to the server device 20. The transmission / reception unit 103 receives, for example, text information, audio information, and the like of content translated by the server device 20. Moreover, the transmission / reception part 103 receives the score regarding a translation precision from the server apparatus 20, for example. The communication interface 14 illustrated in FIG. 2 functions as the transmission / reception unit 103.

The input operation accepting unit 105 is a block that accepts a user's input operation, for example. Here, the input device 15 illustrated in FIG. 2 functions as the input operation reception unit 105.

The display unit 107 displays various information. The display unit 107 displays, for example, translated text. Further, the display unit 107 displays, for example, a language button 61 (second image) shown in FIG. 9A and a call start button 73 (first image) shown in FIG. Here, the display device 16 illustrated in FIG. 2 functions as the display unit 107.

The information processing unit 109 indicates the function of the processor 11 illustrated in FIG. 2, and the score comparison unit 111 compares, for example, a score related to the translation accuracy of the translation processing performed by the server device 20 with a predetermined threshold (score). . The first display processing control unit 113 is a block that controls processing for displaying various types of information on the display unit 107. The first display processing control unit 113 controls, for example, a process of displaying the text of the content translated in the server device 20, and in addition to the text of the content translated in the server device 20, the call shown in FIG. A process of selectively displaying the start button 73 (first image) is controlled. The call processing control unit 115 is, for example, a block that controls call processing between the information terminal 10 and the operator terminal 30 and starts the call processing when the call start button 73 displayed on the display unit 107 is selected. A call processing start request is transmitted to the operator terminal 30. For example, the operator terminal specifying unit 116 specifies the operator terminal 30 used by the interpreter who can use the language indicated by the English button selected in the language button 61 shown in FIG. 9A.

The storage unit 117 is a block that stores various programs and information used for processing of the information terminal 10. The storage unit 117 stores, for example, text information, audio information, and the like of the content received by the transmission / reception unit 103 and translated by the server device 20. In addition, the storage unit 117 stores a score related to the translation accuracy of the server device 20 received by the transmission / reception unit 103. Here, the storage resource 12 illustrated in FIG. 2 functions as the storage unit 117. Note that the camera 17 shown in FIG. 2 functions as, for example, an imaging unit (not shown in FIG. 3).

FIG. 4 is a system block diagram schematically illustrating an example of the configuration of the server device in the speech translation system according to the present disclosure. As illustrated in FIG. 4, the server device 20 illustratively includes a processor 21, a communication interface 22, and a storage resource 23. The server device 20 is configured by, for example, a host computer having high arithmetic processing capability, and expresses a server function when a predetermined server program operates on the host computer. , And a single or a plurality of host computers functioning as a speech synthesis server (in the figure, it is indicated by a single, but is not limited thereto).

The processor 21 is composed of an arithmetic and logic unit for processing arithmetic operations, logical operations, bit operations and the like and various registers (program counter, data register, instruction register, general-purpose register, etc.), and is stored in the storage resource 23. P20 is interpreted and executed, and a predetermined calculation processing result is output. The communication interface 22 is a hardware module for connecting to the information terminal 10 via the network N. For example, the communication interface 22 is a modulation / demodulation device such as an ISDN modem, an ADSL modem, a cable modem, an optical modem, or a soft modem.

The storage resource 23 is a logical device provided by, for example, a storage area of a physical device (a computer-readable storage medium such as a disk drive or a semiconductor memory), and each includes one or a plurality of programs P20, various modules L20, and various types. A database D20 and various models M20 are stored.

The program P20 is the above-described server program that is the main program of the server device 20. In addition, the various modules L20 perform a series of information processing related to requests and information transmitted from the information terminal 10, and thus are software modules (moduleized subprograms) that are appropriately called and executed during the operation of the program P20. ). Examples of the module L20 include a speech recognition module, a translation module, and a speech synthesis module.

The various databases D20 include various corpora required for speech translation processing (for example, in the case of Japanese and English speech translation, a Japanese speech corpus, an English speech corpus, a Japanese character (vocabulary) corpus, an English character) (Vocabulary) corpus, Japanese dictionary, English dictionary, Japanese-English bilingual dictionary, Japanese-English bilingual corpus, etc.), a speech database described later, a management database for managing information related to users, and the like. In addition, examples of the various models M20 include an acoustic model and a language model used for speech recognition described later.

FIG. 5 is a functional block diagram schematically showing an example of the functional configuration of the server device in the speech translation system according to the present disclosure. As illustrated in FIG. 5, the server device 20 functionally includes a transmission / reception unit 201, an information processing unit 203, and a storage unit 213. The information processing unit 203 includes, for example, a speech recognition unit 205, a multilingual translation unit 207, a score calculation unit 209, and a speech synthesis unit 211.

The transmission / reception unit 201 transmits / receives various information to / from the information terminal 10 and the operator terminal 30 shown in FIG. For example, the transmission / reception unit 201 receives the content of the voice input to the information terminal 10 from the information terminal 10. The transmission / reception unit 201 transmits, for example, text information, voice information, and the like of contents translated by the multilingual translation unit 207 described later to the information terminal 10. In addition, the transmission / reception unit 201 transmits, for example, a score related to translation accuracy calculated by a score calculation unit 209 described later to the information terminal 10. Here, the communication interface 22 illustrated in FIG. 4 functions as the transmission / reception unit 201.

The information processing unit 203 indicates the function of the processor 21 shown in FIG. 4, and the voice recognition unit 205 recognizes the content of the voice input to the information terminal 10, for example. For example, the multilingual translation unit 207 translates the content recognized by the speech recognition unit 205 into the content of a different language. For example, the score calculation unit 209 calculates a score related to the translation accuracy of the multilingual translation unit 207. For example, the speech synthesis unit 211 performs speech synthesis based on the translation result by the multilingual translation unit 207.

The storage unit 213 is, for example, a block that stores various programs and information used for processing of the server device 20. The memory | storage part 213 memorize | stores the content of the audio | voice input into the information terminal 10 which the transmission / reception part 201 received, for example. Moreover, the memory | storage part 213 memorize | stores the translated content, for example. For example, the storage unit 213 stores the translated content associated with the content of the input voice as a translation history in association with each user. Here, the storage resource 23 illustrated in FIG. 4 functions as the storage unit 213.

FIG. 6 is a system block diagram schematically illustrating an example of a configuration of an operator terminal (interpreter device) in the speech translation system according to the present disclosure. As shown in FIG. 6, the operator terminal 30 includes a processor 31, a storage resource 32, a voice input / output device 33 (for example, a microphone and a speaker that are separate or integrated), a communication interface 34, an input device 35, A display device 36 and a camera 37 are provided. As described above, the operator terminal 30 has the same block configuration as the information terminal 10 shown in FIG. In the following, in particular, a configuration different from the configuration included in the information terminal 10 will be described. In addition, the operator terminal 30 operates, for example, by installed CTI (Computer Telephony Integration) application software executed as at least a part of a speech translation program according to an embodiment of the present disclosure. It functions as a part or all of the speech translation system.

The operator terminal 30 receives a call from the information terminal 10 shown in FIG. The interpreter performs interpretation via the operator terminal 30. The operator terminal 30 displays information related to at least one of the other party of the telephone, for example, the information terminal 10 and the operator of the information terminal, a translation history, which will be described in detail later, on the display device 36. The operator terminal 30 is, for example, a stationary terminal device including a desktop personal computer having a communication function with the network N.

The processor 31 interprets and executes CTI application software, which is the program P30 stored in the storage resource 32, and performs various processes. The input device 35 provides an interface for accepting an input operation by a tap operation such as an icon, a button, or a virtual keyboard displayed on the display device 36, and various input devices externally attached to the operator terminal 30, for example, A keyboard and a mouse can be exemplified. The input device 35 may be a device such as a touch panel of various types including the function of the display device 36.

FIG. 7 is a functional block diagram schematically illustrating an example of a functional configuration of an operator terminal (interpreter device) in the speech translation system according to the present disclosure. As shown in FIG. 7, the operator terminal 30 functionally includes a voice input / output unit 301, a transmission / reception unit 303, an input operation reception unit 305, a display unit 307, an information processing unit 309, and a storage unit 315. . The information processing unit 309 functionally includes a call processing unit 311 and a second display processing control unit 313.

The voice input / output unit 301 inputs the voice of an operator including an interpreter, for example. In addition, the voice input / output unit 301 may be configured to output the content indicating the translation history received by the transmission / reception unit 303 by voice as described later, for example. Here, the voice input / output device 33 illustrated in FIG. 6 functions as the voice input / output unit 301.

The transmission / reception unit 303 transmits / receives various information to / from the information terminal 10 and the server device 20 illustrated in FIG. The transmission / reception unit 303 receives, for example, a translation history transmitted from the server device 20 via the information terminal 10. In addition, the transmission / reception unit 303 receives, for example, a call processing start request transmitted from the information terminal 10. For example, the transmission / reception unit 303 transmits a response signal to the call processing start request. The communication interface 34 illustrated in FIG. 6 functions as the transmission / reception unit 303.

The input operation accepting unit 305 is a block that accepts an operator's input operation, for example. Here, the input device 35 illustrated in FIG. 6 functions as the input operation reception unit 305.

The display unit 307 displays various information. For example, the display unit 307 displays the translation history in association with each user. Here, the display device 36 illustrated in FIG. 6 functions as the display unit 307.

The information processing unit 309 indicates the function of the processor 31 illustrated in FIG. 6, and the call processing unit 311 is configured between the operator terminal 30 and the information terminal 10 based on a call processing start request transmitted from the information terminal 10, for example. To determine whether or not a call is possible, and generate a response signal to the call processing start request. The response signal includes a signal indicating that a call is possible between the operator terminal 30 and the information terminal 10 and a signal indicating that a call is possible between the operator terminal 30 and the information terminal 10. For example, the second display processing control unit 313 is a block that controls processing for displaying various types of information on the display unit 307. For example, the second display processing control unit 313 controls the display unit 307 to display the translation history in association with each user.

The storage unit 315 is a block that stores various programs and information used for processing of the operator terminal 30. The storage unit 315 stores, for example, a translation history transmitted from the server device 20 via the information terminal 10 received by the transmission / reception unit 303. Here, the storage resource 32 illustrated in FIG. 6 functions as the storage unit 315. The camera 37 shown in FIG. 6 functions as an imaging unit, for example, although not shown in FIG.

An example of operations and operations of speech translation processing and call processing in the speech translation system 100 configured as described above will be further described below.

(Voice translation processing and call processing)
(First embodiment)
FIG. 8 is a flowchart illustrating an example of a process flow (part) in the speech translation system according to the present disclosure. 9 (A) to 9 (C), 10 (A) to (C), and 11 (A) to 11 (D) are plan views illustrating examples of display screen transition in the information terminal according to the present disclosure. . FIG. 12 is a diagram illustrating an example of a display screen in the interpreter terminal according to the present disclosure. Here, the conversation when the user of the information terminal 10 is a restaurant clerk who speaks Japanese and the conversation partner is a customer who speaks English, that is, the input language is Japanese and the translation language is English. Assume conversation. However, it is not limited to this.

First, when the user (clerk) taps an icon (not shown) of the speech translation application software displayed on the display unit 107 of the information terminal 10, the application is activated on the information terminal 10 (FIG. 8; step). SJ1).

When the application is activated, a customer language selection screen is displayed on the display unit 107 (FIG. 8; step SJ2). As shown in FIG. 9A, this language selection screen includes, for example, a Japanese text T21 for inquiring about the language to the customer, an English text T22 for that purpose, and a plurality of typical languages assumed. Here, a language button 61 (second image) indicating English, Chinese (for example, two types depending on the typeface), and Hangul) is displayed.

At this time, the Japanese text T21 and the English text T22 are classified by the first display processing control unit 113 and the display unit 107, for example, by areas of different colors on the screen of the display unit 107 of the information terminal 10, and They are displayed in opposite directions (different directions; upside down in the figure). Thereby, when a conversation is performed in a state where the user and the customer face each other, the user can easily confirm the Japanese text T21, while the customer can easily confirm the English text T22. In addition, since the text T21 and the text T22 are displayed separately, there is an advantage that the text T21 and the text T22 are clearly distinguished from each other.

Then, the user presents the text T21 displayed on the language selection screen of FIG. 9A to the customer, and has the customer tap the English button, so that the customer's language is selected. As a result, a standby screen for voice input in Japanese and English is displayed on the display device as the home screen (FIG. 8; step SJ3). On this home screen, text T23 asking which of the user's or customer's language is to be spoken, a Japanese input button 62a for performing Japanese speech input, and an English input button for performing English speech input 62b is displayed. The home screen also includes a history display button 63 for displaying a history of input contents, a language selection button 64 for returning to the language selection screen and switching the customer language (re-selecting the language), and the application software. A setting button 65 for performing various settings is also displayed.

Next, on the home screen in FIG. 9B, when the user (clerk) taps the Japanese input button 62a and selects Japanese voice input, the voice input screen for accepting the user's Japanese utterance content is displayed. (FIG. 9C). When this voice input screen is displayed, voice input from the voice input / output unit 101 is enabled. Further, on this voice input screen, a text T24 for prompting the user to input voice and a microphone design 66 indicating that the voice input is in a standby state are displayed. Note that the Japanese input button 62a is not displayed on the voice input screen of FIG. 9C to indicate that Japanese voice input has been selected in FIG. 9B, which is the previous screen. Further, the English input button 62b is displayed in a light color so that a part of the English input button 62b is hidden behind the microphone design 66 (the same applies to FIGS. 10A and 10B described later).

In addition, a cancel button 67 is displayed at the bottom of the voice input screen. By tapping this button, it is possible to return to the voice input standby screen (FIG. 9B) and perform voice input again. (Same as in FIGS. 10A and 10B described later). In this state, when a user inputs a message to be communicated to the customer in Japanese, as shown in FIG. 10A, the volume of the voice volume is schematically shown on the screen of the display unit 107 together with the text T24. In addition, a dynamically shown multiple circular design 68 is displayed, and the voice input level is visually fed back to the user who is the speaker (FIG. 8; step SJ4).

Then, when the user's utterance ends, for example, when the information processing unit 109 of the information terminal 10 detects that there is no voice input for a certain period of time, the information processing unit 109 ends the acceptance of the utterance content by the user. Next, the information processing unit 109 generates an audio signal based on the audio input, and transmits the audio signal to the server device 20 through the transmission / reception unit 103 and the network N.

Next, the voice recognition unit 205 of the information processing unit 203 of the server device 20 receives the voice signal through the transmission / reception unit 201 and performs voice recognition processing (FIG. 8; step SS1). At this time, the speech recognition unit 205 calls the necessary module L20, database D20, and model M20 (speech recognition module, Japanese speech corpus, acoustic model, language model, etc.) from the storage unit 213, "To" reading "(character).

Here, the information processing unit 203 generates a text signal for text output based on the recognized “reading” (characters) of the voice, and transmits the text signal to the information terminal 10 through the transmission / reception unit 201 and the network N. At this time, the information processing unit 203 calls the one corresponding to the actual utterance content from the text signal based on the content of the recognized speech itself and the Japanese conversation corpus stored in the storage unit 213 in advance. And generating a text signal based thereon. Then, as shown in FIG. 10B, the first display processing control unit 113 of the information terminal 10 that has received the text signal through the transmission / reception unit 201 recognizes the Japanese utterance content input by the user on the screen. As a result, the Japanese text T25 that is the content of the recognized speech is displayed.

Next, the multilingual translation unit 207 proceeds to multilingual translation processing for translating the recognized speech “reading” (characters) into another language (FIG. 8; step SS2). At this time, the multilingual translation unit 207 stores the necessary module L20 and database D20 (translation module, Japanese character corpus, Japanese dictionary, English dictionary, Japanese-English bilingual dictionary, Japanese-English bilingual corpus, etc.) from the storage unit 213. The input speech “reading” (character string), which is the call and recognition result, is appropriately sorted and converted to Japanese phrases, clauses, sentences, etc., and the English corresponding to the conversion result is extracted, and the English grammar is extracted. Are converted into natural English phrases, clauses, sentences, etc., and the corresponding English conversation corpus is selected from the storage unit 213. At that time, as shown in FIG. 10B, the display unit 107 displays a standby screen including Japanese text T26 indicating that translation is in progress and a circular design 69 indicating that translation is in progress. Is done.

The storage unit 213 stores the translation result (translation content) associated with the content of the input speech for each user as a translation history (FIG. 8; step SS3). For example, the storage unit 213 stores an English conversation corpus or the like corresponding to the translated English phrase, clause, sentence, or the like as a translation history in association with the content of the input speech.

Next, the speech synthesis unit 211 calls the module L20, database D20, and model M20 (speech synthesis module, English speech corpus, acoustic model, language model, etc.) necessary for speech synthesis from the storage unit 213, and uses the translation result. An English conversation corpus corresponding to a certain English phrase, clause, sentence or the like is converted into natural speech (FIG. 8; step SS4).

When these multilingual translation processing and speech synthesis processing are completed, the information processing unit 203 generates a text signal for text display based on the English conversation corpus that is the translation result (translation content), and the synthesized text signal is also synthesized. An audio signal for audio output is generated based on the audio and transmitted to the information terminal 10 through the transmission / reception unit 201 and the network N.

Then, as shown in FIG. 10C, the first display processing control unit 113 of the information terminal 10 that has received the text signal and the audio signal through the transmission / reception unit 103, the Japanese corresponding to the text T25 and the text T25. The conversation corpus text T27 (same as, but not limited to, text T25 here) and the English conversation corpus text T28, which is the translation result thereof, are displayed as a conversation screen, and the call is started on the screen. A process of selectively displaying the button 73 (first image) is controlled (FIG. 8; step SJ5). Here, the memory | storage part 117 of the information terminal 10 may memorize | store the said text signal and audio | voice signal received from the server apparatus 20 as a translation log | history, for example.

Simultaneously with step SJ5, the voice input / output unit 101 outputs (reads out) the content (translation content) of the English text T28 as a translation result (FIG. 8; step SJ6). Note that step SJ6 may be executed before or after step SJ5.

At this time, as shown in FIG. 10C, the Japanese texts T25 and T27 and the English text T28 are also divided on the screen of the display unit 107 of the information terminal 10, for example, by different color areas and line segments, and They are displayed in opposite directions (different directions; upside down in the figure). As a result, when the user and the customer are in a face-to-face conversation, the user confirms the Japanese texts T25 and T27 (input contents) if both can see the screen of the display unit 107. On the other hand, the customer can easily confirm the English text T28 (translated content). In addition, since the texts T25, T27 and the text T28 are displayed separately, there is an advantage that the texts T25, T27 and the text T28 are clearly distinguished from each other.

Note that the audio output is repeated by tapping the audio output button 70 displayed on the conversation screen of FIG. Also, on this conversation screen, a check button 71 indicating that the translation at that time is finished is displayed. By tapping this, the translation processing is finished and the home screen (FIG. 9B) is returned. Can do.

Next, if the customer can understand the user's (clerk's) questions due to the accuracy of the translation, then the voice processing such as the customer's voice input, recognition, translation, and voice synthesis will be performed. Is performed (FIG. 8; No in step SJ7). In this customer voice processing, first, the check screen 71 displayed in FIG. 10C is tapped to display the home screen (FIG. 9B). Next, on the home screen, the English input button 62b is tapped to select English voice input by the customer. The processing after this is performed except that the speaker changes from the user to the customer, the Japanese voice input is switched to the English voice input, and the English voice and text output is replaced with the Japanese voice and text output. Since it is basically the same as the above-described processing, detailed description thereof is omitted here. Then, when the conversation between the user and the customer is completed, a series of speech translation processing is terminated.

On the other hand, if the contents of the Japanese input by the store clerk or the English input by the customer are not in the basic sentence pattern of the language, or if the order of spoken words is different, mistranslation may occur. Is likely to increase. If the translation accuracy is not high, such as when there is a mistranslation actually, there is a possibility that communication between the store clerk and the customer may not be performed smoothly. Therefore, in such a case, when at least one of the store clerk and the customer selects the call start button 73 (first image) displayed on the display unit 107 of the information terminal 10 in step SJ5 of FIG. The process control unit 115 transmits a call process start request to the operator terminal 30 to call the interpreter (FIG. 8; Yes in step SJ7).

Specifically, when at least one of the store clerk and the customer selects the call start button 73 displayed on the display unit 107 of the information terminal 10 in step SJ5 of FIG. 8, as shown in FIG. The screen 107 is grayed out, and an image 75 for confirming whether or not to call the interpreter is displayed on the screen. When at least one of the store clerk and the customer selects “Yes” displayed in the image 75, the first display processing control unit 113 displays text on the screen of the display unit 107 as shown in FIG. Controls the process of displaying T29. For example, when at least one of the store clerk and the customer selects “Yes” displayed in the image 75, the call processing control unit 115 is configured to transmit a call processing start request to make a call with the interpreter. May be.

For example, the call processing control unit 115 may generate a call processing start request when the call start button 73 is selected, or may generate a call processing start request in advance before the call start button 73 is selected. Good. The call processing start request includes, for example, identification information of the information terminal 10. Further, it is generated including the translation history from the server device 20. The identification information of the information terminal 10 includes, for example, the attributes of the user of the information terminal 10, that is, the user's name, address, date of birth, age, affiliation, family structure, etc., and the telephone number or identification number of the information terminal 10 ( ID) and the like. In addition, a call between a store clerk or customer who uses the information terminal 10 and an interpreter who uses the operator terminal 30 is executed through a network N including a general telephone line network, an IP telephone line network, and the like. Note that there is no particular limitation on the calling means, and it is sufficient that both calls can be made.

Here, the store clerk wants to talk to a more appropriate interpreter when there are multiple interpreters who can talk. For example, in the storage unit 117 of the information terminal 10, identification information of each interpreter or a terminal used by each interpreter and language information indicating one or more languages that can be used by each interpreter are stored in association with each other. . In step SJ3 shown in FIG. 8, the user presents the text T21 displayed on the language selection screen shown in FIG. 9 (A) to the customer, and asks the customer to tap an English button so that the customer Language is selected. Then, the operator terminal specifying unit 116 is selected by referring to the terminal identification information used by each interpreter stored in the storage unit 117 and the language information indicating one or more languages that can be used by each interpreter. The operator terminal 30 used by the interpreter who can use the language indicated by the English button, that is, English, is specified. Then, the call processing control unit 115 transmits a call processing start request to the operator terminal used by the interpreter, so that both calls are started. In this way, it is possible to appropriately identify an interpreter who can handle the language used in communication between the store clerk and the customer.

Also, if there are multiple interpreters who can use English, the store clerk may wish to talk to an interpreter who is better at interpreting. For example, the storage unit 117 of the information terminal 10 includes, in addition to the terminal identification information used by each interpreter and the language information indicating one or more languages that can be used by each interpreter, the interpretation level and interpretation of each interpreter. Information indicating the capability may be stored in association with identification information of each interpreter or identification information of a terminal used by each interpreter. Then, when an English button is selected in step SJ3 shown in FIG. 8, the operator terminal specifying unit 116 uses an operator used by an interpreter with a higher interpreting level and ability among a plurality of interpreters who can use English. The terminal may be specified.

The operator terminal specifying unit 116 specifies the interpreter when the call start button 73 (first image) displayed on the display unit 117 of the information terminal 10 is selected in step SJ5 of FIG. May be. Further, the operator terminal specifying unit 116 may be configured in advance to specify an operator terminal used by an interpreter who makes a call for each language used in communication between a store clerk and a customer.

On the other hand, when at least one of the store clerk and the customer selects “No” displayed in the image 75, the screen returns to the screen shown in FIG.

Next, the transmission / reception unit 303 of the operator terminal 30 receives the call processing start request from the information terminal 10 (FIG. 8; step SO1). The transmission / reception unit 303 transmits a response signal to the information terminal 10 (FIG. 8; step SO2). For example, when allowing a call between the information terminal 10 and the operator terminal 30, the call processing unit 311 generates a response signal indicating that the call is allowed. For example, the call processing unit 311 is capable of making a call, in which the identification information of the information terminal 10 included in the received call processing start request is stored in advance in the storage unit 315 or another storage resource that can communicate with the operator terminal 30 By comparing with the identification information of the information terminal, it is determined whether or not a call with the information terminal 10 is permitted. On the other hand, when the call processing unit 311 does not permit the call between the information terminal 10 and the operator terminal 30, the call processing unit 311 generates a response signal indicating that the call is not permitted.

The second display processing control unit 313 causes the display unit 307 to display the translation history transmitted from the server device 20 via the information terminal 10 in association with each user (FIG. 8; step SO3). For example, as illustrated in FIG. 12, the second display processing control unit 313 controls a process of displaying an image 81 indicating “calling” on the screen of the display unit 307, and An image 83 including a column indicating a name, a column indicating a telephone number of an information terminal used by the user, a column indicating an identification number of the information terminal, and a column indicating attribute information indicating a user's address and the like is displayed. A process is controlled and the process which displays the translation log | history of the input speech of a shop assistant (user 1) and a customer (user X) for every user as a translation log | history image 85 is controlled.

Thus, since the operator terminal 30 displays the speech translation history in association with each user on the display unit 307, the interpreter can check the speech translation history, so the communication between the store clerk and the customer so far It is possible to respond based on the flow of

As shown in FIG. 12, the operator terminal 30 displays the speech translation history in time series on the display unit 307, so that the interpreter can more easily flow the communication between the store clerk and the customer so far. Therefore, it is possible to respond appropriately based on this flow.

On the other hand, when the information terminal 10 receives a response signal from the operator terminal 30 (FIG. 8; step SJ8), the connection between the information terminal 10 and the operator terminal 30 is established, and a call between the store clerk or customer and the interpreter is made. Realize (FIG. 8; steps SJ9 and SO4). Here, when a call between the store clerk or the customer and the interpreter is realized, the processor 11 displays the text T30 on the screen of the display unit 107 as shown in FIG.

(Second Embodiment)
In the first embodiment, the information terminal displays a call start button (first image) when outputting the translation result. In the second embodiment, the information terminal relates to the translation accuracy calculated by the server device. The first embodiment and the second embodiment are different in that a call start button (first image) is displayed when the score is compared with a predetermined threshold and the score is equal to or lower than the predetermined threshold. A second embodiment will be described with reference to FIG. Differences from the flowchart shown in FIG. 8 that describe the first embodiment will be particularly described, and descriptions of points that are similar to the flowchart shown in FIG. 8 will be omitted.

FIG. 13 is a flowchart showing another example of the process flow (part) in the speech translation system. As shown in FIG. 13, the multilingual translation unit 207 of the server device 20 executes multilingual translation processing for translating the recognized “reading” (characters) of the recognized speech into another language (FIG. 13; step SS12). . The storage unit 213 stores, as a translation history, a translation result (translation content) associated with the content of the input speech and a score related to translation accuracy corresponding to the translation result for each user (FIG. 13; step SS13).

Here, in the translation processing, for example, statistical translation is performed, and the correspondence between words and phrases between two languages is extracted from the bilingual data, for example, including a bilingual dictionary with probability and a word order conversion table with probability Based on the translation model and the language model including the Japanese word chain data with probabilities representing the naturalness of the sequence that expresses the language likeness of the translation, a translation candidate that maximizes the product of these probabilities is output. Therefore, the score calculation unit 209 is configured to calculate, for example, a score relating to the translation accuracy of what percentage for each translation result.

When the multilingual translation process and the speech synthesis process are completed, the speech synthesizer 211 generates a text signal for text display based on the English conversation corpus that is the translation result (translation content), and also generates the synthesized speech. Based on this, an audio signal for audio output is generated. Then, the generated text signal, the generated voice signal, and the translation accuracy are transmitted to the information terminal 10 through the transmission / reception unit 201 and the network N.

Next, the score comparison unit 111 of the information terminal 10 compares the score related to the translation accuracy calculated by the server device 20 with a predetermined threshold (FIG. 13; step SJ15). If the score is higher than the predetermined threshold (FIG. 13; No in step SJ15), it indicates that the translation accuracy is good, and the first display processing control unit 113 displays the translation result on the display unit 107 and is synthesized. Audio is output (FIG. 13; step SJ16). For example, when the predetermined threshold is 80% and the score relating to the translation accuracy of the translation process in the server device 20 is 90%, the translation accuracy is good. If the customer can understand the questions of the user (clerk) by performing the translation accurately, the process returns to step SJ13 shown in FIG. 13, and this time the customer's voice is input, recognized, and translated. And voice processing such as voice synthesis.

On the other hand, if the score relating to the translation accuracy is equal to or lower than the predetermined threshold value (FIG. 13; Yes in step SJ15), it indicates that the translation accuracy is poor, and the first display processing control unit 113 displays the translation result and A call start button is displayed (FIG. 13; step SJ17).

According to the present disclosure, the first display processing control unit of the information terminal controls the process of selectively displaying the call start button when the translation result is displayed on the display unit. When the call start button is selected, by transmitting a call processing start request for starting a call between the user and an interpreter, the burden on the user can be reduced and convenience can be improved. Can be prevented and smooth communication can be realized.

Also, according to the present disclosure, the information terminal compares the score related to the translation accuracy of speech translation with a predetermined threshold value, and displays a call start button when the translation accuracy is low. Therefore, since the call start button is displayed only when the need for a call with the interpreter is higher at the information terminal, the call with the interpreter can be started more smoothly.

(Other embodiments)
This embodiment is for facilitating the understanding of the present disclosure, and is not to be construed as limiting the present disclosure. The present disclosure can be changed / improved without departing from the spirit thereof, and the present disclosure includes equivalents thereof. In addition, the present disclosure can be implemented with various modifications (combining the embodiments, etc.) without departing from the spirit of the present disclosure.

In each of the above-described embodiments, an example in which each process of speech recognition, translation, and speech synthesis is executed by the server device 20 has been described. However, these processes may be executed by the information terminal 10. In this case, the module L20 used for these processes may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server device 20. Furthermore, the database D20 of the voice database and / or the model M20 such as an acoustic model may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server device 20. . As described above, the speech translation system may not include the network N and the server device 20. In addition, although the example which performs the process which judges translation accuracy by the information terminal 10 was described in the said embodiment, you may comprise so that this process may be performed in the server apparatus 20. FIG.

The step of displaying the translation history relating to step SO3 shown in FIG. 8 in association with each user may be executed simultaneously with step SO1, or after step SO1 and simultaneously with step SO2 or before step SO2. May be executed. Further, the step of displaying the translation history related to step SO13 shown in FIG. 10 in association with each user may be executed simultaneously with step SO11, or after step SO11 and simultaneously with step SO12 or before step SO12. May be executed.

In the above embodiment, it has been described that the operator terminal 30 can obtain the translation history by receiving a call processing start request including the translation history, but is not limited thereto. For example, the operator terminal 30 may be configured to receive the translation history directly from the server device 20 before or after receiving the call processing start request.

Of course, a gateway server for converting a communication protocol between the information terminal 10 and the network N or between the operator terminal 30 and the network N may be interposed. The information terminal 10 is not limited to a portable device, and may be a desktop personal computer, a notebook personal computer, a tablet personal computer, a laptop personal computer, or the like. Furthermore, the operator terminal 30 is not limited to a stationary device, and may be configured with a portable tablet terminal device having a communication function with the network N.

Claims

An information terminal for inputting the user's voice;
A server device for translating the content of the voice input to the information terminal;
An interpreter terminal that performs call processing with the information terminal; and a speech translation system comprising:
The server device
A speech recognition unit for recognizing the content of speech input to the information terminal;
A translation unit that translates the content recognized by the voice recognition unit into content of a different language,
The information terminal
A voice output unit that outputs the content translated by the translation unit of the server device by voice;
A first display processing control unit for controlling processing for displaying the translated text, wherein the first display processing control unit controls processing for selectively displaying a first image in addition to the text;
A call processing control unit for controlling a call process with the interpreter terminal, and when the first image is selected, transmits a call process start request for starting the call process to the interpreter terminal. A call processing control unit for
Speech translation system.
The server device further includes a score calculation unit that calculates a score related to translation accuracy,
The first display processing control unit controls a process of displaying the first image when the score is equal to or less than a predetermined threshold;
The speech translation system according to claim 1.
The server device further includes a storage unit that stores the translated content associated with the content of the input voice as a translation history in association with each user,
The interpreter terminal further includes a second display processing control unit that controls processing of displaying the translation history in association with each user.
The speech translation system according to claim 1 or 2.
The first display processing control unit controls processing for further displaying two or more second images respectively indicating two or more languages,
The call processing control unit uses a language indicated by one of the selected second images when the first image is selected after the one of the second images is selected. Controlling the call processing with the interpreter terminal associated with an interpreter capable of
The speech translation system according to any one of claims 1 to 3.
Outputting the content of the user's voice translated into different language content,
Controlling the process of displaying the translated content text, and controlling the process of selectively displaying the first image in addition to the text;
Controlling a call process with the interpreter terminal, and when the first image is selected, transmitting a call process start request for starting the call process to the interpreter terminal; ,including,
Speech translation method.
Computer
A voice output unit that outputs the contents of the user's voice, which are translated into different languages, in voice;
A first display processing control unit for controlling processing for displaying the translated text, wherein the first display processing control unit controls processing for selectively displaying a first image in addition to the text;
A call processing control unit for controlling a call process with the interpreter terminal, and when the first image is selected, transmits a call process start request for starting the call process to the interpreter terminal. A call processing control unit,
Make it work,
Speech translation program.