WO2017135214A1 - Speech translation system, speech translation method, and speech translation program - Google Patents

Speech translation system, speech translation method, and speech translation program Download PDF

Info

Publication number
WO2017135214A1
WO2017135214A1 PCT/JP2017/003300 JP2017003300W WO2017135214A1 WO 2017135214 A1 WO2017135214 A1 WO 2017135214A1 JP 2017003300 W JP2017003300 W JP 2017003300W WO 2017135214 A1 WO2017135214 A1 WO 2017135214A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
translation
unit
call
content
Prior art date
Application number
PCT/JP2017/003300
Other languages
French (fr)
Japanese (ja)
Inventor
知高 大越
Original Assignee
株式会社リクルートライフスタイル
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社リクルートライフスタイル filed Critical 株式会社リクルートライフスタイル
Publication of WO2017135214A1 publication Critical patent/WO2017135214A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition

Definitions

  • the present invention relates to a speech translation system, a speech translation method, and a speech translation program.
  • identification information for identifying an interpreter for example, a telephone number
  • a telephone number when calling an interpreter while performing speech translation processing in a conventional speech translation apparatus
  • identification information for identifying an interpreter for example, a telephone number
  • a telephone number when calling an interpreter while performing speech translation processing in a conventional speech translation apparatus
  • the telephone number after specifying the telephone number, it is necessary to further perform a call operation, which may increase the burden on the user (user, speaker) and decrease convenience.
  • An object is to provide a speech translation system, a speech translation method, and a speech translation program that can be realized.
  • a speech translation system is provided between an information terminal that inputs a user's speech, a server device that translates the content of speech input to the information terminal, and the information terminal.
  • a speech translation system comprising: an interpreter terminal that performs the telephone call processing, wherein the server device differs in the content recognized by the speech recognition unit and the content recognized by the speech recognition unit.
  • a translation unit that translates the content into a language
  • the information terminal controls a speech output unit that outputs the content translated by the translation unit of the server device by voice, and a process of displaying the text of the translated content
  • Call processing control for controlling call processing between an interpreter terminal and a first display processing control unit that controls processing for selectively displaying a first image in addition to text
  • the first part When the image is selected, it comprises a call processing control unit for transmitting a call processing start request to initiate a call processing interpreter terminal, a speech translation system.
  • the server device further includes a score calculation unit that calculates a score related to translation accuracy, and the first display processing control unit performs a process of displaying the first image when the score is equal to or less than a predetermined threshold. You may control.
  • the server device further includes a storage unit that associates the translated content associated with the input speech content for each user and stores it as a translation history, and the interpreter terminal stores the translation history. You may further provide the 2nd display process control part which controls the process linked and displayed for every user.
  • the first display processing control unit controls processing for further displaying two or more second images respectively indicating two or more languages, and the call processing control unit selects one image of the second images.
  • call processing with the interpreter terminal associated with the interpreter who can use the language indicated by one of the selected second images is performed. You may control.
  • a speech translation method is the content of a user's speech, the content translated into content in a different language is output in speech, and the translated content Controlling the process of displaying text, controlling the process of selectively displaying the first image in addition to the text, and controlling the call process between the interpreter terminals, Transmitting a call process start request for starting the call process to the interpreter terminal when the first image is selected.
  • a speech translation program provides a computer, a speech output unit that outputs the content of a user's speech and the content translated into content of a different language, A first display processing control unit for controlling processing for displaying translated text, a first display processing control unit for controlling processing for selectively displaying a first image in addition to text, and an interpreter A call processing control unit for controlling a call process with a terminal, wherein when the first image is selected, a call process control unit for transmitting a call process start request for starting the call process to the interpreter terminal; And make it work.
  • part”, “apparatus”, and “system” do not simply mean physical means, but the functions of the “part”, “apparatus”, and “system” are realized by software. This includes cases where Further, even if the functions of one “part”, “apparatus”, and “system” are realized by two or more physical means and apparatuses, two or more “parts”, “apparatus”, “system” The function may be realized by one physical means or apparatus.
  • the burden on the user can be reduced and the convenience can be improved, the occurrence of mistranslation can be prevented, and smooth communication can be realized.
  • FIG. 1 is a system block diagram schematically illustrating a preferred embodiment of a network configuration according to a speech translation system according to the present disclosure.
  • FIG. It is a system block diagram showing roughly an example of composition of a user apparatus (information terminal) in a speech translation system by this indication.
  • It is a functional block diagram showing roughly an example of functional composition of a user apparatus (information terminal) in a speech translation system by this indication.
  • It is a system block diagram showing roughly an example of composition of a server apparatus in a speech translation system by this indication.
  • It is a functional block diagram which shows roughly an example of a function structure of the server apparatus in the speech translation system by this indication.
  • FIG. 1 It is a functional block diagram showing roughly an example of functional composition of an operator terminal in a speech translation system by this indication. It is a flowchart which shows an example of the process flow (part) in the speech translation system by this indication.
  • or (C) are top views which show an example of the transition of the display screen in the information terminal by this indication.
  • or (C) are top views which show an example of the transition of the display screen in the information terminal by this indication.
  • or (D) are top views which show an example of the transition of the display screen in the information terminal by this indication. It is a figure which shows an example of the display screen in the interpreter terminal by this indication. It is a flowchart which shows another example of the process flow (part) in the speech translation system by this indication.
  • FIG. 1 is a system block diagram schematically illustrating a preferred embodiment of a network configuration according to a speech translation system according to the present disclosure.
  • the speech translation system 100 exemplarily includes an information terminal 10 for inputting a user's voice, which is used by the user (speaker or other speaker), and an electronic device connected to the information terminal 10 via the network N.
  • the server device 20 that translates the content of the voice input to the information terminal 10 and the information terminal 10 and the server device 20 that are electronically connected via the network N to the operator terminal 30 (interpreter terminal).
  • an operator terminal 30 (interpreter terminal) that performs a call process with the information terminal 10 used by the interpreter.
  • FIG. 2 is a system block diagram schematically illustrating an example of the configuration of the user device (information terminal) in the speech translation system according to the present disclosure.
  • the information terminal 10 illustratively includes a processor 11, a storage resource 12, a voice input / output device 13 (for example, a microphone and a speaker that are separate or integrated), and communication.
  • An interface 14, an input device 15, a display device 16, and a camera 17 are provided.
  • the information terminal 10 operates by installed speech translation application software (at least a part of a speech translation program according to an embodiment of the present disclosure), so that a part of the speech translation system according to the embodiment of the present disclosure or It functions as a whole.
  • the information terminal 10 here is a portable tablet terminal device including a mobile phone represented by a smartphone having a communication function with the network N, for example.
  • the processor 11 includes an arithmetic logic unit and various registers (program counter, data register, instruction register, general-purpose register, etc.). Further, the processor 11 interprets and executes speech translation application software, which is the program P10 stored in the storage resource 12, and performs various processes.
  • the speech translation application software as the program P10 can be distributed from the server device 20 through the network N, for example, and may be installed and updated manually or automatically.
  • the network N includes, for example, a wired network (a short-range communication network (LAN), a wide-area communication network (WAN), a value-added communication network (VAN), etc.) and a wireless network (mobile communication network, satellite communication network, Bluetooth ( Bluetooth (registered trademark), WiFi (Wireless Fidelity), HSDPA (High Speed Downlink Packet Access, etc.).
  • a wired network a short-range communication network (LAN), a wide-area communication network (WAN), a value-added communication network (VAN), etc.
  • LAN short-range communication network
  • WAN wide-area communication network
  • VAN value-added communication network
  • wireless network mobile communication network
  • satellite communication network satellite communication network
  • Bluetooth Bluetooth (registered trademark)
  • WiFi Wireless Fidelity
  • HSDPA High Speed Downlink Packet Access
  • the storage resource 12 is a logical device provided by a storage area of a physical device (for example, a computer-readable storage medium such as a semiconductor memory), and an operating system program, a driver program, various information, etc. used for processing of the information terminal 10 Is stored.
  • a driver program include an input / output device driver program for controlling the audio input / output device 13, an input device driver program for controlling the input device 15, an output device driver program for controlling the display device 16, and the like.
  • the voice input / output device 13 is, for example, a general microphone and a sound player capable of reproducing sound data.
  • the communication interface 14 provides a connection interface with the server device 20 and the operator terminal 30, for example, and is configured by a wireless communication interface and / or a wired communication interface.
  • the input device 15 provides an interface for accepting an input operation by a tap operation such as an icon, a button, or a virtual keyboard displayed on the display device 16, and is externally attached to the information terminal 10 in addition to the touch panel.
  • a tap operation such as an icon, a button, or a virtual keyboard displayed on the display device 16
  • Various input devices can be exemplified.
  • the display device 16 provides various information as an image display interface to the user and the other party of conversation as necessary. Examples thereof include an organic EL display, a liquid crystal display, a CRT display, and preferably various methods. Including those using touch panels.
  • the camera 17 is for capturing still images and moving images of various subjects.
  • FIG. 3 is a functional block diagram schematically illustrating an example of a functional configuration of a user device (information terminal) in the speech translation system according to the present disclosure.
  • the information terminal 10 functionally includes a voice input / output unit 101, a transmission / reception unit 103, an input operation reception unit 105, a display unit 107, an information processing unit 109, and a storage unit 117.
  • the information processing unit 109 functionally includes a score comparison unit 111, a first display processing control unit 113, a call processing control unit 115, and an operator terminal specifying unit 116.
  • the voice input / output unit 101 inputs a user's voice, for example. Moreover, the voice input / output unit 101 outputs, for example, the contents translated by the server device 20 shown in FIG. Here, the voice input / output device 13 illustrated in FIG. 2 functions as the voice input / output unit 101.
  • the transmission / reception unit 103 transmits / receives various information to / from the server device 20 and the operator terminal 30 shown in FIG.
  • the transmission / reception unit 103 transmits the content of the input voice to the server device 20.
  • the transmission / reception unit 103 receives, for example, text information, audio information, and the like of content translated by the server device 20.
  • the transmission / reception part 103 receives the score regarding a translation precision from the server apparatus 20, for example.
  • the communication interface 14 illustrated in FIG. 2 functions as the transmission / reception unit 103.
  • the input operation accepting unit 105 is a block that accepts a user's input operation, for example.
  • the input device 15 illustrated in FIG. 2 functions as the input operation reception unit 105.
  • the display unit 107 displays various information.
  • the display unit 107 displays, for example, translated text. Further, the display unit 107 displays, for example, a language button 61 (second image) shown in FIG. 9A and a call start button 73 (first image) shown in FIG.
  • the display device 16 illustrated in FIG. 2 functions as the display unit 107.
  • the information processing unit 109 indicates the function of the processor 11 illustrated in FIG. 2, and the score comparison unit 111 compares, for example, a score related to the translation accuracy of the translation processing performed by the server device 20 with a predetermined threshold (score). .
  • the first display processing control unit 113 is a block that controls processing for displaying various types of information on the display unit 107.
  • the first display processing control unit 113 controls, for example, a process of displaying the text of the content translated in the server device 20, and in addition to the text of the content translated in the server device 20, the call shown in FIG. A process of selectively displaying the start button 73 (first image) is controlled.
  • the call processing control unit 115 is, for example, a block that controls call processing between the information terminal 10 and the operator terminal 30 and starts the call processing when the call start button 73 displayed on the display unit 107 is selected.
  • a call processing start request is transmitted to the operator terminal 30.
  • the operator terminal specifying unit 116 specifies the operator terminal 30 used by the interpreter who can use the language indicated by the English button selected in the language button 61 shown in FIG. 9A.
  • the storage unit 117 is a block that stores various programs and information used for processing of the information terminal 10.
  • the storage unit 117 stores, for example, text information, audio information, and the like of the content received by the transmission / reception unit 103 and translated by the server device 20.
  • the storage unit 117 stores a score related to the translation accuracy of the server device 20 received by the transmission / reception unit 103.
  • the storage resource 12 illustrated in FIG. 2 functions as the storage unit 117.
  • the camera 17 shown in FIG. 2 functions as, for example, an imaging unit (not shown in FIG. 3).
  • FIG. 4 is a system block diagram schematically illustrating an example of the configuration of the server device in the speech translation system according to the present disclosure.
  • the server device 20 illustratively includes a processor 21, a communication interface 22, and a storage resource 23.
  • the server device 20 is configured by, for example, a host computer having high arithmetic processing capability, and expresses a server function when a predetermined server program operates on the host computer.
  • a single or a plurality of host computers functioning as a speech synthesis server in the figure, it is indicated by a single, but is not limited thereto).
  • the processor 21 is composed of an arithmetic and logic unit for processing arithmetic operations, logical operations, bit operations and the like and various registers (program counter, data register, instruction register, general-purpose register, etc.), and is stored in the storage resource 23. P20 is interpreted and executed, and a predetermined calculation processing result is output.
  • the communication interface 22 is a hardware module for connecting to the information terminal 10 via the network N.
  • the communication interface 22 is a modulation / demodulation device such as an ISDN modem, an ADSL modem, a cable modem, an optical modem, or a soft modem.
  • the storage resource 23 is a logical device provided by, for example, a storage area of a physical device (a computer-readable storage medium such as a disk drive or a semiconductor memory), and each includes one or a plurality of programs P20, various modules L20, and various types.
  • a database D20 and various models M20 are stored.
  • the program P20 is the above-described server program that is the main program of the server device 20.
  • the various modules L20 perform a series of information processing related to requests and information transmitted from the information terminal 10, and thus are software modules (moduleized subprograms) that are appropriately called and executed during the operation of the program P20. ).
  • Examples of the module L20 include a speech recognition module, a translation module, and a speech synthesis module.
  • the various databases D20 include various corpora required for speech translation processing (for example, in the case of Japanese and English speech translation, a Japanese speech corpus, an English speech corpus, a Japanese character (vocabulary) corpus, an English character) (Vocabulary) corpus, Japanese dictionary, English dictionary, Japanese-English bilingual dictionary, Japanese-English bilingual corpus, etc.), a speech database described later, a management database for managing information related to users, and the like.
  • examples of the various models M20 include an acoustic model and a language model used for speech recognition described later.
  • FIG. 5 is a functional block diagram schematically showing an example of the functional configuration of the server device in the speech translation system according to the present disclosure.
  • the server device 20 functionally includes a transmission / reception unit 201, an information processing unit 203, and a storage unit 213.
  • the information processing unit 203 includes, for example, a speech recognition unit 205, a multilingual translation unit 207, a score calculation unit 209, and a speech synthesis unit 211.
  • the transmission / reception unit 201 transmits / receives various information to / from the information terminal 10 and the operator terminal 30 shown in FIG.
  • the transmission / reception unit 201 receives the content of the voice input to the information terminal 10 from the information terminal 10.
  • the transmission / reception unit 201 transmits, for example, text information, voice information, and the like of contents translated by the multilingual translation unit 207 described later to the information terminal 10.
  • the transmission / reception unit 201 transmits, for example, a score related to translation accuracy calculated by a score calculation unit 209 described later to the information terminal 10.
  • the communication interface 22 illustrated in FIG. 4 functions as the transmission / reception unit 201.
  • the information processing unit 203 indicates the function of the processor 21 shown in FIG. 4, and the voice recognition unit 205 recognizes the content of the voice input to the information terminal 10, for example.
  • the multilingual translation unit 207 translates the content recognized by the speech recognition unit 205 into the content of a different language.
  • the score calculation unit 209 calculates a score related to the translation accuracy of the multilingual translation unit 207.
  • the speech synthesis unit 211 performs speech synthesis based on the translation result by the multilingual translation unit 207.
  • the storage unit 213 is, for example, a block that stores various programs and information used for processing of the server device 20.
  • storage part 213 memorize
  • storage part 213 memorize
  • the storage unit 213 stores the translated content associated with the content of the input voice as a translation history in association with each user.
  • the storage resource 23 illustrated in FIG. 4 functions as the storage unit 213.
  • FIG. 6 is a system block diagram schematically illustrating an example of a configuration of an operator terminal (interpreter device) in the speech translation system according to the present disclosure.
  • the operator terminal 30 includes a processor 31, a storage resource 32, a voice input / output device 33 (for example, a microphone and a speaker that are separate or integrated), a communication interface 34, an input device 35, A display device 36 and a camera 37 are provided.
  • the operator terminal 30 has the same block configuration as the information terminal 10 shown in FIG. In the following, in particular, a configuration different from the configuration included in the information terminal 10 will be described.
  • the operator terminal 30 operates, for example, by installed CTI (Computer Telephony Integration) application software executed as at least a part of a speech translation program according to an embodiment of the present disclosure. It functions as a part or all of the speech translation system.
  • CTI Computer Telephony Integration
  • the operator terminal 30 receives a call from the information terminal 10 shown in FIG.
  • the interpreter performs interpretation via the operator terminal 30.
  • the operator terminal 30 displays information related to at least one of the other party of the telephone, for example, the information terminal 10 and the operator of the information terminal, a translation history, which will be described in detail later, on the display device 36.
  • the operator terminal 30 is, for example, a stationary terminal device including a desktop personal computer having a communication function with the network N.
  • the processor 31 interprets and executes CTI application software, which is the program P30 stored in the storage resource 32, and performs various processes.
  • the input device 35 provides an interface for accepting an input operation by a tap operation such as an icon, a button, or a virtual keyboard displayed on the display device 36, and various input devices externally attached to the operator terminal 30, for example, A keyboard and a mouse can be exemplified.
  • the input device 35 may be a device such as a touch panel of various types including the function of the display device 36.
  • FIG. 7 is a functional block diagram schematically illustrating an example of a functional configuration of an operator terminal (interpreter device) in the speech translation system according to the present disclosure.
  • the operator terminal 30 functionally includes a voice input / output unit 301, a transmission / reception unit 303, an input operation reception unit 305, a display unit 307, an information processing unit 309, and a storage unit 315.
  • the information processing unit 309 functionally includes a call processing unit 311 and a second display processing control unit 313.
  • the voice input / output unit 301 inputs the voice of an operator including an interpreter, for example.
  • the voice input / output unit 301 may be configured to output the content indicating the translation history received by the transmission / reception unit 303 by voice as described later, for example.
  • the voice input / output device 33 illustrated in FIG. 6 functions as the voice input / output unit 301.
  • the transmission / reception unit 303 transmits / receives various information to / from the information terminal 10 and the server device 20 illustrated in FIG.
  • the transmission / reception unit 303 receives, for example, a translation history transmitted from the server device 20 via the information terminal 10.
  • the transmission / reception unit 303 receives, for example, a call processing start request transmitted from the information terminal 10.
  • the transmission / reception unit 303 transmits a response signal to the call processing start request.
  • the communication interface 34 illustrated in FIG. 6 functions as the transmission / reception unit 303.
  • the input operation accepting unit 305 is a block that accepts an operator's input operation, for example.
  • the input device 35 illustrated in FIG. 6 functions as the input operation reception unit 305.
  • the display unit 307 displays various information.
  • the display unit 307 displays the translation history in association with each user.
  • the display device 36 illustrated in FIG. 6 functions as the display unit 307.
  • the information processing unit 309 indicates the function of the processor 31 illustrated in FIG. 6, and the call processing unit 311 is configured between the operator terminal 30 and the information terminal 10 based on a call processing start request transmitted from the information terminal 10, for example.
  • the response signal includes a signal indicating that a call is possible between the operator terminal 30 and the information terminal 10 and a signal indicating that a call is possible between the operator terminal 30 and the information terminal 10.
  • the second display processing control unit 313 is a block that controls processing for displaying various types of information on the display unit 307.
  • the second display processing control unit 313 controls the display unit 307 to display the translation history in association with each user.
  • the storage unit 315 is a block that stores various programs and information used for processing of the operator terminal 30.
  • the storage unit 315 stores, for example, a translation history transmitted from the server device 20 via the information terminal 10 received by the transmission / reception unit 303.
  • the storage resource 32 illustrated in FIG. 6 functions as the storage unit 315.
  • the camera 37 shown in FIG. 6 functions as an imaging unit, for example, although not shown in FIG.
  • FIG. 8 is a flowchart illustrating an example of a process flow (part) in the speech translation system according to the present disclosure.
  • 9 (A) to 9 (C), 10 (A) to (C), and 11 (A) to 11 (D) are plan views illustrating examples of display screen transition in the information terminal according to the present disclosure.
  • FIG. 12 is a diagram illustrating an example of a display screen in the interpreter terminal according to the present disclosure.
  • the conversation when the user of the information terminal 10 is a restaurant clerk who speaks Japanese and the conversation partner is a customer who speaks English, that is, the input language is Japanese and the translation language is English. Assume conversation. However, it is not limited to this.
  • this language selection screen When the application is activated, a customer language selection screen is displayed on the display unit 107 (FIG. 8; step SJ2). As shown in FIG. 9A, this language selection screen includes, for example, a Japanese text T21 for inquiring about the language to the customer, an English text T22 for that purpose, and a plurality of typical languages assumed.
  • a language button 61 (second image) indicating English, Chinese (for example, two types depending on the typeface), and Hangul) is displayed.
  • the Japanese text T21 and the English text T22 are classified by the first display processing control unit 113 and the display unit 107, for example, by areas of different colors on the screen of the display unit 107 of the information terminal 10, and They are displayed in opposite directions (different directions; upside down in the figure).
  • the user can easily confirm the Japanese text T21, while the customer can easily confirm the English text T22.
  • the text T21 and the text T22 are displayed separately, there is an advantage that the text T21 and the text T22 are clearly distinguished from each other.
  • the user presents the text T21 displayed on the language selection screen of FIG. 9A to the customer, and has the customer tap the English button, so that the customer's language is selected.
  • a standby screen for voice input in Japanese and English is displayed on the display device as the home screen (FIG. 8; step SJ3).
  • text T23 asking which of the user's or customer's language is to be spoken is displayed on this home screen.
  • the home screen also includes a history display button 63 for displaying a history of input contents, a language selection button 64 for returning to the language selection screen and switching the customer language (re-selecting the language), and the application software.
  • a setting button 65 for performing various settings is also displayed.
  • the voice input screen for accepting the user's Japanese utterance content is displayed.
  • FIG. 9C voice input screen for accepting the user's Japanese utterance content
  • voice input screen for accepting the user's Japanese utterance content
  • FIG. 9C voice input screen
  • voice input from the voice input / output unit 101 is enabled.
  • a text T24 for prompting the user to input voice and a microphone design 66 indicating that the voice input is in a standby state are displayed.
  • the Japanese input button 62a is not displayed on the voice input screen of FIG. 9C to indicate that Japanese voice input has been selected in FIG. 9B, which is the previous screen.
  • the English input button 62b is displayed in a light color so that a part of the English input button 62b is hidden behind the microphone design 66 (the same applies to FIGS. 10A and 10B described later).
  • a cancel button 67 is displayed at the bottom of the voice input screen. By tapping this button, it is possible to return to the voice input standby screen (FIG. 9B) and perform voice input again. (Same as in FIGS. 10A and 10B described later).
  • the volume of the voice volume is schematically shown on the screen of the display unit 107 together with the text T24.
  • a dynamically shown multiple circular design 68 is displayed, and the voice input level is visually fed back to the user who is the speaker (FIG. 8; step SJ4).
  • the information processing unit 109 of the information terminal 10 detects that there is no voice input for a certain period of time, the information processing unit 109 ends the acceptance of the utterance content by the user.
  • the information processing unit 109 generates an audio signal based on the audio input, and transmits the audio signal to the server device 20 through the transmission / reception unit 103 and the network N.
  • the voice recognition unit 205 of the information processing unit 203 of the server device 20 receives the voice signal through the transmission / reception unit 201 and performs voice recognition processing (FIG. 8; step SS1). At this time, the speech recognition unit 205 calls the necessary module L20, database D20, and model M20 (speech recognition module, Japanese speech corpus, acoustic model, language model, etc.) from the storage unit 213, "To" reading "(character).
  • the information processing unit 203 generates a text signal for text output based on the recognized “reading” (characters) of the voice, and transmits the text signal to the information terminal 10 through the transmission / reception unit 201 and the network N.
  • the information processing unit 203 calls the one corresponding to the actual utterance content from the text signal based on the content of the recognized speech itself and the Japanese conversation corpus stored in the storage unit 213 in advance. And generating a text signal based thereon.
  • the first display processing control unit 113 of the information terminal 10 that has received the text signal through the transmission / reception unit 201 recognizes the Japanese utterance content input by the user on the screen. As a result, the Japanese text T25 that is the content of the recognized speech is displayed.
  • the multilingual translation unit 207 proceeds to multilingual translation processing for translating the recognized speech “reading” (characters) into another language (FIG. 8; step SS2).
  • the multilingual translation unit 207 stores the necessary module L20 and database D20 (translation module, Japanese character corpus, Japanese dictionary, English dictionary, Japanese-English bilingual dictionary, Japanese-English bilingual corpus, etc.) from the storage unit 213.
  • the input speech “reading” (character string) which is the call and recognition result, is appropriately sorted and converted to Japanese phrases, clauses, sentences, etc., and the English corresponding to the conversion result is extracted, and the English grammar is extracted.
  • the display unit 107 displays a standby screen including Japanese text T26 indicating that translation is in progress and a circular design 69 indicating that translation is in progress. Is done.
  • the storage unit 213 stores the translation result (translation content) associated with the content of the input speech for each user as a translation history (FIG. 8; step SS3).
  • the storage unit 213 stores an English conversation corpus or the like corresponding to the translated English phrase, clause, sentence, or the like as a translation history in association with the content of the input speech.
  • the speech synthesis unit 211 calls the module L20, database D20, and model M20 (speech synthesis module, English speech corpus, acoustic model, language model, etc.) necessary for speech synthesis from the storage unit 213, and uses the translation result.
  • An English conversation corpus corresponding to a certain English phrase, clause, sentence or the like is converted into natural speech (FIG. 8; step SS4).
  • the information processing unit 203 When these multilingual translation processing and speech synthesis processing are completed, the information processing unit 203 generates a text signal for text display based on the English conversation corpus that is the translation result (translation content), and the synthesized text signal is also synthesized.
  • An audio signal for audio output is generated based on the audio and transmitted to the information terminal 10 through the transmission / reception unit 201 and the network N.
  • the first display processing control unit 113 of the information terminal 10 that has received the text signal and the audio signal through the transmission / reception unit 103, the Japanese corresponding to the text T25 and the text T25.
  • the conversation corpus text T27 (same as, but not limited to, text T25 here) and the English conversation corpus text T28, which is the translation result thereof, are displayed as a conversation screen, and the call is started on the screen.
  • a process of selectively displaying the button 73 (first image) is controlled (FIG. 8; step SJ5).
  • storage part 117 of the information terminal 10 may memorize
  • step SJ5 the voice input / output unit 101 outputs (reads out) the content (translation content) of the English text T28 as a translation result (FIG. 8; step SJ6).
  • step SJ6 may be executed before or after step SJ5.
  • the Japanese texts T25 and T27 and the English text T28 are also divided on the screen of the display unit 107 of the information terminal 10, for example, by different color areas and line segments, and They are displayed in opposite directions (different directions; upside down in the figure).
  • the user and the customer are in a face-to-face conversation, the user confirms the Japanese texts T25 and T27 (input contents) if both can see the screen of the display unit 107.
  • the customer can easily confirm the English text T28 (translated content).
  • the texts T25, T27 and the text T28 are displayed separately, there is an advantage that the texts T25, T27 and the text T28 are clearly distinguished from each other.
  • the audio output is repeated by tapping the audio output button 70 displayed on the conversation screen of FIG. Also, on this conversation screen, a check button 71 indicating that the translation at that time is finished is displayed. By tapping this, the translation processing is finished and the home screen (FIG. 9B) is returned. Can do.
  • the voice processing such as the customer's voice input, recognition, translation, and voice synthesis will be performed. Is performed (FIG. 8; No in step SJ7).
  • the check screen 71 displayed in FIG. 10C is tapped to display the home screen (FIG. 9B).
  • the English input button 62b is tapped to select English voice input by the customer.
  • the processing after this is performed except that the speaker changes from the user to the customer, the Japanese voice input is switched to the English voice input, and the English voice and text output is replaced with the Japanese voice and text output. Since it is basically the same as the above-described processing, detailed description thereof is omitted here.
  • a series of speech translation processing is terminated.
  • the call processing control unit 115 is configured to transmit a call processing start request to make a call with the interpreter. May be.
  • the call processing control unit 115 may generate a call processing start request when the call start button 73 is selected, or may generate a call processing start request in advance before the call start button 73 is selected. Good.
  • the call processing start request includes, for example, identification information of the information terminal 10. Further, it is generated including the translation history from the server device 20.
  • the identification information of the information terminal 10 includes, for example, the attributes of the user of the information terminal 10, that is, the user's name, address, date of birth, age, affiliation, family structure, etc., and the telephone number or identification number of the information terminal 10 ( ID) and the like.
  • a call between a store clerk or customer who uses the information terminal 10 and an interpreter who uses the operator terminal 30 is executed through a network N including a general telephone line network, an IP telephone line network, and the like. Note that there is no particular limitation on the calling means, and it is sufficient that both calls can be made.
  • the store clerk wants to talk to a more appropriate interpreter when there are multiple interpreters who can talk.
  • identification information of each interpreter or a terminal used by each interpreter and language information indicating one or more languages that can be used by each interpreter are stored in association with each other.
  • step SJ3 shown in FIG. 8 the user presents the text T21 displayed on the language selection screen shown in FIG. 9 (A) to the customer, and asks the customer to tap an English button so that the customer Language is selected.
  • the operator terminal specifying unit 116 is selected by referring to the terminal identification information used by each interpreter stored in the storage unit 117 and the language information indicating one or more languages that can be used by each interpreter.
  • the operator terminal 30 used by the interpreter who can use the language indicated by the English button, that is, English, is specified. Then, the call processing control unit 115 transmits a call processing start request to the operator terminal used by the interpreter, so that both calls are started. In this way, it is possible to appropriately identify an interpreter who can handle the language used in communication between the store clerk and the customer.
  • the storage unit 117 of the information terminal 10 includes, in addition to the terminal identification information used by each interpreter and the language information indicating one or more languages that can be used by each interpreter, the interpretation level and interpretation of each interpreter. Information indicating the capability may be stored in association with identification information of each interpreter or identification information of a terminal used by each interpreter. Then, when an English button is selected in step SJ3 shown in FIG. 8, the operator terminal specifying unit 116 uses an operator used by an interpreter with a higher interpreting level and ability among a plurality of interpreters who can use English. The terminal may be specified.
  • the operator terminal specifying unit 116 specifies the interpreter when the call start button 73 (first image) displayed on the display unit 117 of the information terminal 10 is selected in step SJ5 of FIG. May be. Further, the operator terminal specifying unit 116 may be configured in advance to specify an operator terminal used by an interpreter who makes a call for each language used in communication between a store clerk and a customer.
  • the transmission / reception unit 303 of the operator terminal 30 receives the call processing start request from the information terminal 10 (FIG. 8; step SO1).
  • the transmission / reception unit 303 transmits a response signal to the information terminal 10 (FIG. 8; step SO2).
  • the call processing unit 311 when allowing a call between the information terminal 10 and the operator terminal 30, the call processing unit 311 generates a response signal indicating that the call is allowed.
  • the call processing unit 311 is capable of making a call, in which the identification information of the information terminal 10 included in the received call processing start request is stored in advance in the storage unit 315 or another storage resource that can communicate with the operator terminal 30 By comparing with the identification information of the information terminal, it is determined whether or not a call with the information terminal 10 is permitted. On the other hand, when the call processing unit 311 does not permit the call between the information terminal 10 and the operator terminal 30, the call processing unit 311 generates a response signal indicating that the call is not permitted.
  • the second display processing control unit 313 causes the display unit 307 to display the translation history transmitted from the server device 20 via the information terminal 10 in association with each user (FIG. 8; step SO3).
  • the second display processing control unit 313 controls a process of displaying an image 81 indicating “calling” on the screen of the display unit 307, and An image 83 including a column indicating a name, a column indicating a telephone number of an information terminal used by the user, a column indicating an identification number of the information terminal, and a column indicating attribute information indicating a user's address and the like is displayed.
  • a process is controlled and the process which displays the translation log
  • the interpreter can check the speech translation history, so the communication between the store clerk and the customer so far It is possible to respond based on the flow of
  • the operator terminal 30 displays the speech translation history in time series on the display unit 307, so that the interpreter can more easily flow the communication between the store clerk and the customer so far. Therefore, it is possible to respond appropriately based on this flow.
  • the information terminal 10 receives a response signal from the operator terminal 30 (FIG. 8; step SJ8), the connection between the information terminal 10 and the operator terminal 30 is established, and a call between the store clerk or customer and the interpreter is made. Realize (FIG. 8; steps SJ9 and SO4).
  • the processor 11 displays the text T30 on the screen of the display unit 107 as shown in FIG.
  • the information terminal displays a call start button (first image) when outputting the translation result.
  • the information terminal relates to the translation accuracy calculated by the server device.
  • the first embodiment and the second embodiment are different in that a call start button (first image) is displayed when the score is compared with a predetermined threshold and the score is equal to or lower than the predetermined threshold.
  • a second embodiment will be described with reference to FIG. Differences from the flowchart shown in FIG. 8 that describe the first embodiment will be particularly described, and descriptions of points that are similar to the flowchart shown in FIG. 8 will be omitted.
  • FIG. 13 is a flowchart showing another example of the process flow (part) in the speech translation system.
  • the multilingual translation unit 207 of the server device 20 executes multilingual translation processing for translating the recognized “reading” (characters) of the recognized speech into another language (FIG. 13; step SS12).
  • the storage unit 213 stores, as a translation history, a translation result (translation content) associated with the content of the input speech and a score related to translation accuracy corresponding to the translation result for each user (FIG. 13; step SS13).
  • the translation processing for example, statistical translation is performed, and the correspondence between words and phrases between two languages is extracted from the bilingual data, for example, including a bilingual dictionary with probability and a word order conversion table with probability
  • the score calculation unit 209 is configured to calculate, for example, a score relating to the translation accuracy of what percentage for each translation result.
  • the speech synthesizer 211 When the multilingual translation process and the speech synthesis process are completed, the speech synthesizer 211 generates a text signal for text display based on the English conversation corpus that is the translation result (translation content), and also generates the synthesized speech. Based on this, an audio signal for audio output is generated. Then, the generated text signal, the generated voice signal, and the translation accuracy are transmitted to the information terminal 10 through the transmission / reception unit 201 and the network N.
  • the score comparison unit 111 of the information terminal 10 compares the score related to the translation accuracy calculated by the server device 20 with a predetermined threshold (FIG. 13; step SJ15). If the score is higher than the predetermined threshold (FIG. 13; No in step SJ15), it indicates that the translation accuracy is good, and the first display processing control unit 113 displays the translation result on the display unit 107 and is synthesized. Audio is output (FIG. 13; step SJ16). For example, when the predetermined threshold is 80% and the score relating to the translation accuracy of the translation process in the server device 20 is 90%, the translation accuracy is good. If the customer can understand the questions of the user (clerk) by performing the translation accurately, the process returns to step SJ13 shown in FIG. 13, and this time the customer's voice is input, recognized, and translated. And voice processing such as voice synthesis.
  • step SJ15 if the score relating to the translation accuracy is equal to or lower than the predetermined threshold value (FIG. 13; Yes in step SJ15), it indicates that the translation accuracy is poor, and the first display processing control unit 113 displays the translation result and A call start button is displayed (FIG. 13; step SJ17).
  • the first display processing control unit of the information terminal controls the process of selectively displaying the call start button when the translation result is displayed on the display unit.
  • the call start button is selected, by transmitting a call processing start request for starting a call between the user and an interpreter, the burden on the user can be reduced and convenience can be improved. Can be prevented and smooth communication can be realized.
  • the information terminal compares the score related to the translation accuracy of speech translation with a predetermined threshold value, and displays a call start button when the translation accuracy is low. Therefore, since the call start button is displayed only when the need for a call with the interpreter is higher at the information terminal, the call with the interpreter can be started more smoothly.
  • each process of speech recognition, translation, and speech synthesis is executed by the server device 20
  • these processes may be executed by the information terminal 10.
  • the module L20 used for these processes may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server device 20.
  • the database D20 of the voice database and / or the model M20 such as an acoustic model may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server device 20.
  • the speech translation system may not include the network N and the server device 20.
  • you may comprise so that this process may be performed in the server apparatus 20.
  • the step of displaying the translation history relating to step SO3 shown in FIG. 8 in association with each user may be executed simultaneously with step SO1, or after step SO1 and simultaneously with step SO2 or before step SO2. May be executed. Further, the step of displaying the translation history related to step SO13 shown in FIG. 10 in association with each user may be executed simultaneously with step SO11, or after step SO11 and simultaneously with step SO12 or before step SO12. May be executed.
  • the operator terminal 30 can obtain the translation history by receiving a call processing start request including the translation history, but is not limited thereto.
  • the operator terminal 30 may be configured to receive the translation history directly from the server device 20 before or after receiving the call processing start request.
  • a gateway server for converting a communication protocol between the information terminal 10 and the network N or between the operator terminal 30 and the network N may be interposed.
  • the information terminal 10 is not limited to a portable device, and may be a desktop personal computer, a notebook personal computer, a tablet personal computer, a laptop personal computer, or the like.
  • the operator terminal 30 is not limited to a stationary device, and may be configured with a portable tablet terminal device having a communication function with the network N.

Abstract

With the present invention, it is possible to reduce the burden on a user and improve usefulness, and also to prevent the incidence of errors and achieve smooth communication. A speech translation system is provided with an information terminal for inputting the speech of a user, a server device for translating the content of the speech inputted to the information terminal, and an interpreter terminal for performing a telephone conversation process with the information terminal, wherein: the server device is provided with a speech recognition unit for recognizing the content of the speech inputted to the information terminal, and a translation unit for translating the content recognized by the speech recognition unit into content in a different language; and the information terminal is provided with a speech output unit for outputting in speech the content translated by the translation unit of the server device, a first display process control unit for controlling a process for selectively displaying a first image in addition to text, and a telephone conversation process control unit for transmitting to the interpreter terminal a telephone conversation process start request to start the telephone conversation process when the first image is selected.

Description

音声翻訳システム、音声翻訳方法、及び音声翻訳プログラムSpeech translation system, speech translation method, and speech translation program 関連出願の相互参照Cross-reference of related applications
 本出願は、2016年2月1日に出願された日本出願番号(特願)2016-017071号に基づくもので、ここにその記載内容を援用する。 This application is based on Japanese Application No. 2016-017071 filed on Feb. 1, 2016, the contents of which are incorporated herein by reference.
 本発明は、音声翻訳システム、音声翻訳方法、及び音声翻訳プログラムに関する。 The present invention relates to a speech translation system, a speech translation method, and a speech translation program.
 互いの言語を理解できない人同士の会話、例えば店員(飲食店等の店舗の販売員)と顧客(海外からの観光客等)との会話を可能ならしめるべく、話者の発話音声をテキスト化し、そのテキストの内容を相手の言語に機械翻訳した上で画面に表示したり、或いは、音声合成技術を用いてそのテキストの内容を音声再生したりする音声翻訳技術が提案されている(例えば特許文献1参照)。また、かかる音声翻訳技術を具現化したスマートフォン等の情報端末で動作する音声翻訳アプリケーションも実用化されている(例えば非特許文献1参照)。一方、電話による複数の利用者間の通話を可能にする通訳システムが知られている(例えば特許文献2参照)。 In order to enable conversation between people who cannot understand each other's language, for example, conversation between a store clerk (sales clerk at a restaurant, etc.) and a customer (tourist from abroad, etc.) A speech translation technique has been proposed in which the text content is machine-translated into the language of the other party and displayed on the screen, or the text content is played back using speech synthesis technology (for example, a patent). Reference 1). In addition, a speech translation application that operates on an information terminal such as a smartphone that embodies such speech translation technology has been put into practical use (see, for example, Non-Patent Document 1). On the other hand, an interpreting system that enables a telephone call between a plurality of users is known (for example, see Patent Document 2).
特開平9-34895号公報JP-A-9-34895 特開2010-21692号公報JP 2010-21692 A
 上記従来の音声翻訳装置においては、飲食店において、店員が顧客の注文の内容を尋ねたり、料理の素材を説明したりする際に、音声が入力されると翻訳エンジンによる機械翻訳を実行する。よって、入力される音声の内容がその言語の基本的な文型になっていないような場合や、発話した語順等が異なる場合には、誤訳が生じてしまう可能性が高くなる傾向にある。上記機械翻訳の精度が悪く、両者のコミュニケーションが円滑に行えないような場合には、例えば店員は、当該店員が携帯する音声翻訳装置から通訳者に電話をし、通訳者に翻訳をしてもらうことで、両者のコミュニケーションを円滑に行うことが可能となる。 In the above-described conventional speech translation apparatus, when a store clerk asks about the contents of a customer's order or explains a cooking material in a restaurant, machine translation by a translation engine is executed. Therefore, when the content of the input voice is not a basic sentence pattern of the language or when the order of spoken words is different, there is a tendency that mistranslation is likely to occur. If the accuracy of the machine translation is poor and communication between the two is not smooth, for example, the store clerk will call the interpreter from the speech translation device carried by the store clerk and have the interpreter translate it. Thus, communication between the two can be performed smoothly.
 しかしながら、従来の音声翻訳装置において音声翻訳処理を実行している際に、通訳者に電話をする場合、通訳者(通訳者が使用する通訳者端末)を識別するための識別情報、例えば電話番号を、当該音声翻訳装置が記憶する電話帳や通信履歴等から探さなければならない。そして、電話番号を特定した後、さらに発信操作を行わなければならず、ユーザ(利用者、発話者)の負担の増加や利便性の低下を招いてしまうおそれがある。 However, identification information for identifying an interpreter (interpreter terminal used by an interpreter), for example, a telephone number, when calling an interpreter while performing speech translation processing in a conventional speech translation apparatus Must be searched from the telephone directory, communication history, etc. stored in the speech translation apparatus. Then, after specifying the telephone number, it is necessary to further perform a call operation, which may increase the burden on the user (user, speaker) and decrease convenience.
 そこで、本発明のいくつかの態様は、かかる事情に鑑みてなされたものであり、ユーザの負担を軽減し且つ利便性を向上させることができるとともに、誤訳の発生を防止し且つ円滑なコミュニケーションを実現することができる音声翻訳システム、音声翻訳方法、及び音声翻訳プログラムを提供することを目的とする。 Accordingly, some aspects of the present invention have been made in view of such circumstances, and can reduce the burden on the user and improve convenience, prevent occurrence of mistranslation, and facilitate smooth communication. An object is to provide a speech translation system, a speech translation method, and a speech translation program that can be realized.
 上記課題を解決するため、本発明の一側面に係る音声翻訳システムは、ユーザの音声を入力する情報端末と、情報端末に入力された音声の内容を翻訳するサーバ装置と、情報端末との間の通話処理をする通訳者端末と、を備える音声翻訳システムであって、サーバ装置は、情報端末に入力された音声の内容を認識する音声認識部と、音声認識部で認識された内容を異なる言語の内容に翻訳する翻訳部と、を備え、情報端末は、サーバ装置の翻訳部で翻訳された内容を音声で出力する音声出力部と、翻訳された内容のテキストを表示する処理を制御する第1表示処理制御部であって、テキストに加え、第1画像を選択的に表示する処理を制御する第1表示処理制御部と、通訳者端末との間の通話処理を制御する通話処理制御部であって、第1画像が選択されたとき、通話処理を開始するための通話処理開始リクエストを通訳者端末に送信する通話処理制御部と、を備える、音声翻訳システム。 In order to solve the above problems, a speech translation system according to an aspect of the present invention is provided between an information terminal that inputs a user's speech, a server device that translates the content of speech input to the information terminal, and the information terminal. A speech translation system comprising: an interpreter terminal that performs the telephone call processing, wherein the server device differs in the content recognized by the speech recognition unit and the content recognized by the speech recognition unit. A translation unit that translates the content into a language, and the information terminal controls a speech output unit that outputs the content translated by the translation unit of the server device by voice, and a process of displaying the text of the translated content Call processing control for controlling call processing between an interpreter terminal and a first display processing control unit that controls processing for selectively displaying a first image in addition to text The first part When the image is selected, it comprises a call processing control unit for transmitting a call processing start request to initiate a call processing interpreter terminal, a speech translation system.
 上記音声翻訳システムにおいて、サーバ装置は、翻訳精度に関するスコアを算出するスコア算出部を更に備え、第1表示処理制御部は、スコアが所定の閾値以下である場合に第1画像を表示する処理を制御してもよい。 In the speech translation system, the server device further includes a score calculation unit that calculates a score related to translation accuracy, and the first display processing control unit performs a process of displaying the first image when the score is equal to or less than a predetermined threshold. You may control.
 上記音声翻訳システムにおいて、サーバ装置は、入力された音声の内容に対応付けられた翻訳された内容をユーザごとに関連付けて翻訳履歴として記憶する記憶部を更に備え、通訳者端末は、翻訳履歴をユーザごとに関連付けて表示する処理を制御する第2表示処理制御部を更に備えてもよい。 In the speech translation system, the server device further includes a storage unit that associates the translated content associated with the input speech content for each user and stores it as a translation history, and the interpreter terminal stores the translation history. You may further provide the 2nd display process control part which controls the process linked and displayed for every user.
 上記音声翻訳システムにおいて、第1表示処理制御部は、二以上の言語をそれぞれ示す二以上の第2画像を更に表示する処理を制御し、通話処理制御部は、第2画像のうち一の画像が選択された後に、第1画像が選択された場合に、選択された第2画像のうち一の画像が示す言語を使用できる通訳者に対応付けられた通訳者端末との間の通話処理を制御してもよい。 In the speech translation system, the first display processing control unit controls processing for further displaying two or more second images respectively indicating two or more languages, and the call processing control unit selects one image of the second images. When the first image is selected after the selection is made, call processing with the interpreter terminal associated with the interpreter who can use the language indicated by one of the selected second images is performed. You may control.
 上記課題を解決するため、本発明の一側面に係る音声翻訳方法は、ユーザの音声の内容であって、異なる言語の内容に翻訳された内容を音声で出力することと、翻訳された内容のテキストを表示する処理を制御することであって、テキストに加え、第1画像を選択的に表示する処理を制御することと、通訳者端末との間の通話処理を制御することであって、第1画像が選択されたとき、通話処理を開始するための通話処理開始リクエストを通訳者端末に送信することと、を含む。 In order to solve the above-described problem, a speech translation method according to an aspect of the present invention is the content of a user's speech, the content translated into content in a different language is output in speech, and the translated content Controlling the process of displaying text, controlling the process of selectively displaying the first image in addition to the text, and controlling the call process between the interpreter terminals, Transmitting a call process start request for starting the call process to the interpreter terminal when the first image is selected.
 上記課題を解決するため、本発明の一側面に係る音声翻訳プログラムは、コンピュータを、ユーザの音声の内容であって、異なる言語の内容に翻訳された内容を音声で出力する音声出力部と、翻訳された内容のテキストを表示する処理を制御する第1表示処理制御部であって、テキストに加え、第1画像を選択的に表示する処理を制御する第1表示処理制御部と、通訳者端末との間の通話処理を制御する通話処理制御部であって、第1画像が選択されたとき、通話処理を開始するための通話処理開始リクエストを通訳者端末に送信する通話処理制御部と、して機能させる。 In order to solve the above problems, a speech translation program according to an aspect of the present invention provides a computer, a speech output unit that outputs the content of a user's speech and the content translated into content of a different language, A first display processing control unit for controlling processing for displaying translated text, a first display processing control unit for controlling processing for selectively displaying a first image in addition to text, and an interpreter A call processing control unit for controlling a call process with a terminal, wherein when the first image is selected, a call process control unit for transmitting a call process start request for starting the call process to the interpreter terminal; And make it work.
 なお、本開示において、「部」、「装置」、「システム」とは、単に物理的手段を意味するものではなく、その「部」、「装置」、「システム」が有する機能をソフトウェアによって実現する場合も含む。また、1つの「部」、「装置」、「システム」が有する機能が2つ以上の物理的手段や装置により実現されても、2つ以上の「部」、「装置」、「システム」の機能が1つの物理的手段や装置により実現されても良い。 In this disclosure, “part”, “apparatus”, and “system” do not simply mean physical means, but the functions of the “part”, “apparatus”, and “system” are realized by software. This includes cases where Further, even if the functions of one “part”, “apparatus”, and “system” are realized by two or more physical means and apparatuses, two or more “parts”, “apparatus”, “system” The function may be realized by one physical means or apparatus.
 本開示によれば、ユーザの負担を軽減し且つ利便性を向上させることができるとともに、誤訳の発生を防止し且つ円滑なコミュニケーションを実現することができる。 According to the present disclosure, the burden on the user can be reduced and the convenience can be improved, the occurrence of mistranslation can be prevented, and smooth communication can be realized.
本開示による音声翻訳システムに係るネットワーク構成の好適な一実施形態を概略的に示すシステムブロック図である。1 is a system block diagram schematically illustrating a preferred embodiment of a network configuration according to a speech translation system according to the present disclosure. FIG. 本開示による音声翻訳システムにおけるユーザ者装置(情報端末)の構成の一例を概略的に示すシステムブロック図である。It is a system block diagram showing roughly an example of composition of a user apparatus (information terminal) in a speech translation system by this indication. 本開示による音声翻訳システムにおけるユーザ者装置(情報端末)の機能構成の一例を概略的に示す機能ブロック図である。It is a functional block diagram showing roughly an example of functional composition of a user apparatus (information terminal) in a speech translation system by this indication. 本開示による音声翻訳システムにおけるサーバ装置の構成の一例を概略的に示すシステムブロック図である。It is a system block diagram showing roughly an example of composition of a server apparatus in a speech translation system by this indication. 本開示による音声翻訳システムにおけるサーバ装置の機能構成の一例を概略的に示す機能ブロック図である。It is a functional block diagram which shows roughly an example of a function structure of the server apparatus in the speech translation system by this indication. 本開示による音声翻訳システムにおけるオペレータ端末(通訳者装置)の構成の一例を概略的に示すシステムブロック図である。It is a system block diagram showing roughly an example of composition of an operator terminal (interpreter apparatus) in a speech translation system by this indication. 本開示による音声翻訳システムにおけるオペレータ端末の機能構成の一例を概略的に示す機能ブロック図である。It is a functional block diagram showing roughly an example of functional composition of an operator terminal in a speech translation system by this indication. 本開示による音声翻訳システムにおける処理の流れ(一部)の一例を示すフローチャートである。It is a flowchart which shows an example of the process flow (part) in the speech translation system by this indication. (A)乃至(C)は、本開示による情報端末における表示画面の遷移の一例を示す平面図である。(A) thru | or (C) are top views which show an example of the transition of the display screen in the information terminal by this indication. (A)乃至(C)は、本開示による情報端末における表示画面の遷移の一例を示す平面図である。(A) thru | or (C) are top views which show an example of the transition of the display screen in the information terminal by this indication. (A)乃至(D)は、本開示による情報端末における表示画面の遷移の一例を示す平面図である。(A) thru | or (D) are top views which show an example of the transition of the display screen in the information terminal by this indication. 本開示による通訳者端末における表示画面の一例を示す図である。It is a figure which shows an example of the display screen in the interpreter terminal by this indication. 本開示による音声翻訳システムにおける処理の流れ(一部)の他の一例を示すフローチャートである。It is a flowchart which shows another example of the process flow (part) in the speech translation system by this indication.
 以下、本発明の実施の形態について詳細に説明する。なお、以下の実施の形態は、本発明を説明するための例示であり、本発明をその実施の形態のみに限定する趣旨ではない。また、本発明は、その要旨を逸脱しない限り、さまざまな変形が可能である。さらに、当業者であれば、以下に述べる各要素を均等なものに置換した実施の形態を採用することが可能であり、かかる実施の形態も本発明の範囲に含まれる。またさらに、必要に応じて示す上下左右等の位置関係は、特に断らない限り、図示の表示に基づくものとする。さらにまた、図面における各種の寸法比率は、その図示の比率に限定されるものではない。 Hereinafter, embodiments of the present invention will be described in detail. The following embodiments are examples for explaining the present invention, and are not intended to limit the present invention only to the embodiments. The present invention can be variously modified without departing from the gist thereof. Furthermore, those skilled in the art can employ embodiments in which the elements described below are replaced with equivalent ones, and such embodiments are also included in the scope of the present invention. Furthermore, positional relationships such as up, down, left, and right shown as needed are based on the display shown unless otherwise specified. Furthermore, various dimensional ratios in the drawings are not limited to the illustrated ratios.
(システム構成)
 図1は、本開示による音声翻訳システムに係るネットワーク構成の好適な一実施形態を概略的に示すシステムブロック図である。この例において、音声翻訳システム100は、例示的に、ユーザ(発話者、他の発話者)が使用する、ユーザの音声を入力する情報端末10と、情報端末10にネットワークNを介して電子的に接続される、情報端末10に入力された音声の内容を翻訳するサーバ装置20と、情報端末10及びサーバ装置20にネットワークNを介して電子的に接続されオペレータ端末30(通訳者端末)であって、通訳者が使用する、情報端末10との間の通話処理をするオペレータ端末30(通訳者端末)と、を備える。
(System configuration)
FIG. 1 is a system block diagram schematically illustrating a preferred embodiment of a network configuration according to a speech translation system according to the present disclosure. In this example, the speech translation system 100 exemplarily includes an information terminal 10 for inputting a user's voice, which is used by the user (speaker or other speaker), and an electronic device connected to the information terminal 10 via the network N. The server device 20 that translates the content of the voice input to the information terminal 10 and the information terminal 10 and the server device 20 that are electronically connected via the network N to the operator terminal 30 (interpreter terminal). And an operator terminal 30 (interpreter terminal) that performs a call process with the information terminal 10 used by the interpreter.
 図2は、本開示による音声翻訳システムにおけるユーザ者装置(情報端末)の構成の一例を概略的に示すシステムブロック図である。図2に示すように、情報端末10は、例示的に、プロセッサ11と、記憶資源12と、音声入出力デバイス13(例えばマイクとスピーカーが別体のものも一体のものも含む)と、通信インターフェイス14と、入力デバイス15と、表示デバイス16と、カメラ17とを備えている。また、情報端末10は、インストールされた音声翻訳アプリケーションソフト(本開示の一実施形態による音声翻訳プログラムの少なくとも一部)が動作することにより、本開示の一実施形態による音声翻訳システムの一部又は全部として機能するものである。なお、ここでの情報端末10は、例えば、ネットワークNとの通信機能を有するスマートフォンに代表される携帯電話を含む可搬型のタブレット型端末装置である。 FIG. 2 is a system block diagram schematically illustrating an example of the configuration of the user device (information terminal) in the speech translation system according to the present disclosure. As shown in FIG. 2, the information terminal 10 illustratively includes a processor 11, a storage resource 12, a voice input / output device 13 (for example, a microphone and a speaker that are separate or integrated), and communication. An interface 14, an input device 15, a display device 16, and a camera 17 are provided. In addition, the information terminal 10 operates by installed speech translation application software (at least a part of a speech translation program according to an embodiment of the present disclosure), so that a part of the speech translation system according to the embodiment of the present disclosure or It functions as a whole. The information terminal 10 here is a portable tablet terminal device including a mobile phone represented by a smartphone having a communication function with the network N, for example.
 プロセッサ11は、算術論理演算ユニット及び各種レジスタ(プログラムカウンタ、データレジスタ、命令レジスタ、汎用レジスタ等)から構成される。また、プロセッサ11は、記憶資源12に格納されているプログラムP10である音声翻訳アプリケーションソフトを解釈及び実行し、各種処理を行う。このプログラムP10としての音声翻訳アプリケーションソフトは、例えばサーバ装置20からネットワークNを通じて配信可能なものであり、手動的に又は自動的にインストール及びアップデートされてもよい。 The processor 11 includes an arithmetic logic unit and various registers (program counter, data register, instruction register, general-purpose register, etc.). Further, the processor 11 interprets and executes speech translation application software, which is the program P10 stored in the storage resource 12, and performs various processes. The speech translation application software as the program P10 can be distributed from the server device 20 through the network N, for example, and may be installed and updated manually or automatically.
 なお、ネットワークNは、例えば、有線ネットワーク(近距離通信網(LAN)、広域通信網(WAN)、又は付加価値通信網(VAN)等)と無線ネットワーク(移動通信網、衛星通信網、ブルートゥース(Bluetooth:登録商標)、WiFi(Wireless Fidelity)、HSDPA(High Speed Downlink Packet Access)等)が混在して構成される通信網である。 The network N includes, for example, a wired network (a short-range communication network (LAN), a wide-area communication network (WAN), a value-added communication network (VAN), etc.) and a wireless network (mobile communication network, satellite communication network, Bluetooth ( Bluetooth (registered trademark), WiFi (Wireless Fidelity), HSDPA (High Speed Downlink Packet Access, etc.).
 記憶資源12は、物理デバイス(例えば、半導体メモリ等のコンピュータ読み取り可能な記憶媒体)の記憶領域が提供する論理デバイスであり、情報端末10の処理に用いられるオペレーティングシステムプログラム、ドライバプログラム、各種情報等を格納する。ドライバプログラムとしては、例えば、音声入出力デバイス13を制御するための入出力デバイスドライバプログラム、入力デバイス15を制御するための入力デバイスドライバプログラム、表示デバイス16を制御するための出力デバイスドライバプログラム等が挙げられる。さらに、音声入出力デバイス13は、例えば、一般的なマイクロフォン、及びサウンドデータを再生可能なサウンドプレイヤである。 The storage resource 12 is a logical device provided by a storage area of a physical device (for example, a computer-readable storage medium such as a semiconductor memory), and an operating system program, a driver program, various information, etc. used for processing of the information terminal 10 Is stored. Examples of the driver program include an input / output device driver program for controlling the audio input / output device 13, an input device driver program for controlling the input device 15, an output device driver program for controlling the display device 16, and the like. Can be mentioned. Furthermore, the voice input / output device 13 is, for example, a general microphone and a sound player capable of reproducing sound data.
 通信インターフェイス14は、例えばサーバ装置20やオペレータ端末30との接続インターフェイスを提供するものであり、無線通信インターフェイス及び/又は有線通信インターフェイスから構成される。また、入力デバイス15は、例えば、表示デバイス16に表示されるアイコン、ボタン、仮想キーボード等のタップ動作による入力操作を受け付けるインターフェイスを提供するものであり、タッチパネルの他、情報端末10に外付けされる各種入力装置を例示することができる。 The communication interface 14 provides a connection interface with the server device 20 and the operator terminal 30, for example, and is configured by a wireless communication interface and / or a wired communication interface. The input device 15 provides an interface for accepting an input operation by a tap operation such as an icon, a button, or a virtual keyboard displayed on the display device 16, and is externally attached to the information terminal 10 in addition to the touch panel. Various input devices can be exemplified.
 表示デバイス16は、画像表示インターフェイスとして各種の情報をユーザや、必要に応じて会話の相手方に提供するものであり、例えば、有機ELディスプレイ、液晶ディスプレイ、CRTディスプレイ等が挙げられ、好ましくは各種方式のタッチパネルが採用されたものを含む。また、カメラ17は、種々の被写体の静止画や動画を撮像するためのものである。 The display device 16 provides various information as an image display interface to the user and the other party of conversation as necessary. Examples thereof include an organic EL display, a liquid crystal display, a CRT display, and preferably various methods. Including those using touch panels. The camera 17 is for capturing still images and moving images of various subjects.
 図3は、本開示による音声翻訳システムにおけるユーザ者装置(情報端末)の機能構成の一例を概略的に示す機能ブロック図である。図3に示すように、情報端末10は、機能的に、音声入出力部101と、送受信部103と、入力操作受付部105と、表示部107と、情報処理部109と、記憶部117と、を備える。また、情報処理部109は、機能的に、スコア比較部111と、第1表示処理制御部113と、通話処理制御部115と、オペレータ端末特定部116と、を備える。 FIG. 3 is a functional block diagram schematically illustrating an example of a functional configuration of a user device (information terminal) in the speech translation system according to the present disclosure. As shown in FIG. 3, the information terminal 10 functionally includes a voice input / output unit 101, a transmission / reception unit 103, an input operation reception unit 105, a display unit 107, an information processing unit 109, and a storage unit 117. . In addition, the information processing unit 109 functionally includes a score comparison unit 111, a first display processing control unit 113, a call processing control unit 115, and an operator terminal specifying unit 116.
 音声入出力部101は、例えば、ユーザの音声を入力する。また、音声入出力部101は、例えば、後述するとおり、図1に示すサーバ装置20で翻訳された内容を音声で出力する。ここで、図2に示す音声入出力デバイス13は、音声入出力部101として機能する。 The voice input / output unit 101 inputs a user's voice, for example. Moreover, the voice input / output unit 101 outputs, for example, the contents translated by the server device 20 shown in FIG. Here, the voice input / output device 13 illustrated in FIG. 2 functions as the voice input / output unit 101.
 送受信部103は、例えば図1に示すサーバ装置20やオペレータ端末30と各種情報を送受信する。送受信部103は、例えば、入力された音声の内容をサーバ装置20に送信する。送受信部103は、例えば、サーバ装置20で翻訳された内容のテキスト情報や音声情報等を受信する。また、送受信部103は、例えば、サーバ装置20から翻訳精度に関するスコアを受信する。図2に示す通信インターフェイス14は、送受信部103として機能する。 The transmission / reception unit 103 transmits / receives various information to / from the server device 20 and the operator terminal 30 shown in FIG. For example, the transmission / reception unit 103 transmits the content of the input voice to the server device 20. The transmission / reception unit 103 receives, for example, text information, audio information, and the like of content translated by the server device 20. Moreover, the transmission / reception part 103 receives the score regarding a translation precision from the server apparatus 20, for example. The communication interface 14 illustrated in FIG. 2 functions as the transmission / reception unit 103.
 入力操作受付部105は、例えば、ユーザの入力操作を受け付けるブロックである。ここで、図2に示す入力デバイス15は、入力操作受付部105として機能する。 The input operation accepting unit 105 is a block that accepts a user's input operation, for example. Here, the input device 15 illustrated in FIG. 2 functions as the input operation reception unit 105.
 表示部107は、各種情報を表示する。表示部107は、例えば、翻訳された内容のテキストを表示する。また、表示部107は、例えば、図9(A)に示す言語ボタン61(第2画像)や図10(C)に示す通話開始ボタン73(第1画像)を表示する。ここで、図2に示す表示デバイス16は、表示部107として機能する。 The display unit 107 displays various information. The display unit 107 displays, for example, translated text. Further, the display unit 107 displays, for example, a language button 61 (second image) shown in FIG. 9A and a call start button 73 (first image) shown in FIG. Here, the display device 16 illustrated in FIG. 2 functions as the display unit 107.
 情報処理部109は、図2に示すプロセッサ11の機能を示し、スコア比較部111は、例えば、サーバ装置20が行う翻訳処理の翻訳精度に関するスコアと、所定の閾値(スコア)と、を比較する。第1表示処理制御部113は、表示部107において各種情報を表示する処理を制御するブロックである。第1表示処理制御部113は、例えば、サーバ装置20において翻訳された内容のテキストを表示する処理を制御し、サーバ装置20において翻訳された内容のテキストに加え、図10(C)に示す通話開始ボタン73(第1画像)を選択的に表示する処理を制御する。通話処理制御部115は、例えば、情報端末10とオペレータ端末30との間の通話処理を制御するブロックであり、表示部107に表示される通話開始ボタン73が選択されたとき、通話処理を開始するための通話処理開始リクエストをオペレータ端末30に送信する。オペレータ端末特定部116は、例えば、図9(A)に示す言語ボタン61において選択された英語ボタンが示す言語を使用できる通訳者の使用するオペレータ端末30を特定する。 The information processing unit 109 indicates the function of the processor 11 illustrated in FIG. 2, and the score comparison unit 111 compares, for example, a score related to the translation accuracy of the translation processing performed by the server device 20 with a predetermined threshold (score). . The first display processing control unit 113 is a block that controls processing for displaying various types of information on the display unit 107. The first display processing control unit 113 controls, for example, a process of displaying the text of the content translated in the server device 20, and in addition to the text of the content translated in the server device 20, the call shown in FIG. A process of selectively displaying the start button 73 (first image) is controlled. The call processing control unit 115 is, for example, a block that controls call processing between the information terminal 10 and the operator terminal 30 and starts the call processing when the call start button 73 displayed on the display unit 107 is selected. A call processing start request is transmitted to the operator terminal 30. For example, the operator terminal specifying unit 116 specifies the operator terminal 30 used by the interpreter who can use the language indicated by the English button selected in the language button 61 shown in FIG. 9A.
 記憶部117は、情報端末10の処理に用いられる各種プログラム及び情報等を記憶するブロックである。記憶部117は、例えば、送受信部103が受信した、サーバ装置20で翻訳された内容のテキスト情報や音声情報等を記憶する。また、記憶部117は、送受信部103が受信した、サーバ装置20の翻訳精度に関するスコアを記憶する。ここで、図2に示す記憶資源12は、記憶部117として機能する。なお、図2に示すカメラ17は、図3において不図示であるが例えば撮像部として機能する。 The storage unit 117 is a block that stores various programs and information used for processing of the information terminal 10. The storage unit 117 stores, for example, text information, audio information, and the like of the content received by the transmission / reception unit 103 and translated by the server device 20. In addition, the storage unit 117 stores a score related to the translation accuracy of the server device 20 received by the transmission / reception unit 103. Here, the storage resource 12 illustrated in FIG. 2 functions as the storage unit 117. Note that the camera 17 shown in FIG. 2 functions as, for example, an imaging unit (not shown in FIG. 3).
 図4は、本開示による音声翻訳システムにおけるサーバ装置の構成の一例を概略的に示すシステムブロック図である。図4に示すように、サーバ装置20は、例示的に、プロセッサ21と、通信インターフェイス22と、記憶資源23と、を備える。サーバ装置20は、例えば、演算処理能力の高いホストコンピュータによって構成され、そのホストコンピュータにおいて所定のサーバ用プログラムが動作することにより、サーバ機能を発現するものであり、例えば、音声認識サーバ、翻訳サーバ、及び音声合成サーバとして機能する単数又は複数のホストコンピュータから構成される(図示においては単数で示すが、これに限定されない)。 FIG. 4 is a system block diagram schematically illustrating an example of the configuration of the server device in the speech translation system according to the present disclosure. As illustrated in FIG. 4, the server device 20 illustratively includes a processor 21, a communication interface 22, and a storage resource 23. The server device 20 is configured by, for example, a host computer having high arithmetic processing capability, and expresses a server function when a predetermined server program operates on the host computer. , And a single or a plurality of host computers functioning as a speech synthesis server (in the figure, it is indicated by a single, but is not limited thereto).
 プロセッサ21は、算術演算、論理演算、ビット演算等を処理する算術論理演算ユニット及び各種レジスタ(プログラムカウンタ、データレジスタ、命令レジスタ、汎用レジスタ等)から構成され、記憶資源23に格納されているプログラムP20を解釈及び実行し、所定の演算処理結果を出力する。また、通信インターフェイス22は、ネットワークNを介して情報端末10に接続するためのハードウェアモジュールであり、例えば、ISDNモデム、ADSLモデム、ケーブルモデム、光モデム、ソフトモデム等の変調復調装置である。 The processor 21 is composed of an arithmetic and logic unit for processing arithmetic operations, logical operations, bit operations and the like and various registers (program counter, data register, instruction register, general-purpose register, etc.), and is stored in the storage resource 23. P20 is interpreted and executed, and a predetermined calculation processing result is output. The communication interface 22 is a hardware module for connecting to the information terminal 10 via the network N. For example, the communication interface 22 is a modulation / demodulation device such as an ISDN modem, an ADSL modem, a cable modem, an optical modem, or a soft modem.
 記憶資源23は、例えば、物理デバイス(ディスクドライブ又は半導体メモリ等のコンピュータ読み取り可能な記憶媒体等)の記憶領域が提供する論理デバイスであり、それぞれ単数又は複数の、プログラムP20、各種モジュールL20、各種データベースD20、及び各種モデルM20が格納されている。 The storage resource 23 is a logical device provided by, for example, a storage area of a physical device (a computer-readable storage medium such as a disk drive or a semiconductor memory), and each includes one or a plurality of programs P20, various modules L20, and various types. A database D20 and various models M20 are stored.
 プログラムP20は、サーバ装置20のメインプログラムである上述したサーバ用プログラム等である。また、各種モジュールL20は、情報端末10から送信されてくる要求及び情報に係る一連の情報処理を行うため、プログラムP20の動作中に適宜呼び出されて実行されるソフトウェアモジュール(モジュール化されたサブプログラム)である。かかるモジュールL20としては、音声認識モジュール、翻訳モジュール、音声合成モジュール等が挙げられる。 The program P20 is the above-described server program that is the main program of the server device 20. In addition, the various modules L20 perform a series of information processing related to requests and information transmitted from the information terminal 10, and thus are software modules (moduleized subprograms) that are appropriately called and executed during the operation of the program P20. ). Examples of the module L20 include a speech recognition module, a translation module, and a speech synthesis module.
 また、各種データベースD20としては、音声翻訳処理のために必要な各種コーパス(例えば、日本語と英語の音声翻訳の場合、日本語音声コーパス、英語音声コーパス、日本語文字(語彙)コーパス、英語文字(語彙)コーパス、日本語辞書、英語辞書、日英対訳辞書、日英対訳コーパス等)、後述する音声データベース、ユーザに関する情報を管理するための管理用データベース等が挙げられる。また、各種モデルM20としては、後述する音声認識に使用する音響モデルや言語モデル等が挙げられる。 The various databases D20 include various corpora required for speech translation processing (for example, in the case of Japanese and English speech translation, a Japanese speech corpus, an English speech corpus, a Japanese character (vocabulary) corpus, an English character) (Vocabulary) corpus, Japanese dictionary, English dictionary, Japanese-English bilingual dictionary, Japanese-English bilingual corpus, etc.), a speech database described later, a management database for managing information related to users, and the like. In addition, examples of the various models M20 include an acoustic model and a language model used for speech recognition described later.
 図5は、本開示による音声翻訳システムにおけるサーバ装置の機能構成の一例を概略的に示す機能ブロック図である。図5に示すように、サーバ装置20は、機能的に、送受信部201と、情報処理部203と、記憶部213と、を備える。また、情報処理部203は、例えば、音声認識部205と、多言語翻訳部207と、スコア算出部209と、音声合成部211と、を備える。 FIG. 5 is a functional block diagram schematically showing an example of the functional configuration of the server device in the speech translation system according to the present disclosure. As illustrated in FIG. 5, the server device 20 functionally includes a transmission / reception unit 201, an information processing unit 203, and a storage unit 213. The information processing unit 203 includes, for example, a speech recognition unit 205, a multilingual translation unit 207, a score calculation unit 209, and a speech synthesis unit 211.
 送受信部201は、例えば、図1に示す情報端末10やオペレータ端末30と各種情報を送受信する。送受信部201は、例えば、情報端末10に入力された音声の内容を情報端末10から受信する。送受信部201は、例えば、後述する多言語翻訳部207により翻訳された内容のテキスト情報や音声情報等を情報端末10に送信する。また、送受信部201は、例えば、後述するスコア算出部209により算出される翻訳精度に関するスコアを情報端末10に送信する。ここで、図4に示す通信インターフェイス22は、送受信部201として機能する。 The transmission / reception unit 201 transmits / receives various information to / from the information terminal 10 and the operator terminal 30 shown in FIG. For example, the transmission / reception unit 201 receives the content of the voice input to the information terminal 10 from the information terminal 10. The transmission / reception unit 201 transmits, for example, text information, voice information, and the like of contents translated by the multilingual translation unit 207 described later to the information terminal 10. In addition, the transmission / reception unit 201 transmits, for example, a score related to translation accuracy calculated by a score calculation unit 209 described later to the information terminal 10. Here, the communication interface 22 illustrated in FIG. 4 functions as the transmission / reception unit 201.
 情報処理部203は、図4に示すプロセッサ21の機能を示し、音声認識部205は、例えば、情報端末10に入力された音声の内容を認識する。多言語翻訳部207は、例えば、音声認識部205で認識された内容を異なる言語の内容に翻訳する。スコア算出部209は、例えば、多言語翻訳部207の翻訳精度に関するスコアを算出する。音声合成部211は、例えば、多言語翻訳部207による翻訳結果に基づいて音声合成を行う。 The information processing unit 203 indicates the function of the processor 21 shown in FIG. 4, and the voice recognition unit 205 recognizes the content of the voice input to the information terminal 10, for example. For example, the multilingual translation unit 207 translates the content recognized by the speech recognition unit 205 into the content of a different language. For example, the score calculation unit 209 calculates a score related to the translation accuracy of the multilingual translation unit 207. For example, the speech synthesis unit 211 performs speech synthesis based on the translation result by the multilingual translation unit 207.
 記憶部213は、例えば、サーバ装置20の処理に用いられる各種プログラム及び情報等を記憶するブロックである。記憶部213は、例えば、送受信部201が受信した、情報端末10に入力された音声の内容を記憶する。また、記憶部213は、例えば、翻訳された内容を記憶する。記憶部213は、例えば、入力された音声の内容に対応付けられた翻訳された内容をユーザごとに関連付けて翻訳履歴として記憶する。ここで、図4に示す記憶資源23は、記憶部213として機能する。 The storage unit 213 is, for example, a block that stores various programs and information used for processing of the server device 20. The memory | storage part 213 memorize | stores the content of the audio | voice input into the information terminal 10 which the transmission / reception part 201 received, for example. Moreover, the memory | storage part 213 memorize | stores the translated content, for example. For example, the storage unit 213 stores the translated content associated with the content of the input voice as a translation history in association with each user. Here, the storage resource 23 illustrated in FIG. 4 functions as the storage unit 213.
 図6は、本開示による音声翻訳システムにおけるオペレータ端末(通訳者装置)の構成の一例を概略的に示すシステムブロック図である。図6に示すように、オペレータ端末30は、プロセッサ31、記憶資源32、音声入出力デバイス33(例えばマイクとスピーカーが別体のものも一体のものも含む)、通信インターフェイス34、入力デバイス35、表示デバイス36、及びカメラ37を備えている。上記したとおりオペレータ端末30は、図2に示す情報端末10と同様なブロック構成を備えている。以下においては、特に、情報端末10が備える構成と異なる構成について説明する。また、オペレータ端末30は、例えば、本開示の一実施形態による音声翻訳プログラムの少なくとも一部として実行されるインストールされたCTI(Computer Telephony Integration)アプリケーションソフトが動作することにより、本開示の一実施形態による音声翻訳システムの一部又は全部として機能するものである。 FIG. 6 is a system block diagram schematically illustrating an example of a configuration of an operator terminal (interpreter device) in the speech translation system according to the present disclosure. As shown in FIG. 6, the operator terminal 30 includes a processor 31, a storage resource 32, a voice input / output device 33 (for example, a microphone and a speaker that are separate or integrated), a communication interface 34, an input device 35, A display device 36 and a camera 37 are provided. As described above, the operator terminal 30 has the same block configuration as the information terminal 10 shown in FIG. In the following, in particular, a configuration different from the configuration included in the information terminal 10 will be described. In addition, the operator terminal 30 operates, for example, by installed CTI (Computer Telephony Integration) application software executed as at least a part of a speech translation program according to an embodiment of the present disclosure. It functions as a part or all of the speech translation system.
 オペレータ端末30は、図1に示す情報端末10からの電話を受け付ける。通訳者は、オペレータ端末30を介して、通訳を実行する。オペレータ端末30は、電話の相手方、例えば、情報端末10及び当該情報端末10の操作者の少なくとも一方に関する情報や後で詳述する翻訳履歴等を表示デバイス36に表示する。なお、オペレータ端末30は、例示的に、ネットワークNとの通信機能を有する、デスクトップ型パソコンを含む据え置き型の端末装置である。 The operator terminal 30 receives a call from the information terminal 10 shown in FIG. The interpreter performs interpretation via the operator terminal 30. The operator terminal 30 displays information related to at least one of the other party of the telephone, for example, the information terminal 10 and the operator of the information terminal, a translation history, which will be described in detail later, on the display device 36. The operator terminal 30 is, for example, a stationary terminal device including a desktop personal computer having a communication function with the network N.
 プロセッサ31は、記憶資源32に格納されているプログラムP30であるCTIアプリケーションソフトを解釈及び実行し、各種処理を行う。入力デバイス35は、例えば、表示デバイス36に表示されるアイコン、ボタン、仮想キーボード等のタップ動作による入力操作を受け付けるインターフェイスを提供するものであり、オペレータ端末30に外付けされる各種入力装置、例えばキーボードやマウスを例示することができる。なお、入力デバイス35は、表示デバイス36の機能を含んだ各種方式のタッチパネル等のデバイスであってもよい。 The processor 31 interprets and executes CTI application software, which is the program P30 stored in the storage resource 32, and performs various processes. The input device 35 provides an interface for accepting an input operation by a tap operation such as an icon, a button, or a virtual keyboard displayed on the display device 36, and various input devices externally attached to the operator terminal 30, for example, A keyboard and a mouse can be exemplified. The input device 35 may be a device such as a touch panel of various types including the function of the display device 36.
 図7は、本開示による音声翻訳システムにおけるオペレータ端末(通訳者装置)の機能構成の一例を概略的に示す機能ブロック図である。図7に示すように、オペレータ端末30は、機能的に、音声入出力部301と、送受信部303と、入力操作受付部305と、表示部307と、情報処理部309と、記憶部315と、を備える。また、情報処理部309は、機能的に、通話処理部311と、第2表示処理制御部313と、を備える。 FIG. 7 is a functional block diagram schematically illustrating an example of a functional configuration of an operator terminal (interpreter device) in the speech translation system according to the present disclosure. As shown in FIG. 7, the operator terminal 30 functionally includes a voice input / output unit 301, a transmission / reception unit 303, an input operation reception unit 305, a display unit 307, an information processing unit 309, and a storage unit 315. . The information processing unit 309 functionally includes a call processing unit 311 and a second display processing control unit 313.
 音声入出力部301は、例えば、通訳者を含むオペレータの音声を入力する。また、音声入出力部301は、例えば、後述するとおり、送受信部303が受信する翻訳履歴を示す内容を音声で出力するように構成されてもよい。ここで、図6に示す音声入出力デバイス33は、音声入出力部301として機能する。 The voice input / output unit 301 inputs the voice of an operator including an interpreter, for example. In addition, the voice input / output unit 301 may be configured to output the content indicating the translation history received by the transmission / reception unit 303 by voice as described later, for example. Here, the voice input / output device 33 illustrated in FIG. 6 functions as the voice input / output unit 301.
 送受信部303は、例えば図1に示す情報端末10やサーバ装置20と各種情報を送受信する。送受信部303は、例えば、サーバ装置20から情報端末10を介して送信される翻訳履歴を受信する。また、送受信部303は、例えば、情報端末10から送信される通話処理開始リクエストを受信する。送受信部303は、例えば、通話処理開始リクエストに対する応答信号を送信する。図6に示す通信インターフェイス34は、送受信部303として機能する。 The transmission / reception unit 303 transmits / receives various information to / from the information terminal 10 and the server device 20 illustrated in FIG. The transmission / reception unit 303 receives, for example, a translation history transmitted from the server device 20 via the information terminal 10. In addition, the transmission / reception unit 303 receives, for example, a call processing start request transmitted from the information terminal 10. For example, the transmission / reception unit 303 transmits a response signal to the call processing start request. The communication interface 34 illustrated in FIG. 6 functions as the transmission / reception unit 303.
 入力操作受付部305は、例えば、オペレータの入力操作を受け付けるブロックである。ここで、図6に示す入力デバイス35は、入力操作受付部305として機能する。 The input operation accepting unit 305 is a block that accepts an operator's input operation, for example. Here, the input device 35 illustrated in FIG. 6 functions as the input operation reception unit 305.
 表示部307は、各種情報を表示する。表示部307は、例えば、翻訳履歴をユーザごとに関連付けて表示する。ここで、図6に示す表示デバイス36は、表示部307として機能する。 The display unit 307 displays various information. For example, the display unit 307 displays the translation history in association with each user. Here, the display device 36 illustrated in FIG. 6 functions as the display unit 307.
 情報処理部309は、図6に示すプロセッサ31の機能を示し、通話処理部311は、例えば、情報端末10から送信される通話処理開始リクエストに基づいて、オペレータ端末30と情報端末10との間で通話可能か否かを判断し、通話処理開始リクエストに対する応答信号を生成する。応答信号は、オペレータ端末30と情報端末10との間で通話可能であることを示す信号や、オペレータ端末30と情報端末10との間で通話可能であることを示す信号を含む。第2表示処理制御部313は、例えば、表示部307において各種情報を表示する処理を制御するブロックである。第2表示処理制御部313は、例えば、表示部307において、翻訳履歴をユーザごとに関連付けて表示する処理を制御する。 The information processing unit 309 indicates the function of the processor 31 illustrated in FIG. 6, and the call processing unit 311 is configured between the operator terminal 30 and the information terminal 10 based on a call processing start request transmitted from the information terminal 10, for example. To determine whether or not a call is possible, and generate a response signal to the call processing start request. The response signal includes a signal indicating that a call is possible between the operator terminal 30 and the information terminal 10 and a signal indicating that a call is possible between the operator terminal 30 and the information terminal 10. For example, the second display processing control unit 313 is a block that controls processing for displaying various types of information on the display unit 307. For example, the second display processing control unit 313 controls the display unit 307 to display the translation history in association with each user.
 記憶部315は、オペレータ端末30の処理に用いられる各種プログラム及び情報等を記憶するブロックである。記憶部315は、例えば、送受信部303が受信した、サーバ装置20から情報端末10を介して送信される翻訳履歴を記憶する。ここで、図6に示す記憶資源32は、記憶部315として機能する。なお、図6に示すカメラ37は、図7において不図示であるが例えば撮像部として機能する。 The storage unit 315 is a block that stores various programs and information used for processing of the operator terminal 30. The storage unit 315 stores, for example, a translation history transmitted from the server device 20 via the information terminal 10 received by the transmission / reception unit 303. Here, the storage resource 32 illustrated in FIG. 6 functions as the storage unit 315. The camera 37 shown in FIG. 6 functions as an imaging unit, for example, although not shown in FIG.
 以上のとおり構成された音声翻訳システム100における、音声翻訳処理及び通話処理の操作及び動作の一例について、以下に更に説明する。 An example of operations and operations of speech translation processing and call processing in the speech translation system 100 configured as described above will be further described below.
(音声翻訳処理及び通話処理)
(第1実施形態)
 図8は、本開示による音声翻訳システムにおける処理の流れ(一部)の一例を示すフローチャートである。図9(A)乃至(C)、図10(A)乃至(C)、及び図11(A)乃至(D)は、本開示による情報端末における表示画面の遷移の一例を示す平面図である。図12は、本開示による通訳者端末における表示画面の一例を示す図である。ここでは、情報端末10のユーザが日本語を話す飲食店の店員であり、会話の相手が英語を話す顧客である場合の会話、すなわち、入力言語が日本語であり、翻訳言語が英語である会話を想定する。但し、これに限定されない。
(Voice translation processing and call processing)
(First embodiment)
FIG. 8 is a flowchart illustrating an example of a process flow (part) in the speech translation system according to the present disclosure. 9 (A) to 9 (C), 10 (A) to (C), and 11 (A) to 11 (D) are plan views illustrating examples of display screen transition in the information terminal according to the present disclosure. . FIG. 12 is a diagram illustrating an example of a display screen in the interpreter terminal according to the present disclosure. Here, the conversation when the user of the information terminal 10 is a restaurant clerk who speaks Japanese and the conversation partner is a customer who speaks English, that is, the input language is Japanese and the translation language is English. Assume conversation. However, it is not limited to this.
 まず、ユーザ(店員)が、情報端末10の表示部107に表示されている音声翻訳アプリケーションソフトのアイコン(図示せず)をタップする場合、情報端末10において当該アプリケーションを起動する(図8;ステップSJ1)。 First, when the user (clerk) taps an icon (not shown) of the speech translation application software displayed on the display unit 107 of the information terminal 10, the application is activated on the information terminal 10 (FIG. 8; step). SJ1).
 当該アプリケーションが起動すると、表示部107に、顧客の言語選択画面が表示される(図8;ステップSJ2)。図9(A)に示すように、この言語選択画面には、例えば顧客に言語を尋ねる旨の日本語のテキストT21、その旨の英語のテキストT22、及び、想定される複数の代表的な言語(ここでも、英語、中国語(例えば書体により2種類)、ハングル語)を示す言語ボタン61(第2画像)が表示される。 When the application is activated, a customer language selection screen is displayed on the display unit 107 (FIG. 8; step SJ2). As shown in FIG. 9A, this language selection screen includes, for example, a Japanese text T21 for inquiring about the language to the customer, an English text T22 for that purpose, and a plurality of typical languages assumed. Here, a language button 61 (second image) indicating English, Chinese (for example, two types depending on the typeface), and Hangul) is displayed.
 このとき、日本語のテキストT21及び英語のテキストT22は、第1表示処理制御部113及び表示部107により、情報端末10の表示部107の画面において、例えば異なる色の領域によって区分けされ、且つ、互いに逆向き(互いに異なる向き;図示において上下逆向き)に表示される。これにより、ユーザと顧客が対面している状態で会話を行う場合、ユーザは日本語のテキストT21を確認し易い一方、顧客は、英語のテキストT22を確認し易くなる。また、テキストT21とテキストT22が区分けして表示されるので、両者を明別して更に視認し易くなる利点がある。 At this time, the Japanese text T21 and the English text T22 are classified by the first display processing control unit 113 and the display unit 107, for example, by areas of different colors on the screen of the display unit 107 of the information terminal 10, and They are displayed in opposite directions (different directions; upside down in the figure). Thereby, when a conversation is performed in a state where the user and the customer face each other, the user can easily confirm the Japanese text T21, while the customer can easily confirm the English text T22. In addition, since the text T21 and the text T22 are displayed separately, there is an advantage that the text T21 and the text T22 are clearly distinguished from each other.
 それから、ユーザは、図9(A)の言語選択画面に表示されたテキストT21を顧客に提示し、顧客に英語(English)のボタンをタップしてもらうことで、顧客の言語が選択される。これにより、表示デバイスには、ホーム画面として、日本語と英語の音声入力の待機画面が表示される(図8;ステップSJ3)。このホーム画面には、ユーザと顧客の言語の何れを発話するかを問うテキストT23、並びに、日本語の音声入力を行うための日本語入力ボタン62a及び英語の音声入力を行うための英語入力ボタン62bが表示される。また、このホーム画面には、入力内容の履歴を表示するための履歴表示ボタン63、言語選択画面に戻って顧客の言語を切り替える(言語選択をやり直す)ための言語選択ボタン64、及び当該アプリケーションソフトの各種設定を行うための設定ボタン65も表示される。 Then, the user presents the text T21 displayed on the language selection screen of FIG. 9A to the customer, and has the customer tap the English button, so that the customer's language is selected. As a result, a standby screen for voice input in Japanese and English is displayed on the display device as the home screen (FIG. 8; step SJ3). On this home screen, text T23 asking which of the user's or customer's language is to be spoken, a Japanese input button 62a for performing Japanese speech input, and an English input button for performing English speech input 62b is displayed. The home screen also includes a history display button 63 for displaying a history of input contents, a language selection button 64 for returning to the language selection screen and switching the customer language (re-selecting the language), and the application software. A setting button 65 for performing various settings is also displayed.
 次に、図9(B)のホーム画面において、ユーザ(店員)が日本語入力ボタン62aをタップして日本語の音声入力を選択すると、ユーザの日本語による発話内容を受け付ける音声入力画面となる(図9(C))。この音声入力画面が表示されると、音声入出力部101からの音声入力が可能な状態となる。また、この音声入力画面には、ユーザの音声入力を促すテキストT24、及び、音声入力の待機状態であることを示すマイク図案66が表示される。なお、その前の画面である図9(B)において日本語音声入力が選択されたことを示すため、図9(C)の音声入力画面には、日本語入力ボタン62aが表示されない。また、英語入力ボタン62bは、マイク図案66の背面に、その一部が隠れるように、且つ例えば淡い色彩で表示される(後記の図10(A)及び図10(B)において同様)。 Next, on the home screen in FIG. 9B, when the user (clerk) taps the Japanese input button 62a and selects Japanese voice input, the voice input screen for accepting the user's Japanese utterance content is displayed. (FIG. 9C). When this voice input screen is displayed, voice input from the voice input / output unit 101 is enabled. Further, on this voice input screen, a text T24 for prompting the user to input voice and a microphone design 66 indicating that the voice input is in a standby state are displayed. Note that the Japanese input button 62a is not displayed on the voice input screen of FIG. 9C to indicate that Japanese voice input has been selected in FIG. 9B, which is the previous screen. Further, the English input button 62b is displayed in a light color so that a part of the English input button 62b is hidden behind the microphone design 66 (the same applies to FIGS. 10A and 10B described later).
 また、この音声入力画面の下部には、キャンセルボタン67が表示され、これをタップすることにより、ホーム画面である音声入力の待機画面(図9(B))へ戻って音声入力をやり直すことができる(後記の図10(A)及び図10(B)において同様)。この状態で、ユーザにより顧客への伝達事項等が日本語で音声入力されると、図10(A)に示すように、表示部107の画面において、テキストT24とともに、声量の大小を模式的に且つ動的に示す多重円形図案68が表示され、音声入力レベルが発話者であるユーザへ視覚的にフィードバックされる(図8;ステップSJ4)。 In addition, a cancel button 67 is displayed at the bottom of the voice input screen. By tapping this button, it is possible to return to the voice input standby screen (FIG. 9B) and perform voice input again. (Same as in FIGS. 10A and 10B described later). In this state, when a user inputs a message to be communicated to the customer in Japanese, as shown in FIG. 10A, the volume of the voice volume is schematically shown on the screen of the display unit 107 together with the text T24. In addition, a dynamically shown multiple circular design 68 is displayed, and the voice input level is visually fed back to the user who is the speaker (FIG. 8; step SJ4).
 それから、ユーザによる発話が終了し、例えば音声入力が一定期間ないことを情報端末10の情報処理部109が検知すると、情報処理部109は、ユーザによる発話内容の受け付けを終了する。次いで、情報処理部109は、その音声入力に基づいて音声信号を生成し、その音声信号を送受信部103及びネットワークNを通してサーバ装置20へ送信する。 Then, when the user's utterance ends, for example, when the information processing unit 109 of the information terminal 10 detects that there is no voice input for a certain period of time, the information processing unit 109 ends the acceptance of the utterance content by the user. Next, the information processing unit 109 generates an audio signal based on the audio input, and transmits the audio signal to the server device 20 through the transmission / reception unit 103 and the network N.
 次に、サーバ装置20の情報処理部203の音声認識部205は、送受信部201を通してその音声信号を受信し、音声認識処理を行う(図8;ステップSS1)。このとき、音声認識部205は、記憶部213から、必要なモジュールL20、データベースD20、及びモデルM20(音声認識モジュール、日本語音声コーパス、音響モデル、言語モデル等)を呼び出し、入力音声の「音」を「読み」(文字)へ変換する。 Next, the voice recognition unit 205 of the information processing unit 203 of the server device 20 receives the voice signal through the transmission / reception unit 201 and performs voice recognition processing (FIG. 8; step SS1). At this time, the speech recognition unit 205 calls the necessary module L20, database D20, and model M20 (speech recognition module, Japanese speech corpus, acoustic model, language model, etc.) from the storage unit 213, "To" reading "(character).
 ここで、情報処理部203は、認識された音声の「読み」(文字)に基づいてテキスト出力用のテキスト信号を生成し、送受信部201及びネットワークNを通して、情報端末10へ送信する。このとき、情報処理部203は、認識された音声そのものの内容に基づくテキスト信号と、予め記憶部213に記憶されている日本語の会話コーパスのなかから、実際の発話内容に対応するものを呼び出し、それに基づくテキスト信号を生成する。そして、図10(B)に示すように、送受信部201を通してそのテキスト信号を受信した情報端末10の第1表示処理制御部113は、画面において、ユーザによって入力された日本語の発話内容の認識結果として、認識された音声の内容である日本語のテキストT25を表示する。 Here, the information processing unit 203 generates a text signal for text output based on the recognized “reading” (characters) of the voice, and transmits the text signal to the information terminal 10 through the transmission / reception unit 201 and the network N. At this time, the information processing unit 203 calls the one corresponding to the actual utterance content from the text signal based on the content of the recognized speech itself and the Japanese conversation corpus stored in the storage unit 213 in advance. And generating a text signal based thereon. Then, as shown in FIG. 10B, the first display processing control unit 113 of the information terminal 10 that has received the text signal through the transmission / reception unit 201 recognizes the Japanese utterance content input by the user on the screen. As a result, the Japanese text T25 that is the content of the recognized speech is displayed.
 次いで、多言語翻訳部207は、認識された音声の「読み」(文字)を他の言語に翻訳する多言語翻訳処理へ移行する(図8;ステップSS2)。このとき、多言語翻訳部207は、記憶部213から、必要なモジュールL20及びデータベースD20(翻訳モジュール、日本語文字コーパス、日本語辞書、英語辞書、日英対訳辞書、日英対訳コーパス等)を呼び出し、認識結果である入力音声の「読み」(文字列)を適切に並び替えて日本語の句、節、文等へ変換し、その変換結果に対応する英語を抽出し、それらを英文法に従って並び替えて自然な英語の句、節、文等へと変換し、記憶部213からそれに対応する英語の会話コーパスを選定する。その際、図10(B)に示すように、表示部107には、翻訳中であることを示す日本語のテキストT26、及び、翻訳中であることを示す円形図案69を含む待機画面が表示される。 Next, the multilingual translation unit 207 proceeds to multilingual translation processing for translating the recognized speech “reading” (characters) into another language (FIG. 8; step SS2). At this time, the multilingual translation unit 207 stores the necessary module L20 and database D20 (translation module, Japanese character corpus, Japanese dictionary, English dictionary, Japanese-English bilingual dictionary, Japanese-English bilingual corpus, etc.) from the storage unit 213. The input speech “reading” (character string), which is the call and recognition result, is appropriately sorted and converted to Japanese phrases, clauses, sentences, etc., and the English corresponding to the conversion result is extracted, and the English grammar is extracted. Are converted into natural English phrases, clauses, sentences, etc., and the corresponding English conversation corpus is selected from the storage unit 213. At that time, as shown in FIG. 10B, the display unit 107 displays a standby screen including Japanese text T26 indicating that translation is in progress and a circular design 69 indicating that translation is in progress. Is done.
 記憶部213は、入力音声の内容に対応付けられた翻訳結果(翻訳内容)をユーザごとに関連付けて翻訳履歴として記憶する(図8;ステップSS3)。例えば、記憶部213は、翻訳後の英語の句、節、文等に対応する英語の会話コーパス等を入力音声の内容に対応付けて翻訳履歴として記憶する。 The storage unit 213 stores the translation result (translation content) associated with the content of the input speech for each user as a translation history (FIG. 8; step SS3). For example, the storage unit 213 stores an English conversation corpus or the like corresponding to the translated English phrase, clause, sentence, or the like as a translation history in association with the content of the input speech.
 次に、音声合成部211は、記憶部213から、音声合成に必要なモジュールL20、データベースD20、及びモデルM20(音声合成モジュール、英語音声コーパス、音響モデル、言語モデル等)を呼び出し、翻訳結果である英語の句、節、文等に対応する英語の会話コーパスを自然な音声に変換する(図8;ステップSS4)。 Next, the speech synthesis unit 211 calls the module L20, database D20, and model M20 (speech synthesis module, English speech corpus, acoustic model, language model, etc.) necessary for speech synthesis from the storage unit 213, and uses the translation result. An English conversation corpus corresponding to a certain English phrase, clause, sentence or the like is converted into natural speech (FIG. 8; step SS4).
 これらの多言語翻訳処理及び音声合成処理が完了すると、情報処理部203は、翻訳結果(翻訳内容)である英語の会話コーパスに基づいてテキスト表示用のテキスト信号を生成し、また、合成された音声に基づいて音声出力用の音声信号を生成し、送受信部201及びネットワークNを通して、情報端末10へ送信する。 When these multilingual translation processing and speech synthesis processing are completed, the information processing unit 203 generates a text signal for text display based on the English conversation corpus that is the translation result (translation content), and the synthesized text signal is also synthesized. An audio signal for audio output is generated based on the audio and transmitted to the information terminal 10 through the transmission / reception unit 201 and the network N.
 そして、図10(C)に示すように、送受信部103を通して、それらのテキスト信号及び音声信号を受信した情報端末10の第1表示処理制御部113は、テキストT25、テキストT25に対応する日本語の会話コーパスのテキストT27(ここではテキストT25と同じであるが、これに限定されない)、及びその翻訳結果である英語の会話コーパスのテキストT28を会話画面として表示し、さらに、当該画面において通話開始ボタン73(第1画像)を選択的に表示する処理を制御する(図8;ステップSJ5)。ここで、情報端末10の記憶部117は、例えば、サーバ装置20から受信した上記テキスト信号や音声信号を翻訳履歴として記憶してもよい。 Then, as shown in FIG. 10C, the first display processing control unit 113 of the information terminal 10 that has received the text signal and the audio signal through the transmission / reception unit 103, the Japanese corresponding to the text T25 and the text T25. The conversation corpus text T27 (same as, but not limited to, text T25 here) and the English conversation corpus text T28, which is the translation result thereof, are displayed as a conversation screen, and the call is started on the screen. A process of selectively displaying the button 73 (first image) is controlled (FIG. 8; step SJ5). Here, the memory | storage part 117 of the information terminal 10 may memorize | store the said text signal and audio | voice signal received from the server apparatus 20 as a translation log | history, for example.
 また、ステップSJ5と同時に、音声入出力部101は、翻訳結果である英語のテキストT28の内容(翻訳内容)を音声で出力する(読み上げる)(図8;ステップSJ6)。なお、当該ステップSJ6は、ステップSJ5の前、又は、後に実行されてもよい。 Simultaneously with step SJ5, the voice input / output unit 101 outputs (reads out) the content (translation content) of the English text T28 as a translation result (FIG. 8; step SJ6). Note that step SJ6 may be executed before or after step SJ5.
 このとき、図10(C)の如く、日本語のテキストT25,T27と英語のテキストT28も、情報端末10の表示部107の画面において、例えば異なる色の領域や線分によって区分けされ、且つ、互いに逆向き(互いに異なる向き;図示において上下逆向き)に表示される。これにより、ユーザと顧客が対面している状態で会話を行う場合、両者が表示部107の画面を視認できる状態であれば、ユーザが日本語のテキストT25,T27(入力された内容)を確認し易い一方、顧客は、英語のテキストT28(翻訳された内容)を確認し易くなる。また、それらのテキストT25,T27とテキストT28が区分けして表示されるので、両者を明別して更に視認し易くなる利点がある。 At this time, as shown in FIG. 10C, the Japanese texts T25 and T27 and the English text T28 are also divided on the screen of the display unit 107 of the information terminal 10, for example, by different color areas and line segments, and They are displayed in opposite directions (different directions; upside down in the figure). As a result, when the user and the customer are in a face-to-face conversation, the user confirms the Japanese texts T25 and T27 (input contents) if both can see the screen of the display unit 107. On the other hand, the customer can easily confirm the English text T28 (translated content). In addition, since the texts T25, T27 and the text T28 are displayed separately, there is an advantage that the texts T25, T27 and the text T28 are clearly distinguished from each other.
 なお、図10(C)の会話画面に表示される音声出力ボタン70をタップすることにより、音声出力が繰り返される。また、この会話画面には、その時点での翻訳を終了する旨のチェックボタン71が表示され、これをタップすることにより、翻訳処理を終了してホーム画面(図9(B))に戻ることができる。 Note that the audio output is repeated by tapping the audio output button 70 displayed on the conversation screen of FIG. Also, on this conversation screen, a check button 71 indicating that the translation at that time is finished is displayed. By tapping this, the translation processing is finished and the home screen (FIG. 9B) is returned. Can do.
 次に、翻訳が精度よく行われることによって、顧客がユーザ(店員)の質問事項を理解することができた場合、今度は、顧客の音声の入力、認識、翻訳、及び音声合成といった音声処理が行われる(図8;ステップSJ7においてNo)。この顧客の音声処理では、まず、図10(C)に表示されているチェックボタン71をタップしてホーム画面(図9(B))を表示する。次に、そのホーム画面において、英語入力ボタン62bをタップして顧客による英語の音声入力を選択する。この後の処理は、発話者がユーザから顧客に代わり、日本語の音声入力が英語の音声入力に切り替わり、且つ、英語の音声及びテキスト出力が日本語による音声及びテキスト出力に代わること以外は、上述した処理と基本的に同等であるので、ここでの詳細な説明は省略する。そして、ユーザと顧客の会話が完了した場合、一連の音声翻訳処理を終了する。 Next, if the customer can understand the user's (clerk's) questions due to the accuracy of the translation, then the voice processing such as the customer's voice input, recognition, translation, and voice synthesis will be performed. Is performed (FIG. 8; No in step SJ7). In this customer voice processing, first, the check screen 71 displayed in FIG. 10C is tapped to display the home screen (FIG. 9B). Next, on the home screen, the English input button 62b is tapped to select English voice input by the customer. The processing after this is performed except that the speaker changes from the user to the customer, the Japanese voice input is switched to the English voice input, and the English voice and text output is replaced with the Japanese voice and text output. Since it is basically the same as the above-described processing, detailed description thereof is omitted here. Then, when the conversation between the user and the customer is completed, a series of speech translation processing is terminated.
 他方、店員による日本語入力、又は、顧客による英語入力の内容がその言語の基本的な文型になっていないような場合や、発話した語順等が異なる場合には、誤訳が生じてしまう可能性が高まりやすい。そして、実際に誤訳が存在する等翻訳精度が高くないような場合は、店員及び顧客のコミュニケーションが円滑に行われないおそれがある。そこで、このような場合においては、店員及び顧客の少なくとも一方は、図8のステップSJ5において情報端末10の表示部107にて表示される通話開始ボタン73(第1画像)を選択する場合、通話処理制御部115は、通訳者と通話するためにオペレータ端末30に通話処理開始リクエストを送信する(図8;ステップSJ7においてYes)。 On the other hand, if the contents of the Japanese input by the store clerk or the English input by the customer are not in the basic sentence pattern of the language, or if the order of spoken words is different, mistranslation may occur. Is likely to increase. If the translation accuracy is not high, such as when there is a mistranslation actually, there is a possibility that communication between the store clerk and the customer may not be performed smoothly. Therefore, in such a case, when at least one of the store clerk and the customer selects the call start button 73 (first image) displayed on the display unit 107 of the information terminal 10 in step SJ5 of FIG. The process control unit 115 transmits a call process start request to the operator terminal 30 to call the interpreter (FIG. 8; Yes in step SJ7).
 具体的に、店員及び顧客の少なくとも一方が、図8のステップSJ5において情報端末10の表示部107において表示される通話開始ボタン73を選択する場合、図11(B)に示すように、表示部107の画面がグレーアウトされ、当該画面上に、通訳者と通話するか否かを確認するための画像75が表示される。そして、店員及び顧客の少なくとも一方が、当該画像75に表示される「はい」を選択する場合、図11(C)に示すように、第1表示処理制御部113は表示部107の画面にテキストT29を表示する処理を制御する。例えば、店員及び顧客の少なくとも一方が、当該画像75に表示される「はい」を選択する場合、通話処理制御部115は、通訳者と通話するために通話処理開始リクエストを送信するように構成されてもよい。 Specifically, when at least one of the store clerk and the customer selects the call start button 73 displayed on the display unit 107 of the information terminal 10 in step SJ5 of FIG. 8, as shown in FIG. The screen 107 is grayed out, and an image 75 for confirming whether or not to call the interpreter is displayed on the screen. When at least one of the store clerk and the customer selects “Yes” displayed in the image 75, the first display processing control unit 113 displays text on the screen of the display unit 107 as shown in FIG. Controls the process of displaying T29. For example, when at least one of the store clerk and the customer selects “Yes” displayed in the image 75, the call processing control unit 115 is configured to transmit a call processing start request to make a call with the interpreter. May be.
 通話処理制御部115は、例えば、通話開始ボタン73が選択された時に通話処理開始リクエストを生成してもよいし、通話開始ボタン73が選択される前にあらかじめ通話処理開始リクエストを生成してもよい。通話処理開始リクエストは、例えば、情報端末10の識別情報を含んで構成される。また、サーバ装置20からの翻訳履歴を含んで生成される。情報端末10の識別情報は、例えば、情報端末10の使用者の属性、つまり、使用者の名称、住所、生年月日、年齢、所属、家族構成等や情報端末10の電話番号や識別番号(ID)等を含む。また、情報端末10を利用する店員又は顧客とオペレータ端末30を使用する通訳者との通話は、一般的な電話回線網やIP電話回線網等を含むネットワークNを介して実行される。なお、通話手段に特に制限はなく、両者の通話が可能であればよい。 For example, the call processing control unit 115 may generate a call processing start request when the call start button 73 is selected, or may generate a call processing start request in advance before the call start button 73 is selected. Good. The call processing start request includes, for example, identification information of the information terminal 10. Further, it is generated including the translation history from the server device 20. The identification information of the information terminal 10 includes, for example, the attributes of the user of the information terminal 10, that is, the user's name, address, date of birth, age, affiliation, family structure, etc., and the telephone number or identification number of the information terminal 10 ( ID) and the like. In addition, a call between a store clerk or customer who uses the information terminal 10 and an interpreter who uses the operator terminal 30 is executed through a network N including a general telephone line network, an IP telephone line network, and the like. Note that there is no particular limitation on the calling means, and it is sufficient that both calls can be made.
 ここで、店員は、通話が可能な通訳者が複数人いる場合、より適切な通訳者と通話することを望む。例えば、情報端末10の記憶部117には、各通訳者又は各通訳者が使用する端末の識別情報と、各通訳者が使用できる一以上の言語を示す言語情報とが関連付けて記憶されている。図8に示すステップSJ3において、ユーザは、図9(A)に示す言語選択画面に表示されたテキストT21を顧客に提示し、顧客に英語(English)のボタンをタップしてもらうことで、顧客の言語が選択される。そうするとオペレータ端末特定部116は、記憶部117が記憶する各通訳者が使用する端末の識別情報と、各通訳者が使用できる一以上の言語を示す言語情報とを参照することにより、選択された英語ボタンが示す言語つまり英語を使用できる通訳者が使用するオペレータ端末30を特定する。そして、通話処理制御部115は、当該通訳者が使用するオペレータ端末に対して通話処理開始リクエストを送信することによって、両者の通話が開始される。このように、店員と顧客とのコミュニケーションにおいて用いられる言語に対応できる通訳者を適切に特定できる。 Here, the store clerk wants to talk to a more appropriate interpreter when there are multiple interpreters who can talk. For example, in the storage unit 117 of the information terminal 10, identification information of each interpreter or a terminal used by each interpreter and language information indicating one or more languages that can be used by each interpreter are stored in association with each other. . In step SJ3 shown in FIG. 8, the user presents the text T21 displayed on the language selection screen shown in FIG. 9 (A) to the customer, and asks the customer to tap an English button so that the customer Language is selected. Then, the operator terminal specifying unit 116 is selected by referring to the terminal identification information used by each interpreter stored in the storage unit 117 and the language information indicating one or more languages that can be used by each interpreter. The operator terminal 30 used by the interpreter who can use the language indicated by the English button, that is, English, is specified. Then, the call processing control unit 115 transmits a call processing start request to the operator terminal used by the interpreter, so that both calls are started. In this way, it is possible to appropriately identify an interpreter who can handle the language used in communication between the store clerk and the customer.
 また、店員は、英語を使用できる通訳者が複数人いる場合に、通訳がより上手な通訳者と通話することを望むと考えられる。例えば、情報端末10の記憶部117は、各通訳者が使用する端末の識別情報、及び、各通訳者が使用できる一以上の言語を示す言語情報の他に、各通訳者の通訳レベルや通訳能力を示す情報を各通訳者の識別情報又は各通訳者が使用する端末の識別情報に関連付けて記憶してもよい。そして、図8に示すステップSJ3において、英語のボタンが選択されると、オペレータ端末特定部116は、英語を使用できる複数の通訳者の中から通訳レベル・能力がより高い通訳者が使用するオペレータ端末を特定するように構成されてもよい。 Also, if there are multiple interpreters who can use English, the store clerk may wish to talk to an interpreter who is better at interpreting. For example, the storage unit 117 of the information terminal 10 includes, in addition to the terminal identification information used by each interpreter and the language information indicating one or more languages that can be used by each interpreter, the interpretation level and interpretation of each interpreter. Information indicating the capability may be stored in association with identification information of each interpreter or identification information of a terminal used by each interpreter. Then, when an English button is selected in step SJ3 shown in FIG. 8, the operator terminal specifying unit 116 uses an operator used by an interpreter with a higher interpreting level and ability among a plurality of interpreters who can use English. The terminal may be specified.
 なお、オペレータ端末特定部116において、通訳者の特定は、図8のステップSJ5において情報端末10の表示部117にて表示される通話開始ボタン73(第1画像)が選択されるときに、実行されてもよい。また、オペレータ端末特定部116において、あらかじめ、店員と顧客とのコミュニケーションにおいて用いられる言語ごとに、通話する通訳者が使用するオペレータ端末を特定するように構成されてもよい。 The operator terminal specifying unit 116 specifies the interpreter when the call start button 73 (first image) displayed on the display unit 117 of the information terminal 10 is selected in step SJ5 of FIG. May be. Further, the operator terminal specifying unit 116 may be configured in advance to specify an operator terminal used by an interpreter who makes a call for each language used in communication between a store clerk and a customer.
 他方、店員及び顧客の少なくとも一方が、当該画像75に表示される「いいえ」を選択する場合、図11(A)に示す画面に戻る。 On the other hand, when at least one of the store clerk and the customer selects “No” displayed in the image 75, the screen returns to the screen shown in FIG.
 次に、オペレータ端末30の送受信部303は、情報端末10からの通話処理開始リクエストを受信する(図8;ステップSO1)。送受信部303は、情報端末10に対して応答信号を送信する(図8;ステップSO2)。例えば、通話処理部311は、情報端末10とオペレータ端末30との通話を許可する場合、通話を許可する旨の応答信号を生成する。例えば、通話処理部311は、受信した通話処理開始リクエストに含まれる情報端末10の識別情報を、記憶部315又はオペレータ端末30と通信可能な他の記憶資源に予め記憶されている、通話可能な情報端末の識別情報と比較することで、情報端末10との通話を許可するか否か判断する。他方で、通話処理部311は、情報端末10とオペレータ端末30との通話を許可しない場合、通話を許可しない旨の応答信号を生成する。 Next, the transmission / reception unit 303 of the operator terminal 30 receives the call processing start request from the information terminal 10 (FIG. 8; step SO1). The transmission / reception unit 303 transmits a response signal to the information terminal 10 (FIG. 8; step SO2). For example, when allowing a call between the information terminal 10 and the operator terminal 30, the call processing unit 311 generates a response signal indicating that the call is allowed. For example, the call processing unit 311 is capable of making a call, in which the identification information of the information terminal 10 included in the received call processing start request is stored in advance in the storage unit 315 or another storage resource that can communicate with the operator terminal 30 By comparing with the identification information of the information terminal, it is determined whether or not a call with the information terminal 10 is permitted. On the other hand, when the call processing unit 311 does not permit the call between the information terminal 10 and the operator terminal 30, the call processing unit 311 generates a response signal indicating that the call is not permitted.
 第2表示処理制御部313は、表示部307において、サーバ装置20から情報端末10を介して送信された翻訳履歴をユーザごとに関連付けて表示する(図8;ステップSO3)。例えば、図12に示すように、第2表示処理制御部313は、表示部307の画面において、「通話中」であることを示す画像81を表示する処理を制御し、通話中であるユーザの名称を示す欄、当該ユーザが使用する情報端末の電話番号を示す欄、当該情報端末の識別番号を示す欄、及び、その他ユーザの住所等を示す属性情報を示す欄を含む画像83を表示する処理を制御し、店員(ユーザ1)及び顧客(ユーザX)の入力音声の翻訳履歴をユーザごとに関連付けて翻訳履歴画像85として表示する処理を制御する。 The second display processing control unit 313 causes the display unit 307 to display the translation history transmitted from the server device 20 via the information terminal 10 in association with each user (FIG. 8; step SO3). For example, as illustrated in FIG. 12, the second display processing control unit 313 controls a process of displaying an image 81 indicating “calling” on the screen of the display unit 307, and An image 83 including a column indicating a name, a column indicating a telephone number of an information terminal used by the user, a column indicating an identification number of the information terminal, and a column indicating attribute information indicating a user's address and the like is displayed. A process is controlled and the process which displays the translation log | history of the input speech of a shop assistant (user 1) and a customer (user X) for every user as a translation log | history image 85 is controlled.
 このように、オペレータ端末30は、表示部307において、音声翻訳履歴をユーザごとに関連付けて表示するので、通訳者は、音声翻訳履歴を確認できるので、店員と顧客との間の今までのコミュニケーションの流れを踏まえた応対が可能となる。 Thus, since the operator terminal 30 displays the speech translation history in association with each user on the display unit 307, the interpreter can check the speech translation history, so the communication between the store clerk and the customer so far It is possible to respond based on the flow of
 また、図12に示すように、オペレータ端末30は、表示部307において、音声翻訳履歴を時系列に表示するので、通訳者は、店員と顧客との間の今までのコミュニケーションの流れをより容易に把握でき、当該流れを踏まえた適切な応対が可能となる。 As shown in FIG. 12, the operator terminal 30 displays the speech translation history in time series on the display unit 307, so that the interpreter can more easily flow the communication between the store clerk and the customer so far. Therefore, it is possible to respond appropriately based on this flow.
 他方、情報端末10がオペレータ端末30から応答信号を受信する(図8;ステップSJ8)場合に、情報端末10とオペレータ端末30との接続が確立し、店員又は顧客と、通訳者との通話が実現する(図8;ステップSJ9及びSO4)。ここで、店員又は顧客と、通訳者との通話が実現する場合に、図11(D)に示すように、プロセッサ11は、表示部107の画面上にテキストT30を表示する。 On the other hand, when the information terminal 10 receives a response signal from the operator terminal 30 (FIG. 8; step SJ8), the connection between the information terminal 10 and the operator terminal 30 is established, and a call between the store clerk or customer and the interpreter is made. Realize (FIG. 8; steps SJ9 and SO4). Here, when a call between the store clerk or the customer and the interpreter is realized, the processor 11 displays the text T30 on the screen of the display unit 107 as shown in FIG.
(第2実施形態)
 第1実施形態においては、情報端末は、翻訳結果を出力する場合に通話開始ボタン(第1画像)を表示するが、第2実施形態においては、情報端末は、サーバ装置が算出した翻訳精度に関するスコアと、所定の閾値とを比較し、当該スコアが所定の閾値以下である場合に、通話開始ボタン(第1画像)を表示する点において、第1実施形態と第2実施形態とは異なる。図13を用いて第2実施形態を説明する。第1実施形態を説明する、図8に示すフローチャートと異なる点について特に説明し、図8に示すフローチャートと同様な点については、説明を省略する。
(Second Embodiment)
In the first embodiment, the information terminal displays a call start button (first image) when outputting the translation result. In the second embodiment, the information terminal relates to the translation accuracy calculated by the server device. The first embodiment and the second embodiment are different in that a call start button (first image) is displayed when the score is compared with a predetermined threshold and the score is equal to or lower than the predetermined threshold. A second embodiment will be described with reference to FIG. Differences from the flowchart shown in FIG. 8 that describe the first embodiment will be particularly described, and descriptions of points that are similar to the flowchart shown in FIG. 8 will be omitted.
 図13は、音声翻訳システムにおける処理の流れ(一部)の他の一例を示すフローチャートである。図13に示すように、サーバ装置20の多言語翻訳部207は、認識された音声の「読み」(文字)を他の言語に翻訳する多言語翻訳処理を実行する(図13;ステップSS12)。記憶部213は、入力音声の内容に対応付けられた翻訳結果(翻訳内容)及び当該翻訳結果に対応する翻訳精度に関するスコアをユーザごとに関連付けて翻訳履歴として記憶する(図13;ステップSS13)。 FIG. 13 is a flowchart showing another example of the process flow (part) in the speech translation system. As shown in FIG. 13, the multilingual translation unit 207 of the server device 20 executes multilingual translation processing for translating the recognized “reading” (characters) of the recognized speech into another language (FIG. 13; step SS12). . The storage unit 213 stores, as a translation history, a translation result (translation content) associated with the content of the input speech and a score related to translation accuracy corresponding to the translation result for each user (FIG. 13; step SS13).
 ここで、当該翻訳処理においては、例えば統計翻訳が実施されており、対訳データから二言語間の単語や句の対応関係を抽出した、例えば確率付きの対訳辞書と確率付きの語順変換表を含む翻訳モデルと、訳文の言語らしさを表現する、並びの自然さを表す確率付き日本語の単語連鎖データを含む言語モデルと、に基づいてこれらの確率の積を最大化する訳文候補を出力する。よって、スコア算出部209は、例えば、各翻訳結果に対してそれぞれ何%という翻訳精度に関するスコアを算出するように構成されている。 Here, in the translation processing, for example, statistical translation is performed, and the correspondence between words and phrases between two languages is extracted from the bilingual data, for example, including a bilingual dictionary with probability and a word order conversion table with probability Based on the translation model and the language model including the Japanese word chain data with probabilities representing the naturalness of the sequence that expresses the language likeness of the translation, a translation candidate that maximizes the product of these probabilities is output. Therefore, the score calculation unit 209 is configured to calculate, for example, a score relating to the translation accuracy of what percentage for each translation result.
 多言語翻訳処理及び音声合成処理が完了すると、音声合成部211は、翻訳結果(翻訳内容)である英語の会話コーパスに基づいてテキスト表示用のテキスト信号を生成し、また、合成された音声に基づいて音声出力用の音声信号を生成する。そして、生成されたテキスト信号、生成された音声信号、及び翻訳精度を送受信部201及びネットワークNを通して、情報端末10へ送信する。 When the multilingual translation process and the speech synthesis process are completed, the speech synthesizer 211 generates a text signal for text display based on the English conversation corpus that is the translation result (translation content), and also generates the synthesized speech. Based on this, an audio signal for audio output is generated. Then, the generated text signal, the generated voice signal, and the translation accuracy are transmitted to the information terminal 10 through the transmission / reception unit 201 and the network N.
 次に、情報端末10のスコア比較部111は、サーバ装置20が算出した翻訳精度に関するスコアと所定の閾値とを比較する(図13;ステップSJ15)。スコアが所定の閾値より高ければ(図13;ステップSJ15においてNo)、翻訳精度が良いことを示しており、第1表示処理制御部113は、表示部107に翻訳結果を表示し、合成された音声を出力する(図13;ステップSJ16)。例えば、所定の閾値が80%である場合であって、サーバ装置20における翻訳処理の翻訳精度に関するスコアが90%である場合は、その翻訳精度は、良いことを示している。そして、翻訳が精度よく行われることによって、顧客がユーザ(店員)の質問事項を理解することができた場合、図13に示すステップSJ13に戻り、今度は、顧客の音声の入力、認識、翻訳、及び音声合成といった音声処理を行う。 Next, the score comparison unit 111 of the information terminal 10 compares the score related to the translation accuracy calculated by the server device 20 with a predetermined threshold (FIG. 13; step SJ15). If the score is higher than the predetermined threshold (FIG. 13; No in step SJ15), it indicates that the translation accuracy is good, and the first display processing control unit 113 displays the translation result on the display unit 107 and is synthesized. Audio is output (FIG. 13; step SJ16). For example, when the predetermined threshold is 80% and the score relating to the translation accuracy of the translation process in the server device 20 is 90%, the translation accuracy is good. If the customer can understand the questions of the user (clerk) by performing the translation accurately, the process returns to step SJ13 shown in FIG. 13, and this time the customer's voice is input, recognized, and translated. And voice processing such as voice synthesis.
 他方、翻訳精度に関するスコアが所定の閾値以下であれば(図13;ステップSJ15においてYes)、翻訳精度が悪いことを示しており、第1表示処理制御部113は、表示部107に翻訳結果及び通話開始ボタンを表示する(図13;ステップSJ17)。 On the other hand, if the score relating to the translation accuracy is equal to or lower than the predetermined threshold value (FIG. 13; Yes in step SJ15), it indicates that the translation accuracy is poor, and the first display processing control unit 113 displays the translation result and A call start button is displayed (FIG. 13; step SJ17).
 本開示によれば、情報端末の第1表示処理制御部は、表示部において翻訳結果を表示する場合に通話開始ボタンを選択的に表示する処理を制御し、情報端末の通話処理制御部は、通話開始ボタンが選択された場合に、ユーザと通訳者との通話を開始するための通話処理開始リクエストを送信することによって、ユーザの負担を軽減し且つ利便性を向上させることができるとともに、誤訳の発生を防止し且つ円滑なコミュニケーションを実現することができる。 According to the present disclosure, the first display processing control unit of the information terminal controls the process of selectively displaying the call start button when the translation result is displayed on the display unit. When the call start button is selected, by transmitting a call processing start request for starting a call between the user and an interpreter, the burden on the user can be reduced and convenience can be improved. Can be prevented and smooth communication can be realized.
 また、本開示によれば、情報端末において音声翻訳の翻訳精度に関するスコアと所定の閾値とを比較し、翻訳精度が低い場合に、通話開始ボタンを表示するように構成する。よって、情報端末において通訳者との通話の必要性がより高い場合にのみ通話開始ボタンを表示するので、通訳者との通話をより円滑に開始することができる。 Also, according to the present disclosure, the information terminal compares the score related to the translation accuracy of speech translation with a predetermined threshold value, and displays a call start button when the translation accuracy is low. Therefore, since the call start button is displayed only when the need for a call with the interpreter is higher at the information terminal, the call with the interpreter can be started more smoothly.
(他の実施形態)
 本実施形態は、本開示の理解を容易にするためのものであり、本開示を限定して解釈するものではない。本開示はその趣旨を逸脱することなく、変更/改良され得るとともに、本開示にはその等価物も含まれる。また、本開示は、その趣旨を逸脱しない範囲で種々変形(各実施形態を組み合わせる等)して実施することができる。
(Other embodiments)
This embodiment is for facilitating the understanding of the present disclosure, and is not to be construed as limiting the present disclosure. The present disclosure can be changed / improved without departing from the spirit thereof, and the present disclosure includes equivalents thereof. In addition, the present disclosure can be implemented with various modifications (combining the embodiments, etc.) without departing from the spirit of the present disclosure.
 また、上記各実施形態では、音声認識、翻訳、及び音声合成の各処理をサーバ装置20によって実行する例について記載したが、これらの処理を情報端末10において実行するように構成してもよい。この場合、それらの処理に用いるモジュールL20は、情報端末10の記憶資源12に保存されていてもよいし、サーバ装置20の記憶資源23に保存されていてもよい。さらに、音声データベースのデータベースD20、及び/又は、音響モデル等のモデルM20も、情報端末10の記憶資源12に保存されていてもよいし、サーバ装置20の記憶資源23に保存されていてもよい。このとおり、音声翻訳システムは、ネットワークN及びサーバ装置20を備えなくてもよい。なお、上記実施形態では、翻訳精度を判断する処理を情報端末10によって実行する例について記載したが、この処理をサーバ装置20において実行するように構成してもよい。 In each of the above-described embodiments, an example in which each process of speech recognition, translation, and speech synthesis is executed by the server device 20 has been described. However, these processes may be executed by the information terminal 10. In this case, the module L20 used for these processes may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server device 20. Furthermore, the database D20 of the voice database and / or the model M20 such as an acoustic model may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server device 20. . As described above, the speech translation system may not include the network N and the server device 20. In addition, although the example which performs the process which judges translation accuracy by the information terminal 10 was described in the said embodiment, you may comprise so that this process may be performed in the server apparatus 20. FIG.
 なお、図8に示すステップSO3に係る翻訳履歴をユーザごとに関連付けて表示するステップは、ステップSO1と同時に実行されてもよいし、ステップSO1の後であってステップSO2と同時に又はステップSO2の前に実行されてもよい。また、図10に示すステップSO13に係る翻訳履歴をユーザごとに関連付けて表示するステップは、ステップSO11と同時に実行されてもよいし、ステップSO11の後であってステップSO12と同時に又はステップSO12の前に実行されてもよい。 The step of displaying the translation history relating to step SO3 shown in FIG. 8 in association with each user may be executed simultaneously with step SO1, or after step SO1 and simultaneously with step SO2 or before step SO2. May be executed. Further, the step of displaying the translation history related to step SO13 shown in FIG. 10 in association with each user may be executed simultaneously with step SO11, or after step SO11 and simultaneously with step SO12 or before step SO12. May be executed.
 上記実施形態においては、オペレータ端末30は、翻訳履歴を、当該翻訳履歴を含む通話処理開始リクエストを受信することによって得ることができると説明したが、これに限られない。例えば、オペレータ端末30は、通話処理開始リクエストを受信する前に、又は後に、サーバ装置20から直接翻訳履歴を受信するように構成されてもよい。 In the above embodiment, it has been described that the operator terminal 30 can obtain the translation history by receiving a call processing start request including the translation history, but is not limited thereto. For example, the operator terminal 30 may be configured to receive the translation history directly from the server device 20 before or after receiving the call processing start request.
 また、情報端末10とネットワークNとの間、又は、オペレータ端末30とネットワークNとの間には、両者間の通信プロトコルを変換するゲートウェイサーバ等が介在してももちろんよい。また、情報端末10は、携帯型装置に限らず、例えば、デスクトップ型パソコン、ノート型パソコン、タブレット型パソコン、ラップトップ型パソコン等でもよい。さらに、オペレータ端末30は、据え置き型装置に限られず、ネットワークNとの通信機能を有する可搬型のタブレット型端末装置等で構成されてもよい。 Of course, a gateway server for converting a communication protocol between the information terminal 10 and the network N or between the operator terminal 30 and the network N may be interposed. The information terminal 10 is not limited to a portable device, and may be a desktop personal computer, a notebook personal computer, a tablet personal computer, a laptop personal computer, or the like. Furthermore, the operator terminal 30 is not limited to a stationary device, and may be configured with a portable tablet terminal device having a communication function with the network N.

Claims (6)

  1.  ユーザの音声を入力する情報端末と、
     前記情報端末に入力された音声の内容を翻訳するサーバ装置と、
     前記情報端末との間の通話処理をする通訳者端末と、を備える音声翻訳システムであって、
     前記サーバ装置は、
     前記情報端末に入力された音声の内容を認識する音声認識部と、
     前記音声認識部で認識された内容を異なる言語の内容に翻訳する翻訳部と、を備え、
     前記情報端末は、
     前記サーバ装置の前記翻訳部で翻訳された内容を音声で出力する音声出力部と、
     前記翻訳された内容のテキストを表示する処理を制御する第1表示処理制御部であって、前記テキストに加え、第1画像を選択的に表示する処理を制御する第1表示処理制御部と、
     前記通訳者端末との間の通話処理を制御する通話処理制御部であって、前記第1画像が選択されたとき、前記通話処理を開始するための通話処理開始リクエストを前記通訳者端末に送信する通話処理制御部と、を備える、
      音声翻訳システム。
    An information terminal for inputting the user's voice;
    A server device for translating the content of the voice input to the information terminal;
    An interpreter terminal that performs call processing with the information terminal; and a speech translation system comprising:
    The server device
    A speech recognition unit for recognizing the content of speech input to the information terminal;
    A translation unit that translates the content recognized by the voice recognition unit into content of a different language,
    The information terminal
    A voice output unit that outputs the content translated by the translation unit of the server device by voice;
    A first display processing control unit for controlling processing for displaying the translated text, wherein the first display processing control unit controls processing for selectively displaying a first image in addition to the text;
    A call processing control unit for controlling a call process with the interpreter terminal, and when the first image is selected, transmits a call process start request for starting the call process to the interpreter terminal. A call processing control unit for
    Speech translation system.
  2.   前記サーバ装置は、翻訳精度に関するスコアを算出するスコア算出部を更に備え、
      前記第1表示処理制御部は、前記スコアが所定の閾値以下である場合に前記第1画像を表示する処理を制御する、
      請求項1に記載の音声翻訳システム。
    The server device further includes a score calculation unit that calculates a score related to translation accuracy,
    The first display processing control unit controls a process of displaying the first image when the score is equal to or less than a predetermined threshold;
    The speech translation system according to claim 1.
  3.   前記サーバ装置は、前記入力された音声の内容に対応付けられた前記翻訳された内容を前記ユーザごとに関連付けて翻訳履歴として記憶する記憶部を更に備え、
      前記通訳者端末は、前記翻訳履歴を前記ユーザごとに関連付けて表示する処理を制御する第2表示処理制御部を更に備える、
      請求項1又は請求項2に記載の音声翻訳システム。
    The server device further includes a storage unit that stores the translated content associated with the content of the input voice as a translation history in association with each user,
    The interpreter terminal further includes a second display processing control unit that controls processing of displaying the translation history in association with each user.
    The speech translation system according to claim 1 or 2.
  4.   前記第1表示処理制御部は、二以上の言語をそれぞれ示す二以上の第2画像を更に表示する処理を制御し、
     前記通話処理制御部は、前記第2画像のうち一の画像が選択された後に、前記第1画像が選択された場合に、選択された前記第2画像のうち一の画像が示す言語を使用できる通訳者に対応付けられた前記通訳者端末との間の前記通話処理を制御する、
      請求項1から請求項3のいずれか1項に記載の音声翻訳システム。
    The first display processing control unit controls processing for further displaying two or more second images respectively indicating two or more languages,
    The call processing control unit uses a language indicated by one of the selected second images when the first image is selected after the one of the second images is selected. Controlling the call processing with the interpreter terminal associated with an interpreter capable of
    The speech translation system according to any one of claims 1 to 3.
  5.  ユーザの音声の内容であって、異なる言語の内容に翻訳された内容を音声で出力することと、
     前記翻訳された内容のテキストを表示する処理を制御することであって、前記テキストに加え、第1画像を選択的に表示する処理を制御することと、
     前記通訳者端末との間の通話処理を制御することであって、前記第1画像が選択されたとき、前記通話処理を開始するための通話処理開始リクエストを前記通訳者端末に送信することと、を含む、
     音声翻訳方法。
    Outputting the content of the user's voice translated into different language content,
    Controlling the process of displaying the translated content text, and controlling the process of selectively displaying the first image in addition to the text;
    Controlling a call process with the interpreter terminal, and when the first image is selected, transmitting a call process start request for starting the call process to the interpreter terminal; ,including,
    Speech translation method.
  6.  コンピュータを、
     ユーザの音声の内容であって、異なる言語の内容に翻訳された内容を音声で出力する音声出力部と、
     前記翻訳された内容のテキストを表示する処理を制御する第1表示処理制御部であって、前記テキストに加え、第1画像を選択的に表示する処理を制御する第1表示処理制御部と、
     前記通訳者端末との間の通話処理を制御する通話処理制御部であって、前記第1画像が選択されたとき、前記通話処理を開始するための通話処理開始リクエストを前記通訳者端末に送信する通話処理制御部と、
    して機能させる、
      音声翻訳プログラム。
    Computer
    A voice output unit that outputs the contents of the user's voice, which are translated into different languages, in voice;
    A first display processing control unit for controlling processing for displaying the translated text, wherein the first display processing control unit controls processing for selectively displaying a first image in addition to the text;
    A call processing control unit for controlling a call process with the interpreter terminal, and when the first image is selected, transmits a call process start request for starting the call process to the interpreter terminal. A call processing control unit,
    Make it work,
    Speech translation program.
PCT/JP2017/003300 2016-02-01 2017-01-31 Speech translation system, speech translation method, and speech translation program WO2017135214A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2016-017071 2016-02-01
JP2016017071A JP6449181B2 (en) 2016-02-01 2016-02-01 Speech translation system, speech translation method, and speech translation program

Publications (1)

Publication Number Publication Date
WO2017135214A1 true WO2017135214A1 (en) 2017-08-10

Family

ID=59499823

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/003300 WO2017135214A1 (en) 2016-02-01 2017-01-31 Speech translation system, speech translation method, and speech translation program

Country Status (2)

Country Link
JP (1) JP6449181B2 (en)
WO (1) WO2017135214A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107507615A (en) * 2017-08-29 2017-12-22 百度在线网络技术(北京)有限公司 Interface intelligent interaction control method, device, system and storage medium
CN111478971A (en) * 2020-04-14 2020-07-31 青岛联合视界数字传媒有限公司 Multilingual translation telephone system and translation method
CN112818707B (en) * 2021-01-19 2024-02-27 传神语联网网络科技股份有限公司 Reverse text consensus-based multi-turn engine collaborative speech translation system and method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002223299A (en) * 2001-01-26 2002-08-09 Hitachi Ltd Interpretation service system
JP2004157882A (en) * 2002-11-07 2004-06-03 Patolis Corp Online document retrieval/translation method
JP2017010311A (en) * 2015-06-23 2017-01-12 株式会社Nttドコモ Translation support system, information processing device, and program

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62286172A (en) * 1986-06-04 1987-12-12 Ricoh Co Ltd Document processor
JPS63106866A (en) * 1986-10-24 1988-05-11 Toshiba Corp Machine translation device
JPH01230177A (en) * 1988-03-10 1989-09-13 Oki Electric Ind Co Ltd Translation processing system
JPH07105220A (en) * 1993-09-30 1995-04-21 Hitachi Ltd Conference translating device
JP5821096B2 (en) * 2011-06-30 2015-11-24 三井金属アクト株式会社 Door lock device for automobile

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002223299A (en) * 2001-01-26 2002-08-09 Hitachi Ltd Interpretation service system
JP2004157882A (en) * 2002-11-07 2004-06-03 Patolis Corp Online document retrieval/translation method
JP2017010311A (en) * 2015-06-23 2017-01-12 株式会社Nttドコモ Translation support system, information processing device, and program

Also Published As

Publication number Publication date
JP2017138650A (en) 2017-08-10
JP6449181B2 (en) 2019-01-09

Similar Documents

Publication Publication Date Title
JP6678764B1 (en) Facilitating end-to-end communication with automated assistants in multiple languages
US9355094B2 (en) Motion responsive user interface for realtime language translation
US8781840B2 (en) Retrieval and presentation of network service results for mobile device using a multimodal browser
US20140288919A1 (en) Translating languages
JPWO2005101235A1 (en) Dialogue support device
JP2020118955A (en) Voice command matching during testing of voice-assisted application prototype for language using non-phonetic alphabet
JP2002116796A (en) Voice processor and method for voice processing and storage medium
JP2015153108A (en) Voice conversion support device, voice conversion support method, and program
US11538476B2 (en) Terminal device, server and controlling method thereof
US11763074B2 (en) Systems and methods for tool integration using cross channel digital forms
US20080195375A1 (en) Echo translator
WO2017135214A1 (en) Speech translation system, speech translation method, and speech translation program
JPH07222248A (en) System for utilizing speech information for portable information terminal
JP6141483B1 (en) Speech translation device, speech translation method, and speech translation program
JP6290479B1 (en) Speech translation device, speech translation method, and speech translation program
KR100593589B1 (en) Multilingual Interpretation / Learning System Using Speech Recognition
JP6250209B1 (en) Speech translation device, speech translation method, and speech translation program
JP6310950B2 (en) Speech translation device, speech translation method, and speech translation program
JP6353860B2 (en) Speech translation device, speech translation method, and speech translation program
WO2017122657A1 (en) Speech translation device, speech translation method, and speech translation program
JP6110539B1 (en) Speech translation device, speech translation method, and speech translation program
US20070244687A1 (en) Dialog supporting device
JP6334589B2 (en) Fixed phrase creation device and program, and conversation support device and program
JP6198879B1 (en) Speech translation device, speech translation method, and speech translation program
JP6174746B1 (en) Speech translation device, speech translation method, and speech translation program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17747370

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17747370

Country of ref document: EP

Kind code of ref document: A1