WO2017122657A1

WO2017122657A1 - Speech translation device, speech translation method, and speech translation program

Info

Publication number: WO2017122657A1
Application number: PCT/JP2017/000564
Authority: WO
Inventors: 知高大越; 諒俊武藤
Original assignee: 株式会社リクルートライフスタイル
Priority date: 2016-01-13
Filing date: 2017-01-11
Publication date: 2017-07-20
Also published as: JP5998298B1; JP2017126152A

Abstract

A speech translation device according to an embodiment of the present invention is equipped with: an input part for inputting a user's speech; a storage part for storing the content of the input speech; a translation part for translating the content of the input speech into a different language; an output part for outputting the translated content (parallel translation) in the form of speech and/or text; and a history display part for displaying the history of input content entries. In addition, the storage part stores a specific input content entry in the history separately from the other input content entries in response to an instruction given by the user or on the basis of the input frequency thereof. Furthermore, when the specific input content is selected, the translation part translates the specific input content into a different language.

Description

Speech translation device, speech translation method, and speech translation program

Cross-reference of related applications

This application is based on Japanese Patent Application No. 2016-004337 filed on January 13, 2016, the contents of which are incorporated herein by reference.

The present invention relates to a speech translation device, a speech translation method, and a speech translation program.

In order to enable conversation between people who cannot understand each other's language, for example, conversation between a store clerk (sales clerk at a restaurant, etc.) and a customer (tourist from abroad, etc.) A speech translation technique has been proposed in which the text content is machine-translated into the language of the other party and displayed on the screen, or the text content is played back using speech synthesis technology (for example, a patent). Reference 1). In addition, a speech translation application that operates on an information terminal such as a smartphone that embodies such speech translation technology has been put into practical use (see, for example, Non-Patent Document 1).

JP-A-9-34895

By the way, in a conversation between a clerk and a customer, for example, frequently used phrases (question items, guidance items, explanations, etc.) and typical contents are often uttered. For example, in the case of a restaurant, when the clerk asks about the contents of the customer's order or explains the material of the dish, the same wording or the phrase with the same contents may appear. However, in the above-described conventional speech translation apparatus, even with such frequent phrases, the user (clerk) needs to utter each time (every time). In the first place, if the input speech content is not in the basic sentence pattern of the language, there is a high possibility that mistranslation will occur in machine translation by the translation engine. Therefore, even if the phrases have substantially the same contents, for example, even if the spoken word order is slightly different, mistranslation occurs, and even if it is a frequent phrase, it may be necessary to re-speak. obtain. As a result, there is a risk of increasing the burden on the user (user, speaker) or reducing convenience.

Therefore, the present invention has been made in view of such circumstances, and by reducing the trouble of speaking a phrase that often appears in a conversation, the burden on the user can be reduced and convenience can be improved. An object of the present invention is to provide a speech translation device, a speech translation method, and a speech translation program that can prevent the occurrence of speech.

In order to solve the above problems, a speech translation apparatus according to an aspect of the present disclosure first includes an input unit for inputting a user's speech, a storage unit for storing the content of the input speech, and the content of the input speech. A translation unit that translates the content into different languages, an output unit that outputs the translated content (parallel translation) in voice and / or text, and a history display unit that displays the history of the input content are provided. And a memory | storage part distinguishes and stores specific input content from other input content from a log | history by a user's instruction | indication or based on input frequency. Further, when a specific input content is selected, the translation unit translates the specific input content into a different language content. Here, examples of the “specific input contents” include frequent phrases (commonly used phrases) used by the user in conversation and contents of fixed phrases.

In addition, the speech translation device according to an aspect of the present disclosure further includes an information acquisition unit that acquires information on user attributes (for example, gender, occupation, type of business, business type, etc.), and the storage unit stores the specific input content to the user. You may comprise so that it may be linked | related and memorize | stored in the attribute. In this case, the history display unit may switch the display of the history according to the attribute of the user.

The speech translation apparatus according to an aspect of the present disclosure may further include a library creation unit that creates a library for each attribute from specific input content stored in association with the user's attribute. At this time, the library for each attribute can be shared by the user and other users (that is, among a plurality of users).

In addition, a speech translation method according to an aspect of the present disclosure uses a speech translation device including an input unit, a storage unit, a translation unit, an output unit, and a history display unit, and inputs the user's speech and the content of the input speech Storing the content of the input speech, translating the content of the input speech into content in a different language, outputting the content of the translation in speech and / or text, and displaying the history of the input content. In the storing step, the specific input content is stored separately from the other input content from the history based on the user's instruction or based on the input frequency. In the step of displaying the history, the specific input content is displayed so that the user can select it. Further, in the step of translating, when a specific input content is selected, the specific input content is translated into a different language content.

A speech translation program according to an aspect of the present disclosure includes a computer (not limited to a single type or a single type, and may be a plurality or a plurality of types; the same applies hereinafter), an input unit for inputting a user's voice, As a storage unit for storing the content, a translation unit for translating the content of the input speech into content in different languages, an output unit for outputting the translation content in speech and / or text, and a history display unit for displaying the history of the input content Make it work. Then, the speech translation program according to an aspect of the present disclosure causes the storage unit to store specific input content separately from other input content from the history based on a user instruction or based on the input frequency. Further, specific input contents are displayed on the history display section so that the user can select them. Further, when the specific input content is selected, the translation unit causes the specific input content to be translated into different language content.

In addition, as a method of acquiring information related to “attributes”, a user who uses a service related to a speech translation apparatus or a user who installs and uses an application which is a speech translation program in a computer such as an information terminal is used. Examples include filling in an information registration screen and answering a question questionnaire regarding attributes when using a speech translation device.

According to the present invention, a history of input contents of speech uttered by the user is stored, from which specific input contents such as frequent phrases are stored, and the specific input contents are displayed so that the user can select them. Then, by selecting a desired phrase from the specific input contents, it is possible to save the trouble of uttering frequent phrases and the like, and as a result, it is possible to reduce the burden on the user and improve convenience. it can. In addition, the occurrence of mistranslation can be prevented, so that the accuracy of speech translation can be improved easily and effectively. In addition, as described above, it is possible to improve the accuracy of speech translation by preventing the occurrence of mistranslations, which speeds up processing, saves memory, reduces the amount of communication data, and ensures processing reliability. Can increase the sex.

1 is a system block diagram schematically illustrating a preferred embodiment such as a network configuration related to a speech translation apparatus according to the present disclosure. FIG. It is a system block diagram showing roughly an example of composition of a user apparatus (information terminal) in a speech translation device by this indication. It is a system block diagram showing roughly an example of composition of a server in a speech translation device by this indication. It is a flowchart which shows an example of the flow (a part) of the process in the speech translation apparatus by this indication. (A) thru | or (D) are top views which show an example of the transition of the display screen in an information terminal.

Hereinafter, embodiments of the present invention will be described in detail. The following embodiments are examples for explaining the present invention, and are not intended to limit the present invention only to the embodiments. Further, the present disclosure can be variously modified without departing from the gist thereof. Furthermore, those skilled in the art can employ embodiments in which the elements described below are replaced with equivalent ones, and such embodiments are also included in the scope of the present invention. Furthermore, positional relationships such as up, down, left, and right shown as needed are based on the display shown unless otherwise specified. Furthermore, various dimensional ratios in the drawings are not limited to the illustrated ratios.

(Device configuration)
FIG. 1 is a system block diagram schematically illustrating a preferred embodiment such as a network configuration according to a speech translation apparatus according to the present disclosure. In this example, the speech translation apparatus 100 includes a server 20 that is electronically connected via a network N to an information terminal 10 (user apparatus) used by a user (speaker or other speaker) (however, this Not limited to).

The information terminal 10 employs a user interface such as a touch panel and a display with high visibility, for example. The information terminal 10 here is a portable tablet terminal device including a mobile phone represented by a smartphone having a communication function with the network N. The information terminal 10 further includes a processor 11, a storage resource 12, a voice input / output device 13, a communication interface 14, an input device 15, a display device 16, and a camera 17. In addition, the information terminal 10 is operated by the installed speech translation application software (at least a part of the speech translation program according to the embodiment of the present disclosure), so that a part of the speech translation device according to the embodiment of the present disclosure or It functions as a whole.

The processor 11 includes an arithmetic logic unit and various registers (program counter, data register, instruction register, general-purpose register, etc.). Further, the processor 11 interprets and executes speech translation application software, which is the program P10 stored in the storage resource 12, and performs various processes. The speech translation application software as the program P10 can be distributed from the server 20 through the network N, for example, and may be installed and updated manually or automatically.

The network N includes, for example, a wired network (a short-range communication network (LAN), a wide-area communication network (WAN), a value-added communication network (VAN), etc.) and a wireless network (mobile communication network, satellite communication network, Bluetooth ( Bluetooth (registered trademark), WiFi (Wireless Fidelity), HSDPA (High Speed Downlink Packet Access, etc.).

The storage resource 12 is a logical device provided by a storage area of a physical device (for example, a computer-readable recording medium such as a semiconductor memory), and an operating system program, a driver program, various data, etc. used for processing of the information terminal 10 Is stored. Examples of the driver program include an input / output device driver program for controlling the audio input / output device 13, an input device driver program for controlling the input device 15, an output device driver program for controlling the display device 16, and the like. Can be mentioned. Furthermore, the voice input / output device 13 is, for example, a general microphone and a sound player capable of reproducing sound data.

The communication interface 14 provides, for example, a connection interface with the server 20 and includes a wireless communication interface and / or a wired communication interface. The input device 15 provides an interface for accepting an input operation by a tap operation such as an icon, a button, or a virtual keyboard displayed on the display device 16, and is externally attached to the information terminal 10 in addition to the touch panel. Various input devices can be exemplified.

The display device 16 provides various information as an image display interface to a user or a conversation partner as necessary, and examples thereof include an organic EL display, a liquid crystal display, and a CRT display. The camera 17 is for capturing still images and moving images of various subjects.

The server 20 is constituted by, for example, a host computer having a high arithmetic processing capability, and expresses a server function by operating a predetermined server program in the host computer, for example, a speech recognition server, a translation server, And a single or a plurality of host computers functioning as a speech synthesis server (in the drawing, it is indicated by a single, but is not limited thereto). Each server 20 includes a processor 21, a communication interface 22, and a storage resource 23 (storage unit).

The processor 21 is composed of an arithmetic and logic unit for processing arithmetic operations, logical operations, bit operations and the like and various registers (program counter, data register, instruction register, general-purpose register, etc.), and is stored in the storage resource 23. P20 is interpreted and executed, and a predetermined calculation processing result is output. The communication interface 22 is a hardware module for connecting to the information terminal 10 via the network N. For example, the communication interface 22 is a modulation / demodulation device such as an ISDN modem, an ADSL modem, a cable modem, an optical modem, or a soft modem.

The storage resource 23 is a logical device provided by, for example, a storage area of a physical device (a computer-readable recording medium such as a disk drive or a semiconductor memory). Each of the storage resources 23 includes one or more programs P20, various modules L20, various types. A database D20 and various models M20 are stored.

The program P10 is the above-described server program that is the main program of the server 20. In addition, the various modules L20 perform a series of information processing related to requests and information transmitted from the information terminal 10, so that they are appropriately called and executed during the operation of the program P10 (moduleized subprograms). ). Examples of the module L20 include a speech recognition module, a translation module, and a speech synthesis module.

The various databases D20 include various corpora required for speech translation processing (for example, in the case of Japanese and English speech translation, a Japanese speech corpus, an English speech corpus, a Japanese character (vocabulary) corpus, an English character) (Vocabulary) corpus, Japanese dictionary, English dictionary, Japanese-English bilingual dictionary, Japanese-English bilingual corpus, etc.), a speech database described later, a management database for managing information related to users, and the like. In addition, examples of the various models M20 include an acoustic model and a language model used for speech recognition described later.

An example of operations and operations of speech translation processing in the speech translation apparatus 100 configured as described above will be further described below.

(First embodiment)
FIG. 4 is a flowchart showing an example of a process flow (part) in the speech translation apparatus 100 of the present embodiment. 5A to 5D are plan views illustrating an example of display screen transition in the information terminal 10. Here, the conversation when the user of the information terminal 10 is a restaurant clerk who speaks Japanese and the conversation partner is a customer who speaks English, that is, the input language is Japanese and the translation language is English. Assume a conversation (but not limited to this).

First, when the user (clerk) starts the application (step SU1), a customer language selection screen is displayed on the display device 16 (FIG. 5A; step SJ1). This language selection screen includes a Japanese text T21 for inquiring the language to the customer, an English text T22 for that purpose, and a plurality of typical languages (again, English, Chinese (for example, typeface) 2), a language button 61 indicating Korean) is displayed.

At this time, as shown in FIG. 5A, the Japanese text T21 and the English text T22 are divided by the processor 11 and the display device 16 into different areas on the screen of the display device 16 of the information terminal 10, and Are displayed in opposite directions (different directions; upside down in the figure). Thereby, when a conversation is performed in a state where the user and the customer face each other, the user can easily confirm the Japanese text T21, while the customer can easily confirm the English text T22. Further, since the text T21 and the text T22 are displayed separately, there is an advantage that they can be clearly distinguished and further confirmed.

The user can present the display of the text T22 on the language selection screen to the customer and have the customer tap the English button, or can select the language of the customer himself. When the customer's language is thus selected, a standby screen for voice input in Japanese and English is displayed as the home screen (FIG. 5B; step SJ2). The standby screen includes a text T23 asking which of the user's or customer's language is to be spoken, a Japanese input button 62a for inputting Japanese speech, and an English input button for inputting English speech. 62b is displayed. The standby screen includes a history button 63 for displaying a history of input contents, a language selection button 64 for returning to the language selection screen and switching the language of the customer (reselecting the language), and the application software. A setting button 65 for performing various settings is also displayed.

Next, FIG. 4 shows a flow of the case classification (step SU2) when paying attention to whether or not the user taps the history button 63, but in normal speech translation processing, the standby shown in FIG. Voice input can be performed from the screen. Here, the flow of speech translation processing in that case (that is, “No” in step SU2) will be described first.

[Normal speech translation processing]
That is, on this standby screen, when the user (clerk) taps the Japanese input button 62a and selects Japanese voice input, voice input is enabled. In this state, when the user utters a message to be transmitted to the customer, voice input is performed through the voice input / output device 13 (step SJ3). The processor 11 of the information terminal 10 generates an audio signal based on the audio input, and transmits the audio signal to the server 20 through the communication interface 14 and the network N. As described above, the information terminal 10 itself, or the processor 11 and the voice input / output device 13 function as an “input unit”.

The processor 21 of the server 20 receives the voice signal through the communication interface 22 and performs voice recognition processing (step SJ4). At this time, the processor 21 calls the necessary module L20, database D20, and model M20 (speech recognition module, Japanese speech corpus, acoustic model, language model, etc.) from the storage resource 23, and obtains “sound” of the input speech. Convert to "reading" (character). As described above, the processor 21 or the server 20 functions as a “voice recognition server” as a whole.

Here, when the input speech is recognized, the processor 21 shifts to a multilingual translation process for translating the “reading” (characters) of the recognized speech into another language (step SJ5). At this time, the processor 21 calls the necessary module L20 and database D20 (translation module, Japanese character corpus, Japanese dictionary, English dictionary, Japanese-English bilingual dictionary, Japanese-English bilingual corpus, etc.) from the storage resource 23 and recognizes them. The resulting input speech “reading” (character string) is properly sorted and converted into Japanese phrases, clauses, sentences, etc., the English corresponding to the conversion result is extracted, and these are sorted according to the English grammar. To natural English phrases, clauses, sentences, etc. As described above, the processor 21 also functions as a “translation unit”, and the server 20 also functions as a “translation server” as a whole. If the input voice is not recognized well, the voice can be re-input (screen display is not shown).

Further, the processor 21 stores the content of the recognized input voice in the storage resource 23. Next, when the multilingual translation processing and the input speech content storage processing are completed, the processor 21 proceeds to speech synthesis processing (step SJ6). At this time, the processor 21 calls the necessary module L20, database D20, and model M20 (speech synthesis module, English speech corpus, acoustic model, language model, etc.) from the storage resource 23, and the English phrase that is the translation result, Convert clauses, sentences, etc. to natural speech. As described above, the processor 21 also functions as a “speech synthesizer”, and the server 20 also functions as a “speech synthesizer” as a whole.

Next, the processor 21 generates a voice signal for voice output based on the synthesized voice, and transmits the voice signal to the information terminal 10 through the communication interface 22 and the network N. The processor 11 of the information terminal 10 receives the audio signal through the communication interface 14 and performs an audio output process (step SJ7).

[Translation processing from the history display]
On the other hand, when the user taps the history button on the standby screen shown in FIG. 5B and selects the history display of the input voice so far (“Yes” in step SU2), the processor 11 of the information terminal 10 A command signal for displaying the history is transmitted to the server 20. Upon receiving the command signal, the processor 21 of the server 20 reads the contents of the input voice stored and held in the storage resource 23 and displays, for example, a history display screen shown in FIG. 5C on the display device 16 (step SJ8). . On this history display screen, the contents that have been input by speech and subjected to translation processing are displayed as text, for example, in units of phrases. Also, on the screen, a display order selection button 66 for switching the order of the list of the contents of the input speech between, for example, “latest order” and “frequency order” is displayed above the list where the text is displayed as a list. The The user can appropriately switch between the “latest order” list and the “frequency order” list by tapping the display order selection button 66 as appropriate.

Furthermore, on the history display screen shown in FIG. 5C, for example, a pin-shaped design P is additionally displayed in the text of the contents of each input voice. When the user taps the pin shape design P, the content of each input voice displayed on the history display screen is selected from the content that the user frequently utters or the content of the fixed phrase, so to speak, the pin By doing so, you can “clip”.

For example, the user taps the pin shape design P of the input contents (specific input contents) displayed in the texts T31, T32, and T33 among the input contents listed in FIG. 5C (in step SU3, “ Yes "). Then, the processor 11 of the information terminal 10 moves the input contents of the texts T31, T32, T33 to the upper area R1 of the screen and displays them together, while moving the other input contents to the lower area R2 of the screen and collects them. And visually distinguish them from each other (step SJ4). Further, in the vicinity of the upper region R1, a text T23 indicating that the input content is clipped with a pin is clearly shown.

Further, at this time, the processor 11 of the information terminal 10 transmits to the server 20 a command signal indicating that the input contents of the texts T31, T32, and T33 have been selected by the user. Upon receiving the command signal, the processor 21 of the server 20 sets a flag on the input contents (specific input contents) of the texts T31, T32, and T33 held in the storage resource 23, so as to distinguish them from other input contents. Separately remember.

In addition, on the history display screen shown in FIG. 5D, the input contents of the texts T31, T32, and T33 clipped with pins are additionally displayed as x mark designs 67 instead of the pin shape designs P. . The user can unpin the texts T31, T32, and T33 by tapping the cross mark 67 as necessary. In that case, the processor 21 of the server 20 removes the flag from the input content stored in the storage resource 23 with a flag, for example, in response to a command signal from the processor 11 of the information terminal 10.

Next, the user can select desired input contents from the texts T31, T32, and T33 clipped with pins instead of uttering questions to the customer. For example, when the user selects a part of the text T31 by tapping (“Yes” in step SU4), the command signal is transmitted from the processor 11 of the information terminal 10 to the server 20. The processor 21 of the server 20 that has received the command signal sequentially executes multilingual translation processing (step SJ5), speech synthesis processing (step SJ6), and speech output processing (step SJ7) for the contents of the selected text T31. To do. Thereby, the user can output a parallel translation of a desired phrase or the like (specific input contents) without performing voice input.

On the other hand, when the input content to be clipped with the pin is not selected in step SU3 (“No” in step SU3), or when the specific input content is not selected instead of the utterance in step SU4 (in step SU4) “No”), the processor 21 of the server 20 sequentially executes the normal speech translation processing shown in steps SJ3 to SJ7 described above. Specifically, when the user taps the close button 68 on the history display screen shown in FIG. 5C or FIG. 5D, the standby screen shown in FIG. Displayed and can be reverted to normal speech translation processing.

(Second Embodiment)
Once the user has pinned and clipped some specific input content such as frequent phrases or fixed phrases, when the history button 63 is selected on the standby screen shown in FIG. 5B, the history display screen shown in FIG. The history display screen shown in FIG. 5D may be displayed without being displayed. In this case, in the flow shown in FIG. 4, when step SU2 is executed, steps SJ8 and SU3 are skipped and step SJ9 is executed.

(Third embodiment)
In this embodiment, when the user activates the speech translation application (step SU1 shown in FIG. 4), for example, a standby screen (FIG. 5B) for selecting a target language for speech translation is displayed on the display device 16 of the information terminal 10. Before being displayed or after selecting a target language, an information registration screen for inputting information related to the user is displayed on the display device 16 of the information terminal 10. Although it does not restrict | limit especially as information regarding a user, Attribute information, such as a profession of a user (or a user's store), a business type, a business type, age, sex, a birthplace, a residence, is contained.

In this state, when the user inputs user information, the processor 11 of the information terminal 10 generates an information signal based on the information input, and transmits the information signal to the server 20 through the communication interface 14 and the network N. As described above, the information terminal 10 itself or the processor 11 also functions as an “information acquisition unit”.

When the processor 21 of the server 20 receives the information signal through the communication interface 22, the process temporarily shifts to the process after step SJ2 shown in FIG. Then, when the user selects the input content displayed in, for example, the texts T31, T32, and T33 to be clipped in step SU3 (“Yes” in step SU3), as in the first embodiment or the second embodiment, FIG. A history display screen shown in FIG. 5 (C) or FIG. 5 (D) is displayed. On the other hand, the processor 21 of the server 20 distinguishes it from other input contents by flagging the input contents (specific input contents) of the texts T31, T32, and T33 held in the storage resource 23, and the like. Is associated with the user attribute and stored again.

Here, when a plurality of users use the speech translation application, specific input contents clipped in association with the attributes of each user are sequentially stored in the storage resource 23. Therefore, in the present embodiment, when user attribute information is input from the information terminal 10 and the history button 63 is tapped, the

processors

11 and 21 match some (or all) of the user attributes. The specific input content clipped in association with the attribute is displayed on the history display screen shown in FIG. 5C or 5D.

At this time, it is particularly useful if the processor 21 extracts or narrows down specific input contents based on any of the user's attributes (especially the user's (or user's store) occupation, business type, and business type). . In addition, the processor 21 may collect the specific input contents extracted or narrowed down by the user attribute and the corresponding translation contents as a library for each attribute and store them in the storage resource 23. . It is more useful if the library for each attribute created in this way is shared among a plurality of users.

According to the speech translation device 100 configured as described above, the speech translation method using the speech translation device, and the speech translation program, a specific phrase such as a frequent phrase or a fixed sentence is identified from the history of the input content of speech uttered by the user. The input content can be clipped and stored. Therefore, the user can easily call up frequent phrases, fixed phrases, etc., and the user can save time and effort to utter them each time. As a result, the burden on the user can be reduced and the convenience can be improved, and the occurrence of mistranslation can be effectively prevented, so that the accuracy of speech translation can be improved easily and effectively. realizable.

Furthermore, the clipped specific input content is stored in association with the user attribute, and is displayed on the history display screen, so that frequent phrases, fixed phrases, etc. corresponding to the user attribute can be efficiently selected. Is possible. As a result, frequent phrases and fixed phrases necessary for the user can be easily found, so that the burden on the user can be further reduced and convenience can be further improved. In particular, when the user is a store clerk and is talking to a customer in the business, frequent phrases etc. are expected to be further standardized, and based on the occupation, type of business, and business type as the user attribute By extracting or narrowing down the input contents, it is possible to further improve the convenience for the user and further improve the accuracy and efficiency of speech translation.

As described above, each of the above embodiments is an example for explaining the present invention, and the present invention is not limited to the embodiment. The present disclosure can be variously modified without departing from the gist thereof. For example, those skilled in the art can replace the resources (hardware resources or software resources) described in the embodiments with equivalents, and such replacements are also included in the scope of the present disclosure.

In each of the above-described embodiments, an example in which each process of speech recognition, translation, and speech synthesis is executed by the server 20 has been described. However, these processes may be executed in the information terminal 10. In this case, the module L20 used for these processes may be stored in the storage resource 12 of the information terminal 10 or may be stored in the storage resource 23 of the server 20. Furthermore, the database D20 of the voice database and / or the model M20 such as an acoustic model may be stored in the storage resource 12 of the information terminal 10, or may be stored in the storage resource 23 of the server 20. As described above, the speech translation apparatus may not include the network N and the server 20.

Further, for example, instead of the user manually pinning from the specific input content displayed on the history display screen shown in FIG. 5C or FIG. 5D, for example, the frequency of the specific input content May be extracted by the processor 21 of the server 20, and a database or library obtained by clipping them may be automatically generated. In this case, the input content extracted based on the input frequency by the processor 21 may be displayed on the screen in which the display order selection button 66 shown in FIG. 5C or FIG. it can. Furthermore, a translation result of a predetermined language executed once may be stored together with the specific input content clipped (in association with the specific input content). For example, in the flow shown in FIG. 4, when the user taps and selects the portion of the text T31 (“Yes” in step SU4), the multilingual translation process (step SJ5) is skipped and the speech synthesis process (step SJ6) is performed. You may make it perform.

Of course, a gateway server for converting the communication protocol between the information terminal 10 and the network N may be interposed. The information terminal 10 is not limited to a portable device, and may be a desktop personal computer, a notebook personal computer, a tablet personal computer, a laptop personal computer, or the like.

In addition, the speech translation apparatus according to the disclosure is as follows:
An input unit for inputting the user's voice;
A storage unit for storing the contents of the input voice;
A translation unit that translates the content of the input speech into content of a different language;
An output unit that outputs the translated content in audio and / or text;
A history display unit for displaying the history of the input content;
With
The history display unit displays a design indicating that the user can select a specific input content from the history, attached to each input content in the history,
The storage unit stores the specific input content separately from other input content when the user selects the specific input content from the history using the design,
The history display unit visually displays the specific input content and the other input content, and the user can select a desired one from the specific input content displayed visually. Display the input contents selectable,
The translation unit may translate the desired input content into different language content when the desired input content is selected by the user.

Further, when the history display unit visually distinguishes and displays the specific input content and the other input content, the user can remove unnecessary input content from the specific input content. The design shown may be displayed along with the specific input content.

In addition, an information acquisition unit that acquires information about the user's attributes,
The storage unit stores the specific input content in association with the attribute of the user,
The history display unit may switch the display of the history according to the attribute of the user.

Furthermore, a library creation unit that creates a library for each attribute from the specific input content stored in association with the attribute of the user may be further provided.

Alternatively, the library for each attribute may be shared by the user and other users.

Also, the speech translation method according to the present disclosure is:
Using a speech translation device including an input unit, a storage unit, a translation unit, an output unit, and a history display unit,
Inputting the user's voice;
Storing the contents of the input voice;
Translating the content of the input speech into content of a different language;
Outputting the translated content in audio and / or text;
Displaying a history of the input content;
Including
In the step of displaying the history, a design indicating that the user can select a specific input content from the history is displayed along with each input content in the history,
In the storing step, when the user selects the specific input content from the history using the design, the specific input content is stored separately from other input content,
In the step of displaying the history, the specific input content and the other input content are visually distinguished and displayed, and the user can visually distinguish the specific input content. Display the desired input content from
In the step of translating, when the desired input content is selected by the user, the desired input content may be translated into different language content.

The speech translation program according to the present disclosure is
Computer
An input unit for inputting the user's voice;
A storage unit for storing the contents of the input voice;
A translation unit that translates the content of the input speech into content of a different language;
An output unit that outputs the translated content in audio and / or text;
A history display unit for displaying the history of the input content;
To function,
In the history display unit, a design indicating that the user can select specific input contents from the history is displayed along with each input content in the history,
When the user selects the specific input content from the history using the design, the storage unit stores the specific input content separately from other input content,
The history display unit visually displays the specific input content and the other input content, and the user selects a desired one from the specific input content displayed visually. Display the input contents selectable,
When the desired input content is selected by the user, the translation unit may translate the desired input content into different language content.

According to the present invention, the burden on the user in speech translation processing can be reduced and convenience can be improved, and the accuracy of speech translation can be improved easily and effectively by preventing the occurrence of mistranslation. For example, the present invention can be widely used for activities such as designing, manufacturing, providing, and selling programs, apparatuses, systems, and methods in the field of providing services related to conversations between people who cannot understand each other's languages.

10 Information terminal 11 Processor 12 Storage resource 13 Voice input / output device 14 Communication interface 15 Input device 16 Display device 17 Camera 20 Server 21 Processor 22 Communication interface 23 Storage resource 61 Language button 62a Japanese input button 62b English input button 63 History button 64 Language selection button 65 Setting button 66 Display order selection button 67 X mark design 68 Close button 100 Speech translation device D20 Database L20 Module M20 Model N Network P Pin shape design P10 Program P20 Program R1 Upper region R2 Lower region T21, T22, T23 Text T31, T32, T33 Text (specific input contents)

Claims

An input unit for inputting the user's voice;
A storage unit for storing the contents of the input voice;
A translation unit that translates the content of the input speech into content of a different language;
An output unit that outputs the translated content in audio and / or text;
A history display unit for displaying the history of the input content;
With
The storage unit stores specific input content separately from other input content from the history based on the user's instruction or based on the input frequency,
The history display unit displays the specific input content so that the user can select,
The translation unit translates the specific input content into different language content when the specific input content is selected.
Speech translation device.
An information acquisition unit for acquiring information related to the attribute of the user;
The storage unit stores the specific input content in association with the attribute of the user.
The speech translation apparatus according to claim 1.
The history display unit switches the display of the history according to the attribute of the user.
The speech translation apparatus according to claim 1.
A library creating unit for creating a library for each attribute from the specific input content stored in association with the user attribute;
The speech translation apparatus according to claim 2.
The library for each attribute can be shared by the user and other users.
The speech translation apparatus according to claim 4.
An input unit for inputting the user's voice;
A storage unit for storing the contents of the input voice;
A translation unit that translates the content of the input speech into content of a different language;
An output unit that outputs the translated content in audio and / or text;
A history display unit for displaying the history of the input content;
With
The history display unit displays a design indicating that the user can select a specific input content from the history, attached to each input content in the history,
The storage unit stores the specific input content separately from other input content when the user selects the specific input content from the history using the design,
The history display unit visually displays the specific input content and the other input content, and the user can select a desired one from the specific input content displayed visually. Display the input contents selectable,
The translation unit translates the desired input content into different language content when the desired input content is selected by the user.
Speech translation device.
Using a speech translation device including an input unit, a storage unit, a translation unit, an output unit, and a history display unit,
Inputting the user's voice;
Storing the contents of the input voice;
Translating the content of the input speech into content of a different language;
Outputting the translated content in audio and / or text;
Displaying a history of the input content;
Including
In the storing step, specific input content is stored separately from other input content from the history according to the user's instruction or based on the input frequency,
In the step of displaying the history, the specific input content is displayed so that the user can select,
In the step of translating, when the specific input content is selected, the specific input content is translated into different language content;
Speech translation method.
Computer
An input unit for inputting the user's voice;
A storage unit for storing the contents of the input voice;
A translation unit that translates the content of the input speech into content of a different language;
An output unit that outputs the translated content in audio and / or text;
A history display unit for displaying the history of the input content;
To function,
In the storage unit, according to the user's instruction or based on the input frequency, from the history, specific input content is stored separately from other input content,
The history display unit displays the specific input content so that the user can select,
When the specific input content is selected, the translation unit causes the specific input content to be translated into different language content.
Speech translation program.