CN111739511A - Speech translation device, speech translation method, and recording medium - Google Patents

Speech translation device, speech translation method, and recording medium Download PDF

Info

Publication number
CN111739511A
CN111739511A CN202010185150.XA CN202010185150A CN111739511A CN 111739511 A CN111739511 A CN 111739511A CN 202010185150 A CN202010185150 A CN 202010185150A CN 111739511 A CN111739511 A CN 111739511A
Authority
CN
China
Prior art keywords
speech
unit
speaker
voice
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010185150.XA
Other languages
Chinese (zh)
Inventor
古川博基
坂口敦
西川刚树
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Holdings Corp
Original Assignee
Matsushita Electric Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Matsushita Electric Industrial Co Ltd filed Critical Matsushita Electric Industrial Co Ltd
Publication of CN111739511A publication Critical patent/CN111739511A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/80Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Machine Translation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention provides a speech translation device, a speech translation method and a recording medium. A speech translation device (1) translates a conversation between a 1 st speaker and a 2 nd speaker, wherein the 1 st speaker speaks in a 1 st language and the 2 nd speaker speaks in a 2 nd language different from the 1 st language, and comprises: a voice detection unit (22) that detects voice sections uttered by the 1 st and 2 nd speakers from the voice input to the voice input unit (21); a display unit (27) that displays the translation result from the 1 st language to the 2 nd language indicated by the speech recognition of the speech in the speech section detected by the speech detection unit, and displays the translation result from the 2 nd language to the 1 st language; and a speech instruction unit (25) which, after the 1 st speaker speaks, outputs the content for prompting the 2 nd speaker to speak in the 2 nd language via the display unit, and which, after the 2 nd speaker speaks, outputs the content for prompting the 1 st speaker to speak in the 1 st language via the display unit.

Description

Speech translation device, speech translation method, and recording medium
Technical Field
The present application relates to a speech translation apparatus, a speech translation method, and a recording medium.
Background
For example, patent document 1 discloses a translation system including: a voice input unit that converts voice uttered by the 1 st language speaker and a 2 nd language speaker, which is a conversation partner of the 1 st language speaker, into voice data and outputs the voice data; an input switch for inputting voice during a period when the 1 st language speaker utters voice and during a period when the 1 st language speaker does not utter voice; and a voice output unit which translates the input voice data, converts the translation result into voice, and outputs the voice.
(Prior art document)
(patent document)
Patent document 1, patent No. 3891023
However, in the technique disclosed in patent document 1, when the 1 st speaker and the 2 nd speaker perform a conversation, the operation becomes complicated because the input switch needs to be operated every time the 1 st speaker and the 2 nd speaker speak separately. When the 1 st speaker and the 2 nd speaker perform a conversation, the input switch needs to be operated every time, and therefore, the frequency and the period of use of the translation system increase.
Further, in a case where the 1 st speaker and the 2 nd speaker operate the translation system with each other, a non-holder of the translation system cannot generally understand the operation method of the translation system. Therefore, it takes time and effort to perform the operation of the translation system, which also results in an increase in the period of use of the translation system. As described above, the conventional translation system has a problem of energy consumption due to an increase in the period of use.
Disclosure of Invention
Therefore, an object of the present invention is to provide a speech translation apparatus, a speech translation method, and a recording medium that can suppress an increase in energy consumption of the speech translation apparatus by a simple operation.
A speech translation apparatus according to an aspect of the present application, which is used for a conversation between a 1 st speaker and a 2 nd speaker, wherein the 1 st speaker speaks in a 1 st language, and the 2 nd speaker is a conversation partner of the 1 st speaker and speaks in a 2 nd language different from the 1 st language, the speech translation apparatus comprising: a voice detection unit that detects voice sections of voices uttered by the 1 st speaker and the 2 nd speaker from the voice input to the voice input unit; a display unit that displays a translation result of the 1 st language translated into the 2 nd language indicated by the voice by voice recognition of the voice in the voice section detected by the voice detection unit, and displays a translation result of the 2 nd language translated into the 1 st language; and a speech instruction unit that outputs, via the display unit, the 2 nd speaker 2 language content after the 1 st speaker speaks, and outputs, via the display unit, the 1 st speaker 1 language content after the 2 nd speaker speaks.
Specific embodiments of a part of the above-described embodiments may be implemented by a system, a method, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM, or may be implemented by any combination of a system, a method, an integrated circuit, a computer program, and a recording medium.
Effects of the invention
With the speech translation apparatus and the like of the present application, an increase in energy consumption of the speech translation apparatus can be suppressed by a simple operation.
Drawings
Fig. 1A shows an example of the appearance of the speech translation apparatus according to embodiment 1 and the use of the speech translation apparatuses of the 1 st speaker and the 2 nd speaker when the 1 st speaker speaks.
Fig. 1B shows an example of the appearance of the speech translation apparatus according to embodiment 1 and the use of the speech translation apparatuses of the 1 st speaker and the 2 nd speaker when the 2 nd speaker speaks.
Fig. 1C shows another example of the use scenario of the speech translation apparatus when the 1 st speaker and the 2 nd speaker are engaged in a conversation.
Fig. 2 is a block diagram showing a speech translation apparatus according to embodiment 1.
Fig. 3 is a flowchart showing the operation of the speech translation apparatus according to embodiment 1.
Fig. 4 is a block diagram showing a speech translation apparatus according to embodiment 2.
Fig. 5 is a flowchart showing the operation of the speech translation apparatus in embodiment 2.
Fig. 6 is a flowchart showing the operation of the speech translation apparatus according to the modification of embodiment 2.
Fig. 7 is a block diagram showing a speech translation apparatus according to embodiment 3.
Fig. 8 is a flowchart showing the operation of the speech translation apparatus according to embodiment 3.
Fig. 9 is a block diagram showing a speech translation apparatus according to a modification of embodiment 3.
Fig. 10 is a block diagram showing a speech translation apparatus according to embodiment 4.
Fig. 11 is a flowchart showing the operation of the speech translation apparatus according to embodiment 4.
Description of the symbols
1. 1a, 1b, 1c, 1d speech translation device
21 voice input part
22 voice detection unit
23 speech recognition part
24 priority speaking input part
25 speaking instruction part
26 translation section
27 display part
28 voice output part
31 sound source direction estimating unit
31a control part
32 input switching part
41 st beam forming part
42 nd beam forming part
Detailed Description
A speech translation apparatus according to an aspect of the present application, which is used for a conversation between a 1 st speaker and a 2 nd speaker, wherein the 1 st speaker speaks in a 1 st language, and the 2 nd speaker is a conversation partner of the 1 st speaker and speaks in a 2 nd language different from the 1 st language, the speech translation apparatus comprising: a voice detection unit that detects voice sections of voices uttered by the 1 st speaker and the 2 nd speaker from the voice input to the voice input unit; a display unit that displays a translation result of the 1 st language translated into the 2 nd language indicated by the voice by voice recognition of the voice in the voice section detected by the voice detection unit, and displays a translation result of the 2 nd language translated into the 1 st language; and a speech instruction unit that outputs, via the display unit, the 2 nd speaker 2 language content after the 1 st speaker speaks, and outputs, via the display unit, the 1 st speaker 1 language content after the 2 nd speaker speaks.
Accordingly, by detecting the respective voice sections from the conversation between the 1 st speaker and the 2 nd speaker, it is possible to obtain a translation result of translating the detected voice from the 1 st language to the 2 nd language, and to obtain a translation result of translating the detected voice from the 2 nd language to the 1 st language. That is, the speech translation apparatus can automatically translate the language of the detected speech into another language in accordance with the utterances of the 1 st speaker and the 2 nd speaker without performing an input operation for translation.
The speech translation apparatus can output a content for urging the 2 nd speaker to speak after the 1 st speaker speaks, and can output a content for urging the 1 st speaker to speak after the 2 nd speaker speaks. Accordingly, in the speech translation apparatus, the timing of the 1 st speaker and the 2 nd speaker speaking can be recognized without performing an input operation to start speaking each time the 1 st speaker and the 2 nd speaker speak.
As described above, the speech translation apparatus does not need to perform an input operation for starting speech, an input operation for switching languages, and the like, and therefore has excellent operability. That is, since the operation of the speech translation apparatus is not complicated, an increase in the period of use can be suppressed.
Therefore, in the speech translation apparatus, by simplifying the operation, it is possible to suppress an increase in energy consumption of the speech translation apparatus.
In particular, in the speech translation apparatus, since the operation can be simplified, it is possible to suppress an erroneous operation.
A speech translation method according to another aspect of the present application is a speech translation method for a conversation between a 1 st speaker and a 2 nd speaker, the 1 st speaker speaking in a 1 st language, the 2 nd speaker being a conversation partner of the 1 st speaker and speaking in a 2 nd language different from the 1 st language, the speech translation method including: the speech recognition method includes detecting speech intervals of the speech uttered by the 1 st speaker and the 2 nd speaker from the speech input to the speech input unit, and performing speech recognition on the speech in the detected speech intervals to display a translation result of the 1 st language translated into the 2 nd language shown by the speech and a translation result of the 2 nd language translated into the 1 st language on a display unit, and after the 1 st speaker speaks, outputting a content for urging the 2 nd speaker to speak in the 2 nd language via the display unit, and after the 2 nd speaker speaks, outputting a content for urging the 1 st speaker to speak in the 1 st language via the display unit.
This speech translation method can achieve the same operational effects as those of the speech translation apparatus described above.
A recording medium according to another aspect of the present application is a computer-readable non-transitory recording medium on which a program for causing a computer to execute a speech translation method is recorded.
The same effects as those of the above-described speech translation apparatus can be achieved even in such a recording medium.
The speech translation apparatus according to another aspect of the present application further includes a priority speech input unit that, when speech is recognized by the 1 st speaker or the 2 nd speaker, preferentially performs speech recognition again on the speech by the 1 st speaker or the 2 nd speaker that has performed the speech recognition.
Accordingly, for example, when the 1 st speaker and the 2 nd speaker have a wrong utterance during conversation, or when an ambiguous voice is translated halfway, the speaker who has uttered is prioritized by operating the priority utterance input unit, and the speaker who has uttered can get a chance to utter again (can re-utter). Therefore, even if the speech recognition processing of the speech of the other speaker is shifted after the speech of one of the 1 st speaker and the 2 nd speaker is recognized, the priority utterance input unit can return to the speech recognition processing of the speech of the one speaker. Accordingly, the speech translation apparatus can reliably obtain the speech of the 1 st speaker and the 2 nd speaker, and can output the translation result translated from the speech.
A speech translation apparatus according to another aspect of the present application includes: a speech input unit to which speech of a conversation between the 1 st speaker and the 2 nd speaker is input; a voice recognition unit that performs voice recognition on the voice in the voice section detected by the voice detection unit to convert the voice into a text file; a translation section that translates the text file converted by the voice recognition section from the 1 st language into the 2 nd language, and from the 2 nd language into the 1 st language; and a voice output unit that outputs the result translated by the translation unit in voice.
In this way, after the input speech is recognized, the language of the speech can be translated into another language. That is, the speech translation apparatus can perform the processing from the acquisition of the speech of the conversation between the 1 st speaker and the 2 nd speaker to the output of the result of the translation of the speech. Therefore, the speech translation apparatus can translate the respective speech of the 1 st speaker and the 2 nd speaker when they are in conversation with each other without communicating with the external server. The present invention can be applied even in an environment where communication between the speech translation apparatus and an external server is difficult.
In the speech translation apparatus according to another aspect of the present application, the speech input unit is provided in plurality, and the speech translation apparatus further includes: a 1 st beam forming unit configured to perform signal processing on a voice input to at least a part of the plurality of voice input units, thereby controlling a directivity of a collected voice to a sound source direction of the voice of the 1 st speaker; a 2 nd beam forming unit that performs signal processing on the voice input to at least some of the plurality of voice input units, thereby controlling the directivity of the collected voice to the sound source direction of the voice of the 2 nd speaker; an input switching unit for switching the obtained signal to an output signal of the 1 st beam forming unit or an output signal of the 2 nd beam forming unit; and a sound source direction estimating unit that estimates a sound source direction by performing signal processing on the voices input to the plurality of voice input units, wherein the utterance instructing unit causes the input switching unit to perform switching to obtain either the output signal of the 1 st beam forming unit or the output signal of the 2 nd beam forming unit.
Accordingly, the direction of the speaker with respect to the speech translation apparatus can be estimated by the sound source direction estimating unit. Therefore, the input switching unit can switch to either one of the output signal of the 1 st beam forming unit and the output signal of the 2 nd beam forming unit suitable for the direction of the talker. That is, since the directivity of the collected sound of the beam forming unit can be oriented in the sound source direction, the speech translation apparatus can collect the sound by reducing the surrounding noise with respect to the speech of the 1 st speaker and the 2 nd speaker.
In the speech translation apparatus according to another aspect of the present application, the speech input unit is provided in plurality, and the speech translation apparatus further includes: a sound source direction estimating unit that estimates a sound source direction by performing signal processing on the voices input to the plurality of voice input units; and a control unit that causes the 1 st language to be displayed in a region of the display unit corresponding to a position of the 1 st speaker with respect to the speech translation apparatus and causes the 2 nd language to be displayed in a display region of the display unit corresponding to a position of the 2 nd speaker with respect to the speech translation apparatus, wherein the control unit compares a display direction, which is a display direction from the display unit of the speech translation apparatus toward the 1 st speaker or the 2 nd speaker and is a direction to be displayed on one side of a display region of the display unit, with a sound source direction estimated by the sound source direction estimation unit, and causes the speech recognition unit and the translation unit to operate when the display direction substantially coincides with the estimated sound source direction, and causes the speech recognition unit and the translation unit to operate when the display direction differs from the estimated sound source direction, and stopping the operations of the voice recognition unit and the translation unit.
With this, when the display direction of the language displayed in the display area of the display unit substantially matches the direction of the sound source of the speech uttered by the speaker, it is possible to specify whether the speaker is the 1 st speaker who uttered in the 1 st language or the 2 nd speaker who uttered in the 2 nd language. In this case, speech recognition can be performed in the 1 st language for the speech of the 1 st speaker and speech recognition can be performed in the 2 nd language for the speech of the 2 nd speaker. When the display direction is different from the sound source direction, translation of the input speech is stopped, thereby preventing the input speech from being translated or misinterpreted.
Accordingly, the speech translation apparatus can reliably perform speech recognition on the speech of the 1 st language and the speech of the 2 nd language, and thus can reliably translate the speech. In this way, in the speech translation apparatus, since it is possible to suppress erroneous translation and the like, it is possible to suppress an increase in the processing amount of the speech translation apparatus.
In the speech translation device according to another aspect of the present application, when the control unit stops the speech recognition unit and the translation unit, the utterance instruction unit outputs again a content that prompts an operator to speak in the instructed language.
Accordingly, even when the display direction is different from the sound source direction, the content for urging the speaker to speak is output again by the speaking instruction unit, and the speaker to be the target starts speaking. In this way, the speech translation apparatus can reliably obtain the speech of the target speaker, and can more reliably translate the speech.
In the speech translation device according to another aspect of the present application, when the display direction is different from the estimated sound source direction, the utterance instruction unit outputs a content for prompting the user to speak in the instructed language again after a predetermined period of time has elapsed from the comparison by the control unit.
Accordingly, by making a predetermined period after comparing the display direction with the sound source direction, it is possible to suppress the input of the voices of the 1 st speaker and the 2 nd speaker from being mixed. Accordingly, after the predetermined period has elapsed, the content for urging the speaker to speak is again output, and the speaker to be the target starts speaking. Thus, the speech translation apparatus can more reliably obtain the speech of the target speaker, and can more reliably translate the speech.
In the speech translation apparatus according to another aspect of the present application, the speech input unit is provided in plurality, and the speech translation apparatus further includes: a 1 st beam forming unit configured to perform signal processing on a voice input to at least a part of the plurality of voice input units, thereby controlling a directivity of a collected voice to a sound source direction of the voice of the 1 st speaker; a 2 nd beam forming unit that performs signal processing on the voice input to at least some of the plurality of voice input units, thereby controlling the directivity of the collected voice to the sound source direction of the voice of the 2 nd speaker; and a sound source direction estimating unit that estimates a sound source direction by performing signal processing on the output signal of the 1 st beam forming unit and the output signal of the 2 nd beam forming unit.
In this way, the direction of the speaker with respect to the speech translation apparatus can be estimated by the sound source direction estimating unit. In this way, the sound source direction estimating unit can perform signal processing on the output signal of the 1 st beam forming unit and the output signal of the 2 nd beam forming unit suitable for the direction of the talker, and thus the calculation cost due to the signal processing can be reduced.
In the speech translation apparatus according to another aspect of the present application, the speech instruction unit outputs, at the time of startup of the speech translation apparatus, the content that urges the 1 st speaker to speak in the 1 st language via the display unit, and after the speech uttered by the 1 st speaker is translated from the 1 st language into the 2 nd language and the translation result is displayed on the display unit, the content that urges the 2 nd speaker to speak is output in the 2 nd language via the display unit.
Accordingly, by registering the 1 st speaker speaking in the 1 st language and then the 2 nd speaker speaking in the 2 nd language in advance, the content urging the 1 st speaker to speak can be output in the 1 st language and the 1 st speaker can start speaking when the speech translation apparatus is started. Thus, when the speech translation apparatus is started, it is possible to suppress erroneous translation caused by the 2 nd speaker speaking in the 2 nd language.
In the speech translation device according to another aspect of the present application, the utterance instructing unit causes the speech output unit to output the speech for urging utterance a predetermined number of times after the translation is started, and causes the display unit to output the message for urging utterance after the predetermined number of times of the speech for urging utterance is output.
Accordingly, by limiting the number of times of speech for urging speech to be spoken to a predetermined number of times, an increase in energy consumption of the speech translation apparatus can be suppressed.
In the speech translation device according to another aspect of the present application, the speech recognition unit outputs a result of speech recognition of a speech and a reliability score of the result, and the utterance instruction unit outputs a content prompting an utterance via at least one of the display unit and the speech output unit without performing translation of a speech whose reliability score is equal to or less than a threshold value when the reliability score obtained from the speech recognition unit is equal to or less than the threshold value.
Accordingly, when the reliability score indicating the accuracy of the voice recognition is equal to or less than the threshold value, the content urging the utterance is again output by the utterance instructing unit, and the target speaker speaks again. Therefore, the speech translation apparatus can surely perform speech recognition on the speech of the target speaker, and can more surely translate the speech.
In particular, when the content prompting speech is output by speech from the speech output unit, the speaker can easily perceive that accurate speech recognition is not performed.
The specific embodiments of a part of the above can be realized by a system, a method, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM, or by any combination of the system, the method, the integrated circuit, the computer program, and the recording medium.
The embodiments to be described below are each a specific example showing the present application. The numerical values, shapes, materials, constituent elements, arrangement positions and connection forms of the constituent elements, steps, order of the steps, and the like shown in the following embodiments are merely examples, and the present application is not limited thereto. Among the components of the following embodiments, components not described in the independent claims will be described as optional components. In all embodiments, the contents can be combined.
A speech translation apparatus, a speech translation method, and a recording medium according to one embodiment of the present application will be specifically described below with reference to the drawings.
(embodiment mode 1)
< constitution: speech translation apparatus 1>
Fig. 1A shows an example of the appearance of the speech translation apparatus 1 according to embodiment 1 and a use scenario of the speech translation apparatus 1 of the 1 st speaker and the 2 nd speaker when the 1 st speaker speaks. Fig. 1B shows an example of the appearance of the speech translation apparatus 1 according to embodiment 1 and a use scenario of the speech translation apparatus 1 of the 1 st speaker and the 2 nd speaker when the 2 nd speaker speaks.
As shown in fig. 1A and 1B, the speech translation apparatus 1 is an apparatus that performs bidirectional translation of a conversation between a 1 st speaker and a 2 nd speaker in order to perform the conversation between the 1 st speaker and the 2 nd speaker, where the 1 st speaker speaks in a 1 st language and the 2 nd speaker is a conversation partner of the 1 st speaker and speaks in a 2 nd language different from the 1 st language. That is, the speech translation apparatus 1 is an apparatus that recognizes the respective languages uttered (uttered) by the 1 st speaker and the 2 nd speaker between two different languages of the 1 st speaker and the 2 nd speaker, and translates the content of the utterance into the language of the other party. For example, the speech translation apparatus 1 translates and outputs the 1 st language spoken by the 1 st speaker into and from the 2 nd language, and translates and outputs the 2 nd language spoken by the 2 nd speaker into and from the 1 st language. The 1 st language and the 2 nd language are, for example, japanese, english, french, german, chinese, and the like.
Fig. 1A and 1B of the present embodiment show how 1 st speaker and 1 nd 2 nd speaker are in conversation with each other. In addition, the method can also be used for conversation of a plurality of 1 st speakers and a plurality of 2 nd speakers.
The 1 st speaker and the 2 nd speaker may perform a face-to-face conversation by the speech translation apparatus 1, or may speak in parallel in the left and right directions as shown in fig. 1C. Fig. 1C shows another example of the use scenario of the speech translation apparatus 1 when the 1 st speaker and the 2 nd speaker are in conversation. In this case, the speech translation apparatus 1 can change the display mode. As shown in fig. 1A, 1B, and 1C, the speech translation apparatus 1 of this type can be used in a landscape or portrait orientation.
The speech translation apparatus 1 is a portable terminal that can be carried by the 1 st speaker, such as a smartphone and a tablet terminal.
Fig. 2 is a block diagram showing the speech translation apparatus 1 according to embodiment 1.
As shown in fig. 2, the speech translation apparatus 1 includes: a voice input unit 21, a voice detection unit 22, a priority utterance input unit 24, an utterance instruction unit 25, a voice recognition unit 23, a translation unit 26, a display unit 27, a voice output unit 28, and a power supply unit 29.
[ Speech input section 21]
The voice input unit 21 is a microphone for inputting voice when the 1 st speaker and the 2 nd speaker perform a conversation, and is connected to be able to communicate with the voice detection unit 22. That is, the voice input unit 21 obtains (collects) voice, converts the obtained voice into an electric signal, and outputs an acoustic signal, which is the converted electric signal, to the voice detection unit 22. The acoustic signal obtained by the voice input unit 21 may be stored in a storage unit or the like.
The voice input unit 21 may be configured as an adapter. In this case, the voice input unit 21 functions by attaching a microphone to the voice translation apparatus 1, and obtains an acoustic signal obtained by the microphone.
[ Speech detecting section 22]
The voice detection unit 22 is a device that detects a voice section in which the 1 st speaker and the 2 nd speaker speak from the voice input to the voice input unit 21, and is connected to be able to communicate with the voice input unit 21 and the voice recognition unit 23. Specifically, the voice detection unit 22 detects the start time and the end time of the voice section in the acoustic signal (detects the end of speech) by regarding the moment when the volume becomes large and the moment when the volume becomes small as the boundary of the voice, based on the volume indicated by the acoustic signal obtained from the voice input unit 21. Here, the speech interval indicates speech of a speaker every time a speech is uttered, and may include a period from a start time to an end time in the speech of a speech.
The voice detection unit 22 detects a voice section detected from the acoustic signal, that is, detects voices of the 1 st speaker and the 2 nd speaker in the conversation from the acoustic signal, and outputs voice information indicating the detected voices to the voice recognition unit 23.
[ speaking instruction section 25]
The speech instructing unit 25 is a device that outputs the content prompting the 2 nd speaker to speak in the 2 nd language via the display unit 27 after the 1 st speaker speaks, and outputs the content prompting the 1 st speaker to speak in the 1 st language after the 2 nd speaker speaks. That is, the speech instruction unit 25 outputs speech instruction text information, which is a content prompting the 1 st speaker or the 2 nd speaker to speak, to the display unit 27 at different timings so that the 1 st speaker and the 2 nd speaker can perform conversation. The speech instruction unit 25 outputs speech instruction speech information, which is a content prompting the 1 st speaker or the 2 nd speaker to speak, to the speech output unit 28. In this case, the speech instruction unit 25 outputs the speech instruction speech information having the same content as that indicated by the speech instruction text information output to the display unit 27 to the speech output unit 28. The speech instruction unit 25 may not output speech instruction speech information to the speech output unit 28, and output of contents of speech prompted by speech is not essential.
Here, the speech instruction text information is a text file showing a content that urges the 1 st speaker or the 2 nd speaker to speak. The speech instruction speech information is speech indicating the content of prompting the 1 st speaker or the 2 nd speaker to speak.
The speech instruction unit 25 outputs an instruction command for the translation unit 26 to translate the 1 st language into the 2 nd language, or for the translation unit 26 to translate the 2 nd language into the 1 st language. For example, since the 2 nd speaker will speak after the 1 st speaker speaks, the speech instructing unit 25 outputs an instruction command for performing speech recognition on the speech uttered by the 2 nd speaker in the 2 nd language to the speech recognizing unit 23, and outputs an instruction command for translating the speech recognized in the 2 nd language into the 1 st language from the 2 nd language to the translating unit 26. The same applies to the case after the 1 st speaker speaks.
Then, the speech instruction unit 25 outputs speech instruction text information, which is a content for urging one of the 1 st speaker and the 2 nd speaker to speak, to the display unit 27 after the other speaker speaks. At the time or after the translation result of the translation by the translation unit 26 is output, the speech instruction unit 25 outputs the speech instruction text information to the display unit 27 and outputs the speech instruction speech information to the speech output unit 28.
When an instruction command is received from the priority utterance input unit 24, which will be described later, the utterance instructing unit 25 outputs utterance instructing text information, which is the content for prompting an utterance, to the display unit 27 and outputs utterance instructing speech information to the speech output unit 28 again with respect to the speaker who has just uttered the utterance.
When the speech translation apparatus 1 is started, the speech instruction unit 25 outputs the content prompting the 1 st speaker to speak in the 1 st language via the display unit 27. That is, in the case where the 1 st speaker is the holder of the speech translation apparatus 1, the speech instruction section 25 urges the 1 st speaker to start speech. Then, the speech instructing unit 25 translates the speech uttered by the 1 st speaker from the 1 st language into the 2 nd language and displays the translation result on the display unit 27, and then outputs the content prompting the 2 nd speaker to speak in the 2 nd language via the display unit 27. After the 1 st speaker's speech in the 1 st language is translated into the 2 nd language, the 2 nd speaker speaks in the 2 nd language and the spoken 2 nd language is translated into the 1 st language. By repeating such operations, the conversation between the 1 st speaker and the 2 nd speaker can be smoothly performed.
After the translation is started, the speech instructing unit 25 causes the speech output unit 28 to output speech for prompting the user to speak a predetermined number of times. That is, since the 2 nd speaker may not speak immediately or hear it, the speech instructing unit 25 outputs a speech for urging the speaker to speak a predetermined number of times. The speech instruction unit 25 outputs a message for prompting speech to be output from the display unit 27 after outputting the speech for prompting speech a predetermined number of times. That is, when the voice for urging speech is output a predetermined number of times without any effect, the display unit 27 displays a message for urging speech in order to suppress power consumption.
The speech instructing unit 25 is connected to the speech recognizing unit 23, the preferential speech input unit 24, the translating unit 26, the display unit 27, and the speech output unit 28 so as to be able to communicate with each other.
[ priority utterance input unit 24]
The priority speech input unit 24 is a device that, when the 1 st speaker or the 2 nd speaker starts to speak and is speech-recognized, can prioritize (or continue) the speaking of the 1 st speaker or the 2 nd speaker who has spoken again and perform speech recognition by the speech recognition unit 23. That is, the priority utterance input unit 24 can give the 1 st speaker or the 2 nd speaker who has uttered the speech again a chance to speak with respect to the speaker who has just uttered the speech, that is, the speaker whose uttered speech is subjected to speech recognition. In other words, even when the speech recognition of the speech uttered by one of the 1 st speaker and the 2 nd speaker is finished and the process is shifted to the process for the speech recognition of the speech of the other speaker, the priority utterance input unit 24 can return the process to the process for performing the speech recognition of the speech uttered by one speaker.
The preferential utterance input unit 24 is an operation input unit for receiving an input from an operator of the speech translation apparatus 1. For example, when the speaker who starts speaking is misspoken, or when an ambiguous voice is translated halfway, if the section in which the voice detection unit 22 does not perform voice detection is equal to or more than a predetermined section, the speaker who has just started speaking may want to continue speaking, as in the case where the voice translation apparatus 1 recognizes that the speech is likely to end. Therefore, the preferential utterance input unit 24 preferentially performs speech recognition by the speech recognition unit 23 and causes the translation unit 26 to translate the speech uttered by the speaker who has just started speaking. Accordingly, the priority utterance input unit 24 outputs, to the utterance instructing unit 25, an instruction command for causing the utterance instructing unit 25 to output again the utterance instructing text information and the utterance instructing voice information, which are contents for prompting the utterance. The operator is at least one of the 1 st speaker and the 2 nd speaker, and in the present embodiment, is the 1 st speaker.
In the present embodiment, the priority utterance input unit 24 is a touch sensor provided integrally with the display unit 27 of the speech translation apparatus 1. In this case, the display unit 27 of the speech translation apparatus 1 can display an operation key for accepting an operation of one speaker as the preferential utterance input unit 24.
In the present embodiment, when switching the speech recognition from the 1 st language to the 2 nd language, the speech recognition unit 23 displays the priority utterance input unit 24, which is a priority key of the 1 st language, on the display unit 27 so that the 1 st language before switching is preferentially subjected to the speech recognition and translated. When switching the speech recognition from the 2 nd language to the 1 st language, the speech recognition unit 23 displays the priority utterance input unit 24 as the 2 nd language priority key on the display unit 27 so that the 2 nd language before switching is preferentially subjected to the speech recognition and translated. Such priority keys are displayed on the display unit 27 at least after the translation.
[ Speech recognition unit 23]
The voice recognition unit 23 performs voice recognition on the voice in the voice section detected by the voice detection unit 22, and converts the voice into a text file. Specifically, when the voice information detected by the voice detection unit 22 is obtained, the voice recognition unit 23 performs voice recognition on the voice indicated by the voice information. For example, when the speech indicated by the speech information is the 1 st language, the speech is subjected to speech recognition in the 1 st language, and when the speech indicated by the speech information is the 2 nd language, the speech is subjected to speech recognition in the 2 nd language. When performing speech recognition of speech in the 1 st language, the speech recognition unit 23 generates a 1 st text file showing the content of the speech after the speech recognition, and outputs the generated 1 st text file to the translation unit 26. When speech recognition is performed on speech in the 2 nd language, the speech recognition unit 23 generates a 2 nd text file showing the content of the speech after the speech recognition, and outputs the generated 2 nd text file to the translation unit 26.
[ translation section 26]
The translation unit 26 is a translation device that translates the text file converted by the voice recognition unit 23 from the 1 st language to the 2 nd language, and from the 2 nd language to the 1 st language. Specifically, the translation unit 26 translates the 1 st text file into the 2 nd language from the 1 st language when the 1 st text file is obtained as the text file from the voice recognition unit 23. That is, the translation unit 26 generates a 2 nd translation text file for translating the 1 st text file into the 2 nd language. When the 2 nd text file, which is a text file, is obtained from the voice recognition unit 23, the translation unit 26 translates the 2 nd text file from the 2 nd language to the 1 st language. That is, the translation unit 26 generates a 1 st translation text file for translating the 2 nd text file into the 1 st language.
Here, the content of the 1 st text file shown in the 1 st language coincides with the content of the 2 nd translated text file shown in the 2 nd language. The content of the 2 nd text file shown in the 2 nd language is identical to the content of the 1 st translated text file shown in the 1 st language.
When the translation unit 26 generates the 2 nd translation text file, the contents of the 2 nd translation text file are recognized, and the translated speech in the 2 nd language showing the contents of the recognized 2 nd translation text file is generated. When the translation unit 26 generates the 1 st translated text file, the content of the 1 st translated text file is recognized, and the translated speech in the 1 st language showing the content of the recognized 1 st translated text file is generated. The speech output unit 28 may generate the translated speech based on the 1 st translated text file and the 2 nd translated text file.
When the translation unit 26 generates the 2 nd translation text file or the 1 st translation text file, the generated 2 nd translation text file or the 1 st translation text file is output to the display unit 27. When the translation unit 26 generates the translated speech of the 2 nd language or the translated speech of the 1 st language, the generated translated speech of the 2 nd language or the translated speech of the 1 st language is output to the speech output unit 28.
The translation unit 26 is connected to be communicable with the speech instruction unit 25, the speech recognition unit 23, the display unit 27, and the speech output unit 28.
[ display part 27]
The display unit 27 is a display such as a liquid crystal panel or an organic EL panel, for example, and is connected to be able to communicate with the utterance instructing unit 25 and the translation unit 26. Specifically, the display unit 27 is a display, and displays the translation result of the 1 st language into the 2 nd language shown in the voice by voice recognition of the voice in the voice section detected by the voice detection unit 22, and also displays the translation result of the 2 nd language into the 1 st language. The display unit 27 displays the 1 st text file, the 2 nd text file, the 1 st translated text file, and the 2 nd translated text file obtained from the translation unit 26. After or simultaneously with the display of these text files, the display unit 27 displays speech instruction text information, which is content for prompting the 1 st speaker or the 2 nd speaker to speak.
The display unit 27 changes the screen layout for displaying the text file according to the positional relationship between the 1 st speaker and the 2 nd speaker with respect to the speech translation apparatus 1. For example, as shown in fig. 1A and 1B, when the 1 st speaker speaks, the display unit 27 displays the 1 st text file recognized by the voice in the display area of the display unit 27 located on the 1 st speaker side, and displays the 2 nd translated text file after the translation in the display area of the display unit 27 located on the 2 nd speaker side. When the 2 nd speaker speaks, the display unit 27 displays the 2 nd text file recognized by the voice in the display area of the display unit 27 located on the 2 nd speaker side, and displays the 1 st translated text file after translation in the display area of the display unit 27 located on the 1 st speaker side. In the above case, the display unit 27 displays the orientations of the characters of the 1 st text file and the 2 nd translated text file, and displays the orientations of the characters of the 1 st translated text file and the 2 nd text file in the opposite directions. As shown in fig. 1C, when the 1 st speaker and the 2 nd speaker are conversing in a left-right direction, the display unit 27 displays the 1 st text file and the 2 nd text file with the same orientation of characters.
[ Speech output section 28]
The speech output unit 28 is a speaker that obtains the translated speech, which is a result of the translation performed by the translation unit 26, from the translation unit 26 and outputs the obtained translated speech, and is connected to the translation unit 26 and the utterance instructing unit 25 so as to be communicable. That is, when the 1 st speaker speaks, the speech output unit 28 reproduces and outputs the translated speech having the same content as the 2 nd translated text file displayed on the display unit 27. When the 2 nd speaker speaks, the speech output unit 28 reproduces and outputs the translated speech having the same content as the 1 st translated text file displayed on the display unit 27.
When the speech output unit 28 obtains the speech instruction speech information, it reproduces and outputs the speech to the 1 st speaker or the 2 nd speaker as the content of prompting the speech indicated by the speech instruction speech information. The speech output unit 28 outputs the translated speech of the 1 st or 2 nd translation text file, and then reproduces and outputs the speech indicated by the speech instruction speech information.
[ Power supply section 29]
The power supply unit 29 is, for example, a primary battery or a secondary battery, and is electrically connected to the voice input unit 21, the voice detection unit 22, the priority utterance input unit 24, the utterance instruction unit 25, the voice recognition unit 23, the translation unit 26, the display unit 27, the voice output unit 28, and the like via wiring. The power supply unit 29 supplies electric power to the voice detection unit 22, the priority utterance input unit 24, the utterance instruction unit 25, the voice recognition unit 23, the translation unit 26, the display unit 27, the voice output unit 28, and the like.
< working >
The operation of the speech translation apparatus 1 having the above-described configuration will be described with reference to fig. 3.
Fig. 3 is a flowchart showing the operation of the speech translation apparatus 1 according to embodiment 1.
The speech translation apparatus 1 performs setting in advance such that the 1 st speaker speaks in the 1 st language and the 2 nd speaker speaks in the 2 nd language. Here, it is assumed that one of the 1 st speaker and the 2 nd speaker starts speaking first. The 1 st speaker activates the speech translation apparatus 1, and the speech translation apparatus 1 starts the conversation translation of the 1 st speaker and the 2 nd speaker.
First, as shown in fig. 3, when the 1 st speaker and the 2 nd speaker are in conversation with each other, the speech translation apparatus 1 is started up before speech is uttered. The speech translation apparatus 1 acquires a sound (S11), and generates an acoustic signal indicating the acquired sound. In the present embodiment, when one speaker starts speaking, the speech translation apparatus 1 obtains speech uttered by one speaker. As shown in fig. 1A, in the case where one speaker is the 1 st speaker, when saying "what をお visited しですか? (what is found)', the voice input section 21 obtains the spoken voice. The voice input unit 21 obtains voice, converts the obtained voice into an electric signal, and outputs an acoustic signal, which is the converted electric signal, to the voice detection unit 22.
Next, the voice detection unit 22 acquires the acoustic signal from the voice input unit 21, and detects a voice section of one speaker from the voice indicated by the acoustic signal (S12), thereby extracting the detected voice as the voice of one speaker. As an example, as shown in fig. 1A, from the voice input to the voice input unit 21, "what をお is sounding しですか? "this speech interval, and extracts the detected speech. The voice detection unit 22 outputs voice information indicating the extracted voice of one speaker to the voice recognition unit 23.
The speech instructing unit 25 outputs an instruction command for performing speech recognition in the language in which one speaker is speaking to the speech recognizing unit 23, and outputs an instruction command for translating the speech recognized by the speech recognition from one language to the other language to the translating unit 26. That is, the speech instruction unit 25 outputs an instruction command for switching the language recognized by the speech recognition unit 23 so that the speech recognition unit 23 can recognize the language spoken by one of the speakers. The speech instruction unit 25 outputs an instruction command for switching the translation language so that the translation unit 26 can translate the language into a desired language based on the language recognized by the speech recognition unit 23.
For example, when the voice recognition unit 23 obtains the instruction command, the recognized language is switched from the 2 nd language to the 1 st language, or the recognized language is switched from the 1 st language to the 2 nd language. When the instruction command is obtained, the translation unit 26 switches the translation language from the 2 nd language to the 1 st language or from the 1 st language to the 2 nd language.
Next, when the instruction command and the voice information are obtained, the voice recognition unit 23 performs voice recognition on the voice indicated by the voice information (S13). For example, when the language of one speaker is the 1 st language, the speech recognition unit 23 selects the recognition language as the 1 st language, and performs speech recognition on the speech indicated by the speech information in the selected 1 st language. That is, the voice recognition unit 23 converts the voice indicated by the voice information into a text file of the 1 st language, and outputs the converted 1 st text file to the translation unit 26. When the language of one speaker is the 2 nd language, the voice recognition unit 23 selects the recognition language as the 2 nd language, and performs voice recognition on the voice indicated by the voice information in the selected 2 nd language. That is, the voice recognition unit 23 converts the voice indicated by the voice information into a text file of the 2 nd language, and outputs the converted 2 nd text file to the translation unit 26.
As an example, as shown in fig. 1A, the voice recognition section 23 specifies "what をお is sounding しですか? "what をお visited しですか" converted to the 1 st text file? ".
Next, the translation unit 26 obtains the text file from the voice recognition unit 23, and translates the text file from one of the 1 st language and the 2 nd language to the other language (S14). That is, when the text file is the 1 st text file of the 1 st language, the translation unit 26 translates the text file into the 2 nd language and generates the 2 nd translated text file as a translation result. When the text file is a 2 nd text file of the 2 nd language, the translation unit 26 translates the text file into the 1 st language and generates a 1 st translated text file as a translation result. As an example, as shown in fig. 1A, the translation section 26 reads しですか the 1 st text file "what をお is in the 1 st language? "translate to language 2, generate the 2 nd translation text file" What are you looking for? ".
Next, the translation unit 26 outputs the generated 2 nd translation text file in the 2 nd language or the 1 st translation text file in the 1 st language to the display unit 27. The display unit 27 displays the 2 nd translation text file or the 1 st translation text file (S15). As an example, as shown in fig. 1A, the display section 27 displays a 2 nd translation text file "What are you looking for? ".
When the 2 nd translation text file is generated, the translation unit 26 generates a 2 nd translation speech in the 2 nd language in which the 2 nd translation text file is converted into a speech. When the 1 st translated text file is generated, the translation unit 26 generates a 1 st language translated speech obtained by converting the 1 st translated text file into a speech. The translation unit 26 outputs the generated translated speech in the 2 nd language or the translated speech in the 1 st language to the speech output unit 28. The speech output unit 28 outputs the translated speech of the 2 nd language or the translated speech of the 1 st language (S16). As an example, as shown in fig. 1A, the speech output section 28 outputs the 2 nd translation text file "What are you looking for? ". The processing in steps S15 and S16 may be executed at the same timing, or the processing may be reversed.
Next, the utterance instructing unit 25 determines whether or not an instruction command is obtained from the preferred utterance input unit 24 (S17). For example, when one of the speakers wants to speak again, the operator of the speech translation apparatus 1 operates the priority utterance input unit 24. Accordingly, the priority utterance input unit 24 outputs an instruction command to the utterance instructing unit 25 when receiving an operation.
When the utterance instructing unit 25 obtains an instruction command from the preferred utterance input unit 24 (yes in S17), the speech recognition unit 23 and the translation unit 26 can return to the processing of performing speech recognition and translation on the speech uttered by one speaker even when the processing of speech recognition and translation by one speaker is completed and interrupted or the processing is shifted to the processing of performing speech recognition on the speech uttered by the other speaker. The speech instructing unit 25 outputs speech instruction text information, which is a content prompting one of the speakers to speak, to the display unit 27 again so that the speech uttered by the one of the speakers is preferentially recognized with respect to the one of the speakers whose speech has been just uttered. The display unit 27 displays the utterance instruction text information obtained from the utterance instruction unit 25 (S18). As an example, the display unit 27 displays the utterance instruction text information "もう once more generator して and next さい (please describe again)".
When the speech instruction unit 25 receives the instruction command from the preferred speech input unit 24, it outputs speech instruction speech information, which is the content of prompting one of the speakers to speak, to the speech output unit 28. The speech output unit 28 outputs the speech instruction speech information obtained from the speech instruction unit 25 by speech (S19). For example, the voice output unit 28 outputs the utterance instruction voice information "もう the generator して and the lower generator さい" by voice.
In this case, the speech translation apparatus 1 may display "Thank you for yourconference" or the like for the speaker of the other party, and may output the speech or nothing. The processing in steps S18 and S19 may be performed simultaneously, or the processing may be reversed.
The speech instruction unit 25 may output speech instruction speech information to the speech output unit 28 a predetermined number of times. The speech instruction unit 25 may cause the display unit 27 to output a message of speech instruction speech information after outputting the speech instruction speech information a predetermined number of times.
Then, the speech translation apparatus 1 ends the processing. Accordingly, the speech translation apparatus 1 can start the process from step S11 by one speaker speaking again.
When the instruction command is not obtained from the preferred speech input unit 24 (no in S17), the speech instruction unit 25 outputs speech instruction text information, which is content prompting the other speaker to speak, to the display unit 27. For example, this is the case where one speaker does not need to speak again and the speech is correctly recognized. The display unit 27 displays the utterance instruction text information obtained from the utterance instruction unit 25 (S21). For example, as shown in fig. 1A, the display unit 27 displays the utterance instruction text information "young Turn |". ".
When the instruction command cannot be obtained from the preferred speech input unit 24, the speech instruction unit 25 outputs speech instruction speech information, which is the content of prompting the other speaker to speak, to the speech output unit 28. The speech output unit 28 outputs the speech instruction speech information obtained from the speech instruction unit 25 by speech (S22). As an example, the voice output unit 28 outputs the utterance instruction voice information "you Turn!by voice! ". The processing in steps S21 and S22 may be performed simultaneously or reversely.
The speech instructing unit 25 may output the speech for urging speech to the speech output unit 28 a predetermined number of times. The speech instructing unit 25 may output a message for prompting speech to the display unit 27 after outputting the speech for prompting speech a predetermined number of times.
Then, the speech translation apparatus 1 ends the processing. Accordingly, when one speaker speaks again, the speech translation apparatus 1 starts the process from step S11.
In this way, by the 1 st speaker first operating the speech translation apparatus 1, the speech translation apparatus 1 can translate the conversation between the 1 st speaker and the 2 nd speaker.
Since the processing of the speech of the other speaker to the one speaker is the same as described above, the description thereof will be omitted.
< Effect >
Next, the operation and effects of the speech translation apparatus 1 in the present embodiment will be described.
As described above, the speech translation apparatus 1 according to the present embodiment is a speech translation apparatus for translating a conversation between a 1 st speaker and a 2 nd speaker, the 1 st speaker speaking in a 1 st language, the 2 nd speaker being a conversation partner of the 1 st speaker and speaking in a 2 nd language different from the 1 st language, the speech translation apparatus 1 including: a voice detection unit 22 that detects voice sections of voices uttered by the 1 st speaker and the 2 nd speaker from the voice input to the voice input unit 21; a display unit 27 for displaying the translation result of the 1 st language translated into the 2 nd language shown in the voice by the voice recognition of the voice in the voice section detected by the voice detection unit 22, and displaying the translation result of the 2 nd language translated into the 1 st language by the display unit 27; the speech instructing unit 25 outputs the content prompting the 2 nd speaker to speak in the 2 nd language via the display unit 27 after the 1 st speaker speaks, and outputs the content prompting the 1 st speaker to speak in the 1 st language via the display unit 27 after the 2 nd speaker speaks.
Accordingly, by detecting respective voice sections from the conversation of the 1 st speaker and the 2 nd speaker, it is possible to obtain a translation result of translating the detected voice from the 1 st language into the 2 nd language and a translation result of translating the detected voice from the 2 nd language into the 1 st language. That is, in the speech translation apparatus 1, even if the input operation for translation is not performed, the language of the detected speech can be automatically translated into another language in accordance with the utterances of the 1 st speaker and the 2 nd speaker, respectively.
Further, the speech translation apparatus 1 can output the content prompting the 1 st speaker to speak after the 1 st speaker speaks by outputting the content prompting the 2 nd speaker to speak after the 2 nd speaker speaks. Accordingly, the speech translation apparatus 1 can recognize the timing of the 1 st speaker and the 2 nd speaker without performing an input operation to start speaking in accordance with the respective utterances of the 1 st speaker and the 2 nd speaker.
In this way, the speech translation apparatus 1 can have good operability without performing an input operation for starting speech, an input operation for switching languages, or the like. That is, since the speech translation apparatus 1 is easy to operate, it is possible to suppress an increase in the period of use.
Therefore, the speech translation apparatus 1 can suppress an increase in energy consumption of the speech translation apparatus 1 by a simple operation. In particular, since the speech translation apparatus 1 can simplify the operation, it is possible to suppress an erroneous operation.
Further, a speech translation method according to the present embodiment is a speech translation method for a conversation between a 1 st speaker and a 2 nd speaker, the 1 st speaker speaking in a 1 st language, the 2 nd speaker being a conversation partner of the 1 st speaker and speaking in a 2 nd language different from the 1 st language, the speech translation method including: detecting a speech section of speech uttered by the 1 st speaker and the 2 nd speaker from the speech input to the speech input unit 21; when the voice in the detected voice section is recognized by voice, the display unit 27 displays the translation result of the 1 st language translated into the 2 nd language shown by the voice, and also displays the translation result of the 2 nd language translated into the 1 st language; after the 1 st speaker speaks, the content for urging the 2 nd speaker to speak is output in the 2 nd language through the display unit 27, and after the 2 nd speaker speaks, the content for urging the 1 st speaker to speak is output in the 1 st language through the display unit 27.
This speech translation method can also achieve the same operational effects as those of the speech translation apparatus 1 described above.
The recording medium in the present embodiment is a computer-readable non-transitory recording medium on which a program for causing a computer to execute the speech translation method is recorded.
The same effects as those of the above-described speech translation apparatus 1 can be achieved also in this recording medium.
The speech translation apparatus 1 according to the present embodiment further includes a priority speech input unit 24 configured to, when the 1 st speaker or the 2 nd speaker speaks and is speech-recognized, prioritize the speech uttered by the 1 st speaker or the 2 nd speaker and perform speech recognition again.
Accordingly, for example, when the speakers such as the 1 st speaker and the 2 nd speaker are misspoken, or when ambiguous speech is being translated, or the like, the speaker who has made a speech can be given priority by operating the priority speech input unit 24, and thus the speaker who has made a speech can get a chance to speak again (can make a speech again). Therefore, even when the speech recognition of the speech uttered by one of the 1 st speaker and the 2 nd speaker is completed, the priority utterance input unit 24 can return to the process of performing the speech recognition of the speech uttered by one of the speakers, by shifting the process to the process for performing the speech recognition of the speech of the other speaker. Accordingly, the speech translation apparatus 1 can reliably obtain the speech of the 1 st speaker and the 2 nd speaker, and therefore can output the translation result translated from the speech.
The speech translation apparatus 1 according to the present embodiment further includes: a speech input unit 21 to which speech uttered by the 1 st speaker and the 2 nd speaker; a voice recognition unit 23 that performs voice recognition on the voice in the voice section detected by the voice detection unit 22 and converts the voice into a text file; a translation unit 26 for translating the text file converted by the voice recognition unit 23 from the 1 st language to the 2 nd language, and from the 2 nd language to the 1 st language; the speech output unit 28 outputs the result translated by the translation unit 26 as speech.
With this, after speech recognition is performed on the input speech, the language of the speech can be translated into another language. That is, the speech translation apparatus 1 can perform the processing from the acquisition of the speech of the conversation between the 1 st speaker and the 2 nd speaker to the output of the result of the speech translation. Therefore, the speech translation apparatus 1 can mutually translate the speech of the 1 st speaker and the 2 nd speaker during the conversation, even without communicating with the external server. The present invention can be applied even in an environment where communication between the speech translation apparatus 1 and an external server is difficult.
In the speech translation apparatus 1 of the present embodiment, the speech instructing unit 25 outputs the content prompting the 1 st speaker to speak in the 1 st language through the display unit 27 at the time of startup of the speech translation apparatus 1, translates the speech spoken by the 1 st speaker from the 1 st language to the 2 nd language, displays the translation result on the display unit 27, and then outputs the content prompting the 2 nd speaker to speak in the 2 nd language through the display unit 27.
Accordingly, when the speech translation apparatus 1 is started up and the content urging the 1 st speaker to speak is output in the 1 st language, the 1 st speaker can start speaking by registering the 2 nd speaker's speech in the 2 nd language in advance after the 1 st speaker speaks in the 1 st language. Therefore, at the time of startup of the speech translation apparatus 1, it is possible to suppress erroneous translation caused by the 2 nd speaker speaking in the 2 nd language.
In the speech translation apparatus 1 according to the present embodiment, the speech instruction unit 25 causes the speech output unit 28 to output the speech for urging speech a predetermined number of times after the start of translation, and causes the display unit 27 to output the message for urging speech after the end of the output of the speech for urging speech a predetermined number of times.
Accordingly, by stopping the speech for urging speech a predetermined number of times, an increase in energy consumption of the speech translation apparatus 1 can be suppressed.
(embodiment mode 2)
< constitution >
The configuration of the speech translation apparatus 1a according to the present embodiment will be described with reference to fig. 4.
Fig. 4 is a block diagram showing the speech translation apparatus 1a in embodiment 2.
The present embodiment differs from embodiment 1 in the point where the sound source direction is estimated.
Other configurations in the present embodiment are the same as those in embodiment 1, and the same reference numerals are given to the same configurations, and detailed description thereof is omitted, unless otherwise specified.
As shown in fig. 4, the speech translation apparatus 1a includes a plurality of speech input units 21 and a sound source direction estimating unit 31 in addition to the speech detecting unit 22, the preferential speech input unit 24, the speech instructing unit 25, the speech recognizing unit 23, the translating unit 26, the display unit 27, the speech output unit 28, and the power supply unit 29.
[ multiple speech input units 21]
The plurality of voice input units 21 constitute a microphone array. Specifically, the microphone array is configured by 2 or more microphone units arranged separately from each other, obtains a voice, and obtains an acoustic signal converted into an electric signal from the obtained voice.
The plurality of voice input units 21 output the obtained acoustic signals to the sound source direction estimating unit 31. At least one of the plurality of voice input units 21 outputs an acoustic signal to the voice detection unit 22. In the present embodiment, one voice input unit 21 is connected to the voice detection unit 22 so as to be able to communicate with each other, and outputs an acoustic signal to the voice detection unit 22.
In the present embodiment, the speech translation apparatus 1a is provided with two speech input units 21. One voice input unit 21 and the other voice input unit 21 are arranged at a distance of 1/2 wavelength or less of voice.
[ Sound Source Direction estimating section 31]
The sound source direction estimating unit 31 estimates the sound source direction by performing signal processing on the voices input to the plurality of voice input units 21. Specifically, when the sound information from the sound detection unit 22 and the acoustic signals from the plurality of sound input units 21 are obtained, the sound source direction estimation unit 31 calculates a time difference (phase difference) between the sounds arriving at each of the plurality of sound input units 21 constituting the microphone array, and estimates the sound source direction by, for example, a delay time estimation method. That is, as long as the voice detection unit 22 can detect the voice section, it means that the voice of the 1 st speaker or the 2 nd speaker is input to the voice input unit 21, and the sound source direction estimation unit 31 starts the estimation of the sound source direction using the acquisition of the voice information as a trigger.
The sound source direction estimating unit 31 outputs sound source direction information indicating the sound source direction, which is the estimation result, to the speech instructing unit 25.
[ speaking instruction section 25]
The speech instructing unit 25 includes a control unit 31a and controls a state of displaying on the display unit 27. Specifically, the control unit 31a displays the language 1 in the display area of the display unit 27 corresponding to the position of the speaker 1 with respect to the speech translation apparatus 1a, and displays the language 2 in the display area of the display unit 27 corresponding to the position of the speaker 2 with respect to the speech translation apparatus 1 a. For example, as shown in fig. 1A, the display area of the display unit 27 corresponding to the position of the 1 st speaker is the display area of the display unit 27 on the 1 st speaker side displayed in japanese. The display area of the display unit 27 corresponding to the position of the 2 nd speaker is the display area of the display unit 27 on the 2 nd speaker side in english display.
The control unit 31a compares the sound source direction estimated by the sound source direction estimation unit 31 with the display direction from the display unit 27 of the speech translation apparatus 1a to the 1 st speaker or the 2 nd speaker, which is displayed on one side of the display area of the display unit 27. When the display direction substantially coincides with the sound source direction, the control unit 31a causes the voice recognition unit 23 and the translation unit 26 to operate. For example, as shown in fig. 1A, when the 1 st speaker speaks, the 1 st text file showing the content of the 1 st speaker's voice input to the voice translation apparatus 1A is displayed in the display area on the 1 st speaker side (or on the side facing the 1 st speaker). In this case, the display direction is the direction from the display unit 27 toward the 1 st talker, and the sound source direction estimated by the sound source direction estimating unit 31 is also the direction from the display unit 27 toward the 1 st talker.
When the display direction is different from the sound source direction, the control unit 31a stops the operations of the voice recognition unit 23 and the translation unit 26. When the 1 st speaker speaks, even if the 1 st text file showing the content of the 1 st speaker's voice is displayed in the display area on the 1 st speaker side, the display direction does not coincide with the estimated sound source direction in the case where the sound source direction estimated by the sound source direction estimating unit 31 is the direction from the display unit 27 toward the 2 nd speaker. For example, when the 1 st speaker continues speaking without operating the priority speech input unit 24 after the 1 st speaker speaks, the surrounding voice that is not related to the conversation may be collected in the speech input unit 21.
When the control unit 31a stops the operation of the speech recognition unit 23 and the translation unit 26, the utterance instructing unit 25 outputs the content of prompting the utterance in the instructed language again. For example, since the display direction does not coincide with the estimated sound source direction, it is not known which speaker is speaking, and the speech recognition unit 23 does not know whether speech is recognized in the 1 st language or the 2 nd language. Even if the 1 st speaker speaks, the speech recognition cannot be performed on the speech, and translation cannot be performed. Therefore, the control unit 31a stops the operation of the voice recognition unit 23 and the translation unit 26.
< working >
The operation of the speech translation apparatus 1a having the above-described configuration will be described with reference to fig. 5.
Fig. 5 is a flowchart showing the operation of the speech translation apparatus 1a according to embodiment 2.
The same processing as in fig. 3 is denoted by the same reference numerals, and description thereof is omitted as appropriate.
The speech translation apparatus 1a acquires a sound (S11), and generates an acoustic signal indicating the acquired sound.
Next, the sound source direction estimating unit 31 determines whether or not the voice information is obtained from the voice detecting unit 22 (S12 a).
When the sound source direction estimating unit 31 does not obtain the speech information from the speech detecting unit 22 (no in S12a), the speech detecting unit 22 cannot detect the speech from the acoustic signal, and therefore the sound source direction estimating unit 31 cannot obtain the speech information. That is, the 1 st speaker and the 2 nd speaker do not have a conversation. In this case, the process of step S12a is repeated.
When the sound source direction estimating unit 31 obtains the speech information from the speech detecting unit 22 (yes at S12a), it is the case where at least one of the 1 st speaker and the 2 nd speaker has spoken. In this case, the sound source direction estimating unit 31 calculates a time difference (phase difference) between the voices included in the acoustic signals obtained from the respective voice input units 21, and estimates the sound source direction (S31). The sound source direction estimating unit 31 outputs sound source direction information indicating the sound source direction as a result of the estimation to the speech instructing unit 25.
Next, the control unit 31a of the sound source direction estimating unit 31 determines whether or not the display direction substantially coincides with the estimated sound source direction (S32).
When the display direction is different from the sound source direction (no in S32), the control unit 31a stops the operations of the speech recognition unit 23 and the translation unit 26. When the control unit 31a stops the operation of the speech recognition unit 23 and the translation unit 26, the utterance instructing unit 25 outputs the content of prompting the utterance in the instructed language again.
Specifically, the speech instruction unit 25 outputs speech instruction text information for prompting one of the speakers to speak to the display unit 27. The display unit 27 displays the utterance instruction text information obtained from the utterance instruction unit 25 (S33).
The speech instruction unit 25 outputs speech instruction speech information for prompting one of the speakers to speak to the speech output unit 28. The speech output unit 28 outputs the speech instruction speech information obtained from the speech instruction unit 25 in speech (S34).
Then, the speech translation apparatus 1a ends the processing. Accordingly, when one speaker speaks again, the speech translation apparatus 1a starts the process from step S11.
When the display direction substantially matches the sound source direction (yes at S32), the control unit 31a causes the speech recognition unit 23 and the translation unit 26 to operate. Then, the speech translation apparatus 1a proceeds to step S13, and performs the same processing as in fig. 3.
< Effect >
Next, the operation and effects of the speech translation apparatus 1a in the present embodiment will be described.
As described above, in the speech translation apparatus 1a according to the present embodiment, a plurality of speech input units 21 are provided. The speech translation apparatus 1a further includes: a sound source direction estimating unit 31 that estimates a sound source direction by performing signal processing on the voices input to the plurality of voice input units 21; and a control unit 31a for displaying the language 1 in a display area of the display unit 27 corresponding to the position of the speaker 1 with respect to the speech translation apparatus 1a, and for displaying the language 2 in a display area of the display unit 27 corresponding to the position of the speaker 2 with respect to the speech translation apparatus 1 a. Then, the control unit 31a compares the display direction, which is the display direction from the display unit 27 of the speech translation apparatus 1a toward the 1 st speaker or the 2 nd speaker and which is the direction to be displayed on one side of the display area of the display unit 27, with the sound source direction estimated by the sound source direction estimation unit 31, and causes the speech recognition unit 23 and the translation unit 26 to operate when the display direction substantially matches the sound source direction, and causes the speech recognition unit 23 and the translation unit 26 to stop operating when the display direction does not match the sound source direction.
Accordingly, when the display direction of the language displayed in the display area of the display unit 27 substantially matches the sound source direction of the speech uttered by the speaker, it is possible to specify the 1 st speaker who utters the 1 st language or the 2 nd speaker who utters the 2 nd language. In this case, speech recognition can be performed on the speech of the 1 st speaker in the 1 st language and speech recognition can be performed on the speech of the 2 nd speaker in the 2 nd language. When the display direction is different from the sound source direction, translation of the input speech is stopped, thereby preventing the input speech from being translated or misinterpreted.
Accordingly, the speech translation apparatus 1a can reliably perform speech recognition on the speech in the 1 st language and the speech in the 2 nd language, and thus can reliably translate the speech. In this way, the speech translation apparatus 1a can suppress an increase in the processing amount of the speech translation apparatus 1a by suppressing misreading and the like.
In the speech translation apparatus 1a according to the present embodiment, when the control unit 31a stops the operation of the speech recognition unit 23 and the translation unit 26, the utterance instruction unit 25 outputs the content urging the utterance in the instructed language again.
Accordingly, even when the display direction is different from the sound source direction, the content for urging the speaker to speak is output again by the speaking instruction unit 25, and the speaker to be the target starts speaking. Therefore, the speech translation apparatus 1a can reliably obtain the speech of the target speaker, and thus can more reliably translate the speech.
The speech translation apparatus 1a according to the present embodiment can also achieve the same operational effects as those of embodiment 1.
(modification of embodiment 2)
In the case where other structures in the present modification are not described in particular, the same structures are given the same reference numerals as in embodiment 1, and detailed description thereof is omitted.
The operation of the speech translation apparatus 1a having such a configuration will be described with reference to fig. 6.
Fig. 6 is a flowchart showing the operation of the speech translation apparatus 1a in the modification of embodiment 2.
The same processing as in fig. 5 is denoted by the same reference numerals, and description thereof is omitted as appropriate.
In the processing of the speech translation apparatus 1a, after the processing of steps S11 to S31 is performed, if no at step S32, the control unit 31a determines whether or not a predetermined period has elapsed after the comparison between the display direction and the sound source direction (S32 a).
If the predetermined period has not elapsed after the comparison between the display direction and the sound source direction (no in S32a), the controller 31a returns the process to step S32 a.
When the predetermined period has elapsed after the comparison between the display direction and the sound source direction (yes at S32a), the control unit 31a proceeds to step S33 and performs the same processing as that in fig. 5.
As described above, in the speech translation apparatus 1a according to the present modification, when the display direction is different from the sound source direction, the utterance instruction unit 25 outputs the content prompting the utterance in the instructed language again after the predetermined period of time has elapsed after the comparison by the control unit 31 a.
Accordingly, by leaving a predetermined period after the comparison between the display direction and the sound source direction, it is possible to suppress the voices of the 1 st speaker and the 2 nd speaker from being input together. Accordingly, after a predetermined period, the content for urging the speaker to speak is output again, and the speaker to be the target starts speaking. In this way, the speech translation apparatus 1a can more reliably obtain the speech of the target speaker, and can more reliably translate the speech.
The speech translation apparatus 1a according to the present modification can also achieve the same operational effects as those of embodiment 2.
(embodiment mode 3)
< constitution >
The configuration of the speech translation apparatus 1b according to the present embodiment will be described with reference to fig. 7.
Fig. 7 is a block diagram showing a speech translation apparatus 1b according to embodiment 3.
The present embodiment differs from embodiment 1 and the like in estimation of the sound source direction.
Other configurations in the present embodiment are the same as those in embodiment 1 and the like, and the same reference numerals are given to the same configurations, and detailed descriptions of the same configurations are omitted, unless otherwise specified.
The speech translation device 1b includes a plurality of speech input units 21, a 1 st beam forming unit 41, a 2 nd beam forming unit 42, and an input switching unit 32, in addition to the speech detection unit 22, the preferential utterance input unit 24, the utterance instruction unit 25, the speech recognition unit 23, the translation unit 26, the display unit 27, the speech output unit 28, the power supply unit 29, and the sound source direction estimation unit 31.
[ multiple speech input units 21]
The plurality of voice input units 21 constitute a microphone array. Each of the plurality of voice input units 21 outputs the obtained acoustic signal to the 1 st beam forming unit 41 and the 2 nd beam forming unit 42. In the present embodiment, two voice input units 21 are used as an example.
[ 1 st beam forming unit 41 and 2 nd beam forming unit 42]
The 1 st beam forming unit 41 performs signal processing on the acoustic signal of the voice input to at least some of the voice input units 21 among the plurality of voice input units 21, thereby controlling the directivity of the collected voice to the sound source direction of the voice of the 1 st speaker. The 2 nd beam forming unit 42 performs signal processing on the acoustic signal of the voice input to at least some of the voice input units 21 among the plurality of voice input units 21, thereby controlling the directivity of the collected voice to the sound source direction of the voice of the 2 nd speaker. In the present embodiment, the 1 st and 2 nd beam forming units 41 and 42 perform signal processing on acoustic signals obtained from each of the plurality of voice input units 21.
Accordingly, the 1 st and 2 nd beam forming units 41 and 42 control the directivity of the collected sound to a predetermined direction, thereby suppressing the input of sounds other than the predetermined direction. The predetermined direction is, for example, the sound source direction of the speech of each of the 1 st and 2 nd speakers when speaking.
In the present embodiment, the 1 st beam forming unit 41 is disposed on the 1 st speaker side and connected to be able to communicate with each of the plurality of voice input units 21, and the 2 nd beam forming unit 42 is disposed on the 2 nd speaker side and connected to be able to communicate with each of the plurality of voice input units 21. The 1 st and 2 nd beam forming units 41 and 42 perform signal processing on the acoustic signals obtained from each of the plurality of voice input units 21, respectively, and output the acoustic processing signals, which are the results of the signal processing, to the input switching unit 32.
[ speaking instruction section 25]
The speech instructing unit 25 causes the input switching unit 32 to perform switching to obtain either the output signal of the 1 st beam forming unit 41 or the output signal of the 2 nd beam forming unit 42. Specifically, when the sound source direction information indicating the sound source direction as the estimation result is obtained from the sound source direction estimation unit 31, the speech instruction unit 25 compares the sound source direction indicated by the sound source direction information with a predetermined direction, which is the directivity of the collected sound of the beam forming unit. The speech instruction unit 25 selects a beam forming unit in which the sound source direction substantially coincides with or approaches a predetermined direction.
The speech instructing unit 25 outputs a switching command to the input switching unit 32 so as to output an output signal of the beam forming unit selected from the 1 st beam forming unit 41 and the 2 nd beam forming unit 42.
[ input switching part 32]
The input switching unit 32 is a device that obtains an output signal of the 1 st beam forming unit 41 and an output signal of the 2 nd beam forming unit 42 and switches the output signal to be output to the voice detecting unit 22. The input switching unit 32 switches the obtained signal to the output signal of the 1 st beam forming unit 41 or the output signal of the 2 nd beam forming unit 42. Specifically, the input switching unit 32 switches from the output signal of the 1 st beam forming unit 41 to the output signal of the 2 nd beam forming unit 42 or from the output signal of the 2 nd beam forming unit 42 to the output signal of the 1 st beam forming unit 41 by obtaining a switching command from the utterance instructing unit 25. The input switching unit 32 outputs the output signal of the 1 st beam forming unit 41 to the voice detecting unit 22 or outputs the output signal of the 2 nd beam forming unit 42 to the voice detecting unit 22 by a switching command.
The input switching unit 32 is connected to be able to communicate with the 1 st beam forming unit 41, the 2 nd beam forming unit 42, the voice detecting unit 22, and the utterance instructing unit 25.
< working >
The operation of the speech translation apparatus 1b configured as described above will be described.
Fig. 8 is a flowchart showing the operation of the speech translation apparatus 1b according to embodiment 3.
The same processing as in fig. 5 and the like is denoted by the same reference numerals, and description thereof is appropriately omitted.
As shown in fig. 8, after the processing of steps S11, S12a, S31, and S32 in the processing of the speech translation apparatus 1b, if the control unit 31a determines that the display direction and the sound source direction substantially coincide with each other (yes in S32), the utterance instructing unit 25 outputs a switch command to the input switching unit 32 (S51).
Specifically, on the premise that the 1 st speaker and the 2 nd speaker are speaking, in the two voice input units 21, the sensitivity of the 1 st beam forming unit 41 to the 1 st speaker is higher than that of the 2 nd speaker, and the sensitivity of the 2 nd beam forming unit 42 to the 2 nd speaker is higher than that of the 1 st speaker.
Therefore, if the display direction is the display area of the display unit 27 on the 1 st speaker side, the 1 st beam forming unit 41 can have high sensitivity to the 1 st speaker's speech, and thus the speech instructing unit 25 outputs a switching command to the input switching unit 32 so that the output signal of the 1 st beam forming unit 41 is output. In this case, when the input switching section 32 obtains a switching command, the output signal of the 1 st beamforming section 41 is output.
Further, if the display direction is the display region of the display unit 27 on the 2 nd speaker side, the 2 nd beam forming unit 42 can have high sensitivity to the 2 nd speaker's speech, and thus the speech instructing unit 25 outputs a switching command to the input switching unit 32 so that the output signal of the 2 nd beam forming unit 42 is output. In this case, when the input switching section 32 obtains a switching command, the output signal of the 2 nd beam forming section 42 is output.
Then, the speech translation apparatus 1b proceeds to step S12, and performs the same processing as in fig. 5.
< Effect >
Next, the operation and effects of the speech translation apparatus 1b in the present embodiment will be described.
As described above, in the speech translation apparatus 1b according to the present embodiment, a plurality of speech input units 21 are provided. The speech translation apparatus 1b further includes: a 1 st beam forming unit 41 that performs signal processing on the voice input to at least some of the voice input units 21 among the plurality of voice input units 21 to control the directivity of the collected voice to the sound source direction of the voice of the 1 st speaker; a 2 nd beam forming unit 42 that performs signal processing on the voice input to at least some of the voice input units 21, thereby controlling the directivity of the collected voice to the sound source direction of the voice of the 2 nd speaker; an input switching unit 32 for switching the obtained signal to an output signal of the 1 st beam forming unit 41 or an output signal of the 2 nd beam forming unit 42; and a sound source direction estimating unit 31 that estimates a sound source direction by performing signal processing on the voices input to the plurality of voice input units 21. Then, the speech instructing unit 25 causes the input switching unit 32 to perform switching to obtain either the output signal of the 1 st beam forming unit 41 or the output signal of the 2 nd beam forming unit 42.
In this way, the direction of the speaker with respect to the speech translation apparatus 1b can be estimated by the sound source direction estimating unit 31. Therefore, the input switching unit 32 can switch between the output signal of the 1 st beam forming unit 41 and the output signal of the 2 nd beam forming unit 42, which are suitable for the direction of the talker. That is, since the directivity of the collected sound of the beam forming unit can be oriented in the sound source direction, the speech translation apparatus 1b can collect the sound by reducing the surrounding noise for the speech of the 1 st speaker and the 2 nd speaker.
The speech translation apparatus 1b according to the present embodiment can also achieve the same operational effects as those of embodiment 1 and the like.
(modification of embodiment 3)
The speech translation apparatus 1c according to the present modification will be described with reference to fig. 9.
Fig. 9 is a block diagram showing a speech translation apparatus 1c according to a modification of embodiment 3.
Other configurations in this modification are the same as those in embodiment 1 and the like, and the same reference numerals are given to the same configurations, and detailed descriptions of the same configurations are omitted, unless otherwise specified.
As shown in fig. 9, the 1 st beam former 41 and the 2 nd beam former 42 are connected so as to be able to communicate with each of the plurality of voice input units 21, and are connected so as to be able to communicate with the sound source direction estimator 31 and the input switch 32.
The acoustic signals from each of the plurality of voice input units 21 are input to the 1 st beam forming unit 41 and the 2 nd beam forming unit 42. The 1 st beam forming unit 41 and the 2 nd beam forming unit 42 perform signal processing on each of the inputted acoustic signals, and thereby output each of the acoustic processing signals, which is the result of the signal processing, to the sound source direction estimating unit 31 and the input switching unit 32.
That is, in the present modification, each of the plurality of voice input units 21 is connected so as to be able to communicate with the 1 st beam forming unit 41 and the 2 nd beam forming unit 42, and is connected so as to be unable to communicate with the sound source direction estimating unit 31.
In this way, the acoustic signal having high directivity of the collected sound in the sound source direction of the speaker's voice is input to the sound source direction estimation unit 31 through the 1 st beam forming unit 41 and the 2 nd beam forming unit 42.
In the speech translation apparatus 1c according to the present modification, a plurality of speech input units 21 are provided. The speech translation apparatus 1c further includes: a 1 st beam forming unit 41 that performs signal processing on the voice input to at least some of the voice input units 21 among the plurality of voice input units 21 to control the directivity of the collected voice to the sound source direction of the voice of the 1 st speaker; a 2 nd beam forming unit 42 that performs signal processing on the voice input to at least some of the voice input units 21, thereby controlling the directivity of the collected voice to the sound source direction of the voice of the 2 nd speaker; and a sound source direction estimating unit 31 that estimates a sound source direction by performing signal processing on the output signal of the 1 st beam forming unit 41 and the output signal of the 2 nd beam forming unit 42.
Accordingly, the direction to the speaker can be estimated by the sound source direction estimating unit 31. Therefore, the sound source direction estimating unit 31 performs signal processing on the output signal of the 1 st beam forming unit 41 and the output signal of the 2 nd beam forming unit 42 suitable for the direction of the talker, and can reduce the calculation cost of the signal processing.
The speech translation apparatus 1c according to the present modification can also achieve the same operational effects as those of the above-described embodiment 1 and the like.
(embodiment mode 4)
< constitution >
The configuration of the speech translation apparatus 1d according to the present embodiment will be described with reference to fig. 10.
Fig. 10 is a block diagram showing a speech translation apparatus 1d according to embodiment 4.
The present embodiment differs from embodiment 1 and the like in that the speech translation apparatus 1d includes a score calculation unit 43.
The configuration in the present embodiment is the same as embodiment 1 and the like, and the same reference numerals are given to the same configurations, and detailed description thereof is omitted, unless otherwise specified.
As shown in fig. 10, the speech recognition unit 23 of the speech translation apparatus 1d includes a score calculation unit 43.
[ fraction calculating section 43]
The score calculating unit 43 outputs the result of the speech recognition of the speech and the reliability score calculated by calculating the reliability score of the result to the utterance instructing unit 25. The reliability score shows the accuracy (similarity) of voice recognition when voice recognition is performed on the voice indicated by the voice information obtained from the voice detecting unit 22. For example, the score calculating unit 43 compares a text file in which a voice indicated by voice information is converted with a voice indicated by the voice information, and calculates a reliability score indicating the similarity between the text file and the voice.
The score calculating unit 43 may not be provided in the speech recognition unit 23, and may be another device independent of the speech recognition unit 23.
[ speaking instruction section 25]
The utterance instructing section 25 determines the accuracy of speech recognition by evaluating the reliability score obtained from the score calculating section 43 of the speech recognition section 23. Specifically, the utterance instructing unit 25 determines whether or not the reliability score obtained from the score calculating unit 43 of the speech recognition unit 23 is equal to or less than a threshold value. When the reliability score is equal to or less than the threshold value, the utterance instructing unit 25 does not interpret the speech with the reliability score equal to or less than the threshold value, and outputs the content of prompting the utterance via at least one of the display unit 27 and the speech output unit 28. The utterance instructing unit 25 translates the speech when the reliability score is higher than the threshold value.
< working >
The operation of the speech translation apparatus 1d configured as described above will be described.
Fig. 11 is a flowchart showing the operation of the speech translation apparatus 1d according to embodiment 4.
The same processing as in fig. 3 is denoted by the same reference numerals, and description thereof is omitted as appropriate.
In the processing of the speech translation apparatus 1d, after the processing of steps S11 to S13, the score calculating unit 43 of the speech recognition unit 23 calculates the reliability score of the speech recognition result, and outputs the calculated reliability score to the utterance instructing unit 25 (S61).
Next, the utterance instructing unit 25 obtains the reliability score from the score calculating unit 43 of the speech recognition unit 23, and determines whether or not the obtained reliability score is equal to or less than a threshold (S62).
When the reliability score is equal to or less than the threshold value (yes in S62), the utterance instruction unit 25 does not translate the speech with the reliability score equal to or less than the threshold value, and outputs again utterance instruction text information as a content for urging the utterance via the display unit 27 (S18). Then, the speech translation apparatus 1d proceeds to step S19, and performs the same processing as in fig. 3 and the like.
When the reliability score is higher than the threshold value (no in S62), the speech instructing unit 25 proceeds to step S14 and performs the same processing as in fig. 3 and the like.
< Effect >
Next, the operation and effects of the speech translation apparatus 1d in the present embodiment will be described.
As described above, in the speech translation device 1d according to the present embodiment, the speech recognition unit 23 outputs the result of speech recognition and the reliability score of the result, and when the reliability score obtained from the speech recognition unit 23 is equal to or less than the threshold value, the utterance instructing unit 25 does not translate the speech having the reliability score equal to or less than the threshold value, but outputs the content of prompting the utterance via at least one of the display unit 27 and the speech output unit 28.
Accordingly, if the reliability score indicating the accuracy of the voice recognition is equal to or less than the threshold, the utterance instructing unit 25 outputs the content for prompting the utterance again, and the target speaker speaks again. Therefore, the speech translation apparatus 1d can surely perform speech recognition on the speech of the target speaker, and can more surely translate the speech.
In particular, when the speech output unit 28 outputs the content prompting speech by speech, the speaker can easily notice that speech recognition is not performed correctly.
The speech translation apparatus 1d according to the present embodiment can also achieve the same operational effects as those of the above-described embodiment 1 and the like.
(other modifications, etc.)
The present application has been described above based on modifications of embodiments 1 to 4 and embodiments 2 and 3, but the present application is not limited to these embodiments 1 to 4 and embodiments 2 and 4.
For example, in the voice translation apparatus, the voice translation method, and the recording medium according to the modifications of embodiments 1 to 4 and embodiments 2 and 3 described above, the voices of the 1 st speaker and the voices of the 1 or more than 1 or 2 nd speakers may be transmitted to the cloud server via the network and stored in the cloud server, or only the 1 st text file and the 2 nd text file that have recognized the voices may be transmitted to the cloud server via the network and stored in the cloud server.
In the speech translation device, the speech translation method, and the recording medium according to the modifications of embodiments 1 to 4 and embodiments 2 and 3, the speech recognition unit and the translation unit may not be mounted on the speech translation device. In this case, the speech recognition unit and the translation unit may be engines mounted on the cloud server. The voice translation device may transmit the obtained voice information to the cloud server, or may obtain a text file, a translated text file, and a translated voice from the cloud server, which are results of voice recognition and translation performed by the cloud server based on the voice information.
The speech translation method according to the above-described modifications of embodiments 1 to 4 and embodiments 2 and 3 is realized by a program using a computer, and the program may be stored in a storage device.
The speech translation apparatus, the speech translation method, and the processing units included in the program thereof according to the modifications of the above-described embodiments 1 to 4 and embodiments 2 and 3 can be typically realized by an LSI which is an integrated circuit. These may be formed as one chip, respectively, or a part or all of them may be formed as one chip.
The integration is not limited to LSI, and may be realized by a dedicated circuit or a general-purpose processor. After LSI manufacturing, a Programmable FPGA (Field Programmable Gate Array) or a reconfigurable processor capable of reconfiguring connection and setting of circuit cells within LSI may be used.
In the above-described modifications of embodiments 1 to 4 and embodiments 2 and 3, each component may be configured by dedicated hardware or may be realized by executing a software program suitable for each component. Each component may be realized by a program execution unit such as a CPU or a processor reading out and executing a software program recorded in a recording medium such as a hard disk or a semiconductor memory.
The numbers used above are examples for explaining the present application, and the modifications of embodiments 1 to 4 and embodiments 2 and 3 of the present application are not limited to the numbers in these examples.
Further, the division of the functional blocks in the block diagrams is an example, and a plurality of functional blocks may be implemented as one functional block, or one functional block may be divided into a plurality of functional blocks, or some functions may be shifted to another functional block. Also, the functions of a plurality of functional blocks having similar functions may be processed in parallel or in a time-sharing manner by a single piece of hardware or software.
The order in which the steps in the flowchart are executed is an example for specifically explaining the present application, and therefore, the steps may be performed in an order other than the above. Further, a part of the above steps may be executed simultaneously (in parallel) with the other steps.
In addition, a configuration obtained by performing various modifications that can be conceived by a person skilled in the art with respect to the modifications of embodiments 1 to 4 and embodiments 2 and 3, and a configuration realized by arbitrarily combining the constituent elements and functions in the modifications of embodiments 1 to 4 and embodiments 2 and 3 without departing from the scope of the present application are included in the present application.
The present application is applicable to a speech translation apparatus, a speech translation method, and a recording medium, which are used for intention mediation when a plurality of speakers speaking in different languages are in conversation.

Claims (13)

1. A speech translation apparatus for a conversation between a 1 st speaker and a 2 nd speaker, the 1 st speaker speaking in a 1 st language, the 2 nd speaker being a conversation partner of the 1 st speaker and speaking in a 2 nd language different from the 1 st language,
the speech translation apparatus includes:
a voice detection unit that detects voice sections of voices uttered by the 1 st speaker and the 2 nd speaker from the voice input to the voice input unit;
a display unit that displays a translation result of the 1 st language translated into the 2 nd language indicated by the voice by voice recognition of the voice in the voice section detected by the voice detection unit, and displays a translation result of the 2 nd language translated into the 1 st language; and
and a speech instruction unit that outputs, via the display unit, the 2 nd speaker 2 nd language content after the 1 st speaker speaks, and outputs, via the display unit, the 1 st language content after the 2 nd speaker speaks.
2. The speech interpretation apparatus according to claim 1,
the speech translation apparatus further includes a priority speech input unit configured to, when speech is recognized by the 1 st speaker or the 2 nd speaker, prioritize speech recognized by the 1 st speaker or the 2 nd speaker having performed the speech recognition to be re-speech recognized.
3. The speech translation apparatus according to claim 1 or 2,
the speech translation apparatus further includes:
a speech input unit to which speech of a conversation between the 1 st speaker and the 2 nd speaker is input;
a voice recognition unit that performs voice recognition on the voice in the voice section detected by the voice detection unit to convert the voice into a text file;
a translation section that translates the text file converted by the voice recognition section from the 1 st language into the 2 nd language, and from the 2 nd language into the 1 st language; and
and a voice output unit that outputs the result translated by the translation unit in voice.
4. The speech interpretation apparatus according to claim 3,
the voice input part is provided in plurality,
the speech translation apparatus further includes:
a 1 st beam forming unit configured to perform signal processing on a voice input to at least a part of the plurality of voice input units, thereby controlling a directivity of a collected voice to a sound source direction of the voice of the 1 st speaker;
a 2 nd beam forming unit that performs signal processing on the voice input to at least some of the plurality of voice input units, thereby controlling the directivity of the collected voice to the sound source direction of the voice of the 2 nd speaker;
an input switching unit for switching the obtained signal to an output signal of the 1 st beam forming unit or an output signal of the 2 nd beam forming unit; and
a sound source direction estimating unit that estimates a sound source direction by performing signal processing on the voices input to the plurality of voice input units,
the speaking instruction unit causes the input switching unit to perform switching to obtain either the output signal of the 1 st beam forming unit or the output signal of the 2 nd beam forming unit.
5. The speech interpretation apparatus according to claim 3,
the voice input part is provided in plurality,
the speech translation apparatus further includes:
a sound source direction estimating unit that estimates a sound source direction by performing signal processing on the voices input to the plurality of voice input units; and
a control unit for displaying the 1 st language in a display area of the display unit corresponding to the position of the 1 st speaker relative to the speech translation apparatus and displaying the 2 nd language in a display area of the display unit corresponding to the position of the 2 nd speaker relative to the speech translation apparatus,
the control part is used for controlling the operation of the motor,
comparing a display direction, which is a direction from a display unit of the speech translation apparatus toward the 1 st speaker or the 2 nd speaker and is displayed on one side of a display area of the display unit, with the sound source direction estimated by the sound source direction estimating unit,
causing the speech recognition unit and the translation unit to execute operations when the display direction substantially coincides with the estimated sound source direction,
and stopping the operation of the voice recognition unit and the translation unit when the display direction is different from the estimated sound source direction.
6. The speech interpretation apparatus according to claim 5,
when the control unit stops the operations of the speech recognition unit and the translation unit, the speech instruction unit outputs the content of prompting speech in the instructed language again.
7. The speech interpretation apparatus according to claim 5,
when the display direction is different from the estimated sound source direction, the speech instruction unit outputs the content prompting speech in the instructed language again after the control unit starts comparison and a predetermined period elapses.
8. The speech interpretation apparatus according to claim 3,
the voice input part is provided in plurality,
the speech translation apparatus further includes:
a 1 st beam forming unit configured to perform signal processing on a voice input to at least a part of the plurality of voice input units, thereby controlling a directivity of a collected voice to a sound source direction of the voice of the 1 st speaker;
a 2 nd beam forming unit that performs signal processing on the voice input to at least some of the plurality of voice input units, thereby controlling the directivity of the collected voice to the sound source direction of the voice of the 2 nd speaker; and
and a sound source direction estimating unit configured to estimate a sound source direction by performing signal processing on the output signal of the 1 st beam forming unit and the output signal of the 2 nd beam forming unit.
9. The speech translation apparatus according to claim 1 or 2,
the speaking indication part is used for indicating the speaking mode,
when the speech translation apparatus is activated, contents urging the 1 st speaker to speak are output in the 1 st language via the display unit,
after the speech uttered by the 1 st speaker is translated from the 1 st language into the 2 nd language and the translation result is displayed on the display unit, the content prompting the 2 nd speaker to speak is output in the 2 nd language via the display unit.
10. The speech interpretation apparatus according to claim 3,
the speaking indication part is used for indicating the speaking mode,
after the translation is started, the voice output unit outputs a voice for urging the user to speak a predetermined number of times,
and after the voice for urging speech is output for the predetermined number of times, causing the display unit to output a message for urging speech.
11. The speech interpretation apparatus according to claim 3,
the voice recognition unit outputs a result of voice recognition of a voice and a reliability score of the result,
the speech instructing unit outputs a content prompting speech via at least one of the display unit and the speech output unit without translating the speech whose reliability score is equal to or less than a threshold value when the reliability score obtained from the speech recognizing unit is equal to or less than the threshold value.
12. A speech translation method for a conversation of a 1 st speaker and a 2 nd speaker, the 1 st speaker speaking in a 1 st language, the 2 nd speaker being a conversation partner of the 1 st speaker and speaking in a 2 nd language different from the 1 st language,
the voice translation method comprises the following steps:
detecting a speech section of speech uttered by the 1 st speaker and the 2 nd speaker from the speech input to the speech input unit,
the display unit displays a translation result of the 1 st language translated into the 2 nd language indicated by the voice by performing voice recognition on the voice in the detected voice section, and displays a translation result of the 2 nd language translated into the 1 st language,
after the 1 st speaker speaks, outputting the content for urging the 2 nd speaker to speak in the 2 nd language through the display unit, and after the 2 nd speaker speaks, outputting the content for urging the 1 st speaker to speak in the 1 st language through the display unit.
13. A computer-readable non-transitory recording medium having recorded thereon a program for causing a computer to execute the speech translation method according to claim 12.
CN202010185150.XA 2019-03-25 2020-03-17 Speech translation device, speech translation method, and recording medium Pending CN111739511A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201962823197P 2019-03-25 2019-03-25
US62/823197 2019-03-25
JP2019196078A JP7429107B2 (en) 2019-03-25 2019-10-29 Speech translation device, speech translation method and its program
JP2019-196078 2019-10-29

Publications (1)

Publication Number Publication Date
CN111739511A true CN111739511A (en) 2020-10-02

Family

ID=72643263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010185150.XA Pending CN111739511A (en) 2019-03-25 2020-03-17 Speech translation device, speech translation method, and recording medium

Country Status (2)

Country Link
JP (1) JP7429107B2 (en)
CN (1) CN111739511A (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001100788A (en) * 1999-09-30 2001-04-13 Sony Corp Speech processor, speech processing method and recording medium
JP2002135642A (en) * 2000-10-24 2002-05-10 Atr Onsei Gengo Tsushin Kenkyusho:Kk Speech translation system
JP3974412B2 (en) * 2001-01-24 2007-09-12 松下電器産業株式会社 Audio converter
JP2011248140A (en) * 2010-05-27 2011-12-08 Fujitsu Toshiba Mobile Communications Ltd Voice recognition device
JP6250209B1 (en) * 2017-03-27 2017-12-20 株式会社リクルートライフスタイル Speech translation device, speech translation method, and speech translation program

Also Published As

Publication number Publication date
JP7429107B2 (en) 2024-02-07
JP2020160429A (en) 2020-10-01

Similar Documents

Publication Publication Date Title
US10664667B2 (en) Information processing method, information processing device, and recording medium having program recorded thereon
US9484017B2 (en) Speech translation apparatus, speech translation method, and non-transitory computer readable medium thereof
KR102108500B1 (en) Supporting Method And System For communication Service, and Electronic Device supporting the same
JP6447578B2 (en) Voice dialogue apparatus and voice dialogue method
US9570076B2 (en) Method and system for voice recognition employing multiple voice-recognition techniques
US9293134B1 (en) Source-specific speech interactions
US9601107B2 (en) Speech recognition system, recognition dictionary registration system, and acoustic model identifier series generation apparatus
KR102056330B1 (en) Apparatus for interpreting and method thereof
US11507759B2 (en) Speech translation device, speech translation method, and recording medium
JP2015069600A (en) Voice translation system, method, and program
KR20190049260A (en) Device and method for recognizing voice of vehicle
US11315572B2 (en) Speech recognition device, speech recognition method, and recording medium
JP6514475B2 (en) Dialogue device and dialogue method
JP2011248140A (en) Voice recognition device
CN111094924A (en) Data processing apparatus and method for performing voice-based human-machine interaction
KR101989127B1 (en) Method, system and computer program for translation
KR101412657B1 (en) Method and apparatus for improving automatic interpretation function by use of mutual communication between portable interpretation terminals
EP1110207B1 (en) A method and a system for voice dialling
JP7287006B2 (en) Speaker Determining Device, Speaker Determining Method, and Control Program for Speaker Determining Device
CN111739511A (en) Speech translation device, speech translation method, and recording medium
CN103295571A (en) Control using time and/or spectrally compacted audio commands
JP3846500B2 (en) Speech recognition dialogue apparatus and speech recognition dialogue processing method
JP2006251699A (en) Speech recognition device
JP2015036826A (en) Communication processor, communication processing method and communication processing program
JP7449070B2 (en) Voice input device, voice input method and its program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Osaka, Japan

Applicant after: Panasonic Holding Co.,Ltd.

Address before: Osaka, Japan

Applicant before: Matsushita Electric Industrial Co.,Ltd.