WO2014207874A1

WO2014207874A1 - Electronic device, output method, and program

Info

Publication number: WO2014207874A1
Application number: PCT/JP2013/067716
Authority: WO
Inventors: 谷内　謙一
Original assignee: 株式会社東芝
Priority date: 2013-06-27
Filing date: 2013-06-27
Publication date: 2014-12-31

Abstract

An electronic device according to an embodiment is equipped with a separation unit, a conversion unit, and an output unit. The separation unit separates background sound information and first speech information from sound information. The conversion unit converts the first speech information into second speech information corresponding to the first audio information. The output unit mixes and then outputs the background sound information and the second speech information.

Description

Electronic device, output method and program

Embodiments described herein relate generally to an electronic device, an output method, and a program.

Technology that converts sound contained in sound information of content such as moving images into text information, translates the text information into another language, synthesizes sound from the translated text information, and outputs the synthesized sound together with the content It is disclosed.

JP 2000-322077 A JP 2000-92460 A

However, in the prior art, when synthesized speech is output together with content, the speech originally included in the sound information of the content and the synthesized speech are heard at the same time, so the synthesized speech is difficult to hear. There is. In addition, there is a method to make it easy to hear the synthesized voice by lowering the volume of the sound information of the content, but according to the method, the volume of the background sound included in the sound information of the content is reduced and the background sound can not be heard. There are challenges.

The electronic device of the embodiment includes a separation unit, a conversion unit, and an output unit. The separation unit separates the background sound information and the first sound information from the sound information. The conversion unit converts the first sound information into second sound information corresponding to the first sound information. The output unit mixes and outputs the background sound information and the second sound information.

FIG. 1 is a block diagram showing a main signal processing system of a digital television as an example of the electronic apparatus according to the first embodiment. FIG. 2 is a block diagram illustrating a configuration of a signal processing unit included in the digital television according to the first embodiment. FIG. 3 is a flowchart illustrating a flow of output processing of sound information and image information by a signal processing unit included in the digital television according to the first embodiment. FIG. 4 is a diagram illustrating an example of a setting screen for various information in the digital television according to the first embodiment. FIG. 5 is a diagram illustrating a configuration of an information processing system having a notebook PC as an example of an electronic apparatus according to the second embodiment. FIG. 6 is a sequence diagram illustrating a flow of output processing of sound information in the information processing system according to the second embodiment. FIG. 7 is a diagram illustrating a hardware configuration of a PC that is an example of an electronic apparatus according to the third embodiment. FIG. 8 is a block diagram illustrating a functional configuration of a PC according to the third embodiment.

Hereinafter, an electronic device, an output method, and a program according to the present embodiment will be described with reference to the accompanying drawings.

(First embodiment)
FIG. 1 is a block diagram showing a main signal processing system of a digital television as an example of the electronic apparatus according to the first embodiment. The satellite digital television broadcast signal received by the BS / CS digital broadcast receiving antenna 121 is supplied to the satellite digital broadcast tuner 202 a provided in the broadcast input unit 202 via the input terminal 201.

The tuner 202a selects a broadcast signal of a desired channel based on a control signal from the control unit 205, and outputs the selected broadcast signal to a PSK (Phase Shift Keying) demodulator 202b.

The PSK demodulator 202b included in the broadcast input unit 202 demodulates the broadcast signal selected by the tuner 202a based on a control signal from the control unit 205, and obtains a transport stream (TS) including a desired program. The result is output to the TS decoder 202c.

A TS decoder 202c included in the broadcast input unit 202 performs TS decoding processing on a signal in which a transport stream (TS) is multiplexed by a control signal from the control unit 205, and digital video signals and sound signals of a desired program. PES (Packetized Elementary Stream) obtained by depacketing is output to an STD buffer (not shown) in the signal processing unit 206. Further, the TS decoder 202c outputs section information transmitted by digital broadcasting to a section processing unit (not shown) in the signal processing unit 206.

The terrestrial digital television broadcast signal received by the terrestrial broadcast receiving antenna 122 is supplied to the terrestrial digital broadcast tuner 204 a provided in the broadcast input unit 202 via the input terminal 203.

The tuner 204a can select a broadcast signal of a desired channel by a control signal from the control unit 205. The tuner 204a outputs the broadcast signal to an OFDM (Orthogonal Frequency Division Multiplexing) demodulator 204b.

The OFDM demodulator 204b included in the broadcast input unit 202 demodulates the broadcast signal selected by the tuner 204a based on a control signal from the control unit 205, obtains a transport stream including a desired program, and a TS decoder To 204c.

A TS decoder 204c included in the broadcast input unit 202 performs TS decoding processing on a signal in which a transport stream (TS) is multiplexed by a control signal from the control unit 205, and performs digital video signal and sound signal of a desired program. Is output to the STD buffer in the signal processing unit 206. The TS decoder 204c outputs section information transmitted by digital broadcasting to a section processing unit (not shown) in the signal processing unit 206.

The signal processing unit 206 selectively performs predetermined digital signal processing on the digital video signal and sound signal respectively supplied from the TS decoder 202c and the TS decoder 204c when viewing the television, and the graphic processing unit 207 And output to the audio output unit 208. Further, the signal processing unit 206 selectively outputs a signal obtained by performing predetermined digital signal processing on the digital video signal and sound signal respectively supplied from the TS decoder 202c and the TS decoder 204c at the time of program recording. The recording is performed in the whole recording storage device (for example, HDD: Hard Disk Drive) 271 and the external storage device 226 via the control unit 205.

Note that the round recording according to the present embodiment is different from the reserved recording in which the recording is performed in units of program content selected by the user, in order to prevent the user from overlooking the broadcast channel for a predetermined time period (including all day). ) Is a method for recording all program content broadcasted on the Internet. The recording time zone may be different for each broadcast channel.

The signal processing unit 206 also plays back recorded program data (video signal and sound signal) read from the recording / recording storage device 271 or the external storage device 226 via the control unit 205 during playback of the recorded program. Then, predetermined digital signal processing is performed and output to the graphic processing unit 207 and the audio output unit 208.

A section processing unit (not shown) included in the signal processing unit 206 includes various data, electronic program guide (EPG) information, and program attributes for acquiring a program from the section information input from the

TS decoders

202c and 204c. Information (program genre, etc.), subtitle information, etc. (service information, SI, PSI) are output to the control unit 205.

The tuner 202a, the PSK demodulator 202b, the TS decoder 202c, the tuner 204a, the OFDM demodulator 204b, and the TS decoder 204c shown in FIG. 1 have more than the number of systems necessary for the round recording function. For example, when the digital television 100 is an apparatus capable of recording all the terrestrial key stations in Tokyo, the digital television 100 includes seven or more tuners 204a, OFDM demodulators 204b, and TS decoders 204c.

The control unit 205 receives various data (such as key information for B-CAS descrambling), electronic program guide (EPG) information, program attribute information (program genre, etc.) for acquiring a program from the signal processing unit 206. Subtitle information and the like (service information, SI and PSI) are input. The control unit 205 generates screen information for displaying EPG information, caption information, and the like from the input information, and outputs the generated screen information to the graphic processing unit 207.

In addition, the control unit 205 has a function of controlling program recording and program reservation recording. When the program reservation is accepted, the control unit 205 generates screen information for displaying the EPG information on the display unit 214, and performs graphic processing on the generated screen information. In addition to outputting to the unit 207, reservation contents are set in a predetermined storage unit by a user input via the operation unit 220 or the remote controller 221. Then, the control unit 205 controls the

tuners

202a and 204a, the PSK demodulator 202b, the OFDM demodulator 204b, the

TS decoders

202c and 204c, and the signal processing unit 206 so as to record the reserved program at the set time.

In addition, when the digital television 100 automatically records programs of all channels that can be recorded by the round recording function, the digital television 100 performs recording by controlling each device during a time period set separately from the reservation.

The OSD (On Screen Display) signal generation unit 209 generates setting screen information (OSD signal) for displaying a setting screen for setting various information, and outputs the generated setting screen information to the graphic processing unit 207. To do.

The graphic processing unit 207 outputs the digital video signal output from the signal processing unit 206, the setting screen information generated by the OSD signal generation unit 209 and the screen information generated by the control unit 205 to the video processing unit 210.

The digital video signal output from the graphic processing unit 207 is supplied to the video processing unit 210. The video processing unit 210 converts the input digital video signal into an analog video signal in a format that can be displayed on an external device connected via the display unit 214 or the output terminal 211, and then outputs the analog video signal to the output terminal 211 or the display unit. The video is output to 214 and displayed.

The audio output unit 208 converts the input digital sound signal into an analog sound signal in a format that can be played back by the speaker 213, and then outputs the analog sound signal to an external device or speaker 213 connected via the output terminal 212. Let it play.

In the digital television 100 according to the present embodiment, the above-described various operations are comprehensively controlled by the control unit 205. The control unit 205 incorporates a CPU (Central Processing Unit) and the like, receives operation information from the operation unit 220, or receives operation information sent from the remote controller 221 via the light receiving unit 222. Each unit is controlled so that the operation content is reflected.

The control unit 205 stores a ROM (Read Only Memory) 205a that stores a control program executed by the CPU, a RAM (Random Access Memory) 205b that provides a work area for the CPU, and various setting information and control information. The non-volatile memory 205c is used.

The control unit 205 is connected to a card holder 225 in which a memory card 224 can be mounted via a card I / F (Interface) 223. As a result, the control unit 205 can transmit information to the memory card 224 attached to the card holder 225 via the card I / F 223.

Also, the control unit 205 is connected to the first LAN terminal 230 via the communication I / F 229. As a result, the control unit 205 can transmit information to and from the LAN compatible device connected to the first LAN terminal 230 via the communication I / F 229.

The control unit 205 is connected to the second LAN terminal 232 via the communication I / F 231. Accordingly, the control unit 205 can transmit information to and from various LAN-compatible devices connected to the second LAN terminal 232 via the communication I / F 231.

Further, the control unit 205 is connected to the USB terminal 234 via the USB I / F 233. Accordingly, the control unit 205 can transmit information to various devices (for example, the external storage device 226) connected to the USB terminal 234 via the USB I / F 233.

FIG. 2 is a block diagram illustrating a configuration of a signal processing unit included in the digital television according to the first embodiment. The signal processing unit 206 decodes a video signal (image information reproduced in synchronization with a sound signal) input from the broadcast input unit 202 or the control unit 205 into a data format that can be processed by the video processing unit 210. An audio decoder 242 that decodes the sound signal input from the broadcast input unit 202 or the control unit 205 into a data format that can be processed by the audio output unit 208; and an output destination of the sound signal decoded by the audio decoder 242 as a separator A switch unit 248 that switches to 243 or the synchronization processing unit 247, a separator 243 that separates background sound information and first sound information from a sound signal (sound information) decoded by the sound decoder 242, and a first sound information Performing voice recognition processing to analyze and acquire the content of the first voice information as text data; Based on the translator 244 that translates the text data into a translated language (second language) that is a language different from the original language (first language) that is the language of the first speech information, and the text data translated into the translated language A synthesizer 245 for synthesizing the second sound information, a mixing unit 246 for mixing and outputting the background sound information and the second sound information, and a sound obtained by mixing the background sound information and the second sound information by the mixing unit 246 A synchronization processing unit 247 that synchronizes and outputs information and image information to be reproduced in synchronization with the sound information.

In this embodiment, the translator 244 and the synthesizer 245 function as a conversion unit that converts the first speech information into second speech information in a translation language different from the original language of the first speech information. In the present embodiment, an example in which the translator 244 and the synthesizer 245 convert the first speech information into second speech information in a translation language different from the original language of the first speech information will be described. What is necessary is just to convert information into the 2nd audio | voice information corresponding to the said 1st audio | voice information (in other words, the 2nd audio | voice information output instead of the said 1st audio | voice information). For example, the first voice information in the standard language may be converted into the second voice information in the dialect, or the first voice information in the voice may be converted into the second voice information in the pseudo sound. In the present embodiment, the signal processing unit 206 includes a switch unit 248. When the conversion to the second audio information is instructed by the control signal from the control unit 205, the switch unit 248 outputs the sound information decoded by the audio decoder 242 to the separator 243, and the

separators

243, 243 The sound information is output to the synchronization processing unit 247 via the translator 244, the synthesizer 245 and the mixing unit 246. On the other hand, when the conversion to the second audio information is not instructed by the control signal from the control unit 205, the switch unit 248 does not go through the separator 243, the translator 244, the synthesizer 245, and the mixing unit 246. The input sound information is output to the synchronization processing unit 247.

Next, processing for outputting sound information and image information will be described with reference to FIGS. FIG. 3 is a flowchart illustrating a flow of output processing of sound information and image information by a signal processing unit included in the digital television according to the first embodiment. FIG. 4 is a diagram illustrating an example of a setting screen for various information in the digital television according to the first embodiment.

In the present embodiment, the OSD signal generation unit 209 (an example of a display control unit) performs sound information and image information output processing by the signal processing unit 206 when the control unit 205 instructs conversion to second audio information. Prior to which of the background sound information, the volume of each of the first sound information and the second sound information, the translation language that is the language of the second sound information, the reproduction time of the second sound information, and the reproduction time of the image information The setting screen information of the setting screen that can set the setting (synchronization setting) is generated and output to the graphic processing unit 207. As a result, the OSD signal generation unit 209 causes the display unit 214 to display a setting screen.

For example, as shown in FIG. 4, the OSD signal generation unit 209 can input the volume of each of the first sound information (original sound), the second sound information (translated sound), and the background sound information (background sound). A slider 401 that is an example of an image for use, a select box 402 that can input a translation language that is the language of the second audio information, and whether to adjust the reproduction time of the second audio information or the reproduction time of the image information can be set A setting screen 400 including a radio button 403 and the like is displayed on the display unit 214.

In this embodiment, the OSD signal generation unit 209 displays on the display unit 214 the slider 401 that can input the volume of each of the background sound information, the first sound information, and the second sound information. However, at least the first sound information is displayed. It is only necessary to display a volume input image capable of inputting the volume of each of the second audio information.

Returning to FIG. 3, the audio decoder 242 first determines whether or not conversion to second audio information is instructed by the control signal from the control unit 205 (step S301). When conversion to the second audio information is instructed (step S301: Yes), the audio decoder 242 decodes the input audio information into a data format that can be processed by the audio output unit 208. Further, the separator 243 separates the first sound information and the background sound information from the sound information decoded by the sound decoder 242 (step S302).

Specifically, the separator 243 first performs frequency analysis of the sound information and acquires a feature amount of the sound information. The separator 243 may acquire a feature amount obtained by frequency analysis in an external device. Next, the separator 243 calculates a background sound base matrix representing the background sound using the feature amount acquired at a certain time. Further, the separator 243 estimates a first background sound component having non-stationaryness among the background sound components of the feature amount using the acquired feature amount and the calculated background sound base matrix. Then, the separator 243 estimates a representative component of the first background sound component within a predetermined time from the first background sound component estimated from one or more feature amounts acquired at a predetermined time including the past. Next, the separator 243 estimates the first speech component that is the speech component of the feature amount using the acquired feature amount. Further, the separator 243 creates a filter that extracts the spectrum of the sound or the spectrum of the background sound from the estimated first sound component and the representative component of the first background sound component. Next, the separator 243 separates the sound information into the first sound information and the background sound information using the created filter and the spectrum of the sound information.

Next, the translator 244 acquires text data from the first voice information separated from the sound information by the separator 243 by voice recognition processing (step S303). Further, the translator 244 acquires a translation language set in advance on the setting screen 400 shown in FIG. 4 (step S304). Then, the translator 244 translates the text data acquired from the first speech information into text data of a preset translation language by natural language processing (step S305).

The synthesizer 245 synthesizes speech information (second speech information in the translation language) from the text data translated by the translator 244 (text data in a preset translation language) (step S306).

The mixing unit 246 performs synchronization setting (in this embodiment, synchronization setting input on the setting screen 400 shown in FIG. 4) indicating whether to adjust the reproduction time of the second audio information or the reproduction time of the image information. Obtain (step S307). Next, the mixing unit 246 determines whether or not the reproduction time of the synthesized second audio information is different from the reproduction time of the first audio information (step S308). If the reproduction time of the second audio information is different from the reproduction time of the first audio information (step S308: Yes), the mixing unit 246 adjusts the reproduction time of the second audio information based on the acquired synchronization setting. It is determined whether or not (step S309). In the present embodiment, the mixing unit 246 determines whether or not the reproduction time of the second audio information is different from the reproduction time of the first audio information, but the reproduction time of the second audio information and the first audio information When the difference from the reproduction time is longer than the predetermined allowable time, the reproduction time of the second audio information or the reproduction time of the image information may be adjusted. Thus, when the difference between the reproduction time of the second audio information and the reproduction time of the first audio information is short, the image information is viewed without adjusting the reproduction time of the second audio information or the reproduction time of the image information. be able to.

When it is set to adjust the reproduction time of the second audio information by the synchronization setting (step S309: Yes), the mixing unit 246 synchronizes the reproduction time of the second audio information with the second audio information. The reproduction time of the image information to be reproduced (in other words, the image information corresponding to the second audio information) is the same as the reproduction time (in other words, the reproduction time of the second audio information is the same as the reproduction time of the first audio information). As described above, the reproduction time of the second audio information is adjusted (step S310). As a result, the second audio information and the image information can be reproduced in synchronization. Further, since the reproduction time of the image information is not adjusted, when the image information is moving image information, the user can be prevented from feeling uncomfortable with the moving image reproduced from the moving image information. In the present embodiment, the mixing unit 246 compares the time stamp added to the second audio information with the time stamp added to the image information, so that the second image information is selected from the input image information. Image information to be reproduced in synchronization with audio information is determined. Further, in the present embodiment, the mixing unit 246 has the second audio information ((2) so that the reproduction time of the second audio information is the same as the reproduction time of the image information reproduced in synchronization with the second audio information. Or the reproduction time of the image information) is adjusted, but the difference between the reproduction time of the second audio information and the reproduction time of the image information reproduced in synchronization with the second audio information is equal to or less than a predetermined allowable time. In addition, it may be anything that adjusts the reproduction time of the second audio information (or image information).

In this embodiment, the translator 244 translates the text data acquired from the first audio information into a plurality of text data in a preset translation language. Next, the synthesizer 245 synthesizes a plurality of second speech information candidates from each of a plurality of text data in a preset translation language. That is, the translator 244 and the synthesizer 245 convert the first speech information into a plurality of second speech information candidates. Then, the mixing unit 246 selects a second audio information candidate that can be reproduced at the same reproduction time as the reproduction time of the image information that is reproduced in synchronization with the second audio information, from among the plurality of second audio information candidates. The reproduction time of the second audio information is adjusted by selecting and selecting the selected second audio information candidate as the second audio information. In the present embodiment, the synthesizer 245 synthesizes a plurality of candidates for the second speech information from all the plurality of text data in a preset translation language. However, the present invention is not limited to this, and is preset. Based on a plurality of text data of the translated language (for example, based on the number of characters of each of the plurality of text data), the reproduction is performed with the same reproduction time as the reproduction time of the image information reproduced in synchronization with the second audio information. It is also possible to select text data that can be second possible voice information, and use the voice information synthesized from the selected text data as the second voice information.

In the present embodiment, the mixing unit 246 selects a second audio information candidate that can be reproduced with the same reproduction time as the reproduction time of the image information from a plurality of second audio information candidates as the second audio information. However, the present invention is not limited to this. For example, the second audio information is controlled by controlling the audio output unit 208 to change the reproduction speed for reproducing the second audio information. You may adjust the playback time.

On the other hand, when it is set to adjust the reproduction time of the image information by the synchronization setting (step S309: No), the synchronization processing unit 247 reproduces the reproduction time of the image information that is reproduced in synchronization with the second audio information. The reproduction time of the image information is adjusted so as to be the same as the reproduction time of the second audio information (step S311). In the present embodiment, the synchronization processing unit 247 controls the video processing unit 210 to adjust the reproduction time of the image information by changing the reproduction speed for reproducing the image information reproduced in synchronization with the second audio information. To do. Thereby, it becomes possible to reproduce | regenerate image information and 2nd audio | voice information synchronizing.

In this embodiment, the synchronization processing unit 247 adjusts the reproduction time of the image information by changing the reproduction speed at which the image information is reproduced. However, the present invention is not limited to this. For example, the image information is the moving image information. In this case, the reproduction time of the image information may be adjusted by thinning out some of the plurality of frames constituting the moving image information or adding frames.

In this embodiment, the playback time of the second audio information or the playback time of the image information that is played back in synchronization with the second audio information is adjusted. At least one of the reproduction time of the second audio information and the reproduction time of the image information reproduced in synchronization with the second audio information so that the reproduction time of the image information reproduced in synchronization with the audio information becomes the same. However, the present invention is not limited to this as long as one is adjusted. Specifically, when the reproduction time of the second audio information is a time that is twice or more the reproduction time of the image information reproduced in synchronization with the second audio information, or when the reproduction time of the second audio information is The playback time of the second audio information and the playback time of the image information played back in synchronization with the second audio information, such as when the playback time is half or less of the playback time of the image information played back in synchronization with the 2 audio information Is greater than a preset allowable value, the second audio information is adjusted by adjusting either the reproduction time of the second audio information or the reproduction time of the image information reproduced in synchronization with the second audio information. There is a high possibility that the viewer will feel uncomfortable with the sound reproduced from the image or the image reproduced from the image information.

Therefore, in this case, by adjusting both the reproduction time of the second audio information and the reproduction time of the image information reproduced in synchronization with the second audio information, the reproduction time of the second audio information and the second The reproduction time of the image information reproduced in synchronization with the audio information is made the same. For example, when the reproduction time of the second audio information is short, the reproduction time of the second audio information is lengthened and the reproduction time of the image information reproduced in synchronization with the second audio information is shortened. On the other hand, when the reproduction time of the second audio information is long, the reproduction time of the second audio information is shortened and the reproduction time of the image information reproduced in synchronization with the second audio information is lengthened. Thereby, since the reproduction time of the second audio information and the reproduction time of the image information can be minimized, the viewer feels uncomfortable with the audio reproduced from the second audio information or the image reproduced from the image information. The possibility can be reduced.

Further, in the present embodiment, which of the reproduction time of the second audio information and the reproduction time of the image information reproduced in synchronization with the second audio information is determined based on the synchronization setting. However, the present invention is not limited to this. Specifically, based on at least one of the type of the image reproduced from the image information and the difference between the reproduction time of the second audio information and the reproduction time of the image information, the reproduction time of the second audio information and the It may be determined which of the reproduction times of the image information reproduced in synchronization with the second audio information is adjusted.

For example, when the image information is a still image signal, the difference between the reproduction time of the second audio information and the reproduction time of the image information reproduced in synchronization with the second audio information is less than a preset allowable value. If there is a low possibility that the user will feel uncomfortable with the image reproduced from the image information even if the reproduction time of the image information is adjusted in some cases, it may be decided to adjust the reproduction time of the image information. . On the other hand, when the image information is moving image information, or when the difference between the reproduction time of the second audio information and the reproduction time of the image information reproduced in synchronization with the second audio information is greater than a preset allowable value May be determined to adjust the reproduction time of the second audio information.

When the playback time of the image information or the playback time of the second audio information is adjusted, or when the playback time of the second audio information is the same as the playback time of the first audio information (step S308: No), the mixing unit 246 Adjusts the frequency of the second audio information based on the original language of the first audio information and the translated language of the second audio information (step S312). For example, when the original language of the first audio information is English and the translation language of the second audio information is Japanese, the mixing unit 246 lowers the frequency of the second audio information.

Next, the mixing unit 246 is configured to input the volume previously input for each of the first sound information, the second sound information, and the background sound information (in the present embodiment, on the setting screen 400 shown in FIG. 4, the first sound information, the second sound information). The volume input for each of the information and the background sound information) is acquired (step S313). Furthermore, the mixing unit 246 adjusts the volume of each of the first sound information, the second sound information, and the background sound information in accordance with the volume input in advance (step S314).

In the present embodiment, the mixing unit 246 adjusts the volume of each of the first sound information, the second sound information, and the background sound information according to the volume input in advance, but the present invention is not limited to this. For example, the mixing unit 246 may adjust the volume of the second audio information according to the volume of the first audio information. Alternatively, the mixing unit 246 can prevent the second voice information from becoming difficult to hear by making the volume of the first voice information smaller than the volume of the second voice information.

Then, the mixing unit 246 mixes (in other words, adds) the first audio information, the second audio information, and the background sound information (step S315). In the present embodiment, the mixing unit 246 mixes the first sound information, the second sound information, and the background sound information. However, at least the second sound information and the background sound information are mixed and output. It ’s fine. At that time, the mixing unit 246 mixes and outputs the background sound information and the second sound information reproduced in synchronization with the background sound information. In other words, the mixing unit 246 adjusts the timing of outputting the background sound information and the second sound information reproduced in synchronization with the background sound information, and outputs the background sound information and the second sound information in synchronization. To do. At that time, the mixing unit 246 compares the time stamp added to the second sound information with the time stamp added to the background sound information, and thereby compares the second background information with the second background sound information. The background sound information to be reproduced in synchronization with the sound information is determined. Furthermore, the mixing unit 246 may adjust the volume of the second audio information based on the original language of the first audio information and the translated language of the second audio information. For example, when the original language of the first audio information is English and the translated language of the second audio information is Japanese, the volume of the second audio information is set higher than the volume of the first audio information.

The synchronization processing unit 247 outputs the image information and the first information by delaying the image information output from the image decoder 241 to the video processing unit 210 after delaying the conversion time required for the conversion from the first audio information to the second audio information. Synchronization processing for reproducing the two audio information in synchronization is executed (step S316).

The sound output unit 208 outputs sound information obtained by mixing the first sound information, the second sound information, and the background sound information in the mixing unit 246 to the speaker 213 via the synchronization processing unit 247 (step S317). In addition, the video processing unit 210 outputs the image information output from the image decoder 241 to the display unit 214 via the synchronization processing unit 247 (step S317).

As described above, according to the digital television 100 according to the first embodiment, the background sound information and the first sound information are separated from the input sound information, and the first sound information is converted into the first sound information. The second voice information is converted into the second voice information in a translation language different from the original language, and the first voice information is replaced with the second voice information by mixing and outputting the background sound information and the second voice information. Therefore, when outputting the 2nd audio | voice information converted from the 1st audio | voice information, it can prevent that it becomes difficult to hear 2nd audio | voice information. In addition, since it is not necessary to reduce the volume of the background sound information in order to make the second sound information easy to hear, it is possible to prevent the background sound from being inaudible.

(Second Embodiment)
In this embodiment, in an external device connected via a network to an electronic device that outputs sound information, the background sound information and the first sound information are separated from the input sound information, and the first sound information to the second sound information. It is an example which performs conversion to audio | voice information and mixing with background audio | voice information and 2nd audio | voice information. In the following description, description of the same parts as those in the first embodiment is omitted.

FIG. 5 is a diagram illustrating a configuration of an information processing system having a notebook PC as an example of an electronic apparatus according to the second embodiment. In this embodiment, as shown in FIG. 5, a notebook PC (Personal Computer) 500 includes a content server 510 that stores content to be reproduced (content including at least sound information) via a network such as the Internet, and a notebook. A Web server 520 that exchanges various types of information with the notebook PC 500 via a browser executed on the PC 500, separation of background sound information and first sound information from the input sound information, and text data from the first sound information A speech processing server 530 that performs acquisition and the like is connected to a translation server 540 that translates text data acquired from the first speech information into a translation language.

FIG. 6 is a sequence diagram showing a flow of sound information output processing in the information processing system according to the second embodiment. First, the notebook PC 500 connects to the Web server 520 through a browser, and requests the Web server 520 to display the setting screen 400 (see FIG. 4) (step S601).

The Web server 520 transmits the screen information of the setting screen 400 to the notebook PC 500, and displays the setting screen 400 on a display unit (not shown) of the notebook PC 500 (step S602).

The notebook PC 500 transmits various settings set on the setting screen 400 (volumes of the first voice information, second voice information and background sound information, translation language settings, synchronization settings, etc.) to the web server 520 (step S603). ). Furthermore, the notebook PC 500 selects content to be output from the content stored in the content server 510 via the browser (step S604).

The Web server 520 requests the content server 510 to acquire the content selected on the notebook PC 500 (step S605), and acquires the content from the content server 510 (step S606).

The Web server 520 transmits the sound information included in the acquired content to the sound processing server 530, and requests the separation of the first sound information and the background sound information from the sound information (step S607). The speech processing server 530 separates the background sound information and the first speech information from the sound information and the text from the first speech information in the same manner as the separator 243 (see FIG. 2) and the translator 244 (see FIG. 2). Acquire data. Then, the web server 520 acquires the first sound information, background sound information, and text data from the sound processing server 530 (step S608).

The Web server 520 transmits the text data acquired from the speech processing server 530 and the translation language set on the setting screen 400 (see FIG. 4) to the translation server 540 and requests translation of the text data into the translation language ( Step S609). The translation server 540 translates the text data into the translation language in the same manner as the translator 244 (see FIG. 2). Then, the web server 520 acquires text data (translation result) translated into the translation language from the translation server 540 (step S610).

The Web server 520 includes text data translated into a translation language, background sound information, first sound information, and various settings (background sound information, first sound information, and second sound) set on the setting screen 400 (see FIG. 4). The volume of each information, the synchronization setting, etc.) is transmitted to the audio processing server 530, and the second audio information is synthesized, various adjustments (for example, adjustment of the reproduction time of the second audio information, the first audio information, the second audio information) And mixing of the second sound information and the background sound information is requested (step S611). Similar to the synthesizer 245 (see FIG. 2) and the mixing unit 246 (see FIG. 2), the sound processing server 530 performs synthesis of the second sound information, various adjustments, and mixing of the second sound information and the background sound information. Do. Then, the web server 520 acquires sound information obtained by mixing the second sound information and the background sound information (step S612).

Then, the Web server 520 transmits the content obtained by replacing the sound information included in the content acquired in step S606 with the sound information acquired from the audio processing server 530 to the notebook PC 500 (step S613).

As described above, according to the information processing system according to the second embodiment, in the notebook PC 500 that outputs sound information, the background sound information and the first sound information are separated from the input sound information, and the first sound information is used. Since it is not necessary to perform conversion to the second sound information and mixing of the background sound information and the second sound information, the processing load on the notebook PC 500 can be reduced.

(Third embodiment)
In this embodiment, in a PC that is an example of an electronic device, background sound information and first sound information are separated from input sound information, converted from first sound information to second sound information, and background sound information and first sound information. This is an example of outputting sound information obtained by mixing two sound information. In the following description, description of the same parts as those in the first embodiment is omitted.

FIG. 7 is a diagram illustrating a hardware configuration of a PC that is an example of an electronic apparatus according to the third embodiment. As shown in FIG. 7, the PC 700 includes a CPU 701, a ROM 702, a RAM 703, a display unit 704, an input unit 705, a storage control unit 706, a communication unit 707, a speaker 708, and a storage device 709. I have.

The CPU 701 performs various processes in cooperation with various control programs stored in the ROM 702 or the like using the RAM 703 as a work area, and comprehensively controls the operation of each unit constituting the PC 700.

The ROM 702 stores a program for controlling the PC 700, various setting information, and the like in a non-rewritable manner. The RAM 703 is a volatile storage medium and functions as a work area for the CPU 701.

The display unit 704 has a display screen configured by an LCD (Liquid Crystal Display), an organic EL (Electro Luminescence) display, and the like, and displays a process progress, a result, and the like according to control of the CPU 701. The speaker 708 outputs sound information according to the control of the CPU 701.

The input unit 705 has an input device such as a keyboard and a mouse, and notifies the CPU 701 of commands and information input from the user via the input device.

The storage control unit 706 controls the operation of the storage device 709, and executes processing corresponding to a request such as data writing or data reading input from the CPU 701, in the storage device 709. Here, the storage device 709 is a storage device having a recording medium such as a magnetic disk, a semiconductor memory, or an optical disk.

The communication unit 707 is a wireless communication interface, establishes communication with an external device (not shown), and transmits and receives data (for example, content including sound information and image information).

FIG. 8 is a block diagram showing a functional configuration of the PC according to the third embodiment. In this embodiment, the PC 700 executes an image decoder 710, an audio decoder 711, a separator 243, a translator 244, a synthesizer 245, and a mixing unit 246 by executing a program stored in the ROM 702 by the CPU 701. A synchronization processing unit 247, a switch unit 248, a video processing unit 712, and an audio output unit 713.

The image decoder 710 decodes image information included in the content received by the communication unit 707 (image information reproduced in synchronization with sound information included in the content) into a data format that can be processed by the video processing unit 712. The audio decoder 711 decodes sound information included in the content received by the communication unit 707 into a data format that can be processed by the audio output unit 713.

The switch unit 248 switches the output destination of the sound signal decoded by the audio decoder 242 to the separator 243 or the synchronization processing unit 247. The separator 243 separates the background sound information and the first sound information from the sound information decoded by the sound decoder 711. The translator 244 performs a speech recognition process of analyzing the first speech information and acquiring the content of the first speech information as text data, and using the original language (the first language as the language of the first speech information) for the text data. The language is translated into a translated language (second language) which is a different language. The synthesizer 245 synthesizes the second speech information based on the text data translated into the translation language. The mixing unit 246 mixes and outputs the background sound information and the second sound information. The synchronization processing unit 247 synchronizes and outputs the sound information obtained by mixing the background sound information and the second sound information by the mixing unit 246 and the image information reproduced in synchronization with the sound information.

The video processing unit 712 converts the image information output from the synchronization processing unit 247 into an analog video signal in a format that can be displayed on the display unit 704, and then outputs the analog video signal to the display unit 704 for video display. The audio output unit 713 converts the digital sound information output from the synchronization processing unit 247 into an analog sound signal in a format that can be reproduced by the speaker 708, and then outputs the analog sound signal to the speaker 708 for audio reproduction.

As described above, according to the PC 700 according to the third embodiment, it is possible to obtain the same effects as those of the first embodiment.

As described above, according to the first to third embodiments, it is possible to prevent the second sound information from becoming difficult to hear when outputting the second sound information converted from the first sound information. It is also possible to prevent the background sound from being heard.

The program executed by the electronic device of the present embodiment is provided by being incorporated in advance in a ROM or the like. The program executed in the electronic device of the present embodiment is a file in an installable format or an executable format, and is a computer such as a CD-ROM, flexible disk (FD), CD-R, DVD (Digital Versatile Disk). It may be configured to be recorded on a readable recording medium.

Furthermore, the program executed by the electronic device of the present embodiment may be provided by being stored on a computer connected to a network such as the Internet and downloaded via the network. Further, the program executed by the electronic device of the present embodiment may be provided or distributed via a network such as the Internet.

The program executed by the electronic device of the present embodiment has a module configuration including the above-described units (separator 243, translator 244, synthesizer 245, mixing unit 246, and synchronization processing unit 247). As the hardware, a CPU (processor) reads out a program from the ROM and executes it, so that the above-described units are loaded on the main storage device, and a separator 243, a translator 244, a synthesizer 245, a mixing unit 246, and a synchronization processing unit 247 It may be generated on the main storage device.

Although several embodiments of the present invention have been described, these embodiments are presented as examples and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

100 Digital Television 206 Signal Processing Unit 243 Separator 244 Translator 245 Synthesizer 246 Mixing Unit 247 Synchronization Processing Unit 500 Notebook PC
510 Content Server 520 Web Server 530 Audio Processing Server 540 Translation Server 700 PC

Claims

A separation unit that separates background sound information and first sound information from sound information;
A conversion unit that converts the first voice information into second voice information corresponding to the first voice information;
An output unit for mixing and outputting the background sound information and the second sound information;
With electronic equipment.
The electronic device according to claim 1, wherein the output unit mixes and outputs the background sound information and the second sound information reproduced in synchronization with the background sound information.
The output unit receives image information to be reproduced in synchronization with the sound information, and outputs the image information after delaying a conversion time required for conversion from the first sound information to the second sound information. Item 2. The electronic device according to Item 1.
When the difference between the reproduction time of the second audio information and the reproduction time of the first audio information is longer than a predetermined allowable time, the output unit synchronizes with the reproduction time of the second audio information and the second audio information. The reproduction of the image information reproduced in synchronization with the reproduction time of the second audio information and the second audio information so that the difference from the reproduction time of the image information reproduced in this way is equal to or less than the predetermined allowable time. The electronic device according to claim 3, wherein at least one of the times is adjusted.
The conversion unit converts the first audio information into a plurality of candidates for the second audio information,
The output unit selects, from a plurality of candidates for the second audio information, a candidate for the second audio information to be reproduced with the same reproduction time as the reproduction time of the image information, and the selected candidate for the second audio information The electronic device according to claim 4, wherein the length of the second audio information is adjusted by outputting the second audio information after mixing with the background sound information.
The electronic device according to claim 1, wherein the output unit adjusts a volume of the second audio information according to a volume of the first audio information.
The electronic device according to claim 1, wherein the output unit mixes and outputs the background sound information, the second sound information, and the first sound information.
The electronic device according to claim 7, wherein the output unit makes a volume of the first audio information smaller than a volume of the second audio information.
A display control unit for displaying on the display unit a volume input image capable of inputting the volume of each of the background sound information, the first sound information, and the second sound information;
The electronic device according to claim 7, wherein the output unit adjusts the volume of each of the background sound information, the first sound information, and the second sound information according to a sound volume input by the sound volume input image.
The electronic device according to claim 1, wherein the conversion unit converts the first audio information into the second audio information in a second language different from the first language of the first audio information.
The display control unit displays a language input image capable of inputting the second language on the display unit,
The electronic device according to claim 10, wherein the conversion unit converts the first audio information into the second audio information of the second language input by the language input image.
The electronic device according to claim 10, wherein the output unit adjusts a volume of the second audio information based on the first language and the second language.
An output method executed in an electronic device,
A process in which the separation unit separates the background sound information and the first sound information from the sound information;
A process of converting the first audio information into second audio information corresponding to the first audio information;
A process in which the output unit mixes and outputs the background sound information and the second sound information;
Output method including
Computer
A separation unit that separates background sound information and first sound information from sound information;
A conversion unit that converts the first audio information into second audio information corresponding to the first audio information;
An output unit for mixing and outputting the background sound information and the second sound information;
Program to function as.