WO2020251122A1

WO2020251122A1 - Electronic device for providing content translation service and control method therefor

Info

Publication number: WO2020251122A1
Application number: PCT/KR2019/013982
Authority: WO
Inventors: 이범석; 김상하; 유지상
Original assignee: 삼성전자주식회사
Priority date: 2019-06-12
Filing date: 2019-10-23
Publication date: 2020-12-17
Also published as: KR20200142282A

Abstract

An electronic device is disclosed. The present electronic device comprises: a communication interface comprising a circuit; a memory; and a processor connected to the communication interface and the memory to control the electronic device, wherein the processor executes at least one instruction stored in the memory to thereby receive content through the communication interface, acquire text data of a second language on the basis of voice data or subtitle data of a first language included in the content, and output the content on the basis of a play speed corresponding to the length of the acquired text data of the second language.

Description

Electronic device providing content translation service and control method thereof

The present disclosure relates to an electronic device that provides a translation service for audio or subtitles of content. More specifically, the present disclosure relates to an electronic device that adjusts the playback speed of content so that the time length of the translated voice or subtitle matches the existing content.

Conventionally, it was possible to provide video content provided with a translation service by re-coating the acquired voice/subtitle on the video content as a result of applying machine translation and/or text to speech (TTS) technology to the voice/subtitle of video content. .

However, in this case, the existing video content and the translated voice/subtitles did not synchronize with each other in time. This is because the lengths of the voice/subtitle and the translated voice/subtitle included in the existing video content may be different from each other.

An object of the present disclosure is to provide an electronic device that translates voice/subtitles included in content into other languages and outputs them together with content.

In particular, when the electronic device outputs the translated voice/subtitles together with the contents, the main object is to provide an electronic device that enables the content image to naturally match the translated voice/subtitles.

An electronic device according to an embodiment of the present disclosure includes a communication interface including a circuit, a memory including at least one instruction, the communication interface, and a processor connected to the memory to control the electronic device, , The processor, by executing the at least one instruction, receives the content through the communication interface, and based on the voice data of the first language or the caption data of the first language included in the content Text data is acquired, and the content is output based on a reproduction speed corresponding to the length of the acquired text data of the second language.

In this case, the processor acquires text data of the first language based on voice data of the first language corresponding to the first section of the content or subtitle data of the first language corresponding to the first section, and , By translating the obtained text data of the first language, text data of the second language may be obtained.

In this case, the processor determines a difference between the length of the text data of the first language and the length of the text data of the second language, and if the determined difference is greater than or equal to a threshold value, the obtained text data of the first language Another text data of the second language may be obtained by translating again.

In addition, the processor may output text data of the second language in a subtitle format or output voice data converted from text data of the second language in an audio format based on an input user command.

When the time corresponding to the length of the text data of the second language is longer than the time corresponding to the first section, the processor outputs the content at a slower playback speed than the original playback speed, and the second language When the time corresponding to the length of the text data is shorter than the time corresponding to the first section, the content is output at a faster playback speed than the original playback speed, and the text data of the second language is transferred to the content in the form of a subtitle. Can be printed together.

In addition, when the processor converts text data of the second language to obtain voice data of the second language, and the reproduction time of the voice data of the second language is longer than a time corresponding to the first section, When the content is output at a slower playback speed than the original playback speed, and the playback time of the voice data of the second language is shorter than the time corresponding to the first section, the content is displayed at a faster playback speed than the original playback speed. In addition, a voice corresponding to voice data of the second language may be output together with the content.

On the other hand, the processor, based on the image data included in the content or the voice data of the first language included in the content, determine the characteristics of the speaker in the content, and determine the text data of the second language It converts into voice data corresponding to the speaker's characteristic, and outputs the voice corresponding to the voice data together with the content.

In addition, the processor identifies the type of the content, and when the identified type is a preset first type, outputs the content based on a playback speed corresponding to the length of the acquired text data of the second language, and If the identified type is a preset second type, the content may be output at an original playback speed.

In addition, the processor identifies whether a character (appearance person) is included in the image data of the content corresponding to the voice data of the first language or the subtitle data of the first language, and the character is included in the image data Alternatively, the content may be output at a playback speed within a preset range from the original playback speed.

A method of controlling an electronic device according to an embodiment of the present disclosure includes the steps of acquiring text data of a second language based on voice data of a first language or subtitle data of the first language included in input content, the And outputting the content based on a reproduction speed corresponding to the length of the acquired text data of the second language.

In this case, the obtaining of the text data of the second language may include the voice data of the first language corresponding to the first section of the content or the caption data of the first language corresponding to the first section. Acquiring text data of the first language, and translating the obtained text data of the first language to obtain text data of the second language.

In this case, the control method includes determining a difference between the length of the text data of the first language and the length of the text data of the second language, and if the determined difference is greater than or equal to a threshold value, the obtained first language It may further include translating the text data to obtain other text data of the second language.

The control method may further include outputting the text data of the second language in a subtitle format or outputting the voice data converted from the text data of the second language in an audio format based on an input user command. I can.

On the other hand, the outputting of the content includes, when a time corresponding to the length of text data of the second language is longer than a time corresponding to the first section, outputting the content at a slower playback speed than the original playback speed, and If the time corresponding to the length of the text data of the second language is shorter than the time corresponding to the first section, the content is output at a faster playback speed than the original playback speed, and the text data of the second language It can be output together with the content in the form of subtitles.

In addition, the control method further comprises the step of converting text data of the second language to obtain voice data of the second language, and the step of outputting the content comprises reproducing the voice data of the second language When the time is longer than the time corresponding to the first section, the content is output at a slower playback speed than the original playback speed, and the playback time of the voice data of the second language is shorter than the time corresponding to the first section. In this case, the content may be output at a faster playback speed than the original playback speed, and a voice corresponding to the voice data of the second language may be output together with the content.

On the other hand, the present control method includes determining a characteristic of a speaker in the content based on image data included in the content or voice data of the first language included in the content, and the text data of the second language The step of converting into voice data corresponding to the determined speaker's characteristic may be further included, and in the outputting of the content, a voice corresponding to the voice data may be output together with the content.

And, the present control method further includes the step of identifying the type of the content, and the step of outputting the content includes the obtained text data of the second language when the identified type is a preset first type. The content may be output based on a playback speed corresponding to the length, and when the identified type is a preset second type, the content may be output at an original playback rate.

The control method further comprises the step of identifying whether a character (appearance person) is included in the image data of the content corresponding to the voice data of the first language or the subtitle data of the first language, and outputting the content In the step, when a character is included in the image data, the content may be output at a reproduction speed within a preset range from the original reproduction speed.

In a computer-readable medium according to an exemplary embodiment of the present disclosure, the electronic device is executed by a processor of the electronic device to cause the electronic device to use audio data of a first language or caption data of the first language included in the input content. Obtaining text data of a second language as a computer instruction for performing an operation including outputting the content based on a playback speed corresponding to the length of the obtained second language text data Has been saved.

The electronic device according to the present disclosure has an effect of providing content synchronized in time with a translated voice/subtitle.

In addition, since the electronic device according to the present disclosure synchronizes the content with the translated voice/subtitle in consideration of the situation of the content, the image of the content synchronized with the translated voice/subtitle does not appear unnatural to the user. have.

1 is a diagram for explaining a general operation of an electronic device according to the present disclosure;

2 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present disclosure;

3 is a block diagram illustrating a detailed configuration of an electronic device for describing various embodiments of the present disclosure;

4 is a block diagram illustrating a software structure of an electronic device according to an embodiment of the present disclosure;

5A is a diagram for explaining a content output process when the translated voice is longer than the original voice;

5B is a diagram illustrating a content output process when the translated voice is shorter than the original voice;

6 is a flowchart illustrating a method of controlling an electronic device according to an embodiment of the present disclosure;

7 is an algorithm for explaining an example for obtaining translated text data from original content;

FIG. 8 is an algorithm for explaining an example of outputting a corresponding voice along with content based on a length of a corresponding voice when the translated text is output as a voice;

9 is an algorithm for explaining an example of outputting the translated text along with content based on the length of the translated text when outputting the translated text as a subtitle.

Before describing the present disclosure in detail, a method of describing the present specification and drawings will be described.

First, terms used in the specification and claims were selected from general terms in consideration of functions in various embodiments of the present disclosure. However, these terms are intended to be interpreted by a person skilled in the art, legal or technical It may vary depending on the emergence of new technologies, etc. In addition, some terms are arbitrarily selected by the applicant. These terms may be interpreted as the meanings defined in the present specification, and if there is no specific term definition, they may be interpreted based on the general contents of the present specification and common technical knowledge in the art.

In addition, the same reference numbers or reference numerals in each drawing attached to the present specification indicate parts or components that perform substantially the same function. For convenience of description and understanding, different embodiments will be described using the same reference numerals or symbols. That is, even if all components having the same reference numerals are shown in the plurality of drawings, the plurality of drawings do not mean one embodiment.

In addition, terms including ordinal numbers such as “first” and “second” may be used in the specification and claims to distinguish between components. These ordinal numbers are used to distinguish the same or similar constituent elements from each other, and the use of these ordinal numbers should not limit the meaning of the terms. For example, the order of use or arrangement of elements combined with such ordinal numbers should not be limited by the number. If necessary, each of the ordinal numbers may be used interchangeably.

In the present specification, expressions in the singular include plural expressions unless the context clearly indicates otherwise. In the present application, terms such as "comprise" or "comprise" are intended to designate the existence of features, numbers, steps, actions, components, parts, or a combination thereof described in the specification, but one or more other It is to be understood that the presence or addition of features, numbers, steps, actions, components, parts, or combinations thereof, does not preclude in advance the possibility of being excluded.

In the exemplary embodiment of the present disclosure, terms such as "module", "unit", "part" are terms used to refer to components that perform at least one function or operation, and these components are hardware or software. It may be implemented or may be implemented as a combination of hardware and software. In addition, a plurality of "modules", "units", "parts", etc., are integrated into at least one module or chip, and at least one processor, except when each needs to be implemented as individual specific hardware. Can be implemented as

Further, in the embodiment of the present disclosure, when a part is connected to another part, this includes not only a direct connection but also an indirect connection through another medium. In addition, the meaning that a part includes a certain component means that other components may be further included rather than excluding other components unless specifically stated to the contrary.

Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.

1 is a diagram for describing a general operation of an electronic device according to the present disclosure. Referring to FIG. 1, the electronic device of the present disclosure may provide a Korean voice 30 translated from an English voice 20 spoken by a speaker in the content 10.

In this case, the electronic device may provide the translated Korean voice 30 together with the image of the content 10. In this case, the playback time of the content 10 is 30 seconds, whereas the playback time of the translated Korean voice 30 is At 25 seconds, there is an unnatural difference between audio and video.

Therefore, the electronic device adjusts the playback time of the content 10 so that the playback time of the content 10 is equal to 25 seconds as the translated Korean voice 30, and outputs the content 10 and the translated voice 30. I can.

As described above, the electronic device according to the present disclosure has an advantage in that it is possible to successfully synchronize content and the translated voice without making the speed of the translated voice (or subtitle) faster or slower.

Hereinafter, specific embodiments of the electronic device of the present disclosure will be described through the drawings.

2 is a block diagram illustrating a configuration of an electronic device according to an embodiment of the present disclosure.

Referring to FIG. 2, the electronic device 100 may include a communication interface 110, a memory 120, and a processor 130. The electronic device 100 may be various display devices such as a smart phone, a TV, a desktop PC, a tablet PC, and a notebook PC. Further, the electronic device 100 may be implemented as a set-top box or a server.

The communication interface 110 is a component for the electronic device 100 to communicate with at least one external device to exchange signals/data. To this end, the communication interface 110 may include a circuit.

The communication interface 110 may include a wireless communication module, a wired input/output module, and a broadcast reception module.

The wireless communication module includes a Wi-Fi communication module, a Bluetooth module, an infrared data association (IrDA) module, a 3G (third generation) mobile communication module, and a 4G module to receive content from an external server or an external device. It may include at least one of a 4th generation) mobile communication module and a 4th generation Long Term Evolution (LTE) communication module.

The wired input/output module can be implemented as a wired port such as an HDMI port, a display port, an RGB port, a digital visual interface (DVI) port, a Thunderbolt, and a component port. The input/output port may be implemented as an HDMI port or Thunderbolt to transmit image and audio signals together, but a first port for transmitting an image signal and a second port for transmitting an audio signal may be implemented separately. .

The broadcast receiving module may receive a signal for broadcast content. The broadcast receiving module may be implemented in a form including a configuration such as a tuner, a demodulator, and an equalizer to receive broadcast content transmitted from a broadcasting station.

The content received through the communication interface 110 may include at least one of image data, audio data, caption data, and metadata. In this case, the image data may include a caption.

The memory 120 is an operating system (OS) for controlling the overall operation of the components of the electronic device 100 and a component for storing various data related to the components of the electronic device 100. The memory 130 may also include at least one instruction related to one or more components of the electronic device 100.

To this end, the memory 120 may be implemented as a nonvolatile memory (eg, a hard disk, a solid state drive (SSD), a flash memory), a volatile memory, or the like.

The memory 120 may store content received from the outside through the communication interface 110, content generated by itself in the electronic device 100, and the like. In addition, content received from the outside through the communication interface 110 may be temporarily stored in the memory 120. In this case, the temporarily stored content may be output in real time through the electronic device 100.

The processor 130 controls the overall operation of the electronic device 100. To this end, the processor 130 may include a central processing unit (CPU), a graphical processing unit (GPU), and the like in hardware, and processing operations or data related to control of other components included in the electronic device 100 Can run.

The processor 130 may be implemented as a micro processing unit (MPU), or may correspond to a computer in which random access memory (RAM) and read only memory (ROM) are connected to a CPU or the like through a system bus.

The processor 130 may control not only hardware components included in the electronic device 100, but also one or more software modules included in the electronic device 100, and the result of controlling the software modules by the processor 130 It may also be derived from the operation of hardware components.

Specifically, the processor 130 may control the electronic device 100 by executing at least one command stored in the memory 120 by being connected to the communication interface 110 and the memory 120.

Hereinafter, the operation of the electronic device 100 including the components of FIG. 2 will be described in more detail.

In an embodiment of the present disclosure, the processor 130 may first acquire content. Specifically, the processor 130 may receive contents from various sources (eg, external servers, external devices, broadcasting stations, etc.) through the communication interface 110 or obtain contents stored in the memory 120. The content may correspond to video content including video data including one or more images and audio data, and may be various such as news, sports, movies/dramas, and documentaries.

The processor 130 may acquire text data of a second language based on voice data of the first language or subtitle data of the first language included in the obtained content. In this case, the first language and the second language mean languages of different countries or regions, respectively. For example, the first language may be English and the second language may be Korean, but are not limited thereto.

First, the processor 130 may obtain text data of a first language based on voice data of a first language corresponding to a first section of the content or subtitle data of a first language corresponding to the first section.

At this time, the first section refers to a partial section of the entire time section in which video data and audio data of the content are reproduced, and among a plurality of unit time sections in which translation of subtitles or audio data included in the video data of the content is divided It can be one.

The first section may correspond to a time section in which audio/subtitles corresponding to a preset number of words or sentences are output or may correspond to a time section corresponding to a preset interval.

For example, the first section may correspond to a time section during a scene in which a speaker in the content utters a specific sentence among the entire time section in which the content is reproduced, that is, a time section in which a voice for a corresponding sentence is output.

When the object of translation is the voice of the first section, the processor 130 obtains text data of the first language from the voice data of the first section, and when the object of translation is the subtitle of the first section, the processor 130 May obtain text data of the first language from the caption data of the first section.

Whether the subject of translation will be the voice of the first section or the subtitle may vary according to a user command input to the electronic device 100. That is, the translation mode of the electronic device 100 may be classified into a voice translation mode or a subtitle translation mode according to a user command, and the processor 130 may activate the mode according to the user command.

The translation mode of the electronic device 100 may be automatically set by the electronic device 100 according to a situation. For example, the processor 130 may determine whether subtitle data separately exists in the acquired content, and when the subtitle data exists, may activate the subtitle translation mode. Alternatively, if the caption data of the content does not exist but the audio data of the content does exist, the processor 130 may activate the voice translation mode.

However, even if caption data does not exist separately in the acquired content, the caption data may be extracted from the image data of the content, so even if the caption data does not exist in the acquired content, the processor 130 can identify whether a caption exists in the image data. I can. In addition, if there is no caption in the image data, the processor 130 may activate the voice translation mode. However, the setting of the translation mode of the processor 130 may be more diverse, and is not limited to the above-described examples.

In the case of the speech translation mode, the processor 130 performs speech recognition on speech data using a speech recognition module (Speech-to-Text (STT) module), and converts the result of performing speech recognition into text data of the first language. Can be obtained. Details of the speech recognition module (STT module) will be described later with reference to FIG. 4.

In the case of the caption translation mode, the processor 130 may directly acquire the caption data of the content as text data of the first language. However, if the caption data is not separately received or stored, the processor 130 may extract the caption data by recognizing a character from the image included in the image data and then extract the text data of the first language from the extracted caption data. In this case, a character recognition module can be used. A detailed description of the character recognition module will be described later with reference to FIG. 4.

After the text data of the first language is obtained according to the above-described embodiments, the processor 130 may obtain text data of the second language by translating the text data of the first language.

When translating text data of a first language into text data of a second language, the processor 130 may use a translation module. A detailed description of the translation module will be described later with reference to FIG. 4.

The processor 130 may output text data of a second language obtained by translating the text data of the first language together with content in the form of a caption or an audio. When outputting text data of a second language in a subtitle format, a subtitle generation module may be used, and when outputting in an audio format, a text-to-speech (TTS) module may be used. This is further described with reference to FIG. do.

A caption providing mode that provides text data of a second language of the electronic device 100 in the form of a caption, a voice providing mode that provides text data of a second language in an audio form, and a comprehensive providing mode that provides both caption and audio form, etc. Translation services can be provided.

The processor 130 may change the translation providing mode according to a user command or a preset condition. For example, the processor 130 may activate any one of a caption providing mode/audio providing mode/comprehensive providing mode according to a user command. Alternatively, when the translation target is audio data of the first language, the audio providing mode may be activated, and when the translation target is the caption data of the first language, the caption providing mode may be activated. The translation providing mode may be variously set according to preset conditions, and is not limited to the above-described examples.

In providing the translated subtitle/voice as described above, the length of time between the original content and the translated subtitle/voice may not match. To solve this problem, the processor 130 of the electronic device 100 according to the present disclosure may control the reproduction speed of the content according to the length of text data of the second language.

Specifically, the processor 130 may adjust the reproduction speed of the image data of the content according to the length of the translated caption/audio. Alternatively, the processor 130 may adjust the reproduction speed of both the video data and the audio data of the content to match the length of the translated caption.

When adjusting the reproduction speed of the image data of the content, the processor 130 may decrease or increase the reproduction speed by increasing or decreasing the time interval between image frames in the image data.

However, the processor 130 may adjust the reproduction speed of the image data of the content while maintaining a constant time interval between image frames. Specifically, the processor 130 may adjust the playback speed by adding a new image frame or excluding an existing image frame.

For example, the processor 130 may lower the playback speed by adding new image frames between image frames. In this case, an image frame to be added may be generated through an interpolation technique for existing image frames. As a specific example, the third image frame added between the existing first image frame and the second image frame may be an image frame generated as a result of interpolation of the first image frame and the second image frame.

For example, the processor 130 may increase the playback speed by excluding one or more of the existing image frames.

In this case, the processor 130 may preferentially exclude the overlapping image frames. In this case, output timing of the remaining image frames that are not excluded may be additionally adjusted.

Alternatively, the first image frame, the second image frame, and the third image frame that have already existed may be excluded, while a new fourth image frame and a fifth image frame may be added. In this case, the fourth image frame may correspond to an intermediate value between the first and second image frames, and the fifth image frame may correspond to an intermediate value between the second and third image frames.

As described above, there may be various ways of adjusting the playback speed, and the above description is for some examples, and the adjustment of the playback speed of the electronic device 100 according to the present disclosure is not limited thereto.

On the other hand, when the image data of the content is uniformly output at a playback speed corresponding to the length of the text data of the second language, there is a possibility that the image data of the content is reproduced too quickly or slowly, and it becomes unnatural.

Accordingly, the processor 130 may determine a difference between the length of text data of the first language and the length of text data of the second language. And, if the determined difference is greater than or equal to the threshold value, the processor 130 may re-translate the obtained text data of the first language to obtain the text data of the second language again.

If, in the original content, voices of two or more speakers (of the first language) overlap each other in at least some sections, the processor 130 converts the voice data of the first language of each speaker into first text data It is possible to obtain second text data having a length similar to the length of

Specifically, in the case where the voices of two or more speakers overlap each other in at least a partial section in the original content, the processor 130 determines the difference in length between the text data of the first language and the text data of the second language than in other cases. By making the threshold value smaller, the translation can be repeated until text data of the second language that satisfies the (smaller) threshold value is obtained.

The length of the text data may be generally preset to be proportional to the capacity of the text data, but is not limited thereto. For example, the same capacity may be defined as having different lengths if the languages are different.

In the case of the subtitle providing mode, the processor 130 determines the playback time of the first section of the original content including video data or audio data including text data in the first language and a time corresponding to the length of text data in the second language. Can be compared with That is, the reproduction time of the original content may be adjusted so that the time corresponding to the length of the text data of the second language and the reproduction time of the first section become the same.

In this case, the time corresponding to the length of the text data may be conceptually defined/interpreted as a time required to read all texts of the text data. In this case, as the capacity of the text data increases, the time corresponding to the length of the text data may be preset to increase. Alternatively, various embodiments are possible, such as calculating a time corresponding to the length of the text data by adding all preset reading times for each character or word included in the text data.

As a specific example, when the time corresponding to the length of the text data of the second language is longer than the time corresponding to the first section (reproduction time of the first section in the original content), the processor 130 When output at a slower playback speed and the time corresponding to the length of text data in the second language is shorter than the time corresponding to the first section, the content is output at a faster playback speed than the original playback speed, and the second language Text data can be output in the form of subtitles together with content (with a changed playback speed).

In the case of the audio providing mode or the comprehensive providing mode, the processor 130 determines the playback time of the first section of the original content including video data or audio data including text data of the first language, text data of the second language. It can be compared with the playback time of the voice data corresponding to. That is, the reproduction time of the original content may be adjusted so that the reproduction time of the audio data corresponding to the text data of the second language and the reproduction time of the first section become the same.

As a specific example, the processor 130 may first convert text data of the second language to obtain voice data of the second language. And, if the playback time of the voice data of the second language is longer than the time corresponding to the first section, the content is output at a slower playback speed than the original playback speed, and the playback time of the voice data of the second language is the first section. If it is shorter than the time corresponding to, the content can be output at a faster playback speed than the original playback speed.

In this case, the processor 130 may output the voice corresponding to the voice data of the second language together with the content (the playback speed is changed). Specifically, the processor 130 may output image data of content whose playback speed is changed together with audio corresponding to audio data of a second language.

However, in this case, it goes without saying that the playback speed of the content may vary depending on the characteristics of the voice data of the text data of the second language being converted. This is because even if the same text is converted, if the preset tone, intonation, or speech speed is different, the converted voice data is also different.

In connection, the processor 130 may acquire audio data of a second language in consideration of an attribute of an image or audio data in image data included in the content.

Specifically, the processor 130 determines the characteristics of a speaker in the content based on image data included in the content or voice data of a first language included in the content, and determines the text data of the second language. It is converted into voice data corresponding to the voice data, and the voice corresponding to the voice data may be output together with the content.

The characteristics of the speaker may mean gender, age, etc., or may correspond to the tone and tone of the speech.

For example, when the person speaking in the image data included in the content is a young man, the processor 130 may convert text data of the second language into voice data of the young man among previously stored voice data. In this case, the processor 130 may use one or more convolutional neural networks (CNNs) that have been learned to identify a person in the image and recognize the age/gender of the identified person.

For example, when voice data included in the content corresponds to the voice of a young woman, the processor 130 may convert text data of the second language into voice data of the young woman among previously stored voice data. In this case, the processor 130 may use one or more voice feature models or Deep Neural Networks (DNNs) that have been learned to identify the sex/age of voices in the voice data.

Meanwhile, the processor 130 separately stores information on the characteristics of the speaker (eg, a young woman, etc.) identified from the first portion of the video data or audio data of the original content in the memory 110, and When generating translated voice data for a subtitle and/or voice belonging to a part other than the part, information on the stored attribute may be used.

If the voices of two or more speakers overlap each other in the original content, the processor 130 may identify a first ratio between a difference in playback start time and a difference in playback end time between voice data of the speakers' first language. I can. And, the ratio between the difference between the playback start point and the playback end point between the voice data of the second language each generated from the voice data of the first language of the speakers is equal to the identified first rate or within a preset range therefrom. You can set the start (end) time of playback of each voice data of two languages.

Alternatively, the processor 130 may identify a second ratio between a difference in playback start time between speech data of the speakers' first language and a total playback time of speech data of the speaker's first language. In addition, the difference in the playback start time between the voice data of the second language each generated from the voice data of the first language of the speakers and the ratio between the total playback time of the voice data of the second language are equal to or based on the identified second ratio. It is possible to set the start time of reproduction of each voice data of the second language so as to fall within a set range.

In addition, the processor 130 corresponds to the time from the time when at least some of the voice data of the second language of the corresponding speakers starts to be played back to the time when all the voice data of the second language of the corresponding speakers is terminated. It is possible to adjust the reproduction time of the original image data of the scene corresponding to the audio (of the first language).

For example, if the total playback time of the speakers' overlapping utterances within the original content is 10 seconds, and the total playing time of the translated voices of the speakers' speech is 5 seconds (this example is for ease of explanation and is actually 10 seconds). Seconds-It is desirable that the difference in playback time between speech before and after translation by 5 seconds is not significantly different). In this case, if the time difference between the start time points of the speakers in the original content is 2 seconds, the difference between the start time points of reproduction between the translated voices may be 1 second.

Meanwhile, the processor 130 may output content whose playback speed is adjusted according to the type of content or situation for each section.

As an example, the processor 130 may adjust the content playback speed according to a user command for whether to adjust the content playback speed according to the length of the text data of the second language. Specifically, when a user command not to adjust the playback speed of the content according to the translated subtitle/voice is input (or if a user command to adjust the playback speed of the content is not input), the processor 130 Regardless of the length of the text data, the original content can be maintained without changing the playback speed of the video data.

The processor 130 may adjust the reproduction speed of the content differently according to the type of the original content.

In this case, the processor 130 may first identify the type of original content. The processor 130 may analyze image data or audio data included in the content to identify which content corresponds to a content such as news, sports, or drama.

Alternatively, the processor 130 may identify the type of the corresponding content through information on the content received from the external device.

In addition, when the identified type is a preset first type, the processor 130 outputs the content based on the playback speed corresponding to the length of the text data of the second language, but the identified type is a preset second type. If yes, the content can be output at the original playback speed.

For example, even if the reproduction time of the translated voice is different from the reproduction time of the contents, the processor 130 may not adjust the reproduction time of the contents for sports or drama contents.

On the other hand, for news or advertisement content, the content playback time can be adjusted according to the playback time of the translated voice. Meanwhile, the types of contents for which the reproduction speed of the contents can be adjusted according to the translated subtitles/voices may be preset in various ways, and the contents are not limited to news or advertisement contents as in this example.

The processor 130 may analyze the image data of the content in units of image frames, and may change whether or not to adjust the playback time according to a scene in which the image frame is included.

As an example, the processor 130 adjusts the reproduction time of the image data of the content according to the reproduction time of the translated voice for the section including the image frame immediately before or immediately after the scene change, but is irrelevant to the scene change. For one section, even if the reproduction time of the translated voice is different from the reproduction time of the image data of the original content, the reproduction time of the content may not be adjusted.

For example, the processor 130 may not adjust the playback time of the image data of the content or may adjust the playback time only within a preset range from the playback speed of the original for a section including an image frame in which a person or character appears. . This is to prevent the user's feeling of rejection when the video in which the person appears is too fast or slow.

Specifically, the processor 130 is a character (person, character, etc.) in the image data corresponding to the audio data of the first language of the content or the subtitle data of the first language among the image data of the content (matched in time within the original content). ) Can be identified.

In addition, when the corresponding image data includes a character, the length of the text data of the second language acquired from the audio data of the first language or the caption data of the first language is the original reproduction of the corresponding image data. Even if it does not correspond to the speed, the video data can be output at the same reproduction speed as the original reproduction speed.

Alternatively, the processor 130 may adjust the reproduction speed of the corresponding image data according to the length of the text data of the second language (which is a translation result) within a preset range from the original reproduction speed of the corresponding image data.

3 is a block diagram illustrating a detailed configuration of an electronic device 100 for describing various embodiments of the present disclosure.

Referring to FIG. 3, the electronic device 100 further includes at least one of a display 140, an audio output unit 150, and a user interface 160 in addition to the communication interface 110, the memory 120, and the processor 130. Can include.

Through the display 140, the processor 130 may visually output image data and caption data of the original content.

In addition, the processor 130 may output image data of content whose playback speed is adjusted through the display 140. In this case, the translated text data of the second language may be output together in the form of a caption.

To this end, the display 140 may be implemented as a Liquid Crystal Display (LCD), a Plasma Display Panel (PDP), Organic Light Emitting Diodes (OLED), Transparent OLED (TOLED), Micro LED, or the like.

The display 140 may be implemented in the form of a touch screen capable of sensing a user's touch manipulation, and may be implemented as a flexible display that can be folded or bent.

Through the audio output unit 150, the processor 130 may output voice data of the original content or may output a translated voice obtained by converting text data of a second language into a voice form.

To this end, the audio output unit 150 may be implemented as a speaker (not shown) and/or a headphone/earphone output terminal (not shown).

Through the user interface 160, the processor 130 may receive a user command regarding a translation target, a translation providing method, and whether to adjust a content reproduction speed.

Specifically, the processor 130 receives, through the user interface 160, a user command for selecting one of a subtitle translation mode in which the translation target is a subtitle of the original content and a voice translation mode in which the translation target is a voice of the original content. can do.

In addition, the processor 130 provides a user command for selecting any one of a caption providing mode that provides text data of the translated second language in the form of a caption, a voice providing mode that provides a voice form, and a comprehensive providing mode that provides both. May be received through the user interface 160.

In addition, the processor 130 may receive, through the user interface 160, a user command for adjusting the reproduction speed of the image data of the content according to the translated caption/audio for the currently provided content.

In addition, the processor 130 receives a user command on whether to provide a translation service for voice/subtitles included in the original content through the user interface 160, and when a user command to provide a translation service is received. However, it is also possible to provide a voice/subtitle translated from the voice/subtitle of the original content.

To this end, the user interface 160 may include one or more buttons, a keyboard, and a mouse. In addition, the user interface 160 may include a touch panel implemented together with the display 140 or a separate touch pad (not shown). The user interface 160 may include a microphone to receive a user's command or information by voice, or may include a camera for recognizing a user's command or information in a motion form.

4 is a block diagram illustrating a software structure of an electronic device 100 according to an embodiment of the present disclosure.

Referring to FIG. 4, the electronic device 100 includes a character recognition module 410, an STT module 420, a translation module 430, a caption generation module 440, a TTS module 450, and a content playback module 460. It may include at least one of.

If there is no separate caption data in the acquired content, the processor 130 may recognize a character from an image in the image data of the content using the character recognition module 410. In addition, after generating caption data of the original content through the recognized characters, the generated caption data may be identified as text data of the first language.

The character recognition module 410 may be implemented through Optical Character Recognition (OCR), pattern recognition, or a Convolutional Neural Network (CNN) that is learned to recognize characters.

The processor 130 may recognize the voice data through the STT module 420 when the obtained voice data of the content is to be translated. STT module 420 may include an acoustic model (Acoustic Model) and a language model (Language Model). The acoustic model may include information on characteristics of a speech signal corresponding to a phoneme or word, and the language model may include information corresponding to an arrangement order and relationship of one or more phonemes or words.

Processor 130 is a linear predictive coefficient (Linear Predictive Coefficient), Cepstrum (Cepstrum), Mel Frequency Cepstral Coefficient (MFCC), frequency band energy (Filter Bank Energy), etc. from the input speech signal. While extracting various features, it is possible to recognize a phoneme included in a speech signal by comparing it with an acoustic model.

In addition, as a result of modeling the linguistic order relationship of the recognized phonemes using the language model, the processor 130 may obtain text corresponding to a word or sentence and identify it as text data of the first language. In this case, the processor 130 may compare the acquired text with a pronunciation dictionary stored in connection with the STT module, correct/determine it, and then identify it as text data of the first language.

On the other hand, if there is subtitle data in the acquired content, and the corresponding subtitle data is a target for translation, the processor 130 directly converts the corresponding subtitle data into the first language without using the character recognition module 410 and the STT module 420. Can be identified by text data.

The processor 130 may obtain text data of the second language by translating the identified text data of the first language through the translation module 430. In this case, the translation module 430 may use statistical machine translation or neural machine translation, but is not limited thereto.

When text data of the second language is provided in the form of a caption, the processor 130 may generate a caption composed of the text data of the second language through the caption generation module 440. In this case, the processor 130 may modify text data of the second language with a preset font and size through the caption generation module 440 or may detect errors included in the text data of the second language.

When text data of a second language is provided in an audio form, the processor 130 may convert text data of the second language into voice data of the second language through the TTS module 450. In this case, the processor 130 may convert text data of the second language to become a voice of a voice preset in relation to the TTS module 450.

The processor 130 identifies a voice suitable for the age/gender of a person in the original content image or an age/gender suitable for the voice of the original content, among voices of various characteristics previously stored in relation to the TTS module 450, and It is also possible to obtain voice data of a second language suitable for the voice.

The processor 130 may adjust a content playback speed according to the length of text data of the second language through the content playback module 460. In this case, the processor 130 may add a new image frame or exclude at least one of the existing image frames through the content reproduction module 460. In addition, the content for which the playback speed is adjusted may be outputted together with the translated subtitles/audio data.

Hereinafter, FIGS. 5A and 5B assume a case in which the voice of the original content is translated (voice translation mode) and provided in the form of a voice (voice providing mode).FIG. 5A shows the output of the content when the translated voice is longer than the original voice. A diagram for explaining a process, and FIG. 5B is a diagram for explaining a process of outputting content when the translated voice is shorter than the original voice.

Referring to FIG. 5A, a specific section of the original content is shown as being divided into an original image 510 and an original audio 520. Here, the original voice corresponds to the English text “This is one way for an airline to increase its name recognition: misspell your own name.”

Referring to FIG. 5A, the electronic device 100 translates the above-described English text into Korean text, “This is one way for airlines to increase their recognition of names: incorrectly writing their own names.” The Korean text may be converted into a translated voice 520'. In this case, the translated voice 520 ′ may be a male voice set to fit the original voice 520 and the original image 510.

Referring to FIG. 5A, it can be seen that the playback time of the translated voice 520' is 2 minutes 36 seconds, which is 48 seconds longer than the playback time of the original voice 520, 1 minute 48 seconds. As a result, the electronic device 100 slows the reproduction speed of the original image 510 and increases the reproduction time by the reproduction time of the translated voice 520'. Can be printed together.

Referring to FIG. 5B, a specific section of the original content is shown divided into an original image 560 and an original audio 570. At this time, the original voice 570 corresponds to the English text "But painters apparently didn't have an F at their fingertips, resulting in Cathay Pacific instead of Pacific."

Referring to FIG. 5B, the electronic device 100 translates the above-described English text into Korean text “However, painters clearly did not put an F on their fingertips, and as a result, Cathay Pachiok was born instead of Pacific Ocean.” Can be converted into translated voice 570'. In this case, the translated voice 570 ′ may be a male voice set to fit the original voice 570 and the original image 560.

Referring to FIG. 5B, it can be seen that the playback time of the translated voice 570' is 1 minute and 40 seconds, which is 30 seconds shorter than the playback time of the original voice 570, which is 2 minutes and 10 seconds. As a result, the electronic device 100 speeds up the reproduction speed of the original image 560 and reduces the reproduction time by the reproduction time of the translated voice 570 ′. Can be printed together.

Meanwhile, the operations of the electronic device 100 described above may be performed through the electronic device 100 and one or more external devices, not the electronic device 100 alone.

For example, when the electronic device 100 is a TV or a smartphone, text data of the first language is obtained from the electronic device 100 and then the text data of the first language is translated by an external device that is a server. I can. In this case, when text data in the second language is received from the external device, the electronic device 100 adjusts the playback speed of the content according to the length of the text data in the second language, while voice/text data in the second language It can be converted into a subtitle format and output together with the adjusted content.

As another example, when the electronic device 100 is a set-top box, the electronic device 100 obtains text data in a first language from content received from outside, and translates text data in the first language to provide text in a second language. After acquiring the data, the content whose playback speed is adjusted according to the text data of the second language may be transmitted to an external device that is a TV. In addition, the electronic device 100 may convert text data in a second language into an audio/subtitle format and transmit the data to an external device that is a TV. In this case, the adjusted content and the subtitle/audio converted from the text data of the second language may be output through an external device that is a TV.

As another example, when the electronic device 100 is a server, the electronic device 100 obtains first text data from content, translates the first text data to obtain second text data, and the length of the second text data Accordingly, the content whose playback speed is adjusted may be transmitted to an external device such as a TV or a smart phone together with data on subtitles/voices converted from text data in the second language. In this case, the adjusted content and the subtitle/audio converted from the text data of the second language may be output through an external device such as a TV or a smartphone.

In addition, various embodiments in which the electronic device 100 operates together with an external device are possible, and are not limited to the above-described examples.

Hereinafter, a method of controlling an electronic device according to the present disclosure will be described with reference to FIGS. 6 to 9.

6 is a flowchart illustrating a method of controlling an electronic device according to an embodiment of the present disclosure.

Referring to FIG. 6, the control method may acquire text data of a second language based on voice data of a first language or subtitle data of a first language included in the input content (S610 ).

In this case, text data of the first language may be obtained based on voice data of the first language corresponding to the first section of the content or subtitle data of the first language corresponding to the first section. In addition, text data of the second language may be obtained by translating the obtained text data of the first language.

In this case, the present control method determines a difference between the length of the text data of the first language and the length of the text data of the second language, and if the determined difference is more than a threshold value, the obtained text data of the first language is translated again. It is also possible to obtain other text data of the second language.

In addition, content may be output based on a reproduction speed corresponding to the length of the acquired text data of the second language (S620). Specifically, video data and/or audio data of content whose playback speed is adjusted at a playback rate corresponding to the length of text data of the second language may be output.

In this case, based on an input user command or a preset condition, text data of the second language may be output in the form of a caption, or voice data converted from the text data of the second language may be output in the form of an audio. When the text data of the second language is output in the form of an audio/subtitle, the'translated content' may be provided by being output together with the content whose playback speed is adjusted.

As a specific example of adjusting the playback speed of the content, the time corresponding to the length of the text data in the second language corresponds to the first section in which the voice/subtitle of the original content to be translated (matched with the text data in the first language) is included. When the time is longer than the original playback speed, the content is output at a slower playback speed than the original playback speed, and if the time corresponding to the length of the text data of the second language is shorter than the time corresponding to the first section, the content is displayed at the original playback speed. You can output at a faster playback speed. In this case, text data of the second language may be output together with the content in the form of a caption.

As another specific example, first, voice data of the second language may be obtained by converting text data of the second language. For example, data from textual text of'Hello' can be converted into an audio signal containing information about the voice of'Hello'.

In this case, based on the image data included in the original content or the voice data of the first language included in the original content, the characteristics of the speaker (age, gender, emotion, etc.) in the original content are determined, and text data of the second language is determined. It can be converted into voice data corresponding to the determined speaker's characteristics.

And, if the playback time of the voice data of the second language is longer than the time corresponding to the first section in which the voice/subtitle of the original content to be translated (matched with the text data of the first language) is included, the original playback speed of the content When the reproduction time of the voice data of the second language is shorter than the time corresponding to the first section, the content may be output at a faster reproduction speed than the original reproduction speed. In this case, a voice corresponding to voice data of the second language may be output together with the content.

On the other hand, the present control method can identify the type of content. And, if the identified type is a preset first type, the content is output based on the playback speed corresponding to the length of the acquired text data of the second language, and when the identified type is a preset second type, the content is It can be output at the playback speed.

In addition, the present control method may identify whether a character (appearance person) is included in the image data of the content corresponding to the audio data of the first language or the caption data of the first language. When a character is included in the identification result image data, the content may be output at a playback speed within a preset range from the original playback speed. That is, only within a preset range from the original reproduction speed, the reproduction speed of the image data can be adjusted according to the text data of the second language.

7 is an algorithm for explaining an example for obtaining translated text data from original content. The process of FIG. 7 corresponds to a specific example of step S610 of FIG. 6.

Referring to FIG. 7, first, image data and audio data of original content may be separated (S710). In this case, when caption data and/or metadata are additionally included, the present data may also be separated.

In addition, it is possible to identify whether there is a subtitle in the original content (S720). Specifically, it may be identified whether subtitle data of the original content exists separately or whether subtitles are extracted from the image data (S720).

If there is a subtitle (S720-Y), after identifying the text corresponding to the subtitle, the identified text may be translated (S740).

On the other hand, if there is no caption (S720-N), it may be determined whether the speaker of the voice included in the voice data is one (S750).

If there is only one speaker (S750-Y), speech recognition for the corresponding voice is immediately performed (S760), and the result of the speech recognition may be translated (S740). However, if there is not one speaker (S770-N), after separating the voice data for each speaker (S770), the voices for each speaker may be recognized (S760) and translated (S740).

FIG. 8 is an algorithm for explaining an example of outputting a corresponding voice along with content based on a length of a corresponding voice when the translated text is output as a voice. FIG. 8 may be a specific example of operation S620 of FIG. 6.

Referring to FIG. 8, in a state in which the translated text is obtained through step S610 of FIG. 6 or the like (S801), it may be identified whether the length difference between the translated text and the original text (before translation) is greater than or equal to a threshold value (S810). If it is greater than or equal to the threshold value (S810-Y), another translated text of the same language as the previous translation text may be obtained until it becomes less than the threshold value (S810 -N) (S820).

When the length difference between the translated text and the original text is less than the threshold value (S810-N), the translated text may be converted to speech (S830).

And, if the playback time of the converted voice is longer than the playback time of the original content (S840-Y), the playback speed of the original video is modified slowly (S850), and the playback time of the converted voice is the playback of the original content. If it is shorter than the time (S840-N, S870-Y), the playback speed of the original video can be quickly modified (S880). In addition, the video whose playback speed is modified may be output together with the converted (translated) voice (S860).

On the other hand, when the reproduction time of the converted voice is the same as the reproduction time of the original voice (S840-N and S870-N), the original image may be output as it is together with the translated voice (S890).

9 is an algorithm for explaining an example of outputting the translated text along with content based on the length of the translated text when outputting the translated text as a subtitle. 9 may also be a specific example of step S620 of FIG. 6.

Referring to FIG. 9, steps S910 and S920 may be the same as steps S810 and S820 of FIG. 8. However, in the case of FIG. 9, differently from FIG. 8, since translated text (subtitles) is provided instead of the translated voice, the process of converting the translated text into voice (S830) may not be included.

Referring to FIG. 9, after going through step S910 (step S920 may also go through) (S910-N), if the translated text is longer than the original text (S930-Y), the playback speed of the original video is modified slowly (S940). , If the translated text is shorter than the original text (S930-N, S960-Y), the playback speed of the original image can be quickly modified (S970). In addition, the video whose playback speed is modified may be output together with the converted (translated) voice (S950). In this case, the voice of the original content may also be output. In this case, the playback speed of the voice data of the original content may be modified or not modified according to a user command or a preset condition.

On the other hand, if the length of the translated text and the original text are the same (S930-N, S960-N), the original image may be output as it is together with the translated text (S890). At this time, the original voice can also be output.

On the other hand, through each of FIGS. 8 and 9, a case of providing a translated voice or a translation test has been described separately, but it is of course possible that a translated voice and a translated text may be simultaneously provided. In this case, although it is desirable to adjust the playback speed of the content video according to the playback time of the translated audio, it is not necessarily limited thereto.

The control method of the electronic device described above with reference to FIGS. 6 to 9 may be performed by the electronic device 100 illustrated and described with reference to FIGS. 2 and 3. Alternatively, it may be performed through a system including the electronic device 100 and one or more external devices.

Meanwhile, the various embodiments described above may be implemented in a recording medium that can be read by a computer or a similar device by using software, hardware, or a combination thereof.

According to hardware implementation, the embodiments described in the present disclosure include Application Specific Integrated Circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs). ), processor (processors), controllers (controllers), micro-controllers (micro-controllers), microprocessors (microprocessors), may be implemented using at least one of the electrical unit (unit) for performing other functions.

In some cases, the embodiments described herein may be implemented by the processor 130 itself. According to software implementation, embodiments such as procedures and functions described herein may be implemented as separate software modules. Each of the above-described software modules may perform one or more functions and operations described herein.

Meanwhile, computer instructions for performing a processing operation in the electronic device 100 according to various embodiments of the present disclosure described above may be stored in a non-transitory computer-readable medium. I can. When a computer instruction stored in such a non-transitory computer-readable medium is executed by a processor of a specific device, the above-described specific device performs a processing operation in the electronic device 100 according to the various embodiments described above.

The non-transitory computer-readable medium refers to a medium that stores data semi-permanently and can be read by a device, rather than a medium that stores data for a short moment, such as registers, caches, and memory. Specific examples of non-transitory computer-readable media may include CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM, and the like.

In the above, preferred embodiments of the present disclosure have been illustrated and described, but the present disclosure is not limited to the specific embodiments described above, and is generally in the technical field belonging to the disclosure without departing from the gist of the disclosure claimed in the claims. Various modifications may be possible by those skilled in the art, and these modifications should not be individually understood from the technical idea or perspective of the present disclosure.

Claims

An electronic device, comprising: a communication interface including circuitry;

A memory including at least one instruction; And

A processor connected to the communication interface and the memory to control the electronic device; and

The processor, by executing the at least one instruction,

Receiving content through the communication interface,

Acquiring text data of a second language based on voice data of a first language or subtitle data of the first language included in the content,

The electronic device that outputs the content based on a reproduction speed corresponding to the length of the acquired text data of the second language.
The method of claim 1,

The processor,

Acquiring text data of the first language based on voice data of the first language corresponding to the first section of the content or subtitle data of the first language corresponding to the first section,

Translating the obtained text data of the first language to obtain text data of the second language.
The method of claim 2,

The processor,

Determining a difference between the length of the text data of the first language and the length of the text data of the second language,

If the determined difference is greater than or equal to a threshold value, the obtained text data of the first language is translated again to obtain other text data of the second language.
The method of claim 1,

The processor,

An electronic device that outputs text data of the second language in a subtitle format or outputs voice data obtained by converting the text data of the second language in a voice format based on an input user command.
The method of claim 2,

The processor,

When the time corresponding to the length of the text data of the second language is longer than the time corresponding to the first section, the content is output at a slower playback speed than the original playback speed, and the length of the text data of the second language If the time corresponding to is shorter than the time corresponding to the first section, the content is output at a faster playback speed than the original playback speed,

An electronic device that outputs text data of the second language together with the content in a subtitle format.
The method of claim 2,

The processor,

Converting text data of the second language to obtain voice data of the second language,

When the playback time of the voice data of the second language is longer than the time corresponding to the first section, the content is output at a playback speed slower than the original playback speed, and the playback time of the voice data of the second language is the If it is shorter than the time corresponding to the first section, the content is output at a faster playback speed than the original playback speed,

An electronic device that outputs a voice corresponding to voice data of the second language together with the content.
The method of claim 1,

The processor,

Determine the characteristics of the speaker in the content based on image data included in the content or voice data of the first language included in the content,

Converting text data of the second language into voice data corresponding to the determined speaker's characteristics,

An electronic device that outputs a voice corresponding to the voice data together with the content.
The method of claim 1,

The processor,

Identify the type of the content,

If the identified type is a preset first type, outputting the content based on a playback speed corresponding to the length of the acquired text data of the second language,

The electronic device outputting the content at an original playback speed when the identified type is a preset second type.
The method of claim 1,

The processor,

Identify whether a character (appearance person) is included in the image data of the content corresponding to the audio data of the first language or the caption data of the first language,

When a character is included in the image data, the electronic device outputs the content at a reproduction speed within a preset range from an original reproduction speed.
In the control method of an electronic device,

Obtaining text data of a second language based on voice data of a first language or subtitle data of the first language included in the input content; And

And outputting the content based on a reproduction speed corresponding to the length of the acquired text data of the second language.
The method of claim 10,

Obtaining the text data of the second language,

Obtaining text data of the first language based on audio data of the first language corresponding to the first section of the content or subtitle data of the first language corresponding to the first section; And

And translating the obtained text data of the first language to obtain text data of the second language.
The method of claim 11,

Determining a difference between the length of the text data of the first language and the length of the text data of the second language; And

If the determined difference is greater than or equal to a threshold value, translating the obtained text data of the first language again to obtain other text data of the second language.
The method of claim 10,

Based on the input user command, outputting the text data of the second language in the form of a subtitle or outputting the voice data converted from the text data of the second language in the form of a voice; the control method further comprising.
The method of claim 11,

The step of outputting the content,

When the time corresponding to the length of the text data of the second language is longer than the time corresponding to the first section, the content is output at a slower playback speed than the original playback speed, and the length of the text data of the second language If the time corresponding to is shorter than the time corresponding to the first section, the content is output at a faster playback speed than the original playback speed,

And outputting the text data of the second language together with the content in the form of a subtitle.
The method of claim 11,

Converting text data of the second language to obtain voice data of the second language; further comprising,

The step of outputting the content,

When the playback time of the voice data of the second language is longer than the time corresponding to the first section, the content is output at a playback speed slower than the original playback speed, and the playback time of the voice data of the second language is the If it is shorter than the time corresponding to the first section, the content is output at a faster playback speed than the original playback speed,

A control method for outputting a voice corresponding to voice data of the second language together with the content.