CN117812434A

CN117812434A - Display device and translation method of media file

Info

Publication number: CN117812434A
Application number: CN202310553341.0A
Authority: CN
Inventors: 杜娟
Original assignee: Vidaa Netherlands International Holdings BV
Current assignee: Vidaa Netherlands International Holdings BV
Priority date: 2023-05-16
Filing date: 2023-05-16
Publication date: 2024-04-02

Abstract

The application provides a display device and a translation method of media files, wherein the display device detects the category of the media files in response to an automatic translation function starting event or state when the media files are played; performing video decoding and splitting on the video file to generate a first pure audio stream, a pure image stream and a pure subtitle stream; decoding and splitting the audio file to generate a second pure audio stream, a lyric stream and a poster; if the data to be translated is the first pure audio stream or the second pure audio stream, converting the audio in the audio data source into a semantic text and translating the semantic text into a translation text of the target language; if the data to be translated is a text file or a pure subtitle stream or a lyric stream, translating the text data source into a translation text of the target language; the control display displays the image data source on the user interface and outputs translated text in the user interface in a phonetic and/or textual manner. The display device can solve the problems that the media file cannot be translated when the media file is in a foreign language and the translation accuracy of the media file is low.

Description

Display device and translation method of media file

Technical Field

The present disclosure relates to the field of display devices, and in particular, to a display device and a method for translating media files.

Background

When a user views a media file through a display device such as a television, if the played media file is in a foreign language or other language which the user cannot directly understand, the display device cannot directly translate the foreign language or other language into the native language or a language familiar to the user. Therefore, when the played media file is a foreign language or other languages which cannot be directly understood by the user, the user cannot directly know the specific content played by the media file.

In order to learn about the play of media files in foreign or other languages, in one possible manner, the translated version may be viewed to a theatre or looked up on the web. However, when a theatre views a translated version, the theatre's media files are often incomplete and even with a translated version of the media files, their play time is limited. When a translated version of a media file in a foreign language or other language is searched on a network, there may be cases where the translated version cannot be found, and the specific content of the media file may not be checked.

In another possible way, the media files in the foreign or other language version may also be translated by some translation software or translation tool. However, these translation software or translation tools can translate only the surface layer meaning expressed by a foreign language or other languages, and lack analysis of specific scenes, and the translation tools are limited to translation between texts, and cannot translate speech to text, so that a situation that a media file cannot be translated may occur, and the translation accuracy is low.

Disclosure of Invention

Some embodiments of the present application provide a display device and a method for translating a media file, so as to solve the problems that when the media file is in a foreign language or in another language version which cannot be directly understood by a user, the translation accuracy of the media file cannot be translated, and the user wastes time and effort when searching for a subtitle video with the user's own understandability.

In a first aspect, some embodiments of the present application provide a display device, including:

a display configured to display a user interface;

a controller configured to:

detecting a file category of a media file in response to an opening event or state of an automatic translation function when the media file is played; the file categories comprise video files, audio files, picture files and text files;

if the file type is detected as the video file, video decoding is carried out on the video file; and performing streaming on the decoded video file to generate a first pure audio stream, a pure image stream, and a pure subtitle stream of the media file; if the file category is detected as the audio file, performing audio decoding on the audio file, and performing streaming on the decoded audio file to generate a second pure audio stream, a lyrics stream and a poster of the audio file;

If the data category is detected to be the first pure audio stream in the video file or the data category is detected to be the second pure audio stream in the audio file, converting the audio in an audio data source into semantic text, and translating the semantic text into translation text of a target language; the audio data source comprises the first pure audio stream or the second pure audio stream;

if the file type is detected to be the text file, or the data type is the pure subtitle stream in the video file, or the data type is the lyric stream in the audio file, translating a text data source into a translation text of a target language; the text data source comprises the text file or the pure subtitle stream or the lyrics stream;

controlling the display to display an image data source in a user interface, wherein the image data source comprises the picture file or the pure image stream or the poster, and outputting the translated text in the user interface in a voice and/or text mode.

In a second aspect, some embodiments of the present application provide a method for translating a media file, which may be applied to the display device of the first aspect, where the display device includes a display and a controller, and the method for translating a media file includes:

According to the technical scheme, some embodiments of the application provide a display device and a translation method of media files, wherein the display device detects file types of the media files in response to an opening event or state of an automatic translation function when the media files are played; if the file type is a video file, video decoding is performed on the video file, and splitting is performed on the decoded video file so as to generate a first pure audio stream, a pure image stream and a pure subtitle stream; if the file category is detected as an audio file, performing audio decoding on the audio file, and performing streaming on the decoded audio file to generate a second pure audio stream, a lyric stream and a poster of the audio file; if the data category is detected to be a first pure audio stream in the video file or the data category is detected to be a second pure audio stream in the audio file, converting the audio in the audio data source into a semantic text, and translating the semantic text into a translation text of the target language; if the file type is detected to be a text file, or the data type is detected to be a pure subtitle stream in a video file, or the data type is detected to be a lyric stream in an audio file, translating a text data source into a translation text of a target language; the control display displays the image data source in the user interface and outputs translation text in the user interface in a phonetic and/or textual manner. The display device can solve the problems that when the media file is in a foreign language or other languages which cannot be directly understood by a user, the translation accuracy of the media file cannot be translated, and the user wastes time and labor when searching for the video with the subtitle which can be understood by the user.

Drawings

In order to more clearly illustrate some embodiments of the present application or technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control device provided in some embodiments of the present application;

FIG. 2 is a block diagram of a hardware configuration of a display device provided in some embodiments of the present application;

FIG. 3 is a block diagram of a hardware configuration of a control device provided in some embodiments of the present application;

fig. 4 is a schematic diagram of software configuration in a display device according to some embodiments of the present application;

FIG. 5 is a schematic diagram of a flowchart of a display device according to some embodiments of the present application performing media file translation;

FIG. 6 is a flow chart illustrating a display device performing translation of a media file according to some embodiments of the present application;

FIG. 7 is a flowchart illustrating setting of translation categories by a display device according to some embodiments of the present application;

FIG. 8 is a schematic diagram of a translation class selection interface provided in some embodiments of the present application;

FIG. 9 is a flowchart of a display device determining a target language according to some embodiments of the present application;

FIG. 10 is a flow chart of a display device according to some embodiments of the present application for translating an audio data source into translated text in a target language;

fig. 11 is a schematic flow chart of extracting semantic audio information from effective audio information by a display device according to some embodiments of the present application;

fig. 12 is a schematic diagram showing the effect of displaying reminding information which does not meet the translation standard by using the display device according to some embodiments of the present application;

fig. 13 is a schematic diagram of a reminding effect of displaying no voice information by using a display device according to some embodiments of the present application;

fig. 14 is a schematic flow chart of dividing semantic audio information into audio sentences with preset lengths by a display device according to some embodiments of the present application;

FIG. 15 is a flow chart of a display device according to some embodiments of the present application for translating a text data source into translated text in a target language;

FIG. 16 is a framework overview of a display device performing a media file translation method according to some embodiments of the present application;

FIG. 17 is a schematic flow chart of a display device according to some embodiments of the present application outputting translated text in a user interface in a speech and/or text manner;

Fig. 18 is a flowchart illustrating a method for translating a media file according to some embodiments of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of some embodiments of the present application more clear, the technical solutions of some embodiments of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application.

It should be noted that the brief description of the terms in some embodiments of the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the implementation of some embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms first, second, third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the function associated with that element.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control device according to some embodiments of the present application. As shown in fig. 1, a user may operate the display device 200 through the mobile terminal 300 and the control device 100.

In some embodiments, the mobile terminal 300 may install a software application with the display device 200, implement connection communication through a network communication protocol, and achieve the purpose of one-to-one control operation and data communication. The audio/video content displayed on the mobile terminal 300 can also be transmitted to the display device 200, so as to realize the synchronous display function.

As also shown in fig. 1, the display device 200 is also in data communication with the server 400 via a variety of communication means. The display device 200 may be permitted to make communication connections via a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks.

The display apparatus 200 may additionally provide a smart network television function of a computer support function, including, but not limited to, a network television, a smart television, an Internet Protocol Television (IPTV), etc., in addition to the broadcast receiving television function.

Fig. 2 is a block diagram of a hardware configuration of the display device 200 of fig. 1 provided in some embodiments of the present application.

In some embodiments, display apparatus 200 includes at least one of a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, memory, a power supply, a user interface.

In some embodiments, the detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; alternatively, the detector 230 includes an image collector such as a camera, which may be used to collect external environmental scenes, user attributes, or user interaction gestures, or alternatively, the detector 230 includes a sound collector such as a microphone, or the like, which is used to receive external sounds.

In some embodiments, the display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, for receiving an image signal from the controller output, for displaying video content, image content, and components of a menu manipulation interface, and a user manipulation UI interface, etc.

In some embodiments, communicator 220 is a component for communicating with external devices or servers 400 according to various communication protocol types.

In some embodiments, the controller 250 includes a processor, a video processor, an audio processor, a graphic processor, a RAM, a ROM, first to nth interfaces for input/output, and the controller 250 controls the operation of the display device and responds to the user's operation through various software control programs stored on the memory. The controller 250 controls the overall operation of the display apparatus 200.

In some embodiments, the controller 250 and the modem 210 may be located in separate devices, i.e., the modem 210 may also be located in an external device to the main device in which the controller 250 is located, such as an external set-top box or the like.

In some embodiments, a user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI).

In some embodiments, user interface 280 is an interface that may be used to receive control inputs.

Fig. 3 is a block diagram of a hardware configuration of the control device in fig. 1 according to some embodiments of the present application. As shown in fig. 3, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface, a memory, and a power supply.

The control device 100 is configured to control the display device 200, and can receive an input operation instruction of a user, and convert the operation instruction into an instruction recognizable and responsive to the display device 200, functioning as an interaction between the user and the display device 200.

In some embodiments, the control device 100 may be a smart device. Such as: the control apparatus 100 may install various applications for controlling the display apparatus 200 according to user's needs.

In some embodiments, as shown in fig. 1, a mobile terminal 300 or other intelligent electronic device may function similarly to the control device 100 after installing an application that manipulates the display device 200.

The controller 110 includes a processor 112 and RAM 113 and ROM 114, a communication interface 130, and a communication bus. The controller 110 is used to control the operation and operation of the control device 100, as well as the communication collaboration among the internal components and the external and internal data processing functions.

The communication interface 130 enables communication of control signals and data signals with the display device 200 under the control of the controller 110. The communication interface 130 may include at least one of a WiFi chip 131, a bluetooth module 132, an NFC module 133, and other near field communication modules.

A user input/output interface 140, wherein the input interface includes at least one of a microphone 141, a touchpad 142, a sensor 143, keys 144, and other input interfaces.

In some embodiments, the control device 100 includes at least one of a communication interface 130 and an input-output interface 140. The control device 100 is provided with a communication interface 130 such as: the WiFi, bluetooth, NFC, etc. modules may send the user input instruction to the display device 200 through a WiFi protocol, or a bluetooth protocol, or an NFC protocol code.

A memory 190 for storing various operation programs, data and applications for driving and controlling the control device 100 under the control of the controller. The memory 190 may store various control signal instructions input by a user.

A power supply 180 for providing operating power support for the various elements of the control device 100 under the control of the controller.

Fig. 4 is a schematic software configuration diagram of the display device in fig. 1 provided in some embodiments of the present application, and in some embodiments, the system of the display device 200 may be divided into three layers, namely, an application layer, a middleware layer, and a hardware layer from top to bottom.

The application layer mainly comprises common applications on the television, and an application framework (Application Framework), wherein the common applications are mainly applications developed based on Browser, such as: HTML5 APPs; native applications (Native APPs);

The application framework (Application Framework) is a complete program model with all the basic functions required by standard application software, such as: file access, data exchange, and the interface for the use of these functions (toolbar, status column, menu, dialog box).

Native applications (Native APPs) may support online or offline, message pushing, or local resource access.

The middleware layer includes middleware such as various television protocols, multimedia protocols, and system components. The middleware can use basic services (functions) provided by the system software to connect various parts of the application system or different applications on the network, so that the purposes of resource sharing and function sharing can be achieved.

The hardware layer mainly comprises a HAL interface, hardware and a driver, wherein the HAL interface is a unified interface for all the television chips to be docked, and specific logic is realized by each chip. The driving mainly comprises: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, pressure sensor, etc.), and power supply drive, etc.

Based on the above-described display device 200, some media files such as video, audio, etc. may be played through the display device 200. When a user views a media file through a display device such as a television, if the played media file is in a foreign language or other language which the user cannot directly understand, the display device cannot directly translate the foreign language or other language into the native language or a language familiar to the user. Therefore, when the played media file is a foreign language or other languages which cannot be directly understood by the user, the user cannot directly know the specific content played by the media file.

In order to understand the play content of media files in foreign languages or other languages that cannot be directly understood by the user, in one possible manner, the translated version may be viewed to a theatre or looked up on the web. However, when a theatre views a translated version, the theatre's media files are often incomplete and even with a translated version of the media files, their play time is limited. When a translated version of a media file in a foreign language or other language is searched on a network, there may be cases where the translated version cannot be found, and the specific content of the media file may not be checked.

For example, for a newly introduced foreign or other language version of a movie, the user can only buy tickets to the movie theatre to see because only the theatre can provide a "theatre translated" chinese version. If the theatre ticket is empty or the theatre does not schedule a new movie's shelves during the user's vacation, then resources for the translated version need to be located on the web. If one wants to find a clear version of the native language, it takes a lot of time and effort and may not even be found. Thus, the user still cannot directly understand the specific content of the new movie.

In another possible way, the media files of the foreign language or other language version that the user cannot directly understand may also be translated by some translation software or translation tool. However, these translation software or translation tools can translate only the surface layer meaning expressed in a foreign language or other languages, and lack analysis of specific scenes, so that the accuracy of translation is low.

For example, when a user goes to a business trip or travels from a different place, the television being watched is a local station, and the user cannot directly know the meaning of the local language expression. Especially for some people who are not known to the foreign language or other language that the user cannot directly understand, the meaning of their expression is not known to the program and media in the foreign language or other language. Even for a person who is slightly aware of a foreign or other language, it may not be possible to accurately understand its true meaning due to unfamiliar with local cultural differences, local speaking habits, etc. If a translation tool such as translation software is used, only a text in a foreign language or other language can be translated into a desired text, but because of lack of semantic scenes or the like, the result of the translation may have an inaccurate problem, i.e., the translation accuracy is low. Moreover, the translation tool is limited to translation between text and text, and speech-to-text translation cannot be achieved, so that a situation may occur in which the media file cannot be translated.

In order to solve the problems that when a media file is in a foreign language or in another language version which cannot be directly understood by a user, the translation accuracy of the media file cannot be translated, and the user takes time and effort to find a video with a subtitle which can be understood by the user, some embodiments of the present application provide a display device 200. The display device 200 includes a controller 250 and a display 260 for displaying a user interface, and the display device 200 can automatically translate media contents of a language that is not understood by a user into voice or text of a target language that the user can understand according to a user setting and synchronously display the voice and/or text on a played media file. In this way, the display device 200 can automatically translate the language of all the played foreign language programs into the target language, so that the user can quickly understand the content of the displayed media file, and the translation by the translation tool is not needed, and the network is not needed to search the film source of the target language. Meanwhile, the user can select different output forms according to different scenes, and the selection of the user according to different scenes is facilitated.

Fig. 5 is a schematic flow chart of a process of performing translation of a media file by a display device according to some embodiments of the present application, as shown in fig. 5, in some embodiments, in order to start a translation function, a function control for opening or closing the translation function may be provided in the display device 200. The function control is used for controlling the automatic translation function to be started or stopped. It can be understood that when the translated function control is opened, i.e. the media file is prepared for translation, and when the translated function control is closed, the translation function for the media file is stopped. The function control may also be used to control each function module for executing the automatic translation function, and illustratively, when the function control is closed, the user interface of each function module is closed, and when the function control is opened, the value set by the user interface of each function module may be the default value or the value set for the function control last time. When the automatic translation function is turned on, the state of each function module is defaulted to be an on state.

In performing the function of automatic translation on a media file, this can be achieved by the following idea. Firstly, the translation category can be set, then the target language to be translated is selected, and then the output mode of the target language is selected, so that the media file to be translated can be output to the target language through the output mode selected by the user.

Illustratively, the user sets the translation category of the media file. The translation categories may include a translation audio category, a translation text category, and an audio plus text translation category. When setting the translation category, an interface for user selection may be provided, and a translation mode for user selection may be provided on the interface, for example, there may be audio only for translating the media file, text only for translating the media file, simultaneous translation of the audio and text, and the like, so that the user may present different output states by setting different translation modes. Which will be described in detail below.

After the user sets the translation category, the target language to be translated of the media file, that is, the language into which the user wants the foreign language information heard or seen by the user to be translated, can be selected. The display device 200 may create a list of alternative languages TargetLanguageList supported by the device according to the actual situation of the device. The user may select one or more languages from the TargetLanguageList as the target language for translation. Meanwhile, other factors, such as the translation speed of translating the source language of the media file into each candidate language, the use frequency of the candidate languages, the preference priority of the user of the candidate languages, the average translation accuracy of translating the source language into each candidate language, and the like, may be combined, and the target language is selected according to one or more of the above factors, which will be described in detail below.

After the target language is determined, in order to present the target language to the user, an output mode of the target language may be set. The output modes of the target language may include a voice output mode, a text output mode, and a voice-plus-text output mode, for example. The voice output mode is a mode of outputting the result of the media file range as the target language only through voice, namely broadcasting the final translated result through voice. Because the voice playing will occupy the resources such as the playing voice of the display device terminal, this situation requires that the original voice of the prompting user media file will be replaced by the voice of the target language. The text output mode is to broadcast the achievement with the media file range as the target language only by a text or text mode, and not by a voice mode. For the scene, the user can set the display position of the text or the words, judge whether the initial media file has the caption or not, and the like, and the details will be described later. The voice and text output mode is to select the two output modes at the same time, after selecting the two output modes, the voice of the initial media file is replaced by the voice of the target language, and the text or text content can be set according to the actual situation and will be described in detail later. It should be noted that, fig. 5 is a general schematic diagram of an automatic translation process, and each process in fig. 5 will be described in detail below with reference to specific embodiments throughout the entire embodiments. In order to facilitate understanding of the technical solutions in some embodiments of the present application, the following details of each step are described with reference to some specific embodiments and the accompanying drawings. Fig. 6 is a schematic flow chart of a display device executing translation of a media file according to some embodiments of the present application, as shown in fig. 6, when the display device 200 executes translation of a media file, the following steps S1-S5 may be included, which are specifically as follows:

Step S1: in response to an on event or state of the automatic translation function when the media file is played, the display device 200 detects a file category of the media file.

In order to detect the on or off state of the automatic translation function of a media file, an on event of the automatic translation function or its corresponding state may be detected when the media file is opened, after which the display device 200 may send a broadcast notification to the application being played when it is ready to perform the automatic translation. For example, a broadcast notification is sent to an application that is playing a media file to alert it that it is ready to turn on the automatic translation function. When the automatic translation function is turned on, the display device 200 may begin to detect file types of media files to perform different translation processes according to different file types.

In some embodiments, the file categories include video files, audio files, picture files, and text files. It should be noted that, in order to achieve a better translation speed and a more accurate translation effect, the display device 200 may be connected to a network, so as to obtain more network resources in the translation process, thereby improving the translation speed and the translation accuracy. After the completion of the execution of step S1, the following step S2 may be executed.

Step S2: if the file category is detected as a video file, the display apparatus 200 performs video decoding on the video file; and performing streaming on the decoded video file to generate a first pure audio stream, a pure image stream, and a pure subtitle stream of the media file; if the file category is detected as an audio file, performing audio decoding on the audio file, and performing streaming on the decoded audio file to generate a second pure audio stream, a lyric stream, and a poster of the audio file.

To enable the video file to be played on the display device, in some embodiments, the display device 200 may perform video decoding on the video file, and after decoding is complete, may perform streaming on the decoded video file to generate a first pure audio stream, a pure image stream, and a pure subtitle stream of the media file. In this way, the display device 200 may perform different translation processes for different types of media files.

For example, a media audio analysis module may be disposed in the display device 200, and when the media file opened by the user is a video file, the video file may be called by the media audio analysis module, and operations such as decoding, splitting, audio extraction and the like are performed on the video file, so that a complete first pure audio stream, a pure image stream and a pure subtitle stream are extracted from the video file. Similarly, the display device 200 may also perform decoding on the audio file, and after decoding is completed, may perform streaming on the decoded audio file to generate a second pure audio stream, a lyric stream, and a poster in the media file. It should be noted that the pure image stream in the video file is different from the poster in the audio file. In connection with fig. 5, the pure image stream is a frame-by-frame image frame, while the poster is only one image.

With continued reference to fig. 5, in some embodiments, both the video file and the audio file may be decoded and split to obtain pure audio streams, with the pure audio stream from which the video file is split being referred to as a first pure audio stream and the audio file being split being referred to as a second pure audio stream for ease of distinction and description. That is, in the embodiment of the present application, the media file is divided into three data sources, namely, a text data source, including a text file, a pure subtitle stream in a video file, and a lyric stream in an audio file; the second is an audio data source, which comprises a pure audio stream in a video file and a pure audio stream in an audio file; and thirdly, image data sources, including pure image streams in video files, posters in audio files and picture files. In this way, different translation processes may be performed for different data sources to improve the translation efficiency of the display device 200.

In order to translate a text data source such as a text file, a pure subtitle stream, or a lyric stream, a text translation module may be provided in the display device 200, and the text translation module may translate the text data source into a translated text of a target language. In some embodiments, when translating the second pure audio stream in the audio file, the second pure audio stream and the corresponding lyrics stream are both time stamped (presentation time stamp, PTS), and the second pure audio stream and the corresponding lyrics are substantially synchronous and synonymous within the same or similar time range. Therefore, after text translation is respectively performed on the second pure audio stream and the lyric stream in the text translation module, semantic error correction can be performed on the translated text of the second pure audio stream by utilizing the translated text of the lyric stream, so that the translation accuracy of the translated text of the pure audio stream is improved. And the poster is only one image, without a time stamp, without performing an error correction process, the display device 200 can control to directly display the poster. Similarly, the first pure audio stream of the video file is basically synchronous and synonymous with the subtitle stream in the corresponding time range, so after text translation is respectively executed on the first pure audio stream and the subtitle stream in the text translation module, the translated text of the first pure audio stream can be subjected to semantic error correction by utilizing the translated text of the subtitle stream, and the translated text translation accuracy of the pure audio stream can be improved.

Fig. 7 is a schematic flow chart of setting a translation category by a display device according to some embodiments of the present application, and as shown in fig. 7, the display device 200 may set the translation category of a media file. In some embodiments, the translation categories may include a translation audio category, a translation text category, and an audio plus text translation category. After the translation category setting is completed, the display device 200 may detect a translation category of the media file input by the user. If the translation category is detected to be a translation audio category, the display device 200 translates the audio data source in the media file into translation text in the target language; if the translation category is detected to be a translation text category, the display device 200 translates the text data source in the media file into translation text in the target language; if the translation category is detected as an audio plus text translation category, the display device 200 translates the audio data source in the media file, as well as the text data source in the media file, into translated text in the target language.

For example, fig. 8 is a schematic diagram of a translation category selection interface provided in some embodiments of the present application, and as shown in fig. 8, a translation mode for user selection may be set in the translation category selection interface, for example, there may be three modes including Audio Only, text Only, and Audio & Text. Wherein, audio Only represents translating Only Audio in the media file, the display device 200 may translate the Audio data source in the media file into a translation text of the target language; text Only represents translation of Text in a media file, the display device 200 may translate a Text data source in the media file into translation Text in the target language; audio & Text means that Audio and Text in a media file are simultaneously translated, the display device 200 may translate the Audio data source in the media file, as well as the Text data source in the media file, into translated Text in the target language.

In order to determine the target language to be translated, the display device 200 may also execute the following procedure. Fig. 9 is a schematic flow chart of determining a target language by a display device according to some embodiments of the present application, as shown in fig. 9, the display device 200 may first create a list of candidate languages supported by the display device 200, and then obtain candidate languages in the list of candidate languages input by a user, and determine the number of candidate languages. And if the number of the languages to be selected is equal to 1, determining the languages to be selected as target languages, and if the number of the languages to be selected is greater than 1, calculating the translation speed of translating the source language of the media file into each language to be selected, the use frequency of the languages to be selected, the preference priority of the user of the languages to be selected, the average translation accuracy of translating the source language into each language to be selected, and screening the target languages according to one or more of the translation speed, the use frequency, the preference priority of the user and the average translation accuracy.

For example, a target language setting module may be provided in the display device 200 for a user to set a default target language, i.e., into which language the user wants foreign language information that he hears or sees to be translated. The display device 200 may create a list of alternative languages TargetLanguageList supported by the device according to the actual situation of the device. The user may select one or more languages from the TargetLanguageList as the target language for translation. It will be appreciated that the target language selected should be one that the user has a high level of skill, at least in an understandable or understandable manner. There is no requirement as to whether the user can speak and write the target language. That is, the target language selected by the user is free to listen and watch to the capability requirements of the input aspects of the user, but not to speak and write to the capability requirements of the output aspects of the user.

When the target language is determined from the candidate language list, if only one candidate language is selected from the candidate language list by the user, the candidate language is the target language, and the target language is necessarily and directly translated during translation. If the user selects more than one candidate language in the candidate language list, the translation speed of translating the source language of the media file into each candidate language, the use frequency of the candidate languages, the preference priority of the user of the candidate languages, the average translation accuracy of translating the source language into each candidate language and the like can be calculated, and the target language can be screened according to one or more of the translation speed, the use frequency, the preference priority of the user and the average translation accuracy.

For example, when calculating the translation speed of translating the source language into each candidate language, the time required for translating the source language into the target language may be compared, the translation speed is faster when the translation time is short, and the translation speed is slower when the translation time is long. If the user considers the translation speed to be more important, the target language may be determined based on the translation speed.

When the frequency of the use of the candidate language is calculated, the more the number of times of the use of the candidate language in a period of time is, the higher the preference degree of the user for the candidate language is, and the candidate language is likely to be used again. The frequency of use of the selected language can be calculated by setting the starting time of the frequency of use of the selected language, so that when the user uses the frequency of use as a standard for selecting the target language, the frequency of use can be used as a selection basis.

In calculating the preference priority of the user of the candidate language, the user may set the preference priority of the candidate language in the display device 200, the preference priority expressing the preference degree of the user for different candidate languages, so that when the user uses the preference priority as the standard for selecting the target language, the preference is used as the selection basis.

When calculating the average translation accuracy of the translation of the source language into each candidate language, the accuracy of each translation can be recorded, and then the average translation accuracy is calculated. In some embodiments, each time the translation is performed, the translation accuracy of the selected language is fed back, and after the translation of the selected language is performed a plurality of times, the average translation accuracy of the selected language is obtained. The higher the average translation accuracy, the more likely it will be recommended by the system and the more acceptable it will be to the user. For example, although the user prefers the language order to be TL1 chinese, TL2 english, TL3 spanish, if the average translation accuracy of the display device 200 to TL1, TL2, TL3 is 80%, 85%, 95%, respectively, then if the user selects the preference priority as the standard of the target language, the target language is preferentially chinese; however, if the user selects the standard with the average translation accuracy as the target language, the target language will be preferentially selected as TL3 spanish. The target language may also be determined by a combination of the above-described various ways, which is not specifically limited in this application. After the completion of the execution of step S2, the following step S3 may be executed.

Step S3: if it is detected that the data category of the media file is a first pure audio stream in the video file or the data category is a second pure audio stream in the audio file, the display device 200 converts the audio in these audio data sources into semantic text and translates the semantic text into translated text in the target language.

In some embodiments, the audio data source includes a first pure audio stream in the video file and a second pure audio stream in the audio file, as well as other sound-like data. To effect translation of the audio data source in the media file into translation text in the target language, the display device 200 may convert the audio in the audio data source into semantic text and translate the semantic text into translation text in the target language. Fig. 10 is a schematic flow chart of translating an audio data source into translated text of a target language by a display device according to some embodiments of the present application, as shown in fig. 10, the display device 200 may first perform a data flow analysis on the audio data source to extract effective audio information in the audio data source, and then obtain a target audio data specification, where relevant standards of the audio data may be specified, and may be formulated in conjunction with specific translated content, and the specific content is not limited in this application. After the target audio data specification is obtained, the display device 200 may extract the valid audio information according to the target audio data specification to extract semantic audio information in the valid audio information. After the semantic audio information is extracted, the semantic audio information can be divided into audio sentences with preset lengths, semantic recognition is performed on the audio sentences to generate semantic texts of the audio sentences, and finally the semantic texts of the audio sentences are translated into translation texts of target languages.

In conjunction with fig. 5, in some embodiments, in order to improve accuracy of semantic text translation, semantic error correction may be performed on the semantic text generated after recognition, so as to screen out text with semantic errors, and improve accuracy of the semantic text. Thus, when the semantic text is translated into the translated text of the target language, the accuracy of translation of the translated text can be improved.

For example, if it is detected that the file type of the media file, i.e., the data type to be translated, is a first pure audio stream in the video file or a second pure audio stream in the audio file, the display device 200 may first perform a data stream analysis on the audio data source, i.e., the first pure audio stream or the second pure audio stream, to extract valid audio information in the audio data source. The effective audio information may refer to audio information with some noise removed, so as to facilitate later audio extraction and processing. After that, a target audio data specification specifying the relevant standard of the audio data may be acquired, and it is to be noted that the specification may be acquired before or after the data stream analysis, as long as it is acquired after extracting the valid audio information.

In some embodiments, the specification may include, but is not limited to, a container format of audio, an audio format, a maximum data capacity, a single packet data stream size, a transmission frame rate range, a code rate range, a sampling rate, a compression mode, and the like. Moreover, to ensure accuracy in the extraction of valid audio information, the specification may be updated periodically, or at some point. For example, the update may be performed at the time of automatic translation initiation, may be performed at the time of switching on or off the display device 200, may be performed at other links, and the like.

It should be noted that whichever form of update is adopted, it is necessary to ensure that the specification is up-to-date before valid audio information is extracted. In this way, it is possible to perform format analysis and inspection on audio data according to the target audio data specification, and to perform different operation flows according to the result of the analysis and inspection.

After the display device 200 acquires the target audio data specification, valid audio information may be extracted according to the target audio data specification. Fig. 11 is a schematic flow chart of extracting semantic audio information from valid audio information by a display device according to some embodiments of the present application, and as shown in fig. 11, the display device 200 may first detect valid audio information according to a target audio data specification; if it is detected that the valid audio information does not meet the translation standard, the display device 200 may control the display 260 to display a reminder message for reminding the user that the valid audio information does not meet the translation standard; if it is detected that the audio information meets the translation standard, the display apparatus 200 may extract the voice information from the valid audio information, perform conversion of the voice information according to a preset format, and package the converted voice information.

Fig. 12 is a schematic diagram showing the effect of the display device of fig. 3 in displaying the reminding information which does not meet the translation standard according to some embodiments of the present application, as shown in fig. 12, if it is detected that the valid audio information does not meet the translation standard, the display device 200 may control the display 260 to display the reminding information of "the input audio information does not meet the specification requirement and cannot be translated" shown in fig. 12, so as to prompt the user that the audio information cannot be translated.

In some embodiments, the semantic audio information may be voice information, so that if it is detected that the audio information meets the translation standard, the display device may extract the voice information in the effective audio information when extracting the semantic audio information in the effective audio information, perform conversion on the voice information according to a preset format, and package the converted voice information.

In order to support the function of the display device 200 to perform audio information extraction, in some embodiments, a media audio preprocessing module may be provided in the display device 200, and the module may be used for processing operations such as audio information extraction. After the audio information meets the target audio data specification, the human voice information extraction may be performed. In order to support the voice information extraction function, the voice extraction sub-module can be further arranged in the display device, and has the functions of extracting voice from audio information with various background music and noise so as to ensure that the extracted voice is as pure as possible, so that the voice can be accurately identified in subsequent procedures, languages, characters and the like corresponding to the source language can be identified.

Because the frequency of the voice is within a range, the voice information can be extracted from the following modes by combining the frequency characteristics and related information. In some embodiments, the process of extracting the voice information may be performed by a voice extraction sub-module, and several methods of extracting the voice information are described below. It is understood that the method of extracting the voice information is not limited to the following method, but may be extracted by other ways, and the present application is not particularly limited thereto.

For example, when extracting voice information, it may first be checked whether voice and other background sounds or music are respectively placed in different tracks in corresponding audio information. If the voice is located in a single track, the voice track of the person in the audio information can be directly extracted according to the track details, or the voice track details are not clear and can be identified according to the voice frequency range. Thus, the audio tracks of other non-human voices are removed, the human voice tracks are reserved, and the audio stream of pure human voices can be obtained.

If the voice is not placed in a separate track, it can be seen whether there is a reliable way to extract the voice. For example, for audio in which the human voice is mono and is equally distributed to left and right channels, two channels and accompaniment as stereo, two-channel stereo accompaniment with the human voice eliminated can be obtained by subtracting the two-channel sound, and purified human voice in the two channels can be obtained by subtracting the accompaniment sound from the left and right channels of the sound wave before the human voice is eliminated. By the method, background music or accompaniment can be eliminated, and pure voice information is finally obtained.

It is a relatively simple case that the above-mentioned human voice is located in a separate track and that the human voice is not located in a separate track, and that more human voice is mixed directly with other voice in the audio. In some embodiments, the audio range may be limited to the human voice range by a high pass filter and a low pass filter, after which filtering, noise reduction, voice recognition, etc. may be performed by methods such as signal processing and pattern recognition.

In other embodiments, the human voice information may also be extracted by artificial intelligence learning. By way of example, the human voice of the first section of audio can be learned by an artificial intelligence method, and after the characteristics of the voice color, the sound wave frequency band, the amplitude, the sound complexity and the like of the human voice are mastered, an audio human voice characteristic template of the media file is formed. After the N voice feature templates are formed, a voice template library of the current media file is stored. And then, when the voice purification is carried out on the media file or the subsequent similar media file, carrying out matching search based on the voice template library. For example, when the subsequent episodes of the same television show are replayed, the voice template library which is stored in the previous episode can be used as a basis for continuous use. After the voice template library is established, when voice information is extracted, the subsequent voice can be compared with the voice template library according to the voice range, and the audio part with the feature similarity exceeding a certain threshold can be extracted to be used as the purified voice of the current audio stream. The threshold value may be set in combination with audio data and related information, and the specific value is not limited in this application. By comparison, if no matched voice is found in the voice template library, an intelligent learning module can be started to learn the voice template of the current audio, and the learned voice template is added into the voice template library of the media file accumulated before. Thus, the voice extracted according to the new voice template can be used as voice information extracted from the current audio information.

In some embodiments, if no voice information is found in the entire media file, it may be determined that the media file may be pure music, pure noise, pure animal sound, etc., and the result of outputting voice information is null when the voice information is extracted for the media file of the type by the display device 200, at this time, the display 260 may be controlled to display a reminder message prompting the user that no voice exists. Fig. 13 is a schematic diagram showing a reminding effect of displaying no voice information by using the display device provided in some embodiments of the present application, as shown in fig. 13, when no voice information exists in the input audio information, the reminding information of "no voice information exists in the input audio information, no translation is required" as shown in fig. 13 may be displayed, so as to remind a user that no translation is required for the voice information without voice.

After the semantic audio information is extracted by the method, for example, after the voice information is extracted, the voice information can be converted according to a preset format, and the converted voice information is packaged. For example, a media format conversion sub-module may be provided in the display device 200, which performs data stream analysis mainly on input audio information, extracts useful data, and performs data collection and format conversion in a preset format required in a target audio data specification, such as a target audio format, to finally convert an audio stream into the target audio format for subsequent processing to perform semantic extraction. The format conversion process is to fill the corresponding information of the source audio into the corresponding new data structure according to the information required by the target audio format, and encapsulate the corresponding information according to the container of the target audio format.

To translate the extracted semantic audio information into text form, in some embodiments, the display device 200 may also divide the semantic audio information into audio statements of a preset length. The audio stream of the whole media file is dynamically broken into small sections with different lengths, a section of audio statement is generated, and the time of all the audio statements is connected end to form the original audio stream of the whole media file. Since the semantic audio information is finally translated into the translated text of the target language, the accuracy of the semantic audio information division affects the accuracy of the translation effect.

Fig. 14 is a schematic flow chart of dividing semantic audio information into audio sentences with preset lengths by the display device according to some embodiments of the present application, as shown in fig. 14, the display device 200 may first traverse the semantic audio information to obtain audio duration information in the semantic audio information; then, detecting continuous unmanned sound duration in the audio duration information; if the continuous unmanned sound duration is detected to exceed the first time threshold, marking a first mark at the playing position of the media file, and determining that a first punctuation mark exists at the position of the first mark; if the continuous unmanned sound duration is detected to exceed a second duration threshold, marking a second mark at the playing position of the media file, and determining that a second punctuation mark exists at the position of the second mark; wherein the second time duration threshold is greater than the first time duration threshold; and dividing the semantic audio information into audio sentences with preset lengths according to the first punctuation marks and the second punctuation marks.

In some embodiments, the division of the audio statement is not only to find a cut-off point for the audio information, but also to find a sentence-breaking position of the audio statement that is suitable for translation, e.g., a punctuation mark position. The method and the device do not limit the time length and the sentence capacity of dividing the audio sentences, and can be set according to the characteristics of the audio information. The following will exemplify a division method of the audio sentence.

In some embodiments, for a subtitled media file, the duration of a one-screen subtitle may be referenced. The start-stop time of the caption in the current semantic audio information can be used for the start-stop time of the corresponding audio statement, so that the semantic audio information in the one-screen caption duration can be generated into a section of audio statement, and then the section of audio statement is converted into the audio statement in the target audio format. For another example, pauses and mood in the semantic audio information can be determined, and the break positions of the audio sentences can be divided according to a conventional "sentence breaking" mode of speaking by the user. In this manner, it is necessary to determine whether or not the current semantic audio information has a voice, a continuous unmanned voice duration, or the like, and further perform division of audio sentences according to the duration.

For example, after traversing the semantic audio information to obtain the audio duration information in the semantic audio information, if the continuous unmanned duration in the extracted audio duration information exceeds a first duration threshold, for example, 1 second, it may be determined that there is a pause in the semantic audio information, and here, there should be a comma. If the continuous unmanned sound duration is detected to exceed the second duration threshold, for example, 3 seconds, it can be considered that a sentence in the semantic audio information is finished, a period should be provided here, and the period can be marked as a cut-off point of an audio sentence, and finally the semantic audio information is divided into audio sentences with preset lengths according to the comma or the period.

In some embodiments, the sentence length of the audio sentence may also be set according to the actual requirements. For example, a sentence may be divided into one audio sentence, or two or more sentences may be divided into audio sentences, which is not limited in this application. After the audio sentence is divided, the audio sentence can be translated later. Because forced truncation can cause incomplete audio sentences after division and further influence translation accuracy of translation, when dividing the audio sentences, the method can be comprehensively considered by combining various factors such as the performance of display equipment, the type of media files, source language, target language and the like, so that the accuracy of audio sentence division is improved, and the problems that the translation accuracy of the media files is low and a user wastes time and labor when searching for subtitle videos with the user's own understandability are solved.

After the division of the audio sentence of the human voice is completed, the display apparatus 200 may perform semantic recognition on the audio sentence to generate a semantic text of the audio sentence. In some embodiments, in order to perform the function of converting an audio sentence into a semantic text, the display apparatus 200 may be provided with an audio semantic recognition module, where an input object of the audio semantic recognition module is the audio sentence obtained in the foregoing embodiment, and the audio sentence may be subjected to semantic recognition by the audio semantic recognition module, that is, sound information in speech may be recognized from the audio sentence and converted into a semantic text of a target language.

It should be noted that, before the voice information is recognized by the audio semantic recognition module, a great number of operations such as speech semantic collection, signal processing, pattern recognition, and artificial intelligence training need to be performed on the audio semantic recognition module. By way of example, sound features that may occur in various languages may be sampled, analyzed, and sound feature data and collections of various languages may be formed in advance as required in the target audio data specification. Features such as tone, volume, frequency and amplitude of sound can be analyzed, and the voice is searched in a concentrated mode according to the features of the audio to be analyzed, so that the voice is matched with the language with the maximum similarity. And then, specifically identifying the characters corresponding to each sound in the audio to be analyzed, finding the characters with the maximum similarity with the sound in the audio to be analyzed, finding all the matched characters through cyclic searching and matching, forming sentences and paragraphs, and finally forming the complete semantic text of the audio-to-text. In conclusion, the process of converting the audio sentence of the human voice into the semantic text of the target language is completed. Finally, the semantic text of the audio sentence can be translated into the translated text of the target language, that is, a text-to-text translation process is implemented, and the following step S4 can be referred to for a specific implementation process. After the completion of the execution of step S3, the following step S4 may be executed.

Step S4: if it is detected that the file category of the media file is a text file, or the data category is a plain subtitle stream in a video file, or the data category is a lyric stream in an audio file, the display device 200 will translate these text data sources into translated text in the target language.

In some embodiments, the text data sources include text files or plain subtitle streams or lyrics streams, as well as other plain text-like data. Fig. 15 is a schematic flow chart of translating a text data source into a translated text in a target language by a display device according to some embodiments of the present application, as shown in fig. 15, when the display device 200 translates the text data source into the translated text in the target language, the text data source may be obtained first, then divided into text sentences, and then the single-sentence text sentences are input to a text translation module, so that the text data source is translated into the translated text in the target language by the text translation module.

To support the functionality of display device 200 to source stream translate text data into translated text in a target language, in some embodiments, a semantic text translation module may be provided in display device 200. The semantic text translation module may be configured to translate a text data source into a translated text in a target language, and in this embodiment, translation of all translations may be performed by the text translation module, and a translation process will be described below with reference to a specific example.

Before translating the source language of the media file into the target language, the target language needs to be determined, the foregoing embodiment has already described the determination method of the target language, and the description will not be repeated here, and the user may set the determination rule of the target language according to the actual requirement. After the target language is determined, the display device 200 may obtain a text data source, and then divide the text data source into text sentences. The specific division method may refer to an audio sentence division method. For example, the text file may be divided into text sentences by punctuation, and the plain subtitle stream may be divided into text sentences by time stamp. The length of the text sentence is not limited in this application. For example, a sentence may be divided into one text sentence, or two or more sentences may be divided into text sentences. After the division of the text sentence is completed, the text sentence is input into a text translation module, and the text translation module translates a pure subtitle stream in a text file or a video file or a lyric stream in an audio file, namely a text data source, into a translation text of a target language.

In order to more accurately translate the pure subtitle stream, a semantic translation module may be provided in the display device 200, and the display device 200 may identify a subtitle category of the pure subtitle stream, which may include text subtitles or picture subtitles in some embodiments. If the identified subtitle category is text subtitle, the display apparatus 200 may translate the text subtitle into time-stamped subtitle text through the semantic translation module, and then may directly translate the time-stamped subtitle text into translated text of the target language through the text translation module. If the identified caption category is a picture caption, the display device 200 may extract semantic information in the picture caption to obtain a caption text with a time stamp, and may further indirectly obtain a caption text of the video file, and translate the caption text with the time stamp into a translation text of the target language through the text translation module. In this way, different preprocessing is executed for different types of pure subtitle streams, and finally the subtitle text with the timestamp is obtained, so that the translation accuracy of the pure subtitle streams is improved.

In order to further improve the accuracy of the translation of the pure subtitle stream, in conjunction with fig. 5, in some embodiments, the translation of the pure audio stream may be corrected by using the subtitle translation of the same video file, or the translation of the pure audio stream may be corrected by using the lyric translation of the same audio file, so as to screen out the translation of the pure audio stream, which has errors or ambiguity after semantic extraction and translation, so as to improve the accuracy of the translation of the pure audio stream.

The specific error correction method may be various. For example, if a video file has both audio and subtitles, the pure subtitle stream version of the video may be used to correct the version of the pure audio stream based on the principle that the audio version and the subtitle version of the same video should be substantially the same, and the audio version of a song should be substantially the same as the lyric version of the same video; if an audio file has both audio and lyrics, the lyrics of the audio file are used to correct the translation of the pure audio stream. Error correction may take the form of, but is not limited to, the following: according to the subtitle translation of the video with the time stamp PTS, comparing with the translation of the pure audio stream of the video with the time stamp PTS, which is obtained correspondingly: comparing the translation text of the audio and the caption with the same PTS, and comparing whether the semantics and the characters are similar; the identical or similar parts take intersection sets and are stored as identical translated parts; meanwhile, the part which is different or different can be stored as a part with different translations.

In some embodiments, traceback correction may also be performed on semantically different parts. The specific method of backtracking error correction may be many, for example, an exemplary method may directly take a different part of the subtitle translation and the audio translation of the video file to replace a part of the pure audio stream, i.e. the first pure audio stream translation, where the difference between the subtitle translation and the audio translation is large, so as to form a final pure audio stream translation of the video. In other embodiments, the audio corresponding to the different semantic parts may be intercepted and sent to the audio semantic extraction module for re-translation. The result of re-executing the audio semantic recognition is input to the audio semantic extraction module, and the input content is not the whole previous audio sentence, but only the small audio or the audio of two words corresponding to different semantic parts, so that the semantic recognition difficulty of local audio can be reduced, and the recognition accuracy of the audio of the suspicious part of the translation can be improved. The accuracy of the translation of the final pure audio stream can be improved by performing re-furnace reconstruction, i.e. re-identification, of the possibly problematic partial pure audio stream. Similarly, the method can also be applied to correcting the pure audio stream translation by using the lyric translation of the audio file, thereby improving the accuracy of the audio translation.

Referring to fig. 5, in some embodiments, for each of the audio and video data, if the pure audio stream of the video file, that is, the first pure audio stream and the subtitle stream, are not both or only neither, and if the pure audio stream of the audio file, that is, the second pure audio stream and the lyric stream, are not both or only one or both, the error correction process may be skipped, and after the text translation module outputs the translated text for the first pure audio stream or the second pure audio stream, the translated text of the pure audio stream may be directly input into the "output translated text according to the translated text output setting".

In some embodiments, before the text subtitle and the semantic information are translated into the translated text of the target language, a subtitle text with a timestamp is formed in the process of translating the text subtitle, after the picture subtitle is extracted from the picture semantic information, a subtitle text with a timestamp is generated, and then the subtitle text with the timestamp is translated into the translated text through a text translation process. In order to improve the translation accuracy of the final translation text, semantic error correction can be performed on the translation text generated in text translation, and the final translation text is subjected to semantic error correction, so that the accuracy of the translation text can be ensured.

Fig. 16 is a block diagram of a method for performing media file translation by a display device according to some embodiments of the present application, as shown in fig. 16, as in the foregoing embodiments, media file types of a source language may be divided into video files, audio files, picture files and text files, for the video files, the display device 200 may perform video decoding and splitting, and form a first pure audio stream, a pure image stream and a pure subtitle stream after splitting, and the audio files may also perform decoding and splitting, and form a second pure audio stream, a lyric stream and a poster after splitting. Dividing the media file into three data sources, wherein the text data sources are pure subtitle streams in the text file and the video file and lyric streams in the audio file; the first pure audio stream in the video file and the second pure audio stream in the audio file are audio data sources, the pure image stream in the video file, the poster in the audio file and the picture file are image data sources, and then different translation processes are executed for different data sources.

In specific implementation, aiming at the audio data source, audio in the audio data source is converted into semantic text, and the semantic text is translated into translation text of the target language. For a text data source, the text data source is translated into translated text in the target language. Meanwhile, for the pure subtitle stream, the subtitle category of the pure subtitle stream is identified, and different translation processes are performed for the pure subtitle streams of different categories. In connection with fig. 5, in the translation link, four paths of data, i.e., a lyrics stream with a timestamp, an audio semantic text with a timestamp, a subtitle text with a timestamp and a text file without a timestamp, are input into the text translation module, so that except for a pure image stream, a poster and a picture file, the rest of media files are converted into translated text of a target language, and the translated text can be displayed according to the requirements set by a user. After the completion of the execution of step S4, the following step S5 may be executed.

Step S5: the control display 260 displays an image data source in the user interface, the image data source including a picture file or a plain image stream or poster, and outputs translation text in the user interface in a voice and/or text manner.

In order to output the translated text of the target language to the user interface of the display apparatus 200, the display apparatus 200 may provide a manner in which the user selects to output the translated text after completing the translation of the source language of the audio or text, that is, the display apparatus 200 may output the translated text of the target language in a corresponding output manner after the user selects the output manner.

Fig. 17 is a schematic flow chart of a display device according to some embodiments of the present application for outputting translated text in a user interface in a speech and/or text manner, and as shown in fig. 17, in the step of executing outputting translated text in a speech and/or text manner in a user interface, the display device 200 may first obtain translated text of a media file into a target language, and then detect an output manner of the translated text input by a user. In some embodiments, the output modes of the translated text may include a speech output mode, a text output mode, and a speech-plus-text output mode. Upon detecting that the output mode of the translated text is a voice output mode, the display apparatus 200 may translate the translated text into a voice translation and control the display 260 to output the voice translation in a voice output mode or replace the original voice in the media file with the voice translation. If the output mode of the translated text is a text output mode, the display device 200 may control the display 260 to display the translated text at a preset position, or replace the original text file or the original subtitle stream or the original lyric stream in the media file with the translated text. If the output mode of the translated text is a speech plus text output mode, the display device 200 may control the display 260 to output the speech translated text according to the speech output mode, or replace the original speech in the media file with the speech translated text, and control the display 260 to display the translated text at a preset position, or replace the original text file or the original subtitle stream or the original lyric stream in the media file with the translated text.

For example, the voice output mode is to output the translation text only in the form of voice, and when the user selects the voice output mode, the display device 200 may translate the translation text into the voice translation. In order to support the function of the display device 200 for translating the translated text into a voice translation, a text-to-audio module may be provided in the display device 200, which may convert the translated text of the target language into a voice translation and play the translated text in the form of voice. In some embodiments, the text-to-audio module may first analyze the corresponding code of each word based on the translated text, for example, may analyze the variable length character code corresponding to each word, and based on the code, find the pronunciation audio of the corresponding word or word from an internal pronunciation audio library or an external pronunciation audio library of the media terminal, such as the display device 200. The pronunciation audio of the text or the word may be preset in the display device 200, and may be referred to as a default built-in pronunciation audio library, where the built-in pronunciation audio library may provide languages of the voice broadcast translation that the display device 200 may support by default.

In some embodiments, the user may also be supported to download pronunciation audio of each text or word in a specified format from the internet, and the display device 200 may store the pronunciation audio downloaded by the user in the internal pronunciation audio library, and after the storage is completed, the subsequent use will not need to download again, and the pronunciation audio will be directly broadcasted as the internal pronunciation audio when playing the voice. In addition, the pronunciation audio of each word or each word can be obtained in real time in a networking way. In any way, as long as each word or the voice translation corresponding to each word in the translated text can be obtained, the display device 200 may call a speaker or an earphone to broadcast the corresponding voice translation.

When the user selects the speech output mode, the translated text is not displayed in the form of text. When broadcasting a voice translation in a voice form, the voice translation may collide with the voice of the source language, so when the original voice exists in the media file of the source language, the original voice in the media file needs to be replaced by the voice translation, that is, the original voice of the source language is replaced by the voice translation of the target language.

The text output mode is to display the translated text only in the form of characters. It will be appreciated that when the user selects the text output mode of the translated text, it will not be played in the form of speech. The display device 200 may display the translated text at the designated preset location. Before displaying the translated text, in order to avoid repeated display, it may be determined whether an original text file or an original subtitle stream exists when the source language is displayed, if so, the display apparatus 200 may replace the original text file or the original subtitle stream in the media file with the translated text, and if not, the display apparatus 200 may display at a preset position.

In some embodiments, the preset position may be modifiable, e.g., the preset position of the translation text may be dragged. When a translated text is output in a text output mode, the font, color, transparency, stereoscopic form, entering effect of characters and the like of the text can be set. Meanwhile, in order to improve the display effect of the translated text, before the translated text is output, the content of the translated text, such as punctuation mark checking, grammar checking and the like, can be checked. In some embodiments, a basic syntax checking module may be provided in the display apparatus 200, and error correction processing may be performed on a syntax in which a translation text has an error by the basic syntax checking module. Meanwhile, in order to adapt to different languages, a plurality of basic grammar checking modules may be disposed in the display device 200, so that the translated text is more standardized by one or more basic grammar checking modules, and the display effect thereof is improved.

The voice and text output mode is a mode that the translated text is displayed by voice and text at the same time, if the user selects that the output mode of the translated text is the voice and text output mode, when the original voice exists in the media file of the source language, the original voice in the media file needs to be replaced by the voice translated text, if the original voice does not exist, the voice translated text is directly broadcasted, meanwhile, whether the original text file or the original subtitle stream exists in the source language during display can be judged, if the original text file or the original subtitle stream exists in the media file, the display device 200 can replace the original text file or the original subtitle stream in the media file with the translated text, and if the original text file or the original subtitle stream does not exist, the display device 200 can display the original text at a preset position.

Through the output mode of the translated text, the user can see and/or hear the language familiar to the user. By way of example, through the above-described display apparatus 200, when a foreign language or other language program is played, the foreign language or other language may be automatically translated into a target language selected by a user, and the translated text of the target language may be output in the form of speech or text. Thus, the user can solve the problem of language barrier when watching foreign language or other language programs, realize automatic translation of media files by the cooperation of a plurality of modules built in the display device 200, and output the translated text of the target language in a mode that the user selects the best mode from a plurality of selectable output ways. In the whole translation process, not only the text-to-text translation form is supported, but also the voice-to-text translation form is realized, so that the problem that the media file cannot be translated when the media file is in a foreign language version or in other language versions which cannot be directly understood by a user is solved, and meanwhile, the application scene and the application range of the display device 200 can be enriched.

In order to meet the output requirements of different users on the translated text, in some embodiments, the display device 200 may further be provided with a translation voice broadcast special effect module. The effect of the translation voice broadcasting special effect module can be that post-processing is carried out on the voice translation in a user-customized mode, for example, the voice effect selecting and self-defining function can be provided through the translation voice broadcasting special effect module, background music or voice effect is added for broadcasting of the voice translation, and accordingly voice broadcasting requirements of different users are met.

For example, the user may record his own voice of broadcasting certain words or phrases. For example, "hello" may be an internal audio or an audio recorded by the user himself/herself when broadcasting the voice translation. When the audio recorded by the user is stored, a new file name can be created for the recorded audio, and when the audio broadcast recorded by the user is selected, the voice translation corresponding to the translation text is replaced by the audio recorded by the user.

In some embodiments, the user may also add favorite background music to the announcement of the voice translation. For example, the user opens only one text version of the English poem, and the English poem is translated automatically into a Chinese poem translation. If the user selects a voice output mode to output the broadcasting text and determines a specific special effect as background music of voice broadcasting, the final display effect is that the text of Chinese version is output on the user interface and has the relaxed background music, so that the requirements of different scenes of the user can be met.

Note that each functional module provided in the display device 200, for example, a media audio analysis module, a markup language setting module, a media audio preprocessing module, a voice extraction sub-module, a media format conversion sub-module, and the like, may be set to an off state when the translation function is not performed. After the translation function is started, each module is restarted. Meanwhile, when each functional module is started, a default value of each functional module can be set, or after the starting, a value set for the functional module at last can be saved, which is not particularly limited in the application. In this way, on the one hand, memory resources of the display device 200 can be saved, and on the other hand, setting time of the user for each functional module can be reduced.

In some embodiments, the automatic translation function may be turned off after the automatic translation function is completed. In connection with fig. 5, when the display device 200 receives a shutdown event or shutdown state of the automatic translation function, all processes and resources associated with the automatic translation may be stopped and released to exit the automatic translation.

As can be seen from the above technical solution, the display device 200 provided in the above embodiments can detect the file category of the media file in response to the start event or state of the automatic translation function when the media file is played, where the file category includes a video file, an audio file, a picture file and a text file; if the file category is detected as the video file, executing video decoding on the video file; and performing streaming on the decoded video file to generate a first pure audio stream, a pure image stream, and a pure subtitle stream of the media file; if the file category is detected as the audio file, performing audio decoding on the audio file, and performing streaming on the decoded audio file to generate a second pure audio stream, a lyric stream and a poster of the audio file; if the data category is detected to be a first pure audio stream in the video file or the data category is detected to be a second pure audio stream in the audio file, converting the audio in the audio data source into a semantic text, and translating the semantic text into a translation text of the target language; if the file type is detected to be a text file, or the data type is detected to be a pure subtitle stream in a video file, or the data type is detected to be a lyric stream in an audio file, translating a text data source into a translation text of a target language; the control display displays the image data source in the user interface and outputs translation text in the user interface in a phonetic and/or textual manner. The display device 200 may translate the media file in the foreign language or other language version into the translated text in the target language, so as to solve the problems that when the media file is in the foreign language or other language version which cannot be directly understood by the user, the translation accuracy of the media file cannot be translated, and the user takes time and effort to find the subtitle video with the user's understandability.

Some embodiments of the present application also provide a method for translating a media file, which may be applied to the display device 200 in the above embodiments, where the display device 200 includes a controller 250 and a display 260 for displaying a user interface. Fig. 18 is a flowchart of a method for translating a media file according to some embodiments of the present application, as shown in fig. 18, in some embodiments, the method for translating a media file may include the following steps S1-S5:

step S1: in response to an opening event or state of the automatic translation function when the media file is played, the display device 200 detects a file category of the media file;

in some embodiments, the file categories include video files, audio files, picture files, and text files. It should be noted that, in order to achieve a better translation speed and a more accurate translation effect, the display device 200 may be connected to a network, so as to obtain more network resources in the translation process, thereby improving the translation speed and the translation accuracy. To initiate the translation function, in some embodiments, a functionality control may be provided in the display device 200 that turns the translation function on or off. The function control is used for controlling the automatic translation function to be started or stopped. It can be understood that when the translated function control is opened, i.e. the media file is prepared for translation, and when the translated function control is closed, the translation function for the media file is stopped.

Step S2: the display apparatus 200 performs video decoding on the video file if it is detected that the file type is a video file; and performing streaming on the decoded video file to generate a first pure audio stream, a pure image stream, and a pure subtitle stream of the media file; if the file category is detected as an audio file, performing audio decoding on the audio file, and performing streaming on the decoded audio file to generate a second pure audio stream, a lyric stream, and a poster of the audio file.

To enable the video file to be played on the display device, in some embodiments, the display device 200 may perform video decoding on the video file, and after decoding is complete, may perform streaming on the decoded video file to generate a first pure audio stream, a pure image stream, and a pure subtitle stream of the media file. In this way, the display device 200 may perform different translation processes for different types of media files. Similarly, the display device 200 may also perform decoding on the audio file, and after decoding is completed, may perform streaming on the decoded audio file to generate a second pure audio stream, a lyric stream, and a poster in the media file.

Step S3: if the display device 200 detects that the data category is the first pure audio stream in the video file or the data category is the second pure audio stream in the audio file, converting the audio in the audio data source into semantic text and translating the semantic text into translation text of the target language;

In some embodiments, the audio data source comprises a first pure audio stream in the video file and a second pure audio stream in the audio file. To effect translation of the audio data source in the media file into translation text in the target language, the display device 200 may convert the audio in the audio data source into semantic text and translate the semantic text into translation text in the target language. The display device 200 may first perform a data stream analysis on the audio data source to extract valid audio information in the audio data source, and then obtain a target audio data specification, where relevant criteria for the audio data may be specified, may be formulated in connection with specific translation content, and is not limited in this application. After the target audio data specification is obtained, the display device 200 may extract the valid audio information according to the target audio data specification to extract semantic audio information in the valid audio information. After the semantic audio information is extracted, the semantic audio information can be divided into audio sentences with preset lengths, semantic recognition is performed on the audio sentences to generate semantic texts of the audio sentences, and finally the semantic texts of the audio sentences are translated into translation texts of target languages.

Step S4: if the display device 200 detects that the file type is a text file, or the data type is a pure subtitle stream in a video file, or the data type is a lyric stream in an audio file, the text data source is translated into a translated text of a target language;

in some embodiments, the text data source comprises a text file or a plain subtitle stream or a lyric stream. To support the functionality of the display device 200 to translate a text data source into translated text in a target language, in some embodiments, a semantic text translation module may be provided in the display device 200. The semantic text translation module may function to translate a text data source in a media file into translated text in a target language.

Step S5: the display device 200 controls the display 260 to display an image data source in the user interface, the image data source including a picture file or a plain image stream or poster, and to output translation text in the user interface in a voice and/or text manner.

In performing the step of outputting the translated text in the user interface in a phonetic and/or textual manner, the display apparatus 200 may first obtain the translated text that translates the media file into the target language, and then detect the output manner of the translated text input by the user. In some embodiments, the output modes of the translated text may include a speech output mode, a text output mode, and a speech-plus-text output mode. Upon detecting that the output mode of the translated text is a voice output mode, the display apparatus 200 may translate the translated text into a voice translation and control the display 260 to output the voice translation in a voice output mode or replace the original voice in the media file with the voice translation. If the output mode of the translated text is a text output mode, the display device 200 may control the display 260 to display the translated text at a preset position, or replace the original text file or the original subtitle stream or the original lyric stream in the media file with the translated text. If the output mode of the translated text is a speech plus text output mode, the display device 200 may control the display 260 to output the speech translated text according to the speech output mode, or replace the original speech in the media file with the speech translated text, and control the display 260 to display the translated text at a preset position, or replace the original text file or the original subtitle stream or the original lyric stream in the media file with the translated text.

According to the technical scheme, the translation method of the media file can translate the media file of the foreign language or other language version which cannot be directly understood by the user into the translation text of the target language, so that the problems that the translation accuracy of the media file cannot be translated or the translation accuracy of the media file is low and the user takes time and labor when searching the subtitle video with the user can be understood by the user when the media file is the foreign language or other language version which cannot be directly understood by the user are solved.

The same and similar parts of the embodiments in this specification are referred to each other, and are not described herein.

It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied essentially or in parts contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments or some parts of the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A display device, characterized by comprising:

a display configured to display a user interface;

A controller configured to:

if the data category is detected to be the first pure audio stream in the video file or the data category is detected to be the second pure audio stream in the audio file, converting the audio in an audio data source into semantic text, and translating the semantic text into translation text of a target language; the audio data source comprises the first pure audio stream or the second pure audio stream; if the file category is detected to be the text file, or the data category is the pure subtitle stream in the video file, or the data category is the lyrics stream in the audio file, translating a text data source into a translation text of a target language, wherein the text data source comprises the text file, the pure subtitle stream or the lyrics stream;

2. The display device of claim 1, wherein the controller is further configured to:

creating an alternative language list supported by the display device;

acquiring a language to be selected in the candidate language list input by a user;

judging the number of the languages to be selected;

if the number of the languages to be selected is equal to 1, determining that the languages to be selected are target languages;

if the number of the languages to be selected is greater than 1, calculating a translation speed of translating the source language of the media file into each language to be selected, a frequency of use of the language to be selected, a user preference priority of the language to be selected, an average translation accuracy of translating the source language into each language to be selected, and screening out a target language according to one or more of the translation speed, the frequency of use, the user preference priority and the average translation accuracy.

3. The display device of claim 1, wherein the controller is further configured to:

Setting a translation category of the media file, wherein the translation category comprises a translation audio category, a translation text category and an audio plus text translation category;

detecting a translation category of the media file input by a user;

if the translation category is detected to be the translation audio category, translating the audio data source in the media file into a translation text of a target language;

if the translation category is detected to be the translation text category, translating the text data source in the media file into a translation text of a target language;

if the translation category is detected to be the audio plus text translation category, the audio data source in the media file is translated and the text data source in the media file is translated into translation text in a target language.

4. The display device of claim 1, wherein the controller performs the steps of converting audio in the audio data source to semantic text and translating the semantic text to translated text in the target language, further configured to:

performing a data stream analysis on the audio data source to extract valid audio information in the audio data source;

Acquiring a target audio data specification;

extracting the effective audio information according to the target audio data specification to extract semantic audio information in the effective audio information;

dividing the semantic audio information into audio sentences with preset lengths;

executing semantic recognition on the audio statement to generate a semantic text of the audio statement;

and translating the semantic text of the audio sentence into a translated text of the target language.

5. The display device of claim 4, wherein the controller performs the step of extracting the valid audio information in accordance with the target audio data specification to extract semantic audio information in the valid audio information, further configured to:

detecting the effective audio information according to the target audio data specification;

if the effective audio information is detected to be not in accordance with the translation standard, controlling the display to display reminding information for reminding a user that the effective audio information is not in accordance with the translation standard;

and if the audio information is detected to be in accordance with the translation standard, extracting the voice information in the effective audio information, executing conversion on the voice information according to a preset format, and packaging the converted voice information.

6. The display device of claim 4, wherein the controller performs the step of dividing the semantic audio information into audio statements of a preset length, further configured to:

traversing the semantic audio information to obtain audio duration information in the semantic audio information;

detecting continuous unmanned sound duration in the audio duration information;

if the continuous unmanned sound duration is detected to exceed a first duration threshold, marking a first mark at a playing position of the media file, and determining that a first punctuation mark exists at the position of the first mark;

if the continuous unmanned sound duration is detected to exceed a second duration threshold, marking a second mark at the playing position of the media file, and determining that a second punctuation mark exists at the position of the second mark; the second duration threshold is greater than the first duration threshold;

and dividing the semantic audio information into audio sentences with preset lengths according to the first punctuation marks and the second punctuation marks.

7. The display device of claim 1, wherein the controller performs the step of translating the text data source into translated text in the target language, further configured to:

Acquiring the text data source;

dividing the text data source into text sentences;

the single sentence text sentence is input to a text translation module to translate the text data source into translated text in the target language.

8. The display device of claim 1, wherein the controller performs the step of outputting the translation text in the user interface in speech and/or text, further configured to:

acquiring a translation text for translating the media file into a target language;

detecting an output mode of a translation text input by a user; the output modes comprise a voice output mode, a text output mode and a voice and text output mode;

if the output mode is detected to be a voice output mode, translating the translated text into a voice translation, and controlling the display to output the voice translation according to the voice output mode or replacing the original voice in the media file with the voice translation;

if the output mode is detected to be a text output mode, controlling the display to display the translated text at a preset position, or replacing an original text file or an original subtitle stream or an original lyric stream in the media file with the translated text;

If the output mode is detected to be a voice plus text output mode, controlling the display to output the voice translation according to the voice output mode, or replacing the original voice in the media file with the voice translation, and controlling the display to display the translation text at a preset position, or replacing the original text file or the original subtitle stream or the original lyric stream in the media file with the translation text.

9. The display device of claim 1, wherein the controller is further configured to:

identifying the caption category of the pure caption stream, wherein the caption category is text caption or picture caption;

if the caption category is the text caption, translating the text caption into translated text;

and if the caption category is the picture caption, extracting semantic information in the picture caption, and translating the semantic information into a translation text.

10. A method of translating a media file, applied to the display device of any one of claims 1-9, the display device comprising a display and a controller, the method of translating a media file comprising:

If the file category is detected to be the text file, or the data category is the pure subtitle stream in the video file, or the data category is the lyrics stream in the audio file, translating a text data source into a translation text of a target language, wherein the text data source comprises the text file, the pure subtitle stream or the lyrics stream;