WO2020124754A1 - 多媒体文件的翻译方法、装置及翻译播放设备 - Google Patents

多媒体文件的翻译方法、装置及翻译播放设备 Download PDF

Info

Publication number
WO2020124754A1
WO2020124754A1 PCT/CN2019/073767 CN2019073767W WO2020124754A1 WO 2020124754 A1 WO2020124754 A1 WO 2020124754A1 CN 2019073767 W CN2019073767 W CN 2019073767W WO 2020124754 A1 WO2020124754 A1 WO 2020124754A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
voice
original
new
multimedia
Prior art date
Application number
PCT/CN2019/073767
Other languages
English (en)
French (fr)
Inventor
郑勇
孙俊
王文祺
杨汉丹
杜志华
温平
王辉
Original Assignee
深圳市沃特沃德股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市沃特沃德股份有限公司 filed Critical 深圳市沃特沃德股份有限公司
Publication of WO2020124754A1 publication Critical patent/WO2020124754A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/12Formatting, e.g. arrangement of data block or words on the record carriers
    • G11B20/1217Formatting, e.g. arrangement of data block or words on the record carriers on discs
    • G11B20/1251Formatting, e.g. arrangement of data block or words on the record carriers on discs for continuous data, e.g. digitised analog information signals, pulse code modulated [PCM] data
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/011Files or data streams containing coded musical information, e.g. for transmission
    • G10H2240/046File format, i.e. specific or non-standard musical file format used in or adapted for electrophonic musical instruments, e.g. in wavetables
    • G10H2240/071Wave, i.e. Waveform Audio File Format, coding, e.g. uncompressed PCM audio according to the RIFF bitstream format method

Definitions

  • the invention relates to the field of computer technology, in particular to a multimedia file translation method, device and translation playback device.
  • the prompt information can be characters corresponding to different languages, for example, the user is a Chinese native speaker and the English is poor, and the song is an English song, even if the music player can display English lyrics, but the song can provide the user Information is limited.
  • the multimedia audiovisual materials on the market are manually translated into different languages, and then the subtitles are superimposed on the video screen through the superimposition of the subtitles of the video screen, and the audio part is also synchronized with the video screen through the manual translation.
  • the multimedia video materials can only play subtitles and voices in other languages without human translation first. At this time, it is difficult for users to understand the multimedia playback content.
  • the main purpose of the present invention is to provide a multimedia file translation method, device, and translation playback device, which aim to solve the problem that users cannot understand and cannot recognize the video or audio content in other languages in the multimedia file.
  • the present invention proposes a multimedia file translation method, including:
  • the invention also provides a multimedia file translation device, including:
  • Acquisition module for acquiring original voice files in multimedia files
  • a translation module used to translate the original voice file to obtain a new voice file, and the language in the new voice file is a specified language
  • the configuration module is configured to configure the loading attributes of the new voice file so that the multimedia file is loaded with the new voice file synchronously during playback.
  • a translation playback device includes a memory, a processor, and an application program, the application program is stored in the memory and configured to be executed by the processor, and the application program is configured to execute any of the above Item.
  • a multimedia file translation method obtains a new voice file in a specified language by acquiring the original voice file in the multimedia file, translating the original voice file, and configuring the loading attributes of the new voice text file to enable the multimedia file
  • new voice files are loaded synchronously, which can automatically convert an original voice file into voice files in other languages without manual translation, which can help users better understand and recognize the audio and multimedia files in a timely manner.
  • the content within the video is not limited to, voice files, and the original voice file into voice files in other languages without manual translation, which can help users better understand and recognize the audio and multimedia files in a timely manner.
  • FIG. 1 is a schematic flowchart of a multimedia file translation method according to an embodiment of the invention
  • FIG. 2 is a schematic block diagram of a partial structure of a multimedia file translation apparatus according to an embodiment of the present invention
  • FIG. 3 is a schematic image diagram of an audio file for mark detection according to an embodiment of the invention.
  • An embodiment of the present invention provides a method for translating multimedia files. As shown in FIG. 1, the method includes the following steps:
  • the above method is applied to a translation playback device.
  • the above translation playback device is generally a video translation player, an audio translation player, and other intelligent translation playback devices.
  • a video translation player is used as an example for explanation, which has the function of playing video files and audio.
  • the multimedia files include original voice files, video files and header files.
  • the original voice file includes at least one of the first original voice file (that is, the original audio file) or the second original voice file (that is, the original voice text file).
  • this embodiment will include both The above two voice files are used as an example for description. It is worth mentioning that the original voice text file is a subtitle file in a multimedia file, and the original audio file is a sound file.
  • the original voice file may be an original voice file in a language other than the user's native language.
  • the new voice file may not be the user's native language, and may be a file in a specified language that the user wants to view.
  • the new voice file includes The second new voice file (new audio file) and the first new voice file (new voice text file).
  • step S30 after the new voice file is obtained, in order for the user to understand the content in the multimedia file, it is necessary to load the new voice file synchronously during playback.
  • the new voice file includes a new audio file and a new voice text file.
  • the original voice file includes the original audio file and the original voice text file.
  • Various display methods and usages can be performed, which can further improve the user’s learning of video files and understanding. For example, displaying a new voice and text file can help the user to understand the multimedia file; the new voice and text file and the original voice and text file are played simultaneously, which can further help the user to learn and recognize the language and pronunciation in the multimedia file. Playing new audio files can help users understand video files, and playing new voice text files, original voice text files, and original audio files can help users learn and recognize the effects of pronunciation in multimedia files.
  • the above multimedia file includes an original audio file
  • the step S10 of obtaining the original voice text file in the multimedia file includes:
  • the voice segment between the start point and the end point of the voice of each character is used as an original audio file, where the original audio file is an original voice file.
  • the original voice file contains multiple audio objects, such as background noise, people's voices, or the sounds of animals and plants.
  • voice activity detection Voice Activity Detection (VAD) technology detects the endpoint of the person's voice in the audio file, and an original audio file does not continuously emit sound, so the start point of the detected voice and the end point of the voice signal are one of the original voice files
  • VAD Voice Activity Detection
  • the original audio file includes a continuous original audio file that is detected when multiple characters speak individually, that is, one person continuously speaks one original audio file, and the next person speaks another original audio file.
  • a continuous original audio file that is formed by combining the voices of multiple characters at the same time.
  • the voice segments of each character are different, that is, a single person says
  • the original speech file is composed of an original audio file. Because a single person speaks, the timbre and pitch of the speech are not much different, which is easier to be detected, and the detected original audio file is more accurate. The start and end points of the marked speech are not There will be errors.
  • the original voice file also includes an original voice text file.
  • the step S10 of obtaining the original voice file in the multimedia file includes:
  • the original voice text file is a subtitle file among the original voice files. Since the original file contains one or both of the original voice text file or the original audio file, in this embodiment, when it only contains When an original audio file is used, the original audio file can be converted into an original voice text file through the above steps to solve the problem that the original voice file speaks too fast, and there is a non-standard pronunciation in it, which is difficult for users to understand by sound alone At this time, you can rely on the original voice text file for preliminary understanding to further improve the user's understanding of the video file; or if the original voice text file includes both, you can directly obtain the original voice text file, saving acquisition time.
  • PCM Pulse Code Modulation----pulse code modulation recording
  • the PCM signal is a digital signal composed of symbols such as [1] and [0]. Compared with analog signals, it is less susceptible to clutter and distortion of the transmission system. The dynamic range is wide, and the effect of sound quality is quite good. And the PCM track is different from the video track and can be used for post-recording.
  • the audio file in PCM format is a binary sequence formed by analog audio signal through analog-to-digital conversion (A/D conversion), and the video translation player can accurately decode it.
  • the initial format of the original audio file includes multiple formats, such as PCM, WMV, MP4, DAT, and RM.
  • the format of the audio file parsed in this embodiment is preferably the PCM format.
  • step S20 The above step of translating the original voice file to obtain a new voice file, and the language in the new voice file in step S20 includes:
  • the first new voice file is a new voice text file, that is, a translated subtitle file, and the user can understand the content of the video file through the translated subtitle to facilitate understanding.
  • step S20 The above step of translating the original voice file to obtain a new voice file, and the language in the new voice file in step S20 includes:
  • the above-mentioned new audio file can be obtained from the internal conversion of the new voice text file (subtitles), and multiple new audio files can be synthesized to obtain a completed new voice file.
  • the corresponding new audio file is played; when playing the multimedia file, the original voice file and the original voice text file can all be replaced by the new voice file and the new voice text file, or part of the new voice file
  • the new audio file replaces part of the original audio file corresponding to the original voice file, which is not described in detail here.
  • playing the new audio file, the new voice text file, and the original audio text file can further play a full role for the user Understand the role of video content; video translation player plays original audio files, new voice text files and original voice text files. Users can watch videos, original voice text files (original subtitles), new voice text files (translated subtitles) and The synchronously displayed video (watching the mouth of the speaker) can play a role for the user to learn a new language.
  • step S30 of loading the new voice file synchronously includes:
  • the playback selection signal may be sent by the user or automatically selected by the video player.
  • the user may choose to play one or more files according to his or her own grasp of the voice in the multimedia file or his hobbies and interests. To improve user experience. If the user has a good grasp of the language in the original audio file, they can choose to play the original audio file and the new voice text file to improve their listening ability in the language when watching; or the user's ability to master the language is weaker . You can choose to play the original audio file, the original voice text file and the new voice text file, the user can use the original audio file, the original voice text file (original subtitles), the new voice text file (translated subtitles) and the video displayed simultaneously ( Watch the speaker's mouth) to learn the utterances, sentences and semantics of the language; or when the user does not want to learn the language and only wants to understand the content of the multimedia file, they can choose to play a new audio file (translated voice) , New voice and text files, fully understand the content inside the multimedia.
  • the step S30 of configuring the loading attributes of the new voice file to enable the new voice file to be synchronously loaded when the multimedia file is played further includes:
  • the corresponding original audio file is selected to be played.
  • the start time and end time of the new voice text file are the start time and end time of the original voice file
  • the video translation player will play the original Audio files, original voice text files and new voice text files
  • the video translation player can also receive user choices, such as files that can be played by the video translation player (original audio files, original voice text files and new voice text) File), choose to play only one or a few files; or the length of the original audio file is greater than the time length of the synthesized new audio file after translation to output the voice segment, that is, the start time of each translated new audio file can be matched
  • the starting time of the original audio file the video player will automatically choose to output a new audio file, not output the original audio file
  • the new voice text file (subtitles of the translated language text) display starting time is the starting point of the corresponding original voice text file (original subtitles) Time
  • the end time is the end time corresponding to the new voice text file
  • the video translation player will choose to play the new audio file
  • the loading attribute is parsing the original voice file and the new audio file, and loading time playback information.
  • multiple parts of the multimedia file include the original voice file, video file, and header file. Therefore, before playing the video file, the header file will be played first.
  • the video file will have a synchronization time relative to the multimedia file playback, that is, the time to play the header file.
  • K original voice files and M There are generally K original voice files and M in the multimedia file.
  • an original audio file includes a plurality of original audio files and the interval between the original audio files, marking the start time Ts11 and end time Te11 of each original audio file playback (see Figure 3), where, in On a time axis, starting from the beginning of the time axis, it includes the first segment of speech: Ts11 to Te11; the second segment of speech: Ts12 to Te12...
  • the start time Ts11 and the end time Te11 of the file are added to the synchronization time Toffset1 of the relative multimedia file playback of the header file analysis, and you can know that the playback time of the N first original voice files in the original voice file relative to the system is :Toffset1+Ts11, Toffset1+Ts12, whil, Toffset1+Ts1n; after processing the first original audio file in the original audio file, separately process the other K-1 original audio files in the original audio file Get the information of all the first original voice files in the K audio files and the playing time relative to the multimedia file, as shown below:
  • Time information of the original audio file of the first voice file Toffset1+Ts11, Toffset1+Ts12,..., Toffset1+Ts1n;
  • Time information of the original audio file of the second voice file Toffset2+Ts21, Toffset2+Ts22,..., Toffset2+Ts2n;
  • Time information of the original audio file of the Kth voice file Toffsetk+Tsk1, Toffsetk+Tsk2,..., Toffsetk+Tskl, where Toffsetk is the playback time of the Kth voice file relative to the system, and Tsk1 is the start time of the first voice segment of the Kth original voice file, Tskl is the start time of the last voice segment L of the Kth original voice file, which can be regarded as a multimedia file containing Y audio files, and records the start time and end point of the Y audio files relative to the playback time of the system time.
  • the start time and end time of the new voice text file are the start time and end time of the corresponding original audio file, and the Zth new voice file of the Nth new voice file is set.
  • the starting time of the audio file is ToffsetN+TSZ and the ending time is ToffsetN+TEZ
  • the starting point of the corresponding video frame is ToffsetN
  • the number of picture frames where the subtitles continuously appear is (TEz-TSZ)X video frame rate
  • the video frame rate is determined by the multimedia file
  • the codec format is determined, for example, 30 frames per second.
  • the start time of the new voice text file is the start time of the original voice text file
  • the cut-off time is the cut-off time of the speech segment of the corresponding speech synthesis language.
  • the start time of the Z-th voice segment of the N-th audio file be ToffsetN +TSZ
  • the duration of the new audio file corresponding to the original audio file is Tr
  • the starting point of the corresponding video frame is ToffsetN
  • the number of picture frames where the subtitles continue to appear is Tr
  • the video frame rate is determined by the multimedia file
  • the codec format is determined, for example, 30 frames per second.
  • the new voice file and the new voice text file can replace the original voice file and the new voice text file in whole or in part, the specific can be selected according to the user,
  • the system can also automatically play selections to improve the user experience.
  • the above step of translating the original voice file to obtain a new voice file, the step S20 after the voice in the new voice file is the specified language, includes,
  • the search information being text or sentences in any one of the original audio file, the original speech text file, the new speech text file, and the new audio file;
  • a multimedia file translation method obtains a new voice file in a specified language by acquiring the original voice file in the multimedia file, translating the original voice file, and configuring the loading attributes of the new voice text file to enable the When the multimedia file is playing, the new voice file is loaded synchronously, which can automatically convert an original voice file into a voice file in another language without manual translation, which can help users understand and recognize the multimedia file in a better and timely manner.
  • a multimedia file translation device including:
  • the obtaining module 10 is used to obtain the original voice file in the multimedia file
  • the translation module 20 is configured to translate the original voice file to obtain a new voice file, and the language in the new voice file is a specified language;
  • the configuration module 30 is configured to configure the loading attributes of the new voice file so that the multimedia file loads the new voice file synchronously during playback.
  • the above device is applied to a translation playback device.
  • the above translation playback device is generally a video translation player, an audio translation player, and other intelligent translation playback devices.
  • a video translation player is used as an example for explanation. It has a function of playing video files and audio. File, display subtitles and other functions.
  • the multimedia files include original voice files, video files and header files.
  • the obtaining module 10 obtains the original voice file, and the original voice file includes at least one of the first original voice file (that is, the original audio file) or the second original voice file (that is, the original voice text file), for better illustration
  • This embodiment will be described by taking the above two voice files as an example.
  • the original voice text file is a subtitle file in a multimedia file
  • the original audio file is a sound file.
  • the original voice file may be an original voice file in a language other than the user's native language
  • the new voice file may not be the user's native language
  • the translation module 20 may be a file in the specified language that the user wants to view, in which the new voice
  • the files include the second new voice file (new audio file) and the first new voice file (new voice text file).
  • the configuration module 30 needs to configure the module 30 to load the new voice file synchronously during playback so that the user can understand the content in the multimedia file.
  • the new voice file includes a new audio file and a new voice text file.
  • the original voice file includes the original audio file and the original voice text file.
  • a variety of display methods and usages can be performed, which can further improve the user's learning of the video file And understand. For example, displaying a new voice and text file can help the user to understand the multimedia file; the new voice and text file and the original voice and text file are played simultaneously, which can further help the user to learn and recognize the language and pronunciation in the multimedia file. Playing new audio files can help users understand video files, and playing new voice text files, original voice text files, and original audio files can help users learn and recognize the effects of pronunciation in multimedia files.
  • the acquisition module 10 includes:
  • a first detection unit configured to detect the start and end points of the voice of each character in the multimedia file
  • the determining unit is configured to use a voice segment between the start point and the end point of the voice of each character as an original audio file, where the original audio file is a first original voice file.
  • the original voice file contains multiple audio objects, such as background noise, people's voices, or sounds of animals and plants.
  • voice activity detection Voice Activity Detection (VAD) technology detects the endpoint of the person's voice in the audio file, and an original audio file does not continuously emit sound, so the start point of the detected voice and the end point of the voice signal are one of the original voice files
  • VAD Voice Activity Detection
  • the above acquiring module 10 also includes:
  • the first conversion unit is configured to convert the original audio file into the original voice text file, wherein the original voice text file is a second original voice file.
  • the original voice text file is a subtitle file among the original voice files. Since the original file contains one or both of the original voice text file or the original audio file, in this embodiment, when only one type of original audio file is included, the first conversion unit can convert the original audio file into an original voice text file to solve the problem that the speech speed in the original voice file is too fast, and the non-standard pronunciation is sandwiched in it. It is difficult to understand by sound alone, at this time you can rely on the original voice text file for preliminary understanding to further improve the user's understanding of the video file; or if the original voice text file includes both, you can directly obtain the original voice text file, saving Get Time.
  • the above obtaining module 10 also includes:
  • a second detection unit configured to detect the original audio file format
  • a first judging unit used to judge whether the original audio file format is a PCM format
  • the second conversion unit is used to convert the original audio file format to the PCM format when the detection is negative.
  • the first judgment unit is used to determine whether the format of the original audio file detected by the second detection unit is the PCM format.
  • the video translation player converts the format of the detected original audio file through the second conversion unit, Change to PCM format voice file, PCM (Pulse Code Modulation----pulse code modulation recording) is to convert analog signals such as sound into symbolized pulse trains, and then record them.
  • the PCM signal is a digital signal composed of symbols such as [1] and [0]. Compared with analog signals, it is less susceptible to clutter and distortion of the transmission system. The dynamic range is wide, and the effect of sound quality is quite good. And the PCM track is different from the video track and can be used for post-recording.
  • the audio file in PCM format is a binary sequence formed by analog audio signal through analog-to-digital conversion (A/D conversion), and the video translation player can accurately decode it.
  • the initial format of the original audio file includes multiple formats, such as PCM, WMV, MP4, DAT, and RM.
  • the format of the audio file parsed in this embodiment is preferably the PCM format.
  • the above translation module 20 includes:
  • the translation unit is configured to translate the original voice text file to obtain a translated new voice text file, and the new voice text file is the first new voice file.
  • the first new voice file is a new voice text file, that is, a subtitle file translated by the translation unit, and the user can understand the content of the video file through the translated subtitle to facilitate understanding.
  • the above translation module 20 also includes:
  • the synthesizing unit is used to synthesize each of the new voice text files to obtain a new audio file, and the new audio file is a second new voice file.
  • the above new audio file can be converted from a new voice text file (subtitle) through an internal synthesis unit. Multiple segments of new audio file can be synthesized to obtain a completed new voice file.
  • the original audio The playback time of the file corresponds to playing the new audio file; when playing multimedia files, you can replace the original voice file and the original voice text file for the new voice file and the new voice text file, or you can replace some of the new audio file in the new voice file Replace part of the original audio file corresponding to the original voice file.
  • the above multimedia file translation device also includes:
  • the first receiving module is used to receive a play selection signal, the play selection signal is used to select one or more of the new voice text file and the original voice text file, and select to play the original audio file and the new audio file , And the new voice text file and the new audio file play at least one of them; play according to the play selection signal;
  • the first playing module is used for playing according to the playing selection signal.
  • the above-mentioned playback selection signal may be sent by the user or automatically selected by the video player, and the first playback module performs playback according to the playback selection signal received by the first receiving module, wherein the user For the mastery of Chinese voice or hobbies and interests, choose to play one or more files on your own to improve the user experience.
  • the above multimedia file translation device also includes:
  • a time obtaining module configured to obtain the playing time length of each of the original audio files and the corresponding playing time length of each of the new audio files
  • the judgment module is used to judge whether the playing time length of each original audio file is greater than the playing time length of the corresponding new audio file;
  • the selection module is used to determine to play the corresponding new audio file if it is greater than; to play the original audio file if it is less than.
  • the above multimedia file translation device also includes:
  • a second receiving module configured to receive search information, the search information being text or sentences in any one of the original audio file, the original voice text file, the new voice text file, and the new audio file;
  • the second playing module is configured to play the search result corresponding to the search signal.
  • the second receiving module receives the speech that has appeared in the original audio file to be retrieved or the speech that has appeared in the new audio file, or the key sentence or a certain text that has appeared in the original speech text file, or the new speech
  • the second playback module then performs a character-by-character search for all the original voice text files, original audio files or translated new voice text files and new audio files, and each audio file matches the search , Get the speech text files corresponding to the key sentences and words, and the location information of the corresponding voice segment files (that is, original audio files or new audio files) and video frames in the multimedia files, and play the multimedia file fragments of the corresponding key sentences and words
  • the retrieval module retrieves one or more original voice text files and corresponding video files about "I”
  • the user can choose to play the video file he wants to watch, and You can choose to play the translated video file, which is convenient for the user to accurately
  • a multimedia file translation apparatus obtains a new voice file in a specified language by acquiring the original voice file in the multimedia file, translating the original voice file, and configuring the loading attributes of the new voice text file to enable the When the multimedia file is playing, the new voice file is loaded synchronously, which can automatically convert an original voice file into a voice file in another language without manual translation, which can help users understand and recognize the multimedia file in a better and timely manner.
  • a translation playback device including a memory, a processor, and an application program, the application program is stored in the memory and configured to be executed by the processor, the application program It is configured to perform any of the methods described above.
  • Translation playback devices include intelligent translation playback devices such as video translation players and language learning machines.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Machine Translation (AREA)

Abstract

本发明揭示了一种多媒体文件的翻译方法、装置及翻译播放设备,其中方法包括:获取多媒体文件中的原始语音文件;翻译所述原始语音文件得到新语音文件,所述新语音文件中的语言为指定语言;配置所述新语音文件的加载属性,以使所述多媒体文件播放时,同步加载所述新语音文件。实现对多媒体文件中原始语音文件的自动翻译。

Description

多媒体文件的翻译方法、装置及翻译播放设备 技术领域
本发明涉及到计算机技术领域,特别是涉及到一种多媒体文件的翻译方法、装置及翻译播放设备。
背景技术
随着计算机技术的快速发展,使用播放器播放多媒体文件的用户越来越多。由于在播放多媒体文件时,通常需要对多媒体文件对应的提示信息进行显示。例如,用户在播放歌曲时,可能需要同时显示歌曲对应的歌词;用户在看电影时,可能需要同时显示电影对应的字幕。由于提示信息可以为不同语种对应的字符,比如,用户是以中文为母语且英文较差的用户,而歌曲是英文歌曲,即便音乐播放器能够显示英文歌词,但该歌曲对用户所能够提供的信息有限。
目前市场上的多媒体音像资料是先通过人工方式实现不同语种的翻译,再通过影像画面字幕叠加将字幕叠加到视频画面中,音频部分也是先通过人工翻译将语音同步到视频画面上。这也就意味着,用户看其他语言的多媒体影像资料时,若是没有先经过人为翻译,多媒体影像资料只能播放其他语言的字幕及语音,此时用户是很难理解到多媒体播放内容的问题。
技术问题
本发明的主要目的为提供一种多媒体文件的翻译方法、装置及翻译播放设备,旨在解决用户不能理解和不能识别多媒体文件中其他语言的视频或音频的内容。
技术解决方案
为了实现上述发明目的,本发明提出一种多媒体文件的翻译方法,包括:
获取多媒体文件中的原始语音文件;
翻译所述原始语音文件得到新语音文件,所述新语音文件中的语言为指定语言;
配置所述新语音文件的加载属性,以使所述多媒体文件播放时,同步加载所述新语音文件。
本发明还提供一种多媒体文件的翻译装置,包括:
获取模块,用于获取多媒体文件中的原始语音文件;
翻译模块,用于翻译所述原始语音文件得到新语音文件,所述新语音文件中的语言为指定语言;
配置模块,用于配置所述新语音文件的加载属性,以使所述多媒体文件播放时,同步加载所述新语音文件。
一种翻译播放设备,包括存储器、处理器和应用程序,所述应用程序被存储在所述存储器中并被配置为由所述处理器执行,所述应用程序被配置为用于执行上述任一项所述的方法。
有益效果
本发明实施例的一种多媒体文件的翻译方法,通过获取多媒体文件中的原始语音文件,翻译原始语音文件得到指定语言的新语音文件,并通过配置新语音文本文件的加载属性,以使多媒体文件播放时,同步加载新语音文件,实现不经由人工翻译的方式、自动将一种原始语音文件转换成其他语种的语音文件,可以帮助用户更好地、及时地理解和识别多媒体文件中的音频和视频内的内容。
附图说明
图1 为本发明一实施例的多媒体文件的翻译方法的流程示意图;
图2 为本发明一实施例的多媒体文件的翻译装置的部分结构示意框图;
图3 为本发明一实施例的标记检测的音频文件的图像示意图。
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
本发明的最佳实施方式
本发明实施例提供一种多媒体文件的翻译的方法,如图1所示,包括步骤:
S10、获取多媒体文件中的原始语音文件;
S20、翻译所述原始语音文件得到新语音文件,所述新语音文件中的语言为指定语言;
S30、配置所述新语音文件的加载属性,以使所述多媒体文件播放时,同步加载所述新语音文件。
上述方法应用于翻译播放设备,上述翻译播放设备一般为视频翻译播放器、音频翻译播放器等智能翻译播放设备,本实施例以视频翻译播放器为例进行解释说明,其具有播放视频文件、音频文件、显示字幕等功能。如上述步骤S10所述,上述多媒体文件包括原始语音文件、视频文件和头文件等。其中,原始语音文件至少包括第一原始语音文件(即原始音频文件)或第二原始语音文件(即原始语音文本文件)中的一者,为了更好的进行示意,本实施方式将以同时包含上述两种语音文件为例进行说明,值得一提的是,上述原始语音文本文件为多媒体文件中的字幕文件,原始音频文件为声音文件。
如上述步骤S20所述,原始语音文件可为除用户母语外的其他语言的原始语音文件,新语音文件可以不是用户的母语,可以为用户想观看到的指定语言的文件,其中新语音文件包括第二新语音文件(新音频文件)和第一新语音文件(新语音文本文件)。
如上述步骤S30所述,在得到新语音文件之后,为使用户能理解多媒体文件中的内容,需要在播放时,同步加载新语音文件。
本实施例中,新语音文件包括新音频文件和新语音文本文件,原始语音文件包括原始音频文件和原始语音文本文件可以进行多种显示方法和使用,可进一步地提高用户对视频文件的学习和理解。例如,显示新语音文本文件可以有助于用户去理解多媒体文件;新语音文本文件及原始语音文本文件同时播放,可以进一步地有助于用户学习和识别多媒体文件中的语言和发音。播放新音频文件可以有助于用户理解视频文件,播放新语音文本文件、原始语音文本文件和原始音频文件可以帮助用户学习和识别多媒体文件中的发音的效果。
上述多媒体文件中包括有原始音频文件,所述获取多媒体文件中的原始语音文本文件的步骤S10中,包括:
检测所述原始音频文件中每个人物的语音的起点和终点;
将所述每个人物的语音的起点到终点之间的语音段作为原始音频文件其中,所述原始音频文件为原始语音文件。
在本实施例中,原始语音文件中包含了多个音频对象,如背景噪声、人物语音或动植物发出的声音,在对原始语音进行检测时,检测到的只有人物的语音信号,如背景噪声、枪声或动植物发出的声音是不会被检测的,通过语音活动检测(Voice Activity Detection,VAD)技术检测音频文件中人物语音的端点,而一个原始音频文件中,不会连续不断的都发出声音,所以检测的语音的起点及语音信号的终点,是原始语音文件中某一连续的一段音频文件,构成一个原始音频文件(即第一原始语音文件)。其中,原始音频文件包括多个人物分别单独说话时检测到的某一连续的一个原始音频文件,即一个人连续说的话为一个原始音频文件,之后接下来的人说的话为另一个原始音频文件,还包括某一连续的一个原始音频文件是由多个人物的语音同时说话组合在一起形成的,本实施方式中,优选的,每个人物的语音段是各不重合的,即单独一人说的语音组成一个原始音频文件,因单独一个人说话,其说话的音色及音调相差不大,更方便被检测到,且检测到的原始音频文件更为准确,标记的语音的起点和终点,不会出现误差。
上述原始语音文件中还包括有原始语音文本文件,所述获取多媒体文件中的原始语音文件的步骤S10中,包括:
将所述原始音频文件转换成所述原始语音文本文件,其中,所述原始语音文本文件为所述原始语音文件。
如上述步骤所述,原始语音文本文件为原始语音文件当中的字幕文件,因原始文件存在着包含原始语音文本文件或原始音频文件中的一种或两种,在本实施例中,当只包含一种原始音频文件时,可以通过上述步骤,将原始音频文件转换为原始语音文本文件,解决原始语音文件中说话语速过快,且其中夹着着不标准的发音,用户难以单靠声音理解,此时便可以靠原始语音文本文件进行初步理解,进一步地提高用户对视频文件的理解;或原始语音文本文件若两种均包括,则可以直接获取原始语音文本文件,节省获取时间。
上述将所述每个人物的语音的起点到终点之间的语音段作为原始音频文件的步骤之后,包括:
检测所述原始音频文件格式;
判断所述原始音频文件格式是否为PCM格式;
若否,将所述原始音频文件格式转变为PCM格式。
如上述步骤所述,当检测到原始音频文件格式不为PCM格式时,优选的,视频翻译播放器转变检测到的原始音频文件的格式,以改为PCM格式的语音文件,PCM(Pulse Code Modulation----脉码调制录音),就是将声音等模拟信号变成符号化的脉冲列,再予以记录。PCM信号是由[1]、[0]等符号构成的数字信号。与模拟信号比,它不易受传送系统的杂波及失真的影响。动态范围宽,可得到音质相当好的影响效果。且PCM轨迹与视频轨迹不同,可用于后期录音。另外,PCM格式的音频文件为模拟音频信号经模数转换(A/D变换)直接形成的二进制序列,视频翻译播放器能够对其精确进行解码过程。原始音频文件的初始格式包括多种,如PCM、WMV、MP4、DAT、RM等多种格式,本实施例中解析的音频文件格式优选为PCM格式。
上述翻译所述原始语音文件得到新语音文件,所述新语音文件中的语言为指定语言的步骤S20中,包括:
对所述原始语音文本文件进行翻译,得到翻译后的新语音文本文件,所述新语音文本文件为第一新语音文件。
在本实施例中,第一新语音文件为新语音文本文件,即翻译后的字幕文件,用户可以通过翻译后的字幕了解视频文件的内容,方便理解。
上述翻译所述原始语音文件得到新语音文件,所述新语音文件中的语言为指定语言的步骤S20中,包括:
将每个所述新语音文本文件进行语音合成,得到新音频文件,所述新音频文件为第二新语音文件。
在本实施例中,上述新音频文件(语音)可以由新语音文本文件(字幕)进行内部的转换得到,多段新音频文件进行合成可得到一个完成的新语音文件,在播放多媒体文件的过程,根据原始音频文件的播放时间,对应播放新音频文件;在播放多媒体文件时,可以为新语音文件和新语音文本文件全部替换原始语音文件及原始语音文本文件,也可以为新语音文件中的部分新音频文件替换对应原始语音文件中的部分原始音频文件,此中不做详细赘述,用户在观看视频时,播放新音频文件、新语音文本文件及原始音频文本文件可以进一步地对用户起到充分理解视频内容的作用;视频翻译播放器播放原始音频文件、新语音文本文件及原始语音文本文件,用户通过观看视频、原始语音文本文件(原始字幕)、新语音文本文件(翻译后的字幕)及同步显示的视频(观看说话者的口型)可以对用户起到学习新语言的作用。
上述对所述配置所述新语音文件的加载属性,以使所述多媒体文件播放时,同步加载所述新语音文件的步骤S30之后,包括:
接收播放选择信号,所述播放选择信号用于选择播放所述新语音文本文件和原始语音文本文件的一种或多种,以及选择播放原始音频文件和新音频文件中的一种,且所述新语音文本文件与所述新音频文件至少播放其中一种;
根据所述播放选择信号进行播放。
在本实施例中,播放选择信号可以为用户选择发出的,也可以为视频播放器自动选择发出的,用户根据自身对多媒体文件中语音的掌握程度或爱好兴趣,自行选择播放一个或多个文件,以提高用户的使用体验。如用户若是对原始音频文件中的语言掌握程度高,可以选择播放原始音频文件和新语音文本文件,在观看的时提高自己对该种语言的听力能力;或者用户对该语言的掌握能力弱一点,可以选择播放原始音频文件、原始语音文本文件及新语音文本文件,用户可通过原始音频文件、原始语音文本文件(原始字幕)、新语音文本文件(翻译后的字幕)及同步显示的视频(观看说话者的口型),来学习该种语言的发声、语句和语义;或者用户不想学习该语言、只想理解该多媒体文件的内容的时候,可以选择播放新音频文件(翻译后的语音)、新语音文本文件,充分理解到该多媒体内部的内容。
上述配置所述新语音文件的加载属性,以使所述多媒体文件播放时,同步加载所述新语音文件的步骤S30,还包括:
获取每个所述原始音频文件的播放时间长度,以及获取对应的每个所述新音频文件的播放时间长度;
判断每个所述原始音频文件的播放时间长度是否大于对应的所述新音频文件的播放时间长度;
若大于,则选择播放对应的所述新音频文件;
若小于,则选择播放对应的所述原始音频文件。
在本实施例中,值得一提的是,新语音文本文件(翻译语种文本的字幕)显示起点时间和终点时间为对应原始语音文件的起点时间和终点时间,视频翻译播放器此时会播放原始音频文件、原始语音文本文件及新语音文本文件, 视频翻译播放器也可以接收用户做出的选择,如在视频翻译播放器能够播放的文件中(原始音频文件、原始语音文本文件及新语音文本文件),选择只播放某一个或某几个文件;或原始音频文件的时间长度大于翻译后新音频文件合成输出语音段时间长度,即每段翻译后的新音频文件的起点时间都能对上原始音频文件的起点时间,视频播放器会自动选择输出新音频文件,不输出原始音频文件,新语音文本文件(翻译语种文本的字幕)显示起点时间为对应原始语音文本文件(原始字幕)的起点时间,终点时间为对应新语音文本文件的终点时间,视频翻译播放器会选择播放新音频文件、新语音文本文件及原始语音文本文件,在此情况下,视频翻译播放器对新多媒体文件的播放做出选择之后,视频翻译播放器也可以接收用户做出的选择,如在视频翻译播放器能够播放的文件中(新音频文件、新语音文本文件及原始语音文本文件),选择只播放某一个或某几个文件;或原始音频文件的时间长度等于翻译后新音频文件合成输出语音段时间长度,即原始音频文件的起点时间及终点时间与新音频文件的起点时间与终点时间同步对应,则视频翻译播放器可以播放的情况则包括新音频文件和新语音文本文件的一种、原始语音文件及原始语音文本文件的一种或多种,此时视频翻译播放器可接收用户的选择某个或多个文件进行播放例如,多媒体文件可以为GIF文件。
值得一提的是,在一具体实施例中,加载属性为对原始语音文件和新音频文件解析、加载时间播放信息,具体的,多媒体文件多部分均包括原始语音文件、视频文件和头文件等,因此,在播放视频文件之前,会先播放头文件,视频文件会有一个相对于多媒体文件播放时的同步时间,即播放头文件的时间,多媒体文件中一般会有K个原始语音文件和M个视频文件,一个原始音频文件包括了多段原始音频文件和多段原始音频文件之间的间隔段,标记每段原始音频文件的播放的起点时间Ts11及终点时间Te11(参见图3),其中,在一条时间轴上,从时间轴起点开始,依次包括第一段语音段:Ts11至Te11;第二段语音段:Ts12至Te12………第N段语音段:Ts1n至Te1n),每段原始音频文件的起始时间Ts11及终点时间Te11均加上对头文件解析的相对多媒体文件播放的同步时间Toffset1,即可知道一个原始语音文件中的N个第一原始语音文件相对于系统的播放时间分别为:Toffset1+ Ts11, Toffset1+Ts12,………, Toffset1+Ts1n;处理完原始音频文件中的第一段原始音频文件之后,依次分别对原始音频文件中的其他的K-1个原始音频文件进行处理得到K个音频文件中的所有第一原始语音文件的信息和相对于多媒体文件播放的时间,如下示:
第一个语音文件的原始音频文件的时间信息: Toffset1+Ts11, Toffset1+Ts12,………, Toffset1+Ts1n;
第二个语音文件的原始音频文件的时间信息: Toffset2+Ts21, Toffset2+Ts22,………, Toffset2+Ts2n;
第K个语音文件的原始音频文件的时间信息: Toffsetk+Tsk1, Toffsetk+Tsk2,………, Toffsetk+Tskl,其中Toffsetk为第K个语音文件相对于系统的播放时间,Tsk1为第K个原始语音文件的第一个语音段的起始时间,Tskl为第K个原始语音文件的最后一个语音段L的起始时间,可看成一个多媒体文件包含着Y个音频文件,并记载Y个音频文件各自相对于系统的播放时间的起点时间和终点时间。
将每个音频文件转换成原始语音文本文件,并对每个原始语音文本文件加入相对于系统播放的时间信息,与每个音频文件一一对应,得到Y个原始语音文本文件,将Y个原始语音文本文件翻译得到Y个新语音文本文件。将Y个合成得到Y个新音频文件,并且得到Y个新音频文件的各自持续时间Tr,其中r为正整数,0<r<y+1, Y个新音频文件一一替换原始音频文件,采取原始音频文件与新音频文件起始时间一一对准的方式替换,新语音文本文件显示与新语音文件和视频帧的同步,当原始音频文件的时间长度与新音频文件输出时间不同,有两种情况:
a)仅输出原始音频文件,不输出新音频文件,新语音文本文件显示起始时间和截止时间为对应原始音频文件的起始时间和截止时间,设第N个新语音文件的第Z个新音频文件的起始时间为ToffsetN+TSZ和截止时间为ToffsetN+TEZ,对应视频帧的起点为ToffsetN,字幕持续出现的画面帧数为(TEz-TSZ)X视频帧率,视频帧率由多媒体文件编解码格式决定,比如为30帧/秒。
b)输出新音频文件,不输出原始音频文件。新语音文本文件显示起始时间为对应原始语音文本文件的起始时间,截止时间为对应语音合成语种语音段的截止时间,设第N个音频文件的第Z个语音段的起始时间为ToffsetN+TSZ,原始音频文件对应的新音频文件的持续时间为Tr,对应视频帧的起点为ToffsetN,字幕持续出现的画面帧数为Tr X视频帧率(r=z),视频帧率由多媒体文件编解码格式决定,比如为30帧/秒。
当原始音频文件的时间长度与新音频文件输出时间相同,如GIF动态文件,新语音文件和新语音文本文件可以全部或部分替换原始语音文件和新语音文本文件,具体的可根据用户进行选择,也可系统自动播放选择,提高用户的使用体验。
上述翻译所述原始语音文件得到新语音文件,所述新语音文件中的语音为指定语言的步骤S20之后,包括,
接收查找信息,所述查找信息为所述原始音频文件、所述原始语音文本文件、所述新语音文本文件和所述新音频文件任意一者中的文字或句子;
根据所述查找信号对应播放查找结果。
在本实施例中,输入需要检索的原始音频文件出现过的语音或新音频文件中出现过的语音,或者原始语音文本文件出现过的关键句子或某个文字,或者新语音文本文件出现过的关键句子或某个文字,而后对所有的原始语音文本文件、原始音频文件或翻译后的新语音文本文件及新音频文件进行逐一字符、每一个音频文件匹配搜索,得到关键句子和词对应的语音文本文件,以及对应的语音段文件(即原始音频文件或新音频文件)和视频帧在多媒体文件的位置信息,如此,可以播放对应的关键句子和词的多媒体文件片段。
本发明实施例的一种多媒体文件的翻译方法,通过获取多媒体文件中的原始语音文件,翻译原始语音文件得到指定语言的新语音文件,并通过配置新语音文本文件的加载属性,以使所述多媒体文件播放时,同步加载新语音文件,实现不经由人工翻译的方式、自动将一种原始语音文件转换成其他语种的语音文件,可以帮助用户更好地、及时地理解和识别多媒体文件中的音频和视频内的内容。
参照图2,本实施例中,提供了一种多媒体文件的翻译装置,包括:
获取模块10,用于获取多媒体文件中的原始语音文件;
翻译模块20,用于翻译所述原始语音文件得到新语音文件,所述新语音文件中的语言为指定语言;
配置模块30,用于配置所述新语音文件的加载属性,以使所述多媒体文件播放时,同步加载所述新语音文件。
上述装置应用于翻译播放设备,上述翻译播放设备一般为视频翻译播放器、音频翻译播放器等智能翻译播放设备,本实施例以视频翻译播放器为例进行解释说明,其具有播放视频文件、音频文件、显示字幕等功能。上述获取模块10,多媒体文件包括原始语音文件、视频文件和头文件等。其中,获取模块10获取原始语音文件,原始语音文件至少包括第一原始语音文件(即原始音频文件)或第二原始语音文件(即原始语音文本文件)中的一者,为了更好的进行示意,本实施方式将以同时包含上述两种语音文件为例进行说明,值得一提的是,上述原始语音文本文件为多媒体文件中的字幕文件,原始音频文件为声音文件。
上述翻译模块20,原始语音文件可为除用户母语外的其他语言的原始语音文件,新语音文件可以不是用户的母语,通过翻译模块20可以为用户想观看到的指定语言的文件,其中新语音文件包括第二新语音文件(新音频文件)和第一新语音文件(新语音文本文件)。。
上述配置模块30,在得到新语音文件之后,为使用户能理解多媒体文件中的内容,需要在播放时,配置模块30同步加载新语音文件。
本实施例中,新语音文件包括新音频文件和新语音文本文件,原始语音文件包括原始音频文件和原始语音文本文件可以进行多种显示方法和使用,可进一步地提高的用户对视频文件的学习和理解。例如,显示新语音文本文件可以有助于用户去理解多媒体文件;新语音文本文件及原始语音文本文件同时播放,可以进一步地有助于用户学习和识别多媒体文件中的语言和发音。播放新音频文件可以有助于用户理解视频文件,播放新语音文本文件、原始语音文本文件和原始音频文件可以帮助用户学习和识别多媒体文件中的发音的效果。
本实施例中,所述获取模块10包括:
第一检测单元,用于检测所述多媒体文件中每个人物的语音的起点和终点;
确定单元,用于将所述每个人物的语音的起点到终点之间的语音段作为原始音频文件,其中,所述原始音频文件为第一原始语音文件。
如上述第一检测单元,原始语音文件中包含了多个音频对象,如背景噪声、人物语音或动植物发出的声音,在对原始语音进行检测时,检测到的只有人物的语音信号,如背景噪声、枪声或动植物发出的声音是不会被检测的,通过语音活动检测(Voice Activity Detection,VAD)技术检测音频文件中人物语音的端点,而一个原始音频文件中,不会连续不断的都发出声音,所以检测的语音的起点及语音信号的终点,是原始语音文件中某一连续的一段音频文件,构成一个原始音频文件(即第一原始语音文件)。
上述获取模块10,还包括:
第一转换单元,用于将所述原始音频文件转换成所述原始语音文本文件,其中,所述原始语音文本文件为第二原始语音文件。
如上述第一转换单元所述,原始语音文本文件为原始语音文件当中的字幕文件,因原始文件存在着包含原始语音文本文件或原始音频文件中的一种或两种,在本实施例中,当只包含一种原始音频文件时,可以通过第一转换单元,将原始音频文件转换为原始语音文本文件,解决原始语音文件中说话语速过快,且其中夹着着不标准的发音,用户难以单靠声音理解,此时便可以靠原始语音文本文件进行初步理解,进一步地提高用户对视频文件的理解;或原始语音文本文件若两种均包括,则可以直接获取原始语音文本文件,节省获取时间。
上述获取模块10还包括:
第二检测单元,用于检测所述原始音频文件格式;
第一判断单元,用于判断所述原始音频文件格式是否为PCM格式;
第二转换单元,当检测为否时,用于将所述原始音频文件格式转变为PCM格式。
在本实施例中,第一判断单元用于第二检测单元检测到的原始音频文件格式是否为PCM格式,优选的,视频翻译播放器通过第二转换单元转变检测到的原始音频文件的格式,以改为PCM格式的语音文件,PCM(Pulse Code Modulation----脉码调制录音),就是将声音等模拟信号变成符号化的脉冲列,再予以记录。PCM信号是由[1]、[0]等符号构成的数字信号。与模拟信号比,它不易受传送系统的杂波及失真的影响。动态范围宽,可得到音质相当好的影响效果。且PCM轨迹与视频轨迹不同,可用于后期录音。另外,PCM格式的音频文件为模拟音频信号经模数转换(A/D变换)直接形成的二进制序列,视频翻译播放器能够对其精确进行解码过程。原始音频文件的初始格式包括多种,如PCM、WMV、MP4、DAT、RM等多种格式,本实施例中解析的音频文件格式优选为PCM格式。
上述翻译模块20包括:
翻译单元,用于对所述原始语音文本文件进行翻译,得到翻译后的新语音文本文件,所述新语音文本文件为所述第一新语音文件。
如上述步骤所述,第一新语音文件为新语音文本文件,即翻译单元翻译后的字幕文件,用户可以通过翻译后的字幕了解视频文件的内容,方便理解。
上述翻译模块20还包括:
合成单元,用于将每个所述新语音文本文件进行语音合成,得到新音频文件,所述新音频文件为第二新语音文件。
上述新音频文件(语音)可以由新语音文本文件(字幕)通过内部的合成单元进行转换得到,多段新音频文件进行合成可得到一个完成的新语音文件,在播放多媒体文件的过程,根据原始音频文件的播放时间,对应播放新音频文件;在播放多媒体文件时,可以为新语音文件和新语音文本文件全部替换原始语音文件及原始语音文本文件,也可以为新语音文件中的部分新音频文件替换对应原始语音文件中的部分原始音频文件。
上述多媒体文件的翻译装置还包括:
第一接收模块,用于接收播放选择信号,所述播放选择信号用于选择播放所述新语音文本文件和原始语音文本文件的一种或多种,以及选择播放原始音频文件和新音频文件中的一种,且所述新语音文本文件与所述新音频文件至少播放其中一种;根据所述播放选择信号进行播放;
第一播放模块,用于根据所述播放选择信号进行播放。
上述播放选择信号可以为用户选择发出的,也可以为视频播放器自动选择发出的,而第一播放模块根据第一接收模块接收到的播放选择信号,进行播放,其中,用户根据自身对多媒体文件中语音的掌握程度或爱好兴趣,自行选择播放一个或多个文件,以提高用户的使用体验。
上述多媒体文件的翻译装置还包括:
时间获取模块,用于获取每个所述原始音频文件的播放时间长度,以及获取对应的每个所述新音频文件的播放时间长度;
判断模块,用于判断每个所述原始音频文件的播放时间长度是否大于对应的所述新音频文件的播放时间长度;
选择模块,用于判断若大于时,选择播放对应的所述新音频文件;还用于判断若小于时,选择播放对应的所述原始音频文件。
上述多媒体文件的翻译装置还包括:
第二接收模块,用于接收查找信息,所述查找信息为所述原始音频文件、所述原始语音文本文件、所述新语音文本文件和所述新音频文件任意一者中的文字或句子;
第二播放模块,用于根据所述查找信号对应播放查找结果。
在本实施例中,第二接收模块通过接收需要检索的原始音频文件出现过的语音或新音频文件中出现过的语音,或者原始语音文本文件出现过的关键句子或某个文字,或者新语音文本文件出现过的关键句子或某个文字,第二播放模块而后对所有的原始语音文本文件、原始音频文件或翻译后的新语音文本文件及新音频文件进行逐一字符、每一个音频文件匹配搜索,得到关键句子和词对应的语音文本文件,以及对应的语音段文件(即原始音频文件或新音频文件)和视频帧在多媒体文件的位置信息,播放对应的关键句子和词的多媒体文件片段,如原始语音文本文件为英语,用户输入“I”,检索模块检索出关于“I”的一段或多段原始语音文本文件及对应的视频文件,用户可以在其中选择播放想要观看的视频文件,也可以选择播放翻译后的视频文件,方便用户在看完某个视频之后,想要再次观看其中精彩的片段时,能够进行精确查找。
本领域技术人员可以理解的是,本实施例的终端和上述实施例所述的方法相辅相成、互相适应,上述方法项中描述的多个细节和说明均可适用于本实施例的终端,为了避免重复,此处不再赘述。
本发明实施例的一种多媒体文件的翻译装置,通过获取多媒体文件中的原始语音文件,翻译原始语音文件得到指定语言的新语音文件,并通过配置新语音文本文件的加载属性,以使所述多媒体文件播放时,同步加载新语音文件,实现不经由人工翻译的方式、自动将一种原始语音文件转换成其他语种的语音文件,可以帮助用户更好地、及时地理解和识别多媒体文件中的音频和视频内的内容。
在一实施例中,还提供了一种翻译播放设备,包括存储器、处理器和应用程序,所述应用程序被存储在所述存储器中并被配置为由所述处理器执行,所述应用程序被配置为用于执行上述任一项所述的方法。翻译播放设备包括视频翻译播放器、语言学习机等智能翻译播放设备。

Claims (17)

  1. 一种多媒体文件的翻译方法,其特征在于,包括:
    获取多媒体文件中的原始语音文件;
    翻译所述原始语音文件得到新语音文件,所述新语音文件中的语言为指定语言;
    配置所述新语音文件的加载属性,以使所述多媒体文件播放时,同步加载所述新语音文件。
  2. 根据权利要求1所述的多媒体文件的翻译方法,其特征在于,所述原始语音文件中包括有原始音频文件,所述获取多媒体文件中的原始语音文件的步骤,包括:
    检测所述多媒体文件中每个人物的语音的起点和终点;
    将所述每个人物的语音的起点到终点之间的语音段作为原始音频文件,其中,所述原始音频文件为第一原始语音文件。
  3. 根据权利要求2所述的多媒体文件的翻译方法,其特征在于,所述原始语音文件中还包括有原始语音文本文件,所述获取多媒体文件中的原始语音文件的步骤,包括:
    将所述原始音频文件转换成所述原始语音文本文件,其中,所述原始语音文本文件为第二原始语音文件。
  4. 根据权利要求2所述的多媒体文件的翻译方法,其特征在于,所述将所述每个人物的语音的起点到终点之间的语音段作为原始音频文件的步骤之后,包括:
    检测所述原始音频文件的格式;
    判断所述原始音频文件格式是否为PCM格式;
    若否,将所述原始音频文件的格式转变为PCM格式。
  5. 根据权利要求3所述的多媒体文件的翻译方法,其特征在于,所述翻译所述原始语音文件得到新语音文件,所述新语音文件中的语言为指定语言的步骤,包括:
    对所述原始语音文本文件进行翻译,得到翻译后的新语音文本文件,所述新语音文本文件为第一新语音文件。
  6. 根据权利要求5所述的多媒体文件的翻译方法,其特征在于,所述翻译所述原始语音文件得到新语音文件,所述新语音文件中的语言为指定语言的步骤,包括:
    将每个所述新语音文本文件进行语音合成,得到新音频文件,所述新音频文件为第二新语音文件。
  7. 根据权利要求6所述的多媒体文件的翻译方法,其特征在于,所述配置所述新语音文件的加载属性,以使所述多媒体文件播放时,同步加载所述新语音文件的步骤之后,包括:
    接收播放选择信号,所述播放选择信号用于选择播放所述新语音文本文件和原始语音文本文件的一种或多种,以及选择播放原始音频文件和新音频文件中的一种,且所述新语音文本文件与所述新音频文件至少播放其中一种;
    根据所述播放选择信号进行播放。
  8. 根据权利要求6所述的多媒体文件的翻译方法,其特征在于,所述翻译所述原始语音文件得到新语音文件,所述新语音文件中的语言为指定语言的步骤之后,包括,
    接收查找信息,所述查找信息为所述原始音频文件、所述原始语音文本文件、所述新语音文本文件和所述新音频文件任意一者中的文字或句子;
    根据所述查找信号对应播放查找结果。
  9. 一种多媒体文件的翻译装置,其特征在于,包括:
    获取模块,用于获取多媒体文件中的原始语音文件;
    翻译模块,用于翻译所述原始语音文件得到新语音文件,所述新语音文件中的语言为指定语言;
    配置模块,用于配置所述新语音文件的加载属性,以使所述多媒体文件播放时,同步加载所述新语音文件。
  10. 根据权利要求9所述的多媒体文件的翻译装置,其特征在于,所述多媒体文件中包括有原始音频文件,所述获取模块包括:
    第一检测单元,用于检测所述多媒体文件中每个人物的语音的起点和终点;
    确定单元,用于将所述每个人物的语音的起点到终点之间的语音段作为原始音频文件,其中,所述原始音频文件为第一原始语音文件。
  11. 根据权利要求10所述的多媒体文件的翻译装置,其特征在于,所述获取模块包括:
    第一转换单元,用于将所述原始音频文件转换成所述原始语音文本文件,其中,所述原始语音文本文件为第二原始语音文件。
  12. 根据权利要求10所述的多媒体文件的翻译装置,其特征在于,所述获取模块还包括:
    第二检测单元,用于检测所述原始音频文件格式;
    第一判断单元,用于判断所述原始音频文件格式是否为PCM格式;
    第二转换单元,当检测为否时,用于将所述原始音频文件格式转变为PCM格式。
  13. 根据权利要求11所述的多媒体文件的翻译装置,其特征在于,所述翻译模块包括:
    翻译单元,用于对所述原始语音文本文件进行翻译,得到翻译后的新语音文本文件,所述新语音文本文件为第一新语音文件。
  14. 根据权利要求13所述的多媒体文件的翻译装置,其特征在于,所述翻译模块还包括:
    合成单元,用于将每个所述新语音文本文件进行语音合成,得到新音频文件,所述新音频文件为第二新语音文件。
  15. 根据权利要求14所述的多媒体文件的翻译装置,其特征在于,所述多媒体文件的翻译装置还包括:
    第一接收模块,用于接收播放选择信号,所述播放选择信号用于选择播放所述新语音文本文件和原始语音文本文件的一种或多种,以及选择播放原始音频文件和新音频文件中的一种,且所述新语音文本文件与所述新音频文件至少播放其中一种;根据所述播放选择信号进行播放;
    第一播放模块,用于根据所述播放选择信号进行播放。
  16. 根据权利要求14所述的多媒体文件的翻译装置,其特征在于,多媒体文件的翻译装置包括:
    第二接收模块,用于接收查找信息,所述查找信息为所述原始音频文件、所述原始语音文本文件、所述新语音文本文件和所述新音频文件任意一者中的文字或句子;
    第二播放模块,用于根据所述查找信号对应播放查找结果。
  17. 一种翻译播放设备,包括存储器、处理器和应用程序,所述应用程序被存储在所述存储器中并被配置为由所述处理器执行,其特征在于,所述应用程序被配置为用于执行权利要求1至8任一项所述的方法。
PCT/CN2019/073767 2018-12-17 2019-01-29 多媒体文件的翻译方法、装置及翻译播放设备 WO2020124754A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811543822.9A CN109658919A (zh) 2018-12-17 2018-12-17 多媒体文件的翻译方法、装置及翻译播放设备
CN201811543822.9 2018-12-17

Publications (1)

Publication Number Publication Date
WO2020124754A1 true WO2020124754A1 (zh) 2020-06-25

Family

ID=66114701

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/073767 WO2020124754A1 (zh) 2018-12-17 2019-01-29 多媒体文件的翻译方法、装置及翻译播放设备

Country Status (2)

Country Link
CN (1) CN109658919A (zh)
WO (1) WO2020124754A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110121097A (zh) * 2019-05-13 2019-08-13 深圳市亿联智能有限公司 具有无障碍功能的多媒体播放装置及方法
CN110335610A (zh) * 2019-07-19 2019-10-15 北京硬壳科技有限公司 多媒体翻译的控制方法及显示器
CN110471659B (zh) * 2019-08-16 2023-07-21 珠海格力电器股份有限公司 多语言实现方法和系统、人机界面组态软件端和设备端
KR102178175B1 (ko) * 2019-12-09 2020-11-12 김경철 사용자 단말 및 그 제어방법
CN114007116A (zh) * 2022-01-05 2022-02-01 凯新创达(深圳)科技发展有限公司 一种视频处理方法、视频处理装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100299147A1 (en) * 2009-05-20 2010-11-25 Bbn Technologies Corp. Speech-to-speech translation
CN103226947A (zh) * 2013-03-27 2013-07-31 广东欧珀移动通信有限公司 一种基于移动终端的音频处理方法及装置
CN104244081A (zh) * 2014-09-26 2014-12-24 可牛网络技术(北京)有限公司 视频的提供方法及装置
CN104683873A (zh) * 2013-11-27 2015-06-03 英业达科技有限公司 多媒体播放系统及其方法
CN104681049A (zh) * 2015-02-09 2015-06-03 广州酷狗计算机科技有限公司 提示信息的显示方法及装置
CN108289244A (zh) * 2017-12-28 2018-07-17 努比亚技术有限公司 视频字幕处理方法、移动终端及计算机可读存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1143252C (zh) * 1998-01-15 2004-03-24 英业达股份有限公司 交互式图象同步字幕的显示方法
CN1063308C (zh) * 1998-01-15 2001-03-14 英业达股份有限公司 交互式图象同步字幕显示装置及显示方法
KR20140121516A (ko) * 2013-04-05 2014-10-16 이현철 실시간 통역 자막 제공 시스템 및 방법
US10206014B2 (en) * 2014-06-20 2019-02-12 Google Llc Clarifying audible verbal information in video content
CN105704579A (zh) * 2014-11-27 2016-06-22 南京苏宁软件技术有限公司 媒体播放中实时自动翻译字幕的方法和系统
CN105848004A (zh) * 2016-05-16 2016-08-10 乐视控股(北京)有限公司 字幕播放方法、字幕播放装置
CN106303695A (zh) * 2016-08-09 2017-01-04 北京东方嘉禾文化发展股份有限公司 音频翻译多语言文字处理方法和系统
EP3542360A4 (en) * 2016-11-21 2020-04-29 Microsoft Technology Licensing, LLC METHOD AND DEVICE FOR AUTOMATIC SYNCHRONIZATION
CN207302623U (zh) * 2017-07-26 2018-05-01 安徽听见科技有限公司 一种远程语音处理系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100299147A1 (en) * 2009-05-20 2010-11-25 Bbn Technologies Corp. Speech-to-speech translation
CN103226947A (zh) * 2013-03-27 2013-07-31 广东欧珀移动通信有限公司 一种基于移动终端的音频处理方法及装置
CN104683873A (zh) * 2013-11-27 2015-06-03 英业达科技有限公司 多媒体播放系统及其方法
CN104244081A (zh) * 2014-09-26 2014-12-24 可牛网络技术(北京)有限公司 视频的提供方法及装置
CN104681049A (zh) * 2015-02-09 2015-06-03 广州酷狗计算机科技有限公司 提示信息的显示方法及装置
CN108289244A (zh) * 2017-12-28 2018-07-17 努比亚技术有限公司 视频字幕处理方法、移动终端及计算机可读存储介质

Also Published As

Publication number Publication date
CN109658919A (zh) 2019-04-19

Similar Documents

Publication Publication Date Title
US11887578B2 (en) Automatic dubbing method and apparatus
WO2020124754A1 (zh) 多媒体文件的翻译方法、装置及翻译播放设备
WO2020098115A1 (zh) 字幕添加方法、装置、电子设备及计算机可读存储介质
US7047191B2 (en) Method and system for providing automated captioning for AV signals
EP1922720B1 (en) System and method for synchronizing sound and manually transcribed text
US20080195386A1 (en) Method and a Device For Performing an Automatic Dubbing on a Multimedia Signal
US20200211565A1 (en) System and method for simultaneous multilingual dubbing of video-audio programs
WO2014161282A1 (zh) 视频文件播放进度的调整方法及装置
CN111462553B (zh) 一种基于视频配音和纠音训练的语言学习方法及系统
CN110867177A (zh) 音色可选的人声播放系统、其播放方法及可读记录介质
JP2005064600A (ja) 情報処理装置、情報処理方法、およびプログラム
CN101753915A (zh) 数据处理设备、数据处理方法及程序
JP2013025299A (ja) 書き起こし支援システムおよび書き起こし支援方法
KR101618777B1 (ko) 파일 업로드 후 텍스트를 추출하여 영상 또는 음성간 동기화시키는 서버 및 그 방법
US20060039682A1 (en) DVD player with language learning function
JP2008299032A (ja) 語学教材および文字データ再生装置
WO2023276539A1 (ja) 音声変換装置、音声変換方法、プログラム、および記録媒体
KR20190127202A (ko) 스토리 컨텐츠에 대한 음향 효과를 제공하는 미디어 재생 장치 및 음성 인식 서버
KR101920653B1 (ko) 비교음 생성을 통한 어학학습방법 및 어학학습프로그램
JP2000206987A (ja) 音声認識装置
CN112236816B (zh) 信息处理装置、信息处理系统以及影像装置
US20110165541A1 (en) Reviewing a word in the playback of audio data
KR102463283B1 (ko) 청각 장애인 및 비장애인 겸용 영상 콘텐츠 자동 번역 시스템
JP2000358202A (ja) 映像音声記録再生装置および同装置の副音声データ生成記録方法
JP7288530B1 (ja) システムおよびプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19899867

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19899867

Country of ref document: EP

Kind code of ref document: A1