WO2022215026A1 - A method and system for translating a multimedia content - Google Patents

A method and system for translating a multimedia content Download PDF

Info

Publication number
WO2022215026A1
WO2022215026A1 PCT/IB2022/053263 IB2022053263W WO2022215026A1 WO 2022215026 A1 WO2022215026 A1 WO 2022215026A1 IB 2022053263 W IB2022053263 W IB 2022053263W WO 2022215026 A1 WO2022215026 A1 WO 2022215026A1
Authority
WO
WIPO (PCT)
Prior art keywords
content
audio
translated
text
speech
Prior art date
Application number
PCT/IB2022/053263
Other languages
French (fr)
Inventor
Samhitha Jagannath
Satvik Jagannath
Akash NIDHI PS
Original Assignee
Samhitha Jagannath
Satvik Jagannath
Nidhi Ps Akash
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samhitha Jagannath, Satvik Jagannath, Nidhi Ps Akash filed Critical Samhitha Jagannath
Publication of WO2022215026A1 publication Critical patent/WO2022215026A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory

Definitions

  • the present subject matter is generally related multimedia content and more particularly, but not exclusively, to a method and system for translating a multimedia content.
  • the present disclosure discloses a method of translating a multimedia content.
  • the method comprises receiving a multimedia content from a user device.
  • the multimedia content is in a source language.
  • the method comprises extracting foreground content and background content from the multimedia content, translating the foreground content of the multimedia content from the source language to a target language using one or more techniques. Thereafter, the method comprises merging the translated foreground content with the background content for providing the translated multimedia content in the target language.
  • the present disclosure discloses a translation system for translating a multimedia content.
  • the translation system comprises a processor, and a memory communicatively coupled to the processor.
  • the processor receives a multimedia content from a user device.
  • the multimedia content is in a source language.
  • the processor extracts foreground content and background content from the multimedia content, translates the foreground content of the multimedia content from the source language to a target language using one or more techniques. Thereafter, the processor merges the translated foreground content with the background content for providing the translated multimedia content in the target language.
  • Fig.l shows an exemplary environment for translating a multimedia content in accordance with some embodiments of the present disclosure
  • FIG.2 shows a detailed block diagram of a translation system in accordance with some embodiments of the present disclosure
  • FIG.3A-3B show exemplary flowcharts for translating audio content in accordance with some embodiments of the present disclosure.
  • FIG.4A-4C shows exemplary flowcharts for translating video content in accordance with some embodiments of the present disclosure
  • Fig.5 shows a flow chart illustrating a method of translating a multimedia content in accordance with some embodiments of the present disclosure.
  • Fig.6 shows a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure.
  • any flow diagrams and timing diagrams herein represent conceptual views of illustrative device embodying the principles of the present subject matter.
  • any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and executed by a computer or processor, whether such computer or processor is explicitly shown.
  • Embodiments of the present disclosure may relate to a method and translation system for translating a multimedia content.
  • the multimedia content such as, audio and videos may be translated from a source language to a target language.
  • a video may be translated from English to French.
  • the present disclosure may extract foreground and background content of multimedia content and translate the foreground content from the source language to the target language. Thereafter, the translated foreground content is merged with the background content in order to provide the translated multimedia content. Therefore, the present disclosure facilitates translation of the multimedia content with higher accuracy and improved translation with realistic expression of source language.
  • Fig.l shows an exemplary environment for translating a multimedia content in accordance with some embodiments of the present disclosure.
  • the environment 100 may include a translation system 101 connected to a user device 103 through a communication network 105 for translating multimedia content.
  • the translation system 101 may be connected to the user device 103 through a wired communication interface or a wireless communication interface.
  • the user device 103 may be any computing device.
  • the user device 103 may include, a smart phone, a Personal Computer (PC), a tablet, a notebook, and the like.
  • PC Personal Computer
  • the translation system 101 may be implemented on any computing device such as, a server, a High-Performance Computer (HPC), a smart phone, a tablet, a notebook, and the like.
  • the translation system 101 may be implemented on the user device 103 for translating the multimedia content.
  • the translation system 101 may include an I/O interface 107, a memory 109 and a processor 111 as explained in detail in subsequent figures of the detailed description.
  • the translation system 101 may receive multimedia content from a user associated with the user device 103 for translating the multimedia content from a source language to a target language.
  • the multimedia content may include audio content and video content.
  • the audio content may include a podcast, an audio book, a recording, and the like.
  • the video content may include any recorded video.
  • the translation system 101 may identify a plurality of attributes of the audio content and split the audio content into a plurality of audio chunks based on the plurality of attributes.
  • the plurality of attributes may include but not limited to, correctness of audio format, size of audio file and length of audio.
  • the translation system 101 may extract the audio content and video frames associated with the video content before identifying the plurality of attributes of the audio content and split the audio content into the plurality of audio chunks.
  • the translation system 101 may extract foreground content and background content from the multimedia content received from the user device 103.
  • the foreground content may include speech associated with the multimedia content.
  • the speech refers to vocal communication associated with the multimedia content.
  • the speech may include the singing voice in a music, or voice in a lecture, and the like.
  • the background content includes background sound.
  • the background sound may be any background noise or background music associated with the multimedia content.
  • the extracted foreground content is translated from the source language to the target language using a predefined technique.
  • the predefined technique may include text transcription, speech to text, and the like. For instance, the foreground content may be translated from Hindi to English.
  • the translation system 101 may convert the speech of the foreground content to text using a predefined technique.
  • the predefined technique may include, but not limited to, audio transcription or speech to text which are defined as a written or a printed or a textual version of speech or an audio.
  • the transcribed text is in the source language.
  • the text in the source language is translated from the source language to the target language using any known translation techniques.
  • the translation may be a timed or an untimed translation depending on context of the multimedia content or requirement. Based on the translated text, a speech is generated in the target language.
  • generating the speech in the target language based on the translated text includes determining characteristics of the speech in the source language and generating the speech in the target language based on the determined characteristic.
  • the characteristic of the speech may include, but not limited to, pitch correction, bitrate correction, speed correction, audio positioning correction, voice cloning, emotions, and expression adjustment.
  • the generated speech in the target language is merged with background sound to provide translated audio content.
  • This process makes the translation realistic as the background sound is retained as per original multimedia content.
  • the multimedia content is video content
  • the translated audio content is merged with the extracted video frames to provide translated video content.
  • the present disclosure may also include extracting visual text within the video content along with corresponding parameters on the video content.
  • the visual text of the video content may include, but not limited to, labels, text, banners, and the like.
  • the visual text may include written lecture on a broad in the video content.
  • the visual text within the video content is translated from the source language to the target language using known translation techniques.
  • the parameters of the visual text comprise positional information on the video content, font, and style of the visual text in the video content.
  • the translated visual text is rendered on the video content based on parameters of the visual text. That is, the style of original visual text is analyzed in order to recreate the original style on the translated text.
  • Fig.2 shows a detailed block diagram of a translation system in accordance with some embodiments of the present disclosure.
  • the translation system 101 may include the I/O interface 107, the memory 109, the processor 111, and modules 214.
  • the memory 109 includes data 200.
  • the I/O interface 107 may be configured to receive the multimedia content from the user associated with the user device 103 for translation of the multimedia content along with information regarding the target language to which the multimedia content is to be translated. Further, the I/O interface 107 may provide the translated multimedia content to the user device 103.
  • the processor 111 may be configured to receive the multimedia content through the I/O interface 107. Further, the processor 111 may retrieve data from the memory 109 and interact with the modules 214 to perform the translation of the multimedia content.
  • the memory 109 may store the data 200 received through the I/O interface 107, the modules 214 and the processor 111.
  • the data 200 may also include input data 201, foreground data 203, background data 205, translated data 207, and other data 209.
  • the input data 201 may include details about the multimedia content received from the user device 103.
  • the details may include type of multimedia content such as, audio or video content. Further, the details may include information about the source language associated with the multimedia content and the target language to which the multimedia content is to be translated. Further, the input data 201 may include extracted video frames of the video content, when the multimedia content is the video content.
  • the foreground data 203 may include information about the foreground content extracted from the multimedia content.
  • the information may be related to the type of speech such as, verbal communication in music, lecture, and the like.
  • the background data 205 may include information about the background sound/noise extracted from the multimedia content.
  • the background sound is free from the speech or verbal communication.
  • the translated data 207 may include the translated multimedia content in the target language.
  • the other data 209 may store data, including temporary data and temporary files, generated by the modules 214 for performing the various functions of the translation system 101
  • the data 200 stored in the memory 109 may be processed by the modules 214 of the translation system 101.
  • the modules 214 may be communicatively coupled to the processor 111 configured in the translation system 101.
  • the modules 214 may be present outside the memory 109 as shown in Fig.2 and implemented as hardware.
  • the term modules may refer to an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) and a memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality.
  • ASIC Application Specific Integrated Circuit
  • the modules 214 may include, for example, a receiving module 215, an extraction module 217, a translation module 219, a merging module 221, and other modules 223.
  • the other module may include an identification module for identifying the plurality of attributes associated with an audio associated with the audio content and the video content.
  • the plurality of attributes may include, but not limited to, correctness of audio format, size of audio file and length of audio.
  • the identification module may validate metadata associated with the audio with prestored metadata for audio.
  • the other modules 223 may include a splitting module for splitting the audio associated with the audio content and the video content into a plurality of audio chunks based on the plurality of attributes.
  • the other modules 223 may be used to perform various miscellaneous functionalities of the translation system 101. It will be appreciated that aforementioned modules 214 may be represented as a single module or a combination of different modules. Furthermore, a person of ordinary skill in the art will appreciate that in an implementation, the one or more modules 214 may be stored in the memory 109, without limiting the scope of the disclosure. The said modules 214 when configured with the functionality defined in the present disclosure will result in a novel hardware.
  • the receiving module 215 may receive the multimedia content from the user device 103 along with information about the target language to which the multimedia content is to be translated.
  • the extraction module 217 may extract the foreground content and the background content from the multimedia content.
  • the foreground content may include speech associated with the multimedia content.
  • the speech refers to vocal communication associated with the multimedia content.
  • the speech may include the singing voice in a music, or voice in a lecture, and the like.
  • the background content includes background sound.
  • the background sound may be any background noise associated with the multimedia content.
  • the extraction module 217 may separate different channels in the audio associated with the audio content and the video content for each audio chunk to foreground and background content.
  • the extraction module 217 may initially extract the audio content and video frames associated with video content.
  • the extraction module 217 may extract the visual text within the video content along with corresponding parameters on the video content.
  • the parameters of the visual text may include positional information on the video content, font, and style of the visual text in the video content.
  • the extraction of foreground content and background content is performed for the audio content associated with the video content.
  • the translation module 219 may receive the extracted foreground content from the extraction module 217 and may translate the foreground content associated with the multimedia content. That is, the translation module 219 may translate the foreground content from the source language to the target language using the predefined technique.
  • the predefined technique may include, but not limited to, text transcription, speech to text, and the like.
  • the foreground content may be translated from Hindi to English.
  • the translation module 219 may convert the speech of the foreground content to text using the predefined technique.
  • the predefined technique may include, but not limited to, audio or speech transcription to text which are defined as a written or a printed or a textual version of speech or an audio.
  • the transcribed text is in the source language.
  • the translation module 219 may translate the text in the source language to the target language using any known translation techniques.
  • the translation may be a timed or an untimed translation depending on context of the multimedia content or requirement.
  • the timed or an untimed translation of the text involves adjusting time when exact word/speech was spoken in the audio content. This adjustment mechanism helps in factoring in silence in the audio, pauses in the speech, slow and fast speech styles, etc. Time segmented text helps in perfectly aligning speech at appropriate time and place in the translated audio. This enables perfect timing and synchronization with the audio.
  • the translation module 219 may generate the speech in the target language using predefined techniques such as, text-to-speech conversion.
  • predefined techniques such as, text-to-speech conversion.
  • the translation module 219 may determine characteristic of the speech in the source language and generate the speech in the target language based on the determined characteristic.
  • the characteristic of the speech may include, but not limited to, pitch correction, bitrate correction, speed correction, audio positioning correction, voice cloning, emotions, and expression adjustment.
  • the translation module 219 may translate the visual text within the video content from the source language to the target language using known translation techniques.
  • the merging module 221 may receive the generated speech in the target language from the translation module 219 and the extracted background sound, video frames and parameters of the visual text for the multimedia content from the extraction module 217.
  • the merging module 221 may merge the generated speech with the background sound to provide translated audio content in the target language. This process makes the translation realistic as the background sound is retained as per original multimedia content.
  • the merging module 221 may perform the merging in two stages. In this first stage, the generated speech is merged with the background sound to provide the translated audio content. In the second stage, the translated audio content is merged with the extracted video frames to provide the translated video content.
  • the merging module 221 may render the translated visual text received from the translation module 219, on the video content based on parameters of the visual text. That is, the style of original visual text is analyzed in order to recreate the original style on the translated text.
  • FIG.3A-3B show exemplary flowcharts for translating audio content in accordance with some embodiments of the present disclosure.
  • Fig.3A shows an exemplary flowchart for translating an audio content.
  • an audio file 301 may be received by the translation system 101, such as, a podcast or an audiobook or any audio recording in a source language.
  • the audio file 301 is separated into foreground content and background content. That is, the foreground content such as, the speech/voice and background content such as, the background sound and noise is extracted.
  • the extracted foreground content i.e., the speech is converted to text.
  • the text is translated from the source language to the target language.
  • the translated text is indicated as ‘A’.
  • the translation can be a timed or an untimed translation depending on the context of the audio or requirement.
  • the translated text ‘A’ is converted to speech in the target language.
  • the generated speech in the target language is merged with the background content obtained from step 302 in order to provide the translated audio content 307 in the target language.
  • Fig.3B shows an alternate flowchart for translating an audio content.
  • Step 302-304 are same as above.
  • the audio file 301 is received at step 308 for determining the characteristic of the speech in the source language.
  • the characteristic of the speech may include, but not limited to, pitch correction, bitrate correction, speed correction, audio positioning correction, voice cloning, emotions, and expression adjustment.
  • the translated text ‘A’ is converted to speech in the target language based on the determined characteristic.
  • the generated speech in the target language is merged with the background content obtained from step 302 in order to provide the translated audio content 311 with voice cloning in the target language.
  • Fig.4A-4C shows exemplary flowcharts for translating video content in accordance with some embodiments of the present disclosure.
  • Fig.4A shows an exemplary flowchart for translating a video content.
  • a video file 400 may be received by the translation system 101, such as, a music video or any video recording in a source language.
  • the audio, video frames, and visual text, if any, associated with the video file 400 may be extracted.
  • the audio associated with the video file 400 is separated into the foreground content and background content. That is, the foreground content such as, the speech/voice and background content such as, the background sound and noise is extracted.
  • the extracted foreground content i.e., the speech is converted to text.
  • the text is translated from the source language to the target language.
  • the translated text is indicated as ‘A’.
  • the translation can be a timed or an untimed translation depending on the context of the audio or requirement.
  • the translated text ‘A’ is converted to speech in the target language.
  • the generated speech in the target language is merged with the background content obtained from step 402 in order to provide the translated audio content in the target language.
  • the translated audio content is merged with the video frames obtained from step 401 to provide the translated video content indicated as ‘C’ in the target language.
  • Fig.4B shows an exemplary flowchart for translating a visual text in the video content.
  • the visual text associated with the video content is extracted along with corresponding parameters on the video content.
  • the visual text of the video content may include, but not limited to, labels, text, banners, and the like.
  • the visual text may include written lecture on a broad in the video content.
  • the parameters of the visual text comprise positional information on the video content, font, and style of the visual text in the video content.
  • the visual text within the video content is translated from the source language to the target language using known translation techniques.
  • the translated visual text is rendered on the video content based on parameters of the visual text.
  • Step 401-404 are same as above.
  • the audio associated with the video file 400 is received at step 408 for determining the characteristic of the speech in the audio in the source language.
  • the characteristic of the speech may include, but not limited to, pitch correction, bitrate correction, speed correction, audio positioning correction, voice cloning, emotions, and expression adjustment.
  • the translated text ‘A’ is converted to speech in the target language based on the determined characteristic.
  • the generated speech in the target language is merged with the background content obtained from step 402 in order to provide the translated audio content with voice cloning in the target language.
  • the translated audio content is merged with the video frames obtained from step 401 to provide the translated video content indicated as ‘C’ in the target language.
  • Fig.5 shows a flow chart illustrating a method of translating a multimedia content in accordance with some embodiments of the present disclosure.
  • the method 500 includes one or more blocks illustrating a method of translating a multimedia content.
  • the order in which the method 500 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method. Additionally, individual blocks may be deleted from the methods without departing from the spirit and scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof.
  • the method 500 may include receiving, by the receiving module 215, the multimedia content from the user device 103.
  • the multimedia content is in the source language.
  • the multimedia content is one of an audio content and a video content.
  • the method 500 may include extracting, by the extraction module 217, the foreground content and the background content from the multimedia content.
  • the foreground content comprises speech and the background content comprises background sound.
  • the method 500 may include translating, by the translation module 219, the foreground content of the multimedia content from the source language to the target language using one or more techniques.
  • translating the foreground content includes converting speech of the foreground content to text using a predefined technique, where text is in the source language, translating the text from the source language to the target language and generating the speech in the target language based on the translated text.
  • generating the speech in the target language based on the translated text further comprises determining characteristic of the speech in the source language and generating the speech in the target language based on the determined characteristic.
  • the method 500 may include merging, by the merging module 221, the translated foreground content with the background content for providing the translated multimedia content in the target language.
  • the generated speech in the target language is merged with background sound to provide translated audio content.
  • the translated audio content is merged with video frames to provide translated video content.
  • the method 500 further includes extracting the audio content and video frames associated with the video content and visual text within the video content along with corresponding parameters on the video content, when the multimedia content is the video content.
  • the visual text within the video content is translated from the source language to the target language using predefined translation techniques. Further, the method 500 includes rendering the translated visual text based on the parameters of the visual text.
  • Fig.6 illustrates a block diagram of an exemplary computer system 600 for implementing embodiments consistent with the present disclosure.
  • the computer system 600 may be a system for translating the multimedia content.
  • the computer system 600 may include a central processing unit (“CPU” or “processor”) 602.
  • the processor 602 may comprise at least one data processor for executing program components for executing user or system-generated business processes.
  • the processor 602 may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc.
  • the processor 602 may be disposed in communication with one or more input/output (I/O) devices (612 and 613) via I/O interface 601.
  • the I/O interface 601 may employ communication protocols/methods such as, without limitation, audio, analog, digital, stereo, IEEE-1394, serial bus, Universal Serial Bus (USB), infrared, PS/2, BNC, coaxial, component, composite, Digital Visual Interface (DVI), high-definition multimedia interface (HDMI), Radio Frequency (RF) antennas, S-Video, Video Graphics Array (VGA), IEEE 802.
  • the computer system 600 may communicate with one or more I/O devices 612 and 613.
  • cellular e.g., Code-Division Multiple Access (CDMA), High-Speed Packet Access (HSPA+), Global System for Mobile Communications (GSM), Long-Term Evolution (LTE) or the like
  • I/O devices 612 and 613 e.g., Bluetooth, cellular (e.g., Code-Division Multiple Access (CDMA), High-Speed Packet Access (HSPA+), Global System for Mobile Communications (GSM), Long-Term Evolution (LTE) or the like), etc.
  • CDMA Code-Division Multiple Access
  • HSPA+ High-Speed Packet Access
  • GSM Global System for Mobile Communications
  • LTE Long-Term Evolution
  • the processor 602 may be disposed in communication with a communication network 609 via a network interface 603.
  • the network interface 603 may communicate with the communication network 609.
  • the network interface 603 may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), Transmission Control Protocol/Internet Protocol (TCP/IP), token ring, IEEE 802.1 la/b/g/n/x, etc.
  • the communication network 609 may be used to receive the multimedia content from a user device 614.
  • the communication network 609 can be implemented as one of the several types of networks, such as intranet or Local Area Network (LAN) and such within the organization.
  • the communication network 609 may either be a dedicated network or a shared network, which represents an association of several types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), etc., to communicate with each other.
  • HTTP Hypertext Transfer Protocol
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • WAP Wireless Application Protocol
  • the communication network 609 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, etc.
  • the processor 602 may be disposed in communication with a memory 605 (e.g., RAM, ROM, etc. as shown in Fig. 6 via a storage interface 604.
  • the storage interface 604 may connect to memory 605 including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as Serial Advanced Technology Attachment (SATA), Integrated Drive Electronics (IDE), IEEE-1394, Universal Serial Bus (USB), fiber channel, Small Computer Systems Interface (SCSI), etc.
  • the memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, Redundant Array of Independent Discs (RAID), solid-state memory devices, solid-state drives, etc.
  • the memory 605 may store a collection of program or database components, including, without limitation, user /application 606, an operating system 607, a web browser 608, a mail client, a mail server a web server, and the like.
  • computer system 600 may store user /application data, such as the data, variables, records, etc. as described in this invention.
  • databases may be implemented as fault-tolerant, relational, scalable, secure databases such as Oracle R or Sybase R .
  • the operating system 607 may facilitate resource management and operation of the computer system 600.
  • Examples of operating systems include, without limitation, APPLE MACINTOSH 11 OS X, UNIX R , UNIX-like system distributions (E G., BERKELEY SOFTWARE DISTRIBUTIONTM (BSD), FREEBSDTM, NETBSDTM, OPENBSDTM, etc ), LINUX DISTRIBUTIONSTM (E G., RED HATTM, UBUNTUTM, KUBUNTUTM, etc ), IBMTM OS/2, MICROSOFTTM WINDOWSTM (XPTM, VISTATM/7/8, 10 etc ), APPLE R IOSTM, GOOGLE R ANDROIDTM, BLACKBERRY 11 OS, or the like.
  • a user interface may facilitate display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities.
  • user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system 600, such as cursors, icons, check boxes, menus, windows, widgets, etc.
  • GUIs Graphical User Interfaces
  • GUIs may be employed, including, without limitation, APPLE MACINTOSH 11 operating systems, IBMTM OS/2, MICROSOFTTM WINDOWSTM (XPTM, VISTATM/7/8, 10 etc.), Unix R X-Windows, web interface libraries (e.g., AJAXTM, DHTMLTM, ADOBE ® FLASHTM, JAVASCRIPTTM, JAVATM, etc ), or the like.
  • a computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored.
  • a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein.
  • the term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., non-transitory. Examples include Random Access Memory (RAM), Read-Only Memory (ROM), volatile memory, nonvolatile memory, hard drives, Compact Disc (CD) ROMs, Digital Video Disc (DVDs), flash drives, disks, and any other known physical storage media.
  • the present disclosure facilitates translation of the multimedia content with higher accuracy and improved translation with realistic expression of source language.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed herein is a method and translation system (101) for translating multimedia content. The method includes receiving multimedia content from a user device (103). The multimedia content is in a source language. The method comprises extracting foreground content and background content from the multimedia content and translating the foreground content of the multimedia content from the source language to a target language using one or more techniques. Thereafter, the translated foreground content is merged with the background content for providing the translated multimedia content in the target language.

Description

A METHOD AND SYSTEM FOR TRANSLATING A MULTIMEDIA CONTENT
CROSS-REFERENCE TO RELATED APPLICATION:
This complete application is drafted by combining aspects of two Indian Patent Provisional applications (i.e., Application No: 202041043627, dated- 7th October 2020 post-dated to 7th April 2021 and Application No. 202041043628. dated- 7th October 2020 post-dated to 7th April 2021). The disclosure of both the aforementioned provisional applications is hereby incorporated by reference in its entirety.
TECHNICAL FIELD
[001] The present subject matter is generally related multimedia content and more particularly, but not exclusively, to a method and system for translating a multimedia content.
BACKGROUND
[002] The consumption of audiovisual content, from traditional movies, documentaries, and Television (TV) shows to more recent online user-generated content found on social media platforms, has grown exponentially over last few decades. With this consumption, and increase in international business and social activities, a need of functioning with multiple languages has become important. For example, there are many movies and television shows that are being liked by people of different countries and hence, being dubbed in many languages. Further, there are many institutes providing their recorded tutorials in multiple languages. There are many public/professional speakers whose audios and videos are being recorded and further translated in other languages for better reach among global audience.
[003] Most of such examples involve high-end efforts in translation with multiple hardware for processing multimedia content such as, audios and videos in other language. There are many bloggers, vloggers, students, teachers, and small businessmen etc., who may not have access to such high-end translating solutions. From these high-end solutions, very few are affordable and approachable systems or processes that are available for translating multimedia content in other languages. [004] Currently, existing systems for translation, do not provide a very accurate and realistic translation. Furthermore, such systems do not retain background sounds/noise; and mostly, convert voice/audio language into without retaining expression of voice such as, but not limited to, originality of voice, deep breath, poses, and encouraging and laughing gestures, etc. In the existing systems of multimedia translation, either the background becomes noisy in translation and interferes with translation accuracy, or the background is being clipped off entirely without providing realistic expression or surrounding background sounds to the translated content. Furthermore, most of these systems and applications provide machine like voice for translated content and does not retain the originality of speaker’s voice.
[005] The information disclosed in this background of the disclosure section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
SUMMARY
[006] In an embodiment, the present disclosure discloses a method of translating a multimedia content. The method comprises receiving a multimedia content from a user device. The multimedia content is in a source language. The method comprises extracting foreground content and background content from the multimedia content, translating the foreground content of the multimedia content from the source language to a target language using one or more techniques. Thereafter, the method comprises merging the translated foreground content with the background content for providing the translated multimedia content in the target language.
[007] Further, in an embodiment, the present disclosure discloses a translation system for translating a multimedia content. The translation system comprises a processor, and a memory communicatively coupled to the processor. The processor receives a multimedia content from a user device. The multimedia content is in a source language. The processor extracts foreground content and background content from the multimedia content, translates the foreground content of the multimedia content from the source language to a target language using one or more techniques. Thereafter, the processor merges the translated foreground content with the background content for providing the translated multimedia content in the target language.
[008] The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features will become apparent by reference to the drawings and the following detailed description.
BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS
[009] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, explain the disclosed principles. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the figures to reference like features and components. Some embodiments of system and/or methods in accordance with embodiments of the present subject matter are now described, by way of example only, and regarding the accompanying figures, in which:
[0010] Fig.l shows an exemplary environment for translating a multimedia content in accordance with some embodiments of the present disclosure;
[0011] Fig.2 shows a detailed block diagram of a translation system in accordance with some embodiments of the present disclosure;
[0012] Fig.3A-3B show exemplary flowcharts for translating audio content in accordance with some embodiments of the present disclosure.
[0013] Fig.4A-4C shows exemplary flowcharts for translating video content in accordance with some embodiments of the present disclosure;
[0014] Fig.5 shows a flow chart illustrating a method of translating a multimedia content in accordance with some embodiments of the present disclosure; and
[0015] Fig.6 shows a block diagram of an exemplary computer system for implementing embodiments consistent with the present disclosure. [0016] It should be appreciated by those skilled in the art that any flow diagrams and timing diagrams herein represent conceptual views of illustrative device embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and executed by a computer or processor, whether such computer or processor is explicitly shown.
DETAILED DESCRIPTION
[0017] In the present document, the word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment or implementation of the present subject matter described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.
[0018] While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the disclosure to the specific forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternative falling within the scope of the disclosure.
[0019] The terms “comprises,” “comprising,” “includes,” “including” or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, device, or method that comprises a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup, device, or method. In other words, one or more elements in a system or apparatus proceeded by “comprises... a” does not, without more constraints, preclude the existence of other elements or additional elements in the system or method.
[0020] Embodiments of the present disclosure may relate to a method and translation system for translating a multimedia content. The multimedia content such as, audio and videos may be translated from a source language to a target language. For example, a video may be translated from English to French. The present disclosure may extract foreground and background content of multimedia content and translate the foreground content from the source language to the target language. Thereafter, the translated foreground content is merged with the background content in order to provide the translated multimedia content. Therefore, the present disclosure facilitates translation of the multimedia content with higher accuracy and improved translation with realistic expression of source language.
[0021] Fig.l shows an exemplary environment for translating a multimedia content in accordance with some embodiments of the present disclosure.
[0022] As shown in Fig.l, the environment 100 may include a translation system 101 connected to a user device 103 through a communication network 105 for translating multimedia content. In an embodiment, the translation system 101 may be connected to the user device 103 through a wired communication interface or a wireless communication interface. In the present disclosure, the user device 103 may be any computing device. For instance, the user device 103 may include, a smart phone, a Personal Computer (PC), a tablet, a notebook, and the like. A person skilled in the art would understand that any other computing device in the environment 100, not mentioned herein explicitly, may also be referred as the user device 103. The translation system 101 may be implemented on any computing device such as, a server, a High-Performance Computer (HPC), a smart phone, a tablet, a notebook, and the like. In an embodiment, the translation system 101 may be implemented on the user device 103 for translating the multimedia content. In some implementations, the translation system 101 may include an I/O interface 107, a memory 109 and a processor 111 as explained in detail in subsequent figures of the detailed description.
[0023] As illustrated in Fig.l, the translation system 101 may receive multimedia content from a user associated with the user device 103 for translating the multimedia content from a source language to a target language. The multimedia content may include audio content and video content. For example, the audio content may include a podcast, an audio book, a recording, and the like. The video content may include any recorded video. A person skilled in the art would understand that the multimedia may also include any other type not mentioned herein explicitly. Upon receiving the multimedia content and when the multimedia content is audio, the translation system 101 may identify a plurality of attributes of the audio content and split the audio content into a plurality of audio chunks based on the plurality of attributes. The plurality of attributes may include but not limited to, correctness of audio format, size of audio file and length of audio. On other hand, when the multimedia content is the video content, the translation system 101 may extract the audio content and video frames associated with the video content before identifying the plurality of attributes of the audio content and split the audio content into the plurality of audio chunks.
[0024] Further, the translation system 101 may extract foreground content and background content from the multimedia content received from the user device 103. The foreground content may include speech associated with the multimedia content. The speech refers to vocal communication associated with the multimedia content. For instance, the speech may include the singing voice in a music, or voice in a lecture, and the like. While the background content includes background sound. The background sound may be any background noise or background music associated with the multimedia content. The extracted foreground content is translated from the source language to the target language using a predefined technique. The predefined technique may include text transcription, speech to text, and the like. For instance, the foreground content may be translated from Hindi to English. Particularly, to translate the foreground content, the translation system 101 may convert the speech of the foreground content to text using a predefined technique. The predefined technique may include, but not limited to, audio transcription or speech to text which are defined as a written or a printed or a textual version of speech or an audio. The transcribed text is in the source language. Further, the text in the source language is translated from the source language to the target language using any known translation techniques. The translation may be a timed or an untimed translation depending on context of the multimedia content or requirement. Based on the translated text, a speech is generated in the target language. In an embodiment, generating the speech in the target language based on the translated text includes determining characteristics of the speech in the source language and generating the speech in the target language based on the determined characteristic. The characteristic of the speech may include, but not limited to, pitch correction, bitrate correction, speed correction, audio positioning correction, voice cloning, emotions, and expression adjustment.
[0025] Once the speech is translated, the generated speech in the target language is merged with background sound to provide translated audio content. This process makes the translation realistic as the background sound is retained as per original multimedia content. While when the multimedia content is video content, the translated audio content is merged with the extracted video frames to provide translated video content. In an embodiment, when the multimedia content is the video content and includes text, the present disclosure may also include extracting visual text within the video content along with corresponding parameters on the video content. The visual text of the video content may include, but not limited to, labels, text, banners, and the like. For instance, the visual text may include written lecture on a broad in the video content. The visual text within the video content is translated from the source language to the target language using known translation techniques. The parameters of the visual text comprise positional information on the video content, font, and style of the visual text in the video content. Thereafter, the translated visual text is rendered on the video content based on parameters of the visual text. That is, the style of original visual text is analyzed in order to recreate the original style on the translated text.
[0026] Fig.2 shows a detailed block diagram of a translation system in accordance with some embodiments of the present disclosure.
[0027] As shown, the translation system 101 may include the I/O interface 107, the memory 109, the processor 111, and modules 214. The memory 109 includes data 200. The I/O interface 107 may be configured to receive the multimedia content from the user associated with the user device 103 for translation of the multimedia content along with information regarding the target language to which the multimedia content is to be translated. Further, the I/O interface 107 may provide the translated multimedia content to the user device 103.
[0028] The processor 111 may be configured to receive the multimedia content through the I/O interface 107. Further, the processor 111 may retrieve data from the memory 109 and interact with the modules 214 to perform the translation of the multimedia content. The memory 109 may store the data 200 received through the I/O interface 107, the modules 214 and the processor 111. In one embodiment, the data 200 may also include input data 201, foreground data 203, background data 205, translated data 207, and other data 209.
[0029] The input data 201 may include details about the multimedia content received from the user device 103. The details may include type of multimedia content such as, audio or video content. Further, the details may include information about the source language associated with the multimedia content and the target language to which the multimedia content is to be translated. Further, the input data 201 may include extracted video frames of the video content, when the multimedia content is the video content.
[0030] The foreground data 203 may include information about the foreground content extracted from the multimedia content. The information may be related to the type of speech such as, verbal communication in music, lecture, and the like.
[0031] The background data 205 may include information about the background sound/noise extracted from the multimedia content. The background sound is free from the speech or verbal communication.
[0032] The translated data 207 may include the translated multimedia content in the target language.
[0033] The other data 209 may store data, including temporary data and temporary files, generated by the modules 214 for performing the various functions of the translation system 101
[0034] In some embodiments, the data 200 stored in the memory 109 may be processed by the modules 214 of the translation system 101. In an example, the modules 214 may be communicatively coupled to the processor 111 configured in the translation system 101. The modules 214 may be present outside the memory 109 as shown in Fig.2 and implemented as hardware. As used herein, the term modules may refer to an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) and a memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality.
[0035] In some embodiments, the modules 214 may include, for example, a receiving module 215, an extraction module 217, a translation module 219, a merging module 221, and other modules 223. The other module may include an identification module for identifying the plurality of attributes associated with an audio associated with the audio content and the video content. The plurality of attributes may include, but not limited to, correctness of audio format, size of audio file and length of audio. Particularly, the identification module may validate metadata associated with the audio with prestored metadata for audio. The other modules 223 may include a splitting module for splitting the audio associated with the audio content and the video content into a plurality of audio chunks based on the plurality of attributes.
[0036] The other modules 223 may be used to perform various miscellaneous functionalities of the translation system 101. It will be appreciated that aforementioned modules 214 may be represented as a single module or a combination of different modules. Furthermore, a person of ordinary skill in the art will appreciate that in an implementation, the one or more modules 214 may be stored in the memory 109, without limiting the scope of the disclosure. The said modules 214 when configured with the functionality defined in the present disclosure will result in a novel hardware.
[0037] In an embodiment, the receiving module 215 may receive the multimedia content from the user device 103 along with information about the target language to which the multimedia content is to be translated.
[0038] The extraction module 217 may extract the foreground content and the background content from the multimedia content. The foreground content may include speech associated with the multimedia content. The speech refers to vocal communication associated with the multimedia content. For instance, the speech may include the singing voice in a music, or voice in a lecture, and the like. While the background content includes background sound. The background sound may be any background noise associated with the multimedia content. Particularly, the extraction module 217 may separate different channels in the audio associated with the audio content and the video content for each audio chunk to foreground and background content. However, in case of video content, prior to foreground and background extraction, the extraction module 217 may initially extract the audio content and video frames associated with video content. In case, the video content includes the visual text, the extraction module 217 may extract the visual text within the video content along with corresponding parameters on the video content. The parameters of the visual text may include positional information on the video content, font, and style of the visual text in the video content. Subsequently, the extraction of foreground content and background content is performed for the audio content associated with the video content. [0039] The translation module 219 may receive the extracted foreground content from the extraction module 217 and may translate the foreground content associated with the multimedia content. That is, the translation module 219 may translate the foreground content from the source language to the target language using the predefined technique. The predefined technique may include, but not limited to, text transcription, speech to text, and the like. For instance, the foreground content may be translated from Hindi to English. Particularly, to translate the foreground content, the translation module 219 may convert the speech of the foreground content to text using the predefined technique. The predefined technique may include, but not limited to, audio or speech transcription to text which are defined as a written or a printed or a textual version of speech or an audio. The transcribed text is in the source language. Further, the translation module 219 may translate the text in the source language to the target language using any known translation techniques. The translation may be a timed or an untimed translation depending on context of the multimedia content or requirement. In an embodiment, the timed or an untimed translation of the text involves adjusting time when exact word/speech was spoken in the audio content. This adjustment mechanism helps in factoring in silence in the audio, pauses in the speech, slow and fast speech styles, etc. Time segmented text helps in perfectly aligning speech at appropriate time and place in the translated audio. This enables perfect timing and synchronization with the audio.
[0040] Further, based on the translated text, the translation module 219 may generate the speech in the target language using predefined techniques such as, text-to-speech conversion. In an embodiment, for generating the speech in the target language based on the translated text, the translation module 219 may determine characteristic of the speech in the source language and generate the speech in the target language based on the determined characteristic. The characteristic of the speech may include, but not limited to, pitch correction, bitrate correction, speed correction, audio positioning correction, voice cloning, emotions, and expression adjustment. In an embodiment, when the multimedia content is the video content and includes visual text, the translation module 219 may translate the visual text within the video content from the source language to the target language using known translation techniques.
[0041] The merging module 221 may receive the generated speech in the target language from the translation module 219 and the extracted background sound, video frames and parameters of the visual text for the multimedia content from the extraction module 217. The merging module 221 may merge the generated speech with the background sound to provide translated audio content in the target language. This process makes the translation realistic as the background sound is retained as per original multimedia content. Further, when the multimedia content is the video content, the merging module 221 may perform the merging in two stages. In this first stage, the generated speech is merged with the background sound to provide the translated audio content. In the second stage, the translated audio content is merged with the extracted video frames to provide the translated video content. Further, when the video content includes the visual text, the merging module 221 may render the translated visual text received from the translation module 219, on the video content based on parameters of the visual text. That is, the style of original visual text is analyzed in order to recreate the original style on the translated text.
[0042] Fig.3A-3B show exemplary flowcharts for translating audio content in accordance with some embodiments of the present disclosure.
[0043] Fig.3A shows an exemplary flowchart for translating an audio content. As shown, an audio file 301 may be received by the translation system 101, such as, a podcast or an audiobook or any audio recording in a source language. At step 302, the audio file 301 is separated into foreground content and background content. That is, the foreground content such as, the speech/voice and background content such as, the background sound and noise is extracted. At 303, the extracted foreground content i.e., the speech is converted to text. At step 304, the text is translated from the source language to the target language. The translated text is indicated as ‘A’. The translation can be a timed or an untimed translation depending on the context of the audio or requirement. At step 305, the translated text ‘A’ is converted to speech in the target language. At step 306, the generated speech in the target language is merged with the background content obtained from step 302 in order to provide the translated audio content 307 in the target language. Likewise, Fig.3B shows an alternate flowchart for translating an audio content. Step 302-304 are same as above. Post step 304, the audio file 301 is received at step 308 for determining the characteristic of the speech in the source language. The characteristic of the speech may include, but not limited to, pitch correction, bitrate correction, speed correction, audio positioning correction, voice cloning, emotions, and expression adjustment. At step 309, the translated text ‘A’ is converted to speech in the target language based on the determined characteristic. At step 310, the generated speech in the target language is merged with the background content obtained from step 302 in order to provide the translated audio content 311 with voice cloning in the target language.
[0044] Fig.4A-4C shows exemplary flowcharts for translating video content in accordance with some embodiments of the present disclosure. Fig.4A shows an exemplary flowchart for translating a video content. As shown, a video file 400 may be received by the translation system 101, such as, a music video or any video recording in a source language. At step 401, the audio, video frames, and visual text, if any, associated with the video file 400 may be extracted. At step 402, the audio associated with the video file 400 is separated into the foreground content and background content. That is, the foreground content such as, the speech/voice and background content such as, the background sound and noise is extracted. At 403, the extracted foreground content i.e., the speech is converted to text. At step 404, the text is translated from the source language to the target language. The translated text is indicated as ‘A’. The translation can be a timed or an untimed translation depending on the context of the audio or requirement. At step 405, the translated text ‘A’ is converted to speech in the target language. At step 406, the generated speech in the target language is merged with the background content obtained from step 402 in order to provide the translated audio content in the target language. At step 407, the translated audio content is merged with the video frames obtained from step 401 to provide the translated video content indicated as ‘C’ in the target language.
[0045] Likewise, Fig.4B shows an exemplary flowchart for translating a visual text in the video content. At step 409, the visual text associated with the video content is extracted along with corresponding parameters on the video content. The visual text of the video content may include, but not limited to, labels, text, banners, and the like. For instance, the visual text may include written lecture on a broad in the video content. The parameters of the visual text comprise positional information on the video content, font, and style of the visual text in the video content. At step 410, the visual text within the video content is translated from the source language to the target language using known translation techniques. At step 411, the translated visual text is rendered on the video content based on parameters of the visual text. That is, the style of original visual text is analyzed in order to recreate the original style on the translated text. Similarly, Fig.4C shows an alternate exemplary flowchart for translating a video content. Step 401-404 are same as above. Post step 404, the audio associated with the video file 400 is received at step 408 for determining the characteristic of the speech in the audio in the source language. The characteristic of the speech may include, but not limited to, pitch correction, bitrate correction, speed correction, audio positioning correction, voice cloning, emotions, and expression adjustment. At step 405 performed post step 408, the translated text ‘A’ is converted to speech in the target language based on the determined characteristic. At step 406, the generated speech in the target language is merged with the background content obtained from step 402 in order to provide the translated audio content with voice cloning in the target language. At step 407, the translated audio content is merged with the video frames obtained from step 401 to provide the translated video content indicated as ‘C’ in the target language.
[0046] Fig.5 shows a flow chart illustrating a method of translating a multimedia content in accordance with some embodiments of the present disclosure.
[0047] As illustrated in Fig.5, the method 500 includes one or more blocks illustrating a method of translating a multimedia content. The order in which the method 500 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method. Additionally, individual blocks may be deleted from the methods without departing from the spirit and scope of the subject matter described herein. Furthermore, the method can be implemented in any suitable hardware, software, firmware, or combination thereof.
[0048] At block 501, the method 500 may include receiving, by the receiving module 215, the multimedia content from the user device 103. The multimedia content is in the source language. The multimedia content is one of an audio content and a video content.
[0049] At block 503, the method 500 may include extracting, by the extraction module 217, the foreground content and the background content from the multimedia content. The foreground content comprises speech and the background content comprises background sound.
[0050] At block 505, the method 500 may include translating, by the translation module 219, the foreground content of the multimedia content from the source language to the target language using one or more techniques. Particularly, translating the foreground content includes converting speech of the foreground content to text using a predefined technique, where text is in the source language, translating the text from the source language to the target language and generating the speech in the target language based on the translated text. In an embodiment, generating the speech in the target language based on the translated text further comprises determining characteristic of the speech in the source language and generating the speech in the target language based on the determined characteristic.
[0051] At block 507, the method 500 may include merging, by the merging module 221, the translated foreground content with the background content for providing the translated multimedia content in the target language. In an embodiment, the generated speech in the target language is merged with background sound to provide translated audio content. Further, when multimedia content is video content, the translated audio content is merged with video frames to provide translated video content.
[0052] In an embodiment, the method 500 further includes extracting the audio content and video frames associated with the video content and visual text within the video content along with corresponding parameters on the video content, when the multimedia content is the video content. The visual text within the video content is translated from the source language to the target language using predefined translation techniques. Further, the method 500 includes rendering the translated visual text based on the parameters of the visual text.
[0053] Fig.6 illustrates a block diagram of an exemplary computer system 600 for implementing embodiments consistent with the present disclosure. In an embodiment, the computer system 600 may be a system for translating the multimedia content. The computer system 600 may include a central processing unit (“CPU” or “processor”) 602. The processor 602 may comprise at least one data processor for executing program components for executing user or system-generated business processes. The processor 602 may include specialized processing units such as integrated system (bus) controllers, memory management control units, floating point units, graphics processing units, digital signal processing units, etc.
[0054] The processor 602 may be disposed in communication with one or more input/output (I/O) devices (612 and 613) via I/O interface 601. The I/O interface 601 may employ communication protocols/methods such as, without limitation, audio, analog, digital, stereo, IEEE-1394, serial bus, Universal Serial Bus (USB), infrared, PS/2, BNC, coaxial, component, composite, Digital Visual Interface (DVI), high-definition multimedia interface (HDMI), Radio Frequency (RF) antennas, S-Video, Video Graphics Array (VGA), IEEE 802. n /b/g/n/x, Bluetooth, cellular (e.g., Code-Division Multiple Access (CDMA), High-Speed Packet Access (HSPA+), Global System for Mobile Communications (GSM), Long-Term Evolution (LTE) or the like), etc. Using the EO interface 601, the computer system 600 may communicate with one or more I/O devices 612 and 613.
[0055] In some embodiments, the processor 602 may be disposed in communication with a communication network 609 via a network interface 603. The network interface 603 may communicate with the communication network 609. The network interface 603 may employ connection protocols including, without limitation, direct connect, Ethernet (e.g., twisted pair 10/100/1000 Base T), Transmission Control Protocol/Internet Protocol (TCP/IP), token ring, IEEE 802.1 la/b/g/n/x, etc. The communication network 609 may be used to receive the multimedia content from a user device 614.
[0056] The communication network 609 can be implemented as one of the several types of networks, such as intranet or Local Area Network (LAN) and such within the organization. The communication network 609 may either be a dedicated network or a shared network, which represents an association of several types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), etc., to communicate with each other. Further, the communication network 609 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, etc.
[0057] In some embodiments, the processor 602 may be disposed in communication with a memory 605 (e.g., RAM, ROM, etc. as shown in Fig. 6 via a storage interface 604. The storage interface 604 may connect to memory 605 including, without limitation, memory drives, removable disc drives, etc., employing connection protocols such as Serial Advanced Technology Attachment (SATA), Integrated Drive Electronics (IDE), IEEE-1394, Universal Serial Bus (USB), fiber channel, Small Computer Systems Interface (SCSI), etc. The memory drives may further include a drum, magnetic disc drive, magneto-optical drive, optical drive, Redundant Array of Independent Discs (RAID), solid-state memory devices, solid-state drives, etc. [0058] The memory 605 may store a collection of program or database components, including, without limitation, user /application 606, an operating system 607, a web browser 608, a mail client, a mail server a web server, and the like. In some embodiments, computer system 600 may store user /application data, such as the data, variables, records, etc. as described in this invention. Such databases may be implemented as fault-tolerant, relational, scalable, secure databases such as OracleR or SybaseR.
[0059] The operating system 607 may facilitate resource management and operation of the computer system 600. Examples of operating systems include, without limitation, APPLE MACINTOSH11 OS X, UNIXR, UNIX-like system distributions (E G., BERKELEY SOFTWARE DISTRIBUTION™ (BSD), FREEBSD™, NETBSD™, OPENBSD™, etc ), LINUX DISTRIBUTIONS™ (E G., RED HAT™, UBUNTU™, KUBUNTU™, etc ), IBM™ OS/2, MICROSOFT™ WINDOWS™ (XP™, VISTA™/7/8, 10 etc ), APPLER IOS™, GOOGLER ANDROID™, BLACKBERRY11 OS, or the like. A user interface may facilitate display, execution, interaction, manipulation, or operation of program components through textual or graphical facilities. For example, user interfaces may provide computer interaction interface elements on a display system operatively connected to the computer system 600, such as cursors, icons, check boxes, menus, windows, widgets, etc. Graphical User Interfaces (GUIs) may be employed, including, without limitation, APPLE MACINTOSH11 operating systems, IBM™ OS/2, MICROSOFT™ WINDOWS™ (XP™, VISTA™/7/8, 10 etc.), UnixR X-Windows, web interface libraries (e.g., AJAX™, DHTML™, ADOBE® FLASH™, JAVASCRIPT™, JAVA™, etc ), or the like.
[0060] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present invention. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., non-transitory. Examples include Random Access Memory (RAM), Read-Only Memory (ROM), volatile memory, nonvolatile memory, hard drives, Compact Disc (CD) ROMs, Digital Video Disc (DVDs), flash drives, disks, and any other known physical storage media.
[0061] In an embodiment, the present disclosure facilitates translation of the multimedia content with higher accuracy and improved translation with realistic expression of source language.
[0062] The terms "an embodiment", "embodiment", "embodiments", "the embodiment", "the embodiments", "one or more embodiments", "some embodiments", and "one embodiment" mean "one or more (but not all) embodiments of the invention(s)" unless expressly specified otherwise.
[0063] The terms "including", "comprising", “having” and variations thereof mean "including but not limited to", unless expressly specified otherwise. The enumerated listing of items does not imply that any or all the items are mutually exclusive, unless expressly specified otherwise.
[0064] The terms "a", "an" and "the" mean "one or more", unless expressly specified otherwise.
[0065] A description of an embodiment with several components in communication with each other does not imply that all such components are required. On the contrary, a variety of optional components are described to illustrate the wide variety of possible embodiments of the invention.
[0066] When a single device or article is described herein, it will be clear that more than one device/article (whether they cooperate) may be used in place of a single device/article. Similarly, where more than one device or article is described herein (whether they cooperate), it will be clear that a single device/article may be used in place of the more than one device or article, or a different number of devices/articles may be used instead of the shown number of devices or programs. The functionality and/or the features of a device may be alternatively embodied by one or more other devices which are not explicitly described as having such functionality/features. Thus, other embodiments of the invention need not include the device itself. [0067] Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the embodiments of the present invention are intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.
[0068] While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
Referral Numerals:
Figure imgf000020_0001
Figure imgf000021_0001

Claims

CLAIMS:
1. A method for translating a multimedia content, the method comprising: receiving, by a translation system (101), multimedia content from a user device
(103), wherein the multimedia content is in a source language; extracting, by the translation system (101), foreground content and background content from the multimedia content; translating, by the translation system (101), the foreground content of the multimedia content from the source language to a target language using one or more techniques; and merging, by the translation system (101), the translated foreground content with the background content for providing the translated multimedia content in the target language.
2. The method as claimed in claim 1, wherein the multimedia content is one of an audio content and a video content.
3. The method as claimed in claim 1, wherein the foreground content comprises speech and the background content comprises background sound.
4. The method as claimed in claim 1, wherein translating the foreground content comprises: converting speech of the foreground content to text using a predefined technique, wherein the text is in the source language; translating the text from the source language to the target language; and generating a speech in the target language based on the translated text.
5. The method as claimed in claim 4, wherein the generated speech in the target language is merged with background sound to provide translated audio content.
6. The method as claimed in claim 4, wherein generating the speech in the target language based on the translated text further comprises determining characteristic of the speech in the source language and generating the speech in the target language based on the determined characteristic.
7. The method as claimed in claim 6, wherein the characteristic of the speech comprises pitch correction, bitrate correction, speed correction, audio positioning correction, voice cloning, emotions, and expression adjustment.
8. The method as claimed in claim 2 further comprising: identifying a plurality of attributes of the audio content; and splitting the audio content into a plurality of audio chunks based on the plurality of attributes.
9. The method as claimed in claim 8, wherein the plurality of attributes comprises correctness of audio format, size of audio file and length of audio.
10. The method as claimed in claim 5, wherein the translated audio content is merged with video frames to provide translated video content, when multimedia content is video content.
11. The method as claimed in claim 1, wherein receiving the multimedia content further comprising extracting an audio content and video frames associated with a video content and visual text within the video content along with corresponding parameters on the video content, when the multimedia content is the video content.
12. The method as claimed in 11, wherein the visual text within the video content is translated from a source language to a target language using predefined translation techniques.
13. The method as claimed in claim 12 further comprising rendering the translated visual text based on parameters of the visual text.
14. The method as claimed in claim 13, wherein the parameters of the visual text comprise positional information on the video content, font, and style of the visual text in the video content.
15. A translation system (101) for translating multimedia content, comprising: a processor (111); and a memory (109) communicatively coupled to the processor (111), wherein the memory (109) stores processor instructions, which, on execution, causes the processor (111) to: receive multimedia content from a user device (103), wherein the multimedia content is in a source language; extract foreground content and background content from the multimedia content; translate the foreground content of the multimedia content from the source language to a target language using one or more techniques; and merge the translated foreground content with the background content for providing the translated multimedia content in the target language.
16. The translation system (101) as claimed in claim 15, wherein the multimedia content is one of an audio content and a video content.
17. The translation system (101) as claimed in claim 15, wherein the foreground content comprises speech and the background content comprises background sound.
18. The translation system (101) as claimed in claim 15, wherein the processor translates the foreground content by: converting speech of the foreground content to text using a predefined technique, wherein the text is in the source language; translating the text from the source language to the target language; and generating a speech in the target language based on the translated text.
19. The translation system (101) as claimed in claim 18, wherein the processor merges the generated speech in the target language with background sound to provide translated audio content.
20. The translation system (101) as claimed in claim 18, wherein the processor generates the speech in the target language based on the translated text by determining characteristic of the speech in the source language and generating the speech in the target language based on the determined characteristic.
21. The translation system (101) as claimed in claim 20, wherein the characteristic of the speech comprises pitch correction, bitrate correction, speed correction, audio positioning correction, voice cloning, emotions, and expression adjustment.
22. The translation system (101) as claimed in claim 16, wherein the processor is configured to: identify a plurality of attributes of the audio content; and split the audio content into a plurality of audio chunks based on the plurality of attributes.
23. The translation system (101) as claimed in claim 22, wherein the plurality of attributes comprises correctness of audio format, size of audio file and length of audio.
24. The translation system (101) as claimed in claim 19, wherein the processor merges the translated audio content with video frames to provide translated video content, when multimedia content is video content.
25. The translation system (101) as claimed in claim 15, wherein the processor is further configured to extract an audio content and video frames associated with a video content and visual text within the video content along with corresponding parameters on the video content, when the multimedia content is the video content.
26. The translation system (101) as claimed in 25, wherein the processor translates the visual text within the video content from a source language to a target language using predefined translation techniques.
27. The translation system (101) as claimed in claim 26, wherein the processor is further configured to render the translated visual text based on parameters of the visual text.
28. The translation system (101) as claimed in claim 27, wherein the parameters of the visual text comprise positional information on the video content, font, and style of the visual text in the video content.
PCT/IB2022/053263 2021-04-07 2022-04-07 A method and system for translating a multimedia content WO2022215026A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
IN202041043628 2021-04-07
IN202041043627 2021-04-07
IN202041043627 2021-04-07
IN202041043628 2021-04-07

Publications (1)

Publication Number Publication Date
WO2022215026A1 true WO2022215026A1 (en) 2022-10-13

Family

ID=83546659

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2022/053263 WO2022215026A1 (en) 2021-04-07 2022-04-07 A method and system for translating a multimedia content

Country Status (1)

Country Link
WO (1) WO2022215026A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080243473A1 (en) * 2007-03-29 2008-10-02 Microsoft Corporation Language translation of visual and audio input
WO2020181133A1 (en) * 2019-03-06 2020-09-10 Syncwords Llc System and method for simultaneous multilingual dubbing of video-audio programs

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080243473A1 (en) * 2007-03-29 2008-10-02 Microsoft Corporation Language translation of visual and audio input
WO2020181133A1 (en) * 2019-03-06 2020-09-10 Syncwords Llc System and method for simultaneous multilingual dubbing of video-audio programs

Similar Documents

Publication Publication Date Title
US10665267B2 (en) Correlation of recorded video presentations and associated slides
US10037767B1 (en) Integrated system and a method of identifying and learning emotions in conversation utterances
KR101990023B1 (en) Method for chunk-unit separation rule and display automated key word to develop foreign language studying, and system thereof
US10740391B2 (en) System and method for generation of human like video response for user queries
US20140372100A1 (en) Translation system comprising display apparatus and server and display apparatus controlling method
CN112115706A (en) Text processing method and device, electronic equipment and medium
US20170300752A1 (en) Method and system for summarizing multimedia content
CN105590627A (en) Image display apparatus, method for driving same, and computer readable recording medium
CN113035199B (en) Audio processing method, device, equipment and readable storage medium
US11893813B2 (en) Electronic device and control method therefor
WO2022228235A1 (en) Method and apparatus for generating video corpus, and related device
US20200285932A1 (en) Method and system for generating structured relations between words
KR20220127361A (en) Video translation method and apparatus, storage medium and electronic device
CN104994404A (en) Method and device for obtaining keywords for video
US11620328B2 (en) Speech to media translation
US20170004859A1 (en) User created textbook
CN109858005B (en) Method, device, equipment and storage medium for updating document based on voice recognition
WO2022215026A1 (en) A method and system for translating a multimedia content
US20210201045A1 (en) Multimedia content summarization method and system thereof
US20140297285A1 (en) Automatic page content reading-aloud method and device thereof
JP7476138B2 (en) Video processing method, device, electronic device and storage medium
CN112652329B (en) Text realignment method and device, electronic equipment and storage medium
KR20160131730A (en) System, Apparatus and Method For Processing Natural Language, and Computer Readable Recording Medium
CN113221514A (en) Text processing method and device, electronic equipment and storage medium
CN112699687A (en) Content cataloging method and device and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22784262

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22784262

Country of ref document: EP

Kind code of ref document: A1