WO2023121850A1 - Automatic in-game subtitles and closed captions - Google Patents

Automatic in-game subtitles and closed captions Download PDF

Info

Publication number
WO2023121850A1
WO2023121850A1 PCT/US2022/051581 US2022051581W WO2023121850A1 WO 2023121850 A1 WO2023121850 A1 WO 2023121850A1 US 2022051581 W US2022051581 W US 2022051581W WO 2023121850 A1 WO2023121850 A1 WO 2023121850A1
Authority
WO
WIPO (PCT)
Prior art keywords
game
subtitle
overlay
audio stream
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2022/051581
Other languages
English (en)
French (fr)
Inventor
Wei Liang
Ilia Blank
Patrick FOK
Le ZHANG
Michael Schmit
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ATI Technologies ULC
Advanced Micro Devices Inc
Original Assignee
ATI Technologies ULC
Advanced Micro Devices Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ATI Technologies ULC, Advanced Micro Devices Inc filed Critical ATI Technologies ULC
Priority to KR1020247024720A priority Critical patent/KR20240131376A/ko
Priority to JP2024535349A priority patent/JP2025504748A/ja
Priority to EP22912270.0A priority patent/EP4452432A4/en
Priority to CN202280084788.1A priority patent/CN118414200A/zh
Publication of WO2023121850A1 publication Critical patent/WO2023121850A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/50Controlling the output signals based on the game progress
    • A63F13/53Controlling the output signals based on the game progress involving additional visual information provided to the game scene, e.g. by overlay to simulate a head-up display [HUD] or displaying a laser sight in a shooting game
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/30Interconnection arrangements between game servers and game devices; Interconnection arrangements between game devices; Interconnection arrangements between game servers
    • A63F13/35Details of game servers
    • A63F13/355Performing operations on behalf of clients with restricted processing capabilities, e.g. servers transform changing game scene into an encoded video stream for transmitting to a mobile phone or a thin client
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/40Processing input control signals of video game devices, e.g. signals generated by the player or derived from the environment
    • A63F13/42Processing input control signals of video game devices, e.g. signals generated by the player or derived from the environment by mapping the input signals into game commands, e.g. mapping the displacement of a stylus on a touch screen to the steering angle of a virtual vehicle
    • A63F13/424Processing input control signals of video game devices, e.g. signals generated by the player or derived from the environment by mapping the input signals into game commands, e.g. mapping the displacement of a stylus on a touch screen to the steering angle of a virtual vehicle involving acoustic input signals, e.g. by using the results of pitch or rhythm extraction or voice recognition
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/85Providing additional services to players
    • A63F13/87Communicating with other players during game play, e.g. by e-mail or chat
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/26Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4781Games
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/30Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game characterized by output arrangements for receiving control signals generated by the game device
    • A63F2300/303Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game characterized by output arrangements for receiving control signals generated by the game device for displaying additional data, e.g. simulating a Head Up Display
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/50Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game characterized by details of game servers
    • A63F2300/57Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game characterized by details of game servers details of game services offered to the player
    • A63F2300/572Communication between players during game play of non game information, e.g. e-mail, chat, file transfer, streaming of audio and streaming of video

Definitions

  • Subtitles or closed captions for interactive content can provide a key accessibility feature for users with hearing impairments or difficult listening environments. Users that are deaf, hard of hearing, or affected by tinnitus or other hearing conditions may not be able to fully understand audio cues and spoken dialogue. Noisy environments can exacerbate the problem, such as when a user is using public transport, traversing crowded spaces, or is in proximity to construction, traffic, musical performances, or other sources of background noise. Conversely, in environments where silence must be maintained, such as at offices or libraries, or late at night when noise ordinances may be in effect, audio may need to be played at low volume or muted, rendering audio difficult to hear clearly.
  • headphones may assist in hearing audio
  • headphones may be misplaced, forgotten, or incompatible with hearing aids or other devices.
  • spoken dialogue is clearly audible to the user, it may be spoken in a foreign language or in a dialect or accent that is not readily understood by the user.
  • subtitles or closed captions can assist the user in better understanding audio.
  • FIG. 1 is a block diagram that depicts a system for implementing automatic ingame subtitles, as described herein.
  • FIG. 2A is a diagram that depicts an example graphical user interface (GUI) of a video game application.
  • GUI graphical user interface
  • FIG. 2B is a diagram that depicts an example graphical user interface (GUI) of a video game application with automatic in-game subtitles.
  • GUI graphical user interface
  • FIG. 2C is a diagram that depicts an example graphical user interface (GUI) of a video game application with automatic in-game subtitles positioned in proximity to sound sources.
  • GUI graphical user interface
  • FIG. 3 is a flow diagram that depicts an approach for implementing automatic ingame subtitles.
  • An approach is provided for a gaming overlay application to provide automatic ingame subtitles and/or closed captions for video game applications.
  • the overlay application accesses an audio stream and a video stream generated by an executing game application.
  • the video stream comprises frames of image data that are rendered during the executing of the game application.
  • the overlay application processes the audio stream through a text conversion engine, which, in implementations, includes a speech-to-text engine, to generate at least one subtitle.
  • the overlay application determines a display position to associate with the at least one subtitle.
  • the overlay application generates a subtitle overlay comprising the at least one subtitle located at the associated display position.
  • the overlay application causes at least a portion of the video stream to be displayed with the subtitle overlay.
  • Techniques discussed herein enable a gaming overlay application to analyze realtime audio streams from a video game to generate subtitles to be displayed, even when the video game does not natively support subtitles.
  • various cues such as multi-channel surround sound information and machine learning based voice profde matching
  • dialogue and audio cues are associated with specific characters, multiplayer users, or other elements shown in-game, and subtitles are positioned onscreen at a user preferred location or in proximity to the associated sound source.
  • a user quickly identifies a speaker and their associated dialogue even if audio is difficult to hear or muted. This enables the user to react more quickly and efficiently by understanding and reacting to audio cues even with hearing impediments or challenging listening environments.
  • subtitles are shown in a variety of contexts, including cutscenes, in matching lobbies or during gameplay.
  • FIG. 1 is a block diagram that depicts a system 100 for implementing automatic in-game subtitles, as described herein.
  • Subtitles include transcriptions or translations of dialogue or speech of a video, video game, etc. and descriptions of sound effects, musical cues or other relevant audio information from the video/video game.
  • references to subtitles also include closed captions or subtitles with additional context such as speaker identification and non-speech elements such as descriptions of sound effects and audio cues.
  • system 100 includes computing device 110, network 160, input/output (I/O) devices 170, and display 180.
  • computing device 110 includes processor 120, graphics processing unit (GPU) 122, data bus 124, and memory 130.
  • GPU graphics processing unit
  • GPU 122 includes memory for storing one or more frame buffers 123 .
  • memory 130 stores game application 140 and gaming overlay application 150.
  • game application 140 outputs audio stream 142 and video stream 144.
  • Gaming overlay application 150 includes text conversion engine 152, subtitle compositor 154, voice profde database 156, and user preferences 158.
  • I/O devices 170 include microphone 172 and speakers 174.
  • Display 180 includes an interface to receive game graphics 182 from computing device 110.
  • game graphics 182 includes subtitle overlay 190.
  • the components of system 100 are only exemplary and any configuration of system 100 is usable according to the requirements of game application 140.
  • Game application 140 is executed on computing device 110 by one or more of processor 120, GPU 122, or other computing resources not specifically depicted.
  • Processor 120 is any type of general-purpose single or multi core processor, or a specialized processor such as application-specific integrated circuit (ASIC) or field programmable gate array (FPGA). In implementations, more than one processor 120 is present.
  • GPU 122 is any type of specialized hardware for graphics processing, which is addressable using various graphics application programming interfaces (APIs) such as DirectX, Vulkan, OpenGU, and OpenCU.
  • APIs graphics application programming interfaces
  • GPU 122 includes frame buffers 123, where finalized video frames are stored before outputting to display 180.
  • Data bus 124 is any high-speed interconnect for communications between components of computing device 110, such as a Peripheral Component Interconnect (PCI) Express bus, an Infinity Fabric, or an Infinity Architecture.
  • Memory 130 is any type of memory, such as a random access memory (RAM) or other storage device.
  • game application 140 generates audio stream 142 and video stream 144, corresponding to real-time audio and video content.
  • audio stream 142 and video stream 144 are combined into a single audiovisual stream.
  • Audio stream 142 corresponds to internally generated in-game audio and in implementations includes multiple channels for surround sound and/or 3D positional audio information.
  • game application 140 supports multiplayer gaming via network 160.
  • voice chat streams from game participants are embedded in audio stream 142, either combined with existing in-game audio or as separate channels to be mixed by the operating system.
  • microphone 172 is used to record voice chat from participants.
  • gaming overlay application 150 is depicted as receiving audio stream 142 from game application 140, in implementations, audio stream 142 is received from an audio mixer output provided by an operating system of computing device 110.
  • video stream 144 corresponds to in-game visuals which are generated by GPU 122 and exposed for access via a video capture service provided by GPU 122.
  • completed frame buffers 123 are buffered in memory 130 for access by a video streaming application.
  • gaming overlay application 150 is depicted as accessing video stream 144 from game application 140.
  • gaming overlay application 150 corresponds to any program that includes functionality to display an overlay on top of in-game video content. This includes programs provided by the manufacturer of GPU 122, such as Radeon Software Crimson ReLive Edition or GeForce Experience, gaming clients such as Steam with Steam Overlay, voice chat tools such as Discord, or operating system features such as Windows Xbox Game Bar.
  • gaming overlay application allows the user to enable options, such as displaying in-game overlay for configuring video capture, video streaming, audio mixing, voice chat, game profile settings, friend lists, and other options.
  • gaming overlay application 150 includes functionality for video and audio capture and streaming. In implementations, this functionality is utilized to capture audio stream 142 and video stream 144 from game application 140. In implementations, gaming overlay application 150 is further extended to support automatic ingame subtitles by implementing or accessing text conversion engine 152 and subtitle compositor 154. In implementations, text conversion engine 152 accesses audio stream 142 and generates text corresponding to detected speech or sound effects. .
  • text conversion engine 152 includes a speech-to-text engine and a video game sound effect detection engine.
  • Example speech-to-text engines include DeepSpeech, Wav2Letter++, OpenSeq2Seq, Vosk, and ESPnet. By using alternative models that are trained with video game sound effects and other non-dialogue audio cues, the speech-to-text engines are also adaptable for use as video game sound effect detection engines.
  • audio stream 142 is loaded into buffers of a limited size for processing through text conversion engine 152.
  • the buffers are capped at a maximum size or length, such as no longer than 5 seconds, and buffers are split opportunistically according to pauses or breaks detected in audio stream 142.
  • dialogue is processed in buffers containing short dialogue phrases and processed for displaying as quickly as possible.
  • subtitle compositor 154 determines display positions associated with the subtitles.
  • user preferences 158 define a preferred area of the screen for displaying subtitles, such as near the bottom of the screen.
  • video stream 144 is scanned for user interface elements of game application 140, such as health indicators or other in-game indicators that are preferably kept unobscured, and these areas are marked as exclusion areas or keep-out zones that should not display subtitles.
  • subtitle compositor 154 positions the subtitles in proximity to an in-game object associated with the in-game speaker, as described in conjunction with FIG. 2C below.
  • voices detected in audio stream 142 are matched to machine learned classifications stored in voice profile database 156.
  • spatial audio cues from audio stream 142 are utilized to triangulate a position of an in-game object associated with the in-game speaker.
  • text conversion engine 152 and voice profile database 156 are shown as integral to gaming overlay application 150, in implementations, components of gaming overlay application 150 are implemented by a remote service (e.g., cloud server) that is accessed via network 160. This enables offloading of various tasks, such as text conversion, foreign language translation, and/or machine learning matching tasks to external cloud services.
  • a remote service e.g., cloud server
  • subtitle overlay 190 is generated accordingly. Display characteristics of the subtitles, such as font color and size, are set according to one or more of user preferences 158, readability considerations, or speaker intent detected from audio stream 142 as discussed further herein.
  • subtitle overlay 190 is merged with data from one or more frame buffers 123 that are finalized prior to output to display 180, for example as one or more processing steps in a rendering pipeline within GPU 122, or by a desktop compositor of an operating system running on computing device 110. In this manner, subtitle support is provided via gaming overlay application 150 even when game application 140 does not natively support subtitles.
  • Display 280A represents a display of game application 140 when subtitle overlay 190 is not generated or is disabled, or when gaming overlay application 150 is not running. In these cases, no subtitles appear and only in-game elements are shown, including character 284A positioned to the left side of display 280A, character 284B positioned to the right side of display 280A, and user interface element 286 displaying gameplay status including user health and ammo.
  • subtitle overlay 290B is overlaid on top of game graphics 282 and includes the subtitles of “(Explosion sound from the right)” and “That doesn’t sound good. Let’s proceed down the left hallway instead.”
  • subtitle overlay 290B is positioned near the bottom of display 280B, which is set, in implementations, according to user preferences 158. Further, note that subtitle overlay 290B avoids placement of subtitles over user interface element 286, thereby maintaining visibility of vital in-game information.
  • subtitle overlay 290C and 290D are overlaid on top of game graphics 282.
  • Subtitle overlay 290C contains the subtitle “That doesn’t sound good. Let’s proceed down the left hallway instead.” Further, subtitle overlay 290C is positioned to be proximate to an in-game object (e.g., character 284A) associated with an in-game speaker and appears in a speech bubble.
  • Subtitle overlay 290D contains the closed caption “(Explosion sound)” and is positioned proximate to the right of display 280C. In this example, subtitle overlay 290D points offscreen since the explosion itself was determined to occur at a position to the right of the user that is not visible in game graphics 282.
  • the position of audio sources in the game world are estimated according to positional cues in audio stream 142. For example, stereo audio panning position is used to determine whether an audio source is located to the left, right, or center of the user’s current viewpoint in the game world represented by video stream 144.
  • the position of audio sources is estimated with greater accuracy, such as in front, behind, above, or below the user’s current viewpoint.
  • multichannel or positional 3D audio in audio stream 142 indicates that the current in-game speaker is heard primarily from the left channels of speakers 174.
  • the in-game object associated with the in-game speaker is more likely be character 284A, to the left, rather than character 284B, to the right.
  • audio stream 142 indicates that the explosion sound is heard primarily from the right channels of speakers 174.
  • the explosion itself is determined to be offscreen and further to the right.
  • These positional audio cues are factors used to determine the positioning of subtitle overlays 290C and 290D within the display such that they are proximate to their sound source or in-game object associated with the in-game speaker. For example, sounds heard primarily from center or rear surround channels indicate sound sources positioned in the front center or behind the user in a game world rendered by game application 140, whereas sounds heard primarily from height channels indicate sound sources positioned above the user.
  • FIG. 3 To illustrate an example process for implementing automatic in-game subtitles in a gaming overlay application, flow diagram 300 of FIG. 3 is described with respect to FIG. 1 and FIG. 2B and FIG. 2C.
  • display 280B and 280C reflect examples of display 180 after gaming overlay application 150 generates subtitle overlay 190 for displaying with game graphics 182.
  • Flow diagram 300 depicts an approach for implementing automatic in-game subtitles in a gaming overlay application.
  • blocks 302, 304, 306, 308, and 310 are performed by one or more processors.
  • blocks 302, 304, 306, 308 and 310 are performed by a single processor of a computing device, similar to Fig. 1.
  • one or more of the blocks of flow diagram 300 are performed by one or more cloud servers or other computing devices distributed across a wireless or wired network.
  • an audio stream 142 and video stream 144 generated as the result of executing game application 140 are accessed.
  • a gaming overlay application executing on a processor receives the audio stream and video stream.
  • the processor executes gaming overlay application 150 concurrently with game application.
  • game application 140 executes on a remote server. For example, when using a cloud-based gaming streaming service, audio stream 142 and video stream 144 are received from a remote server via network 160.
  • the audio stream 142 is processed through a text conversion engine 152 to generate at least one subtitle.
  • text conversion engine 152 is part of gaming overlay application 150, and in other implementations, text conversion engine 152 is accessed using a cloud-based service via network 160.
  • a display position is determined to associate with the at least one subtitle from block 304.
  • subtitle compositor 154 uses one or more factors to determine the display position. One factor includes a user defined preference for subtitle location, such as near the bottom of the screen.
  • This user preference is retrieved from user preferences 158.
  • Another factor includes avoiding exclusion areas detected in video stream 144. For example, as previously described, video stream 144 is scanned for user interface elements generated by game application 140, and the portion of the display that includes these user interface elements are marked as exclusion areas that should not include subtitles.
  • Yet another factor includes positioning the subtitle in proximity to the sound source or in-game speaker. For example, computer vision processing is performed to identify in-game characters, multiplayer users, and other objects within the video stream 144 that are potential sound sources associated with subtitles or closed captions. Once characters and objects are identified, the at least one subtitle from block 304 is matched to its most likely sound source and positioned proximate to its sound source within the video stream 144. [0033] Matching to the most likely sound source for the at least one subtitle is based on various considerations. As discussed above, in implementations matching is based on triangulation using spatial audio cues from audio stream 142. Thus, in-game objects (e.g., characters) positioned in the in-game world consistent with the spatial audio cues are more strongly correlated with the sound source.
  • in-game objects e.g., characters
  • voice profile database 156 includes classifications such as age range, gender, and dialect.
  • traits analyzed from audio stream 142 and matched to voice profile database 156 are used to classify the in-game speaker as more or less likely to be a child, an adult, an elderly person, a male, a female, or a speaker with a regional dialect.
  • the computer vision processing described above is used to confirm whether a potential sound source, or in-game character, is consistent with the matched classifications.
  • audio stream 142 is classified as likely to be “female” in voice profile database 156, and computer vision processing of the video stream 144 identifies a potential in-game character as likely to be a female character, then matching the potential in-game character to the at least one subtitle is more strongly correlated.
  • Yet another consideration includes matching audio stream 142 to a specific user.
  • game application 140 is a multiplayer game wherein participants use voice chat to communicate with other participants.
  • audio stream 142 includes multiple voice chat streams associated with specific users, and thus the user speaking at any given time is readily determined according to the originating voice chat stream.
  • audio stream 142 is only available as a single mixed stream, then the other considerations described above are still usable to determine the in-game speaker. Further, since game overlay application 150 includes identifying information such as usernames or handles for each participant, the subtitles also include such identifying information when available.
  • a subtitle overlay 190 is generated comprising the at least one subtitle from block 304 located at the associated display position from block 306.
  • subtitle compositor 154 generates subtitle overlay 190 along with various visual characteristics of the subtitles.
  • these visual characteristics include font attribute (e.g. italic, bold, outline), font color, font size, and speech bubble type.
  • Speech bubble type includes, for example, speech bubbles, floating text, or other text presentation methods.
  • Visual characteristics are set according to user preferences 158, for example user preferred font size and color. Visual characteristics are set according to readability considerations, for example by ensuring that the subtitles have high contrast according to colors in the associated area of video stream 144.
  • Visual characteristics are also set according to the in-game speaker, for example by mapping specific font colors for each in-game character.
  • visual characteristics are also set according to speaker intent detected from audio stream 142. For example, audio stream 142 is analyzed for loudness, speech tempo, syllable emphasis, voice pitch, and other elements to determine whether the ingame speaker is calm, and in this case the display characteristics use default values.
  • the display characteristics emphasize this by using a bold font, a larger font size, or a speech bubble that is emphasized using spiked lines or other visual indicators.
  • the intent of the speaker is better understood in a visual manner.
  • a portion of video stream 144 is caused to be displayed with subtitle overlay 190.
  • this is performed by modifying a rendering pipeline within GPU 122, or using a desktop compositor of an operating system, among other methods.
  • display 180 outputs game graphics 182 with subtitle overlay 190.
  • the subtitle overlay 290B is placed according to a user preference for subtitle placement.
  • the subtitle overlay 290C and 290D are placed according to proximity to the sound source. In this manner, subtitle support is provided via gaming overlay application 150 even when game application 140 does not natively support subtitles.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Optics & Photonics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Processing Or Creating Images (AREA)
  • Studio Devices (AREA)
  • User Interface Of Digital Computer (AREA)
  • Studio Circuits (AREA)
PCT/US2022/051581 2021-12-23 2022-12-01 Automatic in-game subtitles and closed captions Ceased WO2023121850A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
KR1020247024720A KR20240131376A (ko) 2021-12-23 2022-12-01 자동 게임 내 자막들 및 폐쇄 자막들
JP2024535349A JP2025504748A (ja) 2021-12-23 2022-12-01 ゲーム内自動字幕及びクローズドキャプション
EP22912270.0A EP4452432A4 (en) 2021-12-23 2022-12-01 CLOSED SUBTITLES AND AUTOMATIC IN-GAME SUBTITLES
CN202280084788.1A CN118414200A (zh) 2021-12-23 2022-12-01 自动的游戏中翻译字幕和隐藏式原文字幕

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US17/561,477 US11857877B2 (en) 2021-12-23 2021-12-23 Automatic in-game subtitles and closed captions
US17/561,477 2021-12-23

Publications (1)

Publication Number Publication Date
WO2023121850A1 true WO2023121850A1 (en) 2023-06-29

Family

ID=86898719

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/051581 Ceased WO2023121850A1 (en) 2021-12-23 2022-12-01 Automatic in-game subtitles and closed captions

Country Status (6)

Country Link
US (2) US11857877B2 (https=)
EP (1) EP4452432A4 (https=)
JP (1) JP2025504748A (https=)
KR (1) KR20240131376A (https=)
CN (1) CN118414200A (https=)
WO (1) WO2023121850A1 (https=)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11857877B2 (en) * 2021-12-23 2024-01-02 Ati Technologies Ulc Automatic in-game subtitles and closed captions
US20240022682A1 (en) * 2022-07-13 2024-01-18 Sony Interactive Entertainment LLC Systems and methods for communicating audio data
GB2622405A (en) * 2022-09-15 2024-03-20 Sony Interactive Entertainment Inc Systems and methods for controlling dialogue complexity in video games
TWI891080B (zh) * 2023-10-05 2025-07-21 宏碁股份有限公司 電子裝置與其影像片段萃取方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180087009A (ko) * 2017-01-24 2018-08-01 주식회사 소리자바 실시간 오디오 스트리밍 분석을 통한 자막 제공 시스템, 단말기 및 자막 서버
CN111556372A (zh) * 2020-04-20 2020-08-18 北京甲骨今声科技有限公司 为视音频节目实时添加字幕的方法和装置
KR20200123988A (ko) * 2019-04-23 2020-11-02 주식회사 비포에이 Vr 영상 콘텐츠의 자막 처리 기기
US20210136459A1 (en) * 2019-11-04 2021-05-06 Sling Media, L.L.C. System to correct closed captioning display using context from audio/video
KR20210151874A (ko) * 2019-05-02 2021-12-14 구글 엘엘씨 컴퓨팅 디바이스에서 콘텐츠의 청각적 부분을 자동으로 자막화하기

Family Cites Families (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10987597B2 (en) * 2002-12-10 2021-04-27 Sony Interactive Entertainment LLC System and method for managing audio and video channels for video game players and spectators
US8620139B2 (en) * 2011-04-29 2013-12-31 Microsoft Corporation Utilizing subtitles in multiple languages to facilitate second-language learning
EP2525568B1 (en) * 2011-05-19 2017-11-15 EchoStar Technologies L.L.C. Automatic subtitle resizing
US8839292B1 (en) * 2011-12-13 2014-09-16 Google Inc. Systems and methods for rendering multiple applications on television screens
US10304458B1 (en) * 2014-03-06 2019-05-28 Board of Trustees of the University of Alabama and the University of Alabama in Huntsville Systems and methods for transcribing videos using speaker identification
EP3220374A4 (en) * 2014-11-12 2018-07-18 Fujitsu Limited Wearable device, display control method, and display control program
KR102202576B1 (ko) * 2014-12-12 2021-01-13 삼성전자주식회사 음향 출력을 제어하는 디바이스 및 그 방법
US9922095B2 (en) * 2015-06-02 2018-03-20 Microsoft Technology Licensing, Llc Automated closed captioning using temporal data
US10332506B2 (en) * 2015-09-02 2019-06-25 Oath Inc. Computerized system and method for formatted transcription of multimedia content
KR20170035502A (ko) * 2015-09-23 2017-03-31 삼성전자주식회사 디스플레이 장치 및 이의 제어 방법
US10179291B2 (en) * 2016-12-09 2019-01-15 Microsoft Technology Licensing, Llc Session speech-to-text conversion
US10299008B1 (en) * 2017-11-21 2019-05-21 International Business Machines Corporation Smart closed caption positioning system for video content
CN108491127B (zh) * 2018-03-12 2020-02-07 Oppo广东移动通信有限公司 输入法界面显示方法、装置、终端及存储介质
CN112154658B (zh) * 2018-05-29 2024-07-23 索尼公司 图像处理装置、图像处理方法和存储介质
US12451154B2 (en) * 2018-08-08 2025-10-21 Comcast Cable Communications, Llc Generating and/or displaying synchronized captions
EP3719613B1 (en) * 2019-04-01 2026-05-06 Nokia Technologies Oy Rendering captions for media content
US11094324B2 (en) * 2019-05-14 2021-08-17 Motorola Mobility Llc Accumulative multi-cue activation of domain-specific automatic speech recognition engine
US10885893B2 (en) * 2019-06-06 2021-01-05 Sony Corporation Textual display of aural information broadcast via frequency modulated signals
US20210074298A1 (en) * 2019-09-11 2021-03-11 Soundhound, Inc. Video conference captioning
US11295497B2 (en) * 2019-11-25 2022-04-05 International Business Machines Corporation Dynamic subtitle enhancement
US11557121B2 (en) * 2020-04-26 2023-01-17 Cloudinary Ltd. System, device, and method for generating and utilizing content-aware metadata
US11475895B2 (en) * 2020-07-06 2022-10-18 Meta Platforms, Inc. Caption customization and editing
US20230055421A1 (en) * 2020-09-16 2023-02-23 Meta Platforms, Inc. Caption customization and editing
US11418849B2 (en) * 2020-10-22 2022-08-16 Rovi Guides, Inc. Systems and methods for inserting emoticons within a media asset
US20240064485A1 (en) * 2020-11-30 2024-02-22 The Regents Of The University Of California Systems and methods for sound-enhanced meeting platforms
US12342102B2 (en) * 2021-11-19 2025-06-24 Apple Inc. Systems and methods for managing captions
US11857877B2 (en) * 2021-12-23 2024-01-02 Ati Technologies Ulc Automatic in-game subtitles and closed captions

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180087009A (ko) * 2017-01-24 2018-08-01 주식회사 소리자바 실시간 오디오 스트리밍 분석을 통한 자막 제공 시스템, 단말기 및 자막 서버
KR20200123988A (ko) * 2019-04-23 2020-11-02 주식회사 비포에이 Vr 영상 콘텐츠의 자막 처리 기기
KR20210151874A (ko) * 2019-05-02 2021-12-14 구글 엘엘씨 컴퓨팅 디바이스에서 콘텐츠의 청각적 부분을 자동으로 자막화하기
US20210136459A1 (en) * 2019-11-04 2021-05-06 Sling Media, L.L.C. System to correct closed captioning display using context from audio/video
CN111556372A (zh) * 2020-04-20 2020-08-18 北京甲骨今声科技有限公司 为视音频节目实时添加字幕的方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4452432A4 *

Also Published As

Publication number Publication date
US20230201717A1 (en) 2023-06-29
EP4452432A4 (en) 2025-12-31
US12427413B2 (en) 2025-09-30
EP4452432A1 (en) 2024-10-30
US11857877B2 (en) 2024-01-02
US20240091640A1 (en) 2024-03-21
CN118414200A (zh) 2024-07-30
JP2025504748A (ja) 2025-02-19
KR20240131376A (ko) 2024-08-30

Similar Documents

Publication Publication Date Title
US12427413B2 (en) Automatic in-game subtitles and closed captions
Peng et al. Speechbubbles: Enhancing captioning experiences for deaf and hard-of-hearing people in group conversations
CN110473525B (zh) 获取语音训练样本的方法和装置
US6925438B2 (en) Method and apparatus for providing an animated display with translated speech
US11514924B2 (en) Dynamic creation and insertion of content
KR102136059B1 (ko) 그래픽 객체를 이용한 자막 생성 시스템
US12141902B2 (en) System and methods for resolving audio conflicts in extended reality environments
US11600279B2 (en) Transcription of communications
JP2023059937A (ja) データインタラクション方法、装置、電子機器、記憶媒体、および、プログラム
JP2025504748A5 (https=)
JPWO2018037956A1 (ja) 情報処理装置及び情報処理方法
JP2023184519A (ja) 情報処理システム、情報処理方法およびコンピュータプログラム
CN116582664A (zh) 一种基于裸眼3d的智能交互虚拟展示系统
WO2010140254A1 (ja) 映像音声出力装置及び音声定位方法
WO2025075827A1 (en) Sonifying visual content for vision-impaired users
KR102583986B1 (ko) 목소리에 기반한 감정 분류가 반영된 음성 메시지의 말풍선 표현 방법 및 시스템
Yamamoto et al. Audiovisual emotion perception develops differently from audiovisual phoneme perception during childhood
WO2025025564A1 (zh) 一种虚拟形象控制方法、装置及相关设备
CN116403583A (zh) 语音数据处理方法和装置、非易失性存储介质及车辆
US12562006B2 (en) System(s) and method(s) for training a sign language captioning model and subsequent use thereof
JP7425243B1 (ja) 情報処理装置及び情報処理方法
EP4535353A1 (en) Method and apparatus for rendering audio data using adaptive fonts
US20220351727A1 (en) Conversaton method, conversation system, conversation apparatus, and program
KR20190075765A (ko) 문자음성자동변환을 이용한 웹툰의 음성출력 시스템
EP4623436A1 (en) Separation of conversational clusters in automatic speech recognition transcriptions

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22912270

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024535349

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 202280084788.1

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 202417052425

Country of ref document: IN

ENP Entry into the national phase

Ref document number: 20247024720

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022912270

Country of ref document: EP

Effective date: 20240723