WO2023045635A1 - 多媒体文件的字幕处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品 - Google Patents

多媒体文件的字幕处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2023045635A1
WO2023045635A1 PCT/CN2022/113257 CN2022113257W WO2023045635A1 WO 2023045635 A1 WO2023045635 A1 WO 2023045635A1 CN 2022113257 W CN2022113257 W CN 2022113257W WO 2023045635 A1 WO2023045635 A1 WO 2023045635A1
Authority
WO
WIPO (PCT)
Prior art keywords
segment
multimedia file
subtitle
content
subtitles
Prior art date
Application number
PCT/CN2022/113257
Other languages
English (en)
French (fr)
Inventor
何聃
龚淑宇
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023045635A1 publication Critical patent/WO2023045635A1/zh
Priority to US18/320,302 priority Critical patent/US20230291978A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/435Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Definitions

  • the present application relates to the technical field of the Internet, and in particular to a subtitle processing method, device, electronic equipment, computer-readable storage medium and computer program product for multimedia files.
  • subtitles can also describe, summarize or summarize the content of video files or audio files. For example, when users watch some foreign video files, they need subtitles to help understand the video files. Content.
  • Embodiments of the present application provide a subtitle processing method, device, electronic device, computer-readable storage medium, and computer program product for multimedia files, which can accurately and efficiently realize the coordination between subtitles and multimedia files at the level of visual perception.
  • An embodiment of the present application provides a subtitle processing method for a multimedia file, the method being executed by an electronic device, including:
  • a multimedia file In response to a play trigger operation, play a multimedia file, wherein the multimedia file is associated with multiple subtitles, and the type of the multimedia file includes a video file and an audio file, and
  • the multiple subtitles are sequentially displayed on the human-computer interaction interface, wherein the applied styles of the multiple subtitles are related to the content of the multimedia file.
  • An embodiment of the present application provides a subtitle processing device for a multimedia file, including:
  • the playback module is configured to play a multimedia file in response to a playback trigger operation, wherein the multimedia file is associated with multiple subtitles, and the type of the multimedia file includes a video file and an audio file;
  • the display module is configured to sequentially display the multiple subtitles on the human-computer interaction interface during the process of playing the multimedia file, wherein the applied styles of the multiple subtitles are related to the content of the multimedia file.
  • An embodiment of the present application provides an electronic device, including:
  • the processor is configured to implement the subtitle processing method for a multimedia file provided in the embodiment of the present application when executing the executable instruction stored in the memory.
  • An embodiment of the present application provides a computer-readable storage medium, which stores executable instructions, and when executed by a processor, implements the method for processing subtitles of a multimedia file provided in the embodiment of the present application.
  • An embodiment of the present application provides a computer program product, including a computer program or an instruction, which, when executed by a processor, implements the method for processing subtitles of a multimedia file provided in the embodiment of the present application.
  • subtitles in styles related to the content of the multimedia files are displayed on the human-computer interaction interface.
  • the diversified display effects of multimedia file-related information can be realized by enriching the expression forms of subtitles, which can be accurate and accurate.
  • FIG. 1 is a schematic structural diagram of a subtitle processing system 100 for a multimedia file provided in an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a terminal device 400 provided in an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a subtitle processing method for a multimedia file provided in an embodiment of the present application
  • FIG. 4A is a schematic diagram of an application scenario of a subtitle processing method for a multimedia file provided in an embodiment of the present application
  • FIG. 4B is a schematic diagram of an application scenario of a subtitle processing method for a multimedia file provided in an embodiment of the present application
  • FIG. 4C is a schematic diagram of the principle of segment division provided by the embodiment of the present application.
  • 4D to 4F are schematic diagrams of application scenarios of the subtitle processing method for multimedia files provided by the embodiment of the present application.
  • FIG. 5A is a schematic flowchart of a subtitle processing method for a multimedia file provided in an embodiment of the present application
  • FIG. 5B is a schematic flowchart of a subtitle processing method for a multimedia file provided in an embodiment of the present application.
  • 6A to 6C are schematic diagrams of application scenarios of the subtitle processing method for multimedia files provided by the embodiment of the present application.
  • FIG. 7 is a schematic diagram of video content dimensions provided by an embodiment of the present application.
  • Fig. 8 is a schematic diagram of the principle of character gender identification provided by the embodiment of the present application.
  • Fig. 9 is a schematic diagram of the principle of character age recognition provided by the embodiment of the present application.
  • Fig. 10 is a schematic diagram of the principle of character emotion recognition provided by the embodiment of the present application.
  • Fig. 11 is a schematic diagram of the principle of video style recognition provided by the embodiment of the present application.
  • Fig. 12 is a schematic diagram of the training principle of the generative confrontation network model provided by the embodiment of the present application.
  • Subtitles which refer to the texts for various purposes that appear in videos such as movies and TV, as well as audio such as dramas and songs, such as copyright signs, title subtitles, cast lists, and explanatory subtitles (used to introduce the content of multimedia files, such as Information about people or scenery appearing in multimedia files is displayed in text), lyrics subtitles, dialogue subtitles, etc., wherein the dialogue subtitles are synchronized with the voice object, and are used to display the voice content of the voice object in text to help users understand The content of video files or audio files (such as audio novels).
  • Multimedia files in terms of data form, include streaming media (streaming media) files and local files, wherein, streaming media files are multimedia files played by streaming media protocols, and streaming media protocols refer to compressing a series of multimedia data , which is transmitted in segments in the network in the form of streams, a technology that realizes real-time transmission of audio and video on the network for playback, corresponding to the network playback scenario.
  • the local file is a multimedia file that needs to be completely downloaded before playing, corresponding to the local playback scene; as far as the carried content is concerned, it includes video files and audio files.
  • Content features including content features of static dimensions and content features of dynamic dimensions, wherein, content features of static dimensions remain unchanged during the playback of multimedia files, such as object gender, age, etc.; content features of dynamic dimensions are in During the playback of multimedia files, changes will occur, such as the mood, position, etc. of the objects.
  • Style also known as subtitle style, is an attribute related to the vision of subtitles.
  • attributes can include: font, color, font size, word spacing, bold, italic, underline, strikethrough, shadow offset and color, alignment, vertical margin, etc.
  • LBP Local Binary Patterns
  • LBP Local Binary Patterns
  • the basic idea is to use each pixel to The surrounding pixels are compared to obtain the local image structure. Assuming that the central pixel value is greater than the adjacent pixel value, the adjacent pixel is assigned a value of 1, otherwise it is assigned a value of 0, and finally a binary eight-bit representation will be obtained for each pixel, for example 11100111.
  • Wavelet (Gabor) feature the feature obtained by transforming the image based on the Gabor function.
  • the Gabor transform belongs to the windowed Fourier transform.
  • the Gabor function can extract related features in different scales and directions in the frequency domain to represent texture.
  • a two-dimensional Gabor filter is the product of a Gaussian kernel function and a sinusoidal plane wave.
  • Principal Component Analysis is a statistical method that converts a group of variables that may have correlations into a group of linearly uncorrelated variables through orthogonal transformation.
  • the converted group of variables is called main component.
  • Histogram of Oriented Gradient is a feature descriptor used in the field of computer vision and image processing for target detection. This technique is a statistical value of orientation information used to compute local image gradients.
  • the specific implementation method is as follows: first divide the image into multiple connected regions (also called cell units), then collect the gradient or edge direction histogram of each pixel in the cell unit, and finally combine these histograms to form the feature descriptor.
  • Canonical Correlation Analysis is a multivariate statistical analysis method that uses the correlation between comprehensive variables to reflect the overall correlation between two groups of indicators. According to the correlation between the group indicators, two representative comprehensive variables U1 and V1 are extracted from the two groups of variables (respectively, the linear combination of the variables in the two variable groups), and the relationship between the two comprehensive variables is used. The correlation relationship reflects the overall correlation between the two groups of indicators.
  • Local Histogram Statistics the feature obtained by using the histogram statistical method to count multiple local binary pattern features, is used to reflect the pixel distribution of the image.
  • the process of the histogram statistical method is as follows: firstly divide a plurality of discrete intervals, and then count the number of local binary pattern features distributed on each interval.
  • Local reconstruction residual weighted recognition processing that is, weighting the local sparse reconstruction representation results (that is, the features obtained by linear combination of a small number of local binary pattern features in the local features of the training set) by constructing a weighting matrix , and use the residual to classify and identify the weighted results.
  • Embodiments of the present application provide a subtitle processing method, device, electronic device, computer-readable storage medium, and computer program product for multimedia files, which can accurately and efficiently realize the coordination between subtitles and multimedia files at the level of visual perception.
  • the exemplary application of the electronic device provided by the embodiment of the present application is described below.
  • the electronic device provided by the embodiment of the present application can be implemented as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (for example, a mobile phone, a portable music player, Various types of terminal devices such as personal digital assistants, dedicated messaging devices, portable game devices), vehicle-mounted terminals, etc., can also be implemented cooperatively by the server and the terminal devices.
  • an exemplary application when the electronic device is implemented as a terminal device will be described.
  • FIG. 1 is a schematic diagram of the architecture of a subtitle processing system 100 for multimedia files provided by an embodiment of the present application.
  • a terminal device 400 is connected to a server 200 through a network 300. is a combination of the two.
  • the server 200 is the background server of the client 410 running on the terminal device 400.
  • the server 200 can be a background server of a video website or an audio website.
  • the requested multimedia file (such as a streaming media file) is sent to the terminal device 400 through the network 300, wherein the multimedia file is associated with multiple subtitles.
  • the client 410 running on the terminal device 400 can be various types of clients, such as a video playback client, an audio playback client, a browser, and an instant messaging client, etc., and the client 410 receives a playback trigger operation (such as When receiving the user's click operation on the play button displayed in the human-computer interaction interface), the multimedia file received in real time from the server 200 is played, and in the process of playing the multimedia file, a plurality of items are sequentially displayed in the human-computer interaction interface Subtitles, wherein the styles of multiple subtitles are related to the content of the multimedia file (will be described in detail below).
  • a playback trigger operation such as When receiving the user's click operation on the play button displayed in the human-computer interaction interface
  • the multimedia file received in real time from the server 200 is played, and in the process of playing the multimedia file, a plurality of items are sequentially displayed in the human-computer interaction interface
  • Subtitles wherein the styles of multiple subtitles are related to the content of the multimedia file (will be described in detail below
  • the subtitle processing method of the multimedia file provided by the embodiment of the present application may also be implemented by the terminal device alone, for example, the downloaded multimedia file is pre-stored locally in the terminal device 400 (the multimedia file is associated with multiple subtitles) , then the client 410 will play the multimedia file locally stored in the terminal device 400 when receiving the play trigger operation, and in the process of playing the multimedia file, display multiple subtitles sequentially on the human-computer interaction interface, wherein the multiple subtitles apply The styles are relative to the content of the multimedia file.
  • the embodiments of the present application can also be implemented by means of cloud technology (Cloud Technology).
  • Cloud technology refers to the unification of a series of resources such as hardware, software, and network in a wide area network or a local area network to realize data calculation and storage. , processing and sharing is a hosted technology.
  • Cloud technology is a general term for network technology, information technology, integration technology, management platform technology, and application technology based on cloud computing business models. It can form a resource pool and be used on demand, which is flexible and convenient. Cloud computing technology will become an important support. The background service of the technical network system requires a large amount of computing and storage resources.
  • the server 200 shown in FIG. 1 may be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, and may also provide cloud services, cloud databases, cloud computing, cloud functions, Cloud servers for basic cloud computing services such as cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery network (CDN, Content Delivery Network), and big data and artificial intelligence platforms.
  • the terminal device 400 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
  • the terminal device 400 and the server 200 may be connected directly or indirectly through wired or wireless communication, which is not limited in this embodiment of the present application.
  • the terminal device 400 may also implement the subtitle processing method for a multimedia file provided in this embodiment of the present application by running a computer program.
  • the computer program can be a native program or software module in the operating system; it can be a local (Native) application program (APP, Application), that is, a program that needs to be installed in the operating system to run (ie the above-mentioned client 410), Such as video playback client, audio playback client, browser, etc.; it can also be a small program, that is, a program that can be run only after being downloaded into the browser environment; it can also be a small program that can be embedded in any APP.
  • the above-mentioned computer program can be any form of application program, module or plug-in.
  • FIG. 2 is a schematic structural diagram of the terminal device 400 provided in an embodiment of the present application.
  • the terminal device 400 shown in FIG. 2 includes: at least one processor 420 , a memory 460 , at least one network interface 430 and a user interface 440 .
  • Various components in the terminal device 400 are coupled together via the bus system 450 .
  • the bus system 450 is used to realize connection and communication between these components.
  • the bus system 450 also includes a power bus, a control bus and a status signal bus. However, for clarity of illustration, the various buses are labeled as bus system 450 in FIG. 2 .
  • Processor 420 can be a kind of integrated circuit chip, has signal processing capability, such as general-purpose processor, digital signal processor (DSP, Digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware Components, etc., wherein the general-purpose processor can be a microprocessor or any conventional processor, etc.
  • DSP digital signal processor
  • DSP Digital Signal Processor
  • Memory 460 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory.
  • the non-volatile memory may be a read-only memory (ROM, Read Only Memory), and the volatile memory may be a random access memory (RAM, Random Access Memory).
  • the memory 460 described in the embodiment of the present application is intended to include any suitable type of memory.
  • Memory 460 optionally includes one or more storage devices located physically remote from processor 420 .
  • memory 460 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
  • Operating system 461 including system programs for processing various basic system services and performing hardware-related tasks, such as framework layer, core library layer, driver layer, etc., for implementing various basic services and processing hardware-based tasks;
  • a network communication module 462 for reaching other computing devices via one or more (wired or wireless) network interfaces 430.
  • Exemplary network interfaces 430 include: Bluetooth, Wireless Compatibility Authentication (WiFi), and Universal Serial Bus ( USB, Universal Serial Bus), etc.;
  • Presentation module 463 for enabling presentation of information via one or more output devices 441 (e.g., display screen, speakers, etc.) associated with user interface 440 (e.g., a user interface for operating peripherals and displaying content and information );
  • output devices 441 e.g., display screen, speakers, etc.
  • user interface 440 e.g., a user interface for operating peripherals and displaying content and information
  • the input processing module 464 is configured to detect one or more user inputs or interactions from one or more of the input devices 442 and translate the detected inputs or interactions.
  • the subtitle processing device for multimedia files provided in the embodiments of the present application can be implemented in software, and the subtitle processing device for multimedia files provided in the embodiments of the present application can be provided as various software embodiments, including application programs, software , software modules, scripts or codes in various forms.
  • Fig. 2 shows the subtitle processing device 465 of the multimedia file stored in the memory 460, which can be software in the form of programs and plug-ins, and includes a series of modules, including a playback module 4651, a display module 4652, an acquisition module 4653, The conversion module 4654, the fusion module 4655 and the determination module 4656, these modules are logical, so they can be combined arbitrarily or further divided according to the realized functions. It should be pointed out that in FIG. 2, for the convenience of expression, all the above-mentioned modules are shown at one time, but it should not be regarded as that the subtitle processing device 465 of the multimedia file excludes the implementation that can only include the playback module 4651 and the display module 4652. The function of each module will be explained below.
  • Fig. 3 is a schematic flow chart of the subtitle processing method for a multimedia file provided by the embodiment of the present application, which will be described in conjunction with the steps shown in Fig. 3 .
  • the method shown in FIG. 3 can be executed by various forms of computer programs running on the terminal device 400, and is not limited to the above-mentioned client 410 running on the terminal device 400.
  • step 101 a multimedia file is played in response to a play trigger operation.
  • the multimedia file is associated with multiple subtitles, and each subtitle corresponds to a playing period on the playing time axis of the multimedia file.
  • a bar is the basic unit of subtitle display, which can be one or more lines of text, including multilingual text such as character language, plot, introduction of characters, etc.
  • Each subtitle is set with a corresponding display time period, including the start display time and the end display time.
  • the corresponding display time period can be 10:00-10:05, that is, it can be displayed according to the multimedia A corresponding subtitle is displayed in the playback period of the real-time playback progress of the file, and the subtitle is applied with a style adapted to the content characteristics of at least one dimension of the multimedia file, that is, for subtitles associated with different multimedia files, the corresponding displayed
  • the styles are different, enabling accurate and efficient coordination of subtitles and multimedia files at the level of visual perception.
  • the data form of multimedia files may include streaming media files (corresponding to network playback scenarios, for example, the client responds to a playback trigger operation, requests streaming media files from the server in real time and plays them) and local files (corresponding to local playback scenarios, such as
  • the client plays a multimedia file locally pre-stored in the terminal device
  • the type of the multimedia file that is, the carried content
  • a multimedia file as a video file as an example, assuming that the subtitle "How do you know” corresponds to a playback period of 10:00 to 10:05 on the playback time axis of the video file, that is, at the 10th playback of the video file: During the period from 00 to 10:05, the subtitle "How do you know” is displayed, and the subtitle "How do you know” is applied with the corresponding style, for example, the object with the playback period from 10:00 to 10:05 is applied A style adapted to the attributes (such as age, gender, mood, etc.) of the speaking object in the segment.
  • attributes such as age, gender, mood, etc.
  • the format of the subtitle file may include a picture format and a text format, wherein the subtitle file in the picture format is composed of idx and sub files, and idx is equivalent to an index file, which includes the time code when the subtitle appears (that is, the above-mentioned Playing period) and subtitle display attributes (ie the above-mentioned style), the sub file is the subtitle data itself, because it is a picture format, it takes up a lot of space, so it can be compressed to save space.
  • the extension of the subtitle file in text format is usually ass, srt, sml, ssa or sub (same as the subtitle suffix of the above image format, but the data format is different), because it is in text format, so it takes up less space.
  • the styles of subtitles including original styles and new styles (that is, styles adapted to the content characteristics of at least one dimension of multimedia files) can be recorded in ass, srt, sml, In files such as ssa or sub.
  • step 102 during the process of playing the multimedia file, a plurality of subtitles are sequentially displayed on the human-computer interaction interface.
  • the style applied to the multiple subtitles (that is, the multiple subtitles associated with the multimedia file, which can be obtained, for example, by reading from the above-mentioned subtitle file) is related to the content of the multimedia file.
  • the styles of multiple subtitle applications can be the same (that is, the subtitle style remains unchanged during the playback of the entire multimedia file), then the above-mentioned sequential display of multiple subtitles in the human-computer interaction interface can be achieved in the following manner: Subtitles: multiple subtitles applied with the same style are sequentially displayed in the human-computer interaction interface; wherein, the same style applied to the subtitles is adapted to at least one dimension of the content characteristics of the multimedia file.
  • the human-computer interaction interface sequentially displays the content characteristics (such as the style of the video file) adapted to at least one dimension of the video file.
  • style for example, when the style of the video file is a comedy, the corresponding styles can be Chinese colorful cloud, number four, blue), that is, in the process of playing the entire video file, the subtitles are all in the font of Chinese colorful cloud, Chinese colorful cloud, The font size is 4 and the color is blue. That is to say, when the style of the video file is comedy, the corresponding subtitle style is also cartoonish and funny, and fits well with the content of the video file. Higher, so that the coordination of subtitles and video files at the level of visual perception can be accurately and efficiently realized.
  • the subtitles can be displayed at a fixed position on the human-computer interaction interface (for example, the subtitles are displayed at the middle and lower part of the human-computer interaction interface), and of course the position where the subtitles are displayed can also be dynamically changed, such as for video files. That is to say, the subtitles can be displayed in the human-computer interaction interface to avoid the position of the object appearing in the video screen, or the subtitles can be displayed in a manner superimposed on the video screen. Specific limits.
  • the styles of multiple subtitles can also be different, that is, the style of the subtitles will change during the playback of the entire multimedia file. If the matching style is used, then the above-mentioned sequentially displaying multiple subtitles in the human-computer interaction interface can be realized in the following manner: the multimedia file is divided to obtain a plurality of segments, wherein the type of the segment can include at least one of the following: object segment, Scene fragments, plot fragments; during the process of playing each fragment of the multimedia file, the following processing is performed: based on the style adapted to the content characteristics of at least one dimension of the fragment, at least one item associated with the fragment is sequentially displayed in the human-computer interaction interface Subtitles, in this way, by dividing the multimedia file and displaying subtitles in a style related to the content of each segment obtained after the division of the multimedia file, the coordination between the subtitle and the multimedia file at the level of visual perception can be further improved.
  • the multimedia file can be divided into a plurality of object segments according to the objects (such as characters, animals, etc.) appearing in the multimedia file, wherein each object segment includes an object (for example, the object included in object segment A is object A, object Fragment B includes object B, where object A and object B are two different objects with different object attributes, for example, object A is male and object B is female, or object A is a young man and object B for the elderly), and then perform the following processing in the process of playing each object segment of the multimedia file: based on the style adapted to the content characteristics of at least one dimension of the object segment (for example, object segment A), for example, the object segment A includes The style adapted to the object attribute of the object A (for example, assuming that the object A is identified as a male), assuming that the style adapted to the male is bold and five, that is, the style of the subtitle style is masculine.
  • At least one subtitle associated with the target segment A is displayed sequentially, that is, the at least one subtitle associated
  • the multimedia file can be divided into multiple scene segments according to different scenes (for example, for various types of historical or geographical documentaries, the documentary can be divided into multiple different scene segments according to the scene), wherein each scene The segment includes one scene, and the scenes included in different scene segments may be different, for example, the scene included in scene segment A is a campus, and the scene included in scene segment B is a park, and then executed during the playback of each scene segment of the multimedia file
  • the following processing based on the style adapted to the content characteristics of at least one dimension of the scene segment (for example, the scene segment B), for example, the style adapted to the scene included in the scene segment B, assuming that the scene included in the scene segment B is a seaside, then with
  • the style adapted to the seaside can be italics and blue, that is, the style of the subtitle style is adapted to the seaside, and at least one subtitle associated with scene segment B is sequentially displayed in the human-computer interaction interface, that is, at least one subtitle associated with scene segment B A subtitle can be displayed in
  • the multimedia file can be divided into multiple plot segments according to the content of the multimedia file.
  • a video file it can be divided into multiple plot segments such as the occurrence, development, climax and ending of the story, wherein each plot segment corresponds to a Plots, and the plots corresponding to different plot segments can be different, for example, plot segment A corresponds to the development stage of the story, plot segment B corresponds to the climax stage of the story, and then performs the following processing in the process of playing each plot segment of the multimedia file:
  • the style adapted to the content characteristics of at least one dimension of the plot segment such as plot segment C
  • the style adapted to the climax segment can be Chinese amber, size three, and the font size is larger ,
  • the font style is more serious, adapted to the climax segment, and at least one subtitle associated with the plot segment C is sequentially displayed on the human-computer interaction interface, that is, at least one subtitle associated with the plot segment
  • the above-mentioned process of dividing the multimedia file is only a logical identification and division of the multimedia file, and the data form of the multimedia file does not change, that is, there is no need to divide the multimedia file, but only in the multimedia file. Add corresponding markers on the playing time axis of the file to logically divide the multimedia file into different segments.
  • the multimedia file may also be divided, which is not specifically limited in this embodiment of the present application.
  • the multimedia file in addition to performing a single type of division, that is, identifying multiple segments of one type, for example, the multimedia file can be divided into multiple object segments only according to the objects that appear in the multimedia file;
  • Carry out composite type division that is, identify multiple different types of fragments, for example, multimedia files can be divided according to the objects and scenes that appear in the multimedia file at the same time, so that the divided multiple fragments can include object fragments and scene fragments at the same time , and subsequently, merge and de-duplicate the divided object segments and scene segments, for example, when object segment A (assuming that the corresponding time period is 10:00-12:00) and scene segment B (assuming that the corresponding time period is also 10:00-12:00 :00-12:00) overlap, only one is reserved, so as to obtain the final division result, and the embodiment of the present application does not specifically limit the division method.
  • the styles of at least one subtitle associated with the same segment may be the same, that is, the subtitle style will not change during the playback of the same segment, then the above-mentioned subtitle based on the segment may be implemented in the following manner
  • a style adapted to the content characteristics of at least one dimension sequentially display at least one subtitle associated with the segment in the human-computer interaction interface: obtain the content characteristics of the static dimension of the segment, and simultaneously display at least one subtitle associated with the segment in the human-computer interaction interface Subtitles, where the style of the subtitle application remains unchanged during the playback of the clip, so that the computing resources and communication resources of the terminal device can be saved on the basis of accurately and efficiently realizing the coordination between the subtitle and the multimedia file at the visual perception level .
  • the content characteristics of the static dimension of the object segment may include at least one of the following object attributes of the vocalizing object in the object segment: role type (including decent characters and villain characters), gender, age ;
  • object segment A first obtain the object attribute (such as the gender of object A) of the sounding object (such as object A) in object segment A (assuming that the gender of object A is recognized as female), and then synchronize in the man-machine interface
  • the style of the subtitle application is suitable for women, for example, the style can be round and pink, that is, the subtitle style is more feminine, and during the playback of the object segment A It remains unchanged, that is, during the process of playing the target segment A, the subtitles are always displayed in the style of small circle font and pink color.
  • FIG. 4A is a schematic diagram of an application scenario of a subtitle processing method for a multimedia file provided in an embodiment of the present application.
  • a certain object segment corresponding to The playing period is 40:30 to 40:40
  • at least one subtitle associated with the object segment is synchronously displayed in the human-computer interaction interface, for example at 40: 30 displays subtitle 402 (“I’m so happy, I bought new clothes”), and at 40:40 displays subtitle 403 (“But next month I’m going to eat dirt”)
  • subtitle 402 and subtitle 403 are adapted to women
  • the font style of subtitle 402 and subtitle 403 is cute, and the style remains unchanged during the playback of the object segment, that is, the applied style of subtitle 402 and subtitle 403 is the same.
  • the content characteristics of the static dimension of the plot segment may include the plot progress of the plot segment, and for the same plot segment, during the playback of the plot segment, all subtitles associated with the plot segment apply
  • the styles can be the same, for example, for all the subtitles associated with the plot segment, the style adapted to the progress of the plot should be applied;
  • the content characteristics of the static dimension of the scene segment can include the scene type of the scene segment, and for the same A scene clip, during the playback of the scene clip, the style applied to all the subtitles associated with the scene clip can be the same, for example, all the subtitles associated with the scene clip apply the style adapted to the scene type.
  • Figure 4B is an application scenario of the subtitle processing method for a multimedia file provided by the embodiment of the present application Schematic diagram, as shown in FIG. 4B , the utterance object 404 and the utterance object 406 belong to the utterance objects included in different object segments, and the subtitle 405 (“what are we going to eat at night”) associated with the object segment where the utterance object 404 is located is the same as the utterance object 406
  • the subtitle 407 associated with the object segment where it is located has a different style of application ("How about a barbecue").
  • the font of the subtitle 405 is Fangzheng Shuti
  • the font of the subtitle 407 is Chinese and colorful clouds.
  • Different styles are applied to subtitles.
  • subtitle 405 applies the style of Founder Comfortable (the font style is softer) suitable for women
  • subtitle 407 applies the style of Chinese Caiyun suitable for men (the font style is more masculine).
  • it is convenient for the user to distinguish different objects appearing in the video file.
  • any one or more segments in the plurality of segments can be further divided to obtain a plurality of sub-segments.
  • the refined division can ensure that the content of subtitles and multimedia files is related in real time during the dissemination of multimedia files, thereby further improving the coordination between subtitles and multimedia files at the level of visual perception.
  • FIG. 4C is a schematic diagram of the principle of segment division provided by the embodiment of the present application.
  • scene segment 1 and plot segment 2 ( Scene segment 1 and plot segment 2 can be two adjacent segments, that is, after playing scene segment 1, continue to play plot segment 2)
  • scene segment 1 the scene can be divided according to the characters appearing in scene segment 1
  • Fragment 1 is further divided into three different character sub-segments, for example, including character sub-segment 1, character sub-segment 2, and character sub-segment 3, wherein different character sub-segments may contain different characters, for example, character sub-segment 1 Including character A, character sub-segment 2 includes character B, and character sub-segment 3 includes character C;
  • plot segment 2 can also be further divided into two different scene sub-segments according to the scenes appearing in plot segment 2, For example, it includes scene sub-segment 1 and scene sub-seg
  • the above-mentioned style adapted to the content characteristics of at least one dimension of the segment can be realized in the following manner, and displayed sequentially in the human-computer interaction interface
  • At least one subtitle associated with the segment divide the segment to obtain a plurality of sub-segments, wherein the plurality of sub-segments have the content characteristics of the static dimension of the segment (the content characteristics of the static dimension remain unchanged during the playback of the segment), and the segment
  • the content characteristics of the dynamic dimension (the content characteristics of different dynamic dimensions will change during the playback of the clip), and the content characteristics of the dynamic dimension of different sub-segments are different; the following is executed during the playback of each sub-segment of the segment
  • Processing display at least one subtitle associated with the sub-segment based on a style adapted to the content feature of the static dimension and the content feature of the dynamic dimension possessed by the sub-segment.
  • the content characteristics of the static dimension of the object segment may include at least one of the following object attributes: the role type, gender, and age of the vocalizing object in the object segment;
  • the content characteristics of the dynamic dimension of the object segment may include the following object attributes : the emotion of the sounding object in the object segment; for example, taking a multimedia file as a video file, for an object segment (such as object segment A) in the video file, first divide the object segment A into multiple sub-segments, and then play the object segment
  • the following processing is performed in the process of each sub-segment of segment A: based on the content characteristics of the static dimension (such as the gender of the speaking object in the object segment A) and the content characteristics of the dynamic dimension (such as the speaking object in the current sub-segment) with the sub-segment mood) adapted style, display at least one subtitle associated with the subsegment, that is, the style applied to at least one subtitle associated with the subsegment is adapted to the gender of the
  • FIG. 4D is a schematic diagram of an application scenario of a subtitle processing method for a multimedia file provided by an embodiment of the present application.
  • sub-segment 408 and sub-segment 409 belong to In different sub-segments of the same object segment, the expression of the utterance object 410 in the sub-segment 408 is sad, while in the sub-segment 409 the expression of the utterance object 410 changes from sad to happy.
  • the subtitle 411 associated with the sub-segment 408 (“Don't leave me") is different from the style applied to subtitle 412 ("I'm glad to see you again") associated with sub-fragment 409.
  • the font of subtitle 411 is Chinese official script, the font size is small four, and the font size Smaller, serious font style, suitable for sad emotions; while the font of subtitle 412 is Chinese colorful cloud, font size is four, larger font size, font style is more festive, suitable for happy emotions, so, for the same
  • the subtitle style will be adjusted accordingly with the change of the emotion of the speaking object, so that the coordination between the subtitle and the content of the video file at the visual perception level can be accurately and efficiently realized.
  • the content characteristics of the static dimension of the plot segment may include the plot type of the plot segment
  • the content features of the dynamic dimension of the plot segment may include at least one of the following: the scene type of different scenes appearing in the plot segment, the plot The object attributes of the different sounding objects that appear in the segment; for example, taking the multimedia file as a video file, for a certain plot segment (such as plot segment B) in the video file, the plot segment B is first divided into a plurality of sub-segments, and then in The following processing is performed during playback of each sub-segment of the episode segment B: content characteristics based on static dimensions (such as the plot type of the episode segment B) and dynamic dimensions (such as the scene appearing in the current sub-segment) with the sub-segment Type) adapted style, displaying at least one subtitle associated with the sub-segment, that is, the style applied to at least one sub-segment associated with the sub-segment is adapted to the plot
  • FIG. 4E is a schematic diagram of an application scenario of a subtitle processing method for a multimedia file provided in the embodiment of the present application.
  • sub-segments 413 and sub-segments 414 are Different sub-segments belonging to the same plot segment, the scene appearing in the sub-segment 413 is home, and the scene appearing in the sub-segment 414 is switched from home to outdoor, correspondingly, the subtitle 415 associated with the sub-segment 413 (“Dad, How about going to climb the mountain?")
  • the subtitle 416 (“Dad, wait for me") associated with the sub-segment 414 has different styles, for example, the font of the subtitle 415 is bold, while the font of the subtitle 416 is Chinese amber, so, for For different sub-segments of the same plot segment, the subtitle style will be adjusted correspondingly according to the change of the content characteristics of the dynamic dimension
  • the content feature of the static dimension of the scene segment may include: the type of the scene involved in the scene segment, and the content feature of the dynamic dimension of the scene segment may include at least one of the following: different sounding objects appearing in the scene segment Object attributes, the types of different plots that appear in the scene segment; for example, taking the multimedia file as a video file, for a certain scene segment (such as scene segment C) in the video file, the scene segment C is first divided into multiple sub-segments, Then in the process of playing each sub-segment of scene segment C, perform the following processing: based on the content characteristics of the static dimension (for example, the scene type involved in scene segment C) and the content characteristics of the dynamic dimension (such as in the current sub-segment) with the sub-segment The type of plot that appears) adapted style, displaying at least one subtitle associated with the sub-segment, that is, the style applied to at least one sub-segment associated with the sub-seg
  • FIG. 4F is a schematic diagram of an application scenario of a subtitle processing method for a multimedia file provided by the embodiment of the present application.
  • sub-segments 417 and sub-segments 418 are For different sub-segments belonging to the same scene segment, the type of the plot appearing in the sub-segment 417 is the development stage, while the type of the plot appearing in the sub-segment 418 enters the climax stage from the development stage, correspondingly, the sub-segment 417 is associated with The subtitle 419 ("The architecture of the Middle Ages is relatively simple") is different from the subtitle 420 ("The architecture of the Renaissance era is more modern") associated with the subsegment 418.
  • the font of the subtitle 419 is Chinese Xingkai, while the font of the subtitle 420 The font is small and round.
  • the subtitle style will be adjusted accordingly according to the change of the content characteristics of the dynamic dimension of different sub-segments, so that users can understand the video content more easily according to the change of subtitle style.
  • Step 103A and step 104A shown in FIG. 5A will be described in conjunction with the steps shown in FIG. 5A .
  • step 103A the content characteristics of at least one dimension of the multimedia file are obtained.
  • the content characteristics of at least one dimension of the multimedia file may include: style (for example, for a video file, the type of the corresponding style may include comedy, horror, suspense, cartoon, etc.; for an audio file, the type of the corresponding style may include popular, rock, etc.), objects (such as characters, animals, etc. appearing in multimedia files), scenes, plots, and hues.
  • style for example, for a video file, the type of the corresponding style may include comedy, horror, suspense, cartoon, etc.; for an audio file, the type of the corresponding style may include popular, rock, etc.
  • objects such as characters, animals, etc. appearing in multimedia files
  • scenes plots, and hues.
  • the above-mentioned step 103A can be implemented in the following manner: call the content feature recognition model to perform content feature recognition processing on the content of the multimedia file, and obtain the content feature of at least one dimension of the multimedia file, wherein the content feature recognition model is It is obtained by training based on the sample multimedia files and the labels marked for the contents of the sample multimedia files.
  • the content feature recognition model can be a separate style recognition model, scene recognition model, plot recognition model and hue recognition model, or a combined model (such as a model that can recognize the style and scene of a multimedia file at the same time), the content
  • the feature recognition model can be a neural network model (such as a convolutional neural network, a deep convolutional neural network, or a fully connected neural network, etc.), a decision tree model, a gradient boosting tree, a multilayer perceptron, and a support vector machine.
  • the embodiment does not specifically limit the type of the content feature recognition model.
  • the above step 103A can be implemented in the following manner: perform the following processing on the target object appearing in the video file: first preprocess the target video frame where the target object is located, For example, the target video frame can be cropped to a set size, or the target object in the target video frame can be rotated so that the target object is in a horizontal state, thereby facilitating subsequent processing.
  • the sharpness of the target object included in each target video frame can be determined (for example, the sharpness can be determined by the Sobel operator, and the blurred image has more unclear edges, so its The value of the Sobel operator is also larger.
  • the Sobel operator is composed of two 3 ⁇ 3 convolution kernels, which are used to calculate the gray-scale weighted difference of the central pixel neighborhood), and select the target video frame with the highest definition Perform subsequent processing; then perform feature extraction on the preprocessed target video frame to obtain the corresponding image features of the target video frame, for example, the wavelet (Gabor) feature used to describe image texture information in the target video frame can be extracted as the target
  • the image feature corresponding to the video frame then the image feature is subjected to dimensionality reduction processing, for example, the principal component feature component of the image feature can be extracted by using the principal component analysis method, so as to realize the dimensionality reduction of the image feature (for example, the image feature of the target video frame can be first X is calculated to obtain the matrix XX T , and then the eigenvalue decomposition of the matrix XX T is performed, and the eigenvectors corresponding to the largest L eigenvalues are reserved, and the decoding matrix D is formed by columns, and then the
  • the above-mentioned step 103A can also be implemented in the following manner: perform the following processing on the target object appearing in the video file: first extract the local two corresponding to the target video frame where the target object is located.
  • principal component analysis can be used to perform dimensionality reduction processing on local binary pattern features; then extract the direction gradient histogram features corresponding to the target video frame, and analyze the direction Gradient histogram features for dimensionality reduction processing, for example, the principal component analysis method can be used to perform dimensionality reduction processing on the directional gradient histogram features; and then canonical correlation analysis is performed on the local binary pattern features and directional gradient histogram images after dimensionality reduction processing processing (that is, extracting two representative comprehensive variables from the two groups of variables, and using the correlation between the two comprehensive variables to reflect the overall correlation between the two groups of indicators) to obtain the analysis results, for example, through Calculate the typical correlation coefficient (a quantitative index used to measure the degree of linear correlation between two random vectors) between local binary pattern features and directional gradient histogram features to mine local binary pattern features and directional gradients.
  • the typical correlation coefficient a quantitative index used to measure the degree of linear correlation between two random vectors
  • regression processing is performed on the analysis results (including linear regression and nonlinear regression, wherein linear regression is a technique that uses a linear model to model the relationship between a single input variable and an output variable, Nonlinear regression is to model the relationship between multiple independent input variables and output variables) to obtain the object attributes of the target object. For example, taking the object attribute as age as an example, the probabilities that the analysis results are mapped to different ages can be calculated through a linear model, and the age corresponding to the maximum probability can be determined as the age of the target object.
  • the above-mentioned step 103A can also be implemented in the following manner: perform the following processing on the target object appearing in the video file: first normalize the target video frame where the target object is located Processing (that is, adjusting the average grayscale and contrast of different target video frames to a fixed level, providing a relatively uniform image specification for subsequent processing), and partitioning the normalized target video frames to obtain multiple sub-frames area, for example, the target video frame can be divided into multiple rectangles, and each rectangle represents a sub-area; then the local binary pattern features corresponding to each sub-area are extracted, and a plurality of local binary pattern features are statistically processed, for example, Use the histogram statistical method to count multiple local binary pattern features to obtain the local histogram statistical features corresponding to the target video frame; then use the local feature library of the training set to perform local sparse reconstruction of the local histogram statistical features (that is, use The linear combination of a small number of local binary pattern features in the local feature library of
  • the target object when there are multiple objects in the video file, the target object can be determined from the multiple objects in any of the following ways: determine the object that appears for the longest time in the video file as the target Object: The object (for example, determining the user's user characteristic data according to the user's historical viewing record in the video file, and determining the object with the highest similarity with the user's characteristic data as the object that meets the user's preference) is determined as the target object; Objects in the video file that are related to user interaction (for example, objects that the user has liked or reposted) are determined as target objects.
  • the object attributes such as gender, age, emotion, etc.
  • the frequency of pronunciation is relatively high, while the frequency of male pronunciation is relatively low) to determine the gender of the target object; according to the pitch (for example, children usually have tighter vocal cords, so the pitch is higher, and with age, The vocal cords become loose, and the pitch will gradually decrease) to identify the target's age; determine the target's emotion according to the speaking speed, volume and other information, for example, when the target is angry, the corresponding volume will be higher and the speaking speed relatively fast.
  • step 104A based on the content feature of at least one dimension of the multimedia file, a style conversion process is performed on multiple original subtitles associated with the multimedia file to obtain multiple new subtitles.
  • a plurality of new subtitles (the styles of the plurality of new subtitles can be the same, for example, are all based on the style of the identified multimedia file to perform style conversion processing on a plurality of original subtitles associated with the multimedia file) are used as the human-machine
  • the multiple subtitles to be displayed on the interactive interface are the multiple subtitles displayed sequentially on the man-machine interactive interface in step 102 .
  • the above step 104A can be implemented in the following manner: based on the value corresponding to the content feature of at least one dimension of the multimedia file and the multiple original subtitles associated with the multimedia file, the subtitle model is invoked to obtain multiple new subtitles, wherein, the subtitle model can be used as a generative model and trained with a discriminative model to form a generative confrontation network.
  • a multimedia file as a video file after obtaining at least one dimension of the content feature of the video file (such as the style of the video file, assuming it is a comedy), it can be based on the value corresponding to the style of the video file, and the multimedia
  • the multiple original subtitles associated with the file (assuming that the fonts of the multiple original subtitles are all in italics) call the subtitle model to obtain multiple new subtitles.
  • Adapted to comedy it is cartoonish, that is, during the playback of video files, the fonts of multiple subtitles displayed in sequence on the human-computer interaction interface are all round.
  • the subtitle model may also be obtained by training in other ways, for example, the subtitle model may be trained separately, and the embodiment of the present application does not specifically limit the training method of the subtitle model.
  • the above-mentioned style conversion processing can be for subtitles in picture format, for example, pictures with original fonts (such as pictures whose subtitle content is in italics) can be converted into fonts that match the style of the video file
  • the font of the subtitle content is a picture of Chinese colorful clouds
  • style conversion processing can be directly performed on subtitles in text format.
  • various attributes such as fonts, font sizes, etc.
  • the matrix vectors can be Style conversion processing (for example, the matrix vector and the value corresponding to the style of the video file can be input into the subtitle model), and a new matrix vector (that is, the matrix vector corresponding to the new style of subtitles) is obtained, and then decoding is performed based on the new matrix vector , to get a new style of subtitles (that is, subtitles of a style adapted to the style of the video file), and finally use the new style of subtitles to replace the original style of subtitles in the subtitle file, and the subtitles in text format are more conducive to saving and updating, such as correcting text mistake.
  • step 102 for the situation that the styles of multiple subtitles associated with the multimedia file are different (that is, the subtitle style will change during the playback of the entire multimedia file), before performing step 102 shown in FIG. 3 , Execution of step 103B and step 104B shown in FIG. 5B will be described in conjunction with the steps shown in FIG. 5B .
  • step 103B at least one dimensional content feature of each segment in the multimedia file is acquired.
  • the multimedia file can be divided first (the specific division process can refer to the above description, and the embodiment of the present application will not be repeated here), and multiple segments are obtained, wherein each segment is associated with at least one Original subtitles, for example, segment 1 is associated with original subtitle 1 to original subtitle 3, segment 2 is associated with original subtitle 4 and original subtitle 5, and then obtains at least one dimension of the content features of each segment, the acquisition method of the content features of the segment is the same as
  • the acquisition method of the content feature of the multimedia file is similar, and can be realized by referring to the above-mentioned acquisition method of the content feature of the multimedia file, and the embodiment of the present application will not repeat it here.
  • step 104B the following processing is performed for each segment: based on the content feature of at least one dimension of the segment, perform style conversion processing on at least one original subtitle associated with the segment to obtain at least one new subtitle.
  • At least one new subtitle corresponding to each segment can be combined to obtain multiple new subtitles, wherein the order of the multiple new subtitles is the same as that of the multiple original subtitles.
  • the order of the subtitles is the same, and the multiple new subtitles are the multiple subtitles to be displayed on the human-computer interaction interface, that is, the multiple subtitles displayed sequentially on the human-computer interaction interface in step 102 .
  • the process of performing style conversion processing on at least one original subtitle associated with the segment obtained after division is similar to the process of performing style conversion processing on multiple original subtitles associated with a multimedia file.
  • the process of performing style conversion processing on multiple associated original subtitles will not be repeated here in this embodiment of the present application.
  • the following processing may be performed for each segment: value, and at least one original subtitle associated with segment A calls the subtitle model to obtain at least one new subtitle associated with segment A, and then at least one new subtitle associated with segment A can be used to replace at least one associated with segment A stored in the subtitle file For the original subtitle, in this way, during the playback of the subsequent multimedia file, for example, when segment A is played, at least one new subtitle associated with segment A can be read from the subtitle file and displayed on the human-computer interaction interface.
  • the emotion of the target object may change during the playback of the video file, that is, the emotion belongs to the content feature of the dynamic dimension.
  • the emotion of the target object may be different. Therefore, when the style conversion process is performed based on the emotion of the target object, after the style conversion process, the original subtitles associated with different segments are converted to obtain the style of the new subtitle It may be different.
  • the emotion of the target object is happy, and the font of the new subtitle obtained after style conversion is round; in Fragment 2, the emotion of the target object is sad, and the font of the new subtitle obtained after style conversion is
  • the font of the font is Chinese colorful cloud, that is to say, during the playback of the video file, the subtitle style will be adjusted correspondingly with the change of the target audience’s emotion, so as to accurately and efficiently realize the coordination between the subtitle and the video file at the level of visual perception .
  • segment A can also be divided again to obtain multiple sub-segments, and then each sub-segment in segment A can be obtained The content feature of at least one dimension of the sub-segment, and then perform the following processing for each sub-segment: call the subtitle model based on the value corresponding to the content feature of at least one dimension of the sub-segment and at least one original subtitle associated with the sub-segment, and obtain the sub-segment-associated At least one new subtitle.
  • the style of at least one new subtitle associated with different sub-segments is also different, so that in the process of playing the same segment, the subtitle style is also different. will change.
  • the style of the subtitle application can also be adapted to the fused content features obtained after the content features of multiple dimensions of the segment are fused, then the above-mentioned based on at least one dimension of the segment can be realized in the following manner
  • a style adapted to the content features of the clip, and at least one subtitle associated with the clip is sequentially displayed in the human-computer interface: the content features of multiple dimensions of the clip are fused to obtain the fused content feature; based on the fused content feature, the subtitle associated with the clip is Perform style conversion processing on at least one original subtitle to obtain at least one new subtitle; wherein, at least one new subtitle is used as at least one subtitle to be displayed in the human-computer interaction interface.
  • a multimedia file as a video file as an example, first obtain the multi-dimensional content features of the video file (such as the style of the video file, the tone of the video file, etc.), and then fuse the style of the video file and the tone of the video file Processing (for example, summing the value corresponding to the style of the video file and the value corresponding to the hue of the video file) to obtain the fusion content feature, and then calling based on the value corresponding to the fusion content feature and at least one original subtitle associated with the segment
  • the subtitle model obtains at least one new subtitle, wherein the style of the new subtitle application is simultaneously adapted to the style of the video file and the tone of the video file, so that by comprehensively considering the content characteristics of multiple dimensions of the video file, the final The presented subtitles can be more in line with the video content, further improving the coordination between subtitles and video files at the level of visual perception.
  • the style of subtitle application can also be related to the content of the multimedia file and user characteristic data at the same time. Record and determine the user characteristic data of the user, and then calculate the user's preference degree for the current multimedia file, and finally determine the style of the subtitle based on the preference degree and at least one dimension of the multimedia file.
  • Value and the value of the content feature of at least one dimension of the multimedia file are fused (for example, two values are added), and the subtitle model is called based on the value obtained by the fusion process and a plurality of original subtitles associated with the multimedia file , to obtain multiple new subtitles, that is, the style of multiple new subtitles is adapted to the content of the multimedia file and the user's user characteristic data at the same time, that is to say, for the same multimedia file, the subtitles displayed on the user terminals of different users
  • the subtitles can also be different. In this way, by comprehensively considering the user's own factors and the content characteristics of the multimedia files, the coordination between the subtitles and the multimedia files at the level of visual perception is further improved.
  • the subtitle processing method of the multimedia file provided by the embodiment of the present application, in the process of playing the multimedia file, displays the subtitle of the style related to the content of the multimedia file on the human-computer interaction interface, and realizes the correlation of the multimedia file by enriching the expression form of the subtitle.
  • the diversified display effects of information can accurately and efficiently realize the coordination of subtitles and multimedia files at the level of visual perception.
  • the embodiment of the present application provides a subtitle processing method for a multimedia file, which can understand the content of the video file (such as mining the character attributes of the characters appearing in the video file, the overall style of the video file, etc.), to generate subtitles of relevant styles in real time , so as to accurately and efficiently realize the coordination of subtitles and video files at the level of visual perception.
  • the subtitle processing method of the multimedia file provided by the embodiment of the present application can be applied to the generation of subtitles of major video websites, and can be identified according to the content of the video file (including the style recognition of the video file and the character attribute recognition of the characters appearing in the video file, for example Recognize the age, gender and emotion of the character, etc.), and generate subtitles in a style related to the content of the identified video file in real time.
  • FIG. 6A to FIG. 6C are schematic diagrams of application scenarios of the subtitle processing method for multimedia files provided by the embodiment of the present application, wherein the style of the video 601 shown in FIG. 6A belongs to cartoons, and the overall The style is cute and cartoon, so the subtitle 602 associated with the video 601 is also in this style.
  • the color of the subtitle 602 can also adapt to the main color of the background.
  • the color of the subtitle 602 can be blue;
  • the style of the video 603 shown in 6B is a comedy, and the overall style is funny, therefore, the subtitle 604 associated with the video 603 is also cartoonish, which matches the style of the video 603;
  • the style of the video 605 shown in FIG. 6C belongs to For hero movies, the overall style is more serious. Therefore, the font style of the subtitle 606 associated with the video 605 is also more serious and serious. That is to say, the styles of the subtitles corresponding to videos of different styles are different, and the styles of the subtitles are closely matched with the style of the video, so that the coordination between the subtitles and the video files at the level of visual perception can be accurately and efficiently realized.
  • the subtitle processing method for multimedia files mainly involves two parts: content understanding of video files, and real-time generation of video subtitles of relevant styles based on the understanding results of video content.
  • the understanding process of video content will be described first below.
  • FIG. 7 is a schematic diagram of the video content dimensions provided by the embodiment of the present application.
  • character attributes including character’s gender, age, emotion, etc.
  • video style the type of video style can include cartoon, comedy, horror, suspense, etc.
  • the identification of character attributes includes the identification of the gender of the character, the identification of the age of the character and the identification of the emotion of the character.
  • the recognition of the gender of a person can adopt (but not limited to) a face gender classification algorithm based on Adaboost and a support vector machine (SVM, Support Vector Machine), wherein, Adaboost is an iterative algorithm, and its core idea is for the same
  • Adaboost is an iterative algorithm, and its core idea is for the same
  • a training set trains different classifiers (ie weak classifiers), and then these weak classifiers are combined to form a stronger final classifier (ie strong classifier).
  • the face gender classification algorithm based on Adaboost+SVM is mainly divided into two stages: (a) training stage: firstly, the training set is preprocessed, and then the preprocessed training set is subjected to Gabor filtering, The wavelet (Gabor) feature of the preprocessed training set is obtained, and then the Adaboost classifier is trained based on the wavelet (Gabor) feature of the preprocessed training set, and finally based on the dimensionality reduction process of the Adaboost classifier.
  • the SVM classifier is trained; (b) testing phase: firstly, the test set is preprocessed, and then the preprocessed test set is subjected to Gabor filtering to obtain the wavelet (Gabor) feature of the preprocessed test set, and then passed
  • the trained Adaboost classifier performs dimensionality reduction processing, and finally calls the trained SVM classifier based on the dimensionality reduction features for recognition processing, and outputs the recognition result (that is, the gender of the person).
  • the age estimation of a person can use (but not limited to) a face age estimation algorithm that integrates local binarization patterns (LBP, Local Binary Patterns) and histogram of oriented gradients (HOG, Histogram of Oriented Gradient) features, where , LBP is an operator used to describe the local texture features of an image, which has significant advantages such as rotation invariance and gray invariance, HOG is a feature descriptor used for object detection in computer vision and image processing , which can be obtained by calculating and counting the gradient direction histogram of the local area of the image.
  • LBP local binarization patterns
  • HOG histogram of Oriented Gradient
  • the face age estimation algorithm mainly includes the following two stages: (a) training stage: first extract the local statistical features (such as LBP features and HOG features) of faces that are closely related to age changes in the training sample set , and then perform dimensionality reduction processing on the extracted features, for example, principal component analysis (PCA, Principal Component Analysis) can be used to perform dimensionality reduction processing on the extracted LBP features and HOG features, and then canonical correlation analysis (CCA, Canonical Correlation Analysis) method fuses the two dimensionality-reduced features, and finally trains the Support Vector Machine regression (SVR, Support Vector Regression) model based on the fusion result.
  • PCA Principal component analysis
  • CCA Canonical Correlation Analysis
  • the SVR model is a regression algorithm model, and its online A "interval" is created on both sides of the sexual function. For all samples that fall into the interval, no loss is calculated; only those outside the interval are included in the loss function, and then by minimizing the width of the interval and the total loss to optimize the model; (b) testing phase: first extract the LBP features and HOG features of the test sample set, then use the PCA method to reduce the dimensionality of the extracted LBP features and HOG features, and then use CCA The method fuses the two dimensionally reduced features, and finally calls the trained SVR model based on the fusion result for age regression processing, and outputs the estimated age result.
  • the recognition of characters' emotions can adopt (but not limited to) a facial expression recognition algorithm that combines LBP features and local sparse representations, as shown in Figure 10, the steps of the algorithm include the following two stages: (a) training stage : First, normalize the face images in the training set, then perform face partition processing on the normalized face images, and then calculate the LBP feature of each face sub-region obtained after the partition processing, and use the local The histogram statistical method integrates the feature vectors of this region to form a training set local feature library composed of specific face local features; (b) testing phase: for the face images in the test set, the face image normalization, human Face partitioning, LBP feature calculation, and local histogram statistical operations; finally, for the local histogram statistical features of the face images in the test set, use the local feature library of the training set to perform local sparse reconstruction representation, and use the local sparse reconstruction residual
  • the weighting method is used to classify and recognize the final facial expression, and output the recognition result.
  • training phase can be processed offline, while the testing phase can be processed online.
  • the video style can be identified using a convolutional neural network (CNN, Convolutional Neural Networks) model, where the training data can come from the video files provided by the video website and the style classification labels (generally identified by the operator), as shown in Figure 11 , input consecutive L (L is a positive integer greater than 1) frame images in the video into the trained convolutional neural network model, after convolution (Convolution), pooling (Pooling), and N dense blocks ( For example, 4 dense blocks, respectively dense block 1 to dense block 4, where the dense block can be composed of multiple convolution blocks, and each block uses the same number of output channels) after processing to obtain the feature map corresponding to each frame image,
  • the Gram matrix is used to calculate the correlation between two feature maps (such as the feature map after convolution processing) to represent the style information of the video, and then the correlation result output by the Gram matrix is fully connected (for example, two full Connection processing), and finally input the full connection result to the regression function (such as Softmax function), output the probabilities corresponding to different styles, and determine the style
  • the generation of subtitles can be realized by using the Generative Adversarial Networks (GAN, Generative Adversarial Networks) model.
  • GAN Generative Adversarial Networks
  • GAN contains two models: the Generative Model and the Discriminative Model, and the Generative Model and the Discriminative Model compete against each other. to achieve the final result.
  • Figure 12 is a schematic diagram of the training principle of the generative confrontation network model provided by the embodiment of the present application, and the specific algorithm flow is as follows:
  • the subtitles in text format can be converted into image format first, and then the above-mentioned processing can be performed.
  • the subtitle style is more in line with the video content or the character characteristics of the characters appearing in the video, which is more immersive;
  • the subtitle style is automatically generated by an electronic device (such as a terminal device or a server), and there is no need to purchase the copyright of the subtitle library, which saves copyright costs.
  • the subtitle processing device 465 of the multimedia file provided in the embodiment of the present application is continued below as an exemplary structure of a software module.
  • the subtitle processing device 465 of the multimedia file stored in the memory 460 The software modules in may include: a playback module 4651 and a display module 4652.
  • the playback module 4651 is configured to play a multimedia file in response to a playback trigger operation, wherein the multimedia file is associated with multiple subtitles, and the type of the multimedia file includes a video file and an audio file; the display module 4652 is configured to play the multimedia file. , displaying a plurality of subtitles sequentially on the human-computer interaction interface, wherein the applied styles of the plurality of subtitles are related to the content of the multimedia file.
  • the display module 4652 is further configured to sequentially display a plurality of subtitles with styles applied in the human-computer interaction interface; wherein, the styles are adapted to the content characteristics of at least one dimension of the multimedia file, and at least one dimension Content characteristics include: style, object, scene, plot, tone.
  • the subtitle processing device 465 for multimedia files also includes an acquisition module 4653 configured to obtain at least one dimensional content feature of the multimedia file; the subtitle processing device 465 for multimedia files also includes a conversion module 4654 configured to be based on at least one Dimensional content features, performing style conversion processing on multiple original subtitles associated with the multimedia file to obtain multiple new subtitles; wherein, the multiple new subtitles are used as multiple subtitles to be displayed in the human-computer interaction interface.
  • the conversion module 4654 is further configured to call the subtitle model based on the value corresponding to the content feature of at least one dimension and the multiple original subtitles associated with the multimedia file to obtain multiple new subtitles; wherein, the subtitle model is as Generate a model and form a generative confrontation network with a discriminant model for training.
  • the multimedia file includes a plurality of fragments, and the types of the fragments include at least one of the following: object fragments, scene fragments, and plot fragments; the display module 4652 is also configured to execute during the playback of each fragment of the multimedia file The following processing: based on the style adapted to the content characteristics of at least one dimension of the segment, sequentially display at least one subtitle associated with the segment in the human-computer interaction interface.
  • the acquisition module 4653 is further configured to acquire the content features of the static dimension of the segment, wherein the content feature of the static dimension of the object segment includes at least one of the following object attributes of the vocalizing object in the object segment: role type, gender , age; the feature of the static dimension of the scene segment includes the scene type of the scene segment; the feature of the static dimension of the plot segment includes the plot progress of the plot segment; the display module 4652 is also configured to be based on content characteristics adapted to the static dimension of the segment A style, synchronously displaying at least one subtitle associated with the segment in the human-computer interaction interface, wherein the style remains unchanged during the playback of the segment.
  • the segment includes multiple sub-segments, the multiple sub-segments have the content characteristics of the static dimension of the segment, and the content characteristics of the dynamic dimension of the segment, and different sub-segments have different content characteristics of the dynamic dimension;
  • the display module 4652 It is also configured to perform the following processing during playing each sub-segment of the segment: displaying at least one subtitle associated with the sub-segment based on a style adapted to the content characteristics of the static dimension and the content characteristics of the dynamic dimension possessed by the sub-segment.
  • the content characteristics of the static dimension of the object segment include at least one of the following object attributes: role type, gender, age of the speaking object in the object segment;
  • the content characteristics of the dynamic dimension of the object segment include the following object attributes: object segment The emotion of the voice object in the plot;
  • the content characteristics of the static dimension of the plot segment include the plot type of the plot segment, and the content characteristics of the dynamic dimension of the plot segment include at least one of the following:
  • the content characteristics of the static dimension of the scene segment include: the type of scene involved in the scene segment;
  • the content characteristics of the dynamic dimension of the scene segment include at least one of the following: the object properties of different sounding objects appearing in the scene segment, Types of different episodes that appear in a scene clip.
  • the subtitle processing device 465 of the multimedia file further includes a fusion module 4655 configured to perform fusion processing on the content features of the multiple dimensions of the segment to obtain the fusion content features; Module 4654 is further configured to perform style conversion processing on at least one original subtitle associated with the segment based on the fused content feature to obtain at least one new subtitle; where at least one new subtitle is used as at least one subtitle to be displayed in the human-computer interaction interface subtitle.
  • the acquisition module 4653 is further configured to call the content feature recognition model to perform content feature recognition processing on the content of the multimedia file to obtain at least one dimension of the content feature of the multimedia file; wherein, the content feature recognition model is based on the sample multimedia files, and labels for content annotation of sample multimedia files are trained.
  • the obtaining module 4653 is further configured to perform the following processing on the target object appearing in the video file: preprocessing the target video frame where the target object is located; The feature extraction of the target video frame is performed to obtain the image features corresponding to the target video frame; the dimensionality reduction processing is performed on the image features, and the trained classifier is used to classify the image features after dimensionality reduction processing to obtain the object of the target object Attributes.
  • the obtaining module 4653 is further configured to perform the following processing on the target object appearing in the video file: extracting the local binary pattern feature corresponding to the target video frame where the target object is located, and Perform dimensionality reduction processing on the local binary pattern features; extract the directional gradient histogram features corresponding to the target video frame, and perform dimensionality reduction processing on the directional gradient histogram features; local binary pattern features and directional gradients after dimensionality reduction processing
  • the histogram image is processed by canonical correlation analysis to obtain the analysis results; the analysis results are regressed to obtain the object attributes of the target object.
  • the acquisition module 4653 is further configured to perform the following processing on the target object appearing in the video file: normalize the target video frame where the target object is located, and The processed target video frame is partitioned to obtain multiple sub-regions; the local binary pattern features corresponding to each sub-region are extracted, and the multiple local binary pattern features are statistically processed to obtain the local histogram corresponding to the target video frame Statistical features: Perform local sparse reconstruction representation on the local histogram statistical features through the local feature library of the training set, and perform local reconstruction residual weighted identification processing on the local sparse reconstruction representation results to obtain the object attributes of the target object.
  • the subtitle processing device 465 for the multimedia file further includes a determination module 4656 configured to determine the target object from the multiple objects in any of the following ways: The object that appears for the longest time is determined as the target object; the object in the video file that meets the user's preference is determined as the target object; and the object in the video file that is related to the user interaction is determined as the target object.
  • the description of the device in the embodiment of the present application is similar to the implementation of the subtitle processing method for a multimedia file above, and has similar beneficial effects, so details are not repeated here.
  • the unfinished technical details of the subtitle processing apparatus for multimedia files provided by the embodiment of the present application, it can be understood according to the description of any one of FIG. 3 , FIG. 5A , or FIG. 5B .
  • An embodiment of the present application provides a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the subtitle processing method for a multimedia file described above in the embodiment of the present application.
  • An embodiment of the present application provides a computer-readable storage medium storing executable instructions, wherein executable instructions are stored, and when the executable instructions are executed by a processor, the processor will be caused to execute the multimedia file provided by the embodiment of the present application.
  • the subtitle processing method for example, the subtitle processing method of a multimedia file as shown in FIG. 3 , FIG. 5A , or FIG. 5B .
  • the computer-readable storage medium can be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; Various equipment.
  • executable instructions may take the form of programs, software, software modules, scripts, or code written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and its Can be deployed in any form, including as a stand-alone program or as a module, component, subroutine or other unit suitable for use in a computing environment.
  • executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in a Hyper Text Markup Language (HTML) document in one or more scripts, in a single file dedicated to the program in question, or in multiple cooperating files (for example, files that store one or more modules, subroutines, or sections of code).
  • HTML Hyper Text Markup Language
  • executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, on multiple computing devices distributed across multiple sites and interconnected by a communication network. to execute.
  • subtitles in styles related to the content of multimedia files are displayed on the human-computer interaction interface, and the diversification of information related to multimedia files is realized by enriching the expression forms of subtitles.
  • the display effect can be applied to the diversified subtitle display requirements of different application scenarios of multimedia files, and at the same time, the effect of information dissemination and the viewing experience of users are improved.

Abstract

本申请提供了一种多媒体文件的字幕处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品;方法包括:响应于播放触发操作,播放多媒体文件,其中,所述多媒体文件关联有多条字幕,所述多媒体文件的类型包括视频文件和音频文件,以及在播放所述多媒体文件的过程中,在人机交互界面中依次显示所述多条字幕,其中,所述多条字幕应用的样式与所述多媒体文件的内容相关。

Description

多媒体文件的字幕处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
相关申请的交叉引用
本申请基于申请号为202111114803.6,申请日为2021年09月23日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及互联网技术领域,尤其涉及一种多媒体文件的字幕处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品。
背景技术
随着互联网技术的发展,尤其是网络视频、网络音乐、网络教育等互联网应用的发展,作为信息载体的多媒体文件(例如各种类型的视频文件或者音频文件)被大量使用,使得信息更加快速、便捷的呈现在用户眼前。其中,字幕的作用不可或缺,除了提示人物的对话,字幕还可对视频文件或者音频文件内容进行描述、概括或者总结,例如用户在观看一些国外的视频文件时,需要字幕来帮助理解视频文件的内容。
然而,相关技术提供的方案中,在视频文件或者音频文件的播放过程中,字幕的显示样式是固定的,以视频文件为例,由于不同的视频文件在风格上会呈现多样化的特点,这就导致经常出现字幕和视频在视觉感知层面上出现不协调的情况,甚至会导致字幕无法清楚地显示。此外,人工制作字幕的方式虽然能够保证字幕和视频文件的协调,但是无法高效处理大量的视频文件。
也就是说,相关技术对于如何准确和高效地实现字幕和视频文件在视觉感知层面的协调,尚无有效的解决方案。
发明内容
本申请实施例提供一种多媒体文件的字幕处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品,能够准确且高效地实现字幕与多媒体文件在视觉感知层面的协调。
本申请实施例提供一种多媒体文件的字幕处理方法,所述方法由电子设备执行,包括:
响应于播放触发操作,播放多媒体文件,其中,所述多媒体文件关联有多条字幕,所述多媒体文件的类型包括视频文件和音频文件,以及
在播放所述多媒体文件的过程中,在人机交互界面中依次显示所述多条字幕,其中,所述多条字幕应用的样式与所述多媒体文件的内容相关。
本申请实施例提供一种多媒体文件的字幕处理装置,包括:
播放模块,配置为响应于播放触发操作,播放多媒体文件,其中,所述多媒体文件关联有多条字幕,所述多媒体文件的类型包括视频文件和音频文件;
显示模块,配置为在播放所述多媒体文件的过程中,在人机交互界面中依次显示所述多条字幕,其中,所述多条字幕应用的样式与所述多媒体文件的内容相关。
本申请实施例提供一种电子设备,包括:
存储器,用于存储可执行指令;
处理器,用于执行所述存储器中存储的可执行指令时,实现本申请实施例提供的多媒体文件的字幕处理方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,当被处理器执行时,实现本申请实施例提供的多媒体文件的字幕处理方法。
本申请实施例提供一种计算机程序产品,包括计算机程序或指令,当被处理器执行时,实现本申请实施例提供的多媒体文件的字幕处理方法。
本申请实施例具有以下有益效果:
在播放多媒体文件的过程中,在人机交互界面中显示与多媒体文件的内容相关的样式的字幕,如此,通过丰富字幕的表现形式来实现多媒体文件相关信息的多样化的展示效果,能够准确且高效地实现字幕与多媒体文件在视觉感知层面的协调。
附图说明
图1是本申请实施例提供的多媒体文件的字幕处理系统100的架构示意图;
图2是本申请实施例提供的终端设备400的结构示意图;
图3是本申请实施例提供的多媒体文件的字幕处理方法的流程示意图;
图4A是本申请实施例提供的多媒体文件的字幕处理方法的应用场景示意图;
图4B是本申请实施例提供的多媒体文件的字幕处理方法的应用场景示意图;
图4C是本申请实施例提供的针对片段进行划分的原理示意图;
图4D至图4F是本申请实施例提供的多媒体文件的字幕处理方法的应用场景示意图;
图5A是本申请实施例提供的多媒体文件的字幕处理方法的流程示意图;
图5B是本申请实施例提供的多媒体文件的字幕处理方法的流程示意图;
图6A至图6C是本申请实施例提供的多媒体文件的字幕处理方法的应用场景示意图;
图7是本申请实施例提供的视频内容维度示意图;
图8是本申请实施例提供的人物性别识别原理示意图;
图9是本申请实施例提供的人物年龄识别原理示意图;
图10是本申请实施例提供的人物情绪识别原理示意图;
图11是本申请实施例提供的视频风格识别原理示意图;
图12是本申请实施例提供的生成式对抗网络模型的训练原理示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
可以理解的是,在本申请实施例中,涉及到用户信息等相关的数据,当本申请实施例运用到具体产品或技术中时,需要获得用户许可或者同意,且相关数据的收集、使用 和处理需要遵守相关国家和地区的相关法律法规和标准。
对本申请实施例进行进一步详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)字幕,指电影、电视等视频、以及戏剧、歌曲等音频中出现的各种用途的文字,例如版权标识、片名字幕、演员表、说明字幕(用于介绍多媒体文件的内容,例如将多媒体文件中出现的人物或者景色的相关信息以文字方式显示)、歌词字幕、对话字幕等,其中,对话字幕与发声对象同步,用于将发声对象的语音内容以文字方式显示,以帮助用户理解视频文件或者音频文件(例如有声小说)的内容。
2)多媒体文件,就数据形式而言,包括流媒体(streaming media)文件和本地文件,其中,流媒体文件是采用流媒体协议播放的多媒体文件,流媒体协议是指将一连串的多媒体数据压缩后,以流的方式在网络中分段传送,实现在网络上实时传输影音以供播放的一种技术,对应于网络播放场景。本地文件是在播放前首先需要完整下载的多媒体文件,对应于本地播放场景;就承载的内容而言,包括视频文件和音频文件。
3)内容特征,包括静态维度的内容特征和动态维度的内容特征,其中,静态维度的内容特征在多媒体文件的播放过程中保持不变,例如对象的性别、年龄等;动态维度的内容特征在多媒体文件的播放过程中会发生变化,例如对象的情绪、位置等。
4)样式,又称字幕样式,与字幕的视觉相关的属性,通过相同属性的不同变换以及不同属性的组合,可以形成多种样式。例如属性可以包括:字体、颜色、字号、字间距、加粗、倾斜、下划线、删除线、阴影偏移与颜色、对齐方式、垂直边距等。
5)局部二值模式(LBP,Local Binary Patterns),是一种用来描述图像局部纹理特征的算子,具有旋转不变性和灰度不变性等特点,其基本思想是用每个像素跟它周围的像素相比较得到局部图像结构,假设中心像素值大于相邻像素值,则相邻像素点赋值为1,否则赋值为0,最终对每个像素点都会得到一个二进制八位的表示,例如11100111。
6)小波(Gabor)特征,对图像基于Gabor函数进行变换所得到的特征,Gabor变换属于加窗傅立叶变换,Gabor函数可以在频域不同尺度、不同方向上提取相关的特征,用于表示纹理。在空间域,二维Gabor滤波器是一个高斯核函数和正弦平面波的乘积。
7)主成分分析方法(PCA,Principal Component Analysis),是一种统计方法,通过正交变换将一组可能存在相关性的变量转换为一组线性不相关的变量,转换后的这组变量称为主成分。
8)方向梯度直方图(HOG,Histogram of Oriented Gradient),是应用在计算机视觉和图像处理领域,用于目标检测的特征描述器。这项技术是用于计算局部图像梯度的方向信息的统计值。具体实现方式如下:首先将图像划分成多个连通区域(也称细胞单元),接着采集细胞单元中各像素点的梯度的或边缘的方向直方图,最后把这些直方图组合起来就可以构成特征描述器。
9)典型相关分析(CCA,Canonical Correlation Analysis),是利用综合变量之间的相关关系来反映两组指标之间的整体相关性的多元统计分析方法,其基本原理是:为了从总体上把握两组指标之间的相关关系,分别在两组变量中提取有代表性的两个综合变量U1和V1(分别为两个变量组中各变量的线性组合),利用这两个综合变量之间的相关关系来反映两组指标之间的整体相关性。
10)局部直方图统计(Local Histogram Statistics)特征,采用直方图统计方法对多个局部二值模式特征进行统计得到的特征,用于反映图像的像素分布。其中,直方图统计方法的过程如下:首先划分出多个离散的间隔,接着统计出分布在每个间隔上的局部二值模式特征的数量。
11)局部稀疏重构表示,即使用训练集局部特征库中的少量的局部二值模式特征的 线性组合,来表示局部直方图统计特征,从而减小特征的维度。
12)局部重构残差加权识别处理,即通过构建加权矩阵,对局部稀疏重构表示结果(即使用训练集局部特征中的少量的局部二值模式特征进行线性组合得到的特征)进行加权处理,并利用残差对加权结果进行分类识别的过程。
本申请实施例提供一种多媒体文件的字幕处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品,能够准确且高效地实现字幕与多媒体文件在视觉感知层面的协调。下面说明本申请实施例提供的电子设备的示例性应用,本申请实施例提供的电子设备可以实施为笔记本电脑,平板电脑,台式计算机,机顶盒,移动设备(例如,移动电话,便携式音乐播放器,个人数字助理,专用消息设备,便携式游戏设备)、车载终端等各种类型的终端设备,也可以由服务器和终端设备协同实施。下面,将说明电子设备实施为终端设备时的示例性应用。
参见图1,图1是本申请实施例提供的多媒体文件的字幕处理系统100的架构示意图,如图1所示,终端设备400通过网络300连接服务器200,网络300可以是广域网或者局域网,又或者是二者的组合。
服务器200是终端设备400上运行的客户端410的后台服务器,例如当客户端410为浏览器时,服务器200可以是某个视频网站或者音频网站的后台服务器,服务器200在接收到终端设备400发送的网络请求后,通过网络300向终端设备400发送所请求的多媒体文件(例如流媒体文件),其中,多媒体文件关联有多条字幕。
终端设备400上运行的客户端410可以是各种类型的客户端,例如视频播放客户端、音频播放客户端、浏览器、以及即时通信客户端等,客户端410在接收到播放触发操作(例如接收到用户针对人机交互界面中显示的播放按钮的点击操作)时,播放从服务器200中实时接收到的多媒体文件,以及在播放多媒体文件的过程中,在人机交互界面中依次显示多条字幕,其中,多条字幕应用的样式与多媒体文件的内容相关(将在下文进行具体说明)。
在一些实施例中,本申请实施例提供的多媒体文件的字幕处理方法也可以由终端设备独自实现,例如在终端设备400本地预先存储有已经下载好的多媒体文件(多媒体文件关联有多条字幕),则客户端410在接收到播放触发操作时,播放终端设备400本地存储的多媒体文件,以及在播放多媒体文件的过程中,在人机交互界面中依次显示多条字幕,其中,多条字幕应用的样式与多媒体文件的内容相关。
在另一些实施例中,本申请实施例还可以借助于云技术(Cloud Technology)实现,云技术是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来,实现数据的计算、储存、处理和共享的一种托管技术。
云技术是基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、以及应用技术等的总称,可以组成资源池,按需所用,灵活便利。云计算技术将变成重要支撑。技术网络系统的后台服务需要大量的计算、存储资源。
示例的,图1中示出的服务器200可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(CDN,Content Delivery Network)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端设备400可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。终端设备400以及服务器200可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例中不做限制。
在一些实施例中,终端设备400还可以通过运行计算机程序来实现本申请实施例提供的多媒体文件的字幕处理方法。例如,计算机程序可以是操作系统中的原生程序或软 件模块;可以是本地(Native)应用程序(APP,Application),即需要在操作系统中安装才能运行的程序(即上述的客户端410),如视频播放客户端、音频播放客户端、浏览器等;也可以是小程序,即只需要下载到浏览器环境中就可以运行的程序;还可以是能够嵌入至任意APP中的小程序。总而言之,上述计算机程序可以是任意形式的应用程序、模块或插件。
下面继续对图1中示出的终端设备400的结构进行说明,参见图2,图2是本申请实施例提供的终端设备400的结构示意图。图2所示的终端设备400包括:至少一个处理器420、存储器460、至少一个网络接口430和用户接口440。终端设备400中的各个组件通过总线系统450耦合在一起。可理解,总线系统450用于实现这些组件之间的连接通信。总线系统450除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线系统450。
处理器420可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
存储器460包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器460旨在包括任意适合类型的存储器。存储器460可选地包括在物理位置上远离处理器420的一个或多个存储设备。
在一些实施例中,存储器460能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统461,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;
网络通信模块462,用于经由一个或多个(有线或无线)网络接口430到达其他计算设备,示例性的网络接口430包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;
呈现模块463,用于经由一个或多个与用户接口440相关联的输出装置441(例如,显示屏、扬声器等)使得能够呈现信息(例如,用于操作外围设备和显示内容和信息的用户接口);
输入处理模块464,用于对一个或多个来自一个或多个输入装置442之一的一个或多个用户输入或互动进行检测以及翻译所检测的输入或互动。
在一些实施例中,本申请实施例提供的多媒体文件的字幕处理装置可以采用软件方式实现,本申请实施例提供的多媒体文件的字幕处理装置可以提供为各种软件实施例,包括应用程序、软件、软件模块、脚本或代码在内的各种形式。
图2示出了存储在存储器460中的多媒体文件的字幕处理装置465,其可以是程序和插件等形式的软件,并包括一系列的模块,包括播放模块4651、显示模块4652、获取模块4653、转换模块4654、融合模块4655和确定模块4656,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或进一步拆分。需要指出的是,在图2中为了表述方便,一次性示出了上述所有模块,但是不应视为在多媒体文件的字幕处理装置465排除了可以只包括播放模块4651和显示模块4652的实施,将在下文中说明各个模块的功能。
如前所述,本申请实施例提供的多媒体文件的字幕处理方法可以由各种类型的电子设备实施。参见图3,图3是本申请实施例提供的多媒体文件的字幕处理方法的流程示 意图,将结合图3示出的步骤进行说明。
需要说明的是,图3示出的方法可以由终端设备400运行的各种形式的计算机程序执行,并不局限于上述终端设备400运行的客户端410,例如还可以是上文所述的操作系统461、软件模块、脚本和小程序,因此下文中以客户端的示例不应视为对本申请实施例的限定。
在步骤101中,响应于播放触发操作,播放多媒体文件。
这里,多媒体文件关联有多条字幕,且每条字幕在多媒体文件的播放时间轴上对应一个播放时段。条是字幕显示的基本单位,可以是一行或多行文本,包括如人物语言的多语种文本、情节、人物的介绍等。每条字幕被设置有对应的一个显示时间段,包括开始显示时间和结束显示时间,例如对于一条字幕A,对应的显示时间段可以为10:00-10:05,也就是说,可以根据多媒体文件的实时播放进度所处的播放时段,显示对应的一条字幕,且字幕是应用了与多媒体文件的至少一个维度的内容特征适配的样式,即对于不同的多媒体文件关联的字幕,对应显示的样式是不同的,从而能够准确且高效地实现字幕与多媒体文件在视觉感知层面的协调。
此外,多媒体文件的数据形式可以包括流媒体文件(对应于网络播放场景,例如客户端响应于播放触发操作,实时向服务器请求流媒体文件并进行播放)和本地文件(对应于本地播放场景,例如客户端响应于播放触发操作,播放终端设备本地预先存储的多媒体文件),且多媒体文件的类型(即承载的内容)可以包括视频文件和音频文件。
示例的,以多媒体文件为视频文件为例,假设字幕“你怎么知道啊”在视频文件的播放时间轴上对应的播放时段为10:00至10:05,即在播放视频文件的第10:00至10:05的过程中,显示字幕“你怎么知道啊”,且字幕“你怎么知道啊”是应用了相应的样式的,例如应用了与播放时段为10:00至10:05的对象片段中发声对象的属性(例如年龄、性别、情绪等)适配的样式。
在一些实施例中,字幕文件的格式可以包括图片格式和文本格式,其中,图片格式的字幕文件由idx和sub文件组成,idx相当于索引文件,里面包括了字幕出现的时间码(即上述的播放时段)和字幕显示的属性(即上述的样式),sub文件就是字幕数据本身,由于是图片格式,占用的空间比较大,因此可以进行压缩处理,以节省空间。文本格式的字幕文件的扩展名通常是ass、srt、sml、ssa或sub(和上述图片格式的字幕后缀一样,但数据格式不同),因为是文本格式,所以占用的空间较小。
需要说明的是,对于文本格式的字幕,字幕的样式,包括原始的样式和新的样式(即与多媒体文件的至少一个维度的内容特征适配的样式)都可以记录在ass、srt、sml、ssa或sub等文件中。
在步骤102中,在播放多媒体文件的过程中,在人机交互界面中依次显示多条字幕。
这里,多条字幕(即与多媒体文件关联的多条字幕,例如可以通过从上述的字幕文件中读取得到)应用的样式与多媒体文件的内容相关。
在一些实施例中,多条字幕应用的样式可以是相同的(即字幕样式在整个多媒体文件的播放过程中保持不变),则可以通过以下方式实现上述的在人机交互界面中依次显示多条字幕:在人机交互界面中依次显示应用了同一样式的多条字幕;其中,字幕应用的同一样式与多媒体文件的至少一个维度的内容特征适配。
示例的,以多媒体文件为视频文件为例,在播放视频文件的过程中,在人机交互界面中依次显示应用了与视频文件的至少一个维度的内容特征(例如视频文件的风格)适配的样式(例如当视频文件的风格为喜剧时,对应的样式可以是华文彩云、四号、蓝色)的多条字幕,即在播放整个视频文件的过程中,字幕均是以字体为华文彩云、字号为四号、颜色为蓝色的样式进行显示的,也就是说,当视频文件的风格为喜剧时,对应的字 幕样式的风格也是偏卡通、搞笑的,与视频文件的内容的贴合度较高,从而能够准确且高效地实现字幕与视频文件在视觉感知层面的协调。
需要说明的是,字幕可以是在人机交互界面的固定位置显示的(例如在人机交互界面的中下部位显示字幕),当然字幕显示的位置也可以是动态变化的,例如对于视频文件来说,字幕可以是在人机交互界面中避让视频画面中出现的对象的位置进行显示的,也可以是采用叠加在视频画面之上的方式显示字幕,本申请实施例对字幕显示的位置不做具体限定。
在另一些实施例中,多条字幕应用的样式也可以是不同的,即字幕样式在整个多媒体文件的播放过程中会发生变化,例如多条字幕分别应用所属片段的至少一个维度的内容特征适配的样式,则可以通过以下方式实现上述的在人机交互界面中依次显示多条字幕:对多媒体文件进行划分,得到多个片段,其中,片段的类型可以包括以下至少之一:对象片段、场景片段、情节片段;在播放多媒体文件的每个片段的过程中执行以下处理:基于与片段的至少一个维度的内容特征适配的样式,在人机交互界面中依次显示与片段关联的至少一条字幕,如此,通过对多媒体文件进行划分,并显示与多媒体文件划分后得到的每个片段的内容相关的样式的字幕,能够进一步提高字幕和多媒体文件在视觉感知层面的协调性。
示例的,可以按照多媒体文件中出现的对象(例如人物、动物等)将多媒体文件划分成多个对象片段,其中,每个对象片段包括一个对象(例如对象片段A包括的对象为对象A,对象片段B包括的对象为对象B,其中,对象A和对象B是两个不同的对象,两者的对象属性不同,例如对象A为男性、对象B为女性,或者对象A为青年人、对象B为老年人),接着在播放多媒体文件的每个对象片段的过程中执行以下处理:基于与对象片段(例如对象片段A)的至少一个维度的内容特征适配的样式,例如与对象片段A包括的对象A的对象属性(例如假设识别出对象A为男性)适配的样式,假设与男性适配的样式为黑体、五号,即字幕样式的风格是偏阳刚的,在人机交互界面中依次显示与对象片段A关联的至少一条字幕,即与对象片段A关联的至少一条字幕可以是以字体为黑体、字号为五号的样式进行显示的。
示例的,可以按照场景的不同将多媒体文件划分成多个场景片段(例如对于各种类型的历史、或者地理记录片,可以按照场景将纪录片划分成多个不同的场景片段),其中,每个场景片段包括一个场景,且不同场景片段包括的场景可以是不同的,例如场景片段A包括的场景为校园,场景片段B包括的场景为公园,接着在播放多媒体文件的每个场景片段的过程中执行以下处理:基于与场景片段(例如场景片段B)的至少一个维度的内容特征适配的样式,例如与场景片段B包括的场景适配的样式,假设场景片段B包括的场景为海边,则与海边适配的样式可以是楷体、蓝色,即字幕样式的风格是与海边适配的,在人机交互界面中依次显示与场景片段B关联的至少一条字幕,即与场景片段B关联的至少一条字幕可以是以字体为楷体、颜色为蓝色的样式进行显示的。
示例的,可以按照多媒体文件的内容将多媒体文件划分成多个情节片段,例如对于视频文件,可以划分成故事的发生、发展、高潮和结局等多个情节片段,其中,每个情节片段对应一个情节,且不同情节片段对应的情节可以是不同的,例如情节片段A对应故事的发展阶段,情节片段B对应故事的高潮阶段,接着在播放多媒体文件的每个情节片段的过程中执行以下处理:基于与情节片段(例如情节片段C)的至少一个维度的内容特征适配的样式,例如假设情节片段C为高潮片段,则与高潮片段适配的样式可以是华文琥珀、三号,字号较大、字体风格较为严肃,与高潮片段适配,在人机交互界面中依次显示与情节片段C关联的至少一条字幕,即与情节片段C关联的至少一条字幕可以是以字体为华文琥珀、字号为三号的样式进行显示的。
需要说明的是,上述对多媒体文件进行划分的过程仅仅是对多媒体文件的一种逻辑上的识别和划分,多媒体文件的数据形式不发生改变,即不需要分割多媒体文件,而仅仅是在多媒体文件的播放时间轴上添加相应的标记,以将多媒体文件从逻辑上划分成不同的片段。当然,也可以对多媒体文件进行分割,本申请实施例对此不做具体限定。
此外,还需要说明的是,除了可以进行单一类型的划分外,即识别出一种类型的多个片段,例如可以仅仅根据多媒体文件中出现的对象将多媒体文件划分成多个对象片段;还可以进行复合类型的划分,即识别出多个不同类型的片段,例如可以同时根据多媒体文件中出现的对象以及场景对多媒体文件进行划分,如此,划分得到的多个片段可以同时包括对象片段和场景片段,随后,对划分得到的对象片段和场景片段进行合并及去重处理,例如当对象片段A(假设对应的时段为10:00-12:00)和场景片段B(假设对应的时段也为10:00-12:00)重合时,仅保留一个,从而得到最终的划分结果,本申请实施例对划分方式不做具体限定。
在另一些实施例中,同一片段关联的至少一条字幕应用的样式可以是相同的,即在播放同一片段的过程中,字幕样式不会发生变化,则可以通过以下方式实现上述的基于与片段的至少一个维度的内容特征适配的样式,在人机交互界面中依次显示与片段关联的至少一条字幕:获取片段的静态维度的内容特征,在人机交互界面中同步显示与片段关联的至少一条字幕,其中,字幕应用的样式在片段的播放过程中保持不变,如此,能够在准确且高效地实现字幕与多媒体文件在视觉感知层面的协调的基础上,节约终端设备的计算资源和通信资源。
示例的,以片段的类型为对象片段为例,对象片段的静态维度的内容特征可以包括对象片段中发声对象的以下对象属性至少之一:角色类型(包括正派角色和反派角色)、性别、年龄;例如对于对象片段A,首先获取对象片段A中的发声对象(例如对象A)的对象属性(例如对象A的性别,假设识别出对象A的性别为女性),接着在人机交互界面中同步显示与对象片段A关联的至少一条字幕,其中,字幕应用的样式是与女性适配的,例如样式可以为幼圆、粉色,即字幕风格比较偏女性化、且在对象片段A的播放过程中保持不变,即在播放对象片段A的过程中,字幕始终是以字体为幼圆,颜色为粉色的样式进行显示的。
举例来说,以多媒体文件为视频文件为例,参见图4A,图4A是本申请实施例提供的多媒体文件的字幕处理方法的应用场景示意图,如图4A所示,对于某个对象片段(对应的播放时段为40:30至40:40),当识别出对象片段中的发声对象401的性别为女性时,在人机交互界面中同步显示与对象片段关联的至少一条字幕,例如在40:30显示字幕402(“好开心,买了新衣服”)、以及在40:40显示字幕403(“可是下个月要吃土了”),且字幕402和字幕403是应用了与女性适配的样式的,例如字幕402和字幕403的字体风格是偏可爱的,且样式在对象片段的播放过程中保持不变,即字幕402和字幕403应用的样式是相同的。
需要说明的是,对于情节片段,情节片段的静态维度的内容特征可以包括情节片段的情节进度,且针对同一个情节片段,在播放该情节片段的过程中,与该情节片段关联的所有字幕应用的样式可以是相同的,例如对于该情节片段关联的所有字幕均应用与情节进度适配的样式;而对于场景片段,场景片段的静态维度的内容特征可以包括场景片段的场景类型,且针对同一个场景片段,在播放该场景片段的过程中,与该场景片段关联的所有字幕应用的样式可以是相同的,例如对于该场景片段关联的所有字幕均应用与场景类型适配的样式。
此外,还需要说明的是,不同片段关联的字幕应用的样式可以是不同的,例如以对象片段为例,参见图4B,图4B是本申请实施例提供的多媒体文件的字幕处理方法的应 用场景示意图,如图4B所示,发声对象404和发声对象406属于不同的对象片段包括的发声对象,且发声对象404所在的对象片段关联的字幕405(“晚上去吃什么呢”)与发声对象406所在的对象片段关联的字幕407(“烧烤怎么样”)应用的样式是不同的,例如字幕405的字体为方正舒体,而字幕407的字体为华文彩云,如此,针对不同的发声对象对应的字幕应用了不同的样式,例如字幕405应用的样式是与女性适配的方正舒体(字体风格比较柔和),而字幕407应用的样式是与男性适配的华文彩云(字体风格比较阳刚),从而便于用户对视频文件中出现的不同对象进行区分。
在一些实施例中,在对多媒体文件进行划分,得到多个片段之后,还可以针对多个片段中的任意一个或者多个片段进行再次划分,得到多个子片段,如此,通过对多媒体文件进行更加精细化的划分,能够保证在多媒体文件的传播过程中,字幕与多媒体文件的内容是实时相关的,从而进一步提高了字幕与多媒体文件在视觉感知层面的协调性。
示例的,参见图4C,图4C是本申请实施例提供的针对片段进行划分的原理示意图,如图4C所示,以对多媒体文件划分得到的多个片段中的场景片段1和情节片段2(场景片段1和情节片段2可以是相邻的两个片段,即在播放完场景片段1后,继续播放情节片段2)为例,针对场景片段1,可以按照场景片段1中出现的人物将场景片段1进一步划分成3个不同的人物子片段,例如包括人物子片段1、人物子片段2和人物子片段3,其中,不同的人物子片段包括的人物可以是不同的,例如人物子片段1包括人物A、人物子片段2包括人物B、人物子片段3包括人物C;针对情节片段2,也可以按照情节片段2中出现的场景将情节片段2进一步划分成2个不同的场景子片段,例如包括场景子片段1和场景子片段2,其中,不同的场景子片段包括的场景可以是不同的,例如场景子片段1的场景为校园,场景子片段2的场景为公园。以场景片段1为例,在划分得到3个人物子片段之后,针对每个人物子片段,在播放人物子片段的过程中,可以基于与子片段的至少一个维度的内容特征适配的样式,显示人物子片段关联的至少一条字幕,下面进行具体说明。
在一些实施例中,当同一片段关联的多条字幕应用的样式不同时,可以通过以下方式实现上述的基于与片段的至少一个维度的内容特征适配的样式,在人机交互界面中依次显示与片段关联的至少一条字幕:对片段进行划分,得到多个子片段,其中,多个子片段具有片段的静态维度的内容特征(静态维度的内容特征在片段的播放过程中保持不变)、以及片段的动态维度的内容特征(不同的动态维度的内容特征在片段的播放过程中会发生变化),且不同子片段具有的动态维度的内容特征不同;在播放片段的每个子片段的过程中执行以下处理:基于与子片段具有的静态维度的内容特征和动态维度的内容特征适配的样式,显示与子片段关联的至少一条字幕。
示例的,对于对象片段,对象片段的静态维度的内容特征可以包括以下对象属性至少之一:对象片段中发声对象的角色类型、性别、年龄;对象片段的动态维度的内容特征可以包括以下对象属性:对象片段中发声对象的情绪;例如以多媒体文件为视频文件为例,对于视频文件中的某个对象片段(例如对象片段A),首先将对象片段A划分成多个子片段,接着在播放对象片段A的每个子片段的过程中执行以下处理:基于与子片段具有的静态维度的内容特征(例如对象片段A中发声对象的性别)和动态维度的内容特征(例如发声对象在当前子片段中的情绪)适配的样式,显示与子片段关联的至少一条字幕,即与子片段关联的至少一条字幕应用的样式是与发声对象的性别、以及发声对象在当前子片段中的情绪适配的。
举例来说,以多媒体文件为视频文件为例,参见图4D,图4D是本申请实施例提供的多媒体文件的字幕处理方法的应用场景示意图,如图4D所示,子片段408和子片段409属于同一个对象片段的不同子片段,在子片段408中发声对象410的表情是悲伤的, 而在子片段409中发声对象410的表情从悲伤变成开心,对应的,子片段408关联的字幕411(“不要离开我”)应用的样式与子片段409关联的字幕412(“很开心又见到你”)应用的样式是不同的,例如字幕411的字体为华文隶书、字号为小四,字号较小,字体风格偏严肃,与悲伤的情绪适配;而字幕412的字体为华文彩云、字号为四号,字号较大,字体风格偏喜庆,与开心的情绪适配,如此,针对同一个对象片段,字幕样式会随着发声对象情绪的变化而对应调整,从而能够准确且高效地实现字幕与视频文件的内容在视觉感知层面的协调。
示例的,对于情节片段,情节片段的静态维度的内容特征可以包括情节片段的情节类型,情节片段的动态维度的内容特征可以包括以下至少之一:情节片段中出现的不同场景的场景类型、情节片段中出现的不同发声对象的对象属性;例如以多媒体文件为视频文件为例,对于视频文件中的某个情节片段(例如情节片段B),首先将情节片段B划分成多个子片段,接着在播放情节片段B的每个子片段的过程中执行以下处理:基于与子片段具有的静态维度的内容特征(例如情节片段B的情节类型)和动态维度的内容特征(例如当前子片段中出现的场景类型)适配的样式,显示与子片段关联的至少一条字幕,即与子片段关联的至少一条字幕应用的样式是与情节片段B的情节类型、以及当前子片段中出现的场景类型适配的。
举例来说,以多媒体文件为视频文件为例,参见图4E,图4E是本申请实施例提供的多媒体文件的字幕处理方法的应用场景示意图,如图4E所示,子片段413和子片段414是属于同一个情节片段的不同子片段,在子片段413中出现的场景为家里,而在子片段414中出现的场景从家里切换为户外,对应的,子片段413关联的字幕415(“爸爸,去爬山好吗”)与子片段414关联的字幕416(“爸爸,等等我”)应用的样式是不同的,例如字幕415的字体为黑体,而字幕416的字体为华文琥珀,如此,针对同一个情节片段的不同子片段,字幕样式会随着不同子片段的动态维度的内容特征的变化而对应调整,从而用户能够根据字幕样式的变化更加容易理解视频内容。
示例的,对于场景片段,场景片段的静态维度的内容特征可以包括:场景片段涉及的场景的类型,场景片段的动态维度的内容特征可以包括以下至少之一:场景片段中出现的不同发声对象的对象属性,场景片段中出现的不同情节的类型;例如以多媒体文件为视频文件为例,对于视频文件中的某个场景片段(例如场景片段C),首先将场景片段C划分成多个子片段,接着在播放场景片段C的每个子片段的过程中执行以下处理:基于与子片段具有的静态维度的内容特征(例如场景片段C涉及的场景类型)和动态维度的内容特征(例如当前子片段中出现的情节的类型)适配的样式,显示与子片段关联的至少一条字幕,即与子片段关联的至少一条字幕应用的样式是与场景片段C涉及的场景类型、以及当前子片段中出现的情节的类型适配的。
举例来说,以多媒体文件为视频文件为例,参见图4F,图4F是本申请实施例提供的多媒体文件的字幕处理方法的应用场景示意图,如图4F所示,子片段417和子片段418是属于同一个场景片段的不同子片段,在子片段417中出现的情节的类型为发展阶段,而在子片段418中出现的情节的类型从发展阶段进入高潮阶段,对应的,子片段417关联的字幕419(“中古时代的建筑比较简陋”)与子片段418关联的字幕420(“复兴时代的建筑更加现代化”)应用的样式是不同的,例如字幕419的字体为华文行楷,而字幕420的字体为幼圆,如此,针对同一场景片段的不同子片段,字幕样式会随着不同子片段的动态维度的内容特征的变化而对应调整,从而用户能够根据字幕样式的变化更加容易理解视频内容。
下面对字幕样式的转换处理过程进行说明。
在一些实施例中,针对多媒体文件关联的多条字幕应用的样式相同的情况(即字幕 样式在整个多媒体文件的播放过程中保持不变),可以在执行图3示出的步骤102之前,执行图5A示出的步骤103A和步骤104A,将结合图5A示出的步骤进行说明。
在步骤103A中,获取多媒体文件的至少一个维度的内容特征。
这里,多媒体文件的至少一个维度的内容特征可以包括:风格(例如对于视频文件,对应的风格的类型可以包括喜剧、恐怖、悬疑、卡通等;对于音频文件,对应的风格的类型可以包括流行、摇滚等)、对象(例如多媒体文件中出现的人物、动物等)、场景、情节、色调。
在一些实施例中,可以通过以下方式实现上述的步骤103A:调用内容特征识别模型对多媒体文件的内容进行内容特征识别处理,得到多媒体文件的至少一个维度的内容特征,其中,内容特征识别模型是基于样本多媒体文件、以及针对样本多媒体文件的内容标注的标签进行训练得到的。
示例的,内容特征识别模型可以是单独的风格识别模型、场景识别模型、情节识别模型和色调识别模型,也可以是组合模型(例如能够同时对多媒体文件的风格和场景进行识别的模型),内容特征识别模型可以是神经网络模型(例如卷积神经网络、深度卷积神经网络、或者全连接神经网络等)、决策树模型、梯度提升树、多层感知机、以及支持向量机等,本申请实施例对内容特征识别模型的类型不作具体限定。
在另一些实施例中,当多媒体文件为视频文件时,可以通过以下方式实现上述的步骤103A:针对视频文件中出现的目标对象执行以下处理:首先对目标对象所在的目标视频帧进行预处理,例如可以对目标视频帧进行裁剪,以将目标视频帧裁剪成设定的尺寸,或者对目标视频帧中的目标对象进行旋转,以使目标对象处于水平状态,从而方便后续的处理,此外,当获取到多张目标视频帧时,可以确定每张目标视频帧中包括的目标对象的清晰度(例如可以通过Sobel算子来确定清晰度,越模糊的图像具有更加不清楚的边缘,因此,其Sobel算子的值也越大,其中,Sobel算子由两个3×3的卷积核构成,分别用于计算中心像素邻域的灰度加权差),并选取清晰度最高的目标视频帧执行后续的处理;接着对经过预处理后的目标视频帧进行特征提取,得到目标视频帧对应的图像特征,例如可以提取目标视频帧中用于描述图像纹理信息的小波(Gabor)特征,作为目标视频帧对应的图像特征;随后对图像特征进行降维处理,例如可以采用主成分分析方法提取图像特征的主成分特征分量,从而实现图像特征的降维(例如可以首先由目标视频帧的图像特征X计算得到矩阵XX T,接着对矩阵XX T作特征值分解,并保留最大的L个特征值所对应的特征向量,按列组成解码矩阵D,随后取解码矩阵D的转置得到编码矩阵,对图像特征X进行压缩,最后使用解码矩阵D重构图像的L个主成分特征分量,其中,X T表示图像特征X的转置,L为大于或等于1的正整数);最后通过训练好的分类器对经过降维处理后的图像特征进行分类处理,得到目标对象的对象属性,例如目标对象的性别。
在一些实施例中,当多媒体文件为视频文件时,还可以通过以下方式实现上述的步骤103A:针对视频文件中出现的目标对象执行以下处理:首先提取目标对象所在的目标视频帧对应的局部二值模式特征,并对局部二值模式特征进行降维处理,例如可以采用主成分分析方法对局部二值模式特征进行降维处理;接着提取目标视频帧对应的方向梯度直方图特征,并对方向梯度直方图特征进行降维处理,例如可以采用主成分分析方法对方向梯度直方图特征进行降维处理;随后对经过降维处理后的局部二值模式特征和方向梯度直方图图像进行典型相关分析处理(即分别在两组变量中提取有代表性的两个综合变量,利用这两个综合变量之间的相关关系来反映两组指标之间的整体相关性),得到分析结果,例如可以通过计算局部二值模式特征与方向梯度直方图特征之间的典型相关系数(一种用于度量两个随机向量间的线性关联程度大小的数量指标),来挖掘出 局部二值模式特征与方向梯度直方图特征之间的相关性;最后对分析结果进行回归处理(包括线性回归和非线性回归,其中,线性回归是一种使用线性模型来建模单个输入变量和输出变量之间关系的技术,非线性回归是建模多个独立输入变量与输出变量之间的关系),得到目标对象的对象属性。例如以对象属性为年龄为例,可以通过线性模型计算出分析结果被映射到不同年龄分别对应的概率,将最大概率对应的年龄确定为目标对象的年龄。
在另一些实施例中,当多媒体文件为视频文件时,还可以通过以下方式实现上述的步骤103A:针对视频文件中出现的目标对象执行以下处理:首先对目标对象所在的目标视频帧进行规格化处理(即将不同目标视频帧的平均灰度和对比度调整到一个固定的级别,为后续处理提供一个较为统一的图像规格),并对经过规格化处理后的目标视频帧进行分区处理,得到多个子区域,例如可以将目标视频帧划分成多个矩形,每个矩形代表一个子区域;接着提取每个子区域对应的局部二值模式特征,并对多个局部二值模式特征进行统计处理,例如可以采用直方图统计方法对多个局部二值模式特征进行统计,得到目标视频帧对应的局部直方图统计特征;随后通过训练集局部特征库对局部直方图统计特征进行局部稀疏重构表示(即使用训练集局部特征库中的少量的局部二值模式特征的线性组合来表示局部直方图统计特征,从而可以减小特征的维度,进而减小后续计算的复杂度),并对局部稀疏重构表示结果进行局部重构残差加权识别处理(即通过构建加权矩阵,对局部稀疏重构表示结果进行加权处理,并利用残差对加权结果进行分类识别),得到目标对象的对象属性,例如目标对象的情绪。
在一些实施例中,承接上文,当视频文件中存在多个对象时,可以通过以下任意一种方式从多个对象中确定出目标对象:将视频文件中出现时间最长的对象确定为目标对象;将视频文件中符合用户偏好的对象(例如根据用户的历史观看记录确定出用户的用户特征数据,将与用户特征数据相似度最高的对象确定为符合用户偏好的对象)确定为目标对象;将视频文件中与用户互动相关的对象(例如用户曾经点赞或者转发过的对象)确定为目标对象。
此外,还需要说明的是,当多媒体文件为音频文件时,可以通过以下方式识别出音频文件中出现的目标对象的对象属性(例如性别、年龄、情绪等),例如可以根据声音的频率(女性发音的频率比较高,而男性发音的频率相对较低)来确定目标对象的性别;根据音调的高低(例如通常情况下小孩子的声带比较紧,因此音调较高,而随着年龄的增长,声带变得松弛,音调也会逐渐下降)来识别目标对象的年龄;根据说话的语速、音量等信息确定目标对象的情绪,例如当目标对象生气时,对应的音量会比较大、语速也相对较快。
在步骤104A中,基于多媒体文件的至少一个维度的内容特征,对多媒体文件关联的多条原始字幕进行样式转换处理,得到多条新字幕。
这里,多条新字幕(多条新字幕的样式可以是相同的,例如都是基于识别出的多媒体文件的风格对多媒体文件关联的多条原始字幕进行样式转换处理得到)用于作为在人机交互界面中待显示的多条字幕,即作为在步骤102中在人机交互界面中依次显示的多条字幕。
在一些实施例中,可以通过以下方式实现上述的步骤104A:基于多媒体文件的至少一个维度的内容特征对应的取值、以及多媒体文件关联的多条原始字幕调用字幕模型,得到多条新字幕,其中,字幕模型可以是作为生成模型,并与判别模型组成生成式对抗网络来进行训练得到的。
示例的,以多媒体文件为视频文件为例,在获取到视频文件的至少一个维度的内容特征(例如视频文件的风格,假设为喜剧)之后,可以基于视频文件的风格对应的取值、 以及多媒体文件关联的多条原始字幕(假设多条原始字幕的字体均为楷体)调用字幕模型,得到多条新字幕,例如假设样式转换处理后得到的多条新字幕的字体均是幼圆,字体风格与喜剧适配,是偏卡通的,即在视频文件的播放过程中,在人机交互界面中依次显示的多条字幕的字体均是幼圆。
在另一些实施例中,字幕模型也可以是通过其他方式训练得到的,例如可以对字幕模型进行单独训练,本申请实施例对字幕模型的训练方式不做具体限定。
此外,还需要说明的是,上述样式转换处理可以是针对图片格式的字幕的,例如可以将原始字体的图片(例如字幕内容的字体为楷体的图片)转换成与视频文件的风格适配的字体的图片,例如字幕内容的字体为华文彩云的图片;而对于文本格式的字幕,可以首先将文本格式的字幕转换成图片格式,再进行上述样式转换处理。
作为替换方案,可以直接针对文本格式的字幕进行样式转换处理,例如可以首先将原始样式的字幕的各种属性(例如字体、字号等)进行编码处理,得到对应的矩阵向量,接着对矩阵向量进行样式转换处理(例如可以将矩阵向量和视频文件的风格对应的取值输入字幕模型),得到新的矩阵向量(即与新样式的字幕对应的矩阵向量),随后基于新的矩阵向量进行解码处理,得到新样式的字幕(即与视频文件的风格适配的样式的字幕),最后使用新样式的字幕替换字幕文件中原始样式的字幕,并且文本格式的字幕更加利于保存和更新,例如修正文本错误。
在另一些实施例中,针对多媒体文件关联的多条字幕应用的样式不同的情况(即字幕样式在整个多媒体文件的播放过程中会发生变化),可以在执行图3示出的步骤102之前,执行图5B示出的步骤103B和步骤104B,将结合图5B示出的步骤进行说明。
在步骤103B中,获取多媒体文件中每个片段的至少一个维度的内容特征。
在一些实施例中,可以首先对多媒体文件进行划分(具体的划分过程可以参照上文的描述,本申请实施例在此不再赘述),得到多个片段,其中,每个片段关联有至少一条原始字幕,例如片段1关联有原始字幕1至原始字幕3、片段2关联有原始字幕4和原始字幕5,接着分别获取每个片段的至少一个维度的内容特征,片段的内容特征的获取方式与多媒体文件的内容特征的获取方式类似,可以参照上文多媒体文件的内容特征的获取方式实现,本申请实施例在此不再赘述。
在步骤104B中,针对每个片段执行以下处理:基于片段的至少一个维度的内容特征,对片段关联的至少一条原始字幕进行样式转换处理,得到至少一条新字幕。
这里,在对每个片段关联的至少一条原始字幕进行样式转换处理之后,可以将每个片段对应的至少一条新字幕组合得到多条新字幕,其中,多条新字幕的顺序是与多条原始字幕的顺序相同的,且多条新字幕作为在人机交互界面中待显示的多条字幕,即在步骤102中在人机交互界面中依次显示的多条字幕。
需要说明的是,对划分后得到的片段关联的至少一条原始字幕进行样式转换处理的过程,与对多媒体文件关联的多条原始字幕进行样式转换处理的过程是类似的,可以参考上述对多媒体文件关联的多条原始字幕进行样式转换处理的过程,本申请实施例在此不再赘述。
在一些实施例中,在获取到多媒体文件中的每个片段的至少一个维度的内容特征之后,可以针对每个片段执行以下处理:基于片段(例如片段A)的至少一个维度的内容特征对应的取值、以及片段A关联的至少一条原始字幕调用字幕模型,得到片段A关联的至少一条新字幕,接着还可以使用片段A关联的至少一条新字幕替换字幕文件中存储的片段A关联的至少一条原始字幕,如此,在后续多媒体文件的播放过程中,例如在播放至片段A时,可以从字幕文件中读取片段A关联的至少一条新字幕,并在人机交互界面中进行显示。
需要说明的是,以片段的内容特征为视频文件中出现的目标对象的情绪为例,在视频文件的播放过程中,目标对象的情绪可能会发生变化,即情绪属于动态维度的内容特征,在不同的片段中,目标对象的情绪可能是不同的,因此,在基于目标对象的情绪进行样式转换处理时,在经过样式转换处理后,不同片段关联的原始字幕经过转换处理得到的新字幕的样式可能是不同的,例如在片段1中目标对象的情绪为开心,经过样式转换处理得到的新字幕的字体为幼圆;在片段2中目标对象的情绪为悲伤,经过样式转换处理得到的新字幕的字体为华文彩云,也就是说,在视频文件的播放过程中,字幕样式会随着目标对象的情绪的变化而对应调整,从而准确且高效地实现了字幕与视频文件在视觉感知层面的协调。
此外,还需要说明的是,针对同一片段(例如片段A)关联的多条字幕应用的样式不同的情况,还可以对片段A进行再次划分,得到多个子片段,接着获取片段A中每个子片段的至少一个维度的内容特征,随后针对每个子片段执行以下处理:基于子片段的至少一个维度的内容特征对应的取值、以及子片段关联的至少一条原始字幕调用字幕模型,得到子片段关联的至少一条新字幕,如此,当不同子片段对应的至少一个维度的内容特征不同时,不同子片段关联的至少一条新字幕的样式也是不同的,从而实现在播放同一片段的过程中,字幕样式也会发生变化。
在一些实施例中,字幕应用的样式还可以是与片段的多个维度的内容特征经过融合处理后得到的融合内容特征适配的,则可以通过以下方式实现上述的基于与片段的至少一个维度的内容特征适配的样式,在人机交互界面中依次显示与片段关联的至少一条字幕:将片段的多个维度的内容特征进行融合处理,得到融合内容特征;基于融合内容特征对与片段关联的至少一条原始字幕进行样式转换处理,得到至少一条新字幕;其中,至少一条新字幕用于作为在人机交互界面中待显示的至少一条字幕。
示例的,以多媒体文件为视频文件为例,首先获取视频文件的多个维度的内容特征(例如视频文件的风格、视频文件的色调等),接着对视频文件的风格和视频文件的色调进行融合处理(例如对视频文件的风格对应的取值和视频文件的色调对应的取值进行求和),得到融合内容特征,随后基于融合内容特征对应的取值、以及片段关联的至少一条原始字幕调用字幕模型,得到至少一条新字幕,其中,新字幕应用的样式是同时与视频文件的风格、以及视频文件的色调适配的,如此,通过综合考虑视频文件的多个维度的内容特征,使得最终呈现的字幕能够与视频内容更加贴合,进一步提高了字幕和视频文件在视觉感知层面的协调性。
在另一些实施例中,字幕应用的样式还可以是同时与多媒体文件的内容、以及用户特征数据相关的,例如可以将用户(即观看者)对多媒体文件的情感进行量化,根据用户历史的观看记录确定出用户的用户特征数据,进而计算出用户对于当前多媒体文件的偏好程度,最后基于偏好程度和多媒体文件的至少一个维度的内容特征综合确定出字幕的样式,例如可以将偏好程度对应的取值和多媒体文件的至少一个维度的内容特征的取值进行融合处理(例如将两个取值进行相加),并基于融合处理得到的取值、以及多媒体文件关联的多条原始字幕调用字幕模型,得到多条新字幕,即多条新字幕的样式是同时与多媒体文件的内容、以及用户的用户特征数据适配的,也就是说,针对同一个多媒体文件,在不同用户的用户终端显示的字幕也可以是不同的,如此,通过综合考虑用户自身的因素、以及多媒体文件的内容特征,使得字幕与多媒体文件在视觉感知层面的协调性得到进一步的提高。
本申请实施例提供的多媒体文件的字幕处理方法,在播放多媒体文件的过程中,在人机交互界面中显示与多媒体文件的内容相关的样式的字幕,通过丰富字幕的表现形式来实现多媒体文件相关信息的多样化的展示效果,从而能够准确且高效地实现字幕与多 媒体文件在视觉感知层面的协调。
下面,将说明本申请实施例在一个实际的视频文件播放场景中的示例性应用。
本申请实施例提供一种多媒体文件的字幕处理方法,可以针对视频文件的内容进行理解(例如挖掘视频文件中出现的人物的人物属性、视频文件的整体风格等),以实时生成相关样式的字幕,从而能够准确且高效地实现字幕与视频文件在视觉感知层面的协调。
本申请实施例提供的多媒体文件的字幕处理方法可以应用于各大视频网站的字幕生成,可根据视频文件的内容(包括视频文件的风格识别、以及视频文件中出现的人物的人物属性识别,例如识别人物的年龄、性别和情绪等),实时生成与识别出的视频文件的内容相关样式的字幕。
示例的,参见图6A至图6C,图6A至图6C是本申请实施例提供的多媒体文件的字幕处理方法的应用场景示意图,其中,图6A中示出的视频601的风格属于动画片,整体风格可爱卡通,因此,视频601关联的字幕602也是这种风格,此外,字幕602的颜色也可以与背景的主色调相适应,例如当背景为天空时,字幕602的颜色可以为蓝色;图6B中示出的视频603的风格属于喜剧片,整体风格偏搞笑,因此,视频603关联的字幕604也是偏卡通的,与视频603的风格适配;图6C中示出的视频605的风格属于英雄片,整体风格比较严肃,因此,视频605关联的字幕606的字体风格也是更加严肃、正经。也就是说,不同风格的视频对应的字幕的样式是不同的,且与视频的风格贴合度比较高,从而能够准确且高效地实现字幕与视频文件在视觉感知层面的协调。
本申请实施例提供的多媒体文件的字幕处理方法主要涉及两个部分:视频文件的内容理解、以及基于视频内容的理解结果实时生成相关样式的视频字幕,下面首先对视频内容的理解过程进行说明。
示例的,参见图7,图7是本申请实施例提供的视频内容维度示意图,如图7所示,本申请实施例针对视频内容的理解主要涉及以下几个维度:人物属性(包括视频中出现的人物的人物性别、年龄、情绪等)和视频风格(视频风格的类型可以包括卡通、喜剧、恐怖、悬疑等),下面首先对人物属性的识别过程进行说明。
(1)人物属性:
人物属性的识别包括人物性别的识别、人物年龄的识别和人物情绪的识别。
示例的,人物性别的识别可以采用(但不局限于)基于Adaboost和支持向量机(SVM,Support Vector Machine)的人脸性别分类算法,其中,Adaboost是一种迭代算法,其核心思想是针对同一个训练集训练不同的分类器(即弱分类器),然后把这些弱分类器集合起来,构成一个更强的最终分类器(即强分类器)。如图8所示,基于Adaboost+SVM的人脸性别分类算法主要分为两个阶段:(a)训练阶段:首先对训练集进行预处理,接着对经过预处理后的训练集进行Gabor滤波,得到经过预处理后的训练集的小波(Gabor)特征,随后基于经过预处理后的训练集的小波(Gabor)特征对Adaboost分类器进行训练,最后基于通过Adaboost分类器降维处理后的特征对SVM分类器进行训练;(b)测试阶段:首先对测试集进行预处理,接着对经过预处理后的测试集进行Gabor滤波,得到经过预处理后的测试集的小波(Gabor)特征,随后通过训练后的Adaboost分类器进行降维处理,最后基于降维后的特征调用训练好的SVM分类器进行识别处理,输出识别结果(即人物的性别)。
示例的,人物的年龄估计可以采用(但不局限于)融合局部二值化模式(LBP,Local Binary Patterns)和方向梯度直方图(HOG,Histogram of Oriented Gradient)特征的人脸年龄估计算法,其中,LBP是一种用来描述图像局部纹理特征的算子,具有旋转不变性和灰度不变性等显著的优点,HOG是一种在计算机视觉和图像处理中用来进行物体检 测的特征描述子,可以通过计算和统计图像局部区域的梯度方向直方图得到。如图9所示,人脸年龄估计算法主要包括以下两个阶段:(a)训练阶段:首先提取出训练样本集中与年龄变化关系紧密的人脸的局部统计特征(例如LBP特征和HOG特征),接着对所提取的特征进行降维处理,例如可以使用主成分分析(PCA,Principal Component Analysis)的方法分别对所提取的LBP特征和HOG特征进行降维处理,随后使用典型相关分析(CCA,Canonical Correlation Analysis)的方法将两个降维后的特征进行融合,最后基于融合结果对支持向量机回归(SVR,Support Vector Regression)模型进行训练,其中,SVR模型是一种回归算法模型,其在线性函数两侧制造了一个“间隔带”,对于所有落入到间隔带内的样本,都不计算损失;只有间隔带之外的,才计入损失函数,之后再通过最小化间隔带的宽度与总损失来最优化模型;(b)测试阶段:首先提取出测试样本集的LBP特征和HOG特征,接着使用PCA的方法分别对所提取的LBP特征和HOG特征进行降维处理,随后使用CCA的方法将两个降维后的特征进行融合,最后基于融合结果调用训练好的SVR模型进行年龄回归处理,输出估计年龄结果。
示例的,人物情绪的识别可以采用(但不局限于)融合LBP特征和局部稀疏表示的人脸表情识别算法,如图10所示,该算法的步骤包括以下两个阶段:(a)训练阶段:首先对训练集中的人脸图像进行规格化处理,接着对规格化后的人脸图像进行人脸分区处理,随后对于分区处理后得到的每个人脸子区域计算该区域的LBP特征,并采用局部直方图统计方法整合该区域的特征向量,形成由特定人脸的局部特征组成的训练集局部特征库;(b)测试阶段:对于测试集中的人脸图像,同样进行人脸图像规格化、人脸分区、LBP特征计算和局部直方图统计操作;最后,对于测试集中的人脸图像的局部直方图统计特征,利用训练集局部特征库进行局部稀疏重构表示,并采用局部稀疏重构残差加权方法进行最终人脸表情的分类识别,输出识别结果。
需要说明的是,训练阶段可以是离线处理的,而测试阶段可以是在线处理的。
(2)视频风格
视频风格可以采用卷积神经网络(CNN,Convolutional Neural Networks)模型进行识别,其中,训练数据可以来自视频网站提供的视频文件,以及风格分类标签(一般由运营人员进行标识),如图11所示,将视频中连续的L(L为大于1的正整数)帧图像输入到训练后的卷积神经网络模型中,在经过卷积(Convolution)、池化(Pooling)、以及N个密集块(例如4个密集块,分别为密集块1至密集块4,其中,密集块可以由多个卷积块组成,且每块使用相同的输出通道数)处理得到每帧图像对应的特征图之后,采用Gram矩阵计算两两特征图(例如经过卷积处理后的特征图)之间的相关性来代表视频的风格信息,随后将Gram矩阵输出的相关性结果进行全连接处理(例如进行2次全连接处理),最后将全连接结果输入到回归函数(例如Softmax函数)上,输出不同风格分别对应的概率,将最大概率对应的风格确定为视频的风格。
下面对字幕的生成过程进行说明。
字幕的生成可以采用生成式对抗网络(GAN,Generative Adversarial Networks)模型实现,其中,GAN中包含两个模型:生成模型(Generative Model)和判别模型(Discriminative Model),通过生成模型和判别模型相互对抗来达到最后的生成结果。
示例的,参见图12,图12是本申请实施例提供的生成式对抗网络模型的训练原理示意图,具体的算法流程如下:
(1)将原始字体图片x和转换的目标域c(目标域c与视频内容理解出的维度对应)结合输入到生成模型来生成假的字体图片x ,即基于原始字体图片x和目标域c对应的取值,生成与理解出的视频内容维度适配的字体图片,即x =G(x,c),其中,G为生成模型;
(2)将假的字体图片x 和原始字体图片x分别输入到判别模型,判别模型需要判断输入的字体图片是否真实,还需要判断字体图片来自哪个域;
(3)将生成的假的字体图片x 和原始字体图片x对应的域信息(即源域c )结合起来输入到生成模型,要求能够重建出原始字体图片x。
需要说明的是,如果原始的字幕是文本形式,例如srt、ass等类型的字幕文件,则可以首先将文本格式的字幕转换为图片格式,然后再进行上述处理。
本申请实施例提供的根据视频内容实时生成字幕的方案,具有以下有益效果:
(1)字幕样式与视频内容贴合度高,不突兀;
(2)字幕样式更加符合视频内容、或者视频中出现的角色的角色特征,更有沉浸感;
(3)字幕样式由电子设备(例如终端设备或者服务器)自动生成,不需要购买字幕库版权,节省了版权成本。
下面继续说明本申请实施例提供的多媒体文件的字幕处理装置465的实施为软件模块的示例性结构,在一些实施例中,如图2所示,存储在存储器460的多媒体文件的字幕处理装置465中的软件模块可以包括:播放模块4651和显示模块4652。
播放模块4651,配置为响应于播放触发操作,播放多媒体文件,其中,多媒体文件关联有多条字幕,多媒体文件的类型包括视频文件和音频文件;显示模块4652,配置为在播放多媒体文件的过程中,在人机交互界面中依次显示多条字幕,其中,多条字幕应用的样式与多媒体文件的内容相关。
在一些实施例中,显示模块4652,还配置为在人机交互界面中依次显示均应用有样式的多条字幕;其中,样式与多媒体文件的至少一个维度的内容特征适配,且至少一个维度的内容特征包括:风格、对象、场景、情节、色调。
在一些实施例中,多媒体文件的字幕处理装置465还包括获取模块4653,配置为获取多媒体文件的至少一个维度的内容特征;多媒体文件的字幕处理装置465还包括转换模块4654,配置为基于至少一个维度的内容特征,对多媒体文件关联的多条原始字幕进行样式转换处理,得到多条新字幕;其中,多条新字幕用于作为在人机交互界面中待显示的多条字幕。
在一些实施例中,转换模块4654,还配置为基于至少一个维度的内容特征对应的取值、以及多媒体文件关联的多条原始字幕调用字幕模型,得到多条新字幕;其中,字幕模型是作为生成模型,并与判别模型组成生成式对抗网络来进行训练得到的。
在一些实施例中,多媒体文件包括多个片段,片段的类型包括以下至少之一:对象片段、场景片段、情节片段;显示模块4652,还配置为在播放多媒体文件的每个片段的过程中执行以下处理:基于与片段的至少一个维度的内容特征适配的样式,在人机交互界面中依次显示与片段关联的至少一条字幕。
在一些实施例中,获取模块4653,还配置为获取片段的静态维度的内容特征,其中,对象片段的静态维度的内容特征包括对象片段中发声对象的以下对象属性至少之一:角色类型、性别、年龄;场景片段的静态维度的特征包括场景片段的场景类型;情节片段的静态维度的特征包括情节片段的情节进度;显示模块4652,还配置为基于与片段的静态维度的内容特征适配的样式,在人机交互界面中同步显示与片段关联的至少一条字幕,其中,样式在片段的播放过程中保持不变。
在一些实施例中,片段包括多个子片段,多个子片段具有片段的静态维度的内容特征、以及片段的动态维度的内容特征,且不同子片段具有的动态维度的内容特征不同;显示模块4652,还配置为在播放片段的每个子片段的过程中执行以下处理:基于与子片段具有的静态维度的内容特征和动态维度的内容特征适配的样式,显示与子片段关联的 至少一条字幕。
在一些实施例中,对象片段的静态维度的内容特征包括以下对象属性至少之一:对象片段中发声对象的角色类型、性别、年龄;对象片段的动态维度的内容特征包括以下对象属性:对象片段中发声对象的情绪;情节片段的静态维度的内容特征包括情节片段的情节类型,情节片段的动态维度的内容特征包括以下至少之一:情节片段出现的不同场景的场景类型、情节片段出现的不同发声对象的对象属性;场景片段的静态维度的内容特征包括:场景片段涉及的场景的类型;场景片段的动态维度的内容特征包括以下至少之一:场景片段中出现的不同发声对象的对象属性,场景片段中出现的不同情节的类型。
在一些实施例中,当至少一个维度为多个维度时,多媒体文件的字幕处理装置465还包括融合模块4655,配置为将片段的多个维度的内容特征进行融合处理,得到融合内容特征;转换模块4654,还配置为基于融合内容特征对与片段关联的至少一条原始字幕进行样式转换处理,得到至少一条新字幕;其中,至少一条新字幕用于作为在人机交互界面中待显示的至少一条字幕。
在一些实施例中,获取模块4653,还配置为调用内容特征识别模型对多媒体文件的内容进行内容特征识别处理,得到多媒体文件的至少一个维度的内容特征;其中,内容特征识别模型是基于样本多媒体文件、以及针对样本多媒体文件的内容标注的标签进行训练得到的。
在一些实施例中,当多媒体文件为视频文件时,获取模块4653,还配置为针对视频文件中出现的目标对象执行以下处理:对目标对象所在的目标视频帧进行预处理;对经过预处理后的目标视频帧进行特征提取,得到目标视频帧对应的图像特征;对图像特征进行降维处理,并通过训练好的分类器对经过降维处理后的图像特征进行分类处理,得到目标对象的对象属性。
在一些实施例中,当多媒体文件为视频文件时,获取模块4653,还配置为针对视频文件中出现的目标对象执行以下处理:提取目标对象所在的目标视频帧对应的局部二值模式特征,并对局部二值模式特征进行降维处理;提取目标视频帧对应的方向梯度直方图特征,并对方向梯度直方图特征进行降维处理;对经过降维处理后的局部二值模式特征和方向梯度直方图图像进行典型相关分析处理,得到分析结果;对分析结果进行回归处理,得到目标对象的对象属性。
在一些实施例中,当多媒体文件为视频文件时,获取模块4653,还配置为针对视频文件中出现的目标对象执行以下处理:对目标对象所在的目标视频帧进行规格化处理,并对经过规格化处理后的目标视频帧进行分区处理,得到多个子区域;提取每个子区域对应的局部二值模式特征,并对多个局部二值模式特征进行统计处理,得到目标视频帧对应的局部直方图统计特征;通过训练集局部特征库对局部直方图统计特征进行局部稀疏重构表示,并对局部稀疏重构表示结果进行局部重构残差加权识别处理,得到目标对象的对象属性。
在一些实施例中,当视频文件中出现多个对象时,多媒体文件的字幕处理装置465还包括确定模块4656,配置为通过以下任意一种方式从多个对象中确定目标对象:将视频文件中出现时间最长的对象确定为目标对象;将视频文件中符合用户偏好的对象确定为目标对象;将视频文件中与用户互动相关的对象确定为目标对象。
需要说明的是,本申请实施例中关于装置的描述,与上文中多媒体文件的字幕处理方法的实现是类似的,并具有相似的有益效果,因此不做赘述。对于本申请实施例提供的多媒体文件的字幕处理装置中未尽的技术细节,可以根据图3、图5A、或图5B任一附图的说明而理解。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例上述的多媒体文件的字幕处理方法。
本申请实施例提供一种存储有可执行指令的计算机可读存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的多媒体文件的字幕处理方法,例如,如图3、图5A、或图5B示出的多媒体文件的字幕处理方法。
在一些实施例中,计算机可读存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。
作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
综上所述,本申请实施例在播放多媒体文件的过程中,在人机交互界面中显示与多媒体文件的内容相关的样式的字幕,通过丰富字幕的表现形式来实现多媒体文件相关信息的多样化的展示效果,能够适用于多媒体文件的不同应用场景的多样化的字幕展示需求,同时提高了信息传播的效果和用户的观看体验。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (18)

  1. 一种多媒体文件的字幕处理方法,所述方法由电子设备执行,所述方法包括:
    响应于播放触发操作,播放多媒体文件,其中,所述多媒体文件关联有多条字幕,所述多媒体文件的类型包括视频文件和音频文件,以及
    在播放所述多媒体文件的过程中,在人机交互界面中依次显示所述多条字幕,其中,所述多条字幕应用的样式与所述多媒体文件的内容相关。
  2. 根据权利要求1所述的方法,其中,所述在人机交互界面中依次显示所述多条字幕,包括:
    在人机交互界面中依次显示均应用有所述样式的所述多条字幕,其中,所述样式与所述多媒体文件的至少一个维度的内容特征适配,且所述至少一个维度的内容特征包括:风格、对象、场景、情节、色调。
  3. 根据权利要求2所述的方法,其中,在人机交互界面中依次显示所述多条字幕之前,还包括:
    获取所述多媒体文件的至少一个维度的内容特征;
    基于所述至少一个维度的内容特征,对所述多媒体文件关联的多条原始字幕进行样式转换处理,得到多条新字幕,其中,所述多条新字幕用于作为在所述人机交互界面中待显示的所述多条字幕。
  4. 根据权利要求3所述的方法,其中,所述基于所述至少一个维度的内容特征,对所述多媒体文件关联的多条原始字幕进行样式转换处理,得到多条新字幕,包括:
    基于所述至少一个维度的内容特征对应的取值、以及所述多媒体文件关联的多条原始字幕调用字幕模型,得到多条新字幕;
    其中,所述字幕模型是作为生成模型,并与判别模型组成生成式对抗网络来进行训练得到的。
  5. 根据权利要求1所述的方法,其中,
    所述多媒体文件包括多个片段,所述片段的类型包括以下至少之一:对象片段、场景片段、情节片段;
    所述在人机交互界面中依次显示所述多条字幕,包括:
    在播放所述多媒体文件的每个所述片段的过程中执行以下处理:
    基于与所述片段的至少一个维度的内容特征适配的样式,在人机交互界面中依次显示与所述片段关联的至少一条字幕。
  6. 根据权利要求5所述的方法,其中,所述基于与所述片段的至少一个维度的内容特征适配的样式,在人机交互界面中依次显示与所述片段关联的至少一条字幕,包括:
    获取所述片段的静态维度的内容特征,其中,所述对象片段的静态维度的内容特征包括所述对象片段中发声对象的以下对象属性至少之一:角色类型、性别、年龄;所述场景片段的静态维度的特征包括所述场景片段的场景类型;所述情节片段的静态维度的特征包括所述情节片段的情节进度;
    基于与所述片段的静态维度的内容特征适配的样式,在人机交互界面中同步显示与所述片段关联的至少一条字幕,其中,所述样式在所述片段的播放过程中保持不变。
  7. 根据权利要求5所述的方法,其中,
    所述片段包括多个子片段,所述多个子片段具有所述片段的静态维度的内容特征、以及所述片段的动态维度的内容特征,且不同子片段具有的动态维度的内容特征不同;
    所述基于与所述片段的至少一个维度的内容特征适配的样式,在人机交互界面中依次显示与所述片段关联的至少一条字幕,包括:
    在播放所述片段的每个子片段的过程中执行以下处理:
    基于与所述子片段具有的静态维度的内容特征和动态维度的内容特征适配的样式,显示 与所述子片段关联的至少一条字幕。
  8. 根据权利要求7所述的方法,其中,
    所述对象片段的静态维度的内容特征包括以下对象属性至少之一:所述对象片段中发声对象的角色类型、性别、年龄;所述对象片段的动态维度的内容特征包括以下对象属性:所述对象片段中发声对象的情绪;
    所述情节片段的静态维度的内容特征包括所述情节片段的情节类型,所述情节片段的动态维度的内容特征包括以下至少之一:所述情节片段出现的不同场景的场景类型、所述情节片段出现的不同发声对象的对象属性;
    所述场景片段的静态维度的内容特征包括:所述场景片段涉及的场景的类型;所述场景片段的动态维度的内容特征包括以下至少之一:所述场景片段中出现的不同发声对象的对象属性,所述场景片段中出现的不同情节的类型。
  9. 根据权利要求5所述的方法,其中,
    当所述至少一个维度为多个维度时,所述基于与所述片段的至少一个维度的内容特征适配的样式,在人机交互界面中依次显示与所述片段关联的至少一条字幕,包括:
    将所述片段的多个维度的内容特征进行融合处理,得到融合内容特征;
    基于所述融合内容特征对与所述片段关联的至少一条原始字幕进行样式转换处理,得到至少一条新字幕,其中,所述至少一条新字幕用于作为在所述人机交互界面中待显示的所述至少一条字幕。
  10. 根据权利要求3所述的方法,其中,所述获取所述多媒体文件的至少一个维度的内容特征,包括:
    调用内容特征识别模型对所述多媒体文件的内容进行内容特征识别处理,得到所述多媒体文件的至少一个维度的内容特征;
    其中,所述内容特征识别模型是基于样本多媒体文件、以及针对所述样本多媒体文件的内容标注的标签进行训练得到的。
  11. 根据权利要求3所述的方法,其中,
    当所述多媒体文件为所述视频文件时,所述获取所述多媒体文件的至少一个维度的内容特征,包括:
    针对所述视频文件中出现的目标对象执行以下处理:
    对所述目标对象所在的目标视频帧进行预处理;
    对经过预处理后的所述目标视频帧进行特征提取,得到所述目标视频帧对应的图像特征;
    对所述图像特征进行降维处理,并通过训练好的分类器对经过降维处理后的所述图像特征进行分类处理,得到所述目标对象的对象属性。
  12. 根据权利要求3所述的方法,其中,
    当所述多媒体文件为所述视频文件时,所述获取所述多媒体文件的至少一个维度的内容特征,包括:
    针对所述视频文件中出现的目标对象执行以下处理:
    提取所述目标对象所在的目标视频帧对应的局部二值模式特征,并对所述局部二值模式特征进行降维处理;
    提取所述目标视频帧对应的方向梯度直方图特征,并对所述方向梯度直方图特征进行降维处理;
    对经过降维处理后的所述局部二值模式特征和所述方向梯度直方图图像进行典型相关分析处理,得到分析结果;
    对所述分析结果进行回归处理,得到所述目标对象的对象属性。
  13. 根据权利要求3所述的方法,其中,
    当所述多媒体文件为所述视频文件时,所述获取所述多媒体文件的至少一个维度的内容特征,包括:
    针对所述视频文件中出现的目标对象执行以下处理:
    对所述目标对象所在的目标视频帧进行规格化处理,并对经过规格化处理后的所述目标视频帧进行分区处理,得到多个子区域;
    提取每个所述子区域对应的局部二值模式特征,并对多个所述局部二值模式特征进行统计处理,得到所述目标视频帧对应的局部直方图统计特征;
    通过训练集局部特征库对所述局部直方图统计特征进行局部稀疏重构表示,并对局部稀疏重构表示结果进行局部重构残差加权识别处理,得到所述目标对象的对象属性。
  14. 根据权利要求11-13任一项所述的方法,其中,
    当所述视频文件中出现多个对象时,通过以下任意一种方式从所述多个对象中确定所述目标对象:
    将所述视频文件中出现时间最长的对象确定为所述目标对象;
    将所述视频文件中符合用户偏好的对象确定为所述目标对象;
    将所述视频文件中与用户互动相关的对象确定为所述目标对象。
  15. 一种多媒体文件的字幕处理装置,所述装置包括:
    播放模块,配置为响应于播放触发操作,播放多媒体文件,其中,所述多媒体文件关联有多条字幕,所述多媒体文件的类型包括视频文件和音频文件;
    显示模块,配置为在播放所述多媒体文件的过程中,在人机交互界面中依次显示所述多条字幕,其中,所述多条字幕应用的样式与所述多媒体文件的内容相关。
  16. 一种电子设备,所述电子设备包括:
    存储器,用于存储可执行指令;
    处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至14任一项所述的多媒体文件的字幕处理方法。
  17. 一种计算机可读存储介质,存储有可执行指令,所述可执行指令被处理器执行时实现权利要求1至14任一项所述的多媒体文件的字幕处理方法。
  18. 一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时实现权利要求1至14任一项所述的多媒体文件的字幕处理方法。
PCT/CN2022/113257 2021-09-23 2022-08-18 多媒体文件的字幕处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品 WO2023045635A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/320,302 US20230291978A1 (en) 2021-09-23 2023-05-19 Subtitle processing method and apparatus of multimedia file, electronic device, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111114803.6 2021-09-23
CN202111114803.6A CN114286154A (zh) 2021-09-23 2021-09-23 多媒体文件的字幕处理方法、装置、电子设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/320,302 Continuation US20230291978A1 (en) 2021-09-23 2023-05-19 Subtitle processing method and apparatus of multimedia file, electronic device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2023045635A1 true WO2023045635A1 (zh) 2023-03-30

Family

ID=80868555

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/113257 WO2023045635A1 (zh) 2021-09-23 2022-08-18 多媒体文件的字幕处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品

Country Status (3)

Country Link
US (1) US20230291978A1 (zh)
CN (1) CN114286154A (zh)
WO (1) WO2023045635A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114286154A (zh) * 2021-09-23 2022-04-05 腾讯科技(深圳)有限公司 多媒体文件的字幕处理方法、装置、电子设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101534377A (zh) * 2008-03-13 2009-09-16 扬智科技股份有限公司 根据节目内容自动改变字幕设定的方法与系统
US20110221873A1 (en) * 2010-03-12 2011-09-15 Mark Kenneth Eyer Extended Command Stream for Closed Caption Disparity
CN103200463A (zh) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 一种视频摘要生成方法和装置
CN108055592A (zh) * 2017-11-21 2018-05-18 广州视源电子科技股份有限公司 字幕显示方法、装置、移动终端及存储介质
CN108833990A (zh) * 2018-06-29 2018-11-16 北京优酷科技有限公司 视频字幕显示方法及装置
CN110198468A (zh) * 2019-05-15 2019-09-03 北京奇艺世纪科技有限公司 一种视频字幕显示方法、装置及电子设备
CN111277910A (zh) * 2020-03-07 2020-06-12 咪咕互动娱乐有限公司 弹幕显示方法、装置、电子设备及存储介质
CN114286154A (zh) * 2021-09-23 2022-04-05 腾讯科技(深圳)有限公司 多媒体文件的字幕处理方法、装置、电子设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9645985B2 (en) * 2013-03-15 2017-05-09 Cyberlink Corp. Systems and methods for customizing text in media content
KR101856192B1 (ko) * 2016-10-06 2018-06-20 주식회사 카카오 자막 생성 시스템, 자막 생성 방법, 그리고 콘텐트 생성 프로그램
CN108924599A (zh) * 2018-06-29 2018-11-30 北京优酷科技有限公司 视频字幕显示方法及装置
CN112084841B (zh) * 2020-07-27 2023-08-04 齐鲁工业大学 跨模态的图像多风格字幕生成方法及系统

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101534377A (zh) * 2008-03-13 2009-09-16 扬智科技股份有限公司 根据节目内容自动改变字幕设定的方法与系统
US20110221873A1 (en) * 2010-03-12 2011-09-15 Mark Kenneth Eyer Extended Command Stream for Closed Caption Disparity
CN103200463A (zh) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 一种视频摘要生成方法和装置
CN108055592A (zh) * 2017-11-21 2018-05-18 广州视源电子科技股份有限公司 字幕显示方法、装置、移动终端及存储介质
CN108833990A (zh) * 2018-06-29 2018-11-16 北京优酷科技有限公司 视频字幕显示方法及装置
CN110198468A (zh) * 2019-05-15 2019-09-03 北京奇艺世纪科技有限公司 一种视频字幕显示方法、装置及电子设备
CN111277910A (zh) * 2020-03-07 2020-06-12 咪咕互动娱乐有限公司 弹幕显示方法、装置、电子设备及存储介质
CN114286154A (zh) * 2021-09-23 2022-04-05 腾讯科技(深圳)有限公司 多媒体文件的字幕处理方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN114286154A (zh) 2022-04-05
US20230291978A1 (en) 2023-09-14

Similar Documents

Publication Publication Date Title
WO2021238631A1 (zh) 物品信息的显示方法、装置、设备及可读存储介质
US20230012732A1 (en) Video data processing method and apparatus, device, and medium
US10242265B2 (en) Actor/person centric auto thumbnail
CN110446063B (zh) 视频封面的生成方法、装置及电子设备
CN104735468B (zh) 一种基于语义分析将图像合成新视频的方法及系统
CN109218629A (zh) 视频生成方法、存储介质和装置
CN111783712A (zh) 一种视频处理方法、装置、设备及介质
CN111739027A (zh) 一种图像处理方法、装置、设备及可读存储介质
KR20220000758A (ko) 영상 검출 장치 및 그 동작 방법
CN113761253A (zh) 视频标签确定方法、装置、设备及存储介质
CN117372570A (zh) 广告图像生成方法、装置
WO2023045635A1 (zh) 多媒体文件的字幕处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
CN114359775A (zh) 关键帧检测方法、装置、设备及存储介质、程序产品
CN111488813A (zh) 视频的情感标注方法、装置、电子设备及存储介质
US11581020B1 (en) Facial synchronization utilizing deferred neural rendering
CN110636322B (zh) 多媒体数据的处理方法、装置、智能终端及存储介质
CN112822539A (zh) 信息显示方法、装置、服务器及存储介质
US20220375223A1 (en) Information generation method and apparatus
CN115474088A (zh) 一种视频处理方法、计算机设备及存储介质
CN112188116B (zh) 基于对象的视频合成方法、客户端及系统
CN111818364B (zh) 视频融合方法、系统、设备及介质
CN116665083A (zh) 一种视频分类方法、装置、电子设备及存储介质
CN115935049A (zh) 基于人工智能的推荐处理方法、装置及电子设备
CN113762056A (zh) 演唱视频识别方法、装置、设备及存储介质
CN115909390A (zh) 低俗内容识别方法、装置、计算机设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22871693

Country of ref document: EP

Kind code of ref document: A1