WO2020024353A1 - 视频播放方法、装置、终端设备及存储介质 - Google Patents

视频播放方法、装置、终端设备及存储介质 Download PDF

Info

Publication number
WO2020024353A1
WO2020024353A1 PCT/CN2018/104047 CN2018104047W WO2020024353A1 WO 2020024353 A1 WO2020024353 A1 WO 2020024353A1 CN 2018104047 W CN2018104047 W CN 2018104047W WO 2020024353 A1 WO2020024353 A1 WO 2020024353A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
timestamp
audio
subtitle text
subtitle
Prior art date
Application number
PCT/CN2018/104047
Other languages
English (en)
French (fr)
Inventor
彭捷
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020024353A1 publication Critical patent/WO2020024353A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4312Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/432Content retrieval operation from a local storage medium, e.g. hard-disk

Definitions

  • the present application relates to the field of multimedia, and in particular, to a video playing method, device, terminal device, and storage medium.
  • the embodiments of the present application provide a video playing method, device, terminal device, and storage medium, so as to efficiently and conveniently output subtitle text of a video, and also accurately retrieve the video through the subtitle text.
  • this application case provides a video playback method, including:
  • the subtitle text includes multiple timestamps corresponding to the playback time of the audio;
  • the multiple timestamps include the target timestamp
  • an example of the present application provides a video playback device, including:
  • An extraction module for extracting audio from a video and generating an audio file
  • a conversion module configured to convert the audio file into a file stream, and convert the file stream into subtitle text through speech recognition; the subtitle text includes multiple timestamps corresponding to the playback time of the audio;
  • a caption display module configured to display the caption text on the video playback interface according to the timestamp
  • a query module configured to receive a query instruction including a keyword, and query a target timestamp corresponding to the keyword in the subtitle text; the multiple timestamps include the target timestamp;
  • a playing module configured to play the audio and the video according to the target timestamp.
  • an example of the present application provides a terminal device including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor executes the computer-readable Implement the following steps when instructing:
  • the subtitle text includes multiple timestamps corresponding to the playback time of the audio;
  • the multiple timestamps include the target timestamp
  • the example of the present application provides one or more non-volatile readable storage media storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more Each processor performs the following steps:
  • the subtitle text includes multiple timestamps corresponding to the playback time of the audio;
  • the multiple timestamps include the target timestamp
  • FIG. 1 is a schematic diagram of an application environment of a video playing method according to an embodiment of the present application
  • FIG. 2 is a flowchart of a video playing method according to an embodiment of the present application.
  • FIG. 3 is a flowchart of step S20 of a video playing method in an embodiment of the present application.
  • FIG. 4 is a flowchart of step S203 of a video playing method in an embodiment of the present application.
  • FIG. 5 is a flowchart of step S40 of a video playing method in an embodiment of the present application.
  • step S50 of a video playing method in an embodiment of the present application is a flowchart of step S50 of a video playing method in an embodiment of the present application
  • FIG. 7 is a block diagram of a video playback device in an embodiment of the present application.
  • FIG. 8 is a block diagram of a conversion module of a video playback device in an embodiment of the present application.
  • FIG. 9 is a block diagram of a query module of a video playback device in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a terminal device in an embodiment of the present application.
  • the video playback method provided in this application can be applied in the application environment shown in FIG. 1, where a client (terminal device) communicates with a server through a network.
  • the clients include, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented by an independent server or a server cluster composed of multiple servers.
  • a video playback method is provided.
  • the method is applied to the server in FIG. 1 as an example, and includes the following steps:
  • S10 Extract audio from the video and generate audio files.
  • the server calls Ffmpeg (Fast Forward Forward Mpeg), a set of open source that can be used to record and convert digital audio and video and convert it into streams The computer-readable instructions) order to extract audio from the video so that the audio and video are separated; and the generated audio file includes, but is not limited to, a wav format and the like.
  • the audio file is converted into a file stream, and the file stream is converted into a subtitle text through speech recognition.
  • the subtitle text includes multiple timestamps corresponding to a playback time of the audio.
  • an audio file is first converted into a file stream (a file stream is also referred to as a character stream or a byte stream), and then the file stream is speech-recognized to convert the file stream into subtitle text.
  • the subtitle text can be divided into multiple text contents according to a preset rule, for example, according to words, words, sentences, paragraphs, etc., and can be in the division interval of multiple text contents (that is, before or after the text contents). ) Insert a time stamp to locate the time coordinates of each of the text content.
  • the preset rule is: each sentence is divided into a text content.
  • each sentence represents a text content, and punctuation marks can be used to set the standard of the "sentence", for example, a period represents the separation of a sentence
  • each The timestamp before the sentence represents the start time of the audio corresponding to the sentence
  • the timestamp after each sentence represents the end time of the audio corresponding to the sentence.
  • the audio playback time points (such as the start time and end time of the audio) corresponding to all text content are located on the time axis of the audio extracted from the video, and each of the timestamps is at The audio time axis corresponds to an audio playback time point at the same time, and the audio playback time point corresponds to the text content associated with the time stamp.
  • a time stamp after the previous text content may be inserted at the end of the previous text content (corresponding to the blank audio segment). Foremost), and insert the time stamp before the next text content into the forefront of the next text content (corresponding to the end of the blank audio segment).
  • the audio playback times corresponding to the two time stamps are not the same. Understandably, the timestamp may also be inserted only before or after each text content, and not necessarily before and after.
  • the time stamp can also be set at other positions in each text content, as long as it is associated with the text content. At this time, the time stamp is preferably set to the start of the audio corresponding to the text content. The playback time is the same.
  • the step S30 includes:
  • the subtitle text Acquiring the subtitle text, and acquiring a correspondence between the timestamp and a time axis of the video; that is, the video includes a time axis corresponding to the time axis of the audio; by aligning the audio Play the audio and video synchronously with the time axis of the video. Therefore, when the subtitle text is obtained, since the timestamp in the subtitle text corresponds to the time axis of the audio, the subtitle text and the time of the video can also be based on the timestamp. Axis alignment so that the subtitle text is displayed synchronously while the video is playing. Displaying the subtitle text as a Chinese subtitle at a first preset position of a playback interface of the video according to a correspondence between the time stamp and a time axis of the video.
  • the subtitle text may be displayed as Chinese subtitles at a first preset position of the video playback interface, and the first preset position may be above, below, or other specific positions of the video playback interface.
  • the form in which the Chinese subtitles are displayed on the playback interface can be set according to requirements, for example, font color, font size, font shape, shadow, bold, brightness, etc. can be set.
  • step S30 further includes:
  • the subtitle text can call the open source translation interface to translate into languages other than Chinese, such as English, Japanese , Korean, etc.
  • Displaying the foreign language subtitles in a second preset position of a playback interface of the video That is, languages other than Chinese after translation may be displayed as foreign language subtitles on a second preset position of a playback interface of the video.
  • the second preset position may be above, below, or other specific positions of the playback interface of the video.
  • the form in which the foreign language subtitles are displayed on the playback interface can be set according to requirements, for example, font color, font size, font shape, shadow, bold, brightness, etc. can be set.
  • the Chinese subtitles and the foreign subtitles can be displayed simultaneously, and multiple open source translation interfaces can be invoked simultaneously to translate the subtitle text into multiple foreign subtitles at the same time, and display the Chinese subtitles and multiple foreign subtitles at the same time. It is also possible to display only multiple subtitles in foreign languages, that is, the selection of subtitle types can be modified according to user needs; similarly, according to the above, the first preset position may be the same as or different from the second preset position.
  • S40 Receive a query instruction including a keyword, and query a target timestamp corresponding to the keyword in the subtitle text; the multiple timestamps include the target timestamp.
  • the user can query the keyword in the subtitle text by using a keyword, and after the keyword is found, one or more text contents containing the keyword (each text content includes at least one
  • Each text content includes at least one
  • the target timestamp associated with the text content is displayed on the query interface. Understandably, there is a one-to-one correspondence between the target timestamp (included in the multiple timestamps), the time axis of the audio, and the time axis of the video.
  • the user can select the text content associated with the target timestamp on the query interface, and the target timestamp is the audio playback time of the audio and the video playback time of the video; at this time in the Find the audio playing time on the audio time axis to start playing the audio, and find the video playing time on the video time axis to start playing the video.
  • the video playback method of this embodiment converts audio in a video into subtitle text after speech recognition, and inserts a timestamp for positioning in the subtitle text, thereby retrieving the playback position of the video.
  • the playback position can be accurately located on the time axis of the video, which greatly improves the analysis and utilization rate of the video;
  • the video retrieval of the present application is accurate in positioning and can efficiently and quickly output and display subtitle text at a corresponding position on the video, greatly improving the user experience. This application can be used in court trial video processing, training video retrieval and other scenarios.
  • the step S20 also converts the audio file into a file stream, and converts the file stream into subtitle text through speech recognition; the subtitle text contains
  • the multiple timestamps corresponding to the audio playback time include the following steps:
  • S202 Convert the file stream to the subtitle text through the voice recognition interface; that is, in this step, transmit the file stream converted from the audio file to the voice recognition interface, and use the voice After the recognition interface performs speech recognition on the file stream, it is converted into subtitle text.
  • the step S202 includes: causing the speech recognition interface to decode the file stream by acquiring acoustic characteristics, a relationship between each word in a context, and a mapping relationship between text and pronunciation, and obtaining the file stream decoding
  • the subtitle text generated later wherein the acoustic features include a transfer relationship between each pronunciation, and a relationship between the pronunciation and a sonic feature. That is, after the file stream is decoded, a subtitle text corresponding to the audio file is generated.
  • the above acoustic characteristics, the relationship between each word in the context, and the mapping relationship between text and pronunciation, etc. can be established by establishing a mathematical model between various parameters and by continuous training to improve the model. For example, you can build an acoustic model based on the transfer relationship between pronunciations, and the relationship between pronunciation and sonic characteristics; build a language model based on the relationship of each word in the context; build a dictionary model based on the mapping relationship between text and pronunciation; thereafter , Training the established acoustic model, language model, and dictionary model, and then decoding the file stream through the trained acoustic model, language model, and dictionary model to convert the file stream into subtitle text.
  • Timestamp refers to the time stamp corresponding to the audio playback time, and each text content is associated with at least one timestamp. Understandably, one text content is before and after (the timestamp can also be set in other text content Position, as long as it is associated with the text content), a time stamp can be inserted; as a preference, a time stamp can be inserted before each text content to represent the start playback time of the audio corresponding to the text content, and in each text Inserting a time stamp after the content represents the end time of the audio corresponding to the text content.
  • the step S203 includes the following steps:
  • the subtitle text is divided into multiple text contents according to the preset rule.
  • the preset rule includes dividing the subtitle text according to words, words, sentences, and paragraphs. Understandably, the preset rule includes, but is not limited to, dividing the subtitle text according to words, words, sentences, paragraphs, and the like.
  • the preset rule is: each paragraph is divided into a text content. At this time, you can set the standard of the "sentence" by each paragraph (each paragraph represents a text content, for example, a carriage return represents the separation of a paragraph) (or Elsewhere) insert a timestamp and associate the timestamp with the text content.
  • the timestamp inserted in the caption text corresponds to the playback time of the audio; understandably, the timestamp can also be set in each text content other than before and after, just need to set it It can be associated with the text content; at this time, the timestamp is preferably set to the same start playback time of the audio corresponding to the text content, and the timestamp is the playback time of the piece of audio corresponding to the text content.
  • the audio playback time of the audio corresponds to the video playback time of the video, so it can also be found on the time axis of the video. Poke the same video playing time to start playing the video corresponding to the text content.
  • the step S40 that is, receiving a query instruction including a keyword, and querying the target timestamp corresponding to the keyword in the subtitle text includes the following steps: :
  • a query instruction including the keyword which is input by a user through a voice input or through an input box on a query interface; that is, the user can enter a keyword in the input box of a query interface on the client and click
  • a preset button such as a search button
  • the keywords are sent to the server along with the query instruction; understandably, the user can also input the query through the client's voice input device associated with the query interface. Keyword voice.
  • the server displays the keyword on the query interface and allows the user to confirm, modify, or re-enter the key.
  • the query instruction containing the keyword is sent to the server.
  • S402. retrieve the subtitle text containing a timestamp from a database, and query the keywords in the subtitle text; that is, when the database stores a plurality of text contents and a plurality of When multiple timestamped subtitle texts are associated with text content, the keywords may be queried from each of the text contents of the subtitle text when a query instruction is received.
  • each audio segment corresponding to each of the text content associated with each target timestamp (or the audio segment as The first half of the audio), and display the audio segment in the audio list.
  • the audio list may be displayed on the query interface in synchronization with the text content list; and when the display item in the text content list is clicked, the audio corresponding to the text content to which the item belongs The segment is also selected and displayed synchronously, and the audio segment can also automatically jump to a prominent position (or other position) in the middle of the audio list display interface.
  • an audio segment in the audio list is clicked, not only the audio segment starts to be played, but also the text content corresponding to the text content list can be displayed and selected simultaneously.
  • the above display mode allows the user to select and confirm the object to be queried among multiple text contents.
  • each video segment corresponding to each of the text content associated with each of the target timestamps (or the video segment as The first half of the video, or the video frame corresponding to the time of the target timestamp), and display the video segment in the video list.
  • the video list may be displayed on the query interface in synchronization with the text content list and / or the audio list. Understandably, when only the text content and the video list are displayed, the display manner may be the same as that when only the content list and the audio list are displayed, and details are not described herein again.
  • the video list, the text content list, and the audio list are simultaneously displayed on the query interface.
  • a display item in the text content list is clicked, a text content location to which the item belongs is displayed.
  • the corresponding audio segment and video segment can be selected for synchronous display, and the audio segment and video segment can also automatically jump to a prominent position (or other position) in the middle of the display interface of the audio list and video list.
  • clicking on an audio segment or a video segment in the audio list or the video list not only the audio segment and the video segment will start to be played at the same time, but also the corresponding content in the text content list.
  • the text content can also be selected and displayed synchronously; the above display mode allows the user to select and confirm the object to be queried among multiple text content.
  • the step S50 that is, playing the audio and the video according to the target timestamp, includes the following steps:
  • S501 Receive a playback instruction including a current playback time; the current playback time is the same as the time of the target timestamp;
  • a playback instruction including the current playback time may be sent to the server, and the current playback time included in the playback instruction (the audio playback of the currently described audio) Time and video playback time of the video) are the same as the time of the target timestamp; understandably, the user selects the text content associated with the target timestamp, which can be selected from the above text content list (can be set with the above steps The operation of "click" in S403 is different.
  • the target timestamp is not the start time of the audio corresponding to the text content, at this time, when selecting the text content (in the text content list, video list, or audio list) After selecting), the start playback time of the audio corresponding to the text content is set as the current playback time, and at this time, the current playback time is not equal to the time of the target timestamp.
  • the audio can be retrieved from the database according to the current playback time, and the audio corresponding to the current playback time from the time axis of the audio. Start playing the audio at the time; at the same time, retrieve the video from the database, and start playing the audio from the time axis of the video corresponding to the current playback time.
  • a video playback device is provided, and the video playback device corresponds to the video playback method in the above embodiment in a one-to-one correspondence.
  • the video playback device includes an extraction module 110, a conversion module 120, a subtitle display module 130, a query module 140, and a playback module 150.
  • the detailed description of each function module is as follows:
  • An extraction module 110 for extracting audio from a video and generating an audio file
  • a conversion module 120 configured to convert the audio file into a file stream, and convert the file stream into subtitle text through speech recognition; the subtitle text includes multiple timestamps corresponding to the playback time of the audio;
  • a caption display module 130 configured to display the caption text on a playback interface of the video according to the timestamp
  • the query module 140 is configured to receive a query instruction including a keyword, and query a target timestamp corresponding to the keyword in the subtitle text; the multiple timestamps include the target timestamp;
  • the playing module 150 is configured to play the audio and the video according to the target timestamp.
  • the video playback device of this embodiment converts audio in a video into subtitle text after speech recognition, and inserts a timestamp for positioning in the subtitle text, thereby retrieving the playback position of the video.
  • the playback position can be accurately located on the time axis of the video, which greatly improves the analysis and utilization rate of the video;
  • the video retrieval of the present application is accurate in positioning and can efficiently and quickly output and display subtitle text at a corresponding position on the video, greatly improving the user experience. This application can be used in court trial video processing, training video retrieval and other scenarios.
  • the conversion module 120 includes:
  • a first conversion submodule 121 configured to convert the audio file into the file stream
  • a second conversion submodule 122 configured to convert the file stream into the subtitle text through the voice recognition interface
  • An inserting sub-module 123 is configured to insert a timestamp in the subtitle text according to a preset rule, and associate the inserted timestamp with the text content before the timestamp or after the timestamp.
  • the second conversion sub-module 122 is further configured to cause the speech recognition interface to decode the file stream through acoustic characteristics, a relationship between each word in a context, and a mapping relationship between text and pronunciation, and obtain The subtitle text generated after the file stream is decoded; wherein the acoustic characteristics include a transfer relationship between each pronunciation, and a relationship between the pronunciation and a sonic feature.
  • the inserting sub-module 123 is further configured to divide the subtitle text into multiple text contents according to the preset rule; wherein the preset rule includes the subtitles according to words, words, sentences, and paragraphs. Text division; inserting a timestamp associated with the text content before and / or after each of the text content, and associating the timestamp with the text content before or after the timestamp; wherein The timestamp inserted in the subtitle text corresponds to the playback time of the audio; the subtitle text containing the timestamp is stored in a database.
  • the query module 140 includes:
  • a receiving sub-module 141 configured to receive a query instruction including the keyword, which is input by a user through a voice on a query interface or through an input box;
  • a calling sub-module 142 configured to retrieve the subtitle text including a timestamp from a database, and query the keywords in the subtitle text;
  • a display submodule 143 configured to obtain all text content including the keyword in the subtitle text, and display all the text content and the target timestamp associated with each of the text content on the query interface on.
  • Each module in the video playback device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • Each of the above modules may be embedded in the processor in the form of hardware or independent of the processor in the terminal device, or may be stored in the memory of the terminal device in the form of software to facilitate the processor to call and execute the operations corresponding to the above modules.
  • a terminal device ie, a computer device
  • the terminal device may be a server, and the internal structure diagram may be as shown in FIG. 10.
  • the terminal device includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the terminal device is used to provide computing and control capabilities.
  • the memory of the terminal device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the internal memory provides an environment for operating the operating system and computer-readable instructions in a non-volatile storage medium.
  • the network interface of the terminal device is used to communicate with external terminals through a network connection.
  • the computer-readable instructions are executed by a processor to implement a video playing method.
  • a terminal device which includes a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor.
  • the processor executes the computer-readable instructions, the following steps are implemented:
  • the terminal device in this embodiment converts audio in a video into subtitle text after speech recognition, and inserts a timestamp for positioning in the subtitle text, so that when the playback position of the video needs to be retrieved
  • the playback position of the video can be accurately located on the time axis of the video, which greatly improves the analysis and utilization of the video;
  • the applied video retrieval has accurate positioning, and can efficiently and quickly output and display the subtitle text at the corresponding position on the video, greatly improving the user experience.
  • This application can be used in court trial video processing, training video retrieval and other scenarios.
  • one or more non-volatile readable storage media storing computer readable instructions are provided, and the non readable storage medium stores computer readable instructions, the computer readable instructions When executed by one or more processors, causes the one or more processors to perform the following steps:
  • the computer-readable storage medium of this embodiment converts audio in a video into subtitle text after speech recognition, and inserts a timestamp for positioning in the subtitle text, so that the playback position of the video needs to be When searching, you only need to retrieve the keywords in the subtitle text and their corresponding target timestamps to accurately locate the playback position on the time axis of the video, which greatly improves the analysis and use of the video.
  • the video retrieval positioning of this application is accurate, and the subtitle text can be output and displayed at the corresponding position on the video efficiently and quickly, which greatly improves the user experience.
  • This application can be used in court trial video processing, training video retrieval and other scenarios.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM dual data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Synchlink DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Databases & Information Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

本申请公开了一种视频播放方法、装置、终端设备及存储介质。所述方法包括:自视频中提取音频,并生成音频文件;将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳;根据所述时间戳将所述字幕文本显示在所述视频的播放界面;接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳;所述多个时间戳包括所述目标时间戳;根据所述目标时间戳播放所述音频和所述视频。本申请可以高效快捷地将字幕文本输出并显示在所述视频上的对应位置,同时可在所述视频的时间轴上准确定位其播放位置,大大提升了用户体验。

Description

视频播放方法、装置、终端设备及存储介质
本申请以2018年8月1日提交的申请号为201810861877.8,名称为“视频播放方法、装置、终端设备及存储介质”的中国发明专利申请为基础,并要求其优先权。
技术领域
本申请涉及多媒体领域,尤其涉及一种视频播放方法、装置、终端设备及存储介质。
背景技术
随着多媒体技术的迅速发展,用户可以通过各种播放终端观看各式各样的视频。当前视频中的音频语音转换为字幕的过程,通常由速记和字幕员来完成,也即,大部分的视频都是采用人工翻译生成字幕,其字幕生成的效率低,且操作复杂。同时,在很多场景下人们录制了视频,但在观看某个视频时,可能会进行定位预览的操作,其目的是为了快速浏览视频,以定位到自己感兴趣的内容;当前,用户主要通过手动拖动进度条,对视频播放位置进行定位,该定位方式过程复杂,定位效率低,定位不准确,用户体验差。
发明内容
本申请实施例提供了一种视频播放方法、装置、终端设备及存储介质,以便于在高效便捷地输出视频的字幕文本的同时,还通过字幕文本对视频进行准确检索。
第一方面,本申请案例提供一种视频播放方法,包括:
自视频中提取音频,并生成音频文件;
将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳;
根据所述时间戳将所述字幕文本显示在所述视频的播放界面;
接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳;所述多个时间戳包括所述目标时间戳;
根据所述目标时间戳播放所述音频和所述视频。
第二方面,本申请实例提供一种视频播放装置,包括:
提取模块,用于自视频中提取音频,并生成音频文件;
转换模块,用于将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳;
字幕显示模块,用于根据所述时间戳将所述字幕文本显示在所述视频的播放界面;
查询模块,用于接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳;所述多个时间戳包括所述目标时间戳;
播放模块,用于根据所述目标时间戳播放所述音频和所述视频。
第三方面,本申请实例提供一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
自视频中提取音频,并生成音频文件;
将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳;
根据所述时间戳将所述字幕文本显示在所述视频的播放界面;
接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳;所述多个时间戳包括所述目标时间戳;
根据所述目标时间戳播放所述音频和所述视频。
第四方面,本申请实例提供一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
自视频中提取音频,并生成音频文件;
将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳;
根据所述时间戳将所述字幕文本显示在所述视频的播放界面;
接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳;所述多个时间戳包括所述目标时间戳;
根据所述目标时间戳播放所述音频和所述视频。
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中视频播放方法的应用环境示意图;
图2是本申请一实施例中视频播放方法的流程图;
图3是本申请一实施例中视频播放方法的步骤S20的流程图;
图4是本申请一实施例中的视频播放方法的步骤S203的流程图;
图5是本申请一实施例中的视频播放方法的步骤S40的流程图;
图6是本申请一实施例中的视频播放方法的步骤S50的流程图;
图7是本申请一实施例中的视频播放装置的框图;
图8是本申请一实施例中的视频播放装置的转换模块的框图;
图9是本申请一实施例中的视频播放装置的查询模块的框图;
图10是本申请一实施例中终端设备的示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供的视频播放方法,可应用在如图1的应用环境中,其中,客户端(终端设备)通过网络与服务器进行通信。其中,客户端包括但不限于为各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一实施例中,如图2所示,提供一种视频播放方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:
S10,自视频中提取音频,并生成音频文件;在一实施例中,服务器通过调用Ffmpeg(Fast Forward Mpeg,一套可以用来记录、转换数字音频、视频,并能将其转化为流的开源计算机可读指令)命令自视频中提取音频,使得音频和视频分离;且生成的音频文件 包括但不限定于为wav格式等。
S20,将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳。
在本实施例中,首先将音频文件转换为文件流(文件流也称为字符流或字节流),再对所述文件流进行语音识别,从而将所述文件流转换为字幕文本,所述字幕文本可以按照预设规则划分为多个文本内容,比如按照字、词、句、段等方式进行划分,并可以在多个文本内容的划分间隔中(也即所述文本内容之前或者之后)插入时间戳,以定位各所述文本内容的时间坐标。比如,将音频文件转换为字幕文本之后,所述预设规则为:将每句话各自划分为一个文本内容。此时,可以在每句话(每句话都代表一个文本内容,可以用标点符号来设定“句”的标准,比如,一个句号代表一句话的分隔)的前后均插入时间戳,且每句话前面的时间戳代表这句话对应的音频的开始播放时间,而在每句话后面的时间戳则代表这句话对应的音频的播放结束时间。可理解的,所有文本内容对应的音频播放时间点(比如音频的开始播放时间与播放结束时间)均位于从所述视频中提取的所述音频的时间轴上,每个所述时间戳都在所述音频的时间轴上对应一个相同时间的音频播放时间点,且该音频播放时间点与该时间戳关联的文本内容对应。
进一步地,对于相邻的两个文本内容,仅在前一个文本内容之后与后一个文本内容之前选取一个音频播放时间点插入时间戳(此方案适用于相邻的两个文本内容之间不存在空白音频段的情况,可理解的,若两文本内容之间存在空白音频段,亦可使用该方案,此时仅需要在空白音频中选取一个时间点插入即可),由于插入在前一个文本内容之后的时间戳与放置在后一个文本内容之前的时间戳插入的是同一个时间点,此时,两个时间戳对应的音频播放时间相同。在本实施例的另一方面,在相邻的两个文本内容中存在空白音频段时,可将前一个文本内容之后的时间戳插入前一个文本内容的末尾(对应于所述空白音频段的最前端),同时将后一个文本内容之前的时间戳插入后一个文本内容的最前端(对应于所述空白音频段的末尾),此时,两个时间戳对应的音频播放时间并不相同。可理解地,所述时间戳也可以仅插入每个文本内容的前面或者后面,而不一定是前后都有。同理,所述时间戳亦可以设置在每个文本内容中的其他位置,只需要将其与所述文本内容关联,此时,所述时间戳优选为设置为与该文本内容对应的音频开始播放时间相同。
S30,根据所述时间戳将所述字幕文本显示在所述视频的播放界面;
在本实施例的一方面,所述步骤S30包括:
获取所述字幕文本,并获取所述时间戳与所述视频的时间轴之间的对应关系;也即, 所述视频包括一个与所述音频的时间轴对应的时间轴;通过对齐所述音频与所述视频的时间轴,同步播放所述音频与视频。因此,在获取到所述字幕文本时,由于所述字幕文本中的时间戳与所述音频的时间轴是对应的,因此同样可根据所述时间戳将所述字幕文本与所述视频的时间轴对齐,从而在播放所述视频时同步显示所述字幕文本。根据所述时间戳与所述视频的时间轴之间的对应关系,将所述字幕文本作为中文字幕显示在所述视频的播放界面的第一预设位置。也即,所述字幕文本可以作为中文字幕显示在所述视频的播放界面的第一预设位置,所述第一预设位置可以是所述视频的播放界面的上方、下方或其他特定的位置。且所述中文字幕显示在所述播放界面上的形式可以根据需求进行设定,比如可设定字体颜色、字体大小、字体形状、阴影、加粗、亮度等。
在本实施例的另一方面,所述步骤S30还包括:
获取所述字幕文本,并调用预设的开源翻译接口将所述字幕文本翻译为外文字幕;也即,所述字幕文本可以调用开源翻译接口翻译为除中文之外的其他语言,比如英文、日文、韩文等。
将所述外文字幕显示在所述视频的播放界面的第二预设位置。也即,翻译之后的除中文之外的其他语言可以作为外文字幕显示在所述视频的播放界面的第二预设位置上。所述第二预设位置可以是所述视频的播放界面的上方、下方或其他特定的位置。且所述外文字幕显示在所述播放界面上的形式可以根据需求进行设定,比如可设定字体颜色、字体大小、字体形状、阴影、加粗、亮度等。
可理解的,所述中文字幕和所述外文字幕可以同时显示,且可以同时调用多个开源翻译接口同时将所述字幕文本翻译为多种外文字幕之后,同时显示中文字幕和多种外文字幕,亦可以仅显示多种外文字幕,也即,字幕种类的选择可以根据用户需求进行修改;同理,依照上述,所述第一预设位置可以与所述第二预设位置相同或不同。
S40,接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳;所述多个时间戳包括所述目标时间戳。
也即,用户可通过关键词在所述字幕文本中查询该关键词,在查询到该关键词之后,将包含所述关键词的一个或多个文本内容(每个文本内容中均包含至少一个与该文本内容关联的目标时间戳)显示在查询界面上。可理解的,所述目标时间戳(包含在上述多个时间戳中)与所述音频的时间轴及所述视频的时间轴均存在一一对应关系。
S50,根据所述目标时间戳播放所述音频和所述视频。
在本实施例中,用户可以在查询界面上选取与目标时间戳关联的文本内容,所述目标 时间戳即为所述音频的音频播放时间和所述视频的视频播放时间;此时在所述音频的时间轴上找寻到所述音频播放时间开始播放所述音频,在所述视频的时间轴上找寻到所述视频播放时间开始播放所述视频。
本实施例的视频播放方法通过对视频中的音频进行语音识别之后将其转换为字幕文本,且在所述字幕文本中插入用于进行定位的时间戳,从而在需要对视频的播放位置进行检索时,仅需要通过检索所述字幕文本中的关键词及其对应的目标时间戳,即可在所述视频的时间轴上准确定位其播放位置,极大地提高了对视频的分析与利用率;本申请的视频检索定位精准,且可以高效快捷地将字幕文本输出并显示在所述视频上的对应位置,大大提升了用户体验。本申请可应用于法院庭审视频处理、培训视频检索等场景中。
在一实施例中,如图3所示,所述步骤S20,也即将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳,包括如下步骤:
S201,将所述音频文件转换为所述文件流;由于所述音频文件的为wav格式等,因此,只需要将wav等格式的音频文件转化为文件流即可。
S202,通过所述语音识别接口将所述文件流转换为所述字幕文本;也即,在该步骤中将上述音频文件转换的所述文件流传输至所述语音识别接口中,通过所述语音识别接口对所述文件流进行语音识别之后,将其转换为字幕文本。
具体地,所述步骤S202包括:令所述语音识别接口通过声学特征、各个词在上下文中的关系、文字与发音之间的映射关系对所述文件流进行解码,并获取所述文件流解码后生成的所述字幕文本;其中,所述声学特征包括各个发音之间的转移关系、发音与声波特征之间的关系。也即,所述文件流解码后,生成与所述音频文件对应的字幕文本。
上述声学特征、各个词在上下文中的关系、文字与发音之间的映射关系等,可以通过建立各参数之间的数学模型并通过不断的训练完善该模型。比如,可以根据各个发音之间的转移关系、发音与声波特征之间的关系建立声学模型;根据各个词在上下文中的关系建立语言模型;根据文字与发音之间的映射关系建立词典模型;此后,对建立的声学模型、语言模型、词典模型各自进行训练,再通过训练之后的声学模型、语言模型、词典模型对所述文件流进行解码,使所述文件流转换为字幕文本。
S203,在所述字幕文本中按照预设规则插入时间戳,并将插入的所述时间戳与所述时间戳之前或所述时间戳之后的文本内容关联;其中,所述字幕文本中插入的所述时间戳与所述音频的播放时间对应。
时间戳是指对应于音频播放时间的时间标记,且每个文本内容关联有至少一个时间戳,可理解的,一个文本内容的前后(所述时间戳亦可以设置在每个文本内容中的其他位置,只需要将其与所述文本内容关联)均可以插入时间戳;作为优选,可以在每个文本内容前面插入时间戳代表所述文本内容对应的音频的开始播放时间,而在每个文本内容后面插入时间戳代表所述文本内容对应的音频的播放结束时间。
在一实施例中,如图4所示,所述步骤S203包括以下步骤:
S2031、按照所述预设规则将所述字幕文本划分为多个文本内容;其中,所述预设规则包括按照字、词、句、段对所述字幕文本进行划分。可理解的,所述预设规则包括但不限定于为按照字、词、句、段等对所述字幕文本进行划分。比如,所述预设规则为:将每段话各自划分为一个文本内容。此时,可以在每段话(每段话都代表一个文本内容,可以用回车符来设定“句”的标准,比如,一个回车符代表一段话的分隔)的前、后(或其他位置)插入时间戳,并将所述时间戳与所述文本内容关联。
S2032、在各所述文本内容之前或/和所述文本内容之后插入与所述文本内容关联的时间戳,并将所述时间戳与所述时间戳之前或之后的文本内容关联;其中,所述字幕文本中插入的所述时间戳与所述音频的播放时间对应;可理解的,所述时间戳亦可以设置在每个文本内容中除前、后之外的其他位置,只需要将其与所述文本内容关联即可;此时,所述时间戳优选为设置为与该文本内容对应的音频的开始播放时间相同,所述时间戳即为该文本内容对应的这一段音频的播放时间的最前端的时间,此时,只要在所述音频的时间轴上找寻与该时间戳相同的音频播放时间,即可开始播放该文本内容对应的音频。同理,由于所述音频是从视频中分离,因此所述音频的音频播放时间与所述视频的视频播放时间是一致对应的,因此,亦可以在所述视频的时间轴上找寻与该时间戳相同的视频播放时间,即可开始播放该文本内容对应的视频。
S2033、将包含所述时间戳的所述字幕文本存储至数据库。也即,由于所述字幕文本被划分为多个文本内容,且所述时间戳与多个所述文本内容关联,因此,存储至所述数据库的是多个文本内容以及与多个所述文本内容关联的多个时间戳。
在一实施例中,如图5所示,所述步骤S40,也即接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的所述目标时间戳,包括以下步骤:
S401,接收包含所述关键词的查询指令,所述关键词由用户在查询界面通过语音输入或通过输入框键入;也即,用户可以在客户端的查询界面的输入框中输入关键词,并点击可触发查询指令的预设按钮(比如搜索按钮)之后,所述关键词伴随所述查询指令发送 至服务器;可理解地,用户亦可以通过客户端的与查询界面关联的语音输入设备,输入所述关键词的语音,服务器可以对所述语音输入设备输入的语音进行识别之后,在所述查询界面上显示所述关键词并供用户进行确认、修改或重新输入;在用户确认输入的所述关键词之后,即将包含所述关键词的所述查询指令发送至服务器。
S402,自数据库中调取包含时间戳的所述字幕文本,并在所述字幕文本中查询所述关键词;也即,当所述数据库中存储有包含多个文本内容以及与多个所述文本内容关联的多个时间戳的字幕文本时,可以在接收到查询指令时,自所述字幕文本的各所述文本内容中查询所述关键词。
S403,获取所述字幕文本中包含所述关键词的所有文本内容,并将所有所述文本内容以及与各所述文本内容关联的所述目标时间戳显示在所述查询界面上。也即,在上述步骤S402中查询到一个或多个所述文本内容中包含所述关键词时,将包含所述关键词的一个或多个所述文本内容显示在所述查询界面上,生成一个文本内容列表;且所述文本内容列表中,每一行(或每一列)展示的项目包括但不限定于为文本内容的摘要或全文、文本内容关联的目标时间戳(在所述目标时间戳为多个的情况下,可仅显示一个目标时间戳,且显示的这个目标时间戳优选为与该文本内容对应的音频的开始播放时间相同的目标时间戳)、文本内容的排序等。
可理解的,在生成所述文本内容列表的同时,可以根据所述目标时间戳调取与各所述目标时间戳关联的各所述文本内容所对应的各音频段(或以该音频段为起始的后半段音频),并将所述音频段显示在音频列表中。在一个实施例中,所述音频列表可以与所述文本内容列表同步显示在所述查询界面上;且在点击所述文本内容列表中的展示项目时,该项目所属的文本内容所对应的音频段亦同步显示被选中,且该音频段也可以自动跳转至所述音频列表显示界面的中间的显眼位置(或其他位置)。同理,在点击所述音频列表中的音频段时,不仅会开始播放所述音频段,且所述文本内容列表中对应的所述文本内容亦可同步显示被选中。以上显示方式可供用户在多个文本内容中选取和确认待查询的对象。
同理,在生成所述文本内容列表的同时,亦可以根据所述目标时间戳调取与各所述目标时间戳关联的各所述文本内容所对应的各视频段(或以该视频段为起始的后半段视频,或该目标时间戳的时间所对应的视频画面),并将所述视频段显示在视频列表中。在一个实施例中,所述视频列表可以与所述文本内容列表和/或所述音频列表同步显示在所述查询界面上。可理解的,仅显示所述文本内容和所述视频列表时,其显示方式可以与仅显示内容列表与所述音频列表时的显示方式相同,在此不再赘述。
在一个实施例中,所述视频列表、所述文本内容列表和所述音频列表同步显示在所述查询界面上,在点击所述文本内容列表中的展示项目时,该项目所属的文本内容所对应的音频段和视频段可同步显示被选中,且该音频段和视频段也可以自动跳转至所述音频列表和所述视频列表的显示界面的中间的显眼位置(或其他位置)。同理,在点击所述音频列表或所述视频列表中的某个音频段或视频段时,不仅会同时开始播放所述音频段和所述视频段,且所述文本内容列表中对应的所述文本内容亦可同步显示被选中;以上显示方式可供用户在多个文本内容中选取和确认待查询的对象。
在一实施例中,如图6所示,所述步骤S50,也即根据所述目标时间戳播放所述音频和所述视频,包括以下步骤:
S501,接收包含当前播放时间的播放指令;所述当前播放时间与所述目标时间戳的时间相同;
在该实施例中,可以在用户选取与目标时间戳关联的文本内容之后,将包含当前播放时间的播放指令发送至服务器,且该播放指令中包含的当前播放时间(当前所述音频的音频播放时间和所述视频的视频播放时间)与所述目标时间戳的时间相同;可理解的,用户选取与目标时间戳关联的文本内容,可以在上述文本内容列表中选取(可以设定与上述步骤S403中的“点击”的操作不同,比如,设定“点击”为鼠标左键的单击,但该步骤中的“选取”为鼠标左键的双击)其中一个项目,也即选取了该项目所属的文本内容及与其关联的目标时间戳;亦可以在上述音频列表或视频列表中选取其中的音频段或视频段,也即选取了该与该音频段或视频段对应的文本内容及与其关联的目标时间戳。在选取与目标时间戳关联的文本内容之后,服务器即接收到包含当前播放时间(与选取的所述目标时间戳对应)的播放指令,并进入步骤S502中。
可理解的,在另一实施例中,若所述目标时间戳并不是文本内容对应的音频的开始播放时间,此时,在选取所述文本内容(在文本内容列表、视频列表或音频列表中选取)之后,会设定所述文本内容对应的音频的开始播放时间作为当前播放时间,此时所述当前播放时间并不等于所述目标时间戳的时间。
S502,自所述当前播放时间播放所述音频和所述视频。
也即,在服务器接收到包含当前播放时间的播放指令之后,可以根据所述当前播放时间,自数据库中调取所述音频,并自所述音频的时间轴中与所述当前播放时间对应的时间开始播放所述音频;同时,自所述数据库中调取所述视频,并自所述视频的时间轴中与所述当前播放时间对应的时间开始播放所述音频。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,如图7所示,提供一种视频播放装置,该视频播放装置与上述实施例中视频播放方法一一对应。该视频播放装置包括提取模块110、转换模块120、字幕显示模块130、查询模块140和播放模块150。各功能模块详细说明如下:
提取模块110,用于自视频中提取音频,并生成音频文件;
转换模块120,用于将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳;
字幕显示模块130,用于根据所述时间戳将所述字幕文本显示在所述视频的播放界面;
查询模块140,用于接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳;所述多个时间戳包括所述目标时间戳;
播放模块150,用于根据所述目标时间戳播放所述音频和所述视频。
本实施例的视频播放装置通过对视频中的音频进行语音识别之后将其转换为字幕文本,且在所述字幕文本中插入用于进行定位的时间戳,从而在需要对视频的播放位置进行检索时,仅需要通过检索所述字幕文本中的关键词及其对应的目标时间戳,即可在所述视频的时间轴上准确定位其播放位置,极大地提高了对视频的分析与利用率;本申请的视频检索定位精准,且可以高效快捷地将字幕文本输出并显示在所述视频上的对应位置,大大提升了用户体验。本申请可应用于法院庭审视频处理、培训视频检索等场景中。
优选地,如图8所示,所述转换模块120包括:
第一转换子模块121,用于将所述音频文件转换为所述文件流;
第二转换子模块122,用于通过所述语音识别接口将所述文件流转换为所述字幕文本;
插入子模块123,用于在所述字幕文本中按照预设规则插入时间戳,并将插入的所述时间戳与所述时间戳之前或所述时间戳之后的文本内容关联。
优选地,所述第二转换子模块122还用于令所述语音识别接口通过声学特征、各个词在上下文中的关系、文字与发音之间的映射关系对所述文件流进行解码,并获取所述文件流解码后生成的所述字幕文本;其中,所述声学特征包括各个发音之间的转移关系、发音与声波特征之间的关系。
优选地,所述插入子模块123还用于按照所述预设规则将所述字幕文本划分为多个文本内容;其中,所述预设规则包括按照字、词、句、段对所述字幕文本进行划分;在各所述文本内容之前或/和所述文本内容之后插入与所述文本内容关联的时间戳,并将所述时 间戳与所述时间戳之前或之后的文本内容关联;其中,所述字幕文本中插入的所述时间戳与所述音频的播放时间对应;将包含所述时间戳的所述字幕文本存储至数据库。
优选地,如图9所示,所述查询模块140包括:
接收子模块141,用于接收包含所述关键词的查询指令,所述关键词由用户在查询界面通过语音输入或通过输入框键入;
调取子模块142,用于自数据库中调取包含时间戳的所述字幕文本,并在所述字幕文本中查询所述关键词;
显示子模块143,用于获取所述字幕文本中包含所述关键词的所有文本内容,并将所有所述文本内容以及与各所述文本内容关联的所述目标时间戳显示在所述查询界面上。
关于视频播放装置的具体限定可以参见上文中对于视频播放方法的限定,在此不再赘述。上述视频播放装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于终端设备中的处理器中,也可以以软件形式存储于终端设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种终端设备(也即,计算机设备),该终端设备可以是服务器,其内部结构图可以如图10所示。该终端设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该终端设备的处理器用于提供计算和控制能力。该终端设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该终端设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种视频播放方法。
在一个实施例中,提供了一种终端设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:
自视频中提取音频,并生成音频文件;将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳;根据所述时间戳将所述字幕文本显示在所述视频的播放界面;接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳;所述多个时间戳包括所述目标时间戳;根据所述目标时间戳播放所述音频和所述视频。
本实施例的终端设备通过对视频中的音频进行语音识别之后将其转换为字幕文本,且在所述字幕文本中插入用于进行定位的时间戳,从而在需要对视频的播放位置进行检索时,仅需要通过检索所述字幕文本中的关键词及其对应的目标时间戳,即可在所述视频的时间 轴上准确定位其播放位置,极大地提高了对视频的分析与利用率;本申请的视频检索定位精准,且可以高效快捷地将字幕文本输出并显示在所述视频上的对应位置,大大提升了用户体验。本申请可应用于法院庭审视频处理、培训视频检索等场景中。
在一个实施例中,提供了一个或多个存储有计算机可读指令的非易失性可读存储介质,该非易失性可读存储介质上存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现以下步骤:
自视频中提取音频,并生成音频文件;将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳;根据所述时间戳将所述字幕文本显示在所述视频的播放界面;接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳;所述多个时间戳包括所述目标时间戳;根据所述目标时间戳播放所述音频和所述视频。
本实施例的计算机可读存储介质通过对视频中的音频进行语音识别之后将其转换为字幕文本,且在所述字幕文本中插入用于进行定位的时间戳,从而在需要对视频的播放位置进行检索时,仅需要通过检索所述字幕文本中的关键词及其对应的目标时间戳,即可在所述视频的时间轴上准确定位其播放位置,极大地提高了对视频的分析与利用率;本申请的视频检索定位精准,且可以高效快捷地将字幕文本输出并显示在所述视频上的对应位置,大大提升了用户体验。本申请可应用于法院庭审视频处理、培训视频检索等场景中。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能 单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种视频播放方法,其特征在于,包括:
    自视频中提取音频,并生成音频文件;
    将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳;
    根据所述时间戳将所述字幕文本显示在所述视频的播放界面;
    接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳;所述多个时间戳包括所述目标时间戳;
    根据所述目标时间戳播放所述音频和所述视频。
  2. 如权利要求1所述的视频播放方法,其特征在于,所述将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳,包括:
    将所述音频文件转换为所述文件流;
    通过所述语音识别接口将所述文件流转换为所述字幕文本;
    在所述字幕文本中按照预设规则插入时间戳,并将插入的所述时间戳与所述时间戳之前或所述时间戳之后的文本内容关联。
  3. 如权利要求2所述的视频播放方法,其特征在于,所述通过所述语音识别接口将所述文件流转换为所述字幕文本,具体为:
    令所述语音识别接口通过声学特征、各个词在上下文中的关系、文字与发音之间的映射关系对所述文件流进行解码,并获取所述文件流解码后生成的所述字幕文本;其中,所述声学特征包括各个发音之间的转移关系、发音与声波特征之间的关系。
  4. 如权利要求2所述的视频播放方法,其特征在于,所述在所述字幕文本中按照预设规则插入时间戳,并将插入的所述时间戳与所述时间戳之前或所述时间戳之后的文本内容关联,包括:
    按照所述预设规则将所述字幕文本划分为多个文本内容;其中,所述预设规则包括按照字、词、句、段对所述字幕文本进行划分;
    在各所述文本内容之前或/和所述文本内容之后插入与所述文本内容关联的时间戳,并将所述时间戳与所述时间戳之前或之后的文本内容关联;其中,所述字幕文本中插入的所述时间戳与所述音频的播放时间对应;
    将包含所述时间戳的所述字幕文本存储至数据库。
  5. 如权利要求1所述的视频播放方法,其特征在于,所述接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳,包括:
    接收包含所述关键词的查询指令,所述关键词由用户在查询界面通过语音输入或通过输入框键入;
    自数据库中调取包含时间戳的所述字幕文本,并在所述字幕文本中查询所述关键词;
    获取所述字幕文本中包含所述关键词的所有文本内容,并将所有所述文本内容以及与各所述文本内容关联的所述目标时间戳显示在所述查询界面上。
  6. 一种视频播放装置,其特征在于,包括:
    提取模块,用于自视频中提取音频,并生成音频文件;
    转换模块,用于将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳;
    字幕显示模块,用于根据所述时间戳将所述字幕文本显示在所述视频的播放界面;
    查询模块,用于接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳;所述多个时间戳包括所述目标时间戳;
    播放模块,用于根据所述目标时间戳播放所述音频和所述视频。
  7. 如权利6所述的视频播放装置,其特征在于,所述转换模块包括:
    第一转换子模块,用于将所述音频文件转换为所述文件流;
    第二转换子模块,用于通过所述语音识别接口将所述文件流转换为所述字幕文本;
    插入子模块,用于在所述字幕文本中按照预设规则插入时间戳,并将插入的所述时间戳与所述时间戳之前或所述时间戳之后的文本内容关联。
  8. 如权利7所述的视频播放装置,其特征在于,所述第二转换子模块还用于令所述语音识别接口通过声学特征、各个词在上下文中的关系、文字与发音之间的映射关系对所述文件流进行解码,并获取所述文件流解码后生成的所述字幕文本;其中,所述声学特征包括各个发音之间的转移关系、发音与声波特征之间的关系。
  9. 如权利7所述的视频播放装置,其特征在于,所述插入子模块还用于按照所述预设规则将所述字幕文本划分为多个文本内容;其中,所述预设规则包括按照字、词、句、段对所述字幕文本进行划分;在各所述文本内容之前或/和所述文本内容之后插入与所述文本内容关联的时间戳,并将所述时间戳与所述时间戳之前或之后的文本内容关联;其中,所述字幕文本中插入的所述时间戳与所述音频的播放时间对应;将包含所述时间戳的所述 字幕文本存储至数据库。
  10. 如权利6所述的视频播放装置,其特征在于,所述查询模块包括:
    接收子模块,用于接收包含所述关键词的查询指令,所述关键词由用户在查询界面通过语音输入或通过输入框键入;
    调取子模块,用于自数据库中调取包含时间戳的所述字幕文本,并在所述字幕文本中查询所述关键词;
    显示子模块,用于获取所述字幕文本中包含所述关键词的所有文本内容,并将所有所述文本内容以及与各所述文本内容关联的所述目标时间戳显示在所述查询界面上。
  11. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:
    自视频中提取音频,并生成音频文件;
    将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳;
    根据所述时间戳将所述字幕文本显示在所述视频的播放界面;
    接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳;所述多个时间戳包括所述目标时间戳;
    根据所述目标时间戳播放所述音频和所述视频。
  12. 如权利要求11所述的终端设备,其特征在于,所述将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳,包括:
    将所述音频文件转换为所述文件流;
    通过所述语音识别接口将所述文件流转换为所述字幕文本;
    在所述字幕文本中按照预设规则插入时间戳,并将插入的所述时间戳与所述时间戳之前或所述时间戳之后的文本内容关联。
  13. 如权利要求12所述的终端设备,其特征在于,所述通过所述语音识别接口将所述文件流转换为所述字幕文本,具体为:
    令所述语音识别接口通过声学特征、各个词在上下文中的关系、文字与发音之间的映射关系对所述文件流进行解码,并获取所述文件流解码后生成的所述字幕文本;其中,所述声学特征包括各个发音之间的转移关系、发音与声波特征之间的关系。
  14. 如权利要求12所述的终端设备,其特征在于,所述在所述字幕文本中按照预设规则插入时间戳,并将插入的所述时间戳与所述时间戳之前或所述时间戳之后的文本内容关联,包括:
    按照所述预设规则将所述字幕文本划分为多个文本内容;其中,所述预设规则包括按照字、词、句、段对所述字幕文本进行划分;
    在各所述文本内容之前或/和所述文本内容之后插入与所述文本内容关联的时间戳,并将所述时间戳与所述时间戳之前或之后的文本内容关联;其中,所述字幕文本中插入的所述时间戳与所述音频的播放时间对应;
    将包含所述时间戳的所述字幕文本存储至数据库。
  15. 如权利要求11所述的终端设备,其特征在于,所述接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳,包括:
    接收包含所述关键词的查询指令,所述关键词由用户在查询界面通过语音输入或通过输入框键入;
    自数据库中调取包含时间戳的所述字幕文本,并在所述字幕文本中查询所述关键词;
    获取所述字幕文本中包含所述关键词的所有文本内容,并将所有所述文本内容以及与各所述文本内容关联的所述目标时间戳显示在所述查询界面上。
  16. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
    自视频中提取音频,并生成音频文件;
    将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳;
    根据所述时间戳将所述字幕文本显示在所述视频的播放界面;
    接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳;所述多个时间戳包括所述目标时间戳;
    根据所述目标时间戳播放所述音频和所述视频。
  17. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述将所述音频文件转换为文件流,并通过语音识别将所述文件流转换为字幕文本;所述字幕文本中包含与所述音频的播放时间对应的多个时间戳,包括:
    将所述音频文件转换为所述文件流;
    通过所述语音识别接口将所述文件流转换为所述字幕文本;
    在所述字幕文本中按照预设规则插入时间戳,并将插入的所述时间戳与所述时间戳之前或所述时间戳之后的文本内容关联。
  18. 如权利要求17所述的非易失性可读存储介质,其特征在于,所述通过所述语音识别接口将所述文件流转换为所述字幕文本,具体为:
    令所述语音识别接口通过声学特征、各个词在上下文中的关系、文字与发音之间的映射关系对所述文件流进行解码,并获取所述文件流解码后生成的所述字幕文本;其中,所述声学特征包括各个发音之间的转移关系、发音与声波特征之间的关系。
  19. 如权利要求17所述的非易失性可读存储介质,其特征在于,所述在所述字幕文本中按照预设规则插入时间戳,并将插入的所述时间戳与所述时间戳之前或所述时间戳之后的文本内容关联,包括:
    按照所述预设规则将所述字幕文本划分为多个文本内容;其中,所述预设规则包括按照字、词、句、段对所述字幕文本进行划分;
    在各所述文本内容之前或/和所述文本内容之后插入与所述文本内容关联的时间戳,并将所述时间戳与所述时间戳之前或之后的文本内容关联;其中,所述字幕文本中插入的所述时间戳与所述音频的播放时间对应;
    将包含所述时间戳的所述字幕文本存储至数据库。
  20. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述接收包含关键词的查询指令,在所述字幕文本中查询与所述关键词对应的目标时间戳,包括:
    接收包含所述关键词的查询指令,所述关键词由用户在查询界面通过语音输入或通过输入框键入;
    自数据库中调取包含时间戳的所述字幕文本,并在所述字幕文本中查询所述关键词;
    获取所述字幕文本中包含所述关键词的所有文本内容,并将所有所述文本内容以及与各所述文本内容关联的所述目标时间戳显示在所述查询界面上。
PCT/CN2018/104047 2018-08-01 2018-09-05 视频播放方法、装置、终端设备及存储介质 WO2020024353A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810861877.8 2018-08-01
CN201810861877.8A CN109246472A (zh) 2018-08-01 2018-08-01 视频播放方法、装置、终端设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020024353A1 true WO2020024353A1 (zh) 2020-02-06

Family

ID=65073382

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/104047 WO2020024353A1 (zh) 2018-08-01 2018-09-05 视频播放方法、装置、终端设备及存储介质

Country Status (2)

Country Link
CN (1) CN109246472A (zh)
WO (1) WO2020024353A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111988663A (zh) * 2020-08-28 2020-11-24 北京百度网讯科技有限公司 视频播放节点的定位方法、装置、设备以及存储介质
EP3817395A1 (en) * 2019-10-30 2021-05-05 Beijing Xiaomi Mobile Software Co., Ltd. Video recording method and apparatus, device, and readable storage medium
EP4206953A4 (en) * 2020-09-29 2024-01-10 Beijing Zitiao Network Technology Co., Ltd. METHOD AND APPARATUS FOR SEARCHING TARGET CONTENT, ELECTRONIC DEVICE AND RECORDING MEDIUM

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886258A (zh) * 2019-02-19 2019-06-14 新华网(北京)科技有限公司 提供多媒体信息的关联信息的方法、装置及电子设备
CN109977239B (zh) * 2019-03-31 2023-08-18 联想(北京)有限公司 一种信息处理方法和电子设备
CN110035326A (zh) * 2019-04-04 2019-07-19 北京字节跳动网络技术有限公司 字幕生成、基于字幕的视频检索方法、装置和电子设备
CN112632321A (zh) * 2019-09-23 2021-04-09 北京国双科技有限公司 音频文件处理方法及装置、音频文件播放方法及装置
CN112584078B (zh) * 2019-09-27 2022-03-18 深圳市万普拉斯科技有限公司 视频通话方法、装置、计算机设备和存储介质
CN111128254B (zh) * 2019-11-14 2021-09-03 网易(杭州)网络有限公司 音频播放方法、电子设备及存储介质
CN111008300A (zh) * 2019-11-20 2020-04-14 四川互慧软件有限公司 一种在音视频中基于关键词的时间戳定位搜索方法
CN112825561A (zh) * 2019-11-21 2021-05-21 上海幻电信息科技有限公司 字幕显示方法、系统、计算机设备及可读存储介质
CN111428059A (zh) * 2020-03-19 2020-07-17 威比网络科技(上海)有限公司 关联音频的多媒体数据播放方法、装置、电子设备、存储介质
CN111639233B (zh) * 2020-05-06 2024-05-17 广东小天才科技有限公司 学习视频字幕添加方法、装置、终端设备和存储介质
CN111611302B (zh) * 2020-06-19 2023-06-16 中国人民解放军国防科技大学 一种飞行器试验数据与仿真数据的数据对准方法及装置
CN114501106A (zh) * 2020-08-04 2022-05-13 腾讯科技(深圳)有限公司 一种文稿显示控制方法、装置、电子设备和存储介质
CN114115668A (zh) * 2020-08-11 2022-03-01 深圳市万普拉斯科技有限公司 音频文件的展示方法、装置、计算机设备和存储介质
CN112163103A (zh) * 2020-09-29 2021-01-01 北京字跳网络技术有限公司 搜索目标内容的方法、装置、电子设备及存储介质
CN112233661B (zh) * 2020-10-14 2024-04-05 广州欢网科技有限责任公司 基于语音识别的影视内容字幕生成方法、系统及设备
CN114449333B (zh) * 2020-10-30 2023-09-01 华为终端有限公司 视频笔记生成方法及电子设备
CN112399269B (zh) * 2020-11-12 2023-06-20 广东小天才科技有限公司 视频分割方法、装置、设备及存储介质
CN113010704B (zh) * 2020-11-18 2022-03-29 北京字跳网络技术有限公司 一种会议纪要的交互方法、装置、设备及介质
CN112489683A (zh) * 2020-11-24 2021-03-12 广州市久邦数码科技有限公司 基于关键词语定位实现音频快进快退的方法和装置
CN112929758A (zh) * 2020-12-31 2021-06-08 广州朗国电子科技有限公司 一种多媒体内容字幕生成方法、设备以及存储介质
CN112686006A (zh) * 2021-01-04 2021-04-20 深圳前海微众银行股份有限公司 音频的识别文本校正方法、音频识别设备、装置和介质
CN112883235A (zh) * 2021-03-11 2021-06-01 深圳市一览网络股份有限公司 视频内容的搜索方法、装置、计算机设备及存储介质
CN113066498B (zh) * 2021-03-23 2022-12-30 上海掌门科技有限公司 信息处理方法、设备和介质
CN113099312A (zh) * 2021-03-30 2021-07-09 深圳市多科特文化传媒有限公司 教学视频播放系统
CN113259776B (zh) * 2021-04-14 2022-11-22 北京达佳互联信息技术有限公司 字幕与音源的绑定方法及装置
CN112995736A (zh) * 2021-04-22 2021-06-18 南京亿铭科技有限公司 语音字幕合成方法、装置、计算机设备及存储介质
CN113593567B (zh) * 2021-06-23 2022-09-09 荣耀终端有限公司 视频声音转文本的方法及相关设备
CN113378001B (zh) * 2021-06-28 2024-02-27 北京百度网讯科技有限公司 视频播放进度的调整方法及装置、电子设备和介质
CN113343675A (zh) * 2021-06-30 2021-09-03 北京搜狗科技发展有限公司 一种字幕生成方法、装置和用于生成字幕的装置
WO2023010402A1 (zh) * 2021-08-05 2023-02-09 深圳Tcl新技术有限公司 一种媒体文件播放方法、装置、计算机设备及存储介质
CN114339300B (zh) * 2021-12-28 2024-04-19 Oppo广东移动通信有限公司 字幕处理方法、装置、电子设备及计算机可读介质及产品
CN115277650B (zh) * 2022-07-13 2024-01-09 深圳乐播科技有限公司 投屏显示控制方法、电子设备及相关装置
CN115269920A (zh) * 2022-08-15 2022-11-01 北京字跳网络技术有限公司 交互方法、装置、电子设备和存储介质
CN115209211A (zh) * 2022-09-13 2022-10-18 北京达佳互联信息技术有限公司 字幕显示方法、装置、电子设备、存储介质及程序产品
CN117749965A (zh) * 2022-09-14 2024-03-22 北京字跳网络技术有限公司 字幕处理方法及装置

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0780777A1 (en) * 1995-12-21 1997-06-25 Hewlett-Packard Company Indexing of recordings
CN101382937A (zh) * 2008-07-01 2009-03-11 深圳先进技术研究院 基于语音识别的多媒体资源处理方法及其在线教学系统
CN103327397A (zh) * 2012-03-22 2013-09-25 联想(北京)有限公司 一种媒体文件的字幕同步显示方法及系统
CN103561217A (zh) * 2013-10-14 2014-02-05 深圳创维数字技术股份有限公司 一种生成字幕的方法及终端
CN104301771A (zh) * 2013-07-15 2015-01-21 中兴通讯股份有限公司 视频文件播放进度的调整方法及装置
CN104618807A (zh) * 2014-03-31 2015-05-13 腾讯科技(北京)有限公司 多媒体播放方法、装置及系统
CN105245917A (zh) * 2015-09-28 2016-01-13 徐信 一种多媒体语音字幕生成的系统和方法
CN106303303A (zh) * 2016-08-17 2017-01-04 北京金山安全软件有限公司 一种媒体文件字幕的翻译方法、装置及电子设备
CN106340291A (zh) * 2016-09-27 2017-01-18 广东小天才科技有限公司 一种双语字幕制作方法及系统

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10645460B2 (en) * 2016-12-30 2020-05-05 Facebook, Inc. Real-time script for live broadcast
CN108259971A (zh) * 2018-01-31 2018-07-06 百度在线网络技术(北京)有限公司 字幕添加方法、装置、服务器及存储介质

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0780777A1 (en) * 1995-12-21 1997-06-25 Hewlett-Packard Company Indexing of recordings
CN101382937A (zh) * 2008-07-01 2009-03-11 深圳先进技术研究院 基于语音识别的多媒体资源处理方法及其在线教学系统
CN103327397A (zh) * 2012-03-22 2013-09-25 联想(北京)有限公司 一种媒体文件的字幕同步显示方法及系统
CN104301771A (zh) * 2013-07-15 2015-01-21 中兴通讯股份有限公司 视频文件播放进度的调整方法及装置
CN103561217A (zh) * 2013-10-14 2014-02-05 深圳创维数字技术股份有限公司 一种生成字幕的方法及终端
CN104618807A (zh) * 2014-03-31 2015-05-13 腾讯科技(北京)有限公司 多媒体播放方法、装置及系统
CN105245917A (zh) * 2015-09-28 2016-01-13 徐信 一种多媒体语音字幕生成的系统和方法
CN106303303A (zh) * 2016-08-17 2017-01-04 北京金山安全软件有限公司 一种媒体文件字幕的翻译方法、装置及电子设备
CN106340291A (zh) * 2016-09-27 2017-01-18 广东小天才科技有限公司 一种双语字幕制作方法及系统

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3817395A1 (en) * 2019-10-30 2021-05-05 Beijing Xiaomi Mobile Software Co., Ltd. Video recording method and apparatus, device, and readable storage medium
CN111988663A (zh) * 2020-08-28 2020-11-24 北京百度网讯科技有限公司 视频播放节点的定位方法、装置、设备以及存储介质
KR20210042852A (ko) * 2020-08-28 2021-04-20 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 비디오 재생 노드 위치 확정 방법, 장치, 전자 장비, 컴퓨터 판독가능 저장 매체 및 컴퓨터 프로그램
EP3855753A3 (en) * 2020-08-28 2021-08-11 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for locating video playing node, device and storage medium
KR102436734B1 (ko) * 2020-08-28 2022-08-26 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 비디오 재생 노드 위치 확정 방법, 장치, 전자 장비, 컴퓨터 판독가능 저장 매체 및 컴퓨터 프로그램
US11581021B2 (en) 2020-08-28 2023-02-14 Beijing Baidu Netcom Science Technology Co., Ltd. Method and apparatus for locating video playing node, device and storage medium
EP4206953A4 (en) * 2020-09-29 2024-01-10 Beijing Zitiao Network Technology Co., Ltd. METHOD AND APPARATUS FOR SEARCHING TARGET CONTENT, ELECTRONIC DEVICE AND RECORDING MEDIUM

Also Published As

Publication number Publication date
CN109246472A (zh) 2019-01-18

Similar Documents

Publication Publication Date Title
WO2020024353A1 (zh) 视频播放方法、装置、终端设备及存储介质
US11301644B2 (en) Generating and editing media
US11917344B2 (en) Interactive information processing method, device and medium
WO2018121001A1 (zh) 数字电视节目同声翻译输出方法、系统及智能终端
CN110430476B (zh) 直播间搜索方法、系统、计算机设备和存储介质
US8620139B2 (en) Utilizing subtitles in multiple languages to facilitate second-language learning
CN111968649A (zh) 一种字幕纠正方法、字幕显示方法、装置、设备及介质
WO2020133039A1 (zh) 对话语料中实体的识别方法、装置和计算机设备
CN110781328A (zh) 基于语音识别的视频生成方法、系统、装置和存储介质
JP6857983B2 (ja) メタデータ生成システム
CN111898388A (zh) 视频字幕翻译编辑方法、装置、电子设备及存储介质
CN110740275A (zh) 一种非线性编辑系统
CN112399269A (zh) 视频分割方法、装置、设备及存储介质
WO2022206198A1 (zh) 一种音频和文本的同步方法、装置、设备以及介质
KR20210138311A (ko) 언어 및 수어의 병렬 말뭉치 데이터의 생성 장치 및 방법
CN110059224B (zh) 投影仪设备的视频检索方法、装置、设备及存储介质
CN109376145B (zh) 影视对白数据库的建立方法、建立装置及存储介质
CN114449310A (zh) 视频剪辑方法、装置、计算机设备及存储介质
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
US20140129221A1 (en) Sound recognition device, non-transitory computer readable storage medium stored threreof sound recognition program, and sound recognition method
CN113438532B (zh) 视频处理、视频播放方法、装置、电子设备及存储介质
JP2007199315A (ja) コンテンツ提供装置
US11606629B2 (en) Information processing apparatus and non-transitory computer readable medium storing program
JP6382423B1 (ja) 情報処理装置、画面出力方法及びプログラム
TWI823815B (zh) 摘要產生方法及系統與電腦程式產品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18928596

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07.05.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18928596

Country of ref document: EP

Kind code of ref document: A1