CN111901615A

CN111901615A - Live video playing method and device

Info

Publication number: CN111901615A
Application number: CN202010598110.8A
Authority: CN
Inventors: 赵晓昆
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-11-06

Abstract

The application discloses a method and a device for playing a live video, which relate to the technical field of voice recognition, the technical field of video processing and the field of image processing, wherein the specific implementation scheme is as follows: acquiring live broadcast data, wherein live broadcast comprises audio information and video stream information; recognizing text information corresponding to the audio information, and generating subtitle information according to the text information; determining a starting time point and an ending time point corresponding to the subtitle information, and determining video frames corresponding to the starting time point and the ending time point in the video stream information; and storing the corresponding relation between the subtitle information and the corresponding video frame in a preset database so as to play the live video according to the corresponding relation. Therefore, on the basis of realizing the matching of the subtitle information for the live video stream, the subtitle information and the video stream do not need to be mixed to generate new video data, and the playing delay of the live video stream is reduced.

Description

Live video playing method and device

Technical Field

The present application relates to the field of speech recognition technology, the field of video processing technology, and the field of image processing, and in particular, to a method and an apparatus for playing a live video.

Background

The direct broadcast is widely applied to announcements, sports events, entertainment and the like, and plays an important role in education and sales during epidemic situations. Live broadcasting has strong real-time performance and interactivity, and therefore video streaming cannot be subjected to post-processing, wherein the absence of subtitles in video streaming has a certain negative influence on understanding of a viewer.

In the related technology, subtitles are manually identified and added, so that the live broadcast data is delayed for a long time and cannot adapt to the high real-time property of live broadcast.

Disclosure of Invention

The application provides a live video playing method and a live video playing device, and on the basis of realizing matching of caption information with a live video stream, as the caption information and the video stream do not need to be mixed to generate new video data, the playing delay of the live video stream is reduced.

According to a first aspect, a live video playing method is provided, which includes the following steps: acquiring live broadcast data, wherein the live broadcast data comprises audio information and video stream information; recognizing text information corresponding to the audio information, and generating subtitle information according to the text information; determining a starting time point and an ending time point corresponding to the subtitle information, and determining video frames corresponding to the starting time point and the ending time point in the video stream information; and storing the corresponding relation between the subtitle information and the corresponding video frame in a preset database so as to play the live video according to the corresponding relation.

According to a second aspect, there is provided a live video playback apparatus comprising: the system comprises a first acquisition module, a second acquisition module and a first display module, wherein the first acquisition module is used for acquiring live broadcast data, and the live broadcast data comprises audio information and video stream information; the first generation module is used for identifying text information corresponding to the audio information and generating subtitle information according to the text information; a first determining module, configured to determine a start time point and an end time point corresponding to the subtitle information, and determine a video frame corresponding to the start time point and the end time point in the video stream information; and the storage module is used for storing the corresponding relation between the subtitle information and the corresponding video frame in a preset database so as to play the live video according to the corresponding relation.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a live video playback method as described in the first aspect.

According to a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the live video playback method of the first aspect.

The method and the device for playing the live video of the support vector machine have the following beneficial effects:

the method comprises the steps of acquiring live broadcast data comprising audio information and video stream information, identifying text information corresponding to the audio information, generating subtitle information according to the text information, aligning the subtitle information with the video stream information in time, storing the corresponding relation between the subtitle information and the video stream information, and playing live broadcast video conveniently according to the corresponding relation. Therefore, on the basis of realizing the matching of the subtitle information for the live video stream, the subtitle information and the video stream do not need to be mixed to generate new video data, and the playing delay of the live video stream is reduced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic flowchart of a live video playing method according to a first embodiment of the present application;

fig. 2 is a flowchart illustrating a live video playing method according to a second embodiment of the present application;

fig. 3-1 is a schematic view of a live video playing scene according to a third embodiment of the present application;

3-2 is a schematic view of a live video playing scene according to a fourth embodiment of the present application;

fig. 4 is a schematic view of a live video playing scene according to a fifth embodiment of the present application;

fig. 5 is a schematic structural diagram of a live video playback device according to a sixth embodiment of the present application;

fig. 6 is a schematic structural diagram of a live video playback device according to a seventh embodiment of the present application;

fig. 7 is a schematic structural diagram of a live video playback device according to an eighth embodiment of the present application;

fig. 8 is a block diagram of an electronic device for implementing a method for playing a live video according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In order to reduce the time delay of providing the subtitle information for the live video data, the subtitle information in the video stream and the video stream are stored separately, and the video is generated without being converted again according to the video stream and the subtitle information.

Specifically, fig. 1 is a flowchart of a live video playing method according to an embodiment of the present application, and as shown in fig. 1, the method includes:

step 101, acquiring live broadcast data, wherein the live broadcast data comprises audio information and video stream information.

In this embodiment, in order to further reduce the time delay, the live broadcast data may be acquired according to a preset acquisition period, so that the acquired live broadcast data is cached in the video cache pool, and after the cached live broadcast data is called and added with the subtitle information, the memory of the called live broadcast data is cleared in the video cache pool.

And 102, identifying text information corresponding to the audio information, and generating subtitle information according to the text information.

In this embodiment, voice recognition may be performed on the audio information according to a voice recognition technology to obtain corresponding text information, and then subtitle information is generated according to the text information.

In the actual execution process, when the subtitle information is generated according to the text information, the size of a playing interface of the target playing device can be obtained, the font size of the subtitle information is determined according to the size of the playing interface, in addition, the language information of the audio information can be identified, the subtitle information corresponds to the language information, meanwhile, a user can also preset the language type of the subtitle information, and the text information is translated into the subtitle information of the corresponding language type.

And 103, determining a starting time point and an ending time point corresponding to the subtitle information, and determining video frames corresponding to the starting time point and the ending time point in the video stream information.

In this embodiment, the subtitle information is time-aligned with the video frame, a start time point and an end time point corresponding to the subtitle information are determined, and the video frame corresponding to the start time point and the end time point is determined in the video stream information.

In an embodiment of the present application, the audio information and the video stream information each include corresponding time information, and a start time point and an end time point of the subtitle information may be determined according to the time information of the corresponding audio information, and further, the video frame is aligned with the subtitle information according to the time information in the video stream information.

In another embodiment of the present application, the start time point and the end time point of the subtitle information are the start point and the end point in the period corresponding to the current capture period in the live broadcast data (the start time of the current capture period is recorded, and the end of the current capture period is known), so the start time point at which the audio information corresponding to the subtitle information starts to appear and the end time point at which the audio information ends can be identified, the playing time length of each frame of the video frames in the video stream is identified, the sequence number of the video frame included between the start time point and the end time point in the current period is determined according to the playing time length, and the video frame corresponding to the subtitle information is determined according to the sequence number.

The caption information in this embodiment may correspond to one sentence in the audio information, so that the caption information is multiple pieces, the length of each piece of caption information may be determined according to the size of the video playing interface of the target playing device, and the larger the size is, the longer the length of the caption information is, and the length of the caption information here may be understood as the number of words of the caption information, etc.

And 104, storing the corresponding relation between the subtitle information and the corresponding video frame in a preset database so as to play the live video according to the corresponding relation.

It should be emphasized that, in this embodiment, a new video stream is not generated for the subtitle information and the corresponding video frame, but the corresponding relationship between the subtitle information and the corresponding video frame is stored in the preset database, so as to perform live video playing according to the corresponding relationship, so that, when the subtitle information is added, the subtitle information and the video frame only need to be aligned after the subtitle information is identified, and the subtitle information and the video frame do not need to be processed into new video data again.

The preset database may be in a cloud server, or on the target playback device side, or the like.

To sum up, the live video playing method according to the embodiment of the present application identifies text information corresponding to audio information after acquiring live data including audio information and video stream information, generates subtitle information according to the text information, and stores a correspondence between the subtitle information and the video stream information after time-aligning the subtitle information and the video stream information, thereby facilitating live video playing according to the correspondence. Therefore, on the basis of realizing the matching of the subtitle information for the live video stream, the subtitle information and the video stream do not need to be mixed to generate new video data, and the playing delay of the live video stream is reduced.

In order to make it clear to those skilled in the art how to play the live video, the following description is provided with specific embodiments, and the following description is provided:

in one embodiment of the present application, as shown in fig. 2, the step 104 includes:

step 201, in response to a live video playing instruction, acquiring a target video frame to be played.

The live video playing instruction can be triggered after a user opens a corresponding live application, and can also be triggered by user voice.

In response to a live video playing instruction of a user, a target video frame to be played corresponding to a current time point is determined, for example, the target video frame to be played can be understood as a video frame corresponding to a time point closest to the current time point.

Step 202, querying a preset database, and determining target subtitle information corresponding to a target video frame.

In this embodiment, a preset database is queried to determine target subtitle information corresponding to a target video frame.

Step 203, playing the target video frame, and displaying the floating layer containing the target caption information on the target video frame.

In this embodiment, a target video frame is played, and a floating layer containing target subtitle information is displayed on the target video frame, wherein when the floating layer is displayed, in order not to affect the viewing of a user, a video background region of the target video frame may be identified according to an image identification technique in the field of image processing techniques, and a floating layer display region is determined in the video background region, that is, a region with a larger area in the video background region is used as the floating layer display region, or a region with fewer scene elements in the video background region is used as the floating layer display region.

The floating layer display area may be irregular, that is, as shown in fig. 3-1, the floating layer display area is adapted to the shape of the video background area, and is an irregular area, or as shown in fig. 3-2, the floating layer display area is a preset shape, and is a regular area, and the floating layer display area may be implemented according to a floating layer adding technology in a video processing technology.

Certainly, in this embodiment, in order to avoid affecting the viewing of live video data by the user, a live interface size of the live device may also be obtained, and a floating layer including target subtitle information is generated according to the live interface size, where the larger the live interface size is, the larger the area of the corresponding floating layer is.

Therefore, as shown in fig. 4, in the embodiment of the present application, the subtitle information is stored independently from the video frame, and when the video frame is played, the subtitle information of the video frame is directly matched, and the subtitle information is displayed on the video frame in a floating layer manner.

In conclusion, the live video playing method does not need to add the subtitle information into the video frame for secondary rendering, and saves server resources and time.

According to the embodiment of the application, the application also provides a live video playing device. Fig. 5 is a schematic structural diagram of a live video playing apparatus according to an embodiment of the present application, and as shown in fig. 5, the apparatus includes: a first obtaining module 10, a first generating module 20, a first determining module 30 and a storing module 40, wherein,

a first obtaining module 10, configured to obtain live broadcast data, where the live broadcast data includes audio information and video stream information;

the first generating module 20 is configured to identify text information corresponding to the audio information, and generate subtitle information according to the text information;

a first determining module 30, configured to determine a start time point and an end time point corresponding to the subtitle information, and determine a video frame corresponding to the start time point and the end time point in the video stream information;

and the storage module 40 is configured to store a corresponding relationship between the subtitle information and the corresponding video frame in a preset database, so as to play a live video according to the corresponding relationship.

It should be noted that the foregoing explanation on the embodiment of the live video playing method is also applicable to the live video playing apparatus of the embodiment, and the implementation principle is similar, and is not described herein again.

In order to implement the above embodiment, the present application further provides a live video playing device. As shown in fig. 6, on the basis of that shown in fig. 5, the apparatus further includes: a second obtaining module 50, a second determining module 60, a display module 70, wherein,

a second obtaining module 50, configured to obtain a target video frame to be played in response to a live video playing instruction;

a second determining module 60, configured to query a preset database and determine target subtitle information corresponding to a target video frame;

and a display module 70, configured to play the target video frame and display a floating layer containing the target subtitle information on the target video frame.

In one embodiment of the present application, as shown in fig. 7, on the basis of fig. 6, the apparatus further comprises: a third obtaining module 80 and a second generating module 90, wherein,

a third obtaining module 80, configured to obtain a live interface size of a live device;

and a second generating module 90, configured to generate a floating layer including the target subtitle information according to the size of the live interface.

In summary, the live video playing device according to the embodiment of the present application does not need to add the subtitle information to the video frame for secondary rendering, thereby saving server resources and time.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 8, the electronic device is a block diagram of a method for playing a live video according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 8, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 8 illustrates an example of a processor 801.

The memory 802 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the methods of live video playback provided herein. A non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform a method of live video playback as provided herein.

The memory 802 serves as a non-transitory computer readable storage medium, and can be used for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for live video playing in the embodiment of the present application (for example, the first obtaining module 10, the first generating module 20, the first determining module 30, and the storage module 40 shown in fig. 5). The processor 801 executes various functional applications and data processing of the server by running non-transitory software programs, instructions and modules stored in the memory 802, that is, implements the method of live video playing in the above method embodiment.

The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device for live video playback, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected to a live video playback electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the live video playing method may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, and are exemplified by a bus in fig. 8.

The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device on which the live video is played, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A live video playing method comprises the following steps:

acquiring live broadcast data, wherein the live broadcast data comprises audio information and video stream information;

recognizing text information corresponding to the audio information, and generating subtitle information according to the text information;

determining a starting time point and an ending time point corresponding to the subtitle information, and determining video frames corresponding to the starting time point and the ending time point in the video stream information;

and storing the corresponding relation between the subtitle information and the corresponding video frame in a preset database so as to play the live video according to the corresponding relation.

2. The method of claim 1, wherein the playing the live video according to the correspondence comprises:

responding to a live video playing instruction, and acquiring a target video frame to be played;

querying the preset database, and determining target subtitle information corresponding to the target video frame;

and playing the target video frame, and displaying a floating layer containing the target subtitle information on the target video frame.

3. The method of claim 2, wherein prior to displaying the floating layer containing the target subtitle information on the target video frame, comprising:

acquiring the size of a live broadcast interface of live broadcast equipment;

and generating a floating layer containing the target subtitle information according to the size of the live broadcast interface.

4. The method of claim 2, wherein prior to displaying the floating layer containing the target subtitle information on the target video frame, comprising:

determining a video background area of the target video frame;

and determining a floating layer display area in the video background area.

5. The method of claim 4, wherein said displaying a floating layer containing the target subtitle information on the target video frame comprises:

displaying the floating layer on the floating layer display area.

6. A live video playback device comprising:

the system comprises a first acquisition module, a second acquisition module and a first display module, wherein the first acquisition module is used for acquiring live broadcast data, and the live broadcast data comprises audio information and video stream information;

the first generation module is used for identifying text information corresponding to the audio information and generating subtitle information according to the text information;

a first determining module, configured to determine a start time point and an end time point corresponding to the subtitle information, and determine a video frame corresponding to the start time point and the end time point in the video stream information;

and the storage module is used for storing the corresponding relation between the subtitle information and the corresponding video frame in a preset database so as to play the live video according to the corresponding relation.

7. The apparatus of claim 6, further comprising:

the second acquisition module is used for responding to a live video playing instruction and acquiring a target video frame to be played;

the second determining module is used for inquiring the preset database and determining target subtitle information corresponding to the target video frame;

and the display module is used for playing the target video frame and displaying the floating layer containing the target subtitle information on the target video frame.

8. The apparatus of claim 7, further comprising:

the third acquisition module is used for acquiring the size of a live broadcast interface of live broadcast equipment;

and the second generation module is used for generating a floating layer containing the target subtitle information according to the size of the live broadcast interface.

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the live video playback method of any of claims 1-5.

10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to execute the live video playback method of any one of claims 1-5.