CN113225614A

CN113225614A - Video playing method, device, server and storage medium

Info

Publication number: CN113225614A
Application number: CN202110427518.3A
Authority: CN
Inventors: 朱星龙; 张恩勇
Original assignee: Shenzhen Jiuzhou Electric Appliance Co Ltd
Current assignee: Shenzhen Jiuzhou Electric Appliance Co Ltd
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-08-06

Abstract

The invention discloses a video playing method used for a server, which comprises the following steps: receiving a result video sent by a sending end, wherein the result video is obtained by adding text data into target video data, the text data is obtained by converting voice information in the target video, and the target video is obtained by recording a target user; extracting the text data and the target video from the result video; converting the target video into an output video, and obtaining an output subtitle based on the text data; adding the output subtitles to the output video to obtain a resultant video; and sending the result video to a receiving end so that the receiving end plays the result video and the output subtitles. The invention also discloses a video playing device, a server and a computer readable storage medium. The user at the receiving end can acquire the information through the text data, and the user experience is better.

Description

Video playing method, device, server and storage medium

Technical Field

The present invention relates to the field of multimedia file processing, and in particular, to a video playing method, apparatus, server, and computer-readable storage medium.

Background

The video conference and the video call can provide an all-around perception control environment comprising various media such as audio, video, pictures, texts and the like for users distributed in different places, and are an indispensable technical hotspot of the modern information society.

In the existing video playing method, a plurality of users respectively record real-time audio and video data by using corresponding sending ends, and send the recorded real-time audio and video data to other users, so as to realize information exchange among different users.

However, the user experience is poor by adopting the existing video playing method.

Disclosure of Invention

The invention mainly aims to provide a video playing method, a video playing device, a server and a computer readable storage medium, and aims to solve the technical problem that the user experience is poor when the existing video playing method is adopted in the prior art.

In order to achieve the above object, the present invention provides a video playing method for a server, the method comprising the following steps:

receiving a result video sent by a sending end, wherein the result video is obtained by adding text data into target video data, the text data is obtained by converting voice information in the target video, and the target video is obtained by recording a target user;

extracting the text data and the target video from the result video;

converting the target video into an output video, and obtaining an output subtitle based on the text data;

adding the output subtitles to the output video to obtain a resultant video;

and sending the result video to a receiving end so that the receiving end plays the result video and the output subtitles.

Optionally, the result video further includes a target timestamp of the text data in the target video; the step of obtaining an output subtitle based on the text data includes:

and obtaining the output caption based on the text data and the target timestamp.

Optionally, the result video includes a plurality of result videos corresponding to a plurality of target users, one result video corresponds to one target video, one target video corresponds to one text data, and one text data corresponds to one target timestamp; the step of converting the target video into an output video includes:

performing video merging on the plurality of target videos to obtain the output video;

the step of obtaining the output subtitle based on the text data and the target timestamp includes:

obtaining the output subtitle based on the plurality of text data and the plurality of target timestamps.

Optionally, the step of performing video merging on the multiple target videos to obtain the output video includes:

merging the video frames of the target videos to obtain a merged video frame with a first preset resolution;

obtaining the output video with the first preset resolution based on the merged video frame.

Optionally, before the step of obtaining the output subtitle based on the plurality of text data and the plurality of target timestamps, the method further includes:

acquiring position information of a video frame of each target video in the plurality of target videos in the merged video frame;

the step of obtaining the output subtitle based on the plurality of text data and the plurality of target timestamps includes:

obtaining an output subtitle based on the location information, the plurality of text data, and the plurality of target timestamps.

Optionally, before the step of sending the result video to a receiving end to enable the receiving end to play the result video and the output subtitles, the method further includes:

acquiring a second preset resolution of the receiving end;

performing resolution conversion on the result video to obtain a converted video with the second preset resolution, wherein the converted video comprises the output subtitles;

the step of sending the result video to a receiving end so that the receiving end plays the result video and the output subtitles comprises the following steps:

and sending the converted video to a receiving end so that the receiving end plays the converted video and the output subtitles.

Optionally, the step of adding the output subtitles to the output video to obtain a result video includes:

inserting the output subtitles into the output video in a manner of supplemental enhancement information or vertical blanking interval information to obtain the resulting video.

In addition, to achieve the above object, the present invention further provides a video playing apparatus for a server, the apparatus including:

the receiving module is used for receiving a result video sent by a sending end, wherein the result video is obtained by adding text data into target video data, the text data is obtained by converting voice information in the target video, and the target video is obtained by recording a target user;

an extraction module, configured to extract the text data and the target video from the result video;

the conversion module is used for converting the target video into an output video and obtaining an output subtitle based on the text data;

the adding module is used for adding the output subtitles to the output video to obtain a result video;

and the sending module is used for sending the result video to a receiving end so that the receiving end plays the result video and the output subtitles.

In addition, to achieve the above object, the present invention further provides a server, including: the system comprises a memory, a processor and a video playing program stored on the memory and running on the processor, wherein the video playing program realizes the steps of the video playing method according to any item when being executed by the processor.

In addition, to achieve the above object, the present invention further provides a computer readable storage medium, having a video playing program stored thereon, where the video playing program, when executed by a processor, implements the steps of the video playing method as described in any one of the above.

The technical scheme of the invention provides a video playing method which is used for a server and comprises the following steps: receiving a result video sent by a sending end, wherein the result video is obtained by adding text data into target video data, the text data is obtained by converting voice information in the target video, and the target video is obtained by recording a target user; extracting the text data and the target video from the result video; converting the target video into an output video, and obtaining an output subtitle based on the text data; adding the output subtitles to the output video to obtain a resultant video; and sending the result video to a receiving end so that the receiving end plays the result video and the output subtitles.

In the existing video playing method, when a receiving end plays recorded real-time audio and video, the sound of audio data is unclear, so that a user at the receiving end cannot hear the sound of a target user, the user at the receiving end cannot acquire information, and the user experience is poor. By the video playing method, the voice information of the target user is converted into the text data, the output subtitles corresponding to the text data are obtained, and the output subtitles are played when the result video is played, so that the user at the receiving end can obtain the information through the output subtitles, and the user experience is good.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a schematic diagram of a server architecture of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a video playing method according to a first embodiment of the present invention;

FIG. 3 is a block diagram of a video player according to a first embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic diagram of a server structure of a hardware operating environment according to an embodiment of the present invention.

Typically, the server comprises: at least one processor 301, a memory 302, and a video playback program stored on the memory and executable on the processor, the video playback program being configured to implement the steps of the video playback method as described above.

The processor 301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 301 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 301 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 301 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. The processor 301 may further include an AI (Artificial Intelligence) processor for processing operations related to the video playback method, so that the video playback method model can be trained and learned autonomously, thereby improving efficiency and accuracy.

Memory 302 may include one or more computer-readable storage media, which may be non-transitory. Memory 302 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 302 is used to store at least one instruction for execution by processor 301 to implement a video playback method provided by method embodiments herein.

In some embodiments, the terminal may further include: a communication interface 303 and at least one peripheral device. The processor 301, the memory 302 and the communication interface 303 may be connected by a bus or signal lines. Various peripheral devices may be connected to communication interface 303 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 304, a display screen 305, and a power source 306.

The communication interface 303 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 301 and the memory 302. In some embodiments, processor 301, memory 302, and communication interface 303 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 301, the memory 302 and the communication interface 303 may be implemented on a single chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 304 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 304 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 304 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 304 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 304 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 304 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 305 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 305 is a touch display screen, the display screen 305 also has the ability to capture touch signals on or over the surface of the display screen 305. The touch signal may be input to the processor 301 as a control signal for processing. At this point, the display screen 305 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 305 may be one, the front panel of the electronic device; in other embodiments, the display screens 305 may be at least two, respectively disposed on different surfaces of the electronic device or in a folded design; in still other embodiments, the display screen 305 may be a flexible display screen disposed on a curved surface or a folded surface of the electronic device. Even further, the display screen 305 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display screen 305 may be made of LCD (liquid crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The power supply 306 is used to power various components in the electronic device. The power source 306 may be alternating current, direct current, disposable or rechargeable. When the power source 306 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

Those skilled in the art will appreciate that the architecture shown in fig. 1 does not constitute a limitation on the transmit end, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a video playing program is stored on the computer-readable storage medium, and when the video playing program is executed by a processor, the steps of the video playing method as described above are implemented. Therefore, a detailed description thereof will be omitted. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application. It is determined that the program instructions may be deployed to be executed on one server, or on multiple servers at one site, or distributed across multiple sites and interconnected by a communication network.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The computer-readable storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Based on the hardware structure, the embodiment of the video playing method is provided.

Referring to fig. 2, fig. 2 is a schematic flowchart of a video playing method according to a first embodiment of the present invention, where the method is used at a sending end, and includes the following steps:

step S11: receiving a result video sent by a sending end, wherein the result video is obtained by adding text data into target video data, the text data is obtained by converting voice information in the target video, and the target video is obtained by recording a target user.

It should be noted that the main execution body of the method is a server, the server installs a video playing program, and when the server executes the video playing program, the video playing method of the invention is realized.

The video playing method is mainly used for instant video communication scenes such as video calls, video conferences and the like. The instant video communication scene does not have a subtitle function, and in some specific scenes, the voice of a user may be unclear (when a plurality of users talk at the same time in a multi-user video conference, the audio content is more, so that the user at the receiving end cannot hear clearly).

It can be understood that the target users are all users participating in the video call (or video conference), the transmitting end is the transmitting end corresponding to all users participating in the video call (or video conference), and the receiving end is the receiving end corresponding to all users participating in the video call (or video conference); the structures of the sending end and the receiving end are described with reference to the structure of the server, and the structures are similar and are not described again here.

In the invention, the target video comprises the video of the target user and the audio recorded when the target user is recorded, namely the target video comprises the target audio. In addition, the recorded target audio is a continuous audio, the information included in the target audio is not all valid, the valid audio may be voice information included in the target audio and only the valid audio (i.e., the voice information) is converted to obtain text data when the target audio is converted.

Wherein the text data is inserted into the target video in a manner of supplementing enhancement information or vertical blanking period information to obtain the result video.

In the video compression standard of H264/H265, SEI (supplemental enhancement information) is used to insert supplemental enhancement information in certain specific data areas by using the normative property of video coding, and the information itself is included in the video, so that some video supplemental information can be quickly and efficiently delivered. In the video compression standard, the vertical blanking interval information is inserted into some specific data area by using the standard characteristic of video coding, and the information is contained in the video, so that some video supplementary information can be transmitted quickly and efficiently.

Step S12: and extracting the text data and the target video from the result video.

Step S13: and converting the target video into an output video, and obtaining an output subtitle based on the text data.

Wherein the result video further comprises a target timestamp of the text data in the target video; the step of obtaining an output subtitle based on the text data includes: and obtaining the output caption based on the text data and the target timestamp. That is, in this embodiment, text data is added to the output video in the form of subtitles. The output caption has the target timestamp, and when the output caption is played, the output caption is played when the time corresponding to the target timestamp arrives.

It can be understood that the target timestamp is the playing time of the voice information corresponding to the text data in the target video, for example, the voice information of 1 minute and 10 seconds in the target video is: reported in Beijing conference rooms. The target timestamp of the text data corresponding to the voice information is 1 minute and 10 seconds.

Typically, in a video call or video conference scenario, the target users are multiple, namely: the result video comprises a plurality of result videos corresponding to a plurality of target users respectively, one result video corresponds to one target video, one target video corresponds to one text data, and one text data corresponds to one target timestamp; the step of converting the target video into an output video includes: performing video merging on the plurality of target videos to obtain the output video; correspondingly, the step of obtaining the output subtitle based on the text data and the target timestamp includes: obtaining the output subtitle based on the plurality of text data and the plurality of target timestamps.

Wherein the step of video merging the plurality of target videos to obtain the output video comprises: merging the video frames of the target videos to obtain a merged video frame with a first preset resolution; obtaining the output video with the first preset resolution based on the merged video frame.

It should be noted that, in general, resolutions of target videos sent by a sending end may be different (for example, 1K, 2K, or 4K), and they need to be merged into an output video, where the resolution of the output video is a first preset resolution, and preferably, the first preset resolution is 8K in this application; wherein, the 1K resolution is 1920 × 1080, the 2K resolution is 2560 × 1440, the 4K resolution is 3840 × 2160, and the 8K resolution is 7680 × 4320. The output video is the video that is the integrated video frame.

In a specific application, the merged video frame has a plurality of different display areas, and the different display areas are used for displaying pictures of different target videos. For example, if the target video is 4 target videos, the merged video frame has 4 different display areas, and one display area is used for displaying the picture of one target video.

It can be understood that, when the number of target users is not more than 4, the merged video frame of the output video may be displayed in a single page, where a page of merged video frame is a picture including target videos corresponding to multiple target users; when the number of target users exceeds 4, the merged video frames can be merged video frames of output videos displayed by multiple pages (each page displays the merged video frames corresponding to the 4 target videos, each page of merged video frames has a first preset resolution, and the multiple page merged video frames relate to the target videos of all the target users), and the users can perform page turning operation to switch different display pages; when the number of the target users exceeds 4, the video frames corresponding to the plurality of target users can be combined into a combined video frame displayed on one page, and the whole page of the combined video frame has a first preset resolution. The present invention is not limited by the specific display mode.

Wherein, before the step of obtaining the output subtitle based on the plurality of text data and the plurality of target timestamps, the method further comprises: acquiring position information of a video frame of each target video in the plurality of target videos in the merged video frame; accordingly, the step of obtaining the output subtitle based on the plurality of text data and the plurality of target timestamps includes: obtaining an output subtitle based on the location information, the plurality of text data, and the plurality of target timestamps.

The plurality of text data are derived from target videos corresponding to a plurality of target users, and the plurality of text data need to be integrated into one output subtitle based on the position information (one position information corresponding to one target video) and a plurality of target timestamps. In the merged video frame of the output video, the picture corresponding to the target video (or the video frame in the target video) has different display areas, and the position information of the display areas in the merged video frame is the position information.

For example, the target video includes two corresponding text data, that is, a text data of the a target video at 0 min 6 sec and B text data of the B target video at 1 min 3 sec, the display area of the a target video is in the left area, the display area of the B target video is in the right area, and the position information is: obtaining an output subtitle based on the position information, the plurality of text data, and the plurality of target timestamps, wherein the content of the output subtitle is as follows: 0 minute 6 second text data a and 1 minute 3 second text data b, where a text data is played in the left area and b text data is played in the right area.

Step S14: adding the output subtitles to the output video to obtain a resultant video.

Specifically, the step of adding the output subtitles to the output video to obtain a result video includes: inserting the output subtitles into the output video in a manner of supplemental enhancement information or vertical blanking interval information to obtain the resulting video.

Step S15: and sending the result video to a receiving end so that the receiving end plays the result video and the output subtitles.

In this embodiment, the resultant video is a video in which an output subtitle is added to the output video, and when the output video is played, the output subtitle is played. When the receiving end plays the output caption, the output caption is automatically played when the time corresponding to the target time stamp in the target caption arrives.

For example, the target users include 4 target videos, all of the target videos acquired by the 4 target user sending ends are 4K videos, and the content of the corresponding 4 voice messages is as follows:

4K video signal	Information group	Time stamp	Caption content
				1	A	00：00：04	"Beijing" conference room
2	B	00：00：02	You are Shanghai meeting room
				3	C	00：00：05	That you are all Guangzhou conference rooms
4	D	00：00：06	Your good is Shenzhen conference room

Wherein 1234 is a 4-channel target video of four target users, ABCD is a name of text data corresponding to voice information, the sending end inserts data of information a and B into 4K video signals 1 and 2 by supplementing enhancement information, and the sending end inserts data of information C and D into 4K video signals 3 and 4 by vertical blanking period information. This forms new 4K video signals 1A, 2B, 3C and 4D, the plurality of resulting videos. In another embodiment, the server may also directly extract the corresponding text data from the original ABCD through a speech recognition technology without the need of the sending end to extract the text data.

The server can analyze the text data in the 1A, 2B, 3C and 4D four-path 4K videos by identifying the supplemental enhancement information and the scene extinction information, and obtain output subtitles based on target timestamps (00: 00:04, 00:00:02, 00:00: 05 and 00:00: 05), text data (that you good me is Beijing conference room, that you good me is Shanghai conference room, that you good me is Guangzhou conference room and that you good me is Shenzhen conference room) and position information (that 1234 videos are respectively positioned at the upper left, the upper right, the lower left and the lower right) corresponding to the text data, wherein the content of the output subtitles is as follows:

4K video signal	Position of	Time stamp	Caption content
				1A	Upper left of	00：00：04	"Beijing" conference room
2B	Upper right part	00：00：02	You are Shanghai meeting room
				3C	Left lower part	00：00：05	That you are all Guangzhou conference rooms
4D	Lower right	00：00：06	Your good is Shenzhen conference room

At this time, the output video is 8K video obtained by merging 4K videos, the 8K video includes 4 videos, and the display areas of the 4 videos are displayed on the upper left, the upper right, the lower left, and the lower right, respectively.

In addition, the output subtitles are inserted into the spliced video in a mode of supplementing the enhancement information, and the insertion mode is as follows:

the display of the playing result video and the output subtitles is as follows:

the video is carried out for a time of 00:00:02

The video is carried out for a time of 00:00:04

The video is performed for a time of 00: 00:05

The video is performed for a time of 00: 00:06

Further, before the step of sending the result video to a receiving end to enable the receiving end to play the result video and the output subtitles, the method further includes: acquiring a second preset resolution of the receiving end; performing resolution conversion on the result video to obtain a converted video with the second preset resolution, wherein the converted video comprises the output subtitles; the step of sending the result video to a receiving end so that the receiving end plays the result video and the output subtitles comprises the following steps: and sending the converted video to a receiving end so that the receiving end plays the converted video and the output subtitles.

It should be noted that the receiving end may not be able to directly play the output video with the first preset resolution, and needs to convert the output video into the converted video, where the resolution of the converted video is the corresponding second preset resolution (generally, the display resolution of the receiving end) of the receiving end, so that the receiving end can play the converted video and output the subtitles.

Referring to fig. 3, fig. 3 is a block diagram of a first embodiment of a video playing apparatus according to the present invention, where the apparatus is used at a transmitting end, and the apparatus includes:

the receiving module 10 is configured to receive a result video sent by a sending end, where the result video is obtained by adding text data to target video data, the text data is obtained by converting voice information in a target video, and the target video is obtained by recording a target user;

an extracting module 20, configured to extract the text data and the target video from the result video;

a conversion module 30, configured to convert the target video into an output video, and obtain an output subtitle based on the text data;

an adding module 40, configured to add the output subtitles to the output video to obtain a result video;

a sending module 50, configured to send the result video to a receiving end, so that the receiving end plays the result video and the output subtitles.

The above description is only an alternative embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A video playing method, for a server, the method comprising the steps of:

extracting the text data and the target video from the result video;

adding the output subtitles to the output video to obtain a resultant video;

2. The method of claim 1, wherein the result video further includes a target timestamp of the text data in the target video; the step of obtaining an output subtitle based on the text data includes:

3. The method of claim 2, wherein the result video comprises a plurality of result videos corresponding to a plurality of target users, respectively, one result video corresponding to one target video, one target video corresponding to one text data, one text data corresponding to one target timestamp; the step of converting the target video into an output video includes:

4. The method of claim 3, wherein said step of video merging said plurality of target videos to obtain said output video comprises:

5. The method of claim 4, wherein prior to the step of obtaining the output subtitle based on the plurality of text data and the plurality of target timestamps, the method further comprises:

6. The method of claim 5, wherein prior to the step of transmitting the resulting video to a receiving end for the receiving end to play the resulting video and the output subtitles, the method further comprises:

acquiring a second preset resolution of the receiving end;

7. The method of claim 6, wherein the step of adding the output subtitles to the output video to obtain a resultant video comprises:

8. A video playback apparatus, for a server, the apparatus comprising:

9. A server, characterized in that the server comprises: memory, processor and a video playback program stored on the memory and running on the processor, the video playback program when executed by the processor implementing the steps of the video playback method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, having a video playback program stored thereon, which when executed by a processor implements the steps of the video playback method according to any one of claims 1 to 7.