CN111031373A

CN111031373A - Video playing method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN111031373A
Application number: CN201911336145.8A
Authority: CN
Inventors: 王骎; 刘勇; 齐萌
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-04-17

Abstract

The application discloses a video playing method and device, electronic equipment and a computer readable storage medium, and relates to the field of video playing. The specific implementation scheme is as follows: acquiring voice information input by a user in the playing process of the first video stream; selecting a target video stream from a plurality of second video streams associated with the first video stream according to the voice information; the first video stream and each second video stream are positioned on different branches of a target video, and the target video adopts a tree structure; and playing the target video stream. According to the embodiment of the application, the interaction mode of the user and the playing video can be enriched, the participation sense of the user in the process of watching the video is enhanced, and the watching experience of the user is improved.

Description

Video playing method and device, electronic equipment and computer readable storage medium

Technical Field

The application relates to the technical field of computers, in particular to a video playing technology.

Background

At present, when a user needs to adjust the playing content during the process of watching a video, the user often can only adjust the playing content of the video by a manual method, such as clicking a pop-up selection box. Therefore, the existing mode for adjusting video playing content is single, and user experience is poor.

Disclosure of Invention

The embodiment of the application provides a video playing method, a video playing device, electronic equipment and a computer readable storage medium, so as to solve the problem that the existing mode for adjusting video playing content is single.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a video playing method, including:

acquiring voice information input by a user in the playing process of the first video stream;

selecting a target video stream from a plurality of second video streams associated with the first video stream according to the voice information; the first video stream and each second video stream are positioned on different branches of a target video, and the target video adopts a tree structure;

and playing the target video stream.

Therefore, compared with the prior art that the playing content of the video can be adjusted only through a manual mode, the interactive mode of the user and the playing video can be enriched, the participation sense of the user when watching the video is enhanced, and the watching experience of the user is improved.

Optionally, the acquiring the voice information input by the user includes:

acquiring mark information of the first video stream; wherein the marking information comprises a video branch condition corresponding to the first video stream;

prompting the user for the video branch condition;

and receiving the voice information input by the user based on the prompt information.

Therefore, the user can conveniently select the required video stream to play.

Optionally, after obtaining the mark information of the first video stream and before selecting the target video stream, the method further includes:

asynchronously acquiring the plurality of second video streams according to the video branch condition;

preloading the plurality of second video streams;

wherein the target video stream is selected from a preloaded plurality of second video streams.

Therefore, by means of the preloading process, the fluency of video switching can be ensured, and therefore after the target video stream is selected based on user input, the currently played video stream can be smoothly switched to the target video stream for playing.

Optionally, the acquiring the voice information input by the user includes:

collecting environmental sound information, wherein the environmental sound information comprises the voice information and environmental noise information;

and inputting the environmental sound information into a pre-trained voice extraction model corresponding to the user to obtain the voice information.

Thus, by means of the pre-trained speech extraction model, the speech information input by the user can be accurately identified from the noisy environment sound.

Optionally, the prompting the user of the video branching condition includes:

prompting the user of the video branch condition in at least one of the following modes:

voice broadcast mode, text display mode.

Therefore, the effect of quickly and accurately prompting the user can be achieved by means of voice broadcasting and/or text display modes.

In a second aspect, an embodiment of the present application provides a video playing apparatus, including:

the first acquisition module is used for acquiring voice information input by a user in the playing process of the first video stream;

a selection module for selecting a target video stream from a plurality of second video streams associated with the first video stream according to the voice information; the first video stream and each second video stream are positioned on different branches of a target video, and the target video adopts a tree structure;

and the playing module is used for playing the target video stream.

Optionally, the first obtaining module includes:

an acquisition unit configured to acquire tag information of the first video stream; wherein the marking information comprises a video branch condition corresponding to the first video stream;

the prompting unit is used for prompting the video branch condition of a user;

and the receiving unit is used for receiving the voice information input by the user based on the prompt information.

Optionally, the apparatus further comprises:

a second obtaining module, configured to obtain the tag information of the first video stream, and then asynchronously obtain the plurality of second video streams according to the video branching condition;

and the loading module is used for preloading the plurality of second video streams so as to select a target video stream from the preloaded plurality of second video streams.

Optionally, the first obtaining module includes:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring environmental sound information, and the environmental sound information comprises the voice information and environmental noise information;

and the extracting unit is used for inputting the environmental sound information into a pre-trained voice extracting model corresponding to the user to obtain the voice information.

Optionally, the prompting unit is specifically configured to:

voice broadcast mode, text display mode.

In a third aspect, an embodiment of the present application further provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a video playback method as described above.

In a fourth aspect, the present application further provides a non-transitory computer-readable storage medium storing computer instructions, where the computer instructions are configured to cause the computer to execute the video playing method described above.

One embodiment in the above application has the following advantages or benefits: the interactive mode of the user and the playing video can be enriched, the participation sense of the user when watching the video is enhanced, and the watching experience of the user is improved. The technical means that the target video stream is selected from the plurality of second video streams associated with the currently played first video stream according to the voice information input by the user, the first video stream and each second video stream are positioned on different branches of the target video, and the selected target video stream is played is adopted, so that the technical problem that the existing mode for adjusting the video playing content is single is solved, the interactive mode of enriching the user and playing the video is achieved, the participation sense of the user when watching the video is enhanced, and the technical effect of watching experience of the user is improved.

Other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a flowchart of a video playing method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating a video uploading and playing process according to an embodiment of the present application;

fig. 3 is a block diagram of a video playback device for implementing a video playback method according to an embodiment of the present application;

fig. 4 is a block diagram of an electronic device for implementing a video playing method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Referring to fig. 1, fig. 1 is a flowchart of a video playing method provided in an embodiment of the present application, applied to an electronic device, and as shown in fig. 1, the method includes the following steps:

step 101: and acquiring voice information input by a user in the playing process of the first video stream.

In this embodiment, since the user may be the speech information input in the noisy environment including noise, in order to accurately acquire the speech information input by the user, the process of acquiring the speech information input by the user in step 101 may include: firstly, collecting environmental sound information, wherein the environmental sound information comprises voice information input by a user and environmental noise information, and the environmental noise information comprises video playing sound, other environmental noise and the like; and then, inputting the collected environmental sound information into a pre-trained voice extraction model corresponding to the user to obtain the voice information.

Understandably, for the collection of the environmental sound information, the collection can be realized by a sound collector such as a microphone and the like. The voice extraction model is mainly used for identifying voice information of a corresponding user and can be understood as a voice filtering module. The speech extraction models of different users are usually different. Thus, by means of the pre-trained speech extraction model, the speech information input by the user can be accurately identified from the noisy environment sound.

In one embodiment, taking user a as an example, the training process of the speech extraction model corresponding to user a may include: firstly, some common voice samples of a user A and a dialogue audio VF containing the voice samples and noise are collected; then, processing the collected voice sample by using a trained model (also called as a speaker encoder) for extracting the voiceprint features of the user, namely inputting the collected voice sample, outputting a feature vector SV, wherein one voice sample correspondingly outputs one feature vector SV, and when a plurality of feature vectors SV are obtained, an average value obtained by performing L2 regularization on the plurality of SVs can be used as the voiceprint feature PF of the user A; in one embodiment, the network used by the model for extracting the user voiceprint features may be a three-layer network based on Long Short-Term Memory (LSTM), with the addition of Generalized Estimation Equation (GEE) loss derivation, the input speech may be in the form of 1600ms audio logarithmic Mel spectrum, and the output feature vector SV may have a width of X; and finally, based on a pre-constructed basic model, taking the voiceprint features PF and the dialogue audio VF as input, taking the voice sample (namely the voice of the user A after irrelevant noise is removed) as output, and training to obtain a voice extraction model corresponding to the user A. For example, the basic model may be a network of Time dimension masks, a soft mask may be generated for multiplying a noise (dialogue audio VF) magnitude spectrum to generate an enhanced magnitude spectrum, then the phase of the noise audio is added to the enhanced magnitude spectrum, and the enhanced audio with irrelevant audio removed is obtained by Inverse Short-Time Fourier Transform (ISTFT).

Step 102: selecting a target video stream from a plurality of second video streams associated with the first video stream according to the voice information.

In this embodiment, the first video stream and each of the second video streams are located on different branches of the target video. The target video adopts a tree structure. The tree structure is adopted for the target video, that is, the video stream index of the target video adopts a tree storage structure to realize a tree video playing path. The target video may include video streams of a plurality of different branches, and the video streams between different branches may be switched, for example, the video stream of the current branch may be switched to the video stream of the next branch. The video streams of the different branches may be set according to predefined rules.

For example, when a video uploader uploads a video, the video in the parallel video list to be uploaded can be uploaded in a dragging manner according to a custom rule such as a plot development trend, so as to form a tree-shaped video structure and control the video contents of different branches. For example, in a tree-like video structure, the video content on the root branch is the first playing part, and the video content on different subbranches corresponds to different plot trends.

Optionally, in this embodiment, the target video stream may be selected through recognition of the user's intention. The process of selecting the target video stream in step 102 may be: firstly, performing intention recognition on voice information input by a user to obtain the intention of the user; then, a target video stream is selected from a plurality of second video streams associated with the first video stream according to the user intent. When the intention recognition is performed, the voice information to be recognized (i.e. the voice information input by the user) can be input into the intention recognition model trained in advance, and the user intention matched with the voice information can be obtained. For the training process of the intention recognition model, an existing method may be adopted, and this embodiment does not limit this. Therefore, through the identification of the user intention, the video stream of the corresponding branch can be dynamically switched to be played according to the user intention, and the video playing of the personalized visual angle is presented to the user.

Step 103: and playing the target video stream.

In one embodiment, taking the target video including video stream 1, video stream 2 and video stream 3 as an example, assuming that video stream 1 is the beginning part of the target video and is the description part of event a, and video stream 2 and video stream 3 are the video parts respectively associated with video stream 1 and located in the next branch, and represent different ends of event a, during the playing of video stream 1, one of video stream 2 and video stream 3 can be selected for playing by means of the voice information input by the user.

In another embodiment, taking a target video as a certain sports game video as an example, the game video includes a video stream 0 stored by a root node (here, a summary description of the game), and a video stream 3 and a video stream 4 stored by different child nodes, where the video stream 3 is game content explained in language 1, and the video stream 4 is game content explained in language 2, during playing the video stream 0, if a user wishes to hear the language 1, the user can play the video stream 1 by inputting corresponding voice information, such as watching the video stream 1; if the user wishes to hear language 2, the video stream 2 can be played by inputting corresponding voice information, such as watching the video stream 2.

According to the video playing method, the target video stream can be selected from the plurality of second video streams associated with the currently played first video stream according to the voice information input by the user, the first video stream and each second video stream are located on different branches of the target video, and the selected target video stream is played. Therefore, compared with the prior art that the playing content of the video can be adjusted only in a manual mode, the interactive mode of the user and the playing video can be enriched, the participation sense of the user when watching the video is enhanced, and the watching experience of the user is improved.

In this embodiment, in order to facilitate the user to select a desired video stream for playing, the video stream in the target video may be marked, and the marking information includes indication information indicating that the video stream is a part of the interactive video, a video branch condition corresponding to the video stream, and the like, so as to obtain the marking information in the playing process of the video stream and prompt based on the marking information. The video branching condition may be set based on a user requirement, for example, a scenario branching condition or a video version branching condition, which is not limited in this embodiment.

Optionally, the process of acquiring the voice information input by the user in step 101 may include:

acquiring mark information of the first video stream; wherein the marking information comprises a video branch condition corresponding to the first video stream; .

Prompting the user for the video branch condition;

In a specific implementation process, when prompting the corresponding video branch condition, at least one of a voice broadcasting mode and a text display mode can be adopted for prompting. Therefore, the effect of quickly and accurately prompting the user can be achieved.

In one embodiment, taking an example that a target video includes a video stream 1, a video stream 2, and a video stream 3, assuming that the video stream 1 is a beginning portion of the target video and is a description portion of an event a, and a video branching condition of the video stream 1 is a scenario branching condition, where the scenario branching condition is that the video stream 1 has two associated video portions located in a next-level branch, which are respectively the video stream 2 and the video stream 3 and respectively represent an ending 1 and an ending 2 of the event a, during playing of the video stream 1, a terminal may obtain the scenario branching condition of the video stream 1 and prompt, so that a user can select to play the video stream 2 or the video stream 3 to be viewed based on the prompt information.

Further, after obtaining the mark information of the first video stream, before selecting the target video stream, the method may further include: and asynchronously acquiring the plurality of second video streams according to the video branch condition, and preloading the plurality of second video streams. Thereafter, a target video stream may be selected from the preloaded plurality of second video streams. Therefore, by means of the preloading process, the fluency of video switching can be ensured, and therefore after the target video stream is selected based on user input, the currently played video stream can be smoothly switched to the target video stream for playing.

In one embodiment, to ensure the fluency of dynamic video switching, after the video stream of the parent node (i.e. the parent branch) is played to a certain progress, all the video streams of the corresponding child nodes (i.e. the child branches) may be asynchronously pulled for preloading, so that a target video stream may be selected from the preloaded video streams based on the voice information input by the user, and smoothly switched to the target video stream for playing.

In addition, in order to facilitate the user to select a desired video stream for playing, the user may be prompted to perform an input operation based on the preset branch node in this embodiment. That is, before acquiring the voice information input by the user, the method further includes: detecting whether a target video is played to a preset branch node or not; and prompting a user to execute the operation of inputting voice information under the condition that the target video is played to the preset branch node. It should be noted that the predetermined branch node may represent a switching point between video streams of different branches. For example, taking the target video including video stream 1, video stream 2 and video stream 3 as an example, assuming that video stream 1 is a beginning portion of the target video and is a description portion of event a, and video stream 2 and video stream 3 are video portions at different branches respectively associated with video stream 1 and represent different ends of event a, the preset branch node may be selected as a branch end point corresponding to video stream 1.

The video uploading and playing process in the embodiment of the present application is described below with reference to fig. 2.

In the embodiment of the present application, as shown in fig. 2, for a video uploading end, a video uploading person (or a video creator) may record a video at first, and then edit the recorded video, for example, set video stream contents of different branches of the video according to a user-defined rule, and store a video stream index by using a tree-shaped storage structure, so as to implement a tree-shaped video playing path, and upload the edited video to a dynamic streaming media system. For a video viewer, the speech extraction model may be trained first, such as using the method described above; then, if the user carries out voice input when watching the video, acquiring environmental sound information comprising the voice information of the user, and extracting the voice information of the user in the environmental sound information by utilizing a pre-trained voice extraction model to obtain the voice information of the user; finally, intention recognition is carried out on the voice information of the user to obtain the intention of the user, a video stream scheduling strategy is determined according to the intention of the user by means of a dynamic streaming media system to select the target video stream, and the selected target video stream is pushed to a video watching end to be played.

Referring to fig. 3, fig. 3 is a block diagram of a video playing apparatus for implementing a video playing method according to an embodiment of the present application, and as shown in fig. 3, the video playing apparatus 30 includes:

a first obtaining module 31, configured to obtain voice information input by a user in a playing process of a first video stream;

a selection module 32, configured to select a target video stream from a plurality of second video streams associated with the first video stream according to the voice information; the first video stream and each second video stream are positioned on different branches of a target video, and the target video adopts a tree structure;

and a playing module 33, configured to play the target video stream.

Optionally, the first obtaining module includes:

the prompting unit is used for prompting the video branch condition of a user;

Optionally, the apparatus further comprises:

Optionally, the first obtaining module includes:

Optionally, the prompting unit is specifically configured to:

voice broadcast mode, text display mode.

It can be understood that the video playing apparatus 30 according to the embodiment of the present application can implement each process implemented in the method embodiment shown in fig. 1 and achieve the same beneficial effects, and for avoiding repetition, details are not repeated here.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 4 is a block diagram of an electronic device for implementing a video playing method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 4, the electronic apparatus includes: one or more processors 401, memory 402, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor 401 is taken as an example.

Memory 402 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to execute the video playing method provided by the application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the video playback method provided by the present application.

The memory 402, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the first obtaining module 31, the selecting module 32, and the playing module 34 shown in fig. 3) corresponding to the video playing method in the embodiment of the present application. The processor 401 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 402, that is, implements the video playing method in the above-described method embodiment.

The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 402 may optionally include memory located remotely from processor 401, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the video playing method may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 4 illustrates an example of a connection by a bus.

The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the video playback method, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 404 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, the interaction mode of enriching the user and playing the video can be achieved, the participation sense of the user when watching the video is enhanced, and the technical effect of watching experience of the user is improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A video playback method, comprising:

and playing the target video stream.

2. The method of claim 1, wherein the obtaining the voice information input by the user comprises:

prompting the user for the video branch condition;

3. The method of claim 2, wherein after the obtaining the tag information of the first video stream and before the selecting the target video stream, the method further comprises:

preloading the plurality of second video streams;

4. The method of claim 1, wherein the obtaining the voice information input by the user comprises:

5. The method of claim 2, wherein said prompting the user for the video branch condition comprises:

voice broadcast mode, text display mode.

6. A video playback apparatus, comprising:

and the playing module is used for playing the target video stream.

7. The apparatus of claim 6, wherein the first obtaining module comprises:

the prompting unit is used for prompting the video branch condition of a user;

8. The apparatus of claim 7, further comprising:

9. The apparatus of claim 6, wherein the first obtaining module comprises:

10. The apparatus according to claim 7, wherein the prompting unit is specifically configured to:

voice broadcast mode, text display mode.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.