CN115396728A

CN115396728A - Method and device for determining video playing multiple speed, electronic equipment and medium

Info

Publication number: CN115396728A
Application number: CN202210994766.0A
Authority: CN
Inventors: 沈晓磊
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2022-08-18
Filing date: 2022-08-18
Publication date: 2022-11-25

Abstract

The application discloses a method and a device for determining video playing speed, electronic equipment and a medium, and belongs to the technical field of images. The method comprises the following steps: acquiring first video content characteristic information corresponding to N video clips in a first video, wherein N is a positive integer; respectively splicing the user behavior characteristic information of the target user with the first video content characteristic information corresponding to the N video clips to obtain video characteristic information corresponding to each video clip in the N video clips; determining a first playing speed corresponding to each video clip based on the video characteristic information corresponding to each video clip; and playing the first video based on the first playing double speed.

Description

Method and device for determining video playing speed, electronic equipment and medium

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a method and a device for determining video playing speed doubling, electronic equipment and a medium.

Background

Nowadays, the number of videos on the internet is more and more, and the video content is richer. Because the user has different interest degrees in different video clips during the process of watching the video, various players in the current market have the function of playing at double speed so as to meet various requirements of the user on fast browsing or slow appreciation and the like.

However, the double-speed playing function of various current players requires that a user continuously and actively trigger and adjust the video playing speed of a video during the watching process. Therefore, the whole video playing speed doubling adjusting process is too complicated, the user needs to continuously and independently adjust the video playing speed doubling, and the experience of watching the video by the user is reduced.

Disclosure of Invention

An embodiment of the present application provides a method, an apparatus, an electronic device, and a medium for determining a video playing speed, which enable the electronic device to adaptively adjust a playing speed of a current video segment during a process of a user watching a video.

In a first aspect, an embodiment of the present application provides a method for determining a video playback speed, where the method for determining a video playback speed includes: acquiring first video content characteristic information corresponding to N video segments in a first video, wherein N is a positive integer; respectively splicing the user behavior characteristic information of the target user with the first video content characteristic information corresponding to the N video clips to obtain video characteristic information corresponding to each video clip in the N video clips; determining a first playing speed corresponding to each video clip based on the video characteristic information corresponding to each video clip; and playing the first video based on the first playing double speed.

In a second aspect, an embodiment of the present application provides a device for determining a video playback speed, where the device for determining a video playback speed includes: the acquisition module and the processing module: the acquisition module is used for acquiring N video clips in a first video, wherein N is a positive integer; the acquisition module is further configured to acquire video feature information corresponding to each of the N video clips; the video characteristic information corresponding to any video clip comprises: user behavior characteristic information of a target user and video content characteristic information of any video clip; the processing module is used for determining a first playing speed corresponding to each video clip based on the video characteristic information corresponding to each video clip; the processing module is further configured to apply the first playback speed corresponding to each video segment to the respective corresponding video segment.

In a third aspect, embodiments of the present application provide an electronic device, which includes a processor and a memory, where the memory stores a program or instructions executable on the processor, and the program or instructions, when executed by the processor, implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product, stored on a storage medium, for execution by at least one processor to implement the method according to the first aspect.

In the embodiment of the application, first video content characteristic information corresponding to N video segments in a first video is obtained, wherein N is a positive integer; respectively splicing the user behavior characteristic information of the target user with the first video content characteristic information corresponding to the N video clips to obtain video characteristic information corresponding to each video clip in the N video clips; determining a first playing speed corresponding to each video clip based on the video characteristic information corresponding to each video clip; and playing the first video based on the first playing double speed. Therefore, the user behavior characteristic information of the target user is merged into the video content characteristic information of each video clip, so that the electronic equipment can set the video playing speed corresponding to each clip in a personalized manner according to the user behavior characteristic information and the video content characteristic information, and the user does not need to manually adjust the video playing speed.

Drawings

Fig. 1 is a schematic flowchart of a method for determining a video playback speed according to an embodiment of the present application;

fig. 2 is a schematic diagram of a model of a method for determining a video playing multiple speed according to an embodiment of the present application;

fig. 3 is a second schematic diagram of a model of a method for determining a video playback speed according to an embodiment of the present application;

fig. 4 is a third schematic model diagram of a method for determining a video playback speed according to an embodiment of the present application;

fig. 5 is a second flowchart illustrating a method for determining a video playback speed according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an apparatus for determining a video playback speed according to an embodiment of the present application;

fig. 7 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present disclosure;

fig. 8 is a second hardware structure schematic diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application are capable of operation in sequences other than those illustrated or described herein, and that the terms "first," "second," etc. are generally used in a generic sense and do not limit the number of terms, e.g., a first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The method, the apparatus, the electronic device and the readable storage medium for determining the video playback speed provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

The method for determining the video playing speed provided by the embodiment of the application can be applied to watching of video scenes.

Nowadays, the number of videos on the internet is more and more, and the video content is richer. Because the user has different interest degrees in different video clips during the process of watching the video, various players in the current market have the function of playing at double speed so as to meet various requirements of the user on fast browsing or slow appreciation and the like. However, the double-speed playing function of various current players requires that a user continuously and actively trigger and adjust the video playing speed of a video during the watching process. Therefore, the whole video playing speed doubling adjusting process is too complicated, the user needs to continuously and independently adjust the video playing speed doubling, and the experience of watching the video by the user is reduced.

For example, the double-speed playing function of various current players has two adjustment modes, one is that a user actively clicks to set a certain fixed numerical value (for example, 0.5x/0.75x/1.0x/1.5x/2.0 x) and plays the current video segment at the double speed corresponding to the fixed numerical value, and the other is that the user triggers a certain fixed numerical value (for example, 3.0 x) during long-time screen pressing by inputting the long-time screen pressing, and plays the current video segment at the double speed corresponding to the fixed numerical value.

Therefore, the scheme determined by the current video double-speed playing depends too much on the active input of the user, both the two modes have obvious defects, only one fixed playing speed can be set through the input of the user every time, and the automatic adjustment cannot be performed on different segments in the video in time.

In summary, how to solve the problem that a user can determine the playing speed of a current video clip without using manual trigger in the viewing process is a technical problem to be urgently solved in the present application.

In the method for determining the video playing speed, first video content characteristic information corresponding to N video segments in a first video is obtained, where N is a positive integer; respectively splicing the user behavior characteristic information of the target user with the first video content characteristic information corresponding to the N video clips to obtain video characteristic information corresponding to each video clip in the N video clips; determining a first playing speed corresponding to each video clip based on the video characteristic information corresponding to each video clip; and playing the first video based on the first playing double speed. Therefore, the user behavior characteristic information of the target user is merged into the video content characteristic information of each video clip, so that the electronic equipment can set the video playing speed corresponding to each clip in a personalized manner according to the user behavior characteristic information and the video content characteristic information, and the user does not need to manually adjust the video playing speed.

The execution main body of the method for determining the video playing double speed provided by the embodiment of the application can be a device for determining the video playing double speed, and the device for determining the video playing double speed can be an electronic device and also can be a functional module in the electronic device.

Fig. 1 shows a flowchart of a method for determining a video playback speed, which can be applied to an electronic device. As shown in fig. 1, the method for determining a video playback multiple speed provided by the embodiment of the present application may include steps 201 to 204 described below.

Step 201, obtaining first video content characteristic information corresponding to N video segments in a first video.

Wherein N is a positive integer.

In this embodiment of the application, the first video may be a video in any application in the electronic device, or a video stored locally in the electronic device.

In the embodiment of the application, the electronic device may regularly or randomly split the first video into N video segments according to the predetermined number of segments.

In this embodiment of the application, the electronic device may input the acquired first video to a video segmentation model for segmentation processing, so as to obtain N video segments.

In the embodiment of the present application, the video segmentation model is a trained video segmentation model.

For example, after the electronic device inputs the first video with the duration T into the video segmentation model, N may be obtained by sampling according to the set video sampling interval I _f = T/I frame image, and will be N _f The frame images are synthesized by taking the number of the frames of each segment as M to obtain N _s ＝N _f Per M video clips

It should be noted that each frame image in the finally obtained video segment is an RGB image of H × W × 3 size obtained by Resize and Crop processing.

Optionally, in this embodiment of the application, in the process of "acquiring first video content feature information corresponding to N video segments in the first video" in step 201, the following steps 201a to 201d are included:

step 201a: and aiming at one frame of video clips in the N video clips, dividing the video images in the video clips to obtain X image blocks.

Step 201b: and acquiring image characteristic information corresponding to the X image blocks.

Step 201c: inputting the image feature information corresponding to the X image blocks into a video content feature extraction model for feature extraction, and obtaining second video content feature information corresponding to the X image blocks.

Step 201d: and performing feature fusion on the second video content feature information corresponding to the X image blocks to obtain first video content feature information.

Illustratively, one image block corresponds to one first key video content characteristic information.

Illustratively, as shown in FIG. 2, a video content feature extraction model M _video May be constructed by repeating the infrastructure N times, wherein the infrastructure is in turn composed of a Multi-Head Attention Module (Multi-Head Attention), a residual and standardization Module (Add)&Norm) and a Feed-Forward network module (Feed Forward), wherein the Multi-Head attachment consists of 12 self-assembliesAttention module (Self-Attention).

Illustratively, the electronic device may input the above-mentioned N video segments into the video content feature extraction model M _video By extracting model M from video content characteristics _video And extracting second video content characteristic information of each video clip from the video content information of the N video clips, and performing characteristic fusion on the second video content characteristic information to finally obtain first video content characteristic information.

For example, the electronic device may first divide each frame image in one of the N video segments into X image blocks of a preset size, and straighten the X image blocks into vectors of the same size and length, so as to obtain image feature information corresponding to the X image blocks. Then, the vector is linearly transformed through a full link layer to obtain the length L corresponding to the video content information in the video segment _video-in Vector F of _video-in 。

Optionally, in this embodiment of the application, in the process of "inputting the image feature information corresponding to the X image blocks into the video content feature extraction model for feature extraction to obtain the second video content feature information corresponding to the X image blocks" in the step 201c, "the following steps 201c1 to 201c3 are included:

step 201c1, based on the multi-head attention module, performing video content feature extraction on the image feature information corresponding to the X image blocks to obtain X first key video content feature information corresponding to the X image blocks.

Step 201c2, based on the residual error and the normalization module, calculating a mean value and a standard difference value corresponding to the X pieces of first key video content feature information, and based on the mean value and the standard difference value, obtaining X pieces of third video content feature information corresponding to the X pieces of first key video content feature information.

Step 201c3, based on the feed-forward module, respectively fusing all feature information in each third video content feature information in the X third video content feature information to obtain X second video content feature information.

Specifically, first, the image information feature information corresponding to the X image blocks is inputted into the Multi-Head Attention, and the 12 Self-Attention contained in the Multi-Head Attention adopt the formula

And calculating the weight according to the correlation among the characteristics, thereby extracting different first key video content characteristic information. Wherein, Q (query), K (key), V (value) are vectors obtained by remapping input image characteristic information through an FC layer, d _k Representing the length of the vector.

And then, inputting the different key video content characteristic information into the Add & Norm, calculating a mean value and a standard difference value corresponding to the first key video content characteristic information, and calculating the first key video content characteristic information based on the mean value and the standard difference value to obtain third video content characteristic information. The Add operation is equivalent to Short Cut, so that the previous characteristics can be multiplexed, the optimization problem of gradient disappearance in the training process is relieved, and the effect of the model is improved; norm is a layer normalization (LayerNor) operation of features that can be normalized to ensure consistency of feature distribution.

And then, inputting the third video content characteristic information into Feed Forward, and fusing all the characteristic information in the X third video content characteristic information respectively through an FC layer, a RelU activation function and an FC layer, thereby obtaining X second video content characteristic information which is more fully described.

Example 2, with reference to example 1, dividing each RGB image in the N video clips into blocks of P × P size to obtain the RGB image

Blocks of P × P × 3. Next, each square is straightened into a vector of length P × 3. And a full link layer is used for linear transformation to obtain a length L _video-in Vector F of _video-in . Thus, the electronic device will obtain the vector F _video-in Input into videoVolume characteristic extraction model M _video In the method, after characteristic information of video content is extracted, the length L can be obtained _video-out Video content characteristic information F _video-out . Wherein L is _video-out ＝L _user-out 。

Step 202, respectively splicing the user behavior characteristic information of the target user and the first video content characteristic information corresponding to the N video segments to obtain the video characteristic information corresponding to each of the N video segments.

In the embodiment of the present application, the electronic device uses the corresponding video content feature information F of each video clip through the Concatenate function _video-out And user behavior feature information F _user-out And splicing corresponding positions to finally obtain the video characteristic information corresponding to each video clip in the N video clips.

In this embodiment of the present application, the video feature information corresponding to any video segment includes: user behavior characteristic information of the target user and video content characteristic information of any video segment. Wherein, any video clip is any video clip in the N video clips.

In the embodiment of the present application, the user behavior feature information and the video content feature information may be represented in the form of feature vectors.

In this embodiment of the application, the user behavior feature information of the target user includes a user basic feature F _user-base Information and learnable user attribute features F _user-laern And (4) information.

In the embodiment of the present application, the user basic feature F is described above _user-base The information may be characteristic information that can represent information such as the gender, age, and favorite type of the user.

In one possible embodiment, the above-mentioned learnable user attribute feature F _user-laern The information may be updated according to user behavior parameters.

In one possible embodiment, the user base feature F is targeted _user-base The electronic equipment can acquire the userAfter the attribute data (UserData), the length L is calculated by encoding the user attribute data and based on the encoded user attribute data and the first formula _user-base User base feature F _user-base 。

Illustratively, the user attribute data may be as shown in table 1 below.

TABLE 1

For example, when the electronic device encodes the user attribute data, the electronic device typically encodes and converts the collected user attribute data into user attribute feature vectors.

Illustratively, the electronic device converts the user attribute data into the user attribute feature vector in the following forwarding manner. Specifically, the conversion method is as follows: the selectable values of the attributes are shown in table 1, the length of the feature vector of each attribute is the number of the selectable values, and if the ith value of the attribute a is selected, V is _a,i =1 otherwise V _a,i ＝0。

Illustratively, the electronic device encodes the user attribute data, and then expresses the encoded user attribute data as a vector V, and normalizes each attribute by a first formula, thereby obtaining a data distribution suitable for neural network input. And coding all the normalized vectors again to obtain the length L _user-base User base feature F _user-base 。

Illustratively, the first formula is:

wherein, V _norm The attribute characteristics are normalized;

V _mean for counting from training dataThe obtained attribute mean value;

V _std the standard deviation is statistically derived from the training data.

In the embodiment of the application, the user behavior feature information is extracted through the user behavior feature extraction model M _user Based on the user basic characteristics F _user-base Information and learnable user attribute features F _user-laern And extracting the information.

In a possible embodiment, the electronic device is configured to apply the user basic feature F _user-base And learnable user attribute features F _user-laern Splicing with Concatenate to give F _user-in 。

Exemplary electronic device is described above as F _user-in Input user behavior feature extraction model M _user Then, a user behavior feature extraction model M _user Can be paired with F _user-in Extracting the user behavior characteristic information to obtain the length L _user-out User behavior feature information F _user-out 。

Example 1: as shown in fig. 3, the electronic device will F _user-in Input user behavior feature extraction model M _user Then, F is changed through a linear transform (FC) _user-in Then converting the input into a length L by a regularization layer (BatchNorm), an activation function layer (e.g., relu activation function) _user-hidden Vector F of _user-hidden (ii) a Then, the final size of the ReLU activation function is L after FC layer, batchNorm regularization and again _user-out User behavior feature information F _user-out And output.

Step 203, determining a first playing speed corresponding to each video segment based on the video characteristic information corresponding to each video segment.

And step 204, playing the first video based on the first playing double speed.

In this embodiment, the first playback speed corresponding to each video segment may be respectively applied to the respective corresponding video segments.

In the embodiment of the application, the electronic device applies the determined first play speed to each video clip, so that the video can automatically adjust the current play speed according to the first play speed when playing the video.

Example 3 for the Nth _i A video segment S _Ni If the first playing double speed is determined to be V, N/V frames are uniformly sampled from the original N frames to realize the playing double speed of V.

In the method for determining the video playing speed, first video content characteristic information corresponding to N video segments in a first video is obtained, wherein N is a positive integer; respectively splicing the user behavior characteristic information of the target user and the first video content characteristic information corresponding to the N video clips to obtain video characteristic information corresponding to each of the N video clips; determining a first playing speed corresponding to each video clip based on the video characteristic information corresponding to each video clip; and playing the first video based on the first playing double speed. Therefore, the user behavior characteristic information of the target user is merged into the video content characteristic information of each video clip, so that the electronic equipment can set the video playing speed corresponding to each clip in a personalized manner according to the user behavior characteristic information and the video content characteristic information, and the user does not need to manually adjust the video playing speed.

Optionally, in this embodiment of the application, in the process of "determining the first playback speed corresponding to each video segment based on the video feature information corresponding to each video segment" in the step 203, "specifically, the following steps 203a and 203b are included:

step 203a, inputting the video characteristic information corresponding to each video segment into a video playing speed estimation model, and outputting the playing speed prediction parameters corresponding to each video segment.

Illustratively, M in the video playing speed estimation model _speed Comprises an attention mechanism module SEmodules.

Optionally, in this embodiment of the application, the process of the step 203a includes the following steps 203a1 and 203a2:

step 203a1, for one frame of video clip of the N video clips, inputting the video feature information corresponding to the video clip into the attention mechanism module to obtain the key video feature information corresponding to the video clip.

Step 203a2, obtaining the playing speed information having a mapping relation with the key video feature information, and obtaining the playing speed prediction parameter corresponding to the video segment based on the playing multiple information and the key video feature information.

Exemplarily, as shown in fig. 4, for the ith video clip, the electronic device sends video feature information F corresponding to the ith video clip _speed-in Input video playback speed estimation model M _speed Secondly, key feature extraction is carried out through SEmodules to obtain key video feature information, and F is obtained through SEmodules output by an attention module _speed-se Then, F is obtained through FC layer, batchNorm regularization and ReLU activation function _speed-hidden And then passing through FC layer to obtain F _speed-out Finally, outputting a play speed prediction parameter P corresponding to the ith video clip through a Softmax function _i 。i＝1，2……，N。

Further exemplarily, first, corresponding user behavior feature information and first video content feature information in the ith video segment are spliced by a terminate function operation to obtain a vector shape of (Batch Size, L) _video-out +L _user-out ) Video feature information F _speed-in 。

Then, the video feature information F is processed _speed-in Input SEmodules, to F _speed-in The user behavior characteristic information and the first video content characteristic information contained in the module are further fused, and different weights are given to the characteristics of each channel in the module to represent the importance of the characteristics, so that the key video characteristic information corresponding to the ith video clip is extracted. Specifically, let SEModule input be F _speed-in The shape of (Batch Size, L) _video-out +L _user-out )。F _speed-in Generating a value range of 0-1 through 2 FC layers and 1 Sigmoid layer, wherein the value range is (Batch Size, L) _video-out +L _user-out ) W, and finally F _speed-in Multiplying by W to obtain weighted F _speed-in I.e. key video feature information F _speed-se The shape of the powder is (Batch Size, L) _video-out +L _user-out )。

Then, the key video feature information F _speed-se Inputting an FC layer, mapping the key video characteristic information to the dimensionality of the play speed by using BatchNorm regularization and ReLU activation function to obtain an abstract characteristic F for estimating the play speed _speed-hidden And then obtaining the playing speed information F with the mapping relation of the key video characteristic information through an FC layer _speed-out And outputting a play speed estimation module. F to be output finally _speed-out And obtaining a prediction parameter of the playing speed of the ith segment by a Softma function.

It should be noted that the ith video segment is any one of the N video segments, in other words, each of the N video segments can be implemented according to an estimation process of the video playing speed estimation model for the ith video segment, and details are not repeated here.

Step 203b, determining a first playing speed multiple corresponding to each video segment from X preset playing speed multiple based on the playing speed multiple prediction parameters corresponding to each video segment.

Wherein X is a positive integer.

Illustratively, the X preset playback speeds may be playback speeds pre-stored in various applications for playing back video, for example, 0.5X/0.75X/1.0X/1.5X/2.0X.

It should be noted that the preset playback multiple speeds in different applications may be the same or different. For example, if the first video is a video in a first application, the X preset playback multiple speeds are preset playback multiple speeds corresponding to the first application.

Illustratively, the corresponding play speed prediction parameter for each video segment is used to indicate at least one preset play speed. Therefore, the electronic equipment can select one preset playing double speed from the preset playing double speeds indicated by the playing double speed prediction parameter corresponding to one video clip as the first playing double speed of the video clip.

Illustratively, the X preset play multiples may also be set by the user. Specifically, the minimum playback multiple speed C is set _min And maximum playback speed C _max And the selectable number of playback speed is X, the playback speed of the ith stage can be expressed as

The play double speed selectable set is denoted as C = { C _i }。

Optionally, in this embodiment of the present application, the play double-speed prediction parameter corresponding to any video segment of the N video segments includes: x probability values.

Illustratively, the X probability values correspond to X preset playback speeds.

Illustratively, any of the X probability values is used to characterize: the playing speed of any video clip is the probability of the preset playing speed corresponding to any probability value.

Optionally, in this embodiment of the application, in the step 203b "determining the first playback speed corresponding to each video segment from X preset playback speeds based on the playback speed prediction parameter corresponding to each video segment", the method specifically includes the following step 203b1:

step 203b1, determining the preset playing multiple speed corresponding to the maximum probability value in the X probability values corresponding to any video clip as the first playing multiple speed corresponding to any video clip.

Illustratively, the play double speed prediction parameter may be a vector of length X.

For example, for an ith video segment, the electronic device may use a preset playback speed corresponding to a maximum element in a playback speed prediction parameter vector corresponding to the ith video segment (i.e., an element with a maximum probability value in the vector) as a corresponding first playback speed of the ith video segment.

For example, the electronic device may select, by using a second formula, a preset playback multiple speed corresponding to a maximum probability value of X probability values corresponding to any one of the video clips from the X preset playback multiple speeds.

Illustratively, the second formula is:

wherein, V is the first playing speed;

C _i presetting a playing speed for the ith level in the set C;

P _j the vector corresponding to the position of the largest element.

Therefore, according to the extraction of the video content characteristics, the playing speed corresponding to each video in the video is calculated, and the speed is adaptively applied to the current video, so that a user does not need to manually adjust the playing speed in the process of watching the video.

Optionally, in this embodiment of the present application, after the step 204 "respectively apply the first playback speed corresponding to each video clip to the corresponding video clip", the method for determining a video playback speed provided by the present application further includes the following steps 301 and 302:

step 301, after detecting a first input of a user for adjusting a video playing speed of a first video segment, taking a playing speed prediction parameter corresponding to the first video segment and an actual video playing speed corresponding to the first input as a first training sample.

Step 302, training the video playing speed estimation model by using the first training sample to obtain the trained video playing speed estimation model.

Illustratively, the first video segment is one of the N video segments.

Illustratively, the first input may include: the touch input of the user to the display screen, or the voice instruction input by the user, or the specific gesture input by the user may be specifically determined according to the actual use requirement, and the embodiment of the present invention is not limited.

Illustratively, the first input mentioned above is an input for adjusting the current video playback multiple speed.

Illustratively, the first input may include: a first input to the video interface when the user is not satisfied with the video playback speed of the first video segment.

For example, the electronic device may process the first training sample based on the third formula, and train the video playing speed estimation model with the processed training sample, to obtain the trained video playing speed estimation model.

Illustratively, the third formula above may be a cross entropy loss function.

Specifically, the third formula is: l = -logP _i [V _i ]。

Wherein, P _i Predicting parameters of playing speed multiplication corresponding to the first video clip;

V _i the actual video playing speed corresponding to the first input is obtained.

Optionally, in this embodiment of the present application, the method for determining a video playback speed provided by the present application further includes the following step 401:

step 401, in the case that it is detected that the behavior of the target user for adjusting the video playing speed within the predetermined time period satisfies the first condition, updating the behavior feature information of the target user based on the first information of the target user.

Illustratively, the first information includes: and the behavior parameters corresponding to the behavior of the target user for adjusting the video playing speed in the preset time period and the video information of the video adjusted by the target user.

For example, the predetermined time period may be set by a user, or may be customized by the electronic device.

Illustratively, the first condition includes at least one of:

actively adjusting the number of times of video playing speed multiplication of a first video to reach a first threshold value by a user within a preset time period (for example, the number of times of adjusting the video playing speed multiplication of a certain video segment by the user within the preset time period reaches a threshold value A);

and the adjusting time length for adjusting the video playing speed of the first video by the user within the preset time period reaches a second threshold value.

Illustratively, the learnable user attribute feature F is continuously updated in the process of detecting the behavior of actively triggered double-speed playing in the process of watching the video by the user and the content of the corresponding video segment _user-laern Information as the latest learnable user attribute feature F in the subsequent video viewing process _user-laern And (4) information. Updating learnable user attribute features F _user-laern In the information process, only F _user-laern As learnable parameters, the remaining model parameters and F _user-base Are all fixed and are still trained using the cross entropy loss function. When a user watches the video and meets a first condition within a preset time, namely when the user actively adjusts the video playing speed times to reach a preset threshold value, the learnable user attribute feature F is subjected to _user-laern The information is updated.

Therefore, through continuous training, when the playing speed recommended by the model provided by the electronic equipment does not meet the requirements of the user, the training can be triggered in time and updated to the latest model, so that the behavior of actively triggering and adjusting the playing speed of the video when the user watches the video is reduced to a greater extent.

The image recognition method provided by the present application will be exemplarily described below with one embodiment.

In the embodiment, an individualized and self-adaptive video playing multiple speed adjusting function is designed based on the user behavior characteristic information and the video content characteristic information, and the user does not need to manually adjust the playing multiple in the video watching process. The constructed user representation can help to understand the types of video segments that are of interest to the user, and the understanding of the video content can help to locate content that is of interest to the user. The combination of the two can adaptively estimate the video playing rate expected by the user and adjust in time. Specifically, as shown in fig. 5, the following steps S1 to S6 are included:

step S1: the electronic equipment acquires user portrait and original video content information.

Step S2: the electronic equipment encodes the user portrait into a vector form through an encoder, and extracts a user behavior feature information vector based on a user behavior feature extraction model; and meanwhile, segmenting the original video acquired in the step S1 to obtain N video segments.

And step S3: and the electronic equipment inputs each video clip in the step S2 into a video content characteristic extraction model, analyzes and codes the video content characteristic extraction model to obtain a user behavior characteristic information vector corresponding to each video.

And step S4: and the electronic equipment inputs the user behavior characteristic information vector in the step S2 and the user behavior characteristic information vector in the step S3 into a playing speed estimation model to obtain the playing speed corresponding to each video clip.

Step S5: the electronic equipment applies the playing speed corresponding to each video clip.

Step S6: and the electronic equipment updates the user portrait in the step S1, namely updating the user behavior characteristic information according to the interaction of the user on the video playing speed regulation actively.

Therefore, the abstract representation of the user and the video content is respectively obtained by using the user behavior feature extraction model and the video content feature extraction model, and the relationship between the abstract representation and the video playing speed is established through the playing speed estimation model, so that the video speed playing function without user interaction is realized.

It should be noted that, in the method for determining a video playback speed provided in the embodiment of the present application, the execution subject may be a determination apparatus for determining a video playback speed, or an electronic device, and may also be a functional module or an entity in the electronic device. In the embodiment of the present application, a determination device for determining a video playing speed is taken as an example to execute a determination method for determining a video playing speed, and the determination device for determining a video playing speed provided in the embodiment of the present application is described.

Fig. 6 is a schematic diagram illustrating a possible structure of the apparatus for determining a video playback multiple speed according to the embodiment of the present application. As shown in fig. 6, the apparatus 700 for determining a video playback multiple speed may include: an obtaining module 701, a processing module 702, and a playing module 703: the obtaining module 701 is configured to obtain first video content feature information corresponding to N video segments in a first video, where N is a positive integer; the processing module 702 is configured to respectively splice the user behavior feature information of the target user with the first video content feature information corresponding to the N video clips, so as to obtain video feature information corresponding to each of the N video clips; the processing module 702 is configured to determine, based on the video feature information corresponding to each video segment, a first playback speed corresponding to each video segment; the playing module 703 is configured to play the first video based on the first playing double speed.

Optionally, in this embodiment of the application, the processing module 702 is specifically configured to, for one frame of video clip in the N video clips, divide a video image in the video clip to obtain X image blocks; the obtaining module 701 is further configured to obtain image feature information corresponding to the X image blocks; the processing module 702 is specifically configured to input the image feature information corresponding to the X image blocks into a video content feature extraction model for feature extraction, so as to obtain second video content feature information corresponding to the X image blocks; the processing module 702 is specifically configured to perform feature fusion on the second video content feature information corresponding to the X image blocks to obtain the first video content feature information.

Optionally, in this embodiment of the present application, the processing module 702 is specifically configured to: inputting image feature information corresponding to the X image blocks into a video content feature extraction model, and then performing video content feature extraction on the image feature information corresponding to the X image blocks based on a multi-head attention module to obtain X first key video content feature information corresponding to the X image blocks; calculating a mean value and a standard difference value corresponding to the X pieces of first key video content characteristic information based on a residual error and standardization module, and acquiring X pieces of third video content characteristic information corresponding to the X pieces of first key video content characteristic information based on the mean value and the standard difference value; based on a front feedback module, respectively fusing all feature information in each of the X pieces of third video content feature information to obtain X pieces of second video content feature information; wherein, one image block corresponds to one first key video content characteristic information.

Optionally, in this embodiment of the present application, the processing module 702 is specifically configured to: inputting the video characteristic information corresponding to each video segment into a video playing speed estimation model, and outputting a playing speed prediction parameter corresponding to each video segment; respectively determining a first playing speed multiple corresponding to each video clip from X preset playing speed multiple based on the playing speed multiple prediction parameters corresponding to each video clip; wherein X is a positive integer.

Optionally, in this embodiment of the present application, the processing module 702 is specifically configured to: inputting video characteristic information corresponding to one of the N video clips into an attention mechanism module to obtain key video characteristic information corresponding to the video clip; and acquiring play speed information which has a mapping relation with the key video characteristic information, and acquiring play speed prediction parameters corresponding to the video clips based on the play multiple information and the key video characteristic information.

In the device for determining the video playing speed, first video content characteristic information corresponding to N video segments in a first video is obtained, where N is a positive integer; respectively splicing the user behavior characteristic information of the target user and the first video content characteristic information corresponding to the N video clips to obtain video characteristic information corresponding to each of the N video clips; determining a first playing speed corresponding to each video clip based on the video characteristic information corresponding to each video clip; and playing the first video based on the first playing double speed. Therefore, the user behavior characteristic information of the target user is merged into the video content characteristic information of each video clip, so that the electronic equipment can set the video playing speed corresponding to each clip in a personalized manner according to the user behavior characteristic information and the video content characteristic information, and the user does not need to manually adjust the video playing speed.

The apparatus for determining a video playing speed in the embodiment of the present application may be an electronic device, or may be a component in the electronic device, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be a device other than a terminal. The electronic Device may be, for example, a Mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic Device, a Mobile Internet Device (MID), an Augmented Reality (AR)/Virtual Reality (VR) Device, a robot, a wearable Device, an ultra-Mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and may also be a server, a Network Attached Storage (Network Attached Storage, NAS), a personal computer (NAS), a Television (TV), an assistant, a teller machine, a self-service machine, and the like, and the embodiments of the present application are not limited in particular.

The device for determining the video playing double speed in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The device for determining the video playing speed can implement each process implemented by the method embodiments of fig. 1 to 6, and is not described herein again to avoid repetition.

Optionally, as shown in fig. 7, an electronic device 800 is further provided in an embodiment of the present application, and includes a processor 801 and a memory 802, where the memory 802 stores a program or an instruction that can be executed on the processor 801, and when the program or the instruction is executed by the processor 801, the steps of the above-mentioned method for determining a video playback speed doubling can be implemented, and the same technical effects can be achieved, and are not described herein again to avoid repetition.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 8 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 100 includes, but is not limited to: radio frequency unit 101, network module 102, audio output unit 103, input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, and processor 110.

Those skilled in the art will appreciate that the electronic device 100 may further comprise a power source (e.g., a battery) for supplying power to various components, and the power source may be logically connected to the processor 110 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The electronic device structure shown in fig. 8 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

The processor 110 is configured to obtain first video content feature information corresponding to N video segments in a first video, where N is a positive integer; the processor 110 is configured to respectively splice the user behavior feature information of the target user and the first video content feature information corresponding to the N video segments to obtain video feature information corresponding to each of the N video segments; a processor 110, configured to determine a first playback speed corresponding to each video segment based on the video feature information corresponding to each video segment; a processor 110, configured to play the first video based on the first play double speed.

Optionally, in this embodiment of the application, the processor 110 is specifically configured to, for one frame of video segment of the N video segments, divide a video image in the video segment to obtain X image blocks; the processor 110 is further configured to obtain image feature information corresponding to the X image blocks; the processor 110 is specifically configured to input image feature information corresponding to the X image blocks into a video content feature extraction model for feature extraction, so as to obtain second video content feature information corresponding to the X image blocks; the processor 110 is specifically configured to perform feature fusion on the second video content feature information corresponding to the X image blocks to obtain the first video content feature information.

Optionally, in this embodiment of the application, the processor 110 is specifically configured to: inputting image feature information corresponding to the X image blocks into a video content feature extraction model, and then performing video content feature extraction on the image feature information corresponding to the X image blocks based on a multi-head attention module to obtain X first key video content feature information corresponding to the X image blocks; calculating a mean value and a standard difference value corresponding to the X pieces of first key video content characteristic information based on a residual error and standardization module, and acquiring X pieces of third video content characteristic information corresponding to the X pieces of first key video content characteristic information based on the mean value and the standard difference value; based on a front feedback module, respectively fusing all feature information in each of the X pieces of third video content feature information to obtain X pieces of second video content feature information; wherein, one image block corresponds to one first key video content characteristic information.

Optionally, in this embodiment of the application, the processor 110 is specifically configured to: inputting the video characteristic information corresponding to each video clip into a video playing speed estimation model, and outputting a playing speed prediction parameter corresponding to each video clip; respectively determining a first playing speed multiple corresponding to each video clip from X preset playing speed multiple based on the playing speed multiple prediction parameters corresponding to each video clip; wherein X is a positive integer.

Optionally, in this embodiment of the application, the processor 110 is specifically configured to: inputting the video characteristic information corresponding to one of the N video clips into an attention mechanism module to obtain key video characteristic information corresponding to the video clip; and acquiring playing speed information which has a mapping relation with the key video characteristic information, and acquiring playing speed prediction parameters corresponding to the video clips based on the playing multiple information and the key video characteristic information.

In the electronic device for determining the video playing speed, the electronic device obtains first video content characteristic information corresponding to N video segments in a first video, where N is a positive integer; respectively splicing the user behavior characteristic information of the target user and the first video content characteristic information corresponding to the N video clips to obtain video characteristic information corresponding to each of the N video clips; determining a first playing speed corresponding to each video clip based on the video characteristic information corresponding to each video clip; and playing the first video based on the first playing double speed. Therefore, the user behavior characteristic information of the target user is merged into the video content characteristic information of each video clip, so that the electronic equipment can set the video playing speed corresponding to each clip in a personalized manner according to the user behavior characteristic information and the video content characteristic information, and the user does not need to manually adjust the video playing speed.

It should be understood that, in the embodiment of the present application, the input Unit 104 may include a Graphics Processing Unit (GPU) 1041 and a microphone 1042, and the Graphics Processing Unit 1041 processes image data of a still picture or a video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 107 includes at least one of a touch panel 1071 and other input devices 1072. The touch panel 1071 is also referred to as a touch screen. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.

The memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a first storage area storing a program or an instruction and a second storage area storing data, wherein the first storage area may store an operating system, an application program or an instruction (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. Further, memory 109 may include volatile memory or non-volatile memory, or memory 109 may include both volatile and non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. The volatile Memory may be a Random Access Memory (RAM), a Static Random Access Memory (Static RAM, SRAM), a Dynamic Random Access Memory (Dynamic RAM, DRAM), a Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), a Double Data Rate Synchronous Dynamic Random Access Memory (Double Data Rate SDRAM, ddr SDRAM), an Enhanced Synchronous SDRAM (ESDRAM), a Synchronous Link DRAM (SLDRAM), and a Direct Memory bus RAM (DRRAM). The memory 109 in the embodiments of the subject application includes, but is not limited to, these and any other suitable types of memory.

Processor 110 may include one or more processing units; optionally, the processor 110 integrates an application processor, which mainly handles operations related to the operating system, user interface, application programs, etc., and a modem processor, which mainly handles wireless communication signals, such as a baseband processor. It will be appreciated that the modem processor described above may not be integrated into the processor 110.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the above-mentioned method for determining a video playing speed, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a computer read only memory ROM, a random access memory RAM, a magnetic or optical disk, and the like.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the above-mentioned method for determining a video playback multiple speed, and the same technical effect can be achieved, and in order to avoid repetition, the details are not repeated here.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

The embodiments of the present application provide a computer program product, where the program product is stored in a storage medium, and the program product is executed by at least one processor to implement the processes of the above embodiment of the method for determining a video playback speed, and achieve the same technical effects, and in order to avoid repetition, details are not repeated here.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatuses in the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions recited, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for determining a video playback speed, the method comprising:

acquiring first video content characteristic information corresponding to N video segments in a first video, wherein N is a positive integer;

respectively splicing the user behavior characteristic information of the target user with the first video content characteristic information corresponding to the N video clips to obtain video characteristic information corresponding to each video clip in the N video clips;

determining a first playing speed corresponding to each video clip based on the video characteristic information corresponding to each video clip;

and playing the first video based on the first playing double speed.

2. The method according to claim 1, wherein the obtaining first video content feature information corresponding to N video segments in the first video includes:

for one frame of video clips in the N video clips, dividing video images in the video clips to obtain X image blocks;

acquiring image characteristic information corresponding to the X image blocks;

inputting image feature information corresponding to the X image blocks into a video content feature extraction model for feature extraction to obtain second video content feature information corresponding to the X image blocks;

and performing feature fusion on the second video content feature information corresponding to the X image blocks to obtain the first video content feature information.

3. The method according to claim 2, wherein the video content feature extraction model includes a multi-head attention module, a residual and standardization module, and a pre-feedback module, and the inputting the image feature information corresponding to the X image blocks into the video content feature extraction model for feature extraction to obtain the second video content feature information corresponding to the X image blocks includes:

inputting the image feature information corresponding to the X image blocks into a video content feature extraction model, and then performing video content feature extraction on the image feature information corresponding to the X image blocks based on the multi-head attention module to obtain X first key video content feature information corresponding to the X image blocks;

calculating a mean value and a standard difference value corresponding to the X pieces of first key video content characteristic information based on the residual error and standardization module, and acquiring X pieces of third video content characteristic information corresponding to the X pieces of first key video content characteristic information based on the mean value and the standard difference value;

based on a front feedback module, respectively fusing all feature information in each of the X pieces of third video content feature information to obtain X pieces of second video content feature information;

wherein, one image block corresponds to one piece of first key video content characteristic information.

4. The method according to claim 1, wherein the determining the first playback multiple speed for each video segment based on the video feature information corresponding to each video segment comprises:

inputting the video characteristic information corresponding to each video clip into a video playing speed estimation model, and outputting a playing speed prediction parameter corresponding to each video clip;

respectively determining a first playing speed multiple corresponding to each video clip from X preset playing speed multiple based on the playing speed multiple prediction parameters corresponding to each video clip;

wherein X is a positive integer.

5. The method according to claim 4, wherein inputting the video feature information corresponding to each video segment into a video playback speed estimation model, and outputting the playback speed prediction parameters corresponding to each video segment, comprises:

for one frame of video clips in the N video clips, inputting video feature information corresponding to the video clips into an attention mechanism module to obtain key video feature information corresponding to the video clips;

and acquiring play speed information having a mapping relation with the key video characteristic information, and acquiring play speed prediction parameters corresponding to the video clips based on the play multiple information and the key video characteristic information.

6. A device for determining a video playback speed, the device comprising: the device comprises an acquisition module, a processing module and a playing module:

the acquisition module is used for acquiring first video content characteristic information corresponding to N video clips in a first video, wherein N is a positive integer;

the processing module is configured to splice user behavior feature information of a target user and first video content feature information corresponding to the N video segments acquired by the acquisition module, respectively, to obtain video feature information corresponding to each of the N video segments;

the processing module is configured to determine a first playback speed corresponding to each video segment based on the video feature information corresponding to each video segment;

the playing module is used for playing the first video based on the first playing double speed.

7. The apparatus of claim 6,

the processing module is specifically configured to divide a video image in the video clip to obtain X image blocks for one of the N video clips;

the acquisition module is further used for acquiring image characteristic information corresponding to the X image blocks;

the processing module is specifically configured to input the image feature information corresponding to the X image blocks into a video content feature extraction model for feature extraction, so as to obtain second video content feature information corresponding to the X image blocks;

the processing module is specifically configured to perform feature fusion on the second video content feature information corresponding to the X image blocks to obtain the first video content feature information.

8. The apparatus of claim 7,

the processing module is specifically configured to:

9. The apparatus of claim 6,

the processing module is specifically configured to:

wherein X is a positive integer.

10. The apparatus of claim 9,

the processing module is specifically configured to:

and acquiring playing speed information which has a mapping relation with the key video characteristic information, and acquiring playing speed prediction parameters corresponding to the video clips based on the playing multiple information and the key video characteristic information.

11. An electronic device comprising a processor, a memory and a program or instructions stored on the memory and executable on the processor, the program or instructions, when executed by the processor, implementing the steps of the method for determining a video playback multiple speed according to any one of claims 1 to 5.

12. A readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method for determining a video playback speed according to any one of claims 1 to 5.