CN110856042A

CN110856042A - Video playing method and device, computer readable storage medium and computer equipment

Info

Publication number: CN110856042A
Application number: CN201911125894.6A
Authority: CN
Inventors: 王瑞琛; 王晓利
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2020-02-28
Anticipated expiration: 2039-11-18
Also published as: CN110856042B

Abstract

The application relates to a video playing method, a video playing device, a computer readable storage medium and computer equipment, wherein the method comprises the following steps: acquiring a target video; determining video key segments in the target video; playing the target video; when the video key segment is played, playing the video key segment at a first playing speed; when playing the non-key clips in the target video, playing the key video clips at a second playing speed; the second play speed is greater than the first play speed. The scheme provided by the application can avoid consuming a large amount of unnecessary playing time in the process of playing the video.

Description

Video playing method and device, computer readable storage medium and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video playing method and apparatus, a computer-readable storage medium, and a computer device.

Background

With the continuous development of internet technology and video technology, more and more users like to search interesting videos on the internet for watching, and when searching videos, users usually search videos according to interesting tags. For example, if the user likes the food, the user will look up the corresponding video with the food as the tag. Therefore, the annotating personnel are required to label the video accordingly.

Before tagging a video, a tagging person usually plays the video by using a corresponding video client, or plays the video at a normal speed, or manually controls the playing speed of the video according to actual requirements, so that the tagging person tags the video with a corresponding tag after finishing watching the video. When the number of videos is large and the duration of the videos is long, the video playing mode consumes a large amount of unnecessary playing time, and the video marking efficiency is further affected.

Disclosure of Invention

Based on this, it is necessary to provide a video playing method, an apparatus, a computer-readable storage medium, and a computer device for solving the technical problem of consuming a large amount of unnecessary playing time in playing a video.

A video playback method, comprising:

acquiring a target video;

determining video key segments in the target video;

playing the target video;

when the video key segment is played, playing the video key segment at a first playing speed;

when playing the non-key clips in the target video, playing the key video clips at a second playing speed; the second play speed is greater than the first play speed.

A video playback device, the device comprising:

the acquisition module is used for acquiring a target video;

the determining module is used for determining video key fragments in the target video;

the playing module is used for playing the target video;

the first adjusting module is used for playing the video key segment at a first playing speed when the video key segment is played;

the second adjusting module is used for playing the key video clips at a second playing speed when the non-key clips in the target video are played; the second play speed is greater than the first play speed.

A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the video playback method.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the video playback method.

According to the video playing method, the video playing device, the computer readable storage medium and the computer equipment, before the target video is played, the video key segments existing in the target video are determined, and the non-key segments are quickly played at the playing speed higher than that of the video key segments when the video is played, so that even if the number of videos is large and the duration of the video is long, a large amount of unnecessary playing time cannot be consumed on the non-key segments, the playing time of the target video can be effectively shortened, and the video labeling efficiency is improved.

Drawings

FIG. 1 is a diagram of an exemplary video playback method;

FIG. 2 is a flow chart illustrating a video playing method according to an embodiment;

FIG. 3 is a diagram of video key segments and non-key segments in a target video, in one embodiment;

FIG. 4 is a diagram illustrating the playing of a target video in one embodiment;

FIG. 5 is a flowchart illustrating the steps of determining key segments of a video based on score values in one embodiment;

FIG. 6 is a schematic diagram of the structure of a VASNet model in one embodiment;

FIG. 7 is a flowchart illustrating the steps of tagging target videos in one embodiment;

FIG. 8 is a timing diagram illustrating a video playback method according to one embodiment;

FIG. 9 is a flowchart illustrating a video playing method according to another embodiment;

FIG. 10 is a block diagram showing the structure of a video player according to an embodiment;

FIG. 11 is a block diagram showing the construction of a video player apparatus according to another embodiment;

FIG. 12 is a block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Fig. 1 is a diagram of an application environment of a video playing method in an embodiment. Referring to fig. 1, the video playing method is applied to a video playing system. The video playing system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. A terminal acquires a target video; determining video key segments in a target video; playing the target video; when the video key segment is played, playing the video key segment at a first playing speed; when the non-key segment in the target video is played, playing the key video segment at a second playing speed; the second playback speed is greater than the first playback speed. When the target video is played, a corresponding tag is set for the target video, and then the target video is uploaded to the server 120. The server 120 recommends the target video to the terminal 130.

The terminals 110 and 130 may be desktop terminals or mobile terminals, and the mobile terminals may be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in FIG. 2, a video playback method is provided. The embodiment is mainly illustrated by applying the method to the terminal 110 in fig. 1. Referring to fig. 2, the video playing method specifically includes the following steps:

s202, acquiring a target video.

The target video may be a video composed of a series of frame images, and when the target video is decoded, a series of consecutive frame images may be obtained. At least two frame images in the target video contain key information, the theme of the target video can be determined through the key information, and the style or type of the target video can be determined through watching the key information. The frame rate of the target video may be greater than or equal to 30 frames/second, with 30 frames/second indicating that the video contains 30 consecutive images per second.

The key information may specifically include information of food, sports, scenery, quest, entertainment, etc., and for example, it may be determined that the target video belongs to one or more of food, sports, scenery, quest, entertainment, etc. preferred by the user through the key information.

In addition, the key information may also include information about a specific person or a vehicle, for example, the specific person or the vehicle in the target video may be determined by the key information, so as to track the specific person or the vehicle in the target video.

In one embodiment, when receiving the video annotation task, the terminal acquires the target video from the local or acquires the target video from a server on the network side according to the network link. The local may refer to the terminal itself, or a server directly connected to the terminal. The video annotation task may be a task of annotating or classifying videos, such as tagging a target video (i.e., a video tagging task), or editing video description information for a target video (i.e., a writing task of video description information), or identifying objects in a video. The object may be a person object or may be a non-person type object, such as a car.

For example, before the annotating personnel annotate the target video, the terminal can select the corresponding target video on the video selection page according to the selection operation of the annotating personnel, and then load the target video into the memory from the local storage. Or the terminal acquires a network link corresponding to the click operation according to the click operation of the annotation personnel, and acquires the target video from a server on the network side according to the network link.

In another embodiment, the terminal may acquire the target video in real time from the connected video capturing device. For example, a video producer shoots a target video in real time through a held video shooting device and then transmits the target video to a terminal through a data line or a network, and a annotator can obtain the target video transmitted by the video shooting device in real time through the terminal. For another example, a monitor in a road or a cell shoots a target video in real time, the target video is transmitted to a terminal through a laid transmission line, and a annotating person can acquire the target video transmitted by the monitor in real time through the terminal.

And S204, determining video key segments in the target video.

The video key clip refers to a video clip containing key information, and the theme of the whole target video can be determined through the key information contained in the video key clip. In addition, the target video includes non-key clips in addition to the video key clip, as shown in fig. 3, the video clip including the task in the figure may be a video key clip.

In one embodiment, the terminal detects key information contained in each frame image in the target video, and determines a video key segment according to the frame image containing the key information.

In one embodiment, the terminal segments a target video to obtain a plurality of video segments, detects whether each video segment contains key information, and if yes, determines that the video segment is a video key segment. Wherein, a plurality of video segments may refer to two or more video segments.

Specifically, the step of detecting whether each video clip contains key information may specifically include: and the terminal extracts frame images from each video clip at fixed intervals or randomly, detects whether the extracted frame images contain key information, and if so, determines the corresponding video clip as the video key clip.

In an embodiment, the step of detecting whether the key information is included may specifically include: and the terminal extracts image features in the frame image, compares the image features with preset image features, and determines to contain key information if the similarity reaches the preset similarity. Or the terminal scores the extracted image features through a machine learning model to obtain a score value of the corresponding frame image, and when the score value reaches a preset score condition, the frame image is determined to contain the key information. The machine learning model may include, but is not limited to, a VASNet model or other neural network models based on video summarization techniques, among others.

In one embodiment, when a video key segment containing key information is determined, the terminal sets a first segment identifier for the video key segment, so that when a target video is played, whether the video key segment is played is determined through the first segment identifier. In addition, the terminal also sets a second segment identifier for the non-key segment in the target video, so that when the target video is played, whether the non-key segment is played or not is determined through the second segment identifier.

S206, playing the target video.

In one embodiment, the terminal loads a video annotation tool, and the target video is played through the video annotation tool.

The video marking tool can be a client, and the target video can be played at a normal playing speed and can be automatically and quickly played according to actual requirements through the video marking tool; moreover, the video labeling tool can also automatically acquire a label from a label library after detecting the key information, and label the target video; or, the video labeling tool may also select a corresponding label according to the input operation of the labeling person, and then label the target video.

In addition, the video annotation tool may also refer to a web plug-in, through which the native player may be called to play the target video, and the target video is tagged according to the input annotation instruction.

In one embodiment, the terminal loads a native player through a video annotation tool to play the target video. The native player can be a native player of the system itself or an installed third-party native player. Native players, which may also be referred to as Native players, use system Native code generated players. For example, for the android system, the native player may be a player generated using android system native JAVA code.

The native player can pre-load and cache the target video when playing the target video, the playing picture can be smoother for playing the target video on the network side, and the buffering can not be generated under the condition that the network is normal, so that the user experience is greatly enhanced.

In one embodiment, if the video annotation tool is a web page plug-in, when the target video is played, the terminal calls the WebView component through the video annotation tool to display an interaction control interacting with the target video, and the interaction with the target video is realized through the interaction control, for example, an annotation person performs different operations such as fast forward, pause and playback at a view level of the WebView, so that the target video can be played fast, or the target video can be temporarily set, or the target video can be played back. In addition, through the video marking tool, manual operation of marking personnel is not needed, and rapid playing of non-key segments in the target video can be automatically realized.

In one embodiment, when the target video is played, the video key segment and the non-key segment are marked in different presentation modes on the progress bar according to the first segment identification and the second segment identification. The progress bar is usually displayed below the video playing page in a rectangular bar shape, and can be used for displaying the playing progress and completion of the target video and the duration of the remaining unplayed target video in real time when the target video is played. The total length of the progress bar may represent the total duration of the target video. Each segment on the progress bar corresponds to each video segment in the target video, for example, the segment from 3 minutes to 6 minutes on the progress bar corresponds to the segment from 3 minutes to 6 minutes of the target video. The presentation may be such that the video key clips and non-key clips are marked with different colors.

As shown in fig. 4, on the progress bar, black portions are used to represent non-key clips of the target video, and light gray portions are used to represent video key clips of the target video.

S208, when the video key segment is played, the video key segment is played at a first playing speed.

The playback speed may refer to a frame rate, i.e., a frequency at which images appear on the display continuously in units of frames. The first playback speed may be a normal playback speed of the target video. For example, when the target video is played normally, the number of video frames played per second is 30, and then the first playing speed is 30 frames/second. Still alternatively, the first playback speed may be slightly higher than the normal playback speed, such as n times the normal playback speed, 1< n < 1.5; it may also be slightly less than the normal play speed, e.g. m times the normal play speed, 0.8< m <1.

In one embodiment, when the terminal plays the target video, the currently played position is recorded in real time, whether the video key segment is played to the start position corresponding to the video key segment is judged according to the recorded position, and if the video key segment is played to the start position corresponding to the video key segment, the video key segment is played at the first playing speed. For example, as shown in fig. 4, when playing to the video key segment pointed by the arrow in the figure, the video key segment will be played at 30 frames/sec.

In one embodiment, if the non-key segment is currently played at the second playing speed and played to the end position of the non-key segment, and the video key segment is to be played at the next moment, the terminal may not immediately decrease the second playing speed directly from the first playing speed, and may gradually decrease the playing speed within the starting a (1< a <90) frame of the video key segment, for example, gradually decrease the playing speed from the 1 st frame to the 30 th frame of the video key segment until the playing speed decreases to the first playing speed.

S210, when the non-key clip in the target video is played, playing the key video clip at a second playing speed.

Wherein the second playing speed is greater than the first playing speed, for example, the first playing speed is v₁If the second playback speed is k × v₁，k>1。

In one embodiment, when the terminal plays the target video, the currently played position is recorded in real time, whether the video is played to the starting position corresponding to the non-key segment is judged according to the recorded position, and if the video is played to the starting position corresponding to the non-key segment, the video key segment is played at the second playing speed.

For example, as shown in fig. 4, when playing the non-key segment pointed by the arrow in the figure, if the video key segment is played at a speed of 30 frames/second, then when playing the non-key segment in the target video, the non-key segment (k >1) will be played at a speed of 30 × k frames/second, thereby realizing accelerated playing of the non-key segment.

In one embodiment, if the video key segment is currently played at the first playing speed and played to the end position of the video key segment, and the non-key segment is to be played at the next moment, the terminal may not immediately increase the first playing speed directly to the second playing speed, and may gradually increase the playing speed within the starting a (1< a <90) frame of the non-key segment, such as gradually increasing the playing speed from the 1 st frame to the 30 th frame of the non-key segment, until the playing speed is increased to the second playing speed.

In another embodiment, when a target video is played through a video annotation tool, a terminal detects a skip playing instruction triggered on the video annotation tool in real time; when the non-key segment in the target video is played and the jump playing instruction is detected, the non-key segment is jumped to the next video key segment for playing, so that the playing time of the target video can be further effectively reduced.

When the video annotation tool is a client, a skip control for skipping playing can be arranged on the video annotation tool, and the currently played video clip can be directly skipped to the next video clip by triggering the skip control.

For example, as shown in fig. 3, a1, a2, and A3 represent 3 non-key clips in the target video, and B1, B2, and B3 represent 3 video key clips in the target video. When the non-key segment A1 is played and the jump control triggered by the user is detected, the terminal does not continue to play the non-key segment A1, jumps directly to the video key segment B1, and so on until the whole target video is played.

In the above embodiment, before the target video is played, the video key segments existing in the target video are determined, and when the video is played, the non-key segments are played at a speed higher than the playing speed of the video key segments, so that even if the number of videos is large and the duration of the video is long, a large amount of unnecessary playing time cannot be consumed on the non-key segments, the playing time of the target video can be effectively shortened, and the efficiency of video annotation can be improved.

In an embodiment, as shown in fig. 5, S204 may specifically include:

and S502, extracting the features of the frame images in the target video to obtain the corresponding image features.

Here, the frame image may refer to a video frame of a target video, and when the target video is decoded, a series of consecutive frame images may be obtained. The image feature may be a feature vector having a set dimension, for example, a feature vector of 2048 dimensions.

In one embodiment, the terminal performs feature extraction on a frame image in a target video through a feature extraction network to obtain a feature vector with a set dimension. The feature extraction network may include, but is not limited to, a residual error network (ResNet), a convolutional neural network, and the like.

In one embodiment, the terminal decodes the target video to obtain a corresponding frame image; sampling the obtained frame image to obtain a sampled frame image; and extracting image features in the sampling frame image through a feature extraction network.

For example, the terminal sequentially inputs the sampled frame images into the residual error network, performs feature extraction on the sampled frame images through the residual error network, and then takes the output of the last pooling layer of the residual error network as the image features of the sampled frame images. Each image feature can be a feature vector with length of 2048 dimensions, each sampling frame image can obtain a corresponding feature vector with 2048 dimensions through processing of a residual error network, then the feature vectors are combined, and the combined feature vectors are used as input of a machine learning model.

In an embodiment, the step of sampling the obtained frame image may specifically include: and the terminal samples the obtained frame image at fixed intervals, namely, the terminal moves forwards at fixed intervals serving as step lengths every time the frame image is sampled, until the frame image is sampled completely.

For example, assuming that the number of frame images of the target video is 3000 frames in total, sampling is performed at fixed intervals of the frame number 15. For example, in 3000 frames, the first sampling is performed to obtain the 1 st frame, then after moving forward by 15 frames, the second sampling is performed to obtain the 16 th frame, and so on, and a total frame image of 200 frames is obtained.

In another embodiment, the step of sampling the obtained frame image may specifically include: the terminal equally divides the obtained frame images into a plurality of groups, and then extracts a specified number of frame images from each group. Wherein, at the time of the extraction in each group, random extraction may be performed.

For example, assuming that the number of frame images of the target video is 3000 frames in total, the 3000 frames are equally divided into 100 groups, and the number of frame images in each group is 30 frames. The terminal extracts 2 frames from 100 groups, respectively, thereby obtaining a frame image of 200 frames in total.

And S504, processing the image characteristics through a machine learning model to obtain a score value of the corresponding frame image.

The machine learning model may include, but is not limited to, a VASNet model or other neural network models based on video summarization techniques, among others. As shown in fig. 6, the structure of the VASNet model mainly consists of two parts, one part is a soft self-attention (attentions) network, i.e. the attentions network in fig. 6; the other part is a fully connected network for regression, i.e. the regression network in fig. 6.

In one embodiment, the terminal combines the obtained image features to obtain a combined image feature, inputs the combined image feature into a machine learning model, and processes the image feature through the machine learning model to obtain a score value of the corresponding frame image. Taking a machine learning model as a VASNet model as an example for explanation:

the terminal processes the input combined image characteristics through a soft self-attention network so as to calculate a self-attention vector e_(t,i)Let the hypothetical combined image feature be X ═ X (X)₀，x₂，...，x_i，...，x_n) Self attention vector e_(t,i)The calculation of (c) is as follows:

e_(t,i)＝s[(Ux_i)^T(Vx_t)],t＝[0,N),i＝[0,N)

where N is the number of frame images (i.e., the number of frames) in the target video, U and V are the network weight matrices estimated during optimization along with other parameters of the network, and s is the reduction

Andthe ratio parameter s may be set to 0.06.

Optionally, a self-attention vector e is calculated_(t,i)The calculation can also be made with reference to the following way:

e_(t,i)＝Mtanh(Ux_i+Vx_t)

where M is an additional network weight learned by the VASNet model during training.

Then, using softmax in the soft self-attention network, the self-attention vector e is then used_(t,i)Convert to attention weight α_(t,i)The calculation used for the conversion is as follows:

therein, attention is paid to the weight α_(t,i)Is the true probability, representing the importance of the input image feature relative to the expected frame-level score at time t.

The input combined image features are then processed using a linear transformation C, and the result of the processing is then combined with an attention vector α_(t,i)Weighting and averaging to obtain a context vector c_tAnd (4) performing final frame score regression. Wherein the linear transformation C is calculated, and computing the context vector C_tThe calculation of (a) is as follows:

b_i＝Cx_i

then, the context vector c_tProjecting the data to a single-layer fully-connected network, performing linear activation and residual summation, and then performing dropout and layer normalization processing, wherein the specific processing mode is as follows:

k_t＝norm(dropout(Wc_t+x_t))

c and W are network weight matrixes which are learned in the network training process, and in order to normalize the network, a dropout is added in the regression network to serve as an attention weight.

Finally, performing frame score regression y through two layers of neural networks_t＝m(k_t) Processing, wherein the first layer has a relu activation function, and then a droout layer and a layer normalization layer (layer normalization) are carried out; and the second layer has a hidden unit activated by sigmoid.

Thereby obtaining a score vector y ═ y₀,y₁,......,y_n) And y is (0, 1) and is the length n.

For example, 200 2048-dimensional image features (the image features are feature vectors) are extracted from a target video, the 200 2048-dimensional image features are combined, the combined image features are input into a VASNet model, and the VASNet model outputs one 200-dimensional score vector, that is, the score vector y is equal to (y is equal to)₀,y₁,......,y_n) Each dimension in the score vector is a score value of the corresponding sampled frame image, and is used for representing the importance degree of the current frame image, wherein the larger the score value is, the more likely the score value is to contain the key information.

S506, when the score value reaches a preset score value condition, determining the video key segments in the target video according to the score value.

When the score value reaches a preset score value condition, the frame image is represented to contain key information.

In one embodiment, the terminal divides the target video into at least two video segments; respectively calculating the total score value of each video clip according to the obtained score values; in each video clip, when the total score value of the target video clip reaches a preset score value condition, determining the target video clip as a video key clip.

The preset score condition may be a preset score threshold, for example, a target video clip with a total score value of each video clip being greater than or equal to a (a is the preset score threshold) is used as a video key clip; the preset score condition may also be the ranking of the score values, for example, the top 10% of the total score values of the video clips are used as video key clips; the preset score condition may also be that the sum of the total scores of the selected target video segments is maximized under the set video segment length, for example, the total video length is 3000 frames, the length of the selected target video segment does not exceed 450, and the sum of the total scores of the target video segments is maximized.

In one embodiment, after the video segments are segmented, the terminal counts the score values of the corresponding frame images in each video segment to obtain the total score value of each video segment. For example, assuming that the total frame number of a target video is 3000, if 200 frames are extracted for score calculation to obtain 200 score values, then the target video is divided into 10 video segments, each video segment has 300 frames, and 20 frames in each video segment have corresponding score values, then the total score value of the video segment is calculated according to the 20 score values in each video segment.

In one embodiment, after calculating the total score value of each video clip, the terminal determines whether the total score value of the video clip reaches a preset score condition, and if so, determines that the video clip reaching the preset score condition is a video key clip; if not, determining that the video clip which does not reach the preset score condition is a non-key clip.

When the target video is divided, the division can be carried out by adopting the difference situation between the adjacent frame images, or the division is carried out at the transition position according to whether the transition is carried out between the two adjacent frame images. Transition refers to transition or transition between paragraphs and paragraphs, and scenes in the target video. For example, the 1 st frame to the 50 th frame in the target video are scene pictures related to food, the 51 st frame to the 100 th frame are scene pictures related to food culture, and then, a transition exists between the 50 th frame and the 51 st frame.

Therefore, the segmentation of the target video can be described in two ways:

in the method 1, the difference between adjacent frame images is used for segmentation.

In an embodiment, the step of dividing the target video into at least two video segments may specifically include: the terminal calculates the difference value between each adjacent frame image in the target video; in the target video, when the difference value of the target adjacent frame images reaches a preset difference threshold value, the target video is segmented according to the target adjacent frame images to obtain at least two video segments.

For a video, the similarity between different frame images may be larger or smaller, that is, there is a smaller or larger difference between different frame images, a point with a larger difference value between different frame images (that is, the difference value reaches a preset difference threshold) is used as an optimal change point, and it is ensured that the difference value between the frame images between the optimal change points is smaller. Wherein the number of the optimum change points may be plural.

In one embodiment, the terminal calculates the optimal change points in the target video through a dynamic programming algorithm, and the difference between each frame of image between the optimal change points is small. And the terminal segments the target video according to the optimal change point to obtain at least two video segments. The dynamic programming algorithm may be a Kernel Temporal Segmentation (KTS) algorithm, which segments the target video by the KTS algorithm, and segments the target video into a certain number of video segments according to a distance (e.g., a euclidean distance, which is used to represent a difference between frame images) between adjacent frame images in the target video:

wherein L is_m,nMeans at the mostUnder the optimal change point, the sum of the difference values of all the frames of images in each video clip reaches the minimum. And m is the number of the optimal change points. And n is the number of frame images in the target video. g (m, n) is a penalty term, and the number m of the optimal change points of the segmentation can be not too large, wherein the expression of g (m, n) is g (m, n) ═ m (log (n/m) + 1).

Mode 2, segmentation is performed according to whether transition occurs between two adjacent frames of images.

In an embodiment, the step of dividing the target video into at least two video segments may specifically include: a terminal acquires a transition identifier of a target video; and segmenting the target video according to the transition identification to obtain a plurality of video segments.

Transition refers to transition or conversion between video clips and video clips, and between video scenes and video scenes in the target video. The transition identifier is used to identify whether a transition occurs in the target video and the location where the transition occurs. In the target video, if a transition occurs between the kth frame and the (k +1) th frame, a frame image before the kth frame and a frame image after the (k +1) th frame have a large difference, and therefore, the target video can be divided according to the transition mark.

In one embodiment, when the target video is produced, if a transition is set, a transition identifier is generated, and the transition identifier is used for identifying that the transition is set for the target video and determining the position of the transition.

In the above embodiment, the frame images in the target video are sampled, the image features of the sampled frame images are extracted, and the score value of each sampled frame image is calculated according to the image features, so that the video key frames are determined according to the score values, which is beneficial to preventing a large amount of unnecessary playing time from being consumed on non-key segments when the target video is played, effectively shortening the playing time of the target video, and improving the annotation efficiency of the video. In addition, the target video is segmented according to the difference value or transition identification between the frame images, so that the similarity between the frame images in each segmented video segment is high, the obtained video key frame is ensured to be a real key frame, and the obtained non-key segment is a real non-key segment, so that the situation that when the target video is played, the video segment containing the key information is accelerated to be played, and the video segment not containing the key information is not accelerated to be played can be avoided.

In one embodiment, as shown in fig. 7, the method may further include:

s702, when the target video is played, at least one video label corresponding to the target video is obtained.

The video tag may be an identifier for classifying or labeling the target video. The video tags can have multiple levels, such as a first level video tag and a second level video tag, wherein the first level video tag can be gourmet food, and the corresponding second level tags can be rutabaga, chuanxiong dish, cantonese dish, mink dish, perilla dish, Zhejiang dish, Hunan dish, hui dish and the like.

In one embodiment, the terminal may obtain at least one video tag corresponding to the target video from the tag library through the input selection instruction. Or, the terminal may receive an input editing instruction, and acquire at least one video tag corresponding to the target video, which is edited by the user.

In one embodiment, when the target video is played, the terminal acquires description information for the target video; storing the description information and the target video in local; or, the description information and the target video are sent to a server for storage.

The description information may refer to text information for describing the target video, such as: the target video shows cultural traits of ceremonies, ethics, interests and the like brought to the life of people by food through a plurality of sides of the food, a country degree with a long-term cultural tradition can be sensed from the target video, and people hold good attitudes on the aspects of life, family and society.

In one embodiment, the terminal may obtain the description information corresponding to the target video from the information base through the input selection instruction. Or, the terminal may receive an input editing instruction, and obtain description information corresponding to the target video, which is edited by the user. Or, the terminal generates the description information corresponding to the target video according to the key information in each video key segment.

S704, classifying the target video according to at least one video label.

For example, when the video tag is a food, tagging the target video with a food tag may indicate that the target video belongs to a video of a food class, so as to classify the target video. When the user searches for the video of the food, the target video can be searched; in addition, in the video recommendation page or the search result display page, besides the video identifier of the target video, a tag of the target video, such as a favorite, a hunan dish, a fish head with a chili and the like, can be displayed.

S706, when the classification of the target video is finished, uploading the classified target video to a server for storage; and the sent target video is used for indicating the server to recommend the target video to the user equipment when receiving the video acquisition instruction sent by the user equipment.

In one embodiment, when the target video classification is completed, the terminal uploads the classified target video to the server. When the server receives the classified target videos, the target videos are stored in a video library, when video acquisition instructions sent by user equipment are received, the server determines the film watching habits of the user according to user identifications carried in the video acquisition instructions, acquires the target videos of corresponding types from the video library according to the film watching habits, and sends the acquired target videos to the user equipment.

In an embodiment, the step of determining the viewing habit of the user according to the user identifier carried in the video acquisition instruction may specifically include: and the server acquires corresponding user behavior data according to the user identification, and determines the film watching habit of the user according to the user behavior data. For example, if the user prefers to watch the videos of the gourmet category, the user device records the watching records of the user each time the user watches the videos of the gourmet category, and sends the records as the user behavior data to the server for storage. Therefore, the server determines the viewing habits of the user through big data analysis, and then recommends the target videos of the corresponding types to the user.

In the above embodiment, the target videos are classified according to the corresponding video tags, and then the classified target videos are sent to the server, so that the server can recommend the videos according to the preference of the user, and the click rate of the videos can be improved. In addition, description information is added to the target video, so that understanding of the user on the target video can be facilitated, the interest of the user in watching the target video is increased, and the click rate of the target video is improved.

In an embodiment, as shown in fig. 8, another video playing method is provided, where the video playing method specifically includes the following steps:

s802, the terminal obtains the target video.

S804, the terminal determines the video key segments in the target video.

S806, the terminal extracts the features of the frame images in the target video to obtain corresponding image features;

in one embodiment, S806 may specifically include: decoding the target video to obtain a corresponding frame image;

sampling the obtained frame image to obtain a sampled frame image; and extracting image features in the sampling frame image through a feature extraction network.

And S808, the terminal processes the image characteristics through a machine learning model to obtain a score value of the corresponding frame image.

And S810, when the score value reaches a preset score condition, the terminal divides the target video into at least two video segments.

In one embodiment, S810 may specifically include: the terminal calculates the difference value between each adjacent frame image in the target video; in the target video, when the difference value of the target adjacent frame images reaches a preset difference threshold value, the target video is segmented according to the target adjacent frame images to obtain at least two video segments.

In another embodiment, S810 may specifically include: a terminal acquires a transition identifier of a target video; and segmenting the target video according to the transition identification to obtain at least two video segments.

And S812, the terminal respectively calculates the total score value of each video clip according to the obtained score values.

S814, the terminal determines the target video clip as a video key clip in each video clip when the total score value of the target video clip reaches a preset score value condition.

And S816, the terminal plays the target video.

S818, when the video key segment is played, the terminal plays the video key segment at the first playing speed.

S820, when the non-key segments in the target video are played, the terminal plays the key video segments at a second playing speed.

In one embodiment, the target video is played through a video annotation tool; the method may further comprise: the terminal detects a skip playing instruction triggered on a video marking tool in real time; and when the non-key segment in the target video is played and the jump playing instruction is detected, jumping from the non-key segment to the next video key segment for playing.

S822, when the target video is played, the terminal acquires at least one video label corresponding to the target video.

In one embodiment, when the target video is played, the description information for the target video is acquired; and storing the description information and the target video in a local place.

S824, the terminal classifies the target video according to at least one video label.

And S826, when the target video classification is finished, the terminal uploads the classified target video and the description information to the server for storage.

S828, when the server receives the video acquisition instruction sent by the user equipment, the server analyzes the video acquisition instruction to obtain the user identifier.

And S830, the server acquires corresponding user behavior data according to the user identification.

S832, the server determines the film watching habits of the user according to the user behavior data.

And S834, the server acquires the corresponding classified target videos according to the film watching habits.

S836, the server sends the acquired target video to the user equipment sending the video acquisition instruction.

In the above embodiment, before the target video is played, the score value of the corresponding frame image in the target video is calculated, the video key segment in the target video is determined according to the score value, and the non-key segment is quickly played at the playing speed greater than that of the video key segment when the video is played, so that even if the number of videos is large and the duration of the video is long, a large amount of unnecessary playing time cannot be consumed on the non-key segment, the playing time of the target video can be effectively shortened, and the annotation efficiency of the video is improved.

As an example, an embodiment of the present invention provides a video playing method, in which a video annotation tool in the video playing method applies an intelligent acceleration function, and performs accelerated playing on a non-key segment of a video through the intelligent acceleration function, so that an annotation worker can quickly implement a task of tagging a video and a task of writing video description information.

Video tagging task: after watching the video, the annotator marks a specific label on the video, thereby realizing the classification of the video. The label is selected from a label library, and can be edited by a labeling person. For the video labeling task, one video can be labeled with a plurality of labels, or one video can be only selected with one classification label.

The writing task of the video description information comprises the following steps: after watching the video, the annotator writes an objective text description for the whole video, and the description needs to be expressed based on the content of the video.

Therefore, for both of the two annotation tasks, it is necessary for annotators to completely view and understand the video content and then perform annotation, but none of the existing annotation tools is optimized for the annotation scene, and for annotators who perform tasks of video classification and writing of video description information, performing the two annotation tasks by using the annotation tools will affect the annotation efficiency, so that the embodiment of the present invention provides a video playing method capable of improving the annotation efficiency, which specifically includes the following steps:

s902, firstly, the video to be marked is obtained.

And S904, scoring the video clips through the neural network model to obtain video key clips and non-key clips.

S906, accelerating the playing of the non-key fragments on the marking tool.

And S908, after the video is played, marking the video.

The details of the embodiments of the present invention are specifically described below from two points:

(1) and acquiring a video key clip.

The embodiment of the invention adopts a neural network model of VASNet to score corresponding frame images in the video, thereby extracting video key segments according to the scores of the frame images.

The structure of the VASNet model mainly comprises two parts, wherein one part is a soft self-attention network, namely the attention network in fig. 6; the other part is a fully connected network for regression, i.e. the regression network in fig. 6.

Sampling frame images in the video to obtain sampled frame images, and then taking the feature vectors of the sampled frame images as the input of the VASNet model.

The method for acquiring the feature vector comprises the following steps: the video is first decoded into a series of frame images, and then the frame images are sampled (for example, every 15 frames), so as to obtain sampled frame images. After the sampling frame images are obtained, a residual error network (ResNet) is adopted to extract the characteristics of the sampling frame images to obtain characteristic vectors.

The residual error network is an image classification model, and is trained on a large image classification data set ImageNet, and after training is completed, image features of frame images in a video are extracted by using the residual error network. The method comprises inputting a frame image to a residual network, processing the frame image through the residual network, and selecting the output of the last pooling layer of the residual network as a required feature vector, wherein the feature vector is a vector with the dimension of 2048. Each sampling frame image corresponds to a characteristic vector with 2048 dimensions, then the characteristic vectors are combined, and the combined characteristic vector obtained by combination is used as the input of the VASNet model.

The output of the VASNet model is a vector, the elements of which represent the scores of the input feature vectors, respectively. For example, if the video itself has 3000 frames of images, sampling is performed at intervals of 15 to obtain 200 frames of sampled frame images, then 200 2048-dimensional feature vectors are extracted and input into the VASNet model, the model outputs a 200-dimensional score result, each dimension in the result corresponds to a score value of one sampled frame image and represents the importance degree of the sampled frame image, and the higher the score is, the more likely the sampled frame image contains key information.

After the score values of the sampling frame images are obtained, the KTS algorithm is adopted to divide the video into a plurality of video segments, then the segment score value and the segment length (namely the number of frames in the segment) of each video segment are obtained according to the score values of the sampling frame images, the backpack algorithm is used for selecting video key segments of which the sum of the segment lengths does not exceed 15% of the original video length, and the maximum segment score of the selected video key segments is ensured.

For example, suppose that there are n video segments in total after video segmentation, and the segment score of each video segment is a₁、a₂、...、a_nNow, m video segments are selected from the n video segments, and the sum of the segment lengths of the m video segments cannot exceed 15% of the original video. Selecting video clips with the sum of the clip lengths not exceeding 15% of the original video from the n video clips through a knapsack algorithm, wherein the clip scores of the selected video clips are the largest, such as selecting 4 video clips (the sum of the lengths of the 4 clips does not exceed 15% of the original video), the 1 st, the 2 nd, the 5 th and the nth video clips, and the sum of the clip scores a₁+a₂+a₅+a_nAnd max.

After the video key segment is selected, the information of the video key segment (the information may be the starting position and the ending position of the video key segment, or other identification marks for identifying the video key segment) may be put on a video annotation tool, so that when the video annotation tool plays the video, whether the current playing position belongs to the video key segment or not may be determined according to the information of the video key segment, and if the current playing position does not belong to the video key segment, that is, belongs to a non-key segment in the video, accelerated playing is performed, thereby implementing intelligent acceleration on the non-key segment in the video.

(2) Video intelligent acceleration

After the information of the video key segment is acquired, the intelligent acceleration function needs to be realized on the video, and the specific realization is as follows: according to whether the video key segment belongs to the video key segment or not, a set of acceleration strategy is designed, the part of the non-key segment is subjected to acceleration processing, as shown in fig. 3, the non-key segment in the video is accelerated and played, and the video key segment is played at normal speed.

By implementing the above embodiment, the following advantageous effects can be achieved:

(1) compared with the method of playing and watching the video at the original speed from beginning to end, the method can reduce the time consumption of watching the video and improve the labeling efficiency.

(2) Compared with the conventional video watching method using the functions of fast forward, pause and rewind, if the video is played fast forward in the whole process, the video is not accurately understood by a annotator, and when some key segments of the video are encountered, the video may need to be manually paused and played back by the annotator. Therefore, the embodiment of the invention can greatly reduce unnecessary operation, and the marking personnel only need to intelligently accelerate to watch for one time from beginning to end, thereby improving the marking efficiency.

Fig. 2, 5, and 7-9 are schematic flow charts illustrating a video playing method according to an embodiment. It should be understood that although the various steps in the flowcharts of fig. 2, 5, 7-9 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 5, 7-9 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

As shown in fig. 10, in an embodiment, a video playing device is provided, which specifically includes: an obtaining module 1002, a determining module 1004, a playing module 1006, a first adjusting module 1008 and a second adjusting module 1010; wherein:

an obtaining module 1002, configured to obtain a target video;

a determining module 1004, configured to determine a video key segment in the target video;

a playing module 1006, configured to play the target video;

a first adjusting module 1008, configured to play the video key segment at a first play speed when the video key segment is played;

a second adjusting module 1010, configured to play the key video clip at a second play speed when playing the non-key clip in the target video; the second playback speed is greater than the first playback speed.

In one embodiment, the determining module 1004 is further configured to:

carrying out feature extraction on frame images in a target video to obtain corresponding image features;

processing the image characteristics through a machine learning model to obtain a score value of a corresponding frame image;

and when the score value reaches a preset score condition, determining the video key segments in the target video according to the score value.

In one embodiment, the determining module 1004 is further configured to:

dividing a target video into at least two video segments;

respectively calculating the total score value of each video clip according to the obtained score values;

in each video clip, when the total score value of the target video clip reaches a preset score value condition, determining the target video clip as a video key clip.

In one embodiment, the determining module 1004 is further configured to:

calculating difference values between adjacent frame images in the target video;

in the target video, when the difference value of the target adjacent frame images reaches a preset difference threshold value, the target video is processed to obtain a target video image

And segmenting the target video according to the target adjacent frame image to obtain at least two video segments.

In one embodiment, the determining module 1004 is further configured to:

acquiring a transition identifier of a target video;

and segmenting the target video according to the transition identification to obtain at least two video segments.

In one embodiment, the determining module 1004 is further configured to:

decoding the target video to obtain a corresponding frame image;

sampling the obtained frame image to obtain a sampled frame image;

and extracting image features in the sampling frame image through a feature extraction network.

In one embodiment, as shown in fig. 11, the apparatus further comprises: a classification module 1012; wherein:

the obtaining module 1002 is further configured to obtain at least one video tag corresponding to the target video when the target video is completely played;

a classification module 1012 for classifying the target video according to at least one video tag.

In one embodiment, as shown in fig. 11, the apparatus further comprises: an upload module 1014; wherein:

the uploading module 1014 is used for uploading the classified target videos to a server for storage when the classification of the target videos is completed; and the target video is used for indicating the server to recommend the target video to the user equipment when receiving the video acquisition instruction sent by the user equipment.

In one embodiment, as shown in fig. 11, the apparatus further comprises: a save module 1016; wherein:

the obtaining module 1002 is further configured to obtain description information for the target video when the target video is completely played;

a storage module 1016, configured to store the description information and the target video locally; alternatively, the first and second electrodes may be,

the uploading module 1014 is further configured to send the description information and the target video to the server for storage.

In one embodiment, the target video is played through a video annotation tool; as shown in fig. 11, the apparatus further includes: a skip module 1018; wherein:

a skip module 1018 for detecting a skip play instruction triggered on the video annotation tool in real time; and when the non-key segment in the target video is played and the jump playing instruction is detected, jumping from the non-key segment to the next video key segment for playing.

FIG. 12 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the terminal 110 in fig. 1. As shown in fig. 12, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement the video playback method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform a video playback method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the video playback apparatus provided in the present application may be implemented in the form of a computer program that is executable on a computer device such as the one shown in fig. 12. The memory of the computer device may store various program modules constituting the video playback apparatus, such as the obtaining module 1002, the determining module 1004, the playing module 1006, the first adjusting module 1008, and the second adjusting module 1010 shown in fig. 10. The computer program constituted by the respective program modules causes the processor to execute the steps in the video playback method of the embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 12 may execute S202 through the obtaining module 1002 in the video playing apparatus shown in fig. 10. The computer device may perform S204 by the determination module 1004. The computer device may execute S206 via the play module 1006. The computer device may perform S208 by the first adjustment module 1008. The computer device may perform S210 through the second adjustment module 1010.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the video playback method described above. Here, the steps of the video playing method may be steps in the video playing methods of the above embodiments.

In one embodiment, a computer-readable storage medium is provided, in which a computer program is stored, which, when executed by a processor, causes the processor to perform the steps of the above-described video playback method. Here, the steps of the video playing method may be steps in the video playing methods of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A video playback method, comprising:

acquiring a target video;

determining video key segments in the target video;

playing the target video;

2. The method of claim 1, wherein the determining video key segments in the target video comprises:

performing feature extraction on the frame image in the target video to obtain corresponding image features;

processing the image characteristics through a machine learning model to obtain a score value corresponding to the frame image;

and when the score value reaches a preset score value condition, determining the video key segments in the target video according to the score value.

3. The method of claim 2, wherein the determining the video key segments in the target video according to the score values when the score values reach a preset score condition comprises:

segmenting the target video into at least two video segments;

in each video clip, when the total score value of a target video clip reaches a preset score condition, determining the target video clip as a video key clip.

4. The method of claim 3, wherein the segmenting the target video into at least two video segments comprises:

in the target video, when the difference value of the target adjacent frame images reaches a preset difference threshold value, determining that the target adjacent frame images are not the same as the preset difference threshold value

5. The method of claim 3, wherein the segmenting the target video into at least two video segments comprises:

acquiring a transition identifier of the target video;

6. The method of claim 2, wherein the extracting the features of the frame image in the target video to obtain the corresponding image features comprises:

decoding the target video to obtain a corresponding frame image;

sampling the obtained frame image to obtain a sampled frame image;

7. The method according to any one of claims 1 to 6, further comprising:

when the target video is played completely, acquiring at least one video label corresponding to the target video;

and classifying the target video according to the at least one video label.

8. The method of claim 7, further comprising:

when the target video classification is finished, uploading the classified target video to a server for storage; the target video is used for indicating the server to recommend the target video to the user equipment when receiving a video acquisition instruction sent by the user equipment.

9. The method according to any one of claims 1 to 6, further comprising:

when the target video is played, acquiring description information aiming at the target video;

storing the description information and the target video locally; alternatively, the first and second electrodes may be,

and sending the description information and the target video to a server for storage.

10. The method according to any one of claims 1 to 6, wherein the target video is played by a video annotation tool; the method further comprises the following steps:

detecting a skip playing instruction triggered on the video annotation tool in real time;

and when the non-key segment in the target video is played and the jump playing instruction is detected, jumping from the non-key segment to the next video key segment for playing.

11. A video playback apparatus, comprising:

the acquisition module is used for acquiring a target video;

the playing module is used for playing the target video;

12. The apparatus of claim 11, wherein the determining module is further configured to:

13. The apparatus of claim 2, wherein the determining module is further configured to:

segmenting the target video into at least two video segments;

14. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 10.

15. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 10.