CN111209440B

CN111209440B - Video playing method, device and storage medium

Info

Publication number: CN111209440B
Application number: CN202010033943.XA
Authority: CN
Inventors: 艾立超
Original assignee: Shenzhen Yayue Technology Co ltd
Current assignee: Shenzhen Yayue Technology Co ltd
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2023-04-14
Anticipated expiration: 2040-01-13
Also published as: CN111209440A

Abstract

The embodiment of the application discloses a video playing method, a video playing device and a storage medium, wherein when a target video is played in a current playing mode, a video clip to be played of the target video can be obtained; identifying the content of the video clip to be played to obtain the type of the target video clip of the video clip to be played; determining a control level corresponding to the video clip to be played in the current playing mode based on the type of the target video clip; and controlling the playing of the video clip to be played according to the control level. When the target video is played, the content of the target video is detected, and the playing of the target video is controlled according to the detection result, so that the video content can be effectively filtered, and the viewing environment of a user is purified.

Description

Video playing method, device and storage medium

Technical Field

The present application relates to the field of multimedia technologies, and in particular, to a video playing method and apparatus, and a storage medium.

Background

In recent years, with the continuous development of internet technology and economy, watching online videos on intelligent terminals such as smart televisions or smart boxes has become an important entertainment activity in daily life of people. Different film watching modes can be set in the current video playing client, and different types of films can be watched in different film watching modes. For example, in order to prevent teenagers and children from being addicted to the network, some video playing clients launch a child mode, even special child video playing clients appear, and through pre-auditing and classification, the clients can only be used for watching videos suitable for children to watch, but because auditing is not timely or missing, many non-child videos appear in the child mode or the child video playing clients. In addition, the existing auditing and classifying are generally only performed on the album, and all pictures in each video of the album cannot be audited, so that one or more pieces of content which are not suitable for children to watch may appear in the album belonging to the children movie, and the user experience is not good.

Disclosure of Invention

In view of this, embodiments of the present application provide a video playing method, apparatus, and storage medium, which can effectively filter video content and purify a viewing environment of a user.

In one aspect, an embodiment of the present application provides a video playing method, including:

when a target video is played in a current playing mode, acquiring a video clip to be played of the target video;

identifying the content of the video clip to be played to obtain the target video clip type of the video clip to be played;

determining a control level corresponding to the video clip to be played in the current playing mode based on the type of the target video clip;

and controlling the playing of the video clip to be played based on the control level.

In an embodiment, the identifying the content of the video segment to be played to obtain the target video segment type of the video segment to be played includes:

performing feature extraction on the video content of the video clip to be played to obtain feature information of the video content;

determining the action type of an object in the video clip to be played according to the characteristic information;

and classifying the video clips to be played according to the action types to obtain the target video clip types of the video clips to be played.

In one embodiment, the feature information includes picture element feature information and motion feature information;

the extracting the characteristics of the video content of the video clip to be played to obtain the characteristic information of the video content includes:

determining a sampling frame from video frames of the video content;

extracting motion information of pixels in the video content according to the sampling frame and the adjacent video frame of the sampling frame to obtain an optical flow sequence corresponding to the video content;

performing convolution operation on the sampling frame according to a spatial flow convolution neural network of a preset action recognition network model, and extracting picture element characteristic information of the video content;

and carrying out convolution operation on the optical flow information according to a time flow convolution neural network of a preset action recognition network model, and extracting the motion characteristic information of the video content.

In an embodiment, the fusing the feature information to determine the action type of the object in the video segment to be played includes:

performing full-connection operation on the picture element characteristic information based on a spatial flow convolution neural network of the preset action recognition network model to obtain first type probability information;

performing full-connection operation on the motion characteristic information based on the time flow convolutional neural network of the preset action recognition network model to obtain second type probability information;

and fusing the first type probability information and the second type probability information to obtain the action type of the object in the video clip to be played.

In an embodiment, the identifying the content of the video segment to be played to obtain a target video segment type of the video segment to be played further includes:

identifying the audio content of the video clip to be played to obtain text information corresponding to the audio content;

identifying the text information according to a sensitive word set which needs to be controlled to be played in a current playing mode to obtain an audio identification result;

and classifying the video clips to be played according to the audio recognition result and the action type to obtain the video clip type.

In an embodiment, identifying the text information according to a sensitive word set that needs to be played in a current playing mode to obtain an audio identification result includes:

when the text information has the matched sensitive words in the sensitive word set, acquiring the attributes of the sensitive words;

and acquiring the audio recognition result according to the attribute of the sensitive word.

In an embodiment, after identifying the content of the video segment to be played to obtain the target video segment type of the video segment to be played, the method further includes:

generating a type mark of the target video according to the type of the target video clip corresponding to the video clip to be played;

and marking the target video based on the type mark to obtain the marked target video.

and when the target video is the marked target video, determining the type of the target video clip of the video clip to be played according to the type mark.

In an embodiment, identifying the content of the video clip to be played to obtain a target video clip type of the video clip to be played includes:

In an embodiment, controlling the playing of the video segment to be played based on the comparison result includes:

and when the control level is a play-allowed level, playing the video clip to be played.

In an embodiment, controlling the playing of the video clip to be played based on the comparison result includes:

when the control level is a play limiting level, acquiring the time length of playing the video clip of the target video clip type within a preset time period by a user;

if the duration is less than a preset threshold, playing the video clip to be played;

if the duration is greater than or equal to the preset threshold, acquiring a next video segment of the video segments to be played according to the playing sequence of the video segments in the target video;

updating the next video clip to be a video clip to be played;

repeatedly executing the step of identifying the content of the video clip to be played to obtain the target video clip type of the video clip to be played, determining the control level corresponding to the target video clip type in the current playing mode, and controlling the playing of the video clip to be played according to the control level until the control level corresponding to the target video clip type is the allowed playing level.

In an embodiment, controlling the playing of the video clip to be played according to the control level includes:

when the control level is a play prohibition level, prohibiting playing the video clip to be played;

acquiring a next video clip of the video clip to be played according to the playing sequence of the video clips in the target video;

updating the next video clip to be a video clip to be played;

and repeatedly executing the step of identifying the content of the video clip to be played to obtain the target video clip type of the video clip to be played, determining the control level corresponding to the target video clip type in the current playing mode, and controlling the playing of the video clip to be played according to the control level.

when the control level is a play limiting level, acquiring target play information according to a preset play condition corresponding to the target video clip type;

if the target playing information meets the preset playing condition, playing the video clip to be played;

if the target playing information does not meet the preset playing condition, acquiring a next video clip of the video clip to be played according to the playing sequence of the video clip in the target video;

updating the next video clip into a video clip to be played;

In another aspect, an embodiment of the present application provides a video playing apparatus, including:

the video acquisition unit is used for acquiring a video clip of a target video to be played when the target video is played in a current playing mode;

the type acquisition unit is used for identifying the content of the video clip to be played to obtain the target video clip type of the video clip to be played;

the control unit is used for determining a control level corresponding to the video clip to be played in the current playing mode based on the type of the target video clip;

and the playing unit is used for controlling the playing of the video clip to be played according to the control level.

In another aspect, an embodiment of the present application provides a server, including: a processor and a memory; the memory stores a plurality of instructions, and the processor loads the instructions stored in the memory to execute the steps of the video playing method provided by any embodiment of the application.

On the other hand, the storage medium provided in the embodiments of the present application has a computer program stored thereon, which, when the computer program runs on a computer, causes the computer to execute the steps in the video playing method provided in any of the embodiments of the present application.

According to the method and the device, when the target video is played in the current playing mode, the video clip to be played of the target video can be obtained; identifying the content of the video clip to be played to obtain the target video clip type of the video clip to be played; determining a control level corresponding to the video clip to be played in the current playing mode based on the type of the target video clip; and controlling the playing of the video clip to be played based on the control level. When the target video is played, the content of the target video is detected, and the playing of the target video is controlled according to the detection result, so that the video content is effectively filtered, and the viewing environment of a user is purified.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a video playing method according to an embodiment of the present application.

Fig. 2 is a schematic flowchart of a video playing method according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of a video playback device according to an embodiment of the present application.

FIG. 4 is a schematic diagram of a computer device provided in an embodiment of the present application.

Fig. 5a is a schematic view of a video album category according to an embodiment of the present application.

Fig. 5b is a schematic interaction diagram of a terminal and a server according to an embodiment of the present disclosure.

Fig. 5c is a schematic diagram of a motion recognition network model according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a video playing method, a video playing device and a storage medium.

The process of identifying video content provided by the embodiment of the application relates to the technologies such as artificial intelligence computer vision technology and machine learning technology, and comprises the following steps:

the process of recognizing sampling frames and optical flow information by using a motion recognition network model relates to a Computer Vision technology, wherein Computer Vision technology (Computer Vision, CV) is a science for researching how to enable a machine to see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as recognition, tracking and measurement on a target, and further performing image processing, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/action recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition, fingerprint recognition, and the like.

The training of the action recognition network model relates to Machine Learning, wherein Machine Learning (ML) is a multi-field cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning action of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Referring to fig. 1, an embodiment of the present invention provides a video playing system, which at least includes a terminal and a server, where the terminal and the server are linked through a network.

The terminal can be a mobile phone, a tablet computer, a notebook computer and other devices, and also can be an intelligent terminal comprising a wearable device, an intelligent sound box, an intelligent television and the like. The terminal is provided with a client, and the client can be a video application client or a browser client and the like. The server may be a single server or a server cluster composed of a plurality of servers.

The terminal can obtain the video clip of the target video from the server and play the video clip in the client. When the terminal plays the target video in the current playing mode, the server can acquire a video clip to be played of the target video from the database according to a request of the terminal, and then identify the content of the video clip to be played to obtain the type of the target video clip of the video clip to be played; determining a control level corresponding to the video clip to be played in the current playing mode based on the type of the target video clip; and finally, sending the control level to a terminal through a network so that the client controls the playing of the video clip to be played based on the control level.

In an embodiment, the server may further generate a type tag of the target video according to a type of the target video segment corresponding to the video segment to be played, and the server may tag the target video based on the type tag to obtain and store the tagged target video. And when the terminal requests to play the marked target video, determining the target video clip type of the video clip to be played in the target video according to the type mark.

The above example of fig. 1 is only an example of a system architecture for implementing the embodiment of the present invention, and the embodiment of the present invention is not limited to the system architecture shown in fig. 1, and various embodiments of the present invention are proposed based on the system architecture.

The following are detailed below. The numbers in the following examples are not intended to limit the order of preference of the examples.

As shown in fig. 2, a video playing method is provided, which may be executed by a terminal or a server, or by both the terminal and the server, and this embodiment is described by taking the method as an example. The specific flow of the video playing method is as follows:

101. when a target video is played in a current playing mode, a video clip to be played of the target video is acquired.

The current play mode refers to a mode in which the client provides play services for the user at the current time, and different play services can be provided for the user in different play modes to realize different play functions, for example, the client has play rights to play different types of videos in different play modes.

In one embodiment, in order to meet the requirements of different users, the client may provide a child mode, a teenager mode, and a normal mode, which may respectively provide different video playing services. For example, the children mode may not play a video containing terrorist elements, the teenagers mode may not play a video containing more terrorist elements, and the common mode may play a legally allowed video containing terrorist elements. In addition, in order to prevent the juveniles from being addicted to the network, the total time allowed to be used by the user every day in different playing modes can be set to be different, for example, the total time allowed to be used by the user every day in a child mode is set to be 1 hour, the total time allowed to be used by the user every day in a teenager mode is set to be 2 hours, and the total time allowed to be used by the user every day in a common mode is not limited.

The server can obtain the video clip to be played of the target video from the database of the server based on the request sent by the terminal.

In an embodiment, when the terminal plays the target video online, the server may send the target video to the terminal in a video streaming manner. The video streaming transmission is to divide a video file into video segment compressed packets in a special compression mode, decompress and play the obtained video segment compressed packets through buffering for several seconds or dozens of seconds without waiting for the whole file to be completely downloaded, and simultaneously continuously download the rest part of the video file (namely the video segment to be played) from a server.

Correspondingly, when the user downloads the target video, the server can also send the target video to the terminal in a video streaming transmission mode, identify the content of the target video when the target video is downloaded, obtain the type and the control level of the target video fragment, and control the downloading of the video fragment according to the control level, so that the video content can be effectively filtered, and the watching environment of the user can be purified. For a specific downloading identification and control method, reference may be made to the identification and control method for playing in the embodiment of the present application, which is not described in detail again.

In an embodiment, in order to facilitate the video clip management of the server, the video file may be further mapped into a small TS (Transport Stream) fragment file in an HLS (HTTP Live Streaming) protocol, and the video file is played through the HLS protocol without actually splitting the video file. The video file includes a MoovBox file, which records decoding information, timestamps, positions, and other very critical data of all subsequent audio frames and video frames, called index data. The whole video file can be partitioned according to the audio and video frame index data listed in the moov box and by taking a key frame as a boundary, each segment corresponds to one TS file, and the corresponding relation is written into the index file. The video file also includes an m3u8 file and an index file. The m3u8 file is a playing address file used by the player, all TS fragment addresses are listed in the playing address file, and the index file can record the direct data corresponding relation between the actual video file and the TS fragments needing to be fragmented. When the playing client actually requests playing, corresponding audio and video data are obtained through the corresponding relation in the index file and the playing address in the m3u8 file, and are assembled into a TS file in the memory. For example, if a target video file requests 0 to 2 seconds of data, the target video file needs to find 0 to 2 seconds of data through corresponding recording, and combine the data into an MPEG-TS format to generate a TS fragment file.

In an embodiment, in order to facilitate the server to identify the content of the target video, the server may decode a video file (where the video file may be a compressed packet of a video segment or may be a TS fragment file) acquired from a database to obtain a video segment to be played, for example, the step "acquiring the video segment to be played of the target video" may include:

decapsulating the acquired video file to obtain a video data stream and an audio data stream;

and respectively decoding the video data stream and the audio data stream to obtain the video content and the audio content of the video clip to be played. The video content comprises a plurality of video frames, and the audio content comprises a group of audio frames.

In an embodiment, a video acquisition request may be triggered to be sent to a server to acquire a video clip to be played based on an operation of a user on a user operation interface of a terminal.

Referring to fig. 5b, in an embodiment, the server may also implement a function of search filtering in addition to the functions of video recognition and play control. The video recognition function is realized through a preset action recognition model, a voice recognition model and a preset video clip type in the server. And realizing the play control function through the type of the target video clip and the current play mode in the server. The function of searching and filtering is realized through the album type stored in the video database.

Specifically, the video albums may be categorized in advance, so that the user is limited to search only the video albums that are allowed to be played in the current play mode, so as to implement the search filtering function. For example, the piglet xx album may be categorized into a juvenile type in advance according to the requirement of the play mode. Of course, the album may be categorized into animation types according to other sorting criteria, such as, for example, by artistic form, by feature, by preview, or by title, depending on the content form. Other classifications may also be made. Referring to fig. 5a, information about the piglet xx album can be saved in the video database of the server.

Referring to fig. 5b, the terminal includes a search module, a play module, and a network request management module. The method comprises the steps that a terminal can determine a target album to be played according to a searching operation of a user in a user operation interface of a client, the target album comprises at least one video, when the user searches the target album in a user operation page, the client sends a searching request to a server, the searching request carries a target album name and current playing mode information, the server determines a candidate album set of candidate albums which are allowed to be viewed in a current playing mode according to the current playing mode information, the target album is searched in the candidate album set according to the target album name, after the target album is searched, the detailed information of the target album is returned to the client, and the terminal displays a detailed page of the target album in the client according to the detailed information, wherein the detailed page comprises at least one control corresponding to the video; and when the video selection operation of the user for the selection control is detected, determining a target video to be played. And acquiring identification information of the target video, and requesting the target video from the server according to the identification information. The identification information may be expressed as the number of sets of the target video in the target album.

In another embodiment, when the terminal implements the video playing method of the present application, in addition to online playing and downloading, the content of the target video may be identified and controlled, and when the terminal plays the downloaded target video offline, the video segment to be played in the target video may also be identified and controlled. The downloaded target video file is stored in a streaming media file format in the terminal, the streaming media file comprises a plurality of video segment compressed packets stored according to a playing sequence, and when the target video is played, the video segment compressed packets to be played can be obtained according to the playing sequence, decompressed and played. Of course, in order to facilitate the terminal to manage the video data, the target video file stored in the terminal may not be actually split. And the terminal plays the target video file through the HLS protocol.

The terminal identifies and controls the target video content, and when the server needs to respond to video acquisition requests of a large number of terminals, the calculation pressure of the server can be relieved.

102. And identifying the content of the video clip to be played to obtain the type of the target video clip of the video clip to be played.

The video clip type is a type which is obtained by dividing according to emotion, purpose and proportion expressed by elements such as pictures, actions and sounds in the video clip and is used for determining whether the video clip is suitable for playing in the current playing mode, and the video clip type can comprise types such as entertainment, sports, horror, violence and pornography. Accordingly, the target video clip type refers to the video clip type of the video clip to be played.

In an embodiment, identifying the content of the video clip to be played to obtain the target video clip type of the video clip to be played may specifically include the following steps:

The action type is information indicating the posture behavior of the object, and may include "horse riding", "archery", "fighting", "reading", and the like.

In an embodiment, the feature information includes object feature information and motion feature information, and feature extraction is performed on the video content of the video segment to be played to obtain the feature information of the video content, which may specifically include the following steps:

determining a sample frame from video frames of the video content;

carrying out convolution operation on the sampling frame according to a spatial flow convolution neural network of a preset action recognition network model, and extracting picture element characteristic information of the video content;

and performing motion feature extraction on the optical flow sequence according to a time flow convolution neural network of a preset action recognition network model to obtain motion feature information of the video clip to be played.

The sampling frame is selected from a video to be played and is used for extracting the video frame of the picture element characteristic information. The screen elements include elements such as objects (including an object that issues an action, an object that receives an action, and a prop used in an action) in the screen, a gesture, and a scene (including a background).

In an embodiment, under the condition that the duration of the video to be played is short, the video picture is not switched, the object and the scene in the picture are not switched, and the sampling frame can be a single video frame. The sampling frame can contain RGB three-channel information and RGB-D grey-scale map information.

In this embodiment, motion vectors for all pixels in the video frame may be obtained to obtain a dense optical flow. The optical flow of the video to be played can be calculated using the optical flow calculation interface in OpenCV. The optical flow sequence refers to a combination of a plurality of optical flow diagrams arranged in sequence.

In an embodiment, gradient calculation may be performed on L consecutive video frames starting from the sampling frame to obtain L optical flow maps, and the L optical flow maps are sequentially combined to obtain an optical flow sequence. A general optical flow graph is 2-channel information, including motion information on the x-axis and y-axis of pixels.

The convolutional neural network of the preset action recognition network model can comprise a plurality of convolutional layers and a plurality of full-connection layers, wherein the convolutional layers are used for extracting object features, and the full-connection layers are used for feature fusion.

The Convolutional layers (Convolutional layers) are mainly used for extracting features of input images (such as a light flow graph and a sampling frame), each Convolutional layer includes a plurality of Convolutional kernels, the sizes of the Convolutional kernels of the Convolutional layers may be determined according to actual applications, different Convolutional kernels have different weight values, and the Convolutional layers may be used for extracting features of different dimensions, such as features of an object, a posture, a scene, a motion direction, a motion amplitude, a motion speed, and the like, wherein the weight values in the Convolutional kernels may be determined through training. In an embodiment, a plurality of feature maps with different dimensions can be obtained through convolution operations of different convolution kernels.

In an embodiment, in order to improve the expression capability of the model, a non-Linear factor may also be added by adding an activation function, in an embodiment of the present invention, the activation functions are all "relu (Linear rectification function)", and a padding (which refers to a space between an attribute definition element border and an element content) manner is all "same", and a "same" padding manner may be simply understood as padding an edge with 0, where the number of 0 padding on the left side (upper side) is the same as or less than the number of 0 padding on the right side (lower side).

In one embodiment, in order to further reduce the amount of calculation, a downsampling operation (downsampling) may be performed on one or some convolution layers, and the downsampling operation is basically the same as the convolution operation, except that the downsampled convolution kernel is only a maximum value (max) or an average value (average) of corresponding positions.

The number of the convolutional layers can be adjusted according to actual requirements, image features output by the convolutional layers of the previous layer can be used as input images of the convolutional layers of the next layer, further feature extraction is carried out, and features extracted by each layer are more and more abstract through convolutional operation of the convolutional layers of the multiple layers.

The fully-connected layer can map the learned features to an image mark space, and mainly plays a role of a classifier in the whole recognition model, each node of the fully-connected layer is connected with all nodes output by an upper layer (such as a down-sampling layer in a convolutional layer), wherein one node of the fully-connected layer is called a neuron in the fully-connected layer, the number of the neurons in the fully-connected layer can be determined according to the requirement of practical application, and is similar to the convolutional layer, optionally, in the fully-connected layer, a nonlinear factor can also be added by adding an activation function, for example, an activation function sigmoid (S-type function) can be added.

In an embodiment, the style type probability information output by the full link layer is in a one-dimensional vector form, where each element in the vector represents a probability that the motion of the object belongs to a different motion type, for example, if the motion recognition network model can recognize n motion types (including "riding a horse", "shooting an arrow", "fighting a frame", "reading a book", etc.), the vector output by the full link layer has n elements, which respectively represent probabilities corresponding to the n style types.

As shown in fig. 5c, the spatial flow convolutional neural network and the temporal flow convolutional neural network of the preset action recognition network model are both classified neural networks, and taking AlexNet as an example, the spatial flow convolutional neural network and the temporal flow convolutional neural network may include 5 convolutional layers and 3 full-link layers, and a downsampling operation is performed after the 1 st, 2 th and 5 th convolutional layers, where the downsampling operation employs maximum pooling (max pooling), and the maximum pooling is to maximize feature points in a neighborhood, so that an fuzzification effect of average pooling can be avoided, thereby retaining the most significant features, and the step length provided in AlexNet is smaller than the size of a pooling kernel, so that there is overlap and coverage between outputs of pooling layers, and the richness of features is improved, and the loss of information is reduced. The last fully-connected layer applies the softmax function, which maps some of the output neurons to real numbers between (0, -1), and the normalization guarantees a sum of 1, so that the sum of the probabilities for the multiple classes is also exactly 1.

It should be noted that, for convenience of description, in the embodiment of the present invention, the layer where the activation function is located is included in the convolution layer, it should be understood that the structure may also be considered to include the convolution layer, the layer where the activation function is located, the downsampling layer (i.e., the pooling layer), and the full connection layer, and of course, the preset action recognition network model may further include an input layer for inputting data and an output layer for outputting data, which are not described herein again.

In an embodiment, determining the action type of the object in the video clip to be played according to the feature information may specifically include the following steps:

performing full-connection operation on the motion characteristic information based on a time flow convolution neural network of the preset action recognition network model to obtain second type probability information;

The first type probability information is the probability corresponding to the action type of the predicted object according to the picture elements in the static single sampling frame. The spatial stream convolution neural network fuses image element characteristics such as postures, props and scenes of the objects, and the actions of the objects in the images can be predicted.

And the second type probability information is the probability corresponding to the action type obtained by prediction according to the pixel change in the video frame.

In an embodiment, the first type probability information and the second type probability information may be fused by using an svm algorithm to obtain probability information corresponding to a final action type, and an action type with a highest probability may be determined as an action type of an object in the video clip to be played.

The SVM (Support Vector Machine) is a generalized linear classifier (generalized linear classifier) that performs binary classification on data in a supervised learning (supervised learning) manner, and a decision boundary of the SVM is a maximum-margin hyperplane (maximum-margin hyperplane) that solves for a learning sample.

Based on the fused action type probability information, the action type in the video to be played can be determined.

In an embodiment, in the one-dimensional vector output by the svm algorithm layer, the action type corresponding to the largest element is determined as the action type of the object in the video segment to be played.

The preset action recognition network model can be obtained through multi-task training, and the multi-task training is a machine learning method for putting a plurality of related tasks together for learning based on shared representation. Most machine learning tasks are single-task learning at present, and the multi-task learning has the advantages that a plurality of related tasks can be put together for learning, information among the tasks is shared, knowledge learned by other tasks can be applied to target tasks, and the effect of the target tasks is improved. In the invention, the target task refers to an action type recognition task, other tasks refer to picture recognition tasks, two tasks are learned simultaneously in a network, and information learned from the picture recognition tasks is shared with the action recognition tasks, so that the action recognition effect is improved. The initial motion recognition network model provides two fully connected output layers. When two data sets train the initial motion recognition network model, one of the fully-connected output layers classifies the video of one data set (such as UCF 101), the other fully-connected output layer classifies the pictures in the other data set, and when the error is finally calculated by using a Back Propagation (BP) algorithm, the outputs of the two fully-connected output layers are summed up to be used as the total loss to execute the BP algorithm to update the weight in the network model. In which, with UCF101 data set, a total of 101 action types are included, 13320 video segments. There are 5 major classes of actions: human-object interaction, limb movement, human-human interaction, playing musical instruments, sports, and the like. The videos in the data set have been tagged with the correct action type.

In an embodiment, the video clips to be played may be classified according to attributes of the action types, so as to obtain a target video clip type of the video clip to be played. For example, if the attribute of the "fighting" action is "violence", the target video clip type is the violence type. For example, if the attribute of the "read" action is "education", the target video clip type is education type. For another example, the attribute of the action of "kicking shuttlecock" is "entertainment", and the type of the target video clip is entertainment type.

In one embodiment, natural language processing techniques may be employed to understand the action attributes corresponding to the action type. Among them, natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between a person and a computer using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

In another embodiment, the action recognition network model may be trained by using a data set labeled with action attributes or video segment types in advance to obtain a trained action recognition network model, and the action attributes or the target video segment types corresponding to the action types are directly obtained by using the trained action recognition network model.

In another embodiment, when the video segment to be played is long and contains more actions, the video segment to be played may be divided into multiple segments by using a C3D convolutional network, and the multiple segments are respectively identified by the action types and finally overlapped.

In another embodiment, the video content may also be recognized in the terminal, and the terminal may obtain the trained preset motion recognition model from the server through the network.

In another embodiment, the video to be played may be classified by identifying audio content in the video to be played, and the content of the video clip to be played is identified to obtain a target video clip type of the video clip to be played, which may specifically include the following steps:

The sensitive word refers to a word unit which is forbidden to be played in the current playing mode.

In the embodiment, the server recognizes the speech information in the audio content, and the process of recognizing the speech information involves an artificial intelligence ASR technology.

Among other things, ASR (Automatic Speech Recognition) is used to convert the vocabulary content in the Speech information into computer-readable input, such as keystrokes, binary codes, or character sequences. ASR is one of the key technologies of Speech Technology (Speech Technology). The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

In an embodiment, a voice assistant is disposed in the server, and the voice assistant includes a voice recognition engine, and the voice recognition engine may apply ASR technology to recognize voice information in the audio content and obtain text information corresponding to the voice information.

In an embodiment, recognizing the speech information by using an ASR technique to obtain text information corresponding to the speech information includes the following steps: firstly, inputting voice information to a feature extraction module, extracting appropriate acoustic feature parameters, and then inputting the extracted acoustic feature parameters into an acoustic model for classification and judgment to obtain text information corresponding to the voice information.

A Hidden Markov Model (HMM) may be employed as the acoustic model. Hidden Markov Models (HMMs) need to be trained to be used.

In another embodiment, the language model can be trained based on the deep neural network, the features of the voice information are extracted, and the extracted features are input into the language model for classification and judgment to obtain text information corresponding to the voice information.

In another embodiment, the recognition of the audio content may also be performed in a terminal, and the terminal may obtain the trained language model or acoustic model from a server through a network.

In an embodiment, identifying the text information according to the sensitive word set that needs to be played in the current playing mode to obtain an audio identification result may include the following steps:

In an embodiment, the video segments to be played may be classified according to attributes of the action types to obtain an initial target video segment type of the video segments to be played. Then, the type of the initial target video clip is verified according to the audio recognition result, and if the type of the initial target video clip is consistent with the audio recognition result, the type of the initial target video clip is used as the type of the target video clip; if the initial target video segment type is inconsistent with the audio recognition result, determining the target video segment type to be consistent with the control degree according to the control degree of the initial target video segment type and the audio recognition result under the current mode, for example, if the control degree of the pornographic type under the teenager mode is heavier than the horror type, the initial target video segment type is a horror type, and the audio recognition result detects that the sensitive vocabulary is the pornographic attribute, determining the target video segment as the pornographic type.

In an embodiment, for convenience of classification and management, the video clip types that may occur in the database may be divided in advance to obtain the preset video clip types. And when the video clip to be played is classified according to the action type and/or the audio recognition result, the video clip to be played is classified into one of the preset video clip types. The preset video clip types can be stored in a server or a terminal, and when the video clips to be played are classified according to the action types and/or the audio recognition results, the target video clip types can be determined from the preset video clip types.

In an embodiment, after the content of the video clip to be played is identified to obtain a target video clip type of the video clip to be played, a type mark of the target video can be generated according to the target video clip type corresponding to the video clip to be played; and marking the target video based on the type mark to obtain the marked target video.

Specifically, after the server identifies the content of the video segment to be played to obtain the target video segment type of the video segment to be played, the video segment type corresponding to the video segment of the target video may be written into an index file of a target video file to obtain a marked target video file, and then the target video in the database is replaced with the marked target video.

When other terminals or the terminal requests to play the target video online again, the video clip type of the video clip to be played can be directly obtained according to the marked target video. Therefore, the calculation amount of the server can be reduced, and the playing speed can be improved.

In another embodiment, when the terminal executes the video playing method of the present application, if the video is played online, after identifying the type of the target video segment, the terminal may send the marked target video or the corresponding type mark to the server. If the downloaded video is played off line, the type of the target video segment is identified, and the downloaded video can be marked and stored.

103. And determining the control level corresponding to the video clip to be played in the current playing mode based on the type of the target video clip.

Wherein, the control level is used for representing the control scale of the video clip playing. According to different control degrees of the playing modes to the video, the preset video clip types can be divided into three levels of 'forbidden playing level', 'limited playing level' and 'allowed playing level'. The video clips at the "prohibited playing level" are video clips that cannot be played at all, for example, in a child mode, the video clips at the horror type are clips at the "prohibited playing level", and the video clips at the "limited playing level" are types that limit the viewing duration or require parents to accompany for viewing, for example, in a child mode, the clips at the entertainment type can limit the viewing duration, and in a teenager mode, the clips at the horror and violence types require parents to accompany for viewing, and the like.

The mapping relationship among the type of the preset video clip, the playing mode and the control level can be preset and stored in the server, and the server can determine the control level corresponding to the video clip to be played in the current playing mode according to the mapping relationship.

In an embodiment, when the terminal executes the video playing method of the present application, a mapping relationship between a preset video clip type, a playing mode, and a control level may be stored in the terminal, and the terminal determines the control level according to the mapping relationship.

104. And controlling the playing of the video clip to be played according to the control level.

In an embodiment, the server may send the video to be played and the corresponding control level to the terminal, so that the terminal controls the playing of the video clip to be played according to the control level.

The terminal can trigger different control instructions according to the control level, and the terminal executes the control instructions to control the playing of the video clip to be played.

In another embodiment, the server may not send the control level to the terminal, but may complete the determination on whether to broadcast the video clip to be played in the server, and send the video clip to be played to the terminal when the video clip can be played, or not send the video clip to be played to the terminal if the video clip cannot be played.

In an embodiment, the playing level includes a permitted playing level, and controlling the playing of the video clip to be played according to the control level may include the following steps:

In an embodiment, the playing level further includes a no playing level, and the playing of the video segment to be played is controlled according to the control level, and the method further includes the following steps:

when the control level is a playing prohibition level, prohibiting playing the video clip to be played;

updating the next video clip into a video clip to be played;

In an embodiment, the controlling level further includes a play limiting level, and the playing of the video segment to be played is controlled according to the controlling level, and the method further includes the following steps:

when the control level is a play limiting level, acquiring target play information according to a preset play condition corresponding to the type of the target video clip;

if the target playing information does not meet the preset playing condition, acquiring a next video segment of the video segments to be played according to the playing sequence of the video segments in the target video;

updating the next video clip into a video clip to be played;

The target playing information is information related to a preset playing condition corresponding to the type of the target video clip in the current playing mode.

For example, the server may obtain a play record of the user within a preset time period according to the user account of the user sending the video obtaining request, and count the duration of the target video clip type in the play record as the target play information. And when the time length is less than a preset threshold value, the preset playing condition is considered to be met, otherwise, the preset playing condition is considered not to be met. The preset time period and the preset threshold are preset playing conditions set by a user, for example, the total time for watching the entertainment type clip in one day does not exceed 30 minutes when the user can set the mode in the child mode, wherein "one day" is the preset time period, and "30 minutes" is the preset threshold.

For another example, the server may send a play request to an associated account of the user or an associated device (e.g., a parent's device or an account) according to the user account, and may play the video clip to be played if the associated device allows playing, and skip the video clip to be played if the associated device does not allow playing.

In another embodiment, when the terminal executes the video playing method of the present application, the terminal may trigger different control instructions according to the obtained control level, and the terminal executes the control instructions to control the playing of the video clip to be played.

As can be seen from the above, in the embodiment of the present application, when a target video is played in a current play mode, a video clip to be played of the target video may be obtained; identifying the content of the video clip to be played to obtain the target video clip type of the video clip to be played; determining a control level corresponding to the video clip to be played in the current playing mode based on the type of the target video clip; and controlling the playing of the video clip to be played based on the control level. When the target video is played, the content of the target video is detected, and the playing of the target video is controlled, so that the content of the video is effectively filtered, and the watching environment of a user is purified.

In order to implement the foregoing method better, an embodiment of the present invention further provides a video playing apparatus, where the video playing apparatus may be specifically integrated in a computer device, and the computer device may be a device such as a terminal or a server.

For example, in the present embodiment, a video playing apparatus is integrated in a server as an example, and the method in the embodiment of the present invention is described in detail.

For example, as shown in fig. 3, the video playback apparatus may include a video acquisition unit 201, a genre acquisition unit 202, a control unit 203, and a playback unit 204. The following were used:

(1) The video acquiring unit 201 is configured to acquire a to-be-played video clip of a target video when the target video is played in a current play mode.

(2) A type obtaining unit 202, configured to identify the content of the video segment to be played, so as to obtain a target video segment type of the video segment to be played.

In an embodiment, the type obtaining unit 202 may specifically include a feature extracting subunit, a feature fusing subunit, and a type determining subunit, as follows:

the characteristic extraction subunit is used for carrying out characteristic extraction on the video content of the video clip to be played to obtain characteristic information of the video content;

the characteristic fusion subunit is used for determining the action type of the object in the video clip to be played according to the characteristic information;

and the type determining subunit is used for classifying the video clips to be played according to the action types to obtain the target video clip types of the video clips to be played.

In an embodiment, the feature extraction subunit may be specifically configured to:

determining a sample frame from video frames of the video content;

In an embodiment, the feature fusion subunit may be specifically configured to:

In an embodiment, the type obtaining unit 202 may be further configured to:

performing lexical division on the text information to obtain at least one target word unit;

acquiring the audio recognition result according to the attribute of the sensitive word;

In an embodiment, the type obtaining unit 202 may be further configured to:

(3) A control unit 203, configured to determine, based on the type of the target video segment, a control level corresponding to the video segment to be played in the current play mode.

(4) A playing unit 204, configured to control playing of the video segment to be played according to the control level.

In an embodiment, the control level includes a play permitted level, and the play unit 204 may specifically be configured to:

In an embodiment, the control level further includes a play prohibition level, and the play unit 204 may be further specifically configured to:

updating the next video clip into a video clip to be played;

In an embodiment, the control level further includes a play limit level, and the play unit 204 may be further specifically configured to:

updating the next video clip into a video clip to be played;

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

As can be seen from the above, in the video playing apparatus of this embodiment, the video obtaining unit obtains the video clip of the target video to be played when the target video is played in the current playing mode; identifying the content of the video clip to be played by a type acquisition unit to obtain the target video clip type of the video clip to be played; determining, by a control unit, a control level corresponding to the video clip to be played in the current play mode based on the type of the target video clip; and controlling the playing of the video clip to be played by the playing unit according to the control level. When the target video is played, the content of the target video is detected, and the playing of the target video is controlled according to the detection result, so that the video content is effectively filtered, and the viewing environment of a user is purified.

The embodiment of the invention also provides computer equipment, and the computer equipment can be a terminal or a server.

For example, as shown in fig. 4, a schematic structural diagram of a computer device according to an embodiment of the present invention is shown, specifically:

the computer device may include components such as a processor 301 of one or more processing cores, memory 302 of one or more computer-readable storage media, a power supply 303, an input module 304, and a communication module 305. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 4 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. Wherein:

the processor 301 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 302 and calling data stored in the memory 302. In some embodiments, processor 301 may include one or more processing cores; in some embodiments, processor 301 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 301.

The memory 302 may be used to store software programs and modules, and the processor 301 executes various functional applications and data processing by operating the software programs and modules stored in the memory 302. The memory 302 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 302 may also include a memory controller to provide the processor 301 with access to the memory 302.

The computer device also includes a power supply 303 for providing power to the various components, and in some embodiments, the power supply 303 may be logically coupled to the processor 301 via a power management system, such that the power management system may be used to manage charging, discharging, and power consumption. The power supply 303 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The computer device may also include an input module 304, the input module 304 operable to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

The computer device may also include a communication module 305. In some embodiments, the communication module 305 may include a wireless sub-module through which the computer device may wirelessly transmit over short distances to provide wireless broadband internet access. For example, the communication module 305 may be used to assist a user in sending and receiving e-mails, browsing web pages, accessing streaming media, and the like.

Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 301 in the computer device loads the executable file corresponding to the process of one or more application programs into the memory 302 according to the following instructions, and the processor 301 runs the application programs stored in the memory 302, thereby implementing various functions as follows:

and controlling the playing of the video clip to be played according to the control level.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein. The above embodiments can be referred to in the foregoing embodiments, and detailed description is omitted here.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer-readable storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute steps in any one of the video playing methods provided in the embodiments of the present application. For example, the instructions may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any video playing method provided in the embodiments of the present application, beneficial effects that can be achieved by any video playing method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The foregoing detailed description is directed to a video playing method, a video playing device, and a storage medium provided in the embodiments of the present application, and specific examples are applied herein to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and the core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A video playback method, comprising:

when a target video is played in a current playing mode, acquiring a video clip to be played of the target video, wherein the content of the video clip to be played comprises video content, and the video content comprises at least one video frame;

determining a sampling frame from video frames of the video content;

performing convolution operation on the optical flow information according to a time flow convolution neural network of a preset action recognition network model, extracting motion characteristic information of the video content, and obtaining the characteristic information of the video content based on the picture element characteristic information and the motion characteristic information;

classifying the video clips to be played according to the action types to obtain target video clip types of the video clips to be played;

2. The video playing method according to claim 1, wherein the determining the action type of the object in the video segment to be played according to the feature information comprises:

3. The video playing method according to claim 1, wherein the classifying the video segments to be played according to the action types to obtain a target video segment type of the video segments to be played further comprises:

and classifying the video clips to be played according to the audio recognition result and the action type to obtain the types of the target video clips.

4. The video playing method of claim 3, wherein the identifying the text information according to the sensitive word set required to control playing in the current playing mode to obtain an audio identifying result comprises:

5. The video playing method according to claim 1, wherein after the classifying the video segments to be played to obtain the target video segment type of the video segments to be played, the method further comprises:

6. The video playing method of claim 5, wherein the classifying the video segments to be played to obtain the target video segment type of the video segments to be played comprises:

7. The video playback method of claim 1, wherein the control level comprises a play-permitted level, and wherein controlling playback of the video segment to be played back according to the control level comprises:

8. The video playback method of claim 1, wherein the control level further includes a no-play level, and wherein controlling playback of the video clip to be played back according to the control level further comprises:

updating the next video clip into a video clip to be played;

and repeatedly executing a video playing control process from the step of determining sampling frames from the video frames of the video content to the step of controlling the playing of the video clip to be played according to the control level.

9. The video playing method of claim 1, wherein the controlling the level further comprises limiting a playing level, and the controlling the playing of the video segment to be played according to the controlling level comprises:

updating the next video clip into a video clip to be played;

10. A video playback apparatus, comprising:

the video playing device comprises a video acquiring unit, a video playing unit and a video playing unit, wherein the video acquiring unit is used for acquiring a video clip of a target video to be played when the target video is played in a current playing mode, the content of the video clip to be played comprises video content, and the video content comprises at least one video frame;

the type acquisition unit is used for determining a sampling frame from video frames of the video content;

11. A server, comprising: a processor and a memory; the memory stores a plurality of instructions, and the processor loads the instructions stored in the memory to execute the steps of the video playing method according to any one of claims 1 to 9.

12. A storage medium having stored thereon a computer program, characterized in that, when the computer program runs on a computer, it causes the computer to execute the video playback method according to any one of claims 1 to 9.