CN112287799A

CN112287799A - Video processing method and device based on artificial intelligence and electronic equipment

Info

Publication number: CN112287799A
Application number: CN202011148760.9A
Authority: CN
Inventors: 禹常隆; 田植良
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-01-29

Abstract

The application provides a video processing method, a video processing device, electronic equipment and a computer-readable storage medium based on artificial intelligence; artificial intelligence technology and big data technology are involved; the method comprises the following steps: dividing a video into a plurality of video segments; carrying out feature extraction processing on the video clip to obtain the content features of the video clip; acquiring operation data aiming at the video clip, and performing feature mapping processing on the operation data to obtain operation features; performing fusion processing on the content characteristics and the operation characteristics of the video clip to obtain fusion characteristics; and carrying out recursive updating processing on the fusion characteristics of the video segments, and predicting the video segments comprising recommendation information in the video segments according to the obtained updated fusion characteristics. By the method and the device, the video clips including the recommendation information in the video can be accurately predicted, and the actual utilization rate of the computing resources of the electronic equipment is improved.

Description

Video processing method and device based on artificial intelligence and electronic equipment

Technical Field

The present application relates to artificial intelligence and big data technologies, and in particular, to a video processing method and apparatus based on artificial intelligence, an electronic device, and a computer-readable storage medium.

Background

In video scenes, many videos tend to include embedded recommendation information. Taking the short video platform as an example, some uploaders add advertisements to the shot short videos and upload the advertisements to the short video platform in order to improve their own profits, so that users of the short video platform are compelled to watch commercial advertisements in the process of watching the short videos.

During the presentation of the video, the computing resources of the electronic device (e.g., a background server of a short video platform) are consumed. However, for videos that include recommendation information, the computing resources of the electronic device for sending or presenting the recommendation information may be wasted, i.e., the actual utilization of the computing resources of the electronic device may be low.

For this reason, the related art does not provide an effective solution.

Disclosure of Invention

The embodiment of the application provides a video processing method and device based on artificial intelligence, an electronic device and a computer readable storage medium, which can accurately identify a video clip including recommendation information in a video, and are beneficial to improving the actual utilization rate of computing resources of the electronic device.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a video processing method based on artificial intelligence, which comprises the following steps:

dividing a video into a plurality of video segments;

carrying out feature extraction processing on the video clip to obtain the content features of the video clip;

acquiring operation data aiming at the video clip, and performing feature mapping processing on the operation data to obtain operation features;

performing fusion processing on the content characteristics and the operation characteristics of the video clip to obtain fusion characteristics;

and carrying out recursive updating processing on the fusion characteristics of the video segments, and predicting the video segments comprising recommendation information in the video segments according to the obtained updated fusion characteristics.

The embodiment of the application provides a video processing device based on artificial intelligence, includes:

the dividing module is used for dividing the video into a plurality of video segments;

the characteristic extraction module is used for carrying out characteristic extraction processing on the video clip to obtain the content characteristics of the video clip;

the feature mapping module is used for acquiring operation data aiming at the video clip and performing feature mapping processing on the operation data to obtain operation features;

the fusion module is used for carrying out fusion processing on the content characteristics and the operation characteristics of the video clips to obtain fusion characteristics;

and the recursive update module is used for performing recursive update processing on the fusion characteristics of the video segments and predicting the video segments comprising the recommendation information in the video segments according to the obtained updated fusion characteristics.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the video processing method based on artificial intelligence provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the computer-readable storage medium, so as to implement the artificial intelligence based video processing method provided by the embodiment of the present application.

The embodiment of the application has the following beneficial effects:

the method comprises the steps of dividing a video into a plurality of video segments, determining content characteristics and operation characteristics of the video segments, and performing fusion processing to obtain fusion characteristics, so that the fusion characteristics comprise information of a plurality of dimensions of the video segments. On the basis, recursive updating processing is carried out on the fusion features of the video segments, the updated fusion features can accurately and effectively represent the video segments, and when the video segments comprising the recommendation information are predicted according to the updated fusion features, the prediction precision can be improved, so that the actual utilization rate of computing resources of the electronic equipment is improved.

Drawings

FIG. 1 is a schematic diagram of an architecture of an artificial intelligence based video processing system provided by an embodiment of the present application;

fig. 2 is a schematic architecture diagram of a terminal device provided in an embodiment of the present application;

FIG. 3A is a schematic flow chart of a method for processing video based on artificial intelligence according to an embodiment of the present application;

FIG. 3B is a schematic flowchart of an artificial intelligence based video processing method according to an embodiment of the present application;

FIG. 3C is a schematic flow chart of a method for processing video based on artificial intelligence according to an embodiment of the present application;

FIG. 3D is a schematic flow chart of a method for processing video based on artificial intelligence according to an embodiment of the present application;

FIG. 4 is a schematic diagram of determining multiple types of objects provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a recursive update process for a fused feature of a plurality of video segments according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a recursive update process for content and operational characteristics of a video segment according to an embodiment of the present application;

FIG. 7 is a schematic diagram of determining content characteristics provided by an embodiment of the present application;

fig. 8 is an architecture diagram of an advertisement prediction model provided in an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein. In the following description, the term "plurality" referred to means at least two.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Artificial Intelligence (AI): a theory, method, technique and application system for simulating, extending and expanding human intelligence, sensing environment, acquiring knowledge and using knowledge to obtain optimal results by using a digital computer or a machine controlled by a digital computer. Computer Vision technology (CV) is an important branch of artificial intelligence, and related theories and technologies are mainly studied in an attempt to establish an artificial intelligence system capable of acquiring information from images or multidimensional data. In the embodiment of the application, the identification of the video clip including the recommendation information in the video can be realized by using a computer vision technology.

2) Content characteristics are as follows: the method and the device for determining the content features are used for representing the content included in the video clip, and the method and the device for determining the content features are not limited in the embodiment of the application, and for example, the content features can be obtained by performing feature extraction processing on the video clip through a neural network model.

3) Object: the object represents a subject that can perform a specific operation on the video, and for example, the object may be a user account or a terminal device. The executable operations for the video may be preset, for example, including a play operation and a skip operation.

4) Operation data: the operation data of the video segment, namely, the operation performed on the video segment by a plurality of objects is included.

5) And (3) recursive updating processing: in one way of processing the sequence data, for each data in the sequence data, data updates are performed in conjunction with data adjacent thereto. In the embodiment of the present application, a Recurrent Neural Network (RNN) model may be used to perform the recursive update process, and the Recurrent Neural Network is also called a Recurrent Neural Network.

6) Recommendation information: for recommending specific content, such as goods, service content or entertainment sports programs, the recommendation information embedded in the video is often different from the original content of the video. Here, the form of the recommendation information is not limited, and may be, for example, an advertisement.

7) Negative correlation: a numerical relationship, for example, there are variables a and B, and when a is smaller, B is larger; when A is larger, B is smaller. We call negative correlation between a and B.

8) Fully Connected Layer (full Connected Layer): it is essentially a matrix that is used to perform a matrix multiplication operation (i.e., linear transformation) on the input vector to extract and integrate the information in the input vector to obtain the output vector. Here, the matrix multiplication operation corresponds to weighting processing.

9) Database (Database): similar to an electronic file cabinet, namely a place for storing electronic files, a user can perform operations of adding, inquiring, updating, deleting and the like on data in the files. A database is also to be understood as a collection of data that are stored together in a manner that can be shared with a plurality of users, with as little redundancy as possible, independent of the application. In embodiments of the present application, the database is used to store videos, such as videos in a video platform.

10) Big Data (Big Data): the data set which can not be captured, managed and processed by a conventional software tool in a certain time range is a massive, high-growth-rate and diversified information asset which can have stronger decision-making power, insight discovery power and flow optimization capability only by a new processing mode. The method is suitable for the technology of big data, and comprises a large-scale parallel processing database, data mining, a distributed file system, a distributed database, a cloud computing platform, the Internet and an extensible storage system. In the embodiment of the application, the processing of a large batch of videos can be realized by utilizing a big data technology, and the training of a model and the like can also be realized.

The embodiment of the application provides a video processing method and device based on artificial intelligence, an electronic device and a computer-readable storage medium, which can accurately predict video segments including recommendation information in a video and are beneficial to improving the actual utilization rate of computing resources of the electronic device. An exemplary application of the electronic device provided in the embodiment of the present application is described below, and the electronic device provided in the embodiment of the present application may be implemented as various types of terminal devices, and may also be implemented as a server.

Referring to fig. 1, fig. 1 is an architecture diagram of an artificial intelligence based video processing system 100 provided in an embodiment of the present application, a terminal device 400 is connected to a server 200 through a network 300, and the server 200 is connected to a database 500, where the network 300 may be a wide area network or a local area network, or a combination of both.

In some embodiments, taking the electronic device as a terminal device as an example, the video processing method based on artificial intelligence provided by the embodiments of the present application may be implemented by the terminal device. For example, the terminal device 400 runs the client 410, the client 410 divides a locally stored video into a plurality of video segments, determines a content feature and an operation feature of each video segment, and performs fusion processing on the content feature and the operation feature to obtain a fusion feature, where the client 410 may be a video-like application program, such as a video player. Then, the client 410 performs recursive update processing on the fusion features of the plurality of video segments, and predicts the video segments including the recommendation information in the plurality of video segments according to the obtained updated fusion features. The client 410 can delete the predicted video segment including the recommendation information in the video, so that when the video is played, user experience can be improved, local storage resources can be saved, and the actual utilization rate of computing resources consumed by the client 410 when the video is played is improved.

It should be noted that, in the embodiment of the present application, a timing for the client 410 to predict the video segment including the recommendation information in the video and delete the video segment is not limited, and for example, the prediction may be performed while the video is stored, or may be performed when a play operation on the video is detected.

In some embodiments, taking the electronic device as a server as an example, the artificial intelligence based video processing method provided by the embodiments of the present application may also be implemented by the server. For example, the server 200 may be a server providing a video service (e.g., a background server of a short video platform), and for videos stored in the database 500 (e.g., videos uploaded into the short video platform by an uploader), the server 200 predicts video segments including recommendation information therein and deletes the video segments including the recommendation information. After deleting the predicted video segment including the recommendation information, the storage space occupied by the video will be reduced, so that the storage resource of the server 200 can be saved. The server 200 may also provide a video service to the client 410 when receiving a video request from the client 410 (for example, a foreground client of a short video platform), that is, send a video with a deleted predicted video segment including recommendation information to the client 410 for presentation (playing) in the client 410, where the video a is taken as an example in fig. 1. In this way, the actual utilization of the computing resources consumed by the server 200 to send the video and the computing resources consumed by the client 410 to render the video can be improved.

It should be noted that the process of predicting the video segment including the recommendation information in the video may be integrated into a recommendation information prediction model, and the electronic device (such as the terminal device 400 or the server 200) may perform the prediction by calling the recommendation information prediction model. Before using the recommendation information prediction model, in order to improve the prediction accuracy, the recommendation information prediction model may be trained, for example, the server 200 trains the recommendation information prediction model according to a data set and invokes the trained recommendation information prediction model to predict the video segment including the recommendation information, or the server may also send the trained recommendation information prediction model to the client 410 to be deployed locally at the client 410, so that the client 410 invokes the trained recommendation information prediction model to predict the video segment including the recommendation information. The database comprises sample videos and video clips which are marked in each sample video and comprise recommendation information, such as through artificial marking.

In some embodiments, the terminal device 400 or the server 200 may implement the artificial intelligence based video processing method provided by the embodiments of the present application by running a computer program, for example, the computer program may be a native program or a software module in an operating system; may be a local (Native) Application program (AP P, Application), that is, a program that needs to be installed in an operating system to be run, such as a video-type Application program (corresponding to the above client 410), specifically, a video player, etc.; or may be an applet, i.e. a program that can be run only by downloading it to the browser environment; it may also be an applet that can be embedded in any APP, such as an applet component embedded in a video-like application, where the applet component can be run or shut down by user control. In general, the computer programs described above may be any form of application, module or plug-in.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, where the cloud service may be a video processing service that is called by the terminal device 400 to predict a video segment including recommendation information in a video. The terminal device 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart television, a smart watch, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited.

Taking the electronic device provided in the embodiment of the present application as an example for illustration, it can be understood that, for the case where the electronic device is a server, parts (such as the user interface, the presentation module, and the input processing module) in the structure shown in fig. 2 may be default. Referring to fig. 2, fig. 2 is a schematic structural diagram of a terminal device 400 provided in an embodiment of the present application, where the terminal device 400 shown in fig. 2 includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in the terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in fig. 2.

The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.

The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 450 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 451, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 431 (e.g., display screens, speakers, etc.) associated with user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided by the embodiments of the present application may be implemented in software, and fig. 2 shows an artificial intelligence based video processing apparatus 455 stored in a memory 450, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: a partitioning module 4551, a feature extraction module 4552, a feature mapping module 4553, a fusion module 4554, and a recursive update module 4555, which are logical and thus can be arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be explained below.

The artificial intelligence based video processing method provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the electronic device provided by the embodiment of the present application.

Referring to fig. 3A, fig. 3A is a schematic flowchart of an artificial intelligence based video processing method according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3A.

In step 101, a video is divided into a plurality of video segments.

Here, the video to be processed is divided into a plurality of video segments. The dividing method in the embodiment of the present application is not limited, and for example, the average dividing may be performed according to a set dividing duration or a set number of video segments. For example, if the duration of the video to be processed is 1 minute and the division duration is 5 seconds, the video can be averagely divided into 12 video segments according to the division duration, wherein the duration of each video segment is 5 seconds; for another example, if the number of the set video segments is 10, the video may be divided into 10 video segments on average according to the number of the video segments, where the duration of each video segment is 6 seconds.

In addition, the source of the video to be processed is not limited in the embodiment of the present application, and for example, the video may be obtained from the outside, or may be a locally stored video.

In step 102, feature extraction processing is performed on the video segment to obtain content features of the video segment.

And for each divided video segment, performing feature extraction processing on the video segment to obtain content features for representing the content of the video segment, wherein the content features can be in a vector form. Here, the feature extraction process may be implemented using a model constructed based on the principle of artificial intelligence, such as a neural network model. It is worth mentioning that after the content features are obtained, the content features may be further subjected to linear transformation to extract and integrate useful information in the content features.

In some embodiments, the above feature extraction processing on the video segment may be implemented in such a manner that the content features of the video segment are obtained: any one of the following processes is performed: performing feature extraction processing on each image in the video clip to obtain image features, and fusing the image features of the plurality of images into content features of the video clip; and performing feature extraction processing on a plurality of continuous images in the video clip, and taking the extracted image features as the content features of the video clip.

The embodiment of the application provides two ways for determining content features, the first way is to perform feature extraction processing on each image in a video segment separately to obtain image features, and fuse the image features corresponding to all the images in the video segment into the content features of the video segment, where the way for obtaining the content features through fusion is not limited, and may be, for example, stitching processing, summation processing, weighted summation, or the like. In the first way, a specific model, such as a two-dimensional Convolutional neural Network (2D CNN) model, may be adopted to implement the feature extraction process for a single image.

In the second method, feature extraction processing is performed simultaneously on a plurality of continuous images (for example, all images) in a video segment, and the extracted image features are used as the content features of the video segment. Likewise, a specific model, such as a three-dimensional Convolutional neural Network (3D CNN) model, may be adopted to implement the feature extraction process for a plurality of images in succession. By the method, the flexibility of determining the content characteristics can be improved, and an applicable mode can be selected according to an actual application scene.

In step 103, operation data for the video segment is obtained, and feature mapping processing is performed on the operation data to obtain an operation feature.

For each video segment in the video to be processed, in addition to determining the content characteristics, the operation performed by a specific object for the video segment is also obtained as operation data, wherein the operations that can be performed for the video segment, such as playing operation, pausing operation, skipping operation, and the like, can be preset. For the skip operation, the start position and the end position of the skip operation for the video to be processed can be obtained, and then it is determined for which video segments in the video the skip operation is performed. For example, a video with a duration of 1 minute is divided into 10 video segments on average, the starting position of a certain skip operation for the video is the 5 th second, and the ending position is the 10 th second, it can be determined that the skip operation is performed for the 1 st video segment and the 2 nd video segment in the video, wherein the number of bits of the video segments (i.e., the number of the video segments) is counted in the time axis order of the video, i.e., the order from the beginning of the video to the end of the video.

The object in the embodiment of the present application refers to an object that can perform the above-mentioned set operation on a video, and the object may be a terminal device or a user account of a video platform, which is not limited herein. When the operation data for the video clip is acquired, the user account may be acquired as a unit, or the terminal device may be acquired as a unit, where the latter case is fixed by default for a user using the same terminal device, where different terminal devices may be distinguished by device identifiers, such as International Mobile Equipment Identity (IMEI), and the like. The object that needs to acquire the operation data may be preset according to an actual application scenario, for example, in a video platform, the operation data of all registered user accounts for a video clip may be acquired, or the operation data of a specific registered user account for a video clip may be acquired.

After the operation data for the video clip are obtained, feature mapping processing is performed on the operation data to obtain a numerical operation feature for facilitating subsequent processing of the electronic device. In an actual application scenario, most users may perform a skip operation on the recommendation information when viewing a video including the recommendation information, that is, the skip operation can reflect the existence of the recommendation information to a large extent. Therefore, in the process of performing the feature mapping process on the operation performed on the video segment by a certain object in the operation data, whether to map the performed operation to the first setting value or the second setting value may be determined according to whether the performed operation includes the skip operation, which, of course, does not constitute a limitation to the feature mapping process. After the feature mapping process, the obtained operation features may be in a vector form (composed of a plurality of numerical values) or a numerical form (including only a single numerical value).

In step 104, the content feature and the operation feature of the video segment are fused to obtain a fusion feature.

Here, a content feature representing the content of the video clip and an operation feature representing an operation performed on the video clip are subjected to fusion processing, and a fusion feature is obtained. The embodiment of the present application does not limit the manner of the fusion processing, and for example, the fusion processing may be splicing processing or weighted summation.

In step 105, the fusion characteristics of the plurality of video segments are recursively updated, and the video segments including the recommendation information in the plurality of video segments are predicted according to the updated fusion characteristics.

After the fusion features of each video clip in the video are obtained, the fusion features of all the video clips are sequenced according to the time axis sequence of the video, and all the sequenced fusion features are subjected to recursive updating processing. Here, the recursive update processing refers to updating a certain fusion feature a itself according to a fusion feature (such as a previous fusion feature) adjacent to the certain fusion feature a. Through recursive update processing, the fusion characteristics of the video segments can be effectively updated according to the sequence relevance of the video segments, so that the updated fusion characteristics can more accurately and effectively represent the video segments. Generally, in a video including recommendation information, recommendation information occupies only a small part, most of the video is still normal content, and the recommendation information has little or no association with the normal content, and by recursive update processing, the difference in the updated fusion characteristics between a video clip including recommendation information and a video clip not including recommendation information can also be increased.

After recursive update processing, updated fusion features of each video segment can be obtained, and then, video segments including recommendation information in all the video segments can be predicted according to the updated fusion features, and a prediction manner will be described in detail later. The application of the predicted video segment including the recommendation information is not limited in the embodiment of the present application, for example, the predicted video segment including the recommendation information in the video may be directly deleted, or a deletion option may be provided, and when a confirmation operation for the deletion option is received, the predicted video segment including the recommendation information in the video is deleted. Therefore, the actual utilization rate of the computing resources consumed in the subsequent video presenting process can be improved, and the experience of a user watching the video is improved.

It should be noted that the above steps may be integrated into the recommendation information prediction model, and the electronic device executes the corresponding steps by calling the recommendation information prediction model. For example, step 105 may be integrated into the recommendation information prediction model, and step 104 and step 105 may also be integrated into the recommendation information prediction model, which is not limited in this respect. In order to improve the prediction accuracy of the recommendation information prediction model, the recommendation information prediction model may be supervised learning, i.e. training, before the recommendation information prediction model is actually called.

For example, the data set used for training may include a plurality of sample videos and video segments including recommendation information labeled in each sample video, where the video segments including recommendation information in the sample videos may be obtained through artificial labeling, and of course, other labeling manners may also be applied. After the sample video is processed by the recommendation information prediction model, the difference (namely loss value) between the predicted video segment comprising the recommendation information and the labeled video segment comprising the recommendation information is determined, back propagation is carried out in the recommendation information prediction model according to the difference, and in the process of back propagation, the weight parameter of the recommendation information prediction model is updated. The loss function used for calculating the loss value is not limited, and may be, for example, a cross-entropy loss function. After the recommendation information prediction model is trained through the data set, the prediction accuracy of the recommendation information prediction model on the video segments comprising the recommendation information can be improved.

As shown in fig. 3A, the embodiment of the present application divides a video into a plurality of video segments, and determines a fusion feature according to information of a plurality of dimensions of the video segments. On the basis, the fusion features of the video segments are subjected to recursive updating, so that the accuracy of the fusion features can be further improved, and therefore, when the video segments including the recommendation information are predicted according to the updated fusion features, the prediction precision can be improved, and the actual utilization rate of computing resources of the electronic equipment is further improved.

In some embodiments, referring to fig. 3B, fig. 3B is a flowchart illustrating an artificial intelligence based video processing method provided in an embodiment of the present application, and step 103 shown in fig. 3A may be implemented by steps 201 to 206, which will be described in conjunction with the steps.

In step 201, an object other than the target recommendation object of the video is taken as a first object.

In the embodiment of the present application, operation data of a specific object for a video clip may be obtained, and the specific object may include three types, namely, a first object, a second object, and a third object, which are described below separately. In the process of determining the first object, a target recommendation object of the video is determined first, wherein the target recommendation object may be set manually or obtained in other manners.

Taking the object as a terminal device for example, when the electronic device is a server, the server may take a certain terminal device as a target recommendation object of a video requested by a video request when detecting the video request sent by the terminal device; when the electronic equipment is terminal equipment, the electronic equipment can be used as a target recommendation object of the video to be presented.

Taking the object as a user account as an example, when the electronic device is a server, the server may take a user account as a target recommendation object of a video requested by a video request when detecting the video request sent by the certain user account; when the electronic device is a terminal device, the terminal device may use a user account in a login state in an application program of a video class that is running as a target recommendation object of a video to be presented by the application program. Of course, the manner of determining the target recommendation object is not limited thereto.

And after the target recommendation object is determined, taking the objects except the target recommendation object as the first object. As an example, the present application provides a schematic diagram of determining multiple types of objects as shown in fig. 4, and in fig. 4, taking a case where an object is a user account as an example, all registered user accounts (for example, registered in a video-class application) are shown, specifically including user accounts A, B, C, D and E. When the target recommendation object is determined to be the user account a, the remaining user accounts B, C, D and E are both used as first objects. The operation performed by the first object on the video segment can be used as a global standard and has a reference value.

In step 202, an object similar to the target recommended object among the plurality of first objects is set as a second object.

After the first object is determined through step 201, an object similar to the target recommended object is taken as a second object. The embodiment of the present application is not limited to determining whether the objects are similar, for example, traversing all the first objects, and when the information (for example, the registration information in the video-class application) of the traversed first object is the same as the information of the target recommended object, taking the traversed first object as the second object, where the information may include at least one of gender, city, and hobbies.

In fig. 4, a case where the user account B is similar to the user account a, and the user account C is also similar to the user account a is taken as an example, both the user accounts B and C are taken as the second object.

In some embodiments, the above-mentioned regarding the object similar to the target recommended object in the plurality of first objects as the second object may be implemented by: constructing initial vectors with dimensions the same as the number of videos in the database; wherein each value in the initial vector corresponds to a video in the database; aiming at any object, determining a numerical value corresponding to the video of which the playing operation is executed by any object in the initial vector, and updating the numerical value determined in the initial vector to a set numerical value to obtain a video playing vector; and determining an object with similarity greater than a similarity threshold value on the video playing vector with the target recommendation object from the plurality of first objects as a second object.

In the embodiment of the present application, the second object may be determined starting from a play operation for a video. First, an initial vector having the same dimension as the number of videos in the database is constructed, wherein each value in the initial vector corresponds to one video in the database, and for convenience of processing, all values in the initial vector may be initialized to zero. A database, such as a database of a video-type application (video platform), is used to store all videos related to the application, but of course, all videos related to the application may also be stored in other manners, such as a distributed file system stored in a server, or locally stored in a terminal device, and the like, and the description is given only by way of example in the case of storing in the database.

And for each object, determining a numerical value corresponding to the video of the object which has performed the playing operation in the initial vector, and updating the determined numerical value to a set numerical value (such as a numerical value 1) to obtain a video playing vector of the object, wherein the obtained video playing vector can directly reflect the video preference of the object. Then, all the first objects are traversed, and when the similarity between the traversed first objects and the target recommendation object on the video playing vector is larger than the similarity threshold, the traversed first objects are used as objects similar to the target recommendation object, namely the second objects. The similarity may be cosine similarity or other similarities, and the similarity threshold may be set according to an actual application scenario. Through the method, the second object similar to the video preference of the target recommendation object can be determined.

In step 203, an object having an attention relationship with the target recommended object among the plurality of first objects is set as a third object.

Here, an object having an attention relationship with the target recommendation object among all the first objects may be set as the third object. The attention relationship may be a one-way attention relationship, such as an attention target recommendation object or an attention target recommendation object, or a mutual attention relationship.

In fig. 4, a case that the user account D and the user account a have a mutual attention relationship, and the user account E also has a mutual attention relationship with the user account a is taken as an example, both the user accounts D and E are taken as the third object.

It should be noted that the first object, the second object and the third object in the embodiments of the present application are not isolated from each other, for example, an object may be the first object, the second object and the third object at the same time.

In step 204, the operations performed by the first object, the second object and the third object on the video segment are used as the operation data on the video segment.

After the objects of different layers, namely the first object, the second object and the third object, are determined, the operations performed by all the first objects on the video segment, the operations performed by all the second objects on the video segment and the operations performed by all the third objects on the video segment are used together as the operation data on the video segment, so that the operation data includes the feedback information of different layers.

In step 205, mapping the operation performed by the sample object for the video segment into an operand value, and constructing a sub-operation feature according to the operand values of the plurality of sample objects; wherein, the sample object is any one of the first object, the second object and the third object.

In the process of performing the feature mapping processing on the operation data, the operation performed on the video segment by each type of object in the operation data is separately processed, and for convenience of understanding, a process of processing the operation performed on the video segment by a sample object is described, where the sample object is any one of the first object, the second object, and the third object.

For the operation performed on the video clip by each sample object in the operation data, the operation is mapped to a specific numerical value, and for the sake of convenience of distinction, the numerical value obtained here is named as an operand value. In the embodiment of the present application, the skip operation can reflect the existence of the recommendation information to a large extent, and therefore, the operation performed by the sample object on the video segment can be mapped to different operand values according to whether the skip operation is included in the operation performed by the sample object on the video segment.

After the operand values of each sample object are obtained, sub-operation features are constructed according to the operand values of all the sample objects, namely, the sub-operation features corresponding to the first objects are constructed according to the operand values of all the first objects, the sub-operation features corresponding to the second objects are constructed according to the operand values of all the second objects, and the sub-operation features corresponding to the third objects are constructed according to the operand values of all the third objects.

In some embodiments, mapping the operations performed by the sample objects for the video segments to operand values as described above may be implemented in such a way that: when the operation executed by the sample object aiming at the video clip comprises a skip operation, taking the first set numerical value as the operand value of the sample object; and when the operation executed by the sample object for the video segment does not comprise the skip operation, taking the second set numerical value as the operand value of the sample object.

Here, the first setting value and the second setting value are different and can be set according to the actual application scenario, for example, the first setting value is 1, and the second setting value is 0. When the operation executed by a certain sample object aiming at the video clip comprises a skip operation, taking a first set numerical value as an operand value of the sample object; and when the operation executed by a certain sample object for the video clip does not comprise the skip operation, taking the second set value as the operand value of the sample object. Thus, the numerical value distinguishing mapping can be realized according to the result of whether the executed operation comprises the skip operation, namely, the obtained operand value can directly reflect whether the corresponding operation comprises the skip operation.

In some embodiments, the above-described construction of sub-operational features from operand values of multiple sample objects may be implemented in such a way that: any one of the following processes is performed: accumulating the operand values of the plurality of sample objects, and taking the numerical value obtained by accumulation as a sub-operation characteristic; and constructing vectors with the dimension same as the number of the sample objects according to the operand values of the plurality of sample objects to serve as the sub-operation characteristics.

The embodiment of the present application provides two ways of constructing the sub-operation features, and for the convenience of understanding, the case of taking the sample object as the first object is exemplified. The first way is to accumulate all the operand values of the first object, and use the value obtained by the accumulation as the sub-operation feature corresponding to the first object, where the obtained sub-operation feature is in a numerical form, and the accumulation may be summation or weighted summation. The sub-operation features obtained in the mode have small information quantity and small subsequent calculation quantity, and are suitable for scenes with weak calculation capacity of electronic equipment or high requirements on video processing efficiency (such as real-time video processing).

The second way is to construct vectors with the same dimension as the number of the first objects according to the operand values of all the first objects, and the vectors are used as the sub-operation features corresponding to the first objects. That is, the sub-operation features obtained by the second method are in the form of vectors, each of which has the same value as the operand value of one of the first objects. Compared with the first mode, the information amount of the sub-operation characteristics obtained by the second mode is larger, the subsequent calculation amount is also larger, and the method is suitable for scenes with stronger calculation capability of electronic equipment or low timeliness requirement on video processing (such as offline video processing).

In some embodiments, the above-mentioned accumulation processing of the operand values of the plurality of sample objects may be implemented in such a manner that the value obtained by the accumulation processing is used as a sub-operation feature: dividing the number of videos of which the sample objects are subjected to the skip operation by the video base number to obtain the skip video ratio; determining a weight negatively correlated to a skipped video fraction; weighting the operand values of the plurality of sample objects according to the weights of the sample objects to obtain sub-operation characteristics; the video base number comprises any one of the number of videos in the database and the number of videos of which the sample object has performed the playing operation.

Here, the accumulation processing may be weighted summation, and for the sake of understanding, the process of the accumulation processing will be described by taking a case where the sample object is the first object as an example. For each first object, dividing the number of videos of which the first object has performed the skip operation by the video base number to obtain a skip video percentage of the first object, wherein the skip video percentage can directly reflect the skip habit of the user corresponding to the first object, and the smaller the skip video percentage, the higher the possibility that the video of which the first object has performed the skip operation includes the recommendation information (i.e., the user corresponding to the first object does not like to skip very much). The video base number may be the number of videos in the database or the number of videos for which the first object has performed the playing operation, and since the number of videos for which the first object has performed the playing operation may be greatly different, the number of videos for which the first object has performed the playing operation is taken as the video base number, so that the accuracy of the determined ratio of skipped videos can be further improved.

After the skipped video percentage of the first object is determined, a weight negatively related to the skipped video percentage is determined, wherein the negative correlation between the skipped video percentage and the weight can be set according to the actual application scene, for example, the weight can be the reciprocal of the skipped video percentage. And then, carrying out weighted summation on the operand values of all the first objects according to the weight of each first object to obtain the sub-operation characteristics corresponding to the first objects. For example, if the first object includes A, B and C, the operand values are 1, 0 and 1, and the weights are 0.2, 0.5 and 0.3, respectively, then the sub-operation characteristic corresponding to the first object is 0.2 × 1+0.5 × 0+0.3 × 1 ═ 0.5. By the method, the weight of the operand of the sample object can be adaptively adjusted according to the skip video proportion of the sample object, so that the weighted operand is more in line with the actual situation, and the accuracy of the finally obtained sub-operation features is improved.

In some embodiments, determining the weight negatively correlated to the skipped video fraction as described above may be accomplished by: sequencing the plurality of sample objects according to a first numerical sequence of the skip video percentage; uniformly sampling the set weight range according to the number of the sample objects to obtain a plurality of weights to be distributed; according to the second numerical sequence, sequentially distributing a plurality of weights to be distributed to the sequenced sample objects; wherein the first numerical order is opposite to the second numerical order.

The embodiment of the present application provides an example of a negative correlation relationship, and for ease of understanding, a case where a sample object is used as a first object is illustrated. First, all the first objects are sorted according to a first numerical order of the skip video ratio, wherein the first numerical order may be from large to small or from small to large. Meanwhile, according to the number of the first objects, the set weight range is uniformly sampled (i.e., equally spaced sampling) to obtain a plurality of weights to be distributed, for example, the set weight range is [0, 1], and the number of the first objects is 3, and after uniform sampling, the weights to be distributed including 0, 0.5, and 1 can be obtained.

Then, all weights to be assigned are sequentially assigned to all the sorted first objects according to a second numerical order opposite to the first numerical order, wherein each first object is assigned a weight. For example, the first numerical order is from large to small, the second numerical order is from small to large, and when performing the allocation, the smallest weight to be allocated is allocated to the first object ranked first, that is, the first object with the largest ratio of skipped videos, and so on. The above manner provides an example of determining the weight of the sample object, but this does not constitute a limitation to the embodiments of the present application.

In some embodiments, constructing a vector having the same dimensions as the number of sample objects from the operand values of the plurality of sample objects as described above may be implemented in such a manner that: sequencing the operand values of the plurality of sample objects according to the object parameters of the sample objects; constructing vectors with the same dimensionality as the number of sample objects according to the sequenced multiple operand values; the object parameter includes any one of registration time, time of last playing operation and number of videos having been played.

In addition to performing the accumulation processing to obtain the sub-operation features in the form of numerical values, the sub-operation features in the form of vectors may be constructed, and for convenience of understanding, the case where the sample object is the first object is exemplified. Since the vector has a direction, the order of the numerical values in the vector may affect the subsequent processing, and therefore, in the embodiment of the present application, the operand values of all the first objects may be sorted according to the object parameter of each first object, that is, the order of the operand values is normalized. The object parameter may include any one of a registration time (e.g., a registration time in an application program of a video class), a time when a play operation was last performed, and a number of videos for which a play operation has been performed, but is not limited thereto. In addition, the order in which the sorting is performed is not limited, and for example, when the object parameter is the registration time, all the operand values of the first object may be sorted in the order from the morning to the evening of the registration time of the first object.

And then, constructing vectors with the same dimensionality as the first object according to all the sorted operand values to serve as the sub-operation characteristics corresponding to the first object. For example, if the sorted operand values are 0, 0, 1, and 1 in sequence, the constructed vector is [0, 0, 1, 1 ]. It should be noted that, in the process of constructing the sub-operation feature in the form of a vector, the skip video ratio of the first object may also be determined, and the operand value of the first object is weighted according to the weight that is negatively related to the skip video ratio (the operand value of the first object is multiplied by the weight of the first object), and then a vector is constructed according to the weighted operand value.

By the method, the ordering principle of the operand values in the vector corresponding to the first object, the vector corresponding to the second object and the vector corresponding to the third object can be consistent, and an effective rule can be conveniently learned from the vectors.

In step 206, the sub-operation features corresponding to the first object, the sub-operation features corresponding to the second object, and the sub-operation features corresponding to the third object are merged into the operation features of the video segment.

After the sub-operation features corresponding to the first object, the sub-operation features corresponding to the second object, and the sub-operation features corresponding to the third object are obtained, the sub-operation features of different layers are fused together, for example, the splicing processing is performed, so that the operation features of the video clip are obtained. Of course, other ways, such as weighted summation, etc., may be used, and fig. 3B is only an example of concatenation.

As shown in fig. 3B, in the embodiment of the present application, by determining the objects in different layers and further fusing the sub-operation features in different layers into the operation features of the video segment, the comprehensiveness and the hierarchy of information included in the operation features can be improved, so that a useful rule can be learned from the operation features in the following process.

In some embodiments, referring to fig. 3C, fig. 3C is a schematic flowchart of an artificial intelligence based video processing method provided in an embodiment of the present application, and step 104 shown in fig. 3A can be implemented by steps 301 to 302, which will be described in conjunction with the steps.

In step 301, the content feature and the operation feature of the video segment are recursively updated to obtain an updated content feature and an updated operation feature.

The embodiment of the application provides an example of performing fusion processing on the content features and the operation features of a video segment. Here, the order of sorting is not limited, and may be set according to an actual application scenario, for example, when the operation feature is obtained by splicing the sub-operation feature corresponding to the first object, the sub-operation feature corresponding to the second object, and the sub-operation feature corresponding to the third object, the sorting may be performed in the order of content feature-sub-operation feature corresponding to the first object-sub-operation feature corresponding to the second object-sub-operation feature corresponding to the third object.

In step 302, the updated content feature and the updated operation feature are spliced to obtain a fusion feature.

Here, the updated content feature and the updated operation feature of the video segment are subjected to splicing processing to obtain a fusion feature. Of course, other ways to obtain the fusion feature may be used, such as weighted summation, etc., and fig. 3C is only an example of the concatenation.

In fig. 3C, step 105 shown in fig. 3A can be implemented by steps 303 to 305, and will be described with reference to the respective steps.

In step 303, the fusion features of the plurality of video segments are recursively updated to obtain updated fusion features.

In step 304, weighting the updated fusion features of the video segment to obtain a probability that the video segment includes the recommendation information.

Here, the updated fusion feature of the video segment may be subjected to linear transformation, that is, weighting processing, and a value is finally obtained, where the value is a probability that the video segment includes the recommendation information. The numerical value may also be a probability that the video clip does not include the recommendation information, and is determined according to an actual application scenario, and only the probability that the video clip includes the recommendation information is exemplified here.

In step 305, the video segment with the probability that the recommendation information is included in the video is greater than the probability threshold is used as the predicted video segment including the recommendation information.

Here, the probability threshold may be set in advance, for example, to 0.5. And after the probability that each video clip comprises the recommendation information is obtained, the video clip with the probability that the probability comprising the recommendation information is greater than the probability threshold value is used as the predicted video clip comprising the recommendation information. For a video, the number of predicted video segments including recommendation information may be 0, 1 or more.

In fig. 3C, after step 305, the predicted video segment including the recommendation information may also be deleted in step 306.

If the video segment including the recommendation information is predicted, the predicted video segment including the recommendation information may be deleted in the video, and in order to reduce adverse effects (such as picture cracking) on the original content of the video (i.e., normal content except the recommendation information) after deletion, the time length of the video segment may be shortened as much as possible in step 101, for example, the video is divided according to the division time length of 5 seconds.

In an actual application scenario, for example, when the electronic device is used as a server, the server may process a video uploaded by a video uploader or a video requested by a video request sent by a terminal device, that is, predict and delete a video segment including recommendation information in the video. Taking the electronic device as a terminal device, for example, the terminal device may process a locally stored video or a video to be played, that is, predict a video segment including recommendation information in the video and delete the video segment. Of course, the scene of the video processing is not limited thereto.

As shown in fig. 3C, in the embodiment of the present application, by deleting a video segment including recommendation information predicted in a video, a storage space occupied by the video can be reduced, and a storage resource of an electronic device can be saved. When the subsequent operations such as sending or playing are performed on the video, the actual utilization rate of the computing resources consumed by the electronic equipment can be improved.

In some embodiments, referring to fig. 3D, fig. 3D is a flowchart illustrating an artificial intelligence based video processing method provided in an embodiment of the present application, and step 105 shown in fig. 3A may be implemented by steps 401 to 404, which will be described in conjunction with the steps.

In step 401, the fusion features of the plurality of video clips are ordered according to the timeline order of the video.

Here, a mode of the recursive update processing will be explained as an example. After the fusion features of all the video clips in the video are obtained, the fusion features of all the video clips can be sequenced according to the time axis sequence of the video, namely the sequence from the beginning of the video to the end of the video.

In step 402, according to the fusion feature of any video segment and the intermediate feature of the previous video segment, the intermediate feature of any video segment is obtained through fusion.

As an example, the embodiment of the present application provides a schematic diagram of performing recursive update processing on fusion features of a plurality of video segments as shown in fig. 5, and in fig. 5, the fusion features after sorting are the fusion feature of the video segment 1, the fusion feature of the video segment 2, and the fusion feature of the video segment 3 in sequence. For ease of understanding, taking video segment 2 as an example, one way to fuse to obtain intermediate features is illustrated, where the intermediate features are also referred to as hidden layer states in the R NN structure. Firstly, the fusion feature of the video segment 2 and the intermediate feature of the previous video segment (namely, the video segment 1) are subjected to weighted summation, a bias term is added to the result of the weighted summation, and the obtained result is subjected to activation processing to obtain the intermediate feature of the video segment 2. The activation process may be implemented by using an activation function such as a hyperbolic tangent function. For different video segments, the corresponding weight parameters of the fused features in the weighted summation process may be the same, and the weight parameters and the bias terms corresponding to the intermediate features of the previous video segment are the same, that is, the different video segments share the weight parameters and the bias terms.

It should be noted that, for the first video segment after the sorting (e.g. video segment 1 in fig. 5), since there is no previous video segment, the intermediate feature of the previous video segment can be set to zero during the fusion process.

In step 403, the intermediate features are weighted to obtain updated fusion features of any video segment.

Here, taking the video segment 2 in fig. 5 as an example, the intermediate feature of the video segment 2 is weighted (after the weighting, an offset term may be added, where the offset term has a different meaning from the offset term in step 402), so as to obtain an updated fusion feature of the video segment 2. After the fusion features are updated according to the sequence relevance among the video segments, the updated fusion features can more accurately and effectively represent the corresponding video segments.

It should be noted that the recursive update processing shown in steps 402 to 403 is also applicable to step 301. As an example, the embodiment of the present application provides a schematic diagram of performing recursive update processing on content features and operation features of a video segment as shown in fig. 6, where fig. 6 shows four features of a certain video segment, which are obtained after sorting, a sub-operation feature corresponding to a first object, a sub-operation feature corresponding to a second object, and a sub-operation feature corresponding to a third object, and for convenience of description, the four features are named as feature 1, feature 2, feature 3, and feature 4, respectively. Taking the feature 2 as an example, when performing recursive update processing, according to the feature 2 and the intermediate feature corresponding to the previous feature (i.e., the feature 1), an intermediate feature corresponding to the feature 2 is obtained by fusion, and then the intermediate feature corresponding to the feature 2 is subjected to weighting processing to obtain the updated feature 2, and so on.

In step 404, a video segment including recommendation information in a plurality of video segments is predicted according to the updated fusion features.

And predicting the video clips comprising the recommendation information according to the updated fusion characteristics of each video clip in the video.

It should be noted that the above steps (e.g., steps 402 to 404) can be integrated into the RNN model, and the recursive update process is implemented by using the ordered fusion features as the input of the RNN model through the RNN structure in the RNN model. In the process of training the RNN model, the weighting parameters and bias terms used in the weighting processing in

steps

402 and 403 can be updated to improve the effect of the recursive update processing, wherein the bias term is added to help the RNN model to converge quickly, thereby improving the training effect.

As shown in fig. 3D, in the embodiment of the present application, based on the RNN principle, the fusion features of all video segments in a video are recursively updated, so that the updated fusion features can more accurately and effectively represent the corresponding video segments, which is helpful for improving the prediction accuracy of the video segments including the recommendation information.

In the following, an exemplary application of the embodiment of the present application in an actual application scenario will be described, and for convenience of understanding, a case where the recommendation information is an advertisement and the object is a user account will be described as an example. In a video scene, a video uploader of a video platform may add advertisements in a video in order to promote the income of the video uploader, so that the user is compelled to watch the advertisements embedded in the video when watching the video. The video embedded with the advertisement causes poor viewing experience of users, and meanwhile, due to the fact that a video uploader carries out advertisement recommendation through an illegal channel (namely, advertisements are added in the video), the video platform cannot obtain due benefits.

The embodiment of the application provides a video processing method based on artificial intelligence, which can be applied to video applications, such as background servers of the applications, foreground clients of the applications, and various video players. Specifically, the video processing may be implemented by a video content presentation module, a user behavior collection analysis module, and an advertisement prediction module, which are described below.

1) A video content presentation module.

For a video to be processed, the video is divided into a plurality of video segments on average, and the division standard may be set in advance, for example, the video to be processed is divided according to the time length of 5 seconds to obtain a plurality of video segments. Then, each of the obtained video segments is modeled, and content characteristics for representing the content of the video segment are obtained. As an example, the embodiment of the present application provides a schematic diagram of determining content features as shown in fig. 7, feature extraction processing may be performed on a plurality of continuous images in a video segment through a 3D CNN model, and the content features of the video segment are obtained by processing the extracted image features (expressed in a vector form) through two full-communication layers. The full-communication layer is a full-connection layer, the purpose is to perform linear transformation on the vector, namely, weighting processing is performed, and in the deep learning model, the more times of the linear transformation (namely, the deeper the model is), the better the effect is, and the larger the calculation amount is. Of course, the process of determining the content feature described above does not constitute a limitation on the embodiment of the present application, and for example, the image feature output by the 3D CNN model may be directly used as the content feature.

2) And the user behavior collection and analysis module.

In this module, the following information is collected:

firstly, determining whether each user account performs skip operation on the video to be processed and a starting position and an ending position corresponding to the skip operation for all the user accounts except the current user account of the video to be processed. According to the starting position and the ending position corresponding to the skip operation, which video segments in the video to be processed are subjected to the skip operation can be determined. The current user account of the video to be processed corresponds to the target recommendation object, and the user accounts except the current user account correspond to the first object. The current user account of the video to be processed may be preset, or a user account in a login state in the video-type application program may be used as the current user account, which is not limited herein.

Determining the ratio of the number of videos which have performed the skip operation and the number of videos which have performed the play operation for each user account for all user accounts except the current user account of the video to be processed, wherein for convenience of description, the ratio is named as skip video ratio hereinafter.

And determining the user account (corresponding to the third object) which has a mutual attention relationship with the current user account of the video to be processed.

According to the collected information, three different user behavior characteristics (corresponding to the sub-operation characteristics) of the video clip can be obtained:

the global user skips the behavior feature (i.e. the sub-operation feature corresponding to the first object above).

Here, for each video segment of the video to be processed, the user accounts which have performed the skip operation on the video segment are counted in all the user accounts except the current user account, and each time a user account is counted, the operand value of the user account is counted to be 1 (corresponding to the above first set value), and finally, the operand values of all the user accounts except the current user account are accumulated, and the value obtained by the accumulation is used as the global user skip behavior feature of the video segment. Among them, accumulation processing such as summation processing.

In a case that the ratio of skipped videos of a user account is smaller (i.e., the user holding the user account does not like to skip much), if the user account performs a skip operation on a certain video segment, the video segment is more likely to be a video segment including an advertisement. Therefore, in the accumulation process, the operand values corresponding to different user accounts can be weighted according to the skip video proportion. For example, all the user accounts except the current user account may be sorted according to the skipped video percentage, the weight of the user account with the largest skipped video percentage is assigned to 0, the weight of the user account with the smallest skipped video percentage is assigned to 1, and other user accounts are uniformly distributed according to the ranking, so that each user account can obtain an assigned weight.

In the accumulation process, the weighting of the user account and the operand value corresponding to the user account are multiplied to realize weighting, and finally, the weighted operand values of all the user accounts except the current user account are summed to obtain the skipping behavior characteristic of the global user. For example, if the weight of a certain user account is 0.5 and the operand value is 1, the weighted operand value is 0.5 × 1 — 0.5.

And the good friend skips the behavior characteristic (namely the sub-operation characteristic corresponding to the second object).

Here, the user account having a video preference similar to that of the current user account is determined to be a good friend, which corresponds to the above second object. For example, if the current user account is account a and some other user account is account B, the calculation may be performed according to the following formula:

the account similarity is cos (video playback vector for account a, video playback vector for account B).

The cos is a cosine function, and is used for calculating cosine similarity of the account A and the account B on a video playing vector, so as to serve as the account similarity between the two accounts. The video playing vector is a One-Hot (One-Hot) vector, the dimensionality is the number of videos in the database, each dimensionality corresponds to One video, and if a user account performs playing operation on a certain video, the value corresponding to the video in the video playing vector of the user account is assigned to be 1; on the contrary, if the user account does not perform the playing operation on the video, the value corresponding to the video in the video playing vector of the user account is assigned to be 0.

After the account similarity between the current user account and each of the other user accounts is determined, the user accounts with the account similarity larger than the similarity threshold are used as good friends of the current user account. Then, for all good friends of the current user account, the good friend skipping behavior characteristics of each video segment in the video to be processed are calculated according to the method for calculating the global user skipping behavior characteristics.

And taking care of the friend skipping behavior characteristic (namely the sub-operation characteristic corresponding to the third object).

Here, the user account having a mutual attention relationship with the current user account is referred to as a focused friend and corresponds to the above third object. Then, for all concerned friends of the current user account, the concerned friend skipping behavior feature of each video segment in the video to be processed is calculated according to the method for calculating the global user skipping behavior feature.

Therefore, in the user behavior collection and analysis module, for each video segment in the video to be processed, three features can be obtained, namely a global user skipping behavior feature, a good friend skipping behavior feature and a concerned friend skipping behavior feature.

3) And an advertisement prediction module.

Here, for each video segment in the video to be processed, the content feature, the global user skip behavior feature, the good friend skip behavior feature, and the attention friend skip behavior feature of the video segment are spliced into one sequence feature, and the sequence features of a plurality of video segments are used as the input of the advertisement prediction model (corresponding to the recommendation information prediction model above). The advertisement prediction model carries out sequence marking on the input sequence characteristics, finally outputs the probability that each video segment comprises the advertisement, and can predict the video segment comprising the advertisement in the video to be processed according to the probabilities, for example, the video segment comprising the advertisement and having the probability larger than the probability threshold value is taken as the video segment comprising the advertisement.

In the training phase of the advertisement prediction model, training may be performed on a supervised data set, which includes a plurality of sample videos and labeled video segments including advertisements in each sample video. After the sequence characteristics of the sample video are processed through the advertisement prediction model, the difference (namely loss value) between the predicted video segment including the advertisement and the actually marked video segment including the advertisement is determined, back propagation is carried out in the advertisement prediction model according to the difference, and the weight parameter of the advertisement prediction model is updated in the back propagation process. The loss function used for calculating the loss value is not limited, and may be, for example, a cross-entropy loss function.

By way of example, an architecture diagram of an advertisement prediction model as shown in fig. 8 is provided in the embodiments of the present application, and the advertisement prediction model adopts an architecture of a hierarchical RNN. In fig. 8, the feature values shown represent the above four features of the video segment, where the content feature is in a vector form, the global user skip behavior feature, the good friend skip behavior feature, and the friend skip behavior feature are all in a numerical form, and the numerical value can be regarded as a vector with a dimension of 1. After the four features of the video segment are subjected to linear transformation through the full communication layer, the four linearly transformed features are subjected to fusion processing through the first layer RNN to generate a vector, and the generated vector is named as a fusion feature for understanding.

And taking the fusion characteristics of all the video segments as a new sequence characteristic, and processing the new sequence characteristic through the RNN of the second layer to obtain updated fusion characteristics. For the updated fusion feature of each video segment, a score is generated through processing (i.e. performing two linear transformations) by a Multi Layer Perceptron (MLP) and a full connectivity layer, where the score is a probability that the video segment includes an advertisement (of course, the probability that the video segment does not include an advertisement may also be a probability that the video segment does not include an advertisement, depending on an actual labeling situation).

Through this application embodiment, can realize following technological effect at least: 1) the content characteristics and the information fed back by the user are obtained, and the original content (namely the normal content of the non-advertisement) of the video is considered in the processing process of the advertisement prediction model, so that the identification precision of the video segment comprising the advertisement can be effectively improved; 2) the advertisements in the video can be automatically deleted, the user experience of the user when watching the video is improved, and the advertisement interference brought to the user is avoided; 3) and a video uploader is prompted to publish advertisements by using a channel specified by the video platform, so that the legal income of the video platform can be increased.

Continuing with the exemplary structure of the artificial intelligence based video processing device 455 provided by the embodiments of the present application as implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the artificial intelligence based video processing device 455 of the memory 450 may include: a dividing module 4551 configured to divide the video into a plurality of video segments; the feature extraction module 4552 is configured to perform feature extraction processing on the video segment to obtain content features of the video segment; the feature mapping module 4553 is configured to acquire operation data for a video clip, and perform feature mapping processing on the operation data to obtain an operation feature; the fusion module 4554 is configured to perform fusion processing on the content features and the operation features of the video clip to obtain fusion features; and a recursive update module 4555, configured to perform recursive update processing on the fusion features of the multiple video segments, and predict video segments including the recommendation information in the multiple video segments according to the obtained updated fusion features.

In some embodiments, the feature mapping module 4553 is further configured to: taking an object except the target recommendation object of the video as a first object; taking an object similar to the target recommendation object in the plurality of first objects as a second object; taking an object which has an attention relationship with the target recommendation object in the plurality of first objects as a third object; and taking the operation performed by the first object, the second object and the third object on the video segment as operation data on the video segment.

In some embodiments, the feature mapping module 4553 is further configured to: mapping the operation executed by the sample object aiming at the video clip into an operand value, and constructing a sub-operation characteristic according to the operand values of a plurality of sample objects; wherein the sample object is any one of a first object, a second object and a third object; and splicing the sub-operation features corresponding to the first object, the sub-operation features corresponding to the second object and the sub-operation features corresponding to the third object into the operation features of the video clip.

In some embodiments, the feature mapping module 4553 is further configured to: any one of the following processes is performed: accumulating the operand values of the plurality of sample objects, and taking the numerical value obtained by accumulation as a sub-operation characteristic; and constructing vectors with the dimension same as the number of the sample objects according to the operand values of the plurality of sample objects to serve as the sub-operation characteristics.

In some embodiments, the feature mapping module 4553 is further configured to: dividing the number of videos of which the sample objects are subjected to the skip operation by the video base number to obtain the skip video ratio; determining a weight negatively correlated to a skipped video fraction; weighting the operand values of the plurality of sample objects according to the weights of the sample objects to obtain sub-operation characteristics; the video base number is any one of the number of videos in the database and the number of videos of which the sample object has performed the playing operation.

In some embodiments, the feature mapping module 4553 is further configured to: sequencing the plurality of sample objects according to a first numerical sequence of the skip video percentage; uniformly sampling the set weight range according to the number of the sample objects to obtain a plurality of weights to be distributed; according to the second numerical sequence, sequentially distributing a plurality of weights to be distributed to the sequenced sample objects; wherein the first numerical order is opposite to the second numerical order.

In some embodiments, the feature mapping module 4553 is further configured to: sequencing the operand values of the plurality of sample objects according to the object parameters of the sample objects; constructing vectors with the same dimensionality as the number of sample objects according to the sequenced multiple operand values; the object parameter includes any one of registration time, time of last playing operation and number of videos having been played.

In some embodiments, the feature mapping module 4553 is further configured to: constructing initial vectors with dimensions the same as the number of videos in the database; wherein each value in the initial vector corresponds to a video in the database; aiming at any object, determining a numerical value corresponding to the video of which the playing operation is executed by any object in the initial vector, and updating the numerical value determined in the initial vector to a set numerical value to obtain a video playing vector; and determining an object with similarity greater than a similarity threshold value on the video playing vector with the target recommendation object from the plurality of first objects as a second object.

In some embodiments, the recursive update module 4555 is further configured to: sequencing the fusion characteristics of the video clips according to the time axis sequence of the video; for any video segment in the video, the following processing is performed: fusing to obtain the intermediate feature of any video clip according to the fusion feature of any video clip and the intermediate feature of the previous video clip; and performing weighting processing on the intermediate features to obtain updated fusion features of any video segment.

In some embodiments, the fusion module 4554 is further configured to: carrying out recursive updating processing on the content characteristics and the operation characteristics of the video clip to obtain updated content characteristics and updated operation characteristics; and splicing the updated content characteristics and the updated operation characteristics to obtain fusion characteristics.

In some embodiments, the feature extraction module 4552 is further configured to: any one of the following processes is performed: performing feature extraction processing on each image in the video clip to obtain image features, and fusing the image features of the plurality of images into content features of the video clip; and performing feature extraction processing on a plurality of continuous images in the video clip, and taking the extracted image features as the content features of the video clip.

In some embodiments, the recursive update module 4555 is further configured to: weighting the updated fusion characteristics of the video clips to obtain the probability that the video clips comprise recommendation information; and taking the video clip with the probability of the recommendation information in the video larger than the probability threshold value as the predicted video clip comprising the recommendation information.

In some embodiments, artificial intelligence based video processing device 455 further comprises: and the deleting module is used for deleting the predicted video clip comprising the recommendation information.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the artificial intelligence based video processing method according to the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform a method provided by embodiments of the present application, for example, an artificial intelligence based video processing method as shown in fig. 3A, 3B, 3C, and 3D.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, the following technical effects can be achieved through the embodiments of the present application:

1) the fusion characteristics are determined according to the content characteristics and the operation characteristics of the video clips, so that the fusion characteristics comprise information of multiple dimensions in the video clips, on the basis, the fusion characteristics of the video clips are subjected to recursive updating, and the updated fusion characteristics can more accurately and effectively represent the video clips by referring to normal contents in the video. In this way, when predicting the video segment including the recommendation information according to the updated fusion feature, the prediction accuracy can be improved.

2) By deleting the predicted video segment including the recommendation information in the video, the storage space occupied by the video can be reduced, and the storage resource of the electronic equipment can be saved. When the electronic equipment subsequently executes operations such as sending or playing the video, the actual utilization rate of the computing resources consumed by the electronic equipment can be improved.

3) In the process of determining the content features, feature extraction processing can be performed on a single image, or feature extraction processing can be performed on a plurality of continuous images together, so that the flexibility of feature extraction is improved.

4) By determining the objects (the global object, the similar objects and the attention object) of different levels and further fusing the sub-operation features of different levels into the operation features of the video clip, the comprehensiveness and the hierarchy of information contained in the operation features can be improved, and the effective data rule can be conveniently learned from the operation features subsequently.

5) The sub-operation features in the numerical form or the vector form can be constructed and are suitable for different scenes, for example, the sub-operation features in the numerical form can be suitable for scenes for real-time video processing, and the sub-operation features in the vector form can be suitable for scenes for offline video processing.

6) The embodiment of the application can be integrated in a recommendation information prediction model for realization, and the recommendation information prediction model can be deployed in electronic equipment in a componentized form, so that the deployment is simple and convenient, and the application range is wide. For example, the method can be deployed in a video player of a terminal device to serve as a function of filtering recommendation information, and can also support the decision of whether to start the function or not by a user.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for artificial intelligence based video processing, the method comprising:

dividing a video into a plurality of video segments;

2. The method of claim 1, wherein the obtaining operation data for the video clip comprises:

taking an object except for a target recommendation object of the video as a first object;

taking an object similar to the target recommendation object in the plurality of first objects as a second object;

taking an object which has an attention relationship with the target recommendation object in the plurality of first objects as a third object;

and taking the operation performed by the first object, the second object and the third object for the video segment as operation data for the video segment.

3. The method of claim 2, wherein the performing a feature mapping process on the operation data to obtain an operation feature comprises:

mapping operations executed by sample objects aiming at the video clips into operand values, and constructing sub-operation characteristics according to the operand values of a plurality of sample objects;

wherein the sample object is any one of the first object, the second object, and the third object;

and splicing the sub-operation features corresponding to the first object, the sub-operation features corresponding to the second object and the sub-operation features corresponding to the third object into the operation features of the video clip.

4. The method of claim 3, wherein constructing sub-operational features from the operand values of the plurality of sample objects comprises:

any one of the following processes is performed:

accumulating the operand values of the plurality of sample objects, and taking the value obtained by accumulation as a sub-operation characteristic;

and constructing vectors with the dimension same as the number of the sample objects according to the operand values of the sample objects to serve as sub-operation features.

5. The method according to claim 4, wherein the accumulating the operand values of the plurality of sample objects, and taking the accumulated operand values as the sub-operational features comprises:

dividing the number of videos of which the sample objects are subjected to skip operation by the video base number to obtain a skip video ratio;

determining a weight negatively correlated to the skipped video fraction;

weighting the operand values of the plurality of sample objects according to the weights of the sample objects to obtain sub-operation characteristics;

wherein the video base includes any one of the number of videos in the database and the number of videos of which the sample object has performed the play operation.

6. The method of claim 5, wherein determining the weight negatively correlated to the skipped video fraction comprises:

sorting the plurality of sample objects according to a first numerical order of the skip video fraction;

uniformly sampling a set weight range according to the number of the sample objects to obtain a plurality of weights to be distributed;

according to a second numerical sequence, sequentially distributing the weights to be distributed to the sequenced sample objects;

wherein the first numerical order is opposite to the second numerical order.

7. The method of claim 4, wherein constructing a vector having the same dimensions as the number of sample objects according to the operand values of the plurality of sample objects comprises:

sequencing the operand values of the plurality of sample objects according to the object parameters of the sample objects;

constructing vectors with the same dimensionality as the sample objects according to the sorted multiple operand values;

the object parameter includes any one of registration time, time of last playing operation and number of videos having been played.

8. The method according to claim 2, wherein the step of regarding an object similar to the target recommended object in the plurality of first objects as the second object comprises:

constructing initial vectors with dimensions the same as the number of videos in the database; wherein each value in the initial vector corresponds to a video in the database;

for any object, determining a numerical value corresponding to the video of which the playing operation is executed by the any object in the initial vector, and updating the numerical value determined in the initial vector to a set numerical value to obtain a video playing vector;

and determining an object with similarity greater than a similarity threshold value on a video playing vector with the target recommendation object from the plurality of first objects as a second object.

9. The method according to any one of claims 1 to 8, wherein said recursively updating the fused features of the plurality of video segments comprises:

sequencing the fusion characteristics of the video clips according to the time axis sequence of the video;

for any video segment in the video, performing the following processing:

according to the fusion feature of any video segment and the intermediate feature of the previous video segment, fusing to obtain the intermediate feature of any video segment;

and performing weighting processing on the intermediate features to obtain updated fusion features of any one video segment.

10. The method according to any one of claims 1 to 8, wherein the fusing the content feature and the operation feature of the video segment to obtain a fused feature comprises:

carrying out recursive updating processing on the content characteristics and the operation characteristics of the video clip to obtain updated content characteristics and updated operation characteristics;

and splicing the updated content characteristics and the updated operation characteristics to obtain fusion characteristics.

11. The method according to any one of claims 1 to 8, wherein the performing a feature extraction process on the video segment to obtain the content feature of the video segment includes:

any one of the following processes is performed:

performing feature extraction processing on each image in the video clip to obtain image features, and fusing the image features of the plurality of images into content features of the video clip;

and performing feature extraction processing on a plurality of continuous images in the video clip, and taking the extracted image features as the content features of the video clip.

12. The method according to any one of claims 1 to 8, wherein predicting the video segments including the recommendation information in the plurality of video segments according to the obtained updated fusion features comprises:

weighting the updated fusion characteristics of the video clips to obtain the probability that the video clips comprise recommendation information;

taking the video clip with the probability of the recommendation information in the video larger than the probability threshold as the predicted video clip with the recommendation information;

the method further comprises the following steps:

deleting the predicted video segment including the recommendation information.

13. An artificial intelligence based video processing apparatus, the apparatus comprising:

14. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the artificial intelligence based video processing method of any one of claims 1 to 12 when executing executable instructions stored in the memory.

15. A computer-readable storage medium storing executable instructions for implementing the artificial intelligence based video processing method of any one of claims 1 to 12 when executed by a processor.