CN112203122B

CN112203122B - Similar video processing method and device based on artificial intelligence and electronic equipment

Info

Publication number: CN112203122B
Application number: CN202011080112.4A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2024-01-26
Anticipated expiration: 2040-10-10
Also published as: CN112203122A

Abstract

The application provides a similar video processing method, a device, electronic equipment and a computer readable storage medium based on artificial intelligence; big data technology in the field of cloud technology; the method comprises the following steps: performing multi-dimensional feature extraction processing on an image in a video, and performing fusion processing on the extracted multi-dimensional feature vectors to obtain image vectors of the image; performing feature extraction processing on the audio in the video to obtain an audio vector; performing feature extraction processing on texts in the video to obtain text vectors; carrying out fusion processing on the image vector, the audio vector and the text vector to obtain a video representation vector of the video; the method comprises the steps of taking the vector similarity between video representation vectors of any two videos as the video similarity between the two videos; and processing the two videos according to the video similarity. Through this application, can promote video processing's precision.

Description

Similar video processing method and device based on artificial intelligence and electronic equipment

Technical Field

The present application relates to artificial intelligence and big data technology, and in particular, to a similar video processing method, device, electronic equipment and computer readable storage medium based on artificial intelligence.

Background

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. Computer Vision (CV) is an important branch of artificial intelligence, and is mainly studied how to build an artificial intelligence system capable of acquiring information from images or multidimensional data.

For providers of video services, it is generally necessary to determine whether two videos are similar or not by means of computer vision technology and big data processing technology related to cloud technology, so as to ensure the operational ecology of the video service. In the scheme provided by the related art, the hash value of the image in the video is generally calculated by the pHash algorithm or the dHash algorithm, and the similarity between the hash values of the images in the two videos is used as the video similarity between the two videos. However, this method is very sensitive to the fact that the image itself is cut, translated or the angle of view of the shot changes slightly, and has too high a limitation, and the accuracy of the obtained video similarity is low.

Disclosure of Invention

The embodiment of the application provides a similar video processing method and device based on artificial intelligence, electronic equipment and a computer readable storage medium, which can improve the accuracy of similar video identification.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a similar video processing method based on artificial intelligence, which comprises the following steps:

performing multi-dimensional feature extraction processing on an image in a video, and performing fusion processing on the extracted multi-dimensional feature vectors to obtain image vectors of the image;

performing feature extraction processing on the audio in the video to obtain an audio vector;

performing feature extraction processing on texts in the video to obtain text vectors;

carrying out fusion processing on the image vector, the audio vector and the text vector to obtain a video representation vector of the video;

the method comprises the steps of taking the vector similarity between video representation vectors of any two videos as the video similarity between the two videos;

and processing the two videos according to the video similarity.

In the above scheme, the method further comprises:

acquiring a plurality of history recommendation records aiming at a history user; the historical recommendation record comprises user characteristics of the historical user, video representation vectors of recommended videos and triggering results of the historical user on the recommended videos;

Updating recommendation parameters of a recommendation model according to the plurality of historical recommendation records;

combining the user characteristics of the user to be recommended with the video representation vectors of the candidate videos respectively to obtain a plurality of prediction samples;

carrying out prediction processing on the prediction sample through the updated recommendation model to obtain a prediction triggering result of the candidate video corresponding to the prediction sample;

and screening videos to be recommended from the plurality of candidate videos according to the prediction triggering result, and executing recommendation operation for the videos to be recommended.

The embodiment of the application provides a similar video processing device based on artificial intelligence, which comprises:

the first feature extraction module is used for carrying out feature extraction processing on images in the video in multiple dimensions, and carrying out fusion processing on the extracted feature vectors in the multiple dimensions to obtain image vectors of the images;

the second feature extraction module is used for carrying out feature extraction processing on the audio in the video to obtain an audio vector;

the third feature extraction module is used for carrying out feature extraction processing on the text in the video to obtain a text vector;

the fusion module is used for carrying out fusion processing on the image vector, the audio vector and the text vector to obtain a video representation vector of the video;

The similarity determining module is used for taking the vector similarity between video representation vectors of any two videos as the video similarity between the two videos;

and the processing module is used for processing the two videos according to the video similarity.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the similar video processing method based on artificial intelligence when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium, which stores executable instructions for causing a processor to execute, so as to implement the similar video processing method based on artificial intelligence.

The embodiment of the application has the following beneficial effects:

the method comprises the steps of carrying out feature extraction processing on images in multiple dimensions, fusing information in the multiple dimensions to obtain image vectors, fusing the image vectors, audio vectors and text vectors to obtain video representation vectors of the video, so that the video representation vectors can more accurately represent meanings of the video, accuracy of video similarity obtained according to the video representation vectors is improved, processing precision can be improved when the video is processed according to the video similarity, and optimization of operation ecology of video business is facilitated.

Drawings

FIG. 1 is a schematic diagram of an architecture of an artificial intelligence based similar video processing system provided in an embodiment of the present application;

fig. 2 is a schematic architecture diagram of a terminal device according to an embodiment of the present application;

FIG. 3A is a schematic flow chart of a similar video processing method based on artificial intelligence according to an embodiment of the present application;

FIG. 3B is a schematic flow chart of a similar video processing method based on artificial intelligence according to an embodiment of the present application;

FIG. 3C is a flowchart of a similar video processing method based on artificial intelligence according to an embodiment of the present application;

FIG. 3D is a flowchart of indexing and identifying non-original video according to an embodiment of the present application;

FIG. 4 is a schematic diagram of generating video representation vectors provided by embodiments of the present application;

FIG. 5 is a schematic diagram of generating video representation vectors provided by embodiments of the present application;

FIG. 6 is a schematic diagram of an index library management framework provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a similar video processing system based on artificial intelligence as provided by embodiments of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein. In the following description, the term "plurality" refers to at least two.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

Before further describing embodiments of the present application in detail, the terms and expressions that are referred to in the embodiments of the present application are described, and are suitable for the following explanation.

1) Feature extraction: specific data (such as images, audio or text) is mapped to a vector space through a series of processes to obtain a vector-form representation of the data for processing by a computer.

2) Fusion treatment: the manner of fusing the multiple vectors includes, but is not limited to, direct stitching, product (including inner and outer products), averaging, and weighting (such as weighted averaging and weighted summation), and may be determined according to the actual application scenario.

3) Database (Database): similar to an electronic filing cabinet, namely a place for storing electronic files, a user can perform operations such as adding, inquiring, updating, deleting and the like on data in the files. A database is also understood to be a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application.

4) Index (Index): in order to accelerate retrieval of data in a database, a distributed storage structure is created, an index is used for pointing to one or more data in the database, and when the data volume stored in the database is huge, the index can greatly accelerate query speed. In embodiments of the present application, video representation vectors and indexes of video representation vectors may be stored in a specialized index library.

5) Machine Learning (ML): at the core of artificial intelligence, it is mainly studied how computers simulate or implement learning behaviors of human beings to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve the performance of the computer. A machine learning model, such as a neural network model, may be constructed based on machine learning principles.

In video services (such as video websites), a content creator may upload (publish) a large number of similar videos in order to promote its own benefits, for example, make simple editing modifications on the title, cover or watermark of a certain video, add the head and tail of an advertisement to the video, modify the audio in the video, etc., and may even directly copy the video uploaded by other creators. These contents can take up a lot of traffic, which is detrimental to the healthy development of the whole content ecology.

In the scheme provided by the related art, the hash value of the image in the video is generally calculated by a pHash algorithm or a dHash algorithm, and the similarity between the hash values of the images in the two videos is used as the video similarity between the two videos. However, this method is very sensitive to the fact that the image itself is cut, shifted or the angle of view of the shot is slightly changed, and the effect of similar recognition is poor and the limitation is too high. If the scheme provided by the related technology is applied, the content creator can easily bypass the limitation (such as making some simple cropping on the image or changing the view angle of the image, etc.), and successfully upload similar videos.

The embodiment of the application provides a similar video processing method and device based on artificial intelligence, electronic equipment and a computer readable storage medium, which can improve the accuracy of similar video identification. An exemplary application of the electronic device provided by the embodiment of the present application is described below, where the electronic device provided by the embodiment of the present application may be implemented as various types of terminal devices, and may also be implemented as a server.

Referring to fig. 1, fig. 1 is a schematic architecture diagram of a similar video processing system 100 based on artificial intelligence according to an embodiment of the present application, where a terminal device 400 is connected to a server 200 through a network 300, and the server 200 is connected to a database 500, where the network 300 may be a wide area network or a local area network, or a combination of both.

In some embodiments, taking an example that the electronic device is a terminal device, the similar video processing method based on artificial intelligence provided in the embodiments of the present application may be implemented by the terminal device. For example, the terminal apparatus 400 locally stores a plurality of videos, extracts an image vector, an audio vector, and a text vector for each video, and fuses the extracted image vector, audio vector, and text vector to obtain a video representation vector. And then, carrying out exhaustive combination processing on the videos to obtain a plurality of video groups, taking the vector similarity between the video representation vectors of two videos in the video groups as the video similarity between the two videos, and determining that the two videos are similar if the video similarity is greater than a set similarity threshold value. Finally, the terminal device 400 may display a plurality of similar videos in the man-machine interaction interface, so that the user may delete part of the videos, thereby implementing video deduplication, saving local storage resources of the terminal device 400, or the terminal device 400 may directly reserve the video with the highest video parameters (such as frame rate, definition or resolution) in the plurality of similar videos, and delete other videos.

In some embodiments, taking an example that the electronic device is a server, the similar video processing method based on artificial intelligence provided in the embodiments of the present application may also be implemented by the server. For example, the server 200 may be a server that provides a video service (such as a background server of a video website), where the server 200 determines, when acquiring a video uploaded by the terminal device 400 via the client 410 (such as a video client), video similarity between the video and a plurality of videos stored in the database 500 (also by determining vector similarity between video representation vectors). If it is determined that the video similarity between the video uploaded by the terminal device 400 and a certain video in the database 500 is greater than the set similarity threshold, the server 200 determines the video uploaded by the terminal device 400 as a non-original video, and sends a re-uploading prompt (may also include a video similar to the video uploaded by the terminal device 400) to the terminal device 400, so that the operation ecology of the video service is improved, and the non-original video is prevented from occupying the recommended flow of the video service. As an example, in the interface of the client 410 of fig. 1, a video a sent by the terminal device 400 to the server 200, and a video B in the database 500, in which the video similarity with the video a is greater than the similarity threshold, are shown, and a re-upload prompt is also shown. Of course, the above scenario is not limited to the embodiment of the present application, for example, the server 200 may also calculate video similarity to the video in the database 500, and perform video deduplication, so as to alleviate the storage pressure of the database 500.

In some embodiments, the terminal device 400 or the server 200 may implement the artificial intelligence-based similar video processing method provided in the embodiments of the present application by running a computer program, for example, the computer program may be a native program or a software module in an operating system; may be a Native Application (APP), i.e., a program that needs to be installed in an operating system to run, such as client 410; the method can also be an applet, namely a program which can be run only by being downloaded into a browser environment; but also an applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in.

In some embodiments, the server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms, where the cloud services may be similar video recognition services for the terminal device 400 to call. The terminal device 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiments of the present application.

Taking the electronic device provided in the embodiment of the present application as an example of a terminal device, it can be understood that, in a case where the electronic device is a server, portions (such as a user interface, a presentation module, and an input processing module) in the structure shown in fig. 2 may be default. Referring to fig. 2, fig. 2 is a schematic structural diagram of a terminal device 400 provided in an embodiment of the present application, and the terminal device 400 shown in fig. 2 includes: at least one processor 410, a memory 450, at least one network interface 420, and a user interface 430. The various components in terminal device 400 are coupled together by bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 440.

The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable presentation of the media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.

Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 450 described in the embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;

network communication module 452 for reaching other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;

a presentation module 453 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., a display screen, speakers, etc.) associated with the user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows a similar video processing apparatus 455 based on artificial intelligence stored in a memory 450, which may be software in the form of a program and a plug-in, etc., including the following software modules: the first feature extraction module 4551, the second feature extraction module 4552, the third feature extraction module 4553, the fusion module 4554, the similarity determination module 4555 and the processing module 4556 are logical, so that any combination or further splitting may be performed according to the functions implemented. The functions of the respective modules will be described hereinafter.

Similar video processing methods based on artificial intelligence provided by embodiments of the present application will be described in connection with exemplary applications and implementations of electronic devices provided by embodiments of the present application.

Referring to fig. 3A, fig. 3A is a schematic flow chart of a similar video processing method based on artificial intelligence according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3A.

In step 101, a feature extraction process of multiple dimensions is performed on an image in a video, and fusion process is performed on feature vectors of multiple dimensions obtained by extraction, so as to obtain an image vector of the image.

The images in the video include visual information that can be perceived by humans, and the visual information is an important element for judging whether the two images are similar, so that the images in the video are subjected to feature extraction processing, for example, feature extraction processing through an image classification model, wherein the type of the image classification model is not limited, and can be a visual geometry group (Visual Geometry Group, VGG) model, an acceptance model, a res net model, or the like.

In an actual application scenario, the subjects of the images in different videos may be different, for example, faces, cats, dogs, automobiles or sceneries, and if feature extraction processing is performed through a single image classification model, the extracted feature vectors may not fully express the semantics of the images. Therefore, in the embodiment of the application, the image in the video can be subjected to feature extraction processing of multiple dimensions through image classification models for executing different classification tasks, namely, each image classification model is used for extracting a feature vector of one dimension. And then, carrying out fusion processing on the extracted feature vectors with multiple dimensions, such as weighting summation, so as to obtain the image vector of the image.

In step 102, feature extraction processing is performed on audio in the video to obtain an audio vector.

The audio is also an element which can be perceived by human senses in the video, so that the audio in the video can be subjected to feature extraction processing through an audio feature extraction model to obtain an audio vector.

In step 103, feature extraction processing is performed on the text in the video, so as to obtain a text vector.

In addition to the image and the audio, the text in the video is subjected to feature extraction processing through a text feature extraction model to obtain a text vector, and the type of the text feature extraction model is not limited in the embodiment of the application, and can be a bidirectional encoder representation (Bidirectional Encoder Representation from Transformers, BERT) model of a transducer. It should be noted that, the text in the video referred to herein may be text in an image of the video, or may be text added to the video when the user uploads the video, such as title text. Wherein optical character recognition (Optical Character Recognition, OCR) may be performed on the image to obtain text in the image.

In addition, the above image classification model, audio feature extraction model, and text feature extraction model may be trained prior to use, for example by metric learning, and the training set may include a plurality of positive samples (each positive sample including two videos labeled as similar) and a plurality of negative samples (each negative sample including two videos labeled as dissimilar).

In step 104, fusion processing is performed on the image vector, the audio vector and the text vector, so as to obtain a video representation vector of the video.

For a video, the image vector, the audio vector and the text vector which are obtained through feature extraction processing are fused to obtain a video representation vector of the video, and the video representation vector can comprehensively and fully represent the actual semantics of the video.

In some embodiments, the above-mentioned fusion processing of the image vector, the audio vector and the text vector may be implemented in such a manner that a video representation vector of the video is obtained: and performing at least one of splicing processing, averaging processing, product processing and weighting processing on the image vector, the audio vector and the text vector to obtain a video representation vector of the video.

In the embodiment of the present application, the fusion processing manner includes, but is not limited to, splicing processing, averaging processing, product processing and weighting processing, and these four manners may be optionally executed or may be executed in combination, for example, the image vector, the audio vector and the text vector are respectively subjected to averaging processing and weighting processing, and then the result of the averaging processing and the result of the weighting processing are subjected to averaging processing, so as to obtain a video representation vector of the video. Where the product process may be an inner product or an outer product and the weighting process may be a weighted average or a weighted sum. By the method, the flexibility of fusion processing can be improved, a specific mode can be selected according to the requirements on efficiency and precision in an actual similar video identification scene, for example, a mode with higher precision requirement can be selected, and a video representation vector with longer length can be obtained by directly splicing; and if the efficiency requirement is high, a weighted average mode can be selected to obtain a video representation vector with a shorter length so as to reduce the subsequent calculation amount.

In step 105, the video of any two videos represents the vector similarity between the vectors, as the video similarity between the two videos.

When similar video identification is performed, the vector similarity between the video representation vectors of any two videos can be used as the video similarity between the two videos. The type of vector similarity in the embodiments of the present application is not limited, and may be, for example, cosine similarity or the like.

In step 106, the two videos are processed according to the video similarity.

And after obtaining the video similarity between any two videos, processing the two videos according to the video similarity. The specific processing manner is not limited, for example, the embodiment of the application can perform the exhaustive combination processing on the multiple videos stored in the electronic device to obtain multiple video groups. Then, determining the video similarity between two videos in each video group, if the video similarity is larger than a set similarity threshold, deleting the video with lower video parameters in the two videos, and saving the storage resources of the electronic equipment in a video deduplication mode, wherein the video parameters are any one of the grade of a release account number of the video (suitable for the scene that the electronic equipment is a video service such as a background server of a video website), the frame rate, the definition and the resolution of the video.

In some embodiments, after step 104, further comprising: acquiring a plurality of history recommendation records aiming at a history user; the historical recommendation record comprises user characteristics of a historical user, video representation vectors of the recommended videos and triggering results of the historical user on the recommended videos; updating recommendation parameters of the recommendation model according to the plurality of historical recommendation records; combining the user characteristics of the user to be recommended with the video representation vectors of the candidate videos respectively to obtain a plurality of prediction samples; carrying out prediction processing on the prediction samples through the updated recommendation model to obtain a prediction triggering result of the candidate video corresponding to the prediction samples; and screening the video to be recommended from the plurality of candidate videos according to the prediction triggering result, and executing the recommendation operation for the video to be recommended.

In the embodiment of the application, the video representation vector can be used for similar video identification, and can also be used as a characteristic of the video itself to participate in a recall stage and a sorting stage of video recommendation. For example, a plurality of history recommended records for a history user (the same user or a plurality of users) are obtained, each history recommended record comprises a user characteristic of the history user, a video representing vector of a recommended video and a triggering result of the history user on the recommended video, wherein the user characteristic can be set according to actual application scenes, such as age, gender, living city and the like, and the triggering result comprises triggering (such as clicking) and non-triggering.

And updating the recommendation parameters of the recommendation model through a plurality of historical recommendation records, wherein the recommendation parameters are weight parameters of the recommendation model. For example, for each history recommendation record, the user characteristics and the video representation vectors in the history recommendation record are predicted by the recommendation parameters of the recommendation model to obtain a predicted trigger result to be compared, and the recommendation parameters of the recommendation model are reversely propagated in the recommendation model according to the difference (namely, the loss value can be calculated by the loss function of the recommendation model) between the trigger result in the history recommendation record and the predicted trigger result to be compared, and are updated along the gradient descent direction in the process of the reverse propagation.

After the recommendation parameters are updated, video recommendation can be performed according to the recommendation model. For example, the user characteristics of the user to be recommended and video representation vectors of a plurality of candidate videos are respectively combined to obtain a plurality of prediction samples, wherein the plurality of candidate videos can be all videos stored in a database or some videos screened from all videos stored in the database through a set rule, such as the video uploaded in the last 7 days; each prediction sample includes user characteristics of a user to be recommended and a video representation vector of a candidate video. And then, carrying out prediction processing on the prediction samples through the updated recommendation parameters of the recommendation model to obtain the prediction triggering results of the candidate videos corresponding to the prediction samples, and screening the videos with the prediction triggering results being triggering from the plurality of candidate videos to serve as the videos to be recommended. Finally, an operation of recommending the video to be recommended to the user to be recommended is performed. The video representing vector can fully represent the actual semantics of the video, so that the accuracy of video recommendation can be improved and the user experience can be optimized through the mode.

In some embodiments, the above-mentioned processing of two videos according to video similarity may be achieved by: performing pairwise combination processing on a plurality of videos to be recommended to obtain a plurality of video groups; when the video similarity between two videos in the video group is larger than a second similarity threshold value and the video parameters corresponding to the two videos are different, screening out the video with higher video parameters in the two videos; when the video similarity between two videos in the video group is larger than a second similarity threshold value and the video parameters corresponding to the two videos are the same, taking any one of the two videos as a screened video; performing recommendation operation for the screened video; the video parameter is any one of the grade of the issuing account number of the video, the frame rate, the definition and the resolution of the video.

In the embodiment of the present application, in a video recommendation link, a plurality of videos to be recommended may be subjected to an exhaustive combination process to obtain a plurality of video groups. When the video similarity between two videos in the video group is larger than a set similarity threshold (named as a second similarity threshold here for convenience of distinguishing) and the video parameters corresponding to the two videos are different, the difference of the quality between the two videos is proved, so that the video parameters in the two videos are screened out to be higher, namely the better video; when the video similarity between two videos in the video group is larger than a second similarity threshold value and the video parameters corresponding to the two videos are the same, the fact that the two videos have no good and bad differences is proved, and therefore one of the two videos is randomly selected to serve as the screened video. The video parameter is any one of the grade of the release account number of the video, the frame rate, the definition and the resolution of the video, wherein the higher the grade of the release account number of the video is, the higher the probability that the video is a high-quality video is.

After the screening is completed for the plurality of video groups, a recommendation operation for the screened video is performed. By the method, the content of the recommended display can be scattered, and the recommendation of the video which is too similar is avoided.

In some embodiments, the above-mentioned processing of two videos according to video similarity may be achieved by: when the video similarity between the first video marked with the video tag and the second video not marked with the video tag is larger than a third similarity threshold value, taking the video tag of the first video as the video tag of the second video; the video tag is any one of video quality and video theme.

In the embodiment of the application, the automatic labeling of the video labels can be realized according to the video similarity between videos. For example, when the video similarity between the first video labeled with the video label and the second video not labeled with the video label is greater than a set similarity threshold (for convenience of distinction, herein named as a third similarity threshold), the video label of the first video is used as the video label of the second video, where the video label of the first video may be manually labeled or automatically labeled; the video tag is any one of video quality and video theme (or video type). When the video tag is of the video quality, the method can realize automatic mining of the high-quality video, for example, in a video website, a plurality of fragment videos belonging to a popular movie or a popular television play exist, and if the video tag of one fragment video is of the high-quality video, other fragment videos belonging to the high-quality video can be mined through the method.

As shown in fig. 3A, in the embodiment of the present application, feature extraction processing of multiple dimensions is performed on an image in a video to obtain an image vector, and the image vector, an audio vector and a text vector are fused into a video representation vector, so that the video representation vector can effectively and sufficiently represent actual semantics of the video, and finally obtained video similarity can also conform to similarity perceived by human senses, thereby improving accuracy of similar video recognition. When the video is processed according to the obtained video similarity, the processing precision can be effectively improved, and the method is applicable to various processing scenes, such as similar video filtering.

In some embodiments, referring to fig. 3B, fig. 3B is a schematic flow chart of a similar video processing method based on artificial intelligence according to an embodiment of the present application, and step 101 shown in fig. 3A may be implemented through steps 201 to 204, which will be described in connection with the steps.

In step 201, feature extraction processing is performed on the image through the multi-classification object detection model, so as to obtain an object feature vector.

In the embodiment of the application, the plurality of dimensions for performing feature extraction processing on the image may include a target dimension, a scene dimension, and a face dimension. In the dimension of the target, the image is subjected to feature extraction processing through the target detection model to obtain a target feature vector, wherein the classification task of the target detection model is multi-classification, and the predictable target can comprise a human face, a cat, a dog, an automobile and the like, and the method is not limited. Here, the type of the object detection model is also not limited, and may be, for example, a single-step multi-frame detector (Single Shot multiBox Detector, SSD) model or a one-time operation (You Only Live Once, YOLO) model.

In step 202, feature extraction processing is performed on the image through the scene recognition model, so as to obtain a scene feature vector.

In the field Jing Weidu, feature extraction processing is performed on the image through a scene recognition model to obtain scene feature vectors, wherein the classification task of the scene recognition model is multi-classification, and the identifiable scene can comprise mountains, rivers, lakes, seas, common urban facilities and the like. Here, the type of scene recognition model is not limited, and may be, for example, a NetVLAD model.

In step 203, feature extraction processing is performed on the image through the two-classification face detection model, so as to obtain a face feature vector.

In the human face dimension, the image is subjected to feature extraction processing through a human face detection model to obtain human face feature vectors, wherein the classification task of the human face detection model is two classifications, namely whether a human face exists in the image or not is detected. Here, the type of the face detection model is also not limited, and may be, for example, a multitasking convolutional neural network (Multi-Task Convolutional Neural Network, MTCNN) model.

In step 204, fusion processing is performed on the target feature vector, the scene feature vector and the face feature vector to obtain an image vector of the image.

Here, the manner of the fusion process also includes, but is not limited to, a splicing process, an averaging process, a product process, and a weighting process.

In some embodiments, the above-mentioned fusion processing of the target feature vector, the scene feature vector, and the face feature vector may be implemented in such a manner that an image vector of the image is obtained: according to the video theme of the video, determining weights of a target feature vector, a scene feature vector and a face feature vector; weighting the target feature vector, the scene feature vector and the face feature vector to obtain an image vector of the image; the video theme is any one of a target corresponding to the target detection model, a scene corresponding to the scene recognition model and a face corresponding to the face detection model; the weight of the vectors that conform to the video theme is greater than the weight of the vectors that do not conform to the video theme.

Here, the target feature vector, the scene feature vector, and the face feature vector may be weighted and summed to obtain an image vector of the image. Because the emphasis points of the images in different videos are different, the weights of the target feature vector, the scene feature vector and the face feature vector can be adjusted in a targeted manner according to the video theme, wherein the video theme is any one of the target, the scene and the face, for example, in short videos, most of the video theme is the face; the weight of the vectors that conform to the video theme is greater than the weight of the vectors that do not conform to the video theme.

It should be noted that, in a special case where the predictable target in the target detection model includes a face, when the video theme is a face, it may be determined that the weight of the face feature vector is greater than or equal to the weight of the target feature vector > the weight of the scene feature vector. By the method, the duty ratio of the information of each dimension in the image vector can be correspondingly adjusted according to the actual video theme, and the accuracy of the image vector is further improved.

As shown in fig. 3B, in the embodiment of the present application, information in an image is extracted from three dimensions of a target, a scene and a face, so that an obtained image vector can fully and effectively represent an actual meaning of the image, and accuracy of similar video recognition can be further improved.

In some embodiments, referring to fig. 3C, fig. 3C is a schematic flow chart of a similar video processing method based on artificial intelligence according to an embodiment of the present application, and after step 101 shown in fig. 3A, in step 301, an image vector of a plurality of frame-extracted images in a video may be weighted to obtain a global image vector.

A video generally includes a plurality of images, and after obtaining an image vector of each image, the image vectors of the plurality of images may be fused to obtain an overall image vector.

For example, in a short video scene, the video typically includes a cover image, which may be one of a plurality of video frame images, and a plurality of video frame images. Because the number of the video frame images is usually large, if all the video frame images are subjected to feature extraction processing, a large amount of computing resources are consumed, and meanwhile, the processing time delay is too long, so that the plurality of video frame images can be subjected to frame extraction processing to obtain a plurality of frame extraction images, and only the frame extraction images can be subjected to feature extraction processing subsequently. And after obtaining the image vectors of the plurality of frame-drawing images, carrying out weighted summation on the image vectors of the plurality of frame-drawing images to obtain a global image vector.

In some embodiments, prior to step 301, further comprising: performing traversing processing on a plurality of video frame images in the video, and performing the following processing on the traversed video frame images: when the difference of the traversed video frame image and the adjacent video frame image in the image parameters is larger than a difference threshold, taking the traversed video frame image as a frame extraction image obtained by frame extraction processing, and taking the video frame image separated from the traversed video frame image by a set interval as the frame extraction image obtained by frame extraction processing; wherein the image parameter is any one of brightness, contrast, definition and resolution.

For example, the traversing process may be performed on all video frame images in the video, and when the difference between the traversed video frame image and the previous video frame image in the image parameter is greater than a set difference threshold, it is proved that the traversed video frame image includes more critical information, so that the traversed video frame image is used as a frame extraction image obtained by frame extraction process, and in order to avoid the insufficient information amount caused by the too small number of frame extraction images, the video frame image separated from the traversed video frame image by a set interval is also used as a frame extraction image obtained by frame extraction process, where the image parameter is any one of brightness, contrast, definition and resolution.

It should be noted that the set interval may be a duration interval, such as 1 second, or a frame number interval, such as 5 frames. The video frame images which are separated from the traversed video frame images by a set interval, including the video frame images before the traversed video frame images and the video frame images after the traversed video frame images, are exemplified by a time interval with the set interval of 1 second, and the video frame images 1 second before the traversed video frame images and the video frame images 1 second after the traversed video frame images are taken as frame extraction images. By the method, the follow-up calculated amount is prevented from being too large due to too many frame extraction images, and the information amount is prevented from being insufficient due to too few frame extraction images.

In step 302, the global image vector is stitched with the image vector of the cover image of the video, so as to obtain an overall image vector.

Because the cover image of the video includes essence information of the video and can be mutually complemented with the frame image of the video, the global image vector obtained in step 301 and the image vector of the cover image of the video are fused to obtain an overall image vector. The manner of the fusion process includes at least one of a splicing process, an averaging process, a product process, and a weighting process, for example, a splicing process may be employed herein.

In fig. 3C, after step 102 shown in fig. 3A, in step 303, weighting may be performed on the audio vectors corresponding to the plurality of frame-pumped images to obtain an overall audio vector.

For example, after performing feature extraction processing on audio through VGGish, audio vectors corresponding to respective video frame images in video can be obtained. The fusion processing of the image vectors of the plurality of images in the video can be performed, and at the same time, the fusion processing of the audio vectors corresponding to the plurality of images (referred to herein as video frame images) can be performed, so as to obtain an overall audio vector. Also, the manner of the fusion processing here includes at least one of a splicing processing, an averaging processing, a product processing, and a weighting processing.

For example, on the basis that the video is subjected to frame extraction processing to obtain a plurality of frame extraction images, the audio vectors corresponding to the frame extraction images can be weighted and summed to obtain an overall audio vector.

In fig. 3C, after step 103 shown in fig. 3A, in step 304, a concatenation process may be further performed on the text vector of the headline text, the text vector of the abstract text, and the text vectors of the texts in the multiple frame-extracting images, to obtain an overall text vector.

Meanwhile, text vectors of a plurality of texts in the video can be fused, and an integral text vector is obtained. It should be noted that the text in the video is not necessarily located in the video frame image, and may be, for example, the title text and the abstract text that are uploaded by the user at the same time when the user uploads the video. Also, the manner of the fusion processing here includes at least one of a splicing processing, an averaging processing, a product processing, and a weighting processing.

For example, on the basis that a plurality of frame extraction images are obtained by performing frame extraction processing on a video, feature extraction processing can be performed only on a title text, a abstract text and a plurality of frame extraction images, and the obtained text vectors can be subjected to splicing processing to obtain an overall text vector.

In fig. 3C, step 104 shown in fig. 3A may be updated to step 305, and in step 305, the fusion processing is performed on the whole image vector, the whole audio vector and the whole text vector, so as to obtain a video representation vector of the video.

Also, the manner of the fusion processing here includes at least one of a splicing processing, an averaging processing, a product processing, and a weighting processing.

As shown in fig. 3C, in the embodiment of the present application, the frame extraction image is obtained through frame extraction processing, so that the workload of feature extraction processing and fusion processing can be reduced, thereby reducing the processing load of the electronic device, improving the processing efficiency, and simultaneously ensuring the accuracy of similar video recognition.

In some embodiments, referring to fig. 3D, fig. 3D is a schematic flow chart of indexing and identifying non-original video provided in the embodiments of the present application, and will be described with reference to the steps shown in fig. 3D.

In step 401, a video stored in the database in a set period of time preceding the real-time is used as an inventory video, and indexes of video representation vectors of a plurality of inventory videos are established in an inventory index space.

In the embodiment of the present application, in order to improve the efficiency of similar video identification, an index may be established for a video representation vector of a video, where the video representation vector and the index are 1:1, i.e. each video representation vector corresponds to a separate index. Meanwhile, in order to avoid adverse effects on the performance of the database caused by a large number of mixed reads and writes, the stock index space and the increment index space can be divided. For the stock index space, the video stored in the database in a set time period before the real-time can be used as the stock video, and the index of the video representing vector of each stock video is built in the stock index space, and the set time period can be set according to the timeliness requirement of similar video identification, for example, the set time period is set to 89 days.

In some embodiments, in establishing the indexing of video representation vectors for a plurality of stock videos in the stock index space, further comprising: for each stock video representation vector, performing any one of the following processes: taking the video representation vector of the stock video as an index of the video representation vector of the stock video; and carrying out hash processing on the video representation vector of the stock video, and taking the result obtained by the hash processing as an index of the video representation vector of the stock video.

The embodiment of the application provides two modes for establishing indexes, wherein the first mode is to directly take the video representation vector of the stock video as the index of the video representation vector; in the second mode, the video representation vector of the stock video is hashed, and the result of the hashing is used as an index of the video representation vector. The two ways are equally applicable to the subsequent process of establishing an index of video representation vectors of delta video.

In step 402, storing the new video as an incremental video to a database, and establishing an index of a video representation vector of the new video in an incremental index space; wherein the index in the delta index space is used to periodically migrate into the inventory index space.

Here, if a new video is acquired, the new video is stored as an incremental video to the database. For a new video, the divided stock index space is read-only, and the increment index space supports reading and writing at the same time, so that the problem of a large number of mixed reading and writing can be avoided. After the video representation vector of the new video is obtained, an index of the video representation vector of the new video is established in the delta index space.

In order to ensure that the indexes in the incremental index space are not too redundant, the indexes in the incremental index space may be periodically migrated into the stock index space, where the period (switching period) may be preset, for example, set to 1 day. Meanwhile, when the switching period arrives, the index of the video representation vector which no longer belongs to the stock video (i.e., the storage time of the video is no longer located in a set period of time before the real-time) in the stock index space may be deleted.

In some embodiments, the video representation vector of the stock video and the video representation vector of the new video are used to store into an index library, the index library comprising a stock index space and an delta index space; after storing the new video as the delta video to the database, further comprising: synchronously storing video representation vectors of the new video to a backup index library copy of the index library being used; when the switching period is reached, the indexes are re-established in the stock index space of the backup index library copy according to the video representation vector of the stock video in the backup index library copy, and the index library being used is switched with the backup index library copy.

In the embodiment of the application, the video representation vector and the index may be stored in the same database, or the video representation vector and the index may be separately stored in an additional index library, so as to accelerate the reading efficiency. For the latter case, the index library includes an inventory index space and an increment index space, and after obtaining a video representation vector of the inventory video, the video representation vector is stored into the index library to establish an index of the video representation vector in the inventory index space; after the video representation vector of the delta video is obtained, the video representation vector is stored to an index library to build an index of the video representation vector in the delta index space. In order to facilitate identifying whether the video representing vector in the index library corresponds to the stock video or the delta video, the storage time of the video representing vector corresponding to the video may also be stored in the index library.

In order to avoid that the performance of indexes is adversely affected due to the fact that indexes are deleted in a large number in an index library, a Double Buffer mechanism can be applied in the embodiment of the application. For example, a backup index library copy of the index library being used is created, and when a video representation vector of a new video is stored to the index library being used, the video representation vector is synchronously stored to the backup index library copy. When the switching period is reached, the indexes are re-established in the stock index space of the backup index library copy according to the video representation vector of the stock video in the backup index library copy, and the index library being used is switched with the backup index library copy. Therefore, timeliness of indexes in the in-use stock index space can be guaranteed, and seamless double-library switching without on-line perception can be realized.

In step 403, the stock index space and the increment index space are read to determine the video similarity between the new video and the plurality of videos to be compared; the video to be compared comprises a plurality of stock videos and incremental videos except the new video.

When the similar video identification is carried out on the new video, the stock index space and the increment index space are read at the same time, namely, the vector similarity between the video representation vector of the new video and the video representation vector of each stock video is determined, and the vector similarity between the video representation vector of the new video and the video representation vector of the increment video except the new video is determined. In other words, all stock videos and all delta videos except the new video are used as videos to be compared of the new video.

In step 404, when the video similarity between the new video and any one of the videos to be compared is greater than the first similarity threshold, the new video is identified as non-original video.

When the video similarity between the new video and any one of the videos to be compared is greater than a set similarity threshold (named as a first similarity threshold for convenience of distinction), the new video is identified as a non-original video, and further processing is performed, for example, the new video, the video representation vector and the index of the new video can be deleted directly, and a re-uploading prompt can be sent to a device for uploading the new video. It should be noted that, in the embodiments of the present application, other functions may be implemented through indexing, and are not limited to identifying non-original video.

As shown in fig. 3D, in the embodiment of the present application, by dividing the stock index space and the increment index space, read-write separation is implemented, so that efficiency of obtaining video similarity can be improved, and performance degradation of a database (or an index library) caused by performing a large amount of hybrid read-write is avoided.

In the following, an exemplary application of the embodiment of the present application in an actual application scenario will be described, where the embodiment of the present application may identify whether a video is an original video, and for the identified non-original video, a masking or deleting operation may be performed, so as to encourage the content producer to upload the original video, and ensure a good ecology of the video service. The following is a detailed description.

With the rapid development of the internet, the threshold for content production decreases, and the uploading amount (distribution amount) of video increases at an exponential rate. Sources of such content include a wide variety of content authoring facilities such as professional production content (Professional Generated Content, PGC) from media and facilities, user generated content (User Generated Content, UGC), and the like. For example, in video services that rely on instant messaging applications, the peak daily upload of video from various sources has exceeded a million levels.

With the rapid increase of the number of videos, higher requirements are put on the auditing speed and auditing quality of the videos, and particularly, if UGC cannot be audited and processed rapidly, there is no way to distribute rapidly, so that the experience of users is greatly adversely affected. Meanwhile, in order to promote the income of the content creator, a large amount of similar videos may be uploaded, for example, the title text, the cover image, the watermark and the video frame image of a certain video are edited, cut and transformed in various ways, so as to bypass the similar video identification and deduplication of the video service platform, which causes the carried video to prevent the video of the normal creator from being started, occupy a large amount of traffic at the same time, and is unfavorable for the healthy development of the whole content ecology.

From the perspective of the user consuming video, the user can feel the image, audio and text of the video, so similar video recognition is performed based on the three modes in the embodiment of the application, and as an example, a schematic diagram of generating video representation vectors as shown in fig. 4 is provided, and will be described with reference to fig. 4.

1) Visual modality. Here, the video frame image and the cover image in the video, which are main bodies of the video content and contain main content information, may be used as images containing visual information, that is, as images for performing feature extraction processing, and the cover image is essence of the video content, which may complement each other. The embodiment of the application does not limit the model for extracting visual information, and may be, for example, VGG16, acceptance series model or image classification model such as res net, and in fig. 4, an acceptance-res net v2 model is taken as an example. In training a model for extracting visual information, a metric learning manner may be adopted, and the basic idea is that: two images of the same class are closer together in vector space, while two images of different classes are farther apart in vector space. The Loss function of the model for extracting visual information is not limited, and may be, for example, contrast Loss (contrast Loss) or Triplet Loss (Triplet Loss).

In order to improve the information richness of the image vector, in the embodiment of the present application, as shown in fig. 5, feature extraction processing of three dimensions of a target (object), a scene and a face may be performed on an image in a video. For example, performing feature extraction processing on the image through a target detection model (such as an SSD model or a YOLOv5 model) to obtain a target feature vector; performing feature extraction processing on the image through a scene recognition model (such as a NetVLAD model and the like) to obtain a scene feature vector; and carrying out feature extraction processing on the image through a face detection model (such as an MTCNN model and the like) to obtain a face feature vector. Because the contribution degrees of the three dimensions of information are inconsistent in different videos, after the feature extraction processing is completed, the target feature vector, the scene feature vector and the face feature vector can be fused through a mechanism of a cooperative gate (Collaborative Gating), namely, weighted summation is carried out, and an image vector is obtained. The mechanism of the collaborative gate attention is actually equivalent to a multi-channel switch and is used for adjusting the weight of information of each dimension in an image, for example, the video theme is a face, and the weight of the face feature vector is correspondingly increased to be larger than the weights of the target feature vector and the scene feature vector; for example, if the video theme is landscape, the weight of the scene feature vector is correspondingly increased to be larger than the weights of the target feature vector and the face feature vector. The video theme can be marked manually or automatically identified by a machine learning algorithm. The method is suitable for a short video scene, and one big characteristic of the short video is that the human face is a very important part in video content by centering on human, so that after the human face feature vector is fused into the image vector, different videos of the same person can be identified.

Because there is a time sequence correlation between the image vectors of different video frame images in the same video, in the embodiment of the application, a NetVLAD model is used as an aggregation network of image vectors, where the NetVLAD model is used to weight and sum the image vectors of multiple video frame images by a weight that can be learned, so as to obtain a global image vector.

2) An audio modality. In some videos, such as tutorial-like videos, the key information of the video may be represented through audio, so that in the embodiments of the present application, the audio information may be extracted. For example, an audio signal is separated from a video, the audio signal is converted by calculating mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC) characteristics, and then the converted audio signal is subjected to characteristic extraction processing by using a VGGish model, so as to obtain audio vectors corresponding to different video frame images. Similar to image vectors, a NetVLAD model may be employed herein as an aggregation network of audio vectors, where the NetVLAD model is used to weight sum audio vectors corresponding to multiple video frame images by weights that can be learned, resulting in a global audio vector (corresponding to the overall audio vector above).

3) Text modality. For example, the text in the video may include title text (not shown in fig. 4) and summary text (e.g., in commodity-type video, summary text may be text related to a commodity link), and further, text in the video frame image. The headline text and abstract text are typically manually uploaded by the video's uploader (producer), where the text can be recognized by OCR technology for video frame images.

For the text in the video, feature extraction processing may be performed through the BERT model to obtain a text vector, where the vector output by the next to last layer in the BERT model may be used as the text vector, which is because the last layer in the BERT model is too close to the target of the classification task of the BERT model itself, and if the vector output by the last layer is used as the text vector, there may be a certain deviation in the task of similar video recognition in the embodiment of the present application. The Training process of the BERT model is not limited, and may include a Pre-Training (Pre-Training) stage and a Fine-Tuning (Fine-Tuning) stage, where in the Pre-Training stage, the Training task may include a masking language model (Masked Language Model, MLM) task and a next sentence prediction (Next Sentence Prediction, NSP) task, where the MLM task is used to mask a portion of an input text, and then predict the masked text through the BERT model, and the NSP task is used to determine, for the input sentences a and B, whether the sentence B is a next sentence of the sentence a; in the fine tuning phase, text implication recognition (Multi-genre Natural Language Inference, MNLI) tasks, named entity recognition (Named Entity Recognition, NER) tasks, and stent question-answering dataset (Stanford Question Answering Dataset, squiad) tasks may be included.

After obtaining the image vector in the visual mode, the audio vector in the audio mode and the text vector in the text mode, performing multi-mode fusion to obtain the video representation vector, wherein the multi-mode fusion mode comprises but is not limited to direct stitching, product (comprising inner product and outer product) and weighted average. For example, if the accuracy requirement for similar video recognition is high, a direct splicing mode can be adopted to obtain a video representation vector with a long length; if the efficiency requirement on similar video identification is higher, a weighted average mode can be adopted to obtain a video representation vector with shorter length at the expense of a small part of information, so that the calculation amount of subsequent calculation can be reduced. By taking a multi-mode fusion mode as a direct splicing way for example, the global image vector of the video, the image vector of the cover image, the global audio vector, the text vector of the title text, the text vector of the abstract text and the text vector of the text in the video frame image can be spliced to obtain the video representation vector of the video. In this way, the video similarity between two videos can be equal to the vector similarity between the video representation vectors of the two videos.

By determining the video representation vector of the video, the video service platform can be helped to more accurately identify similar videos, so that video deduplication can be performed, for example, the video with the highest frame rate or resolution is reserved in a plurality of similar videos; when the manual auditing is involved, the video with the highest definition or resolution in a plurality of similar videos can be submitted to auditing personnel for auditing, and the rest videos can refer to the manual auditing result of the video, so that unnecessary manpower waste can be reduced; the video quality of the video can be automatically marked, for example, the video A is marked (such as artificial marking) as a high-quality video, and the video B similar to the video A can be automatically marked as a high-quality video; the video representation vector is used as the accurate description of video semantics, can be used as an important feature and is directly used for recall and sequencing phases of video recommendation; in addition, the video representation vector can be used for scattering of a recommendation display layer, dense and repeated recommendation is controlled, user experience can be effectively improved, for example, videos A and B to be recommended to a user are similar, videos with higher levels of issued accounts in the videos A and B are screened out, and finally the videos are recommended to the user.

In the process of determining the video representation vector, the conversion from the high-dimensional sparse feature vector to the low-dimensional dense feature vector can be realized, and the accuracy of similar video identification is improved. In the embodiment of the application, the management of the video representation vectors of a plurality of videos can be realized through a distributed vector retrieval service, wherein a Faiss library can be utilized to provide efficient similarity searching and clustering service for dense video representation vectors.

Since millions of videos may be newly entered daily on a content link, it is necessary to add videos to a database in real time and to add video representation vectors of the videos to an index base (here, a Faiss index base is taken as an example) for performing similar calculation, that is, the Faiss index base includes video representation vectors of the videos and indexes of the video representation vectors. In the embodiment of the present application, the generation method of the index is not limited, and for example, the video representation vector of the video may be used as the index of the video, or the hash processing (through the hash function of the Faiss) may be performed on the video representation vector of the video, and the obtained hash value may be used as the index of the video. In the distributed retrieval process, in order to avoid adverse influence on performance caused by a large number of mixed reads and writes of the Faiss index library, a read-write separation mechanism is applied in the embodiment of the application.

Specifically, as shown in fig. 6, two sets of indexes of size are built in the Faiss index library, corresponding to the stock index space and the increment index space respectively, wherein the stock index space is read-only, and the increment index space supports reading and writing at the same time. Taking the example that the validity period of similar video identification is video within 3 months, the stock index space may include an index of video 89 days (corresponding to the set period above) before the real-time, and the delta index space may include an index of video stored on the day of the real-time. And on the day of real-time, each time 1 new video is put in, establishing the index of the video in the increment index space, simultaneously, searching the stock index space and the increment index space, and merging the obtained search results. It should be noted that a Faiss Proxy (Proxy) may be set for the Faiss index library, so as to perform a read-write operation on the Faiss index library in response to an external similar video identification request (such as a deduplication request).

Similar video recognition scenes often have time-efficient requirements, and the data in the Faiss index library needs to be periodically eliminated, and the data in the Faiss index library needs to be eliminated every day for the earliest 1 day index in the Faiss index library by way of example. Since deleting data in large amounts in the Faiss index library may cause its performance to be adversely affected, in the embodiment of the present application, a Double Buffer (Double Buffer) switching mechanism is applied to perform data elimination. For example, for the Faiss index library being used (library 1 in FIG. 6 for example), a backup index library copy thereof is created (library 2 in FIG. 6 for example), and for the video representation vector stored to library 1, it is synchronously stored to library 2. When the switching period arrives, for example, when a new 1 day comes, controlling the library 2 to reconstruct the index offline according to the video representation vector of the video of the last 89 days, and switching the use states of the library 1 and the library 2, namely, putting the library 2 into use, and taking the library 1 as a standby index library copy. Wherein index names in library 1 and library 2 may be stored in another database, such as a Redis database, and upon switching between library 1 and library 2, switching of indexes is accomplished by modifying the Redis index state.

For complex vector search scenarios with different purposes, there may be multiple Faiss index libraries, where the video representation vectors and validity periods in different Faiss index libraries are different, for example, in a global similar video recognition scenario, the corresponding Faiss index library includes video representation vectors of hundreds of millions of videos; in a scene of protecting videos issued by a large V head account, the number of video representation vectors included in a corresponding Faiss index library may be small, and only tens of thousands of video representation vectors are included; in the scene of protecting the new hot content, mainly considering the timeliness of the video, the number of video representation vectors included in the corresponding Faiss index library may be fewer, and the validity period is shorter, namely thousands of video representation vectors. For this case, in the embodiment of the present application, the abstract common component, i.e. the index library management module (Faiss Manager), is completely decoupled from the service, so that the index library management framework in fig. 6 can be used as a set of frameworks to manage multiple different Faiss index libraries, and implement multiplexing of vector retrieval and recall processes. Next, the respective modules included in the Faiss Manager are explained:

1) Version management: for managing different versions of the index, such as the indexes in library 1 and library 2 above, when a switching cycle arrives, the Faiss Proxy is notified to apply the new version of the index.

2) Model training: the resulting video representation vectors differ in terms of the versions used to manage the video representation vectors, e.g., the versions of the models used to perform feature extraction processing on the video.

3) Configuration management: the method is mainly used for recording addresses of equipment (such as a server) and modules (modules with independent functions and deployed on the server), wherein a plurality of modules can be deployed in one server, for example, a content reading module and a writing module are deployed in the server for storage, and the method is also used for recording configuration information of related storage size and table structure.

4) Delta sampling: for extracting portions (not shown in fig. 6) from the incremental index space, analyzing and locating the problem. For example, when a hot event occurs in real life, an creator a issues a video related to the hot event, an creator B issues a similar video after a period of time, and an creator C issues a similar video after a period of time. The incremental sampling module is also used for recording which part is extracted, the number of extracted indexes, the extraction time and the like.

5) File management: for reading video representation vectors from a Cloud DataBase (CDB) and creating corresponding indexes. Here, the CDB may be used as a storage instance of the MySQL database for storing video representation vectors of a small number of videos, for example, video representation vectors of videos released on the same day, and may be stand-alone; the Faiss index library is a distributed storage, and is used for storing a large number of video representation vectors and corresponding indexes, for example, video representation vectors and corresponding indexes within several months, and the Faiss index library usually adopts a Solid State Disk (SSD) with higher speed. The FILE management module may also be used to store the index in pieces, i.e., in blocks, each block having multiple FILEs (FILEs). After the slicing storage, the file management module may also be used to record the original information of files from file_1 to file_n, including but not limited to file size, file save format, file version, validity period of the file, and creation time of the file.

The index library management framework shown in fig. 6 has the advantages that: the system is highly abstract, a set of components are multiplexed, read-write furled, accessed in a standardized mode, read-write separation of large and small indexes is achieved, therefore performance can be guaranteed, double buffers are switched in an online seamless mode, and modules can be horizontally expanded.

Embodiments of the present application also provide a schematic diagram of an artificial intelligence based similar video processing system as shown in fig. 7, which will be described in connection with the respective parts shown in fig. 7:

1) Content production end: refers to a production end of content such as PGC, UGC or Multi-Channel Network (MCN), such as a mobile terminal. The content production end is used for communicating with the uplink and downlink content interface service, for example, uploading the video to the uplink and downlink content interface service through an interface provided by the uplink and downlink content interface service, and the uploaded video is used as a data source of subsequent video recommendation (namely content distribution). Before uploading the video through the content production end, the user can select the matched audio (such as music) in the video, clip the video, select the cover image, select the filter template, beautify the video, and the like, which is not limited. In the process of uploading the video, the content production end can send data such as uploading speed, whether uploading is blocked or not, whether uploading is failed or not and the like to the back end for statistical analysis.

2) Uplink and downlink content interface service:

(1) and communicating with a content production end, acquiring a video submitted by the front end, and storing a source file of the video into a content storage service.

(2) The meta information of the video submitted by the front end is stored in a meta information database, wherein the meta information includes, but is not limited to, title text, abstract text, cover image (link of cover image can be also used, and source file of cover image can be stored in a content storage service), author, code rate, file format, file size and release time of the video.

(3) The video is synchronized to a dispatch center service for subsequent video processing and streaming.

3) Meta information database: the metadata database is a core database of the video, and the metadata of the video issued by all content production ends is stored in the metadata database. In addition to the information uploaded by the content production end, the meta-information may also include information obtained by performing video processing, such as a label of whether original is created, and a video label obtained by labeling the video in a manual auditing process (may include multiple levels of classification, such as a video explaining xx-brand mobile phone, where the first level classification is science and technology, the second level classification is smart mobile phone, and the third level classification is domestic mobile phone, and may also have more detailed classification, such as xx-brand and a specific model of the mobile phone). That is, in the process of manual auditing, the information in the meta-information database is read, and meanwhile, the result obtained after manual auditing is returned into the meta-information database to update the meta-information.

4) Dispatch center service & manual auditing system:

(1) the dispatching center service is responsible for the whole dispatching process of video streaming, receives the video in storage through the uplink and downlink content interface service, and then acquires the meta information of the video from the meta information database.

(2) The processing of the video by the dispatching center service mainly comprises the machine processing of a dispatching machine processing system (such as a similar identification service), the manual auditing system is dispatched to conduct manual auditing processing, and the dispatching center service is also used for controlling the dispatching sequence and priority. The machine processing core comprises various quality judgments such as low-quality filtering, video similarity recognition, wherein the result of the similarity recognition (such as whether an original mark is generated or not) is written into a meta-information database, and the more similar video can not be subjected to repeated secondary auditing by workers, so that human resources can be saved.

(3) The manual auditing system is a carrier of manual service capability and is mainly used for auditing and filtering videos which cannot be judged by machines such as illegal information and the like, and can be used for labeling the videos. After the manual auditing system passes a certain video, the video is enabled and can be recommended to a content consumption end through a content distribution outlet service (such as a recommendation engine, a search engine or manual operation). The content distribution outlet service can recommend the video to the content consumption end through the form of Feeds stream.

5) Content storage service:

(1) for storing entity information other than meta information of the video, such as source files of the video.

(2) When constructing the video representation vector, temporary storage of the frame-extracted image and the audio in the video is provided, and repeated extraction is avoided.

6) Downloading a file system:

(1) the original source file of the video is downloaded from the content storage service, the speed and progress of the download is controlled, and the download file system generally comprises a group of parallel servers, and consists of related task scheduling and distribution clusters.

(2) For the downloaded source file, a framing service is invoked to obtain the necessary file key frames and audio from the source file for subsequent construction of the video representation vector.

7) Frame extraction and audio extraction services: and according to the characteristic construction mode of the video mode and the audio mode, carrying out frame extraction processing on the source file of the video to obtain a frame extraction image, and extracting the audio in the source file. In the frame extraction process, if a uniform frame extraction strategy (for example, 1 second is used for extracting 1 frame), the frame extraction frequency is too high, and the burden and the calculated amount of frame extraction are increased; conversely, if a variable-length frame extraction strategy is adopted (for example, one frame is extracted every 3 seconds, 5 seconds or 7 seconds), the frame rate is insufficient, and the information content of the obtained frame extraction image is insufficient. Therefore, in this embodiment of the present application, a scene-switching frame image with obvious brightness change (for example, the brightness difference between two adjacent frames is greater than the difference threshold value) is extracted from the source file of the video, and at the same time, based on the scene-switching frame image, video frame images with equal intervals (may be a duration interval, for example, 1 second) are extracted, which are also used as frame-extracting images, so as to perform frame-extracting and frame-filling. Therefore, the burden and the calculated amount of the frame extraction are not excessive, and the obtained frame extraction image can contain comprehensive and rich information.

8) Vector generation service: the video representation vector of the video is constructed in the above multi-modal feature extraction and multi-modal fusion manner.

9) Distributed vector retrieval service: based on the constructed video representation vector, the indexes of the video representation vector are distributed managed and searched and matched, and particularly as shown in fig. 6, a distributed Faiss index library, a read-write separation mechanism and a Double Buffer mechanism can be adopted to manage massive indexes.

10 Similar video recognition service): and calling a distributed vector retrieval service, and realizing similar video identification of different scenes and purposes according to different index libraries, wherein the corresponding video representation vectors and valid periods are different for a global video similar identification scene, a protection scene for a large V-head account video and a protection scene for new hot content (new hot video). Here, whether the video is an original video may be identified by the similar video identification service, for example, if it is determined that the video B is similar to the video a having an earlier storage time (uploading time), the video B is determined to be a non-original video, and then an operation of deleting the video B or shielding the video B may be performed to ensure the benefit of the content producer of the original video, and encourage uploading of the original video.

11 Content consumption end: as a consumer, the consumer communicates with the uplink and downlink content interface service to obtain access entry information of a plurality of videos (for example, feeds information stream browsed by a content consumer includes a plurality of information stream cards, each information stream card includes a title, a cover image and a publishing account name of a video, and the information stream card is the access entry information). The content consumption end is also used for responding to the triggering operation (such as clicking operation) of the user on the access entry information, communicating with the content storage service according to the access entry information, and thus obtaining the video source file corresponding to the access entry information for playing. Of course, the content consumption end can also communicate with the content distribution outlet service to directly acquire the recommended video.

In the process of downloading and playing the video by the content consumption end, the data such as behavior data of a user, whether the playing is blocked, the loading time of the video, the playing click quantity and the like can be sent to the back end for statistical analysis.

Through the embodiment of the application, the following technical effects can be realized: 1) The method integrates information of a plurality of modes, can improve the accuracy of similar video identification, effectively modifies and edits various videos of the video carrier, avoids the video uploaded by the video carrier from successfully bypassing the similar video identification, and is suitable for various video scenes such as short video scenes. The modification and editing of the video includes, but is not limited to, geometric transformation (such as rotation, scale change, and overturn) of the cover image, signal processing (such as changing illuminance and contrast), image noise attack (such as gaussian noise, watercolor, motion blur, mosaic tile, etc.), modification of the title text (such as word modification and word order modification), and modification of the video frame image itself (such as clipping and stitching). Meanwhile, the processing efficiency of the video link can be improved.

2) The video representation vector can be used as a description of video semantics directly for recall and sort phases of video recommendation.

3) Through similar video identification, the recommendation presentation layer can be scattered, recommendation is controlled to be dense and repeated, and user experience can be effectively improved.

Continuing with the description below, the exemplary architecture provided by embodiments of the present application in which the artificial intelligence based similar video processing device 455 is implemented as a software module, in some embodiments, as shown in fig. 2, the software module stored in the memory 450 in the artificial intelligence based similar video processing device 455 may include: the first feature extraction module 4551 is configured to perform feature extraction processing on multiple dimensions on an image in a video, and perform fusion processing on feature vectors of the multiple dimensions obtained by extraction to obtain an image vector of the image; the second feature extraction module 4552 is configured to perform feature extraction processing on audio in the video to obtain an audio vector; a third feature extraction module 4553, configured to perform feature extraction processing on a text in a video to obtain a text vector; the fusion module 4554 is configured to fuse the image vector, the audio vector, and the text vector to obtain a video representation vector of the video; a similarity determining module 4555, configured to use a vector similarity between video representation vectors of any two videos as a video similarity between the two videos; and the processing module 4556 is configured to process the two videos according to the video similarity.

In some embodiments, the feature vectors of the plurality of dimensions include a target feature vector, a scene feature vector, and a face feature vector; the first feature extraction module 4551 is further configured to: performing feature extraction processing on the image through a multi-classification target detection model to obtain a target feature vector; carrying out feature extraction processing on the image through a scene recognition model to obtain a scene feature vector; and carrying out feature extraction processing on the image through the two-classification face detection model to obtain a face feature vector.

In some embodiments, the first feature extraction module 4551 is further configured to: according to the video theme of the video, determining weights of a target feature vector, a scene feature vector and a face feature vector; weighting the target feature vector, the scene feature vector and the face feature vector to obtain an image vector of the image; the video theme is any one of a target corresponding to the target detection model, a scene corresponding to the scene recognition model and a face corresponding to the face detection model; the weight of the vectors that conform to the video theme is greater than the weight of the vectors that do not conform to the video theme.

In some embodiments, the artificial intelligence based similar video processing device 455 further comprises: the image fusion module is used for carrying out fusion processing on the image vectors of the plurality of images in the video to obtain an overall image vector; the audio fusion module is used for carrying out fusion processing on audio vectors corresponding to a plurality of images in the video to obtain an overall audio vector; the text fusion module is used for carrying out fusion processing on text vectors of a plurality of texts in the video to obtain an overall text vector; fusion module 4554, further configured to: and carrying out fusion processing on the integral image vector, the integral audio vector and the integral text vector to obtain a video representation vector of the video.

In some embodiments, the image used for performing feature extraction processing in the video includes a cover image and a plurality of frame extraction images obtained through frame extraction processing; the image fusion module is also used for: weighting the image vectors of the plurality of frame extraction images to obtain global image vectors; and splicing the global image vector with the image vector of the cover image to obtain an overall image vector.

In some embodiments, the artificial intelligence based similar video processing device 455 further comprises: the frame extraction module is used for performing traversing processing on a plurality of video frame images in the video and performing the following processing on the traversed video frame images: when the difference of the traversed video frame image and the adjacent video frame image in the image parameters is larger than a difference threshold, taking the traversed video frame image as a frame extraction image obtained by frame extraction processing, and taking the video frame image separated from the traversed video frame image by a set interval as the frame extraction image obtained by frame extraction processing; wherein the image parameter is any one of brightness, contrast, definition and resolution.

In some embodiments, the audio fusion module is further to: and carrying out weighting processing on the audio vectors corresponding to the plurality of frame-drawing images to obtain an overall audio vector.

In some embodiments, the text in the video for feature extraction processing includes headline text, abstract text, and text in a plurality of frame-taking images; the text fusion module is also used for: and performing splicing processing on the text vector of the title text, the text vector of the abstract text and the text vectors of the texts in the plurality of frame-drawing images to obtain an overall text vector.

In some embodiments, fusion module 4554 is further to: and performing at least one of splicing processing, averaging processing, product processing and weighting processing on the image vector, the audio vector and the text vector to obtain a video representation vector of the video.

In some embodiments, the artificial intelligence based similar video processing device 455 further comprises: the first index establishing module is used for taking videos stored in the database and having time within a set time period before the real-time as stock videos, and establishing indexes of video representation vectors of a plurality of stock videos in the stock index space; the second index establishing module is used for storing the new video as an incremental video to the database and establishing an index of a video representation vector of the new video in an incremental index space; wherein the index in the incremental index space is used to periodically migrate into the inventory index space; the index reading module is used for reading the stock index space and the increment index space to determine the video similarity between the new video and the multiple videos to be compared; the video to be compared comprises a plurality of stock videos and incremental videos except the new video; and the non-original identification module is used for identifying the new video as a non-original video when the video similarity between the new video and any one video to be compared is greater than a first similarity threshold value.

In some embodiments, the video representation vector of the stock video and the video representation vector of the new video are used to store into an index library, the index library comprising a stock index space and an delta index space; the artificial intelligence based similar video processing device 455 further includes: the synchronous storage module is used for synchronously storing the video representation vector of the new video to the standby index library copy of the index library in use; and the switching module is used for reestablishing indexes in the stock index space of the standby index library copy according to the video representation vector of the stock video in the standby index library copy when the switching period is reached, and switching the index library being used with the standby index library copy.

In some embodiments, the first index building module is further to: for each stock video representation vector, performing any one of the following processes: taking the video representation vector of the stock video as an index of the video representation vector of the stock video; and carrying out hash processing on the video representation vector of the stock video, and taking the result obtained by the hash processing as an index of the video representation vector of the stock video.

In some embodiments, the artificial intelligence based similar video processing device 455 further comprises: the record acquisition module is used for acquiring a plurality of history recommendation records aiming at the history user; the historical recommendation record comprises user characteristics of a historical user, video representation vectors of the recommended videos and triggering results of the historical user on the recommended videos; the updating module is used for updating the recommendation parameters of the recommendation model according to the plurality of historical recommendation records; the combination module is used for respectively carrying out combination processing on the user characteristics of the user to be recommended and the video representation vectors of the candidate videos to obtain a plurality of prediction samples; the prediction module is used for carrying out prediction processing on the prediction samples through the updated recommendation model to obtain a prediction triggering result of the candidate video corresponding to the prediction samples; and the result screening module is used for screening the video to be recommended from the plurality of candidate videos according to the prediction triggering result and executing the recommendation operation for the video to be recommended.

In some embodiments, the processing module 4556 is further configured to: performing pairwise combination processing on a plurality of videos to be recommended to obtain a plurality of video groups; when the video similarity between two videos in the video group is larger than a second similarity threshold value and the video parameters corresponding to the two videos are different, screening out the video with higher video parameters in the two videos; when the video similarity between two videos in the video group is larger than a second similarity threshold value and the video parameters corresponding to the two videos are the same, taking any one of the two videos as a screened video; performing recommendation operation for the screened video; the video parameter is any one of the grade of the issuing account number of the video, the frame rate, the definition and the resolution of the video.

In some embodiments, the processing module 4556 is further configured to: when the video similarity between the first video marked with the video tag and the second video not marked with the video tag is larger than a third similarity threshold value, taking the video tag of the first video as the video tag of the second video; the video tag is any one of video quality and video theme.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the similar video processing method based on artificial intelligence according to the embodiment of the application.

Embodiments of the present application provide a computer readable storage medium having stored therein executable instructions that, when executed by a processor, cause the processor to perform a method provided by embodiments of the present application, for example, an artificial intelligence based similar video processing method as illustrated in fig. 3A, 3B, or 3C. It is noted that a computer includes various computing devices including a terminal device and a server.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and scope of the present application are intended to be included within the scope of the present application.

Claims

1. A method for similar video processing based on artificial intelligence, the method comprising:

determining videos stored in a database in a set time period before real-time as stock videos, and establishing indexes of video representation vectors of a plurality of stock videos in a stock index space;

Storing a new video as an incremental video to the database, and establishing an index of a video representation vector of the new video in an incremental index space; wherein the index in the incremental index space is for periodically migrating into the inventory index space; the video representation vector of the stock video and the video representation vector of the new video are used for being stored in an index library, and the index library comprises the stock index space and the increment index space; the stock index space is read-only, and the increment index space supports reading and writing at the same time;

synchronously storing video representation vectors of the new video to a backup index library copy of an index library being used;

when the switching period is reached, reestablishing an index in an inventory index space of the standby index library copy according to the video representation vector of the inventory video in the standby index library copy, and switching the index library in use with the standby index library copy;

reading the stock index space and the increment index space to determine video similarity between the new video and a plurality of videos to be compared; wherein the videos to be compared comprise a plurality of stock videos and incremental videos except the new video;

And processing the new video and the video to be compared according to the video similarity.

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the feature vectors of the multiple dimensions comprise a target feature vector, a scene feature vector and a face feature vector;

the feature extraction processing of multiple dimensions is performed on the image in the video, and the feature extraction processing comprises the following steps:

performing feature extraction processing on the image through a multi-classification target detection model to obtain the target feature vector;

performing feature extraction processing on the image through a scene recognition model to obtain a scene feature vector;

and carrying out feature extraction processing on the image through a two-class face detection model to obtain the face feature vector.

3. The method according to claim 2, wherein the fusing the extracted feature vectors of multiple dimensions to obtain an image vector of the image includes:

according to the video theme of the video, determining weights of the target feature vector, the scene feature vector and the face feature vector;

weighting the target feature vector, the scene feature vector and the face feature vector to obtain an image vector of the image;

The video theme is any one of a target corresponding to the target detection model, a scene corresponding to the scene recognition model and a face corresponding to the face detection model; the weight of the vector conforming to the video theme is greater than the weight of the vector not conforming to the video theme.

4. The method according to claim 1, wherein the method further comprises:

carrying out fusion processing on image vectors of a plurality of images in the video to obtain an overall image vector;

carrying out fusion processing on audio vectors corresponding to the images in the video to obtain an overall audio vector;

carrying out fusion processing on text vectors of a plurality of texts in the video to obtain an overall text vector;

the fusing processing is performed on the image vector, the audio vector and the text vector to obtain a video representation vector of the video, including:

and carrying out fusion processing on the integral image vector, the integral audio vector and the integral text vector to obtain a video representation vector of the video.

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

the image for carrying out feature extraction processing in the video comprises a cover image and a plurality of frame extraction images obtained through frame extraction processing;

The fusing processing is performed on the image vectors of a plurality of images in the video to obtain an overall image vector, and the fusing processing comprises the following steps:

weighting the image vectors of the plurality of frame extraction images to obtain global image vectors;

and performing stitching processing on the global image vector and the image vector of the cover image to obtain an overall image vector.

6. The method of claim 5, wherein the method further comprises:

performing traversing processing on a plurality of video frame images in the video, and performing the following processing on the traversed video frame images:

when the difference of the traversed video frame image and the adjacent video frame image in the image parameters is larger than a difference threshold value, taking the traversed video frame image as a frame extraction image obtained by frame extraction processing, and

taking the video frame images separated from the traversed video frame images by a set interval as frame extraction images obtained by frame extraction processing;

wherein the image parameter is any one of brightness, contrast, definition and resolution.

7. The method of claim 5, wherein the step of determining the position of the probe is performed,

the fusing processing is performed on the audio vectors corresponding to the plurality of images in the video to obtain an overall audio vector, including:

Weighting the audio vectors corresponding to the plurality of frame extraction images to obtain an overall audio vector;

the fusing processing is carried out on the text vectors of a plurality of texts in the video to obtain an integral text vector, and the fusing processing comprises the following steps:

and performing splicing processing on the text vector of the title text, the text vector of the abstract text and the text vectors of the texts in the plurality of frame extraction images to obtain an overall text vector.

8. The method of claim 1, wherein the fusing the image vector, the audio vector, and the text vector to obtain a video representation vector of the video comprises:

and performing at least one of splicing processing, averaging processing, product processing and weighting processing on the image vector, the audio vector and the text vector to obtain a video representation vector of the video.

9. The method according to any one of claims 1 to 8, wherein said processing the new video and the video to be compared according to the video similarity comprises:

and when the video similarity between the new video and any one of the videos to be compared is larger than a first similarity threshold value, identifying the new video as a non-original video.

10. The method according to any one of claims 1 to 8, wherein said processing the new video and the video to be compared according to the video similarity comprises:

performing pairwise combination processing on a plurality of videos to be recommended to obtain a plurality of video groups;

when the video similarity between two videos in the video group is larger than a second similarity threshold value and the video parameters corresponding to the two videos are different, screening out the video with higher video parameters in the two videos;

when the video similarity between two videos in the video group is larger than the second similarity threshold value and the video parameters corresponding to the two videos are the same, taking any one of the two videos as a screened video;

performing recommendation operation for the screened video;

the video parameter is any one of the grade of the issuing account number of the video, the frame rate, the definition and the resolution of the video.

11. The method according to any one of claims 1 to 8, wherein said processing the new video and the video to be compared according to the video similarity comprises:

When the video similarity between a first video marked with a video tag and a second video not marked with the video tag is larger than a third similarity threshold, taking the video tag of the first video as the video tag of the second video;

the video tag is any one of video quality and video theme.

12. A similar video processing apparatus based on artificial intelligence, comprising:

the similarity determining module is used for taking videos stored in the database and having time within a set time period before the real-time as stock videos, and establishing indexes of video representation vectors of a plurality of stock videos in a stock index space; storing a new video as an incremental video to the database, and establishing an index of a video representation vector of the new video in an incremental index space; wherein the index in the incremental index space is for periodically migrating into the inventory index space; the video representation vector of the stock video and the video representation vector of the new video are used for being stored in an index library, and the index library comprises the stock index space and the increment index space; the stock index space is read-only, and the increment index space supports reading and writing at the same time; synchronously storing video representation vectors of the new video to a backup index library copy of an index library being used; when the switching period is reached, reestablishing an index in an inventory index space of the standby index library copy according to the video representation vector of the inventory video in the standby index library copy, and switching the index library in use with the standby index library copy; reading the stock index space and the increment index space to determine video similarity between the new video and a plurality of videos to be compared; wherein the videos to be compared comprise a plurality of stock videos and incremental videos except the new video;

And the processing module is used for processing the new video and the video to be compared according to the video similarity.

13. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the artificial intelligence based similar video processing method of any one of claims 1 to 11 when executing executable instructions stored in the memory.

14. A computer readable storage medium storing executable instructions for implementing the artificial intelligence based similar video processing method of any one of claims 1 to 11 when executed by a processor.