CN112818251A

CN112818251A - Video recommendation method and device, electronic equipment and storage medium

Info

Publication number: CN112818251A
Application number: CN202110394131.2A
Authority: CN
Inventors: 徐程程
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-05-18
Anticipated expiration: 2041-04-13
Also published as: CN112818251B

Abstract

The embodiment of the application discloses a video recommendation method, a video recommendation device, electronic equipment and a storage medium, wherein the method comprises the following steps: collecting video data to be recommended and historical browsing video data; acquiring the video type and video description content of the video to be recommended from the video attribute information, wherein the video description content comprises a video description text and video keywords; performing feature extraction on the video type and the video description content to obtain a first vector corresponding to the video type and a second vector corresponding to the video description content; constructing a semantic text vector of the video description text, and fusing the first vector, the second vector and the semantic text vector to obtain a video vector of the video to be recommended; and determining a target video in the plurality of videos to be recommended based on the historical browsing videos and the video vectors of the videos to be recommended, and recommending the target video.

Description

Video recommendation method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of communication, in particular to a video recommendation method and device, electronic equipment and a storage medium.

Background

Short videos are very important content transmission forms in the current internet, the short videos mainly produce content, the content is published to a platform, and then the platform recommends the content to users.

At present, short video applications generally recommend short videos which may be interested to users by using a recommendation algorithm, and a common recommendation method is a recommendation method based on a collaborative filtering algorithm, however, the recommendation method mainly uses single data as a recommendation basis to perform similar calculation to realize recommendation, and the collaborative filtering method only considers behavior correlation and does not consider content correlation, and cannot be extended to short videos which are not clicked or exposed by users, so that videos recommended to users based on the current recommendation scheme are not accurate.

Disclosure of Invention

The embodiment of the invention provides a video recommendation method and device, electronic equipment and a storage medium, which can improve the accuracy of video recommendation.

The embodiment of the invention provides a video recommendation method, which comprises the following steps:

the method comprises the steps of collecting video data to be recommended and historical browse video data, wherein the video data to be recommended comprises a plurality of videos to be recommended and video attribute information of each video to be recommended, the historical browse video data comprises at least one historical browse video, and the historical browse video is a video browsed by a user in a historical period;

acquiring the video type and video description content of the video to be recommended from the video attribute information, wherein the video description content comprises a video description text and video keywords;

performing feature extraction on the video type and the video description content to obtain a first vector corresponding to the video type and a second vector corresponding to the video description content;

constructing a semantic text vector of the video description text, and fusing the first vector, the second vector and the semantic text vector to obtain a video vector of the video to be recommended;

and determining a target video in the plurality of videos to be recommended based on the historical browsing videos and the video vectors of the videos to be recommended, and recommending the target video.

Correspondingly, the embodiment of the present application further provides a video recommendation device, including:

the system comprises a collection module, a recommendation module and a recommendation module, wherein the collection module is used for collecting video data to be recommended and historical browse video data, the video data to be recommended comprises a plurality of videos to be recommended and video attribute information of each video to be recommended, the historical browse video data comprises at least one historical browse video, and the historical browse video is a video browsed by a user in a historical period;

the acquisition module is used for acquiring the video type and the video description content of the video to be recommended from the video attribute information, wherein the video description content comprises a video description text and video keywords;

the extraction module is used for extracting the characteristics of the video type and the video description content to obtain a first vector corresponding to the video type and a second vector corresponding to the video description content;

the construction module is used for constructing a semantic text vector of the video description text based on the video keywords;

the fusion module is used for fusing the first vector, the second vector and the semantic text vector to obtain a video vector of the video to be recommended;

and the recommending module is used for determining a target video in the videos to be recommended based on the historical browsing videos and the video vectors of the videos to be recommended and recommending the target video.

Optionally, in some embodiments of the present application, the extracting module includes:

the device comprises a first obtaining unit, a second obtaining unit and a third obtaining unit, wherein the first obtaining unit is used for obtaining a preset feature extraction model, and the feature extraction model comprises a first sub-network, a second sub-network and a third sub-network;

a first extraction unit, configured to perform feature extraction on the video type based on the first sub-network to obtain a first vector corresponding to the video type;

the second extraction unit is used for extracting the characteristics of the video keywords and the text keywords in the video description text based on the second sub-network to obtain keyword vectors;

and the third extraction unit is used for extracting the features of the video description text based on the third sub-network to obtain a text vector corresponding to the video description text.

Optionally, in some embodiments of the present application, the first extracting unit includes:

the first obtaining subunit is configured to obtain a type identifier corresponding to the video type;

an extracting subunit, configured to extract a weight value of the type identifier from a weight matrix corresponding to the first sub-network, to obtain a first weight value corresponding to the type identifier;

and the first constructing subunit is used for constructing a first vector corresponding to the video type according to the extracted first weight value.

Optionally, in some embodiments of the present application, the first sub-construction unit is specifically configured to:

constructing a video type vector corresponding to each first weight value;

and carrying out average processing on the constructed video type vector to obtain a first vector corresponding to the video type.

Optionally, in some embodiments of the present application, the second extraction unit includes:

the second obtaining subunit is used for obtaining the keyword identification corresponding to the video keyword;

the second constructing subunit is used for constructing a video keyword vector corresponding to the video keyword based on the second sub-network and the keyword identifier;

the processing subunit is configured to perform convolution processing on the text keywords in the video description text by using the second sub-network to obtain text keyword vectors corresponding to the text keywords;

and the fusion subunit is used for fusing the video keyword vector and the text keyword vector to obtain a keyword vector.

Optionally, in some embodiments of the present application, the second building subunit is specifically configured to:

extracting the weight value of the keyword identifier from the weight matrix corresponding to the second sub-network to obtain a second weight value corresponding to the keyword identifier;

and constructing a video keyword vector corresponding to the video keyword according to the extracted second weight value.

Optionally, in some embodiments of the present application, the fusion subunit is specifically configured to:

carrying out average processing on the constructed second weight vector to obtain a video keyword vector corresponding to the video keyword;

and splicing the text keyword vector and the video keyword vector to obtain a keyword vector.

Optionally, in some embodiments of the present application, the building module includes:

the second acquisition unit is used for acquiring a preset semantic text construction model;

the word segmentation unit is used for segmenting words of the video description text to obtain a segmented description text;

and the construction unit is used for constructing a semantic text vector of the video description text by adopting the semantic text construction model based on the word segmentation description text and the video keywords.

Optionally, in some embodiments of the present application, the building unit is specifically configured to:

extracting word embedding vectors of text words in the word segmentation description text, wherein the word embedding vectors carry semantic information of the text word context;

extracting word vectors of the video keywords;

splicing the extracted word embedding vector and the word vector to obtain a spliced vector;

and inputting the spliced vector into the semantic text construction model to obtain a semantic text vector of the video description text.

After acquiring video data to be recommended and history browsing video data, the video data to be recommended comprises a plurality of videos to be recommended and video attribute information of each video to be recommended, the history browsing video data comprises at least one history browsing video, the history browsing video is a video browsed by a user in a history period, a video type and video description content of the video to be recommended are obtained from the video attribute information, the video description content comprises a video description text and video keywords, then, feature extraction is performed on the video type and the video description content to obtain a first vector corresponding to the video type and a second vector corresponding to the video description content, then, a semantic text vector of the video description text is constructed, and the first vector, the second vector and the semantic text vector are fused, and finally, determining a target video in the videos to be recommended based on the historical browsing videos and the video vectors of the videos to be recommended, and recommending the target video. Therefore, the scheme can improve the accuracy of video recommendation.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a scene schematic diagram of a video recommendation method provided in an embodiment of the present application;

fig. 1b is a schematic flowchart of a video recommendation method provided in an embodiment of the present application;

FIG. 1c is a schematic structural diagram of a feature extraction model provided herein;

FIG. 1d is a schematic diagram of a network parameter matrix of a second subnetwork as provided herein;

FIG. 1e is a schematic diagram of a bi-directional encoder provided herein;

fig. 1f is a schematic diagram of generating a video vector in the video recommendation method provided in the present application;

fig. 2 is another schematic flowchart of a video recommendation method provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a video recommendation apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence.

Deep learning is a core part of machine learning, and generally includes techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning. The deep learning is a new research direction in the field of machine learning. That is, deep learning is a method of machine learning based on performing characterization learning on data. An observation (e.g., an image) may be represented using a number of ways, such as a vector of intensity values for each pixel, or more abstractly as a series of edges, a specially shaped region, etc. Tasks (e.g., face recognition or facial expression recognition) are more easily learned from the examples using some specific representation methods. The benefit of deep learning is to replace the manual feature acquisition with unsupervised or semi-supervised feature learning and hierarchical feature extraction efficient algorithms.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to technologies such as artificial intelligence natural language processing and deep learning, and the following embodiment is used for explanation.

The embodiment of the application provides a video recommendation method and device, electronic equipment and a storage medium.

The video recommendation device can be specifically integrated in a terminal or a server. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through a wired or wireless communication manner, the server may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, which is not limited herein.

For example, referring to fig. 1a, the present application provides a video recommendation system, which includes a server 10, a viewer terminal set 20 and a video master terminal 30, wherein the video recommendation apparatus is integrated in the server 10, the viewer terminal set 20 includes a plurality of viewer terminals 20a, specifically, after a video master uploads a short video produced by the video master through the video master terminal 30, that is, the server 10 collects a video to be recommended (a short video produced by the video master) and historical browsing video data (a short video set played by the viewer terminals 20 a), where the recommended video data includes a plurality of videos to be recommended and video attribute information of each video to be recommended, the historical browsing video data includes at least one historical browsing video, then, the server 10 obtains a video type and a video description content of the video to be recommended from the video attribute information, the video description content includes a video description text and a video keyword, then, the server 10 extracts features of the video type and the video description content to obtain a first vector corresponding to the video type and a second vector corresponding to the video description content, then, the server 10 constructs a semantic text vector of the video description text, and fuses the first vector, the second vector and the semantic text vector to obtain a video vector of a video to be recommended, and finally, the server 10 determines a target video in a plurality of videos to be recommended based on the history browsing video and the video vector of the video to be recommended, and recommends the target video to a corresponding audience terminal 20a so that the audience terminal 20a plays the target video, thereby realizing the recommendation of the target video to a user.

According to the video recommendation scheme, the first vector corresponding to the video type and the second vector corresponding to the video description content are extracted, the semantic text vector of the video description text is constructed by using the video keywords, and finally the first vector, the second vector and the semantic text vector are fused to obtain the video vector of the video to be recommended.

The following are detailed below. It should be noted that the description sequence of the following embodiments is not intended to limit the priority sequence of the embodiments.

A video recommendation method, comprising: the method comprises the steps of collecting video data to be recommended and historical browsing video data, obtaining video types and video description contents of videos to be recommended from video attribute information, carrying out feature extraction on the video types and the video description contents to obtain first vectors corresponding to the video types and second vectors corresponding to the video description contents, constructing semantic text vectors of video description texts, fusing the first vectors, the second vectors and the semantic text vectors to obtain video vectors of the videos to be recommended, determining target videos in the videos to be recommended based on the historical browsing videos and the video vectors of the videos to be recommended, and recommending the target videos.

Referring to fig. 1b, fig. 1b is a schematic flowchart illustrating a video recommendation method according to an embodiment of the present disclosure. The specific flow of the video recommendation method can be as follows:

101. and collecting video data to be recommended and historical browsing video data.

Video generally refers to various techniques for capturing, recording, processing, storing, transmitting, and reproducing a series of still images as electrical signals. Advances in networking technology have also enabled recorded segments of video to be streamed over the internet and received and played by computers. Video data is a time-varying stream of images that contains much richer information and content that other media cannot express. The information is transmitted in the form of video, and the content to be transmitted can be intuitively, vividly, truly and efficiently expressed.

The video data to be recommended comprises a plurality of videos to be recommended and video attribute information of each video to be recommended, the historical browsing video data comprises at least one historical browsing video, and the videos to be recommended can be videos played on video websites or videos inserted in webpages. For example, the videos may be various movie videos, live videos, program videos, short videos, and the like, and the videos to be recommended may be obtained from a video website or a video database.

The history browsing video is a video browsed by the user in a history period, for example, if the user a browses the video a in a history time period, the time (e.g., 5 months and 2 days) when the user a browses the video a, the browsing time (e.g., 5 seconds), and the browsing times (e.g., 1 time) are recorded.

102. And acquiring the video type and the video description content of the video to be recommended from the video attribute information.

The video attribute information may carry the memory size occupied by the video, the video playing time, the video type of the video to be recommended and the video description content, where the video description content includes a video description text and video keywords.

Specifically, the video description text is a text for describing video content, for example, the video title of the video a is "challenge to eat 100 steamed stuffed buns", the content introduction is "the grand stomach king anchor surpasses oneself today and challenges to eat 100 steamed stuffed buns", and then the video title and the content introduction of the video a are both the video description text of the video a; for another example, for the short video B, the title thereof is a video description text, it should be noted that the short video, that is, a short video belongs to a form of video, and the short video is an internet content transmission mode, and is generally a video transmission content transmitted on a new internet medium within a period of 5 minutes.

The video keywords may be keywords in a video description text, or keywords extracted from video content, such as keywords extracted from a bullet screen, or keywords extracted from a video subtitle, or keywords corresponding to video tags, such as a video tag of a video a is a fun, a video tag of a video B is a hot blood, a video tag of a video C is a campus, and the like, and are specifically selected according to actual situations, and are not described herein again.

103. And performing feature extraction on the video type and the video description content to obtain a first vector corresponding to the video type and a second vector corresponding to the video description content.

For example, specifically, a pre-trained feature extraction model may be used to extract a first vector corresponding to a video type and a second vector corresponding to video description content, where the feature extraction model may include a first sub-network, a second sub-network, and a third sub-network, and further, the first sub-network is used to extract the first vector corresponding to the video type, the second sub-network is used to extract a feature of a video keyword and a text keyword in a video description text to obtain a keyword vector, and the third sub-network is used to extract a text vector corresponding to the video description text, that is, optionally, in some embodiments, the step "performing feature extraction on the video type and the video description content to obtain the first vector corresponding to the video type and the second vector corresponding to the video description content" may specifically include:

(11) acquiring a preset feature extraction model;

(12) performing feature extraction on the video type based on a first sub-network to obtain a first vector corresponding to the video type;

(13) extracting the characteristics of the video keywords and the text keywords in the video description text based on a second sub-network to obtain a keyword vector;

(14) and performing feature extraction on the video description text based on the third sub-network to obtain a text vector corresponding to the video description text.

Referring to fig. 1c, a feature extraction model provided by the present application is shown, which includes three parts, namely a first sub-network S1, a second sub-network S2 and a third sub-network S3, wherein the first sub-network S1 is used for extracting a first vector corresponding to a video type, during training, an association relationship between each video type and a type identifier may be pre-constructed, and then a mapping relationship between the type identifier and a network parameter of the first sub-network S1 is established, for example, referring to fig. 1d, a network parameter corresponding to the first sub-network S1 is a matrix of 3x3, a type identifier c corresponds to a weight value c in the matrix, the first vector corresponding to the video type is a vector formed by a line where a parameter value m is located, that is a vector corresponding to the weight value c < a, b, c >, and during use, the type identifier corresponding to the video type is identified by the first sub-network, and constructing a first vector corresponding to the video type based on the identified type identifier, that is, optionally, in some embodiments, the step "extracting features of the video type based on the first sub-network to obtain the first vector corresponding to the video type" may specifically include:

(21) acquiring a type identifier corresponding to a video type;

(22) extracting a weight value of the type identifier from a weight matrix corresponding to the first sub-network to obtain a first weight value corresponding to the type identifier;

(23) and constructing a first vector corresponding to the video type according to the extracted first weight value.

Further, when the video type of the video to be recommended is only one, determining the video type vector corresponding to the weight value as a first vector corresponding to the video type; when the video types of the video to be recommended are at least two, averaging the video type vectors corresponding to each weight value to obtain a first vector corresponding to the video type, that is, optionally, in some embodiments, the step "constructing the first vector corresponding to the video type according to the extracted first weight value" may specifically include:

(31) constructing a video type vector corresponding to each first weight value;

(32) and carrying out average processing on the constructed video type vector to obtain a first vector corresponding to the video type.

It is understood that, in this embodiment, the average value of all the video type vectors is actually calculated, and the specific description is made by taking the vector a <1, 2, 3> and the vector b <3, 1, 2> as an example, and assuming that the target vector is c, c = (a + b)/2, it is known that c = [ (3, 2, 4.5), (4, 1, 2) ].

Further, in the second sub-network S2, the keywords of the video description content are encoded in two ways in the present application, which may also be understood as extracting features of the keywords in the video description content in two ways, specifically, the text keywords in the video description text are convolved, and it is to be noted that the second sub-network may include a Convolutional Neural Network (CNN), where the CNN is a type of feed forward Neural network (fed forward Neural network) that includes convolution calculation and has a deep structure, and is one of the representative algorithms of deep learning (deep learning). Convolutional Neural Networks have a feature learning (representation learning) capability, and can perform Shift-Invariant classification (Shift-Invariant classification) on input information according to a hierarchical structure thereof, and are also called Shift-Invariant Artificial Neural Networks (SIANN). The input layer of the convolutional neural network can process multidimensional data, and the input layer of the one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array, wherein the one-dimensional array is usually a time or frequency spectrum sample; the two-dimensional array may include a plurality of channels; an input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array. The convolutional neural network generally comprises an input layer, a hidden layer and an output layer, the hidden layer of the convolutional neural network comprises 3 types of common structures including a convolutional layer, a pooling layer and a full-connection layer, the convolutional layer has the function of extracting the characteristics of input data and comprises a plurality of convolutional kernels, and each element forming the convolutional kernels corresponds to a weight coefficient and a deviation value (bias vector) and is similar to a neuron (neuron) of a feedforward neural network. Each neuron in the convolution layer is connected to a plurality of neurons in a closely located region in the previous layer, the size of which depends on the size of the convolution kernel, known in the literature as the "receptive field", which has a meaning analogous to that of the visual cortical cells. When the convolution kernel works, the convolution kernel regularly sweeps input characteristics, matrix element multiplication summation is carried out on the input characteristics in a receptive field, deviation amount is superposed, and after characteristic extraction is carried out on the convolution layer, an output characteristic diagram is transmitted to a pooling layer to carry out characteristic selection and information filtering. The pooling layer contains a pre-set pooling function whose function is to replace the result of a single point in the feature map with the feature map statistics of its neighboring regions. The step of selecting the pooling area by the pooling layer is the same as the step of scanning the characteristic diagram by the convolution kernel, and the pooling size, the step length and the filling are controlled. The fully-connected layer in the convolutional neural network is equivalent to the hidden layer in the traditional feedforward neural network. The fully-connected layer is located at the last part of the hidden layer of the convolutional neural network and only signals are transmitted to other fully-connected layers. The feature map loses the spatial topology in the fully-connected layer, is expanded into vectors and passes through the excitation function, the fully-connected layer is usually upstream of the output layer in the convolutional neural network, so the structure and the working principle of the feature map are the same as those of the output layer in the conventional feedforward neural network, and in the convolutional neural network of the second sub-network S2, the fully-connected layer outputs a word vector of each text keyword in the video description text, and the word vector carries the meaning of the word.

It should be noted that the keyword identifiers corresponding to the video keywords are obtained, and based on the second sub-network S2 and the keyword identifiers, video keyword vectors corresponding to the video keywords are constructed, where a manner of constructing the video keyword vectors corresponding to the video keywords is similar to a manner of constructing the first vectors corresponding to the video types in the foregoing, and please refer to the foregoing embodiment specifically, and details are not repeated here.

After the video keyword vector and the text keyword vector are obtained, the video keyword vector and the text keyword vector are fused to obtain a keyword vector, that is, the step "extracting features of the video keyword and the text keyword in the video description text based on the second sub-network to obtain the keyword vector" may specifically include:

(41) acquiring a keyword identifier corresponding to a video keyword;

(42) constructing a video keyword vector corresponding to the video keyword based on the second sub-network and the keyword identifier;

(43) performing convolution processing on the text keywords in the video description text by adopting the second sub-network to obtain text keyword vectors corresponding to the text keywords;

(44) and fusing the video keyword vector and the text keyword vector to obtain a keyword vector.

Optionally, in some embodiments, the step "constructing a video keyword vector corresponding to the video keyword based on the second sub-network and the keyword identifier" may specifically include:

(51) extracting the weight value of the keyword identifier from the weight matrix corresponding to the second sub-network to obtain a second weight value corresponding to the keyword identifier;

(52) and constructing a video keyword vector corresponding to the video keyword according to the extracted second weight value.

Furthermore, a mode of fusing the video keyword vectors and the text keyword vectors can adopt a vector splicing mode, when only one video keyword of the video to be recommended exists, the vector corresponding to the keyword identifier is spliced with the text keyword vectors to obtain keyword vectors; and when the video keywords of the video to be recommended are at least two, carrying out average processing on the constructed second weight vector to obtain a video keyword vector corresponding to the video keywords, and then splicing the text keyword vector and the video keyword vector to obtain a keyword vector.

In addition, the third sub-network S3 may be a convolutional neural network, and specifically, the convolutional neural network may be used to perform feature extraction on the video description text, and the convolutional neural network may be the same as the convolutional neural network in the second sub-network S2, so as to improve the training efficiency of the network; the number of hidden layers in the two convolutional neural networks is different, so that the dimensionality of the features extracted by different sub-networks is different, and the semantic richness of subsequent video vectors is further improved.

104. And constructing a semantic text vector of the video description text based on the video keywords, and fusing the first vector, the second vector and the semantic text vector to obtain a video vector of the video to be recommended.

In order to improve the semantic richness of the subsequent video vector, in the present application, a converter-based BiDirectional Encoder (BERT) may be used to construct a semantic text vector of a video description text, and for convenience of description, hereinafter referred to as BERT model, please refer to fig. 1e, the input of the BERT model is a video keyword and a text word in the video description text, so that before constructing the video vector, the video description text needs to be segmented, that is, optionally, in some embodiments, the step "constructing the semantic text vector of the video description text based on the video keyword" may specifically include:

(61) acquiring a preset semantic text construction model;

(62) segmenting words of the video description text to obtain segmented description text;

(63) and constructing a semantic text vector of the video description text by adopting a semantic text construction model based on the word segmentation description text and the video keywords.

The method comprises the steps of adopting a BERT model as an encoder of a text, inputting a description text after word segmentation into the BERT model, wherein a vector corresponding to a CLS position is a vector of a whole sentence, and it needs to be explained that in the application, the input of the BERT model is a splicing result of a word embedding vector and a word vector of a text word, and a semantic text vector of the video description text is output after the processing of the BERT model.

It should be noted that the word Embedding vector is also called Embedding feature, and an object of the Embedding feature in this embodiment is a video, that is, is used to describe a video. The word embedding vector is described by converting a word represented by a natural language into a vector or a matrix that can be understood by a computer, and the extraction of the word embedding vector can be performed by a deep learning model, for example, a Convolutional Neural Network (CNN) model, a Long Short-Term Memory Network (LSTM) model, a Recurrent Neural Network (RNN) model, a Gated CNN (G-CNN) model, or the like, but other possible deep learning models can be used for extraction without limitation.

After the first vector, the second vector and the semantic text vector are obtained, the first vector, the second vector and the semantic text vector may be stitched to obtain a video vector of the video to be recommended, for example, the first vector, the second vector and the semantic text vector are stitched, then the stitching result is input into a feed-forward network, and finally the video vector of the video to be recommended is output, as shown in fig. 1 f.

105. And determining a target video in the plurality of videos to be recommended based on the historical browsing videos and the video vectors of the videos to be recommended, and recommending the target video.

Specifically, the history browsing videos may include videos watched by the user and/or marked videos, the watched videos are videos watched by the user in the history time period, the marked videos are collected by the user in the history time period, the marked videos may be videos collected by the user or videos favored by the user, and the like. Based on this, a corresponding weight is assigned to each historical browsing video, for example, a higher weight may be assigned to the marked video, and the selection is performed according to the actual situation.

Further, the method can be adopted to construct a video vector corresponding to the video to be recommended, then the cosine similarity between two video vectors (the video to be recommended and the historical browsing video) is calculated, the video to be recommended with the cosine similarity larger than a preset value is determined as a target video, and the target video is recommended to the user; of course, the video to be recommended which is greater than the preset value may also be determined as a candidate video, the candidate videos are ranked based on the cosine similarity, then the candidate video at the head of the ranked candidate video queue is determined as a target video, and the target video is recommended to the user, specifically selected according to actual conditions, and no further description is given here.

According to the method and the device for recommending the videos, after video data to be recommended and historical browse video data of a user are collected, the video type and the video description content of the video to be recommended are obtained from video attribute information, then feature extraction is conducted on the video type and the video description content, a first vector corresponding to the video type and a second vector corresponding to the video description content are obtained, then a semantic text vector of the video description text is constructed based on video keywords, the first vector, the second vector and the semantic text vector are fused to obtain a video vector of the video to be recommended, finally a target video is determined in the videos to be recommended based on the historical browse video and the video vector of the video to be recommended, and the target video is recommended. According to the video recommendation scheme, the first vector corresponding to the video type and the second vector corresponding to the video description content are extracted, the semantic text vector of the video description text is constructed by using the video keywords, and finally the first vector, the second vector and the semantic text vector are fused to obtain the video vector of the video to be recommended.

The method according to the examples is further described in detail below by way of example.

In this embodiment, the video recommendation apparatus will be described by taking an example in which the video recommendation apparatus is specifically integrated in a server.

Referring to fig. 2, a video recommendation method may specifically include the following processes:

201. the server collects video data to be recommended and historical browsing video data.

The video data to be recommended comprises a plurality of videos to be recommended and video attribute information of each video to be recommended, the historical browsing video data comprises at least one historical browsing video, the historical browsing video is a video browsed by a user in a historical period, and the video to be recommended can be a video played by a video website or a video inserted in a webpage. For example, various video videos, live videos, program videos, short videos, and the like may be provided, and the server may obtain the video data to be recommended and the historical browsing video data of the user from a video website, or may obtain the video data to be recommended and the historical browsing video data of the user from a video database.

202. And the server acquires the video type and the video description content of the video to be recommended from the video attribute information.

203. The server extracts the characteristics of the video type and the video description content to obtain a first vector corresponding to the video type and a second vector corresponding to the video description content.

For example, specifically, the server may extract a first vector corresponding to a video type and a second vector corresponding to video description content by using a pre-trained feature extraction model, where the feature extraction model may include a first sub-network, a second sub-network, and a third sub-network, and further, the server extracts the first vector corresponding to the video type by using the first sub-network, performs feature extraction on the video keywords and text keywords in the video description text by using the second sub-network to obtain keyword vectors, and extracts the text vectors corresponding to the video description text by using the third sub-network.

204. The server constructs a semantic text vector of the video description text based on the video keywords, and fuses the first vector, the second vector and the semantic text vector to obtain a video vector of the video to be recommended.

The server may splice the first vector, the second vector and the semantic text vector to obtain a video vector of the video to be recommended, for example, splice the first vector, the second vector and the semantic text vector, then input a splicing result into a feed-forward network, and finally output the video vector of the video to be recommended.

205. The server determines a target video in the videos to be recommended based on the historical browsing videos and the video vectors of the videos to be recommended, and recommends the target video.

For example, specifically, the server may construct a video vector corresponding to a video to be recommended by using the above method, then calculate cosine similarity between two video vectors (the video to be recommended and the history browsing video), determine the video to be recommended with the cosine similarity larger than a preset value as a target video, and recommend the target video to the user.

After the server collects video data to be recommended and history browsing video data, the server obtains video types and video description contents of videos to be recommended from video attribute information, then the server performs feature extraction on the video types and the video description contents to obtain first vectors corresponding to the video types and second vectors corresponding to the video description contents, then the server constructs semantic text vectors of the video description texts based on video keywords and fuses the first vectors, the second vectors and the semantic text vectors to obtain video vectors of the videos to be recommended, and finally the server determines target videos in the videos to be recommended based on the history browsing videos and the video vectors of the videos to be recommended and recommends the target videos. According to the video recommendation scheme, the first vector corresponding to the video type and the second vector corresponding to the video description content are extracted, the semantic text vector of the video description text is constructed by using the video keywords, and finally the first vector, the second vector and the semantic text vector are fused to obtain the video vector of the video to be recommended.

In order to better implement the video recommendation method according to the embodiment of the present application, an embodiment of the present application further provides a video recommendation apparatus (referred to as recommendation apparatus for short) based on the foregoing video recommendation apparatus. The meaning of the noun is the same as that in the video recommendation method, and specific implementation details can refer to the description in the method embodiment.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a video recommendation device according to an embodiment of the present application, where the interactive device may include an acquisition module 301, an acquisition module 302, an extraction module 303, a construction module 304, a fusion module 305, and a recommendation module 306, which may specifically be as follows:

the acquisition module 301 is configured to acquire video data to be recommended and historical browsing video data.

The video data to be recommended comprises a plurality of videos to be recommended and video attribute information of each video to be recommended, the historical browsing video data comprises at least one historical browsing video, the historical browsing video is a video browsed by a user in a historical period, and the video to be recommended can be a video played by a video website or a video inserted in a webpage. For example, the videos may be various movie videos, live videos, program videos, short videos, and the like, and the acquisition module 301 may acquire video data to be recommended and video data to be historically viewed from a video website, or acquire video data to be recommended and video data to be historically viewed from a video database.

An obtaining module 302, configured to obtain a video type and video description content of a video to be recommended from the video attribute information.

The extracting module 303 is configured to perform feature extraction on the video type and the video description content to obtain a first vector corresponding to the video type and a second vector corresponding to the video description content.

For example, the extracting module 303 may specifically extract a first vector corresponding to a video type and a second vector corresponding to a video description content by using a pre-trained feature extraction model, where the feature extraction model may include a first sub-network, a second sub-network, and a third sub-network, and further, the server extracts the first vector corresponding to the video type by using the first sub-network, performs feature extraction on the video keywords and text keywords in the video description text by using the second sub-network to obtain a keyword vector, and extracts the text vector corresponding to the video description text by using the third sub-network.

Optionally, in some embodiments, the extracting module 303 may specifically include:

the first acquisition unit is used for acquiring a preset feature extraction model;

the first extraction unit is used for extracting the characteristics of the video type based on a first sub-network to obtain a first vector corresponding to the video type;

the second extraction unit is used for extracting the characteristics of the video keywords and the text keywords in the video description text based on a second sub-network to obtain keyword vectors;

and the third extraction unit is used for extracting the features of the video description text based on a third sub-network to obtain a text vector corresponding to the video description text.

Optionally, in some embodiments, the first extracting unit may specifically include:

the first obtaining subunit is used for obtaining a type identifier corresponding to the video type;

the extracting subunit is configured to extract a weight value of the type identifier from a weight matrix corresponding to the first subnetwork, so as to obtain a first weight value corresponding to the type identifier;

Optionally, in some embodiments, the first sub-construction unit may specifically be configured to: constructing a video type vector corresponding to each first weight value; and carrying out average processing on the constructed video type vector to obtain a first vector corresponding to the video type.

Optionally, in some embodiments, the second extracting unit may specifically include:

the second acquisition subunit is used for acquiring the keyword identification corresponding to the video keyword;

the second construction subunit is used for constructing a video keyword vector corresponding to the video keyword based on the second sub-network and the keyword identifier;

the processing subunit is used for performing convolution processing on the text keywords in the video description text by adopting a second sub-network to obtain text keyword vectors corresponding to the text keywords;

Optionally, in some embodiments, the second building subunit may be specifically configured to: extracting the weight value of the keyword identifier from the weight matrix corresponding to the second sub-network to obtain a second weight value corresponding to the keyword identifier; and constructing a video keyword vector corresponding to the video keyword according to the extracted second weight value.

Optionally, in some embodiments, the fusion subunit may be specifically configured to: carrying out average processing on the constructed second weight vector to obtain a video keyword vector corresponding to the video keyword; and splicing the text keyword vector and the video keyword vector to obtain a keyword vector.

And the building module 304 is used for building a semantic text vector of the video description text based on the video keywords.

In the present application, the building module 304 may use a BiDirectional transducer-Based Encoder (BERT) to build semantic text vectors of video description text.

Optionally, in some embodiments, the building module 304 may specifically include:

and the construction unit is used for constructing a semantic text vector of the video description text by adopting a semantic text construction model based on the word segmentation description text and the video keywords.

Optionally, in some embodiments, the building unit may specifically be configured to: extracting word embedding vectors of text words in the word segmentation description text, wherein the word embedding vectors carry semantic information of text word context; extracting word vectors of the video keywords; splicing the extracted word embedding vector and the word vector to obtain a spliced vector; and inputting the spliced vector into the semantic text construction model to obtain a semantic text vector of the video description text.

And the fusion module 305 is configured to fuse the first vector, the second vector and the semantic text vector to obtain a video vector of the video to be recommended.

And the recommending module 306 is configured to determine a target video from the multiple videos to be recommended based on the historical browsing videos and the video vectors of the videos to be recommended, and recommend the target video.

After the acquisition module 301 acquires video data to be recommended and historical browsing video data of a user, the acquisition module 302 acquires a video type and video description content of a video to be recommended from video attribute information, the extraction module 303 performs feature extraction on the video type and the video description content to obtain a first vector corresponding to the video type and a second vector corresponding to the video description content, the construction module 304 constructs a semantic text vector of the video description text based on video keywords, the fusion module 305 fuses the first vector, the second vector and the semantic text vector to obtain a video vector of the video to be recommended, and finally the recommendation module 306 determines a target video from a plurality of videos to be recommended based on the historical browsing video and the video vector of the video to be recommended and recommends the target video. According to the video recommendation scheme, the first vector corresponding to the video type and the second vector corresponding to the video description content are extracted, the semantic text vector of the video description text is constructed by using the video keywords, and finally the first vector, the second vector and the semantic text vector are fused to obtain the video vector of the video to be recommended.

In addition, an electronic device according to an embodiment of the present application is further provided, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to an embodiment of the present application, and specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

the method comprises the steps of collecting video data to be recommended and historical browsing video data of a user, obtaining video types and video description contents of videos to be recommended from video attribute information, carrying out feature extraction on the video types and the video description contents to obtain first vectors corresponding to the video types and second vectors corresponding to the video description contents, constructing semantic text vectors of the video description texts, fusing the first vectors, the second vectors and the semantic text vectors to obtain video vectors of the videos to be recommended, determining a target video in the videos to be recommended based on the historical browsing videos and the video vectors of the videos to be recommended, and recommending the target video to the user.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

According to the method and the device for recommending the videos, after the video data to be recommended and the history browsing video data are collected, the video type and the video description content of the video to be recommended are obtained from the video attribute information, then, feature extraction is conducted on the video type and the video description content, a first vector corresponding to the video type and a second vector corresponding to the video description content are obtained, then, a semantic text vector of the video description text is constructed based on video keywords, the first vector, the second vector and the semantic text vector are fused, a video vector of the video to be recommended is obtained, finally, a target video is determined in the videos to be recommended based on the history browsing video and the video vector of the video to be recommended, and the target video is recommended. According to the video recommendation scheme, the first vector corresponding to the video type and the second vector corresponding to the video description content are extracted, the semantic text vector of the video description text is constructed by using the video keywords, and finally the first vector, the second vector and the semantic text vector are fused to obtain the video vector of the video to be recommended.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present application provides a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the video recommendation methods provided in the present application. For example, the instructions may perform the steps of:

the method comprises the steps of collecting video data to be recommended and historical browsing video data, obtaining video types and video description contents of videos to be recommended from video attribute information, carrying out feature extraction on the video types and the video description contents to obtain first vectors corresponding to the video types and second vectors corresponding to the video description contents, constructing semantic text vectors of video description texts, fusing the first vectors, the second vectors and the semantic text vectors to obtain video vectors of the videos to be recommended, determining target videos in the videos to be recommended based on the historical browsing videos and the video vectors of the videos to be recommended, and recommending the target videos.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any video recommendation method provided in the embodiments of the present application, beneficial effects that can be achieved by any video recommendation method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The video recommendation method, the video recommendation device, the electronic device, and the storage medium provided in the embodiments of the present application are described in detail above, and specific examples are applied herein to illustrate the principles and implementations of the present application, and the descriptions of the above embodiments are only used to help understand the method and the core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for video recommendation, comprising:

constructing a semantic text vector of the video description text based on the video keywords, and fusing the first vector, the second vector and the semantic text vector to obtain a video vector of the video to be recommended;

2. The method according to claim 1, wherein the extracting features of the video type and the video description content to obtain a first vector corresponding to the video type and a second vector corresponding to the video description content comprises:

acquiring a preset feature extraction model, wherein the feature extraction model comprises a first sub-network, a second sub-network and a third sub-network;

performing feature extraction on the video type based on the first sub-network to obtain a first vector corresponding to the video type;

extracting the features of the video keywords and the text keywords in the video description text based on the second sub-network to obtain keyword vectors;

and performing feature extraction on the video description text based on the third sub-network to obtain a text vector corresponding to the video description text.

3. The method of claim 2, wherein the performing feature extraction on the video type based on the first sub-network to obtain a first vector corresponding to the video type comprises:

acquiring a type identifier corresponding to the video type;

extracting the weight value of the type identifier from the weight matrix corresponding to the first sub-network to obtain a first weight value corresponding to the type identifier;

and constructing a first vector corresponding to the video type according to the extracted first weight value.

4. The method according to claim 3, wherein the constructing a first vector corresponding to the video type according to the extracted first weight value comprises:

constructing a video type vector corresponding to each first weight value;

5. The method of claim 2, wherein the performing feature extraction on the video keywords and text keywords in the video description text based on the second sub-network to obtain a keyword vector comprises:

acquiring a keyword identifier corresponding to the video keyword;

constructing a video keyword vector corresponding to the video keyword based on the second sub-network and the keyword identifier;

performing convolution processing on text keywords in the video description text by adopting the second sub-network to obtain text keyword vectors corresponding to the text keywords;

and fusing the video keyword vector and the text keyword vector to obtain a keyword vector.

6. The method of claim 5, wherein constructing a video keyword vector corresponding to the video keyword based on the second sub-network and a keyword identifier comprises:

7. The method of claim 5, wherein fusing the video keyword vector and the text keyword vector to obtain a keyword vector comprises:

8. The method according to any one of claims 1 to 7, wherein the constructing a semantic text vector of the video description text based on the video keywords comprises:

acquiring a preset semantic text construction model;

segmenting the video description text to obtain a segmented description text;

and constructing a semantic text vector of the video description text by adopting the semantic text construction model based on the word segmentation description text and the video keywords.

9. The method according to claim 8, wherein the building a semantic text vector of the video description text by using the semantic text construction model based on the participle description text and the video keywords comprises:

extracting word vectors of the video keywords;

10. A video recommendation apparatus, comprising:

the construction module is used for constructing a semantic text vector of the video description text;

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the video recommendation method according to any one of claims 1-9 are implemented when the program is executed by the processor.

12. A storage medium having a computer program stored thereon, wherein the computer program when executed by a processor performs the steps of the video recommendation method according to any one of claims 1-9.