CN113761270A

CN113761270A - Video recall method and device, electronic equipment and storage medium

Info

Publication number: CN113761270A
Application number: CN202110536313.9A
Authority: CN
Inventors: 高黎明; 廖东亮; 黄炜杰; 姚日恒; 王艺如; 黎功福; 徐进
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-17
Filing date: 2021-05-17
Publication date: 2021-12-07

Abstract

The application discloses a video recall method, a video recall device, electronic equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining a video title and query information corresponding to a sample video, wherein the query information comprises a query sentence and a query path, detecting the same key words between the video title and the query sentence to obtain the same key words, constructing a special composition for representing the incidence relation among the sample video, the query sentence and the query path according to the query path and the same key words, training a preset video recall model based on the special composition, the query sentence and the video title, and performing video recall through the trained video recall model.

Description

Video recall method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a video recall method and device, electronic equipment and a storage medium.

Background

With the development of network video platforms, the number of videos is continuously expanded, and users need to spend a lot of time searching videos interested in themselves from massive videos.

The current video recall method is to recall a video according to a search term input by a user, specifically, the video is recalled by using semantic association between an inquiry statement and a video title, but the video recalled by the method is often different from the video expected by the user, that is, the current video recall method is low in recall rate.

Disclosure of Invention

The application provides a video recall method, a video recall device, electronic equipment and a storage medium, which can improve the recall rate in video query.

The application provides a video recall method, which comprises the following steps:

acquiring a video title and query information corresponding to a sample video, wherein the query information comprises a query statement and a query path;

detecting the same keywords between the video title and the query sentence to obtain the same keywords;

constructing a heteromorphic graph for representing the association relation among the sample video, the query sentence and the query path according to the query path and the same key words;

and training a preset video recall model based on the abnormal picture, the query sentence and the video title, and performing video recall through the trained video recall model.

Correspondingly, the present application further provides a video recall apparatus, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a video title and query information corresponding to a sample video, and the query information comprises a query statement and a query path;

the detection module is used for detecting the same keywords between the video title and the query sentence to obtain the same keywords;

the construction module is used for constructing a heteromorphic graph for representing the incidence relation among the sample video, the query sentence and the query path according to the query path and the same key words;

the training module is used for training a preset video recall model based on the abnormal picture, the query sentence and the video title;

and the recall module is used for recalling the video through the trained video recall model.

Optionally, in some embodiments of the present invention, the training module includes:

the processing unit is used for carrying out graph embedding processing on the abnormal graph to obtain a node vector corresponding to each node in the abnormal graph;

the word segmentation unit is used for performing word segmentation processing on the query sentence and the video title respectively to obtain a query word set corresponding to the query sentence and a title word set corresponding to the video title;

and the training unit is used for training a preset video recall model based on the node vector, the query word set and the heading word set to obtain a trained video recall model.

Optionally, in some embodiments of the present application, the training unit includes:

the first construction subunit is configured to construct, according to a first sub-network, a query term set, and a heading term set in the video recall model, a first semantic feature corresponding to the query statement and a second semantic feature corresponding to the video title;

a second constructing subunit, configured to construct, based on a second sub-network in the video recall model, the node vector, and the same keyword, a first heterogeneous graph feature corresponding to the query statement and a second heterogeneous graph feature corresponding to the video title;

and the training subunit is used for training a preset video recall model according to the first semantic feature, the second semantic feature, the first heterogeneous graph feature and the second heterogeneous graph feature to obtain a trained video recall model.

Optionally, in some embodiments of the present application, the second building subunit is specifically configured to:

inputting the node vector into a second sub-network in the video recall model to obtain a vector characteristic corresponding to each node in the abnormal picture, wherein the vector characteristic carries an association weight between the nodes;

obtaining vector features corresponding to the same keywords;

according to the association weight between the nodes, constructing corresponding query features of each node vector under different query statements, and fusing the query features to obtain first heterogeneous graph features corresponding to the query statements;

and constructing a second heterogeneous graph characteristic corresponding to the video title based on the vector characteristic corresponding to the same keyword.

Optionally, in some embodiments of the present application, the first building subunit is specifically configured to:

inputting the first word vector into a first sub-network of the video recall model to obtain a first word embedded representation corresponding to each query word in the query word set, and performing average pooling on the first word embedded representations to obtain a first semantic feature corresponding to the query statement;

and inputting the second word vector into a second sub-network of the video recall model to obtain a second word embedded representation corresponding to each entry word in the entry word set, and performing average pooling on the second word embedded representations to obtain a second semantic feature corresponding to the video title.

Optionally, in some embodiments of the present application, the training subunit is specifically configured to:

fusing the first semantic features and the first heterogeneous graph features to obtain query features corresponding to the query sentences;

fusing the second semantic features and the second heterogeneous graph features to obtain title features corresponding to the video title;

and training a preset video recall model according to the query features and the title features to obtain the trained video recall model.

Optionally, in some embodiments of the present application, the method further includes a building unit, where the building unit is specifically configured to:

reserving the keywords with the word frequency smaller than a preset value in the query sentence and the video title, and constructing a reference dictionary according to the reserved keywords;

and aligning the text lengths of the query sentence and the video title according to the reference dictionary.

Optionally, in some embodiments of the present application, the recall module is specifically configured to:

when a video searching operation is received, acquiring a video set to be recalled, a video searching sentence corresponding to the video searching operation and a video searching path, wherein the video set to be recalled comprises at least one video to be recalled, and the video to be recalled corresponds to a video title to be recalled;

and calculating the text similarity between the video title to be recalled and the video search sentence, and performing video recall through a trained video recall model based on the text similarity.

Optionally, in some embodiments of the present application, the detection module is specifically configured to:

counting the frequency of each keyword in the video title and the frequency of each keyword in the query sentence;

removing keywords with frequency greater than a preset value from the video title to obtain a processed video title, and;

removing keywords with frequency greater than a preset value from the query sentence to obtain a processed query sentence;

and detecting the same keywords between the processed video title and the processed query sentence to obtain the same keywords.

This application is after obtaining the video title and the inquiry information that sample video corresponds, the inquiry information includes inquiry sentence and inquiry route, detects the same keyword between video title and the inquiry sentence obtains the same keyword, then, according to inquiry route and the same keyword, establish and be used for the representation the heterogeneous graph of incidence relation between sample video, inquiry sentence and the inquiry route, finally, based on different composition picture, inquiry sentence and video title are trained to predetermine the video recall model to carry out the video recall through the video recall model after the training. The video recall method comprises the steps of constructing a heterogeneous graph for representing incidence relations among sample videos, query sentences and query paths by utilizing query paths and the same key words, then training a preset video recall model based on the heterogeneous graph, the query sentences and the video titles, and subsequently utilizing the incidence relations among the videos, the query sentences and the query paths when recalling videos through the trained video recall model, so that the recall rate during video query can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1a is a schematic view of a video recall method provided in the present application;

FIG. 1b is a schematic flow chart of a video recall method provided herein;

FIG. 1c is a schematic representation of a video recall method provided herein;

FIG. 1d is an architecture diagram of a video recall model in the video recall method provided herein;

FIG. 1e is a schematic diagram of an attention mechanism in a video recall method provided herein;

FIG. 1f is a block diagram of a gate unit in the video recall method provided herein;

FIG. 2 is a schematic flow chart of a video recall method provided herein;

FIG. 3 is a schematic structural diagram of a video recall device provided in the present application;

fig. 4 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

The technical solutions in the present application will be described clearly and completely with reference to the accompanying drawings in the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence.

Deep learning is a core part of machine learning, and generally includes techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning. The deep learning is a new research direction in the field of machine learning. That is, deep learning is a method of machine learning based on performing characterization learning on data. An observation (e.g., an image) may be represented using a number of ways, such as a vector of intensity values for each pixel, or more abstractly as a series of edges, a specially shaped region, etc. Tasks (e.g., face recognition or facial expression recognition) are more easily learned from the examples using some specific representation methods. The benefit of deep learning is to replace the manual feature acquisition with unsupervised or semi-supervised feature learning and hierarchical feature extraction efficient algorithms.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to technologies such as artificial intelligence natural language processing and deep learning, and the following embodiment is used for explanation.

The application provides a video recall method and device, electronic equipment and a storage medium.

The video recall device can be specifically integrated in a server or a terminal, the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, a big data and artificial intelligence platform and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

For example, referring to fig. 1a, the video recall apparatus is integrated on a server, the server may obtain a video title and query information corresponding to a sample video, where the query information includes a query sentence and a query path, then the server detects a same keyword between the video title and the query sentence to obtain the same keyword, then the server constructs a heterogeneous graph representing an association relationship between the sample video, the query sentence, and the query path according to the query path and the same keyword, finally, the server may train a preset video recall model based on the heterogeneous graph, the query sentence, and the video title, and perform video recall through the trained video recall model, for example, the server responds to a video recall request triggered by a terminal, recalls a target video through the trained video recall model, and sends the recalled target video to the terminal, the terminal may display the target video through a web page or an applet.

The video recall method comprises the steps of constructing a heterogeneous graph for representing incidence relations among sample videos, query sentences and query paths by utilizing query paths and the same key words, then training a preset video recall model based on the heterogeneous graph, the query sentences and the video titles, and subsequently utilizing the incidence relations among the videos, the query sentences and the query paths when recalling videos through the trained video recall model, so that the recall rate during video query can be improved.

The following are detailed below. It should be noted that the description sequence of the following embodiments is not intended to limit the priority sequence of the embodiments.

A video recall method, comprising: the method comprises the steps of obtaining a video title and query information corresponding to a sample video, detecting the same key words between the video title and a query sentence to obtain the same key words, constructing a special composition for representing the incidence relation among the sample video, the query sentence and the query path according to a query path and the same key words, training a preset video recall model based on the special composition, the query sentence and the video title, and recalling the video through the trained video recall model.

Referring to fig. 1b, fig. 1b is a schematic flow chart of a video recall method provided in the present application. The specific flow of the video recall method can be as follows:

101. and acquiring a video title and query information corresponding to the sample video.

Video generally refers to various techniques for capturing, recording, processing, storing, transmitting, and reproducing a series of still images as electrical signals. Advances in networking technology have also enabled recorded segments of video to be streamed over the internet and received and played by computers. Video data is a time-varying stream of images that contains much richer information and content that other media cannot express. The information is transmitted in the form of video, and the content to be transmitted can be intuitively, vividly, truly and efficiently expressed.

The query information includes a query statement and a query path, the sample video may be a video played by a video website, or a video inserted in a web page, for example, various movie videos, live videos, program videos, short videos, and the like, and the sample video may be acquired from the video website or may be acquired from a video database.

The query information refers to query information generated when the user queries the video, where the query information includes a query statement and a query path, the query statement is a statement used when querying the video, for example, if the user inputs "zhang san XX television series" to query video a, then "zhang san XX television series" is a query statement corresponding to video a, and if the user enters a search engine through a website to query video a, then the corresponding query path is: search engine-search results page-video a.

102. And detecting the same keywords between the video title and the query sentence to obtain the same keywords.

The keywords are originated from English "keywords", and refer to the vocabulary used by a single media when making the index. Keyword search is one of the main methods of web search indexing, namely, specific name terms of products, services, companies and the like which visitors want to know.

The video title is a text describing video content, such as "xx engineer self-made helicopter", corresponding keywords of which may be "xx engineer", "self-made" and "helicopter", and the query sentence is a sentence used by the user to query the video, such as "xx engineer, helicopter", "helicopter" or "xx engineer", taking the query sentence as "xx engineer and helicopter", for example, it can be seen that the query is expected that the same keywords between the query and the video title are "xx engineer" and "helicopter", and then the "xx engineer" and the "helicopter" are determined as the same keywords between the video title and the query sentence, that is, the same keywords.

It should be noted that, in order to improve the recall capability of the subsequent model, the frequent occurrence of the keywords, such as prepositions, mood assist words, etc., which are not of practical significance, often occurs in the video title and the query sentence, so in some embodiments, the frequent keywords may be removed to improve the recall capability of the subsequent model, that is, the step "detecting the same keywords between the video title and the query sentence to obtain the same keywords" may specifically include:

(11) counting the frequency of each keyword in the video title and the frequency of each keyword in the query sentence;

(12) removing keywords with frequency greater than a preset value from the video titles to obtain processed video titles, and removing keywords with frequency greater than a preset value from the query sentences to obtain processed query sentences;

(13) and detecting the same keywords between the processed video title and the processed query sentence to obtain the same keywords.

For example, N-Gram (N-Gram) can be used to count the frequency of occurrence of each keyword in a video title and to count the frequency of occurrence of each keyword in a query sentence, and it should be noted that N-Gram is an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation with the size of N on the content in the text according to bytes, and form a byte fragment sequence with the length of N. Each byte segment is called as a gram, the occurrence frequency of all the grams is counted, and filtering is performed according to a preset threshold value to form a key gram list, namely a vector feature space of the text, wherein each gram in the list is a feature vector dimension. The model is based on the assumption that the occurrence of the nth word is only related to the first N-1 words and not to any other words, and that the probability of a complete sentence is the product of the probabilities of occurrence of the words. These probabilities can be obtained by counting the number of times that N words occur simultaneously directly from the corpus. The unigram, the binary Bi-Gram and the ternary Tri-Gram are commonly used, and the unigram, the Bi-Gram and the Thri-Gram correspond to word frequencies respectively in the application: 20. 2 and 2.

103. And constructing an abnormal graph for representing the association relation among the sample video, the query statement and the query path according to the query path and the same key words.

Heterogeneous graphics (Heterogeneous graphics) is a special information network that contains multiple types of objects or multiple types of connections. An anomaly map, denoted G ═ V, E, is composed of an object set V and a connection set E. More than one type of node and edge may exist in the heterogeneous graph, thus allowing different types of nodes to possess different dimensions of features or attributes.

In the present application, the query path may also be a meta path, where the meta path is a composite relationship connecting two objects, and is a semantic capture structure widely used, for example, for a movie video, the meta path may be a movie video-actor-movie video, and the meta path describes a relationship between two movie videos, and for example, the meta path is a movie video-director-movie video, which means that two movies are directed by the same director, and the heterogeneity is an intrinsic property of a heterogeneous composition, that is, various types of nodes and edges. For example, different types of nodes have different characteristics, which may fall in different characteristic spaces.

Therefore, the heterogeneous graph representing the association relationship among the sample video, the query statement and the query path can be constructed according to the query path and the same keyword, wherein the video title, the query statement and the same keyword of the sample video are nodes in the heterogeneous graph, as shown in fig. 1 c.

104. And training a preset video recall model based on the special composition, the query sentence and the video title, and performing video recall through the trained video recall model.

For products of the search class, the ranking of the recall results often impacts the user experience. The existing model only adopts a single calculation mode to calculate the correlation between the recall result and the query or calculate the semantic correlation. The semantic model may compute semantic relevance of the query and the title to determine the ranking of the recalled results. However, the fact that the video recalled by the method is often different from the video expected by the user is ignored, and therefore, the video recall model is trained by utilizing the semantic information and the graph structure information, so that when the subsequent video recall model is used, the video recall can be carried out by utilizing the semantic information and the graph structure information, and the recall rate during video query is improved.

It should be noted that, because the heteromorphic graph, the video title, and the query sentence all carry many pieces of high-dimensional information, which is inconvenient to be utilized during model training, the high-dimensional information needs to be converted into low-dimensional information, that is, optionally, in some embodiments, the step "training the preset video recall model based on the heteromorphic graph, the query sentence, and the video title" may specifically include:

(21) carrying out graph embedding processing on the heterogeneous graph to obtain a node vector corresponding to each node in the heterogeneous graph;

(22) performing word segmentation processing on the query sentence and the video title respectively to obtain a query word set corresponding to the query sentence and a title word set corresponding to the video title;

(23) and training the preset video recall model based on the node vector, the query word set and the heading word set to obtain the trained video recall model.

The graph embedding, also called network representation learning, maps nodes in a network into a low-dimensional vector based on the characteristics of the network, and can quantitatively measure the similarity between the nodes, so that the graph embedding is more convenient to apply.

Optionally, a DeepWalk algorithm may be employed to graph the social networking graph. Deepwalk is a graph structure data mining algorithm that combines the two algorithms random walk (random walk) and Word2 vec. The algorithm can learn hidden information of the network, and can represent nodes in the graph as a vector containing potential information.

Optionally, GCN (Graph convolunti onanetwork) may also be used to perform Graph embedding processing on the heterogeneous Graph. The graph convolution neural network algorithm takes the global neighborhood of each node as input, iteratively aggregates neighborhood embedding of nodes by defining a convolution algorithm on the graph, and obtains new embedding using the embedding obtained in the previous iteration and a function of the embedding. Aggregation embedding of only local neighborhoods makes it scalable, and multiple iterations allow learning to embed one node to describe a global neighborhood.

The node feature set composed of the node features of all nodes in the graph can be obtained no matter the Deepwalk algorithm is adopted, or the graph convolutional neural network algorithm or other graph embedding modes are adopted for feature extraction. And extracting the characteristics of each node in the abnormal graph to obtain a node vector corresponding to each node in the abnormal graph. The node vector comprises the attributes of the nodes and the topology or connection relationship among the nodes.

In addition, the query sentence and the video title can be subjected to word segmentation by using the language model, and then the preset video recall model is trained by using the node vector, the query word set and the title word set to obtain the trained video recall model.

It should be noted that, before performing the word segmentation processing on the query sentence and the video title by using the language model, a reference dictionary may be pre-constructed, the text lengths of the query sentence and the video title are aligned based on the reference dictionary, and then, the word segmentation processing is performed on the query sentence and the video title by using the language model, that is, optionally, in some embodiments, the method further includes:

(31) reserving keywords with the word frequency smaller than a preset value in the query sentence and the video title, and constructing a reference dictionary according to the reserved keywords;

(32) the text lengths of the query sentence and the video title are aligned according to a reference dictionary.

For example, specifically, keywords with a word frequency less than 2 in the query sentence and the video title are reserved, a reference dictionary is constructed according to the reserved keywords, then, the text lengths of the query sentence and the video title are aligned according to the reference dictionary, words not in the reference dictionary are represented by preset marks (such as < unk >), the query sentence and/or the video title with the text length greater than the preset text length is truncated, the text length less than the preset text length is filled up by the preset marks, for example, the preset text length is 5, the query sentence "zhang san" and the video title "zhang san sitting on the airplane" are aligned, and then, the query sentence is adjusted to "zhang san < unk > < unk > < unk >".

Furthermore, an end-to-end video recall model fusing semantic information and structural information is designed. The architecture diagram of the model is shown in fig. 1d, where a first sub-network captures semantic features of a query sentence and semantic features of a video title by using a neural network, a second sub-network constructs heterogeneous graph features corresponding to the query sentence and heterogeneous graph features corresponding to the video title by using a query path node vector and the same keywords, then fuses the two parts of information through a door mechanism, outputs a predicted value of correlation between the query sentence and the video title, and finally trains the video recall model according to an actual value of correlation between a pre-labeled query sentence and the video title, that is, optionally, in some embodiments, the step "train a preset video recall model based on the node vector, the query word set, and the title word set to obtain the trained video recall model" may specifically include:

(41) according to a first sub-network, a query word set and a heading word set in the video recall model, constructing a first semantic feature corresponding to a query sentence and a second semantic feature corresponding to a video title;

(42) constructing a first heterogeneous graph characteristic corresponding to a query statement and a second heterogeneous graph characteristic corresponding to a video title based on a second sub-network, a node vector and the same keyword in the video recall model;

(43) and training the preset video recall model according to the first semantic feature, the second semantic feature, the first heterogeneous graph feature and the second heterogeneous graph feature to obtain the trained video recall model.

For step (31), a first sub-network may be used to obtain a query term set and a term embedding vector corresponding to a term set, and then average pooling is performed on the obtained term embedding vectors to obtain a semantic feature corresponding to a query term and a semantic feature corresponding to a video title, that is, optionally, in some embodiments, the step "construct a first semantic feature corresponding to the query term and a second semantic feature corresponding to the video title according to the first sub-network, the query term set, and the term set in the video recall model" may specifically include:

inputting the first word vector into a first sub-network of the video recall model to obtain a first word embedded representation corresponding to each query word in the query word set, performing average pooling on the first word embedded representations to obtain first semantic features corresponding to the query sentence, inputting the second word vector into a second sub-network of the video recall model to obtain a second word embedded representation corresponding to each entry word in the entry word set, and performing average pooling on the second word embedded representations to obtain second semantic features corresponding to the video title.

In step (32), the heterogeneous graph and the same keyword are used as input of a second sub-network, the heterogeneous graph Attention neural network in the second sub-network can learn feature representation of each node vector, then vector features of the same keyword in the heterogeneous graph can be obtained, and finally, the node vectors are fused through an Attention-pooling (Attention-pooling) mechanism to obtain the heterogeneous graph features of the query statement, specifically, under each query path, the vector features of each node learned through the Attention mechanism are shown in fig. 1 e. Then, for each node on the graph, different vector features based on the query path are learned, and different vector features of the same node are aggregated into a vector feature containing a structural feature (i.e., a heterogeneous graph feature corresponding to the query statement) through an attention mechanism, and for the video title, a heterogeneous graph feature for describing the video title may be constructed according to the vector features corresponding to the same keyword, that is, the vector features corresponding to the same keyword are spliced to obtain a second heterogeneous graph feature corresponding to the video title, that is, optionally, in some embodiments, the step "constructing the first heterogeneous graph feature corresponding to the query statement and the second heterogeneous graph feature corresponding to the video title based on the second subnetwork, the heterogeneous graph and the same keyword in the video recall model" may specifically include:

(51) inputting the node vector into a second sub-network in the video recall model to obtain vector characteristics corresponding to each node in the abnormal picture;

(52) obtaining vector features corresponding to the same keywords;

(53) constructing query features corresponding to each node vector under different query statements according to the association weight between the nodes, and fusing the query features to obtain first heterogeneous graph features corresponding to the query statements;

(54) and constructing a second heterogeneous graph characteristic corresponding to the video title based on the vector characteristic corresponding to the same keyword.

In the method, each word in the sentence can establish an association relationship with other words at any distance in the sentence, and the target is to be able to have a reference of the weights (namely association weights) of different words when generating the words.

After obtaining the first semantic feature, the second semantic feature, the first heterogeneous graph feature, and the second heterogeneous graph feature, the video recall model may be trained based on the first semantic feature, the second semantic feature, the first heterogeneous graph feature, and the second heterogeneous graph feature, specifically, the semantic feature and the graph structure feature of the query sentence and the semantic feature and the graph structure feature of the video title may be fused, and then the video recall model is trained based on the fusion result, that is, optionally, in some embodiments, the step "train the preset video recall model according to the first semantic feature, the second semantic feature, the first heterogeneous graph feature, and the second heterogeneous graph feature, to obtain the trained video recall model" may specifically include:

(61) fusing the first semantic features and the first heterogeneous graph features to obtain query features corresponding to query sentences, and fusing the second semantic features and the second heterogeneous graph features to obtain title features corresponding to video titles;

(62) and training the preset video recall model according to the query features and the title features to obtain the trained video recall model.

Specifically, please refer to fig. 1f, the gate control unit fuses the first semantic feature and the first heterogeneous graph feature and fuses the second semantic feature and the second heterogeneous graph feature, the gate control unit can adaptively adjust the ratio of the semantic information to the graph structure information, if the similarity between the query statement and the video title is high, the model is more focused on the semantic information, because the semantic matching model has a better advantage in calculating the similarity, and if the similarity between the query statement and the video title is low, the model is more focused on the structure information, and is more inclined to search for the target video through the nodes in the heterogeneous graph.

It can be understood that after a trained video recall model is obtained, when a video search operation is received, a video set to be recalled, a video search sentence corresponding to the video search operation, and a video search path are obtained, where the video set to be recalled includes at least one video to be recalled, the video to be recalled corresponds to one video title to be recalled, then, a text similarity between the video title to be recalled and the video search sentence may be calculated, and a video recall is performed through the trained video recall model based on the text similarity, for example, when the text similarity is less than 50%, a target video is searched through a node in an abnormal composition; and when the text similarity is greater than or equal to 50%, recalling the target video according to the semantic features.

According to the method and the device, after a video title and query information corresponding to a sample video are obtained, the same keywords between the video title and a query sentence are detected, the same keywords are obtained, then a heterogeneous graph used for representing the incidence relation among the sample video, the query sentence and the query path is constructed according to the query path and the same keywords, finally, a preset video recall model is trained based on the heterogeneous graph, the query sentence and the video title, and video recall is carried out through the trained video recall model. The video recall method comprises the steps of constructing a heterogeneous graph for representing incidence relations among sample videos, query sentences and query paths by utilizing query paths and the same key words, then training a preset video recall model based on the heterogeneous graph, the query sentences and the video titles, and subsequently utilizing the incidence relations among the videos, the query sentences and the query paths when recalling videos through the trained video recall model, so that the recall rate during video query can be improved.

The method according to the examples is further described in detail below by way of example.

In this embodiment, the video recall apparatus is specifically integrated in a server as an example.

Referring to fig. 2, a video recall method may specifically include the following steps:

201. and the server acquires a video title and query information corresponding to the sample video.

The sample video can be a video played by a video website, or a video inserted in a web page, and the like. For example, various movie videos, live videos, program videos, short videos, and the like may be available, and the sample videos may be obtained from a video website or a video database.

The query information refers to query information generated when a user queries the video, and the query information includes a query statement and a query path, and the query statement is a statement used when the video is queried.

202. The server detects the same keywords between the video title and the query sentence to obtain the same keywords.

For example, specifically, the server may count the frequency of occurrence of each keyword in the video title and the frequency of occurrence of each keyword in the query sentence, remove the keywords with the frequency greater than the preset value from the video title to obtain the processed video title, remove the keywords with the frequency greater than the preset value from the query sentence to obtain the processed query sentence, and finally, the server detects the same keywords between the processed video title and the processed query sentence to obtain the same keywords.

203. And the server constructs an abnormal picture for representing the association relation among the sample video, the query statement and the query path according to the query path and the same key words.

204. The server trains a preset video recall model based on the special composition, the query sentence and the video title, and video recall is carried out through the trained video recall model.

The method includes the steps that a heterogeneous graph, a video title and query sentences all carry a lot of high-dimensional information which is inconvenient to use during model training, so that the high-dimensional information needs to be converted into low-dimensional information, specifically, a server carries out graph embedding processing on the heterogeneous graph to obtain a node vector corresponding to each node in the heterogeneous graph, then the server can carry out word segmentation processing on the query sentences and the video titles respectively to obtain query word sets corresponding to the query sentences and entry word sets corresponding to the video titles, finally, the server can train a preset video recall model based on the node vectors, the query word sets and the entry word sets to obtain the trained video recall model, and after the server obtains the trained video recall model, when receiving video searching operation, the server obtains a video set to be recalled, And then, the server can calculate the text similarity between the title of the video to be recalled and the video search sentence, and recall the video through the trained video recall model based on the text similarity.

After the server obtains the video title and the query information corresponding to the sample video, the server detects the same keywords between the video title and the query sentence to obtain the same keywords, then the server constructs a heterogeneous graph for representing the incidence relation among the sample video, the query sentence and the query path according to the query path and the same keywords, finally, the server trains a preset video recall model based on the heterogeneous graph, the query sentence and the video title, and video recall is carried out through the trained video recall model. The video recall method comprises the steps of constructing a heterogeneous graph for representing incidence relations among sample videos, query sentences and query paths by utilizing query paths and the same key words, then training a preset video recall model based on the heterogeneous graph, the query sentences and the video titles, and subsequently utilizing the incidence relations among the videos, the query sentences and the query paths when recalling videos through the trained video recall model, so that the recall rate during video query can be improved.

In order to better implement the video recall method of the present application, the present application further provides a video recall apparatus (recall apparatus for short) based on the above. The terms are the same as those in the above-mentioned video recall method, and details of implementation may refer to the description in the method embodiment.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a video recall device according to the present application, where the recall device may include an obtaining module 301, a detecting module 302, a constructing module 303, a training module 304, and a recall module 305, which may specifically be as follows:

the obtaining module 301 is configured to obtain a video title and query information corresponding to the sample video.

The query information includes a query statement and a query path. The sample video may be a video played by a video website, or a video inserted in a web page, and the like, and may be, for example, various movie videos, live videos, program videos, short videos, and the like, and the sample video may be obtained from the video website or from a video database.

The query information refers to query information generated by a user for querying the video, the query information comprises a query statement and a query path, and the query statement is a statement used when the video is queried

The detecting module 302 is configured to detect the same keyword between the video title and the query sentence, and obtain the same keyword.

The constructing module 303 is configured to construct a heteromorphic graph for representing an association relationship among the sample video, the query statement, and the query path according to the query path and the same keyword;

in the present application, the query path may also be a meta path, where the meta path is a composite relationship connecting two objects, and is a semantic capture structure widely used, for example, for a movie video, the meta path may be a movie video-actor-movie video, and the meta path describes a relationship between two movie videos, and for example, the meta path is a movie video-director-movie video, which means that two movies are directed by the same director, and the heterogeneity is an intrinsic property of a heterogeneous composition, that is, various types of nodes and edges. For example, different types of nodes have different characteristics, which may fall in different characteristic spaces

And the training module 304 is used for training the preset video recall model based on the abnormal picture, the query sentence and the video title.

Since the heterogeneous graph, the video title and the query sentence all carry many high-dimensional information, which is inconvenient to be utilized during model training, the high-dimensional information needs to be converted into low-dimensional information, specifically, the training module 304 performs graph embedding processing on the heterogeneous graph to obtain a node vector corresponding to each node in the heterogeneous graph, then the training module 304 can perform word segmentation processing on the query sentence and the video title respectively to obtain a query word set corresponding to the query sentence and a title word set corresponding to the video title, and finally, the training module 304 can train a preset video recall model based on the node vector, the query word set and the title word set to obtain the trained video recall model.

Optionally, in some embodiments, the training module 304 may specifically include:

the processing unit is used for carrying out graph embedding processing on the heterogeneous graph to obtain a node vector corresponding to each node in the heterogeneous graph;

the word segmentation unit is used for performing word segmentation processing on the query sentence and the video title respectively to obtain a query word set corresponding to the query sentence and a heading word set corresponding to the video title;

and the training unit is used for training the preset video recall model based on the node vector, the query word set and the heading word set to obtain the trained video recall model.

Optionally, in some embodiments, the training unit may specifically include:

the first construction subunit is used for constructing a first semantic feature corresponding to the query sentence and a second semantic feature corresponding to the video title according to the first sub-network, the query word set and the title word set in the video recall model;

the second construction subunit is used for constructing a first heterogeneous graph characteristic corresponding to the query statement and a second heterogeneous graph characteristic corresponding to the video title based on a second sub-network, the node vector and the same keyword in the video recall model;

and the training subunit is used for training the preset video recall model according to the first semantic feature, the second semantic feature, the first heterogeneous graph feature and the second heterogeneous graph feature to obtain the trained video recall model.

Optionally, in some embodiments, the second building subunit may be specifically configured to: inputting the node vectors into a second sub-network in the video recall model to obtain vector characteristics corresponding to each node in the abnormal picture, wherein the vector characteristics carry the association weight between the nodes; obtaining vector features corresponding to the same keywords; constructing query features corresponding to each node vector under different query statements according to the association weight between the nodes, and fusing the query features to obtain first heterogeneous graph features corresponding to the query statements; and constructing a second heterogeneous graph characteristic corresponding to the video title based on the vector characteristic corresponding to the same keyword.

Optionally, in some embodiments, the first building subunit may specifically be configured to: inputting the first word vector into a first sub-network of the video recall model to obtain a first word embedded representation corresponding to each query word in the query word set, performing average pooling on the first word embedded representations to obtain first semantic features corresponding to the query sentence, inputting the second word vector into a second sub-network of the video recall model to obtain a second word embedded representation corresponding to each entry word in the entry word set, and performing average pooling on the second word embedded representations to obtain second semantic features corresponding to the video title.

Optionally, in some embodiments, the training subunit may be specifically configured to: and fusing the first semantic features and the first heterogeneous graph features to obtain query features corresponding to the query sentences, fusing the second semantic features and the second heterogeneous graph features to obtain title features corresponding to the video titles, and training the preset video recall model according to the query features and the title features to obtain the trained video recall model.

Optionally, in some embodiments, the system further includes a construction unit, and the construction unit may specifically be configured to: and reserving the keywords with the word frequency smaller than a preset value in the query sentence and the video title, constructing a reference dictionary according to the reserved keywords, and aligning the text lengths of the query sentence and the video title according to the reference dictionary.

A recall module 305 for video recall through the trained video recall model

After obtaining the trained video recall model, the recall module 305 obtains, when receiving a video search operation, a set of videos to be recalled, a video search sentence corresponding to the video search operation, and a video search path, and then the recall module 305 may calculate a text similarity between a title of a video to be recalled and the video search sentence, and perform video recall through the trained video recall model based on the text similarity, that is, optionally, in some embodiments, the recall module 305 may be specifically configured to: when a video searching operation is received, acquiring a video set to be recalled, a video searching sentence corresponding to the video searching operation and a video searching path, wherein the video set to be recalled comprises at least one video to be recalled, and the video to be recalled corresponds to a video title to be recalled; and calculating the text similarity between the video title to be recalled and the video search sentence, and performing video recall through the trained video recall model based on the text similarity.

After the acquisition module 301 acquires the video title and the query information corresponding to the sample video, the detection module 302 detects the same keywords between the video title and the query sentence to obtain the same keywords, then the construction module 303 constructs a heterogeneous graph for representing the association relationship between the sample video, the query sentence and the query route according to the query route and the same keywords, finally, the training module 304 trains a preset video recall model based on the heterogeneous graph, the query sentence and the video title, and the recall module 305 recalls the video through the trained video recall model. The video recall method comprises the steps of constructing a heterogeneous graph for representing incidence relations among sample videos, query sentences and query paths by utilizing query paths and the same key words, then training a preset video recall model based on the heterogeneous graph, the query sentences and the video titles, and subsequently utilizing the incidence relations among the videos, the query sentences and the query paths when recalling videos through the trained video recall model, so that the recall rate during video query can be improved.

In addition, the present application also provides an electronic device, as shown in fig. 4, which shows a schematic structural diagram of the electronic device related to the present application, specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

the method comprises the steps of obtaining a video title and query information corresponding to a sample video, detecting the same key words between the video title and a query sentence to obtain the same key words, constructing a special composition for representing the incidence relation among the sample video, the query sentence and the query path according to a query path and the same key words, training a preset video recall model based on the special composition, the query sentence and the video title, and recalling the video through the trained video recall model.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present application provides a storage medium having stored therein a plurality of instructions that can be loaded by a processor to perform the steps of any of the video recall methods provided herein. For example, the instructions may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any video recall provided by the present application, the beneficial effects that can be achieved by any video recall method provided by the present application can be achieved, for details, see the foregoing embodiments, and are not described herein again.

The video recall method, apparatus, electronic device and storage medium provided by the present application are described in detail above, and a specific example is applied in the present application to illustrate the principle and implementation of the present invention, and the description of the above embodiment is only used to help understand the method and core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for video recall, comprising:

2. The method of claim 1, wherein training a preset video recall model based on the composition, query statement, and video title comprises:

carrying out graph embedding processing on the abnormal graph to obtain a node vector corresponding to each node in the abnormal graph;

performing word segmentation processing on the query sentence and the video title respectively to obtain a query word set corresponding to the query sentence and a title word set corresponding to the video title;

and training a preset video recall model based on the node vector, the query word set and the heading word set to obtain a trained video recall model.

3. The method of claim 2, wherein the training a preset video recall model based on the node vector, the query term set and the entry term set to obtain a trained video recall model comprises:

according to a first sub-network, a query word set and a title word set in the video recall model, constructing a first semantic feature corresponding to the query statement and a second semantic feature corresponding to the video title;

constructing a first heterogeneous graph feature corresponding to the query statement and a second heterogeneous graph feature corresponding to the video title based on a second sub-network in the video recall model, the node vector and the same keyword;

and training a preset video recall model according to the first semantic feature, the second semantic feature, the first heterogeneous graph feature and the second heterogeneous graph feature to obtain a trained video recall model.

4. The method of claim 3, wherein constructing a first heterogeneous graph feature corresponding to the query statement and a second heterogeneous graph feature corresponding to the video title based on a second sub-network in the video recall model, the heterogeneous graph, and the same keyword comprises:

obtaining vector features corresponding to the same keywords;

5. The method of claim 3, wherein constructing the first semantic features corresponding to the query sentence and the second semantic features corresponding to the video title according to the first sub-network, the query word set and the header word set in the video recall model comprises:

6. The method of claim 3, wherein the training a preset video recall model according to the first semantic feature, the second semantic feature, the first heterogeneous graph feature, and the second heterogeneous graph feature to obtain a trained video recall model comprises:

7. The method according to claim 2, wherein before performing the word segmentation processing on the query sentence and the video title respectively to obtain a query word set corresponding to the query sentence and a title word set corresponding to the video title, the method further comprises:

8. The method of any one of claims 1 to 7, wherein the video recall via the trained video recall model comprises:

9. The method of any one of claims 1 to 1, wherein the detecting the same keyword between the video title and the query sentence to obtain the same keyword comprises:

10. A video recall device, comprising:

11. The apparatus of claim 10, wherein the training module comprises:

12. The apparatus of claim 11, wherein the training unit comprises:

13. The apparatus according to claim 12, wherein the second building subunit is specifically configured to:

obtaining vector features corresponding to the same keywords;

14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the video recall method of any one of claims 1-8 are implemented when the program is executed by the processor.

15. A computer-readable storage medium, having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of the video recall method of any of claims 1-8.