CN117150054A

CN117150054A - Media asset recall method and device, electronic equipment and storage medium

Info

Publication number: CN117150054A
Application number: CN202311038448.8A
Authority: CN
Inventors: 丁隆乾; 沙源; 章婷婷; 陈康霖; 罗红
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-12-01

Abstract

The invention relates to the technical field of artificial intelligence, and provides a media resource recall method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a poster picture and text description of a current media asset, generating a poster feature vector and a text feature vector, inputting the poster feature vector and the text feature vector into a first layer of a multi-mode cross joint learning network for intra-mode feature learning, respectively obtaining a first output result and a second output result, inputting the first output result into a second layer of the joint learning network for capturing the connection between the text and the image mode, respectively obtaining a third output result and a fourth output result, splicing the third output result, inputting the third output result into the third layer of the joint learning network for multi-mode feature joint learning, and obtaining joint representation of the current media asset; and calculating the similarity and arranging the side by side between the current media asset and other media assets based on the joint representation, and determining a recall media asset set. According to the invention, multi-mode cross joint learning is performed through the stacked three-layer network structure, the influence of the image mode on the user behavior is considered, the diversity of recall samples is increased, and the recommendation effect can be improved.

Description

Media asset recall method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a media recall method, a device, an electronic apparatus, and a storage medium.

Background

The universal recommendation system is a system with complex data, algorithms and service platforms tightly matched, real-time data processing, off-line big data processing and recommendation algorithm service modules are all round to one another, each module of the system performs its own role, a user and an article database to be recommended which are updated continuously are maintained, and recommendation results are returned for browsing requests generated by each user at every moment in each scene. The recommendation system can be divided into two main steps according to the realized functions: a recall layer and a sort layer.

Recall has a foundation, bridging and decisive role in the overall recommendation system. The information funnel is essentially an information funnel and mainly comprises efficient recall rules, algorithms or simple models, so that a recommendation system can quickly recall information which is possibly interesting and valuable for a user from a huge amount of candidate sets, and the search range of a sorting algorithm is narrowed; and the method is also responsible for fusing the multiple recalled data to obtain a simplified candidate set.

In the field of movie and television recommendation, recall is typically based on attribute characteristics of users and media assets. However, most of the features are single-mode features such as classification/numerical value, and the like, and the user behavior is not considered to be influenced by characteristic factors of various modes such as characters, images, sounds and the like, so that the diversity of video media recall cannot be ensured, and the recall effect is limited.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a media resource recall method, which comprises the following steps:

acquiring a poster picture and text description of a current media asset, and generating a poster feature vector and a text feature vector of the current media asset;

inputting the text feature vector and the poster feature vector into a first layer of a multi-mode cross joint learning network to perform intra-mode feature learning to obtain a first output result and a second output result respectively;

inputting the first output result and the second output result into a second layer of the multi-mode cross joint learning network to capture the connection between the text and the image modes, so as to obtain a third output result and a fourth output result respectively;

accumulating and splicing the third output result and the fourth output result, inputting the spliced result into a third layer of the multi-mode cross joint learning network to perform multi-mode feature joint learning, and obtaining the joint representation of the current media asset;

and calculating the similarity between the current media asset and other media assets based on the joint representation of the current media asset, and carrying out similarity score sorting to determine a recall media asset set.

According to the media asset recall method provided by the invention, the poster picture and text description of the current media asset are obtained, and the poster feature vector and text feature vector of the current media asset are generated, comprising the following steps:

Inputting the poster picture of the current media asset into a pre-trained regional suggestion network for regional detection to obtain a poster detection regional picture of the current media asset;

mapping the poster detection area picture into a fixed-length feature by adopting a pre-trained area feature aggregation network to obtain a poster feature vector;

inputting the text description of the current media asset into a pre-trained language characterization model for processing to obtain a text feature vector.

According to the media recall method provided by the invention, the first layer of the multi-mode cross joint learning network comprises a text transducer and a picture transducer; inputting the text feature vector and the poster feature vector into a first layer of a multi-mode cross joint learning network to perform intra-mode feature learning to obtain a first output result and a second output result respectively, wherein the method comprises the following steps:

generating a query, a key and a value corresponding to the text feature vector and a query, a key and a value corresponding to the poster feature vector through matrix operation respectively;

and respectively inputting the query, the key and the value corresponding to the text feature vector and the query, the key and the value corresponding to the poster feature vector into a text transducer and a picture transducer to perform intra-modal feature learning, so as to obtain a first output result output by the text transducer and a second output result output by the picture transducer.

According to the media recall method provided by the invention, the second layer of the multi-mode Cross joint learning network comprises a text Cross-transform and a picture Cross-transform; inputting the first output result and the second output result into the second layer of the multi-mode cross joint learning network to capture the connection between the text and the image modes, and respectively obtaining a third output result and a fourth output result, wherein the method comprises the following steps:

generating a query, a key and a value corresponding to the first output result and a query, a key and a value corresponding to the second output result through matrix operation respectively;

and exchanging keys and values corresponding to the first output result and keys and values corresponding to the second output result, and respectively inputting the keys and the values to a text Cross-transform and a picture Cross-transform to capture the relation between the text and the image modes, so as to obtain a third output result output by the text Cross-transform and a fourth output result output by the picture Cross-transform.

According to the media recall method provided by the invention, the third layer of the multi-mode cross joint learning network is a text-picture Co-converter; the step of accumulating and splicing the third output result and the fourth output result, inputting the spliced result into a third layer of the multi-mode cross joint learning network to perform joint learning of multi-mode features, and obtaining joint representation of the current media asset, comprising the following steps:

Accumulating and splicing the third output result and the fourth output result;

generating a query, a key and a value corresponding to the splicing result through matrix operation, inputting the query, the key and the value corresponding to the splicing result into a text-picture Co-transducer for joint learning of multi-modal characteristics, and obtaining a text integral representation and a picture integral representation output by the text-picture Co-transducer;

multiplying the text integral representation and the picture integral representation to obtain the joint representation of the current media asset.

According to the media recall method provided by the invention, the text fransformer and the picture fransformer are obtained by learning under the supervision of a multi-mode contrast loss function; the multi-mode contrast loss function is used for distinguishing the coding characteristics of positive sample pairs in different modes from the coding characteristics of negative sample pairs, wherein the coding characteristics of the positive sample pairs in different modes are similar; the step of calculating the similarity between the current media asset and other media assets and sorting similarity scores based on the joint representation of the current media asset, and determining a recall media asset set comprises the following steps:

splicing the joint representation of the current media asset with a first output result output by the text converter and a second output result output by the picture converter to obtain a final feature vector of the current media asset;

And calculating the similarity between the current media asset and other media assets based on the final feature vector, and sorting similarity scores to determine a recall media asset set.

According to the media asset recall method provided by the invention, the similarity between the current media asset and other media assets is calculated and the similarity score is ordered based on the final feature vector, and a recall media asset set is determined, which comprises the following steps:

splicing the mapping value of the current media asset metadata feature into the final feature vector to obtain a target feature vector; the current media asset metadata features comprise at least one of information such as type, region, language and the like;

and calculating the similarity between the current media asset and other media assets according to the target feature vector, and sorting similarity scores to determine a recall media asset set.

The invention also provides a media resource recall device, which comprises:

the acquisition module is used for acquiring the poster picture and text description of the current media asset and generating a poster feature vector and a text feature vector of the current media asset;

the feature learning module is used for inputting the text feature vector and the poster feature vector into a first layer of the multi-mode cross joint learning network to perform feature learning in a mode to respectively obtain a first output result and a second output result;

The relation capturing module is used for inputting the first output result and the second output result into a second layer of the multi-mode cross joint learning network to capture the relation between the text and the image modes, so as to obtain a third output result and a fourth output result respectively;

the joint learning module is used for performing accumulation and splicing on the third output result and the fourth output result, inputting the spliced result into a third layer of the multi-mode cross joint learning network to perform joint learning of multi-mode characteristics, and obtaining joint representation of the current media asset;

and the ordering module is used for calculating the similarity between the current media asset and other media assets based on the joint representation of the current media asset and ordering similarity scores to determine a recall media asset set.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the media recall method of any one of the above when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the media asset recall method of any one of the preceding claims.

According to the embodiment of the invention, the text feature vector and the poster feature vector are input into the first layer of the multi-mode cross joint learning network for carrying out feature learning in a mode, the output result of the first layer is input into the second layer of the multi-mode cross joint learning network for capturing the connection between the text and the image modes, the output result of the second layer is subjected to accumulation and splicing, and the output result is input into the third layer of the multi-mode cross joint learning network for carrying out joint learning of multi-mode features, so that joint representation of the media asset can be obtained, and similar media assets of the current media asset can be quickly and accurately recalled based on the media asset poster picture and the text description. According to the invention, multi-mode cross joint learning is performed through a stacked three-layer network structure, the influence of image modes on user behaviors is considered, and the diversity of recall samples is increased; and capturing the relation between the text and the image modes, so that the recommendation effect can be improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a media recall method according to an embodiment of the present invention;

FIG. 2 is a second flowchart of a media recall method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a media recall device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the recommendation system, recall has foundation property, bridging property and decisive property, so that the recommendation system can quickly recall information which is possibly interesting and valuable for a user from a huge number of candidate sets, the search range of a sorting algorithm is reduced, and multiple paths of retrieved data are fused to obtain a simplified candidate set. The current popular similar media recall algorithm mainly comprises the following steps:

(1) Thermal list stomacher: and crawling the online popular list data by the crawler, and recommending the popular list data to the user as a recall result.

(2) Heat spam based on user behavior: for example, based on the heat spam of the play amount, sorting based on media play amount independent Visitor (UV) data of the whole users for a period of time, forming heat spam data of the current province large-screen user, and providing non-personalized recommendation for new users; based on the heat spam of the media asset statistics feature, the user can count the Page click volume (PV) and UV data of the media asset click volume, the exposure volume, the click rate and the like in different time periods according to different media assets by using the click exposure behavior data of the user, and the media assets are ordered according to the click rate to give recall results.

(3) Content-based recall: according to metadata (such as labels or categories) of recommended media assets, the method discovers the relevance of the media assets, and only needs to calculate the similarity between the media assets and the media assets without considering the preference of different users to different media assets, so as to calculate a recall result.

(4) Collaborative filtering algorithm: collaborative filtering is a commonly used recommendation algorithm that discovers user preference bias based on mining historical behavioral data of a user and predicts the product that the user may prefer to make recommendations. I.e., the common "guess you like", and "people who see the media still like", etc. The main realization of the method is as follows: according to the recommendation of people with common preference with you, similar media materials are recommended to you according to the media materials liked by you, so that common collaborative filtering algorithms can be obtained and divided into two types, namely a collaborative filtering algorithm (user-based Collaborative Filtering, userCF) based on users and a collaborative filtering algorithm (item-based Collaborative Filtering, item CF) based on articles. The features may be summarized as "people group by group" and recalled accordingly.

(5) An Embedding recall algorithm: embling is essentially represented by a low-dimensional dense vector, where the object may be a word, a commodity, a news, a movie, etc., and visually represents smoothing a one-hot code. The traditional recall algorithm has the characteristics of simplicity and strong interpretability, but also has the own limitations. Neither collaborative filtering nor matrix factorization adds user, item, and context-dependent features. The ebedding approach takes this into account, and recall techniques begin to evolve toward modeling ebedding. The recall algorithm of the emplacement is mainly divided into i2i recall and u2i recall, wherein i2i recall is firstly performed offline according to user history behaviors to train a recall model, item emplacement is output and stored in a database, online user requests are requested, corresponding emplacement is searched from the database according to user history behaviors, n items with highest similarity are searched, and finally Top n recall results are returned. The main difference of the u2i recall is that the retrieval can be directly performed based on the user subedding to retrieve similar items. Based on the emmbedding recall algorithm, i2i recall based on content semantics is gradually evolved, and the method is suitable for semantic recommendation scenes, such as news recommendation, including a Google classical word vector method (word 2 vec), a dynamic word vector method (Bert), and a Facebook character-level vector method (FastText); the content semantic model may be directly sleeved from "word sequence" to "click/view sequence" to generate recall results. Evolving Graph Embedding, and carrying out Embedding on nodes in the Graph structure, wherein the finally generated node Embedding generally comprises global structure information of the Graph and local similarity information of neighbor nodes, and besides correlation among user behavior sequences is captured, secondary neighbor relations in the Graph structure are also considered by Graph Embedding, so that recall effect is improved, and the Graph Embedding method is suitable for presenting user behavior data of the Graph structure in an Internet scene; graph Embedding includes common methods such as random walk (deep walk), aleges (Enhanced Graph Embedding with Side Information, enhancement map Embedding with side information), node2Vec, spectral domain-based Graph convolutional neural network (Graph Convolutional Networks, GCN), spatial domain-based Graph neural network (Graph Sample and aggregate, graph sage), and the like. Finally, a u2i recall method based on deep learning is introduced, a classical method comprises a Microsoft deep semantic model (Deep Structured Semantic Models, DSSM), the Microsoft deep semantic model is transplanted to a recommended scene to become a classical double-tower recall model, a user and item side recall are produced offline, networks on two sides are independent from each other, and the calculation speed is high. A deep learning (Deep Neural Networks, DNN) recall method of YouTube realizes quick recall of hundreds of videos from millions of videos, and ensures recall effect. Airnb is based on the recall of the user's short-term and long-term interests.

However, the above media recall method has the following disadvantages:

(1) The features related to users, articles and context are not added in the hot spam, recall based on content, collaborative filtering and matrix decomposition, so that the method is simple and the recall effect is limited.

(2) In a deep learning model of video recommendation, attribute features of users and media assets are mainly input at present, however, most of the features are single-mode features such as classification/numerical values, and the influence of characteristic factors of multiple modes such as characters, images and sounds on the behaviors of the users is not considered, so that the diversity of video media asset recall cannot be ensured.

In addition, other media recall methods are usually single-mode calculation, recall commodity retrieval and recall result limitation.

FIG. 1 is a flowchart of a media recall method according to an embodiment of the present invention. Referring to fig. 1, an embodiment of the present invention provides a media recall method, which specifically includes the following steps:

and step 101, acquiring a poster picture and text description of the current media asset, and generating a poster feature vector and a text feature vector of the current media asset.

It should be noted that, the execution body of the media recall method provided in the embodiment of the present invention may be a server or a computer device, for example, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook, a personal digital assistant (personal digital assistant, PDA), or the like.

Taking an execution subject as a server as an example, the media asset recall method provided by the embodiment of the invention can be applied to the server, the server can acquire the poster picture and text description of the current media asset, and batch processing is performed on the media asset poster picture and text description information to generate the poster feature vector and the text feature vector of the current media asset. The text description may include text information such as a profile, a tag, and media asset metadata.

It should be noted that, the media asset recall method provided by the embodiment of the invention can be applied to the field of video for video recommendation, and similar media assets of the current media asset can be recalled based on poster pictures and text description of the current media asset; in addition, the media asset recall method provided by the embodiment of the invention can be applied to the field of electronic commerce for commodity recommendation, and similar commodities of the current commodity can be recalled based on commodity pictures and text description of the current commodity.

And 102, inputting the text feature vector and the poster feature vector into a first layer of a multi-mode cross joint learning network to perform intra-mode feature learning to obtain a first output result and a second output result respectively.

The multi-modal cross-joint learning network constructed by the embodiment of the invention can be composed of three parts, wherein a first layer of the multi-modal cross-joint learning network can be used for feature learning in a modality.

Specifically, the text feature vector and the poster feature vector can be respectively input into a corresponding network of the first layer of the multi-mode cross joint learning network to perform intra-mode feature learning, so that a first output result and a second output result output by the first layer of the multi-mode cross joint learning network are respectively obtained.

And step 103, inputting the first output result and the second output result into a second layer of the multi-mode cross joint learning network to capture the connection between the text and the image modes, and respectively obtaining a third output result and a fourth output result.

The multi-modal cross-joint learning network constructed by the embodiment of the invention can be composed of three parts, wherein the second layer of the multi-modal cross-joint learning network can capture the connection between text and image modes.

Specifically, the first output result and the second output result can be respectively input into a corresponding network of the second layer of the multi-mode cross joint learning network to capture the connection between the text and the image mode, so as to respectively obtain a third output result and a fourth output result output by the second layer of the multi-mode cross joint learning network.

And 104, accumulating and splicing the third output result and the fourth output result, inputting the spliced result into a third layer of the multi-mode cross joint learning network to perform joint learning of multi-mode features, and obtaining the joint representation of the current media asset.

The multi-modal cross joint learning network constructed by the embodiment of the invention can be composed of three parts, wherein a third layer of the multi-modal cross joint learning network can be used for joint learning of multi-modal characteristics.

Specifically, the third output result and the fourth output result output by the second layer of the multi-modal cross joint learning network can be accumulated and spliced to obtain a spliced result, and the spliced result is input into the third layer of the multi-modal cross joint learning network to perform multi-modal feature joint learning, so that the joint representation of the current media asset is obtained.

Unlike popular single-flow or double-flow network architecture, the invention can combine popular two architectures into a unified architecture by stacking three types of layers to perform semantic alignment and multi-mode input joint learning, and can generate better recommendation effect by capturing the relation between text and image modes.

Step 105, calculating the similarity between the current media asset and other media assets and sorting similarity scores based on the joint representation of the current media asset, and determining a recall media asset set.

And finally, outputting the feature vectors with uniform sizes aiming at different media. Specifically, cosine similarity between the current media asset and other media asset samples in the media asset library can be calculated based on the joint representation of the current media asset, and the recalled media asset sets are formed by sorting according to the similarity scores, so that recall of similar media assets is realized.

In the embodiment of the invention, the text feature vector and the poster feature vector are input into the first layer of the multi-mode cross joint learning network for carrying out feature learning in a mode, the output result of the first layer is input into the second layer of the multi-mode cross joint learning network for capturing the connection between the text and the image mode, the output result of the second layer is accumulated and spliced and input into the third layer of the multi-mode cross joint learning network for carrying out joint learning of multi-mode features, so that joint representation of the media asset can be obtained, and similar media asset of the current media asset can be quickly and accurately recalled based on the media asset poster picture and the text description. According to the invention, multi-mode cross joint learning is performed through a stacked three-layer network structure, the influence of image modes on user behaviors is considered, and the diversity of recall samples is increased; and capturing the relation between the text and the image modes, so that the recommendation effect can be improved.

In an optional embodiment, the obtaining the poster picture and the text description of the current media asset, and generating the poster feature vector and the text feature vector of the current media asset, include: inputting the poster picture of the current media asset into a pre-trained regional suggestion network for regional detection to obtain a poster detection regional picture of the current media asset; mapping the poster detection area picture into a fixed-length feature by adopting a pre-trained area feature aggregation network to obtain a poster feature vector; inputting the text description of the current media asset into a pre-trained language characterization model for processing to obtain a text feature vector.

The regional suggestion network may be RPN (Region Proposal Network), the regional feature aggregation network may be ROI (Region of Interest) Align, and the language characterization model may be a simplified chinese BERT (Bidirectional Encoder Representation from Transformers) model.

Many pre-trained detection models are currently available, but these cannot be directly applied to poster detection due to the distribution differences between the data sets. In the embodiment of the invention, the RPN can be trained by adopting the media asset poster pictures and the manually marked poster theme area detection frames, wherein the media asset poster pictures can comprise movies, television shows, children, variety, sports and other categories.

After training is finished, a pre-trained RPN can be obtained, the poster picture of the current media asset is input into the pre-trained RPN for region detection, and the poster detection region picture output by the RPN can be obtained. The poster pictures can be mapped using ROI alignment to fixed length features (i.e., poster feature vectors) that can be input into a subsequent network for multimodal learning.

Meanwhile, the text description of the current media asset can be input into a pre-trained simplified Chinese BERT model for processing, and a text feature vector is obtained.

In an alternative embodiment, the multi-modal cross-joint learning network first layer includes a text transducer and a picture transducer; inputting the text feature vector and the poster feature vector into a first layer of a multi-mode cross joint learning network to perform intra-mode feature learning to obtain a first output result and a second output result respectively, wherein the method comprises the following steps: generating a query, a key and a value corresponding to the text feature vector and a query, a key and a value corresponding to the poster feature vector through matrix operation respectively; and respectively inputting the query, the key and the value corresponding to the text feature vector and the query, the key and the value corresponding to the poster feature vector into a text transducer and a picture transducer to perform intra-modal feature learning, so as to obtain a first output result output by the text transducer and a second output result output by the picture transducer.

Specifically, the text feature vector may be multiplied by 3 different matrices to perform an operation, so as to generate a query corresponding to the text feature vector ₁ 、key ₁ And value ₁ That is, 3 different linear transformations are performed on the same text feature vector input to represent 3 different states thereof; multiplying the poster feature vectors by 3 different matrixes respectively to perform operation, and generating a query corresponding to the poster feature vectors ₁ 、key ₁ And value ₁ I.e. 3 different linear transformations are performed on the same poster feature vector input to represent its different 3 states.

The first layer of the multimodal cross-over joint learning network comprises a Txet transducer and a Visual transducer. The query can be sent to ₁ 、key ₁ And value ₁ Inputting the Txet transducer to perform intra-modal feature learning to obtain a first output result output by the Txet transducer; query is made ₁ 、key ₁ And value ₁ The input Visual Transformer performs intra-modal feature learning to obtain a second output result output at Visual Transformer.

In an alternative embodiment, the multimodal Cross joint learning network second layer includes text Cross-transformations and picture Cross-transformations; inputting the first output result and the second output result into the second layer of the multi-mode cross joint learning network to capture the connection between the text and the image modes, and respectively obtaining a third output result and a fourth output result, wherein the method comprises the following steps: generating a query, a key and a value corresponding to the first output result and a query, a key and a value corresponding to the second output result through matrix operation respectively; and exchanging keys and values corresponding to the first output result and keys and values corresponding to the second output result, and respectively inputting the keys and the values to a text Cross-transform and a picture Cross-transform to capture the relation between the text and the image modes, so as to obtain a third output result output by the text Cross-transform and a fourth output result output by the picture Cross-transform.

In the embodiment of the invention, the second layer of the multi-mode Cross joint learning network comprises a Txet Cross-transform and a Visual Cross-transform, and the layer of the network captures the connection between the text and the image modes by exchanging keys and values of the text and the image in a multi-head attention mechanism. The input vector is the output of the previous layer of transducer, the query, the key and the value are generated through matrix operation, the key and the value of the text and the key and the value of the picture are exchanged, and then the subsequent calculation is executed.

Specifically, the first output result may be first generated into the query corresponding to the first output result through matrix operation ₃ 、key ₃ And value ₃ Generating a query corresponding to the second output result by matrix operation of the second output result ₄ 、key ₄ And value ₄ . Then, the query may be sent to ₃ 、key ₃ And query ₄ 、key ₄ Exchange and query ₄ 、key ₄ And value ₃ Inputting Txet Cross-transducer to capture the relation between text and image mode to obtain the output third output result, and combining the query ₃ 、key ₃ And value ₄ And inputting Visual Cross-transformers to capture the relation between the text and the image modes to obtain a fourth output result.

In an alternative embodiment, the third layer of the multi-modal cross-joint learning network is a text-to-picture Co-transducer; the step of accumulating and splicing the third output result and the fourth output result, inputting the spliced result into a third layer of the multi-mode cross joint learning network to perform joint learning of multi-mode features, and obtaining joint representation of the current media asset, comprising the following steps: accumulating and splicing the third output result and the fourth output result; generating a query, a key and a value corresponding to the splicing result through matrix operation, inputting the query, the key and the value corresponding to the splicing result into a text-picture Co-transducer for joint learning of multi-modal characteristics, and obtaining a text integral representation and a picture integral representation output by the text-picture Co-transducer; multiplying the text integral representation and the picture integral representation to obtain the joint representation of the current media asset.

In the embodiment of the invention, for the output of the Txet/Visual Cross-transducer of the previous layer, accumulation and splicing can be performed before the input of the Text-Visual Co-transducer of the last layer, then the query, the key and the value corresponding to the splicing result are generated through matrix operation, and then the subsequent calculation is performed, namely, the query, the key and the value corresponding to the splicing result are input into the Text-Visual Co-transducer to perform the joint learning of the multi-mode characteristics.

In the reasoning process, the output of Co-transformers are respectively represented as H as a whole text _txt And picture integral representation H _img These two vectors H _txt And H _img Multiplication may result in a joint representation of the assets.

FIG. 2 is a second flowchart of a media recall method according to an embodiment of the present invention. Referring to fig. 2, large screen media asset multi-modal characterization extraction and similar media asset mining can be performed based on a pre-trained multi-modal model. Firstly, batch processing can be carried out on media asset poster pictures and text description information, a poster suggestion region is generated through a pre-trained region suggestion network (RPN), and a poster picture main body feature vector is extracted; text feature vectors of the media asset profile + tags can be extracted by a pre-trained BERT model.

Unlike popular single-stream or dual-stream converter architectures, embodiments of the present invention combine popular two architectures into a unified architecture by stacking three types of layers for semantic alignment and multi-modal input joint learning. Text feature vectors and poster feature vectors are input into Text transformers and Visual Transformer, respectively, for separating features within the learning singlemode; text/Visual Cross Transformer captures the relationship between Text and image modalities by exchanging keys/values of Text and images in a multi-headed attention mechanism; finally, the features of the Text and image are concatenated and input into the Text-Visual Co-transducer for joint learning of the multi-modal features.

Aiming at multi-mode data of single media asset, the embodiment of the invention outputs a single feature vector after model calculation, and then auxiliary feature dimensions such as director, actors, region and the like are spliced in the vector; and finally, outputting feature vectors with uniform sizes by different media assets, calculating the similarity among the different media assets by using cosine similarity, and sorting according to similarity scores to realize recall of similar media assets, so that recall effect can be improved.

In an alternative embodiment, pre-tasks may be employed for self-supervised learning, and for multi-modal pair feature learning, two masked modeling tasks, namely a masked language modeling task (Masked Language Model, MLM) and a masked region prediction task (Masked Region Prediction, MRP), may be employed according to BERT and Visual BERT.

With continued reference to FIG. 2, specifically, for MLM, about 15% of the text and suggested regions are covered (MASK), the remaining inputs may be used to reconstruct the masked information, consistent with the manner employed in BERT, 80% of the selected 15% of the tokens are replaced with [ MASK ],10% are randomly replaced with other tokens, leaving 10% unchanged, and finally the 15% of the tokens' corresponding outputs are classified to predict their true values. For MRP, the model may directly regress the mask features, supervising based on the pre-trained RPN and mean square error (MSELoss) extracted features.

In an alternative embodiment, the text and picture convertors learn under the supervision of a multimodal contrast loss function; the multi-mode contrast loss function is used for distinguishing the coding characteristics of positive sample pairs in different modes from the coding characteristics of negative sample pairs, wherein the coding characteristics of the positive sample pairs in different modes are similar; the step of calculating the similarity between the current media asset and other media assets and sorting similarity scores based on the joint representation of the current media asset, and determining a recall media asset set comprises the following steps: splicing the joint representation of the current media asset with a first output result output by the text converter and a second output result output by the picture converter to obtain a final feature vector of the current media asset; and calculating the similarity between the current media asset and other media assets based on the final feature vector, and sorting similarity scores to determine a recall media asset set.

In addition to intra-modality losses, contrast learning losses between modalities may be employed to express consistency between images and text. This form of contrast loss function encourages the coding features of positive pairs of samples from different modalities to be similar, while distinguishing between the coding features of negative pairs of samples.

Specifically, for N pairs of picture text within a batch, there are N positive pairs of samples and other 2 (N-1) pairs of negative samples that do not match the picture text. The loss function formula is as follows:

wherein (x) _i ,x _j ) A pair of picture text is represented and,representing the feature vector extracted from the corresponding picture text by the transform process, sim (u, v) =u ^T v/|u II v the calculation of the i (u, v) cosine similarity of pairs, +.>Meaning that if k+.i, the function returns 1.

The Text transformers and Visual Transformer are learned under the supervision of a contrast loss function, and the results output by the Text transformers and Visual Transformer can be spliced with the joint representation of the current media asset to obtain the final feature vector of the current media asset. In the embodiment of the invention, because the Text transformers and Visual Transformer are learned under the supervision of the contrast loss function, the vector output by the Text transformers and Visual Transformer is spliced to the joint representation of the current media for retrieval, which is beneficial to improving the final retrieval effect.

In the embodiment of the invention, the corresponding association relation between the image and the Text modes can be enhanced by adopting the multi-mode contrast loss function, and the outputs of the Text convertors and Visual Transformer are spliced to the Text-Vi due to the supervision effect on the outputs of the Text convertors and Visual TransformerH output by real Co-transducer _img And H _txt And on the product of the two vectors, the retrieval recall effect can be further improved.

In an optional embodiment, the calculating the similarity between the current asset and other assets and the sorting of similarity scores based on the final feature vector, determining a recall asset set includes: splicing the mapping value of the current media asset metadata feature into the final feature vector to obtain a target feature vector; the current media asset metadata features comprise at least one of information such as type, region, language and the like; and calculating the similarity between the current media asset and other media assets according to the target feature vector, and sorting similarity scores to determine a recall media asset set.

Specifically, after the results output by the Text transformers and Visual Transformer are spliced with the joint representation of the current media asset, a fixed number of columns can be added, mapping values of media asset metadata features, such as mapping values of at least one of information including type, region, language, director, actor and the like, are spliced, similarity between the current media asset and other media assets is calculated according to the target feature vector obtained after the splicing is completed, similarity score ranking is performed, and a recall media asset set is determined.

In the embodiment of the invention, when the similar media information is recalled by the final retrieval and calculation of the cosine similarity, a fixed number of columns are added to the feature vector on the basis of the output vector of the network model, and the mapping values of the media information metadata information such as the media information category, the region, the language and the like are spliced, so that the final retrieval and recall effect is improved.

According to the embodiment of the invention, through a stacked 3-layer network structure, the Text/Visual Transformer is used for learning intra-modal features, the Text/Visual Cross-transducer is used for capturing the inter-modal relationship between texts and images through exchanging key value pairs in a multi-head attention mechanism, the Text-Visual Co-transducer is used for joint learning of multi-modal features, and unlike a previous single-flow or double-flow transducer architecture, multi-modal Cross joint learning can be realized, the influence of image modes on user behaviors is considered, and the diversity of recall samples is increased; and capturing the relation between the text and the image modes, so that the recommendation effect can be improved.

The media recall device provided by the invention is described below, and the media recall device described below and the media recall method described above can be referred to correspondingly.

FIG. 3 is a schematic diagram of a media recall device according to an embodiment of the present invention. Referring to fig. 3, an embodiment of the present invention provides a media asset recall device, where the device may specifically include the following modules:

the acquisition module 301 is configured to acquire a poster picture and a text description of a current media asset, and generate a poster feature vector and a text feature vector of the current media asset;

the feature learning module 302 is configured to input the text feature vector and the poster feature vector into a first layer of a multi-mode cross joint learning network to perform feature learning in a mode, so as to obtain a first output result and a second output result respectively;

the relationship capturing module 303 is configured to input the first output result and the second output result into the second layer of the multi-mode cross joint learning network to capture a relationship between text and image modes, so as to obtain a third output result and a fourth output result respectively;

the joint learning module 304 is configured to perform accumulation and splicing on the third output result and the fourth output result, and input a spliced result into a third layer of the multi-modal cross joint learning network to perform joint learning of multi-modal features, so as to obtain a joint representation of the current media asset;

The ranking module 305 is configured to calculate a similarity between the current media asset and other media assets based on the joint representation of the current media asset, and perform similarity score ranking to determine a recall media asset set.

In an alternative embodiment, the acquiring module is specifically configured to:

In an alternative embodiment, the multi-modal cross-joint learning network first layer includes a text transducer and a picture transducer; the feature learning module is specifically configured to:

In an alternative embodiment, the multimodal Cross joint learning network second layer includes text Cross-transformations and picture Cross-transformations; the relation capturing module is specifically configured to:

In an alternative embodiment, the third layer of the multi-modal cross-joint learning network is a text-to-picture Co-transducer; the joint learning module is specifically configured to:

accumulating and splicing the third output result and the fourth output result;

In an alternative embodiment, the text and picture convertors learn under the supervision of a multimodal contrast loss function; the multi-mode contrast loss function is used for distinguishing the coding characteristics of positive sample pairs in different modes from the coding characteristics of negative sample pairs, wherein the coding characteristics of the positive sample pairs in different modes are similar; the sorting module is specifically configured to:

In an alternative embodiment, the sorting module is specifically configured to:

Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 410, communication interface (Communications Interface) 420, memory 430 and communication bus 440, wherein processor 410, communication interface 420 and memory 430 communicate with each other via communication bus 440. Processor 410 may invoke logic instructions in memory 430 to perform a media recall method comprising:

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the media recall method provided by the methods described above, the method comprising:

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A media recall method, comprising:

2. The method of claim 1, wherein the obtaining the poster picture and text description of the current asset, generating the poster feature vector and text feature vector of the current asset, comprises:

3. The method of claim 1, wherein the multimodal cross-joint learning network first layer comprises text and picture transducers; inputting the text feature vector and the poster feature vector into a first layer of a multi-mode cross joint learning network to perform intra-mode feature learning to obtain a first output result and a second output result respectively, wherein the method comprises the following steps:

4. The method of claim 3, wherein the multimodal Cross-joint learning network second layer comprises text Cross-transform and picture Cross-transform; inputting the first output result and the second output result into the second layer of the multi-mode cross joint learning network to capture the connection between the text and the image modes, and respectively obtaining a third output result and a fourth output result, wherein the method comprises the following steps:

5. The method of claim 4, wherein the third layer of the multi-modal cross-joint learning network is a text-to-picture Co-transducer; the step of accumulating and splicing the third output result and the fourth output result, inputting the spliced result into a third layer of the multi-mode cross joint learning network to perform joint learning of multi-mode features, and obtaining joint representation of the current media asset, comprising the following steps:

Accumulating and splicing the third output result and the fourth output result;

6. A method according to claim 3, wherein the text and picture convertors learn under the supervision of a multimodal contrast loss function; the multi-mode contrast loss function is used for distinguishing the coding characteristics of positive sample pairs in different modes from the coding characteristics of negative sample pairs, wherein the coding characteristics of the positive sample pairs in different modes are similar; the step of calculating the similarity between the current media asset and other media assets and sorting similarity scores based on the joint representation of the current media asset, and determining a recall media asset set comprises the following steps:

7. The method of claim 6, wherein the calculating the similarity between the current asset and other assets and ordering similarity scores based on the final feature vector, determining a recall asset set, comprises:

8. A media recall device, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the media recall method of any one of claims 1 to 7 when the program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the media recall method of any one of claims 1 to 7.