WO2024061073A1

WO2024061073A1 - Multimedia information generation method and apparatus, and computer-readable storage medium

Info

Publication number: WO2024061073A1
Application number: PCT/CN2023/118512
Authority: WO
Inventors: 张政; 刘银星; 阮涛; 吕晶晶
Original assignee: 北京沃东天骏信息技术有限公司
Priority date: 2022-09-19
Filing date: 2023-09-13
Publication date: 2024-03-28
Also published as: CN117786193A

Abstract

Embodiments of the present invention provide a multimedia information generation method and apparatus, and a computer-readable storage medium. The method comprises: recalling article information and content information in response to a received browsing request; performing feature extraction on the basis of the article information and the content information to obtain article features corresponding to article dimensions and content features corresponding to content dimensions, and performing collaboration and fusion on the article features and the content features to obtain a plurality of groups of fused features; estimating the plurality of groups of fused features by means of a preset recommendation model, and selecting target article information and target content information corresponding to a group of fused features having the highest estimated value; and generating target multimedia information on the basis of the target article information and the target content information. According to the solution, feature extraction and combination are performed on article information and content information to obtain a plurality of fused features, such that generated target multimedia information is diversified, and the recommendation effect is good.

Description

A method and device for generating multimedia information, and a computer-readable storage medium

Cross-references to related applications

This invention is based on a Chinese patent application with application number 202211139046.2 and a filing date of September 19, 2022, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated by reference into the present invention.

Technical field

The present invention relates to the field of computer vision, and in particular, to a method and device for generating multimedia information, and a computer-readable storage medium.

Background technique

In the current e-commerce advertising system, interest product recall is usually at product granularity. The system will select a product candidate set for the current user based on the user's historical browsing, searching, purchasing, adding to the shopping cart, etc., and select the optimal product based on the advertising system ranking model. After the product is determined, relevant advertisements will be generated. At present, advertisements are usually generated using templates. The main image of the product is inserted into the template and replaced, and the corresponding product advertisement is rendered and generated. Although it has automation capabilities, the generated advertisement recommendations are relatively simple.

Contents of the invention

Embodiments of the present invention provide a method and device for generating multimedia information, and a computer-readable storage medium, which can generate corresponding target multimedia information based on item information and content information, has diversity, and has good recommendation effects.

The technical solution of the present invention is implemented as follows:

Embodiments of the present invention provide a method for generating multimedia information. The method includes:

Recall item information and content information in response to a received browsing request;

Feature extraction is performed based on the item information and content information to obtain item features corresponding to the item dimension and content features corresponding to the content dimension, and the item features and content features are collaborated and fused to obtain multiple sets of fusion features; each Group fusion features represent the fusion between different content modal combinations and different items;

The plurality of groups of fusion features are estimated by a preset recommendation model, and target item information and target content information corresponding to a group of fusion features with the highest estimated value are selected; the preset recommendation model representation screens the fusion features;

Target multimedia information is generated based on the target item information and the target content information.

In the above solution, feature extraction is performed based on the item information and content information to obtain item features corresponding to the item dimension and content features corresponding to the content dimension, including:

Perform feature extraction on the item information to obtain the item characteristics corresponding to the item dimensions;

Identify the content information to obtain content information corresponding to a content multi-modal type; the content multi-modal type includes at least two modalities among text information, image information and image sequence information;

Feature extraction is performed on the content information corresponding to the content multi-modal type to obtain the content features corresponding to the content dimensions.

In the above scheme, the feature extraction of the content information corresponding to the content multi-modal type to obtain the content features corresponding to the content dimension includes: if the content multi-modal type is a text type, through the first The encoding method characterizes the text information Extract and obtain text features;

If the content multi-modal type is an image type or an image sequence type, feature extraction is performed on the image information and the image sequence information respectively through the second encoding method to obtain image features and behavioral features;

The content characteristics corresponding to the content dimension are determined according to at least one of the text characteristics, the image characteristics and the behavioral characteristics.

In the above solution, if the content multi-modal type is a text type, feature extraction is performed on the text information through the first encoding method to obtain text features, including:

If the content multi-modal type is a text type, feature extraction is performed on the text information to obtain text initial features; the text initial features include semantic expression information and word information;

Using the first encoding method, the text initial features are encoded to obtain the text features.

In the above solution, if the content multi-modal type is an image type or an image sequence type, feature extraction is performed on the image information and the image sequence information through the second encoding method to obtain image features and behavioral features. ,include:

If the content multi-modal type is an image type, perform feature extraction on the image information to obtain initial image features; the initial image features include scene information, content information and style information;

If the content multi-modal type is an image sequence type, feature extraction is performed on the image sequence information to obtain initial behavioral features; the initial behavioral features include subject target information and key frame information;

Through the second encoding method, the image initial features and the behavior initial features are respectively encoded to obtain the image features and the behavior features.

In the above scheme, the item characteristics and the content characteristics are coordinated and fused to obtain multiple sets of fusion characteristics, including:

The item feature and the content feature are collaboratively processed to obtain a first item feature and a first content feature of the same probability distribution; the first item feature includes a plurality of first sub-item features; the first content feature includes a plurality of first sub-content features;

Randomly combining the multiple first sub-item features to obtain multiple item combination features;

The plurality of first sub-content features are randomly combined to obtain multiple content combination features; the content combination features include content features corresponding to at least two content multi-modal types.

The plurality of item combination features and the plurality of content combination features are fused to obtain the plurality of sets of fusion features.

In the above solution, the multiple sets of fusion features are estimated through a preset recommendation model, and the target item information and target content information corresponding to the set of fusion features with the highest estimated value are selected, including:

Input the multiple sets of fusion features into the preset recommendation model for estimation, and obtain the first estimated values corresponding to the multiple sets of fusion features;

Based on the first estimated values, selecting a group of fused features with the highest estimated value from the multiple groups of fused features;

The set of fused features is decoded to obtain the target item information and the target content information.

In the above solution, generating target multimedia information based on the target item information and the target content information includes:

Through a preset layout generation model, layout generation is performed on the target item information and the target content information to obtain multiple layouts; the preset layout generation model represents the adjustment of layout through items and content;

Through the evaluation model, the multiple layouts are evaluated and candidate layouts are determined; the evaluation model is used to evaluate and screen the layouts;

Select the optimal layout from the candidate layouts through the layout optimization model;

The target multimedia information is generated based on the optimal layout, the target item information and the target content information.

In the above scheme, the target items and the target content are laid out through a preset layout generation model, and multiple layouts are obtained, including:

Generate an initialization layout corresponding to the target item information and the target content information through a preset layout generation model; the preset layout generation model includes the stacking order of image layers and the text size range constraints in the text information;

Through adjustment rules, the initialization layout is adjusted and the multiple layouts are determined; the adjustment rules are obtained through continuous training using the object's preference as an incentive.

In the above solution, before evaluating the multiple layouts by using the evaluation model to determine the candidate layouts, the method further includes:

Obtain historical target multimedia information;

Identify the historical target multimedia information to obtain a historical layout; the historical layout includes positive sample data and negative sample data;

The initial evaluation model is trained by using the positive sample data and the negative sample data to determine the evaluation model.

In the above solution, the evaluation model is used to evaluate the multiple layouts and determine candidate layouts, including:

Use the evaluation model to evaluate the multiple layouts and obtain evaluation results corresponding to the multiple layouts;

If the evaluation result is characterized as successful, the corresponding layout is used as the candidate layout.

An embodiment of the present invention provides a device for generating multimedia information. The device for generating multimedia information includes an acquisition part, a selection part and a generation part; wherein,

The acquisition part is configured to recall item information and content information in response to a received browsing request; perform feature extraction based on the item information and content information to obtain item features corresponding to the item dimension and content features corresponding to the content dimension, The item features and the content features are collaborated and fused to obtain multiple sets of fusion features; each set of fusion features represents the fusion between different content modal combinations and different items;

The selection part is configured to estimate the multiple sets of fusion features through a preset recommendation model, and select the target item information and target content information corresponding to the set of fusion features with the highest estimated value; the preset The recommendation model representation optimizes the fusion features;

The generating part is configured to generate target multimedia information based on the target item information and the target content information.

An embodiment of the present invention provides a device for generating multimedia information, the device for generating multimedia information comprising:

A memory for storing executable instructions;

A processor, configured to execute executable instructions stored in the memory. When the executable instructions are executed, the processor executes the method for generating multimedia information.

An embodiment of the present invention provides a computer-readable storage medium, characterized in that executable instructions are stored therein, and when the executable instructions are executed by one or more processors, the processors execute the method for generating multimedia information.

Embodiments of the present invention provide a method and device for generating multimedia information, and a computer-readable storage medium. The method includes: in response to a received browsing request, recalling item information and content information; and performing features based on the item information and content information. Extract, obtain the content features corresponding to the item features in the item dimension and the content features corresponding to the content dimension, and collaborate and fuse the item features and the content features to obtain multiple sets of fusion features; each set of fusion features represents a combination of different content modalities. Fusion between different items; through the preset recommendation model, estimate the multiple sets of fusion features, and select the target item information and target content information corresponding to the set of fusion features with the highest estimated value; the preset The recommendation model represents the filtering of fusion features; based on the target item information and the target content information, a target is generated Multimedia information. In the above scheme, first, the server vectorizes the item information and content information to obtain the item characteristics corresponding to the item and the content characteristics corresponding to the content; it converts the item characteristics and content characteristics in different spaces into vectors in the same space for fusion, Multiple sets of fusion features are obtained; the fusion features are features with two dimensions, so the obtained fusion features are diverse. Secondly, the server estimates multiple fusion features based on the preset recommendation model and obtains multiple estimated values. Among them, the higher the estimated value, the better the diversity of fusion features, and the better the diversity of target item information and target content information corresponding to the set of fusion features with the highest estimated value, so that according to the target item information and target The target multimedia information generated by content information is diverse. Finally, based on the generation method of multimedia information, the diversity of target multimedia information can be improved, thereby ensuring that personalized recommendations are provided to users and improving the recommendation effect.

Description of drawings

FIG1 is a first optional flow chart of a method for generating multimedia information provided by an embodiment of the present invention;

Figure 2 is an optional flow diagram 2 of a method for generating multimedia information provided by an embodiment of the present invention;

Figure 3 is an optional flow diagram 3 of a method for generating multimedia information provided by an embodiment of the present invention;

Figure 4 is an optional flowchart 4 of a method for generating multimedia information provided by an embodiment of the present invention;

Figure 5 is an optional flow diagram 5 of a method for generating multimedia information provided by an embodiment of the present invention;

Figure 6 is an optional flowchart 6 of a method for generating multimedia information provided by an embodiment of the present invention;

Figure 7 is an optional flow chart 7 of a method for generating multimedia information provided by an embodiment of the present invention;

Figure 8 is an optional flowchart 8 of a method for generating multimedia information provided by an embodiment of the present invention;

Figure 9 is an optional flow chart 9 of a method for generating multimedia information provided by an embodiment of the present invention;

Figure 10 is a schematic structural diagram 1 of a device for generating multimedia information provided by an embodiment of the present invention;

FIG. 11 is a schematic structural diagram 2 of a device for generating multimedia information according to an embodiment of the present invention.

Detailed ways

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

In order to enable those skilled in the art to better understand the solution of the present invention, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

In the current e-commerce advertising system, the recall of items of interest is usually at SKU granularity. The system will select a candidate set of items for the current user based on the user's historical browsing, search, purchase, additional purchase and other behaviors, and select the optimal one based on the advertising system ranking model. After the items are determined, related creatives (advertising carriers, pictures, videos, copywriting and other creative content) are generated. The process includes several parts:

(1)Item recall

Item recall is an interest matching problem. First, it uses the user's historical interest information to effectively mine related searches, browsing, clicking, and adding shopping cart items. Item recall must consider both the user's long-term preferences and real-time needs. This combination of short-term and long-term behaviors conducts multi-dimensional interest mining. At the same time, for promotional activities and hot information, consider expanding related items to the same store and same product, and add information on hot items and similar items for further exploration.

Another idea is to consider the similarity of the crowd, which is often called collaborative filtering, to effectively cluster related users through user portraits, and to consider migrating the interests and preferences of the same type of users (such as similar consumption habits and consistent brand preferences). can be increased for some The ability to mine novel items of interest.

(2) Model sorting

Model sorting is called "precise sorting" in the advertising system. Through model estimation, the best product is selected for display. The goal of sorting is usually to maximize revenue. The core pCTR estimation here is the product with the best estimated effect (highest click-through rate).

The current ranking model is usually based on the convolutional neural network model (Convolutional Neural Network, CNN), which combines rich data features with deep learning, conducts model training based on the posterior click rate of the data, and uses the learned parameters for online An estimate of items and user characteristics.

In the e-commerce advertising scenario, a relatively mature feature system has been established, such as the user's pin and device information, contextual information, item categories, attributes and other information. This information has been widely used in various scenarios as basic features.

(3)Creative generation

Creativity, as an advertising carrier, presents the content of items. Currently, there are usually two ways of creativity. One is based on the creativity predefined by the merchant; the other is based on creative element content (such as item pictures, templates, benefit points, and selling points). Do idea generation.

As the key export of advertising content, the generation method of creativity is now fully automatic. At the same time, the creativity produced by merchants can also be unified and optimized.

Take display pictures as an example. Currently, creative generation is usually generated using templates. The main image of the product is inserted into the template and replaced, and the corresponding item creative is generated by rendering. Although it has automation capabilities, it does not fully integrate and embody AI capabilities. , there is a lot of room for optimization in terms of effect.

It should be noted that the items here can be commodities.

In the interest product recall stage, since the product recall only focuses on the product itself, it does not take into account the user's preference for creative forms and creative types, and there is a certain synergistic relationship between the product itself and the creative content. The current method lacks the knowledge of related content. The unified expression and modeling methods among them lead to poor product recall results. In the model sorting stage, the current sorting method only considers the preference of a single product and lacks the ability to make combined predictions for multiple products. It also lacks corresponding modeling capabilities for different levels of product attributes, and model predictions are based on product and user characteristics. As input, there is a lack of creative dimension information, especially the lack of multi-modal prediction capabilities that use copywriting, pictures, videos and other content as features, resulting in poor recommendation results. In the creative generation stage, the current creative generation is after recall and sorting, and the generation method is mostly template nesting, which does not reflect the user's preference for creative elements. Using templates as a carrier limits the diversity of creativity. At the same time, It also limits users’ diverse needs for creativity, and lacks personalized expression of user interests.

In view of the above problems, an embodiment of the present invention proposes a method for generating multimedia information. By adding the recall of creative elements in the product recall stage, intelligent creative generation is evolved into an element combination problem. Based on the idea of vectorized collaborative matching, a unified expression of products and creative elements is established, breaking the underlying logic of traditional e-commerce advertisements that directly sell goods. Relying on the combination of advertising ecology and information flow ecology, a creative-driven user conversion maximization idea is established; in the model sorting stage, the optimization of a single product is transformed into a combination optimization of multiple products, and multimodal information is integrated into model estimation; in the creative generation stage, real-time personalized creative generation is performed in the form of creative elements and product combinations to express user interests in a personalized way.

Figure 1 is an optional flow diagram 1 of a method for generating multimedia information provided by an embodiment of the present invention. It will be shown in conjunction with Figure 1 The steps are explained.

S101. Recalling item information and content information in response to a received browsing request.

In some embodiments of the present invention, the item information is all items to be recommended by the terminal. Item information contains multiple sub-item information; content information contains multiple sub-content information. A browsing request refers to a request formed by the user entering the item information to be browsed in the search box on the application software browsing page or on the application web page browsing page. For example, after a user enters "photo frame" in the search box on the browsing page of a shopping platform, a request for browsing photo frames will be generated.

In some embodiments of the present invention, the server receives the browsing request sent by the terminal, responds to the browsing request, and recalls item information and content information from the item library and content library according to the object's historical browsing information.

Exemplarily, as shown in FIG2 , based on the historical browsing information of the object (equivalent to the actor), multiple commodities (equivalent to the items) in the item library are input into the sorting recommendation model (Deep & Cross Network, DCN) for extraction to obtain multiple item information; the browsed image is input into the convolutional neural network (Convolutional Neural Network, CNN) for extraction to obtain the content information carrying the item information; the item information is removed from the content information carrying the item information to obtain Click (equivalent to the content information).

S102. Extract features of items and content based on item information and content information, obtain item features corresponding to the item dimension and content features corresponding to the content dimension, and collaborate and fuse the item features and content features to obtain multiple sets of fusion features.

In some embodiments of the present invention, each set of fusion features represents the fusion between different content modal combinations and different items. Collaboration is to process multiple vectors located in different vector spaces so that they are mapped to the same vector space so that they meet the same probability distribution; fusion is a fusion vector formed by different combinations of multiple vectors located in the same space ; The vector can be the item characteristics and content characteristics in the present invention. Collaboration must be carried out before fusion, and fusion can only be carried out after collaborative processing. Item features are the displayed characteristics of a certain item, and content features are features obtained by extracting features from the picture description, video description, and text description of a certain item.

For example, the item information can be the attribute characteristics of the item; the content information can be images and copywriting excluding the item attributes, that is, some creative content to promote the item.

In some embodiments of the present invention, the server can perform feature extraction on item information to obtain item features corresponding to the item dimension; identify content information to obtain content multi-modal types; and perform content information corresponding to content multi-modal types. Feature extraction to obtain content features corresponding to the content dimension. The item features and content features are collaboratively processed to obtain the first item feature and the first content feature of the same probability distribution; the first item feature and the first item feature are fused to obtain multiple sets of fusion features.

In some embodiments of the present invention, Figure 3 is an optional flowchart 3 of a method for generating multimedia information provided by an embodiment of the present invention. As shown in Figure 3, feature extraction is performed based on item information and content information to obtain The item characteristics corresponding to the item dimension and the content characteristics corresponding to the content dimension can be realized through S1021-S1023, as follows:

S1021. Extract features from the item information to obtain item features corresponding to the item dimensions.

In some embodiments of the present invention, the server can perform feature extraction on the item information, convert the item information into features in vector form, and obtain item features corresponding to the item dimensions. Item characteristics are a 1024-dimensional floating point array.

S1022: Identify the content information to obtain content information corresponding to the content multimodal type.

In some embodiments of the present invention, the server can identify the content information through a neural network model to obtain content information corresponding to the content multi-modal type, where the content multi-modal type includes text information, image information and image sequence information. At least two modalities. The neural network (NN) model is a complex network system formed by a large number of simple processing units (called neurons) that are widely connected to each other. It reflects many basic characteristics of human brain function and is a highly complex system. nonlinear dynamic learning system. Neural networks have large-scale parallelism, distributed storage and processing, self-organization, self-adaptation and self-learning capabilities, and are particularly suitable for processing imprecise and fuzzy information processing problems that require simultaneous consideration of many factors and conditions.

S1023. Extract features from content information corresponding to content multi-modal types to obtain content features corresponding to content dimensions.

In some embodiments of the present invention, the server can perform corresponding feature extraction according to the multi-modal type of the content. If the content multi-modal type is a text type, feature extraction processing is performed on the text information through the first encoding method to obtain text features. If the content multi-modal type is an image type or an image sequence type, feature extraction processing is performed on the image information and image sequence information respectively through the second encoding method to obtain image features and behavioral features. Based on text features, image features and behavioral features, the content features corresponding to the content dimensions are determined. The first encoding method mainly targets text information; the second encoding method mainly targets image information and image sequence information. For example, the image information may be an image, and the image sequence information may be a video.

It can be understood that the server extracts features from item information, vectorizes the item information, and obtains item features corresponding to the item; identifies content information to obtain content multimodal types; extracts features from content information corresponding to content multimodal types to obtain content features corresponding to content dimensions. Since item features and content features belong to features in different dimensions, the server obtains multidimensional features. When the target multimedia information is subsequently generated based on the multidimensional features, the target multimedia information has multidimensional information, thereby making the target multimedia information diverse.

In some embodiments of the present invention, Figure 4 is an optional flow diagram 4 of a method for generating multimedia information provided by an embodiment of the present invention. As shown in Figure 4, S1023 can be implemented through S201-S203, as follows:

S201. If the content multi-modal type is a text type, perform feature extraction on the text information through the first encoding method to obtain text features.

In some embodiments of the present invention, the server extracts features from text information based on the content multimodal type being text type to obtain initial text features; and encodes the initial text features using a first encoding method to obtain text features.

It should be noted that text features are text initial features in vector form.

In some embodiments of the present invention, S201 can be implemented through S2011-S2012, as follows:

S2011. If the content multi-modal type is a text type, perform feature extraction on the text information to obtain initial text features.

In some embodiments of the present invention, the text initial features include semantic expression information and word information.

In some embodiments of the present invention, the server performs feature extraction on the text information based on the multi-modal type of the content being text type to obtain semantic expression information and word information. Semantic expression information and word information are both initial features of text.

Exemplarily, Figure 5 is an optional flow diagram 5 of a method for generating multimedia information provided by an embodiment of the present invention. As shown in Figure 5, the server obtains semantic expression (equivalent to semantic expression) by performing feature extraction on the copy information. Express information) and word segmentation (equivalent to word information). Specifically, the semantic expression is obtained through Bert's method.

S2012. Use the first encoding method to encode the initial text features to obtain text features.

In some embodiments of the present invention, the server encodes the initial text features through the first encoding method to obtain vectorized text features.

For example, as shown in Figure 5, the first encoding method is ConCat, and the server uses ConCat to express semantics (equivalent to semantic table Information) and word segmentation (equivalent to word information) are encoded to obtain feature vectors (equivalent to text features).

It can be understood that the server performs feature extraction and encoding on the text information to obtain text features. During this process, the server converts text information into vectorized text features to facilitate subsequent collaboration and integration of item features and content features.

S202. If the content multi-modal type is an image type or an image sequence type, perform feature extraction on the image information and image sequence information respectively through the second encoding method to obtain image features and behavioral features.

In some embodiments of the present invention, the server performs feature extraction on the image information according to the content multi-modal type being the image type to obtain initial image features. According to the content multi-modal type being an image sequence type, feature extraction is performed on the image sequence information to obtain initial behavioral features. Through the second coding method, the initial features of the image and the initial features of the behavior are respectively coded to obtain the image features and the behavior features.

In some embodiments of the present invention, S202 can be implemented through S2021-S2023, as follows:

S2021. If the content multi-modal type is an image type, perform feature extraction on the image information to obtain initial image features.

In some embodiments of the present invention, the initial image features include scene information, content information and style information.

In some embodiments of the present invention, the server can extract features from the image information according to the content multimodal type being the image type, and obtain scene information, content information and style information. The scene information, content information and style information are all initial features of the image.

For example, as shown in Figure 5, the image information can be a promotional picture of an item display; the server extracts features of the image information to obtain the scene (equivalent to scene information), content, and main body (content and main body are equivalent to content information) , color, style and layout (color, style and layout are equivalent to style information). Scene, content, subject, color, style and layout are all initial characteristics of an image.

S2022. If the content multi-modal type is an image sequence type, perform feature extraction on the image sequence information to obtain initial behavioral features.

In some embodiments of the present invention, the behavior initial features include subject target information and key frame information.

In some embodiments of the present invention, the server can perform feature extraction on the image sequence information based on the content multi-modal type as the image sequence type to obtain the target theme information and key frame information. Target theme information and key frame information are both behavioral initial features.

Illustratively, as shown in Figure 5, the server performs feature extraction on the image sequence information to obtain key frames, highlight points (key frames, highlight points are equivalent to key frame information), abstracts, and subject target behavior actions (summary, subject target Behavioral actions are equivalent to target topic information). Key frames, highlights, summaries, and subject target behaviors are all initial behavioral characteristics. Among them, key frames, highlights, summaries, subject target behaviors and actions all belong to the content list.

S2023. Through the second encoding method, the initial image features and the initial behavioral features are respectively encoded to obtain the image features and behavioral features.

In some embodiments of the present invention, the server encodes the initial image features through the second encoding method to obtain vectorized image features; it encodes the initial behavioral features to obtain vectorized behavioral features.

For example, as shown in Figure 5, the second encoding method is One Hot. The server performs feature encoding on the scene, content, subject, color, style and layout through One Hot to obtain a feature vector (equivalent to image features). The server uses One Hot to perform feature encoding on key frames, highlights, summaries, and subject target behaviors to obtain feature vectors (equivalent to behavioral features).

It can be understood that the server performs feature extraction and encoding on the image information and image sequence information to obtain image features and behavioral features. The server can convert image information and image sequence information into vectorized image features and vectorized behavioral features respectively, thereby obtaining multi-modal content features, making the content features diverse.

S203. Determine content features corresponding to the content dimension based on at least one of text features, image features, and behavioral features.

In some embodiments of the present invention, the server uses at least one of text features, image features, and behavioral features as content features corresponding to the content dimension.

For example, the server can determine text features as content features corresponding to the content dimension; or, the server can determine image features as content features corresponding to the content dimension; or, the server can determine behavioral features as content features corresponding to the content dimension; Alternatively, the server can determine text features and image features as content features corresponding to the content dimension; alternatively, the server can determine text features and behavioral features as content features corresponding to the content dimension; or, the server can determine image features and behavioral features as Content features corresponding to the content dimension; alternatively, the server can determine text features, image features, and behavioral features as content features corresponding to the content dimension.

It is understandable that the server can identify and extract features of the content information to obtain text features, image features, and behavior features. The server can determine the content features corresponding to the content dimension based on one of the text features, image features, and behavior features; or, the server can determine the content features corresponding to the content dimension based on two of the text features, image features, and behavior features; or, the server can determine the content features corresponding to the content dimension based on three of the text features, image features, and behavior features. Since the content features have one or more multimodal features, the content features are diverse.

In some embodiments of the present invention, collaboration and fusion of item features and content features to obtain multiple sets of fusion features can be achieved through S301-S303, as follows:

S301. Coordinately process item features and content features to obtain first item features and first content features with the same probability distribution.

In some embodiments of the present invention, the first item feature includes a plurality of first sub-item features; the first content feature includes a plurality of first sub-content features.

In some embodiments of the present invention, the server performs collaborative learning processing on item features and content features based on differences in feature domains, maps the item features and content features to the same vector space, and obtains the first item feature sum of the same probability distribution. First content characteristics.

It should be noted that collaborative processing is to process multiple vectors located in different vector spaces so that they are mapped to the same vector space and satisfy the same probability distribution; the technical means of collaborative processing and collaboration are consistent.

S302. Randomly combine multiple first sub-item features to obtain multiple item combination features.

In some embodiments of the present invention, the server can randomly combine multiple first sub-item features to obtain multiple different item combination features.

For example, the server randomly combines 12 first sub-item features (the 12 first sub-item features are different) to obtain 5 item combination features; among which, the 5 item combination features each include 6 first sub-item features. Item characteristics, 8 first sub-item characteristics, 3 first sub-item characteristics, 5 first sub-item characteristics and 9 first sub-item characteristics. It should be noted that the five item combination features may have the same first sub-item feature, or there may be different first sub-item features.

S303. Randomly combine multiple first sub-content features to obtain multiple content combination features.

In some embodiments of the present invention, the content combination features include content features corresponding to at least two content multi-modal types.

In some embodiments of the present invention, the server can randomly combine multiple first sub-content features to obtain multiple different content combination features.

For example, the server processes six first sub-content features (the six first sub-content features are different, specifically, the content contained is more The modal types are different or the content features themselves are different) are randomly combined to obtain two content combination features. Among them, one content combination feature contains content features corresponding to three content multimodal types, including two text features, three image features and one behavior feature; the other content combination feature contains content features corresponding to two content multimodal types, including two text features and one image feature.

S304. Fusion of multiple item combination features and multiple content combination features to obtain multiple sets of fusion features.

In some embodiments of the present invention, the server can fuse multiple item combination features and multiple content combination features to obtain multiple sets of fusion features; one set of fusion features includes at least one item combination feature and at least one content combination feature.

For example, the server fuses 5 item combination features and 2 content combination features to obtain 3 groups of fusion features, namely the 1st group, the 2nd group and the 3rd group; among which, the 1st group of fusion characteristics includes 3 The first sub-item features, three content multi-modal types, including 2 text features, 3 image features and 1 behavioral feature; the second set of fusion features include 8 first sub-item features, two types of content multi-modal types Modal type, there are 2 text features, 1 image feature; the third group includes 13 first sub-item features, three content multi-modal types, 4 text features, 4 image features and behavioral features There is 1 kind.

It can be understood that the server processes the item characteristics and content characteristics located in different vector spaces, mapping them to the same vector space, so that they satisfy the same probability distribution, so that the two characteristics can be located in the same vector space, which is convenient for Subsequent implementation of the fusion of the two features. The server randomly combines multiple first sub-item features to obtain multiple item feature combinations. Since each item combination contains multiple first sub-item features, the item feature combinations are diverse. The server randomly combines multiple first sub-content features to obtain multiple content feature combinations. Since each content combination contains multiple first sub-content features, the content feature combinations are diverse. The server randomly fuses the item feature combination and the content feature combination to obtain multiple sets of fusion features. Since the fusion features include improved item feature combinations and content feature combinations, the fusion features are diverse.

S103. Estimate multiple groups of fusion features through a preset recommendation model, and select the target item information and target content information corresponding to a group of fusion features with the highest estimated value.

In some embodiments of the present invention, the server can input multiple sets of fusion features into a preset recommendation model for prediction, and obtain first estimated values corresponding to each of the multiple sets of fusion features. Based on multiple first estimated values, a set of fused features with the highest estimated value is selected from multiple sets of fused features. Decode a set of fused features to obtain target item information and target content information.

Exemplarily, the number of items is set to From ₁ , and the range of From ₁ is (0, M); the number of creatives (i.e., content) is set to From ₂ , and the range of From ₂ is (0, N); the server performs traversal exploration to obtain multiple items; for multiple items, the vectors of the multiple items are fused; for multiple creatives, the multimodal features of the recall stage are fused with feature vectors; at the same time, the fused creative vectors (i.e., fused features) are input into the estimation model (i.e., the preset recommendation model) to obtain a CTR estimation value; the combination with the highest pCTR estimation value is selected for output as the overall estimation result (i.e., the target item information and target content information corresponding to a set of fused features with the highest estimation value).

In some embodiments of the present invention, S103 can be implemented through S1031, S1032 and S1033, as follows:

S1031. Input multiple sets of fusion features into the preset recommendation model for prediction, and obtain the first estimated values corresponding to each of the multiple sets of fusion features.

In some embodiments of the present invention, the server estimates the multiple groups of fused features through a preset recommendation model to obtain first estimated values corresponding to each of the multiple groups of fused features.

For example, the server estimates three sets of fusion features through a preset recommendation model, and obtains the third set of fusion features corresponding to the three sets of fusion features. One estimate is 0.7, 0.85 and 0.62.

S1032. Based on multiple first estimated values, select a set of fusion features with the highest estimated value from multiple sets of fusion features.

In some embodiments of the present invention, the server selects a set of fusion features with the highest estimated value from multiple sets of fusion features based on multiple first estimated values.

For example, the server selects the fusion feature with an estimated value of 0.85 from the first estimated values 0.7, 0.85 and 0.62 corresponding to the three sets of fusion features respectively.

S1033. Decode a set of fusion features to obtain target item information and target content information.

In some embodiments of the present invention, the server can decode a set of fused features to convert the fused features into target item information and target content information.

For example, the server decodes the first set of fused features. Obtain the target item information and target content information; among them, the target item information includes 3 items, and the target content information includes three content multi-modal types, including 2 types of text, 3 types of images, and 1 type of image sequence.

It can be understood that the server estimates multiple fusion features based on the preset recommendation model and obtains multiple estimated values. Among them, the higher the estimated value, the better the diversity of fusion features, and the better the diversity of target item information and target content information corresponding to the set of fusion features with the highest estimated value, so that according to the target item information and target The target multimedia information generated by content information is diverse.

S104. Generate target multimedia information based on the target item information and target content information.

In some embodiments of the present invention, the server can perform layout generation on target item information and target content information through a preset layout generation model to obtain multiple layouts. Through the evaluation model, multiple layouts are evaluated and candidate layouts are determined. Through the layout optimization model, the optimal layout is selected from the candidate layouts. Target multimedia information is generated based on the optimal layout, target item information and target content information. The target multimedia information is sent to the terminal, so that the terminal displays the browsing page based on the target multimedia information.

Exemplarily, Figure 6 is an optional flow diagram 6 of a multimedia information generation method provided by an embodiment of the present invention. As shown in Figure 6, the traditional multimedia information generation process is: receiving a user request (equivalent to a browsing request ), the server recalls the product (equivalent to item recall) to obtain product information; sorts the product information by model, and selects the product information corresponding to the Top1 model as the recommended product information; generates template creative ideas, and fuses product information to obtain multimedia information. Figure 7 is an optional flow diagram 7 of a method for generating multimedia information provided by an embodiment of the present invention. As shown in Figure 7, data A/B (equivalent to target item information and target content information) are input to the server. In the online learning module, an initial layout is generated through a preset layout generation model (not shown in the figure). Adjust the text size, element position, color, and contrast of the initial layout through adjustment rules to obtain multiple layouts. Multiple layouts are evaluated through the evaluation model (represented by +++ in Figure 7), and the evaluation results are obtained. The evaluation results include pass and fail. If the evaluation result is "passed", the layout plan (equivalent to the candidate layout) is output. , among which, the layout planning includes four layouts, namely 1, 2, 3, and 4. The layout planning is optimized through the layout optimization model to obtain the optimal style (optimal layout). The optimal style includes copywriting or pictures or videos or middle pages. Generate target multimedia information through a real-time multimedia information generation engine.

It can be understood that the server vectorizes the item information and content information to obtain item features corresponding to the items and content features corresponding to the content. The server converts item features and content features in different spaces into vectors in the same space and fuses them to obtain multiple sets of fusion features. Since the fused feature is a feature with two dimensions, the fused feature is diverse. The server estimates multiple fusion features based on the preset recommendation model and obtains multiple estimated values. Among them, the higher the estimated value, the better the diversity of fused features, corresponding to The better the diversity of target item information and target content information corresponding to the set of fusion features with the highest estimated value is selected, so that the target multimedia information generated based on the target item information and target content information has diversity. Finally, based on the generation method of multimedia information, the diversity of target multimedia information can be improved, thereby ensuring that personalized recommendations are provided to users and improving the recommendation effect.

In some embodiments of the present invention, Figure 8 is an optional flow diagram 8 of a method for generating multimedia information provided by an embodiment of the present invention. As shown in Figure 8, S104 can be implemented through S1041-S1045, as follows:

S1041. Use a preset layout generation model to generate layouts for target item information and target content information to obtain multiple layouts.

In some embodiments of the present invention, the preset layout generation model includes the stacking order of image layers and text size range constraints in text information.

In some embodiments of the present invention, the server can generate an initialization layout corresponding to the target item information and the target content information through a preset layout generation model. By adjusting the rules, adjust the initial layout and determine multiple layouts.

In some embodiments of the present invention, S1041 may be implemented by S401 and S402 as follows:

S401. Generate an initialization layout corresponding to the target item information and target content information through a preset layout generation model.

In some embodiments of the present invention, the server can input the target item information and the target content information into a preset layout generation model, and generate an initialization layout corresponding to the target item information and the target content information. The initial layout refers to the arrangement and combination of the positions of target item information and target content information.

S402. Adjust the initial layout by adjusting rules to determine multiple layouts.

In some embodiments of the present invention, the adjustment rule is obtained through continuous training using the object's preference as an incentive. Specifically, it is based on reinforcement learning and is used as an incentive according to the object's preference. That is, if the click-through rate is higher after adjustment, it is a positive incentive. If the click-through rate becomes lower, it is a negative incentive. It is obtained through repeated adjustments and learning. of.

In some embodiments of the present invention, the server adjusts the initial layout by adjusting rules to obtain multiple layouts.

It can be understood that the server can generate an initial layout corresponding to the target item information and target content information through a preset layout generation model, adjust the initial layout through adjustment rules, and determine multiple layouts of the target item information and target content information; by adjusting The rules adjust the initial layout and adjust the unreasonable layout method, which can improve the rationality of the layout. The adjusted layout of the server can still include multiple layouts, so that the adjusted layout still has diversity.

S1042. Evaluate multiple layouts through an evaluation model to determine candidate layouts.

In some embodiments of the present invention, the evaluation model is used to evaluate and filter layouts.

In some embodiments of the present invention, the server can evaluate multiple layouts through an evaluation model and obtain evaluation results corresponding to the multiple layouts. If the evaluation result is characterized as successful, the corresponding layout will be used as a candidate layout. If the evaluation result is characterized as failure, the corresponding layout will be deleted.

In some embodiments of the present invention, S1042 can be implemented through S501 and S502, as follows:

S501. Use the evaluation model to evaluate multiple layouts and obtain evaluation results corresponding to the multiple layouts.

In some embodiments of the present invention, the server can use the evaluation model to evaluate the rationality of multiple layouts and obtain corresponding evaluation results of the multiple layouts. Evaluation results include success and failure.

S502. If the evaluation result indicates success, use the corresponding layout as a candidate layout.

In some embodiments of the present invention, the server can characterize the layout as successful according to the evaluation result of the layout, which means that the layout passed, and the layout as Candidate layout.

It is understandable that the server can evaluate multiple layouts through the evaluation model and obtain evaluation results corresponding to the multiple layouts. The evaluation results represent local rationality. Based on the evaluation results, the server filters the layouts, removes unreasonable layouts, and determines candidate layouts. Since the candidate layout is the filtered result after removing unreasonable layouts, the server selects the candidate layout as a more reasonable layout.

In some embodiments of the present invention, S601, S602 and S603 are also implemented before S1042, as follows:

S601. Obtain historical target multimedia information.

In some embodiments of the present invention, the server may obtain historical target multimedia information.

S602. Identify the historical target multimedia information and obtain the historical layout.

In some embodiments of the invention, the historical layout includes positive sample data and negative sample data.

In some embodiments of the present invention, the server can identify and analyze the historical target multimedia information to obtain the historical layout corresponding to the historical target multimedia information.

S603. Train the initial evaluation model through positive sample data and negative sample data to determine the evaluation model.

In some embodiments of the present invention, the server trains the initial evaluation model through the positive sample data and negative sample data of the historical layout until the evaluation result output by the model meets the preset threshold, saves the model, and obtains the evaluation model.

It is understandable that the server trains the initial evaluation model through historical target multimedia information to determine the evaluation model, which can ensure the evaluation accuracy of the evaluation model.

S1043. Select the optimal layout from the candidate layouts through the layout optimization model.

In some embodiments of the present invention, the server may input candidate layouts into the layout optimization model, and select an optimal layout from the candidate layouts. The optimal layout is to use the layout optimization model to evaluate multiple candidate layouts and obtain the corresponding index evaluation values of the multiple candidate layouts; from the multiple index evaluation values, select the candidate layout with the highest index evaluation value as the optimal layout. Excellent layout.

For example, there are three candidate layouts, and the three candidate layouts are respectively evaluated with indexes through the layout optimization model, and the index evaluation values corresponding to the three candidate layouts are obtained. The index evaluation value of the first candidate layout is 0.5, the index evaluation value of the second candidate layout is 0.7, and the index evaluation value of the third candidate layout is 0.8; the third candidate layout with an index evaluation value of 0.8 is regarded as the optimal layout.

S1044. Generate target multimedia information based on the optimal layout, target item information and target content information.

In some embodiments of the present invention, the server can arrange the target item information and the target content information according to the optimal layout to generate the target multimedia information.

S1045: Send the target multimedia information to the terminal, so that the terminal can display a browsing page based on the target multimedia information.

In some embodiments of the present invention, the server sends the target multimedia information to the terminal. The terminal can display a browsing page based on target multimedia information.

It can be understood that the server can generate multiple layouts of target item information and target content information according to the preset layout generation model and adjustment rules, and filter the multiple layouts through the evaluation model and layout optimization model to determine the optimal layout. Among them, the optimal layout is determined after removing unreasonable layouts, then the target multimedia information obtained through the optimal layout is more reasonable and accurate, thus improving the accuracy of the target multimedia information. Then when the server recommends the target multimedia information, the target multimedia information will be more in line with the user's needs, and can provide the user with personalized recommendations, and the recommendation effect is good.

In some embodiments of the present invention, Figure 9 is an optional flow diagram 9 that provides a method for generating multimedia information according to an embodiment of the present invention. As shown in Figure 9, the server receives a user request (equivalent to a browsing request); Carry out interest product recall (equivalent to item information recall) and creative element recall (equivalent to content recall) to obtain item information and content information. Conduct vectorized collaborative modeling of item information and content information to obtain item features and content features. Fusion of item features and content features is performed to obtain fusion features (not shown in Figure 9, obtained before inputting to cross-modal CTR estimation). Multi-product selection (equivalent to fusion feature selection) is performed through cross-modal CTR estimation to obtain the optimal product content combination; among them, multi-modality includes text (equivalent to text features), style, picture (equivalent to image features), Video (equivalent to behavioral characteristics). Through the preset layout generation model (not shown in the figure) and adjustment rules, element planning is performed on the product content combination to generate the layout in real time, and the final target multimedia information is determined to be sent to the user (equivalent to the terminal).

It can be understood that, first, the server can vectorize the item information and content information to obtain the item characteristics corresponding to the item and the content characteristics corresponding to the content. The server converts item features and content features in different spaces into vectors in the same space and fuses them to obtain multiple fusion features. Since the fused feature is a feature with two dimensions, the fused feature has diversity. Based on the diversity of fusion features, the server estimates multiple fusion features based on the preset recommendation model and obtains multiple estimated values. Among them, the higher the estimated value, the better the diversity of the fusion features, and the better the diversity of the optimal product content combination corresponding to the set of fusion features with the highest estimated value. Secondly, the server uses the preset layout generation model and adjustment rules to perform element planning on the optimal product content combination to generate the layout in real time to determine the final target multimedia information. Since the optimal product content combination is diverse, the target multimedia information generated based on the optimal product content combination is also diverse.

Based on the multimedia information generation method of the above embodiment, the embodiment of the present invention also provides a multimedia information generation device, as shown in Figure 10. Figure 10 is the structure of a multimedia information generation device provided by the embodiment of the present invention. Schematic diagram 1 shows that the multimedia information generation device 10 includes: an acquisition part 1001, a selection part 1002 and a generation part 1003; wherein,

The acquisition part 1001 is configured to recall item information and content information in response to a received browsing request; perform feature extraction based on the item information and content information to obtain item features corresponding to the item dimension and content features corresponding to the content dimension , and collaborate and fuse the item features and the content features to obtain multiple sets of fusion features; each set of fusion features represents the fusion between different content modal combinations and different items;

The selection part 1002 is configured to estimate the multiple sets of fusion features through a preset recommendation model, and select the target item information and target content information corresponding to the set of fusion features with the highest estimated value; the prediction The recommended recommendation model representation is designed to optimize the fusion features;

The generating part 1003 is configured to generate target multimedia information based on the target item information and the target content information.

In some embodiments of the present invention, the acquisition part 1001 is configured to perform feature extraction on the item information to obtain the item dimensions corresponding to the item characteristics; identify the content information to obtain the content multi-modal type Corresponding content information; the content multi-modal type includes at least two modalities among text information, image information and image sequence information; feature extraction is performed on the content information corresponding to the content multi-modal type to obtain content dimensions corresponding to The content characteristics.

In some embodiments of the present invention, the device for generating multimedia information further includes a determining part 1004; wherein,

The acquisition part 1001 is configured to perform feature extraction on the text information through the first encoding method to obtain text features if the content multi-modal type is a text type; if the content multi-modal type is an image type or image sequence type, then perform feature extraction on the image information and the image sequence information respectively through the second encoding method to obtain image features and behavioral features;

The determining part 1004 is configured to determine the content characteristics corresponding to the content dimension according to at least one of the text characteristics, the image characteristics and the behavioral characteristics.

In some embodiments of the present invention, the acquisition part 1001 is configured to perform feature extraction on the text information to obtain initial text features if the content multi-modal type is a text type; the text initial features It includes semantic expression information and word information; through the first encoding method, the text initial features are encoded to obtain the text features.

In some embodiments of the present invention, the acquisition part 1001 is configured to, if the content multimodal type is an image type, perform feature extraction on the image information to obtain image initial features; the image initial features include scene information, content information and style information; if the content multimodal type is an image sequence type, perform feature extraction on the image sequence information to obtain behavior initial features; the behavior initial features include subject target information and key frame information; and through the second encoding method, encode the image initial features and the behavior initial features respectively to obtain the image features and the behavior features.

In some embodiments of the present invention, the acquisition part 1001 is configured to perform collaborative processing on the item characteristics and the content characteristics to obtain the first item characteristics and the first content characteristics of the same probability distribution; The first item feature includes a plurality of first sub-item features; the first content feature includes a plurality of first sub-content features; the multiple first sub-item features are randomly combined to obtain multiple item combination features; The plurality of first sub-content features are randomly combined to obtain multiple content combination features; the content combination features include content features corresponding to at least two content multi-modal types; the multiple item combination features and the Multiple content combination features are fused to obtain the multiple sets of fusion features.

In some embodiments of the present invention, the acquisition part 1001 is configured to input the multiple sets of fusion features into the preset recommendation model for prediction, and obtain the first corresponding first set of the multiple sets of fusion features. Estimated value; based on a plurality of the first estimated values, select a group of fusion features with the highest estimated value from the plurality of groups of fusion features; decode the group of fused features to obtain the target Item information and the target content information.

In some embodiments of the present invention, the acquisition part 1001 is configured to perform layout generation on the target item information and the target content information through a preset layout generation model to obtain multiple layouts; the preset layout generation model The layout generation model is designed to represent the adjustment of layout through items and content;

The determination part 1004 is configured to evaluate the multiple layouts and determine candidate layouts through an evaluation model; the evaluation model is configured to evaluate and screen layouts;

The selection part 1002 is configured to select an optimal layout from the candidate layouts through a layout optimization model;

The generating part 1003 is configured to generate the target multimedia information based on the optimal layout, the target item information and the target content information.

In some embodiments of the present invention, the generation part 1003 is configured to generate an initialization layout corresponding to the target item information and the target content information through a preset layout generation model; the preset layout generation The model includes the stacking order of image layers and the text size range constraints in text information;

The determination part 1004 is configured to adjust the initial layout and determine the multiple layouts through adjustment rules; the adjustment rules are obtained through continuous training using the object's preference as an incentive.

In some embodiments of the present invention, before evaluating the multiple layouts through the evaluation model and determining candidate layouts, the acquisition part 1001 is configured to obtain historical target multimedia information; Identify and get history Layout; the historical layout includes positive sample data and negative sample data;

The determination part 1004 is configured to train an initial evaluation model through the positive sample data and the negative sample data, and determine the evaluation model.

In some embodiments of the present invention, the acquisition part 1001 is configured to evaluate the multiple layouts through the evaluation model and obtain the evaluation results corresponding to the multiple layouts;

The determining part 1004 is configured to use the corresponding layout as the candidate layout if the evaluation result is characterized as successful.

It can be understood that, first, the server can vectorize the item information and content information to obtain the item characteristics corresponding to the item and the content characteristics corresponding to the content. The server converts item features and content features in different spaces into vectors in the same space and fuses them to obtain multiple fusion features. Since the fused feature is a feature with two dimensions, the fused feature has diversity. Based on the diversity of fusion features, the server estimates multiple fusion features based on the preset recommendation model and obtains multiple estimates; among them, the higher the estimate, the better the diversity of the fusion features, and the corresponding selection The set of fusion features with the highest estimated value corresponds to a better diversity of optimal product content combinations. Secondly, the server uses the preset layout generation model and adjustment rules to perform element planning on the optimal product content combination to generate the layout in real time to determine the final target multimedia information. Since the optimal product content combination is diverse, the target multimedia information generated based on the optimal product content combination is also diverse.

It should be noted that when generating multimedia information, only the division of the above-mentioned program modules is used as an example. In practical applications, the above-mentioned processing can be allocated to different program modules as needed, that is, the internal structure of the device is divided into into different program modules to complete all or part of the processing described above. In addition, the multimedia information generation device provided by the above embodiments and the multimedia information generation method embodiments belong to the same concept. The specific implementation process and beneficial effects can be found in the method embodiments and will not be described again here. For technical details not disclosed in the device embodiment, please refer to the description of the method embodiment of the present invention for understanding.

Based on the multimedia information generation method of the above embodiment, the embodiment of the present invention also provides a multimedia information generation device, as shown in Figure 11. Figure 11 is a schematic structural diagram of a multimedia information generation device provided by the embodiment of the present invention. Second, the multimedia information generating device 11 includes: a processor 1101 and a memory 1102; the memory 1102 stores one or more programs executable by the processor. When one or more programs are executed, the processor 1101 executes the program as described above. Any method for generating multimedia information according to the above embodiments.

Those skilled in the art will appreciate that embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, etc.) embodying computer-usable program code therein.

The invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.

These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions device actual Now flowchart a process or processes and/or block diagram a function specified in a box or boxes.

These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

The above descriptions are only preferred embodiments of the present invention and are not intended to limit the scope of the present invention.

Industrial applicability

Embodiments of the present invention provide a method and device for generating multimedia information, and a computer-readable storage medium. The method includes: responding to a received browsing request, recalling item information and content information; performing feature extraction based on the item information and content information, Obtain the item features corresponding to the item dimension and the content features corresponding to the content dimension, and collaborate and fuse the item features and content features to obtain multiple sets of fusion features; through the preset recommendation model, estimate the multiple sets of fusion features and select The target item information and target content information corresponding to the set of fusion features with the highest estimated value; based on the target item information and target content information, target multimedia information is generated. The above scheme extracts and combines features of item information and content information to obtain multiple fusion features. The generated target multimedia information is diverse and has good recommendation effects.

Claims

A method for generating multimedia information, including:

Recall item information and content information in response to a received browsing request;

Feature extraction is performed based on the item information and content information to obtain item features corresponding to the item dimension and content features corresponding to the content dimension, and the item features and content features are collaborated and fused to obtain multiple sets of fusion features; each Group fusion features represent the fusion between different content modal combinations and different items;

The plurality of groups of fusion features are estimated by a preset recommendation model, and the target item information and target content information corresponding to a group of fusion features with the highest estimated value are selected; the preset recommendation model representation screens the fusion features;

Target multimedia information is generated based on the target item information and the target content information.
The method according to claim 1, wherein the feature extraction based on the item information and content information to obtain item features corresponding to the item dimension and content features corresponding to the content dimension includes:

Perform feature extraction on the item information to obtain the item characteristics corresponding to the item dimensions;

Identify the content information to obtain content information corresponding to a content multi-modal type; the content multi-modal type includes at least two modalities among text information, image information and image sequence information;

Feature extraction is performed on the content information corresponding to the content multi-modal type to obtain the content features corresponding to the content dimensions.
The method according to claim 2, wherein the feature extraction of the content information corresponding to the content multi-modal type to obtain the content features corresponding to the content dimension includes:

If the content multimodal type is a text type, extracting features from the text information using a first encoding method to obtain text features;

If the content multi-modal type is an image type or an image sequence type, feature extraction is performed on the image information and the image sequence information respectively through the second encoding method to obtain image features and behavioral features;

The content characteristics corresponding to the content dimension are determined according to at least one of the text characteristics, the image characteristics and the behavioral characteristics.
The method according to claim 3, wherein if the content multi-modal type is a text type, feature extraction is performed on the text information through the first encoding method to obtain text features, including:

If the content multi-modal type is a text type, feature extraction is performed on the text information to obtain text initial features; the text initial features include semantic expression information and word information;

Using the first encoding method, the text initial features are encoded to obtain the text features.
The method according to claim 3, wherein if the content multi-modal type is an image type or an image sequence type, feature extraction is performed on the image information and the image sequence information respectively through a second encoding method. , obtain image features and behavioral features, including:

If the content multi-modal type is an image type, perform feature extraction on the image information to obtain initial image features; the initial image features include scene information, content information and style information;

If the content multi-modal type is an image sequence type, feature extraction is performed on the image sequence information to obtain initial behavioral features; the initial behavioral features include subject target information and key frame information;

Through the second encoding method, the image initial features and the behavior initial features are respectively encoded to obtain the image features and the behavior features.
The method according to claim 1, wherein the item characteristics and the content characteristics are coordinated and fused to obtain multiple sets of fusion characteristics, including:

The item characteristics and the content characteristics are collaboratively processed to obtain the first item characteristics and the first content characteristics of the same probability distribution; the first item characteristics include a plurality of first sub-item characteristics; the first content The features include a plurality of first sub-content features;

Randomly combine the multiple first sub-item features to obtain multiple item combination features;

Randomly combine the plurality of first sub-content features to obtain multiple content combination features; the content combination features include content features corresponding to at least two content multi-modal types;

The plurality of item combination features and the plurality of content combination features are fused to obtain the plurality of sets of fusion features.
The method according to any one of claims 1 to 6, wherein the multiple sets of fusion features are estimated through a preset recommendation model, and the target item corresponding to the set of fusion features with the highest estimated value is selected. Information and target content information, including:

Inputting the multiple groups of fused features into the preset recommendation model for estimation, and obtaining first estimated values corresponding to each of the multiple groups of fused features;

Based on a plurality of the first estimated values, select a set of fusion features with the highest estimated value from the multiple sets of fusion features;

The set of fused features is decoded to obtain the target item information and the target content information.
The method according to any one of claims 1 to 6, wherein generating target multimedia information based on the target item information and the target content information includes:

Through a preset layout generation model, layout generation is performed on the target item information and the target content information to obtain multiple layouts; the preset layout generation model represents the adjustment of layout through items and content;

Through the evaluation model, the multiple layouts are evaluated and candidate layouts are determined; the evaluation model is used to evaluate and screen the layouts;

Select the optimal layout from the candidate layouts through the layout optimization model;

The target multimedia information is generated based on the optimal layout, the target item information and the target content information.
The method according to claim 8, wherein the target items and the target content are laid out using a preset layout generation model to obtain multiple layouts, including:

Generate an initialization layout corresponding to the target item information and the target content information through a preset layout generation model; the preset layout generation model includes the stacking order of image layers and the text size range constraints in the text information;

Through adjustment rules, the initialization layout is adjusted and the multiple layouts are determined; the adjustment rules are obtained through continuous training using the object's preference as an incentive.
The method according to claim 8, wherein before evaluating the multiple layouts through the evaluation model and determining candidate layouts, the method further includes:

Obtain historical target multimedia information;

Identify the historical target multimedia information to obtain a historical layout; the historical layout includes positive sample data and negative sample data;

The initial evaluation model is trained using the positive sample data and the negative sample data to determine the evaluation model.
The method according to claim 8, wherein said evaluating the plurality of layouts through an evaluation model and determining candidate layouts includes:

Use the evaluation model to evaluate the multiple layouts and obtain evaluation results corresponding to the multiple layouts;

If the evaluation result is characterized as successful, the corresponding layout is used as the candidate layout.
A device for generating multimedia information, including: an acquisition part, a selection part and a generation part; wherein,

The acquisition part is configured to recall item information and content information in response to a received browsing request; perform feature extraction based on the item information and content information to obtain item features corresponding to the item dimension and content features corresponding to the content dimension, The item features and the content features are collaborated and fused to obtain multiple sets of fusion features; each set of fusion features represents the fusion between different content modal combinations and different items;

The selection part is configured to estimate the multiple sets of fusion features through a preset recommendation model, and select the target item information and target content information corresponding to the set of fusion features with the highest estimated value; the preset The recommendation model representation optimizes the fusion features;

The generating part is configured to generate target multimedia information based on the target item information and the target content information.
The device according to claim 12, wherein the acquisition part is further configured to perform feature extraction on the item information to obtain the item characteristics corresponding to the item dimensions; identify the content information to obtain the content information. Content information corresponding to the modal type; the content multi-modal type includes at least two modalities among text information, image information and image sequence information; feature extraction is performed on the content information corresponding to the content multi-modal type to obtain the content The content characteristics corresponding to the dimension.
The device according to claim 13, wherein the multimedia information generating device further includes: a determining part;

The acquisition part is further configured to: if the content multi-modal type is a text type, perform feature extraction on the text information through the first encoding method to obtain text features; if the content multi-modal type is an image type or image sequence type, then perform feature extraction on the image information and the image sequence information respectively through the second encoding method to obtain image features and behavioral features;

The determining part is configured to determine the content feature corresponding to the content dimension based on at least one of the text feature, the image feature, and the behavioral feature.
The device according to claim 14, wherein the acquisition part is further configured to perform feature extraction on the text information to obtain initial text features if the content multi-modal type is a text type; the text The initial features include semantic expression information and word information; through the first encoding method, the text initial features are encoded to obtain the text features.
The device according to claim 14, wherein the acquisition part is further configured to perform feature extraction on the image information to obtain initial image features if the content multi-modal type is an image type; the image The initial features include scene information, content information and style information; if the content multi-modal type is an image sequence type, feature extraction is performed on the image sequence information to obtain behavioral initial features; the behavioral initial features include subject target information and key frame information; through the second encoding method, the image initial features and the behavior initial features are respectively encoded to obtain the image features and the behavior features.
The device according to claim 12, wherein the acquisition part is further configured to perform collaborative processing on the item characteristics and the content characteristics to obtain the first item characteristics and the first content characteristics of the same probability distribution; The first item feature includes a plurality of first sub-item features; the first content feature includes a plurality of first sub-content features; the multiple first sub-item features are randomly combined to obtain multiple item combination features ; Randomly combine the plurality of first sub-content features to obtain multiple content combination features; the content combination features include content features corresponding to at least two content multi-modal types; combine the multiple item combination features with The multiple content combination features are fused to obtain the multiple sets of fusion features.
The device according to any one of claims 12 to 17, wherein the acquisition part is further configured to input the multiple sets of fusion features into the preset recommendation model for prediction, and obtain the multiple sets of fusion features. The first estimated value corresponding to each fusion feature;

The selection part is further configured to select a group of fusion features with the highest estimated value from the plurality of groups of fusion features based on a plurality of the first estimated values;

The acquisition part is further configured to decode the set of fused features to obtain the target item information and the target content information.
The device according to any one of claims 12 to 17, wherein the acquisition part is further configured to perform layout generation on the target item information and the target content information through a preset layout generation model, to obtain Multiple layouts; the preset layout generation model representation adjusts the layout through items and content;

The determination part is also configured to evaluate the multiple layouts and determine candidate layouts through an evaluation model; the evaluation model is used to evaluate and screen layouts;

The selection part is further configured to select an optimal layout from the candidate layouts through a layout optimization model;

The generating part is further configured to generate the target multimedia information based on the optimal layout, the target item information and the target content information.
The device according to claim 19, wherein the generating part is further configured to generate an initialization layout corresponding to the target item information and the target content information through a preset layout generation model; the preset The layout generation model includes the stacking order of image layers and the text size range constraints in text information;

The determination part is further configured to adjust the initialization layout by adjusting rules to determine the multiple layouts; the adjustment rules are obtained through continuous training by taking the object's preference degree as an incentive.
The device according to claim 19, wherein the acquisition part is further configured to evaluate the multiple layouts through an evaluation model, and obtain historical target multimedia information before determining candidate layouts; The information is identified to obtain the historical layout; the historical layout includes positive sample data and negative sample data;

The determining part is further configured to train an initial evaluation model through the positive sample data and the negative sample data, and determine the evaluation model.
The device according to claim 19, wherein the acquisition part is further configured to evaluate the multiple layouts through the evaluation model to obtain evaluation results corresponding to the multiple layouts;

The determining part is further configured to use the corresponding layout as the candidate layout if the evaluation result is characterized as successful.
A device for generating multimedia information, including:

Memory, used to store executable instructions;

A processor, configured to implement the multimedia information generation method described in any one of claims 1-11 when executing executable instructions stored in the memory.
A computer-readable storage medium, the storage medium stores executable instructions, and when the executable instructions are executed, they are used to cause the processor to perform the generation of multimedia information according to any one of claims 1-11 method.