CN114090848A

CN114090848A - Data recommendation and classification method, feature fusion model and electronic equipment

Info

Publication number: CN114090848A
Application number: CN202111243283.9A
Authority: CN
Inventors: 杨粟森; 雷陈奕; 王国鑫; 唐海红
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-10-25
Filing date: 2021-10-25
Publication date: 2022-02-25

Abstract

The embodiment of the application provides a data recommendation and classification method, a feature fusion model and electronic equipment. The data recommendation method comprises the following steps: determining target data; determining at least one associated data related to the target data in the plurality of data according to the associated information among the data; fusing the feature information of the target data and the feature information of the at least one piece of associated data to obtain a fusion feature of the target data; and recommending at least one piece of recommendation data for the user based on the fusion characteristics of the target data. According to the technical scheme, the feature information of the target data is fused with the feature information of at least one piece of associated data to obtain the fusion feature of the target data, and then data is recommended for the user based on the fusion feature of the feature information fused with other associated data, so that the recommendation diversity is promoted, and the fusion of posterior information is brought.

Description

Data recommendation and classification method, feature fusion model and electronic equipment

Technical Field

The application relates to the technical field of computers, in particular to a data recommendation and classification method, a feature fusion model and electronic equipment.

Background

Data recommendations, such as text recommendations, video recommendations, music recommendations, etc., are now widely used. Currently, data recommendation is usually performed by analyzing user preferences according to user history data and then recommending users based on the user preferences. Or, according to the text, the viewed video or the played music currently browsed by the user, and the like, the text, the video or the music related to the currently browsed text, the viewed video or the played music is recommended to the user.

In the prior art, a data recommendation scheme only recommends similar data for a user, and the recommendation diversity is not enough.

Disclosure of Invention

The present application provides a data recommendation and classification method, a feature fusion model and an electronic device that solve the above problems, or at least partially solve the above problems.

In one embodiment of the present application, a data recommendation method is provided. The method comprises the following steps:

determining target data;

determining at least one associated data related to the target data in the plurality of data according to the associated information among the data;

fusing the feature information of the target data and the feature information of the at least one piece of associated data to obtain a fusion feature of the target data;

and recommending at least one piece of recommendation data for the user based on the fusion characteristics of the target data.

In another embodiment of the present application, a data recommendation method is also provided. The method comprises the following steps:

responding to the operation of a user on the interactive interface, and outputting first multimedia data;

determining at least one second multimedia data related to the first multimedia data;

fusing the feature information of the first multimedia data and the feature information of the at least one second multimedia data to obtain a fusion feature of the first multimedia data;

determining recommended data based on the fusion characteristics of the first multimedia data;

and outputting the recommended data for the user when the output condition is met.

In yet another embodiment of the present application, a data classification method is also provided. The method comprises the following steps:

and determining the classification of the first multimedia data according to the fusion characteristics of the first multimedia data.

In yet another embodiment of the present application, a relationship graph-based feature fusion model is provided. The feature fusion model based on the relational graph comprises the following steps:

the node sampling module is used for sampling nodes in the relational graph aiming at a first node in the relational graph to obtain at least one second node related to the first node;

a node feature information determining module, configured to configure corresponding embedded features for the first node and the at least one second node according to the multi-modal information of the first node, the multi-modal information of the at least one second node, and the side information between the first node and the at least one second node, respectively; determining feature information of the first node according to the multi-modal information of the first node and the embedded feature corresponding to the first node; determining feature information of the at least one second node according to the multi-modal information of the at least one second node and the embedded feature corresponding to the at least one second node;

the feature fusion module is used for inputting the feature information of the first node and the feature information of the at least one second node into a feature fusion model and executing the feature fusion model to obtain the fusion feature of the first node;

the optimization module is used for executing a graph reconstruction task and a masking node characteristic reconstruction task according to the fusion characteristics of the first sample node to obtain an execution result corresponding to each task; and optimizing parameters in the feature fusion model according to the execution result corresponding to each task.

In yet another embodiment of the present application, an electronic device is also provided. The electronic device comprises a memory and a processor; wherein the memory is used for storing programs; the processor is coupled with the memory and is used for executing the program stored in the memory so as to realize the steps in the data recommendation methods; or implementing the steps in the short video recommendation method.

The embodiment of the application also provides a computer readable storage medium storing a computer program, and the computer program can realize the steps in the data recommendation methods when being executed by a computer; or implementing the steps in the short video recommendation method.

The embodiment of the application also provides a computer program product. The computer program product includes a computer program that, when executed by a computer, causes the computer to implement the steps of the data recommendation methods described above; or implementing the steps in the short video recommendation method.

According to the technical scheme provided by each embodiment of the application, the feature information of the target data is fused with the feature information of at least one piece of associated data to obtain the fusion feature of the target data, and then data is recommended for a user based on the fusion feature of the feature information fused with other associated data, so that the diversity of recommendation is promoted, and the fusion of posterior information is brought.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required to be utilized in the description of the embodiments or the prior art are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to the drawings without creative efforts for those skilled in the art.

FIG. 1 is a flow chart illustrating a method for providing data recommendation according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating that a user clicks a video C to perform full-screen playing and displays a recommended video through a corresponding operation in the present embodiment;

FIG. 3 shows an example of a relationship diagram and a node sampling diagram mentioned in the embodiments of the present application;

FIG. 4 is a schematic diagram illustrating a node characteristic information determination process in an embodiment of the present application;

FIG. 5 is a diagram illustrating an example of an encoding module mentioned in an embodiment of the present application;

FIG. 6 shows a schematic diagram of a layer Encoder Encoder of FIG. 5;

FIG. 7 shows a schematic diagram of a training process of a PMGT;

FIG. 8 is a flow chart illustrating a data recommendation method according to another embodiment of the present application;

FIG. 9a is a flow chart illustrating a short video recommendation method according to another embodiment of the present application;

FIG. 9b is a flow chart illustrating a data classification method according to an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating a structure of a feature fusion model provided by an embodiment of the present application;

FIG. 11 is a flowchart illustrating a training method of a feature fusion model according to an embodiment of the present disclosure;

FIG. 12 is a schematic structural diagram of a data recommendation system provided by an embodiment of the present application;

FIG. 13 is a schematic structural diagram of a data recommendation device according to an embodiment of the present application;

FIG. 14 is a schematic structural diagram of a data recommendation device according to another embodiment of the present application;

fig. 15 is a schematic structural diagram of a short video recommendation apparatus according to an embodiment of the present application;

fig. 16 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Before describing the contents of the embodiments of the present application, a brief description of terms or concepts that will appear below will be provided to assist in understanding.

Multimedia (Multimedia) data refers to data that combines multiple media types (e.g., text, graphics, images, sound, video, etc.).

The Modality (Modality/Modal), the source or form of each type of information, may be referred to as a Modality. For example, human touch, hearing, vision, smell; the information media comprise voice, video, characters and the like; a wide variety of sensors, such as radar, infrared, accelerometer, etc., each of which may be referred to as a modality. Also, the modality may be defined very broadly, for example, two different languages may be considered as two modalities, even the data sets collected under two different situations may be considered as two modalities, and so on. Modality herein has two meanings: a finger feature; two media types. Multimodal information (multimodal information) refers to: data comprising a plurality of feature representations or comprising a plurality of media types. For example, the multimodal information for short videos may include, but is not limited to: visual information (e.g., frame information of a video), textual information (e.g., a video title), audio information (e.g., background music of a video). The multimodal information for a web page may include: text information and picture information (illustrations in text), etc.

The method includes the steps that a relation graph (also called a isomorphic graph) is formed, nodes in the relation graph represent data (short videos, webpages, music, articles and the like), edges in the relation graph represent that the nodes are related to the nodes, the weight of the edges represents the size of the relevance, and the nodes are characterized by multiple modal characteristics.

Pre-training (pre-training), a process of training a model using training samples and training tasks, and then saving the trained model or a representation of the model output for use in downstream tasks.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. In some of the flows described in the specification, claims, and above-described figures of the present application, a number of operations are included that occur in a particular order, which operations may be performed out of order or in parallel as they occur herein. The sequence numbers of the operations, e.g., 101, 102, etc., are used merely to distinguish between the various operations, and do not represent any order of execution per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different modules, models, devices, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different. In addition, the embodiments described below are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, short video applications are popular, and users can upload short videos made by themselves through client applications and watch short videos of other users through the client applications. The content of the short video can be a funny video, a commodity recommendation video, a knowledge sharing video and the like which are made by the user. When the user watches the short video, the server side can recommend other short videos for the user, and therefore the recommended short video can be automatically played after the current short video is played. The user does not close the short video playing interface, and the client application can continuously play the short video recommended to the user by the server. However, in the prior art, the next short video is recommended to the user based on the currently played short video. The prior art only recommends short videos similar to the currently played short videos, for example, the user may find: at present, the user watches 'hand-made short videos of color clay', the short videos to be recommended later have high probability of being hand-made short videos or being recommended by color clay products, and the like, and the user actually wants to watch the short videos of other contents. At this time, the user is not interested in the recommended short video, and it can be seen that the recommendation diversity of the technical scheme provided by the prior art is low.

The application provides the following technical scheme in each embodiment for improving the recommendation diversity. The reason for the technical solution provided by the embodiment of the present application is drawn by short video recommendation, and the technical solution provided by the embodiment of the present application can be applied to short video recommendation, commodity recommendation, text recommendation, music recommendation, picture recommendation, and the like. Further, the core of the technical scheme provided by the following embodiments of the present application is: besides the characteristics of target data (such as short videos watched by a user currently, music listened to currently and the like), other data related to the target data (such as short videos watched by the user historically, music listened to previously, short videos watched by other users who have watched the same short videos historically and the like) are fused, the fused characteristics of the target data are used as the fused characteristics of the target data, and the fused characteristics of the target data can be used as data recommendation and can also be used for data classification (such as commodity classification, short video classification, text classification, music classification and the like).

Fig. 1 shows a flowchart of a data recommendation method according to an embodiment of the present application. An execution main body of the method provided in this embodiment may be a server or a client, which is not specifically limited in this embodiment. The server 22 in the embodiment shown in fig. 12 may be a server, a server cluster, a virtual server or cloud built on the server or cluster, and so on. Of course, the execution subject of the partial step in the method provided by this embodiment may be a server, and the execution subject of the partial step may be a client (e.g., a client corresponding to reference numeral 21 in the embodiment shown in fig. 12), for example, the execution subjects corresponding to the following

steps

101 and 102 may be clients, and the execution subjects corresponding to the

steps

103 and 104 may be servers. Alternatively, the execution agent corresponding to step 101 may be a client, and the execution agents corresponding to steps 102 to 104 may be servers. Or, the execution subject corresponding to the following steps 101 to 103 is a client, and the execution subject corresponding to the step 104 can be a server; and so on. Specifically, the data recommendation method includes:

101. determining target data;

102. determining at least one associated data related to the target data in the plurality of data according to the associated information among the data;

103. fusing the feature information of the target data and the feature information of the at least one piece of associated data to obtain a fusion feature of the target data;

104. and recommending at least one piece of recommendation data for the user based on the fusion characteristics of the target data.

In the foregoing 101, the target data may be any one of text, image, video, and audio browsed by the user through the client, or a combination of any multiple data, which is not limited in this embodiment. For example, in the scenario shown in fig. 2, a user opens a data page of an application, in which a plurality of short videos are displayed. And clicking and opening one of the short videos by the user, and jumping to a playing interface of the short video, such as a full-screen playing video C. At this time, the short video C being played in full screen is determined to be the target data in this embodiment.

In 102, the plurality of data may be determined based on historical data associated with the user. Such as the user historically browsing, playing, commenting, collecting, adding to a shopping cart or adding to the data in a song list, etc. The plurality of data in the embodiment may include all data that the user has historically browsed, played, commented, collected, added to a shopping cart or added to a song list, etc.; or the plurality of data may include only a portion of data selected from all data historically viewed, played, commented, collected, added to a shopping cart, or added to a song list by the user. In this embodiment, the selection rule is not particularly limited.

Further, the plurality of data may also be determined based on historical data of other users with whom the user is in a presence relationship. Wherein the plurality of users with the association may be: a plurality of users who have browsed the same video, the same web page, etc., a plurality of users who have played the same music, a plurality of users who all have made the same or similar evaluation on the same data (such as video, web page, music, picture), a plurality of users who have collected the same video or goods, etc., which is not limited in this embodiment. For example, the plurality of data may be determined from historical data of at least one associated user with which the user has an association. Extracting a part of data from all data historically browsed, played, commented, collected, etc. by the at least one associated user as the plurality of data.

Of course, the plurality of data may also be determined by integrating historical data related to the user and historical data of at least one associated user associated with the user, which is not limited in this embodiment.

The association information exists in the two data with the association relation. In an implementable example, two data associations are illustrated when the same tag exists between the two data. At this time, the association information between the data may include: the two data have the same tag and the same number of tags. The two data have a user behavior sequence relation, for example, a user watches a first short video first and then clicks and watches a second short video, which shows that the two short videos are related. At this time, the association information between data may further include: and a relation identifier for representing that two data are played continuously. As another example, the same user behavior exists for both data, e.g., the user has collected a first audio and also has collected a second audio, indicating that the two audios are related. At this time, the association information between the data may further include: representing that the two data have relationship identification and user behavior identification with the same user behavior; and so on. In this embodiment, the association of other angles may be further mined, which is not listed here. In the specific implementation, the method can be determined according to the actual scene.

In the above 103, the different types of data have different corresponding characteristic information. For example, if the target data only includes text data, the feature information of the target data may only include text features extracted from the text data. If the target data includes only audio data, the feature information of the target data may include only audio features extracted from the audio data. For another example, if the target data includes only image data, the feature information of the target data may include only image features extracted from the image data. However, when the target data is multimedia data, such as short video, movie-like video, and the type of data may include, but is not limited to: text information (e.g., short video title, movie title, etc.), video information, audio information, and so forth. For multimedia data, the feature information of the multimedia data needs to be obtained by fusing features extracted from the modal information.

Similarly, when the associated data only contains data of one modality, the feature information is the feature extracted from the data of the modality; however, when the associated data includes data of multiple modalities, the feature information of the associated data needs to be obtained by fusing features extracted from the information of the modalities.

It can be seen that, when the target data and the associated data include data of multiple modalities, the "fusion" in step 103 in this embodiment includes two steps: one step is the fusion of the multi-modal features of the target data, the fusion of the multi-modal features of the associated data; the other step is the fusion of the characteristic information of the target data and the characteristic information of at least one associated data.

In this embodiment, the feature information of at least one piece of associated data is merged into the feature information of the target data, so as to: the target data has the characteristics of the target data and the characteristics of associated data, so that the posterior information is merged, and data recommendation is better served.

Here, it should be noted that: for better fusion, features representing the relationship between the target data and each associated data may be fused in the above fusion step, so that the target data not only has its own features and features with associated data, but also implicitly has features representing the relationship between the target data and each associated data, such as embedded features based on location and role. This section will be described in more detail below, and reference will be made to the corresponding section below.

As can be seen from the above, the fusion feature of the target data in step 104 may have a feature of associated data in addition to its own feature, and may further have a feature of representing the relationship between the target data and each associated data. In step 104, data recommendation is performed based on the fusion characteristics, so that the recommendation diversity can be obviously and effectively improved, and not only data similar to the target data characteristics can be recommended, but also data similar to the associated data can be recommended.

In specific implementation, the fusion features of the target data can be used as the input of the recommendation model, and the recommendation model is executed. The recommendation model has two stages, namely a recall stage, wherein at least one candidate data can be recalled in the recall stage directly based on the fusion characteristics of the target data; and the second is a sorting stage, which can execute sorting operation based on the fusion characteristics of the target data and the characteristics of each candidate data. Further, the recommendation model includes two models respectively corresponding to different stages, such as a recall model corresponding to a recall stage and a ranking model corresponding to a ranking stage. Inputting the fusion characteristics of the target data into a recall model to execute recall of the recall model to recall at least one candidate data; and then, taking the fusion characteristic of the target data and the characteristic of at least one candidate data as the input of a sorting model, and executing the sorting model to obtain a sorting result. The top N candidate data in the ranking result can be used as the recommended data recommended to the user. N may be 1, 2, 5, 10 or more, which is not limited in this embodiment.

In summary, in the technical solution provided in this embodiment, at least one piece of associated data related to the target data is determined according to the associated information between the data; then fusing the feature information of the target data with the feature information of at least one piece of associated data, taking the fused feature as the fusion feature of the target data, and recommending data for the user based on the fusion feature; therefore, when recommendation is performed based on the target data, the fusion features not only have the features of the target data, but also have the features of the associated data, and further have the features of the relationship between the target data and each associated data, so that the features referred to for recommendation during recommendation are not the features of a single target data, and the recommendation result may include recommendation data similar to the target data, data similar to the associated data, and the like.

Referring to an implementation solution shown in fig. 3, after a plurality of data are determined, a relationship graph may be constructed according to the association information between the data, and then node sampling is performed based on the relationship graph to sample the at least one association data. For example, in this embodiment, the step 102 "determining at least one related data related to the target data in the plurality of data according to the related information between the data" may be implemented by:

1021. constructing a relational graph according to the associated information among the data; the relationship graph comprises a plurality of nodes and side information reflecting the relationship among the nodes; different nodes correspond to different data, and the corresponding node of the target data in the relation graph is a target node;

1022. for the target node, sampling nodes in the relation graph to sample at least one associated node related to the target node; and the data corresponding to the relevant node is relevant data relevant to the target data.

Referring to the relationship diagram shown in FIG. 3, assume node h in the relationship diagram₀Corresponding to the target data in this embodiment. h is₁～h₁₆A plurality of data in the present embodiment. An edge in a relational graph, i.e. a connecting line connecting two nodes, indicates that a relationship exists between the two nodes, and the edge information may comprise twoThe nodes respectively correspond to the associated information among the data. Taking the example that each node in the association graph corresponds to different short videos respectively, the short videos have label information; if two short videos respectively corresponding to two nodes have a certain number (e.g., 1, 2, 5 or more) of same labels, it is indicated that the two nodes are associated with each other, and an "edge" is established between the two nodes, where the edge information may be a weight of the edge, and the weight of the edge is related to the number of common labels of the two short videos respectively corresponding to the two nodes. The feature of the node is the feature of the short video corresponding to the node, such as the multi-modal feature of the short video. What needs to be added here is: in order to alleviate the problem of large variance of the number of tags, a scaling operation may be added in the present embodiment.

The above 1022 may be implemented by using a corresponding sampling algorithm. An example of a sampling algorithm that "samples, for the target node, nodes in the relationship graph to sample at least one associated node related to the target node" includes:

s1, acquiring target times and sampling quantity;

s2, in the primary sampling process, taking the target node as a sampling origin, and sampling at least one neighbor node adjacent to the origin in the relational graph;

s3, judging whether the sampling frequency reaches the target frequency;

s4, when the sampling times are not more than the target times, entering next sampling iteration; in the next sampling iteration, taking any neighbor node in the at least one neighbor node as a sampling origin, and sampling at least one neighbor node adjacent to the origin in the relational graph;

and S5, when the sampling times are larger than the target times, determining the neighbor nodes with the sampling number as the associated nodes from the neighbor nodes sampled in the sampling iteration of the target times.

The embodiment provides an efficient parallel sampling algorithm. Specifically, starting from a target node, a fixed number of fixed depth node sets are iteratively sampled, wherein the sampling probability that a node is sampled is proportional to the weight of an edge. The above-described process is explained for convenience of understanding, taking the relationship diagram shown in fig. 3 as an example.

As shown in the relationship diagram of fig. 3, assume that the target node is h₀The target number of times (same sampling depth) is 2 times, and the number of samples is 4.

Step 11, one sampling iteration, wherein B is 1; sampling origin h₀。

Step 12, obtaining neighbor nodes of the sampling origin to obtain a neighbor node set

Wherein h is_iIs the sampling origin. For example, in fig. 3, when the sampling origin is h₀When h is present₀The neighbor node of (1) includes: h is₈ h₉ h₁h₄ h₁₄To obtain a neighbor node set

And step 13, next sampling iteration, wherein B is B + 1.

Step 14, judging that B is not more than 2, entering second sampling iteration, and replacing the target node with a target node

Each node in the graph is used as a sampling origin, and at least one neighbor node adjacent to each sampling origin in the sampling relation graph or a slave node

Randomly sampling part of nodes, e.g. sample h₉,h₁,h₄Taking the three nodes as sampling origin points, sampling at least one neighbor node adjacent to each sampling origin point in the relational graph to obtain a neighbor node set

And returning to step 13 toB is greater than the target number (i.e., 2).

Following to replace the target node with

Taking each node in (1) as an example, the specific obtained neighbor node sets are:

target node h₈Obtained by the obtained neighbor node

Target node h₉Obtained by the obtained neighbor node

Target node h₁Obtained by the obtained neighbor node

Target node h₄Obtained by the obtained neighbor node

And step 15, when B is larger than 2, determining the number of the sampling neighbor nodes (such as 4) as the associated nodes from all the neighbor nodes sampled in two sampling iterations.

Namely from

And

in the method, 4 neighbor nodes are sampled as associated nodes, for example, h₁、h₂、h₃And h₄。

As can be seen from the above example, duplicate nodes inevitably occur in the acquired neighbor node set. Such as

And

h in (1)₄、h₆And h₁₄Are sampled twice. In an implementation solution, in the above step S5 and the above step 15, the associated node may be determined based on the number of occurrences (or the frequency of occurrences) of each neighboring node and the distance between each neighboring node and the target node. That is, in this embodiment, in step S5, "determining the number of sampled neighbor nodes as associated nodes from the neighbor nodes sampled in the target number of sampling iterations" may include:

s51, acquiring the occurrence frequency of each neighbor node in the neighbor node set acquired in each sampling iteration;

s52, acquiring the distance between each neighbor node in the neighbor node set acquired in each sampling iteration and the target node;

s53, according to the number of occurrences of each neighbor node and the distance between each neighbor node and the target node, determining the number of neighbor nodes sampled from the neighbor nodes sampled in the sampling iteration of the target number as the associated nodes.

More specifically, the product of the number of occurrences (or frequency of occurrences) of the neighbor node and the distance of the neighbor node from the target node may be calculated, and the product is referred to as the importance. And sequencing all the neighbor nodes sampled from the sampling iteration of the target times according to the importance of each neighbor node.

For example, take the top-k nodes as the associated nodes of the target node. Where top-k equals the number of samples.

Again taking the example shown in fig. 3 above as an example, all the neighboring nodes sampled from the target number of sampling iterations include: h is₈,h₉,h₁,h₄，h₁₄，h₇,h₆,h₃，h₁₀,h₂ h₁₅,h₁₆,h₁₄,h₁₃,h₁₂. Wherein，

h₈The number of occurrences is 1 time and h ₀1 hop in terms of; corresponding to h ₈1 × 1 hop;

h₄the number of occurrences is 2 times and the distance h ₀1 hop in terms of; corresponding to h ₄2 × 1 hop;

h₂the number of occurrences is 1 time and h ₀2 jumps; corresponding to h ₄1 × 2 hops; etc. of

Similarly, the importance scores of other neighbor nodes can be calculated in turn.

It should be noted that, in the foregoing, each node corresponds to one piece of data. The associated node sampled by the method corresponds to associated data.

That is, based on the above, when a plurality of associated data related to the target data are sampled, the method provided by this embodiment may further include the following steps:

s6, determining the importance of a plurality of related data;

s7, configuring corresponding embedding characteristics for the plurality of associated data according to the importance degrees;

s8, configuring embedded features for the target data;

s9, acquiring the characteristics of at least one mode of the target data and the characteristics of at least one mode of any one of the plurality of associated data;

s10, determining feature information of the target data according to the features of at least one mode of the target data and the embedded features corresponding to the target data;

and S11, determining the characteristic information of the associated data according to the characteristics of at least one mode of any associated data in the associated data and the embedded characteristics corresponding to the associated data.

In the above S6, the calculation of the importance of the associated data can be referred to above.

In S7, the plurality of related data may be sorted according to the importance of each related data. In this embodiment, the embedded features configured for each associated data may include location features and role features. For example, according to the location identifier (such as the first, second, third, etc.) of the associated data in the ranking, the associated data may be configured with the location feature based on the location identifier. The role played by the associated data with respect to the target data is an associated role, not a master role, and thus, the associated data may be configured with a role characteristic based on the "associated role". The role characteristics of the multiple associated data can be the same, all being "associated roles". In specific implementation, the first role identifier (e.g. 1) may be associated with a main role (i.e. a role corresponding to the target data), and the second role identifier (e.g. 0) may be associated with an associated role (i.e. a non-main role).

In S8, in order to facilitate subsequent calculation, the feature information of the target data and the feature information of the multiple pieces of associated data may be spliced into one combined feature in this embodiment. As can be seen from the above, the plurality of related data are sorted according to the importance of each related data, and a data sequence in which the plurality of related data are arranged in the order of the sorting result is obtained. At the time of splicing, target data may be added to the data sequence, which may be arranged at the head or the tail of the queue. In this case, the target data has a location identifier (e.g., first, or last (e.g., 5 th)) in the data sequence, so that the target data is configured with a location characteristic based on the location identifier corresponding to the location of the target data.

In the above S9, the multi-modal information (including the characteristics of the plurality of modalities) can be referred to the above explanation. For example, short video implies rich modal information, including but not limited to: video frame information, title text information, and audio frame information, etc., which are information of the three modalities. The multi-modal information can bring various benefits, for example, the multi-modal information can mine the content representation of the data, and the recall relevance is improved; in a scene with sparse historical behaviors of a user, the multi-mode information can effectively pull exposure of cold data. Therefore, the method has very high application value for effectively integrating the multi-modal information into the recommendation scene.

The multimodal features mentioned above are extracted from the multimodal information. Taking a short video as an example, for example, the title feature can be extracted from the title information of the short video; extracting short video frame characteristics from video frames of short videos; audio frame features are extracted from the audio information of the short video. The title features extracted from the short video titles can be regarded as features corresponding to a first modality; the video frame features extracted from the short video frames can be regarded as features corresponding to the second modality; the audio frame features extracted from the audio frames of the short video can be regarded as features corresponding to the third modality. In practical implementation, the TF-IDF method or the BERT network can be used for extracting the title characteristics from the title information of the short video; features of each frame of the video can be extracted using an inclusion-v 4 network, and then the visual features of the short video can be obtained by averaging all the frame features.

In a specific embodiment, the above mentioned embedding feature configured for the target data and the embedding feature configured for the at least one associated data may include but are not limited to: location features and role features. When the target data in the present embodiment has a multi-modal feature, correspondingly, in step S10 of the present embodiment, "determining feature information of the target data according to the multi-modal feature of the target data and the embedded feature corresponding to the target data" may include:

s1001, determining corresponding weight for the feature of each mode in the multi-mode features of the target data;

s1002, determining content characteristics of the target data according to the multi-modal characteristics of the target data and the weight of each modal characteristic in the multi-modal characteristics;

s1003, aggregating the content characteristics of the target data and the position characteristics and role characteristics corresponding to the target data to obtain the characteristic information of the target data.

In the above step S1001, the influence ratio of the features corresponding to different modalities of the target data on the target data may be calculated first; then, the weight of each modal corresponding feature in the multi-modal features of the target data is determined according to the influence proportion of the features corresponding to different modalities on the target data. The algorithm for calculating the influence weight is not limited in this embodiment, and may be determined by combining the behavior data of the user, or by analyzing the multi-modal features of the target data and the multi-modal features of the at least one piece of associated data to determine the influence weight of the features of different modalities on the target data. Alternatively, in the present embodiment, an attention mechanism (Self-attention) may be used to calculate a score for each modal feature of the target data, the score corresponding to each modal feature may be used as its weight, and the weight may also be determined based on the score corresponding to each modal feature. For the content of attention mechanism, reference may be made to related documents, and this embodiment is not described in detail.

In the above S1002, the content characteristics of the target data may be obtained by weighting and adding the characteristics of all the modalities of the target data.

Based on the content that the above-mentioned embedded features include the location features and the role features, in this embodiment S1003, the content features, the location features, and the role features of the target data may be added and aggregated to obtain the feature information of the target data.

For ease of understanding, the above process is described below with reference to the example shown in fig. 4. FIG. 4 shows the target data as t; the multi-modal features of the target data t include: first mode characteristics

Second modal characteristics

Characteristics of the third mode

The corresponding weight of each modal feature can be determined through an attention mechanism, such as: first mode characteristics

Corresponding weight is

Second modal characteristics

Corresponding weight

Characteristics of the third mode

Corresponding weight

Content characteristics M of target data t_t：

P shown in FIG. 4_tAs a position feature of the target data t, R_tThe role characteristic of t for the target data. Characteristic information of target data t

Similarly, the characteristic information of the associated data can also be determined by the method.

Further, in step 103 "fusing the feature information of the target data and the feature information of the at least one piece of associated data to obtain a fused feature of the target data" in this embodiment may specifically include:

1031. determining feature similarity of feature information of the target data and feature information of the at least one piece of associated data;

1032. determining attention weight corresponding to the at least one piece of associated data based on the feature similarity;

1033. and performing fusion coding on the feature information of the target data and the feature information of the at least one piece of associated data according to the attention weight corresponding to the at least one piece of associated data to obtain the fusion feature of the target data.

The method for determining the similarity between the feature information of the two data in step 1031 is not limited in this embodiment, and for example, the feature similarity between the two data may be determined by calculating the distance (e.g., cosine distance) between the feature information of the two data.

In this embodiment, the reason for determining the attention weight corresponding to each associated data based on the feature similarity in step 1032 is to pay attention to the associated nodes with high attention weight through an attention mechanism when the features are fused. Generally, it is easier to focus on the associated nodes that are similar to the target node, and thus, the diversity of subsequent recommendations is reduced. In order to improve the diversity, the present embodiment adds an attention mechanism for improving the diversity, that is, the attention weight of the associated nodes with low feature similarity is increased. That is, the embodiment 1032 may specifically be: according to the feature similarity corresponding to each of the plurality of associated data, arranging the plurality of associated data in a descending order; the attention weight is configured for each associated data in order in descending order of the plurality of associated data, wherein the attention weight corresponding to the associated data ranked later is higher than the attention weight corresponding to the associated data ranked earlier.

Further, the solution provided by the present embodiment may use a Transformer (which may be translated into a deformer, Transformer or converter). The Transformer is a model with attention as a main component. For example, as shown in fig. 5, the Transformer may include an encoding module. The coding module includes a plurality of coding layers (encoders), each layer having 2 sub-layers therein (as shown in fig. 6). For example, as shown in the example of FIG. 5, the transform may include a 6-layer Encoder. As shown in fig. 6, the above-mentioned attention mechanism may be provided in each layer Encoder, or the attention mechanism in the partial layer Encoder may be as mentioned above, that is, the attention weight of the associated node with low feature similarity is increased. The Multi-Head Attention mechanism (Multi-Head Attention) is composed of a plurality of self-Attention mechanisms. As can be seen from FIG. 6, the Encoder block contains a Multi-Head attachment, and also an Add & Norm layer, Add indicating a Residual Connection (Residual Connection) for preventing network degradation; norm denotes Layer Normalization, which is used to normalize the activation values for each Layer. In this embodiment, the specific structure of the feedforward network in fig. 6 is not limited, and reference may be made to related documents. In this embodiment, step 103 "fuse the feature information of the target data and the feature information of the at least one piece of associated data to obtain a fused feature of the target data", may be implemented using a Transformer. In the implementation, in the attention mechanism of the Transformer, an attention mechanism for improving diversity is added, namely, the attention weight between data with dissimilar characteristics is increased. Therefore, the diversity and richness of the representation can be improved through the coding module in the Transformer.

As shown in fig. 6, the determination of the Attention weight in the above steps can be realized by a Multi-Head Attention mechanism (Multi-Head Attention) in the coding layer of the transform. The concept of Multi-Head orientation, which is very simple, may define multiple Q, K, V groups, which focus on different associated data with low feature similarity. As shown in FIG. 6, the multi-headed attention mechanism may be in an encoding layer encoder in a transform. The encoding module comprises a plurality of encoding layers encoders, namely a feature fusion model which will be mentioned immediately below.

That is, in step 103 "fusing the feature information of the target data and the feature information of the at least one piece of associated data to obtain a fused feature of the target data", according to the embodiment of the present application, the method may include the following steps:

1034. acquiring a feature fusion model;

1035. inputting the feature information of the target data and the feature information of the at least one piece of associated data into the feature fusion model, and executing the feature fusion model to obtain fusion features of the target data;

and obtaining the feature fusion model through a pre-training process. As can be seen from the above, the feature fusion model in step 1034 can be a transform, and more specifically, a coding module in a transform.

Further, the method provided by this embodiment may further include: and pre-training the feature fusion model. That is, the method provided by this embodiment may further include the following steps:

105. obtaining a sample graph, wherein the sample graph comprises sample nodes and side information reflecting the relation between the sample nodes;

106. for a first sample node in the sample graph, sampling the sample node in the sample graph to sample at least one second sample node related to the first sample node;

107. respectively determining feature information of the first sample node and feature information of at least one second sample node according to the feature of at least one mode of the first sample node and the feature of at least one mode of the at least one second sample node;

108. inputting the feature information of the first sample node and the feature information of the at least one second sample node into the feature fusion model to obtain the fusion feature of the first sample node and the fusion feature of the at least one second sample node;

109. executing a graph reconstruction task and a masking sample node feature reconstruction task based on the fusion feature of the first sample node and the fusion feature of the at least one second sample node to obtain an execution result corresponding to each task;

110. and optimizing parameters in the feature fusion model according to the execution result corresponding to each task.

As shown in fig. 7, the present embodiment provides a frame structure diagram of a feature fusion system based on a relational graph. Referring to fig. 7, the feature fusion system includes: the system comprises a node sampling layer, a node characteristic information determining layer, a characteristic fusion layer and an optimization layer. The feature fusion model in the present embodiment may include the feature fusion layer in fig. 7. The feature fusion layer may be a plurality of layers as shown in fig. 5, i.e., a plurality of coding layers, each including a plurality of sub-layers as shown in fig. 6.

Referring to fig. 7, the training process of the present embodiment may include, but is not limited to, the following four parts:

construction of part (1), relationship diagram

Part (2), node sampling and determination of node characteristic information

Part (3), feature fusion model

And (4) optimizing parameters in the feature fusion model based on the reconstruction task.

The process of constructing the relationship diagram in section (1) is not limited in this embodiment. For example, based on the labels included in the corresponding data of all sample nodes, when two data include 1, 2, 3, or more same labels, the side information of the corresponding node of the two data (for example, the connection line between the two nodes in fig. 7) may be established; the side information may include an edge weight, which may be the number of common labels. Through the above process, a relationship diagram as shown in fig. 7 can be established. The relationship graph is used for training the feature fusion model, and in order to distinguish the above "relationship graph", the relationship graph for training is referred to as a "sample graph". What needs to be added here is: in practical application, the relationship graph can be constructed based on historical behavior data of the user, for example, if the user browses data A and data B, the data A and the data B are related to each other, and the two data respectively correspond to the side information of the node.

In part (2), after the sample graph is constructed, the first sample node h in the sample graph is pointed to₀(i.e., any sample node in the sample graph) is sampled to obtain the first sample node h₀At least one second sample node of the correlation. E.g., the example shown in fig. 7, node h is the first sample₀The associated second sample nodes are: h is₁ h₂ h₃ h₄。

And combining the first sample node and at least one second sample node to form a node sequence. Wherein the position (or the order of arrangement) of the at least one second sample node in the node sequence may be determined based on the importance. For example, the importance of each second sample node may be calculated according to the frequency of the second sample node being collected and the distance (e.g., 1 hop or 2 hops) of the second sample node from the first sample node. Specifically, the contents of the sampling frequency and the importance of the second sample node can be referred to the above related contents, which are not described herein again.

The order of the first sample node in the ordered node can be arranged at the head or at the tail. For example, as shown in FIG. 7, the node sequence is h₀ h₁ h₂ h₃ h₄}。

After the node sequence is formed, corresponding embedding features can be configured for each node (such as the first sample node and at least one second sample node) in the node sequence based on the position of each node in the node sequence and the role attribute corresponding to each node (such as the first sample node is the principal attribute, and the second sample node is the associated attribute related to the first sample node). Here, it should be noted that: the configuration of the embedded features can be seen in the above related text.

Then, feature information of the first sample node can be determined based on the features of the at least one modality of the first sample node and the embedded features corresponding to the first sample node. That is, the feature information of the first sample node includes an embedded feature in addition to the feature item corresponding to the feature of the at least one modality. Similarly, according to at least one modal characteristic of the associated data and the embedded characteristic corresponding to the associated data, characteristic information of the associated data is determined. For the determination of the characteristic information of the sample node, reference may be made to the corresponding contents in the above and fig. 4, which is not described herein again.

And in the part (3), the feature information of the first sample node and the feature information of at least one second sample node obtained in the part (2) are input into a feature fusion model, and the feature fusion model is executed to obtain the fusion feature of the first sample node. In addition, the feature fusion model can also obtain the fusion feature of the second sample node.

In the part (4), the fusion characteristics of the first sample node and the fusion characteristics of the second sample node obtained in the part (3) are used for executing the reconstruction task. The reconstruction task may include: a graph structure reconstruction task (i.e., GSR in fig. 7) and a feature reconstruction task of a mask node (i.e., NFR in fig. 7). Wherein, the graph structure reconstruction task can be simply understood as: and (4) obtaining the difference between the graph structure reconstructed by the fusion characteristics of each sample node and the graph structure of the sample graph according to the part (3), and optimizing the parameters in the characteristic fusion model through the reconstruction difference. The feature reconstruction task of the masking node can be simply understood as: masking partial nodes in the node sequence (such as masking 1 node), and decoding the characteristic information of the masked nodes through a decoding process; and optimizing parameters in the feature fusion model according to the decoded feature information of the masked node and the difference of the actual feature information of the masked node. Wherein the characteristic information of the masked node in the node sequence may be set to 0. In a specific training optimization process, the graph structure reconstruction task and the feature reconstruction task of the masking node can be respectively represented by corresponding objective functions; the values of the target functions corresponding to the graph structure reconstruction task and the feature reconstruction task of the masking node can be respectively calculated based on the fusion features of the sample nodes obtained in the step (3); and (4) integrating the values of the objective functions respectively corresponding to the two tasks, and optimizing the parameters in the feature fusion model.

In this embodiment, the objective function corresponding to each task may be referred to as a loss function in some specific embodiments. The present embodiment is not limited to the specific expression of the objective function. Specific functional expressions will be listed below in conjunction with specific scenarios. In addition, the optimization process of the parameters in the feature fusion model is not specifically limited in this embodiment, and can be implemented by referring to the content in the corresponding literature.

And repeating the processes of the part (2), the part (3) and the part (4) until the training is finished. For example, after the reconstruction task in the part (4) is executed, when the difference between the reconstructed graph structure and the graph structure of the sample graph meets a first preset requirement, and the difference between the reconstructed feature information of the masking node and the actual feature information of the node meets the first preset requirement, it may be determined that the training is completed. In essence, it is desirable to minimize the graph structure differences and the differences in the masked node characteristic information. The goal of minimizing the variance is reached and the training is completed.

Multimodal information of the data has proven effective in improving accuracy in recommendation scenarios. For example, the short video comprises video frame characteristics, short video title characteristics, audio frame characteristics and the like, and the content representation of the short video can be mined by using multi-modal information in a recommendation scene, so that recall correlation is improved. In scenarios where user behavior is sparse, the multimodal information may effectively pull the exposure of the cold short video-A-Si. In this embodiment, the training process of the feature fusion system based on the relationship diagram as shown in fig. 7 may be understood as follows: the process of node characterization is learned by considering the multi-modal features of the sample nodes (i.e., data) and the relationships between the nodes (e.g., side information).

The technical solution provided by this embodiment will be described below with reference to specific implementation schemes. The relationship graph referred to herein may be expressed as: g ═ ν, ξ) to provide a unified view of the multi-modal features of the nodes and their relationships. Here, ν denotes a node, ξ denotes a multimodal feature of each node h. By using

The ith mode of the node h is characterized, and the node h is characterized by m modes. In the following description, h is used to represent a target node for sampling, where the target node is a node that is used as a sampling reference in a sampling process, and all the target nodes are neighbor nodes of the sampling target node in the sampling process.

The MCN algorithm mentioned above can be expressed as follows:

inputting a relational graph G, a sampling step B, a sampling depth K and a sampling size

The number of associated nodes S;

outputting the sampled associated node sequence C_B；

During each sampling, only the sample is relative to the targetThe node h is a neighbor node at the same depth; the above sampling depth K is equivalent to the above-mentioned target number. Alternatively still, in another implementable example, during each sampling, neighbor nodes within a sampling depth K (including depth K) from the target node h are sampled. For example, if the sampling depth K is 2, the sampled neighbor node set includes: the neighbor nodes which are 1 hop away from the target node h, and the neighbor nodes which are 2 hop away from the target node h. In the relation G, N is used_hA single-hop neighbor node (or called single-hop association node) representing h and using omega_htRepresenting side information (e.g., weight of side, where ω is_ht>0. For node h, use C_hTo represent the associated node sampled by the MCN sampling algorithm. Given a relationship graph G and associated nodes for each node, the goal of a relationship graph-based feature fusion system is to obtain a node representation that can capture the multi-modal features and node relationships of the node. The learned node representation can then be applied directly to downstream tasks. Among other things, downstream tasks may include, but are not limited to: data recommendations, data classifications (e.g., video classifications, picture classifications, etc.), and so forth.

The feature fusion system based on the relationship diagram in this embodiment is an example of a system framework shown in fig. 7, where feature information of nodes participating in fusion may include multi-modal information of the nodes. Therefore, the feature fusion system based on the relational graph in the present embodiment may also be referred to as a "pre-training graph converter using multi-modal information, corresponding to english: pre-translating graph transform with multimodal side information for replication, abbreviated PMGT.

For each node h in the relationship graph G, there is one or more associated nodes related to the node h in the relationship graph G. In this embodiment, feature information of an associated node associated with the node h is fused in the feature information of the node h, which is helpful for enriching the representation of the node h, i.e. the fusion feature of the node h.

The MCN algorithm mentioned above includes a plurality of sampling iterations, each of which is a predefined sampling for the origin nodeDepth K samples the associated node relative to the target node. Is provided with

Representing the sampling to the associated node set in the (k-1) th sampling iteration process. For the

In the k-th step of sampling iteration, replacing the single-hop neighbor of the node t with Nt. Probability of sampling for node t' epsilon Nt and side information omega of node t and t_ttAnd is proportional. It is noted that a node may be in

Multiple times. In the MCN sampling algorithm, an associated node is selected from a set of associated nodes sampled from a plurality of sampling iterations, which may be considered in combination with the following two points: 1) sampling frequency of the node; 2) the number of sampling steps (or distance) between the target node h and the nodes in the associated node set, i.e., the number of hops passed between the nodes in the associated node set and the target node h.

For each node t ∈ V \ h, in specific implementation, the following formula (1) can be used to calculate the importance of t to the target node h in the kth sampling step (K ≦ K) as follows:

wherein the content of the first and second substances,

indicates that t appears in

The number of times (1).

The importance of node t in the kth sampling step. That is, node t is considered more relevant to target node h if t is sampled more frequentlyAnd it has a smaller sampling step to the destination node h. The importance of the node t to the target node h is defined by the following formula (2):

wherein S is_tThe importance of the node t to the target node h. Then, all the nodes in V \ h can be sorted according to the importance degree corresponding to each sampled node t, and the nodes are sorted in a descending order; and selects the top-ranked S (e.g., 2, 5, 6, or more) nodes as the associated nodes with the target node h.

After the associated node is sampled, the target node h and the associated node C can be connected (connected)_hIt can be expressed as a sequence of nodes: i is_h＝[h，h₁，h₂，...h_s]。h_jIs the jth associated node, and j is more than or equal to 1 and less than or equal to S. For each node t ∈ I_hCorresponding content feature M_t：

α_t＝softmax[tanh(X_t)W_s+b_s]

And

respectively representing a sequence of nodes I_hOf the i-th modality of the middle node tA weight matrix and a bias term.

And b_s∈R^1×mThe weight matrix and bias terms of the attention mechanism (attention) are shown.

Is a polymerization operation.

Representing the characteristics of the ith modality of node t.

Node-in-node sequence I_hThe position in (c) reflects its importance to the target node h. As can be seen from the above, the plurality of associated nodes related to the target node h are obtained by ranking based on the importance of each associated node. Thus, a node can be considered to be in I_hThe order of (1) is important in learning node representation. The following embedded location ID (i.e., sequence number P)_t＝P-Embedding[p(t)]， (4)

Identity) is used to identify a node in a sequence of nodes I_hThe order information of (a).

Wherein p (t) represents the node t in the node sequence I_hOf (2) is determined.

Indicating the location features of node t that are embedded based on the location ID.

Because the purpose of training the PMGT is to derive a fused feature of a node (i.e., a representation of the node) based on the input feature information of the node and the feature information of the associated node related to the node. Intuitively, the target node and its associated nodes should play different roles in pre-training. To identify role differences, the following role-based embedding is added to each node te ∈ I in this embodiment_h。

R_t＝R-Embedding[r(t)]， (5)

Wherein r (t) represents the role corresponding to the node t,

and representing the role characteristics of the node t embedded based on the self role. In specific implementation, the role of the target node may be set as "master role" or "target", and the role of the associated node may be set as "associated role", "context", or "non-master role", and the like, which is not limited in this embodiment.

The second algorithm referred to in this application is: a model training algorithm or a model optimization algorithm. Specific contents can be found in the following:

also taking the PMGT mentioned above as an example, the algorithm is expressed as follows:

inputting: sample graph (such as relation graph G), multi-modal characteristics of all nodes in graph G

And (3) outputting: parameter set Θ of PMGT and pre-trained node representation

As shown in fig. 7, the above training process is more specifically to train the coding module in the Transformer. The main objective is to optimize the coding modules (parameters in each coding layer Encoder). The encoding module in the PMGT may be referred to as a converter-based graphics encoder.

Essentially, a Transformer (Transformer) is used to simulate the interaction between a node and an associated node. Given a node fusion characteristic H of the (l-1) th layer^l-1The output of the l-th layer of the converter is defined as follows:

wherein the content of the first and second substances,

representing a weight matrix, FFN (.) is a feed forward network. Here, for convenience, other networks are omitted in the formula (7).

For the target node h, there may be some sampling nodes in Ch whose representation (representation) is similar to that of h. It is assumed that all sampled associated nodes are related to the target node. It is expected that the graph encoder in this embodiment can capture the diversity of the associated nodes, for which it needs to focus on associated nodes that are related to the target node but not very similar. To achieve this goal, this embodiment designs and incorporates a diversity-promoting attention mechanism into the network framework of the converter.

Wherein the content of the first and second substances,

is a weight matrix. E is an element of R^(S+1)×(S+1)Is a matrix in which all its elements are 1. | S | non-woven phosphor₂∈R^(S+1)×1Denotes l of S₂Row norm, and I ∈ R^(S+1)×(S+1)An identity matrix is represented. Wherein the greater the similarity between two different nodes, at U₁Attention among themThe smaller the weight. At U₁The purpose of adding I in the definition of (2) is to make it contain the own information of the node. Beta is a constant (0. ltoreq. beta. ltoreq.1) that balances the contributions of the two attention weights.

Obtaining the output H at the last layer of the encoder^lThereafter, can obtain

As a representation of the target node h (i.e., the fusion feature of the target node h), it is denoted by h for simplicity. Then, H^lWill be used in subsequent pre-training efforts, i.e. in the optimization process.

In the optimization process, the present embodiment implements optimization through a reconstruction task. The PMGT model is pre-trained with two objectives, 1) graph structure reconstruction and 2) masked node feature reconstruction. To ensure that the learned node representation can capture the graph structure, the present embodiment defines the following penalty function:

where σ (·) is the sigmoid function, Pn and Q represent the negative sample distribution and the number of negative samples, respectively.

The masking node feature reconstruction task focuses on capturing multi-modal features in the learned node representation. In the embodiment of the application, a masking node characteristic reconstruction task is designed, and aims to pass through a node sequence I_hThe other non-masked nodes in (1) reconstruct the features of the masked nodes.

In the present embodiment, no masking operation is applied to the target node h. But rather randomly selects list I_h20% of the nodes in \ h are masked. If node t is selected, replace t with: 1) 80% of the time is [ mask ]]Nodes, (2) 10% of the time are random nodes, and (3) 10% of the time are invariant nodes t. Will [ mask ] mask]The node's characteristic is set to 0. Then, the node sequence I after the node is masked_hThe fusion characteristic of the unmasked node and the characteristic of the masked node (set to 0) in (1) are input to a Decoder in the Transformer to be outputAnd outputting the reconstruction characteristic information of the masked nodes in the node sequence. For a masking node feature reconstruction task, the embodiment of the present application defines a corresponding feature reconstruction loss function as follows:

wherein M is_hRepresenting a sequence of nodes I_hThe set of masked nodes in (1) is,

is H^LThe fusion feature (node representation) of the node t in (1),

is a weight matrix for feature reconstruction of the ith modality;

is characteristic of the ith modality of node t.

The model parameters of the PMGT may be learned by minimizing a combined objective function, i.e., by minimizing a combined objective function as shown in equation (11) below.

When the pre-trained PMGT is applied to a downstream task (such as a data recommendation task, data classification (such as video classification, picture classification and the like), the fusion characteristics of data output by the PMGT can be directly input into the downstream task; or in order to be suitable for a scene corresponding to a downstream task, the sample data of the scene is used for adjusting and training the PMGT which is trained in advance.

Specifically, the reconstruction task in this embodiment is designed to further capture the topological relation and node characteristic information of the relation graph. The reconstruction task may include two tasks, one is a graph structure reconstruction task, and the task is to: in the fusion features of the target nodes obtained by the encoding module, the fusion features (or called node representation, corresponding to english) of adjacent nodes are similar, and the fusion features of non-adjacent nodes are far away. The other is a masking node characteristic reconstruction task, which is to: the fusion characteristics of the nodes obtained by the coding module can be restored through the decoding module, so that the pre-trained short video representation contains modal information.

What needs to be added here is: for the content of the decoding module (Decoder), reference can be made to the corresponding literature, and this document is not limited thereto.

The pre-trained PMGT of the embodiment can be accessed into a recommendation model of a recommendation scene, and the PMGT is further trained by using behavior data of a user so as to achieve the maximum improvement of the effect. Or after the fusion characteristics of the target data are output by using the pre-trained PMGT, the fusion characteristics of the target data are input into a recommendation model, so that corresponding recommendation data are obtained through the recommendation model.

In the traditional method for fusing multi-modal information, only the respective modal representations of target data (such as short videos) are extracted, and then the target data are spliced and fused. And recalling based on the characterization, only recalling data similar to the own modality. In the method, the multi-modal characteristics of the target data are fused, and the multi-modal characteristics of at least one piece of associated data related to the target data are fused, so that the data similar to the target data and the data related to the target data can be recalled by recalling according to the fusion characteristics fused with other data characteristics.

Based on the above, it can be concluded that the core of the present application is: and establishing correlation among data in a form of a relationship graph, guiding the data to fuse multi-modal information of the data, and simultaneously capturing topological relationship of the relationship graph to further fuse the characteristics of the relationship data so as to help recommendation.

Actually, by extending to a specific recommendation scene, multiple correlation information of the short video and the short video can be directly captured by establishing multiple relationship graphs between the user and the short video, between the user and the user, and between the short video and the short video, and a feature fusion model (such as a PMGT) can be subjected to multi-dimensional pre-training.

The data recommendation method provided by the embodiment can be applied to a multimedia data recommendation scene. For example, recommendation of short videos in social applications, recommendation of songs in music APPs, recommendation of music videos in music APPs, recommendation of commodities in e-commerce APPs, recommendation of commodity recommendation videos in e-commerce APPs, and the like. The multimedia data may include: text, picture, audio, video, etc. That is, as shown in fig. 8, another embodiment of the present application provides a data recommendation method, including:

201. responding to the operation of a user on the interactive interface, and outputting first multimedia data;

202. determining at least one second multimedia data related to the first multimedia data;

203. fusing the feature information of the first multimedia data and the feature information of the at least one second multimedia data to obtain a fusion feature of the first multimedia data;

204. determining recommended data based on the fusion characteristics of the first multimedia data;

205. and outputting the recommended data for the user when the output condition is met.

The execution subject of each step of the method provided by this embodiment may be the server 22 in the embodiment shown in fig. 12; or the execution subject of the partial step in the method may be the server 22 in fig. 12, and the execution subject of the partial step may be the client 21 in fig. 12.

In the above 201, the operation performed by the user on the interactive interface may be a "click" operation or a "slide" operation for a short video playing control on the interface, which is not limited in this embodiment.

In an implementation manner, the step 202 "determining at least one second multimedia data related to the first multimedia data" in the present embodiment may include:

2021. determining at least one second multimedia data based on historical data associated with the user;

2022. obtaining multi-modal characteristics of the first multimedia data, multi-modal characteristics of the at least one second multimedia data, and behavior data of the user;

2023. determining associated information among the multimedia data according to the multi-modal characteristics of the first multimedia data, the multi-modal characteristics of the at least one second multimedia data and the behavior data of the user;

2024. constructing a relational graph according to the associated information among the multimedia data; the relationship graph comprises a plurality of nodes and side information reflecting the relationship among the nodes; the relationship graph comprises: a first node corresponding to the first multimedia data, and at least one second node corresponding to the at least one second multimedia data;

2025. and for the first node, sampling nodes in the relational graph to sample at least one associated node related to the first node, wherein data corresponding to the associated node is second multimedia data related to the first multimedia data.

In a specific implementation example, in step 203, "fusing the feature information of the first multimedia data and the feature information of the at least one second multimedia data to obtain a fused feature of the first multimedia data" in the present embodiment includes:

2031. configuring corresponding embedding characteristics for the first multimedia data and the at least one second multimedia data respectively;

2032. determining feature information of the first multimedia data according to the multi-modal features of the first multimedia data and the embedded features corresponding to the first multimedia data;

2033. determining feature information of the at least one second multimedia data according to the multi-modal features of the at least one second multimedia data and the embedded features corresponding to the at least one second multimedia data;

2034. inputting the feature information of the first multimedia data and the feature information of the at least one second multimedia data into a feature fusion model, and executing the feature fusion model to obtain the fusion features of the first multimedia data;

and obtaining the feature fusion model through a pre-training process.

In particular, the above feature fusion model may be the coding module in the PMGT above. Furthermore, the step 2025 and the steps 2031 to 2034 can be implemented by PMGT. As can be known from the contents of the PMGT and the structure shown in fig. 7, after the constructed relationship graph and the target node (i.e., the node corresponding to the first multimedia data) are input to the PMGT, the PMGT can complete the collection of the associated node (i.e., the associated at least one second multimedia data), the determination of the feature information of the first multimedia data and the at least one second multimedia data, the fusion of the feature information of the first multimedia data and the at least one second multimedia data, and finally output the fusion feature of the first multimedia data.

In 204, since the fusion feature of the first multimedia data is fused with not only the information corresponding to the feature of the first multimedia data, but also the information corresponding to the feature of at least one second multimedia data, when recommending based on the fusion feature, in addition to the recommendation data similar to the first multimedia data, the recommendation data that is not similar to the first multimedia data but related to the first multimedia data may be recalled. Therefore, by adopting the method provided by the embodiment, the diversity of data recommendation can be effectively improved.

The 205 may include, but is not limited to: and the output condition is met when the first multimedia data is played, and the output condition is met when the user triggers the switching operation. For example, the user selects a short video to play in full screen, and after the short video is played, the output condition is satisfied. Or, after the user "slides up" on a short video full screen playing interface (as shown in fig. 2), the output condition is satisfied.

After the output condition is satisfied, the determined recommendation data can be output on the client device of the user, such as screen display, playing, speaker playing, and the like.

The technical scheme provided by the application is applied to short video recommendation scenes and multimedia data classified scenes. For example, fig. 9a shows a flow diagram of an embodiment for a short video recommendation scenario. As shown in fig. 9a, the short video recommendation method includes:

301. displaying the first short video;

302. determining at least one second short video related to the first short video;

303. fusing the feature information of the first short video and the feature information of the at least one second short video to obtain the fusion feature of the first short video;

304. determining a recommended object according to the fusion characteristics of the first short video;

305. and when the display condition is met, displaying the recommended object.

As another example, an embodiment of a data classification method is shown in FIG. 9 b. As shown in fig. 9b, the data classification method includes:

301', determining at least one second multimedia data related to the first multimedia data;

302', fusing the feature information of the first multimedia data and the feature information of the at least one second multimedia data to obtain a fused feature of the first multimedia data;

303' determining a classification to which the first multimedia data belongs according to the fusion characteristics of the first multimedia data.

The execution subject of each step of the embodiment shown in fig. 9b may be the server 22 in the embodiment shown in fig. 12; or the execution subject of partial steps in the method may be the server 22 in fig. 12, and the execution subject of partial steps may be the client 21 in fig. 12; or the execution subject of all the steps of the method is the client. Multimedia data is classified by the client.

The classification to which the multimedia data belongs can be used in data search and home page recommendation. In an implementation, 303' may input the merged feature of the first multimedia data into a classification model, and execute the classification model to obtain a corresponding classification. The classification model may be a deep neural network model, a convolutional neural network model, and the like, which is not limited in this embodiment.

The specific implementation of the above steps can refer to the corresponding content in the above, and is not described in detail here.

As shown in fig. 10, another embodiment of the present application provides a feature fusion model based on a relational graph. The relationship graph-based feature fusion model may include: the system comprises a sampling module 11, a feature information determining module 12, a feature fusion module 13 and an optimization module 14. The sampling module 11 is configured to sample, for a first node in a relational graph, the node in the relational graph to obtain at least one second node related to the first node. A feature information determining module 12, configured to configure corresponding embedded features for the first node and the at least one second node according to the multi-modal information of the first node, the multi-modal information of the at least one second node, and the side information between the first node and the at least one second node; determining feature information of the first node according to the multi-modal information of the first node and the embedded feature corresponding to the first node; and determining the characteristic information of the at least one second node according to the multi-modal information of the at least one second node and the embedded characteristic corresponding to the at least one second node. And a feature fusion module 13, configured to input the feature information of the first node and the feature information of the at least one second node into a feature fusion model, and execute the feature fusion model to obtain a fusion feature of the first node. The optimization module 14 is configured to execute a graph reconstruction task and a masking node feature reconstruction task according to the fusion feature of the first sample node, so as to obtain an execution result corresponding to each task; and optimizing parameters in the feature fusion model according to the execution result corresponding to each task.

Correspondingly, the application also provides an embodiment of the training method of the feature fusion model based on the relational graph. The execution subject of each step of the method provided by this embodiment may be a server or a client. Because the performance requirement of the model training process on the equipment is high, the execution subject of the embodiment is a server side, which is common. As shown in fig. 11, the method includes:

401. obtaining a sample graph, wherein the sample graph comprises sample nodes and side information reflecting the relation between the sample nodes;

402. for a first sample node in the sample graph, sampling the sample node in the sample graph to sample at least one second sample node related to the first sample node;

403. respectively determining feature information of the first sample node and feature information of at least one second sample node according to the multi-modal features of the first sample node and the multi-modal features of the at least one second sample node;

404. inputting the feature information of the first sample node and the feature information of the at least one second sample node into the feature fusion model to obtain the fusion feature of the first sample node;

405. executing a graph reconstruction task and a masking sample node characteristic reconstruction task based on the fusion characteristics of the first sample node to obtain execution results corresponding to the tasks;

406. and optimizing parameters in the feature fusion model according to the execution result corresponding to each task.

The scheme provided by the embodiment of the application can be applied to the system architecture shown in fig. 12. As shown in fig. 12, the present embodiment provides a service system. The service system includes: a client 21 and a server 22. The server 22 may be a server, a server cluster formed by a plurality of servers, or a virtual server on the server cluster, and the like, which is not limited in this embodiment. In particular, the method comprises the following steps of,

the client 21 is configured to display or output target data, and send a recommendation request to the server based on the target data, where the recommendation request may carry a user identifier.

The server 22 may have three tasks, one of which is used to construct a relationship graph, for example, a relationship graph related to a user is constructed mainly by the user, and the relationship graph includes nodes corresponding to data (such as video, music, text, etc.) of browsing, collecting, commenting, enjoying, etc. in history of the user. Or a relation graph which is constructed by mainly belonging to the same class of users on the platform and is related to the class of users. The relationship graph comprises nodes corresponding to data (such as videos, music, texts, and the like) which are historically browsed, collected, praised, appreciated, concerned, and the like by the users belonging to the same category. Or constructing a relation graph containing nodes respectively corresponding to all the data on the platform by taking all the data on the platform as the main. Wherein two data connected by an edge in the relational graph may exist: relationships with the same label, there is a user sequential behavior relationship, relationships with the same user behavior (e.g., both data, are appreciated by the user), and so on. The second task is a model training task, such as the PMGT pre-training process mentioned above. The third task is a recommendation task and is used for receiving a recommendation request sent by a client and acquiring a corresponding relation graph based on a user identifier and target data carried in the recommendation request; then, calling a pre-trained PMGT, inputting the relation graph and the target data into the PMGT, and outputting the fusion characteristic of the target data; then, based on the fusion characteristics of the target data, executing a recommendation task to obtain at least one recommendation datum; and feeding back the at least one recommendation data to the client device.

Fig. 13 shows an example of a structure of a data recommendation device according to an embodiment of the present application. As shown in the figure, the data recommendation device includes: a determination module 31, a fusion module 32 and a recommendation module 33. The determining module 31 is configured to determine target data, and further configured to determine at least one associated data related to the target data in the plurality of data according to associated information between the data. The fusion module 32 is configured to fuse the feature information of the target data and the feature information of the at least one piece of associated data to obtain a fusion feature of the target data. The recommending module 33 is configured to recommend at least one piece of recommended data for the user based on the fusion feature of the target data.

Further, when determining at least one associated data related to the target data in the plurality of data according to the associated information between the data, the determining module 31 is specifically configured to:

constructing a relational graph according to the associated information among the data; the relationship graph comprises a plurality of nodes and side information reflecting the relationship among the nodes; different nodes correspond to different data, and the corresponding node of the target data in the relation graph is a target node;

for the target node, sampling nodes in the relation graph to sample at least one associated node related to the target node; and the data corresponding to the relevant node is relevant data relevant to the target data.

Still further, when the determining module samples the nodes in the relationship graph for the target node to sample at least one associated node related to the target node, the determining module is specifically configured to:

acquiring target times and sampling quantity;

in a sampling process, taking the target node as a sampling origin, and sampling at least one neighbor node adjacent to the sampling origin in the relational graph;

judging whether the sampling times reach the target times or not;

when the sampling times are not more than the target times, entering next sampling iteration; in the next sampling iteration, taking any neighbor node in the at least one neighbor node as a sampling origin, and sampling at least one neighbor node adjacent to the sampling origin in the relational graph;

and when the sampling times are greater than the target times, determining the neighbor nodes with the sampling number as the associated nodes from the neighbor nodes sampled in the sampling iteration of the target times.

Still further, the determining module, when sampling a plurality of associated data related to the target data, is further configured to: determining the importance of a plurality of associated data; according to the importance, corresponding embedding characteristics are configured for the plurality of associated data respectively; configuring an embedded feature for the target data; acquiring characteristics of at least one mode of the target data and multi-mode characteristics of any one of the plurality of associated data; determining feature information of the target data according to features of at least one mode of the target data and embedded features corresponding to the target data; and determining the characteristic information of the associated data according to the characteristics of at least one mode of any associated data in the associated data and the embedded characteristics corresponding to the associated data.

Further, the embedded features include location features and role features, the target data has multi-modal features, and the multi-modal features include features of multiple modalities. Correspondingly, when the determining module 31 determines the feature information of the target data according to the multi-modal feature of the target data and the embedded feature corresponding to the target data, it is specifically configured to:

determining a respective weight for each modality of the multi-modal features of the target data;

determining content features of the target data according to the multi-modal features of the target data and the weight of the features of each mode in the multi-modal features;

and aggregating the content characteristics of the target data and the position characteristics and role characteristics corresponding to the target data to obtain the characteristic information of the target data.

Further, when the fusion module 32 fuses the feature information of the target data and the feature information of the at least one piece of associated data to obtain a fusion feature of the target data, it is configured to:

determining feature similarity of feature information of the target data and feature information of the at least one piece of associated data;

determining attention weight corresponding to the at least one piece of associated data based on the feature similarity;

and performing fusion coding on the feature information of the target data and the feature information of the at least one piece of associated data according to the attention weight corresponding to the at least one piece of associated data to obtain the fusion feature of the target data.

Or, when the fusion module 32 fuses the feature information of the target data and the feature information of the at least one piece of associated data to obtain the fusion feature of the target data, the fusion module is configured to:

acquiring a feature fusion model;

inputting the feature information of the target data and the feature information of the at least one piece of associated data into the feature fusion model, and executing the feature fusion model to obtain fusion features of the target data;

and obtaining the feature fusion model through a pre-training process.

Further, the data recommendation device provided by the embodiment may further include a training module. Wherein the training model is to:

obtaining a sample graph, wherein the sample graph comprises sample nodes and side information reflecting the relation between the sample nodes;

for a first sample node in the sample graph, sampling the sample node in the sample graph to sample at least one second sample node related to the first sample node;

respectively determining feature information of the first sample node and feature information of at least one second sample node according to the feature of at least one mode of the first sample node and the feature of at least one mode of the at least one second sample node;

inputting the feature information of the first sample node and the feature information of the at least one second sample node into the feature fusion model to obtain the fusion feature of the first sample node and the fusion feature of the at least one second sample node;

executing a graph reconstruction task and a masking sample node feature reconstruction task based on the fusion feature of the first sample node and the fusion feature of the at least one second sample node to obtain an execution result corresponding to each task;

and optimizing parameters in the feature fusion model according to the execution result corresponding to each task.

Here, it should be noted that: the content of each module in the data recommendation device provided in this embodiment, which is not described in detail in the above embodiments, may refer to the corresponding content in the above embodiments, and is not described in detail here. In addition, the data recommendation device provided in this embodiment may further include, in addition to the functions described above, functions corresponding to other parts or all of the steps in the embodiments, for which reference may be specifically made to corresponding contents in the embodiments, which is not described herein again.

Fig. 14 is a block diagram illustrating a data recommendation apparatus according to another embodiment of the present application. As shown in fig. 14, the data recommendation apparatus includes: an output module 41, a determination module 42, a fusion module 43, and a recommendation module 44. The output module 41 is configured to output the first multimedia data in response to an operation performed by a user on the interactive interface. The determining module 42 is configured to determine at least one second multimedia data related to the first multimedia data. The fusion module 43 is configured to fuse the feature information of the first multimedia data and the feature information of the at least one second multimedia data to obtain a fusion feature of the first multimedia data. The recommending module 44 is configured to determine recommended data according to the fusion feature of the first multimedia data. The output module is further used for outputting the recommended data for the user when the condition to be output is met.

Further, the determining module 42, when determining at least one second multimedia data related to the first multimedia data, is specifically configured to:

determining at least one second multimedia data based on historical data associated with the user; obtaining multi-modal characteristics of the first multimedia data, multi-modal characteristics of the at least one second multimedia data, and behavior data of the user; determining associated information among the multimedia data according to the multi-modal characteristics of the first multimedia data, the multi-modal characteristics of the at least one second multimedia data and the behavior data of the user; constructing a relational graph according to the associated information among the multimedia data; the relationship graph comprises a plurality of nodes and side information reflecting the relationship among the nodes; the relationship graph comprises: a first node corresponding to the first multimedia data, and at least one second node corresponding to the at least one second multimedia data; and for the first node, sampling nodes in the relational graph to sample at least one associated node related to the first node, wherein data corresponding to the associated node is second multimedia data related to the first multimedia data.

Further, when the fusion module 43 fuses the feature information of the first multimedia data and the feature information of the at least one second multimedia data to obtain the fusion feature of the first multimedia data, it is specifically configured to:

configuring corresponding embedding characteristics for the first multimedia data and the at least one second multimedia data respectively; determining feature information of the first multimedia data according to the multi-modal features of the first multimedia data and the embedded features corresponding to the first multimedia data; determining feature information of the at least one second multimedia data according to the multi-modal features of the at least one second multimedia data and the embedded features corresponding to the at least one second multimedia data; inputting the feature information of the first multimedia data and the feature information of the at least one second multimedia data into a feature fusion model, and executing the feature fusion model to obtain the fusion features of the first multimedia data; and obtaining the feature fusion model through a pre-training process.

Fig. 15 shows a schematic structural diagram of a short video recommendation apparatus according to an embodiment of the present application. As shown in fig. 15, the apparatus includes: a display module 51, a determination module 52, a fusion module 53 and a recommendation module 54. The display module 51 is configured to display the first short video. The determining module 52 is configured to determine at least one second short video related to the first short video. The fusion module 53 is configured to fuse the feature information of the first short video and the feature information of the at least one second short video to obtain a fusion feature of the first short video. The recommending module 54 is configured to determine a recommended object according to the fusion feature of the first short video. The display module is further used for displaying the recommended object when the conditions to be displayed are met.

Here, it should be noted that: the content of each module in the short video recommendation device provided in this embodiment, which is not described in detail in the foregoing embodiments, may refer to the corresponding content in the foregoing embodiments, and is not described in detail herein. In addition, the short video recommendation apparatus provided in this embodiment may further include, in addition to the functions described above, functions corresponding to other parts or all of the steps in the embodiments, for which reference may be specifically made to corresponding contents in the embodiments, and details are not described herein again.

Fig. 16 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 16, the electronic apparatus includes: a memory 61 and a processor 62. The memory 61 may be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device. The memory 61 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The memory 61 for storing one or more computer instructions;

the processor 62 is coupled to the memory 61 and configured to execute one or more computer instructions stored in the memory 61 to implement the steps in the data recommendation method or the short video recommendation method provided in the foregoing embodiments.

Further, as shown in fig. 16, the electronic apparatus further includes: communication components 63, power components 65, and a display 66. Only some of the components are schematically shown in fig. 16, and it is not meant that the electronic device includes only the components shown in fig. 16.

Accordingly, embodiments of the present application also provide a computer-readable storage medium storing a computer program, where the computer program can implement the steps or functions in the data recommendation method or the short video recommendation method provided in the foregoing embodiments when executed by a computer.

Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the processor is enabled to implement the steps or functions in the data recommendation method or the short video recommendation method provided in the foregoing embodiments.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for recommending data, comprising:

determining target data;

2. The data recommendation method according to claim 1, wherein determining at least one associated data related to the target data from among a plurality of data according to the associated information among the data comprises:

3. The data recommendation method of claim 2, wherein sampling nodes in the relationship graph for the target node to sample at least one associated node related to the target node comprises:

acquiring target times and sampling quantity;

judging whether the sampling times reach the target times or not;

4. The data recommendation method of claim 3, wherein when a plurality of associated data related to the target data are sampled, the method further comprises:

determining the importance of a plurality of associated data;

according to the importance, corresponding embedding characteristics are configured for the plurality of associated data respectively;

configuring an embedded feature for the target data;

acquiring characteristics of at least one mode of the target data and multi-mode characteristics of any one of the plurality of associated data;

determining feature information of the target data according to features of at least one mode of the target data and embedded features corresponding to the target data;

and determining the characteristic information of the associated data according to the characteristics of at least one mode of any associated data in the associated data and the embedded characteristics corresponding to the associated data.

5. The data recommendation method according to claim 4, wherein the embedded features comprise location features and role features, the target data has multi-modal features, and the multi-modal features comprise features of multiple modalities; and

determining feature information of the target data according to the multi-modal features of the target data and the embedded features corresponding to the target data, wherein the determining includes:

6. The data recommendation method according to any one of claims 1 to 5, wherein fusing the feature information of the target data and the feature information of the at least one piece of associated data to obtain a fused feature of the target data comprises:

7. The data recommendation method according to any one of claims 1 to 5, wherein fusing the feature information of the target data and the feature information of the at least one piece of associated data to obtain a fused feature of the target data comprises:

acquiring a feature fusion model;

and obtaining the feature fusion model through a pre-training process.

8. The data recommendation method of claim 7, further comprising:

9. A method for recommending data, comprising:

10. The method of claim 9, wherein determining at least one second multimedia data related to the first multimedia data comprises:

determining at least one second multimedia data based on historical data associated with the user;

obtaining multi-modal characteristics of the first multimedia data, multi-modal characteristics of the at least one second multimedia data, and behavior data of the user;

determining associated information among the multimedia data according to the multi-modal characteristics of the first multimedia data, the multi-modal characteristics of the at least one second multimedia data and the behavior data of the user;

constructing a relational graph according to the associated information among the multimedia data; the relationship graph comprises a plurality of nodes and side information reflecting the relationship among the nodes; the relationship graph comprises: a first node corresponding to the first multimedia data, and at least one second node corresponding to the at least one second multimedia data;

and for the first node, sampling nodes in the relational graph to sample at least one associated node related to the first node, wherein data corresponding to the associated node is second multimedia data related to the first multimedia data.

11. The data recommendation method according to claim 9 or 10, wherein fusing the feature information of the first multimedia data and the feature information of the at least one second multimedia data to obtain a fused feature of the first multimedia data comprises:

configuring corresponding embedding characteristics for the first multimedia data and the at least one second multimedia data respectively;

determining feature information of the first multimedia data according to the multi-modal features of the first multimedia data and the embedded features corresponding to the first multimedia data;

determining feature information of the at least one second multimedia data according to the multi-modal features of the at least one second multimedia data and the embedded features corresponding to the at least one second multimedia data;

inputting the feature information of the first multimedia data and the feature information of the at least one second multimedia data into a feature fusion model, and executing the feature fusion model to obtain the fusion features of the first multimedia data;

and obtaining the feature fusion model through a pre-training process.

12. A method of data classification, comprising:

13. A relationship graph-based feature fusion system, comprising:

the node sampling layer is used for sampling nodes in the relational graph aiming at a first node in the relational graph to obtain at least one second node related to the first node;

a node feature information determining layer, configured to configure corresponding embedded features for the first node and the at least one second node according to the multi-modal information of the first node, the multi-modal information of the at least one second node, and the side information between the first node and the at least one second node, respectively; determining feature information of the first node according to the multi-modal information of the first node and the embedded feature corresponding to the first node; determining feature information of the at least one second node according to the multi-modal information of the at least one second node and the embedded feature corresponding to the at least one second node;

the characteristic fusion layer is used for fusing the characteristic information of the first node and the characteristic information of the at least one second node to obtain the fusion characteristic of the first node;

the optimization layer is used for executing a graph reconstruction task and a masking node characteristic reconstruction task according to the fusion characteristics of the first sample node to obtain an execution result corresponding to each task; and optimizing parameters in the feature fusion model according to the execution result corresponding to each task.

14. An electronic device comprising a memory and a processor; wherein the memory is used for storing programs; the processor is coupled with the memory and is used for executing the program stored in the memory to realize the steps in the data recommendation method of any one of the claims 1-8; or implementing the steps in the data recommendation method of any of the preceding claims 9-11; or to implement the steps in the data classification method of claim 12 above.