CN115114462A

CN115114462A - Model training method and device, multimedia recommendation method and device and storage medium

Info

Publication number: CN115114462A
Application number: CN202210470106.2A
Authority: CN
Inventors: 袁宇辰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-09-27

Abstract

The training method of the multimedia feature extraction model comprises the following steps: determining a plurality of training samples from a plurality of multimedia, each training sample comprising an anchor multimedia, a first multimedia, and a second multimedia; the anchor multimedia is any one of a plurality of multimedia; a first object set corresponding to the first multimedia is intersected with an anchor point object set corresponding to the anchor point multimedia; the second set of objects corresponding to the second multimedia is disjoint from the set of anchor objects; sequencing a plurality of training samples to obtain a target sample matrix; inputting the target sample matrix into a preset multimedia feature extraction model, and performing feature extraction processing to obtain a multimedia feature matrix; determining triplet state loss information according to the multimedia feature matrix; and training a preset multimedia feature extraction model based on the triplet state loss information to obtain a target multimedia feature extraction model. Object information is introduced in the generation process of the multimedia side training sample, and the accuracy of the target multimedia feature extraction model can be improved.

Description

Model training method and device, multimedia recommendation method and device and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a model training method and device, a multimedia recommendation method and device and a storage medium.

Background

In the field of multimedia recommendation (e.g., advertisement recommendation), existing multimedia recommendation systems generate multimedia-side features and object-side features independently, i.e., separately. Taking the commonly used multi-modal feature embedding (embedding) in the multimedia side features as an example, the corresponding multimedia side feature generation model is generally trained by taking the multi-modal material of the multimedia itself as input (video frame, title, audio, etc.). After the multimedia side feature generation model is converged, in a testing stage, a section of multi-mode material of the tested multimedia is input, an output result of a certain layer in the middle of the model is output according to service requirements, the whole multi-mode material is used as a multi-dimensional feature embedding, and the multi-dimensional feature embedding and the object features generated by the object side feature generation model are used as input of a subsequent model in a recommendation model to obtain the recommended multimedia.

Therefore, in the existing multimedia recommendation technology, the process of multimedia side feature generation is irrelevant to the object side, object information and multimedia materials are not fully explored, so that the multimedia recommendation accuracy is low, and multimedia meeting the object requirements cannot be pushed to the object.

Disclosure of Invention

In view of the above technical problems, the present application provides a model training method, an apparatus, a multimedia recommendation method, a device and a storage medium.

According to an aspect of the present application, there is provided a training method of a multimedia feature extraction model, including:

determining a plurality of training samples from a plurality of multimedia, each training sample comprising an anchor multimedia, a first multimedia, and a second multimedia; the anchor multimedia is any one of the plurality of multimedia; the first object set corresponding to the first multimedia is intersected with the anchor object set corresponding to the anchor multimedia; a second set of objects corresponding to a second multimedia is disjoint from the set of anchor objects;

sequencing the training samples to obtain a target sample matrix;

inputting the target sample matrix into a preset multimedia feature extraction model, and performing feature extraction processing to obtain a multimedia feature matrix, wherein the number of rows of the multimedia feature matrix is the same as that of the target sample matrix;

determining triplet state loss information according to the multimedia feature matrix;

and training the preset multimedia feature extraction model based on the triplet state loss information until a preset condition is met, and obtaining a target multimedia feature extraction model.

According to another aspect of the present application, there is provided a multimedia recommendation method including:

determining an object to be recommended and a target multimedia corresponding to the object to be recommended;

inputting the target multimedia into a target multimedia feature extraction model, and performing feature extraction processing on the target multimedia to obtain multimedia features corresponding to the target multimedia, wherein the target multimedia feature extraction model is obtained according to the training method;

inputting the multimedia characteristics corresponding to the target multimedia into a prediction model, and performing prediction processing on a plurality of preset multimedia according to the multimedia characteristics corresponding to the target multimedia to obtain respective prediction probabilities of the plurality of preset multimedia;

determining multimedia to be recommended from the preset multimedia according to the respective prediction probabilities of the preset multimedia;

and recommending the multimedia to be recommended to the object to be recommended.

According to another aspect of the present application, there is provided a multimedia feature extraction model training apparatus, including:

a determining module for determining a plurality of training samples from a plurality of multimedia, each training sample comprising an anchor multimedia, a first multimedia, and a second multimedia; the anchor multimedia is any one of the plurality of multimedia; the first object set corresponding to the first multimedia is intersected with the anchor object set corresponding to the anchor multimedia; a second set of objects corresponding to a second multimedia is disjoint from the set of anchor objects;

the sequencing module is used for sequencing the training samples to obtain a target sample matrix;

the characteristic extraction module is used for inputting the target sample matrix into a preset multimedia characteristic extraction model and performing characteristic extraction processing to obtain a multimedia characteristic matrix, wherein the number of rows of the multimedia characteristic matrix is the same as that of the target sample matrix;

the triplet state loss determining module is used for determining triplet state loss information according to the multimedia feature matrix;

and the model training module is used for training the preset multimedia feature extraction model based on the triplet state loss information until a preset condition is met, so as to obtain a target multimedia feature extraction model.

According to another aspect of the present application, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute a training method or a multimedia recommendation method of the multimedia feature extraction model.

According to another aspect of the present application, there is provided a non-transitory computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the above-mentioned training method or multimedia recommendation method of a multimedia feature extraction model.

In the training samples, each training sample comprises an anchor point multimedia, a first multimedia and a second multimedia, a first object set corresponding to the first multimedia is intersected with an anchor point object set corresponding to the anchor point multimedia, a second object set corresponding to the second multimedia is not intersected with the anchor point object set, so that the anchor point multimedia and the first multimedia have object-based correlation, the anchor point multimedia and the second multimedia do not have object-based correlation, namely the training samples are integrated with object information, model training is carried out based on the training samples, the object information and multimedia materials can be combined, the incidence relation between the object information and the multimedia materials is fully mined, more accurate multimedia characteristics are obtained, and the accuracy of subsequent multimedia recommendation can be improved; in addition, in the model training process, each multimedia whole is completely used as a unit, the robustness of multimedia feature extraction can be ensured, and different recommended scenes can be effectively adapted.

Other features and aspects of the present application will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the application and, together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic diagram illustrating an application system according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating a method for training a multimedia feature extraction model according to an embodiment of the present application.

Fig. 3 is a flowchart illustrating a method for determining a plurality of training samples from a plurality of multimedia according to an embodiment of the present application.

Fig. 4 is a flowchart illustrating a method for performing a ranking process on a plurality of training samples to obtain a target sample matrix according to an embodiment of the present application.

Fig. 5 is a flowchart illustrating a method for inputting a target sample matrix into a preset multimedia feature extraction model and performing feature extraction processing to obtain a multimedia feature matrix according to an embodiment of the present application.

Fig. 6 is a flowchart illustrating a method for determining triplet loss information according to a multimedia feature matrix according to an embodiment of the present application.

Fig. 7 is a flowchart illustrating a method for grouping multimedia features in a multimedia feature matrix to obtain a target feature matrix according to an embodiment of the present application.

Fig. 8 is a flowchart illustrating a method for training a multimedia feature extraction model according to another embodiment of the present application.

Fig. 9 is a data processing diagram illustrating a training method of a multimedia feature extraction model according to an embodiment of the present application.

Fig. 10 is a flowchart illustrating a multimedia recommendation method according to an embodiment of the present application.

Fig. 11 is a block diagram illustrating a multimedia feature extraction model training apparatus according to an embodiment of the present application.

FIG. 12 shows a block diagram of an electronic device for multimedia feature extraction model training or multimedia recommendation provided according to an embodiment of the present application.

Detailed Description

Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present application.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application system according to an embodiment of the present application. The application system can be used for the training method of the multimedia feature extraction model. As shown in fig. 1, the application system may include at least a server 01 and a terminal 02.

In this embodiment of the present application, the server 01 may be used for multimedia resource processing, for example, multimedia resource search processing, and the server 01 may include an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), and a big data and artificial intelligence platform.

In the embodiment of the application, the terminal 02 can receive and display the target multimedia resource. The terminal 02 may include a smart phone, a desktop computer, a tablet computer, a notebook computer, an intelligent voice interaction device, an intelligent appliance, a digital assistant, an Augmented Reality (AR)/Virtual Reality (VR) device, an intelligent wearable device, a vehicle-mounted terminal, an aircraft, and other types of entity devices. The physical device may also include software running in the physical device, such as an application program. The operating system running on terminal 02 in this embodiment of the present application may include, but is not limited to, an android system, an IOS system, linux, windows, and the like.

In the embodiment of the present disclosure, the terminal 02 and the server 01 may be directly or indirectly connected by a wired or wireless communication method, and the present disclosure is not limited thereto.

In a specific embodiment, when the server 02 is a distributed system, the distributed system may be a blockchain system, when the distributed system is a blockchain system, the distributed system may be formed by a plurality of nodes (any form of computing device in an access network, such as a server and a user terminal), a Peer-To-Peer (P2P, Peer To Peer) network is formed between the nodes, and the P2P Protocol is an application layer Protocol running on top of a Transmission Control Protocol (TCP). In a distributed system, any machine, such as a server or a terminal, can join to become a node, and the node comprises a hardware layer, a middle layer, an operating system layer and an application layer. Specifically, the functions of each node in the blockchain system may include:

routing, a basic function that a node has, is used to support communication between nodes.

Besides the routing function, the node may also have the following functions:

2) the application is used for being deployed in a block chain, realizing specific services according to actual service requirements, recording data related to the realization functions to form recorded data, carrying a digital signature in the recorded data to represent a source of task data, and sending the recorded data to other nodes in the block chain system, so that the recorded data is added to a temporary block when the other nodes verify the source and the integrity of the recorded data.

It should be noted that, in the specific implementation manner of the present application, the data related to the user information is referred to, when the following embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant countries and regions.

The method can be used for model training based on Machine Learning (ML) technology, wherein Machine Learning is a multi-field cross subject and relates to multi-subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Fig. 2 shows a flowchart of a training method of a multimedia feature extraction model according to an embodiment of the present application. The multimedia feature extraction model may refer to a model for extracting multimedia features in a multimedia recommendation system, the multimedia recommendation system may include an advertisement recommendation system, a short video recommendation system, a music recommendation system, and the like, and the multimedia recommendation system may be a double-tower model or a single-tower model, which is not limited in this disclosure.

With reference to fig. 2 to fig. 9, a method for training a multimedia feature extraction model provided in an embodiment of the present specification may include:

step S11: determining a plurality of training samples from a plurality of multimedia, each training sample comprising an anchor multimedia, a first multimedia, and a second multimedia; the anchor multimedia is any one of a plurality of multimedia; a first object set corresponding to the first multimedia is intersected with an anchor point object set corresponding to the anchor point multimedia; the second set of objects corresponding to the second multimedia is disjoint from the set of anchor objects.

In the embodiments of the present specification, the type of multimedia may be video, audio, image, or text, and the present disclosure does not limit this. Each training sample comprises three multimedia, which may also be referred to as a triplet sample, wherein the anchor multimedia in the training sample may be referred to as an anchor sample, the first multimedia may also be referred to as a positive sample of the anchor multimedia, and the second multimedia may also be referred to as a negative sample of the anchor multimedia. An anchor multimedia may be determined from a plurality of multimedia, each of the plurality of multimedia may serve as the anchor multimedia.

The anchor object set may be a set of objects corresponding to anchor multimedia, and the objects corresponding to the anchor multimedia may be referred to as anchor objects; the first set of objects may be a set of objects corresponding to a first multimedia, which may be referred to as a first object; the second set of objects may be a set of objects corresponding to a second multimedia, which may be referred to as a second object. Wherein the object may be a user.

In one example, if any object satisfies a preset corresponding condition, the object may be an object corresponding to multimedia; the preset corresponding condition may be that a preset operation is performed on the multimedia; the present disclosure is not limited thereto; the preset operation can be the operation of clicking a watching button, clicking a forwarding button or clicking a purchasing button and the like; for example, if an object clicks a purchase button corresponding to multimedia, the object may be an object corresponding to the multimedia.

In another example, if the object satisfies the preset tag condition, the object may be an object corresponding to multimedia; the preset tag condition may be that the object classification tag matches a multimedia classification tag of the multimedia. For example, if the multimedia is a funny video, the multimedia classification tag of the multimedia is "fun", the classification tag to which the object belongs may be a tag selected by the object itself, and if the classification tag selected by the object is "fun", the object classification tag of the object may be "fun", and the object may be an object corresponding to the multimedia.

In this embodiment of the present specification, in the same training sample, a first object set corresponding to a first multimedia intersects with an anchor object set corresponding to an anchor multimedia, which may indicate that at least one object has performed a preset operation on the anchor multimedia and also performed a preset operation on the first multimedia. It can further be stated that there is an association between the anchor multimedia and the first multimedia based on the object. In the same training sample, the second object set corresponding to the second multimedia is disjoint from the anchor object set, which may indicate that none of the objects performs the preset operation on the anchor multimedia and the second multimedia at the same time, and further may indicate that no object-based association exists between the anchor object and the second multimedia.

In one example, an object may be determined first, an anchor point multimedia and a first multimedia are determined from multimedia corresponding to the object, then an anchor point object set corresponding to the anchor point multimedia is determined, multimedia corresponding to each object in the anchor point object set is determined, and any multimedia except the multimedia corresponding to each object in the anchor point object set is used as a second multimedia, thereby obtaining a training sample. Other training samples can also be obtained according to the steps.

Step S12: and sequencing the training samples to obtain a target sample matrix.

In this embodiment of the present specification, the sorting process may be to arrange P multimedia of multiple training samples into P columns (P is a total amount of multimedia in multiple training samples), so as to obtain a target sample matrix with 1 row and P columns (1 row × P columns); a plurality of column information of the target sample matrix respectively corresponds to a plurality of multimedia of a plurality of training samples, and P column information of the target sample matrix corresponds to P multimedia one by one.

In other examples, the row and column values of the target sample matrix may be other values, which is not limited by the present disclosure as long as the target sample matrix can represent multiple multimedia of multiple training samples.

Step S13: and inputting the target sample matrix into a preset multimedia feature extraction model, and performing feature extraction processing to obtain a multimedia feature matrix, wherein the number of rows of the multimedia feature matrix is the same as that of the target sample matrix.

In an embodiment of the present specification, the preset multimedia feature extraction model may perform feature extraction processing on each multimedia of the target sample matrix to obtain a multimedia feature corresponding to each multimedia in the target sample matrix, where the multimedia feature may be a CLS feature (CLS emb). The multimedia features corresponding to each multimedia in the target sample matrix form a multimedia feature matrix.

The number of rows of the multimedia feature matrix is the same as the number of rows of the target sample matrix. In one example, the target sample matrix and the multimedia feature matrix are both 1 row × P column matrices, P column information of the target sample matrix corresponds to P multimedia one by one, and P column information of the multimedia feature matrix corresponds to P multimedia features one by one.

Step S14: and determining triplet state loss information according to the multimedia feature matrix.

In this embodiment, the multimedia features corresponding to each training sample may be substituted into the triplet loss function to obtain triplet loss information (triplet loss). The triplet lost information may be determined on an unsupervised basis. In one example, the multimedia features corresponding to the training samples may be obtained according to the multimedia feature matrix, and the triplet state loss information may be determined based on the multimedia features corresponding to the training samples.

In the embodiment of the present specification, a preset triplet state loss function may be obtained, where a mathematical expression of the triplet state loss function is as follows:

L＝max(d(a，p)-d(a，n)+margin，0)

wherein d (a, p) represents the distance between the anchor multimedia and the first multimedia of the same training sample, d (a, n) represents the distance between the anchor multimedia and the second multimedia of the same training sample, and margin is a constant greater than 0.

Step S15: and training a preset multimedia feature extraction model based on the triplet state loss information until a preset condition is met, and obtaining a target multimedia feature extraction model.

In this embodiment of the present disclosure, if the triplet loss information is not satisfied with the preset condition, the model parameter of the preset multimedia feature extraction model may be adjusted based on the triplet loss information, and the adjusted model is used as a new preset multimedia feature extraction model, and the step S13 is returned until the obtained triplet loss information satisfies the preset condition. The preset condition may be that the value of the triplet state loss information is within a preset range.

In this embodiment of the present specification, each training sample includes an anchor multimedia, a first multimedia, and a second multimedia, where a first object set corresponding to the first multimedia and an anchor object set corresponding to the anchor multimedia may intersect, and a second object set corresponding to the second multimedia and the anchor object set do not intersect, it can be seen that there may be a correlation based on an anchor object between the anchor multimedia and the first multimedia, and the anchor object is an object in the anchor object set; the anchor point multimedia and the second multimedia do not have correlation based on the object, namely, the training sample integrates object information (object behaviors or object preferences and the like), model training is carried out based on the training sample, the object information and the multimedia material can be combined, the object information is integrated into a multimedia side characteristic generating process, the incidence relation between the object information and the multimedia material is fully mined, more accurate multimedia characteristics are obtained, and further the accuracy of subsequent multimedia recommendation can be improved.

In the model training process, each multimedia whole is completely taken as a unit, the robustness of multimedia feature extraction can be ensured, the influence of data quality (such as picture-text matching error, primary and secondary labeling error and the like) is not easy to influence, and different recommendation scenes can be effectively adapted.

Because the object information has diversity (for example, diversity caused by interest change of the object), and the training sample is determined based on different object information in the embodiment of the specification, the diversity of the object information can be embodied by the training sample, so that the diversity of the multimedia to be recommended can be improved in subsequent actual recommendation, and the exposure of various multimedia can be improved on the premise of meeting the requirement of the object to be recommended.

Along with the update of the object information, the corresponding relation between the object and the multimedia is correspondingly updated, so that the training sample is correspondingly changed, and the multimedia characteristics capable of more fully embodying the object information are obtained; therefore, the target multimedia feature extraction model obtained by the embodiment of the specification can obtain more accurate multimedia features, so that the accuracy of subsequent recommendation is improved, and multimedia meeting the requirements of the object to be recommended is recommended to the object to be recommended.

The model is trained on the basis of the triplet state loss information, the final recommendation tendency can be taken as a target, and in the application scene of multimedia recommendation, the loss is more direct than other losses, so that a better effect can be brought to a subsequent recommendation model.

As shown in fig. 3, in one possible implementation manner, step S11 may include:

step S111: an anchor multimedia is determined from the plurality of multimedia.

In one example, the anchor multimedia may be randomly determined from a plurality of multimedia. In another example, the anchor multimedia may be a multimedia of the plurality of multimedia that satisfies an anchor multimedia condition, and the anchor multimedia condition may be a number of times browsed up to a preset number of times. In other examples, the anchor multimedia may also be a multimedia whose corresponding object satisfies an anchor object condition, where the anchor object condition may be that the total amount of multimedia browsed by the object reaches a preset value; objects satisfying the anchor object condition may be determined as candidates, and one multimedia may be randomly determined from multimedia corresponding to the candidates as an anchor multimedia.

Step S112: determining an anchor point object set, a candidate multimedia set and a second multimedia set corresponding to the anchor point multimedia, wherein the anchor point object set is a set of objects which perform preset operations on the anchor point multimedia, and the candidate multimedia set comprises the multimedia which is performed the preset operations on each anchor point object in the anchor point object set; the second multimedia set includes multimedia of the plurality of multimedia other than the candidate multimedia set.

In this embodiment of the present specification, an anchor object set, a candidate multimedia set, and a second multimedia set corresponding to each anchor multimedia may be determined based on each anchor multimedia. The preset operation may be an operation of clicking a viewing button, a forwarding button, or a purchasing button in the multimedia, or a time period for browsing the multimedia reaches a preset time period, which is not limited in the present disclosure. For example, if the object clicks a purchase button corresponding to multimedia, the object may be an object corresponding to the multimedia.

Step S113: and screening a first multimedia set from the candidate multimedia sets, wherein the first multimedia set comprises the multimedia in the candidate multimedia sets except the anchor multimedia.

In an embodiment of the present specification, the candidate multimedia set corresponding to the anchor multimedia includes the anchor multimedia, and the anchor multimedia can be removed from the candidate multimedia set, so as to obtain the first multimedia set by screening.

Step S114: arranging anchor point multimedia, first multimedia and second multimedia according to a preset multimedia arrangement sequence to obtain a plurality of training samples; the first multimedia is any one of the first multimedia sets, and the second multimedia is any one of the second multimedia sets.

In the embodiment of the present specification, training samples corresponding to multimedia of each anchor point are determined one by one.

In one example, each multimedia in the first set of multimedia can be used to construct a training sample, and each multimedia in the second set of multimedia can be used to construct a training sample. The first multimedia in the training sample may be randomly determined from the first set of multimedia and the second multimedia may be randomly determined from multimedia other than the candidate set of multimedia.

In another example, a subset may be randomly determined from a second set of multimedia, from which the second multimedia may be randomly determined.

In an embodiment of this specification, a training sample may be obtained from an anchor point multimedia, a first multimedia corresponding to the anchor point multimedia, and a second multimedia corresponding to the anchor point multimedia, and a multimedia ranking order of three multimedia in the training sample may be: anchor multimedia, first multimedia, and second multimedia.

In one example, the preset operation is taken as a click-to-view button. The total set of all multimedia is denoted C. Each object may correspond to a multimedia list that includes all multimedia that the object clicks to view within a preset time (e.g., one week). Determining objects with the number of multimedia more than or equal to 2 in the multimedia list as candidate objects, recording the set of the candidate objects as U, and recording the candidate objects and the multimedia list set viewed by clicking each candidate object as U

For the object ui, the corresponding multimedia list includes ni multimedia, which are

Randomly determining a multimedia C in the set C _m As anchor multimedia, finding out all multimedia lists containing it, merging all other multimedia in each multimedia list except it, and referring the set as the first multimedia set corresponding to the current anchor multimedia

The second multimedia set corresponding to the current anchor multimedia is NEG _m ＝C\(POS _m ∪c _m ) I.e. total collective removal of POS _m And c _m Itself.

Based on c _m ，POS _m And NEG _m Constructing a training sample (triple sample); each triplet contains 3 multimedia, where the 1 st bit is fixed as c _m And position 2 is POS _m In any multimedia, the 3 rd bit is NEG _m Any of the multimedia. Due to NEG _m Number of multimediaOften very large, full use will result in severe data imbalance and thus can be driven from NEG _m In the random selection of n _used A negative example (e.g. n) _used 10) so for c _m The training sample corresponding to the training sample has POS _m ·n _used And (4) respectively.

Constructing training sample data of at least one anchor point multimedia to obtain a training sample set T ═ a _i ，pos _ij ，neg _ijk Where a denotes an anchor point (anchor), i ═ 0., C-1; j ═ 0., | POS _i |-1；k＝0，...，||NEG _i |-1。|POS _i |，|NEG _i Respectively representing POS _i ，NEG _i The number of multimedia in the multimedia. Thus, embodiments of the present description may obtain a plurality of training samples.

In one possible implementation manner, in each training sample, the anchor point multimedia, the first multimedia and the second multimedia are arranged according to a preset multimedia arrangement sequence; two adjacent training samples in the target sample matrix are different training samples.

In this embodiment of the present specification, based on step S114, it can be ensured that the arrangement order of the three multimedia in different training samples is the same. In step S12, a plurality of training samples may be subjected to a random sorting process in units of training samples; in the process of random sequencing treatment, one training sample is arranged adjacent to another training sample; the adjacent arrangement may be adjacent arrangement on a matrix row, or adjacent arrangement on a matrix column, which is not limited in this disclosure; each row information of the target sample matrix corresponds to at least one training sample, or each column information of the target sample matrix corresponds to at least one training sample. In one example, the target sample matrix is a 1 row by P column matrix, and the row information of the target sample matrix corresponds to all training samples.

In this embodiment, in step S12, the training samples may be sorted, and the obtained target sample matrix may be used to determine triplet state loss information in step S14 by using the multimedia features corresponding to the training samples as units.

In a possible implementation manner, as shown in fig. 6, step S14 may include:

step S141: and grouping the multimedia features in the multimedia feature matrix to obtain a target feature matrix, wherein the target feature matrix is sample features corresponding to a plurality of training samples, and each sample feature is a multimedia feature corresponding to the anchor multimedia, the first multimedia and the second multimedia.

In this embodiment, an arrangement order of multimedia features in the multimedia feature matrix may correspond to an arrangement order of multimedia in the target sample matrix, and the target sample matrix is obtained by sorting the multimedia in units of training samples, and the multimedia belonging to the same training sample in the target sample matrix may be adjacently arranged. In step S141, the three multimedia corresponding to the training samples may be taken as an integral unit, and the multimedia feature matrix is subjected to grouping processing to obtain sample features corresponding to each training sample, where each sample feature includes multimedia features of the three multimedia in the training samples.

Step S142: determining triplet state loss information according to a plurality of sample characteristics of the target characteristic matrix.

In the embodiment of the present specification, a plurality of sample features may be substituted into the triplet loss function to obtain triplet loss information (triplet loss).

In a possible implementation manner, as shown in fig. 4, step S12 may include:

step S121: and carrying out sample interval arrangement processing on the plurality of training samples to obtain a reference sample matrix, wherein a plurality of rows of information of the reference sample matrix respectively correspond to the plurality of training samples.

In this embodiment, the inter-sample permutation may be performed by arranging a plurality of training samples into a plurality of rows to obtain a reference sample matrix (Q rows × 3 columns), and at the same time, the permutation processing of three multimedia inside the training samples is not changed in the process of the inter-sample permutation processing. Each row information of the reference sample matrix corresponds to a training sample, and the number of rows of the reference sample matrix is equal to the number Q of training samples. Each row of the reference sample matrix comprises three columns of information corresponding to the three multimedia of the training sample, wherein a first column may characterize the anchor multimedia, a second column may characterize the second multimedia, and a third column may characterize the third multimedia. In the reference sample matrix, the arrangement between the plurality of training samples may be random. In the examples of the present specification, Q3 ═ P.

Step S122: and performing sample row-column conversion processing on the reference sample matrix to obtain a target sample matrix, wherein the row number of the target sample matrix is one row.

In this embodiment, the sample row-column conversion process may be a process of converting a reference sample matrix (Q rows × 3 columns) into a matrix of 1 row × P columns, where the 1 st column in the s-th row in the reference sample matrix may be converted into the (3 × s-2) th column, the 2 nd column in the s-th row in the reference sample matrix may be converted into the (3 × s-1) th column, and the 3 rd column in the s-th row in the reference sample matrix may be converted into the (3 × s) th column, so as to obtain a target sample matrix of 1 row × P columns.

In this embodiment, in step S13, the target sample matrix is input into a preset multimedia feature extraction model, so as to obtain a multimedia feature matrix with 1 row × P columns. As shown in fig. 9, the dotted arrow on the left side of fig. 9 indicates that the arrangement order of the multimedia features in the multimedia feature matrix may correspond to the arrangement order of the multimedia in the target sample matrix.

As shown in fig. 7, in a possible implementation manner, step S141 may include:

step S1411: and performing characteristic row-column conversion processing on the multimedia characteristic matrix based on a plurality of training samples to obtain a target characteristic matrix, wherein the characteristic row-column conversion processing is the reverse conversion processing of the sample row-column conversion processing, and the row number of the target characteristic matrix is the number of the plurality of training samples.

In this embodiment, the grouping processing in step S141 may include performing characteristic row-column conversion processing on the multimedia characteristic matrix; the multimedia feature matrix can be subjected to feature row-column conversion processing based on the conversion relation between the reference sample matrix and the target sample matrix to obtain a target feature matrix. As shown in fig. 9, the dotted arrow on the right side of fig. 9 indicates that the arrangement order of the multimedia features in the target feature matrix may correspond to the arrangement order of the multimedia in the reference sample matrix. In one example, the feature row-column conversion process may be converting a multimedia feature matrix of 1 row × P columns into a matrix of Q rows × 3 columns, and the row-column after conversion may be determined according to the quotient (t) and remainder (w) of P ÷ 3 in the pth column in the multimedia feature matrix. Wherein, the row where the p column is converted is the (t +1) th row; under the condition that w is not equal to 0, the column where the p-th column is converted is the w-th column; in the case where w is equal to 0, the column in which the p-th column is converted is the 3 rd column, thereby obtaining a target feature matrix of Q rows × 3 columns.

The Q pieces of row information in the target feature matrix may correspond to Q training samples one-to-one. Three columns of information may be included in each row of information of the object feature matrix, the three columns of information corresponding to three multimedia features, respectively, wherein a first multimedia feature corresponds to the anchor multimedia, a second multimedia feature corresponds to the first multimedia, and a third multimedia feature corresponds to the second multimedia.

Referring to fig. 5, in a possible implementation manner, the preset multimedia feature extraction model includes a coding model, a fusion model, and a segmentation model.

Step S13 may include:

step S131: and inputting the target sample matrix into the coding model, and coding the target sample matrix to obtain a single-mode characteristic matrix.

In this embodiment, the format of each multimedia in the target sample matrix may be a single-modality identifier sequence format. The number of the target sample matrixes can be one or more; in the case that the number of the target sample matrix is one, the modality of the identifier sequence may be one of a text modality, a video modality, an audio modality and the like; in the case where the number of the target sample matrix is plural, the modalities of the identifier sequence may include plural ones of a text modality, a video frame modality, and an audio modality.

In this embodiment of the present specification, before step S131, after a target sample matrix is obtained, decomposition processing and sequence processing may be performed on multimedia in the target sample matrix; decomposing each multimedia to obtain two types of single-mode data (text mode data and video mode data); then, sequence processing is carried out on the single-mode data to obtain identifier sequences of two single modes of each multimedia, for example, a long sentence of a text mode can be processed into a word (word) sequence, and a video mode can be processed into a video frame (frame) sequence; and further obtaining a target sample matrix in the text modality identifier sequence format and a target sample matrix in the video modality identifier sequence format.

In other examples, before step S131, decomposition processing and sequence processing may be performed on all multimedia of all training samples, so as to obtain training samples in the identifier sequence format; and then sequencing the training samples in the identifier sequence format to obtain a target sample matrix in the monomodal identifier sequence format.

In step S131, the coding model may be a single-mode coding model, and the coding model may perform coding processing on the multimedia in the single-mode identifier sequence format in the target sample matrix to obtain a single-mode feature corresponding to each multimedia in the target sample matrix. Target sample matrixes in different modal identifier sequence formats can be input into coding models corresponding to respective modalities, and common Text modal coding models comprise Text-RCNN, BERT and the like for Text modalities; for video modalities, common video modality coding models include C3D, EfficientNet, videowin, and the like. For each sequence of identifiers, the coding model of the corresponding modality may output a single-modality feature of fixed dimension (single-modality embedding), which may be 768 dimensions in one example. Taking a case that one multimedia is decomposed into two types of single-mode data (for example, text mode data and video mode data) as an example, in step S131, for each multimedia, corresponding text mode features and video mode features can be obtained, so as to obtain text mode feature matrices and video mode feature matrices corresponding to the target sample matrices.

Step S132: and inputting the single-mode feature matrix into the fusion model, and performing fusion processing on the single-mode feature matrix to obtain a fusion mode feature matrix.

In this embodiment of the present disclosure, the fusion model may be a fusion transformation model (transform model), which is generally a multi-layer transform structure, and the fusion model may be used to perform fusion transformation on the input single-mode features to obtain the fusion mode features. The fusion modality feature matrix may include fusion modality features corresponding to each multimedia in the target sample matrix, and one multimedia may correspond to one fusion modality feature.

In the embodiment of the present specification, the parameters of the fusion model may be adjusted, and may be adjusted correspondingly according to the number of types of the input single-mode feature matrices, so as to ensure that the fusion model can process one single-mode feature matrix or process multiple single-mode feature matrices simultaneously.

In an example, referring to fig. 9, taking two single modalities (text modality and video modality) as an example, the text modality feature matrix and the video modality feature matrix obtained in step S131 are input to the fusion model together, and modality interaction is performed on the text modality feature matrix and the video modality feature matrix to obtain a fusion modality feature matrix.

In another example, taking a modality (text modality) as an example, the text modality feature matrix obtained in step S131 is input to the fusion model, and modality interaction is performed on the text modality feature matrix to obtain a fusion modality feature matrix.

Step S133: and inputting the fusion modal characteristic matrix into a segmentation model, and segmenting each fusion modal characteristic of the fusion modal characteristic matrix to obtain a multimedia characteristic matrix.

In an embodiment of the present specification, a feature of a fusion modality may include a multimedia feature (cls emb) and a modality-corresponding feature. Taking a case that a multimedia is decomposed into two types of single-mode data (for example, text mode data and video mode data) as an example, the first bit (0 th bit) of the fusion mode feature is a cls flag bit (multimedia feature cls emb) which aggregates information of all text and video frames and can be regarded as a comprehensive representation of the multimedia; the following are the features corresponding to the text modality and the features corresponding to the visual modality in sequence.

In this embodiment, the segmentation model may perform segmentation processing on the fusion modal features to obtain multimedia features corresponding to the multimedia. The multimedia feature matrix may include multimedia features corresponding to each multimedia in the target sample matrix.

With reference to fig. 8, in a possible implementation manner, the method may further include:

step S16: and determining the monomodal loss information according to the monomodal feature matrix.

In this embodiment, after step S131, the single-mode feature in the single-mode feature matrix may be substituted into the single-mode loss function to obtain single-mode loss information. The single mode loss functions corresponding to different modes are different. In one example, mask text modeling loss information (MLM) may be determined from a text modality feature matrix, mask video frame modeling loss information (MFM) may be determined from a video modality feature matrix, and both the mask text modeling loss information and the mask video frame modeling loss information may be determined in an unsupervised manner.

Step S17: and determining reference loss information according to the fusion modality feature matrix, wherein the reference loss information comprises one or more of cross entropy loss information and contrast loss information.

In an embodiment of the present specification, a feature of a fusion modality may include a multimedia feature (cls emb) and a modality-corresponding feature. In the segmentation process of step S133, the segmentation model may also obtain the corresponding fusion feature of the multimedia.

In one example, one multimedia is decomposed into two types of single-modality data (e.g., text modality data and video modality data), then, the fusion modality features may include multimedia features (cls emb), features corresponding to text modalities, and features corresponding to visual modalities; in step S17, in addition to the multimedia feature cls emb, a feature corresponding to the text modality and a feature corresponding to the visual modality may be obtained, and then the feature corresponding to the text modality and the feature corresponding to the visual modality are subjected to number-dimension average processing, so that a text fusion feature (fused text emb) and a video fusion feature (fused video emb) may be obtained. Therefore, each multimedia can obtain a multimedia feature cls emb, a text fusion feature merged text emb and a video fusion feature merged video emb; based on the target sample matrix, a multimedia feature matrix, a text fusion feature matrix and a video fusion feature matrix can be obtained. Further, cross entropy loss information can be determined according to the multimedia feature matrix, the cross entropy loss information can be classified cross entropy loss information of a second-level industry, and the cross entropy loss information can be determined based on a supervision mode. The comparison loss information can be determined according to the text fusion characteristic matrix and the video fusion characteristic matrix, can be image-text matching comparison learning loss information, and can be determined based on an unsupervised mode. The multimedia feature cls emb may be 10 dimensions, the text fusion feature merged text emb may be 768 dimensions, and the video fusion feature merged video emb may be 768 dimensions, which are not limited in this disclosure.

In another example, a multimedia is decomposed into a single modality data (e.g. text modality data), and then, the fusion modality feature may include a multimedia feature (cls emb) and a feature corresponding to a text modality; each multimedia can obtain a multimedia feature cls emb and a text fusion feature merged text emb, and a multimedia feature matrix and a text fusion feature matrix can be obtained based on the target sample matrix. Further, cross entropy loss information may be determined from the multimedia feature matrix.

Step S15 may include:

step S151: training a preset multimedia feature extraction model based on the triplet state loss information and the target loss information until a preset condition is met to obtain the target multimedia feature extraction model, wherein the target loss information comprises one or more of monomodal loss information and reference loss information.

In this embodiment, the triplet loss information may be combined with any loss information in the target loss information to determine the target multimedia feature extraction model. The target loss information may be one or more of text modeling loss information, mask video frame modeling loss information, cross entropy loss information, and contrast loss information.

In the embodiment of the present specification, weighting processing may be performed on the triplet state loss information and the target loss information, different weights may be set for different losses according to needs, so as to obtain overall loss information, perform gradient calculation and return, and train a preset multimedia feature extraction model.

The embodiment of the specification performs model training based on various loss function information including triplet loss information, so that a more accurate target multimedia feature extraction model can be obtained, and the accuracy of subsequent multimedia recommendation is improved.

In the embodiment of the specification, a training sample based on object information (such as object behaviors or object preferences) is introduced. The essence of the embodiments of the present specification is that, given a multimedia (which can be viewed as the current behavior of the object), the model is made to learn which multimedia is more likely to co-occur with it (should be recommended) and which multimedia is less likely to co-occur with it (should not be recommended); in the training sample, the first multimedia may be recommended together with the anchor multimedia, and the second multimedia may not be recommended together with the anchor multimedia. The supervision process is practically free from the limitation of a specific mode, and takes each multimedia as a whole as a unit, so that the supervision process is more robust, relatively less susceptible to the influence of data quality (such as graph-text matching error, class I/II labeling error and the like), and can be suitable for different recommendation scenes. The model is trained on the basis of the triplet state loss information, the final recommendation tendency can be taken as a target, and under the large background of multimedia recommendation, the loss is more direct than other losses, so that a better effect can be brought to a subsequent recommendation model.

Referring to fig. 10, an embodiment of the present disclosure provides a multimedia recommendation method applied to a multimedia recommendation system, in one example, the multimedia recommendation system may be a single tower model, and the single tower model may include a prediction model and a target multimedia feature extraction model on a multimedia side.

The multimedia recommendation method provided by the embodiment of the specification comprises the following steps:

step S21: and determining an object to be recommended and a target multimedia corresponding to the object to be recommended.

In this embodiment of the present specification, an object to be recommended may be any object, and a target multimedia may be a multimedia on which a preset operation is performed by the object to be recommended. In an example, when it is monitored that a preset operation is performed on any object, the object may be determined as an object to be recommended, and a multimedia of the object performing the preset operation may be determined as a target multimedia. The preset operation may be an operation of clicking a viewing button, a forwarding button, or a purchasing button in the multimedia, or a time period for browsing the multimedia reaches a preset time period, which is not limited in the present disclosure.

Step S22: inputting the target multimedia into a target multimedia feature extraction model, and performing feature extraction processing on the target multimedia to obtain multimedia features corresponding to the target multimedia, wherein the target multimedia feature extraction model is obtained according to the training method of the multimedia feature extraction model.

In this embodiment of the present specification, before step S22, a target multimedia is decomposed to obtain at least one single-mode data; and then, carrying out sequence processing on the at least one single-mode data to obtain an identifier sequence of at least one single mode.

In this embodiment, the target multimedia feature extraction model may include a target coding model, a target fusion model, and a target segmentation model. In step S22, the identifier sequences of at least one single mode may be respectively input into the target coding models of the corresponding single modes to obtain at least one single mode feature; inputting at least one single-mode feature into a target fusion model to obtain a fusion mode feature; and inputting the fusion modal characteristics into the target segmentation model to obtain a multimedia characteristic (cls emb) corresponding to the target multimedia.

Step S23: inputting the multimedia characteristics corresponding to the target multimedia into the prediction model, and performing prediction processing on a plurality of preset multimedia according to the multimedia characteristics corresponding to the target multimedia to obtain respective prediction probabilities of the plurality of preset multimedia.

In this embodiment, the preset multimedia may belong to a preset database corresponding to the prediction model. In the single-tower model, the prediction model of the single-tower model can perform prediction processing on a plurality of preset multimedia according to the multimedia characteristics to obtain the prediction probability corresponding to each preset multimedia. The prediction probability may represent a probability that the preset multimedia is executed by the object to be recommended to perform the target operation. Optionally, the prediction processing may be click rate prediction processing, the target operation may be clicking a viewing button, and the prediction probability may be click rate prediction probability; the prediction processing can also be conversion rate prediction processing, the target operation can be clicking a purchase button, and the prediction probability can be conversion rate prediction probability; the present disclosure is not limited thereto.

Step S24: and determining the multimedia to be recommended from the preset multimedia according to the respective prediction probabilities of the preset multimedia.

In an example, according to respective prediction probabilities (click rate prediction probabilities or conversion rate prediction probabilities) of a plurality of preset multimedia, probability sorting processing may be performed on the plurality of preset multimedia according to a sequence of the prediction probabilities, and the preset multimedia with the highest prediction probability is determined as the multimedia to be recommended.

In other examples, the multimedia price information (ECPM) of each of the plurality of preset multimedia may be determined according to the click rate prediction probability, the conversion rate prediction probability, and the multimedia price of each of the plurality of preset multimedia, and the preset multimedia with the largest multimedia price information may be determined as the multimedia to be recommended. The multimedia bid may be a price paid to the multimedia platform by a producer of the multimedia in order to promote the multimedia on the multimedia platform.

Step S25: and recommending the multimedia to be recommended to the object to be recommended.

In the embodiment of the specification, after the multimedia to be recommended is determined, the multimedia to be recommended can be pushed to the object to be recommended in the modes of short messages, voice, popup windows and the like.

In the embodiment of the present specification, in a training sample of a target multimedia feature extraction model, a first object set corresponding to a first multimedia intersects with an anchor object set corresponding to an anchor multimedia, and a second object set corresponding to a second multimedia does not intersect with the anchor object set, so that it can be seen that there is a correlation between the anchor multimedia and the first multimedia based on an anchor object, and the anchor object is an object in the anchor object set; the anchor point multimedia and the second multimedia do not have correlation based on objects, namely, object information (object behaviors or object preferences and the like) is blended into the training samples, model training is carried out based on the training samples, the object information can be combined with the multimedia material, the object information is blended into the multimedia side characteristic generation process, the incidence relation between the object information and the multimedia material is fully mined, and more accurate multimedia characteristics are obtained. Therefore, by using the target multimedia feature extraction model of the embodiment of the specification, more accurate multimedia features can be obtained, and the accuracy of subsequent recommendation is improved.

In another example, the multimedia recommendation system may be a two-tower model, which may include a predictive model, a multimedia-side target multimedia feature extraction model, and an object-side object feature extraction model. The target multimedia feature extraction model may be used to generate corresponding multimedia features based on the target multimedia. The object feature extraction model may be used to generate corresponding object features based on object information (information of an object to be recommended). The prediction model of the double-tower model can obtain the full-scale characteristics according to the multimedia characteristics and the object characteristics, and the prediction processing is carried out on a plurality of preset multimedia on the basis of the full-scale characteristics to obtain the prediction probability corresponding to each preset multimedia.

With reference to fig. 11, an embodiment of the present specification further provides a multimedia feature extraction model training apparatus, which may include:

a determining module 10, configured to determine a plurality of training samples from a plurality of multimedia, each training sample including an anchor multimedia, a first multimedia, and a second multimedia; the anchor multimedia is any one of a plurality of multimedia; a first object set corresponding to the first multimedia is intersected with an anchor point object set corresponding to the anchor point multimedia; the second object set corresponding to the second multimedia is not intersected with the anchor object set;

the sequencing module 20 is configured to sequence a plurality of training samples to obtain a target sample matrix;

the feature extraction module 30 is configured to input the target sample matrix into a preset multimedia feature extraction model, and perform feature extraction processing to obtain a multimedia feature matrix, where a number of rows of the multimedia feature matrix is the same as a number of rows of the target sample matrix;

a triplet state loss determining module 40, configured to determine triplet state loss information according to the multimedia feature matrix;

and the model training module 50 is configured to train a preset multimedia feature extraction model based on the triplet state loss information until a preset condition is met, so as to obtain a target multimedia feature extraction model.

In one possible implementation, the triplet-loss determining module 40 may include:

the grouping unit is used for grouping the multimedia features in the multimedia feature matrix to obtain a target feature matrix, wherein the target feature matrix is sample features corresponding to a plurality of training samples, and each sample feature is a multimedia feature corresponding to an anchor point multimedia, a first multimedia and a second multimedia;

and the triplet state loss determining unit is used for determining triplet state loss information according to the plurality of sample characteristics of the target characteristic matrix.

In one possible implementation manner, in each training sample, the anchor point multimedia, the first multimedia and the second multimedia are arranged according to a preset multimedia arrangement sequence;

two adjacent training samples in the target sample matrix are different training samples.

In one possible implementation, the sorting module 20 may include:

the inter-sample arrangement unit is used for carrying out inter-sample arrangement processing on the training samples to obtain a reference sample matrix, and a plurality of rows of information of the reference sample matrix respectively correspond to the training samples;

the system comprises a sample row-column conversion unit, a reference sample matrix and a data processing unit, wherein the sample row-column conversion unit is used for carrying out sample row-column conversion processing on a reference sample matrix to obtain a target sample matrix, and the row number of the target sample matrix is one row;

the grouping unit may include:

and the characteristic row-column conversion subunit is used for performing characteristic row-column conversion processing on the multimedia characteristic matrix based on the plurality of training samples to obtain a target characteristic matrix, wherein the characteristic row-column conversion processing is the inverse conversion processing of the sample row-column conversion processing, and the row number of the target characteristic matrix is the number of the plurality of training samples.

In one possible implementation manner, the preset multimedia feature extraction model comprises a coding model, a fusion model and a segmentation model;

the feature extraction module 30 may include:

the encoding unit is used for inputting the target sample matrix into the encoding model and encoding the target sample matrix to obtain a monomodal feature matrix;

the fusion unit is used for inputting the single-mode feature matrix into the fusion model and performing fusion processing on the single-mode feature matrix to obtain a fusion mode feature matrix;

and the segmentation unit is used for inputting the fusion modal characteristic matrix into the segmentation model, and segmenting each fusion modal characteristic of the fusion modal characteristic matrix to obtain the multimedia characteristic matrix.

In one possible implementation, the method may further include:

the single-mode loss determining module is used for determining single-mode loss information according to the single-mode feature matrix;

the reference loss determining module is used for determining reference loss information according to the fusion modal characteristic matrix, wherein the reference loss information comprises one or more of cross entropy loss information and contrast loss information;

the model training module 50 includes:

and the model training unit is used for training a preset multimedia feature extraction model based on the triplet state loss information and the target loss information until a preset condition is met to obtain the target multimedia feature extraction model, wherein the target loss information comprises one or more of monomodal loss information and reference loss information.

In one possible implementation, the determining module 10 may include:

a first determining unit for determining an anchor multimedia from a plurality of multimedia;

the second determining unit is used for determining an anchor point object set, a candidate multimedia set and a second multimedia set corresponding to the anchor point multimedia, wherein the anchor point object set is a set of objects which perform preset operations on the anchor point multimedia, and the candidate multimedia set comprises the multimedia which is performed the preset operations by each anchor point object in the anchor point object set; the second multimedia set comprises multimedia of the plurality of multimedia except the candidate multimedia set;

the screening unit is used for screening a first multimedia set from the candidate multimedia sets, wherein the first multimedia set comprises multimedia except the anchor multimedia in the candidate multimedia sets;

the arrangement unit is used for arranging the anchor point multimedia, the first multimedia and the second multimedia according to a preset multimedia arrangement sequence to obtain a plurality of training samples; the first multimedia is any one of the first multimedia sets, and the second multimedia is any one of the second multimedia sets.

With regard to the apparatus in the above-described embodiment, the specific manner in which the respective modules and units perform operations has been described in detail in the embodiment related to the method, and will not be elaborated upon here.

FIG. 12 shows a block diagram of an electronic device for multimedia feature extraction model training or multimedia recommendation provided according to an embodiment of the present application. The electronic device may be a server, and its internal structure diagram may be as shown in fig. 12. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of training a multimedia feature extraction model or a method of multimedia recommendation.

Those skilled in the art will appreciate that the structure shown in fig. 12 is a block diagram of only a portion of the structure relevant to the present disclosure, and does not constitute a limitation on the electronic device to which the present disclosure may be applied, and that a particular electronic device may include more or less components than those shown, or combine certain components, or have a different arrangement of components.

In an exemplary embodiment, there is also provided an electronic device including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement a multimedia recommendation method or a training method of a multimedia feature extraction model as in the embodiments of the present application.

In an exemplary embodiment, there is also provided a storage medium, and when instructions in the storage medium are executed by a processor of an electronic device, the electronic device is enabled to execute the multimedia resource processing method in the embodiment of the present application.

In an exemplary embodiment, a computer program product containing instructions is also provided, which when run on a computer causes the computer to perform the training method of the multimedia feature extraction model or the multimedia recommendation method in the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A training method of a multimedia feature extraction model is characterized by comprising the following steps:

sequencing the training samples to obtain a target sample matrix;

2. The method of claim 1, wherein the determining triplet state loss information according to the multimedia feature matrix comprises:

grouping multimedia features in the multimedia feature matrix to obtain a target feature matrix, wherein the target feature matrix is sample features corresponding to the training samples, and each sample feature is the multimedia feature corresponding to the anchor multimedia, the first multimedia and the second multimedia;

determining the triplet state loss information according to a plurality of sample characteristics of the target characteristic matrix.

3. The method as claimed in claim 2, wherein the step of performing the ranking process on the training samples to obtain the target sample matrix comprises:

performing sample-to-sample arrangement processing on the training samples to obtain a reference sample matrix, wherein a plurality of rows of information of the reference sample matrix respectively correspond to the training samples;

and performing sample row-column conversion processing on the reference sample matrix to obtain the target sample matrix, wherein the row number of the target sample matrix is one row.

4. The method for training the multimedia feature extraction model according to claim 3, wherein the step of grouping the multimedia features in the multimedia feature matrix to obtain a target feature matrix comprises:

and performing characteristic row-column conversion processing on the multimedia characteristic matrix based on the training samples to obtain the target characteristic matrix, wherein the characteristic row-column conversion processing is the inverse conversion processing of the sample row-column conversion processing, and the number of rows of the target characteristic matrix is the number of the training samples.

5. The method for training a multimedia feature extraction model according to claim 1, wherein the preset multimedia feature extraction model comprises a coding model, a fusion model and a segmentation model;

the step of inputting the target sample matrix into the preset multimedia feature extraction model for feature extraction processing to obtain a multimedia feature matrix comprises:

inputting the target sample matrix into the coding model, and coding the target sample matrix to obtain a monomodal feature matrix;

inputting the single-mode feature matrix into the fusion model, and performing fusion processing on the single-mode feature matrix to obtain a fusion mode feature matrix;

and inputting the fusion modal feature matrix into the segmentation model, and performing segmentation processing on each fusion modal feature of the fusion modal feature matrix to obtain the multimedia feature matrix.

6. The method of claim 5, further comprising:

determining monomodal loss information according to the monomodal feature matrix;

determining reference loss information according to the fusion modal feature matrix, wherein the reference loss information comprises one or more of cross entropy loss information and contrast loss information;

the training of the preset multimedia feature extraction model based on the triplet state loss information until a preset condition is met to obtain a target multimedia feature extraction model comprises the following steps:

training the preset multimedia feature extraction model based on the triplet state loss information and the target loss information until the preset condition is met to obtain the target multimedia feature extraction model, wherein the target loss information comprises one or more of the monomodal loss information and the reference loss information.

7. The method of claim 1, wherein determining a plurality of training samples from a plurality of multimedia comprises:

determining the anchor multimedia from the plurality of multimedia;

determining the anchor point object set, a candidate multimedia set and a second multimedia set corresponding to the anchor point multimedia, wherein the anchor point object set is a set of objects which have performed preset operations on the anchor point multimedia, and the candidate multimedia set comprises multimedia which has performed the preset operations on each anchor point object in the anchor point object set; the second set of multimedia comprises multimedia of the plurality of multimedia other than the candidate set of multimedia;

screening a first multimedia set from the candidate multimedia set, wherein the first multimedia set comprises multimedia in the candidate multimedia set except the anchor multimedia;

arranging the anchor point multimedia, the first multimedia and the second multimedia according to a preset multimedia arrangement sequence to obtain a plurality of training samples; the first multimedia is any one of the first multimedia sets, and the second multimedia is any one of the second multimedia sets.

8. A method for multimedia recommendation, comprising:

inputting the target multimedia into a target multimedia feature extraction model, and performing feature extraction processing on the target multimedia to obtain multimedia features corresponding to the target multimedia, wherein the target multimedia feature extraction model is obtained according to the training method of any one of claims 1 to 7;

9. A training device for a multimedia feature extraction model is characterized by comprising:

10. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of training a multimedia feature extraction model of any one of claims 1 to 7, or the processor is configured to perform the method of multimedia recommendation of claim 8.

11. A non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method for training a multimedia feature extraction model according to any one of claims 1 to 7, or wherein the computer program instructions, when executed by a processor, implement the method for multimedia recommendation according to claim 8.